M Matriix Mulltiplly: A Case Study M I A C D: Saman Amarasinghe Fall 2010
M Matriix Mulltiplly: A Case Study M I A C D: Saman Amarasinghe Fall 2010
LIMIT
PER ORDER OF 6.172
MMatri ix Mul ltipl ly: A Case
S
S
tudy
M
i A C d
Saman Amarasinghe
Fall 2010
Matrix Multiply
Matrix multiple is a fundamental operation in many computations
Example: video encoding, weather simulation, computer graphics
C
=
x
A B
C
for(int i =0; i < x; i++) for(int i =0; i < x; i++)
for(int j =0; j < y; j++)
for(int k=0; k < z; k++)
A[i][j] += B[i][k]*C[k][j]
Matrix Representation
Id like my matrix representation to be
Object oriented
Immutable Immutable
Represent both integers and doubles
Matrix
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class Value {
final MatrixType type;
final int iVal; final int iVal;
final double dVal;
Value(int i)
Value(double d) {
type = MatrixType.FLOATING_POINT;
dVal = d;
iVal = 0;
}
int getInt() throws Exception
double getDouble() throws Exception {
if(type == MatrixType.FLOATING_POINT)
return dVal;
else
throw new Exception();
}
}}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class Matrix {
final MatrixRow[] rows;
final int nRows, nColumns;
final MatrixType type; final MatrixType type;
Matrix(int rows, int cols, MatrixType type) {
this.type = type;
this.nRows = rows;
this.nColumns = cols;
this.rows = new MatrixRow[this.nRows];
for(int i=0; i<this.nRows; i++)
this.rows[i] = (type == MatrixType.INTEGER)?
new IntegerRow(this.nColumns): new DoubleRow(this.nColumns);
}
}
}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class Matrix {
private Matrix(MatrixRow[] rows, MatrixType type, int nRows, int nCols) {
this.rows = rows;
this.nRows = nRows;
this.nColumns = nCols;
this.type = type;
}
public Matrix update(int row, int col, Value val) throws Exception {
MatrixRow[] newRows = new MatrixRow[nRows];
for(int i=0; i<nRows; i++)
newRows[i] = (i == row)?rows[i].update(col, val):rows[i];
return new Matrix(newRows, type, nRows, nColumns);
}
Value get(int row, int col) throws Exception {
return rows[row].get(col);
}
}}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public abstract class MatrixRow {
abstract Value get(int col) throws Exception;
abstract public MatrixRow update(int col Value val) throws Exception; abstract public MatrixRow update(int col, Value val) throws Exception;
}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class DoubleRow extends MatrixRow {
final Double[] theRow;
public final int numColumns;
DoubleRow(int ncols) {
this.numColumns = ncols;
theRow = new Double[ncols];
for((int i=00; i < ncols; i++) )
theRow[i] = new Double(0);
}
private DoubleRow(Double[] row, int cols) { private DoubleRow(Double[] row, int cols) {
this.theRow = row;
this.numColumns = cols;
}
public MatrixRow update(int col, Value val) throws Exception {
Double[] row = new Double[numColumns];
for(int i=0; i< numColumns; i++)
row[i] = (i==col)?(new Double(val.getDouble())):theRow[i];
return new DoubleRow(row, numColumns);
}
public Value get(int col) {
return new Val ( lue(thheRow[[col]); l])
}
}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class MatrixMultiply {
public static long testMM(int x, int y, int z)
{ {
Matrix A = new Matrix(x, y, MatrixType.FLOATING_POINT);
Matrix B = new Matrix(y, z, MatrixType.FLOATING_POINT);
Matrix C = new Matrix(x, z, MatrixType.FLOATING_POINT);
long started = System.nanoTime();
try {
for(int i =0; i < x; i++)
for(int j =0; j < y; j++)
for(int k=0; k < z; k++)
A = A.update(i, j, new Value(A.get(i, j).getDouble() +
B.get(i, k).getDouble()*
C.get(k, j).getDouble()));
} catch(Exception e) {
}}
long time = System.nanoTime();
long timeTaken = (time - started);
System.out.println ("Time:" + timeTaken/1000000 + "ms");
return timeTaken; return timeTaken;
}
}
Performance
1024x1024matrix multiply 1024x1024 matrix multiply
Is the performance good?
It took almost 5 hours to multiply two 1024x1024 matrices
1024
3
= 1,073,741,824 operations
Each operation is multiply, add and 3 index updates, and branch check 6 ops
1,073,741,824 * 6 = 6,442,450,944 1,073,741,824 6 6,442,450,944
Operations per second = 6,442,450,944 / 17,094 = 376,880 = 3.77x10
5
My PC runs at 3.15 GHz 3.15x10
9
cycles / second
Th b 8 358 l h i ibl i That comes to about 8,358 cycles per each visible operation
How can we improve performance?
Profiling
Look deeply in to the program execution
Find out where you are spending your time
By method By method
By line
Lot of interesting information
Time spend
Cumulative time spend
Number of invocations Number of invocations
Etc. etc.
Great way to zero in on what matters Hotspots
If 90% time is in one routine, inefficiencies in the rest of the program dont
matter
Also, is the hotspots doing what you expect them to do?
SamanAmarasinghe2008
t t
Profile Data
Method Num Calls Method Time Cumulative Times
java.lang.Double.<init>(double) 3,157,263 52,100 52,100
DoubleRow.<init>(int) 3,072 51,120 102,980
DoubleRow.update(int, Value) 11,535 31,630 32,610
Matrix.update(int, int, Value) 11,535 30,740 63,540
MatrixMultiply.testMM(int, int, int) 1 1,790 172,410
DoubleRow.get(int) 34,605 1,290 1,870
Matrix.get(int, int) 34,605 1,170 3,040
Value.getDouble() 46,140 1,000 1,000
Value.<init>(double) 46,140 810 810
DoubleRow.<init>(Double[ ], int) 11,535 310 480
MatrixRow.<init>() 14,607 220 220
Matrix.<init>(MatrixRow[ ], MatrixType, int, int) 11,534 190 190
Matrix.<init>(int, int, MatrixType) 3 40 103,020
Main.<init>() 1 10 172,420
<ROOT>.<ROOT> - - 172,420
Main.main(String[ ]) 1 - 172,420
j l Obj i i () 72,285
-
- java.lang.Object.<init>() 72 285
java.lang.System.nanoTime() 1 - -
java.lang.StringBuilder.append(int) 7 - -
MatrixType.<clinit>() 1 - -
java.lang.StringBuilder.append(String) 7 - -
java.lang.StringBuilder.<init>() 1 - -
j l St i B ild t St i () 1
-
-
java.io.PrintStream.println(String) 1 - -
MatrixType.<init>(String, int) 2 - -
java.lang.Double.doubleValue() 34,605 - -
java.lang.Enum.<init>(String, int) 2 - -
java.lang.StringBuilder.toString() 1
SamanAmarasinghe2008
Issues with Immutability
Updating one location copy of the matrix
2*N copies for each update 2 N copies for each update
N
3
updates N
4
copies are made.
Copying is costly py g y
Cost of making duplicates
Cost of garbage collecting the freed objects
Huge memory footprint Huge memory footprint
Can we do better?
Matrix Representation
Id like my matrix representation to be
Object oriented
Immutable Immutable
Represent both integers and doubles
Matrix
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class Matrix {
MatrixRow[] rows;
final int nRows, nColumns;
final MatrixType type; final MatrixType type;
Matrix(int rows, int cols, MatrixType type) {
this.type = type;
this.nRows = rows;
this.nColumns = cols;
this.rows = new MatrixRow[this.nRows];
for(int i=0; i<this.nRows; i++)
this.rows[i] = (type == MatrixType.INTEGER)?
new IntegerRow(this.nColumns):new DoubleRow(this.nColumns);
}
void set(int row, int col, Value v) throws Exception {
rows[row].set(col, v);
}
Value get(int row, int col) throws Exception {
return rows[row].get(col);
}
}}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class DoubleRow extends MatrixRow {
double[] theRow;
public final int numColumns;
DoubleRow(int ncols) {
this.numColumns = ncols;
theRow = new double[ncols];
}
public void set(int col, Value val) throws Exception {
theRow[col] = val.getDouble();
}
public Value get(int col) {
return new Value(theRow[col]);
}
}
H h d thi k th How much do you think the
performance will improve?
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
Performance
Saman Amarasinghe 2008
Profile Data
M th d NNum CCall ll s M th d Ti CCumulati tive Times Method Method Time l Ti
MatrixMultiply.testMM(int, int, int) 1 40,076 171,425
Value.getDouble() 1,958,974 36,791 36,791
Matrix.get(int, int) 1,469,230 27,725 64,624
DoubleRow.get(int) 1,692,307 25,343 36,900
Value.<init>(double) 1,958,974 15,501 15,501
Matrix.set(int, int, Value) 489,743 13,032 35,220
DoubleRow.set(int, Value) 489,743 12,932 22,188
DoubleRow.<init>(int) 372 21 23
MatrixRow.<init>() 372 2 2
Matrix.<init>(int, int, MatrixType) 3 2 25
Main.<init>() () 1 1 171,,426
java.io.PrintStream.println(String) 1 - -
java.lang.StringBuilder.append(int) 7 - -
java.lang.System.nanoTime() 1 - -
Main.main(String[ ]) 1 - 171,426
MatrixType.<clinit>() 1 - -
java lang StringBuilder append(String) java.lang.StringBuilder.append(String)
1 - -
2 - -
1 - -
2 - -
- - 171,426
19 592 818
77 - -
java.lang.StringBuilder.<init>()
MatrixType.<init>(String, int)
java.lang.StringBuilder.toString()
java.lang.Enum.<init>(String, int)
<ROOT>.<ROOT>
java lang Object <init>() java.lang.Object.<init>() 19,592,818
-
-
Issues with Dynamic Dispatch
Method call overhead
Multiple subtypes what method to call depends on the object
Each method call needs to loop-up the object type in a dispatch table Each method call needs to loop up the object type in a dispatch table
Dynamic dispatch is an address lookup + indirect branch
Indirect branches are costly
Modern microprocessors are deeply pipelined
Matrix
Row
Integer
Row
Double
Row
12 pp p ipeline stagges in core 2 duo,, 20 in Pentium 4
i.e. hundreds of instructions in flight
Need to be able to keep fetching next instructions before executing them
Normal instructions keep fetching the next instructions Normal instructions keep fetching the next instructions
Direct branch target address known, can fetch ahead from target
works for conditional branches by predicting the branch
Indirect branch target unknown, need to wait until address fetch completes
pipeline stall
Matrix Representation
Id like my matrix representation to be
Object oriented
Immutable Immutable
Represent both integers and doubles
Double
i
Double
Matrix Row
public class DoubleMatrix {
final DoubleRow[] rows;
final int nRows, nColumns;
Matrix(int rows, int cols) {
this.nRows = rows;
this.nColumns = cols;
this.rows = new DoubleRow[this.nRows];
for(int i=0; i<this.nRows; i++)
this.rows[i] = new DoubleRow(this.nColumns);
}
void set(int row, int col, double v) {
rows[row].set(col, v);
}
double get(int row, int col) {
return rows[[row] g ].get(col); ( );
}
}
Double
Matrix
Double
Row
public final class DoubleRow {
double[] theRow;
public final int numColumns;
DoubleRow(int ncols) {
this.numColumns = ncols;
theRow = new double[ncols];
}
public void set(int col, double val) throws Exception {
theRow[col] = val;
}
public double get(int col) throws Exception {
return theRow[col];
}
}
Double
Matrix
Double
Row
Performance
Profile Data
Method Num Calls Method Time Cumulative Times
Matrix.get(int, int) 1,943,313 66,120 100,310
MatrixMultiply.testMM(int, int, int) 1 44,590 179,960
DoubleRow.get(int) 1,943,313 34,190 34,190
Matrix.set(int, int, double) 647,770 22,950 34,940
DoubleRow.set((int, double)) 647,770 11,990 11,990
DoubleRow.<init>(int) 3,072 70 70
Matrix.<init>(int, int) 3 50 120
<ROOT>.<ROOT> - - 179,960
Main.main(String[ ]) 1 - 179,960
Main.<init>() 1 - 179,960
java.lang.Object.<init>() 3,076 - -
java.lang.System.nanoTime() 1 - -
java.lang.StringBuilder.toString() 1 - -
java.lang.StringBuilder.<init>() 1 - -
java.lang.StringBuilder.append(int) 7 - -
java.lang.StringBuilder.append(String) java.lang.StringBuilder.append(String) 77 - -
java.io.PrintStream.println(String) 1 - -
SamanAmarasinghe2008
1
Profile Data
D
o
u
b
l
ee
O
n
l
y
M
u
t
a
b
ll
e
I
m
m
u
tt
a
b
l
e
Method
java.lang.Double.<init>(double)
DoubleRow.<init>(int)
DoubleRow.update(int, Value)
Matrix.update(int, int, Value)
MatrixMultiply.testMM(int, int, int) MatrixMultiply.testMM(int, int, int)
DoubleRow.get(int)
Matrix.get(int, int)
Value.getDouble()
Value.<init>(double)
DoubleRow.<init>(Double[ ], int)
MatrixRow <init>() MatrixRow.<init>()
Matrix.<init>(MatrixRow[ ], MatrixType, int, int)
Method
MatrixMultiply.testMM(int, int, int)
Value getDouble() Value.getDouble()
Matrix.get(int, int)
DoubleRow.get(int)
Value.<init>(double)
Matrix.set(int, int, Value)
DoubleRow.set(int, Value)
Method
Matrix.get(int, int)
MatrixMultiply.testMM(int, int, int)
DoubleRow get(int) DoubleRow.get(int)
Matrix.set(int, int, double)
DoubleRow.set(int, double)
DoubleRow.<init>(int)
Num Calls
3,157,263
3,072
11,535
11,535
1
34,605
34,605
46,140
46,140
11,535
14 607 14,607
11,534
Num Calls
1
1 958 974 1,958,974
1,469,230
1,469,230
1,958,974
489,743
489,743
Num Calls
1,943,313
1
1 943 313 1,943,313
647,770
647,770
3,072
Method Time
52,100
51,120
31,630
30,740
1,790 1,790
1,290
1,170
1,000
810
310
220 220
190
Method Time
40,076
36 791 36,791
27,725
25,343
15,501
13,032
12,932
Method Time
66,120
44,590
34 190 34,190
22,950
11,990
70
Cumulative Times
52,100
102,980
32,610
63,540
172,410 172,410
1,870
3,040
1,000
810
480
220 220
190
Cumulative Times
171,425
36 791 36,791
64,624
36,900
15,501
35,220
22,188
Cumulative Times
100,310
179,960
34 190 34,190
34,940
11,990
70
SamanAmarasinghe2008
Issues with Object Oriented
Memory fragmentation
Objects are allocated independently
All over memory All over memory
If contiguous in memory getting to the next is just an index increment
Method call overhead
Method calls are expensive
Cannot opptimize the loopp bodyy because of the method call
Matrix Representation
Id like my matrix representation to be
Object oriented
Immutable Immutable
Represent both integers and doubles
=
double[][] A = new double[x][y];
double[][] B = new double[x][z]; double[][] B new double[x][z];
double[][] C = new double[z][y];
long started = System.nanoTime();
for(int i =0; i < x; i++)
for(int j =0; j < y; j++)
for(int k=0; k < z; k++)
A[i][j] += B[i][k]*C[k][j];
long ended = System.nanoTime();
Performance
From Java to C
JJava CC
Memory bounds check No such thing in C
Bytecode first interpreted and Intel C compiler compiles the
then JITted (fast compilation, no program directly into x86
time to generate the best code) assembly
SamanAmarasinghe2008
uint64_t testMM(const int x, const int y, const int z)
{
double **A; double **A;
double **B;
double **C;
uint64_t started, ended;
uint664_t timeTaken;
int i, j, k;
A = (double**)malloc(sizeof(double *)*x);
B = (double**)malloc(sizeof(double *)*x);
C = (double**)malloc(sizeof(double *)*y);
for (i = 0; i < x; i++)
A[i] = (double *) malloc(sizeof(double)*y);
for (i = 0; i < z; i++)
B[i] = (double *) malloc(sizeof(double)*z);
for (i = 0; i < z; i++)
C[i] = (double *) malloc(sizeof(double)*z);
started = read_timestamp_counter();
for(i =0; i < x; i++)
for(j =0; j < y; j++)
for(k=0; k < z; k++)
A[i][j] += B[i][k] * C[k][j];
ended = read_timestamp_counter();
timeTaken = (ended - started);
printf("Time: %f ms\n", timeTaken/3158786.0);
return timeTaken;
}
Performance
Profiling with Performance Counters
Modern hardware counts events
Lot more information than just execution time
CPI Clock cycles Per Instruction
Measures if instructions are stalling
L1 and L2 Cache Miss Rate
Are your accesses using the cache well or is the cache misbehaving?
Instructions Retired Instructions Retired
How many instructions got executed
L1 Miss L2 Percent SSE Instructions
CPI Rate Miss Rate Instructions Retired
0 02 43% 13 137 280 000 4 78 4.78 0 24 0.24 0.02 43% 13,137,280,000
SamanAmarasinghe2008
Issues with Matrix Representation
Scanning the memory
A
=
B
C
x
Contiguous accesses are better
D f h h li (C 2 D 64 b L2 C h li ) Data fetch as cache line (Core 2 Duo 64 byte L2 Cache line)
Contiguous data Single cache fetch supports 8 reads of doubles
Preprocessing of Data
In Matrix Multiply
n
3
computation
nn
2
data data
Possibility of preprocessing data before computation
n
2
data n
2
processing
Can make the n
3
happens faster
One matrix dont have good cache behavior
Transpose that matrix
n
2
operations
Will make the main matrix multiply loop run faster
SamanAmarasinghe2008
=
#define IND(A, x, y, d) A[(x)*(d)+(y)]
A = (double *)malloc(sizeof(double)*x*y);
B = (double *)malloc(sizeof(double)*x*z); B (double *)malloc(sizeof(double)*x*z);
C = (double *)malloc(sizeof(double)*y*z);
Cx = (double *)malloc(sizeof(double)*y*z);
started = read_timestamp_counter() ();
for(j =0; j < y; j++)
for(k=0; k < z; k++)
IND(Cx,j,k,z) = IND(C, k, j, y);
for(i =0; i < x; i++)
for(j =0; j < y; j++)
for(k=0; k < z; k++)
IND(A, i, j, y) += IND(B, i, k, z)*IND(Cx, j, k, z);
ended = read timestampp_counter(); (); _
timeTaken = (ended - started);
printf("Time: %f ms\n", timeTaken/3158786.0);
Performance
Profile Data
SamanAmarasinghe2008
The Memory System
The memory system dilemma
Small amount of memory fast access
Large amount of memory slow access Large amount of memory slow access
How do you have a lot of memory and access them very fast
Cache Hierarchy
Store most probable accesses in small amount of memory with fast access
Hardware heuristics determine what will be in each cache and when
L1
64kB
L1
64kB
L2
2MB
L3
16MB
Mem
Gigabytes
Processor
Processor
The temperamental cache
64kB
1 cycle 3 cycles 14 cycles 100+cycles
p
If your access pattern matches heuristics of the hardware blazingly fast
Otherwise dog slow
Data Reuse
D Data reuse
Change of computation order can reduce the # of loads to cache
Calculating a row (1024 values of A) g ( )
A: 1024*1=1024 + B: 384*1=394 + C: 1024*384=393,216 = 394,524
Blocked Matrix Multiply (32
2
= 1024 values of A)
A: 32*32=1024 + B: 384*32 =12 284 + C: 32*384=12 284 = 25 600 A: 32*32=1024 + B: 384*32 =12,284 + C: 32*384=12,284 = 25,600
C
1024 1024 384
4
A B
C
=
x
1
0
2
4
1
0
2
4
3
8
4
Changing the Program
Many ways to get to the same result
Change the execution order
Change the algorithm Change the algorithm
Change the data structures
Some changes can perturb the results
Select a different but equivalent answer
Reorder arithmetic opperations
(a + b) + c a + (b + c)
Drop/change precision
Operate within an acceptable error range Operate within an acceptable error range
SamanAmarasinghe2008
started = read_timestamp_counter();
for(j2 = 0; j2 < y; j2 += block_x)
for(k2 = 0; k2 < z; k2 += block_y)
for(i = 0; i < x; i++)
for((j = j22; j < min(j2 + block_x, y)); j++) ( 2 )
for(k=k2; k < min(k2 + block_y, z); k++)
IND(A,i,j,y) += IND(B,i,k,z) * IND(C,k,j,z);
ended = read_timestamp_counter();
timeTaken = (ended - started);
printf("Time: %f ms\n", timeTaken/3158786.0);
Performance
Profile Data
SamanAmarasinghe2008
Instruction Level Optimizations
Modern processors have many other performance tricks
Instruction Level Parallelism
2 integer 2 integer, 2 floating point and 1 MMX/SSE 2 floating point and 1 MMX/SSE
MMX/SSE Instructions
Can do the same operation on multiple contiguous data at the same time
C h hi h Cache hierarchy
Prefetching of data
Nudge the Compiler
Need to nudge the compiler to generate the vector code
Removed any perceived dependences
Bound most constant variables to the constant
Possible use of compiler #pragmas
Use of vector reporting to see why a loop is not vectorizing
Other options is to write vector assembly code
=
#define N 1024
#define BLOCK_X 256
#define BLOCK Y 1024 #define BLOCK_Y 1024
#define IND(A, x, y, d) A[(x)*(d)+(y)]
started = read_timestamp_counter();
for(j =0; j < N; j++)
for(k=0; k < N; k++)
IND(Cx,j,k,N) = IND(C, k, j, N);
for(j2 = 0; j2 < N; j2 += BLOCK_X)
for(k2 = 0; k2 < N; k2 += BLOCK_Y)
for(i = 0; i < N; i++)
for(j = 0; j < BLOCK_X; j++)
for((k = 0; k < BLOCK Y; k++) ) ; _ ;
IND(A,i,j+j2,N) += IND(B,i,k+k2,N) * IND(Cx,j+j2,k+k2,N);
ended = read_timestamp_counter();
timeTaken = (ended - started); timeTaken (ended started);
printf("Time: %f ms\n", timeTaken/3158786.0);
;;; for(j2 = 0; j2 < N; j2 += BLOCK_X)
xorl %edx, %edx
xorl %eax, %eax
xorps %xmm0, %xmm0
;;; for(k2 = 0; k2 < N; k2 += BLOCK_Y)
;;; for(i = 0; i < N; i++)
xorl %ebx %ebx
Pl ith th il fl
xorl %ebx, %ebx
xorl %ecx, %ecx
;;; for(j = 0; j < BLOCK_X; j++)
xorl %r9d, %r9d
;;; for(k = 0; k < BLOCK_Y; k++)
;;; IND(A,i,j+j2,N)+=IND(B,i,k+k2,N)* IND(Cx,j+j2,k+k2,N);
movslq %ecx, %r8
lea (%rdx,%rcx), %esi
movslq %esi %rdi
Play with the compiler flags
icc help
Find the best flags
movslq %esi, %rdi
shlq $3, %rdi
movslq %eax, %rsi
shlq $3, %rsi
..B1.13:
movaps %xmm0, %xmm2
movsd A(%rdi), %xmm1
xorl %r10d, %r10d
B1 14:
icc -c -O3 -xT -msse3 mxm.c
Use information from icc
icc -vec-report5
..B1.14:
movaps B(%r10,%r8,8), %xmm3
mulpd Cx(%r10,%rsi), %xmm3
addpd %xmm3, %xmm1
movaps 16+B(%r10,%r8,8), %xmm4
mulpd 16+Cx(%r10,%rsi), %xmm4
addpd %xmm4, %xmm2
movaps 32+B(%r10,%r8,8), %xmm5
mulpd 32+Cx(%r10 %rsi) %xmm5
c
t
i
o
n
s
icc vec report5
Generate assembly and stare!
Icc -S -fsource-asm -fverbose-asm
mulpd 32+Cx(%r10,%rsi), %xmm5
addpd %xmm5, %xmm1
movaps 48+B(%r10,%r8,8), %xmm6
mulpd 48+Cx(%r10,%rsi), %xmm6
addpd %xmm6, %xmm2
movaps 64+B(%r10,%r8,8), %xmm7
mulpd 64+Cx(%r10,%rsi), %xmm7
addpd %xmm7, %xmm1
movaps 80+B(%r10 %r8 8) %xmm8
S
S
E
i
n
s
t
r
u
Tweaked the program until the
compiler is happy
movaps 80+B(%r10,%r8,8), %xmm8
mulpd 80+Cx(%r10,%rsi), %xmm8
addpd %xmm8, %xmm2
movaps 96+B(%r10,%r8,8), %xmm9
mulpd 96+Cx(%r10,%rsi), %xmm9
addpd %xmm9, %xmm1
movaps 112+B(%r10,%r8,8), %xmm10
mulpd 112+Cx(%r10,%rsi), %xmm10
addpd %xmm10 %xmm2
n
e
r
l
o
o
p
:
S
addpd %xmm10, %xmm2
addq $128, %r10
cmpq $8192, %r10
jl ..B1.14 # Prob 99%
SamanAmarasinghe2008
I
n
n
Performance
Profile Data
SamanAmarasinghe2008
Tuned Libraries
BLAS Library
Hand tuned library in C/assembly to take the full advantage of hardware
See http://www netlib org/blas/ See http://www.netlib.org/blas/
Intel Math Kernel Library
Experts at Intel figuring out how to get the maximum performance for
commonly used math routines
They have a specially tuned BLAS library for x86
int main(int argc, char *argv[])
{
double *A, *B, *C;
uint64 t started ended timeTaken; uint64_t started, ended, timeTaken;
A = (double *)calloc( N*N, sizeof( double ) );
B = (double *)calloc( N*N, sizeof( double ) );
( ) ( ( ) ) C = (double *)calloc( N*N, sizeof( double ) );
int i, j;
started = read_timestamp_counter();
//enum ORDER {CblasRowMajor=101, CblasColMajorR=102};
//enum TRANSPOSE {CblasNotrans=111, CblasTrans=112, CblasConjtrans=113};
//void gemm(CBLAS_ORDER Order, CBLAS_TRANSPOSE TransB, CBLAS_TRANSPOSE TransC,
// int M, int N, int K,
// double alpha,
// double B[], int strideB,
// double C[], int strideC,
// double beta, // ,
// double A[], int strideA)
// A = alpha * B x C + beta * A
cblas_dgemm(CblasColMajor, CblasTrans, CblasTrans, N, N, N, 1,B, N, C, N, 0, A, N);
ended = read_timestamp_counter();
timeTaken = (ended - started);
printf("Time: %f ms\n", timeTaken/3158786.0);
Performance
Profile Data
SamanAmarasinghe2008
Parallel Execution
Multicores are here
2 to 6 cores in a processor,
1 to 4 processors in a box 1 to 4 processors in a box
Cloud machines have 2 processors with 6 cores each (total 12 cores)
Use concurrency for parallel execution
Divide the computation into multiple independent/concurrent
computations
Run the computations in parallel
Synchronize at the end