M Matriix Mulltiplly: A Case Study M I A C D: Saman Amarasinghe Fall 2010

Matrix multiple is a fundamental operation in many computations Example: video encoding, weather simulation, computer graphics.

Uploaded by

printesoi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views61 pages

M Matriix Mulltiplly: A Case Study M I A C D: Saman Amarasinghe Fall 2010

Matrix multiple is a fundamental operation in many computations Example: video encoding, weather simulation, computer graphics.

Uploaded by

printesoi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

SPEED SPEED

LIMIT
PER ORDER OF 6.172
MMatri ix Mul ltipl ly: A Case
S
S
tudy
M
i A C d
Saman Amarasinghe
Fall 2010
Matrix Multiply
Matrix multiple is a fundamental operation in many computations
Example: video encoding, weather simulation, computer graphics
C
=
x
A B
C
for(int i =0; i < x; i++) for(int i =0; i < x; i++)
for(int j =0; j < y; j++)
for(int k=0; k < z; k++)
A[i][j] += B[i][k]*C[k][j]
Matrix Representation
Id like my matrix representation to be
Object oriented
Immutable Immutable
Represent both integers and doubles
Matrix
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class Value {
final MatrixType type;
final int iVal; final int iVal;
final double dVal;
Value(int i)
Value(double d) {
type = MatrixType.FLOATING_POINT;
dVal = d;
iVal = 0;
}
int getInt() throws Exception
double getDouble() throws Exception {
if(type == MatrixType.FLOATING_POINT)
return dVal;
else
throw new Exception();
}
}}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class Matrix {
final MatrixRow[] rows;
final int nRows, nColumns;
final MatrixType type; final MatrixType type;
Matrix(int rows, int cols, MatrixType type) {
this.type = type;
this.nRows = rows;
this.nColumns = cols;
this.rows = new MatrixRow[this.nRows];
for(int i=0; i<this.nRows; i++)
this.rows[i] = (type == MatrixType.INTEGER)?
new IntegerRow(this.nColumns): new DoubleRow(this.nColumns);
}
}

}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class Matrix {

private Matrix(MatrixRow[] rows, MatrixType type, int nRows, int nCols) {
this.rows = rows;
this.nRows = nRows;
this.nColumns = nCols;
this.type = type;
}
public Matrix update(int row, int col, Value val) throws Exception {
MatrixRow[] newRows = new MatrixRow[nRows];
for(int i=0; i<nRows; i++)
newRows[i] = (i == row)?rows[i].update(col, val):rows[i];
return new Matrix(newRows, type, nRows, nColumns);
}
Value get(int row, int col) throws Exception {
return rows[row].get(col);
}
}}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public abstract class MatrixRow {
abstract Value get(int col) throws Exception;
abstract public MatrixRow update(int col Value val) throws Exception; abstract public MatrixRow update(int col, Value val) throws Exception;
}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class DoubleRow extends MatrixRow {
final Double[] theRow;
public final int numColumns;
DoubleRow(int ncols) {
this.numColumns = ncols;
theRow = new Double[ncols];
for((int i=00; i < ncols; i++) )
theRow[i] = new Double(0);
}
private DoubleRow(Double[] row, int cols) { private DoubleRow(Double[] row, int cols) {
this.theRow = row;
this.numColumns = cols;
}
public MatrixRow update(int col, Value val) throws Exception {
Double[] row = new Double[numColumns];
for(int i=0; i< numColumns; i++)
row[i] = (i==col)?(new Double(val.getDouble())):theRow[i];
return new DoubleRow(row, numColumns);
}
public Value get(int col) {
return new Val ( lue(thheRow[[col]); l])
}
}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class MatrixMultiply {
public static long testMM(int x, int y, int z)
{ {
Matrix A = new Matrix(x, y, MatrixType.FLOATING_POINT);
Matrix B = new Matrix(y, z, MatrixType.FLOATING_POINT);
Matrix C = new Matrix(x, z, MatrixType.FLOATING_POINT);
long started = System.nanoTime();
try {
for(int i =0; i < x; i++)
for(int j =0; j < y; j++)
for(int k=0; k < z; k++)
A = A.update(i, j, new Value(A.get(i, j).getDouble() +
B.get(i, k).getDouble()*
C.get(k, j).getDouble()));
} catch(Exception e) {
}}
long time = System.nanoTime();
long timeTaken = (time - started);
System.out.println ("Time:" + timeTaken/1000000 + "ms");
return timeTaken; return timeTaken;
}
}
Performance
1024x1024matrix multiply 1024x1024 matrix multiply
Is the performance good?
It took almost 5 hours to multiply two 1024x1024 matrices
1024
3
= 1,073,741,824 operations
Each operation is multiply, add and 3 index updates, and branch check 6 ops
1,073,741,824 * 6 = 6,442,450,944 1,073,741,824 6 6,442,450,944
Operations per second = 6,442,450,944 / 17,094 = 376,880 = 3.77x10
5
My PC runs at 3.15 GHz 3.15x10
9
cycles / second
Th b 8 358 l h i ibl i That comes to about 8,358 cycles per each visible operation
How can we improve performance?
Profiling
Look deeply in to the program execution
Find out where you are spending your time
By method By method
By line
Lot of interesting information
Time spend
Cumulative time spend
Number of invocations Number of invocations
Etc. etc.
Great way to zero in on what matters Hotspots
If 90% time is in one routine, inefficiencies in the rest of the program dont
matter
Also, is the hotspots doing what you expect them to do?
SamanAmarasinghe2008
t t
Profile Data
Method Num Calls Method Time Cumulative Times
java.lang.Double.<init>(double) 3,157,263 52,100 52,100
DoubleRow.<init>(int) 3,072 51,120 102,980
DoubleRow.update(int, Value) 11,535 31,630 32,610
Matrix.update(int, int, Value) 11,535 30,740 63,540
MatrixMultiply.testMM(int, int, int) 1 1,790 172,410
DoubleRow.get(int) 34,605 1,290 1,870
Matrix.get(int, int) 34,605 1,170 3,040
Value.getDouble() 46,140 1,000 1,000
Value.<init>(double) 46,140 810 810
DoubleRow.<init>(Double[ ], int) 11,535 310 480
MatrixRow.<init>() 14,607 220 220
Matrix.<init>(MatrixRow[ ], MatrixType, int, int) 11,534 190 190
Matrix.<init>(int, int, MatrixType) 3 40 103,020
Main.<init>() 1 10 172,420
<ROOT>.<ROOT> - - 172,420
Main.main(String[ ]) 1 - 172,420
j l Obj i i () 72,285
-
- java.lang.Object.<init>() 72 285
java.lang.System.nanoTime() 1 - -
java.lang.StringBuilder.append(int) 7 - -
MatrixType.<clinit>() 1 - -
java.lang.StringBuilder.append(String) 7 - -
java.lang.StringBuilder.<init>() 1 - -
j l St i B ild t St i () 1
-
-
java.io.PrintStream.println(String) 1 - -
MatrixType.<init>(String, int) 2 - -
java.lang.Double.doubleValue() 34,605 - -
java.lang.Enum.<init>(String, int) 2 - -
java.lang.StringBuilder.toString() 1
SamanAmarasinghe2008
Issues with Immutability
Updating one location copy of the matrix
2*N copies for each update 2 N copies for each update
N
3
updates N
4
copies are made.
Copying is costly py g y
Cost of making duplicates
Cost of garbage collecting the freed objects
Huge memory footprint Huge memory footprint
Can we do better?
Matrix Representation
Id like my matrix representation to be
Object oriented
Immutable Immutable
Represent both integers and doubles
Matrix
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class Matrix {
MatrixRow[] rows;
final int nRows, nColumns;
final MatrixType type; final MatrixType type;
Matrix(int rows, int cols, MatrixType type) {
this.type = type;
this.nRows = rows;
this.nColumns = cols;
this.rows = new MatrixRow[this.nRows];
for(int i=0; i<this.nRows; i++)
this.rows[i] = (type == MatrixType.INTEGER)?
new IntegerRow(this.nColumns):new DoubleRow(this.nColumns);
}
void set(int row, int col, Value v) throws Exception {
rows[row].set(col, v);
}
Value get(int row, int col) throws Exception {
return rows[row].get(col);
}
}}
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
public class DoubleRow extends MatrixRow {
double[] theRow;
public final int numColumns;
DoubleRow(int ncols) {
this.numColumns = ncols;
theRow = new double[ncols];
}
public void set(int col, Value val) throws Exception {
theRow[col] = val.getDouble();
}
public Value get(int col) {
return new Value(theRow[col]);
}
}
H h d thi k th How much do you think the
performance will improve?
Matrix
Matrix
Row
Integer
Row
Double
Row
Value
Performance
Saman Amarasinghe 2008
Profile Data
M th d NNum CCall ll s M th d Ti CCumulati tive Times Method Method Time l Ti
MatrixMultiply.testMM(int, int, int) 1 40,076 171,425
Value.getDouble() 1,958,974 36,791 36,791
Matrix.get(int, int) 1,469,230 27,725 64,624
DoubleRow.get(int) 1,692,307 25,343 36,900
Value.<init>(double) 1,958,974 15,501 15,501
Matrix.set(int, int, Value) 489,743 13,032 35,220
DoubleRow.set(int, Value) 489,743 12,932 22,188
DoubleRow.<init>(int) 372 21 23
MatrixRow.<init>() 372 2 2
Matrix.<init>(int, int, MatrixType) 3 2 25
Main.<init>() () 1 1 171,,426
java.io.PrintStream.println(String) 1 - -
java.lang.StringBuilder.append(int) 7 - -
java.lang.System.nanoTime() 1 - -
Main.main(String[ ]) 1 - 171,426
MatrixType.<clinit>() 1 - -
java lang StringBuilder append(String) java.lang.StringBuilder.append(String)
1 - -
2 - -
1 - -
2 - -
- - 171,426
19 592 818
77 - -
java.lang.StringBuilder.<init>()
MatrixType.<init>(String, int)
java.lang.StringBuilder.toString()
java.lang.Enum.<init>(String, int)
<ROOT>.<ROOT>
java lang Object <init>() java.lang.Object.<init>() 19,592,818
-
-
Issues with Dynamic Dispatch
Method call overhead
Multiple subtypes what method to call depends on the object
Each method call needs to loop-up the object type in a dispatch table Each method call needs to loop up the object type in a dispatch table
Dynamic dispatch is an address lookup + indirect branch
Indirect branches are costly
Modern microprocessors are deeply pipelined
Matrix
Row
Integer
Row
Double
Row
12 pp p ipeline stagges in core 2 duo,, 20 in Pentium 4
i.e. hundreds of instructions in flight
Need to be able to keep fetching next instructions before executing them
Normal instructions keep fetching the next instructions Normal instructions keep fetching the next instructions
Direct branch target address known, can fetch ahead from target
works for conditional branches by predicting the branch
Indirect branch target unknown, need to wait until address fetch completes
pipeline stall
Matrix Representation
Id like my matrix representation to be
Object oriented
Immutable Immutable
Represent both integers and doubles
Double
i
Double
Matrix Row
public class DoubleMatrix {
final DoubleRow[] rows;
final int nRows, nColumns;
Matrix(int rows, int cols) {
this.nRows = rows;
this.nColumns = cols;
this.rows = new DoubleRow[this.nRows];
for(int i=0; i<this.nRows; i++)
this.rows[i] = new DoubleRow(this.nColumns);
}
void set(int row, int col, double v) {
rows[row].set(col, v);
}
double get(int row, int col) {
return rows[[row] g ].get(col); ( );
}
}
Double
Matrix
Double
Row
public final class DoubleRow {
double[] theRow;
public final int numColumns;
DoubleRow(int ncols) {
this.numColumns = ncols;
theRow = new double[ncols];
}
public void set(int col, double val) throws Exception {
theRow[col] = val;
}
public double get(int col) throws Exception {
return theRow[col];
}
}
Double
Matrix
Double
Row
Performance
Profile Data
Method Num Calls Method Time Cumulative Times
Matrix.get(int, int) 1,943,313 66,120 100,310
MatrixMultiply.testMM(int, int, int) 1 44,590 179,960
DoubleRow.get(int) 1,943,313 34,190 34,190
Matrix.set(int, int, double) 647,770 22,950 34,940
DoubleRow.set((int, double)) 647,770 11,990 11,990
DoubleRow.<init>(int) 3,072 70 70
Matrix.<init>(int, int) 3 50 120
<ROOT>.<ROOT> - - 179,960
Main.main(String[ ]) 1 - 179,960
Main.<init>() 1 - 179,960
java.lang.Object.<init>() 3,076 - -
java.lang.System.nanoTime() 1 - -
java.lang.StringBuilder.toString() 1 - -
java.lang.StringBuilder.<init>() 1 - -
java.lang.StringBuilder.append(int) 7 - -
java.lang.StringBuilder.append(String) java.lang.StringBuilder.append(String) 77 - -
java.io.PrintStream.println(String) 1 - -
SamanAmarasinghe2008
1
Profile Data
D
o
u
b
l
ee

O
n
l
y

M
u
t
a
b
ll
e

I
m
m
u
tt
a
b
l
e

Method
java.lang.Double.<init>(double)
DoubleRow.<init>(int)
DoubleRow.update(int, Value)
Matrix.update(int, int, Value)
MatrixMultiply.testMM(int, int, int) MatrixMultiply.testMM(int, int, int)
DoubleRow.get(int)
Matrix.get(int, int)
Value.getDouble()
Value.<init>(double)
DoubleRow.<init>(Double[ ], int)
MatrixRow <init>() MatrixRow.<init>()
Matrix.<init>(MatrixRow[ ], MatrixType, int, int)
Method
MatrixMultiply.testMM(int, int, int)
Value getDouble() Value.getDouble()
Matrix.get(int, int)
DoubleRow.get(int)
Value.<init>(double)
Matrix.set(int, int, Value)
DoubleRow.set(int, Value)
Method
Matrix.get(int, int)
MatrixMultiply.testMM(int, int, int)
DoubleRow get(int) DoubleRow.get(int)
Matrix.set(int, int, double)
DoubleRow.set(int, double)
DoubleRow.<init>(int)
Num Calls
3,157,263
3,072
11,535
11,535
1
34,605
34,605
46,140
46,140
11,535
14 607 14,607
11,534
Num Calls
1
1 958 974 1,958,974
1,469,230
1,469,230
1,958,974
489,743
489,743
Num Calls
1,943,313
1
1 943 313 1,943,313
647,770
647,770
3,072
Method Time
52,100
51,120
31,630
30,740
1,790 1,790
1,290
1,170
1,000
810
310
220 220
190
Method Time
40,076
36 791 36,791
27,725
25,343
15,501
13,032
12,932
Method Time
66,120
44,590
34 190 34,190
22,950
11,990
70
Cumulative Times
52,100
102,980
32,610
63,540
172,410 172,410
1,870
3,040
1,000
810
480
220 220
190
Cumulative Times
171,425
36 791 36,791
64,624
36,900
15,501
35,220
22,188
Cumulative Times
100,310
179,960
34 190 34,190
34,940
11,990
70
SamanAmarasinghe2008
Issues with Object Oriented
Memory fragmentation
Objects are allocated independently
All over memory All over memory
If contiguous in memory getting to the next is just an index increment
Method call overhead
Method calls are expensive
Cannot opptimize the loopp bodyy because of the method call
Matrix Representation
Id like my matrix representation to be
Object oriented
Immutable Immutable
Represent both integers and doubles
=
double[][] A = new double[x][y];
double[][] B = new double[x][z]; double[][] B new double[x][z];
double[][] C = new double[z][y];
long started = System.nanoTime();
for(int i =0; i < x; i++)
for(int j =0; j < y; j++)
for(int k=0; k < z; k++)
A[i][j] += B[i][k]*C[k][j];
long ended = System.nanoTime();
Performance
From Java to C
JJava CC
Memory bounds check No such thing in C
Bytecode first interpreted and Intel C compiler compiles the
then JITted (fast compilation, no program directly into x86
time to generate the best code) assembly
SamanAmarasinghe2008
uint64_t testMM(const int x, const int y, const int z)
{
double **A; double **A;
double **B;
double **C;
uint64_t started, ended;
uint664_t timeTaken;
int i, j, k;
A = (double**)malloc(sizeof(double *)*x);
B = (double**)malloc(sizeof(double *)*x);
C = (double**)malloc(sizeof(double *)*y);
for (i = 0; i < x; i++)
A[i] = (double *) malloc(sizeof(double)*y);
for (i = 0; i < z; i++)
B[i] = (double *) malloc(sizeof(double)*z);
for (i = 0; i < z; i++)
C[i] = (double *) malloc(sizeof(double)*z);

started = read_timestamp_counter();
for(i =0; i < x; i++)
for(j =0; j < y; j++)
for(k=0; k < z; k++)
A[i][j] += B[i][k] * C[k][j];
ended = read_timestamp_counter();
timeTaken = (ended - started);
printf("Time: %f ms\n", timeTaken/3158786.0);
return timeTaken;
}
Performance
Profiling with Performance Counters
Modern hardware counts events
Lot more information than just execution time
CPI Clock cycles Per Instruction
Measures if instructions are stalling
L1 and L2 Cache Miss Rate
Are your accesses using the cache well or is the cache misbehaving?
Instructions Retired Instructions Retired
How many instructions got executed
L1 Miss L2 Percent SSE Instructions
CPI Rate Miss Rate Instructions Retired
0 02 43% 13 137 280 000 4 78 4.78 0 24 0.24 0.02 43% 13,137,280,000
SamanAmarasinghe2008
Issues with Matrix Representation
Scanning the memory
A
=
B
C
x
Contiguous accesses are better
D f h h li (C 2 D 64 b L2 C h li ) Data fetch as cache line (Core 2 Duo 64 byte L2 Cache line)
Contiguous data Single cache fetch supports 8 reads of doubles
Preprocessing of Data
In Matrix Multiply
n
3
computation
nn
2
data data
Possibility of preprocessing data before computation
n
2
data n
2
processing
Can make the n
3
happens faster
One matrix dont have good cache behavior
Transpose that matrix
n
2
operations
Will make the main matrix multiply loop run faster
SamanAmarasinghe2008
=
#define IND(A, x, y, d) A[(x)*(d)+(y)]

A = (double *)malloc(sizeof(double)*x*y);
B = (double *)malloc(sizeof(double)*x*z); B (double *)malloc(sizeof(double)*x*z);
C = (double *)malloc(sizeof(double)*y*z);
Cx = (double *)malloc(sizeof(double)*y*z);
started = read_timestamp_counter() ();
for(j =0; j < y; j++)
for(k=0; k < z; k++)
IND(Cx,j,k,z) = IND(C, k, j, y);
for(i =0; i < x; i++)
for(j =0; j < y; j++)
for(k=0; k < z; k++)
IND(A, i, j, y) += IND(B, i, k, z)*IND(Cx, j, k, z);
ended = read timestampp_counter(); (); _
timeTaken = (ended - started);
printf("Time: %f ms\n", timeTaken/3158786.0);
Performance
Profile Data
SamanAmarasinghe2008
The Memory System
The memory system dilemma
Small amount of memory fast access
Large amount of memory slow access Large amount of memory slow access
How do you have a lot of memory and access them very fast
Cache Hierarchy
Store most probable accesses in small amount of memory with fast access
Hardware heuristics determine what will be in each cache and when
L1
64kB
L1
64kB
L2
2MB
L3
16MB
Mem
Gigabytes
Processor
Processor
The temperamental cache
64kB
1 cycle 3 cycles 14 cycles 100+cycles
p
If your access pattern matches heuristics of the hardware blazingly fast
Otherwise dog slow
Data Reuse
D Data reuse
Change of computation order can reduce the # of loads to cache
Calculating a row (1024 values of A) g ( )
A: 1024*1=1024 + B: 384*1=394 + C: 1024*384=393,216 = 394,524
Blocked Matrix Multiply (32
2
= 1024 values of A)
A: 32*32=1024 + B: 384*32 =12 284 + C: 32*384=12 284 = 25 600 A: 32*32=1024 + B: 384*32 =12,284 + C: 32*384=12,284 = 25,600
C
1024 1024 384
4
A B
C
=
x
1
0
2
4
1
0
2
4
3
8
4
Changing the Program
Many ways to get to the same result
Change the execution order
Change the algorithm Change the algorithm
Change the data structures
Some changes can perturb the results
Select a different but equivalent answer
Reorder arithmetic opperations
(a + b) + c a + (b + c)
Drop/change precision
Operate within an acceptable error range Operate within an acceptable error range
SamanAmarasinghe2008

started = read_timestamp_counter();
for(j2 = 0; j2 < y; j2 += block_x)
for(k2 = 0; k2 < z; k2 += block_y)
for(i = 0; i < x; i++)
for((j = j22; j < min(j2 + block_x, y)); j++) ( 2 )
for(k=k2; k < min(k2 + block_y, z); k++)
IND(A,i,j,y) += IND(B,i,k,z) * IND(C,k,j,z);
ended = read_timestamp_counter();
timeTaken = (ended - started);
printf("Time: %f ms\n", timeTaken/3158786.0);
Performance
Profile Data
SamanAmarasinghe2008
Instruction Level Optimizations
Modern processors have many other performance tricks
Instruction Level Parallelism
2 integer 2 integer, 2 floating point and 1 MMX/SSE 2 floating point and 1 MMX/SSE
MMX/SSE Instructions
Can do the same operation on multiple contiguous data at the same time
C h hi h Cache hierarchy
Prefetching of data
Nudge the Compiler
Need to nudge the compiler to generate the vector code
Removed any perceived dependences
Bound most constant variables to the constant
Possible use of compiler #pragmas
Use of vector reporting to see why a loop is not vectorizing
Other options is to write vector assembly code
=
#define N 1024
#define BLOCK_X 256
#define BLOCK Y 1024 #define BLOCK_Y 1024
#define IND(A, x, y, d) A[(x)*(d)+(y)]

started = read_timestamp_counter();
for(j =0; j < N; j++)
for(k=0; k < N; k++)
IND(Cx,j,k,N) = IND(C, k, j, N);
for(j2 = 0; j2 < N; j2 += BLOCK_X)
for(k2 = 0; k2 < N; k2 += BLOCK_Y)
for(i = 0; i < N; i++)
for(j = 0; j < BLOCK_X; j++)
for((k = 0; k < BLOCK Y; k++) ) ; _ ;
IND(A,i,j+j2,N) += IND(B,i,k+k2,N) * IND(Cx,j+j2,k+k2,N);
ended = read_timestamp_counter();
timeTaken = (ended - started); timeTaken (ended started);
printf("Time: %f ms\n", timeTaken/3158786.0);
;;; for(j2 = 0; j2 < N; j2 += BLOCK_X)
xorl %edx, %edx
xorl %eax, %eax
xorps %xmm0, %xmm0
;;; for(k2 = 0; k2 < N; k2 += BLOCK_Y)
;;; for(i = 0; i < N; i++)
xorl %ebx %ebx
Pl ith th il fl
xorl %ebx, %ebx
xorl %ecx, %ecx
;;; for(j = 0; j < BLOCK_X; j++)
xorl %r9d, %r9d
;;; for(k = 0; k < BLOCK_Y; k++)
;;; IND(A,i,j+j2,N)+=IND(B,i,k+k2,N)* IND(Cx,j+j2,k+k2,N);
movslq %ecx, %r8
lea (%rdx,%rcx), %esi
movslq %esi %rdi
Play with the compiler flags
icc help
Find the best flags
movslq %esi, %rdi
shlq $3, %rdi
movslq %eax, %rsi
shlq $3, %rsi
..B1.13:
movaps %xmm0, %xmm2
movsd A(%rdi), %xmm1
xorl %r10d, %r10d
B1 14:
icc -c -O3 -xT -msse3 mxm.c
Use information from icc
icc -vec-report5
..B1.14:
movaps B(%r10,%r8,8), %xmm3
mulpd Cx(%r10,%rsi), %xmm3
addpd %xmm3, %xmm1
movaps 16+B(%r10,%r8,8), %xmm4
mulpd 16+Cx(%r10,%rsi), %xmm4
addpd %xmm4, %xmm2
movaps 32+B(%r10,%r8,8), %xmm5
mulpd 32+Cx(%r10 %rsi) %xmm5
c
t
i
o
n
s
icc vec report5
Generate assembly and stare!
Icc -S -fsource-asm -fverbose-asm
mulpd 32+Cx(%r10,%rsi), %xmm5
addpd %xmm5, %xmm1
movaps 48+B(%r10,%r8,8), %xmm6
mulpd 48+Cx(%r10,%rsi), %xmm6
addpd %xmm6, %xmm2
movaps 64+B(%r10,%r8,8), %xmm7
mulpd 64+Cx(%r10,%rsi), %xmm7
addpd %xmm7, %xmm1
movaps 80+B(%r10 %r8 8) %xmm8
S
S
E

i
n
s
t
r
u
Tweaked the program until the
compiler is happy
movaps 80+B(%r10,%r8,8), %xmm8
mulpd 80+Cx(%r10,%rsi), %xmm8
addpd %xmm8, %xmm2
movaps 96+B(%r10,%r8,8), %xmm9
mulpd 96+Cx(%r10,%rsi), %xmm9
addpd %xmm9, %xmm1
movaps 112+B(%r10,%r8,8), %xmm10
mulpd 112+Cx(%r10,%rsi), %xmm10
addpd %xmm10 %xmm2
n
e
r

l
o
o
p
:

S
addpd %xmm10, %xmm2
addq $128, %r10
cmpq $8192, %r10
jl ..B1.14 # Prob 99%
SamanAmarasinghe2008
I
n
n
Performance
Profile Data
SamanAmarasinghe2008
Tuned Libraries
BLAS Library
Hand tuned library in C/assembly to take the full advantage of hardware
See http://www netlib org/blas/ See http://www.netlib.org/blas/
Intel Math Kernel Library
Experts at Intel figuring out how to get the maximum performance for
commonly used math routines
They have a specially tuned BLAS library for x86
int main(int argc, char *argv[])
{
double *A, *B, *C;
uint64 t started ended timeTaken; uint64_t started, ended, timeTaken;
A = (double *)calloc( N*N, sizeof( double ) );
B = (double *)calloc( N*N, sizeof( double ) );
( ) ( ( ) ) C = (double *)calloc( N*N, sizeof( double ) );
int i, j;
started = read_timestamp_counter();
//enum ORDER {CblasRowMajor=101, CblasColMajorR=102};
//enum TRANSPOSE {CblasNotrans=111, CblasTrans=112, CblasConjtrans=113};
//void gemm(CBLAS_ORDER Order, CBLAS_TRANSPOSE TransB, CBLAS_TRANSPOSE TransC,
// int M, int N, int K,
// double alpha,
// double B[], int strideB,
// double C[], int strideC,
// double beta, // ,
// double A[], int strideA)
// A = alpha * B x C + beta * A
cblas_dgemm(CblasColMajor, CblasTrans, CblasTrans, N, N, N, 1,B, N, C, N, 0, A, N);
ended = read_timestamp_counter();
timeTaken = (ended - started);
printf("Time: %f ms\n", timeTaken/3158786.0);
Performance
Profile Data
SamanAmarasinghe2008
Parallel Execution
Multicores are here
2 to 6 cores in a processor,
1 to 4 processors in a box 1 to 4 processors in a box
Cloud machines have 2 processors with 6 cores each (total 12 cores)
Use concurrency for parallel execution
Divide the computation into multiple independent/concurrent
computations
Run the computations in parallel
Synchronize at the end

Issues with Parallelism

Amdhals Law
Any computation can be analyzed in terms of a portion that must be
t d ti ll T d ti th t b t d i ll l T executed sequentially,Ts, and a portion that can be executed in parallel,Tp.
Then for n processors:
T(n) = Ts + Tp/n
T() = Ts, thus maximum speedup (Ts + Tp) /Ts
Load Balancing
The work is distributed among processors so that allprocessors are kept The work is distributed among processors so that allprocessors are kept
busy allof the time.
Granularityy
The size of the parallel regions between synchronizations or
the ratio of computation (useful work) to communication (overhead).
Parallel Execution of Matrix Multiply
C
=
x
A[0] B[0]
A B
A[1]
[ ]
B[1]
[ ]
[ ] [ ]
C
=
x
A[0] B[0]
x
C
=
A[1] B[1]
Performance
Performance
Summary
There is a lot of room for performance improvements!
Matrix Multiply is an exception, other programs may not yield gains this large
That said, in Matrix Multiple from Immutable to Parallel BLAS 296,260x improvement
In comparison Miles per Gallon improvement
14,700x
Image by MIT OpenCourseWare.
Need to have a good understanding on what the hardware and Need to have a good understanding on what the hardware and
underling software is doing
MIT OpenCourseWare
http://ocw.mit.edu
6.172 Performance Engineering of Software Systems
Fall 2010
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

All Technologies Boolean Search Strings 1561560291
72% (25)
All Technologies Boolean Search Strings 1561560291
43 pages
Matrices: - Matrix Is 2-D Array of M Rows by N Columns
No ratings yet
Matrices: - Matrix Is 2-D Array of M Rows by N Columns
14 pages
Practical With Solution
No ratings yet
Practical With Solution
15 pages
Operatii Operatii Pe Matrici Javape Matrici Java
No ratings yet
Operatii Operatii Pe Matrici Javape Matrici Java
4 pages
DS 4
No ratings yet
DS 4
15 pages
Java Code On Arrays and Matrix
No ratings yet
Java Code On Arrays and Matrix
10 pages
Matrix Operations Using Java
No ratings yet
Matrix Operations Using Java
9 pages
Experiment 3 Java Prachi
No ratings yet
Experiment 3 Java Prachi
7 pages
Introduction To Programming
No ratings yet
Introduction To Programming
38 pages
Arrays
No ratings yet
Arrays
50 pages
Laboratory - 10: Problem 1
No ratings yet
Laboratory - 10: Problem 1
6 pages
JavaRec CO1 2024
No ratings yet
JavaRec CO1 2024
9 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
2 pages
Lab Report 5
No ratings yet
Lab Report 5
5 pages
Jblas - Fast Matrix Computations For Java
No ratings yet
Jblas - Fast Matrix Computations For Java
19 pages
Java Program On Arrays (1D and 2D)
No ratings yet
Java Program On Arrays (1D and 2D)
10 pages
Introduction To Programming - CS201 Power Point Slides Lecture 45
No ratings yet
Introduction To Programming - CS201 Power Point Slides Lecture 45
38 pages
Introduction To Programming - CS201 Power Point Slides Lecture 44
No ratings yet
Introduction To Programming - CS201 Power Point Slides Lecture 44
36 pages
Object Oriented Programming Lab: Department of Computer Science and Engineering
No ratings yet
Object Oriented Programming Lab: Department of Computer Science and Engineering
46 pages
Proj
No ratings yet
Proj
43 pages
Java Code Snippet
No ratings yet
Java Code Snippet
12 pages
02 Array.pptx
No ratings yet
02 Array.pptx
15 pages
WINSEM2024-25_BCSE103E_ELA_VL2024250505252_2025-01-23_Reference-Material-I
No ratings yet
WINSEM2024-25_BCSE103E_ELA_VL2024250505252_2025-01-23_Reference-Material-I
3 pages
class ReverseArray { public static void main(String[] arg (3)
No ratings yet
class ReverseArray { public static void main(String[] arg (3)
12 pages
Gundersen 2004
No ratings yet
Gundersen 2004
17 pages
Lab 2 Instruction (3+1 marks)
No ratings yet
Lab 2 Instruction (3+1 marks)
8 pages
Inverse of NXN Matrix in Java (By Augmenting The Matrix With Identity Matrix)
No ratings yet
Inverse of NXN Matrix in Java (By Augmenting The Matrix With Identity Matrix)
4 pages
Lab4
No ratings yet
Lab4
3 pages
Special Matrices
No ratings yet
Special Matrices
25 pages
lecture21مهم
No ratings yet
lecture21مهم
16 pages
Java Programming - Array
No ratings yet
Java Programming - Array
34 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
9 2Dim Array
No ratings yet
9 2Dim Array
32 pages
Introduction To Programming
No ratings yet
Introduction To Programming
36 pages
Computer project
No ratings yet
Computer project
22 pages
Assignment2 Oops Mudassir
No ratings yet
Assignment2 Oops Mudassir
4 pages
efficiient matrix multiply C#
No ratings yet
efficiient matrix multiply C#
9 pages
CSE 211: Data Structures Lecture Notes II: Ender Özcan, Şebnem Baydere
No ratings yet
CSE 211: Data Structures Lecture Notes II: Ender Özcan, Şebnem Baydere
6 pages
Matrices: - Matrix Is 2-D Array of M Rows by N Columns
No ratings yet
Matrices: - Matrix Is 2-D Array of M Rows by N Columns
10 pages
Ds ch-2
No ratings yet
Ds ch-2
224 pages
Tips On The AP FR 21
No ratings yet
Tips On The AP FR 21
53 pages
Matrics (2D) Array zoho samle questions
No ratings yet
Matrics (2D) Array zoho samle questions
1 page
OOP Lab Record Ddvda
No ratings yet
OOP Lab Record Ddvda
95 pages
Matrix and Graph
No ratings yet
Matrix and Graph
44 pages
Dynamic Programming
No ratings yet
Dynamic Programming
15 pages
Unit 1
No ratings yet
Unit 1
65 pages
Matrix: Remark: The Created Matrices Are Square Matrices. (Indicated by The Number: 2, 3, 4)
No ratings yet
Matrix: Remark: The Created Matrices Are Square Matrices. (Indicated by The Number: 2, 3, 4)
10 pages
Arrays
No ratings yet
Arrays
37 pages
UNIT I Continues New
No ratings yet
UNIT I Continues New
60 pages
HW4 MatrixOperations 10-19-23
No ratings yet
HW4 MatrixOperations 10-19-23
15 pages
FDS SOLVED QB naaz
No ratings yet
FDS SOLVED QB naaz
27 pages
Problem Set 1-1
No ratings yet
Problem Set 1-1
8 pages
Array Logic
No ratings yet
Array Logic
9 pages
b22cs028 Rakesh Assignment-4
No ratings yet
b22cs028 Rakesh Assignment-4
6 pages
2 - matrix
No ratings yet
2 - matrix
10 pages
java exp 13 to 19
No ratings yet
java exp 13 to 19
29 pages
I/P Two Matrix & Multiplication in Third Matrix
No ratings yet
I/P Two Matrix & Multiplication in Third Matrix
10 pages
Sparse Matrices
No ratings yet
Sparse Matrices
8 pages
Ds First & Second Unit Aug 2020
No ratings yet
Ds First & Second Unit Aug 2020
78 pages
Data Structures Using C++: Ms - Akhila Shaji Assistant Professor SSM College, Rajakkad
No ratings yet
Data Structures Using C++: Ms - Akhila Shaji Assistant Professor SSM College, Rajakkad
6 pages
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
Aqaaid and Tuhfatul Banaat
No ratings yet
Aqaaid and Tuhfatul Banaat
5 pages
Hydrostatic Motor Disassemble - tm14282x19 - Service ADVISOR™
No ratings yet
Hydrostatic Motor Disassemble - tm14282x19 - Service ADVISOR™
8 pages
UM-D30
No ratings yet
UM-D30
2 pages
Dll-Per Dev-Vv 8-Ii
No ratings yet
Dll-Per Dev-Vv 8-Ii
4 pages
Sustainable Development
No ratings yet
Sustainable Development
8 pages
How Does The Liver Function
No ratings yet
How Does The Liver Function
2 pages
The Marked
No ratings yet
The Marked
52 pages
Transitive & Intransive Verbs
No ratings yet
Transitive & Intransive Verbs
6 pages
New Base Building Construction Elevator and Escalator Maintenance Agreement
No ratings yet
New Base Building Construction Elevator and Escalator Maintenance Agreement
38 pages
IS 2911 (Part 3) - 2021_Under Reamed Piles
No ratings yet
IS 2911 (Part 3) - 2021_Under Reamed Piles
32 pages
December 12-16, 2016 (WEEK 6) : GRADES 1 To 12 Daily Lesson Log Tabunoc Central School V Maria Angeles Y. Niadas 3 Quarter
0% (1)
December 12-16, 2016 (WEEK 6) : GRADES 1 To 12 Daily Lesson Log Tabunoc Central School V Maria Angeles Y. Niadas 3 Quarter
4 pages
Memorandum of Agreement For Non Sanctioned Club Sports
100% (1)
Memorandum of Agreement For Non Sanctioned Club Sports
3 pages
What Is A Report?: Reports May Contain Some or All of The Following Elements
No ratings yet
What Is A Report?: Reports May Contain Some or All of The Following Elements
4 pages
Pioneers in Education
100% (3)
Pioneers in Education
80 pages
Turtle Dove Cardigan - Ind v2
No ratings yet
Turtle Dove Cardigan - Ind v2
5 pages
Why-Network-Marketing-Why-Vestige
No ratings yet
Why-Network-Marketing-Why-Vestige
8 pages
A Very Special Delivery! - Barney Wiki - Fandom
No ratings yet
A Very Special Delivery! - Barney Wiki - Fandom
6 pages
CV - Intan Ardina
No ratings yet
CV - Intan Ardina
1 page
Who Is The Article Most Likely Aimed at
No ratings yet
Who Is The Article Most Likely Aimed at
3 pages
Computer Notes
No ratings yet
Computer Notes
18 pages
Instant Access to Construction of first genetic linkage map based on microsatellite markers and characterization of di- and tri-nucleotide microsatellite markers for Crassostrea hongkongesis Haitao Ma ebook Full Chapters
100% (7)
Instant Access to Construction of first genetic linkage map based on microsatellite markers and characterization of di- and tri-nucleotide microsatellite markers for Crassostrea hongkongesis Haitao Ma ebook Full Chapters
34 pages
Business Negotiation
No ratings yet
Business Negotiation
3 pages
Eyp Presentation
No ratings yet
Eyp Presentation
26 pages
Am2019-Program and Abstracts
No ratings yet
Am2019-Program and Abstracts
189 pages
Sofa Care Booklet
No ratings yet
Sofa Care Booklet
12 pages
Job Advertisements - Exercises
No ratings yet
Job Advertisements - Exercises
1 page
Narmer Astronomy 22
No ratings yet
Narmer Astronomy 22
3 pages
Business Research Methods
No ratings yet
Business Research Methods
3 pages
Coaching Essentials For Leaders Overview: Helping Leaders Emulate Great Coaches
No ratings yet
Coaching Essentials For Leaders Overview: Helping Leaders Emulate Great Coaches
4 pages

M Matriix Mulltiplly: A Case Study M I A C D: Saman Amarasinghe Fall 2010

Uploaded by

M Matriix Mulltiplly: A Case Study M I A C D: Saman Amarasinghe Fall 2010

Uploaded by

SPEED SPEED

Issues with Parallelism

You might also like