0% found this document useful (0 votes)
33 views20 pages

278 hw5

Uploaded by

saitejagoud445
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views20 pages

278 hw5

Uploaded by

saitejagoud445
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

HW-6

Shiva Prashanth Sriramsetty


016766256

1.

a) Memory Access Order

Given matrices a[3][5], b[5][4], and c[3][4] in row-major order:

• Row-major order: Elements of each row are stored sequentially in memory.


• Access order for the tripl. y nested loop:

1. Matrix a: Accesses are a[i][k] for a fixed i. Memory access order:


o i=0, k=0,1,2,3,
o i=1, k=0,1,2,3,4
o i=2, k=0,1,2,3,4

Sequential: For each row iii, elements of a are accessed sequentially.

2. Matrix b: Accesses are b[k][j] for a fixed jjj. Memory access order:
o j=0, k=0,1,2,3,4
o j=1, k=0,1,2,3,4
o j=2, k=0,1,2,3,4
o j=3, k=0,1,2,3,4

Non-sequential: Elements of b are not accessed sequentially because k varies faster than
jjj in row-major storage.

3. Matrix c: Accesses are c[i][j]. Memory access order:


o i=0, j=0,1,2,3
o i=1, j=0,1,2,3
o i=2, j=0,1,2,3

Sequential: Elements of c are accessed sequentially for a given iii.

Summary of Sequential Access

• Sequential: a, c
• Non-sequential: b
Part (b)

Now, matrix b is transposed to b[N][K], meaning indices are swapped, and the innermost
statement becomes:

c[i][j] += a[i][k] * b[j][k];

Access Analysis

1. Matrix a: Access pattern remains the same (a[i][k] for a fixed iii), which is sequential.
2. Matrix b: Accesses are now b[j][k] for fixed jjj. Memory access order:
o j=0,k=0,1,2,3,4
o j=1,k=0,1,2,3,4
o j=2,k=0,1,2,3,4
o j=3,k=0,1,2,3,4

Sequential: Elements of b are now accessed sequentially for each jjj.

3. Matrix c: Access pattern (c[i][j]) remains the same, which is sequential.

Summary of Sequential Access

• Sequential: a, b, c
• Non-sequential: None

Part (c)

To use the hardware_dot function, we need to rewrite the loop with transposed b:

float a[M][K], b[N][K], c[M][N]; // M=3, N=4, K=5


for (int i = 0; i < M; ++i) {
for (int j = 0; j < N; ++j) {
float accumulator = 0.0;
for (int k = 0; k < K; k += 8) { // Process 8 elements at a time
hardware_dot(&accumulator, &a[i][k], &b[j][k]);
}
c[i][j] += accumulator;
}
}

Explanation

• The loop over k is modified to process 8 elements at a time.


• The hardware_dot function efficiently computes the dot product of 8 elements, updating the
accumulator.
Part (d)

Using the hardware_saxpy function with the original b (not transposed):

float a[M][K], b[K][N], c[M][N]; // M=3, N=4, K=5


for (int i = 0; i < M; ++i) {
for (int j = 0; j < N; ++j) {
float temp[K]; // Temporary storage for the intermediate results
for (int k = 0; k < K; k += 8) { // Process 8 elements at a time
hardware_saxpy(temp, a[i][k], &b[k][j]);
}
c[i][j] = 0.0; // Accumulate results into c[i][j]
for (int k = 0; k < K; ++k) {
c[i][j] += temp[k];
}
}
}

Explanation

• The hardware_saxpy function multiplies a[i][k] with 8 elements of b and accumulates the
results in a temporary array.
• The final result for c[i][j] is computed by summing up all values in the temporary array.

• LLM Answer



2. Part (a):

For each layer, the input activations (and output activations of the previous layer) for batch size
NNN are calculated as follows:

• Size per layer (bytes) = N×2,048N \times 2,048N×2,048 bytes (since each activation is 8
bits or 1 byte, and there are 2,048 neurons).

Calculations for each batch size:


Total Input & Output Data
Batch Size NNN Size per Layer (bytes)
(bytes)
128 128×2,048=262, 144 bytes 262,144×2=524,288 bytes
256 256×2,048=524,288 bytes 524,288×2=1,048,576 bytes
512 512×2,048=1,048,576 bytes 1,048,576×2=2,097,152 bytes
1,024 1,024×2,048=2,097,152 bytes 2,097,152×2=4,194,304 bytes
2,048 2,048×2,048=4,194,304 bytes 4,194,304×2=8,388,608 bytes

Transfer Time over PCIe Gen3 x16 (100 Gbit/s):

• Transfer time (seconds) = Total data size in bits / PCIe Bandwidth


• Calculations:

Batch Size NNN Transfer Time (microseconds)


128 (4,194,304) bits / (100×10^9) bits/s =41.9 μs
256 (8,388,608) / (100×10^9) =83.9μs
512 (16,777,216) / (100×10^9) =167.8μs
1,024 (33,554,432) / (100×10^9) =335.5μs
2,048 (67,108,864) / (100×10^9) =671.1 μs

Part (b):

• Time to read all weights:

Time=Total weight size/Memory bandwidth

=20 MB/30 GiB/s

=20×10^6 bytes / 30×2^30 bytes/s≈620.8 μs

Time to read a 256×256 weight tile:

Tile size=256×256×1 byte=65,536 bytes

Time =65,536 bytes / 30×2^ 30 bytes/s≈2.034 μs


Part (c):

Each of the 256×256 elements performs 2 operations (multiply and accumulate) per cycle:

• Operations per cycle = 256×256×2=131,072256 operations


• Total operations per second = 131,072×Clock frequency
• Given the clock frequency such that total operations per second = 92 TOPS:

92×10^12=131,072×Clock frequency

Solving for the clock frequency confirms the TPU's performance.

Part (d):

• Time to load weight tile (cycles):

Cycles=Time to load tile×Clock frequency=2.034×10^−6 s×702×10^6 Hz≈1,429 cycles

Break-even batch size: B=1,429B = 1,429B=1,429

Part (e):

• Total operations for batch size 128:


• Total operations=20×106×2×128=5.12×109
• Compute times:

Intel Haswell (1 TFLOPS): 5.12×10^9/1×10^12=5.12 ms

NVIDIA K80 (3 TFLOPS): 5.12×10^9/3×10^12≈1.71 ms

• TPU weight load time: ≈620.8 μs\approx 620.8 \text{ μs}≈620.8 μs

Part (f):

Total elapsed time:

Total time=PCIe transfer time+Weights load time+Compute time=41.9 μs+620.8 μs+55.6


μs=718.3 μs
• Fraction of PCIe bandwidth used:

Fraction = Total data size/(PCIe bandwidth * Total Time )


= 4,194,304 bits / (100×10^9 bits/s×718.3×10^−6 s) ≈5.84%

Part (g):

Compute time per layer: 1,024,000,000/92×1012≈11.13 μs

Communication time per layer: 2,097,152 bits/100×10^9=20.97 μs

Total latency: 5×11.13 μs+4×20.97 μs=139.54 μs

Throughput: 128/139.54×10^−6≈917,284 inferences/s

Compared to a single TPU throughput of ≈178,185 inferences/s, this configuration offers higher
throughput and lower latency.

Answer to Part (h):

• Total CPU time per batch: 128×50 μs=6,400 core- μs


• Number of cores needed:

Cores=6,400 core-μs/718.3 μs≈8.908 cores

Therefore, at least 9 CPU cores are required to drive the TPU at batch size 128.
3. (a) Rewriting the Program with Vector and Tile Lengths

Given the natural vector length of 256 and weight tiles of 256 × 256 (65,536 bytes), we can
rewrite the program addresses and sizes to align with these constraints.

• Input Activations:

o Each 256-element vector of 8-bit activations is 256 bytes.

o Total vectors: 512 KB/256 bytes=2048 vectors.

• Accumulators:

o Each accumulator group holds 256 32-bit values, totaling 1 KB.


o Total accumulators used: 524,288 bytes/1 KB=512 accumulator groups.

• Output Activations:

o Each output vector is 256 bytes (after activation).

o Total vectors: 2048 vectors.

Rewritten Program:

read_host u#0, 2048 vectors // Read 2048 vectors into unified buffer starting at u#0

read_weights w#0, 1 tile // Read one weight tile (256×256 weights)

matmul u#0, a#0, 2048 vectors // Multiply input vectors with weights, store in accumulators
starting at a#0

activate u#2048, a#0, 2048 vectors // Apply activation function, store outputs at u#2048

write_host u#2048, 2048 vectors // Write 2048 output vectors back to the host

Answers:
• Number of Input Vectors Read: 2048 vectors

• Bytes of Accumulator Values Used:


2048 vectors×256 elements/vector×4 bytes/element=2 MB
2\,\text{MB}2048vectors×256elements/vector×4bytes/element=2MB

• Number of Output Vectors Written: 2048 vectors

(b) Handling a 1024 × 256 Weight Matrix

• To handle a weight matrix of size 1024 × 256, we split it into four 256 × 256 weight tiles
because the matrix unit size remains 256 × 256.
Modified Program:

read_host u#0, 2048 vectors

// Loop over the 4 weight tiles

for tile_index from 0 to 3:

read_weights w#(tile_index * tile_size), 1 tile

if tile_index == 0:

matmul u#0, a#0, 2048 vectors // Overwrite accumulators on first pass

else:

matmul_add u#0, a#0, 2048 vectors // Accumulate results on subsequent passes

activate u#2048, a#0, 2048 vectors

write_host u#2048, 2048 vectors

Answers:

• Changes to the Program:

o Use the matmul_add instruction to accumulate results over four passes.

o No additional accumulators are needed.

• Number of Weight Tiles Needed: 4 tiles of 256 × 256

(c) Handling a 256 × 512 Weight Matrix

• We split the 256 × 512 weight matrix into two 256 × 256 tiles.

Modified Program:

read_host u#0, 2048 vectors

// First half of the weights

read_weights w#0, 1 tile

matmul u#0, a#0, 2048 vectors

activate u#2048, a#0, 2048 vectors

// Second half of the weights


read_weights w#(1 * tile_size), 1 tile

matmul u#0, a#0, 2048 vectors

activate u#4096, a#0, 2048 vectors

write_host u#2048, 4096 vectors // Write both halves of the output

Answers:

• Need for More Accumulators: No, we reuse the same accumulators by overwriting them.

• Number of Weight Tiles Needed: 2 tiles of 256 × 256

• Order in Weight DRAM: Store the two tiles sequentially, corresponding to the two halves
of the weight matrix.

(d) Handling a 1024 × 768 Weight Matrix

We divide the 1024 × 768 weight matrix into 12 tiles of 256 × 256.

Modified Program:

for N_chunk from 0 to 2:

// Initialize accumulators for each N_chunk

zero_accumulators a#0

for K_chunk from 0 to 3:

read_host u#0, 2048 vectors corresponding to K_chunk

read_weights w#(N_chunk*4 + K_chunk), 1 tile

if K_chunk == 0:
matmul u#0, a#0, 2048 vectors // Overwrite accumulators

else:

matmul_add u#0, a#0, 2048 vectors // Accumulate results


activate u#(2048 * N_chunk), a#0, 2048 vectors

write_host u#0, 6144 vectors // Write all output vectors

Answers:

• Number of Weight Tiles Needed: 12 tiles of 256 × 256


• Order in Weight DRAM: Store the tiles in the order of NchunkN_{\text{chunk}}Nchunk
and KchunkK_{\text{chunk}}Kchunk processing.

• Times Each Input Activation Is Read: 3 times, once for each


NchunkN_{\text{chunk}}Nchunk

(e) Discussion on Reading Input Activations Once

To read each 256-element input activation vector just once, we would need to process all
computations involving that input in one pass. This requires:

• Number of Accumulators Needed:


o Total output activations: 2048 batch size×768 outputs=1,572,864

o Number of accumulators (256 elements each): 1,572,864/256=6144 accumulators

• Size of Accumulator Memory:

o 6144 accumulators×256 elements×4 bytes=6 MB

Contrast with the TPU Design:

• The TPU uses 4096 accumulators, allowing double-buffering:

o One set of 2048 accumulators is being filled by the matrix unit.

o Another set is being processed by the activation unit.

• This design balances the need for accumulator memory with efficient data reuse,
accepting multiple reads of input activations to keep accumulator size manageable.

Summary:

• Larger Accumulator Memory Trade-Off: Reading input activations only once requires
significantly more accumulator memory.
• TPU Approach: By reusing accumulators and accepting multiple reads of input data, the
TPU optimizes for a balance between hardware resources and computational efficiency.
LLM

a.

b.

c.
d.

4. (a): 1×1 Convolutional Kernel

1. Matrix Dimensions:
o Input depth = 3

o Output depth = 48
A 1×1 convolution is mathematically equivalent to a matrix multiplication of
dimensions 3×483.

2. ALUs Available:
o The TPU has 65,536 ALUs.

3. Matrix Multiplication:
Matrix multiplication for 3×483 \times 483×48 involves computing 3×48=144
multiplications.

4. Fraction of ALUs Used:

Fraction used=Number of multiplications in the matrix/Total ALUs available=144/65,536

Simplifying: Fraction used≈0.0022 or 0.22%.

(b): Smallest Square Image Size for Batch Size of 1


1. Batch Size and TPU Efficiency:
The TPU reaches balanced compute and memory at a batch size of 1400 for some image
size S×SS.
To process a batch size of 1 efficiently, the total computations (image size and batch size
product) must match the balanced case.

2. Balanced Case Computations:

Total computations at batch size 1400=1400×S2

3. Single Batch Case Computations:


For batch size 1:

Total computations=1×s2

4. Efficiency Requirement:
These computations must match:

1×s2=1/1400×1400×S2

This simplifies to:

s2=S2/1400

5. Smallest Image Size:


Let S=220 the original image size. Substitute S=220S.
s2=220^2/1400

Compute:

S^2=48,400/1400≈34.57

Taking the square root:

s≈ root 34.57≈ 5.88s


Rounding up to the nearest whole number, the smallest square image size is: s=6s
(c): Striding and Restacked Convolutional Kernel

1. Restacked Kernel and Input Dimensions:

o Input data: 220×220×3 → 55×55×48 (stride of 4).

o Kernel: 7×7×3×48 → 2×2×48×48.

For the 2×2 kernel, we perform 4 matrix multiplications:

o Each matrix multiplication involves a 48×48 weight matrix.


2. Matrix Multiplications Performed:
A 48×4848 \times 4848×48 matrix multiplication requires:

48×48=2304 multiplications.

Since there are 4 matrix multiplications:

4×2304=9216 multiplications

3. Fraction of ALUs Used:

Fraction used=9216/65,536≈0.14 or 14%.

Final Results:
• (a) Fraction of ALUs used for 1×1 convolutional kernel: 0.22%.

• (b) Smallest square image size for batch size of 1: 6×6.

• (c) Fraction of ALUs used with restacked convolutional kernel: 14%.

5.

For number of blocks =2048:


For number of blocks =1024:

You might also like