278 hw5
278 hw5
1.
2. Matrix b: Accesses are b[k][j] for a fixed jjj. Memory access order:
o j=0, k=0,1,2,3,4
o j=1, k=0,1,2,3,4
o j=2, k=0,1,2,3,4
o j=3, k=0,1,2,3,4
Non-sequential: Elements of b are not accessed sequentially because k varies faster than
jjj in row-major storage.
• Sequential: a, c
• Non-sequential: b
Part (b)
Now, matrix b is transposed to b[N][K], meaning indices are swapped, and the innermost
statement becomes:
Access Analysis
1. Matrix a: Access pattern remains the same (a[i][k] for a fixed iii), which is sequential.
2. Matrix b: Accesses are now b[j][k] for fixed jjj. Memory access order:
o j=0,k=0,1,2,3,4
o j=1,k=0,1,2,3,4
o j=2,k=0,1,2,3,4
o j=3,k=0,1,2,3,4
• Sequential: a, b, c
• Non-sequential: None
Part (c)
To use the hardware_dot function, we need to rewrite the loop with transposed b:
Explanation
Explanation
• The hardware_saxpy function multiplies a[i][k] with 8 elements of b and accumulates the
results in a temporary array.
• The final result for c[i][j] is computed by summing up all values in the temporary array.
• LLM Answer
•
•
•
•
2. Part (a):
For each layer, the input activations (and output activations of the previous layer) for batch size
NNN are calculated as follows:
• Size per layer (bytes) = N×2,048N \times 2,048N×2,048 bytes (since each activation is 8
bits or 1 byte, and there are 2,048 neurons).
Part (b):
Each of the 256×256 elements performs 2 operations (multiply and accumulate) per cycle:
92×10^12=131,072×Clock frequency
Part (d):
Part (e):
Part (f):
Part (g):
Compared to a single TPU throughput of ≈178,185 inferences/s, this configuration offers higher
throughput and lower latency.
Therefore, at least 9 CPU cores are required to drive the TPU at batch size 128.
3. (a) Rewriting the Program with Vector and Tile Lengths
Given the natural vector length of 256 and weight tiles of 256 × 256 (65,536 bytes), we can
rewrite the program addresses and sizes to align with these constraints.
• Input Activations:
• Accumulators:
• Output Activations:
Rewritten Program:
read_host u#0, 2048 vectors // Read 2048 vectors into unified buffer starting at u#0
matmul u#0, a#0, 2048 vectors // Multiply input vectors with weights, store in accumulators
starting at a#0
activate u#2048, a#0, 2048 vectors // Apply activation function, store outputs at u#2048
write_host u#2048, 2048 vectors // Write 2048 output vectors back to the host
Answers:
• Number of Input Vectors Read: 2048 vectors
• To handle a weight matrix of size 1024 × 256, we split it into four 256 × 256 weight tiles
because the matrix unit size remains 256 × 256.
Modified Program:
if tile_index == 0:
else:
Answers:
• We split the 256 × 512 weight matrix into two 256 × 256 tiles.
Modified Program:
Answers:
• Need for More Accumulators: No, we reuse the same accumulators by overwriting them.
• Order in Weight DRAM: Store the two tiles sequentially, corresponding to the two halves
of the weight matrix.
We divide the 1024 × 768 weight matrix into 12 tiles of 256 × 256.
Modified Program:
zero_accumulators a#0
if K_chunk == 0:
matmul u#0, a#0, 2048 vectors // Overwrite accumulators
else:
Answers:
To read each 256-element input activation vector just once, we would need to process all
computations involving that input in one pass. This requires:
• This design balances the need for accumulator memory with efficient data reuse,
accepting multiple reads of input activations to keep accumulator size manageable.
Summary:
• Larger Accumulator Memory Trade-Off: Reading input activations only once requires
significantly more accumulator memory.
• TPU Approach: By reusing accumulators and accepting multiple reads of input data, the
TPU optimizes for a balance between hardware resources and computational efficiency.
LLM
a.
b.
c.
d.
1. Matrix Dimensions:
o Input depth = 3
o Output depth = 48
A 1×1 convolution is mathematically equivalent to a matrix multiplication of
dimensions 3×483.
2. ALUs Available:
o The TPU has 65,536 ALUs.
3. Matrix Multiplication:
Matrix multiplication for 3×483 \times 483×48 involves computing 3×48=144
multiplications.
Total computations=1×s2
4. Efficiency Requirement:
These computations must match:
1×s2=1/1400×1400×S2
s2=S2/1400
Compute:
S^2=48,400/1400≈34.57
48×48=2304 multiplications.
4×2304=9216 multiplications
Final Results:
• (a) Fraction of ALUs used for 1×1 convolutional kernel: 0.22%.
5.