0% found this document useful (0 votes)
23 views31 pages

Vishwa HLD LLD - Ver0.1

The document outlines the design and approach for an AI accelerator, emphasizing the need for specialized hardware to support the growing complexity of AI models. It discusses various design alternatives, including architecture and performance metrics, while highlighting the importance of indigenous development for data sovereignty and economic growth. The document also details the features and components of the accelerator, including memory management and processing capabilities, aimed at enhancing efficiency in AI training and inference.

Uploaded by

SARDONYX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views31 pages

Vishwa HLD LLD - Ver0.1

The document outlines the design and approach for an AI accelerator, emphasizing the need for specialized hardware to support the growing complexity of AI models. It discusses various design alternatives, including architecture and performance metrics, while highlighting the importance of indigenous development for data sovereignty and economic growth. The document also details the features and components of the accelerator, including memory management and processing capabilities, aimed at enhancing efficiency in AI training and inference.

Uploaded by

SARDONYX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

AI Accelerator

Design

1
CONTENTS
1 Introduction..................................................................................................................................................4
1.1 Design alternatives....................................................................................................................... 4
2 design and Approach....................................................................................................................................5
2.1 Design alternatives....................................................................................................................... 5
2.2 The accelerator architecture......................................................................................................... 8
2.3 HLD............................................................................................................................................ 10
2.4 LLD............................................................................................................................................. 11
2.4.1 Micro Architecture....................................................................................................................11
2.4.2 Transcendental Functions:........................................................................................................11
9.1.1 DMA controller..........................................................................................................................13
9.1.2 supported data types................................................................................................................13
9.1.3 feature & weight matrix buffers................................................................................................14
9.1.4 zero skippng / sparsity functional block....................................................................................14
9.1.5 Data optimisation......................................................................................................................14
9.1.6 activation block.........................................................................................................................16
9.1.7 Pooling block.............................................................................................................................17
9.1.8 hierarchical Tiling......................................................................................................................17
9.1.9............................................................................................................................................................18
9.1.10 Clock and power module...........................................................................................................18
9.1.11 interrupt signal....................................................................................................................18
9.1.12 Partial Sums...............................................................................................................................18
9.2 Software stack............................................................................................................................ 18
9.2.1 Distributed computing / parallelism techniques........................................................................19
9.2.2 tools...........................................................................................................................................19
9.3 Process Flow............................................................................................................................... 20
10 Contributions and Acknowledgements...................................................................................................21

2
TABLE OF FIGURES

Figure 5.1: Process Flow......................................................................................................................10

TABLE OF TABLES

Table 2.1: Global Scenario....................................................................................................................9


Table 13.1: List of contributors............................................................................................................19

3
1 INTRODUCTION

1.1 BACKGROUND

Large language models (LLMs) like GPT-4, Deepseek, Perplexity had taken the AI solutions to a new level by
generating response in natural languages, code of a program etc. Investing in the Research and Development
(R&D) of AI Accelerators is a strategic move for companies, governments, and institutions due to the
transformative impact of artificial intelligence (AI) across industries. As AI models grow in complexity and size,
general-purpose processors (like CPUs) struggle to keep up. Focused R&D in AI accelerator design can unlock
significant advantages.

The development of indigenous microprocessors and AI/ML hardware accelerators is gaining significant
momentum with partnerships, driven by the growing need for data sovereignty, security, and technological
self-reliance. Countries are actively investing in local processor ecosystems to reduce dependency on foreign
semiconductor supply chains and to build specialized hardware that caters to emerging AI/ML workloads.

Combining indigenous microprocessor development with specialized AI accelerators presents several strategic
advantages with step towards technological independence.

Data Sovereignty: Ensures that sensitive data remains within national borders, enhancing security.

Customization: Indigenous AI accelerators can be optimized for regional AI applications like agriculture,
healthcare, and natural language processing in native languages.

Cost Efficiency: Locally designed accelerators can reduce licensing costs and dependencies on foreign
chipmakers.

Strategic investment in talent development, research partnerships, and manufacturing capabilities will be
crucial for success in this frontier.

Memory wall & opportunity for custom hw designs:

In the last decade compute power delivered through VLSI / semiconductor chips has increased by 10x whereas
the memory access capacity is only increased by 4x. CPU computes floating point operations (FLOPs) much
faster than memory bandwidth and capacity, creating the memory wall where the memory system can no
longer feed the compute efficiently. The disproportionate increase in compute-memory ratio created
opportunity to go for custom design to meet training & inference needs in AI-ML domain.

AI applications require millions of input and weights data to be fetched from memory to compute. Systems like
HPC with large number of general purpose high performing CPU are overkill for just MAC operations of large
quantity. Graphics Processing Units (GPUs) are better suited for accelerating AI-ML workloads, but it’s

4
architecture didn’t capitalize the advantage of power & computational efficiency with efficient data
movement.

Light weight PEs (Processing Elements) can bring down the power and area costs while increasing the
computation speed up with data optimization flow. This customized hardware design leading to AI-ML
accelerators in achieving the accelerated operations with lesser power

Indigenous AI/ML Hardware Accelerator:

With the overall objective to design and develop indigenous AI Accelerator chip for the future needs of exa-
scale computing that supports both training and inference under HW-SW co-development methodology. Since
this is fairly involved and complex. It is being pursued with collaboration with industry and semiconductor
fabrication unit.

Feature set:

The Feature set targeted for the AI-ML accelerator is as below:

 AI Accelerator for both training and inference work loads


 On-chip scratch memory & matrix buffers
 Systolic array-based parallel matrix multiplication
 Accelerating through activation functions and Pooling
 Supported computation formats INT8, INT16, INT32, FP16, FP32
 Data fetch & computation parallelization through DMA
 Acceleration through Multi clusters
 Handling sparsity through zero skipping algorithms

We are very certain that the implementation and commissioning the product will certainly make a positive
impact in the following ways.

Economic and Strategic Importance: AI is a key driver of economic growth and innovation across sectors like
healthcare, finance, manufacturing, and transportation. Countries and companies investing in AI accelerator
R&D can secure a leadership position in the global AI race, reducing dependence on foreign technology.

National Security and Sovereignty: AI accelerators are critical for defense, cybersecurity, and intelligence
applications. Countries investing in domestic AI hardware reduce reliance on foreign suppliers, enhancing
national security and technological sovereignty.

Environmental Impact: AI accelerators are designed to be more energy-efficient, reducing the carbon footprint
of AI training and inference. This aligns with global sustainability goals and regulatory requirements.

2 DESIGN AND APPROACH

2.1 DESIGN ALTERNATIVES

5
Design crossroad #1:

The way AI-Accelerator is connected to the CPU or host gives different deign approaches. Based
on the application and

1. AI-Accelerator as a Card:
In addition to standard motherboard components, computers often need to be equipped with
other parts and components to achieve the desired functionality based on application requirements. In
this context one of the component connection methods will be through PCIe.
PCIe, or peripheral component interconnect express, is an interface standard for connecting
high-speed input output (HSIO) components. Every high-performance computer motherboard has a
number of PCIe slots you can use to add GPUs, RAID cards, Wi-Fi cards, Accelerator cards or SSD
(solid-state drive) add-on cards. The primary benefits of PCIe are that it offers higher bandwidth, faster
speed, lower latency, and more utility.

Accelerator cards provide servers with added processing power optimized to handle
application specific workloads. Using a standard PCI-Express (PCIe) connector to a server
motherboard or backplane, accelerator cards utilize GPUs, FPGAs, or specialized ASICs,
which require an array of low jitter reference clocks for PCIe.

2. AI-Accelerator connected over interconnect:


In this kind of approach Neural Processing Unit (NPU) is connected to the host
processor by system bus. Here NPU which is on the chip is connected to the main processor
by the cross points of system bus.

6
CPU CPU

CPU

CPU

AiA AiA AiA AiA

Fig: Typical system configuration block diagram

Based on the instructions NPU unit/co-processor will be accessed by the host


processor. All the AI related operations will be handled by the NPU core and non-AI
instructions are handled by the host core.

Design crossroad #2:

Performance:

Generic design provides a way to develop IP which can be used to realise the hardware for different target
areas or applications. Performance of the AI-Accelerator is determined by the number of scalar/floating point
operations per second. For a 5POPS (Peta Operation Per Sec) configuration, we can choose chiplets with 4
units of 64x64 systolic array blocks that can handle 128-bit SIMD operations for the chosen data type. It can be
designed to run at 500MHz. meaning which, it can do 32K operation in a clock cycle. Two chiplets running at
1.2GHz with the above stated SA units can give (4 * 64 * 64 * 4^3 int8 ops * 1.2 GHz * 2MAC ops * 2chiplets) 5
POPS performance.

7
Performance of the AI-Accelerator is determined by the number of operations per second.
Depending on the targeted application we have to choose the amount of parallelism needed and the
components required for the accelerator.

1. 2 TOPS Accelerator:

In order to achieve the required operation count based on the application (2 TOPS)
requirement we are going to require only one Matrix Multiplication Unit (MXMU) which
contains input and weight feature matrix buffers, systolic array, accumulator, Activation
function unit and pooling unit. In the NPU cluster, along with MXMU unit there are other
units like DMA unit, scratch memory, data optimizer and sparsity units. All this units are
collectively called as NPU cluster. Combination of one NPU cluster, NPU scheduler and
command queue is termed as NPU core, which acts as AI accelerator.

2. 50 TOPS Accelerator:
To get the operations count of 50 TOPS we are going with the same structural units
which are included in the previous design of 2 TOPS, but we are including more number of
MXMU units i.e., 4 units. In the MXMU unit we are using 3D systolic array, where each
processing element in the systolic array acts as SIMD register. With the help of NPU
scheduler and command queue we are able to distribute the instructions to all the MXMU
units parallelly, to achieve maximum throughput from the NPU.

3. 1000 TOPS Accelerator system:

Along with the above-mentioned architecture we are increasing the number of


clusters and NPU cores to achieve the desired operations. We are increasing the cluster and
core count to 4 to achieve around 1000 TOPS. We are going to connect all the four cores in a
network to achieve the maximum performance. In the following sections we will find more
information about interconnects and memory units for the network configuration.

2.2 THE ACCELERATOR ARCHITECTURE

8
Core of the accelerator is Systolic Array. it is designed to perform matrix operations of maximum size 64x64 in
one go. Once Input matrix & Weight matrix are loaded into the buffer, it performs the chosen operation like
multiplications and the resultant matrix is moved into the accumulator block. Activation and pooling
operations can follow once the basic operation is performed. This can happen in parallel while DMA is fetching
the data for the subsequence operations. A generic design that can scale up based on the need of the targeted
application is being micro designed with prototype on FPGA.

AXI Lite

Host AXI Fetch


AXI stream
Barrier Response
PCIe
Barrier Response Barrier Response

XDMA AXI

AXI INP BRAM


Interconnect
BRAM
Controller
AXI AXI
Loa
Loa Stor
Load dd
Compu
te e
WGT BRAM OUT BRAM

BRAM acting
as HBM

Synchronization Synchronization
9
queues queues
AXI
Core of the accelerator is Compute IP with 4 systolic array (SA) units of size 64x64. it is designed to perform
16K MAC operations in one go. Once Input matrix & Weight matrix are loaded into the buffer, it performs the
chosen operation like multiplications and the resultant matrix is moved into the accumulator block.

Terminology:

• SA.2D

– systolic array 64 x 64 x 16 bit / cycle = 4K MAC (fp16) = 8K MAC(INT8)

• SA.3D

– systolic array 64 x 64 x 128-bit SIMD / cycle = 32K MAC (fp16) = 64K MAC(INT8)

• RCU - Reconfiguration control unit

– Dynamic reconfiguration refers to changing #rows, #columns and SIMD lanes in a


systolic array

• MXU (Matrix-multiply unit)

= IFM buffer + WM buffer + SA.xD + Acccu + Acti + Pool

• Cluster = M# MXUs + DMA engine + SRAMs + Vector Unit + Pre-Features

• NPU Core = Scheduler + Cmd Queue + N# Clusters

• AI Accel = Scalar core + NPU Core

 Command queue – PCIe


 Interrupt
 DDR access for DMA transfer

10
2.3 HLD

2.3.1 FETCH MODULE


This section provides details on the design and functionality of fetch module implemented
using system verilog. This fetch module fetches the instructions from the HBM and decodes
them, drive those instructions using the FIFO buffer into command queues i.e load, compute
and store modules.

o LOAD instructions consist of input and weight these are driven to the load command to
be reused by the load module.
o STORE instructions are loaded to the store command to be reused by the store module.db
o COMPUTE instructions are pushed to the compute command to be reused by the
compute module.

CONTROL REGISTERS

The control logic manages the start, idle, done, and autostart signals to coordinate the
operation of the fetch module based on AXI write transactions and processing completion.
Control registers facilitate the initiation and termination of the fetch operation.

Start Signal (start):

 Initiates the fetch module execution.


 Controlled via an 8-bit addressable register in hexadecimal format.
 The start bit is asserted (1) to trigger the fetch module.

Idle Signal (idle):

 Ensures the hardware is in an idle state where it is ready to accept the new set of
instructions.
 The module remains inactive when idle = 1.
 Transition to an active state occurs when start is asserted.

Done Signal (done):

 Indicates completion of the hardware operation.

11
 The done bit is asserted (1) upon successful execution.
 Typically used to signal AXI write acknowledgment or trigger subsequent processing
stages.

Autostart Signal (autostart):

 Enables automatic start of hardware.


 When autostart = 1 and idle = 1, the hardware restarts without external intervention.
 Ensures continuous operation in scenarios requiring repetitive data fetching.

This control mechanism ensures synchronized operation between AXI transactions and fetch
processing while maintaining appropriate state transitions for efficient execution.

DECODE

The decode unit interprets the instructions fetched from the instruction FIFO and
appropriately drives the instruction to the respective instruction queues based on the opcode.

2.3.2 LOAD MODULE


Transfers data from memory to compute units, from HBM to internal memory (input BRAM
and weight BRAMs). Supports AXI or custom memory interfaces.

LOAD instructions can be either input instruction or weight instruction based on the given
instruction, these are driven to the load command to be reused by the load module.

Each instruction size is 128 bits, it comes from HBM where they are stored that which gets
updated by the compiler. Based on the instructions, it will decide the type of the instruction
either input data or weight data. HBM was connected to the Load module through the AXI
interconnect.

Data Transfer from HBM to Input & Weight BRAMs via AXI for LOAD Instruction

1. Instruction Fetch & Decode:

o The fetch module retrieves no. of instructions and start address (offset address) from HBM
using an AXI-based interface.

12
o These instructions are decoded and categorized into Load, Compute, and Store command
queues.

o The LOAD instruction identifies whether the fetched data is input activations or weights.

2. LOAD Command Execution:

o When a LOAD instruction is received, it is queued in the Load Command Queue.

o The Load Module is triggered to fetch the required data from HBM.

3. HBM to BRAM Data Movement via AXI:

o The HBM controller handles the memory access requests.

o Data transfer happens via AXI4 interfaces.

o The input activations and weights are streamed into separate BRAMs (Block RAMs) based
on instruction metadata.

4. Data flow management:

o The load module manages data flow between HBM and BRAMs, ensuring efficient loading of
the data.

o The Load Module writes the fetched data into:

 Input BRAM

 Weight BRAM

5. Ready for Compute Module:

o Load generates a forward synchronization signal once the data is stored in BRAMs, the Compute
Module can access it for further processing.

6. Load Completion & Execution Readiness

 Once the load instruction is completed, the compute unit can access the data from BRAM without
needing to go back to HBM based on the forward synchronization signal which decides whether it is a
input or weight.

Input BRAM has 4 blocks, each block is of 64x64 matrix input with 1 Byte of data.

Where the Systolic Array is 64x64 size.

13
Synchronization queues are of forward in the direction, which tells the Compute the type of the instruction it
is, whether it is carrying the input BRAM data or weight BRAM data.
Input type  input matrix /weight matrix
No. of blocks  4
If it is a weight matrix then, it will forward to the four different weight matrices to the four different instances
of Compute, based instruction.

Weight BRAMs connected to the output of the load module, which are of count four.

The weights are transferred to the Load module from HBM based on address after decoding the instruction in
the fetch module.

2.3.3 COMPUTE MODULE


Core of the accelerator is Compute IP with 4 systolic array (SA) units of size 64x64. it is designed to perform
16K MAC operations in one go. Once Input matrix & Weight matrix are loaded into the buffer, it performs the
chosen operation like multiplications and the resultant matrix is moved into the accumulator block.

Dual-Mode PEs for Handling Both Dense and Sparse Computations

Systolic arrays traditionally optimize for dense matrix operations, but many workloads (e.g., pruned neural
networks, sparse matrices) exhibit sparsity. Dual-mode PEs allow the same hardware to efficiently switch
between dense and sparse computation modes, improving flexibility and efficiency.

1. Key Challenges in Dual-Mode PEs

Before diving into solutions, let’s outline the main challenges:

1. Data Representation

o Dense: Regular, structured data (e.g., full matrices).

o Sparse: Irregular, compressed formats (e.g., CSR, CSC).

14
2. Computation Efficiency

o Dense: Maximize MAC throughput.

o Sparse: Skip zeros, avoid redundant computations.

3. Memory Access Patterns

o Dense: Predictable, streaming accesses.

o Sparse: Irregular, pointer-chasing.

4. Control Overhead

o Switching modes should not introduce excessive latency.

2. Architectural Approaches for Dual-Mode PEs

Several techniques enable systolic PEs to handle both dense and sparse computations efficiently.

A. Reconfigurable Data Paths

 Approach: Dynamically switch between dense and sparse computation logic.

 Techniques:

o Mode bit in PE control: A configuration flag selects between dense/sparse modes.

o Multiplexed datapaths:

 Dense mode: Standard MAC units with full-precision operands.

 Sparse mode: Zero-skipping logic + compressed operand handling.

 Example:

o NVIDIA’s Ampere Tensor Cores support both dense and structured-sparse modes.

o Eyeriss v2 (MIT) uses reconfigurable PEs for CNN workloads.

B. Compressed-Sparse Data Handling

 Approach: Store sparse inputs in compressed formats (e.g., CSR) but decompress on-the-fly for dense-
like processing.

 Techniques:

o Sparse-to-dense conversion unit: Expands compressed data before feeding into PEs.

o Run-length encoding (RLE): Skip zero blocks efficiently.

 Example:

o SIGMA (Intel) uses a sparse-aware systolic array with compressed inputs.

C. Dynamic Zero-Skipping

 Approach: Detect and skip zero operands at runtime, even in "dense" mode.

15
 Techniques:

o Operand gating: If an input is zero, disable the MAC operation.

o Prediction-based skipping: Use metadata (bitmasks) to preemptively skip zeros.

 Example:

o Google’s TPU v4 uses conditional execution for sparse weights.

D. Hybrid Dataflows

 Approach: Use different systolic dataflows for dense vs. sparse.

 Techniques:

o Dense mode: Weight-stationary (optimal for GEMM).

o Sparse mode: Output-stationary (avoids redundant computations).

 Example:

o FlexFlow (Stanford) dynamically switches dataflows based on sparsity.

E. Sparse Accumulation Support

 Approach: Modify PE accumulation logic to handle sparse outputs.

 Techniques:

o Sparse accumulator (SPA): Only stores non-zero partial sums.

o Bitmask-based reduction: Skip zero contributions during accumulation.

 Example:

o STELLAR (ETH Zurich) uses SPAs for sparse-DNN acceleration.

3. Trade-offs in Dual-Mode PEs

Design Choice Advantages Drawbacks

Reconfigurable Datapaths Flexible, best of both worlds Area overhead, control complexity

Compressed Data Handling Saves bandwidth, works with dense PEs Decompression latency

Dynamic Zero-Skipping No format conversion, low overhead Only saves compute, not memory

Hybrid Dataflows Optimized for each mode Complex to synchronize

Sparse Accumulation Reduces redundant writes Needs extra storage for metadata

16
2.3.4 STORE MODULE

2.3.5 CONCURRENCY
Overlapped Compute & Communication

2.3.6 UNIFIED VIRTUAL ADDRESSING (UVA)

Unified Virtual Addressing (UVA) is a memory addressing feature supported by modern GPUs (such as NVIDIA's
CUDA-capable GPUs) and CPUs, which allows them to share a single virtual address space across both host
(CPU) and device (GPU) memory. To support UVA, the following hardware functionalities are required:

1. Memory Management Unit (MMU) with Page Tables

 The GPU must have an MMU capable of handling virtual-to-physical address translations, similar to a
CPU MMU.

 Support for page tables that map virtual addresses to either CPU or GPU physical memory.

 The MMU should handle page faults, allowing data to be migrated between CPU and GPU memory
transparently (if supported by the system).

2. Address Translation Services (ATS) or IOMMU/SMMU

 For systems where the GPU accesses CPU memory directly (e.g., via PCIe), an IOMMU (Input-Output
Memory Management Unit) or SMMU (System MMU) is needed to translate GPU virtual addresses
to CPU physical addresses.

 This ensures secure and correct access to host memory when the GPU references a CPU-mapped
address.

3. Unified Address Space Support

17
 The GPU and CPU must agree on a shared virtual address range, meaning pointers can refer to either
CPU or GPU memory without explicit distinction.

 The hardware must recognize whether an address belongs to CPU or GPU memory and route accesses
accordingly.

4. Cache Coherence (Optional but Beneficial)

 Some systems (like NVIDIA's GPUs with UVA + NVLink or AMD's Infinity Fabric) support varying
degrees of cache coherence between CPU and GPU.

 If full coherence is not supported, software must manage data consistency explicitly (e.g., via
CUDA cudaMemcpy or synchronization primitives).

5. PCIe BAR (Base Address Register) and Peer-to-Peer (P2P) Support

 For GPUs connected via PCIe, PCIe Base Address Registers (BARs) must be configured to allow the
GPU to access CPU memory regions.

 If multiple GPUs are involved, Peer-to-Peer (P2P) transfers must be supported to allow direct GPU-to-
GPU memory access under UVA.

6. OS and Driver Support

 While not strictly hardware, the operating system and GPU drivers must coordinate memory
allocation and page table management to ensure UVA works correctly.

 The OS must allow the GPU driver to manage its own page tables or integrate with the system MMU.

2.4 LLD

2.4.1 MICRO ARCHITECTURE

18
2.4.2 TRANSCENDENTAL FUNCTIONS:
functions those cannot be expressed as a finite combination of algebraic operations (addition, subtraction,
multiplication, division, and root extractions) are discussed under this sub section. These functions go beyond
algebraic equations and often involve infinite series, integrals, or differential equations.

Examples of Transcendental Functions

Exponential Functions: ex,2xe^x, 2^x


Logarithmic Functions: ln⁡(x),log⁡b(x)\ln(x), \log_b(x)
Trigonometric Functions: sin⁡(x),cos⁡(x),tan⁡(x)\sin(x), \cos(x), \tan(x)
Inverse Trigonometric Functions: arcsin⁡(x),arccos⁡(x)\arcsin(x), \arccos(x)

19
Hyperbolic Functions: sinh⁡(x),cosh⁡(x)\sinh(x), \cosh(x)
Special Functions: Gamma function, Bessel functions, Elliptic integrals, etc.

🔹 Why Are Transcendental Functions Important?

Machine Learning & AI: Used in activation functions (e.g., sigmoid, softmax).
AI Accelerators (TPUs, GPUs, etc.): Hardware optimizations for fast transcendental function computation.

🔹 Computation of Transcendental Functions in Hardware

Since transcendental functions involve infinite series, AI accelerators and processors approximate them
efficiently using:

🔸 1. Taylor Series Approximation

Many transcendental functions are computed using truncated Taylor series.


Example: exe^x Taylor Series Expansion

ex=1+x+x22!+x33!+…e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots

Used in FPGA, GPU, and TPU accelerators for fast computation.

🔸 2. CORDIC Algorithm (For Trigonometric & Logarithmic Functions)

CORDIC (COordinate Rotation DIgital Computer) is a hardware-friendly iterative method for computing:

 Trig functions: sin⁡(x),cos⁡(x),tan⁡−1(x)\sin(x), \cos(x), \tan^{-1}(x)

 Hyperbolic functions: sinh⁡(x),cosh⁡(x)\sinh(x), \cosh(x)

 Logarithm & Exponentiation

✅ Used in AI chips, embedded processors, and FPGAs since it requires only shifts and adds (no multiplication).

🔸 3. Lookup Tables (LUTs)

 Precompute transcendental function values for specific inputs.

 Used in low-power AI inference accelerators to avoid complex computations.

🔹 Transcendental Functions in AI & Deep Learning

1️⃣ Sigmoid Activation:

σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-x}}

⚡ Computed efficiently using lookup tables and exponential approximations.

2️⃣ Softmax Function:

20
softmax(xi)=exi∑jexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}

⚡ Implemented in hardware accelerators (TPUs, GPUs) using optimized exponentiation techniques.

3️⃣ Fourier Transforms (FFT) in Signal Processing & AI:

 Requires trigonometric function calculations for sin/cos wave transforms.

3 🔹 Summary

Function Type Examples Computation Methods in AI Hardware

Exponential ex,2xe^x, 2^x Taylor Series, LUTs, CORDIC

Logarithmic ln⁡(x),log⁡2(x)\ln(x), \log_2(x) CORDIC, Polynomial Approximation

Trigonometric sin⁡(x),cos⁡(x)\sin(x), \cos(x) CORDIC, LUTs, Taylor Series

Hyperbolic sinh⁡(x),cosh⁡(x)\sinh(x), \cosh(x) Taylor Series, LUTs

Special Functions Γ(x),Jn(x)\Gamma(x), J_n(x) Approximation Methods, AI Optimization

3.1.1 DMA CONTROLLER


DMA controller is expected transfer the data from DDR to the local scratch memory.
This can happen in parallel to AI operations on the data fetched in the previous DMA cycle.

3.1.2 SUPPORTED DATA TYPES


A data type is an attribute associated with a piece of data that tells the system how
to interpret its value. It is important to know about all these data types when you are mainly
concerned with understanding how to leverage customer data The datatypes which are
supported by the proposed hardware are INT8, INT16, FP32, FP16, BF16.

Data types play a major role in determining accuracy of the outcome by the AI
model.

Data types:

Data types play a major role in determining accuracy of the outcome by the AI model. Typically training
workloads use FP32 (single precision floating point) data type for the inputs and weights for better accuracy.
Once the model is trained the trained wight data can be quantised to INT8 for inference without losing the

21
outcome of the application. Industry has come up BF16 which giving the same value range as FP32 and
decimal range as FP16 for better performance. The newer data types, BF16, FP8, INT4 which are emerging are
being considered for the futuristic needs.

3.1.3 FEATURE & WEIGHT MATRIX BUFFERS


Input matrices need to be copied into SRAM associated with the SA. Once these are
filled with the correct Input & Weight data, SA blocks can be initiated to perform the
required tensor operations like convolution, activation, and pooling. While this is happening
DMA can fetch the next set of data into scratch memory for better throughput performance
and hence the efficiency.

3.1.4 ZERO SKIPPNG / SPARSITY FUNCTIONAL BLOCK

3.1.5 DATA OPTIMISATION


Optimization is the process where we train the model iteratively that results in a
maximum and minimum function evaluation. It is one of the most important phenomena in
Machine Learning to get better results. Every optimization problem has three
components: an objective function, decision variables, and constraints. Symmetric and
Asymmetric quantization are types of quantization.

Goals:

✔ Optimizes input data reuse to minimize memory bandwidth usage.

✔ Minimizes the number of partial sum accumulations to reduce redundant memory accesses.

✔ Ensures full utilization of the systolic array for maximum throughput.

Optimizing Data Access (Weight-Stationary & Double Buffering)

Dataflow: Weight-Stationary for Efficient Access

 Weights (B) are preloaded into the systolic array, minimizing weight movement.

 Inputs (A) stream in a wavefront and shift through the PEs.

22
 Partial sums accumulate downward and are collected in the accumulator units.

Double Buffering to Overlap Computation and Data Movement

 While one tile is being processed, the next tile is loaded from DRAM to SRAM.

 This ensures full utilization of the systolic array without idle cycles.

DRAM Access Optimization Using Banked Memory

 Data is striped across multiple DRAM banks to maximize parallel memory access.

 Reduces memory latency by ensuring each PE gets its data in parallel.

One Column of Blocks of C in On-Chip Memory

- Only Once Per Column


- Avoids frequent off-chip writes, improving memory efficiency

Weight-Stationary Dataflow (B in On-Chip Memory)

- Minimizes memory bandwidth usage

Double Buffering (Prefetch Next A Tile)

- Hides memory latency

SIMD Vectorization (Parallel Computation)

- Speeds up execution
- Abstracts data types from SA

3.1.6 MEMORY HIERARCHY & OPTIMIZATION


 On-Chip Memory: Use on-chip memory to store 4 tiles of A, 4 tiles of B, and partial sums for a 4x4 block of
C.

Comparison: Latency & Power (Typical Values)

Memory Latency (ns) Bandwidth Energy (pJ/bit)


(GB/s)

Register ~1 — ~0.1

L1/ ~5-10 10–1000 (local) ~0.3


SMEM

L2 Cache 20–40 2–5x L1 ~1.0

HBM2 ~300 900–1600 ~15–30

DDR5 ~80–150 50–100 ~15–30

23
PCIe 4.0 ~1000+ 32–64 ~100+

Note: Bandwidth and power vary by vendor and workload; HBM has higher bandwidth density due to 3D
stacking, even if latency is high.

Feature HBM DDR

Latency Higher Lower

Bandwidth MUCH higher (900+ GB/s) 50–100 GB/s

Power efficiency Better (per GB/s) Worse

Integration On-package (close to die) Off-chip (on motherboard)

Use Case GPU/TPU/AI ASIC CPU, general workloads

HBM is more parallelized — thousands of pins, huge bandwidth. DDR is cheaper and lower latency per access,
but doesn't scale for parallel workloads.

3.1.7 ACTIVATION BLOCK


Activation functions (AFs) are essential components of neural networks, enabling the
learning of abstract features through nonlinear transformations. An activation function
determines whether a neuron should be activated by computing the weighted sum of inputs
and adding a bias term. It plays a crucial role in enabling neural networks to make complex
decisions and predictions. By introducing non-linearity, activation functions allow the model
to learn and represent intricate patterns in data. Without this non-linearity, a neural
network would function like a simple linear regression model, regardless of the number of
layers. An effective activation function should possess several key properties:

• it must introduce non-linearity into the optimization landscape to enhance training


convergence;
• it should maintain computational efficiency without significantly increasing model
complexity;
• it must support stable gradient flow during training to prevent issues like vanishing or
exploding gradients; and
• it should preserve the data distribution to facilitate better network training.

an activation function like ReLU, Sigmoid, or Tanh can be computed in a single clock
cycle on an FPGA, depending on the complexity and implementation method:

ReLU (Rectified Linear Unit)

24
 Equation: f(x)=max(0,x)
 Hardware Implementation: Simple comparison and multiplexer.
 Clock cycles: 1 clock cycle

Sigmoid (Logistic Function)

1
 Equation: f(x)= −x
1+ e
 Hardware Implementation:
o Can be implemented using LUT-based approximation, CORDIC, or Taylor series.
o CORDIC or Taylor series may take multiple cycles.
 Clock cycles:
o CORDIC/Taylor: Multiple cycles (5–15 cycles)

Softmax (Exponentiation & Normalization)

xi
 Equation: S(xi) = e

In recent years, various activation functions have been explored to meet these
criteria and improve deep learning performance. This survey highlights advancements in
activation functions, providing insights into their characteristics and effectiveness. By
examining different activation functions and their impact, this study aims to contribute
valuable knowledge to the deep learning community.

one of the most important parameters of the CNN model is the activation function.
They are used to learn and approximate any kind of continuous and complex relationship
between variables of the network. In simple words, it decides which information of the
model should fire in the forward direction and which ones should not at the end of the
network.

 Performs activation functions such as Sigmoid, ReLU family (ReLU, Leaky ReLU, PReLU,
ReLU6) etc.
 programmable look-up table (LUT) that supports any current or future activation function
including tanh and sigmoid, MISH, SWISH etc.

25
3.1.8 POOLING BLOCK
In most cases, a Convolutional Layer is followed by a Pooling Layer. The primary aim
of this layer is to decrease the size of the convolved feature map to reduce the
computational costs. This is performed by decreasing the connections between layers and
independently operates on each feature map. Depending upon method used, there are
several types of Pooling operations. It basically summarises the features generated by a
convolution layer. Two common pooling methods are average pooling and max pooling.

3.1.9 HIERARCHICAL TILING

int8 – data type

Tiling Stage Tile Size Purpose

Global Tile 8kB × 8kB Minimizes DRAM access with


1. Wgt-mem constant
2. Inp-men in wave front / ping pong
3. Psum-mem saving one column of result
Execution Tile 256B x Fits into SRAM efficiently & Maps to systolic array with 64 × 64 PE with
256B 128b SIMD.

Compute Tile 4B x 4B Fits into 128b SIMD register for int8 MAC computations

Since each 8K × 8K matrix has 64M elements, direct execution on a 64×64 systolic array is not possible.
Instead, we perform hierarchical tiling.

3.1.10 CLOCK AND PLLS

3.1.11 POWER MODULE

3.1.12 INTERRUPT SIGNAL


To indicate completion of the MM & O/P buffer is copied back to DDR?
which you must connect to an interrupt controller?

3.1.13 PARTIAL SUMS

26
If accumulation cannot be completed in one pass, intermediate sums are stored in on-chip SRAM
instead of DRAM.

SA Accumulation

Output stationary allows accumulated results in PEs.

Local Accumulation / On-chip accumulation (SRAM)

Instead of sending every partial sum to global memory (slow), partial sums are first accumulated in
registers or SRAM before writing the final result.

Reduction Tree

A reduction tree is a hierarchical summation method that reduces multiple values in parallel instead of
sequentially. Pairs of elements are summed in parallel, reducing total operations from O(N) → O(log
N). Sequential addition takes O(N) time

Explanation of the Modified Loop Order

1. Outer Loop (i): Iterate over the rows of Matrix A.

2. Middle Loop (j): Iterate over the columns of Matrix B.

3. Inner Loop (k): Iterate over the shared dimension in steps of 4. This allows loading 4 tiles of A and 4
tiles of B at a time.

4. Load 4 Tiles of A: Load 4 tiles from the same row of Matrix A (e.g., A[i][k:k+4]A[i][k:k+4]).

5. Load 4 Tiles of B: Load 4 tiles from the corresponding columns of Matrix B (e.g., B[k:k+4][j]B[k:k+4][j]).

6. Compute Partial Sums: Use the loaded tiles to compute partial sums for a 4x4 block of Matrix C.

7. Accumulate Partial Sums: Accumulate the partial sums in on-chip memory.

8. Write Results: After processing all kk values, write the final result for the 4x4 block of C to off-chip
memory.

Benefits of This Approach

1. Increased Data Reuse:

o Each tile of Matrix A is reused for 4 tiles of Matrix B, and each tile of Matrix B is reused for 4
tiles of Matrix A.

o This maximizes the utilization of on-chip memory and minimizes redundant data transfers.

2. Reduced Memory Traffic:

o Loading 4 tiles of A and 4 tiles of B at a time reduces the number of memory accesses by a
factor of 4.

27
o Partial sums for a 4x4 block of C are accumulated in on-chip memory, reducing the number
of writes to off-chip memory.

3. Efficient Use of On-Chip Memory:

o By processing 4x4 blocks of C at a time, the on-chip memory is used more efficiently,
reducing the need to spill partial sums to off-chip memory.

Example Workflow

For a specific example, consider computing the 4x4 block C[0:4][0:4]C[0:4][0:4]:

1. Initialize C[0:4][0:4]C[0:4][0:4] to zero in on-chip memory.

2. For each kk from 0 to 31 in steps of 4:

o Load 4 tiles of A: A[0][k:k+4]A[0][k:k+4].

o Load 4 tiles of B: B[k:k+4][0]B[k:k+4][0].

o Compute partial sums for C[0:4][0:4]C[0:4][0:4] using the loaded tiles.

o Accumulate the partial sums in on-chip memory.

3. Write the final result for C[0:4][0:4]C[0:4][0:4] to off-chip memory.

Repeat this process for all 4x4 blocks of Matrix C.

Memory Hierarchy Optimization

 On-Chip Memory: Use on-chip memory to store 4 tiles of A, 4 tiles of B, and partial sums for a 4x4
block of C.

 Off-Chip Memory: Only access off-chip memory to load tiles of A and B and to write the final results
for C.

Scaling to Larger Matrices

For larger matrices (e.g., 1024x1024), the same tiling sequence can be applied, but the number of tiles will
increase (e.g., 64x64 tiles for 1024x1024 matrices). The key principles of data reuse and minimizing partial sum
writes remain the same.

By loading 4 tiles from the same row of Matrix A and 4 tiles from the corresponding columns of Matrix B, this
approach significantly reduces memory traffic and improves the efficiency of partial sum handling, making it
highly suitable for hardware accelerators like systolic arrays.

28
Optimizations Applied

✅ Tiling Strategy → Uses N×N tiles for efficient on-chip memory usage.
✅ Weight-Stationary Dataflow → B stays fixed in on-chip memory to reduce data movement.
✅ Double Buffering → Overlaps computation & memory fetch to eliminate stalls.
✅ SIMD Vectorization → Computes multiple elements in parallel for faster execution.
✅ Efficient Partial Sum Accumulation → Uses registers to store partial sums, reducing off-chip

3.2 SOFTWARE STACK

3.2.1 DISTRIBUTED COMPUTING / PARALLELISM TECHNIQUES


Data parallelism -- is achieved by partitioning inputs and outputs to create multiple independent
data streams that are processed by different PCUs.

Tensor parallelism -- is achieved by forking into data parallel streams, then joining them.

Pipeline parallelism -- is achieved by chaining multiple PCUs together to fuse operations and
increase operational intensity.

Factor Multi-Accel / Single-Node Multi-Node


Communicatio
PCIe, NVLink Ethernet, InfiniBand (faster)
n
Limited by GPU count on a
Scaling Scales beyond a single node
node
Use-Case Single server, multiple GPUs Cluster-based AI workloads
Distributed training, cloud
Best For High-performance inference
AI

3.2.2 TOOLS

Tools help maintain a high utilization factor of the hardware for those diverse convolutions.

Tools quantize a neural network to be suitable for low-precision hardware through three
processes, Profiling, Quantization, and Compensation

– Profiling is a process to acquire statistics of the activations for each channel or layer when
lots of sample images entered into a network.

– Quantization is a process to determine an optimal fractional length of activations based on


the profiling as well as weights/biases.

– Compensation is a process to compensate biases by comparing between original model and


quantized model.

29
Profiling tools

Quantization tools

The NPU uses offline tools to optimize the code. At runtime, the application processor passes this

optimized trained model to the NPU.

Pass your trained model through the quantization tool. This tool quantizes weights to 8-bit and
activations to 8-bit or 16-bit values.

Pass the quantized model to the compiler. This tool optimizes the model for this NPU and outputs an
optimized model that contains a command stream for the NPU.

3.3 PROCESS FLOW

Figure 5.1: Process Flow

30
4 CONTRIBUTIONS AND ACKNOWLEDGEMENTS

Table 13.1: List of contributors

Sr. No. Name of the Author/Contributor Affiliation

31

You might also like