Vishwa HLD LLD - Ver0.1
Vishwa HLD LLD - Ver0.1
Design
1
CONTENTS
1 Introduction..................................................................................................................................................4
1.1 Design alternatives....................................................................................................................... 4
2 design and Approach....................................................................................................................................5
2.1 Design alternatives....................................................................................................................... 5
2.2 The accelerator architecture......................................................................................................... 8
2.3 HLD............................................................................................................................................ 10
2.4 LLD............................................................................................................................................. 11
2.4.1 Micro Architecture....................................................................................................................11
2.4.2 Transcendental Functions:........................................................................................................11
9.1.1 DMA controller..........................................................................................................................13
9.1.2 supported data types................................................................................................................13
9.1.3 feature & weight matrix buffers................................................................................................14
9.1.4 zero skippng / sparsity functional block....................................................................................14
9.1.5 Data optimisation......................................................................................................................14
9.1.6 activation block.........................................................................................................................16
9.1.7 Pooling block.............................................................................................................................17
9.1.8 hierarchical Tiling......................................................................................................................17
9.1.9............................................................................................................................................................18
9.1.10 Clock and power module...........................................................................................................18
9.1.11 interrupt signal....................................................................................................................18
9.1.12 Partial Sums...............................................................................................................................18
9.2 Software stack............................................................................................................................ 18
9.2.1 Distributed computing / parallelism techniques........................................................................19
9.2.2 tools...........................................................................................................................................19
9.3 Process Flow............................................................................................................................... 20
10 Contributions and Acknowledgements...................................................................................................21
2
TABLE OF FIGURES
TABLE OF TABLES
3
1 INTRODUCTION
1.1 BACKGROUND
Large language models (LLMs) like GPT-4, Deepseek, Perplexity had taken the AI solutions to a new level by
generating response in natural languages, code of a program etc. Investing in the Research and Development
(R&D) of AI Accelerators is a strategic move for companies, governments, and institutions due to the
transformative impact of artificial intelligence (AI) across industries. As AI models grow in complexity and size,
general-purpose processors (like CPUs) struggle to keep up. Focused R&D in AI accelerator design can unlock
significant advantages.
The development of indigenous microprocessors and AI/ML hardware accelerators is gaining significant
momentum with partnerships, driven by the growing need for data sovereignty, security, and technological
self-reliance. Countries are actively investing in local processor ecosystems to reduce dependency on foreign
semiconductor supply chains and to build specialized hardware that caters to emerging AI/ML workloads.
Combining indigenous microprocessor development with specialized AI accelerators presents several strategic
advantages with step towards technological independence.
Data Sovereignty: Ensures that sensitive data remains within national borders, enhancing security.
Customization: Indigenous AI accelerators can be optimized for regional AI applications like agriculture,
healthcare, and natural language processing in native languages.
Cost Efficiency: Locally designed accelerators can reduce licensing costs and dependencies on foreign
chipmakers.
Strategic investment in talent development, research partnerships, and manufacturing capabilities will be
crucial for success in this frontier.
In the last decade compute power delivered through VLSI / semiconductor chips has increased by 10x whereas
the memory access capacity is only increased by 4x. CPU computes floating point operations (FLOPs) much
faster than memory bandwidth and capacity, creating the memory wall where the memory system can no
longer feed the compute efficiently. The disproportionate increase in compute-memory ratio created
opportunity to go for custom design to meet training & inference needs in AI-ML domain.
AI applications require millions of input and weights data to be fetched from memory to compute. Systems like
HPC with large number of general purpose high performing CPU are overkill for just MAC operations of large
quantity. Graphics Processing Units (GPUs) are better suited for accelerating AI-ML workloads, but it’s
4
architecture didn’t capitalize the advantage of power & computational efficiency with efficient data
movement.
Light weight PEs (Processing Elements) can bring down the power and area costs while increasing the
computation speed up with data optimization flow. This customized hardware design leading to AI-ML
accelerators in achieving the accelerated operations with lesser power
With the overall objective to design and develop indigenous AI Accelerator chip for the future needs of exa-
scale computing that supports both training and inference under HW-SW co-development methodology. Since
this is fairly involved and complex. It is being pursued with collaboration with industry and semiconductor
fabrication unit.
Feature set:
We are very certain that the implementation and commissioning the product will certainly make a positive
impact in the following ways.
Economic and Strategic Importance: AI is a key driver of economic growth and innovation across sectors like
healthcare, finance, manufacturing, and transportation. Countries and companies investing in AI accelerator
R&D can secure a leadership position in the global AI race, reducing dependence on foreign technology.
National Security and Sovereignty: AI accelerators are critical for defense, cybersecurity, and intelligence
applications. Countries investing in domestic AI hardware reduce reliance on foreign suppliers, enhancing
national security and technological sovereignty.
Environmental Impact: AI accelerators are designed to be more energy-efficient, reducing the carbon footprint
of AI training and inference. This aligns with global sustainability goals and regulatory requirements.
5
Design crossroad #1:
The way AI-Accelerator is connected to the CPU or host gives different deign approaches. Based
on the application and
1. AI-Accelerator as a Card:
In addition to standard motherboard components, computers often need to be equipped with
other parts and components to achieve the desired functionality based on application requirements. In
this context one of the component connection methods will be through PCIe.
PCIe, or peripheral component interconnect express, is an interface standard for connecting
high-speed input output (HSIO) components. Every high-performance computer motherboard has a
number of PCIe slots you can use to add GPUs, RAID cards, Wi-Fi cards, Accelerator cards or SSD
(solid-state drive) add-on cards. The primary benefits of PCIe are that it offers higher bandwidth, faster
speed, lower latency, and more utility.
Accelerator cards provide servers with added processing power optimized to handle
application specific workloads. Using a standard PCI-Express (PCIe) connector to a server
motherboard or backplane, accelerator cards utilize GPUs, FPGAs, or specialized ASICs,
which require an array of low jitter reference clocks for PCIe.
6
CPU CPU
CPU
CPU
Performance:
Generic design provides a way to develop IP which can be used to realise the hardware for different target
areas or applications. Performance of the AI-Accelerator is determined by the number of scalar/floating point
operations per second. For a 5POPS (Peta Operation Per Sec) configuration, we can choose chiplets with 4
units of 64x64 systolic array blocks that can handle 128-bit SIMD operations for the chosen data type. It can be
designed to run at 500MHz. meaning which, it can do 32K operation in a clock cycle. Two chiplets running at
1.2GHz with the above stated SA units can give (4 * 64 * 64 * 4^3 int8 ops * 1.2 GHz * 2MAC ops * 2chiplets) 5
POPS performance.
7
Performance of the AI-Accelerator is determined by the number of operations per second.
Depending on the targeted application we have to choose the amount of parallelism needed and the
components required for the accelerator.
1. 2 TOPS Accelerator:
In order to achieve the required operation count based on the application (2 TOPS)
requirement we are going to require only one Matrix Multiplication Unit (MXMU) which
contains input and weight feature matrix buffers, systolic array, accumulator, Activation
function unit and pooling unit. In the NPU cluster, along with MXMU unit there are other
units like DMA unit, scratch memory, data optimizer and sparsity units. All this units are
collectively called as NPU cluster. Combination of one NPU cluster, NPU scheduler and
command queue is termed as NPU core, which acts as AI accelerator.
2. 50 TOPS Accelerator:
To get the operations count of 50 TOPS we are going with the same structural units
which are included in the previous design of 2 TOPS, but we are including more number of
MXMU units i.e., 4 units. In the MXMU unit we are using 3D systolic array, where each
processing element in the systolic array acts as SIMD register. With the help of NPU
scheduler and command queue we are able to distribute the instructions to all the MXMU
units parallelly, to achieve maximum throughput from the NPU.
8
Core of the accelerator is Systolic Array. it is designed to perform matrix operations of maximum size 64x64 in
one go. Once Input matrix & Weight matrix are loaded into the buffer, it performs the chosen operation like
multiplications and the resultant matrix is moved into the accumulator block. Activation and pooling
operations can follow once the basic operation is performed. This can happen in parallel while DMA is fetching
the data for the subsequence operations. A generic design that can scale up based on the need of the targeted
application is being micro designed with prototype on FPGA.
AXI Lite
XDMA AXI
BRAM acting
as HBM
Synchronization Synchronization
9
queues queues
AXI
Core of the accelerator is Compute IP with 4 systolic array (SA) units of size 64x64. it is designed to perform
16K MAC operations in one go. Once Input matrix & Weight matrix are loaded into the buffer, it performs the
chosen operation like multiplications and the resultant matrix is moved into the accumulator block.
Terminology:
• SA.2D
• SA.3D
– systolic array 64 x 64 x 128-bit SIMD / cycle = 32K MAC (fp16) = 64K MAC(INT8)
10
2.3 HLD
o LOAD instructions consist of input and weight these are driven to the load command to
be reused by the load module.
o STORE instructions are loaded to the store command to be reused by the store module.db
o COMPUTE instructions are pushed to the compute command to be reused by the
compute module.
CONTROL REGISTERS
The control logic manages the start, idle, done, and autostart signals to coordinate the
operation of the fetch module based on AXI write transactions and processing completion.
Control registers facilitate the initiation and termination of the fetch operation.
Ensures the hardware is in an idle state where it is ready to accept the new set of
instructions.
The module remains inactive when idle = 1.
Transition to an active state occurs when start is asserted.
11
The done bit is asserted (1) upon successful execution.
Typically used to signal AXI write acknowledgment or trigger subsequent processing
stages.
This control mechanism ensures synchronized operation between AXI transactions and fetch
processing while maintaining appropriate state transitions for efficient execution.
DECODE
The decode unit interprets the instructions fetched from the instruction FIFO and
appropriately drives the instruction to the respective instruction queues based on the opcode.
LOAD instructions can be either input instruction or weight instruction based on the given
instruction, these are driven to the load command to be reused by the load module.
Each instruction size is 128 bits, it comes from HBM where they are stored that which gets
updated by the compiler. Based on the instructions, it will decide the type of the instruction
either input data or weight data. HBM was connected to the Load module through the AXI
interconnect.
Data Transfer from HBM to Input & Weight BRAMs via AXI for LOAD Instruction
o The fetch module retrieves no. of instructions and start address (offset address) from HBM
using an AXI-based interface.
12
o These instructions are decoded and categorized into Load, Compute, and Store command
queues.
o The LOAD instruction identifies whether the fetched data is input activations or weights.
o The Load Module is triggered to fetch the required data from HBM.
o The input activations and weights are streamed into separate BRAMs (Block RAMs) based
on instruction metadata.
o The load module manages data flow between HBM and BRAMs, ensuring efficient loading of
the data.
Input BRAM
Weight BRAM
o Load generates a forward synchronization signal once the data is stored in BRAMs, the Compute
Module can access it for further processing.
Once the load instruction is completed, the compute unit can access the data from BRAM without
needing to go back to HBM based on the forward synchronization signal which decides whether it is a
input or weight.
Input BRAM has 4 blocks, each block is of 64x64 matrix input with 1 Byte of data.
13
Synchronization queues are of forward in the direction, which tells the Compute the type of the instruction it
is, whether it is carrying the input BRAM data or weight BRAM data.
Input type input matrix /weight matrix
No. of blocks 4
If it is a weight matrix then, it will forward to the four different weight matrices to the four different instances
of Compute, based instruction.
Weight BRAMs connected to the output of the load module, which are of count four.
The weights are transferred to the Load module from HBM based on address after decoding the instruction in
the fetch module.
Systolic arrays traditionally optimize for dense matrix operations, but many workloads (e.g., pruned neural
networks, sparse matrices) exhibit sparsity. Dual-mode PEs allow the same hardware to efficiently switch
between dense and sparse computation modes, improving flexibility and efficiency.
1. Data Representation
14
2. Computation Efficiency
4. Control Overhead
Several techniques enable systolic PEs to handle both dense and sparse computations efficiently.
Techniques:
o Multiplexed datapaths:
Example:
o NVIDIA’s Ampere Tensor Cores support both dense and structured-sparse modes.
Approach: Store sparse inputs in compressed formats (e.g., CSR) but decompress on-the-fly for dense-
like processing.
Techniques:
o Sparse-to-dense conversion unit: Expands compressed data before feeding into PEs.
Example:
C. Dynamic Zero-Skipping
Approach: Detect and skip zero operands at runtime, even in "dense" mode.
15
Techniques:
Example:
D. Hybrid Dataflows
Techniques:
Example:
Techniques:
Example:
Reconfigurable Datapaths Flexible, best of both worlds Area overhead, control complexity
Compressed Data Handling Saves bandwidth, works with dense PEs Decompression latency
Dynamic Zero-Skipping No format conversion, low overhead Only saves compute, not memory
Sparse Accumulation Reduces redundant writes Needs extra storage for metadata
16
2.3.4 STORE MODULE
2.3.5 CONCURRENCY
Overlapped Compute & Communication
Unified Virtual Addressing (UVA) is a memory addressing feature supported by modern GPUs (such as NVIDIA's
CUDA-capable GPUs) and CPUs, which allows them to share a single virtual address space across both host
(CPU) and device (GPU) memory. To support UVA, the following hardware functionalities are required:
The GPU must have an MMU capable of handling virtual-to-physical address translations, similar to a
CPU MMU.
Support for page tables that map virtual addresses to either CPU or GPU physical memory.
The MMU should handle page faults, allowing data to be migrated between CPU and GPU memory
transparently (if supported by the system).
For systems where the GPU accesses CPU memory directly (e.g., via PCIe), an IOMMU (Input-Output
Memory Management Unit) or SMMU (System MMU) is needed to translate GPU virtual addresses
to CPU physical addresses.
This ensures secure and correct access to host memory when the GPU references a CPU-mapped
address.
17
The GPU and CPU must agree on a shared virtual address range, meaning pointers can refer to either
CPU or GPU memory without explicit distinction.
The hardware must recognize whether an address belongs to CPU or GPU memory and route accesses
accordingly.
Some systems (like NVIDIA's GPUs with UVA + NVLink or AMD's Infinity Fabric) support varying
degrees of cache coherence between CPU and GPU.
If full coherence is not supported, software must manage data consistency explicitly (e.g., via
CUDA cudaMemcpy or synchronization primitives).
For GPUs connected via PCIe, PCIe Base Address Registers (BARs) must be configured to allow the
GPU to access CPU memory regions.
If multiple GPUs are involved, Peer-to-Peer (P2P) transfers must be supported to allow direct GPU-to-
GPU memory access under UVA.
While not strictly hardware, the operating system and GPU drivers must coordinate memory
allocation and page table management to ensure UVA works correctly.
The OS must allow the GPU driver to manage its own page tables or integrate with the system MMU.
2.4 LLD
18
2.4.2 TRANSCENDENTAL FUNCTIONS:
functions those cannot be expressed as a finite combination of algebraic operations (addition, subtraction,
multiplication, division, and root extractions) are discussed under this sub section. These functions go beyond
algebraic equations and often involve infinite series, integrals, or differential equations.
19
Hyperbolic Functions: sinh(x),cosh(x)\sinh(x), \cosh(x)
Special Functions: Gamma function, Bessel functions, Elliptic integrals, etc.
Machine Learning & AI: Used in activation functions (e.g., sigmoid, softmax).
AI Accelerators (TPUs, GPUs, etc.): Hardware optimizations for fast transcendental function computation.
Since transcendental functions involve infinite series, AI accelerators and processors approximate them
efficiently using:
CORDIC (COordinate Rotation DIgital Computer) is a hardware-friendly iterative method for computing:
✅ Used in AI chips, embedded processors, and FPGAs since it requires only shifts and adds (no multiplication).
20
softmax(xi)=exi∑jexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
3 🔹 Summary
Data types play a major role in determining accuracy of the outcome by the AI
model.
Data types:
Data types play a major role in determining accuracy of the outcome by the AI model. Typically training
workloads use FP32 (single precision floating point) data type for the inputs and weights for better accuracy.
Once the model is trained the trained wight data can be quantised to INT8 for inference without losing the
21
outcome of the application. Industry has come up BF16 which giving the same value range as FP32 and
decimal range as FP16 for better performance. The newer data types, BF16, FP8, INT4 which are emerging are
being considered for the futuristic needs.
Goals:
✔ Minimizes the number of partial sum accumulations to reduce redundant memory accesses.
Weights (B) are preloaded into the systolic array, minimizing weight movement.
22
Partial sums accumulate downward and are collected in the accumulator units.
While one tile is being processed, the next tile is loaded from DRAM to SRAM.
This ensures full utilization of the systolic array without idle cycles.
Data is striped across multiple DRAM banks to maximize parallel memory access.
- Speeds up execution
- Abstracts data types from SA
Register ~1 — ~0.1
23
PCIe 4.0 ~1000+ 32–64 ~100+
Note: Bandwidth and power vary by vendor and workload; HBM has higher bandwidth density due to 3D
stacking, even if latency is high.
HBM is more parallelized — thousands of pins, huge bandwidth. DDR is cheaper and lower latency per access,
but doesn't scale for parallel workloads.
an activation function like ReLU, Sigmoid, or Tanh can be computed in a single clock
cycle on an FPGA, depending on the complexity and implementation method:
24
Equation: f(x)=max(0,x)
Hardware Implementation: Simple comparison and multiplexer.
Clock cycles: 1 clock cycle
1
Equation: f(x)= −x
1+ e
Hardware Implementation:
o Can be implemented using LUT-based approximation, CORDIC, or Taylor series.
o CORDIC or Taylor series may take multiple cycles.
Clock cycles:
o CORDIC/Taylor: Multiple cycles (5–15 cycles)
xi
Equation: S(xi) = e
❑
In recent years, various activation functions have been explored to meet these
criteria and improve deep learning performance. This survey highlights advancements in
activation functions, providing insights into their characteristics and effectiveness. By
examining different activation functions and their impact, this study aims to contribute
valuable knowledge to the deep learning community.
one of the most important parameters of the CNN model is the activation function.
They are used to learn and approximate any kind of continuous and complex relationship
between variables of the network. In simple words, it decides which information of the
model should fire in the forward direction and which ones should not at the end of the
network.
Performs activation functions such as Sigmoid, ReLU family (ReLU, Leaky ReLU, PReLU,
ReLU6) etc.
programmable look-up table (LUT) that supports any current or future activation function
including tanh and sigmoid, MISH, SWISH etc.
25
3.1.8 POOLING BLOCK
In most cases, a Convolutional Layer is followed by a Pooling Layer. The primary aim
of this layer is to decrease the size of the convolved feature map to reduce the
computational costs. This is performed by decreasing the connections between layers and
independently operates on each feature map. Depending upon method used, there are
several types of Pooling operations. It basically summarises the features generated by a
convolution layer. Two common pooling methods are average pooling and max pooling.
Compute Tile 4B x 4B Fits into 128b SIMD register for int8 MAC computations
Since each 8K × 8K matrix has 64M elements, direct execution on a 64×64 systolic array is not possible.
Instead, we perform hierarchical tiling.
26
If accumulation cannot be completed in one pass, intermediate sums are stored in on-chip SRAM
instead of DRAM.
SA Accumulation
Instead of sending every partial sum to global memory (slow), partial sums are first accumulated in
registers or SRAM before writing the final result.
Reduction Tree
A reduction tree is a hierarchical summation method that reduces multiple values in parallel instead of
sequentially. Pairs of elements are summed in parallel, reducing total operations from O(N) → O(log
N). Sequential addition takes O(N) time
3. Inner Loop (k): Iterate over the shared dimension in steps of 4. This allows loading 4 tiles of A and 4
tiles of B at a time.
4. Load 4 Tiles of A: Load 4 tiles from the same row of Matrix A (e.g., A[i][k:k+4]A[i][k:k+4]).
5. Load 4 Tiles of B: Load 4 tiles from the corresponding columns of Matrix B (e.g., B[k:k+4][j]B[k:k+4][j]).
6. Compute Partial Sums: Use the loaded tiles to compute partial sums for a 4x4 block of Matrix C.
8. Write Results: After processing all kk values, write the final result for the 4x4 block of C to off-chip
memory.
o Each tile of Matrix A is reused for 4 tiles of Matrix B, and each tile of Matrix B is reused for 4
tiles of Matrix A.
o This maximizes the utilization of on-chip memory and minimizes redundant data transfers.
o Loading 4 tiles of A and 4 tiles of B at a time reduces the number of memory accesses by a
factor of 4.
27
o Partial sums for a 4x4 block of C are accumulated in on-chip memory, reducing the number
of writes to off-chip memory.
o By processing 4x4 blocks of C at a time, the on-chip memory is used more efficiently,
reducing the need to spill partial sums to off-chip memory.
Example Workflow
On-Chip Memory: Use on-chip memory to store 4 tiles of A, 4 tiles of B, and partial sums for a 4x4
block of C.
Off-Chip Memory: Only access off-chip memory to load tiles of A and B and to write the final results
for C.
For larger matrices (e.g., 1024x1024), the same tiling sequence can be applied, but the number of tiles will
increase (e.g., 64x64 tiles for 1024x1024 matrices). The key principles of data reuse and minimizing partial sum
writes remain the same.
By loading 4 tiles from the same row of Matrix A and 4 tiles from the corresponding columns of Matrix B, this
approach significantly reduces memory traffic and improves the efficiency of partial sum handling, making it
highly suitable for hardware accelerators like systolic arrays.
28
Optimizations Applied
✅ Tiling Strategy → Uses N×N tiles for efficient on-chip memory usage.
✅ Weight-Stationary Dataflow → B stays fixed in on-chip memory to reduce data movement.
✅ Double Buffering → Overlaps computation & memory fetch to eliminate stalls.
✅ SIMD Vectorization → Computes multiple elements in parallel for faster execution.
✅ Efficient Partial Sum Accumulation → Uses registers to store partial sums, reducing off-chip
Tensor parallelism -- is achieved by forking into data parallel streams, then joining them.
Pipeline parallelism -- is achieved by chaining multiple PCUs together to fuse operations and
increase operational intensity.
3.2.2 TOOLS
Tools help maintain a high utilization factor of the hardware for those diverse convolutions.
Tools quantize a neural network to be suitable for low-precision hardware through three
processes, Profiling, Quantization, and Compensation
– Profiling is a process to acquire statistics of the activations for each channel or layer when
lots of sample images entered into a network.
29
Profiling tools
Quantization tools
The NPU uses offline tools to optimize the code. At runtime, the application processor passes this
Pass your trained model through the quantization tool. This tool quantizes weights to 8-bit and
activations to 8-bit or 16-bit values.
Pass the quantized model to the compiler. This tool optimizes the model for this NPU and outputs an
optimized model that contains a command stream for the NPU.
30
4 CONTRIBUTIONS AND ACKNOWLEDGEMENTS
31