SlideShare a Scribd company logo
ONNX and Python to C++:
State-of-the-Art Graph Compilation
Nigel Drego
Co-founder & CTO
Quadric
©2025 Quadric Inc.
About Quadric
2
Pure play Semiconductor IP Licensing
• Processor IP & Software Tools
Edge / device AI/ML Inference + DSP processing
HQ: Silicon Valley – Burlingame CA
Venture Capital funded
Mar 2021: First working silicon
May 2023: First IP delivery, DevStudio Online
Dec 2024: First multicore cluster delivery
Patents: 28+ Granted
©2025 Quadric Inc.
Chimera GPNPU Block Diagram
A Hybrid Between a CPU and a Systolic Array
3
©2025 Quadric Inc.
Quadric SDK Overview
• Chimera runs Graph Code, C++ and
Python code
• Chimera SDK runs on
user-premises or user-cloud
• Pre-packaged Docker container
version available
4
©2025 Quadric Inc.
Chimera Graph Compiler (CGC)
• Import
• Graph analysis (operator compatibility, quantization completeness, custom operator ID)
• Graph simplification / canonicalization
• Constant propagation
• Graph optimization
• Quantization scheduling (with activation range analysis)
• Graph serialization
• Fuse operators and select implementations to minimize data movement costs​
• Memory optimization
• Tensor format layout analysis (both L2 and LRM)
• Intelligent weight prefetching
5
Graph Import and
Optimizations
Successive Lowering &
Optimization Passes
Memory Usage
Optimizations
C++ using
CCL API
INT8/16
ONNX Input
©2025 Quadric Inc.
Chimera Graph Compiler - High Level Architecture
• Chimera Graph Compiler (CGC) leverages the TVM project framework
• Architecture-aware Quadric specific “middle-end”
• Injects Quadric-specific data layouts early in the lowering flow
• Hardware aware mapping Op selection
• Memory scheduling to maximize L2MEM & LRM utilization (lowers power)
• Direct control flow mapping from Relay -> C++
• Create highly optimized code for GPNPU
6
Quadric C++
Quadric
Data-layout
Assignment
Quadric
LRM/L2
Optimization
Quadric Operator
Mapping
Optimizations
TF / PyTorch /
ONNX
Relay IR
Fully customized middle & back-end in TVM
©2025 Quadric Inc.
Layout Optimization
• Transform non-NCHW formats to NCHW where possible
• Many shape transformation ops can be eliminated, even in
Multi-Head Self Attention
• Example: LayerNorm operating over channel dimension
• Solution: Analyze dimension dependencies, then sink
transpose across ops
• Can handle arbitrary sequences of shape transformation
operations
7
NCHW = Number of samples, Channels, Height, Width
©2025 Quadric Inc.
Layout Optimization
• Process graph in an edge-wise manner:
• An edge has a producing operator, i.e., Conv
and a consuming operator, i.e., ReLU
• Utilize information about op’s consumed layout(s)
and produced layout(s):
• “Layout” refers to data placement in PE-array,
not L2 or external memory
• Op can have multiple implementations:
different consumed / produced layouts
• Enumerate producer/consumer layout pairs
and select pairs with lowest cost
8
©2025 Quadric Inc.
External Memory
L2 Memory
PE-0,0
Local Register
Memory
PE-0,1
Local Register
Memory
PE-0,N-1
Local Register
Memory
. . . .
.
.
.
.
.
PE-N-1,0
Local Register
Memory
PE-N-1,1
Local Register
Memory
PE-N-1,N-1
Local Register
Memory
. . . .
.
.
.
.
.
.
.
.
.
.
Deep LRM-based Fusion
• Other NPUs refer to “fusion” as keeping data on-chip
• Chimera fusion keeps data within PE’s LRM
• Element-wise ops fused by keeping in registers
• Other ops via LRM-based fusion
• Convolutions
• MHSA
• Normalizations (e.g., LayerNorm, GroupNorm, etc)
9
DDR
Fusion Example: Blog link
© 2025 Quadric Inc.
Fusion in
Local Memory
(FILM)
saves power,
increases
performance
9
CGC Generated Code Example
Architected to Run a Single Code Stream For Scalar, Vector & Matrix Computations
10
for (int32_t tb_y1 = 0; tb_y1 < 14; ++tb_y1) {
for (int32_t tb_x1 = 0; tb_x1 < 14; ++tb_x1) {
container::NDArray<qVar_t<int8_t>, 27> ocm_tensor_0_rf;
ocm_tensor_0_flow_1.read(ocm_tensor_0_rf);
container::NDArray<qVar_t<int8_t>, 32> T_qlinear_conv2d_rf;
BroadcastFlow<TensorAccessor<MinRoiDescriptor<OcmTensor<std::int8_t, 1, 1, 1, 1280>,... >
for (int32_t ch1 = 0; ch1 < 32; ++ch1) {
qVar_t<int32_t> _0 = nn::convTileBlockInt8<std::int32_t, 27, 1, 0, false>(ocm_tensor_0_rf);
qVar_t<int32_t> _11 = qBroadcast<0, std::int32_t, BroadcastAction::POP>;
T_qlinear_conv2d_rf[(ch1)] =
math::min(math::max(cgc::fxRoundPosInf<23>(math::fxMul<31>(math::min(math::max(math::fxMul<2>((math::min(math::max(cgc::fxRoundP
osInf<2>(math::fxMul<29>((_0 + _11), 4679030)), -128), 127)), 58507024), 0, 3.2212255e+09f, 1231605867)), -128), 127));
}
container::NDArray<qVar_t<int8_t>, 32> T_qlinear_conv2d_rf1;
for (int32_t ch2 = 0; ch2 < 32; ++ch2) {
qVar_t<int32_t> _2 = nn::groupwiseConvTileBlockInt8<std::int32_t, 1, 3, 1, 0, false>(T_qlinear_conv2d_rf,ch2);
qVar_t<int32_t> _3 = qBroadcast<0, std::int32_t, BroadcastAction::POP>;
T_qlinear_conv2d_rf1[(ch2)] = math::min(math::max(cgc::fxRoundPosInf<22>......
Scalar Variables
Flow/DMA APIs
Vector/Matrix
Variables
Fused Elementwise
Any activation can be
coded up
De-quantize math
Fused Convolution
Example of auto generated Convolution code running on Chimera
© 2025 Quadric Inc.
Custom Operator Support
• CGC Compiler Supports
Hybrid Lowering
• Automatic code generation
for whole/partial graph
• Custom Implementation of
nodes/subgraphs
• e.g., NMS, proprietary
layers, custom operators
11
Auto Generated
Source (C++)
ONNX Graph
Quadric Binary
based
CGC compiler
LLVM based
C++ compiler
Custom
Operator (C++)
© 2025 Quadric Inc.
BEVDepth – The Key Component Is in CUDA
12
Mixed ONNX + CUDA Code
BEVDepth is a hybrid algorithm that includes common
neural network graph constructs as well as DSP
algorithms for voxel-pooling, implemented in CUDA.
Voxel-pooling is ~60% of the algorithmic compute cost
 How & where you run Voxel Pooling will
determine overall BEVdepth performance efficiency
BEVDepth on Dual QC-Ultra
Original Repo GitHub Repo
Network Config NuScene Configurations
Total Ops (G) 700
Parameters (M) 76.25
Input Size 6 x 3 x 256 x 704
Inference Rate (FPS)
(with Voxel Pooling)
26.30
© 2025 Quadric Inc.
Arxiv paper
Dual Chimera Ultra Config – 64 TOPS
12
BEVDepth – Simple Port from CUDA
13
CUDA Implementation Chimera Implementation
© 2025 Quadric Inc.
Custom Operator: Voxel Pooling
• Program at PE-level, awareness
of entire PE-array
• Similar to CUDA’s thread/warp
• Data access:
• Flows, with automatic double-
buffering
• Random-access
14
© 2025 Quadric Inc.
void voxelPooling(…) {
…
auto featuresReadFlow = createGeneralReadFlow (imgFeatures, ocmMemAlloc);
ExtFlow<FlowType::Read, ..> geomInElsFlow{geomXYZ, ocmMemAlloc};
for(std::uint32_t chnOffset = 0; chnOffset < numChannels; chnOffset += core_array::coreDim) {
container::NDArray<qVar_t<std::int32_t>, …> qOut = {0};
for(std::uint32_t row = 0; row < numPoints; row += mroiHeight) {
auto qFeatures = featuresReadFlow.read();
auto geomBuf = geomInElsFlow.template read<false>();
rau::config(geomBuf);
for(std::int32_t mroiRowIdx = 0; mroiRowIdx < numTilesPerMroi; mroiRowIdx++) {
container::NDArray<qVar_t<GeomT>, 3> qGeomXYZ;
container::NDArray<qVar_t<std::uint32_t>, 3> qAddrs;
…
rau::load::tiles(qAddrs, qGeomXYZ, geomBuf);
qVar_t<std::int16_t> qX = qGeomXYZ[0];
qVar_t<std::int16_t> qY = qGeomXYZ[1];
qVar_t<std::int16_t> qZ = qGeomXYZ[2];
qVar_t<std::int16_t> qBaseChnIdx = qZ * (OutShape::NUM_CHN + core_array::coreDim) + chnOffset + qCol<>;
voxelAtomicAdd<OutShape>(qFeatures[mroiRowIdx], qBaseChnIdx, qY, qX,
qOut);
}
}
rau::config(outGrid);
qVar_t<std::uint32_t> qChnIdx = chnOffset + qCol<>;
for(std::size_t i = 0; i < qOut.size(); i++) {
qVar_t<std::uint32_t> qRowIdx = i / OutShape::NUM_TILES_PER_COL;
qVar_t<std::uint32_t> qColIdx = (i % OutShape::NUM_TILES_PER_COL) * core_array::coreDim + qRow<>;
rau::store::oneTile(0, qChnIdx, qRowIdx, qColIdx, qOut[i], outGrid);
}
}
}
Key Lines of Code in Large Font
Voxel Pooling: Atomic Add
• Atomic add implemented
without atomic primitives
• All data exchanged within
PE-array
• Huge performance gain
relative to CUDA
15
void voxelAtomicAdd(…) {
static_assert(sizeof(InT) == 1, "Input type must be 1 byte type.");
qVar_t<std::int8_t> qValid = qArrayCore & (qColIdx >= 0) & (qColIdx < OcmShape::NUM_COLS) & (qRowIdx >= 0) &
(qRowIdx < OcmShape::NUM_ROWS) & (qBaseChnIdx >= 0) &
(qBaseChnIdx < OcmShape::NUM_CHN);
qVar_t<std::int8_t> qDist = std::numeric_limits<std::int8_t>::min();
qVar_t<std::uint16_t> qChn = 0;
if(qValid) {
qVar_t<std::int16_t> qLinearizedIdx = qRowIdx * 128 + qColIdx;
qDist = qLinearizedIdx % core_array::coreDim - qRow<>;
qChn = qLinearizedIdx / core_array::coreDim;
}
container::NDArray<qVar_t<std::int8_t>, 2> qUnpacked1 = {qDist, qVal};
auto qPacked1 = qUnpacked1.template asReinterpretCast<qVar_t<std::uint16_t>, 1>();
container::NDArray<qVar_t<std::uint16_t>, 2> qUnpacked = {qPacked1[0], qChn};
auto qPacked = qUnpacked.template asReinterpretCast<qVar_t<std::uint32_t>, 1>();
qOut[qChn] += qVal * (qDist == 0);
setCleanStateForNeighborMovement<InT>(0);
qNorthSouthBcast<std::uint32_t>= qPacked[0];
#pragma unroll
for(std::int32_t move = 1; move < core_array::coreDim; move++, rot180()) {
container::NDArray<qVar_t<std::uint32_t>, 1> qPackedSouth = {qSouth<std::uint32_t>};
auto qUnpacked1South = qPackedSouth.template asReinterpretCast<qVar_t<std::uint16_t>, 2>();
qVar_t<std::uint16_t> qChnSouth = qUnpacked1South[1];
auto qUnpackedSouth =
(qUnpacked1South.template slice<1>(0)).template asReinterpretCast<qVar_t<std::int8_t>, 2>();
qOut[qChnSouth] += qUnpackedSouth[1] * (qUnpackedSouth[0] == -move);
container::NDArray<qVar_t<std::uint32_t>, 1> qPackedNorth = {qNorth<std::uint32_t>};
auto qUnpacked1North = qPackedNorth.template asReinterpretCast<qVar_t<std::uint16_t>, 2>();
qVar_t<std::uint16_t> qChnNorth = qUnpacked1North[1];
auto qUnpackedNorth = (qUnpacked1North.template slice<1>(0)).template
asReinterpretCast<qVar_t<std::int8_t>, 2>();
qOut[qChnNorth] += qUnpackedNorth[1] * (qUnpackedNorth[0] == move);
}
}
© 2025 Quadric Inc.
Combine Model + Custom Operator w/ Python
@chipy.ccl_custom_op(ccl_func_name="voxelPooling")
def voxel_pooling(geom_xyz, img_feats):
# Add python implementation to check CCL implementation against pass
@chipy.func()
def full_model(
model_path,
imgs,
points,
ida_mats
):
geom_xyz, feats = chipy.infer_onnx(model_path,
ida_mats,
imgs,
points)
return voxel_pooling(geom_xyz, feats)
16
© 2025 Quadric Inc.
Voxel Pooling: Chimera Performance
Twice the Performance
1/8th of the Bandwidth Required
10 - 80x Lower Power*
17
Nvidia RTX 3090
(Full Chip)
Quadric QC-Ultra
(quad-core IP)
Coding Language CUDA C++ CCL C++
Clock Freq 1.4 GHz 1.4 GHz
ALU Capacity (ALU Count)
41,984
(from 10496 shader cores)
4096
Memory Bandwidth 936 GB/s
128 GB/s
(32GB per core)
Voxel Pooling Performance
(1 iteration)
23 ms 11 ms
Voxel Pooling using data sizes of: [ geom_xyz: 6, 112, 16, 44, 3]; [ img_feat: 6, 112, 16, 44, 80 ]; [out_grid: 80, 128, 128 ]
* RTX3090 card hosts a GA102 GPU – 8 nm process, 628 mm2, 450 Watts (max). Voxel Pooling power difficult to measure directly on the RTX3090.
Quadric power estimates for cores only – not including system memory I/F.
Full Detailed Tutorial on
DevStudio with complete
Voxel Pooling source code
© 2025 Quadric Inc.
Benefit of Pure Compiler Approach
Unlocking New Networks Far Faster Than Manual Porting
18
• Total number of demonstration networks on
Quadric’s online DevStudio
• Shows rapid “unlocking” of new networks
that automatically compile - purely from
source Repos without any modifications -
with successive compiler releases
© 2025 Quadric Inc.
Release Over Release Performance Upgrades
Code generation optimizations continue to unlock large performance gains, compounding over time
19
Up to 50%
23.08
Up to 33%
23.10
Up to 42%
24.01
Up to 22%
24.04
Up to 76%
24.07
Up to 106%
24.09
Up to 68%
24.12
• Percentages represent the largest release-over-release performance gains for existing networks (different networks
each comparison period).
• All networks auto-compiled from original FP32 source repos – no network mods, no layer changes, no manual
preparation.
© 2025 Quadric Inc.
CGC Compiler Maturity
Memory Management Refinement
20
MBytes
2 year maturity curve from first functional
release of CGC graph compiler to now
Works in 1 MB
“Required” L2 Memory needed to run RN50
© 2025 Quadric Inc.
© 2025 Quadric Inc.
Conclusions
• Pure Graph Compiler approach to AI / ML inference yields far better results
than manual porting / optimization
• Developer productivity  most networks “just compile” out of the box
• Compiler maturity  “free” improvements each release into the future
• C++ programmability  runs more than just “graph code”
• Programming in C++, Graph and Python code makes embedded AI inference
almost as easy as datacenter AI training & inference
21
Visit us on the web
https://www.quadric.io
Try Quadric DevStudio
https://studio.quadric.io/
2025 Embedded Vision Summit
Please visit us at Booth #821
22
Resources @ EVS’25
© 2025 Quadric Inc.
Thank You
23

More Related Content

More from Edge AI and Vision Alliance (20)

PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
Edge AI and Vision Alliance
 
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
Edge AI and Vision Alliance
 
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
Edge AI and Vision Alliance
 
PDF
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
Edge AI and Vision Alliance
 
PDF
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
PDF
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
Edge AI and Vision Alliance
 
PDF
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
PDF
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Edge AI and Vision Alliance
 
PDF
“Why It’s Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
 
PDF
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
Edge AI and Vision Alliance
 
PDF
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
PDF
“How Qualcomm Is Powering AI-driven Multimedia at the Edge,” a Presentation f...
Edge AI and Vision Alliance
 
PDF
“OAAX: One Standard for AI Vision on Any Compute Platform,” a Presentation fr...
Edge AI and Vision Alliance
 
PDF
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
Edge AI and Vision Alliance
 
PDF
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
Edge AI and Vision Alliance
 
PDF
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
Edge AI and Vision Alliance
 
PDF
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
Edge AI and Vision Alliance
 
PDF
“Implementing AI/Computer Vision for Corporate Security Surveillance,” a Pres...
Edge AI and Vision Alliance
 
PDF
“Continual Learning thru Sequential, Lightweight Optimization,” a Presentatio...
Edge AI and Vision Alliance
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
Edge AI and Vision Alliance
 
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
Edge AI and Vision Alliance
 
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
Edge AI and Vision Alliance
 
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
Edge AI and Vision Alliance
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
Edge AI and Vision Alliance
 
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Edge AI and Vision Alliance
 
“Why It’s Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
 
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
Edge AI and Vision Alliance
 
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
“How Qualcomm Is Powering AI-driven Multimedia at the Edge,” a Presentation f...
Edge AI and Vision Alliance
 
“OAAX: One Standard for AI Vision on Any Compute Platform,” a Presentation fr...
Edge AI and Vision Alliance
 
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
Edge AI and Vision Alliance
 
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
Edge AI and Vision Alliance
 
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
Edge AI and Vision Alliance
 
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
Edge AI and Vision Alliance
 
“Implementing AI/Computer Vision for Corporate Security Surveillance,” a Pres...
Edge AI and Vision Alliance
 
“Continual Learning thru Sequential, Lightweight Optimization,” a Presentatio...
Edge AI and Vision Alliance
 

Recently uploaded (20)

PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Home Cleaning App Development Services.pdf
V3cube
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
Linux schedulers for fun and profit with SchedKit
Alessio Biancalana
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
Essential Content-centric Plugins for your Website
Laura Byrne
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pdf
ghjghvhjgc
 
PDF
Software Development Company Keene Systems, Inc (1).pdf
Custom Software Development Company | Keene Systems, Inc.
 
PDF
[GDGoC FPTU] Spring 2025 Summary Slidess
minhtrietgect
 
PPTX
Talbott's brief History of Computers for CollabDays Hamburg 2025
Talbott Crowell
 
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Home Cleaning App Development Services.pdf
V3cube
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Linux schedulers for fun and profit with SchedKit
Alessio Biancalana
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Essential Content-centric Plugins for your Website
Laura Byrne
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pdf
ghjghvhjgc
 
Software Development Company Keene Systems, Inc (1).pdf
Custom Software Development Company | Keene Systems, Inc.
 
[GDGoC FPTU] Spring 2025 Summary Slidess
minhtrietgect
 
Talbott's brief History of Computers for CollabDays Hamburg 2025
Talbott Crowell
 
Ad

“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation from Quadric

  • 1. ONNX and Python to C++: State-of-the-Art Graph Compilation Nigel Drego Co-founder & CTO Quadric ©2025 Quadric Inc.
  • 2. About Quadric 2 Pure play Semiconductor IP Licensing • Processor IP & Software Tools Edge / device AI/ML Inference + DSP processing HQ: Silicon Valley – Burlingame CA Venture Capital funded Mar 2021: First working silicon May 2023: First IP delivery, DevStudio Online Dec 2024: First multicore cluster delivery Patents: 28+ Granted ©2025 Quadric Inc.
  • 3. Chimera GPNPU Block Diagram A Hybrid Between a CPU and a Systolic Array 3 ©2025 Quadric Inc.
  • 4. Quadric SDK Overview • Chimera runs Graph Code, C++ and Python code • Chimera SDK runs on user-premises or user-cloud • Pre-packaged Docker container version available 4 ©2025 Quadric Inc.
  • 5. Chimera Graph Compiler (CGC) • Import • Graph analysis (operator compatibility, quantization completeness, custom operator ID) • Graph simplification / canonicalization • Constant propagation • Graph optimization • Quantization scheduling (with activation range analysis) • Graph serialization • Fuse operators and select implementations to minimize data movement costs​ • Memory optimization • Tensor format layout analysis (both L2 and LRM) • Intelligent weight prefetching 5 Graph Import and Optimizations Successive Lowering & Optimization Passes Memory Usage Optimizations C++ using CCL API INT8/16 ONNX Input ©2025 Quadric Inc.
  • 6. Chimera Graph Compiler - High Level Architecture • Chimera Graph Compiler (CGC) leverages the TVM project framework • Architecture-aware Quadric specific “middle-end” • Injects Quadric-specific data layouts early in the lowering flow • Hardware aware mapping Op selection • Memory scheduling to maximize L2MEM & LRM utilization (lowers power) • Direct control flow mapping from Relay -> C++ • Create highly optimized code for GPNPU 6 Quadric C++ Quadric Data-layout Assignment Quadric LRM/L2 Optimization Quadric Operator Mapping Optimizations TF / PyTorch / ONNX Relay IR Fully customized middle & back-end in TVM ©2025 Quadric Inc.
  • 7. Layout Optimization • Transform non-NCHW formats to NCHW where possible • Many shape transformation ops can be eliminated, even in Multi-Head Self Attention • Example: LayerNorm operating over channel dimension • Solution: Analyze dimension dependencies, then sink transpose across ops • Can handle arbitrary sequences of shape transformation operations 7 NCHW = Number of samples, Channels, Height, Width ©2025 Quadric Inc.
  • 8. Layout Optimization • Process graph in an edge-wise manner: • An edge has a producing operator, i.e., Conv and a consuming operator, i.e., ReLU • Utilize information about op’s consumed layout(s) and produced layout(s): • “Layout” refers to data placement in PE-array, not L2 or external memory • Op can have multiple implementations: different consumed / produced layouts • Enumerate producer/consumer layout pairs and select pairs with lowest cost 8 ©2025 Quadric Inc. External Memory L2 Memory PE-0,0 Local Register Memory PE-0,1 Local Register Memory PE-0,N-1 Local Register Memory . . . . . . . . . PE-N-1,0 Local Register Memory PE-N-1,1 Local Register Memory PE-N-1,N-1 Local Register Memory . . . . . . . . . . . . . .
  • 9. Deep LRM-based Fusion • Other NPUs refer to “fusion” as keeping data on-chip • Chimera fusion keeps data within PE’s LRM • Element-wise ops fused by keeping in registers • Other ops via LRM-based fusion • Convolutions • MHSA • Normalizations (e.g., LayerNorm, GroupNorm, etc) 9 DDR Fusion Example: Blog link © 2025 Quadric Inc. Fusion in Local Memory (FILM) saves power, increases performance 9
  • 10. CGC Generated Code Example Architected to Run a Single Code Stream For Scalar, Vector & Matrix Computations 10 for (int32_t tb_y1 = 0; tb_y1 < 14; ++tb_y1) { for (int32_t tb_x1 = 0; tb_x1 < 14; ++tb_x1) { container::NDArray<qVar_t<int8_t>, 27> ocm_tensor_0_rf; ocm_tensor_0_flow_1.read(ocm_tensor_0_rf); container::NDArray<qVar_t<int8_t>, 32> T_qlinear_conv2d_rf; BroadcastFlow<TensorAccessor<MinRoiDescriptor<OcmTensor<std::int8_t, 1, 1, 1, 1280>,... > for (int32_t ch1 = 0; ch1 < 32; ++ch1) { qVar_t<int32_t> _0 = nn::convTileBlockInt8<std::int32_t, 27, 1, 0, false>(ocm_tensor_0_rf); qVar_t<int32_t> _11 = qBroadcast<0, std::int32_t, BroadcastAction::POP>; T_qlinear_conv2d_rf[(ch1)] = math::min(math::max(cgc::fxRoundPosInf<23>(math::fxMul<31>(math::min(math::max(math::fxMul<2>((math::min(math::max(cgc::fxRoundP osInf<2>(math::fxMul<29>((_0 + _11), 4679030)), -128), 127)), 58507024), 0, 3.2212255e+09f, 1231605867)), -128), 127)); } container::NDArray<qVar_t<int8_t>, 32> T_qlinear_conv2d_rf1; for (int32_t ch2 = 0; ch2 < 32; ++ch2) { qVar_t<int32_t> _2 = nn::groupwiseConvTileBlockInt8<std::int32_t, 1, 3, 1, 0, false>(T_qlinear_conv2d_rf,ch2); qVar_t<int32_t> _3 = qBroadcast<0, std::int32_t, BroadcastAction::POP>; T_qlinear_conv2d_rf1[(ch2)] = math::min(math::max(cgc::fxRoundPosInf<22>...... Scalar Variables Flow/DMA APIs Vector/Matrix Variables Fused Elementwise Any activation can be coded up De-quantize math Fused Convolution Example of auto generated Convolution code running on Chimera © 2025 Quadric Inc.
  • 11. Custom Operator Support • CGC Compiler Supports Hybrid Lowering • Automatic code generation for whole/partial graph • Custom Implementation of nodes/subgraphs • e.g., NMS, proprietary layers, custom operators 11 Auto Generated Source (C++) ONNX Graph Quadric Binary based CGC compiler LLVM based C++ compiler Custom Operator (C++) © 2025 Quadric Inc.
  • 12. BEVDepth – The Key Component Is in CUDA 12 Mixed ONNX + CUDA Code BEVDepth is a hybrid algorithm that includes common neural network graph constructs as well as DSP algorithms for voxel-pooling, implemented in CUDA. Voxel-pooling is ~60% of the algorithmic compute cost  How & where you run Voxel Pooling will determine overall BEVdepth performance efficiency BEVDepth on Dual QC-Ultra Original Repo GitHub Repo Network Config NuScene Configurations Total Ops (G) 700 Parameters (M) 76.25 Input Size 6 x 3 x 256 x 704 Inference Rate (FPS) (with Voxel Pooling) 26.30 © 2025 Quadric Inc. Arxiv paper Dual Chimera Ultra Config – 64 TOPS 12
  • 13. BEVDepth – Simple Port from CUDA 13 CUDA Implementation Chimera Implementation © 2025 Quadric Inc.
  • 14. Custom Operator: Voxel Pooling • Program at PE-level, awareness of entire PE-array • Similar to CUDA’s thread/warp • Data access: • Flows, with automatic double- buffering • Random-access 14 © 2025 Quadric Inc. void voxelPooling(…) { … auto featuresReadFlow = createGeneralReadFlow (imgFeatures, ocmMemAlloc); ExtFlow<FlowType::Read, ..> geomInElsFlow{geomXYZ, ocmMemAlloc}; for(std::uint32_t chnOffset = 0; chnOffset < numChannels; chnOffset += core_array::coreDim) { container::NDArray<qVar_t<std::int32_t>, …> qOut = {0}; for(std::uint32_t row = 0; row < numPoints; row += mroiHeight) { auto qFeatures = featuresReadFlow.read(); auto geomBuf = geomInElsFlow.template read<false>(); rau::config(geomBuf); for(std::int32_t mroiRowIdx = 0; mroiRowIdx < numTilesPerMroi; mroiRowIdx++) { container::NDArray<qVar_t<GeomT>, 3> qGeomXYZ; container::NDArray<qVar_t<std::uint32_t>, 3> qAddrs; … rau::load::tiles(qAddrs, qGeomXYZ, geomBuf); qVar_t<std::int16_t> qX = qGeomXYZ[0]; qVar_t<std::int16_t> qY = qGeomXYZ[1]; qVar_t<std::int16_t> qZ = qGeomXYZ[2]; qVar_t<std::int16_t> qBaseChnIdx = qZ * (OutShape::NUM_CHN + core_array::coreDim) + chnOffset + qCol<>; voxelAtomicAdd<OutShape>(qFeatures[mroiRowIdx], qBaseChnIdx, qY, qX, qOut); } } rau::config(outGrid); qVar_t<std::uint32_t> qChnIdx = chnOffset + qCol<>; for(std::size_t i = 0; i < qOut.size(); i++) { qVar_t<std::uint32_t> qRowIdx = i / OutShape::NUM_TILES_PER_COL; qVar_t<std::uint32_t> qColIdx = (i % OutShape::NUM_TILES_PER_COL) * core_array::coreDim + qRow<>; rau::store::oneTile(0, qChnIdx, qRowIdx, qColIdx, qOut[i], outGrid); } } } Key Lines of Code in Large Font
  • 15. Voxel Pooling: Atomic Add • Atomic add implemented without atomic primitives • All data exchanged within PE-array • Huge performance gain relative to CUDA 15 void voxelAtomicAdd(…) { static_assert(sizeof(InT) == 1, "Input type must be 1 byte type."); qVar_t<std::int8_t> qValid = qArrayCore & (qColIdx >= 0) & (qColIdx < OcmShape::NUM_COLS) & (qRowIdx >= 0) & (qRowIdx < OcmShape::NUM_ROWS) & (qBaseChnIdx >= 0) & (qBaseChnIdx < OcmShape::NUM_CHN); qVar_t<std::int8_t> qDist = std::numeric_limits<std::int8_t>::min(); qVar_t<std::uint16_t> qChn = 0; if(qValid) { qVar_t<std::int16_t> qLinearizedIdx = qRowIdx * 128 + qColIdx; qDist = qLinearizedIdx % core_array::coreDim - qRow<>; qChn = qLinearizedIdx / core_array::coreDim; } container::NDArray<qVar_t<std::int8_t>, 2> qUnpacked1 = {qDist, qVal}; auto qPacked1 = qUnpacked1.template asReinterpretCast<qVar_t<std::uint16_t>, 1>(); container::NDArray<qVar_t<std::uint16_t>, 2> qUnpacked = {qPacked1[0], qChn}; auto qPacked = qUnpacked.template asReinterpretCast<qVar_t<std::uint32_t>, 1>(); qOut[qChn] += qVal * (qDist == 0); setCleanStateForNeighborMovement<InT>(0); qNorthSouthBcast<std::uint32_t>= qPacked[0]; #pragma unroll for(std::int32_t move = 1; move < core_array::coreDim; move++, rot180()) { container::NDArray<qVar_t<std::uint32_t>, 1> qPackedSouth = {qSouth<std::uint32_t>}; auto qUnpacked1South = qPackedSouth.template asReinterpretCast<qVar_t<std::uint16_t>, 2>(); qVar_t<std::uint16_t> qChnSouth = qUnpacked1South[1]; auto qUnpackedSouth = (qUnpacked1South.template slice<1>(0)).template asReinterpretCast<qVar_t<std::int8_t>, 2>(); qOut[qChnSouth] += qUnpackedSouth[1] * (qUnpackedSouth[0] == -move); container::NDArray<qVar_t<std::uint32_t>, 1> qPackedNorth = {qNorth<std::uint32_t>}; auto qUnpacked1North = qPackedNorth.template asReinterpretCast<qVar_t<std::uint16_t>, 2>(); qVar_t<std::uint16_t> qChnNorth = qUnpacked1North[1]; auto qUnpackedNorth = (qUnpacked1North.template slice<1>(0)).template asReinterpretCast<qVar_t<std::int8_t>, 2>(); qOut[qChnNorth] += qUnpackedNorth[1] * (qUnpackedNorth[0] == move); } } © 2025 Quadric Inc.
  • 16. Combine Model + Custom Operator w/ Python @chipy.ccl_custom_op(ccl_func_name="voxelPooling") def voxel_pooling(geom_xyz, img_feats): # Add python implementation to check CCL implementation against pass @chipy.func() def full_model( model_path, imgs, points, ida_mats ): geom_xyz, feats = chipy.infer_onnx(model_path, ida_mats, imgs, points) return voxel_pooling(geom_xyz, feats) 16 © 2025 Quadric Inc.
  • 17. Voxel Pooling: Chimera Performance Twice the Performance 1/8th of the Bandwidth Required 10 - 80x Lower Power* 17 Nvidia RTX 3090 (Full Chip) Quadric QC-Ultra (quad-core IP) Coding Language CUDA C++ CCL C++ Clock Freq 1.4 GHz 1.4 GHz ALU Capacity (ALU Count) 41,984 (from 10496 shader cores) 4096 Memory Bandwidth 936 GB/s 128 GB/s (32GB per core) Voxel Pooling Performance (1 iteration) 23 ms 11 ms Voxel Pooling using data sizes of: [ geom_xyz: 6, 112, 16, 44, 3]; [ img_feat: 6, 112, 16, 44, 80 ]; [out_grid: 80, 128, 128 ] * RTX3090 card hosts a GA102 GPU – 8 nm process, 628 mm2, 450 Watts (max). Voxel Pooling power difficult to measure directly on the RTX3090. Quadric power estimates for cores only – not including system memory I/F. Full Detailed Tutorial on DevStudio with complete Voxel Pooling source code © 2025 Quadric Inc.
  • 18. Benefit of Pure Compiler Approach Unlocking New Networks Far Faster Than Manual Porting 18 • Total number of demonstration networks on Quadric’s online DevStudio • Shows rapid “unlocking” of new networks that automatically compile - purely from source Repos without any modifications - with successive compiler releases © 2025 Quadric Inc.
  • 19. Release Over Release Performance Upgrades Code generation optimizations continue to unlock large performance gains, compounding over time 19 Up to 50% 23.08 Up to 33% 23.10 Up to 42% 24.01 Up to 22% 24.04 Up to 76% 24.07 Up to 106% 24.09 Up to 68% 24.12 • Percentages represent the largest release-over-release performance gains for existing networks (different networks each comparison period). • All networks auto-compiled from original FP32 source repos – no network mods, no layer changes, no manual preparation. © 2025 Quadric Inc.
  • 20. CGC Compiler Maturity Memory Management Refinement 20 MBytes 2 year maturity curve from first functional release of CGC graph compiler to now Works in 1 MB “Required” L2 Memory needed to run RN50 © 2025 Quadric Inc.
  • 21. © 2025 Quadric Inc. Conclusions • Pure Graph Compiler approach to AI / ML inference yields far better results than manual porting / optimization • Developer productivity  most networks “just compile” out of the box • Compiler maturity  “free” improvements each release into the future • C++ programmability  runs more than just “graph code” • Programming in C++, Graph and Python code makes embedded AI inference almost as easy as datacenter AI training & inference 21
  • 22. Visit us on the web https://www.quadric.io Try Quadric DevStudio https://studio.quadric.io/ 2025 Embedded Vision Summit Please visit us at Booth #821 22 Resources @ EVS’25 © 2025 Quadric Inc.