Day5_03_Converting Neural Networks model into Optimzied Code
Day5_03_Converting Neural Networks model into Optimzied Code
Amit Kumar
5th Aug 2021
STM32 Cube.AI
STM32 comprehensive AI ecosystem
Applications Frameworks
Applicative
Examples
(Function Packs)
AI Model convertor
Pre and post
Processing Graph Memory
Quantizer
libraries optimizer optimizer
Edge Hardware
3
A tool to seamlessly integrate AI in your projects
Machine Learning
Deep Learning
4
The 3 pillars of STM32Cube.AI
Graph optimizer Quantized model support Memory optimizer
Automatically improve performance through Import your quantized ANN to be compatible Optimize memory allocation to get the best
graph simplifications & optimizations that with STM32 embedded architectures while performance while respecting the constraints
benefit STM32 target HW architectures keeping their performance of your embedded design
STM32Cube.AI is free of charge, available both in graphical interface and in command line.
5
Graph optimizer
• Loss-less conversion
6
Quantized model support
Simply use quantized networks to reduce memory footprint and
inference time
LATENCY & MEMORY COMPARISON FOR STM32Cube.AI support quantized Neural Network
QUANTIZED MODELS
models with all parameter formats:
FP32
800
• FP32
700
• Int8
600
• Mixed binary Int1 to Int8 (Qkeras*, Larq.dev*)
Flash (kB)
500
*Please contact [email protected] to request
400
the relevant version of STM32Cube.AI
300
200 Int8
100 HW Target: NUCLEO-STM32H743ZI2
Int 1 + Int8 Model: Low complexity handwritten digit reading
0
Freq: 480 MHz
0 20 40 60 80
Accuracy: >97% for all quantized models
Latency (ms)
Tested database: MNIST dataset MNIST dataset
7
Memory optimizer
378
148
141 KB
KB
KB
301
KB
HW Target: STM32H723
55 Flash: 1Mbyte
50 50 KB RAM: 564 Kbytes
KB 102 Freq: 550 MHz
KB
28
58 SW Version:
36 KB X-Cube.AI v 7.2.0
27 TFLm v2.7.0
Latency (ms) Flash RAM Latency (ms) Flash RAM
* the lower the better
9
Making Edge AI possible with all STM32 portfolio
STM32Cube.AI is compatible with all STM32 series
STM32MP1
4158 CoreMark
MPU Up to 800 MHz Cortex-A7
209 MHz Cortex-M4
STM32WL STM32WB
Wireless 162 CoreMark 216 CoreMark
MCUs 48 MHz Cortex-M4 64 MHz Cortex-M4
48 MHz Cortex-M0+ 32 MHz Cortex-M0+
• Introducing support for mixed • Addition of a new kernel • Support TensorFlow v2.9 models
precision quantization and binary performance enhancement for • Support of new ONNX operators
neural network (BNN) for STM32 further optimization of memory (refer to documentation for
• Supporting pre-trained quantized footprint and power consumption exhaustive list)
models from: • Extend support of scikit-learn ML
• qKeras algorithms
• Larq
11
STM32Cube.AI user flow (1/3)
12
STM32Cube.AI user flow (2/3)
• Model complexity and footprint analysis
• Fine tune memory allocation with optimizations and GUI
• Optimize system parameters and clock tree
• Extend model with your own customer layers
1 Select MCU
13
STM32Cube.AI user flow (3/3)
1 Select MCU
• Generate Application Template
• Integrate with your application-
specific code in your favorite IDE
• Perform system tests
2 Optimize and validate
14
Possible conversion strategies:
Network code generation and interpreter
• X-CUBE-AI Expansion Package integrates a specific path which allows to generate a ready-to-use STM32
IDE project embedding a TensorFlow Lite for Microcontrollers run-time (also called TFLm) and its
associated TFLite model. This can be considered as an alternative of the default X-CUBE-AI solution to
deploy a AI solution based on a TFLite model.
15
Possible conversion strategies:
Network code generation and interpreter
More Flexible: More optimized:
TensorFlow Lite interpreter mode Optimized C code generated by
STM32 device
run-time
run-time on
16
STM32Cube.AI vs Tensor Flow Lite for MCU
Results with STM32Cube.AI v7.2
Model Runtime MCU and Inference Flash (KiB) RAM (KiB)
board time (ms)
Image Classification X-CUBE-AI STM32U585 148 142 50
NUCLEO-
Image Classification TFLM 253 149 55
U575ZI-Q
% TFLM vs Cube.AI + 71 +5 +9
17
Two user interfaces of X-CUBE-AI
• In order to get a better user experience, X-CUBE-AI provides two user interfaces ,
GUI and CLI
• GUI can be used with cubeMX
GUI CLI
18
« stm32ai » main command(s)
• Memory requirements (ro/rw)
• Processing requirements (MACC)
Pre-trained model • Complexity per layer
(topology and weights)
20
Analyze command
• The 'analyze' command is the primary command to import, to parse, to check and to render an
uploaded pre-trained model. Detailed report provides the main system metrics to know if the
generated code can be deployed on a STM32 device. It includes also rendering information by
layer or/and operator.
21
Validate command
• The 'validate' command allows to import, to render and to validate the generated C-files.
22
Validation
• the different metrics (and associated computing flow) which are used to evaluate the performance of
the generated C-files (or C-model). Proposed metrics should be considered as the generic indicators
which allows to compare numerically the predictions of the c-model against the predictions of the
original model.
23
Don't go alone
24
Find out more at www.st.com/stm32ai