0% found this document useful (0 votes)
32 views30 pages

tinyML Development TFL CMSIS Ethos-U55

Uploaded by

orlcp440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views30 pages

tinyML Development TFL CMSIS Ethos-U55

Uploaded by

orlcp440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

tinyML development with

TensorFlow Lite for


AI Virtual Tech Talks Series

Microcontrollers using
CMSIS-NN and Ethos-U55

Fredrik Knutsson, Felix Johnny Thomasmathibalan


June 30, 2020
AI Virtual Tech Talks Series
Date Title Host
See Arm’s
YouTube for Machine learning for embedded systems at the edge Arm and NXP
recording

Today TensorFlow Lite for Microcontrollers using Arm’s CMSIS-NN and Ethos-U55 Arm

July, 14 Demystify artificial intelligence on Arm MCUs Cartesiam.ai

July, 28 Speech recognition on Arm Cortex-M Fluent.ai

August, 11 Getting started with Arm Cortex-M software development and Arm Development Studio Arm

August, 25 Efficient ML across Arm from Cortex-M to Web Assembly Edge Impulse

Visit: developer.arm.com/solutions/machine-learning-on-arm/ai-virtual-tech-talks
2 © 2020 Arm Limited
Today’s speakers

Fredrik Knutsson Felix Johnny Thomasmathibalan


ML Software Team Lead ML Engineer

3 © 2020 Arm Limited


Agenda
• Tensorflow Lite for Microcontrollers (TFLu)
• CMSIS-NN
• Neural network kernels developed to maximize the performance on Cortex-M CPU
• Ethos-U55
• A new class of machine learning (ML) processor, called a microNPU, specifically designed to accelerate ML
inference in area-constrained embedded and IoT devices.
• Integration: TFLu, Ethos-U55 and CMSIS-NN
• CMSIS-NN and Ethos-U55 integrated with Tensorflow Lite for microcontrollers

• Demo: CMSIS-NN / TFLu speed-up on Arduino

4 © 2020 Arm Limited


AI Virtual Tech Talks Series
Tensorflow Lite for
Microcontrollers
(TFLu)
TensorFlow Lite for Microcontrollers (TFLu)
• Version of TensorFlow Lite designed to execute neural networks on microcontrollers, starting at only a few kB of
memory

• Designed to be portable even to 'bare metal' systems

• The core runtime is ~20kB.

• Examples/demos
• Micro speech: Detects simple commands such as yes, no and silence.
• Person detection: Detects whether a person is in the room or not.
• Magic wand demo for image recognition etc.

• Generate multiple projects, for example MbedOS and Arduino

• Over 50 operators supported currently. Growing quickly


• Many integrated operator optimizations

6 © 2020 Arm Limited


AI Virtual Tech Talks Series
CMSIS-NN
Efficient Neural Network kernels for
Arm Cortex-M CPUs via TFLu
CMSIS
Pathway to the Arm ecosystem

6,000+ devices • Cortex Microcontroller Software Interface Standard


supported with CMSIS

• Consistent, generic, and standardized software building blocks


Used in many projects
> 1,200,000 source files • Available for all Cortex-M and Cortex-A5, Cortex-A7 and
public on GitHub
Cortex-A9 processors

• Open source – public development on GitHub:


Device family packs https://github.com/ARM-software/CMSIS_5
> 3,000,000 pack downloads
in past 6 months
8 © 2020 Arm Limited
CMSIS-NN
Part of CMSIS that provide optimized ML kernel implementation

Application code Debugger

TFLu

CMSIS-Pack
CMSIS-RTOS CMSIS-NN CMSIS-DSP CMSIS-Driver Peripheral HAL CMSIS-SVD
Real-time execution Machine learning Signal processing Middleware interface Device specific Peripheral description

CMSIS-CORE CMSIS-DAP CMSIS-Zone


Processor core and peripheral access Debug access System Partitioning

Communication Specialized CoreSight™ Access Filter


Arm® Cortex® processor
peripherals peripherals debug logic (MPU, SAU)
System-on-chip

9 © 2020 Arm Limited


8-bit MAC as SIMD operation
Load data -> MAC -> Load data -> MAC -> ….. -> Save data

DSP Extension M-profile Vector Extension (Helium tech.)


• A max capability of 2 MACs/cycle. • Cortex-M55 processor: 8 MAC/cycle
• Cortex-M4 processor: 1 MAC/cycle • MAC operands use vector registers (128 bit) and
• Cortex-M7 processor: 2 MAC/cycle (dual issue) result is stored in a 32 bit GP register.
• 𝑦 += ∑#$ !"# 𝑎! ∗ 𝑏! , in two cycles

Vector register GP register


(128 bit) (32 bit)

R0
R1
Q0 …
Q1

a1 |… ... ...| a16


* → y
b1 |… ... ...| b16
MAC - Multiply Accumulate
… …
SIMD – Single Instruction Multiple Data
Q7

R11
10 © 2020 Arm Limited
R12
Performance Results - TFLu runtime with CMSIS-NN
On a Cortex-M55 system

• These numbers show current CMSIS-NN performance on Cortex-M55


improvements on an FPGA 10x
reference system
• Continuously improving 11x
performance

MobileNet V2 Wav2letter
Ref Kernels CMSIS-NN

11 © 2020 Arm Limited


AI Virtual Tech Talks Series
Ethos-U55:
Accelerating ML Compute
further using microNPUs
Ethos-U55: First microNPU for Cortex-M CPUs
• Neural network processor for Cortex-M Arm
systems
• Works alongside Cortex-M55, Cortex-M7, Cortex-M33 Ethos-U55 microNPU
and Cortex-M4 processors

• Designed for embedded type systems


• Fast on-chip SRAM and a slower system flash
Configurable
• Heavy compute operators for CNN and RNN MAC Engine
accelerated in hardware. Cortex-M
• Support for efficient weight compression
• Compression typically offline Elementwise Engine
• Decompression on-the-fly
Local Memory
• Configurations 32, 64, 128 or 256 MAC/cc
Weight Decode
• 8-bit activations use 1 cc per MAC
• 16-bit activations use 2 cc per MAC Control Unit DMA

System FLASH System SRAM

13 © 2020 Arm Limited


Ethos-U55 Optimized Software Flow
Host computer (Offline) Target/ Device
TF Quantization Ethos-U55 Ethos-U55
TF TFL flatbuffer
tooling Driver microNPU
Framework 1 2 file 5
TOCO

TFLu Runtime
CMSIS-NN
3 4 Optimized
Kernels Cortex-M
Vela compiler CPU

Ref. Kernels

• Train network in TensorFlow • Runtime executable file on device


• Quantize it to Int8 TFL flatbuffer file (.tflite file) • Accelerates kernels on Ethos-U55. Driver handles
• Vela compiler identifies graphs to run on Ethos-U55 the communication
• Optimizes, schedules and allocates these • The remaining layers are executed on Cortex-M
graphs • CMSIS-NN optimized kernels if available
• Lossless compression, reducing size of tflite file • Fallback on the TFLu reference kernels

14 © 2020 Arm Limited


TFL: TensorFlow Lite
Vela Compiler
A Python based optimizer executed on your computer

other
• Reads a tflite file, writes a modified tflite file other
conv
• Generates commands for microNPU
Vela microNPU
pool
• Optimizes scheduling of subgraphs compiler custom op
FC
• Loss-less compression of weights other
other
• Reduces SRAM and Flash footprint
• Enabling networks previously not feasible in embedded systems!
• Open source

15 © 2020 Arm Limited


Ethos-U55 Performance Results
Using 256 MACs/Cycle configuration vs. Cortex-M4 using CMSIS-NN optimizations
Wav2letter MobileNet V2
280x 300x

Cortex-M4 Ethos-U55 Cortex-M4 Ethos-U55


(CMSIS-NN) (CMSIS-NN)
16 © 2020 Arm Limited
Ethos-U55 & CMSIS-NN:
AI Virtual Tech Talks Series
Integration with Tensorflow Lite for
Microcontrollers
Software Stack Integration
Add CMSIS-NN and Ethos-U55 under the same stack

Application • TFLu is built as a lib, then linked with application

• Optimized kernels enabled by using “TAGS” in the TFLu build


TF Lite micro system
Reference
and CMSIS-
Ethos-U55 • Software is open source
driver
NN kernels
• Vela compiler, Ethos-U55 driver, TFLu and CMSIS-NN

Ethos-
Cortex-M
U55

18 © 2020 Arm Limited


Optimize Where it Matters…
…and always have a fallback path

• Reference kernels always a possibility Kernel TFLu reference


implementation
CMSIS-NN
(fast)
NPU
(faster)

Kernel 1 ü ü ü

• For more horsepower - CMSIS-NN Kernel 2 ü ü ü


Kernel 3 ü ü ü
Kernel 4 ü ü ü
• For most horsepower - Ethos-U55 Kernel 5 ü ü
Kernel 6 ü
Kernel 7 ü

19 © 2020 Arm Limited


Build TFLu with Ethos-U55 and CMSIS-NN
Access to optimized kernels through TFLu, simple example
• Step 1: Clone TensorFlow repository from GitHub
git clone https://github.com/tensorflow/tensorflow

• Step 2: Compile it using TAGS, in prio order.


make -f tensorflow/lite/micro/tools/make/Makefile TAGS=“ethos-u cmsis-nn” TARGET=<your cortex-m plus
ethos-u55 board> person_detection_int8

20 © 2020 Arm Limited


AI Virtual Tech Talks Series
Demo:
Person detection with CMSIS-NN
and TFLu
The Hardware
Arduino Nano 33 BLE Sense + Arducam Mini 2MP Plus
• Powered by Arm’s Cortex-M4 CPU
• 1 MB flash. 256kB SRAM. 64MHz.

22 © 2020 Arm Limited


Step-by-step
Utilize CMSIS-NN in TFLu on an Arduino Nano 33 BLE Sense
• Step 1 (optional): Clone TensorFlow repository from GitHub
git clone https://github.com/tensorflow/tensorflow

• Step 2 (optional): Generate an Arduino project


make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arduino TAGS=cmsis-nn generate_arduino_zip

• Step 3 (optional): Include the generated project into your Arduino libraries folder
unzip tensorflow_lite.zip -d ~/Arduino/libraries/

• Step 4: Compile and flash demo using the Arduino IDE


• Check “person detection experimental” example in library “Arduino_TensorFlowLite”. A one button
install using Arduino IDE library manager.

23 © 2020 Arm Limited


Useful links
• TFLu + CMSIS-NN instructions:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/kernels/cmsis-
nn/README.md
• TFLu + Ethos-U55 instructions:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/kernels/ethos-
u/README.md
• CMSIS GitHub: https://github.com/ARM-software/CMSIS_5
• Person Detection Int8 example:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/examples/person_detecti
on_experimental
• Arm AI: https://www.arm.com/solutions/artificial-intelligence/iot-endpoint-devices
• ML platform Ethos-U landing page: https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-
u/+/refs/heads/master/README.md

24 © 2020 Arm Limited


Contact us!

• Fredrik Knutsson (freddan80 @ Github)


• Felix Johnny Thomasmathibalan (felix-johnny @ Github) and
www.instagram.com/photoquiver/
• Jens Elofsson (jenselofsson @ Github)
• Måns Nilsson (mansnils @ Github)
• Patrik Laurell (patriklaurell @ Github)
• Magnus Midholt (mmidholt @ Github)

25 © 2020 Arm Limited


Questions?
AI Virtual Tech Talks Series
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
ध"यवाद
‫ﺷﻛًرا‬
‫תודה‬
Join our next
AI Virtual Tech Talks Series
virtual tech talk:
Demystify artificial
intelligence on
Arm MCUs
Tuesday 14 June

Register here:
developer.arm.com/solutions/machine-learning-on-arm/ai-virtual-tech-talks
The Arm trademarks featured in this presentation are registered
trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks
AI Virtual Tech Talks Series
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
ध"यवाद
‫ﺷﻛًرا‬
‫תודה‬

You might also like