Embedded Artificial Intelligence
Devices, Embedded Systems, and Industrial
Applications
RIVER PUBLISHERS SERIES IN COMMUNICATIONS AND
NETWORKING
Series Editors:
ABBAS JAMALIPOUR
The University of Sydney
Australia
MARINA RUGGIERI
University of Rome Tor Vergata
Italy
The “River Publishers Series in Communications and Networking” is a series of
comprehensive academic and professional books which focus on communication and network
systems. Topics range from the theory and use of systems involving all terminals, computers,
and information processors to wired and wireless networks and network layouts, protocols,
architectures, and implementations. Also covered are developments stemming from new
market demands in systems, products, and technologies such as personal communications
services, multimedia systems, enterprise networks, and optical communications.
The series includes research monographs, edited volumes, handbooks and textbooks,
providing professionals, researchers, educators, and advanced students in the field with an
invaluable insight into the latest research and developments.
Topics included in this series include:
• Communication theory
• Multimedia systems
• Network architecture
• Optical communications
• Personal communication services
• Telecoms networks
• Wifi network protocols
For a list of other books in this series, visit www.riverpublishers.com
Embedded Artificial Intelligence
Devices, Embedded Systems, and Industrial
Applications
Editors
Ovidiu Vermesan
SINTEF, Norway
Mario Diaz Nava
STMicroelectronics, France
Björn Debaillie
imec, Belgium
River Publishers
Published 2023 by River Publishers
River Publishers
Alsbjergvej 10, 9260 Gistrup, Denmark
www.riverpublishers.com
Distributed exclusively by Routledge
4 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
605 Third Avenue, New York, NY 10017, USA
Embedded Artificial Intelligence / by Ovidiu Vermesan, Mario Diaz Nava,
Björn Debaillie.
© 2023 River Publishers. All rights reserved. No part of this publication may
be reproduced, stored in a retrieval systems, or transmitted in any form or by
any means, mechanical, photocopying, recording or otherwise, without prior
written permission of the publishers.
Routledge is an imprint of the Taylor & Francis Group, an informa
business
ISBN 978-87-7022-821-3 (print)
ISBN 978-10-0088-191-2 (online)
ISBN 978-1-003-39444-0 (ebook master)
While every effort is made to provide dependable information, the
publisher, authors, and editors cannot be held responsible for any errors
or omissions.
Dedication
“The question is not whether intelligent machines can have any emotions, but
whether machines can be intelligent without any emotions.”
– Marvin Minsky
“Our ultimate objective is to make programs that learn from their experience
as effectively as humans do. We shall. . . say that a program has common sense
if it automatically deduces for itself a sufficiently wide class of immediate
consequences of anything it is told and what it already knows.”
– John McCarthy
“It is customary to offer a grain of comfort, in the form of a statement that
some peculiarly human characteristic could never be imitated by a machine.
I cannot offer any such comfort, for I believe that no such bounds can be set.”
– Alan Turing
Acknowledgement
The editors would like to thank all the contributors for their support in the
planning and preparation of this book. The recommendations and opinions
expressed in the book are those of the editors, authors, and contributors
and do not necessarily represent those of any organizations, employers, or
companies.
Ovidiu Vermesan
Mario Diaz Nava
Björn Debaillie
Contents
Preface ix
Editors Biography xiii
List of Figures xv
List of Tables xxiii
1. Power Optimized Wafermap Classification for Semiconductor
Process Monitoring 1
Ana Pinzari, Thomas Baumela, Liliana Andrade, Marcello
Coppola, and Frédéric Pétrot
2. Low-power Analog In-memory Computing Neuromorphic
Circuits 15
Roland Müller, Bijoy Kundu, Elmar Herzer, Claudia Schuhmann,
and Loreto Mateu
3. Tools and Methodologies for Edge-AI Mixed-Signal Inference
Accelerators 25
Loreto Mateu, Johannes Leugering, Roland Müller, Yogesh
Patil, Maen Mallah, Marco Breiling, and Ferdinand Pscheidl
4. Low-Power Vertically Stacked One Time Programmable
Multi-bit IGZO-Based BEOL Compatible Ferroelectric TFT
Memory Devices with Lifelong Retention for Monolithic 3D
Inference Engine Applications 37
Sourav De, Sunanda Thunder, David Lehninger, Michael P.M.
Jank, Maximilian Lederer, Yannick Raffel, Konrad Seidel,
and Thomas Kämpfe
vii
viii Contents
5. Generating Trust in Hardware through Physical Inspection 45
Bernhard Lippmann, Matthias Ludwig, and Horst Gieser
6. Meeting the Latency and Energy Constraints on Timing-
critical Edge-AI Systems 61
Ivan Miro-Panades, Inna Kucher, Vincent Lorrain,
and Alexandre Valentian
7. Sub-mW Neuromorphic SNN Audio Processing Applications
with Rockpool and Xylo 69
Hannah Bos and Dylan Muir
8. An Embedding Workflow for Tiny Neural Networks on Arm
Cortex-M0(+) Cores 79
Jianyu Zhao, Cecilia Carbonelli, and Wolfgang Furtner
9. Edge AI Platforms for Predictive Maintenance in Industrial
Applications 89
Ovidiu Vermesan and Marcello Coppola
10. Food Ingredients Recognition Through Multi-label Learning 105
Rameez Ismail and Zhaorui Yuan
Index 117
Preface
Embedded Artificial Intelligence
Embedded edge artificial intelligence (AI) reduces latency, increases the
speed of processing tasks, and reduces bandwidth requirements by reducing
the among of data transmitted, and costs by introducing cost-effective and
efficient low power hardware solutions allowing processing data locally.
New embedded AI techniques offer high data security, decreasing the
risks to sensitive and confidential data and increasing the dependability of
autonomous technologies.
Embedded edge devices are becoming more and more complex,
heterogeneous, and powerful as they incorporate a combination of hardware
components like central processing units (CPUs), microcontroller processing
units (MCUs), graphics processing units (GPUs), digital signal processors
(DSPs), image signal processors (ISPs), neural processing units (NPUs),
field-programmable gate arrays (FPGAs), application specific integrated
circuits (ASICs) and other accelerators to perform multiple forms of machine
learning (ML), deep learning (DL) and spiking neural network (SNN)
algorithms. Embedded edge devices with dedicated accelerators can perform
matrix multiplication significantly faster than CPUs, and ML/DL algorithms
implemented in AI frameworks and edge AI platforms can efficiently exploit
these hardware components.
Processing pipelines, toolchains, and flexible edge AI software
architectures can provide specific system-on-a-chip (SoC), system-on
module (SoM) and application types for optimised run-time support.
These tools can facilitate the full exploitation of heterogeneous SoC/SoM
capabilities for ML/DL and maximise component reuse at the edge.
The book offers complete coverage of the topics presented at the
International Workshop on Embedded Artificial Intelligence (EAI) - Devices,
Systems, and Industrial Applications" in Milan, Italy 19 September 2022,
as part of the ESSCIRC/ESSDERC 2022 European Solid-state Circuits and
Devices Conference held in Milan, Italy, combining ideas and concepts
ix
x Preface
developed by researchers and practitioners working on creating edge AI
methods, techniques, and tools for industrial applications.
The book explores the challenges faced by AI technologies embedded
into electronic systems and applied to various industrial sectors by
highlighting essential topics, such as embedded AI for semiconductor
manufacturing; trustworthiness, verification, validation and benchmarking
of AI systems and technologies; the design of novel AI-based hardware
architectures; neuromorphic implementations; edge AI platforms; and AI-
based workflows deployments on hardware.
This book is a valuable resource for researchers, post-graduate students,
practitioners and technology developers interested in gaining insight into
embedded AI, ML, DL, SNN and the technology trends advancing intelligent
processing at the edge. It covers several embedded AI research topics and is
structured into ten articles. A brief introduction of each article is discussed in
the following paragraphs.
Ana Pinzari, Thomas Baumela, Liliana Andrade, Marcello Copolla
and Frédéric Pétrot: “Power Optimised Wafermap Classification for
Semiconductor Process Monitoring” introduce a power efficient neural
network architecture specifically designed for embedded system boards that
includes microcontroller and edge tensor processing units. Experiments show
that the analysis of the control of wafers can be achieved in real-time with an
accuracy of 99.9% (float) and 97.3% (8-bit integer) using less than 2W.
Roland Müller, Bijoy Kundu, Elmar Herzer, Claudia Schuhmann and
Loreto Mateu: “Low-Power Analog In-memory Computing Neuromorphic
Circuits” present the ASIC design and validation results of a neuromorphic
circuits comprising synaptic weights and neurons. This design includes batch
normalization, activation function, and offset cancelation circuits. The ASIC
shows excellent results: 12 nJ per inference with 5μs latency.
Loreto Mateu, Johannes Leugering, Roland Müller, Yogesh Patil, Maen
Mallah, Marco Breiling and Ferdinand Pscheidl: “Tools and Methodologies
for Edge-AI Mixed-Signal Inference Accelerators" present how a toolchain
to facilitate design, training, and deployment of artificial neural networks in
dedicated hardware accelerators allows to optimize and verify the hardware
design, reach the targeted KPIs, and reduce the time-to-market.
Sourav De, Sunanda Thunder, David Lehninger, Michael P.M. Jank,
Maximilian Lederer, Yannick Raffel, Konrad Seidel, and Thomas Kämpfe:
“Low-Power Vertically Stacked One Time Programmable Multi-bit IGZO-
Based BEOL Compatible Ferroelectric TFT Memory Devices with Lifelong
Retention for Monolithic 3D-Inference Engine Applications” discuss and
Preface xi
demonstrate an IGZO-based one-time programmable FeFET memory device
with multilevel coding and lifelong retention capability. The synaptic device
shows to achieve 97% for inference-only application with MNIST data and
an accuracy degradation of only 1.5% over 10 years. The proposed inference
engine also showed superior energy efficiency and cell area.
Bernhard Lippmann, Matthias Ludwig, and Horst Gieser: “Generating
Trust in Hardware through Physical Inspection” address the image processing
methods for physical inspection within the semiconductor manufacturing
process and physical layout to provide trustworthiness in the produced
microelectronics hardware. The results are presented for a 28nm process
including a proposed quantitative trust evaluation scheme based on feature
similarities.
Ivan Miro-Panades, Inna Kucher, Vincent Lorrain, and Alexandre
Valentian: “Meeting the Latency and Energy Constraints on Timing-critical
Edge-AI Systems” explore a novel architectural approach to overcome such
limitations by using the attention mechanism of the human brain. The
energy-efficient design includes a small NN topology (i.e., MobileNet-V1)
to be completely integrable on-chip; heavily quantized (4b) weights and
activations, and fixed bio-inspired extraction layers in order to limit the
embedded memory capacity to 600kB.
Hannah Bos and Dylan Muir: “Sub-mW Neuromorphic SNN Audio
Processing Applications with Rockpool and Xylo” apply a new SNN
architecture designed for temporal signal processing, using a pyramid of
synaptic time constants to extract signal features at a range of temporal
scales. The architecture was demonstrated on an ambient audio classification
task, deployed to the Xylo SNN inference processor in streaming mode. The
application achieves high accuracy (98 %) and low latency (100 ms) at low
power (<100μW dynamic inference power).
Jianyu Zhao, Cecilia Carbonelli and Wolfgang Furtner: “An Embedding
Workflow for Tiny Neural Networks on ARM Cortex-M0(+) Cores” provide
a description and propose an end-to-end embedding workflow focused on tiny
neural network deployment on Arm® Cortex® -M0(+) cores. With this, up to
73.9% of the memory footprint could be reduced. While reducing the manual
effort of network embedding to the minimum, the workflow remains flexible
enough to allow for customizable bit shifts and different layer combinations.
Ovidiu Vermesan and Marcello Copolla: “Edge AI Platforms for
Predictive Maintenance in Industrial Applications” provide an assessment
and comparative analysis of several existing edge AI platforms and workflows
including some of the most essential architectural elements of differentiation
xii Preface
(AEDs) in edge AI-based industrial applications, such as analytic capabilities
in the time and frequency domains, features visualisation and exploration,
microcontroller (Arm® Cortex® -M cores) emulator and live tests, support
for ML, DL, and using ML core capabilities implemented in the sensors.
Rameez Ismail and Zhaorui Yuan: “Food Ingredients Recognition
Through Multi-label Learning” describe deep multi-label learning approaches
and related models to detect an arbitrary number of ingredients in a dish
image. With an average precision score of 78.4% using a challenging dataset
(Nutrition5K), this approach forms a strong baseline for future exploration.
Editors Biography
Ovidiu Vermesan holds a PhD degree in microelectronics and a Master
of International Business (MIB) degree. He is Chief Scientist at SINTEF
Digital, Oslo, Norway. His research interests are in smart systems integration,
mixed-signal embedded electronics, analogue neural networks, edge artificial
intelligence and cognitive communication systems. Dr. Vermesan received
SINTEF’s 2003 award for research excellence for his work on the
implementation of a biometric sensor system. He is currently working
on projects addressing nanoelectronics, integrated sensor/actuator systems,
communication, cyber–physical systems (CPSs) and Industrial Internet of
Things (IIoT), with applications in green mobility, energy, autonomous
systems, and smart cities. He has authored or co-authored over 100 technical
articles, conference/workshop papers and holds several patents. He is
actively involved in the activities of European partnership for Key Digital
Technologies (KDT). He has coordinated and managed various national, EU
and other international projects related to smart sensor systems, integrated
electronics, electromobility and intelligent autonomous systems such as
E3 Car, POLLUX, CASTOR, IoE, MIRANDELA, IoF2020, AUTOPILOT,
AutoDrive, ArchitectECA2030, AI4DI, AI4CSM. Dr. Vermesan actively
participates in national, Horizon Europe and other international initiatives by
coordinating the technical activities and managing the various projects. He is
the coordinator of the IoT European Research Cluster (IERC) and a member
of the board of the Alliance for Internet of Things Innovation (AIOTI). He
is currently the technical co-coordinator of the Artificial Intelligence for
Digitising Industry (AI4DI) project.
Mario Diaz Nava has a Ph.D, and M.S. both in computer science,
from Institut National Polytechnique de Grenoble, France, and B.S. in
communications and electronics engineering from Instituto Politecnico
National, Mexico. He has worked in STMicroelectronics since 1990. He has
occupied different positions (Designer, Architect, Design Manager, Project
Leader, Program Manager) in various STMicroelectronics research and
xiii
xiv Editors Biography
development organisations. His selected project experience is related to the
specifications and design of communication circuits (ATM, VDSL, Ultra
wideband), digital and analogue design methodologies, system architecture
and program management. He currently has the position of ST Grenoble
R&D Cooperative Programs Manager, and he has actively participated, for
the last five years, in several H2020 IoT projects (ACTIVATE, IoF2020,
Brain-IoT), working in key areas such as Security and Privacy, Smart
Farming, IoT System modelling, and edge computing. He is currently leading
the ANDANTE project devoted to developing neuromorphic ASICS for
efficient AI/ML solutions at the edge. He has published more than 35 articles
in these areas. He is currently a member of the Technical Expert Group of the
PENTA/Xecs European Eureka cluster and a Chapter chair member of the
ECSEL/KDT Strategic Research Innovation Agenda. He is an IEEE member.
He participated in the standardisation of several communication technologies
in the ATM Forum, ETSI, ANSI and ITU-T standardisation bodies.
Björn Debaillie leads imec’s collaborative R&D activities on cutting-edge
IoT technologies in imec. As program manager, he is responsible for
the operational management across programs and projects, and focusses
on strategic collaborations and partnerships, innovation management, and
public funding policies. As chief of staff, he is responsible for executive
finance and operations management and transformations. Björn coordinates
semiconductor-oriented public funded projects and seeds new initiatives on
high-speed communications and neuromorphic sensing. He currently leads
the 35M=C TEMPO project on neuromorphic hardware technologies, enabling
low-power chips for computation-intensive AI applications (www.tempo
ecsel.eu). Björn holds patents and authored international papers published
in various journals and conference proceedings. He also received several
awards, was elected as IEEE Senior Member and is acting in a wide range
of expert boards, technical program committees, and scientific/strategic think
tanks.
List of Figures
Chapter 1
Figure 1 Synthetic examples of wafermap failure patterns.
From left to right and top to bottom, the big-cluster,
wide-dense-edge-donut, fingerprint, complete-wafer,
horizontal-dots-lines, and matrix classes are shown. 4
Figure 2 Proposed CNN architecture . . . . . . . . . . . . . 5
Figure 3 Model scores: the evolution of (a) accuracy and (b)
loss during the learning and testing processes . . . 6
Figure 4 Experiment setups:(a) Google’s Coral setup (b)
STM32MP1 setup. Both are powered through a
small power-meter allowing to record the power
consumption of the entire board . . . . . . . . . . 9
Figure 5 Power measures: (a) Instantaneous power (idle on
the bottom part of each bar) (b) Power efficiency in
inference per second per watt . . . . . . . . . . . . 10
Chapter 2
Figure 1 Top-level block diagram of the DNN circuit
including digital control and circuitry used to test
the ASIC. . . . . . . . . . . . . . . . . . . . . . . 16
Figure 2 Synaptic weight circuit using variable resistors. . . 17
Figure 3 Voltage divider synaptic weight implementation: (a)
schematic and (b) equivalent circuit illustration. . . 17
Figure 4 Differential voltage divider synaptic weight for
positive and negative weight values. . . . . . . . . 18
Figure 5 ReLU activation function and buffer circuit. . . . . 19
Figure 6 Transfer function of the neuron over PVT variations. 20
Figure 7 Transient simulation results of the output voltages
of all three layers. . . . . . . . . . . . . . . . . . . 20
xv
xvi List of Figures
Figure 8 Different combinations of weight value, batch
normalization offset and resulting output voltage to
test the synaptic weights. . . . . . . . . . . . . . . 22
Figure 9 Different number of weights in use for different gain
configurations. . . . . . . . . . . . . . . . . . . . . 22
Chapter 3
Figure 1 Toolchain for the design of edge-AI mixed-signal
inference accelerators. . . . . . . . . . . . . . . . . 26
Figure 2 Fault aware quantizer (FAQ) implementation. . . . 27
Figure 3 Block diagram of a Hardware-Aware Training layer. 28
Figure 4 Overview of the Mapper Tool. . . . . . . . . . . . 28
Figure 5 Overview of the Compiler Tool. . . . . . . . . . . 29
Figure 6 Circuit Hierarchy created by Neural Network
Hardware Generator. . . . . . . . . . . . . . . . . 32
Chapter 4
Figure 1 The P-V response of HZO-based ferroelectric
capacitors as IGZO as the bottom electrode
shows asymmetric swing with negligible negative
switching. The absence of negative polarizations
plays an essential role in facilitating OTP features
in Fe-TFTs. . . . . . . . . . . . . . . . . . . . . . 39
Figure 2 (a). WRITE operation in IGZO based Fe-TFTs with
500ns wide pulses of amplitude 3V and 7V. (b).
2bits/cell operation for IGZO-based Fe-TFT OTP
devices. . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 3 (a). The measured retention characteristics show
stable retention of 4-states for ten years without any
loss. (b). Benchmarking the retention performance
relative Vth shift w.r.t MW proves that IGZO-
based OTP devices have maximum long-term data
retention capability. . . . . . . . . . . . . . . . . . 40
Figure 4 (a). The modus operandi of MLP NN. (b). The
reported inference engine shows life-long lossless
inference operation. . . . . . . . . . . . . . . . . . 40
List of Figures xvii
Chapter 5
Figure 1 Different stages of the IC development flow from
the initial design phase until the final product. . . . 46
Figure 2 Abstraction levels of computing systems. This work
focuses on the physical layers of the abstraction
stack with an emphasis on the physical layout and
manufacturing technology. . . . . . . . . . . . . . 46
Figure 3 Reverse engineering process overview. . . . . . . . 47
Figure 4 Examples of labelled data showcasing the different
ROIs: green – VIA; yellow – metal; teal – Local
silicon Oxidation; red – poly; blue – deep trench
isolation. . . . . . . . . . . . . . . . . . . . . . . . 49
Figure 5 Example cross-section image with annotated metal
and contact/VIA features. . . . . . . . . . . . . . . 49
Figure 6 Typical SEM images using two different detectors.
(a) shows the scan with an InLens detector with a
field of view of 30 μm and a working distance of
10 mm. (b) shows the scan with ET-SE2 detector
with a field of view of 20 μm and a working distance
of 10 mm. . . . . . . . . . . . . . . . . . . . . . . 50
Figure 7 Image segmentation using threshold algorithm. . . 51
Figure 8 Image segmentation using custom algorithm SEMSeg. 51
Figure 9 SEMSeg algorithm, segmentation of SEM images
with overlapping foreground and background colour
level. . . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 10 Contact detection using Hough algorithm. . . . . . 51
Figure 11 Contact detection using modified Hough algorithm. 52
Figure 12 Standard cell identification process. (a) SEM image
displaying poly-silicon, active area, and contacts
(b) Detection of power lines (VDD , VSS ). (c)
Segmented standard cells using custom image
processing algorithms. (d) Classification of different
standard cells. . . . . . . . . . . . . . . . . . . . . 52
Figure 13 Segmentation of polysilicon, active area delayering
SEM images into standard cells using a custom
algorithm with domain knowledge. . . . . . . . . . 52
Figure 14 Extracting transistor level netlist from std. cell SEM
images displaying polysilicon, contact, and active
area layouts. . . . . . . . . . . . . . . . . . . . . . 53
xviii List of Figures
Figure 15 Three different VIA extraction methods. . . . . . . 53
Figure 16 Metal layer extraction with deep learning. . . . . . 53
Figure 17 28 nm technology, delayering study on test samples.
Optical microscope and SEM images are used for
quality assurance. . . . . . . . . . . . . . . . . . . 53
Figure 18 Overview of different metal segmentation and VIA
detection tasks of the 28 nm test chip sample. . . . 54
Figure 19 Polygon extraction example with deep learning. . . 55
Figure 20 Reconstruction capabilities DL. . . . . . . . . . . . 55
Chapter 6
Figure 1 Illustration of the two visual pathways or streams
in the visual cortex, used for extracting different
information. . . . . . . . . . . . . . . . . . . . . . 62
Figure 2 N2D2 framework. . . . . . . . . . . . . . . . . . . 63
Figure 3 Forward and backward quantization passes. . . . . 63
Figure 4 Backward propagation with QAT. . . . . . . . . . . 63
Figure 5 Comparison of several NN topologies, as function
of number of operations (X-axis), classification
accuracy (Y-axis) and number of parameters (size
of the circle) [9]. . . . . . . . . . . . . . . . . . . 65
Figure 6 (a) Standard convolution; (b) Depth-wise + point
wise convolution. . . . . . . . . . . . . . . . . . . 65
Figure 7 NeuroCorgi initial floorplan, illustrating the placement
of the different NN layers. . . . . . . . . . . . . . 66
Chapter 7
Figure 1 Spiking network architecture for temporal signal
processing. . . . . . . . . . . . . . . . . . . . . . 72
Figure 2 Architecture of the digital spiking neural network
inference processor “Xylo”. . . . . . . . . . . . . . 73
Figure 3 Digital LIF neurons on Xylo. . . . . . . . . . . . . 73
Figure 4 Feedforward weights mapped to Xylo architecture. 74
Figure 5 Distribution of correct classification latency. . . . . 74
Figure 6 Audio classification results on audio samples for
each class (columns). . . . . . . . . . . . . . . . . 74
List of Figures xix
Chapter 8
Figure 1 An overview of the embedding workflow proposed
in this work. . . . . . . . . . . . . . . . . . . . . . 81
Figure 2 A signed 8-bit fixed-point representation of a
fractional number, with the binary point positioned
after the 4th bit. . . . . . . . . . . . . . . . . . . . 82
Figure 3 Example of the configuration txt file for a classifier
trained for the iris dataset. The model has four
inputs, three outputs and two dense layers, each with
a different activation function. . . . . . . . . . . . 82
Figure 4 Example of the txt file for the quantised 8-bit
parameters from the model mentioned in Fig. 3. The
6 and 5 at the beginning represent the numbers of
bits used for the fractional part of the numbers for
the first and the second dense layers, respectively.
They are followed by weights and biases quantised
accordingly. . . . . . . . . . . . . . . . . . . . . . 82
Figure 5 Header and C files from the C library. . . . . . . . 83
Figure 6 (a) The C structure to save 8-bit parameters for
a dense layer, including a double pointer for the
2d weight array, a pointer to the 1d bias array,
an unsigned 8-bit integer to suggest the type of
activation function, an unsigned 8-bit integer for the
size of the layer output and another for the bit shift.
(b) The corresponding implementation of an 8-bit
dense layer, which besides layer parameters also
takes a pointer to a 1d array for the input values and
modifies the array pointed by the output pointer in
return. . . . . . . . . . . . . . . . . . . . . . . . . 83
xx List of Figures
Figure 7 Example of the generated C source code. (a)
Definition of a comprehensive parameter structure,
which consists of a pointer for the parameter
structure of each of the hidden layers and two
integers for the dimensions of the network input and
output. (b) Declaration of the model parameter (only
the first layer is shown due to space limitations)
with automatically filled model parameters read
from the parameter text file illustrated in Fig. 3.
(c) Simple network implementation with available
layer implementations. . . . . . . . . . . . . . . . 83
Figure 8 A low-cost environmental sensing platform with a
neural network embedded on an Infineon PSoC®
Analogue Coprocessor in the same package. . . . . 84
Figure 9 Network architecture. The example network has 14
extracted features, 15 timesteps, 20 hidden units,
and 2 output values. . . . . . . . . . . . . . . . . . 85
Figure 10 Comparison of the estimated gas concentrations
from the floating-point, 16-bit fixed-point and 8-bit
fixed-point algorithm implementations. . . . . . . . 85
Chapter 9
Figure 1 Industrial edge AI system architecture. . . . . . . . 92
Figure 2 STM32 Arm Cortex-M4F Microcontroller architecture. 93
Figure 3 Cross-platform conversion of sensor data (collected
vs uploaded). . . . . . . . . . . . . . . . . . . . . 97
Figure 4 F class along the 3-axis (NEAI) . . . . . . . . . . . 98
Figure 5 F class along the 3-axis over 50 ms window (EI) . . 98
Figure 6 F class along the y-axis (Qeexo) in time and
spectrogram. . . . . . . . . . . . . . . . . . . . . . 98
Figure 7 Dimension reduction with two components (PC1
and PC2) with PCA (left) and PCA+t-SNE (right) . 99
Figure 8 Dimension reduction with three components (PC1,
PC2 and PC3) with PCA (left) and PCA+t-SNE (right) 99
Figure 9 Benchmarking with NEAI. All correctly classified
(green dots) . . . . . . . . . . . . . . . . . . . . . 100
List of Figures xxi
Figure 10 Benchmarking with EI. Confusion matrix and
data explorer. Correctly classified (green dots) and
misclassified (red dots). . . . . . . . . . . . . . . . 100
Figure 11 Benchmarking with Qeexo. Confusion matrix for
SVM model (up). Overview trained models (down). 101
Figure 12 Live streaming evaluation of SVM trained model
using NEAI Emulator (left) and trained ANN model
Qeexo (right). . . . . . . . . . . . . . . . . . . . . 101
Figure 13 EI ANN model testing with test datasets (Arm®
Cortex®-M4 MCU STM32L4R9 not yet supported). 101
Figure 14 Qeexo testing with test datasets collected (SVM
model). . . . . . . . . . . . . . . . . . . . . . . . 101
Chapter 10
Figure 1 The envisioned approach towards automated nutrition
assessment. The assessment block consists of two
ML functions, one for detecting the ingredients and
the other to estimate the quantity, and a nutrients
database look-up service. . . . . . . . . . . . . . . 106
Figure 2 The meta-structure used for the evaluated models. . 108
Figure 3 The decoding mechanisms evaluated in this work:
(a) the GAP based decoder averages the latent
features before projecting them on to a read
out layer, and (b) the ML-Decoder that computes
a response, linear combination of value vectors,
against each external query vector. The responses
are then projected on to the read-out layer through a
group decoding scheme. . . . . . . . . . . . . . . . 110
Figure 4 Examples of dish images from the test dataset. . . . 110
Figure 5 Prediction results for the dish images selected
from the test dataset for illustrative purpose. The
detection confidence for each ingredient is shown.
The green color represents a true positive detection
while the red stands for the false positives. . . . . 112
List of Tables
Chapter 1
Table I Description of the model architecture
(k (n × n) means k kernels of size n × n) . . . . . . . 5
Table II Inference accuracy on the data and validation sets . . 6
Table III Inference accuracy on the data and validation sets after
post-training quantization . . . . . . . . . . . . . . . 8
Table IV Coral board: Latency measurements for a throughput
of 10,000 images and on 5 consecutive tests . . . . . 9
Table V Performance and power efficiency of both boards . . . 10
Chapter 2
Table I Weight Value and Input Resistance of the Voltage
Divider Synaptic Weight . . . . . . . . . . . . . . . . 17
Table II Key Performance Indicators for this Work . . . . . . 21
Chapter 4
Table I Benchmarking . . . . . . . . . . . . . . . . . . . . . 41
Chapter 5
Table I Challenges for IC Reverse Engineering. . . . . . . . . 48
Table II Rule of thumb performance options for the extraction
of classical vs. DL-based VIA detection. . . . . . . . 54
Chapter 7
Table I Ambient audio scene classification accuracy . . . . . 74
Table II continuous power measurements . . . . . . . . . . . 75
Table III Per-inference energy measurements . . . . . . . . . . 75
xxiii
xxiv List of Tables
Table IV Per-neuron per-inference energy comparison . . . . . 75
Chapter 8
Table I Performance Comparison of Different Network
Implementations . . . . . . . . . . . . . . . . . . . . 85
Chapter 10
Table I Performance Evaluation and Benchmark . . . . . . . 112
Power Optimized Wafermap
Classification for Semiconductor
Process Monitoring
Ana Pinzari, Thomas Baumela, Liliana Andrade, Marcello Coppola,
and Frédéric Pétrot
Abstract—Today, the exploitation of AI STMicroelectronics MP1 board or Google’s
solutions is very immersive and has wide Coral board that includes an edge tensor
applicability in virtually all industrial processing unit. Experiments show that we
fields. In many sectors, the quality of the achieve this analysis in real-time with an
final product is the key to profitability. For accuracy of 99.9% (float) and 97.3% (8-bit
the semi-conductor industry, this translates integer) using less than 2W.
into production yield, the ratio of func
tional dies over the total number of dies Index Terms—Process Control, Wafer-
produced. The control of wafer fabrication map Classification, Deep Learning,
in the semiconductor industry is a funda Convolutional Neural Network Optimiz
mental task to ensure high yield. Analysis ation, Hyper-parameter Tuning, Low-
of the distribution of non-functional dies Energy Consumption, Power Reduction.
on a wafer is a necessary step to iden
tify process drifts leading to their root
causes. Current approaches use large-scale I. INTRODUCTION
state-of-the-art neural networks running
Y
on GPUs to perform this analysis. Aiming IELD is paramount in semicon
at power efficiency, we propose a neural ductor process manufacturing,
network architecture specifically designed as it determines the financial vi
to target embedded devices such as
ability of a production line for a given
process. Since the start of the microelec
This work was conducted under the framework of tronic VLSI industry, yield control has
the ECSEL AI4DI “Artificial Intelligence for Digi been the focus of the process engineers
tising Industry” project. The project has received
funding from the ECSEL Joint Undertaking (JU) [1], and although the technology has ma
under grant agreement No 826060. The JU receives tured in a tremendous manner, today’s
support from the European Union’s Horizon 2020 challenges are equally difficult to take
research. (Corresponding author: A. Pinzari).
A. Pinzari is with Institute of Engineering Univ. up [2].
Grenoble Alpes, France (e-mail: Ana.Pinzari@ One important step in yield control
univ-grenoble-alpes.fr). is wafer testing (also known as Cir
T. Baumela is with Institute of Engineering
Univ. Grenoble Alpes, France (e-mail: Thomas. cuit Probe). A specialized test equip
[email protected]). ment is used to test each die on the
L. Andrade is with Institute of Engineering wafer, and indicates which dies are
Univ. Grenoble Alpes, France (e-mail:
[email protected]). functional and which are not. This is
M. Coppola is with ST Microelectronics, Greno used to build a binary wafermap [3],
ble, 38000 France. ([email protected]). i.e., 2-dimension image in which each
F. Pétrot is with Institute of Engineering Univ.
Grenoble Alpes, France (Frederic.Petrot@univ pixel represents a die. A white pixel
grenoble-alpes.fr). indicates a functional die, while a black
2 ANA PINZARI: POWER OPTIMIZED WAFERMAP CLASSIFICATION FOR SEMICONDUCTOR PROCESS
one indicates a non-functional die. The devices in Section V, and concludes in
distribution of the black pixels is attached Section VI.
to a class of classical issues, which gives
leads to the process engineer to identify
II. RELATED WORKS
the causes of the malfunctions.
Given the ability of recent neural Although statistical approaches have
networks to accurately perform classifi been used for long to classify wafermaps
cation, and the importance of wafermap in high-volume production [4], it is only
classification in the semiconductor fabri recently that the used of deep neural
cation process, a lot of research has been network has been proposed to that aim.
done on the subject recently. Automating This is due to two factors: First the
this process makes sense: human clas breakthrough brought by AlexNet [5]
sification of abstract dots distributions is making neural networks clearly superior
feasible for a small number of classes, to any other algorithm for classification,
but very complex when the number of and second the availability of an open
classes is high. The existing approaches wafermap dataset (WM-811K) donated
are using increasingly complex state-of to the community by a major foundry.
the-art network architectures, to gain a Wu et al [6] introduced the approach
few tens of percent in accuracy. Given and made the dataset public, which led to
the low throughput of the test equipment, a very large number of papers being pub
we believe that searching for a much lished around this dataset, far too many to
smaller ad-hoc architecture that will cite here. They all share one characteris
be able to run on a micro-controller tic: the use of the most recent neural net
at production line speed will lead to work architecture of the time, generally
much better power efficiency while only needing more computations and more
marginally degrading accuracy. parameters than the previous one, to gain
In this paper we present our low en little in accuracy. The WM-811K dataset
ergy wafermap classification approach, is quite heterogeneous in size, shape,
put to work on actual foundry data from number of examples per classes, and has
STMicroelectronics 28 nm fabrication fa a small, 9, number of classes, whereas
cilities, which uses a purposely defined industrial practice is more in the order of
neural network, and makes use of ad 50, and class labelling is sometimes not
vanced quantization techniques to work correct. So, each published work does its
on a limited number of bits. This has own pre-processing to resize wafermaps
the nice property of both limiting the and carries out data augmentation to even
memory used for parameter storage and the class cardinality, making fair compar
minimizing the complexity of the opera isons difficult.
tors used for computation. In the next paragraphs we give an
This paper reviews the related works in overview on recent research. In this con
Section II, introduces a brief description text, [7] introduces a CNN model that
of the dataset used for training and classi can be used to classify patterns and to
fication in Section III, presents the main identify the cause of defects by using
contributions related to the proposed neu an image retrieval technique. To avoid
ral network architecture in Section IV, the use of imbalanced datasets, the au
describes the experiments and implemen thors generate synthetic wafermaps mod
tation performed on different embedded elled using a Poisson distribution. [8]
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 3
uses a CNN to automatically extract fea similar wafermaps in a same space. Au
tures and identify defects, in addition thors argue that self-supervised training
to use batch normalization and spatial methods improve classification perfor
dropout techniques to improve classifi mance when datasets with several unla
cation performances. During data pre belled data are used, which is the case of
processing they use random rotations, the WM-811K dataset. [13] presents ex
horizontal flipping, width, and height perimentations with simplified AlexNet,
shifts, shearing range, channel shifting MobileNetV1, and VGG, so as to limit
and zooming as data augmentation tech the number of parameters they require.
niques. The final dataset used by authors This latter architecture leads to the best
have 90K wafer defect images and it accuracy while requiring the least param
is fully balanced with 10K images per eters.
class. [9] proposes a new classification Overall, among these works and oth
system, that based on active learning of ers, only [11] evokes power efficiency
CNN, allows the selection of a reduced consideration in a short paragraph, sim
and representative subset of unlabelled ply saying that the network fits into a
wafermaps that will be inspected by ex NVidia Jetson Nano board (≈ 10 W) and
perts. Based on a model of four steps a performs a 5 frames per second inference
small LeNet-5-like CNN architecture is on 64x64 images. None refers to power
trained using the initial labelled dataset, reduction techniques or power/accuracy
an uncertainty prediction in the unla trade-offs for neural network inference.
belled wafermaps is calculated, and using
top-K selection methods, a new set of
III. DESCRIPTION OF THE
wafermaps is extracted to be manually
DATASET
inspected and merged with the original
dataset. [10] also uses a CNN to detect Although identified long ago, the qual
defects but with the particularity that ity of data is a too often neglected issue
wafermaps are augmented using rotations when dealing with neural networks [14].
techniques and converted in 2D arrays First, the number of elements must be
before training. This approach avoids the large enough for the complexity of the
problems related to wafer size variations. problem, typically a few thousands for
The authors also apply data augmenta simple problems to a few millions for
tion before training to take in account more complex ones. These elements must
possible rotations related to the input also be well balanced between the classes
wafermaps. [11] proposes to use a CNN and labelled with care.
encoder-decoder for data-augmentation After a broad analysis of possible fail
and a depth-wise separable convolution ures during the manufacturing process,
for classification. The greatest advan the process engineers have defined 58
tage of this approach is that proposed possible wafermap failure patterns. The
model reduces the number of parameters wafers have a notch that is used to pre
by 30% and the amount of calculation cisely align them within the machine
by 75% in the WM-811K dataset. [12] during fabrication. This makes the ori
proposes a neural network pre-training entation known, and allows to consider
method based on self-supervised learning differently e.g., vertical and horizontal
to improve classification performance on patterns. Each class contains about 2,200
imbalanced datasets. A CNN-encoder is images, resulting in a complete and well-
responsible to learn features by mapping balanced data set of 121,550 images.
4 ANA PINZARI: POWER OPTIMIZED WAFERMAP CLASSIFICATION FOR SEMICONDUCTOR PROCESS
Although build on purpose for data- Now that we have explained our pre
confidentiality reasons, Figure 1 shows processing steps, we can perform the
six wafermaps representative of 6 cate neural network architecture search step.
gories among the 58.
The binary wafermap images have an IV. AD-HOC NEURAL NETWORK
original resolution of 401×401. As the
Since our hardware targets are small,
size of the input images has a major im
embedded devices in the watt power
pact on the size of the network, it is very
range, we must minimize the num
useful to resize them to reduce the num
ber of multiplication-accumulation oper
ber of parameters and computations. This
ations, the size of the operands of these
process is worthwhile if the loss of infor
operations, and the number and size of
mation is minimal. A failure category has
the network parameters. We have two
its own characteristics, so resizing must
assets for that. First, our classification
be done in a way that respects the pat
problem is somehow easy, compared to
terns of the category to which an image
the current CIFAR100 or ImageNet chal
belongs. After several experiments using
lenges: we have black and white bitmaps
state-of-the-art network architectures, we
instead of 256 or even true-colour pic
determined the target size to be 224×224
tures, and the number of classes is large
pixels. As a resizing method, we applied
but not huge. Second, much recent efforts
the nearest neighbour interpolation al
have been put into getting rid of large
gorithm [15], which produces grayscale
floating-point representations for the data
images. We then binarize the images sim
and parameters.
ply by considering all non-white pixels
as black and all white pixels as white.
We observed that with this strategy there A. Neural Network Architecture Defini
was no loss of information, and the im tion
age retains its own failure characteristics. Finding the appropriate NN architec
With such reduced size black and white ture for a given problem is not a well for
images, the memory space necessary to malized problem. This can be automated
encode the image is minimized, and the using AutoML [16] or Neural Architec
even more interesting, the inputs to the ture Search [17] approaches, but this re
neural network are single bits. quires an enormous number of resources,
and is beyond the scope of our work.
So, we classically got inspired by the
existing CNNs approaches and following
intuition, we made a few well-chosen
experiments to converge towards a suit
able architecture. We noticed that some
defaults on the wafermaps are similar
but have different scales, which is some
thing the inception layers, as introduced
in GoogLeNet [18], can deal with well.
As our goal is to limit the number of
Fig. 1. Synthetic examples of wafermap failure layers, and to minimize the number of
patterns. From left to right and top to bottom, parameters while keeping a high predic
the big-cluster, wide-dense-edge-donut, fingerprint,
complete-wafer, horizontal-dots-lines, and matrix tion accuracy, we shall introduce only
classes are shown. very few of them given their complexity.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 5
In the end, we introduced a single incep- TABLE I
tion layer within our architecture, which D ESCRIPTION OF THE MODEL ARCHITECTURE
( K ( N × N ) MEANS K KERNELS OF SIZE N × N )
significantly improved the final accuracy.
Our final network architecture is de Conv2D 32 (7×7), stride of 2, no
picted in Figure 2. Our model has a bit padding
MaxPool2D (2×2), stride of 2
less than 0.5 million parameters, to be 32 (1×1), 8 (1×1), 8 (1×1),
compared with the 58 million of AlexNet MaxPool (3×3)
Inception Block
and the 6 million of GoogLeNet. 32 (3×3), 32 (5×5),
32 (1×1)
There are 17 network layers (convo MaxPool2D (3×3), stride of 2
lution, subsampling and fully connected Conv2D 12 (1×1)
layers) of which 10 are learnable lay Conv2D 116 (3×3), stride of 2,
ers. The input shape is a 224×224×1 padding with zeros
Conv2D 116 (3×3), stride of 2,
bit image which is subsampled across padding with zeros
the entire architecture into a 7×7×116 Dense/Softmax 58
dimensional vector. Nr of 478,150
parameters 125,518,940
Table I summarizes the description FLOPs
of our architecture. The model starts
with a 7×7 convolution layer, followed
by a max-pooling layer and an incep The intermediate layers use Relu as
tion block. The first convolution has no activation function and the last fully
padding, which allows the feature de connected layer uses a Softmax activation
tector to work only on the pixels of to produce a probability score over the 58
the image. This upper part of the dia classes (the sum of all probabilities must
gram extracts the most important features equal 1). Since we are facing a multi-
of the image. Next, the pooling layer class classification problem, categori
maximizes the response of each feature cal crosentroppy is used as a feedback
map and reduces the space exploration of signal, and the weights adjustments dur
feature maps. The bottleneck layer (i.e., ing the backpropagation steps are carried
1×1 convolution) reduces the number of out by the Adam optimizer. During the
parameters, and the last two consecutives learning process an important aspect is
3×3 convolutional layers take the role to the choice of hyperparameters. By vary
increase the feature maps volume again, ing the batch size and learning rate, the
before being flattened and passed to the accuracy is remarkably improved. The
classification function. regularization of these hyper-parameters
Fig. 2. Proposed CNN architecture
6 ANA PINZARI: POWER OPTIMIZED WAFERMAP CLASSIFICATION FOR SEMICONDUCTOR PROCESS
allows to find an optimal balance be TABLE II
tween the learning of the model and I NFERENCE ACCURACY ON THE DATA AND
VALIDATION SETS
its evaluation on the test data. For our
model, we started with a batch size of Dataset Validation Test
512 and a learning rate of 10−3 , which Data Data
Nr of 116,000 23,200 3,750
was successively reduced to 256, 128 till images
32 by decaying the learning rate to 10−5 . Top-1 99.92 99.93 99.84
The choice of the hyper-parameters re Accuracy (%)
mains empirical, to the best of our knowl
edge, there is no exact approach au of machine learning, this case is called
tomating the learning process. The hyper- overfitting. Overfitting or overgeneraliza
parameters are selected according to the tion is the situation in which the model
specificity of the network architecture learns well on the training data but does
and the type of data. not generalize well on the test data. To
Fig. 3 shows the performance of the avoid it, we early stop the training of
model on a balanced dataset of 116,000 the model and continue with the tune
images (2,000 images per class). We may of hyperparameters, in our case, by set
observe that starting by the 10th epoch ting the mini-batch size and learning rate
onwards, the loss function on evaluation smaller. In both graphs, we observe that
(test) data goes to increase. In the field from the 20th epoch onwards, the model
learns well and the evaluation on the
test data occurs correctly. Moreover, the
descent of the loss functions decreases
continuously until the end of the learning
phase. We reach a predictive accuracy of
our model of 99.9%.
Table II gives the performance of our
model on our dataset and on validation
data.
Now that we have a model that is
highly accurate while relatively small in
terms of parameters, we further limit its
computation and storage needs by apply
ing quantization.
B. Quantification Principles
Quantization consists of reducing the
number of bits necessary to represent a
value. Its use in neural networks is not
new [19], [20], but the introduction of
deep convolutional neural network has
however led to different works this past
decade. There are now many different
quantization approaches, ranging from
Fig. 3. Model scores: the evolution of (a) accu
quantizing only the parameters, quantiz
racy and (b) loss during the learning and testing ing both parameters (often only weights,
processes not biases) and activations, quantizing
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 7
on 16, 8, or even 2 or 1 bit, etc. The function, leading (roughly, as the idea is
approaches using the smaller bit sizes to divide by 2s which is not a raw shift for
are meaningful for hardware implemen negative values) to y j = (o j × M » s) +
tations only [21]–[24] to name a few. zp. These operations, done only once per
For the sake of this work, which tar kernel, are typically 32-bit, and the result
gets of-the-shelf micro-controller-based is saturated to −128 or 127.
boards, we will restrict ourselves to an From a practical point of view, there
8-bit quantization of both the weights are two main ways for quantizing a net
that is well suited to byte base compu work: post-training quantization (PTQ)
tation in software, or with existing hard and quantization-aware training (QAT).
ware accelerators (either ad-hoc or per PTQ consists of finding offsets and scale
forming matrix-vector or matrix-matrix values to approximate the weights of
multiplications). As a result, the most an already trained network. Post-training
demanding part of the neuron output works quite well on large networks, es
computation. (v j = ∑n− 1
i=0 xi wi j ) uses only pecially when lowering weight size to 8
8-bit integer multiplications. This is key bits or above. To further reduce bit size
because the area and power complexity without incurring high accuracy losses,
of a multiplier is in O(b2 ) where b is it is usually necessary to use QAT. This
the number of bits of the inputs. Each consists of training the network by con
multiplication produces a 2b-bit result, sidering the low precision behaviour dur
that is accumulated with the adder to ing the process.
produce (2b + log2 n )-bit result, n being Google’s TensorFlow-Lite (TF-Lite)
the number of inputs of the neuron. Using open-source framework provides an API
a 32-bit addition is a safe guess here, to convert and interpret quantized net
as there are very few chances that the works. Given our target that is micro-
accumulation takes place with more than controllers possibly backed by an acceler
216 inputs. It is also safe to have a bias b j ator, for which lower than 8-bit precision
on 32-bit, as this is a single addition per is useless, we use the PTQ method. It
formed after all integer multiplications produces weights and biases quantized
(o j = v j + b j ). It might even be the initial to a fixed-point precision of 8-bit inte
value of the accumulator. ger using the approach mentioned above
As TensorFlow was the first frame and required by integer-only accelerators.
work to provide 8-bit integer arithmetic PTQ takes a fully trained model and
fine-tuned implementations for micro- doesn’t require additional modifications
controllers (using e.g., SIMD instruc for conversion into a quantized model.
tions) and Google TPU [25], we opted Nevertheless, an important point for the
for using it given our high-power effi conversion process is to provide a rep
ciency goal. We briefly summarize here resentative dataset, i.e., a small subset
the quantization approach that is advo of the original dataset which covers the
cated by and implemented in this frame entire value space. This gives the quanti
work, and thoroughly detailed in [26]. zation process the range of inputs values
For a given convolutional layer, the quan and it can then find the most appropri
tization process produces in addition an ate 8-bit representation (multiplicand M
offset (called zero-point, zp), and for each and shift s) for each weight and acti
output channel of the layer a scale under vation value. To achieve the best pos
the form of an integer multiplicand M sible performance, i.e., ensure that all
and a shift s. The scale factor and offset computations are done using SIMD in
must be applied before the activation structions or outsourced to the TPU, it is
8 ANA PINZARI: POWER OPTIMIZED WAFERMAP CLASSIFICATION FOR SEMICONDUCTOR PROCESS
TABLE III power consumption still is an ongoing
I NFERENCE ACCURACY ON THE DATA AND concern that must be considered. Power
VALIDATION SETS AFTER POST- TRAINING
QUANTIZATION efficiency analysis shows as well that
these performances are achieved with
Dataset Validation Test low-power consumption and good overall
Data Data
Nr of 116,000 23,200 3,750 power efficiency.
images These experiments are conducted us
Top-1 97.58 97.23 96.18 ing software implementation of our quan
Accuracy
(%)
tized neural network model. They are
each using the available kernel imple
mentation provided with their develop
recommended to strictly stick to the 8-bit ment kit without neither modification
data type. For this purpose, we perform nor optimization from our side. Further
full integer optimization with the TF-Lite optimization is surely possible, though
converter, i.e., the inputs and the outputs we show through this type of exper
use 8 bits. iment that optimizing only the neural
The accuracy once the quantization network model is enough to deliver the
process is given in Table III. There is required performances using general pur
a slight drop in accuracy, around 2% on pose hardware.
the whole dataset, a bit less than 4% on
the test data. The confusion matrix shows A. Experimental Setup
that errors fit into nearby classes, which The two boards used in our experiments
is not perfect but reasonable from an ap are Google’s Coral standalone board and
plicative point of view. The experiments ST Microelectronics’ STM32MP1 board.
and results are presented next. The Coral SoC embeds a quad Cortex
A53 and a Cortex-M4F as well as a small
V. EXPERIMENTS TPU coprocessor for neural network
We implemented an end-to-end in inference. The technical specification
ference design based on our quantized rates this TPU with 4 TOPS for 8-bit
neural network architecture using two integer operations giving 2 TOPS per
boards representative of off-the-shelf IA watt. The STM32MP1 board embeds a
edge platforms, one of them embed dual Cortex-A7 and a Cortex-M4.
ding a small TPU. These experiments Both boards are setup in the same way,
aim at demonstrating that our solution as shown Figure 4. They are powered
is feasible on these types of low-power through a power-meter allowing to record
and limited resources devices. It shows power measures, in particular the total
in particular that both boards deliver power consumption and curves of instan
performances that follow the pace of taneous power. For demonstration pur
a production line test equipment with poses, the whole setup is powered by a
a good margin of progression. Indeed, laptop on which an HDMI video capture
test equipment usually produces batches card is connected to display the boards’
of wafermaps, meaning that the perfor display output. This capture device is
mance of our solution target is a mean not taken into account into the power
throughput of at least 1 inference per measures.
second, without a strong requirement on Images being inferred are all already
inference latency. Even though our so loaded into the boards’ memory. This
lution is not intended to be integrated is consistent with the way wafermaps
on a battery-powered embedded system, are generated in batches and not inferred
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 9
TABLE IV
C ORAL BOARD : L ATENCY MEASUREMENTS FOR
A THROUGHPUT OF 10,000 IMAGES AND ON 5
CONSECUTIVE TESTS
Experiment #1 #2 #3 #4 #5
Min 0.959 0.956 0.956 0.952 0.952
Latency
(ms)
(a) Max 2.916 3.568 4.160 4.278 3.825
Latency
(ms)
Avg 1.119 1.110 1.110 1.110 1.110
Latency
(ms)
Flow framework. About inference time,
the Coral reaches a rate of ≈903 in
ferences per seconds. The MP1, on its
side performs, without hardware support
for inference, at 5.5 inferences per sec
(b) onds. These results are good regarding
Fig. 4. Experiment setups:(a) Google’s Coral setup
our target problem and are actually very
(b) STM32MP1 setup. Both are powered through satisfying considering the size of the
a small power-meter allowing to record the power data (224×224 pixels per image) and
consumption of the entire board
the number of classes (58). Real-time
for wafer manufacturing means that we
one by one along the day. Moreover, it must keep the pace of the testing equip
would give results that would be hard to ment within the production line, which
interpret as the pace and sizes of data yield a wafer every dozens of seconds
batch transfers vary very much from one at least. Making a minimum of around
foundry to another. 5 inferences per seconds for the MP1
is thus well enough and gives a mar
B. Experimental Results gin of progression both on production
The inference latency is a measure line speed and data complexity. For in
to see the real time performance of the stance, it gives room for higher resolution
model. Experiments performed on 10,000 wafermaps or more classes.
random images show that the average The power measures we have made
inference latency is about 1.11 ms. Ta show that the Coral board has an idling
ble IV shows the inference latencies on instantaneous power of 3.3 W while the
5 consecutive tests. Each time the max MP1 stays at 1 W. This of course is due
imum latency corresponds to the first to the internal hardware: the Coral em
inference which is longer due to the pa beds a higher-grade processor and a hard
rameter caching of the model on the on- ware tensor processing unit. The Coral
chip memory. Once the model has been board has a cooling fan which might
loaded, the model weights are reused add to its overhead (the MP1 being a
for the next inferences and so inference fan less board), although the fan never
latencies are about three times lower. actually ran during our experiments. We
Both boards achieve an accuracy of though decided not to remove it from
97% confirming the numbers we mea the equation to stay in a realistic use
sured on the host using the Tensor- case, as it could be measure on an
10 ANA PINZARI: POWER OPTIMIZED WAFERMAP CLASSIFICATION FOR SEMICONDUCTOR PROCESS
Performance/Watt (i/s/w)
actual device attached to a machine in
Instantaneous Power (W )
220.16
4
102
the production line. In other words, we 3
measure the whole board components, 2 101
including DRAM, peripheral accesses 1
4.21
and the Linux kernel running on their 0
Coral STM32MP1
100
Coral STM32MP1
cores, not specifically the coprocessor
(a) (b)
making the inference [27].
The left part of Figure 5 shows instan Fig. 5. Power measures: (a) Instantaneous power
taneous power, both for idle and running (idle on the bottom part of each bar) (b) Power
efficiency in inference per second per watt
state of each board. When performing an
inference, the Coral board consumes a
total of about 4.1 W while the MP1 runs considering the key metric of total energy
at 1.3 W. It represents a 25% increase consumption, they would indeed run
for the Coral and a 30% increase for faster with an increased power of around
the MP1. However, considering the error 10W. However, the total consumed en
bars, both power increases can be consid ergy would be higher as we choose
ered similar. The right part of Figure 5 more and more power-hungry and fast
also shows an interesting indicator, the hardware because of the nature of how
performance per watt, to compare how batches of wafermaps are generated by
both boards perform and which one is the test equipment. Faster devices would stay
most efficient. inactive most of the time, wasting their
Table V gathers the numbers we get idle power waiting the next data batch.
for each board. Results show without sur Thus, deliberately getting less throughput
prise that the Coral with its TPU is 164× still delivers the required throughput for
faster than the MP1. When it comes to the industrial semi-conductor use case we
inferences per second per watt, the Coral consider, while being among the low
performs better with 220, against 4.2 for est power consumption solution we can
the MP1. This gives a 52× better power- get for this type of classification prob
efficiency for the Coral board, which is lem. A good future improvement of such
easily explained by the dedicated ASIC solution would be to better tune hard
for neural network acceleration. We still ware performances to either downclock,
note that the TPU is not exploited at standby or even shutdown the inference
its maximum by the TensorFlow Lite platform at the right moment to save
backend, as the peak performance would even more power. Finally, the market cost
lead to a 2 W increase in power of such small boards such as the one
consumption. used in these experiments are well under
Similar or better efficiency might be GPU solutions, making them even more
achievable with low-power GPUs such attractive considering both the cost of
as Nvidia Tegra. Though, we are not the initial purchase and the maintenance
convinced that this is much interesting replacement cost.
TABLE V
VI. CONCLUSION AND FUTURE
P ERFORMANCE AND POWER EFFICIENCY OF WORK
BOTH BOARDS
Wafermap classification is an
Board Inference Average Performance important step in semi-conductor process
Perfor- Power per Watt
mance
control. While the throughput of the test
Coral 902 i/s 4.1 W 220.16 i/s/W equipment is low compared to, e.g., video
STM32MP1 5.5 i/s 1.3 W 4.21 i/s/W rate, it runs around the clock. Therefore,
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 11
having a low footprint accurate With an appropriate model, further
power-efficient solution usable directly optimization can be made by focusing
on the industrial machines is of interest on the actual inference implementation.
to spare energy. To that end, we present First, kernel implementation can be op
in this paper a purpose defined neural timized with power usage in mind. For
network architecture that features a instance, some instructions are by na
low parameter count, that we further ture more power consuming than oth
quantize to limit the computation and ers, such as memory load and stores,
memory resources necessary to perform or branching instructions. Using works
inference. We implement this network focusing on instruction-level power con
on micro-controller boards with and sumption optimization could thus be used
without a hardware inference accelerator to trade more power efficiency against
and show that it can perform inferences performance, e.g., by redoing computa
fast enough to follow test equipment tion rather than storing and then loading
pace, at a 4 inference per second per an intermediate result. Secondly, hard
watt cost for a small microcontroller ware implementation solutions allow to
board and around 220 i/s/W using a optimise inference power efficiency even
low-cost embedded TPU accelerator. further. Multiple level of abstraction can
We proposed an approach centered be used depending on how much work
on neural network model optimization, we are willing to put in such imple
demonstrating that it is a good approach mentations. It extends from High Level
toward low-power deep learning Synthesis (HLS) solutions to HDL so
solutions. This approach can be applied lutions. The interesting point with hard
to other industrial use-cases sharing ware solutions is that model quantization
the same dataset features. In particular, can be further pushed toward ternary or
datasets with low interferences such as binary models, as demonstrated in other
our black and white wafermaps generated application domains [28]. This allows
by consistent test equipment are well very efficient matrix multiplications sav
suited. For instance, industries such as ing even more power while accuracy is
railway or photovoltaic manufacturing only slightly degraded.
have test equipment generating very
similar data with small changes between ACKNOWLEDGEMENTS
them. In the end, GoogLeNet, ResNet
and others are very effective also with The authors would like to thank
much more complex data such as Maxime Martin, STMicroelectronics,
24-bit real life photographs, but they Crolles, for providing them the dataset
are overkill solutions when applied to and process related information.
very specific industrial applications. As F. P´etrot would also like to ac
promising as this seems, one of the most knowledge the support of the French
important aspects is the training dataset. Agence Nationale de la Recherche
A clean training dataset is an absolute ne (ANR) through the MIAI@Grenoble
cessity before applying any sort of deep Alpes ANR-19-P3IA-0003 grant.
learning approach and must be the first
priority. This means that it must be large R EFERENCES
enough, well labeled and well balanced. [1] B. T. Murphy, “Cost-size optima
Only that prerequisite enables efficient of monolithic integrated circuits,”
model optimization eventually allowing Proceedings of the IEEE, vol. 52,
to downscale inference platforms. no. 12, pp. 1537-1545, 1964.
12 ANA PINZARI: POWER OPTIMIZED WAFERMAP CLASSIFICATION FOR SEMICONDUCTOR PROCESS
[2] K. Park and H. Simka, “Ad- [9] J. Shim, S. Kang, and S. Cho,
vanced interconnect challenges be “Active Learning of Convolutional
yond 5nm and possible solutions,” Neural Network for Cost-Effective
in 2021 IEEE International In Wafer Map Pattern Classification,”
terconnect Technology Conference IEEE Transactions on Semiconduc
(IITC). IEEE, pp. 1-3, 2021 tor Manufacturing, vol. 33, no. 2,
[3] M. H. Hansen, V. N. Nair, and D. J. pp. 258- 266, 2020.
Friedman, “Monitoring wafer map [10] R. Wang and N. Chen, “Defect Pat
data from integrated circuit fabri tern Recognition on Wafers using
cation processes for spatially clus Convolutional Neural Networks,”
tered defects,” Technometrics, vol. Quality and Reliability Engineering
39, no. 3, pp. 241-253, 1997. International, vol. 36, no. 4, pp.
[4] F. Duvivier, “Automatic detection of 1245-1257, 2020.
spatial signature on wafermaps in a [11] T.-H. Tsai and Y.-C. Lee, “A
high volume production,” in Inter light-weight neural network for
national Symposium on Defect and wafer map classification based on
Fault Tolerance in VLSI Systems. data augmentation,” IEEE Transac
IEEE, pp. 61-66, 1999. tions on Semiconductor Manufac
[5] A. Krizhevsky, I. Sutskever, and G. turing, vol. 33, no. 4, pp. 663-672,
E. Hinton, “Imagenet classification 2020.
with deep convolutional neural net [12] H. Kahng and S. B. Kim, “Self
works,” Advances in Neural Infor supervised representation learning
mation Processing Systems, vol. 25, for wafer bin map defect pat
2012. tern classification,” IEEE Transac
[6] M.-J. Wu, J.-S. R. Jang, and tions on Semiconductor Manufac
J.-L. Chen, “Wafer map failure turing, vol. 34, no. 1, pp. 74-86,
pattern recognition and similarity 2021.
ranking for large-scale data sets,” ´
[13] L. Andrade, T. Baumela, F. Petrot,
IEEE Transactions on Semiconduc D. Briand, O. Bichler, and M. Cop
tor Manufacturing, vol. 28, no. 1, pola, Efficient Deep Learning Ap
pp. 1-12, 2015. proach for Fault Detection in the
[7] T. Nakazawa and D. V. Kulkarni, Semiconductor Industry, ser. Series
“Wafer Map Defect Pattern Clas in Communications and Network
sification and Image Retrieval Us ing. River Publishers, ch. 2.2, pp.
ing Convolutional Neural Network,” 131-146, 2021.
IEEE Transactions on Semiconduc [14] C. Cortes, L. D. Jackel, and W.
tor Manufacturing, vol. 31, no. 2, P. Chiang, “Limits on learning ma
pp. 309- 314, 2018. chine accuracy imposed by data
[8] M. Saqlain, Q. Abbas, and J. quality,” Advances in Neural Infor
Y. Lee, “A Deep Convolutional mation Processing Systems, vol. 7,
Neural Network for Wafer De 1994.
fect Identification on an Imbalanced [15] D. H. McLain, “Two dimensional
Dataset in Semiconductor Manufac interpolation from random data,”
turing Processes,” IEEE Transac The Computer Journal, vol. 19, no.
tions on Semiconductor Manufac 2, pp. 178-181, 1976.
turing, vol. 33, no. 3, pp. 436-444, [16] F. Hutter, L. Kotthoff, and J. Van
2020. schoren, Automated machine learn
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 13
ing: methods, systems, challenges. compression,” ACM Transactions
Springer Nature, 2019. on Reconfigurable Technology and
[17] T. Elsken, J. H. Metzen, and F. Systems, vol. 11, no. 3, pp. 1-24,
Hutter, “Neural architecture search: 2018.
A survey,” The Journal of Machine [24] R. Zhao, W. Song, W. Zhang,
Learning Research, vol. 20, no. 1, T. Xing, J.-H. Lin, M. Srivastava,
pp. 1997-2017, 2019. Gupta, and Z. Zhang, “Accelerating
[18] C. Szegedy, W. Liu, Y. Jia, P. Ser binarized convolutional neural net
manet, S. Reed, D. Anguelov, D. works with software-programmable
Erhan, V. Vanhoucke, and A. Rabi fpgas,” in Proceedings of the 2017
novich, “Going deeper with convo ACM/SIGDA International Sympo
lutions,” in Proceedings of the IEEE sium on Field-Programmable Gate
conference on computer vision Arrays. ACM, pp. 15-24, 2017.
and pattern recognition, pp. 1-9, [25] N. P. Jouppi, D. H. Yoon, M.
2015. Ashcraft, M. Gottscho, T. B. Jablin,
[19] G. Dundar and K. Rose, “The ef G. Kurian, J. Laudon, S. Li, P. Ma,
fects of quantization on multilayer X. Ma et al., “Ten lessons from
neural networks,” IEEE Transac three generations shaped google’s
tions on Neural Networks, vol. 6, tpuv4i: Industrial product,” in 2021
no. 6, pp. 1446-1451, 1995. ACM/IEEE 48th Annual Interna
[20] B. Hoskins, M. Haskard, and G. tional Symposium on Computer Ar
Curkowicz, “A vlsi implementation chitecture (ISCA). IEEE, pp. 1-14,
of multi-layer neural network with 2021.
ternary activation functions and lim [26] B. Jacob, S. Kligys, B. Chen,
ited integer weights,” in 1995 20th M. Zhu, M. Tang, A. Howard,
International Conference on Micro H. Adam, and D. Kalenichenko,
electronics, pp. 843-846, 1995. “Quantization and training of neu
[21] R. Andri, L. Cavigelli, D. Rossi, ral networks for efficient integer
and L. Benini, “Yodann: An ultra- arithmetic-only inference,” in IEEE
low power convolutional neural net conference on computer vision and
work accelerator based on binary pattern recognition, pp. 2704-2713,
weights,” in IEEE Computer Soci 2018.
ety Annual Symposium on VLSI, [27] V. Sze, Y.-H. Chen, T.-J. Yang, and
pp. 236-241, 2016. J. S. Emer, “How to evaluate deep
[22] Y. Umuroglu, N. J. Fraser, G. neural network processors: Tops/w
Gambardella, M. Blott, P. Leong, (alone) considered harmful,” IEEE
M. Jahre, and K. Vissers, “Finn: Solid-State Circuits Magazine, vol.
A framework for fast, scalable 12, no. 3, pp. 28-41, 2020.
binarized neural network infer [28] A. De Vita, D. Pau, L. Di
ence,” in Proceedings of the 2017 Benedetto, A. Rubino, F. Petrot,´
ACM/SIGDA International Sympo and G.D. Licciardo, “Low power
sium on Field-Programmable Gate tiny binary neural network with im
Arrays, pp. 65-74, Feb. 2017. proved accuracy in human recog
[23] A. Prost-Boucle, A. Bourge, and nition systems,” in 23rd Euromi
´
F. Petrot, “High-efficiency convo cro Conference on Digital Sys
lutional ternary neural networks tem Design. IEEE, pp. 309-315,
with custom adder trees and weight 2020.
14 ANA PINZARI: POWER OPTIMIZED WAFERMAP CLASSIFICATION FOR SEMICONDUCTOR PROCESS
Ana Pinzari received the on chip, and acceleration of heterogeneous systems
Ph.D degree in computer simulation.
science at Université
de Technologie de
Compiègne, Compiègne,
France in 2012, on Marcello Coppola is
the subject of defining technical Director at
a methodology for STMicroelectronics. He
the execution of real- has more than 25 years
time image processing of industry experience
algorithms. Then, she with an extended network
worked in a scientific research organization within the research
ECSI (European Electronic Chips & Systems community and major
Initiative) focusing on new design methods, tools funding agencies with
and standards for design of complex electronic the primary focus on the
systems. Besides dissemination activities of development of break
European EDA research projects, she took part through technologies. He is a technology innovator,
in definition and implementation of renowned with the ability to accurately predict technology
international conferences focusing on specification trends. He is involved in many European research
and design languages (FDL, DVCon Europe), projects targeting Industrial IoT and IoT, cyber
signal and image processing (DASIP) and physical systems, Smart Agriculture, AI, Low
debug (S4D). She joined Grenoble INP, TIMA power, Security, 5G, and design technologies
laboratory in 2021, as a post-doctoral research for Multicore and Many-core System-on-Chip,
engineer. Her current research interest is in with particular emphasis to architecture and
exploring optimization methods for the hardware network-on-chip. He has published more than
implementation of neural networks. 50 scientific publications, holds over 26 issued
patents. He authored chapters in 12 edited print
books, and he is one of the main authors of
Thomas Baumela “Design of Cost-Efficient Interconnect Processing
received his Ph.D degree Units:Spidergon STNoC” book. Until 2018, he was
in computer science part of IEEE Computing Now Journal Technical
at Université Grenoble editorial board. He contributed to the security
Alpes, Grenoble, France chapter of the Strategic Research Agenda (SRA)to
in 2021, working on the set the scene on R&D on Embedded Intelligent
hardware and software Systems in Europe. He is serving under different
integration of devices roles numerous top international conferences and
in FPGA using a workshops. Graduated in Computer Science from
co-designed message- the University of Pisa, Italy in 1992.
based approach at the
TIMA Laboratory. He then stayed at the TIMA
laboratory as a post-doctoral researcher, working
on implementation and integration of hardware Fr´ed´eric P´etrot received
accelerated neural networks. the Ph.D. degree in
computer science from
Université Pierre et Marie
Liliana Andrade Curie (now Sorbonne
received her engineering Université), Paris, France,
degree from Universidad in 1994, working on the
de Los Andes, Mérida, Alliance CAD system for
Venezuela, in 2012; VLSI design. He became
and her Ph.D. degree assistant professor there
in computer science, in 1995, and worked on
telecommunications higher level of abstractions for ESL. He joined
and electronics from Grenoble-INP/Ensimag, Grenoble (now Institute of
Université Pierre et Engineering Univ. Grenoble Alpes), France, as a
Marie Curie (Paris VI), professor in 2004. His research takes place in the
Paris, France, in January 2016. Since September TIMA Laboratory and focuses on the specification,
2017, she is with the TIMA laboratory, System simulation, and implementation of multiprocessor
Level Synthesis team, in Grenoble, France. She systems on chip architectures, including circuits,
is associate professor in computer science at system software, and Computer-Aided-Design
the Polytechnical School of Université Grenoble aspects. Since 2019, he holds the Digital HW AI
Alpes. Her research interests include system-level Architectures chair of Grenoble Multidisciplinary
modeling, design and validation of systems Institute in Artificial Intelligence.
Low-power Analog In-memory
Computing Neuromorphic Circuits
Roland Müller, Bijoy Kundu, Elmar Herzer, Claudia Schuhmann, and
Loreto Mateu
Abstract—The presented neuromorphic Index Terms—AI accelerators, Applica
circuits comprise synaptic weights and neu tion specific integrated circuits, Analog
rons including batch normalization, ac processing circuits, Neural networks, Neu
tivation function, and offset cancelation ral network hardware, SRAM cells
circuits. These neuromorphic circuits com
prise an effective 3.5 bits weight storage
based on binary memory cells while the I. INTRODUCTION
analog multiplication and addition opera
T
HE outperforming results of
tion is based on a voltage divider principle.
To experimentally proof the working prin deep neural networks (DNNs)
ciple, three fully connected layers (50x20, for cloud based platforms has
20x10 and 10x4) have been designed. The pushed the development of new DNN
connection between these layers is realized models and hardware components for
completely in the analog domain without
edge devices. Since the requirements for
ADCs and DACs in between. An infer
ence state machine takes care of pipelining cloud and edge computing in terms of
the layers for a proper operation during performance are different, new architec
inference. The schematic and layout of tures are explored. Inference accelerators
the neuromorphic circuits comprised in for edge applications target low energy
these layers have been automatically gen
consumption results per inference. Since
erated with a in-house designed automa
tion framework. This framework, called the multiply-accumulate (MAC) opera
UnilibPlus, is a Python-based Cadence Vir tion together with the activation func
tuoso add-on. Simulation results of weight tion are the main operation to perform
loading, transfer of input values, inference in DNNs, inference accelerators perform
and read inference results via SPI interface
efficiently these operations. This work
show a correct operation of the designed
ASIC with 12 nJ per inference and 5 μs explores a fully analog implementation
latency. approach for DNNs by using SRAM
cells for the storage of the synaptic
The research leading to these results has re weights in the crossbar arrays in 28-nm
ceived funding from the Electronic Components and
Systems for European Leadership Joint Undertak GlobalFoundries technology. This work
ing under grant agreement No 826655 – project presents: (i) a crossbar implementation
TEMPO. This Joint Undertaking receives support per neural network layer with 3.5 bit
from the European Union’s Horizon 2020 research
and innovation programme and Belgium, France, accuracy for weight storage (equivalent
Germany, Netherlands, Switzerland. TEMPO has to three positive, three negative and zero
also received funding from the German Federal weight values); (ii) a pipeline between
Ministry of Education and Research (BMBF) under
Grant No. 16ESE0405. The authors are responsible layers arranged via finite state machines
for the content of this publication. Roland Müller, (FSMs); (iii) the use of SRAM cells for
Bijoy Kundu, Elmar Herzer, Claudia Schuhmann in-memory computing which ensures the
and Loreto Mateu are with the Advanced Analog
Circuits group at Fraunhofer IIS, Am Wolfsmantel porting of the circuit to other technology
3, 91058 Erlangen. nodes and reduces data movement; (iv) a
16 ROLAND MÜLLER: LOW-POWER ANALOG IN-MEMORY COMPUTING NEUROMORPHIC CIRCUITS
voltage divider approach for the synaptic
weight circuit; (v) a fully analog design
of the neural network without ADC and
DACs between layers in order to get rid
of their overhead [1] for the DNN design;
(vi) a weight loading via SPI interface to
update the synaptic weights of the layers;
and (vii) circuit blocks included to enable
the test of all neural network building
blocks.
The paper is organized as follows:
Section II provides an overview of the
ASIC architecture, a short description at
block level as well as a more detailed
explanation of the synaptic and neuron
Fig. 1. Top-level block diagram of the DNN circuit
circuits. Section III summarizes the sim including digital control and circuitry used to test
ulation results obtained at neuron level the ASIC.
and during inference. Design for testing
techniques have been used in the ASIC
design to be able to scan the correct func Inference-FSMs (3x I-FSM) and a Cap
tionality of each one of the circuits that ture FSM. These state machines control
conform the crossbar arrays of the three the execution and timing of the infer
layers separately. Section IV provides an ence executed in the DNN. The Frontend
overview of the functional tests and eval block distributes the 8-bit serial input
uation tests strategy for the verification data to registers for each input of the
of the designed circuits and evaluation first DNN layer. The registers are con
of their performance. Finally, Section V nected to a combination of a DAC and
provides a summary of the remaining a Driver circuit. The DAC converts the
work and concluding remarks. digital input data to the analog domain
whilst the Driver provides a low output
II. ARCHITECTURE FOR AN
resistance to drive the input resistance of
ANALOG IMPLEMENTATION OF
the DNN. The outputs of the Frontend
NN S
are then used as the Analog Input Data to
A. Overall Architecture the DNN with the DNN containing three
Fig. 1 shows the top-level block dia fully connected (FC) layers with input-
gram of the inference accelerator ASIC, output dimensions of 50×20, 20x10 and
which consists of a Digital Control block, 10x4. After all layers have finished pro
a Frontend circuit, a 3-layer Deep Neural cessing its inputs, the Analog Output
Network (DNN), a Test Multiplexer and Data of the DNN is then captured by
a Capture Stage. The Digital Control four 8-bit ADCs (one per output neu
block includes a Serial Peripheral Inter ron of the last layer) implementing the
face (SPI) to control the configuration conversion back from analog to digital
of the ASIC and to enable the transfer domain. These Digital Output Data can
of input and output data during infer then be read out via SPI. Finally, the
ence. Additionally, three different types Test Multiplexer provides access to all
of finite state machines (FSMs) are in frontend and neuron output voltages for
cluded: a Frontend-FSM (FE-FSM), three measurement and verification purposes.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 17
The architecture of the chip enables the
measurement, evaluation, and verification
of almost every building block within the
DNN separately. Therefore, block level
simulation results can be correlated with
the respective measurements. The results Fig. 2. Synaptic weight circuit using variable
can further be used to extrapolate the cir resistors.
cuit´s behavior to bigger scale networks
that shall be implemented.
B. Voltage Divider Approach For Synap
tic Weights
Fig. 3. Voltage divider synaptic weight imple
Synaptic weights in analog DNN ac mentation: (a) schematic and (b) equivalent circuit
celerators are one of the major compo illustration.
nents in such circuits. They can be imple
mented in multiple ways with the stan
dard being a voltage to current conver Because of the previously mentioned
sion using a variable resistor like shown issues, a different approach has been
in Fig. 2 [1]. In this implementation, implemented in our ASIC design.
the voltages across the resistors R1 to For this, we use a configurable
RN are converted into a current by the voltage divider [2] as shown in
variable resistors, with the resistor values Fig. 3 replacing the variable resistor
being set to a certain value provided by circuit.
the neural network model. The result Each synaptic weight circuit consists
ing currents are then summed up to the of two resistors R1 and R2. Both resis
final output current. This implementa tors can be connected either to the Input
tion requires the voltage at the Output node or to the analog ground. This is
node to be constant (virtual ground) and illustrated in Fig. 3(a) by the switches S1
the current to be converted back to a and S2. The resistors are sized such that
voltage (I-V conversion) to forward the the output resistance ROut , illustrated in
calculated value as Input voltage to the Fig. 3(b) stays constant regardless of the
next layer. Different approaches can be switch configuration whilst the weight
followed to implement the virtual ground value k is equivalent to voltage divider
and the I-V conversion. For example, ratio. Table I shows the possible weight
a transimpedance amplifier (TIA) or a values k and the corresponding input re
shunt resistor can be used. The latter sistance RIn for the four different switch
one can only approximately provide a configurations under the condition that
virtual ground if its resistance is small R1 = 2·R2.
enough which would lead to very low
output voltages. Using a TIA, the virtual TABLE I
W EIGHT VALUE AND I NPUT R ESISTANCE OF THE
ground requirement is fulfilled but it must VOLTAGE D IVIDER S YNAPTIC W EIGHT
feedback the same current as I Out. This
S1 S2 k RIn
leads to increased energy consumption Ground Ground 0
whilst stability proves to be critical as Input Ground 0.33 R1
the resistance connected to its input is Ground Input 0.67 R2
dependent on the used weight values. Input Input 1 R1 || R2
18 ROLAND MÜLLER: LOW-POWER ANALOG IN-MEMORY COMPUTING NEUROMORPHIC CIRCUITS
1 N
VOUT = ∑ VIn,i ·ki ,
N i=1
(1)
where N is the number of inputs con
nected together to a common output
through synaptic weights, VIn,i is the
input voltage for Input i, ki are the weight
values for Input i, as shown in Table I.
If multiple such synaptic weight cir
Fig. 4. Differential voltage divider synaptic weight
cuits are connected in parallel to form a for positive and negative weight values.
crossbar array output, the output voltage
is defined by (1). The output voltage is
the average of the weighted sum of the and R2 are connected to analog ground
input voltages, which leads to a reduction via S1 and S2.
of the effective weight values. Therefore,
a recovery gain needs to be applied to the C. Neuron Circuit
outputs so as to bring them to a usable
range. As stated in section B, the neuron
The voltage divider approach for the circuit must implement a gain due to the
synaptic weight does not require an I-V averaging operation in the crossbar. Ad
conversion because every output of the ditionally, a batch normalization circuit
crossbar array is a voltage. This improves must be implemented which consists of
the accuracy compared with the cur a variable gain with a following offset
rent based solution, because the weight addition. Finally, a defined nonlinearity is
value only depends on the matching of required to perform the activation func
unit resistors. The energy consumption tion. Equation (2) defines the operation
is also reduced because only a voltage that the neuron implements
amplifier is needed which can be imple
mented in low power switch capacitor VNeuron,Out = f (VNeuron, In ·kNeuron
topology.
+VO f f set ) (2)
Further, as the output is a voltage,
the signal routing within the crossbar is where VNeuron,In , is the input voltage,
simplified as parasitic resistances at the VNeuron,Out is the output voltage, VO f f set
outputs do not influence the calculation is the batch normalizations offset voltage
result. and kNeuron represents the combination
Finally, to implement both positive of the mentioned recovery gain and the
and negative weight values, the synaptic batch normalizations variable gain. The
weight circuit was replicated to create function f is defined as the nonlinear
a differential output pair like shown in activation function, which in this case is
Fig. 4. If a positive weight value is re a Rectified Linear Unit (ReLU) function.
quired, switches S1 and S2 are used to set The neuron circuit comprises two am
the weight values, whilst switches S3 and plification stages: a switched capacitor
S4 connect the resistors to R3 and R4 to variable gain stage [4] to implement the
the analog ground voltage. For negative batch normalization circuit and a buffer
weight values, the switches S3 and S4 set stage including the ReLU function. In
the required weight values whilst the R1 between both stages, a sample and hold
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 19
the input voltage of the output stage
is limited to Vss ,. Therefore, the gate-
source voltage of transistor M4 would
be 0 if the required output voltage is
also 0. In this case, the output voltage
cannot reach 0 if a load resistance is
connected, as transistor M4 would not
drive any current.To avoid this effect,
Fig. 5. ReLU activation function and buffer circuit.
a limiting amplifier (Limiter) was intro
duced that compares the output voltage to
a reference voltage (Ref ) of 200 mV and
stage is placed to decouple and pipeline sets the output voltage to the reference
the operation of the layers in the neural voltage in case the output voltage drops
network. below the reference voltage [5].
As stated, the activation function The transition between the linear and
should be a ReLU function, which has a the limited region of the activation func
constant output value for negative input tion is implemented by the two-gate am
values and a linear rising output value for plification stage, which consists the tran
positive input values. Ideally, the maxi sistors M1 and M2. In the linear region,
mum output value is not limited but the M1 is turned completely on, leaving the
output voltage cannot exceed its positive Limiter with no effect on the circuit be
supply voltage. Therefore, for high input havior. In the region where the Limiter
voltages, the output voltage will result in is active, the current through M1 is reg
a constant value. This behavior equals the ulated such that the output voltage stays
DC transfer function of an operational constant.
amplifier, which was therefore chosen as
the architecture for the activation func III. SIMULATION RESULTS
tion. The output range reaches from 200 Corner and mismatch simulations have
mV to 400 mV at a supply voltage of 1V. been performed for all analog circuits.
The limits were chosen such that there is Digital simulations for the digital core
enough headroom to the negative supply including the SPI module and FSMs have
voltage and that it does not exceed the also been performed. This section pro
half of the supply voltage. This eases vides further details of the simulations
the requirement for the switch resistance results obtained for the neuron circuit and
in the synapse circuits since an NMOS for the inference of the 3 layers.
switch is sufficient.
Fig. 5 shows a simplified schematic of
the implemented ReLU activation func A. Neuron
tion. Transistors M3 and M4 form a The neuron circuit was simulated for
source follower output stage. This config different operating conditions including
uration was selected due to the relatively different batch normalization gain and
low load resistance from the following offset values, load resistances, load volt
layer of approximately 1 kΩ. The upper ages and input voltages. Fig. 6 shows the
limit is defined by Vdd2 which is set to simulation results for the transfer func
400 mV and is therefore limited by the tion of the neuron for different process,
supply voltage. The lower limit cannot voltage, and temperature (PVT) varia
be generated in the same manner because tions, with 5% supply voltage variation
20 ROLAND MÜLLER: LOW-POWER ANALOG IN-MEMORY COMPUTING NEUROMORPHIC CIRCUITS
Fig. 6. Transfer function of the neuron over PVT Fig. 7. Transient simulation results of the output
variations. voltages of all three layers.
and a temperature range of 0◦ C to 70◦ C. voltage is amplified and then sampled
As required, the output voltage stays con in a sample-and-hold stage. Right after
stant for negative input voltages, rises lin the sampling happened in one layer, the
early for positive voltages and is clipped preceding neurons are switched off to
for differential input voltages exceeding reduce the energy consumption.
200 mV. Further, the transitions between The three vertical markers in Fig. 7
linear and constant regions are well de highlight the time points at which the
fined which is also required for a suc result of each layer is valid. The neuron
cessful training and inference of a neural output voltages at these points have been
network. compared to an ideal calculation exe
cuted in Python. The average difference
B. Inference resulted in 1.9 mV and a maximum dif
ference in 6.9 mV between ideal calcula
The inference was simulated with mul tions and simulations. As such, we derive
tiple random weight and input value pat 1% average and 3% maximum difference
terns. In a first step, the weights are for a neuron output range of 200 mV.
loaded onto the neural network and af
terwards, multiple inferences have been
executed with different input data value C. Key Performance Indicators
sets. Fig. 7 shows the transient simulation Table II summarizes the main perfor
results for a single inference, where each mance indicators of the work presented
plot shows the output voltages of the in the paper. The energy consumption of
neurons per layer. The operation of each the neural network was estimated based
neuron starts with an offset sampling on a Python script containing electrical
phase. During this phase, the calculations simulation data of the neuron energy
in the batch normalization circuits are consumption in relation to its operation
executed, and its result is sampled on an conditions. Then, the operating condi
internal capacitor. Afterwards, the activa tions for each neuron in the neural net
tion function is calculated leading to a work are calculated for different random
valid output value to be used in the next weight and input pattern. These operat
layer for calculating the multiply-and ing conditions are then used to get the
accumulate results. The same process is respective energy consumption for each
executed in each layer of the neural net neuron from the previous characteriza
work in a pipelining process described in tion. Finally, all the energy consumption
Section II.C. Within the neuron, the input values are summed which leads to the
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 21
TABLE II frontend drivers output voltage is mea
K EY P ERFORMANCE I NDICATORS FOR THIS sured and compared with the expected
W ORK
value.
Parameter Value
CMOS Technology 28-nm GF
Synapse Memory Size 4.96 Kb Neuron Test
Weight precision 3.5b (7 levels)
Core Area 940 μm x 890 μm
In a second test, each neuron is vali
Power Supply 1.0 V dated individually. This is done by setting
Energy/Inference 12 nJ all the weight values to 0 which ensures
Latency 5 μs that there is no influence between the
Power Efficiency 193 GOPS/W
neurons. Afterwards, the batch normal
ization offset is swept through its com
final energy consumption of the complete plete range, which leads to a varying out
neural network with an average result of put voltage of the neuron. For each neu
12 nJ. ron and each batch normalization offset
value, the neuron output voltage is mea
IV. EVALUATION AND sured and compared with the expected
VERIFICATION value.
The inference accelerator ASIC is de
signed as a test chip and provides ac Weight Test
cess to all neuron outputs via the test After testing the neuron circuit´s func
multiplexer as illustrated in Fig. 1. To tionality, the weights can be tested. This
verify the functionality and evaluate the is done by setting the output voltage of a
performance of all circuit components frontend driver or a neuron in a layer to
based on measurements, test patterns are the center of its output range value whilst
required. These test patterns are auto sweeping one of the connected weights
matically generated by a Python script through its possible values and measur
that creates the required files for weight ing the output voltage of the following
loading and inference. In addition, the neuron.
expected neuron output voltages are cal In case of negative weight values and
culated for later comparison and vali a weight value of 0, the neuron´s output
dation. This section describes the tests voltage would always be 200 mV due to
and test patterns used for evaluation and the limiter circuit. Therefore, the batch
verification of the ASIC. normalization offset value is used in this
case to set the neuron´s output voltage to
A. Functional Test a value greater than 200 mV. The differ
Functional tests validate the expected ent combinations used are summarized in
operation of all circuit building blocks. Fig. 8.
Therefore, the test patterns must config
ure the ASIC in such a way, that one Batch Normalization Gain Test
component is tested at a time.
The final functional test verifies the
batch normalization gain configuration
Frontend Test by sweeping the gain value through its
First, the digital input value for one overall range. In this test, according to
of the frontend DACs at time is swept the selected gain value, one or more
through its complete range whilst the synaptic weights must be used to set the
22 ROLAND MÜLLER: LOW-POWER ANALOG IN-MEMORY COMPUTING NEUROMORPHIC CIRCUITS
Secondly, the energy consumption in dif
ferent configurations is measured.
Random Test Pattern
The first evaluation test uses random
weight and input value patterns. For each
weight pattern, multiple input value pat
terns are used. This test executes infer
ences with pseudo realistic values and
therefore is a good indicator for the en
ergy consumption in realistic scenarios.
Furthermore, its results can be used to
verify the calculated energy consumption
results described in section III.B.
Fig. 8. Different combinations of weight value,
batch normalization offset and resulting output volt
age to test the synaptic weights. Best and Worst Case Energy Consump
tion
The final test configures the ASIC in
such a way that the lowest and highest
energy consumption is achieved. On the
one hand, the test case with lowest energy
consumption occurs when all weights are
set to 0 whilst the digital input values
are also 0. On the other hand, the highest
energy consumption requires half of the
weights in each column of the cross
bar array to be set to their maximum
value whilst the other weights are set
Fig. 9. Different number of weights in use for
different gain configurations. to their minimum value with the neural
networks input voltages and the batch
normalization offset values all set to their
neurons output voltage to a measurable maximum value.
value. The number of weights is selected
such that the output voltage of the cur
V. CONCLUSIONS AND FUTURE
rently tested neuron is close to 300 mV
WORK
and the input voltage to the weights in
use never exceeds 300 mV. Fig. 9 shows While this work shows the architec
two different configurations for a gain ture and circuits implemented for a fully
value of 0.5 and 1 where three and two analog in-memory computing implemen
synaptic weights are used. tation of a DNN based on SRAM cells,
the simulation results as well as the test
and evaluation strategy, measurements
B. Evaluation Test are still pending and will be executed
The evaluation tests first verify if the as soon as the ASIC is fabricated and
complete inference is executed correctly. packaged.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 23
The SRAM-based in-memory comput in Proc. SAMOS 2021, Lecture
ing analog DNN implementation pre Notes in Computer Science, vol.
sented in this work shows good energy 13227, pp. 385–400, Springer,
efficiency performance, which makes it Cham, https://doi.org/10.1007/
attractive for small models with fixed 978-3-031-04580-6 26
sizes of neurons per layer. In comparison [4] C. Enz and G. C. Temes, ”Cir
with many emerging non-volatile mem cuit techniques for reducing the ef
ory (eNVM) technologies like FeRAM, fects of op-amp imperfections: au
PCM, RRAM or FeFETs, SRAM cells tozeroing, correlated double sam
are a mature technology that can be im pling, and chopper stabilization,” in
plemented in any technology node. While Proc. of the IEEE, vol. 84, no.
those eNVMs does not add any leakage 11, pp. 1584-1614, Nov. 1996, doi:
to the overall ASIC while no inference 10.1109/5.542410.
is performed, SRAM is a good choice [5] C. Cini, C. Diazzi and P. Erratico,
for applications in which inferences are “Limited output operational ampli
continuously running. fier”, U.S. Patent vol. 4 pp. 672–
The circuits presented here have been 326, June, 9, 1987.
simulated across corners and mismatch
and provide a stunning accuracy while
comparing their results with ideal DNN Roland Muller¨ born on
the 19. January 1994 ob
models. Thus, the synaptic divider ap tained his bis B.Eng. at
proach and overall architecture proves the OTH Regensburg in
that a fully analog computation of neural 2017 and his M.Sc. in
2019 at the FAU Erlan
networks can be implemented. gen, both in electrical en
gineering.
In May 2019, he joined
the department of Inte
R EFERENCES grated Circuits and Sys
[1] A. Shafiee, A. Nag, N. Murali tems at Fraunhofer IIS, Erlangen (Germany), where
he is working in the field of analog-mixed signal
manohar, R. Balasubramonian, J. P. design of neural network accelerators and design
Strachan, M. Hu, R. S. Williams, automation for such circuits. Currently, he is pursu
and V. Srikumar. “ISAAC: a con ing his PhD. His main research interests include low
power analog-mixed signal circuits, neuromorphic
volutional neural network acceler computing and electronic design automation.
ator with in-situ analog arithmetic
in crossbars”, SIGARCH Comput. Bijoy Kundu received
Archit. News 44, no. 3, pp. 14–26, B.Eng. degree in Elec
June 2016, https://doi.org/10.1145/ tronics and Telecommuni
cation Eng. from IIEST,
3007787.3001139. Shibpur, India, in 2013,
[2] E. Herzer and R. Müller, and M.Sc. degree in Infor
“Electronic Circuit for Calculating mation and Communica
tion Eng. from TU Darm
Weighted Sums,” EP3992862, stadt, Germany, in 2017.
May., 4, 2022. He joined Fraunhofer
[3] R. Müller et al., “Hard IIS as a research engineer
in 2017 and is currently working there in the
ware/Software Co-Design of an Advanced Analog Circuits group. His main re
Automatically Generated Analog search interests include low power analog-mixed
NN”, signal circuits, data converters, and neuromorphic
computing.
24 ROLAND MÜLLER: LOW-POWER ANALOG IN-MEMORY COMPUTING NEUROMORPHIC CIRCUITS
Elmar Herzer received Loreto Mateu obtained
his Dipl.-Ing. degree in her B.S. in Industrial En
electrical engineering gineering in 1999, her
from the University of M.S. in Electronic Engi
Stuttgart, Germany, in neering in 2002 and her
1997. PhD degree in June 2009
Since 1997 he has been with a thesis titled En
with the department for ergy Harvesting from Hu
analog and mixed signal man Passive Power at the
IC Design, at Fraunhofer Universitat Politècnica de
IIS, Erlangen. He has Catalunya.
been involved in numerous ASIC projects with In June 2007, she joined Fraunhofer IIS as re
mixed signal sensor frontends for automotive, search engineer and became chief scientist in 2012.
industrial and consumer applications, mainly with Since 2018 she is group manager of the Advanced
3D-Hall sensors, where he focused on low noise Analog Circuits group. Her research interests in
pre-amplification and Delta-Sigma ADCs. As a clude ultra-low power design, AC-DC and DC-DC
chief engineer he guided the 22 nm SOI design converters, neuromorphic hardware and energy har
enablement and IIS 22 nm design flow. He is vesting. She holds several patents and is co-author
working currently in the field of analog-mixed of a book, a book chapter and international papers
signal design of neural network accelerators as published journals and conference proceedings.
a system architect. His main research interests
include low noise analog-mixed signal circuits,
approximate computing and analog design
automation. He has filed 12 patents.
Claudia Schuhmann ob
tained her diploma in
physics in 1986 at the
FAU Erlangen.
Until 1995 she worked
on analogue design im
plementation in the IC
department of SIEMENS
AG. In 1999 she joined
Fraunhofer IIS where she
is working in the field of
digital design. Her main focus is on the implemen
tation and physical verification of digital and mixed
signal designs in deep submicron technologies.
Tools and Methodologies for
Edge-AI Mixed-Signal Inference
Accelerators
Loreto Mateu, Johannes Leugering, Roland Müller, Yogesh Patil, Maen
Mallah, Marco Breiling, and Ferdinand Pscheidl
Abstract—The ANDANTE project aims reach the targeted KPIs, and reduce the
to tackle the hardware/software co-design time-to-market.
challenge that arises from the develop
ment of novel (neuromorphic) edge-AI ac Index Terms—AI accelerators, analog
celerators. For this purpose, Fraunhofer processing circuits, application specific in
IIS and EMFT among other partners de tegrated circuits, mixed-signal simulation,
veloped several tools to facilitate design, neural networks, neural network hard
training and deployment of artificial neural ware, neuromorphic computing, system on
networks in dedicated hardware accelera chip.
tors. These tools provide hardware-aware
training, automatic hardware generation,
compilers, estimation of KPIs like energy I. INTRODUCTION
consumption, and simulation under consid
eration of the constraints imposed by the
T
HE rapid adoption of deep learn
targeted hardware implementation and use
cases. The development of such a tool chain ing in recent years has led to
is a multidisciplinary effort combining neu a growing demand for efficient
ral network algorithm design, software de AI hardware accelerators. In particu
velopment and integrated circuit design. lar, edge applications with very specific
We show how such a toolchain allows to
and stringent KPI requirements, i.e. in
optimize and verify the hardware design,
terms of power consumption or latency,
The research leading to these results has re stand to benefit from highly customized
ceived funding from the Electronic Components solutions. However, designing a custom
and Systems for European Leadership Joint Un
dertaking under grant agreement No 876925 – inference accelerator for edge AI applica
project ANDANTE. This Joint Undertaking re tions requires a multi-disciplinary hard
ceives support from the European Union’s Hori ware/software co-design approach that
zon 2020 research and innovation programme and
France, Belgium, Germany, The Netherlands, Por combines neural network algorithm de
tugal, Spain, Switzerland. ANDANTE has also re sign and software development with in
ceived funding fro the German Federal Ministry tegrated circuit design. To reach the
of Education and Research (BMBF) under Grant
No. 16MEE0116 and 16MEE0117. The authors targeted KPIs within a short time-to
are responsible for the content of this publication. market despite the complexity of this
Loreto Mateu, Johannes Leugering, Roland Müller, task, dedicated tools and workflows for
Yogesh Patil, Maen Mallah and Marco Breiling
are with the Fraunhofer Institute for Integrated optimizing and verifying the hardware
Circuits IIS, Am Wolfsmantel 3, 91058 Erlangen, design are mandatory. Taken together,
Germany. Ferdinand Pscheidl is with the Fraunhofer such tools form a hardware/software co
Research Institution for Microsystems and Solid
State Technologies EMFT, Hansastraße 27d, 80686 design tool chain that, we argue, is cru
Munich, Germany. cial to make the development of custom
26 LORETO MATEU: TOOLS AND METHODOLOGIES FOR EDGE-AI MIXED-SIGNAL INFERENCE
neuromorphic hardware accelerators vi plained in the following section, that take
able. as input the hardware building blocks
In this paper, we present an example of from which the system is constructed,
such a tool chain, developed in the AN neural network architectures and the la
DANTE project, for a mixed-signal infer beled data to train them, as well as a
ence accelerator with analog In-Memory description of the hardware system ar
Computing (IMC) and explain the ratio chitecture and its limitations. As output,
nale behind it. these tools provide (part of) the hard
The paper is organized as follows: after ware design, the neural networks and
providing a brief overview of the entire instructions to deploy on it, as well as
tool chain in section II, we explain each estimations of its KPIs.
of its components in section III, followed The Hardware-Aware Training tool
by concluding remarks in section IV. takes the labeled data set and the NN
model as input and generates a quantized
II. T OOLCHAIN OVERVIEW FOR model. In this way, the quantized model
N EUROMORPHIC C OMPUTING takes into consideration the quantization
of the NN parameters and some errors
The design of an inference accelerator induced by the hardware components.
with high performance, e.g. in terms of Afterwards, the quantized NN model is
its energy efficiency, latency and through provided to the Mapper & Compiler tool
put, requires more than good circuit de to obtain the compiled program that will
sign. It also requires tools that minimize be used by both the Architecture Ex
the memory footprint, data movement ploration and Simulation tool and the
and access, and that can provide (opti Deployment & Runtime API. The NN
mized) sets of instructions for the specific Hardware Generator Tool automates par
hardware. To form a coherent tool chain, tially the generation of the custom mixed-
these tools need to be mutually compat signal inference accelerator.
ible, and they need to be designed with State-of-the-art design and simulation
the target hardware in mind. environments for neural networks like
Fig. 1 shows the different components N2D2 [1] allow quantization-aware train
of one of the toolchains used in the ing but do not provide the possibility
ANDANTE project for the design of op to train the network with the weight
timized edge-AI inference accelerators. variations introduced by the IMC analog
It comprises five specific tools, each ex- circuits. Already existing state-of-the-art
mapping tools for digital accelerators like
Timeloop can also be used to repre
sent mixed signal inference accelerators
containing simple analog crossbar arrays
[2]. However, complex constraints related
to dataflow, computational resources and
scheduling of standalone accelerators
cannot be integrated in Timeloop. As the
existing tools are not able to provide
enough functionality and flexibility to
incorporate the restrictions and artifacts
Fig. 1. Toolchain for the design of edge-AI mixed- produced by an analog neural network
signal inference accelerators. accelerator, a new framework of tools
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 27
dedicated to such circuits has to be de The HAT tool uses the QAT built in
veloped. Xilinx Brevitas [8], and extends it to
include FAT. A Fault-Aware Quantizer
III. D ESCRIPTIONS OF THE (FAQ) is implemented including both
INDIVIDUAL TOOLS Quantization (Q) and Fault injection be
A. Hardware –Aware Training fore (PreQ Op) and after (PostQ Op)
Hardware-Aware Training (HAT) consist quantization operation, see Fig. 2. Any
of two main components: Quantization- quantizer (FAQ or the standard Brevi
Aware Training (QAT) and Fault-Aware tas Q) or no quantizer (NoQ) can be
Training (FAT). In this section, we ex used with weights, inputs, bias and/or
plain why both are required and present outputs of any layer (convolutional, fully
our approach of constructing the NNs and connected, activation, pooling, etc.), see
training them. Fig. 3. For example, we could quantize
Low-power consumption NNs heavily some inputs with the FAQ and others
rely on reducing the processing power with the standard Brevitas Q.
of the network as they are deployed at Moreover, the faults/errors that occur
the edge with limited power budget. It in hardware can be injected in two places:
has been shown that limiting the NN before (PreQ Op) or after (PostQ Op)
parameters and arithmetic to fixed-point the quantization (Q) takes places. PreQ
has a big impact on reducing the en Op and PostQ Op are modular and can
ergy consumption, as it allows a smaller model any error associated with a hard
memory footprint of the neural network ware implementation. Currently, the HAT
with less memory access and smaller implements relative noise (1), absolute
multipliers [3] [4] [5]. This approach re noise (2), scaling (3) and bit flips.
quires quantizing the weights, activation
x = x + x · n. (1)
and bias parameters with different bit
width and fixed-point format. Generally, x = x + n. (2)
there are two main approaches for NN x = x · s. (3)
quantization: I) quantizing a pre-trained
NN (Post-Training Quantization) and II) where x is the input value, n is the noise
including the quantization in training and s is the scaling factor.
(QAT). The first works well when the NN However, any other error source could
is quantized to 16 or (in some cases) 8 be modeled and added easily. In this way,
bits while the latter is proven to achieve the HAT tool can be used for different
better performance (accuracy) when se hardware implementations that may re
vere quantization is required (e.g. 1 up quire different errors/variation implemen
to 4 bits) [5] [6] [7]. tations.
Additionally, non-ideal circuit behav Here, we list few examples where er
iors (errors) can degrade the performance rors can be injected/modeled with PreQ
of the NN. The HW errors can be miti Op or PostQ Op of the FAQ.
gated by modelling and injecting them in
SW during training. This has been shown
to produce more reliable NNs, since they
are robust against the errors, they were
trained with [9]. This is referred to as
FAT. A HAT tool was implemented in Fig. 2. Fault aware quantizer (FAQ) implementa
cluding both QAT and FAT. tion.
28 LORETO MATEU: TOOLS AND METHODOLOGIES FOR EDGE-AI MIXED-SIGNAL INFERENCE
and Normalization layers. Padding
and Pooling layers add pre and
post-processing of the data on the chip.
The main operation to be executed by
an accelerator are multiply-accumulate
(MAC) operations. With increasing size
of neural networks and the requirement
to reduce the energy consumption and
increase the throughput, the execution
of the MAC operations on the targeted
hardware accelerator has to be optimized
which can be achieved by optimizing
the data movement. The use of analog
crossbar arrays has surpassed the
Fig. 3. Block diagram of a Hardware-Aware Train energy efficiency achieved by the digital
ing layer. processing accelerators for computing
the MAC operations [10] [11]. To
make an efficient inference on the
1) In analog crossbars, where the accelerator, the compiler attempts to
synaptic weights are implemented allocate adequate on-chip resources
with memristors, the nominal value for the processing of NN layers and
of these memristors may vary in schedules the instructions accordingly.
the manufacturing process. This er In mixed-signal inference accelerators,
ror can be modeled as noise in the a mapper and compiler process quan
PostQ Op of FAQ at the weight. tized neural network models, stored in an
2) The electrical noise affects the in Intermediate Representation (IR) format
put signal (partial sum) before ap like ONNX format, and the architec
plying the ADC (the quantizer in ture specification. The Mapper explores
this analogy). This error can be the mapping design space and sends
modeled as noise in the PreQ Op the valid mappings to the Analyzer, see
of the FAQ at the layer output. Fig. 4, following a pre-defined set of
3) Stuck at 0 and stuck at 1 errors rules:
[9] result in a processing element
1) Realize the critical constraints (like
being stuck at one value. This can
computational elements, ADC bit-
be modeled by fixing the values in
width and its availability to the
the PostQ Op of the FAQ at the
processing core, stationarity of di
layer output..
mension) and exploit the hardware
4) Bit flips can be modeled in the
limitations
PostQ Op of the FAQ at the layer
output or input .
B. Mapper & Compiler
Convolutional Neural Networks
(CNNs) are widely used in computer
vision and audio applications. CNNs
consist of different layers: Convolutional,
Fully Connected, Pooling, Padding, Fig. 4. Overview of the Mapper Tool.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 29
2) For mixed-signal inference accel
erators based on a crossbar ar
chitecture, maintain stationarity of
weights. If required, include recon
figuration of weights and try to
minimize the reloading of weights
from the global buffer.
3) Explore parallel computation of
outputs on the hardware to reduce
latency.
4) Data fetch/write and processing in
CWHN (first #channels, #width,
#height, #batch of filters) format.
The mapper takes into account con
straints like neural network layer di
mensions, limited on-chip resources like
Fig. 5. Overview of the Compiler Tool.
memory sizes, computational block di
mensions, and Network-on-Chip (NoC)
Control that induce numerous ways for tations in parallel. For example: as soon
mapping the same network layer on the as layer n-1 generates sufficient outputs
same hardware. The Analyzer within the to start processing layer n, the output
Design Space Explorer discards such in buffer of layer n-1 sends the inputs to
valid mappings that cannot satisfy phys the on-chip locations where layer n will
ical limitations of the hardware while be processed.
valid mappings are translated to an in In summary, the goal of such a
termediate mapping output format. The Mapper & Compiler tool is to effi
outcome of these different mapping pos ciently explore the design space and en
sibilities will vary in terms of energy sure the best possible mapping for the
consumption and latency. accelerator.
Fig. 5 shows an overview of a compiler
tool and its sub-modules for a mixed-
signal inference accelerator with ana C. Architecture Exploration & Simula
log in-memory computing. The Evalua tion
tor evaluates the valid mappings based In order to provide scalability with
on the performance heuristic expected respect to smaller or larger NNs for var
(currently, max usage of on-chip com ious use cases, mixed-signal accelerators
putational resources). Following this, the combine analog in-memory computing
Scheduler converts the mapping into ap with a multi-core system architecture and
propriate instructions. For standalone ac distributed memory. The design complex
celerators with custom instruction sets, ity of such SoCs, e.g. in terms of the
a dedicated compiler needs to be de number of transistors and other digital or
veloped. The intermediate mapping out analog components, rules out simulations
put generated by the tool can be used at transistor level or RTL simulations of
to reduce the exploration time by any the entire system (i.e. at top-level) with
compiler that could translate the mapping realistic use-cases due to the simulation
information. Further, the compiler can times needed on state-of-the-art simu
pipeline the layers to process the compu- lation clusters. Therefore, a functional
30 LORETO MATEU: TOOLS AND METHODOLOGIES FOR EDGE-AI MIXED-SIGNAL INFERENCE
model of the SoC with decreased simula stack verification (e.g. verifying the
tion time will make automated architec compiler output) and reduces the time
tural design space exploration feasible. to market for the developed complex
Also, it will enable the optimization heterogeneous SoCs that require close
of the aforementioned hardware KPIs, HW/SW co-design.
system level simulations and the valida To ensure that the system level
tion of the Mapper & Compiler output at simulation and verification of such an
an early stage. abstract model is accurate, it needs to be
An architecture exploration tool allows complemented by and compared against
to optimize the system architecture for a more detailed models of the individual
given use case or, conversely, optimize modules at various lower levels of ab
a NN under the constraints imposed by straction. The right level of detail can
the system architecture, while estimating also vary from module to module; for
the relevant KPIs for a given configu example, while the internal timing of
ration of the hardware architecture and some processor modules may be safely
NN model. The performance of such abstracted, the timing of other compo
complex systems-on-chips (SoCs) can be nents such as network-on-chip (NoC) and
measured in terms of use-case specific bus arbitration circuits may be critical
key performance indicators (KPIs) like for the operation of the system (e.g.
energy consumption, latency per infer contention), and, therefore, have to be
ence and inference accuracy [12]. The modeled cycle-accurately. This can be
KPIs depend on many aspects of the especially challenging during early devel
architecture, e.g. memory type and sizes, opment, where hardware specifications
number of processing units [13] [14], are not yet stable, and any changes need
bit accuracy, communication bandwidths, to be continuously integrated throughout
technology node, as well as hardware these multiple levels.
induced inaccuracies like noise or quan Several tools exist to address this prob
tization errors [15]. Even though the lem of modelling complex systems across
hardware-aware training (HAT) tool al multiple levels of abstraction; similar to
ready estimates some of these KPIs, mul prior work [16], the tool chain described
tiple hardware effects – like splitting up here employs SystemC [17], [18], which
the calculation of a layer into multiple can be used with commercial simulation
processing cores – are not taken into and verification tools like Incisive (from
consideration. Cadence Design Systems) [19] and Mod
Therefore, a functional model of the elSim (from Siemens) [20].
SoC sacrifices accuracy in favor of speed In the ANDANTE project, a functional
by transitioning from transistor or gate System C model of one of the mixed-
level models to more abstract models; signal inference accelerators has been de
e.g. pin and cycle accurate at the module veloped and can efficiently simulate the
interfaces or even more abstract transac system’s behavior at a higher level of
tion level models (TLM). abstraction, while its correctness can be
Besides simulation time, having a verified by lower level hardware simu
functional model makes possible to lations of its individual parts. This ab
carry out an architecture exploration, stract model is used to test and verify
design and verification workflow already the designed hardware, to estimate the
at an early stage of development. This KPIs (like accuracy, latency, throughput
greatly simplifies hardware and software and energy consumption), and to make
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 31
informed architecture and design choices weights. Analog circuits, dedicated to be
in order to optimize these KPIs (e.g. used in neuromorphic hardware, consist
selecting the ADC bit resolution). of the assembly of thousands of the same
unit circuits, like synapses and neurons,
D. Deployment & Runtime API which have to be placed and routed.
With each change of one or multiple of
Before the hardware can be deployed
the available hardware parameters, the
in production, various tests need to be
circuits must be redone which leads to
performed to verify the correct opera
a long time-to-market, especially if an
tion of the hardware and to refine the
analog-mixed signal accelerator should
estimated KPIs with real-world mea
be implemented, since only minimal au
surements. Therefore, we designed a
tomation is available for analog circuit
broad suite of automated measurements
design. Therefore, a hardware generator
and tests that compare the ground-truth
tool for the automatic generation is re
data produced by hardware measure
quired for the circuit design of neural
ments against the expected outputs pro
network accelerators.
vided by the simulation tools. To sup
The Neural Network Hardware Gener
port such automated testing, the hardware
ator tools implement the automation of
must be designed accordingly from early
the circuit design for dedicated Neural
on (“design-for-testing”).
Network accelerators. The tool takes the
To finally deploy the developed hard
architecture specifications like the num
ware in production, simplified, user-
ber of weights and neurons as well as
friendly APIs are needed to communicate
block level implementations as input and
parameters, data and results between the
assembles them into the final accelera
chip and the host hardware. Depending
tor circuit. The block level circuits are
on the host systems for a given appli
synapse and neuron circuits together with
cation, such an API may need to sup
their required framework circuits. Ac
port microprocessors, e.g. via a C library,
cording to the architecture specifications,
and/or conventional computers, e.g. via a
the tool then instantiates and connects
python library. In the ANDANTE project,
these block level modules to finalize the
our work has so far focused on APIs for
accelerators circuits.
testing, but an API for deployment out
Within the ANDANTE project, the
side of the lab is planned for future work.
functionality of our Neural Network
Hardware Generator Tool for a multi
E. Neural Network Hardware Generator purpose accelerator architecture with
To execute neural networks efficiently analog-mixed signal data processing
in terms of energy consumption, la based on the Fraunhofer IIS internal
tency and accuracy, dedicated hardware automation framework called UnilibPlus
is needed especially in edge applications. [21] was extended. Fig. 6 shows the
Further, the configuration of the circuits hierarchy of the designed circuit, where
must be set according to the requirements the green blocks are manually designed
of the neural network to execute, which block level circuits, whilst the blue
includes for example the maximum ker blocks are automatically created by the
nel size of a convolutional layer that tool. Thus, the analog core is automat
the hardware can execute, how many ically generated, at schematic and lay
neurons can be computed in parallel or out levels, from the Crossbar Array and
the number of quantization levels of the ADC Array cells. The Neural Network
32 LORETO MATEU: TOOLS AND METHODOLOGIES FOR EDGE-AI MIXED-SIGNAL INFERENCE
circuits, which leads to a high possibility
of errors if the design is created manu
ally. By automating the design process,
examples can be created, simulated and
verified to ensure a correct implementa
tion. Any up or downscaled version of
these exemplary circuit is then considered
to be correct since the process for imple
menting them is executed by the same
algorithm. Therefore, the possibility of
errors in the design is reduced. Further
more, it enables the parallel development
of neural network algorithm and the neu
ral network circuits as many different
Fig. 6. Circuit Hierarchy created by Neural Net versions of the circuits and algorithm can
work Hardware Generator. be created and tested within a reasonable
time frame.
Hardware Generator Tool takes the user
created ADC unit cell, places them to IV. C ONCLUSION AND F UTURE W ORK
gether, routes the inputs to the correct The development and deployment of
place where they should be connected custom AI accelerators is a multi
to the Crossbar Array and executes the disciplinary effort that should be ad
power routing and finally interconnects dressed from a holistic system’s per
the ADCs generating then the ADC Array spective. In particular, careful co-design
cell. The Crossbar Array cell is gener of the neural networks, soft- and hard
ated from Crossbar Array Segment cells. ware is required to achieve a good uti
These Crossbar Array Segments contain lization of the hardware in practice. In
the Crossbar Array Element which in the ANDANTE project, we developed
turn are generated from base level unit a stack of tools to support this design-
cells, with in this case are the synapse flow, from the assessment of relevant
circuit (AWE), the Row and Column se KPIs, to hardware-aware training of neu
lection logic (Row-Column AND) and the ral networks, to simulators, compilers
generated multiplexer cells, which are and drivers for the hardware, all the
used to interconnect the Crossbar Array way down to (partially) automated gen
Elements to each other and to the ADC eration of the hardware itself. We ar
Array. Moreover, the hierarchy shown in gue that this approach, laborious as it
Fig. 6 can be changed in order to explore may be, is highly beneficial, if not nec
different architectures of the analog core, essary, for the design of complex AI
while using the same unit cells, to exam accelerators.
ine possible reductions in area and to pro To facilitate similar AI accelerator de
vide options for tradeoffs between perfor sign efforts in the future, further work
mance, leakage and energy consumption. should be invested to integrate these
This tool does not only reduce the and/or similar tools into a cohesive
design time and therefore the time-to and general framework. Moreover, such
market but also enhances the stability of framework is necessary for the bench
the design process. Neural network accel marking of AI accelerators based on use
erators are immensely complex and large case requirements.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 33
R EFERENCES [9] U. Zahid, G. Gambardella, N. J.
Fraser, M. Blott, and K. Vissers,
[1] “CEA-LIST / N2D2”. [Online]. “FAT: Training Neural Networks
Available: https://github.com/CEA for Reliable Inference Under Hard
-LIST/N2D2 ware Faults,” in 2020 IEEE In
[2] A. Parashar et. al., “Timeloop: A ternational Test Conference (ITC),
systematic approach to dnn accel Nov. 2020, pp. 1–10, doi: 10.1109/
erator evaluation,” in 2019 IEEE ITC44778.2020.9325249.
international symposium on perfor [10] T. Luo, S. Liu, L. Li, Y. Wang,
mance analysis of systems and soft S. Zhang, T. Chen, Z. Xu, O.
ware (ISPASS), IEEE, 2019, pp. Temam and Y. Chen,“DaDianNao:
304-315 A Neural Network Supercomputer,”
[3] B. Moons, K. Goetschalckx, N. IEEE Transactions on Computers,
Van Berckelaer and M. Verhelst, 66, pp.73-88, 2017.
“Minimum energy quantized neu [11] A. Shafiee, A. Nag, N. Murali
ral networks,” in Proc. 2017 51st manohar, R. Balasubramonian, J. P.
Asilomar Conference on Signals, Strachan, M. Hu, R. S. Williams,
Systems, and Computers, 2017, pp. and V. Srikumar, “ISAAC: a con
1921-1925, doi: 10.1109/ACSSC. volutional neural network acceler
2017.8335699. ator with in-situ analog arithmetic
[4] J. Johnson, “Rethinking floating in crossbars,” in Proceedings of the
point for deep learning,”. arXiv 43rd International Symposium on
preprint arXiv:1811.01721, 2018. Computer Architecture (ISCA ’16,
[5] Q. Ducasse, P. Cotret, L. Lagadec IEEE Press, pp. 14–26, 2016, https:
and R. Stewart,. “Benchmarking //doi.org/10.1109/ISCA.2016.12.
Quantized Neural Networks on FP- [12] S. Narduzzi, L. Mateu, P. Jo
GAs with FINN,” arXiv preprint kic, E. Azarkhish and A. Dun-
arXiv:2102.01341, 2021. bar, “Benchmarking Neuromorphic
[6] M. Courbariaux, Y. Bengio and Computing for Inference,” in In
J.P. David, “Binaryconnect: Train dustrial Artificial Intelligence Tech
ing deep neural networks with bi nologies and Applications, River
nary weights during propagations,” Publishers, 2022, pp. 1-16.
Advances in neural information [13] X. Peng, S. Huang, H. Jiang, A.
processing systems, 28, pp. 3123 Lu, and S. Yu, “DNN+NeuroSim
3131, Nov. 2015. V2.0: An End-to-End Benchmark
[7] M. Rastegari, V. Ordonez, J. Red ing Framework for Compute-in-
mon and A. Farhadi, “Xnor-net: Memory Accelerators for On-Chip
Imagenet classification using binary Training,“ IEEE Transactions on
convolutional neural networks,”in Computer-Aided Design of Inte
European conference on computer grated Circuits and Systems, vol.
vision, Springer, Cham., 2016, pp. 40, no. 11, pp. 2306–2319, Nov.
525-542. 2021, doi: 10.1109/TCAD.2020.
[8] A. Pappalardo, G. Franco, and nick 3043731.
fraser, “Xilinx/brevitas: Cnv test [14] Y. N. Wu, J. S. Emer, and V.
reference vectors r0,” May 2020. Sze, “Accelergy: An Architecture-
[Online]. Available: https://doi.or Level Energy Estimation Method
g/10.5281/zenodo.3824904. ology for Accelerator Designs,” in
34 LORETO MATEU: TOOLS AND METHODOLOGIES FOR EDGE-AI MIXED-SIGNAL INFERENCE
2019 IEEE/ACM International Con [21] R. Müller, et al., “Hardware/
ference on Computer-Aided De Software Co-Design of an
sign (ICCAD), Nov. 2019, pp. 1– Automatically Generated Analog
8, doi: 10.1109/ICCAD45719.2019. NN,” in International Conference
8942149. on Embedded Computer Systems,
[15] L. Mei, P. Houshmand, V. Jain, S. 2022, pp. 385-400, Springer, Cham.
Giraldo, and M. Verhelst, “ZigZag:
Enlarging Joint Architecture- Loreto Mateu obtained
her B.S. in Industrial
Mapping Design Space Exploration Engineering in 1999,
for DNN Accelerators,” IEEE her M.S. in Electronic
Transactions on Computers, vol. 70, Engineering in 2002
and her PhD degree
no. 8, pp. 1160–1174, Aug. 2021, (with highest honors) in
doi: 10.1109/TC.2021.3059962. June 2009 with a thesis
[16] D. Bortolotti, C. Pinto, A. titled Energy Harvesting
from Human Passive
Marongiu, M. Ruggiero, and Power at the Universitat
L. Benini, “VirtualSoC: A Full- Politècnica de Catalunya.
System Simulation Environment for In June 2007, she joined Fraunhofer IIS as
research engineer and became chief scientist in
Massively Parallel Heterogeneous 2012. Since 2018 she is group manager of the
System-on-Chip,” in 2013 IEEE Advanced Analog Circuits group. Her research
International Symposium on interests include ultra-low power design, AC-DC
and DC-DC converters, neuromorphic hardware
Parallel & Distributed Processing, and energy harvesting.
Workshops and Phd Forum,
Mai 2013, pp.. 2182–2187, doi: Johannes Leugering is a
10.1109/IPDPSW.2013.177. researcher in the area of
[17] P. R. Panda, “SystemC - a mod neuromorphic computing
at the Fraunhofer Institute
eling platform supporting multi for Integrated Circuits
ple design abstractions,” in Interna (IIS). He received a
tional Symposium on System Syn PhD in computational
neuroscience (Dr.rer.nat,
thesis (IEEE Cat. No.01EX526), with highest honors) from
Sep. 2001, pp. 75–80, doi: 10.1145/ Osnabrück University
500001.500018. on the topic of “Neural
mechanisms of information processing and
[18] G. Arnout, “SystemC standard,” transmission” in 2021.
in Proceedings 2000. Design Au Since 2019, he has been working at Fraunhofer
tomation Conference (IEEE Cat. IIS as an expert for neuromorphic computing
concepts and architectures, particularly in the
No.00CH37106), Jan. 2000, pp. context of spiking neural networks. Since 2022, he
573–577. doi: 10.1109/ASPDAC. is chief scientist in the Broadband and Broadcast
2000.835166. department.
[19] “Computational Software for In
Roland Müller born on
telligent System DesignTM”. Ca the 19. January 1994 ob
dence. [Online]. Available: https: tained his bis B.Eng. at
//www.cadence.com/en US/home the OTH Regensburg in
2017 and his M.Sc. in
.html 2019 at the FAU Erlan
[20] “ModelSim HDL simulator”. gen, both in electrical en
Siemens. [Online]. Available: gineering.
In May 2019, he joined
https://eda.sw.siemens.com/en-US/i the department of Integr
c/modelsim/ ated Circuits and Systems
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 35
at Fraunhofer IIS, Erlangen (Germany), where he is Marco Breiling
working in the field of analog-mixed signal design conducted studies
of neural network accelerators and design automa at the ¨
Universitat
Karlsruhe/Germany (now
tion for such circuits. Currently, he is pursuing his
Karlsruhe Institute of
PhD. Technology – KIT),
the Norges Tekniske
Høgskole (NTH)
(now Norges Teknisk-
Yogesh Patil completed Naturvitenskapelige
his M.E. in Information Universitet – NTNU), the
Technology in 2021 with Ecole Supérieure d’Ingénieurs en Electronique
thesis titled, “Design et Electrotechnique (ESIEE) and the University
Space Exploration of Southampton, and graduated with a Dipl.-Ing.
of Neural Network (equivalent to master’s) degree from KIT in
Hardware Architectures”, 1997. He earned his PhD degree (with highest
from SRH University ¨
honors) in digital communications from Universitat
Heidelberg/Germany Erlangen/Germany in 2002.
and B.E. in Electronics Since 2001, he has been working at Fraunhofer
and Telecommunications IIS in the field of signal processing, digital
from Savitribai Phule Pune University (Formerly communications and digital design. He held
University of Pune), India in 2018. the chief scientist position of the Broadband
Since January 2021, he has been contributing to & Broadcast department from 2013 until 2021.
the research in neural network accelerators with Moreover, he is a Distinguished Lecturer of the
In-Memory Computing architectures at Fraunhofer IEEE Broadcast Technology Society.
Institute for Integrated Circuits (Fraunhofer IIS,
Germany).
Ferdinand Pscheidl ob
Maen Mallah is a tained his B.S. in Bache
researcher in the area lor of Science in electrical
of embedded AI at the engineering at Technische
Fraunhofer Institute for Universität München in
Integrated Circuits (IIS). 2018. In 2021 he received
He obtained his B.Sc his Master of Science
in Telecommunication in electrical engineering
Engineering in 2014 at Technische Universität
from An-Najah National M¨unchen with focus on
University, Palestine and circuit design and ma
his M.Sc in Electrical chine learning.
Engineering in 2018 from Bilkent University, In 2021 he joined Fraunhofer EMFT as research
Turkey with a thesis titled “Multiplication Free engineer. His research interests include ultra-low
Neural Networks”. power design, neuromorphic hardware, software
In March 2018, he joined Fraunhofer IIS. His and development tools.
main work and interest focuses on implementing
and optimizing NNs for Edge applications and
designing the special SW tools required for such
a task with a special focus on Quantization- and
Fault-aware training.
Low-Power Vertically Stacked One
Time Programmable Multi-bit
IGZO-Based BEOL Compatible
Ferroelectric TFT Memory
Devices with Lifelong Retention
for Monolithic 3D-Inference
Engine Applications
Sourav De, Sunanda Thunder, David Lehninger, Michael P.M. Jank,
Maximilian Lederer, Yannick Raffel, Konrad Seidel, and Thomas Kämpfe
Abstract—This article demonstrates in suitable for inference engine application.
dium gallium zinc oxide-based onetime The compatibility with the back-end-of
programmable ferroelectric memory de line process enables monolithic 3D integra
vices with multilevel coding and lifelong tion of the devices with standard technol
retention capability. The entire integration ogy. We have evaluated the performance
process was conducted in the back-end of this indium gallium zinc oxide-based
of-line with a maximum process temper onetime programmable ferroelectric thin
ature of 350◦ C. The fabricated devices film transistor for inference engine applica
demonstrate data retention up to 104 sec tions. The system-level simulation was per
onds, which was used to estimate the re formed to gauge the performance of the de
tention property up to 108 seconds. We vices as synapses in multilevel perceptron
observed a marginal drop in channel cur based neural networks. The synaptic de
rent after 108 seconds, which makes them vices could achieve 97% for inference-
only applications with MNIST data. The
accuracy degradation was also limited to
Sourav De, Sunanda Thunder, David Lehninger, 1.5% over 10 years without retraining.
Maximilian Lederer, Yannick Raffel, Konrad Sei
The proposed inference engine also showed
del, and Thomas Kämpfe are associated with
Fraunhofer-Institut für Photonische Mikrosysteme superior energy efficiency and cell area of
IPMS, Center Nanoelectronic Technologies, Dres 95.33 TOPS/W (binary) and 16F2 , respec
den, Germany. Michael P.M. Jank is associated tively.
with Fraunhofer-Institut für Integrierte Systeme und
Bauelementetechnologie. Erlangen, Germany.
Index Terms—Hafnium Oxide, IGZO,
This research was partly funded by the EC FeTFT, non-volatile memory, variation,
SEL Joint Undertaking project ANDANTE in neural networks.
collaboration with the European Union’s Hori
zon 2020 Framework Program for Research I. INTRODUCTION
and Innovation (H2020/2014-2020) and partly
by the European Union’s ECSEL Joint Un
R
ECENT developments in the re
dertaking under grant agreement n◦ 826655
project TEMPO (Corresponding author:Sourav De). search of hafnium oxide(HfO2 )
Email:[email protected] based ferroelectric memories in
38 SOURAV DE: LOW-POWER VERTICALLY STACKED ONE TIME PROGRAMMABLE MULTI-BIT
the form of ferroelectric (Fe) field ef and channel. It is worth noting that the
fect transistors (Fe-finFETs) [1–7], Fe mobility of electrons in IGZO is much
finFETs [8–12] and ferroelectric thin higher than the mobility of the holes. ?We
film transistors (FeTFT) [13–17] have observed that while tIGZO > 12 tHZO , most
paved the way for further scaling, 3D of the electric field (ξ ) is dropped across
integration, and system level application the IGZO film during erase operation.
of Fe memory devices. So far, most of Therefore, the impact of the depolar
the applications of FeTFT have focused ization field is also lowered across the
on improving endurance characteristics, HZO film, simultaneously facilitating the
which is crucial for online training. How fabricated devices’ onetime programming
ever, the online training of the neural (OTP) and lifelong retention capability.
network requires endurance above 108 The IGZO-based OTP synaptic devices
cycles [8, 9], which limits the retrain occupy a cell area of 16F2 , where F
ing of the neural network if necessary. is the lithography feature size. The de
Therefore, in this work, we primarily vices also demonstrate lifelong retention,
focus on the offline training of the neural which is the most critical characteris
network with the inference-only opera tic for inference application. IGZO-based
tion for hardware. While the endurance TFTs demonstrate 2bits/cell operation,
characteristics are essential for online with extrapolated retention for each state
training of the neural network, retention above ten years. We have further eval
is vital for conducting the inference-only uated the IGZO-TFT’s performance as
operation. The depolarization field across a synaptic inference engine device. The
the ferroelectric layer plays a vital role in performance of these synaptic devices
retention characteristics. in terms of area and energy efficiency
The degradation in data retention over demonstrates their suitability of this de
time necessitates retraining the neural vice for in-memory-computing (IMC) ap
network after a certain amount of time plications. The synaptic devices main
and renders multi-bit per cell operation tained an inference accuracy above 95%
futile for long-term applications. for ten years with a multi-layer percep
In this work, we demonstrate indium tron (MLP) based neural network (NN)
gallium zinc oxide-based (IGZO) one for MNIST dataset recognition without
time programmable (OTP) memory de retraining.
vices with lifelong retention for infer
ence engine applications. The IGZO-
II. E XPERIMENTS
based FeTFT device was fabricated using
a gate-first process with different ratios We began our experiment by fabricating
of the ferroelectric layer and channel metal-ferroelectric metal (MFM) and
layer thickness. The data retention capa metal-semiconductor-ferroelectric-metal
bility is related to the depolarization field (MSFM) capacitors on highly doped
across the ferroelectric layer. The pri (boron) 300mm silicon wafers using
mary motive of this work was to reduce industry-standard production tools. The
the depolarization field without affecting titanium nitride (TiN) bottom electrode
the memory window. It is pretty well was deposited via atomic layer deposition
known that the carrier’s mobility and the (ALD). Titanium tetrachloride (TiCl4 )
relative thickness of the dielectric and and ammonia (NH3 ) were precursors
semiconductor layer regulates the volt for the ALD process. The ferroelectric
age drop across the ferroelectric layer layer (Hfx Zr1−x O2 ) was deposited at
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 39
Fig. 1. The P-V response of HZO-based ferro
electric capacitors as IGZO as the bottom electrode
shows asymmetric swing with negligible negative
switching. The absence of negative polarizations
plays an essential role in facilitating OTP features
in Fe-TFTs.
300◦ C by ALD. With HfCl4 and ZrCl4
precursors, H2 O as an oxidizing agent,
and Ar as a purging gas. For MSFM
capacitors, the IGZO was deposited by RF
magnetron sputtering. The thickness of
the IGZO varied between 5nm and 30nm.
Fig. 2. (a). WRITE operation in IGZO based Fe-
A 2nm thick layer of Al2 O3 , deposited TFTs with 500ns wide pulses of amplitude 3V and
by ALD, was used as an interfacial 7V. (b). 2bits/cell operation for IGZO-based Fe-
layer between Hfx Zr1−x O2 and IGZO. TFT OTP devices.
The TiN top electrode was deposited
by magnetron sputtering, where the
deposition temperature is below 100◦ C. the basic requirements for conducting the
The annealing for crystallization was program and erasing the FE-TFTs. The
carried out at 350◦ C. The FE-TFTs were mobility of the electrons (μn ≈ 10 cm2
fabricated on standard silicon wafers. V−1 s−1 ) in IGZO-based semiconductors
100nm SiO2 was used to insulate the is high. Therefore, the accumulation layer
devices from the substrate. 50nm of TiN in n-type IGZO devices is formed within
was deposited by ALD and patterned a short time, 100ns. Contradictorily, the
via e-beam lithography and reactive ion hole mobility (μ p ) is deficient, and the in
etching to form bottom gate electrodes. version layer formation is complex when
ALD deposited HZO of 10nm thickness, most of the electric field is dropped
followed by 2nm ALD Al2 O3 . Finally, across IGZO. Proper tuning of the thick
the devices were annealed in air at a ness of IGZO and HZO resulted in the
temperature of 350◦ C for 1h. omission of erasing capability in the fab
The polarization versus field (P–E) ricated TFT devices. We have observed
measurements were performed with a that an inversion channel is not formed
triangular waveform at a frequency of even after applying long pulses (2s) of
1kHz (Fig. 1). The formation of both amplitude up to -6V. The dielectric break
accumulation and inversion layers are down happens before the formation of
40 SOURAV DE: LOW-POWER VERTICALLY STACKED ONE TIME PROGRAMMABLE MULTI-BIT
!!
"!
!!
!!
$
!!
!#
!
!
" !#
Fig. 3. (a). The measured retention characteristics
show stable retention of 4-states for ten years
without any loss. (b). Benchmarking the retention Fig. 4. (a). The modus operandi of MLP NN.
performance relative Vth shift w.r.t MW proves that (b). The reported inference engine shows life-long
IGZO-based OTP devices have maximum long-term lossless inference operation.
data retention capability.
the inversion layer. This invokes the OTP three layers, including 400 input-layer
scheme in IGZO-TFTs, which is also nodes, 100 hidden-layer nodes, and 10
responsible for lifelong data retention ca nodes in the output layer. After com
pability. The binary and 2bits/cell READ pleting the training, the synaptic weights
WRITE operations are demonstrated in were written to the FeTFT-based synaptic
Fig. 2 (a,b). core. The inference task is performed
The retention characteristic (Fig.3 (a)) once the synaptic weights are updated
shows only slightly conductance degra in the FeFET devices. The measured
dation after 10 years, which is supe 2bits/cell operation with experimentally
rior to other state-of-the-art Fe-FETs. In calibrated variations was used while con
Fig. 3(b), we showed the lowest Vth shift ducting the neuromorphic simulations.
w.r.t memory window (MW) compared to Due to retention degradation, the ac
other state-of-art Fe-FET devices. curacy drop was limited to only 1.5%
Finally, the system-level validation for over 10 years, while other state-of-the
the inference-only application was per art ferroelectric memory devices could
formed using the CIMulator platform only retain 11% inference accuracy after
[7]. The modus-operandi for the IGZO- 10 years. We have further simulated the
FeTFT inference engine is described in impact of ADC precision in the inference
Fig. 4(a). The multi-layer perceptron engine, which shows that using area ef
based neural network (MLP-NN) has ficient 1bit ADCs can boost the system
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 41
performance in terms of area and energy 28.7.4, doi: https://doi.org/10.1109/
efficiency with ?only accuracy drops of IEDM19573.2019.8993642
1.87% and maintains the inference accu [2] Kim, M.-K., & Lee, J.-S. Ferro
racy above 94% for 10 years. Table I. electric Analog Synaptic Transis
summarizes the performance of this de tors. Nano Letters, 19(3), 2044–
vice w.r.t other state-of-the-art devices. 2050.2019 https://doi.org/10.1021/
acs.nanolett.9b00180.
TABLE I [3] Kim, D., Jeon, Y.-R., Ku, B.,
B ENCHMARKING
Chung, C., Kim, T. H., Yang,
Device Type Fe-FinFET IWO-FeFET This Work S., Won, U., Jeong, T., & Choi,
[9] [14]
M3D Integrator No Yes Yes C. (2021). Analog Synaptic
2 15F2 15F2 16F2
Cell Area (F ) Transistor with Al-Doped HfO
Ron (Ω) 100K 4M 100M
MW @10 years 1 0.2 1 2 Ferroelectric Thin Film. ACS
Inference Accuracy ∼85% 85% 1.5%
Drop @10 years
Applied Materials & Interfaces,
Energy Efficiency N/A 71.04 95.33 13(44), 52743–52753. https:
(TOPS/W) (Binary)
//doi.org/10.1021/acsami.1c12735
[4] Soliman, T., Gmbh, R. B., Laleni,
N., & Ipms, F. (2022). FELIX:
III. IV. CONCLUSIONS A Ferroelectric FET Based Low
Ultra-low power multi-bit IGZO-based Power Mixed-Signal In-Memory
OTP Fe-TFTs with lifelong retention Architecture for DNN Acceleration.
have been fabricated with a maximum ACM Transactions on Embedded
process temperature of 350◦ C. The de Computing Systems. https://doi.org/
vices demonstrate 2bits/cell operation. https://doi.org/10.1145/3529760
Long-term non-disturbing retention prop [5] Trentzsch, M., Flachowsky, S.,
erty makes these devices. Richter, R., Paul, J., Reimer, B.,
Utess, D., Jansen, S., Mulaosman
ACKNOWLEDGEMENT ovic, H., Muller, S., Slesazeck, S.,
This research was funded by the ECSEL Ocker, J., Noack, M., Muller, J., Po
Joint Undertaking project ANDANTE in lakowski, P., Schreiter, J., Beyer, S.,
collaboration with the European Union’s Mikolajick, T., & Rice, B. (2017).
Horizon 2020 Framework Program for A 28nm HKMG super low power
Research and Innovation (H2020/2014 embedded NVM technology based
2020) and National Authorities under on ferroelectric FETs. Technical Di
Grant No. 876925. We thank Dr. Hoang- gest - International Electron De
Hiep Le and Prof. Darsen Lu from Na vices Meeting, IEDM. https://doi.
tional Cheng Kung University for helping org/10.1109/IEDM.2016.7838397
us with the neuromorphic simulations. [6] Wang, P., & Yu, S. (2020). Fer
roelectric devices and circuits for
R EFERENCES neuro- inspired computing. MRS
[1] T. Ali et al.,“A Multilevel Communications, 10(4). https://doi.
FeFET Memory Device based org/10.1557/mrc.2020.71
on Laminated HSO and HZO [7] S. De et al., “READ-Optimized
Ferroelectric Layers for High- 28nm HKMG Multibit FeFET
Density Storage,” 2019 IEEE Synapses for Inference-Engine
International Electron Devices Applications,” in IEEE Journal
Meeting (IEDM), 2019, pp. 28.7.1 of the Electron Devices Society,
42 SOURAV DE: LOW-POWER VERTICALLY STACKED ONE TIME PROGRAMMABLE MULTI-BIT
vol. 10, pp. 637-641, 2022, doi: FinFET in Presence of Process
10.1109/JEDS.2022.3195119 Variation, Device Aging and Flicker
[8] De S, Baig MA, Qiu B-H, Muller ¨ Noise. CoRR, abs/2103.13302.
F, Le H-H, Lederer M, Kampfe ¨ https://arxiv.org/abs/2103.13302
T, Ali T, Sung P-J, Su C-J, [13] C. Matsui, K. Toprasertpong, S.
Lee Y-J and Lu DD (2022) Ran Takagi and K. Takeuchi, “Energy-
dom and Systematic Variation in Efficient Reliable HZO FeFET
Nanoscale Hf0.5 Zr0.5 O2 Ferroelec Computation-in-Memory with Lo
tric FinFETs: Physical Origin and cal Multiply & Global Accumu
Neuromorphic Circuit Implications. late Array for Source-Follower &
Front. Nanotechnol. 3:826232. doi: Charge- Sharing Voltage Sensing,”
10.3389/fnano.2021.826232. 2021 Symposium on VLSI Circuits,
[9] S. De, D. D. Lu, H.-H. Le, S. 2021, pp. 1-2, doi: 10.23919/VLSI
Mazumder, Y.-J. Lee, W.-C. Tseng, Circuits52068.2021.9492448.
B.-H. Qiu, Md. A. Baig, P.-J. Sung, [14] S. Dutta et al., “Monolithic 3D
C.-J. Su, C.-T. Wu, W.-F. Wu, W.-K Integration of High Endurance
Yeh, Y.-H. Wang, “Ultra-low power Multi-Bit Ferroelectric FET
robust 3bit/cell Hf0.5 Zr0.5 O2 ferro for Accelerating Compute-In-
electric finFET with high endurance Memory,” 2020 IEEE International
for advanced computing-in-memory Electron Devices Meeting (IEDM),
technology,” in Proc. Symp. VLSI 2020, pp. 36.4.1-36.4.4, doi:
Technology, 2021. 10.1109/IEDM13553.2020.9371974.
[10] S. De, H.H.Le, B.H.Qiu, M.A.Baig, [15] Lehninger, D., Ellinger, M., Ali,
P.J.Sung, C.J.Su, Y.J.Lee and T., Li, S., Mertens, K., Led
D.D.Lu, “Robust Binary Neural erer, M., Olivio, R., Kampfe, ¨
Network Operation From 233 T., Hanisch, N., Biedermann, K.,
K to 398 K via Gate Stack and Rudolph, M., Brackmann, V., Sanc
Bias Optimization of Ferroelectric tis, S., Jank, M. P. M., Seidel,
FinFET Synapses,” in IEEE K., A Fully Integrated Ferroelec
Electron Device Letters, vol. 42, tric Thin-Film-Transistor – Influ
no. 8, pp. 1144- 1147, Aug. 2021, ence of Device Scaling on Thresh
doi: https://doi.org/10.1109/LED. old Voltage Compensation in Dis
2021.3089621. plays. Adv. Electron. Mater. 2021,
[11] S. De, M. A. Baig, B. -H. Qiu, H. 7, 2100082. https://doi.org/10.1002/
-H. Le, Y. -J. Lee and D. Lu, “Neu aelm.202100082
romorphic Computing with Fe- [16] F. Mo et al., “Low-Voltage Operat
FinFETs in the Presence of Vari ing Ferroelectric FET with Ultrathin
ation,” 2022 International Sympo IGZO Channel for High-Density
sium on VLSI Technology, Sys Memory Application,” in IEEE
tems and Applications (VLSI-TSA), Journal of the Electron Devices So
2022, pp. 1-2, doi: 10.1109/VLSI ciety, vol. 8, pp. 717-723, 2020, doi:
TSA54299.2022.9771015. 10.1109/JEDS.2020.3008789.
[12] De, S., Qiu, B.-H., Bu, W. [17] C. Sun et al., “First Demonstration
X., Baig, Md. A., Su, C.-J., of BEOL-Compatible Ferroelectric
Lee, Y.-J., & Lu, D. D. (2021). TCAM Featuring a-IGZO Fe-TFTs
Neuromorphic Computing with with Large Memory Window of 2.9
Deeply Scaled Ferroelectric V, Scaled Channel Length of 40
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 43
nm, and High Endurance of 108 David Lehninger works
as a project manager at
Cycles,” 2021 Symposium on VLSI Fraunhofer IPMS - Center
Technology, 2021, pp. 1-2. Nanoelectronic Technolo
[18] Lu, D. D., De, S., Baig, M. A., gies. His main scientific
interest is the optimiza
Qiu, B.-H., and Lee, Y.-J. (2020). tion and integration of fer
Computationally Efficient Compact roelectric HfO2 films into
Model for Ferroelectric Field-Effect the back end of the line of
established CMOS tech
Transistors to Simulate the On nologies as well as the
line Training of Neural Networks. structural and electrical characterization of test
Semicond. Sci. Technol. 35 (9), structures and device concepts in the field of emerg
ing non-volatile memories. Before joining Fraun
95007. doi: https://doi.org/10.1088/ hofer in 2018, he did a Ph.D. in Nanoscience at the
1361-6641/ab9bed. TU Bergakademie Freiberg and a Master of Science
in electrical engineering at the Dresden University
of Technology.
Sourav De (Member,
IEEE) works as a scientist
at Fraunhofer IPMS, Michael P.M. Jank
Center Nanoelectronic received the diploma
Technologies. He degree in electrical
received his Ph.D. degree engineering from the
in Electrical Engineering Friedrich-Alexander-
in 2021 from National University of Erlangen-
Cheng Kung University, Nuremberg (FAU) in
and his B.Tech degree 1996, and the Dr.-Ing.
in Electronics and degree with a thesis
Communication Engineering in 2013, both from on extremely simplified
STCET, Kolkata. During his doctoral studies, routes to silicon CMOS
Sourav De used to work in Taiwan Semiconductor devices in 2006. He started his career as a teaching
Research Institute as graduate research student. assistant at the Chair of Electron Devices at FAU.
Sourav worked towards integration of ferroelectric Following the Dr.-Ing. degree in 2006, he joined
memory with advanced technology nodes. the Fraunhofer Institute for Integrated Systems
Sourav De joined the Fraunhofer Society in and Device Technology IISB, Erlangen, where he
2021. His main research interests are CMOS is currently heading the thin-film systems group,
compatible emerging non-volatile memories for a joint undertaking with FAU. The group focuses
neuromorphic computing, advanced transistor and on large area and printable thin-film electronics
thin film transistor design for logic and memory and develops materials, processing techniques,
applications, analog in-memory computing devices and thin-film devices based on of conventional
& circuits in CMOS and SOI technologies. PVD/CVD techniques as well as novel solution
based approaches. He holds lectureships for
Nanoelectronics and Printed Electronics from the
Sunanda Thunder FAU. He is reviewer for reknown international
works as a digital design journals and contributes to scientific and industrial
engineering at TSMC, working groups on semiconductor memory devices
Taiwan. She received and nanomaterials.
her Masters’ degree
from the department of
International College
of Semiconductor
Technology from
National Yang-Ming
Chiao Tung University
in 2021.She worked in Fraunhofer IPMS as a
research intern before joining TSMC. Her primary
research interests are neuromorphic circuit design
with emerging non-volatile memories.
44 SOURAV DE: LOW-POWER VERTICALLY STACKED ONE TIME PROGRAMMABLE MULTI-BIT
Maximilian Lederer is Engineering, Infineon Technologies AG, Dresden,
currently working as a and Qimonda, Munich, Germany. Since 2008,
project manager at the he has been a Research Associate with Fraun
Fraunhofer IPMS Cen hofer Center Nanoelectronic Technologies, Dres
ter Nanoelectronic Tech
den, which is currently a Business Unit of the
nologies. Prior, he fin
ished his Ph.D. degree Fraunhofer Institute for Photonic Microsystems,
in Physics performing re Fraunhofer IPMS, Dresden. His current research in
search in the field of fer terests include electrical characterization and relia
roelectric hafnium oxide bility of integrated circuits as well as the integration
together with TU Dres and design of integrated high-density capacitors.
den and Fraunhofer IPMS. He received his mas
ter degree in material science and engineering
in 2018 at the Friedrich-Alexander Universitat ¨
Erlangen-Nürnberg, Germany, and conducted a re
search semester at the Nagoya Institute of Tech Thomas Kämpfe works
nology, Japan, in 2017. His current research top as a senior scientist
ics include non-volatile memories, neuromorphic at Fraunhofer IPMS,
devices, materials for quantum computing, struc Center Nanoelectronic
tural and electrical analysis techniques as well as Technologies. He
ferroelectrics. received his Ph.D. degree
in Physics in 2016 and his
Diplom degree in Physics
in 2011, both from TU
Yannick Raffel was Dresden, respectively.
born in Dormagen, After research visiting
Germany, in 1993. He scholar positions with the University of Colorado
received the B.S. degree at Boulder and Stanford University, he joined
from the Department of the Fraunhofer Society in 2017. To date,
physiks, Ruhr-Universität Dr. Kämpfe authored and co-authored more
Bochum, Germany, in than 150 peer-reviewed journal papers and
2016 and the M.S. degree conference proceedings. His main research
from the Department of interests are CMOS compatible ferroelectrics for
physics (AFP), Ruhr advanced emerging memories, analog in-memory
¨
Universitat Bochum, computing paradigms/architectures, high-frequency
Germany, in 2018. In 2020 he is working toward electronics, pyro- \& piezo-electronics as well as
the Ph.D. degree at the Fraunhofer-Institut CNT in RF/mmWave devices & circuits in CMOS and SOI
Dresden, Germany. His current research interest is technologies.
the investigation and description of low frequency
noise and defect influences in nanotechnology.
Konrad Seidel received
the Diploma degree in
electrical engineering
from the Dresden
University of Technology,
Dresden, Germany, in
2003. From 2004 to 2008,
he was with the Reliability
and Qualification
Team of Flash Product
Generating Trust in Hardware
through Physical Inspection
Bernhard Lippmann, Matthias Ludwig, and Horst Gieser
Abstract—A globally distributed semi able to identify the sources of anomalies
conductor supply chain, if not properly and clearly distinguish variations of the
secured, may lead to counterfeiting, mali manufacturing process and artefacts of the
cious modification, and IP piracy. Coun analysis process from the signatures of
termeasures like a secured design flow, potentially malicious activities. Finally, we
including development tools, and physical propose a novel quantitative trust evalua
and functional verification methods are tion scheme which is partially based on the
paramount for building a trusted supply physical inspection methods outlined in this
chain. In this context, we present selected work.
image processing methods for physical
inspection that aim to provide trust in the Index Terms—hardware trust, reverse
hardware produced. We consider aspects engineering, physical layout, manufactur
of the manufacturing process as well as ing technology, physical verification, artifi
the level of the physical layout. In addition, cial intelligence, image processing
from the perspective of trust, we discuss the
potential and risks of artificial intelligence
I. I NTRODUCTION
compared to traditional image processing
methods. We showcase the presented meth OCIETY depends increasingly on
ods for a 28 nm process, and propose
a quantitative trust evaluation scheme on
the basis of feature similarities. A pro
cess for physical analysis and inspection
S the availability, reliability and
trustability of modern integrated
circuits manufactured in nanoscale tech
nologies.
that is free of anomalies or failures, of
course, cannot be reached. However, from These cover applications ranging from
the trust point of view, it is crucial to be consumer, smart home, and internet of
things (IoT) applications to autonomous
This work was partly funded by the projects
AI4DI, VE-FIDES, and platform project Velek driving and critical infrastructure. While
tronik. AI4DI receives funding within the Elec being a key enabler for global mega-
tronic Components and Systems for European trends like digitalization and decar
Leadership Joint Undertaking (ECSEL JU) in col
laboration with the European Union’s Horizon2020 bonization, trust in microelectronics is
Framework Programme and National Authorities, no longer granted by default. Glob
under grant agreement no. 826060. VE-FIDES ally distributed supply chains, outsourced
and Velektronik receive funding by the German
Federal Ministry of Education and Research (grant semiconductor manufacturing, and the
no. 16ME0257 and grant no. 16ME0217). increased design and technology com
(Corresponding author: B. Lippmann). B. plexity extend the threat surface. These
Lippmann is with Infineon Technologies AG,
Munich, Germany (e-mail: Bernhard.Lippmann@ threats include the design of hardware
infineon.com). M. Ludwig is with Infineon Trojans, undocumented or optional func
Technologies AG, Munich, Germany (e-mail: tionality, access paths for programming
Matthias.Ludwig@infineon.com). H. Gieser is with
Fraunhofer EMFT, Munich, Germany (e-mail: and testing, counterfeiting, but also bugs
[email protected]). or weak design solutions. Consequently,
46 BERNHARD LIPPMANN: GENERATING TRUST IN HARDWARE THROUGH PHYSICAL INSPECTION
Trusted Microelectronics
these unwanted properties limit the trust
Design Fabrication Analysis
in hardware solutions, particularly for
Foundry Test &
critical applications. In addition, as most Requirements IC Design Mask Shop
Production Packaging
Product
of the products are a combination of Netlist
RTL / behavioral
Netlist
Gate level
GDSII
Physical design
Mask ICs ICs
design design
hard- and software, any hardware vul
nerability must be treated as a com Fig. 1. Different stages of the IC development flow
plete system vulnerability independent of from the initial design phase until the final product.
using a secured software solution and
post-production patches are generally not untrusted manufacturing sites, layout
available. integrity checking [10] via hardware
As hardware trust is no precise tech reverse engineering to detect layout-
nical term with established measurement bound hardware Trojans, cell camouflag
metrics, in a first approach, hardware ing [11] to hamper reverse engineering
trust is generated through security and for counterfeiting or the planning of sub
functional testing as specified in various sequent hardware attacks, or logic lock
schemes like Common Criteria (CC) [1], ing [12] or finite state machine obfusca
FIPS 140 including the Cryptographic tion [13] to protect against reverse engi
Module Validation Program (CMVP) [2], neering. Each of these methods can be
Trusted Computing Group Certification assigned to one of the steps in the semi
(TCG Certification) [3], the Security conductor supply chain shown in Fig. 1.
Evaluation Standard for IoT Platforms While pre-production methods are mostly
(SESIP) [4] and the Platform Security the preferred way, post-production, with
Architecture (PSA) [5]. They primarily out further measures, analysis techniques
cover on-device threats, including manip are the only viable option. The second
ulative, side-channel, fault, or logical aspect to be motivated origins from the
attacks. Both, the SAE Aerospace Stan aforementioned system trust which is
dard (AS 6171B) and the IDEA 1010B illustrated in Fig. 2.
Standard define inspection and test proce In Fig.2, the different abstraction levels
dures for the identification of the suspect for computing or microelectronic systems
and counterfeit devices and target to ver are shown. Weaknesses or vulnerabil
ify a specific lot of devices from a non- ities in lower abstraction layers often
trusted source [6], [7].
As previously elaborated, the complete
Application
microelectronic supply chain requires
Software Operating system
trust schemes. While the above- Hardware abstraction layer
mentioned methods provide generic, Instruction set architecture (ISA)
theoretical measures, these might not be Micro-architecture
Abstraction level
comprehensive. The assessment of pre- Hardware RTL
silicon threats remains non-granularly Transistor / gate
resolved with a first approach shown Physical layout 2
Phyiscal layers
in [8], where a metric considering the Manufacturing Process Design Kit (PDK)
functional and structural test coverage Technology Manufacturing process 1
has been introduced. Package
Integration
PCB design and manufacturing
Various attack models and poten
tial countermeasures have been dis-
Fig. 2. Abstraction levels of computing systems.
cussed in scientific publications. These This work focuses on the physical layers of the
include split manufacturing [9] to tackle abstraction stack with an emphasis on the physical
IP infringement or overproduction at layout and manufacturing technology.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 47
lead to issues at higher abstraction lev
els, albeit deemed trustworthy or suf
ficiently secure. In these systems, the
hardware acts as the root of trust. Bot Delayering &
Scanning
Image Processing Functional Recovery
tom line, securing the lower system lev P Physical Layout Recovery E Electrical Function Recovery
els is of utmost importance. This work
discusses two concrete ways of validat Fig. 3. Reverse engineering process overview.
ing two abstraction levels. First 1 , a
method for validation of the manufac The initial stage in the physical anal
turing technology is elaborated. Second ysis lab is the preparation of samples.
2 , methods for physical layout verifica Each deposited physical layer of an inte
tion through hardware reverse engineer grated circuit is deprocessed and subse
ing are discussed. These methods are dis quently high resolution images of these
tinguished into metal track segmentation, layers are acquired using a scanning elec
VIA detection, and standard cell analysis, tron microscope (SEM). The complete
which are further discussed from an algo scanned image mosaic is built upon sev
rithmic point of view. Finally, example eral thousand individual tiles depending
analyses will be shown, and the viability on the chip size. In the following step, we
of artificial intelligence (AI)-based meth need to seamlessly stitch the individual
ods will be discussed. tiles while analysing the shared overlap
The contribution is organised as fol ping area between these tiles. Geomet
lows: Related work and the background rically undistorted mosaics of each layer
for the verification processes of the phys are used for a correct 3D alignment of the
ical layers and the major technical chal complete scanned layer set [20]. During
lenges are presented in Section II. Sec the layout recovery process, the images
tion III contains a selection of inno with wires and interconnecting structures
vative methods for physical inspection. are converted into a vector format. For
This includes a rule and AI-based image identification and read back of digital
processing for layout and manufacturing or analogue devices from the raw lay
technology assurance. The evaluation of out images custom methods with domain
these methods on a 28 nm test chip sam expert’s knowledge are required to solve
ple is demonstrated in Section IV. In these. As a result, we obtain a recon
section V, the trust evaluation scheme is structed device library (std. cell library)
elaborated. and the extracted connectivity between
the devices. Via a back-annotation of
individual devices, we generate a flat
II. BACKGROUND
netlist of the analysed devices. Netlist
Fig. 3 shows the major stages of a interpretation algorithms are used to cre
reverse engineering (RE) process used in ate an understanding of the extracted
academia, research institutes, commercial design, which can finally be verified
service providers, and integrated device through simulations and further electrical
manufacturers (e.g. [14], [15], [16], [17], analysis.
[18], or [19]). The reverse engineering Increased design complexity and
flow is constituted of a physical phase shrinking technology nodes require
and a functional or electrical recovery reliable and innovative methods for the
phase. The flow is illustrated in Fig. 3 physical verification process. A compi
and explained in the following. lation of these challenges is summarised
48 BERNHARD LIPPMANN: GENERATING TRUST IN HARDWARE THROUGH PHYSICAL INSPECTION
TABLE I
C HALLENGES FOR IC R EVERSE E NGINEERING [21].
Task Challenges
1. Physical layout recovery P
• Delayering (P1) maximal uniform layer
removal over complete chip size
• Technology (P2) enable delayering of
advanced nodes with ultra-thin and fragile
inter-oxide layers, support Al & Cu tech
nology, FinFET
• Chip Scanning (P3) homogeneous, fast
and accurate high resolution imaging over
the complete chip area for all layers with
minimal placement error
• Image Processing (P4) precise layout
recovery including preparation errors, indi
cation of the error rate
2. Electrical function recovery E
• Digital Circuits (E1) recovery and sense
making of large digital circuits based on
std. cell designs
• Analogue Circuits (E2) recovery of circuit
functions based on analogue devices with
unclear electrical behaviour
• Robustness (E3) robust to remaining errors
coming from the physical layout extraction
process
3. Analysis and scoring of Security protection
mechanisms S • Chip Individual Features (S1) hardware
security may include chip individual fea
tures like physical unclonable functions
(PUFs), dedicated protection layers and
protection circuits configured with run time
keys, logic locking, etc.
• Design for Physical Analysis Protection
(S2) camouflaged cell libraries, timing cam
ouflage
• Error and Effort Estimation (S3) reliable
indications must be shown how strong these
measures are under the current analysis
options
in Tab. I. A trustworthy verification these results. The first aspect of physical
process needs to address and solve these inspection described is the assurance pro
challenges with methods which are on vided by manufacturing technology, and
itself trustworthy and reliable, while the second aspect involves the inspection
their limitations and potential failure of the physical layout.
modes must be explainable.
A. Manufacturing technology assurance
III. M ETHODOLOGY
The technology can be subdivided into
Physical verification is defined by an assessment of the process design kit
comparing physically measured data (PDK) on the one hand and the manu
against a reference. The decision of facturing process on the other hand. PDK
authenticity (see section V) is based on aspects include physical aspects like
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 49
Fig. 4. Examples of labelled data showcasing the different ROIs: green – VIA; yellow – metal; teal –
Local silicon Oxidation; red – poly; blue – deep trench isolation [22].
standard cell dimensions, static access • Lateral, shallow isolation (e.g.
memory cell dimensions, digital, ana shallow trench isolation (STI)
logue, and passive primitives, and design or local oxidation of silicon
rules. The underlying manufacturing pro (LOCOS)): Electrical lateral isola
cess covers e.g. physical aspects of differ tion between devices with a dioxide
ent manufactured layers, critical dimen trough a shallow deposition.
sions, and utilised process materials. In • Deep trench isolation: Trenches for
this paper, we focus on the manufacturing lateral isolation with a high depth-
process, although it cannot be sharply width ratio. Mostly found in ana
separated. A semiconductor device man logue integrated circuits.
ufacturing process is a complex process, • Polysilicon: Poly-crystalline silicon
which is constituted of several hundred which is used as gate electrode.
up to more than thousand individual sub
Besides a classification of the different
processes. The repeated sub-processes
classes, their respective features are of
include, e.g. lithography, ion implanta
importance. These include geometrical
tion, chemical-mechanical polishing, wet
ones (width, height, pitch) or material
and dry etching, deposition, and clean
related ones. An example of how to mea
ing. Three examples of how the pro
sure properties of VIA1 and Metal3 is
cess manifests itself in silicon are illus
shown in Fig. 5. They may depend on the
trated in Fig. 4. The images have been
actual position of the cross section, the
acquired after manual cross-sectioning of
perspective and the SEM-parameters and
the respective semiconductor devices –
require a careful calibration and aware
post-production. The substantial different
ness of uncertainty and process varia
objects are shown colour-coded, and they
tions. Based on these features, the fol
describe following functionality (taken
lowing hypothesis can be formulated:
from [22]):
• Metal: Low resistance metallic con
nections between devices. Several
metallisation layers can be stacked
over each other to route inter-device Metal3
connections. Height
• Vertical interconnect access (VIA)
Width Pitch Metal2
/ contact: Low ohmic interconnec
Metal1
tions between different metallisation
layers (VIA) or between devices and Fig. 5. Example cross-section image with annotated
the lowest metallisation layer. metal and contact/VIA features [22].
50 BERNHARD LIPPMANN: GENERATING TRUST IN HARDWARE THROUGH PHYSICAL INSPECTION
the entirety of a process can be inter
preted as an individual manufacturing
technology fingerprint. This hypothesis
leads to the assertion that devices can be
distinguished through their technological
parameters. Consequently, device authen
ticity can be validated via a testing of
aforementioned technological parameters
against the expected parameters. These
Fig. 6. Typical SEM images using two different
expected parameters are either provided detectors. (a) shows the scan with an InLens detec
by the manufacturer or extracted from tor with a field of view of 30 μm and a working
known non-counterfeit devices, so called distance of 10 mm. (b) shows the scan with ET
SE2 detector with a field of view of 20 μm and a
exemplars (SAE AS6171). The post working distance of 10 mm.
production technique allows the iden
tification of counterfeit processes. The suitable common thread for trust genera
method covers all types of electronic tion by physical inspection on the layout
devices (digital, analogue, systems-on level. In this section, methods on differ
chip, FPGAs, etc.) and especially cloned, ent parts of the layout extraction process
remarked, and repackaged types of coun are discussed and made transparent.
terfeits. For methodological details, refer Fig. 6 compares images generated
to [22]. using an Everhart-Thornley-SE2 (ET
SE2) detector and an InLens detector
B. Physical layout verification available in the RAITH 150Two. As the
The verification of layout integrity can material contrast is increased with the
be executed through a comparison of ET-SE2 detector, large and small lay
reference layout data against the physi out structures become bright, and the
cal layout extracted in the recovery pro background remains dark, enabling a
cess. Layout recovery includes sample threshold-based segmentation approach.
preparation and fast and accurate chip Using the InLens detector, the brightness
imaging, which needs to be addressed by of a metal line depends on its structural
a combination of the scanning electron size. Large structures in the left area of
microscope (SEM) hardware, including the reference image appear darker in their
the detectors and the subsequent image centre area. Solving these challenges is
processing algorithm. In this work, imag vital for a successful extraction.
ing is done using the RAITH 150Two 1) Metal track segmentation: A
chip scanning tool, which was also used threshold-based extraction of metal lines
in previous research [15], [18]. This work using algorithms like Otsu, Li ([18], [15])
focuses on image processing as the last is performed and confirmed within our
stage of the physical part of reverse engi analysis projects (Fig. 7) on images with
neering, i.e., the physical layout verifica sufficient separation between fore- and
tion. All previous errors are manifested in background colour levels.
this stage while error correction remains The challenge arises when the colour
possible – if the entire process is well of thick metal patterns drops close to the
understood. Besides the final stage of background colour value, as shown in
the physical part, the imaging outputs Fig. 8. This is solved with a customised
are used for ensuing operations based segmentation algorithm.
on netlist interpretation and simulation. The presented algorithm, called SEM-
Consequently, the imaging output is a Seg (see Fig. 8 and 9), is based on
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 51
Fig. 7. Image segmentation using threshold algo
rithm.
Fig. 8. Image segmentation using custom algorithm
SEMSeg.
the identification of fore- or background
area depending on the gradient of the
edge shape rather than relying on abso
lute colour value. In our initial stage, Fig. 9. SEMSeg algorithm, segmentation of SEM
we inspect these edge points and iden images with overlapping foreground and back
ground colour level.
tify the factual background on one side
of the edge. In stage 2 of our algo
rithm, we sequentially convert identified
background area into a red-coloured area
using a flood-fill algorithm until only
white edges remain or black not yet
flipped foreground inside larger struc Fig. 10. Contact detection using Hough algorithm.
tures remain. Finally, we need to convert
the colour values to the segmentation
target values, the foreground becomes As shown in the right image of Fig. 10,
white, and the background is changed one wrong contact has been found on a
to black. Fig. 9 shows interim stages particle.
of the segmentation flow, leading to an In the original HCT each pixel is
error free-segmentation, where threshold- evaluated concerning its possibility of
based algorithms would be infeasible. being located on a circle with a radius
2) VIA segmentation with customised r. This result is stored in the accumulator
Hough algorithm: The identification of matrix. In the final step, selecting local
contacts located on the top side of a maxima from the accumulator matrix
depassivated metal layer can be achieved returns the possible circles. We introduce
using the Hough circle transformation a rule-based modified accumulator calcu
(HCT) [23] with the risk of false contact lation and evaluation considering, e.g., a
detection using no optimised parameters colour change over the contact area or
or due to the limitations of the HCT. an evaluation of the outer circle area
52 BERNHARD LIPPMANN: GENERATING TRUST IN HARDWARE THROUGH PHYSICAL INSPECTION
aiming to remove false contacts maxima
from the accumulator. The accumulator
result in Fig. 11 shows the dominant
peak for the particle, which is eliminated
and finally generates an error-free contact
identification.
3) Standard cell analysis:
a) Automatic standard cell identifi
cation: The automatic identification of
logic gates in a given design is based
on domain knowledge about design prin
ciples and customised image processing
algorithms. The complete process con
tains three major steps as shown Fig. 12.
The major innovation uses the fact
Fig. 12. Standard cell identification process. (a)
that our standard cell images display SEM image displaying poly-silicon, active area, and
polysilicon, contacts and p/n-doped area contacts (b) Detection of power lines (VDD , VSS ).
at the same time (left side of Fig. 13). (c) Segmented standard cells using custom image
processing algorithms. (d) Classification of different
As these structures are brighter than the standard cells.
background (isolation), in a black-white
image of the complete cell array a flood-
fill mechanism, stopping only at the iso
lation area, allows the identification of
the correct cell dimension perpendicular
to the power line direction. By this proce
dure, the complete image is automatically
segmented into the individual std. cells
(right side of Fig. 13).
b) Standard cell structure analysis:
Fig. 13. Segmentation of polysilicon, active area
We continue the analysis of std. cell delayering SEM images into standard cells using a
images with a customized image pro custom algorithm with domain knowledge.
cessing algorithm converting the identi
fied cells in a flat transistor level netlist.
First, we segment the contact, polysilicon by polysilicon. We construct nothing else
and active area from an std. cell image. like the channel area of the transistor
The active area is not completely vis in this step (B). The logical combina
ible in the input image, so it is filled tion between the completed active area
by connecting the paths between two and the polysilicon shapes on the pixel
active area segments, which are covered base defines the active transistor area.
Segmenting this image, we obtain the
active transistor area as polygons (C, D).
The transistor definition algorithm now
extends the active transistor area until
half of the contact distance, and finally,
we build transistors and have the net
Fig. 11. Contact detection using modified Hough information adding the M1 layer. The
algorithm. extracted geometrical shapes are stored
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 53
workflow has been applied on test chips
manufactured at the 28 nm technology
node.
A. Sample preparation
The delayering process has been con
tinuously improved presenting homo
Fig. 14. Extracting transistor level netlist from
std. cell SEM images displaying polysilicon, con geneous sample preparation techniques
tact, and active area layouts. almost over the entire chip area at the
40 nm technology node [21]. Due to
the ever-shrinking inter-layer oxide thick
nesses for the 28 nm node a homoge
neous full chip area delayering for the
lower metal levels and the polysilicon
level could not be achieved as shown
in Fig. 17. However, dedicated larger
Fig. 15. Three different VIA extraction methods
[17]. circuit blocks are available in adequate
delayering quality.
The delayering quality assessment is
in a vector format, and we use a lay based on a first inspection of the colour
out versus schematic-like back annotation uniformity of the whole chip module with
tool for the generation of an electrical optical microscopy techniques. Colour
transistor level netlist. changes inside one layer indicate dif
4) Discussion on the applicability of ferent modules and different remaining
deep learning approaches: For a more layers stacks. Finally, SEM inspection
robust and better-yielding identification
of VIAs, a convolutional neural net
work (CNN) based architecture is used.
A training set is obtained from images
generated through classical processing
and following manual cleaning. The left
image in Fig. 15 shows the detection
of VIAs using the customized Hough
transformation. Results obtained from a
pure deep learning approach are shown
in the center image. A hybrid approach Fig. 16. Metal layer extraction with deep learning
using classical Hough transformation for [17].
detection and deep learning (DL) based
contact validation for improved comput
ing performance is shown in the right
image.
IV. E VALUATION OF P HYSICAL
L AYOUT E XTRACTION
Fig. 17. 28 nm technology, delayering study on
In our latest work, the developed phys test samples. Optical microscope and SEM images
ical and functional reverse engineering are used for quality assurance.
54 BERNHARD LIPPMANN: GENERATING TRUST IN HARDWARE THROUGH PHYSICAL INSPECTION
allows the precise and detailed quality
assessment.
B. Layout recovery
Various image processing methods
including the discussed solutions have
also been successfully applied to layout
extraction studies using 28 nm test chip
samples.
Fig. 18 shows high-resolution sample
images and the corresponding extracted
layout patterns.
Successful VIA recovery is demon
strated in Fig. 18a and 18b. Even new
challenges arise from large VIA shapes
covering the metal pattern underlying.
We implemented dedicated image pre
processing algorithms using the design
rules (DR) for recovering the hidden or
over-blended metal pattern. Fig. 18d dis
plays a recovered part of a 28 nm layout.
Fig. 18c shows the successful polygon
recovery even due to scan time contains
the scanning resolution has been reduced.
C. Comparison of layout recovery using
classical CV against DL methods
A comparative study between layout
recovery using classical image processing
algorithms and DL-based solutions has
been executed. Obtained results show a
dependence on image quality and applied
training efforts and training data. An
Fig. 18. Overview of different metal segmentation
overall simplified rule of thumb is given and VIA detection tasks of the 28 nm test chip
in Tab. II. We observed a performance sample.
increase with the drawbacks of training
data dependency, sample-specific train
ing, and labelling efforts. Images obtained from smaller technol
ogy nodes with a lower image quality
achieved better conversions results using
TABLE II AI methods. Classical image processing
RULE OF THUMB PERFORMANCE OPTIONS FOR
THE EXTRACTION OF CLASSICAL VS . DL- BASED
methods show a much higher yield drop.
VIA DETECTION . During the quality assessment of image
sets using DL-based image processing we
Classical Deep Learning Remark
Good image quality 99.5% 99.6%...100% VIAs perfectly visible
observed unexpected failures as shown
∼ 50% ∼ 90%
Poor image quality VIAs hard to recognise
in Fig. 19 as only a minor brightness
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 55
N1
1
= P ackComi (Mi , Gi )
N
P ackage
N2
+ DF Comi (Mi , Gi )
P DK
N3
+ T F Comi (Mi , Gi )
T ech.
Fig. 19. Polygon extraction example with deep N4
learning. + P olyComi (Mi , Gi ) (1)
Layout
In (1), S(M, G) describes the simi
larity been a measured product M and
the reference data G (i.e. golden model).
Each comparator consists of a number Ni
Fig. 20. Reconstruction capabilities DL. of single feature comparisons. A single
stage comparator function should mea
sure the similarity and returns 1 for a
change over the entire image area led to perfect match and 0 if no correlation
significant conversion failures. to the reference data has been found.
For thickness and dimension measure
ments in the technology comparator can
V. T RUST A SSESSMENT THROUGH be directly applied, polygon similarity
P HYSICAL I NSPECTION measurements based on a polygon-based
F1 score or a pixel-based XOR score
A. Introduction of similarity metrics
have been presented in [17]. For nor
HG: Erst hier kommt das Package ins malising the complete similarity function
Spiel, daher ist ist daer andere Counter S(M, G) we need to add the factor N1
feit claim zu früh. with N = i Ni .
The complete physical verification The proposed comparator definitions
process is composed of individual com include:
parisons of measurable product features
against the original data covering the • Technology Feature Comparator
discussed aspects of the manufacturing (TFCom): Compares wafer fabrica
technology and the reconstruction of the tion features like deposited layers
geometrical layout. Furthermore, albeit including minimum structure width,
not elaborated in this work, the respective
physical design kit (PDK) and physical thickness and material against the
die package information are two more original technology.
pillars for a trust validation flow. This – Efficiency: very high, only
flow is described in the following math SEM cross-section images
ematically.
needed
S(M, G) – Use case: manufacturing tech
1( nology fingerprinting
= P ackCom(M, G)
N • Package Feature Comparator
+ DF Com(M, G) (PackCom): Analogue to the
+ T F Com(M, G) TFCom, but using die package
)
+ P olyCom(M, G) manufacturing steps, including
56 BERNHARD LIPPMANN: GENERATING TRUST IN HARDWARE THROUGH PHYSICAL INSPECTION
labelling processes with font size Using these data an extended
and marking details. manufacturing technology footprint
– Efficiency: typically used tools is generated without the need for
like optical microscopes, height a complete layout recovery. As
gauges and x-ray systems offer the polygon comparator is just
high throughput at low analysis based on the similarity of different
costs. geometrical layout descriptions and
– Use case: inspection of bond product specific, the DFCom is
wires, bond scheme, device providing an electrical interpretation
marking including acetone wipe of the recovered patterns.
and scrub test, die marking Furthermore, in combination
can detect remarked and recy- with the technology comparator, it
cled devices. Confocal Scan- defines a complete manufacturing
ning Acoustic Microscopy C- technology fingerprinting.
SAM may be used to detect – Efficiency: does not require the
delamination of reused devices complete layout recovery, sig-
[6]. nificantly higher compared to a
full polygon comparison
• Polygon Comparator (PolyCom):
– Use case: technology finger-
After complete recovery of the
printing
physical design the comparison of
the recovered polygons against the
B. Discussion on trust metrics
original layout can be executed.
Challenges from Tab. I ( P , S ) The trust generated by a comprehensive
apply as large area chip delayer- physical verification can therefore be mod-
ing, scanning and image analysis elled by assigning individual weights to
are required. Unstable and imperfect each comparison stage as shown by (2).
delayering processes may require in This extends previously defined S(M, G)
addition scanning dedicated to spe- to T (M, G). In this adapted equation, indi-
cial features e.g. VIAs or the use of vidual weights (w) allow an adaptive rating
a larger sample-sets, both increasing of different features but also within differ-
costs and time. ent comparators.
– Efficiency: dedicated highly
T (M, G)
customised tools are necessary
N1
– Use case: the most sensitive 1
= wP ackage (i)
method for detection of circuit N
P ackage
modifications presented here · P ackComi (Mi , Gi )
• Design Features Comparator N2
(DFCom): Besides the + wP DK (i) · DF Comi (Mi , Gi )
manufacturing technology, a P DK
N3
set of logic standard cells, analogue
+ wT ech. (i) · T F Comi (Mi , Gi )
primitive devices, SRAM, ROM, T ech.
NVM cells and modules are N4
included in the PDK. Checking + wLayer (i)
the physical layout in dedicated Lay out
local areas the recovery of many
· P olyComi (Mi , Gi ) (2)
PDK features can be performed.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 57
As given in (3), in practice not all the common reverse engineering scoring
features are measured by physical inspec system (CRESS) [24]. Nonetheless, a full
tion and since many comparator stages practical realisation exists for software
(Comparator(i,j)) are costly and time- in the form of the common vulnerabil
consuming, normally, only a limited ity scoring system (CVSS) [25]. In the
number of features is selected. end, a comprehensive method for trust
evaluation, as proposed in this work, is
a dynamic one. Ultimately, the decision
T (M, G)
how to scale the weights and which meth
CT Nj
1 JJ ods to use differs from use case to use
= ( wj (i) case and a myriad of factors (e.g. end
N j=0 i=0 ' -v "
' -v " W application, economic factors, potential
test coverage safety effects, etc.) must be taken into
· Comparator(i,j) (Mi , Gi )) (3) account.
' -v "
C
VI. C ONCLUSION
Under these constraints, a realistic
trust evaluation with skipped comparator Besides a secured design flow, cov
stages must still provide a trustworthy ering electronic design automation tools
test coverage. CT in (3) is the number of and methodology and a secured manu
used comparator types. As the assigned facturing process, physical inspection is
weights W to individual stages express one pillar to create hardware trust. The
their impact on the overall hardware implementation techniques to generate
stages with an assumed low weight might trust through physical inspection must
be skipped. be profoundly understood when applied.
Still, this quite intuitive approach Furthermore, it must be acknowledged
yields two major challenges: First, as that physical constraints in sample prepa
many input data for C are a result of ration, imaging and scanning may lead to
quite challenging lab processes, we have imperfect comparator inputs and must be
to consider a concise handling of the carefully investigated. Continuously exe
variations and tolerances each manufac cuted analysis projects will demonstrate
turing and analysis process as well as the the achieved trust level while aiming to
metrology has. Furthermore, there may reduce the test coverage.
be published documented and undocu Addressing the requirements for an
mented changes in the process of the advanced physical verification process,
original device manufacturer. The rule- several innovative process solutions have
based image processing algorithms and been developed and evaluated. Novel
results are presented above but no sys image processing algorithms using ded
tematic approach for analysis of method icated expert semiconductor analysis
failures is existing. This holds specifi knowledge allow the effective and pre
cally true for ML/DL-based comparators cise extraction of layout and technolog
where training and analysis data play an ical information. With the introduction
important role. of advanced technology nodes, AI-based
Secondly, the computation of expres image processing can provide additional
sive values for the weights is far from options. Creative solutions are needed
trivial. A quantification of trust and secu as the number of challenging imaging
rity is one a big challenge as concluded in problems increases.
58 BERNHARD LIPPMANN: GENERATING TRUST IN HARDWARE THROUGH PHYSICAL INSPECTION
VII. F UTURE W ORK [5] Platform Security Model version
Considering trust through testing 1.1, en, 2019-12-01 2019. https :
against well-defined criteria with //www.psacertified.org/app/upload
accepted frameworks as a reference, s/2021/12/JSADEN014 PSA Cert
physical verification lacks these generally ified SM V1.1 BET0.pdf.
accepted methods and verification flows. [6] U. Guin, D. DiMase, and M. M.
As AI-based image processing methods Tehranipoor, “A Comprehensive
need to prove their trustworthiness and Framework for Counterfeit Defect
performance increase over conventional Coverage Analysis and Detec
image processing, a continuation of tion Assessment,” Journal of Elec
benchmarking is mandatory. These tronic Testing, vol. 30, 2014.
will be the two major challenges for [7] Test Methods Standard; General
the future of effective and efficient Requirements, Suspect/ Counter
physical inspection flows. Still, feit, Electrical, Electronic, and
manufacturing technology, physical Electromechanical Parts AS6171,
layout, and functional verification are en, Oct. 2016. https : / / www . sae
major building blocks of a complete .org/standards/content/as6171/.
future product verification process. For [8] J. Cruz, P. Mishra, and S. Bhu
future work, practical analysis projects nia, “INVITED: The Metric Mat
will provide concrete numerical results ters: The Art of Measuring Trust
and contribute to the overall trust level. in Electronics,” in 2019 56th
ACM/IEEE Design Automation
ACKNOWLEDGMENT Conference (DAC), 2019.
The authors would like to thank [9] M. Jagasivamani et al., “Split
¨
Anja Dubotzk y and Peter Egger of Infi fabrication obfuscation: Metrics
neon Technologies AG for physical sam and techniques,” in 2014 IEEE
ple preparation and Tobias Zweifel and International Symposium on
Nicola Kovač of Fraunhofer EMFT for Hardware-Oriented Security and
EBeam scanning. Trust (HOST), IEEE, 2014.
[10] M. Ludwig, A.-C. Bette, and
R EFERENCES B. Lippmann, “ViTaL: Verify
[1] The Common Criteria, https : / / ing Trojan-Free Physical Layouts
www . commoncriteriaportal . org/, through Hardware Reverse Engi
Accessed: 2022-08-06. neering,” in 2021 IEEE Physi
[2] A. Vassilev, L. Feldman, and cal Assurance and Inspection of
G. Witte, Cryptographic Module Electronics (PAINE), (Dec. 2021),
Validation Program (CMVP), en, preprint: http://dx.doi.org/10.3622
2021-12-01 2021. https://tsapps.ni 7/techrxiv.16967275, Washington,
st.gov/publication/get pdf.cfm?pu DC, USA: IEEE, Dec. 2021.
b id=917620. [11] A. Vijayakumar et al., “Physi
[3] TCG Certification Programs, en. h cal Design Obfuscation of Hard
ttps : / / trustedcomputinggroup . org ware: A Comprehensive Investi
/membership/certification/. gation of Device and Logic-Level
[4] SESIP: An optimized security eval Techniques,” Trans. Info. For. Sec.,
uation methodology, designed for vol. 12, no. 1, Jan. 2017.
IoT devices, en. https://globalplatf [12] S. M. Plaza and I. L. Markov,
orm.org/sesip/. “Solving the Third-Shift Problem
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 59
in IC Piracy With Test-Aware [20] A. Singla, B. Lippmann, and H.
Logic Locking,” IEEE Transac Graeb, “Recovery of 2D and 3D
tions on Computer-Aided Design Layout Information through an
of Integrated Circuits and Systems, Advanced Image Stitching Algo
vol. 34, no. 6, Jun. 2015. rithm using Scanning Electron
[13] M. Fyrbiak et al., “On the Dif Microscope Images,” Jan. 2021.
ficulty of FSM-based Hardware [21] B. Lippmann et al., “Physical and
Obfuscation,” IACR Transactions Functional Reverse Engineering
on Cryptographic Hardware and Challenges for Advanced Semi
Embedded Systems, Aug. 2018. conductor Solutions,” in 2022
[14] R. Torrance and D. James, “The Design, Automation & Test in
state-of-the-art in semiconductor Europe Conference & Exhibition
reverse engineering,” in 48th (DATE), IEEE, Mar. 2022.
ACM/EDAC/IEEE Des. Automat. [22] D. Purice, M. Ludwig, and C.
Conf. (DAC), Jun. 2011. Lenz, “An End-to-End AI-based
[15] The Role of Cloud Computing Automated Process for Semicon
in a Modern Reverse Engineering ductor Device Parameter Extrac
Workflow at the 5nm Node and tion,” in Industrial Artificial Intel
Beyond, vol. ISTFA 2021: Conf. ligence Technologies and Applica
Proc. 47th Int. Symp. for Testing tions. Vienna, Austria: River Pub
and Failure Anal. Oct. 2021. lishers, 2022, ch. 4, pp. 53–72.
[16] A. Kimura et al., “A Decomposi [23] U. Botero et al., Hardware Trust
tion Workflow for Integrated Cir and Assurance through Reverse
cuit Verification and Validation,” J. Engineering: A Tutorial and Out
Hardw. Syst. Secur., vol. 4, Mar. look from Image Analysis and
2020. Machine Learning Perspectives,
[17] B. Lippmann et al., “Verification Oct. 2020.
of physical designs using an inte [24] M. Ludwig, A. Hepp, M. Brunner,
grated reverse engineering flow for and J. Baehr, “CRESS: Frame
nanoscale technologies,” Integra work for Vulnerability Assessment
tion, vol. 71, Nov. 2019. of Attack Scenarios in Hardware
[18] R. Quijada et al., “Large-Area Reverse Engineering,” in 2021
Automated Layout Extraction IEEE Physical Assurance and
Methodology for Full-IC Inspection of Electronics (PAINE),
Reverse Engineering,” Journal of (Dec. 2021), preprint: https://dx.d
Hardware and Systems Security, oi.org/10.36227/techrxiv.169648
vol. 2, 2018. 57, Washington, DC, USA: IEEE,
[19] H. P. Yao et al., “Circuitry anal Dec. 2021.
yses by using high quality image [25] FIRST.Org, Inc. (Aug. 21, 2022).
acquisition and multi-layer image “Common Vulnerability Scoring
merge technique,” in Proceedings System version 3.1: Specification
of the 12th International Sympo Document,” https://www.first.org
sium on the Physical and Fail /cvss/specification-document.
ure Analysis of Integrated Circuits,
2005. IPFA 2005., Jun. 2005.
60 BERNHARD LIPPMANN: GENERATING TRUST IN HARDWARE THROUGH PHYSICAL INSPECTION
Bernhard Lippmann Horst A. Gieser is head
received a diploma of the AT (Analysis
degree in Physics from and Test) team at the
the Technical University Fraunhofer-Institution
Munich (TUM), Germany for Microsystems and
in 1992. He started Solid State Technologies
his career at Hitachi EMFT (www.emft. fraun
Semiconductor Europe hofer.de). He received
in Landshut in the type his diploma in Electrical
engineering group. From Engineering and his Ph.D.
1993 until 1998 he was from the Technical Uni
responsible for the physical and electrical failure versity in Munich where he started his first labora
analysis and yield enhancement programs for tory and research team for analysis and test in 1989
several generations of DRAM and smart card and has transferred it to Fraunhofer in 1994. Start
products. In 1999 he joined Infineon Technologies ing and growing with Electrostatic Discharge ESD,
AG (Siemens) former Chipcard and Security he has extended his research and application interest
Division in Munich. He is responsible for into the field of the analysis for Trusted Electronics
competitor analysis, reverse engineering and down to the nanoscale and the cryo-characterization
benchmarking projects. Currently, he is responsible of quantum devices. His lab is CC-EAL6 certified
for the project coordination of public-funded for the physical analysis of security chips. Mainly
project RESEC (https://www.forschung-it-sicherh in the field of ESD he has authored and contributed
eit-kommunikationssysteme.de/projekte/resec) at to more than 120 publications including several
the Connected Secure Systems (CSS) division of invited talks at international conferences in the US,
Infineon. He holds several patents on smart card Taiwan and Japan. He is author of several publica
and security topics, his publications cover Java tions in peer reviewed journals. Four publications
Card benchmarking and circuit reverse engineering. won awards. Today he is leading activities in several
public funded projects on Trusted Electronics.
Matthias Ludwig
received the B.Eng. from
Regensburg University
of Applied Sciences,
Germany in 2017,
and the M.S. from
Munich University
of Applied Sciences,
Germany in 2019, both in
electrical and computer
engineering. He is
currently pursuing his Ph.D. with the department
of electrical and computer engineering at Technical
University of Munich, Germany and is with the
Connected Secure Systems (CSS) division of
Infineon Technologies AG, Neubiberg, Germany.
His research interests include hardware security
with a focus on anti-counterfeiting, hardware trust
and physical security.
Meeting the Latency and Energy
Constraints on Timing-critical
Edge-AI Systems
Ivan Miro-Panades, Inna Kucher, Vincent Lorrain, and Alexandre Valentian
Abstract—Smart devices, with AI ca to several issues when used in edge ap
pabilities, at the edge have demonstrated plications. Even though mobile versions
impressive application results. The current of some network topologies have been
trend in video/image analysis is to increase
its resolution and classification accuracy. introduced over time, it remains difficult
Moreover, computing object detection and to integrate them on-chip in an energy-
classification tasks at the edge require both efficient manner. The main issue is the
low latency and high-energy efficiency for large number of parameters, requiring
these new devices. In this paper, we will the use of an external memory which
explore a novel architectural approach to
overcome such limitations by using the leads to a large power dissipation due to
attention mechanism of the human brain. the data movement. It should be noted
The latter allows humans to selectively that the energy necessary for moving
analyze the scene allowing limiting the data is three orders of magnitude larger
spent energy. than that spent for doing computations
Index Terms—Edge AI accelerator, high- on the same data [1]. This is the pri
energy efficiency, low-latency, object detec mary issue (“Issue No. 1”) that must be
tion. addressed.
Moreover, when processing a video
I. INTRODUCTION
input stream, the whole image is pro
T
HE observed trend, in visual cessed, frame by frame, even though
processing tasks, is to increase there is enormous spatial redundancy be
the complexity of neural network tween consecutive frames. If the target
(NN) topologies to improve the clas application requires a low reaction time
sification accuracy. This results in NN to events (e.g., to an object or person
models being deeper and larger leading moving), the frame rate needs to be high,
leading to high instantaneous power val
This work was in part funded by the ECSEL
Joint Undertaking (JU) under grant agreement No ues. On the other hand, if the frame
876925. The JU receives support from the European rate can be kept low, such as for secu
Union’s Horizon 2020 research and innovation pro rity surveillance applications, the system
gram and France, Belgium, Germany, Netherlands,
Portugal, Spain, Switzerland overall energy efficiency would still be
Authors Ivan Miro-Panades and Alexandre poor, because of the high inter-frame
Valentian are with Univ. Grenoble Alpes, CEA, redundancy (especially during nights and
List, F-38000 Grenoble, France (e-mail: ivan.miro
[email protected] and [email protected]). weekends). The second issue (“Issue No.
Authors Inna Kucher and Vincent Lorrain are 2”) that must be addressed is to reconcile
with Université Paris-Saclay, CEA, List, F-91120, low power dissipation and short reaction
Palaiseau, France (e-mail: [email protected] and
[email protected]) times to events.
62 IVAN MIRO-PANADES: MEETING THE LATENCY AND ENERGY CONSTRAINTS
In those respects, bio-inspired ap- good fit to many motor and perceptual
proaches can lead to innovative solutions. findings.
For instance, neuroscientists have found In this work, we focus on the “Where”
anatomical evidence that there are two subsystem, as the “What” one is already
separate cortical pathways, or ‘streams’, well addressed with existing accelerators
in the visual cortex of monkeys [2]: the [10]. The main objectives are therefore
ventral pathway and the dorsal pathway, to obtain the lowest possible latency and
as shown in Fig. 1. On one hand, the power values. First, we have started by
dorsal stream is relatively fast, sensitive selecting an adequate neural network
to high-temporal frequencies (i.e. mo topology, i.e., one with a small number
tion), viewer-centered and relatively un of parameters, but with only slightly
conscious. It has been called the “Where” degraded performances: we have chosen
path, since it is used for quickly retriev the MobileNet-V1 topology [4]. Since all
ing the location of objects, especially of the parameters need to be stored on-chip,
moving ones. On the other hand, the for solving ‘Issue N◦ 1’, the synaptic
ventral stream is relatively slow (∼4x weights and activation values must be
longer reaction time), sensitive to high- heavily quantized, without a significant
spatial frequencies (i.e., details), object- loss in accuracy our in-house learning
centered and relatively conscious. It is framework was complemented with
known as the “What” path, involved in a state-of-the-art quantization-aware
the recognition of objects. If millions of training (QAT) algorithm. This tool is now
years of evolution have led to such a 2 available in open source and presented
path solution, this is because it brought a in Section II. An innovative architecture
competitive advantage to our ancestors, was considered for the hardware, once
allowing them to quickly evade threats, again taking inspiration from biology:
even before their brain knew the nature layers V1 to V3 of the visual cortex, which
of the threat. are sensitive to orientations of edges and
Even though this 2-stream hypothesis to movement, are fixed early on during
has been disputed over the years and life. For instance, V1 undergoes synaptic
we now know that those pathways are and dendritic refinement to reach adult
not strictly independent, but do interact appearance at around 2 years of age [5].
with each other (for instance for skillfully Even though these synaptic weights will
grasping objects [3]), it is still relevant not be learnt again during adulthood,
to the problem at hand, as it provides a that does not prevent our visual cortex
to learn how to recognize new objects.
We have thus chosen to fix the feature
extraction layers of the MobileNet once
and for all (while ensuring they remain
sufficiently generic) and then to apply
a transfer learning technique, to target
several applications. Fixing synaptic
weights actually leads to tremendous
energy and latency savings, e.g. getting rid
of memory accesses. Such an architecture
can be used in an attention mechanism,
solving ‘Issue N◦ 2’. The architecture
Fig. 1. Illustration of the two visual pathways or analysis is described in Section III. Finally,
streams in the visual cortex, used for extracting
different information. Section IV concludes this work.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 63
Software DNN libraries
Considered criteria •C/OpenMP (+ custom COTS
•Applicative performance SIMD) •STM (STM32, ASMP)
metrics •C++/TensorRT (+ •Nvidia (all GPUs)
•Memory requirement CUDA/CuDNN) •Renesas (RCar)
•Computational complexity •Kalray (MPPA)
•C++/OpenCL
•C/C++ (+ custom API) Custom accelerators
Learning & •Assembly •ASIC (PNeuro,
Test NeucoCorgi)
databases Hardware DNN libraries
Optimization •C/HLS •FPGA (DNeuro)
•Custom RTL
Trained
DNN
Data Post-training Code cross- Code execution
Modeling Learning Test
conditioning quantization generation on target
ONNX model
1 to 8 bit integers + rescaling 2 to 8 bit integers + rescaling
•SotA QaT methods (SAT, LSQ) •Based on dataset distribution
•Integration of quantization •Quantized applicative
operators in learning performance metrics
process. validation
Fig. 2. N2D2 framework.
Fig. 3. Forward and backward quantization passes.
Fig. 4. Backward propagation with QAT.
64 IVAN MIRO-PANADES: MEETING THE LATENCY AND ENERGY CONSTRAINTS
II. QUANTIZATION AWARE agation of errors on these parameters,
TRAINING and repeating this process for a certain
number of epochs, until the figure of
A scalable and efficient QAT tool has merit of the training is satisfactory.
been developed and integrated into the The forward pass with QAT follows
N2D2 framework [6] (see Fig. 2). N2D2 the logic shown in Fig. 3:
provides a complete design environment
for a wide range of quantization modes • Inputs arriving to the convolu
to achieve the best performances includ tional layer are passed through the
ing SAT [7] and LSQ [8] methods. The convolution operation, where the
overall framework and the addition of the weights are quantized beforehand
quantization aware training modules are using Weight Quantizer module;
shown in Figs. 3 and 4 above. • The output is propagated to Batch
The advantages of this dedicated normalization layer, which provides
framework include: operations in full precision;
• The output from Batch normaliza
• Integration of common quantiza tion is transformed to its quantized
tion methods (Dorefa, PACT, CG values using the Activation Quan
PACT). tizer;
• Straightforward support of mixed- • At the end, the quantized output is
precision operator (2-bits to 8-bits passed as an input to the next layer.
on Weights and/or Activations).
• Automatic support of non-quantized The backward propagation with QAT,
layers (e.g., batch normalization). shown in Fig. 4, includes the following
• Training phase based on optimized steps:
computing kernels, resulting in fast • Starting from the errors on quan
evaluation of the quantization per tized activations, the errors on full
formance. precision activations are computed,
There are two separate quantization using the derivatives of the transfor
modules, one dedicated to weights and mations applied in the forward pass;
another one to activations. This is illus • Then these errors, on full precision
trated in Figs. 3 and 4 above, where the activations, are propagated through
example Layer N consists of a Convo Batch normalization and convolu
lutional layer followed by the Batch nor tional layers;
malization layer with activation (typically • In a similar way, the errors on full
ReLu). The weights of this Convolutional precision weights are computed us
layer are quantized to a desired precision, ing quantized weights errors.
using the quantize wts function. Batch During the learning procedure, both
normalization stays in full precision and full precision and quantized quantities
goes through the activation function. This are kept. One has to keep in mind that,
output is then quantized to the required during the training, the applied procedure
precision, using the quantize acts func is called “fake” quantization, since even
tion. It must be noted that the two quan quantized values are kept using floating-
tification precisions, i.e., of the weights point type.
and activations, might not necessarily be Once the network is trained, the
the same. weights and inputs are transformed into
During the neural network training, the true integer values before execution on a
parameters are adjusted using backprop hardware target.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 65
III. A RCHITECTURE EXPLORATION
The architecture exploration started
with the choice of the NN topology,
with a low energy and low latency per
inference in mind: the target is an energy
below 4mJ per image (HD: 1280x720
pixels) and a latency compatible with
a 30FPS frame rate (i.e. below 30ms).
A tradeoff must thus be made between
network complexity and operations per
inference. A lower number of operations
obviously leads to a lower number of
Multiplication-Accumulation (MAC) op
erations to be performed per image. Fig. 5. Comparison of several NN topologies, as
Fig. 5 illustrates the various topologies function of number of operations (X-axis), classifi
cation accuracy (Y-axis) and number of parameters
that can be found in the literature [9]. (size of the circle) [9].
The MobileNet-V1 topology has been
chosen, as it uses depth-wise and point-
wise convolutions to reduce the comput
ing complexity (their difference with a
standard convolution is shown in Fig. 6).
Usually, NN accelerators use a layer-
wise architecture. This makes it possible Fig. 6. (a) Standard convolution; (b) Depth-wise +
to support different topologies, since net point-wise convolution.
works are computed layer-wise. It also
makes it possible to compute multiple
images per layer, i.e., inputs with a batch topology and parameterized to perform
size higher than one: the synaptic weights the inference calculation (conv, FC . . . )
are read once and can be reused for the and minimize the latency.
different images, reducing the power dis To simplify the architectural tradeoff
sipation. However, in our case, we have analysis and the RTL generation, a back-
conflicting constraints: the topology is end tool has been added to the N2D2
fixed and the batch size is equal to one to learning framework. This tool takes as
limit the processing latency. A streaming input an algorithmic configuration file
architecture is thus considered, since the (representing the computations that need
latency is minimized and the fixed topol to be performed per layer) and the hard
ogy allows optimizing the buffering and ware parameters for each layer sub-
the inter-layer communication through architecture. It then generates files, fol
put, limiting the area overhead. lowing a 3 steps procedure: first, the gen
Our NN accelerator, called Neuro- eration of the topological and hardware
Corgi, thus takes the form of a pipelined configuration; second, the generation of
computational architecture, in which each the RTL code; and finally, the test and
layer of the network is instantiated validation files.
into a specialized, parameterizable sub- This tool suite is very useful for ar
architecture. These sub-architectures are chitecture exploration, by varying sev
then connected according to the network eral architectural parameters: level of
66 IVAN MIRO-PANADES: MEETING THE LATENCY AND ENERGY CONSTRAINTS
bio-inspiration was again considered, by
fixing the features extraction layers (em
bedded memory limited to 600kB).
Our in-house learning framework
N2D2 has been completed with the
necessary functionalities: state-of-the
art quantization algorithms, transfer
learning, hardware generation and
configuration.
R EFERENCES
Fig. 7. NeuroCorgi initial floorplan, illustrating the [1] B. Dally, “CPU Computing To Ex
placement of the different NN layers. aScale and Beyond”, The Interna
tional Conference for High Per
formance Computing, Networking,
parallelism of each sub-architecture; size
Storage, and Analysis (Super Com
of the buffers between layers, to bal
puting), 2010.
ance the data flow and minimize the
[2] M. Mishkin, L.G. Ungerleider and
congestions in the pipeline. . . . An ex
K.A. Macko, “Object vision and
ploration of the design space was done
spatial vision: two cortical path
by manually varying those parameters:
ways,” Trends Neuroscience, Vol. 6,
their impact can be readily assessed at
pp.414–417, 1983.
accelerator-level. The pipelined architec
[3] V. Van Polanen and M. Davare, “In
ture allows ultra-low latency image de
teraction between dorsal and ven
tection (11ms). The result of initial floor-
tral streams for controlling skilled
planning experiments is shown in Fig. 7.
grasp,” Neuropsychologia, 79(Pt B),
pp. 186–191, 2015.
IV. C ONCLUSIONS [4] A. G. Howard et al., “MobileNets:
We aim at solving the paradox of han Efficient Convolutional Neural
dling ever larger image resolutions (HD) Networks for Mobile Vision Appli
and frame rates (>30FPS), with more cations,” https://doi.org/10.48550/
complex neural networks, while at the arXiv.1704.04861.
same exhibiting low latency and power [5] C. R. Siu and K. M. Murphy,
values. In this work, we explored a clever, “The development of human visual
bio-inspired solution, for providing an at cortex and clinical implications,”
tention mechanism to vision solutions at Eye and Brain 2018:10 25–36, doi:
the edge. We focus on the dorsal stream, 10.2147/EB.S130893.
or “Where” path, since the “What” path [6] https://github.com/CEA
is already well covered by a number of LIST/N2D2.
accelerators. [7] Q. Jin, L. Yang, A. Liao,
For pushing the energy efficiency to “Towards Efficient Training
its maximum, several design decisions for Neural Network Quantization,”
were made: a small NN topology was arXiv:1912.10207 [cs.CV], 2019.
chosen, i.e., MobileNet-V1 to be com [8] S. K. Esser et al., “Learned
pletely integrable on-chip; weights and Step Size Quantization,”
activations were heavily quantized (4b); arXiv:1902.08153 [cs.LG], 2019.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 67
[9] S. Bianco, R. Cadene, L. Celona Vincent Lorrain
and P. Napoletano, “Benchmark received his engineering
degree from ESEO,
Analysis of Representative Deep Angers, France, in
Neural Network Architectures,” in 2014, a M.S. degree in
IEEE Access, vol. 6, pp. 64270 microelectronics from
INSA, Rennes, France,
64277, 2018, doi: 10.1109/AC in 2014, and his PhD
CESS.2018.2877890. degree in physics from
[10] A. Reuther et al., “Survey and Université Paris-Saclay,
Orsay, France, in 2018.
Benchmarking of Machine He has been working at CEA LIST, since 2018,
Learning Accelerators,” IEEE where he is currently a research engineer in
Conference on High Performance the development of optimized neural network
architecture.
Extreme Computing (HPEC), 2019,
https://doi.org/10.48550/arXiv.1908.
11348. Alexandre Valentian
joined CEA LETI in
2005, after an MSc and a
Ivan Miro-Panades PhD in microelectronics.
received the M.S. degree His past research
in telecommunication activities included
engineering from the design technology co
Technical University optimization, promoting
of Catalonia (UPC), the FDSOI technology
Barcelona, Spain, in (notably through his
2002, and the M.S. participation in the SOI
and Ph.D. degrees in Academy), 2.5D/3D integration technologies and
computer science from non-volatile memory technology. He is currently
Pierre and Marie Curie pursuing the development of bio-inspired circuits
University (UPMC), Paris, France, in 2004 and for AI, combining memory technology, information
2008, respectively. He worked at Philips Research, encoding and dedicated learning methods. Since
Sureness, France and STMicroelectronics, Crolles, 2020, he heads the Systems-on-Chip and Advanced
France, before joining CEA, Grenoble, France, in Technologies (LSTA) laboratory at CEA LIST.
2008, where he is currently an Expert Research Dr Valentian has authored or co-authored 80
Engineer in digital integrated circuits. His main conference and journal papers.
research interests are artificial intelligence, the
Internet of Things, low-power architectures,
energy-efficient systems, and Fmax/Vmin tracking
methodologies.
Inna Kucher received
the M.S. degree in
high-energy physics from
École Polytechnique,
Palaiseau, France,
in 2013, and Ph.D.
degree in high-energy
particle physics from
Université Paris-Saclay,
Orsay, France, in 2017.
She worked in Ecole ´
Polytechnique and Cern, before joining CEA LIST,
in 2020, where she is currently a Research Engineer
in neural networks development, optimization and
deployment on embedded platforms.
Sub-mW Neuromorphic SNN
Audio Processing Applications
with Rockpool and Xylo
Hannah Bos and Dylan Muir
Abstract—Spiking Neural Networks streaming mode. Our application achieves
(SNNs) provide an efficient computational high accuracy (98 %) and low latency
mechanism for temporal signal processing, (100 ms) at low power (<100 μW dynamic
especially when coupled with low-power inference power). Our approach makes
SNN inference ASICs. SNNs have training and deploying SNN applications
been historically difficult to configure, available to ML engineers with general
lacking a general method for finding NN backgrounds, without requiring
solutions for arbitrary tasks. In recent specific prior experience with spiking
years, gradient-descent optimization NNs. Our approach makes Neuromorphic
methods have been applied to SNNs hardware and SNNs an attractive choice
with increasing ease. SNNs and SNN for commercial low-power and edge signal
inference processors, therefore, offer a processing applications.
good platform for commercial low-power
signal processing in energy-constrained Index Terms—Audio processing, Spiking
environments without cloud dependencies. Neural Networks, Deep Learning, Neuro
Historically, these methods have not morphic Hardware, Python.
been accessible to Machine Learning
(ML) engineers in industry, requiring
graduate-level training to successfully I. I NTRODUCTION
configure a single SNN application. Here
E
XISTING Deep Neural Network
we demonstrate a convenient high-level
pipeline to design, train and deploy (DNN) approaches to temporal
arbitrary temporal signal processing signal classification generally re
applications to sub-mW SNN inference move the time dimension from the data
hardware. We apply a new straightforward by buffering input windows over e.g.
SNN architecture designed for temporal
40 ms and processing the entire window
signal processing, using a pyramid of
synaptic time constants to extract signal as a single frame [1], [2], or else apply
features at a range of temporal scales. models with complex recurrent dynam
We demonstrate this architecture on an ics such as Long Short-Term Memories
ambient audio classification task, deployed (LSTMs) [3]. In contrast to Artificial
to the Xylo SNN inference processor in
Neural Networks (ANNs), Spiking Neu
This work was partially funded by the ECSEL ral Networks (SNNs) include multiple
Joint Undertaking (JU) under grant agreements temporally-evolving states with dynamics
number 876925, “ANDANTE” and number 826655, over a range of configurable time-scales.
“TEMPO”. The JU receives support from the Euro
pean Union’s Horizon 2020 research and innovation Applying these dynamics in recurrent
program and France, Belgium, Germany, Nether networks forms a complex temporal basis
lands, Portugal, Spain, Switzerland. for extracting information from temporal
Hannah Bos and Dylan Muir are with
SynSense, Zürich, Switzerland (Email: signals. This is achieved either through
[email protected]) random projection [2], [4] or constructed
70 HANNAH BOS: SUB-MW NEUROMORPHIC SNN AUDIO PROCESSING APPLICATIONS
with carefully chosen temporal properties the user is in a quiet office environment,
[5]. Random recurrent architectures have on a street with passing traffic, or in a
historically been used for SNNs because busy cafe with surrounding conversation.
they simplify the configuration problem To choose from and steer pre-
— when only the readout layer is trained, configured noise reduction approaches,
configuration is performed by simply ap we propose a low-power solution to au
plying linear regression [4]. tomatically and continuously classify the
An alternative approach is to build noise environment surrounding the user.
feedforward networks with individual We train and deploy an SNN on a low-
spiking units tuned to a range of var power neuromorphic inference processor
ious frequencies, by selecting synaptic to perform a continuous temporal sig
and membrane time constants [6]. Recent nal monitoring application, with weak
advances in the optimization of SNNs low-latency requirements (environments
using surrogate gradient descent [7], [8] change on the scale of minutes), but hard
have provided a feasible solution for con low-energy requirements (portable audio
figuring deep feedforward SNNs. How devices are almost uniformly battery-
ever, most available libraries for sim powered).
ulating SNNs do not support gradient We use the QUT-NOISE [12] back
calculations, and are designed to simu ground noise corpus to train and evalu
late biological architectures rather than ate the application. QUT-NOISE consists
modern DNNs. At the same time, modern of multiple sequential hours of ambi
ML libraries for training DNNs do not ent audio scene recordings, from which
support building or training SNNs. we used the CAFE, HOME, CAR and
We here demonstrate a modern ML STREET classes.
library for SNNs, “Rockpool” [9] and
its application to a new SNN inference III. A TEMPORAL SIGNAL PROCESSING
processor “Xylo” that trains and deploys ARCHITECTURE FOR SNN S
a temporal signal classification task. Re
cently several alternative libraries for We make use of slow synaptic and
SNN-based training with Pytorch have membrane states provided by leaky
emerged [10], [11]. However, these li integrate-and-fire (LIF) spiking neurons
braries do not support multiple compu to integrate information within an SNN.
tational backends for training, and do The dynamics of an LIF neuron are given
not support deployment to neuromorphic by
hardware.
I˙syn · τmem = −Isyn + x(t)
II. A N AMBIENT AUDIO SCENE V̇mem · τsyn = −Vmem + Isyn + b
�
CLASSIFICATION TASK z(t) = z(t) + δ(t − tk )
Audio headsets, phones, hearing aids Vmem > θ →
Vmem = Vmem − θ
and other portable audio devices often
use noise reduction or sound shaping to Here x(t) are weighted input events, Isyn
improve listening performance for the and Vmem are synaptic and membrane
user. The parameters used for noise re state variables, and z(t) is the train of
duction may depend on the noise level output spikes when Vmem crosses the
and characteristics surrounding the de threshold θ at event times tk . The synap
vice and user. For example, optimal noise tic and membrane time constants τsyn
filtering may differ depending on whether and τmem provide a way to sensitise the
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 71
LIF neuron to a particular time-scale of can be used seamlessly within Rock-
information. pool. Rockpool has the goal of making
We use a range of pre-defined synaptic supervised training of SNNs as conve
time constants in a deep SNN to extract nient and simple as training ANNs. The
and integrate temporal information over library interfaces with multiple back-ends
a range of scales, which can then be for accelerated training and inference
classified by a spiking readout layer. The of SNNs, currently supporting PyTorch
proposed network architecture is shown [13], Jax [14], Numpy, Brian 2 [15]
in Figure 1. and NEST [16], and is easily extensible.
Single-channel input audio is pre Rockpool enables hardware-aware train
processed through a filter bank, which ing for neuromorphic processors, and
extracts the power in each frequency provides a convenient interface for map
band, spanning 50 Hz to 8000 Hz over 16 ping, deployment and inference on SNN
logarithmically-spaced channels. Instan hardware from a high-level Python API.
taneous power is temporally quantized to Rockpool is installed with “pip” and
1 ms bins, with the amplitude encoded by “conda”, and documentation is available
up to 15 events per bin per channel. from https://rockpool.ai. Rockpool is an
Input signals are then processed by open-source package, with public devel
three spiking layers of 24 LIF spiking opment based at https://github.com/syn
neurons in each layer, interposed with sense/rockpool.
dense weight matrices. Each layer con
tains a fixed common τmem of 2 ms, and V. D EFINING THE NETWORK
a range of τsyn from 2 ms to 256 ms. ARCHITECTURE
The maximum synaptic time constant in The network architecture shown in
creases with each layer, such that early Figure 1 is defined in a few lines of
layers have only short τsyn while final Python code, shown in Listing 1.
layers contain the full range of τsyn
values. The first layer contains neurons VI. T RAINING APPROACH
with two time constants of τsyn = 2 and We trained the SNN on segments of
4 ms. The final layer contains neurons 1s duration using BPTT and surrogate
with τsyn = 2, 4, 8, 16, 32, 64, 128 and gradient descent [7], [8]. We applied a
256 ms. mean-squared-error loss to the membrane
The readout layer consists of four spik potential of the readout neurons, with a
ing neurons corresponding to the four high value for the target neuron Vmem
ambient audio classes. and a low value for non-target neuron
This network uses no bias parameters. Vmem . After training, we set the thresh
old of the readout neurons such that the
target neurons emit events for their target
IV. ROCKPOOL : AN OPEN - SOURCE
class and remain silent for non-target
P YTHON LIBRARY FOR TRAINING AND
classes. Pytorch Lightning [17] was used
DEPLOYING DEEP SNN S
to optimize the model against the training
Rockpool [9] is a high-level machine- set using default optimization parameters.
learning library for spiking NNs, de
signed with a familiar API similar to VII. X YLO DIGITAL SNN
other industry-standard python-based NN ARCHITECTURE
libraries. The API is similar to Py- We deployed the trained model to
Torch [13], and in fact, PyTorch classes a new digital SNN inference ASIC
72 HANNAH BOS: SUB-MW NEUROMORPHIC SNN AUDIO PROCESSING APPLICATIONS
Fig. 1. Spiking network architecture for temporal signal processing. A filter bank splits single-channel
audio into sixteen channels, spanning 50 Hz to 8000 Hz. The power in each frequency band is quantized to
4 bits, then injected into the SNN. The spiking network consists of three hidden layers, with a pyramid of
time constants from slow to fast distributed over 24 neurons in each layer. Each layer contains several time
constants, with the first hidden layer containing only short time constants (τ1 , τ2 ), and the final hidden
layer containing short to long time constants (τ1 to τ8 ). Finally, the readout layer outputs a continuous
one-hot event-coded prediction of the current ambient audio class.
from rockpool.nn.combinators import Sequential supports individual synaptic and mem
from rockpool.nn.modules import LinearTorch,
LIFTorch
brane time-constants, thresholds and bi
from rockpool.parameters import Constant ases for each neuron. Xylo supports ar
bitrary network architectures, including
Nh = 24 # − Hidden layer size
recurrent networks, for up to 1000 neu
# − Define pyramid of time constants over SNN rons. More information about Xylo can
layers be found at https://rockpool.ai/devices/
taus = [2**n * 1e−3 for n in range(1, 9)]
tau layer1 = [taus[i] for i in range(2) for in
xylo-overview.html.
range(Nh // 2)] Figure 2 shows the logical architec
tau layer2 = [taus[i] for i in range(4) for in ture of the network within Xylo. Xylo
range(Nh // 4)]
tau layer3 = [taus[i] for i in range(8) for in
contains 1000 LIF neurons in a hid
range(Nh // 8)] den population and 8 LIF neurons in a
readout population. Xylo provides dense
# − Define the network as a sequential list of
modules
input and output weights, and sparse
net = Sequential( recurrent weights with a fan-out of up
LinearTorch((16, Nh)), # − Linear weights, to 32 targets per hidden neuron. Inputs
hidden layer 1
LIFTorch(Nh, tau syn=Constant(tau layer1)),
(16 channels) and outputs (8 channels)
# − LIF layer are asynchronous firing events. The Xylo
ASIC permits a range of clock frequen
LinearTorch((Nh, Nh)), # − Hidden layer 2
LIFTorch(Nh, tau syn=Constant(tau layer2)),
cies, with a free choice of network time
step dt.
LinearTorch((Nh, Nh)), # − Hidden layer 3 Figure 3 shows the design of the dig
LIFTorch(Nh, tau syn=Constant(tau layer3)),
ital LIF neurons on Xylo. Each neu
LinearTorch((Nh, Nh)), # − Readout layer ron maintains independent 16-bit synap
LIFTorch(4) ) tic and membrane states. Up to 31 spike
Listing 1. Define an SNN architecture in events can be generated by each neu
Rockpool. The network here corresponds to Fig 1. ron on each time-step if the thresh
old is exceeded multiple times. Each
“Xylo”. Xylo is an all-digital spik hidden layer neuron supports up to
ing neural network ASIC, for efficient two synaptic input states. Each neuron
simulation of spiking leaky integrate- has independently configurable synaptic
and-fire neurons with exponential input and membrane time constants, thresh
synapses. Xylo is highly configurable and olds, and biases. Synaptic and membrane
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 73
Input
events Synaptic weights Synaptic state Membrane state Event generation
state decay is simulated using a bit-shift N
Weights
+ Isyn: 16 bits + Vmem: 16 bits N
8 bits
+
approximation to an exponential decay ×16 bit-shift decay: 4 bits bit-shift decay: 4 bits
bias: 16 bits
threshold: 16 bits
(Listing 2). Time constants τ are con
verted to decay parameters dash, with Fig. 3. Digital LIF neurons on Xylo. Each neuron
dash = log2 (τ /dt). maintains an integer synaptic and membrane state,
with independent parameters per neuron. Exponen
tial state decay is simulated with a bit-shift decay
def bitshift(value: int, dash: int) −> int: approach, shown in Listing 2.
new value = value − (value >> dash)
if new value == value:
new value −= 1
return new value network weights as sub-matrices within
Listing 2. Python code demonstrating the bit- the recurrent weights of Xylo. Figure 4
shift decay algorithm. The decay parameter is illustrates this mapping for the network
given by dash = log2 (τ /dt).
architecture of Figure 1.
b) Quantization: Floating-point pa
Rockpool includes a bit-accurate sim
rameter values must be converted to
ulation of the Xylo architecture, “Xy
the integer representations on Xylo. For
loSim”, fully integrated with the high-
weights and thresholds, this is accom
level Rockpool API.
plished by considering all input weights
VIII. M APPING AND DEPLOYMENT TO to a neuron, then computing a scaling
X YLO factor such that the maximum absolute
weight is mapped to ±128, with the
a) Mapping: Rockpool provides threshold scaled by the same factor, then
full integration with Xylo-family hard rounding parameters to the nearest inte
ware development kits (HDKs), sup ger.
porting deployment of arbitrary network The deployment process is shown in
architectures to Xylo. The ability for Listing 3.
Xylo to implement recurrent connectiv
ity within the hidden population per
mits arbitrary network architectures to # − Extract the computational graph from a
be deployed. Feedforward, recurrent and trained network
residual SNN architectures are all equally graph = net.as graph()
supported for deployment. This is ac # − Map the computational graph to the Xylo
complished by embedding feedforward architecture
# Performs DRC, assignment of HW resources,
linearising all parameters
from rockpool.devices import xylo
LIF hidden neurons Recurrent spec = xylo.mapper(graph)
weights
LIF output neurons
Spiking # − Quantize the specification using per−channel
Input channels
quantization
… … from rockpool.transform import
quantize methods as Q
Readout weights
spec Q = Q.channel quantize(**spec)
Input weights
# − Deploy to a Xylo HDK
config = xylo.config from specification(**spec))
Fig. 2. Architecture of the digital spiking net xylo = xylo.XyloSamna(hdk, config)
neural network inference processor “Xylo”. Xylo
supports 1000 digital LIF neurons, 16 input and 8 # − Perform inference on the HDK
output channels. Recurrent weights with restricted output, , = net xylo(inputs)
fan-out of up to 32 targets per neuron can be used
to map deep feed-forward networks to the Xylo Listing 3. Mapping, quantizing and deploying a
architecture. trained network to the Xylo HDK.
74 HANNAH BOS: SUB-MW NEUROMORPHIC SNN AUDIO PROCESSING APPLICATIONS
Input weights Win Recurrent weights Wrec Output weights Wout
16 72 72
W4 4
W1
24
24
W2
24
W3
72 24 72
Fig. 4. Feedforward weights mapped to Xylo
architecture. The result of mapping the network
in Figure 1 to Xylo is indicated, with dimensions Fig. 5. Distribution of correct classification
and locations of sub-matrices within the Xylo ar latency. Triangle: median latency of 100 ms.
chitecture weights. Weight sub-matrices are labeled
corresponding to the weight blocks in Figure 1.
IX. R ESULTS
The accuracy for the trained model is
given in Table I. The quantized model
was deployed to a Xylo HDK, and tested
on audio segments of 60 s duration. We
observed a drop in accuracy of 0.8 %
from the training accuracy, and a drop
of 0.7 % due to model quantization.
We measured the real-time power con
sumption of the Xylo ASIC running at
6.25 MHz while processing test samples
(Table II). Audio pre-processing (“Filter
bank” in Figure 1) was performed in sim Fig. 6. Audio classification results on audio
ulation, while SNN inference was per samples for each class (columns). From top to
formed on the Xylo device. We observed bottom raw audio waveforms for each class (class
indicated at the top); Filter bank outputs from
an average total power consumption of low to high frequencies (indicated at left); Hidden
542 μW while performing the audio clas layer responses, grouped by synaptic time constant
sification task. The idle power of the (indicated at left); Membrane potentials Vmem for
each of the readout neurons; and spiking events of
SNN inference core was 219 μW, with a the readout neurons (classes indicated at left).
dynamic inference cost of 93 μW. The IO
power consumption used to transfer pre TABLE I
processed audio to the SNN was 230 μW. A MBIENT AUDIO SCENE CLASSIFICATION
ACCURACY
Note that in a deployed application, audio
pre-processing would be performed on Four-class accuracy (training set) 98.8 %
Validation accuracy (simulation; quantized) 98.7 %
the device, with a concomitant reduction
Test accuracy (Xylo HW; 60 s samples; quantized) 98.0 %
of IO power requirements.
Our model performs streaming clas
sification of ambient audio with a me Figure 6 shows several examples of
dian latency of 100 ms. Figure 5 shows audio samples classified by the trained
the response latency distribution, from network.
the onset of an audio sample until the a) Inference energy benchmarks:
first spike from the correct class output Our network performs continuous non-
neuron. frame-based inference, making a precise
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 75
TABLE II TABLE III
CONTINUOUS POWER MEASUREMENTS P ER - INFERENCE ENERGY MEASUREMENTS
SNN core idle power (Xylo HDK) 219 μW Inference rate (med. latency) 10 Hz
SNN core dynamic inference power (Xylo HDK; 60 s samples) 93 μW Total energy per inference (med. latency) 54.2 μJ
SNN core total inference power (Xylo HDK; 60 s samples) 312 μW
Total energy per network time-step 542 nJ
Total IO power (Xylo HDK; 60 s samples) 230 μW
Total inference power (Xylo HDK; 60 s samples) 542 μW Dynamic energy per inference (med. latency) 9.3 μJ
Dynamic energy per network time-step 93 nJ
TABLE IV
definition of “an inference” complicated. P ER - NEURON PER - INFERENCE ENERGY
We considered two possible definitions COMPARISON
for inference time: one based on the
Citation Device N ET ot /N EDyn /N
median latency (100 ms; Figure 5); and
[2] Quadro K4000 512 95.9 μJ 58.0 μJ
one based on the time taken to perform [2] Xeon E5-2630 512 30.7 μJ 12.4 μJ
a full evaluation of the network (net [2] Jetson TX1 512 23.2 μJ 10.9 μJ
[18] Cortex M4F 2176 — 5.15 μJ
work time-step of 1 ms). Based on the [2] Movidius NCS 512 4.21 μJ 2.85 μJ
continuous power measurements in Ta [2] Loihi 512 0.73 μJ 0.53 μJ
ble II, our system exhibits per-inference [18] MAX78000 2176 — 0.12 μJ
This work Xylo-A2† 76 0.71 μJ 0.12 μJ
dynamic energy consumption of 9.3 μJ N : Number of neurons. ET ot : Total energy per inference.
(med. latency) and 93 nJ (network time- EDyn : Dynamic energy per inference. † Based on med. latency
of 100 ms.
step). Per-inference total energy con
sumption was 54.2 μJ (med. latency) and
542 nJ (network time-step). These results Syntiant NDP120 place the device at
are summarised in Table IV. 35 μJ to 50 μJ per inference on a keyword
Recent work deploying a keyword- spotting task, with continuous power
spotting application to low-power CNN consumption of 8.2 mW to 28 mW [19].
inference hardware achieved total energy Dynamic inference energy scales
consumption of 251 μJ per inference on roughly linearly with the number of
the optimized Maxim MAX78000 ac neurons in a network [2]. In Table IV we
celerator, and 11 200 μJ per inference report a comparison between auditory
on a low-power microprocessor (ARM processing tasks on various architectures,
Cortex M4F) [18]. This corresponded normalized by network size. At 0.12 μJ
to a continuous power consumption of dynamic inference power per inference
71.7 mW (MAX78000) and 12.4 mW per neuron, our implementation on Xylo
(Cortex M4F) respectively. requires equal low energy with the
Previous work benchmarking audio MAX78000 CNN inference processor.
processing applications with regard to However the MAX78000 CNN core
power consumption compared desktop- requires 792 μW to 2370 μW in inactive
scale inference with spiking neuromor mode [20], compared with 219 μW for
phic hardware [2]. In a keyword spotting Xylo, making Xylo more energy efficient
task, dynamic energy costs ranged from in real terms.
0.27 mJ to 29.8 mJ per inference, cover
ing spike-based neuromorphic hardware X. C ONCLUSION
(intel Loihi) to a GPU device (Quadro We demonstrated a general approach
K4000). This corresponded to a range for implementing audio processing ap
of continuous power consumption from plications using spiking neural networks,
0.081 W to 22.86 W. deployed to a low-power Neuromorphic
Published numbers for the mixed- SNN inference processor “Xylo”. Our
signal low-power audio processor solution reaches high accuracy (98 %)
76 HANNAH BOS: SUB-MW NEUROMORPHIC SNN AUDIO PROCESSING APPLICATIONS
with <100 spiking neurons, operating in keyword spotting efficiency on
streaming mode with low latency (med. neuromorphic hardware,” in Pro
100 ms) and at low power (<100 μW ceedings of the 7th Annual Neuro-
dynamic inference power). Xylo exhbits Inspired Computational Elements
lower idle power, lower dynamic infer Workshop, ser. NICE ’19. New
ence power and lower energy per infer York, NY, USA: Association
ence than other low-power audio process for Computing Machinery, 2019.
ing implementations. [Online]. Available: https://doi.org/
Our software pipeline “Rockpool” 10.1145/3320288.3320304
(rockpool.ai) provides a modern Machine [3] J. Deng, B. Schuller, F. Eyben,
Learning approach to building applica D. Schuller, Z. Zhang, H. Francois,
tions, with a convenient high-level API and E. Oh, “Exploiting time-
for defining neural network architectures. frequency patterns with LSTM-
Rockpool supports the definition and RNNs for low-bitrate audio
training of SNNs via several automatic restoration,” Neural Computing and
differentiation back-ends. Rockpool also Applications, vol. 32, no. 4, pp.
supports quantization, mapping, and de 1095–1107, Feb. 2020. [Online].
ployment to SNN inference hardware Available: https://doi.org/10.1007/
with a few lines of Python code. s00521-019-04158-0
Our approach supports commercial de [4] W. Maass and H. Markram, “On
sign and deployment of SNN applica the computational power of circuits
tions, by making the configuration pro of spiking neurons,” Journal of
cess of SNNs accessible to ML engineers Computer and System Sciences,
without graduate-level training in SNNs. vol. 69, no. 4, pp. 593–616,
Here we have not demonstrated the 2004. [Online]. Available: https:
full capabilities of Rockpool, which also //www.sciencedirect.com/science/ar
supports residual spiking architectures, ticle/pii/S0022000004000406
quantization- and hardware-aware train [5] A. Voelker, I. Kajic,´ and C. Elia
ing, training for time constants and other smith, “Legendre memory units:
neuron parameters, and high extensibility Continuous-time representation in
for additional computational back-ends. recurrent neural networks,” in
We anticipate SNNs and low-power Advances in Neural Information
neuromorphic inference processors to Processing Systems, H. Wallach,
contribute significantly to the current H. Larochelle, A. Beygelzimer,
push for low-power machine learning at ´
F. d'Alche-Buc, E. Fox, and
the edge. R. Garnett, Eds., vol. 32. Curran
Associates, Inc., 2019. [Online].
R EFERENCES
Available: https://proceedings.neur
[1] G. Chen, C. Parada, and G. Heigold, ips.cc/paper/2019/file/952285b9b7
“Small-footprint keyword spotting e7a1be5aa7849f32ffff05-Paper.pdf
using deep neural networks,” in [6] P. Weidel and S. Sheik,
2014 IEEE International Confer “Wavesense: Efficient temporal
ence on Acoustics, Speech and Sig convolutions with spiking neural
nal Processing (ICASSP), 2014, pp. networks for keyword spotting,”
4087–4091. 2021. [Online]. Available: https:
[2] P. Blouw, X. Choo, E. Hunsberger, //arxiv.org/abs/2111.01456
and C. Eliasmith, “Benchmarking [7] J. H. Lee, T. Delbruck, and
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 77
M. Pfeiffer, “Training deep E. Yang, Z. DeVito, M. Raison,
spiking neural networks using A. Tejani, S. Chilamkurthy,
backpropagation,” Frontiers in B. Steiner, L. Fang, J. Bai,
Neuroscience, vol. 10, 2016. and S. Chintala, “PyTorch: An
[Online]. Available: https://www. imperative style, high-performance
frontiersin.org/articles/10.3389/fni deep learning library,” in Ad
ns.2016.00508 vances in Neural Information
[8] E. O. Neftci, H. Mostafa, and Processing Systems 32. Curran
F. Zenke, “Surrogate gradient learn Associates, Inc., 2019, pp. 8024–
ing in spiking neural networks: 8035. [Online]. Available: http:
Bringing the power of gradient- //papers.neurips.cc/paper/9015-pyt
based optimization to spiking neural orch-an-imperative-style-high-perf
networks,” IEEE Signal Processing ormance-deep-learning-library.pdf
Magazine, vol. 36, no. 6, pp. 51–63, [14] J. Bradbury, R. Frostig, P. Hawkins,
2019. M. J. Johnson, C. Leary, D. Maclau
[9] D. Muir, F. Bauer, and P. Wei rin, G. Necula, A. Paszke, J. Van
del, “Rockpool documentaton,” Mar derPlas, S. Wanderman-Milne, and
2019. Q. Zhang, “JAX: composable
[10] C.-G. Pehle and J. E. Pedersen, transformations of Python+NumPy
“Norse — A deep learning programs,” 2018. [Online]. Avail
library for spiking neural net able: http://github.com/google/jax
works,” Jan. 2021, documenta [15] M. Stimberg, R. Brette, and D. F.
tion: https://norse.ai/docs/. [Online]. Goodman, “Brian 2, an intuitive
Available: https://doi.org/10.5281/ and efficient neural simulator,”
zenodo.4422025 eLife, vol. 8, p. e47314, Aug.
[11] J. K. Eshraghian, M. Ward, 2019.
E. Neftci, X. Wang, G. Lenz, [16] M.-O. Gewaltig and M. Diesmann,
G. Dwivedi, M. Bennamoun, “NEST (NEural Simulation Tool),”
D. S. Jeong, and W. D. Lu, Scholarpedia, vol. 2, no. 4, p. 1430,
“Training spiking neural networks 2007.
using lessons from deep learn [17] W. Falcon et al., “PyTorch
ing,” 2021. [Online]. Available: Lightning,” GitHub, 2019. [Online].
https://arxiv.org/abs/2109.12894 Available: https://github.com/PyTor
[12] D. Dean, S. Sridharan, R. Vogt, chLightning/pytorch-lightning
and M. Mason, “The QUT-NOISE [18] M. G. Ulkar and O. E. Okman,
TIMIT corpus for evaluation of “Ultra-low power keyword spotting
voice activity detection algorithms,” at the edge,” 2021. [Online].
in Proceedings of the 11th An Available: https://arxiv.org/abs/21
nual Conference of the Interna 11.04988
tional Speech Communication As [19] “MLPerf™ v0.7 Inference: Tiny
sociation. International Speech Keyword Spotting, entries 0.7
Communication Association, 2010, 2012 and 0.7-2013. result verified
pp. 3110–3113. by MLCommons Association. The
[13] A. Paszke, S. Gross, F. Massa, MLPerf™ name and logo are trade
A. Lerer, J. Bradbury, G. Chanan, marks of MLCommons Association
T. Killeen, Z. Lin, N. Gimelshein, in the United States and other coun
L. Antiga, A. Desmaison, A. Kopf, tries. All rights reserved. Unau
78 HANNAH BOS: SUB-MW NEUROMORPHIC SNN AUDIO PROCESSING APPLICATIONS
thorized use strictly prohibited. Dylan Muir Dr. Muir
See www.mlcommons.org for is the Vice President for
Global Research Opera
more information.” 2021. [Online]. tions; Director for Algo
Available: https://mlcommons.org/ rithms and Applications;
en/inference-tiny-07/ and Director for Global
Business Development at
[20] “MAX78000 artificial intelligence SynSense. Dr. Muir is a
microcontroller with ultra-low specialist in architectures
power convolutional neural network for neural computation.
He has published exten
accelerator,” May 2021. [Online]. sively in computational and experimental neuro
Available: https://www.maximinteg science. At SynSense he is responsible for the
rated.com/en/products/microcontrol company research vision, and directing the develop
ment of neural architectures for signal processing.
lers/MAX78000.html Dr. Muir holds a Doctor of Science Ph.D. from
ETH Zürich, and undergraduate degrees (Masters)
in Electronic Engineering and Computer Science
Hannah Bos Dr. Bos is from QUT, Australia.
a Senior Algorithms and
Applications ML Engin
eer at SynSense, with
a background in comput
ational Neuroscience and
theoretical Physics. At
SynSense she designs
algorithms for neuro
morphic chips and
supports the development
of new hardware. Dr. Bos holds a Ph.D. in Physics
and theoretical Neuroscience from RWTH Aachen,
and a Masters in Physics from the University of
Oslo. She researched computational neuroscience
at the University of Pittsburgh.
An Embedding Workflow for Tiny
Neural Networks on Arm
Cortex-M0(+) Cores
Jianyu Zhao, Cecilia Carbonelli, and Wolfgang Furtner, Infineon
Technologies AG
Abstract—Neural networks are becom I. INTRODUCTION
ing increasingly widely used in always-
I
on IoT edge devices for more precise
N the past decade, the availability of
and secure data analysis with less la large amounts of data and the ever-
tency. However, due to the strict cost and increasing computing power have
power constraints for such applications, enabled the explosive development of
only the smallest microcontrollers, typi machine learning algorithms, especially
cally equipped with Arm Cortex-M0(+)
cores, could be used for algorithm im
neural networks. Meanwhile, growing
plementation. For a memory of only a concerns about data privacy have drawn
few tens of kilobytes, the available open- people’s attention from centralised cloud
source embedding tools either are too large computing to more distributed edge im
or require too much handcraft effort. In plementation. Consequently, there is an
this paper, we propose an end-to-end em
bedding workflow focused on tiny neural
increasing interest in the use of neural
network deployment on Arm Cortex-M0(+) networks for more secure and precise on-
cores. The method covers all the steps, sensor data analysis and the joint optimi
including network quantisation, C code sation of the algorithm and the dedicated
generation and performance verification. edge hardware, such as microcontrollers.
A Python and C library was developed
following the proposed method and vali
Due to the cost and power constraint
dated on a low-cost environmental sensing for such applications, the hardware plat
application. As a result, up to 73.9% of the form is often so small, with a memory
memory footprint could be reduced with of only a few tens of kilobytes, that
the quantised network with only a small the implementation has to run on bare
sacrifice in performance. While reducing
the manual effort of network embedding
metal, without the support of an oper
to the minimum, the workflow remains ating system. In this work, we propose
flexible enough to allow for customisable an end-to-end embedding workflow for
bit shifts and different layer combinations. the quantisation of small neural networks,
the generation of C source files and the
Index Terms—Arm Cortex-M0(+), envi accuracy evaluation. Focusing on Arm
ronmental sensing, network quantisation Cortex-M0(+), the workflow provides un
paralleled simplicity while keeping fine-
Jianyu Zhao, Cecilia Carbonelli, and Wolfgang
Furtner are with Infineon Technologies AG, 85579 tuning possibilities.
Neubiberg, Germany. The corresponding author is
J. Zhao (jianyu.zhao@infineon.com). The research II. BACKGROUND
leading to these results has received funding from
the European Union’s ECSEL Joint Undertaking Artificial neural networks have existed
under grant agreement n◦ 826655 - project TEMPO. for decades [1], but it was not until
80 JIANYU ZHAO: AN EMBEDDING WORKFLOW FOR TINY NEURAL NETWORKS
2012, when AlexNet won the ImageNet achieve 32-bit performance at an 8-bit
Large Scale Visual Recognition Chal price point [8, 9] and thus has become
lenge by using graphic processing units the state-of-the-art for IoT edge devices.
(GPUs) during training [2], that the al As many applications already have an
gorithm gained momentum and attracted M0 core for simple sensor control, im
increased attention from academia and plementing a machine learning algorithm
industry. Inspired by the biological neu on the same core would create addi
ral networks in animal brains, artificial tional value without increasing the mate
neural networks consist of multiple lay rial cost. To further reduce the footprint
ers of inter-connected computing units, and power consumption of the whole
called “neurons”, which are essentially system, a system in a package (SiP) [10]
matrix multiplications and non-linear ac could be used to stack the microcontroller
tivation functions. Such networks can and the sensors in one single package.
model complicated non-linear mappings However, with opportunities, there also
without humans manually choosing the come challenges. Small platforms such as
parameters. Arm Cortex-M0(+) do not support any
Besides image processing, with the operating system or file system, so the
growth of the Internet of Things (IoT) code needs to run on bare metal and the
technologies, it is increasingly popular to network parameters need to be saved as C
use neural networks also for the anal source files. The memory resource is also
ysis of data from other types of sen minimal, typically 16 or 32 kB Flash and
sors, such as microphones [3, 4] radars 4 or 8 kB SRAM, making most available
[5] and even gas sensors [6]. Compared machine learning frameworks too big for
with most handcrafted algorithmic mod the platform. In addition, the lack of
els, neural networks are capable of mod a floating-point unit (FPU) also makes
elling more complex relations and thus floating-point arithmetic extremely ex
delivering much closer fitting for many pensive. However, most neural networks
use cases. Although the training process are trained with single-precision floating-
with back-propagation is computationally points inputs and parameters.
expensive, the inference process can boil Fortunately, more deep learning frame
down to loops of simple multiplications works are being developed for different
and accumulations, making it possible to hardware platforms. Arm, for example,
run such models on restricted platforms. has extended its Common Microcon
Different from conventional neural troller Software Interface Standard (CM
network applications, such as image SIS) with a dedicated neural network
and natural language processing, sensor library called CMSIS-NN [11], target
applications are often much more con ing Cortex-M processors with specific
strained w.r.t. power consumption, mate networks. However, the overall process
rial cost and footprint. Such devices are from quantisation to embedding still re
often always-on and most likely battery- mains manual, and the developers need
powered, so the power limit is typically to spend a lot of time coming up
under 1 mW [7], while a mobile or desk with the quantisation solution, writing
top CPU needs Watts. Cost also plays an C source code with the library, and
important role in the selection of proces then evaluating the performance of the
sors. As the smallest member of the Arm chosen quantisation settings. TensorFlow
Cortex-M series, Cortex-M0(+) costs less Lite for Microcontroller (TF Lite Micro)
than a euro each, enables developers to [12] is another promising solution for
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 81
the Cortex-M series, as it is seamlessly
integrated into the popular TensorFlow
framework for model training, and one
can convert the original model to Flat-
buffer format and then generate embed
ded source code with just a few lines
of Python code. As an extension of Ten
sorFlow Lite that targets larger platforms
such as the Cortex-A series, TF Lite Mi
cro libraries are also written in C++ 11.
However, the most popular language for
embedded programming is C, and C++
11 is still not supported by toolchains
available on many platforms. Also, the
generated models are serialised as arrays
of bytes, which is hard to interpret and
customise implementation details. Note
that although the network embedding
could be to some extent automated, the
other parts of the embedded firmware,
e.g., pre-processing and feature extrac
tion, still need to be implemented manu
ally in C. Therefore, it is crucial that the
network implementation is as transparent Fig. 1. An overview of the embedding workflow
as possible. proposed in this work.
III. M ETHODS
The training of neural network models quantisation-aware training [13] and
is often carried out on computers or com post-training quantisation [14]. In this
puter clusters with GPUs. With various work, we use the latter because it is
machine learning frameworks available more flexible, doesn’t require retraining
on these platforms, algorithm developers and could achieve reasonably good per
can focus on the architecture design with formance in most cases [14].
out worrying about the underlying imple Besides trained models, some test data
mentation. The trained models are often with the corresponding target values are
Python objects with properties describing also needed for performance verification.
the network architecture and the trained
parameters as 32-bit floating-point val
ues. B. Quantisation
For the implementation of a model on As the platform does not have an FPU,
an Arm Cortex-M0(+) platform, a work floating-point arithmetic can be very ex
flow is suggested in Fig. 1 with a library pensive. By converting the model param
pre-written in Python and C. eters from 32-bit floating-point values to
16- or 8-bit integers, it is possible to
A. Preparation reduce the memory footprint and compu
When it comes to network quantisa tational cost simultaneously without sac
tion, there are generally two approaches: rificing much accuracy.
82 JIANYU ZHAO: AN EMBEDDING WORKFLOW FOR TINY NEURAL NETWORKS
dynamic range exceeds the static one
calculated from the parameters. With the
workflow described in Fig. 1, we suggest
first quantising the parameters with as
many fractional bits as possible and ver
ifying if overflow or underflow happens
with the output of the C implementation
before exporting the source code to the
Fig. 2. A signed 8-bit fixed-point representation target platform. When it does happen, we
of a fractional number, with the binary point posi can go back to quantisation and shift the
tioned after the 4th bit. binary point one bit to the right until the
whole dynamic range with the test data
is covered.
Quantisation is generally defined as the After quantisation, the network archi
process of constraining a large set of val tecture is saved in a configuration text
ues (such as real numbers) to a discrete file, and the quantised parameters are
set (such as the integers) [15]. In the saved in another file, one layer after
context of deep learning, network quanti another. These files would be parsed in
sation is essentially the process of repre
senting model parameters with fewer bits,
e.g., 8 bits, as mentioned above. 4
An 8-bit fixed-point representation of Dense 4 8 hard_sigmoid
Dense 8 3 softmax
the fractional numbers is shown in Fig. 2. 4 3
In this example, 4 bits are used for the
integer part, while the other 4 bits are Fig. 3. Example of the configuration txt file for
a classifier trained for the iris dataset. The model
for the fractional part. The first bit of the has four inputs, three outputs and two dense layers,
integer part is used as a sign bit so that it each with a different activation function.
is possible to represent both positive and
negative values. 6
5
The choice of the binary point position 16 6 -1 7 -19 -1 6 13
-46 -49 41 -45 -24 6 73 53
is crucial when we quantise decimals. 68 79 -78 69 -65 -87 -84 -91
74 95 -80 68 -99 -124 -71 -105
In most cases, two factors need to be 8
-26
considered: 4
23
• the dynamic range of the variable, 44
67
-32
depending on the lowest and high 43
-84 7 50
est value that one needs to repre -78 -17 37
75 13 -50
sent within the given algorithm; -82 7 35
• the highest tolerable quantisation 8 43 -63
38 47 -46
73 -10 -42
error. 66 31 -63
3
Ideally, after calculating the least 3
-9
number of bits needed to represent the
full range of the model parameters, the Fig. 4. Example of the txt file for the quantised 8
developer could use the remaining bits bit parameters from the model mentioned in Fig.
for the fractional part. However, during 3. The 6 and 5 at the beginning represent the
numbers of bits used for the fractional part of the
the multiplication and accumulation op numbers for the first and the second dense layers,
erations of the neural networks, overflow respectively. They are followed by weights and
or underflow can still happen when the biases quantised accordingly.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 83
the next step to generate C source files. typedef struct {
const int8_t **weights;
Examples of the configuration file and const int8_t *biases;
const uint8_t activation;
the parameter file are given in Fig. 3 and const uint8_t dim_out;
const uint8_t shift;
Fig. 4. } dense_param_8bit_t;
(a)
void dense_8bit(
C. library and code generation int16_t *input,
const dense_param_8bit_t *params,
Before generating any C source code uint8_t dim_in,
const uint8_t input_shift,
for a specific model, a general-purpose int16_t *output);
library needs to be prepared to perform (b)
the basic layer tasks, as shown in Fig. 5.
Fig. 6. (a) The C structure to save 8-bit parameters
It is recommended to start with dense and for a dense layer, including a double pointer for
Gated Recurrent Unit (GRU) layers, as the 2d weight array, a pointer to the 1d bias array,
they are often used in time series analyses an unsigned 8-bit integer to suggest the type of
activation function, an unsigned 8-bit integer for
for sensor data. the size of the layer output and another for the bit
Pre-defining the data structures for dif shift. (b) The corresponding implementation of an
ferent types of layers and different vari 8-bit dense layer, which besides layer parameters
also takes a pointer to a 1d array for the input values
able lengths, the library also provides and modifies the array pointed by the output pointer
standardised interfaces for all the layers in return.
supported. Following the mathematical
definition of the specific layers, the de
them in complicated C structures with
tailed implementation boils down to a
pointers and double pointers can be time-
few loops with multiply and accumulate
consuming and error-prone. In this paper,
operations. An example of the 8-bit dense
we propose using a standardised Python
layer interface is provided in Fig. 6.
script, which reads the model parameters
Other layers, 16-bit dense, 8-bit GRU and
in the txt file following the network archi
16-bit GRU, have similar designs.
tecture given in the configuration file and
As no MAC (multiply and accumula
saves them in the desired format in a C
tion) is available on Cortex-M0(+) cores,
header file. The implementation is writ
multiplications and accumulations are
ten in the corresponding C file. An exam
calculated with basic instructions with bit
ple of the generated code for an 8-bit 3
shifts.
hidden-layer classifier is given in Fig. 7.
With all the implementations available
Note that the model parameters are
for each layer of the model, it is already
declared as constant variables to save
possible for the developer to handcraft
SRAM usage.
a C solution. However, copying all the
Meanwhile, the main function is also
parameters from txt files and reformatting
generated to take the test input data from
a txt file, feed the samples one by one
Name
to the C model function and export the
----
activation_functions.c
output in a txt file for verification.
activation_functions.h
fixed_point_operations.c
fixed_point_operations.h
nn_helpers.c
nn_helpers.h
D. Performance evaluation in Python
nn_layers.c
nn_layers.h The exported C output can be loaded
softmax_functions.c
softmax_functions.h and compared against the output pro
vided by the original Python model. If
Fig. 5. Header and C files from the C library. the results are drastically different, for
84 JIANYU ZHAO: AN EMBEDDING WORKFLOW FOR TINY NEURAL NETWORKS
typedef struct { public data sets, such as Boston hous
const dense_param_8bit_t *layer1;
const dense_param_8bit_t *layer2; ing price, Iris and MNIST. At Infineon
const uint8_t n_features;
const uint8_t n_out; Technologies AG, we also applied the
} model_param_t;
workflow to our in-house low-cost envi
(a)
ronmental sensors and managed to enable
smart gas sensing on a PSoC® analogue
static const model_param_t model_params = {
&(const dense_param_8bit_t)
{
(const int8_t *[]) coprocessor, which comes with a Cortex
{
(const int8_t []) {16, 6, -1, 7, -19, M0+ core. In this section, we will intro
-1, 6, 13},
(const int8_t []) {-46, -49, 41, -45, duce the application and then discuss the
-24, 6, 73, 53},
(const int8_t []) {68, 79, -78, 69, results of the embedded algorithm.
-65, -87, -84, -91},
(const int8_t []) {74, 95, -80, 68,
-99, -124, -71, -105}
With the growing concerns about air
},
(const int8_t []) {8, -26, 4, 23, 44, 67,
quality, there is also a higher demand for
-32, 43},
ACTIVATION_SIGMOID,
fine-granular sensing for many health ap
// dim_out, shift
8, 6
plications [16]. Conventionally, air qual
…
}, ity is monitored at the city level, with
}
4, 3 large monitoring stations logging the
(b)
concentrations of target gases and par
int16_t z1[model_params->layer1->dim_out];
ticles every hour. Typically using laser-
dense_8bit(features, model_params->layer1,
based spectroscopy analysis [17], such
model_params->n_features, input_shift, z1);
dense_8bit(z1, model_params->layer2,
as cavity ring-down spectroscopy, the
model_params->layer1->dim_out,
model_params->layer1->shift, pred);
stations are accurate but take a lot of
(c)
space and are expensive to set up and
maintain [18, 19]. The recorded 1- or
Fig. 7. Example of the generated C source code. (a) even 8-hour average concentrations of
Definition of a comprehensive parameter structure, the gas pollutants also do not necessarily
which consists of a pointer for the parameter struc
ture of each of the hidden layers and two integers reflect the air quality in a local envi
for the dimensions of the network input and output. ronment. With more accurate, low-cost
(b) Declaration of the model parameter (only the electrochemical sensors enabled by new
first layer is shown due to space limitations) with
automatically filled model parameters read from the sensing materials and embedded neural
parameter text file illustrated in Fig. 3. (c) Simple networks, it is possible to measure the
network implementation with available layer imple concentration of the target gas(es) with
mentations.
finer granularity, e.g., in a room or out
door on a battery-powered device, offer
example, the C output saturates at a ing opportunities to develop various new
certain level, there could be an overflow applications. The hardware components
or underflow. The developer may go back of such a gas-measuring platform are
and shift the binary point one bit to illustrated in Fig. 8.
the right, generate the code again and For the specific gas-sensing applica
evaluate the model performance until no tion, which we address in this section, a
overflow or underflow happens again. small GRU network is trained to estimate
gas concentrations with historical data
saved in a buffer. The architecture of the
IV. R ESULTS AND D ISCUSSION example network is provided in Fig. 9.
The proposed workflow has been im It can exploit the time properties of the
plemented as a Python and C library and sensor signals while keeping the memory
tested extensively with classification and footprint within budget. The algorithmic
regression networks trained with various model was first designed and trained on
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 85
Fig. 8. A low-cost environmental sensing platform
with a neural network embedded on an Infineon
PSoC® analogue Coprocessor in the same package.
Fig. 10. Comparison of the estimated gas concen
trations from the floating-point, 16-bit fixed-point
and 8-bit fixed-point algorithm implementations.
The test results with a given data set
are shown in Fig. 10. In the test ex
periment, the target gas was increased
to specific concentrations and then de
creased to 0 ppb. The concentration esti
mates of the original floating-point model
are very close to the ground truth, with
slight over- or underestimation for some
of the samples. The results from the
16-bit fixed-point model are similar to
the ones from the original, with a mean
absolute error (MAE) of 3.9 ppb. A more
Fig. 9. Network architecture. The example network significant quantisation error is seen with
has 14 extracted features, 15 timesteps, 20 hidden the 8-bit model. It tends to overestimate
units, and 2 output values. a lot, especially during the first pulse at
10 ppb, resulting in an MAE of 6.8 ppb.
The accuracies and memory footprints
a computer cluster and then deployed of the neural networks of different data
on an Infineon PSoC® analogue copro types are summarised in Table I. By
cessor, which is not only used for real- quantising the network to 16-bit, we
time concentration estimation but also saved 49.3% of the memory with lit
signal measurement, heater control and tle change in performance; when the
communication.
The analogue coprocessor platform is
equipped with an Arm Cortex M0+ pro TABLE I
cessor with 32 kB Flash and 4 kB SRAM. P ERFORMANCE C OMPARISON OF D IFFERENT
Following the workflow proposed in sec N ETWORK I MPLEMENTATIONS
tion III, it is possible to embed the model, Performance Metrics
Network
visualise the simulated output, flexibly Implementation MAE a (ppb) Flash SRAM
adjust the position of the binary point, (kB) (kB)
Floating-point 4.5 33.5 1.8
and thus find the best trade-off between 16-bit fixed-point 3.9 16.7 1.2
algorithm performance and memory foot 8-bit fixed-point 6.8 8.4 0.8
prints. a Mean absolute error
86 JIANYU ZHAO: AN EMBEDDING WORKFLOW FOR TINY NEURAL NETWORKS
network is further reduced to 8-bit, an From Close-Talking Microphones
other 24.6% of memory is reduced at to Far-Field Sensors,” in IEEE Sig
the price of the noisy and less accurate nal Processing Magazine, vol. 29,
regression results. Given the sizes of the no. 6, pp. 127-140, Nov. 2012, doi:
Flash and SRAM, quantising the network 10.1109/MSP.2012.2205285.
to 16-bit or even 8-bit makes it possible [4] M. Wu et al., “Monophone-
to embed a GRU that would otherwise Based Background Modeling for
not fit on the platform. Two-Stage On-Device Wake Word
Detection,” 2018 IEEE International
V. C ONCLUSION Conference on Acoustics, Speech
and Signal Processing (ICASSP),
In this paper, we propose an embed
2018, pp. 5494-5498, doi:
ding workflow optimised for tiny neural
10.1109/ICASSP.2018.8462227.
networks and Arm Cortex-M0(+) cores.
[5] M. Scherer, M. Magno, J. Erb,
Without the overhead of developing a
P. Mayer, M. Eggimann and L.
single universal solution for all the em
Benini, “TinyRadarNN: Combin
bedded platforms, it is possible to have
ing Spatial and Temporal Convolu
an end-to-end solution that covers ev
tional Neural Networks for Embed
ery step, from parameter quantisation,
ded Gesture Recognition With Short
code generation to performance evalua
Range Radars,” in IEEE Internet of
tion with one library written in Python
Things Journal, vol. 8, no. 13, pp.
and C. While reducing the manual effort
10336-10346, 1 July1, 2021, doi:
of networking embedding to the mini
10.1109/JIOT.2021.3067382.
mum, the workflow is flexible enough
[6] X. Zhai, A. A. S. Ali, A. Amira
for different layer combinations and cus
and F. Bensaali, “MLP Neural Net
tomisable bit shifts. As Cortex-M0(+) is
work Based Gas Classification Sys
already part of many low-power sensor
tem on Zynq SoC,” in IEEE Access,
systems but only used for basic sensor
vol. 4, pp. 8138-8146, 2016, doi:
control, values can be easily added to
10.1109/ACCESS.2016.2619181.
such products without additional material
[7] P. Warden and D. Situnayake,
costs.
“Chapter 1. Introduction,” in
TinyML, 1st ed., Sebastopol, CA,
R EFERENCES USA: O’Reilly Media, Inc, 2019,
[1] R. Lippmann, “An introduction to pp. 1-3.
computing with neural nets,” in [8] arm. (2022, Aug. 12). arm CPU
IEEE ASSP Magazine, vol. 4, Cortex-M0: Tiny, 32-bit Processor.
no. 2, pp. 4-22, Apr 1987, doi: [online]. Available: https://www.ar
10.1109/MASSP.1987.1165576. m.com/products/silicon-ip-cpu/cort
[2] A. Krizhevsky, I. Sutskever and ex-m/cortex-m0.
G. Hinton, “ImageNet classifica [9] arm. (2022, Aug. 12). arm CPU
tion with deep convolutional neural Cortex-M0+: 32-bit, Low-Power
networks,” Communications of the Processor at an 8-bit Cost. [online].
ACM, vol. 60, no. 6, pp. 84-90, Available: https://www.arm.com/pr
2017, doi: 10.1145/3065386. oducts/silicon-ip-cpu/cortex-m/cor
[3] K. Kumatani, J. McDonough and B. tex-m0-plus.
Raj, “Microphone Array Process [10] A. Fontanelli, “System-in-Package
ing for Distant Speech Recognition: Technology: Opportunities and
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 87
Challenges,” 9th International Dioxide and Nitrogen Monoxide
Symposium on Quality Electronic by Chemiluminescence (EN
Design (isqed 2008), 2008, pp. 589 14211:2012),” European Committee
593, doi: 10.1109/ISQED.2008. for Standardization, 2012.
4479803. [19] CEN Ambient Air, “Standard
[11] L. Lai, N. Suda, V. Chandra, Method for the Measurement
“CMSIS-NN: Efficient Neural of the Concentration of Ozone
Network Kernels for Arm Cortex- by Ultraviolet Photometry (EN
M CPUs,” arXiv preprint, 2018. 14625:2012),” European Committee
Available: arXiv:1801.06601 for Standardization, 2012.
[12] TensorFlow, (2022, Aug. 12). Ten
sorFlow Lite for Microcontrollers.
[online]. Available: https://www.te
Jianyu Zhao was born in
nsorflow.org/lite/microcontrollers Sichuan, China, in 1993.
[13] B. Jacob et al., “Quantization She received a B. Eng.
and Training of Neural Networks (2015) from Tianjin Uni
versity, China, and her
for Efficient Integer-Arithmetic- M.Sc. (2018) from the
Only Inference,” 2018 IEEE/CVF Technical University of
Conference on Computer Vision Munich, Germany, both in
electrical and information
and Pattern Recognition, 2018, pp. engineering.
2704-2713, doi: 10.1109/CVPR. In 2017, she joined In
2018.00286. fineon Technologies AG, Neubiberg, Germany, to
write her master thesis on analysing gas sensor
[14] R. Krishnamoorthi, “Quantizing data using machine learning approaches. In 2018,
deep convolutional networks for she joined Infineon as an algorithm and mod
efficient inference: A whitepaper,” elling engineer and continued to work on algorithm
development and implementation for smart sens
arXiv preprint, 2018. Available: ing technologies. She holds several international
arXiv:1806.08342 patents, and her current interest is deploying neural
[15] R. M. Gray and D. L. Neuhoff, networks on edge devices.
“Quantization,” in IEEE Transac
tions on Information Theory, vol.
44, no. 6, pp. 2325-2383, Oct. 1998, Cecilia Carbonelli is a
doi: 10.1109/18.720541. Senior Principal - Sys
[16] P. Kumar et al., “The rise of low- tem and Algorithm Ar
chitect at Infineon Tech
cost sensing for managing air pol nologies AG. She studied
lution in cities,” Environment inter telecommunications engi
national 75, 2015, pp. 199-205. neering and earned a PhD
in information engineer
[17] W. X. Peng, K. W. D. Ledingham, ing from the University of
A. Marshall and R. P. Singhal, “Ur Pisa in 2005. She was a
ban air pollution monitoring: Laser- post-doc at University of
Southern California for a couple of years and
based procedure for the detection of then moved into industry and to Germany, joining
NOx gases,” Analyst 120.10, 1995, Infineon in December 2006. She has been a sys
pp. 2537-2542. tem engineer over the last 15 years, working on
cellular standards and modem platforms, physical
[18] CEN Ambient Air, “Standard layer algorithms, machine learning and AI applied
Method for the Measurement of to sensor products. She is the author of 35 scientific
the Concentration of Nitrogen publications and numerous patents.
88 JIANYU ZHAO: AN EMBEDDING WORKFLOW FOR TINY NEURAL NETWORKS
Wolfgang Furtner is a
Distinguished Engineer
for SoC Architectures
at Infineon Technologies
AG. He received his
degree in electrical
engineering from the
University of Applied
Sciences Munich,
Germany. He started his
career working 4 years in
a startup developing Graphics Processors (GPUs),
followed by 11 years architecting graphics and
video processing ICs at Philips Semiconductors.
Since 2006 he has been with Infineon and heading
System Concept Engineering for power and
sensors. His interests are embedded architectures
for artificial intelligence and machine learning,
smart sensors and system architectures for quantum
computing.
Edge AI Platforms for Predictive
Maintenance in Industrial
Applications
Ovidiu Vermesan and Marcello Coppola
Abstract—The use of intelligent edge- been benchmarked. While the best predic
sensing devices for measuring various pa tive accuracy can often be used to select
rameters (vibration, temperature, etc.) for the best-performing platform, comparing
industrial equipment/motors using artifi platforms for AI-based industrial applica
cial intelligence (AI), machine learning tions can be a difficult task involving many
(ML), and neural networks (NNs) is be architectural aspects. This paper provides
ing increasingly adopted in industrial pre an assessment and comparative analysis of
dictive maintenance (PdM) applications. some of the most essential architectural
Developing and deploying ML algorithms elements of differentiation (AEDs) in edge
and NNs on edge devices using sen AI-based industrial applications, such as
sors and microcontroller processing units analytic capabilities in the time and fre
based on Arm® Cortex®-M cores (e.g., quency domains, features visualisation and
M0, M0+, M3, M4, M7) microcontrollers exploration, microcontroller emulator and
requires robust AI-based platforms and live tests, support for deep learning (DL),
workflows. This paper highlights the im and using the ML core of the sensor. The
portance of adequately architecting AI use case selected for the benchmarking is a
workflow for PdM in industrial applica classification task based on the vibration of
tions at the edge. New platforms have generic rotating equipment (e.g., motors),
recently emerged with various degrees of common to many industrial manufacturing
automation and customisation for end-to applications. The benchmarking findings
end development and deployment of edge indicate that no single edge-AI-based plat
AI-based algorithms. An important aspect form can outperform all other platforms
in understanding the differences between across all AEDs. The platforms have differ
the various platforms used for edge-based ent implementation approaches and exhibit
AI algorithm development and deployment different capabilities and weaknesses. Nev
is diving into their architecture. For this ertheless, they all produce independently
purpose, several existing edge AI platforms relevant results, and together they provide
and workflows that allow integration with an overall insight into their architecture
Arm® Cortex® -M4F-based MCUs have and internal workings that can benefit the
PdM solution. As they evolve and interact
This work was conducted under the framework of with each other, they will also overcome
the ECSEL AI4DI “Artificial Intelligence for Digi their weaknesses and gain strengths. Fu
tising Industry” project. The project has received ture work is intended to enlarge the com
funding from the ECSEL Joint Undertaking (JU) parison by considering additional edge AI
under grant agreement No 826060. The JU receives platforms and AEDs.
support from the European Union’s Horizon 2020
research. (Corresponding author: O. Vermesan). O.
Vermesan is with SINTEF Digital, Oslo, 0373
Index Terms—Artificial intelligence, au
Norway (e-mail: [email protected]). M. tomated machine learning, edge com
Coppola is with ST Microelectronics, Grenoble, puting, industrial automation, predictive
38019 France. ([email protected]). maintenance, validation.
90 OVIDIU VERMESAN: EDGE AI PLATFORMS FOR PREDICTIVE MAINTENANCE
I. INTRODUCTION Industrial PdM is driven by physi-
cal models and processed data. The for
mer employs physical laws to assess
M
AINTENANCE represents
a significant portion of the degradation of the equipment/motors.
expenses associated with The latter monitors various health indi
industrial manufacturing operations. cators and employs methods such as ML
Unexpected failures in production and statistical approaches to find patterns
equipment/motors can cost significantly in the data and determine operating con
more than scheduled maintenance for ditions over time. There also exist com
an industrial manufacturing facility, binations, such as rule-based methods,
and an unspecified amount of time where ML or statistics are used to extract
can be required to fix problems. By domain knowledge in the form of rules
using predictive maintenance (PdM) to that govern model dynamics.
continuously monitor equipment states AI-based PdM refers to the abil
to predict potential problems that may ity of a PdM system to use knowl
lead to costly failures, actions can be edge and sensor data to identify, foresee
taken to prevent failures ahead of time. and address potential issues before they
In PdM, the industrial manufacturing lead to breakdowns in services, oper
system is under constant monitoring, and ations, processes, or systems. Different
analysis is made based on data collected AI-based techniques can be explored for
from various sensors. As a result, the the implementation of industrial PdM
functioning of the equipment/motors can systems: data-, ontology-, rule-, model-,
be optimised, and the costs of repair can sensor-, signal-, knowledge-, ML- and
be reduced. DL-based approaches. A survey of differ
The amount of data produced by in ent AI approaches for PdM can be found
dustrial production processes has in in [12].
creased exponentially due to the rapid This article focuses on sensor-driven
development of sensing technologies. PdM using ML approaches with sensor
When processed and analysed, sensor data to predict failures over time, to min
data can provide valuable information imise costs and to extend the useful life
and knowledge about manufacturing pro of components.
cesses, production systems and equip Sensor-driven PdM involves leverag
ment. Across different industries, equip ing sensor data to predict failures be
ment maintenance plays an important fore they occur. Rotating machine fail
role in a system’s functioning and affects ures can be diagnosed and predicted
equipment operation time and efficiency. by analysing the vibration signals de
Hence, equipment faults need to be iden rived from accelerometers connected to
tified and solved to avoid shutdowns in industrial equipment. However, sensor-
production processes. driven PdMs are associated with many
PdM involves leveraging sensor data challenges. Transforming raw sensor data
to predict mechanical failures before they into actionable insights is a complex,
occur. It is a proactive maintenance tech time-consuming, and costly process re
nique that can predict the amount of quiring a systematic engineering ap
time required to schedule maintenance proach to building, deploying, and mon
activities for any system or any piece itoring ML solutions. Many aspects and
of equipment before it enters the failure questions need to be considered, such as
mode of operation. the following:
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 91
1) How can distinguishable classes be VII, the use case experiment, evaluation
defined? results, and further discussions are in
2) What types of data will reveal the vestigated, followed by conclusions in
differences between classes? Section VIII.
3) What signal length will reveal the
differences between classes?
4) What range of sensor values will II. E DGE AI P ROCESSING
fully reflect the range of input in Recent advances in edge comput
formation? ing, edge AI and IIoT have con
The approach used in this paper in tributed substantially to the deployment
cludes elements of signal processing for of lightweight PdM solutions at the
analysing the equipment/motor’s mea edge to extend the lifetime of industrial
sured operating condition, identifying, equipment.
and extracting the relevant PdM features Edge computing is a paradigm where
and performing a diagnostic assessment computation is executed on the edge of
based on prior knowledge of healthy sys networks rather than on cloud servers,
tems. thus reducing the response time, trans
The signal processing methods used mission bandwidth, required storage,
include the time domain (standard de computation resources and network con
viation, trends, slope, and magnitudes), nectivity dependency. Sensor data can
frequency domain (motor current sig be processed in real-time, thus having
nature analysis) and time frequency the potential for PdM applications where
(Fourier transform, wavelet transforms, fault diagnosis and dynamic control are
instantaneous power FFT, high resolu time critical.
tion spectral analysis, wavelet analysis, Edge AI increases the potential of
bi spectrum, adaptive statistical time fre PdM applications even further by merg
quency method, etc.). ing AI/ML and edge computing, resulting
ML algorithms either classify the in new algorithms for specific tasks that
health state of the machinery or detect are less computationally expensive with
abnormal behaviour (e.g., any signifi out compromising their effectiveness.
cant deviation from the normal operating AI increases the value of IIoT by
condition). Bearings account for a large transforming data into useful informa
percentage of rotating machine failures, tion while IIoT increases the value of
which can be predicted by analysing the AI through connectivity and data ex
vibration signals derived from accelerom change. Developments in intelligent ap
eters. The use case selected for bench plications and edge AI processing for
marking is a classification task based on industrial applications are reflected in
accelerometer time-series raw data. advancements in different edge lay
The paper is organised as follows. Sec ers (micro-, deep-, meta-edge). The
tion II provides background on edge AI edge processing continuum includes the
processing and introduces the concepts of sensing, processing, and communica
micro-, deep-, and meta-edge. The edge tion devices (micro-edge) close to the
AI system architecture and the micro- sensing/actuating elements, gateways, in
edge acquisition approaches are intro telligent controllers processing devices
duced in Section III and IV. Section (deep-edge), and on-premises multi-use
V gives an overview of the integration computing devices (meta-edge). This
with edge AI platforms. In Sections VI– continuum creates a multi-level structure
92 OVIDIU VERMESAN: EDGE AI PLATFORMS FOR PREDICTIVE MAINTENANCE
Fig. 1. Industrial edge AI system architecture.
that can advance processing, intelligence, distributed wireless sensor nodes over the
and connectivity capabilities [11]. production facility.
For this paper’s use case, the edge AI
III. E DGE AI S YSTEM A RCHITECTURE components are deployed and evaluated
The overall PdM architecture pro at the micro-edge layer on the micro-
posed considers that the edge AI com controller and sensing units, while the
ponents are integrated into different edge edge AI platforms run at the meta
layers. edge layer. The inference runs at the
The system implements an architec micro-edge on an Arm® Cortex®-M4F
ture integrated at the micro-, deep-, and STM32L4R9ZIJ6 microcontroller and an
meta-edge levels, allowing heterogeneous ISM330DHCX MEMS sensor module
wireless sensor networks to communi containing ML capabilities.
cate with the various gateways while
integrating information from heteroge
IV. M ICRO -E DGE DATA ACQUISITION
neous nodes in a shared on-premises edge
A PPROACH
server application and a shared database.
The network architecture allows for inter The development of the industrial PdM
facing with the existing SCADA system is based on the design of sensors in
and providing a secure link to external combination with the Arm® Cortex®
cloud applications. M4F microcontroller. It incorporates all
The micro-edge implementation in the required modules for data acquisition
creases information acquisition from the from the industrial equipment/motors, the
intelligent edge sensors placed on equip- pre-processing of the data, an interface
ment/motors and allows end users to for user interaction and a wireless net
build predictive maintenance solutions work module for data transmission.
based on advanced anomaly detection This micro-edge IIoT device
algorithms. (STWIN SensorTile Wireless Industrial
The heterogeneous architecture pro Node) [3] used for the experiments
vides the ability to retrieve data from comprises a 3-axis acceleration sensor
Bluetooth Low Energy (BLE), and Wi-Fi (ISM330DHCX) and an Arm® Cortex®
wireless sensor nodes using, for example, M4F (STM32L4R9ZIJ6) 32-bit RISC
the MQTT protocol. This architecture has core microcontroller. The device includes
several advantages related to integrating several other sensors and interfaces, but
data from heterogeneous sensor nodes, for the use case presented in this
providing a mechanism for their trans paper, the focus is mainly on these
mission to an on-premises edge comput two components and the BLE, Wi-Fi,
ing server, and creating geographically and serial interfaces as part of the
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 93
four I2Cs, three SPIs, two SAIs, one
SDMMC, one CAN, one USB OTG
full-speed, and one camera interface [9].
The interface between the micro-edge
IIoT device and the deferent AI platforms
is illustrated in Fig. 1. ISM330DHCX
is a system-in-package (SiP) that works
as a combo accelerometer-gyroscope sen
sor, generating acceleration and angular
rate output data using a high-performance
3D digital accelerometer and 3D digital
gyroscope. The ISM330DHCX features
a full-scale acceleration range of ±2/ ±
Fig. 2. STM32 Arm® Cortex® -M4F Microcon 4/ ± 8/ ± 16 g full-scale range (FSR) and
troller architecture.
a different angular rate range of ±125/ ±
250/±500/±1000/±2000/±4000 dps.
communications used for implementing The accelerometer’s frequency range is
the use case. from 1,6 to 6667 Hz and is selectable
The processing capabilities ensured by using specific output data rate (ODR).
the microcontroller are 120 MHz, 640 The optimal ODR and FSR config
KB SRAM, and 2 MB flash. The device’s uration depends on the use case. For
architecture is presented in Fig. 2. example, using an ODR of over 1000 Hz
The Cortex-M4 core implements a will give practical detail when detecting
single-precision floating-point unit (FPU) and analysing motor vibrations. Using a
that supports all the Arm® single- larger FSR will allow for more consider
precision data-processing instructions able variations in sensor values, but the
and all data types. The core integrates resolution will be compromised and have
a memory protection unit (MPU) which less detail.
enhances the application’s security and The ISM330DHCX MEMS sensor
a set of digital signal processing (DSP) module contains a Machine Learning
instructions. The microcontroller offers Core (MLC), a Finite State Machine
a fast 12-bit ADC (5 MSa/s), two (FSM), and advanced digital functions
operational amplifiers, two comparators, that run custom algorithms; it shares the
an internal voltage reference buffer, workload with the main processor to en
two DAC channels, a low-power RTC, able system functionality while saving
two general-purpose 32-bit timer, 16-bit considerable power and memory.
low-power timers, seven general-purpose The dedicated MLC provides sys
16-bit timers and two 16-bit PWM timers tem flexibility, allowing some algorithms
dedicated to motor control. For external usually running in the dedicated mi
sigma delta modulators (DFSDM), crocontroller to be transferred to the
four digital filters are supported by MEMS sensor memory with the advan
the devices and up to 24 capacitive tage of a consistent reduction in power
sensing channels are available. The consumption. MLC logic allows for the
device features standard and advanced determination of whether a data pattern
communication interfaces such as: one (for example, motion, pressure, tempera
DMA2D controller, three USARTs, two ture, magnetic data, etc.) matches a user-
UARTs including one low-power UART, defined set of classes.
94 OVIDIU VERMESAN: EDGE AI PLATFORMS FOR PREDICTIVE MAINTENANCE
The ISM330DHCX MLC works on cal parameters computed from the input
data patterns coming from accelerome data. In the binary tree, the statistical
ters and gyro sensors, but it also allows parameters are compared against certain
for connecting to and processing data thresholds to generate the results. The
from external sensors (like magnetome decision tree’s results might also be fil
ter) using the Sensor Hub feature. The tered by an optional filter called ”Meta
input data can be filtered using a ded classifier”. MLC’s results are the decision
icated configurable computation block tree’s results, which include the optional
containing filters and features computed meta-classifier (i.e., a filter that uses in
in a fixed time window defined by the ternal counters to evaluate the decision
user. ML processing is based on logical tree’s outputs) [10]. Decision tree gen
processing containing a series of config eration for ISM330DHCX in IIoT edge
urable nodes characterised by if-then-else devices can be achieved using a dedicated
conditions, where the feature values are tool available as an interface extension
assessed against specified thresholds. (Unico GUI) or using external tools,
ISM330DHCX’s ML-processing capa such as MATLAB, Python, RapidMiner,
bility originates from its decision tree Weka.
logic. A decision tree is a mathematical
tool comprised of a series of config V. I NTEGRATION WITH E DGE AI
urable nodes. Each node is described by P LATFORMS
an if-then-else condition, where an input Deploying AI on the edge in the view
signal (represented by statistical param of PdM is gaining increasing interest
eters computed from the sensor data) is in various industries, but this deploy
assessed against a threshold. ment is prone to challenges in deter
The ISM330DHCX can be mining the optimal solution architecture
programmed to run up to 8 decision trees for a given application. To a certain
at the same time independently. The extent, these challenges are due to a
decision trees are kept in the device and lack of robust frameworks that can in
provide results in the dedicated output tegrate multivariate data from heteroge
registers. The decision tree’s results neous sensors, partly because validation
can be read from the microcontroller at of the design is difficult, which means the
any time. Also, there is the option of predictions derived from these sensors’
providing an interrupt for every change data may not be trustworthy.
in the results. The sensor data come PdM solution architecture is the foun
from the 3-axis accelerometer. dation for industrial applications because
The MLC inputs defined in the first it helps in adapting IT implementations
block are used in the “Computation to specific business needs and describ
Block”, where filters and features can ing their functional and non-functional
be applied. The features are statisti requirements and implementation stages.
cal parameters computed from the in It comprises many components and pro
put data (or from the filtered data) in cesses that draw guidance from various
a defined time window, which the user architectural viewpoints.
may select. The features computed in Recently, new edge intelligence plat
the computation block are used as in forms have emerged that may be good
put for MLC’s third block. This block, candidates for a PdM solution archi
called “Decision Tree”, includes the bi tecture. These AI platforms must be
nary tree, which evaluates the statisti able to leverage sensor data, build AI
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 95
solutions for constrained environments retrieval over wired or wireless
and deploy on edge devices. However, connectivity, data labelling and
they use various components – including storage of vibration data in a spe
data gathering, model training and infer cific format.
ence implementation – to implement core 2) Condition monitoring includes
tasks, and they include various degrees data cleaning/denoising, data vi
of automation, control, and transparency sualisation, pre-processing, feature
into their processes. extraction and feature engineering.
Thus, investigating the architecture 3) Anomaly detection and classifi
of edge AI platforms is an important cation include ML of the system’s
step in understanding the differences behaviour, unsupervised learning at
among them. The approach adopted in the edge for anomaly detection
this work for benchmarking was to and supervised learning to classify
define a set of architectural elements states.
that make the differences among the 4) Model emulation and model
essential platforms explicit. These ele deployment at the inference on
ments affect theseplatforms’ performance the target device incorporate the
(accuracy and loss) and other relevant remaining life prediction models,
qualities. overall efficiency optimisation and
In this paper, three existing edge AI operational system integration.
platforms for integrating AI mechanisms
within MCUs have been employed: A. Use Case Design
1) Qeexo AutoML (Qeexo) - auto The classes were defined based on
mated ML platform for Arm® conditions (motor speeds) and sub-
Cortex® -M0-to-M4-class conditions (malfunctions). The motor
processors, was operating at fixed speeds, minimum,
2) NanoEdgeT M AI (NEAI) Studio, medium and maximum. A malfunction of
3) Edge Impulse (EI) the motor (motor fan trepidations) was
The edge AI platforms are used added on all classes to obtain new ones.
to deploy ML algorithms and run The classes defined are:
pre-trained Artificial Neural Networks 1) Class A: the motor is running at
(ANN) with on-chip self-training on minimum speed
an ultra-low power Arm® Cortex® 2) Class B: the motor is running at
M4F STM32L4R9ZIJ6 microcontroller half of the speed
running at 120 MHz and the MLC 3) Class C: motor is running at max
of the ISM330DHCX MEMS sensor imum speed
module. 4) Class D: the motor is running at
minimum speed with an excess
load producing a centrifugal force
VI. M AINTENANCE C LASSIFICATION
5) Class E: the motor is running at
C ASE S TUDY
half of the speed with an excess
In this paper, classification has been load producing a centrifugal force
used as a case study. The following steps 6) Class F: the motor is running at
were used for implementation: maximum speed with an excess
1) Data acquisition includes the load producing a centrifugal force
acquisition sensor setup for each The classes were defined with the fol
specific edge AI platform, data lowing goals in mind:
96 OVIDIU VERMESAN: EDGE AI PLATFORMS FOR PREDICTIVE MAINTENANCE
1) The motor behaviour and the clas All components and processes across
sification problem being solved the E2E workflow can benefit from
with ML/DL were studied in-depth. automation, parameter control, trans
2) Classes should be distinguishable parency, integration, and other facili
for easier classification. ties. For instance, automation removes
3) Data sets should be class-balanced. tedious, iterative, and time-consuming
4) Data sets should be properly split work, thus making AI/ML more acces
(training, validation, test). sible to applications. Nevertheless, when
embedding AI/ML at the edge, some of
B. Experimental Setup these components and processes present
The design and implementation steps particularities, and therefore, have been
and the experimental setup of the end selected as elements of differentiation
to-end (E2E) classification application in the comparative analysis of the three
consist of three primary workflows, in platforms. These are:
cluding NEAI Studio, EI and Qeexo 1) Data acquisition methodology
platforms. 2) Pre-analysis in time and frequency
All three workflows are E2E devel domain
opment platforms for embedded AI/ML, 3) Visualization and exploration of
supporting developers through all phases features
of their projects, including collecting, 4) Support for various ML algorithms
pre-processing, and leveraging sensor and ANN
data; generating and training the mod 5) Emulation/live test capabilities
els; and deploying to highly constrained 6) Sensor MLC capabilities
environments. 7) Deployment automation
The experimental process occurred as
follows: vibration signals for each class C. Data Acquisition Methodology
were collected live from the micro-edge
IIoT device mounted on the motor, us Data acquisition parameters, such as
ing NEAI and Qeexo, as EI currently buffer size, signal length and sam
does not support direct connection to pling frequency, vary depending on
project constraints. In general, the higher
the STWIN Sensor Tile. The recorded
the sampling frequency, the higher the
signals were then analysed in both the
chances of capturing important features
time and frequency domains, and then
in the signal snapshot; however, higher
filtrated, exported, and converted to pre
sampling affects the constraints, such as
pare the cross-platform datasets. Several memory and power consumption.
AI models were then built with each of The motor’s vibration patterns were
the three flows, using the accelerome analysed at different frequencies, and
ter spectral features, and optimised for 1667 Hz was identified as the most suit
performance and resource constraints. In able sampling frequency to capture the
the end, the three models were deployed patterns.
for MCUs, and classification inferences As the task was to compare the plat
were run to note real-time performance. forms, the goals were to apply configura
During all stages, comparative analysis tions that were as similar as possible and
has been performed, both in terms of to ensure that the same signals were used
the processes involved and the results for processing, thereby yielding compara
produced. ble results. This proved to be a difficult
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 97
task, because the sampling frequency,
buffer size and signal length are inter
related, and their interrelationships also
vary, depending on what each platform
defines as fixed or configurable. Thus,
the sampling method is a differentiating
element. Fig. 3. Cross-platform conversion of sensor data
In the case of NEAI, the signal length (collected vs uploaded).
was determined based on buffer size
and sampling frequency, namely approxi
Similarly, the data recorded and exported
mately 300 milliseconds (= 512/1667) for
using Qeexo had to be converted into
a buffer size of 512 samples on each axis,
formats accepted by NEAI and EI. EI
in total 1536 values per signal.
acted as a neutral platform, processing no
In the case of Qeexo, the length of
data collected on its own, only data from
the signal exported is fixed, 50 millisec
the other two platforms.
onds, and the buffer size is approximately
Regarding the collection process, mea
83 samples: 50/(1000/1667). The buffer
surements could be collected directly
size can also vary, due to sample rate
from the sensor IoT devices within their
tolerance.
GUIs, but the processes differed. In the
Another differentiating element is the
case of NEAI, acquiring signals was
duration of a live collection session,
straightforward when using the datalog
which, for practical reasons, was defined
ger functionality, requiring only an SD
as being about 30 seconds. With Qeexo,
card. For the experimental use case, a
the collection duration was easily set to
simple logger application was built in C
the exact number of seconds, while with
to read the raw accelerometer sensor data
NEAI, the duration was determined by
and log it directly onto the serial port.
the number of signals per collection (in
In the case of Qeexo, a data collection ap
this case, set at 100 to obtain sessions
plication was deployed to MCU with the
that were 30 seconds long).
press of a button, after the ODR(Hz) and
All three platforms enabled the acqui
FSR(g) parameters have been configured.
sition of data by both directly collecting
them from sensors and uploading them
from files. The acquisition methodology D. Pre-analysis in Time and Frequency
consisted of three steps: Domain
1) Collect sensor data using platforms All three platforms have pre-analysis
that enable direct connection to and pre-processing capabilities in both
the microcontroller (e.g., NEAI and the time domain and the frequency do
Qeexo). main. This is shown in Fig. 4, Fig. 5
2) Export the recorded data and con and Fig. 6, which visualise the frequency
vert it to a cross-platform format. plots for the F class along the accelerom
3) Generate cross-platform data sets eter z-axis.
(split for training, validation, and In the case of NEAI, the graph’s x-axis
test) and then upload. in the temporal plots corresponds to the
The acquisition methodology is de number of samples (512), while the y-
picted in Fig. 3. The data recorded and values represent the mean value of each
exported using NEAI had to be converted sample across all 100 signals, their min
into formats accepted by EI and Qeexo. max values, and their standard deviation.
98 OVIDIU VERMESAN: EDGE AI PLATFORMS FOR PREDICTIVE MAINTENANCE
In the frequency domain, the x/y axis
shows frequency/amplitude. With Qeexo,
the frequency domain is a frequency
spectrogram.
In the frequency domain, it is often
easier to differentiate the individual
classes, thus ensuring a high accuracy
Fig. 6. F class along the y-axis (Qeexo) in time
score. In NEAI, switching to the fre and spectrogram.
quency domain is done by activating
the filter settings in the signals pre
processing step. This allows for setting explicitly in the design, otherwise the raw
the filter frequency parameters, so that accelerometer data will be used without
only the frequencies that represent the pre-processing.
characteristics of the motor vibration are
kept and the rest are attenuated. Filter E. Visualisation and Exploration of
ing techniques also help eliminate both Features
the high frequency noise that interferes Standard classification algorithms are
with the vibration signal and the fre not well suited to work on time series;
quencies for transitions between states, therefore, raw time series data is sam
which would normally yield an unknown pled using various windowing techniques
class. and aggregated to generate new features,
In EI, the Spectral Analysis process such as mean, standard deviation, RMS,
ing block extracts the frequency and median, number of peaks, skewness, kur
power characteristics of a signal over tosis, and energy.
time. This component must be included If the data exhibit clustering, retaining
only the features with the highest vari
ance can make the clusters more visi
ble, with clear distinctions between them
(i.e., the clusters do not overlap, and
some space exists between them). Mak
Fig. 4. F class along the 3-axis (NEAI) ing the clusters more visible can be ob
tained by applying dimensionality reduc
tion and visualisation techniques, such as
principal component analysis (PCA), t-
distributed stochastic neighbourhood em
bedding (t-SNE) and uniform manifold
approximation and projection (UMAP).
While PCA preserves the data’s global
structure at the expense of the local
structure (which might get lost), t-SNE
aims to embed the points from higher
dimensions to lower dimensions by fit
ting the data into a distribution such
that the neighbourhood of the points is
preserved. UMAP provides a balance be
Fig. 5. F class along the 3-axis over 50 ms window tween local and global structures. Both
(EI) t-SNE and UMAP first build a graph
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 99
that represents data in high dimensional
space, then reconstruct the graph in a
lower dimensional space while retaining
the structure.
PCA is employed by Qeexo, while
UMAP is employed by both Qeexo and
EI. EI also generates a list of fea
tures sorted by importance, indicating Fig. 7. Dimension reduction with two components
how vital each feature is for each class (PC1 and PC2) with PCA (left) and PCA+t-SNE
(right)
compared to all other classes. RMS,
power density and peak values of vi
bration along the 3-axis proved to be
the most important features for deter
mining the class. Based on this infor
mation, the less critical features can be
removed from the training set to simplify
the model and make it more manage
able while maintaining its relevance and
performance. Fig. 8. Dimension reduction with three components
(PC1, PC2 and PC3) with PCA (left) and PCA+t-
Most AI-based platforms have embed SNE (right)
ded visualisation and feature exploring
functionality in the form of tables or
graphs, where dimension reduction algo Notably, the lack of separability
rithms are employed to generate projec of the classes in the two-dimensional
tions from the higher-dimensional fea plots generated by the platforms
ture space onto two dimensions. The does not necessarily imply a lack of
features are coloured based on labels; separability between classes in the
thus, it is easy to observe if distinct la higher-dimensional space. This matter
bels are separated based on the available was further explored off-platform
features. (Python framework), with various
The generated two-dimensional plots features and dimension reduction
proved to be a good indicator for how techniques along both 2D and 3D plots.
well the motor classifier will perform. As shown in the scatter plot in Fig. 7,
The fact that the features were visually PCA with two components did not pro
clustered was a good indication that the vide sufficient insights into the different
model could be trained to perform the classes. On the other hand, t-SNE is
classification with high accuracy. known for its ability to capture non-linear
When the features overlapped to a dependencies, thus combining the two
large degree, as was the case in the provides better results. In general, apply
early development stages, it was difficult ing PCA in conjunction with either t-SNE
for the trained models to distinguish the or UMAP provides an initial reduction
classes. The problem was addressed by in the dataset’s dimensionality, while still
collecting more signals and increasing preserving most of the important data
the size of the sampling signal to bet structure.
ter capture the signal patterns; some As shown in Fig. 8, applying dimen
times, it was necessary to redefine the sionality reduction with the three compo
classes. nents provided even better results.
100 OVIDIU VERMESAN: EDGE AI PLATFORMS FOR PREDICTIVE MAINTENANCE
F. Support for Various ML/DL Algo
rithms and Automation
All three platforms offer an automated
mechanism for generating the AI model
architecture and training, but its imple
mentation differs considerably, mainly
due to the type of algorithms employed. Fig. 9. Benchmarking with NEAI. All correctly
classified (green dots)
NEAI employs several ML models im
plementations such as K-Means, Random
Forest (RF), Support Vector Machine
(SVM), optimised for embedded systems,
each having their own hyper-parameters.
The benchmarking of the classification
task involved searching through these al
gorithms and producing combinations of
three elements: pre-processing, ML algo
rithm, and hyper-parameters. These com
binations, called libraries, are displayed
as a ranked list that is evaluated based on
accuracy, confidence, and memory usage,
with the highest accuracy being on top. Fig. 10. Benchmarking with EI. Confusion matrix
and data explorer. Correctly classified (green dots)
Accuracy denotes the library’s ability to and misclassified (red dots).
correctly classify each signal into the
right class, whereas confidence reflects
the library’s ability to distinguish be as Artificial Neural Network (ANN),
tween classes. Learning is fixed at library Convolutional Neural Network (CNN),
generation based on the data provided for Recurrent Neural Network (RNN)
each class. Selecting the suitable model architectures, but is fully automated, and
occurs after training. the selected and trained model can be
In contrast to NEAI, EI and Qeexo deployed to target embedded hardware
allow explicit selection and configuration with just one click, without any coding
of the algorithms of choice before train required. Solutions are optimised to have
ing. EI supports ML algorithms as well ultra-low latency, power consumption
as NN architectures following the full and a small memory footprint.
pipeline that is commonly found in tradi The benchmarking allowed for com
tional deep learning frameworks, such as parison of the performance of differ
Keras and TensorFlow, thereby making it ent model architectures generated by the
more flexible. same platform, but also of the perfor
EI employs K-means algorithms for mance of the same model across plat
anomaly detection, and Keras for classi forms. The confusion matrix has been a
fication and regression tasks. useful tool.
Qeexo also supports ML algorithms, Snapshots from the benchmarking with
such as Gradient Boosting Machine NEAI, EI and Qeexo are presented in
(GBM), XGBoost (XGB), RF, Logistic Fig. 9, Fig. 10, Fig. 11 respectively,
Regression (LR), Decision Trees (DT), showing comparable results on the val
Gaussian Naive Bayes (GNB) as well idation data.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 101
Fig. 12. Live streaming evaluation of SVM trained
model using NEAI Emulator (left) and trained ANN
model Qeexo (right).
Fig. 11. Benchmarking with Qeexo. Confusion
matrix for SVM model (up). Overview trained
models (down).
G. Emulation, Live Test Capabilities
Testing the trained model performance
using the testing dataset was done to
analyse how well the model performs
against unseen data prior to deployment
on the target MCU. To ensure unbiased
evaluation of the model’s effectiveness,
the test samples were not used directly Fig. 13. EI ANN model testing with test datasets
(Arm® Cortex®-M4 MCU STM32L4R9 not yet
or indirectly during training. supported).
Testing was also performed live, while
changing motor speeds and triggering
shaft disturbances, to switch between
classes and cover all six classes. Live
testing ensured unbiased evaluation of the
model’s effectiveness with completely
new signals, not seen before.
The results of the tests are depicted in
Figures 12, 13, and 14 showing that all
three classifier systems properly repro
duced and detected all classes with rea Fig. 14. Qeexo testing with test datasets collected
sonable certainty percentages, and these (SVM model).
are comparable. This was the case both
when testing on test datasets and when algorithms and features. Experiments re
testing live, although the latter had some vealed that the only available algorithm
particularities. When testing live with (e.g., decision tree) performed well on
NEAI, the trained model runs on a micro- the collected data, demonstrating suffi
controller emulator. In the case of Qeexo, cient predictive value to distinguish the
live means both new data and the trained classes with good accuracy.
model running in the microcontroller.
I. Deployment Automation
H. Sensor MLC Capabilities Model deployment was the final stage
The MLC capabilities were only ex in the E2E workflow, consisting of
plored with Qeexo. Deploying in the flashing the compiled binary into the
MLC implies limited availability of the MCU, running on-device inferences, and
102 OVIDIU VERMESAN: EDGE AI PLATFORMS FOR PREDICTIVE MAINTENANCE
displaying the expected output. All three MCU. Qeexo could be used to auto
workflows were successfully deployed, mate repetitive deployments to the target
and the models were able to accurately MCU. Finally, EI could be used for spe
recognize motor states in real time. cialised NN architectures in which more
However, the deployment steps exhib control over the model’s parameters and
ited some particularities with the three hyperparameters is desired (using Keras).
platforms. Because deployment occurred
in the context of micro-edge embedded VII. R ESULTS AND D ISCUSSION
systems, the steps depended on the hard Although automation compensates for
ware and software used. many of the drawbacks of manual pro
In the case of NEAI, the trained cesses, it is important to verify that the
model was deployed in the form of a E2E workflow is easily repeatable and
static library (libneai.a), an AI header reproducible, i.e., to validate the design.
file (NanoEdgeAI.h) containing functions One of the main findings during ex
and variable definitions, and a knowledge perimentation is that in addition to the
header file (knowledge.h) containing the primary design flow with the AI platform
model’s knowledge. of choice, at least one complementary
For the EI deployment, the CMSIS design flow with another AI platform is
PACK for STM32 packaged all sig necessary for validation. This comple
nal processing blocks, configuration and mentary flow will have a similar purpose
learning blocks up into a single library during development as the parallel flow
(.pack file), which was then added to has in operation; namely, to avoid a sin
the STM32 project using the CubeMX gle point of failure.
packages manager. For instance, employing the same ML
In the case of Qeexo, the trained model algorithm in both the primary and com
was deployed to MCU with the press of a plementary flows would hardly produce
button, without any coding required. The the same results, as the implementations
model was compiled, built, and flashed are different. However, if they produce
onto the MCU target automatically. similar results, it will increase the level
All three platforms generated an op of confidence in the design. Extremely
timised code for inclusion in the mi different results will provide valuable
crocontroller application. However, while insight into what validation actions are
the integration was accomplished within needed. In this way, the complementary
the integrated development environment flow may compensate for eventual flaws
(IDE) for NEAI and EI, for Qeexo, it in the primary flow.
was done automatically with the press of
a button in the GUI. VIII. C ONCLUSION
This led to the conclusion that the This paper presented three design
three platforms could co-exist in the de workflows with different edge AI plat
velopment portfolio of a PdM solution forms and embedded inference engines
provider and be employed depending on used for the same classification use
the application or stage of development. case, highlighting the different aspects of
NEAI could be used for rapid prototyp model design, development, and deploy
ing, as it enables automatic exploration ment of AI-based industrial applications
using various traditional ML algorithms approached by edge AI multi-platform
and emulation to test performance with solutions as part of a holistic flow frame
out actual deployment to the target work for industrial PdM applications.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 103
The three platforms are Qeexo AutoML, R EFERENCES
NanoEdgeTM AI Studio and Edge Impulse [1] J. Lin, W.-M. Chen, Y. Lin, C. Gan,
for integrating edge AI mechanisms at S. Han et al., “Mcunet: Tiny deep
the micro-edge within MCUs such as learning on IoT devices,” Advances
Arm® Cortex® -M4F STM32L4R9ZIJ6 in Neural Information Processing
microcontroller and ISM330DHCX Systems, vol. 33, pp. 11 711-11
MEMS sensor module containing an 722, 2020. Available:https://procee
MLC, an FSM, and advanced digital dings.neurips.cc/paper/2020/file/86
functions that are used to run custom c51678350f656dcc7f490a43946ee5
algorithms on the inertial measurement -Paper.pdf
unit (IMU) and share the workload from [2] J. Lin, W.-M. Chen, H. Cai, C. Gan,
the central processor enabling system and S. Han, “Mcunetv2: Memory
functionality while significantly saving efficient patch-based inference for
power and memory footprint. tiny deep learning,” in Annual Con
Each platform was benchmarked by ference on Neural Information Pro
assessing some of the most critical AEDs cessing Systems (NeurIPS), 2021.
using different ML algorithms and NN Available: https://arxiv.org/pdf/21
implementations for an industrial PdM 10.15352.pdf
application to classify the data from mo [3] STMicroelectronics, 2022, STWIN
tor vibrations measured with a 3-axis SensorTile Wireless Industrial Node
accelerometer IIoT device. The results development kit and reference de
were benchmarked by considering spe sign for industrial IoT applications.
cific edge AI flow frameworks to analyse Available: https:r//www.st.com/en/e
the results. valuation-tools/steval-stwinkt1b.h
Transforming raw sensor data into tml
actionable insights is complex, time- [4] STMicroelectronics, 2022, X-Cube-
consuming, and costly, requiring a sys AI - AI expansion pack for STM32
tematic engineering approach to building, CubeMX. Available: https://www.st
deploying, and monitoring ML solutions. .com/en/embedded-software/x-cube
The benchmarking findings indicate -ai.html
that no single edge-AI-based platform [5] STMicroelectronics, 2022, Nano-
can outperform all other platforms across Edge AI Studio V3. Available: ht
all AEDs. The platforms have differ tps://stm32ai.st.com/
ent implementation approaches and ex [6] Edge Impulse. Edge Impulse Devel
hibit different capabilities and weak opment Platform. Available: https:
nesses. Nevertheless, they all produce in //www.edgeimpulse.com/
dependently relevant results, and together [7] Qeexo, Qeexo AutoML Platform.
they provide overall insight into their Available: https://qeexo.com/
architecture and internal workings that [8] STMicroelectronics, 2022, Nano-
can benefit the PdM solution. As they Edge AI Studio V3. Available: ht
evolve and interact with each other, they tps://stm32ai.st.com/
will also overcome their weaknesses and [9] STMicroelectronics, 2022, STM32
gain strengths. Future work is intended L4R9ZI. Available: https://www.st
to enlarge the comparison by considering .com/en/microcontrollers-micropr
additional edge AI platforms and AEDs. ocessors/stm32l4r9zi.html
104 OVIDIU VERMESAN: EDGE AI PLATFORMS FOR PREDICTIVE MAINTENANCE
[10] STMicroelectronics, 2022, ISM and industrial manufacturing. He is cur
330DHCX. Available: https://www. rently the technical co-coordinator of the
st.com/en/mems-and-sensors/ism33 ECSEL JU ”Artificial Intelligence for
0dhcx.html Digitising Industry” (AI4DI) project.
[11] O. Vermesan and M. Coppola, “Em
bedded Edge Intelligent Processing
Marcello Coppola is
for End-To-End Predictive Mainte technical Director at
nance in Industrial Applications,” STMicroelectronics. He
in Industrial Artificial Intelligence has more than 25 years
of industry experience
Technologies and Applications, O. with an extended network
Vermesan, F. Wotawa, M. Diaz within the research
Nava end B. Debaillie, eds. Gistrup, community and major
funding agencies with
Denmark: River Publishers, 2022, the primary focus on
Ch. 12, pp. 157-175. Available: ht the development of
tps://www.riverpublishers.com/pdf/ breakthrough technologies. He is a technology
innovator, with the ability to accurately predict
ebook/chapter/RP 9788770227902 technology trends. He is involved in many
C12.pdf European research projects targeting Industrial
[12] Y. Ran, X. Zhou, P. Lin, Y. Wen, IoT and IoT, cyber physical systems, Smart
Agriculture, AI, Low power, Security, 5G, and
and R. Deng, “A Survey of Pre design technologies for Multicore and Many-
dictive Maintenance: Systems, Pur core System-on-Chip, with particular emphasis
poses and Approaches”. IEEE Com to architecture and network-on-chip. He has
published more than 50 scientific publications and
munications Surveys & Tutorials, holds over 26 issued patents. He authored chapters
20, pp. 1-36. https://arxiv.org/pd in 12 edited print books, and he is one of the main
f/1912.07383.pdf authors of “Design of Cost-Efficient Interconnect
Processing Units: Spidergon STNoC” book. Until
2018, he was part of IEEE Computing Now Journal
Technical editorial board. He contributed to the
Ovidiu Vermesan security chapter of the Strategic Research Agenda
is Chief Scientist at (SRA) to set the scene on R&D on Embedded
SINTEF Digital, Oslo, Intelligent Systems in Europe. He has served
Norway, where he is in different roles at numerous top international
involved in applied conferences and workshops. He graduated in
research on edge Computer Science from the University of Pisa,
AI and future edge Italy in 1992.
autonomous intelligent
systems, smart systems
integration, wireless
sensing devices and
networks, microelectronics design of integrated
systems (analogue and mixed-signal), IIoT.He
holds a PhD in Microelectronics and a Master of
International Business (MIB).His applied research
activities focus on wireless and smart sensing
technologies, advancing edge AI processing,
wireless and smart sensing technologies, embedded
electronics, and the convergence of these
technologies and applying the developments to
applications such as autonomous systems, green
mobility, energy, buildings, electric connected,
autonomous, and shared (ECAS) vehicles,
Food Ingredients Recognition
Through Multi-label Learning
Rameez Ismail and Zhaorui Yuan
Abstract—The ability to recognize var I. INTRODUCTION
ious food-items in a generic food plate
is a key determinant for an automated
W
diet assessment system. This study mo HAT we eat and drink has
tivates the need for automated diet as a huge impact on our daily
sessment and proposes a framework to lives and our wellbeing. It is
achieve this. Within this framework, we well established by now that a healthy
focus on one of the core functionalities
and well-balanced diet is paramount to
to visually recognize various ingredients.
To this end, we employed a deep multi- one’s health. Daily diet varies consid
label learning approach and evaluated sev erably around the world, however, peo
eral state-of-the-art neural networks for ple in almost all regions of the world
their ability to detect an arbitrary num could benefit from rebalancing their di
ber of ingredients in a dish image. The
ets by eating optimal amounts of var
models evaluated in this work follow a
definite meta-structure, consisting of an ious nutrients [1]. A suboptimal diet
encoder and a decoder component. Two does not only carry risks for physi
distinct decoding schemes, one based on cal health but might reduce cognitive
global average pooling and the other on capabilities. Understandably, numerous
attention mechanism, are evaluated and
studies, for example [2][3][4], implicate
benchmarked. Whereas for encoding, sev
eral well-known architectures, including dietary factors in the cause and preven
DenseNet, EfficientNet, MobileNet, Incep tion of diseases such as, cancer, coronary
tion and Xception, were employed. We heart disease, diabetes, birth defects, and
present promising preliminary results for cataracts. Similarly, findings from nutri
deep learning-based ingredients detection,
tional psychiatry indicate multitude of
using a challenging dataset, Nutrition5K,
and establish a strong baseline for future consequences and implications between
explorations. what we eat and how we feel and ulti
Index Terms—Automated diet assess
mately behave [5]. On a population scale,
ment, deep learning, visual ingredients eating habits and a broader diet pattern
recognition, machine learning, multi-label of a population is shown to be correlated
learning. with its health outcomes and longevity
[6][7]. For example, a diet that adheres to
traditional Mediterranean diet principles
represents a healthy pattern and is pos
Rameez Ismail and Zhaorui Yuan are with the itively associated with the longevity in
team of embedded intelligence and analytics at
Philips Research, High Tech Campus, 5656, Eind Mediterranean blue-zones [8]. Although,
hoven, The Netherlands, Corresponding author: R. such guidelines and general principles
Ismail ([email protected]). The research are quite useful, the individual metabolic
leading to these results has received funding from
the European Union’s ECSEL Joint Undertaking responses varies substantially even if
under grant agreement n◦ 826655 - project TEMPO. individuals are eating identical meals
106 RAMEEZ ISMAIL: FOOD INGREDIENTS RECOGNITION THROUGH MULTI-LABEL LEARNING
[9]. This calls for a personalized nutri This work focusses on evaluating the
tion guidance approach that goes beyond performance of various state-of-the-art
general health recommendations. Person ML algorithms to detect ingredients or
alized or precision guidance could ad food-items present in a digital food im
ditionally enable better management of age. Recognition of the actual ingredients
various nutrient-related health conditions is one of the two essential ML functions
and diseases [10], for example, celiac dis towards our envisioned food assessment
ease, bowel syndrome, phenylketonuria, system depicted in Fig. 1. The other
food allergies and diabetes. However, to is the portion estimation. When all the
make the proposition viable and to drive major ingredients and their portion size
impact at scale, the guidance system must are identified, a descriptive nutrition log
be automated. can be created by using a simple lookup
A major difficulty in realizing an au service that searches across various nu
tomated diet guidance is to capture the trition fact databases. There are several
eating habits of a user accurately and food databases in public domain which
effortlessly. Besides providing an intro can be utilized for this purpose, such
spection ability to the consumers, cap as USDA Food and Nutrient Database
turing dietary data from several partic for Dietary Studies (FNDDS) and Dutch
ipants is fundamental for understanding Food Composition Database (NEVO).
the diet and disease relationship. An ac In contrast to building separate in
curate assessment of dietary intake en gredient recognition and portion estima
ables the investigators to make progress tion functions, some studies attempted an
in diet related studies by discovering pat end-to-end nutrition estimation scheme
terns in context of nutritional epidemiol [18], which estimates the total amount
ogy [11][12]. Diet assessment is usually of macro-nutrients directly from the
performed using one of the three basic
methods: meal recall, food diaries, or
food frequency questionnaires. However,
all these methods are based on self-
reporting and therefore are time consum
ing, tiresome, and prone to misreporting
errors. With recent advances in sensor
technology and Machine Learning (ML)
algorithms, automated food assessment
has gained ground. Some of the tech
nologies being explored in this direc
tion include digital biomarkers and in
digestible sensors [13][14], imaging sen
sors coupled with advanced AI/ML algo
rithms [15][16], as well as smart scales
and eating utensils [17]. Digital cameras
are readily available and can perform
a detailed analysis on the food being
consumed and the eating behaviors of Fig. 1. The envisioned approach towards automated
the subjects but relies heavily on the nutrition assessment. The assessment block consists
of two ML functions, one for detecting the ingre
visual recognition and the interpretation dients and the other to estimate the quantity, and a
technology. nutrients database look-up service.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 107
food images. Such end-to-end schemes, formable Parts Models (DPM), followed
however, suffer from descriptive inade by features based classification of the
quacy. For example: actual food items candidate regions [24]. Such approaches
composition and the respective por employed various hand-crafted features
tions remain undiscovered, which are such as color, texture, and Scale-invariant
important details for dietary research Feature Transform (SIFT) features for
and for determining the eating habits classifying the candidate regions. How
of the users. Additionally, as the two- ever, majority of the recent work is based
stage scheme depicted in Fig. 1 is more on deep neural networks, because of its
transparent, it is also expected to score powerful feature representation ability,
better on the scale of user engagement where such features are directly learnt
and trustworthiness. from the data. For example, [25] pro
In summary, automated nutrition as posed to use a convolution neural net
sessment has the potential to empower work GoogLeNet for labels prediction
communities, on one hand, while enables and employed the DeepLab [26] for se
investigators to make progress in nutri mantic segmentation of the images. This
tional epidemiology on the other hand. pixel level image segmentation allows
The main contributions of this work are: further analysis, such as estimation of
(a) motivate and investigate the problem the count and portion of the food con
of multi-item food recognition and pro stituents. More recently, PRENet [27]
poses a multi-label learning framework adopted a progressive training strategy
to achieve this, and (b) evaluates and to learn multi-scale features for a large-
benchmark the proposed framework us scale visual food recognition task. The
ing various state-of-the-art deep learning approach utilizes a self-attention mecha
modules and a challenging dataset, Nutri nism to contextualize the local features.
tion5k [18], comprised of real-world food These refined local features and a set of
images with ingredient level annotations. global features are then concatenated for
the final classification task.
II. R ELATED W ORK Both pure convolutional [20][25][18]
Automated food assessment through and attention-based networks [27] have
visual recognition has attracted a decent shown promising results for visual food
research interest in recent years. Existing recognition. Nevertheless, the latter can
efforts include deep learning based single contextualize and dynamically prioritize
dish recognition [19][20], contextualized the information. This suggests that the at
food recognition (for example using GPS tention networks can extract much richer
data which exploits the knowledge about descriptions from the images compared
location and data from the restaurants), to pure convolutional networks. How
multiple-food items detection [21][22] ever, it also implies that the learning
and real-time recognition [23]. This sec process is comparatively difficult as the
tion mainly reviews the previous works attention-based networks have more de
on multiple food-item recognition using grees of freedom. Moreover, since at
dish images as well as multi-label learn tention mechanism is a novel paradigm
ing in general. within deep learning, its potential is not
Some of the early approaches, towards yet fully explored for the task of ingredi
multiple food-item recognition, relied ents and multiple food-item recognition.
on a separate candidate regions genera Dish images in real-life usually
tion, using either simple circles or De- contain multiple food items and
108 RAMEEZ ISMAIL: FOOD INGREDIENTS RECOGNITION THROUGH MULTI-LABEL LEARNING
ingredients, which makes it worthwhile
to detect multiple labels independently
for each input image. It also provides an
added benefit that any food image can
be inferred by the model, even if the
actual dish is novel to the model, given
that all its constituents were taught to the
network during the training phase. The
independent prediction of various labels
against a single image is a more general Fig. 2. The meta-structure used for the evaluated
models.
classification problem, commonly known
as multi-label classification or extreme
classification. In such a regime of classification is to learn the inter-label
classification [28][29], the goal is to relationships and the dependencies.
predict the existence of a multitude The transformer-based methods
of classes, thus forcing the model and employ attention mechanism with
training scheme to be efficient and the goal to implicitly model the co
scalable. The networks usually contain occurrence probability of the labels
a stem and a head: the stem outputs through relative weighing of the latent
a spatial embedding while the head embeddings. Correspondingly, graph
transforms this spatial embedding into convolutional networks for multilabel
prediction logits. The most employed learning [34] is another emerging theme
head is a Global Average Pooling (GAP) that captures the complex inter-label
layer, which computes a scalar global associations by modelling them as graph
embedding for each spatial embedding, nodes. In this work, we constrained
followed by a fully connected read-out our analysis to networks that follows
layer. a definite meta-structure shown in
In case of multi-label learning, each Fig. 2. The encoder block extracts the
neuron in the read-out layer is a bi discriminative features from the image.
nary classifier representing a specific Thus, projecting the image onto a latent
class. Recently, several works proposed space, while the decoder predicts the
novel attention-based heads for multi- presence of various labels using the
label classification. For example, [30] latent feature space.
proposed an approach which leverages
Transformer [31] decoders to query the III. M ETHODOLOGY
existence of various class labels. Sim In this section, we first describe various
ilarly, [32] introduced a class specific deep learning modules used to build the
residual attention (CSRA) module that classification networks benchmarked in
generates class-specific features, using this work. Then we explore the dataset,
a simple spatial attention score, and nutrition5k, used to train and evaluate the
combines them with the global average networks, followed by explanation of the
pooling achieving state-of-the-art results. loss function, utilized for the training,
ML-Decoder [33] is yet another classifi and the evaluation metrics.
cation head based on transformer decoder
architecture, which outperforms on vari A. Multi-Label Classification Networks
ous multi-label classification. One of the All classification networks benchmarked
critical challenges for multilabel image in this work are composed of an encoder
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 109
and a decoder module, as depicted in second exploits an attention-
Fig. 2. The encoder performs features ex based group-decoding scheme,
traction, using the input image, while the ML-Decoder [33]. In a GAP
decoder exploits these features to con based decoder, we first project
struct an ingredient breakdown sequence. the features, F ∈ RH x W x D , on
The decoder can be as simple as a linear to a single dimensional vector,
learnable projection that maps the latent z ∈ RD , by averaging over the
space on to a layer of read-out neurons spatial dimensions. Afterwards, a
but can also have a complex construc dense layer transforms the single
tion. For example, a versatile attention dimensional vectors through a
block that exploits inter-label relation learnable linear projection into
ships through a querying mechanism or K logits. K is the number of total
an autoregressive decoder that contextu classes. The ML-Decoder is adapted
alizes its mapping based on its previous from the transformer-decoder
predictions. Next, we describe the various [31] with the goal to meet the
implementation of these blocks imple computational demands for multi-
mented and benchmarked in this work: label learning as the computational
1) Encoder cost grows quadratically with the
An encoder block can be mod number of classes. The proposed
elled by any network that projects modifications include the removal
the image onto a latent space, of a self-attention block, which
F ∈ RH x W x D . For example, reduces the quadratic dependency
both convolutional neural net of the decoder in the number of
works, which progressively con query tokens to a linear one, and the
struct richer representations using introduction of a group decoding
convolution operations, and trans scheme. In a group-decoding
former networks, which transforms scheme, a single query token is
images into a small set of embed responsible for decoding multiple
ding tokens, are great candidates ingredients, thus limiting the
for the encoder block. In this work, required number of query tokens.
however, we limited our analysis to This strict decoupling of the roles en
convolutional networks for encod ables the reuse of the latent space across
ing. The employed encoding net various image recognition tasks. Fig. 3
works include, MobileNet [35][36], provides a visual description of the two
DenseNet [37], Inception network decoding schemes evaluated for the task
[38], Xception network [40] and of multi-label learning.
EfficientNet [39]. These are some
of the top-performing networks,
when tasked to perform single label B. Dataset
classification, evaluated on a large- We employed Nutrition5k [18] to train
scale visual recognition challenge and evaluate various ML models ex
ImageNet [41] dataset. plored in this work. Nutrition5k is a
2) Decoder relatively diverse dataset of mixed food
We used two distinct approaches dishes with ingredients level annotations.
to model the decoder block, the The dataset contains 20k short videos
first approach employs a simple generated from roughly 5000 unique
GAP based decoder while the dishes composed of over 250 different
110 RAMEEZ ISMAIL: FOOD INGREDIENTS RECOGNITION THROUGH MULTI-LABEL LEARNING
gle videos, the dataset also exhibits a
smaller subset of images collected from
a top mounted RGB-D camera, which
provides depth images from a top-down
view. We constrained our analysis to
RGB images only collected from the
side-angle video cameras. The data is
organized into a ‘train‘ and ‘test‘ subsets,
following the original train-test splitting
of the video files. The final dataset is
obtained by extracting a single image
from each video file. To this end, we
simply extracted the first frame from each
video and downsized it appropriately to
be able to process through the neural
networks.
Fig. 3. The decoding mechanisms evaluated in
This results in an image dataset com
this work: (a) the GAP based decoder averages the prising of around 15K training images
latent features before projecting them on to a read and 2.5K test images, all of which are
out layer, and (b) the ML-Decoder that computes
a response, linear combination of value vectors,
resized to 448x448 resolution in pixels.
against each external query vector. The responses In Fig. 4, few examples of the dish im
are then projected on to the read-out layer through ages from the test split of the dataset are
a group decoding scheme [33].
depicted.
ingredients. The dataset also contains
the portion estimates of each ingredients,
which makes it possible to perform a
supervised learning for portion estima
tion on top of ingredients recognition.
The original dataset is collected using
video cameras, mounted on the sides of
a custom platform to capture each dish
from various angles. A digital scale was
embedded under the food plates to weigh
the dish contents.
The dataset exploits an incremental
scanning approach, where a plate is
scanned at various time instances with
growing cardinality of the ingredients.
This resulted in a rich and diverse set
of images with varying portion sizes,
ingredients, and dish complexity. All in
cremental scans from a single dish image
were organized into a single split to avoid
any potential leak of information between Fig. 4. Examples of dish images from the test
the train and test split. Besides side an dataset.
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 111
C. Loss Formulation set to γ − = 5 and γ + = 0; which effec
Within multi-label learning, a common tively down weighs the loss contribution
loss formulation is to consider the final from easy negatives allowing the network
read-out neurons as a series of mutu to focus on harder samples as the training
ally independent binary classifiers and progresses.
compute their aggregated binary cross-
entropy score. This aggregated score is D. Evaluation Metrics
then used as the minimization objective
We employed a mean Average Preci
for the multi-label learning.
sion (mAP) metric to evaluate the per
Given K labels, each neuron in the
formance of all ingredient recognition
read-out layer outputs a score, zk where
networks, which is a common practice
k ∈K, that represents exclusively a single
in multi-label learning tasks. The average
label. Each neuron is then independently
precision for a single prediction is com
activated by a sigmoid function σ zk ,
puted using (4), a micro-average of this
which converts the logits into probability
is then computed over all the predictions
scores. Let’s denote yk as the ground-
and samples to compute the final score.
truth for the kth class, the total classi
fication loss Ltotal is then obtained by AP = ∑ (Rn − R(n−1) ) Pn (4)
aggregating a binary loss from all labels. n
K Where Rn and Pn are the recall and
Ltotal = ∑ L σ zk , y k (1) precisions at the nth thresholds. We used
k=1 a total of 500 thresholds, equally dis
A general form of a binary cross en tributed on the interval [0, 1], to calculate
tropy loss per label, L, is given by: the individual recall and precision scores.
L = −yL+ − (1 − y)L− (2) IV. R ESULTS
Where, L+ and L− are the are the Table I outlines the best mAP scores
positive and negative parts of the loss obtained from the several models evalu
respectively, normally evaluated by L+ = ated in this work, along with the compute
log(p) and L− = log(1 − p). These parts specifications of these models. The com
can be additionally weighted to asym pute cost is parameterized by the number
metrically focus more on the presence or of atomic operations and the parameters
absence of the label. A form of this asym in the models. The reported mean average
metric weighing, for multilabel learning, precision score is obtained by evaluating
is proposed in [42], which introduces the models on the test split of the dataset.
independent focusing parameters, γ + and In the first set of experiments, we
γ for positive and negative loss parts. explored a global average pooling (GAP)
This updates the computations as given based decoding, with different classifica
by (3) tion neural networks acting as the en
+
coder. In the second set, we selected four
L+ = (1 − p)γ log(p) encoders to couple with the ML-Decoder,
L− = pγ log(1 − p) (3) based on their performance in the previ
ous experiments and the compute speci
We used the asymmetric loss, given by fications. The resulting models are then
(2) and (3), for evaluating all models in trained with an identical configuration as
this work. The focusing parameters are the previous experiment.
112 RAMEEZ ISMAIL: FOOD INGREDIENTS RECOGNITION THROUGH MULTI-LABEL LEARNING
Fig. 5. Prediction results for the dish images selected from the test dataset for illustrative purpose. The
detection confidence for each ingredient is shown. The green color represents a true positive detection
while the red stands for the false positives.
TABLE I
P ERFORMANCE E VALUATION AND B ENCHMARK improvements despite their high compute
and memory complexity. Upon inspec
Models Performance Compute
Encoder Decoder mAP % Operations Parameters
tion of the training and validation ac
% (GFLOPs) (MParams) curacy curves, we concluded that that
DenseNet121 GAP 75.6 22.6 7.6
DenseNet169 76.5 26.9 13.6
these larger models were overfitting the
DenseNet201 74.7 34.3 19.4 training data and that their performance
MobileNetV1 72.4 4.5 3.8
MobileNetV2 74.5 2.4 2.9
is capped by the availability of data.
EfficitNetB0 73.3 3.1 4.8 However, not all larger models suffer
EfficitNetB1 71.9 4.6 7.3
EfficitNetB2 72.2 5.3 8.6
equally from the over-fitting problem.
EfficitNetB3 71.5 7.8 11.6 For instance, DenseNet and Xception are
EfficitNetB4 72.7 12.9 18.7
Xception 78.4 36.6 22.0
comparatively bigger and yet are able
InceptionV3 72.8 26.4 23.0 generalize quite well on the test dataset.
MobileNetV2 ML-Decoder 68.0 3.4 9.3
EfficitNetB0 73.4 4.2 11.1
Furthermore, we observed that Efficient-
DenseNet169 67.9 28.0 19.9 Net models require more careful calibra
Xception 70.0 38.0 28.5
tion to reach their full potential. However,
to create a fair benchmark, we did not
attempt model-specific hyper-parameter
Among the models with the GAP de tuning sweeps. All networks are trained
coder, Xception network achieved the using a standard training configuration as
best mAP score of 78.4% but is also described in the appendix. All models
one of the most computationally expen with GAP decoder scored more than 70%
sive networks: it requires 36.6 billion mAP, showing the effectiveness to this
multiply-add operations per image in simple decoding scheme.
ference. Among smaller networks, the The ML-Decoder did not perform well
mobileNetV2 is exceptionally performant for the task of ingredients detection. As
with 74.5% mAP while using only 2.4 claimed in [33], the decoder is meant
billion operations per inference. Table 1 to be a drop-in replacement for GAP-
shows that not all computationally so based decoder in a multi-label learning
phisticated models perform equally well. setting. However, our results do not sup
For example, the EfficientNetB1-B4 and port this claim. Although, some networks
the inceptionV3 models did not show perform equally good when coupled with
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 113
ML-Decoder, this does not hold in gen 32 and the models were trained for a
eral for all encoder models. For example, maximum of 50 epochs. The encoder
when coupled with EfficientNetB0, ML- of each model is initialized with Ima
Decoder performs slightly better com geNet weights. For the ML-Decoder we
pared to using the GAP decoder, while report outcomes from its default multi-
when coupled with other encoders the label configuration [33]. Furthermore,
performance deteriorates. The large per we applied a simple data augmentation
formance difference between the models pipeline, which is composed of random
with MobileNetV2 and EfficientNetV2 horizonal and vertical flips, random im
encoder highlights the weakness of the age translations, and random crops with
ML-Decoder, as both encoders are of padding.
similar compute sophistication. The root
cause analysis of this issues and an abla R EFERENCES
tion study of the decoder is currently con [1] Afshin, Ashkan, et al. “Health
sidered for future work. Fig. 5. demon effects of dietary risks in 195
strate few test images, annotated with countries, 1990–2017: a systematic
predictions from a trained model, com analysis for the Global Burden of
posed of Denset121encoder and a GAP Disease Study 2017.” The Lancet,
decoder. vol. 393, no. 10184, pp. 1958–1972,
2019.
V. C ONCLUSION [2] Grosso, Giuseppe, et al. “Possible
In this work, we presented a frame role of diet in cancer: systematic
work for automated diet assessment review and multiple meta-analyses
and demonstrated encouraging results for of dietary patterns, lifestyle fac
image-based ingredient recognition us tors, and cancer risk.” Nutrition re
ing deep learning. The Xception encoder, views, vol. 75, no. 6, pp. 405–419,
coupled with a global average-pooling 2017.
based decoding scheme performs the [3] Feskens, Edith J., et al. “Dietary
best, with a mean average precision score factors determining diabetes and
of 78.4%. It therefore creates a strong impaired glucose tolerance: a 20
baseline for future work. The attention- year follow-up of the Finnish and
based decoding, contrary to what we con Dutch cohorts of the Seven Coun
jectured, is unable to reliably extract the tries Study.” Diabetes care, vol. 18,
inter-label relationships and does not im no. 8, pp. 1104–1112, 1995.
prove the overall performance. An abla [4] Mente, Andrew, et al. “A systematic
tion study of the attention-based decoder review of the evidence supporting
will be attempted further to presumably a causal link between dietary fac
overcome its current limitations. tors and coronary heart disease.”
Archives of internal medicine, vol.
169, no. 7, pp. 659-669, 2009.
A PPENDIX [5] Selhub, Eva.“Nutritional psychia
For all training jobs, we used Adam try: Your brain on food.” Harvard
optimizer with an initial learning rate of and Health Blog, vol. 16, no.11,
1e−3 following a learning schedule with 2015.
linear warmup of 200 iterations and a [6] Eva. Keys, Ancel, et al. “The diet
cosine decay to 1e−6 afterwards. The and 15-year death rate in the seven
batch size for all trainings was set to countries study.” American journal
114 RAMEEZ ISMAIL: FOOD INGREDIENTS RECOGNITION THROUGH MULTI-LABEL LEARNING
of epidemiology, vol. 124, no.6, pp. ings of the 25th ACM SIGKDD
903–915, 1986. International Conference on Knowl
[7] Sofi, Francesco, et al. “Adherence to edge Discovery & Data Mining, pp.
Mediterranean diet and health sta 2260-2268, 2019.
tus: meta-analysis.” Bmj, vol. 337, [16] Zuppinger, Claire, et al.
Sep. 2008. “Performance of the Digital Dietary
[8] Pes, Giovanni Mario, et al. “Evo Assessment Tool MyFoodRepo.”
lution of the dietary patterns across Nutrients, vol. 14, no. 3, pp. 635,
nutrition transition in the Sardinian 2022.
longevity blue zone and association [17] Zhang, Zuoyi, et al. “A smart utensil
with health indicators in the oldest for detecting food pick-up gesture
old.” Nutrients, vol. 13, no.5, pp. and amount while eating.” Proceed
1495, 2021 ings of the 11th Augmented Human
[9] Berry, Sarah E., et al. “Hu International Conference, 2020.
man postprandial responses to food [18] Thames, Quin, et al. “Nutrition5k:
and potential for precision nutri Towards automatic nutritional un
tion.” Nature medicine, vol. 26, no. derstanding of generic food.” Pro
6, pp. 964–973, 2020. ceedings of the IEEE/CVF Confer
[10] de Toro-Martı́n, Juan, et al. “Pre ence on Computer Vision and Pat
cision nutrition: a review of per tern Recognition, 2021.
sonalized nutritional approaches for [19] Yanai, Keiji, and Yoshiyuki
the prevention and management Kawano. “Food image recognition
of metabolic syndrome.” Nutrients, using deep convolutional network
vol.9, no.8, pp. 913, 2017. with pre-training and fine-tuning.”
[11] Boeing, H.“Nutritional epidemiol International Conference on
ogy: New perspectives for under Multimedia & Expo Workshops
standing the diet-disease relation (ICMEW) IEEE, 2015.
ship.” European journal of clinical [20] Mezgec, Simon, and Barbara Ko
nutrition, vol. 67, no.5, 2013. ˇ ć Seljak. “NutriNet: a deep
rousi
[12] Thompson, Frances E., et al. “Need learning food and drink image
for technological innovation in di recognition system for dietary as
etary assessment.”Journal of the sessment.” Nutrients vol. 9, no. 7,
American Dietetic Association, vol pp. 657, 2017.
110, no.1, pp. 48, 2010. [21] Shroff, Geeta, Asim Smailagic, and
[13] Naska, Androniki, Areti Lagiou, Daniel P. Siewiorek. “Wearable
and Pagona Lagiou. “Dietary as context-aware food recognition for
sessment methods in epidemio calorie monitoring.” International
logical research: current state of symposium on wearable computers.
the art and future prospects.” IEEE, 2008.
F1000Research, vol. 6, June 2017. [22] Herranz, Luis, Shuqiang Jiang, and
[14] Mimee, Mark, et al. “An ingestible Ruihan Xu. “Modeling restaurant
bacterial-electronic system to gas context for food recognition.” IEEE
trointestinal health.” Science vol. Transactions on Multimedia, vol.
360, no. 639, pp. 915-918, 2018. 19, no.2, pp. 430-440, 2016.
[15] Sahoo, Doyen, et al. “FoodAI: Food [23] Ravı̀, Daniele, Benny Lo, and
image recognition via deep learning Guang-Zhong Yang. “Real-time
for smart food logging.” Proceed food intake classification and
EMBEDDED ARTIFICIAL INTELLIGENCE - DEVICES, SYSTEMS AND INDUSTRIAL APPLICATIONS 115
energy expenditure estimation [32] Zhu, Ke, and Jianxin Wu. “Resid
on a mobile device.” IEEE 12th ual attention: A simple but effective
International Conference on method for multi-label recognition.”
Wearable and Implantable Body Proceedings of the IEEE/CVF Inter
Sensor Networks, pp. 1–6, 2015. national Conference on Computer
[24] Matsuda, Yuji, Hajime Hoashi, Vision. 2021.
and Keiji Yanai. “Recognition of [33] Ridnik, Tal, et al. “Ml
multiple-food images by detect decoder: Scalable and versatile
ing candidate regions.” Interna classification head.” arXiv preprint
tional Conference on Multimedia arXiv:2111.1293, 2021.
and Expo. IEEE, pp. 25–30, 2012. [34] Chen, Zhao-Min, et al. “Multi-label
[25] Meyers, Austin, et al. “Im2Calories: image recognition with graph con
towards an automated mobile vi volutional networks.” Proceedings
sion food diary.” Proceedings of of the IEEE/CVF conference on
the IEEE international conference computer vision and pattern recog
on computer vision, pp. 1233-1241, nition, 2019
2015. [35] Howard, Andrew G., et al.
[26] Chen, Liang-Chieh, et al. “Deeplab: “Mobilenets: Efficient convol
Semantic image segmentation with utional neural networks for mobile
deep convolutional nets, atrous vision applications.” arXiv preprint
convolution, and fully connected arXiv:1704.04861, 2017.
crfs.” IEEE transactions on pat [36] Howard, Sandler, Mark, et al. “Mo
tern analysis and machine intelli bilenetv2: Inverted residuals and
gence, vol. 40, no.4, pp. 834-848, linear bottlenecks.” Proceedings of
2017. the IEEE conference on computer
[27] Min, Weiqing, et al. “Large scale vision and pattern recognition, pp.
visual food recognition.” arXiv 4510–4520, 2018.
preprint arXiv:2103.16107[cs], [37] Huang, Gao, et al. “Densely
March 2021. connected convolutional networks.”
[28] Ridnik, Tal, et al. “Tresnet: Proceedings of the IEEE conference
High performance gpu-dedicated on computer vision and pattern
architecture.” Proceedings of the recognition, pp. 4700–4708, 2017.
IEEE/CVF Winter Conference on [38] Szegedy, Christian, et al. “Rethink
Applications of Computer Vision, ing the inception architecture for
pp. 1400–1409, 2021. computer vision.” Proceedings of
[29] Cheng, Xing, et al. “MlTr: the IEEE conference on computer
Multi-label Classification with vision and pattern recognition, pp.
Transformer.” arXiv preprint 2818–2826, 2016.
arXiv:2106.06195, 2021. [39] Tan, Mingxing, and Quoc Le. “Ef
[30] Liu, Shilong, et al. “Query2label: ficientnet: Rethinking model scal
A simple transformer way to multi- ing for convolutional neural net
label classification.” arXiv preprint works.” International conference
arXiv:2107.10834, 2021. on machine learning. PMLR, pp.
[31] Vaswani, Ashish, et al. “Attention 6105–6114, 2019.
is all you need.” Advances in neu [40] Chollet, François. “Xception: Deep
ral information processing systems, learning with depthwise separable
vol. 30, 2017. convolutions.” Proceedings of the
116 RAMEEZ ISMAIL: FOOD INGREDIENTS RECOGNITION THROUGH MULTI-LABEL LEARNING
IEEE conference on computer vi Rameez Ismail received
sion and pattern recognition, pp. master in robotics (2015)
from Technical Univ.
1251–1258, 2017. Dortmund, Germany, and
[41] Chollet Russakovsky, Olga, et al. Engineering Doctorate
“Imagenet large scale visual recog in systems design (2017)
from Eindhoven Univ. of
nition challenge.”International jour Technology, Netherlands.
nal of computer vision, vol. 115, He started his career with
no.3, pp. 211–252, 2015. a brief stint (2016-2018)
at NXP Semiconductors
[42] Ridnik, Tal, et al. “Asymmetric loss and thereafter joined Philips Research as a
for multi-label classification.” Pro scientist with the ambition to accelerate the digital
ceedings of the IEEE/CVF Inter transformation of healthcare. His research interests
include Artificial Intelligence (AI), biomimetics
national Conference on Computer systems and High-Performance Computing (HPC).
Vision, 2021. Over the years, he successfully applied advanced
AI research to various healthcare use cases within
the scope of personal health and medical imaging.
Index
A Hyper-parameter Tuning 1
AI accelerators 15
AI accelerators 25 I
Analog processing circuits 15 IGZO 37
analog processing circuits 25 image processing 45
Application specific integrated circuits 15 industrial automation 89
application specific integrated circuits 25
Arm Cortex-M0(+) 79
artificial intelligence 45 L
Artificial intelligence 89 Low-Energy Consumption 1
Audio processing 69 low-latency 61
Automated diet assessment 105
automated machine learning 89 M
machine learning 105
C manufacturing technology 45
Convolutional Neural Network Optimiz-ation mixed-signal simulation 25
1 multi-label learning 105
D N
Deep Learning 1 network quantisation 79
Deep Learning 69 Neural network hardware 15
deep learning 105 neural network hardware 25
Neural networks 15
neural networks 25
E neural networks 37
Edge AI accelerator 61 neuromorphic computing 25
edge computing 89 Neuromorphic Hardware 69
environmental sensing 79 non-volatile memory 37
F O
FeTFT 37 object detection 61
H P
Hafnium Oxide 37 physical layout 45
hardware trust 45 physical verification 45
high-energy efficiency 61 Power Reduction 1
117
118 Index
predictive maintenance 89
Process Control 1
Python 69
R
reverse engineering 45
S
Spiking Neural Networks 69
SRAM cells 15
system on chip 25
V
validation 89
variation 37
visual ingredients recognition 105
W
Wafer-map Classification 1