0% found this document useful (0 votes)
61 views16 pages

AIfES A Next-Generation Edge AI Framework

AIfES_A_Next-Generation_Edge_AI_Framework

Uploaded by

ana.lucas1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views16 pages

AIfES A Next-Generation Edge AI Framework

AIfES_A_Next-Generation_Edge_AI_Framework

Uploaded by

ana.lucas1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

AIfES: A Next-Generation Edge AI Framework


Lars Wulfert, Johannes Kühnel, Lukas Krupp, Justus Viga
Christian Wiede, Pierre Gembaczka, Anton Grabmaier
Fraunhofer IMS, Duisburg, Germany

Abstract—Edge Artificial Intelligence (AI) relies on the integration of Machine Learning (ML) into even the smallest embedded devices,
thus enabling local intelligence in real-world applications, e.g. for image or speech processing. Traditional Edge AI frameworks lack
important aspects required to keep up with recent and upcoming ML innovations. These aspects include low flexibility concerning the
target hardware and limited support for custom hardware accelerator integration. Artificial Intelligence for Embedded Systems
Framework (AIfES) has the goal to overcome these challenges faced by traditional edge AI frameworks. In this paper, we give a
detailed overview of the architecture of AIfES and the applied design principles. Finally, we compare AIfES with TensorFlow Lite for
Microcontrollers (TFLM) on an ARM Cortex-M4-based System-on-Chip (SoC) using fully connected neural networks (FCNNs) and
convolutional neural networks (CNNs). AIfES outperforms TFLM in both execution time and memory consumption for the FCNNs.
Additionally, using AIfES reduces memory consumption by up to 54 % when using CNNs. Furthermore, we show the performance of
AIfES during the training of FCNN as well as CNN and demonstrate the feasibility of training a CNN on a resource-constrained device
with a memory usage of slightly more than 100 kB of RAM.

Index Terms—Machine Learning Framework, Edge AI Framework, On-Device Training, Embedded Systems, Resource-Constrained
Devices, TinyML

1 I NTRODUCTION reduced processing latency. Additionally, this system can be


deployed in remote areas with limited or no connectivity.
O VER recent years, Machine Learning (ML) has become
one of the main drivers of innovation in engineering
and scientific applications [1], [2]. With currently estimated
To develop efficient and effective ML methods, numer-
ous frameworks are available that utilize high-performance
computing hardware, like graphic processing units (GPUs).
more than 250 billion embedded devices in use [3], this trend
PyTorch [11] and TensorFlow [12] are two of the most
also extends to embedded systems bringing ML-enabled
popular and widely used frameworks in this context.
computing capabilities closer to the data sources. Often re-
However, since the hardware resources of embedded
ferred to as edge Artificial Intelligence (AI) or TinyML, these
systems are often very restricted, these frameworks can-
methods have recently found their way into microcontroller
not be used. Therefore, specialized tools and frameworks
units (MCUs), which make up the Internet of Things (IoT).
have been developed that allow migrating of ML models,
According to [4], using TinyML offers numerous improve-
trained on high-performance hardware using large data-
ments compared to cloud AI solutions in terms of data
sets, to resource-constrained devices. Such traditional edge
protection, low processing latency, energy saving, and min-
AI frameworks, like TensorFlow Lite for Microcontrollers
imal connectivity dependency. An extensively researched
(TFLM) [13] or Apache TVM [14], focus mainly on the
application for TinyML based on artificial neural networks
inference and optimization of a wide variety of ANNs, en-
(ANNs) pertains to condition monitoring utilizing intelli-
abling the deployment of deep neural networks (DNNs) on
gent sensor systems (e.g. [5], [6], [7], [8]). The primary objec-
embedded platforms like MCUs. However, converting the
tive is to detect anomalous machine behavior directly within
ML models between frameworks and even programming
the sensors themselves. This approach enables the trans-
languages is unavoidable since the training was done on a
mission of only anomalous data, thereby reducing energy
PC. Furthermore, the frameworks are often bound to specific
consumption as transmitting data is more energy-intensive
hardware platforms with limited possibilities of integrating
compared to local processing [9], [10]. Furthermore, the
specialized hardware acceleration. As a result, developing
sensor system can be directly connected to the machine’s
or training an ANN directly on a resource-constrained sys-
control system, allowing for immediate intervention in the
tem is impossible.
event of a defect. Consequently, there is no need for data to
Therefore, several developments in the domain of on-
be sent to a cloud server for defect detection, highlighting
device training of ML models have been carried out in the
the data protection aspect of TinyML and emphasizing the
last two years [15], [16], [17], [18], [19], [20]. On-device train-
ing of ML models such as support vector machine (SVM)
• Lars Wulfert, Johannes Kühnel, Lukas Krupp, Christian Wiede, Pierre [21], k-nearest neighbor (K-NN) [19] and decision trees (DT)
Gembaczka and Anton Grabmaier are with the Fraunhofer Institute
for Microelectronic Circuits and Systems, North rhine westphalia, Ger, [16] were made feasible first. However, SVM and K-NN
Dusiburg 47057. E-mail: {firstname}.{lastname}@ims.fraunhofer.de have the drawback that the training data must be retained in
• Justus Viga is with the RWTH Aachen University, North rhine west- order to make robust predictions, resulting in high memory
phalia, Ger, Aachen 52076. E-mail: [email protected]
requirements [22]. In addition, while SVM, K-NN, as well

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

as DT can learn linear relationships very effectively, they raw training data, and privacy increases. Furthermore, on-
reach their limits when nonlinear correlations are learned. device training enables a new generation of self-learning
In contrast, ANN are able to acquire and train nonlinear systems and sensors that adapt to new data and can even be
relationships in data and are therefore suited for complex combined on-demand using Federated Learning (FL) [28] to
problems [23]. As the number of embedded systems contin- increase performance further. The main contributions of this
ues to rise, there is also increasing interest in having more paper are:
complex tasks, such as condition monitoring, predictive
• AIfES an open source edge AI framework that pro-
maintenance or object detection in images, performed on
vides inference and on-device training for resource-
MCUs [20]. Due to restricted resources of MCUs, on-device
constrained devices that is hardware agnostic and
training of fully connected neural network (FCNN) and
software modular
convolutional neural networks (CNNs) was considered un-
• The framework supports the modular addition of
feasible until recently [24]. However, preliminary work has
user-specific hardware accelerators to enhance the
shown that on-device training of FCNN [15] and CNN [20]
performance of inference and on-device training. It
on MCUs is feasible. Although the frameworks and methods
also includes software optimization modules for ac-
are compatible with different MCUs such as RISC-V, Cortex-
tivation functions
M, or ESP32, many lack open access and modular software
• Modular framework that allows network architec-
structure so that customized functions can be integrated
tures to be adapted or changed at run time
by the user, e.g., activation function. Even though training
• We conducted an extensive evaluation to demon-
FCNNs and CNNs on MCUs is now feasible, complex
strate the framework’s effectiveness and variety of
arithmetic operations (e.g., matrix multiplication) are still
settings. Starting with FCNNs for inference, com-
performed that are challenging for MCUs. With customized
paring our framework with TFLM using hardware
hardware accelerators, complex calculations are performed
accelerators, quantization, and on-device training.
faster which saves resources and energy [25], [26], [27].
Furthermore, we evaluated CNNs on-device training
However, none of the existing developments enable a mod-
on well-known datasets
ular structure to insert custom hardware accelerators in their
framework. The remainder of this paper is organized as follows. In
We introduce Artificial Intelligence for Embedded Sys- Section 2, we provide background on on-device learning
tems (AIfES)1 , a hardware-independent edge AI framework frameworks and an overview of related work. Subsequently,
that bridges the gap between resource-constrained embed- we present our proposed framework, AIfES, in Section 3.
ded systems and sophisticated machine learning models. Furthermore, detailed insight is provided into the design
As depicted in figure 1, the modular structure of AIfES is principles, such as modular architecture, memory usage,
designed to follow the well-known structure of ML frame- and hardware and software optimizations for reduced run-
works, such as Keras or PyTorch. This structure includes time. Section 4 the inference is evaluated for different data-
four steps: (1) building the model, (2) selecting the loss, sets and network structures for FCNNs and CNNs. Also, an
(3) choosing the optimizer and (4) training the model. This analysis of on-device training of AIfES is presented. Finally,
approach enables users to easily transfer their experiences Section 5 summarizes the paper and provides an outlook on
to AIfES. Furthermore, the ailayer module of AIfES offers future work.
a selection of different function types, such as dense layer.
Users can assign a data type, such as 8-bit, to each function.
This feature can help to save memory or enable faster 2 BACKGROUND AND R ELATED W ORK
training or inference on devices without floating-point unit Due to the enormous potential of and interest in edge AI
(FPU). Moreover, AIfES is the first framework that pro- and TinyML, the number of frameworks, libraries and tools
vides the ability to use hardware accelerators in a modular is constantly growing [29], [30]. The most commonly used
fashion within an ML framework. The software’s modular conversion approach relies on deploying pre-trained ML
design allows for the addition of new user-specific hardware models on embedded platforms, like MCUs. Consequently,
accelerators, function types, data types, or modules. This well-known ML libraries such as TensorFlow [12], Scikit-
allows customization of the framework according to the Learn [31] or PyTorch [11] are used to create the model
user’s needs and preferences. Leveraging C and optimized and train it. Subsequently, the pre-trained ML model can
modules, AIfES enables on-device training and inference be used on the resource-constrained system. Frameworks
of ML models without requiring an operating system. To such as TFLM have been developed to apply the models
perform the inference, AIfES requires only the structure and to MCUs. TFLM optimizes TensorFlow models to run effi-
weights of a pre-trained ANN model. Optimized modules ciently on mobile and embedded devices. Utility functions
and custom hardware accelerators can be easily integrated are provided to reduce the size and complexity of an ML
to enhance the inference or training performance of the model. Several ANN architectures are supported, which can
model. ANNs can be loaded, fine-tuned, or the network be used for inference on different platforms after conversion.
structure can be changed at runtime. Even training from These include smartphones, embedded Linux systems and
scratch is possible without pre-training, avoiding the need 32-bit MCUs. [13]. Edge Impulse [32] is a service that uses a
to send training data to a more powerful and energy- completely different approach to pre-train ML models and
consuming device. Energy is saved without sharing the deploys them on the edge device. First, the data must be
uploaded to the cloud, where the training will be performed.
1. https://github.com/Fraunhofer-IMS/AIfES_for_Arduino Afterward, the ML model can be converted to various

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

AIfES
AIfES-Model
Choose
Build Model Choose Loss Train Model
Optimizer

ailayer ailoss aiopti


Type Type Type
dense_layer MSE SGD
Data Type Implementation Data Type Implementation Data Type Implementation
On-Device Training
f32 Default f32 Default f32 Default
q15 CMSIS q15 CMSIS q15 CMSIS
q7 RISC-V q7 RISC-V q7 RISC-V
... ... ... ... ... ...

conv2d_layer Cross-Entropy ADAM


Data Type Implementation Data Type Implementation Data Type Implementation

... ... ...


Data Type Implementation Data Type Implementation Data Type Implementation

Fig. 1: The structure of AIfES to build a model and perform on-device training includes four steps: (1) building the model,
(2) selecting the loss, (3) choosing the optimizer, and (4) training the model. The AIfES model is built from a customizable
structure composed of the modules ailayer, ailoss, and aiopti. Each module consists of the hierarchy layers Type (functions
of the module), Implementation (use of hardware accelerators), and Data type (data type with which the ML models are
executed).

libraries of the required hardware, such as C++, Arduino, described in chapter 3.1.
or Cube.AI library. Even conversion to WebAssembly and In addition to the frameworks capable of performing
binary files are supported. Subsequently, the library can inference only, there were developments on on-device train-
be deployed on smartphones, CPU/GPU, or a variety of ing of ML models using MCUs. Table 1 depicts a detailed
supported MCUs, e.g., Nordic Semi nRF52840 DK. TFLM is list of publications. In the review, 19 publications were
used to run the Edge Impulse ML models on the resource- identified in which on-device training was conducted. The
constrained devices [32]. majority support the ARM Cortex family, and [19], [24], [15],
There are also manufacturer-specific solutions that op- [21], [65], [66], [67] support even multiple MCU families
erate according to a similar principle. A first example is the such as ESP32 or AVR MCUs. The publications used the
STM32Cube.AI [33] toolkit for STM32 ARM Cortex-M-based programming languages C or C++ to implement the ML
MCUs and their X-Cube-AI [34] extension for optimizations. algorithms since these languages are suitable for hardware-
The toolkit can convert pre-trained ANNs from TensorFlow, near implementations and offer fast execution times with
Keras [35] or models in ONNX [36] format into C code. low memory requirements [68].
With NanoEdge AI Studio [37] it is also feasible to incre- There have been many successful attempts to train ML
mentally train the ML models on STM32 MCU. Another algorithms on MCUs, besides from ANNs a variety of
tool is Microsoft’s Embedded Learning Library (ELL) [38], algorithms such as SVM, DT, RF or K-NN have been applied
which enables the development and deployment of pre- and shown to allow training of these algorithms on MCUs.
trained ML models on resource-constrained platforms, such These publications include Edge2Train [21] and Train++
as Arm Cortex-A and Cortex-M-based architectures. ELL is [24], which use SVM for on-device training, whereas [16]
an optimized cross-compiler that runs on a regular desktop only supports training of K-NN or DT. Other papers such
computer and outputs MCU-compatible C++ code [29]. as [19], [22], [67] support further algorithms such as SVM
Several techniques have been developed to address in addition to FCNN or CNN. For instance, [22] compares
TinyML’s low-resource challenges, including pruning [39], the memory requirements and inference time of different
[40], [41], [42], [43], [44], [45], Quantization [46], [47], [48], algorithms for TinyML, including FCNN, SVM, RF, LR,
[49], [50], [39], [51], [52], [53], [54], [55] and neural archi- GNB, and DT. Lee et al. [19] proposed an intermitted
tecture search (NAS) [53], [56], [57], [58], [59], [60], [61], learning framework for energy-harvested computing plat-
[62], [63], [64]. These methods reduce model parameters forms supporting unsupervised and semi-supervised learn-
while maintaining model accuracy, allowing the models to ing algorithms. Although they publish their framework
be applied to MCUs. Although quantization is supported in and support a software modular architecture, they have
the framework, other optimizations such as pruning or NAS neglected to support hardware accelerators, which can save
are not yet available. Instead, the framework emphasizes energy [25], which is particularly relevant in energy-saving
a modular software architecture and the modular addition systems operated by energy harvester [19]. Almost 80 %
of custom hardware accelerators. This modular approach of the publications in the field of microcontrollers and on-
allows adding pruning or NAS methods individually, as device training address FCNN or CNN, since these methods

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

TABLE 1: Comparison between AIfES and various papers with on-device training using MCU.
✓ = true, ✕ = false, - = not identified

Open Modularity
Paper Compatible MCU Language ML-Model
Source Software Hardware

[3] nRF52840 C/C++ FCNN, CNN ✕ - -


[17] Cortex-M C/C++ FCNN ✕ - -
[18] Cortex-M4 C++ FCNN ✕ - -
[69] Cortex-M C FCNN ✕ - -
[70] Cortex-M7 C FCNN ✕ - -

[22] nRF52840 C DT, Random Forest (RF), ✕ - -


Logistic Regression (LR),
Gaussian Naive Bayes
(GNB), FCNN SVM

[71] Cortex-M3 C CNN ✕ - -

[67] STM32 C FCNN, CNN, K-NN, SVM ✕ - -

[65] STM32 C CNN ✕ - -

[66] Adafruit Feather, STM32, C - ✓ ✕ ✕


ESP32, Adafruit METRO
[24] Xtensa, ESP32, Cortex-M C++ SVM ✓ ✕ ✕
[72] Arduino Uno C/C++ FCNN ✓ ✕ ✕
[21] ESP32, Cortex-M C++ SVM ✓ ✕ ✕
[73] Cortex-M4 C++ FCNN ✓ ✕ ✕

[74] Arduino Portenta H7 C/C++ FCNN ✓ ✕ ✕


[16] Cortex-M C K-NN, DT ✓ ✓ ✕
[19] AVR, PIC, MSP430 C K-NN, K-Means, FCNN ✓ ✓ ✕
[20] Cortex-M7 C CNN ✓ ✓ ✕
[15] RISC-V, STM32 C FCNN, CNN ✓ ✓ ✕
AIfES All GCC compatible MCU C FCNN, CNN ✓ ✓ ✓
(e.g., Cortex-M, Arduino,
STM32, Atmel AVR, etc.)

can train complex nonlinear correlations, allowing more did not specify whether they used an existing library or
complex applications of ML methods on MCUs [23]. There developed the on-device training from scratch. In [74] and
are seven recent publications [17], [18], [69], [72], [73], [70], [75] algorithms from [72] are included. In contrast, the train-
[74] that have successfully performed on-device training ing in [73] was developed from scratch since no available
using only FCNNs. In [17], on-device training is investi- frameworks supported it yet [73]. Similarly, due to a lack of
gated and compared using an Arduino Nano 33 BLE Sense support in available frameworks, [71] proposed a method
and an Arduino Protenta H7, resulting in the Portenta H7 for training CNNs on MCUs. Additionally, Lin et al. [20]
training a FCNN 4.2 times faster. Incremental learning is proposed a method for training CNNs models with less than
performed for supervised and unsupervised learning using 256 KB random access memory (RAM) including two key
an autoencoder in [18]. To use a self-adaptive control algo- innovations for on-device training through Quantization-
rithm for a DC motor controller, [69] trains a FCNN from Aware Scaling (QAS) and Tiny Training Engine (TTE). First,
scratch. An autoencoder without anomalies was initially QAS stabilizes training by automatically scaling the gra-
trained in [70] to monitor the condition of rotating machines dient of tensors with different bit-precisions. Second, TTE
on MCU. The model was then used to detect anomalies optimizes the runtime by performing auto-differentiation
in the machines afterward. With the increasing interest in and sparse update at compile-time. Although Lin et al. [20]
FL where different devices collaborate to train ML models, demonstrated the feasibility of on-device training with less
several methods have been proposed for MCUs [73], [74], than 256 KB RAM, they pre-trained the models and per-
[75], [3]. Ren et al. [3] proposed a federated meta learning formed post-training quantization in their experiments be-
approach for resource constrained-devices. However, they fore running fine-tuning on MCUs. Therefore, it is uncertain

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

whether training under these resource constraints would Each module in itself has different hierarchy layers Type
be possible without pre-training and post-quantization (i.e., (functions of the module), Implementation (use of hardware
training from scratch). An optimization was also proposed accelerators) and Data type (data type with which the ML
with Parallel-Ultra-Low-Power (PULP) [15], using a RISC-V- models are executed), which allows different settings and
based parallel software approach. Also, they propose a strat- configurations. For instance, in the ailayer module, users can
egy to automatically select the fastest kernel depending on select different layers, such as Dense or Conv2D layer, to be
the tensor shapes in each DNN layer. In addition to FCNN, included in their ML model. The cuboid structure allows the
PULP is also capable of training CNN in parallel on different user to extend the existing modules and hierarchy levels or
RISC-V cores. Opt-SGD and Opt-OVO are presented as op- even add new ones as required.
timization methods for binary and multiclass ML classifier
training in the paper [66]. In [67] and [65], the X-Cube-AI 3.1 Modular Architecture
[34] extension of STM is used for on-device training, while
The AIfES framework has a modular architecture. An ANN
in [67] the Forward-Forward [76] training method is used
model can be built out of processing blocks called layers
for the first time on MCUs, where no backpropagation is
that are connected to form the whole model, which is also
required. De Vita et al. [65] use the extension to train an
employed by the commonly used modern deep learning
echo state network [77], which is a form of recurrent neural
frameworks like Keras [35] and PyTorch [11]. This allows
network (RNN), on an STM32 board.
experienced users of Keras or PyTorch to use AIfES more
Although a total of 19 publications have been published
easily. For training, loss functions are assigned to the model,
on the topic of on-device training on MCUs, the source code
and it can be trained with different optimizers to perform
has only been published for slightly half of the papers, so
the gradient steps of the backpropagation algorithm. Unlike
validation or further development of the work is not possi-
other frameworks, AIfES also takes the data type and the
ble, which slows down the acceleration of the research area
underlying system particularities in the foreground, which
and optimization of the algorithms. Furthermore, slightly
are essential factors on resource-constrained devices. More-
more than half are available with open access, whereas only
over, it provides all components required for training an
[15], [20], [19] uses a modular software structure. A modular
ANN right on the device, like backward implementations of
software structure is shown by the implementations being
all layers, several loss functions and optimizers, and weight
systematically divided into logical sub-blocks.
initialization functions. An overview of the supported com-
We noticed that there are some promising solutions with
ponents is given in the appendix. These components follow
[20], [67], [19] and [15] to deal with little resources while
the same modular concept and are flexible and adaptable to
training not only FCNN but also CNNs and RNN. However,
any system and use case.
none of the current work includes a modular hardware
Figure 3 shows the hierarchical structure of modules in
structure allowing users to include their hardware accel-
AIfES. Every category (e.g. layer, loss, optimizer) contains
erator into the framework. A modular hardware structure
specific modules (e.g. Dense layer, ReLU layer) that define
implies that arbitrary hardware accelerators can be added
the functionality of the module. Each module type can
to the framework as long as they are callable by the used
work on data of several data types (e.g. float32, int8). The
programming language. In addition, training is developed
from scratch in several publications such as [71], [73], [17],
[69], [70]. Thus, we conclude that there is a need for a mod-
ular TinyML framework for on-device training on resource-
constrained devices. To address this gap, we present AIfES.
To overcome this gap, the open source framework AIfES,
presented in this paper, has been developed. It targets
all types of MCUs ranging from small 8-bit MCUs up to
powerful e.g., ARM Cortex-M-based MCUs and supports
both inference and on-device training of FCNN and CNN.
However, AIfES can as well be used on a PC for evaluation
or visualization purposes.

3 D ESIGN P RINCIPLES
AIfES is specifically designed to run on embedded, low-
resource devices like MCUs. Therefore, the requirements
differ compared to usual machine learning frameworks. A
major goal of AIfES is to make the usage of the library as
simple and intuitive as possible while being efficient enough
to run even on the smallest MCUs and flexible enough to Fig. 2: Modular and customizable cuboid structure of AIfES
support most of the use cases in ML. Therefore, we propose composed of the modules ailayer, ailoss and aiopti. Each
a modular cuboid structure of AIfES depicted in figure 2, module consists of the hierarchy layers Type (functions of
which is designed to provide a flexible and customizable the module), Implementation (use of hardware accelerators)
structure in which users can individually select the avail- and Data type (data type with which the ML models are
able functions from the modules ailayer, ailoss and aiopti. executed).

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

Module Hierarchy Examples


Category ailayer ailoss aiopti
Structure with common

Fixed
Layer pointer to the previous and following layer;
attributes for each Function pointer for Function pointer for
Function pointer for DII forward and backward pass
category and DII function loss function update of weights
pointers

Type ailayer_dense ailayer_relu

Customizable
Storage for the number of neurons;
Structure including Setting function pointers for forward and backward pass
Setting DII function pointer to DII implementation of dense
previous structure and to DII implementations of ReLU, referencing DII math
forward and backward pass, referencing DII math function
type-specific attributes function pointers
pointers

Data Type ailayer_dense_f32 ailayer_dense_q7

Customizable
Initialization of datatype-
specific attributes in the Setting datatype specific functions to f32 Setting datatype specific functions to q7
structure of module type

Implementation ailayer_dense_f32_default ailayer_dense_f32_cmsis

Customizable
Initialization of type-
specific functions in Setting DII math function pointer to f32 and native C Setting DII math function pointer to f32 and CMSIS-based
struct with HW- implementations (e.g., linear forward pass, using native implementations (e.g., linear forward pass, using matrix
specialized and datatype- implemented matrix multiplication) multiplication of CMSIS)
specific functions

Fig. 3: Structural concept of the modules in AIfES. Arrows are indicating sub-types, similar to inheritance in object-oriented
programming. On the left side the hierarchy is given, while the right side gives some examples for an implementation of
the hierarchy. The examples describe what type of parameters are set at each level of the hierarchy, to allow software and
hardware flexibility.

final implementation can then be system-specific (e.g. Arm mathematical functions to implement the desired function-
Cortex M, AVR ATMega) to get optimal performance on ality (e.g., tensor add and multiply operations or matrix
any hardware. multiplication). Hereby, the mathematical functions are also
Thereby, the higher modules pass on general properties DII, as they are referenced as function pointers from the final
to the modules below them. This is done by using structures, implementation module.
which are part of the structures of the lower-level modules. Consequently, the data type-specific representation ini-
In the case of a layer, the category ailayer contains common tializes the data type of the layer. Combined with the final
attributes like a pointer to the previous and following layer. implementation module, the DII function pointers are initial-
Furthermore, some function pointer are provided. These ized with data type- and implementation-dependent versions.
function pointer are called during a forward or backward All the needed mathematical functions are provided by a
pass of the ANN and represent an abstract (data type- and separate mathematics module, where the needed functions
implementation-independent (DII)) call location during the are referenced for the initialization of the DII function
passes. In contrast, the ailoss and aiopti categories contain pointers. The mathematics module contains implementa-
abstract loss- and optimization-specific attributes, respec- tions for each data type and implementation, e.g., the matrix
tively (e.g., function pointer for loss calculation for ailoss multiplication of the forward pass in f32 and q7 data type.
or function pointer for parameter update for aiopti). AIfES currently offers two different types of implementations.
The module type describes common attributes which are The default implementation is purely software-based and is
DII but specific to their operation. This allows for different tested on various systems to provide the best performance
functions, e.g., dense layer in contrast to activation function in most cases. In contrast, the Common Microcontroller
layer for ailayer or different loss functions for ailoss. For Software Interface Standard (CMSIS) implementation uses
the example of a dense layer, the module type provides the CMSIS digital signal processor (DSP) functions for an
arguments like the number of neurons, tensor pointers for efficient implementation on Arm Cortex-M MCUs by opti-
weights and bias, and initializes the function pointers from mizing the implementation of the mathematical functions.
the category for the forward and backward pass with DII In order to use a different hardware accelerator (in
function implementations for dense layers. In the case of the shape of existing or custom-designed hardware units),
an activation layer, the function pointer for the forward the hardware-optimized mathematical functions (like tensor
and backward pass are initialized with DII versions of the multiplication) need to be added to the mathematics mod-
activation layer. Those DII functions use the underlying ule. Additionally, a final implementation must be added to

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

the desired layer. In the final implementation the specialized stored in ROM (e.g. Flash memory or EEPROM) or external
mathematical functions need to be referenced. No further storage components, while mutable parameters (e.g. gra-
adjustments are necessary, as the DII implementations of the dients, errors, momentum, quantization parameters) need
modules type and category automatically use the mathemat- to consume space in RAM. For instance, gradients, errors,
ical functions given in the implementation. With this design intermediate results, and quantization parameters will be
concept, a hardware developer does not need to know about placed in RAM.
neural networks to develop hardware accelerators. Instead, Unlike a dynamic memory allocation like Keras [35],
a machine learning expert can use accelerated building or a custom caching memory allocator such as PyTorch
blocks provided by the hardware expert. Moreover, this [11], AIfES uses a different approach. As a primary design
allows for a hardware/software co-design workflow where concept, all the memory is assigned before running the
the developer starts with the default implementations and network. For this, AIfES provides scheduler functions to
gradually replaces them with custom accelerated functions. calculate the required memory size beforehand based on
An example for customized hardware accelerators can be the network structure of the FCNN and distribute a block of
found in [25]. In this example, custom single-instruction- memory to the model.
multiple-data (SIMD) instructions were integrated into an First, the memory size for the inference of each layer
RISC-V MCU to improve the calculation of dense layers. is calculated. Thereby, the number of mutable parameters
Here, only the implementation of the dense layer needed to is multiplied by the selected data type. The memory size
be updated. For this, the function pointer for the default for the intermediate results of the quantization parameters
implementation needs to be replaced with the SIMD specific is also determined if quantization is used. For this, the
implementation. No further changes are necessary, high- size of the used data type is added to the memory block.
lighting the modular structure of AIfES. With this concept, a Subsequently, the address sections of the memory block are
hardware developer only needs to develop a mathematical added to the individual layers depending on the number
function for matrix multiplication, which is then referenced of variable parameters by the scheduler. Afterward, the
by the function pointer inside the implementation. Further- memory size for the training is identified if the FCNN
more, the activation functions were optimized with custom should be trained by AIfES. Hereby, the memory size for the
hardware accelerators, where also only the function pointers intermediate results of the different layers for the forward-
needed to be updated to automatically use the optimized and backward-pass is computed. The memory size for the
hardware accelerators. forward and backward pass is determined by the data type
Moreover, easy porting of the framework to other hard- and the most significant number of weights and biases in
ware architectures is possible. As the framework is entirely the FCNN. Furthermore, the memory size of the gradients
written in C and is compatible with the GNU Compiler and optimization memory (e.g. first or second momentum)
Collection (GCC), the default implementation is executable on is ascertained. The size of the gradients is determined by the
any hardware that is supported by the GCC and customized size of the tensor and the used data type. The memory size
hardware accelerators can be included as described above. for the optimizer depends on the chosen one, as Adam in
With this modular concept, adding new components and contrast to SGD needs additional storage for the moments.
adapting to new use cases is straightforward without diving In addition, if applied, the memory size of the quantization
deep into the framework code, as seen by integrating new parameters is calculated by the utilized data type. After
hardware accelerators. Additional types can be added, e.g., calculating the size of the memory block, the scheduler
to support new activation functions, where the additional allocates the address ranges of each layer based on the size
mathematical functions need to be added and referenced of the mutable parameters, optimization size, and memory
in the new final implementation. The clear design choice of size for the intermediate results.
structuring AIfES into the different modules data type and Thus, AIfES has no internal dynamic memory allocation
implementation leads to a more efficient system, as unnec- (apart from local variables on the call stack). This ensures
essary functions can be excluded during implementation, that the system can not run out of memory during inference
compilation and, therefore, during deployment. or training of an ANN, which is particularly important in
safety-critical applications like autonomous driving. Fur-
thermore, no memory fragmentation can occur because the
3.2 Memory
memory scheduler knows when and how much memory is
A main constraining factor on embedded devices is limited needed during runtime and can optimize the assignment.
memory. On the one hand, the RAM for storing variables
and mutable data is often only a few kilobytes (or even a
few bytes) of size and must therefore be used very spar- 3.3 Hardware and Software Optimizations for Reduced
ingly. On the other hand, the read-only memory (ROM) Runtime
for program code storage and constant variables is also Keeping the runtime of ML inference and training low is
essential. This contrast with non-edge devices, where the a key objective of AIfES. A major portion of the execution
code size is often not considered. With its modular design, time of state-of-the-art neural networks is devoted to matrix
AIfES makes it easy for C-compilers to remove unused code operations, e.g. in fully-connected or convolutional layers.
and thus shrink the code size. The memory for the param- However, comparably small neural networks are frequently
eters (e.g. weights) and intermediate results of an ANN used on embedded systems due to prevailing resource con-
can be individually assigned depending on the application. straints. With decreasing network size, the activation layers
Constant parameters, like non-trainable parameters, can be become increasingly relevant concerning their contribution

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

to the execution time. Therefore, AIfES includes runtime- Q31) are offered as a symmetric 8-bit/32-bit integer quanti-
optimized activation functions and layers in addition to zation facilitates integer-only calculations on real values by
the matrix multiplication-based layers. As several activation following the proposed techniques of [51].
functions require the calculation of the exponential function,
that can be costly in its default implementation, AIfES 4 E VALUATION
includes an optimized variant [78]. Furthermore, AIfES em- In order to evaluate the performance of AIfES a benchmark
ploys piecewise linear approximation (PLA) of activation including multiple ANN architectures (FCNNs and CNNs)
functions that introduce minor calculation errors but speed and datasets was developed. The performance of AIfES
up their execution. Hence, AIfES allows adapting the degree in terms of execution time and memory consumption is
of approximation depending on the precision and runtime compared to TFLM for the inference of the models and
constraints of the application. AIfES does not use look-up AIfES is evaluated in a training scenario.
table-based activation functions to prevent a further increase
of the library’s memory requirements. 4.1 Benchmark Setup
For the backward pass of the model, no automatic differ-
entiation is executed. Consequently, a separate implemen- The experiments were conducted on the nRF52840 DK
tation of the backward pass is provided for every layer. (ARM Cortex-M4-based) by Nordic Semiconductor [80]. The
Thus, no additional bookkeeping of the executed functions nRF52840 System-on-Chip (SoC) runs at a clock rate of
on a tensor is necessary, resulting in reduced storage and 64 MHz.
computing requirements. For the software development and programming of the
AIfES provides two complementary backpropagation SoC, the PlatformIO IDE [81] was used. For the compilation,
workflows to achieve lower memory consumption during the GCC included in the GNU ARM Embedded Toolchain
training. Both commence with a forward pass that retains [82] was used with maximum optimization (-O3). The exe-
the results. The traditional approach progresses by iterating cution time of inference and training was measured with
over all layers in reverse, computing and storing the gra- a logic analyzer (Digital Discovery by Digilent) and the
dients at each step. All parameters are only updated at the results were evaluated statistically. In the following, only
very end of this process. The lightweight stochastic gradient the mean execution time is reported, as the deviations were
descent (L-SGD) algorithm [79] operates differently, retain- insignificant. The given values for the memory consumption
ing only the partial derivatives necessary for the subsequent in terms of RAM and flash memory were taken from the
layer and directly updating the parameters using the calcu- compilation report of PlatformIO. For the inference setting,
lated gradients. Thus, it only keeps two layers in memory at the parameters of the ANNs were declared with the const
any given moment. classifier to place them in the flash memory of the SoC dur-
We have expanded this algorithm to include other opti- ing compile time. The same procedure for the inference and
mizers such as ADAM. Additionally, we have made it pos- training experiments was used to place the input data in the
sible to use the lightweight backpropagation workflow with flash memory for AIfES and TFLM, respectively. However,
batch learning, enhancing its practical utility. In this context, the input data size was subtracted from the reported flash
we accumulate the gradients over a complete batch in each memory consumption, as only the storage requirements of
iteration and update before advancing to the next layer. The the two frameworks should be compared and the size of
lightweight procedure becomes increasingly efficient as the input data varies with the different ANNs. The benchmarks
depth of the model increases. For larger batch sizes, the were conducted with the two data types F32 and Q7 for
algorithm is quicker due to memory access, though this the FCNNs and only F32 for the CNNs. For the Q7-based
comes at the expense of higher peak memory usage. AIfES versions, the pre-trained model from Keras was quantized.
provides users with the flexibility to select the workflow To be able to control the behavior of TFLM, the official
that best suits their network architecture and performance repository from GitHub [83] was downloaded and the pro-
requirements. vided converter tool was used to create the library for ARM
Another factor to be aware of when developing a library Cortex architectures with and without optimized CMSIS
for embedded purposes is the huge range of underlying kernels. The library is then included in the PlatformIO IDE.
hardware configurations. On systems with only 8-bit mem- For TFLM, pre-trained ANNs from TensorFlow needs to
ory bandwidth and no FPU, the optimal implementation of be converted to a TensorFlow Lite model. The converted
an algorithm is different than on 32-bit systems with SIMD models are then exported and included in the benchmarking
instructions or DSP accelerators. The modular concept of environment. The size of the kTensorArenaSize was estimated
AIfES allows the development of individual components empirically for each ANN, as it contains all necessary pa-
that fit perfectly to the given hardware platform and in- rameters for the ANN and therefore changes size with each
struction set (cf. section 3.1 and Fig.3). For example, the tested ANN. A conversion of the pre-trained models from
CMSIS for ARM-based MCUs allows acceleration of the Keras [35] to AIfES is executed. Only the weights and bias
calculations by utilizing, among other optimizations, SIMD are transferred to AIfES to convert a pre-trained model. For
instructions. The CMSIS can be used in AIfES by the addi- both AIfES and TFLM, version 5.8.0 of CMSIS is used [84].
tional CMSIS-implementation.
4.2 Inference Benchmark
AIfES allows quantizing your model. Quantization en-
ables an adaptation for different hardware architectures, e.g. 4.2.1 FCNNs
to the named 8-bit MCU with no FPU. A quantized ANN to For the evaluation of the FCNNs, the model architectures
Q7 can improve the calculations. Two quantizations (Q7 and from an existing TinyML benchmark [85] were adopted.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

Three representative datasets with several numbers of in-


1200 AIfES f32 AIfES Q7
put features (4 - 64 features) were selected. Additionally, TFLM f32 TFLM Q7

Execution Time ( s)
two larger fully connected deep neural network (FCDNN) 1000
architectures were evaluated based on the MNIST dataset 800
[86] using the complete and flattened images (784 features).
The experiments with the corresponding evaluated datasets 600
and models are summarized in Table 2. Experiments 1 to 3 400
were conducted first. The results are shown in Figures 4, 5
and 6. 200
TABLE 2: Summary of inference experiments with FCNNs 0

10 er

x 1 er

10 T
+ er

x1 T
+ ST
1 x NIS

10 MNIS
1 x Iris
1 x anc

50

50

50

0
10 Iris
10 anc

10 Iris
10 anc

10

10 MNI
x1
M
+
C

C
C
Ex-
peri- Dataset # of inputs Hidden layers (a) Without CMSIS
& outputs
ment
1200 AIfES f32 AIfES Q7
1 hidden layer with 10 TFLM f32 TFLM Q7

Execution Time ( s)
neurons (1 x 10)
1000
1st hidden layer with 10 800
4 features,
1 Iris[87]
3 classes
neurons, 2nd with 50 600
neurons (10 + 50)
400
10 hidden layers with 10
neurons each (10 x 10) 200
Breast 1 x 10 0

10 er

x 1 er

10 T
+ er

x1 T
+ ST
Can-

1 x NIS
30 features,

10 MNIS
1 x Iris
1 x anc

50

50

50

0
10 Iris
10 anc

10 Iris
10 anc

10

10 MNI
x1
2 10 + 50

M
+
2 classes
C

C
cer

C
[88] 10 x 10
(b) With CMSIS
1 x 10
MNIST 64 features, Fig. 4: A comparison of the execution times between AIfES
3 10 classes 10 + 50
[86] and TFLM for different FCNNs (experiments 1 to 3) is
10 x 10 shown, with comparisons between the F32 and Q7 versions
32 + 32 + 16 (FCDNN 1) illustrated in both subfigures. Subfigure 4a represents the
MNIST 784 features, inference time with the standard implementation, and sub-
4 10 classes 128 + 64 + 32 + 16
[86] figure 4b with the CMSIS implementation.
(FCDNN 2)

Figure 4a shows that the execution time of the AIfES setting, TFLM uses CMSIS-based implementations, showing
models exceed that of the TFLM models in most of the a performance increase, also compared to the Q7 implemen-
cases. Without CMSIS, a speed-up by factors of up to 2.1 tations without CMSIS by factors of up to 1.6 (mean 1.4).
for F32 respectively 2.2 for Q7 was measured. The execution At the same time, TFLM can reduce the execution time of
times of the slower F32-AIfES models (MNIST 1x10 and Q7 ANNs further with CMSIS in comparison without it by
MNIST 10 + 50) lie within 17 % of that of the TFLM models. factors of up to 2.7 (mean 2.1). These results demonstrate
At the same time the Q7-based version of MNIST 1x10 is the effectiveness of the modular and open AIfES architec-
slightly faster (speed-up by factor 1.2), whereas the Cancer ture, enabling the integration of arbitrary accelerated or
10 + 50 model is slightly slower (by 2 %). An explanation optimized implementations of ANN functionalities.
for the lower performance of the MNIST F32 ANNs might Figures 5 and 6 show that the AIfES models require
be an optimized matrix multiplication implementation of overall less memory than the TFLM models by factors of
TFLM, taking effect for fully connected layers with a higher up to 3.9 (starting by factors of 2.1 with a mean of 2.7).
number of parameters. This fits with the results for the Q7 The RAM requirements are similar for both frameworks
Cancer 10 + 50, as for the Cancer 10 + 50 the execution in most cases, while the significant difference is due to
time from AIfES is slightly longer than TFLM. At the same the flash memory consumption. We attribute this result to
time the optimizations by Q7 allows AIfES to be slightly the memory-efficient implementation of AIfES concerning
faster than TFLM in the Q7 MNIST 1 x 10 setting. Figure 4b program code and constant variables, typically placed in-
shows that the AIfES models with CMSIS are faster than side the flash memory. Furthermore, it is to note that the
the TFLM models in all cases by factors of up to 2.4 and flash memory consumption increases for the TFLM models
2.3 for F32 respectively Q7. The slow performance of the with CMSIS enabled, while it slightly decreases for the F32
F32-based versions of TFLM can be attributed to the fact AIfES models. The reason for this is that AIfES integrates
that TFLM uses the default implementation for F32 without only a subset of the CMSIS modules. This can also be seen
using any function from the CMSIS. This leads to almost in the Q7 based implementation, where the flash memory
the same results as for F32 without CMSIS. For the Q7 consumption increases for both frameworks but the amount

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

10

of increase is larger for TFLM. TFLM either uses more without and with CMSIS respectively). Overall, these results
or other CMSIS modules with an increased code size or prove the effectiveness of the AIfES architecture again for
more constant variables. An explanation for the decrease the integration of external optimized ANN modules and its
in memory consumption for the F32 AIfES models is the memory efficiency with respect to flash memory storage.
higher efficiency of the CMSIS functions in terms of code
size compared to the native AIfES implementations. 4.2.2 CNNs
Table 3 shows the results of the FCDNN architectures Subsequently, 2D-CNN architectures using the MNIST,
(experiment 4 in Table 2). The execution time of the TFLM CIFAR-10 [89] and Visual Wake Words (VWW) [90] dataset
models without CMSIS is 17 % and 16 % lower than that were evaluated. The network architecture changes with
of the AIfES models. This result supports our hypothe- every dataset since the datasets have a different number of
sis concerning the optimized native matrix multiplication input channels and also a different amount of outputs. For
implementation of TFLM for fully-connected layers with the CIFAR-10 and the VWW, an input of 3 × 32 × 32 was
higher numbers of parameters. With CMSIS, the execution used since the datasets contain RGB images. The images of
time of the AIfES models exceed that of the TFLM models the VWW were previously resized to fit the input shape of
by factors of 1.4 and 1.3, respectively. Nevertheless, the RAM the CNN. The input of 1 × 28 × 28 was used for the MNIST
requirements of the TFLM models are slightly lower (12 % dataset. However, the basic architecture is the same for all of
for FCDNN 1 and 5 % for FCDNN 2). The flash memory them. Each network has two convolutional layers, the first
consumption of the AIfES models fall below that of the layer using four kernels and the second layer eight kernels
TFLM models by the same absolute differences as in the with a size of 3 × 3 using no padding and a stride of one.
previous experiments 1 to 3 (31 kB and 41 kB on average ReLU is used in both layers, and maxpooling is performed

AIfES RAM TFLM RAM AIfES RAM TFLM RAM


105 AIfES Flash TFLM Flash 105 AIfES Flash TFLM Flash
Memory (Bytes)

Memory (Bytes)

104 104

103 103
10 er

10 er
x 1 er

x 1 er
10 T

10 T
+ er

+ er
x1 T

x1 T
+ ST

+ ST
1 x NIS

1 x NIS
10 MNIS

10 MNIS
1 x Iris

1 x Iris
1 x anc

1 x anc
50

50

50

50

50

50

0
10 Iris

10 Iris
10 anc

10 anc
10 Iris

10 Iris
10 nc

10 nc
10

10
10 NI

10 NI
x1

x1
M

M
a

a
+

+
M

M
C

C
C

C
C

(a) f32, Without CMSIS (a) Q7, Without CMSIS


AIfES RAM TFLM RAM AIfES RAM TFLM RAM
105 AIfES Flash TFLM Flash 105 AIfES Flash TFLM Flash
Memory (Bytes)

Memory (Bytes)

104 104

103 103
10 er

10 er
x 1 er

x 1 er
10 T

10 T
+ er

+ er
x1 T

x1 T
+ ST

+ ST
1 x NIS

1 x NIS
10 MNIS

10 MNIS
1 x Iris

1 x Iris
1 x anc

1 x anc
50

50

50

50

50

50

0
10 Iris

10 Iris
10 anc

10 anc
10 Iris

10 Iris
10 nc

10 nc
10

10
10 NI

10 NI
x1

x1
M

M
a

a
+

+
M

M
C

C
C

C
C

(b) f32, With CMSIS (b) Q7, With CMSIS

Fig. 5: The memory comparison for the frameworks AIfES Fig. 6: The memory comparison for the frameworks AIfES
and TFLM in the F32 version using different FCNNs (ex- and TFLM in the Q7 version using different FCNNs (ex-
periments 1 to 3) is shown. Furthermore, the RAM and periments 1 to 3) is shown. Furthermore, the RAM and
flash consumption is depicted for the standard and CMSIS flash consumption is depicted for the standard and CMSIS
implementation. Subfigure 5a shows the standard imple- implementation. Subfigure 6a shows the standard imple-
mentation, and subfigure 5b illustrates the implementation mentation, and subfigure 6b illustrates the implementation
with CMSIS. with CMSIS.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

11

TABLE 3: Evaluation of execution time and memory consumption of the FCDNNs (experiment 4)

Execution Time (ms) Memory (kB)


NN Frame-
No CMSIS CMSIS
Arch. work No CMSIS CMSIS
RAM Flash RAM Flash
AIfES 14.47 8.67 10.99 122.28 10.98 121.11
FCDNN 1
TFLM 11.97 12.40 9.69 153.03 9.69 162.84
AIfES 58.20 40.71 11.22 461.09 11.21 459.92
FCDNN 2
TFLM 49.13 50.95 10.71 491.84 10.71 501.66

with a kernel of 2 × 2 after each convolutional layer. As the measured the execution time per epoch and normalized it
CMSIS support for CNNs is not yet included in AIfES, only by dividing it by the number of batches per epoch, as the
the native implementations were compared. batch size is the same (5) for each model.
The results in Table 4 show that the TFLM CNN ex-
ceeds the AIfES model in terms of execution time and 4.3.1 FCNNs
RAM requirements by factors of up to 1.74 (VWW) and
TABLE 5: Evaluation of the on-device training with AIfES of
1.63 (MNIST) respectively in the worst case evaluations.
the FCNN on the nRF52840 DK (batch size: 5)
An explanation for the difference in execution time is
the optimized implementation of matrix multiplications in
TFLM. AIfES currently uses simple direct convolutions. So- Training Memory
Archi- (kB)
phisticated methodologies such as general matrix multiply Dataset Time/
tecture
(GEMM)- or fast Fourier transform (FFT)-based implemen- Batch (ms) RAM Flash
tations are not yet included. Similar to the previous exper-
iments, the flash memory consumption of the AIfES CNN 1 x 10 15.04 8.34 20.96
is lower than that of the TFLM model with a maximum (±0.01)
absolute difference of 32 kB for all datasets. As a result, the Breast
10 + 50 40.34 21.27 21.22
flash memory occupancy increases up to 53 % in the worst Cancer
(±0.02)
case for the VWW dataset. The significant differences are
related to the flat buffer used to store the weights, network 10 x 10 55.74 29.85 23.33
structure, and activation functions so that the neural net- (±0.02)
work can be built at run time of TFLM. 1 x 10 03.62 3.54 20.96
(±0.00)
4.3 Training Benchmark Iris 10 + 50 05.41 17.45 21.22
Subsequently, we investigated the on-device training for (±0.01)
the FCNN and CNN with AIfES. For the evaluation of the
10 x 10 07.76 24.58 23.21
FCNN, the same architectures were used for experiments
(±0.01)
1 to 3 of the inference benchmark shown in Table 2. For
the CNNs, the same architectures were chosen as for the 1 x 10 33.43 18.25 21.36
inference benchmarks in Section 4.2.2. All models were (±0.02)
trained using a cross-entropy loss and the Adam optimizer
MNIST 10 + 50 74.57 34.80 21.63
with η = 0.01, β1 = 0.9, β2 = 0.999 using ϵ = 1e−7 for
(±0.04)
FCNN and ϵ = 1e−5 for CNNs. Only the training with the
default implementation (i.e., without CMSIS) is shown. We 10 x 10 76.67 39.06 23.74
(±0.04)

TABLE 4: Evaluation of execution time and memory con- Table 5 shows that the training of FCNNs on resource-
sumption of the 2D-CNN constrained embedded devices is possible with AIfES while
keeping the execution time and memory consumption at an
Frame- Execution Memory (kB)
Dataset acceptable level. This means that the platform is not fully
work Time (ms) RAM Flash
utilized, and enough resources remain. For example, on the
AIfES 51.16 27.01 34.35 experimental platform based on the nRF52840 SoC, 15 % of
MNIST
TFLM 43.53 16.58 66.63 the RAM and 2 % of the flash memory are utilized by AIfES
AIfES 111.03 43.33 38.53 in the worst evaluated case (MNIST 10 x 10). Hence, other
CIFAR-10
TFLM 70.85 32.96 70.44 tasks such as communication, sensor sampling, or signal
AIfES 122.42 43.30 28.48 pre-processing can also run on the embedded system. The
VWW
TFLM 70.40 32.96 61.19 overall training execution time of the evaluated models on

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

12

values were determined using Tensorflow. Afterwards, the


TABLE 6: Evaluation of the on-device training with AIfES of
same test data was used to measure the mean squared
the 2D-CNN on the nRF52840 DK (batch size: 5)
error of the autoencoder using Tensorflow Lite and AIfES.
Memory (kB) Subsequently, the mean relative error of Tensorflow Lite and
Training
Dataset AIfES with respect to the reference values from Tensorflow
Time/Batch (s) RAM Flash was determined. Tensorflow Lite has a mean relative error
MNIST 1.19 104.14 39.60 of 1200.95588 ‰, while AIfES has an error of 0.00027 ‰.
Thus, the mean relative deviation of AIfES is negligible. The
CIFAR-10 2.61 149.90 38.80 applied approximations in an application example lead to
no difference in the pAUC value compared to Tensorflow
VWW 2.56 103.64 38.48
and only to a minimal difference in the mean value of the
AUC of 2.7 ∗ 10−6 .
the experimental platform lies in the range of milliseconds
to seconds. 5 C ONCLUSION & F UTURE D IRECTIONS
Despite these results, it has to be considered that the In this paper, we presented the next-generation edge AI
model architecture and especially the memory requirements framework AIfES. It is specifically designed to leverage
of the training process can be the main limitations of the on- the full potential of ML on resource-constrained embedded
device training with AIfES. We also included a training of a devices. Compared to other traditional edge AI frameworks,
complex deep autoencoder in the appendix B. AIfES not only supports the inference on embedded systems
but also the on-device training. This allows the use of FL and
4.3.2 CNNs online learning (OL) techniques in real-world applications.
The results of the on-device training benchmark of the Furthermore, due to its modular architecture, AIfES enables
CNNs can be seen in Table 6, where we used the same the easy integration of arbitrary optimized and hardware-
datasets as for the inference evaluations. Also, similar archi- accelerated ANN functionalities. We performed benchmarks
tectures as in the inference analyses were used. The architec- comparing AIfES to TFLM in multiple inference scenarios
tures were extended by adding a batch normalization layer on an ARM Cortex-M4-based SoC. Especially for FCNN
after both convolutional layers with momentum = 0.9 and architectures, we showed that AIfES is capable of outper-
ϵ = 1e−6 , which accelerates the training according to [91]. forming TFLM in terms of execution time and memory
Compared to the evaluations of the FCNNs, the training consumption. Furthermore, we demonstrated the feasibility
time per batch and the RAM consumption has increased of the training of ANNs and CNNs on embedded devices
significantly. The CNN trained with the CIFAR-10 dataset with AIfES. The current main limitation of AIfES is the
takes the most time to train, with 2.61 seconds, and the RAM implementation of the native matrix multiplication, leading
memory consumption with almost 150 kB. The training time to a lower performance of ANNs compared to TFLM. In the
and the memory consumption of the RAM have increased future, we will enhance AIfES with more advanced matrix
since the number of trainable parameters also increased multiplication methods for ANNs and optimize the overall
in the CNNs. For instance, the CNN used in the analysis on-device training for ANN with e.g. pruning. Furthermore,
with the CIFAR-10 dataset uses four times the amount of new ANN architectures, such as transformers will be added,
parameters as the FCNN with the MNIST dataset using and we will focus on the further development of FL and OL
10×10 architecture. Hence the memory requirement is about techniques with AIfES.
3.8 times larger.
The benchmark shows that the training of CNNs is
6 ACKNOWLEDGMENT
feasible with AIfES, but the training time, as well as the
memory consumption, goes up compared to the FCNN Lars Wulfert, Johannes Kühnel and Lukas Krupp con-
training. Nevertheless, the training time remains within an tributed equally to this paper.
acceptable time frame. However, training deep CNN would
be challenging. In addition, the amount of data required to R EFERENCES
train a deep CNN cannot be stored directly on an MCU. [1] M. Frank, D. Drikakis, and V. Charissis, “Machine-learning meth-
ods for computational science and engineering,” Computation,
vol. 8, no. 1, p. 15, 2020.
4.4 Approximation Benchmark [2] R. Cioffi, M. Travaglioni, G. Piscitelli, A. Petrillo, and F. de Fe-
Since AIfES uses approximations, an important aspect is to lice, “Artificial intelligence and machine learning applications in
smart production: Progress, trends, and directions,” Sustainability,
examine the relative error of these approximations. For this
vol. 12, no. 2, p. 492, 2020.
purpose, the Deep Autoencoder included in the TinyML [3] H. Ren, D. Anicic, and T. A. Runkler, “Tinyreptile: Tinyml with
Perf Benchmark [92] was used. The ReLU activation func- federated meta-learning,” in 2023 International Joint Conference on
tions were replaced with sigmoid activation functions to Neural Networks (IJCNN), 2023, pp. 1–9.
[4] P. P. Ray, “A review on tinyml: State-of-the-art and prospects,”
enable an integrated approximation of the activation func- Journal of King Saud University - Computer and Information Sciences,
tion in AIfES. First, the autoencoder was trained in Ten- vol. 34, no. 4, pp. 1595–1623, 2022. [Online]. Available: https://
sorflow. Then, the model was exported to Tensorflow Lite www.sciencedirect.com/science/article/pii/S1319157821003335
and AIfES. These two models were then executed on the [5] A. Mostafavi and A. Sadighi, “A Novel Online Machine Learning
Approach for Real-Time Condition Monitoring of Rotating Ma-
PC, and the mean squared error from the input value was chines,” in 2021 9th RSI International Conference on Robotics and
calculated. Based on the 2459 test datasets, the reference Mechatronics (ICRoM), Nov. 2021, pp. 267–273.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

13

[6] D. Pau, A. Khiari, and D. Denaro, “Online learning on tiny micro- [22] G. Delnevo, S. Mirri, C. Prandi, and P. Manzoni, “An
controllers for anomaly detection in water distribution systems,” evaluation methodology to determine the actual limitations
in 2021 IEEE 11th International Conference on Consumer Electronics of a tinyml-based solution,” Internet of Things, vol. 22, p.
(ICCE-Berlin), Nov. 2021, pp. 1–6. 100729, 2023. [Online]. Available: https://www.sciencedirect.
[7] K. Sai Charan, “An Auto-Encoder Based TinyML Approach for com/science/article/pii/S2542660523000525
Real-Time Anomaly Detection,” in 10TH SAE India International [23] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
Mobility Conference, Oct. 2022, pp. 2022–28–0406. vol. 521, no. 7553, pp. 436–444, May 2015. [Online]. Available:
[8] T. Kohlheb, M. Sinapius, C. Pommer, and A. Boschmann, “Em- https://doi.org/10.1038/nature14539
bedded autoencoder-based condition monitoring of rotating ma- [24] B. Sudharsan, P. Yadav, J. G. Breslin, and M. Intizar Ali, “Train++:
chinery,” in 2021 26th IEEE International Conference on Emerging An incremental ml model training algorithm to create self-learning
Technologies and Factory Automation (ETFA ), Sep. 2021, pp. 1–4. iot devices,” in 2021 IEEE SmartWorld, Ubiquitous Intelligence &
[9] H. Bosman, A. Liotta, G. Iacca, and H. Wörtche, “Anomaly de- Computing, Advanced & Trusted Computing, Scalable Computing &
tection in sensor systems using lightweight machine learning,” in Communications, Internet of People and Smart City Innovation (Smart-
2013 IEEE International Conference on Systems, Man, and Cybernetics, World/SCALCOM/UIC/ATC/IOP/SCI), 2021, pp. 97–106.
2013, pp. 7–13. [25] I. Hoyer, A. Utz, A. Lüdecke, M. Rohr, C. H. Antink, and K. Seidl,
[10] C. Antonopoulos, A. Prayati, T. Stoyanova, C. Koulamas, and “Inference runtime of a neural network to detect atrial fibrillation
G. Papadopoulos, “Experimental evaluation of a wsn platform on customized risc-v-based hardware,” Current Directions in
power consumption,” in 2009 IEEE International Symposium on Biomedical Engineering, vol. 8, no. 2, pp. 703–706, 2022. [Online].
Parallel Distributed Processing, 2009, pp. 1–8. Available: https://doi.org/10.1515/cdbme-2022-1179
[26] A. Moss, H. Lee, L. Xun, C. Min, F. Kawsar, and
[11] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
A. Montanari, “Ultra-low power dnn accelerators for iot:
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,
Resource characterization of the max78000,” in Proceedings
A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
of the 20th ACM Conference on Embedded Networked Sensor
B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative
Systems, ser. SenSys ’22. New York, NY, USA: Association
style, high-performance deep learning library,” in Advances in
for Computing Machinery, 2023, p. 934–940. [Online]. Available:
Neural Information Processing Systems 32, H. Wallach, H. Larochelle,
https://doi.org/10.1145/3560905.3568300
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds.
Curran Associates, Inc., 2019, pp. 8024–8035. [27] D.-M. Ngo, D. Lightbody, A. Temko, C. Pham-Quoc, N.-T.
Tran, C. C. Murphy, and E. Popovici, “Hh-nids: Heterogeneous
[12] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
hardware-based network intrusion detection framework for iot
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
security,” Future Internet, vol. 15, no. 1, 2023. [Online]. Available:
I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
https://www.mdpi.com/1999-5903/15/1/9
L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
[28] L. Wulfert, C. Wiede, and A. Grabmaier, “Tinyfl: On-device
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
training, communication and aggregation on a microcontroller
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
for federated learning,” in 2023 21st IEEE Interregional NEWCAS
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Conference (NEWCAS), 2023, pp. 1–5.
Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning
[29] A. Osman, U. Abid, L. Gemma, M. Perotto, and D. Brunelli,
on heterogeneous systems,” 2015, software available from
“TinyML Platforms Benchmarking,” CoRR, 2021.
tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[30] P. P. Ray, “A review on tinyml: State-of-the-art and prospects,”
[13] R. David, J. Duke, A. Jain, V. J. Reddi, N. Jeffries, J. Li, N. Kreeger,
Journal of King Saud University - Computer and Information Sciences,
I. Nappier, M. Natraj, S. Regev, R. Rhodes, T. Wang, and
vol. 34, no. 4, pp. 1595–1623, 2022.
P. Warden, “Tensorflow lite micro: Embedded machine learning
[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
on tinyml systems,” CoRR, vol. abs/2010.08678, 2020. [Online].
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
Available: https://arxiv.org/abs/2010.08678
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
[14] T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Q. Yan, L. Wang, and E. Duchesnay, “Scikit-learn: Machine learning in Python,”
Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
end-to-end optimization stack for deep learning,” arXiv preprint [32] EdgeImpuls, “Edgeimpuls,” https://www.edgeimpulse.com/,
arXiv:1802.04799, vol. 11, p. 20, 2018. 2022, accessed: 2022-10-17.
[15] D. Nadalini, M. Rusci, G. Tagliavini, L. Ravaglia, L. Benini, and [33] STMicroelectronics, “Stm32cube.ai,” https://
F. Conti, “Pulp-trainlib: Enabling on-device training for risc-v www.st.com/content/st_com/en/ecosystems/
multi-core mcus through performance-driven autotuning,” in Em- artificial-intelligence-ecosystem-stm32.html, 2022, accessed:
bedded Computer Systems: Architectures, Modeling, and Simulation, 2022-05-17.
A. Orailoglu, M. Reichenbach, and M. Jung, Eds. Cham: Springer [34] ——, “X-cube-ai,” https://www.st.com/en/embedded-software/
International Publishing, 2022, pp. 200–216. x-cube-ai.html, 2023, accessed: 2023-10-11.
[16] F. Sakr, R. Berta, J. Doyle, A. De Gloria, and F. Bellotti, [35] F. Chollet et al., “Keras,” https://keras.io, 2015.
“Self-learning pipeline for low-energy resource-constrained [36] O. R. developers, “Onnx runtime,” https://onnxruntime.ai/,
devices,” Energies, vol. 14, no. 20, 2021. [Online]. Available: 2021.
https://www.mdpi.com/1996-1073/14/20/6636 [37] Cartesiam, “Nanoedge ai studio,” https://cartesiam-neai-docs.
[17] N. L. Giménez, F. Freitag, J. Lee, and H. Vandierendonck, “Com- readthedocs-hosted.com/index.html, 2022, accessed: 2022-10-10.
parison of two microcontroller boards for on-device model train- [38] Microsoft, “Embedded learning library,” https://github.com/
ing in a keyword spotting task,” in 2022 11th Mediterranean Con- microsoft/ELL, 6 2020, accessed: 2022-11-15.
ference on Embedded Computing (MECO), 2022, pp. 1–4. [39] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
[18] H. Ren, D. Anicic, and T. A. Runkler, “Tinyol: Tinyml with online- deep neural network with pruning, trained quantization and
learning on microcontrollers,” in 2021 International Joint Conference huffman coding,” in 4th International Conference on Learning
on Neural Networks (IJCNN), 2021, pp. 1–8. Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016,
[19] S. Lee, B. Islam, Y. Luo, and S. Nirjon, “Intermittent learning: On- Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016.
device machine learning on intermittently powered system,” Proc. [Online]. Available: http://arxiv.org/abs/1510.00149
ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 3, no. 4, dec [40] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl
2019. for model compression and acceleration on mobile devices,” in
[20] J. Lin, L. Zhu, W.-M. Chen, W.-C. Wang, C. Gan, and S. Han, “On- Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchis-
device training under 256kb memory,” in Annual Conference on escu, and Y. Weiss, Eds. Cham: Springer International Publishing,
Neural Information Processing Systems (NeurIPS), 2022. 2018, pp. 815–832.
[21] B. Sudharsan, J. G. Breslin, and M. I. Ali, “Edge2train: [41] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating
A framework to train machine learning models (svms) on very deep neural networks,” in 2017 IEEE International Conference
resource-constrained iot edge devices,” in Proceedings of the 10th on Computer Vision (ICCV), 2017, pp. 1398–1406.
International Conference on the Internet of Things, ser. IoT ’20. New [42] E. Liberis and N. D. Lane, “Differentiable neural network pruning
York, NY, USA: Association for Computing Machinery, 2020. to enable smart applications on microcontrollers,” vol. 6, no. 4,
[Online]. Available: https://doi.org/10.1145/3410992.3411014 jan 2023. [Online]. Available: https://doi.org/10.1145/3569468

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

14

[43] J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime neural pruning,” [59] B. Zoph and Q. V. Le, “Neural architecture search with
in Advances in Neural Information Processing Systems, I. Guyon, reinforcement learning,” in 5th International Conference on Learning
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. Conference Track Proceedings. OpenReview.net, 2017. [Online].
[Online]. Available: https://proceedings.neurips.cc/paper_files/ Available: https://openreview.net/forum?id=r1Ue8Hcxg
paper/2017/file/a51fb975227d6640e4fe47854476d133-Paper.pdf [60] Y. Geifman and R. El-Yaniv, “Deep active learning with a neural
[44] J. Liu, B. Zhuang, Z. Zhuang, Y. Guo, J. Huang, J. Zhu, and architecture search,” in Advances in Neural Information Processing
M. Tan, “Discrimination-aware network pruning for deep model Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
compression,” IEEE Transactions on Pattern Analysis and Machine E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019.
Intelligence, vol. 44, no. 8, p. 4035 – 4051, 2022. [Online]. Available: https://proceedings.neurips.cc/paper_files/
[45] X. Ding, T. Hao, J. Tan, J. Liu, J. Han, Y. Guo, and G. Ding, paper/2019/file/b59307fdacf7b2db12ec4bd5ca1caba8-Paper.pdf
“Resrep: Lossless cnn pruning via decoupling remembering and [61] Z. Sun, C. Ge, J. Wang, M. Lin, H. Chen, H. Li, and
orgetting,” in 2021 IEEE/CVF International Conference on Computer X. Sun, “Entropy-driven mixed-precision quantization for
Vision (ICCV), 2021, pp. 4490–4500. deep network design,” in Advances in Neural Information
[46] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal,
and K. Gopalakrishnan, “Pact: Parameterized clipping activation D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran
for quantized neural networks,” ArXiv, vol. abs/1805.06085, 2018. Associates, Inc., 2022, pp. 21 508–21 520. [Online]. Avail-
[Online]. Available: https://api.semanticscholar.org/CorpusID: able: https://proceedings.neurips.cc/paper_files/paper/2022/
21721698 file/86e7ebb16d33d59e62d1b0a079ea058d-Paper-Conference.pdf
[47] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor- [62] Y. Li, C. Hao, P. Li, J. Xiong, and D. Chen, “Generic
net: Imagenet classification using binary convolutional neural neural architecture search via regression,” in Advances
networks,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, in Neural Information Processing Systems, M. Ranzato,
N. Sebe, and M. Welling, Eds. Cham: Springer International A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan,
Publishing, 2016, pp. 525–542. Eds., vol. 34. Curran Associates, Inc., 2021, pp. 20 476–20 490.
[48] F. Daghero, A. Burrello, C. Xie, M. Castellano, L. Gandolfi, [Online]. Available: https://proceedings.neurips.cc/paper_files/
A. Calimera, E. Macii, M. Poncino, and D. J. Pagliari, paper/2021/file/aba53da2f6340a8b89dc96d09d0d0430-Paper.pdf
“Human activity recognition on microcontrollers with quantized [63] H. Benmeziane, K. E. Maghraoui, H. Ouarnoughi, S. Niar, M. Wis-
and adaptive deep neural networks,” ACM Trans. Embed. tuba, and N. Wang, “A comprehensive survey on hardware-
Comput. Syst., vol. 21, no. 4, aug 2022. [Online]. Available: aware neural architecture search,” https://arxiv.org/abs/2101.
https://doi.org/10.1145/3542819 09336, 2021.
[49] M. Rusci, A. Capotondi, and L. Benini, “Memory-driven mixed [64] Y. Guo, Y. Zheng, M. Tan, Q. Chen, Z. Li, J. Chen, P. Zhao, and
low precision quantization for enabling deep network inference J. Huang, “Towards accurate and compact architectures via neural
on microcontrollers,” CoRR, vol. abs/1905.13082, 2019. [Online]. architecture transformer,” IEEE Transactions on Pattern Analysis and
Available: http://arxiv.org/abs/1905.13082 Machine Intelligence, vol. 44, no. 10, pp. 6501–6516, 2022.
[50] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “Haq: Hardware-aware [65] F. De Vita, G. Nocera, D. Bruneo, V. Tomaselli, and M. Falchetto,
automated quantization with mixed precision,” in IEEE Conference “On-device training of deep learning models on edge microcon-
on Computer Vision and Pattern Recognition (CVPR), 2019. trollers,” in 2022 IEEE International Conferences on Internet of Things
[51] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, (iThings) and IEEE Green Computing Communications (GreenCom)
H. Adam, and D. Kalenichenko, “Quantization and training of and IEEE Cyber, Physical Social Computing (CPSCom) and IEEE
neural networks for efficient integer-arithmetic-only inference,” Smart Data (SmartData) and IEEE Congress on Cybermatics (Cyber-
2017. matics), 2022, pp. 62–69.
[52] W. Chen, H. Qiu, J. Zhuang, C. Zhang, Y. Hu, Q. Lu, T. Wang, [66] B. Sudharsan, J. G. Breslin, and M. I. Ali, “Ml-mcu: A framework to
Y. Shi, M. Huang, and X. Xu, “Quantization of deep neural train ml classifiers on mcu-based iot edge devices,” IEEE Internet
networks for accurate edge computing,” J. Emerg. Technol. of Things Journal, vol. 9, no. 16, pp. 15 007–15 017, 2022.
Comput. Syst., vol. 17, no. 4, jun 2021. [Online]. Available: [67] F. De Vita, R. M. A. Nawaiseh, D. Bruneo, V. Tomaselli, M. Lat-
https://doi.org/10.1145/3451211 tuada, and M. Falchetto, “µ-ff: On-device forward-forward train-
[53] J. Lin, W.-M. Chen, Y. Lin, j. cohn, C. Gan, and S. Han, ing algorithm for microcontrollers,” in 2023 IEEE International
“Mcunet: Tiny deep learning on iot devices,” in Advances Conference on Smart Computing (SMARTCOMP), 2023, pp. 49–56.
in Neural Information Processing Systems, H. Larochelle, [68] N. R. K P, G. .Y, S. .D, and M. Rajesh, “Comparison of program-
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., ming languages: Review,” vol. 9, pp. 113–122, 07 2018.
vol. 33. Curran Associates, Inc., 2020, pp. 11 711–11 722. [69] F. Funk, T. Bucksch, and D. Mueller-Gritschneder, “Ml training
[Online]. Available: https://proceedings.neurips.cc/paper_files/ on a tiny microcontroller for a self-adaptive neural network-based
paper/2020/file/86c51678350f656dcc7f490a43946ee5-Paper.pdf dc motor speed controller,” in IoT Streams for Data-Driven Predic-
[54] S. Xu, H. Li, B. Zhuang, J. Liu, J. Cao, C. Liang, and M. Tan, tive Maintenance and IoT, Edge, and Mobile for Embedded Machine
“Generative low-bitwidth data free quantization,” in Computer Learning, J. Gama, S. Pashami, A. Bifet, M. Sayed-Mouchawe,
Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. H. Fröning, F. Pernkopf, G. Schiele, and M. Blott, Eds. Cham:
Frahm, Eds. Cham: Springer International Publishing, 2020, pp. Springer International Publishing, 2020, pp. 268–279.
1–17. [70] A. Mostafavi and A. Sadighi, “A novel online machine learn-
[55] Z. Xie, Z. Wen, J. Liu, Z. Liu, X. Wu, and M. Tan, “Deep transfer- ing approach for real-time condition monitoring of rotating ma-
ring quantization,” in Computer Vision – ECCV 2020, A. Vedaldi, chines,” in 2021 9th RSI International Conference on Robotics and
H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer Mechatronics (ICRoM), 2021, pp. 267–273.
International Publishing, 2020, pp. 625–642. [71] J. Guan and G. Liang, “A research of convolutional neural
[56] J. Lin, W.-M. Chen, H. Cai, C. Gan, and S. Han, “Mcunetv2: network model deployment in low- to medium-performance
Memory-efficient patch-based inference for tiny deep learning,” microcontrollers,” in Proceedings of the 2023 10th International
in Annual Conference on Neural Information Processing Systems Conference on Wireless Communication and Sensor Networks,
(NeurIPS), 2021. ser. icWCSN ’23. New York, NY, USA: Association for
[57] Z. Sun, C. Ge, J. Wang, M. Lin, H. Chen, H. Li, and Computing Machinery, 2023, p. 44–50. [Online]. Available:
X. Sun, “Entropy-driven mixed-precision quantization for https://doi.org/10.1145/3585967.3585975
deep network design,” in Advances in Neural Information [72] R. Heymsfeld. (2022) Arduinoann. http://robotics.hobbizine.
Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, com/arduinoann.html. accessed: 2022-10-10.
D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran [73] K. Kopparapu, E. Lin, J. G. Breslin, and B. Sudharsan, “Tinyfedtl:
Associates, Inc., 2022, pp. 21 508–21 520. [Online]. Avail- Federated transfer learning on ubiquitous tiny iot devices,” in 2022
able: https://proceedings.neurips.cc/paper_files/paper/2022/ IEEE International Conference on Pervasive Computing and Communi-
file/86e7ebb16d33d59e62d1b0a079ea058d-Paper-Conference.pdf cations Workshops and other Affiliated Events (PerCom Workshops),
[58] B. Zoph, V. Vasudevan, J. Shlens, and Q. Le, “Learning transferable 2022, pp. 79–81.
architectures for scalable image recognition,” 06 2018, pp. 8697– [74] N. Llisterri Giménez, J. Miquel Solé, and F. Freitag, “Embedded
8710. federated learning over a lora mesh network,” Pervasive and Mobile

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

15

Computing, vol. 93, p. 101819, 2023. [Online]. Available: https:// J. Kühnel received his B.Eng. degree in elec-
www.sciencedirect.com/science/article/pii/S1574119223000779 trical engineering from the University of Applied
[75] N. Llisterri Giménez, M. Monfort Grau, R. Pueyo Centelles, and Sciences Bielefeld, Bielefeld, Germany, in 2019
F. Freitag, “On-device training of machine learning models on mi- and his M.Sc. degree in embedded systems en-
crocontrollers with federated learning,” Electronics, vol. 11, no. 4, gineering from the University of Duisburg-Essen,
p. 573, 2022. Duisburg, Germany, in 2021. He is currently a
[76] G. Hinton, “The forward-forward algorithm: Some preliminary Ph.D. student at the Fraunhofer Institute for Mi-
investigations,” 2022. croelectronic Circuits and Systems. His research
[77] H. Jaeger, “The "echo state" approach to analysing and training interests include machine learning, embedded
recurrent neural networks-with an erratum note’,” Bonn, Germany: systems, and signal analysis.
German National Research Center for Information Technology GMD
Technical Report, vol. 148, 01 2001.
[78] N. N. Schraudolph, “A fast, compact approximation of the expo-
nential function,” Neural Computation, vol. 11, no. 4, pp. 853–862,
1999.
[79] D. Costa, M. Costa, and S. Pinto, “Train me if you
can: Decentralized learning on the deep edge,” Applied L. Krupp received his B.Sc. degree in electri-
Sciences, vol. 12, no. 9, 2022. [Online]. Available: https: cal and computer engineering from the Techni-
//www.mdpi.com/2076-3417/12/9/4653 cal University of Kaiserslautern, Kaiserslautern,
[80] Nordic Semiconductor, “nRF52840 DK,” https://www. Germany, in 2018 and his M.Sc. degree in elec-
nordicsemi.com/Products/Development-hardware/ trical and computer engineering from the Techni-
nRF52840-DK, 05 2022. cal University of Kaiserslautern, Kaiserslautern,
[81] PlatformIO, “PlatformIO Core,” https://platformio.org/, 2 2022, Germany, in 2019. He is currently a Ph.D. stu-
Version: 5.2.5. dent at the Fraunhofer Institute for Microelec-
[82] ARM. (2021) Gnu arm embedded toolchain. https://developer. tronic Circuits and Systems. His research inter-
arm.com/downloads/-/gnu-rm. accessed: 2022-11-15. ests include machine learning, predictive main-
[83] TensorFlow, “TensorFlow Lite for Microcontrollers,” https:// tenance and embedded systems.
github.com/tensorflow/tflite-micro, 4 2022, Version: 20220407.
[84] L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efficient Neu-
ral Network Kernels for Arm Cortex-M CPUs,” CoRR, vol.
abs/1801.06601, 2018.
[85] B. Sudharsan, S. Salerno, D.-D. Nguyen, M. Yahya, A. Wahid,
P. Yadav, J. G. Breslin, and M. I. Ali, “TinyML Benchmark: Exe- J. Viga received the B.Sc. degree in electrical
cuting Fully Connected Neural Networks on Commodity Micro- engineering, information technology and com-
controllers,” in 2021 IEEE 7th World Forum on Internet of Things puter engineering from RWTH Aachen Univer-
(WF-IoT), 2021, pp. 883–884. sity, Aachen, Germany, in 2019 and is currently
[86] Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST Database of in the end of his M.Sc. degree in computer en-
handwritten digits,” http://yann.lecun.com/exdb/mnist/. gineering at RWTH Aachen University, Aachen,
[87] R. Fisher, “Iris,” https://archive.ics.uci.edu/ml/datasets/iris, Germany. From 2019 to 2022, he worked on
1988. tiny-ML projects at the Fraunhofer Institute for
[88] W. Wolberg, W. Street, and O. Mangasarian, “Breast Cancer Wis- Microelectronic Circuits and Systems as a stu-
consin (Diagnostic),” https://archive.ics.uci.edu/ml/datasets/ dent assistant. His research interests are mainly
Breast+Cancer+Wisconsin+%28Diagnostic%29, 1995. focused on the field of AI in embedded applica-
[89] A. Krizhevsky. (2009) Learning multiple layers of fea- tions.
tures from tiny images. https://www.cs.toronto.edu/~kriz/
learning-features-2009-TR.pdf. accessed: 2022-11-17.
[90] A. Chowdhery, P. Warden, J. Shlens, A. Howard, and R. Rhodes,
“Visual wake words dataset,” CoRR, vol. abs/1906.05721, 2019.
[Online]. Available: http://arxiv.org/abs/1906.05721
[91] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep C. Wiede received his B. Sc. and M. Sc. de-
network training by reducing internal covariate shift,” in Proceed- gree in biomedical engineering from Ilmenau
ings of the 32nd International Conference on International Conference University of Technology, Ilmenau, Germany, in
on Machine Learning - Volume 37, ser. ICML’15. JMLR.org, 2015, p. 2011 and 2013 and his Ph.D. degree in electri-
448–456. cal engineering and information technology from
[92] C. R. Banbury, V. J. Reddi, P. Torelli, J. Holleman, N. Jeffries, Chemnitz University of Technology, Chemnitz,
C. Király, P. Montino, D. Kanter, S. Ahmed, D. Pau, U. Thakker, Germany, in 2018. He is specialized in the field
A. Torrini, P. Warden, J. Cordaro, G. D. Guglielmo, J. M. of computer vision, machine learning, artificial
Duarte, S. Gibellini, V. Parekh, H. Tran, N. Tran, W. Niu, and intelligence and their applications in industry an
X. Xu, “Mlperf tiny benchmark,” CoRR, vol. abs/2106.07597, 2021. medicine. He is currently working at Fraunhofer
[Online]. Available: https://arxiv.org/abs/2106.07597 IMS as head of Embedded AI.

P. Gembaczka received the M. Sc. degree in mi-


L. Wulfert received his B.Sc. degree in physi- crotechnology and medical engineering from the
cal engineering from the University of Applied University of Applied Sciences Gelsenkirchen,
Sciences Gelsenkirchen, Gelsenkirchen, Ger- Gelsenkirchen, Germany, in 2010, and the Ph.D.
many, in 2018 and his M.Sc. degree in mi- degree in electrical engineering from the Uni-
crosystem technology from the Applied Sci- versity of Duisburg-Essen, Duisburg, Germany,
ences Gelsenkirchen, Gelsenkirchen, Germany, in 2014. Since 2010, he has been with the
in 2020. He is currently a Ph.D. student at the Fraunhofer Institute for Microelectronic Circuits
Fraunhofer Institute for Microelectronic Circuits and Systems, first receiving the Ph.D. degree,
and Systems. His research interests include fed- where he is currently program manager for the
erated learning and embedded systems. "Industrial AI" topic and product manager for the
AI software framework AIfES.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3355495

16

A. Grabmaier received his M.S. and Ph.D. de-


gree in physics from the University of Stuttgart,
Stuttgart, Germany, in 1989 and 1993, respec-
tively. He was a Post-Doctoral with the Deutsche
Telekom Research Center, where he was en-
gaged in modeling, technology, and characteri-
zation of DFB lasers. From 1994 to 1999, he has
been with Valeo Switches and Sensors GmbH
where he worked on the development of ad-
vanced electrical systems in automotives. From
1999 to 2005, he has been a Director in Siemens
VDO (Sensor Innovation Management) responsible for the worldwide
production of new sensors. In 2006, he joined the University of Duis-
burg Essen, Duisburg, Germany, as a Full University Professor with
the Electrical Engineering Faculty, where he is currently Chair of the
Electronic Components and Circuits Department. Since 2006, he is the
Head with the Fraunhofer Institute IMS, Duisburg. He has published
over 50 technical papers and articles. Also, he is currently a member
of the German Chamber of Industry and Commerce for Research and
Technology, the Eduard Rhein Foundation, Program Committee of the
Microsystems Technology Congress of the GMM VDE/VDI Society and
University Council of Hochschule Ruhr West. In 2009, he received the
Honorary Membership in the German Research Hall of Fame.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

You might also like