Livro TinyML
Livro TinyML
Author: Director:
Marc M ONFORT G RAU Dr. Felix F REITAG
Co-director:
Roger Pueyo Centelles
Abstract
TinyML aims to implement machine learning (ML) applications on small, and low-
powered devices like microcontrollers. Typically, edge devices need to be connected
to data centers in order to run ML applications. However, this approach is not pos-
sible in many scenarios, such as lack of connectivity. This project investigates the
tools and techniques used in TinyML, the constraints of using low-powered devices,
and the feasibility of implementing advanced machine learning applications on mi-
crocontrollers.
Resumen
Contents
2 Project Planning 10
2.1 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Task Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Risk Management: Alternative Plans . . . . . . . . . . . . . . . . . . . . 19
3 Budget 20
3.1 Personnel Costs per Task (PCT) . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Generic Costs (GC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Contingency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Incidental Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Final Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iv
5 Embedded Systems 31
5.1 Microcontroller Boards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Development Environments . . . . . . . . . . . . . . . . . . . . . . . . . 32
9 Sustainability Analysis 68
9.1 Matrix of Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
9.2 Project put into Production . . . . . . . . . . . . . . . . . . . . . . . . . 68
9.2.1 Environmental . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
9.2.2 Economic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.2.3 Social . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.3 Exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.3.1 Environmental . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.3.2 Economic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.3.3 Social . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.4 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.4.1 Environmental . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.4.2 Economic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.4.3 Social . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.5 Weighted Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10 Conclusion 72
Bibliography 75
vi
List of Figures
7.1 Microcontroller board setup with four external buttons for on-device
training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2 Workflow diagram of on-device training application . . . . . . . . . . . 49
7.3 MFCC from Montserrat keyword . . . . . . . . . . . . . . . . . . . . . . 50
7.4 Neural network diagram for on-device training application . . . . . . . 51
7.5 Loss vs. epochs during the training of the three keywords (Montserrat,
Pedraforca and silence). Number of observations is 130, learning rate
0.1, momentum 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
vii
7.6 Loss vs. epochs during the training of the three keywords (Montserrat,
Pedraforca and silence). Number of observations is 70, learning rate
0.3, momentum 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.7 Loss vs. epochs during the training of the two keywords (Montserrat
and Pedraforca (no silence)). Number of observations is 70, learning
rate 0.3, momentum 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.8 Loss vs. epochs during the training of the three keywords (Montserrat
and Pedraforca and silence). Number of observations is 200, learning
rate 0.1, no momentum. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.9 Loss vs. epochs during the training of the three keywords (vermell,
verd and blau). Number of observations is 60. learning rate 0.3, mo-
mentum 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
List of Abbreviations
1.1 Introduction
Machine learning applications often rely on cloud services offered by external com-
panies. Devices (e.g., smartphones) running these applications have to transmit the
data captured by their sensors (e.g., cameras, microphones) to data centers. This
data is then processed by GPUs or TPUs 1 , which can offer high computing power.
The result of the machine learning algorithm is sent back to the device to continue
the application workflow. Although this approach allows running high computing
power applications on low-powered devices, it also has some disadvantages and re-
quirements that cannot be met in all scenarios. For instance, having to send data
from two separate locations can lead to excessive latency or, worse, can compromise
data privacy (data leakage). In addition, the devices must be connected to the inter-
net for the application to work. With these disadvantages it is no wonder that of the
5 petabytes of data produced each day by IoT devices, less than 1 % of this data is
analyzed or used at all [1].
TinyML is a new field that aims to implement machine learning applications on mi-
crocontrollers capable of performing data analytics at extremely low power. There-
fore, TinyML applications can run continuously for a long period of time only using
battery power or energy harvesting. The devices running the TinyML application
do not need to be connected to the Internet. There is no need to worry about the
privacy, as the data is analysed on the device itself. Microcontrollers are the only
option when the power supply is restricted, size is a constraint or budget is limited.
However, the use of microcontrollers for machine learning applications will bring
with it many challenges to overcome.
1A TPU (Tensor Processing Unit) is an AI accelerator integrated circuit developed by Google for
training neural networks.
Chapter 1. Introduction and Context 2
1.1.1 Context
1.1.2 Concepts
TinyML lies at the intersection between machine learning and microcontrollers. There-
fore, the reader should be familiar with the following concepts in order to under-
stand the thesis correctly.
Microcontroller
Machine Learning
Machine learning (ML) is a subfield of artificial intelligence (AI) that, with the help
of statistics and a lot of data, generates a model that is capable of identifying inter-
esting patterns. The model can then be used to identify patterns in unseen data (in
a process called inference) and make decisions based on this. There are many ma-
chine learning techniques (e.g., linear regression, SVM, decision trees, etc.), but for
TinyML we will have a special interest in artificial neural networks.
There are two approaches to machine learning algorithms: supervised and unsuper-
vised learning. In supervised learning, the data is labeled, so the machine knows
what patterns to look for. In unsupervised learning, the data is not labelled and
therefore the machine is responsible for identifying the patterns. Supervised learn-
ing has become more popular and tends to produce better results. However, in many
Chapter 1. Introduction and Context 3
problems the data cannot be labeled, and the only way is to use unsupervised learn-
ing algorithms.
Artificial neural networks are algorithms that are intended to mimic the function-
ing of the brain. Figure 1.1 shows a representation of a neural network. A neural
network is composed of several layers, and each layer contains several nodes repre-
senting neurons. The connections between neurons are represented by edges. Each
neuron applies a mathematical function (e.g., linear regression) to the input values,
and return an output value that will be propagated to the connected neurons. But
before sending the output value, the neuron applies a function called the activation
function (e.g., the sigmoid). The activation function is often used to make the system
non-linear. With a non-linear system, the neural network can identify much more
complex patterns in data.
Microcontroller hardware
Component PC Microcontroller
Compute 1GHz − 4GHz 1MHz − 400MHz
Memory 512MB − 64GB 2KB − 512KB
Storage 64GB − 4TB 32KB − 2MB
Power 30W − 100W 150µW − 23.5mW
Microcontroller software
In general-purpose computers there are three levels of abstraction: the high-level ap-
plication, the libraries that provide support for those applications and the operating
system that provides support for the libraries and the applications. This architec-
ture allows a lot of flexibility. However, microcontrollers are not general-purpose
systems. Microcontrollers are typically designed to perform one task, and therefore,
they usually do not have any operating system. Many common libraries may not
work on their standard version and may have to be adapted. This supposes a loss
in portability. We cannot be sure that the same code will work in different microcon-
troller devices, since they may not have the same components. Therefore, the second
challenge we face is to enable TinyML applications across different microcontrollers.
The following two figures show the evolution of machine learning. Figure 1.2 shows
the increase in size of machine learning models over the last few years. State-of-the-
art (SOTA) models have many times more parameters than old models. This trend
makes it more difficult to fit the latest models into memory restricted devices. As
well as the size, the computing power needed to train the models is also increasing.
Therefore, from the machine learning perspective, we have the challenge to shrink
the size of the model and, at the same time, improve the training performance.
Chapter 1. Introduction and Context 6
1.1.4 Stakeholders
• Research team: With the development of this thesis, the research team will
become an expert in the field of TinyML, which it is expected to be decisive for
the future of machine learning.
• Companies: The bright future of TinyML will impact very positively to com-
panies that invest and research about the field. An early investment will strengthen
a dominant position. This thesis will show some of the most successful TinyML
applications in the market.
• Cientific community: We hope that this thesis will inspire new lines of re-
search about TinyML, or the development of innovative applications.
1.2 Justification
for new strategies to allow the deployment of machine learning applications in these
restricted scenarios.
1.3 Scope
This section will define the global objectives and sub-objectives, the functional re-
quirements, and the potential obstacles and risks that carry out the development of
the thesis.
The thesis has two objectives. The first objective is to develop a basic TinyML ap-
plication. This objective has been broken down into smaller sub-objectives:
The second objective is to identify advanced research directions of TinyML, and try
to develop some advanced applications. The scope of this objective will be highly
influenced by the obstacles and the lack of documentation, since we may go beyond
the state of the art. Again, we have broken down the objectives into more detailed
sub-objectives to be completed sequentially.
Every project involves a series of potential obstacles and risks that must be taken
into account. In this thesis, we have considered that the greatest obstacle is the
time available. The field of TinyML comprises two very different disciplines such
as machine learning and microcontrollers. It is therefore required to be, not only
software expertise, but also embedded-hardware expertise. The lack of time can
compromise the quality and the scope of the work, and we may fail to complete all
the sub-objectives defined above.
A second major obstacle that we may face, is the non-compatibility between the
software (libraries and frameworks) used for machine learning and the hardware
(microcontrollers boards and sensors) available for this project. We must carefully
analyze all the components that are going to be used.
Lastly, we should be very careful when developing the TinyML application. Even
tough modern Integrated Development Environments (IDE) have very sophisticated
debugging functionalities, when developing for microcontrollers it is completely dif-
ferent. Microcontroller IDEs very rarely are able to debug the code due to the nature
Chapter 1. Introduction and Context 9
This section defines the working methodology, the monitoring tools and the valida-
tion methods. It is very important for any project to correctly define these aspects in
order to optimize the work and avoid the obstacles.
The methodology is based on the Agile Manifesto. We will work in short iterations
(around two weeks). Each iteration has six phases to be completed sequentially:
planning, design, development, verification, review and deployment. At the end of
each iteration, the result will be evaluated among the project team. Then, we will
start the planning phase for the next iteration.
To monitor and revise the progress of the report, we will use cloud services that
allow remote collaboration. Specifically, we will use Google Drive for sharing files
and Overleaf for writing the final version of the report. To maintain the application
code we will use the Git version control, and to store the application, we will use a
public Github repository. The tools have been chosen because they are very popular,
reliable and easy to use.
For organizing the project is has been agreed to hold meeting every two weeks be-
tween the author, the director and the co-director of the thesis. These meetings will
be held using the Jitsi platform.
10
2 Project Planning
2.1 Duration
The thesis starts on February 26, 2021, and ends on June 30, 2021, with a formal
presentation. This period is equivalent to 125 days (including holidays). Consider-
ing four hours of daily dedication to the project (on average), this becomes into 500
hours. This is an approximation and can be affected by external factors.
The project tasks have been grouped into sections from A to F. Each section has sev-
eral tasks, and each task may be split into subtasks. The requirements are inherited
from section to task and from task to subtask. The section’s estimated time, is the
sum of the time of its task, and the task’s estimated time is the sum of the time of
its subtask. Table 2.1 shows a summary of the tasks with its dependencies and the
required resources.
A - Project Management
This section includes the tasks of defining and organising the project. Estimated time
of 76 hours.
A1 - Context and scope: Definition of the project scope in the context of its
study. It includes the justification of the project and the methodology to be
followed. Estimated time of 14 hours.
B - Basic Applications
This section includes the first objective tasks. Estimated time of 146 hours.
B1 - Research (state of the art): Research on the TinyML state of the art. Esti-
mated time of 50 hours.
B2.2 - Visual Wake Word: Analysis of visual wake word technique. Esti-
mated time of 6 hours.
B3 - Learn Basic Tools: Study of the basic software tools used for developing
TinyML applications. Estimated time of 12 hours. It requires the B1 task to be
finished.
B3.2 - Google Colab: Study of Google Colab (taking into account the prior
knowledge). Estimated time of 2 hours. It requires access to the Google
Colab environment.
B6.3 - Design and Train Model: Designing and training the neural net-
work model. Estimated time of 10 hours. It requires the B6.2 task to be
finished.
C - Advanced Applications
This section includes the second objective tasks. Estimated time of 160 hours. It
requires the B tasks to be finished successfully.
D - Project Documentation
This section includes the project documentation tasks. Estimated time of 60 hours.
D2 - Memory Revision: Revision of the drafted chapters from section D1. Es-
timated time of 32 hours. It requires the task D1 to be finished.
E - Project Defense
This section includes the task to be completed before the project defense. Estimated
time of 20 hours. It requires the D tasks to be finished.
F - Post-Mortem Analysis
This section includes a retrospective analysis of the project. This is intended to im-
prove for later projects, and should be done after the jury evaluation. Estimated time
of 2 hours (outside the deadline). It requires the presence of the author, the director
and the co-director of this project.
Chapter 2. Project Planning 15
2.3 Resources
Human resources
Material resources
Hardware
Software
As mentioned in the Section 1.3.2, the development of the thesis entails several ob-
stacles and risks. Below we have defined some possible solutions and alternative
plans that will be followed in the case of having to face any of these obstacles.
The deadline is the most important risks we face. To avoid underestimating the time
of each task, we have added some extra time for unforeseen events. However, if
this extra time is not enough, we have the possibility, as a last resort, to extend the
deadline until October, 2021 (4 extra months).
The scope of the project is likely to be affected by the results of section C. Since the
task are strongly dependent, a negative result in one of the tasks would produce the
inability to perform any of the subsequent tasks. This risk is high since the second
objective is beyond the state of the art and may be unfeasible. This issue is already
foreseen by the team, and it has been agreed that section C will have the scope that
can be reached within the deadline.
3 Budget
This section analyzes the economic costs involved in carrying out this thesis. It in-
cludes the personnel costs per activity, the generic costs, the contingency and the
incidental costs.
First, we will analyze the cost per hour of the human resources required by the
project. We distinguish between the roles already mentioned in Section 2.3. Table 3.1
shows the gross hourly salary of each role, the social security cost (calculated as 0.3
of the gross salary), and the final salary as the sum of these two. Table 3.2 shows the
cost of each task based on the amount of hours and the salary of the personnel in
charge.
Amortization
Table 3.3 shows the amortization of the hardware devices used in the project. To
calculate the amortized cost, we have used the following formula:
1 1 1
cost_per_hour = resource_price ∗ ∗ ∗
li f e_expectancy working_days working_hours
(3.1)
The life expectancy of hardware devices is set to 10 years [8]. The working days in
Catalonia in 2021 are 252, and each day has 8 working hours. The used hours of each
device are obtained from the sum of the hours of each task requiring the device. The
software used is free and does not suppose any cost.
Indirect costs
The list below shows the indirect resources of the project and the parameters used
to calculate the cost.
Work space: The work will be done remotely. Monthly rent of 400€. Total of
494 project hours.
1 1
Work space cost = 400 ∗ ∗ ∗ 494
30 days 24 hours
Chapter 3. Budget 23
Internet: Monthly cost of 50€. Average of 8 working hours. Total of 494 project
hours.
1 1
Internet cost = 50 ∗ ∗ ∗ 494
30 days 8 hours
3.3 Contingency
This project may have unforeseen events that result in cost overruns. The contin-
gency cost is a percentage of the budget value that is added to it in order to cover
the cost of unforeseen events that finally appear. We have decided to set a 15% of
the sum of the PCT and GC as the contingency cost. This gives us 2071.224 €.
The obstacles that we may encounter while working on the project have already
been described. To mitigate the cost of these obstacles we will add to the budget the
percentage of the probability of encountering the obstacle over the estimated cost
it would entail. Table 3.5 shows the incidental cost added to the budget for each
obstacle.
Chapter 3. Budget 24
Table 3.6 shows the final budget as the sum of the personnel costs per task, the
generic costs, the contingency costs and the incidental costs.
It is likely that during the course of the project the costs will vary from those esti-
mated in the previous sections. It is important to monitor these deviations and act
accordingly.
The personnel costs deviation is calculated as follows. A delay in any task from the
Gantt chart will increase the personnel cost. In the same way, if any task is finished
before the estimated time, the cost will be reduced, and we could cover the cost for
the delayed tasks.
The generic costs deviation is calculated as follows. It applies the same task delay
situation as the personnel costs deviation. If we have a delay in any task that requires
a resource, the cost of using the resource will raise.
The total deviation is the sum of the partial deviations mentioned above.
In this section we will describe the three most promising TinyML techniques. These
techniques have been successfully applied in many applications that are commer-
cialized in different products. These products can be found in domestic, office or
industrial scenarios.
Keyword spotting (KWS) is a speech recognition technique that deals with the iden-
tification of specific words from a short voice recording. Usually, the applications
that are based on this technique are constantly capturing sounds from the envi-
ronment in order to identify predefined keywords with a machine learning model.
When the model recognizes a keyword, the application triggers a signal to wake up
a more powerful system which will be responsible to perform a more complex task
(e.g., recognizing a full sentence). In simpler applications, the keyword only triggers
an actuator (e.g., turn on an LED, opening a door, etc.).
There are several commercial products that use a keyword spotting approach. For
example, many smartphones come with Google’s "Ok Google" or Apple’s "Hey Siri".
Both applications allow to send a query after saying a specific keyword, without
touching the device. Another application is Amazons’s "Alexa". A home virtual
assistant with the same functionality. Also, it is common to find keyword spotting
products for car voice assistant. These assistants allow the driver to interact with
the multimedia system without distraction. It is expected that keyword spotting
applications will be very important for the future of human-machine interfaces and
IoT devices.
Chapter 4. Tools and Techniques 27
Visual wake word technique is the extension of keyword spotting for images. The
application pipeline is very similar: A camera is constantly capturing images from
the environment in order to identify (with a machine learning model) whether a
specific object or person appears in the image. Again, if the model recognizes the
object or the person, it triggers a response (e.g., activating an alarm, turning on the
lights), or a more complex system (e.g., face recognition).
There are several applications on the market using the visual wake word technique.
For example, Bird Buddy have designed a smart bird feeder with a built-in camera
that is able to recognize the presence of birds, and then send a notification to the user.
Other popular applications are camera sensors that are able to detect the presence of
persons in a room. The application could turn off the lights if no person is detected,
or it could be used for security control. Again, the growth in IoT will only increase
the use of visual wake word applications for automation and human-machine inter-
action.
Anomaly detection applications are mainly deployed in the industry. They can be
used for detecting malfunctions on factory machines. Tiny devices are attached to
the machines and are constantly analyzing their sounds and vibrations in order to
train a machine learning model (autoencoder). After the model is trained, if the
device detects an anomaly on the sound, it can send a signal to warn about a possible
malfunction. Then, the operator can repair the machine and prevent a shutdown of
the entire factory production. Anomaly detection applications can also be used in
other scenarios, like detecting vehicle engine failures or pipeline water leakage.
Chapter 4. Tools and Techniques 28
In this section we will analyze the tools and libraries used to develop a typical
TinyML application.
Python
Jupyther Notebooks
Jupyter Notebook is and interactive web application that combines software code,
computational output, explanatory text and multimedia resources, all in a single
document. This approach facilitates the collaboration between different members
on a project with a common development environment. Python with Jupyter Note-
books have gained a lot of popularity for machine learning researchers.
Google Colaboratory
Google Colaboratory (or Google Colab) is an online web application based on Jupyter
Notebook that allows high computing power using cloud computing. It runs using
a Chrome web browser without any previous configuration. Moreover, it makes it
even easier to share, and it allows to edit simultaneously a Jupyter Notebook file.
TensorFlow
designed to facilitate the building of machine learning models. It has many func-
tionalities for data pre-processing, data ingestion, model evaluation, visualization
and serving.
TensorFlow Lite. In TinyML we work on very constraint devices with limited com-
puting power, memory and storage. However, the basic TinyML applications usu-
ally do not have to perform the whole machine learning pipeline, neither need all
the advanced techniques used by researchers. Therefore, many functionalities of the
TensorFlow are not required. Only a small subset will be used. This is exactly what
TensorFlow Lite is about.
TensorFlow Lite Micro. TensorFlow Lite Micro is the state-of-the-art inference frame-
work from Google. It is designed to run machine learning on microcontrollers and
other devices with only a few kilobytes of memory. It takes the compression of the
standard framework to the extreme, removing all but the essential functionalities.
The core runtime just fits in 16 KB. It does not require any operating system support
Chapter 4. Tools and Techniques 30
(can run on bare metal), any standard C or C++ library, nor dynamic memory allo-
cation. On the downside, it is harder to troubleshoot any issue since the framework
does not support any plotting or debugging tool. TensorFlow Lite Micro runs on a
wide variety of embedded microcontrollers. The exact list of supported devices can
be found on the official page.
We decided to use TensorFlow Lite Micro framework for our TinyML applications
because it is free, open-source and very well documented. However, it is important
to notice that there are several alternatives:
• Apache TVM: Open source machine learning compiler framework for CPUs,
GPUs, and machine learning accelerators.
• uTensor: Free and open source embedded machine learning infrastructure de-
signed for rapid-prototyping and deployment.
• GLOW: Machine learning compiler engine and execution engine for hardware
accelerators. It is designed to be used as a back-end for high-level machine
learning frameworks.
31
5 Embedded Systems
We have available an Arduino Nano 33 BLE Sense, and an Espressif ESP32 board.
Both boards have enough computing power and memory capacity for basic TinyML
applications, and both are supported by the TensorFlow Lite Micro framework. Ta-
ble 5.1 compares the hardware characteristics of each board.
The above table shows that the Espressif board is cheaper, has more clock speed,
more memory and more storage than the Arduino board. However, the Arduino
PCB offers many integrated sensors that can be very useful for TinyML applications.
Chapter 5. Embedded Systems 32
5.2 Sensors
Sensors are basic components for any TinyML application. In TinyML we try to
bring computing power closer to the edge devices where the data is collected. Sen-
sors are required to obtain the data that will later be used by the machine learn-
ing pipeline. Therefore, the performance and feasibility of any TinyML application
strongly depend on the kind of sensor and the quality of the data captured.
The Arduino Nano 33 BLE Sensen board already comes with several sensors inte-
grated on the board. It has a nine axis inertial sensor, a humidity and temperature
sensor, a barometric sensor, a microphone, and a light sensor. We also have available
an external image sensor that could be plugged into the board. This high variety
of sensors allow us to develop a wide range of TinyML applications. However, it is
hard to find existing datasets for some of the listed sensors, and therefore, we will
focus on the most common (microphone).
Arduino IDE
application. Besides the standard version, there is a cloud-based (Create Web Edi-
tor) and a professional version (Arduino Pro IDE). For this project we will use the
standard Arduino Desktop IDE only for prototyping and fast deploying.
PlatformIO
In this chapter, we will develop a basic keyword spotting application. The appli-
cation will be able to recognize a custom set of keywords using the Arduino Nano
33 BLE Sense board. The application will be developed following two different ap-
proaches. First, using the TensorFlow framework, both for training and calling a
neural network model. Second, using the Edge Impulse web application. Finally,
we will evaluate the benefits of using one approach over the other.
The first approach will be based on the Micro Speech example from the Tensor-
Flow Lite Micro framework. This example provides a pre-trained keyword spotting
model that can recognize two keywords, "yes" and "no", from audio samples. The
application is listening to the environment with a embedded microphone. If any of
the keywords is recognized, the application turns on a LED. In the example, a green
LED means that the application has recognized the "yes" keyword, and a red LED
means that the application has recognized the "no" keyword.
In the following sections we will go through all the steps necessary for developing
a TinyML keyword spotting application. Some of these steps are common in other
techniques (e.g., visual wake word, anomaly detection). We want the application to
be able to recognize the words vermell, verd and blau (red, green and blue in Catalan
language). However, for creating our custom application, we will not develop all
the parts (that are explained below), but rather modify the necessary parts in order
to adapt the example to our set of keywords.
Chapter 6. Basic Keyword Spotting Application 35
The first step is to collect all the data that will be used to train the machine learning
model. The collection stage is often the most challenging and time-consuming, un-
less an existing dataset is available. The dataset must be large and specific, as the
feasibility of our application would be highly determined by the quality and size
of the dataset. A bad dataset would result in a poor performance of the machine
learning model, which is the core component of the application.
Although there are some known dataset for keyword spotting, we want the appli-
cation to recognize our custom set of keywords. Therefore, we will not reuse any
dataset, but create our own. In order to do so, we will record several samples of
each of the three keywords (vermell, verd and blau), using a voice recorder applica-
tion. The process of creating the dataset is cumbersome, since we have to save every
recorded word in a single file. This can suppose a lot of time of audio editing. Our
final dataset has around 100 samples (of one second) for each keyword. Although
typical datasets often have thousands (or even millions) of samples, we have neither
the time nor the resources to obtain that many. Anyway, a keyword spotting model
is usually very small compared to typical machine learning models. Therefore, with
a small dataset we can still obtain good performance.
After obtaining enough samples for the dataset, we have to process this data in or-
der to extract more relevant features. This process is known as feature engineering.
Chapter 6. Basic Keyword Spotting Application 36
Another possibility is to simply leave the input data unperturbed. However, pre-
processing the data usually gives us better performance and reduces the size of the
model.
The type of data obtained for our keyword spotting dataset is in the form of digital
audio signals. To obtain this digital representation, the microphone has to capture
the vibrations from the sound waves, and then convert these vibrations to electrical
signals in the form of voltage distortions. Finally, the electrical signals are converted
into digital signals using an analog-to-digital converter (ADC). The number of data
points captured each second depends on the sample rate used by the microphone.
For example, a typical 16 kHz sample rate captures 16000 data points each second,
which may be too much data for only one sample. Figure 6.1 shows the digital
representation of the vermell, verd and blau words.
To obtain the feature vector from an audio signal, we have to perform several steps.
The first step is to align the critical part of the signal that contains the spoken word.
As it is seen in Figure 6.1, the start and the end of the audio has no sound, and
therefore it should be removed. The simplest way is to extract the loudest second
from the audio sample [11].
The second step is to extract the unique features of each sound in the word. The
audio signals can be decomposed into primitive signals of fundamental frequen-
cies. Using the Fourier transform we can decompose the audio and obtain the fre-
quency domain representation of each word. Figure 6.2 shows the frequencies that
are present in the three keywords (vermell, verd and blau). By extracting out the criti-
cal frequencies, we are obtaining the fingerprint of each word. However, to generate
the frequency spectrogram of a word, we have to apply the Fourier transform along
Chapter 6. Basic Keyword Spotting Application 37
all the word sound using short sliding windows (from 20 to 30 milliseconds). This
way, we will obtain the time domain spectrogram of the frequencies of each word,
as seen in Figure 6.3.
The last step of the data processing stage is to obtain the Mel Frequency Cepstral
Coefficient (MFCC) from the spectrogram (Figure 6.3). This technique is based on the
phenomenon that the human ear can hear the low frequencies easier than the high
frequencies according to the Mel Scale. Therefore, we are able to extract more salient
features from the higher frequency signals using the Mel Filter Bank (Figure 6.4). The
result of the MFCC is shown in Figure 6.5. The difference between the MFCC and
the previous spectrogram is that with the MFCC we take into account the human
audio perception.
After applying all the processing steps, we converted the audio time signal into a
frequency signal in the form of an image (Figure 6.5). This image is the feature
vector that will be used to feed the neural network model. We could further process
Chapter 6. Basic Keyword Spotting Application 38
the audio signal (e.g., signal cleaning, amplitude normalization, etc.), however, with
this feature vector we can already get good performance on the application.
Once our dataset is processed, we can start designing the machine learning model.
Usually, models are selected based only on their performance, however, other met-
rics may also be important, such as simplicity. In this application, we will use a
neural network model. This network needs to be very small in order to be deployed
into our microcontroller.
Convolutional neural networks are very popular for images, and since the type of
data of the feature vector is in the form of images (Figure 6.5), the convolutional
neural networks would be a perfect fit. The model used in the TensorFlow example,
and the one we are going to use in our application, is called TinyConv. This convo-
lutional model is composed of a 2D convolutional layer, followed by a single dense
Chapter 6. Basic Keyword Spotting Application 39
layer with a softmax activation function. The model architecture can be seen in Fig-
ure 6.6. The model has a size of 16652 parameters (calculated below). Therefore, the
model could perfectly fit into the Arduino Nano 33, which has 1MB of flash memory
and 256 kB of RAM. Although the model is very small, it is possible to get good per-
formance because TinyML applications (like keyword spotting) are domain-specific.
Therefore, we only need a simple model that performs very good on a very specific
task.
With the model already defined, the next step is to train it using our dataset. The
training process consists in tuning the model’s parameters (weights and biases) in
order to learn associations that could find patterns on a particular dataset. The
Python script we used to train our custom keyword spotting model is based on
the Google Colab template from HarvardX’s Tiny Machine Learning course. At the
same time, this script is using the TensorFlow train.py script to perform the model
Chapter 6. Basic Keyword Spotting Application 40
training. With this script, we are basically training the TinyConv model using the
gradient descent algorithm for 12000 training steps with a learning rate of 0.001, and
3000 final steps with a learning rate of 0.0001. This way, in the last iterations, the
model is slowly approximating to the local minimum, and not missing it. The used
batch size is 100, which represents the number of data samples to train with at once.
The time it takes to train the model using Google’s Colab GPUs is around two hours.
The accuracy obtained is 0.957. This result could be optimized by fine-tunning the
training hyperparameters.
The last step before deploying the model in the microcontroller is to convert it to
an appropriate format. As we already mentioned, the TinyConv model has 16652
parameters. If these parameters are represented using floats (4 bytes), the model will
occupy 66.6 kB. But if we represent the model using unsigned integers (1 byte) the
size will be reduced to 16.6 kB. To obtain such a small size, we have to quantize the
model. This technique allows our model to run using only 8-bit, which is available
in most embedded systems.
Neural networks are able to find patterns in data, despite the noise (e.g., a blur pic-
ture, lighting changes). When the precision of the model is reduced from four to one
bytes, we are just adding a bit more noise that needs to be filtered out. However, the
model can still produce good results. With quantization we can reduce the model
size and improve the inference latency, with a small loss in accuracy. Moreover, the
consuming power is also reduced, which is very important for edge devices that run
on batteries.
In quantization aware training the model is quantized during the training phase in
order to reduce the errors introduced by the quantization. The weights are adjusted
considering the quantization noise. However, we will only perform post-training
quantization, since it adds less complexity to the training phase.
Pruning is another technique to reduce the size of the model. This technique consists
in removing the weights with the lowest values. These weights do not contribute
much to the final model performance. The result after pruning is a model with the
same architecture, but with fewer parameters (sparser) [12]. Figure 6.7 shows a rep-
resentation of a model after being pruned. This technique is done automatically by
TensorFlow.
F IGURE 6.7: Original model vs. pruned model (towards data science)
After pruning and quantizing the model, we have to convert it into a FlatBuffer
format (the format used by TensorFlow Lite, see Figure 6.8). FlatBuffer is a cross
platform serialization library compatible with many programming languages (C++,
C#, Python, JavaScript, etc.). Some of the advantages of using FlatBuffer for tiny
devices are: access to serialized data without parsing/unpacking, memory efficiency
and speed, and tiny code footprint [13]. The FlatBuffer model will be serialized to a
byte array in order to be deployed in the microcontroller.
Finally, we can deploy the trained and serialized model into the microcontroller to
run the keyword spotting application. This basic application was based on the al-
ready mentioned TensorFlow example micro speech. Therefore, we will only modify
Chapter 6. Basic Keyword Spotting Application 42
The deployed keyword spotting application is able to recognize the words vermell,
verd and blau. When the device recognizes any of the keywords, it turns on the RGB
LED with the corresponding color. This simple application could be easily upgraded
to perform more complex tasks using the already pre-trained model. Although we
just implemented a simple keyword spotting application based on previous exam-
ples, the research on the TinyML pipeline will help us to develop more complex
applications in the next chapters.
words (vermell, verd and blau). The development process will be much faster than the
previous approach, since the Edge Impulse platform is very fast and simple to use.
Again, the first step is to create a dataset to train the keyword spotting model. Edge
Impulse provides a firmware that allows to connect a microcontroller board to the
platform. Therefore, it is possible to create our dataset using the same board’s sen-
sors that will be used by the application when deployed. For creating our keyword
spotting dataset, we have connected the Arduino Nano 33 BLE Sense board to the
platform, and we have used the embedded microphone to record samples of the key-
words, which are sent directly to the Edge Impulse platform. The maximum length
of the recorded audio is about 10 seconds. However, it is possible to split the audio
into one second sample containing a single word. Figure 6.9 shows a long audio
sample that is split into six segments using the Edge Impulse platform. The seg-
ments can be randomly shuffled to make the application more robust to un-aligned
audios. After obtaining enough samples for each keyword, we have to rebalance our
dataset between the training and testing set.
F IGURE 6.9: Edge Impulse interface to split audio recording into sev-
eral shifted sections.
The next step is to configure the machine learning pipeline. The raw data have to be
processed before being sent to the neural network. We will use the Mel Frequency
Cepstral Coefficient (with 13 coefficients) to generate the feature vector of each au-
dio sample, like we did in the previous approach. To configure the neural network
Chapter 6. Basic Keyword Spotting Application 44
model, we can use the Keras library. However, the platform allows to tweak the
parameters with a visual interface. We will use the recommended neural network,
which can be seen in Figure 6.10.
Our implementation with Edge Impulse is almost over. The last step is to train the
model. The model is trained with a learning rate of 0.005 for 100 training cycles.
We have obtained an accuracy of 0.948 and a loss of 0.17. A very useful feature
of the Edge Impulse platform is the Feature explorer, which allows to visualize the
classification of the data in a 3D plot, and see the cluster created by each keyword.
After training the model, we can generate the firmware that contains the application
code and all the necessary libraries for the microcontroller board (Arduino Nano
33). In this final step, the neural network can be quantized in order to optimize the
latency and reduce the RAM and the ROM usage, with a very low drop in accuracy
2.
To deploy the application we have to flash the downloaded firmware into the mi-
crocontroller board. By default, the application driver is programmed to constantly
2 The project can be found on the page https://studio.edgeimpulse.com/public/25158/latest
Chapter 6. Basic Keyword Spotting Application 45
record one second audio samples, then process the audio (MFCC) to obtain the fea-
ture vector, and finally call the trained model to classify the audio among all the
keywords. The result is sent over the serial output, and can be seen using a serial
monitor. This driver can be easily adapted in order to perform more sophisticated
keyword spotting applications.
6.3 Results
In both approaches, we have obtained a very good model accuracy (above 0.94) in
the testing set. However, the performance of the first approach application, when
deployed in the microcontroller, has dropped considerably. On average, the applica-
tion correctly recognizes a keyword once in four times (around 25 % accuracy). This
low performance is due to the difference in the quality of the training dataset and the
audio recorded by the embedded microphone. The training dataset has been created
using a laptop microphone, which is able to capture with a much higher quality than
the Arduino’s microphone.
In the second approach, we have created the training dataset using the same low-
quality microphone from the Arduino board. The performance obtained by the de-
ployed application is higher than in the first approach, with the application being
able to correctly recognize a keyword two out of three times (around 66 % accuracy).
The drop in accuracy is produced by keywords that are recorded with a strange
alignment. The application is constantly recording the environment, so it is not
always possible to capture each keyword from the start. This could be solved by
creating a bigger dataset that contains audios with a wide variety of alignments.
Using the Edge Impulse platform, we have obtained a better performance. More-
over, the development process was much more easier and faster than in the first
approach. However, it has been very useful to develop the first approach applica-
tion, since we were able to understand more deeply the TinyML pipeline and the
challenges of the keyword spotting technique. The results have also shown the im-
portance of having a good training dataset. In the next chapters we will develop
more advanced applications based on the techniques learned here.
46
Training a neural network model is all about finding the best parameters (weights
and biases) in order to minimize the loss function. The loss function depends on the
ability of the model to correctly classify (classification problem) or predict a value
(regression approach) from data. In a supervised learning approach, we need a
dataset for training the model. It is very important to have a good dataset, since
the model will be trained to recognize patterns in it. For training a neural network
model we use the gradient descent algorithm. Gradient descent is an iterative opti-
mization algorithm that allows to find the local minimum of the model’s loss func-
tion. In each iteration of the algorithm, the model tries to classify a batch of data.
From the error obtained, the algorithm tries to adjust the parameters in order to re-
duce it. This last step is called backpropagation. Usually, the training algorithm has
to iterate many times over the dataset.
Typical TinyML applications usually perform the training phase outside the micro-
controller board. In the previous keyword spotting application, we were using a
pre-trained model. The model was trained in Google Colab with the standard Ten-
sorFlow framework. Then, the model was pruned and quantized to reduce its size.
Finally, the model was compiled to a specific format to be flashed into the micro-
controller board. However, it was not possible to improve the model after being
deployed. The application only used the model to classify new data. In this ap-
proach, the user cannot improve the model with its own local data (his/her own
voice) in order to improve the performance of the application.
Training a model is a computationally intensive task that some devices cannot per-
form. Some of the biggest artificial neural networks can take up to months to be
Chapter 7. On-Device Model Training 47
trained and many watts of power. This huge amount of computing power is not
feasible in tiny devices like microcontrollers. However, our intention is not to train a
general purpose model (e.g., GPT-3), but to train a very specific model that performs
well in a single task. The keyword spotting model we used in the previous appli-
cation had a size of 16 kB and only 16652 parameters. Running a gradient descent
iteration on this model with the Arduino Nano 33 would take less than a second.
Therefore, it is computationally possible to train a specific model using a tiny de-
vice.
Transfer learning is a technique that can significantly reduce the time and the com-
puting power required to train a model. This technique takes advantage of a pre-
viously trained model to train a new one. For example, we could train a model
that recognizes Greyhounds in images, making use of a pre-existing model that rec-
ognizes dogs. Instead of training a new model from the beginning, with transfer
learning we reuse some of the parameters (weights and biases) already trained on
a previous model. Usually, we only take the parameters from the first layers, since
their job is to recognize the most basic patterns of the data. The new model only
has to train the layers that are closer to the output. These layers are responsible to
recognize the most specific patterns of the data. Therefore, with transfer learning we
only have to train a small subset of layers of the neural network. However, transfer
learning has some limitations. If the purpose of the previous trained model is not
related to the purpose of the new model, the result will be very poor. The problems
need to be similar enough, which is quite hard to discern [14].
This section shows the development of a TinyML application1 that demonstrates the
feasibility of training a model in a microcontroller. We will develop a program that
is able to train a keyword spotting model. Moreover, the application will be able to
recognize three different keywords, which will be decided by the user when starting
to train the model. When the application is restarted, the model has to be trained
once again, using the same or a new set of keywords. The user could also test the
application accuracy at recognizing the keywords to see the training progress.
To interact with the application, we need to set up the board. The application will
be deployed on an Arduino Nano 33 BLE Sensen board, which already comes with
several components. The board has an integrated microphone, that will be used
to record the keywords. It also has an integrated white and RGB LED. The white
LED will be used to visualize the application stage (e.g., IDLE, busy), and the RGB
LED will be used to show the output class of the keyword spotting model. To train
the model, we need to plug three buttons to the board. Each button will allow to
train one of the keywords. A fourth button will be added for testing the model, but
without training it. Figure 7.1 shows a way to connect the buttons to the Arduino
board.
7.2.2 Workflow
The application starts when the program is flashed to the Arduino board, or the
board is restarted with the program already flashed. The board should have a similar
setup as the one shown in Figure 7.1. Every time the application starts, a new model
is created. The weights and biases of the model are initialized to random numbers.
Once the model is initialized, the user can start to train it using the three training
buttons. Each button allows to train one of the three keywords. When a training
button is pressed, the RGB LED will turn on with a color identifying the button (red,
green or blue). When the button is released, the Arduino built-in microphone will
start recording a one second audio. The keyword should be said within this second.
The recorded audio will be processed to obtain the feature vector. Finally, the model
will be trained with the feature vector and the expected keyword (known from the
button pressed). The result of each training iteration is sent through the serial port.
The fourth button has the same workflow as the three training buttons, but it will
not train the model. Instead, it will turn on the RGB LED with the color associated
with the keyword that the model had recognized. Figure 7.2 shows a diagram of the
application workflow.
Before training the model, we need to obtain the feature vector. As mentioned above,
just after releasing any of the buttons, the microphone will start to record for one sec-
ond. The microphone will record with a sample rate of 16 kHz, and the result will
Chapter 7. On-Device Model Training 50
The neural network model has to be small and fast to work on the Arduino board.
The keyword spotting problem is very simple, if compared to general speech recog-
nition problems. Therefore, our model does not need to have millions of parameters
for a good performance. The model that will be used in our application is a feed-
forward neural network with an input layer of 650 nodes, a single hidden layer of
16 nodes, and an output layer of 3 nodes (each node representing a keyword). The
activation function used is a sigmoid. Although the Relu function is more popular
for deep learning models, for smaller models Sigmoid can be a better choice. This
architecture gives a sum of 10467 parameters (10448 weight and 19 biases). We will
Chapter 7. On-Device Model Training 51
use a 4 byte float to represent the parameters (the maximum precision allowed by
the Arduino framework). Therefore, the model will have a total size of 41868 bytes.
These bytes are not stored in the slow Arduino flash memory (1MB) because they
are constantly modified in the training phase. They need to be stored in the RAM
(preferably in the heap). This is not an issue with the Arduino Nano 33, since it has
256 kB of SRAM. Figure 7.4 shows a diagram of the neural network.
The model is trained following an online learning approach [16]. As already men-
tioned, the model is created every time the application is restarted. The weights
and values are initialized to random floats in the range between -0.5 and 0.5. When
a training button is pressed (and released), the board will record a keyword (said
by the user) and generate the feature vector (MFCC). Then, the feature vector will
be sent to the model to perform a forward propagation in order to obtain the three
output values. With these output values and the expected values (known from the
button pressed), we can calculate the mean squared error. The final step is to calcu-
late the delta 2 of each neuron in order to perform a gradient descent iteration. To
2 The delta value reflects the magnitude of the error of each node.
Chapter 7. On-Device Model Training 52
sum up, we are following an online learning approach, where the model is trained
using the most recent recorded audio.
To optimize the model, we can fine tune training hyperparameters. The most im-
portant hyperparameter is the learning rate, which controls how much the model
is updated in response of the estimated error of each training epoch. Choosing the
correct learning rate is quite challenging. A too small value may result in a long
training phase, and a too high value may result in an unstable training process [17].
The default learning rate is 0.3. This value is quite high if compared with the val-
ues used for training more complex models. However, since the application is fully
trained by the user, it is necessary to not extend the training phase for too long.
The second hyperparameter is the momentum. Similar to the learning rate, the mo-
mentum tries to maintain a consistent direction on the gradient descent algorithm.
It works by combining the previous heading vector and the new computed gradient
vector. The default value used in our setup is 0.9, which adds 90 % of the previous
direction to the new direction. We note that the use of the momentum consumes
additional RAM, since it is necessary to store the previous gradients.
7.2.5 Results
In order to evaluate the performance of on-device training, the model is trained with
two different sets of keywords. The first set of keywords contains two spoken words,
namely Montserrat and Pedraforca (two iconic mountains of Catalonia), and a third
type to classify silence. The keywords are spoken to the device in alternate order.
Figure 7.5 shows the result of the training process. The erratic short-scale behaviour
is produced by the online learning approach, where each epoch is using a single au-
dio recording, and therefore, the accuracy obtained can vary significantly between
one word and the next. However, the important observation is that loss progres-
sively decays over the epochs.
Chapter 7. On-Device Model Training 53
F IGURE 7.5: Loss vs. epochs during the training of the three key-
words (Montserrat, Pedraforca and silence). Number of observations is
130, learning rate 0.1, momentum 0.9.
In the previous experiment, we have set a learning rate of 0.1. With this learning rate,
we needed around 80 epochs for the loss to converge. However, 80 training epochs
could suppose too much time for the user to train the model, since for each epoch the
user have to say a keyword. Therefore, in this second experiment, we increased the
learning rate up to 0.3. Figure 7.6 shows the result with the updated value. Again,
we see that the loss progressively decays over the epochs, but this time it only needs
30 epochs to converge.
Chapter 7. On-Device Model Training 54
F IGURE 7.6: Loss vs. epochs during the training of the three key-
words (Montserrat, Pedraforca and silence). Number of observations is
70, learning rate 0.3, momentum 0.9.
Before, we have trained the model using three keywords: two words (Montserrat,
Perdraforca) and a silence class. In this third experiment, we have trained the model
using only the words Montserrat and Pedraforca. The result is similar to the previous
experiment, but we obtained a much more consistent accuracy in the last epochs
(Figure 7.7). Altought we are not able to classify the silence, when no word is spoken
the model shows a low probability for both keywords. Therefore, we could set a
threshold to classify also the silence or unknown words.
Chapter 7. On-Device Model Training 55
F IGURE 7.7: Loss vs. epochs during the training of the two keywords
(Montserrat and Pedraforca (no silence)). Number of observations is
70, learning rate 0.3, momentum 0.9.
In this fourth experiment, the model is trained with the three keywords (Monserrat,
Pedraforca and silence) and a learning rate of 0.1. However, we removed the mo-
mentum parameter. The result is shown in Figure 7.8. With this setup it takes more
than 200 training epochs for the loss to converge. After analyzing the previous ex-
periments, we have decided to set a default learning rate of 0.3 and a momentum of
0.9.
Chapter 7. On-Device Model Training 56
F IGURE 7.8: Loss vs. epochs during the training of the three key-
words (Montserrat and Pedraforca and silence). Number of observa-
tions is 200, learning rate 0.1, no momentum.
To have a more reliable evaluation, we will perform a last experiment with a different
set of keywords. This set of keywords contains the words vermell, verd and blau
(which stand for red, green and blue, in Catalan). Figure 7.9 shows the results of the
training with the default hyperparameters (0.3 learning rate, 0.9 momentum). Again,
it can be seen how the loss decays over the epochs, proving that the application
works with different sets of keywords.
Chapter 7. On-Device Model Training 57
F IGURE 7.9: Loss vs. epochs during the training of the three key-
words (vermell, verd and blau). Number of observations is 60. learning
rate 0.3, momentum 0.9.
The results are very positive. It has been proven the feasibility of training a neural
network model in a microcontroller, at least, for a problem as complex as keyword
spotting. It is important to notice that the application is not fully optimized. In fact,
there are several upgrades that could be done in order to improve the performance.
For example, we could add a convolutional layer to the model to process the input
image (Figure 7.3). This layer would reduce the short-scale behaviour. We could
also tweak the number of hidden layers and neurons. Also, we could use a more
complex Fourier transform, or more coefficients in the MFCC algorithm for process-
ing the audio. However, we are satisfied with the results obtained, and we leave the
possibility to implement these upgrades in forthcoming applications.
58
Over the last few years, federated learning has raised the interest of the research
community, as it provides a means to train machine learning models on distributed
devices without sharing the local training data [19]. In federated learning, instead
of training a single model with a centralized dataset, local models are trained with
local datasets and then merged into a global model. The inconvenience of having
less data on each device can be compensated by the capacity of the global model
built upon the local ones. Federated learning is also seen as a solution to train with
data which, for privacy reasons, cannot be sent to the cloud, such as medical records
[20]. Therefore, the main advantages of federated learning over traditional machine
learning is the security it brings from data leakage, and the personalized experience
that is achieved by training a model with local data.
the training of heterogenous local models while still producing a single global model
[21]. In this approach, devices with different capabilities can be used together, even
if they can only train models of different architectures.
The two most popular aggregation methods are Federated Stochastic Gradient De-
scent (FedSGD) and Federated Averaging (FedAVG). In FedSGD, the clients calcu-
late the gradient vector using their local data, but instead of backpropagating the
error to the model, the clients just send back to the server this gradient. Then, the
server will average all the gradient vectors and update the global model. In contrast,
with FedAVG, the clients train their local model. Therefore, the error is backpropa-
gated, and the model’s parameters (the weights and biases) are updated. Then, the
clients send back all the parameters to the server, which are averaged (and weighted)
to create the global model. The advantage of using FedAVG over FedSGD is that
with FedAVG we are able to train multiple times the model before sending the up-
dated parameters to the server. Instead, with FedSGD it is required to send the
gradient vector in each training iteration, and therefore, it has a much higher com-
munication cost. However, FedSGD can guarantee the convergence, while FedAVG
cannot [22].
Some commercial federated learning applications are Google’s Smart Keyboard and
Apple’s Siri. Google’s Smart Keyboard is used on smartphones in order to help the
user by suggesting the next word to type. Depending on the user response, accepting
or not the suggestion, the model will be updated to improve the accuracy of future
Chapter 8. Federated Learning with Microcontrollers 60
suggestions [24]. Apple’s Siri is a voice assistant that works in iPhone devices. When
a user says "Hey Siri", the Siri assistant wakes up to respond any of the user’s query.
In order to personalize the application, they have implemented a federated learning
approach so the Siri assistant will only respond to the voice from the iPhone’s owner
[25]. The expectations for new federated learning applications are very high.
Although federated learning offers many possibilities, it also has several limitations.
During the learning process, it requires frequent communication between the server
and the clients. Depending on the aggregation methods, it may be necessary to
exchange several times the model’s parameters. Therefore, the communication net-
work can get saturated and easily become the bottleneck of the training phase. More-
over, the clients involved in federated learning may be unreliable (and drop out from
the training phase), since they commonly rely on less powerful communication me-
dia (i.e. Wi-Fi, Bluetooth) and usually depend on batteries. Lastly, the local data
probably will not be independent and identically distributed (IID), and the amount
of local data may span several orders of magnitude between the edge devices. This
Chapter 8. Federated Learning with Microcontrollers 61
This section shows the development of a TinyML application1 that demonstrates the
feasibility of (centralized) federated learning with microcontrollers. This application
will be based on the previous keyword spotting application implemented in On-
Device Model Training. The training phase is identical from the device point of view:
The user can train a keyword spotting model (using three buttons) to recognize three
different keywords and test the performance (with a fourth button). However, this
new application instead of training a model with a single device, it will use several
devices following a federated learning approach. Therefore, we will also implement
a server that will coordinate all the clients (devices), and update a global model.
The application will be deployed on several Arduino Nano 33 BLE Sense boards.
Each of these boards should have a similar setup as the one shown in Figure 7.1.
The three leftmost buttons are used to train the model, and the fourth button is used
to test it. Another requirement is to connect all the boards to the computer that
will be running the server script. This connection should be through the serial port,
since we are using the pySerial library from Python. Figure 8.3 shows the application
diagram with two boards (clients).
TinyML-FederatedLearning
Chapter 8. Federated Learning with Microcontrollers 62
8.2.2 Workflow
The application workflow differs from the previous one. To start the application,
we have to flash the firmware to all the Arduino boards used as clients. Then, we
should run and configure the server (a Python script). The server will ask the num-
ber of clients and the port where they are connected (e.g., COM5, tty5). After the
configuration, the server will create a neural network model and initialize it with
random parameters (see Artificial Neural Network). This model will be sent to all
the clients. As soon as the clients received the model, they can start training it with
their local data. The local data will be generated by the user saying the keywords. Of
course, all the clients must use the same three keywords. Meanwhile, the server will
set a ten second countdown to start the first federated learning iteration. When the
countdown ends, the server will try to connect with all the clients and ask them to
send their improved model. The clients will have five seconds to accept the request,
or they will be discarded for this iteration. The models that are received from the
connected clients will be aggregated to create a global model. This global model will
be sent back to the connected devices to be further trained. Finally, the server will
restart the countdown for the next iteration. Figure 8.4 shows the sequence diagram
of the application workflow.
Chapter 8. Federated Learning with Microcontrollers 63
The training phase requires to exchange the models between the server and the
clients multiple times. First, the initialized model is sent from the server to all the
clients. And then, in an iterative way, the clients send the local model to the server,
and the server sends back the global model to the clients. As calculated in Artificial
Neural Network, the neural network model has a size of 41868 kB. This size does
not suppose any issue for the server when receiving the models, since it only has to
read the parameters from the computer buffer. However, sending the model from
the server to the Arudino can be problematic due to the small size of the Arduino’s
input buffer (64 bytes). To avoid a buffer overflow, the server only sends four bytes
at once, before checking that the board has received them. Therefore, the sending
time (from server to client) is significantly longer ( 3 sec) than the receiving time ( 1
sec).
Probably, the most important part of the federated learning is the aggregation of the
models. The technique used in this application is FedAvg, which stand for federated
average. In contrast to FedSGD (stochastic gradient descent) where the aggrega-
tion is done on the gradients, with FedAvg, the aggregation is done on the model’s
parameters (weights and biases). After the server receives all the models from the
clients, the parameters are weighted by the number of training epochs done on the
local model since the last aggregation. Then, the weighted parameters are simply
averaged in order to produce the global model. Although the aggregation of the
local models is the heart of the federated learning, the implementation is straight-
forward. In the Python script the aggregation comprises a single line of code (using
Numpy):
8.2.5 Results
To evaluate the application, we deployed the scenario shown in Figure 8.3. We used
two Arduino Nano 33 BLE Sense boards with identical hardware setup as previously
shown in Figure 7.1. The boards are connected via the serial port to a PC, where the
Chapter 8. Federated Learning with Microcontrollers 65
federated learning server runs. We run three experiments with different strategies
to train a global keyword spotting model.
Federated learning with IID data: In the first experiment, we performed 10 training
epochs on each client before sending the local model to the server. Then, the up-
dated global model is sent back to both clients for the next local training round.
Two keywords are used for training, Montserrat and Pedraforca. The two keywords
are spoken in alternating order to both clients (nodes), to represent the scenario of
training with independent and identically distributed (IID) data. Figure 8.5 shows
the obtained results for the training loss. It can be seen that, in both nodes, the loss
decreases over the training epochs.
F IGURE 8.5: Loss vs. epochs during training with federated learning
in IID data scenario (10 training epochs per aggregation).
Federated learning with non-IID data: In the second experiment, the two keywords
are split among the clients (nodes) (i.e., each keyword is spoken to only one of the
nodes). This setup aims at presenting the scenario of training with non-IID data.
It can be seen in Figure 8.6 how, after averaging the model every 10 epochs, loss
increases. This can be explained by the fact that the model averaging merges the
Chapter 8. Federated Learning with Microcontrollers 66
characteristics of the two models, which are trained with different keywords. How-
ever, the long-term trend is that the loss of the global model is decreasing over the
training epochs.
F IGURE 8.6: Loss vs. epochs during training with federated learning
in non-IID data scenario (10 training epochs per aggregation).
Federated learning with non-IID data (1 training epoch per aggregation) In the third and
last experiment we kept the two keywords split among the clients, like in the previ-
ous experiment, but instead of performing 10 training epoch, we only performed one
training epoch on each client before sending the local model to the server. Figure 8.7
shows once again that the loss decrease over the training epochs, and therefore, it
proves that averaging the local models helps to produce an improved global model,
even with non-IID local data.
Chapter 8. Federated Learning with Microcontrollers 67
F IGURE 8.7: Loss vs. epochs during training with federated learning
in non-IID data scenario (1 training epoch per aggregation).
9 Sustainability Analysis
This block includes the evaluation of the planning, development and implementa-
tion of the project.
9.2.1 Environmental
The environmental impact of undertaking the project has been quantified based on
the energy consumption (kWh). This project has been carried out by the author and
has been supervised by the director and co-director. Therefore, we required three
computers. Considering that a computer typically uses about 50 watts of electricity
and that the project has required 474 hours of use of a single computer and 20 hours
of use of three computers, the amount of energy is:
Chapter 9. Sustainability Analysis 69
Training a machine learning model can involve large amounts of energy. However,
TinyML models are relatively small, and therefore, the energy used is negligible. We
will consider this energy as part of the energy used by the computers.
In order to reduce the energy, we have reused all three computers, and the micro-
controller boards, from previous projects. For this reason, we have not taken into
account the energy cost in the manufacturing process. To reduce the model training
cost, we have considered using the transfer learning technique, which consists in
reusing pre-trained models. With this technique we could save a lot of energy when
training very big models. However, as stated above, the models used in TinyML are
very small, and transfer learning would not be noticed in the total energy consump-
tion. Therefore, from the previous equations, the total energy consumed is:
9.2.2 Economic
The estimated cost of undertaking the project is analyzed on Chapter 3. The final es-
timated cost was about 16158.88€. However, this cost has undergone some changes
during the project. The task "Decentralized Federated Learning" was cancelled (as
already foreseen in Section 2.4), and therefore, the final cost has been reduced to
16002.88€.
Chapter 9. Sustainability Analysis 70
9.2.3 Social
The completion this project has allowed me to demonstrate the knowledge I have
acquired during my degree. It has also helped me to improve my writing skills. And
finally, I have gained a very deep understanding of machine learning and embedded
systems. However, this project has taken me many hours to complete. At times
I have felt frustrated and unmotivated due to the heavy workload. But overall, I
believe that having done this work will help me in my career and future projects.
9.3 Exploitation
9.3.1 Environmental
9.3.2 Economic
As mentioned above, TinyML applications do not have to send data over the inter-
net (like previous solutions) and would therefore be less expensive to operate. In
addition, microcontrollers (the devices used in TinyML) are very cheap and do not
require a large investment in them. Therefore, a rapid growth of TinyML applica-
tions is feasible.
9.3.3 Social
The development of TinyML will enable smart devices everywhere. New applica-
tions could be developed to improve the quality of life, such as medical applications
or home automation devices.
Chapter 9. Sustainability Analysis 71
9.4 Risks
9.4.1 Environmental
Microcontrollers are very small and cheap, so from a selfish perspective, it may be
preferable to throw them away rather than reuse or recycle them.
9.4.2 Economic
It could occur that the production of microcontrollers will be reduced, and the cost
will increase, making TinyML applications less economically viable. However, this
is highly unlikely due to the worldwide dependence on microcontrollers and the
low cost of production.
9.4.3 Social
TinyML applications are likely to be part of our lives in the near future. However,
it will take a long time for poorer countries to see the benefits of these applications.
In addition, the training of machine learning models may be biased against discrim-
inated or marginalized groups. For example, a keyword spotting application might
have more difficulty recognising an unusual accent. Another social problem of the
massive use of TinyML applications is the dependency it will create towards this
technology, which will be difficult to replace.
Based on the above analysis, we have assigned a value to each cell of the sustain-
ability matrix in order to quantify its impact. Figure 9.2 shows the matrix with the
assigned values, where 1 means a very negative impact (or very high risk) and 10 a
very positive impact (or very low risk).
10 Conclusion
This project had two objectives: The first objective aimed to develop a basic TinyML
application. However, before starting the development process, we had to identify
the common techniques used in TinyML (Section 4.1). We noticed that the most
popular TinyML applications were based on either keyword spotting, visual wake
words or anomaly detection. We found several commercial applications that are us-
ing these techniques in very different scenarios (e.g., domestic, office and industrial).
After identifying these techniques, we looked into the tools and frameworks that
are currently being used for developing TinyML applications (Section 4.2). It was
no surprise to discover that TensorFlow is the most popular approach for develop-
ing machine learning solutions. However, for TinyML we needed something more
compact, since the memory is very limited. Thankfully, the TensorFlow Lite Micro
is a very small version (16 kB) specifically designed to be deployed into microcon-
trollers.
For this project we had two microcontroller boards available: the Arduino Nano 33
BLE Sense, and the Espressif ESP32 (with LoRa). Both boards are supported by Ten-
sorFlow Lite Micro. At first, we had the intention to use both boards, depending on
the requirements of each application to be developed. However, in the whole project
we have only used the Arduino Nano 33 BLE Sense. Although the ESP32 board have
a better microcontroller, the Arduino board comes with several sensors (e.g., micro-
phone, IMU, temperature, etc.) that made it easier to set up the device (Chapter 5).
Anyway, the Arduino board has had enough power to run all the TinyML applica-
tions we developed.
After identifying the techniques, the tools and the microcontroller specifications, we
started to develop a basic keyword spotting application able to recognize the words
Chapter 10. Conclusion 73
vermell, verd and blau (red, green and blue in Catalan). To develop the application,
we followed two approaches (Chapter 6). In the first approach we analyzed the ma-
chine learning pipeline, and then we implemented the application based on the micro
speech example from TensorFlow. In the second approach, we used the Edge Impulse
platform. In both approaches, we obtained a very high accuracy when training the
model (above 94%), however, the performance after deploying the application was
significantly lower. This low performance was due to the quality of the dataset and
the different alignments of the recorded keywords. We conclude that for creating
a typical TinyML application, it is much easier and faster to use the Edge Impulse
platform (or any similar), as it does not require deep knowledge of machine learning
nor embedded systems, and most of the work is automatized.
of the Independent and Identically Distributed (IID) scenario were very positive, ob-
taining a loss value similar to the previous application. For the non-IID scenario, the
loss values of the local models increase greatly with each aggregation. However, in
the long term, the loss value of the global model decreases.
Bibliography
[1] Tim Stack. Internet of Things (IoT) Data Continues to Explode Exponentially. Who
Is Using That Data and How? - Cisco Blogs. 2018. URL: https://news- blogs.
cisco.com/datacenter/internet- of- things- iot- data- continues- to-
explode-exponentially-who-is-using-that-data-and-how.
[2] Ben Lutkevich. Microcontroller (MCU). 2019. URL: https://internetofthingsagenda.
techtarget.com/definition/microcontroller.
[3] Vijay Janapa Reddi, Laurence Moroney, and Pete Warden. Tiny Machine Learn-
ing (TinyML). 2021. URL: https://www.edx.org/professional-certificate/
harvardx-tiny-machine-learning.
[4] Karen Hao. What is machine learning? 2018. URL: https://www.technologyreview.
com / 2018 / 11 / 17 / 103781 / what - is - machine - learning - we - drew - you -
another-flowchart/#:~:text=What%20is%20the%20definition%20of,into%
20a%20machine%2Dlearning%20algorithm..
[5] PayScale. Average Project Manager, Information Technology (IT) Salary in Spain.
URL : https://www.payscale.com/research/ES/Job=Project_Manager%2C_
Information_Technology_(IT)/Salary.
[6] PayScale. Average Research Scientist Salary in Spain. URL: https://www.payscale.
com/research/ES/Job=Research_Scientist/Salary.
[7] PayScale. Average Software Developer Salary in Spain. URL: https://www.payscale.
com/research/ES/Job=Software_Developer/Salary.
[8] Agencia Tributaria. Tabla de coeficientes de amortización lineal. URL: https : / /
www.agenciatributaria.es/AEAT.internet/Inicio/_Segmentos_/Empresas_
y_profesionales/Empresas/Impuesto_sobre_Sociedades/Periodos_impositivos_
a _ partir _ de _ 1 _ 1 _ 2015 / Base _ imponible / Amortizacion / Tabla _ de _
coeficientes_de_amortizacion_lineal_.shtml.
Bibliography 76