Embedded Software Development The Open-Source Approach
Embedded Software Development The Open-Source Approach
Hu
Development
valuable to students and professionals who need a single, coherent source of
information.”
—Kristian Sandström, ABB Corporate Research, Västerås, Sweden
Embedded Software Development: The Open-Source Approach delivers a practi-
cal introduction to embedded software development, with a focus on open-source
components. This programmer-centric book is written in a way that enables even novice
practitioners to grasp the development process as a whole.
• Defines the role and purpose of embedded systems, describing their internal
structure and interfacing with software development tools
• Examines the inner workings of the GNU compiler collection (GCC)-based
software development system or, in other words, toolchain
• Presents software execution models that can be adopted profitably to
model and express concurrency
• Addresses the basic nomenclature, models, and concepts related to task-based
scheduling algorithms
• Shows how an open-source protocol stack can be integrated in an embedded
system and interfaced with other software components
• Analyzes the main components of the FreeRTOS Application Programming Interface
(API), detailing the implementation of key operating system concepts
• Discusses advanced topics such as formal verification, model checking, runtime
checks, memory corruption, security, and dependability
Tingting Hu
National Research Council of Italy
P o l i t e c n i c o d i To r i n o
Tu r i n , I t a l y
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a photo-
copy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Ché nessun quaggiù lasciamo,
né timore, né desir
(For we leave no one behind us,
nor dreads, nor desires)
— Ivan
Foreword .................................................................................................................xiii
Preface....................................................................................................................xvii
vii
viii Contents
Index...................................................................................................................... 515
Foreword
Embedded software touches many aspects in our daily life by defining the behavior
of the things that surround us, may it be a phone, a thermostat, or a car. Over time,
these things have been made more capable by a combination of advances in hardware
and software. The features and capabilities of these things are further augmented by
connecting to other things or to external software services in the cloud, e.g., by a
training watch connecting wirelessly to a heartrate monitor that provides additional
sensor data and to a cloud service providing additional analysis and presentation
capabilities. This path of evolution can be seen in most domains, including industrial
embedded systems, and has over time added new layers of knowledge that are needed
for development of embedded systems.
Embedded systems cover a large group of different systems. For instance, a phone
is a battery-powered device with relatively high processing power and ample options
for wireless connectivity, while a thermostat is likely to have scarce resources and
limited connectivity options. A car on the other hand represents another type of em-
bedded system, comprising a complex distributed embedded system that operates in
harsh environments. Although the requirements and constraints on these systems are
quite different, there are still strong commonalities among most types of embedded
systems, considering the software layers that sit closest to the hardware, typically
running on embedded or real-time operating systems that offer similar services.
Every new generation of a software-based system typically is more elaborate and
operates in a more complex and dynamic environment than the previous one. Over
the last four decades many things have changed with respect to software based sys-
tems in general; starting in the 1980s with the early desktop PC, like the IBM XT that
I used at the university, which was similar to industrial and embedded systems with
respect to operating system support and development environments. In the 1990s,
real-time and embedded operating systems services remained mostly the same, while
the evolution of general purpose operating systems witnessed many changes includ-
ing graphical user interfaces. During the first decade of the new millennium we saw
the introduction of smart phones, server virtualization, and data centers, and cur-
rently we see the emergence of large scale open-source software platforms for cloud
computing and the Internet of Things (IoT).
Some embedded systems such as consumer electronics have been an integral part
of these latest developments while industrial system adoption is slower and more
careful. Although, the system context today is much more complex than thirty-five
years ago and many software layers have been added, the foundation, the software
that sits close to the hardware, is still much the same with the same requirements for
timeliness and predictability.
As the software stack in embedded systems increases and as the functionality
reaches beyond the embedded device through connectivity and collaboration with
xiii
xiv Foreword
What is lacking is continuous and coherent descriptions holding all fragments to-
gether presenting a holistic view on the fundamentals of the embedded system, the
involved components, how they operate in general, and how they interact and depend
on each other. As such, this book fills a gap in the literature and will prove valuable
to students and professionals who need a single coherent source of information.
Kristian Sandström
ABB Corporate Research
Västerås, Sweden
xvii
The Authors
Ivan Cibrario Bertolotti earned the Laurea degree (summa cum laude) in com-
puter science from the University of Torino, Turin, Italy, in 1996. Since then, he has
been a researcher with the National Research Council of Italy (CNR). Currently, he
is with the Institute of Electronics, Computer, and Telecommunication Engineering
(IEIIT) of CNR, Turin, Italy.
His research interests include real-time operating system design and implementa-
tion, industrial communication systems and protocols, and formal methods for vul-
nerability and dependability analysis of distributed systems. His contributions in this
area comprise both theoretical work and practical applications, carried out in coop-
eration with leading Italian and international companies.
Dr. Cibrario Bertolotti taught several courses on real-time operating systems at
Politecnico di Torino, Turin, Italy, from 2003 until 2013, as well as a PhD degree
course at the University of Padova in 2009. He regularly serves as a technical referee
for the main international conferences and journals on industrial informatics, factory
automation, and communication. He has been an IEEE member since 2006.
xix
List of Figures
xxi
xxii List of Figures
6.1 Notation for real-time scheduling algorithms and analysis. ....................... 141
6.2 U-based schedulability tests for Rate Monotonic....................................... 145
6.3 Scheduling diagram for the task set of Table 6.2........................................ 147
6.4 Scheduling diagram for the task set of Table 6.3........................................ 148
6.5 Unbounded priority inversion. .................................................................... 154
6.6 Priority inheritance protocol. ...................................................................... 158
xxv
xxvi List of Tables
10.1 Object-Like Macros to Be Provided by the F REE RTOS Porting Layer .... 276
10.2 Function-Like Macros to Be Provided by the F REE RTOS
Porting Layer .............................................................................................. 277
10.3 Data Types to Be Provided by the F REE RTOS Porting Layer................... 278
10.4 Functions to Be Provided by the F REE RTOS Porting Layer..................... 278
14.1 Original (N) and Additional (A) Transitions of the Election Protocol’s
Timed FSM ................................................................................................. 412
1
2 Embedded Software Development: The Open-Source Approach
but are indeed of utmost importance to guarantee that embedded systems work reli-
ably and efficiently.
Furthermore, virtually all chapters in the first part of the book make explicit ref-
erence to real fragments of code and, when necessary, to a specific operating system,
that is, F REE RTOS. This also addresses a shortcoming of other textbooks, that is, the
lack of practical programming examples presented and commented on in the main
text (some provide specific examples in appendices).
From this point of view F REE RTOS is an extremely useful case study because it
has a very limited memory footprint and execution-time overhead, but still provides
a quite comprehensive range of primitives.
In fact, it supports a multithreaded programming model with synchronization and
communication primitives that are not far away from those available on much bigger,
and more complex systems. Moreover, thanks to its features, is has successfully been
used in a variety of real-world embedded applications.
At the same time, it is still simple enough to be discussed in a relatively detailed
way within a limited number of pages and without overwhelming the reader with
information. This also applies to its hardware abstraction layer—that is, the code
module that layers the operating system on top of a specific hardware architecture—
which is often considered an “off-limits” topic in other operating systems.
As a final remark, the book title explicitly draws attention to the open-source
approach to software development. In fact, even though the usage of open-source
components, at least in some scenarios (like industrial or automotive applications)
is still limited nowadays, it is the authors’ opinion that open-source solutions will
enjoy an ever-increasing popularity in the future.
For this reason, all the software components mentioned and taken as examples
in this book—from the software development environment to the real-time operat-
ing system, passing through the compiler and the software analysis and verification
tools—are themselves open-source.
The first part of the book guides the reader through the key aspects of embed-
ded software development, spanning from the peculiar requirements of embedded
systems in contrast to their general-purpose counterparts, without forgetting network
communication, implemented by means of open-source protocol stacks.
In fact, even though the focus of this book is mainly on software development
techniques, most embedded systems are nowadays networked or distributed, that is,
they consist of a multitude of nodes that cooperate by communicating through a
network. For this reason, embedded programmers must definitely be aware of the
opportunities and advantages that adding network connectivity to the systems they
design and develop may bring.
The chapters of the first part of the book are:
their internal structure and to the most common way they are interfaced
to software development tools. The chapter also gives a first overview of
the embedded software development process, to be further expanded in the
following.
• Chapter 3, GCC-Based Software Development Tools. The main topic
of this chapter is a thorough description of the GNU compiler collec-
tion (GCC)-based software development system, or toolchain, which is ar-
guably the most popular open-source product of this kind in use nowadays.
The discussion goes through the main toolchain components and provides
insights on their inner workings, focusing in particular on the aspects that
may affect programmers’ productivity,
• Chapter 4, Execution Models for Embedded Systems. This chapter and the
next provide readers with the necessary foundations to design and imple-
ment embedded system software. Namely, this chapter presents in detail
two different software execution models that can be profitably applied to
model and express the key concept of concurrency, that is, the parallel ex-
ecution of multiple activities, or tasks, within the same software system.
• Chapter 5, Concurrent Programming Techniques. In this chapter, the con-
cept of execution model is further expanded to discuss in detail how con-
currency must be managed to achieve correct and timely results, by means
of appropriate concurrent programming techniques.
• Chapter 6, Scheduling Algorithms and Analysis. After introducing some
basic nomenclature, models, and concepts related to task-based scheduling
algorithms, this chapter describes the most widespread ones, that is, rate
monotonic (RM) and earliest deadline first (EDF). The second part of the
chapter briefly discusses scheduling analysis, a technique that allows pro-
grammers to predict the worst-case timing behavior of their systems.
• Chapter 7, Configuration and Usage of Open-Source Protocol Stacks.
In recent years, many embedded systems rapidly evolved from central-
ized to networked or distributed architectures, due to the clear advan-
tages this approach brings. In this chapter, we illustrate how an open-
source protocol stack—which provides the necessary support for inter-node
communication—can easily be integrated in an embedded software system
and how it interfaces with other software components.
• Chapter 8, Device Driver Development. Here, the discourse goes from the
higher-level topics addressed in the previous three chapters to a greater level
of detail, concerning how software manages and drives hardware devices.
This is an aspect often neglected in general-purpose software development,
but of utmost importance when embedded systems are considered, because
virtually all of them are strongly tied to at least some dedicated hardware.
• Chapter 9, Portable Software. While the previous chapters set the stage
for effective embedded software development, this chapter outlines the all-
important trade-off between code execution efficiency and portability, that
is, easiness of migrating software from one project to another. This aspect
4 Embedded Software Development: The Open-Source Approach
In the second part, the book presents a few advanced topics, focusing mainly on
improving software quality and dependability. The importance of these goals is ever
increasing nowadays, as embedded systems are becoming commonplace in critical
application areas, like anti-lock braking system (ABS) and motion control.
In order to reach the goal, it is necessary to adopt a range of different techniques,
spanning from the software design phase to runtime execution, passing through its
implementation. In particular, this book first presents the basic principles of formal
verification through model checking, which can profitably be used since the very
early stages of algorithm and software design.
Then, a selection of runtime techniques to prevent or, at least, detect memory
corruption are discussed. Those techniques are useful to catch any software error
that escaped verification and testing, and keep within bounds the damage it can do to
the system and its surroundings.
Somewhat in between these two extremes, static code analysis techniques are use-
ful, too, to spot latent software defects that may escape manual code inspection. With
respect to formal verification, static code analysis techniques have the advantage of
working directly on the actual source code, rather than on a more abstract model.
Moreover, their practical application became easier in recent years because sev-
eral analysis tools moved away from being research prototypes and became stable
enough for production use. The book focuses on one of them as an example.
The second part of the book consists of the following chapters:
The bibliography at the end of the book has been kept rather short because it has
been compiled with software practitioners in mind. Hence, instead of providing an
exhaustive and detailed list of references that would have been of interest mainly to
people willing to dive deep into the theoretical aspects of embedded real-time sys-
tems, we decided to highlight a smaller number of additional sources of information.
In this way, readers can more effectively use the bibliography as a starting point
to seek further knowledge on this rather vast field, without getting lost. Within the
works we cite, readers will also find further, more specific pointers to pursue their
quest.
Part I
CONTENTS
2.1 Role and Purpose of Embedded Systems .......................................................... 9
2.2 Microcontrollers and Their Internal Structure................................................. 13
2.2.1 Flash Memory..................................................................................... 15
2.2.2 Static Random Access Memory (SRAM)........................................... 17
2.2.3 External Memory ................................................................................ 19
2.2.4 On-Chip Interconnection Architecture ............................................... 23
2.3 General-Purpose Processors versus Microcontrollers ..................................... 29
2.4 Embedded Software Development Process ..................................................... 32
2.5 Summary.......................................................................................................... 37
This chapter outlines the central role played by embedded software in a variety of
contemporary appliances. At the same time, it compares embedded and general-
purpose computing systems from the hardware architecture and software develop-
ment points of view. This is useful to clarify and highlight why embedded software
development differs from application software development for personal computers
most readers are already familiar with.
9
10 Embedded Software Development: The Open-Source Approach
What’s more, embedded systems are deeply involved in the transportation indus-
try, especially automotive. It is quite common to find that even inexpensive cars are
equipped with more than 10 embedded nodes, including those used for anti-lock
braking system (ABS). For what concerns industry automation, embedded systems
are deployed in production lines to carry out all sorts of activities, ranging from mo-
tion control to packaging, data collection, and so on.
The main concerns of embedded systems design and development are different
from general-purpose systems. Embedded systems are generally equipped with lim-
ited resources, for instance, small amount of memory, low clock frequency, leading to
the need for better code optimization strategies. However, fast CPUs sometimes sim-
ply cannot be adopted in industrial environments because they are supposed to work
within a much narrower range of temperature, for instance [0, 40] degrees Celsius.
Instead, industrial-grade microprocessors are generally assumed to perform sus-
tainable correct functioning even up to 85 degrees. This leads to one main concern
of embedded systems, that is reliability, which encompasses hardware, software, and
communication protocol design. In addition, processor heat dissipation in such envi-
ronments is another issue if they are working at a high frequency.
All these differences bring unavoidable changes in the way embedded software
is developed, in contrast with the ordinary, general-purpose software development
process, which is already well known to readers. The main purpose of this chapter is
to outline those differences and explain how they affect programmers’ activities and
way of working.
In turn, this puts the focus on how to make the best possible use of the limited
resources available in an embedded system by means of code optimization. As out-
lined above, this topic is of more importance in embedded software development
with respect to general-purpose development, because resource constraints are usu-
ally stronger in the first case.
Moreover, most embedded systems have to deal with real-world events, for in-
stance, continuously changing environmental parameters, user commands, and oth-
ers. Since in the real world events inherently take place independently from each
other and in parallel, it is natural that embedded software has to support some form
of parallel, or concurrent execution. From the software development point of view,
this is generally done by organizing the code as a set of activities, or tasks, which are
carried out concurrently by a real-time operating system (RTOS).
The concept of task—often also called process—was first introduced in the semi-
nal work of Dijkstra [48]. In this model, any concurrent application, regardless of its
nature or complexity, is represented by, and organized as, a set of tasks that, concep-
tually, execute in parallel.
Each task is autonomous and holds all the information needed to represent the
evolving execution state of a sequential program. This necessarily includes not only
the program instructions but also the state of the processor (program counter, regis-
ters) and memory (variables).
Informally speaking, each task can be regarded as the execution of a sequential
program by “its own” conceptual processor even though, in a single-processor sys-
Embedded Applications and Their Requirements 11
tem the RTOS will actually implement concurrent execution by switching the physi-
cal processor from one task to another when circumstances warrant.
Therefore, thoroughly understanding the details of how tasks are executed, or
scheduled, by the operating system and being aware of the most common concurrent
programming techniques is of great importance for successful embedded software
development. This is the topic of Chapters 4 through 6.
In the past, the development of embedded systems has witnessed the evolution
from centralized to distributed architectures. This is because, first of all, the easiest
way to cope with the increasing need for more and more computing power is to
use a larger number of processors to share the computing load. Secondly, as their
complexity grows, centralized systems cannot scale up as well as distributed systems.
A simple example is that, in a centralized system, one more input point may re-
quire one more pair of wires to bring data to the CPU for processing. Instead, in a
distributed system, many different Inputs/Outputs values can be transmitted to other
nodes for processing through the same shared communication link. Last but not the
least, with time, it becomes more and more important to integrate different subsys-
tems, not only horizontally but also vertically.
For example, the use of buses and networks at the factory level makes it much eas-
ier to integrate it into the factory management hierarchy and support better business
decisions. Moreover, it is becoming more and more common to connect embedded
systems to the Internet for a variety of purposes, for instance, to provide a web-based
user interface, support firmware updates, be able to exchange data with other equip-
ment, and so on.
For this reason protocol stacks, that is, software components which implement
a set of related communication protocols—for instance, the ubiquitous TCP/IP
protocols—play an ever-increasing role in all kinds of embedded systems. Chap-
ter 7 presents in detail how a popular open-source TCP/IP protocol stack can be
interfaced, integrated, and configured for use within an embedded system.
For the same reason Chapter 8, besides illustrating in generic terms how software
and hardware shall be interfaced in an embedded system—by means of suitable de-
vice drivers—also shows an example of how protocol stacks can be interfaced with
the network hardware they work with.
Another major consideration in embedded systems design and development is
cost, including both hardware and software, as nowadays software cost is growing
and becomes as important as hardware cost. For what concerns software, if existing
applications and software modules can be largely reused, this could significantly
save time and effort, and hence, reduce cost when integrating different subsystems
or upgrading the current system with more advanced techniques/technologies.
An important milestone toward this goal, besides the adoption of appropriate soft-
ware engineering methods (which are outside the scope of this book), consists of
writing portable code. As discussed in Chapters 9 and 10, an important property of
portable code is that it can be easily compiled, or ported, to diverse hardware archi-
tectures with a minimum amount of change.
12 Embedded Software Development: The Open-Source Approach
This property, besides reducing software development time and effort—as out-
lined previously—also brings the additional benefit of improving software reliability
because less code must be written anew and debugged when an application is moved
from one architecture to another. In turn, this topic is closely related to code opti-
mization techniques, which are the topic of Chapter 11.
Generally, embedded systems also enforce requirements on real-time perfor-
mance. It could be soft real-time in the case of consumer appliances and building au-
tomation, instead of hard real-time for critical systems like ABS and motion control.
More specifically, two main aspects directly related to real-time are delay and
jitter. Delay is the amount of time taken to complete a certain job, for example,
how long it takes a command message to reach the target and be executed. Delay
variability gives rise to jitter. For example, jobs are completed sometimes sooner,
sometimes later.
Hard real-time systems tolerate much less jitter than their counterparts and they
require the system to behave in a more deterministic way. This is because, in a hard
real-time system, deadlines must always be met and any jitter that results in missing
the deadline is unacceptable. Instead, this is allowed in soft real-time systems, as
long as the probability is sufficiently small. This also explains why jitter is generally
more of concern. Several of these important aspects of software development will be
considered in more detail in Chapter 12, within the context of a real-world example.
As we can see, in several circumstances, embedded systems also have rather tight
dependability requirements that, if not met, could easily lead to safety issues, which
is another big concern in some kinds of embedded systems. Safety is not only about
the functional correctness (including the timing aspect) of a system but also related
to the security aspect of a system, especially since embedded systems nowadays are
also network-based and an insecure system is often bound to be unsafe.
Therefore, embedded software must often be tested more accurately than general-
purpose software. In some cases, embedded software correctness must be ensured
not only through careful testing, but also by means of formal verification. This topic
is described in Chapter 13 from the theoretical point of view and further developed
in Chapter 14 by means of a practical example. Further information about security
and dependability, and how they can be improved by means of automatic software
analysis tools, is contained in Chapter 16.
A further consequence, related to both code reliability and security, of the limited
hardware resources available in embedded architectures with respect to their general-
purpose counterparts, is that the hardware itself may provide very limited support to
detect software issues as early as possible and before damage is done to the system.
From this point of view, the software issue of most interest is memory corruption
that takes place when part of a memory-resident data structure is overwritten with
inconsistent data, often due to a wayward task that has no direct relationship with
the data structure itself. This issue is also often quite difficult to detect, because the
effects of memory corruption may be subtle and become manifest a long time after
the corruption actually took place.
Embedded Applications and Their Requirements 13
• One or more processor cores, which are the functional units responsible for
program execution.
• Multiple internal memory banks, with different characteristics regarding
capacity, speed, and volatility.
• Optionally, one or more memory controllers to interface the microcontroller
with additional, external memory.
• A variety of input–output controllers and devices, ranging from very
simple, low-speed devices like asynchronous serial receivers/transmitters
to very fast and complex ones, like Ethernet and USB controllers.
14 Embedded Software Development: The Open-Source Approach
Is is therefore evident that, even though most of this book will focus on the pro-
cessing capabilities of microcontrollers, in terms of program execution, it is ex-
tremely important to consider the microcontroller architecture as a whole during
component selection, as well as software design and development.
For this reason, this chapter contains a brief overview of the major components
outlined above, with special emphasis on the role they play from the programmer’s
perspective. Interested readers are referred to more specific literature [50, 154] for
detailed information. After a specific microcontroller has been selected for use, its
hardware data sheet of course becomes the most authoritative reference on this topic.
Embedded Applications and Their Requirements 15
Arguably the most important component to be understood and taken into account
when designing and implementing embedded software is memory. In fact, if it is true
that processor cores are responsible for instruction execution, those instructions are
stored in memory and must be continuously retrieved, or fetched, from memory to
be executed.
Moreover, program instructions heavily refer to, and work on, data that are stored
in memory, too. In a similar way, processor cores almost invariably make use of
one or more memory-resident stacks to hold arguments, local variables, and return
addresses upon function calls.
This all-important data structure is therefore referenced implicitly upon each func-
tion call (to store input arguments and the return address into it), during the call itself
(to retrieve input arguments, access local variables, and store function results), and
after the call (to retrieve results).
As a consequence, the performance and determinism of a certain program is
deeply affected, albeit indirectly, by the exact location of its instructions, data, and
stacks in the microcontroller’s memory banks. This dependency is easily forgotten
by just looking at the program source code because, more often than not, it simply
does not contain this kind of information.
This is because most programming languages (including the C language this book
focuses on) do allow the programmer to specify the abstract storage class of a vari-
able (for instance, whether a variable is local or global, read-only or read-write,
and so on) but they do not support any standard mechanisms that allow program-
mers to indicate more precisely where (in which memory bank) that variable will be
allocated.
As will be better described in Chapters 3 and 9, this important goal must there-
fore be pursued in a different way, by means of other components of the software
development toolchain. In this case, as shown in Figure 2.2, the linker plays a central
role because it is the component responsible for the final allocation of all memory-
resident objects defined by the program (encompassing both code and data), to fit the
available memory banks.
Namely, compiler-dependent extensions of the programming language are first
used to tag specific functions and data structures in the source code, to specify where
they should be allocated in a symbolic way. Then, the linker is instructed to pick up
all the objects tagged in a certain way and allocate them in a specific memory bank,
rather than using the default allocation method.
An example of this strategy for a GCC-based toolchain will be given in Chapters 8
and 11, where it will be used to distribute the data structures needed by the example
programs among the memory banks available on the microcontroller under study. A
higher-level and more thorough description of the technique, in the broader context
of software portability, will be given in Chapter 9 instead.
lose its contents when power is removed from the microcontroller. On the other hand,
it works as a read-only memory during normal use.
Write operations into flash memory are indeed possible, but they require a special
procedure (normal store operations performed by the processor are usually not ade-
quate to this purpose), are relatively slow (several orders of magnitude slower than
read operations) and, in some cases, they can only be performed when the microcon-
troller is put into a special operating mode by means of dedicated tools.
Flash memory is usually used to store the program code (that, on recent archi-
tectures, is read-only by definition), constant data, and the initial value of global
variables.
In many cases, flash memory is not as fast as the processor. Therefore, it may be
unable to sustain the peak transfer rate the processor may require for instruction and
data access, and it may also introduce undue delays or stalls in processing activities
if those accesses are performed on demand, that is, when the processor asks for them.
Embedded Applications and Their Requirements 17
For this reason, units of various complexity are used to try and predict the next
flash memory accesses that will be requested by the processor and execute them in
advance, while the processor is busy with other activities.
For instance, both the NXP LPC24xx and LPC17xx microcontroller fami-
lies [126, 128] embed a Memory accelerator module (MAM) that works in com-
bination with the flash memory controller to accelerate both code and data accesses
to flash memory.
In this way, if the prediction is successful, the processor is never stalled waiting for
flash memory access because processor activities and flash memory accesses proceed
concurrently.
On the other hand, if the prediction is unsuccessful, a processor stall will definitely
occur. Due to the fact that prediction techniques inherently work on a statistical basis,
it is often very hard or impossible to predict exactly when a stall will occur during
execution, and also for how long it will last. In turn, this introduces a certain degree
of non-determinism in program execution timings, which can hinder its real-time
properties in critical applications.
In those cases, it is important that programmers are aware of the existence of
the flash acceleration units just described and are able to turn them off (partially or
completely) in order to change the trade-off point between average performance and
execution determinism.
1. Being SRAM volatile, it cannot be used to permanently retain the code, which
would otherwise be lost as soon as power is removed from the system. Hence, it
is necessary to store the code elsewhere, in a non-volatile memory (usually flash
memory) and copy it into SRAM at system startup, bringing additional complexity
to system initialization.
Furthermore, after the code has been copied, its execution address (that is, the
address at which it is executed by the processor) will no longer be the same as its
load address (the address of the memory area that the linker originally assigned
to it).
18 Embedded Software Development: The Open-Source Approach
This clearly becomes an issue if the code contains absolute memory addresses to
refer to instructions within the code itself, or other forms of position-dependent
code, because these references will no longer be correct after the copy and, if
followed, they will lead code execution back to flash memory.
This scenario can be handled in two different ways, at either the compiler or
linker level, but both require additional care and configuration instructions to
those toolchain components, as better described in Chapter 3. Namely:
• It is possible to configure the compiler—for instance, by means of appropriate
command-line options and often at the expense of performance—to gener-
ate position-independent code (PIC), that is, code that works correctly even
though it is moved at will within memory.
• Another option is to configure the linker so that it uses one base address (often
called load memory address (LMA)) as its target to store the code, but uses
a different base address (the virtual memory address (VMA)) to calculate and
generate absolute addresses.
2. When the code is spread across different memory banks—for instance, part of it
resides in flash memory while other parts are in SRAM—it becomes necessary to
jump from one bank to another during program execution—for instance, when a
flash-resident function calls a SRAM-resident function. However, memory banks
are often mapped far away from each other within the microcontroller’s address
range and the relative displacement, or offset, between addresses belonging to
different banks is large.
Modern microcontrollers often encode this offset in jump and call instructions
to locate the target address and, in order to reduce code size and improve per-
formance, implement several different instruction variants that support different
(narrower or wider) offset ranges. Compilers are unaware of the target address
when they generate jump and call instructions—as better explained in Chapter 3,
this is the linker’s responsibility instead. Hence they, by default, choose the in-
struction variant that represents the best trade-off between addressing capability
and instruction size.
While this instruction variant is perfectly adequate for jumps and calls among
functions stored in the same memory bank, the required address offset may not fit
into it when functions reside in different memory banks. For the reasons recalled
above, the compiler cannot detect this issue by itself. Instead, it leads to (generally
rather obscure) link-time errors and it may be hard for programmers to track these
errors back to their original cause.
Although virtually all compilers provide directives to force the use of an instruc-
tion variant that supports larger offsets for function calls, this feature has not
been envisaged in most programming language standards, including the C lan-
guage [89]. Therefore, as discussed in Chapter 9, compiler-dependent language
extensions have to be used to this purpose, severely impairing code portability.
3. When SRAM or, more in general, the same bank of memory is used by the pro-
cessor for more than one kind of access, for instance, to access both instructions
and data, memory contention may occur.
Embedded Applications and Their Requirements 19
• Flash memory
• SRAM
• Dynamic Random Access Memory (DRAM)
On them, external flash memory and SRAM share the same general characteris-
tics as their on-chip counterparts. Instead, DRAM is rarely found in commerce as
a kind of on-chip microcontroller memory, due to difficulties to make the two chip
production processes coexist. It is worth mentioning the main DRAM properties here
20 Embedded Software Development: The Open-Source Approach
because they have some important side effects on program execution, especially in an
embedded real-time system. Interested readers are referred to [155] for more detailed
information about this topic.
DRAM, like SRAM, is a random access memory, that is, the processor can freely
and directly read from and write into it, without using any special instructions or
procedures. However, there are two important differences concerning access delay
and jitter:
1. Due to its internal architecture, DRAM usually has a much higher capacity than
SRAM but read and write operations are slower. While both on-chip and external
SRAM are able to perform read and write operations at the same speed as the
processor, DRAM access times are one or two orders of magnitude higher in
most cases.
To avoid intolerable performance penalties, especially as processor speed grows,
DRAM access requires and relies on the interposition of another component,
called cache, which speeds up read and write operations. A cache is basically
a fast memory of limited capacity (much smaller than the total DRAM capacity),
which is as fast as SRAM and holds a copy of DRAM data recently accessed by
the processor.
The caching mechanism is based on widespread properties of programs, that is,
their memory access locality. Informally speaking, the term locality means that,
if a processor just made a memory access at a certain address, its next accesses
have a high probability to fall in the immediate vicinity of that address.
To persuade ourselves of this fact, by intuition, let us consider the two main kinds
of memory access that take place during program execution:
• Instruction fetch. This kind of memory access is inherently sequential in most
cases, the exception being jump and call instructions. However, they represent
a relatively small fraction of program instruction and many contemporary pro-
cessors provide a way to avoid them completely, at least for very short-range
conditional jumps, by means of conditionally executed instructions. A thor-
ough description of how conditionally executed instructions work is beyond
the scope of this book. For instance, Reference [8] discusses in detail how they
have been implemented on the ARM Cortex family of processor cores.
• Data load and store. In typical embedded system applications, the most com-
monly used memory-resident data structure is the array, and arrays elements
are quite often (albeit not always) accessed within loops by means of some
sort of sequential indexing.
Cache memory is organized in fixed-size blocks, often called cache lines, which
are managed as an indivisible unit. Depending on the device, the block size usu-
ally is a power of 2 between 16 and 256 bytes. When the processor initiates a
transaction to read or write data at a certain address, the cache controller checks
whether or not a line containing those data is currently present in the cache.
• If the required data are found in the cache a fast read or write transaction, in-
volving only the cache itself and not memory, takes place. The fast transaction
Embedded Applications and Their Requirements 21
is performed at the processor’s usual speed and does not introduce any extra
delay. This possibility is called cache hit.
• Otherwise, the processor request gives origin to a cache miss. In this case, the
cache controller performs two distinct actions:
a. It selects an empty cache line. If the cache is completely full, this may
entail storing the contents of a full cache line back into memory, an oper-
ation known as eviction.
b. The cache line is filled with the data block that surrounds the address
targeted by the processor in the current transaction.
In the second case, the transaction requested by the processor finishes only after
the cache controller has completed both actions outlined previously. Therefore, a
cache miss entails a significant performance penalty from the processor’s point of
view, stemming from the extra time needed to perform memory operations.
On the other hand, further memory access transactions issued by the processor in
the future will likely hit the cache, due to memory access locality, with a signifi-
cant reduction in data access time.
From this summary description, it is evident that cache performance heavily de-
pends on memory access locality, which may vary from one program to another
and even within the same program, depending on its current activities.
An even more important observation, from the point of view of real-time embed-
ded systems design, is that although it is quite possible to satisfactorily assess the
average performance of a cache from a statistical point of view, the exact cache
behavior with respect to a specific data access is often hard to predict.
For instance, it is impossible to exclude scenarios in which a sequence of cache
misses occurs during the execution of a certain section of code, thus giving rise to
a worst-case execution time that is much larger than the average one.
To make the problem even more complex, cache behavior also depends in part on
events external to the task under analysis, such as the allocation of cache lines—
and, consequently, the eviction of other lines from the cache—due to memory
accesses performed by other tasks (possibly executed by other processor cores) or
interrupt handlers.
2. Unlike SRAM, DRAM is unable to retain its contents indefinitely, even though
power is continuously applied to it, unless a periodic refresh operation is per-
formed. Even though a detailed explanation of the (hardware-related) reasons for
this and of the exact procedure to be followed to perform a refresh cycle are
beyond the scope of this book, it is useful anyway to briefly recall its main conse-
quences on real-time code execution.
Firstly, it is necessary to highlight that during a refresh cycle DRAM is unable
to perform regular read and write transactions, which must be postponed, unless
advanced techniques such as hidden refresh cycles are adopted [155]. Therefore,
if the processor initiates a transaction while a refresh cycle is in progress, it will
incur additional delay—beyond the one needed for the memory access itself—
unless the transaction results in a cache hit.
Secondly, the exact time at which a refresh cycle starts is determined by the mem-
ory controller (depending on the requirements of the DRAM connected to it) and
not by the processor.
22 Embedded Software Development: The Open-Source Approach
• For what concerns internal signal routing, we already mentioned that typ-
ical microcontrollers have a limited number of external pins, which is not
big enough to route all internal input–output signals to the printed circuit
board (PCB).
The use of an external (parallel) bus is likely to consume a significant num-
ber of pins and make them unavailable for other purposes. A widespread
workaround for this issue is to artificially limit the external bus width to
save pins.
For instance, even tough 32-bit microcontrollers support an external 32-bit
bus, hardware designers may limit its width to 16 or even 8 bits.
Besides the obvious advantage in terms of how many pins are needed to
implement the external bus, a negative side-effect of this approach is that
more bus cycles become necessary to transfer the same amount of data.
Assuming that the bus speed is kept constant, this entails that more time is
needed, too.
• To simplify external signal routing hardware designers may also keep the
external bus speed slower than the maximum theoretically supported by the
microcontroller, besides reducing its width. This is beneficial for a variety
of reasons, of which only the two main ones are briefly presented here.
• It makes the system more tolerant to signal propagation time skews and
gives the designer more freedom to route the external bus, by means of
longer traces or traces of different lengths.
• Since a reduced number of PCB traces is required to connect the mi-
crocontroller to the external components, fewer routing conflicts arise
against other parts of the layout.
maximum of 6 wires, plus ground, to be implemented. These figures are much lower
than what a parallel bus, even if its width is kept at 8 bits, requires.
As it is easy to imagine, the price to be paid is that this kind of interface is likely
unable to sustain the peak data transfer rate a recent processor requires, and the
interposition of a cache (which brings all the side effects mentioned previously) is
mandatory to bring average performance to a satisfactory level.
Regardless of the underlying reasons, those design choices are often bound to cap
the performance of external memory below its maximum. From the software point of
view, in order to assess program execution performance from external memory, it is
therefore important to evaluate not only the theoretical characteristics of the memory
components adopted in the system, but also the way they have been connected to the
microcontroller.
The components connected to a bus can be categorized into two different classes:
1. Bus masters, which are able to initiate a read or write transaction on the bus,
targeting a certain slave.
24 Embedded Software Development: The Open-Source Approach
2. Bus slaves, which can respond to transactions initiated by masters, but cannot
initiate a transaction on their own.
A typical example of bus master is, of course, the processor. However, peripheral
devices may act as bus masters, too, when they are capable of autonomous direct
memory access (DMA) to directly retrieve data from, and store them to, memory
without processor intervention.
A different approach to DMA, which does not require devices to be bus masters,
consists of relying on a general-purpose DMA controller, external to the devices. The
DMA controller itself is a bus master and performs DMA by issuing two distinct bus
transactions, one targeting the device and the other one targeting memory, on behalf
of the slaves.
For instance, to transfer a data item from a device into memory, the DMA con-
troller will first wait for a trigger from the device, issue a read transaction targeting
Embedded Applications and Their Requirements 25
the device (to read the data item from its registers), and then issue a write trans-
action targeting memory (to store the data item at the appropriate address). This
approach simplifies device implementation and allows multiple devices to share the
same DMA hardware, provided they will not need to use it at the same time.
In order to identify which slave is targeted by the master in a certain transaction,
masters provide the target’s address for each transaction and slaves respond to a
unique range of addresses within the system’s address space. The same technique is
also used by bridges to recognize when they should forward a transaction from one
bus to the other.
Each bus supports only one ongoing transaction at a time. Therefore all bus mas-
ters connected to the same bus must compete for bus access by means of an arbitra-
tion mechanism. The bus arbiter chooses which master can proceed and forces the
others to wait when multiple masters are willing to initiate a bus transaction at the
same time. On the contrary, several transactions can proceed in parallel on differ-
ent buses, provided those transactions are local to their bus, that is, no bridges are
involved.
As shown in Figure 2.3, buses are interconnected by means of a number of
bridges, depicted as gray blocks.
• Two bridges connect the local bus to the AHB buses. They let the proces-
sor access the controllers connected there, as well as the additional SRAM
banks.
• One additional bridge connects one of the AHB buses to the APB bus. All
transactions directed to the lower-performance peripherals go through this
bridge.
• The last bridge connects the two AHB buses together, to let the Ethernet
controller (on the left) access the SRAM bank residing on the other bus, as
well as external memory through the external memory controller.
The role of a bridge between two buses A and B is to allow a bus master M
residing, for instance, on bus A to access a bus slave S connected to bus B. In order
to do this, the bridge plays two different roles on the two buses at the same time:
• On bus A, it works as a bus slave and responds on behalf of S to the trans-
action initiated by M.
• On bus B, it works as a bus master, performing on behalf of M the transac-
tion directed to S.
The kind of bridge described so far is the simplest one and works in an asymmetric
way, that is, is able to forward transactions initiated on bus A (where its master port
is) toward bus B (where the slave port is), but not vice versa. For instance, referring
back to Figure 2.3, the AHB/AHB bridge can forward transactions initiated by a
master on the left-side AHB toward a slave connected to the right-side AHB, but not
the opposite.
Other, more complex bridges are symmetric instead and can assume both master
and slave roles on both buses, albeit not at the same time. Those bridges can also be
26 Embedded Software Development: The Open-Source Approach
seen as a pair of asymmetric bridges, one from A to B and the other from B to A,
and they are often implemented in this way.
As outlined above, bus arbitration mechanisms and bridges play a very important
role to determine the overall performance and determinism of the on-chip intercon-
nection architecture. It is therefore very important to consider them at design time
and, especially for what concerns bridges, make the best use of them in software.
When the processor (or another bus master) crosses a bridge to access memory
or a peripheral device controller, both the processor itself and the system as a whole
may incur a performance and determinism degradation. Namely:
especially when designing and implementing software drivers for DMA-capable de-
vices, it is very important to ensure that the data structures to be used by the device
are indeed accessible to it. In turn, as will be better explained in Chapters 3 and 9, this
requires additional instructions to the linker in order to force those data structures to
be allocated in the intended memory bank.
In any case, as the number of on-chip peripherals and their data bandwidth re-
quirements grow, bridge-based interconnection shows its limits in terms of achiev-
able parallelism and flexibility. For this reason, as shown in Figure 2.4, recent mi-
crocontrollers are shifting from the bus- and bridge-based interconnection discussed
previously to a crossbar-based interconnection, even in low-cost implementations.
As shown in the figure, which depicts in a simplified way the internal intercon-
nections of the LPC1768 microcontroller [127, 128], a crossbar resembles a matrix
28 Embedded Software Development: The Open-Source Approach
All these buses are shown as solid vertical lines in the figure. It should also be
remarked that the Ethernet and USB controllers, as well as the general-purpose DMA
controller, also work as slaves on a separate bus and allow the processor to read and
write their internal registers through it. These buses are labeled as Regs in the figure
and shown as dashed vertical lines. They will be discussed separately in the following
because in a way they represent a deviation with respect to the general structure of
the crossbar.
Referring back to Figure 2.4, slaves are shown on the right and at the bottom.
Besides the usual ones—namely, flash memory, high-speed GPIO, and a couple of
SRAM banks—two bridges also work as slaves on the crossbar side, in order to grant
access to low-speed peripherals. The bus segments they are connected to are shown
as horizontal lines in the figure.
At this level, since those peripherals are not performance-critical and do not have
important data bandwidth requirements, the interconnection infrastructure can still
be bridge-based without adverse side-effects on performance.
Informally speaking, the crossbar itself works like a matrix of switches that can
selectively connect one vertical bus segment (leading to a master) to one horizon-
tal segment (leading to a slave). Since multiple switches can be active at the same
time, multiple transactions can pass through the switch at the same time and without
blocking each other, provided that they all use distinct bus segments.
As said previously, the bus segments shown as dashed vertical lines somewhat
deviate from the regular structure of the crossbar and are used by components (for
instance, the Ethernet controller) that can be both bus masters and slaves. This kind
of behavior is needed in some cases because, continuing the example concerning the
Ethernet controller,
• It must be able to autonomously read from memory and write into memory
the frames to be transmitted and being received, respectively, and also
Embedded Applications and Their Requirements 29
Table 2.1
Microcontrollers versus General-Purpose Processors
• It must allow the processor to read and write its internal registers, to access
status information and control its behavior, respectively.
Accordingly, these devices are connected to a vertical bus segment, on which they
operate as masters, as well as a horizontal bus segment, on which they work as slaves
and are targeted by other masters for register access.
Although a crossbar-based interconnection increases the degree of bus transaction
parallelism that can be achieved in the system, it should also be noted that, on the
other hand, it does not completely remove all bus-related timing dependencies from
the system because blocking still occurs when two masters intend to access the same
slave at the same time.
For instance, referring again to Figure 2.4, regardless of the presence of a
crossbar-based interconnection, the processor core and the Ethernet controller will
still block each other if they try to access the main SRAM bank together. Moreover—
as also shown in the figure—in order to save complexity and cost, the switch ma-
trix within the crossbar may not be complete, thus introducing additional constraints
about which bus segments can, or cannot, be connected.
Another extremely important on-chip network, which has not yet been discussed
so far, is responsible for conveying interrupt requests from most other functional
units to the processor. The performance of this network is critically important for
embedded systems. As for the data-transfer interconnection just discussed, several
architectural variants are possible and are in use nowadays. They will be described
in more detail in Chapter 8, along with the software development methods and tech-
niques used to properly handle interrupts.
The first difference worth mentioning is about the number and variety of on-
chip peripheral components that are included in a microcontroller with respect to
a general-purpose processor.
Since their main role is to provide computing power to a more complex system,
made of several distinct components, general-purpose processors usually embed only
a very limited number of on-chip peripherals.
Two notable exceptions to this general trend are extremely high-speed periph-
erals, like peripheral component interconnect (PCI) express and direct media inter-
face (DMI) bus controllers, with their associated DMA controller blocks, as well as
integrated graphics processing units (GPUs). The implementation of all lower-speed
peripherals (for instance, Ethernet and USB controllers) is left to other chips, histor-
ically known as “south bridges” and connected to the processor chip by means of a
high-speed interface, like DMI.
On the contrary, microcontrollers are invariably equipped with a large variety
of on-chip peripherals, which brings two contrasting consequences on embedded
system design and software development. Namely:
A very similar thought is also valid for on-chip memory. General-purpose proces-
sors usually don’t have any kind of on-chip memory and rely on an external memory
bus to connect the processor to a bank of DRAM. To achieve an adequate transfer
speed, this bus often operates at an extremely high frequency. Hence, it becomes hard
to lay out and route, due to severe constraints on the maximum length of the PCB
traces between the processor and the memory chips or cards, as well as the maximum
length difference, or skew, between them.
Embedded Applications and Their Requirements 31
1. A development system, shown on the left of the figure and often consisting of
a personal computer, hosts and executes the programs used to develop the new
software and compile, deploy, and debug it.
2. A target system, shown on the right, executes the new software under develop-
ment, after it has been uploaded into its memory banks.
Embedded Applications and Their Requirements 33
Within the development system, several software components still closely resem-
ble their counterpart used for general-purpose software development. Namely:
• An editor helps programmers create, modify and, more in general, manage
source code modules. Depending on its level of complexity and sophistica-
tion, the editor’s role may span from a mere programmer’s aid to “put text”
into source code modules to a full-fledged integrated development environ-
ment (IDE) able to perform the following functions (among others).
• Syntax coloring and, in some cases, other operations that require knowl-
edge of the language syntax and part of its semantics. Among these,
probably the most useful ones concern the capability of automatically
retrieving the definition of, and references to, a variable or function
given its name.
Other useful features provided by the most sophisticated IDEs also in-
clude automatic code refactoring. By means of code refactoring it be-
comes possible, for instance, to extract part of the code of a function
and create a new, standalone function from it.
• Automatic integration with software versioning and revision control
systems, for instance, concurrent version system (CVS) [28] or Apache
subversion (SVN) [117]. In short, a software versioning and revision
control system keeps track of all changes made by a group of program-
mers to a set of files, and allows them to collaborate even though they
are physically separated in space and time.
• Interaction with the toolchain, which allows programmers to rebuild
their code directly from the IDE, without using a separate command
shell or other interfaces. Most IDEs are also able to parse the toolchain
output and, for instance, automatically open a source file and draw the
programmer’s attention to the line at which the compiler detected an
error.
• Ability to interact with the debugger and use its facilities directly from
the IDE interface. In this way, it becomes possible to follow program
execution from the source code editing window, explore variable values,
and visualize the function call stack directly.
Although none of the previously mentioned IDE features are strictly nec-
essary for successful software development, their favorable effect on pro-
grammers’ productivity increases with the size and complexity of the
project. Therefore, for big and complex projects, where several program-
mers are involved concurrently, it may therefore be wise to adopt a suitable
IDE right from the beginning, rather than retrofit it at a later time.
A thorough discussion of editors and IDEs is beyond the scope of this
book. Staying within the open-source software domain, probably the most
widespread products are Emacs [156] and Eclipse [54]. Interested readers
may refer to their documentation for further information.
• A toolchain is responsible for transforming a set of source code modules
into an executable image, containing the program code (in machine lan-
34 Embedded Software Development: The Open-Source Approach
guage form) and data (suitably allocated in the target system memory). The
only difference with respect to general-purpose software development is
that, instead of producing executable code for the development system it-
self, the toolchain produces code for the target system, which may use a
different processor with a different instruction set.
Since, by intuition, the toolchain plays a central role in the software devel-
opment process, detailed information about the toolchain will be given in
Chapter 3. Moreover, Chapters 9 and 11 contain further information on how
to profitably use the toolchain to attain specific goals, such as portability or
code optimization from different points of view.
On the other hand, other components are either unique to the embedded software
development process or they behave differently than their general-purpose counter-
parts. In particular:
• The debugger must interact with the processor on the target board in or-
der to perform most of its functions, for instance, single-step execution or
display the current value of a variable stored in the target system memory.
To this purpose, it must implement an appropriate debugging protocol, also
understood by the target system, and communicate with the target through
a suitable debugging interface, for instance, JTAG [80].
• The toolchain stores the executable image of the program on the develop-
ment system, as a disk file. To execute the program, its image must first of
all be moved into the memory banks of the target system by means of an up-
load tool. Communication between the development and the target system
for image upload may require the conversion of the executable image into a
format understood by the target system and is controlled by a communica-
tion protocol that both sides must implement. For instance, many members
of the NXP LPC microcontroller family, like the LPC1768 [127, 128], sup-
port the in-system programming (ISP) protocol to upload the executable
image into the on-chip flash memory.
• The target system may be unable to autonomously manage console input–
output, which is often used to print out information and debugging mes-
sages, as well as support simple interaction with the embedded software
during development. This is especially true during the early stages of soft-
ware development, when software to drive many of the on-board peripheral
devices of the target system may not be available yet.
For this reason, most target systems provide a minimal console input–
output capability through a very simple hardware interface, often as simple
as an asynchronous serial port made visible to the development system as a
USB device. In this case, the development system must be able to connect
to this interface and make it available to the programmer, for instance, by
means of a terminal emulator.
• Last, but not least, since the target system is totally separate from the de-
velopment system, it becomes necessary to supply power to it.
Embedded Applications and Their Requirements 35
Figure 2.6 Possible ways of connecting the development system with the target board.
1. If the target system offers debugging capabilities, the debugger must be connected
to the corresponding debugging interface.
2. A suitable connection is needed to upload the executable image into the target
system.
3. Console input–output data reach the development system through their own
connection.
4. Finally, the target system must be connected to a power source.
As illustrated by the examples shown in the Figure 2.6, depending on the tar-
get board configuration and complexity, the previously mentioned logical connec-
tions may be realized by means of different combinations of physical connections.
Namely:
in Figure 2.6 (a). In this case, a universal serial bus (USB) connection is
often used to power the target system and make one of the microcontroller’s
asynchronous serial ports available to the development system through the
USB itself.
In turn, the serial port is used to upload the code using a bootloader that
has been preloaded into the development board by the manufacturer and
supports a serial connection, for instance, U-Boot [46]. After the program
under development has been started, the same serial port is also used to
interact with the program, by means of a terminal emulator on the develop-
ment system side.
This is the case, for instance, of the Embedded Artist EA LPC2468 OEM
board [56].
• Custom-made boards may need a separate physical connection for the
power supply in order to work around the power limitations of USB.
Moreover, it may be inconvenient to fit a full-fledged bootloader like U-
Boot in a production system. In this case, the USB connection is still used
to interact with the program under development as before, but executable
image upload is done in a different way.
For instance, as shown in Figure 2.6 (b), many recent microcontrollers have
an internal read-only memory (ROM) with basic bootloading capabilities.
This is the case, for instance, of the NXP LPC1768 microcontroller [127,
128], whose ROM implements an in-system programming (ISP) protocol
to upload an executable image into the on-chip flash memory.
Of course, besides being useful by itself, the ISP protocol can also be used
for the initial upload of a more sophisticated bootloader, like the previously
mentioned U-Boot, into a portion of the microcontroller flash memory re-
served to this purpose. Then, subsequent uploads can be done as in the
previous case.
• A possible issue with the previous two ways of connecting the development
system with the target board is the relatively limited support for debugging.
It is indeed possible to support remote debugging through a serial port,
for instance, by means of the remote debugging mode of the GNU de-
bugger GDB [158]. However, some software (called remote stub in the
GDB nomenclature) needs to be executed on the development board to this
purpose.
As a consequence, this debugging method works only as long as the target
board is still “healthy enough” to execute some code and, depending on the
kind of issues the software being developed encounters, this may or may
not be true. It is therefore possible to lose debugging capabilities exactly
when they are most needed.
In order to address this inconvenience, some microcontrollers offer a lower-
level code uploading and debugging interface, most often based on a
JTAG [80] interface. The main difference is that this kind of debugging
interface is hardware-based, and hence, it works in any case.
Embedded Applications and Their Requirements 37
Table 2.2
Embedded versus General-Purpose Software Development Environments
As shown in Figure 2.6 (c), a third connection allows the development sys-
tem to interact with the target board for debugging purposes using JTAG.
Since PCs usually don’t support this kind of interface natively, the connec-
tion is usually made by means of a JTAG–USB converter, often called pod.
For the software point of view, many pods support the open on-chip de-
bugger (OpenOCD) [134]. It is an open-source project aiming at providing
a free and open platform for on-chip debugging, in-system programming,
and boundary-scan testing. As such, the JTAG connection also supports
code uploading to the on-chip and external memory banks. In this case, the
serial port is used exclusively for program-controlled console input–output.
Since, as can be perceived from the few examples discussed previously, there
is no established, single standard for this kind of connection, it is very important
that programmers thoroughly understand the kinds of connections their development
system, toolchain, and development board support, in order to choose and use the
right tools that, as explained, differ case by case. Further information about the topic
can be retrieved from the reference material about the tools involved, cited in the text.
2.5 SUMMARY
Embedded systems are nowadays used in all areas of industry and permeate our daily
life even though, due to their nature, people may not even appreciate their existence.
As a consequence, an ever-increasing amount of application and system software has
to be developed for them, often satisfying quality and dependability requirements
that exceed the ones of general-purpose software.
As shown in Table 2.2, significant differences exist between embedded soft-
ware development environments with respect to the ones typically used for general-
purpose computing. These differences become more significant as we move away
from the “surface” of the software development environment—that is, the editor or
38 Embedded Software Development: The Open-Source Approach
IDE that programmers use to write their code—and get closer to the point where
code is executed.
At the same time, the microcontrollers most commonly used in embedded systems
are also significantly different from general-purpose processors, from the hardware
point of view. At least some of these differencies have important repercussions on
software development, and hence, programmers cannot be unaware of them.
Both aspects have been discussed in this chapter, with the twofold goal of setting
the foundation for the more specific topics to be discussed in this book and provide
readers with at least a few ideas and starting points for further learning.
3 Development
GCC-Based Software
Tools
CONTENTS
3.1 Overview ......................................................................................................... 40
3.2 Compiler Driver Workflow .............................................................................. 42
3.3 C Preprocessor Workflow ................................................................................ 44
3.4 The Linker ....................................................................................................... 48
3.4.1 Symbol Resolution and Relocation .................................................... 49
3.4.2 Input and Output Sequences ............................................................... 51
3.4.3 Memory Layout .................................................................................. 55
3.4.4 Linker Script Symbols ........................................................................ 58
3.4.5 Section and Memory Mapping ........................................................... 59
3.5 The C Runtime Library ................................................................................... 62
3.6 Configuring and Building Open-Source Software .......................................... 65
3.7 Build Process Management: GNU Make ........................................................ 69
3.7.1 Explicit Rules...................................................................................... 70
3.7.2 Variables ............................................................................................. 72
3.7.3 Pattern Rules ....................................................................................... 74
3.7.4 Directives and Functions .................................................................... 77
3.8 Summary.......................................................................................................... 78
39
40 Embedded Software Development: The Open-Source Approach
3.1 OVERVIEW
Generally speaking a toolchain is a complex set of software components. Collec-
tively, they translate source code into executable machine code. Figure 3.1 outlines
the general, theoretical toolchain workflow. As shown in the figure, its main compo-
nents are:
1. The GCC compiler translates a C source file (that may include other headers and
source files) and produces an object module. The translation process may require
several intermediate code generation steps, involving the assembly language. In
the last case, the compiler implicitly invokes the assembler as [55], too.
The gcc program is actually a compiler driver. It is a programmable compo-
nent, able to perform different actions by appropriately invoking other toolchain
components, depending on the input file type (usually derived from its filename
extension). The actions to be performed by gcc, as well as other options, are con-
figured by means of a specs string or file. Both the compiler driver itself and its
specs string are discussed in more detail in Section 3.2.
2. The ar librarian collects multiple object modules into a library. It is worth re-
calling that, especially in the past, the librarian was often called archiver, and the
name of the toolchain component still derives from this term. The same tool also
GCC-Based Software Development Tools 41
C sources
Compiler driver
*.c gcc -c
C preprocessor
cpp C sources
*.c
cpplib Specs
string/ le
C headers
*.h
C compiler
cc1
Object
ASM Assembler Files
sources as *.o
*.s
The availability of a native toolchain is usually not an issue because most open-
source operating system distributions already provide one ready-to-use.
First of all, as shown in the figure, the preprocessor is integrated in the C compiler
and both are implemented by cc1. A standalone preprocessor cpp does exist, but it
is not used during normal compilation.
In any case, the behavior of the standalone preprocessor and the one implemented
in the compiler is consistent because both make use of the same preprocessing li-
brary, cpplib, which can also be used directly by application programs as a general-
purpose macro expansion tool.
On the contrary, the assembler is implemented as a separate program, as, that
is not part of the GCC distribution. Instead, it is distributed as part of the binary
utilities package BINUTILS [138]. It will not be further discussed here due to space
constraints.
One aspect peculiar to the GCC-based toolchain is that the compiler driver is
programmable. Namely, it is driven by a set of rules, contained in a “specs” string or
file. The specs string can be used to customize the behavior of the compiler driver. It
ensures that the compiler driver is as flexible as possible, within its design envelope.
In the following, due to space constraints, we will just provide an overview of the
expressive power that specs strings have, and illustrate what can be accomplished
with their help, by means of a couple of examples. A thorough documentation of
specs string syntax and usage can be found in [159].
First of all, the rules contained in a specs string specify which sequence of pro-
grams the compiler driver should run, and their arguments, depending on the kind of
file provided as input. A default specs string is built in the compiler driver itself and
is used when no custom specs string is provided elsewhere.
The sequence of steps to be taken in order to compile a file can be specified
depending on the suffix of the file itself.
Other rules, associated with some command-line options may change the argu-
ments passed by the driver to the programs it invokes.
*link: %{mbig-endian:-EB}
For example, the specs string fragment listed above specifies that if the command-
line option -mbig-endian is given to the compiler driver, then the linker must be
invoked with the -EB option.
Let us now consider a different specs string fragment:
*startfile: crti%O%s crtbegin%O%s new_crt0%O%s
In this case, the specs string specifies which object files should be uncondition-
ally included at the start of the link. The list of object files is held in the startfile
variable, mentioned in the left-hand part of the string, while the list itself is in the
right-hand part, after the colon (:). It is often useful to modify the default set of ob-
jects in order to add language-dependent or operating system-dependent files without
forcing programmers to mention them explicitly whenever they link an executable
image.
More specifically:
• The *startfile: specification overrides the internal specs variable
startfile and gives it a new value.
44 Embedded Software Development: The Open-Source Approach
It is useful to remark that, in this case, neither the preprocessor keyword, nor the
following tokens (macro name and body in this example) are forwarded to the
compiler. The macro body will become visible to the compiler only if the macro
will be expanded later.
The name → body table is initialized when preprocessing starts and discarded
when it ends. As a consequence, macro definitions are not kept across multiple
compilation units. Initially, the table is not empty because the preprocessor pro-
vides a number of predefined macros.
2. When a macro is invoked, the preprocessor performs macro expansion. In the
simplest case—that is, for object-like macros—macro expansion is triggered by
encountering a macro name in the source code.
The macro is expanded by replacing its name with its body. Then, the result of the
expansion is examined again to check whether or not further macro expansions
can be done. When no further macro expansions can be done, the sequence of to-
kens obtained by the preprocessor as a result is forwarded to the compiler instead
of the tokens that triggered macro expansion.
3. Tokens unknown to the preprocessor are simply passed to the compiler without
modification. Since the preprocessor’s and compiler’s grammars are very different
at the syntax level, many kinds of token known to the compiler have no meaning to
the preprocessor, even though the latter is perfectly able to build the token itself.
For instance, it is obvious that type definitions are extremely important to the
compiler, but they are completely transparent to the preprocessor.
The syntax of preprocessor keywords is fairly simple. In fact, they always start
with a sharp character (#) in column one. Spaces are allowed between # and the rest
of the keyword. The main categories of keyword are:
Namely, the definition of macro B does not produce any error although A has not
been defined yet. When B is encountered in the source file, after the previously listed
definitions, it is expanded as: B → A+3 → 12+3.
Since no other macro names are present in the intermediate result, macro ex-
pansion ends at this point and the three tokens 12, +, and 3 are forwarded to the
compiler.
Due to the way of communication between the preprocessor and the compiler,
explained previously and outlined in Figure 3.3, the compiler does not know how
tokens are obtained. For instance, it cannot distinguish between tokens coming from
macro expansion and tokens taken directly from the source file.
As a consequence, if B is used within a more complex expression, the compiler
might get confused and interpret the expression in a counter-intuitive way. Contin-
uing the previous example, the expression B*5 is expanded by the preprocessor as
B*5 → A+3*5 → 12+3*5.
When the compiler parses the result, the evaluation of 3*5 is performed before +,
due to the well-known precedence rules of arithmetic operators, although this may
not be the behavior the programmer expects.
To solve this problem, it is often useful to put additional parentheses around macro
bodies, as is shown in the following fragment of code.
#define B (A+3)
#define A 12
To illustrate how function-like macro expansion takes place, let us consider the
following fragment of code as an example.
#define F(x, y) x*y*K
#define K 7
#define Z 3
After the replacement, the body of the macro becomes 3*6*K. At this point, the
modified body replaces the function-like macro invocation. Therefore, F(3, 6) →
3*6*K.
The final step in function-like macro expansion consists of re-examining the result
and check whether or not other macros (either object-like or function-like macros)
can be expanded. In our example, the result of macro expansion obtained so far still
contains the object-like macro name K and the preprocessor expands it according to
its definition: 3*6*K → 3*6*7.
To summarize, the complete process of macro expansion when the function-like
macro F(Z, 6) is invoked is
as intended.
• The option -Map=<file> writes a link map to <file>. When the ad-
ditional --cref option is given, the map also includes a cross reference
table. Even though no further information about it will be given here, due
to lack of space, the link map contains a significant amount of information
about the outcome of the linking process. Interested readers may refer to
the linker documentation [34] for more details about it.
• The option --oformat=<format> sets the format of the output file,
among those recognized by ld. Being able to precisely control the output
format helps to upload the executable image into the target platform suc-
cessfully. Reference [34] contains the full list of supported output formats,
depending on the target architecture and linker configuration.
• The options --strip and --strip-debug remove symbolic informa-
tion from the output file, leaving only the executable code and data. This
step is sometimes required for executable image upload tools to work cor-
rectly, because they might not handle any extra information present in the
image properly.
GCC-Based Software Development Tools 49
When ld is invoked through the compiler driver, linker options must be pre-
ceded by the escape sequence -Wl to distinguish them from options directed to the
compiler driver itself. A comma is used to separate the escape sequence from the
string to be forwarded to the linker and no intervening spaces are allowed. For in-
stance, gcc -Wl,-Map=f.map -o f f.c compiles and links f.c, and gives the
-Map=f.map option to the linker.
As a last introductory step, it is also important to informally recall the main differ-
ences between object modules, libraries, and executable images as far as the linker
is concerned. These differences, outlined below, will be further explained and high-
lighted in the following sections.
• Object files are always included in the final executable image, instead the
object modules found in libraries (and also called library modules) are used
only on demand.
• More specifically, library modules are included by the linker only if they are
needed to resolve pending symbol references, as will be better described in
the following section.
• A library is simply a collection of unmodified object modules put together
into a single file by the archiver or librarian ar.
• An executable image is formed by binding together object modules, either
standalone or from libraries, by the linker. However, it is not simply a col-
lection, like a library is, because the linker performs a significant amount
of work in the process.
1. If the processor supports relative jumps—that is, a jump in which the target ad-
dress is calculated as the sum of the current program counter plus an offset stored
in the jump instruction—the compiler may be able to generate the code com-
pletely and automatically by itself, because it knows the “distance” between the
jump instruction and its target. The linker is not involved in this case.
2. If the processor only supports absolute jumps—that is, a jump in which the target
address is directly specified in the jump instruction—the compiler must leave a
50 Embedded Software Development: The Open-Source Approach
“blank” in the generated code, because it does not know where the code will even-
tually end up in memory. As will be better explained in the following, this blank
will be filled by the linker when it performs symbol resolution and relocation.
Another intuitive example, regarding data instead of code addresses, is repre-
sented by global variables accessed by means of an extern declaration. Also in this
case, the compiler needs to refer to the variable by name when it generates code,
without knowing its memory address at all. Also in this case, the code that the com-
piler generates will be incomplete because it will include “blanks,” in which symbols
are referenced instead of actual memory addresses.
When the linker collects object files, in order to produce the executable image, it
becomes possible to associate symbol definitions and the corresponding references,
by means of a name-matching process known as symbol resolution or (according to
an older nomenclature) snapping.
On the other hand, symbol values (to continue our examples, addresses of vari-
ables, and the exact address of machine instructions) become known when the linker
relocates object contents in order to lay them out into memory. At this point, the
linker can “fill the blanks” left by the compiler.
As an example of how symbol resolution takes place for data, let us consider the
following two, extremely simple source files.
f.c g.c
extern int i; int i;
void f(void) {
i = 7;
}
• When the compiler generates code for f() it does not know where (and if)
variable i is defined. Therefore, in f.o the address of i is left blank, to be
filled by the linker.
• This is because the compiler works on exactly one source file at a time.
When it is compiling f.c it does not consider g.c in any way, even though
both files appear together on the command line.
• During symbol resolution, the linker observes that i is defined in g.o and
associates the definition with the reference made in f.o.
• After the linker relocates the contents of g.o, the address of i becomes
known and can eventually be used to complete the code in f.o.
It is also useful to remark that initialized data need a special treatment when the
initial values must be in non-volatile memory. In this case, the linker must cooperate
with the startup code (by providing memory layout information) so that the initial-
GCC-Based Software Development Tools 51
ization can be performed correctly. Further information on this point will be given in
Section 3.4.3.
1. The input and output part picks the input files (object files and libraries) that the
linker must consider and directs the linker output where desired.
2. The memory layout part describes the position and size of all memory banks avail-
able on the target system, that is, the space the linker can use to lay out the exe-
cutable image.
3. The section and memory mapping part specifies how the input files contents must
be mapped and relocated into memory banks.
If necessary, the linker script can be split into multiple files that are then bound
together by means of the INCLUDE <filename> directive. The directive takes a
file name as argument and directs the linker to include that file “as if” its contents
appeared in place of the directive itself. The linker supports nested inclusion, and
hence, INCLUDE directives can appear both in the main linker script and in an in-
cluded script.
This is especially useful when the linker script becomes complex or it is conve-
nient to divide it into parts for other reasons, for instance, to distinguish between
architecture or language-dependent parts and general parts.
Input and output linker script commands specify:
• Which input files the linker will operate on, either object files or libraries.
This is done by means of one or more INPUT() commands, which take the
names of the files to be considered as arguments.
• The sequence in which they will be scanned by the linker, to perform sym-
bol resolution and relocation. The sequence is implicitly established by the
order in which input commands appear in the script.
• The special way a specific file or group of files will be handled. For in-
stance, the STARTUP() command labels a file as being a startup file rather
than a normal object file.
• Where to look for libraries, when just the library name is given. This is
accomplished by specifying one or more search paths by means of the
SEARCH_DIR() command.
• Where the output—namely, the file that contains the executable image—
goes, through the OUTPUT() command.
OUTPUT() and mentioning an object file name on the linker command line has the
same effect as putting it in an INPUT() linker script command.
The entry point of the executable image—that is, the instruction that shall be
executed first—can be set by means of the ENTRY(<symbol>) command in the
linker script, where <symbol> is a symbol.
However, it is important to remark that the only effect of ENTRY is to keep a
record of the desired entry point and store it into the executable image itself. Then,
it becomes the responsibility of the bootloader mentioned in Chapter 2—often sim-
ply called loader in linker’s terminology—to obey what has been requested in the
executable image.
When no loader is used—that is, the executable image is uploaded by means of
an upload tool residing on the development host, and then runs on the target’s “bare
metal”—the entry point is defined by hardware. For example, most processors start
execution from a location indicated by their reset vector upon powerup. Any entry
point set in the executable image is ignored in this case.
All together, the input linker script commands eventually determine the linker in-
put sequence. Let us now focus on a short fragment of a linker sequence that contains
several input commands and describe how the input sequence is built from them.
INPUT(a.o, b.o, c.o)
INPUT(d.o, e.o)
INPUT(libf.a)
Normally, the linker scans the input files once and in the order established by the
input sequence, which is defined by:
• The order in which files appear within the INPUT() command. In this case,
b.o follows a.o in the input sequence and e.o follows d.o.
• If there are multiple INPUT() commands in the linker script, they are con-
sidered in the same sequence as they appear in the script.
Therefore, in our example the linker scans the files in the order: a.o, b.o, c.o,
d.o, e.o, and libf.a.
As mentioned previously, object files can also be specified on the linker command
line and become part of the input sequence, too. In this case:
• The command line may also include an option (-T) to refer to the linker
script.
• The input files specified on the command line are combined with those
mentioned in the linker script depending on where the linker script has been
referenced.
For instance, if the command line is gcc ... a.o -Tscript b.o and the
linker script script contains the command INPUT(c.o, d.o), then the input
sequence is: a.o, c.o, d.o, and b.o.
As mentioned previously the startup file is a special object file because it contains
low-level hardware initialization code. For example, it may set the CPU clock source
GCC-Based Software Development Tools 53
Figure 3.4 Linker’s handling of object files and libraries in the input sequence.
and frequency. Moreover, it sets up the execution environment for application code.
For instance, it is responsible of preparing initialized data for use. As a consequence,
its position in memory may be constrained by the hardware startup procedure.
The STARTUP(<file>) command forces <file> to be the very first object file
in the input sequence, regardless of where the command is. For example, the linker
script fragment
INPUT(a.o, b.o)
STARTUP(s.o)
leads to the input sequence s.o, a.o, and b.o, even though s.o is mentioned last.
Let us now mention how the linker transforms the input sequence into the out-
put sequence of object modules that will eventually be used to build the executable
image. We will do this by means of an example, with the help of Figure 3.4. In our
example, the input sequence is composed of an object file g.o followed by two li-
braries, liba.a and libb.a, in this order. They are listed at the top of the figure,
from left to right. For clarity, libraries are depicted as lighter gray rectangles, while
object files correspond to darker gray rectangles. In turn, object files contain function
definitions and references, as is also shown in the figure.
The construction of the output sequence proceeds as follows.
• Since only a is undefined at the moment, only module a.o is put in the out-
put. More specifically, module f.o is not, because the linker is not aware
of any undefined symbols related to it.
• When the linker scans libb.a, it finds a definition of b and places module
b.o in the output. In turn, c becomes undefined. Since c is defined in c.o,
that is, another module within the same library, the linker places this object
module in the output, too.
• Module c.o contains a reference to f, and hence, f becomes undefined.
Since the linker scans the input sequence only once, it is unable to refer
back to liba.a at this point. Even though liba.a defines f, that defini-
tion is not considered. At point ⃝
3 f is still undefined.
According to the example it is evident that the linker implicitly handles libraries
as sets. Namely, the linker picks up object modules from a set on demand and places
them into the output. If this action introduces additional undefined symbols, the
linker looks into the set again, until no more references can be resolved. At this
time, the linker moves to the next object file or library.
As also shown in the example, this default way of scanning the input sequence is
problematic when libraries contain circular cross references. More specifically, we
say that a certain library A contains a circular cross-reference to library B when one
of A’s object modules contains a reference to one of B’s modules and, symmetrically,
one of B’s modules contains a reference back to one module of library A.
When this occurs, regardless of the order in which libraries A and B appear in the
input sequence, it is always possible that the linker is unable to resolve a reference
to a symbol, even though one of the libraries indeed contains a definition for it. This
is what happens in the example for symbol f.
In order to solve the problem, it is possible to group libraries together. This is done
by means of the command GROUP(), which takes a list of libraries as argument.
For example, the command GROUP(liba.a, libb.a) groups together libraries
liba.a and libb.a and instructs the linker to handle both of them as a single set.
Going back to the example, the effect of GROUP(liba.a, libb.a) is that it
directs the linker to look back into the set, find the definition of f, and place module
f.o in the output.
It is possible to mix GROUP() and INPUT() within the input sequence to trans-
form just part of it into a set. For example, given the input sequence listed below:
INPUT(a.o, b.o)
GROUP(liba.a, libb.a)
INPUT(libc.a)
the linker will first examine a.o, and then b.o. Afterwards, it will handle liba.a
and libb.a as a single set. Last, it will handle libc.a on its own.
Moreover, as will become clearer in the following, the use of GROUP() makes
sense only for libraries, because object files are handled in a different way in the first
place. Figure 3.5 further illustrates the differences. In particular, the input sequence
GCC-Based Software Development Tools 55
shown in Figure 3.5 is identical to the one previously considered in Figure 3.4, with
the only exception that libraries have been replaced by object modules.
The input sequence of Figure 3.5 is processed as follows:
• Object module g.o is placed in the output and symbol a becomes unde-
fined at point ⃝.
1
• When the linker scans s.o it finds a definition for a and places the whole
object module in the output.
• This provides a definition of f even though it was not called for at the
moment and makes b undefined at point ⃝. 2
• When the linker scans t.o it finds a definition of b and it places the whole
module in the output. This also provides a definition of c.
• The reference to f made by c can be resolved successfully because the
output already contains a definition of f.
where:
56 Embedded Software Development: The Open-Source Approach
For example, the following MEMORY command describes the Flash memory bank
and the main RAM bank of the LPC1768, as defined in its user manual [128].
MEMORY
{
rom (rx) : ORIGIN = 0x00000000, LENGTH = 512K
ram (rwx) : ORIGIN = 0x10000000, LENGTH = 32K
}
To further examine a rather common issue that may come even from the seem-
ingly simple topic of memory layout, let us now consider a simple definition of an
initialized, global variable in the C language, and draw some comments on it. For
example,
int a = 3;
Often, the VMA and LMA of an object are the same. For example, the address
where a function is stored in memory is the same address used by the CPU to call it.
When they are not, a copy is necessary, as illustrated previously.
This kind of copy can sometimes be avoided by using the const keyword of the
C language, so that read-only data are allocated only in ROM. However, this is not
strictly guaranteed by the language specification, because const only determines
the data property at the language level, but does not necessarily affect their allocation
at the memory layout level.
In other words, data properties express how they can be manipulated in the pro-
gram, which is not directly related to where they are in memory. As a consequence,
the relationship between these two concepts may or may not be kept by the toolchain
during object code generation.
From the practical point of view, it is important to remark that the linker follows
the same order when it allocates memory for initialized variables in RAM and when
it stores their initial value in ROM. Moreover, the linker does not interleave any
additional memory object in either case. As a consequence, the layout of the ROM
area that stores initial values and of the corresponding RAM area is the same. Only
their starting addresses are different.
In turn, this implies that the relative position of variables and their corresponding
initialization values within their areas is the same. Hence, instead of copying variable
by variable, the startup code just copies the whole area in one single sweep.
The base addresses and size of the RAM and ROM areas used for initialized
variables are provided to the startup code, by means of symbols defined in the linker
script as described in Section 3.4.4.
As a final note for this section, it is worth remarking that there is an unfortunate
clash of terminology between virtual memory addresses as they are defined in the
58 Embedded Software Development: The Open-Source Approach
linker’s nomenclature and virtual memory addresses in the context of virtual memory
systems, outlined in Chapter 15.
This assignment sets the symbol __stack to the value of the location counter.
Assigning a value to . moves the location counter. For example, the following as-
signment:
. += 0x4000
allocates 0x4000 bytes starting from where the location counter currently points and
moves the location counter itself after the reserved area.
An assignment may appear in three different positions in a linker script and its
position partly affects how the linker interprets it.
GCC-Based Software Development Tools 59
1. By itself. In this case, the assigned value is absolute and, contrary to the general
rule outlined previously, the location counter . cannot be used.
2. As a statement within a SECTIONS command. The assigned value is absolute
but, unlike in the previous case, the use of . is allowed. It represents an absolute
location counter.
3. Within an output section description, nested in a SECTIONS command. The as-
signed value is relative and . represents the relative value of the location counter
with respect to the beginning of the output section.
As an example, let us consider the following linker script fragment. More thor-
ough and formal information about output sections is given in Section 3.4.5.
SECTIONS
{
. = ALIGN(0x4000);
. += 0x4000;
__stack = .;
}
In this example:
• code (.text),
• initialized data (.data),
• uninitialized data (.bss).
Each category corresponds to its own input section of the object file, whose name
has also been listed above. For example, the object code generated by the C compiler
is placed in the .text section of the input object files. Libraries follow the same
rules because they are just collections of object files.
The part of linker script devoted to section mapping tells the linker how to fill
the memory image with output sections, which are generated by collecting input
sections. It has the following syntax:
SECTIONS
{
<sub-command>
...
}
60 Embedded Software Development: The Open-Source Approach
where:
• The SECTIONS command encloses a sequence of sub-commands, delim-
ited by braces.
• A sub-command may be:
• an ENTRY command, used to set the initial entry point of the executable
image as described in Section 3.4.2,
• a symbol assignment,
• an overlay specification (seldom used in modern programs),
• a section mapping command.
The order in which input section descriptions appear is important because it sets
the order in which input sections are placed in the output sections.
For example, the following input section description:
* ( .text .rodata )
places the .text and .rodata sections of all files in the input sequence in the
output section. The sections appear in the output in the same order as they appear in
the input.
Instead, these slightly different descriptions:
* ( .text )
* ( .rodata )
first places all the .text sections, and then all the .rodata sections.
Let us now examine the other main components of the section mapping command
one by one. The very first part of a section mapping command specifies the output
section name, address, and type. In particular:
• The AT attribute sets the LMA of the output section to address <lma>.
• The ALIGN attribute specifies the alignment of the output section.
• The SUBALIGN attribute specifies the alignment of the input sections
placed in the output section. It overrides the “natural” alignment specified
in the input sections themselves.
62 Embedded Software Development: The Open-Source Approach
The memory block mapping specification is the very last part of a section mapping
command and comes after the list of output section commands. It specifies in which
memory block (also called region) the output section must be placed. Its syntax is:
[> <region>] [AT> <lma_region>]
[: <phdr> ...] [= <fillexp>]
where:
• > <region> specifies the memory block for the output section VMA, that
is, where it will be referred to by the processor.
• AT> <lma_region> specifies the memory block for the output section
LMA, that is, where its contents are loaded.
• <phdr> and <fillexp> are used to assign the output section to an out-
put segment and to set the fill pattern to be used in the output section,
respectively.
Segments are a concept introduced by some executable image formats, for exam-
ple the executable and linkable format (ELF) [33] format. In a nutshell, they can be
seen as groups of sections that are considered as a single unit and handled all together
by the loader.
The fill pattern is used to fill the parts of the output section whose contents are not
explicitly specified by the linker script. This happens, for instance, when the location
counter is moved or the linker introduces a gap in the section to satisfy an alignment
constraint.
The most important aspect that should be taken into account when porting this
library to a new architecture or hardware platform, is probably to configure and pro-
vide an adequate multitasking support to it. Clearly, this is of utmost importance to
ensure that the library operates correctly when it is used in a multitasking environ-
ment, like the one provided by any real-time operating system.
In principle, NEWLIB is a reentrant C library. That is, it supports multiple tasks
making use of and calling it concurrently. However, in order to do this, two main
constraints must be satisfied:
1. The library must be able to maintain some per-task information. This information
is used, for instance, to hold the per-task errno variable, I/O buffers, and so on.
2. When multiple tasks share information through the library, it is necessary to im-
plement critical regions within the library itself, by means of appropriate syn-
chronization points. This feature is used, for instance, for dynamic memory al-
location (like the malloc and free functions) and I/O functions (like open,
read, write, and others).
It is well known that, in most cases, when a C library function fails, it just returns
a Boolean error indication to the caller or, in other cases, a NULL pointer. Additional
information about the reason for the failure is stored in the errno variable.
In a single-task environment, there may only be up to one pending library call
at any given time. In this case, errno can be implemented as an ordinary global
variable. On the contrary, in a multitasking environment, multiple tasks may make
a library call concurrently. If errno were still implemented as a global variable, it
would be impossible to know which task the error information is for. In addition, an
error caused by a task would overwrite the error information used by all the others.
The issue cannot be easily solved because, as will be better explained in Chap-
ters 4 and 5, in a multitasking environment the task execution order is nondetermin-
istic and may change from time to time.
As a consequence, the errno variable must hold per-task information and must
not be shared among tasks. In other words, one instance of the errno variable must
exist for each thread, but it must still be “globally accessible” in the same way as
a global variable. The last aspect is important mainly for backward compatibility.
In fact, many code modules were written with single-thread execution in mind and
preserving backward compatibility with them is important.
By “globally accessible,” we mean that a given thread must be allowed to refer to
errno from anywhere in its code, and the expected result must be provided.
The method adopted by NEWLIB to address this problem, on single-core systems,
is depicted in Figure 3.7 and is based on the concept of impure pointer.
Namely, a global pointer (impure_ptr) allows the library to refer to the per-
task data structure (struct _reent) of the running task. In order to do this, the
library relies on appropriate support functions (to be provided in the operating system
support layer), which:
64 Embedded Software Development: The Open-Source Approach
1. Create and destroy per-task data structures at an appropriate time, that is, task
creation and deletion. This is possible because—even though the internal structure
of the struct _reent is opaque to the operating system—its size is known.
2. Update impure_ptr upon every context switch, so that it always points to the
per-task data structure corresponding to the running task.
In addition, some library modules need to maintain data structures that are shared
among all tasks. For instance, the dynamic memory allocator maintains a single
memory pool that is used by the library itself (on behalf of tasks) and is also
made available to tasks for direct use by means of malloc(), free(), and other
functions.
The formal details on how to manage concurrent access to shared resources, data
structures in this case, will be given in Chapter 5. For the time being, we can con-
sider that, by intuition, it is appropriate to let only one task at a time use those data
structures, to avoid corrupting them. This is accomplished by means of appropri-
ate synchronization points that force tasks to wait until the data structures can be
accessed safely.
As shown in the left part of Figure 3.8, in this case, the synchronization point is
internal to the library. The library implements synchronization by calling an operat-
ing system support function at critical region boundaries. For instance, the dynamic
memory allocation calls __malloc_lock() before working on the shared memory
pool, and __malloc_unlock() after it is done.
Therefore, a correct synchronization is a shared responsibility between the library
and the operating system, because:
• the library “knows” where critical regions are and what they are for, and
• the operating system “knows” how to synchronize tasks correctly.
GCC-Based Software Development Tools 65
1. The configuration phase sets up all the parameters that will be used in the build
phase and automatically generates a Makefile. As will be explained in Sec-
tion 3.7, this file contains key information to automate the build process.
66 Embedded Software Development: The Open-Source Approach
• The source directory tree is the place where the source code of the compo-
nent is stored, when its distribution archive is unpacked.
• The build directory tree is used in the build phase. The build phase is started
by running GNU make. As a result, the corresponding make command
must be launched from there. In turn, make uses the Makefile generated
during the configuration phase. In this case, the configuration must also be
run in the same place.
• The installation directory tree is used to install the toolchain component. It
can be specified during the configuration phase, by means of the --prefix
argument. Installation is started explicitly by means of the make install
command.
The first preliminary step required to build a toolchain component is to create the
source tree in the current directory, starting from the source package of the compo-
nent, as shown in the following.
GCC-Based Software Development Tools 67
tar xf <source_package>
Then, we can create the build directory, change into it and configure the compo-
nent, specifying where the installation tree is, as follows.
mkdir <build_dir>
cd <build_dir>
<path_to_source_tree>/configure \
--prefix=<installation_tree> \
...<other_options>...
Finally, we can build and (if the build was successful) install the component.
make
make install
Each toolchain component must know its own installation prefix, that is, the root
of the directory tree into which they will be installed. As shown previously, this
information must be provided to the component at configuration time.
Although it is possible, in principle, to install each toolchain component in a dif-
ferent location, it is recommended to use a single installation prefix for all of them.
A convenient way to do it is to set the environment variable PREFIX to a suitable
path, and use it when configuring the toolchain components.
Moreover, most toolchain components need to locate and run executable modules
belonging to other components, not only when they are used normally, but also while
they are being built. For example, the build process of GCC also builds the C runtime
library. This requires the use of the assembler, which is part of BINUTILS.
As toolchain components are built, their executable files are stored into the bin
subdirectory under the installation prefix. The value of environment variable PATH
must be extended to include this new path. This is because, during the build process,
executable files are located (as usual) by consulting the PATH environment variable.
In the following, we will give more detailed information on how to build a work-
ing toolchain component by component. It is assumed that both the source code pack-
ages have already been downloaded into the current working directory. Moreover,
if any toolchain component requires architecture or operating system-dependent
patches, it is assumed that they are applied according to the instructions provided
with them.
The first component to be built is BINUTILS, which contains the GNU binary
utilities, by means of the following sequence of commands.
tar xjf binutils-<version>.tar.bz2
mkdir binutils-<version>-b
cd binutils-<version>-b
../binutils-<version>/configure --target=<target> \
--prefix=$PREFIX --enable-multilib --disable-nls
make
68 Embedded Software Development: The Open-Source Approach
cd gcc-<version>
tar xzf ../gmp-<version>.tar.gz
mv gmp-<version> gmp
tar xjf ../mpfr-<version>.tar.bz2
mv mpfr-<version> mpfr
ln -s ../newlib-<version>/newlib .
cd ..
mkdir gcc-<version>-b
cd gcc-<version>-b
../gcc-<version>/configure \
--target=<target> --prefix=$PREFIX \
--with-gnu-as --with-gnu-ld \
--with-newlib \
--enable-languages="c,c++" \
--enable-multilib \
--disable-shared --disable-nls
make
make install
• Like before, the GCC and NEWLIB source packages are expanded into two
separate directories.
GCC-Based Software Development Tools 69
• In order to build GCC, two extra libraries are required: gmp (for multiple-
precision integer arithmetic) and mpfr (for multiple-precision floating-
point arithmetic). They are used at compile-time—that is, the executable
image is not linked against them—to perform all arithmetic operations in-
volving constant values present in the source code. Those operations must
be carried out in extended precision to avoid round-off errors. The required
version of these libraries is specified in the GCC installation documenta-
tion, in the INSTALL/index.html file within the GCC source package.
The documentation also contains information about where the libraries can
be downloaded from.
• The GCC configuration script is also able to configure gmp and mpfr auto-
matically. Hence, the resulting Makefile builds all three components to-
gether and statically links the compiler with the two libraries. This is done
if their source distributions are found in two subdirectories of the GCC
source tree called gmp and mpfr, respectively, as shown in our example.
• Similarly, both GCC and NEWLIB can be built in a single step, too.
To do this, the source tree of NEWLIB must be linked to the GCC
source tree. Then, configuration and build proceed in the usual way. The
--with-newlib option informs the GCC configuration script about the
presence of NEWLIB.
• The GCC package supports various programming languages. The
--enable-languages option can be used to restrict the build to a subset
of them, only C and C++ in our example.
• The --disable-shared option disables the generation of shared li-
braries. Shared libraries are nowadays very popular in general-purpose
operating systems, as they allow multiple applications to share the same
memory-resident copy of a library and save memory. However, they are
not yet commonly used in embedded systems, especially small ones, due to
the significant amount of runtime support they require to dynamically link
applications against shared libraries when they are launched.
1. decide which parts of a component shall be rebuilt after some source modules
have been updated, based on their dependencies, and then
2. automatically execute the appropriate sequence of commands to carry out the
rebuild.
70 Embedded Software Development: The Open-Source Approach
In an explicit rule:
• The target is usually a file that will be (re)generated when the rule is
applied.
• The prerequisites are the files on which target depends and that,
when modified, trigger the regeneration of the target.
• The sequence of command lines are the actions that GNU make must
perform in order to regenerate the target, in shell syntax.
GCC-Based Software Development Tools 71
Table 3.1
GNU make Command Line Execution Options
Option Description
@ Suppress the automatic echo of the command line that GNU make nor-
mally performs immediately before execution.
- When this option is present, GNU make ignores any error that occurs
during the execution of the command line and continues.
• Every command line must be preceded by a tab and is executed in its own
shell.
It is extremely important to pay attention to the last aspect of command line exe-
cution, which is often neglected, because it may have very important consequences
on the effects commands have.
For instance, the following rule does not list the contents of directory
somewhere.
all:
cd somewhere
ls
This is because, even though the cd command indeed changes current directory
to somewhere within the shell it is executed by, the ls command execution takes
place in a new shell, and the previous notion of current directory is lost when the new
shell is created.
As mentioned previously, the prerequisites list specifies the dependencies of the
target. GNU make looks at the prerequisites list to deduce whether or not a target
must be regenerated by applying the rule. More specifically, GNU make applies the
rule when one or more prerequisites are more recent than the target. For example,
the rule:
kbd.o : kbd.c defs.h command.h
cc -c kbd.c
specifies that the object file kbd.o (target) must be regenerated when at least one file
among kbd.c defs.h command.h (prerequisites) has been modified. In order
to regenerate kbd.o, GNU make invokes cc -c kbd.c (command line) within a
shell.
The shell, that is, the command line interpreter used for command line execution is
by default /bin/sh on unix-like systems, unless the Makefile specifies otherwise
by setting the SHELL variable. Namely, it does not depend on the user login shell to
make it easier to port the Makefile from one user environment to another.
72 Embedded Software Development: The Open-Source Approach
Unless otherwise specified, by means of one of the command line execution op-
tions listed in Table 3.1, commands are echoed before execution. Moreover, when an
error occurs in a command line, GNU make abandons the execution of the current
rule and (depending on other command-line options) may stop completely. Com-
mand line execution options must appear at the very beginning of the command
line, before the text of the command to be executed. In order to do the same things
in a systematic way GNU make supports options like --silent and --ignore,
which apply to all command lines or, in other words, change the default behavior of
GNU make.
3.7.2 VARIABLES
A variable is a name defined in a Makefile, which represents a text string. The
string is the value of the variable. The value of a certain variable VAR—by conven-
tion, GNU make variables are often written in all capitals—is usually retrieved and
used (that is, expanded) by means of the construct $(VAR) or ${VAR}.
In a Makefile, variables are expanded “on the fly,” while the file is being read,
except when they appear within a command line or on the right-hand part of vari-
able assignments made by means of the assignment operator “=”. The last aspect of
variable expansion is important and we will further elaborate on it in the following,
because the behavior of GNU make departs significantly from what is done by most
other language processors, for instance, the C compiler.
In order to introduce a dollar character somewhere in a Makefile without calling
for variable expansion, it is possible to use the escape sequence $$, which represents
one dollar character, $.
Another difference with respect to other programming languages is that the
$() operators can be nested. For instance, it is legal, and often useful, to state
$($(VAR)). In this way, the value of a variable (like VAR) can be used as a variable
name. For example, let us consider the following fragment of a Makefile.
MFLAGS = $(MFLAGS_$(ARCH))
MFLAGS_Linux = -Wall -Wno-attributes -Wno-address
MFLAGS_Darwin = -Wall
A variable can get a value in several different, and rather complex ways, listed
here in order of decreasing priority.
GCC-Based Software Development Tools 73
Table 3.2
GNU make Assignment Operators
Operator Description
In order to better grasp the effect of delayed variable expansion, let us consider
the following two examples.
X = 3
Y = $(X)
X = 8
In this first example, the value of Y is 8, because the right-hand side of its as-
signment is expanded only when Y is used. Let us now consider a simply expanded
variable.
X = 3
Y := $(X)
X = 8
In this case, the value of Y is 3 because the right-hand side of its assignment is
expanded immediately, when the assignment is performed.
As can be seen from the previous examples, delayed expansion of recursively
expanded variables has unusual, but often useful, side effects. Let us just briefly
consider the two main benefits of delayed expansion:
In particular:
GCC-Based Software Development Tools 75
Table 3.3
GNU make Automatic Variables
• If a target pattern does not contain any slash—which is the character that
separates directory names in a file path specification—all directory names
are removed from target file names before comparing them with the pattern.
• Upon a successful match, directory names are restored at the beginning of
the stem. This operation is carried out before generating prerequisites.
• Prerequisites are generated by substituting the stem of the rule in the right-
hand part of the rule, that is, the part that follows the colon (:).
• For example, file src/p.o satisfies the pattern rule %.o : %.c. In
this case, the prefix is empty, the stem is src/p and the prerequisite is
src/p.c because the src/ directory is removed from the file name be-
fore comparing it with the pattern and then restored.
76 Embedded Software Development: The Open-Source Approach
When a pattern rule is applied, GNU make automatically defines several auto-
matic variables, which become available in the corresponding command lines. Ta-
ble 3.3 contains a short list of these variables and describes their contents.
As an example, the rightmost column of the table also shows the value that au-
tomatic variables would get if the rules above were applied to regenerate kbd.o,
mentioned before, because defs.h has been modified.
To continue the example, let us assume that the Makefile we are considering
contains the following additional rule. The rule updates library lib.a, by means of
the ar tool, whenever any of the object files it contains (main.o kbd.o disk.o)
is updated.
lib.a : main.o kbd.o disk.o
ar rs $@ $?
After applying the previous rule, kbd.o becomes more recent than lib.a, be-
cause it has just been updated. In turn, this triggers the application of the second rule
shown above. While the second rule is being applied, the automatic variable corre-
sponding to the target of the rule ($@) is set to lib.a and the list of prerequisites
more recent than the target ($?) is set to kbd.o.
To further illustrate the use of automatic variables, we can also remark that we
could use $ˆ instead of $? in order to completely rebuild the library rather than
update it. This is because, as mentioned in Table 3.3, $ˆ contains the list of all
prerequisites of the rule.
It is also useful to remark that GNU make comes with a large set of predefined,
built-in rules. Most of them are pattern rules, and hence, they generally apply to a
wide range of targets and it is important to be aware of their existence. They can be
printed by means of the command-line option --print-data-base, which can
also be abbreviated as -p.
For instance, there is a built-in pattern rule to generate an object file given the
corresponding C source file:
%.o: %.c
$(COMPILE.c) $(OUTPUT_OPTION) $<
The variables cited in the command line have got a built-in definition as well,
that is:
COMPILE.c = $(CC) $(CFLAGS) $(CPPFLAGS) $(TARGET_ARCH) -c
OUTPUT_OPTION = -o $@
CC = cc
$(<function> <arguments>)
where
As for directives, in the following we are about to informally discuss only a few
GNU make functions that are commonly found in Makefiles. Interested readers
should refer to the full documentation of GNU make, available online [65], for in-
depth information.
78 Embedded Software Development: The Open-Source Approach
all: $(ELF)
%.elf: %.c
$(CC) -o $@ $<
• The function $(shell command) executes a shell command and cap-
tures its output as return value. For example, when executed on a Linux
system:
3.8 SUMMARY
This chapter provided an overview of the most peculiar aspects of a GCC-based
cross-compilation toolchain, probably the most commonly used toolchain for em-
bedded software development nowadays.
After starting with an overview of the workflow of the whole toolchain, which
was the subject of Section 3.1, the discussion went on by focusing on specific compo-
nents, like the compiler driver (presented in Section 3.2), the preprocessor (described
in Section 3.3), and the runtime libraries (Section 3.5).
GCC-Based Software Development Tools 79
Due to its complexity, the discussion of the inner workings of the compiler has
been postponed to Chapter 11, where it will be more thoroughly analyzed in the
context of performance and memory footprint optimization.
Instead, this chapter went deeper into describing two very important, but often
neglected, toolchain components, namely the linker (which was the subject of Sec-
tion 3.4) and GNU make (discussed in Section 3.7).
Last, but not least, this chapter also provided some practical information on how
to configure and build the toolchain components, in Section 3.6. More specific in-
formation on how to choose the exact version numbers of those components will be
provided in Chapter 12, in the context of a full-fledged case study.
4 Execution Models for
Embedded Systems
CONTENTS
4.1 The Cyclic Executive....................................................................................... 81
4.2 Major and Minor Cycles.................................................................................. 83
4.3 Task Splitting and Secondary Schedules ......................................................... 87
4.4 Task-Based Scheduling.................................................................................... 91
4.5 Task State Diagram.......................................................................................... 95
4.6 Race Conditions in Task-Based Scheduling.................................................... 98
4.7 Summary........................................................................................................ 102
A key design point of any embedded system is the selection of an appropriate execu-
tion model that, generally speaking, can be defined as the set of rules and constraints
that organize the execution of the embedded system’s activities.
Traditionally, many embedded systems—especially the smallest and simplest
ones—were designed around the cyclic executive execution model that is simple,
efficient, and easy to understand. However, it lacks modularity and its application
becomes more and more difficult as the size and complexity of the embedded system
increase.
For this reason, task-based scheduling is nowadays becoming a strong competitor,
even if it is more complex for what concerns both software development and system
requirements. This chapter discusses and compares the two approaches to help the
reader choose the best one for a given design problem and introduce the more general
topic of choosing the right execution model for the application at hand.
81
82 Embedded Software Development: The Open-Source Approach
ponents of an embedded, real-time system. For this reason, this book will mainly
focus on this kind of task. Interested readers should refer to References [26, 111] for
a more thorough and formal description of task scheduling theory, which includes
more sophisticated and flexible task models. Reference [152] provides a comprehen-
sive overview about the historical evolution of task scheduling theory and practice.
To stay within the scope of the book, discussion will be limited to single-processor
systems.
In its most basic form, a cyclic executive is very close to the typical intuitive
structure of a simple embedded control system, whose time diagram is depicted in
Figure 4.1. As shown in the figure, the activities carried out by the control system
can be seen as three tasks, all with the same period T .
1. An input task interacts with input devices (e.g., sensors) to retrieve the input vari-
ables of the control algorithm.
2. A processing task implements the control algorithm and computes the output vari-
ables based on its inputs.
3. An output task delivers the values of the output variables to the appropriate output
devices (e.g., actuators).
To guarantee that the execution period is fixed, regardless of unavoidable varia-
tions in task execution time, the cyclic executive makes use of a timing reference
signal and synchronizes with it at the beginning of each cycle. Between the end of
the output task belonging to the current cycle and the beginning of the input task
belonging to the next cycle—that is, while the synchronization with the timing refer-
ence is being performed—the system is idle. Those areas are highlighted in gray in
Figure 4.1.
In most embedded systems, hardware timers are available to provide accurate
timing references. They can deliver timing signals by means of periodic interrupt
requests or, in simple cases, the software can perform a polling loop on a counter
register driven by the timer itself.
For what concerns the practical implementation, the cyclic executive correspond-
ing to the time diagram shown in Figure 4.1 can be coded as an infinite loop contain-
ing several function calls.
Execution Models for Embedded Systems 83
while(1)
{
WaitForReference();
Input();
Processing();
Output();
}
As shown in Figure 4.2, which depicts a cyclic executive in which the major cycle
is composed of 3 minor cycles, minor cycle boundaries are synchronization points.
At these points, the cyclic executive waits for a timing reference signal. As a con-
sequence, the task activated at the very beginning of a minor cycle is synchronized
with real time as accurately as possible. On the other hand, within a minor cycle,
tasks are activated in sequence and suffer from any execution time jitter introduced
by other tasks that precede them.
Minor cycle synchronization is also very useful to detect the most critical error
that a cyclic executive may experience, that is, a cycle overrun. An overrun occurs
when—contrary to what was assumed at design time—some task functions have an
execution time that is longer than expected and the accuracy of the overall cyclic
executive timing can no longer be guaranteed.
As an example, Figure 4.2 shows what happens if the task functions invoked in
minor cycle 2 exceed the minor cycle length. If the cyclic executive does not handle
this condition and simply keeps executing task functions according to the designed
sequence, the whole minor cycle 3 is shifted to the right and its beginning is no longer
properly aligned with the timing reference.
Overrun detection at a synchronization point occurs in two different ways depend-
ing on how synchronization is implemented:
1. If the timing reference consists of an interrupt request from a hardware timer, the
cyclic executive will detect the overrun as soon as the interrupt occurs, that is, at
time t1 in Figure 4.2.
2. If the cyclic executive synchronizes with the timing reference by means of a
polling loop, it will be able to detect the overrun only when minor cycle 2 even-
tually ends, that is, when the last task function belonging to that cycle returns.
At that point, the cyclic executive should wait until the end of the cycle, but it
will notice that the time it should wait for is already in the past. In the figure, this
occurs at time t2 .
Generally speaking, the first method provides a more timely detection of overruns,
as well as being an effective way to deal with indefinite non-termination of a task
function. However, a proper handling of an overrun is often more challenging than
detection itself. Referring back to Figure 4.2, it would be quite possible for the cyclic
Execution Models for Embedded Systems 85
Table 4.1
A Simple Task Set for a Cyclic Executive
Tm = 20 ms, TM = 120 ms
Task τi Period Ti (ms) Execution time Ci (ms) ki
τ1 20 6 1
τ2 40 4 2
τ3 60 2 3
τ4 120 6 6
executive to abort minor cycle 2 at t1 and immediately start minor cycle 3, but this
may lead to significant issues at a later time. For instance, if a task function was
manipulating a shared data structure when it was aborted, the data structure was
likely left in an inconsistent state.
In addition, an overrun is most often indicative of a design, rather than an execu-
tion issue, at least when it occurs systematically. For this reason, overrun handling
techniques are often limited to error reporting to the supervisory infrastructure (when
it exists), followed by a system reset or shutdown.
Referring back to Figure 4.2, if we denote the minor cycle period as Tm and there
are N minor cycles in a major cycle (N = 3 in the figure), the major cycle period
is equal to TM = N Tm . Periodic tasks with different periods can be placed within
this framework by invoking their task function from one or more minor cycles. For
instance, invoking a task function from all minor cycles leads to a task period T = Tm ,
while invoking a task function just in the first minor cycle gives a task period T = TM .
Given a set of s periodic tasks τ1 , . . . , τs , with their own periods T1 , . . . , Ts , it is
interesting to understand how to choose Tm and TM appropriately, in order to accom-
modate them all. Additional requirements are to keep Tm as big as possible (to reduce
synchronization overheads), and keep TM as small as possible (to decrease the size
of the scheduling table).
It is easy to show that an easy, albeit not optimal, way to satisfy these constraints
is to set the minor cycle length to the greatest common divisor (GCD) of the task
periods, and set the major cycle length to their least common multiple (LCM). Ref-
erence [16] contains information on more sophisticated methods. In formula:
Tm = gcd(T1 , . . . , Ts ) (4.1)
TM = lcm(T1 , . . . , Ts ) (4.2)
When Tm and TM have been chosen in this way, the task function corresponding
to τi must be invoked every ki = Ti /Tm minor cycles. Due to Equation (4.1), it is
guaranteed that every ki is an integer and a sub-multiple of N, and hence, the schedule
can actually be built in this way.
As an example, let us consider the task set listed in Table 4.1. Besides the task
periods, the table also lists their execution time and the values ki , computed as dis-
86 Embedded Software Development: The Open-Source Approach
Figure 4.3 The cyclic executive schedule for the task set of Table 4.1.
cussed above. The execution time—usually denoted as Ci —is defined as the amount
of CPU time it takes to execute the task function corresponding to τi . It is especially
important to evaluate or, at least, estimate these values at design time, in order to
allocate task functions in the right minor cycles and avoid overruns.
In this example, from Equations (4.1) and (4.2), it is:
1. It is generally impossible to ensure that all tasks will be executed with their exact
period for every instance, although this is true on average. In the schedule shown
in Figure 4.3 this happens to τ3 , which is executed alternatively at time intervals
′ ′′
of T3 = 56 ms and T3 = 64 ms. In other words, on average the task period is still
T3 = 60 ms but every instance will suffer from an activation jitter of ±4 ms around
the average period.
In this particular example, the jitter is due to the presence of τ2 in minor cycle 0,
whereas it is absent in minor cycle 3. In general, it may be possible to choose
which tasks should suffer from jitter up to a certain extent, but it cannot be re-
moved completely. In this case, swapping the activation of τ2 and τ3 in minor
cycle 0 removed jitter from τ3 , but introduces some jitter on τ2 . Moving τ2 so that
Execution Models for Embedded Systems 87
Table 4.2
A Task Set That Requires Task Splitting to Be Scheduled
Tm = 20 ms, TM = 120 ms
Task τi Period Ti (ms) Execution time Ci (ms) ki
τ1 20 6 1
τ2 40 4 2
τ3 60 2 3
τ4 120 16 6
it is activated in the odd minor cycles instead of the even ones does not completely
solve the problem, either.
2. There may be some freedom in where to place a certain task within the schedule.
In the example, τ4 may be placed in any minor cycle. In this case, the minor
cycle is chosen according to secondary goals. For instance, choosing one of the
“emptiest” minor cycles—as has been done in the example—is useful to reduce
the likelihood of overrun, in case some of the Ci have been underestimated.
Figure 4.4 A possible task split for the task set of Table 4.2.
1. The criteria used to decide where a task must be split depend on task timing and
not on its behavior. For this reason, when a task is split, it may be necessary to
cut through its code in a way that has little to do with the internal, logic structure
of the task itself. As a result, the code will likely be harder to understand and
maintain.
2. Even more importantly, splitting a task as has been done with τ4 in the example,
introduces a race condition zone, or window, depicted as a gray area in Figure 4.4.
Generally speaking, a race condition occurs whenever two tasks are allowed to
access a shared data structure in an uncontrolled way. A race condition window is
a time frame in which, due to the way tasks are scheduled, a race condition may
occur.
In this case, if τ4 shares a data structure with τ1 or τ2 —the tasks executed within
the race condition zone—special care must be taken.
• If τ4 modifies the data structure in any way, the first part of τ4 must be imple-
mented so that it leaves the data structure in a consistent state, otherwise τ1
and τ2 may have trouble using it.
• If τ1 or τ2 update the shared data structure, the first and second part of τ4 must
be prepared to find the data structure in two different states and handle this
scenario appropriately.
In both cases, it is clear that splitting a task does not merely require taking its code,
dividing it into pieces and putting each piece into a separate function. Instead,
some non-trivial modifications to the code are needed to deal with race conditions.
These modifications add to code complexity and, in a certain sense, compromise
one important feature of cyclic executives recalled in Section 4.1, that is, their
ability to handle shared data structures in a straightforward, intuitive way.
Execution Models for Embedded Systems 89
Table 4.3
A Task Set Including a Task with a Large Period
Tm = 20 ms, TM = 1200 ms
Task τi Period Ti (ms) Execution time Ci (ms) ki
τ1 20 6 1
τ2 40 4 2
τ3 60 2 3
τ4 1200 6 60
It is also useful to remark that what has been discussed so far is merely a simple
introduction to the data sharing issues to be confronted when a more sophisticated
and powerful execution model is adopted. This is the case of task-based scheduling,
which will be presented in Section 4.4. To successfully solve those issues, a solid
grasp of concurrent programming techniques is needed. This will be the topic of
Chapter 5.
Besides tasks with a significant execution time, tasks with a large period, with
respect to the others, may be problematic in cyclic executive design, too. Let us
consider the task set listed in Table 4.3. The task set is very close to the one used in
the first example (Table 4.1) but T4 , the period of τ4 , has been increased to 1200 ms,
that is, ten times as before.
It should be remarked that tasks with a large period, like τ4 , are not at all un-
common in real-time embedded systems. For instance, such a task can be used for
periodic status and data logging, in order to summarize system activities over time.
For these tasks, a period on the order of 1 s is quite common, even though the other
tasks in the same system, dealing with data acquisition and control algorithm, require
much shorter periods.
According to the general rules to determine Tm and TM given in Equations (4.1)
and (4.2), it should be:
In other words, the minor cycle length Tm would still be the same as before, but
the major cycle length TM would increase tenfold, exactly like T4 . As a consequence,
a major cycle would now consist of N = TM /Tm = 60 minor cycles instead of 6. What
is more, as shown in Figure 4.5, 59 minor cycles out of 60 would still be exactly the
same as before, and would take care of scheduling τ1 , τ2 , and τ3 , whereas just one
(minor cycle 1) would contain the activation of τ4 .
The most important consequence of N being large is that the scheduling table
becomes large, too. This not only increases the memory footprint of the system, but
it also makes the schedule more difficult to visualize and understand. In many cases,
90 Embedded Software Development: The Open-Source Approach
Figure 4.5 Schedule of the task set of Table 4.3, without secondary schedules.
Figure 4.6 Schedule of the task set of Table 4.3, using a secondary schedule.
• The operating system itself (or, more precisely, one of its main compo-
nents known as scheduler) is responsible for switching from one task to
92 Embedded Software Development: The Open-Source Approach
another. The switching points from one task to another are no longer hard-
coded within the code and are chosen autonomously by the scheduler. As
a consequence task switch may occur anywhere in the tasks, with very few
exceptions that will be better discussed in Chapter 5.
• In order to determine the scheduling sequence, the scheduler must follow
some criteria, formally specified by means of a scheduling algorithm. In
general terms, any scheduling algorithm bases its work on certain task char-
acteristics or attributes. For instance, quite intuitively, a scheduling algo-
rithm may base its decision of whether or not to switch from one task to an-
other on their relative importance, or priority. For this reason, the concept
of task at runtime can no longer be reduced to a mere function containing
the task code, as is done in a cyclic executive, but it must include these
attributes, too.
The concept of task (also called sequential process in more theoretical descrip-
tions) was first introduced in Reference [48] and plays a central role in a task-based
system. It provides both an abstraction and a conceptual model of a code module
that is being executed. In order to represent a task at runtime, operating systems
store some relevant information about it in a data structure, known as task control
block (TCB). It must contain all the information needed to represent the execution of
a sequential program as it evolves over time.
As depicted in Figure 4.7, there are four main components directly or indirectly
linked to a TCB:
1. The TCB contains a full copy of the processor state. The operating system makes
use of this piece of information to switch the processor from one task to another.
This is accomplished by saving the processor state of the previous task into its
TCB and then restoring the processor state of the next task, an operation known
as context switch.
At the same time, the processor state relates the TCB to two other very important
elements of the overall task state. Namely, the program counter points to the next
instruction that the processor will execute, within the task’s program code. The
stack pointer locates the boundary between full and empty elements in the task
stack.
As can be inferred from the above description, the processor state is an essential
part of the TCB and is always present, regardless of which kind of operating
system is in use. Operating systems may instead differ on the details of where the
processor state is stored.
Conceptually, as shown in Figure 4.7, the processor state is directly stored in the
TCB. Some operating systems follow this approach literally, whereas others store
part or all of the processor state elsewhere, for instance in the task stack, and then
make it accessible from the TCB through a pointer.
The second choice is especially convenient when the underlying processor ar-
chitecture provides hardware assistance to save and restore the processor state
Execution Models for Embedded Systems 93
Figure 4.8 Task state diagram in the F REE RTOS operating system.
As a practical example, Figure 4.8 depicts the TSD defined by the F REE RTOS
operating system [17, 18], used as a case study throughout this book. Referring to
the TSD, at any instant a process may be in one of the following states:
1. A task is in the ready state when it is eligible for execution but no processors are
currently available to execute it, because all of them are busy with other activities.
This is a common occurrence because the number of ready tasks usually exceeds
the total number of processors (often just one) available in the system. A task does
not make any progress when it is ready.
96 Embedded Software Development: The Open-Source Approach
• A voluntary transition is performed under the control of the task that un-
dergoes it, as a consequence of one explicit action it took.
• An involuntary transition is not under the control of the task affected by
it. Instead, it is the consequence of an action taken by another task, the
operating system, or the occurrence of an external event.
a. The task creation transition instantiates a new TCB, which describes the task be-
ing created. After creation, the new task is not necessarily executed immediately.
However, it is eligible for execution and resides in the ready state.
b. The operating system is responsible for picking up tasks in the ready state for
execution and moving them into the running state, according to the outcome of
its scheduling algorithm, whenever a processor is available for use. This action is
usually called task scheduling.
Execution Models for Embedded Systems 97
c. A running task may voluntarily signal its willingness to relinquish the processor it
is being executed on by asking the operating system to reconsider the scheduling
decision it previously made. This is done by means of an operating system request
known as yield, which corresponds to transition c’ in the figure and moves the
invoking task from the running state to the ready state.
The transition makes the processor previously assigned to the task available for
use. This leads the operating system to run its scheduling algorithm and choose
a task to run among the ones in the ready state. Depending on the scheduling
algorithm and the characteristics of the other tasks in the ready state, the choice
may or may not fall on the task that just yielded.
Another possibility is that the operating system itself decides to run the scheduling
algorithm. Depending on the operating system, this may occur periodically or
whenever a task transitions into the ready state from some other states for any
reason. The second kind of behavior is more common with real-time operating
systems because, by intuition, when a task becomes ready for execution, it may
be “more important” than one of the running tasks from the point of view of the
scheduling algorithm.
When this is the case, the operating system forcibly moves one of the tasks in
the running state back into the ready state, with an action called preemption and
depicted as transition c” in Figure 4.8. Then, it will choose one of the tasks in the
ready state and move it into the running state by means of transition b.
One main difference between transitions c’ and c” is therefore that the first one is
voluntary, whereas the second one is involuntary.
d. The transition from the running to the blocked state is always under the control
of the affected task. In particular, it is performed when the task invokes one of
the operating system synchronization primitives to be discussed in Chapter 5, in
order to wait for an event e.
It must be remarked that this kind of wait is very different from what can be
obtained by using a polling loop because, in this case, no processor cycles are
wasted during the wait.
e. When event e eventually occurs, the waiting task is returned to the ready state
and starts competing again for execution against the other tasks. The task is not
returned directly to the running state because it may or may not be the most
important activity to be performed at the current time. As discussed for yield and
preemption, this is a responsibility of the scheduling algorithm and not of the
synchronization mechanism.
Depending on the nature of e, the component responsible for waking up the wait-
ing task may be another task (when the wait is due to inter-task synchronization),
the operating system timing facility (when the task is waiting for a time-related
event), or an interrupt handler (when the task is waiting for an external event, such
as an input–output operation).
f. The suspension and self-suspension transitions, denoted as f’ and f” in the figure,
respectively, bring a task into the suspended state. The only difference between
the two transitions is that the first one is involuntary, whereas the second one is
voluntary.
98 Embedded Software Development: The Open-Source Approach
Table 4.4
A Task Set to Be Scheduled by the Rate Monotonic Algorithm
g. When a task is resumed, it unconditionally goes from the suspended state into
the ready state. This happens regardless of which state it was in before being
suspended.
An interesting side effect of this behavior is that, if the task was waiting for
an event when the suspend/resume sequence took place, it may resume execu-
tion even though the event did not actually take place. In this case, the task be-
ing resumed will receive an error indication from the blocking operating system
primitive.
h. Tasks permanently cease execution by means of a deletion or self-deletion tran-
sition. These two transitions have identical effects on the affected task. The only
difference is that, in the first case, deletion is initiated by another task, whereas in
the second case the task voluntarily deletes itself. As discussed above, the TCB
of a deleted task is not immediately removed from the system. A side effect of
a self-deletion transition is that the operating system’s scheduling algorithm will
choose a new task to be executed.
i. The TCB cleanup transition removes tasks from the waiting termination state.
After this transition is completed, the affected tasks completely disappear from
the system and their TCB may be reused for a new task.
Figure 4.9 Rate Monotonic scheduling of the task set specified in Table 4.4.
which is very commonly used in real-time systems, tasks are assigned a fixed priority
that is inversely proportional to their periods. As a consequence, Table 4.4 lists them
in decreasing priority order. At any instant, the scheduler grants the processor to the
highest-priority task ready for execution.
As can be seen in the figure, even though the task set being considered is ex-
tremely simple, the lowest-priority task τ3 is not only preempted in different places
from one instance to another, but by different tasks, too. If, for instance, tasks τ2 and
τ3 share some variables, no race condition occurs while the first instance of τ3 is
being executed, because the first instance of τ3 is preempted only by τ1 . On the other
hand, the second instance of τ3 is preempted by both τ1 and τ2 and race condition
may occur in this case.
As the schedule becomes more complex due to additional tasks, it rapidly be-
comes infeasible to foresee all possible scenarios. For this reason, a more thorough
discussion about race conditions and how to address them in a way that is indepen-
dent from the particular scheduling algorithm in use is given here. Indeed, dealing
with race conditions is perhaps the most important application of the task synchro-
nization techniques to be outlined in Chapter 5.
For the sake of simplicity, in this book race conditions will be discussed in rather
informal terms, looking mainly at their implication from the concurrent program-
ming point of view. Interested readers should refer, for instance, to the works of
Lamport [105, 106] for a more formal description. From the simple example pre-
sented above, it is possible to identify two general, necessary conditions for a race
condition to occur:
100 Embedded Software Development: The Open-Source Approach
1. Two or more tasks must be executed concurrently, leaving open the possibility of
a context switch occurring among them. In other words, they must be within a
race condition zone, as defined above.
2. These tasks must be actively working on the same shared object when the context
switch occurs.
It should also be noted that the conditions outlined above are not yet sufficient to
cause a race condition because, in any case, the occurrence of a race condition is also
a time-dependent issue. In fact, even though both necessary conditions are satisfied,
the context switch must typically occur at very specific locations in the code to cause
trouble. In turn, this usually makes the race condition probability very low and makes
it hard to reproduce, analyze, and fix.
The second condition leads to the definition of critical region related to a given
shared object. This definition is of great importance not only from the theoretical
point of view, but also for the practical design of concurrent programs. The fact that,
for a race condition to occur, two or more tasks must be actively working on the same
shared object leads to classifying the code belonging to a task into two categories.
1. A usually large part of a task’s code implements operations that are internal to the
task itself, and hence, do not make access to any shared data. By definition, all
these operations cannot lead to any race condition because the second necessary
condition described above is not satisfied.
From the software development point of view, an important consequence of this
observation is that these pieces of code can be safely disregarded when the code
is analyzed to reason about and avoid race conditions.
2. Other parts of the task’s code indeed make access to shared data. Therefore, those
regions of code must be looked at more carefully because they may be responsible
for a race condition if the other necessary conditions are met, too.
For this reason, they are called critical regions or critical sections with respect to
the shared object(s) they are associated with.
Keeping in mind the definition of critical region just given, we can imagine that
race conditions on a certain shared object can be avoided by allowing only one task
to be within a critical region, pertaining to that object, within a race condition zone.
Since, especially in a task-based system, race condition zones may be large and it
may be hard to determine their locations in advance, a more general solution consists
of enforcing the mutual exclusion among all critical regions pertaining to the same
shared object at any time, without considering race condition zones at all.
Traditionally, the implementation of mutual exclusion among critical regions is
based on a lock-based synchronization protocol, in which a task that wants to access a
shared object, by means of a certain critical region, must first of all acquire some sort
of lock associated with the shared object and possibly wait, if it is not immediately
available.
Afterwards, the task is allowed to use the shared object freely. Even though a
context switch occurs at this time, it will not cause a race condition because any
Execution Models for Embedded Systems 101
other task trying to enter a critical region pertaining to the same shared object will
be blocked.
When the task has completed its operation on the shared object and the shared
object is again in a consistent state, it must release the lock. In this way, any other
task can acquire it and be able to access the shared object in the future.
In other words, in order to use a lock-based synchronization protocol, critical re-
gions must be “surrounded” by two auxiliary pieces of code, usually called the crit-
ical region entry and exit code, which take care of acquiring and releasing the lock,
respectively. For some kinds of task synchronization techniques, better described in
Chapter 5, the entry and exit code must be invoked explicitly by the task itself, and
hence, the overall structure of the code strongly resembles the one outlined above. In
other cases, for instance when using message passing primitives among tasks, crit-
ical regions as well as their entry/exit code may be “hidden” within the inter-task
communication primitives and be invisible to the programmer, but the concept is still
the same.
For the sake of completeness, it must also be noted that mutual exclusion imple-
mented by means of lock-based synchronization is by far the most common, but it is
not the only way to solve the race condition problem. It is indeed possible to avoid
race conditions without any lock, by using lock-free or wait-free inter-task commu-
nication techniques.
Albeit a complete description of lock-free and wait-free communication is out-
side the scope of this book—interested readers may refer to References [2, 3, 4,
70, 72, 104] for a more detailed introduction to this subject and its application to
real-time embedded systems—some of their advantages against lock-based synchro-
nization are worth mentioning anyway. Moreover, a simple, hands-on example of
how wait-free communication works—in the context of communication between a
device driver and a hardware controller—will be given in Section 8.5.
It has already been mentioned that, during the lock acquisition phase, a task τa
blocks if another task τb is currently within a critical region associated with the same
lock. The block takes place regardless of the relative priorities of the tasks. It lasts at
least until τb leaves the critical region and possibly more, if other tasks are waiting
to enter their critical region, too.
As shown in Figure 4.10 if, for any reason, τb is delayed (or, even worse, halted
due to a malfunction) while it is within its critical region, τa and any other tasks
willing to enter a critical region associated with the same lock will be blocked and
possibly be unable to make any further progress.
Even though τb proceeds normally, if the priority of τa is higher than the priority
of τb , the way mutual exclusion is implemented goes against the concept of task
priority, because a higher-priority task is forced to wait until a lower-priority task
has completed part of its activities. Even though, as will be shown in Chapter 6, it is
possible to calculate the worst-case blocking time suffered by higher-priority tasks
for this reason and place an upper bound on it, certain classes of applications may
not tolerate it in any case.
102 Embedded Software Development: The Open-Source Approach
4.7 SUMMARY
This chapter provided an overview of two main execution models generally adopted
in embedded system design and development. The first one, described in Sections 4.1
and 4.2, has the clear advantages of being intuitive for programmers and efficient for
what concerns execution.
On the other hand, it also has several shortcomings, outlined in Section 4.3, which
hinder its applicability as system size and complexity grow. Historically, this consid-
eration led to the introduction of more sophisticated execution models, in which the
concept of task is preserved and plays a central role at runtime.
Execution Models for Embedded Systems 103
CONTENTS
5.1 Task Management.......................................................................................... 105
5.2 Time and Delays ............................................................................................ 111
5.3 Semaphores.................................................................................................... 114
5.4 Message Passing ............................................................................................ 125
5.5 Summary........................................................................................................ 136
A side effect of the adoption of task-based scheduling is the need for task communi-
cation and synchronization. All operating systems invariably provide a rich selection
of communication and synchronization primitives but, without a solid theoretical
background, it is easy to misuse them and introduce subtle, time-dependent errors in
the software.
This chapter provides the necessary background and bridges it into software
development practice by presenting the main communication and synchronization
primitives provided by the F REE RTOS open-source operating system [18]. On the
one hand, this RTOS is small/simple enough to be discussed in a small amount of
space. On the other hand, it is comprehensive enough to give readers a thorough
introduction to the topic.
Interested readers could refer to the F REE RTOS reference manual [19], for more
detailed information. The next chapter contains an introduction to real-time task
scheduling algorithms and the related scheduling analysis theory for a few simple
cases of practical relevance.
105
106 Embedded Software Development: The Open-Source Approach
Table 5.1
Task Management Primitives of F REE RTOS
Function Purpose Optional
vTaskStartScheduler Start the scheduler -
vTaskEndScheduler Stop the scheduler -
xTaskCreate Create a new task -
vTaskDelete Delete a task given its handle ∗
uxTaskPriorityGet Get the priority of a task ∗
vTaskPrioritySet Set the priority of a task ∗
vTaskSuspend Suspend a specific task ∗
vTaskResume Resume a specific task ∗
xTaskResumeFromISR Resume a specific task from an ISR ∗
xTaskIsTaskSuspended Check whether a task is suspended ∗
vTaskSuspendAll Suspend all tasks but the running one -
xTaskResumeAll Resume all tasks -
uxTaskGetNumberOfTasks Return current number of tasks -
For reasons more thoroughly discussed in Chapter 2, the executable image is usu-
ally stored in a nonvolatile memory within the target system (flash memories are in
widespread use nowadays). When the system is turned on, the image entry point is
invoked either directly, through the processor reset vector, or indirectly, by means of
a minimal boot loader. When no boot loader is present, the executable image must
also include an appropriate startup code, which takes care of initializing the target
hardware before executing the main() C-language application entry point.
Another important aspect to remark on is that, when main() gets executed, the
operating system scheduler is not yet active. It can be explicitly started and stopped
by means of the following function calls:
void vTaskStartScheduler(void);
void vTaskEndScheduler(void);
The only difference between the two cases is that, if a task is created when the
scheduler is already running, it is eligible for execution immediately and may even
preempt its creator. On the other hand, tasks created before starting the scheduler are
not eligible for execution until the scheduler is started.
The arguments of xTaskCreate are defined as follows:
The return value of xTaskCreate is a status code. If its value is pdPASS, the
function was successful in creating the new task and filling the memory area pointed
by pvCreatedTask with a valid handle, whereas any other value means that an
error occurred and may convey more information about the nature of the error.
The F REE RTOS header projdefs.h, automatically included by FreeRTOS.h,
contains the full list of error codes that may be returned by any operating system
function.
Concurrent Programming Techniques 109
For what concerns task execution, F REE RTOS implements a fixed-priority sched-
uler in which, informally speaking, the highest-priority task ready for execution gets
executed. As described in Sections 6.1 and 6.2, this scheduler is not only very in-
tuitive and easy to understand, but also has several very convenient properties for
real-time applications.
The total number of priority levels available for use is set in the operating sys-
tem configuration. It must be determined as a trade-off between application require-
ments and overhead, because the size of several operating system data structures,
and hence, the operating system memory requirements, depend on it. The currently
configured value is available in the symbolic constant configMAX_PRIORITIES.
Hence, the legal range of priorities in the system goes from tskIDLE_PRIORITY
to tskIDLE_PRIORITY+configMAX_PRIORITIES−1, extremes included.
After creation, a task can be deleted by means of the function
void vTaskDelete(TaskHandle_t xTaskToDelete);
Both functions take a task handle, xTask, as their first argument. The special
value NULL can be used as a shortcut to refer to the calling task.
The function vTaskPrioritySet modifies the priority of a task after it has
been created, and uxTaskPriorityGet returns the current priority of the task. It
should, however, be noted that both the priority given at task creation and the priority
set by vTaskPrioritySet represent the baseline priority of the task.
110 Embedded Software Development: The Open-Source Approach
They are used to suspend and resume the execution of the task identified by their
argument, respectively. In other words, these functions are used to move a task into
and from state 4 of the task state diagram shown in Figure 4.8. For vTaskSuspend,
the special value NULL can be used to suspend the invoking task, whereas, obviously,
it makes no sense for a task to attempt to resume itself.
It should be noted that vTaskSuspend may suspend the execution of a task at
an arbitrary point. Like vTaskDelete, it must therefore be used with care because
any resources that the task is holding are not implicitly released while the task is
suspended. As a consequence, any other tasks willing to access the same resources
may have to wait until the suspended task is eventually resumed.
F REE RTOS, like most other monolithic operating systems, does not hold a full
task control block for interrupt handlers, and hence, they are not full-fledged tasks.
One of the consequences of this design choice is that interrupt handlers cannot invoke
many operating system primitives, in particular the ones that may block the caller. In
other cases a specific variant of some primitives is provided, to be used specifically
by interrupt handlers, instead of the normal one.
For this reason, for instance, calling vTaskSuspend(NULL) from an interrupt
handler is forbidden. For related operating system design constraints, interrupt han-
dlers are also not allowed to suspend regular tasks by invoking vTaskSuspend with
a non-NULL xTaskHandle as argument. The function
UBaseType_t xTaskResumeFromISR(TaskHandle_t xTaskToResume);
is the variant of vTaskResume that must be used to resume a task from an in-
terrupt handler, also known as interrupt service routine (ISR) in the F REE RTOS
nomenclature.
Since, as said above, interrupt handlers do not have a full-fledged, dedicated
task context in F REE RTOS, xTaskResumeFromISR cannot immediately and au-
tomatically perform a full context switch to a new task when needed, as its regular
counterpart would do. An explicit context switch would be necessary, for example,
when a low-priority task is interrupted and the interrupt handler resumes a higher-
priority task.
On the contrary, xTaskResumeFromISR merely returns a nonzero value in this
case, in order to make the invoking interrupt handler aware of the situation. In re-
sponse to this indication, the interrupt handler must invoke the F REE RTOS schedul-
ing algorithm, as better discussed in Chapter 8, so that the higher-priority task just
resumed will be considered for execution immediately after interrupt handling ends.
Concurrent Programming Techniques 111
can be used to tell whether or not a certain task, identified by its handle xTask, is
currently suspended. Its return value is nonzero if the task is suspended, and zero
otherwise.
The function
void vTaskSuspendAll(void);
suspends all tasks but the calling one. Symmetrically, the function
UBaseType_t xTaskResumeAll(void);
The count includes the calling task itself, as well as blocked and suspended tasks.
Moreover, it may also include some tasks that have been deleted by vTaskDelete.
This is a side effect of the delayed dismissal of the operating system’s data structures
associated with a task when the task is deleted.
Table 5.2
Time-Related Primitives of F REE RTOS
returns the current time, expressed as the integral number of ticks elapsed since the
operating system scheduler was started. Barring low-level implementation details,
F REE RTOS maintains its notion of current time by incrementing a tick counter at
the frequency specified by configTICK_RATE_HZ. This function simply returns
the current value of the tick counter to the caller.
In most microcontrollers, the tick counter data type is either a 16- or 32-bit un-
signed integer, depending on the configuration. Therefore, it is important to consider
Concurrent Programming Techniques 113
that the F REE RTOS time counter will sooner or later wrap around and restart count-
ing from zero. For instance, an unsigned, 32-bit counter incremented 1000 times per
second—a common configuration choice for F REE RTOS—will wrap around after
about 1193 hours, that is, slightly more than 49 days.
It is therefore crucial that any application that must keep functioning correctly
for a longer time without rebooting—as many real-time applications must do—is
aware of the wraparound and handles it appropriately if it manipulates time values
directly. If this is not the case, the application will be confronted with time values
that suddenly “jump into the past” when a wraparound occurs, with imaginable con-
sequences. The delay functions, to be discussed next, already handle time counter
wraparound automatically, and hence, no special care is needed to use them.
Two distinct delay functions are available, depending on whether the delay should
be relative, that is, measured with respect to the instant at which the delay function
is invoked, or absolute, that is, until a certain instant in the future.
void vTaskDelay(const TickType_t xTicksToDelay);
5.3 SEMAPHORES
The first definition of a semaphore as a general intertask synchronization mechanism
is due to Dijkstra [48]. The original proposal was based on busy wait, but most imple-
mentations found within contemporary operating systems use passive wait instead,
without changing its semantics.
Over the years, semaphores have successfully been used to address many prob-
lems of practical significance in diverse concurrent programming domains even
though, strictly speaking, they are not powerful enough to solve every concurrent
programming problem that can be conceived [102].
Another reason of their popularity is that they are easy to implement in an efficient
way. For this reason, virtually all operating systems offer semaphores as an intertask
synchronization method.
According to its abstract definition, a semaphore is an object that contains two
items of information, as shown in the right part of Figure 5.2:
The initial value of the semaphore is chosen (explicitly or implicitly) by the pro-
grammer upon semaphore creation, while the queue of waiting tasks is always ini-
tially empty.
Neither the value, nor the queue associated with a semaphore can be accessed or
manipulated directly after initialization, albeit some implementation may make the
current value of a semaphore available to programmers.
The only way to interact with a semaphore, and possibly modify its value and
queue as a consequence, is to invoke the following abstract primitives, whose behav-
ior is also depicted in Figure 5.2. An important assumption about those primitives is
that their implementation ensures that they are executed atomically, that is, as indi-
visible units in which no context switches occur.
Figure 5.3 Task state diagram states and transitions involved in semaphore operations.
• Otherwise, the value of s is certainly zero. In this case, V(s) picks one of the
tasks referenced by the queue associated with s, removes the reference from
the queue, and unblocks it. In other words, referring to the task state diagram,
the selected task is moved back to the ready state, so that it is eligible for
execution again.
It must be noted that semaphore primitives are tied to the task state diagram
because their execution may induce the transition of a task from one state to an-
other. Figure 5.3 is a simplified version of the full F REE RTOS task state diagram
(Figure 4.8) that highlights the nodes and arcs involved in semaphore operations.
As shown in the figure, the transition of a certain task τ from the running to the
blocked state caused by P(s) is voluntary because it depends on, and is caused by the
invocation of the semaphore primitive by τ itself. On the contrary, the transition of τ
back into the ready state is involuntary because it depends on an action performed by
another task, namely, the task that invokes the primitive V(s). It cannot be otherwise
because, as long as τ is blocked, it cannot proceed with execution and cannot perform
any action on its own.
Another aspect worth mentioning is that, when a task (τ in this case) is unblocked
as a consequence of a V(s), it goes into the ready state rather than running. Recall-
ing how these states have been defined in Section 4.5, this means that the task is
now eligible for execution again, but it does not imply that it shall resume execution
immediately. This is useful to keep a proper separation of duties between the task
synchronization primitives being described here and the task scheduling algorithms
that will be presented in depth in Section 6.1. In this way:
Concurrent Programming Techniques 117
Figure 5.4 Usage of a semaphore and its primitives for mutual exclusion.
For similar reasons, the way semaphore queues are managed and, most impor-
tantly, which task is unblocked by V(s) among those found in the semaphore queue,
is left unspecified in the abstract definition of semaphore. This is because the queue
management strategy must often be chosen in concert with the scheduling algorithm,
in order to ensure that task synchronization and scheduling work well together. More
information about this topic will be given in Section 6.2. The same considerations
are also true for the message passing primitives, which embed a synchronization
mechanism and will be the topic of Section 5.4.
A widespread application of semaphores is to ensure mutual exclusion among a
number of critical regions that make access to the same set of shared variables, a key
technique to avoid race conditions as described in Section 4.6. Namely, the critical
regions are associated with a semaphore s, whose initial value is 1. The critical re-
gion’s entry and exit code are the primitives P(s) and V(s). In other words, as shown
in Figure 5.4, those primitives are placed like “brackets” around the critical regions
they must protect.
A full formal proof of correctness of the mutual exclusion technique just described
is beyond the scope of this book. However, its workings will be described in a simple
case, in order to give readers at least a reasonable confidence that it is indeed correct.
As shown in Figure 5.4, the example involves two concurrent tasks, τ and τ ′ , both
118 Embedded Software Development: The Open-Source Approach
willing to enter their critical regions. Both critical regions are associated with the
same set of shared variables and are protected by the same semaphore s.
• For the sake of the example, let us imagine that τ executes its critical region
entry code (that is, the primitive P(s)) first. It will find that the value of s is
1 (the initial value of the semaphore), it will decrement the value to 0, and
it will be allowed to proceed into its critical region immediately, without
blocking.
• If any other tasks, like τ ′ in the figure, execute the primitive P(s) they will
be blocked because the current value of semaphore s is now 0. This is cor-
rect and necessary to enforce mutual exclusion, because task τ has been
allowed to enter its critical region and it has not exited from it yet.
• It must also be noted that two or more executions of P(s) cannot “over-
lap” in any way because they are executed atomically, one after another.
Therefore, even though multiple tasks may invoke P(s) concurrently, the
operating system enforces an ordering among them to ensure that P(s) in-
deed works as intended.
• Task τ ′ will be blocked at least until task τ exits from the critical region and
executes the critical region exit code, that is, V(s). At this point one of the
tasks blocked on the semaphore and referenced by the semaphore queue, if
any, will be unblocked.
• Assuming the choice fell on τ ′ , it will proceed when the operating system
scheduling algorithm picks it up for execution, then τ ′ will enter its criti-
cal region. Mutual exclusion is still ensured because any tasks previously
blocked on the semaphore are still blocked at the present time. Moreover,
any additional tasks trying to enter their critical region and invoking a P(s)
will be blocked, too, because the semaphore value it still 0.
• When τ ′ exits from the critical region and executes V(s) two different out-
comes are possible:
1. If some other tasks are still blocked on the semaphore, the process
described above repeats, allowing another task into its critical region.
In this case, the semaphore value stays at 0 so that any “new” tasks
trying to enter their critical region will be blocked.
2. If the queue associated with s is empty, that is, no tasks are blocked on
the semaphore, V(s) increments the value of s by one. As a result, the
value of s becomes 1 and this brings the semaphore back to its initial
state. Any task trying to enter its critical region in the future will be
allowed to do so without blocking, exactly as happened to task τ in
this example.
Figure 5.5 Usage of a semaphore and its primitives for condition synchronization.
implement the transfer is to allocate a set of shared variables large enough to hold
the information to be transferred. Then, τ1 writes the information into the shared
variables after preparing it and τ2 reads from the same variables, in order to create
its local copy of the same information and use it.
Of course, τ2 must synchronize with τ1 so that it does not start reading from the
shared variables before τ1 has completely written into them the information to be
transferred. If this is not the case, τ2 may retrieve and use corrupted or inconsistent
information. Namely, the goal is to block τ2 until a certain condition is fulfilled. In
this case, the condition is that the shared variables have been completely filled by τ1 .
We neglect for the time being that τ1 may be “too fast” for τ2 , that is, we assume
that τ1 will never provide new information to be shared before τ2 is done with the
old one. Under this assumption the problem can be solved by using a synchronization
semaphore s initialized to 0. Referring again to Figure 5.5, the semaphore is used by
τ1 and τ2 in the following way:
• Before reading from the shared variables, τ2 performs a P(s). Since the
initial value of the semaphore is 0 this primitive blocks τ2 , unless τ1 already
filled the shared variables beforehand, as will be described in the following.
• After writing into the shared variables, τ1 performs a V(s). This primitive
has two distinct outcomes depending on whether or not τ2 already executed
its P(s) in the past:
1. If τ2 already executed its P(s) and is currently blocked, it will be
unblocked. Then, τ2 is free to proceed and read from the shared
variables.
120 Embedded Software Development: The Open-Source Approach
2. If, informally speaking, τ2 “is late” and did not execute its P(s) yet,
the value of s becomes 1. This signifies that the shared variables have
been updated by τ1 and are ready to be read by τ2 in the future without
further delay. In fact, when τ2 will eventually execute P(s), it will
simply bring the value of s back to 0 without blocking.
Table 5.3
Semaphore Creation/Deletion Primitives of F REE RTOS
Table 5.4
Semaphore Manipulation Primitives of F REE RTOS
Like the others, these creation functions also return either a valid semaphore han-
dle upon successful completion, or a NULL pointer upon failure. All mutual exclu-
sion semaphores are unlocked when they are first created—that is, their initial value
is 1—and priority inheritance is always enabled for them.
A semaphore can be deleted by invoking the function:
void vSemaphoreDelete(SemaphoreHandle_t xSemaphore);
Its only argument is the handle of the semaphore to be deleted. This function must
be used with care because the semaphore is destroyed immediately, even though
there are some tasks waiting on it. Moreover, it is the programmer’s responsibility to
ensure that a semaphore handle will never be used by any task after the corresponding
semaphore has been destroyed, a constraint that may not be trivial to enforce in a
concurrent programming environment.
After being created, semaphores are acted upon by means of the functions
listed in Table 5.4. Most kinds of semaphore except recursive, mutual exclusion
semaphores are acted upon by means of the functions xSemaphoreTake and
xSemaphoreGive, the F REE RTOS counterpart of P() and V(), respectively. As
shown in the following prototypes, both take a semaphore handle xSemaphore as
their first argument:
Concurrent Programming Techniques 123
Comparing the prototypes with the abstract definition of the same primitives, two
differences are evident:
1. The abstract primitive P() may block the caller for an unlimited amount of time.
Since, for obvious reasons, this may be inconvenient in a real-time system, the
function xSemaphoreTake has a second argument, xBlockTime, that specifies
the maximum blocking time. In particular:
• If the value is portMAX_DELAY (a symbolic constant defined when the
main F REE RTOS header file is included), the function blocks the caller un-
til the semaphore operation is complete. In other words, when this value of
xBlockTime is used, the concrete function behaves in the same way as its
abstract counterpart.
For this option to be available, the operating system must be configured to
support task suspend and resume, as described in Section 5.1.
• If the value is 0 (zero), the function returns an error indication to the caller
when the operation cannot be performed immediately.
• Any other value is interpreted as the maximum amount of time the function
will possibly block the caller, expressed as an integral number of clock ticks.
See Section 5.2 for more information about time measurement and represen-
tation in F REE RTOS.
2. According to their abstract definition, neither P() nor V() can ever fail. On the
contrary, their real-world implementation may encounter an error for a vari-
ety of reasons. For example, as just described, xSemaphoreTake fails when
a finite timeout is specified and the operation cannot be completed before the
timeout expires. For this reason, the return value of xSemaphoreTake and
xSemaphoreGive is a status code, which is pdTRUE if the operation was suc-
cessful. Otherwise, they return pdFALSE.
Another important difference between the abstract definition of P() and V() with
respect to their actual implementation is that, for reasons that will be better described
in Chapter 8, neither xSemaphoreTake nor xSemaphoreGive can be invoked
from an interrupt handler. In their place, the following two functions must be used:
BaseType_t xSemaphoreTakeFromISR(SemaphoreHandle_t xSemaphore,
BaseType_t *pxHigherPriorityTaskWoken);
Like all other F REE RTOS primitives that can be invoked from an interrupt han-
dler, these functions never block the caller. In addition, they return to the caller—
in the variable pointed by pxHigherPriorityTaskWoken—an indication of
whether or not they unblocked a task with a priority higher than the interrupted task.
124 Embedded Software Development: The Open-Source Approach
Both their arguments and return values are the same as xSemaphoreTake and
xSemaphoreGive, respectively.
One last aspect worth mentioning is the interaction between semaphore use and
task deletion that, as discussed in Section 5.1 is immediate and unconditional. The
high-level effect of deleting a task while it is within a critical region is therefore the
same as the terminated task never exited from the critical region for some reason.
Namely, no other tasks will ever be allowed to enter a critical region controlled by
the same semaphore in the future.
Since this usually corresponds to a complete breakdown of any concurrent pro-
gram, the direct invocation of vTaskDelete should usually be avoided, and it
should be replaced by a more sophisticated deletion mechanism. One simple so-
lution is to send a deletion request to the target task by some other means—for in-
stance, one of the intertask communication mechanisms described in this section and
the next one. The target task must be designed so that it responds to the request by
terminating itself at a well-known location in its code after any required cleanup op-
eration has been carried out and any mutual exclusion semaphore it held has been
released.
Besides using semaphores, an alternative way to implement mutual exclusion is
by means of the task suspend and resume primitives discussed in Section 5.1. In fact,
in a single-processor system, vTaskSuspendAll opens a mutual exclusion region
because the first task that successfully executes it will effectively prevent all other
tasks from being executed until it invokes xTaskResumeAll.
As a side effect, any task executing between vTaskSuspendAll and
xTaskResumeAll implicitly gets the highest possible priority in the system, ex-
cept interrupt handlers. This consideration indicates that the method just described
has two main shortcomings:
1. Any F REE RTOS primitive that might block the caller for any reason even tem-
porarily, or might require a context switch, must not be used within this kind of
critical region. This is because blocking the only task allowed to run would com-
pletely lock up the system, and it is impossible to perform a context switch with
the scheduler disabled.
Concurrent Programming Techniques 125
2. Protecting critical regions with a sizable execution time in this way would prob-
ably be unacceptable in many applications because it leads to a large amount of
unnecessary blocking. This is especially true for high-priority tasks, because if
one of them becomes ready for execution while a low-priority task is engaged in
a critical region of this kind, it will not run immediately, but only at the end of the
critical region itself.
Seen in a different way, the role of semaphores is to coordinate and enforce mutual
exclusion and precedence constraints on task actions, in order to ensure that their
access to shared variables takes place in the right sequence and at the appropriate
time. On the other hand, semaphores are not actively involved in data transfer.
Message passing takes a radically different approach to task synchronization and
communication by providing a single mechanism that is able, by itself, to provide
both synchronization and data transfer at the same time and with the same set of
primitives.
In this way, the mechanism not only works at a higher level of abstraction and
becomes easier to use, but also it can be adopted with minimal updates when shared
memory is not necessarily available. This happens, for example, in distributed sys-
tems where the communicating tasks may be executed by distinct computers linked
by a communication network.
In its abstract form, a message passing mechanism is based upon two basic prim-
itives, defined as follows:
• A send primitive, which transfers a certain amount of information, called
a message, from one task to another. The invocation of the primitive may
imply a synchronization action that blocks the caller if the data transfer
cannot take place immediately.
• A receive primitive, which allows the calling task to retrieve the contents of
a message sent to it by another task. Also in this case, the calling task blocks
if the message it is seeking is not immediately available, thus synchronizing
the receiver with the sender.
126 Embedded Software Development: The Open-Source Approach
Even if this definition still lacks many important lower-level details that will be
discussed later, it is already clear that the most apparent effect of message passing
primitives is to transfer a certain amount of information from the sending task to
the receiving one. At the same time, message passing primitives also incorporate
synchronization because they may block the caller when needed.
It should also be noted that, with message passing, mutual exclusion is not a
concern because any given message is never shared among tasks. In fact, the message
passing mechanism works as if the message were instantaneously copied from the
sender to the receiver, so that message ownership is implicitly passed from the sender
to the receiver when the message is transferred.
In this way, even if the sender modifies its local copy of a message after sending it,
this will not affect the message already sent in the past. Symmetrically, the receiver
is allowed to modify a message it received. This action is local to the receiver and
does not affect the sender’s local copy of the message in any way.
A key design aspect of a message passing scheme is how the sender identifies
the intended recipient of a message and, symmetrically, how a receiver indicates
which senders it is willing to receive messages from. In F REE RTOS, the design is
based upon an indirect naming scheme, in which the send and receive primitives are
associated by means of a third, intermediate entity, known as message queue. Other
popular names for this intermediate entity, used by other operating systems and in
other contexts, are mailbox or channel.
As shown in Figure 5.6, even though adopting an indirect naming scheme looks
more complex than, for instance, directly naming the recipient task when sending, it
is indeed advantageous to software modularity and integration.
Let us assume that two software modules, A and B, each composed of multiple
tasks and with a possibly elaborate internal structure, have to synchronize and com-
municate by means of message passing. In particular, a task τ1 within module A has
to send a message to a recipient within module B.
If we chose to directly name the recipient task when sending a message, then task
τ1 needs to know the internal structure of module B accurately enough to determine
that the intended recipient is, for instance, τ2 as is shown in the figure.
Concurrent Programming Techniques 127
If the internal architecture of module B is later changed, so that the intended re-
cipient becomes τ2′ instead of τ2 , module A must be updated accordingly. Otherwise,
communication will no longer be possible or, even worse, messages may reach the
wrong task.
On the contrary, if communication is carried out with an indirect naming scheme,
module A and its task τ1 must only know the name of the message queue that module
B is using for incoming messages. Since the name of the mailbox is part of the
interface of module B to the external world, it will likely stay the same even if the
implementation or the internal design of the module itself changes over time.
An additional effect of communicating through a message queue rather that di-
rectly between tasks is that the relationship among communicating tasks becomes
very flexible, albeit more complex. In fact, four scenarios are possible:
1. The simplest one is a one-to-one relationship, in which one task sends messages
to another through the message queue.
2. A many-to-one relationship is also possible, in which multiple tasks send mes-
sages to a message queue, and a single task receives and handles them.
3. In a one-to-many relationship, a single task feeds messages into a message queue
and multiple tasks receive from the queue. Unlike the previous two scenarios, this
one and the following one cannot easily be implemented with direct inter-task
communication.
4. The many-to-many relationship is the most complex one, comprising multiple
sending and receiving tasks all operating on the same message queue.
Establishing a one-to-many or a many-to-many relationship does not allow the
sending task(s) to determine exactly which task—among the receiving ones—will
actually receive and handle its message. However, this may still be useful, for in-
stance, to conveniently handle concurrent processing in software modules acting as
servers.
In this case, those software modules will contain a number of equivalent “worker”
tasks, all able to handle a single request at a time. All of them will be waiting for
requests using the same message queue located at the module’s boundary. When a
request arrives, one of the workers will be allowed to proceed and get it. Then, the
worker will process the request and provide an appropriate reply to the requesting
task. Meanwhile, the other workers will still be waiting for additional requests and
may start working on them concurrently.
As mentioned earlier, message passing primitives incorporate both data transfer
and synchronization aspects. Albeit the data transfer mechanism by itself is straight-
forward, in order to design a working concurrent application, it is important to look
deeper into how message queue synchronization works.
As illustrated in Figure 5.7, message queues enforce two synchronization con-
straints, one for each side of the communication:
1. The receive primitive blocks the caller when invoked on an empty message queue.
The invoking task will be unblocked when a message is sent to the queue. If
multiple tasks are blocked trying to receive from a message queue when a message
128 Embedded Software Development: The Open-Source Approach
arrives, the operating system selects and unblocks exactly one of them. That task
receives the incoming message, whereas the others wait until further messages are
sent to the queue.
2. The send primitive blocks the caller when the message queue it operates on is
full, that is, the number of messages it currently contains is equal to its maximum
capacity, declared upon creation. When a task receives a message from the queue,
the operating system selects one of the tasks waiting to send, unblocks it, and puts
its message into the message queue. The other tasks keep waiting until more space
becomes available in the message queue.
It should also be noted that many operating systems (including F REE RTOS) offer
a nonblocking variant of the send and receive primitives. Even though these variants
may sometimes be useful from the software development point of view, they will
not be further discussed here because, in that case, synchronization simply does not
occur.
Another popular variant is a timed version of send and receive, in which it is
possible to specify the maximum amount of time the primitives are allowed to block
the caller. If the operation cannot be completed within the allotted time, the caller is
unblocked anyway and the primitives return an error indication.
Under this synchronization model, considering the case in which the message
queue is neither empty nor full, messages flow asynchronously from the sender to
the receiver. No actual synchronization between those tasks actually occurs, because
neither of them is blocked by the message passing primitives it invokes.
When necessary, stricter forms of synchronization can be implemented starting
from these basic primitives. In particular, as shown in Figure 5.8:
Table 5.5
Message Passing Primitives of F REE RTOS
Its two arguments specify the maximum number of elements the newly cre-
ated message queue can contain, uxQueueLength, and the size of each element,
uxItemSize, expressed in bytes. Upon successful completion, the function returns
a valid message queue handle to the caller, which represents the message queue just
created and must be used for any subsequent operation on it. If an error occurs, the
function returns a NULL pointer instead.
When a message queue is no longer in use, it is useful to delete it, in order to
reclaim the memory allocated to it for future use. This is done by means of the
function:
void vQueueDelete(QueueHandle_t xQueue);
It should be noted that the deletion of a F REE RTOS message queue takes place
immediately and is never delayed, even though some tasks are currently blocked be-
cause they are engaged in a send or receive primitive on the message queue itself. The
effect of the deletion on the waiting tasks depends on whether or not they specified a
time limit for the execution of the primitive:
• if they did so, they will receive an error indication when a timeout occurs;
• otherwise, they will be blocked forever.
queues are deleted only when no tasks are blocked on them, when those tasks will no
longer be needed in the future, or when they have other ways to recover (for instance,
by means of a timeout mechanism).
After a message queue has been successfully created and its xQueue handle is
available for use, it is possible to send a message to it by means of the functions
BaseType_t xQueueSendToBack(
QueueHandle_t xQueue,
const void *pvItemToQueue,
TickType_t xTicksToWait);
BaseType_t xQueueSendToFront(
QueueHandle_t xQueue,
const void *pvItemToQueue,
TickType_t xTicksToWait);
functions is that xQueueSendToFront sends the message to the front of the mes-
sage queue, so that it passes over the other messages stored in the queue and will be
received before them.
Another possibility is to forcefully send a message to a message queue even
though it is full by overwriting one message already stored into it in the past. This
is generally done on message queues with a capacity of one, with the help of the
following function:
BaseType_t xQueueOverwrite(
QueueHandle_t xQueue,
const void * pvItemToQueue);
Unlike for the previous functions, the argument xTicksToWait is not present
because this function always completes its work immediately. As before, the return
value indicates whether the function was successful or not.
Neither xQueueSendToBack nor xQueueSendToFront nor
xQueueOverwrite can be invoked from an interrupt handler. Instead, ei-
ther xQueueSendToBackFromISR or xQueueSendToFrontFromISR or
xQueueOverwriteFromISR must be called in this case, according to the
following prototypes:
BaseType_t xQueueSendToBackFromISR(
QueueHandle_t xQueue,
const void *pvItemToQueue,
BaseType_t *pxHigherPriorityTaskWoken);
BaseType_t xQueueSendToFrontFromISR(
QueueHandle_t xQueue,
const void *pvItemToQueue,
BaseType_t *pxHigherPriorityTaskWoken);
BaseType_t xQueueOverwriteFromISR(
QueueHandle_t xQueue,
const void *pvItemToQueue
BaseType_t *pxHigherPriorityTaskWoken);
• They never block the caller, and hence, they do not have a xTicksToWait
argument. In other words, they always behave as if the timeout were 0, so
that they return an error indication to the caller if the operation cannot be
concluded immediately.
• The argument pxHigherPriorityTaskWoken points to a BaseType_t
variable. The function will set the referenced variable to either pdTRUE or
pdFALSE, depending on whether or not it awakened a task with a priority
higher than the task which was running when the interrupt handler started.
134 Embedded Software Development: The Open-Source Approach
Messages are always received from the front of a message queue by means of the
following functions:
BaseType_t xQueueReceive(
QueueHandle_t xQueue,
void *pvBuffer,
TickType_t xTicksToWait);
BaseType_t xQueuePeek(
QueueHandle_t xQueue,
void *pvBuffer,
TickType_t xTicksToWait);
The first argument of these functions is a message queue handle xQueue, of type
QueueHandle_t, which indicates the message queue they will work upon.
The second argument, pvBuffer, is a pointer to a memory buffer into which
the function will store the message just received. The memory buffer must be large
enough to hold the message, that is, at least as large as a message queue item.
The last argument, xTicksToWait, specifies how much time the function should
wait for a message to become available if the message queue was completely empty
when the function was invoked. The valid values of xTicksToWait are the same
as already mentioned when discussing xQueueSendToBack.
The function xQueueReceive, when successful, removes the message it just
received from the message queue, so that each message sent to the queue is received
exactly once. On the contrary, the function xQueuePeek simply copies the message
into the memory buffer indicated by the caller without removing it for the queue.
The return value of xQueueReceive and xQueuePeek is pdPASS if the func-
tion was successful, whereas any other value means that an error occurred. In particu-
lar, the error code errQUEUE_EMPTY means that the function was unable to receive
a message within the maximum amount of time specified by xTicksToWait be-
cause the queue was empty. In this case, the buffer pointed by pvBuffer will not
contain any valid message after these functions return to the caller and its contents
shall not be used.
The functions xQueueReceiveFromISR and xQueuePeekFromISR are the
variants of xQueueReceive and xQueuePeek, respectively, which must be
used within an interrupt handler. Neither of them block the caller. Moreover,
xQueueReceiveFromISR returns to the caller—in the variable pointed by
pxHigherPriorityTaskWoken—an indication on whether or not it awakened
a task with a higher priority than the interrupted one.
There is no need to do the same for xQueuePeekFromISR because it never
frees any space in the message queue, and hence, it never wakes up any tasks. The
two functions have the following prototype:
Concurrent Programming Techniques 135
BaseType_t xQueueReceiveFromISR(
QueueHandle_t xQueue,
void *pvBuffer,
BaseType_t *pxHigherPriorityTaskWoken);
BaseType_t xQueuePeekFromISR(
QueueHandle_t xQueue,
void *pvBuffer);
UBaseType_t
uxQueueMessagesWaitingFromISR(const QueueHandle_t xQueue);
BaseType_t
xQueueIsQueueEmptyFromISR(const QueueHandle_t xQueue);
BaseType_t
xQueueIsQueueFullFromISR(const QueueHandle_t xQueue);
These functions should be used with caution because, although the information
they return is certainly accurate at the time of the call, the scope of its validity is
somewhat limited. It is worth mentioning, for example, that the information may no
longer be valid and should not be relied upon when any subsequent message queue
operation is attempted because other tasks may have changed the queue status in the
meantime.
For example, the preventive execution of uxQueueMessageWaiting by a task,
with a result greater than zero, is not enough to guarantee that the same task will be
able to immediately conclude a xQueueReceive with a non-zero xTicksToWait
in the immediate future.
This is because other tasks, or interrupt handlers, may have received messages
from the queue and emptied it completely in the meantime. On the contrary,
xQueueReceive primitive, with xTicksToWait equal to 0, has been specifically
designed to work as intended in these cases.
136 Embedded Software Development: The Open-Source Approach
5.5 SUMMARY
This chapter contains an introduction to concurrent programming from the practical
point of view, within the context of a real-world real-time operating system for em-
bedded applications. Starting with the all-important concept of task, the basic unit of
scheduling in real-time execution that has been discussed in Section 5.1, it was then
possible to introduce readers to the main inter-task communication and synchroniza-
tion mechanisms. This was the subject of Sections 5.3 and 5.4, which described the
basic concept as well as the practical aspects of semaphore-based synchronization
and message-passing, respectively.
Because this book focused on real-time embedded systems, rather than general-
purpose computing, Section 5.2 also presented in detail how real-time operating sys-
tems manage time and timed delays and which primitives they make available to
users for this purpose. The discussion precedes inter-process communication and
synchronization because, in a real-time operating environment, the latter are obvi-
ously subject to timing constraints, too. These constraints are most often expressed
by means of a timeout mechanism, which has many analogies with a timed delay.
6 Scheduling Algorithms
and Analysis
CONTENTS
6.1 Scheduling Algorithms for Real-Time Execution ......................................... 137
6.2 Scheduling Analysis ...................................................................................... 144
6.3 Summary........................................................................................................ 162
137
138 Embedded Software Development: The Open-Source Approach
For example, a mutual exclusion semaphore (Section 5.3) ensures that only one
task at a time is allowed to operate on shared data. Similarly, a message sent from
one task to another (Section 5.4) can force the receiving task to wait until the sending
task has completed a computation and, at the same time, transfers the results from
one task to the other.
Despite these correctness-related constraints, the application will still exhibit a
significant amount of nondeterminism because, in many cases, the execution of its
tasks may interleave in different ways without violating any of those constraints. For
example, as discussed in Section 5.4, a message queue puts in effect a synchroniza-
tion between senders and receivers only when it is either empty or full.
When this is not the case, senders and receivers are free to proceed and inter-
leave their execution arbitrarily. The application results will of course be the same
in all cases and they will be correct, provided the application was designed properly.
However, its timings may vary considerably from one execution to another due to
different interleavings.
Therefore, if some tasks have a constraint on how much time it takes to complete
them, a constraint also known as response time deadline—as is common in a real-
time system—only some of the interleavings that are acceptable from the point of
view of correctness will also be adequate to satisfy those additional constraints.
As a consequence, in a real-time system it is necessary to further restrict the
nondeterminism, beyond what is done by the communication and synchronization
primitives, to ensure that the task execution sequence will not only produce correct
results in all cases, but will also lead tasks to meet their deadlines. This is exactly
what is done by real-time scheduling algorithms.
Under these conditions, scheduling analysis techniques—briefly presented in Sec-
tion 6.2—are able to establish whether or not all tasks in the system will be able to
meet their deadlines and, using more complex techniques, calculate the worst-case
response time of each task, too. For the time being, interrupts will be left out of the
discussion for simplicity. Interrupt handling and its effects on real-time performance
will be further discussed in Chapter 8.
Even neglecting interrupts, it turns out that assessing the worst-case timing behav-
ior of an arbitrarily complex concurrent application is very difficult. For this reason,
it is necessary to introduce a simplified task model, which imposes some restrictions
on the structure of the application to be considered for analysis and its tasks.
The simplest model, also known as the basic task model, will be the starting point
for the discussion. It has the following characteristics:
1. The application consists of a fixed number of tasks, and that number is known in
advance. All tasks are created when the application starts executing.
2. Tasks are periodic, with fixed and known periods, so that each task can be seen
as an infinite sequence of instances or jobs. Each task instance becomes ready for
execution at regular time intervals, that is, at the beginning of each task period.
3. Tasks are completely independent of each other. They neither synchronize nor
communicate in any way.
Scheduling Algorithms and Analysis 139
• The deadline of a task is not always the same as its period. For instance, a
deadline shorter than the period is of particular interest to model tasks that
are executed infrequently but, when they are, must be completed with tight
timing constraints.
• Some tasks are sporadic rather than periodic. This happens, for instance,
when the execution of a task is triggered by an event external to the system.
• In a modern hardware architecture, it may be difficult to determine an up-
per bound on a task execution time which is at the same time accurate
and tight. This is because those architectures include hardware compo-
nents (like caches, for example), in which the average time needed to com-
plete an operation may differ from the worst-case time by several orders of
magnitude.
Table 6.1 summarizes the notation that will be used throughout this chapter to
discuss scheduling algorithms and their analysis. Even though it is not completely
standardized, the notation proposed in the table is the one adopted by most textbooks
and publications on the subject. In particular:
Table 6.1
Notation for Real-Time Scheduling Algorithms and Analysis
Symbol Meaning
τi The i-th task
τi, j The j-th instance of the i-th task
Ti The period of task τi
Di The relative deadline of task τi
Ci The worst-case execution time of task τi
Ri The worst-case response time of task τi
ri, j The release time of τi, j
di, j The absolute deadline of τi, j
fi, j The response time of τi, j
the notation τi, j , which indicates the j-th instance of τi . Instances are enu-
merated according to their temporal order, so that τi, j precedes τi,k in time
if and only if j < k, and the first instance of τi is usually written as τi,0 .
• In a periodic task τi individual instances are released, that is, they become
ready for execution, at regular time intervals. The distance between two
adjacent releases is the period of the task, denoted by Ti .
• The symbol Di represents the deadline of τi expressed in relative terms, that
is, with respect to the release time of each instance. In the model, the rela-
tive deadline is therefore the same for all instances of a given task. More-
over, in the following it will be assumed that Di = Ti ∀i for simplicity.
• The worst-case execution time of τi is denoted as Ci . As outlined above, the
worst-case execution time of a task is the maximum amount of processor
time needed to complete any of its instances when the task is executed in
isolation, that is, without the presence of any other tasks in the system. It is
important to note that a task execution time shall not be confused with its
response time, to be described next.
• The worst-case response time of τi , denoted as Ri , represents the maxi-
mum amount of time needed to complete any of its instances when the
task is executed together with all the other tasks in the system. It is there-
fore Ri ≥ Ci ∀i because, by intuition, the presence of other tasks can only
worsen the completion time of τi . For instance, the presence of a higher-
priority task τ j may lead the scheduler to temporarily stop executing τi in
favor of τ j when the latter becomes ready for execution.
• Besides considering timing parameters pertaining to a task τi as a whole—
like its period Ti , deadline Di , and response time Ri —it is sometimes im-
portant to do the same at the instance level, too. In this case, ri, j is used to
denote the release time of the j-th instance of τi , that is:
where ϕi represents the initial phase of τi , that is, the absolute time at which
its first instance τi,0 is released.
• Similarly, di, j represents the absolute deadline of the j-th instance of τi ,
which is:
di, j = ri, j + Di . (6.2)
An important difference between Di and di, j is that the former is a relative
quantity, which is measured with respect to the release time of each in-
stance of τi and is the same for all instances. On the contrary, the latter is an
absolute quantity that represents the instant in time at which task instance
τi, j must necessarily already be completed in order to satisfy its timing con-
straints. As a consequence, di, j is different for each instance of τi .
• Last, fi, j denotes the response time of task instance τi, j . It is again a relative
quantity, measured with respect to the release time of the corresponding in-
stance, that is, ri, j . The difference between fi, j and Ri is that the former
represents the actual response time of τi, j , a specific instance of τi , whereas
the latter is the worst-case response time among all (infinite) instances of
τi . Hence, it is Ri ≥ max j ( fi, j ). We stipulate that Ri may be greater than
the maximum instance-by-instance response time fi, j to take into account
the fact that, due to the way Ri is calculated, it may be a conservative
bound.
One of the useful consequences of introducing a formal task model is that it is now
possible to be more precise in defining what, so far, has been described as “satisfying
timing constraints,” in a generic way. According to the above definitions, all tasks in
a set meet their deadline—and hence, they all satisfy their timing constraints—if and
only if Ri ≤ Di ∀i.
Figure 6.1 depicts the notation just introduced in graphical form and further high-
lights the difference between Ci and Ri when other tasks are executed concurrently
with τi . The left part of the figure, related to instance τi,0 , shows how that instance
is executed when there are no other tasks in the system. Namely, when the instance
142 Embedded Software Development: The Open-Source Approach
is released at ri,0 and becomes ready for execution, it immediately transitions to the
running state of the TSD and stays in that state until completion. As a consequence,
its response time fi,0 will be the same as its execution time Ci .
On the contrary, the right part of Figure 6.1 shows what may happen if instance
τi,1 is executed concurrently with other higher-priority tasks. In this case:
For both these reasons, the response time of τi,1 , denoted as fi,1 in the figure, may
become significantly longer that its execution time Ci because the task instance en-
dures a certain amount of interference from higher-priority tasks. A very important
goal of defining a satisfactory real-time scheduling algorithm, along with an appro-
priate way of analyzing its behavior, is to ensure that fi, j is bounded for any instance
j of task i. Moreover, it must also be guaranteed that, for all tasks, the resulting
worst-case response time Ri is acceptable with respect to the task deadline, that is,
Ri ≤ Di ∀i.
The Rate Monotonic (RM) scheduling algorithm for single-processor systems, in-
troduced by Liu and Leyland [110] assigns to each task in the system a fixed priority,
which is inversely proportional to its period Ti . Tasks are then selected for execu-
tion according to their priority, that is, at each instant the operating system sched-
uler chooses for execution the ready task with the highest priority. Preemption of
lower-priority tasks in favor of higher-priority ones is performed, too, as soon as a
higher-priority task becomes ready for execution. An example of how Rate Mono-
tonic schedules a simple set of tasks, listed in Table 4.4, was shown in Figure 4.9.
It should be noted that the Rate Monotonic priority assignment takes into account
only the task period Ti , and not its execution time Ci . In this way, tasks with a shorter
period are expected to be executed before the others. Intuitively, this makes sense
because we are assuming Di = Ti , and hence, tasks with a shorter period have less
time available to complete their work. On the contrary, tasks with a longer period
can afford giving precedence to more urgent tasks and still be able to finish their
execution in time.
This informal reasoning can be confirmed with a mathematical proof of optimal-
ity, that is, it has been proved that Rate Monotonic is the best scheduling policy
among all the fixed priority scheduling policies when the basic task model is consid-
ered. In particular, under the following assumptions:
It has been proved [110] that, if a given set of periodic tasks with fixed priorities can
be scheduled so that all tasks meet their deadlines by means of a certain scheduling
algorithm A, then the Rate Monotonic algorithm is able to do the same, too.
Another interesting mathematical proof about the Rate Monotonic answers a
question of significant, practical relevance. From the previous discussion, it is al-
ready clear that the response time of a task instance depends on when that instance is
released with respect to the other tasks in the system, most notably the higher-priority
ones. This is because the relative position of task instance release times affects the
amount of interference the task instance being considered endures and, as a conse-
quence, the difference between its response time fi, j with respect to the execution
time Ci .
It is therefore interesting to know what is the relative position of task instance re-
lease times that leads to the worst possible response time Ri . The critical instant the-
orem [110] provides a simple answer to this question for Rate Monotonic. Namely,
it states that a critical instant for a task instance occurs when it is released together
with an instance of all higher-priority tasks. Moreover, releasing a task instance at a
critical instant leads that instance to have the worst possible response time Ri among
all instances of the same task.
Therefore, in order to determine Ri for the Rate Monotonic algorithm, it is un-
necessary to analyze, simulate, or experimentally evaluate the system behavior for
any possible relationship among task instance release times, which may be infeasible
or very demanding. Instead, it is enough to look at the system behavior in a single
scenario, in which task instances are released at a critical instant.
Given that the Rate Monotonic algorithm has been proved to be optimal among all
fixed-priority scheduling algorithms, it is still interesting to know if it is possible to
“do better” than Rate Monotonic, by relaxing some constraints on the structure of the
scheduler and add some complexity to it. In particular, it is interesting to investigate
the scenario in which task priorities are no longer constrained to be fixed, but may
change over time instead. The answer to this question was given by Liu and Layland
in [110], by defining a dynamic-priority scheduling algorithm called earliest deadline
first (EDF) and proving it is optimal among all possible scheduling algorithms, under
some constraints.
The EDF algorithm selects tasks according to their absolute deadlines. That is, at
each instant, tasks with earlier deadlines receive higher priorities. According to (6.1)
and (6.2), the absolute deadline di, j of the j-th instance (job) of task τi is given by
From this equation, it is clear that the priority of a given task τi as a whole changes
dynamically, because it depends on the current deadline of its active instance. On the
other hand, the priority of a given task instance τi, j is still fixed, because its deadline
is computed once and for all by means of (6.3) and it does not change afterward.
This property also gives a significant clue on how to simplify the practical im-
plementation of EDF. In fact, EDF implementation does not require that the sched-
uler continuously monitors the current situation and rearranges task priorities when
144 Embedded Software Development: The Open-Source Approach
needed. Instead, task priorities shall be updated only when a new task instance is re-
leased. Afterwards, when time passes, the priority order among active task instances
does not change, because their absolute deadlines do not move.
As happened for RM, the EDF algorithm works well according to intuition, be-
cause it makes sense to increase the priority of more “urgent” task instances, that is,
instances that are getting closer to their deadlines without being completed yet. The
same reasoning has also been confirmed in [110] by a mathematical proof. In partic-
ular, considering the basic task model complemented by the following assumptions:
where, according to the notation presented in Table 6.1, the fraction Ci /Ti represents
the fraction of processor time spent executing task τi . The processor utilization factor
is therefore a measure of the computational load imposed on the processor by a given
task set. Accordingly, the computational load associated with a task increases when
its execution time Ci increases and/or its period Ti decreases.
Although U can be calculated in a very simple way, it does provide useful insights
about the schedulability of the task set it refers to. First of all, an important theoretical
result identifies task sets that are certainly not schedulable. Namely, if U > 1 for
a given task set, then the task set is not schedulable, regardless of the scheduling
algorithm.
Besides the formal proof—which can be found in [110]—this result is quite in-
tuitive. Basically, it states that it is impossible to allocate to the tasks a fraction of
processor time U that exceeds the total processor time available, that is, 1. It should
also be noted that this result merely represents a necessary schedulability condition
and, by itself, it does not provide any information when U ≤ 1.
Further information is instead provided by a sufficient schedulability test for Rate
Monotonic. Informally speaking, it is possible to determine a threshold value for U
so that, if U is below that threshold, the task set can certainly be scheduled by Rate
Monotonic, independently of all the other characteristics of the task set itself. More
formally, it has been proved that if
N
Ci
U=∑ ≤ N(21/N − 1) , (6.5)
i=1 Ti
where N is the number of tasks in the task set, then the task set is certainly schedula-
ble by Rate Monotonic. Interested readers will find the complete proof in [110] and,
in a more refined form, in [47].
Combined together, the two conditions just discussed can be summarized as
shown in Figure 6.2. Given that, in a single-processor system, valid values of U
vary between 0 and 1, the two conditions together identify three ranges of values of
U with different schedulability properties:
146 Embedded Software Development: The Open-Source Approach
Table 6.2
Task Set with U ≃ 0.9 Schedulable by Rate Monotonic
1. if 0 ≤ U ≤ N(21/N − 1), the task set is certainly schedulable, because it passes the
sufficient test;
2. if U > 1, the task set is certainly not schedulable, because it fails the necessary
test;
3. if N(21/N − 1) < U ≤ 1 the tests give no information about schedulability and the
task set may or may not be schedulable.
When the processor utilization factor U of a task set falls in the third range, fur-
ther analysis—to be performed with more complex and sophisticated techniques—is
necessary to determine whether or not the task set is schedulable. The following two
examples highlight that, within this “uncertainty area,” two task sets with the same
value of U may behave very differently for what concerns schedulability. According
to the critical instant theorem, the system will be analyzed at a critical instant for all
tasks—that is, when the first instance of each task is released at t = 0—because we
are looking for the worst possible behavior in terms of timings.
The characteristics of the first task set, called A in the following, are listed in
Table 6.2. The task set has a processor utilization factor UA = 0.9. Since, for N = 3
tasks, it is:
N(21/N − 1) ≃ 0.78 , (6.6)
N=3
neither the necessary nor the sufficient schedulability test provide any information.
The corresponding scheduling diagram, drawn starting at a critical instant accord-
ing to the Rate Monotonic algorithm is shown in Figure 6.3. As is commonly done
in this kind of diagram, it consists of three main parts:
1. The top of the diagram summarizes the execution time, period, and deadline of
the tasks in the task set. The execution time of each task is represented by a gray
block, while the period (that coincides with the deadline in the task model we
are using) is depicted with a dashed horizontal line ending with a vertical bar.
Different shades of gray are used to distinguish one task from another.
2. The bottom of the diagram shows when task instances are released. There is a
horizontal time line for each task, with arrows highlighting the release time of
that task’s instances along it.
3. The mid part of the diagram is a representation of how the Rate Monotonic sched-
uler divides the processor time among tasks. Along this time line, a gray block
Scheduling Algorithms and Analysis 147
Figure 6.3 Scheduling diagram for the task set of Table 6.2.
signifies that the processor is executing a certain task instance, written in the
block itself, and empty spaces mean that the processor is idle because no tasks
are ready for execution at that time. The numbers near the bottom right corner of
task instances represent the absolute time when the instance ends.
For this particular task set, the Rate Monotonic scheduler took the following
scheduling decisions:
• At t = 0, the first instance of all three tasks has just been released and all
of them are ready for execution. The scheduler assigns the processor to the
highest-priority instance, that is, τ1,0 .
• At t = 30 ms, instance τ1,0 has completed its execution. As a consequence,
the scheduler assigns the processor to the highest-priority task instance still
ready for execution, that is, τ2,0 .
• At t = 50 ms, instance τ2,0 completes its execution, too, just in time for
the release of the next instance of τ1 . At this point, both τ3,0 and τ1,1 are
ready for execution. As always, the scheduler picks the highest-priority task
instance for execution, τ1,1 in this case.
• At t = 80 ms, when instance τ1,1 is completed, the only remaining task in-
stance ready for execution is τ3,0 and its execution eventually begins.
• The execution of τ3,0 concludes at t = 100 ms. At the same time, new in-
stances of both τ1 and τ2 are released. The processor is assigned to τ1,2
immediately, as before.
148 Embedded Software Development: The Open-Source Approach
Table 6.3
Task Set with U ≃ 0.9 Not Schedulable by Rate Monotonic
Figure 6.4 Scheduling diagram for the task set of Table 6.3.
By examining the diagram, it is evident that all task instances released at t = 0 met
their deadlines. For example, the response time of τ3,0 is f3,0 = 100 ms. Since τ3,0 was
released at a critical instant, we can also conclude that the worst-case response time
of τ3 is R3 = f3,0 = 100 ms. Being D3 = 100 ms, the condition R3 ≤ D3 is therefore
satisfied.
Overall, it turns out that Rate Monotonic was able to successfully schedule this
particular task set, including the lowest-priority task τ3 , even though no further idle
time remained between the release of its first instance, τ3,0 and the corresponding
deadline d3,0 = 100 ms.
By contrast, the outcome is very different for another task set, called B in the
following, even though it is very similar to the previous one and has the same pro-
cessor utilization factor UB ≃ 0.9. (actually, the value of U is even slightly lower in
the second case with respect to the first). Table 6.3 lists the tasks parameters of task
Scheduling Algorithms and Analysis 149
set B and Figure 6.4 shows its scheduling diagram, drawn in the same way as the
previous one.
In this case, Rate Monotonic was unable to schedule the task set because instance
τ3,0 , released at a critical instant, concluded its execution at f3,0 = 133 ms. Due to the
critical instant theorem, this corresponds to a worst-case response time R3 = 133 ms
well beyond the deadline D3 = 100 ms.
It should be noted that the failure of Rate Monotonic is not due to a lack of pro-
cessor time. In fact, saying U ≃ 0.9 means that, on average, the task set only re-
quires 90% of the available processor time to be executed. This is also clear from
the scheduling diagram, which shows that a significant amount of idle time—where
the processor is not executing any task—remains between t = 133 ms (where τ3,0
concludes its execution) and t = 150 ms (where a new instance of both τ1 and τ2 is
released).
On the contrary, the different outcome of Rate Monotonic scheduling when it
is applied to task sets A and B depends on the relationship among the periods and
execution times of the tasks in the task sets, which is less favorable in task set B,
even though U remains the same in both cases.
At the same time, the scheduling diagram also shows that no other tasks can be
executed at all in the interval [0, 130]ms when τ1,0 and τ2,0 are released together at
t = 0. As a consequence, as long as the deadline of task τ3 is D3 ≤ 130 ms, it will
miss the deadline regardless of how small C3 is.
Overall, these simple examples lead us to observe that, at least in some cases,
the value of U does not provide enough information about the schedulability of a
set of tasks. Hence, researchers developed more sophisticated tests, which are more
complex than the U-based tests discussed so far, but are able to provide a definite
answer about the schedulability of a set of tasks, without any uncertainty areas.
Among them, we will focus on a method known as response time analy-
sis (RTA) [13, 14]. With respect to the U-based tests, it is slightly more complex,
but it is an exact (both necessary and sufficient) schedulability test that can be ap-
plied to any fixed-priority assignment scheme on single-processor systems.
Moreover, it does not just give a “yes or no” answer to the schedulability ques-
tion, but calculates the worst-case response times Ri individually for each task. It is
therefore possible to compare them with the corresponding deadlines Di to assess
whether all tasks meet their deadlines or not and judge how far (or how close) they
are from missing their deadlines as well.
According to RTA, the response time Ri of task τi can be calculated by considering
the following recurrence relationship:
⌈ (k) ⌉
wi
= Ci + ∑
(k+1)
wi Cj , (6.7)
j∈hp(i)
Tj
in which:
(k+1) (k)
• wi and wi are the (k + 1)-th and the k-th estimate of Ri , respectively.
Informally speaking, Equation (6.7) provides a way to calculate the next
estimate of Ri starting from the previous one.
150 Embedded Software Development: The Open-Source Approach
(0) (0)
• The first approximation wi of Ri is chosen by letting wi = Ci , which is
the smallest possible value of Ri .
• hp(i) denotes the set of indices of the tasks with a priority higher than τi .
For Rate Monotonic, the set contains the indices j of all tasks τ j with a
period T j < Ti .
(0) (1) (k)
It has been proved that the succession wi , wi , . . . , wi , . . . defined by (6.7) is
monotonic and nondecreasing. Two cases are then possible:
1. If the succession does not converge, there exists at least one scheduling scenario
in which τi does not meet its deadline Di , regardless of the specific value of Di .
2. If the succession converges, it converges to Ri , and hence, it will be
(k+1) (k)
wi = wi = Ri for some k. In this case, τi meets its deadline in every pos-
sible scheduling scenario if and only if the worst-case response time provided by
RTA is Ri ≤ Di .
As an example, let us apply RTA to the task sets listed in Tables 6.2 and 6.3. For
what concerns the first task set, considering task τ1 we can write:
(0)
w1 = C1 = 30 ms (6.8)
⌈ (k)
⌉
wi
∑
(1)
w1 = C1 + C j = C1 = 30 ms . (6.9)
j∈hp(1)
Tj
R1 = 30 ms . (6.10)
In this case, hp(2) = {1} because τ1 has a higher priority than τ2 . The succession
converges, and hence, it is:
R2 = 50 ms . (6.14)
Scheduling Algorithms and Analysis 151
The analysis of the lowest-priority task τ3 , for which hp(3) = {1, 2} proceeds in
the same way:
(0)
w3 = C3 = 20 ms (6.15)
⌈ (0) ⌉ ⌈ (0) ⌉
(1) w2 w2
w3 = C3 + C1 + C2
T1 T2
⌈ ⌉ ⌈ ⌉
20 20
= 20 + 30 + 20 = 20 + 30 + 20 = 70 ms (6.16)
50 100
⌈ ⌉ ⌈ ⌉
(2) 70 70
w3 = 20 + 30 + 20 = 20 + 60 + 20 = 100 ms (6.17)
50 100
⌈ ⌉ ⌈ ⌉
(3) 100 100
w4 = 20 + 30 + 20 = 20 + 60 + 20 = 100 ms (6.18)
50 100
Also in this case, the succession converges (albeit convergence requires more it-
erations than before) and we can conclude that R3 = 100 ms. Quite unsurprisingly,
the worst-case response times just obtained from RTA coincide perfectly with the
ones determined from the scheduling diagram in Figure 6.3, with the help of the
critical instant theorem. The main advantage of RTA with respect to the scheduling
diagram is that the former can be automated more easily and is less error-prone when
performed by hand.
Let us consider now the task set listed in Table 6.3. For τ1 and τ2 , the RTA results
are the same as before, that is:
(0)
w1 = C1 = 30 ms (6.19)
(1)
w1 = C1 = 30 ms (6.20)
and
(0)
w2 = C2 = 20 ms (6.21)
⌈ ⌉
(1) 20
w2 = 20 + 30 = 20 + 30 = 50 ms (6.22)
50
⌈ ⌉
(1) 50
w2 = 20 + 30 = 20 + 30 = 50 ms , (6.23)
50
from which we conclude that R1 = 30 ms and R2 = 50 ms. For what concerns τ3 , the
152 Embedded Software Development: The Open-Source Approach
2. Task self suspension, which takes place when a task waits for any kind of external
event.
The second kind of interaction includes, for instance, the case in which a task
interacts with a hardware device by invoking an Input–Output operation and then
waits for the results. Other examples, involving only tasks, include semaphore-based
task synchronization, outlined in Section 5.3 and message passing, discussed in
Section 5.4.
In a real-time system any kind of task interaction, mutual exclusion in particu-
lar, must be designed with care, above all when the tasks involved have different
priorities. In fact, a high-priority task may be blocked when it attempts to enter its
critical region if a lower-priority task is currently within a critical region controlled
by the same semaphore. From this point of view, the mutual exclusion mechanism is
hampering the task priority scheme. This is because, if the mutual exclusion mech-
anism were not in effect, the high-priority task would always be preferred over the
lower-priority one for execution.
This phenomenon is called priority inversion and, if not adequately addressed,
can adversely affect the schedulability of the system, to the point of making the
response time of some tasks completely unpredictable, because the priority inversion
region may last for an unbounded amount of time and lead to an unbounded priority
inversion.
Even though proper software design techniques may alleviate the issue—for in-
stance, by avoiding useless or redundant critical regions—it is also clear that the
problem cannot be completely solved in this way unless all forms of mutual exclu-
sion, as well as all critical regions, are banned from the system. This is indeed possi-
ble, by means of lock-free and wait-free communication, but those techniques could
imply a significant drawback in software design and implementation complexity.
On the other hand, it is possible to improve the mutual exclusion mechanism in
order to guarantee that the worst-case blocking time endured by each individual task
in the system is bounded. The worst-case blocking time can then be calculated and
used to refine the response time analysis (RTA) method discussed previously, in order
to determine their worst-case response times.
The example shown in Figure 6.5 illustrates how an unbounded priority inversion
condition may arise, even in very simple cases. In the example, the task set is com-
posed of three tasks, τ1 , τ2 , and τ3 , listed in decreasing priority order, scheduled by
a preemptive, fixed-priority scheduler like the one specified by Rate Monotonic. As
shown in the figure, τ1 and τ3 share some data and protect their data access with two
critical regions controlled by the same semaphore s. The third task τ2 does not share
any data with the others, and hence, does not contain any critical region.
• The example starts when neither τ1 nor τ2 is ready for execution and the
lowest-priority task τ3 is running.
• After a while τ3 executes a P(s) to enter its critical region. It does not
block, because the value of s is currently 1. Instead, τ3 proceeds beyond the
154 Embedded Software Development: The Open-Source Approach
critical region boundary and keeps running. At the same time, the value of
s becomes 0.
• While τ3 is running, τ1 —the highest-priority task in the system—becomes
ready for execution. Since the scheduler is preemptive, this event causes
an immediate preemption of τ3 (which is still within its critical region) in
favor of τ1 .
• While τ1 is running, τ2 becomes ready for execution, too. Unlike in the
previous case, this event does not have any immediate effect on scheduling,
because the priority of τ2 is lower than the one of the running task τ1 .
• After some time, τ1 tries to enter its critical region by executing the entry
code P(s), like τ3 did earlier. However, the semaphore primitive blocks τ1
because the value of s is currently 0.
Scheduling Algorithms and Analysis 155
This point in time is the beginning of the priority inversion region because τ1 , the
highest-priority task in the system is blocked due to a lower-priority task, τ3 , and the
system is executing the mid-priority task τ2 .
It is important to remark that, so far, nothing “went wrong” in the system. In fact,
τ1 has been blocked for a sensible reason (it must be prevented from accessing shared
data while τ3 is working on them) and the execution of τ2 is in perfect adherence to
how the scheduling algorithm has been specified. However, a crucial question is for
how long will the priority inversion region last.
Referring back to Figure 6.5 it is easy to observe that:
• The amount of time τ1 will be forced to wait does not depend on τ1 itself.
Since it is blocked, there is no way it can directly affect its own future com-
putation, and it will stay in this state until τ3 exits from its critical region.
• The blocking time of τ1 does not completely depend on τ3 , either. In fact, τ3
is ready for execution, but it will not proceed (and will not leave its critical
region) as long as there is any higher-priority task ready for execution, like
τ2 in the example.
As a consequence, the duration of the priority inversion region does not depend
completely on the tasks that are actually sharing data, τ1 and τ3 in the example. In-
stead, it also depends on the behavior of other tasks, like τ2 , which have nothing to
do with τ1 and τ3 . Indeed, in a complex software system built by integrating multi-
ple components, the programmers who wrote τ1 and τ3 may even be unaware that
τ2 exists.
The presence of multiple, mid-priority tasks makes the scenario even worse. In
fact, it is possible that they take turns entering the ready state so that at least one of
them is in the ready state at any given time. In this case, even though none of them
monopolizes the processor by executing for an excessive amount of time, when they
are taken as a whole they may prevent τ3 (and hence, also τ1 ) from being executed
at all.
In other words, it is possible that a group of mid-priority tasks like τ2 , in combi-
nation with a low-priority task τ3 , prevents the execution of the high-priority task τ1
for an unbounded amount of time, which is against the priority assignment principle
and obviously puts schedulability at risk. It is also useful to remark that, as happens
for many other concurrent programming issues, this is not a systematic error. Rather,
it is a time-dependent issue that may go undetected when the system is bench tested.
Considering again the example shown in Figure 6.5, it is easy to notice that the
underlying reason for the unbounded priority inversion is the preemption of τ3 by τ1
while it was within its critical region. If the preemption were somewhat delayed after
τ3 exited from the critical region, the issue would not occur, because there would be
no way for mid-priority tasks like τ2 to make the priority inversion region unbounded.
156 Embedded Software Development: The Open-Source Approach
This informal reasoning can indeed be formally proved and forms the basis of a
family of methods—called priority ceiling protocols and fully described in [153]—
to avoid unbounded priority inversion. In a single-processor system, a crude imple-
mentation of those methods consists of completely forbidding preemption during the
execution of critical regions.
As described in Sections 5.1 and 5.3, this also implements mutual exclusion and
it may be obtained by disabling the operating system scheduler or, even more dras-
tically, turning interrupts off. In this way, any task that successfully enters a critical
region also gains the highest possible priority in the system so that no other task
can preempt it. The task goes back to its regular priority as soon as it exits from the
critical region.
Even though this method has the clear advantage of being extremely simple to
implement, it also introduces by itself a significant amount of a different kind of
blocking. Namely, any higher-priority task like τ2 that becomes ready while a low-
priority task τ3 is within a critical region will not get executed—and we therefore
consider it to be blocked by τ3 —until τ3 exits from the critical region and returns to
its regular priority.
The problem has been solved anyway because the amount of blocking endured
by tasks like τ2 is indeed bounded by the maximum amount of time τ3 may spend
within its critical region, which is finite if τ3 has been designed in a proper way.
Nevertheless, we are now potentially blocking many tasks which were not blocked
before, and it turns out that most of this extra blocking is actually unessential to solve
the unbounded priority inversion problem. For this reason, this way of proceeding is
only appropriate for very short critical regions. On the other hand, a more sophisti-
cated approach—which does not introduce as much extra blocking—is needed in the
general case.
Nevertheless, the underlying idea is useful, that is, a better cooperation between
the synchronization mechanism used for mutual exclusion and the processor sched-
uler can indeed solve the unbounded priority inversion problem. Namely, the pri-
ority inheritance protocol, proposed by Sha, Rajkumar, and Lehoczky [153] and
implemented on most real-time operating systems—including F REE RTOS—is im-
plemented by enabling the mutual exclusion mechanism to temporarily boost task
priorities, and hence, affect scheduling decisions.
Informally speaking, the general idea behind the priority inheritance protocol is
that, if a task τ is blocking a set of n higher-priority tasks τ1 , . . . , τn at a given
instant, it will temporarily inherit the highest priority among them. This temporary
priority boost lasts until the blocking is in effect and prevents any mid-priority task
from preempting τ and unduly make the blocking experienced by τ1 , . . . , τn longer
than necessary or unbounded.
More formally, the priority inheritance protocol assumes that the following as-
sumptions hold:
• The tasks are under the control of a fixed-priority scheduler and are exe-
cuted by a single processor.
Scheduling Algorithms and Analysis 157
• If there are two or more highest-priority tasks ready for execution, the
scheduler picks them in first-come first-served (FCFS) order, that is, they
are executed in the same order as they became ready.
• Semaphore wait queues are ordered by priority so that, when a task executes
a V(s) on a semaphore s and there is at least one task waiting on s, the
highest-priority waiting task is unblocked and becomes ready for execution.
The priority inheritance protocol itself consists of the following set of rules:
It should be noted that the bound Bi given by (6.30) is often “pessimistic” when
applied to real-world scenarios, because
• It assumes that if a certain semaphore can possibly block a task, it will
indeed block it.
• For each semaphore, the blocking time suffered by τi is always assumed
to be equal to the worst-case execution time of the longest critical region
guarded by that semaphore, even though that critical region is never entered
by τi itself.
158 Embedded Software Development: The Open-Source Approach
The example shows that, by means of the priority inheritance protocol, the length
of the priority inversion region is now bounded and limited to the region highlighted
in the figure. Namely, the worst-case length of the priority inversion region is now
equal to the maximum amount of time that τ3 can possibly spend within its critical
region, corresponding to the light gray boxes drawn on τ3 ’s time line.
However, it is also useful to remark that, within the priority inversion region, τ3
now blocks both τ1 and τ2 , whereas only τ1 was blocked in the previous example.
Scheduling Algorithms and Analysis 159
This fact leads us to observe that the priority inheritance protocol implies a trade-off
between enforcing an upper bound on the length of priority inversion regions and
introducing additional blocking in the system, as any other algorithm dealing with
unbounded priority inversion does.
Hence, for the priority inheritance protocol, we identify two distinct kinds of
blocking:
1. Direct blocking occurs when a high-priority task tries to acquire a shared
resource—for instance, get access to some shared data by taking a mutual exclu-
sion semaphore—while the resource is held by a lower-priority task. This is the
kind of blocking affecting τ1 in this case. Direct blocking is an unavoidable con-
sequence of mutual exclusion and ensures the consistency of the shared resources.
2. Push-through blocking is a consequence of the priority inheritance protocol and
is the kind of blocking experienced by τ2 in the example. It occurs when an
intermediate-priority task (like τ2 ) is not executed even though it is ready be-
cause a lower-priority task (like τ3 ) has inherited a higher priority. This kind of
blocking may affect a task even if it does not actually use any shared resource, but
it is necessary to avoid unbounded priority inversion.
To conclude the example, let us apply (6.30) to the simple scenario being consid-
ered and calculate the worst-case blocking time for each task. Since there is only one
semaphore in the system, that is, K = 1, the formula becomes:
Bi = usage(1, i)C(1) , (6.31)
where usage(1, i) is a function that returns 1 if semaphore s is used by (at least) one
task with a priority less than the priority of τi , and also by (at least) one task with
a priority higher than or equal to the priority of τi . Otherwise, usage(1, i) returns 0.
Therefore:
usage(1, 1) = 1 (6.32)
usage(1, 2) = 1 (6.33)
usage(1, 3) = 0 . (6.34)
Similarly, C(1) is the worst-case execution time among all critical regions asso-
ciated with, or guarded by, semaphore s. Referring back to Figure 6.6, C(1) is the
maximum between the length of the light gray boxes (critical region of τ3 ) and the
dark gray box (critical region of τ1 ). As a result, we can write:
B1 = C(1) (6.35)
B2 = C(1) (6.36)
B3 = 0 . (6.37)
These results confirm that both τ1 and τ2 can be blocked (by τ3 )—in fact, B1 ̸= 0
and B2 ̸= 0—while the only task that does not suffer any blocking is the lowest-
priority one, τ3 , because B3 = 0. On the other hand, they also confirm the “pes-
simism” of (6.30) because the worst-case amount of blocking is calculated to be the
160 Embedded Software Development: The Open-Source Approach
maximum between the two critical regions present in the system. Instead, it is clear
from the diagram that the length of the dark-gray critical region, the critical region
of τ1 , cannot affect the amount of blocking endured by τ1 itself.
Considering mutual exclusion as the only source of blocking in the system is still
not completely representative of what happens in the real world, in which tasks also
invoke external operations and wait for their completion. For instance, it is common
for tasks to start an input–output (I/O) operation and wait until it completes or a
timeout expires. Another example would be to send a message to another task and
then wait for an answer.
In general, all scenarios in which a task voluntarily suspends itself for a vari-
able amount of time, provided this time has a known and finite upper bound, are
called self-suspension or self-blocking. The analysis presented here is based on
Reference [144], which addresses schedulability analysis in the broader context of
real-time synchronization for multiprocessor systems. Interested readers are referred
to [144] for further information and the formal proof of the statements discussed in
this book.
Contrary to what can be expected by intuition, the effects of self-suspension are
not necessarily local to the task that is experiencing it. On the contrary, the self-
suspension of a high-priority task may hinder the schedulability of lower-priority
tasks and, possibly, make them no longer schedulable. This is because, after self-
suspension ends, the high-priority task may become ready for execution (and hence,
preempt the lower-priority task) at the “wrong time” and have a greater impact on
their worst-case response time with respect to the case in which the high-priority task
runs continuously until completion.
Moreover, when considering self-suspension, even the critical instant theorem
that, as shown previously, plays a central role in the schedulability analysis theory
we leveraged so far, is no longer directly applicable to compute the worst-case inter-
ference that a task may be subject to.
In any case, the worst-case extra blocking endured by task τi due to its own self-
suspension, as well as the self-suspension of higher-priority tasks, denoted BSSi , can
still be calculated efficiently and can be written as
BSS
i = Si + ∑ min(C j , S j ) . (6.38)
j∈hp(i)
Informally speaking, according to (6.38) the worst-case blocking time BSS i due to
self-suspension endured by task τi is given by the sum of its own worst-case self-
suspension time Si plus a contribution from each of the higher-priority tasks. The in-
dividual contribution of task τ j to BSS
i is given by its own worst-case self-suspension
time S j , but it never exceeds its execution time C j .
Scheduling Algorithms and Analysis 161
Bi = BSS TI
i + Bi
K
= Si + ∑ min(C j , S j ) + (Qi + 1) ∑ usage(k, i)C(k) (6.40)
j∈hp(i) k=1
It is useful to remark that, in the above formula, C j and C(k) have got two different
meanings that should not be confused despite the likeness in notation, namely:
The value of Bi calculated by means of (6.40) can then be used to extend RTA
and consider the blocking time in worst-case response time calculations. Namely,
the basic recurrence relationship (6.7) can be rewritten as:
⌈ (k) ⌉
wi
= Ci + Bi + ∑
(k+1)
wi Cj , (6.41)
j∈hp(i)
Tj
(0)
of wi . On the other hand, if the succession does not converge, τi is surely not
(0)
schedulable. As before, setting wi = Ci provides a sensible initial value for the
succession.
The main difference is that the new formulation is pessimistic, instead of neces-
sary and sufficient, because the bound Bi on the worst-case blocking time is not tight.
Therefore it may be practically impossible for a task to ever incur in a blocking time
equal to Bi , and hence, experience the worst-case response time calculated by (6.41).
A clear advantage of the approach just described is that it is very simple and
requires very little knowledge about the internal structure of the tasks. For instance,
it is unnecessary to know exactly where the self-suspension operations are. Instead,
it is enough to know how many there are, which is much simpler to collect and
maintain as software evolves with time. However, the disadvantage of using such a
limited amount of information is that it makes the method extremely conservative.
Thus, the Bi calculated in this way is definitely not a tight upper bound for the worst-
case blocking time and may widely overestimate it in some cases.
More sophisticated and precise methods do exist, such as that described in Refer-
ence [103]. However, as we have seen in several other cases, the price to be paid for a
tighter upper bound for the worst-case blocking time is that much more information
is needed. For instance, in the case of [103], we need to know not only how many
self suspensions each task has got, but also their exact location within the task. In
other words, we need to know the execution time of each individual task segment,
instead of the task execution time as a whole.
6.3 SUMMARY
This chapter contains the basics of task-based scheduling algorithms. After in-
troducing some basic nomenclature, models, and concepts related to this kind of
scheduling, it went along describing what are probably the most widespread real-
time scheduling algorithms, known as rate monotonic (RM) and earliest deadline
first (EDF).
The second part of the chapter presented a few basic scheduling analysis tech-
niques for the RM algorithm, able to mathematically prove whether or not a set of
periodic tasks satisfies their deadlines, by considering some of their basic character-
istics, like their period and execution time.
The analysis started with a very simple task model and was later extended to
include some additional aspects of task behavior of practical interest, for instance,
their interaction when accessing shared data in mutual exclusion.
The analysis methods mentioned in this chapter can also be applied to the EDF
algorithm, but they become considerably more complex than for RM, both from the
theoretical point of view, and also for what concerns their practical implementation.
For this reason, they have not been considered in this book.
7 Con guration and Usage
of Open-Source Protocol
Stacks
CONTENTS
7.1 Introduction to the LW IP Protocol Stack....................................................... 163
7.2 Operating System Adaptation Layer ............................................................. 167
7.3 Configuration Options ................................................................................... 169
7.4 Netconn Interface .......................................................................................... 175
7.5 Network Buffer Management ........................................................................ 181
7.6 POSIX Networking Interface ........................................................................ 187
7.7 Summary........................................................................................................ 197
Many microcontrollers are nowadays equipped with at least one integrated Ethernet
controller. It is therefore relatively easy—by just adding an external Ethernet physi-
cal transceiver (PHY)—to add network connectivity to an embedded system.
In turn, this extra feature can be quite useful for a variety of purposes, for in-
stance remote configuration, diagnostics, and system monitoring, which are becom-
ing widespread requirements, even on relatively low-end equipment.
This chapter shows how to configure and use an open-source protocol stack,
namely LW IP [51], which is capable of handling the ubiquitous TCP, UDP, and IP
protocols, and how to interface it with the underlying real-time operating system.
Another topic addressed by this chapter is how to use the protocol stack effec-
tively from the application tasks, by choosing the most suitable LW IP application
programming interface depending on the applications at hand.
On the other hand, due to its greater complexity, the way of writing a device driver
in general and, more specifically, interfacing the protocol stack with an Ethernet
controller by means of its device driver, will be the subject of a chapter by itself,
Chapter 8.
163
164 Embedded Software Development: The Open-Source Approach
goals are portability, as well as small code and data size, making it especially suited
for small-scale embedded systems.
It interacts with other system components by means of three main interfaces, de-
picted in Figure 7.1.
Portable code modules, which indeed constitute the vast majority of LW IP code, are
shown as one single white rectangle.
As can be seen from the figure, the main non-portable modules, that is, the mod-
ules that should possibly be modified or rewritten when LW IP is ported to a new
processor architecture, toolchain, or software project, are:
Going into even more detail, Figure 7.3 shows how LW IP is structured internally
and how its different layers and components communicate and synchronize with each
other along the network transmit and receive paths.
In its default configuration, LW IP carries out all protocol stack operations within
a single, sequential task, shown in the middle of the figure. Although it is called
tcpip task (or thread) in the LW IP documentation, this task actually takes care of
all protocols supported by LW IP (not only TCP and IP) and processes all frames
received from the network.
The main interactions of this thread with the other LW IP components take place
by means of several inter-task synchronization devices, as described in the following.
Con guration and Usage of Open-Source Protocol Stacks 167
• The mailbox mbox is the main way other components interact with the tcpip
thread. It is used by network interface drivers to push received frames into
the protocol stack, and by application threads to send network requests, for
instance to send a UDP datagram. In Figure 7.3, a solid arrow represents
either a data transfer or a message passing operation, comprising both data
transfer and synchronization.
• At the application level, there are zero or more active network communica-
tion endpoints at any instant. A per-endpoint semaphore, op-completed,
is used to block an application task when it requests a synchronous opera-
tion on that endpoint, and hence, it must wait until it is completed.
In the figure, a synchronization semaphore is represented by a light gray
circle, and a dashed arrow stands for a synchronization primitive, either
Take or Give. The same kind of arrow is also used to denote the activation
of an IRQ handler by hardware.
• Any application data received from the network is made available to appli-
cation tasks by means of a per-endpoint mailbox, recvmbox. The mailbox
interaction also blocks the application task when no data are available to be
received.
• Along the transmit path, the main LW IP thread calls the transmit functions
of the network interface driver directly, whenever it wants to transmit a
frame. The functions are asynchronous, that is, they simply enqueue the
frame for transmission and return to the caller.
Table 7.1
LW IP Operating System Interface and Corresponding F REE RTOS Facilities
Task creation
sys_thread_t xTaskHandle Task data type
sys_thread_new xTaskCreate Create a task
Binary semaphores
sys_sem_t xSemaphoreHandle Semaphore data type
sys_sem_new vSemaphoreCreateBinary Create a semaphore
sys_sem_free vSemaphoreDelete Delete a semaphore
sys_arch_sem_wait xSemaphoreTake Semaphore P()
sys_sem_signal xSemaphoreGive Semaphore V()
Mailboxes
sys_mbox_t xQueueHandle Mailbox data type
sys_mbox_new xQueueCreate Create a mailbox
sys_mbox_free vQueueDelete Delete a mailbox
sys_mbox_post xQueueSend Post a message
sys_mbox_trypost xQueueSend Nonblocking post
sys_arch_mbox_fetch xQueueReceive Fetch a message
sys_arch_mbox_tryfetch xQueueReceive Nonblocking fetch
account that the main design goal was to make the porting layer as efficient as possi-
ble. Since it is used by LW IP as a foundation for data transfer and synchronization,
any issue at this level would heavily degrade the overall performance of the protocol
stack. The following, more detailed aspects of the mapping are worth mentioning.
For this reason, the LW IP interface has been mapped directly on the
most basic and efficient mutual exclusion mechanism of F REE RTOS. This
is adequate on single-processor systems and, as an added benefit, the
F REE RTOS mechanism already supports critical region nesting, a key fea-
ture required by LW IP. Readers interested in how the F REE RTOS mecha-
nism works will find more details about it in Chapter 10.
• The porting layer must provide a system-independent way to create a new
task, called thread in the LW IP documentation, by means of the function
sys_thread_new. This interface is used to instantiate the main LW IP
processing thread discussed previously, but it can be useful to enhance
portability at the network application level, too.
When using F REE RTOS, an LW IP thread can be mapped directly onto a
task, as the two concepts are quite similar. The only additional function
realized by the porting layer—besides calling the F REE RTOS task cre-
ation function—is to associate a unique instance of a LW IP-defined data
structure, struct sys_timeouts, to each thread. Each thread can re-
trieve its own structure by means of the function sys_arch_timeouts.
The thread-specific structure is used to store information about any pending
timeout request the thread may have.
• Internally, LW IP makes use of binary semaphores, with an initial value
of either 1 or 0, for mutual exclusion and synchronization. Besides
sys_sem_new and sys_sem_free—to create and delete a semaphore,
respectively—the other two interfaces to be provided correspond to the
classic timed P() and V() concurrent programming primitives.
Even though nomenclature is quite different—the wait primitive of LW IP
corresponds to the Take primitive of F REE RTOS, and signal corre-
sponds to Give—both primitives are provided either directly, or with min-
imal adaptation, by F REE RTOS.
• A mailbox is used by LW IP when it is necessary to perform a data transfer
among threads—namely to pass a pointer along—in addition to synchro-
nization. The flavor of message passing foreseen by LW IP specifies an indi-
rect, symmetric naming scheme, fixed-size messages and limited buffering,
so it can be directly mapped onto the F REE RTOS message queue facil-
ity. A non-blocking variant of the send and receive interfaces are needed
(known as trypost and tryfetch, respectively), but their implementa-
tion is not an issue, because the underlying F REE RTOS primitives already
support this feature.
Table 7.2
Main LW IP Protocol Enable/Disable Options
options that LW IP supports. The default value automatically comes into effect if a
certain option is not defined in lwipopts.h.
Interested readers should refer to the documentation found in the source code,
mainly in opt.h for detailed information about individual configuration options. In
this section, we will provide only a short summary and outline the main categories
of option that may be of interest.
Table 7.2 summarizes a first group of options that are very helpful to optimize
LW IP memory requirements, concerning both code and data size. In fact, each indi-
vidual option enables or disables a specific communication protocol, among the ones
supported by LW IP. Roughly speaking, when a protocol is disabled, its implementa-
tion is not built into LW IP, and hence, its memory footprint shrinks.
Each row of the table lists a configuration option, on the left, and gives a short
description of the corresponding protocol, on the right. The middle column provides
a reference to the request for comments (RFC) document that defines the protocol, if
applicable. Even though a certain protocol has been extended or amended in one or
more subsequent RFCs, we still provide a reference to the original one.
It is also worth mentioning that, even though the ICMP protocol [141] can in
principle be disabled, it is mandatory to support it, in order to satisfy the minimum
requirements for Internet hosts set forth by RFC 1122 [24].
A second group of options configures each protocol individually. The main con-
figuration options for the IP, UDP, and TCP protocols are listed in Table 7.3 along
with a terse description. Readers are referred to specialized textbooks, like [147], for
more information about their meanings and side effects on protocol performance and
conformance to the relevant standards.
The only aspect that will be discussed here—because it is more related to the
way protocols are implemented with respect to their abstract definition—is the role
Con guration and Usage of Open-Source Protocol Stacks 171
Table 7.3
Main LW IP Protocol--Speci c Con guration Options
Option Description
IP protocol
IP_DEFAULT_TTL Default time-to-live (TTL) of outgoing IP datagrams, un-
less specified otherwise by the transport layer
IP_FORWARD Enable IP forwarding across network interfaces
IP_FRAG Enable IP datagram fragmentation upon transmission
IP_REASSEMBLY Enable IP datagram reassembly upon reception
IP_REASS_MAX_PBUFS Maximum number of struct pbuf used to hold IP frag-
ments waiting for reassembly
CHECKSUM_GEN_IP Generate checksum for outgoing IP datagrams in software
CHECKSUM_CHECK_IP Check checksum of incoming IP datagrams in software
UDP protocol
UDP_TTL Time-to-live (TTL) of IP datagrams carrying UDP traffic
CHECKSUM_GEN_UDP Generate checksum for outgoing UDP datagrams in soft-
ware
CHECKSUM_CHECK_UDP Check checksum of incoming UDP datagrams in software
TCP protocol
TCP_TTL Time-to-live (TTL) of IP datagrams carrying TCP traffic
TCP_MSS TCP maximum segment size
TCP_CALCULATE_EFF_SEND_MSS Trim the maximum TCP segment size based on the MTU
of the outgoing interface
TCP_WND TCP window size
TCP_WND_UPDATE_THRESHOLD Minimum window variation that triggers an explicit win-
dow update
TCP_MAXRTX Maximum number of retransmissions for TCP data seg-
ments
TCP_SYNMAXRTX Maximum number of retransmissions of TCP SYN seg-
ments
TCP_QUEUE_OOSEQ Queue out-of-order TCP segments
TCP_SND_BUF TCP sender buffer space in bytes
TCP_SND_QUEUELEN Number of struct pbuf used for the TCP sender buffer
space
TCP_SNDLOWAT Number of bytes that must be available in the TCP sender
buffer to declare the endpoint writable
TCP_LISTEN_BACKLOG Enable the backlog for listening TCP endpoints
TCP_DEFAULT_LISTEN_BACKLOG Default backlog value to use, unless otherwise specified
CHECKSUM_GEN_TCP Generate checksum for outgoing TCP segments in soft-
ware
CHECKSUM_CHECK_TCP Check checksum of incoming TCP segments in software
172 Embedded Software Development: The Open-Source Approach
Table 7.4
Main LW IP Memory Management Con guration Options
Option Description
Network buffers
MEM_SIZE Size, in bytes, of the native heap (when enabled), mainly used
for struct pbuf of type PBUF_RAM
MEMP_NUM_NETBUF Number of struct netbuf
MEMP_NUM_PBUF Number of struct pbuf of type PBUF_ROM and PBUF_REF
PBUF_POOL_SIZE Number of struct pbuf of type PBUF_POOL
PBUF_POOL_BUFSIZE Size, in bytes, of the data buffer associated to each struct
pbuf of type PBUF_POOL
Both strategies are useful in principle because they represent different trade-offs
between flexibility, robustness, and efficiency. For instance, the allocation of a fixed-
size data structure from a dedicated pool holding memory blocks of exactly the right
size can be done in constant time, which is much more efficient than leveraging a
general-purpose (but complex) memory allocator, able to manage blocks of any size.
Moreover, dividing the available memory into distinct pools guarantees that, if a
certain memory pool is exhausted for any reason, memory is still available for other
kinds of data structure, drawn from other pools. In turn, this makes the protocol stack
174 Embedded Software Development: The Open-Source Approach
more robust because a memory shortage in one area does not prevent other parts of
it from still obtaining dynamic memory and continue to work.
On the other hand, a generalized use of the heap for all kinds of memory allocation
is more flexible because no fixed and predefined upper limits on how much memory
can be spent on each specific kind of data structure must be set in advance.
Therefore, the three main options in this subcategory choose which of the sup-
ported strategies shall be used by LW IP depending on user preferences. Moreover,
they also determine if the memory LW IP needs should come from statically de-
fined arrays permanently reserved for LW IP, or from the C library memory allocator.
Namely:
The second group of options determines the overall size of the heap, if the option
MEM_LIBC_MALLOC has not been set, and specifies how many network buffers—
that is, buffers used to temporarily store data being transmitted or received by the
protocol stack—should be made available for use. Obviously, these options have
effect only if MEM_MEM_MALLOC has not been set. More detailed information about
the internal structure of the different kinds of network buffer mentioned in the table
and their purpose will be given in Section 7.5.
The next group of options, shown in the middle part of Table 7.4, determines
the size of other memory pools, related to communication endpoints and to protocol
control blocks, specifically for each protocol supported by LW IP.
For instance, the MEMP_NUM_NETCONN option determines how big the pool for
struct netconn data structures will be. These data structures are in one-to-one
correspondence with communication endpoints created by any of the LW IP APIs.
Moreover, one protocol control block is needed for every active communication end-
point that makes use of that protocol.
The fourth and last group of configuration option related to memory management,
to be discussed here, determines the size of the memory pools that hold the data struc-
tures exchanged between the other parts of the protocol stack and the tcpip thread.
Since, as outlined in Section 7.1, this thread is responsible for carrying out most
protocol stack activities, the availability of an adequate number of data structures
to communicate with it is critical to make sure that application-level requests and
incoming packets can flow smoothly through the protocol stack.
Con guration and Usage of Open-Source Protocol Stacks 175
Table 7.5
Other Important LW IP Con guration Options
Option Description
Last, but not least, we will present shortly a last group of important LW IP config-
uration options, listed in Table 7.5.
The two options LWIP_NETCONN and LWIP_SOCKET enable the two main user-
level LW IP APIs, that is, the native netconn interface and POSIX sockets. They
will be discussed in Sections 7.4 and 7.6, respectively. In normal use, at least one of
the two APIs must be enabled in order to make use of the protocol stack.
They can be enabled together, though, to accommodate complex code in which
components make use of distinct APIs. Moreover, the POSIX sockets API re-
quires the netconn API in any case because it is layered on it. In addition, the
options LWIP_COMPAT_SOCKETS and LWIP_POSIX_SOCKETS_IO_NAMES con-
trol the names with which POSIX sockets API functions are made available to the
programmer. Their effect will be better described in Section 7.6.
The two configuration options LWIP_SO_RCVTIMEO and LWIP_SO_RCVBUF
determine the availability of two socket options (also discussed in Section 7.6) used
to specify a timeout for blocking receive primitives and to indicate the amount of
receive buffer space to be assigned to a socket, respectively.
The last option listed in Table 7.5, when set, configures LW IP to collect a variety
of statistics about traffic and protocol performance.
Table 7.6
Main Functions of the netconn Networking API
Function Description
Connection establishment
netconn_connect Initiate a connection (TCP) or register address/port (UDP)
netconn_peer Return address/port of remote peer
netconn_disconnect Disassociate endpoint from a remote address/port (UDP only)
netconn_listen Make a TCP socket available to accept connection requests
netconn_listen_with_backlog Same as netconn_listen, with backlog
netconn_accept Accept a connection request
Data transfer
netconn_send Send data to registered address/port (UDP only)
netconn_sendto Send data to explicit address/port (UDP only)
netconn_write Send data to connected peer (TCP only)
netconn_recv Receive data (both TCP and UDP)
Table 7.7
Main LW IP netconn Types
netconn Configuration
type option Description
NETCONN_TCP LWIP_TCP TCP protocol
NETCONN_UDP LWIP_UDP UDP protocol
NETCONN_UDPNOCHKSUM LWIP_UDP Standard UDP, without checksum
NETCONN_UDPLITE LWIP_UDP, LWIP_UDPLITE Lightweight UDP, non-standard
NETCONN_RAW LWIP_RAW Other IP-based protocol
The pointer to the struct ip_addr can be NULL. In this case, LW IP deter-
mines an appropriate local IP address automatically. The function returns an err_t
value that indicates whether or not the function was successful.
178 Embedded Software Development: The Open-Source Approach
Connection establishment
The function netconn_connect, when invoked on a communication endpoint, be-
haves in two radically different ways, depending on whether the underlying protocol
is TCP or UDP.
1. When the underlying protocol is UDP, it does not generate any network traffic
and simply associates the remote IP address and port number—passed as the sec-
ond and third argument, respectively—to the endpoint. As a result, afterward the
endpoint will only receive traffic coming from the given IP address and port, and
will by default send traffic to that IP address and port, when no remote address is
specified explicitly.
2. When the underlying protocol is TCP, it opens a connection with the remote host
by starting the three-way TCP handshake sequence. The return value, of type
err_t indicates whether or not the connection attempt was successful. TCP data
transfer can take place only after a connection has been successfully established.
After a successful netconn_connect the remote address and port number as-
sociated to an endpoint can be retrieved by calling the function netconn_peer. It
takes the endpoint itself as its only argument.
The function netconn_disconnect can be invoked only on a UDP endpoint
and undoes what netconn_connect did, that is, it disassociates the endpoint from
the remote IP address and port given previously without generating any network
traffic. For TCP endpoints, it is necessary to use netconn_close, as described
previously, to close the connection properly.
In order to accept incoming connection requests, a TCP endpoint must be put
into the listening state first. In its simplest form, this is done by invoking the
netconn_listen function on it.
If the LW IP configuration option TCP_LISTEN_BACKLOG has been set, the
(more powerful) function netconn_listen_with_backlog performs the same
action and it also allows the caller to specify the size of the connection backlog as-
sociated to the endpoint. If the option has not been set, and hence, the backlog is not
available, netconn_listen_with_backlog silently reverts to the same behav-
ior as netconn_listen.
Con guration and Usage of Open-Source Protocol Stacks 179
Roughly speaking, the backlog size indicates the maximum number of outstand-
ing connection requests that may be waiting acceptance by the endpoint. Any further
connection request received when the backlog is full may be refused or ignored by
the protocol stack. More information about how the backlog mechanism works can
be found in [147].
The next function to be discussed is netconn_accept, which is the counterpart
of netconn_connect. It takes as argument a TCP communication endpoint that
has been previously marked as listening for incoming connection requests and blocks
the caller until a connection request targeting that endpoint arrives and is accepted.
When this happens, it returns to the caller a newly allocated communication
endpoint—represented, as usual, by a pointer to a struct netconn—that is asso-
ciated to the originator of the request and can be used for data transfer. At the same
time, the original endpoint is still available to listen for and accept further connec-
tion requests.
Data transfer
The main LW IP functions for data transfer are listed at the bottom of Table 7.6.
Concerning data transmission, they can be divided into two groups, depending on
the kind of communication endpoint at hand.
On the other hand, the function netconn_recv is able to receive data from both
UDP and TCP endpoints. It takes an endpoint as argument and blocks the caller
until data arrive on that endpoint. Then, it returns a (non NULL) pointer to a struct
netbuf that holds the incoming data. Upon error, or if the TCP connection has been
closed by the remote peer, it returns a NULL pointer instead.
As can be seen from the description, on both the transmit and the receive paths
data buffers are shared between the calling task and the protocol stack in different
ways.
Before going into deeper detail about how network buffers are managed, as will
be done in Section 7.5, it is therefore extremely important to appreciate how they
should be managed, in order to avoid memory leaks (if a buffer is not deallocated
when it is no longer in use) or memory corruption (that may likely happen if a buffer
is released, and possibly reused, “too early”). In particular:
• Moreover, the additional pointer ptr of the struct netbuf points to the
current pbuf in the chain and is used to scan data belonging to a struct
netbuf sequentially. In the figure, as an example, this pointer refers to the
third, and last pbuf in the chain.
• In turn, a struct pbuf does not contain any data, either. Instead, as can
be seen in the figure, the data buffer associated with a certain struct
pbuf can be accessed by means of the payload field of the struc-
ture. For this reason, the LW IP documentation sometimes refers to the
struct pbuf as pbuf header.
Table 7.8
LW IP Network Buffer Management Functions
Function Description
Data access
netbuf_len Return the total data length of a struct netbuf
netbuf_first Reset the current data buffer pointer to the first buffer
netbuf_data Get access to the current data buffer contents
netbuf_next Move to the next data buffer linked to a struct netbuf
netbuf_copy Copy data from a struct netbuf
netbuf_copy_partial Similar to netbuf_copy, with offset
netbuf_take Copy data into a struct netbuf
Address information
netbuf_fromaddr Return IP address of the source host
netbuf_fromport Return port number of the source host
protocol headers at a later time without copying data around and without
allocating any further struct pbuf for them.
Moreover, if any struct pbuf and their corresponding data buffers were
already allocated to the struct netbuf in the past, they are released
before allocating the new one. Upon successful completion, the function
returns a pointer to the newly allocated data buffer. Else, it returns a NULL
pointer.
• Conversely, the chain of struct pbuf currently linked to a struct
netbuf can be released by calling the function netbuf_free. It is legal
to call this function on a struct netbuf that has no memory allocated
to it, the function simply has no effect in this case.
• An alternative way for an application to associate a data buffer to a struct
netbuf is to allocate the data buffer by itself and then add a reference to
it to the struct netbuf. This is done by means of the netbuf_ref
function, which takes three arguments: the struct netbuf it must work
on, a void * pointer to the data buffer and an integer holding the size,
in bytes, of the buffer. The function returns a value of type err_t that
conveys information about its outcome.
184 Embedded Software Development: The Open-Source Approach
The function allocates a single struct pbuf that holds a reference to the
application-provided data buffer. As in the previous case, if any struct
pbuf and their corresponding data buffers were already allocated to the
struct netbuf in the past, they are released before allocating the new
one.
However, in this case, no extra space can be reserved at the very beginning
of the buffer for protocol headers, and hence, subsequent LW IP processing
will likely be slower.
Moreover, unlike for netbuf_alloc, the correct management of the data
buffer becomes the responsibility of the application. In particular, it is the
application’s responsibility to ensure that the data buffer remains available
as long as LW IP needs it. Since it is not easy to satisfy this requirement in
the general case, netbuf_ref is most often used to reference immutable,
read-only data buffers, which are always available by definition.
• The function netconn_chain can be used to concatenate the contents of
two struct netbuf, passed as arguments. All data buffers are associated
with the first struct netbuf while the second one is released and can
no longer be used afterward.
• When a struct netbuf is no longer needed, it is important to deallocate
it by means of the netbuf_delete function. As a side effect, this function
also releases any data buffer associated with the struct netbuf being
deallocated, exactly like netbuf_free does.
As shown in Figure 7.4 in the general case, the data associated to a single
struct netbuf can be scattered among multiple, non-contiguous data buffers,
often called fragments. Therefore, LW IP provides a set of data access functions to
“walk through” those data buffers and get sequential access to them.
next element of the chain if possible. The return value indicates the outcome
of the function and where the pointer is within the chain, that is:
• a negative return value means that it was impossible to move the pointer
because it was already pointing to the last element of the chain;
• a positive value indicates that the current pointer was moved and now
points to the last element of the chain;
• zero means that the pointer was moved but it is not yet pointing to the
last element of the chain, that is, there are more elements in the chain
still to be reached.
For this kind of pbuf, the memory for both the struct pbuf and the data
buffer is allocated as a single chunk of contiguous memory from the LW IP
heap memory. Therefore, the size of the data buffer is exactly tailored to the
data size declared when the allocation takes place, plus some space reserved
for LW IP to store protocol headers at a later time, as described previously.
The overall size of the LW IP heap is controlled by a configuration option
discussed in Section 7.3.
Minor padding may be necessary between the struct pbuf and the data
buffer, and at the end of the data buffer, in order to satisfy the alignment
constraints of the target architecture. It is not shown in Figure 7.5 for
simplicity.
• On the other hand, PBUF_POOL pbuf chains are used by LW IP along the
receive path and eventually passed to applications, for instance, as the re-
turn value of netconn_recv.
Unlike in the previous case, pbufs of this kind are allocated from the LW IP
memory pool PBUF_POOL and have a fixed-size data buffer associated with
Con guration and Usage of Open-Source Protocol Stacks 187
them. Section 7.3 provides more details on how to configure the number of
pbuf that the pool must contain and their data buffer size.
There are two direct consequences of the different memory allocation strategy
used in PBUF_POOL pbufs with respect to PBUF_RAM pbufs.
• len holds the length of the data buffer linked to the struct pbuf, and
• tot_len holds the total length of all data buffers linked to the struct
pbuf and to the ones that follow it in the chain.
Besides the two kinds of pbuf just mentioned, LW IP supports two others,
PBUF_ROM and PBUF_REF, not shown in the figure.
Both of them differ from the ones discussed previously because they hold a ref-
erence to a data buffer not managed by LW IP itself. The main difference between
PBUF_ROM and PBUF_REF is that, in the first case, LW IP considers the contents of
the data buffer to be immutable and always available. In the second case, instead,
LW IP will copy buffer contents elsewhere if it needs to use them at a later time.
It is also worth mentioning that, as a result of protocol stack processing, all kinds
of pbuf can eventually be mixed and linked to a struct netbuf. Therefore, as-
suming that the chain of struct pbuf has got a specific structure (for instance, it
contains one single pbuf) or only holds a specific kind of pbuf is a wrong design
choice and makes application programs prone to errors.
Table 7.9
Main Functions of the POSIX sockets API
Function Description
Connection establishment
connect Initiate a connection (TCP) or register address/port (UDP)
listen Mark a socket as available to accept connections (only for TCP)
accept Accept connection on a socket, create a new socket (only for TCP)
getpeername Retrieve address of connected peer
Data transfer
send Send data through a socket
sendto Like send, but specifying the target address
sendmsg Not implemented
recv Receive data from a socket
recvfrom Like recv, also retrieves the source address
recvmsg Not implemented
read Receive data from a socket **
write Send data through a socket **
Synchronous multiplexing
select Synchronous input–output multiplexing *
pselect Not implemented
poll Not implemented
FD_ZERO Initialize a socket descriptor set to the empty set
FD_CLR Remove a socket descriptor from a socket descriptor set
FD_SET Add a socket descriptor to a socket descriptor set
FD_ISSET Check whether or not a socket descriptor belongs to a set
2. With respect to the netconn interface, the sockets interface is much simpler
for what concerns memory management. As was shown in Section 7.4, data trans-
fer through the netconn interface take place by means of network buffers that
are represented by a netbuf data structure. The internal structure of these buffers
is rather complex and they are in a way “shared” between the application and the
protocol stack. They must therefore be carefully managed in order to avoid cor-
rupting them or introducing memory leaks. On the contrary, all data transferred
through the sockets interface are stored in contiguous user-level buffers, which
are conceptually very simple.
As often happens, there is a price to be paid for these advantages. In this case, it
is mainly related to execution efficiency, which is lower for POSIX sockets. This
is not only due to the additional level of abstraction introduced by this interface, but
also to the fact that, in LW IP, it is actually implemented as an additional layer of
software on top of the netconn interface.
In addition, as will become clearer in the following, POSIX sockets require
more memory to memory copy operations with respect to netconn for commonly
used functions, like data transfer.
Speaking more in general, the main advantage of sockets is that they support in
a uniform way any kind of communication network, protocol, naming conventions,
hardware, and so on, even if it is not based on the IP protocol. Semantics of commu-
nication and naming are captured by communication domains and socket types, both
specified upon socket creation.
For example, communication domains are used to distinguish between IP-based
network environments with respect to other kinds of network, while the socket type
determines whether communication will be stream-based or datagram-based and also
implicitly selects which network protocol a socket will use. Additional socket char-
acteristics can be set up after creation through abstract socket options. For exam-
ple, a socket option provides a uniform, implementation-independent way to set the
amount of receive buffer space associated with a socket, without requiring any prior
knowledge about how buffers are managed by the underlying communication layers.
On the other hand, introducing a tailored API for network communication is also
not a new concept in the embedded system domain. For instance, the OSEK VDX
operating system specification [92, 136], focused on automotive applications, speci-
fies a communication environment (OSEK/VDX COM) less general than sockets
and oriented to real-time message-passing networks, such as the controller area net-
work (CAN) [91].
The API that this environment provides is more flexible and efficient because it
allows applications to easily set message filters and perform out-of-order receives,
thus enhancing their timing behavior. Neither of these functions is straightforward
to implement with sockets, because they do not fit well within the general socket
paradigm.
Table 7.9 summarizes the main functions made available by the POSIX sockets
API, divided into functional groups that will be reviewed in more detail in the follow-
ing. Before proceeding further it is worth noting that, as was outlined in Section 7.3,
190 Embedded Software Development: The Open-Source Approach
and obey the communication model the socket type indicates. Then, the protocol
identifier is used to narrow the choice down to a specific protocol within this set.
The special identifier 0 (zero) specifies that a default protocol, selected by the
underlying socket implementation, shall be used. It should also be noted that,
in most cases, this is not a source of ambiguity, because most protocol families
support exactly one protocol for each socket type.
When it completes successfully, socket returns to the caller a small non-negative
integer, known as socket descriptor, which represents the socket just created and shall
be passed to all other socket-related functions, in order to refer to the socket itself.
Instead, the negative value -1 indicates that the function failed, and no socket has
been created. In this case, like for most other POSIX functions, the errno variable
conveys to the caller additional information about the reason for the failure.
At the time of this writing, LW IP supports, in principle, three different socket
types. Which ones are actually available for use depend on how LW IP has been con-
figured, as explained in Section 7.3.
• The IPPROTO_UDP protocol identifier calls for the well-known UDP [140]
protocol.
• The IPPROTO_UDPLITE identifier denotes a lighter, non-standard variant
of UDP, which is LW IP-specific and does not interoperate with other proto-
col stacks.
The two names close and closesocket correspond to the same function,
which closes and destroys a socket, given its descriptor. It must be used to re-
claim system resources—mainly memory buffers—assigned to a socket when it is
no longer in use.
192 Embedded Software Development: The Open-Source Approach
By means of the shutdown function, the standard also specifies a way to shut
down a socket only partially, by disabling further send and/or receive operations.
However, the version of LW IP considered in this book only provides a partial imple-
mentation of shutdown, which always closes the socket completely.
Socket options can be retrieved and set by means of a pair of generic functions,
getsockopt and setsockopt. The way of specifying options to these func-
tions is modeled after the typical layered structure of the underlying communica-
tion protocols and software. In particular, each option is uniquely specified by a
(level, name) pair, in which:
• level indicates the protocol level at which the option is defined. In ad-
dition, a separate level identifier (SOL_SOCKET) is reserved for the upper
layer, that is, the socket level itself, which does not have a direct correspon-
dence with any protocol.
• name determines the option to be set or retrieved within the level and, im-
plicitly, the additional arguments of the functions.
It should be noted that LW IP does not implement all the options specified by the
standard, and hence, the availability of a certain option must be assessed before use
on a case-by-case basis. Two important socket options supported by LW IP at the
SOL_SOCKET level are:
In order to reduce memory footprint and execution overhead, both options are sup-
ported by LW IP only if it has been explicitly configured to this purpose, as described
in Section 7.3.
After a socket has been created, it is possible to set and retrieve some of its char-
acteristics, or attributes, by means of the ioctl function. Even though the POSIX
standard defines a rather large set of commands that this function should accept and
obey, LW IP implements only two of them. Namely:
• The FIONREAD command lets the caller know how many bytes of data
are waiting to be received from the socket at the moment, without actually
retrieving or destroying those data.
• The FIONBIO command allows the caller to set or reset the O_NONBLOCK
socket flag. When this flag is set, all operations subsequently invoked on
the socket are guaranteed to be nonblocking. In other words, they will re-
turn an error indication to the caller, instead of waiting, when the requested
operation cannot be completed immediately.
Con guration and Usage of Open-Source Protocol Stacks 193
Regardless of the way the local socket address has been assigned, it can be re-
trieved by means of the getsockname function. The function may return a failure
indication when the socket has no local address, for instance, after it has been shut
down.
Connection establishment
The connect function has two arguments:
because they simply result in a local operation, that is, the system recording the
remote address for later use.
If connect has not been used, the only way to send data through a connectionless
socket is by means of a function that allows the caller to specify the destination
address on a message-by-message case, such as sendto.
Data transfer
The functions send, sendto, and sendmsg send data through a socket, with dif-
ferent trade-offs between expressive power and interface complexity. Here, we will
only discuss the first two, because the version of LW IP we are considering, that is,
version 1.3, does not implement sendmsg.
• The send function is the simplest one and assumes that the destination
address is already known to the system, as is the case when the function
is invoked on a connection-oriented socket that has been successfully con-
nected to a remote peer in the past. On the other hand, it cannot be used,
for example, on connectionless sockets on which no former connect has
been performed.
Instead, its four arguments specify the socket to be used, the position and
size of a memory buffer containing the data to be sent, and a set of flags
that may alter the semantics of the function.
• With respect to the previous one, the sendto function is more powerful be-
cause, by means of two additional arguments, it allows the caller to explic-
itly specify a destination address, making it also useful for connectionless
sockets.
The only flag currently supported by LW IP is MSG_MORE. This flag has effect
only on TCP sockets and indicates that the caller intends to send more data through
the same socket in a short time. Hence, it is unnecessary to push the present data to
the receiving socket immediately. In turn, the protocol stack does not set the TCP
PSH flag in outgoing TCP segments until some data are sent without the MSG_MORE
flag set.
Symmetrically, the recv, recvfrom, and recvmsg functions allow a process to
wait for and retrieve incoming data from a socket. Also in this case, LW IP does not
implement the most complex one, that is, recvmsg.
Like their counterparts, these functions have different levels of expressive power
and complexity:
• The recv function possibly waits for data to be available from a socket.
When data are available, the function stores it into a data buffer in memory,
and returns to the caller the length of the data just received. It also accepts
as argument a set of flags that may alter the semantics of the function.
• In addition, the recvfrom function allows the caller to retrieve the address
of the sending socket, making it useful for connectionless sockets, in which
the communication endpoints may not be permanently paired.
196 Embedded Software Development: The Open-Source Approach
The LW IP implementation supports two flags to modify the behavior of recv and
recvfrom.
MSG_DONTWAIT: When this flag is set, recv and recvfrom immediately return
an error indication to the caller, instead of waiting, if no data are available to be
received. It has the same effect as setting the O_NONBLOCK flag on the socket, but
on a call-by-call basis.
MSG_PEEK: When this flag is set, incoming data are returned to the application,
but without removing them from socket buffers. In other words, they make the
receive operation nondestructive, so that the same data can be retrieved again by
a subsequent receive.
Besides the specialized functions described previously, applications can also use
the read and write functions to receive and send data through a connection-
oriented socket, respectively. These functions are simpler to use, but not as powerful
as the others, because no flags can be specified.
Synchronous multiplexing
The socket functions described so far can behave in two different ways for what
concerns blocking, depending on how the O_NONBLOCK flag (as well as the
MSG_DONTWAIT flag for data transfer operations) has been set. Namely:
The default behavior in which socket functions block the caller until completion
is quite useful in many cases, because it allows the software to be written in a simple
and intuitive way. However, it may become a disadvantage in other, more complex
situations.
Let us consider, for instance, a network server that is simultaneously connected to
a number of clients and does not know in advance from which socket the next request
message will arrive. In this case the server must not perform a blocking recv on a
specific socket, because it would run into the risk of ignoring incoming messages
from all the other sockets for an unpredictable amount of time.
On the other hand, the polling-based approach may or may not be acceptable, de-
pending on the kind of application, because its overhead and latency grow linearly
Con guration and Usage of Open-Source Protocol Stacks 197
with the number of sockets to be handled. For this reason, the POSIX standard spec-
ifies a third way of managing a whole set of sockets at once, called synchronous
multiplexing.
The basic principle of this approach is that a task—instead of blocking on an
individual socket until a certain operation on it is complete—blocks until certain
operations become possible on any socket in a set.
The standard specifies three main specialized functions for synchronous multi-
plexing. In order of complexity, they are select, pselect, and poll. Among
them, we will discuss only the simplest one, that is, select because it is the only
one implemented by LW IP at the time of this writing.
The function select takes as arguments three, possibly overlapping, sets of
socket descriptors and a timeout value. It examines the descriptors belonging to each
set in order to check whether at least one of them is ready for reading, ready for
writing, or has an exceptional condition pending, respectively.
More specifically, it blocks the caller until the timeout expires or at least one of
the conditions being watched becomes true. In the second case, the function updates
its arguments to inform the caller about which socket descriptors became ready for
the corresponding kind of operation.
It should also be noted that LW IP currently does not support the third set of
socket descriptors mentioned previously. Any exceptional condition involving a cer-
tain socket is notified by informing the caller that the socket is ready for reading.
The subsequent read operation will then fail, and convey more information about the
nature of the error.
A set of socket descriptors is represented by the abstract data type fd_set and
can be manipulated by means of the following function-like macros.
• FD_ZERO initializes a socket descriptor set to the empty set. All socket
descriptor sets must be initialized in this way before use.
• FD_CLR and FD_SET remove and add a socket descriptor to a socket de-
scriptor set.
• Finally, FD_ISSET checks whether or not a certain socket descriptor be-
longs to a socket descriptor set.
7.7 SUMMARY
This chapter provided an overview of the LW IP open-source TCP/IP protocol stack.
The two main aspects described here were how to embed the protocol stack within
a more complex software project, and how to make use of it from an application
program.
After a short introduction to the general characteristics and internal structure of
LW IP, given in Section 7.1, Section 7.2 provides information about the interface
between the protocol stack and the underlying operating system, while Section 7.3
contains a short overview of the most common LW IP configuration options, which
are useful to tailor LW IP to the requirements of a specific application.
198 Embedded Software Development: The Open-Source Approach
The second part of the chapter, comprising Sections 7.4 through 7.6, is focused on
how to use the protocol stack. Hence, it contains a thorough description of the two
APIs that LW IP provides to this purpose. Moreover, Section 7.5 presents in detail the
network buffer management scheme foreseen by LW IP.
Although a detailed knowledge of this topic is unnecessary when using the higher-
level sockets API, it becomes very important for the lower-level netconn API.
This is because, in that case, memory management responsibilities are in part shared
between the protocol stack and the application program itself.
Hence, a careful application-level network buffer management is therefore of ut-
most importance in order to avoid corrupting them or introducing memory leaks.
8 Device Driver
Development
CONTENTS
8.1 General Structure of a Device Driver ............................................................ 199
8.2 Interrupt Handling ......................................................................................... 201
8.2.1 Interrupt Handling at the Device Level ............................................ 203
8.2.2 Interrupt Handling at the Interrupt Controller Level ........................ 204
8.2.3 Interrupt Handling at the CPU Level ................................................ 206
8.2.4 Some Remarks .................................................................................. 211
8.3 Device Driver Interfaces................................................................................ 211
8.4 Synchronization Issues .................................................................................. 214
8.5 Example: Ethernet Device Driver.................................................................. 218
8.5.1 Ethernet Block Architecture ............................................................. 218
8.5.2 Initialization...................................................................................... 221
8.5.3 Interrupt Handler .............................................................................. 230
8.5.4 Receive Process ................................................................................ 232
8.5.5 Transmit Process............................................................................... 237
8.6 Summary........................................................................................................ 238
199
200 Embedded Software Development: The Open-Source Approach
The device driver is essential to enable the functionality of devices and permit com-
munication with the external world. As a consequence, it is of prominent importance
in embedded system design.
Figure 8.1 demonstrates the general structure of a device driver. Whenever the
application or protocol stacks (either standalone or coming as a component of an
operating system) such as the TCP/IP protocol stack, USB protocol stacks, and so
on would like to access a device, or vice versa when a device possesses some infor-
mation worth the attention of the upper layers, they are both managed by the device
driver. The two components that the device driver interfaces with are shown in dark
gray. Downward, the device driver communicates with a device through the device
register interface and optionally some shared memory, whereas upward another in-
terface should be provided between the device driver and upper layers.
The device register interface allows the device driver to access registers, which
can be used for two purposes:
Device Driver Development 201
to packet drop for further incoming packets. Events external to the processor can be
handled by interrupts, for instance incoming packets, whereas those detected to the
processor itself are treated as exceptions such as data abort, instruction fetch abort,
memory error, and other exceptional conditions. In this chapter, the discussion will be
focused on interrupts (actually, most processors handle exceptions in a similar way),
as it is essential to an embedded system to interact with the external world. Interrupt
handling is also quite an important part of device driver development. Chapter 15
studies exception handling for what concerns a specific type of exception, namely,
memory error, and presents different memory protection techniques to address the
issue. More general information about exception can be found in Reference [161].
There are different ways to classify interrupts. One popular classification is to
divide interrupts into vectored interrupts and non-vectored interrupts. The main dif-
ference between them is about how interrupts are handled. More specifically, for
non-vectored interrupts, the CPU always branches to the same interrupt service rou-
tine (ISR) and from there it polls each device to see which may have caused the
interrupt. And if there is more than one, they will be handled one by one within the
same interrupt service routine. Instead, with vectored interrupts, the CPU is notified
when there is an interrupt and where the interrupt comes from. What’s more, different
interrupt service routines are provided to handle different interrupts. Non-vectored
interrupts are not as reactive as vectored interrupts, which makes them less suitable
for some embedded system applications. As a consequence, if it is not mentioned
specifically, we will focus on vectored interrupts in the following.
The basic concept about interrupt handling is that when interrupts arrive, the nor-
mal program flow of the CPU needs to be altered and context switch is required
before running the interrupt handlers. Actually, the overall process of interrupt han-
dling, especially when taking into account operations done by hardware, is much
more complex. Three main hardware components involved in interrupt handling are
devices, interrupt controller(s) and the CPU, as shown in Figure 8.2. In the following,
they will be addressed one by one. For simplicity, we will focus on the case in which
Device Driver Development 203
Figure 8.3 Chain of events and actions along the interrupt handling path.
just one interrupt controller is responsible for handling different sources of interrupts
from different devices.
Figure 8.3 illustrates the whole process of how an interrupt is handled, starting
from when it arrives at the device. Sometimes software developers are more familiar
with how it should be handled in software, namely, interrupt handlers. This chapter
will spend more time on how hardware addresses interrupts. By the way, interrupt
handler is an alternative term for ISR and they are used in an interchangeable way in
the following.
interface, and those due to internal events, like Ethernet transmit underrun.1 Each de-
vice can be configured to handle a certain event or not by setting appropriate device-
specific registers in software. In this way, it is possible to selectively enable, disable,
mask, or merge individual interrupt requests. This is called interrupt masking. As a
result, when an event occurs, depending on the setting of the interrupt masking, it
may or may not trigger an interrupt.
As we will see in the following, interrupt masking can be applied also in other
places along the interrupt handling path, including the interrupt controller(s) and
the CPU, in order to be flexible at selectively serving interrupt requests in different
scopes. More specifically, interrupt masking set in the interrupt controller enables
or disables interrupt requests on a device-by-device basis, while its setting at the
CPU level has a more global effect. Generally, interrupt masking can be done by
configuring proper registers in software and then it will take effect when hardware
runs.
If an interrupt is generated, the corresponding interrupt signal will be propagated
to the next module, namely the interrupt controller(s), through either the system bus
or dedicated interrupt lines. It is quite common that each device just has a single inter-
rupt request output to the interrupt controller through one interrupt line or the system
bus. This indicates that interrupts corresponding to the same device are aggregated
on a single interrupt line. This way of implementation is useful to simplify hardware
design. In order to determine the origin of the interrupt, the interrupt service rou-
tine, when started, needs to query registers related to the device which activates its
interrupt line. Besides, it is also possible that more than one device share the same
interrupt line.
1 This happens when the Ethernet transmitter does not produce transmit data, for example, the next
The above code declares an interrupt handler to serve interrupt requests coming
from the Ethernet interface. As explained in Section 9.4, when a symbol is declared
as a weak symbol and if a (non-weak) declaration and definition of the same symbol
206 Embedded Software Development: The Open-Source Approach
is provided elsewhere, like in the device driver, the latter takes precedence and will be
used. The default handlers are generally quite simple and are normally implemented
as a forever loop. On the other hand, when a real implementation of an interrupt
handler is not available, the default handler will be executed and the system will not
fall into any undefined state.
The following listing shows an example of how to set the priority level and ISR
when programmable priority is adopted, on the NXP LPC24xx microcontroller [126]
which is built around an ARM7 CPU core [8]:
#define VICVectAddr(x) (*(ADDR_REG32 (0xFFFFF100 + (x)*4)))
#define VICVectPriority(x) (*(ADDR_REG32 (0xFFFFF200 + (x)*4)))
#define VICIntEnable (*(ADDR_REG32 (0xFFFFF010)))
On LPC24xx, the interrupt controller supports 32 vectored IRQ slots and pro-
vides a set of registers for each individual slot. The VICVectAddr(x) register
holds the address of the interrupt service routine corresponding to slot x, while
VICVectPriority(x) register allows the software to configure the priority level
for interrupt source identified by slot entry x. In addition, by writing to the corre-
sponding bit in the 32-bit VICIntEnable register, it is possible to enable a certain
interrupt.
As shown in the example, the macro ENET_VIC_VECTOR specifies the entry for
the Ethernet interrupt in the 32 slots. Symbols ENET_IRQHandler_Wrapper and
ENET_VIC_PRIORITY indicate the address of ISR and the priority level set for the
Ethernet interrupt, respectively.
• Context switch
• Mode change
• Interrupt vectorization
image of the state of the task when it is suspended so that the OS can resume its
execution from the same place afterward. This process is referred to as context switch
in Chapter 4, which includes first saving the context before switching to the ISR and
then restoring the context, correspondingly.
Moreover, the context or, in other words, the processor state, to be saved and
restored depends on several different things, including the CPU in use, the OS, and
any optimization enabled for a program. For instance, it can vary from one CPU to
another even if they are within the same family, because some additional registers
may be available on one CPU but not the other. At a minimum, the context includes
the general purpose registers, in particular the program counter register. When an
OS is present, context switch also involves updating OS lists and other internal data
structure. Last but not least, for systems with memory management unit (MMU), the
page table will be updated as well.
With respect to context switch between tasks, which is discussed in Chapter 4,
interrupt handling also requires to change the operating mode of the CPU when
switching to an ISR or returning from it. The number of modes in which a micro-
208 Embedded Software Development: The Open-Source Approach
controller can work may vary from architecture to architecture. For example, the
ARM7 CPU supports seven different operating modes as shown in Figure 8.4. The
application code is running in user mode. And the system mode is a privileged mode
for privileged tasks, which are often related to the operating system, or part of the
operating system.
It is worth mentioning that the ARM7 CPU accepts two kinds of hardware inter-
rupts, namely, the general purpose interrupt (IRQ) and fast interrupt (FIQ). And they
are corresponding to two different operating modes, namely the IRQ mode and FIQ
mode. Strictly speaking, among existing interrupt sources, only one of them can be
selected as FIQ source so that the processor can enter the FIQ mode and start pro-
cessing the interrupt as fast as possible. In this book, we will mainly focus on IRQ,
rather than FIQ.
When there are software interrupts or when the system is reset, the CPU should
work in supervisor mode, while the abort mode is for both the data access and in-
struction fetch memory abort. The CPU enters into the undefined mode when unde-
fined instructions are executed. The availability of different modes may permit the
CPU to service interrupts or exceptions more efficiently.
Each operating mode, except the system mode, has some registers private to its
own use, in addition to the general purpose registers common among different modes.
Generally, the application code runs in the user mode, where it has access to the
register bank R0-R15 as well as the current program status register (CPSR), as shown
in Figure 8.4.
As shown in the same figure, R0-R12 and R15 remain accessible to the IRQ mode,
while R13 (link register) and R14 (stack pointer) are replaced by a pair of registers
unique to the IRQ mode. This indicates that, in IRQ mode, the CPU has its own
link register and stack. Besides, each mode has its own saved program status regis-
ter (SPSR), except the user mode and the system mode.
In Figure 8.4, the registers available for the system and user modes are used as
a reference and are shown in light gray. Registers private to a certain mode, with
respect to those available for the system and user modes, are highlighted in dark
gray, whereas those registers in common are omitted for clarity.
Device Driver Development 209
Figure 8.5 depicts the CPSR register provided on ARM7. As we can see, it con-
tains several flag bits such as negative, zero, carry, and overflow which report the
result status of a data processing operation, as well as multiple control bits which
can be set to change the way the CPU behaves, including the operating mode, en-
abling/disabling IRQ and so on.
When saving the context and switching from one mode to another, the general
purpose registers common to both modes, especially those that will be used in the
second mode, are saved onto the stack of the first mode, pointed to by its R14 register.
In addition, the R15 (program counter) is saved to the link register of the second
mode and the value of the CPSR register is saved in the SPSR register of the later
mode. Then the mode can be changed by writing to the CPSR register. The opposite
will be done when the CPU needs to go back to the previous mode. The content of
SPSR and the link register will be copied back to the CPSR and the program counter
of the previous mode, respectively. And the register values stored in the stack will be
written back to the corresponding registers.
By the way, as can be seen from the figure, the FIQ mode also has its own R7-R12
registers. This means that, when entering the FIQ mode, there is no need to save the
value of those registers of the previous mode because they will not be reused in FIQ
mode. This leads to more efficient handling of FIQs.
The last thing to do for interrupt handling is to perform interrupt vectorization,
which is to jump to the appropriate interrupt handler, depending on the interrupt
source.
More specifically, taking again the ARM7 architecture as an example, each op-
erating mode has its own vector table at a predefined address, except the system
and user modes. Instructions which can be used to reach and access each individual
vector table are kept in an upper level vector table, namely the exception vector ta-
ble (EVT). More precisely, only the starting address of where the instructions for a
certain mode are stored is recorded in the exception vector table.
For example, as shown in Figure 8.6, the instructions, which can be used to access
the interrupt vector table (IVT), are stored at a certain address, for instance, Add_B.
As a consequence, the entry in the EVT corresponding to IRQ mode will be filled
with Add_B. It is worth mentioning that, by convention, those instructions are often
referred to as first level IRQ handler.
When the CPU changes from one mode to another, for example to the IRQ mode,
the program counter (PC) will be filled with Add_B. By following the pointer, the
CPU will execute instructions stored there, namely the first level IRQ handler. De-
pending on the interrupt number (or interrupt source), the first level IRQ handler is
able to locate the correct entry in the interrupt vector table which stores the starting
address of the associated ISR. For example, the Ethernet interrupt has an entry in the
interrupt vector table and its ISR is stored at Add_C.
Then, the first level IRQ handler will alter the PC to point to Add_C. Afterward,
the Ethernet interrupt handler will be executed.
For the sake of efficiency, it is not necessary that these three activities, namely
context switch, mode change and interrupt vectorization, are carried out exactly one
210 Embedded Software Development: The Open-Source Approach
after the other. It may vary from one architecture to another. For example, for ARM7
architecture, the operations performed at the CPU level to service interrupt are sum-
marized in the following:
• Save the content of the program counter into the link register of the IRQ
mode;
• Save the content of CPSR into the SPSR of the IRQ mode;
• The program counter is pointed to the entry corresponding to the IRQ mode
in the upper level vector table;
• Write to the CPSR register to change the operating mode to the IRQ mode;
• After entering the IRQ mode, follow the PC to jump to the right interrupt
service routine;
• At the beginning of the ISR, before processing the interrupt, the content of
those registers that are to be reused in ISR will be saved on the stack of the
user mode (since we are considering a running task is interrupted);
• Then proceed with interrupt processing.
Device Driver Development 211
For example, the LW IP protocol stack uses the netif data structure to repre-
sent all network interfaces. This data structure contains several fields, for instance,
netif->input and netif->linkoutput, which point to functions that should
be implemented in the device driver and can be used to pass a received packet up the
protocol stack and send a packet on the interface, respectively.
Moreover, device driver should also follow the requirements for data structures
adopted in higher layers. For instance, internally, LW IP makes use of the pbuf data
structure to hold incoming and outgoing packets, as discussed in Chapter 7. As a
consequence, data should be moved back and forth between the shared memory (or
registers) and pbuf in the device driver.
For what concerns embedded systems, the primary goal of the design and imple-
mentation of a device driver and its interface to other components is performance
rather than being standard as for general purpose systems.
More specifically, general purpose operating systems tend to provide a standard
device driver interface. In particular, all devices are categorized into just a few classes
and devices belonging to the same class share the same interface to the upper layers.
For example, Linux only differentiates three classes of devices, namely, block de-
vices, character devices, and network devices. And the main interface functions for
character devices include only a handful of high-level, abstract functions like open,
close, read, write, ioctl.
Being generic, on the one hand, has the advantage of abstracting the develop-
ment of application software away from hardware details and makes them as well
as the OS more portable. On the other hand, sometimes it is not easy at all to map
existing device features/capabilities directly onto a limited set of interface functions.
For example, graphics cards are classified as character devices. However, the read
and write functions are not so meaningful for this category of devices because they
communicate with devices character-by-character in a sequential manner. At the end,
specific ioctls are defined in order to exploit the functionality of graphics cards.
What’s more, in order to offer a standard device driver interface to the upper lay-
ers, some sacrifices are unavoidable. For instance, a mapping layer is needed to map
the generic interface functions like those listed above to specific functions imple-
mented within each individual device driver. The introduction of an extra layer not
only adds extra software complexity, but may also affect performance.
Instead, for what concerns embedded systems, there is not any device that must
be supported by all platforms; in other words, there is no standard device. As a con-
sequence, unlike general purpose operating systems, device drivers are not deeply
embedded into the RTOS, except a few commonly used ones like the UART used for
early debugging. On the contrary, device drivers are generally implemented on top
of the RTOS. Afterward, applications and middleware such as protocol stacks can be
implemented directly on top of appropriate drivers, rather than using a standardized
API like in the case of the general purpose OS.
For the same reason, the design and implementation of different protocol stacks
by different people do not assume the existence of a unified device driver interface.
It is possible that different protocol stacks may abstract various types of devices in
214 Embedded Software Development: The Open-Source Approach
sibly lose some data from the device because it was impossible to store them into the
shared data structure) or retrying it.
The second option does not lead to loss of data because, at least by intuition,
the semaphore operation will eventually succeed, but it has nonetheless two im-
portant drawbacks that hinder its applicability, especially in a real-time execution
environment:
1. Retrying the critical region access multiple times represents an overhead, because
each invocation of the nonblocking P() requires a certain amount of processing
to be performed.
This overhead may severely impair the schedulability scenario of the system as a
whole because, as already mentioned in Section 8.2, in many operating systems
interrupt handlers implicitly (and unavoidably) have a priority greater than any
regular task in the system.
2. In principle, there is no upper bound on the number of retries needed to gain
access to shared data. Actually, if retries are implemented in a careless way, access
may never be granted anyway.
Regarding the second drawback let us consider, for example, a monolithic op-
erating system running on a single-core processor. Due to the operating system ar-
chitecture, as long as the interrupt handler keeps using the processor to retry the
nonblocking P(), no other tasks will ever be executed. As a consequence, the device
driver component that is currently engaged in its critical region—and is preventing
the interrupt handler from entering its own critical region—will never be able to
proceed and exit.
It must also be noted that introducing some delay between P() retries is not as
easy as it looks at first sight, because:
216 Embedded Software Development: The Open-Source Approach
Concerning the second side effect, when interrupts are only disabled locally, noth-
ing prevents an interrupt handler from executing on a certain core, while a regular
Device Driver Development 217
task or another instance of the same interrupt handler executes within a critical re-
gion on another core. In this way, mutual exclusion is no longer guaranteed unless
interrupt disabling is complemented by a different locking technique, for instance,
the usage of spin locks, like it is done in modern multi-processor (MP) Linux ker-
nels [112]. More information about this and other issues associated with multi-core
processing can be found in [122, 71], and they will not be further discussed here.
Another way to address this issue, outlined in Figure 8.8 and often more appro-
priate for a real-time system, consists of delegating most of the activities that the
interrupt handler must perform to a dedicated helper task. As shown in the figure, the
helper task consists of an infinite loop containing a blocking synchronization point
with the interrupt handler. Synchronization is implemented by means of a semaphore
s, initialized to zero.
Within the infinite loop, the helper task blocks when it invokes P(s), until the
interrupt handler wakes it up by means of the corresponding V(s). The last primitive
is nonblocking, and hence, it can be safely invoked from the interrupt handler. At this
point, the helper task interacts with the device and then makes access to the shared
data structure in the usual way. At the end of the cycle, the helper task executes
P(s) again, to wait for the next interrupt. On the other hand, if the interrupt already
occurred before P(s), the helper task proceeds immediately.
With respect to the previous approach, the introduction of helper task brings the
following main advantages:
• The interrupt handler becomes extremely short and often contains only a
single V() primitive, because most of the interrupt-related processing is
delegated to the helper task. As a consequence, the amount of time spent
executing in the interrupt context—with its priority assignment issues high-
lighted in Section 8.2—is greatly reduced.
• The priority of the helper task can be chosen at will, also depending on the
requirements of the other system components. All the schedulability anal-
ysis techniques described in Chapter 6 can be applied to it, by considering
it a sporadic task.
• Critical regions are implemented without disabling interrupts for their en-
tire length. Besides improving interrupt-handling latency, especially when
critical regions have a sizable execution time, this also makes the code eas-
ier to port toward a multi-core execution environment.
• Two DMA engines. They permit the transfer of frames directly to and from
memory with little support from the microprocessor, while at the same time
they off-load CPU processing/computation significantly.
Device Driver Development 219
This information can be exploited by the receive filter to carry out filtering.
The temporary receive buffer between Ethernet MAC and the receive DMA
engine implements a delay for the received packets so that the receive filter
can work upon them and filter out certain frames before storing them back
to memory.
• The transmit retry buffer also works as a temporary storage for an outgoing
packet. It can be exploited to handle the Ethernet retry and abort situations,
for example, due to collision.
• The Ethernet block also includes an interrupt logic block. Interrupts can be
masked, enabled, cleared, and set by the software device driver as afore-
mentioned. The interrupt block keeps track of the causes of interrupts and
sends an interrupt request signal to the microprocessor through either the
VIC (LPC2468) or NVIC (LPC1768) when events of interest occur.
• Host registers. The host registers module provides a set of registers accessi-
ble by software. The host registers are connected to the transmit and receive
path as well as the MAC. As a result, they can be used to manipulate and
retrieve information about network communication.
The receive path consists of the receive DMA engine, the temporary receive
buffer, receive filter as well as the Ethernet MAC. Similarly, the transmit path is
made up of the transmit DMA engine, transmit retry module, transmit flow control
as well as the Ethernet MAC.
Referring back to Chapter 7 and Figure 7.3, the Ethernet device driver could inter-
face with the LW IP protocol stack, which in turn is layered on top of the F REE RTOS
real-time operating system, to provide network communications.
Figure 8.10 demonstrates the general structure of the device driver. As shown in
the figure, the main components of the Ethernet device driver include the interrupt
handler, an Ethernet receive thread, as well as Ethernet transmit functions, shown in
the dark gray rectangles. For simplicity, the initialization function is not shown in the
figure. Moreover, two ring buffers are used to store incoming and outgoing data and
they are shown in the middle of the figure.
As previously mentioned, the interrupt handler needs to synchronize with both the
hardware and other components of the device driver. In the figure, a synchronization
semaphore is represented by a light gray circle, and a dashed arrow stands for a
synchronization primitive, either Take or Give. The same kind of arrow is also
used to denote the activation of an IRQ handler by hardware. Instead, a solid arrow
represents data flow among the Ethernet controller, device driver, and LW IP.
It is worth mentioning that, for the sake of demonstration, two interrupt handlers,
which are responsible for the receive and transmit path separately, are shown in the
figure. In practice, interrupt handling is implemented in a single interrupt handler.
In the following, we will go deeper into each individual component and provide
more detailed information.
Device Driver Development 221
8.5.2 INITIALIZATION
After reset, the Ethernet software driver needs to initialize the Ethernet block just
introduced, configure proper data structures that are referred by the LW IP protocol
stack, create suitable variables for synchronization, and so on. It is worth noting that
interrupt handler installation is also performed during initialization. In this exam-
ple, initialization is implemented in the ethernetif_init function shown in the
following listing.
/* Network interface name. */
#define IFNAME0 ’e’
#define IFNAME1 ’n’
/**
* Initialize the network interface. Internally, it calls the function
* low_level_init() to do the actual setup of the hardware.
*/
err_t ethernetif_init(struct netif *netif)
{
LWIP_ASSERT("netif != NULL", (netif != NULL));
netif->name[0] = IFNAME0;
netif->name[1] = IFNAME1;
netif->output = etharp_output;
netif->linkoutput = low_level_output;
222 Embedded Software Development: The Open-Source Approach
return ERR_OK;
}
In the implementation of the netif_add function, the input field of the net-
work interface structure will be set to the function indicated by the input argument.
Moreover, the init argument points to the user-specified initialization function for
the network interface. That’s to say, the ethernet_init function should be given
to this parameter when we call netif_add in the application.
The following listing demonstrates the implementation of the low_level_init
function.
#define ETHARP_HWADDR_LEN 6
/** if set, the interface has an active link (set by the network
* interface driver) */
#define NETIF_FLAG_LINK_UP 0x10U
/* MAC address, from the least significant to the most significant octect */
#define MYMAC_1 0x4D
#define MYMAC_2 0x02
#define MYMAC_3 0x01
#define MYMAC_4 0xF1
#define MYMAC_5 0x1A
#define MYMAC_6 0x00
/**
* In this function, the hardware should be initialized.
* Called from ethernetif_init().
*/
void low_level_init(struct netif *netif)
{
portBASE_TYPE result;
xTaskHandle input_handle;
/* Device capabilities */
netif->flags = NETIF_FLAG_BROADCAST | NETIF_FLAG_ETHARP | NETIF_FLAG_LINK_UP;
if(semEthTx == SYS_SEM_NULL) {
LWIP_DEBUGF(NETIF_DEBUG,("Creation of EMAC transmit semaphore failed\n"));
}
if(semEthRx == SYS_SEM_NULL) {
LWIP_DEBUGF(NETIF_DEBUG,("Creation of EMAC receive semaphore failed\n"));
}
if(result == pdFAIL) {
LWIP_DEBUGF(NETIF_DEBUG,("Creation of EMAC receive task failed\n"));
}
}
used to manage resources, namely the ring buffers in this example. The two
arguments indicate the maximum count value that a semaphore of this type
can reach and its initial value, respectively.
The semEthTx semaphore is used for the transmission path. However, as
shown in the listing above, the maximum count value of the semEthTx
semaphore is set to NUM_TX_FRAG-1. It is configured in this way, because
it is important for the hardware, namely the Ethernet controller, to be able
to distinguish between an empty buffer and a full buffer. When the buffer
is full, the software should stop producing new frames until hardware has
transmitted some frames. Otherwise, it can keep going.
On the LPC2468, the check of buffer status is implemented in the hardware
and carried out with the help of two indexes associated with the buffer,
namely TxProducerIndex and TxConsumerIndex. Concerning the
transmission path, the software is responsible for inserting new frames into
the buffer for transmission, whereas the hardware will remove frames from
the buffer and send them onto the link. As a consequence, they are consid-
ered as the producer and the consumer of the shared resource, respectively.
Moreover, they are responsible for updating the corresponding index after
working on the buffer.
TxProducerIndex always points to the next free element in the ring
buffer to be filled by the device driver, whereas TxConsumerIndex in-
dicates the next buffer element which contains the frame to be transmitted.
Since this is a ring buffer, the two indexes are wrapped to 0 when they reach
the maximum value.
Internally, the Ethernet controller considers the buffer as empty when
TxProducerIndex equals TxConsumerIndex, and as full when
TxProducerIndex equals TxConsumerIndex-1, as shown in Fig-
ure 8.12. As we can see, in this way, the buffer can accommodate up to
NUM_TX_FRAG-1 elements at the same time. It is also worth remarking
that all elements of the buffer will still be used by the controller any-
way, just not all at the same time. For clarity, the check for full buffer
when TxConsumerIndex gets a larger value than TxProducerIndex
is shown here. The check can be done with a slight update when it is the
226 Embedded Software Development: The Open-Source Approach
It is now time to have a deeper look at the specific operations carried out during
hardware initialization, namely, within the Init_EMAC.
• One thing typically done during the hardware initialization is pin routing.
It brings the signals back and forth between the Ethernet controller and
the right I/O pins of the CPU. In this way, data and control information
can be properly exchanged between the controller and the PHY. It is worth
mentioning that each I/O pin of the CPU can be configured for different
functions, which are predefined and can be found in the related manual or
data sheet [126, 128].
• As shown in Figure 8.9, the Ethernet block transmits and receives Ether-
net frames through an external off-chip PHY, namely the Ethernet physical
layer. It should also be initialized before being used. One main thing is to set
up the link and decide the link speed as well as the duplex mode, which also
depend on the node at the other end of the connection. A convenient way to
perform this is through autonegotiation, which can be enabled by configur-
ing, for example for the DP83843 PHY device supported by LPC2468, the
basic mode control register of the PHY as follows:
write_PHY (PHY_REG_BMCR, PHY_AUTO_NEG);
MAC_Command = CR_RMII;
• As mentioned at the very beginning of this section, the Ethernet block
contains DMA engines for both transmission and reception. Besides, the
way software, namely the device driver, and hardware interact with the ring
buffers, which is also part of the DMA engines, has already been explained
above. Instead, during initialization, it is necessary to configure the DMA
engines, including setting up the ring buffers for use. For instance, the size
of the ring buffers as well as the initial value of the indexes, which are used
to access the ring buffers, should be specified here.
• Another thing that is done during the initialization phase is to enable inter-
rupts at the device level, whose general concept has been discussed in Sec-
tion 8.2.1. As mentioned there, each device can support multiple interrupt
sources and it can choose to enable them in a selective way by interrupt
masking, which in turn can be done by configuring proper registers. The
following listing shows how it is done in the Ethernet driver:
228 Embedded Software Development: The Open-Source Approach
/* Enable interrupts. */
MAC_IntEnable = INT_RX_DONE | INT_TX_DONE;
As shown above, interrupt masking can be set through the interrupt enable
register. Among the interrupts that could be generated by the Ethernet mod-
ule, we are interested in the interrupts corresponding to successful reception
and transmission of a frame.
After that, we also reset interrupts by writing to the interrupt clear regis-
ter. It will clear all the bits in the interrupt status register. Individual bits in
the interrupt status register are set by the hardware when an interrupt cor-
responding to that bit occurs. During initialization, they should be reset to
avoid spurious interrupt requests.
• The receive and transmit mode of the Ethernet module can be enabled
through the command register, as shown in the following listing. What’s
more, by setting the receive enable bit of the MAC configuration register 1,
we allow incoming frames to be received. This is necessary, because inter-
nally the MAC synchronizes the incoming stream to this control bit.
/* Command register. */
#define MAC_Command (*(ADDR_REG32 (0xFFE00100)))
#ifdef GCC_ARM7_EA_LPC2468
/* The interrupt entry point is naked so we can control the context
saving. */
void ENET_IRQHandler_Wrapper( void ) __attribute__ ((naked));
#endif
Device Driver Development 229
#ifdef GCC_ARM7_EA_LPC2468
/**
* Interrupt handler wrapper function.
*/
void ENET_IRQHandler_Wrapper( void )
{
/* ARM7: Save the context of the interrupted task. */
portSAVE_CONTEXT();
#ifdef GCC_ARM7_EA_LPC2468
VICVectAddr(ENET_VIC_VECTOR) = ( portLONG ) ENET_IRQHandler_Wrapper;
VICVectPriority(ENET_VIC_VECTOR) = ENET_VIC_PRIORITY;
VICIntEnable = (1 << ENET_VIC_VECTOR);
#endif
#ifdef GCC_CM3_UN_LPC1768
NVIC_SetPriority(ENET_IRQn, ENET_INTERRUPT_LEVEL);
NVIC_EnableIRQ(ENET_IRQn);
#endif
First of all, as we can see, the interrupt handler is not specified here for
LPC1768. This is because a default vector table is provided by the sys-
tem startup file used on LPC1768. It includes an entry where the interrupt
handler to be used for Ethernet interrupts is indicated. The ISR is named
ETH_IRQHandler and it is defined as a weak symbol. What’s more, a
default implementation of the interrupt handler is provided as well. In-
stead, this is not the case for LPC2468. As we will see in the following,
for LPC1768, an interrupt handler with the same symbol name is defined
and implemented. It will take precedence over the default one and we do
not need to install the interrupt handler again. For what concerns LPC2468,
the interrupt handler is specified explicitly.
Secondly, another difference is that, context switch needs to be done explic-
itly in software on LPC2468, whereas LPC1678 provides hardware support
for it so that the general purpose and status registers are saved automatically
in the stack when an interrupt arrives.
In the above listing, portSAVE_CONTEXT and portRESTORE_CONTEXT
are two function-like macros, which implement context switch by saving
and restoring the entire execution context. More detailed implementation
information of them can be found in Chapter 10. Since context is saved
and restored explicitly with these two macros in the function body, the
ENET_IRQHandler_Wrapper function should be declared or defined
with the naked attribute. Otherwise, some processor registers would be
saved in the prologue and restored in the epilogue sequence of this func-
tion when the compiler generates code for it. This will result in redundant
230 Embedded Software Development: The Open-Source Approach
operations. With the naked attribute, the compiler will no longer generate
any prologue or epilogue for it. For more information about the naked at-
tribute, readers can refer to Chapter 9.
The processing concerning the interrupt itself is implemented in the
ETH_IRQHandler function. It is recommended to implement this func-
tion as a separate function with respect to the wrapper so that the stack
frame can be set up correctly.
Thirdly, on LPC1768, library functions are used to set the priority and en-
able the interrupt. Different macros are used to specify the Ethernet inter-
rupt number and the priority level assigned to it. This is simply because
they are different from the hardware point of view, namely they correspond
to different interrupt lines on the two platforms, and also due to the use of
the library function.
We can observe that the Ethernet device driver for LPC2468 and the one for
LPC1768 have a major part in common, except, for instance, the registers
map, external PHY in use, interrupt controller. As a consequence, when
porting the device driver from one platform to the other, there is no need
to rewrite the device driver. It is more convenient to just make the code
corresponding to the parts different from each other executed conditionally,
as shown in the above listing.
#ifdef GCC_ARM7_EA_LPC2468
signed portBASE_TYPE TaskWokenByRx = pdFALSE;
signed portBASE_TYPE TaskWokenByTx = pdFALSE;
#endif
#ifdef GCC_CM3_UN_LPC1768
/* Added static to avoid
internal compiler error: in expand_expr_addr_expr_1, at expr.c:6925
with CodeSourcery toolchain arm-2010.09-51-arm-none-eabi.bin
*/
static signed portBASE_TYPE TaskWokenByRx;
static signed portBASE_TYPE TaskWokenByTx;
TaskWokenByRx = pdFALSE;
TaskWokenByTx = pdFALSE;
#endif
/* RxDoneInt */
if(status & INT_RX_DONE) {
xSemaphoreGiveFromISR(semEthRx, &TaskWokenByRx);
}
/* TxDoneInt */
if(status & INT_TX_DONE) {
xSemaphoreGiveFromISR(semEthTx, &TaskWokenByTx);
}
Device Driver Development 231
#ifdef GCC_ARM7_EA_LPC2468
/* On LPC2468, this register must be written with any value at the
very end of an ISR, to update the VIC priority hardware.
*/
VICAddress = 0;
#endif
}
The above listing shows how the interrupt handler is implemented. As mentioned
in Section 8.2, it is recommended that the interrupt handler just does the minimum to
permit the hardware to keep working correctly. In this way, leaving the main part of
the interrupt processing to a task could help to improve the real-time performance.
The interrupt status register, that is MAC_IntStatus, keeps a record of which
interrupts corresponding to the Ethernet device are triggered. During initialization,
it is specified that two kinds of interrupt are of interest. The first one corresponds to
the event that an Ethernet frame has been received correctly in the Ethernet interface,
while the second one indicates that a frame has been transmitted to the Ethernet link
and more space is available in the transmit ring buffer.
If either interrupt occurs, the same operation will be taken. The interrupt
handler releases the corresponding semaphore with a call to the F REE RTOS
xSemaphoreGiveFromISR primitive. In this way, if previously there are some
tasks blocked on one of the semaphores, they will be woken up.
As discussed in Chapter 5, with respect to the xSemaphoreGive primitive,
xSemaphoreGiveFromISR can be invoked from an interrupt handler since it never
blocks the caller. Moreover, it returns to the caller an indication—stored by the driver
in the variable TaskWokenByRx or TaskWokenByTx—of whether or not a task
with a higher priority than the interrupted task has been unblocked. If it is, the un-
blocked task should be executed after the ISR completes, instead of returning to the
interrupted task. This requires to run the F REE RTOS task scheduling algorithm be-
fore exiting the interrupt handler. As shown in the listing, this is achieved by the
portYIELD_FROM_ISR function on LPC2468 and vPortYieldFromISR func-
tion on LPC1768.
Besides dealing with architectural differences, the two functions have a lot in
common. A possible implementation of portYIELD_FROM_ISR is provided in
Chapter 9.
Writing a “1” to bits in the interrupt clear register, that is MAC_IntClear in the
listing, clears the corresponding status bit in the interrupt status register. This is to
232 Embedded Software Development: The Open-Source Approach
mark the related interrupt as processed. Otherwise, it will be served again when we
enter the ISR next time. The way to clear an interrupt status bit may vary from one
device to another, for instance, some devices support clear on read.
According to References [126, 125], on LPC2468, the vector address register (in-
dicated by VICAddress in the above listing) contains the address of the ISR for the
interrupt that is to be serviced. What’s more, this register must be written, with any
value, at the very end of an ISR in order to update the vectored interrupt controller
(VIC) priority hardware. Writing to the register at any other time can cause incorrect
operation. Instead, this is not needed for LPC1768.
By the way, it is also worth mentioning that registers are referred to by name in
this example. They are mapped to different physical memory addresses for LPC2468
and LPC1768 in separate header files.
/**
* This function should be called when a packet is ready to be read
* from the interface. It uses the function low_level_input() to
* handle the actual reception of bytes from the network interface.
*/
static void
ethernetif_input(void *parameter)
{
struct eth_hdr *ethhdr;
struct pbuf *p;
struct netif *netif;
netif = (struct netif *)parameter;
for(;;)
{
/* Wait until a packet is arrived at the interface. */
if(xSemaphoreTake(semEthRx, portMAX_DELAY) == pdTRUE)
{
/* When another interrupt request is generated within the
EMAC while another request is still pending, the
interrupt status register is updated, but no additional
requests are sent to the CPU.
switch (htons(ethhdr->type))
{
/* IP or ARP packet? */
case ETHTYPE_IP:
case ETHTYPE_ARP:
#if PPPOE_SUPPORT
/* PPPoE packet? */
case ETHTYPE_PPPOEDISC:
case ETHTYPE_PPPOE:
#endif /* PPPOE_SUPPORT */
As mentioned in Section 8.5.2, this function is the entry point of the task created
during initialization. When a frame has been received in the Ethernet interface, this
function is responsible for retrieving it from the hardware ring buffer and delivering
it to the higher layer protocol for processing.
As we can see, all operations are performed within an infinite loop. This is be-
cause, first of all, this task should prepare to handle any number of frames whenever
they arrive at the interface. Secondly, as required by F REE RTOS, tasks should be
implemented to never return.
It blocks on the semEthRx semaphore indefinitely, unless this semaphore is sig-
naled elsewhere, for instance, the interrupt handler. In other words, it waits until a
frame arrives at the interface. Then it moves the received frame, which resides in
the hardware buffer, to a software pbuf using the low_level_input function.
In this way, more space in the hardware ring buffer is available for future incoming
frame. Detailed implementation information of low_level_input will be shown
later. By the way, as discussed in Section 7.5, the pbuf data structure is the internal
representation of packet in LW IP.
The low_level_input function returns to the caller a pointer to the pbuf
structure, whose payload field points to the received Ethernet frame (including
the Ethernet header) stored in the pbuf pool.
In theory, it is possible to check the type of the received frame, whether it cor-
responds to an IP packet, an ARP packet, or other types of frame, and then react
accordingly. Instead here, the whole packet is simply passed to the main tcpip thread
234 Embedded Software Development: The Open-Source Approach
of the LW IP protocol stack for input processing by means of the function indicated
by the input field of the netif data structure. As mentioned in Section 8.5.2, this
field is initialized in the application when a new network interface is added, with
a call to the netif_add function. In LW IP, this is handled by the tcpip_input
function. Consequently, the input field should point to this function.
The CheckFrameReceived function is used to check whether there is any
frame remaining in the receive ring buffer. It is performed by simply comparing the
two indexes used to access the ring buffer and checking whether the buffer is empty
or not. This is not just a redundant test, because in some extreme cases it might hap-
pen that more than one frame has been received while just one interrupt request is
handled by the processor.
For example, this is possible when two frames are received in rapid succession.
In this case, when the processor is handling the interrupt request corresponding to
the frame received first, that is when the interrupt service routine is running, another
interrupt request is generated by the Ethernet device. Concerning the second interrupt
request, the interrupt status register is updated. However, no additional request is sent
to the CPU.
This is because, for what concerns the ARM7 architecture, interrupts are disabled
at the CPU level while it is serving an interrupt request. They will be enabled again
when the interrupt handler returns. What is more, as shown in Section 8.5.3, the
content of the interrupt status register will be cleared in the interrupt handler. As
a consequence, there is not a second chance that the later interrupt request will be
propagated to the processor. Another side effect is that the semEthRx semaphore is
signaled only once, even if there is more than one frame to process. To work around
this issue, the receive task needs to check whether this aforementioned situation hap-
pens or not and keep receiving, if so.
This also somehow explains why the interrupt handler should be kept as short as
possible. Otherwise, due to the way hardware works, it is possible that some inter-
rupts are missed when interrupt rate is too high.
Instead, in the normal case, when two incoming frames are apart from each other,
the receive interrupt is enabled again before the next frame is coming.
The situation is slightly different on the Cortex-M3 architecture, as it supports
nested interrupts. However, an interrupt at a lower or equal priority is not allowed to
interrupt the one that is currently being served. As we know, interrupts corresponding
to the same device share the same priority level.
As can be seen from the above listing, the low_level_input function may
return to the caller a NULL pointer, which indicates that no packet could be read.
The caller just silently ignores this error indication. This may happen due to memory
errors. It is recommended to attempt receiving at a later time because it is possible
that in the meanwhile other tasks may free some memory space.
The following code shows the implementation of the low_level_input
function.
Device Driver Development 235
/**
* Allocate a pbuf and transfer the bytes of the incoming packet from
* the interface to the pbuf.
*/
struct pbuf *
low_level_input(struct netif *netif)
{
struct pbuf *p, *q;
s16_t len;
if(len < 0)
{
/* StartReadFrame detected an error, discard the frame */
EndReadFrame();
/* Update statistics */
switch(len)
{
case SRF_CHKERR: /* CRC Error */
LINK_STATS_INC(link.chkerr);
break;
#if ETH_PAD_SIZE
/* Allow room for Ethernet padding */
len += ETH_PAD_SIZE;
#endif
if (p != NULL) {
#if ETH_PAD_SIZE
/* Drop the padding word. Find the starting point to store
* the packet. */
pbuf_header(p, -ETH_PAD_SIZE);
#endif
#if ETH_PAD_SIZE
/* Reclaim the padding word. The upper layer of the protocol
* stack needs to read the data in the pbuf from a correct
* location. */
pbuf_header(p, ETH_PAD_SIZE);
#endif
LINK_STATS_INC(link.recv);
}
else
{
/* Notify the hardware but actually nothing has been done
* for the incoming packet. This packet will just be ignored
* and dropped. */
EndReadFrame();
LINK_STATS_INC(link.memerr);
LINK_STATS_INC(link.drop);
}
}
return p;
}
More specifically, the StartReadFrame function is used to locate the next Eth-
ernet frame to be read from the hardware ring buffer as well as its size. Actually,
the next frame to be read can be accessed through the RxConsumerIndex index.
Normally, the return value of this function simply indicates the length of the frame
to be read, represented with positive values. Regarding erroneous frames, a negative
length is used to distinguish different types of error.
If the frame corresponds to an incorrectly received frame, the EndReadFrame
function should be called. It just updates the RxConsumerIndex to point to the
next element in the receive ring buffer. In this way, the erroneous frame is simply
discarded.
Instead, if the next frame to be retrieved from the ring buffer represents a correctly
received frame, the following actions will be performed:
First of all, room will be made for Ethernet padding, depending on the config-
uration. This is because for some architectures, word access has to be aligned to 4
byte boundary. If this is the case, the Ethernet source address of the Ethernet header
structure will start at a non-aligned (4 byte) location. ETH_PAD_SIZE is used to pad
this data structure to 4 byte alignment.
After that, a chain of pbufs is allocated from the pbuf pool to accommodate the
Ethernet frame, by means of the pbuf_alloc function. Since each pbuf is of fixed
size, as discussed in Chapter 7, one or more pbufs may be needed to store the whole
frame. If there is not enough memory and the required amount of pbufs cannot be
allocated, the frame is simply dropped and some statistic information is updated,
namely, link.memerr and link.drop.
On the other hand, upon successful memory allocation, the pbuf chain allocated is
pointed to by the p pointer. It is worth noting that the padding field is prepended be-
fore the Ethernet header. In other words, the Ethernet header starts after the padding
field. When storing the received packet, the pointer should be moved forward along
Device Driver Development 237
the pbuf chain to find the right starting point. This is achieved by means of the
pbuf_header function implemented in the LW IP protocol stack. The second ar-
gument of this function specifies which direction to move and how much. If it is a
negative value, the pointer will move forward, otherwise backward.
Then the Ethernet frame can be copied from the hardware ring buffer to the
pbuf chain by iterating through the pbuf chain and copying enough bytes to
fill a pbuf at a time until the entire frame is read into the pbuf by means of
the CopyFromFrame_EMAC function. After that, the EndReadFrame function is
called to update the RxConsumerIndex. In this case, the hardware is notified that
a packet has been processed by the driver and more space is available for further
incoming packets.
At the end, before returning the pointer to the pbuf chain to the caller, it should
be adjusted to include the padding field so that the upper layer of the protocol stack
can operate on it correctly.
/* Move the data from the pbuf to the ring buffer, one pbuf at a
time. The size of the data in each pbuf is kept in the ->len
field.
*/
for(q = p; q != NULL; q = q->next) {
CopyToFrame_EMAC_Start(q->payload, q->len);
}
/* Reclaim the padding word. The upper layer of the protocol stack needs
the pointer to free the memory allocated for pbuf *p from a correct
position.
*/
#if ETH_PAD_SIZE
pbuf_header(p, ETH_PAD_SIZE);
#endif
LINK_STATS_INC(link.xmit);
return ERR_OK;
}
First of all, the padding word should be dropped before starting a transmission
because it is just used within the LW IP protocol stack. There is no need to send it
onto the link. The real Ethernet frame starts after it.
Moreover, transmission cannot be started if there is no space available in the
hardware transmit ring buffer. If this is the case, the caller will be blocked on the
semEthTx semaphore created and initialized in low_level_init indefinitely.
Otherwise, the next free element in the buffer is indicated by TxProducerIndex.
The RequestSend function will set up that buffer for the outgoing frame. Since the
size of each transmit buffer is the same as the maximum Ethernet frame size, any
outgoing Ethernet frame will just require one single transmit buffer.
Then the data can be copied from the pbuf (chain) to the ring buffer, one pbuf
at a time, by means of the CopyToFrame_EMAC_Start function, until the whole
Ethernet frame is stored in it. When it is done, the TxProducerIndex is updated
in the CopyToFrame_EMAC_End function to point to the next free element in the
ring buffer. In this way, the Ethernet hardware will also find out that a packet is ready
to be sent out.
Last but not least, the pointer to the pbuf (chain) should be adjusted to reclaim
the space allocated for the padding field so that when the upper layer of the protocol
stack is about to free the memory of the pbuf, the right amount of memory will be
freed.
8.6 SUMMARY
This chapter discussed the design and implementation of device driver for embedded
systems, which are quite different from those on general purpose operating systems.
The most significant differences mainly related to the concept of interrupt handling,
application task interface, as well as synchronization.
The chapter starts with a general description of the internal structure of typical
device drivers, presented in Section 8.1. Device driver does nothing different than
moving data back and forth between peripheral devices and other software compo-
nents. The main processing is generally done within an interrupt handler and one or
more helper tasks in order to be reactive to external events while at the same time
permit better real-time performance.
Section 8.2 illustrates the whole process of how an interrupt is handled. In partic-
ular, it focuses on interrupt handling at the hardware level, from when an interrupt
Device Driver Development 239
arrives at the device until it is accepted by the processor, since this part may be less
familiar to embedded software developers.
Moreover, in order to interact with both the hardware and other software compo-
nents, interfaces to them should be provided by the device driver, which is discussed
in Section 8.3. For what concerns hardware interface, it could be either register-based
or DMA-based. With respect to general purpose operating systems, the application
task interface is more performance oriented rather than being standardized.
Last but not least, synchronization is another major topic in concurrent program-
ming. It becomes even trickier when interrupt handler(s) are added into the picture.
As a simple example, any potentially blocking primitive is forbidden in an inter-
rupt handler. To this purpose, synchronization issues between interrupt handlers and
upper layers have been analyzed and addressed in Section 8.4
At the end, a practical example about the Ethernet device driver is presented in
Section 8.5, which demonstrates how the main issues related to the three topics just
mentioned are addressed in a real implementation.
9 Portable Software
CONTENTS
9.1 Portability in Embedded Software Development .......................................... 241
9.2 Portability Issues in C-Language Development ............................................ 247
9.3 Application Programming Interfaces ............................................................ 255
9.4 GCC Extensions to the C Language ............................................................. 258
9.4.1 Object Attributes............................................................................... 259
9.4.2 Assembly Language Inserts.............................................................. 268
9.5 Summary........................................................................................................ 274
The previous chapters set the stage for embedded software development, starting
from the general requirements of embedded applications and proceeding by present-
ing the main software development tools and runtime components. The practical
aspects of the discussion were complemented by theoretical information relating to
real-time execution models and concurrent programming.
In this chapter, the focus is now on how to make the best profit from the software
development activity, by producing software that can be easily migrated, or ported,
from one processor architecture to another and among different projects, without
sacrificing its efficiency.
This aspect is becoming more and more important nowadays because new, im-
proved processor architectures are proposed at an increased rate with respect to
the past and it becomes exceedingly important to adopt them in new projects as
quickly and effectively as possible. At the same time, the time to market of new
projects is constantly decreasing, putting even more focus on quick prototyping and
development.
Further portability issues may arise if some important software components—
most notably, the operating system and the protocol stacks—or the software devel-
opment toolchain itself may also change from one project to another and from one
target system to another. In this case, an appropriate use of abstract application pro-
gram interfaces and porting layers becomes of primary importance.
241
242 Embedded Software Development: The Open-Source Approach
• Last, but not least, the fast pace of contemporary hardware evolution be-
comes easier to follow. This aspect is especially important in some areas—
for instance, consumer appliances—in which being able to quickly incor-
porate the latest advances in microcontroller technology into a new product
often plays a key role in its commercial success.
1. The component itself must have been designed and implemented in a portable
way. In turn, this encompasses two different aspects. The first one concerns how
the code belonging to the component itself has been written at the programming
language level.
Namely, an improper use of the C programming language and its compiler may
easily lead to portability pitfalls, which will be summarized in Section 9.2. On
the other hand, at least in some cases, the use of some compiler-specific language
extensions, such as the ones presented in Section 9.4, becomes unavoidable.
Sometimes, the standard C language may simply be inadequate to express the op-
eration to be performed—for instance, accessing the contents of some processor
registers directly—or implementing the operation using only standard language
constructs would have unacceptable, adverse effects on performance.
In this case, to limit the impact on portability, it becomes essential to organize
the code in a proper way and enclose non-standard language constructs in well-
defined code modules, often called porting layers, to be kept as small as possible.
The second aspect concerns how the component interfaces with other components
and, symmetrically, how the application programming interface it makes available
to others is architected.
As will be better described in Section 9.3, choosing the right interface when more
than one is available (when using another component) and providing a proper
interface (when designing a component) are key factors to reach a proper trade-
off between execution efficiency and portability.
2. Some software components interface heavily with hardware. For instance, as de-
scribed in Chapter 4, the operating system needs to leverage an extremely deep
knowledge of the processor architecture in order to perform some of its critical
functions, for instance, context switch.
Therefore, when the operating system is ported from one processor architecture
to another, it becomes necessary to extend this knowledge. As for the usage of
non-standard language constructs, the goal is conveniently reached by means of
a porting layer. A detailed example of how a porting layer is architected and
implemented will be given in Chapter 10.
244 Embedded Software Development: The Open-Source Approach
Sometimes, hardware dependencies are limited to the interface toward a few spe-
cific devices. For instance, the TCP/IP protocol stack obviously needs to have
access to at least one network device, like an Ethernet controller, in order to per-
form most of its functions.
In this case, the problem is often solved by confining device-specific code to spe-
cial software modules called device drivers, which have been presented in Chap-
ter 8 and must be at least partially rewritten in order to support a new device.
For this reason, when starting a new project possibly involving new hardware (for
instance, a new microcontroller) or new software (for instance, a different TCP/IP
protocol stack) it’s important to pay attention to code portability issues.
The same is true also in perspective, when designing a system that is likely to be
ported to other hardware architectures or has to support different software compo-
nents in the foreseeable future.
In order to make a more specific and practical example, Figure 9.1 depicts the
typical structure of a simple, networked embedded system. The main software com-
ponents that build up the system, represented as white rectangles in the picture, are:
• The language runtime libraries, which are shared among all the other com-
ponents. In an open-source embedded system using the C language those
libraries are typically provided by NEWLIB.
• The F REE RTOS real-time operating systems, providing multitasking, tim-
ing, and inter-process communication services to all the other components.
• A number of system components, each providing additional services and
interfacing with the specific devices they control. Most notably, network
services are provided by the LW IP TCP/IP protocol stack interfaced with
an Ethernet controller.
• Several high-level components that build up the application. They make use
of all the other components mentioned so far and, in addition, interface with
each other.
According to the general discussion presented above, portability issues may occur
in various areas of the system and for two different reasons, mainly due to the pres-
ence of interfaces to other software or hardware. In the figure, these interfaces are
depicted as lighter and darker gray blocks, respectively. These areas are highlighted
with black dots in the figure. In particular:
minimum, when printouts are used as a debugging aid, it shall be linked to the
console device (usually a serial port), by means of a simple device driver.
3. The operating system contains some code to initialize the system when it is turned
on or after a reset, usually called startup code. Some typical operations performed
by the startup code are, for instance:
• Set up the processor clock appropriately, switching to a higher-quality oscil-
lator and programming the clock generation phased-locked loops (PLLs) to
achieve a faster speed with respect to the default configuration used by the
microcontroller upon system startup, which is often extremely conservative.
• Make the memory external to the chip accessible by configuring the external
memory controller with parameters appropriate to the hardware and memory
characteristics, such as external bus width, speed, and timings.
246 Embedded Software Development: The Open-Source Approach
• Prepare the system for executing code written in C, by setting up the initial
processor stack as well as the interrupt-handling stack when required, and
initializing the data segment.
As can be seen, in order to implement any of those operations, it is strictly nec-
essary to have an intimate knowledge of the hardware characteristics of the mi-
crocontroller in use, as well as how it has been configured and integrated into
the embedded board. For obvious reasons, all these aspects are hardly portable
from one project to another and some of them cannot even be expressed in a high-
level language without using non-standard extensions, which will be outlined in
Section 9.4.
It must also be noted that, strictly speaking, the responsibility of carrying out these
operations is somewhat “shared” between the language runtime libraries and the
operating system. Therefore, even though the startup code is shown to be part of
the operating system in this example, it is not uncommon to find part or all of the
startup code within those libraries as well.
4. Even after the system has been initialized, the operating system still needs to per-
form operations that are inherently architecture-dependent, most notably context
switch and interrupt masking. They are implemented in a dedicated code mod-
ule, that constitutes the operating system porting layer. Due to its importance, the
operating system porting layer will be discussed in more detail in Chapter 10.
5. One important function of the operating system is to provide timing references
to all tasks in the system. This feature is implemented with the help of a hard-
ware timer that raises an interrupt request at predetermined intervals. Even though
timers are a relatively simple component, their way of working highly depends on
the microcontroller in use. For this reason, the interface toward the timer is im-
plemented as part of the operating system porting layer, too.
6. Additional system components are often the most critical for what concerns porta-
bility because, as shown in Figure 9.1, they must deal with multiple interfaces
toward other software and hardware components.
For instance, the LW IP protocol stack, besides making use of the C language
support library, is concerned with three other important interfaces.
• At the bottom, LW IP interfaces with an Ethernet controller, by means of a
device driver presented in Section 8.5.
• For timing and synchronization—both internal and with respect to the appli-
cation tasks—LW IP leverages the operating system services through the oper-
ating system programming interface.
• At the top, LW IP offers two different programming interfaces to applica-
tion tasks. Both these interfaces and the previous one have been discussed
in Chapter 7.
7. When a program component interacts with another it does so by means of a well-
defined application programming interface (API). In some cases a component im-
plements and offers multiple APIs, which represent different trade-offs between
execution efficiency and portability of the code that makes use of them.
As was mentioned earlier, both aspects are of great importance in an industrial
applications context. Therefore, choosing the most appropriate API to be used
Portable Software 247
Table 9.1
Minimum Range of Basic Integer Data Types Speci ed by the C Standard
Minimum Maximum
Type Macro Value Macro Value
signed char SCHAR_MIN -127 SCHAR_MAX 127
unsigned char — 0 UCHAR_MAX 255
char CHAR_MIN (a) CHAR_MAX (b)
short SHRT_MIN -32767 SHRT_MAX 32767
unsigned short — 0 USHRT_MAX 65535
int INT_MIN -32767 INT_MAX 32767
unsigned int — 0 UINT_MAX 65535
long LONG_MIN -2147483647 LONG_MAX 2147483647
unsigned long — 0 ULONG_MAX 4294967295
About the integer representation it must be remarked that, even though most re-
cent architectures adopt the two’s complement representation, this is not mandated
in any way by the standard, which also supports the sign and magnitude, as well as
the one’s complement representation.
As a side note for curious readers, this is the reason why the minimum int range
does not include the value -32768.
In some cases, knowing those details is indeed unimportant. For instance, when
using an int variable to control a for loop, the main goal of the programmer usually
is to obtain code that executes as fast as possible.
This goal is fulfilled by the standard because choosing a “natural size” for int,
and making the variable fit in a machine register, is beneficial for machine code
efficiency. The only check that careful programmers should perform is to compare
the maximum value that the control variable can assume by design with INT_MAX.
Portable Software 249
Table 9.2
Extended Integer Data Types Speci ed by the C99 Standard
Table 9.3
Minimum and Maximum Values of C99 Integer Data Types
Macro Description
INTn_MIN Minimum value of intn_t, exactly −2n−1
INTn_MAX Maximum value of intn_t, exactly 2n−1 − 1
UINTn_MAX Maximum value of uintn_t, exactly 2n − 1
INT_LEASTn_MIN Minimum value of int_leastn_t
INT_LEASTn_MAX Maximum value of int_leastn_t
UINT_LEASTn_MAX Maximum value of uint_leastn_t
INT_FASTn_MIN Minimum value of int_fastn_t
INT_FASTn_MAX Maximum value of int_fastn_t
UINT_FASTn_MAX Maximum value of uint_fastn_t
INTMAX_MIN Minimum value of intmax_t
INTMAX_MAX Maximum value of intmax_t
UINTMAX_MAX Maximum value of uintmax_t
INTPTR_MIN Minimum value of intptr_t
INTPTR_MAX Maximum value of intptr_t
UINTPTR_MAX Maximum value of uintptr_t
1. Integer data types having an exact and known in advance width, in bits.
2. Integer data types having at least a certain width.
3. Fast integer data types having at least a certain width.
4. Integer data types having the maximum width supported by the architecture.
5. Integer data types able to hold pointers.
In addition, the same header also defines the object-like macros listed in Table 9.3.
They specify the minimum and maximum values that the data types listed in Ta-
ble 9.2 can assume. The minimum values of unsigned data types are not specified as
macros because they are invariably zero.
Another common source of portability issues is related to which version of the C
language standard a certain module of code has been written for. In fact, as summa-
rized in the right part of Table 9.4, there are three editions of the standard, ratified in
the past 25 years, plus one intermediate version introduced by means of an amend-
ment to the standard.
For the sake of completeness, we should mention that full support for the most
recent edition of the standard [90], commonly called C11 because it was ratified in
2011, is still being added to the most recent versions of GCC and is incomplete at
the time of this writing.
In each edition, the language evolved and new features were introduced. Although
every effort has been put into preserving backward compatibility, this has not always
been possible. The problem is further compounded by the presence of several lan-
guage dialects, typical of a specific compiler.
Portable Software 251
Table 9.4
Main Compiler Flags Related to Language Standards and Dialects
Flag Description
-std=c89 Support the first edition of the C standard, that is, ISO/IEC
9899:1990 [87], commonly called C89 or, less frequently, C90.
Support includes the technical corrigenda published after ratifi-
cation. The alternative flag -ansi has the same meaning.
-std=iso9899:199409 Support the ISO/IEC 9899:1990 standard plus “Amend-
ment 1” [88], an amendment to it published in 1995; the
amended standard is sometimes called C94 or C95.
-std=c99 Support the second edition of the C standard, that is, ISO/IEC
9899:1999 [89], commonly called C99. Support includes the
technical corrigenda published after ratification.
-std=c11 Support the third edition of the C standard, that is, ISO/IEC
9899:2011 [90], commonly called C11.
-std=gnu89 Support the C89 language plus GCC extensions, even if they
conflict with the standards.
-std=gnu99 Support the C99 language plus GCC extensions, even if they
conflict with the standard.
In many cases, these dialects were historically introduced to work around short-
comings of earlier editions of the standard, which were addressed by later editions.
However, they often became popular and programmers kept using them even after,
strictly speaking, they were made obsolete by further standardization activities.
For this reason, GCC offers a set of command-line options, listed in Table 9.4,
which specify the language standard the compiler shall use. By default, these options
do not completely disable the GCC language dialect.
Namely, when instructed to use a certain standard, the compiler accepts all lan-
guage constructs conforming to that standard plus all GCC dialect elements that do
not conflict with the standard itself. When strict checks against the standard are re-
quired, and all GCC-specific language constructs shall be disabled, it is necessary to
specify the -pedantic option, too, as discussed in the following.
Another group of flags is related to code portability in two different ways. We list
the main ones in Table 9.5. Namely:
1. Some flags direct the compiler to perform additional checks on the source code.
Those checks are especially useful in new code, in order to spot potential issues
that may arise when it is ported to another architecture in the future.
2. Other flags “tune” the language by affecting the default behavior of the compiler
or relaxing some checks, with the goal of making it easier to compile existing
code on a new architecture successfully.
As shown in the table, the main flag to disable GCC-specific language extensions
and enable additional source code checks is -pedantic. The two additional flags
-Wall and -Wextra enable additional warnings.
252 Embedded Software Development: The Open-Source Approach
Table 9.5
Main Compiler Flags Related to Code Portability
Flag Description
Language tuning
-fno-builtin Instruct the compiler to neither recognize nor handle specially
some built-in C library functions. Instead, they will be handled
like ordinary function calls.
-fcond-mismatch Allow conditional expressions, in the form a ? b : c with
mismatched types in the second (b) and third (c) arguments.
The expression can be evaluated for side effects, but its value
becomes void.
-flax-vector-conversions Relax vector conversion rules to allow implicit conversions be-
tween vectors with differing number of elements and/or incom-
patible element types.
-funsigned-char Sets the char data type to be unsigned, like unsigned char.
-fsigned-char Sets the char data type to be signed, like signed char.
-funsigned-bitfields Sets bit-fields to be unsigned, when their declaration does not
explicitly use signed or unsigned. The purpose of this op-
tion and of the following one is similar to -funsigned-char
and -fsigned-char above.
-fsigned-bitfields Sets bit-fields to be signed, when their declaration does not ex-
plicitly use signed or unsigned.
-fshort-enums Allocate the minimum amount of space to enumerated (enum)
data types, more specifically, only as many bytes as required for
the range of values they can assume.
-fshort-double Use the same size for both float and double floating-point
data types.
-fpack-struct When this flag is specified without arguments, it directs the
compiler to pack structure members tightly, without leaving
any padding in between and ignoring the default member align-
ment. When the flag is accompanied by an argument in the form
=n, n represents the maximum default alignment requirement
that will be honored by the compiler.
Portable Software 253
However, it should be remarked that this flag does not enable a thorough set
of conformance checks against the C language standard. Rather, only the non-
conforming practices for which the standard requires a diagnostic message, plus a
few selected others, are reported.
Other tools, like the static code analysis tool mentioned in Chapter 16, are often
able to perform even more accurate checks concerning code portability, although this
is not their primary goal.
For what concerns the language tuning flags:
ferent settings for individual code modules because this would result in the
same data type, char, having two different meanings depending on the
code module it’s encountered in.
• The flags -funsigned-bitfields and -fsigned-bitfields have
a similar meaning, but they apply to bit-fields that are not explicitly quali-
fied as either unsigned or signed.
• The two flags -fshort-enums and -fshort-double determine the
amount of storage the compiler reserves for enumerated and floating-point
data types. As a consequence, they make it possible to port legacy code that
contains hidden assumptions about these aspects without extensive modifi-
cations. However, it makes the resulting code incompatible with code gen-
erated without these options. Hence, it must be used with care, especially
when mixing legacy and newly developed code.
• Similarly, the flag -fpack-struct controls how structure members are
packed. When used alone, that is, without arguments, the flag directs the
compiler to pack structure members as tightly as possible, that is, without
leaving any padding between them, regardless of the default or “natural”
alignment members may have.
Thus, for instance, an int member that follows a char member will be al-
located immediately after the char, even though its default, natural align-
ment (on a 32-bit architecture) would be to an address which is a multiple
of 4 bytes.
The flag also accepts an optional integer argument, n, which is specified
as -fpack-struct=n. In this case, the compiler will introduce some
padding, in order to honor default alignments up to a multiple of n bytes,
but no more than that.
For example, when the flag -fpack-struct=2 is in effect, the compiler
will use padding to align members up to a multiple of 2 bytes if their de-
fault alignment says so. However, continuing the previous example, int
members will still be aligned only to a multiple of 2 bytes, instead of 4.
This flag may sometimes be useful to define a data structure whose layout
is exactly as desired, for instance, when the data structure is shared with a
hardware device. However, it makes the generated code incompatible with
other code modules that are not compiled with exactly the same flag setting.
Moreover, it also makes the generated code less efficient than it could be.
This is because, on most recent architectures, accessing a memory-resident
variable that is not naturally aligned requires extra instructions and, possi-
bly, additional working registers.
Since the criteria just mentioned are not easy to express formally, and it is even
harder to evaluate quantitatively how much a software component adheres to a cer-
tain criterion—for instance, how can we measure how portable a piece of code is,
especially when it is complex—in the following we will discuss a small set of real-
world examples to give readers an informal feeling of which trade-offs they may
have to confront in practice, and which main aspects they should consider.
• At the C library level, the most intuitive way of producing some output, for
instance, to roughly trace program execution during software development,
is to call the standard library function printf.
Assuming that the C library is properly configured, as described in Chap-
ter 3, the resulting string of character will eventually be transferred to a
suitable console I/O device on the target board and, from there, reach the
programmer.
On the other hand, many runtime libraries offer alternate ways of producing
output. For instance, the LPCOpen software development platform offered
by NXP for use on their range of LPC microcontrollers [129] provides sev-
eral macros, whose name starts with _DB, with the same purpose.
When considering which functions shall be used in a certain project, it is
important to fully realize in which aspects they differ from each other. In
our specific case—but this scenario is rather typical—it is important to con-
sider that the two approaches are rather far away from each other for what
concerns the abstraction level.
In fact, the printf function is very complex and implements a wide range
of data conversion and formatting facilities, analogous to what is available
on a general-purpose computer. On the contrary, the _DB functions offer
just a few formatting options.
Namely, they can output only single characters, character strings, as well as
16- and 32-bit integer variables. For integer variables, the only conversion
options are to the decimal and hexadecimal representations. In turn, this
brings some important differences from the practical point of view.
256 Embedded Software Development: The Open-Source Approach
1. On the one hand, printf and other closely related functions, such
as fprintf, fputs, and so on, provide a general and very flexible
input–output framework. For instance, by just calling fprintf with
one stream or another as an argument, it is easy to direct the output to
another device rather than to the console.
If the C runtime library is appropriately configured, it is also possible
to direct the output to an entity belonging to a quite different level
of abstraction with respect to a physical device, for instance, a file
residing on a storage medium.
On the other hand, the _DB functions are able to operate only on a
single device, that is, the console, which is chosen at configuration
time and cannot be changed afterward.
2. The generality and flexibility of printf comes at the expense of
memory footprint, because printf necessarily depends on many
other lower-level library functions that implement various parts of its
functionality.
Due to the way most embedded applications are linked, those depen-
dencies imply that calling printf in an application brings many ad-
ditional library modules into the executable image besides printf
itself, even though most of them are not actually used at runtime.
For instance, printf is able to convert and format floating-point
variables. In most cases, the (rather large) portion of code that im-
plements this capability will be present in the executable image even
though the application does not print any floating-point variables
at all.
On the other hand if, for instance, the program never calls the _DBD32
macro (that is responsible for printing a 32-bit integer in decimal), the
corresponding conversion and formatting code will not be present in
the executable image.
3. For the same reasons, printf also has a bigger execution time over-
head with respect to other functions that are less general and flexible.
Continuing the previous example, scanning and obeying even a rela-
tively simple format specification like "%-+5d" (which specifies that
printf should take an int argument and print it in decimal, left-
aligned, within a 5-character wide output field, and always print its
sign even though it’s positive) is much more complex than knowing
in advance that the argument is a 32-bit integer and print it in full
precision with a fixed, predefined format, like _DBD32 does.
4. Due to its internal complexity, printf also requires a relatively large
amount of stack space (for example, at least 512 bytes on a Cortex-
M3 using the NEWLIB C library). In turn, this may make the function
hard to use when stack space is scarce, for instance, within an interrupt
handler.
5. Last, but not least, printf works only if the C library input–output
subsystem has been properly configured. This may or may not be true
Portable Software 257
on any other operating system. It is the most efficient one, too, because the
other two are layered on top of it, at the cost of some performance.
At the other extreme, the POSIX API, being backed by an international
standard, makes software development easier and quicker, also because
many programmers are already familiar with it and are able to use it ef-
fectively from day zero.
It also offers the best guarantee against software obsolescence because,
even though the standard will continue to evolve, backward compatibility
will always be taken into consideration in any future development.
The ITRON API, instead, is an example of interface provided for yet an-
other reason. In this case, the emphasis is mainly on encouraging the adop-
tion of a certain operating system in a very specific application area, where
existing software plays a very important role. This can be obtained in two
different, symmetric ways:
• Port existing applications to the new operating system by modifying
their porting layer.
• Informally speaking, do the opposite, that is, port the operating system
to those applications by providing the API they expect.
Of course, the second alternative becomes appealing when the relative
weight of existing software and operating system within applications is
strongly unbalanced toward the former.
In this case, it was considered more profitable to write an API adaptation
layer once and for all, and then make applications pass through two layers
of code to reach the operating system (their own porting layer plus the API
adaptation layer just described), rather than updating existing code modules
one by one.
compiler can offer to the maximum extent possible, mainly to improve execution
efficiency.
This opportunity becomes even more appealing because, in this particular case,
there are no adverse side effects. In fact, informally speaking, there is no reason
to have a “portable porting layer” and we can accept that the code that belongs to
the porting layer is non-portable “by definition.”
2. In some cases, within porting layers, the use of some extensions to the C language
is sometimes necessary. In other cases, even though in principle they could be
avoided, their use is indeed extremely convenient to avoid developing standalone
assembly language modules in the project.
Two examples, which we will further elaborate upon in the following, are
assembly-language inserts within C code—to express concepts that are not sup-
ported by high-level programming languages—as well as object attributes to di-
rect the toolchain to place some special code and data structures into specific areas
of memory.
• They drive the compiler’s code optimizer in a more accurate way and im-
prove optimization quality.
• When applied to a function, they modify the code generation process in
order to meet special requirements when the function is invoked from par-
ticular contexts, like interrupt handling.
• Some attributes are propagated through the toolchain and may affect other
toolchain components as well, for instance, how the linker resolves cross-
module symbol references.
Table 9.6
Main Object Attributes, Cortex-M3 GCC Toolchain
As such, the main goal of the discussion will be to provide a general idea of which
concepts object attributes are able to convey, and how they can profitably be used to
develop a porting layer, like the one presented in Chapter 10, in the most effective
way. Interested readers should refer to the compiler documentation [159, 160] for
up-to-date, architecture-specific, and thorough information.
The general syntax of an object attribute is
__attribute__ ((<attribute-list>))
1. An attribute can be made up of a single word. For instance, referring to Table 9.6,
which lists all the attributes that will be discussed in the following, cold is a valid
attribute.
2. Some attributes are more complex and take one or more parameters. In these
cases, parameters are listed within parentheses after the attribute name and are
separated by commas. For instance, section("xyz") is a valid section at-
tribute with one parameter, the string "xyz".
In some cases, not discussed in detail in this book, attribute parameters have a
more complex syntax and can be expressions instead of single words. In addition, it
must be noted that it is also possible to add __ (two underscores) before and after an
attribute name, without embedded spaces, without changing its meaning. Therefore,
the meaning of cold and __cold__ is exactly the same.
Attributes may appear at several different syntactic positions in the code, depend-
ing on the object they refer to. In the following, we will discuss only attributes at-
tached to a variable, function, or data type definition, even though other cases are
possible, too.
A first group of attributes affects the layout of data and code in memory, namely,
for what concerns alignment. The most important attributes in this group are:
• The aligned(n) attribute, which specifies the minimum alignment of the
corresponding object in memory. In particular, the object address will be at
least an integer multiple of n bytes. The alignment specification is consid-
ered to be in addition to any existing alignment constraints already in effect
for the object.
It is also possible to specify just aligned, without giving any arguments.
In this case, the compiler will automatically choose the maximum align-
ment factor that is ever used for any data type on the target architecture,
which is often useful to improve execution efficiency.
• Informally speaking, the packed attribute has the opposite effect, because
it specifies that the corresponding object should have the smallest possible
alignment—that is, one byte for a variable, and one bit for a bit field—
disregarding the “natural” alignment of the object.
Besides the obvious benefits in terms of storage space, it is necessary to
consider that the use of packed often has negative side effects on per-
formance and code size. For instance, on many architectures, multiple load
operations and several mask and shift instructions may be needed to retrieve
a variable that is not aligned naturally in memory.
The second group of attributes can be attached only to function definitions. These
attributes tune code optimization toward improving execution speed or reducing code
size on a function-by-function basis. Namely:
• The hot attribute specifies that a certain function is frequently used and/or
it is critical from the performance point of view. The compiler will optimize
the function to improve execution speed, possibly at the expense of code
size, which may increase.
Depending on the target, this attribute may also place all “hot” functions
in a reserved subsection of the text section, so that all of them are close
Portable Software 263
Another group of attributes does not affect code generation within the function
they are attached to, but changes the way this function is called from other parts of
the application. As before, the two most important attributes in this category for the
Cortex-M3 processor select different trade-off points between execution efficiency
and flexibility.
It is therefore possible (and even likely) that a function placed in a certain bank of
memory calls a function residing in a different bank, resulting in an extremely large
offset that cannot be encoded in the BL instruction when using the short_call
calling sequence.
Due to the way the toolchain works, as described in Chapter 3, the compiler is
unable to detect this kind of error because it lacks information about where functions
will be placed in memory. Instead, the error is caught much later by the linker and
triggers a rather obscure message similar to the following one:
object.o: In function ‘f’:
(.text+0x20) relocation truncated to fit:
R_ARM_PC24 against symbol ‘g’ ...
Actually, bearing in mind how the short_call calling sequence works and its
restriction, it becomes easy to discern that the error message simply means that func-
tion f, defined in object module object.o, contains a call to function g.
The compiler generated a BL instruction to implement the call but the linker was
unable to properly fit the required offset in the 24-bit signed field (extended with a
zero in the least significant bit position) destined to it. As a consequence, the linker
truncated the offset and the BL will therefore not work as intended, because it will
jump to the wrong address. The very last part of the message, replaced by ... in the
previous listing for concision, may contain some additional information about where
g has been defined.
The next two attributes listed in Table 9.6 affect the way the compiler generates
two important (albeit hidden) pieces of code called function prologue and epilogue.
The prologue is a short fragment of code that the compiler automatically places at
the beginning of each function, unless instructed otherwise. Its goal is to set up the
processor stack and registers to execute the body of the function.
For instance, if the function body makes use of a register that may also be used
by the caller for another purpose, the prologue must save its contents onto the stack
to avoid clobbering it. Similarly, if the function makes use of stack-resident local
variables, the prologue is responsible for reserving stack space for them.
Symmetrically, the epilogue is another fragment of code that the compiler auto-
matically places at the very end of each function, so that it is executed regardless
of the execution path taken within the function body. Its aim is to undo the actions
performed by the prologue immediately before the function returns to the caller.
The attributes interrupt and naked control prologue and epilogue generation
for the function they are attached to, as outlined in the following.
The last two attributes to be discussed here are peculiar because they affect only
marginally the compiler’s behavior, but they are propagated to, and take effect during,
the linking phase.
The section attribute allows programmers to control exactly where a certain
object, consisting of either code or data, will be placed in memory. On many ar-
chitectures this is important, for instance, to achieve maximum performance in the
execution of critical sections of code.
In other cases, this may even be necessary for correctness. For example, as de-
scribed in Chapter 8, when developing a device driver, the shared buffers between
software and hardware must be allocated in a region of memory that is accessible
to both.
266 Embedded Software Development: The Open-Source Approach
Figure 9.2 Propagation and handling of the section attribute through the toolchain.
In this case, the output section has the same name as input sections, a rather com-
mon choice.
4. Afterward, the linker itself allocates the output section into a specific area, or
bank, of memory—again, following the directives contained in the linker script.
In particular:
• The SECTIONS part of the script establishes into which area of memory the
sram output section will go. Referring back to the previous listing, this is done
by means of the > staticram specification. In our example, as the bank
name suggests, we imagine that the sram output section must be allocated
within staticram, which is a bank of static RAM.
• The MEMORY part of the script specifies where the bank of static RAM is,
within the processor address space, and how big it is. The linker makes use of
the last item of information to detect and report any attempt to overflow the
memory bank capacity. For instance, the following specification:
MEMORY {
staticram: ORIGIN=0x1000, LENGTH=0x200
}
states that the staticram memory bank starts at address 0x1000 and can
hold up to 0x200 bytes.
The weak attribute requires the same kind of information propagation through the
toolchain, but it has a simpler meaning. As shown in Figure 9.3, when this attribute
is attached to an object within a source file, the compiler will mark it in the same
way in the corresponding object file.
In the link phase, if the linker has to choose between referring to a “normal”
object or a “weak” one, it will invariably prefer the first one and disregard the second,
without emitting any error or warning message about duplicate symbols.
The main use of the mechanism just described is to provide a default version of a
function or a data structure—possibly within a library—but still provide the ability
to override it and provide a different implementation elsewhere—for instance, in the
application code. An example of use of this technique, in the context of interrupt
handling, has been given in Chapter 8.
268 Embedded Software Development: The Open-Source Approach
From the syntax point of view, assembly language inserts are enclosed in an “en-
velope,” introduced by the keyword __asm__, delimited by a pair of parentheses,
and closed by a semicolon, as shown below.
__asm__(
<assembly_language_insert>
);
• The insert contains one assembly instruction, fsinx, which calculates the
sine of an angle and has two operands. Namely, the first operand is the angle
and the second one is the result.
In the example, these operands are mapped to two C-language expressions,
with symbolic names [angle] and [output], respectively.
• After the first colon, there is a single output operand description that refers
to operand [output]. The description specifies, according to the con-
straint specification "=f", that it is a write-only (=) floating-point (f)
operand. The operand corresponds to the C variable result.
• After the second colon, there is the description of the input operand
[angle]. It is a real-only floating-point operand and corresponds to the
C variable angle.
Table 9.7 lists a subset of the characters that can appear in a constraint specifica-
tion. It should be noted that many of them are architecture-dependent, because their
goal is to specify the key properties of interest of an operand from the point of view
of the assembly instruction that makes use of it.
For this reason, in the table we only list the most commonly used ones, which are
available on most architectures supported by GCC. Interested readers should refer to
the compiler documentation [159, 160] to obtain a comprehensive list of constraint
letters and modifiers pertaining to the specific architecture they are working on.
The compiler makes use of the constraints specifiers to generate the code that
surrounds the assembly language insert and ensure that all constraints are met or,
when this is impossible, emit an error message. However, it is worth noting that
the checks the compiler can perform are strictly limited to what is specified in the
operand descriptions, because it parses the first part of the assembly language insert
(that is, the list of assembly instructions) in a very limited way.
For instance, the compiler is unaware of the meaning of assembly instructions
themselves and cannot ensure they are given the right number of arguments.
Portable Software 271
Table 9.7
Main Assembly Operand Constraints Speci ers
Name Purpose
Simple constraints
m The operand can be any memory-resident operand, reachable
by means of any addressing mode the processor supports.
r No constraints are posed upon the type of operand, but the
operand must be in a general-purpose register.
i The operand must be an immediate integer, that is, an operand
with a constant integer value. It should be noted that an operand
is still considered immediate, even though its value is unknown
at compile time, provided the value becomes known at assem-
bly or link time.
g The operand can be either in memory or in a general-purpose
register; immediate integer operands are allowed, too, but the
operand cannot be in a register other than a general-purpose
register.
p The operand must be a valid memory address, which corre-
sponds to a C-language pointer.
f The operand must be a floating-point value, which is stored in
a floating-point register.
X No constraints are posed on the operand, which can be any
operand supported by the processor at the assembly language
level.
Constraint modifiers
= The operand is a write-only operand for the instruction, that is,
its previous value is not used and is overwritten by the instruc-
tion with output data.
+ The operand is both an input and an output operand for the
instruction, that is, the instruction uses the previous operand
value and then overwrites it with output data; operands that are
not marked with either = or + are assumed to be input-only
operands.
& Indicates that the instruction modifies the operand before it is
finished using the input operands; as a consequence, the com-
piler must ensure that the operand does not overlap with any
input operands to avoid corrupting them.
• The most immediate way to indicate side effects is through output operand
specifications. Each specification explicitly indicates that the assembly lan-
guage insert may modify the operand itself, thus destroying its previous
contents.
• Some assembly instructions make implicit use of some registers and destroy
their previous contents even though, strictly speaking, they are not used as
operands. These registers must be specified in the clobber list, to prevent
the compiler from storing useful information into them before the assembly
language insert and use their content afterward, which would obviously
lead to incorrect results.
• In many architectures, some assembly language instructions may modify
memory contents in a way that cannot be described by means of output
operands. This fact must be indicated by adding the keyword memory to
the clobber list. When this keyword is present, the compiler will not cache
memory values in registers and will not optimize memory load and store
operations across the assembly language insert.
• Virtually all processor architectures have a status register that holds various
pieces of information about the current processor state, which can affect
future instruction execution.
For instance, this register is called program status register (xPSR) on the
Cortex-M3 architecture [8, 9] and is divided into 3 subregisters. Among
those subregisters, the application program status register (APSR) holds a
set of flags that arithmetic and test instructions may set to indicate their
outcome.
The processor uses these flags at a later time to decide whether or not con-
ditionally executed instructions must be executed and whether or not con-
ditional branch instructions will be taken.
In GCC terminology, these pieces of information are called condition
codes. If the assembly language insert can modify these condition codes,
it is important to make the compiler aware of this fact by adding the key-
word cc to the clobber list.
Portable Software 273
• Last, but not least, it is important to remark that the compiler applies its
ordinary optimization rules to the assembly language insert, considering it
as an indivisible unit of code. In particular, it may opt for removing the
insert completely if (according to its knowledge) it can prove that none of
its outputs are used in the code that follows it.
As discussed previously, a common use of output operands is to describe
side effects of the assembly language inserts. In this case, it is natural that
these output operands are otherwise unused in the code. If, as it happens
in some cases, the assembly language insert must be retained in any case,
because it has other important side effects besides producing output, it must
be marked with the volatile keyword, as shown below:
__asm__ volatile(
<assembly_language_insert>
);
To conclude this section, Figure 9.4 summarizes in a graphical form how the
compiler handles assembly language inserts.
274 Embedded Software Development: The Open-Source Approach
9.5 SUMMARY
The focus of this chapter was on portable software. As shown in Section 9.1, this
is a complex topic that involves virtually all software components of an embedded
system, from the interface between software and hardware, up to the application
programming interfaces adopted to connect software modules together.
In order to reach the goal, it is first of all important to realize where the main
sources of portability issues in C-language software development may be and how
the compiler and other tools can help programmers to detect them as early as possi-
ble. This was the topic of Section 9.2, whereas the all-important aspect of choosing
the right application programming interfaces was briefly considered in Section 9.3.
However, especially when working on low-level software component, such as
device drivers, resorting to non-portable constructs of the C language is sometimes
highly convenient, or even unavoidable. Section 9.4 provided an overview of the
features provided, in this respect, by the GCC-based open-source toolchain.
To conclude the discussion, a real-world example of porting layer to achieve
architecture-independence, and hence, portability at the operating system level will
be presented in the next chapter.
10 The F REE RTOS
Porting Layer
CONTENTS
10.1 General Information ...................................................................................... 275
10.2 Basic Data Types ........................................................................................... 280
10.3 Time Representation and Architectural Details............................................. 281
10.4 Context Switch .............................................................................................. 282
10.5 Interrupt Handling and Critical Regions ....................................................... 288
10.6 Task Stack Initialization ................................................................................ 290
10.7 Tick Timer ..................................................................................................... 294
10.8 Architecture-Dependent Scheduler Startup................................................... 296
10.9 Summary........................................................................................................ 297
In this chapter we will have a glance at how a typical porting layer is designed and
built. The example taken as a reference is the F REE RTOS porting layer, which con-
tains all the architecture-dependent code of the operating system.
For what concerns device driver development, another important method to
achieve portability and hardware independence, a comprehensive example has been
presented and discussed as part of Chapter 8.
275
276 Embedded Software Development: The Open-Source Approach
Table 10.1
Object-Like Macros to Be Provided by the F REE RTOS Porting Layer
Group/Name Purpose
Time representation
portMAX_DELAY Highest value that can be represented by a TickType_t, special
value used to indicate an infinite delay.
portTICK_PERIOD_MS Tick period, approximated as an integral number of milliseconds.
Architectural details
portSTACK_GROWTH Direction of stack growth, -1 is downward (toward lower memory
addresses), +1 is upward (toward higher addresses).
portBYTE_ALIGNMENT Alignment required by critical data structures, namely, task stacks
and dynamically allocated memory.
concrete examples and code excerpts are needed. More information about this archi-
tecture can be found in References [8, 9].
The F REE RTOS version taken as a reference is V8.0.1. Since this operating
system is still evolving at a fast pace, there may be minor differences in the porting
layer when referring to other versions of the source code.
When another architecture is needed as a comparison and to show how the port-
ing layer shall be adapted to a different processor architecture, we will also make
reference to the port of the same operating system for the Coldfire-V2 family of
microcontrollers [66].
Both families are typical representatives of contemporary, low-cost components
for embedded applications and, at the same time, they are simple enough so that
the reader can gain a general understanding of how the porting layer works without
studying them in detail beforehand. The example has a twofold goal:
Table 10.2
Function-Like Macros to Be Provided by the F REE RTOS Porting Layer
Group/Name Purpose
2. It illustrates the typical structure and contents of a porting layer and gives a sum-
mary idea of the amount of effort it takes to write one.
The bulk of the port to a new architecture is done by defining a set of C preproces-
sor macros (listed in Tables 10.1 and 10.2), C data types (listed in Table 10.3), and
functions (listed in Table 10.4), pertaining to different categories. Figure 10.1 sum-
marizes them graphically. Besides public functions, which are mandatory and must
be provided by all porting layers, Table 10.4 also lists several private porting layer
278 Embedded Software Development: The Open-Source Approach
Table 10.3
Data Types to Be Provided by the F REE RTOS Porting Layer
Group/Name Purpose
Time representation
TickType_t Relative or absolute time value
Table 10.4
Functions to Be Provided by the F REE RTOS Porting Layer
Group/Name Purpose
Public functions
pxPortInitialiseStack Initialize a task stack and prepare it for a context switch.
xPortStartScheduler Perform all architecture-dependent activities needed to start the
scheduler.
functions described in the following. They are used as “helpers” in the Cortex-M3
porting layer and may or may not be present in other ports.
Macros and data types are defined in the header file portmacro.h. When the
operating system is compiled, the contents of this file are incorporated by means of
an #include directive contained in the F REE RTOS header file portable.h and,
in turn, made available for use to the operating system source code.
280 Embedded Software Development: The Open-Source Approach
side effect when determining how big task stacks should be, in order to call
xTaskCreate with a sensible value for the usStackDepth argument, as
discussed in Chapter 5.
As a side note, it it useful to remark that this value is an integer, and hence, it may
not accurately represent the actual tick length when configTICK_RATE_HZ is not
an integral sub-multiple of 1000.
The next two macros listed in Table 10.1 provide additional details about the
underlying architecture. Namely:
282 Embedded Software Development: The Open-Source Approach
For what concerns the second macro, it is worth remarking the first important
difference between the two porting layers been considered in this example because,
according to the corresponding data sheets [9, 66], the two architectures have dif-
ferent alignment requirements. More specifically, the Cortex-M3 port defines this
macro as 8, whereas the Coldfire-V2 port defines it as 4.
The portmacro.h header only contains data type and macro definitions. We
have just seen that, in some cases, they map macro names used by F REE RTOS, like
portYIELD, into architecture-dependent function names, like vPortYield.
The implementation of those functions—along with other functions required by
F REE RTOS and to be discussed later—is done in at least one source module belong-
ing to the porting layer and usually called port.c.
Namely, the implementation of vPortYield found in that file is:
void vPortYield( void )
{
/* Set a PendSV to request a context switch. */
portNVIC_INT_CTRL_REG = portNVIC_PENDSVSET_BIT;
In particular:
1. The context switch is requested from an application task and, in this case, the
macro portYIELD just described is used.
2. The context switch is requested from within a F REE RTOS function, using the
macro portYIELD_WITHIN_API.
3. The context switch is requested from an interrupt service routine, us-
ing the macro portYIELD_FROM_ISR or, in some ports, the macro
portEND_SWITCHING_ISR, which has the same semantics.
284 Embedded Software Development: The Open-Source Approach
When the processor eventually honors the software interrupt request, it auto-
matically saves part of the execution context onto the task stack, namely, the pro-
gram status register (xPSR), the program counter and the link register (PC and
LR), as well as several other registers (R0 to R3 and R12). Then it switches to a
dedicated operating system stack and starts executing the exception handling code,
xPortPendSVHandler.
The F REE RTOS Porting Layer 285
Figure 10.2 Simplified stack diagrams after a F REE RTOS context save operation.
The handler first retrieves the task stack pointer PSP and stores it in the R0 register
(line 9). This does not clobber the task context because R0 has already been saved
onto the stack by hardware. Then, it puts into R2 a pointer to the current TCB taken
from the global variable pxCurrentTCB (lines 12–13).
The handler is now ready to finish the context save initiated by hardware by push-
ing onto the task stack registers R4 through R11 (line 15). At last, the task stack
pointer in R0 is stored into the TopOfStack field of the task control block (TCB),
which is dedicated to this purpose (line 16). At this point, the stack layout is as shown
on the left of Figure 10.2.
In particular,
• the stack pointer currently used by the processor, SP, points to the operating
system stack;
286 Embedded Software Development: The Open-Source Approach
• the PSP register points to where the top of the task stack was after excep-
tion entry, that is, below the part of task context saved automatically by
hardware;
• the TopOfStack field of the current task TCB points to the top of the task
stack after the context save has been concluded.
21 portSAVE_CONTEXT
22 jsr vPortYieldHandler
23 portRESTORE_CONTEXT
• It loads the address of the TCB of the task to be restored and retrieves the
stack pointer saved into its first word (lines 12–13).
• It restores processor registers from the stack by means of a move multiple
instruction (line 14).
• It adjusts the stack pointer to release the memory area reserved when the
context was saved (line 15)
As can be seen, the similarities between the two context switch procedures on
the Cortex-M3 and ColdFire-V2 are still remarkably strong, despite the architectural
differences.
288 Embedded Software Development: The Open-Source Approach
In the Cortex-M3 porting layer, the two macros just described are used directly
to implement portDISABLE_INTERRUPT and portENABLE_INTERRUPT, which
are invoked by F REE RTOS to disable and enable interrupts, respectively, from a
task context. In the general specification of porting layers, the two sets of macros are
independent from each other, as this distinction is needed on some architectures.
The last two functions related to interrupt handling, to be defined by the port-
ing layer, are portENTER_CRITICAL and portEXIT_CRITICAL. They are used
within F REE RTOS to delimit very short critical regions of code that are executed in
a task context, and must be protected by disabling interrupts.
Since these critical regions can be nested into each other, it is not enough to map
them directly into portDISABLE_INTERRUPTS and portENABLE_INTERRUPTS.
If this were the case, interrupts would be incorrectly reenabled at the end of the in-
nermost nested critical region instead of the outermost one. Hence, a slightly more
complex approach is in order.
For the Cortex-M3, as shown in previous listings, the actual implementation is
delegated to the functions vPortEnterCritical and vPortExitCritical.
The global variable uxCriticalNesting contains the critical region nesting
level of the current task. Its initial value 0xaaaaaaaa is invalid, to catch errors
during startup. It is set to zero, its proper value, when the operating system is about
to begin the execution of the first task.
The two functions are rather simple: vPortEnterCritical disables interrupts
by means of the portDISABLE_INTERRUPTS macro discussed before. Then, it in-
crements the critical region nesting counter because one more critical region has just
been entered. The function vPortExitCritical, called at the end of a critical
region, first decrements the nesting counter and then reenables interrupts by call-
ing portENABLE_INTERRUPTS only if the count is zero, that is, the calling task
is about to exit from the outermost critical region. Incrementing and decrementing
uxCriticalNesting does not pose any concurrency issue on a single-processor
system because these operations are always performed with interrupts disabled.
It should also be noted that, although, in principle, uxCriticalNesting should
be part of each task context—because it holds per-task information—it is not neces-
sary to save it during a context switch. In fact, due to the way the Cortex-M3 port has
been designed, a context switch never occurs unless the critical region nesting level
of the current task is zero. This property implies that the nesting level of the task
targeted by the context switch must be zero, too, because its context has been saved
exactly in the same way. Then it is assured that any context switch always saves and
restores a critical nesting level of zero, making this action redundant.
It takes as arguments the task stack pointer pxTopOfStack, the address from
which task execution should begin pxCode, and a pointer to the task parameter block
pvParameters. The return value of the function is the new value of the task pointer
after the context has been saved. The listing that follows shows the Cortex-M3 im-
plementation.
#define portINITIAL_XPSR ( 0x01000000UL )
#ifdef configTASK_RETURN_ADDRESS
#define portTASK_RETURN_ADDRESS configTASK_RETURN_ADDRESS
#else
#define portTASK_RETURN_ADDRESS prvTaskExitError
#endif
StackType_t *pxPortInitialiseStack(
StackType_t *pxTopOfStack, TaskFunction_t pxCode, void *pvParameters )
{
pxTopOfStack--;
*pxTopOfStack = portINITIAL_XPSR; /* xPSR */
pxTopOfStack--;
*pxTopOfStack = ( StackType_t ) pxCode; /* PC */
pxTopOfStack--;
*pxTopOfStack = ( StackType_t ) portTASK_RETURN_ADDRESS; /* LR */
pxTopOfStack -= 5; /* R12, R3, R2 and R1. */
*pxTopOfStack = ( StackType_t ) pvParameters; /* R0 */
pxTopOfStack -= 8; /* R11, R10, R9, R8, R7, R6, R5 and R4. */
return pxTopOfStack;
}
By comparing the listing with Figure 10.2, it can be seen that the initial context is
set up as follows:
• The initial processor status register xPSR is the value of the macro
portINITIAL_XPSR.
• The program counter PC comes from the pxCode argument.
• The link register LR must initially contain the return address of the main
task function, that is, the address from which execution must resume when
the main task function returns.
It is set to the value of the portTASK_RETURN_ADDRESS macro. In turn,
the macro may take two different values, depending on how F REE RTOS
has been configured.
1. If the port layer-dependent configuration macro configTASK_-
RETURN_ADDRESS is set, the value of that macro is taken as the main
task return address, thus allowing programmers to override the porting
layer default, to be discussed next.
2. Otherwise, the return address is set to the address of function
prvTaskExitError, which is defined by the porting layer itself,
although it is not listed here for conciseness. That function basically
contains an endless loop that locks the task in an active wait and can
be caught, for instance, by debuggers.
• Register R0, which holds the first (and only) argument of the main task
function, points to the task parameter block pvParameters.
• The other registers are not initialized.
292 Embedded Software Development: The Open-Source Approach
The only aspect worth mentioning is the peculiar constant put on the stack at line
10. This is the stack position where the processor expects the return address to be
when the main task function returns to the caller. Since this should never happen, ac-
cording to the F REE RTOS specifications, the stack initialization code shown above
stores an invalid address there. In this way, any attempt to return from the main task
function will trigger an illegal memory access and can be detected easily.
We have already examined the architecture-dependent functions that switch the
processor from one task to another. Starting the very first task is somewhat an ex-
ception to this general behavior and, in the Cortex-M3, is performed by the private
port-layer function prvPortStartFirstTask. The implementation, listed below,
is based on generating and handling a software interrupt request.
1 static void prvPortStartFirstTask( void ) __attribute__ (( naked ));
2 void vPortSVCHandler( void ) __attribute__ (( naked ));
3
4 static void prvPortStartFirstTask( void )
5 {
6 __asm volatile(
7 " ldr r0, =0xE000ED08 \n"
8 " ldr r0, [r0] \n"
9 " ldr r0, [r0] \n"
10 " msr msp, r0 \n"
11 " cpsie i \n"
12 " dsb \n"
13 " isb \n"
14 " svc 0 \n"
15 " nop \n"
16 );
17 }
18
The F REE RTOS Porting Layer 293
to thread mode. A similar, automatic processor mode switch for exception handling
is supported by most other modern processors, too, although the exact names given
to the various execution modes may be different. The ColdFire-V2 implementation
is quite similar and is not shown here for conciseness.
10.9 SUMMARY
This chapter went through the main components of the F REE RTOS porting layer.
Looking at how the porting layer of a real-time operating system is built is worth-
while for at least two reasons. First of all, it helps to refine concepts, like context
switch (Section 10.4) and the implementation of critical regions by disabling inter-
rupts (Section 10.5), because it fills the gap between their abstract definition and their
concrete implementation.
Secondly, it better differentiates the general behavior of operating system prim-
itives from the peculiarities and limitations of their implementation on a specific
processor architecture. Due to lack of space, the presentation is far from being ex-
haustive but can be used as a starting point for readers willing to adapt an operating
system to an architecture of their interest.
11 Performance and
Footprint at the
Toolchain Level
CONTENTS
11.1 Overview of the GCC Workflow and Optimizations .................................... 300
11.1.1 Language-Dependent Workflow ....................................................... 301
11.1.2 Language-Independent Workflow..................................................... 303
11.1.3 Target-Dependent Workflow............................................................. 314
11.2 Optimization-Related Compiler Options....................................................... 316
11.3 Architecture-Dependent Compiler Options................................................... 320
11.4 Source-Level Optimization: A Case Study ................................................... 323
11.4.1 Summary of the Algorithm............................................................... 324
11.4.2 Base Source Code ............................................................................. 327
11.4.3 Optimizations.................................................................................... 331
11.5 Summary........................................................................................................ 342
299
300 Embedded Software Development: The Open-Source Approach
In fact, provided that the inner workings of the code generation algorithms are
known, the algorithms and the corresponding source code can be reworked to drive
the compiler to generate better code. The second part of the chapter briefly dis-
cusses a few basic techniques to achieve this goal, by means of a running real-world
case study.
Figure 11.1 contains a summary of the compiler workflow, which will be used as
a reference and starting point for the discussion. As shown in the three rows of the
figure, the overall workflow can be divided into three parts, performed in sequence:
of a library, called cpplib. This library is common to both the standalone pre-
processor, which can be invoked on its own, and the compiler.
2. Parsing, which transforms the source code into an internal, language-dependent
representation that faithfully represents it, but is easier to manipulate than the
source code text.
The parsing process is driven by the language syntax and, even more importantly,
by its grammar rules [1]. As a consequence, the general shape of the parsed out-
put resembles the shape of the grammar rules themselves. This is the main reason
why, even after parsing, language dependencies remain in the internal representa-
tion of the source code.
In most cases, regardless of the language, the internal language-dependent rep-
resentations used by the compiler take the form of a tree, whose nodes are heavily
decorated with information derived from the source code.
For instance, the internal representation of a data structure definition in the C
language may take the form of a tree, in which:
• The root node represents the data structure definition as a whole and may
be decorated with the name of the structure.
• Its children represent individual fields in the data structure. Each of them is
decorated with the field name and its data type. In turn, the data type can
be either a primitive data type (like int) that can be represented as a single
node, or a more complex data type (like a pointer to another user-defined
data structure). In the second case, the data type is described by its own
subtree.
In the case of the C language, GCC makes use of the so-called G ENERIC trees.
Those trees are language independent in principle but, if deemed necessary, a front
end for a certain language can extend the G ENERIC trees it generates by introducing
some custom, language-dependent node types.
The only requisite is that, in doing so, the front end must also provide an appro-
priate set of functions, often called hooks, associated to each of these custom node
types. The functions are able to convert custom nodes into a different representation,
called G IMPLE, to be used by the target-independent optimizer. For historical rea-
sons, the intermediate representation used by the C and C++ front ends makes use of
those extensions.
The third and last phase, depicted on the far right of Figure 11.2, is at the boundary
between the language-dependent and the language-independent parts of the work-
flow. It is called gimplification in the official GCC documentation [159] because its
purpose is to convert a G ENERIC tree into a G IMPLE tree. For the C language, the
translation takes place in a single pass and on a function-by-function basis.
The G IMPLE tree grammar is a simplified and more restrictive subset of the
G ENERIC tree grammar, which is more suitable for use in optimization. For instance,
expressions can contain no more than three operands (except function calls, which
can still have an arbitrary number of arguments as required by the language), and
Performance and Footprint at the Toolchain Level 303
hence, more complex expressions are broken down into a 3-operand form by intro-
ducing temporary variables to hold intermediate results.
Moreover, side effects of expression evaluation are confined to be on the right-
hand side of assignments, and cannot appear on the left-hand side. As before, tem-
porary variables are introduced as needed to secure this property.
Last, G IMPLE trees keep no track of high-level control flow structures, such as
loops. All those control structures are transformed and simplified—or lowered as
is commonly said in compiler theory—into conditional statements (also called if
statements in the following and in the compiler documentation) and jumps (also
called goto).
For the history-oriented readers it is also interesting to remark that the features of
G IMPLE trees were derived from the S IMPLE representation adopted by the McCAT
compiler project [69], with several updates to make the representation, designed in a
mainly research-oriented environment, more suitable for a production compiler.
As recalled previously, the conversion from G ENERIC to G IMPLE trees is rela-
tively simple and takes place in a single pass. However, for other languages, the in-
ternal representation produced by the parser in the first place may be quite far away
from G IMPLE.
This is because, as also mentioned previously, the shape of the parser output
heavily depends on the language grammar. As a consequence, the structure of the
language-dependent part of the workflow becomes more involved, as additional con-
version passes are needed.
In many cases, in order to keep compiler development effort to a minimum,
these passes involve G ENERIC trees, too, because a semi-automatic converter from
G ENERIC to G IMPLE trees is available.
• It helps to better understand how the source file is “seen” by the com-
piler and how code optimization options, to be described in Sections 11.2
and 11.3, work.
• A better grasp of how the compiler eventually transforms source code into
machine code is also useful when it is necessary to modify the structure
of the source code and help the compiler to produce better machine code.
Source-level optimization will be the topic of Section 11.4.
304 Embedded Software Development: The Open-Source Approach
Table 11.1
Compiler Options to Dump the Internal Code Representation
Option Purpose
Language-dependent workflow
-dD Dump all macro definitions at the end of the preprocessing step.
-dI Dump all #include directives at the end of the preprocessing step.
Language-independent workflow
-fdump-tree-ssa Dump the internal SSA representation. The output file name is
formed by appending .ssa to the source file name.
-fdump-tree-copyrename Dump the internal representation after applying the copy rename
optimization. The output file name is generated by appending
.copyrename to the source file name.
-fdump-tree-dse Dump the internal representation after applying dead store elimina-
tion. The output file name is made by appending .dse to the source
file name.
-fdump-tree-optimized Dump the internal representation after all tree-based optimizations
have been performed.
Target-dependent workflow
-dp Annotate the assembler output, enabled by using the -S compiler op-
tion, with a comment indicating which RTL code generation pattern
was used.
-dP Extended version of -dp, it also annotates the assembled output with
the corresponding RTL.
-fdump-rtl-all Dumps the RTL representation at various stages of the target-
dependent workflow. Specific dumps can also be enabled selectively
by replacing all with other options.
Some command-line options ask the compiler to dump into some output files
a human-readable transcript of its internal representation at various stages of the
compilation workflow. The real output produced by the compiler when it is invoked
on some short and simple code excerpts will be used to guide the description that
follows. The main options are summarized in Table 11.1.
Before going deeper into the description of the language-independent part of the
workflow, let us refer back to Figure 11.2 and present some options of the same kind,
but still related to the language-dependent part of the workflow pertaining to the C
language. Namely:
• The option -dD (not to be confused with -dd, which is also valid but en-
ables a totally different kind of dump) asks the compiler to dump all macro
definitions (performed by means of #define directives) at the end of the
preprocessing step.
Performance and Footprint at the Toolchain Level 305
• The option -dI has a similar meaning, but for #include directives. Also
in this case, the output is produced at the end of the preprocessing step.
Beyond code optimization, both features are in some cases very useful to better re-
alize what preprocessing actually did and how it affected the results of the compiler.
This is especially true when dealing with complex code modules or—as sometimes
happens with open-source software—when compiling a code module without hav-
ing a thorough understanding of its internal structure, in particular for what concerns
header files. For instance, at first glance:
• It may not be totally clear which macro definitions are in effect when com-
piling a certain source module, and in which header files those definitions
are, in particular when conditional compilation is also in effect.
• Moreover, it is common that in a cross-compilation environment there are
multiple versions of the same header file. For instance, there may be a
stdio.h for the native compiler, plus one more stdio.h for each target
architecture supported by the cross compilers. All these headers have the
same name but reside at different places in the file system, and hence, it is
hard to distinguish them from one another by just browsing the #include
directives in the source code.
On the other hand, the option -fdump-tree asks the compiler to dump one or
more tree-based internal representations. It applies to both the language-dependent
and the language-independent parts of the workflow.
Its general syntax is -fdump-tree-<switch>-<options>, where
The control flow graph is a data structure built on top of an internal code
representation—a tree-based representation in this case—that abstracts the control
flow behavior of a unit of code. In GCC, control flow graph generation takes indi-
vidual functions as units and works on them one by one. In the control flow graph
nodes represent basic blocks of code, and directed edges link one node to another to
represent possible transfers of control flow from one basic block to another.
A basic block is a sequence of statements with exactly one entry point at the
beginning and one exit point at the end. To all purposes, it can therefore be considered
as an atomic, indivisible unit by many optimization techniques. Accordingly, control
flow graph generation scans the tree-based representation of a function, splits it into
basic blocks and identifies all the directed edges that connect them.
Performance and Footprint at the Toolchain Level 307
At this point, the reason behind some of the preliminary optimizations discussed
previously should become clearer. For instance, the two goto statements found in
if statements after control flow simplification can be mapped into directed edges of
the control flow graph in a straightforward way.
At a later time, the directed edges in the control flow graph are used as a base to
establish dependencies among basic blocks that read and/or modify a given variable.
In turn, this information is the starting point to build the data flow graph of a function,
which drives further optimization techniques.
Referring back to Figure 11.3, the central step of the language-independent work-
flow is to translate, or rewrite, as it is called in the compiler documentation, the
tree-based representation into static single assignment (SSA) form [43]. This form
has been chosen because plenty of theoretical work has been performed to develop
efficient, language-independent optimizations based on it.
One basic assumption the SSA representation is based upon is that program vari-
ables are assigned in exactly one place. Of course, this is an unrealistic assumption
because in any real-world program variables are normally assigned multiple times.
In order to address the issue the compiler attaches a version number to all program
variables and manages it appropriately. In particular:
• Upon encountering the very first assignment to a, shown at the top of the
figure, the compiler qualifies the variable with version number 1.
• When a is referenced on the right-hand side of the next statement, the com-
piler qualifies the reference with version number 1, according to the control
flow graph (that, as a side note, corresponds to the arrows of the flowchart).
• The next statement is a new assignment to variable a, and hence, the com-
piler assigns a new version number to it. In the example, we assume that
308 Embedded Software Development: The Open-Source Approach
version numbers are consecutive integers, although this is not strictly re-
quired, as long as they are all unique when they are generated. Following
this assumption, the assignment refers to version 2 of variable a.
• In the right (true) branch of the conditional statement, a is referenced again.
This time, the compiler qualifies the reference with version number 2 and
not 1, because a has been reassigned and there is no way to “reach” version
1 any more.
• In both branches of the conditional statement, a is reassigned, too. The
corresponding flowchart blocks have been highlighted in a darker shade of
gray in Figure 11.4. Accordingly, the compiler generates two fresh version
numbers (3 and 4) and qualifies the assignments appropriately.
The last statement of the flowchart, shown at the bottom of the figure and also
highlighted by means of a darker shade of gray, is the most problematic from the
point of view of version numbers. This is because the compiler cannot determine
with certainty at compile-time whether the execution flow will follow the left or the
right branch of the conditional statement at run-time.
In fact, this depends on the value of b and, in turn, it depends on the value of a
and c through the assignment marked with ⋆ in the figure. The value of a is known
at compile-time, but the value of c is not. For this reason, the version of a used in the
Performance and Footprint at the Toolchain Level 309
last statement can be either 3 (if execution follows the left path) or 4 (if execution
follows the right path).
Accordingly, the compiler creates version 5 of a and qualifies the reference to it
accordingly. The definition of this new version of a is performed, as outlined previ-
ously, by inserting an artificial definition. It is shown as a rounded rectangle at the
bottom left of Figure 11.4.
The artificial definition is carried out by means of a fictitious function ϕ that
combines versions 3 and 4 of a in an unspecified way to produce version 5, thus
remarking the dependency of version 5 from those other versions. It should also be
noted that neither a_1 nor a_2 nor a_3 is mentioned in the ϕ node because none of
these versions can possibly be used at this point.
Let us now consider a more specific and concrete example, by examining and
commenting on the actual compiler output, corresponding to the C function defini-
tion listed below.
int f(int k) {
int m;
m = (k > 0 ? 3 : 5);
return m;
}
The human-readable dump of the internal SSA representation, shown below, has
been obtained by means of the -fdump-tree-ssa option, listed in Table 11.1.
Concerning this output, the first thing worth noting is that, even though the internal
representation is a tree, the dump takes the form of C-like code, to make it easier to
read and understand.
1 ;; Function f (f)
2
3 f (k)
4 {
5 int m;
6 int D.1177;
7 int iftmp.0;
8
9 <bb 2>:
10 if (k_2(D) > 0)
11 goto <bb 3>;
12 else
13 goto <bb 4>;
14
15 <bb 3>:
16 iftmp.0_3 = 3;
17 goto <bb 5>;
18
19 <bb 4>:
20 iftmp.0_4 = 5;
21
22 <bb 5>:
310 Embedded Software Development: The Open-Source Approach
By looking further into the output we can also notice that, even though the input
function was extremely simple, the compiler performed all the activities mentioned
so far. In particular:
• Besides m, two additional local variables have been added, that is, D.1177
to hold the function return value and iftmp.0 to store the result of the
conditional expression before assigning it to m.
• Control flow has been simplified by converting the conditional expression
into an if statement in which both branches consist of a single goto.
• The code has been split into 4 basic blocks, labeled from <bb 2> to
<bb 5>. Basic blocks are connected by means of goto statements, which
represent the directed edges of the control flow graph.
• Concerning goto statements, the one connecting <bb 4> to <bb 5> has
been removed because it has been recognized as provably useless (the two
blocks are in strict sequence).
Moreover, version numbers have been assigned to all variable assignments and
references, as better detailed in the following.
• Argument k has a single version number, that is, 2. This version number is
assigned to the variable at the beginning and does not change through the
code because there are no additional assignments to it.
• Similarly, variables m and D.1177 have version number 5 and 6, respec-
tively, all the way through the code. This is because they are assigned only
once (at lines 24–25).
• On the contrary, temporary variable iftmp.0 gets two different version
numbers, depending on the side of the conditional statement is it assigned
to. Namely, version number 3 corresponds to the true branch of the state-
ment (line 16), while version 4 corresponds to the false branch (line 20).
• When iftmp.0 is eventually assigned to m (line 24) a new version number
becomes necessary because the compiler cannot determine which version
of iftmp.0 will be used at runtime. Therefore:
• The compiler introduces a new version of iftmp.0, version 1, to be
used on the right-hand side of the assignment.
• It also inserts a ϕ node (line 23) that highlights how version 1 of
iftmp.0 depends on versions 3 and 4 of the same variable.
As a final remark, attentive readers have probably noticed that the compiler output
contains additional annotations—for instance, the (D) annotation that follows the
reference to k_2 at line 10. Unfortunately, they cannot be commented on in any
Performance and Footprint at the Toolchain Level 311
detail here, and hence, readers are referred to the compiler documentation [159] for
further information.
Most language-independent optimizations are performed on the SSA form.
Among the main ones, we briefly recall:
1. Dead code elimination. Statements without side effects that may affect future
computation and whose results are unused are removed. Since other optimizations
may change the code and make more statements useless, this pass is repeated
multiple times during the optimization process.
2. Forward propagation of single-use variables. This optimization identifies vari-
ables that are used exactly once. Then, it moves forward their definition to the
point of use, in order to check whether or not the result leads to simpler code.
3. Copy renaming. In this pass, the compiler tries to rename variables involved
in copy operations so that copies can be coalesced. This includes temporary
compiler-generated variables that are copies of user variables. In this case, the
compiler may rename compiler-generated variables to the corresponding user
variables.
4. Replace local aggregates with scalars. Local aggregates without aliases are re-
placed with sets of scalar variables. By intuition, this makes subsequent optimiza-
tions easier and more effective because, after this pass, each component of the
aggregate can be handled as an independent unit by the optimizer.
5. Dead store elimination. This optimization removes memory store operations that
are useless because the same memory location is overwritten before being used.
6. Forward movement of store operations. Store operations are moved along the con-
trol flow graph, to bring them as close as possible to their point of use.
7. Loop optimizations. A complex group of related optimizations works on loops.
Among them, the main ones are:
• Loop-invariant statements—that is, statements that always produce the same
result and no side effects at every iteration of the loop—are moved out of the
loop in order to evaluate them only once.
• Another optimization pass identifies induction variables, that is, variables that
are incremented (or decremented) by a fixed amount upon each loop iteration,
or depend linearly on other induction variables.
Then, it tries to optimize them by reducing the strength of their computation,
merging induction variables that consistently assume the same values through
loop iterations, and eliminating induction variables that are used rarely and
whose value can easily be derived from others.
• When the loop contains conditional jumps whose result is loop-invariant, it
is possible to evaluate them only once, out of the loop, and create a simpli-
fied version of the loop for each possible outcome of the conditional jumps
themselves. This optimization is sometimes called loop unswitching.
• Loops with a relatively small upper bound on the number of iterations can
be unrolled, partially or completely. Informally speaking, complete loop un-
rolling replaces the loop by an equivalent sequence of instructions, thus delet-
ing loop control instructions and possibly simplifying the code produced for
312 Embedded Software Development: The Open-Source Approach
5 <bb 2>:
6 if (k_2(D) > 0)
7 goto <bb 3>;
8 else
9 goto <bb 4>;
10
11 <bb 3>:
12 m_3 = 3;
13 goto <bb 5>;
14
15 <bb 4>:
16 m_4 = 5;
17
18 <bb 5>:
19 # m_1 = PHI <m_3(3), m_4(4)>
20 m_5 = m_1;
21 m_6 = m_5;
22 return m_6;
23 }
As can be seen, both temporary variables D.1177 and iftmp.0 have been re-
moved because it was possible to rename them into local variable m, introducing
several additional version numbers for it. Namely:
• iftmp.0_3 was renamed to m_3,
Performance and Footprint at the Toolchain Level 313
The following listing shows instead the internal representation of function f after
several other optimizations, namely forward propagation of single-use variables and
dead store elimination, have been carried out.
f (k)
{
int m;
<bb 2>:
if (k_2(D) > 0)
goto <bb 4>;
else
goto <bb 3>;
<bb 3>:
<bb 4>:
# m_1 = PHI <3(2), 5(3)>
return m_1;
In this case, both m_3 and m_4 are single-use variables. In fact, they are used
only in the ϕ node that defines m_1. For this reason, they have been deleted and their
definitions, that is, the constants 3 and 5, have been forward-propagated to their point
of use. Moreover, the compiler determined that the sequence of stores into m, found
at lines 20–21 of the listing on page 312, was useless and has removed it.
The very last step performed on the internal SSA representation before entering
the target-dependent part of the compiler workflow is to remove version numbers
attached to variables, leading to the final tree-based representation in normal form. It
is shown in the following listing for our example function f.
f (k)
{
int m;
<bb 2>:
if (k > 0)
goto <bb 4>;
else
goto <bb 3>;
<bb 3>:
m = 5;
goto <bb 5>;
314 Embedded Software Development: The Open-Source Approach
<bb 4>:
m = 3;
<bb 5>:
return m;
Even though, for the sake of simplicity, a detailed description of RTL syntax is
outside the scope of this book, it is nevertheless useful to note the strong relation-
ship between the RTL representation and the final machine code produced by the
compiler, listed in the following.
316 Embedded Software Development: The Open-Source Approach
00000000 <f>:
0: e3500000 cmp r0, #0
4: c3a00003 movgt r0, #3
8: d3a00005 movle r0, #5
c: e12fff1e bx lr
• The first one is a comparison statement that sets the processor condition
codes (compare:CC). It compares register r0 with the constant integer
zero (const_int 0).
• The second and third statements are similar. Both represent condition-
ally executed instructions (cond_exec) depending on condition codes
(reg:CC). They move different constants (const_int 3 in the first case
and const_int 5 in the second) depending on whether register r0 was
greater than (gt), or less than or equal to (le) zero, respectively, when the
condition codes were set.
• The fourth statement indicates that register r0 is in use to hold the func-
tion’s return value.
• The last statement denotes a return point of the function.
As can be seen from the machine code listing, all RTL statements except the
fourth one indeed have a one-to-one correspondence with machine instructions.
Table 11.2
Compiler Options That Control Optimization
Option Purpose
turns on a set of basic optimizations. As shown in the table, the option can also be
followed by an integer number between 1 and 3 to set various optimization levels
and turn on more and more optimization options.
In general, the higher the number, the “better” optimization becomes, at the ex-
pense of compilation time and memory requirements. However, especially in em-
bedded software development, it is important to notice a difference between -O2
and lower optimization levels with respect to -O3.
In fact, up to optimization level -O2, the various optimization techniques that are
enabled do not involve a tradeoff between execution speed and size of the generated
code, because they generally improve both.
On the other hand, optimizations enabled by -O3 attempt to further improve ex-
ecution speed even at the expense of code size, which may increase noticeably. Fur-
thermore some of them, like loop unrolling, may affect execution time predictability
in unexpected ways, as will be shown in Section 11.4.
318 Embedded Software Development: The Open-Source Approach
Table 11.3
Additional Optimizations Enabled by Optimization Level 3
Option Purpose
In particular, with respect to -O2, optimization level -O3 turns on the additional
optimizations listed in Table 11.3. Each of them will be briefly discussed in the
following.
In addition, depending on the compiler version, optimization level -O3 may also
enable loop unrolling. This represents a typical case of “controversial” optimization
that, on a case by case basis, may either improve or worsen execution speed and, even
more, execution time predictability. As an example, a thorough description of the
unforeseen side effects of loop unrolling on the execution time of a simple algorithm
on the Cortex-M3 architecture will be given in Section 11.4.
Last, but not least, the optimization option -Os directs the compiler to reduce
code size. Even though the selection of this optimization level may adversely affect
execution speed, it is indeed quite useful in embedded software development when,
320 Embedded Software Development: The Open-Source Approach
Table 11.4
Main Architecture-Dependent Compiler Options, ARM Architecture
Option Purpose
In the ARM family, the matter is even more complex due to the fact that some
members of the family simultaneously support two distinct instruction sets, namely:
The designs of the two instruction sets differ significantly because the first one is
mainly aimed at obtaining top execution speed, whereas the second one favors code
compactness. For the sake of completeness, it should also be mentioned that the
Thumb instruction set design was later extended to include some 32-bit instructions,
and hence, it became a variable-length instruction set. In these cases, the -mcpu or
-march options may not be sufficient to uniquely indicate which instruction set the
compiler must use. This is done by means of an additional option, that is, -mthumb.
A related problem is whether or not it should be possible to mix functions com-
piled using different instruction sets in the same executable program, and hence,
whether or not these functions should be able to call each other. In the ARM fam-
ily this possibility comes at a code size and performance cost, because the in-
struction sequence required for a function call becomes more complex. Hence, it
is not enabled automatically and must be turned on manually, by means of the
-mthumb-interwork option.
A second group of options, still related to the instruction set of the target archi-
tecture, concerns floating point calculations and is more closely related to embed-
ded systems development. In fact, in many processors designed for embedded ap-
plications (and unlike most general-purpose processors) the hardware floating-point
unit (FPU) is an optional component, which is often not included on the chip due to
cost and power consumption considerations.
On systems without hardware support for floating-point instructions, floating-
point calculations can still be performed in software, by means of two distinct
approaches:
The results are even more interesting because a satisfactory level of optimization
was indeed achieved across different processor architectures with the same updates
to the source code and without leveraging any detailed knowledge of the processor
machine language. Namely, neither hand-written nor hand-optimized assembly code
has been developed.
The two processor architectures being considered are the same ones already
used as examples throughout the book. Namely, the first one is represented by the
NXP LPC2468 microcontroller [126], which is a low-end component based on an
ARM7TDMI-S processor core and running at 72 MHz. The second one is repre-
sented by NXP LPC1768 microcontroller [128], based on the contemporary ARM
Cortex-M3 processor core running at 100 MHz.
• The exact duration of frame transmission depends not only on the size of the
payload, as is natural, but also on its content. This is because, according to
the definition given previously, the CAN controller may or may not insert a
stuff bit at a certain position depending on the specific sequence of payload
bits found there.
Performance and Footprint at the Toolchain Level 325
For both reasons, avoiding the insertion of stuff bits when transmitting the payload
assumes practical relevance, especially when dealing with demanding applications
characterized by tight timing constraints and/or heavy dependability requirements.
As was shown in other research work, it is also possible to avoid the insertion of stuff
bits in all the other parts of the frame besides the payload, thus reaching completely
jitterless CAN communication [32].
As shown in Figure 11.6, the basic principle of 8B9B encoding is fairly simple
and revolves around the following steps.
• The encoder retrieves the data it operates on from an input buffer (shown
on the left of the figure) and stores its result into an output buffer (on the
right of the figure). Both buffers are defined as an array of bytes.
• The input buffer contains k payload bytes, with 0 ≤ k ≤ 7, to be encoded.
As shown in the figure, they are naturally aligned on byte boundaries.
• The encoder translates, by means of a suitable forward lookup table each
individual payload byte into a 9-bit codeword. In abstract, the table consists
of 28 = 256 entries, and each entry is 9 bits wide.
326 Embedded Software Development: The Open-Source Approach
• Codewords are packed into the output buffer one after another and are pre-
ceded by a break bit. Because of the presence of the break bit itself and due
to their size, codewords are generally not aligned on byte boundaries.
• After the last codeword (codeword k − 1 in the figure) has been stored into
the output buffer, it is also necessary to pad the buffer with a suitable bit
pattern to completely fill the last byte.
• It can be proved that, except when k = 0, the number of bytes needed in the
output buffer is always equal to k + 1, when encoding k payload bytes. The
case k = 0 is handled trivially, since no actual encoding is done in this case.
A thorough explanation of the design of the 8B9B algorithm and its lookup ta-
ble in order to prevent bit stuffing from occurring is beyond the scope of this book.
Interested readers are referred to [31], where all the technical details can be found.
Informally speaking, 8B9B encoding relies on the following properties:
1. No codewords contain 5 or more consecutive bits at the same value, and hence,
they do not trigger the bit stuffing mechanism by themselves.
2. No codewords start or end with 3 or more consecutive bits at the same value. For
this reason, the bit pattern appearing across two consecutive codewords may not
trigger the bit stuffing mechanism.
3. The break bit prevents 5 or more consecutive bits from appearing across the
boundary between the encoded payload and the field that precedes it in the CAN
frame, namely, the data length code (DLC) that is equal to the size of the en-
Performance and Footprint at the Toolchain Level 327
coded payload, expressed in bytes. For this reason, the break bit is chosen as the
complement of the least significant bits of the DLC.
4. The pad is made of an alternating bit pattern of suitable length. Besides making
the encoded payload size an integral number of bytes—and hence, making it suit-
able for transmission—it also has the same purpose as the break bit. Namely, it
prevents 5 or more consecutive bits at the same level from appearing across the
boundary between the encoded payload and the next field of the CAN frame.
In this book, we are instead more concerned about two general characteristics of
the 8B9B encoding process, because they are common to many other algorithms. In
particular:
1. Individual payload bytes are translated into items of a different size (9 bits in this
case) by means of a lookup table. Since the table resides in main memory, it must
be implemented and consulted in the most efficient way, for what concerns both
storage space and access time.
2. Codewords are not an integral number of bytes in size and must be tightly packed
in the output buffer. Therefore, codewords cross byte boundaries in the output
buffer and each output byte, in general, consists of two adjacent sections. In turn,
these sections consist of parts of two adjacent codewords.
For instance, as shown in Figure 11.6, the fourth output byte consists of the trail-
ing part of codeword 2 (light gray) as well as the leading part of codeword 3
(darker gray). As a consequence, output bytes must be built by shifting and mask-
ing two adjacent keywords into place.
For the sake of completeness, it must be mentioned that the 8B9B decoder pro-
ceeds symmetrically with respect to the encoder and, from the algorithmic point of
view, it is quite similar to it. For this reason, only the encoder will be further analyzed
in the following.
17 if(in->dlc == 0)
18 out->dlc = 0;
19
20 else
21 {
22 /* If in->dlc is even <-> out->dlc is odd, except when in->dlc
23 is zero, but this case has already been handled.
24 */
25 out->dlc = in->dlc + 1;
26 out->data[0] = (out->dlc & 1) ? 0 : 128; /* Break Bit */
27
28 os = 7; /* Available bits in output byte i */
29 for(i=0; i<in->dlc; i++)
30 {
31 w = flt[in->data[i]]; /* 9 lsb */
32 l = w & lm[os]; /* os msb, msb @8 */
33 r = w & rm[os]; /* os lsb, lsb @0 */
34
35 /* Due to the 9/8 relationship, the presence of the break
36 bit, and the limited maximum length of the payload, the
37 9 output bits are always split between two output
38 bytes. In other words, os cannot be zero and the code
39 becomes simpler.
40 */
41 out->data[i] |= (l >> (9-os));
42 out->data[i+1] = (r << (os-1));
43 os--;
44 }
45
46 /* Pad */
47 out->data[i] |= (PAD >> (9-os));
48 }
49 }
where the ... within braces is a placeholder for the array contents, which
are not shown for conciseness.
• Then, the codeword must be split into two parts:
1. the most significant part of the codeword, consisting of os bits, which
must be stored in the least significant part of the current output byte
out->data[i];
2. the least significant part of the codeword, consisting of 9-os bits,
which must be stored in the most significant bits of the next output
byte out->data[i+1].
In order to isolate and extract, or mask, the two parts the code makes use of
two auxiliary lookup tables, lm[] and rm[], respectively, defined as:
const uint16_t lm[8] =
{0x000, 0x100, 0x180, 0x1C0, 0x1E0, 0x1F0, 0x1F8, 0x1FC};
const uint16_t rm[8] =
{0x000, 0x0FF, 0x07F, 0x03F, 0x01F, 0x00F, 0x007, 0x003};
As can be seen from the listing above, element lm[os] contains a 9-bit
mask with the os most significant bits set to 1 and the remaining ones
set to 0. Symmetrically, rm[os] contains a 9-bit mask with 9-os least
significant bits set to 1 and the others to 0. In both arrays, the element with
index 0 is unused.
In this way, as is done at lines 32–33 of the listing, the two parts of the
codeword can be masked by means of a bit-by-bit and operation. The results
are stored in local variables l and r for the left and right part, respectively.
• As discussed in the algorithm description, before storing them in the output
bytes, l and r must be appropriately shifted, too. This is done directly in
the assignments to the output bytes at lines 41 and 42 of the listing. De-
termining that the direction and amount of shift specified in the listing are
correct is left as an exercise to readers.
330 Embedded Software Development: The Open-Source Approach
An additional point worth remarking is that, for what concerns the current
output byte out->data[i], the assignment is done through a bit-by-bit
or operation. This is because part of the current output byte has already
been filled in the previous iteration of the loop (or before entering the loop
for the first iteration), and hence, its contents must not be overwritten.
Instead, the next output byte out->data[i+1] has not been used so far,
and hence, a direct assignment is appropriate. This way brings the addi-
tional benefit of setting the rest of out->data[i+1] to zero.
• The overall data flow pertaining to one iteration of the 8B9B encoding loop
is summarized in Figure 11.7. In the figure, rectangular boxes represent data
items and circles denote operations. When a data item is stored in a named
variable, rather than being just the value of an expression, the variable name
is shown beside the box, too.
Performance and Footprint at the Toolchain Level 331
• The last instruction within the loop body, at line 43, updates the value of
os to prepare for the next iteration. It is in fact easy to prove that, if os bits
were available for use in the current output byte, os-1 bits are left for use
in the next output byte, after storing a 9-bit codeword across them.
• The very last operation performed by the encode function is to pad the
last output byte with a portion of the alternating bit pattern assigned to the
PAD macro. This is done at line 47 that, quite unsurprisingly, bears a strong
similarity with line 41.
As can be seen from the listing, in the initial implementation most effort has been
spent to ensure that the code is correct—that is, it matches the definition of the 8B9B
algorithm outlined in Section 11.4.1—rather than efficient.
This is a rather widespread approach in software design and development, in
which an initial version of the code is produced first, in order to verify algorithm
correctness. Then, possibly based on code inspection, benchmarking and profiling
(when supported by the underlying target architecture) critical parts of the code are
identified and optimized.
This also helps to avoid the so-called “premature optimization” issues, in which
effort is spent optimizing parts of code that turn out to be unimportant for overall
performance at a later stage.
11.4.3 OPTIMIZATIONS
In order to assess the encoder performance and gather information regarding possible
optimizations, the software module discussed in the previous section was evaluated
by encoding a large set of uniformly distributed, non-empty pseudo-random mes-
sages of varying lengths.
Namely, the software was embedded in a software test framework and cross-
compiled for the NXP LPC2468 microcontroller [126]. Then, the amount of time
spent in the encoding function, as a function of the payload size, was measured by
means of a free-running, 32-bit counter clocked at the same clock frequency as the
CPU. In this way, it was possible to collect a cycle-accurate encoding delay mea-
surement for each message.
Working on a large number of pseudo-random messages was convenient to single
out any data dependency in the delay itself. External sources of measurement noise
were avoided by running the experiments in a very controlled environment, with in-
terrupts disabled and unused peripheral devices powered down, to avoid unexpected
activities on the internal microcontroller buses.
As can be expected, the overall delay of the encoding function has two distinct
components.
1. A constant part b, which can be attributed to the prologue and epilogue code of
the function, as well as local variables initialization.
2. A part linearly dependent on the payload length, caused by the encoding and
decoding loops. Denoting with k the payload length, in bytes, this part can be
expressed as q · k, where q is a constant.
332 Embedded Software Development: The Open-Source Approach
Table 11.5
Interpolated Encoder Delay, ARM7 Architecture
On the other hand, jitter is due to DRAM refresh cycles, which stall any instruc-
tion or data access operation issued by the CPU while they are in progress. For this
architecture, both issues were solved by directing the toolchain to move code, data,
and stack into the on-chip SRAM. The goal was achieved by means of two different
techniques:
• As can readily be seen by looking at the second row of Table 11.5, coeffi-
cient q, which represents the slope of the linear part of the encoding delay,
was reduced by a factor close to 5. The constant part of the encoding delay
b improved similarly, too.
• Furthermore, all of the subsequent measurements revealed that jitter was
completely removed from encoding delay.
It must also be remarked that this kind of behavior is not at all peculiar to the
ARM7 processor architecture or the use of DRAM. On the contrary, it is just a special
case of a more widespread phenomenon that should be taken into account in any case
during embedded software development.
For instance, the processor exhibited a very similar behavior when the lookup
table was stored in flash memory. In this case, the jitter was due to a component
whose purpose is to mitigate the relatively large access time of flash memory (up
to several CPU clock cycles in the worst case), by predicting future flash memory
accesses and performing them in advance.
In this way, if the prediction is successful, the processor is never stalled waiting
for flash memory access, since processor activities and flash memory accesses pro-
ceed concurrently. As was described in Chapter 2, this component is present in both
the NXP LPC24xx and LPC17xx microcontroller families and is called memory ac-
celerator module (MAM).
Unfortunately, although the prediction algorithm is quite effective with instruc-
tion fetch, because it proceeds sequentially in most cases, it does not work equally
well for accesses to the lookup table—which are inherently random because they
334 Embedded Software Development: The Open-Source Approach
Computed masks
In the base implementation shown and commented on in Section 11.4.2, two auxil-
iary lookup tables, lm[] and rm[], hold the bit masks that, as shown in Figure 11.7,
are used on every iteration of the encoding loop to split the 9-bit value w obtained
from the forward lookup table into two parts l and r, respectively.
In the optimization step to be discussed next, the bit masks have been computed
directly, and hence, lm[] and rm[] have been deleted. The decision has been driven
by a couple of observations, detailed in the following.
With respect to the base version of the same function, shown previously, the fol-
lowing changes to the code are worth noting.
• The bit mask lm to be used in the first iteration of the loop is set before
entering the loop, at line 22. Then, it is updated as specified in (11.2) at the
end of each iteration, that is, at line 36.
• Then, lm and its complement are used at lines 26–27, instead of the lookup
table entries lm[os] and rm[os].
Referring back to the listing, the following parts of the encode function have
been modified:
• The newly-defined local variable od (line 8) is the local buffer used to carry
forward the common output byte from one iteration to the next.
• The first output byte out->data[0] is in common between the break bit
store operation and the first iteration of the loop. For this reason, the break
bit store operation at line 21 has been modified to store the break bit into
od instead of directly into memory.
• Upon each iteration of the loop, the operation at line 35 operates on an out-
put byte partially filled by a previous iteration of the loop (or the break bit
store operation, upon the first iteration) and fills it completely. Accordingly,
it operates on od and stores the result into memory, out->data[i].
• On the contrary, the operation at line 36 fills the next output byte only par-
tially. Therefore, the result must not be stored into out->data[i+1] yet,
because it must be completely filled in the next iteration of the loop (or
in the padding operation) beforehand. For this reason, the partial result is
stored into od and carried forward.
• Finally, the padding operation at line 41 fills the last output byte completely.
Therefore, it takes od as input and stores the result into memory.
Attentive readers may have noticed that both this optimization and the previous
one mimic some automatic compiler optimizations discussed in Section 11.1. This
is indeed true and further highlights the interest of knowing how compiler optimiza-
tions work, at least to some level of detail.
This knowledge is, in fact, useful to “assist” the compiler by manual intervention
on the source code if, for any reason, the compiler is unable to automatically apply a
certain optimization technique by itself.
In this case, the starting point of the optimization is a basic symmetry property of
the lookup table used by the encoder. In fact, it can be proved that, due to the way
the table has been generated, it is
1. Namely, using property (11.4), it is possible to store only half of the table and
reduce its size from 256 to 128 9-bit entries. For efficient access, the size of each
entry must nevertheless be an integral number of bytes, and hence, each entry
actually occupies 16 bits in memory.
2. However, due to the second property stated previously, it is not necessary to store
the most-significant bit of table entries explicitly and the lookup table can be
further shrunk down to 128 8-bit entries.
The following listing shows the source code of the encode function after
optimization.
1 const uint8_t flt[] = { ... };
2
3 void encode(payload_t *in, payload_t *out)
4 {
5 int i;
6
7 /* int os: suppressed */
8 int w, l, r;
9 int lm;
10 int od;
11 int x, m;
12
13 assert(in->dlc <= 7);
14
15 if(in->dlc == 0)
16 out->dlc = 0;
17
18 else
19 {
20 /* If in->dlc is even <-> out->dlc is odd, except when in->dlc
21 is zero, but this cannot happen.
22 */
23 out->dlc = in->dlc + 1;
24 od = (out->dlc & 1) ? 0 : 128; /* Break Bit */
25
26 lm = 0xFFFFFFFC;
27 for(i=0; i<in->dlc; i++)
28 {
29
Performance and Footprint at the Toolchain Level 339
30 x = in->data[i];
31 m = (x >= 0x80) ? 0x1FF : 0x000;
32 w = flt[(x ˆ m) & 0x7F] ˆ m; /* 9 bits, lsb @0 */
33
34 l = w & lm; /* 7-i bits, msb @8 */
35 r = w & ˜lm; /* 2+i bits, lsb @0 */
36
37 /* Due to the 9/8 relationship, the presence of the break
38 bit, and the limited maximum length of the payload, the
39 9 output bits are always split between two output
40 bytes.
41 */
42 out->data[i] = od | (l >> (2+i));
43 od = (r << (6-i));
44 lm <<= 1;
45 }
46
47 /* Pad */
48 out->data[i] = od | (PAD >> (2+i));
49 }
50 }
With respect to the previous version of the code, the following updates concerning
lookup table definition and access have been performed.
• According to the folding process, the lookup table definition (line 1) de-
fines lookup table elements to be of type uint8_t (8-bit unsigned integer)
instead of uint16_t (16-bit unsigned integer). The initializer part of the
definition (shown as ... in the listing for conciseness) now contains 128
entries instead of 256.
• Two additional local variables x and m (defined at line 11) hold, upon each
iteration of the encoding loop, the current input byte as retrieved from the
input buffer and an appropriate bit mask to complement it, if necessary, by
means of an on-the-fly exclusive or operation upon lookup table access.
• Namely, at line 31, m is set to either 0x1ff (9 bits at 1) or 0 depending on
whether the current input byte x was negative or positive.
• The mask m is used twice at line 32:
1. It is used to conditionally complement the input byte x before using it
as an index in the lookup table, by means of the exclusive or operation
xˆm. Since the mask is set to ones if and only if x was negative, the
overall result of the exclusive or is to complement x if it was negative,
or return it directly if it was either zero or positive.
In any case, since the mask is 9-bit wide whereas the table has
27 = 128 entries, the result of the exclusive or must be truncated to
7 bits before use, by means of the bitwise and operation & 0x7f.
2. After lookup table access, the same mask is used again to condition-
ally complement the lookup table entry before storing it into w.
It is useful to remark why this computation has been carried out by means
of conditional expressions and exclusive or operations instead of using a
more straightforward conditional statement.
This was done in order to make execution time independent from predicate
truth. For instance, the execution time of a conditional statement in the form
340 Embedded Software Development: The Open-Source Approach
5
1. Code/data placement
2. Computed masks
Encoder Delay (µs) 4 3. Load/store reduction
0
0 1 2 3 4 5 6 7
Payload Size (byte)
The performance of this version of the code, determined in the usual way, is shown
in the bottom row of Table 11.5. As can be seen from the table, unlike the previous
optimizations, table folding is not “one-way,” because it entails a trade-off between
footprint reduction and loss of performance.
In fact, the price to be paid for the previously mentioned fourfold reduction in
lookup table size is a performance penalty that, basically, brings the execution time of
the encoder back to almost the same values measured before load and store reduction.
The linear part of the delay is the same (k = 0.43µ s/byte in both cases) and the
difference in the constant part of the delay is minimal (1.09 versus 1.08µ s).
5
1. Code/data placement
2. Computed masks
Encoder Delay (µs) 4 3. Load/store reduction
4. Folded tables
3
0
0 1 2 3 4 5 6 7
Payload Size (byte)
the sample variance was still zero in all cases—meaning that the architecture never-
theless behaved in a fully deterministic way—the clean, linear relationship between
the payload size k and the encoding time found with the ARM7 architecture was lost
after code optimization.
After some further analysis of the optimizations automatically put in place by
the compiler, the reason for the peculiar behavior was identified in a specific opti-
mization, that is, loop unrolling. According to the general description given in Sec-
tion 11.2, in this particular scenario the compiler partially unrolled the encoding
loop, by a factor of two.
In other words, in an effort to achieve a better execution efficiency, the compiler
duplicated the loop body so that a single loop iteration handles up to two input bytes
and this gave rise to the up/down pattern in the delays, depending on whether k is
odd or even.
When this optimization was turned off, by means of the -fno-unroll-loops
option, linearity was almost completely restored, as shown in Figure 11.9, at a minor
performance cost. The comparison between Figures 11.8 and 11.9 confirms that the
compiler took the right choice when applying the optimization, because the average
performance of the unrolled encoding loop, across the whole range of values of k, is
indeed better.
However, in the case being considered here, the optimization also leads to other
detrimental effects that the compiler’s optimization algorithms do not consider at
all, namely, encoding time linearity. This is one of the reasons why the compiler of-
fers the possibility of turning off optimizations on an individual basis. Interestingly
enough, loop unrolling did not take place on the ARM7 architecture. This difference
is most probably due to the different compiler versions in use and to the dissimilari-
ties between the ARM7 and Cortex-M3 architectures.
342 Embedded Software Development: The Open-Source Approach
To conclude this section, we observe that, as shown in Figure 11.9, the very same
optimizations that were effective on the ARM7 architecture brought similar perfor-
mance improvements on the Cortex-M3 architecture, too.
11.5 SUMMARY
The main topic of this chapter was the improvement of code performance and foot-
print by working at the toolchain and source code levels. In order to do this the first
section of the chapter, Section 11.1, contains an overview of the compiler workflow
and optimizations.
The next two sections build on this knowledge to describe how some compiler
options affect code generation and allow programmers to achieve different trade-offs
between performance and footprint. In particular, Section 11.2 is about architecture-
independent options, while Section 11.3 discusses architecture-dependent options.
On the other hand, Section 11.4 describes source-level code optimization. To en-
hance the practical relevance and usefulness of the discussion, the whole section
makes use of a real-world embedded algorithm as a case study, rather than focusing
exclusively on theoretical concepts.
12 Example:
Device
A M ODBUS TCP
CONTENTS
12.1 Toolchain and Operating System................................................................... 343
12.2 General Firmware Structure .......................................................................... 345
12.3 M ODBUS Slave Protocol Stack ..................................................................... 348
12.3.1 TCP/IP Protocol Stack Interface....................................................... 349
12.3.2 Operating System Adaptation Layer ................................................ 352
12.3.3 M ODBUS TCP Layer ........................................................................ 356
12.3.4 M ODBUS Application Layer and Event Loop .................................. 358
12.4 USB-Based Filesystem.................................................................................. 360
12.4.1 FAT Filesystem.................................................................................. 360
12.4.2 USB Protocol Stack and Driver ........................................................ 364
12.5 Application Code........................................................................................... 367
12.6 Performance and Footprint ............................................................................ 368
12.6.1 Memory Footprint............................................................................. 369
12.6.2 System Performance ......................................................................... 371
12.7 Summary........................................................................................................ 373
This chapter concludes the first part of the book. It complements the smaller exam-
ples contained in the previous chapter with a more comprehensive case study. In it,
readers can follow the whole development process of an embedded system from the
beginning to the end.
At the same time, they will get more practical information on how to apply the
techniques learned in the previous chapters and how to leverage and bind together ex-
isting open-source software modules to effectively build a working embedded system
of industrial interest. Namely, the case study involves the design and development of
a simple M ODBUS TCP data logger using a low-cost ARM-based microcontroller,
the LPC2468 [126].
343
344 Embedded Software Development: The Open-Source Approach
Table 12.1
Software Development Toolchain Components
As a reference, Table 12.1 lists the exact version numbers of those components. It
should be noted that, although it is quite possible to build all toolchain components
starting from their source code, as in the case study, it is often more convenient
to acquire them directly in binary form, especially if the target architecture is in
widespread use.
As an example, at the time of this writing, Mentor Graphics offers a free, lite edi-
tion of their GCC-based Sourcery CodeBench toolchain [116], previously known as
CodeSourcery, which is able to generate code for a variety of contemporary proces-
sor architectures, including ARM.
Another aspect already noted in Chapter 3 is that, more often than not, compo-
nent version numbers to be used shall not be chosen “at random.” Similarly, always
choosing the latest version of a component is often not a good idea, either.
This is due to several main factors, which must all be taken into account in the
decision.
• Toolchain components work together in strict concert. However, they are
maintained by different people and the coordination between the respective
software development groups is loose.
Even though some effort is made to ensure backward compatibility when-
ever a component is updated, there is absolutely no guarantee that, for in-
stance, any version of the compiler works with any other version of the
assembler.
• Merely downloading the latest version of all components roughly at the
same time does not provide additional guarantees as well, for a similar rea-
son. This is because component release dates are not tightly related to each
other.
• There may be adverse interactions between the version numbers of the na-
tive toolchain used to build the cross-compiling toolchain and the version
numbers of the cross-compiling toolchain to be built.
For instance, most real-time operating system developers give some suggestions
about which version numbers are the best ones. They are usually the version numbers
the developers themselves used when they compiled the operating system on the
target architecture in order to test it.
Since the operating system—and, in particular, its porting layer, as discussed in
Chapter 10—is a very complex component, it is likely to leverage even obscure
or little-used parts of the toolchain. It is therefore sensible to assume that, if the
toolchain was able to successfully build it, it will likely also be able to build other,
simpler software modules.
In addition, operating system developers may also offer several useful patches—
that is, corrections to the source code, to be applied to a specific version of a toolchain
component—that fix toolchain problems they have encountered during development
and testing.
Last, but not least, the release notes of each individual toolchain component give
more information about known component (in)compatibilities and additional depen-
dencies among components. For instance, release notes often include constraints to
specify that, in order for a component to work correctly, another component should
be more recent than a certain version number.
For what concerns real-time operating systems for embedded applications, a
relatively wide choice of open-source products is nowadays available, for in-
stance [17, 131, 146]. They represent different trade-offs between the number of
functions they make available through their application programming interface (API)
on one hand, with respect to their memory requirement and processing power over-
head on the other.
For this case study, the choice fell on the F REE RTOS real-time operating sys-
tem [18, 19], which was thoroughly described in Chapter 5. The main reason for this
choice was to minimize memory occupation, because memory is often a scarce re-
source in small embedded systems. At the same time, as will be better detailed in the
following, this operating system still provides all the functions needed to effectively
support all the other firmware components.
The port of a small real-time operating system like F REE RTOS to a new embed-
ded system platform of choice is usually not an issue. This is because, on one hand,
these operating systems are designed to be extremely portable—also due to the lim-
ited size and complexity of their porting layer, an example of which was presented
in Chapter 10.
On the other hand, their source code package is likely to already support the se-
lected architecture with no modifications required. In this case, operating system
deployment becomes even easier because a working C compiler is all that is needed
to build and use it.
blocks represent the hardware interface of blocks that interact with some hardware
device. Hierarchical relationships among components that make use of each other
are represented by arrows.
Furthermore, the C language runtime libraries have been omitted from the figure
for clarity, because all the other modules make use of, and depend on, it.
As can be seen, some firmware components have already been thoroughly dis-
cussed in the previous chapter and will only be briefly recalled here. Namely:
• The F REE RTOS operating system, presented in Chapter 5 and shown at the
bottom of the figure, implements scheduling, basic timing, and inter-task
communication and synchronization services, and makes them available to
all the other components through its native API.
• The TCP/IP protocol stack used for the case study is LW IP [51], depicted
on the center left of Figure 12.1 and fully discussed in Chapter 7. With
Example: A M ODBUS TCP Device 347
Other components are more typical of the application considered as a case study
and will be better described in the following. Namely, they are:
Figure 12.2 Inner structure and interfaces of the M ODBUS slave protocol stack.
• Finally, the application code, shown at the top of Figure 12.1, glues all
components together and performs data logging. Namely, it stores on a
USB flash-disk resident file information about the incoming M ODBUS TCP
traffic, which can be useful to debug M ODBUS master protocol stacks and
gather performance data about them.
• White blocks indicate portable layers, which can be moved from one archi-
tecture/project to another by simply recompiling them.
Example: A M ODBUS TCP Device 349
Table 12.2
F REE MODBUS TCP/IP Protocol Stack Interface Functions
Function Description
Connection management
xMBPortTCPPoll Accept incoming client connections, receive requests
Data transfer
xMBTCPPortGetRequest Retrieve a request previously received
xMBTCPPortSendResponse Send a response to the last request retrieved
• Light gray blocks indicate APIs that a certain module offers, or makes use
of. More specifically, an arrow going from a layer toward a standalone
API indicates that the layer makes use of that API, belonging to another
firmware component. On the other hand, an API positioned above a layer
indicates that the layer makes the services it implements available to other
modules through that API.
• Dark gray blocks correspond to code modules that are partially or com-
pletely non-portable. A non-portable code module may require updates
when it is moved from one architecture/project to another, and hence, soft-
ware development effort is likely to be more focused on them.
The same convention has been followed in the next figures of this chapter, too.
slave before talking to the next one, when there are multiple masters wanting to
communicate with the same group of slaves. This not only affects the design and
implementation of the M ODBUS master firmware, but may also have significant side
effects at the TCP/IP protocol stack level, which must be carefully scrutinized.
For instance, the number of TCP connections that are simultaneously open has
important effects on LW IP memory consumption. In turn, this affects how LW IP
memory pools have to be configured and, as a whole, the memory management strat-
egy of the embedded system.
Before proceeding further, it is important to informally introduce some
connection-related terminology that will be used in the following and may seem
confusing at first. When considering M ODBUS TCP connection, a distinction must
be made between the nomenclature used at the application layer, and the one used at
the TCP layer. Namely:
• At the application layer, connections are initiated by the M ODBUS master
by issuing a connection request to the target M ODBUS slave.
• At the TCP layer, slave nodes listen for incoming connection requests—and
are therefore called servers—while master nodes connect to a server, and
hence, they become clients of that slave.
Connection management
The function xMBPortTCPPoll is extremely important and belongs to a group by
itself because it is concerned with connection management. In particular, it must
perform the following activities when called:
• Monitor the listening socket for incoming connection requests, and monitor
open client sockets for incoming M ODBUS requests. Monitoring includes
passive wait until an event of interest occurs within the TCP/IP protocol
stack and is implemented by invoking the select function on the set of
sockets to be monitored.
• Accept incoming connection requests and establish new connections with
M ODBUS masters, by means of the function accept.
• Retrieve and gather chunks of incoming data from client sockets—by
means of the LW IP recv function—in order to build full M ODBUS requests
and pass them to the upper protocol layers. This goal is achieved by parsing
the M ODBUS TCP application protocol (MBAP) header [120], found at the
very beginning of each M ODBUS request and that contains, among other
information, the overall length of the request itself. Once the expected, to-
tal length of the request is known, xMBPortTCPPoll keeps gathering and
buffering incoming data until a whole request becomes available and can
be passed to the upper layers of the protocol stack.
• Close client connections, by means of the LW IP close function, upon nor-
mal shutdown or when an error is detected. Normal shutdown occurs when
the master closes the connection. In this case, the recv function returns
zero as the number of bytes transferred when called on the corresponding
socket on the slave side. Errors are detected when either recv itself or any
other LW IP function returns an error indication when invoked on a socket.
In order to properly synchronize with the upper protocol layer, besides carrying
out all the previously mentioned activities, xMBPortTCPPoll must also:
• Return a Boolean flag to the caller, true denoting successful completion and
false denoting an error.
• Post event EV_FRAME_RECEIVED through the event system to be dis-
cussed in Section 12.3.2 when a M ODBUS request has been received
successfully.
Data transfer
The third and last group of functions allows the upper protocol layers to perform data
transfer, handling one M ODBUS TCP request and response transaction at a time. The
two functions in this group are:
• xMBTCPPortGetRequest, which retrieves a M ODBUS TCP request pre-
viously gathered by xMBPortTCPPoll and returns a pointer to it, along
with its length, to the caller.
352 Embedded Software Development: The Open-Source Approach
Along with the event system, these two functions also implicitly serialize request
and response handling by upper protocol layers. This is because requests are retrieved
and handed to the upper layers one at a time, and the upper layer must provide a
response to the current request before retrieving a new one.
Event system
The two portable layers of the protocol stack, to be discussed in Sections 12.3.3
and 12.3.4 are synchronized by means of a simple event system, implemented in the
module portevent.c. The main purpose of the event system, in its most general
form, is to allow the upper layer to wait for an event, which is generated by the
lower layer.
The event system specification supports either an active or a passive wait. When
appropriate operating system support is available, like in this case, passive wait is
preferred because it avoids trapping the processor in a waiting loop—with a detri-
mental effect on overhead and processor power consumption—until an event occurs.
It should also be noted that the event system defines a boundary between two
kinds of code within the protocol stack:
1. Higher-level code that consumes events, and possibly waits in the attempt until
events become available.
2. Lower-level code that produces events and never waits.
This boundary must be considered, for instance, when the goal is not just to make
use of the protocol stack but also to extend it, for instance, to add a new protocol
module. In this case, it is important to remember that lower-level code is not al-
lowed to block internally unless a separate task, with its own execution context, is
introduced to this purpose.
The event system must implement a single event stream and make available four
functions, summarized at the top of Table 12.3, to interact with it. In particular:
Table 12.3
F REE MODBUS Operating System Adaptation Functions
Function Description
Event System
xMBPortEventInit Initialize the event system
vMBPortEventClose Close the event system
xMBPortEventPost Post an event
xMBPortEventGet Get an event
Timer
xMBPortTimersInit Initialize the timer module
vMBPortTimerClose Close timer
vMBPortTimersEnable Start timer
vMBPortTimersDisable Stop timer
pxMBPortCBTimerExpired Pointer to the timer expiration callback
Critical Regions
ENTER_CRITICAL_SECTION Macro to open a critical region
EXIT_CRITICAL_SECTION Macro to close a critical region
attached to it, which distinguishes one class of event from another. Accord-
ingly, this function accepts one argument of that type.
• The function xMBPortEventGet looks for and possibly consumes an
event. When successful, it returns the corresponding event identifier, of type
eMBEventType, to the caller.
1. If passive wait is not available, xMBPortEventGet simply polls the event stream
and checks whether or not there is a pending event. If there is one, it consumes
it and returns the event identifier to the caller. Otherwise, it notifies the caller
that the event stream is empty, without delay. In the last case, it is the caller’s
responsibility to retry the operation, if appropriate, at a later time.
2. Otherwise, the function must block the caller, by means of a suitable operat-
ing system-provided task synchronization mechanism—for example a message
queue—until a message is available, and then return the event identifier to the
caller. The function shall also return immediately if an event is already available
at the time of the call.
The event system must be capable of storing just one posted event, before it is
consumed. However, trying to post a new event when another one is still pending
must not be flagged as an error.
354 Embedded Software Development: The Open-Source Approach
The M ODBUS slave protocol stack has been designed to be quite permissive and
flexible with respect to the event system, and hence, it is permissible to “lose” events
in this case. More specifically, when a new event is posted by the lower layer while
there is already a pending event in the event system, it can be handled in two different
ways:
• keep the pending event and ignore the new one, or
• replace the pending event with the new one.
Due to the rather large degree of freedom in how the event system may be imple-
mented, the upper layer considers at least the following two possible event handling
scenarios and handles them properly.
As a side note, this provides evidence of how designing and implementing
portable code, which can be easily layered on top of other software components
without posing overly restrictive requirements on them, may introduce additional
complexity into it.
1. The xMBPortEventGet function may return to the caller (with or without wait-
ing) and signal that no events are available. This must not be considered an error
condition and the operation shall be retried.
2. It is also possible that the same function returns fewer events than were generated.
In fact, this may happen when events appear close to each other, and the event
system drops some of them. In this case, it is not specified which events are kept
and which are lost with respect to their generation time. Then, the upper layer
needs to rely on extra information, usually provided by the lower layer through a
side channel, to keep memory and properly handle the events.
In this case study, F REE RTOS synchronization support is available, and hence,
the event system can be implemented in a straightforward way by means of a message
queue with a buffer capacity of one message of type eMBEventType. The queue
provides both the required synchronization and the ability to transfer event identifiers
(as message contents) between xMBPortEventPost and xMBPortEventGet.
Within xMBPortEventPost, the xQueueSend function that operates on the
message queue (see Section 5.4) is invoked with a timeout of zero, thus asking the
operating system to return immediately to the caller if the message queue is full. As a
consequence, the event system always drops the most recent events when necessary.
In other cases—namely, when it is invoked from an interrupt handling rather
than from a task context—xMBPortEventPost has to use xQueueSendFromISR
to operate on the queue. However, its behavior is still consistent with respect
to the previous scenario, because xQueueSendFromISR implicitly operates as
xQueueSend with zero timeout does.
Timer
The purpose of porttimer.c is to implement a single timer, exporting the four
interfaces listed in the middle part of Table 12.3 and described in the following.
Example: A M ODBUS TCP Device 355
Namely:
Upon expiration, the timer module invokes the callback function specified by the
function pointer pxMBPortCBTimerExpired, possibly from an interrupt context.
The timer is one-shot and stops after expiration, until it is possibly started again.
It should be noted that, as in most other timeout notification mechanisms, there
is an inherent race condition between the invocation of the premature stop func-
tion and timer expiration, which must be resolved externally from the timer mod-
ule. In fact, the timer module cannot guarantee by itself that the timer will not
expire (and the callback function will not be invoked) during the execution of
vMBPortTimersDisable.
Within the F REE MODBUS protocol stack, the timer is used for a very specific
purpose, that is, to detect the end of the M ODBUS frame being received and cor-
responds to the t3.5 timer defined in the M ODBUS RTU specification [121]. If the
protocol stack is used exclusively for M ODBUS TCP communication, as happens in
this case, the timer module is not required and its implementation can indeed be
omitted.
In any case, it is worth noting—as an example of how operating system support,
albeit present, may not be fully adequate to support all possible upper-layer soft-
ware requirements—that the timer, if required, could not be implemented using the
F REE RTOS timer facility when it operates in tick-based mode (see Section 10.7 for
further details).
This is because, in order to support the required resolution, the tick frequency
would need to be increased from the default of 1 kHz to 20 kHz and tick interrupt
handling overheads would thus become inordinate. In this case, a direct implementa-
tion on top of one of the hardware timers embedded in the LPC2468 would be more
appropriate.
356 Embedded Software Development: The Open-Source Approach
Critical regions
The third and last aspect to be discussed in this section concerns the implemen-
tation of critical region entry and exit code. Namely, in the F REE MODBUS
code, critical regions are surrounded by the ENTER_CRITICAL_SECTION and
EXIT_CRITICAL_SECTION macros, which are responsible for mutual exclusion.
The operating system adaptation layer shall provide a definition for these macros
in the header file port.h. Depending on its complexity, the actual code of critical
region entry and exit—invoked by the above-mentioned macros—can be provided
either directly in port.h if it is simple and short enough, or in the source module
portother.c.
Only a single mutual exclusion domain must be provided for the whole pro-
tocol stack, because F REE MODBUS critical regions have been kept very short,
and hence, unnecessary blocking has been reduced to a minimum. In the F REE R-
TOS port, the two macros have been mapped to the portENTER_CRITICAL and
portEXIT_CRITICAL system services, respectively, as has been done for the fast
critical regions of LW IP, as described in Chapter 7.
A notable special case that could not happen in LW IP is that, according to its spec-
ification, F REE RTOS may invoke the mutual exclusion macros not only from a task
context, but also from an interrupt handling context. The two previously mentioned
system services are not supported in the last case, so they must not be invoked.
The issue can be solved by checking whether or not ENTER_CRITICAL_SECTION
and EXIT_CRITICAL_SECTION have been invoked from an interrupt handling
context, by querying the CPU status register through a short assembly language in-
sert, according to the general technique presented in Section 9.4. If this is the case,
the macros simply do nothing.
Since mutual exclusion is implemented by disabling interrupts, mutual exclusion
with respect to both regular tasks and interrupt handlers (as well as any callback
function invoked by them) is guaranteed anyway, on a single-processor system. Mu-
tual exclusion between callback functions invoked from an interrupt handler is also
implicitly enforced in the same way, as long as interrupt handlers are not themselves
interruptible.
It is worth mentioning that, as most other low-level mechanisms, this way of pro-
tecting critical regions may have some unforeseen interactions with other parts of the
protocol stack that must be carefully considered in a case-by-case basis.
In particular, passive wait (for instance, within an event system function) within
a critical region may not be supported in some operating systems. This is the case,
for example, of F REE RTOS on Cortex-M3 processors, due to the way rescheduling
is implemented.
Table 12.4
F REE MODBUS M ODBUS TCP Layer Functions
Function Description
eMBTCPDoInit Prepares this layer and the ones below it for use
eMBTCPStart Enable protocol stack layer
eMBTCPStop Disable protocol stack layer
eMBTCPSend Send a M ODBUS response
eMBTCPReceive Retrieve a M ODBUS request
protocol stack interface, presented in Section 12.3.1. To this purpose, it exports the
functions listed in Table 12.4 and described in the following.
Figure 12.3 MBAP header processing in the M ODBUS TCP layer and below.
layer to another—is to check that the protocol identifier field of the MBAP
header really indicates that the request belongs to the M ODBUS protocol.
In summary, Figure 12.3 depicts how the MBAP header is managed and trans-
formed by the M ODBUS TCP layer when going from a request to the corresponding
reply.
1. Initialization code, to configure the M ODBUS protocol stack for a certain combi-
nation of data link and physical layers, and prepare it for use. Symmetrically, the
shutdown code turns off the protocol stack when it is no longer needed.
2. The main event handling function, a function that—when it is invoked by
the task responsible for handling M ODBUS transactions after protocol stack
initialization—looks for incoming requests and handles them.
The initialization function eMBTCPInit initializes the protocol stack as a whole,
when it is configured to work on top of TCP/IP connections, that is, for M ODBUS
TCP. As part of the process, it directly or indirectly invokes the appropriate initializa-
tion functions of the lower protocol layers. Namely, it calls eMBTCPDoInit within
the M ODBUS TCP layer and xMBPortEventInit to initialize the event system.
A different function also defined in the same layer, eMBInit, initializes the proto-
col stack when it is configured for a TIA/EIA–485 interface bus at the physical level.
Example: A M ODBUS TCP Device 359
Table 12.5
F REE MODBUS Application Layer Functions
Function Description
Event handling
eMBPoll Main event handling function
Application-layer callbacks
eMBRegisterCB Register an application-layer callback
Table 12.6
Links between the Application Layer and the M ODBUS TCP Layer
In this case, the communication protocol can be either M ODBUS ASCII or M ODBUS
RTU [121]. However, this operating mode is beyond the scope of this discussion.
The initialization function also sets up the interface between the application layer
and the M ODBUS TCP layer. Table 12.6 lists the names of the function pointers used
at the application layer and gives a terse description of the function they correspond
to in the M ODBUS TCP layer.
Using function pointers instead of hardwired function names for most of the in-
terface helps to accommodate different communication protocols, implemented by
distinct lower layers, transparently and efficiently. As a side note, referring back to
the table, the vMBTCPPortClose function technically belongs to the TCP/IP inter-
face layer, and not to the M ODBUS TCP layer.
Conversely, the function eMBClose shuts down the protocol stack, by simply in-
voking the corresponding M ODBUS TCP layer function through its function pointer.
360 Embedded Software Development: The Open-Source Approach
Similarly, the eMBEnable and eMBDisable functions enable and disable the proto-
col stack, respectively. As in the previous case, these functions do very little by them-
selves, except invoking the corresponding M ODBUSTCP layer functions through
their function pointers.
The eMBPoll function is responsible for event handling and contains the applica-
tion layer protocol state machine. Accordingly, the protocol state machine is driven
by events taken from the event system. Events are generated either by the state ma-
chine itself or by the lower protocol layers.
It should be noted that eMBPoll does not necessarily wait for events (whether
or not that happens depends on the event system implementation), and hence, it can
return to the caller without doing anything. This is not considered an error condition,
and the function will return MB_ENOERR in this case. It is expected that calls to
eMBPoll will be enclosed within a polling loop in an even higher layer of code,
above the protocol stack.
Since part of M ODBUS requests handling is application-dependent, the event han-
dling function just mentioned works in concert with appropriate callback functions
that must be implemented at the application layer. As will be discussed in Sec-
tion 12.5, the function eMBRegisterCB can be invoked from the application layer
to register a callback function for a certain category of M ODBUS requests.
Table 12.7
FAT FS Main Con guration Options
Option Description
Device characteristics
_VOLUMES Maximum number of logical drives per physical device
_MAX_SS Maximum sector size, in bytes
_MULTI_PARTITION Support multiple partitions on a logical drive
Table 12.8
Filesystem Requirements Concerning Disk I/O
2. The option group concerned about device characteristics determines how many
logical drives at a time the filesystem module can handle, the maximum sector
size to be supported, and whether or not it is possible to access multiple partitions
on the same device.
3. The operating system interface option group controls whether or not the filesys-
tem modules shall be reentrant and optionally enables file locking. Moreover, on
processor architectures that support them, it is possible to enable unaligned word
accesses in the filesystem code, thus achieving a higher speed.
The last option group deserves special attention because it affects not only the
footprint of FAT FS itself, but also the complexity of its adaptation layer—the operat-
ing system interface in particular—and the operating system memory requirements.
This is because, when reentrancy support is enabled, FAT FS relies on the un-
derlying operating system for proper synchronization among concurrent filesystem
requests. Accordingly, several additional operating system interfaces are required to
declare, create, delete, and use mutual exclusion locks—called synchronization ob-
jects by FAT FS.
For this specific case study, the main purpose of the mass storage subsystem is
data logging, performed by a single application task after collecting information from
other components of the real-time system. For this reason, FAT FS reentrancy has
been disabled to speed up the development of the adaptation layer. On the other hand,
the full set of FAT FS features/capabilities has been enabled, at a small footprint cost,
to make data logger development more convenient.
Table 12.9
Filesystem Requirements upon the Operating System and Runtime Library
one-to-one relationship between the function required by the filesystem and what is
offered by the underlying layer.
The main duty of the adaptation layer is, therefore, limited to perform argument
and return code conversion.
Figure 12.5 Inner structure and interfaces of the USB protocol stack.
• the exact location of the bank of host controller operational registers in the
microcontroller address space;
• the way USB controller signals are routed to the microcontroller’s I/O ports,
to the USB transceiver and, ultimately, to the physical USB port;
• the clock source to be used by the USB controller and its frequency;
• how USB controller interrupts are routed to the processor and how they are
associated with the corresponding interrupt handler.
consuming because it requires a very accurate and detailed knowledge of the target
microcontroller’s inner workings.
In order to assist software developers, many microcontroller vendors provide the
source code of an exemplar driver that works with their products and can be further
extended by system developers according to their specific needs.
In our case, the LPC2468 microcontroller [125] includes an OHCI-compliant [41]
USB 2.0 [40] host controller. The vendor itself provides an extremely simplified,
polling-based driver as part of a larger software package fully described in [124].
In this kind of controller, all interactions between the device driver and the con-
troller itself, after initialization, take place by reading from and writing into the con-
troller’s operational registers and by exchanging data through a shared memory area
called host controller communication area (HCCA). Both the register and memory
layouts are fixed and completely defined in [41].
With respect to a full-fledged host controller driver, the main limitation of the
prototype provided in [124] is related to the device enumeration algorithm, which
lacks the capability of crossing USB hubs. As a consequence, as it is, the driver is
capable of handling only one device, directly connected to the host controller port.
On the contrary, dynamic device insertion and removal are supported, so that the
driver is, for instance, more than adequate to support a USB mass-storage device for
data-logging purposes.
The main improvement introduced to make the driver work efficiently in an oper-
ating system-based environment was to replace the polling-based, busy waiting loops
provided by the original driver with a passive, interrupt-based synchronization. As
a consequence, two synchronization semaphores managed by means of appropriate
F REE RTOS primitives were introduced for the two sources of events of the con-
troller, outlined in the following.
1. The Root Hub Status Change (RHSC) event, which is generated when a USB de-
vice is connected to, or disconnected from, the USB controller. When a device
connection is detected, the device driver reacts by issuing, according to the USB
specification, a port reset command. After that, device initialization is completed
by assigning an address to it and retrieving its configuration descriptor. After con-
firming that the device is indeed a mass storage device—an operation accom-
plished by the USB mass storage interface module—it is configured for use.
2. The Writeback Done Head (WDH) event. This event is triggered when the USB
controller has finished processing one or more transmit descriptors (TDs), has
linked them to the done queue data structure, and has successfully updated the
pointer to the head of the done queue accessible to the device driver. This event
therefore provides a convenient way for the device driver to be notified of the
completion of the transfer described by a certain TD and reuse the TD for further
transfers.
The main interfaces exported by the device driver are summarized in Table 12.10.
The two initialization functions listed at the top of the table prepare the host
366 Embedded Software Development: The Open-Source Approach
Table 12.10
USB Host Controller Driver Interfaces
Function Description
Initialization
Host_Init Host controller initialization
Host_EnumDev Device enumeration
controller for use and wait for the insertion of a mass storage device, respectively.
After device initialization, the other interfaces allow the caller to:
• send and receive control information through the device’s control endpoint
descriptor;
• send data to the device’s bulk output endpoint descriptor;
• receive data from the device’s bulk input endpoint descriptor.
This minimal set of capabilities corresponds to what is required by the USB mass
storage interface module, to be discussed next.
Table 12.11
Main USB Mass Storage Interface Entry Points
Function Description
check whether or not the device is supported. The main checks performed
at this stage are aimed at ensuring that the device is indeed a mass stor-
age device that understands the transport mechanism and command set
just discussed.
• The function MS_Init makes sure that the device is ready for use by
issuing the SCSI command test unit ready and waiting if necessary. The
same capability is also made available to the upper software layers for later
use, by means of the function MS_TestUnitReady. In this way, the same
check can be performed at that level, too, for instance, before issuing ad-
ditional commands to the device. During initialization, the block size and
the total storage capacity of the device are also retrieved and returned to the
caller, since this information is required by the filesystem module.
After device initialization, the main purpose of the USB mass storage inter-
face is to encapsulate the SCSI commands Read 10 and Write 10 into bulk-
only transport messages and perform data transfer to the device by means of the
Host_ProcessTD function of the host controller driver.
As their names imply, these two commands are used to read and write, respec-
tively, a number of contiguous storage blocks. These crucial functions are made
available to the upper software layer by means of the last two functions listed in
Table 12.11, that is, MS_BulkRecv and MS_BulkSend.
These functions are called callbacks because they are invoked in a peculiar way,
that is, they are called back from the M ODBUS application layer, discussed in Sec-
tion 12.3.4, when an incoming M ODBUS request is received. They must carry out the
transaction on the slave and then return to the caller with its result.
All application-layer callbacks have a uniform interface, involving two argu-
ments. For better efficiency, those arguments are used both as inputs to the function
and outputs from it.
1. a pointer to a buffer where the application-layer function must store the values of
the registers requested by the master, pucFrameCur;
2. the initial register number requested by the master, usRegAddress;
3. the number or registers to be stored into the buffer, usRegCount.
evaluated the overall memory footprint of the system and assessed the performance
it can achieve in M ODBUS TCP communication.
Table 12.12
Firmware Memory Footprint
LW IP protocol stack
Portable modules 87024 11737
Operating system adaptation 4200 4
Ethernet device driver 4928 12
USB-based filesystem
FAT FS filesystem 27728 520
Disk I/O adaptation 288 0
Operating system adaptation 100 0
USB protocol stack and driver 7548 8
tasks, because the largest object allocated there is the per-task stack. The
configuration used in the case study supports a minimum of 6 tasks.
When necessary, it is often possible to further reduce the footprint of these compo-
nents by means of an appropriate configuration. For instance, the LW IP configuration
used in the case study includes support for both TCP and UDP for generality. If one
of these protocols is not used by the application, it can be removed from the con-
figuration and the memory required by it can be put to other uses, as discussed in
Chapter 7.
It should also be noted that the amount of memory required by the application
code, except the M ODBUS application-layer callbacks, and by the C language run-
time libraries has not been quantified, because it strongly depends on the applica-
tion code. In fact, the library modules are included in the executable image only
on-demand and the system components by themselves require only a few of them,
whereas the library usage by the application code is usually much heavier.
Example: A M ODBUS TCP Device 371
Table 12.13
M ODBUS Communication Performance, Write Multiple Registers
2 6.6
32 6.7
62 6.8
92 7.0
123 7.1
As a consequence, just listing the memory footprint of the library modules needed
by the system components would have been overly optimistic, whereas listing the
memory footprint of all library modules included in the firmware under study would
have been misleading, because these figures would not be a reliable representative of
any other application.
For similar reasons, the footprint data listed in Table 12.12 do not include the
amount of RAM used as buffer by the Ethernet and USB controllers, which has been
drawn from their own dedicated banks of RAM.
Table 12.14
Filesystem Performance, Read and Append Rates
Data chunk size (B) Read rate (B/s) Append rate (B/s)
16 7200 7100
64 24870 23690
512 88600 75340
521 87600 74630
only the M ODBUS TCP protocol stacks themselves (both the master and the slave
sides), but also another complex component, LW IP.
Since, for each write multiple registers command issued by the M ODBUS master,
a reply from the slave is expected, it indicates that the RTT comprises two receive-
path delays and two transmit-path delays through the whole protocol stacks.
Filesystem
Since the primary intended use of the filesystem is data logging, its performance
was evaluated by considering sequential read and append operations with different
data chunk sizes. Table 12.14 shows the read and append rates (expressed in bytes per
second) as a function of the data chunk size (in bytes). The rates have been calculated
by averaging the time needed to transfer 100 MB of data.
The first three data chunk sizes are a sub-multiple of the sector size used by the
filesystem, that is, 512 B. This ensures that read and append operations never over-
lap with a sector boundary and makes the measurement adequate to determine the
relative contribution of software overheads, which are inversely proportional to the
data chunk size, with respect to the total number of USB I/O operations, which are
constant across the experiments.
The experimental results in this group exhibit a strong dependency on the data
chunk size for what concerns both read and append rates. Namely, the ratio be-
tween whole-sector operations and operations typical of fine-grained data logging
(and hence, operating with much smaller data chunk sizes) is about one order of
magnitude and is likely to affect application-level software design.
The lower performance of append with respect to read (from −2% to −15% de-
pending on the data chunk size) can easily be justified by considering that appending
to a file incurs extra overhead because it is necessary to update the file allocation ta-
ble information on the filesystem as new blocks are linked to the file being appended
to. In addition, write operations on a flash drive may be slower than reads due to the
significant amount of time needed to erase flash memory blocks before writing.
The fourth data chunk size value used in the experiments, 521 bytes, is the small-
est prime number greater than the sector size. The value has been chosen in this
way to keep software overheads related to I/O requests from the application program
Example: A M ODBUS TCP Device 373
as close as possible to the whole-section read/append case and, at the same time,
maximize the probability that read and append operations cross a sector boundary.
In this way, it was possible to evaluate the extra overhead introduced in this sce-
nario. The experimental results show that the additional overhead does not exceed
2% and confirm that the filesystem module behaves satisfactorily.
12.7 SUMMARY
The main goal of this chapter was to provide a complete example of how a com-
plete and fully working embedded system can be assembled exclusively from open-
source components, including the C-language software development toolchain used
to build it.
Indeed, the discussion began by providing, in Section 12.1, some summary in-
formation about how the main toolchain components (binary utilities, compiler, C
runtime library, and debugger) should be chosen so that they work correctly together.
The same section also provides information about the rationale behind the choice of
using the F REE RTOS operating system instead of other open-source products, and
about the consequences of this choice on the software development process.
Then, Section 12.2 provided an overall picture of the firmware structure, out-
lining its main components and how they are connected to each other through their
APIs—by means of appropriate adaptation or porting layers. More information about
individual components and their internal structure, also useful to conveniently port
them to new architectures and projects, was given in Sections 12.3 through 12.5.
Last, but not least, Section 12.6 provided some insights into the memory footprint
and performance of the main firmware components.
Part II
Advanced Topics
13 Model Checking of
Distributed and
Concurrent Systems
CONTENTS
13.1 Introduction ................................................................................................... 377
13.2 The S PIN Model Checker .............................................................................. 379
13.3 The P ROMELA Modeling Language.............................................................. 380
13.3.1 Processes........................................................................................... 381
13.3.2 Data Types ........................................................................................ 383
13.3.3 Expressions ....................................................................................... 386
13.3.4 Control Statements............................................................................ 388
13.3.5 Atomicity .......................................................................................... 393
13.3.6 Channels ........................................................................................... 396
13.4 Property Specification ................................................................................... 400
13.5 Performance Hints ......................................................................................... 401
13.6 Summary........................................................................................................ 404
This chapter discusses the first advanced topic of the book. It briefly presents how
the well-known limits of software testing techniques can be addressed by means of
formal verification. As embedded software complexity grows, the correct use of this
technique is becoming more and more important, in order to ensure that it still works
in a reliable way. From the practical point of view, very often software verification
is carried out through model checking. This chapter is devoted to this topic using the
well-known model checker S PIN and its input language P ROMELA to draw some in-
troductory examples. The discussion of a full-fledged case study is left to Chapter 14.
13.1 INTRODUCTION
When a distributed concurrent system has been designed and implemented, software
testing techniques can be applied to check whether it works as expected or not. How-
ever, software testing techniques can just be used to show the presence of bugs, but
they lack the ability to prove the absence of them. Instead, this goal can be achieved
with formal verification.
377
378 Embedded Software Development: The Open-Source Approach
Model checking [15] is one popular technique of formal verification. It has been
widely used in the hardware industry. More recently, its adoption in the software
industry keeps growing as well, especially for what concerns safety critical applica-
tions. This book will focus on its application in the verification of distributed con-
current software systems.
More specifically, model checking is an automated technique that, given a (finite-
state) model M of a system and a logical property ϕ , systematically checks whether or
not this property holds on the model, that is, M |= ϕ . The logic property should also
be specified with some formal notation, for instance, linear temporal logic (LTL),
which will be discussed in Section 13.4.
Model checking aims at proving that the property of interest holds by exhaustively
exploring every possible state the system under analysis can reach during its execu-
tion, starting from its initial state. Collectively, these states are called the state space
of the system.
The system transitions from one state to another by means of elementary com-
putation steps. The possible computation steps originating from a certain state are
specified in the model itself. Therefore, a path from the initial state up to one pos-
sible final state (in which no more transitions are possible) represents one of the
possible computations the system can perform.
Based on the (initial) design or implementation of a system, it is possible to
build the corresponding model, upon which verification can be carried out by suit-
able model checkers. Depending on the verification results, possible refinement tech-
niques can be applied to achieve better implementation.
P ROMELA, which is a shorthand for PROtocol MEta LAnguage, is a language
which can be used to write models for concurrent systems. With respect to ordinary
programming languages like C, P ROMELA offers formal semantics. Moreover, it is
also more powerful in expressing concepts like nondeterminism, passive wait for an
event, concurrency and so on, as will be better discussed in Section 13.3. On the
other hand, P ROMELA suffers from some weak points. For example, it just supports
a limited set of data types, for example pointers are not available in P ROMELA.
Since the model is written in a language different than the one used for the imple-
mentation of the original system, it is essential to ensure that the model is a faithful
representation of the real system. In other words, the discrepancy between the model
and the corresponding system should be kept to a minimum. Otherwise, the results
of model checking are just meaningless.
Recently, it has become possible to abstract a P ROMELA model automatically
from the system implementation. However, for the current time, it only supports sys-
tems developed with the C programming language [97]. Moreover, automatic trans-
lation may include more details than necessary for what concerns the verification of
a property, and in turn this makes the verification inefficient. As a consequence, this
topic will not be further discussed in the book.
S PIN is one of the most powerful model checkers. Actually, S PIN stands for
Simple Promela INterpreter, namely, S PIN performs model checking based on the
formal model written in P ROMELA. S PIN can work in two modes, simulation or ver-
Model Checking of Distributed and Concurrent Systems 379
ification. Simulation examines just one possible computation at a time, whereas veri-
fication explores them all. More information about this can be found in Section 13.2.
In order to help readers to build a more solid foundation, some simple examples
are provided while introducing the basic concepts. A much more complex example,
which shows how model checking can help to detect low probability issues of an
industry level distributed multi-master election protocol, is presented in Chapter 14.
It is well known that model checking, or verification in general could be quite
time and memory consuming, not only because it requires examining the whole
state space but also because a state could become quite complex as system com-
plexity grows. Some hints to improve the verification performance are discussed in
Section 13.5.
As mentioned, S PIN requires writing a separate model for the system under anal-
ysis and it is also important to make sure that the model is consistent with respect to
the real system, which is not a trivial job at all. Instead, unlike S PIN, S PLINT, a tool
which can be used to detect code mistakes and vulnerabilities (especially for what
concerns memory access errors) by means of static code analysis, can work directly
on the C source code. And it will be discussed in Chapter 16.
when it is invoked with either the -f or -F command-line options. LTL will be dis-
cussed in a more detailed way in Section 13.4.
When both the model and the property to be verified (shown in the light gray
box of Figure 13.1) are available, S PIN is able to generate a verifier corresponding
to them when invoked with the command-line option -a. Moreover, the verifier is
written in the C programming language, hence it is possible to compile it with the
native C compiler to get the executable file, which then can be run in order to carry
out the verification. By convention, the executable file of the verifier is called pan,
which is a shorthand for Protocol ANalyzer.
After the verification process is completed, S PIN produces two kinds of outputs:
a verification report and optionally a trail if the verification result turns out to be
negative. Both outputs are shown as dark gray boxes in Figure 13.1. The verification
report summarizes the verification results, state space exploration information, mem-
ory usage, possible values of variables defined in the model, and so on. On the other
hand, the trail provides a counterexample which leads to the violation of the target
property. After that, users can follow a guided simulation driven by the trail to find
out how it takes place exactly with all necessary details. This could be quite helpful
to locate the issue and fix it.
In addition, several graphical front ends, which permit more convenient interac-
tion with S PIN rather than through the command line, are available for a variety of
operating systems. More information about this can be found in [73]. Besides gener-
ating the verification report and optionally the trail, these tools (for example xspin),
are powerful also in the sense that they are able to perform post-processing of the
S PIN textual output and present it in a more convenient way. For example, xspin
can derive a message sequence chart (MSC) [94] from the trail which demonstrates
inter-process communication of a model.
It is worth mentioning that examples shown in the following make use of S PIN
version 5. Moreover, technical details internal to S PIN are out of the scope of this
book, except that some hints to improve the verification performance are provided in
Section 13.5. More interested readers could refer to References [74, 76].
Last but not least, from version 5 on, S PIN also supports model checking on multi-
core systems [73]. Reference [77] presents the extensions to S PIN in order to make it
support model checking on multi-core machines, as well as the related performance
measurement.
13.3.1 PROCESSES
Processes are the main “actors” in a P ROMELA model. They contain all the exe-
cutable statements of the model. A process can be declared by means of the proctype
keyword as follows:
proctype P(<param>) {
<body>
}
in which, P indicates the process name, <param> is a list of parameters, and <body>
represents the body of process P. Executable statements should be put in <body>.
Moreover, statements are separated but not terminated by a semicolon (;). As a con-
sequence, in the following sequence of statements belonging to process P,
proctype P(byte incr) {
byte temp;
temp = n + incr;
n = temp;
printf("Process P: n = %d\n", n)
}
it is not necessary to have a semicolon at the end of the last statement. By the way,
as can be seen from the above listing, the P ROMELA syntax still shares a lot in com-
mon with the C programming language, including comments. This makes the model
written with it still intuitive to understand.
Processes declared in the above way must be instantiated explicitly one or more
times, by means of the run operator. If a process has parameters in its declaration,
then the same process can be instantiated multiple times, with different parameters.
For instance,
run P(3);
run P(7)
As we can see, the run operator takes as arguments a previously defined proctype
and, a possibly non-empty list of parameters which match the formal declaration
of the proctype. It is worth mentioning that the processes created with run do not
necessarily start their execution immediately after run completes.
Processes declarations can also be active,
active [<n>] proctype P () {
<body>
}
1 byte n;
2
10 init {
11 n = 0;
12 atomic {
13 run P(3);
14 run P(7)
15 }
16 }
By the way, when illustrating the P ROMELA syntax, optional terms are specified
with (non-quoted) square brackets ([...]), whereas a list of terms is enclosed in
angle brackets (<...>). If [<n>] is omitted, the default value 1 will be used. This
means that there is just one instance of process P.
In P ROMELA, there is a special process, namely the init process,
init {
<body>
}
The init process, if declared, is always the first process to be activated. It is
typically used for:
• Instantiation of other processes.
• Complex initialization of some global variables. For instance, nondeter-
ministic choice of a value. This can be useful to model different configura-
tions/settings of a system, external inputs to a system, and so on.
Figure 13.2 shows a simple example of the init process. It initializes a global
variable n and creates two copies of process P with different parameters. In the exam-
ple, the global variable is initialized in quite a straightforward way. Better examples
will be shown at a later time when control statements which permit nondeterministic
execution are introduced in Section 13.3.4.
It is worth remarking that, by convention, instantiation of a group of processes is
performed within an atomic sequence, as shown in the above example (line 12). It
ensures that executions of those processes start at the same time. One advantage of
doing it in this way is that all possible interleaving among processes can be examined.
More information about this is provided in Section 13.3.5.
Model Checking of Distributed and Concurrent Systems 383
Table 13.1
Basic P ROMELA Data Types on a Typical 32-Bit Architecture
The above listing declares three variables, a bit variable x, a Boolean variable
done, as well as a byte variable a. What’s more, the Boolean variable and the byte
variable are initialized to false and 78, respectively. In P ROMELA, if no initial
value is provided, a variable is initialized to 0 by default.
The syntax for variable assignment is similar to C. For instance,
byte x;
x = 4
declares a variable of type byte and then assigns a value of 4 to it. However we
should pay attention to the fact that, in P ROMELA, this is not exactly the same as
byte x = 4. This is because, this may introduce additional, unnecessary, or even
384 Embedded Software Development: The Open-Source Approach
unexpected system states. For instance, in the above listing and in some states, the
value of x is 0 instead of 4.
This is especially important when modeling a concurrent system, where other
processes could observe the value of x, and may perform actions correspondingly.
It is worth noting that data types, like characters, strings, pointers, and floating-
point numbers, which are commonly used in most programming languages, are not
available in P ROMELA.
Other data types supported in P ROMELA include array, mtype, and chan. More
specifically, only one-dimensional arrays are directly supported by P ROMELA. Ar-
rays can be declared in a similar way like in the C language. For instance,
byte a[10] = 8;
the above listing declares an array called a with 10 elements, each of which contains
a byte value initialized to 8. If no initial value is provided, all elements get a default
value of 0. Array can be accessed with index and the first element of an array is at
index zero. It is also possible to specify multi-dimensional arrays in P ROMELA, by
means of the structure definition, which will be shown later.
An mtype keyword is available in P ROMELA, which can be used to define sym-
bolic names for numeric constant values. For instance,
mtype = {mon, tue, wed, thu, fri, sat, sun};
declares a list of symbolic names. Then, it can be used to specify variables of mtype
type and assign values to the variable, as follows,
mtype x = fri;
where ch represents the name of a channel variable that can be used by processes to
send or receive messages from a channel. type, ..., type defines the structure
of messages held by a channel. capacity indicates the number of messages that
can be stored in a channel. It can be either zero or a positive value, which represents
a rendezvous unbuffered channel or a buffered channel, respectively. The difference
between the two types of channel together with possible operations over channels
will be discussed in Section 13.3.6.
chan c = [4] of {byte, mtype}
Model Checking of Distributed and Concurrent Systems 385
The above listing declares a buffered channel c that can hold 4 messages, which are
made up of a byte value and a field of mtype type.
Channels are commonly declared as global variables so that they can be accessed
by different processes. Actually, any process can send/receive messages to/from any
channel it has access to. As a result, contrary to what their name indicates, channels
are not constrained to be point-to-point. Local channels are also possible [20], but
their usage is beyond the scope of this book.
If the aforementioned data types are not expressive enough, P ROMELA also sup-
ports user-defined structured data types by means of the typedef keyword, followed
by the name uname chosen for the new type and a list of fields that make up the
structure, as follows,
typedef uname {
type f_name1;
...
type f_nameN
}
Individual fields can be referred in the same manner as in the C language, for instance
uname.f_name1. It is worth mentioning that a typedef definition must always be
global, whereas it can be used to declare both global and local variables of the newly
defined type. The following listing defines a two-dimensional array with each of its
elements of byte type. Moreover, one element of it is assigned with value 10.
typedef arr{
byte b[5];
}
arr a[3];
a[1].b[4] = 10
For what concerns variable declaration, there are exactly two kinds of scope1 in
P ROMELA, namely:
• One single global scope. Global variables are visible to all processes in the
system and can be manipulated by all of them
• One local scope for each process. The scope of a local variable is the whole
process, regardless of where it is declared. And there are no nested scopes.
All local variable declarations are silently moved to the very beginning of
the process where they are declared. This may have unexpected side effects,
for instance, in the following listing
1 active proctype P() {
2 byte a = 1;
3 a = 2;
4 byte b = a
5 }
1 It is worth noting that an additional block scope has been added to S PIN version 6 [73].
386 Embedded Software Development: The Open-Source Approach
Table 13.2
Summary of P ROMELA Operators in Decreasing Precedence Order
Last but not least, it is recommended that programmers should pick up the most
appropriate data types for variables in a model. This is because the data type size di-
rectly affects the state vector size during verification. A system state of a P ROMELA
model consists of a set of values of its variables, both global and local, as well as
location counters, which indicate where each process currently is in its execution.
Generally speaking, during verification, the model checker needs to store the state
in memory and examine every possible value of a variable. As a result, if variables
are defined with data types which are unnecessarily large, it could easily lead to state
space explosion and make the verification lengthy or even unfeasible. For instance,
the capacity of a channel should be selected carefully. Otherwise, due to the rather
complex structure of channels with respect to other data types, the state vector size
may increase in a significant way with inappropriate channel capacity values, which
in turn impairs verification efficiency. Detailed information about this topic is pro-
vided in Section 13.5.
13.3.3 EXPRESSIONS
The way a P ROMELA expression is constructed is, for the most part, the same as in
the C language and also most C-language operators are supported. They are listed
in Table 13.2 in decreasing precedence order; operators with the same precedence
are listed together. The most important difference of expressions in P ROMELA with
respect to C relates to the fact that it must be possible to repeatedly evaluate a
Model Checking of Distributed and Concurrent Systems 387
it becomes true. This rule applies when an expression is used either stan-
dalone or at a guard position of a statement, for instance, in a selection state-
ment or a repetition statement, which will be introduced in Section 13.3.4.
For instance, a process that encounters a standalone expression i==3
blocks until i is set to 3 by other processes. As we can see, expressions
can profitably be used to synchronize multiple processes.
• The executability rules for more complex types of statement, like those
aforementioned, will be given along with their description.
It is worth noting that, besides being used in the init process to instantiate other
processes, the run operator is the only operator allowed inside expressions that can
have a side-effect. This is because run P() returns an identifier for the process
just created. Upon successful creation, a positive process identifier is returned. Oth-
erwise, the return value is 0. Most importantly, processes identifiers are also part
of system state. As a consequence, the run operator should be used carefully inside
expressions. For interested readers, more detailed information can be found in [76].
Selection statement
The selection statement models the choice among a number of execution alternatives.
And its general syntax is as follows:
if
:: guard_1 -> sequence_1
...
:: guard_n -> sequence_n
fi
It starts with the if keyword and ends with the fi keyword. It includes one or
more branches, starting with double colons (::). Each branch is made up of a guard
and the sequence of statements following it, if the guard is evaluated to be true. A
guard is usually an expression which results in a Boolean value, either true or false.
By the way, the arrow separator is an alternative for the semicolon (;) separator.
But semantically, they are the same. The arrow separator is more commonly used
in selection statements and repetition statements, which will be introduced later, be-
tween a guard and the following sequence of statements to better visually identify
Model Checking of Distributed and Concurrent Systems 389
those points in a model where execution could block. It is mainly for improvement
of readability.
The selection statement is executable if at least one guard is true. Otherwise,
the process is blocked. Instead, if more than one guard are evaluated to be true, a
nondeterministic choice exists among them. When at this point, the verification can
proceed with either branch. As a result, this corresponds to multiple execution paths
in verification. There is a special guard else, which is true if and only if all the other
guards are false.
For a concurrent program, it follows one possible execution path at a single run.
Different runs may lead to different results. In order to verify whether a certain prop-
erty holds on it or not, it should be checked for every possible execution path. As we
can see, P ROMELA provides exactly what is needed.
The above two listings show the conditional statement in C (left) and the selection
statement in P ROMELA (right). The two listings may not make any sense from the
logic point of view. However, syntactically, they are not wrong. They are used here
just to demonstrate the difference between the two languages. At first glance, the two
if statements may look quite similar to each other. However, they are not equivalent.
This is because, in C, the if predicates are evaluated in a sequential manner. In
other words, if predicate A comes before predicate B, then A is evaluated first. If A
is true, the statements associated to it are executed and B will not be evaluated. After
the execution of the statements associated to A, the if statement terminates. Only if
A is false, will B be examined.
Since the predicate i<4 is true in this example, the value of j will always be 1.
Instead, for what concerns the P ROMELA code, since the first and second guards
are both true, there is a nondeterministic choice between their associated statements
and both execution paths are possible. In this case, j gets the values 1 and 2 in differ-
ent execution paths. This way of reasoning is also confirmed by the S PIN verification
output.
unreached in proctype P
line 4, "pan_in", state 6, "j = 3"
(1 of 9 states)
The first part reports that the statement j=3 is unreachable during verification,
whereas the second part indicates that j is assigned with value 1 and 2. By the way,
pan_in is a conventional way of naming a model when it is supplied to S PIN. It is
worth noting that some verification output is omitted for clarity.
Repetition statement
do
:: guard_1 -> sequence_1
...
:: guard_n -> sequence_n
od
The repetition statement is embraced with the pair of keywords do and od. It also
contains one or more branches starting with the guard and followed by a sequence of
statements, like in the selection statement.
It behaves as follows: after the execution of statements associated to a true guard,
it goes back to the beginning of do and the guards are evaluated again. If any of the
guards are true, the execution will continue with the statements following it. This
process will be repeated until it encounters the break keyword. It terminates the
loop and control passes to the statement following od.
It is worth remarking that the while loop implemented in the C code (shown in
the left listing) cannot be translated to the P ROMELA code shown on the right.
This is because the body of the C while loop, namely i++, is executed three times
and then the assignment j=i is performed. As a result, both i and j get value of 3.
Instead, the P ROMELA process never goes beyond the do loop. The reason is that,
after being incremented for three times, i becomes 3. After that, the execution goes
back to the beginning of the do loop. When it tries to evaluate the guard again, it is
no longer true. And the repetition statement is no longer executable and process P
will block at this statement indefinitely, if there are no other processes in the system
operating on i.
The S PIN verification output is shown in the following,
pan: invalid end state (at depth 5)
pan: wrote pan_in.trail
Model Checking of Distributed and Concurrent Systems 391
unreached in proctype P
line 7, "pan_in", state 6, "j = i"
line 7, "pan_in", state 7, "-end-"
(2 of 7 states)
The above results show that statement j=i in the P ROMELA model is unreachable
because execution is blocked within the repetition statement. And, i obtains any
value between 0 and 3.
In order to obtain the same results as the C code, a second branch should be added
to the do loop of the P ROMELA code, as follows,
int i=0, j=0;
active proctype P() {
do
:: (i<3) -> i++
:: else -> break
od;
j=i
}
As discussed before, the else guard is true if and only if all other guards are
false. Consequently, when i becomes 3, the first guard is false. Instead the else
guard becomes true and the statement that comes after it, namely break, is executed,
which terminates the loop. The execution proceeds to the assignment statement and
j gets the value of 3 as well.
If the else guard in the above loop is replaced by the true keyword, as follows
int i=0, j=0;
active proctype P() {
do
:: (i<3) -> i++
:: true -> break
od;
j=i
}
it will behave in yet another way. More specifically, according to the S PIN verifi-
cation output shown in the following, variable j gets all possible values between 0
and 3. Besides, it also shows the state space exploration information, for instance, a
system state is stored in a 16-byte state vector.
State-vector 16 byte, depth reached 9, errors: 0
19 states, stored
0 states, matched
19 transitions (= stored+matched)
0 atomic steps
392 Embedded Software Development: The Open-Source Approach
The aforementioned result can be easily justified by the fact that, unlike else,
the true guard is always true. As a consequence, when i is less than 3, both guards
within the repetition statement are true. This represents a nondeterministic choice.
In other words, at each iteration, it is possible to either increment the value of i
or terminate the loop. Verification examines all possible computation paths, from
terminating the loop immediately as soon as entering the loop, to performing three
times of increment and then terminating.
This variant of the do loop can be used to nondeterministically choose a value
within a certain range, in order to use it in the subsequent code. It can be profitably
used in the init process to perform complex initialization of global variables.
There are two other types of statement, namely jump (go) and the unless clause.
The jump statement breaks the normal computation flow and causes the control to
jump to a label, which is a user-defined identifier followed by a colon. For instance,
label_1:
...
if
:: (i > 6) -> goto label_1
:: else -> ...
fi
in the above listing, if the guard i>6 is true, it will jump to where label_1 is and
execute the code that follows that label. As we can see, if there is more than one
jump statement, it makes the code less structured and may impair the readability of
a model. Consequently, it is recommended to limit the use of the jump statement.
An exception to this rule is that when the original system can be well represented
as a finite state automaton, where the system is made up of multiple states, the sys-
tem takes different actions under different states, and the system moves from one
state to another through a transition. In this case, a state can be modeled as a label
followed by a group of statements, whereas the transition can be modeled by a goto
statement.
The unless clause is used to model exception catch. An unless block can be
associated with either a single statement or a sequence of statements. Moreover, it is
executed when the first statement of the block becomes executable. For instance,
active proctype P() {
byte i=4, j=2, k;
do
:: {
k = i / j;
i--;
j--;
Model Checking of Distributed and Concurrent Systems 393
} unless {
j == 0 -> break
}
od
}
in the above example, i is divided by j and the result is stored in variable k. After
that, both i and j are decremented. This process is repeated until j becomes 0,
which is an invalid divisor, an exception caught by the unless clause.
The unless clause is commonly used for the situation that an event occurs at an
arbitrary point in a computation.
For more information about the jump statement and the unless clause, interested
readers can refer to [20].
13.3.5 ATOMICITY
In P ROMELA, only elementary statements (namely expressions and assignments) are
atomic. During verification, an atomic statement is always executed as an indivisible
unit. Consequently, variables will not change their value in between. If there is any
side effect associated with an atomic statement (for instance the assignment state-
ment), the effect is visible to other processes either entirely or not at all. For what
concerns the following P ROMELA code, where a variable a of the int type is first
declared and initialized to 0 and then assigned with value 1,
int a = 0;
a = 1
in P ROMELA, any process observing this variable will see either 0 or 1, but nothing
else.
On the contrary, a sequence of statements is not executed atomically. This is also
because processes’ location counters are also part of system state and they work at
the statement level.
It is possible to make a sequence of statements executed closer toward an atomic
way with two individual specifiers, namely atomic and d step, as follows.
atomic {
sequence
}
d_step {
sequence
}
• The atomic sequence is executable if and only if its first statement is ex-
ecutable. As a consequence, the first statement works as a guard for the
atomic sequence. Once the atomic sequence starts executing, it is executed
394 Embedded Software Development: The Open-Source Approach
From the verification point of view, d step sequence is more efficient than atomic
sequence because there is only one possible execution path regarding a d step se-
quence whereas atomic sequence at least can interleave with other processes at
certain points.
As a result, d_step is recommended to be reserved for fragments of sequential
code which can be executed as a whole. Instead, atomic is preferred for implementing
synchronization primitives.
The following example shows a case where atomicity is needed and how an
atomic sequence can help to address the issue. The following fragment of code (un-
fortunately not completely correct, as we will see later) is an attempt to force multiple
processes to enter a critical region in a mutually exclusive way.
1 bool lock = false;
2 byte crit = 0;
3
8 crit++;
9 assert(crit == 1);
10 crit--;
11
12 lock=false
13 }
Within the critical region (lines 8–10), both processes will access the shared
variable crit, by first increasing its value by 1 (line 8) and then bringing it back
(line 10). If mutual exclusion is implemented successfully, there should be only one
process in the critical region at a time. As a result, when the value of crit is evalu-
ated at line 9 by any process, it should always be 1.
By the way, the above code also includes an assertion (line 9). An assertion can
be specified by means of the assert keyword, followed by an expression. It can be
placed anywhere between two statements. Assertion is mainly used to help S PIN trap
Model Checking of Distributed and Concurrent Systems 395
As we can see, S PIN complains that the assertion is violated. In particular, the
crit variable can obtain the value of 2. S PIN wrote a counterexample found during
verification back into a trail file, namely pan_in.trail.
As will be shown in the following guided simulation output driven by the trail
file, the issue is related to the atomicity of execution. More specifically, the sequence
of statements (line 5–6) of the P ROMELA code is not executed atomically and its
execution by a process can interleave with other processes’ activities.
1 Starting P with pid 0
2 Starting P with pid 1
3 1: proc 1 (P) line 5 [(!(lock))]
4 2: proc 0 (P) line 5 [(!(lock))]
5 3: proc 1 (P) line 6 [lock = 1]
6 4: proc 1 (P) line 8 [crit = (crit+1)]
7 5: proc 1 (P) line 9 [assert((crit==1))]
8 6: proc 0 (P) line 6 [lock = 1]
9 7: proc 0 (P) line 8 [crit = (crit+1)]
10 spin: line 9 "pan_in", Error: assertion violated
11 spin: text of failed assertion: assert((crit==1))
As shown in the guided simulation output, two instances of process P are created,
with different process identifiers, namely pid 0 and pid 1 (lines 1–2). The guided
simulation output also shows how the two processes interleave with each other and
396 Embedded Software Development: The Open-Source Approach
give rise to the error step by step. More specifically, process 1 first evaluates !lock
(line 3) and proceeds beyond it, because lock is false at the moment. Then, before
process 1 sets lock=true according to its next statement, process 0 also evaluates
the expression !lock and it is still true (line 4). As a result, both processes are
allowed to enter the critical region. After that, process 1 first keeps going for three
steps within the critical region by setting the lock, incrementing the shared variable,
and performing assertion (lines 5–7). And then process 0 tries to do the same and the
assertion fails (lines 8–9).
A straightforward solution to this issue is to execute both the test expression and
the subsequent assignment atomically, which can be achieved by means of the fol-
lowing construct:
atomic {
!lock;
lock=true
}
According to the rules of atomic sequence, since the assignment statement is al-
ways executable, when a process evaluates the guard and it is true, it will proceed to
the end of atomic sequence without interleaving with other processes. This indicates
that if other processes attempt to enter the critical region at the same time, the atomic
sequence within them is not executable, as the guard is not executable.
As we can see, this models how a synchronization semaphore (which has been
discussed in Section 5.3) behaves. Actually, d step can also be used in this simple
example and it makes no difference.
It is important to remark that introducing an atomic or d step sequence in a model
is reasonable if and only if it is guaranteed that the modeled system works in exactly
the same way. Otherwise, verification results may be incorrect as some possible in-
terleavings among processes are not considered.
Referring back to Figure 13.2, since the instantiation of two processes is embed-
ded in an atomic sequence within the init process, interleaving of the init process with
other processes is not allowed as long as the atomic sequence does not encounter any
blocking statement. In this way, instantiation of the two processes will be completed
before any of them can be executed. The execution of the two processes will start
afterward so that all possible ways of interleaving between them can be modeled.
13.3.6 CHANNELS
As mentioned in Section 13.3.2, channels are used to exchange messages among
processes. There are two types of operations that can be performed upon a chan-
nel, namely send and receive. The basic form of send and receive operations are as
follows,
chan ch = [capacity] of {type, ..., type};
ch ! exp, ..., exp;
ch ? var, ..., var
Model Checking of Distributed and Concurrent Systems 397
The ! represents the basic form of the send operation, whereas ? represents the
receive counterpart. exp, ..., exp is a sequence of expressions whose values
are sent through the channel. Instead, var, ..., var is a sequence of variables
that are used to store the values received from the channel. If one of them is specified
with _ (underscore), it indicates that the corresponding field of the message is not
interesting to the receiver and its value will simply be discarded.
For both the send and the receive operations, the number and types of the expres-
sions and variables must match the channel declaration.
If the capacity of a channel is 0, it models a rendezvous channel. Since there is no
room in this kind of channel to store the message, it is required that both the sender
and receiver should be ready when the message is about to be exchanged. Moreover,
the exchange of message will be performed in one atomic step.
More specifically, when two processes A and B would like to exchange messages
through a rendezvous channel, as shown in the two parts of Figure 13.3,
1. If process B encounters the receive statement first, while at the same time process
A is not yet at the point to send the message, process B will be blocked at the
398 Embedded Software Development: The Open-Source Approach
where k is used to indicate the capacity of a buffered channel and k>0. The basic
form of send and receive statements can also be used for it. Regarding a buffered
channel, the send statement is executable as long as the channel is not full. In other
words, the number of messages held in the channel is less than k. Instead, the receive
statement is executable as long as the channel is not empty, namely the number of
messages stored in it is greater than 0.
It should be noted that ordinary send and receive statements operate on a channel
in a first in first out (FIFO) order. Moreover, channel contents are also part of the
system state.
There are four predefined Boolean functions available in P ROMELA which can be
used to check the state of a channel, whether it is empty or full. They are,
empty(ch)
nempty(ch)
full(ch)
nfull(ch)
As we can see, they all take the name of a channel as argument. There are four
functions instead of two because the negation operator (!) cannot be applied to them
for a reason related to how S PIN works internally. Interested readers could refer to
Reference [20] for more information. Instead, nempty (not empty) and nfull (not
full) are used to specify the counterparts of empty and full, respectively. These
functions can be used as a guard to determine whether a certain operation, either
send or receive, can be carried out or not.
For instance, the following code specifies one sender and two receivers, which
access the same channel that can store only a single one-byte message. The two
receivers intend to receive a message from the channel if it is available, otherwise
they simply terminate without doing anything. The skip keyword simply concludes
a selection statement and moves the execution to the next statement. It is always
executable.
1 chan c = [1] of { byte };
2
9 if
10 :: nempty(c) -> c ? v
11 :: empty(c) -> skip
12 fi
13 }
However, it does not work as expected. Instead, the S PIN verification output re-
ports there is invalid end state.
pan: invalid end state (at depth 4)
pan: wrote pan_in.trail
By following the trail, the guided simulation output shown in the following indi-
cates that it is related to atomic execution. More specifically,
1 Starting S with pid 0
2 Starting R with pid 1
3 Starting R with pid 2
4 1: proc 0 (S) line 4 [values: 1!1]
5 1: proc 0 (S) line 4 [c!1]
6 2: proc 2 (R) line 10 [(nempty(c))]
7 3: proc 1 (R) line 10 [(nempty(c))]
8 4: proc 2 (R) line 10 [values: 1?1]
9 4: proc 2 (R) line 10 [c?v]
10 5: proc 2 terminates
11 spin: trail ends after 5 steps
the sender sends a message to the channel (lines 4–5). This leads to both receivers
passing their guard since the channel is no longer empty (lines 6–7). One receiver
receives the message (lines 8–9) and then terminates (line 10). Instead the other
receiver is stuck in its receive operation and never terminates. S PIN reports this situ-
ation as an invalid end state.
The issue can be addressed by enclosing the check and the receive operation in
an atomic sequence by means of the atomic keyword. Instead, d_step cannot be
adopted here, because c?v could also block besides the guard.
Besides ordinary receive statement, several other types of receive statement are
also available, including random receive, nondestructive receive, as well as polling.
Random receive can be specified with the double question mark (??) in the fol-
lowing way,
ch ?? exp_or_var, ..., exp_or_var
Unlike the basic form of receive statement, the list of variables can also contain
expressions. It receives the first message in the channel that matches the given ex-
pressions. This message could be at any position in the target channel. As a result, it
does not follow the FIFO order. Variables are handled as usual. If there is no match,
the random receive is not executable.
The next type of receive can be built on top of both ordinary receive and random
receive by means of the angle brackets (<>),
ch ? <var, ..., var>
ch ?? <exp_or_var, ..., exp_or_var>
400 Embedded Software Development: The Open-Source Approach
The meaning is the same as the ordinary and random receive. The most important
difference is that the (matched) message is not removed from the channel. Variables
are still assigned with values of the corresponding message fields. Hence, this type
of receive still has side effects and thus they cannot be used as expressions.
On the contrary, polling receive statements written as follows
ch ? [var, ..., var]
ch ?? [exp_or_var, ..., exp_or_var]
are expressions. They are true if and only if the corresponding receive statements are
executable. Variables are not assigned. As a result, there is no side effect and they
can be used as a guard.
The following LTL formula specifies that, for every possible computation, it will
always be true that if F is true, then it implies that sooner or later G will eventually
become true.
[]( F -> <> G )
Model Checking of Distributed and Concurrent Systems 401
The above example highlights an important difference between the temporal op-
erators and the other two. More specifically,
The above example can be read in an alternative way, that is, whenever there is
a state within which F is true, it is necessary to examine the following states to see
whether there is a state within which G is true.
Properties specified as LTL formulae can be automatically translated by S PIN into
P ROMELA code before verification, and become part of the model being verified.
In the above listing, the fill process fills channel c with N byte-sized values,
chosen in a nondeterministic way, and then blocks because the channel buffer is full.
During an exhaustive state space exploration, the number of states x grows expo-
nentially with respect to N, that is, x ≃ 2N+1 . This is because, just considering the
channel variable c, the content of its buffer goes from being completely empty (at
the beginning of verification) to being completely full (at the end) and, at any step,
it can contain any sequence of values, from all 0s to all 1s. In order to perform an
exhaustive search, it is necessary to consider and check the property of interest in all
these states.
402 Embedded Software Development: The Open-Source Approach
It is not the case that the whole state space is built all at once, and then verification
is carried out on it. On the contrary, state vectors are calculated on the fly and stored
into a hash table. More specifically, when storing a state vector, a hash function
is applied to calculate an index which indicates the position where the state vector
should be stored within the table. If there is already a state vector stored in that
position, and it is not the same as the current one, a hash conflict occurs. In this case,
the new state vector is stored elsewhere and linked to the existing state vector through
a linked list. By intuition, as the number of hash conflicts increases, storing a new
state becomes less and less efficient because the linked lists grow. As a consequence,
more time is required to linearly scan the lists, just to check whether or not a certain
state vector is already in the table.
Verification is the process of checking whether a given property holds in each
state. As a consequence, if a state has already been checked, there is no need to check
it again. Let us assume that the verification is currently considering a certain state,
and it is about to execute a statement. After execution, the values of some variables
in the model may change. This leads to another state, namely, a state transition takes
place. What S PIN does next is to look into the hash table:
• If the new state has not been stored yet, then it checks whether the property
holds on that state. If it does, S PIN stores the state vector into the hash table,
otherwise S PIN just found a counterexample.
• If the state has already been stored in the table, it is a proof that the property
holds on that state. In this case, there is no need to store it again and the
program can move to the next step.
Overall, during verification, S PIN keeps storing state vectors into the hash table
and looking up newly created state vectors to see whether or not the correspond-
ing states have already been visited before. This process may be highly time and
memory consuming, depending on a lot of different aspects. For instance, if the hash
algorithm in use is not so effective, it is possible to have long linked lists somewhere
in the hash table, whereas other parts of the table still remain empty.
By the way, the verification process ends when either a counterexample is found
(with the default verification option of S PIN), or all processes reach their ends, or no
more states can be generated (for example, all processes are blocked).
When trying to improve the performance of S PIN, the goal is to achieve the best
trade-off between speed and memory requirements. More specifically, speed depends
on many factors, such as how large the hash table is, how effective its management
algorithms are, and how efficient S PIN is in updating and searching it. However,
speed improvements in most cases have some impact on memory requirements.
This topic can be addressed in different ways and at different levels, starting from
the model itself down to configuring the way S PIN performs verification bearing in
mind how some S PIN algorithms work internally. Some of the most commonly used
optimization methods are summarized in the following:
Model Checking of Distributed and Concurrent Systems 403
In this case, instead of considering all possible execution orders, and gener-
ating the corresponding intermediate states, it is enough to follow just one
execution order. In some cases, this is a quite effective method since it can
reduce the size of the state space sharply.
• Bitstate hashing and hashing compact. Instead of allocating more mem-
ory to accommodate a large hash table, it is also possible to reduce its mem-
ory requirements.
For what concerns bitstate hashing, which is also known as supertrace, a
state is identified by its index in the hash table. As a result, a single bit is
sufficient to indicate whether a state has already been visited or not. How-
ever, it is possible that two different states may correspond to the same
index, due to hash conflict. If the “visited” bit is already set for a certain
state, when coming to verify whether the property of interest holds on a
different but conflicting state, the result is positive, even if this may not be
the case in reality. In other words, some parts of the state space may not
be searched and it is possible to miss some counterexamples. However, if a
counterexample is found, it does represent a true error.
For what concerns hashing compact, instead of storing a large hash table,
the indexes of visited states in that table are stored in another, much smaller,
hash table. It has the same issue as bitstate hashing, because two different
states may have the same index in the large hash table, and thus collide.
Both methods are quite effective to reduce the memory requirements of
state storage. However, they are lossy and entail a certain probability of
having a false positive in the verification results, that is, a property may be
considered true although some counterexamples do exist. The false positive
probability can be estimated and often brought down to an acceptable level
by tuning some of the algorithm parameters [75].
Except for the first one, the other optimization methods can be enabled in S PIN
by configuration, through either command-line options or graphical front ends. For
instance, state vector compression can be specified with the -DCOLLAPSE option
when compiling the verifier to obtain the executable file.
13.6 SUMMARY
This chapter provides an overview of formal verification of distributed concurrent
systems through model checking, which evaluates whether or not a property holds
on the model corresponding to the practical system by exhaustively exploring the
state space.
S PIN is one of the most widespread model checkers. The general verification flow
of S PIN is introduced in Section 13.2. Besides being able to carry out verification in
an efficient way, it can also provide a trail if the property to be verified is violated and
a counterexample is found. In this case, users can follow the trail to better understand
the issue and then may fix the problem with other methods. Model checking is most
effective when it is used together with debugging techniques.
Model Checking of Distributed and Concurrent Systems 405
P ROMELA, which is the input language for S PIN, can be used to write formal
models for concurrent programs. With respect to ordinary programming languages,
P ROMELA is powerful in specifying concepts like nondeterministic execution, con-
currency, passive wait as well as atomicity. These concepts are illustrated in Sec-
tion 13.3 by means of some intuitive examples.
When the model is ready, property to be verified on the model should also be
specified in a formal way. Simple properties like assertions can be indicated directly
with P ROMELA, whereas more complex properties can be defined by means of linear
temporal logic, which is explained in Section 13.4.
Last but not least, some hints to improve the verification performance are provided
in Section 13.5, from writing an efficient model down to tuning the way S PIN works,
taking into account internal technical details of S PIN.
In order to help readers better understand the concept of model checking, a more
complex example can be found in the next chapter. It considers a multi-master elec-
tion protocol which has already been used in practice and uses model checking tech-
niques introduced in this chapter to identify low probability issues and optimize the
implementation.
14 Model Checking:
An Example
CONTENTS
14.1 Introduction ................................................................................................... 407
14.2 Distributed Master Election Protocol ............................................................ 408
14.3 Formal P ROMELA Protocol Model................................................................ 412
14.3.1 Simplified Time Model..................................................................... 412
14.3.2 Broadcast Channel Model................................................................. 415
14.3.3 Masters Model .................................................................................. 417
14.4 Formal Verification Results ........................................................................... 419
14.5 Summary........................................................................................................ 425
The general concept of model checking is introduced in the previous chapter. Topics
covered there include the P ROMELA modeling language which can be used to write a
formal model for a practical concurrent system, the linear temporal logic formalism
that can be adopted to specify properties to be verified over the model, as well as the
S PIN model checker together with possible techniques to improve the verification
performance.
This chapter will show a case study based on a distributed multi-master election
protocol that has been deployed in industry for real-time communication. It demon-
strates how model checking can be used to identify low-probability issues which may
never occur in practice and how counterexamples found by the S PIN model checker
can be helpful to correct them.
14.1 INTRODUCTION
Fieldbus networks are still popular nowadays, especially for low-cost distributed em-
bedded systems, because their underlying technology is mature and well-known to
most designers. M ODBUS [119, 121] is one of the most popular fieldbuses. It is
based on the master-slave communication paradigm and can be deployed over the
TIA/EIA–485 [163] (formerly called RS485) physical level channel.
The basic M ODBUS protocol was originally conceived as a low-cost and low-
complexity solution for single-master systems only. It is possible to extend it to sup-
port multiple masters on the same fieldbus segment, as shown in Reference [29],
407
408 Embedded Software Development: The Open-Source Approach
while maintaining backward compatibility with existing M ODBUS slaves. In this ex-
tension, the master election protocol—a protocol executed periodically to decide
which is the highest-priority master connected to the fieldbus segment and give it
permission to operate on the bus until the next election—plays a key role.
The design of this type of protocol is seemingly straightforward and the designer
may believe that its correctness can be assessed satisfactorily by intuition and testing.
However, as will be shown in the following, formal verification can help to identify
and fix subtle and low-probability issues, which seldom occur in practice, and there-
fore, may be extremely difficult to detect during pre-production testing.
At the same time, the case study is also useful as an example of how a typical real-
time protocol—not necessarily M ODBUS-related—can be modeled and analyzed in
an effective way by means of a contemporary model checker, S PIN [74]. In particular,
it shows how to handle the concept of time and how to model a multi-drop, inherently
broadcast physical channel, like TIA/EIA–485, in an accurate but efficient way, that
is, without introducing any undue complexity in the analysis, which is often one of
the most important issues of practical model checking.
The design of the protocol to be analyzed is briefly recalled in Section 14.2, then
Section 14.3 describes how the corresponding formal model has been built. The re-
sults of the analysis and the ensuing corrective actions are presented in Section 14.4.
• A state. The active master has full control of the M ODBUS segment. A
potential master takes part in the elections proclaimed by the active master,
and is willing to become the active master. A quiescent master takes part
Model Checking: An Example 409
in the election as a potential master does, but does not want to become the
active master.
• An 8-bit, integer priority, which must be unique for, and preassigned to,
each master. To simplify the protocol, the priority is also used to iden-
tify each master. Moreover, it is assumed that priorities lie in the range
[0, np − 1], and np is known in advance.
Unlike the states, which may change over time as a result of an election, the
masters’ priorities are constant. As shown in Figure 14.1, the election comprises
three distinct phases:
1. After ann t ms (about 500 ms in a typical implementation) since the end of the
previous election, the active master proclaims a new election, by sending an an-
nouncement message containing its own priority (function code Ann).
2. The end of the announcement message acts as a time reference point, shared
among all masters, which marks the beginning of a series of np fixed-length reply
slots, each sl t ms long (8 ms in a typical implementation). Each master, except
the active one, replies to the announcement in the slot corresponding to its pri-
ority (function code Pot or Qui, depending on whether the master is potential or
quiescent); as before, the reply contains the priority of the sender.
3. After the reply slots, the active master announces the result of the election by
means of an acknowledgment message, bearing the priority of the winning master
(function code Ack).
All messages exchanged during the election have the same structure and are
broadcast to all masters. Besides the M ODBUS function code, which determines the
message type, their payload contains an 8-bit priority, with the meaning discussed
above.
During initialization, as well as when the notion of which is the active master is
lost (e.g., due to a communication error), a timeout mechanism is triggered to force
the election of a new master. In particular, if a potential master at priority cp detects
that no elections have been proclaimed for ato t · (cp + 1) ms, it starts an election
on its own. Since the timeout value is directly proportional to each master’s priority
value (lower values correspond to higher priorities), the unfortunate event of two or
more masters starting an election at roughly the same time should be avoided. The
typical value of ato t used in practice is 1000 ms.
As an additional protection against the collision between two or more masters—
that is, the presence of two or more masters which simultaneously believe they are the
active master on the M ODBUS segment—the active master monitors the segment and
looks for unexpected election-related traffic generated by a higher-priority master. If
such traffic is detected, the active master immediately demotes itself to potential
master.
In the real protocol implementation, slot timings are slightly more complex than
shown in Figure 14.1, because it is necessary to take into account the limited preci-
sion of the real-time operating system embedded in the masters to synchronize with
the Ann message, in the order of 1 ms. The most obvious consequence is that the
410 Embedded Software Development: The Open-Source Approach
real slots are longer than the messages they are meant to contain, and masters try to
send their messages in the center of the slots. The basic operating requirements the
protocol should satisfy by design are:
1. It shall be correct in absence of communication errors, that is, if the whole proto-
col is performed correctly, the highest-priority master shall win.
2. It shall be resilient to a single communication error in a protocol round and the
subsequent recovery. The worst effect of such an error shall be the loss of the
notion of which is the active master, a transient condition that shall be recovered
in the next election round.
3. The outcome of an election shall not be affected by the silent failure of one or
more masters. By silent failure, we mean that a master simply ceases functioning
without generating any spurious traffic on M ODBUS.
The occurrence of multiple communication errors, as well as the presence of bab-
bling masters, has not been considered in the design, because their probability is
deemed to be negligible with respect to other sources of failure present in the system
as a whole.
The protocol and the masters’ behavior have been specified by a timed finite
state machine (FSM), as shown in Figure 14.2. What’s more, the corresponding C-
language implementation of the protocol has been directly derived from it. Although
the full details of the FSM will not be discussed here, the FSM transitions have been
reported in the upper part of Table 14.1, except those related to error counter man-
agement for the sake of conciseness.
Those transitions are essential because, as will be described in Section 14.3, the
bulk of the formal protocol model is heavily based on them, to the extent that they can
be used to generate the model in a fully automatic way. In turn, this reduces the possi-
bility of any hidden discrepancy between the protocol model and its implementation.
The most important FSM states are the active and standby states. They character-
ize the active master, and the potential/quiescent masters, respectively.
For simplicity, the following discussion just focuses on the table. In Table 14.1,
a transition from an old to a new state is composed of two parts: a predicate and an
action. The predicate enables the transition, that is, the transition can take place only
if the predicate is true. Predicates are written with a C-like syntax in which the clause
rx m(p) is true if a message of type m has been received. If this is the case, variable
p is bound to the priority found in the message itself. The symbol _ (underscore)
matches any message type defined by the election protocol.
Actions are carried out when a certain transition is followed, before bringing the
FSM into a new state. The clause tx m(p) denotes the transmission of an election
message of type m, carrying priority p in the payload.
In both predicates and actions, the special variable t represents a free-running,
resettable counter holding the local master time, in ms. The Boolean variable men
distinguishes between potential (true) and quiescent (false) masters; mrl, when it
is set to true in an active master, forces that master to relinquish its role as soon as
another potential master is detected, regardless of its priority. The array pr is local
Model Checking: An Example 411
to each master; it collects the state of each master (A-bsent, Q-uiescent, or P-otential)
derived from the reply messages they sent during the election. Finally, the function
win is used by the active master to determine which master won the election from
cp, pr, and mrl.
412 Embedded Software Development: The Open-Source Approach
Table 14.1
Original (N) and Additional (A) Transitions of the Election Protocol's Timed
FSM
For example, transition N9 is enabled for a potential master (men), when ato t ·
(cp + 1) ms have elapsed since the last election (t==(cp+1)*ato_t). In this case,
that master’s FSM goes from the Standby to the Elec a state after transmitting an
announcement message (tx Ann(cp)), initializing all elements of the array pr to
A (pr[]=A), and resetting the local timer to zero (t=0).
explicitly in the model. A few general S PIN extensions for discrete time have been
developed in the past, most notably DTS PIN [22].
Although these extensions are quite powerful—they even support the automatic
translation of a specification and description language (SDL) model [93] into ex-
tended P ROMELA and its verification [23]—for simple protocols, like the one ana-
lyzed here, an ad hoc approach is simpler to deploy and, most importantly, incurs
less verification overhead. Moreover, as will be described later, a proper time model
is also helpful to model a broadcast channel with collisions in an efficient way.
Other tools, U PPAAL [21] in particular, directly support Timed Automata, and
hence, would be good candidates to perform the analysis, too. The relative mer-
its of S PIN and U PPAAL have been discussed at length in literature. A detailed
comparison—involving a protocol similar to the one considered here—is given, for
instance, in [96]. Generally speaking, the price to be paid to have a richer model is
often a loss of performance in the verification.
In analogy with the protocol FSM specification presented in Section 14.2, the
timer of each master has been represented as an element of the global array t, en-
compassing NP elements, one for each master (Figure 14.3, lines 1–2). To reduce
the timer range, time is expressed as an integral multiple of the slot length sl t, rather
than milliseconds, in the model.
This represents a compromise between the accuracy and the complexity of the
analysis because, even if the slot length is the minimum unit of time that masters
are able to resolve, our approach entails the assumption that the underlying real-time
operating system is nonetheless able to synchronize each master with respect to slot
boundaries in a proper way.
It must also be remarked that doing otherwise would require the inclusion of an
operating system and the M ODBUS protocol stack model in the P ROMELA spec-
ification to be verified. This would be a daunting task because both components
lack a formal specification. On the other hand, checking masters for proper slot
synchronization—a very simple timing property—can be and has been done in an
immediate and exhaustive way through bench testing.
During initialization, all the timers are initialized nondeterministically with neg-
ative values in the range [−xs, 0], representing the power-up skew between masters,
by means of the choose skews inline function (Figure 14.3, lines 5–19). Namely, an
initial timer value of −k conveys in a simple way the fact that a certain master started
operating on the bus k slots after the system was powered up.
Similarly, XS represents the maximum time difference between the power-up
times of different masters. It is useful to remark that, in this function, the break
statement at line 12 does not prevent the statement at line 13 from being executed.
Rather, it gives rise to the nondeterministic choice between abandoning the inner do
loop and further decrementing t[i].
By the way, the body of an inline function is directly passed to the body of a
process whenever it is invoked. In P ROMELA, an inline call can appear anywhere a
standalone statement can appear.
414 Embedded Software Development: The Open-Source Approach
1 #define NP 4
2 short t[NP];
3 bool tf = true;
4
5 inline choose_skews()
6 {
7 i=0;
8 do
9 :: (i < NP) ->
10 {
11 do
12 :: break
13 :: (t[i] > -XS) -> t[i]--
14 od;
15 i++
16 }
17 :: else -> break
18 od
19 }
20
21 /* The Tick process updates global time when there is nothing else to
22 do in the system, i.e. all other processes are blocked, by using
23 the timeout special variable.
24
25 In this way, it is guaranteed that time will not elapse until no
26 processes are in the condition of carrying out any other action.
27 Implicitly, this also means that processes will always honor their
28 timing constraints.
29 */
30 active proctype Tick()
31 {
32 byte i;
33
34 do
35 :: timeout ->
36 d_step {
37 if
38 :: (tf) -> do_bcast(); tf = false
39 :: (!tf) ->
40 i=0;
41 do
42 :: (i < NP) -> t[i]++; i++
43 :: else -> break
44 od;
45 tf = true
46 fi
47 }
48 od
49 }
In the time model, each slot has been further divided into two sequential phases,
a transmit and a receive phase. The global variable tf distinguishes between them.
This subdivision has no counterpart in the real system. It has been included in the
specification as part of the broadcast channel model, which will be described later.
The last noteworthy point to be discussed is how and when the local timers are
incremented. This is done by a separate process, Tick (Figure 14.3, lines 30–49),
in a d_step sequence enabled by the timeout guard (line 35). According to the
Model Checking: An Example 415
P ROMELA language specification, this guard is true if and only if no other statements
in the whole model are executable.
Therefore, informally speaking, timers are incremented when no other actions are
possible in the whole system. More specifically, masters go from one slot to another
when they have completely carried out all the actions they are supposed to perform in
the current slot. It is important to remark that, in its simplicity, this approach contains
two important assumptions that must be understood and checked for validity in the
real system:
1. Within an election round, masters are perfectly synchronized at the slot level,
because their timers are always updated together, albeit the time value each master
uses can still differ from the others by an integral number of slots. In the system
being considered here, this hypothesis has been satisfied in the design phase, by
calculating the slot length and message location so that, even under the worst-
case relationship among oscillator tolerances and synchronization errors among
masters, they are still able to send and receive their messages in the right slot.
2. All timing constraints concerning the actions that must be accomplished by the
masters within a slot are satisfied, because the model lets all masters perform
all actions they have to perform in a certain slot before switching to the next
one. In the real-world system, this hypothesis has been checked through real-time
schedulability analysis and subsequent testing.
As remarked before, a more precise assessment of these properties through formal
verification would require a formal model of the real-time operating system and the
M ODBUS protocol stack. Due to the complexity of both components, it would likely
have been infeasible.
allowed to send only a single message in any given slot, the last statement of bcast
(line 25) is a wait that lasts until the end of the transmit phase.
The actual message broadcast is performed by the do bcast inline function (Fig-
ure 14.4, lines 37–55). This function is invoked by the Tick process at the boundary
between the transmit and receive phase of each slot (Figure 14.3, line 38). If a mes-
sage to be sent has been collected in mb t and mb p (Figure 14.4, line 39), it is sent
to all masters, except the senders themselves (line 44). Each master has its own pri-
vate input channel for incoming messages, namely, mc[i] for the i-th master (line 2).
The message buffers are then brought back to their initial state (lines 45 and 54).
Excluding the senders from message relay is necessary to model the channel ac-
curately, because the TIA/EIA–485 transceivers used in practice are half-duplex, that
is, they are unable to receive and transmit a message at the same time. Message re-
ception is therefore inhibited during transmission.
If a collision occurs, a special message (with message type Col) is broadcast,
only in the model. This message makes all receivers aware of the collision and
has no counterpart in the actual protocol. However, this is consistent with the real
behavior of the system because all the universal asynchronous receiver transmit-
ter (UART)/transceiver combinations of the four different microcontroller fami-
lies picked for the practical implementation of the protocol [30] (namely, the At-
mel ATmega 1284P [12], Atmel AT 91 [11], NXP LPC2468 [126], and NXP
LPC1768 [128]) have the capability of generating an interrupt when they detect
M ODBUS traffic. This happens even if the message cannot be decoded due to fram-
ing, parity, or checksum errors. This quite reliable way to detect collisions, which
is generally available at no additional hardware cost, is mimicked in the model by
means of the Col message.
7 st = ist;
8 do
9 :: ...translation of FSM transition...
10 :: ...
11 od
12 }
13
14 inline run_agents()
15 {
16 i=0;
17 do
18 :: (i < NP) ->
19 run Master(i, Standby,
20 true, false); i++
21 :: else -> break
22 od
23 }
In this example, the agents are given a unique priority in the range [0, np − 1].
Moreover, none of them is active at startup (they all start in the Standby state), they
are potential masters (men is true), and they are not willing to relinquish their role of
active master after they acquire it (mrl is false).
The body of each master contains declarations for the local FSM variables, most
importantly the FSM state st (line 4). The initial value of st is taken from the initial
state of the master, held in the ist parameter (line 7). The rest of the body is an
infinite loop containing one nondeterministic choice for each FSM transition listed
in Table 14.1. Each nondeterministic choice is the direct translation of one FSM
transition, obeying the following rules:
• The predicate of the FSM transition is encoded in the guard of the nonde-
terministic choice. There, the channel can be inquired for the presence of a
message by means of a channel polling expression (see Section 13.3). For
example, the expression mc[cp]?[Ann(p)] is true if master cp received an
announcement.
Since the polling expression does not imply the actual reception of the mes-
sage from the channel, the message must be then explicitly received with an
ordinary receive clause, mc[cp]?Ann(p), in a subsequent statement of the
nondeterministic choice.
Model Checking: An Example 419
• The actions embedded in the FSM transition are encoded in the subsequent
statements of the nondeterministic choice, after the guard. To handle the
TIA/EIA–485 channel in a proper way, message transmission must be per-
formed by means of the bcast inline function previously discussed.
• The next state(ns) macro brings the FSM to the new state ns. Its imple-
mentation, not shown here for conciseness, it very simple and basically
changes the value of st at the end of the receive phase of the current slot.
In the translation, the set pr inline function is a shortcut to set all elements of the
pr array to A. Similarly, transition N9 has been translated into:
:: (st == Standby && men
&& t[cp] == (cp+1)*ato_t) ->
bcast(cp, Ann, cp);
set_pr(A);
next_state(Elec_a);
t[cp] = 0
For the sake of completeness, the masters model also contains several additional
lines of code. Their purpose is to make a few internal variables globally visible, so
that they can be used in the LTL formulae presented in Section 14.4. They are neither
shown nor further discussed due to their simplicity. Overall, the complete protocol
specification consists of about 400 lines of P ROMELA code.
Figure 14.6 First bug found by S PIN in the original protocol. Collided messages shown
in gray.
the limits stated above, up to NP=5, but only by setting the supertrace/bitstate veri-
fication option, albeit the verification is approximate in this case.
The properties of interest are:
where prv is an array of Booleans, set by the masters when they reach the
end of an election. Informally speaking, the property states that, in any
computation, the end of an election is reached infinitely often by all masters.
• Agreement, written as:
#define q (prres[0] == prres[1] \
&& prres[1] == prres[2] \
&& prres[2] == prres[3])
[] ( p -> q )
where prres is an array that collects the election winners, calculated by each
master. Informally speaking, the property states that, when the masters have
got a result (p is true), then they agree on the result (q is true, too).
• Weak Agreement, a property stating that there is at most one active master
at any given time. It is a weaker form of the agreement property, because
it contemplates the possibility of losing the notion of which is the active
master. It is written as:
#define p ((
((gst[0]==St_Active) -> 1 : 0) + \
((gst[1]==St_Active) -> 1 : 0) + \
((gst[2]==St_Active) -> 1 : 0) + \
((gst[3]==St_Active) -> 1 : 0)) <= 1)
[] ( p )
The original version of the protocol, presented in Figure 14.1, satisfied the liveness
property when checked by S PIN. On the contrary, S PIN showed that the agreement
Model Checking: An Example 421
Figure 14.7 Fix for the first bug (A), and second bug found by S PIN (B), involving three
masters.
property was not satisfied and provided a counterexample. The corresponding time
diagram, derived from the message sequence chart (MSC) produced by S PIN as part
of the counterexample, is depicted in Figure 14.6.
The time diagram shows that, when two masters start operating with a very
unfortunate (and very unlikely to happen) time relationship—namely, a power-up
time skewed by exactly 2 slots—both their announcement and their acknowledge
messages collide on the network. This happens because the highest-priority master
(cp = 0) waits for ato t = 2 slots before proclaiming an election, whereas the other
master (cp = 1) waits for 2 · ato t = 4 slots. As a result, neither master receives the
messages sent by the other, and both become active because they believe they are
alone on the M ODBUS segment.
The bug can apparently be fixed by observing that in the original version of the
protocol, when a master proclaims an election, it sends an announcement and then
stays silent during all the reply slots, waiting for replies from other masters. Hence,
if the announcement is lost due to a collision, the other masters will have no clue of
its presence.
Since, by design, the k-th reply slot is reserved for the master at priority k—and
there is a single master at that priority—the protocol can be enhanced by making the
k-th master send a Pot message in its own slot without fear of colliding with others.
The additional FSM transition is marked A1 in Table 14.1. As shown in part (A) of
Figure 14.7, this is enough to fix the bug in the scenario being discussed, because
both masters now agree that master 0 is the winner.
However, this enhancement is not sufficient to solve the issue in the general case.
When fed with the enhanced protocol model, S PIN was in fact able to provide a new,
more complex counterexample, shown in part (B) of Figure 14.7. This second coun-
terexample is particularly interesting because it involves the concurrent activity of
three distinct masters and, as in the previous case, it also requires a very precise time
relationship among them. The combination of those factors makes the counterexam-
ple extremely difficult to find by intuition.
422 Embedded Software Development: The Open-Source Approach
Looking again at part (B) of Figure 14.7, we notice that a third master at priority 2
may send an announcement message that collides with the Pot message sent by
master 0. This happens because of a combination of two factors:
• Master 1 was powered up 2 slots before master 0, and hence, its announce-
ment message collided with the announcement sent by master 0 itself. As a
result, no masters actually received any valid announcement.
• Master 2 was powered up 3 slots before master 0 and, by design, it waited
for 3 · ato t = 6 slots before sending its announcement. It then sent its an-
nouncement because, during the wait, it did not receive any other message.
In fact, the only messages sent during this period are the announcements,
which were lost due to the collision.
As a consequence, master 1 did not receive the Pot message sent by master 0 and
considered itself to be the winner. On the other hand, master 0 did receive the Pot
message sent by master 1, but this was irrelevant for the outcome of the election,
as determined by master 0 itself. The final result is that both master 0 and master 1
become active.
This second bug can be fixed by leveraging the hardware-based collision detection
capabilities of TIA/EIA–485 UARTs and transceivers. Namely, all masters shall go
back to the Standby state and reset their timer as soon as they detect a collision. The
corresponding FSM transition is marked A2 in Table 14.1.
With this further enhancement, the master election protocol satisfied all three
properties discussed above in the absence of communication errors. In the scenario
shown in Figure 14.7, part (B), master 2 will no longer send its announcement, be-
cause it will detect the collision between the announcement messages sent by the
other two masters. The election will therefore proceed correctly, as shown in part (A)
of the same figure.
Finally, the model was further extended to introduce single communication errors
and silent master failures because, according to its design, the protocol shall be re-
silient to them. Indeed, S PIN proved that the liveness property is still satisfied, and
hence, all functioning masters still get election results within a finite number of exe-
cution steps in all cases. The agreement property is no longer satisfied—because the
notion of which is the active master may be lost in the presence of communication
errors—but the weak agreement property still is. The most dangerous condition the
protocol may encounter, that is, the presence of more than one active master at the
same time on the same M ODBUS segment, is therefore still avoided in all failure
modes foreseen in the design.
For the sake of conciseness, just the models corresponding to them are shown
in the following. For what concerns single communication errors, it is assumed that
there is at most one error (either a transmit error or a receive error) in a single election
round and no additional errors during the subsequent recovery. Moreover, a transmit
error has a global effect, that is to say, it affects all receivers and none of them can
receive the correct message. Instead, a receive error is local and it just affects a single
receiver.
Model Checking: An Example 423
5 do
6 :: timeout ->
7 /* Channel error: use atomic instead of d_step */
8 atomic {
9 if
10 :: (tf) -> do_bcast(); tf = false
11 :: (!tf) ->
12 i=0;
13 do
14 :: (i < NP) -> t[i]++; i++
15 :: else -> break
16 od;
17 tf = true
18 fi
19 }
20 od
21 }
A transmit error is modeled by extending the inline function bcast with one
more nondeterministic choice in the if statement (lines 11–15), which is imple-
mented as an atomic sequence. If no communication error has been generated, a Col
message could be prepared here and then transmitted by the do_bcast function.
Within the same atomic sequence, rxtxe is set to true so that no more error could
be introduced.
On the other hand, it is more convenient to model the receive error (line 40) in the
do_bcast function because it should affect only one receiver. If neither a transmit
error nor a receive error has been generated, one master could be affected by a receive
error, which is indicated by receiving a Col message. Instead, if a transmit error has
already been generated, only the alternative nondeterministic choice can be followed,
that is to broadcast the transmit error message to all masters (except the sender).
A side effect of introducing the receive error in the model (Figure 14.8, line 40) is
that execution in do_bcast can no longer be performed in a deterministic way. As
a consequence, d_step in the time model is not suitable to enclose the do_bcast
function any more, even though d_step leads to more efficient verification. It can
be replaced with the atomic keyword, as shown in Figure 14.9.
Last but not least, in order to model silent master failures, a Boolean array
silent[] with NP elements is introduced, with each element indicating whether
the corresponding master encounters silent failure or not. Besides, the bcast func-
tion is extended, as shown in Figure 14.10.
Model Checking: An Example 425
27 !tf
28 }
As mentioned in Section 14.2, silent master failures mean that one or more mas-
ters cease functioning without generating any spurious traffic. As a consequence, a
master just set its flag in the array when it becomes silent (line 9), and skips the col-
lection of message and detection of collision since then (line 24). In this way, it will
not affect the traffic sent by other masters within the same slot.
14.5 SUMMARY
This chapter shows how formal verification through model checking can be used to
verify industrial real-time communication protocols, like the one presented in this
chapter. Even though S PIN does not have built-in support for concepts such as time
and broadcast channels, which is common to most popular model checkers, it is pos-
sible to write models that include these concepts explicitly, without suffering huge
verification overheads.
426 Embedded Software Development: The Open-Source Approach
The case study discussed in this chapter also demonstrates that model checking
can be profitably used to identify low-probability issues which involves complex
interleaving among multiple nodes in a concurrent system. Moreover, the counterex-
amples provided by the model checker when a property being verified is violated
can be extremely helpful to gain further information about the issues and fix them
accordingly.
15 Memory Protection
Techniques
CONTENTS
15.1 Memory Management Units (MMUs)........................................................... 427
15.2 Memory Protection Units (MPUs) ................................................................ 432
15.3 MPUs versus MMUs ..................................................................................... 438
15.4 Memory Checksumming ............................................................................... 439
15.5 CRC Calculation............................................................................................ 444
15.6 Data Structure Marking ................................................................................. 455
15.7 Stack Management and Overflow Detection ................................................. 460
15.8 Summary........................................................................................................ 465
427
428 Embedded Software Development: The Open-Source Approach
Chip Boundary
V≡P
On-chip Interconnection
Internal
Memory
Processor
A Core
External
External
Memory
Memory
Controller
Address V issued
by the processor
Address P given
to memory
Address Address
translation translation
fault V→P
Chip Boundary
V≠P
On-chip Interconnection
Internal
Memory
MMU
Processor
B
Core
External
External
Memory
Memory
Controller
Address V issued
by the processor
Address P given
to memory
• the address issued by the processor and the associated access mode, which
specifies the kind of memory transaction the processor is requesting (for
instance, instruction fetch, data read, or data write), and
• some memory protection information known to the MMU and that the exe-
cuting task cannot modify on its own initiative.
If the check just mentioned fails, or if the translation fails for other reasons, the
MMU incurs in an address translation fault. In this case, it reports the occurrence to
the processor by raising an exception and blocks the memory transaction before it
takes place.
The reaction of the processor to an exception signal is similar to the reaction to an
interrupt. As better described in Chapter 8, informally speaking an exception stops
the execution of the current task—in this case, the task that attempted the illegal
memory access—and diverts execution to the corresponding exception handler.
Due to its criticality from the point of view of memory consistency and system
integrity, the duty of handling address translation faults is most often left to the op-
erating system. In turn, the typical action carried out by the operating system is to
terminate the offending task under the assumption that it is malfunctioning.
430 Embedded Software Development: The Open-Source Approach
The details of the translation may vary, depending on the computer architecture,
but the basic mechanism always relies on a data structure called page table. As shown
in the simplified diagram of Figure 15.2, the virtual memory address V (shown at the
top of the diagram) is translated into a physical address P (bottom of the diagram)
according to the following steps.
For the sake of completeness, it is also worth mentioning that virtual addresses
are sometimes called logical addresses, especially in the context of general-purpose
operating systems.
1. The virtual address V, of total width m + n bits, is divided into two parts. The
first one consists of the n least significant bits, while the second one is composed
of the m most significant bits. These two parts are usually called page offset and
page number, respectively, for reasons that will become clearer in the following.
2. The offset is not translated in any way. Instead, as shown in the right part of
Figure 15.2, it goes directly into the least significant portion of the physical
address P.
3. The page number is used as an index in the page table. Since the page number is
an m-bit value, the page table must have 2m entries, at least conceptually, although
in practice several techniques are adopted to reduce the overall page table size by
not storing unused elements. This is especially important when the virtual address
space (that is, the set of legal values V can assume) is large, for instance when
Memory Protection Techniques 431
using a 64-bit processor. In these cases m may be quite a large number, and the
page table size grows exponentially with m.
4. The MMU retrieves from the page table entry corresponding to the most signifi-
cant portion of V two kinds of information:
a. Address translation information, consisting of a k-bit value called page
frame number and shown as a white box in Figure 15.2.
b. Memory protection information, often implemented as a set of page access
permission bits and shown as a smaller, light gray box in the figure.
5. For what concerns address translation, the page frame number becomes the most
significant part of the physical address P. It must also be noted that the size of the
virtual address space may not coincide with the size of the physical address space
(that is, the range of legal values P can assume). This happens when m ̸= k.
6. Regarding memory protection, the MMU compares the access mode requested by
the processor with the page access permission bits, in order to determine whether
or not the access shall be allowed to take place. As described previously, if the
result of the comparison is negative, the MMU aborts the memory transaction and
raises an address translation fault.
Overall, the address translation process put in place by the MMU divides the vir-
tual address space into 2m pages. Each page is uniquely identified by its page number
and is 2n memory locations wide. The exact location within the page corresponding
to a given virtual address V is given by its page offset. Similarly, the physical address
space is also divided into 2k page frames of 2n locations, that is, page frames have
the same size as pages.
Then, the page table can be considered as a page number translation mechanism,
from the virtual page number into the physical page frame number. On the other
hand, the relative position of a location within a page and within the corresponding
page frame is the same, because this information is conveyed by the least-significant
n bits of V and P, and they are the same.
The page table used by the MMU is associated with the running task and a pointer
to it, often called page table base, is part of the task context managed by the operating
system. Therefore, when the operating system switches from one task to another, it
also switches the MMU from using one page table to another.
As a consequence, different tasks can (and usually do) have different “views” on
physical memory, because the translations enforced by their page tables are different.
This view comprises the notions of which physical page frames are accessible to the
task and for which kind of access (address translation mechanism), as well as which
page frame corresponds to, or supports, each valid virtual page.
It must also be noted that, although page tables are stored in memory, like any
other data structure, the MMU itself is able to enforce the validity and consistency
of its address translation and memory protection work. This is because it is possible
to protect the page table of a task against accidental (or deliberate) modifications by
the task itself or another task. Write access to page table contents is usually granted
only to the operating system, which is responsible for, and trusted to, store correct
information into them.
432 Embedded Software Development: The Open-Source Approach
A full description of how a MMU works and how to use it effectively is beyond
the scope of this book. Nevertheless, two simple examples shall persuade readers of
its usefulness and power.
1. Let us consider two tasks that must run the same code. This is perfectly normal in
general-purpose computing—for instance, it is perfectly normal that two instances
of a Web browser or editor run together on the same PC—but also in embedded
application.
In the last case, a common way to design an embedded software component re-
sponsible for making a service available to other parts of the system, and increase
its performance, is to organize the component as a set of “worker tasks,” all iden-
tical for what concerns the code. Each worker task runs concurrently with the
others and is responsible for satisfying one single service request at a time.
If no virtual memory were supported, two instances of the same code executed
concurrently by two worker tasks would interfere with each other since they
would access the same data memory locations. In fact, they are running the same
code and the memory addresses it contains are the same.
This inconvenience is elegantly solved using a MMU and providing two different
page tables—and hence, two different address translations—to the two tasks, so
that the same virtual page referenced by the two tasks is mapped onto two different
physical page frames.
2. In order to prevent memory corruption, it becomes possible to allocate all local
data to be used by a task—like, for instance, its local variables and its stack—in a
set of pages and map them onto a set of physical page frames.
As long as the set of page frames accessible to the different tasks in the system are
disjoint, no memory corruption can take place because any attempt made by a task
to access or modify the page frames assigned to another task would be prevented
and flagged by the MMU.
Shared data, which are often needed in order to achieve effective inter-task com-
munication, as described in Chapter 5, can still be implemented by storing them in
a page frame and granting write access only to the set of tasks which are supposed,
by design, to share those data.
A convenient starting point interested readers can use to gain more information
about how a MMU aimed at high-end embedded systems works is Reference [6],
which describes the MMU architecture optionally available in several recent ARM
processor versions.
The problem is further compounded by the fact that address translation must be
performed in an extremely critical and very frequently used part of the execution
path, that is, whenever the processor needs to access memory to fetch an instruction,
read, or write data. Hence, its performance has a deep impact on the overall speed of
the processor.
On the other hand, attentive readers have surely noticed that the abstract MMU
address translation process, as depicted in Figure 15.2, is extremely inefficient if
implemented directly. In fact, due to their size, page tables must necessarily be stored
in main memory.
In this case, the interposition of a MMU between the processor and memory im-
plies that at least two memory accesses are needed whenever the processor requests
one. For instance, when the processor performs an instruction fetch:
• The MMU first consults the current page table to determine if the virtual
address V provided by the processor has got a mapping in the table and
whether or not that mapping permits instruction fetches depending on the
current processor mode. Since the page table resides in main memory, con-
sulting the table implies one memory access even though—as depicted in
the figure—the table is implemented as a flat array of entries.
However, as also discussed in Section 15.1, more sophisticated data
structures—such as multilevel, tree-structured tables—are often needed in
order to keep page table memory requirements down to an acceptable level.
The price to be paid is that those data structures need more than one mem-
ory access to be consulted.
• After the MMU has confirmed that the memory access request is permissi-
ble and determined the physical address P at which it must be performed,
yet another memory access is needed to retrieve the instruction that the
processor wants to fetch. Only after all these memory accesses have been
completed, is the instruction available to the processor.
This is because small variations in the virtual address will likely not modify the
page number that, as shown in Figure 15.2, is the only part of the virtual address used
to determine which page table entry must be used to translate it.
Therefore, it is convenient to keep in a small and fast associative memory—the
TLB—the page translation information for a subset of the memory pages most re-
cently accessed, with the goal of decreasing the overall memory access time by re-
ducing the number of extra memory accesses needed to consult the page table. In
this way:
• Before consulting the page table to translate a certain virtual address V, the
MMU checks if the result of the translation is already available in the TLB
because the translation has already been performed in the recent past. This
check is performed without introducing any extra delay because, as men-
tioned previously, the TLB is based on a fast, on-chip associative memory.
In this case (TLB hit), there is no need to read the page table entry from
memory to get the corresponding physical page number and access permis-
sion information. The MMU simply uses this information and proceeds.
• Otherwise, the MMU consults the page table by issuing one or more addi-
tional memory address requests. Afterward, it updates the TLB and stores
the result of the translation into it. Since the TLB capacity is limited—on
the order of one hundred entries—in order to do this it likely becomes nec-
essary to delete a TLB entry stored in the past.
As happens for caches, the algorithm responsible for choosing the entry to
be replaced is critical for performance. It is the result of a trade-off between
the goal of deleting the entry with the minimum probability of being reused
in the future versus the need for taking the decision quickly and with limited
information about future task behavior.
that is, memory protection. To better highlight its function, this component is often
called memory protection unit (MPU).
As shown in Figure 15.3, which depicts the operating principle of a typical
MPU [8] in a simplified way, in systems equipped with a MPU there is always full
correspondence between virtual and physical addresses (V and P in the figure) be-
cause no address translation takes place.
Memory protection information is kept in an on-chip table, often called a region
table. Each entry of the table contains protection information pertaining to a contigu-
ous memory region, identified by its base address and size.
The MPU architecture often imposes strict constraints on admissible values of re-
gion base addresses and sizes, in order to simplify the hardware. A typical constraint
is that the size must be an integer power of two and the base address must be an
integer multiple of the size.
In this way, the comparison between the address issued by the processor for a
certain memory transaction and individual region table entries, to determine which
region the address falls in, can be performed in a very efficient way. This is because
it can be based solely on a bit-by-bit comparison between several most significant
bits of the address and the most significant bits of region base addresses stored in
the table.
436 Embedded Software Development: The Open-Source Approach
Table 15.1
Fields of a Typical Memory Protection Unit (MPU) Region Table Entry
Table 15.2
Access Permission Field Encoding in the ARM MPU
AP Processor mode
Value Privileged Unprivileged
000 None None
101 Read None
001 Read, Write None
010 Read, Write Read
011 Read, Write Read, Write
110 Read Read
111 Read Read
100 — —
operations and behavior. For this reason, it will not be discussed further.
Similarly, a thorough description of memory access ordering strategies is
beyond the scope of this book.
Another interesting aspect of the region table access algorithm adopted by this
particular MPU is that table entries have different priorities. The priority ordering is
fixed and is determined by the entry number, so that higher-numbered entries have
higher priority.
Priorities become of importance when an address issued by the processor matches
multiple region table entries. This may happen when the regions described by the
entries overlap in part, which is allowed. Hence, for instance, it is possible to define
a large lower-priority region in memory, with a certain access permission, and a
smaller higher-priority region, with a different access permission, within it.
In this case, when the address lies within the large region but not within the
smaller region, the access permission defined for the large region is used. On the
contrary, when the address lies within the smaller region, and hence, matches both
region table entries, the address permission set for the smaller region prevails.
This feature can be very useful to set the general access permission of a large
memory area and then “dig a hole” with a more specific access permission. In turn,
this helps in circumventing the previously recalled address alignment restrictions the
region start address is subject to, which become more and more severe as the region
size grows bigger, and also use a smaller number of table entries to put in effect the
same access permissions.
As an example, even neglecting address alignment restrictions, configuring the
MPU as discussed previously if region table entries overlap were not allowed, would
require three entries instead of two. This is an important difference given that the
total number of entries is usually very small, eight being a common value [8].
One last point about MPUs, which is worth mentioning and should emerge from
the description given previously, is that they operate in a much simpler way than
MMUs. Moreover, a MPU region table is much smaller than a typical MMU page
table. For these reasons, the MPU region table can be kept in an on-chip memory.
Moreover, all MPU operations are performed completely in hardware and do not
require any acceleration mechanism that—as happens, for instance, with the TLB—
may introduce non-deterministic timings into the system.
Table 15.3
Comparison between MMU and MPU Features and Their Implementation
because, in both cases, tasks corrupt a memory area they must necessarily be granted
access to.
In these cases, it is useful to resort to software-based techniques. The exact details
of these techniques vary and, by necessity, they will only be summarized here, but
they are all based on the same underlying principles. Namely,
• They stipulate that software must read from and write into a certain memory
area M, possibly shared among multiple tasks, only by obeying a specific
access algorithm, or protocol.
• Any write operation performed on M without following the access protocol
shall be detected and taken as an indication of memory corruption.
• Detection is based on computing a short “summary” of the contents of M,
generically called checksum C. The value of C is a mathematical function
of the contents of M.
• When memory is used according to the proper protocol, the checksum is
kept consistent with the contents of the memory area.
• Instead, if memory is modified without following the protocol—as a way-
ward task would most likely do—checksum C will likely become inconsis-
tent with the corrupted contents of M.
• In this way, it will afterward be possible to detect memory corruption with
high probability.
Before going into the details of how memory checksumming works, it is important
to note one important difference with respect to what MMUs and MPUs do. Namely,
MMUs and MPUs are able to prevent memory corruption because they are able to
detect that a task is attempting to illegally access or modify a memory location before
the memory operation is actually performed so that, in the case of an attempted write
operation, memory contents are not corrupted.
Informally speaking, this makes recovery easier because the system state, which
is normally held in memory, is not affected. On the other hand, software-based tech-
niques are able to detect memory corruption with high probability, but not in all
cases, for reasons that will be better described in the following.
Moreover, detection takes place only upon execution of a software procedure,
which typically takes place when a task makes use of the shared memory area. As
a consequence, memory corruption is detected only after it occurs. At this time,
the system state has been irreparably lost and must be reconstructed as part of the
recovery process.
Another point worth mentioning is that software-based techniques are unable to
detect illegal read operations. In other words, they can neither detect nor prevent that
a task gets access to information it has no permission for.
Figure 15.4 summarizes how memory checksumming works. In the middle of the
figure, the shared memory area M to be protected is surrounded by two auxiliary data
items, associated with it to this purpose:
• As described in Chapter 5, concurrent access to a shared memory area
must take place in mutual exclusion, with the help of a semaphore S.
Memory Protection Techniques 441
The two flowcharts on the sides of the figure outline the protocol to be followed
for read (on the left) and write (on the right) accesses to M. In order to perform a
read operation:
1. Task T must first of all acquire the mutual exclusion semaphore S and it may need
to wait if the semaphore is currently held by another task. After acquiring S, T is
granted exclusive access to both M and C until it releases the semaphore at a later
time.
2. Within the mutual exclusion region just opened, T can transfer data items
D1 , . . . , Dn held in M into its own local memory.
3. Afterward, it also copies checksum C into a local variable C′′ .
4. Then, task T recalculates the checksum of M and stores the result into another
local variable, called C′ in the figure.
5. At this point, T releases semaphore S because it will no longer use any shared
data structure.
6. The final operation performed by T is to compare the two checksums C′ and C′′ .
442 Embedded Software Development: The Open-Source Approach
Any mismatch between the two values compared in the last protocol step indicates
that either the memory area M or its checksum C or both have been corrupted. This
is because
Instead, in order to perform a write operation, a task T must update both M and C
to keep them consistent, by executing the following steps.
1. As for a read operation, task T must first of all acquire the mutual exclusion
semaphore S.
2. At this time, T can update data items D1 , . . . , Dn held in M by means of a sequence
of write operations from local task memory into M. Partial updates, involving only
a subset of {D1 , . . . , Dn } are possible as well.
3. After completing the update, task T calculates the new checksum of M, denoted
as C′ in the figure, and stores it into C.
4. Eventually, T releases S in order to allow other tasks to access M.
It should be noted that since the beginning of step 2 and until C has been updated,
at the very end of step 3, M and C are indeed inconsistent even though no memory
corruption occurred. However, this does not lead to any false alarm because any tasks
wishing to perform a read operation—and hence, check for memory corruption—
would be forced to wait on the mutual exclusion semaphore S during those steps.
Let us now consider what happens if a task T′ writes into either M or C in an
uncontrolled way, that is, without following the above-specified protocol. In order to
do this in a realistic way, we must also assume that T′ will not acquire semaphore S
beforehand. As a consequence, T′ can perform its write operations at any time, even
when the mutual exclusion rule enforced by S would forbid. A couple of distinct
scenarios are possible:
1. If memory corruption occurs when neither a read nor a write operation are in
progress, it will likely make M inconsistent with C. The next time a task T per-
forms a read operation, it will detect the inconsistency during the last step of the
read protocol.
2. If memory corruption occurs while a read operation is in progress, it may or may
not be detected immediately, depending on when it takes place during the progress
of the read protocol.
Namely, any corruption that takes place after step 4 of the read protocol will not
be detected in step 6. However, it will be detected upon the next read operation.
3. The most difficult scenario occurs when memory corruption occurs while a write
operation is in progress because, in this case, there is a time window during which
any corruption is undetectable.
This happens if memory corruption due to T′ occurs during write protocol step
2, that is, when T is updating M, too. In this case, task T will later recalculate
Memory Protection Techniques 443
Dn Dn
Keep M and C
Store received
consistent during local
contents into M' and C'
write operations
the correct checksum of M, during protocol step 3, and store it into C “on behalf
of” T0 .
At the end of the day, C will indeed be consistent with the corrupted contents of
M and it will pass the check performed during read operations.
Another reason why memory corruption may not lead to an inconsistent checksum
is that, for practical reasons, C is always much smaller than M in size. In fact, the
typical size of a checksum does not exceed 32 bits whereas the size of M can easily
be on the order of several hundred bytes or more.
As a consequence, different contents of M may still lead to the same value of C,
and hence, there is a certain probability that the expected value of C after memory
corruption will still be the same as before.
In general, this probability is called residual error probability and this is the reason
why, in the first scenario described above, we said that memory corruption will just
“likely” make M inconsistent with C, rather than certainly.
It is worth remarking that, although Figure 15.4 shows how checksumming can be
used to protect a shared memory area, the same method can also be applied to sys-
tems based on message passing, in order to make the system more robust against
corruption occurring locally, within the communicating tasks, or during message
passing itself.
For instance, as shown in Figure 15.5, if we assume that there is a unidirectional
data flow between a sending task (on the left) and a receiving task (on the right),
checksumming can be organized as follows:
• The sending task keeps its local memory area M and checksum C consistent
444 Embedded Software Development: The Open-Source Approach
Attentive readers would certainly have noticed the analogy between checksum-
ming used in this way—that is, for message passing within tasks residing on the
same microcontroller—and what is normally done in network communication, in
which checksums are commonly used to ensure data integrity when messages travel
across the network and network equipment, as mentioned in Chapter 7. This is in-
deed the case, and also the checksumming algorithms used in both cases overlap
significantly.
mN−1 , . . . , m0 . (15.1)
For what concerns CRC computation, the same area can also be seen as a polyno-
mial M(x), defined as
nM −1
M(x) = ∑ mi · xi = m0 + m1 x + . . . + mN−1 xnM −1 , (15.2)
i=0
Memory Protection Techniques 445
Those two representations (sequence of bit and polynomial) are totally inter-
changeable and we will freely switch from one to the other in the following.
As shown in (15.2), if the memory area M is nM bits wide, the degree of M(x) is
nM − 1 and its coefficients mi (0 ≤ i ≤ N − 1) are the same as the bits in M.
The CRC of M, denoted as C in this chapter, is defined as the remainder of the di-
vision between M(x)xnG and a generator polynomial G(x) of degree nG . Even though
other choices are possible in theory, in most cases of practical interest related to com-
puter engineering, coefficient arithmetic is carried out within a Galois field of two
elements, denoted GF(2).
In this way, the two elements of the Galois field correspond to the two values
a bit can assume, 0 and 1, as outlined above. Even more importantly, coefficient
arithmetic can be performed by means of efficient bitwise operations on the strings
of bits corresponding to the polynomials.
The choice of G(x) heavily affects the “quality,” that is, the error detection capa-
bility of the resulting CRC [109].
In formula, if it is
M(x)xnG = Q(x)G(x) + R(x) , (15.3)
then Q(x) is the quotient of the division and R(x), the remainder, is the CRC of M.
By definition, if the degree of G(x) is nG , then the degree of the remainder R(x) is
nG − 1. Therefore, the polynomial R(x) corresponds to a sequence of nG bits
rnG −1 , . . . , r0 , (15.4)
which is the sequence of bits of C.
Obviously, actually using polynomial arithmetic to calculate the CRC would be
overly expensive from the computational point of view in most cases. Fortunately,
however, by means of the properties of polynomial division in GF(2), it is possible
to calculate the CRC of a memory area M incrementally, considering one bit at a
time, with a simple algorithm mainly based on the logical shift and the exclusive-or
bitwise operation (XOR).
In particular, the CRC C of M can be calculated in the following way.
1. Initialize C by setting C ← 0.
2. For each bit of M, denoted as mi , perform the three steps that follow, then go to
step 6.
3. Calculate the XOR of the most significant bit of C (that is, bit nG − 1) and mi , and
call it X.
4. Shift C one position to the left, dropping its most significant bit.
5. If X is 1, XOR C with the sequence of bits G corresponding to G(x) and store the
result into C itself.
6. At the end, C holds the result of the calculation.
The listing that follows shows, as an example, a possible C-language implemen-
tation of the bit-by-bit algorithm just presented considering a specific CRC generator
polynomial used, among many other applications, for network error detection in the
controller area network (CAN) [91].
446 Embedded Software Development: The Open-Source Approach
The most significant coefficient of G(x) is not considered in the bit sequence
because it is implicitly considered to be 1 by the algorithm. A value of 0 would not
make sense because it would imply that G(x) is not of degree nG as assumed.
return crc;
}
Given that nxtbits is a 16-bit unsigned integer (as specified by its type,
uint16_t) the maximum number of bits nbits it can process is limited
to 16, but a larger number of bits can be handled in a similar way.
Also in this case, the function is designed to be used for incremental CRC
calculations. Hence, it takes the current value of C as argument crc and
returns the updated value.
• Extracts the most significant bit of crc (that is, bit nG −1 = 14) by masking
and shifting. In this way, the value of (crc & 0x4000) >> 14) is either
1 or 0 according to the value of the most significant bit of crc.
• It performs the XOR (denoted by the ˆ operator in the C language) of this
bit with the next bit to be considered, nxtbit, and stores the result into
local variable crcnxt.
As described in Chapter 9, this variable is declared as an int so that its
size matches the natural size of an architectural register of the processor in
use, and the compiler has more optimization options available.
• It shifts crc left by one bit. Bit masking with 0x7FFF is needed to discard
the most significant bit of crc, which fell into bit position 15 after the shift.
On the other hand, the left shift operator of the C language automatically
shifts a 0 into the least significant bit position.
• Depending on the value of crcnxt, it conditionally performs the XOR
of crc with the generator polynomial bit sequence (15.6). As for other
arithmetic operators of the C language, it is possible to combine the XOR
operation with an assignment and write it as ˆ=.
This is because the development system usually has much higher computational
power than the target system. Moreover, the computation is performed only once,
when the executable image is built, which is a time when there are no tight execution
timing constraints.
Figure 15.6 illustrates the general principle of lookup table-based CRC calcula-
tion, which is designed according to the following guidelines.
• The bit-by-bit CRC calculation algorithm is used as the basis to build the
lookup table used by the fast algorithm.
• To this purpose, the native toolchain is used to build an executable version
of the bit-by-bit algorithm, which runs on the development system.
• The bit-by-bit algorithm is executed to generate the lookup table, which is
then written in source code form, according to the language used to program
the target system.
• The cross-compilation toolchain is used to build the executable image for
the target system. The image contains, besides other modules, the compiled
version of the fast CRC calculation algorithm and the lookup table.
Memory Protection Techniques 449
The procedure to be followed to compute the lookup table itself is fairly simple:
• If the fast CRC calculation shall work on chunks of N bits at a time, the
lookup table must have 2N entries.
• Each entry must be nG bits wide and the i-th entry must contain the CRC
of i, which can be calculated using the bit-by-bit algorithm.
It also turns out that, due to other well-known properties of CRC calculation, a
table dimensioned to work on chunks of N bits can also be used to calculate the CRC
using chunks smaller than N bits, still using exactly the same algorithm.
/* --- Lookup table for the word-by-word CRC implementation.
For an n-bit word, the first 2ˆn entries of the table
are needed. The table is large enough for 1 <= n <= N.
*/
#define N 8
uint16_t crc_lookup[POW2(N)];
• The printout consists of the source code to define an array of 16-bit un-
signed integer elements, according to the uint16_t data type. The table
is called crc_lookup and comprises N elements.
• After an opening brace, the initializer part of the definition contains the
lookup table entries printed as hexadecimal integers according to the
printf format specification "0x%04x"
• Entries are separated by commas. In order to avoid printing an extra comma
after the last entry, either a comma or a blank space is printed according to
the conditional expression ((i<POW2(N)-1) ? "," : " ").
• A closing brace followed by a semicolon concludes the printout, following
the C language syntax for array definitions.
The code to use the lookup table to speed up CRC calculation is equally simple
and is summarized below. Although a formal proof of correctness is beyond the scope
of this book, it is useful to note its similarity with the bit-by-bit algorithm discussed
previously to get at least an intuitive feeling of its correctness.
In particular, to calculate the CRC C of a memory block M working on chunks of
N bits at a time, the following algorithm can be used.
1. Initialize C by setting C ← 0
2. For each chunk of N bits, denoted as m′i , perform the three steps that follow, then
go to step 6
3. Calculate the XOR of the N most significant bits of C and m′i , calling the result X.
4. Shift C by N positions to the left, dropping the N most significant bits of the result.
5. Calculate the XOR of C with the contents of the X-th lookup table entry and store
the result into C itself.
6. At the end, C holds the result of the calculation.
The following fragment of code implements the algorithm just presented.
Memory Protection Techniques 451
#define N 8
return crc;
}
452 Embedded Software Development: The Open-Source Approach
At the very beginning of the code, there are a couple of macro definitions. Those
macros return several values of interest, as described in the following.
Then, the function wcrc_nxtbits implements the lookup table-based CRC cal-
culation algorithm proper, starting from the partial CRC crc and considering the
nbits least significant bits of nxtbits. The return value of the function is the
result.
In other words, this function implements steps 3 to 5 of the CRC calculation
algorithm. The implementation of steps 1, 2, and 6 is not shown for conciseness.
More specifically:
The function just described is parametric with respect to the chunk size nbits. In
other words, the chunk size is one of the input parameters of the function. This makes
the function more flexible and the code more compact—because the same code can
be used for all legal values of nbits—but, as better explained in Chapter 11, gives
the compiler less optimization opportunities.
As shown in the following listing, another approach, which privileges execution
time with respect to code size, is possible.
/* --- Lookup-table-based word-by-word CRC implementation.
return crc;
}
• The first part of the code defines the macros POW2(x), POW2M1(x),
MASKM0(m), and MASKMN(m, n), because they are still needed to com-
pile the target code.
• Then, the definition of crc_lookup, that is, the lookup table, must be
included. In the code fragment above, it is assumed that the output of
the wcrc_dump_lookup function, executed on the development system,
Memory Protection Techniques 455
was saved in a file called crc_lookup_def.h, but the actual file name
is irrelevant.
• At this point, the macro instantiate_wcrc_nxtbits previously de-
scribed is invoked, at build time. The compiler (or more precisely, as de-
scribed in Chapter 3, the C language preprocessor) expands it into a func-
tion called wcrc_nxtbits_8, as specified in the first argument of the
macro invocation. This function has the following prototype
uint16_t wcrc_nxtbits_8(uint16_t crc, uint16_t nxtbits);
Type is a unique identifier of a certain data structure type, assigned when the data
type is first defined and introduced into the system. All data structures of the same
type contain the same value in this field.
Size represents the size of the data structure, expressed by means of a uniform unit
of measurement. Variable-length data structures, if supported by the programming
language, shall contain the actual size of the data structure.
Version is a version number that shall be changed whenever, during software de-
velopment or maintenance, the definition of the data structure is changed without
changing the data type name.
Owner holds a unique identifier of the software component that “owns” the data
structure, for instance, the task that created it.
The figure also shows the expected position of the additional items of information
with respect to the original data structure. For reasons that will be better detailed in
the following, Type, Size, and Version are expected to be at the very beginning of the
extended data structure, whereas Owner shall be at the very end of it.
For variable-length data structures, Size (which can be located immediately given
a pointer to the data structure because it is at a fixed offset from the beginning of it)
can be used to determine where Owner should be and retrieve it.
In the C programming language, the possible values of Type and, for each data
type, Version numbers can, for instance, be defined in source headers as macros. In
Memory Protection Techniques 457
each data structure, these fields are filled in by the function that creates it, according
to its intended, initial layout.
When backward compatibility is important, the function responsible for creating a
certain data structure may have an additional argument that indicates which specific
version of the data structure it should create and return.
In this case, it is convenient to surround the data type definitions corresponding
to all supported versions of a data structure with a union. In this way, a pointer
to that union represents a pointer to a data structure of that kind, regardless of its
version. Within the code, the Version field—which is present in all versions at the
same offset—can be used to distinguish one version from another and make access
to the right member of the union depending on it.
Similarly, Size and Owner are also set when the data structure is created. The
Owner field is especially useful for local data structures belonging to “generic” data
types, such as lists. In this case, it is important to ensure that the data structures used
by a certain task are not only of the correct type, but also that they were created by
the task itself and not by others.
When the owner of a data structure is a task, it is possible (and convenient) to store
the unique task identifier provided by the operating system in the Owner field. How-
ever, this approach is inadequate when the concept of owner must be implemented at
a finer level of granularity—for instance, at the software component or source code
module level.
If the operating system itself does not provide any facilities to generate unique
identifiers, a viable choice is often to initialize a free-running hardware counter
clocked at a high frequency and take a snapshot of its value as a reasonable approxi-
mation of a unique identifier (neglecting counter wrap-arounds) when the component
or source code module is first initialized.
As a general rule, every function that accepts a data structure as input shall first
check all four additional items of information before proceeding to use the other
parts of the data structure itself. If any check fails, then the data structure has likely
been damaged and cannot be used safely. In particular:
• The Type field must be compared against the expected data structure type.
Any mismatch prevents the function from working on the data structure,
because it implies that its internal structure is unknown.
• The Size field must match the expected data structure size. In this case,
a mismatch indicates that this field—and, likely, other information around
it—has been corrupted, and hence, the function shall not use the data struc-
ture.
• The Owner field must match the task, component, or source code mod-
ule identifier. A mismatch indicates that, even though the data structure
received as input is known to the receiving function for what concerns its
internal structure, it must not be used anyway because its contents belong
to a different software component.
Of course, this check must be implemented in a different way if the data
structure at hand is expected to be passed from one software component to
458 Embedded Software Development: The Open-Source Approach
Those checks are easy to implement, even when they must be added to existing
code, and do not imply a significant overhead. If all items of information can be
encoded in a machine word—as is usually the case—all basic checks can be per-
formed by means of four word comparisons against constant values plus simple ad-
dress arithmetic in most cases. In any case, this is much simpler than checking the
data structure content as a whole for consistency.
Structure marking also helps ensure that when a data structure is passed by refer-
ence, that is, by means of a pointer, the invoking function is indeed receiving a valid
pointer. In fact, as is shown in the right part of Figure 15.7, if the function receives an
invalid pointer B instead of a valid pointer P, it will basically compare the contents
of several values retrieved from arbitrary memory locations with the expected values
of Type, Size, Version, and Owner.
If, as is advisable, developers took care to not use common values like zero or one
as expected values, there is a high probability of having a mismatch, and hence, a
high probability of successfully detecting that the pointer is invalid.
Reasoning at a higher level of abstraction this is because, in general, structure
marking works according to the principle of adding redundant information to data
structures. With the help of this information it is possible to define additional in-
variants concerning the data structures themselves, that is, properties that they must
always satisfy.
The code that accesses a data structure can then check the invariants and ensure
that they hold. By intuition, the more invariants are defined and checked, the higher
the probability of successfully detecting data structure corruption is.
In other words—as remarked in [168]—when using structure marking, instead of
assuming that a data structure is suitable for use just because we have a pointer to it,
we assume it is good because we have a pointer to it and some of its fields have the
legal, consistent values we expect. The following kinds of issues can all be detected
by structure marking.
• Write overflows from other adjacent data structures, such as arrays. Fields
Type and Owner are especially useful to this purpose, because they are at
the extremities of the data structure, and hence, they are overwritten first in
case of overflow coming from lower or higher addresses, respectively.
Overflow detection occurs with high probability, that is, unless the garbage
written into the fields happens to match the expected values.
Memory Protection Techniques 459
On the contrary, structure marking is not as effective against other kinds of errors,
like the ones listed in the following.
• Spurious or “random” memory writes that damage the contents of a mem-
ory word without regard to the data structure it belongs to. In fact, the
probability that the spurious write ruins the original contents of the data
structure (gray rectangles in Figure 15.7) without affecting the additional
items of information (white rectangles) is high because their size is usually
a small fraction of the total data structure size.
• Design errors. For instance, an ill-designed algorithm can easily produce
incorrect results, systematically or on occasion, but still store them into
perfectly formed data structures.
• Errors in the toolchain, namely the compiler because, in this case, a bug
may introduce corresponding errors everywhere in the code, including the
portion of code that implements structure marking and related checks.
When structure marking checks detect that a data structure has been damaged,
the program actions to take depend on how the software has been designed and, in
particular, on the degree of fault tolerance that has been built into the system. Some
possibilities are described in the following.
• Print an error message or, more generally, report the error. Then, abort the
application as a whole or halt the task that detected the damage. This kind
of behavior is generally acceptable only during development and testing,
especially for embedded real-time applications.
• If there is enough redundant information available, attempt to repair the
damaged structure. This approach, when successful, has the advantage that
data structure damage is handled transparently and the caller only notices
an extra delay in function execution.
460 Embedded Software Development: The Open-Source Approach
Experience with the M ULTICS operating system, in which structure marking was
extensively used in most of the disk- and memory-resident file system data structures,
shows that it improved system reliability noticeably [168].
At the same time, standard benchmarks showed no measurable performance cost
due to the additional checks. Space cost, related to the introduction of the additional
fields in every data structure, was a few percent.
Interestingly enough, the option of keeping a checksum for each structure, as de-
scribed in Section 15.4, was also considered during the design phase. However, it was
considered that there were not enough cases where data structures were considered
to be good by structure marking even though their content was damaged, to justify
the additional cost of computing a checksum at this very fine level of granularity.
Table 15.4
F REE RTOS Con guration Variables Related to Stack Over ow Checks
the stack of the invoking task upon function entry and released upon exit, in a last-in,
first-out (LIFO) fashion.
In many operating systems, including F REE RTOS [18], the task stack is also
used to hold the task context when it does not reside in the processor, that is, when
the task is not running. As for local variables, a certain amount of space is allocated
on-demand from the task stack to this purpose.
Task stacks in an embedded operating system usually have a fixed maximum size
that is specified upon task creation. In the case of F REE RTOS, as explained in Chap-
ter 5, the usStackDepth argument of xTaskCreate indicates how much memory
must be allocated for the task stack.
Since, as discussed above, space is dynamically allocated from and released
to them dynamically during task execution, it is extremely important to prevent
or at least detect any overflow, which may corrupt stack contents themselves, as
well as surrounding memory areas, and cause all kinds of hard-to-predict issues in
task behavior.
For this reason, many embedded operating systems implement mechanisms to as-
sist in this detection of such an occurrence. In the case of F REE RTOS, three distinct
mechanisms are provided to this purpose:
It must also be noted that some processors may perform hardware-based stack
overflow checks, and hence, they may generate an exception in response to a stack
corruption before any software-based overflow check can occur. Moreover, severe
stack overflows may lead the processor to address a nonexistent memory location, an
occurrence that some processors are able to detect and report, usually by means of a
bus fault exception.
Stack high water marking and detection is the simplest kind of stack overflow
detection. It is also the least invasive from the point of view of execution time over-
head, because the check is performed only on demand, when a certain operating
system primitive is invoked. For this reason, the application has full control on the
trade-off between overhead and accuracy of the check.
As shown on the left part of Figure 15.8, when a task is first created its stack is
filled with a known value, depicted in light gray, and is mostly unused. Only a small
portion of it, at the top, is initialized by the operating system and holds the initial
task context, to be used upon the first context switch toward the task.
In this and the following figures, the used portion of the stack is colored in dark
gray. Moreover, it is assumed that stacks “grow downward,” that is, stack space is
allocated in a LIFO fashion starting from higher addresses and going toward lower
addresses. As is customary, higher addresses are depicted above lower addresses in
the figures.
During task execution, as shown in the middle of Figure 15.8, the amount of stack
used by the task grows and shrinks. At any time, the moving boundary between the
currently used and unused portions of the stack is indicated by the current processor
stack pointer.
Memory Protection Techniques 463
As a consequence, when the stack grows, the known value with which the stack
was filled upon task creation is overwritten by other values, for instance, the content
of task local variables. On the other hand, the known value is not restored when the
task shrinks.
Therefore, it is possible to get an approximate idea of the worst-case (mini-
mum) amount of unused stack space reached by the task in its past history by
scanning the task memory area starting from the bottom and proceeding as far
as the known value is still found. This is exactly what the F REE RTOS function
uxTaskGetStackHighWaterMark does.
It it useful to remark that the check is approximate because it is possible that, by
coincidence, the task overwrites the known value, with which the stack was initially
filled, with the same value. As a consequence, there may be a certain amount of
uncertainty about the precise location of the boundary between the light gray (still
filled with the known value) and dark gray (overwritten by task) areas shown in the
figure. More specifically, the function
UBaseType_t uxTaskGetStackHighWaterMark(
TaskHandle_t xTask);
calculates and returns the worst-case amount of stack space left unused by task
xTask since it started executing. This corresponds to the size of the light gray stack
area highlighted at the extreme right side of Figure 15.8. As discussed in Chapter 5,
xTask is a task handle. The special value NULL can be used as a shortcut to refer to
the calling task.
By analogy with the usStackDepth argument of xTaskCreate, the return
value of the uxTaskGetStackHighWaterMark function is not expressed in
bytes, but in stack words, whose size is architecture-dependent. Refer to Chapter 5
for more information on how to determine the stack word size on the architecture
at hand.
As task xTask approaches stack overflow, the return value becomes closer to
zero. A return value of zero means that the available stack space has been completely
used by the task in the past, and hence, a stack overflow likely happened.
One shortcoming of high water marking is that, by intuition, the longer the time
that elapses between a stack overflow and its detection, the more likely it becomes
that the system misbehaves.
From this point of view, the other two stack overflow detection mechanisms are
more aggressive because the check is performed automatically, and quite frequently,
by the operating system itself instead of relying on application-level code.
In order to be used, these mechanism need, first of all, a way to inform the applica-
tion that a stack overflow occurred. To this purpose, the application code must define
a stack overflow hook function whenever configCHECK_FOR_STACK_OVERFLOW
is set to a value greater than zero. The hook function must bear a special name and
adhere to the following prototype.
void vApplicationStackOverflowHook(
TaskHandle_t xTask,
signed char *pcTaskName );
464 Embedded Software Development: The Open-Source Approach
The operating system invokes the hook function whenever it detects a task stack
overflow. The arguments passed to the function both indicate the offending task in
two different ways:
went back to a legal value after the overflow occurred. As in the previous case,
any change in the guard area content leads the operating system to invoke the
vApplicationStackOverflowHook hook.
Since this method requires the operating system to compare the whole content
of the guard area against the known value, it is more expensive, from the execution
time overhead point of view, with respect to the previous one, which only requires
two address comparisons. On the other hand, it has a higher probability of detecting
stack overflows successfully.
As a concluding note, it is useful to remark that automatic stack overflow check-
ing, performed at the operating system level, may introduce a significant overhead
especially on low-end microcontrollers. Therefore, it should be used only during
software development and testing, whereas it can be disabled after the actual stack re-
quirements of the various tasks present in the system become well known and stable.
15.8 SUMMARY
Internal consistency checks, especially those related to memory corruption issues,
play a central role in software development, especially for embedded systems in
which reliability is often an important design requirement. In addition, the same
checks also speed up software development and debugging because they help
programmers to quickly identify functions and code modules with questionable
behavior.
466 Embedded Software Development: The Open-Source Approach
CONTENTS
16.1 Introduction to S PLINT .................................................................................. 468
16.2 Basic Checks ................................................................................................. 470
16.3 Memory Management ................................................................................... 478
16.4 Buffer Overflows ........................................................................................... 492
16.5 Function Interface Annotations ..................................................................... 499
16.6 Summary........................................................................................................ 506
This chapter shows how to detect code mistakes and vulnerabilities, mostly related
to bogus memory accesses, by means of static code analysis techniques. Namely, it
analyzes in detail a mature open-source tool called S PLINT [60], which implements
those techniques.
The goal is to enhance code reliability and, in some cases, its security. The last
aspect is becoming more and more important as embedded systems are nowadays
often connected to public communication networks, and hence, they are becoming
more vulnerable than in the past to security attacks.
Informally speaking, static code analysis is able to infer some properties of a pro-
gram by working exclusively at the source code level—possibly with some additional
hints from the programmer, given in the form of source code annotations.
In any case, it is never necessary to actually execute the code, and hence, static
code analysis does not require any kind of runtime support from the processor and
the operating system.
This is especially welcome in embedded software development because, as illus-
trated in Chapter 2, static code analysis can be performed on the development system
and does not impact the performance of the target system in any way [37].
Even more powerful methods and tools exist, based on dynamic analysis, of which
VALGRIND [151] is a typical example in the open-source arena. However, they re-
quire various kinds of support from the target processor, operating system, and run-
time libraries, which may not be available in many embedded systems. For instance,
at the time of this writing VALGRIND is available only for several processor archi-
tectures and a couple of operating systems.
467
468 Embedded Software Development: The Open-Source Approach
Moreover, even when appropriate support is indeed available, the runtime over-
head that those tools introduce by permeating the target application with their instru-
mentation code may be unacceptable.
can be used to inform S PLINT that the code contained in a certain source file, where
the annotation appears, uses data types char and int interchangeably, as was typi-
cal of legacy programs.
Clearly, assumptions like this weaken the checks that S PLINT can perform, and
hence, it is better to keep their scope as narrow as possible, so that they apply only
where they are strictly needed.
From the practical point of view, the tool is invoked mostly like a normal C com-
piler. For instance,
splint <flags> a.c b.c
Security and Dependability Aspects 469
Table 16.1
S PLINT Mode Selector Flags
standard This is the default operating mode of S PLINT, the one used in
all examples given in this chapter, unless otherwise specified.
As described in the main text, a limited amount of annotation is
needed to avoid false warnings.
checks This flag makes checks even stricter. The main difference with
respect to standard is that it enables the memory manage-
ment checks presented in Section 16.3.
strict When this operating mode is selected, S PLINT performs all the
checks it is able to. It can profitably be used to thoroughly scru-
tinize short, critical sections of code. However, as S PLINT au-
thors themselves say [60] it is hard to produce a real program
that triggers no warnings when this operating mode is in effect.
invokes S PLINT with a certain set of <flags> on source files a.c and b.c, which
the tool considers to be part of the same program, and hence, analyzes together.
The tool supports a relatively high number of flags, which control different aspects
of the checks it performs. For this reason several special flags, called mode selector
flags and listed in Table 16.1, set the other flags to predefined values, in order to
make the checks weaker or stricter. In other words, they provide a convenient, coarse-
grained way of generally controlling which classes of errors the tool reports.
After setting a mode selector flag, it is still possible to set or reset individual flags
to further configure the tool in greater detail. Instead, doing the opposite triggers
a warning.
Another peculiar aspect of S PLINT is how flags are set and reset. In most “unix-
style” tools, a flag is set by putting its name on the command line, preceded by either
- or -- to distinguish it from other kinds of command-line arguments. For instance,
in order to ask a tool to print out some help information, it is pretty common to
invoke the tool with the --help command-line option. Flags are usually reset by
putting no- before their name.
Instead, S PLINT flags are preceded by a single character that determines the action
to be performed on that flag. Namely:
• The + character turns the flag on.
• The - character turns the flag off.
• The = character can be used in a control comment to reset a flag to the value
specified on the command line or to the default value.
470 Embedded Software Development: The Open-Source Approach
Due to space limitations, only an overview of the main checks that S PLINT can
perform and a short introduction on how to take advantage of it, by means of ex-
amples, will be given here. Readers interested in the fine details or aiming at adopt-
ing the tool for production use should refer to the S PLINT manual [60] for further
information.
It must also be remarked that, with the steady progress of compiler technology,
many of these checks can nowadays be performed by the compiler itself. Chapter 9
lists some flags that can profitably be used for this purpose. However, in many cases,
S PLINT checks still have greater precision also thanks to programmers’ annotations,
which give it additional valuable information that is otherwise unavailable to the
compiler.
int *alloc_int(void)
{
int *p = (int *)malloc(sizeof(int));
return p;
}
void alloc_and_use(void)
{
int *p = alloc_int();
use_intp(p);
}
Security and Dependability Aspects 471
Of these, the last two pertain to other kinds of checks, unrelated to pointer han-
dling, which will be better detailed later. Concerning the first three:
1. The first one informs the programmer that function alloc_int may return a
NULL pointer. If this is deemed to be acceptable because, for instance, the caller
472 Embedded Software Development: The Open-Source Approach
or some other functions are responsible for ensuring that the pointer they receive
is indeed not NULL before using it, then the programmer should annotate the re-
turn value in an appropriate way, else the code should be modified to address the
potential issue.
2. The second warning highlights that, in the same function, the storage pointed
by p—that becomes accessible to the caller after the function return—may have
undefined content. This is because the library function malloc does not initialize
the content of memory areas it allocates. As before, it is possible to disable the
warning if this is the intended behavior of the function.
3. The third and last warning indicates that there is a memory leak in function
alloc_and_use. This is because the function dynamically allocates some stor-
age, by means of alloc_int. Then, it returns without releasing the storage and
without making a valid reference to it available in some other ways, for instance,
by returning a pointer to the storage to the caller or storing the pointer into a global
variable. As a consequence, all references to the storage are lost and it becomes
impossible to release it appropriately.
As shown in the following, modified listing, all three warnings can be addressed
in a straightforward way, either by improving the quality of the code or by providing
further information about the intended behavior of the code itself.
1 #include <stdio.h>
2 #include <stdlib.h>
3
16 void alloc_and_use(void)
17 {
18 int *p = alloc_int();
19 use_intp(p);
20 free(p);
21 }
In particular:
• For the sake of this example, we are willing to accept that alloc_int
may return a NULL pointer because our intention is to check the pointer
Security and Dependability Aspects 473
value before using it. As a consequence, we annotate the return value with
/*@null@*/ to indicate that it may be NULL.
• The input argument of use_intp has been annotated in the same way to
remark that we accept it to be NULL, because the function itself will check
it before use.
• In order to address the second warning, in alloc_int we explicitly initial-
ize to zero the storage pointed by p after allocation. In order to avoid further
warnings related to dereferencing a possible NULL pointer in alloc_int,
initialization is performed only after checking that allocation was success-
ful, that is, p != (int *)NULL.
• In function use_intp we ensure that p is not NULL before passing it to
printf. This avoids further warnings about NULL pointer dereferencing
in this function.
• Last, we free the storage pointed by p before returning from
alloc_and_use, thus avoiding the memory leak spotted in the third
warning discussed previously.
Going back to the last two original warnings, their meaning is that both
alloc_int and use_intp are globally visible functions but—as far as the tool
can tell—they are not used elsewhere in the program, that is, outside the source code
module where they are defined. Those warnings can be addressed in two different
ways:
1. If this is indeed the case, then it is advisable to use the static qualifier in the
function definition, to avoid cluttering the global name space without reason.
2. If the warning depends on the fact that the tool has insufficient knowledge about
how functions are used—for instance, because it is running only on some modules
of a bigger program—it can be suppressed by means of the +partial flag.
In addition, S PLINT performs a variety of sophisticated checks aimed at detect-
ing type mismatches, which are stricter than what average compilers do. Of these,
probably the most interesting ones concern enumerated and Boolean data types.
The standard C language considers enumerated data types, defined by means of
the enum keyword, to be equivalent to integers in many respects. As a consequence,
it is possible to assign an arbitrary integral value to an enum variable, even though
that value was not mentioned as an enumerator member. Even assigning a member
defined for a certain enumerated data type to another enumerated data type merely
triggers a compiler warning, and not in all cases.
Let us consider the following fragment of code as an example.
1 /*@ -enumint @*/
2
3 enum a {
4 A_ONE = 1,
5 A_TWO,
6 A_THREE
7 };
474 Embedded Software Development: The Open-Source Approach
9 enum b {
10 B_ONE = 1,
11 B_TWO
12 };
13
14 void work_with_enum(void)
15 {
16 enum b x = A_THREE;
17 enum b y = 8;
18 }
The code defines two enumerated data types (enum a and enum b), along
with their members (A_ONE to A_THREE for enum a, and B_ONE to B_TWO for
enum b). According to the C-language specification, the numeric values of these
members corresponds to the numbers spelled out in their name.
Then, function work_with_enum defines two local variables (x and y) and as-
signs a value to them. Both assignments are questionable, for two different reasons.
• The first one, at line 16, assigns to x (which is an enum b) a valid enumer-
ation member A_THREE. However, that member has been defined for the
enum a data type, not enum b, and its numeric value does not correspond
to any member of enum b.
• The second one, at line 17, assigns to y (which is again an enum b) the
integer value 8, which does not correspond to the numeric value of any
enum b members.
• The first two warnings highlight that the two assignments are questionable,
for the reasons explained previously.
• The last two warnings are of little interest because they merely remark that
local variables x and y are never used. This is a condition that most com-
pilers can indeed detect and report to the user.
Before the adoption of C99, the informal name of the ISO/IEC 9899:1999 in-
ternational standard [89], the C programming language did not foresee an explicit
Boolean data type, and used integers in its place. In particular:
Therefore, it was possible to make a confusion between Boolean and integer val-
ues, and hence, introduce errors in the code, without receiving any warning from the
compiler. C99 introduced a Boolean data type, revolving around the stdbool.h
header, but did not specify any stronger type checking.
The following fragment of code declares several bool variables, the Boolean data
type in C99, and performs some checks on them by means of the if statement. The
control comment at the very beginning of the listing conveys two pieces of informa-
tion to S PLINT.
1. -booltype bool specifies that the name of the Boolean data type is indeed
bool. The option of using a different name for this data type is useful when the
program defines its own Boolean data type, as happens, for instance, when the
program was written before C99 came into effect.
2. +predboolptr enables one additional check that forbids a test expression from
being a pointer.
3 #include <stdbool.h>
4
10 if(px) return;
11
12 if(a = b) return;
13
14 if(a + b) return;
15 }
476 Embedded Software Development: The Open-Source Approach
When run on the code fragment, the tool produces the following output.
splint basic_checks_4.c
Splint 3.1.2 --- 19 Apr 2015
For what concerns standard compilers behavior, at the time of this writing GCC
only warns about the if(a = b) statement. This is not surprising because, as re-
marked previously, all those statements are totally legal as far as the language is
concerned.
Security and Dependability Aspects 477
The last group of basic checks performed by S PLINT is about questionable control
flow. Examples of checks belonging to this group are:
Even though these mistakes seem trivial and easy to spot by code inspection, they
are indeed responsible for a fairly large share of programming errors that, as pointed
out for instance in [167], may be hard and time-consuming to find.
Regarding automatic checks, it is important to remark that, since the C-language
syntax, by itself, does not provide detailed flow control information, many checks are
effective and do not produce false warnings only if the program contains a sufficient
number of annotations. To better illustrate this point, let us consider the following
excerpt of code, which addresses in a different way the issues pointed out by S PLINT
about the code shown on page 470.
1 #include <stdio.h>
2 #include <stdlib.h>
3
4 void report_error(void);
5
6 int *alloc_int(void)
7 {
8 int *p = (int *)malloc(sizeof(int));
9 if(p == NULL) report_error();
10 *p = 0;
11 return p;
12 }
13
19 void alloc_and_use(void)
20 {
21 int *p = alloc_int();
22 use_intp(p);
23 free(p);
24 }
478 Embedded Software Development: The Open-Source Approach
In this case, instead of checking if pointer p is NULL before using it, we would
like to check it immediately after attempting to allocate the dynamic memory it must
point to.
To this purpose, at line 9 of the listing—within the alloc_int function that
is responsible for allocating memory for p—we added a statement to perform the
above-mentioned check and call the function report_error if memory allocation
was unsuccessful.
When called, the function report_error reports the error in some ways, and
then aborts the program without ever returning to the caller. In this way, the program
behaves correctly because, if memory allocation fails, it never makes use of p.
Nonetheless, as shown in the listing that follows, S PLINT warns that it is indeed
possible to dereference a possibly NULL pointer.
splint +partial basic_checks_5.c
Splint 3.1.2 --- 19 Apr 2015
This false warning is due to the fact that—unless it can prove this is not the case—
S PLINT assumes that all functions eventually return to the caller, and hence, program
execution always continues after a function call.
Therefore, it is strongly recommended that functions that never return are anno-
tated to improve the quality of the analysis, by adding /*@ noreturn @*/ before
the function prototype.
To conclude this section, Tables 16.2 and 16.3 summarize the main flags and
annotations that configure S PLINT basic checks, respectively. Due to lack of space,
not all flags listed in the tables have been mentioned and thoroughly described in the
text. However, the short description given in the tables ought to provide readers with
a starting point for further research on the subject.
Table 16.2
S PLINT Flags Related to Basic Checks
usedef When set, it enables checks concerning the use of a location before it has been
initialized.
mustdefine When this flag is turned on, the tool emits a warning if a function parameter
annotated as out (see Table 16.3) has not been defined before the function
returns.
impouts When this flag is set, unannotated function parameters are implicitly assumed
to be annotated as out.
charint When set, this flag makes the char data type indistinguishable from integers,
to avoid false warnings in legacy programs.
charindex When set, the tool allows array indexes to be of type char without warnings.
enumint Like charint, but for enumerated (enum) data types.
enumindex Line charindex, but for enumerated data types.
booltype Informs the tool about the name used for the Boolean data type. Flags
booltrue and boolfalse can be used to specify the (symbolic) names of
true and false.
predboolptr When set, enables a warning when a pointer is used as a test expression.
predboolint When set, enables a warning when an integer, rather than a Boolean, is used as
a test expression.
predboolothers When set, enables a warning when any data type (other than Boolean, pointer,
or integer) is used as a test expression. The flag predbool can be used to set
the three flags just described all together.
eval-order When set, the tool checks and warns the programmer when it finds an expres-
sion whose result is unspecified, and may be implementation-dependent, be-
cause it depends on its sub-expression evaluation order, which is not defined by
the standard.
casebreak When set, the tool warns about cases in which control flow falls through a
switch statement in a doubtful way.
misscase When set, the tool emits a warning if not all members of an enumerated data
type appear in a switch statement concerning that data type.
noeffect When set, statements which have no effect are flagged with a warning.
retvalint When set, ignoring an integer function return value triggers a warning.
retvalbool When set, ignoring a Boolean function return value triggers a warning.
retvalother When set, ignoring a function return value whose type is neither integer nor
Boolean triggers a warning. The flag predbool can be used to set the three
flags just described all together.
480 Embedded Software Development: The Open-Source Approach
Table 16.3
S PLINT Annotations Related to Basic Checks
nullwhentrue This annotation is used for functions that check whether or not a
pointer is NULL. This annotation indicates that when the return
value of the function is true, then its first argument was a NULL
pointer.
out This annotation applies to a pointer and indicates that the stor-
age reachable from it need not be defined.
partial This annotation indicates the storage reachable from the pointer
it applies to, typically a struct, may be partially defined, that
is, it may have undefined fields.
noreturnwhentrue This annotation denotes that a function never returns if its first
argument is true.
fallthrough Within a switch, it indicates that flow control was left to fall
through a case statement on purpose, and hence, the occur-
rence shall not be flagged by the tool.
Security and Dependability Aspects 481
a complex and hard to predict way not only on the offending task, but also on the
memory-related activities of all other tasks in the system.
At the beginning of Section 16.2, we already saw an informal example of how
S PLINT can effectively track the value of a pointer and generate a warning if the
pointer is NULL or the memory location it points to may be used while it has unde-
fined contents.
In order to proceed further, it is first of all necessary to give some more precise
definitions about the storage model adopted by S PLINT [58]. In this model, an object
is a typed region of storage that holds a C-language variable, for instance, an array.
The assignment of storage to objects is a crucial part of memory management. In the
C language, it can be performed in two different ways.
1. The storage assigned to, and used by, some objects is implicitly managed by
the compiler without the programmer’s intervention. For instance, storage for lo-
cal variables is automatically allocated from the task stack when the function is
called, and released when it returns.
2. In other cases, storage must be explicitly managed by means of appropriate pro-
gram statements. In other words, the programmer becomes responsible for allo-
cating and releasing storage at the right time. In the C language, this must be done
for dynamic memory allocated by malloc or similar functions.
A certain region of storage and, by analogy, the object it holds, can be in sev-
eral different states. Its state evolves according to the operations that the program
performs on it.
As shown in Figure 16.1, the matter becomes more complex when we also con-
sider object pointers. Even though they are depicted as separate entries in the figure,
for the sake of clarity, pointers are themselves objects and are stored in memory like
any other object is. It is therefore possible, for instance, to have a pointer Q to another
pointer P to an object O. In this case, pointer P assumes a double role, because
• A first, rough distinction is between live pointers and dead pointers. Live
pointers are often also called valid or legal pointers, whereas dead pointers
are called dangling, invalid, or illegal.
• A null pointer is a pointer that has a special reserved value, corresponding
to the macro NULL in the C language. This reserved value indicates that
the pointer does not point to any object in memory, and hence, it cannot be
dereferenced.
• A live pointer is either a null pointer or a pointer to an allocated area of
storage. As shown in the figure, in the second case, the pointer value cannot
be NULL by definition.
• Live pointers belong to two different categories, depending on the state of
the storage they point to. A pointer to storage that has been allocated to
an object and is defined is called an object pointer. Object pointers can be
dereferenced and the result is to get access to a valid object.
• A special type of object pointer is a pointer that points within an object
and is called an offset pointer. For example, if the object is an array of 5
elements, the address of the third element is a valid offset pointer.
Offset pointers have the same basic properties as object pointers. However,
as better detailed in Section 16.4, extra checks are required to make sure
they stay within the underlying object boundaries, especially when they are
generated by means of pointer arithmetic.
Security and Dependability Aspects 483
Table 16.4
Kinds of Pointers and Their Usage
Table 16.4 summarizes the kinds of pointers just discussed and their properties.
Namely, the second and third columns of the table manifest which operations can be
performed on the object they reference, and the rightmost column says whether or
not that kind of pointer is allowed at all in the program.
From another point of view, informally speaking, the table summarizes the checks
S PLINT is able to perform on pointer usage. In particular, any usage not marked with
“yes” in the table triggers a warning.
At the same time, the table highlights a crucial distinction that must be made clear
between null and dead pointers. As can be seen, neither of them can be dereferenced
because they do not point to a valid object.
However, the null pointer has a fixed, well-defined value (NULL) and it is both
possible and easy to determine if a pointer is null or not. For this reason, null pointers
are allowed in programs and are quite frequently used in complex data structures—
484 Embedded Software Development: The Open-Source Approach
Eventually, the obligation to release storage is satisfied by calling the library func-
tion free, which is implicitly annotated as
void free (/*@only@*/ /*@out@*/ /*@null@*/ void *p);
Let us now illustrate, by taking the code fragment that follows as an example, how
S PLINT flags memory management mistakes according to Table 16.4.
1 #include <stdlib.h>
2 #include <assert.h>
3
10 return &v;
11 }
12
17 assert(p != NULL);
18 *p = 0;
19 return p;
20 }
21
27 void use_heap(void)
28 {
29 int *p = create_heap();
30
31 *p = 1;
32
486 Embedded Software Development: The Open-Source Approach
33 p = create_heap();
34 *p = 2;
35
36 destroy_heap(p);
37 *p = 3;
38 }
It should already be clear from the description that this code fragment contains
several serious memory management errors, which S PLINT is able to detect, as high-
lighted by the tool output that is listed in the following.
splint +checks mem_checks_1.c
Splint 3.1.2 --- 19 Apr 2015
1. The first warning states that create_stack returns a dead pointer to the caller.
In fact, it is a pointer to local variable v and the storage allocated to it is automat-
ically released when the function itself returns.
2. Furthermore, in the second warning the tool remarks that the storage made acces-
sible to the caller by create_stack is not completely defined. In fact, v has
not been assigned any value since it was defined. According to the nomenclature
introduced previously, this makes a pointer to it an allocated pointer.
As shown in Table 16.4, allocated pointers are allowed in programs, but are sub-
ject to usage restrictions. In particular, the value of the object they reference must
not be used.
For pointers of this kind, the tool requires the annotation /*@out@*/, as men-
tioned in the warning itself, in order to check them appropriately. This annotation
has already been introduced informally in Section 16.2 and Table 16.3.
3. Since there is no explicit annotation regarding it, the return value of
create_stack is implicitly assumed to be marked with only. This is inconsis-
tent with returning an immediate address of a local variable as the responsibility
for managing the allocation and release of storage for local variables falls onto
the compiler and should never be performed explicitly in the program.
4. In the fourth warning, S PLINT warns that a new value was assigned to p at line 33,
without first releasing the storage it was pointing to. Since p contained the only
reference to that storage, according to the only annotations, this indicates a mem-
ory leak.
In fact, the storage previously pointed by p is still allocated but it will no longer
be possible to access it, and not even release it, in the future. As a debugging aid,
the tool also points out that the leaked storage was allocated at line 29.
5. The last warning remarks that the program dereferences pointer p after the storage
it points to has been released and p became a dead pointer. The line numbers
488 Embedded Software Development: The Open-Source Approach
cited in the warning are the line at which storage was released (line 36) and then
dereferenced (line 37).
As was already noted in the previous examples, the checks performed by a stan-
dard compiler like GCC are much more limited, also due to the inability to provide
annotations. In this case, it is able to detect only the very first issue in the previous
list, and not the others.
In real programs, even when working with relatively simple data structures, it is
often necessary to have more than one pointer to the same object. Those additional
pointers are called aliases in the S PLINT documentation.
A typical example is a circular, linked list in which the first element of the list must
be accessible by means of a pointer to the head of the list, but it is also referenced by
the last element of the list, in order to make the list circular.
In this case, the only annotation must be replaced with other, weaker annotations,
to be described in the following.
• At the same time, the caller can keep using the original pointer for as
long as it wants.
Therefore, after the call, the storage may be released at any time by the
called function, without informing the caller, whereas the caller can keep
using a (possibly dead) pointer to it. It is up to the programmer to en-
sure that this does not happen, because the condition cannot be checked
automatically.
• A pointer annotated with shared gives even more freedom to the program-
mer and further relaxes the checks the tool performs on its usage. Multiple
shared pointers can point to the same object and there are no limits on
how pointers can be aliased. The only constraint enforced by the tool is that
storage should never be released explicitly. This annotation is adequate, for
instance, to model a program that makes use of garbage-collected memory
management.
• The temp annotation can be attached to a function parameter to indicate
that the function uses the corresponding argument, which must be a pointer,
only temporarily. When the function is called, the obligation to release stor-
age is not transferred to the function. Therefore, S PLINT outputs a warn-
ing if the function releases the storage referenced by the pointer or creates
aliases of the pointer that are still visible after the function returns.
The difference between only, keep, and temp for what concerns the transfer of
obligation to release storage is best illustrated by means of an example.
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <assert.h>
4
19 void m(void)
20 {
21 int *pf = a();
22 int *pg = a();
23 int *ph = a();
490 Embedded Software Development: The Open-Source Approach
24
1. The first warning highlights that pointer pf has been dereferenced after the stor-
age it points to has been released. In fact, pf was passed as argument to f and the
corresponding parameter is annotated as only.
According to the annotation, the called function took the obligation to release
storage and the caller lost the right of using the pointer after the call.
2. The second warning reveals that the storage pointed by ph has not been released
correctly before m returns to the caller. This indicates a memory leak because ph
is lost upon return and it is the only available reference to that storage.
In this case, even though function h was invoked with ph as argument, it did
not take the obligation to release storage because its parameter is annotated
with temp.
Security and Dependability Aspects 491
Thus, the caller m retained the right of using ph after the call, but also the obliga-
tion to release storage, which it did not fulfill.
Instead, no warnings are raised about pointer pg. This is because the parameter
of function g is annotated with keep. As a consequence, g took the obligation to
release storage (and no warnings are raised about this aspect) but m kept the right of
using the pointer after the call (and hence, the use of pg after the call is considered
correct).
Besides being often associated with memory management issues, aliasing may
also trigger problems within some functions, when their implementation somewhat
assumes that arguments do not alias each other, but they do.
A well-known example is represented by the strcpy library function, which
copies a ’\0’-terminated source string, pointed by its second argument, into a des-
tination string pointed by the first argument. The behavior of this function when
source and destination overlap even partially is undefined, because the source—and,
even more importantly, its termination character—may be overwritten while the copy
is in progress.
In addition to the ones discussed previously, S PLINT supports two annotations to
provide information and constrain aliasing of function parameters and return values.
In particular:
As an example of how these two annotations work, let us consider the following
fragment of code.
1 /*@ -exportheader @*/
2
8 void m(void)
9 {
10 int a;
11 int *b, *c;
492 Embedded Software Development: The Open-Source Approach
12
13 a = 0;
14 b = &a;
15 c = f(b);
16
17 g(b, c);
18 }
Taking into account the annotations of f and g, the tool emits the following two
warnings:
splint +checks +partial mem_checks_3.c
Splint 3.1.2 --- 19 Apr 2015
As can be inferred from the warning messages above, these two warnings are
actually symmetric. They remark that pointers b and c, passed to function g, may be
aliases of each other.
This is because, due to the presence of the returned annotation for parameter
p of f, the return value of f may be an alias of p. Therefore, when f is called by m
using b as argument, c may become an alias of b, and both may therefore point to
variable a.
Then, m passes pointers b and c to function g upon calling it. However, the corre-
sponding parameters have a unique annotation attached to them, and this triggers
the warning.
Tables 16.5 and 16.6 summarize the most important S PLINT flags and annotations
related to memory management. It should be noted that not all of them have been
discussed in this book for brevity.
Table 16.5
S PLINT Flags Related to Memory Management
stackref When this flag is turned off, warnings about returning the address of variables
allocated on the stack are suppressed.
compdef When this flag is off, the tool does not warn the programmer when a function
returns a pointer to storage that is not completely defined.
immediatetrans When this flag is off, no warnings are given when an immediate object address,
obtained with the & operator, is used in an inconsistent way.
mustfree When off, the tool does not produce any warning about memory leaks.
usereleased When off, using storage after it has been released does not trigger a warning.
constant and is determined when they are defined. Then, buffer contents are accessed
by referring to array elements by means of an integer index.
As any other variable, buffers are surrounded by other objects in memory. Hence,
accessing an array element with an invalid index (either lower than zero or higher
than the number of array elements minus one) is illegal because it references storage
that is outside the region allocated to the array.
In particular, read operations may cause part of the program to use seemingly ran-
dom information and malfunction. Write operations usually result in memory cor-
ruption that, in some cases, can be exploited to make the program execute arbitrary,
malicious code. In fact, it has been estimated that buffer overflows are responsible
for about 50% of all security attacks [107].
As happens for memory corruption in general, buffer overflows are often diffi-
cult to detect because their effects, especially in a concurrent system, may be dif-
ferent from one program execution to another. Moreover, they are inherently data-
dependent and may not show up during testing.
The general techniques used by static analysis tools—and S PLINT in particular—
to detect buffer overflows are quite complex and thoroughly discussing them is be-
yond the scope of this book. Here, they will mainly be described in an intuitive,
rather than formal way, and the results they can achieve will be shown by means of
simple examples. Interested readers should refer to more specialized literature, for
instance [59, 107], for further information about this topic.
Informally speaking, in order to perform buffer overflow analysis, S PLINT tags
buffers with two properties. If b is a buffer, then:
• The property maxSet(b) represents the highest index of b that can legally
be set, by using it as the target of an assignment.
494 Embedded Software Development: The Open-Source Approach
Table 16.6
S PLINT Annotations Related to Memory Management
only This annotation indicates that a pointer is the only reference to the object it
points to. Therefore, the pointer has attached to it the obligation of releasing
the storage associated to the object.
owned This annotation indicates that a pointer has attached to it the obligation of re-
leasing the storage pointed by it. Unlike only, it is however possible to have
other pointers to the same storage, provided they are annotated as dependent
(see below).
dependent This annotation indicates that a pointer references storage to which other point-
ers refer, too. One of those other pointers, annotated with owned, has the obli-
gation to release the storage.
keep This annotation applies to a function parameter that is a pointer. Like only, it
indicates that the function takes the obligation of releasing the storage associ-
ated to the referenced object. Unlike only, the caller can however keep using
the original pointer.
shared This annotation indicates that a pointer points to storage that has one or more
pointers to it and it is never explicitly released, as happens in garbage-collected
memory management systems.
temp This annotation applies to a function parameter that is a pointer, like keep does.
It indicates that the function uses the pointer only temporarily, and hence, it
does not take the obligation of releasing the storage associated to the referenced
object.
unique This annotation is attached to a function parameter and denotes that the param-
eter shall not be aliased by any storage reachable from within the function body.
Unlike only, this annotation does not imply any obligation to release storage.
returned This annotation, when attached to a function parameter, indicates that the pa-
rameter may be aliased by the function return value, and hence, the call should
be checked for correctness accordingly.
• The property maxRead(b) denotes the highest index of b that can legally
be read. Namely, it is considered illegal to read a buffer element that is
beyond the highest-index element that has been initialized or, for character
strings, any element beyond the ’\0’ terminating character.
In other words, preconditions can be seen as constraints that must be satisfied for
a statement to be legal. Then, postconditions made true by the execution of a certain
legal statement can be leveraged to prove that the preconditions of subsequent state-
ments are also satisfied, and so on. Informally speaking, static verification proceeds
in this way, trying to prove that all statements belonging to a block of code—for
instance, the body of a function—are indeed legal.
The tool is able to establish preconditions and generate preconditions for a vari-
ety of C statements. In addition, functions can be annotated to specify their precon-
ditions and postconditions, by means of the requires and ensures annotations,
respectively. C library functions are implicitly annotated in this way, too, and no
programmer’s intervention is needed for them.
For instance, when analyzing the statement
static int b[10];
the tool establishes that there are no preconditions for it to be legal, and it generates
the postcondition
maxSet(b) == 9 (16.1)
because the last allocated element of the array is at index 9.
As a further example, let us consider the implicit requires and ensures
annotations attached automatically by the tool to the well-known strcpy library
function:
void strcpy(char *s1, char *s2)
/*@requires maxSet(s1) >= maxRead(s2) @*/
/*@ensures maxRead(s1) == maxRead (s2) @*/;
indicates that the index of the highest element that can legally be written
into the argument passed to strcpy as parameter s1 (the destination of
the string copy) must be at least as high as the highest element that can
legally be read from the argument passed as parameter s2 (the source of
the string copy).
• The ensures annotation specifies the postcondition made true by the exe-
cution of strcpy after the function returns to the caller.
In this example, the postcondition
indicates that after string s2 has been copied into s1 the highest element
that can legally be read from s1 is the same as s2, because the contents of
both strings are now identical.
496 Embedded Software Development: The Open-Source Approach
10 strcpy(b, "abc");
11
12 x = b[2];
13
14 i = 4;
15 x = b[i];
16
17 strcpy(b, "abcde");
18 x = b[i];
19
20 b[2*i] = ’x’;
21 b[3*i] = ’x’;
22 }
2. Then, it reads element 4, but this time is uses variable i as index, after
setting it to 4 (lines 14–15).
• At line 17, the function changes the contents of b by copying another literal
string into it. This time, the string is 5 characters long (plus the terminating
’\0’ character).
• At this point, the function reads again element 4 of b, as before (line 18).
• The last two statements of m (lines 20–21) store a character into b at two
different positions determined by performing a simple calculation on i.
1. Position 2*i, that is, 8.
2. Position 3*i, that is, 12.
When invoked on the fragment of code just illustrated, the tool produces the fol-
lowing output.
splint +checks +bounds +partial buffer_overflow_1.c
Splint 3.1.2 --- 19 Apr 2015
Let us now informally follow the procedure followed by S PLINT to emit the two
warnings that appear in the output and justify why they indeed indicate issues in the
code itself.
9 >= 3 (16.5)
498 Embedded Software Development: The Open-Source Approach
that is obviously true. From this, the tool concludes that the call to strcpy
is legal and adds
maxRead(b) == 3 (16.6)
to the list of postconditions. This postcondition is derived from (16.3)
by back-substituting parameter and argument names, and replacing
maxRead(s2) with its known value.
• In order to verify that the assignment at line 12 is valid, the tool consults the
current postconditions concerning maxRead(b) to prove the precondition
of the assignment, that is,
Due to (16.6) this is true because 3 >= 2 and hence, the assignment is
considered to be legal. The assignment statement does not generate any
further postcondition.
• The tool follows the same procedure to analyze the next operation on b, at
line 15. In this case, the precondition to be satisfied is
By inspecting the code, the tool concludes that the current value of i is
i == 4 due to the assignment at line 14. Moreover, (16.6) is still true. By
substitution, the precondition becomes
3 >= 4 (16.9)
and it is therefore not satisfied. For this reason, S PLINT emits the first warn-
ing shown in the above listing.
• When analyzing line 17, the tool proceeds exactly like it did for the previous
call to strcpy. The only difference is that, due to the different length of the
literal passed to strcpy as s2 (5 characters instead of 3), the preconditions
and postconditions are modified accordingly.
The conclusion is that the function call is legal and establishes the postcon-
dition
maxRead(b) == 5 . (16.10)
• The statement at line 18 is syntactically the same as the one at line 15,
which was flagged as illegal. However, the statement at line 18 shall be
analyzed according to the current set of postconditions, which leads to a
different result.
In particular, the most recent postcondition concerning maxRead(b) is
now (16.10) instead of (16.6) and precondition (16.8) is satisfied in this
case.
• The two assignments to elements of b at lines 20 and 21 are analyzed
in a similar way. In this case, the preconditions to be satisfied concern
Security and Dependability Aspects 499
respectively.
The value of maxSet(b) can readily be assessed from the postcondition
established by the definition of b (16.1). On the other hand, it is possible to
calculate the values of 2*i and 3*i from the current value of i, obtaining
the values 8 and 12, respectively. By substituting these values back into
(16.11) and (16.12), the result is
Of these preconditions, the second one is clearly not satisfied and triggers
a warning from the tool.
As may already be clear from the example just illustrated, the checks S PLINT
performs are extremely strict and may lead to a high number of false warnings unless
all the code is thoroughly and properly annotated as required by the tool.
For this reason, the tool offers the ability to classify warnings into two different
categories—according to a heuristic that tries to determine their likelihood of indi-
cating a real flaw in the program—and selectively suppress them. The two categories
correspond to different conclusions that can be drawn about constraints.
Table 16.7
S PLINT Flags Related to Buffer Over ow Checks
boundsread When set, the tool produces a warning for any attempt to read
from a buffer at a position that may lie beyond the bounds of
allocated storage, because the buffer access constraints deter-
mined by the tool cannot be proven true.
boundswrite When set, the tool produces a warning for any attempt to write
into a buffer at a position that may lie beyond the bounds of
allocated storage, because the buffer access constraints deter-
mined by the tool cannot be proven true.
likelyboundsread When set, the tool produces a warning for any attempt to read
from a buffer at a position that is likely to lie beyond the bounds
of allocated storage, because it induces a numerical inconsis-
tency in the buffer access constraints determined by the tool.
likelyboundswrite When set, the tool produces a warning for any attempt to write
into a buffer at a position that is likely to lie beyond the bounds
of allocated storage, because it induces a numerical inconsis-
tency in the buffer access constraints determined by the tool.
function may take place through global variables and, more in general, any storage
reachable from both parties.
In the C language, the interface to a function is defined by its function prototype.
In early versions of the language the prototype was extremely simple and only de-
scribed the type of the function return value, but it was later extended to also define
the number and type of its arguments.
By means of annotations, S PLINT provides programmers a way to express further
information about the following aspects of a function interface, not covered by its
prototype but still very important.
• Which parts of the storage accessible through its arguments a function may
use or modify, and how.
• Whether or not the function has a persistent internal state and modifies it.
• Which global variables the function modifies.
Security and Dependability Aspects 501
Table 16.8
S PLINT Annotations Related to Buffer Over ow Checks
• Whether or not the function indirectly modifies the system state as a whole.
• Specify any assumptions made by the function about the state of its argu-
ments and global variables when it is called.
• Assert which predicates—concerning arguments, return value, and global
variables—the function makes true at the caller site when it returns.
it means that f may modify field y of the structure passed as argument ans pointed
by p but it cannot modify any other field, for instance x. In fact, when run on the code
fragment above, S PLINT produces the following warning:
splint +checks -exportheader int_checks_1.c
Splint 3.1.2 --- 19 Apr 2015
Table 16.9
S PLINT Annotations for Function Interfaces
Global variables
globals This annotation is attached to a function and contains the list of
global variables that the function may use.
checkedstrict This one and the following three annotations are attached to
a global variable, in order to control how strictly S PLINT shall
check accesses to it. This annotation selects the strictest checks,
so that all undocumented accesses to the global variable are
flagged.
checked Any undocumented accesses to the variable are reported only
for functions annotated with a globals or modifies list.
checkmod Undocumented modifications of the global variable trigger a
warning, but undocumented uses do not.
unchecked This annotation disables all checks on the global variable it
refers to.
Security and Dependability Aspects 503
It should be noted that this mechanism is more powerful than the const
qualifier foreseen by the C language. In fact, it is possible to declare
void f(const struct a *p) to indicate that f does not modify the structure
pointed by p at all, but not on a field-by-field basis.
In the same way, it is possible to attach the const qualifier to the struct type
definition as a whole or referring to individual fields, but then it applies to all func-
tions using the structure. Hence, it is not possible to override it on a function-by-
function basis.
More specific annotations, listed in Table 16.9 but not discussed here due to lack
of space, allow programmers to specify how the function handles its arguments and
the return value at an even finer level of detail, also in relation to memory man-
agement, discussed in Section 16.3. They are especially useful when the function
receives references to variables by means of pointers.
As outlined previously an important aspect of a function interface, which is not
mentioned at all in its standard C-language prototype, is whether or not the function
modifies other parts of the program state besides those reachable through its argu-
ments. To this purpose, S PLINT supports three special names, to be used with the
modifies annotation.
• The name internalState specifies that a function has got a hidden in-
ternal state and uses/modifies it. For instance, it defines some static vari-
ables, whose value persists from one function call to another and may affect
its future computations.
In other words, since the result of the function depends not only on its
arguments, but also on its internal state, the results of two calls to the same
function may differ even though the function was given exactly the same
arguments.
This item of information is therefore important when S PLINT checks
whether or not the result of an expression may depend on the order of eval-
uation of its sub-expressions, and also when it tries to prove that a certain
statement is side-effect free.
• The name fileSystem indicates that a function modifies the file system
state or, more in general, change the system state. For instance, a function
may create a new file, write new contents into a file, or delete a file. The
consequences on the analysis performed by S PLINT are the same as for
internalState.
• A function annotated with modifies nothing is completely side-effect
free, and hence, it can affect the caller’s computation only through its
return value.
504 Embedded Software Development: The Open-Source Approach
Last, but not least, a sometimes unintended interaction between a function and
the outside world is through global variables. Checking global variables for correct,
intended use involves adding two kinds of annotations to the code:
In order to show how global variable checks work, let us consider the following
fragment of code.
1 /*@ checkedstrict @*/ int a;
2 /*@ checkedstrict @*/ int b;
3 /*@ unchecked @*/ int c;
4
For the sake of the example, the code fragment defines three global variables
(a, b, and c). According to their annotations, a and b shall be checked in the strictest
way, whereas c shall not be checked at all. Moreover, it also defines three functions.
• bad_use makes use of all three global variables, because it calculates their
sum and returns it to the caller. On the other hand, its annotation states that
it only uses global variable a.
• bad_mod copies the value of b into a and c. The annotation states that it
uses variables a and b.
• good_mod is identical to bad_mod except for the annotation. In this case,
the annotation states that the function uses b and modifies a.
As can be seen from the listing that follows, S PLINT identifies and reports two is-
sues with the code just discussed. The tool was invoked with the -exportheader
and -exportlocal command-line options to avoid spurious warning related to the
fact that the fragment of code, being just an example, defines globally visible func-
tions without declaring them in an appropriate header and defines global variables
that are not used elsewhere.
splint +checks -exportheader -exportlocal int_checks_2.c
Splint 3.1.2 --- 19 Apr 2015
1. The first message remarks that, contrary to what has been stated in its annotation,
function bad_use makes use of global variable b, which is subject to strict access
checks according to the annotation checkedstrict.
506 Embedded Software Development: The Open-Source Approach
2. The second message highlights that function bad_mod modifies the value of
global variable a, whereas the function annotation states that it should only use
it. In fact, a is mentioned in the globals annotation, but not in the modifies
annotation. Actually, there is no modifies annotation at all for bad_mod.
On the other hand, no warnings are raised for the good_mod function, because
its annotation correctly states that it modifies a.
Futhermore, no warnings are given concerning variable c, even though all func-
tions use or modify it, because it has been annotated as unchecked.
16.6 SUMMARY
Static code analysis tools are nowadays able to perform a wide variety of checks,
ranging from basic ones (like the ones discussed in Section 16.2) to very complex
ones (for instance, buffer overflow checks described in Section 16.4).
In addition, as shown in Section 16.5, when the source code is properly anno-
tated, they can also verify that different source code modules (possibly written by
different groups of programmers at different times) interface in a consistent way.
This aspect is becoming more and more important every day, as it becomes more
and more common to build new applications (especially when they are based on
open-source software) starting from existing components that are migrated from one
project to another.
Static code analysis is therefore a useful technique to improve source code quality,
reliability, and alleviate security concerns. Another important feature of this kind
of analysis is that it does not require any kind of runtime support and—perhaps
even more importantly for embedded software developers—it does not introduce any
overhead on program execution on the target system.
Last, but not least, the availability of free, open-source tools, like the one dis-
cussed in this chapter, further encourages the adoption of static code analysis even
when developing very low-cost applications, in which acquiring a commercial tool
may adversely affect project budget.
References
1. Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Prin-
ciples, Techniques, and Tools. Pearson Education Ltd., Harlow, England, 2nd edition,
September 2006.
2. James H. Anderson and Mark Moir. Universal constructions for large objects. IEEE
Transactions on Parallel and Distributed Systems, 10(12):1317–1332, 1999.
3. James H. Anderson and Srikanth Ramamurthy. A framework for implementing objects
and scheduling tasks in lock-free real-time systems. In Proc. 17th IEEE Real-Time
Systems Symposium, pages 94–105, December 1996.
4. James H. Anderson, Srikanth Ramamurthy, and Kevin Jeffay. Real-time computing with
lock-free shared objects. In Proc. 16th IEEE Real-Time Systems Symposium, pages 28–
37, December 1995.
5. ANSI/INCITS. ANSI/INCITS 408-2005 – Information Technology – SCSI Primary
Commands – 3 (SPC–3), 2005.
6. ARM Ltd. ARM System Memory Management Unit Architecture Specification — SMMU
architecture version 2.0, January 2-15. IHI 0062D.a.
7. ARM Ltd. ARM PrimeCellTM Vectored Interrupt Controller (PL192) — Technical Ref-
erence Manual, December 2002. DDI 0273A.
8. ARM Ltd. ARMv7-M Architecture Reference Manual, February 2010. DDI 0403D.
9. ARM Ltd. CortexTM -M3 Technical Reference Manual, rev. r2p0, February 2010. DDI
0337H.
10. ARM Ltd. CortexTM -M4 Devices, Generic User Guide, December 2010. DUI 0553A.
11. Atmel Corp. AT 91 ARM Thumb-based Microcontrollers, 2008.
12. Atmel Corp. 8-bit AVR Microcontroller with 128K bytes In-System Programmable Flash
– ATmega 1284P, 2009.
13. Neil C. Audsley, Alan Burns, Mike Richardson, and Andy J. Wellings. Hard real-time
scheduling: The deadline monotonic approach. In Proc. 8th IEEE Workshop on Real-
Time Operating Systems and Software, pages 127–132, 1991.
14. Neil C. Audsley, Alan Burns, and Andy J. Wellings. Deadline monotonic scheduling
theory and application. Control Engineering Practice, 1(1):71–78, 1993.
15. Christel Baier and Joost-Pieter Katoen. Principles of Model Checking. The MIT Press,
Cambridge, MA, 2008.
16. Theodore P. Baker and Alan Shaw. The cyclic executive model and Ada. In Proc. IEEE
Real-Time Systems Symposium, pages 120–129, December 1988.
17. Richard Barry. The FreeRTOS.org project. Available online, at http://www.
freertos.org/.
18. Richard Barry. Using the FreeRTOS Real Time Kernel – Standard Edition. Lulu Press,
Raleigh, North Carolina, 1st edition, 2010.
TM
19. Richard Barry. The FreeRTOS Reference Manual. Real Time Engineers Ltd., 2011.
20. Mordechai Ben-Ari. Principles of the Spin Model Checker. Springer-Verlag, London,
2008.
21. Johan Bengtsson, Kim Larsen, Fredrik Larsson, Paul Pettersson, and Wang Yi. Uppaal –
a tool suite for automatic verification of real-time systems. In Hybrid Systems III, LNCS
1066, pages 232–243. Springer-Verlag, 1995.
507
508 References
22. Dragan Bosnacki and Dennis Dams. Integrating real time into Spin: A prototype imple-
mentation. In Proc. FIP TC6 WG6.1 Joint International Conference on Formal Descrip-
tion Techniques for Distributed Systems and Communication Protocols and Protocol
Specification, Testing and Verification, pages 423–438, 1998.
23. Dragan Bosnacki, Dennis Dams, Leszek Holenderski, and Natalia Sidorova. Model
checking SDL with Spin. In Proc. 6th International Conference on Tools and Algorithms
for Construction and Analysis of Systems, pages 363–377, 2000.
24. Robert Braden, editor. Requirements for Internet Hosts — Communication Layers, RFC
1122. Internet Engineering Task Force, October 1989.
25. Alan Burns and Andy Wellings. Real-Time Systems and Programming Languages. Pear-
son Education, Harlow, England, 3rd edition, 2001.
26. Giorgio C. Buttazzo. Hard Real-Time Computing Systems. Predictable Scheduling Al-
gorithms and Applications. Springer-Verlag, Santa Clara, CA, 2nd edition, 2005.
27. William Cary Huffman and Vera Pless. Fundamentals of Error-Correcting Codes. Cam-
bridge University Press, February 2010.
28. Per Cederqvist et al. Version Management with CVS, for CVS 1.12.13. Free Software
Foundation, Inc., 2005.
29. Gianluca Cena, Marco Cereia, Ivan Cibrario Bertolotti, and Stefano Scanzio. A Modbus
extension for inexpensive distributed embedded systems. In Proc. 8th IEEE Interna-
tional Workshop on Factory Communication Systems, pages 251–260, May 2010.
30. Gianluca Cena, Ranieri Cesarato, and Ivan Cibrario Bertolotti. An RTOS-based design
for inexpensive distributed embedded system. In Proc. IEEE International Symposium
on Industrial Electronics, pages 1716–1721, July 2010.
31. Gianluca Cena, Ivan Cibrario Bertolotti, Tingting Hu, and Adriano Valenzano. Fixed-
length payload encoding for low-jitter Controller Area Network communication. IEEE
Transactions on Industrial Informatics, 9(4):2155–2164, 2013.
32. Gianluca Cena, Ivan Cibrario Bertolotti, Tingting Hu, and Adriano Valenzano. A mech-
anism to prevent stuff bits in CAN for achieving jitterless communication. IEEE Trans-
actions on Industrial Informatics, 11(1):83–93, February 2015.
33. Steve Chamberlain and Cygnus Support. Libbfd — The Binary File Descriptor Library.
Free Software Foundation, Inc., 2008.
34. Steve Chamberlain and Ian Lance Taylor. The GNU linker ld (GNU binutils) Version
2.20. Free Software Foundation, Inc., 2009.
35. ChaN. FatFs Generic FAT File System Module, 2012. Available online, at http:
//elm-chan.org/fsw/ff/00index_e.html.
36. Joachim Charzinski. Performance of the error detection mechanisms in CAN. In Proc.
1st International CAN Conference, pages 20–29, September 1994.
37. Ben Chelf and Christof Ebert. Ensuring the integrity of embedded software with static
code analysis. IEEE Software, 26(3):96–99, 2009.
38. Ivan Cibrario Bertolotti and Tingting Hu. Real-time performance of an open-source pro-
tocol stack for low-cost, embedded systems. In Proc. 16th IEEE International Confer-
ence on Emerging Technologies and Factory Automation, pages 1–8, September 2011.
39. Ivan Cibrario Bertolotti and Gabriele Manduchi. Real-Time Embedded Systems: Open-
Source Operating Systems Perspective. CRC Press, Taylor & Francis Group, Boca Ra-
ton, FL, 1st edition, January 2012.
40. Compaq Computer Corp., Hewlett-Packard Company, Intel Corp., Lucent Technologies
Inc., Microsoft Corp., NEC Corp., Koninklijke Philips Electronics N.V. Universal Serial
Bus Specification, 2000. Revision 2.0.
References 509
41. Compaq Computer Corp., Microsoft Corp., National Semiconductor Corp. OpenHCI
Open Host Controller Interface Specification for USB, 1999. Release 1.0a.
42. John L. Connell and Linda Isabell Shafer. Object-Oriented Rapid Prototyping. Prentice
Hall, Englewood Cliffs, NJ, October 1994.
43. Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth
Zadeck. Efficiently computing static single assignment form and the control dependence
graph. ACM Transactions on Programming Languages and Systems, 13(4):451–490,
October 1991.
44. Alan M. Davis. Software Requirements: Analysis and Specification. Prentice Hall,
Englewood Cliffs, NJ, December 1989.
45. Stephen Deering and Robert Hinden. Internet Protocol, Version 6 (IPv6) Specification,
RFC 2460. The Internet Society, December 1998.
46. DENX Software Engineering GmbH. The DENX U-Boot and Linux Guide (DULG)
for canyonlands, March 2015. The documentation is also available online, at http:
//www.denx.de/.
47. Raymond Devillers and Joël Goossens. Liu and Layland’s schedulability test revisited.
Information Processing Letters, 73(5-6):157–161, 2000.
48. Edsger W. Dijkstra. Cooperating sequential processes. Technical Report EWD-123,
Eindhoven University of Technology, 1965. Published as [49].
49. Edsger W. Dijkstra. Cooperating sequential processes. In F. Genuys, editor, Program-
ming Languages: NATO Advanced Study Institute, pages 43–112. Academic Press, Vil-
lard de Lans, France, 1968.
50. Joseph D. Dumas II. Computer Architecture: Fundamentals and Principles of Computer
Design. Taylor & Francis Group, Boca Raton, FL, November 2005.
51. Adam Dunkels. lwIP—a lightweight TCP/IP stack. Available online, at http:
//savannah.nongnu.org/projects/lwip/.
52. Adam Dunkels. Design and implementation of the lwIP TCP/IP stack. Available online,
at http://www.sics.se/˜adam/lwip/doc/lwip.pdf, 2001.
53. Adam Dunkels. Full TCP/IP for 8-bit architectures. In Proc. 1st International Confer-
ence on Mobile Applications, Systems and Services, pages 1–14, 2003.
54. Eclipse Foundation, Inc. Eclipse Luna (4.4) Documentation, 2015. Full documentation
available in HTML format at http://www.eclipse.org/.
55. Dean Elsner, Jay Fenlason, and friends. Using as — The GNU Assembler (GNU binutils)
Version 2.20. Free Software Foundation, Inc., 2009.
56. Embedded Artists AB. LPC2468 OEM Board User’s Guide, EA2-USG-0702 v1.2
Rev. C, 2008. Available online, at http://www.embeddedartists.com/.
57. Embedded Solutions. Modbus master. Available online, at http://www.
embedded-solutions.at/.
58. David Evans. Static detection of dynamic memory errors. In Proc. ACM SIGPLAN
Conference on Programming Language Design and Implementation, pages 44–53, New
York, NY, USA, 1996. ACM.
59. David Evans and David Larochelle. Improving security using extensible lightweight
static analysis. IEEE Software, 19(1):42–51, January 2002.
60. David Evans and David Larochelle. Splint Manual, Version 3.1.1-1. Secure Program-
ming Group, University of Virginia, Department of Computer Science, June 2003.
61. Max Felser. Real time Ethernet: standardization and implementations. In Proc. IEEE
International Symposium on Industrial Electronics, pages 3766–3771, 2010.
62. Free Software Foundation, Inc. GCC, the GNU compiler collection, 2012. Available
online, at http://gcc.gnu.org/.
510 References
63. Free Software Foundation, Inc. GDB, the GNU project debugger, 2012. Available
online, at http://www.gnu.org/software/gdb/.
64. Free Software Foundation, Inc. GNU binutils, 2012. Available online, at http://
www.gnu.org/software/binutils/.
65. Free Software Foundation, Inc. GNU Make, 2014. Available online, at http://www.
gnu.org/software/make/.
66. Freescale Semiconductor, Inc. ColdFire⃝ R
Family Programmer’s Reference Manual,
March 2005.
67. Philippe Gerum. Xenomai—Implementing a RTOS emulation framework on GNU/Linux,
2004. Available online, at http://www.xenomai.org/.
68. Ian Graham. Requirements Engineering and Rapid Development: An object-oriented
approach. Addison-Wesley Professional, June 1999.
69. Laurie J. Hendren, Chris Donawa, Maryam Emami, Guang R. Gao, Justiani, and Bhama
Sridharan. Designing the McCAT compiler based on a family of structured intermediate
representations. In Proc. 5th International Workshop on Languages and Compilers for
Parallel Computing, number 757 in LNCS, pages 406–420. Springer-Verlag, 1992.
70. Maurice P. Herlihy. A methodology for implementing highly concurrent data objects.
ACM Trans. on Programming Languages and Systems, 15(5):745–770, November 1993.
71. Maurice P. Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan
Kaufmann, June 2012.
72. Maurice P. Herlihy and Jeannette M. Wing. Axioms for concurrent objects. In Proc.
14th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages,
pages 13–26, New York, 1987.
73. Gerard J. Holzmann. Verifying Multi-threaded Software with Spin. Available online, at
http://spinroot.com/.
74. Gerard J. Holzmann. The model checker SPIN. IEEE Transactions on Software Engi-
neering, 23:279–295, 1997.
75. Gerard J. Holzmann. An analysis of bitstate hashing. Formal Methods in System Design,
13(3):289–307, November 1998.
76. Gerard J. Holzmann. The Spin Model Checker: Primer and Reference Manual. Pearson
Education, Boston, MA, 2003.
77. Gerard J. Holzmann and Dragan Bošnački. The design of a multicore extension of the
SPIN model checker. IEEE Transactions on Software Engineering, 33(10):659–674,
2007.
78. Gerard J. Holzmann and Doron Peled. An improvement in formal verification. In Proc.
7th IFIP WG6.1 International Conference on Formal Description Techniques, pages
197–211, Berne, Switzerland, 1994.
TM
79. IEEE Std 1003.13 -2003, IEEE Standard for Information Technology—Standardized
Application Environment Profile (AEP)—POSIX⃝ R
Realtime and Embedded Application
Support. IEEE, 2003.
80. IEEE Std 1149.1-2013 — IEEE Standard for Test Access Port and Boundary-Scan Ar-
chitecture. IEEE Computer Society, May 2013.
81. Intel Corp. Universal Host Controller Interface (UHCI) Design Guide, 1996. Revision
1.1.
82. Intel Corp. Enhanced Host Controller Interface Specification for Universal Serial Bus,
2002. Revision 1.0.
83. Intel Corp. Intel⃝
R
64 and IA-32 Architectures Software Developer’s Manual, 2007.
84. Intel Corp. eXtensible Host Controller Interface for Universal Serial Bus, 2010. Revi-
sion 1.0.
References 511
107. David Larochelle and David Evans. Statically detecting likely buffer overflow vulnera-
bilities. In Proc. 10th USENIX Security Symposium, pages 1–13, Berkeley, CA, USA,
2001. USENIX Association.
108. John A. N. Lee. Howard Aiken’s third machine: the Howard Mark III calculator
or Aiken-Dahlgren electronic calculator. IEEE Annals of the History of Computing,
22(1):62–81, January 2000.
109. Shu Lin and Daniel J. Costello. Error Control Coding. Prentice Hall, 2nd edition, June
2004.
110. Chung L. Liu and James W. Layland. Scheduling algorithms for multiprogramming in
a hard-real-time environment. Journal of the ACM, 20(1):46–61, 1973.
111. Jane W. S. Liu. Real-Time Systems. Prentice Hall, Upper Saddle River, NJ, 2000.
112. Robert Love. Kernel korner: Kernel locking techniques. Linux Journal, August 2002.
113. Zohar Manna and Amir Pnueli. The Temporal Logic of Reactive and Concurrent Systems
– Specification. Springer-Verlag, New York, NY, 1992.
114. Theresa C. Maxino and Philip J. Koopman. The effectiveness of checksums for em-
bedded control networks. IEEE Transactions on Dependable and Secure Computing,
6(1):59–72, January 2009.
115. Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, and John S. Quarterman. The
Design and Implementation of the 4.4BSD Operating System. Addison-Wesley, Reading,
MA, 1996.
116. Mentor Graphics Corp. Sourcery CodeBench, 2013. Available online, at http://www.
mentor.com/embedded-software/codesourcery/.
117. C. Michael Pilato, Ben Collins-Sussman, and Brian W. Fitzpatrick. Version Control with
Subversion. O’Reilly Media, Sebastopol, CA, 2nd edition, September 2008.
118. Microsoft Corp. Microsoft Extensible Firmware Initiative, FAT32 File System Specifi-
cation – FAT: General Overview of On-Disk Format, 2000. Version 1.03.
119. Modbus-IDA. MODBUS Application Protocol Specification V1.1b. Modbus Organiza-
tion, Inc., 2006. Available online, at http://www.modbus-ida.org/.
120. Modbus-IDA. MODBUS Messaging on TCP/IP Implementation Guide V1.0b. Modbus
Organization, Inc., 2006. Available online, at http://www.modbus-ida.org/.
121. Modbus-IDA. MODBUS over Serial Line Specification and Implementation Guide
V1.02. Modbus Organization, Inc., 2006. Available online, at http://www.
modbus-ida.org/.
122. Bryon Moyer. Real World Multicore Embedded Systems. Newnes, May 2013.
123. Netlabs. FAT32, 2012. Available online, at http://svn.netlabs.org/fat32.
124. NXP B.V. AN10703 — NXP USB host lite, 2008. Rev. 01.
125. NXP B.V. LPC2468 Product data sheet, rev. 4, October 2008. Available online, at
http://www.nxp.com/.
126. NXP B.V. LPC24xx User manual, UM10237 rev. 2, December 2008. Available online,
at http://www.nxp.com/.
127. NXP B.V. LPC1769/68/67/66/65/64/63 Product data sheet, rev. 6, 2010. Available
online, at http://www.nxp.com/.
128. NXP B.V. LPC17xx User manual, UM10360 rev. 2, August 2010. Available online, at
http://www.nxp.com/.
129. NXP B.V. LPCOpen Software Development Platform for NXP LPC Microcontrollers,
June 2012. Available online, at http://www.lpcware.com/lpcopen.
130. On-line Applications Research Corp. RTEMS C User’s Guide, December 2011. Avail-
able online, at http://www.rtems.com/.
References 513
131. On-line Applications Research Corp. RTEMS Documentation, December 2011. Avail-
able online, at http://www.rtems.com/.
132. On-line Applications Research Corp. RTEMS ITRON 3.0 User’s Guide, December 2011.
Available online, at http://www.rtems.com/.
133. On-line Applications Research Corp. RTEMS POSIX API User’s Guide, December
2011. Available online, at http://www.rtems.com/.
134. The OpenOCD Project. Open On-Chip Debugger: OpenOCD User’s Guide, March
2015.
135. Elliott I. Organick. The Multics System: An Examination of Its Structure. MIT Press,
Cambridge, MA, USA, 1972.
136. OSEK/VDX. OSEK/VDX Operating System Specification. Available online, at http:
//www.osek-vdx.org/.
137. Sergio Pérez, Joan Vila, Jose A. Alegre, and Josep V. Sala. A CORBA based architecture
for distributed embedded systems using the RTLinux-GPL platform. In Proc. IEEE
International Symposium on Object Oriented Real-Time Distributed Computing, pages
285–288, 2004.
138. Roland H. Pesch, Jeffrey M. Osier, and Cygnus Support. The GNU Binary Utilities
(GNU binutils) Version 2.20. Free Software Foundation, Inc., October 2009.
139. Amir Pnueli. The temporal logic of programs. In Proc. 18th Annual Symposium on
Foundations of Computer Science, pages 46–57, November 1977.
140. Jon Postel. User Datagram Protocol, RFC 768. Information Sciences Institute (ISI),
August 1980.
141. Jon Postel. Internet Control Message Protocol—DARPA Internet Program Protocol
Specification, RFC 792. Information Sciences Institute (ISI), September 1981.
142. Jon Postel, editor. Internet Protocol—DARPA Internet Program Protocol Specification,
RFC 791. USC/Information Sciences Institute (ISI), September 1981.
143. Jon Postel, editor. Transmission Control Protocol—DARPA Internet Program Protocol
Specification, RFC 793. USC/Information Sciences Institute (ISI), September 1981.
144. Ragunathan Rajkumar, L. Sha, and John P. Lehoczky. Real-time synchronization pro-
tocols for multiprocessors. In Proc. 9th IEEE Real-Time Systems Symposium, pages
259–269, December 1988.
145. Red Hat, Inc. The Red Hat Newlib C Library, 2012. Available online, at http://
sourceware.org/newlib/.
146. Red Hat Inc. eCos User Guide, 2013. Available online, at http://ecos.
sourceware.org/.
147. W. Richard Stevens and Gary R. Wright. TCP/IP Illustrated (3 Volume Set). Addison-
Wesley Professional, Boston, MA, USA, November 2001.
148. Ken Sakamura. ITRON3.0: An Open and Portable Real-Time Operating System for
Embedded Systems. IEEE Computer Society Press, Los Alamitos, CA, April 1998.
149. Dilip V. Sarwate. Computation of cyclic redundancy checks via table look-up. Commu-
nications of the ACM, 31(8):1008–1013, August 1988.
150. Sharad C. Seth and R. Muralidhar. Analysis and design of robust data structures. In Proc.
15th Annual International Symposium on Fault-Tolerant Computing (FTCS), pages 14–
19, June 1985.
151. Julian Seward, Nicholas Nethercote, Josef Weidendorfer, and The Valgrind Develop-
ment Team. Valgrind 3.3 — Advanced Debugging and Profiling for Gnu/Linux Applica-
tions. Network Theory Ltd., Bristol, United Kingdom, March 2008.
514 References
152. Lui Sha, Tarek Abdelzaher, Karl-Erik Årzén, Anton Cervin, Theodore P. Baker, Alan
Burns, Giorgio C. Buttazzo, Marco Caccamo, John Lehoczky, and Aloysius K. Mok.
Real time scheduling theory: A historical perspective. Real-Time Systems, 28(2):101–
155, 2004.
153. Lui Sha, Ragunathan Rajkumar, and John P. Lehoczky. Priority inheritance protocols: an
approach to real-time synchronization. IEEE Transactions on Computers, 39(9):1175–
1185, September 1990.
154. Sajjan G. Shiva. Computer Organization, Design, and Architecture. Taylor & Francis
Group, Boca Raton, FL, 5th edition, December 2013.
155. Muzaffer A. Siddiqi. Dynamic RAM: Technology Advancements. Taylor & Francis
Group, Boca Raton, FL, December 2012.
156. Richard M. Stallman et al. GNU Emacs Manual. Free Software Foundation, Inc., 17th
edition, 2014.
157. Richard M. Stallman, Roland McGrath, and Paul D. Smith. GNU Make — A Program
for Directing Recompilation, for GNU make Version 4.0. Free Software Foundation,
Inc., October 2013.
158. Richard M. Stallman, Roland Pesch, Stan Shebs, et al. Debugging with GDB — The
GNU Source-Level Debugger. Free Software Foundation, Inc., 10th edition, 2015.
159. Richard M. Stallman and the GCC Developer Community. GNU Compiler Collection
Internals, for GCC Version 4.3.4. Free Software Foundation, Inc., 2007.
160. Richard M. Stallman and the GCC Developer Community. Using the GNU Compiler
Collection, for GCC Version 4.3.4. Free Software Foundation, Inc., 2008.
161. Andrew S. Tanenbaum and Todd Austin. Structured Computer Organization. Pearson
Education Ltd., 6th edition, August 2012.
162. David J. Taylor, David E. Morgan, and James P. Black. Redundancy in data structures:
Improving software fault tolerance. IEEE Transactions on Software Engineering, SE-
6(6):585–594, November 1980.
163. TIA. Electrical Characteristics of Generators and Receivers for Use in Balanced Digital
Multipoint Systems (ANSI/TIA/EIA-485-A-98) (R2003). Telecommunications Industry
Association, 1998.
164. USB Implementers Forum. Universal Serial Bus Mass Storage Class UFI Command
Specification, 1998. Revision 1.0.
165. USB Implementers Forum. Universal Serial Bus Mass Storage Class Bulk-Only Trans-
port, 1999. Revision 1.0.
166. USB Implementers Forum. Universal Serial Bus Mass Storage Class Specification
Overview, 2008. Revision 1.3.
167. Peter van der Linden. Expert C Programming: Deep C Secrets. SunSoft Press, a Prentice
Hall Title, Mountain View, CA, June 1994.
168. Tom Van Vleck. Structure marking. http://www.multicians.org/thvv/
marking.html, 2015 (retrieved on April 14, 2015).
169. William Von Hagen. The definitive guide to GCC. Apress, Berkeley, CA, 2006.
170. Christian Walter. FreeMODBUS - a Modbus ASCII/RTU and TCP implementation,
2007. Available online, at http://freemodbus.berlios.de/.
171. Joseph Yiu. The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, Third
Edition. Newnes, Newton, MA, USA, 3rd edition, 2013.
172. Lennart Ysboodt and Michael De Nil. EFSL – Embedded Filesystems Library, 2012.
Available online, at http://sourceforge.net/projects/efsl/.
Computer Science and Engineering Bertolotti
Hu
Development
valuable to students and professionals who need a single, coherent source of
information.”
—Kristian Sandström, ABB Corporate Research, Västerås, Sweden
Embedded Software Development: The Open-Source Approach delivers a practi-
cal introduction to embedded software development, with a focus on open-source
components. This programmer-centric book is written in a way that enables even novice
practitioners to grasp the development process as a whole.
• Defines the role and purpose of embedded systems, describing their internal
structure and interfacing with software development tools
• Examines the inner workings of the GNU compiler collection (GCC)-based
software development system or, in other words, toolchain
• Presents software execution models that can be adopted profitably to
model and express concurrency
• Addresses the basic nomenclature, models, and concepts related to task-based
scheduling algorithms
• Shows how an open-source protocol stack can be integrated in an embedded
system and interfaced with other software components
• Analyzes the main components of the FreeRTOS Application Programming Interface
(API), detailing the implementation of key operating system concepts
• Discusses advanced topics such as formal verification, model checking, runtime
checks, memory corruption, security, and dependability