0% found this document useful (0 votes)
36 views

Achieving Predictable Multicore Execution of Automotive Applications Using The LET Paradigm

1) The document proposes an approach to achieve predictable multicore execution of automotive applications using the Logical Execution Time (LET) paradigm. 2) The LET model delays task outputs until the end of the period to avoid jitter, trading off delay for predictability. This can control memory accesses precisely in time. 3) The paper discusses implementing the LET model on an Infineon Aurix multicore platform in agreement with the AUTOSAR automotive standard and analyzes execution of a Bosch application from the WATERS 2017 challenge using this approach.

Uploaded by

天黑黑
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Achieving Predictable Multicore Execution of Automotive Applications Using The LET Paradigm

1) The document proposes an approach to achieve predictable multicore execution of automotive applications using the Logical Execution Time (LET) paradigm. 2) The LET model delays task outputs until the end of the period to avoid jitter, trading off delay for predictability. This can control memory accesses precisely in time. 3) The paper discusses implementing the LET model on an Infineon Aurix multicore platform in agreement with the AUTOSAR automotive standard and analyzes execution of a Bosch application from the WATERS 2017 challenge using this approach.

Uploaded by

天黑黑
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Achieving Predictable Multicore Execution of

Automotive Applications Using the LET Paradigm


Alessandro Biondi and Marco Di Natale
Scuola Superiore Sant’Anna, Pisa, Italy
E-mail: {alessandro.biondi, marco.dinatale}@sssup.it

Abstract—Next generation automotive applications require sup- restore predictability by controlling the time when memory
port for safe, predictable, and deterministic execution. The Logical resources are accessed.
Execution Time (LET) model has been introduced to improve For modern automotive systems, the AUTOSAR standard [3]
the predictability and correctness of time-critical applications.
The advent of multicore architectures, together with the need to provides a reference model for the development of applications,
ensure time predictability despite the complex memory hierarchy including a model of the functions and the tasks, a standard
and the hardware resources shared by the cores, is an additional API for communication and execution, and a standard platform
motivation for the use of the LET paradigm in conjunction with architecture. In AUTOSAR, the application consists of a set
a suitable scheduling and memory access model. In this paper, of communicating runnables grouped into tasks and statically
we show how an implementation of the LET model on actual
multicore platforms for automotive systems brings the potential allocated and scheduled on the system cores. The AUTOSAR
to improve time determinism at the price of a modicum run-time model is based on the concept that the task model and the
overhead. Multiple implementation options are discussed using communication implementation are automatically generated by
the automotive AUTOSAR model and operating system standard, dedicated tools based on configuration information, the model
and a realistic application defined by Bosch for the 2017 WATERS of the application, and platform constraints. Such aspects are of
challenge. Experimental data of executions on the Infineon Aurix
platform show the feasibility of the proposed approach. The paper paramount importance when designing a LET implementation
also provides a discussion on further implementation optimizations for automotive applications.
and other issues related to the general problem of memory-aware This paper. In this paper, we draw analogies from all these
analysis of automotive applications on multicores.
concepts and propose an integrated approach to face the prob-
I. I NTRODUCTION lem of implementing and scheduling task communications in
multicores. We first provide a characterization of possible vari-
The introduction of safety-critical functions in automotive ants of the LET paradigm. Next, we discuss the implementation
systems, together with the advent of multicore platforms, brings of the LET paradigm in agreement with the AUTOSAR model
the need to rethink the development and execution paradigms and API on a multicore platform that is very common in the
for embedded functionality. Developers need high levels of automotive domain and representative of typical HW configu-
predictability, testability, and ultimately determinism in the rations: the Infineon Aurix microcontroller. Then, we provide
execution of their code. The LET model was introduced as an analysis of possible actual implementation options based
part of the GIOTTO framework [1] to eliminate output jitter on the ERIKA RTOS (compliant with the OSEK automotive
and provide time determinism in the code implementation of standard and a de-facto representative of the typical behavior of
controls. Recently, there has been a renewed interest in the AUTOSAR OS kernels). Finally, we provide our results on the
LET execution paradigm by automotive electronics vendors, as evaluation of a code implementation of the application proposed
witnessed by the recent WATERS challenges [2]. by Bosch in the context of the WATERS 2017 challenge [2],
In essence, the LET delays the program output of a task executed with our LET implementation on the Aurix. Other
(or any function executed inside the task) at the end of the related issues will be shortly discussed but are not the main
task period, trading delay for output jitter. The LET model is concern of this work, including the schedulability analysis with
also characterized by an execution model of functional units explicit consideration of memory contention.
with execution order (causality) constraints. The adoption of
this model brings to the foreground not only the concept of II. M ODELING AND BACKGROUND
timeliness, but also of causality, which is typical of synchronous This paper considers applications composed of a set of n
languages and their implementations. periodic tasks Γ = {τ1 , . . . , τn }, each characterized by a
A key observation is that the LET execution model not only worst-case execution time (WCET) Ci , a period Ti , and a
avoids output jitter but has the additional benefit of scheduling relative deadline Di ≤ Ti . A bound on the response time
precisely in time the accesses to the communications variables. of τi is denoted by Ri . The tasks execute upon a platform
This can be extremely valuable in the multicore execution of that comprises m processors P1 , . . . , Pm , with local memories
tasks communicating remotely. Several techniques have been M1 , . . . , Mm (one for each core), and a global memory Mm+1 .
proposed to analyze the time performance of real-time tasks The platform disposes of a crossbar switch that enables point-
on multicores in the face of the sharing of memory and to-point communication between each core and each memory.
other hardware resources, including interconnects, arbiters and Concurrent accesses to memories are arbitrated with a FIFO
I/O devices. Unfortunately, COTS multicore platforms are not policy. Blocking memory access is assumed, i.e., no write or
designed with the aim of providing predictability, with the read buffers. Tasks are scheduled according to partitioned fixed-
consequence that conventional analysis techniques can be at priority scheduling, and hp(i) denotes the set of tasks with
best pessimistic. The LET execution model can improve and higher priority than τi . Each task is statically allocated to a
given processor P (τi ). The symbol Γx denotes the set of tasks τ1
allocated to the processor Px , while Γ(τi ) denotes the set of
tasks allocated to the same processor to which τi is allocated. τ2
As a representative model for automotive AUTOSAR appli-
cations, each task τi is composed of an ordered sequence of τ3
ni runnables ρi,1 , . . . , ρi,ni , each of which has WCET Ci,j .
from input to program variables from program to output variables
The WCET of a task τi is simply computed (as a first-order LET τ3
approximation) as the sum of the WCETs of its runnables. LET input LET output
Runnables communicate by means of labels: variables that
can be read and written in an atomic manner. Each runnable ρi,j Fig. 1. The LET model of execution. The short arrows upon the dots denote
may read or write labels from a set L = {`1 , `2 , . . . , `q , . . .}. the input/output operations performed by the tasks.
Each label `q is characterized by a size (an integer number of
bytes no larger than the processor word) and an access cost
λq . Li denotes the set of labels accessed by task τi , which input of the task data is performed at the task activation, and the
can be constructed by looking at all the labels accessed by output is performed at the end of the task period. All task inputs
the runnables in τi . Each label is written by at most one task, are stored in local variables at the task activation. Similarly, all
while it can be read by multiple tasks. Labels that are written outputs need to be stored in local variables and are actually
and read by tasks on different cores are mapped to the global output only by the LET code at the end of the cycle. This
memory, while all the other labels (including constant data) are requires to allocate memory for local variables mirroring all
mapped to the local memories (including their duplicates). The input and output variables.
set of labels mapped in global memory and accessed by τi is Several mechanisms can be used to enforce the LET syn-
denoted by LG chronization of input and output operations. In essence, LET
i ⊆ Li . Task τi accesses label `v at most Ni,v
times. For a given pair of communicating tasks, a producer τP is a sample and hold mechanism with synchronized execution
and a consumer τC , LW (τP , τC ) denotes the set of labels that of the input and output part.
are written by τP and read by τC . LR (τC , τP ) denotes the set
III. LET S EMANTIC O PTIONS
of labels that are read by τC and written by τP . In order to
compare the effects of different memory access policies, the The following sections present and discuss three different
WCETs do not include the execution cost to read and write the LET semantics characterized by different timing properties and
memory labels. implementation concerns using a simple running example.
Running example. Consider a producer task τP communicat-
A. Logical Execution Time
ing with a consumer task τC by means of a shared label `.
The LET model we assume is inspired by the original Task τP acquires input data from a sensor, then it elaborates
proposal in [1]. However, in Section III we discuss other the data producing an update for `. In a dual manner, τC reads
semantics and implementation options that are still inspired data from `, performs further elaboration on such data, and then
by the need for predictable and deterministic execution. In performs a control output operation.
addition, we include a model for the implementation of the LET
execution paradigm in the context of the AUTOSAR standard. A. The GIOTTO LET semantic
For this option, we adopt from AUTOSAR definitions and In abstract terms, the LET paradigm assumes that the in-
most of the semantics for the activation and communication put/output operations happen in zero time. However, in a real
of functions (runnables in AUTOSAR). implementation, the actual input/output operations must be
Functional and runnable model. In the original LET proposal, scheduled for execution. The order with which they are exe-
the execution of functions is characterized by a predictable and cuted influences the timing properties of the systems, especially
deterministic execution that preserves the order of execution of when flow preservation along communication chains is re-
the functions and provides for deterministic communication and quired. To ensure time determinism, the GIOTTO programming
actuation times. In the LET model, the system is a network of paradigm [4] specifies an order of execution for the writes and
functional blocks B = {b1 , b2 , . . . , bn }. Communicating blocks reads of blocks communicating using LET to enforce causality
may be related by execution order constraints (expression of (see GIOTTO micro steps in [4]).
causality). Each block is characterized by a periodic activation Without delving too much in details, the order of execution
and execution. Each block can perform multiple reads and in GIOTTO can be recapped as follows: (i) first, data write and
multiple writes. Communication may occur between blocks control output (i.e., actuation) operations are performed, then
with different periods, and each writer can have multiple readers (ii) input (i.e., sensing) and data read operations are undertaken.
for the same piece of information. In the LET execution model, This order is applied at every periodic instance of the tasks
blocks are executed by tasks (or threads) and their input and in the system and considers the input/output operations of all
output operations are grouped together at the task level. the tasks in a holistic manner, i.e., if the period instances of
The LET execution model can be summarized as depicted two tasks begin at the same time, then the communication is
in Figure 1. In the figure, the output of task τ2 (denoted collapsed within a pair of phases (i) and (ii), each comprising
by the upward arrow at the end of the box representing the the communication operations for both tasks.
task execution) has a significant jitter. Because of variable Figure 2 illustrates an example schedule of LET communi-
interference from τ1 , it occurs late in the first task instance cation with GIOTTO semantic. The communication phases are
and much earlier in the second. The LET solution is shown scheduled at the beginning of each periodic instance, which
in the bottom timeline for task τ3 (taken as an example). The is compatible with the case in which they are performed by
a high-priority task. As shown in the figure (dashed arrow),
write operations have precedence on read operations, and the τP w S w S
third periodic instance of τC reads the data written by the first
instance of τP .
τC AR A R A R A R A R

time
logical end-to-end latency
τP w S w S
Fig. 3. Example schedule of LET communication with interleaved communi-
cation phases. The producer task τP has a period of TP = 4 ms while the
τC consumer task τC has a period of TC = 2 ms. The same legend of Figure 2
A R A R A R A R
applies.
time
logical end-to-end latency

Fig. 2. Example schedule of LET communication with GIOTTO semantics. τP w S w S w


The producer task τP has a period of TP = 4 ms while the consumer task τC
has a period of TC = 2 ms. Legend: W = write, A = actuation, S = sensing, R
= read. The operations that do not contribute to the end-to-end latency indicated
in the figure are colored with light grey. τC A R A R A

time
logical end-to-end latency
As long as the tasks complete their execution before the
(a)
release of their next instance (i.e., according to the implicit-
deadline model), and ignoring the time needed to perform
the actual input/output operations, the end-to-end latency with τP S w S
which the system reacts to the control input is deterministic,
i.e., it is independent from the tasks’ response times and equal
to TP + TC . τC A R A
The same semantic can be realized by scheduling the in- time
logical end-to-end latency
put/output operations at different times than the ones in Fig-
(b)
ure 2: implementation issues related to the scheduling of LET
communication are addressed in Section V.
Fig. 4. Two examples of LET communication for a task chain. When it
B. Interleaved LET communications completes its execution, the producer task τP activates the consumer task
τC (dotted arrow). Both the tasks have the same period, but τC incurs in
By altering the order with which the input/output operations release jitter. The same legend of Figure 2 applies. The marker with a large
are performed, it is possible to obtain different end-to-end dot indicates the completion of a job of τP . Inset (a) depicts the case where
the GIOTTO semantic is applied, while inset (b) depicts an alternative case
latencies. For instance, consider the case where the LET where data write and read operations are scheduled when τC is activated.
communication phases are grouped by tasks, i.e., input and
output operations are interleaved. This case is compatible with
a LET implementation where each task delegates the LET Figure 4(a) illustrates an example schedule for the considered
communication for its input/output operations to a dedicated task chain where LET communication follows the GIOTTO
high-priority task. semantic introduced in Section III-A. Since the communication
Figure 3 illustrates an example schedule of LET commu- phases are performed at the beginning of the periodic instances,
nication where the input/output operations of task τC have the data produced by the first instance of τP is available to
precedence on those of τP , i.e., they follow the rate-monotonic the second instance of τC . As a result, the (logical) end-to-end
order (note the periods of the tasks in the figure caption). As latency is equal to TP +TC = 2T . However, this latency may be
it can be observed from the figure, differently from the case reduced while still preserving the flows of data values between
discussed in the previous section, the third periodic instance of two consecutive instances of such tasks, i.e., the data produced
τC is not able to read the data produced by the first instance by a job of task τP must be available to the successor job of
of τP . This happens because the read operations of τC are τC explicitly activated at the completion of τP . As illustrated
scheduled before the write operations of τP . As a consequence, in Figure 4(b), data writes and reads are performed at a time
the data produced by the first instance of τP are only available instant (e.g., at the response time of τP as in the figure) within
to the fourth instance of τC , which determines an increase of the the period of the tasks. Nevertheless, the LET paradigm can
end-to-end latency with which the system reacts to the control be retained for external inputs and outputs, thus maintaining
input. Specifically, the latter becomes TP + 2TC . predictability for the control timing. In other words, this scheme
can be seen as a LET paradigm applied in a holistic manner to
C. LET for task chains
the task chain, rather than to each individual task. As depicted
In the particular case in which a producer task τP only in Figure 4(b), the resulting end-to-end latency with which the
communicates with a consumer task τC , the LET model can system reacts to the control input is equal to T .
be dropped for the internal communication of the chain and
restored only at its boundaries, by enforcing an order of IV. R EALIZING LET WITH GIOTTO S EMANTICS ON
execution with an explicit activation signal. Under this scenario, M ULTICORES
the tasks have the same period TP = TC = T , but the consumer This section presents a method for realizing the LET commu-
task τC incurs in release jitter, which depends on the response nication with GIOTTO semantics on a multicore platform. To
time of τP . generalize the proposed method, the following sections consider
the abstract platform model in Section II. The method is later τP executed with a rate of TP = 2 ms that is communicating
instantiated for a real platform in Section V. The local copies of with a consumer task τC running with a rate of TC = 10 ms.
the labels required by the LET are allocated to local memories, Suppose also that both the tasks are synchronously released at
i.e., a task τi running upon processor Pk and accessing a the system startup. As a function of the ratio of their periods,
label `q disposes of a local copy for `q , named `i,q , allocated for each job of τC there are TP /TC = 5 jobs of τP that overlap
to Mk . Since application tasks work only on local copies, over time. For a given job JC of τC , the data produced by the
their execution is not affected by memory contention. The first four overlapping jobs of τP are never used by JC , as they
global communication labels are allocated to the global memory are overwritten by the data write operations performed by the
Mm+1 . Contention in the access to such labels is avoided by the last overlapping job, i.e., the last job of τC that completes no
LET communication mechanism at the price of a (predictable) later than the release of the next job of τC (following JC ).
synchronization delay. For the sake of simplicity, only data read In a dual manner, a consumer does not always need to read
and write operations are considered: possible improvements and the shared copies of the labels. By leveraging these observa-
optimizations are discussed at the end of this section. tions, it is possible to derive an analytical characterization of
the timing of LET communications.
A. LET as an opportunity to avoid memory contention
A major issue in executing real-time applications upon ⌊kTC /TP ⌋ jobs
multicore platforms is the contention of architectural shared
resources in the memory hierarchy (e.g., levels of caches and τP
global memories). Works in the literature [5], [6] addressed
such a problem by proposing clever solutions to improve the
predictability of memory traffic. τC
0
As discussed in Section III-A, LET communication can be (a)
kTC time

realized by scheduling data write and read operations at various


⌈kTP /TC ⌉ jobs
time instants provided that the order of their execution preserves
the specified causality. However, scheduling the communication
phases at the beginning of the periodic instances of tasks, τC
as illustrated in Figure 2, carries considerable benefits in
controlling the memory traffic. In fact, this approach allows
localizing the memory accesses within precise time windows τP
0
that are determined by the task periods, and allows to arbitrate (b)
kTP time

the access to the global memory Mm+1 .


Consider a label ` that is read by two tasks executing upon Fig. 5. Illustration of the timing of LET communications. Inset (a) depicts the
two different cores. The tasks dispose of local copies that case of write operations, while inset (b) depicts the case of read operations.
To preserve the LET semantic, it is sufficient that only the jobs in red perform
must be updated by the LET communication mechanism by the update of (resp., the read from) global copies. Dashed arrows indicate the
reading the global copy of `, which is mapped to the global communications that involve the kth job of the consumer task (inset (a)) and
memory Mm+1 . Although the read operations can be performed the producer task (inset (b)).
in parallel on the two cores, each benefiting of a low latency
in accessing the corresponding local memory where the private Timing of write operations. Consider a producer task τP
copies of ` are allocated, their timing is mutually coupled due communicating with a consumer task τC , both synchronously
to the potential contention in accessing the global memory. released at time t = 0. If the period of τP is larger than (or
Without a proper synchronization mechanism, in the worst-case equal to) the period of τC , i.e., TP ≥ TC , then a job of τC will
the memory accesses issued by one core can interfere with the always be released between two jobs of τP . As a consequence,
other, and viceversa, leaving room for pathological scenarios the producer task must update the global copies of each label
that inevitably affect the tasks’ response times. Conversely, by ` ∈ LW (τP , τP ) at every periodic instance. Otherwise, if
localizing the accesses to the global memory in precise time TP < TC , then multiple jobs of τP can overlap with one job
windows, the interference generated by memory contention can of τC . It is then sufficient that the global copies are updated by
be avoided by design. the last overlapping job that completes before the release of a
As a drawback, the execution of the communication phases job of τC . For the generic k th job of τC , which is released at
at the release of periodic task instances requires specific jobs time kTC , there are bkTC /TP c periodic instances of τP that
with priority higher than all the tasks. This determines a are fully contained within the time window [0, kTC ]. Hence, the
priority inversion, as the LET communication for a low priority job of interest is the one that completes at time bkTC /TP cTP .
task delays the execution of a high priority task. The impact This scenario is illustrated in Figure 5(a).
of this drawback is discussed in the experimental results in Generalizing such results, the jobs of τP that must update
Section VIII. the global copies are those whose periodic instances complete
W
at times ηC,P (k) · TP , for k ∈ N≥0 , where
B. Timing of LET communications (j k
kTC
First, it is necessary to identify the subset of memory W TP if TP < TC ,
ηC,P (k) = (1)
accesses that are required to safely realize the LET paradigm. k otherwise.
Depending on the task periods, a producer does not need
to always update the shared copies of the accessed labels at Timing of read operations. Consider the same two tasks τP
every periodic instance. For example, consider a producer task and τC . If the period of the consumer task is larger than
(or equal to) the one of the producer task, i.e., TC ≥ TP , 1: procedure G ENERATE GMF BEHAVIOR(Px )
then a job of τP will surely be released between two jobs 2: for each (τC , τP ) ∈ Γ × Γx do
W
3: for each tk = ηC,P (k) · TP , k ∈ N≥0 do
of τC . Consequently, the consumer task must read the global
4: for each `q ∈ LW (τP , τC ) do
copies of each label ` ∈ LR (τC , τP ) at every periodic instance. 5: schedule write(tk , `q = `P,q )
Otherwise, if TC < TP , it is sufficient that the global copies 6: end for
are read by the first job of τC that is released after (or at) the 7: end for
release of a job of τP . Considering the generic k th job of τP , 8: end for
9:
released at time kTP , there are dkTP /TC e periodic instances of
10: for each (τC , τP ) ∈ Γx × Γ do
τC that overlap with the time window [0, kTP ]. Hence, the first 11: R
for each tk = ηC,P (k) · TC , k ∈ N≥0 do
job of τC released after (or at) time kTP is the one released at 12: for each `q ∈ LR (τC , τP ) do
time dkTP /TC eTC , as shown in Figure 5(b). 13: schedule read(tk , `C,q = `q )
Generalizing such results, the jobs of τC that must actually 14: end for
R
read the global copies are those activated at times ηC,P (k)·TC , 15: end for
for k ∈ N≥0 , where 16: end for
17: build frames()
(l m 18: end procedure
kTP
R TC if TP > TC ,
ηC,P (k) = (2) Fig. 6. Algorithm to generate the behavior of a GMF task that implements
k otherwise. LET communications on processor Px .

When applying the properties identified above to every pair


of communicating tasks in the systems, it is clear that LET the frames of the GMF task by (i) looking at all times instants
communication requires a workload with multiple periodic tk for which there is at least one operation scheduled, and
patterns (i.e., one for each pair of communicating tasks), which (ii) for each of such time instants, defining a frame that has
can be realized with a multiframe task. as workload all the corresponding write operations followed
by the read operations. The times tk to be considered in the
C. Deriving a multiframe task with inter-core synchronization algorithm can be limited to the hyperperiod of all the tasks in
The generalized multiframe (GMF) task model [7] has been the system. A practical implementation of the GMF tasks will
proposed to cope with computational activities that exhibit a be discussed in Section V.
variable behavior across multiple instances. Specifically, a GMF To avoid contention in global memory, the execution of
task is characterized by an ordered sequence of frames each the GMF communication tasks on each core are strictly syn-
defined by a WCET, an inter-arrival time to the next job, and crhonized. Flow preservation requires that all tasks complete
a relative deadline. A GMF task releases jobs by following the before the end of their period and that all writes are performed
cyclically repeating order of the frames. before the corresponding reads. To ensure that writes are
The proposed approach to realize LET communication is performed before reads, the GMF communication tasks execute
based on the following design principles: all their writes in a strict order (following principle (iii)). When
(i) Synchronous activation of all the tasks in the system (i.e., all writes are completed, the GMF tasks execute the read
all the tasks on all the cores are synchronously released operations in order. The order of execution may be different
at startup time t = 0). for each GFM execution instance, given that some GMF from
(ii) Definition of a GMF task τxLET for each processor Px some core may not need to read or write for a given periodic
that performs the copies of labels from the corresponding instance. The resulting protocol to regulate the access to global
local memory to the global memory (write operations), memory complies with the following rules:
and viceversa (read operations). Such tasks run at the R1 For each execution instance of a GMF task τxLET released
highest priority. at time tk , two sets of processors are defined: Wx (tk ) and
(iii) Adoption of an inter-core synchronization protocol to Rx (tk ).
arbitrate the accesses to the global memory performed R2 Before performing the write operations scheduled in a
by each frame of the GMF tasks. frame, τxLET must wait until all the write operations sched-
The results derived in the previous section can be leveraged uled at time tk for the GMF tasks of the processors in
to match principle (ii); that is, as a function of the timing of Wx (tk ) are completed.
LET communications, it is possible to identify the time instants R3 Before performing the read operations scheduled in a
at which data write and read operations must be performed. frame, τxLET must wait (i) that all the write operations
Then, the operations that must be scheduled at the same time scheduled at time tk are completed, and (ii) that the read
instant are merged into a frame of a GMF task. This strategy operations scheduled at time tk in the GMF tasks of the
is summarized in the algorithm in Figure 6. processors in Rx (tk ) are completed.
Given a processor Px , the algorithm identifies all the time R4 The GMF tasks busy-wait to guarantee rules R2 and R3.
instants in which a producer task τP ∈ Γx must update the The corresponding pseudocode for the frames (instances) of
global copies of the accessed labels, hence writing in the global the GMF tasks is illustrated in Figure 7.
memory (lines 2-8). In a dual manner, the algorithm proceeds The sets Wx (tk ) and Rx (tk ) determine the order with
by identifying all the time instants in which a consumer task which the communication operations are performed. A simple
τC ∈ Γx must update its local copies of the accessed labels definition for such sets can be devised by enforcing a fixed
by reading from the global memory (lines 10-16). Finally, by global order of processors, where some processors can be
means of the function called at line 17, the algorithm constructs skipped when their corresponding GMF task does not have
1: procedure F RAME(Px , tk ) AUTOSAR OS requirements. The code is publicly available
2: wait(Wx (tk )) on-line [9].
3: do write operations(tk )
4: wait all writes() A. The Aurix Tricore platform
5: wait(Rx (tk ))
6: do read operations(tk ) The Aurix Tricore is an automotive-grade multicore platform
7: end procedure widely adopted as a main processing unit in several types of
electronic control units (ECUs), such as for engine control.
Fig. 7. Pseudocode for the frame released at time tk of the GMF task running
on processor Px . The Aurix Tricore includes three cores, each associated with a
program memory interface (PMI) and a data memory interface
(DMI) (Figure 9). The DMI includes a scratchpad memory (i.e.,
LET
a local memory under the control of the programmer) and a set-
τ1 W R W R W R W R
associative data cache. The PMI includes a program scratchpad
memory and program cache (i.e., to store instructions). The
LET
τ2 W R W R W R
caches can be disabled. The microcontroller also includes a
local memory unit (LMU) and a program memory unit (PMU).
Despite the name, the LMU is a 32KB memory that is external
LET
τ3 W R W R to the core subsystems, and can be considered as a global
0 2 4 6 8 time memory. The PMU includes a 384KB data flash memory, and
Fig. 8. Example schedule of three GMF tasks with inter-core synchronization.
two 2MB program flash memories.
The GMF task on processor P1 is fully periodic. The GMF task on processor In the Aurix platform, the scratchpads are used as the core lo-
P1 has three frames released at times 0, 4, and 6, while the GMF task on cal memories in the abstract model of Section II, and the LMU
processor P2 has two frames released at times 0 and 2. For simplicity, the
hyperperiod is 8 and all the frames execute just one write and one read
is the global memory. Despite their names, the local scratchpads
operation. are accessible from any core. In the Aurix, the memory map of
the microcontroller allows any core to access any of the above-
mentioned memories by means of a cross-bar interconnect. The
to perform communication. The resulting scheme is a token memory map is the same for all cores. As an example, Table I
passing with busy-waiting. A practical implementation of the reports an excerpt from the memory map of the Aurix Tricore
proposed protocol is presented in Section V. An example sched- TC275 (taken from the corresponding datasheet [10]), which
ule of the GMF tasks is illustrated in Figure 8, where the sets shows the addresses at which the scratchpads of the CPUs are
referred in rules R2 and R3 for the frames at time 0 are defined accessible from any core.
as follows: W1 (0) = R1 (0) = ∅, W2 (0) = R2 (0) = {P1 }, and
W3 (0) = R3 (0) = {P1 , P2 }. TABLE I
The proposed approach requires that the frames of the GMF E XCERPT FROM THE TC275 M EMORY M AP
tasks are sufficiently spaced, i.e., the minimum inter-arrival time Access Type
between two frames released on any processor is significantly Address Range Size Description
Read Write
larger than the longest time a frame takes to complete all the 5000 0000
120 KByte
CPU2 Data
access access
communication operations. 5001 DFFF Scratch-Pad
6000 0000 CPU1 Data
6001 DFFF 120 KByte Scratch-Pad access access
D. Improvements and Optimizations 7000 0000 CPU0 Data
7001 BFFF 112 KByte Scratch-Pad access access
The proposed approach can be improved and optimized in
several directions. First, it is possible to schedule tasks not
affected by global reads and writes during the busy-waiting, The access to a scratchpad memory of a remote core
hence improving their response times and the processor uti- does not involve the global memory. The core subsystems
lization. The same can be done to cope with control input are asymmetric. One of the three cores has a different CPU
and output operations as long as the adopted I/O devices are architecture than the others. The cores also differ in the sizes
not shared by multiple processors. Second, the algorithm in of the local memories. For instance, the first core has a 112KB
Figure 6 can be improved to reduce the number of reads data scratchpad, while the other two cores have a 120KB
from global memory, e.g, when the same label is read by data scratchpad. At a high level, the abstract platform model
multiple tasks running on the same processor. Third, further introduced in Section II matches the architectural characteristics
parallelism in performing the copies of the labels can be of the Aurix Tricore.
achieved by adopting a DMA and a different label allocation. B. Implementation
Lastly, different scheduling schemes can be devised to reduce
the interference introduced by the GMF tasks, e.g., by deferring The implementation of the approach proposed in Sec-
some communication operations. tion IV-C required facing with three major issues: (i) the syn-
chronization of the task activations, (ii) the efficient realization
V. I MPLEMENTING LET ON AURIX T RICORE of GMF tasks, and (iii) the implementation of the inter-core
This section presents an implementation of the approach synchronization protocol to explicitly regulate the access to the
proposed in the previous section on the popular Aurix Tricore global memory.
platform by Infineon. The implementation has been performed Synchronizing the task activations. The first issue has been
upon the ERIKA open-source real-time operating system [8], solved by exploiting the remote procedure call (RPC) features
which is certified OSEK/VDX and implements most of the that are available in ERIKA. In accordance to the OSEK/VDX
munication phase is known a-priori and given by Equation (1).
Furthermore, note that Equation (1) produces values with a
periodic pattern that is repeated every hyper-period of the two
tasks: therefore, it is sufficient to consider only the values of
W
ηC,P (k) up to k = lcm(TP , TC )/TC , where lcm(a, b) denotes
the least common multiple of a and b. Similar observations can
also be made for read operations by considering Equation (2).
The key idea of our proposal is to use the counters to count
the number of jobs that separate the communication phases.
For each processor Px , the corresponding GMF task has been
implemented as a periodic task running with period TxLET equal
to the MCD of the periods of all the tasks executing upon Px .
Each instance of such a task is in charge of decrementing all the
above-mentioned counters. Each counter is associated with a
code section that implements the corresponding communication
phase. Such code sections are executed when the corresponding
counter reaches zero, where the latter is re-initialized to the next
value. This strategy can be realized with a code generator, as
Fig. 9. Architecture of the Aurix Tricore microcontroller (from infineon.com). done for the case-study presented in Section VIII.
Inter-core synchronization. Finally, by exploiting the char-
acteristics of the Aurix Tricore, it is possible to devise a
standard, alarms ar provided to periodically activate tasks. In lightweight implementation of inter-core synchronization. For
our implementation, all the alarms are driven by a single OSEK each processor Px , two atomic spin variables allocated to the
counter, which is realized with a timer that periodically sends corresponding local memory Mx are provided: one to wait for
interrupts to the first core (with a rate of one millisecond). write operations, and another to wait for read operations. Such
Using ERIKA RPC features, such alarms can be used to activate variables are initialized to zero. Each frame of a GMF task that
tasks on any processor. Inter-core interrupts are leveraged to has to wait before executing a communication phase (see the
synchronously activate tasks on remote processors. Hence, in algorithm in Figure 7) performs the busy-wait by spinning in a
the resulting design, the first core is in charge of activating all loop executed as long as the spin variable is zero. Leveraging
the tasks in the system; synchronization of the tasks’ periods is the feature of the Aurix Tricore that allows a core to write
ensured because all task activations are generated by the same in the scratchpad of another core, it is possible to notify a
time reference, modulo some negligible synchronization delay GMF task that is spinning by simply updating one of its spin
introduced by the RPC mechanism. variables. Note that such notifications do not involve accesses
to the global memory. Furthermore, the number of notifications
Realizing GMF tasks. The realization of the GMF tasks issued in a given time window can be computed off-line as
required facing a memory vs. time trade-off. A straightforward a function of the configuration of the GMF tasks. Since the
implementation of the method proposed in Section IV-C would platform includes a write buffer, the DSYNC instruction can be
require the definition of a table that stores the set of labels to provided after the write on a remote spin variable to flush the
be read and written (or a pointer to a function performng the write buffer, thus enforcing the consistency of the notification.
set of reads and writes) for each frame instance of the core The GMF tasks perform the busy-waiting by continuously
GMF tasks up to the hyperperiod of all the tasks in the system. accessing their corresponding local memory: hence, they do
Each frame instance would be characterized by the release time not generate memory traffic that compromises the arbitration
(or the inter-arrival time to the next frame) and a code section of the accesses to the global memory. The actual spin variables
with all the communication operations to be executed within to be used, and the GMF tasks to be notified, can change frame
the frame. While this choice would have a limited impact in by frame depending on the desired order with which the cores
terms of runtime overhead, it is memory eager for realistic must access the global memory.
applications. First, the required table may be very large for
realistic values of the hyperperiod. Second, this method would C. Pseudocode of the GMF tasks
require a lot of duplicated code among the code sections of the Figure 10 reports the pseudocode for GMF task τxLET running
frames, i.e., there may be several frames that perform mostly on processor Px . First, the function do write tick(), decrements
(if not exactly) the same data write and read operations. all the counters associated with write operations. Then, the
To contain the memory footprint when realizing the GMF function invokes the write operations for the counters that
tasks, the solution adopted in our implementation is based on are down to zero (implemented using a bitmask for each
providing two counters for each pair of communicating tasks: task). Subsequently, the task busy-waits on the spin variable
one for write operations, and one for read operations. Such spin P x write (line 3) until another core signals its comple-
counters can be used to identify the time instants in which the tion, i.e., passing to Px the token to access the global memory.
LET communications for a pair of tasks must be performed. Once the token has been acquired, depending on the value of the
For instance, consider a producer task τP communicating with bitmask of each task running on Px , the set of scheduled write
a consumer task τC with TP < TC . By following the timing communications are executed by updating the global copies
of LET communications derived in Section IV-B, it is possible of the corresponding labels (line 5). Once the write phase is
to observe that the number of jobs of τP between each com- completed, another core is notified to proceed with its write
P
operations (line 6). A similar scheme is provided for read where Wi = Ci + `v ∈Li Ni,v · λv (i.e., the worst-case
operations (lines 8-12). execution time of the task plus the cost for accessing the labels)
To avoid memory interference due to the implementation of and M Ci (Ri ) represents the delay due to memory contention
the multiframe mechanism, the data (i.e., the counters and the incurred by τi and all the high-priority tasks, which transitively
bitmasks) managed by the do write tick() and do read tick() affect the response time of the task under analysis.
functions must be allocated to the local memory Mx . Note the Memory contention arises when tasks access to communi-
execution of such functions (by all the GMF tasks in system) cation labels mapped to the global memory. Since memory
is performed in parallel, thus reclaiming part of (or possibly contention is resolved according to the FIFO policy, a safe
even all) the time that a task has to busy-wait. bound on the term M Ci (Ri ) can be obtained by simply
An example of the do write tick() function is reported in inflating the terms Wi to account for m−1 contentions for each
Figure 11, where the counter associated to a pair of commu- memory access. However, this approach may lead to excessive
nicating tasks is managed. At line 7, the function modifies the pessimism, thus resulting in very coarse upper-bounds on the
bitmask of a producer task τ6 to notify that the communication response times. Rather, in this work an inflation-free analysis
labels read by a consumer task τ8 must be updated within the strategy [11], [12] is adopted.
current frame of the GMF task. Such operations will then be An inflation-free analysis explicitly accounts for each mem-
accomplished by the do write() function of Figure 10. ory access that may originate a contention while task τi (under
analysis) is pending. To this end, a bound is derived for the
1: procedure LET TASK P X( ) maximum number of accesses N RAx (t) to the global memory
2: do write tick()
3: busy wait( spin P x write == 0 ) issued by tasks executing on remote processors Px 6= P (τi ) in
4: spin P x write = 0 an arbitrary time window of length t, that is
5: do write() X X  t + Rj 
6: notify next processor write() N RAx (t) = Nj,v . (4)
7: Tj
8: do read tick() τj ∈Γx `v ∈Lj
G

9: busy wait( spin P x read == 0 )


10: spin P x read = 0 Note that the above equation considers the sum over all the
11: do read() tasks allocated to Pk as they can produce memory contention
12: notify next processor read() independently of their priority (FIFO arbitration). The term
13: end procedure d(t + Rj )/Tj e is a safe bound on the maximum number of
Fig. 10. Pseudocode for the GMF task τxLET running on processor Px .
pending jobs of τj ∈ Γx in any time window of length t [11],
[12].
Similarly, a bound is derived for the number of accesses
1: procedure DO WRITE TICK( ) N LAi (t) to the global memory issued by the local processor
2: h...i P (τi ) in a busy-period of length t where τi is pending, that is
3: cnt write T6 T8 = cnt write T6 T8 -1
4: if (cnt write T6 T8 == 0) then X X  t 
max
5: k6,8 = (k6,8 + 1) mod k6,8 N LAi (t) = Ni,v + Nj,v . (5)
6: cnt write T6 T8 = jobs T6 T8[k6,8 ] ·T6 /T1LET G
Tj
τj ∈hp(τi ) `v ∈Lj
7: write flags T6 | = TURN ON FLAG T6 T8 τj ∈Γ(τi )
8: end if
9: h...i Due to the FIFO arbitration and the fact that the memory
10: end procedure accesses are blocking and non-interruptible, it follows that each
memory access issued by a remote processor can delay at most
Fig. 11. Example of function do write tick() showing the management of the one access issued by the local processor. Hence, the following
counter for the pair of tasks τ6 (producer) and τ8 (consumer) with TP < TC .
W (k + 1) − η W (k),
Variable k6,8 is initialized to zero, jobs T6 T8[k] = η6,8 6,8
bound for the contention delay holds:
max = lcm(T , T )/T .
and k6,8 6 8 8 X
M Ci (t) = min {N RAx (t), N LAi (t)} · λR , (6)
Px 6=P (τi )
VI. W ORST- CASE ANALYSES WITH AND WITHOUT LET
This section provides a comparison of possible approaches where λR is the cost to access a label in global memory.
for bounding the worst-case cost of memory accesses in mul- Equation (6) can be used in Equation (3) to bound the re-
ticore platforms, including the blockings for contention in the sponse times of the tasks. The term N RAi,x (t) depends on the
case of any-time (in the context of the task execution) memory response time of the tasks allocated to the remote processors:
accesses and in the case of our proposed LET implementation. this additional recursive dependency can be addressed with
A. Memory-aware response-time analysis for any-time ac- an iterative loop in which Equation (3) is solved for all the
cesses tasks until all the response-time bounds Ri converge. Such an
iterative loop starts with Ri = Ci for all tasks τi .
Following standard response-time analysis, under the as-
sumption of constrained deadlines, the worst-case response time B. Analyzing the proposed LET implementation
of a task τi is bounded by the least positive fixed-point of the
By construction, the proposed approach guarantees that all
following recurrent equation:
the application tasks execute without incurring in memory
X  Ri  contention, as they only access local copies of the labels
Ri = Wi + Wj + M Ci (Ri ) (3)
Tj (allocated to the local scratchpad memories). However, they
τj ∈hp(τi )
τj ∈Γ(τi ) incur in temporal interference caused by the GMF tasks, which
execute at the highest priority. The interference generated by The simplest way to implement the LET communication
such tasks can be bounded with established analysis techniques paradigm in an AUTOSAR flow is to modify the RTE gener-
for GMF tasks or more general task models: please refer ation process for the implicit communication model. The RTE
to [13] for a detailed survey. Bounds on the execution times generator would add the code performing the global variables
of the frames can be derived by accounting for the cost of inputs and outpust to the multiframe LET tasks instead of
accessing global and local memories, and the time necessary placing it at the runnable boundaries (bottom of Figure 12).
to manage the multiframe behavior (mainly related to functions The RTE generator could generate the LET input and output
do write tick() and do read tick()). Note that the time a frame tasks together with the other RTE-generated code (according to
has to wait before performing the communication actions is the algorithms outlined in this paper).
determined by the maximum between (i) the spinning time,
which depends on the communication actions performed in VIII. E XPERIMENTAL E VALUATION : A C ASE -S TUDY
the other cores, and (ii) the execution times of functions This section reports on an experimental evaluation that
do write tick() and do read tick(). has been conducted to assess the feasibility of the proposed
This approach allows to precisely account for the contention approach and its impact in terms of timing performance. The
delay incurred by tasks, resulting in a definitively more pre- LET implementation discussed in Section V has been adopted
dictable design compared with the case where the application for a synthetic application that has been automatically generated
tasks can access the global memory at any time during their from a model provided by Bosch for the WATERS 2017
execution, as coped by the analysis presented in the previous challenge [2], which is representative of a realistic engine
section. A fine-grained analysis of the GMF tasks is out of the control application.
scope of this paper, and is left as future work.
A. The WATERS 2017 challenge model
VII. I MPLEMENTING LET IN AUTOSAR The WATERS 2017 challenge came with a model of an en-
gine control application consisting of 1250 runnables grouped
This section shortly addresses solutions for the implementa- into 21 tasks/ISRs that access 10000 labels. About 5000 labels
tion of the LET model in AUTOSAR. are constant, while the others are actual communication vari-
In AUTOSAR, runnables communicate by using an API ables. The model specifies the labels accessed by each runnable,
offered by an architecture layer called RTE (Run-time environ- the type of access (read or write), and the number of accesses.
ment). For data-oriented communication, the API offers simple Furthermore, it provides the execution times of the runnables
functions for writing to and reading from data objects. The net of memory access and memory contention times. The task
API functions can be explicit or implicit. In the explicit model, periods and the minimum inter-arrival times of the ISRs are also
the (shared) communication variable is accessed at the time it provided. The model comprises a quad-core platform, where
is needed (the API function is called) within the execution of tasks are statically allocated.
the runnable, as it is illustrated at the top of Figure 12. In the B. Experimental setup
implicit model, when a read or write operation is invoked by
the runnable in the middle of its execution, the values are read The tests have been performed on an Infineon TriBoard
from and written into local copies of the variables. The actual v2 equipped with an Aurix TC275 microcontroller running at
code implementing the read from and write into the shared 200MHz and connected to a Lauterbach PowerTrace to per-
global variables is automatically generated as part of the RTE form debugging and tracing. The HIGHTECH Aurix compiler
code at the beginning and at the end of the runnable code. The v4.6.3.1 and the ERIKA real-time operating system v2.7 are
result of the read operation is sampled at the beginning of the used, with the default compiler configurations provided for
runnable execution and then stored in a local variable for the the ERIKA kernel. Data caches have been disabled and the
duration of the runnable execution. Similarly, the write value is application code is fetched from the PMU (flash memories).
locally stored in a variable and then copied by the RTE code in
C. Assumptions and Code generator
the actual global variable after the runnable execution (shown
in the middle of Figure 12, the darker rectangles before and Some additional assumptions were necessary to generate
after the runnable execution represent the RTE code). executable code from the WATERS challenge model. First,
while the challenge model is conceived for a quad-core plat-
form, only three cores are available in the Aurix platform.
Consequently, one core and the corresponding tasks have been
discarded. Second, since our proposals focus on fully-periodic
tasks, ISRs have been considered as periodic tasks with rate
obtained by rounding their minimum inter-arrival time to the
closest multiple of one millisecond. Third, since the challenge
model does not specify the memory access patterns (i.e., no
runnable code structure is provided), two strategies have been
tested: (i) uniformly-distributed memory accesses within each
runnable with random order, (ii) grouping of all memory read
operations at the beginning of the runnable, and all write
Fig. 12. Illustration of the implementation of the LET model of execution with operations at the end of the runnable. No conditional statements
explicit and implicit communication. Down- and up-arrows denote the input within the runnable code have been considered (this information
and output operations, respectively. was lacking in the challenge model).
TABLE II
N ET EXECUTION TIME ( WITHOUT KERNEL OVERHEAD ) OF THE FIRST JOB , AND EXECUTION TIMES OF THE FIRST EIGHT JOBS OF THE GMF TASKS .

core net execution time [µs] core execution times [µs]


1 3.8 1 4.25 1.188 1.438 1.438 1.188 1.938 1.813 1.125
2 108.76 2 136 7.438 7.313 7.313 6.813 8.813 7.688 6.625
3 148.2 3 163.8 57.19 86.13 58.06 85.38 61.12 85.56 56.81

Based on such assumptions, a code generator has been TABLE III


developed. The generator inputs the XML file that encodes the A PPLICATION FOOTPRINT WITH AND WITHOUT LET ( IN BYTES )
system model and generates C code for each runnable where text data bss
execution segments are realized with for loops including a nop LET 393064 4904 88328
operation in the body. Concerning the application, the generator Explicit 359872 4784 80752
also generates (i) the definition of all the labels (both the local
and the global copies), (ii) the corresponding accesses within
(with the first frame) and in the third core. On the other hand,
the runnable code, (iii) the tasks’ code (to call a sequence of
it is important to recall that LET communication introduces the
runnables), (iv) the OIL configuration for the operating system,
benefit of controlling the accesses to the shared memory
and (v) the code to setup the OSEK alarms to periodically
activate the tasks. To better evaluate the overall impact on the tasks’ timing
Furthermore, the generator is in charge of generating the performance, the response times of twelve representative tasks
code of the GMF tasks as discussed in Section V starting of the challenge model have been measured. Both the cases
from the information available in the challenge model (i.e., with LET communication and with direct access to the global
communication relationships between tasks and task periods). memory (AUTOSAR explicit communication) have been tested.
The periods of the GMF tasks (implementing the LET commu- The two memory access patterns discussed in Section VIII-C
nication) have been configured to the MCD of the periods of the have been tested, but no significant difference has been ob-
tasks running in the corresponding processors, which resulted served. The longest observed response times, normalized to the
in T1LET = 1 ms, T2LET = 1 ms, and T3LET = 10 ms. The inter- corresponding task period, for the case of read-execute-write
core synchronization protocol has been configured with a fixed patterns are reported in Figure 13. As it can be observed from
order between the cores: first P2 , then P3 , and lastly P1 . A the figure, the response times differ by very small amounts.
core is skipped if it does not release a frame, i.e., P1 waits for These results demonstrate that LET communication—with all
P3 only one every 10 jobs (note that the period ratio of the the benefits that it brings in terms of predictability of the timing
corresponding GMF tasks is actually 10). This order has been of control outputs and end-to-end latencies—can be realized
chosen with the following rationale. As discussed in Section V, without harming the timing of the application with respect to
the first core is responsible for activating all the tasks, hence it the case of direct accesses to the global memory, which by
is subject to the highest runtime overhead related to the OSEK definition lacks of the benefit provided by LET. We believe
alarms. If the first core would be the first one in accessing the that evident benefits in terms of reduced memory contention
global memory, then it would delay all the GMF tasks in the have not been observed because the tested application is not
other cores by the time it takes to manage the activation of all sufficiently memory-intensive and the limited number of runs
the tasks (note that the kernel functionality are executed with with variable execution times (because of the problems in
higher priority than the tasks). Letting the first core to be the tracing the execution and the limited available time) was not
last one in accessing the global memory also determines the sufficient to explore cases with multiple memory contentions.
benefit that, when managing the task activations, it can reclaim The major impact of the realization of LET has been found
some of the time it would have to busy-wait. in terms of memory footprint, which increased by the 7.5%
(about 40KB) with respect to the case of AUTOSAR explicit
D. Experimental results communication. See Table III for the detailed results.
Experiments have been performed to measure the execution
time of the GMF tasks implementing the LET communication. 1
Normalized response time

0.9
The results are reported in Table II. The table on the left
0.8
reports the net execution times of the first frames without the 0.7
kernel overhead. Note that the first frame is analogous to the 0.6

one executed at the tasks’ hyperperiod, where all the LET 0.5
0.4
communications are performed, and is the heaviest in terms of 0.3
execution time. Collecting the net execution times for all the 0.2

frames was beyond the capability of our tracing hardware due 0.1
0
to the limited trace buffer of the microprocessor. The execution ISR9 Task 1s Task ISR4 Task ISR3 ISR8 Task ISR1 ISR7 Task Task
times, including the kernel overhead, for the first eight frames 10ms 20ms 100ms 50ms 5ms

are reported in the table on the right. The GMF tasks require a LET Explicit

relatively small processor utilization (the GMF task of the third


core runs at 10 milliseconds). However, as it can be observed Fig. 13. Longest observed normalized response times under both LET and
AUTOSAR explicit communication.
from the measurements, the interference generated by the GMF
may be harmful for latency-sensitive tasks in the second core
IX. R ELATED W ORK in setting up the experimental setup, Pasquale Buonocunto, Paolo
The benefits of the LET paradigm for automotive applica- Pazzaglia, and Alessio Balsini from the ReTiS lab for their work on
tions have been outlined by Hamann et al. [14], together with the parser for the WATERS challenge model, and Infineon for having
an analysis of the end-to-end latencies of communicating tasks provided the microcontroller platform.
that make use of the LET paradigm. However, the authors R EFERENCES
considered a different implementation model with respect to
[1] T. A. Henzinger, C. M. Kirsch, M. A. A. Sanvido, and W. Pree, “From
the one adopted in the present paper, nor they took advantage control models to real-time code using giotto,” in Control Systems
of the possibility of explicitly controlling the accesses to global Magazine, IEEE, 2003.
memory. Rather, they propose communication mechanisms to [2] A. Hamann, D. Dasari, S. Kramer, M. Pressler, F. Wurst, and
D. Ziegenbein. WATERS Industrial Challenge 2017. [Online]. Available:
guarantee the LET communication flows that are similar to https://waters2017.inria.fr/challenge/#Challenge17
those proposed to guarantee flow preservation in synchronous [3] The AUTOSAR standard, version 4.3. [Online]. Available: http:
systems [15]–[17]. //www.autosar.org
[4] T. A. Henzinger, B. Horowitz, and C. M. Kirsch, “Giotto: a time-triggered
Several efforts have been spent in developing techniques to language for embedded programming,” Proceedings of the IEEE, vol. 91,
improve the predictability of memory accesses in multicore no. 1, pp. 84–99, Jan 2003.
platforms, but none of them took into consideration the LET [5] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memguard:
Memory bandwidth reservation system for efficient performance isola-
paradigm nor adopted the inter-core synchronization scheme tion in multi-core platforms,” in 19th IEEE Real-Time and Embedded
proposed in this work. Most close to the present paper, Tabish Technology and Applications Symposium (RTAS), 2013, pp. 55–64.
et al. [18] presented an OS-level technique to preload scratch- [6] H. Yun, R. Mancuso, Z. P. Wu, and R. Pellizzoni, “PALLOC: DRAM
bank-aware memory allocator for performance isolation on multicore
pad memories (data and instruction) to enable a contention- platforms,” in 19th IEEE Real-Time and Embedded Technology and
free non-preemptive execution of tasks. In 2011, Pellizzoni et Applications Symposium (RTAS), April 2014.
al. [19] proposed the PREM execution, where tasks access [7] S. Baruah, D. Chen, S. Gorinsky, and A. Mok, “Generalized multiframe
tasks,” Real-Time Systems, vol. 17, no. 1, pp. 5–22, Jul 1999.
memory only at the beginning at the end of their jobs. Yao [8] ERIKA Enterprise: Open-source RTOS OSEK/VDX kernel. [Online].
et al. [20] presented a scheduling technique to arbitrate with Available: http://erika.tuxfamily.org
time-division multiplexing the memory accesses performed by [9] [Online]. Available: http://retis.sssup.it/∼a.biondi/LET/
[10] AURIX TC27x D-Step - User’s Manual, V2.2 2014-12.
PREM tasks. [11] A. Wieder and B. Brandenburg, “On spin locks in AUTOSAR: blocking
Alternative approaches have been proposed to regulate the analysis of FIFO, unordered, and priority-ordered spin locks,” in RTSS’13.
access to shared DRAM memories. Yun et al. [5] proposed a [12] A. Biondi and B. Brandenburg, “Lightweight real-time synchroniza-
tion under P-EDF on symmetric and asymmetric multiprocessors,” in
memory bandwidth reservation mechanism that exploits hard- ECRTS’16.
ware performance counters, while Yun et al. [6] and Kim [13] M. Stigge and W. Yi, “Graph-based models for real-time workload: a
et al. [21] presented bank-aware memory allocation schemes. survey,” Real-Time Systems, vol. 51, no. 5, pp. 602–636, Sep 2015.
[14] A. Hamann, D. Dasari, S. Kramer, M. Pressler, and F. Wurst, “Commu-
Techniques have also been proposed to improve the predictabil- nication Centric Design in Complex Automotive Embedded Systems,” in
ity of cache memories: please refer to the excellent survey by 29th Euromicro Conference on Real-Time Systems (ECRTS 2017), vol. 76,
Gracioli et al. [22]. 2017.
[15] C. Sofronis, S. Tripakis, and P. Caspi, “A memory-optimal buffering
Finally, other authors proposed schedulability analysis tech- protocol for preservation of synchronous semantics under preemptive
niques that explicitly take into account the memory contention. scheduling,” in EMSOFT Conference, Seoul, Korea, October 2225, 2006.
Most relevant to us are the works of Mancuso et al. [23], which [16] G. Wang, M. Di Natale, and A. Sangiovanni-Vincentelli, “Improving
the size of communication buffers in synchronous models with time
proposed a WCET bound in the presence of a collection of constraints,” in IEEE Transactions on Industrial Informatics, vol. 5 (3),
resource management techniques developed within the single- 2009, pp. 229–240.
core equivalence project at UIUC, and Davis et al. [24], which [17] H. Zeng and M. Di Natale, “Mechanisms for guaranteeing data consis-
tency and flow preservation in autosar software on multi-core platforms,”
adopted a trace-based task model and proposed to account for in 6th IEEE International Symposium on Industrial Embedded Systems
contention delays at the stage of response-time analysis. (SIES), Vasteras, Sweden, June 2011.
[18] R. Tabish, R. Mancuso, S. Wasly, A. Alhammad, S. S. Phatak, R. Pelliz-
X. C ONCLUSIONS AND F UTURE W ORK zoni, and M. Caccamo, “A real-time scratchpad-centric OS for multi-core
embedded systems,” in IEEE Real-Time and Embedded Technology and
We presented a scheme for the practical implementation Applications Symposium (RTAS), April 2016.
of the LET execution model in multicores. We discussed the [19] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Caccamo, and
R. Kegley, “A predictable execution model for COTS-based embedded
benefits arising from the use of LET not only in terms of systems,” in 17th IEEE Real-Time and Embedded Technology and Appli-
a predictable model of computation with deterministic output cations Symposium, April 2011.
times, but also the potential for scheduling memory accesses [20] G. Yao, R. Pellizzoni, S. Bak, E. Betti, and M. Caccamo, “Memory-
centric scheduling for multicore hard real-time systems,” Real-Time
avoiding excessive contention. An actual implementation on Systems, vol. 48, no. 6, pp. 681–715, Nov 2012.
an automotive platform has been presented and its implemen- [21] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. Rajkumar,
tation issues have been discussed. Overall, it emerged that “Bounding and reducing memory interference in COTS-based multi-core
systems,” Real-Time Systems, vol. 52, no. 3, pp. 356–395, May 2016.
the realization of LET communication requires facing with [22] G. Gracioli, A. Alhammad, R. Mancuso, A. A. Fröhlich, and R. Pel-
several challenging design problems, which should possibly be lizzoni, “A survey on cache management mechanisms for real-time
integrated in a holistic synthesis methodology that optimizes embedded systems,” ACM Comput. Surv., vol. 48, no. 2, Nov. 2015.
[23] R. Mancuso, R. Pellizzoni, M. Caccamo, L. Sha, and H. Yun, “WCET(m)
the communication infrastructure for a given application. This estimation in multi-core systems using Single Core Equivalence,” in
observation lays the foundations for very interesting future ECRTS, 2015.
works. [24] R. I. Davis, S. Altmeyer, L. S. Indrusiak, C. Maiza, V. Nelis, and
J. Reineke, “An extensible framework for multicore response time anal-
ACKNOWLEDGMENTS ysis,” Real-Time Systems, Jul 2017.

The authors like to thank Giuseppe Serano, Errico Guidieri, and


Paolo Gai of Evidence S.R.L. for the valuable support provided

You might also like