0% found this document useful (0 votes)
29 views

Module 2 Part B (Mces 21cs43)

Uploaded by

gokul.shreeraj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Module 2 Part B (Mces 21cs43)

Uploaded by

gokul.shreeraj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

C COMPILERS AND OPTIMIZATION


● Optimizing code takes time and reduces source code readability. Usually, it’s only worth
optimizing functions that are frequently executed and important for performance.
● We recommend you use a performance profiling tool, found in most ARM simulators, to
find these frequently executed functions.
● Document nonobvious optimizations with source code comments to aid maintainability.
● C compilers have to translate your C function literally into assembler so that it works for
all possible inputs.
● In practice, many of the input combinations are not possible or won’t occur. Let’s start by
looking at an example of the problems the compiler faces.
● The memclr function clears N bytes of memory at address data.

● No matter how advanced the compiler, it does not know whether N can be 0 on input or
not. Therefore the compiler needs to test for this case explicitly before the first iteration
of the loop.
● The compiler doesn’t know whether the data array pointer is four-byte aligned or not. If it
is four-byte aligned, then the compiler can clear four bytes at a time using an int store
rather than a char store.
● Nor does it know whether N is a multiple of four or not. If N is a multiple of four, then
the compiler can repeat the loop body four times or store four bytes at a time using an int
store.
To keep our examples concrete, we have tested them using the following specific C compilers:
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

❖ armcc from ARM Developer Suite version 1.1 (ADS1.1). You can license this compiler,
or a later version, directly from ARM.
❖ arm-elf-gcc version 2.95.2. This is the ARM target for the GNU C compiler, gcc, and is
freely available.
We have used armcc from ADS1.1 to generate the example assembler output in this book. The
following short script shows you how to invoke armcc on a C file test.c. You can use this to
reproduce our examples.

By default armcc has full optimizations turned on (the -02 command line switch). The -0time
switch optimizes for execution efficiency rather than space and mainly affects the layout of for
and while loops. If you are using the gcc compiler, then the following short script generates a
similar assembler output listing:

Basic C Data Types


ARM supports operations on different data types.
The data types we can load (or store) can be signed and unsigned words, halfwords, or bytes. The
extensions for these data types are: -h or -sh for halfwords, -b or -sb for bytes, and no extension
for words. The difference between signed and unsigned data types is:
Signed data types can hold both positive and negative values and are therefore lower in range.
Unsigned data types can hold large positive values (including ‘Zero’) but cannot hold negative
values and are therefore wider in range.
● ARM processors have 32-bit registers and 32-bit data processing operations. The ARM
architecture is a RISC load/store architecture.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

● In other words you must load values from memory into registers before acting on them.
There are no arithmetic or logical instructions that manipulate values in memory directly.

● The ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores
directly, through new instructions
● ARMv5 adds instruction support for 64-bit load and stores. This is available in ARM9E
and later cores.
● Therefore ARM C compilers define char to be an unsigned 8-bit value, rather than a
signed 8-bit value as is typical in many other compilers.
● Compilers armcc and gcc use the datatype mappings
● A common example is using a char type variable i as a loop counter, with loop
continuation condition i ≥ 0.
● As i is unsigned for the ARM compilers, the loop will never terminate. Fortunately armcc
produces a warning in this situation: unsigned comparison with 0.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

● to make char signed. For example, thecommand line option -fsigned-char will make
char signed on gcc.
● The command line option -zc will have the same effect with armcc.

Local Variable Types


● ARMv4-based processors can efficiently load and store 8-, 16-, and 32-bit data. However,
most ARM data processing operations are 32-bit only.
● For this reason, you should use a 32-bit datatype, int or long, for local variables wherever
possible.
● Avoid using char and short as local variable types, even if you are manipulating an 8- or
16-bit value.
● The one exception is when you want wrap-around to occur. If you require modulo
arithmetic of the form 255 + 1 = 0, then use the char type.
● The following code checksums a data packet containing 64 words. It shows why you
should avoid using char for local variables.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43


lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

The loop is now three instructions longer than the loop for example checksum_v2 earlier! There
are two reasons for the extra instructions:
● The LDRH instruction does not allow for a shifted address offset as the LDR instruction
did in checksum_v2. Therefore the first ADD in the loop calculates the address of item i
in the array. The LDRH loads from an address with no offset. LDRH has fewer
addressing modes than LDR as it was a later addition to the ARM instruction set.
● The cast reducing total +array[i] to a short requires two MOV instructions. The compiler
shifts left by 16 and then right by 16 to implement a 16-bit sign extend. The shift right is
a sign-extending shift so it replicates the sign bit to fill the upper 16 bits.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43


lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

FUNCTION ARGUMENT TYPES


Consider the following simple function, which adds two 16-bit values, halving the second, and
returns a 16-bit sum:

● The input values a, b, and the return value will be passed in 32-bit ARM registers. Should
the compiler assume that these 32-bit values are in the range of a short type, that is,
−32,768 to +32,767?
● Or should the compiler force values to be in this range by sign-extending the lowest 16
bits to fill the 32-bit register?
● The compiler must make compatible decisions for the function caller and callee. Either
the caller or callee must perform the cast to a short type.
● If the compiler passes arguments wide, then the callee must reduce function arguments to
the correct range. If the compiler passes arguments narrow, then the caller must reduce
the range.
● If the compiler returns values wide, then the caller must reduce the return value to the
correct range. If the compiler returns values narrow, then the callee must reduce the range
before returning the value.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43


lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

SIGNED VERSUS UNSIGNED TYPES

It is more efficient to use unsigned types for divisions. The compiler converts unsigned power of
two divisions directly to right shifts. For general divisions, the divide routine in the C library is
faster for unsigned types.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43


lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

C LOOPING STRUCTURES

LOOPS WITH A FIXED NUMBER OF ITERATIONS


Below code shows how the compiler treats a loop with incrementing count i++.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

● For an unsigned loop counter i we can use either of the loop continuation conditions
i!=0 or i>0.
● As i can’t be negative, they are the same condition. For a signed loop counter, it is
tempting to use the condition i>0 to continue the loop.
● You might expect the compiler to generate the following two instructions to implement
the loop:

LOOPS USING A VARIABLE NUMBER OF ITERATIONS


Now suppose we want our checksum routine to handle packets of arbitrary size. We pass in a
variable N giving the number of words in the data packet. Using the lessons from the last section
we count down until N = 0 and don’t require an extra loop counter i.

The checksum_v7 example shows how the compiler handles a for loop with a variable
number of iterations N.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43


lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

● On ARM7 or ARM9 processors the subtract takes one cycle and the branch three cycles,
giving an overhead of four cycles per loop.
● You can save some of these cycles by unrolling a loop—repeating the loop body several
times, and reducing the number of loop iterations by the same proportion.
● For example, let’s unroll our packet checksum example four times.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

To start with the first question, only unroll loops that are important for the overall performance
of the application. Otherwise unrolling will increase the code size with little performance benefit.
Unrolling may even reduce performance by evicting more important code from the cache.

For the second question, try to arrange it so that array sizes are multiples of your unroll amount.
If this isn’t possible, then you must add extra code to take care of the leftover cases. This
increases the code size a little but keeps the performance high.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

SUMMARY: Writing Loops Efficiently

REGISTER ALLOCATION
● The compiler attempts to allocate a processor register to each local variable you use in a
C function.
● It will try to use the same register for different local variables if the use of the variables
do not overlap.
● When there are more local variables than available registers, the compiler stores the
excess variables on the processor stack.
● These variables are called spilled or swapped out variables since they are written out to
memory (in a similar way virtual memory is swapped out to disk).
● Spilled variables are slow to access compared to variables allocated to registers.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

First let’s look at the number of processor registers the ARM C compilers have available for
allocating variables. Below table shows the standard register names and usage when following
the ARM-Thumb procedure call standard (ATPCS), which is used in code generated by C
compilers.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

● The C compiler can assign 14 variables to registers without spillage.


● In practice, some compilers use a fixed register such as r12 for intermediate scratch
working and do not assign variables to this register.
● Also, complex expressions require intermediate working registers to evaluate. Therefore,
to ensure good assignment to registers, you should try to limit the internal loop of
functions to using at most 12 local variables.
● If the compiler does need to swap out variables, then it chooses which variables to swap
out based on frequency of use.
● A variable used inside a loop counts multiple times. You can guide the compiler as to
which variables are important by ensuring these variables are used within the innermost
loop.
● The register keyword in C hints that a compiler should allocate the given variable to
a register.
● However, different compilers treat this keyword in different ways, and different
architectures have a different number of available registers (for example, Thumb and
ARM).
● Therefore we recommend that you avoid using register and rely on the compiler’s
normal register allocation routine.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

FUNCTION CALLS
● The ARM Procedure Call Standard (APCS) defines how to pass function arguments and
return values in ARM registers.
● The more recent ARM-Thumb Procedure Call Standard (ATPCS) covers ARM and
Thumb interworking as well.
● The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and
r3. Subsequent integer arguments are placed on the full descending stack, ascending in
memory as in figure. Function return integer values are passed in r0.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

● This description covers only integer or pointer arguments. Two-word arguments such as
long long or double are passed in a pair of consecutive argument registers and
returned in r0, r1.
● The compiler may pass structures in registers or by reference according to command line
compiler options.
● The first point to note about the procedure call standard is the four-register rule.
● Functions with four or fewer arguments are far more efficient to call than functions with
five or more arguments.
● For functions with four or fewer arguments, the compiler can pass all the arguments in
registers.
● For functions with more arguments, both the caller and callee must access the stack for
some arguments.
● Note that for C++ the first argument to an object method is the this pointer. This
argument is implicit and additional to the explicit arguments.
● If your C function needs more than four arguments, or your C++ method more than three
explicit arguments, then it is almost always more efficient to use structures.
● Group related arguments into structures, and pass a structure pointer rather than multiple
arguments. Which arguments are related will depend on the structure of your software.

The next example illustrates the benefits of using a structure pointer. First we show a typical
routine to insert N bytes from array data into a queue. We implement the queue using a cyclic
buffer with start address Q_start (inclusive) and end address Q_end (exclusive).
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43


lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

Example
The following code creates a Queue structure and passes this to the function to reduce the
number of function arguments.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

● The queue_bytes_v2 is one instruction longer than queue_bytes_v1, but it is in


fact more efficient overall.
● The second version has only three function arguments rather than five. Each call to the
function requires only three register setups.
● This compares with four register setups, a stack push, and a stack pull for the first
version. There is a net saving of two instructions in function call overhead.
● There are likely further savings in the callee function, as it only needs to assign a single
register to the Queue structure pointer, rather than three registers in the nonstructured
case.

Example
The function uint_to_hex converts a 32-bit unsigned integer into an array of eight
hexadecimal digits. It uses a helper function nybble_to_hex, which converts a digit d in the
range 0 to 15 to a hexadecimal digit.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43


lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

The compiler will only inline small functions. You can ask the compiler to inline a function using
the inline keyword, although this keyword is only a hint and the compiler may ignore it.
Inlining large functions can lead to big increases in code size without much performance
improvement.

POINTER ALIASING
● Two pointers are said to alias when they point to the same address.
● If you write to one pointer, it will affect the value you read from the other pointer. In a
function, the compiler often doesn’t know which pointers can alias and which pointers can’t.
● The compiler must be very pessimistic and assume that any write to a pointer may affect
the value read from any other pointer, which can significantly reduce code efficiency.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

● Note that the compiler loads from step twice. Usually a compiler optimization called
common subexpression elimination would kick in so that *step was only evaluated
once, and the value reused for the second occurrence.
● However, the compiler can’t use this optimization here. The pointers timer1 and step
might alias one another.
● In other words, the compiler cannot be sure that the write to timer1 doesn’t affect the
read from step.
● In this case the second value of *step is different from the first and has the value
*timer1. This forces the compiler to insert an extra load instruction.
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

Example

Consider the following example, which reads and then checksums a data packet:

Here get_next_packet is a function returning the address and size of the next data
packet. The previous code compiles to
lOMoAR cPSD| 27847919

Microcontroller and Embedded Systems 21CS43

You might also like