Module 2 Part B (Mces 21cs43)
Module 2 Part B (Mces 21cs43)
● No matter how advanced the compiler, it does not know whether N can be 0 on input or
not. Therefore the compiler needs to test for this case explicitly before the first iteration
of the loop.
● The compiler doesn’t know whether the data array pointer is four-byte aligned or not. If it
is four-byte aligned, then the compiler can clear four bytes at a time using an int store
rather than a char store.
● Nor does it know whether N is a multiple of four or not. If N is a multiple of four, then
the compiler can repeat the loop body four times or store four bytes at a time using an int
store.
To keep our examples concrete, we have tested them using the following specific C compilers:
lOMoAR cPSD| 27847919
❖ armcc from ARM Developer Suite version 1.1 (ADS1.1). You can license this compiler,
or a later version, directly from ARM.
❖ arm-elf-gcc version 2.95.2. This is the ARM target for the GNU C compiler, gcc, and is
freely available.
We have used armcc from ADS1.1 to generate the example assembler output in this book. The
following short script shows you how to invoke armcc on a C file test.c. You can use this to
reproduce our examples.
By default armcc has full optimizations turned on (the -02 command line switch). The -0time
switch optimizes for execution efficiency rather than space and mainly affects the layout of for
and while loops. If you are using the gcc compiler, then the following short script generates a
similar assembler output listing:
● In other words you must load values from memory into registers before acting on them.
There are no arithmetic or logical instructions that manipulate values in memory directly.
● The ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores
directly, through new instructions
● ARMv5 adds instruction support for 64-bit load and stores. This is available in ARM9E
and later cores.
● Therefore ARM C compilers define char to be an unsigned 8-bit value, rather than a
signed 8-bit value as is typical in many other compilers.
● Compilers armcc and gcc use the datatype mappings
● A common example is using a char type variable i as a loop counter, with loop
continuation condition i ≥ 0.
● As i is unsigned for the ARM compilers, the loop will never terminate. Fortunately armcc
produces a warning in this situation: unsigned comparison with 0.
lOMoAR cPSD| 27847919
● to make char signed. For example, thecommand line option -fsigned-char will make
char signed on gcc.
● The command line option -zc will have the same effect with armcc.
The loop is now three instructions longer than the loop for example checksum_v2 earlier! There
are two reasons for the extra instructions:
● The LDRH instruction does not allow for a shifted address offset as the LDR instruction
did in checksum_v2. Therefore the first ADD in the loop calculates the address of item i
in the array. The LDRH loads from an address with no offset. LDRH has fewer
addressing modes than LDR as it was a later addition to the ARM instruction set.
● The cast reducing total +array[i] to a short requires two MOV instructions. The compiler
shifts left by 16 and then right by 16 to implement a 16-bit sign extend. The shift right is
a sign-extending shift so it replicates the sign bit to fill the upper 16 bits.
lOMoAR cPSD| 27847919
● The input values a, b, and the return value will be passed in 32-bit ARM registers. Should
the compiler assume that these 32-bit values are in the range of a short type, that is,
−32,768 to +32,767?
● Or should the compiler force values to be in this range by sign-extending the lowest 16
bits to fill the 32-bit register?
● The compiler must make compatible decisions for the function caller and callee. Either
the caller or callee must perform the cast to a short type.
● If the compiler passes arguments wide, then the callee must reduce function arguments to
the correct range. If the compiler passes arguments narrow, then the caller must reduce
the range.
● If the compiler returns values wide, then the caller must reduce the return value to the
correct range. If the compiler returns values narrow, then the callee must reduce the range
before returning the value.
lOMoAR cPSD| 27847919
It is more efficient to use unsigned types for divisions. The compiler converts unsigned power of
two divisions directly to right shifts. For general divisions, the divide routine in the C library is
faster for unsigned types.
lOMoAR cPSD| 27847919
C LOOPING STRUCTURES
● For an unsigned loop counter i we can use either of the loop continuation conditions
i!=0 or i>0.
● As i can’t be negative, they are the same condition. For a signed loop counter, it is
tempting to use the condition i>0 to continue the loop.
● You might expect the compiler to generate the following two instructions to implement
the loop:
The checksum_v7 example shows how the compiler handles a for loop with a variable
number of iterations N.
lOMoAR cPSD| 27847919
● On ARM7 or ARM9 processors the subtract takes one cycle and the branch three cycles,
giving an overhead of four cycles per loop.
● You can save some of these cycles by unrolling a loop—repeating the loop body several
times, and reducing the number of loop iterations by the same proportion.
● For example, let’s unroll our packet checksum example four times.
lOMoAR cPSD| 27847919
To start with the first question, only unroll loops that are important for the overall performance
of the application. Otherwise unrolling will increase the code size with little performance benefit.
Unrolling may even reduce performance by evicting more important code from the cache.
For the second question, try to arrange it so that array sizes are multiples of your unroll amount.
If this isn’t possible, then you must add extra code to take care of the leftover cases. This
increases the code size a little but keeps the performance high.
lOMoAR cPSD| 27847919
REGISTER ALLOCATION
● The compiler attempts to allocate a processor register to each local variable you use in a
C function.
● It will try to use the same register for different local variables if the use of the variables
do not overlap.
● When there are more local variables than available registers, the compiler stores the
excess variables on the processor stack.
● These variables are called spilled or swapped out variables since they are written out to
memory (in a similar way virtual memory is swapped out to disk).
● Spilled variables are slow to access compared to variables allocated to registers.
lOMoAR cPSD| 27847919
First let’s look at the number of processor registers the ARM C compilers have available for
allocating variables. Below table shows the standard register names and usage when following
the ARM-Thumb procedure call standard (ATPCS), which is used in code generated by C
compilers.
lOMoAR cPSD| 27847919
FUNCTION CALLS
● The ARM Procedure Call Standard (APCS) defines how to pass function arguments and
return values in ARM registers.
● The more recent ARM-Thumb Procedure Call Standard (ATPCS) covers ARM and
Thumb interworking as well.
● The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and
r3. Subsequent integer arguments are placed on the full descending stack, ascending in
memory as in figure. Function return integer values are passed in r0.
lOMoAR cPSD| 27847919
● This description covers only integer or pointer arguments. Two-word arguments such as
long long or double are passed in a pair of consecutive argument registers and
returned in r0, r1.
● The compiler may pass structures in registers or by reference according to command line
compiler options.
● The first point to note about the procedure call standard is the four-register rule.
● Functions with four or fewer arguments are far more efficient to call than functions with
five or more arguments.
● For functions with four or fewer arguments, the compiler can pass all the arguments in
registers.
● For functions with more arguments, both the caller and callee must access the stack for
some arguments.
● Note that for C++ the first argument to an object method is the this pointer. This
argument is implicit and additional to the explicit arguments.
● If your C function needs more than four arguments, or your C++ method more than three
explicit arguments, then it is almost always more efficient to use structures.
● Group related arguments into structures, and pass a structure pointer rather than multiple
arguments. Which arguments are related will depend on the structure of your software.
The next example illustrates the benefits of using a structure pointer. First we show a typical
routine to insert N bytes from array data into a queue. We implement the queue using a cyclic
buffer with start address Q_start (inclusive) and end address Q_end (exclusive).
lOMoAR cPSD| 27847919
Example
The following code creates a Queue structure and passes this to the function to reduce the
number of function arguments.
lOMoAR cPSD| 27847919
Example
The function uint_to_hex converts a 32-bit unsigned integer into an array of eight
hexadecimal digits. It uses a helper function nybble_to_hex, which converts a digit d in the
range 0 to 15 to a hexadecimal digit.
lOMoAR cPSD| 27847919
The compiler will only inline small functions. You can ask the compiler to inline a function using
the inline keyword, although this keyword is only a hint and the compiler may ignore it.
Inlining large functions can lead to big increases in code size without much performance
improvement.
POINTER ALIASING
● Two pointers are said to alias when they point to the same address.
● If you write to one pointer, it will affect the value you read from the other pointer. In a
function, the compiler often doesn’t know which pointers can alias and which pointers can’t.
● The compiler must be very pessimistic and assume that any write to a pointer may affect
the value read from any other pointer, which can significantly reduce code efficiency.
lOMoAR cPSD| 27847919
● Note that the compiler loads from step twice. Usually a compiler optimization called
common subexpression elimination would kick in so that *step was only evaluated
once, and the value reused for the second occurrence.
● However, the compiler can’t use this optimization here. The pointers timer1 and step
might alias one another.
● In other words, the compiler cannot be sure that the write to timer1 doesn’t affect the
read from step.
● In this case the second value of *step is different from the first and has the value
*timer1. This forces the compiler to insert an extra load instruction.
lOMoAR cPSD| 27847919
Example
Consider the following example, which reads and then checksums a data packet:
Here get_next_packet is a function returning the address and size of the next data
packet. The previous code compiles to
lOMoAR cPSD| 27847919