Code Optimization Word-Wide Optimization Mixing C and Assembly
Code Optimization Word-Wide Optimization Mixing C and Assembly
Word-Wide Optimization
Mixing C and Assembly
To mix C and assembly, it is necessary to know the register convention used
by the
compiler to pass arguments. This convention is illustrated in Figure. DP, the
base
pointer, points to the beginning of the .bsssection, containing all global and
static
variables. SP, the stack pointer, points to local variables. The stack grows
from higher
memory to lower memory, as indicated in Figure. The space between even
registers
(odd registers) is used when passing 40-bit or 64-bit values.
Software Pipelining
Software pipelining is a technique for writing highly efficient assembly loop
codes
on the C6x processor. Using this technique, all functional units on the
processor
are fully utilized within one cycle. However, to write hand-coded software
pipelined
assembly code, a fair amount of coding effort is required, due to the
complexity and
number of steps involved in writing such code. In particular, for complex
algorithms
encountered in many communications, and signal/image processing
applications,
hand-coded software pipelining considerably increases coding time. The C
compiler
at the optimization levels 2 and 3 (o2and o3) performs software pipelining
to
some degree. Compared with linear assembly, the increase in code
efficiency when writing hand-coded software pipelining is relatively slight.
Linear Assembly
Linear assembly is a coding scheme that allows one to write efficient codes
(compared
with C) with less coding effort (compared with hand-coded software pipelined
assembly). The assembly optimizer is the software tool that parallelizes
linear assembly
code across the eight functional units. It attempts to achieve a good
compromise
between code efficiency and coding effort. In a linear assembly code, it is not
required to specify any functional units, registers,
and NOPs.
The directives .procand .endprocdefine the beginning and end,
respectively, of the linear assembly procedure. The symbolic names p_m,
p_n, m, n,
count, prod, and sumare defined by the .regdirective. The names p_m,
p_n,
and countare associated with the registers A4, B4, and A6 by using the
assignment
MVinstruction.
the loop part of the dot-product code, we start by drawing nodes for the
instructions
and symbolic variable names.
C64x Improvements
This section shows how the additional features of the C64x DSP can be used
to
further optimize the dot-product example. Figure (b) shows the C64x version
of
the dot-product loop kernel for multiplying two 16-bit values. The equivalent
C code
appears in Figure (a).
Considering that the C64x can bring in 64-bit data values by using the
double-word
loading instruction LDDW, the foregoing code can be further improved by
performing
four 16 * 16 multiplications via two DOTP2instructions within a single-cycle
loop, as shown in Figure (b). This way the number of operations is reduced by
four-fold, since four 16 * 16 multiplications are done per cycle. To do this in C,
we need to cast short datatypes as doubles, and to specify which 32 bits of
64-bit data a
DOTP2is supposed to operate on. This is done by using the _lo()and _hi()
intrinsics
to specify the lower and the upper 32 bits of 64-bit data, respectively. Figure
(a) shows the equivalent C code.
Circular Buffering
In many DSP algorithms, such as filtering, adaptive filtering, or spectral
analysis, we
need to shift data or update samples (i.e., we need to deal with a moving
window).
The direct method of shifting data is inefficient and uses many cycles.
Circular buffering
is an addressing mode by which a moving-window effect can be created
without
the overhead associated with data shifting. In a circular buffer, if a pointer
pointing
to the last element of the buffer is incremented, it is automatically wrapped
around
and pointed back to the first element of the buffer. This provides an easy
mechanism
to exclude the oldest sample while including the newest sample, creating a
movingwindow
effect as illustrated in Figure.
Some DSPs have dedicated hardware for doing this type of addressing. On
the C6x
processor, the arithmetic logic unit has the circular addressing mode
capability built
into it. To use circular buffering, first the circular buffer sizes need to be
written into
the BK0 and BK1 block size fields of the Address Mode Register (AMR), as
shown
in Figure . The C6x allows two independent circular buffers of powers of 2 in
size.
Buffer size is specified as 2 bytes, where N indicates the value written to
the BK0
and BK1 block size fields.
(N+1)
Adaptive Filtering
Adaptive filtering is used in many applications ranging from noise
cancellation to
system identification. In most cases, the coefficients of an FIR filter are
modified
according to an error signal in order to adapt to a desired signal. In this lab, a
system
identification example is implemented wherein an adaptive FIR filter is used
to
adapt to the output of a seventh-order IIR bandpass filter. The IIR filter is
designed
in MATLAB and implemented in C. The adaptive FIR is first implemented in C
and
later in assembly using circular buffering.
In system identification, the behavior of an unknown system is modeled by
accessing
its input and output. An adaptive FIR filter can be used to adapt to the output
of the
system based on the same input. The difference in the output of the system,
d[n], and
the output of the adaptive filter, y[n], constitutes the error term e[n], which
is used to
update the coefficients of the FIR filter.
The error term calculated from the difference of the outputs of the two
systems is
used to update each coefficient of the FIR filter according to the formula
(least mean
square (LMS) algorithm [1]):
where the hs denote the unit sample response or FIR filter coefficients. The
output
y[n] is required to approach d[n]. The term indicates step size. A small step
size will
ensure convergence, but results in a slow adaptation rate. A large step size,
though
faster, may lead to skipping over the solution.