Module 2-1
Module 2-1
Processing Devices
Basic Architectural Features
A programmable DSP device should provide instructions similar to a
conventional microprocessor
The instruction set of a typical DSP device should include the
following,
a. Arithmetic operations such as ADD, SUBTRACT, MULTIPLY etc
b. Logical operations such as AND, OR, NOT, XOR etc
c. Multiply and Accumulate (MAC) operation
d. Signal scaling operation
In addition to the above provisions, the architecture should also
include,
a. On chip registers to store immediate results
b. On chip memories to store signal samples (RAM)
c. On chip memories to store filter coefficients (ROM)
Problem: Investigate the basic features that should be provided in
the DSP architecture to be used to implement the following Nth
order FIR filter
the instruction and to store the result and are also called as Array
multipliers
The key features to be considered for a multiplier are:
a. Accuracy
b. Dynamic range
c. Speed
The numbers of bits used to represent the operands decide the
execution will be very high but the circuit complexity will also
increases considerably
Thus there should be a trade-off between the speed of execution and
application
Parallel Multipliers
Consider the multiplication of two unsigned numbers A and B
In the Braun multiplier the sign of the numbers are not considered
into account
In order to implement a multiplier for signed numbers, additional
Product P=Pm+n-1........P1P0
Step 3: Multiply the sign bit A3 with B2 B1 B0 to get A3B2 A3B1 A3 B0 and add
Step 4: Multiply the sign bit B3 with A2 A1 A0 to get B3A2 B3A1 B3 A0 and add
Multiplication of
A2A1A0xB2B1B0 →101x110
000
101
101
p5p4p3p2p1p0 → 011110
p5p4p3→ 1110
B=1 1101110
Borrow = 1 implies p7,s p8 etc are all 1 (sign extension)
Hence
of implementation as it is expensive
Many alternatives for the above method have been proposed
One such method is to use the program bus itself to fetch one of the
operands after fetching the instruction, thus requiring only one bus to
fetch the operands
And the result Z can be stored back to the memory using the same
operand bus
But the problem with this is the result Z is 2n bits long whereas the
required
Another alternative can be used for the applications where speed is
single bus to fetch the operands and to store the result (shown
below)
results
The following scenarios give the necessity of a shifter
a. While performing the addition of N numbers each of n bits long, the
sum can grow up to n+log2N bits long
If the accumulator is of n bits long, then an overflow error will occur
by an amount of log2N
b. Similarly while calculating the product of two n bit numbers, the
product can grow up to 2n bits long
Generally the lower n bits get neglected and the sign bit is shifted to
architecture of a DSP
Problem 1: It is required to find the sum of 64, 16 bit numbers.
How many bits should the accumulator have so that the sum can
be computed without the occurrence of overflow error or loss of
accuracy?
Solution: The sum of 64, 16 bit numbers can grow up to (16+ log 2 64)
=22 bits long. Hence the accumulator should be 22 bits long in order
to avoid overflow error from occurring
Barrel Shifters
In conventional microprocessors, normal shift registers are used for
shift operation
As it requires one clock cycle for each shift, it is not desirable for
the shift
The block diagram of a typical barrel shifter is as shown in figure
below
A Barrel Shifter
Implementation of a 4 bit Shift Right Barrel Shifter
Figure above depicts the implementation of a 4 bit shift right barrel
shifter
Shift to right by 0, 1, 2 or 3 bit positions can be controlled by setting
Solution: As the numbers of bits used to represent the input are 16, log 2
16=4 control inputs are required
Multiply and Accumulate Unit
Most of the DSP applications require the computation of the sum of
Accumulator
MACs are used to implement the functions of the type A+BC
products
Problem 5: If a sum of 256 products is to be computed using a
pipelined MAC unit, and if the MAC execution time of the unit is
100nsec, what will be the total time required to complete the
operation?
Solution: As N=256 in this case, MAC unit requires
N+1=257execution cycles
As the single MAC execution time is 100nsec, the total time required
sizes encountered at the input of the multiplier and the sizes of the
add/subtract unit and the accumulator, as there is a possibility of
overflow and underflows
Overflow/underflow can be avoided by using any of the following
methods viz
a. Using shifters at the input and the output of the MAC
b. Providing guard bits in the accumulator
c. Using saturation logic
a. Shifters
Shifters can be provided at the input of the MAC to normalize the
bits called guard bits in the accumulator so that there will not be
any overflow error
Here the add/subtract unit also has to be modified appropriately to
+5 0101
+4 0100
-5 1011
+2 0010
-5 1011
-4 1100
instructions like ADD, SUB, INC, DEC etc and logical operations
like AND, OR , NOT, XOR etc
The block diagram of a typical ALU for a DSP is as shown in the
figure below
It consists of status flag register, register file and multiplexers
arithmetic and logic operations - flags include sign, zero, carry and
overflow
Arithmetic Logic Unit of a DSP
sign flags, the saturation logic can be used to limit the accumulator
content
Register File: Instead of moving data in and out of the memory
during the operation, for better speed, a large set of general purpose
registers are provided to store the intermediate results
used to store program and data and a separate set of data and address
buses have been given to both memories, the architecture called as
Harvard Architecture
It is as shown in figure below
Harvard Architecture
Although the usage of separate memories for data and the instruction
of a single data memory leads to the fetch the operands one after the
other, thus increasing the delay of processing
This problem can be overcome by using two separate data memories
for storing operands separately, thus in a single clock cycle both the
operands can be fetched together
Although the architecture improves the speed of operation, it
Therefore there should be a trade off between the cost and speed
are faster
Speed and size are the two key parameters to be considered with
chip, reducing the scope for implementing any functional block on-
chip, which in turn reduces the speed of execution and hence some
other alternatives required
The following are some other ways in which the on-chip memory
can be organized
a. As many DSP algorithms require instructions to be executed
repeatedly, the instruction can be stored in the external memory, once
it is fetched can reside in the instruction cache
b. The access times for memories on-chip should be sufficiently small
so that it can be accessed more than once in every execution cycle
c. On-chip memories can be configured dynamically so that they can
serve different purpose at different times
Data Addressing Capabilities
Data accessing capability of a programmable DSP device is
applications
1. Circular Addressing Mode
While processing the data samples coming continuously in a
They are
a. SAR < EAR & updated PNTR > EAR
b. SAR < EAR & updated PNTR < SAR
c. SAR >EAR & updated PNTR > SAR
d. SAR > EAR & updated PNTR < EAR
The buffer length in the first two case will be (EAR-SAR+1)
shown below
Four cases explained earlier are as shown in the figure below:
Case iii) EAR<SAR and updated PNTR>SAR Case iv) EAR<SAR and updated PNTR<EAR
Problem 10: A DSP has a circular buffer with the start and the end
addresses as 0200h and 020Fh respectively. What would be the
new values of the address pointer of the buffer if, in the course of
address computation, it gets updated to
a. 0212h
b. 01FCh
Solution: Buffer Length= (EAR-SAR+1)= 020F-0200+1=10h
a. New Address Pointer= Updated Pointer-buffer length = 0212-
10=0202h
b. New Address Pointer= Updated Pointer+buffer length =
01FC+10=020Ch
Problem 11: Repeat the previous problem for SAR= 0210h and
EAR=0201h
Solution: Buffer Length= (SAR-EAR+1)= 0210-0201+1=10h
c. New Address Pointer= Updated Pointer-buffer length = 0212-
10=0202h
d. New Address Pointer= Updated Pointer+buffer length =
01FC+10=020Ch
2. Bit Reversed Addressing Mode
To implement FFT algorithms we need to access the data in a bit
reversed manner
Hence a special addressing mode called bit reversed addressing
Next index can be calculated by adding half the FFT length, in this
operations
a. Getting value from immediate operand, register or a memory
location
b. Incrementing/ decrementing the current address
c. Adding/subtracting the offset from the current address
d. Adding/subtracting the offset from the current address and
generating new address according to circular addressing mode
e. Generating new address using bit reversed addressing mode
The block diagram of a typical address generation unit is as shown in
figure below
thus requiring memory access for storing and retrieving the return
address, which in turn reduces the speed of operation
Hence a LIFO memory can be directly interfaced with the program
counter
1. Program Control
Like microprocessors, DSP also requires a control unit to provide
necessary control and timing signals for the proper execution of the
instructions
In microprocessors, the controlling is micro coded based where each
The processor is designed such that only control unit can access them
applications
Hence in DSP the controlling is hardwired base
control unit and hence resulting control and timing signals are
generated to complete the instruction decode and execution
Advantage: Instruction execution time is much faster as compared to
microcoded design
Disadvantage: High hardware complexity; not easy to change the
design of control unit to incorporate additional features
2. Program Sequencer
It is a part of the control unit used to generate instruction addresses
a. Program Counter
b. Instruction register in case of branching, looping and subroutine calls
c. Interrupt Vector table
d. Stack which holds the return address
The block diagram of a program sequencer is as shown in figure
below
Program Sequencer
Program sequencer should have the following circuitry
2. Parallelism
3. Pipelining
Hardware Architecture
Harvard architecture which separates the program and data
Parallelism
High speed of operation can be achieved in DSP by the provision of
Parallelism
Parallelism means several things done at a time
pipelining
In a pipelined architecture, an instruction to be executed is broken
When the first of these units performs the first step on the current
instruction
The second unit will be performing the second step on the previous
instruction
The third unit will be performing the third step on the instruction
Limitations
1. Dividing each instruction into steps takes equal amount of time to
perform and design the architectures
2. Extra time required at the start of algorithm execution results in
pipeline latency
For example, Let us assume that the execution of an instruction can
For simplicity, we will assume that all the steps take equal amounts
of time
From the figure, we observe that the output corresponding to the
8 times to compute the eight product terms and find their sum
Each input sample is delayed from the previous sample by 8T
through multiplexers
As each product term is generated, it is added to the previously
compute one product term and add it to the sum passed on from
the previous stage of the pipeline
This implementation can take in a new input sample once every T
This implementation uses two MAC units and an adder at the output
Input samples and the filter coefficients are fed to the MACs using
described
It is possible to achieve higher speed implementation by the use of
outside world
The outside world provides the signal to be processed and receives