0% found this document useful (0 votes)
9 views

Module 2-1

The document outlines the essential architectural features and computational building blocks necessary for programmable digital signal processing (DSP) devices, emphasizing the importance of instruction sets, memory types, and specialized units like Multiply and Accumulate (MAC). It discusses the design considerations for multipliers, shifters, and the handling of overflow and underflow in computations. Additionally, it highlights the need for efficient data management and speed optimization in DSP applications.

Uploaded by

chiragbengre10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Module 2-1

The document outlines the essential architectural features and computational building blocks necessary for programmable digital signal processing (DSP) devices, emphasizing the importance of instruction sets, memory types, and specialized units like Multiply and Accumulate (MAC). It discusses the design considerations for multipliers, shifters, and the handling of overflow and underflow in computations. Additionally, it highlights the need for efficient data management and speed optimization in DSP applications.

Uploaded by

chiragbengre10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 93

Architectures for Programmable Digital Signal

Processing Devices
Basic Architectural Features
 A programmable DSP device should provide instructions similar to a

conventional microprocessor
 The instruction set of a typical DSP device should include the

following,
a. Arithmetic operations such as ADD, SUBTRACT, MULTIPLY etc
b. Logical operations such as AND, OR, NOT, XOR etc
c. Multiply and Accumulate (MAC) operation
d. Signal scaling operation
 In addition to the above provisions, the architecture should also

include,
a. On chip registers to store immediate results
b. On chip memories to store signal samples (RAM)
c. On chip memories to store filter coefficients (ROM)
Problem: Investigate the basic features that should be provided in
the DSP architecture to be used to implement the following Nth
order FIR filter

 In order to implement the above operation in a DSP, the architecture

requires the following features


i. A RAM to store the signal samples x (n)
ii. A ROM to store the filter coefficients h (n)
iii. An MAC unit to perform Multiply and Accumulate operation
iv. An accumulator to store the result immediately
v. A signal pointer to point the signal sample in the memory
vi. A coefficient pointer to point the filter coefficient in the memory
vii. A counter to keep track of the count
viii. A shifter to shift the input samples appropriately
DSP Computational Building Blocks
 Each computational block of the DSP should be optimized for

functionality and speed


 The design should be sufficiently general so that it can be easily

integrated with other blocks to implement overall DSP systems


Multipliers
 The advent of single chip multipliers paved the way for

implementing DSP functions on a VLSI chip


 Parallel multipliers replaced the traditional shift and add multipliers

 Parallel multipliers take a single processor cycle to fetch and execute

the instruction and to store the result and are also called as Array
multipliers
The key features to be considered for a multiplier are:
a. Accuracy
b. Dynamic range
c. Speed
 The numbers of bits used to represent the operands decide the

accuracy and the dynamic range of the multiplier


 Whereas speed is decided by the architecture employed

 If the multipliers are implemented using hardware, the speed of

execution will be very high but the circuit complexity will also
increases considerably
 Thus there should be a trade-off between the speed of execution and

the circuit complexity


 Hence the choice of the architecture normally depends on the

application
Parallel Multipliers
 Consider the multiplication of two unsigned numbers A and B

 Let A be represented using m bits as (Am-1 Am-2 …….. A1 A0) and B be

represented using n bits as (Bn-1 Bn-2 …….. B1 B0)


 Then the product of these two numbers is given by,

 This operation can be implemented paralleling using Braun

multiplier whose hardware structure is as shown in the figure below:


Braun Multiplier for a 4x4 Multiplication

Multipliers for Signed Numbers


 Generally signed numbers are represented in 2’s complement form

 In the Braun multiplier the sign of the numbers are not considered

into account
 In order to implement a multiplier for signed numbers, additional

hardware is required to modify the Braun multiplier


 The modified multiplier is called as Baugh-Wooley multiplier

 Consider two signed numbers A and B,

Product P=Pm+n-1........P1P0

 If we need to multiply two 4 bit numbers A 3A2A1A0 x B3B2B1B0 with

A3 and B3 as sign bits


Steps are as follows:

Step 1: Multiply A2 A1 A0 x B2 B1 B0 to give product p5-p0

Step 2: Multiply sign bits A3 B3 and add them at p6 position

Step 3: Multiply the sign bit A3 with B2 B1 B0 to get A3B2 A3B1 A3 B0 and add

them at positions p5, p4 and p3

 The addition is performed after obtaining 2’s complement of A3B2 A3B1 A3

B0 which is equal to subtracting A3B2 A3B1 A3 B0

Step 4: Multiply the sign bit B3 with A2 A1 A0 to get B3A2 B3A1 B3 A0 and add

them to position p5, p4 and p3


 Steps 1 to 4 give the total product of two signed numbers

 Using 2’s complement the subtraction can be expressed as additions and

hence resulting the multiplier consist of only adders


Example: Multiply 2 numbers -3 and 6

Step 1: -3 = A3A2A1Ao = 1101 and 6 = B3B2B1Bo = 0110

 Multiplication of

A2A1A0xB2B1B0 →101x110

000

101

101

p5p4p3p2p1p0 → 011110

Step 2: Sign bits; A3B3 = 1 x 0 =0→ add them at P6 (Subscripts of A3B3

=3 + 3 = 6) to get p6p5p4p3p2p1p0 = 0011110


Step 3:
 Multiplication of A3 with B2 B1 B0 → 1x110 =110

 Subtract this at p5p4p3 gives p6 p5 p4p3p2p1p0 →0011110

p5p4p3→ 1110
B=1 1101110
 Borrow = 1 implies p7,s p8 etc are all 1 (sign extension)

Step 4: Multiply B3 with A2A1A0 → 0 x 101 = 000


 Subtract this at p5p4p3 (since zero; p6 po remains same)

 Hence

p7p6p5p4p3p2p1p0 = 1110 1110 = EEh = -18h


The Baugh Wooley multiplier is as shown below:
Speed
 Conventional Shift and Add technique of multiplication requires n

cycles to perform the multiplication of two n bit numbers


 Whereas in parallel multipliers the time required will be the longest

path delay in the combinational circuit used


 As DSP applications generally require very high speed, it is desirable

to have multipliers operating at the highest possible speed by having


parallel implementation
Bus Widths
 Consider the multiplication of two n bit numbers X and Y

 The product Z can be atmost 2n bits long


 In order to perform the whole operation in a single execution cycle,

we require two buses of width n bits each to fetch the operands X


and Y and a bus of width 2n bits to store the result Z to the memory
 Although this performs the operation faster, it is not an efficient way

of implementation as it is expensive
 Many alternatives for the above method have been proposed

 One such method is to use the program bus itself to fetch one of the

operands after fetching the instruction, thus requiring only one bus to
fetch the operands
 And the result Z can be stored back to the memory using the same

operand bus
 But the problem with this is the result Z is 2n bits long whereas the

operand bus is just n bits long


We have two alternatives to solve this problem,
a. Use the n bits operand bus and save Z at two successive memory
locations
 Although it stores the exact value of Z in the memory, it takes two

cycles to store the result


b. Discard the lower n bits of the result Z and store only the higher
order n bits into the memory
 It is not applicable for the applications where accurate result is

required
 Another alternative can be used for the applications where speed is

not a major concern


 In which latches are used for inputs and outputs thus requiring a

single bus to fetch the operands and to store the result (shown
below)

A Multiplier with Input and Output Latches


Shifters
 Shifters are used to either scale down or scale up operands or the

results
The following scenarios give the necessity of a shifter
a. While performing the addition of N numbers each of n bits long, the
sum can grow up to n+log2N bits long
 If the accumulator is of n bits long, then an overflow error will occur

 This can be overcome by using a shifter to scale down the operand

by an amount of log2N
b. Similarly while calculating the product of two n bit numbers, the
product can grow up to 2n bits long
 Generally the lower n bits get neglected and the sign bit is shifted to

save the sign of the product


c. Finally in case of addition of two floating-point numbers, one of the
operands has to be shifted appropriately to make the exponents of
two numbers equal
 From the above cases it is clear that, a shifter is required in the

architecture of a DSP
Problem 1: It is required to find the sum of 64, 16 bit numbers.
How many bits should the accumulator have so that the sum can
be computed without the occurrence of overflow error or loss of
accuracy?
Solution: The sum of 64, 16 bit numbers can grow up to (16+ log 2 64)
=22 bits long. Hence the accumulator should be 22 bits long in order
to avoid overflow error from occurring

Problem 2: In the previous problem, it is decided to have an


accumulator with only 16 bits but shift the numbers before the
addition to prevent overflow, by how many bits should each
number be shifted?
Solution: As the length of the accumulator is fixed, the operands have
to be shifted by an amount of log2 64 = 6 bits prior to addition
operation, in order to avoid the condition of overflow
Problem 3: If all the numbers in the previous problem are fixed
point integers, what is the actual sum of the numbers?
Solution: The actual sum can be obtained by shifting the result by 6
bits towards left side after the sum being computed. Therefore
 Actual Sum= Accumulator content X 2 6

Barrel Shifters
 In conventional microprocessors, normal shift registers are used for

shift operation
 As it requires one clock cycle for each shift, it is not desirable for

DSP applications, which generally involves more shifts


 In other words, for DSP applications as speed is the crucial issue,

several shifts are to be accomplished in a single execution cycle


 This can be accomplished using a barrel shifter, which connects the

input lines representing a word to a group of output lines with the


required shifts determined by its control inputs
 For an input of length n, log2 n control lines are required

 And an additional control line is required to indicate the direction of

the shift
 The block diagram of a typical barrel shifter is as shown in figure

below

A Barrel Shifter
Implementation of a 4 bit Shift Right Barrel Shifter
 Figure above depicts the implementation of a 4 bit shift right barrel

shifter
 Shift to right by 0, 1, 2 or 3 bit positions can be controlled by setting

the control inputs appropriately


Problem 4: A Barrel Shifter is to be designed with 16 inputs for left
shifts from 0 to 15 bits. How many control lines are required to
implement the shifter?

Solution: As the numbers of bits used to represent the input are 16, log 2
16=4 control inputs are required
Multiply and Accumulate Unit
 Most of the DSP applications require the computation of the sum of

the products of a series of successive multiplications


 In order to implement such functions a special unit called a Multiply

and Accumulate (MAC) unit is required


 A MAC consists of a multiplier and a special register called

Accumulator
 MACs are used to implement the functions of the type A+BC

 A typical MAC unit is as shown in the figure below


A MAC Unit
 Although addition and multiplication are two different operations,

they can be performed in parallel


 By the time the multiplier is computing the product, accumulator can

accumulate the product of the previous multiplications


 Thus if N products are to be accumulated, N-1 multiplications can

overlap with N-1 additions


 During the very first multiplication, accumulator will be idle and

during the last accumulation, multiplier will be idle


 Thus N+1 clock cycles are required to compute the sum of N

products
Problem 5: If a sum of 256 products is to be computed using a
pipelined MAC unit, and if the MAC execution time of the unit is
100nsec, what will be the total time required to complete the
operation?
Solution: As N=256 in this case, MAC unit requires
N+1=257execution cycles
 As the single MAC execution time is 100nsec, the total time required

will be, (257*100nsec)=25.7μsec


Overflow and Underflow
 While designing a MAC unit, attention has to be paid to the word

sizes encountered at the input of the multiplier and the sizes of the
add/subtract unit and the accumulator, as there is a possibility of
overflow and underflows
 Overflow/underflow can be avoided by using any of the following

methods viz
a. Using shifters at the input and the output of the MAC
b. Providing guard bits in the accumulator
c. Using saturation logic
a. Shifters
 Shifters can be provided at the input of the MAC to normalize the

data and at the output to denormalize the same


b. Guard bits
 As the normalization process does not yield accurate result, it is not

desirable for some applications


 In such cases we have another alternative by providing additional

bits called guard bits in the accumulator so that there will not be
any overflow error
 Here the add/subtract unit also has to be modified appropriately to

manage the additional bits of the accumulator


Problem 6: Consider a MAC unit whose inputs are 16 bit numbers.
If 256 products are to be summed up in this MAC, how many
guard bits should be provided for the accumulator to prevent
overflow condition from occurring?
Solution: As it is required to calculate the sum of 256, 16 bit numbers,
the sum can be as long as (16+ log2 256)=24 bits
 Hence the accumulator should be capable of handling these 32 bits

Thus the guard bits required will be (24-16)= 8 bits


 The block diagram of the modified MAC after considering the guard

or extension bits is as shown in the figure below


MAC Unit with Guard Bits

Problem 7: What should be the minimum width of the accumulator


in a DSP device that receives 10 bit A/D samples and is required
to add 64 of them without causing an overflow?
Solution: As it is required to calculate the sum of 64, 10 bit numbers,
the sum can be as long as (10+ log2 64)=16 bits
 Hence the accumulator should be capable of handling these 16 bits

Thus the guard bits required will be (16-10)= 6 bits


c. Saturation Logic
 Overflow/ underflow will occur if the result goes beyond the most

positive number or below the least negative number the


accumulator can handle
 It can be resolved by loading the accumulator with the most positive

number which it can handle at the time of overflow


 The least negative number that it can handle at the time of underflow

 This method is called as saturation logic

 A schematic diagram of saturation logic is as shown in figure below


 In saturation logic, as soon as an overflow or underflow condition is

satisfied the accumulator will be loaded with the most positive or


least negative number overriding the result computed by the MAC
unit

A Schematic Diagram of the Saturation Logic


A Schematic Diagram of the Saturation Logic
Examples of overflow and underflow are illustrated below
Overflow
+3 0011
+2 0010

+5 0101 Here Cin= 0 = Cout, Hence no overflow

+5 0101
+4 0100

+9 1001 Here Cin= 1; Cout,=0, Cin≠ Cout Hence overflow


Underflow
-3 1101
-2 1110

-5 1 1101 Here Cin= 1; Cout,=1, Cin= Cout=1 Hence no underflow

-5 1011
+2 0010

-3 1101 Here Cin= 0 = Cout, Hence no underflow

-5 1011
-4 1100

-9 10111 Here Cin= 0; Cout= 1, Cin≠ Cout Hence underflow


Arithmetic and Logic Unit
 A typical DSP device should be capable of handling arithmetic

instructions like ADD, SUB, INC, DEC etc and logical operations
like AND, OR , NOT, XOR etc
 The block diagram of a typical ALU for a DSP is as shown in the

figure below
 It consists of status flag register, register file and multiplexers

 Status Flags: ALU includes circuitry to generate status flags after

arithmetic and logic operations - flags include sign, zero, carry and
overflow
Arithmetic Logic Unit of a DSP

 Overflow Management: Depending on the status of overflow and

sign flags, the saturation logic can be used to limit the accumulator
content
 Register File: Instead of moving data in and out of the memory

during the operation, for better speed, a large set of general purpose
registers are provided to store the intermediate results

Bus Architecture and Memory


 Conventional microprocessors use Von Neumann architecture for

memory management wherein the same memory is used to store both


the program and data (Fig below)

Von Neumann Architecture


 Although this architecture is simple, it takes more number of

processor cycles for the execution of a single instruction as the


same bus is used for both data and program
 In order to increase the speed of operation, separate memories were

used to store program and data and a separate set of data and address
buses have been given to both memories, the architecture called as
Harvard Architecture
 It is as shown in figure below

Harvard Architecture
 Although the usage of separate memories for data and the instruction

speeds up the processing, it will not completely solve the problem


 As many of the DSP instructions require more than one operand, use

of a single data memory leads to the fetch the operands one after the
other, thus increasing the delay of processing
 This problem can be overcome by using two separate data memories

for storing operands separately, thus in a single clock cycle both the
operands can be fetched together
 Although the architecture improves the speed of operation, it

requires more hardware and interconnections, thus increasing the


cost and complexity of the system
Harvard Architecture with Dual Data Memory

 Therefore there should be a trade off between the cost and speed

while selecting memory architecture for a DSP


On-chip Memories
 In order to have a faster execution of the DSP functions, it is

desirable to have some memory located on chip


 As dedicated buses are used to access the memory, on chip memories

are faster
 Speed and size are the two key parameters to be considered with

respect to the on-chip memories


Speed
 On-chip memories should match the speeds of the ALU operations in

order to maintain the single cycle instruction execution of the DSP


Size
 In a given area of the DSP chip, it is desirable to implement as many

DSP functions as possible


 Thus the area occupied by the on-chip memory should be minimum

so that there will be a scope for implementing more number of DSP


functions on- chip

Organization of On-chip Memories


 Ideally whole memory required for the implementation of any DSP

algorithm has to reside on-chip so that the whole processing can be


completed in a single execution cycle
 Although it looks as a better solution, it consumes more space on

chip, reducing the scope for implementing any functional block on-
chip, which in turn reduces the speed of execution and hence some
other alternatives required
 The following are some other ways in which the on-chip memory

can be organized
a. As many DSP algorithms require instructions to be executed
repeatedly, the instruction can be stored in the external memory, once
it is fetched can reside in the instruction cache
b. The access times for memories on-chip should be sufficiently small
so that it can be accessed more than once in every execution cycle
c. On-chip memories can be configured dynamically so that they can
serve different purpose at different times
Data Addressing Capabilities
 Data accessing capability of a programmable DSP device is

configured by means of its addressing modes


 The summary of the addressing modes used in DSP is as shown in

the table below


Table 2.1 DSP Addressing Modes
Addressing Operand Sample Format Operation
Mode
Immediate Immediate Value ADD #imm #imm +A →A
Register Register Contents ADD reg reg +A→ A
Direct Memory Address ADD mem mem+A →A
Register
Indirect Memory contents with ADD *addreg *addreg +A→ A
address in register
1. Immediate Addressing Mode: In this addressing mode, data is
included in the instruction itself
2. Register Addressing Mode: In this mode, one of the registers will be
holding the data and the register has to be specified in the instruction
3. Direct Addressing Mode: In this addressing mode, instruction holds
the memory location of the operand
4. Indirect Addressing Mode: In this addressing mode, the operand is
accessed using a pointer. A pointer is generally a register, which holds
the address of the location where the operands resides
 Indirect addressing mode can be extended to inculcate automatic

increment or decrement capabilities, which has lead to the following


addressing modes
Table: Indirect Addressing Modes
Addressing Mode Sample Format Operation
Post Increment ADD *addreg+ A← A + *addreg
addreg ←addreg+1
Post Decrement ADD *addreg- A← A + *addreg
addreg← addreg-1
Pre Increment ADD +*addreg addreg← addreg+1
A← A + *addreg
Pre Decrement ADD -*addreg addreg ←addreg-1
A ←A + *addreg
Post_Add_Offset ADD *addreg, offsetreg+ A ←A + *addreg
addreg ←addreg+offsetreg
Post_Sub_Offset ADD *addreg, offsetreg- A←A + *addreg
addreg←addreg-offsetreg
Pre_Add_Offset ADD offsetreg+,*addreg addreg ←addreg+offsetreg
A←A + *addreg
Pre_Sub_Offset ADD offsetreg-,*addreg addreg ←addreg-offsetreg
A ←A + *addreg
Problem 8: What are the memory addresses of the operands in
each of the following cases of indirect addressing modes? In each
case, what will be the content of the addreg after the memory
access? Assume that the initial contents of the addreg and the
offsetreg are 0200h and 0010h, respectively
a. ADD *addreg-
b. ADD +*addreg
c. ADD offsetreg+,*addreg
d. ADD *addreg,offsetreg-
Solution:
Instruction Addressing Operand Address addreg Content
Mode after Access
ADD *addreg- Post Decrement 0200h 0200-01=01FFh
ADD +*addreg Pre Increment 0200+01=0201h 0201h
ADD offsetreg+,*addreg Pre_Add_Offset 0200+0010=0210h 0210h
ADD *addreg,offsetreg- Post_Sub_Offset 0200h 0200-0010=01F0h

Problem 9: Identify the addressing modes of the operands in each of


the following instructions
a. ADD #1234h
b. ADD 1234h
c. ADD *AR+
d. ADD offsetreg-,*AR
Solution:
Instruction Addressing Mode
ADD #1234h Immediate Addressing Mode
ADD 1234h Direct Addressing Mode
ADD *AR+ Post Increment Indirect Addressing Mode
ADD offsetreg-,*AR Pre-Sub_Offset Indirect Addressing Mode
Special Addressing Modes
 For the implementation of some real time applications in DSP,

normal addressing modes will not completely serve the purpose


 Thus some special addressing modes are required for such

applications
1. Circular Addressing Mode
 While processing the data samples coming continuously in a

sequential manner, circular buffers are used


 In a circular buffer the data samples are stored sequentially from the

initial location till the buffer gets filled up


 Once the buffer gets filled up, the next data samples will get stored

once again from the initial location


 This process can go forever as long as the data samples are processed

in a rate faster than the incoming data rate


 Circular Addressing mode requires three registers viz

a. Pointer register to hold the current location (PNTR)


b. Start Address Register to hold the starting address of the buffer
(SAR)
c. End Address Register to hold the ending address of the buffer (EAR)
 There are four special cases in this addressing mode

They are
a. SAR < EAR & updated PNTR > EAR
b. SAR < EAR & updated PNTR < SAR
c. SAR >EAR & updated PNTR > SAR
d. SAR > EAR & updated PNTR < EAR
 The buffer length in the first two case will be (EAR-SAR+1)

whereas for the next two cases (SAR-EAR+1)


 The pointer updating algorithm for the circular addressing mode is as

shown below
 Four cases explained earlier are as shown in the figure below:

Case iii) EAR<SAR and updated PNTR>SAR Case iv) EAR<SAR and updated PNTR<EAR
Problem 10: A DSP has a circular buffer with the start and the end
addresses as 0200h and 020Fh respectively. What would be the
new values of the address pointer of the buffer if, in the course of
address computation, it gets updated to
a. 0212h
b. 01FCh
Solution: Buffer Length= (EAR-SAR+1)= 020F-0200+1=10h
a. New Address Pointer= Updated Pointer-buffer length = 0212-
10=0202h
b. New Address Pointer= Updated Pointer+buffer length =
01FC+10=020Ch
Problem 11: Repeat the previous problem for SAR= 0210h and
EAR=0201h
Solution: Buffer Length= (SAR-EAR+1)= 0210-0201+1=10h
c. New Address Pointer= Updated Pointer-buffer length = 0212-
10=0202h
d. New Address Pointer= Updated Pointer+buffer length =
01FC+10=020Ch
2. Bit Reversed Addressing Mode
 To implement FFT algorithms we need to access the data in a bit

reversed manner
 Hence a special addressing mode called bit reversed addressing

mode is used to calculate the index of the next data to be fetched


 It works as follows. Start with index 0. The present index can be

calculated by adding half the FFT length to the previous index in a


bit reversed manner, carry being propagated from MSB to LSB
Current index= Previous index+ B (1/2(FFT Size))
Compute the indices for an 8-point FFT using Bit reversed
Addressing Mode
 Start with index 0. Therefore the first index would be (000)

 Next index can be calculated by adding half the FFT length, in this

case it is (100) to the previous index


i.e. Present Index= (000)+B (100)= (100)
 Similarly the next index can be calculated as

Present Index= (100)+B (100)= (010)


 The process continues till all the indices are calculated

 The following table summarizes the calculation


Index in Binary BCD value Bit reversed BCD value
index
000 0 000 0
001 1 100 4
010 2 010 2
011 3 110 6
100 4 001 1
101 5 101 5
110 6 011 3
111 7 111 7

Address Generation Unit

 Address Generation Unit generate the address of the operands required to

carry out the operation

 They have to work fast in order to satisfy the timing constraints

 As the address generation unit has to perform some mathematical operations

in order to calculate the operand address, it is provided with a separate ALU


 Address generation typically involves one of the following

operations
a. Getting value from immediate operand, register or a memory
location
b. Incrementing/ decrementing the current address
c. Adding/subtracting the offset from the current address
d. Adding/subtracting the offset from the current address and
generating new address according to circular addressing mode
e. Generating new address using bit reversed addressing mode
The block diagram of a typical address generation unit is as shown in
figure below

Address Generation Unit


Programmability and Program Control
 A programmable DSP device should provide the programming

capability involving branching, looping and subroutines


 The implementation of repeat capability should be hardware based so

that it can be programmed with minimal or zero overhead


 A dedicated register can be used as a counter

 In a normal subroutine call, return address has to be stored in a stack

thus requiring memory access for storing and retrieving the return
address, which in turn reduces the speed of operation
 Hence a LIFO memory can be directly interfaced with the program

counter
1. Program Control
 Like microprocessors, DSP also requires a control unit to provide

necessary control and timing signals for the proper execution of the
instructions
 In microprocessors, the controlling is micro coded based where each

instruction is divided into microinstructions stored in micro memory


called microstore as microcode
 This microstore and microcode are not accessible by the programmer

 The processor is designed such that only control unit can access them

during the execution of instructions


Advantage: Easy to design and implement as less hardware is required
Disadvantage: Low speed of execution of instruction, as execution of
each instruction implies fetching the microcode from microstore and
decoding them
 As this mechanism is slower, it is not applicable for DSP

applications
 Hence in DSP the controlling is hardwired base

 Each instruction of the DSP acts as control lines to the hardwired

control unit and hence resulting control and timing signals are
generated to complete the instruction decode and execution
Advantage: Instruction execution time is much faster as compared to
microcoded design
Disadvantage: High hardware complexity; not easy to change the
design of control unit to incorporate additional features
2. Program Sequencer
 It is a part of the control unit used to generate instruction addresses

in sequence needed to access instructions


 It calculates the address of the next instruction to be fetched

 The next address can be from one of the following sources

a. Program Counter
b. Instruction register in case of branching, looping and subroutine calls
c. Interrupt Vector table
d. Stack which holds the return address
 The block diagram of a program sequencer is as shown in figure

below
Program Sequencer
 Program sequencer should have the following circuitry

a. PC has to be updated after every fetch


b. Counter to hold count in case of looping
c. A logic block to check conditions for conditional jump instructions
d. Condition logic-status flag
Speed Issues
 Fast execution of algorithms is an essential requirement of a digital

signal processing architecture


 DSP architecture must include features that facilitate high speed of

operation and large throughputs


Some of the feature that can increase the execution speed of the DSP
architecture are
1. Hardware Architecture

2. Parallelism

3. Pipelining
Hardware Architecture
 Harvard architecture which separates the program and data

memories with separate buses for each, increases the speed of


execution of programs considerably
 Dual data memories with Individual buses for each help in access

dual operands simultaneously


 Multiple external memories require multiple buses external to the

DSP in addition to being expensive, external buses are slow for


program access and execution
 By providing on-chip memories and an instruction cache, program

execution is speeded up considerably


 There are many techniques used in DSP architectures to increase

their speed of operation


 Two of these techniques are – Parallelism and Pipelining

Parallelism
 High speed of operation can be achieved in DSP by the provision of

Parallelism
 Parallelism means several things done at a time

 Functional units may operate in parallel and increase the throughput

 For example, instead of the same arithmetic unit being used to do

computations on data and address, a separate address arithmetic unit


can be provided to take care of address computations
 This frees up the main arithmetic unit to concentrate on data

computations alone and thereby increases the throughput


 Another example may be the provision of multiple memories and

multiple buses to fetch an instruction and operands simultaneously


 Algorithms can perform more than one operation at the same time,

such as adding while carrying out a multiply, shifting while reading


data from memory etc
 The architecture should be such that instructions and data required

for a computation are fetched from the memory simultaneously


 By parallelism, the following operations has to be performed in a

single clock cycle


1. Fetch instructions and multiple data required for the computation
2. Shift data as they are fetched in order to accomplish scaling
3. Carryout a multiplication operation on the fetched data
4. Add the product to the previously computed result in the
accumulator
5. Save the accumulator contents in the memory storage, if required,
and
6. Compute new addresses for the instruction and data required for
the next operation
Pipelining
 An architectural feature to increase the speed of the DSP algorithm is

pipelining
 In a pipelined architecture, an instruction to be executed is broken

into a number of steps


 A separate unit of the architecture performs each of these steps

 When the first of these units performs the first step on the current

instruction
 The second unit will be performing the second step on the previous

instruction
 The third unit will be performing the third step on the instruction

prior to that, etc


 If p steps were required to complete the execution of each

instruction, it would take p units of time for the complete execution


of each instruction
 Since all the units will work all the time, one output will flow out of

the architecture at the end of each time unit


 The throughput is maintained as one instruction per unit time

Limitations
1. Dividing each instruction into steps takes equal amount of time to
perform and design the architectures
2. Extra time required at the start of algorithm execution results in
pipeline latency
 For example, Let us assume that the execution of an instruction can

be broken into 5 steps:


1. Instruction fetch
2. Instruction decode
3. Operand fetch
4. Execute
5. Save the result
 Figure shows how a pipelined processor will handle this

 For simplicity, we will assume that all the steps take equal amounts

of time
 From the figure, we observe that the output corresponding to the

first instruction is available after 5 units of time


System level parallelism and pipelining
 The parallelism and pipelining concepts can be extended to the

implementation of DSP algorithms


 Consider the example of an 8-tap (8 coefficients) FIR filter given by:

 It can be realized using different methods

1. Implementation using a single MAC unit


2. Pipelined Implementation using eight multipliers and eight
accumulators
3. Parallel Implementation using two MAC units
Implementation using a single MAC unit
 If only one multiplier and accumulator is available, it must be used

8 times to compute the eight product terms and find their sum
 Each input sample is delayed from the previous sample by 8T

 T is the time taken by the multiplier and accumulator to compute

one product term and add it to the previously accumulated sum in


the accumulator
 Input samples and the filter coefficients are fed to the multiplier

through multiplexers
 As each product term is generated, it is added to the previously

accumulated sum in the MAC unit


Pipelined Implementation using eight multipliers and
eight accumulators
 The implementation of the FIR filter of Eqn can be speeded up if

more multipliers and accumulators are available


 Let us assume that there are eight multipliers and eight

accumulators connected in a pipelined structure, as shown in


Figure
 Each multiplier computes one product term and passes it on to the

corresponding accumulator, which in turn adds it to the summation


passed on from the previous accumulator
 Since all the multipliers and accumulators work all the time, a new

output sample is generated once every T units of time


 This is the time required by the multiplier and accumulator to

compute one product term and add it to the sum passed on from
the previous stage of the pipeline
 This implementation can take in a new input sample once every T

units of time and generate an output sample at the same rate


 In other words, this filter implementation works 8 times faster than

the simple one MAC implementation


Parallel Implementation using two MAC units
 A third implementation of the FIR filter of Eqn is shown in Figure

 This implementation uses two MAC units and an adder at the output

 Each MAC computes four of the eight product terms

 Input samples and the filter coefficients are fed to the MACs using

multiplexers that are controlled such that correct combinations of


samples and the corresponding filter coefficients are fed to the two
MACs at any given time
 If T time units are required to compute one pair of products and add

them to the previously accumulated sum in the MAC units, it will


require 4T units of time to generate the final output by adding the
outputs of the two MACs
 At this time, a new input sample can be applied to the filter for

computation of the next output sample


 The speed of this implementation is 2 times that of one MAC
implementation of Figure and one fourth of that of the pipelined
eight- multiplier, eight-accumulator implementation
 The maximum rate at which input samples can be applied to this

filter implementation is 2 times that of the first implementation and


one fourth that of the second
 Table summarizes the performance of the three implementations

described
 It is possible to achieve higher speed implementation by the use of

parallelism and/or pipelining


 It increases the circuit complexity
Features for external interfacing
 It is important for a DSP device to be able to communicate with the

outside world
 The outside world provides the signal to be processed and receives

the processed signal


 Most of the peripherals used in conventional microprocessors are

also needed in a DSP system


 Peripherals include interfaces for interrupts, direct memory access,

serial I/O, and parallel I/O

You might also like