Digital signal processors
Department of ECE
Bannari Amman Institute of Technology,
Sathyamangalam
1
Technology
2
Why DSP - Why Not Analog
3
Why DSP - Why Not Analog
ASP DSP
• Inflexibility to changes • More Flexible
• Sensitive to Electrical noise • Highly Immune to Noise
• Limited Repeatability due to • Repeatability or
temperature variations Reproducible
• Limited Accuracy • Accuracy controlled by word
length
•Difficult to store and costlier • Storage is easy and cheap
•Difficult for implementation •Easy for implementation
4
Digital Signal Processing
Disadvantages
• More Quantization and Round off errors
• Complex to design ADC and DAC circuits
5
Block diagram of communication system
6
Digital Signal Processing(DSP)
Processing: a series operations performed according to programmed instructions
Signal: a parameter (electrical quantity or effect) that can be varied in such a way as to
convey information.
Digital: operating by the use of discrete signals to represent data in the form of
numbers.
Digital Signal Processors(DSPS)
Digital Signal Processor (DSPs): electronic system that processes the digital signals.
7
8
Real time dsps APPLICATIONS
• Cell phones.
• Fax machines.
• DVD players and other home audio equipment.
• Computer disk drives.
• High-resolution printers.
• Digital cameras.
9
10
11
DSPs ARCHITECTURES
12
VON NEUMANN ARCHITECTURE
In 1946, John Von Neumann
Developed.
Three Buses
a) Data Bus.
b) Address Bus.
c) Control Bus
• Both instructions and data are stored in same memory. 13
HARVARD ARCHITECTURE
Hardvard Mark
• To transfer instructions and data simultaneously. Both instructions and data are stored in
separate memory .
• Enhance the performance, because instructions and data can be fetched simultaneously.
• Also have ALUs and input/output units.
14
MODIFIED HARDVARD ARCHITECTURE
Two or More Memory Buses.
Two Independent memory Banks.
Not Interchangeable.
Less Flexible. 15
VLIW – Very Long InstructIon Words/ MuLtIpLe ALu’s
Increased the number of
instruction processed per
second.
Execute Multiple
Instruction/cycle.
It transfer each instruction to
an appropriate Functional Unit.
16
Advantages of VLIW Architecture
Increased Performance.
Easier to program.
Can add more execution units, allow more instructions to be packed
into a VLIW instruction.
Disadvantages of VLIW Architecture
Increase in memory use.
High Power Consumption.
17
DSPs CLASSIFICATION
18
• By arithmetic format
– Fixed-point
– Floating-point
– Block floating-point
• By data width
– Typical fixed-point DSPs: 16-bit
– Typical floating-point DSPs: 32-bit
• By memory organization
• By multiprocessor support
19
Contd.,
• By speed
– Million of instruction per second (MIPS)
– A basic operation (e.g. MAC)
– A basic algorithm (e.g. FFT, FIR or IIR filter)
– Benchmark programs
• By power consumption
– Operating voltage
– Sleep or idle mode
– Programmable clock dividers
– Peripheral control
20
21
• First generation (TI TMS32010)
• Second generation (Motorola DSP56001, AT&T
DSP16A, Analog Dev. ADSP-2100, TI TMS320C50)
• Third generation (Motorola DSP56301, TI
TMS320C541, TI TMS320C80, Motorola
MC68356)
• Fourth generation (TI TMS320C6201, Intel
Pentium MMX)
22
• 16-bit fixed-point.
• Harvard architecture.
• Accumulator.
• Specialized instruction set.
• 390 ns MAC time (228 ns today).
23
• 24-bit data, instructions
• 3 memory spaces (X, Y, P)
• Parallel moves
• Single- and multi
instruction
• hardware
• loops
• Modulo addressing
• 75 ns MAC (21 ns today)
24
• Enhanced conventional DSP architectures
• 3.0 or 3.3 volts.
• More on-chip memory.
• Application-specific function units in data path or
as co-processors.
• More sophisticated debugging and application
development tools.
• DSP cores (Pine & Oak from DSP G., cDSP from TI)
• 20 ns MAC (10 ns today).
25
• Blazing clock speeds and super scalar
architectures.
• VLIW-like architectures, achieve top
performance via high parallelism and increased
clock speeds.
• 3 ns MAC throughput.
• Expensive, power-hungry.
26
27
TMS320 Family & Its Applications
28
NOMENCLATURE
29
TMS320 DSP Family Overview
• Consists of fixed-point, floating-point, and multiprocessor digital signal processors.
• Designed specifically for real-time signal processing.
Characteristics
• Very flexible instruction set
• Inherent operational flexibility
• High-speed performance
• Innovative parallel architecture
• Cost-effectiveness
• C-friendly architecture
30
C5x Architecture
31
TMS320C5x Overview
• Fixed point, 16-bit processor and operated at 40MHz
• Consists of the ’C50, ’C51, ’C52, ’C53, ’C53S, ’C56, ’C57, and ’C57S DSPs
• Fabricated by CMOS integrated-circuit technology.
• Single instruction execution time is 50msec.
• Execute up to 50 million instructions per second (MIPS).
• Operational flexibility and speed due to advanced Harvard architecture
• CPU with application-specific hardware logic, on-chip peripherals, on-chip
memory, and a highly specialized instruction set.
Advantages
Increased performance and versatility due to enhanced architectural design
Low power consumption due to advanced integrated-circuit processing technology
Source code compatibility with ’C1x, ’C2x, and ’C2xx DSPs for fast and easy
performance upgrades
Enhanced instruction set for faster algorithms and for optimized high-level
32
language operation
General DSP System Block Diagram
33
Architecture of TMS320C5X
34
FOUR sub blocks
• Bus Structure
• Central Processing Unit(CPU)
• On chip Memory
• On chip Peripherals
35
Bus Structure
Separate program and data buses allow simultaneous access to program
instructions and data, providing a high degree of parallelism.
Four major buses:
• Program bus (PB)
Carries the instruction code and immediate operands
from program memory to CPU.
• Program address bus (PAB)
Provides address to program memory space for both
read and write.
• Data read bus (DB)
Interconnects various elements of the CPU to data
memory space
• Data read address bus (DAB)
Provides address to access the data memory space
36
Central Processing Unit (CPU)
37
elements of CPU
CALU
•Central arithmetic logic unit (CALU)
Consist of 16x16 bit parallel Multiplier
32 bit Accumulator (ACC)
32 bit Accumulator Buffer (ACCB)
Product Register (PREG)
Additional Shifters at the output of ACC and PREG
Used to perform 2’s Complement arithmetic.
PLU
•Parallel logic unit (PLU)
It’s a Second logic unit
Executes logic operations on data without affecting
the contents of Accumulator or PREG
can set, clear, test, or toggle bits in a status register,
control register, or any data memory location.
Results are written back to the original data memory
location. 38
Contd.,
•Auxiliary register arithmetic unit (ARAU)
Register file containing eight Auxiliary Registers (AR0-AR7) ARAU
with 16 bit wide connected with ARAU.
3 bit Auxiliary Register Pointer (ARP).
Unsigned 16 bit ALU.
AR are used for indirect addressing of the data memory or
temporary data storage.
•Memory-mapped registers
Has 96 registers memory mapped into page 0
It is the component of the data memory space.
used for indirect data address pointers, temporary storage,
CPU status and control, or integer arithmetic processing
through the ARAU. 39
Contd.,
Program controller
Decodes the operational instructions, manages the CPU
pipeline, stores the status of CPU operations, and decodes
the conditional operations.
It consists of,
Program Counter- contain an address of program memory used for fetch instruction.
Control and status registers -16 bit reg. contain control & status bits for CPU.
Hardware Stack- used for PUSH & POP operation.
Address Generation unit –Holds table information.
Instruction Register- Contains code for application.
40
DSP Requires Multiply and Accumulate
41
On-Chip Memory
• C5x has a total address range of 224K words X 16 bits.
• Memory space is divided into four individually selectable memory segments:
64K-word program memory space
64K-word local data memory space
64K-word input/output ports
32K-word global data memory space
42
Large on-chip Memories includes,
•Data/program dual-access RAM (DARAM)
Carry a 1056-word X 16-bit on-chip dual-access RAM (DARAM).
DARAM is divided into three individually selectable memory blocks:
512-word data or program DARAM block B0
512-word data DARAM block B1
32-word data DARAM block B2.
All 1056 words X 16 bits configured as data memory
544 words X 16 bits configured as data memory and
512 words × 16 bits configured as program memory
43
Contd.,
Data/program single-access RAM (SARAM)
carry a 16-bit on-chip single-access RAM (SARAM) of various sizes.
divided into 1K- and/or 2K-word blocks continues in program or data memory
space
CPUs support parallel accesses to these SARAM blocks
one SARAM block can be accessed only once per machine cycle
SARAM can be configured by software in one of three ways:
All SARAM configured as data memory
All SARAM configured as program memory
SARAM configured as both data memory and program memory
44
On-Chip Peripherals
On-chip peripherals are:
• Clock generator
• Hardware timer
• Software-programmable wait-state generators
• Parallel I/O ports
• Host port interface (HPI)
• Serial port
• Buffered serial port (BSP)
• Time-division multiplexed (TDM) serial port
• User-maskable interrupts
45
Contd.,
• Clock Generator
Consists of an internal oscillator and a phase-locked loop (PLL) circuit.
Can be driven internally by a crystal resonator circuit or driven externally by a
clock source.
PLL circuit can generate an internal CPU clock by multiplying the clock source
by a specific factor, so you can use a clock source with a lower frequency than
that of the CPU.
• Hardware Timer
6-bit hardware timer with a 4-bit prescaler is available.
It clocks at a rate that is between 1/2 and 1/32 of the machine cycle rate
(CLKOUT1), depending upon the timer’s divide-down ratio.
Can be stopped, restarted, reset, or disabled by specific status bits.
46
Contd.,
•Host Port Interface (HPI)
Its available on the ’C57S and ’LC57 is an 8-bit parallel I/O port that provides
an interface to a host processor.
Information is exchanged between the DSP and the host processor through on-
chip memory that is accessible to both the host processor and the ’C57.
•Serial Port
Three different kinds of serial ports are available:
a general-purpose serial Port
a time-division multiplexed (TDM) serial port
a buffered serial port (BSP).
Each ’C5x contains at least one general-purpose, high-speed synchronous, full-duplexed
serial port interface that provides direct communication with serial devices such as codecs,
serial analog-to- digital (A/D) converters and other serial systems.
Capable of operating at up to one fourth the machine cycle rate
The serial port transmitter and receiver are double-buffered and individually controlled by
maskable external interrupt signals.
Data is framed either as bytes or as words
47
Contd.,
•TDM Serial Port
This is available on the ’C50, ’C51, and ’C53 devices is a full duplexed serial port that can
be configured by software either for synchronous operations or for time-division
multiplexed operations.
Commonly used in multiprocessor applications.
•Test/Emulation
On the ’C50, ’LC50, ’C51, ’LC51, ’C53, ’LC53, ’C57S and ’LC57S, an IEEE standard 1149.1
(JTAG) interface with boundary scan capability is used for emulation and test.
It provides the boundary scan to and from the interfacing devices.
It can be used to test pin-to-pin continuity and to perform operational tests on devices that
are peripheral to the ’C5x.
On the ’C52, ’LC52, ’C53S, ’LC53S, ’LC56, and ’LC57, an IEEE standard 1149.1 (JTAG)
interface without boundary scan capability is used for emulation purposes only
Can perform on-board emulation by means of the IEEE standard 1149.1 serial scan pins
and the emulation-dedicated pins.
48
Pipelining
Definition
In the operation of the pipeline, the instruction fetch, decode, operand read,
and execute operations are independent, which allows overall instruction
executions to overlap.
(Or)
The process of fetching a new instruction while other instruction on execution.
Advantages:
Reduce the critical path.
Increase the clock speed or sample speed.
Reduce power consumption.
Improve the System Performance.
Increase the efficiency.
49
Pipeline Structure
The FOUR PHASES of pipeline structure and their functions are as
follows:
Fetch (F)
This phase fetches the instruction words from memory and
updates the program counter (PC).
Decode (D)
This phase decodes the instruction word and performs address
generation and ARAU updates of auxiliary registers.
Read (R)
This phase reads operands from memory, if required.
If the instruction uses indirect addressing mode, it will read the
memory location pointed at by the ARP before the update of the
previous decode phase.
Execute (E)
This phase performs any specify operation, and, if required,
writes results of a previous operation to memory.
50
Prefetch Fetch Decode Access Read Execute
P F D A R E
• Prefetch: Calculate address of instruction
• Fetch: Collect instruction
• Decode: Interpret instruction
• Access: Collect address of operand
• Read: Collect operand
• Execute: Perform operation
51
Four Level Pipeline Operation
52
Contd.,
53
Example: Pipeline Operation of 1-Word Instruction
ADD *+
SAMM TREG0
MPY *+
SQRA *+, AR2
54
Addressing Modes of c5x
Direct addressing
Indirect addressing
Immediate addressing
Memory-mapped register addressing
Circular addressing
55
Direct Addressing
Instruction contains the lower 7 bits of the data memory address (dma).
7-bit dma is concatenated with the 9 bits of the data memory page pointer (DP) in
status register 0 to form the full 16-bit data memory address.
This 16-bit data memory address is placed on an internal direct data memory address
bus (DAB).
DP points to one of 512 possible data memory pages and the 7-bit address in the
instruction points to one of 128 words within that data memory page.
Load the DP bits by using the LDP or the LST #0 instruction.
56
Examples:
• Bits 15 through 8 contain- opcode.
• Bit 7, with a value of 0, defines addressing mode - direct
• Bits 6 through 0 contain - dma.
57
Immediate Addressing mode
used to handle the constant data.
Data can be either 16 bit constant or 7,9,13.
Depending on the length of data addressing mode is referred to as Long
Immediate or Short Immediate.
# prefix to specify the Immediate Addressing.
Examples:
58
Contd
., The
Instruction word(s) contains the value of the immediate operand.
’C5x has both 1-word (8-bit, 9-bit, and 13-bit constant) short immediate
instructions and 2-word (16-bit constant) long immediate instructions.
Short Immediate Addressing
Long Immediate Addressing
59
memory mapped register addressing
used to access efficiently the CPU and on chip peripheral registers.
It operate like the direct addressing except upper 9 bits of the address that is accessed
assumed to be 0s.
Only Eight instructions are their for memory mapped register addressing.
LAMM — Load accumulator with memory-mapped register
LMMR — Load memory-mapped register
SAMM — Store accumulator in memory-mapped register
SMMR — Store memory-mapped register
Examples:
60
Contd.,
61
Indirect Addressing
It uses auxiliary register to holds the address of the of operands in
memory.
Each auxiliary register (AR0-AR7) provide flexible and powerful indirect
addressing
Example
62
Contd.,
Eight 16-bit auxiliary registers (AR0–AR7) provide flexible and powerful indirect
addressing. In indirect addressing, any location in the 64K-word data memory
space can be accessed using a 16-bit address contained in an AR. Figure 5–3
shows the hardware for indirect addressing.
63
Circular Addressing
Many algorithms such as convolution, correlation, and finite impulse response (FIR)
filters can use circular buffers in memory to implement a sliding window, which
contains the most recent data to be processed.
’C5x supports two concurrent circular buffers operating via the ARs.
The following five memory-mapped registers control the circular buffer operation:
CBSR1 — Circular buffer 1 start register
CBSR2 — Circular buffer 2 start register
CBER1 — Circular buffer 1 end register
CBER2 — Circular buffer 2 end register
CBCR — Circular buffer control register
64
Instruction Set
•The ’C5x instruction set supports numerically intensive signal-processing
operations as well as general-purpose applications, such as
multiprocessing and high-speed control.
• The instruction set is a superset of the ’C1x and ’C2x instruction sets and is
source-code upward compatible with both devices.
• Classifications:
Accumulator memory reference instructions
Auxiliary registers and data memory page pointer instructions
Parallel Logic Unit (PLU) instructions
Multiply instructions
Branch and call instructions
I/O and data memory operation instructions
Control instructions
65
Accumulator memory reference
instructions
66
Examples:
67
Auxiliary registers and data memory
page pointer instructions
68
Examples:
69
Contd.,
70
Parallel Logic Unit (PLU) instructions
71
Examples:
72
PREG & Multiply instructions
73
Examples:
74
BRANCH INSTRUCTIONs
75
Examples:
76
control INSTRUCTIONs
77
SIMPLE PROGRAMMING EXAMPLE
64 bit Addition and Subtraction
ADDITION
.MMERGS
.TEXT
START LDP #100H
LACC 0001,10H
ADDS 0000
ADDS 0004
ADD 0005,10H
SACL 0008
SACH 0009
LACC 0003,10H
ADDC 0002
ADDS 0006
ADD 0007,10H
SACL 0010
SACH 0011
H: B H
78
SUBTRACTION
.MMERGS
.TEXT
START LDP #100H
LACC 0001,10
ADDS 0000
SUBS 0004
SUB 0005,10
SACL 0008
SACH 0009
LACC 0002,0
SUBB 0006
ADD 0003,10
SUB 0007,10
SACL 000A
SACH 000B
H: B H
79
MULTIPLICATION AND EXPRESSION EVALUATION
16 Bit Multiplication:
.MMREGS
.TEXT
LDP #100H
LACL #0
LT 0000
MPY 0001
PAC
SACL 0002,0
SACH 0003,0
H: B H
80
Y=A*X1+B*X2+C*X3
.MMREGS
.TEXT
LDP #100H
LACL #0
LT 0000
MPY 0003
LTA 0001
MPY 0004
LTA 0002
MPY 0005
APAC
SACL 0006,0
H: B H
81
GENERATION OF WAVEFORMS
SQUARE WAVE FORM
.MMERGS
.TEXT
START: LDP #120H
LACC #0H
LOOP: SACL 0
RPT #0FFH
OUT 0,04
CMPL
B LOOP
.END
SAW TOOTH WAVE FORM
.MMERGS
.TEXT
START: LDP #120H
LACC #0H
SACL 0
OUT 0,04H
LOOP: LACC 0
OUT 0,04H
ADD #05h
SACL 0
SUB #0FFFh
BCND LOOP,LEQ
B START
.END
82
83