DSP Processor Fundementals
DSP Processor Fundementals
Editorial Board
John B. Anderson, Editor in Chief
VLSI DIGITAL SIGNAL PROCESSORS: An Introduction to Rapid Prototyping and Design Synthesis
Copublished with A K Peters, Ltd.
Vijay K. Madisetti
1995 Hardcover 544 pp ISBN 0-7806-9406-8
DSP Processor Fundamentals
Phil Lapsley
Jeff Bier
Amit Shoham
Berkeley Design Technology, Inc.
Edward A. Lee
University of California at Berkeley
+IEEE
The Institute of Electrical and Electronics Engineers, lnc., New York
ffiWILEY-
~INTERSCIENCE
A JOHN WILEY & SONS, INC., PUBLICATION
New York • Chichester • Weinheim • Brisbane • Singapore • Toronto
No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, electronic,
mechanical, photocopying, recording, scanning or otherwise, except
as permitted under Sections 107 and 108 of the 1976 United States
Copyright Act, without either the prior written permission of the
Publisher, or authorization through payment of the appropriate per-
copy fee to the Copyright Clearance Center, 222 Rosewood Drive,
Danvers, MA 01923, (978) 750-8400, fax (978) 750-4744. Requests
to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 605 Third
Avenue, New York, NY 10158-0012. (212) 850-60 II, fax (2 J 2)
850-6008. Evmail: [email protected].
This is the IEEE reprinting ofa book previously published by Berkeley Design Technology, Inc.,
39355 California Street, Suite 206, Fremont, CA 94538.
(5 10) 791-91 00
ISBN 0-7803-3405-1
Preface xi
Acknowledgments xiii
vii
DSP Processor Fundamentals: Architectures and Features
viii
Contents
ix
DSP Processor Fundamentals: Architectures and Features
x
Preface
This book is an introduction to the technical fundamentals of the architectures and features
of programmable digital signal processors. Programmable digital signal processors (often called
DSPs, PDSPs, or DSP processors) are microprocessors that are specialized to perform well in dig-
ital signal processing-intensive applications. Since the introduction of the first commercially suc-
cessful DSP processors in the early 1980s, dozens of new processors have been developed,
offering system designers a vast array of choices. According to the market research firm Forward
Concepts, sales of user-programmable DSPs will total roughly US $1.8 billion in 1996, with a
projected annual growth rate of 35 percent [Str95]. With semiconductor manufacturers vying for
bigger shares of this booming market, designers' choices will broaden even further in the next few
years.
Organization
This book is organized as follows:
Digital Signal Processing and nsp Systems
Chapter 1 provides a high-level overview of digital signal processing, including nsp sys-
tem features and applications.
nsp Processor Embodiments and Alternatives
Chapter 2 provides a brief introduction to nsp processors and then discusses the different
forms that nsp processors take, including chips, multichip modules, and cores. In this
xl
DSP Processor Fundamentals: Architectures and Features
xii
Acknowledgments
This book would not have been possible without the help of many people.
First, we would like to thank the members and former members of our staff at Berkeley
Design Technology. Those who contributed through their efforts in benchmark coding, processor
analysis, document production, or work on previous industry reports include Franz Weller,
Mohammed Kabir, Rosemary Brock, Dave Wilson, Cynthia Keller, Stephen Slater, and Michael
Kiernan. Ivan Heling of TechScribe provided valuable editing and document production services.
Second, we would like to acknowledge the assistance of the many individuals too numer-
ous to list here who contributed helpful insights and advice. In particular, we wish to thank Will
Strauss of Forward Concepts for sharing with us his encyclopedic knowledge of the DSP industry.
Kennard White provided valuable insights over the course of many wide-ranging discussions
about DSp'processors, communications, and the DSP industry as a whole.
Finally, we thank product marketing managers, applications engineers, and designers at
Analog Devices, AT&T Microelectronics, Clarkspur Design, DSP Group, IBM Microelectronics,
Motorola, NEC, SGS-Thomson, TCSI, Texas Instruments, 3Soft, Zilog, and Zoran. All gave gen-
erously of their time and expertise.
xiii
Chapter 1
For the purposes of this book, we define a digital signal processing (DSP) system to be any
electronic system making use of digital signal processing. Our informal definition of digital signal
processing is the application of mathematical operations to digitally represent signals. Signals are
represented digitally as sequences of samples. Often, these samples are obtained from physical
signals (for example, audio signals) through the use of transducers (such as microphones) and
analog-to-digital converters. After mathematical processing, digital signals may be converted
back to physical signals via digital-to-analog converters.
In some systems, the use of DSP is central to the operation of the system. For example,
modems and digital cellular telephones rely very heavily on DSP technology. In other products,
the use of DSP is less central, but often offers important competitive advantages in terms of fea-
tures, performance, and cost. For example, manufacturers of primarily analog consumer electron-
ics devices like audio amplifiers are beginning to employ DSP technology to provide new
features.
This chapter presents a high-level overview of digital signal processing. We first discuss
the advantages of DSP over analog systems. We then describe some salient features and character-
istics of DSP systems in general. We conclude with a brief look at some important classes of DSP
applications.
This chapter is not intended to be a tutorial on DSP theory. For a general introduction to
nsp theory, we recommend one of the many textbooks now available on DSP, such as Discrete-
Time Signal Processing by Oppenheim and Schafer [Opp89].
DSP systems also enjoy two additional advantages over analog systems:
Insensitivity to environment. Digital systems, by their very nature, are considerably less
sensitive to environmental conditions than analog systems. For example, an analog cir-
cuit's behavior depends on its temperature. In contrast, barring catastrophic failures, a
DSP system's operation does not depend on its environment-whether in the snow or in
the desert, a DSP system delivers the same response.
Insensitivity to component tolerances. Analog components are manufactured to particu-
lar tolerances-a resistor, for example, might be guaranteed to have a resistance within 1
percent of its nominal value. The overall response of an analog system depends on the
actual values of all of the analog components used. As a result, two analog systems of
exactly the same design will have slightly different responses due to slight variations in
their components. In contrast, correctly functioning digital components always produce
the same outputs given the saine inputs.
These two advantages combine synergistically to give DSP systems an additional advantage over
analog systems:
Predictable, repeatable behavior. Because a DSP system's output does not vary due to
environmental factors or component variations, it is possible to design systems having
exact, known responses that do not vary.
Finally, some DSP systems may also have two other advantages over analog systems:
Reprogrammability. If a DSP system is based on programmable processors, it can be
reprogrammed-even in the field-to perform other tasks. In contrast, analog systems
require physically different components to perform different tasks.
Size. The size of analog components varies with their values; for example, a 100 flF
capacitor used in an analog filter is physically larger than aID pF capacitor used in a dif-
ferent analog filter. In contrast, DSP implementations of both filters might well be the
same size-indeed, might even use the same hardware, differing only in their filter coeffi-
cients-and might be smaller than either of the two analog implementations.
These advantages, coupled with the fact that DSP can take advantage of the rapidly increasing
density of digital integrated circuit (K') manufacturing processes, increasingly make DSP the
solution of choice for signal processing.
Algorithms
DSP systems are often characterized by the algorithms used. The algorithm specifies the
arithmetic operations to be performed but does not specify how that arithmetic is to be imple-
mented. It might be implemented in software on an ordinary microprocessor or programmable
2
Chapter 1 Digital Signal Processing and DSP Systems
Speech encryption and Digital cellular telephones, personal communications systems, digital cordless
decryption telephones, secure communications
Speech recognition Advanced user interfaces, multimedia workstations, robotics, automotive
applications, digital cellular telephones, personal communications systems, digital
cordless telephones
Speech synthesis Multimedia PCs, advanced user interfaces, robotics
Speaker identification Security, multimedia workstations, advanced user interfaces
Hi-fi audio encoding Consumer audio, consumer video, digital audio broadcast, professional audio,
and decoding multimedia computers
Modem algorithms Digital cellular telephones, personal communications systems, digital cordless
telephones, digital audio broadcast, digital signaling on cable TV, multimedia comput-
ers, wireless computing, navigation, data/facsimile modems, secure communications
Audio equalization Consumer audio, professional audio, advanced vehicular audio, music
Ambient acoustics
Consumer audio, professional audio, advanced vehicular audio, music
emulation
Audio mixing and
Professional audio, music, multimedia computers
editing
Sound synthesis Professional audio, music, multimedia computers, advanced user interfaces
Image compositing Multimedia computers, consumer video, advanced user interfaces, navigation
3
DSP Processor Fundamentals: Architectures and Features
Sample Rates
A key characteristic of a DSP system is its sample rate: the rate at which samples are con-
sumed, processed, or produced. Combined with the complexity of the algorithms, the sample rate
determines the required speed of the implementation technology. A familiar example is the digital
audio compact disc (CD) player, which produces samples at a rate of 44.1 kHz on two channels.
Of course, a DSP system may use more than one sample rate; such systems are said to be
multirate DSP systems. An example is a converter from the CD sample rate of 44.1 kHz to the dig-
ital audio tape (DAT) rate of 48 kHz. Because of the awkward ratio between these sample rates,
the conversion is usually done in stages, typically with at least two intermediate sample rates.
Another example of a multirate algorithm is a filter bank, used in applications such as speech,
audio, and video encoding and some signal analysis algorithms. Filter banks typically consist of
stages that divide the signal into high- and low-frequency portions. These new signals are then
downsampled (i.e., their sample rate is lowered by periodically discarding samples) and divided
again. In multirate applications, the ratio between the highest and the lowest sample rates in the
system can become quite large, sometimes exceeding 100,000.
The range of sample rates encountered in signal processing systems is huge. In Figure 1-1
we show the rough positioning of a few classes of applications with respect to algorithm complex-
ity and sample rate. Sample rates for applications range over 12 orders of magnitude! Only at the
very top of that range is digital implementation rare. This is because the cost and difficulty of
implementing a given algorithm digitally increases with the sample rate. nsp algorithms used at
higher sample rates tend to be simpler than those used at lower sample rates.
Many DSP systems must meet extremely rigorous speed goals, since they operate on
lengthy segments of real-world signals in real-time. Where other kinds of systems (like data-
bases) may be required to meet performance goals on average, real-time DSP systems often must
meet such goals in every instance. In such systems, failure to maintain the necessary processing
rates is considered a serious malfunction. Such systems are often said to be subject to hard real-
time constraints. For example, let's suppose that the compact disc-to-digital audio tape sample
rate converter discussed above is to be implemented as a real-time system, accepting digital sig-
nals at the CD sample rate of 44.1 kHz and producing digital signals at the DAT sample rate of 48
kHz. The converter must be ready to accept a new sample from the CD source every 22.6 us (i.e.,
1/44100 s), and must produce a new output sample for the DAT device every 20.8 us (1/48000 s).
If the system ever fails to accept or produce a sample on this schedule, data are lost and the
resulting output signal is corrupted. The need to meet such hard real-time constraints creates spe-
cial challenges in the design and debugging of real-time DSP systems.
Clock Rates
Digital electronic systems are often characterized by their clock rates. The clock rate usu-
ally refers to the rate at which the system performs its most basic unit of work. In mass-produced,
commercial products, clock rates of up to 100 MHz are common, with faster rates found in some
high-performance products. For DSP systems, the ratio of system clock rate to sample rate is one
of the most important characteristics used to determine how the system will be implemented. The
relationship between the clock rate and the sample rate partially determines the amount of hard-
4
Chapter 1 Digital Signal Processing and DSP Systems
lOG -~
1G ~ Radio Signaling and Radar
1 -~
Instrumentation
1/10 - f-
Financial Modeling
1/100
1/1000 -,.. Weather Modeling
.......
Less Complex More Complex
Algorithm Complexity
ware needed to implement an algorithm with a given complexity in real-time. As the ratio of sam-
ple rate to clock rate increases, so does the amount and complexity of hardware required to
implement the algorithm.
Numeric Representations
Arithmetic operations such as addition and multiplication are at the heart of nsp algo-
rithms and systems. As a result, the numeric representations and type of arithmetic used can have
a profound influence on the behavior and performance of a nsp system. The most important
choice for the designer is between fixed-point and floating-point arithmetic. Fixed-point arith-
metic represents numbers in a fixed range (e.g., -1.0 to +1.0) with a finite number of bits of preci-
sion (called the word width). For example, an eight-bit fixed-point number provides a resolution
of 1/256 of the range over which the number is allowed to vary. Numbers outside of the specified
range cannot be represented; arithmetic operations that would result in a number outside this
range either saturate (that is, are limited to the largest positive or negative representable value) or
wrap around (that is, the extra bits resulting from the arithmetic operation are discarded).
5
DSP Processor Fundamentals: Architectures and Features
6
Chapter 1 Digital Signal Processing and DSP Systems
are often given less weight, even though these applications usually involve the development of
custom software to run on the DSP and custom hardware that interfaces with the DSP.
High-Performance Applications
Another important class of applications involves processing large volumes of data with
complex algorithms for specialized needs. This includes uses like sonar and seismic exploration,
where production volumes are lower, algorithms are more demanding, and product designs are
larger and more complex. As a result, designers favor processors with maximum performance,
ease of use, and support for multiprocessor configurations. In some cases, rather than designing
their own hardware and software from scratch, designers of these systems assemble systems using
standard development boards and ease their software development tasks by using existing soft-
ware libraries.
7
Chapter 2
The previous chapter described digital signal processing in general terms, focusing on
DSP fundamentals, systems, and application areas. In this chapter we narrow our focus to DSP
processors. We begin with a high-level description of the features common to virtually all DSP
processors. We then describe typical embodiments of nsp processors and briefly discuss alterna-
tives to nsp processors, such as general-purpose microprocessors. The next several chapters pro-
vide a detailed treatment of DSP processor architectures and features.
Fast Multiply-Accumulate
The most often cited feature of nsp processors is the ability to perform a multiply-accu-
mulate operation (often called a MAC) in a single instruction cycle. The multiply-accumulate
operation is useful in algorithms that involve computing a vector product, such as digital filters,
correlation, and Fourier transforms. To achieve this functionality, nsp processors include a multi-
plier and accumulator integrated into the main arithmetic processing unit (called the data path) of
the processor. In addition, to allow a series of multiply-accumulate operations to proceed without
the possibility of arithmetic overflow, nsp processors generally provide extra bits in their accu-
mulator registers to accommodate growth of the accumulated result. nsp processor data paths are
discussed in detail in Chapter 4, "Data Path."
9
DSP Processor Fundamentals: Architectures and Features
10
Chapter 2 DSP Processors, Embodiments, and Alternatives
Feature Use
Fast multiply-accumulate Most DSP algorithms, including filtering, transforms, etc. are
multiplication-intensive.
Multiple-access memory Many data-intensive nsp operations require reading a program
architecture instruction and multiple data items during each instruction
cycle for best performance.
Specialized addressing Efficient handling of data arrays and first-in, first-out buffers in
modes memory.
Specialized program Efficient control of loops for many iterative nsp algorithms.
control Fast interrupt handling for frequent I/O operations.
On-chip peripherals and On-chip peripherals like analog-to-digital converters allow for
input/output interfaces small, low-cost system designs. Similarly, I/O interfaces tai-
lored for common peripherals allow clean interfaces to off-chip
I/O devices.
11
DSP Processor Fundamentals: Architectures and Features
Multichip Modules
Multichip modules (MCMs) are a kind of superchip. Rather than packaging a single inte-
grated circuit (IC) die in a ceramic or plastic package as is done with conventional ICs, MCMs
combine multiple, bare (i.e., unpackaged) dies into a single package. One advantage of this
approach is achieving higher packaging density-more circuits per square inch of printed circuit
board. This, in tum, results in increased operating speed and reduced power dissipation. As multi-
chip module packaging technology has advanced in the past few years, vendors have begun to
offer multichip modules containing DSP processors. For example, Texas Instruments sells an
MCM that includes two TMS320C40 processors and 128 Kwords of 32-bit static random-access
memory (SRAM).
12
Chapter 2 DSP Processors, Embodiments, and Alternatives
Chip Sets
While some manufacturers combine multiple processors on a single chip, and others use
multichip modules to combine multiple chips into one package, another variation on DSP proces-
sor packaging is to divide the nsp into two or more separate packages. This is the approach that
Butterfly nsp has taken with their DSP chip set, which consists of the LH9320 address generator
and the LH9124 processor (formerly sold by Sharp Microelectronics). Dividing the processor into
two or more packages may make sense if the processor is very complex and if the number of
input/output pins is very large. Splitting functionality into multiple integrated circuits may allow
the use of much less expensive IC packages, and thereby provide cost savings. This approach also
provides added flexibility, allowing the system designer to combine the individual ICs in the con-
figuration best suited for the application. For example, with the Butterfly chip set, multiple
address generator chips can be used in conjunction with one processor chip. Finally, chip sets
have the potential of providing more I/O pins than individual chips. In the case of the Butterfly
chip set, the use of separate address generator and processor chips allows the processor to have
eight 24-bit external data buses, many more than provided by more common single-chip proces-
sors.
DSP Cores
An interesting approach for high-volume designs is the coupling of a programmable nsp
with user-defined circuitry on a single chip. This approach combines the benefits of a DSP proces-
sor (such as programmability, development tools, and software libraries) with the benefits of cus-
tom circuits (e.g., low production costs, small size, and low power consumption). In this section,
we briefly describe two variants of this design style: nsp core-based application-specific inte-
grated circuits (ASICs) and customizable DSP processors.
A nsp core is a nsp processor intended for use as a building block in creating a chip, as
opposed to being packaged by itself as an off-the-shelf chip. A nsp core-based ASIC is an ASIC
that incorporates a DSP core as one element of the overall chip. The DSP core-based ASIC
approach allows the system designer to integrate a programmable nsp, interface logic, peripher-
als, memory, and other custom elements onto a single integrated circuit. Figures 2-1 and 2-2 illus-
trate the nsp core-based ASIC concept.
Many vendors of nsp processors use the core-based approach to create versions of their
standard processors targeted at entire application areas, like telecommunications. In our discus-
13
DSP Processor Fundamentals: Architectures and Features
Control
Software
Micro-
DSP controller
Core
D-to-A
~
Custom Digital
Custom
Analog
G
A-to-D
II Serial Port
FIGURE 2-1. DSP core-based ASICs allow the integration of multiple processor
types and analog circuitry, in addition to memories, peripheral interfaces, and
custom digital circuitry.
sion of DSP core-based ASICs, we focus on the case where the user of a DSP processor wants to
create a DSP core-based ASIC for a specific application.
There are presently several companies established as providers of DSP cores for use in
ASIC designs: AT&T, Texas Instruments, SGS-Thomson Microelectronics, DSP Group, Clark-
spur Design, 3Soft, and Tensleep Design. Two additional firms (TCSI and Infinite Solutions) have
announced plans to offer DSP cores as well. Currently available DSP cores are summarized in
Table 2-3.
Note that vendors differ in their definitions of exactly what is included in a "DSP core."
For example, Texas Instruments' definition of a DSP core includes not only the processor, but
memory and peripherals as well. Clarkspur Design's and DSP Group's cores include memory but
not peripherals. SGS- Thomson's core includes only the processor and no peripherals or memory.
The services that these companies provide can be broadly divided into two categories. In
the first category, the vendor of the core is also the provider of foundry services used to fabricate
the ASIC containing the core; we refer to this category as "foundry-captive." In the second cate-
gory, the core vendor licenses the core design to the customer, who is then responsible for select-
ing an appropriate foundry. We call this category "licensable."
14
Chapter 2 DSP Processors, Embodiments, and Alternatives
-- ~~ Digital Cellular
Telephone
Speech Coding ASIC
Hi-Fidelity
- - - - - - - - . . Audio
Equipment
DSP Core Audio ASIC
~ Fax Machine
Modem ASIC
Arith. Speed
Vendor Core Family Type Data Width (MIPS)
15
DSP Processor Fundamentals: Architectures and Features
lOur use of the term "customizable DSP processor" should not be confused with Texas Instruments' "customiz-
able DSP" (cDSP), which is their term for DSP core-based ASICs using Texas Instruments DSP cores.
16
Chapter 2 DSP Processors, Embodiments, and Alternatives
As mentioned earlier, different vendors use the term "core" in different ways, so a "cus-
tomizable DSP core" may have a range of meanings, from simply allowing the addition of periph-
erals and memory to support for addition to or modification of the processor's execution units.
Although customizing a DSP core has a potentially large payoff, it also poses serious
drawbacks. First, in the case of a foundry-captive core, the customer may not have access to the
internal design of the core. As a result, the desired modifications must be made by the chip ven-
dor-a potentially expensive proposition. In the case of a licensable core, the design is accessible
to the customer and can be modified by the customer's engineers. However, these engineers may
not be sufficiently expert in the core's internal design to efficiently make the desired modifica-
tions. Finally, changes to the core processor architecture require that corresponding changes be
made to the core's software development tools (assembler, linker, simulator, and so on).
At present, the companies that have best addressed the challenges of customizable DSP
processors are AT&T Microelectronics and Philips-neither of which offers their cores in a broad
fashion. In particular, AT&T's DSP1600 processor core was designed to permit easy attachment
of new execution units to its data path, and its software development tools were designed to facil-
itate the addition of support for new execution units into the tools. Similarly, Philips' EPICS core
was designed with configurability in mind; Philips has demonstrated versions of the core with dif-
ferent word widths.
MUltiprocessors
No matter how fast and powerful DSP processors become, the needs of a large class of
important applications cannot be met by a single processor. Some of these applications may be
well suited to custom integrated circuits. If programmability is important, a multiprocessor based
on commercial DSPsmay be an effective solution. Although any DSP processor can be used in a
multiprocessor design, some manufacturers have made special efforts to create DSPs that are
especially well suited to multiprocessor systems. Among these are Texas Instruments'
TMS320C4x, Motorola's DSP96002, and Analog Devices' ADSP-2106x. These processors
include features such as multiple external buses, bus-sharing logic, and (on the Texas Instruments
and Analog Devices processors) multiple, dedicated parallel ports designed for interprocessor
communication that simplify hardware design and improve performance.
17
DSP Processor Fundamentals: Architectures and Features
use a general-purpose processor for a given application. First, many electronics products-from
telephones to automotive engine controllers-are currently designed using general-purpose
microprocessors for control, user interface, and communications functions. If a DSP application is
being added to an existing product that already contains a general-purpose microprocessor, then it
may be possible to add the new application without needing an additional processor. This
approach has obvious cost advantages, though it is only suited to relatively simple DSP applica-
tions. Second, software development tools for general-purpose processors are generally much
more sophisticated and powerful than those for nsp processors. For applications that are rela-
tively undemanding in terms of performance, but for which ease of development is a critical con-
sideration, this can be an important factor.
As digital signal processing-intensive applications increasingly move into the mainstream
of computing and electronics products, general-purpose processors have begun to adopt some of
the features of DSP processors to make them more suitable for DSP-intensive applications. For
example, the Motorola/IBM PowerPC 601, the MIPS RIOOOO, the Sun UltraSPARC, and the
Hewlett-Packard PA-7100LC general-purpose microprocessors are all able to perform a floating-
point multiply-accumulate in one instruction cycle under certain circumstances. Additionally,
some of these processors have special instructions aimed at multimedia signal processing applica-
tions. Similarly, Intel has announced a version of the Pentium with features designed to better
support DSP.
It isn't yet clear, though, whether a single, hybrid processor or separate general-purpose
processor and DSP processor (possibly on the same chip) will become the more popular approach.
We expect that for the next few years, at least, applications with a significant real-time nsp con-
tent will be better served by a separate, specialized nsp processor.
18
Chapter 2 DSP Processors, Embodiments, and Alternatives
There is a trend among both PC and workstation manufacturers to add more DSP capabil-
ities to their products, both through on-board nsp processors and through the addition of periph-
erals like analog-to-digital and digital-to-analog converters and telephone line interfaces that
support nsp applications like modems, speech recognition, and music synthesis. As these kinds
of resources become increasingly prevalent in pes and workstations, more opportunities will open
up for software-only nsp products.
Custom Hardware
There are two important reasons why custom-developed hardware is sometimes abetter
choice than a nsp processor-based implementation: performance and production cost. In virtually
any application, custom hardware can be designed which provides better performance than a pro-
grammable processor. Just as nsp processors are more cost-effective for DSP applications than
general-purpose processors because of their specialization, custom hardware has the potential to
be even more cost-effective due to its more specialized nature. In applications with high sampling
rates (for example, higher than 1/100th of the system clock rate), custom hardware may be the
only reasonable approach.
For high-volume products, custom hardware may also be less expensive than a DSP pro-
cessor. This is because a custom implementation places in hardware only those functions needed
by the application, whereas a nsp processor requires every application to pay for the full func-
tionality of the processor, even if it uses only a small subset of its capabilities. Of course, develop-
ing custom hardware has some serious drawbacks in addition to these advantages. Most notable
among these drawbacks are the effort and expense associated with custom hardware development,
especially for custom chip design.
Custom hardware can take many forms. It can be a simple, small printed circuit board
using off-the-shelf components, or it can be a complex, multiboard system, incorporating custom
integrated circuits. The aggressiveness of the design approach depends on the needs of the appli-
cation. For an in-depth exploration of nsp system design alternatives and tools, see DSP Design
Tools and Methodologies [BDT95]. In the remainder of this section we very briefly mention some
of the more popular approaches.
One of the most common approaches for custom hardware for nsp applications is to
design custom printed circuit boards that incorporate a variety of off-the-shelf components. These
components may include standard logic devices, fixed-function or configurable arithmetic units,
field-programmable gate arrays (FPGAs), and function- or application-specific integrated circuits
(FASICs). As their name implies, FASICs are chips that are designed to perform a specific func-
tion, perhaps for a single application. Examples of FASICs include configurable digital filter
chips, which can be configured to work in a range of applications, and facsimile modem chips,
which are designed specifically to provide the signal processing functions for a fax modem and
aren't useful for anything else.
Many off-the-shelf application-specific les sold by semiconductor vendors for DSP appli-
cations are really standard DSP processors containing custom, mask-programmed software in
ROM. Some of these chips are based on commercial DSP processors. For example, Phylon's
modem chips are based on Analog Devices and Texas Instruments processors. Others are based on
19
DSP Processor Fundamentals: Architectures and Features
proprietary processor architectures; the most prominent examples of this approach are Rockwell's
data and fax modem chips.
As tools for creating custom chips improve and more engineers become familiar with chip
design techniques, more companies are developing custom chips for their applications. Designing
a custom chip provides the ultimate flexibility, since the chip can be tailored to the needs of the
application, down to the level of a single logic gate.
Of course, the benefits of custom chips and other hardware-based implementation
approaches come with important trade-offs. Perhaps most importantly, the complexity and cost of
developing custom hardware can be high, and the time required can be long. In addition, if the
hardware includes a custom programmable processor, new software development tools will be
required.
It is important to point out that the implementation options discussed here are not mutually
exclusive. In fact, it is quite common to combine many of these design approaches in a single sys-
tem, choosing different techniques for different parts of the system. One such hybrid approach,
DSP core-based ASICs, was mentioned above. Others, such as the combination of an off-the-shelf
DSP processor with custom ICs, FPGAs, and a general-purpose processor, are very common.
20
Chapter 3
One of the most important characteristics determining the suitability of a DSP processor
for a given application is the type of binary numeric representation(s) used by the processor. The
data representation variations common in commercial nsp processors can be illustrated hierarchi-
cally, as in Figure 3-1.
DSP Processors
Fixed-Point Floating-Point
~
16-bit 20-bit 24-bit 3Jbit
IEEE 754 Other
21
DSP Processor Fundamentals: Architectures and Features
arithmetic) or as fractions between -1.0 and + 1.0 (fractional arithmetic). The algorithms and
hardware used to implement fractional arithmetic are virtually identical to those used for integer
arithmetic. The main difference between integer and fractional arithmetic has to do with how the
results of multiplication operations are handled. In practice, most fixed-point DSP processors
support fractional arithmetic and integer arithmetic. The former is most useful for signal process-
ing algorithms, while the latter is useful for control operations, address calculations, and other
operations that do not involve signals. Figures 3-2 and 3-3 illustrate simple integer and fractional
representations. Note that performing integer arithmetic on fixed-point DSPs sometimes is more
time-consuming or more constrained than performing fractional arithmetic.
Another class of DSP processors primarily uses floating-point arithmetic, where numbers
are represented by the combination of a mantissa and an exponent. This is illustrated in Figure 3-
4. The mantissa is usually a signed fractional value with a single implied integer bit. (The
implied integer bit is not actually stored as part of the data value; rather, it is assumed to always
be set to "1.") This means that the mantissa can take on a value in the ranges of + 1.0 to +2.0 and
-1.0 to -2.0. The exponent is an integer that represents the number of places that the binary point
of the mantissa (analogous to the decimal point in an ordinary base 10 number) must be shifted
left or right to obtain the original number represented. The value represented is computed via an
expression of the form:
value = mantissa x 2exponent
Figure 3-4 illustrates a simple floating-point data representation. In general, floating-point proces-
sors also support fixed-point (usually integer) data formats. This is necessary to facilitate opera-
tions that are inherently integer in nature, such as memory address computations.
Floating-point arithmetic is a more flexible and general mechanism than fixed-point. With
floating-point, system designers have access to wider dynamic range (the ratio between the largest
and smallest numbers that can be represented) and in many cases better precision.
Our definition of precision is based on the idea of quantization error. Quantization error is
the numerical error introduced when a longer numeric format is converted to a shorter one. For
example, when we round the value 1.325 to 1.33, we have introduced a quantization error of
0.005. The greater the possible quantization error relative to the size of the value represented, the
less precision is available.
For a fixed-point format, we define the maximum available precision to be equal to the
number of bits in the format. For example, a 16-bit fractional format provides a maximum 16 bits
of precision. This definition is based on computing the ratio of the size of the value represented to
the size of the maximum quantization error that could be suffered when converting from a more
precise representation via rounding. Formally stated,
maximum precision (in bits) = log2 (vnaximum valuel /Imaximum quantization errorl)
For a 16-bit fractional representation, the largest-magnitude value that can be represented
is -1.0. When converting to a 16-bit fractional format from a more precise format via rounding,
the maximum quantization error is 2- 16 . Using the relation above, we can compute that this for-
mat has a maximum precision of log2(1/2- 16 ) , or 16 bits, the same as the format's overall width.
Note that if the value being represented has a smaller magnitude than the maximum, the precision
22
Chapter 3 Numeric Representations and Arithmetic
obtained is less than the maximum available precision. This underscores the importance of careful
signal scaling when using fixed-point arithmetic. Scaling is used to maintain precision by modify-
ing the range of signal values to be near the maximum range of the numeric representation used.
Scaling is discussed in detail in Chapter 4, "Data Path."
Using this same definition for a floating-point format, the maximum available precision is
the number of bits in the mantissa, including the implied integer bit. Because floating-point pro-
cessors automatically scale all values so that the implied integer bit is equal to 1, the magnitude of
Sign Bit
Bit Weights
=
To determine the equivalent decimal value, add up the bit weights
for each bit that is a "1."
Example 1:
= 2 6 + 24 + 2 1 + 2° = 64 + 16 + 2 + 1 = 83
=
Example 2:
FIGURE 3-2. Simple binary Integer representation. This form is called two's
complement representation. Integer formats used in DSP processors are usually
longer than eight bits, but follow the same structure as shown here.
23
DSP Processor Fundamentals: Architectures and Features
Sign Bit
=I
\
Radix Point
Bit Weights
Example 1:
Example 2:
FIGURE 3-3. Simple binary fractional representation. This format is identical to the
integer format, except that a radix point is assumed to exist Immediately after the
sign bit. Fractional formats used in DSP processors are usually longer than eight
bits, but follow the same structure as shown here.
24
Chapter 3 Numeric Representations and Arithmetic
Sign Bit
\ I Mantissa
Sign Bit
\ Exponent
ITIIJ
Radix Point I
Bit Weights
Example:
Exponent = 2 2 + 2° = 4 + 1 = 5
25
DSP Processor Fundamentals: Architectures and Features
the mantissa is restricted to be at least 1.0. This guarantees that the precision of any floating-point
value is no less than half of the maximum available precision. Thus, floating-point processors
maintain very good precision with no extra effort on the part of the programmer.
In practice, floating-point DSPs generally use a 32-bit format with a 24-bit mantissa and
one implied integer bit, providing 25 bits of precision. Most fixed-point DSPs use a 16-bit format,
providing 16 bits of precision. So, while in theory the choice between fixed- and floating-point
arithmetic could be independent of the choice of precision, in practice floating-point processors
usually provide higher precision.
As mentioned above, dynamic range is defined as the ratio between the largest and small-
est number representable in a given data format. It is in this regard that floating-point formats pro-
vide their key advantage. For example, consider a 32-bit fixed-point fractional representation. The
minimum value that can be represented by this format is 2-3 1 ; the maximum value that can be rep-
resented is 1.0 - 2-3 1. The ratio between these, which is the dynamic range of this data format, is
approximately 2.15 x 109, or about 187 decibels (dB). A 32-bit floating-point format of the same
overall size (with a 24-bit mantissa and an 8-bit exponent) can represent numbers from approxi-
rnately 5.88 x 10-39 to 3.40 X 1038 , yielding a dynamic range of approximately 5.79 x 1076 , or
over 1535 dB. So, while using the same number of bits as the fixed-point format, the floating-
point format provides dramatically higher dynamic range.
In applications, dynamic range translates into a range of signal magnitudes that can be
processed while maintaining sufficient fidelity. Different applications have different dynamic
range needs. For telecommunications applications, dynamic range in the neighborhood of 50 dB
is usually sufficient. For high-fidelity audio applications, 90 dB is a common benchmark. It's
often helpful, though, if the processor's numeric representation and arithmetic hardware have
somewhat more dynamic range than the application demands, as this frees the programmer from
some of the painstaking scaling that may otherwise be needed to preserve adequate dynamic
range. Scaling is discussed in more detail in Chapter 4, "Data Path."
Floating-point DSP processors are generally costlier than their fixed-point cousins, but
easier to program. The increased cost results from the more complex circuitry required within the
floating-point processor, which implies a larger chip. In addition, the larger word sizes of floating-
point processors often means that off-chip buses and memories are wider, raising overall system
costs.
The ease-of-use advantage of floating-point processors is due to the fact that in many cases
the programmer does not have to be concerned about dynamic range and precision. On a fixed-
point processor, in contrast, programmers often must carefully scale signals at various stages of
their programs to ensure adequate numeric performance with the limited dynamic range and pre-
cision of the fixed-point processor.
Most high-volume, embedded applications use fixed-point processors because the priority
is low cost. Programmers and algorithm designers determine the dynamic range and precision
needs of their application, either analytically or through simulation, and then add scaling opera-
tions into the code if necessary. For applications that are less cost-sensitive, or that have extremely
demanding dynamic range and precision requirements, or where ease of programming is para-
mount, floating-point processors have the advantage.
26
Chapter 3 Numeric Representations and Arithmetic
27
DSP Processor Fundamentals: Architectures and Features
metic may make sense. If most of the application requires higher precision, then a processor with
a larger native data word size may be a better choice, if one is available.
28
Chapter 3 Numeric Representations and Arithmetic
Block 1:
\
Block 2:
16-bit Intermediate 0.110111011001010 0.000111011001010
results: 0.101011000100100 0.001011000100100
0.111010100100011 0.000010100100011
0.000101101010011 0.001011010100110
Block floatlng-
1
Exponent: 000 (shift left 0 places)
1
Exponent: 110 (shift left 2 places)
point representation: Mantissas: Mantissas:
(a-bit mantissa, 0.1101110 0.0111011
3-bit signed exponent) 0.1010110 0.1011000
0.1110101 0.0010100
0.0001011 0.1011010
FIGURE 3-5. Block floating-point representation. In this example, the block size is
chosen as four for simplicity. Typically, the block size would be larger. In this
example, we have produced an Intermediate signal in our application that requires
16 bits for Its full representation, but we have only 8 bits available to store the
samples. We use block floating-point to maintain the best precision for each block
of four samples. For each block of four samples, we de~ermlne the exponent by
finding the value In the block with the largest magnitude. The exponent for the block
Is equal to the negation of the number of left-shifts (or doubllngs) we can apply to
this largest-magnitude value without causing overflow.
29
DSP Processor Fundamentals: Architectures and Features
sion intermediate result (for example, a value in an accumulator) to block floating-point format,
Specialized instructions that support block floating-point are discussed further in Chapter 7.
3.6 Relationship between Data Word Size and Instruction Word Size
While most DSP processors use an instruction word size equal to their data word size, not
all do. For example, the Analog Devices ADSP-21xx family and the IBM MDSP2780 use a 16-bit
data word and a 24-bit instruction word. Similarly, Zoran's 20-bit ZR3800x uses a 32-bit instruc-
tion word. Processors with dissimilar word sizes generally have provisions to allow data to be
stored in program memory, for example, using the low-order 16 bits of a 24-bit program memory
location. While this arrangement works, it obviously is not the most efficient use of memory since
a significant portion of each program memory word used to store data is wasted, and this can
impact overall system cost.
30
Chapter 4
Data Path
The data path of a DSP processor is where the vital arithmetic manipulations of signals
take place. nsp processor data paths, not surprisingly, are highly specialized to achieve high per-
formance on the types of computation most common in nsp applications, such as multiply-accu-
mulate operations. The capabilities of the data path, along with the memory architecture
(discussed in Chapter 5), are the features that most clearly differentiate DSP processors from
other kinds of processors. Data paths for floating-point nsp processors are significantly different
than those for fixed-point nsps because of the differing requirements of the two kinds of arith-
metic. We will first discuss fixed-point data paths and then floating-point data paths.
31
DSP Processor Fundamentals: Architectures and Features
Multiplier
The presence of a single-cycle multiplier is central to the definition of a programmable
digital signal processor. Multiplication is an essential operation in virtually all DSP applications;
in many applications, half or more of the instructions executed by the processor involve multipli-
cation. Thus, virtually all DSP processors contain a multiplier that can multiply two native-sized
XO
Operand -. Xl
Registers YO
Yl
56
Accumulators ---
FIGURE 4-1. A representative fixed-point DSP processor data path (from the
Motorola DSP5600x, a 24-bit, fixed-point processor).
32
Chapter 4 Data Path
operands in a single instruction cycle. Despite this commonality, there are important differences
among DSP processors in terms of multiplier capabilities.
While all DSP processors are equipped with a multiplier that can produce one new result
per instruction cycle, the internal pipelining of the multiplier can result in a delay of more than
one cycle from the time inputs are presented to the multiplier until the time the result is available.
(Pipelining is discussed in detail in Chapter 9.) This delay from input availability to result is
called latency. While pipelined multipliers can produce one result every clock cycle, they achieve
this performance only when long series of multiply operations are used. If a single multiply oper-
ation is preceded and followed by other kinds of operations, one or more instruction cycles must
be spent waiting for the multiplier result. DSP processors with pipelined multipliers include the
Clarkspur Design CD2450 DSP core.
In some cases (e.g., the Motorola DSP5600x) the multiplier is integrated with an adder to
form a multiplier-accumulator unit. In other cases (e.g., the AT&T DSPI6xx) the multiplier is
separate; its output is deposited into a product register, and from there can be sent to an adder for
accumulation. This distinction manifests itself in the latency suffered by a multiply-accumulate
operation. If the multiplier and adder are separate, then the result of a multiply-accumulate opera-
tion is typically delayed by one instruction cycle before it can be used by the next instruction.
Another distinction among fixed-point DSP processor multipliers is the size of the product
relative to the size of the operands. In general, when multiplying two n-bit fixed-point numbers,
2 x n bits are required to represent the resulting product without introducing any error. This is
sometimes referred to as the law ofconservation of bits.
To understand why this is so, recall the 8-bit integer representation introduced in Figure 3-2
of Chapter 3. This format is capable of representing numbers between -128 and + 127. If we multi-
ply two large-magnitude 8-bit numbers, the result becomes too large to represent with the 8-bit
format. For example, -128 x 128 = 16,384. To represent this result, we need a format providing 16
bits.
For this reason, most fixed-point DSP processor multipliers produce a result that is twice
the width of the input operands. This means that the multiplier itself does not introduce any errors
into the computation. However, some fixed-point processors, for the sake of speed and economy,
use multipliers that produce narrower results and thus introduce errors into some computations.
For example, in the Zilog Z893xx (and the Clarkspur CD2400 core, on which the Z893xx is
based), the multiplier accepts 16-bit operands, but produces a 24-bit result rather than the 32-bit
result required for full precision.
Although it is possible to pass the full-width multiplication result to the next step of com-
putation, this is usually impractical, since subsequent multiplication or additional operations
would produce ever-wider results. Fortunately, in most cases it is not necessary to retain the full-
width multiplication result, because there is often a point of diminishing returns beyond which
additional dynamic range and precision are not useful. Therefore, for practicality's sake, the pro-
grammer usually selects a subset of the multiplier output bits to be passed on to the next computa-
tion. Or, if a series of multiplier results are to be accumulated, the accumulation may be done with
the full-width results, and then the width of the final result reduced before proceeding to the next
stage of computation.
33
DSP Processor Fundamentals: Architectures and Features
Sign Bit
\ 7 Integer Bits
Radix Point
Operand A I Operand B
Result
Select Subset
Output Word
1 1
Sign Bit 7 Integer Bits
FIGURE 4-2. When mUltiplying integer values, the result is also an integer, with twice
the overall width of the input operands. Normally, the programmer selects the least
significant half of the result word for use in the next step of processing. In this case,
the programmer must ensure that the full result value is contained within the lower
half of the result word to prevent overflow. This can be done by constraining the size
of the operands or by scaling the result. When the operands are constrained or the
results have been scaled properly, the most significant N/2+1 bits of the Nobit result
are all equal to the value of the sign bit, so no information is lost when the upper N/2
bits are discarded (leaving one sign bit in the final output word).
34
Chapter 4 Data Path
Sign Bit
7 Fractional Bits
Operand A Operand B
Radix Point I
Sign Bit
\ 14 Fractional Bits
Result
1 1
Select Subset
Output Word =
FIGURE 4-3. When multiplying fractional values, the result is also a fractional value.
The result has a number of bits to the right of the radix point which is equal to the
sum of the number of bits to the right of the radix point in the two operands. If the
operands are constrained so that at least one operand is not the largest negative
number representable, then the bit to the left of the radix point is the sign bit. This
means that the programmer can select the subset of the result shown above,
yielding an output value with the same fractional format as the input operands.
Without this optimization, there would be a single sign bit plus a single integer bit
to the left of the radix point in the full-width result.
sion bits). Thus, no information is lost when the upper N/2 bits are discarded (leaving one sign bit
in the final output word).
If fractional arithmetic is used, then the full-width multiplier result has twice as many bits
to the right of the radix point as the multiplier operands, and the programmer typically discards
the least significant half of these bits, perhaps after rounding.
The difference in which part of the multiplier result is selected for use in the next step of
computation is the primary difference between how integer arithmetic and fractional arithmetic
are handled in fixed-point DSPs.
35
DSP Processor Fundamentals: Architectures and Features
Accumulator Registers
Accumulator registers hold intermediate and final results of multiply-accumulate and
other arithmetic operations. Most DSP processors provide two or more accumulators. A few pro-
cessors provide only a single accumulator, which can be a drawback for many applications. When
only one accumulator is available, the accumulator often becomes a bottleneck in the architecture:
since the accumulator is usually used as one of the source operands and as the destination operand
for ALU operations, its contents must be loaded or stored frequently as the ALU is used for vari-
ous tasks. These loads and stores limit the rate at which data can flow through the ALU.
Ideally, the size of the accumulator registers should be larger than the size of multiplier
output word by several bits. The extra bits, called guard bits, allow the programmer to accumulate
a number of values without the risk of overflowing the accumulator and without the need for scal-
ing intermediate results to avoid overflow. An accumulator with n guard bits provides the capacity
36
Chapter 4 Data Path
for up to 2 n values to be accumulated without the possibility of overflow. Most processors provide
either four or eight guard bits. For example, the AT&T DSP16xx provides four guard bits (36-bit
accumulators with a 32-bit multiplier product), while the Analog Devices ADSP-21xx provides
eight guard bits (40-bit accumulators with a 32-bit multiplier product).
On a processor that lacks guard bits, input signals or intermediate results often must be
scaled before being added to the accumulator if the possibility of overflow is to be eliminated.
Usually this involves scaling the multiplier result by shifting it right by a few bits. Some proces-
sors that do not provide guard bits in the accumulator are capable of shifting the product register
value before adding it to the accumulator without requiring additional instruction cycles. For
example, the Texas Instruments TMS320C2x and TMS320C5x allow the product register to be
automatically shifted right by six bits. As described in the discussion of shifters below, such scal-
ing results in a loss of precision. However, unless the amount of scaling used is extreme or the
number of products being accumulated is very large, the loss of precision introduced by scaling
the product is small.
A further argument in favor of shifting the product right before accumulation is that when
fractional arithmetic is used, often only the most significant half of the accumulator is retained
after a series of multiply-accumulates, as discussed earlier. In this case, the loss of precision of
intermediate values due to scaling usually does not affect the final result. This is because the quan-
tization error present in the result due to scaling is usually entirely contained in the discarded
least-significant half of the accumulator.
Guard bits provide greater flexibility than scaling the multiplier product because they
allow the maximum precision to be retained in intermediate steps of computation. However, sup-
port for scaling the multiplier result in lieu of guard bits is sufficient for many applications. A few
processors, such as the Texas Instruments TMS320Clx, lack both guard bits and the ability to effi-
ciently scale the product register. This requires the multiplier input to be scaled to avoid overflow,
which can result in significantly reduced precision. The lack of both accumulator guard bits and
support for scaling the product register is a serious limitation in many situations.
ALU
DSP processor arithmetic logic units implement basic arithmetic and logical operations in
a single instruction cycle. Common operations include add, subtract, increment, negate, and logi-
cal and, or, and not. ALUs differ in the word size they use for logical operations. Some proces-
sors perform logical operations on operands that are the full width of the accumulator, while
others can perform logical operations only on native-width data words. For example, the AT&T
DSP16xx performs logical operations on 36-bit accumulator values, while the Motorola
DSP5600x, which has a 56-bit accumulator, performs logical operations on 24-bit data words. If
the ALU cannot perform logical operations on accumulator-width data, then programmers need-
ing this capability must resort to performing logical operations in multiple steps, which compli-
cates programming and consumes instruction cycles.
As mentioned above, in some processors the ALU is used to perform addition for multiply-
accumulate operations. In other processors, a separate adder is provided for this purpose.
37
DSP Processor Fundamentals: Architectures and Features
Shifter
Multiplication and accumulation tend to result in growth in the bit width of arithmetic
results. In most cases, the programmer will want to choose a particular subset of the result bits to
pass along to the next stage of processing. A shifter in the data path eases this selection by scaling
(multiplying) its input by a power of two (Z"),
Scaling is an important operation in many fixed-point DSP applications. This is because
many basic DSP functions have the effect of expanding or contracting the range of values of the
signals they process. Consider the simple example of a filter, as illustrated in Figure 4-4. The
example filter has a gain of 100. This means that the range of values at the output of the filter can
be as much as 100 times larger than the range of values at the input to the filter. If the input signal
is limited to the range -1.0 to + 1.0, the output values are limited to the range -100 to + 100. The
difficulty arises because the numeric representation has a limited range, normally -1.0 to + 1.0 for
fractional representations. If signals exceed these values, overflow occurs, and incorrect results
are produced. To avoid this situation, the programmer must be aware of the range of signal values
at each point in the program and scale signals at various points to either eliminate the possibility
of overflow or reduce the probability of overflow to an acceptably low level. As shown in
Figure 4-4(b), the programmer could eliminate the possibility of overflow by scaling the signal x n
by a factor of 0.0078 (1/128, the nearest power of 2 to 1/100) before filtering it.
FILTER
... .....
Yn
Gain =100
(a)
FILTER
xn --....
... ....... Yn
Gain = 100
Scale
by 0.0078 (1/128)
(b)
38
Chapter 4 Data Path
The trade-off that comes with scaling signals in this way is the loss of precision and
dynamic range when a signal is scaled to a smaller range. Recall that precision is the ratio between
the magnitude of the value being represented and the maximum possible magnitude of the quanti-
zation error of the representation. When a signal is scaled to a smaller range, some of the lower-
order bits of the original value are lost through truncation or rounding. This means that the magni-
tudes of the values represented are reduced, while the maximum possible magnitude of the quanti-
zation error is unchanged. Therefore, precision is reduced. Similarly, since the magnitude of the
largest representable value is reduced, but the magnitude of the smallest representable value is
unchanged, so dynamic range is reduced.
Thus, scaling must be done with great care, balancing the need to reduce or eliminate the
possibility of overflow with the need to maintain adequate dynamic range and precision. Proper
scaling of signals can be a significant challenge in implementing an application on a fixed-point
processor. Typically, application developers use detailed simulations and analysis to determine
where and by how much signals should be scaled and then insert scaling operations at the appro-
priate locations.
Note that scaling is related to accumulator guard bits in that both are used to eliminate or
reduce the possibility of overflow. While scaling limits the range of intermediate or final results to
the range of representable values, guard bits provide an intermediate representation that has a
larger dynamic range. As described above, scaling of intermediate results can sometimes be used
to remove the need for guard bits. However, guard bits do not remove the need for scaling. When
guard bits are in use it is necessary to scale the final result in order to convert from the intermedi-
ate representation to the final one. For example, in a 16-bit processor with four guard bits in the
accumulator, it may be necessary to scale accumulator values by 2-4 before writing them to mem-
ory as 16-bit values.
A shifter is often found immediately following the multiplier and ALD. Some processors
provide a shifter between the multiplier and ALU to allow scaling the product as discussed above.
Some shifters have limited capabilities, for example, offering a left shift by one bit (scale by 2 1) , a
right shift by one bit (scale by 2- 1), or no shift. Such shifters can perform multibit shifts one bit at
a time, but this can be time consuming. Another kind of shifter, called a barrel shifter, offers more
flexibility by supporting shifts by any number of bits in a single instruction cycle. In some proces-
sors, a limited-capability shifter is located immediately after the multiplier, so that multiplication
results can be scaled before being passed to the accumulator or ALU for further processing.
Some nsp processors provide multiple shifters with different capabilities in different
places in the data path. This allows greater flexibility to perform shifting in the most appropriate
places for each algorithm being implemented. For example, the nSP5600x has two independent,
limited-capability shifters: one is used to scale multiply-accumulate results as they are written
from the accumulator to memory, and the other is used to shift values within the accumulator,
including logical (no sign extension) and rotate-type shifts.
39
DSP Processor Fundamentals: Architectures and Features
coefficients and the resulting products are summed together. When a series of numbers is accumu-
lated, the magnitude of the sum may grow. Eventually, the magnitude of the sum may exceed the
maximum value that can be represented by the accumulator register. In this situation, called over-
flow, an incorrect value is stored. When overflow occurs, the actual output of the accumulator can
be very far from the correct value.
To understand the effects of overflow, consider adding base-1 0 numbers in a system where
numbers cannot be larger than two digits in size. If we add the numbers 50, 45, and 20, the result
is 15, because two digits are not sufficient for representing the correct result of 115.
Even if the accumulator register does not overflow, overflow can still occur when the accu-
mulated value is transferred to memory if the accumulator provides guard bits. In this case, the
accumulator can represent larger numbers than can be stored in a single word in memory. This is
the case in many DSP processors, including DSP Group's PineDSPCore and AT&T's DSPI6xx.
Overflow can also occur if a shifter is used to scale up the accumulator value as it is stored to
memory.
There are two common ways of dealing with overflow. The first technique is to carefully
scale all computations to eliminate the possibility of overflow, regardless of the input data. This
can be effective, but it may require that signals be scaled to such small values that adequate signal
fidelity cannot be maintained. An alternative is to use saturation arithmetic. In saturation arith-
metic, a special circuit detects when overflow has occurred and replaces the erroneous output
value with the largest positive number that can be represented (in the case of overflow in the posi-
tive direction) or with the largest negative number that can be represented (in the case of overflow
in the negative direction). The result, of course, is still incorrect, but the error is smaller than it
would be without saturation. Referring again to our 2-digit decimal calculator, the result of adding
50,45, and 20 without saturation is 15, which is 100 away from the correct result of 115. With sat-
uration, the result is 99 (the largest positive number that we can represent with two digits), which
is only 16 away from the correct result.
Because it often is not practical or desirable to scale signals to eliminate the possibility of
overflow, saturation arithmetic is very useful. Fixed-point DSP processors generally provide spe-
cial hardware for saturation arithmetic, so that it occurs automatically (perhaps under the control
of a mode register) or with the execution of a special instruction. The hardware unit that imple-
ments saturation arithmetic is called a limiter by some manufacturers.
Rounding
As discussed above, multiplication, accumulation, and other arithmetic operations tend to
increase the number of bits needed to represent arithmetic results without loss of precision. At
some point, it is usually necessary to reduce the precision of these results (for example, when
transferring the contents of a 36-bit accumulator into a 16-bit memory location). The simplest
way to do this is to discard the least significant bits of the representation. This operation is called
truncation. For example, to truncate a 36-bit value to 16 bits, the least significant 20 bits can be
discarded, leaving only the most significant 16 bits. Since the information contained in the dis-
carded bits is lost, an error is introduced to the signal. Note that the truncated value is always
40
Chapter 4 Data Path
smaller than or equal to the original. This means that truncation adds a bias or offset to signals.
Rounding techniques reduce the arithmetic error as well as the bias introduced by this reduction in
precision. As with saturation, some processors perform rounding automatically when results are
transferred between certain registers and memory (this may be enabled by a mode register). Other
processors provide a special instruction for rounding.
The simplest kind of rounding is the so-called round-to-nearest technique. This is the con-
ventional type of rounding that we use in everyday arithmetic, and it is the type of rounding pro-
vided by most fixed-point DSP processors. With this approach, numbers are rounded to the
nearest value representable in the output (reduced-precision) format; numbers that lie exactly at
the midpoint between the two nearest output values are always rounded up to the higher (more
positive) output value.
The standard way of implementing this scheme is to add a constant equal to one half the
value of the least significant bit of the output word to the value to be rounded, and then truncate
the result to the desired width. Some processors provide a special instruction to facilitate rounding
by preloading the accumulator with the appropriate constant value. Without such an instruction,
round-to-nearest rounding can be performed by using .a normal move instruction to preload the
appropriate constant into the accumulator, or by adding the constant to the accumulator prior to
truncation.
The fact that round-to-nearest rounding always rounds numbers that lie exactly at the mid-
point between the two nearest output values to the higher output value introduces an asymmetry.
If we round a typical set of signal values in this way, on average more values are rounded up than
are rounded down. This means that the round-to-nearest operation adds a bias to signals. This bias
is typically many orders of magnitude smaller than the bias introduced by truncation. However,
for some applications, such as IIR (infinite impulse response) filters and adaptive filters, this small
bias can still be troublesome. Round-to-nearest rounding has the advantage of being simple to
implement.
An improved rounding scheme that addresses the bias problem of the round-to-nearest
approach is convergent rounding. Convergent rounding is slightly more sophisticated than the
familiar round-to-nearest technique. In convergent rounding, when a number to be rounded lies
exactly at the midpoint between the two nearest output values, it may be rounded higher or lower.
The rounding direction depends on "the value of the bit of the number that is in the position that
will become the least significant bit (LSB) of the output word. If this bit is a zero, then the number
is rounded down (in the negative direction); if this bit is a one, the number is rounded up. Conver-
gent rounding is compared to the round-to-nearest technique in Figure 4-5.
For most signals, the bit that is used to decide whether to round midpoint cases up or down
is assumed to be equally likely to be zero or one. Therefore, the convergent mechanism rounds up
in half of the midpoint cases and down in half. If this assumption holds, then convergent rounding
effectively avoids the bias caused by the round-to-nearest approach. Even though it is relatively
simple to implement, convergent rounding is not supported in hardware by most fixed-point
DSPs. The Analog Devices ADSP-21xx and Motorola DSP5600x families are two that do provide
convergent rounding. Processors that support convergent rounding can also perform conventional
rounding using the techniques described above.
41
DSP Processor Fundamentals: Architectures and Features
Sign Bit
\
operandA~
1
8 Bits
\
ResultS
4 Bits =
~
If a3:aO = 1000
b3:bO = a7:a4 + a4
The error introduced by rounding and truncation in DSP algorithms is often modeled as
random noise or quantization noise. Note that the quantization noise due to truncation, round-to-
nearest rounding, and convergent rounding all have the same noise power. The key difference
between these three techniques for reducing precision is the bias they add to the signal.
In some applications (especially in telecommunications) the rounding technique to be
used is specified by a published technical standard.
42
Chapter 4 Data Path
the data path. A processor that processes data only after loading it into operand registers is often
referred to as having a load-store architecture. Registers are accessed using register-direct
addressing. (Refer to Chapter 6 for a detailed discussion of addressing modes.)
On some processors (for example, the Texas Instruments TMS320C2x/C5x and the
PineDSPCore from DSP Group), operands can be supplied directly from memory to the data path,
using memory-direct or register-indirect addressing.
Multiplier
Floating-point DSP multipliers accept two native-size (usually 32-bit) floating-point oper-
ands. Unlike fixed-point DSP multipliers, floating-point multipliers generally do not produce an
output word large enough to avoid loss of precision. For example, when multiplying two IEEE-754
single-precision floating-point values, each input value has an effective mantissa width of 24 bits.
To maintain full precision, the output value requires an effective mantissa width of 48 bits. Most
floating-point DSPs do not accommodate this full width. Instead, the output format supported by
floating-point nsp multipliers is commonly somewhat larger than the input format, usually pro-
viding an extra eight to twelve bits of mantissa precision.
ALU
Floating-point DSP processor ALUs typically provide addition, subtraction, and other
arithmetic operations such as absolute value, negate, minimum, and maximum. Some ALUs pro-
vide other specialized operations, examples of which are shown in Table 4-1.
Floating-point processors use their ALUs to perform addition for multiply-accumulate
operations. In addition to multiply-accumulate operations, some processors (for example, the
AT&T DSP32C) provide a multiply-add operation. The multiply-add operation is distinguished
from the multiply-accumulate in that the result is written into a different accumulator than the one
that provides the addend value for addition with the product.
43
DSP Processor Fundamentals: Architectures and Features
40 32
Operand
Registers - - - .. '---_ _ -y---J
Accumulators -
FIGURE 4-6. A typical floating-point DSP processor data path (from the AT&T
DSP321 0).
44
Chapter 4 Data Path
TABLE 4-1. Examples of Specialized ALU Operations Found in Some Floating-Point DSPs
Example
Operation Description Processors
Bit-wise logical operations such as and, or, and not are generally not meaningful when
applied to floating-point data and are usually not provided by ftoating-point-only ALUs.
45
DSP Processor Fundamentals: Architectures and Features
Rounding
As with fixed-point arithmetic, floating-point multiplication, accumulation, and other
arithmetic operations tend to increase the bit width of arithmetic results. Most floating-point DSP
processors deal with this growth in precision by automatically rounding arithmetic results to a 40-
to 44-bit intermediate format. This intermediate format is used for computations that take place
within the data path. When results are written to memory, they can be written either as extended-
precision values (which requires multiple memory locations and multiple register-to-memory
move operations per data value) or they can be rounded to the native single-precision format, usu-
ally 32 bits.
As with fixed-point processors, most floating-point processors provide the simplest kind
of rounding, round-to-nearest. Some processors provide two or more options for rounding float-
ing-point results. The Motorola DSP96002 is unique in that it provides all of the rounding modes
specified by IEEE standard 754 for single-precision values: convergent rounding, round toward
positive infinity, round toward zero, and round toward negative infinity.
Accumulator Registers
In general, floating-point processors have more and larger registers than their fixed-point
counterparts. In some floating-point processors (for example, the AT&T DSP3210) a small num-
ber of registers are specifically designed for use as accumulators. Other processors provide a bank
of general-purpose registers, some subset of which can receive the results of multiply-accumulate
or other arithmetic operations. This is the case, for example, with the Texas Instruments
TMS320C3x, which has eight 40-bit and eight 32-bit registers, subsets of which can receive the
results of arithmetic operations.
Shifter
As with fixed-point arithmetic, a floating-point multiply-accumulate operation tends to
result in growth in the bit width of arithmetic results. However, with floating-point arithmetic the
hardware automatically scales the results to preserve the maximum precision possible. This is the
key advantage of floating-point arithmetic, as described in detail in Chapter 3. Floating-point data
paths incorporate a shifter to perform this scaling, but the shifter is generally not visible to or
explicitly controllable by the programmer for floating-point operations. In processors where a sin-
gle data path performs both fixed- and floating-point arithmetic, the shifter can be explicitly con-
trolled by the programmer for shifting fixed-point data.
46
Chapter 4 Data Path
47
Chapter 5
Memory Architecture
As we explored in the previous chapter, DSP processor data paths are optimized to provide
extremely high performance on certain kinds of arithmetic-intensive algorithms. However, a pow-
erful data path is, at best, only part of a high-performance processor. To keep the data path fed
with data and to store the results of data path operations, DSP processors require the ability to
move large amounts of data to and from memory quickly. Thus, the organization of memory and
its interconnection with the processor's data path are critical factors in determining processor per-
formance. We call these characteristics the memory architecture of a processor, and the kinds of
memory architectures found in DSP processors are the subject of this chapter. Chapter 6 covers
addressing modes, which are the means by which the programmer specifies accesses to memory.
To understand the need for large memory bandwidth in DSP applications, consider the
example of a finite impulse response (FIR) filter, shown in Figure 5-1. Although this example has
become somewhat overused in nsp processor circles, it is perhaps the simplest example that
clearly illustrates the need for several special features of DSP processors.
The mechanics of the basic FIR filter algorithm are straightforward. The blocks labeled D
in Figure 5-1 are unit delay operators; their output is a copy of the input sample delayed by one
sample period. A series of storage elements (usually memory locations) are used to simulate a
series of these delay elements (called a delay line). The FIR filter is constructed from a series of
taps. Each tap includes a multiplication and an accumulation operation. At any given time, n - 1
of the most recent input samples reside in the delay line, where n is the number of taps in the filter.
Input samples are designated xk; the first input sample is xl, the next is x2' and so on. Each time a
new input sample arrives, the previously stored samples are shifted one place to the right along the
delay line, and a new output sample is computed by multiplying the newly arrived sample and
each of the previously stored input samples by the corresponding coefficient. In the figure, coeffi-
cients are represented as en' where n is the coefficient number. The results of each multiplication
are summed together to form the new output sample, Yk.
As we discussed in Chapter 4, DSP processor data paths are designed to perform a multiply-
accumulate operation in one instruction cycle. This means that the arithmetic operations required for
one tap can be computed in one instruction cycle. Therefore, a new output sample can be produced
every n instruction cycles for an n-tap FIR filter. However, to achieve this performance, the proces-
49
DSP Processor Fundamentals: Architectures and Features
sor must be able to make several accesses to memory within one instruction cycle. Specifically, the
processor must:
Fetch the multiply-accumulate instruction
Read the appropriate data value from the delay line
Read the appropriate coefficient value
Write the data value to the next location in the delay line to shift data through the delay
line
Thus, the processor must make four accesses to memory in one instruction cycle if the
multiply-accumulate operation is to execute in a single instruction cycle. In practice, some proces-
sors use other techniques (discussed later) to reduce the actual number of memory accesses
needed to three or even two. Nevertheless, all processors require multiple memory accesses within
one instruction cycle to compute an FIR filter at a rate of one tap per instruction cycle. This level
of memory bandwidth is also needed for other important DSP algorithms besides the FIR filter.
Note that for clarity of explanation, in this section we ignore issues of pipelining. Pipelining is
explored in detail in Chapter 9.
• •• ---r--"
........---- .. ...
A "tap" j
50
Chapter 5 Memory Architecture
Harvard Architectures
The name Harvard architecture refers to a memory structure wherein the processor is con-
nected to two independent memory banks via two independent sets of buses. In the original Har-
Processor Core
Memory
FIGURE 5-2. Simple memory structure. This is the so-called Von Neumann
architecture, common among many kinds of non-DSP processors.
51
DSP Processor Fundamentals: Architectures and Features
vard architecture, one memory bank holds program instructions and the other holds data.
Commonly, this concept is extended slightly to allow one bank to hold program instructions and
data, while the other bank holds data only. This "modified" Harvard architecture is shown in
Figure 5-3.
The key advantage of the Harvard architecture is that two memory accesses can be made
during anyone instruction cycle. Thus, the four memory accesses required for our example FIR
filter can be completed in two instruction cycles.
This type of memory architecture is used in many DSP processor families, including the
Analog Devices ADSP-21xx and the AT&T DSPI6xx, although on the nSP16xx, writes to
memory always take two instruction cycles, so the full potential of the dual-bank structure is not
realized.
If two memory banks are better than one, then one might suspect that three memory banks
would be better still. Indeed, this is the approach adopted by several nsp processor manufactur-
ers. The modified Harvard architectures of the PineDSPCore and OakDSPCore from DSP Group
provide three memory banks, each with its own set of buses: a program memory bank and two
data memory banks, designated X and Y. These three memories allow the processor to make three
independent memory accesses per instruction cycle: one program instruction fetch, one X mem-
ory data access, and one Y memory data read. Other processors based on a three-bank modified
Processor Core
Memory A Memory B
FIGURE 5-3. A Harvard architecture. The processor core can simultaneously access
the two memory banks using two independent sets of buses.
52
Chapter 5 Memory Architecture
Harvard architecture include the Zilog Z893xx, the SGS-Thomson D950-CORE, and the Motor-
ola DSP5600x, DSP563xx, and DSP96002.
For our FIR filter example, recall that we nominally need four memory accesses per
instruction cycle in order to compute one filter tap per instruction cycle. Many processors that
support only three memory accesses per instruction cycle dispense with the need for a fourth
memory access to update the filter delay line by using a technique called modulo addressing,
which is discussed in Section 5.2.
Because extending multiple memory buses outside the chip is costly, DSP processors gen-
erally provide only a single off-chip bus set (Le., one address and one data bus). Processors with
multiple memory banks usually provide a small amount of memory on-chip for each bank.
Although the memory banks can usually be extended off-chip, multiple off-chip memory accesses
cannot proceed in parallel (due to the lack of a second set of external memory buses). Therefore, if
multiple accesses to off-chip memory are requested by an instruction, the instruction execution is
extended to allow time for the multiple external accesses to proceed sequentially. Issues relating
to external memory are discussed later in this section.
Multiple-Access Memories
As we've just discussed, Harvard architectures achieve multiple memory accesses per
instruction cycle by using multiple, independent memory banks connected to the processor data
path via independent buses. While a number of DSP processors use this approach, there are also
other ways to achieve multiple memory accesses per instruction cycle. These include using fast
memories that support multiple, sequential accesses per instruction cycle over a single set of
buses, and using multiported memories that allow multiple concurrent memory accesses over two
or more independent sets of buses.
Some processors use on-chip memories that can complete an access in one half of an
instruction cycle. This means that two independent accesses to a single memory can be completed
in sequence. Fast memories can be combined with a Harvard architecture, yielding better perfor-
mance than could be obtained from either technique alone. For example, consider a modified Har-
vard architecture with two banks of fast memory. Each bank can complete two sequential memory
accesses per instruction cycle. The two banks together can complete four memory accesses per
instruction cycle, assuming the memory accesses are arranged so that each memory bank handles
two accesses. In general, if the memory accesses cannot be divided in this way so that, for exam-
ple, three accesses are made to one bank, the processor automatically lengthens the execution of
the instruction to allow time for three sequential memory accesses to complete. Thus, there is no
risk that a suboptimal arrangement of memory accesses will cause erroneous results; it simply
causes the program to run more slowly.
Zoran's ZR3800x combines a modified Harvard architecture with multiple-access mem-
ory. This processor combines a single-access program memory bank with a dual-access data
memory bank. Thus, one program fetch and two data accesses to on-chip memory can be com-
pleted per instruction cycle. The AT&T DSP32xx combines a Von Neumann architecture with
multiple access memories. This processor can complete four sequential accesses to its on-chip
memory in a single instruction cycle.
53
DSP Processor Fundamentals: Architectures and Features
Another technique for increasing memory access capacity is the use of multiported memo-
ries. A multiported mernory has multiple independent sets of address and data connections, allow-
ing multiple independent memory accesses to proceed in parallel. The most common type of
multiported memory is the dual-ported variety, which provides two simultaneous accesses. How-
ever, triple- and even quadruple-ported varieties are sometimes used. Multiported memories dis-
pense with the need to arrange data among multiple, independent memory banks to achieve
maximum performance. The key disadvantage of multiported memories is that they are much
more costly (in terms of chip area) to implement than standard, single-ported memories.
Some DSP processors combine a modified Harvard architecture with the use of multi-
ported memories. The memory architecture shown in Figure 5-4, for example, includes a single-
ported program memory with a dual-ported data memory. This arrangement provides one pro-
gram memory access and two data memory accesses per instruction word and is used in the
Motorola DSP561 xx processors.
For the most part, the use of fast memories with multiple sequential accesses within an
instruction cycle and multiported memories with multiple parallel accesses is limited to what can
be squeezed onto a single integrated circuit with the processor core because of limitations on chip
input/output performance and capacity. In the case of fast memories, moving the memory (or part
of it) off-chip means that significant additional delays are introduced between the processor core
Processor Core
FIGURE 5-4. A Harvard architecture with a dual-ported data memory (A) and a
single-ported program memory (B). The processor core can simultaneously perform
two accesses to memory bank A and one access to memory bank B using three
independent sets of buses.
54
Chapter 5 Memory Architecture
and the memory. Unless the processor instruction rate is relatively slow, these delays may make it
impractical to obtain two or more sequential memory accesses per instruction cycle. In the case of
multiported memories, moving all or part of the memory off-chip means that multiple address and
data buses must be brought outside the chip. This implies that the chip will need many more I/O
pins, which often means that a larger, more expensive package and possibly also a larger die size
must be used.
Program Caches
Some DSP processors incorporate a program cache, which is a small memory within the
processor core that is used for storing program instructions to eliminate the need to access pro-
gram memory when fetching certain instructions. Avoiding a program instruction fetch can free a
memory access to be used for a data read or write, or it can speed operation by avoiding delays
associated with slow external (off-chip) program memory.
nsp processor caches vary significantly in their operation and capacity. They are gener-
ally much smaller and simpler than the caches associated with general-purpose microprocessors.
We briefly discuss each of the major types of DSP processor caches below.
The simplest type of DSP processor cache is a single-instruction repeat buffer. This is a
one-word instruction cache that is used with a special repeat instruction. A single instruction that
is to be executed multiple times is loaded into the buffer upon its first execution; immediately,
subsequent executions of the same instruction fetch the instruction from the cache, freeing the
55
DSP Processor Fundamentals: Architectures and Features
program memory to be used for a data read or write access. For example, the Texas Instruments
TMS320C2x and TMS320C5x families provide one program memory access and one data mem-
ory access per instruction cycle. However, when an instruction is placed in the repeat buffer for
repeated execution, the second and subsequent executions of the instruction can perform two
memory accesses (one to program memory to fetch one data value and one to data memory to
fetch another data value). Thus, when the repeat instruction is used, the processor can achieve per-
formance comparable to a processor that provides three memory accesses per instruction cycle.
The obvious disadvantage to the repeat buffer approach is that it works on only one instruction at
a time, and that instruction must be executed repeatedly. While this is very useful for some algo-
rithms (e.g., dot-product computation), it does not help for algorithms in which a block of multi-
ple instructions must be executed repeatedly as a group.
The repeat buffer concept can be extended to accommodate more than one program
instruction. For example, the AT&T DSP16xx provides a 16-entry repeat buffer. The DSP16xx
buffer is loaded when the programmer specifies a block of code of 16 or fewer words to be
repeated using the repeat instruction. The first time through, the block of instructions are read
from program memory and copied to the buffer as they are executed. During each repetition, the
instructions are read from the buffer, freeing one additional memory access for a data read or
write. As with the TMS320C2x and TMS320C5x, the DSP16xx can achieve two data transfers per
instruction cycle when the repeat buffer is used. Multiword repeat buffers work well for algo-
rithms that contain loops consisting of a modest number of instructions. This type of loop is quite
common in DSP algorithms, since many (if not most) DSP algorithms contain groups of several
instructions that are executed repeatedly. Such loops are often used in filtering, transforms, and
block data moves.
A generalization of the multi-instruction repeat buffer is a simple single-sector instruction
cache. This is a cache that stores some number of the most recent instructions that have been exe-
cuted. If the program flow of control jumps back to an instruction that is in cache (a cache hit), the
instruction is executed from the cache instead of being loaded from program memory. This frees
an additional memory access for a data transfer, and avoids a speed penalty that may be associated
with accessing slow off-chip program memory. The limitation on this type of cache is that it can
be used to access only a single, contiguous region of program memory. When a program control
flow change (for example, a branch instruction or an interrupt service routine) accesses a program
memory location that is not already contained in the cache, the previous contents of the cache are
invalidated and cannot be used.
The difference between the single-sector instruction cache and the multiword repeat buffer
is that the cache is loaded with each instruction as it is executed and tracks the addresses of the
instructions in the cache. If the program flow of control jumps to a program address that is con-
tained in the cache, the processor detects this and accesses the instructions out of the cache. This
means that the cache can be accessed by a variety of instructions, such as jump, return, and so on.
With the repeat buffer, only the repeat instruction can be used to access instructions in the cache.
This means that a repeat buffer cannot be used to hold branch instructions. An example of a pro-
cessor using a single-sector cache is the Zoran ZR3800x. As with multiword repeat buffers, sin-
gle-sector caches are useful for a wide range of DSP processor operations that involve repetitively
executing small groups of instructions.
56
Chapter 5 Memory Architecture
A more flexible structure is a cache with multiple independent sectors. This type of cache
functions like the simple single-sector instruction cache, except that two or more independent seg-
ments of program memory can be stored. For example, the cache in the Texas Instruments
TMS320C3x contains two sectors of 32 words each. Each sector can be used to store instructions
from an independent 32-word region of program memory. If the processor attempts to fetch an
instruction from an external memory location that is stored in the cache (a cache hit), the external
access is not made, and the word is taken from the cache. If the memory location is not in the
cache (a cache miss), then the instruction is fetched from external memory, and the cache is
updated in one of two ways. If the external address was from one of the two 32-word sectors cur-
rently associated with the cache, then the word is stored in the cache at the appropriate location
within that sector. If the external address does not fall within the two 32-word sectors currently
being monitored by the cache, then a sector miss occurs. In this case, the entire contents of one of
the sectors is discarded and that sector becomes associated with the 32-word region of memory
containing the accessed address. In the case of Texas Instruments processors, the algorithm used
to determine which cache sector should be discarded when a sector miss occurs is the least-
recently-used (or LRU) algorithm. This algorithm keeps track of when each cache sector has been
accessed. When a cache sector is needed to load new program memory locations, the algorithm
selects the cache sector that has not been read from for the longest time.
Some DSP processors with instruction caches provide special instructions or configuration
bits that allow the programmer to lock the contents of the cache at some point during program
execution or to disable thecache altogether. These features provide a measure of manual control
over cache mechanisms, which may allow the programmer to obtain better performance than
would be achieved with the built-in cache management logic of the processor. In addition, impos-
'ing manual control over cache loading may help software developers to ensure that their code will
meet critical real-time constraints.
An interesting approach to caches was introduced by Motorola with the updated
DSP96002. This processor allows the internal 1 Kword by 32-bit program memory to be config-
ured either as an instruction cache or as program memory. When the cache is enabled, it is orga-
nized into eight 128-word sectors. Each sector can be individually locked and unlocked.
Motorola's more recent nSP563xx family includes a similar dual cache/memory construct.
A variation on the multisector caches just discussed is the Analog Devices ADSP-210xx
cache. The ADSP-210xx uses a two-bank Harvard architecture; instructions that access data from
program memory require two accesses and therefore cause contention for program memory.
Because the ADSP-210xx cache is loaded only with instructions whose execution causes conten-
tion for program memory access, the cache is more efficient than a traditional cache, which stores
every instruction fetched.
Although DSP processor caches are in some cases beginning to approach the sophistica-
tion of caches found in high-performance general-purpose processors, there are still some impor-
tant differences. In particular, nsp processor caches are used only for program instructions, not
for data. A cache that accommodates data as well as instructions must include a mechanism for
updating both the cache and external memory when a data value held in the cache is modified by
the program. This adds significantly to the complexity of the cache hardware.
57
DSP Processor Fundamentals: Architectures and Features
Modulo Addressing
As we've just discussed, cache memories reduce the number of accesses to a processor's
main memory banks required to accomplish certain operations. They do this by acting as an addi-
tional, specialized memory bank. In special circumstances, it is possible to use other techniques to
reduce the number of total memory accesses (including use of a cache, if one exists) required to
accomplish certain operations. One such technique is modulo addressing, which is discussed in
detail in Chapter 6. Modulo addressing enables a processor to implement a delay line, such as the
one used in our FIR filter example, without actually having to move the data values in memory.
Instead, data values are written to one memory location and remain there until they are no longer
needed. The effect of data shifting along a delay line is simulated by manipulating memory point-
ers using modulo arithmetic. This technique reduces the number of simultaneous memory
accesses required to implement the FIR filter example from four per instruction cycle to three per
instruction cycle.
Algorithmic Approaches
Although not a DSP processor feature, another technique for reducing memory access
requirements is to use algorithms that exploit data locality to reduce the number of memory
accesses needed. DSP algorithms that operate on blocks of input data often fetch the same data
from memory multiple times during execution. A clever programmer can reuse previously fetched
data to reduce the number of memory accesses required by an algorithm. For example, Figure 5-5
illustrates an FIR filter operating on a block of two input samples. Instead of computing output
samples one at a time, the filter instead computes two output samples at a time, allowing it to reuse
previously fetched data. This reduces the memory bandwidth required from one instruction fetch
and two data fetches per instruction cycle to one instruction fetch and one data fetch per instruction
cycle. At the expense of slightly larger code size, this technique allows (for example) FIR filter
outputs to be computed at one instruction cycle per tap while requiring less memory bandwidth
than a more straightforward approach. This technique is heavily used on IBM's Mwave family of
DSP processors, which have limited memory bandwidth. Within IBM the technique is known as
the "Zurich Zip," in honor of the researcher at IBM Zurich Laboratories who popularized it.
58
Chapter 5 Memory Architecture
Pseudo-code to compute
two FIR filter outputs
(Yo, Yl) using only one
data memory access per
instruction cycle.
LD RO,X 1
LD RI,C 1
(a) R2 RO*Rl, LD RO,X o
R3 RO*Rl, LD Rl,C 2
R2 R2+RO*Rl, LD RO,X_ 1
R3 R3+RO*Rl, LD Rl,C 3
R2 R2+RO*Rl, LD RO,X_ 2
R3 R3+RO*Rl, LD Rl,C 4
R2 R2+RO*RI, LD RO,X_ 3
R3 R3+RO*Rl
Yl is in R2
Yo is in R3
(b) (c)
Almost all processors recognize the need for conflict wait states and automatically insert
the minimum number of conflict wait states needed. Exceptions to this are a few members of the
AT&T DSP16xx family (the DSP1604, DSP1605, and DSP1616). On these processors, attempt-
ing to fetch words from both external program and data memory in one instruction cycle results in
a correct program word fetch, but the fetched data word is invalid.
Most nsp processors include one or more small banks of fast on-chip RAM and/or ROM
that provide one or more accesses per instruction cycle. In many situations, it is necessary or
desirable to expand this memory using off-chip memory that is too slow to support a complete
memory access within one processor instruction cycle. Typically this is done to save cost, since
slower memory chips are cheaper than faster ones. In these cases, the processor is configured to
insert programmed wait states during external memory accesses. These wait states are configured
by the programmer to deliberately slow down the processor's memory accesses to match the
59
DSP Processor Fundamentals: Architectures and Features
speed of slow memories. Some processors can be programmed to use different numbers of pro-
grammed wait states when accessing different regions of off-chip memory, so cost-effective com-
binations of slower and faster memory can be used.
In some systems, it may not be possible to predict in advance precisely how many wait
states will be required to access external memory. For example, when the processor shares an
external memory bus with one or more other processors, the processor may have to wait for
another processor to relinquish the bus before it can proceed with its own access. Similarly, if
dynamic memory (DRAM) is used, the processor may have to wait while the DRAM controller
refreshes the DRAM. In these cases, the processor must have the ability to dynamically insert
externally requested wait states until it receives a signal from an external bus or memory control-
ler that the external memory is ready to complete the access. For example, the Texas Instruments
TMS320C5x provides a special READY pin that can be used by external hardware to signal the
processor that it must wait before continuing with an external memory access.
The length of a wait state relative to the length of a processor instruction cycle varies from
processor to processor. Wait state lengths typically range from one quarter of an instruction cycle
(as on the AT&T DSP32C) to a full instruction cycle (as on most processors). Shorter wait states
allow more efficient operation, since the delay from the time when the external memory is ready
for an access to the time when the wait state ends and the processor begins the access will likely
be shorter.
5.4 ROM
DSP processors that are intended for low-cost, embedded applications like consumer elec-
tronics and telecommunications equipment provide on-chip read-only memory (ROM) to store
the application program and constant data. Some manufacturers offer multiple versions of their
processors: a version with internal RAM for prototyping and for low-volume production, and a
version with factory-programmed ROM for large-volume production. On-chip ROM sizes typi-
cally range from 256 words to 36 Kwords.
Texas Instruments offers versions of some of its processors (e.g., the TMS320P17 and
TMS320P25) with one-time-programmable ROM on-chip. These devices can be programmed by
the system manufacturer using inexpensive PROM programmers, either for prototyping or for
low- or medium-volume production.
For applications requiring more ROM than is provided on-chip by the chosen processor,
external ROM can be connected to the processor through its external memory interface. Typically,
multiple ROM chips are used to create a bank of memory whose width matches the width of the
program word of the processor. However, some processors have the ability to read their initial
(boot) program from an inexpensive byte-wide external ROM. These processors construct instruc-
tion words of the appropriate width by concatenating bytes from the ROM.
60
Chapter 5 Memory Architecture
Most nsp processors provide a single external memory port consisting of an address bus,
a data bus, and a set of control signals, even though most DSP processors have multiple indepen-
dent memory banks on-chip. This is because extending buses off-chip requires large numbers of
package pins, which increase the cost of the processor. Most processors with multiple on-chip
memory banks provide the flexibility to use the external memory port to extend any of the internal
memory banks off-chip. However, the lack of multiple external memory ports usually means that
multiple accesses cannot be made to external memory locations within a single instruction cycle,
and programs attempting to do so will incur a performance penalty. Figure 5-6 illustrates a typical
nsp processor external memory interface, with three independent sets of on-chip memory buses
sharing one external memory interface.
"'~
t\.
I X Address Bus /
v
16
; "-
t\. /
I Y Address Bus /
v
Exfernal
t\.
Address Bus
I Proaram Address Bus
v
)
V
External
Address Bus
Switch
~
A t\.
X Data Bus )
"l v
16
~
A t\. / "-
"l
Y Data Bus ) -,
'I /
/
V
External
A
'I
Proaram Data Bus )V Data Bus
External
Data Bus
Switch
....
.... Rea d/Write Strobe
....
External mStrobe
Bus Control Progra m Strobe
Logic
DataS trobe
Ready Input
FIGURE 5-6. Example DSP processor external memory Interface. The processor has
three sets of on-chlp memory buses, but only one set of off-chip memory buses.
The on-chlp buses are mUltiplexed such that anyone of the on-chip bus sets can be
connected to the off-chip bus set.
61
DSP Processor Fundamentals: Architectures and Features
Some DSP processors do provide multiple off-chip memory ports. The Analog Devices
ADSP-21020 provides an external program memory port (24-bit address, 48-bit data) and an
external data memory port (32-bit address, 32-bit data). The Texas Instruments TMS320C30 pro-
vides one 24-bit address, 32-bit data external memory port, and one 13-bit address, 32-bit data
external memory port, while the TMS320C40 has two identical 31-bit address, 32-bit data exter-
nal memory ports. Similarly, the Motorola DSP96002 provides two identical 32-bit address and
data bus sets. The cost of these devices is correspondingly higher than that of comparable proces-
sors with only one external memory port.
DSP processor external memory interfaces vary quite a bit in flexibility and sophistication.
Some are relatively simple and straightforward, with only a handful of control pins. Others are
much more complex, providing the flexibility to interface with a wider range of external memory
devices and buses without special interfacing hardware. Some of the features distinguishing exter-
nal memory interfaces are the flexibility and granularity of programmable wait states, the inclu-
sion of a wait pin to signal the availability of external memory, bus request and bus grant pins
(discussed below), and support for page-mode DRAM (discussed below).
High-performance applications must often use fast static RAM devices for off-chip mem-
ory. In such situations, it is important for system hardware designers to scrutinize the timing spec-
ifications for DSP processors' external memory ports. Because timing specifications can vary
significantly among processors, it is common to find two processors that have the same instruction
cycle time but have very different timing specifications for off-chip memory. These differences
can have a serious impact on system cost, because faster memories are significantly more expen-
sive than slower memories. Hardware design flexibility is also affected, since more stringent tim-
ing specifications may constrain the hardware designer in terms of how the interface circuitry is
designed and physically laid out.
Manual Caching
Whether or not a processor contains a cache, it is often possible for software developers to
improve performance by explicitly copying sections of program code from slower or more con-
gested (in terms of accesses) memory to faster or less congested memory. For example, if a sec-
tion of often-used program code is stored in a slow, off-chip ROM, then it may make sense to
copy that code to faster on-chip RAM, either at system start-up or when that particular program
section is needed.
62
Chapter 5 Me~ory Architecture
(bus arbitration) and to prevent the processors that do not have control of the bus from trying to
assert values onto the bus. Several DSP processors provide features to facilitate this kind of
arrangement, though there are significant differences in the sophistication and flexibility of the
features provided. In some cases, a shared bus multiprocessor can be created simply by connect-
ing together the appropriate pins of the processors without the need for any special software or
hardware to manage bus arbitration. In other cases, extra software on one or more of the DSP pro-
cessors and/or external bus arbitration hardware may be required.
An example of basic support for shared bus systems is provided by the Motorola
DSP5600x. Two of the DSP processor's pins can be configured to act as bus request and bus grant
signals. When an external bus arbitrator (either another processor or dedicated hardware) wants a
particular DSP processor to relinquish the shared bus, it asserts that processor's bus request input.
The processor then completes any external memory access in progress and relinquishes the bus,
acknowledging with the bus grant signal that it has done so. The DSP processor can continue to
execute its program as long as no access to the shared bus is required. If an access to the shared
bus is required, the processor waits until the bus request signal has been deasserted, indicating that
it can again use the shared bus.
The Texas Instruments TMS320C5x provides several features that support multiprocess-
ing. In addition to providing the equivalent of bus request and bus grant signals (called HOLD and
HOLDA on the TMS320C5x), the processor also allows an external device to access its on-chip
memory. To accomplish this, the external device first asserts the TMS320C5x's HOLD input.
When the processor responds by asserting HOLDA, the external device asserts BR, indicating that
it wishes to access the TMS320C5x's on-chip memory. The TMS320C5x responds by asserting
IAQ. The external device can then read and write the TMS320C5x's on-chip memory by driving
TMS320C5x's address, data, and read/write lines. When finished, the external device deasserts
HOLD and BR. This allows the creation of multiprocessor systems that do not require shared
memory for interprocessor communications.
A processor feature that simplifies the use of shared variables in shared memory is bus
locking, which allows a processor to read the value of a variable from memory, modify it, and
write the new value back to memory, while ensuring that this sequence of operations is not inter-
rupted by another processor attempting to update the variable's value. This is sometimes referred
to as an atomic test-and-set operation. The Texas Instruments TMS320C3x and TMS320C4x pro-
cessors provide special instructions and hardware support for bus locking. Texas Instruments
refers to these operations as "interlocked operations."
The Analog Devices ADSP-2106x offers a sophisticated shared bus interface. The pro-
cessor provides on-chip bus arbitration logic that allows direct interconnection of up to six
ADSP-2106x devices with no special software or external hardware required for bus arbitration.
In addition, the processor allows one DSP processor in a shared-bus configuration to access
another processor's on-chip memory, much like on the Texas Instruments TMS320C5x family.
This means that interprocessor data moves will not necessarily have to transit through an exter-
nal shared memory.
In addition to special external memory interface features, the Analog Devices ADSP-2106x
and the Texas Instruments TMS320C4x families provide special communications ports to facilitate
63
DSP Processor Fundamentals: Architectures and Features
connections within multiprocessor systems. Features of this type are discussed in detail in
Chapter 10.
Dynamic Memory
All of the writable memory found on DSP processors and most of the memory found in
systems based on DSP processors is static memory, also called SRAM (for static random-access
memory; a better name would have been static read and write memory). Static memory is simpler
to use and faster than dynamic memory (DRAM), but it also requires more silicon area and is
more costly for a given number of bits of memory. The key operational attribute distinguishing
static from dynamic memories is that static memories retain their data as long as power is avail-
able. Dynamic memories must be refreshed periodically; that is, a special sequence of signals
must be applied to reinforce the stored data, or it eventually (typically in a few tens of millisec-
onds) is lost. In addition, interfacing to static memories is usually simpler than interfacing to
dynamic memories; the use of dynamic memories usually requires a separate, external DRAM
controller to generate the necessary control signals.
Because of the increasing proliferation of DSP processors into low-cost, high-volume
products like answering machines and personal computer add-in cards, there has been increased
interest in using dynamic memory in nsp systems. DRAM can also be attractive for systems that
require large quantities of memory, such as large-scale multiprocessor systems.
One way to get faster, static RAM-like performance from slower, dynamic RAM is the use
of paged or static column DRAM. These are special types of DRAM chips that allow faster than
normal access when a group of memory accesses occurs within the same region (or page) of
memory, Some DSP processors, including the Motorola DSP96002, the Analog Devices ADSP-
210xx, and the Texas Instruments TMS320C3x and TMS320C4x provide memory page boundary
detection capabilities. These capabilities generally consist of a set of programmable registers,
which the programmer uses to specify the locations of page boundaries in external memory, and
circuitry to detect when external memory accesses cross page boundaries. In most cases, when the
processor detects that a memory access has crossed a page boundary, it asserts a special output
pin. It is then up to the external DRAM controller to use a processor input pin to signal back to the
processor that it must delay its access by inserting wait states while the controller readies the
DRAM for access to a new page.
As mentioned above, the use of DRAM as external memory for a DSP processor usually
requires the use of an external DRAM controller chip. This additional chip may increase the man-
ufacturing cost of the design, which partly defeats the reason for using DRAM in the first place.
To address this problem, some nsp processors now incorporate a DRAM controller on-chip. The
Motorola DSP56004 and DSP56007, for example, provide on-chip DRAM interfaces that include
support for page-mode DRAM.
64
Chapter 5 Memory Architecture
from an I/O device and copy the data into memory or vice versa, a separate DMA controller can
handle such transfers more efficiently. This DMA controller may be a peripheral on the DSP chip
itself or it may be implemented using external hardware.
Any processor that has the simple bus requestlbus grant mechanism described above can
be used with an external DMA controller that accesses external memory. Typically, the processor
loads the DMA controller with control information including the starting memory address for the
transfer, the number of data words to be transferred, the direction of the transfer, and the source or
destination peripheral. The DMA controller uses the bus request pin to notify the DSP processor
that it is ready to make a transfer to or from external memory. The DSP processor completes its
current instruction, relinquishes control of external memory, and signals the DMA controller via
the bus grant pin that the DMA transfer can proceed. The DMA controller then transfers the spec-
ified number of data words and optionally signals completion to the processor through an inter-
rupt.
Some more sophisticated DSP processors include a DMA controller on-chip that can
access internal and external memory. These DMA controllers vary in their performance and flexi-
bility. In some cases, the processor's available memory bandwidth may be large enough to allow
DMA transfers to occur in parallel with normal program instruction and data transfers without
any impact on performance. For example, the Texas Instruments TMS320C4x contains a DMA
controller that, combined with the TMS320C4x's on-chip memory and on-chip DMA address and
data buses, can complete one memory access per instruction cycle independent of the processor.
The Motorola DSP96002, the Texas Instruments TMS320C3x family, and the Analog Devices
ADSP-2106x family all include on-chip DMA controllers with similar capabilities.
Some DMA controllers can manage multiple DMA transfers in parallel. Such a DMA con-
troller is said to have multiple channels, each of which can manage one transfer, and each of
which has its own set of control registers. The TMS320C4x DMA controller supports six chan-
nels, the Analog Devices ADSP-2106x supports ten channels, and the Motorola DSP96002 can
handle two channels. Each channel can be used for memory-memory or memory-peripheral trans-
fers.
In contrast, the AT&T DSP3210 includes a more limited, two-channel DMA controller that
can only be used for transfers to and from the processor's internal serial port. Since the DSP3210
does not have extra memory bandwidth, the currently executing instruction is forced to wait one
cycle when the DMA controller accesses memory. This arrangement (where the processor is sus-
pended during DMA bus accesses) is called cycle stealing. The Analog Devices ADSP-21xx pro-
vides a similar capability through a mechanism that Analog Devices calls autobuffering.
5.6 Customization
We've already mentioned that many DSP processor vendors offer versions of their proces-
sors that are customized by placing user-specified programs and/or data into the on-chip ROM. In
addition, several vendors can produce nsp core-based ASICs or customizable DSPs (see Chapter
4), which provide the user with more flexibility. These approaches may allow the user to specify
memory sizes and configurations (for example, the mix of ROM and RAM) that are best suited to
65
DSP Processor Fundamentals: Architectures and Features
the application at hand. DSP processor vendors offering customizable DSPs or DSP core-based
ASICs include AT&T, Clarkspur Design, DSP Group, SGS-Thomson, Tensleep Design, Texas
Instruments, and several other vendors.
66
Chapter 6
Addressing
67
DSP Processor Fundamentals: Architectures and Features
68
Chapter 6 Addressing
As with immediate data, small addresses can be encoded in the instruction word specify-
ing the operation, resulting in a single-word instruction. Larger addresses have to be placed in a
memory word separate from the instruction. This requires the processor to read two words from
program memory before the instruction can be executed, slowing program execution.
r..
RO OxOOFB
R1 OxOOFC
R2 Ox0100 OxOOFD
R3 OxOOFE Addresses
R4 OxOOFF
R5 7 Ox0100
R6 Ox0101
R7 Ox0102
Address Memory
Registers
MOVE (R2), A
69
DSP Processor Fundamentals: Architectures and Features
addresses; other processors have general-purpose registers that can store addresses or data. Regis-
ter-indirect addressing is, for two reasons, one of the most important addressing modes found on
most DSP processors. First, register-indirect addressing lends itself naturally to working with
arrays of data, which are common in nsp applications. This point is expanded upon below in the
section on register-indirect addressing with pre- and post-increment. Second, register-indirect
addressing is efficient from an instruction-set point of view: it allows powerful, flexible address-
ing while requiring relatively few bits in an instruction word.
As an example of register-indirect addressing, consider the AT&T DSP32xx instruction
AD = AD + *R5
which causes the value stored in the memory location pointed to by the contents of register R5 to
be added to the value in the accumulator register AD. The result is stored in the accumulator regis-
ter AD. In this instruction, the input operand "*R5" is specified using register-indirect addressing:
R5 is used as an address register; it holds the address of the memory location that contains the
actual operand.
In AT&T's assembly language, the "*,, symbol, when used in this manner, is interpreted to
mean "the contents of the memory location pointed to by ... ". This meaning is familiar to pro-
grammers who use the C programming language. Other processors have different syntaxes for
register-indirect addressing; many use "(Rn)" to indicate the memory location pointed to by
address register Rn.
There are many variations on register-indirect addressing; these variations are described
next.
70
Chapter 6 Addressing
causes the address register R5 to be decremented to point to the previous memory location after
the addition operation is performed.
Some processors also provide the ability to add or subtract values other than one to or
from an address register. For example, the DSP32xx instruction
AD = AD + *RS++R17
adds the value in the memory location pointed to by register R5 to the accumulator and then
increases the value in the address register R5 by the value stored in register RI7. Since the value
in register RI7 can be negative, this same instruction format can be used to decrease the value
stored in an address register. In some processors, a special group of registers is provided to hold
pre- or post-increment values. In other processors, the increment value can be stored in a general-
purpose register. The register that holds the increment value (R 17 in this example) is called an
offset register. (Note that some vendors use different terminology. For example, Analog Devices
calls these registers modifier registers.)
An example of a pre-decrement addressing mode is provided by the Motorola DSP5600x:
MOVE X: - (RD), Al
which decrements address register RO before it is used to point to the memory location containing
the data value to be moved into the accumulator register AI. Pre-incrementing usually requires an
extra instruction cycle to update the address register contents before they are used to address
memory.
71
DSP Processor Fundamentals: Architectures and Features
Indexed addressing is useful when the same program code is used on multiple sets of data.
The index register can be used to point to the beginning of a data set in memory, and the regular
address registers can be pre- or postmodified to step through the data in the desired fashion. When
the program is ready to operate on the next data set, only the value in the index register needs to be
changed.
Indexed addressing is also useful in compilers for communicating arguments to subrou-
tines by passing data on the stack. Compilers commonly dedicate an address register to be used as
a stack frame pointer. Each time a subroutine is called, the stack frame pointer is set up so that the
address pointed to by the stack frame pointer contains the address of the previous stack frame.
The next address contains the number of arguments being passed to the subroutine, and subse-
quent addresses contain the arguments themselves. This way, subroutines can access their argu-
ments in a simple, consistent way. For example, if address register ARI is designated for use as
the stack frame pointer, then in the TMS320C3x, the instruction
LDI *+AR1(2) , RQ
could be used by any subroutine to copy its first argument into register RO. The subroutine itself is
insulated from knowledge of the exact memory locations of its arguments; all it needs to know is
where its arguments are located relative to the current stack frame pointer.
72
Chapter 6 Addressing
(a)
Read Pointer
(b)
FIGURE 6-2. (a) A FIFO buffer with linear addressing. Five items of data Xn have
arrived In the order Xo, X1 , X 2 , X3 , X4 and have been written Into the buffer. Only the
first data Item, Xo, has been read out of the buffer. After each read or write
operation, the corresponding pointer moves to the right. Once either pointer
advances to the end of the buffer, it must be reset to point to the beginning of the
buffer. (b) The same data In a FIFO buffer with circular addressing. After the read
pointer or the write pointer reaches the end of the buffer, it automatically advances
to the start of the buffer, making the buffer appear circular to the programmer.
73
DSP Processor Fundamentals: Architectures and Features
The term modulo refers to modulo arithmetic, wherein numbers are limited to a specific
range. This is similar to the arithmetic used in a clock, which is based on a 12-hour cycle. When
the result of a calculation exceeds the maximum value, it is adjusted by repeatedly subtracting
from it the maximum representable value until the result lies within the specified range. For exam-
ple, 4 hours after 10 o'clock is 2 o'clock (14 modulo 12).
When modulo address arithmetic is in effect, read and write pointers (address registers)
are updated using pre- and/or post-increment register-indirect addressing. The processor's address
generation unit performs modulo arithmetic when new address values are computed, creating the
appearance of a circular memory layout, as illustrated in Figure 6-2(b). Modulo address arith-
metic eliminates the need for the programmer to check the read and write pointers to see whether
they have reached the end of the buffer and to reset the pointers to the beginning of the buffer once
they have reached the end. This results in much faster buffer operations and makes modulo
addressing a valuable capability for many applications.
Most recently designed DSP processors provide some support for modulo address arith-
metic. However, the depth of this support and the mechanisms used to control it vary from proces-
sor to processor. Modulo addressing approaches are discussed in the paragraphs below.
The programmer typically controls modulo addressing in one of two ways. In the first
method, the length of the circular buffer is loaded into a special register, often called a modifier
or modulo register. A processor may have only one modifier register, or it may have several.
Each modifier register is associated with one or more address registers; whenever a modifier reg-
ister is loaded with a circular buffer length, its associated address registers automatically use
modulo address arithmetic. Because the modifier register contains only the length of the buffer
and not its starting address, the modulo addressing circuitry must make some assumptions about
the starting address of circular buffers. Typically, circular buffers must start on k-word bound-
aries, where k is the smallest power of 2 that is equal to or greater than the size of the circular
buffer. For example, a 48-word circular buffer would typically have to reside on a 64-word
boundary, since 64 is the smallest power of 2 that is equal to or greater than 48. If we imagine
dividing a processor's address space into k-wordblocks, starting at address 0, then a k-word
boundary occurs between every pair of k-word blocks. For example, under this scheme a circular
buffer of length 256 could start at address 0, 256, 512, or any other address that is a multiple of
256. Processors implementing this form of modulo addressing (or one like it) include the Texas
Instruments TMS320C3x and TMS320C4x, and processors from Motorola, Analog Devices,
NEe, and DSP Group.
An alternative approach to modulo addressing uses start and end registers to hold the start
and end addresses of each circular buffer. On some processors, modulo address arithmetic is then
used on any address registers that point into the region of memory bounded by the addresses in
the start and end registers. On other processors, only one address register can be associated with a
given circular buffer. Processors using the start/end register approach include the AT&T DSP16xx
and the Texas Instruments TMS320C5x.
As suggested above, different processors support different numbers of simultaneously
active circular buffers with modulo addressing. For example, the AT&T DSP16xx supports mod-
ulo address arithmetic on only one circular buffer at a time, since it has only one start and one end
74
Chapter 6 Addressing
register. The Texas Instruments TMS320C5x supports two, the Motorola DSP561xx supports
four, and the Motorola DSP5600x and all Analog Devices processors support eight buffers.
75
DSP Processor Fundamentals: Architectures and Features
Xo Xo
X1 X4
X2 X2
X3 X6
X1 000=0
X4
001=1
(b) 010=2
Xs Xs
o1 1 = 3
x6 X3 100=4
101 = 5
x7 X7 110=6
111= 7
(a)
Bit Reversal
FIGURE 6-3. The output of an FFT algorithm (in this case a radix-2, eight-point FFT)
is produced in scrambled order (a). If the output of a binary counter (b) is bit
reversed, the resulting sequence (c) can be used to transform the FFT output into
natural order (d) for further processing.
word. In Motorola's DSP5600x assembly language syntax, for example, immediate addressing
can be used to load a constant into a register, as in
MOVE #1234, A
which loads the constant value 1234 into register A. In the general case, this instruction requires
two words of program memory: one to store the move instruction and one to store the immediate
data themselves. However, if the immediate data are small, it may be possible to fit the instruction
and the data into a single program memory word. In the case of the Motorola DSP5600x, immedi-
ate data instructions of the form above can be encoded in one program memory word if the data
are 12 bits long or smaller.
76
Chapter 6 Addressing
77
Chapter 7
Instruction Set
A processor's instruction set is a key factor in determining not only what operations are
possible on that processor, but also what operations are natural and efficient. Instructions control
how data are sequenced through the processor's data path, how values are read and written to
memory, and so forth. As a result, a processor's instruction set can have a profound influence on a
processor's suitability for different tasks.
In this chapter we investigate a number of aspects of instruction sets. We look at the types
of instructions (and, closely related, the types of registers) commonly found on DSPs and discuss
instruction set support for movement of data in parallel with arithmetic and multiply operations.
We also explore orthogonality and ease of programming and conclude by examining conditional
execution and special function instructions.
79
DSP Processor Fundamentals: Architectures and Features
instructions typically use indirectly addressed memory as operands. Other processors that use reg-
isters for their multiply operands may not require a special instruction to implement squaring. For
example, the Motorola DSP5600x implements squaring with an "MPY XO,XO,A" instruction.
A number of processors provide instructions to support extended-precision arithmetic. For
example, signed/signed, signed/unsigned, and unsigned/unsigned multiplication instructions are
available on the SGS- Thomson D950-CORE, the Texas Instruments TMS320C54x, and the Zoran
ZR3800x. Many processors also support add-with-carry and subtract-with-borrow instructions,
both particularly important for extended-precision arithmetic.
Logic Operations
Most DSP processors provide instructions for logic operations, including logical and, or,
exclusive-or, and not. These find use in error correction coding and decision-processing applica-
tions that DSPs are increasingly being called upon to perform. Note that processors may also have
bit (or bit-field) manipulation instructions, discussed below.
Shifting
Shifting operations can be divided into two categories: arithmetic and logical. A logical
left shift by one bit inserts a zero bit in the least significant bit, while a logical "right shift by one
bit inserts a zero bit in the most significant bit. In contrast, an arithmetic right shift duplicates the
sign bit (either a one or zero, depending on whether the number is negative or not) into the most
significant bit. Although people use the term "arithmetic left shift," arithmetic and logical left
shifts are really identical: they both shift the word left and insert a zero in the least significant bit.
Arithmetic shifting provides a way of scaling data without using the processor's multiplier.
Scaling is especially important on fixed-point processors where proper scaling is required to obtain
accurate results from mathematical operations.
Virtually all DSPs provide shift instructions of one form or another. Some processors pro-
vide the minimum, i.e., instructions to do arithmetic left or right shifting by one bit. Some proces-
sors may additionally provide instructions for two- or four-bit shifts. These can be combined with
single-bit shifts to synthesize n-bit shifts, although at a cost of several instruction cycles.
Increasingly, many DSP processors feature a barrel shifter and instructions that use the
barrel shifter to perform arithmetic or logical left or right shifts by any number of bits. Examples
include the AT&T DSP16xx, the Analog Devices ADSP-21xx and ADSP-210xx, the DSP Group
OakDSPCore, the Motorola DSP563xx, the SGS-Thomson D950-CORE, and the Texas Instru-
ments TMS320C5x and TMS320C54x.
Rotation
Rotation can be thought of as circular shifting: shifting where the bits that are shifted off
one end of the word "loop around" and are shifted in on the other end. For example, in a left rotate
by one bit, the most significant bit is rotated into the least significant bit, and all other bits shift left
one position. Rotation finds use in a number of areas, including error correction coding (e.g., for
bit interleaving). It can also be used to (slowly) generate bit-reversed addresses on processors that
do not have bit-reversal built in to their address generation units.
80
Chapter 7 Instruction Set
Most processors provide instructions to rotate a word left or right by one bit. Exceptions
include the AT&T DSP16xx, the NEC JlPD7701x, the Zilog Z893xx, and the Zoran ZR3800x.
Comparison
Most processors provide a set of status bits that provide information about the results of
arithmetic operations. For example, status bits commonly include a zero bit (set if the result of the
last arithmetic operation resulted in a zero value), a minus bit (set if the result of the last operation
was negative), an overflow bit, and so on. Status bits are set as a result of an arithmetic operation
and can then be used in conditional branches or conditional execution instructions (discussed
below).
In decision-intensive code, a processor may need to rapidly compare a series of values to a
known value stored in a register. One way to do this is to subtract the value to be compared from
the reference value to set the status bits accordingly. On some processors, this changes the refer-
ence value; on others, the result of the subtraction can be placed in another register. On processors
with few registers (including most fixed-point DSPs), neither of these is an attractive solution. As
an alternative, some processors provide compare instructions that effectively perform this subtrac-
tion without modifying the reference value or using another register to store the result. An inter-
esting enhancement to the normal compare instruction is the compare absolute value instruction,
which enables quick determination of whether a number's magnitude falls within a specified
range.
Looping
The majority of DSP applications require repeated execution of a small number of arith-
metic or multiplication instructions. Because the number of instructions in the inner loop is usu-
ally small, the overhead imposed by instructions used to decrement and test a counter and branch
to the start of the loop may be relatively large. As a result, virtually all DSP processors provide
hardware looping instructions. These instructions allow a single instruction or block of instruc-
tions to be repeated a number of times without the overhead that would normally come from the
decrement-test-branch sequence at the end of the loop.
Hardware looping features are discussed in detail in Ch-apter 8.
81
DSP Processor Fundamentals: Architectures and Features
sor to execute a number of instructions located immediately following the branch before
starting execution at the branch destination address. This reduces the number of instruc-
tion cycles used for the branch itself. Delayed branches are discussed in detail in
Chapter 9.
Delayed branch with nullify. Texas Instruments' TMS320C4x family offers a conditional
delayed branch wherein the instructions in the delay slot can be conditionally executed
based on whether the branch is taken or not.
PC-relative. A PC-relative branch jumps to a location determined by an offset from the
current instruction location. This is important in applications that require position-inde-
pendent code, that is, programs that can run when loaded at any memory address. PC-rela-
tive branches are discussed in more detail in Chapter 8.
82
Chapter 7 Instruction Set
of leading ones or zeros) in a data value. Normalization refers to exponent detection com-
bined with a left shift that scales the data value so that it contains no redundant sign bits.
Exponent detection is sometimes used separately from normalization to determine the
maximum exponent in a block of numbers. Once the maximum exponent has been found,
a shift of that many bits can be performed on each number in the block to scale them all to
the same exponent. Please refer to Chapter 3 for details on block floating-point arithmetic.
DSP processor instructions that support exponent detection and normalization fall into
three basic categories:
Block exponent detection. Block exponent detection is used (typically within a hard-
ware loop) to determine the maximum exponent in a block of data. This is a very use-
ful instruction for manipulating arrays of block floating-point data. The Analog
Devices ADSP-21xx and Zoran ZR3800x are the only DSP processors that feature
block exponent detect instructions.
Exponent detection. Many processors provide instructions that can compute the
exponent of a single data value in one instruction cycle. Processors with such instruc-
tions include the Analog Devices ADSP-21xx, the DSP Group OakDSPCore, the NEC
JlPD7701x, the SGS-Thomson D950-CORE, the Texas Instruments TMS320C54x,
and the Zoran ZR3800x.
Normalization. Many processors also support normalization (that is, combined expo-
nent detection and shifting), but this support varies widely between processors. AT&T
DSP16xx family members (other than the DSP1602 and DSP1605) are the only DSP
processors that provide a single-cycle normalize instruction. Most of the processors
with single-cycle exponent detection listed in the preceding paragraph can follow the
exponent detect instruction with a shift instruction to perform a two-cycle normaliza-
tion. Finally, a number of other processors, such as the DSP Group PineDSPCore, the
Motorola DSP5600x and DSP561xx, and the Texas Instruments TMS320C2x and
TMS320C5x, provide iterative normalization instructions. These instructions normal-
ize a data value one bit at a time, meaning that normalization of an n-bit number
requires n instruction cycles. These instructions usually also compute the exponent of
the data value and store this in a processor register.
Bit manipulation instructions
Bit manipulation instructions are useful for both decision processing and error correction
coding. They can be broken down into two categories: single-bit manipulation instructions
and multibit (or bit-field) manipulation instructions.
Single-bit manipulation instructions include bit set, bit clear, bit toggle, and bit test. The
Motorola DSP5600x family was one of the first to feature instructions for these opera-
tions, although the instructions execute in two instruction cycles. Motorola processors also
offer branch-if-bit-set and branch-if-bit-clear instructions that are useful in control appli-
cations.
Bit-field instructions operate on several bits at once. For example, Texas Instruments'
TMS32OC5x features instructions that use its parallel logic unit to perform a logical oper-
83
DSP Processor Fundamentals: Architectures and Features
ation (and, or, not, exclusive-or) on a specified memory location. Similarly, the Motorola
DSP561 xx family features bit-field test, set, and clear instructions.
AT&T's DSP16xx processors have a special bit manipulation unit that provides the pro-
cessor with bit-field extract and replace (insert in AT&T parlance) instructions. The
former extracts a number of contiguous bits at a specified location within a word and
right-justifies them in a destination register. Similarly, the latter takes a set of right-justi-
fied bits and uses them to replace the bits at a given position in the destination word. This
can be very useful for "protocol processing" in communications applications, for example.
Other special function instructions
There are a wide variety of other special function instructions, some of which are summa-
rized here.
A number of processors provide an iterative division instruction that can be used to per-
form division one bit at a time.
The Analog Devices ADSP-210xx and Texas Instruments TMS320C4x provide a square
root seed instruction, which can be used as a basis for finding the square root of a number.
Many processors provide specialized stack operations, push and pop being the most com-
mon.
Interrupt enable and disable instructions are used by some processors for control over
interrupts.
7.2 Registers
The main registers in a processor are closely coupled to its instruction set, since instruc-
tions typically specify registers as source or destination operands, or use them to generate
addresses for their source and destination operands. In the paragraphs below, we briefly review the
types and functions of registers found in programmable DSPs.
As with many processor features, an abundance of registers usually makes the processor
easier to program, but also increases the instruction width and die size of the processor, resulting
in a more expensive chip.
Accumulators
Every DSP processor on the market today has at least one accumulator register, although it
may be called by a different name on some processors. An accumulator is a register that is at least
wide enough to hold the largest ALU or multiplier result produced by the data path and that can be
used as a source or destination for arithmetic operations. Some processors, such as the Texas
Instruments TMS320Clx, TMS320C2x, and TMS320C5x families, are based on a single primary
accumulator. Others, like the AT&T DSP16xx, DSP Group PineDSPCore and OakDSPCore,
Motorola DSP5600x and DSP561xx, and Texas Instruments TMS320C54x provide two or more
accumulators, which simplifies coding of algorithms that use complex (as opposed to real) num-
bers. Some floating-point processors provide a large number of extended-precision registers,
some of which can be used as accumulators.
84
Chapter 7 Instruction Set
Address Registers
As discussed in Chapter 6 on addressing, address registers (and their associated offset and
modifier registers) are used to generate addresses for register-indirect addressing. The number of
address registers found on processors ranges from 2 (on the Texas Instruments TMS20Clx) to 22
(on the AT&T DSP32xx).
Many DSP programmers complain that their processors have too few address registers;
this is one area where more is usually better.
Other Registers
Other registers found on DSP processors include:
Stack pointer. Typically found on processors that support software stacks, a stack pointer
is a register dedicated to indicating the current top of stack location.
Program counter. The program counter holds the address of the next instruction to be
fetched.
Loop registers. These hold information regarding hardware loops, such as start and end
address and repetition count.
In most cases the programmer's interaction with these other registers is far more limited
than with the processor's main registers.
85
DSP Processor Fundamentals: Architectures and Features
86
Chapter 7 Instruction Set
7.4 Orthogonality
Orthogonality refers to the extent to which a processor's instruction set is consistent. In
general, the more orthogonal an instruction set is, the easier the processor is to program. This is
because there are fewer inconsistencies and special cases that the programmer must remember.
Orthogonality is a subjective topic, and it does not lend itself easily to quantification. This should
be kept in mind when reviewing a critique of a processor's orthogonality.
The two major areas that most influence the perception of a processor's orthogonality are
the consistency and completeness of its instruction set and the degree to which operands and
addressing modes are uniformly available with different operations. As an example of the former,
most programmers would describe a processor with an add instruction but not a subtract instruc-
tion as nonorthogonal. As an example of the latter, a processor that allows register-indirect
addressing with its add instruction but not with its subtract instruction would also be considered
nonorthogonal.
Processors with larger instruction word widths tend to be more orthogonal than processors
with smaller instruction words. Fundamentally, this is because orthogonality results from inde-
pendent encodings of operations and operands within an instruction word. The more bits available
in the instruction word, the easier it is to design an orthogonal processor.
As an example, consider a hypothetical processor with 32 instructions that support three
operands (for example, "ADD RO,RI,R2", which might add RO and RI and place the result in
R2). Each operand can be one of eight registers (RO-R7). In addition, suppose the processor sup-
ports two data moves in parallel with all instructions, and that the processor has eight address reg-
isters that can be used to generate register-indirect addresses for these moves. The processor
might also support three address register update modes (no update, post-increment by one, and
post-decrement by one).
Table 7-1 shows the breakdown of bits in the instruction encoding. An instruction set
encoded in this manner would be quite orthogonal, since every aspect of the instruction set is
independently encoded. However, it also requires 30-bit instruction words. Although wider
instruction words can provide greater orthogonality and allow the programmer to control more
operations per instruction, they also increase required bus and memory widths, thus increasing
system cost. In addition, few applications will be able to take full advantage of the instruction
set's power and flexibility on every instruction. Every instruction executed that does not make use
of all the instruction set's power is an argument for a smaller, less orthogonal, harder to program,
but more memory-efficient instruction set.
There are several approaches that are widely used for squeezing more functionality into a
smaller instruction word width on DSPs with 16-bit-wide instruction words:
Reduced number of operations. Supporting fewer operations frees instruction word bits
for other uses. For example, the AT&T DSP16xx does not have a rotation instruction.
87
DSP Processor Fundamentals: Architectures and Features
TABLE 7-1. The Number of Bits Required for a Fully Independent Instruction Encoding
on a Hypothetical Processor
Reduced number of addressing modes. Processors with wider instruction word lengths
typically feature a rich variety of addressing modes that are available to all instructions.
Processors using smaller instruction words typically provide only a few addressing modes,
and the number of update modes in register-indirect addressing may be limited. Similarly,
the processor may also limit the allowable combinations of operations and addressing
modes.
Restrictions on source/destination operands. Motorola's DSP561xx family typifies this
solution. For example, the second parallel move in an instruction can only use a specific
address register (R3) for address generation. Similarly, the choice of operands in a multi-
ply instruction determines how many parallel reads or writes are allowed.
Use of mode bits. Texas Instruments uses this approach with its TMS320Clx and
TMS320C2x processors, and especially on the TMS320C5x. These processors use mode
bits or data stored in a variety of registers to partly determine what an instruction does. For
example, the TMS230C5x does not have separate arithmetic and logical shift operations.
Rather, a shift mode bit in a control register determines whether the single shift instruction
is arithmetic or logical. Similarly, the accumulator shift instruction takes its shift count
from a special register instead of from a shift count encoded in the instruction word.
Most of these options increase programming difficulty, but the narrower instruction word
width usually reduces overall processor and system cost.
88
Chapter 7 Instruction Set
89
Chapter 8
Execution Control
Execution control refers to the rules or mechanisms used in a processor for determining
the next instruction to execute. In this chapter, we focus on several features of execution control
that are important to DSP processors. Among these are hardware looping, interrupt handling,
stacks, and relative branch support.
Execution control is closely related to pipelining, which is discussed in detail in Chapter 9.
91
DSP Processor Fundamentals: Architectures and Features
loop. This can result in considerable savings. For example, Figures 8-1(a) and 8-1(b) show an FIR
filter implemented in assembly language on two DSP processors, one with software looping and
one with hardware looping. The software loop takes roughly three times as long to execute,
assuming that all instructions execute in one instruction cycle. In fact, branch instructions usually
take several cycles to execute, so the hardware looping advantage is usually even larger.
Except for the Clarkspur Design CD24xx and CD245x cores, the IBM MDSP2780, the
Texas Instruments TMS320C1x, and the Zilog Z893xx, all DSPs provide some hardware looping
support. However, the exact form of this support may vary widely from one DSP to another. The
paragraphs below discuss a number of the more important hardware loop features.
(a) (b)
FIGURE 8-1. An FIR filter kernel implemented using a software loop (a) and a
hardware loop (b). In the hardware loop, the RPT ("repeat") instruction is
executed only once, but automatically repeats the next Instruction 16 times. If
each instruction takes one instruction cycle (a conservative estimate), the
software loop version takes roughly three times longer to execute than the
hardware loop version.
92
Chapter 8 Execution Control
In contrast, a multi-instruction hardware loop must refetch the instructions in the block of
code being repeated each time the processor proceeds through the loop. Because of this, the pro-
cessor's program bus is not available to access other data. (One exception to this is the AT&T
DSPI6xx, which provides a special IS-word buffer that is used to hold instructions being
repeated. On the first iteration of the loop, instructions are copied into this buffer. On subsequent
passes through the loop the processor fetches instructions from the IS-word buffer, freeing the
program bus for data accesses.)
Most processors allow an arbitrary number of instructions to be repeated in a multi-
instruction hardware loop. Others, however, only allow a small number of instructions in a multi-
instruction loop. For example, the AT&T DSP16xx hardware loop supports a maximum of 15
instructions, and the AT&T DSP32C supports a maximum of 32 instructions. As we noted earlier,
most inner loops tend to be small, so such limitations are not unreasonable for many applications.
93
DSP Processor Fundamentals: Architectures and Features
Nesting Depth
A nested loop is one loop placed within another. Most applications that benefit from hard-
ware looping need only a single hardware loop. In some applications, it can be both convenient
and efficient to nest one hardware loop inside another. For example, the fast Fourier transform
requires three nested loops.
The most common approaches to hardware loop nesting are:
Directly nestable. Some processors, including all Motorola DSPs, all Analog Devices
DSPs, and the NEC J1,PD770Ix, allow hardware loops to be nested simply by using the
hardware loop instruction within the outer loop. Maximum nesting depths range from
three (the NEC J.!PD7701x) to seven (Motorola processors). Processors that feature
directly nestable hardware loops usually provide a separate hardware loop stack that holds
the loop start and end addresses and repetition count.
Partially nestable. Processors with both single- and multi-instruction hardware loops fre-
quently allow a single-instruction loop to be nested within a multi-instruction loop, even if
multi-instruction loops are not nestable themselves. Examples include PineDSPCore from
DSP Group and the Texas Instruments TMS320C3x, TMS320C4x, and TMS320C5x pro-
cessors.
Software nestable. On the Texas Instruments TMS320C3x and TMS320C5x, multi-
instruction hardware loops can be nested by saving the state of various looping registers
(loop start, loop end, and loop count) and then executing a new loop instruction. Depend-
ing on the repetition counts of the inner and outer loops, this may be a better approach
than using a branch instruction for the outer loop. (Recall that a branch instruction incurs
overhead every iteration through the loop, while saving and restoring loop registers needs
to be done only once for each pass through the outer loop.)
Nonnestable. A number of DSPs, including the Texas Instruments TMS320C2x and the
AT&T DSP16xx and DSP32xx, do not permit nested hardware loops at all.
8.2 Interrupts
An interrupt is an external event that causes the processor to stop executing its current pro-
gram and branch to a special block of code called an interrupt service routine. Typically, this code
94
Chapter 8 Execution Control
deals with the source of the interrupt in some way (for example, reading the data that have just
arrived on a serial port and storing the in a buffer) and then returns from the interrupt, allowing the
processor to resume execution where it left off. All DSP processors support interrupts and most
use interrupts as their primary means of communicating with peripherals.
The sections below discuss aspects of interrupt processing and features found on DSPs to
support it.
Interrupt Sources
Interrupts can come from a variety of sources, including:
On-chip peripherals. Most processors provide a variety of on-chip peripherals (e.g.,
serial ports, parallel ports, timers, and so on) that generate interrupts when certain condi-
tions are met.
External interrupt lines. Most processors also provide one or more external interrupt
lines that can be asserted by external circuitry to interrupt the processor.
Software interrupts. Also called exceptions or traps, these are interrupts that are gener-
ated either under software control or due to a software-initiated operation. Examples
include illegal instruction traps and floating-point exceptions (division by zero, overflow,
underflow, and so on).
Both on-chip peripherals and external interrupt lines are discussed in more detail in Chap-
ter 10, "Peripherals."
Interrupt Vectors
As mentioned above, an interrupt causes the processor to begin execution from a certain
location in memory. Virtually all processors associate a different memory address with each inter-
rupt. These locations are called interrupt vectors, and processors that provide different locations
for each interrupt are said to support vectored interrupts. Vectored interrupts simplify program-
ming because the programmer does not have to be concerned with which interrupt source gener-
ated the interrupt; this information is implicitly contained in the interrupt vector being executed.
Without vectored interrupts, the programmer must check each possible interrupt source to see if it
caused the interrupt, increasing interrupt response time.
Typical interrupt vectors are one or two words long and are located in low memory. The
interrupt vector usually contains a branch or subroutine call instruction that causes the processor
to begin execution of the interrupt service routine located elsewhere in memory. On some proces-
sors, interrupt vector locations are spaced apart by several words. This allows brief interrupt ser-
vice routines to be located directly at the interrupt vector location, eliminating the overhead of
branch and return instructions. On other processors, the interrupt vector location does not actually
contain an instruction; rather, it contains only the address of the actual interrupt service routine.
On these processors, the branch to a different location is mandatory, increasing interrupt service
time.
95
DSP Processor Fundamentals: Architectures and Features
Interrupt Enables
All processors provide mechanisms to globally disable interrupts. On some processors,
this may be via special interrupt enable and interrupt disable instructions, while on others it may
involve writing to a special control register. Most processors also provide individual interrupt
enables for each interrupt source. This allows the processor to selectively decide which interrupts
are allowed at any given time.
Interrupt Latency
Interrupt latency is the amount of time between an interrupt occurring and the processor
doing something in response to it. Interrupt latencies can vary significantly from one processor
family to another. Especially for real-time control applications, low interrupt latency may be very
important.
Unfortunately, it is often difficult to meaningfully compare interrupt latencies for different
processors when working from data books or data sheets. This is because processor vendors often
use vague, ill-defined, or contradictory definitions of interrupt latency (or the conditions under
which it is measured). Our formal definition of interrupt latency is the minimum time from the
assertion of an external interrupt line to the execution of the first word of the interrupt vector that
can be guaranteed under certain assumptions. This time is measured in instruction cycles. The
details of and assumptions used in this definition are as follows:
Most processors sample the status of external interrupt lines every instruction cycle. For
an interrupt to be recognized as occurring in a given instruction cycle, the interrupt line
must be asserted some amount of time prior to the start of the instruction cycle; this time
is referred to as the set-up time. Because the interrupting device has no way of guarantee-
ing that it will meet the processor's set-up time requirements, we assume that these set-
up time requirements are missed. This lengthens interrupt latency by one instruction
cycle.
96
Chapter 8 Execution Control
Once an interrupt line has been sampled by the processor's clock, it is typically passed
through several stages of flip-flops to avoid metastability problems. This step is often
referred to as synchronization. Depending on the processor this can add from one to three
instruction cycles to the processor's interrupt latency.
At this stage the interrupt is typically recognized as valid and pending by the processor. If
the processor is not in an interruptible state, interrupt processing is delayed until the pro-
cessor enters an interruptible state. Examples of noninterruptible states include interrupts
being disabled under software control, the processor already servicing another interrupt,
the processor executing a single-instruction hardware loop that disables interrupts, the
processor executing a multi-instruction-cycle instruction, or the processor being delayed
due to wait states when accessing external memory. In our definition of interrupt latency,
we assume the processor is in an interruptible state, which typically means that it is exe-
cuting the shortest interruptible instruction possible.
The processor must next finish executing instructions that are already being processed in
the pipeline. We assume that these instructions are all instructions that execute in a single
instruction cycle. In parallel with this, the processor can begin to fetch and process the first
word of the interrupt vector. Our definition of interrupt latency includes the time for the
first word of the interrupt vector to percolate through to the execute stage of the pipeline.
As described above, some processors do not actually store an instruction word in the interrupt
vector location. Rather, on these processors, the interrupt vector location holds only the address of
the interrupt service routine, and the processor automatically branches to this address. On such
processors, our definition of interrupt latency includes the time to branch to the interrupt service
routine. This time is not included for processors that contain instructions at the interrupt vector;
this is because the interrupt service routine can reside directly at the interrupt vector in some situ-
ations. In other situations, this is not possible, and the time required for the processor to execute a
branch instruction must be factored into the interrupt latency calculation as well.
Some processors, notably all Motorola processors and the AT&T DSP32C and DSP32xx,
provide fast or quick interrupts. On Motorola processors, an interrupt is automatically classified as
"fast" if the first two instructions at the interrupt vector location are not branch instructions. These
two words are inserted directly into the pipeline and executed without a branch, reducing interrupt
latency. Fast interrupts are usually used to move data from a peripheral to memory or vice versa.
Interrupt latency is closely tied to the depth and management of the processor's pipeline.
Chapter 9 discusses the interactions between interrupts and pipelines in detail.
97
DSP Processor Fundamentals: Architectures and Features
8.3 Stacks
Processor stack support is closely tied to execution control. For example, subroutine calls
typically place their return address on the stack, while interrupts typically use the stack to save
both return address and status information (so that changes made by the interrupt service routine
to the processor status word do not affect the interrupted code).
nsp processors typically provide one of three kinds of stack support:
Shadow registers. Shadow registers are dedicated backup registers that hold the contents
of key processor registers during interrupt processing. Shadow registers can be thought of
as a one-deep stack for the shadowed registers.
Hardware stack. A hardware stack is a special block of on-chip memory, typically only a
few words long, that holds selected registers during interrupt processing or subroutine
calls.
Software stack. A software stack is a conventional stack using the processor's main mem-
ory to store values during interrupt processing or subroutine calls.
Of the three, shadow registers and software stacks are generally the most useful. Shadow
registers can be used to provide very low overhead interrupt handling, since they automatically
save key processor registers in a single cycle. The key advantage of a software stack over a hard-
ware stack is that its depth can be configured by the programmer simply by reserving an appropri-
ately sized section of memory. In contrast, hardware stacks are usually fairly shallow, and the
programmer must carefully guard against stack overflow.
98
Chapter 11
Debugging is one of the most difficult stages of the design process. This is especially true
for real-time embedded systems, where access to the components being debugged may be quite
limited, and where it may be difficult to run the system other than in real-time. As a result, DSP
processor features that facilitate debugging have become more important as systems have
increased in complexity.
One of the most important innovations in this area is scan-based in-circuit emulation
(ICE), which combines debugging circuitry on the processor with dedicated test/debug pins to
allow debugging of the DSP's operation while. it is installed in the target system. Scan-based
debugging has become quite popular on DSP processors in recent years, although several other
debugging approaches (notably pod-based in-circuit emulation and monitor-based debugging) are
still in use.
This chapter focuses on scan-based on-chip debugging facilities. Other debugging tech-
niques, such as pod-based emulation, are discussed in Chapter 16, "Development Tools."
121
Chapter 11 On-Chip Debugging Facilities
123
Chapter 9
Pipelining
Pipelining is a technique for increasing the performance of a processor (or other electronic
system) by breaking a sequence of operations into smaller pieces and executing these pieces in
parallel when possible, thereby decreasing the overall time required to complete the set of opera-
tions. Almost all DSP processors on the market today use pipelining to some extent.
Unfortunately, in the process of improving performance, pipelining frequently compli-
cates programming. For example, on some processors pipelining causes certain instruction
sequences to execute more slowly in some cases than in others. On other processors, certain
instruction sequences must be avoided for correct program operation. Thus, pipelining represents
a trade-off between efficiency and ease of use.
This chapter introduces pipeline concepts, provides examples of how they are used in DSP
processors, and touches on a number of issues related to pipelining.
99
DSP Processor Fun damentals: Arch itec tures and Features
Clock Cycle
2 3 4 5 6 7 8
Instruction Fetch 11 12
Decode /1 12
Execute 11 12
2 3 4 5 6 7 8
Instruction Fetch 11 12 13 14 15 16 17 18
Decode 11 12 13 14 15 16 17
~
Execute 11 12 ", 13 14 15
100
Chapter 9 Plpelining
9.3 Interlocking
The execution sequence shown in Figure 9-2 is referred to as a perfect overlap, because
the pipeline phases mesh together perfectly and provide 100 percent utilization of the processor's
execution stages. In reality, processors may not perform as well as we have shown in our hypo-
thetical example. The most common reason for this is resource contention. For example, suppose
our hypothetical processor takes two instruction cycles (i.e., 40 ns) to write to memory. (This is
the case, for example, on AT&T DSP16xx processors.) If instruction 12 attempts to write to mem-
ory and instruction 13 needs to read from memory, the second instruction cycle in I2's data write
phase conflicts with instruction 13's data read. This is illustrated in Figure 9-3, where the extended
write cycle is indicated with a dark border, and the 12 writelI3 read conflict is shaded.
One solution to this problem is interlocking. An interlocking pipeline delays the progres-
sion of the latter of the conflicting instructions through the pipeline. In the example below,
instruction 13 would be held at the decode stage until the read/write stage was finished with
instruction I2's write. This is illustrated in Figure 9-4. Of course, holding instruction 13 at decode
implies that instruction 14 must be held at the fetch stage. (This may mean that 14 is actually
refetched from memory, or it may simply be held in a register.) A side effect of interlocking the
Instruction Cycle
2 3 4 5 6 7 8
Instruction Fetch 11 12 13 14 15 16 17 18
Decode 11 12 13 14 15 16 17
Data ReadlWrite 11 12 14 15 16
Execute 11 13 14 15
101
DSP Processor Fundamentals: Architectures and Features
Instruction Cycle
1 2 3 4 5 6 7 8
Instruction Fetch 11 12 13 14 14 15 16 17
Decode 11 12 13 13 14 15 16
Data Read/Write 11 12 12 13 14 15
Execute 11 12 NOP 13 14
FIGURE 9·4. Use of interlocking to resolve the resource conflict In the previous
example. As before, instruction 12 requires two cycles for its read/write stage. The
pipeline sequencer holds instruction 13 at the decode stage and 14 at the fetch
stage, allowing 12's write to complete. This forces the execute stage to execute an
NOP at instruction cycle 6.
pipeline in this way is that there is no instruction to execute at instruction cycle 6, and an NOP is
executed instead. This results in a one-instruction cycle penalty when an interlock occurs.
A key observation from this is that an instruction's execution time on a processor with an
interlocking pipeline may vary depending on the instructions that surround it. For example, if
instruction 13 in our example did not need to read from data memory, then there would be no con-
flict, and an interlock would not be necessary. This complicates code optimization on an interlock-
ing processor, because it may not be easy to spot interlocks by reading the program code.
It is not hard to envision other resource conflict scenarios. For example, some processors
support instructions with long immediate data, that is, data that are part of the instruction but that
do not fit in the instruction word itself. Such instructions require an additional program memory
fetch to get the immediate data. On many such processors, only one program memory access per
instruction cycle is possible. Thus, the long immediate data fetch conflicts with the fetch of the
next instruction, resulting in an interlock.
The above examples illustrate pipeline conflicts where interlocking is the only feasible
solution; failing to interlock in these examples would produce erroneous results. In other cases,
the lack of interlocking can bring about results that may not be erroneous, but may not be what the
programmer intended. We'll illustrate this with an example from the Motorola DSP5600x, a pro-
cessor that makes little use of interlocking.
Like most DSP processors, the DSP5600x provides address registers that are used for reg-
ister-indirect addressing. An interesting pipeline effect occurs in the following sequence of
instructions:
MOVE #$1234,RO
MOVE X:(RO) ,XO
The first instruction loads the hexadecimal address 1234 into address register RO. The second
instruction moves the contents of the X memory location pointed to by address register RO into
register XO. A reasonable expectation would then be that the above instructions move the value
102
Chapter 9 Pipelining
stored at X memory address 1234 into register XO. However, due to pipeline effects, the above
instructions do not execute in the way one might expect. Understanding why requires a brief
examination of the DSP5600x pipeline.
The Motorola DSP5600x uses a three-stage pipeline, made up of fetch, decode, and exe-
cute stages. ALU operations (e.g., add, multiply, etc.), data accesses (data memory reads and
writes, as in the above example), and register loads are carried out during the execute stage. How-
ever, addresses used in data accesses are formed during the decode stage. This creates a pipeline
hazard-a problem due to pipelining that the programmer must be aware of to obtain correct
results. In this example, the first MOVE instruction modifies an address register used in address
generation, but the modification is not reflected in the address used during the data access for the
second MOVE instruction.
Figure 9-5 illustrates this hazard in detail. Note that this figure is slightly different from
previous figures as it shows how processor resources (including memory and execution units) are
used over time. Such a figure is commonly called a reservation table.
Let's assume that at the start of the example RO contains the hexadecimal value 5678. Dur-
ing instruction cycles 1 and 2, the first move instruction is fetched and decoded, and the second
Instruction Cycle
1 2 3 4
Fetch Fetch
Program Memory 13 14
MOVE #$1234,AO MOVE X:(AO),XO
Decode Decode
Decode Unit 13
MOVE #$1234,AO MOVE X:(AO),XO
Generate data
AO now equals
Address Generation Unit address for X:(RO)
1234
access (I.e., 5678)
Execute Execute
Execute Unit MOVE X:(AO),XO
MOVE #$1234,RO
103
DSP Processor Fundamentals: Architectures and Features
move instruction is fetched. During instruction cycle 3, the second move instruction is decoded.
At this time, the address generation unit generates an address for use in the data read of the second
move instruction, i.e., address 5678. The first move instruction is being executed, but its results
are not yet available to the address generation unit. As a result, during instruction cycle 4, the pro-
cessor reads from X memory address 5678, and register RO now contains 1234. This is probably
not what the programmer intended. To avoid this problem, some other instruction (or even an
Nap) can be inserted between the two MOVE instructions to give the pipeline time to get the cor-
rect value into RO.
The Texas Instruments TMS320C3x, TMS320C4x, and TMS320C5x processors also face
this problem, but use interlocking to protect the programmer from the hazard. For example, the
TMS320C3x detects writes to any of its address registers and holds the progression through the
pipeline of other instructions that use any address register until the write has completed. This is
illustrated in Figure 9-6. In this example, an LDI (load immediate) instruction loads a value into
an address register, like the first move instruction in the DSP5600x example. The second instruc-
tion is an MPYF (floating-point multiply) that uses register-indirect addressing to fetch one of its
operands. Because the MPYF instruction uses one of the address registers, interlocking holds it at
the decode stage until the LDI instruction completes. This results in the processor executing two
Nap instructions.
Processors using interlocking in such situations save the programmer from worrying about
whether certain instruction sequences will produce correct output. On the other hand, interlocking
also allows the programmer to write slower-than-optimal code, perhaps without even realizing it.
This is a fundamental trade-off made by heavily interlocked processors.
Instruction Cycle
1 2 3 4 5 6 7 8
Instruction Fetch LDI MPYF 13 13 13 14 15 16
FIGURE 9-6. The Texas Instruments TMS320C3x approach to solving the pipeline
hazard in the previous example. The processor detects writes to address registers
and delays the execution of subsequent Instructions that use address registers
until the write completes. The interlock can be seen during instruction cycles 4
and 5.
104
Chapter 9 Pipelinlng
begin executing at a new address, the next sequential instruction word has already been fetched
and is in the pipeline. One possibility is to discard, or flush, the unwanted instruction and to cease
fetching new instructions until the branch takes effect. This is illustrated in Figure 9-7, where the
processor executes a branch (BR) to a new location. The disadvantage of this approach is that the
processor must then execute NOPs for the invalidated pipeline slots, causing branch instructions
to execute in multiple instruction cycles. Typically, a multicycle branch executes in a number of
instruction cycles equal to the depth of the pipeline, although some processors use tricks to exe-
cute the branch late in the decode phase, saving an instruction cycle.
Instruction Cycle
1 2 3 4 5 6 7 8
Instruction Fetch SA 12 - - N1 N2 N3 N4
Decode SR - - - N1 N2 N3
Data ReadIWrite SR - - - N1 N2
FIGURE 9-7. The effect on the pipeline of a multicycle branch (SR) instruction.
When the processor realizes a branch Is pending during the decode stage of
instruction cycle 2, it flushes the remainder of the pipeline and stops fetching
instructions. The flushed pipeline causes the branch Instruction to execute In
four Instruction cycles. The processor begins fetching instructions (N1-N4) at
the branch destination at cycle 5.
An alternative to the multicycle branch is the delayed branch, which does not flush the
pipeline. Instead, several instructions following the branch are executed normally, as shown in
Figure 9-8. A side effect of a delayed branch is that instructions that will be executed before the
branch instruction must be located in memory after the branch instruction itself, e.g.:
BRD NEW_ADDR Branch to new address.
INST 2 These three instructions
INST 3 are executed before
INST 4 the branch occurs.
Delayed branches are so named because, to the programmer, the branch appears to be
delayed in its effect by several instruction cycles.
Almost all DSP processors use multicycle branches. Many also provide delayed branches,
including the Texas Instruments TMS320C3x, TMS320C4x, and TMS320C5x. The Analog
Devices ADSP-210xx, the AT&T DSP32C and DSP32xx, and the Zoran ZR3800x provide only
delayed branches.
As with interlocking, multicycle and delayed branches represent trade-offs between ease
of programming and efficiency. In the worst case, a programmer can always place NOP instruc-
105
DSP Processor Fundamentals: Architectures and Features
Instruction Cycle
1 2 3 4 5 6 7 8
Instruction Fetch SR 12 13 14 N1 N2 N3 N4
Decode SR 12 13 14 N1 N2 N3
Data Read/Write SR 12 13 14 N1 N2
Execute SR 12 13 14 N1
FIGURE 9-8. A delayed branch. Delayed branches do not flush the pipeline.
Instead, the instructions immediately following the branch are executed
normally. This increases efficiency, but results in code that is confusing upon
casual inspection.
tions after a delayed branch and achieve the same effect as a multicycle branch, but this requires
more attention on the programmer's part.
Branching effects occur whenever there is a change in program flow, and not just for
branch instructions. For example, subroutine call instructions, subroutine return instructions, and
return from interrupt instructions are all candidates for the pipeline effects described above. Pro-
cessors offering delayed branches frequently also offer delayed returns.
106
Chapter 9 Pipelining
Instruction Cycle
1 2 3 4 5 6 7 8 9 10
Instruction Fetch 14 15 16 - - - V1 V2 V3 V4
Decode 13 14 15 INTR - - - V1 V2 V3
.tit.
Interrupt processing begins
FIGURE 9·9. TMS320C5x instruction pipeline while handling an interrupt. The
processor inserts the INTR Instruction Into the pipeline after the instructions
that were In the pipeline. INTR Is a specialized branch that flushes the pipeline
and Jumps to the appropriate interrupt vector location. The first word of the
interrupt vector Is fetched In cycle 7 and executed In cycle 10.
First, the processor takes two instruction cycles after recognizing the interrupt to begin interrupt
processing. Second, the DSP5600x does not use an INTR instruction to cause the processor to
execute at the vector address. Instead, the processor simply begins fetching from the vector loca-
tion after the interrupt is recognized. However, at most two words are fetched starting at this
address. If one of the two words is a subroutine call (Figure 9-10), the processor flushes the previ-
ously fetched instruction and then branches to the long interrupt vector. If neither word is a sub-
routine call, the processor executes the two words and continues executing from the original
program, as shown in Figure 9-11. This second form of interrupt routine is called afast interrupt.
Interrupts are discussed in more detail in Chapter 8.
Instruction Cycle
1 2 3 4 5 6 7
Decode 12 13 14 JSA - V3 V4
Execute 11 12 13 14 JSA - V3
FIGURE 9·10. Motorola DSP5600x interrupt processing. The first word of the
interrupt vector (V1) Is a subroutine call (JSR) Instruction, and the second
word (V2) Is not fetched. This causes a flush of the pipeline, reSUlting in the
loss of one Instruction cycle. The Interrupt service routine at the destination
JSR address Is fetched In cycle 5 and executes in cycle 7.
107
DSP Processor Fundamentals: Architectures and Features
Instruction Cycle
1 2 3 4 5 6 7
Instruction Fetch 13 14 V1 V2 15 16 17
Decode 12 13 14 V1 V2 15 16
Execute 11 12 13 14 V1 V2 15
108
Chapter 9 Pipelinlng
Unlike the DSP16xx instruction, the data-stationary approach uses operands that refer to
memory directly. Instead of multiplying two registers, the instruction multiplies the contents of
two memory locations. The values of these memory locations are fetched and brought to the mul-
tiplier, but the programmer does not specify this operation directly as in the time-stationary case.
The processor schedules the sequence of memory accesses and carries them out without explicit
direction.
The data-stationary and time-stationary approaches each have their adherents. In general,
the data-stationary model is easier to read but not as flexible as the time-stationary approach.
109
Chapter 10
Peripherals
Most DSP processors provide a good selection of on-chip peripherals and peripheral inter-
faces. This allows the DSP to be used in an embedded system with a minimum amount of external
hardware to support its operation and interface it to the outside world. In this chapter we discuss
peripherals and peripheral interfaces commonly found on DSP processors.
111
DSP Processor Fundamentals: Architectures and Features
Frame Synchronization
Synchronous serial interfaces use a third signal in addition to data and clock called the
frame synchronization, or frame sync, signal. This signal indicates to the receiver the position of
the first bit of a data word on the serial data line. Frame sync is known by a variety of names;
some manufacturers call it a "word sync" or "start of word" signal.
Two C01111110n formats for frame sync signals are bit length and word length. These names
refer to the duration of the frame sync relative to the duration of one data bit. A bit-length frame
Bit
Clock
Frame
Sync..J
n 1_.
n - - - - - _ _
n ~ -
Data
FIGURE 10-1. Clock, data, and frame sync signals. In this example, data change on
the rising edge of the clock and are stable on the falling edge. The frame sync signal
indicates the start of a new word.
112
Chapter 10 Peripherals
sync signal lasts one bit time and typically occurs one bit before the first transmitted bit of the data
word, although it can also occur simultaneously with the first transmitted bit of the data word. The
frame sync signal shown in Figure 10-1 is a bit-length frame sync. In contrast, a word-length
frame sync lasts the entire length of the data word. An example word-length frame sync is shown
in Figure 10-2. Most DSPs support bit-length frame sync signals, and a good number also support
word-length frame syncs.
An added complication is that some external devices may expect an inverted frame sync,
meaning a frame sync signal that is normally high but pulses low at the start of a data word. Rela-
tively few DSPs allow choice of frame sync polarity.
Clock
Frame
Sync
J
Data
FIGURE 10-2. Word-length frame sync. In this example, no data are transmitted while
the frame sync is inactive; the receiver ignores the data signal during that time.
Clock
F~~~; n~ nL-. _
Data
FIGURE 10-3. Two data words per frame sync. This example uses a bit-length frame
sync and 8-blt words.
113
DSP Processor Fundamentals: Architectures and Features
ITIay have independent clock and frame sync pins), while on others, the two may be ganged
together and use separate data pins but common clock and frame sync lines. The former arrange-
ment provides much greater flexibility, but this flexibility is not needed in all situations.
114
Chapter 10 Peripherals
Clock
...
.... .. ....
~rame Sync
Data
... ...
~ ~~
(a)
Clock
I I
Frame
Sync J I
L
I
FIGURE 10-4. (a) Multiple OSP processors using TOM to share a serial line.
(b) Serial signals used in such an arrangement.
Figure 10-4(b) shows a typical set of serial signals that might be used in a TOM network.
One processor (or external circuitry) is responsible for generating the bit clock and frame sync
signals. The frame sync signal is used to indicate the start of a new set of time slots. After the
frame sync, each processor must keep track of the current time slot number and transmit only dur-
ing its assigned slot. A transmitted data word (16 bits in the figure) might contain some number of
bits to indicate the destination DSP (e.g., two bits for four processors) with the remainder contain-
ing data. Another approach uses a secondary data line to transmit the source and destination
address in parallel with the data word.
At a minimum, TOM support requires that the processor be able to place its serial port
transmit data pin in a high-impedance state when the processor is not transmitting. This allows
other DSPs to transmit data during their time slots without interference. Some DSP serial ports
have additional support, such as a register that causes the serial port to transmit during a particular
time slot. Without this support, the processor must receive all data sent to all other processors and
throwaway the data not addressed to it in order to know when it is its tum to transmit.
115
DSP Processor Fundamentals: Architectures and Features
Companding Support
Synchronous serial ports are frequently used to interface with off-chip codecs, which are
special purpose AID and D/A converters used in telephony and low-fidelity voiceband applica-
tions. Codecs usually use eight-bit data words representing compressed 14- or 16-bit samples of
audio data. The compression scheme is typically based on one of two standards, u-law (in the
United States) or A-law (in Europe). The process of compressing and expanding the data is called
companding and is usually done in software via lookup tables. Some DSPs, notably the Analog
Devices ADSP-21xx family and the Motorola DSP56156, provide companding support built into
their serial interfaces. This feature can save both execution time and memory by relieving the pro-
cessor of the table-lookup functions. Interestingly, the AT&T DSP32C and DSP32xx have these
format conversions built into their data paths.
10.2 Timers
Almost all DSPs provide programmable timers. These are often used as a source of peri-
odic interrupts (e.g., as a "clock tick" for a real-time operating system), but other timer applica-
tions are possible as well. For example, some DSPs include an output pin that provides a square
wave at the frequency generated by the timer, Because this frequency is under software control,
this output can be thought of as a software-controlled oscillator, useful in implementing phase-
locked loops.
Fundamentally, a timer is much like a serial port bit clock generator: it consists of a clock
source, a prescaler, and a counter, as shown in Figure 10-5. The clock source is usually the DSP's
master clock, but some DSPs allow an external signal to be used as the clock source as well.
The purpose of the prescaler is to reduce the frequency of the clock source so that the
counter can count longer periods of time. It does this by dividing the source clock frequency by
one of several selectable values. For example, in the Analog Devices ADSP-21xx family, the pre-
scaler uses an 8-bit value, dividing the input clock by a value from 1 to 256. In the AT&T
DSP16xx family, the prescaler uses a 4-bit value N to divide the input clock by 2(N+1), i.e., by
powers of 2 from 2 through 65,536.
The counter then uses this prescaled signal as a clock source, typically counting down
from a preloaded value on every rising clock edge. The counter usually interrupts the processor
upon reaching zero. At this point, the counter may reload with the previously loaded value, or it
may stop, depending on the configuration of the timer.
116
Chapter 10 Peripherals
Clock
Source
Interrupt
Upon Reaching
Prescale Counter Zero
Preload Value Preload Value
Most counter registers on DSP chips are 16 bits wide, which provides a fairly large range
of interrupt frequencies that can be generated. On most chips, the user can read the value stored in
the counter register. This allows the DSP to read the counter register at the start and end of some
event and thus calculate the event's duration.
In the Motorola DSP56000 and DSP56001, the timer is shared with the asynchronous
serial clock generator. To use both the timer and the serial clock generator on these DSPs, the user
must select a common frequency that is acceptable for both applications.
As mentioned above, some DSPs optionally make the clock signal generated by the timer
available on an output pin. This can be useful either for a software-controlled phase-locked loop
or simply as a variable frequency synthesizer. Most DSPs, however, have timers that are only
capable of generating interrupts and do not have output pins.
117
DSP Processor Fundamentals: Architectures and Features
118
Chapter 10 Peripherals
119
DSP Processor Fundamentals: Architectures and Features
120
Chapter 12
The number of portable DSP applications has increased dramatically in recent years. DSPs
are now commonly found in portable devices such as cellular telephones, pagers, personal digital
assistants, laptop computers, and consumer audio gear. All of these applications are battery pow-
ered, and battery life is a key product differentiator in all of them. As a result, designers are con-
stantly looking for ways to reduce power consumption.
nsp vendors have responded to this challenge in multiple ways. First, almost all DSP
manufacturers have introduced low-voltage DSPs, capable of operation at nominal supply volt-
ages of 3.0 or 3.3 V. Second, many vendors have added power management features that reduce
power consumption under software or hardware control. Both of these approaches are discussed
below.
125
DSP Processor Fundamentals: Architectures and Features
"3.0 V," does it mean that its nominal supply voltage is 3.0 V, or 3.3 V? For consistency, through-
out this book we use nominal voltages in all cases.
In some cases, low-voltage DSPs are actually 5 V parts that are able to run at a lower volt-
age. In such cases the system clock must be reduced to permit operation at a lower voltage. DSP
vendors have also begun to introduce "true" 3 V versions of their DSPs that are able to run at full
speed at 3.0 or 3.3 V.
DSPs available at low voltages include the AT&T DSP16xx and DSP32xx families, the
Analog Devices ADSP-2103, ADSP-2173, ADSP-2183, and ADSP-2106x; the DSP Group
PineDSPCore and OakDSPCore; the IBM MDSP2780; the Motorola DSP56L002 and
DSP563xx; the NEC J.lPD77015, J.lPD77017, and IlPD77018; the SGS-Thomson D950-CORE;
and the Texas Instruments TMS320LC3x, TMS320LC5x, TMS320VC54x, and TMS320C80.
126
Chapter 12 Power Consumption and Management
tor in combination with an external crystal to generate a clock signal. In this case, if the sleep mode
disables the internal oscillator then a wake-up latency of thousands (or even tens of thousands) of
instruction cycles may be necessary while the oscillator stabilizes. A similar problem occurs with
phase-locked loop (PLL) on-chip clock generators: turning off the PLL reduces power consump-
tion, but the PLL may require a significant amount of time to lock when it is reenabled.
In a further effort to reduce power consumption, some processors (for example, the Ana-
log Devices ADSP-2171 and the Texas Instruments TMS320LC31 and TMS320C32) provide
slow-clock idle modes. These modes reduce power by slowing down the processor's master clock
before going to sleep, but they usually also increase wake-up latency.
127
DSP Processor Fundamentals: Architectures and Features
128
Chapter 13
Clocking
DSP processors, like all synchronous electronic systems, use clock signals to sequence
their operations. A clock signal is a square wave at some known frequency; the rising and falling
edges of the waveform are used to synchronize operations within the processor.
The highest frequency clock signal found on a processor is called the master clock. Typi-
cal DSP processor master clock frequencies range from 10 to 100 MHz. Usually, all other lower-
frequency clock signals used in the processor are derived from the master clock. In some proces-
sors, the frequency of the master clock may be the same as the instruction execution rate of the
processor; such processors are said to have a lX clock. On other processors, the master clock fre-
quency may be two or four times higher than the instruction execution rate. That is, multiple clock
cycles are required for each instruction cycle. These processors are said to have a 2X or 4X clock.
This illustrates the point that clock rates are not equal to instruction execution (MIPS) rates. To
avoid confusion, we generally mention clock rates in this book only when discussing input clock
requirements and use MIPS in all other circumstances.
A processor's master clock typically comes from either an externally supplied clock signal
or from an external crystal. If an external crystal is used, the DSP processor must supply an on-
chip oscillator to make the crystal oscillate. DSP processors that do not have on-chip oscillators
must use an externally-generated clock signal. If an external clock at the right frequency is avail-
able, this is not a problem. However, if an appropriate clock signal is not available and your
options are creating an externally-generated clock signal and using an external crystal with an on-
chip oscillator, the crystal/oscillator approach is usually cheaper and saves both board space and
power.
A number of DSP processors now have on-chip frequency synthesizers (also called phase-
locked loops or PLLs) that produce a full-speed master clock from a lower-frequency input clock
signal. On some processors, such as the Analog Devices ADSP-2171 and some members of the
Texas Instruments TMS320C5x family, the input frequency must be one-half of the desired mas-
ter clock frequency; these chips are said to have on-chip clock doublers. Other processors are
slightly more flexible in the input frequencies they can handle. For example, the Texas Instru-
ments TMS320C541 PLL can generate a full-speed clock from an input clock that is 1/3, 1/2, 2/3,
1, or 2 times the desired instruction execution rate.
129
DSP Processor Fundamentals: Architectures and Features
A few processors provide extremely flexible clock generators. For example, on some
members of the Motorola DSP5600x family and on the Motorola DSP561xx, the frequency syn-
thesizer can generate a full-speed master clock from a very-low-frequency input clock. As an
example, the Motorola DSP56002's frequency synthesizer can generate a 40 MHz master clock
from a roughly 10kHz input clock.
On-chip frequency synthesizers not only simplify designs and potentially reduce costs, but
they can also reduce electromagnetic interference generated by high-speed external clock signals.
130
Chapter 14
Depending on the application, the importance of a DSP processor's cost may range from
moderately important to critically important. In some applications, the nsp may represent only a
tiny fraction of the overall system cost; in others, the DSP may comprise a large portion of the
system cost. Some applications, such as those in consumer electronics and the personal computer
industry, bring intense market pressure to cut costs. Product shipment volumes in such applica-
tions may be high enough that even minor differences in processor price make a significant differ-
ence in product profitability.
In this chapter, we examine pricing for several processors and discuss the types of Ie
packages that are frequently used for nsp processors.
131
DSP Processor Fundamentals: Architectures and Features
TABLE 14-1. Representative Unit Prices of DSPs (as of June, 1995) Purchased in
Quantities of 1,000. Package Nomenclature Is Explained in the Text
those listed in the table. The prices of these other versions may be considerably different from the
prices listed above.
14.2 Packaging
Integrated circuits can be mounted in a variety of packages. Package options for some
DSPs are shown in Table 14-2. The package option chosen can have a strong impact on an inte-
grated circuit's price. As an example, a Texas Instruments TMS320C30 processor in a PGA pack-
age is almost twice as expensive as the same processor in a PQFP package.
132
Chapter 14 Price and Packaging
In general, PQFP packaging is the cheapest for high pin-count devices, while PGA is the
most expensive. CQFP devices are more expensive than their plastic counterparts, but are able to
dissipate heat more effectively. PLCC and DIP packages are inexpensive, but are only suitable for
devices with low pin counts.
133
Chapter 15
Fabrication Details
While it is often a secondary consideration, information about the fabrication process used
to manufacture a DSP processor can provide useful insights for designers evaluating DSPs. The
two most basic metrics that characterize fabrication processes for digital integrated circuits are
feature size and operating voltage. In this chapter, we explore these issues and others relating to
the fabrication process.
135
DSP Processor Fundamentals: Architectures and Features
example, if a given processor is only moderately fast, but is fabricated in an older, less aggressive
process, then there is a good chance that substantial improvements in the processor's performance
can be made by moving it to a smaller fabrication process if one is available to the manufacturer.
136
Chapter 16
Development Tools
No matter how powerful a processor may be, a system designer requires good develop-
ment tools to efficiently tap the processor's capabilities. Because the quality and sophistication of
development tools for nsp processors varies significantly among processors, development tool
considerations are an important part of choosing a DSP processor.
A detailed evaluation of the capabilities of DSP development tools is beyond the scope of
this book. A separate report, DSP Design Tools and Methodologies [BDT95], provides in-depth
analyses of these tools, related high-level design environments, and various design approaches.
Here, we limit our discussion to reviewing the types of development tools critical to DSP proces-
sor-based product development and the key capabilities required of each type of tool. The material
in this chapter is adapted from [BDT95].
137
DSP Processor Fundamentals: Architectures and Features
Object Code
Assembly
Code
Binary
Executable Final Product
In-Circuit Emulator
Debugger
138
Chapter 16 Development Tools
vendor; third parties also provide assembly language development tools. In fact, in some cases,
DSP processor vendors rely on third-party firms to develop their assembly language tools rather
than developing them in-house.
All DSP processor vendors provide a set of basic assembly language tools for each of their
processors, although the quality and completeness of the tools vary. Texas Instruments' DSPs
have historically had the most extensive selection of assembly language tools, including numer-
ous third-party products. However, other processor vendors have recently begun to offer more
sophisticated basic tools in some cases. For example, Analog Devices, AT&T, and Motorola have
all demonstrated or are beta-testing Microsoft Windows-based DSP development tools with
impressive features. One vendor, NEC, even offers an integrated development environment (IDE)
that integrates a language-sensitive text editor and "makefile" utility along with the standard com-
plement of assembler, linker, simulator, and debugger. The integration of these tools allows some
very useful features, such as automatic highlighting of source code statements that cause assem-
bler errors. Though iDEs have been offered for years for general-purpose microprocessors, they
have only very recently begun to emerge for DSPs. This illustrates one of the stubborn paradoxes
of DSP processors: good software tools for DSP processors are essential because of the need to
optimize software, but DSP processor software tools generally lag behind their general-purpose
processor counterparts in sophistication and quality.
It should be noted that assembly language tools are often also an integral part of higher-
level tools. For example, assemblers, linkers, and assembly language libraries are often used by
high-level language tools such as C compilers or block-diagram-based code generation systems.
Assemblers
An assembler translates processor-specific assembly language source code files (ASCII
text) into binary object code files for a particular target processor. Usually this object code
requires the additional steps of linking and relocation to transform it into a binary executable file
that can be run on the target processor.
Assemblers are among the most basic DSP software development tools; they are also
among the most important, since a vast amount of software for DSP processors is developed in
assembly language to optimize code speed and size. Below, we describe the most important fea-
tures of DSP processor assemblers.
Most assemblers for programmable DSP chips are macro assemblers, meaning that the
programmer can define (or use predefined) parameterized blocks of code that will be expanded in-
line in the final program. Macros are an important feature, since they allow the programmer to
reduce the amount of source code that must be maintained, while avoiding the overhead of subrou-
tine calls. Figure 16-2 illustrates the definition and use of a typical assembler macro.
Another useful assembler feature is conditional assembly: the ability to conditionally
control whether certain blocks of code are assembled, depending on the value of an assembler
meta-variable. This can be useful for maintaining multiple, slightly different versions of a single
program without having to create multiple copies of the source code. For example, an audio
encoding application might use conditional assembly based on the value of a variable to decide
whether to assemble code for monaural or stereo processing.
139
DSP Processor Fundamentals: Architectures and Features
(a)
(b)
FIGURE 16-2. (a) Macro definition for an FIR filter for the Motorola DSP5600x.
(b) Example usage of the macro.
COFF (common object file format) is a standard format for object code files and is sup-
ported by many assemblers. COFF allows the annotation of object files with debugging informa-
tion, such as pointers to the lines of source code that correspond to each machine instruction. The
lise of COFF can also simplify the integration of third-party tools (such as debuggers) with a pro-
cessor vendor's assembly language development tools.
For more details on assembler features and evaluation criteria, please consult DSP Design
Tools and Methodologies [BDT95].
Linkers
Linkers are used to combine multiple object files and libraries into an executable program.
To do this, linkers must relocate code; that is, fix the addresses at which various code fragments in
object files and libraries are to reside when they are executed. With some processors, a linker may
not be needed for small applications, but for most serious applications it is a necessity.
140
Chapter 16 Development Tools
Because linkers are used to bring together many parts of an application program, they
must be flexible enough to accommodate the requirements of different object files and memory
configurations.
Most linkers for digital signal processors are based on the concept of a memory map,
which tells the linker which segments of memory (and which memory spaces) to use for each sec-
tion of program code and data. Because different system hardware designs and applications
require different memory maps, the user must be able to specify these maps. Linkers provide var-
ious ways of describing the desired memory maps. This process is more straightforward with
some linkers than with others.
When a vendor's assembler generates COFF object files, the linker must support this for-
mat as well, since the linker processes object files produced by the assembler.
When debugging or analyzing a program, users frequently need to display the addresses of
symbols used in both object and executable files. This information is stored in a structure called a
symbol table. Some linkers are able to produce an ASCII representation of the symbol table at the
end of the linking process, while others rely on separate utility programs to do this.
In linker parlance, a library is a group of object code files bundled into a single file.
Libraries are sometimes called "archives." Libraries are a convenient way to manage large groups
of commonly used subroutines or functions while avoiding the need for a large number of sepa-
rate files. Virtually all linkers allow programs to be linked with code extracted from libraries.
Users frequently want to produce their own libraries. This is typically done with a separate
program, called a librarian or archiver, which bundles object files into a single library. Surpris-
ingly, not all DSP processor vendors give their users the ability to create libraries.
141
DSP Processor Fundamentals: Architectures and Features
with the host processor. Nevertheless, the lack of an instruction set simulator can be a hindrance
when chip samples are not available.
There are wide differences in the capabilities of instruction set simulators for different
DSPs. Because there is usually only one instruction set simulator available for a particular proces-
sor, the choice of a processor equates to the choice of an instruction set simulator. Users should
therefore carefully consider their simulation needs before choosing a processor.
All instruction set simulators provide the user with the ability to single-step through a pro-
gram and to view and edit the contents of registers and memory. The main factors differentiating
instruction set simulators from one another are their accuracy, speed, completeness, and support
for debugging and optimization. Each of these is discussed below.
Accuracy. When evaluating an instruction set simulator, two types of accuracy are of
interest: functional accuracy and timing accuracy.
Functional accuracy refers to the fidelity with which a simulator models the functionality
of the processor. That is, the likelihood that a certain sequence of instructions operating on
a certain set of data will yield the same results on the simulator as on the actual chip. The
functional accuracy of instruction set simulators is sometimes less than perfect. Therefore,
after relying on an instruction set simulator for software verification, developers are well
advised to recheck the functionality and timing of their application using an emulator or
development board (discussed below). Unfortunately, the most common means of learning
about inaccuracies in instruction set simulators is by discovering them oneself. Some ven-
dors do make an effort to track these inaccuracies and inform users of them.
Timing accuracy refers to the fidelity with which the simulator models the time relation-
ships of certain operations within the chip. On many processors some instructions take
more than one instruction cycle to execute, and instructions may take varying amounts of
time to execute depending on factors such as pipeline interlocks or internal memory bus
contention. Some instruction set simulators ignore these effects, while others track timing
to the level of one instruction cycle or even a fraction of an instruction cycle. Generally
speaking, a simulator with better timing resolution is preferable, but there is a trade-off
between timing accuracy and simulator execution speed. Even when a simulator attempts
to model time accurately, it may not always succeed.
Speed. Naturally, instruction set simulators are much slower than the processors they sim-
ulate. A typical DSP processor instruction set simulator executes on the order of thousands
of instructions per second on typical pes and workstations. We have, however, found
speed variations of up to a factor of 5 among different instruction set simulators.
Completeness. Some instruction set simulators model only the core processor itself and
ignore supporting hardware units such as on-chip timers, I/O interfaces, and peripherals,
or simulate these functions very crudely. Others model the behavior of these functions
quite accurately. If an instruction set simulator does not model these kinds of on-chip
functions, it can be difficult to meaningfully simulate a complete application.
However, some simulators provide interfaces to allow the user to add functionality to the
simulator, and this can offset the lack of accurate modeling of some on-chip functions.
Extensibility is discussed below.
142
Chapter 16 Development Tools
Those simulators that do provide some modeling of I/O interfaces generally use files on
the host system as the sources and destinations for data arriving at the processor's inputs
and being sent from its outputs.
• Debugging and optimization support. The user generally interacts with an instruction
set simulator through a front-end program called a debugger. Often, the same debugger
can be used with the instruction set simulator and with the emulator for a given processor.
Sometimes the same debugger also works with development boards. The debugger pro-
vides much of the debugging and optimization functionality for a simulator or emulator.
Thus, the quality of the debugger has a strong impact on the usefulness of an instruction
set simulator. Debuggers are discussed below. Note that when a processor vendor provides
a high-level language compiler for a DSP, the simulator often provides support for debug-
ging high-level language programs as well as assembly programs.
Key debugging and optimization features provided by instruction set simulators include
data display and editing, breakpoints, symbolic and source-level debugging, single-step-
ping, profiling, and cycle counting. Of these, data display and editing, single-stepping, and
symbolic and source-level debugging are primarily features of the debugger rather than the
simulator. As such, these features are discussed below in the section on debuggers. Break-
points, profiling, and cycle counting are discussed briefly here.
All instruction set simulators provide breakpoints: user-specified conditions which cause
the simulation to be halted. Some simulators can handle only simple breakpoints, which
cause the simulator to halt when a particular program location is reached. Others provide
very sophisticated breakpoints, which can be conditioned on a complex arithmetic expres-
sion involving memory and register contents. This type of capability can be extremely use-
ful for understanding obscure program bugs.
Profiling support helps the developer ascertain the amount of time a program spends in
different sections of code. Profiling data are usually collected either on a per-instruction
basis or on a per-region basis. In per-instruction profiling, the debugger keeps tabs on the
number of executions of each instruction in the program. In per-region profiling, the
debugger tracks the amount of time spent in various regions of the program. These regions
may be defined to correspond to functions or subroutines in the user's program. As most
DSP applications are extremely performance-sensitive, profiling is an important capabil-
ity. Perhaps surprisingly, good profiling support is found in only a few instruction set sim-
ulators, such as those from Texas Instruments and AT&T.
Cycle counting allows the user to obtain a count of the number of instruction cycles exe-
cuted between any two points in a program. This can be quite useful for determining the
performance of an algorithm, as well as for determining average and/or worst-case execu-
tion time of a subroutine. Cycle counting is especially handy for understanding program
behavior when complex pipeline dependencies cause some instructions to require extra
cycles for execution.
• Integration/extensibility. Instruction set simulators are generally designed to simulate the
DSP processor alone and do not provide capabilities for simulating the behavior of periph-
erals or other processors. (There are some exceptions: those from DSl" Group and Motor-
143
DSP Processor Fundamentals: Architectures and Features
ola provide the user the ability to link in custom C code to simulate external devices;
Motorola also provides built-in support for simulating multiprocessor configurations.)
Most instruction set simulators do provide memory-mapped file I/O, meaning that data
words written to or read from a particular simulated memory location can be mapped to a
file on the host computer. Some simulators provide port- or pin-level I/O, meaning that
various processor I/O interfaces can be mapped to a file, and pin values (0, 1, or "unde-
fined") can be read from or written to the file. Some simulators provide scripting lan-
guages that can be used to specify input data sequences and interrupt arrival times for the
processor's 110 interfaces. For example, NEC's simulator for the fJ.PD7701x has a particu-
larly powerful scripting language for simulating 110 activity.
In-Circuit Emulation
In-circuit emulation (ICE) assists the user in debugging and optimizing applications run-
ning on the nsp in the target system. External hardware and software on a personal computer or
workstation provide the user with the ability to monitor and control the processor in the target sys-
tem as it executes application programs.
All emulators provide the user with the ability to single-step through a program, and to
view and edit the contents of registers and memory. The main factors differentiating emulators
from one another are the maximum processor speed supported, the ability to trace program flow
and pin activity in real-time, the sophistication of real-time and non-real-time breakpoint condi-
tions supported, and profiling capabilities. Each of these is discussed briefly next.
144
Chapter 16 Development Tools
Processor speed. Key considerations for evaluating an emulator include whether or not it
supports full-speed operation of the target processor and, if so, what set of debugging fea-
tures are supported with real-time operation. These factors depend mostly on the architec-
ture of the emulator. Emulator architectures are discussed in detail later in this subsection.
Program flow and pin tracing. Some emulators provide high-speed buffers which can be
used to trace the flow of program execution and in some cases to trace the activity on pro-
cessor I/O pins as the processor executes in real-time. Real-time program trace is imple-
mented by capturing the state of the processor's program memory address bus during each
instruction cycle. Real-time program trace capability can be extremely useful, since pro-
gram bugs that appear when the processor executes in real-time may disappear if the pro-
cessor is halted or placed into a slow, single-stepping mode. Similarly, having the ability to
capture traces of processor I/O pins in real-time can be quite helpful for debugging.
Some emulators that do not provide true program flow tracing still provide limited infor-
mation on program flow through the use of a discontinuity buffer. The discontinuity buffer
typically records the source and destination addresses of the last few jump, call, or return
instructions-or other instructions which cause the processor's program counter to change
in a way other than being incremented.
Breakpoints. Real-time breakpoints allow the processor to run at full speed until a speci-
fied program address is reached or another specified trigger condition is met. All emula-
tors support real-time breakpoints that stop program execution at a specified program
address. Other kinds of real-time breakpoint conditions supported by some emulators
include accesses to a data memory location (possibly distinguishing between read and
write accesses), accesses within a range of program memory locations or within a range of
data memory locations, and execution of certain kinds of instructions such as program
branches.
Some emulators include one or more counters in their breakpoint logic. A counter allows
the user to program the emulator to halt the processor after a particular condition has
occurred a certain number of times. The Texas Instruments TMS320C4x and TMS320C5x
emulators provide the most sophisticated real-time breakpoint capabilities among com-
mercially available DSP processors.
Non-real-time breakpoints are used if the needed breakpoint trigger condition is more
complex than the emulator's breakpoint logic can support. Examples include halting exe-
cution upon the change of the value in a register or memory location, or when an arith-
metic or logical expression involving two or more memory locations becomes true.
Because such complex breakpoint expressions are beyond the capabilities of the emula-
tor's breakpoint logic, they must be evaluated by the debugger after the execution of each
instruction. Therefore, the processor is placed in a single-stepping mode and the emulator
forces the DSP to execute debug instructions after each instruction. Some emulators do
not provide any support for non-real-time breakpoints, while others provide very sophisti-
cated support.
Profiling. Profiling is used to determine where a program spends most of its time during
execution. Basic profiling support provides the user with a statement of instruction execu-
145
DSP Processor Fundamentals: Architectures and Features
tion counts on a particular block of code . From this information, the user can decide how
to best optimize the program . Profiling is commonly supported in instruction set simula-
tors, but among emulators for commercially available DSPs, only the Texas Instruments
TMS320C4x and TMS320C5x support profiling. The emulator sets breakpoints at the start
and end addresses of loops , subroutines, and functions and measures processor cycle
counts between the breakpoints. Since the emulator must intervene for each profiling event
encountered, execution speed is slowed, but is still faster than the instruction set simulator.
The features and capabilities of an in-circuit emulator are largely determined by the basic
architecture of the emulator. Emulator architectures can be divided into three categories:
Pod-based emulation. With pod-based emulation, the DSP is removed from the target
system. Hardware attached to the host computer (called an ICE adapter or pod) contains a
sample (sometimes a special version) of the processor to be emulated, with additional
hardware for controlling it. A cable with a connector whose pin-out is identical to the
DSP's connects the pod to the target system, replacing the target DSP. This is illustrated in
Figure 16-3. Compared to the alternatives, pod-based emulators have some strong advan-
tages, such as the ability to provide real-time traces of program execution and processor
pin activity . However, pod-based emulators are expensive and can be troublesome, since
replacing the processor in the target system with the emulator pod changes the electrical
drive and loading characteristics of the circuit , and may cause electrical timing problems .
Pod-based emulators were once the only type of in-circuit emulators available. Over the
past few years, scan-based emulators (discussed below) have become prevalent . Only a
few DSP proces sor vendors and third-party tool providers still offer pod-based emulators
(for example, Analog Devices still offers pod-based emulators for the older members of
the ADSP -2Ixx family) . Because of their hardware complexity and the electrical loading
effects mentioned above, pod-based emulators do not always support full-speed processor
operation.
146
Chapter 16 Development Tools
Scan-based emulation. Over the past five years or so, almost all nsp processor vendors
have begun adding special debugging logic to their nsps, along with a special serial port to
access this logic. Some processors use a "JTAG" (IEEE standard 1149.1) compatible serial
port to access these features. On other processors, a special dedicated interface is provided
(e.g., Motorola's "OnCE" port). By connecting an IBM PC or workstation to the proces-
sor's serial debugging port (via a bus adaptor of some sort), the debugging features of the
processor can be accessed. We call this approach scan-based emulation. With scan-based
emulation, the on-chip debugging logic is responsible for monitoring the chip's real-time
operation and halting the processor when a breakpoint is reached. After the processor is
halted, the software communicates with the debugging logic over the serial debugging port.
Scan-based emulation has several advantages over the traditional pod-based approach.
First, the processor does not have to be removed from the target system and replaced with
an emulator pod-a key consideration when processors cannot be mounted in sockets due
to space or electrical constraints. Second, the number of signal lines that must be con-
nected to the target hardware is minimized (typically the debug port has only five signals),
and the debugging port signals do not have to operate at the same speed as the rest of the
chip's signals. This reduces the overall complexity and cost of the emulator system and
virtually eliminates the possibility of emulator attachment causing serious changes in the
target system's behavior. Because the debugging logic is on-chip, scan-based emulators
always support full-speed processor operation. However, scan-based emulators must
revert to a very slow single-stepping mode to implement certain features (such as program
flow tracing) which can be implemented in real-time with pod-based emulators.
Because the debugging logic required to support scan-based emulation must be present on
every copy of the DSP processor manufactured, its cost must be carefully constrained. For
this reason, scan-based emulators generally have more limited capabilities than pod-based
emulators. The capabilities of scan-based emulation are further limited by the serial con-
nection between the target processor and the host, which has extremely limited bandwidth.
One key feature found in pod-based emulators but not in scan-based emulators is real-time
program flow and pin tracing.
Scan-based emulation is presently supported by most nsp processor vendors, especially
for processors introduced within the past few years. In addition, many third-party suppli-
ers offer scan-based emulators for DSPs from Texas Instruments and Motorola.
Monitor-based emulation. For processors which lack pod-based or scan-based emula-
tors, or for applications where the use of these tools is inconvenient or prohibitively expen-
sive, it is sometimes possible to obtain some of the functionality of an emulator by running
a special supervisory program (called a monitor) on the nsp processor. One of the proces-
sor's conventional I/O interfaces (such as a host port) is used for communication with the
debugger program running on the host. For example, monitor-based emulation is used
with IBMts MDSP2780 Mwave nsp, which is intended for use in personal computer
applications. In these applications, the monitor program communicates with the host pro-
cessor through the same interfaces used by application programs. The key advantages of
monitor-based emulation are that no special emulation hardware is required--either on the
147
DSP Processor Fundamentals: Architectures and Features
DSP processor or externally-and that the processor does not have to be removed from the
target system and replaced with an emulator pod. However, the debugging capabilities
provided by monitor-based emulation are typically more limited than those found in pod-
or scan-based emulation. For example, with monitor-based debugging it is usually not
possible to set real-time breakpoints which are triggered by accesses to data memory loca-
tions, since there is no means provided for detecting such accesses. With monitor-based
emulation, real-time breakpoints are typically limited to program memory locations.
These breakpoints are implemented by substituting a call to the monitor for the instruction
at the breakpoint address. After execution of the user's program is halted, the original
instruction is replaced. Obviously, this approach cannot be used if the program code is
executed out of read-only memory. Monitor-based emulation generally supports full-speed
processor operation, but monitor-based emulators generally must revert to a slow single-
stepping mode to implement some features (like program flow tracing) which pod-based
emulators can implement in real-time.
A further disadvantage of monitor-based debugging is that when the monitor program is
called (e.g., at a program breakpoint or when single-stepping), the state of the processor's
pipeline is changed before it can be examined by the user. Despite its disadvantages, mon-
itor-based debugging can be attractive because it is often very inexpensive and may be
quite convenient to implement, especially in applications where the DSP is already cou-
pled with a host processor.
Monitor-based debugging is primarily used in host-based applications such as personal
computer multimedia and with low-cost development boards or evaluation boards. For
example, GO DSP sells an $89 monitor-based debugger that works in conjunction with
Texas Instruments' $99 DSP Starter Kit.
As with instruction set simulators, in-circuit emulators are used with front-end programs
called debuggers. The debugger provides the user interface and much of the functionality of the
emulator. Debuggers are discussed in more detail in the next subsection.
Debuggers
Debugger is the term we use to describe the front-end program that provides the user
interface and much of the functionality of an emulator or instruction set simulator. A powerful
debugger is a critical tool for software development. Emulators furnished by the processor vendor
generally share the same debugger as the vendor's instruction set simulator. This is convenient
because users do not need to learn different interfaces for the two tools. Emulators from third par-
ties sometimes provide their own debugger or sometimes license the interface from the processor
vendor (as is the case with many third-party emulators for Texas Instruments DSPs).
Debugger user interfaces vary significantly in their power and ease of use. The main types
are:
Character-based, command-line oriented. Command-line-oriented debuggers provide
text-based interaction within a single window, much like MS-DOS commands and their
output. While simple and straightforward, such user interfaces can make it difficult to debug
complex applications. AT&T's debuggers for their floating-point DSPs use this approach.
148
Chapter 16 Development Tools
I A
_ _ 122 oeR2-O<l>.l!9 : .. PC
_l«8Z1111 BP>
-_ ll8Gcll87f
08'78!llIll
--_Iji-Iliiii!!!!iIiiiiilij
U?AIM
B.1l7
Illl
R2
R4
--
__ ,
_ 0 8 _ BP> AIlS.ARt R6 _ _ R7 _
_ B7l>:m.e B.1I6 .Jt2 ARII _ _ ARt _"c3d
_11 B7l>IlII88ll lJ.lI9.1lll AR2 llB8"83. AR3 _1l9ll3 .
BlJlJlIU 1IlI2llQlJ23 elI23K. ARlil AM _ _ AIlS 00899000
_2 13n8l113c ..2 AR6 _ _ AR? _
lllllJlJl3 8B1l1212ll ~1l&-<.l.) ."'R1 IIllJ _ _ RC
IRI __ _
_ 4 81~ RII.R2 Sf _
\lIlJlJll1S cl8lJ2232S ooAllS--o. >.Rll: RS _ RE - .
- , ~ AR4.Jl?
T ....1 - ......2 DP -rLLS~
_ 8 'f2i;£Cf7
~ __ 1: a,. _ 1 : -UHlGlOUll
_ 9~
149
DSP Processor Fundamentals: Architectures and Features
FIGURE 16-5. NEe's debugger for the /lPD7701x provides a flexible, graphical
user interface under Microsoft Windows.
Source-level debugging. Refers to the ability to manipulate objects in the application pro-
gram by referring directly to the source code that produced them. Typically, a window dis-
plays the source code (either assembly language or a high-level language , such as C) for a
program and highlights each line as the simulator single-steps through the program , Some
debuggers are capable of source-level debugging for both C and assembly code, while oth-
ers can handle only one or neither.
Data entry and display mechanisms. These are key to efficient debugging. It is of the
utmost importance that debuggers allow the user to focus on the important aspects of a
problem . and flexible data display is one of the best ways to achieve this. For example ,
some debuggers visually highlight data that have changed from one point to another in a
program . Some debuggers provide the ability to display data using a selection of formats
such as hexadecimal, decimal integer, and decimal floating-point ; some can display data
only in hex, which can be a serious inconvenience . Debuggers should also provide good
facilities for displaying profiling data collected by the simulator or emulator. but very few
do. Zilog's Z893xx debugger is one that provides good capabilities in this area.
Signal plotting. This capability found in some debuggers , such as those for IBM's
MDSP2780 Mwave DSP. allows the user to graphically display signals . This can be a very
convenient feature. although good file export capabilities coupled with third-party graph-
ing tools can also be used to good effect.
150
Chapter 16 Development Tools
Watch variables. Sometimes called watch windows, these allow the user to specify a set
of registers or memory locations (or in some cases expressions based on register or mem-
ory contents) to be displayed in a special window. The variables displayed in a watch win-
dow are typically updated each time the simulator or emulator stops execution, e.g., after a
breakpoint or single-step command.
Disassembly. Recreates an assembly language program from a machine language execut-
able. Note that this is different than source-level debugging: source-level debugging uses
the original source file, while disassembly recreates an assembly language display from
object code. In addition, an in-line assembly mechanism in some emulators and simulators
allows the user to assemble instructions and store them in memory without having to reas-
semble and reload the entire program. This can be useful for quickly patching an errant
program.
Command logging. Provides the ability to create a log file of all commands executed dur-
ing a particular debugging sequence. Session logging records not only the commands
entered, but the debugger's output as well. Scripting or macro capability refers to the abil-
ity to execute commands from a file (which may have been generated from a previously
recorded command log file). Because debugging sessions are frequently repetitive and
require customization for maximum efficiency, scripting and logging facilities are impor-
tant debugger capabilities.
Note that when a processor vendor provides a high-level language compiler for a DSP,
often the debugger supports debugging high-level language programs as well as assembly pro-
grams. If the debugger does not provide this support, then debugging high-level language pro-
grams is extremely awkward. In some cases, third-party vendors of high-level language compilers
provide separate source-level debuggers to accompany their compilers. For example, Intermetrics
offers their XDB debugger in conjunction with their C compiler for Motorola's DSP96002.
151
DSP Processor Fundamentals: Architectures and Features
DSPs, though they are in a disorganized state on Texas Instruments' electronic bulletin board sys-
tem. More extensive, optimized libraries are typically provided by independent software vendors.
Application libraries contain much larger blocks of code that implement complete applica-
tions or parts of applications. Examples of this type of library include complete speech coders or
modems. These libraries are usually developed and licensed by independent third-party software
developers. For example, Analogical Systems provides speech coding and telecommunications
libraries for the Motorola DSP5600x and Analog Devices ADSP-21xx DSPs. DSP Software Engi-
neering provides a wide range of telecommunications applications for Texas Instruments DSP
processors. In addition, some DSP processor vendors provide application libraries either free of
charge or under license. A large number of assembly code library vendors and their products are
described in DSP Design Tools and Methodologies [BDT95].
Note that libraries written in assembly code are often usable with users' C code as well as
assembly code.
152
Chapter 16 Development Tools
guage, and partially due to the free availability of a good general-purpose C compiler (the GNU C
compiler [GCC] from the Free Software Foundation), all currently available DSP processors and
cores that have high-level language support have a C compiler. In fact, the only currently available
DSP processor and core families that do not have C compilers available are the AT&T DSP16xx
and the Zoran ZR3800x.
Despite its popularity, C lacks a number of essential language features that simplify cod-
ing of DSP algorithms. For example, C lacks both a fixed-point data type (a feature essential for
efficient coding on fixed-point processors) and a complex data type (a feature that simplifies cod-
ing of DSP algorithms using complex arithmetic).
To address these issues, many compiler vendors have added their own extensions to the C
language to improve both its suitability for DSP applications and its efficiency on DSP processors.
For example, most Ccompilers support language extensions that allow the user to insert assembly
language statements directly into the assembly code generated from the C program. Some also
provide memory space qualifiers (Le., keywords that inform the C compiler that certain variables
are to be located in certain banks of memory).
In addition to ad hoc extensions used by compiler vendors, there are the efforts of the
ANSI Numerical C Extension Group (NCEG), an ANSI committee working to extend the C lan-
guage to better support numerical computations. At present, there is no official Numerical C stan-
dard; however, Analog Devices now offers a Numerical C compiler for their ADSP-210xx
floating-point DSP processor. This compiler implements a subset of the enhancements discussed
by the committee.
153
DSP Processor Fundamentals: Architectures and Features
Tartan is presently the only vendor offering an Ada compiler for DSP processors. The
compiler is targeted at the Texas Instruments TMS320C3x and TMS320C4x DSPs. Tartan
has also announced an Ada compiler for the Analog Devices ADSP-2106x family.
154
Chapter 16 Development Tools
instruction set. Third, such processors have hardware support for floating-point arithmetic, mak-
ing them a good match for C's fioat data type.
Debuggers
An important adjunct to high-level language compilers are source-level debuggers. It is
unreasonable (not to mention unproductive) to provide a user with a high-level language compiler
and then force him or her to use an assembly-level debugger to debug the resulting code.
In most cases, compiler vendors provide reasonable debugging support with their compil-
ers. For example, Intermetrics offers an excellent source-level debugger for its Mwave C com-
piler. Similarly, Tartan provides the AdaScope debugger for its DSP Ada compiler.
155
DSP Processor Fundamentals: Architectures and Features
fixed-point analysis, and hardware synthesis capabilities, as well as software generation, making
them useful for many aspects of DSP system design.
The four most popular block-diagram-based programming tools are Alta Group's SPW,
Mentor Graphics' DSP Station, Synopsys' COSSAP, and Hyperception's Hypersignal for Win-
dows Block Diagram. Of these tools, the Hyperception software is the only one that runs on IBM
PCs; the other three run only on UNIX workstations. All can generate C programs from block-dia-
gram specifications. This C code can then be compiled, downloaded, and run on a DSP processor,
although the generated C code is mainly intended for use on floating-point DSP processors. Addi-
tionally, SPW and DSP Station can directly generate assembly code for some processors (SPW
supports the Motorola DSP5600x family, while DSP Station supports both the Motorola
DSP5600x and the Texas Instruments TMS320C3x family).
For thorough analyses of these and other block-diagram-based programming tools, please
refer to DSP Design Tools and Methodologies [BDT95].
156
Chapter 16 Development Tools
For detailed evaluations of real-time operating systems for fixed- and floating-point DSPs,
please refer to DSP Design Tools and Methodologies [BDT95].
157
Chapter 17
Applications Support
DSP processors are complex devices and can be tricky to design with, both from a hard-
ware and- a software standpoint. When choosing a processor, it is important to consider what kind
of support the manufacturer provides to help ease the design process and solve problems as they
arise.
Not only is there wide variability in the level of support provided by processor manufac-
turers, but there is also wide variability within a single processor manufacturer. Not surprisingly,
high-volume customers (and potential customers) generally get better support than low-volume
customers. However, some processor vendors (such as Analog Devices and Texas Instruments)
have a strategy of providing broad support to all kinds of users, while other vendors (such as
AT&T) concentrate on supporting a few major customers.
In the following sections, we outline the kinds of support typically provided by processor
vendors and highlight the vendors who, in our assessment, provide the best support in each cate-
gory.
17.1 Documentation
Documentation for DSP processors generally includes user's guides, data sheets, applica-
tion notes, and application handbooks. User's guides provide an introduction to the architecture
and the capabilities of the processor as well as a detailed discussion of the instruction set, memory
architecture, peripherals, I/O interfaces, execution control, and instruction set encoding. Some
user's guides include software benchmarks, example programs, and hardware designs using the
processor. The average quality of DSP processor documentation is fairly poor, though there is
quite a lot of variability among manufacturers. While errors of fact do find their way into these
manuals, completeness and organization are usually more serious concerns. High-quality user's
guides are a tremendous asset to system developers. Conversely, poorly organized or incomplete
guides can be a major hindrance to efficient development. Analog Devices deserves special praise
for their user's guides. They are clearly written, well organized, and provide several features that
are surprisingly rare in DSP processor documentation, including a thorough index and a list of
changes made in the guide since the previous version. AT&T's and Zoran's user's guides are rela-
159
DSP Processor Fundamentals: Architectures and Features
tively thin on details. Zilog's manuals use terminology that is at odds with accepted industry
usage (e.g., referring to modulo addressing as "hardware looping") and contain large numbers of
typographical errors. Those from Texas Instruments have tended to be massive but poorly orga-
nized and sometimes incomplete; this seems to be changing for the better.
Data sheets are usually used to give precise timing and pinout information for processor
variations based on speed and packaging options. Some manufacturers include this information in
their user's guides, rather than providing separate data sheets. As with user's guides, there is sig-
nificant variation in the quality of data sheets among manufacturers. Accuracy is usually fairly
good, but understandability is sometimes poor.
Application notes can be extremely useful in providing complete, practical hardware and
software design examples. These examples can be a powerful tool for learning how to use a pro-
cessor, for writing initial programs, and for understanding complex techniques for optimizing
hardware and software. Motorola provides a good selection of application notes for the
DSP5600x family. Application handbooks are compilations of application examples, emphasizing
mostly software but including some hardware designs as well. Some handbooks include a diskette
containing electronic versions of the examples described in the book. This is a great convenience.
Again, Analog Devices is a standout here. The two-volume set Digital Signal Processing Applica-
tions with the ADSP-2100 Family [Ana90], [Ana95] documents a well-rounded collection of
ADSP-21xx applications. The books include a tutorial discussion of each application as well as
the source code and a discussion of the implementation used. A diskette containing the source
code files is also provided. Texas Instruments also offers a four-volume set of application hand-
books for their DSPs. These volumes consist mostly of reprints of technical articles originally
published elsewhere. While they provide discussions of a good variety of applications, the Texas
I nstruments volumes lack the coherence and tutorial treatment of the Analog Devices books.
Many of the larger DSP processor manufacturers have literature centers that are responsi-
ble for distributing documentation. We include telephone numbers for vendors' literature centers
in the Appendix, "Vendor Contact Information."
160
Chapter 17 Applications Support
After a company has chosen a processor, the applications engineers can help designers
understand the complexities of designing hardware and software around the processor. As we
have observed, DSP processors in general are very tricky to program efficiently, have very irregu-
lar instruction sets, and in some cases lack good development tools and thorough documentation.
Because of this, a local, knowledgeable applications engineer who can provide assistance during
product development is an extremely valuable asset.
Although they are generally well qualified-and very helpful, most manufacturers' applica-
tions engineers are in short supply. Not surprisingly, the largest customers (or potential customers)
generally get the lion's share of the AEs' attention, though this varies from vendor to vendor.
In response to the perennial shortage of applications engineers' time, some nsp processor
manufacturers. are increasingly trying to make use of applications engineers employed by their
distributors. While we have found distributors' applications engineers to be helpful sources of
basic information, so far we have not found them to have the depth of technical knowledge of
their counterparts who are employed by the processor manufacturers.
161
DSP Processor Fundamentals: Architectures and Features
sionally for locating useful pieces of software. However, many are poorly organized, making
locating any particular item of interest difficult. In addition, many of these systems are undersized
relative to the demand for them; thus, users may have to battle busy signals to get through. Finally,
each bulletin board system has its own (usually cumbersome) user interface that users must be
prepared to deal with.
Many DSP processor vendors now provide access to their bulletin board systems through
the Internet. Texas Instruments deserves praise for an exceptionally useful World Wide Web site,
containing not only DSP applications code but also well-organized, searchable tables of product
information and press releases. The Texas Instruments Web site even contains PostScript versions
of data sheets for new processors.
In some cases, independent institutions have (sometimes with the manufacturer's blessing,
sometimes without their knowledge) copied portions of the contents of a manufacturer's bulletin
board onto a computer that is connected to the Internet and available for public access. Users who
have access to the Internet can retrieve files using the Internet File Transfer Protocol (FfP).
17.5 Training
Most DSP processor manufacturers provide formal training classes for users of their
products. These are usually one to three days in length and focus on one processor family (for
example, a vendor's fixed-point or floating-point processor family). Costs are typically between
$500 and $1,500 per person. In the United States, vendors rotate training courses through differ-
ent locations around the country.
162
Chapter 17 Applications Support
developers of DSP systems. Second, the tools themselves may encapsulate significant expertise
about a given processor. For example, software function libraries that have been carefully opti-
mized by expert programmers are not only useful as building blocks for creating application soft-
ware, but also as tools for learning how to efficiently program a given processor. Of course, this
assumes that the library code is clearly written, well documented, and available in source code
form,
For a comprehensive listing of DSP development tool vendors and their products, includ-
ing detailed evaluations of many products, see DSP Design Tools and Methodologies [BDT95].
Board Vendors
Dozens of companies manufacture off-the-shelf printed circuit boards for use in develop-
ing DSP processor-based systems. In some cases, these boards are intended to be used as a devel-
opment aid-for example, to allow real-time evaluation and debugging of software while custom
hardware is under development. In other cases, boards are designed to be incorporated into an end
product. In either case, many boards provide interfaces to allow users to connect their own custom
hardware to the board. The manufacturers of these boards can be useful sources of information on
hardware design issues for DSP processor-based systems. In addition, some board vendors are
also suppliers of software libraries and development tools such as debuggers and emulators.
Textbooks
Over the past few years, several textbooks have appeared that combine an introduction
to digital signal processing with an introduction to one of the popular DSP processors. We rec-
ommend contacting DSP processor vendors for a current list of related books. Examples
include Real Time Digital Processing Applications with Motorola's DSP56000 Family by
Mohamed El-Sharkawy [Els90], and Digital Signal Processing with C and the TMS320C30 by
Rulph Chassaing [Cha92]. For complete citations of these texts and others, see the References
and Bibliography.
Consultants
The rapid growth in the number of applications using DSP processors has spawned a cot-
tage industry of independent consultants specializing in software and hardware for DSP proces-
sor-based systems. The larger DSP processor manufacturers maintain consultant directories that
can be useful in locating someone with the right expertise for a given project. Be aware that the
DSP processor manufacturers generally do not rigorously screen the consultants listed in their
directory; on the contrary, they may be motivated to collect as large a list as possible for appear-
ances' sake. As always, caveat emptor.
Design Houses
A number of independent companies specialize in complete development of DSP proces-
sor-based products. These so-called design houses generally specialize in one or more applica-
163
DSP Processor Fundamentals: Architectures and Features
tions areas, such as telecommunications or motion control. DSP processor manufacturers can
provide names of design houses that have experience designing with their processors.
Training
Independent companies and educational institutions from time to time provide short
courses covering some of the more popular commercial DSP processors, sometimes in conjunc-
tion with an introduction to digital signal processing. These courses are usually advertised in
major trade magazines or through direct mailings. DSP processor manufacturers may also be able
to provide information on third-party training courses.
164
Chapter 18
Conclusions
In this chapter, we briefly review the broad themes of this book, summarize our findings,
and provide a perspective on the history, the state of the art, and the future of DSP processors and
applications.
165
DSP Processor Fundamentals: Architectures and Features
that different processors can accomplish with a single instruction or floating-point operation.
While it is our firm belief that benchmarks, if carefully chosen and fairly implemented, provide
invaluable insights into processor performance, benchmark results should not be accepted at face
value. To draw meaningful conclusions from benchmarks, one must look beneath the surface to
understand what the benchmarks really measure and apply the relevant results to the application at
hand. (This approach is discussed in detail in the industry report Buyers Guide to DSP Processors
from Berkeley Design Technology, Inc.)
Additionally, designers should keep in mind that many features important to processor
selection cannot readily be quantified via benchmarking. Applications support, quality of develop-
ment tools, documentation, and I/O performance are all examples of processor selection criteria
that are not captured in benchmark results. The careful designer must weigh both qualitative and
quantitative considerations when choosing a processor.
Market Challenges
The main market for DSPs has always been embedded systems. Voiceband data modems,
for example, were an early application of DSPs and continue to be one of the most important.
Other applications with a long history of DSP use include music, speech synthesis, and servo con-
166
Chapter 18 Conclusions
trollers. Speech compression applications have only recently begun using DSPs in high volume.
This is primarily because recent wireless communications standards are better suited to imple-
mentation on nsp processors than were older and less aggressive compression standards, which
were better suited to custom ASIC implementations.
To serve the embedded systems market well, nsp vendors must resist the temptation to
excessively expand the feature sets provided by their devices. Most customers will be unwilling to
pay the price for these features, in terms of overall system cost and power consumption.
A countervailing pressure, however, comes from the proliferation of DSPs into an ever-
widening array of diverse applications, each with its own specialized needs. Several DSP vendors
are successfully addressing this problem by producing a variety of processor variants based on the
same nsp core. Some have introduced application-specific processors with special-purpose exe-
cution units (e.g., Viterbi decoders for wireless communications). Other vendors are offering DSP
cores, which can become the basis of a customer's ASIC. This allows the system designer to cus-
tomize the Ie to very closely match the needs of an application. Reuse of a nsp core permits the
core vendor and the customer to leverage their investments in development tools, software, and
engineering know-how.
An additional market pressure comes from the increasing use of DSP processors in porta-
ble, battery-powered applications. Already most processor vendors offer 3.3 or 3.0 V versions of
their DSPs to reduce power consumption. We expect to see supply voltages continue to drop and
increasingly aggressive power management features being added to DSP processors in the next
few years.
Competitive Challenges
The main competition for DSPs has always been custom analog, digital, or mixed cir-
cuitry. To meet this competition, DSPs have concentrated on speed and low cost, with memory
capacity and 110 as secondary considerations. Given that the competition is custom circuitry, the
assumption is usually that the nsp processor will be programmed in assembly language in an
effort to obtain maximum efficiency.
Given the predominance of embedded systems applications, a possible new competitive
threat to nsps is high-level synthesis of ASICs. A design tool like Mentor Graphics' DSP Station
could, in concept, replace DSPs with fully customized devices, while providing the designer with
a straightforward programming interface. This threat, however, has thus far failed to materialize
for several reasons. First, the technical problems are daunting, and current synthesis systems have
demonstrated success only for narrow application areas. Second, the cost of fully customized
devices may still be higher than that of programmable DSP solutions due to the higher volume
production of DSPs.
Many manufacturers of high-performance general-purpose microprocessors believe that
they, too, are a threat to DSPs. There is currently little evidence that this is true for the embedded
systems market, except possibly for floating-point DSPs. While many microprocessors have
acquired DSP-like features such as fast multipliers, their system cost is much too high for most
embedded systems. Moreover, the high system cost is largely due to features that are irrelevant in
167
DSP Processor Fundamentals: Architectures and Features
embedded systems. These include elaborate floating-point hardware with sophisticated handling
of exceptions, cache management (which yields a nondeterministic speed gain not useful for most
hard-real-time applications), and hardware support for virtual memory.
It can be argued, however, that conventional microprocessors might eventually displace
DSPs in certain types of systems, especially where a hardware platform containing a powerful
general-purpose microprocessor is already established in a market. Intel's native signal processing
initiative envisions PC multimedia functions such as music synthesis and speech compression
implemented in software running on the main CPU of a personal computer, doing away with the
need for a separate DSP processor. Currently, host processors are capable only of low-perfor-
mance signal processing tasks, and manage such functions only at the expense of leaving few pro-
cessor cycles available for other uses. However, one can anticipate sufficient improvement in the
performance of microprocessors that this computational burden will become tolerable. At the
same time, one can anticipate that, even as general-purpose microprocessors improve, communi-
cations technology will also advance, making ever-greater computational demands on hardware.
For example, voiceband data moderns might be replaced by wireless links to base stations with
direct digital connections into a high-speed network.
When assessing alternatives to DSPs such as general-purpose microprocessors, it is criti-
cal to evaluate comparable technology. It would be very misleading, for example, to compare the
projected capabilities of a microprocessor that will become available in two years to a DSP that is
in volume production today. It is also inappropriate to compare an $800 microprocessor to a $50
DSP without considering cost. For many applications, it is equally inappropriate to compare a
5 W device to a 100 mW device without considering power consumption. Today, when all of these
factors are considered, DSPs are still the overwhelming winners for most embedded signal pro-
cessing applications.
Technical Challenges
Overall, software is the clear weak spot in DSP processor technology. This includes devel-
opment tools supporting DSP-based software and hardware design, as well as libraries of software
functions and applications programs suitable for use in end products.
Most tools supporting DSP processor software development are created by the DSP pro-
cessor manufacturers, rather than by third-party software houses. Most vendors offer the tradi-
tional trio of assembler, linker, and simulator, and nothing more. This set of tools is oriented
toward painstaking, cycle-by-cycle optimization of programs, and the tools are often little differ-
ent from the technology of the 1970s.
Given that most DSP processor software is developed in assembly language, the low qual-
ity of most existing assembly language development and debugging environments is surprising. A
key technical challenge facing DSP processor vendors is development of sophisticated tools that
enable extremely detailed, highly productive software optimization and debugging. Users need
symbolic debugging, the ability to graphically display data, and tools supporting structured, mod-
ular program. development and software reuse. Other critical tool needs include good profiling
tools (including tools to help the programmer visualize profiling results), tools to detect possible
undesired side effects of instruction sequences in highly pipelined processors, and tools to assist
168
Chapter 18 Conclusions
system designers in accurately estimating power consumption for an application. Application and
function libraries are also key to increasing developer productivity and speeding the expansion of
DSPs into new applications. We believe that these tools offer a promising path to faster develop-
ment of more complex application programs.
Although most nsp vendors have developed C compilers for their processors, these com-
pilers (especially those for fixed-point processors) often seem to be most useful as a marketing
tool for the vendor rather than as a development tool for the programmer. Although today's com-
pilers are better than those of even a few years ago, the fact remains that most developers of appli-
cations software for nsp processors do not use C. This reflects the extreme cost sensitivity of
their applications and their focus on embedded systems, which often justify painstaking software
optimization. We expect that compilers will continue to improve and their use will become more
widespread, but this will take time and significant effort on the part of the nsp tool vendors.
nsp core technology is already being used by nsp vendors to offer their customers more
customized configurations of their processors. Today, the customization is usually done by the
DSP processor manufacturer. But the true potential of nsp core technology will be realized when
system designers have the ability to create customized DSP processors for their own applications.
Currently, however, the design tools needed to support user-customization are sorely lacking. DSl>
core vendors will need to develop much closer cooperation with CAn tool vendors to create
design approaches and tools that will allow customized nsps to achieve their potential. Several
vendors of nsp cores (such as Texas Instruments and nsp Group) have taken tentative steps
down this path by working with CAD tool vendors to make available full-functional processor
models of their cores. Much remains to be done in this area, however.
Conclusion
We believe that nsp processors will become increasingly important in an expanding range
of electronic products over the next several years, much as microprocessors and microcontrollers
have over the past two decades. The diversity of these applications and the stringent performance,
cost, and power consumption demands they make will spur an increased pace of processor archi-
tectural innovation and specialization. This specialization is both a strength and a weakness of
DSP processors, positioning them between custom ASICs and general-purpose microprocessors
in terms of performance and system hardware and software development complexity. While nsp
processors will be threatened by these alternative technologies in some applications, we believe
that DSPs will be the implementation approach of choice in some of the next decade's most
important applications, including telecommunications systems of many kinds, advanced multime-
dia user interfaces, and flexible information appliances.
169
Appendix
The following table provides contact information for vendors of DSP processors men-
tioned in this book.
Butterfly DSP, Inc. 2401 S.E. 161st Ct., Suite A (360) 892-5597 LH9320
Vancouver, WA 98684 (360) 892-0402 - Fax LH9124
Infinite Solutions, Inc. 3333 Bowers Avenue (408) 986-1686 Green Core
Suite 280 (408) 986-1687 - Fax
Santa Clara, CA 95054 Email: [email protected]
http://www.infinitesolutions.com
171
DSP Processor Fundamentals: Architectures and Features
Tensleep Design, Inc. 3809 South 2nd Street (512) 447-5558 AlDSCx21 cores
Suite 0100 (512) 447-5565 - Fax
Austin, TX 78704 Email: [email protected]
Texas Instruments, Inc. 13510 N. Central Exwy. (713) 274-2320 - Tech. Support TMS320Clx
Dallas, TX 75265 (713) 274-2324 - Fax TMS32OC2x
(713) 274-2323 - BBS TMS32OC2xx
Email: [email protected] TMS320C3x
http://www.ti.com TMS320C4x
TMS320C5x
TMS320C54x
TMS32OC8x
350ft Corporation 1001 Ridder Park Drive (408) 451-5670 M32OC25 core
San Jose, CA 95131-2314 (408) 451-5690 - Fax M32OC50 core
Email: [email protected]
172
Appendix Vendor Contact Information
173
Glossary
AlDSCx21 cores A family of 16-bit fixed-point DSP cores from Tensleep Design, Inc.
179
DSP Processor Fundamentals: Architectures and Features
BBS (Bulletin board system.) Many DSP processor vendors provide BBSs
accessible by modem that hold source code and application notes for
their processors.
Big-endian A term used to describe the ordering of bytes within a multibyte data
word. In big-endian ordering, bytes within a multibyte word are
arranged most-significant byte first. See also Little-endian.
180
Glossary
Bit field Logical or bit operations applied to a group of bits at a time. Bit field
manipulation manipulation is a key operation in error control coding and decoding.
Bit IJO port An I/O port in which each bit is individually configurable to be an input
or an output and in which each bit can be independently read or written.
Bit-reversed An addressing mode in which the order of the bits used to· form a
addressing memory address is reversed. This simplifies reading the output from
radix-2 fast Fourier transform algorithms, which produce their results
in a scrambled order.
Boundary scan A facility provided by some integrated circuits that allows the values on
the IC's pins to be interrogated or driven to specified logic levels
through the use of a special serial test port on the device. This is useful
for testing the interconnections between ICs on a printed circuit board.
Bounding box The boundary of a circuit component, such as a DSl" processor, ASIC,
standard cell, or DSP core. The bounding-box definition includes a list
of all signals and associated timing and electrical properties.
181
DSP Processor Fundamentals: Architectures and Features
Cascade of biquads An implementation for IIR filters where the transfer function is
factored into second-order terms which are then implemented as a
chain of biquad filters.
CD2450 A fixed-point DSP core with configurable data word widths from
Clarkspur Design, Inc.
Circular buffer A region of memory used as a buffer that appears to wrap around.
Circular buffers are typically implemented in software on conventional
processors and via modulo addressing on DSPs.
Clock cycle The time required for one cycle of the processor's master clock. See
also Instruction cycle.
Clock divider A circuit that reduces the frequency of a processor's master clock.
Programmable clock dividers allow the programmer to slow down the
processor's clock during times when full-speed operation is not needed,
thus reducing power consumption.
Clock doubler A frequency synthesizer circuit that allows an input clock with
frequency of one-half of the processor's desired master clock frequency
to be used to generate the master clock. Both the Texas Instruments
TMS320C5x family and the Analog Devices ADSP-2171 feature 00
chip clock doublers. See also Phase-locked loop.
182
Glossary
Conflict wait state A wait state (defined below) inserted due to contention for a resource.
Similar to a pipeline interlock, except that a pipeline interlock is
usually attributable to contention for a resource in the processor's core,
whereas a conflict wait state may be attributable to contention for a
resource outside of the core, such as an external memory interface.
Convolutional An error control coding technique used to encode bits before their
encoding transmission over a noisy channel. Used in modems and digital cellular
telephony. Convolutional encoding is usually decoded via the Viterbi
algorithm (see below).
183
DSP Processor Fundamentals: Architectures and Features
architecture. Some vendors. provide DSP cores that their customers can
use to create their own customized application-specific ICs.
Data path A collection of execution units (adder, multiplier, shifter, and so on)
that process data. A processor's data path determines the mathematical
operations possible on that processor.
Debugger A front-end program that provides the user interface and much of the
functionality of an emulator or instruction set simulator.
Delay line A buffer used to store a fixed number of past samples. Delay lines are
used to implement both FIR and IIR filters.
184
Glossary
Delayed branch A branch instruction where the branch actually occurs later than the
lexical appearance of the instruction. In other words, one or more
instructions appearing after the branch in the program are executed
before the branch is executed.
185
DSP Processor Fundamentals: Architectures and Features
Event counter In the context of in-circuit emulators, an event counter counts the
number of times a user-specified event occurs while the processor is
executing. An event may consist of, for example, access to specified
program or data memory addresses, a branch taken by a program, or an
external interrupt. Not all in-circuit emulators provide event counters.
Externally- Wait states that are requested by an external device. See also Wait state.
requested wait state
Fast interrupt An interrupt where the service routine can execute only one or two
instructions but that offers reduced interrupt latency. Fast interrupts are
typically used to quickly move data from a peripheral to a memory
location or vice versa.
186
Glossary
187
DSP Processor Fundamentals: Architectures and Features
G.728 The ITU-T standard for low-delay CELP (see above), a speech
compression technique. G.728 compresses a 4 kHz audio bandwidth
speech signal into a 16 kbitls bit stream.
GDB (GNU debugger.) A C-Ianguage source level debugger for use with
GCC. See also GCe.
GNU (Gnu's not UNIX.) The name given by the Free Software Foundation to
UNIX-like programs developed independently of AT&T or U.C.
Berkeley and protected by a copyright agreement requiring free
distribution of source and object code for original GNU software and
derivative works.
Green Core A 16-bit fixed-point DSP core from Infinite Solutions, Inc.
188
Glossary
Host interface (or A specialized parallel port on a DSP intended to interface easily to a
port) host processor. In addition to data transfer, some host interfaces allow
the host processor to force the DSP to execute interrupt service
routines, which can be useful for control.
189
DSP Processor Fundamentals: Architectures and Features
IEEE standard 754 An IEEE standard for floating-point arithmetic. A number of DSP
processors, including the Analog Devices ADSP-210xx and Motorola
DSP96002, support IEEE-754 arithmetic.
Instruction cycle The time required to execute the fastest instruction on a processor. See
also Clock cycle.
Interlocking pipeline A pipeline architecture in which instructions that cause contention for
resources are delayed by some number of instruction cycles. The Texas
Instruments TMS320C3x, TMS320C4x, and TMS320C5x make heavy
use of interlocking in their pipelines.
Interrupt An event that causes the processor to suspend execution of its current
program and begin execution elsewhere in memory.
Interrupt latency The maximum amount of time from the assertion of an interrupt line to
the execution of the first word of the interrupt's service routine,
assuming that the processor is in an interruptible state.
I/O Input/output.
190
Glossary
IS-95 A standard for U.S. digital cellular telephony. IS-95 uses COMA.
IS-136 A standard for digital cellular telephony. Also known as IS-54 revision
c.
JEDEC Joint Electron Device Engineering Council.
Joule A unit of energy. One joule is the amount of energy used by a device
consuming one watt of power in one second.
JTAG The informal name for IEEE standard P1149.1 (see above). JTAG
stands for "Joint Test Action Group," the group that defined the
standard.
Linker A program that combines separate object code modules into a single
object code module and resolves cross references between modules.
Little-endian A term used to describe the ordering of bytes within a multibyte data
word. In little-endian ordering, bytes within a multibyte word are
arranged least-significant byte first. See also Big-endian.
Low voltage Pertaining to the use of less than the standard 5 V for digital logic. This
is usually done to conserve power, but can also better match a given
battery technology.
191
DSP Processor Fundamentals: Architectures and Features
Master clock The highest frequency clock signal used within a processor. The master
clock is typically between one and four times the instruction execution
rate of the processor.
MDSP2780 A 16-bit fixed-point DSP with a 24-bit instruction word from IBM
Microelectronics.
Micron A unit of length equal to 10- 6 rn. Integrated circuit feature sizes are
usually specified in microns, and typical sizes range from 0.5 to 2.0
urn. See also Feature size. Abbreviated "urn,"
Modifier register A register used in the computation of addresses. Some vendors use this
term to refer to a register that contains a value to be added to an address
192
Glossary
Modulo addressing An addressing mode where post-increments are done using modulo
arithmetic. This is used to implement circular buffers.
JlPD7701x A family of 16-bit fixed-point DSPs with 32-bit instructions from NEC
Electronics, Inc.
Multiply-accumulate The dominant operation in many DSl" applications, where operands are
multiplied and added to the contents of an accumulator register.
Frequently abbreviated to "MAC."
NaN (Not a number.) The IEEE standard 754 specifies that floating-point
processors should reserve a special representation in their numeric
formats to indicate that a register or memory location does not contain
a valid number. This representation is referred to as NaN.
193
DSP Processor Fundamentals: Architectures and Features
Numerical C A GCC-based C compiler from Analog Devices, Inc. for its ADSP-
210xx DSPs that implements a number of the extensions discussed by
the Numerical C Extensions Group, in particular, complex arithmetic
and iterators.
Object code Binary instructions and data for use on a programmable processor.
Object code is usually produced by an assembler and is often
"relocatable," meaning that it does not contain absolute references to
particular memory locations.
194
Glossary
Parallel move A movement of data that is carried out in parallel with the execution of
an instruction. DSP processors typically provide the ability to move
two data values in parallel with executing an instruction, although the
number of instructions that support parallel moves may be limited.
PGA (Pin grid array.) A type of integrated circuit package. The external
connections are made available on metal pins arranged in a grid.
Phase-locked loop A feedback system in which an oscillator tracks a periodic input signal.
There are many uses for phase-locked loops, including timing recovery
in modems and generation of an on-chip master clock at a higher
frequency than an off-chip system clock.
Programmed wait A wait state that is automatically generated by the processor when
state accessing certain ranges of external memory. Most processors allow the
number of wait states to be configured by the programmer.
195
DSP Processor Fundamentals: Architectures and Features
QFP (Quad flat pack.) A type of integrated circuit package. K's packaged in
QFP packages are typically less expensive than the same Ie in PGA
packages.
Real-time operating An operating system that allows the developer to place an upper bound
system (RTOS) on the amount of time a process must wait to execute after a critical
event occurs. Examples of real-time operating systems for DSPs
include SPOX and Mwave OIS; UNIX is an example of a non-real-time
operating system, in that programs may wait an indefinite amount of
time before executing.
Register A circuit that holds a set of contiguous bits that are treated as a group.
An accumulator is an example of a register.
Register-direct An addressing mode where operands come from registers, and the
addressing registers are identified by constants in the instruction. For example, the
instruction "ADD XO, YO,A," which adds the contents of the XO and YO
registers and places their sum in the A register, uses register-direct
addressing.
Relocatable code Object code that does not contain absolute memory addresses, but
instead has symbolic references that a loader can resolve when it loads
the program. This allows the program to be loaded into memory at any
starting address. See also Object code.
196
Glossary
Shadow register A register that mirrors the contents of another register. This is useful,
for example, for storing processor state during interrupt servicing. A
shadow register can be thought of as a one-deep hardware stack that
only supports a specific register.
197
DSP Processor Fundamentals: Architectures and Features
Static When used to describe a processor, static means that the processor will
run with an arbitrarily low frequency input clock and still function
correctly, although more slowly. This is in contrast to a dynamic
processor, which requires a minimum frequency input clock to function
correctly. Because power consumption is proportional to clock
frequency in CMOS circuitry, a static processor allows one to reduce
power consumption by slowing or stopping the input clock.
Subroutine A unit of software that can be invoked from multiple locations in one or
more other units of software to perform a specific operation or set of
operations. Subroutines allow a programmer to avoid the need for
repeatedly specifying often-used sequences of instructions in a
program.
Target system The end system or product in which a processor will be used.
198
Glossary
TMS320C5x A family of 16-bit fixed-point DSPs from Texas Instruments, Inc. The
TMS320C5x is the successor to the TMS320C2x family.
TQFP (Thin quad flat pack.) A type of integrated circuit package similar to,
but thinner than, a plastic quad flat pack (PQFP; see above). TQFP
packages are typically used in small, portable electronic systems, such
as cellular telephones and pagers.
Two's complement The binary representation of numbers most commonly used in DSPs
for fixed-point numbers.
USFS 1015 (United States Federal Standard 1015.) The standard specifying the
LPC-10E speech coder. See also LPC.
USFS 1016 (United States Federal Standard 1016.) The standard specifying the
CELP speech coder. See also CELP.
~27 An ITU-T standard for 4800 and 2400 bit/s facsimile modems.
199
DSP Processor Fundamentals: Architectures and Features
V.29 An ITU- T standard for 9600 and 7200 bitls facsimile modems.
~32terbo A protocol for 19,200 bit/s modems promulgated by AT&T before the
V.34 standard was available.
Viterbi decoding (or A computationally efficient (but still relatively complex) mechanism
Viterbi algorithm) for decoding a convolutionally encoded bit stream.
200
Glossary
VSELP (Vector sum excited linear prediction.) A speech coding technique used
in the U.S. IS-54 digital cellular telephone system.
Wait state A delay inserted in an external memory access to give a slow peripheral
or memory time to access its data.
Watt A unit of power. One watt is the power consumed by a device that uses
one joule of energy in one second.
ZR3800x A family of 20-bit fixed-point DSPs with 32-bit instruction words from
Zoran Corporation.
201
Index
203
DSP Processor Fundamentals: Architectures and Features
204
Index
dual-ported memory 54 I
dynamic RAM 64 I/O 12
dynamic range 22,26,39 IBM
MDSP2780
E lack of instruction set simulator 141
edge-triggered interrupt 120 signal plotting in debugger 150
electronic mail idle mode 126
vendor contact information 171 IEEE 754 30,45
embedded systems 6 IEEE standard 1149.1 122, 147
exceptions 45, 95 immediate addressing 68
execution control 12, 91-98 implied addressing 68
exponent 22, 25, 28 in-circuit emulation 121, 137, 144
exponent detection 82 pod-based 146
extended precision 27 scan-based 147
extended-precision registers 84 index register 71
external crystal 129 indexed addressing 71
external interrupt 95 Infinite Solutions
external interrupt lines 120 Green nsp core 14
external memory interfaces 60-65 inner-loops 91
extract and replace (bit field) 84 instruction cache 55
instruction encoding 87
F instruction set 79-89
fabrication details 135
fast Fourier transform 75 instruction set models 144
fast interrupt 97, 107 instruction set simulators 137, 141
feature size 135 instruction word widths 87
field applications engineers 160 integer arithmetic 35
FIR filter 49 integrated development environment (IDE), for software
fixed-point arithmetic 6, 21 development 139
data path 31 interlocking pipeline 101
lack of support in C language 154 Intermetrics
vs. floating-point arithmetic 5 Motorola DSP96002 C compiler 151
floating-point arithmetic 6, 22, 45 Mwave debugger 155
data path 43 NEC fJPD7701x C compiler 154
IEEE 754 standard 30 XDB debugger 151
vs, fixed-point arithmetic 5 Internet 162
foundry-captive nsp cores 16 interrupts 94-97
fractional arithmetic 22,24,35 enabling and disabling 96
lack of support in C language 154 external request lines 120
frame synchronization 112 latency 96
function libraries 151 and hardware looping 93
functional accuracy 142 pipeline effects 106
service routine 94
G vectors 95, 106
general-purpose microprocessors 17, 118 iterative division 84
as challengers to DSP processors 167 iterative normalization 83
GODSP
monitor-based debugger 148 K
guard bits 36 kernels 91, 156
H L
hard real-time constraints 4 language-sensitive text editor 139
hardware looping 91-94 latency 33
instructions 81 law of conservation of bits 33
Harvard architecture 51 level-triggered interrupts 120
high-level languages 152 librarian 141
compilers 152 libraries
libraries 155 application 152
high-performance applications 7 C language 155
host ports 118 in block-diagram-based tools 155
Hyperception licensable DSP cores 14, 16
Hypersignal for Windows Block Diagram 156 limiter 40
205
DSP Processor Fundamentals: Architectures and Features
206
Index
207
DSP Processor Fundamentals: Architectures and Features
208
About the Authors
This book was prepared by the staff of Berkeley Design Technology, Inc. (BDTI), a firm
founded in 1991 to make nsp technology more accessible to a wide range of product developers
and to facilitate the commercialization of promising research technology. BDTI specializes in DSP
technology evaluation and has produced a number of industry reports, including Buyer's Guide to
DSP Processors and DSP Design Tools and Methodologies. (In fact, much of the material in this
book is adapted from the introductory material in Buyer's Guide to DSP Processors.) BDTI also
offers consulting services in the nsp field, ranging from technical evaluations of DSP processors
and tools to customization and integration of EDA tools.
BDTI can be reached by telephone at + 1 510 665-1600, by fax at + 1 510 665-1680, by
electronic mail at [email protected], or via the World Wide Web at http://www.bdtLcom.
Philip D. Lapsley is a founder of Berkeley Design Technology, Inc., where he is responsi-
ble for special projects. His technical interests include real-time DSP, DSP processor code gener-
ation and debugging, management of large software systems, network protocols, and computer
security. He has worked at several research groups at the University of California at Berkeley, the
NASA Ames Research Center, Teknekron Communication Systems, and the U. C. Berkeley Space
Sciences Lab. Lapsley has also worked as an independent consultant in the field of real-time nsp,
with emphasis on the interaction between real-time and non-real-time systems. At U. C. Berkeley,
he cofounded the Experimental Computing Facility and served as its Director from 1986 to 1988.
While a researcher in the DSP Design Group at U. C. Berkeley, Lapsley focused on DSP code
generation and debugging, concentrating on the interface between programmable DSPs and host
processors. This work culminated in the development of a debugger/monitor that allows users to
monitor, control, and debug automatically generated DSP assembly code at the block-diagram
level. He received both his B.S. degree with high honors and his M.S. degree in electrical engi-
neering and computer sciences from the University of California at Berkeley.
Jeffrey C. Bier is a founder of Berkeley Design Technology, Inc., where he is responsible
for general and technical management, research, and product development. His experience spans
software, hardware, and design tool development for signal processing and control applications in
commercial and research environments. Bier has held positions with Acuson Corporation,
Hewlett-Packard Laboratories, Quinn & Feiner, the University of California at Berkeley, and else-
209
DSP Processor Fundamentals: Architectures and Features
where. He has implemented real-time signal processing systems using DSP processors from
AT&T, Motorola, and Texas Instruments. While a researcher at U. C. Berkeley, Bier made signifi-
cant contributions to the Gabriel DSP design project. He developed code generation and simula-
tion software for multiprocessor DSP systems and was a key contributor to the development of a
new class of high-efficiency multiprocessor architectures for DSP. In addition, he has developed
several DSP ASICs. Bier has written numerous technical articles on topics including design tools,
multiprocessor architectures, and simulation techniques. He earned his bachelor's degree with
high honors from Princeton University. His master's degree is from the University of California at
Berkeley.
Amit Shoham is a Senior DSP Engineer with Berkeley Design Technology, Inc., where he
focuses primarily on benchmarking DSP processor performance and evaluating DSP design tools.
His technical interests include digital audio and music synthesis. Prior to joining BOT, Mr. Sho-
ham was at Silicon Graphics, where he developed diagnostics for digital audio hardware. He
holds a bachelor's degree in computer systems engineering and a master's degree in electrical
engineering, both from Stanford University.
Edward A. Lee is a professor in the Electrical Engineering and Computer Science Depart-
ment at the University of California at Berkeley and a founder of Berkeley Design Technology,
Inc. He has been codirector of the Ptolemy project (a system-level design and simulation project)
at U. C. Berkeley since its inception in 1990 and he directed the Gabriel project before that. His
research areas include parallel computation, architecture and software techniques for programma-
ble DSPs, design environments for development of real-time software, and digital communica-
tion. He is a fellow of the. IEEE and he was a recipient of a 1987 NSF Presidential Young
Investigator award, an IBM faculty development award, the 1986 Sakrison prize at U. C. Berkeley
for the best thesis in electrical engineering, and a paper award from the IEEE Signal Processing
Society. He is coauthor of Digital Communication, with D. G. Messerschmitt, Digital Signal Pro-
cessing Experiments with Alan Kamas, and numerous technical papers. His B.S. degree is from
Yale University, his master's from MIT, and his Ph.D. from U. C. Berkeley. From 1979 to 1982 he
was a member of the technical staff at Bell Telephone Laboratories, where he did extensive work
with early programmable DSPs and exploratory work in voiceband data modem techniques and
simultaneous voice and data transmission.
210