Digital Signal Processing is carried out by mathematical operations. In comparison, word processing and similar programs merely rearrange stored data. This means that computers designed for business and other general applications are not optimized for algorithms such as digital filtering and Fourier analysis. Digital Signal Processors are microprocessors specifically designed to handle Digital Signal Processing tasks. These devices have seen tremendous growth in the last decade, finding use in everything from cellular telephones to advanced scientific instruments. In fact, hardware engineers use "DSP" to mean Digital Signal Processor, just as algorithm developers use "DSP" to mean Digital Signal Processing.
The last forty years have shown that computers are extremely capable in two broad areas: Data manipulation, such as word processing and database management, Mathematical calculation, used in science, engineering, and Digital Signal Processing. All microprocessors can perform both tasks; however, it is difficult (expensive) to make a device that is optimized for both. There are technical tradeoffs in the hardware design, such as the size of the instruction set and how interrupts are handled. Even more important, there are: Marketing issues involved: Development and manufacturing cost, Competitive position, Product lifetime,. As a broad generalization, these factors have made traditional microprocessors, such as the Pentium, primarily directed at data manipulation. Similarly, DSPs are designed to perform the mathematical calculations needed in Digital Signal Processing.
Data manipulation versus mathematical calculation.
In addition to performing mathematical calculations very rapidly, DSPs must also have a predictable execution time. Most DSPs are used in applications where the processing is continuous, not having a defined start or end. For instance, consider an engineer designing a DSP system for an audio signal, such as a hearing aid. If the digital signal is being received at 20,000 samples per second, the DSP must be able to maintain a sustained throughput of 20,000 samples per second. However, there are important reasons not to make it any faster than necessary. As the speed increases, so does the cost, the power consumption, the design difficulty, and so on. This makes an accurate knowledge of the execution time critical for selecting the proper device, as well as the algorithms that can be applied.
Circular Buffering
In real-time processing, the output signal is produced at the same time that the input signal is being acquired. For example, this is needed in telephone communication, hearing aids, and radar. These applications must have the information immediately available, although it can be delayed by a short amount. For instance, a 10 millisecond delay in a telephone call cannot be detected by the speaker or listener. Likewise, it makes no difference if a radar signal is delayed by a few seconds before being displayed to the operator. Real-time applications input a sample, perform the algorithm, and output a 2 sample, over-and-over. Alternatively, they may input a group of samples, perform the algorithm, and output a group of samples. This is the world of Digital Signal Processors.
Circular buffer operation.
FIR digital filter. In FIR filtering, each sample in the output signal, y[n], is found by multiplying samples from the input signal, x[n], x[n-1], x[n-2], ..., by the filter kernel coefficients, a0, a1, a2, a3 ..., and summing the products.
Figure above shows an FIR filter being implemented in real-time. To calculate the output sample, we must have access to a certain number of the most recent samples from the input. For example, suppose we use eight coefficients in this filter, a 0 ,
a 1 ,..
a 7 . This means we must know the value of the eight most recent samples from the input signal, x[n], x[n-1], . x[n-7]. These eight samples must be stored in memory and continually updated as new samples are acquired. What is the best way to manage these stored samples? The answer is circular buffering.
3 Figure above illustrates an eight sample circular buffer. We have placed this circular buffer in eight consecutive memory locations, 20041 to 20048. Figure (a) shows how the eight samples from the input might be stored at one particular instant in time, while (b) shows the changes after the next sample is acquired. The idea of circular buffering is that the end of this linear array is connected to its beginning; memory location 20041 is viewed as being next to 20048, just as 20044 is next to 20045. You keep track of the array by a pointer (a variable whose value is an address) that indicates where the most recent sample resides. For instance, in (a) the pointer contains the address 20044, while in (b) it contains 20045. When a new sample is acquired, it replaces the oldest sample in the array, and the pointer is moved one address ahead. Circular buffers are efficient because only one value needs to be changed when a new sample is acquired.
Four parameters are needed to manage a circular buffer. First, there must be a pointer that indicates the start of the circular buffer in memory (in this example, 20041). Second, there must be a pointer indicating the end of the array (e.g., 20048), or a variable that holds its length (e.g., 8). Third, the step size of the memory addressing must be specified. In Fig. 28-3 the step size is one, for example: address 20043 contains one sample, address 20044 contains the next sample, and so on. This is frequently not the case. For instance, the addressing may refer to bytes, and each sample may require two or four bytes to hold its value. In these cases, the step size would need to be two or four, respectively.
These three values define the size and configuration of the circular buffer, and will not change during the program operation. The fourth value, the pointer to the most recent sample, must be modified as each new sample is acquired. In other words, there must be program logic that controls how this fourth value is updated based on the value of the first three values. While this logic is quite simple, it must be very fast. This is the whole point of this discussion; DSPs should be optimized at managing circular buffers to achieve the highest possible execution speed.
Below are the steps needed to implement an FIR filter using circular buffers for both the input signal and the coefficients. The efficient handling of these individual tasks is what separates a DSP from a traditional microprocessor. For each new sample, all the following steps need to be taken:
1. Obtain a sample with the ADC; generate an interrupt 2. Detect and manage the interrupt 3. Move the sample into the input signal's circular buffer 4. Update the pointer for the input signal's circular buffer 5. Zero the accumulator 6. Control the loop through each of the coefficients 7. Fetch the coefficient from the coefficient's circular buffer 8. Update the pointer for the coefficient's circular buffer 9. Fetch the sample from the input signal's circular buffer 10. Update the pointer for the input signal's circular buffer 11. Multiply the coefficient by the sample 12. Add the product to the accumulator 13. Move the output sample (accumulator) to a holding buffer 14. Move the output sample from the holding buffer to the DAC
The goal is to make these steps execute quickly. Since steps 6-12 will be repeated many times (once for each coefficient in the filter), special attention must be given to these operations. Traditional microprocessors must generally carry out these 14 steps in serial (one after another), while DSPs are designed to perform them in parallel. In some cases, all of the operations within the loop (steps 6-12) can be completed in a single clock cycle. Let's look at the internal architecture that allows this magnificent performance.
Architecture of the Digital Signal Processor
One of the biggest bottlenecks in executing DSP algorithms is transferring information to and from memory. This includes data, such as samples from the input signal and the filter coefficients, as well as 4 program instructions, the binary codes that go into the program sequencer. Typical DSP operations require simple many additions and multiplications. Additions and multiplications require us to: Fetch two operands Perform the addition or multiplication (usually both) Store the result or hold it for a repetition To fetch the two operands in a single instruction cycle, we need to be able to make two memory accesses simultaneously. Actually, a little thought will show that since we also need to store the result - and to read the instruction itself - we really need more than two memory accesses per instruction cycle. For this reason DSP processors usually support multiple memory accesses in the same instruction cycle. It is not possible to access two different memory addresses simultaneously over a single memory bus. There are three common methods to achieve multiple memory accesses per instruction cycle: Harvard architecture Modified von Neuman architecture Super Harvard Architecture The Harvard architecture has two separate physical memory buses. This allows two simultaneous memory accesses:
The Harvard architecture The Harvard architecture requires two memory buses. This makes it expensive to bring off the chip - for example a DSP using 32 bit words and with a 32 bit address space requires at least 64 pins for each memory bus - a total of 128 pins if the Harvard architecture is brought off the chip. This results in very large chips, which are difficult to design into a circuit. Even the simplest DSP operation - an addition involving two operands and a store of the result to memory - requires four memory accesses (three to fetch the two operands and the instruction, plus a fourth to write the result). This exceeds the capabilities of a Harvard architecture. Some processors get around this by using a modified von Neuman architecture.
Figure below shows how it is implemented in a traditional microprocessor. This is often called Von Neumann architecture, after the brilliant American mathematician John Von Neumann (1903-1957). The von Neuman architecture uses only a single memory bus. The Von Neumann design is quite satisfactory when you are content to execute all of the required tasks in serial. In fact, most computers today are of the Von Neumann design. We only need other architectures when very fast processing is required, and we are willing to pay the price of increased complexity.
The Von Neuman Architecture
This is cheap, require less pins than the Harvard architecture, and simple to use because the programmer can place instructions or data anywhere throughout the available memory. But it does not permit multiple memory accesses. The modified von Neuman architecture allows multiple memory accesses per instruction cycle by the simple trick of running the memory clock faster than the instruction cycle. For example the Lucent DSP32C runs with an 80 MHz clock: this is divided by four to give 20 million instructions per second (MIPS), but the memory clock runs at the full 80 MHz - each instruction cycle is divided into four 'machine states' and a memory access can be made in each machine state, permitting a total of four memory accesses per instruction cycle: 5
In this case the modified von Neuman architecture permits all the memory accesses needed to support addition or multiplication: fetch of the instruction fetch of the two operands Storage of the result. Both Harvard and von Neuman architectures require the programmer to be careful of where in memory data is placed, for example, with the Harvard architecture, if both needed operands are in the same memory bank then they cannot be accessed simultaneously.
The true Harvard architecture dedicates one bus for fetching instructions, with the other available to fetch operands. This is inadequate for DSP operations, which usually involve at least two operands. So DSP Harvard architectures usually permit the 'program' bus to be used also for access of operands. Note that it is often necessary to fetch three things - the instruction plus two operands - and the Harvard architecture is inadequate to support this: so DSP Harvard architectures often also include a cache memory which can be used to store instructions which will be reused, leaving both Harvard buses free for fetching operands. This extension - Harvard architecture plus cache - is sometimes called an extended Harvard architecture or Super Harvard ARChitecture (SHARC) e.g. ADSP-2106x and new ADSP-211xx.
The super Harvard architecture
The idea is to build upon the Harvard architecture by adding features to improve the throughput. While the SHARC DSPs are optimized in dozens of ways, two areas are important enough to be included in Fig. above: An instruction cache, and An I/O controller. A handicap of the basic Harvard design is that the data memory bus is busier than the program memory bus. When two numbers are multiplied, two binary values (the numbers) must be passed over the data memory bus, while only one binary value (the program instruction) is passed over the program memory bus. To improve upon this situation, we start by relocating part of the "data" to program memory. For instance, we might place the filter coefficients in program memory, while keeping the input signal in data memory. (This relocated data is called "secondary data" in the illustration). At first glance, this doesn't seem to help the situation; now we must transfer one value over the data memory bus (the input signal sample), but two values over the program memory bus (the program instruction and the coefficient). In fact, if we were executing random instructions, this situation would be no better at all.
However, DSP algorithms generally spend most of their execution time in loops, such as instructions 6-12 above. This means that the same set of program instructions will continually pass from program memory to the CPU. The Super Harvard architecture takes advantage of this situation by including an instruction cache in the CPU. This is a small memory that contains about 32 of the most recent program instructions. 6 The first time through a loop, the program instructions must be passed over the program memory bus. This results in slower operation because of the conflict with the coefficients that must also be fetched along this path. However, on additional executions of the loop, the program instructions can be pulled from the instruction cache. This means that all of the memory to CPU information transfers can be accomplished in a single cycle: The sample from the input signal comes over the data memory bus, The coefficient comes over the program memory bus, and The program instruction comes from the instruction cache. In the jargon of the field, this efficient transfer of data is called a high memory access bandwidth.
The super Harvard architecture detailed view
Figure below presents a more detailed view of the SHARC architecture, showing the I/O controller connected to data memory. The SHARC DSPs provides both serial and parallel communications ports. These are extremely high speed connections. For example, at a 40 MHz clock speed, there are two serial ports that operate at 40 Mbits/second each, while six parallel ports each provide a 40 Mbytes/second data transfer. When all six parallel ports are used together, the data transfer rate is an incredible 240 Mbytes/second. Just as important, dedicated hardware allows these data streams to be transferred directly into memory (Direct Memory Access, or DMA), without having to pass through the CPU's registers. In other words, tasks 1 & 14 on our list happen independently and simultaneously with the other tasks; no cycles are stolen from the CPU. The main buses (program memory bus and data memory bus) are also accessible from outside the chip, providing an additional interface to off-chip memory and peripherals. This allows the SHARC DSPs to use a four Gigaword (16 Gbyte) memory, accessible at 40 Mwords/second (160 Mbytes/second), for 32 bit data. This type of high speed I/O is a key characteristic of DSPs. The overriding goal is to move the data in, perform the math, and move the data out before the next sample is available. Everything else is secondary. Some DSPs have onboard analog-to-digital and digital-to-analog converters, a feature called mixed signal. However, all DSPs can interface with external converters through serial or parallel ports.
At the top of the diagram are two blocks labeled Data Address Generator (DAG), one for each of the two memories. These control the addresses sent to the program and data memories, specifying where the information is to be read from or written to. In simpler microprocessors this task is handled as an inherent part of the program sequencer, and is quite transparent to the programmer. However, DSPs are designed to operate with circular buffers, and benefit from the extra hardware to manage them efficiently. This avoids needing to use precious CPU clock cycles to keep track of how the data are stored. For instance, in the 7 SHARC DSPs, each of the two DAGs can control eight circular buffers. This means that each DAG holds 32 variables (4 per buffer), plus the required logic.
Why so many circular buffers? Some DSP algorithms are best carried out in stages. For instance, IIR filters are more stable if implemented as a cascade of biquads (a stage containing two poles and up to two zeros). Multiple stages require multiple circular buffers for the fastest operation. The DAGs in the SHARC DSPs are also designed to efficiently carry out the Fast Fourier transform. In this mode, the DAGs are configured to generate bit-reversed addresses into the circular buffers, a necessary part of the FFT algorithm. In addition, an abundance of circular buffers greatly simplifies DSP code generation- both for the human programmer as well as high-level language compilers, such as C. The data register section of the CPU is used in the same way as in traditional microprocessors. In the ADSP-2106x SHARC DSPs, there are 16 general purpose registers of 40 bits each. These can hold intermediate calculations, prepare data for the math processor, serve as a buffer for data transfer, hold flags for program control, and so on. If needed, these registers can also be used to control loops and counters; however, the SHARC DSPs have extra hardware registers to carry out many of these functions.
The math processing is broken into three sections, a multiplier, an arithmetic logic unit (ALU), and a barrel shifter. The multiplier takes the values from two registers, multiplies them, and places the result into another register. The ALU performs addition, subtraction, absolute value, logical operations (AND, OR, XOR, NOT), conversion between fixed and floating point formats, and similar functions. Elementary binary operations are carried out by the barrel shifter, such as shifting, rotating, extracting and depositing segments, and so on. A powerful feature of the SHARC family is that the multiplier and the ALU can be accessed in parallel. In a single clock cycle, data from registers 0-7 can be passed to the multiplier, data from registers 8-15 can be passed to the ALU, and the two results returned to any of the 16 registers. There are also many important features of the SHARC family architecture that aren't shown in this simplified illustration. For instance, an 80 bit accumulator is built into the multiplier to reduce the round-off error associated with multiple fixed-point math operations.
Another interesting feature is the use of shadow registers for all the CPU's key registers. These are duplicate registers that can be switched with their counterparts in a single clock cycle. They are used for fast context switching, the ability to handle interrupts quickly. When an interrupt occurs in traditional microprocessors, all the internal data must be saved before the interrupt can be handled. This usually involves pushing all of the occupied registers onto the stack, one at a time. In comparison, an interrupt in the SHARC family is handled by moving the internal data into the shadow registers in a single clock cycle. When the interrupt routine is completed, the registers are just as quickly restored. This feature allows step 4 on our list (managing the sample-ready interrupt) to be handled very quickly and efficiently.
Now we come to the critical performance of the architecture, how many of the operations within the loop (steps 6-12 above) can be carried out at the same time. Because of its highly parallel nature, the SHARC DSP can simultaneously carry out all of these tasks. Specifically, within a single clock cycle, it can perform a multiply (step 11), an addition (step 12), two data moves (steps 7 and 9), update two circular buffer pointers (steps 8 and 10), and control the loop (step 6). There will be extra clock cycles associated with beginning and ending the loop (steps 3, 4, 5 and 13, plus moving initial values into place); however, these tasks are also handled very efficiently. If the loop is executed more than a few times, this overhead will be negligible. As an example, suppose you write an efficient FIR filter program using 100 coefficients. You can expect it to require about 105 to 110 clock cycles per sample to execute (i.e., 100 coefficient loops plus overhead). This is very impressive; a traditional microprocessor requires many thousands of clock cycles for this algorithm.
Although there are many DSP processors, they are mostly designed with the same few basic operations in mind, so they share the same set of basic characteristics. This enables us to draw the processor diagrams in a similar way, to bring out the similarities and allow us to concentrate on the differences: 8
The diagram above shows a generalised DSP processor, with the basic features that are common. The Analog Devices ADSP21060 shows how similar are the basic architectures:
The ADSP21060 has a Harvard architecture - shown by the two memory buses. This is extended by a cache, making it a Super Harvard ARChitecture (SHARC). Note, however, that the Harvard architecture is not fully brought off chip - there is a special bus switch arrangement which is not shown on the diagram. The 21060 has two serial ports in place of the Lucent DSP32C's one. Its host port implements a PCI bus rather than the older ISA bus. Apart from this, the 21060 introduces four features not found on the Lucent DSP32C: There are two sets of address generation registers. DSP processors commonly have to react to interrupts quickly - the two sets of address generation registers allow for swapping between 9 register sets when an interrupt occurs, instead of having to save and restore the complete set of registers. There are six link ports, used to connect with up to six other 21060 processors: showing that this processor is intended for use in multiprocessor designs. There is a timer - useful to implement DSP multitasking operating system features using time slicing. There is a debug port - allowing direct non-intrusive debugging of the processor internals.
To suit these fundamental operations DSP processors often have: Parallel multiply and add Multiple memory accesses (to fetch two operands and store the result) Lots of registers to hold data temporarily Efficient address generation for array handling Special features such as delays or circular addressing To perform the simple arithmetic required, DSP processors need special high speed arithmetic units. Most DSP operations require additions and multiplications together. So DSP processors usually have hardware adders and multipliers which can be used in parallel within a single instruction:
The diagram shows the data path for the Lucent DSP32C processor. The hardware multiply and add work in parallel so that in the space of a single instruction, both an add and a multiply can be completed. Delays require that intermediate values be held for later use. This may also be a requirement, for example, when keeping a running total - the total can be kept within the processor to avoid wasting repeated reads from and writes to memory. For this reason DSP processors have lots of registers which can be used to hold intermediate values:
Registers may be fixed point or floating point format. Array handling requires that data can be fetched efficiently from consecutive memory locations. This involves generating the next required memory address. For this reason DSP processors have address registers which are used to hold addresses and can be used to generate the next needed address efficiently:
10 The ability to generate new addresses efficiently is a characteristic feature of DSP processors. Usually, the next needed address can be generated during the data fetch or store operation, and with no overhead. DSP processors have rich sets of address generation operations: *rP register indirect read the data pointed to by the address in register rP *rP++ postincrement having read the data, postincrement the address pointer to point to the next value in the array *rP-- postdecrement having read the data, postdecrement the address pointer to point to the previous value in the array *rP++rI register postincrement having read the data, postincrement the address pointer by the amount held in register rI to point to rI values further down the array *rP++rIr bit reversed having read the data, postincrement the address pointer to point to the next value in the array, as if the address bits were in bit reversed order The table shows some addressing modes for the Lucent DSP32C processor. The assembler syntax is very similar to C language. Whenever an operand is fetched from memory using register indirect addressing, the address register can be incremented to point to the next needed value in the array. This address increment is free - there is no overhead involved in the address calculation - and in the case of the Lucent DSP32C processor up to three such addresses may be generated in each single instruction. Address generation is an important factor in the speed of DSP processors at their specialised operations. The last addressing mode - bit reversed - shows how specialised DSP processors can be. Bit reversed addressing arises when a table of values has to be reordered by reversing the order of the address bits: - Reverse the order of the bits in each address - Shuffle the data so that the new, bit reversed, addresses are in ascending order This operation is required in the Fast Fourier Transform - and just about nowhere else. So one can see that DSP processors are designed specifically to calculate the Fast Fourier Transform efficiently.
DSP processors: input and output interfaces In addition to the mathematics, in practice DSP is mostly dealing with the real world. Although this aspect is often forgotten, it is of great importance and marks some of the greatest distinctions between DSP processors and general purpose microprocessors:
In a typical DSP application, the processor will have to deal with multiple sources of data from the real world. In each case, the processor may have to be able to receive and transmit data in real time, without interrupting its internal mathematical operations. There are three sources of data from the real world: Signals coming in and going out Communication with an overall system controller of a different type Communication with other DSP processors of the same type 11 These multiple communications routes mark the most important distinctions between DSP processors and general purpose processors.
When DSP processors first came out, they were rather fast processors: for example the first floating point DSP - the AT&T DSP32 - ran at 16 MHz at a time when PC computer clocks were 5 MHz. This meant that we had very fast floating point processors: a fashionable demonstration at the time was to plug a DSP board into a PC and run a fractal (Mandelbrot) calculation on the DSP and on a PC side by side. The DSP fractal was of course faster. Today, however, the fastest DSP processor is the Texas TMS320C6201 which runs at 200 MHz. This is no longer very fast compared with an entry level PC. And the same fractal today will actually run faster on the PC than on the DSP. But DSP processors are still used - why? The answer lies only partly in that the DSP can run several operations in parallel: a far more basic answer is that the DSP can handle signals very much better than a Pentium. Try feeding eight channels of high quality audio data in and out of a Pentium simultaneously in real time, without impacting on the processor performance, if you want to see a real difference. The need to deal with these different sources of data efficiently leads to special communication features on DSP processors:
In a typical DSP application, the processor will have to deal with multiple sources of data from the real world. In each case, the processor may have to be able to receive and transmit data in real time, without interrupting its internal mathematical operations. There are three sources of data from the real world: Signals coming in and going out Communication with an overall system controller of a different type Communication with other DSP processors of the same type The need to deal with different sources of data efficiently and in real time leads to special communication features on DSP processors:
Signals tend to be fairly continuous, but at audio rates or not much higher. They are usually handled by high speed synchronous serial ports. Serial ports are inexpensive - having only two or three wires - and are well suited to audio or telecommunications data rates up to 10 Mbit/s. Most modern speech and audio analogue to digital converters interface to DSP serial ports with no intervening logic. A synchronous serial port requires only three wires: clock, data, and word sync. The addition of a fourth wire (frame sync) and a high impedance state when not transmitting makes the port capable of Time Division Multiplex (TDM) data handling, which is ideal for telecommunications:
DSP processors usually have synchronous serial ports - transmitting clock and data separately - although some, such as the Motorola DSP56000 family, have asynchronous serial ports as well (where the clock is recovered from the data). Timing is versatile, with options to generate the serial clock from the DSP chip clock or from an external source. The serial ports may also be able to support separate clocks for receive and transmit - a useful feature, for example, in satellite modems where the clocks are affected by Doppler shifts. Most DSP processors also support companding to A-law or mu-law in serial port hardware with no overhead - the Analog Devices ADSP2181 and the Motorola DSP56000 family do this in the serial port, whereas the Lucent DSP32C has a hardware compander in its data path instead. The serial port will usually operate under DMA - data presented at the port is automatically written into DSP memory without stopping the DSP - with or without interrupts. It is usually possible to receive and trasmit data simultaneously. The 12 serial port has dedicated instructions which make it simple to handle. Because it is standard to the chip, this means that many types of actual I/O hardware can be supported with little or no change to code - the DSP program simply deals with the serial port, no matter to what I/O hardware this is attached.
Host communications is an element of many, though not all, DSP systems. Many systems will have another, general purpose, processor to supervise the DSP, for example, the DSP might be on a PC plug-in card or a VME card - simpler systems might have a microcontroller to perform a 'watchdog' function or to initialise the DSP on power up. Whereas signals tend to be continuous, host communication tends to require data transfer in batches - for instance to download a new program or to update filter coefficients. Some DSP processors have dedicated host ports which are designed to communicate with another processor of a different type, or with a standard bus. For instance the Lucent DSP32C has a host port which is effectively an 8 bit or 16 bit ISA bus: the Motorola DSP56301 and the Analog Devices ADSP21060 have host ports which implement the PCI bus.
The host port will usually operate under DMA - data presented at the port is automatically written into DSP memory without stopping the DSP - with or without interrupts. It is usually possible to receive and trasmit data simultaneously. The host port has dedicated instructions which make it simple to handle. The host port imposes a welcome element of standardisation to plug-in DSP boards - because it is standard to the chip, it is relatively difficult for individual designers to make the bus interface different. For example, of the 22 main different manufacturers of PC plug-in cards using the Lucent DSP32C, 21 are supported by the same PC interface code: this means it is possible to swap between different cards for different purposes, or to change to a cheaper manufacturer, without changing the PC side of the code. Of course this is not foolproof - some engineers will always 'improve upon' a standard by making something incompatible if they can - but at least it limits unwanted creativity.
Interprocessor communications is needed when a DSP application is too much for a single processor - or where many processors are needed to handle multiple but connected data streams. Link ports provide a simple means to connect several DSP processors of the same type. The Texas TMS320C40 and the Analog Devices ADSP21060 both have six link ports (called 'comm ports' for the 'C40). These would ideally be parallel ports at the word length of the processor, but this would use up too many pins (six ports each 32 bits wide=192, which is a lot of pins even if we neglect grounds). So a hybrid called serial/parallel is used: in the 'C40, comm ports are 8 bits wide and it takes four transfers to move one 32 bit word - in the 21060, link ports are 4 bits wide and it takes 8 transfers to move one 32 bit word.
The link port will usually operate under DMA - data presented at the port is automatically written into DSP memory without stopping the DSP - with or without interrupts. It is usually possible to receive and trasmit data simultaneously. This is a lot of data movement - for example the Texas TMS320C40 could in principle use all its six comm ports at their full rate of 20 Mbyte/s to achieve data transfer rates of 120 Mbyte/s. In practice, of course, such rates exist only in the dreams of marketing men since other factors such as internal bus bandwidth come into play. The link ports have dedicated instructions which make them simple to handle. Although they are sometimes used for signal I/O, this is not always a good idea since it involves very high speed signals over many pins and it can be hard for external hardware to exactly meet the timing requirements.
13 DSP processors: data formats
DSP processors store data in fixed or floating point formats. It is worth noting that fixed point format is not quite the same as integer:
The integer format is straightforward: representing whole numbers from 0 up to the largest whole number that can be represented with the available number of bits. Fixed point format is used to represent numbers that lie between 0 and 1: with a 'binary point' assumed to lie just after the most significant bit. The most significant bit in both cases carries the sign of the number. The size of the fraction represented by the smallest bit is the precision of the fixed point format. The size of the largest number that can be represented in the available word length is the dynamic range of the fixed point format To make the best use of the full available word length in the fixed point format, the programmer has to make some decisions: If a fixed point number becomes too large for the available word length, the programmer has to scale the number down, by shifting it to the right: in the process lower bits may drop off the end and be lost If a fixed point number is small, the number of bits actually used to represent it is small. The programmer may decide to scale the number up, in order to use more of the available word length In both cases the programmer has to keep a track of by how much the binary point has been shifted, in order to restore all numbers to the same scale at some later stage. Floating point format has the remarkable property of automatically scaling all numbers by moving, and keeping track of, the binary point so that all numbers use the full word length available but never overflow:
Floating point numbers have two parts: the mantissa, which is similar to the fixed point part of the number, and an exponent which is used to keep track of how the binary point is shifted. Every number is scaled by the floating point hardware: If a number becomes too large for the available word length, the hardware automatically scales it down, by shifting it to the right If a number is small, the hardware automatically scale it up, in order to use the full available word length of the mantissa 14 In both cases the exponent is used to count how many times the number has been shifted. In floating point numbers the binary point comes after the second most significant bit in the mantissa. The block floating point format provides some of the benefits of floating point, but by scaling blocks of numbers rather than each individual number:
Block floating point numbers are actually represented by the full word length of a fixed point format. If any one of a block of numbers becomes too large for the available word length, the programmer scales down all the numbers in the block, by shifting them to the right If the largest of a block of numbers is small, the programmer scales up all numbers in the block, in order to use the full available word length of the mantissa In both cases the exponent is used to count how many times the numbers in the block have been shifted. Some specialised processors, such as those from Zilog, have special features to support the use of block floating point format: more usually, it is up to the programmer to test each block of numbers and carry out the necessary scaling. The floating point format has one further advantage over fixed point: it is faster. Because of quantisation error, a basic direct form 1 IIR filter second order section requires an extra multiplier, to scale numbers and avoid overflow. But the floating point hardware automatically scales every number to avoid overflow, so this extra multiplier is not required:
DSP processors: precision and dynamic range
The precision with which numbers can be represented is determined by the word length in the fixed point format, and by the number of bits in the mantissa in the floating point format. In a 32 bit DSP processor the mantissa is usually 24 bits: so the precision of a floating point DSP is the same as that of a 24 bit fixed point processor. But floating point has one further advantage over fixed point: because the hardware automatically scales each number to use the full word length of the mantissa, the full precision is maintained even for small numbers: 15
There is a potential disadvantage to the way floating point works. Because the hardware automatically scales and normalises every number, the errors due to truncation and rounding depend on the size of the number. If we regard these errors as a source of quantisation noise, then the noise floor is modulated by the size of the signal. Although the modulation can be shown to be always downwards (that is, a 32 bit floating point format always has noise which is less than that of a 24 bit fixed point format), the signal dependent modulation of the noise may be undesirable: notably, the audio industry prefers to use 24 bit fixed point DSP processors over floating point because it is thought by some that the floating point noise floor modulation is audible. The precision directly affects quantisation error. The largest number which can be represented determines the dynamic range of the data format. In fixed point format this is straightforward: the dynamic range is the range of numbers that can be represented in the available word length. For floating point format, though, the binary point is moved automatically to accommodate larger numbers: so the dynamic range is determined by the size of the exponent. For an 8 bit exponent, the dynamic range is close to 1,500 dB:
So the dynamic range of a floating point format is enormously larger than for a fixed point format:
While the dynamic range of a 32 bit floating point format is large, it is not infinite: so it is possible to suffer overflow and underflow even with a 32 bit floating point format. A classic example of this can be seen by running fractal (Mandelbrot) calculations on a 32 bit DSP processor: after quite a long time, the fractal pattern ceases to change because the increment size has become too small for a 32 bit floating point format to represent. Most DSP processors have extended precision registers within the processor: 16
The diagram shows the data path of the Lucent DSP32C processor. Although this is a 32 bit floating point processor, it uses 40 and 45 bit registers internally: so results can be held to a wider dynamic range internally than when written to memory.
DSP processors: ADSP2181 DSP often involves a need to switch rapidly between one task and another: for example, on the occurrence of an interrupt. This would usually require all registers currently in use to be saved, and then restored after servicing the interrupt. The DSP16A and the Analogue Devices ADSP2181 use two sets of address generation registers:
The two sets of address generation registerscan be swapped as a fast alternative to saving and restoring registers when switching between tasks. The ADSP2181 also has a timer: useful for implementing 'time sliced' task switching, such as in a real time operating system - and another indication that this processor was designed with task switching in mind. 17 It is interesting to see how far a manufacturer carries the same basic processor model into their different designs. Texas Instruments, Analog Devices and Motorola all started with fixed point devices, and have carried forward those designs into their floating point processors. AT&T (now Lucent) started with floating point, then brought out fixed point devices later. The Analog Devices ADSP21060 looks like a floating point version of the integer ADSP2181:
The 21060 also has six high speed link ports which allow it to connect with up to six other processors of the same type. One way to support multiprocessing is to have many fast inter-processor communications ports: another is to have shared memory. The ADSP21060 supports both methods. Shared memory is supported in a very clever way: each processor can directly access a small area of the internal memory of up to four other processors. As with the ADSP2181, the 21060 has lots of internal memory: the idea being, that most applications can work without added external memory: note, though, that the full Harvard architecture is not brought off chip, which means they really need the on-chip memory to be big enough for most applications. The problem of fixed point processors is quantisation error, caused by the limited fixed point precision. Motorola reduce this problem in the DSP56002 by using a 24 bit integer word length: 18
They also use three internal buses - one for program, two for data (two operands). This is an extension of the standard Harvard architecture which goes beyond the usual trick of simply adding a cache, to allow access to two operands and the instruction at the same time. Of course, the problem of 24 bit fixed point is its expense: which probably explains why Motorola later produced the cheap, 16 bit DSP56156 - although this looks like a 16 bit variant of the DSP56002:
And of course there has to be a floating point variant - the DSP96002 looks like a floating point version of the DSP56002: 19
The DSP96002 supports multiprocessing with an additional 'global bus' which can connect to other DSP96002 processors: it also has a new DMA controller with its own bus The Texas TMS320C25 is quite an early design. It does not have a parallel multiply/add: the multiply is done in one cycle, the add in the next and the DSP has to address the data for both operations. It has a modified Harvard bus with only one data bus, which sometimes restricts data memory accesses to one per cycle, but it does have a special 'repeat' instruction to repeat an instruction without writing code loops
The Texas TMS320C50 is the C25 brought up to date: the multiply/add can now achieve single cycle execution if it is done in a hardware repeat loop. It also uses shadow registers as a fast way to preserve registers when context switching. It has automatic saturation or rounding (but it needs it, since the 20 accumulator has no guard bits to prevent overflow), and it has parallel bit manipulation which is useful in control applications
The Texas TMS320C30 carries on some of the features of the integer C25, but introduces some new ideas. It has a von Neuman architecture with multiple memory accesses in one cycle, but there are still separate internal buses which are multiplexed onto the CPU. It also has a dedicated DMA controller.
The Texas TMS320C40 is similar to the C30, but with high speed communications ports for multiprocessing. It has six high speed parallel comm ports which connect with other C40 processors: these are 8 bits wide, but carry 32 bit data in four successive cycles. 21
The Texas TMS320C60 is radically different from other DSP processors, in using a Very Long Instruction Word (VLIW) format. It issues a 256 bit instruction, containing up to 8 separate 'mini instructions' to each of 8 functional units. Because of the radically different concept, it an be hard to compare this processor with other, more traditional, DSP processors. But despite this, the 'C60 can still be viewed in a similar way to other DSP processors, which makes apparent the fact that it basically has two data paths each capable of a multiply/accumulate.
Note that this diagram is very different from the way Texas Instruments draw it. This is for several reasons: Texas Instruments tend to draw their processors as a set of subsystems, each with a separate block diagram 22 my diagram does not show some of the more esoteric features such as the bus switching arrangement to address the multiple external memory accesses that are required to load the eight 'mini-instructions' An important point is raised by the placing of the address generation unit on the diagram. Texas Instruments draw the C60 block diagram as having four arithmetic units in each data path - whereas my diagram shows only three. The fourth unit is in fact the address generation calculation. Following my practice for all other DSP processors, I show address generation as separate from the arithmetic units - address calculation being assumed by the presence of address registers as is the case in all DSP processors. The C60 can in fact choose to use the address generation unit for general purpose calculations if it is not calculating addresses - this is similar, for example, to the Lucent DSP32C: so Texas Instruments' approach is also valid - but for most classic DSP operations address generation would be required and so the unit would not be available for general purpose use.
There is an interesting side effect of this. Texas Instruments rate the C60 as a 1600 MIPS device - on the basis that it runs at 200 MHz, and has two data paths each with four execution units: 200 MHz x 2 x 4=1600 MIPS. But from my diagram, treating the address generation separately, we see only three execution units per data path: 200 MHz x 2 x 3=1200 MIPS. The latter figure is that actually achieved in quoted benchmarks for an FIR filter, and reflects the device's ability to perform arithmetic.
This illustrates a problem in evaluating DSP processors. It is very hard to compare like with like - not least, because all manufacturers present their designs in such a way that they show their best performance. The lesson to draw is that one cannot rely on MIPS, MOPS or Mflops ratings but must carefully try to understand the features of each candidate processor and how they differ from each other - then make a choice based on the best match to the particular application. It is very important to note that a DSP processor's specialised design means it will achieve any quoted MIPS, MOPS or Mflops rating only if programmed to take advantage of all the parallel features it offers.
(Ebook) Advanced iOS app architecture: real-world app architecture in Swift by René Cacheaux, Josh Berlin ISBN 9781942878940, 194287894X 2024 scribd download
(Ebook) Advanced iOS app architecture: real-world app architecture in Swift by René Cacheaux, Josh Berlin ISBN 9781942878940, 194287894X 2024 scribd download