Beginners Guide
Beginners Guide
Programmer’s Guide
Preliminary Draft
SPRU376A
August 2001
IMPORTANT NOTICE
Texas Instruments and its subsidiaries (TI) reserve the right to make changes to their products
or to discontinue any product or service without notice, and advise customers to obtain the latest
version of relevant information to verify, before placing orders, that information being relied on
is current and complete. All products are sold subject to the terms and conditions of sale supplied
at the time of order acknowledgment, including those pertaining to warranty, patent infringement,
and limitation of liability.
TI warrants performance of its products to the specifications applicable at the time of sale in
accordance with TI’s standard warranty. Testing and other quality control techniques are utilized
to the extent TI deems necessary to support this warranty. Specific testing of all parameters of
each device is not necessarily performed, except those mandated by government requirements.
Customers are responsible for their applications using TI components.
In order to minimize risks associated with the customer’s applications, adequate design and
operating safeguards must be provided by the customer to minimize inherent or procedural
hazards.
TI assumes no liability for applications assistance or customer product design. TI does not
warrant or represent that any license, either express or implied, is granted under any patent right,
copyright, mask work right, or other intellectual property right of TI covering or relating to any
combination, machine, or process in which such products or services might be or are used. TI’s
publication of information regarding any third party’s products or services does not constitute TI’s
approval, license, warranty or endorsement thereof.
Reproduction of information in TI data books or data sheets is permissible only if reproduction
is without alteration and is accompanied by all associated warranties, conditions, limitations and
notices. Representation or reproduction of this information with alteration voids all warranties
provided for an associated TI product or service, is an unfair and deceptive business practice,
and TI is not responsible nor liable for any such use.
Resale of TI’s products or services with statements different from or beyond the parameters stated
by TI for that products or service voids all express and any implied warranties for the associated
TI product or service, is an unfair and deceptive business practice, and TI is not responsible nor
liable for any such use.
Also see: Standard Terms and Conditions of Sale for Semiconductor Products.
www.ti.com/sc/docs/stdterms.htm
Mailing Address:
Texas Instruments
Post Office Box 655303
Dallas, Texas 75265
Notational Conventions
This document uses the following conventions.
- The device number TMS320C55x is often abbreviated as C55x.
- In most cases, hexadecimal numbers are shown with the suffix h. For ex-
ample, the following number is a hexadecimal 40 (decimal 64):
40h
Similarly, binary numbers usually are shown with the suffix b. For example,
the following number is the decimal number 4 shown in binary form:
0100b
iv
Related Documentation From Texas Instruments
The following books describe the TMS320C55x devices and related support
tools. To obtain a copy of any of these TI documents, call the Texas
Instruments Literature Response Center at (800) 477-8924. When ordering,
please identify the book by its title and literature number.
The CPU, the registers, and the instruction sets are also described in online
documentation contained in Code Composer Studio.
Trademarks
vi
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
Lists some key features of the TMS320C55x DSP architecture and recommends a process for
code development.
1.1 TMS320C55x Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
1.2 Code Development Flow for Best Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
2 Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
Uses example code to walk you through the code development flow for the TMS320C55x DSP.
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
2.2 Writing Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
2.2.1 Allocate Sections for Code, Constants, and Variables . . . . . . . . . . . . . . . . . . . . 2-5
2.2.2 Processor Mode Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
2.2.3 Setting up Addressing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
2.3 Understanding the Linking Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.4 Building Your Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13
2.4.1 Creating a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13
2.4.2 Adding Files to the Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14
2.4.3 Modifying Build Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
2.4.4 Building the Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
2.5 Testing Your Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
2.6 Benchmarking Your Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21
3 Optimizing C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
Describes how you can maximize the performance of your C code by using certain compiler
options, C code transformations, and compiler intrinsics.
3.1 Introduction to Writing C/C++ Code for a C55x DSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.1.1 Tips on Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.1.2 How to Write Multiplication Expressions Correctly in C Code . . . . . . . . . . . . . . 3-3
3.1.3 Memory Dependences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3.1.4 Analyzing C Code Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
3.2 Compiling the C/C++ Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
3.2.1 Compiler Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
3.2.2 Performing Program-Level Optimization (−pm Option) . . . . . . . . . . . . . . . . . . . . 3-9
3.2.3 Using Function Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
3.3 Profiling Your Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
3.3.1 Using the clock() Function to Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
3.3.2 Using CCS 2.0 to Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
vii
Contents
viii
Contents
Contents ix
Contents
x
Figures
1−1 Code Development Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
2−1 Section Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2−2 Extended Auxiliary Registers Structure (XARn) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
2−3 Project Creation Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14
2−4 Add tutor.asm to Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
2−5 Add tutor.cmd to Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
2−6 Build Options Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
2−7 Rebuild Complete Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
4−1 Data Bus Usage During a Dual-MAC Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
4−2 Computation Groupings for a Block FIR (4-Tap Filter Shown) . . . . . . . . . . . . . . . . . . . . . . . 4-7
4−3 Computation Groupings for a Single-Sample FIR With an
Even Number of TAPS (4-Tap Filter Shown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12
4−4 Computation Groupings for a Single-Sample FIR With an
Odd Number of TAPS (5-Tap Filter Shown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
4−5 Matrix to Find Operators That Can Be Used in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
4−6 CPU Operators and Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20
4−7 Process for Applying User-Defined Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23
4−8 First Segment of the Pipeline (Fetch Pipeline) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-49
4−9 Second Segment of the Pipeline (Execution Pipeline) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-50
3−1 Dependence Graph for Vector Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
5−1 4-Bit 2s-Complement Integer Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
5−2 8-Bit 2s-Complement Integer Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
5−3 4-Bit 2s-Complement Fractional Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
5−4 8-Bit 2s-Complement Fractional Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
5−5 Effect on CARRY of Addition Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
5−6 Effect on CARRY of Subtraction Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
5−7 32-Bit Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14
6−1 FFT Flow Graph Showing Bit-Reversed Input and In-Order Output . . . . . . . . . . . . . . . . . . 6-4
7−1 Symmetric and Antisymmetric FIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
7−2 Adaptive FIR Filter Implemented With the Least-Mean-Squares (LMS) Algorithm . . . . . . 7-6
7−3 Example of a Convolutional Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
7−4 Generation of an Output Stream G0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7−5 Bit Stream Multiplexing Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
7−6 Butterfly Structure for K = 5, Rate 1/2 GSM Convolutional Encoder . . . . . . . . . . . . . . . . . 7-17
Contents xi
Tables
3−1 Compiler Options to Avoid on Performance-Critical Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
3−2 Compiler Options for Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
3−3 Compiler Options That May Degrade Performance and Improve Code Size . . . . . . . . . . . 3-8
3−4 Compiler Options for Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
3−5 Summary of C/C++ Code Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
3−6 TMS320C55x C Compiler Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
3−7 C Coding Methods for Generating Efficient C55x Assembly Code . . . . . . . . . . . . . . . . . . 3-40
3−8 Section Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
3−9 Possible Operand Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-46
4−1 CPU Data Buses and Constant Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19
4−2 Basic Parallelism Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21
4−3 Advanced Parallelism Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
4−4 Steps in Process for Applying User-Defined Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23
4−5 Pipeline Activity Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-51
4−6 Recommendations for Preventing Pipeline Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-54
4−7 Bit Groups for STx Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-63
4−8 Pipeline Register Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-64
4−9 Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-72
4−10 C55x Data and Program Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-73
4−11 Half-Cycle Accesses to Dual-Access Memory (DARAM) and the Pipeline . . . . . . . . . . . . 4-73
4−12 Memory Accesses and the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-74
4−13 Cross-Reference Table Documented By Software Developers to Help
Software Integrators Generate an Optional Application Mapping . . . . . . . . . . . . . . . . . . . 4-78
6−1 Syntaxes for Bit-Reverse Addressing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6−2 Bit-Reversed Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6−3 Typical Bit-Reverse Initialization Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
7−1 Operands to the firs or firsn Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
xii
Examples
2−1 Final Assembly Code of tutor.asm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
2−2 Partial Assembly Code of tutor.asm (Step 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2−3 Partial Assembly Code of tutor.asm (Step 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
2−4 Partial Assembly Code of tutor.asm (Part3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
2−5 Linker command file (tutor.cmd) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2−6 Linker map file (test.map) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11
2−7 x Memory Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
3−1 Generating a 16x16−>32 Multiply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
3−2 C Code for Vector Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3−3 Main Function File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
3−4 Sum Function File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
3−5 Assembly Code Generated With −o3 and −pm Options . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
3−6 Assembly Generated Using −o3, −pm, and −oi50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
3−7 Using the clock() Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
3−8 Simple Loop That Allows Use of localrepeat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16
3−9 Assembly Code for localrepeat Generated by the Compiler . . . . . . . . . . . . . . . . . . . . . . . . 3-17
3−10 Inefficient Loop Code for Loop Variable and Constraints (C) . . . . . . . . . . . . . . . . . . . . . . . 3-18
3−11 Inefficient Loop Code for Variable and Constraints (Assembly) . . . . . . . . . . . . . . . . . . . . . 3-19
3−12 Using the MUST_ITERATE Pragma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19
3−13 Assembly Code Generated With the MUST_ITERATE Pragma . . . . . . . . . . . . . . . . . . . . . 3-20
3−14 Use Local Rather Than Global Summation Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
3−15 Returning Q15 Result for Multiply Accumulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
3−16 C Code for an FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
3−17 FIR C Code After Unroll-and-Jam Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
3−18 FIR Filter With MUST_ITERATE Pragma and restrict Qualifier . . . . . . . . . . . . . . . . . . . . . 3-27
3−19 Generated Assembly for FIR Filter Showing Dual-MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
3−20 Implementing Saturated Addition in C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29
3−21 Inefficient Assembly Code Generated by C Version of Saturated Addition . . . . . . . . . . . . 3-30
3−22 Single Call to _sadd Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
3−23 Assembly Code Generated When Using Compiler Intrinsic for Saturated Add . . . . . . . . 3-31
3−24 Using ETSI Functions to Implement sadd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
3−25 Block Copy Using Long Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34
3−26 Simulating Circular Addressing in C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35
3−27 Assembly Output for Circular Addressing C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36
3−28 Circular Addressing Using Modulus Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
3−29 Assembly Output for Circular Addressing Using Modulus Operator . . . . . . . . . . . . . . . . . 3-38
Contents xiii
Examples
xiv
Examples
Contents xv
Examples
xvi
Chapter 1
This chapter lists some of the key features of the TMS320C55x (C55x) DSP
architecture and shows a recommended process for creating code that runs
efficiently.
Topic Page
1-1
TMS320C55x Architecture
- Software stacks that support 16-bit and 32-bit push and pop operations.
You can use these stack for data storage and retreival. The CPU uses
these stacks for automatic context saving (in response to a call or inter-
rupt) and restoring (when returning to the calling or interrupted code se-
quence).
- A large number of data and address buses, to provide a high level of paral-
lelism. One 32-bit data bus and one 24-bit address bus support instruction
fetching. Three 16-bit data buses and three 24-bit address buses are used
to transport data to the CPU. Two 16-bit data buses and two 24-bit address
buses are used to transport data from the CPU.
- The following computation blocks: one 40-bit arithmetic logic unit (ALU),
one 16-bit ALU, one 40-bit shifter, and two multiply-and-accumulate units
(MACs). In a single cycle, each MAC can perform a 17-bit by 17-bit multi-
plication (fractional or integer) and a 40-bit addition or subtraction with op-
tional 32-/40-bit saturation.
- Data address generation units that support linear, circular, and bit-reverse
addressing.
- Interrupt-control logic that can block (or mask) certain interrupts known as
the maskable interrupts.
1-2
Code Development Flow for Best Performance
Compile
Profile
Efficient Yes
Done
enough?
No
Optimize C code
Step 2:
Optimize
C Code Compile
Profile
Efficient Yes
Done
enough?
No
Yes More C
optimization?
No
Introduction 1-3
Code Development Flow for Best Performance
Profile
Efficient Yes
Done
enough?
No
Optimize assembly code
Step 4:
Optimize
Assembly Profile
Code
No
Efficient
enough?
Yes
Done
1-4
Code Development Flow for Best Performance
Step Goal
1 Write C Code: You can develop your code in C using the ANSI-
compliant C55x C compiler without any knowledge of the C55x DSP.
Use Code Composer Studio to identify any inefficient areas that
you might have in your C code. After making your code functional,
you can improve its performance by selecting higher-level optimiza-
tion compiler options. If your code is still not as efficient as you would
like it to be, proceed to step 2.
Introduction 1-5
1-6
Chapter 2
This tutorial walks you through the code development flow introduced in Chap-
ter 1, and introduces you to basic concepts of TMS320C55x (C55x) DSP pro-
gramming. It uses step-by-step instructions and code examples to show you
how to use the software development tools integrated under Code Composer
Studio (CCS).
Installing CCS before beginning the tutorial allows you to edit, build, and debug
DSP target programs. For more information about CCS features, see the CCS
Tutorial. You can access the CCS Tutorial within CCS by choosing
Help!Tutorial.
The examples in this tutorial use instructions from the mnemonic instruction
set, but the concepts apply equally for the algebraic instruction set.
Topic Page
2-1
Introduction
2.1 Introduction
This tutorial presents a simple assembly code example that adds four num-
bers together (y = x0 + x3 + x1 + x2). This example helps you become familiar
with the basics of C55x programming.
- The four common C55x addressing modes and when to use them.
- The basic C55x tools required to develop and test your software.
This tutorial does not replace the information presented in other C55x docu-
mentation and is not intended to cover all the topics required to program the
C55x efficiently.
Refer to the related documentation listed in the preface of this book for more
information about programming the C55x DSP. Much of this information has
been consolidated as part of the C55x Code Composer Studio online help.
For your convenience, all the files required to run this example can be down-
loaded with the TMS320C55x Programmer’s Guide (SPRU376) from
http://www.ti.com/sc/docs/schome.htm. The examples in this chapter can be
found in the 55xprgug_srccode\tutor directory.
2-2
Writing Assembly Code
The following rules should be considered when writing C55x assembly code:
- Labels
- Comments
The final assembly code product of this tutorial is displayed in Example 2−1,
Final Assembly Code of tutor.asm. This code performs the addition of the ele-
ments in vector x. Sections of this code are highlighted in the three steps used
to create this example.
For more information about assembly syntax, see the TMS320C55x Assembly
Language Tools User’s Guide (SPRU280).
Tutorial 2-3
Writing Assembly Code
end
NOP
B end
2-4
Writing Assembly Code
- .int value reserves a 16-bit word in memory and defines the initialization
value
- .def symbol makes a symbol global, known to external files, and indicates
that the symbol is defined in the current file. External files can access the
symbol by using the .ref directive. A symbol can be a label or a variable.
As shown in Example 2−2 and Figure 2−1, the example file tutor.asm contains
three sections:
J The first four are reserved for vector x (the input vector to add).
J The last location, y, will be used to store the result of the addition.
- table, to hold the initialization values for x. The init label points to the begin-
ning of section table.
Example 2−2 shows the partial assembly code used for allocating sections.
Tutorial 2-5
Writing Assembly Code
Note: The algebraic instructions code example for Partial Assembly Code of tutor.asm (Step 1) is shown in Example B−1 on
page B-2.
Init 1
2
3
4
Start
Code
2-6
Writing Assembly Code
- The AR0 and AR6 registers are set to linear addressing (instead of circular
addressing) using bit addressing mode to modify the status register bits.
- The processor has been set in C55x native mode instead of C54x-compat-
ible mode.
Note: The algebraic instructions code example for Partial Assembly Code of tutor.asm (Step 2) is shown in Example B−2 on
page B-2.
Tutorial 2-7
Writing Assembly Code
- ARn Indirect addressing (identified by *), in which you use auxiliary regis-
ters (ARx) as pointers.
- k23 absolute addressing (identified by #), which allows you to specify the
entire 23-bit data address with a label.
- Bit addressing (identified by the bit instruction), which allows you to modify
a single bit of a memory location or MMR register.
For further details on these addressing modes, refer to the TMS320C55x DSP
CPU Reference Guide (SPRU371). Example 2−4 demonstrates the use of the
addressing modes discussed in this section.
In Step 3a, initialization values from the table section are copied to vector x (the
vector to perform the addition) using indirect addressing. Figure 2−2 illustrates
the structure of the extended auxiliar registers (XARn). The XARn register is
used only during register initialization. Subsequent operations use ARn be-
cause only the lower 16 bits are affected (ARn operations are restricted to a
64k main data page). AR6 is used to hold the address of table, and AR0 is used
to hold the address of x.
In Step 3b, direct addressing is used to add the four values. Notice that the
XDP register was initialized to point to variable x. The .dp assembler directive
is used to define the value of XDP, so the correct offset can be computed by
the assembler at compile time.
Finally, in Step 3c, the result was stored in the y vector using absolute address-
ing. Absolute addressing provides an easy way to access a memory location
without having to make XDP changes, but at the expense of an increased code
size.
2-8
Writing Assembly Code
end
NOP
B end
Note: The algebraic instructions code example for Partial Assembly Code of tutor.asm (Part3) is shown in Example B−3 on
page B-3.
Note: ARnH (upper 7 bits) specifies the 7-bit main data page. ARn (16-bit register) specifies a
16-bit offset to the 7-bit main data page to form a 23-bit address.
Tutorial 2-9
Understanding the Linking Process
The linker (lnk55.exe) assigns the final addresses to your code and data sec-
tions. This is necessary for your code to execute.
The file that instructs the linker to assign the addresses is called the linker com-
mand file (tutor.cmd) and is shown in Example 2−5. The linker command file
syntax is covered in detail in the TMS320C55x Assembly Language Tools
User’s Guide (SPRU280).
- All addresses and lengths given in the linker command file uses byte ad-
dresses and byte lengths. This is in contrast to a TMS320C54x linker com-
mand file that uses 16-bit word addresses and word lengths.
- The MEMORY linker directive declares all the physical memory available
in your system (For example, a DARAM memory block at location 0x100
of length 0x8000 bytes). Memory blocks cannot overlap.
- The SECTIONS linker directive lists all the sections contained in your input
files and where you want the linker to allocate them.
When you build your project in Section 2.4, this code produces two files, tu-
tor.out and a tutor.map. Review the test.map file, Example 2−6, to verify the
addresses for x, y, and table. Notice that the linker reports byte addresses for
program labels such as start and .text, and 16-bit word addresses for data la-
bels like x, y, and table. The C55x DSP uses byte addressing to acces variable
length instructions. Instructions can be 1-6 bytes long.
2-10
Understanding the Linking Process
******************************************************************************
TMS320C55xx COFF Linker
******************************************************************************
>> Linked Mon Feb 14 14:52:21 2000
MEMORY CONFIGURATION
output attributes/
section page orgn(bytes) orgn(words) len(bytes) len(words) input sections
−−−−−−−− −−−− −−−−−−−−−−− −−−−−−−−−−− −−−−−−−−−− −−−−−−−−−− −−−−−−−−−−−−−−
vars 0 00000080 00000005 UNINITIALIZED
00000080 00000005 test.obj (vars)
Tutorial 2-11
Understanding the Linking Process
abs. value/
byte addr word addr name
−−−−−−−−− −−−−−−−−− −−−−
00000000 .bss
00000000 .data
00010008 .text
00000000 ___bss__
00000000 ___data__
00000000 ___edata__
00000000 ___end__
00010040 ___etext__
00010008 ___text__
00000000 edata
00000000 end
00010040 etext
00008000 init
00010008 start
00000080 x
00000084 y
abs. value/
byte addr word addr name
−−−−−−−−− −−−−−−−−− −−−−
00000000 ___end__
00000000 ___edata__
00000000 end
00000000 edata
00000000 ___data__
00000000 .data
00000000 .bss
00000000 ___bss__
00000080 x
00000084 y
00008000 init
00010008 start
00010008 .text
00010008 ___text__
00010040 ___etext__
00010040 etext
[16 symbols]
2-12
Building Your Program
Before building your program, you must set up your work environment and
create a .pjt file. Setting up your work environment involves the following tasks:
- Creating a project
1) From the Project menu, choose New and enter the values shown in
Figure 2−3.
2) Select Finish.
You have now created a project named tutor.pjt and saved it in the new
c:\ti\myprojects\tutor folder.
Tutorial 2-13
Building Your Program
1) Navigate to the directory where the tutorial files are located (the
55xprgug_srccode\tutor directory) and copy them into the c:\ti\mypro-
jects\tutor directory. As an alternative, you can create your own source
files by choosing File!New!Source File and typing the source code from
the examples in this book.
2) Add the two files to the tutor.pjt project. Highlight tutor.pjt, right-click the
mouse, select Add Files, browse for the tutor.asm file, select it, and click
Open, as shown in Figure 2−4. Do the same for tutor.cmd, as shown in
Figure 2−5.
2-14
Building Your Program
Tutorial 2-15
Building Your Program
2-16
Building Your Program
2) Select the Linker tab and enter fields as shown in Figure 2−6.
Tutorial 2-17
Building Your Program
When you build your project, CCS compiles, assembles, and links your code
in one step. The assembler reads the assembly source file and converts C55x
instructions to their corresponding binary encoding. The result of the assembly
processes is an object file, tutor.obj, in industry standard COFF binary format.
The object file contains all of your code and variables, but the addresses for
the different sections of code are not assigned. This assignment takes place
during the linking process.
2-18
Testing Your Code
Load tutor.out
2) Navigate to and select tutor.out (in the \debug directory), then choose
Open.
CCS now displays the tutor.asm source code at the beginning of the start label
because of the entry symbol defined in the linker command file (-e start).
Otherwise, it would have shown the location pointed to by the reset vector.
The labels x, y, and init are visible to the simulator (using View→ Memory) be-
cause they were exported as symbols (using the .def directive in tutor.asm).
The -g option was used to enable assembly source debugging.
Now, single-step through the code to the end label by selecting Debug→Step
Into. Examine the X Memory window to verify that the table values populate
x and that y gets the value 0xa (1 + 2 + 3 + 4 = 10 = 0xa), as shown in
Example 2−7.
Tutorial 2-19
Testing Your Code
2-20
Benchmarking Your Code
Set breakpoints
2) Set one breakpoint at the beginning of the code you want to benchmark
(first instruction after start): Right-click on the instruction next to the copy
label and choose Toggle Breakpoint.
3) Set one breakpoint marking the end: Right-click on the instruction next to
the end label and choose Toggle Breakpoint.
4) The Clock Window displays the number of cycles the code took to execute
between the breakpoints, which was approximately 17.
Tutorial 2-21
2-22
Chapter 3
You can maximize the performance of your C code by using certain compiler
options, C code transformations, and compiler intrinsics. This chapter dis-
cusses features of the C language relevant to compilation on the
TMS320C55x (C55x) DSP, performance-enhancing options for the compiler,
and C55x-specific code transformations that improve C code performance. All
assembly language examples were generated for the large memory model via
the −ml compiler option.
Topic Page
3-1
Introduction to Writing C/C++ Code for a C55x DSP
Give careful consideration to the data type size when writing your code. The
C55x compiler defines a size for each C data type (signed and unsigned):
char 16 bits
short 16 bits
int 16 bits
long 32 bits
long long 40 bits
float 32 bits
double 64 bits
Floating point values are in the IEEE format. Based on the size of each data
type, follow these guidelines when writing your code:
- Avoid code that assumes that int and long types are the same size.
- Use the int data type for fixed-point arithmetic (especially multiplication)
whenever possible. Using type long for multiplication operands will result
in calls to a run-time library routine.
- Use int or unsigned int types rather than long for loop counters.
The C55x has mechanisms for efficient hardware loops, but hardware
loop counters are only 16 bits wide.
When writing code to be used on multiple DSP targets, it may be wise to define
“generic” types for the standard C types. For example, one could use the types
Int16 and Int32 for a 16 bit integer type and 32 bit integer type respectively.
When compiling for the C55x DSP, these types would be type defined to int
and long, respectively.
In general it is best to use the type int for loop index variables and other inte-
ger variables where the number of bits is unimportant as int typically repre-
sents the most efficient integer type for the target to manipulate, regardless of
architecture.
3-2
Introduction to Writing C/C++ Code for a C55x DSP
Writing multiplication expressions in C code so that they are both correct and
efficient can be confusing, especially when technically illegal expressions can,
in some circumstances, generate the code you wanted in the first place. This
section will help you choose the correct expression for your algorithm.
A 16-bit multiplication with a 32-bit result is an operation which does not direct-
ly exist in the C language, but does exist on C55x hardware, and is vital for mul-
tiply-and-accumulate (MAC)-like algorithm performance.
Example 3−1 shows two incorrect ways and a correct way to write such a multi-
plication in C code.
Note that the same rules also apply for other C arithmetic operators. For exam-
ple, if you want to add two 16-bit numbers and get a full 32 bit result, the correct
syntax is:
(long) res = (long)(int)src1 + (long)(int)src;
To maximize the efficiency or your code, the C55x compiler reorders instruc-
tions to minimize pipeline stalls, puts certain assembly instructions in parallel,
and generates dual multiply-and-accumulate (dual-MAC) instructions. These
transformations require the compiler to determine the relationships, or depen-
dences, between instructions. Dependence means that one instruction must
occur before another. For example, a variable may need to be loaded from
memory before it can be used. Because only independent instructions can be
scheduled in parallel or reordered, dependences inhibit parallelism and code
movement. If the compiler cannot prove that two instructions are independent,
it must assume that instructions must remain in the order they originally ap-
peared and must not be scheduled in parallel.
Often it is difficult for the compiler to determine whether instructions that ac-
cess memory are independent. The following techniques help the compiler de-
termine which instructions are independent:
- Use the restrict keyword to indicate that a pointer is the only pointer
than can point to a particular object in the scope in which the pointer is de-
clared.
- Use the –pm option which gives the compiler global access to the whole
program and allows it to be more aggressive in ruling out dependences.
3-4
Introduction to Writing C/C++ Code for a C55x DSP
Load Load
in1[i] in2[i]
Add
Store
sum[i]
- The paths from the store of sum[i] back to the loads of in1[i] and
in2[i] indicate that writing to sum may have an effect on the memory
pointed to by either in1 or in2.
- A read from in1 or in2 cannot begin until the write to sum finishes, which
creates an aliasing problem. Aliasing occurs when two pointers can point
to the same memory location. For example, if vecsum() is called in a pro-
gram with the following statements, in1 and sum alias each other be-
cause they both point to the same memory location:
short a[10], b[10];
vecsum(a, a, b, 10);
To help the compiler resolve memory dependences, you can qualify a pointer
or array with the restrict keyword. Its use represents a guarantee by the
programmer that within the scope of the pointer declaration, the object pointed
to can be accessed only by that pointer. Any violation of this guarantee renders
the behavior of the program undefined. This practice helps the compiler opti-
mize certain sections of code because aliasing information can be more easily
determined.
In the declaration of the vector sum function you can use the restrict key-
word to tell the compiler that sum is the only pointer that points to that object:
void vecsum(int * restrict sum, int *in1, int *in2, int N)
(Likewise, you could add restrict to in1 and in2 as well.) The next piece of
code shows how to use restrict with an array function parameter instead of
a pointer:
void vecsum(int sum[restrict], int *in1, int *in2, int N)
Caution must be exercised when using restrict. Consider this call of vecsum()
(with the sum parameter qualified by restrict):
vecsum(a, a, b, 10);
Undefined behavior would result because sum and in1 would point to the
same object, which violates sum’s declaration as restrict.
Use the following techniques to analyze the performance of specific code re-
gions:
- Use the clock() and printf() functions in C/C++ code to time and dis-
play the performance of specific code regions. You can use the stand-
alone simulator (load55) for this purpose. Remember to subtract out the
overhead time of calling the clock() function.
- Enable the clock and use profile points and the RUN command in the Code
Composer Studio debugger to track the number of CPU clock cycles con-
sumed by a particular section of code.
- Put each loop into a separate file that can be rewritten, recompiled, and
run with the stand-alone simulator (load55).The critical performance
areas in your code are most often loops.
As you use the techniques described in this chapter to optimize your C/C++
code, you can then evaluate the performance results by running the code and
looking at the instructions generated by the compiler. More detail on perfor-
mance analysis can be found in section 3.3.
3-6
Compiling the C/C++ Code
For a complete description of the C/C++ compiler and the options discussed in
this section, see the TMS320C55x Optimizing C Compiler User’s Guide
(SPRU281).
Options control the operation of the compiler. This section introduces you to
the recommended options for performance, information gathering, and code
size.
First make note of the options to avoid using on performance critical code.The
options described in Table 3−1 are intended for debugging, and could poten-
tially decrease performance and increase code size.
Option Description
−g, −s, −ss, −gp These options are intended for debugging and can limit the
amount of optimization across C statements leading to larger
code size and slower execution.
−o1, −o0 Always use −o2/−o3 to maximize compiler analysis and opti-
mization
The options in Table 3−2 can be used to improve performance. The options
−o3, −pm, −mb, −oi50, and −op2 are recommended for maximum perfor-
mance.
Option Description
−o3 Represents the highest level of optimization available. Various loop
optimizations are performed, and various file-level characteristics
are also used to improve performance.
−mb Asserts to the compiler that all data is on-chip. This option is used
to enable the compiler to generate dual-MAC. See section 3.4.2.2
for more details.
−op2 When used with −pm, this option allows the compiler to assume
that the program being compiled does not contain any functions or
variables called or modified from outside the current file. The com-
piler is free to remove any functions or variables that are unused in
the current file.
The options described in Table 3−3, can be used to improve code size with a
possible degradation in performance.
Table 3−3. Compiler Options That May Degrade Performance and Improve Code Size
Option Description
−ms Encourages the compiler to optimize for code space. (Default is to
optimize for performance.)
3-8
Compiling the C/C++ Code
Example 3−3 and Example 3−4 show the content of two files. One file contains
the source for the main function and the second file contains source for a small
function called sum.
When this code is compiled with −o3 and −pm options, the optimizer has
enough information about the calls to sum to determine that the same loop
count is used for both calls. It therefore eliminates the argument n from the call
to the function and explicitly uses the count in the repeat single instruction as
shown in Example 3−5.
3-10
Compiling the C/C++ Code
Example 3−5. Assembly Code Generated With −o3 and −pm Options
_sum:
;** Parameter deleted n == 9u
MOV #0, T0 ; |3|
RPT #9
ADD *AR0+, T0, T0
return ; |11|
_main:
AADD #−1, SP
AMOV #_a, XAR0 ; |9|
call #_sum ; |9|
; call occurs [#_sum] ; |9|
MOV T0, *(#_sum1) ; |9|
AMOV #_b, XAR0 ; |10|
call #_sum ; |10|
; call occurs [#_sum] ; |10|
MOV T0, *(#_sum2) ; |10|
AADD #1, SP
return
; return occurs
Note: The algebraic instructions code example for Assembly Code Generated With −o3 and −pm Options is shown in
Example B−4 on page B-4.
Example 3−6 shows the resulting assembly instructions when the code in
Example 3−3 and Example 3−4 is compiled with −o3, −pm, and −oi50 op-
tions.
In main, the function calls to sum have been inlined. However, code for the
body of function sum has still been generated. The compiler must generate this
code because it does not have enough information to eliminate the possibility
that the function sum may be called by some other externally defined function.
If no external function calls sum, it can be declared as static inline. The
compiler will then be able to eliminate the code for sum after inlining.
_sum:
MOV #0, T0 ; |3|
RPT #9
ADD *AR0+, T0, T0
return ; |11|
_main:
AMOV #_a, XAR3 ; |9|
RPT #9
|| MOV #0, AR1 ; |3|
ADD *AR3+, AR1, AR1
Note: The algebraic instructions code example for Assembly Generated Using −o3, −pm, and −oi50 is shown in Example B−5
on page B-5.
3-12
Profiling Your Code
To get cycle count information for a function or region of code with the stand-
alone simulator, embed the clock() function in your C code. Example 3−7
demonstrates this technique.
#include <stdio.h>
#include <time.h> /* Need time.h in order to call clock() */
int main()
{
clock_t start, stop, overhead;
start = clock(); /* Calculate the overhead of calling clock */
stop = clock(); /* and subtract this amount from the results. */
overhead = stop − start;
start = clock();
/* Function or Code Region to time goes here */
stop = clock();
printf(”cycles: %ld\n”,(long)(stop − start – overhead));
return(0);
}
Caution: Using clock() to time a region of code could increase the cycle
count of that region due to the extra variables needed to hold the timing infor-
mation (the stop, start, and overhead variables above). Wrapping
clock() around a function call should not affect the cycle count of that func-
tion.
Code Composer Studio (CCS) 2.0 has extensive profiling options that can be
used to profile your C code. First you must enable the clock by selecting En-
able Clock from the Profiler menu. Selecting Start New Session from the Profil-
er menu starts a new profiling session. To profile all functions, click on the Pro-
file All Functions button in the profiler session window. To profile certain func-
tions or regions of code, click the Create Profile Area and enter the starting and
ending line numbers of the code you wish to profile. (Note that you must build
your code for debugging (−g option) to enable this feature.) Then, run your pro-
gram and the profile information will be updated in the profiler session window.
More information on profiling with CCS 2.0 can be found in the online docu-
mentation.
3-14
Refining the C/C++ Code
- Create loops that efficiently use C55x hardware loops, MAC hardware,
and dual-MAC hardware.
void vecsum(const short *a, const short *b, short *c, unsigned int n)
{
unsigned int i;
3-16
Refining the C/C++ Code
Note: The algebraic instructions code example for Assembly Code for localrepeat Generated by the Compiler is shown in
Example B−6 on page B-5.
A trip count is the number of times that a loop executes; the trip counter is the
variable used to count each iteration. When the trip counter reaches the limit
equal to the trip count, the loop terminates. Maximum performance for loop
code is gained when the compiler can determine the exact minimum and maxi-
mum for the trip count. To this end, use the following techniques to convey trip
count information to the compiler:
- Use int (or unsigned int) type for trip counter variable, whenever
possible.
- Be sure to use the −o3 and −pm compiler options to allow the optimizer
access to the whole program or large parts of it and to characterize the be-
havior of loop trip counts.
Using int Type. Using the type int for the trip counter is important to allow
the compiler to generate hardware looping constructs.
If, for example, i and n were declared to be of type long, no hardware loop
could be generated. This is because the C55x internal loop iteration count reg-
ister is only 16 bits wide. If i and n are declared as type int, then the compiler
will generate a hardware loop.
Example 3−10 shows code to compute the sum of a vector. The corresponding
assembly code is shown in Example 3−11. Notice the conditional branch that
jumps around the loop body in the generated assembly code. The compiler
must insert this additional code if there is any possibility that the loop could ex-
ecute zero times. In this particular case the loop upper bound n is an integer.
Thus, n could be zero or negative in which case C semantics would dictate that
the for loop body would not execute. A hardware loop must execute at least
once, so the jump around code ensures correct execution in cases where
n <= 0.
Example 3−10. Inefficient Loop Code for Loop Variable and Constraints (C)
3-18
Refining the C/C++ Code
Example 3−11. Inefficient Loop Code for Variable and Constraints (Assembly)
_sum:
MOV #0, AR1 ; |3|
BCC L2,T0 <= #0 ; |6|
; branch occurs ; |6|
SUB #1, T0, AR2
MOV AR2, CSR
RPT CSR
ADD *AR0+, AR1, AR1
Note: The algebraic instructions code example for Inefficient Loop Code for Variable and Constraints (Assembly) is shown in
Example B−7 on page B-6.
If it is known that the loop always executes at least once, this fact can be com-
municated to the compiler via the MUST_ITERATE pragma. Example 3−12
shows how to use the pragma for this piece of code. Example 3−13 shows the
more efficient assembly code that can now be generated because of the prag-
ma.
#pragma MUST_ITERATE(1)
for(i=0; i<n; i++)
{
sum += a[i];
}
return sum;
}
(Note that the same effect could be generated by using an _nassert, to as-
sert to the compiler that n is greater than zero: _nassert(n>0)).
_sum:
SUB #1, T0, AR2
MOV AR2, CSR
MOV #0, AR1 ; |3|
RPT CSR
ADD *AR0+, AR1, AR1
Note: The algebraic instructions code example for Assembly Code Generated With the MUST_ITERATE Pragma is shown in
Example B−8 on page B-6.
All fields are optional. min is the minimum number of iterations of the loop, max
is the maximum number of iterations of the loop, and mult tells the compiler
that the loop always executes a multiple of mult times. If some of these values
are not known until run time, do not include them in the pragma. Incorrect infor-
mation communicated via the pragma could result in undefined program beha-
vior. The MUST_ITERATE pragma must appear immediately before the loop
that it is meant to describe in the C code. MUST_ITERATE can be used in the
following ways:
- It can convey that the trip count will be greater than some minimum value.
/* This loop will always execute at least 30 times */
#pragma MUST_ITERATE(30)
for(j=0; j<x; j++)
- It can convey the maximum trip count.
/* The loop will execute no more than 100 times */
#pragma MUST_ITERATE(,100)
for (j=0; j<x; j++)
- It can convey that the trip count is always divisible by a value.
/* The loop will execute some multiple of 4 times */
#pragma MUST_ITERATE(,,4)
for (j=0; j<x; j++)
3-20
Refining the C/C++ Code
Consider the following loop header (from the ETSI gsmefr benchmark):
for(i=a[0]; i < 40; i +=5)
To generate a hardware loop, the compiler would need to emit code that would
determine the number of loop iterations at run time. This code would require an
integer division. Since this is computationally expensive, the compiler will not
generate such code and will not generate a hardware loop. However, if the pro-
grammer knows that, for example, a[0] is always less than or equal to 4, then
the loop always executes exactly eight times. This can be communicated via a
MUST_ITERATE pragma enabling the compiler to generate an efficient hard-
ware loop:
#pragma MUST_ITERATE(8,8)
The compiler can generate a very efficient single repeat MAC construct (that
is, a repeat (RPT) loop with a MAC as its only instruction.) To facilitate the
generation of single repeat MAC constructs, use local rather than global vari-
ables for the summation, as shown in Example 3−14. If a global variable is
used, the compiler is obligated to perform an intervening storage to the global
object. This prevents it from generating a single repeat.
In the case where Q15 arithmetic is being simulated, the result of the MAC op-
eration may be accumulated into a long object. The result may then be shifted
and truncated before the return, as shown in Example 3−15.
/* Not recommended */
int gsum=0;
void dotp1(const int *x, const int *y, unsigned int n)
{
unsigned int i;
for(i=0; i<=n−1; i++)
gsum += x[i] * y[i];
}
/* Recommended */
int dotp2(const int *x, const int *y, unsigned int n)
{
unsigned int i;
int lsum=0;
for(i=0; i<=n−1; i++)
lsum += x[i] * y[i];
return lsum;
}
3-22
Refining the C/C++ Code
In order for the compiler to generate a dual-MAC operation, the code must
have two consecutive MAC (or MAS/multiply) instructions that get all their mul-
tiplicands from memory and share one multiplicand. The two operations must
not write their results to the same variable or location. The compiler can easily
turn this example into a dual-MAC:
int *a,*b, onchip *c;
long s1,s2;
[...]
s1 = s1 + (*a++ * *c);
s2 = s2 + (*b++ * *c++);
This is a sequence of two MAC instructions that share the *c memory referen-
ce. Intrinsics can also be transformed into dual-MACs:
s1 = _smac(s1,*a++,*c);
s2 = _smac(s2,*b++,*c++);
You must inform the compiler that the memory pointed to by the shared dual-
MAC operand is on chip (a requirement for the addressing mode used for the
shared operand). There are two ways to do this. The first (and preferred) way
involves the use of the onchip type qualifier. It is used like this:
void foo(int onchip *a)
{
int onchip b[10];
...
}
This keyword can be applied to any pointer or array and indicates that the
memory pointed to by that pointer or array is always on chip.
The second technique is to compile with the −mb switch (passed to cl55). This
asserts to the compiler that all data pointed to by the shared dual-MAC pointer
will be on chip. This switch is a shortcut. Instead of putting many onchip quali-
fiers into the code, −mb can be used instead. You must ensure that all required
data will be on chip. If −mb is used and some data pointed to by a shared dual-
MAC pointer is not on chip, undefined behavior may result. Remember, this is a
shortcut. The onchip keyword should be used to enable dual-MAC opera-
tions in most circumstances. Using −mb could result in dual-MACs being gen-
erated in unexpected or undesirable places.
Unfortunately, a lot of C code that could benefit from using dual-MACs is not
written in such a way as to enable the compiler to generate them. However, the
compiler can sometimes transform the code in such a way to generate a dual-
MAC. For example, look at Example 3−16 which shows a C version of a simple
FIR filter. (Notice the onchip keyword used for the pointer parameter h.) In
order to generate a dual-MAC in this case, the compiler must somehow gener-
ate two consecutive MAC operations from the single MAC operation in the
code. This is done via a loop transformation called unroll-and-jam. This trans-
formation replicates the outer loop and then fuses the two resulting inner loops
back together. Example 3−17 shows what the code in Example 3−16 would
look like if unroll-and-jam were applied manually.
3-24
Refining the C/C++ Code
void fir(short onchip *h, short *x, short *y, short m, short n)
{
short i,j;
long y0,y1;
for (j = 0; j < m; j+=2)
{
y0 = 0;
y1 = 0;
Notice that now we are computing two separate sums (y0 and y1) for each
iteration of the outer loop. If this C code were fed to the compiler, it would gen-
erate a dual-MAC in the inner loop. The compiler can perform the unroll-and-
jam transformation automatically, but the programmer must provide additional
information to ensure that the transformation is safe.
- The compiler must determine that the outer loop repeats an even number
of times. If the loop bounds are provably constant, the compiler can deter-
mine this automatically. Otherwise, if the user knows that the loop always
repeats an even number of times, a MUST_ITERATE pragma can be
used immediately preceding the outer loop:
#pragma MUST_ITERATE(1,,2)
(Note that the first parameter (1) indicates that the outer loop always exe-
cutes at least once. This is to eliminate loop jump-around code as de-
scribed in section 3.4.1.4 on page 3-18.)
- The compiler must also know that the inner loop executes at least once.
This can be specified by inserting the following MUST_ITERATE pragma
just before the for statement of the inner loop:
#pragma MUST_ITERATE(1)
- The compiler must also know that there are no memory conflicts in the loop
nest. In our example, that means the compiler must know that all the writes
to array y cannot affect the values in array x or h. Consider the code in
Example 3−17 on page 3-25. We have changed the order of memory ac-
cesses by performing unroll-and-jam. In the transformed code, twice as
many reads from x (and h) occur before any writes to y. If writes to y could
affect the data pointed to by x (or h), the transformed code could produce
different results. If these three arrays were locally declared arrays, the
compiler would not have a problem. In this case we pass the arrays into
the function via pointer parameters. If the programmer is sure that writes
to y will not affect the arrays x and h within the function, the restrict
keyword can be used in the function declaration:
void fir(short onchip *h, short *x, short * restrict
y, short m, short n)
The restrict keyword tells the compiler that no other variable will point
at the memory that y points to. (See section 3.1.3 for more information on
memory dependences and restrict.) The final C code is shown in
Example 3−18, and the corresponding assembly code in Example 3−19.
Even using the MUST_ITERATE pragma and restrict qualifiers, some
loops may still be too complicated for the compiler to generate as dual-
MACs. If there is a piece of code you feel could benefit from dual-MAC op-
erations, it may be necessary to transform the code by hand. This process
is similar to the transformations described for writing dual-MAC operations
in assembly code as described in section 4.1.
3-26
Refining the C/C++ Code
Example 3−18. FIR Filter With MUST_ITERATE Pragma and restrict Qualifier
_fir:
ADD #1, T0, AR3
SFTS AR3, #−1
SUB #1, AR3
MOV AR3, BRC0
PSH T3, T2
MOV #0, T3 ; |6|
|| MOV XAR0, XCDP
AADD #−1, SP
RPTBLOCAL L4−1
AADD #1, SP
POP T3,T2
return
Note: The algebraic instructions code example for Generated Assembly for FIR Filter Showing Dual-MAC is shown in
Example B−9 on page B-7.
3-28
Refining the C/C++ Code
Example 3−21 shows the resultant inefficient assembly language code gener-
ated by the compiler.
_sadd:
MOV T1, AR1 ; |5|
XOR T0, T1 ; |9|
BTST @#15, T1, TC1 ; |9|
ADD T0, AR1
BCC L2,TC1 ; |9|
; branch occurs ; |9|
MOV T0, AR2 ; |9|
XOR AR1, AR2 ; |9|
BTST @#15, AR2, TC1 ; |9|
BCC L2,!TC1 ; |9|
; branch occurs ; |9|
BCC L1,T0 < #0 ; |22|
; branch occurs ; |22|
MOV #32767, T0 ; |22|
B L3 ; |22|
; branch occurs ; |22|
L1:
MOV #−32768, AR1 ; |22|
L2:
MOV AR1, T0 ; |25|
L3:
return ; |25|
; return occurs ; |25|
Note: The algebraic instructions code example for Inefficient Assembly Code Generated by C Version of Saturated Addition is
shown in Example B−10 on page B-8.
The code for the C simulated saturated addition can be replaced by a single
call to the _sadd intrinsic as is shown in Example 3−22. The assembly code
generated for this C source is shown in Example 3−23.
3-30
Refining the C/C++ Code
Note that using compiler intrinsics reduces the portability of your code. You
may consider using ETSI functions instead of intrinsics. These functions can
be mapped to intrinsics for various targets. For C55x code, the file gsm.h de-
fines the ETSI functions using compiler intrinsics. (The actual C code ETSI
functions can be used when compiling on the host or other target without intrin-
sics.) For example, the code in Example 3−22 could be rewritten to use the
ETSI add function as shown in Example 3−24. The ETSI add function is
mapped to the _sadd compiler intrinsic in the header file gsm.h. (Of course,
you probably want to replace calls to the sadd function with calls to the ETSI
add function.)
Table 3−6 lists the intrinsics supported by the C55x compiler. For more infor-
mation on using intrinsics, please refer to the TMS320C55x Optimizing C
Compiler User’s Guide (SPRU281).
Example 3−23. Assembly Code Generated When Using Compiler Intrinsic for
Saturated Add
_sadd:
BSET ST3_SATA
ADD T1, T0 ; |3|
BCLR ST3_SATA
return ; |3|
; return occurs ; |3|
Note: The algebraic instructions code example for Assembly Code Generated When Using Compiler Intrinsic for Saturated
Add is shown in Example B−11 on page B-9.
#include <gsm.h>
int sadd(int a, int b)
{
return add(a,b);
}
int _sadd(int src1, int src2); Adds two 16-bit integers, producing a saturated 16-bit re-
sult (SATA bit set)
long _lsadd(long src1, long src2); Adds two 32-bit integers, producing a saturated 32-bit re-
sult (SATD bit set)
long long _llsadd(long long src1, long long src2); Adds two 40-bit integers, producing a saturated 40-bit re-
sult (SATD bit set)
int _ssub(int src1, int src2); Subtracts src2 from src1, producing a saturated 16-bit
result (SATA bit set)
long _lssub(long src1, long src2); Subtracts src2 from src1, producing a saturated 32-bit
result (SATD bit set)
long long _llssub(long long src1, long long src2); Subtracts src2 from src1, producing a saturated 40-bit
result (SATD bit set)
int _smpy(int src1, int src2); Multiplies src1 and src2, and shifts the result left by 1. Pro-
duces a saturated 16-bit result. (SATD and FRCT bits set)
long _lsmpy(int src1, int src2); Multiplies src1 and src2, and shifts the result left by 1. Pro-
duces a saturated 32-bit result. (SATD and FRCT bits set)
long _smac(long src, int op1, int op2); Multiplies op1 and op2, shifts the result left by 1, and adds
it to src. Produces a saturated 32-bit result. (SATD, SMUL,
and FRCT bits set)
long _smas(long src, int op1, int op2); Multiplies op1 and op2, shifts the result left by 1, and sub-
tracts it from src. Produces a 32-bit result. (SATD, SMUL
and FRCT bits set)
long long _llabss(long long src); Creates a saturated 40-bit absolute value.
_llabss(800000000h) results in 7FFFFFFFFFh (SATD bit
set)
long long _llsneg(long long src); Negates the 40-bit value with saturation.
_llsneg(8000000000h) results in 7FFFFFFFFFh
3-32
Refining the C/C++ Code
long _smpyr(int src1, int src2); Multiplies src1 and src2, shifts the result left by 1, and
rounds by adding 215 to the result and zeroing out the low-
er 16 bits. (SATD and FRCT bits set)
long _smacr(long src, int op1, int op2); Multiplies op1 and op2, shifts the result left by 1, adds the
result to src, and then rounds the result by adding 215 and
zeroing out the lower 16 bits. (SATD , SMUL, and FRCT
bits set)
long _smasr(long src, int op1, int op2); Multiplies op1 and op2, shifts the result left by 1, subtracts
the result from src, and then rounds the result by adding
215 and zeroing out the lower 16 bits. (SATD , SMUL and
FRCT bits set)
int _norm(int src); Produces the number of left shifts needed to normalize
16-bit value.
int _lnorm(long src); Produces the number of left shifts needed to normalize
32-bit value.
long _rnd(long src); Rounds src by adding 215. Produces a 32-bit saturated
result with the lower 16 bits zeroed out. (SATD bit set)
int _sshl(int src1, int src2); Shifts src1 left by src2 and produces a 16-bit result. The
result is saturated if src2 is less than or equal to 8. (SATD
bit set)
long _lsshl(long src1, int src2); Shifts src1 left by src2 and produces a 32-bit result. The
result is saturated if src2 is less than or equal to 8. (SATD
bit set)
int _shrs(int src1, int src2); Shifts src1 right by src2 and produces a 16-bit result. Pro-
duces a saturated 16-bit result. (SATD bit set)
long _lshrs(long src1, int src2); Shifts src1 right by src2 and produces a 32-bit result. Pro-
duces a saturated 32-bit result. (SATD bit set)
The primary use of treating 16-bit data as long is to transfer data quickly from
one memory location to another. Since 32-bit accesses also can occur in a
single cycle, this could reduce the data-movement time by half. The only limita-
tion is that the data must be aligned on a double word boundary (that is, an
even word boundary). The code is even simpler if the number of items trans-
ferred is a multiple of 2. To align the data use the DATA_ALIGN pragma:
short x[10];
#pragma DATA_ALIGN(x,2)
Example 3−25 shows a memory copy function that copies 16-bit data via
32-bit pointers.
3-34
Refining the C/C++ Code
Notice that CIRC_REF simply expands to (var). In the future, using modulus
will be the more efficient way to implement circular addressing in C. The com-
piler will be able to transform certain uses of modulus into efficient C55x circu-
lar addressing code. At that time, the CIRC_UPDATE and CIRC_REF macros
can be updated to use modulus. Use of these macros will improve current per-
formance and minimize future changes needed to take advantage of improved
compiler functionality with regards to circular addressing.
The (much less efficient) resulting assembly code is shown in Example 3−29.
#define CIRC_UPDATE(var,inc,size)\
(var) +=(inc); if ((var)>=(size)) (var)−=(size);
#define CIRC_REF(var,size) (var)
long circ(const int *a, const int *b, int nb, int na)
{
int i,x=0;
long sum=0;
for(i=0; i<na; i++)
{
sum += (long)a[i] * b[CIRC_REF(x,nb)];
CIRC_UPDATE(x,1,nb)
}
return sum;
}
_circ:
MOV #0, AC0 ; |7|
BCC L2,T1 <= #0 ; |9|
; branch occurs ; |9|
SUB #1, T1, AR3
MOV AR3, BRC0
MOV #0, AR2 ; |6|
RPTBLOCAL L2−1
; loop starts
L1:
MACM *AR1+, *AR0+, AC0, AC0 ; |11|
|| ADD #1, AR2
CMP AR2 < T0, TC1 ; |12|
XCCPART !TC1 ||
SUB T0, AR1
XCCPART !TC1 ||
SUB T0, AR2
; loop ends ; |13|
L2:
return ; |14|
; return occurs ; |14|
Note: The algebraic instructions code example for Assembly Output for Circular Addressing C Code is shown in Example B−12
on page B-9.
3-36
Refining the C/C++ Code
long circ(const int *a, const int *b, int nb, int na)
{
int i,x=0;
long sum=0;
for(i=0; i<na; i++)
{
sum += (long)a[i] * b[x % mb];
x++;
}
return sum;
}
Example 3−29. Assembly Output for Circular Addressing Using Modulus Operator
_circ:
PSH T3, T2
AADD #−7, SP
MOV XAR1, dbl(*SP(#0))
|| MOV #0, AC0 ; |4|
MOV AC0, dbl(*SP(#2)) ; |4|
|| MOV T0, T2 ; |2|
BCC L2,T1 <= #0 ; |6|
; branch occurs ; |6|
MOV #0, T0 ; |3|
MOV T1, T3
|| MOV XAR0, dbl(*SP(#4))
L1:
MOV dbl(*SP(#0)), XAR3
MOV *AR3(T0), T1 ; |8|
MOV dbl(*SP(#4)), XAR3
MOV dbl(*SP(#2)), AC0 ; |8|
ADD #1, T0, T0
MACM *AR3+, T1, AC0, AC0 ; |8|
MOV AC0, dbl(*SP(#2)) ; |8|
MOV XAR3, dbl(*SP(#4))
call #I$$MOD ; |9|
|| MOV T2, T1 ; |9|
; call occurs [#I$$MOD] ; |9|
SUB #1, T3
BCC L1,T3 != #0 ; |10|
; branch occurs ; |10|
L2:
MOV dbl(*SP(#2)), AC0
AADD #7, SP ; |11|
POP T3,T2
return ; |11|
; return occurs ; |11|
Note: The algebraic instructions code example for Assembly Output for Circular Addressing Using Modulo is shown in
Example B−13 on page B-10.
3-38
Refining the C/C++ Code
In the case of single conditionals, it is best to test against zero. For example,
consider the following piece of C code:
if (a!=1) /* Test against 1 */
<inst1>
else
<inst2>
In most cases this test against zero will result in more efficient compiled code.
Table 3−7 shows the C coding methods that should be used for some basic
DSP operations to generate the most efficient assembly code for the C55x.
Table 3−7. C Coding Methods for Generating Efficient C55x Assembly Code
16bit +/− 16bit => 16bit <int, long, long long> a,b,c;
32bit +/− 32bit => 32bit c = a + b;
40bit +/− 40bit => 40bit (addition or subtraction) /* or */
c = a – b;
3-40
Refining the C/C++ Code
Table 3−7. C Coding Methods for Generating Efficient C55x Assembly Code (Continued)
Operation Recommended C Code Idiom
16bit – 16bit => 16bit (subtraction) int a,b,c;
with saturation c = _ssub(a,b);
- Stack configuration
The compiler requires that all values of type long be stored on an even word
boundary. When declaring data objects (such as structures) that may contain a
mixture of multi-word and single-word elements, place variables of type long
in the structure definition first to avoid holes in memory. The compiler automati-
cally aligns structure objects on an even word boundary. Placing these items
first takes advantage of this alignment.
/* Not recommended */
typedef struct abc{
int a;
long b;
int c;
} ABC;
/* Recommended */
typedef struct abc{
long a;
int b,c;
} ABC;
3-42
Memory Management Issues
{
int lflag = Xflag;
.
x = lflag ? lflag & 0xfffe : lflag;
.
.
return x;
}
The C55x has two software stacks: the data stack (referenced by the pointer
SP) and the system stack (referenced by the pointer SSP). These stacks can
be indexed independently or simultaneously depending on the chosen operat-
ing mode. There are three possible operating modes for the stack:
Additionally, the selection of fast return mode enables use of the RETA and
CFCT registers to effect return from functions. This potentially increases exe-
cution speed because it reduces the number of cycles required to return from a
function.
It is recommended to use dual 16-bit fast return mode to reduce memory space
requirements and increase execution speed. The stack operating mode is se-
lected by setting bits 28 and 29 of the reset vector address to the appropriate
values. Dual 16-bit fast return mode may be selected by using the .ivec assem-
bler directive when creating the address for the reset vector. For example:
(This is the default mode for the compiler as setup by the supplied runtime sup-
port library.) The assembler will automatically set the correct value for bits 28
and 29 when encoding the reset vector address. For more information on stack
modes, see the TMS320C55x DSP CPU Reference Guide (SPRU371).
The compiler groups generated code and data into logical units called sec-
tions. Sections are the building blocks of the object files created by the assem-
bler. They are the logical units operated on by the linker when allocating space
for code and data in the C55x memory map.
3-44
Memory Management Issues
Section Description
.cinit Initialization record table for global and static C variables
.stack Data stack (local variables, lower 16 bits of return address, etc.)
These sections are encoded in the object file produced by the assembler.
When linking the objects, it is important to pay attention to where these sec-
tions are linked in memory to avoid as many memory conflicts as possible. Fol-
lowing are some recommendations:
- The start address of the .stack and .sysstack sections are used to initialize
the data stack pointer (SP) and the system stack pointer (SSP), respec-
tively. Because these two registers share a common data page pointer
register (SPH) these sections must be allocated on the same 64K-word
memory page.
- Allocate the .bss and .stack sections in a single DARAM or separate SA-
RAM memory spaces. Local variable space is allocated on the stack. It is
possible that there may be conflicts when global variables, whose alloca-
tion is in .bss section, are accessed within the same instruction as a locally
declared variable.
Local var(stack) Const symbol (.const) If .const is located in separate SARAM or same DA-
RAM no conflict will occur
Global var(.bss) Global var(.bss) If .bss is allocated in DARAM, then no conflict will
occur
Global var(.bss) Const symbol(.const) If .const and .bss are located in separate SARAM or
same DARAM block, then no conflict will occur
When compiling with the small memory model (compiler default) allocate all
data sections, .data, .bss, .stack, .sysmem, .sysstack, .cio, and .const, on the
first 64K word page of memory (Page 0).
Example 3−32 contains a sample linker command file for the small memory
model. For extensive documentation on the linker and linker command files,
see the TMS320C55x Assembly Language Tools User’s Guide (SPRU280).
3-46
Memory Management Issues
/*********************************************************
LINKER command file for LEAD3 memory map.
Small memory model
**********************************************************/
−stack 0x2000 /* Primary stack size */
−sysstack 0x1000 /* Secondary stack size */
−heap 0x2000 /* Heap area size */
−c /* Use C linking conventions: auto−init vars at runtime */
−u _Reset /* Force load of reset interrupt handler */
MEMORY
{
PAGE 0: /* −−−− Unified Program/Data Address Space −−−− */
RAM (RWIX) : origin = 0x000100, length = 0x01ff00 /* 128Kb page of RAM */
ROM (RIX) : origin = 0x020100, length = 0x01ff00 /* 128Kb page of ROM */
VECS (RIX) : origin = 0xffff00, length = 0x000100 /*256−byte int vector*/
PAGE 1: /* −−−−−−−− 64K−word I/O Address Space −−−−−−−− */
IOPORT (RWI) : origin = 0x000000, length = 0x020000
}
SECTIONS
{
.text > ROM PAGE 0 /* Code */
/* These sections must be on same physical memory page */
/* when small memory model is used */
.data > RAM PAGE 0 /* Initialized vars */
.bss > RAM PAGE 0 /* Global & static vars */
.const > RAM PAGE 0 /* Constant data */
.sysmem > RAM PAGE 0 /* Dynamic memory (malloc) */
.stack > RAM PAGE 0 /* Primary system stack */
.sysstack > RAM PAGE 0 /* Secondary system stack */
.cio > RAM PAGE 0 /* C I/O buffers */
/* These sections may be on any physical memory page */
/* when small memory model is used */
.switch > RAM PAGE 0 /* Switch statement tables */
.cinit > RAM PAGE 0 /* Auto−initialization tables */
.pinit > RAM PAGE 0 /* Initialization fn tables */
vectors > VECS PAGE 0 /* Interrupt vectors */
.ioport > IOPORT PAGE 1 /* Global & static IO vars */
The pragma, in Example 3−33, defines a new section called .myfunc. The
code for the function myfunction() will be placed by the compiler into this
newly defined section. The section name can then be used within the SEC-
TIONS directive of a linker command file to explicitly allocate memory for this
function. For details on how to use the SECTIONS directive, see the
TMS320C55x Assembly Language Tools User’s Guide (SPRU280).
3-48
Chapter 4−
- Makes good use of special architectural features, like the dual multiply-
and-accumulate (MAC) hardware, parallelism, and looping hardware.
This chapter shows ways you can optimize TMS320C55x assembly code, so
that you have highly-efficient code in time-critical portions of your programs.
Topic Page
4-1
Efficient Use of the Dual-MAC Hardware
that performs
where xmem, ymem, and cmem are operands in memory pointed by registers
AR2, AR3, and CDP, respectively. Notice the following characteristics of C55x
dual-MAC instructions:
The two MAC units on the C55x DSP are economically fed data via three
independent data buses: BB (the B bus), CB (the C bus), and DB (the D
bus). During a dual-MAC operation, each MAC unit requires two data op-
erands from memory (four operands total). However, the three data buses
are capable of providing at most three independent operands. To obtain
the required fourth operand, the data value on the B bus is used by both
MAC units. This is illustrated in Figure 4−1. With this structure, the fourth
data operand is not independent, but rather is dependent on one of the
other three operands.
C bus
B bus
4-2
Efficient Use of the Dual-MAC Hardware
In the most general case of two multiplications, one would expect a re-
quirement of four fully independent data operands. While this is true on the
surface, in most cases one can get by with only three independent oper-
ands and avoid degrading performance by specially structuring the DSP
code at either the algorithm or application level. The special structuring,
covered in sections 4.1.1 through 4.1.4, can be categorized as follows:
- Implicit algorithm symmetry (e.g., symmetric FIR, complex vector
multiply)
- Loop unrolling (e.g., block FIR, single-sample FIR, matrix multiply)
- Multi-channel applications
- Multi-algorithm applications
ȍ a ĂƪxĂǒk * j Ǔ ) xĂǒk ) j * N ) 1 Ǔƫ
2
y(k) + j
j=0
where
Similar in form to the symmetrical FIR filter is the anti-symmetrical FIR filter:
N
Ă*Ă1
ȍ a ĂƪxĂǒk * j Ǔ * xĂǒk ) j * N ) 1 Ǔƫ
2
y(k) + j
j=0
ci + ai * bi
ǒ RE
+ a i ) ja i
IM
Ǔ * ǒb RE
i
IM
) jb i Ǔ ąą
+ ǒa Ǔ ) jǒa Ǔ
RE RE IM IM RE IM IM RE
i b i *a i b i i b i )a i b i for 1 v i v N
4-4
Efficient Use of the Dual-MAC Hardware
IM IM RE IM
- 2nd multiplication group: a i b i and a i b i
x RE
1
Lowest memory address
x IM
1
x RE
2
x IM
2
In addition, the code stores both portions of the complex result to memory at
the same time. This requires that the results vector be long-word aligned in me-
mory. One way to achieve this is through use of the alignment flag option with
the .bss directive, as was done with this code example. Alternatively, one could
place the results array in a separate uninitialized named section using a .usect
directive, and then use the linker command file to force long-word alignment
of that section.
.data
A .int 1,2,3,4,5,6 ; Complex input vector #1
B .int 7,8,9,10,11,12 ; Complex input vector #2
.text
BCLR ARMS ; Clear ARMS bit (select DSP mode)
.arms_off ; Tell assembler ARMS = 0
cplxmul:
AMOV #A, XAR0 ; Pointer to A vector
AMOV #B, XCDP ; Pointer to B vector
AMOV #C, XAR1 ; Pointer to C vector
MOV #(N−1), BRC0 ; Load loop counter
MOV #1, T0 ; Pointer offset
MOV #2, T1 ; Pointer increment
endloop:
MOV pair(LO(AC0)), dbl(*AR1+) ; Store complex result
; End of loop
Note: The algebraic instructions code example for Complex Vector Multiplication is shown in Example B−14 on page B-11.
In filtering, input and/or output data is commonly stored in a delay chain buffer.
Each time the filter is invoked on a new data point, the oldest value in the delay
chain is discarded from the bottom of the chain, while the new data value is
added to the top of the chain. A value in the chain will get reused (for example,
multiplied by a coefficient) in the computations over and over again as succes-
4-6
Efficient Use of the Dual-MAC Hardware
sive time-step outputs are computed. The reuse will continue until such a time
that the data value becomes the oldest value in the chain and is discarded.
Dual-MAC implementation of filtering should therefore employ a time-based
loop unrolling approach to exploit the reuse of the data. This scenario is pre-
sented in sections 4.1.2.1 and 4.1.2.2.
To efficiently implement a block FIR filter with the two MAC units, loop unrolling
must be applied so that two time-based iterations of the algorithm are com-
puted in parallel. This allows reuse of the coefficients.
Figure 4−2 illustrates the coefficient reuse for a 4-tap block FIR filter with
constant, real-value coefficients. The implementation computes two sequen-
tial filter outputs in parallel so that only a single coefficient, ai, is used by both
MAC units. Consider, for example, the computation of outputs y(k) and
y(k − 1). For the first term in each of these two rows, one MAC unit computes
a0x(k), while the second MAC unit computes a0x(k − 1). These two computa-
tions combined require only three different values from memory: a0, x(k), and
x(k − 1). Proceeding to the second term in each row, a1x(k − 1) and a1x(k − 2)
are computed similarly, and so on with the remaining terms. After fully comput-
ing the outputs y(k) and y(k − 1), the next two outputs, y(k − 2) and y(k − 3),
are computed in parallel. Again, the computation begins with the first two terms
in each of these rows. In this way, DSP performance is maintained at two MAC
operations per clock cycle.
Figure 4−2. Computation Groupings for a Block FIR (4-Tap Filter Shown)
Note that filters with either an even or odd number of taps are handled equally
well by this method. However, this approach does require one to compute an
even number of outputs y(). In cases where an odd number of outputs is de-
sired, one can always zero-pad the input vector x() with one additional zero
element, and then discard the corresponding additional output.
Note also that not all of the input data must be available in advance. Rather,
only two new input samples are required for each iteration through the algo-
rithm, thereby producing two new output values.
A non-optimized assembly code example for the block FIR filter is shown in
Example 4−2 (showing mnemonic instructions). An optimized version of the
same code is found in Example 4−3 (showing mnemonic instructions). The fol-
lowing optimizations have been made in Example 4−3:
- The first filter tap was peeled out of the inner loop and implemented using
a dual-multiply instruction (as opposed to a dual-multiply-and-accumulate
instruction). This eliminated the need to clear AC0 and AC1 prior to enter-
ing the inner loop each time.
- The last filter tap was peeled out of the inner loop. This allows for the use
of different pointer adjustments than in the inner loop, and eliminates the
need to explicitly rewind the CDP, AR0, and AR1 pointers.
The combination of these first two optimizations results in a requirement
that N_TAPS be a minimum of 3.
- Both results are now written to memory at the same time using a double
store instruction. Note that this requires the results array (OUT_DATA) to
be long-word aligned. One way to achieve this is through use of the align-
ment flag option with the .bss directive, as was done in this code example.
As an alternative, you could place the results array in a separate uninitial-
ized named section using a .usect directive, and then use the linker com-
mand file to force long-word alignment of that section.
- The outer loop start instruction, RPTBLOCAL, has been put in parallel with
the instruction that preceded it.
4-8
Efficient Use of the Dual-MAC Hardware
.data
COEFFS .int 1,2,3,4 ; Coefficients
IN_DATA .int 1,2,3,4,5,6,7,8,9,10,11 ; Input vector
.text
BCLR ARMS ; Clear ARMS bit (select DSP mode)
.arms_off ; Tell assembler ARMS = 0
bfir:
AMOV #COEFFS, XCDP ; Pointer to coefficient array
AMOV #(IN_DATA + N_TAPS − 1), XAR0 ; Pointer to input vector
AMOV #(IN_DATA + N_TAPS), XAR1 ; 2nd pointer to input vector
AMOV #OUT_DATA, XAR2 ; Pointer to output vector
MOV #((N_DATA − N_TAPS + 1)/2 − 1), BRC0
; Load outer loop counter
MOV #(N_TAPS − 1), CSR ; Load inner loop counter
Note: The algebraic instructions code example for Block FIR Filter Code (Not Optimized) is shown in Example B−15 on
page B-12.
.data
COEFFS .int 1,2,3,4 ; Coefficients
IN_DATA .int 1,2,3,4,5,6,7,8,9,10,11 ; Input vector
.text
BCLR ARMS ; Clear ARMS bit (select DSP mode)
.arms_off ; Tell assembler ARMS = 0
bfir:
AMOV #COEFFS, XCDP ; Pointer to coefficient array
AMOV #(IN_DATA + N_TAPS − 1), XAR0 ; Pointer to input vector
AMOV #(IN_DATA + N_TAPS), XAR1 ; 2nd pointer to input vector
AMOV #OUT_DATA, XAR2 ; Pointer to output vector
MOV #((N_DATA − N_TAPS + 1)/2 − 1), BRC0
; Load outer loop counter
MOV #(N_TAPS − 3), CSR ; Load inner loop counter
MOV #(−(N_TAPS − 1)), T0 ; CDP rewind increment
endloop:
MOV pair(LO(AC0)), dbl(*AR2+) ; Store both results
; End of outer loop
Note: The algebraic instructions code example for Block FIR Filter Code (Optimized) is shown in Example B−16 on page B-13.
4-10
Efficient Use of the Dual-MAC Hardware
The temporally unrolled block FIR filter described in section 4.1.2.1 maintains
dual-MAC throughput by sharing a common coefficient between the two MAC
units. In some algorithms, the loop unrolling needs to be performed so that a
common data variable is shared instead. The single-sample FIR filter is an ex-
ample of such an algorithm. In the single-sample FIR filter, the calculations for
the current sample period are interlaced with those of the next sample period
in order to achieve a net performance of two MAC operations per cycle.
Figure 4−3 shows the needed computation groupings for a 4-tap FIR filter. At
any given time step, one multiplies and accumulates every other partial prod-
uct in the corresponding row, beginning with the first partial product in the row.
In addition, one also multiplies and accumulates every other term in the next
row (that is, the row above the current row) in advance of that time step, begin-
ning with the second partial product in the next row. In this way, each row is
fully computed over the course of two sample periods.
For example, at time step k, it is desired to compute y(k). The first term in the
y(k) row is a0x(k), which is computed using one of the two MAC units. In addi-
tion, the second MAC unit is used to compute the second term in the y(k+1)
row, a1x(k), in advance of time step k + 1. These two computations combined
require only three different values from memory: a0, a1, and x(k). Note that the
term x(k) is not available until time k. This is why calculations at each time step
must begin with the first term in the corresponding row.
The second term in the y(k) row is a1x(k − 1). However, this would have been
already computed during the first computation at time step k − 1 (similar to how
a1x(k) was just pre-computed for time step k+1) , so it can be skipped here.
The third term in the y(k) row, a2x(k − 2), is computed next, and at the same
time, the term a3x(k − 2) is computed in the y(k + 1) row in advance of time
step k+1.
Notice that two separate running sums are maintained, one with partial prod-
ucts for the current time step, the other with pre-computed terms for the next
time step. At the next time step, the pre-computed running sum becomes the
current running sum, and a new pre-computed running sum is started from
zero. At the end of each sample period, the current running sum contains the
current filter output, which can be dispatched as required by the application.
The above approach is not limited to the 4-tap filter illustrated in Figure 4−3.
Any other filter with an even number of taps is a straightforward extension. For
filters with an odd number of taps, the computation groupings become prob-
lematic, in that the last grouping in each row is missing the pre-calculation term
in the row above it.
Figure 4−4 depicts this problem for a 5-tap filter. To overcome this problem,
one should pad the filter to the next higher even number of taps by using a zero
coefficient for the additional term. For example, the five tap filter is augmented
to
In this way, any filter with an odd number of taps can be implemented as a filter
with an even number of taps but retain the frequency response of the original
odd-number-tap filter.
4-12
Efficient Use of the Dual-MAC Hardware
[C ] + [A ] [B ]
where
[A] = m × n matrix
[B] = n × p matrix
[C] = m × p matrix
m ≥ 1, n ≥ 1, p ≥ 1
The expression for each element in matrix C is given by:
ȍa
n
c i, j + i, k Ăb k, j ǒ1 v i v m,Ă 1 v j v pǓ
k=1
where the conventional notation xi,j is being used to represent the element of
matrix X in the ith row and jth column. There are basically two different options
for efficient dual-MAC implementation. First, one could compute ci,j and ci,j + 1
in parallel. The computations made are:
ȍa ȍa
n n
c i, j + i, k Ăb k, j c i, j+1 + i, k Ăb k, j+1
k=1 k=1
The element ai,k is common to both expressions. The computations can there-
fore be made in parallel, with the common data ai,k delivered to the dual-MAC
units using the B bus and using XCDP as the pointer. The C bus and the D bus
are used along with two XARx registers to access the independent elements
bk,j and bk,j+1.
Alternatively, one could compute ci,j and ci+1,j in parallel. The computations
made are then:
ȍa ȍa
n n
c i, j + i, kĂb k, j c i+1, j + i+1, k Ăb k, j
k=1 k=1
In this case, the element bk,j is common to both expressions. They can there-
fore be made in parallel, with the common data bk,j delivered to the dual-MAC
units using the B bus and using XCDP as the pointer. The C bus and the D bus
are used along with two XARx registers to access the independent elements
ai,k and ai+1,k.
The values of m and p determine which approach one should take. Because
the inner loop will compute two elements in matrix C each iteration, clearly it
is most efficient if an even number of elements can be computed. Therefore,
if p is even, one should implement the first approach: compute ci,j and ci,j+1 in
parallel. Alternatively, if m is even, the second approach is more efficient: com-
pute ci,j and ci+1,j in parallel. If both m and p are even, either approach is ap-
propriate. Finally, if neither m nor p is even, there will be an extra element c that
will need to be computed individually each time through the inner loop. One
could add additional single-MAC code to handle the final element in the inner
loop. Alternatively, one could pad either matrix A or matrix B with a row or col-
umn or zeros (as appropriate) to make either m or p even. The elements in ma-
trix C computed using the pad row or column should then be discarded after
computation.
N-1 N-1
y 1(k) + ȍ a x ǒk * j Ǔ
j 1 y 2(k) + ȍ a x ǒk * j Ǔ
j 2
j=0 j=0
4-14
Efficient Use of the Dual-MAC Hardware
where
The value aj is common to both calculations. The two calculations can there-
fore be performed in parallel, with the common aj delivered to the dual-MAC
units using the B bus and using XCDP as the pointer. The C bus and the D bus
are used along with two XARx registers to access the independent input ele-
ments x1(k − j) and x2(k − j).
The element y(k) is common to both calculations. The two calculations can
therefore be performed in parallel, with the common data y(k) delivered to the
dual-MAC units via the B bus with XCDP as the address pointer. The C bus
and the D bus are used along with two XARx registers to access the indepen-
dent elements x1(k + j) and x2(k + j).
The element x(k + j) is common to both calculations. The two calculations can
therefore be made in parallel, with the common data x(k + j) delivered to the
dual-MAC units using the B bus and using XCDP as the pointer. The C bus and
the D bus are used along with two XARx registers to access the independent
elements x(k) and y(k).
In the mnemonic syntax, they can be identified by a double colon (::) that sepa-
rates the two operations. The preceding example in the mnemonic syntax is:
MPY *AR0, *CDP, AC0 ; The data referenced by AR0 is multiplied by
:: MPY *AR1, *CDP, AC1 ; a coefficient referenced by CDP. At the same time
; the data referenced by AR1 is multiplied by the
; same coefficient.
4-16
Using Parallel Execution Features
The C55x instructions make use of dedicated operative resources (or opera-
tors) within each of the units. In total, there are 14 operators available across
the three computation units, and the parallelism rules enable the use of two
independent operators in parallel within the same cycle. If all other rules are
observed, two instructions that independently use any two of the independent
operators may be placed in parallel.
Figure 4−5 shows a matrix that reflects the 14 operators mentioned and the
possible operator combinations that may be used in placing instructions in par-
allel. The operators are ordered from rows 1 through 14 as well as columns
1 though 14. A blank cell in any given position (row I, column J) in the matrix
indicates that operator I may be placed in parallel with operator J, and an X in
any given position indicates that the two operators cannot be placed in parallel.
For example, a D-Unit MAC operation (row 7) may be placed in parallel with
a P-Unit Load operation (column 13) but cannot be placed in parallel with a
D-Unit ALU operation (column 5).
P-unit Control
D-unit Shifter
A-unit Swap
D-unit Swap
A-unit Store
D-unit Store
P-unit Store
A-unit Load
D-unit Load
P-unit Load
A-unit ALU
D-unit MAC
D-unit ALU
1 2 3 4 5 6 7 8 9 10 11 12 13 14
A-unit ALU 1 X
A-unit Swap 2 X
A-unit Load 3
A-unit Store 4
D-unit ALU 5 X X X
D-unit Shifter 6 X X X X
D-unit MAC 7 X X X
D-unit Load 8
D-unit Store 9
D-unit Swap 11 X
P-unit Control 12 X
P-unit Load 13
P-unit Store 14
Note: X in a table cell indicates that the operator in that row and the operator in that
column cannot be used in parallel with each other. A blank table cell indicates
that the operators can be used in parallel.
4-18
Using Parallel Execution Features
Bus resources also play an important part in determining whether two instruc-
tions may be placed in parallel. Typically, programmers should be concerned
with the data buses and the constant buses. Table 4−1 lists and describes the
main CPU buses of interest and gives examples of instructions that use the
different buses. These may also be seen pictorially in Figure 4−6. Figure 4−6
also shows all CPU buses and the registers/operators in each of the three
functional units.
Example:
Bus Type Bus(es) Description of Bus(es) Instruction That Uses The Bus(es)
D-unit
A-unit P-unit
4-20
Using Parallel Execution Features
Maximum instruction length The combined length of the instruction pair cannot exceed 6
bytes.
Parallel enable bit If either of the following cases is true, the instructions can be
OR placed in parallel:
Soft dual encoding
- Parallel enable bit is present: At least one of two instruc-
tions in parallel must have a parallel enable bit in its instruc-
tion code. The instruction set reference guides (see Re-
lated Documentation from Texas Instruments in the pref-
ace) indicate whether a given instruction has a parallel en-
able bit.
*ARn
*ARn+
*ARn−
*(ARn + T0) (Available if C54CM bit = 0)
*(ARn + AR0) (Available if C54CM bit = 1)
*(ARn − T0) (Available if C54CM bit = 0)
*(ARn − AR0) (Available if C54CM bit = 1)
*ARn(T0) (Available if C54CM bit = 0)
*ARn(AR0) (Available if C54CM bit = 1)
*(ARn + T1)
*(ARn − T1)
mmap() and port() qualifiers An instruction that uses the mmap() qualifier to indicate an
access to a memory-mapped register or registers cannot be
placed in parallel with another instruction. The use of the
mmap() qualifier is a form of parallelism already.
Likewise an instruction that uses a port() qualifier to indicate
an access to I/O space cannot be placed in parallel with
another instruction. The use of a port() qualifier is a form of
parallelism already.
Parallelism among A unit, D unit, and P unit Parallelism among the three computational units is allowed
without restriction (see Figure 4−5).
An operation executed within a single computational unit can
be placed in parallel with a second operation executed in one
of the other two computational units.
Parallelism within the P unit Two program-control instructions cannot be placed in parallel.
However, other parallelism among the operators of the P unit
is allowed.
Parallelism within the D unit Certain restrictions apply to using operators of the D unit in
parallel (see Figure 4−5).
Parallelism within the A unit Two A-unit ALU operations or two A-unit swap operations
cannot be performed in parallel. However, other parallelism
among the operators of the A unit is allowed.
4-22
Using Parallel Execution Features
Start
End
2 Identify potential user-defined parallel instruction pairs in your code, and, using the basic rules outlined
in Table 4−2 as guidelines, place instructions in parallel. Start by focusing on heavily used kernels of
the code.
3 Run the optimized code through the assembler to see if the parallel instruction pairs are valid. The as-
sembler will indicate any invalid parallel instruction pairs. If you have invalid pairs, go to step 4; otherwise
go to step 5.
4 Refer to the set of parallelism rules in section 4.2.4 to determine why failing parallel pairs may be invalid.
Make necessary changes and return to step 3.
5 Once all your parallel pairs are valid, make sure your code still functions correctly.
4-24
Using Parallel Execution Features
As mentioned, Example 4−5 shows the optimized code for Example 4−4. In
Example 4−5, the parallel instruction pairs are highlighted. Notice the follow-
ing points:
- The first four instructions (ARn loads) are immediate loads and cannot be
placed in parallel due to constant bus conflicts and total instruction sizes.
- The first parallel pair shows an immediate load of CSR through the bus
called KDB. This load is executed in parallel with the setting of the SXMD
mode bit, which is handled by the A-unit ALU.
- The third parallel pair stores AR4 to memory via the D bus (DB), and stores
a constant (BUSY) to memory via the bus called KDB.
- The fourth parallel pair loads AC1 with a constant that is carried on the bus
called KDB and, in parallel, switches program control to a single-repeat
loop.
- The last parallel pair stores AC1 to AR4 via a cross-unit bus and, in paral-
lel, returns from the COMPUTE function.
; Variables
.data
.global start_a1
.text
start_a1:
MOV #HST_FLAG, AR0 ; AR0 points to Host Flag
MOV #HST_DATA, AR2 ; AR2 points to Host Data
MOV #COEFF1, AR1 ; AR1 points to COEFF1 buffer initially
MOV #COEFF2, AR3 ; AR3 points to COEFF2 buffer initially
MOV #4, CSR ; Set CSR = 4 for repeat in COMPUTE
BSET FRCT ; Set fractional mode bit
BSET SXMD ; Set sign−extension mode bit
Note: The algebraic instructions code example for A-Unit Code With No User-Defined Parallelism is shown in Example B−17
on page B-14.
4-26
Using Parallel Execution Features
LOOP:
MOV *AR0, T0 ; T0 = Host Flag
BCC PROCESS, T0 == #READY ; If Host Flag is “READY”, continue
B LOOP ; process − else poll Host Flag again
PROCESS:
MOV *AR2, T0 ; T0 = Host Data
Check:
CALL COMPUTE ; Compute subroutine
MOV AR4, *AR2 ; Write result to Host Data
MOV #BUSY, *AR0 ; Set Host Flag to Busy
B LOOP ; Infinite loop continues
END
COMPUTE:
MOV #0, AC1 ; Initialize AC1 to 0
RPT CSR ; CSR has a value of 4
MACM *AR2, *AR3+, AC1 ; This MAC operation is performed
; 5 times
MOV AC1, AR4 ; Result is in AR4
RET
HALT:
B HALT
Note: The algebraic instructions code example for A-Unit Code With No User-Defined Parallelism is shown in Example B−17
on page B-14.
Example 4−5. A-Unit Code in Example 4−4 Modified to Take Advantage of Parallelism
; Variables
.data
.global start_a2
.text
start_a2:
MOV #HST_FLAG, AR0 ; AR0 points to Host Flag
MOV #HST_DATA, AR2 ; AR2 points to Host Data
MOV #COEFF1, AR1 ; AR1 points to COEFF1 buffer initially
MOV #COEFF2, AR3 ; AR3 points to COEFF2 buffer initially
MOV #4, CSR ; Set CSR = 4 for repeat in COMPUTE
||BSET FRCT ; Set fractional mode bit
BSET SXMD ; Set sign−extension mode bit
LOOP:
MOV *AR0, T0 ; T0 = Host Flag
BCC PROCESS, T0 == #READY ; If Host Flag is “READY”, continue
B LOOP ; process − else poll Host Flag again
Note: The algebraic instructions code example for A-Unit Code in Example 4−4 Modified to Take Advantage of Parallelism is
shown in Example B−18 on page B-16.
4-28
Using Parallel Execution Features
Example 4−5. A-Unit Code in Example 4−4 Modified to Take Advantage of Parallelism
(Continued)
PROCESS:
MOV *AR2, T0 ; T0 = Host Data
Check:
CALL COMPUTE ; Compute subroutine
MOV AR4, *AR2 ; Write result to Host Data
||MOV #BUSY, *AR0 ; Set Host Flag to Busy
B LOOP ; Infinite loop continues
END
COMPUTE:
MOV #0, AC1 ; Initialize AC1 to 0
||RPT CSR ; CSR has a value of 4
MACM *AR2, *AR3+, AC1 ; This MAC operation is performed
; 5 times
MOV AC1, AR4 ; Result is in AR4
||RET
HALT:
B HALT
Note: The algebraic instructions code example for A-Unit Code in Example 4−4 Modified to Take Advantage of Parallelism is
shown in Example B−18 on page B-16.
Example 4−6 demonstrates a very simple nested loop and some simple con-
trol operations with the use of P-unit registers. This example shows the unopti-
mized code, and Example 4−7 shows the code optimized through the use of
the P-unit parallel instruction pairs (parallel instruction pairs are highlighted).
- The first three register loads are immediate loads using the same constant
bus and, therefore, cannot be placed in parallel.
- The fourth register load (loading BRC0) can be placed in parallel with the
next load (loading BRC1), which does not use the constant bus, but the
data bus DB to perform the load.
4-30
Using Parallel Execution Features
; Variables
.data
.global start_p1
.text
start_p1:
AMOV #var1, XDP
MOV #var2, AR3
RPTB Loop1−1
MOV AC2, AC1
MOV #8000h, AR1
RPTBLOCAL Loop2−1
SUB #1, AC1
MOV AC1, *AR1+
Loop2:
ADD #1, AC2
Loop1:
Note: The algebraic instructions code example for P-Unit Code With No User-Defined Parallelism is shown in Example B−19
on page B-18.
Example 4−7. P-Unit Code in Example 4−6 Modified to Take Advantage of Parallelism
; Variables
.data
.global start_p2
.text
start_p2:
AMOV #var1, XDP
MOV #var2, AR3
Note: The algebraic instructions code example for P-Unit Code in Example 4−6 Modified to Take Advantage of Parallelism is
shown in Example B−20 on page B-19.
4-32
Using Parallel Execution Features
Example 4−8 demonstrates a very simple load, multiply, and store function
with the use of D-unit registers. Example 4−9 shows this code modified to take
advantage of user-defined parallelism (parallel instruction pairs are high-
lighted).
- The instructions that load AC0 and AC2 have been placed in parallel be-
cause they are not both immediate loads and as such, there are no
constant-bus conflicts.
- The two 16-bit store operations at the end of the code are placed in parallel
because there are two 16-bit write buses available (EB and FB).
; Variables
.data
.global start_d1
.text
start_d1:
MOV #var1, AR3
MOV #var2, AR4
Note: The algebraic instructions code example for D-Unit Code With No User-Defined Parallelism is shown in Example B−21
on page B-20.
4-34
Using Parallel Execution Features
Example 4−9. D-Unit Code in Example 4−8 Modified to Take Advantage of Parallelism
; Variables
.data
.global start_d2
.text
start_d2:
MOV #var1, AR3
MOV #var2, AR4
Note: The algebraic instructions code example for D-Unit Code in Example 4−8 Modified to Take Advantage of Parallelism is
shown in Example B−22 on page B-21.
4.2.8 Example of Parallel Optimization Across the A-Unit, P-Unit, and D-Unit
Example 4−10 shows unoptimized code for an FIR (finite impulse response)
filter. Example 4−11 is the result of applying user-defined parallelism to the
same code. It is important to notice that the order of instructions has been al-
tered in a number of cases to allow certain instruction pairs to be placed in par-
allel. The use of parallelism in this case has saved about 50% of the cycles
outside the inner loop.
Example 4−10. Code That Uses Multiple CPU Units But No User-Defined Parallelism
; Register usage
; −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
FRAME_SZ .set 2
.global _fir
.text
; *****************************************************************
_fir
AADD #−FRAME_SZ, SP
Note: The algebraic instructions code example for Code That Uses Multiple CPU Units But No User-Defined Parallelism is
shown in Example B−23 on page B-22.
4-36
Using Parallel Execution Features
Example 4−10. Code That Uses Multiple CPU Units But No User-Defined Parallelism
(Continued)
RPTB Loop1−1
RPT CSR
MACM *H_ptr+, *DB_ptr+, AC0
Note: The algebraic instructions code example for Code That Uses Multiple CPU Units But No User-Defined Parallelism is
shown in Example B−23 on page B-22.
Example 4−10. Code That Uses Multiple CPU Units But No User-Defined Parallelism
(Continued)
; Store result
; −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
END_FUNCTION:
RET
; **********************************************************************
Note: The algebraic instructions code example for Code That Uses Multiple CPU Units But No User-Defined Parallelism is
shown in Example B−23 on page B-22.
In Example 4−11, parallel pairs that were successful are shown in bold type;
potential parallel pairs that failed are shown in italic type. The first failed due
to a constant bus conflict, and the second failed due to the fact that the com-
bined size is greater than 6 bytes. The third pair failed for the same reason,
as well as being an invalid soft-dual encoding instruction. This last pair in italics
failed because neither instruction has a parallel enable bit. Some of the load/
store operations that are not in parallel were made parallel in the first pass opti-
mization process; however, the parallelism failed due to bus conflicts and had
to be removed.
4-38
Using Parallel Execution Features
Note:
Example 4−11 shows optimization only with the use of the parallelism fea-
tures. Further optimization of this FIR function is possible by employing other
optimizations.
; Register usage
; −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
FRAME_SZ .set 2
.global _fir
.text
; *****************************************************************
_fir
Note: The algebraic instructions code example for Code in Example 4−10 Modified to Take Advantage of Parallelism is shown
in Example B−24 on page B-25.
||RPTB Loop1−1
Note: The algebraic instructions code example for Code in Example 4−10 Modified to Take Advantage of Parallelism is shown
in Example B−24 on page B-25.
4-40
Using Parallel Execution Features
; Store result
; −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
END_FUNCTION:
||RET
; **********************************************************************
Note: The algebraic instructions code example for Code in Example 4−10 Modified to Take Advantage of Parallelism is shown
in Example B−24 on page B-25.
The selection of the looping method to use depends basically on the number
of instructions that need to be repeated and in the way you need to control the
loop counter parameter. The first three methods in the preceding list offer zero-
overhead looping and the fourth one offers a 5-cycle loop overhead.
Overall, the most efficient looping mechanisms are the repeat() and the
localrepeat{} mechanisms. The repeat() mechanism provides a way to repeat
a single instruction or a parallel pair of instructions in an interruptible way. re-
peat(CSR), in particular, allows you to compute the loop counter at runtime.
Refer to section 4.3.2, Efficient Use of repeat(CSR) Looping.
Note:
If you are migrating code from a TMS320C54x DSP, be aware that a single-
repeat instruction is interruptible on a TMS320C55x DSP. On a
TMS320C54x DSP, a single-repeat instruction cannot be interrupted.
The localrepeat{} mechanism provides a way to repeat a block from the in-
struction buffer queue. Reusing code that has already been fetched and
placed in the queue brings the following advantages:
4-42
Implementing Efficient Loops
Example 4−12 shows one block-repeat loop nested inside another block-re-
peat loop. If you need more levels of multiple-instruction loops, use branch on
auxiliary register not zero constructs to create the remaining outer loops. In
Example 4−13 (page 4-44), a branch on auxiliary register not zero construct
(see the last instruction in the example) forms the outermost loop of a Fast
Fourier Transform algorithm. Inside that loop are two localrepeat{} loops. No-
tice that if you want the outermost loop to execute n times, you must initialize
AR0 to (n − 1) outside the loop.
Note: The algebraic instructions code example for Nested Loops is shown in Example B−25 on page B-28.
_cfft:
radix_2_stages:
; ...
outer_loop:
; ...
MOV T1, BRC0
; ...
MOV T1, BRC1
; ...
SFTS AR4, #1 ; outer loop counter
||BCC no_scale, AR5 == #0 ; determine if scaling required
; ...
no_scale:
RPTBLOCAL Loop1−1
RPTBLOCAL Loop2−1
Note: The algebraic instructions code example for Branch-On-Auxiliary-Register-Not-Zero Construct (Shown in Complex FFT
Loop Code) is shown in Example B−26 on page B-28.
4-44
Implementing Efficient Loops
Loop2:
ADD dual(*AR2), AC0, AC1 ; ar’ = ar + br
; ai’ = ai + bi
||MOV AC2, dbl(*AR6) ; Store tr, ti
SFTS AR3, #1
||MOV #0, CDP ; rewind coefficient pointer
SFTS T3, #1
||BCC outer_loop, AR4 != #0
Note: The algebraic instructions code example for Branch-On-Auxiliary-Register-Not-Zero Construct (Shown in Complex FFT
Loop Code) is shown in Example B−26 on page B-28.
- Use a single-repeat instruction for the innermost loop if the loop contains
only a single instruction (or a pair of instructions that have been placed in
parallel).
J Reduce the number of bytes in the loop. For example, you can reduce
the number of instructions that use embedded constants.
- When you nest a block-repeat loop inside another block-repeat loop, ini-
tialize the block-repeat counters (BRC0 and BRC1) in the code outside of
both loops. This technique is shown in Example 4−12 (page 4-43).
Neither counter needs to be re-initialized inside its loop; placing such init-
ializations inside the loops only adds extra cycles to the loops. The CPU
uses BRC0 for the outer (level 0) loop and BRC1 for the inner (level 1) loop.
BRC1 has a shadow register, BRS1, that preserves the initial value of
BRC1. Each time the level 1 loop must begin again, the CPU automatically
re-initializes BRC1 from BRS1.
- The repeat count can be dynamically computed during runtime and stored
to CSR. For example, CSR can be used when the number of times an in-
struction must be repeated depends on the iteration number of a higher
loop structure.
- Using CSR saves outer loop cycles when the single-repeat loop is an inner
loop.
Example 4−14 (page 4-47) uses CSR for a single-repeat loop that is nested
inside a block-repeat loop. In the example, CSR is assigned the name
inner_cnt.
4-46
Implementing Efficient Loops
Example 4−14. Use of CSR (Shown in Real Block FIR Loop Code)
; ...
.asg CSR, inner_cnt ; inner loop count
.asg BRC0, outer_cnt ; outer loop count
; ...
.asg AR0, x_ptr ; linear pointer
.asg AR1, db_ptr1 ; circular pointer
.asg AR2, r_ptr ; linear pointer
.asg AR3, db_ptr2 ; circular pointer
.asg CDP, h_ptr ; circular pointer
; ...
_fir2:
; ...
AMAR *db_ptr2− ; index of 2nd oldest db entry
;
; Setup loop counts
; −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
||SFTS T0, #−1 ; T0 = nx/2
; 1st interation
MPY *db_ptr1+, *h_ptr+, AC0 ; part 1 of dual−MPY
::MPY *db_ptr2+, *h_ptr+, AC1 ; part 2 of dual−MPY
; inner loop
||RPT inner_cnt
MAC *db_ptr1+, *h_ptr+, AC0 ; part 1 of dual−MAC
::MAC *db_ptr2+, *h_ptr+, AC1 ; part 2 of dual−MAC
Note: This example shows portions of the file fir2.asm in the TI C55x DSPLIB (introduced in Chapter 8).
Example 4−14. Use of CSR (Shown in Real Block FIR Loop Code) (Continued)
; ...
Note: This example shows portions of the file fir2.asm in the TI C55x DSPLIB (introduced in Chapter 8).
4-48
Minimizing Pipeline and IBQ Delays
- The first segment, referred to as the fetch pipeline, fetches 32-bit instruc-
tion packets from memory into the instruction buffer queue (IBQ), and then
feeds the second pipeline segment with 48-bit instruction packets. The
fetch pipeline is illustrated in Figure 4−8.
Pipeline
Phase Description
PF1 Present the program fetch address to memory.
ÉÉÉÉ
Figure 4−9. Second Segment of the Pipeline (Execution Pipeline)
ÉÉÉÉ
Time
ÉÉ ÉÉÉÉ
Decode Address Access 1 Access 2 Read Execute Write Write+
(D) (AD) (AC1) (AC2) (R) (X) (W) (W+)
Note:
ÉÉ
Pipeline
Only for memory write operations.
Phase Description
Decode (D) - Read six bytes from the instruction buffer queue.
- Decode an instruction pair or a single instruction.
- Dispatch instructions to the appropriate CPU functional units.
- Read STx_55 bits associated with data address generation:
ST1_55(CPL) ST2_55(ARnLC)
ST2_55(ARMS) ST2_55(CDPLC)
- Read STx bits related to address generation (ARxLC, CPL,
ARMS) for address generation in the AD phase.
Address - Read/modify registers when involved in data address
(AD) generation. For example:
ARx and Tx in *ARx+(T0)
BK03 if AR2LC=1
SP during pushes and pops
SSP, same as for SP if in 32-bit stack mode
- Read A-unit register in the case of a R/W (in AD-phase)
conflict.
- Perform operations that use the A-unit ALU. For example:
Arithmetic using AADD instruction
Swapping A-unit registers with a SWAP instruction
Writing constants to A-unit registers (BKxx, BSAxx, BRCx, CSR,
etc.)
- Decrement ARx for the conditional branch instruction that
branches on ARx not zero.
- (Exception) Evaluate the condition of the XCC instruction
(execute(AD-unit) attribute in the algebraic syntax).
Access 1 Refer to section 4.4.3.1 for description.
(AC1)
Access 2 Refer to section 4.4.3.1 for description.
(AC2)
4-50
Minimizing Pipeline and IBQ Delays
AADD #k, ARx With this special instruction, ARx is initialized with a
constant in the AD phase.
MOV #k, *ARx+ The memory write happens in the W+ phase. See
special write pending and memory bypass cases in
section 4.4.3.
ADD #k, ARx ARx is read at the beginning of the X phase and is
modified at the end of the X phase.
ADD ACy, ACx ACx and ACy read and write activity occurs in the X
phase.
PUSH, POP, RET or SP is read and modified in the AD phase. SSP is also
AADD #K8, SP affected if the 32-bit stack mode is selected.
The instruction set reference guides (see Related Documentation From Texas
Instruments in the preface) show how many cycles an instruction takes to exe-
cute when the pipeline is full and experiencing no delays. Pipeline-protection
4-52
Minimizing Pipeline and IBQ Delays
cycles add to that best-case execution time. As it will be shown, most cases
of pipeline conflict can be solved with instruction rescheduling.
This section provides examples to help you to better understand the impact
of the pipeline structure on the way your code performs. It also provides you
with recommendations for coding style and instruction usage to minimize con-
flicts or pipeline stalls. This section does not cover all of the pipeline potential
conflicts, but some of the most common pipeline delays and IBQ delays found
when writing C55x code.
- Reschedule instructions.
General - In the case of a conflict, the front runner wins. Section 4.4.2.1, page 4-56
Register access - Avoid consecutive accesses to the same Section 4.4.2.2, page 4-56
register.
4-54
Minimizing Pipeline and IBQ Delays
- Use MAR type of instructions, when pos- Section 4.4.2.3, page 4-60
sible, to modify ARx and Tx registers, but
avoid read/write register sequences, and
pay attention to instruction size.
Loop control - Understand when the loop-control registers Section 4.4.2.6, page 4-64
are accessed in the pipeline
- Avoid writing the BRC register in the last few Section 4.4.2.7, page 4-65
instructions of a block-repeat or local block-
repeat structure to prevent an unprotected
pipeline situation.
- Initialize the BRCx or CSR register at least 4 Section 4.4.2.8, page 4-66
cycles before the repeat instruction, or initial-
ize the register with an immediate value.
Condition evaluation - Try to set conditions well in advance of the Section 4.4.2.9, page 4-67
time that the condition is tested.
Memory usage - When working with dual-MAC and FIR in- Section 4.4.3.2, page 4-78
structions, put the Cmem operand in a differ-
ent memory bank.
- For 32-bit accesses (using an Lmem oper- Section 4.4.3.4, page 4-79
and), no performance hit is incurred if you
use SARAM (there is no need to use
DARAM).
IBQ usage - Align PC discontinuities in 32-bit memory Section 4.4.4, page 4-79
boundaries.
- Use short instructions as the first instructions
after a PC discontinuity.
- Use LOCALREPEAT when possible
A pipeline conflict arises when two instructions in different phases in the pipe-
line compete for the use of the same resource. The resource is granted to the
instruction that is ahead in terms of pipeline execution, to increase overall in-
struction throughput.
As shown in Figure 4−9, registers are not accessed in the same pipeline
phase. Therefore, pipeline conflicts can occur, especially in write/read or read/
write sequences to the same register. Following are three common register
pipeline conflict cases and how to resolve them.
4-56
Minimizing Pipeline and IBQ Delays
4-58
Minimizing Pipeline and IBQ Delays
Accumulators and registers not associated with address generation are read
and written in the X phase, if MMR addressing is not used. Otherwise, the read
and write happen in the R and W phases, respectively.
Notice that
ADD #1, AC0
MOV AC0, AC2
will not cause pipeline conflicts because AC0 is read by (AC2=AC0) in the X
phase. When AC0_L is accessed via the memory map (@AC0_L || mmap()),
it is treated as a memory access and read in the read phase.
4.4.2.3 Use MAR type of instructions, when possible, to modify ARx and Tx registers, but
avoid read/write register sequences, and pay attention to instruction size.
The MAR type of instructions (AMOV, AMAR, AADD, ASUB) use independent
hardware in the data-address generation unit (DAGEN) to update ARx and Tx
registers in the AD phase of the pipeline. You can take advantage of this fact
to avoid pipeline conflicts, as shown in Example 4−18. Because AR1 is up-
dated by the MAR instruction prior to being used by I2 for addressing genera-
tion, no cycle penalty is incurred.
However, using a MAR instruction could increase instruction size. For exam-
ple, AADD T1, AR1 requires 3 bytes, while ADD T1, AR1 requires 2 bytes. You
must consider the tradeoff between code size and speed.
Example 4−19 shows that sometimes, using a MAR instruction can cause
pipeline conflicts. The MAR instruction (I2) attempts to write to AR1 in the AD
phase, but due to pipeline protection, I2 must wait for AR1 to be read in the
R phase by I1. This causes a 2-cycle latency. Notice that AR1 is read by I1
and is updated by I2 in the same cycle (cycle 5). This is made possible by the
A-unit register prefetch mechanism activated in the R phase of the C55x DSP
(see page 4-58). I1 is one of the D-unit instructions listed in Appendix A.
One way to avoid the latency in Example 4−19 is to use the code in
Example 4−20:
I1: ADD mmap(AR1), T2, AC1 ; AR1 read in R phase and
I2: ADD T1, AR1 ; AR1 updated in X phase
; (No cycle penalty)
4-60
Minimizing Pipeline and IBQ Delays
Example 4−20. Solution for Bad Use of MAR Instruction (Read/Write Sequence)
I1: ADD mmap(AR1), T2, AC1 ; AR1 read in R phase
; (A-unit register prefetch)
I2: ADD T1, AR1 ; AR1 updated in X phase
; (No cycle penalty)
4-62
Minimizing Pipeline and IBQ Delays
A pipeline delay can occur if two different registers belonging to the same
group are accessed at the same time.
4.4.2.6 Understand when the loop-control registers are accessed in the pipeline.
As in any register, the h/w loop controller registers (CSR, RPTC, RSA0, REA0,
RSA1, REA1, BRC0, BRC1, and BRS1) can be read and written in:
Loop-control registers can also be modified by the repeat instruction. For ex-
ample:
4-64
Minimizing Pipeline and IBQ Delays
4.4.2.7 Avoid writing the BRC register in the last few instructions of a block-repeat
or local block-repeat structure to prevent an unprotected pipeline situation.
Writing to BRC in one of the last few instructions (exact number depends on
the specific instruction) of a block-repeat structure could cause an unprotected
pipeline situation. On the other hand, reading BRC is pipeline protected and
does not insert an extra pipeline stall cycle.
BRC write accesses may not be protected in the last cycles of a block-repeat
structure. Do not write to BRC0 or BRC1 within those cycles. This can be seen
in Example 4−22. BRC0 is to be written to by I1 in the W phase (cycle 7),while
BRC0 is decremented in the D phase of I2. The pipeline-protection unit can-
not guarantee the proper sequence of these operations (write to BRC0 and
then decrement BRC0). BRC0 is decremented by I2 before BRC0 changed
by I1. On the other hand, certain instructions are protected (for example,
BRC0 = #k is pipeline protected).
4.4.2.8 Initialize the BRCx or CSR register at least 4 cycles before the repeat instruction,
or initialize the register with an immediate value.
4-66
Minimizing Pipeline and IBQ Delays
4.4.2.9 Try to set conditions well in advance of the time that the condition is tested.
Conditions are typically evaluated in the R phase of the pipeline with the follow-
ing exceptions:
- Use of XCC: Example 4−26 shows a case where a register load operation
(MOV *AR3+, AC1) is made conditional with the XCC instruction. When
XCC is used, the condition is evaluated in the AD phase, and if the
condition is true, the conditional instruction performs its operations
in the AD through W phases. In Example 4−26, the AR3+ update in I3
depends on whether AC0 > 0; therefore, the update is postponed until I1
updates AC0. As a result, I3 is delayed by 4 cycles.
4-68
Minimizing Pipeline and IBQ Delays
Typically, XCCPART causes the CPU to evaluate the condition in the execute
(X) phase. The exception is when you make a memory write operation depen-
dent on a condition, in which case the condition is evaluated in the read (R)
phase.
To prevent the 2-cycle delay in Example 4−28, you can insert two other, non-
conflicting instructions between I2 and I3.
4-70
Minimizing Pipeline and IBQ Delays
I3 9
I3 10
4-72
Minimizing Pipeline and IBQ Delays
Bus Description
B BB. This data-read data bus carries a 16-bit coefficient data value (Cmem)
from data space to the CPU.
C, D CB, DB. Each of these data-read data buses carries a 16-bit data value
from data space to the CPU. DB carries a value from data space or from
I/O-space. In the case of an Lmem (32-bit) read, DB presents the data
address but each half of the 32-bit data comes in DB and CB.
E, F EB, FB. Each of these data-write data buses carries a 16-bit data value to
data space from the CPU. EB carries a value to data space or to I/O-
space. In the case of an Lmem (32-bit) write, EB presents the data ad-
dress but each half of the 32-bit data comes in EB and FB.
P PB. This program read bus carries 32-bit instructions from program space
to the IBQ.
Table 4−11. Half-Cycle Accesses to Dual-Access Memory (DARAM) and the Pipeline
(Note 1)
Smem read D
Smem write E
Lmem read C D
Lmem write E F
Xmem read
C D
|| Ymem read
Xmem write
E F
|| Ymem write
Xmem read
|| Ymem read D
C
|| Cmem read B
(Note 2)
Xmem read
D E
|| Ymem write
Lmem read
C D E F
|| Lmem write
- For a read operation: In the first cycle (request cycle), the request and
address are placed on the bus. In the second cycle (memory read cycle),
the read access is done to the memory. In the third cycle (data read cycle),
the data is delivered to the buses.
- For a write operation: In the first cycle (request cycle), the request and
address are placed on the bus. In the second cycle (data write cycle), the
data is written to the buses. In the third cycle (memory write), the write ac-
cesses are done to the memory.
4-74
Minimizing Pipeline and IBQ Delays
The memory access happens in the AC2 phase for reads and in the W+
phase for writes.
As seen in Table 4−9, two simultaneous acceses can occur to the same DA-
RAM block, and only one access to a SARAM block.
Ideally, we should allocate all data into DARAM due to its higher memory band-
width (2 accesses/cycle). However, DARAM is a limited resource and should
be used only when it is advantageous. Following are recommendations to
guide your memory mapping decisions.
- Reschedule instructions.
The actual execution time of these instructions does not increase, because the
delayed (pending) write memory access (I1) is performed while the read in-
struction (I5) is in the R phase.
4-76
Minimizing Pipeline and IBQ Delays
If in Example 4−30, any read access (via D or C buses) is from the same
memory location in memory where the write access should occur, the CPU by-
passes reading the actual memory location; instead, it reads the data directly
from the internal buses. This allows the pipeline to perform a memory write ac-
cess in a later pipeline phase than that in which the next instruction reads from
the same memory location. Without this memory bypass feature, the delay
would have been 3 cycles.
4.4.3.2 When working with dual-MAC and FIR instructions, put the Cmem operand
in a different memory bank.
Provided code is not executed from the same memory block in which you have
the data being accessed by that code, the only memory access type which can
generate a conflict in a DARAM is the execution of instructions requiring three
data operands in 1 cycle: Xmem, Ymem, and Cmem (coefficient operand). Ex-
amples of two commonly used instructions that use three data operands are:
- Dual multiply-and-accumulate (MAC) instruction:
MAC Xmem, Cmem, ACx
:: MAC Ymem, Cmem, ACy
- Finite impulse response filter instructions:
FIRSADD Xmem, Ymem, Cmem, ACx, ACy
FIRSSUB X mem, Ymem, Cmem, ACx, ACy
This memory conflict can be solved by maintaining the Ymem and Xmem oper-
ands in the same DARAM memory bank but putting the Cmem operand into
a different memory bank (SARAM or DARAM).
When cycle intensive DSP kernels are developed, it is extremely important to
identify and document software integration recommendations stating which
variables/arrays must not be mapped in the same DARAM memory block.
The software developer should also document the associated cycle cost when
the proposed optimized mapping is not performed. That information will pro-
vide the software integrator enough insight to make trade-offs.
When cycle intensive DSP kernels are developed, it is extremely important to
identify and document software integration recommendations stating which
variables/arrays must not be mapped in the same dual access memory. The
software developer should also document the associated cycle cost when the
proposed optimized mapping is not performed. That information will provide
the software integrator enough insight to make trade-offs. Table 4−13 provides
an example of such table: if the 3 arrays named “input.” “output,” and “coeffi-
cient” are in the same DARAM, the subroutine named “filter” will have 200
cycle overhead.
… … … … … … … … …
4-78
Minimizing Pipeline and IBQ Delays
4.4.3.3 Map program code to a dedicated SARAM memory block to avoid conflicts with
data accesses.
If a DARAM block maps both program and data spaces of the same routine,
a program code fetch will conflict, for example, with a dual (or triple) data oper-
and read (or write) access if they are performed in the same memory block.
C55x DSP resolves the conflict by delaying the program code fetch by one
cycle. It is therefore recommended to map the program code in a dedicated
program memory bank: generally a SARAM memory bank is preferred. This
avoids conflicts with data variables mapped in the high bandwidth DARAM
banks.
Another way to avoid memory conflicts is to use the 56-byte IBQ to execute
blocks of instructions without refetching code after the 1st iteration (see local-
repeat{} instruction). Conflict will only occur in the first loop iteration.
When a 32-bit memory access is performed with Lmem, only one address bus
(DAB or EAB) is used to specify the most and least significant words of the
32-bit value. Therefore, reading from or writing to a 32-bit memory location in
an SARAM bank occurs in 1 cycle.
To avoid IBQ delay cycles and increase the chances of the first 32-bit packet
to contain the target address instruction and the first byte of the next instruc-
tion, following are recommendations:
To better understand the effect of the IBQ in code cycles, we present 3 typical
PC discontinuitites cases:
Examples 1a, 1b, and 1c show branch examples where the IBQ behavior is
illustrated.
4-80
Minimizing Pipeline and IBQ Delays
IBQ delays can also occur when the PC discontinuity is caused by a block re-
peat loop. During a block repeat processing the IBQ fetches sequentially the
instructions until it reaches the last instruction. At this point, since the IBQ does
not know what the size of the last instruction is (it only knows the address of
the last instruction), it fetches 7 more bytes. Thus, the IBQ may be fetching
more program bytes than it actually needs. This overhead can cause some de-
lays. The following example will show this:
Example 2
.align 4
4000 nop
4001 rptb label3
4004 amov #0x8000 , xar0
400a ;;nop_16 ;← commented nop_16 instruction
400a nop_16 || nop_16 ;#1
400e nop_16 || nop_16 ;#2
4012 nop_16 || nop_16 ;#3
4016 nop_16 || nop_16 ;#4
401a nop_16 || nop_16 ;#5
401e nop_16 || nop_16 ;#6
4022 nop_16 || nop_16 ;#7
4026 nop_16 || nop_16 ;#8
402a nop_16 || nop_16 ;#9
label3:
402e nop_16 || nop_16 ;#10
4032
The block repeat loop has 11 instructions. Since the size of the first instruction
is 6 bytes, the total cycle count for this loop should be 12 cycles since it requires
two 32-bit fetch packets to have the entire first instruction (AMOV) and the first
byte of the next instruction. However, this code actually takes 13 cycles, as
illustrated below.
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Program
Address 04 08 0c 10 14 18 1c 20 24 28 2c 30 34 04 08 0c 10 14 18
Bus
Decode 04 0a 0e 12 16 1a 1e 22 26 2a 2e X X 04
PC
Note: S The Cycle row corresponds to the cycle timeline.
S The Program Address Bus row corresponds to the address of the 32-bit block requested from memory (fetch packet).
(For simplifying the notation, the 0x4004 is written as 04.)
S The Decode PC row corresponds to the address of the instruction that is being decoded.
There are two X slots that are used to fetch packets that are never decoded.
Therefore, we have 11 instructions for 13 cycles. One of the X slots is unavoid-
able because of the first 6-byte instruction (AMOV). The second, however,
could be used by another instruction. For example, if we uncomment the
nop_16 instruction, the cycle count will be the same—13 cycles—indicating
the opportunity to rearrange your code and insert a useful instruction instead
of the nop_16. This kind of delay can occur in block repeat loops but it does
not occur in local repeat loops. For this reason, the use of RPTBLOCAL is rec-
ommended. To avoid this kind of delay, use the following recommendation:
IBQ delays can also occur when there are many 5- or 6-byte instructions con-
secutively. This will reduce the fetch advance and eventually produce delays.
To avoid this kind of delay, use the following recommendation:
- Avoid using too many 5- or 6-byte instructions (or 5- or 6-byte parallel in-
structions) consecutively. Doing so will eventually starve the IBQ. Instead,
mix in short instructions to balance the longer instructions, keeping in mind
that the average sustainable fetch, without incurring a stall, is 4 bytes.
The RPTBLOCAL loop differs from a RPTB loop in that the size of the content
of the loop is able to fit completely in the IBQ. The maximum size of a RPTBLO-
CAL is 55 bytes between the address of the first instruction and the last instruc-
tion of the loop. If the last instruction of the loop is a 6-byte instruction then the
size of the loop is 61 bytes. During the processing of the loop the IBQ keeps
fetching instructions until it reaches the maximum fetch advance which is 24
bytes.
Since all the instructions of the local loop are in the IBQ, there are no IBQ de-
lays inside a RPTBLOCAL loop.
4-82
Minimizing Pipeline and IBQ Delays
Special Case: IBQ Delays When the size of the Local Loop is Close to 61
Bytes
There is not much space left in the IBQ to fetch additional instructions. In this
case, after the loop completes, IBQ delays in the first instruction outside
the loop may occur. This delay will occur when the IBQ did not have enough
space to fetch the first instruction following the loop plus the first byte of the
second instruction (see Example 3). This type of IBQ delay could range from
1 to 6, depending on the size of the local loop and the length of the instruction
following the local loop.
; loop is 61 bytes
The size of the RPTBLOCAL loop in the previous example is 61 bytes. Since
the size of the IBQ is 64 bytes, and since, in this case, the content of the loop
is aligned on a 32-bit boundary, the IBQ fetches also the first three bytes of the
first instruction after the loop. However, that is not sufficient to avoid delays as
the first instruction outside the loop (at 0x403d) address is a 4-byte instruction.
Therefore, in this case, five IBQ delays will occur after the completion of the
loop.
The C55x offers a speculative pre-fetch feature that can save several execu-
tion cycles of conditional control flow instructions when the condition detected
is “true.” For example, in the case of a conditional branch (BCC) with an imme-
diate value, the branch target address is known in the decode (D) phase (in
the case of an immediate value) or in the address (AD) phase (in the case of
a relative offset) of the pipeline but the condition is evaluated later in the Read
(R) phase. To avoid the extra cycle delays that could imply to wait for the condi-
tion to be known before fetching the target address, the C55x fetches the
branch target address speculatively and the fetch packet is stored in the IBQ.
If the condition is evaluated as “true,” then the instruction decoder can get the
branch target instruction from the IBQ with minimum latency.
4-84
Chapter 5
Topic Page
5-1
Fixed-Point Arithmetic − a Tutorial
Fixed-point DSPs, like the TMS320C55x DSP, typically use 16-bit words. They
use less silicon area than their floating-point counterparts, which translates
into cheaper prices and less power consumption. Due to the limited dynamic
range and the rules of fixed-point arithmetic, a designer must play a more ac-
tive role in the development of a fixed-point DSP system. The designer has to
decide whether the 16-bit words will be interpreted as integers or fractions, ap-
ply scale factors if required, and protect against possible register overflows.
This representation is not used in a DSP architecture because the addition al-
gorithm would be different for numbers that have the same signs and for num-
bers that have different signs. The DSP uses the 2s-complement format, in
which a positive number is represented as a simple binary value and a nega-
tive value is represented by inverting all the bits of the corresponding positive
value and then adding 1.
Example 5−1 shows the decimal number 353 as a 16-bit signed binary num-
ber. Each bit position represents a power of 2, with 20 at the position of the least
significant bit and 215 at the position of the most significant bit. The 0s and 1s
of the binary number determine how these powers of 2 are weighted (times 0
or times 1) when summed to form 353. Because the number is signed, 215 is
given a negative sign. Example 5−2 shows how to compute the negative of a
2s-complement number.
5-2
Fixed-Point Arithmetic − a Tutorial
Begin with a positive binary number (353 decimal): 0000 0001 0110 0001
Invert all bits to get the 1s complement: 1111 1110 1001 1110
Add 1 to get the 2s complement: + 1
Result: negative binary number (−353 decimal): 1111 1110 1001 1111
The most common formats used in DSP programming are integers and frac-
tions. In signal processing, fractional representation is more common. A frac-
tion is defined as a ratio of two integers such that the absolute value of the ratio
is less than or equal to 1. When two fractions are multiplied together, the result
is also a fraction. Multiplicative overflow, therefore, never occurs. Note, how-
ever, that additive overflow can occur when fractions are added. Overflows are
discussed in section 5.5, beginning on page 5-24.
Figure 5−1 shows how you can interpret 2s-complement numbers as integers.
The most significant bit (MSB) is given a negative weight, and the integer is
the sum of all the applicable bit weights. If a bit is 1, its weight is included in
the sum; if the bit is 0, its weight is not applicable (the effective weight is 0). For
simplicity, the figure shows 4-bit binary values; however, the concept is easily
extended for larger binary values. Compare the 4-bit format in Figure 5−1 with
the 8-bit format in Figure 5−2. The LSB of a binary integer always has a bit
weight of 1, and the absolute values of the bit weights increase toward the
MSB. Adding bits to the left of a binary integer does not change the absolute
bit weights of the original bits.
1 1 0 1 = −8 + 4 + 0 + 1 = −3
MSB LSB
−27 = −128 26 = 64 25 = 32 24 = 16 23 = 8 22 = 4 21 = 2 20 = 1
5-4
Fixed-Point Arithmetic − a Tutorial
4-bit binary
MSB LSB
fraction
Most negative = −1 + 0 + 0 + 0
1 0 0 0
value = −1
= 0 + 1/2 + 0 + 1/8
Other examples: 0 1 0 1
= 5/8
= −1 + 1/2 + 0 + 1/8
1 1 0 1
= −3/8
MSB LSB
−20 = −1 2−1 = 1/2 2−2 = 1/4 2−3 = 1/8 2−4 = 1/16 2−5 = 1/32 2−6 = 1/64 2−7 = 1/128
a fractional interpretation are shown for each addition. For simplicity, the ex-
amples use 8-bit binary values; however, the concept is easily extended for
larger binary values. For a better understanding of how the integer and frac-
tional interpretations were derived for the 8-bit binary numbers, see
Figure 5−2 (page 5-4) and Figure 5−4 (page 5-5), respectively.
1 (carry)
0000 0101 5 5/128
+ 0000 0100 + 4 + 4/128
−−−−−−−−−−−−−− −−−− −−−−−−−
0000 1001 9 9/128
1 1 1 (carries)
0000 0101 5 5/128
+ 0000 1101 + 13 + 13/128
−−−−−−−−−−−−−− −−−−− −−−−−−−−
0001 0010 18 18/128
Example 5−4 shows subtraction. As with the additions in Example 5−3, an in-
teger interpretation and a fractional interpretation are shown for each com-
putation. It is important to notice that 2s-complement subtraction is the same
as the addition of a positive number and a negative number. The first step is
to find the 2s-complement of the number to be subtracted. The second step
is to perform an addition using this negative number.
5-6
Fixed-Point Arithmetic − a Tutorial
Original form:
0000 0101 5 5/128
− 0000 0100 − 4 − 4/128
−−−−−−−−−−−−−− −−−− −−−−−−−
2s complement of subtracted term:
1 1 (carries)
1111 1011
+ 1
−−−−−−−−−−−−−−
1111 1100
Addition form:
11111 1 (carries)
0000 0101 5 5/128
+ 1111 1100 + (−4) + (−4/128)
−−−−−−−−−−−−−− −−−−−− −−−−−−−−−
0000 0001 1 1/128
(final carry ignored)
Original form:
0000 0101 5 5/128
− 0000 1101 − 13 − 13/128
−−−−−−−−−−−−−− −−−−− −−−−−−−−
2s complement of subtracted term:
1111 0010
+ 1
−−−−−−−−−−−−−−
1111 0011
Addition form:
1 1 1 (carries)
0000 0101
+ 1111 0011 5 5/128
−−−−−−−−−−−−−− + (−13) + (−13/128)
1111 1000 −−−−−−− −−−−−−−−−−
−8 −8/128
5-8
Extended-Precision Addition and Subtraction
The C55x DSP has several features that help make extended-precision cal-
culations more efficient.
- CARRY bit: One of the features is the CARRY status bit, which is affected
by most arithmetic D-unit ALU instructions, as well as the rotate and shift
operations. CARRY depends on the M40 status bit. When M40 = 0, the
carry/borrow is detected at bit position 31. When M40 = 1, the carry/borrow
reflected in CARRY is detected at bit position 39. Your code can also expli-
citly modify CARRY by loading ST0_55 or by using a status bit clear/set
instruction. For proper extended-precision arithmetic, the saturation mode
bit should be cleared (SATD = 0) to prevent the accumulator from saturat-
ing during the computations.
- 32-bit addition, subtraction, and loads: Two C55x data buses, CB and
DB, allow some instructions to handle 32-bit operands in a single cycle.
The long-word load and double-precision add/subtract instructions use
32-bit operands and can efficiently implement extended-precision arith-
metic.
- The partial sum of the 64-bit addition is efficiently performed by the follow-
ing instructions, which handle 32-bit operands in a single cycle.
Mnemonic instructions: MOV40 dbl(Lmem), ACx
ADD dbl(Lmem), ACx
Algebraic instructions: ACx = dbl(Lmem)
ACx = ACx + dbl(Lmem)
- For the upper half of a partial sum, the instruction that follows this para-
graph uses the carry bit generated in the lower 32-bit partial sum. Each
partial sum is stored in two memory locations using MOV ACx, dbl(Lmem)
or dbl(Lmem) = ACx.
Mnemonic instruction: ADD uns(Smem), CARRY, ACx
Algebraic instruction: ACx = ACx + uns(Smem) + CARRY
;*********************************************************************
; 64−Bit Addition Pointer assignments:
;
; X3 X2 X1 X0 AR1 −> X3 (even address)
; + Y3 Y2 Y1 Y0 X2
; −−−−−−−−−−−−−− X1
; W3 W2 W1 W0 X0
; AR2 −> Y3 (even address)
; Y2
; Y1
; Y0
; AR3 −> W3 (even address)
; W2
; W1
; W0
;
;*********************************************************************
Note: The algebraic instructions code example for 64-Bit Addition is shown in Example B−27 on page B-30.
5-10
Extended-Precision Addition and Subtraction
- For the upper half of the partial remainder, the instruction that follows this
paragraph uses the borrow generated in the lower 32-bit partial remainder.
The borrow is not a physical bit in a status register; it is the logical inverse
of CARRY. Each partial sum is stored in two memory locations using
MOV ACx, dbl(Lmem) or dbl(Lmem) = ACx.
As shown in Figure 5−6, the SUB instruction with a 16-bit shift (shown fol-
lowing this paragraph) is an exception because it only resets the carry bit.
This allows the D-unit ALU to generate the appropriate carry when sub-
tracting to the lower or upper half of the accumulator causes a borrow.
Mnemonic instruction: SUB Smem << #16, ACx, ACy
Algebraic instruction: ACy = ACx − (Smem << #16)
Figure 5−6 shows the effect of subtractions on the CARRY bit.
;**********************************************************************
; 64−Bit Subtraction Pointer assignments:
;
; X3 X2 X1 X0 AR1 −> X3 (even address)
; − Y3 Y2 Y1 Y0 X2
; −−−−−−−−−−−−−− X1
; W3 W2 W1 W0 X0
; AR2 −> Y3 (even address)
; Y2
; Y1
; Y0
; AR3 −> W3 (even address)
; W2
; W1
; W0
;
;**********************************************************************
Note: The algebraic instructions code example for 64-Bit Subtraction is shown in Example B−28 on page B-31.
5-12
Extended-Precision Addition and Subtraction
X1X1 X0
X0
Y1Y1 Y0
Y0
×
Unsigned multiplication X0 x Y0
Signed/unsigned multiplication X1 x Y0
Signed/unsigned multiplication X0 x Y1
Signed multiplication X1 x Y1
Example 5−8 shows that a multiplication of two 32-bit integer numbers re-
quires one multiplication, two multiply/accumulate/shift operations, and a mul-
tiply/accumulate operation. The product is a 64-bit integer number.
Example 5−9 shows a fractional multiplication. The operands are in Q31 for-
mat, while the product is in Q31 format.
5-14
Extended-Precision Multiplication
;****************************************************************
; This routine multiplies two 32−bit signed integers, giving a
; 64−bit result. The operands are fetched from data memory and the
; result is written back to data memory.
;
; Data Storage: Pointer Assignments:
; X1 X0 32−bit operand AR0 −> X1
; Y1 Y0 32−bit operand X0
; W3 W2 W1 W0 64−bit product AR1 −> Y1
; Y0
; Entry Conditions: AR2 −> W0
; SXMD = 1 (sign extension on) W1
; SATD = 0 (no saturation) W2
; FRCT = 0 (fractional mode off) W3
;
; RESTRICTION: The delay chain and input array must be
; long-word aligned.
;***************************************************************
Note: The algebraic instructions code example for 32-Bit Integer Multiplication is shown in Example B−29 on page B-32.
;**************************************************************************
; This routine multiplies two Q31 signed integers, resulting in a
; Q31 result. The operands are fetched from data memory and the
; result is written back to data memory.
;
; Data Storage: Pointer Assignments:
; X1 X0 Q31 operand AR0 −> X1
; Y1 Y0 Q31 operand X0
; W1 W0 Q31 product AR1 −> Y1
; Y0
; Entry Conditions: AR2 −> W1 (even address)
; SXMD = 1 (sign extension on) W0
; SATD = 0 (no saturation)
; FRCT = 1 (shift result left by 1 bit)
;
; RESTRICTION: W1 W0 is aligned such that W1 is at an even address.
;***************************************************************************
AMAR *AR0+ ; AR0 points to X0
MPYM uns(*AR0−), *AR1+, AC0 ; AC0 = X0*Y1
MACM *AR0, uns(*AR1−), AC0 ; AC0 =X0*Y1 + X1*Y0
MACM *AR0, *AR1, AC0 >> #16, AC0 ; AC0 = AC0>>16 + X1*Y1
MOV AC0, dbl(*AR2) ; Save W1 W0
Note: The algebraic instructions code example for 32-Bit Fractional Multiplication is shown in Example B−30 on page B-33.
5-16
Division
5.4 Division
1) The 16-bit divisor is shifted left by 15 bits and is subtracted from the value
in the accumulator.
Example 5−10 shows how to use the SUBC instruction to implement unsigned
division with a 16-bit dividend and a 16-bit divisor.
;***************************************************************************
; Pointer assignments: ___________
; AR0 −> Dividend Divisor ) Dividend
; AR1 −> Divisor
; AR2 −> Quotient
; AR3 −> Remainder
;
; Algorithm notes:
; − Unsigned division, 16−bit dividend, 16−bit divisor
; − Sign extension turned off. Dividend & divisor are positive numbers.
; − After division, quotient in AC0(15−0), remainder in AC0(31−16)
;***************************************************************************
Note: The algebraic instructions code example for Unsigned, 16-Bit Integer Division is shown in Example B−31 on page B-33.
Example 5−11 shows how to implement unsigned division with a 32-bit divi-
dend and a 16-bit divisor. The code uses two phases of 16-bit by 16-bit integer
division. The first phase takes as inputs the high 16 bits of the 32-bit dividend
and the 16-bit divisor. The result in the low half of the accumulator is the high
16 bits of the quotient. The result in the high half of the accumulator is shifted
left by 16 bits and added to the lower 16 bits of the dividend. This sum and the
16-bit divisor are the inputs to the second phase of the division. The lower 16
bits of the resulting quotient is the final quotient and the resulting remainder
is the final remainder.
5-18
Division
;***************************************************************************
; Pointer assignments: ___________
; AR0 −> Dividend high half Divisor ) Dividend
; Dividend low half
; ...
; AR1 −> Divisor
; ...
; AR2 −> Quotient high half
; Quotient low half
; ...
; AR3 −> Remainder
;
; Algorithm notes:
; − Unsigned division, 32−bit dividend, 16−bit divisor
; − Sign extension turned off. Dividend & divisor are positive numbers.
; − Before 1st division: Put high half of dividend in AC0
; − After 1st division: High half of quotient in AC0(15−0)
; − Before 2nd division: Put low part of dividend in AC0
; − After 2nd division: Low half of quotient in AC0(15−0) and
; Remainder in AC0(31−16)
;***************************************************************************
Note: The algebraic instructions code example for Unsigned, 32-Bit by 16-Bit Integer Division is shown in Example B−32 on
page B-34.
Some applications might require doing division with signed numbers instead
of unsigned numbers. The conditional subtract instruction works only with pos-
itive integers. The signed integer division algorithm computes the quotient as
follows:
2) The quotient of the absolute values of the dividend and the divisor is deter-
mined using repeated conditional subtract instructions.
Example 5−12 shows the implementation of division with a signed 16-bit divi-
dend and a 16-bit signed divisor, and Example 5−13 extends this algorithm to
handle a 32-bit dividend.
5-20
Division
;***************************************************************************
; Pointer assignments: ___________
; AR0 −> Dividend Divisor ) Dividend
; AR1 −> Divisor
; AR2 −> Quotient
; AR3 −> Remainder
;
; Algorithm notes:
; − Signed division, 16−bit dividend, 16−bit divisor
; − Sign extension turned on. Dividend and divisor can be negative.
; − Expected quotient sign saved in AC0 before division
; − After division, quotient in AC1(15−0), remainder in AC1(31−16)
;***************************************************************************
Note: The algebraic instructions code example for Signed, 16-Bit by 16-Bit Integer Division is shown in Example B−33 on
page B-35.
;***************************************************************************
; Pointer assignments: (Dividend and Quotient are long−word aligned)
; AR0 −> Dividend high half (NumH) (even address)
; Dividend low half (NumL)
; AR1 −> Divisor (Den)
; AR2 −> Quotient high half (QuotH) (even address)
; Quotient low half (QuotL)
; AR3 −> Remainder (Rem)
;
; Algorithm notes:
; − Signed division, 32−bit dividend, 16−bit divisor
; − Sign extension turned on. Dividend and divisor can be negative.
; − Expected quotient sign saved in AC0 before division
; − Before 1st division: Put high half of dividend in AC1
; − After 1st division: High half of quotient in AC1(15−0)
; − Before 2nd division: Put low part of dividend in AC1
; − After 2nd division: Low half of quotient in AC1(15−0) and
; Remainder in AC1(31−16)
;***************************************************************************
BCC skip, AC0 >= #0 ; If actual result should be positive, goto skip.
MOV40 dbl(*AR2), AC1 ; Otherwise, negate Quotient.
NEG AC1, AC1
MOV AC1, dbl(*AR2)
skip:
RET
Note: The algebraic instructions code example for Signed, 32-Bit by 16-Bit Integer Division is shown in Example B−34 on
page B-36.
5-22
Division
If we start with an initial estimate of Ym, then the equation will converge to a
solution very rapidly (typically in three iterations for 16-bit resolution). The ini-
tial estimate can either be obtained from a look-up table, from choosing a mid-
point, or simply from linear interpolation. The ldiv16 algorithm uses linear inter-
polation. This is accomplished by taking the complement of the least signifi-
cant bits of the Xnorm value.
- Guard bits:
Each of the C55x accumulators (AC0, AC1, AC2, and AC3) has eight
guard bits (bits 39−32), which allow up to 256 consecutive multiply-and-
accumulate operations before an accumulator overflow.
- Overflow flags:
Each C55x accumulator has an associated overflow flag (see the following
table). When an operation on an accumulator results in an overflow, the
corresponding overflow flag is set.
The DSP has two saturation mode bits: SATD for operations in the D unit of
the CPU and SATA for operations in the A unit of the CPU. When the SATD
bit is set and an overflow occurs in the D unit, the CPU saturates the result.
Regardless of the value of SATD, the appropriate accumulator overflow
flag is set. Although no flags track overflows in the A unit, overflowing re-
sults in the A unit are saturated when the SATA bit is set.
Saturation replaces the overflowing result with the nearest range bound-
ary. Consider a 16-bit register which has range boundaries of 8000h (larg-
est negative number) and 7FFFh (largest positive number). If an operation
generates a result greater than 7FFFh, saturation can replace the result
with 7FFFh. If a result is less than 8000h, saturation can replace the result
with 8000h.
5-24
Methods of Handling Overflows
- Saturation:
- Input scaling:
You can analyze the system that you want to implement and scale the in-
put signal, assuming worst conditions, to avoid overflow. However, this ap-
proach can greatly reduce the precision of the output.
- Fixed scaling:
You can scale the intermediate results, assuming worst conditions. This
method prevents overflow but also increases the system’s signal-to-noise
ratio.
- Dynamic scaling:
The intermediate results can be scaled only when needed. You can ac-
complish this by monitoring the range of the intermediate results. This
method prevents overflow but increases the computational requirements.
The next sections demonstrate these methodologies applied to FIR (finite im-
pulse response) filters, IIR (infinite impulse response) filters and FFTs (fast
Fourier transforms).
The best way to handle overflow problems in FIR (finite impulse response) fil-
ters is to design the filters with a gain less than 1 to avoid having to scale the
input data. This method, combined with the guard bits available in each of the
accumulators, provides a robust way to handle overflows in the filters.
Fixed scaling and input scaling are not used due to their negative impact on
signal resolution (basically one bit per multiply-and-accumulate operation).
Dynamic scaling can be used for an FIR filter if the resulting increase in cycles
is not a concern. Saturation is also a common option for certain types of audio
signals.
or
ȍǒabsǒf (n)Ǔ Ǔ
1ń2
2
OptionĂ2 : G k +
n
5-26
Chapter 6
This chapter introduces the concept and the syntax of bit-reverse addressing.
It then explains how bit-reverse addressing can help to speed up a Fast Fourier
Transform (FFT) algorithm. To find code that performs complex and real FFTs
(forward and reverse) and bit-reversing of FFT vectors, see Chapter 8, TI C55x
DSPLIB.
Topic Page
6-1
Introduction to Bit-Reverse Addressing
Table 6−1 shows the syntaxes for each of the two bit-reversed addressing
modes supported by the TMS320C55x (C55x) DSP.
Operand
Syntax Function Description
*(ARx −T0B) address = ARx After access, T0 is subtracted from ARx with
ARx = (ARx − T0) reverse carry (rc) propagation.
*(ARx + T0B) address = ARx After access, T0 is added to ARx with reverse
ARx = (ARx + T0) carry (rc) propagation.
Assume that the auxiliary registers are 8 bits long, that AR2 represents the
base address of the data in memory (0110 0000b), and that T0 contains the
value 0000 1000b (decimal 8). Example 6−1 shows a sequence of modifica-
tions of AR2 and the resulting values of AR2.
Table 6−2 shows the relationship between a standard bit pattern that is repeat-
edly incremented by 1 and a bit-reversed pattern that is repeatedly increm-
ented by 1000b with reverse carry propagation. Compare the bit-reversed pat-
tern to the 4 LSBs of AR2 in Example 6−1.
6-2
Introduction to Bit-Reverse Addressing
Figure 6−1. FFT Flow Graph Showing Bit-Reversed Input and In-Order Output
x(0) X(0)
W 0
N
x(4) X(1)
−1
W 0
N
x(2) X(2)
−1
W 2 W 0
x(6) N N X(3)
−1 −1
W 0
x(1) N X(4)
−1
W 1 W 0
N N
x(5) X(5)
−1 −1
W2 W0
N N
x(3) X(6)
−1 −1 0
W3 W2 W
N
N N
x(7) X(7)
−1 −1 −1
Consider a complex FFT of size N (that is, an FFT with an input vector that con-
tains N complex numbers). You can bit-reverse either the input or the output
vectors by executing the following steps:
1) Write 0 to the ARMS bit of status register 2 to select the DSP mode for AR
indirect addressing. (Bit-reverse addressing is not available in the control
mode of AR indirect addressing.) Then use the .arms_off directive to notify
the assembler of this selection.
2) Use Table 6−3 to determine how the base pointer of the input array must
be aligned to match the given vector format. Then load an auxiliary register
with the proper base address.
4) Ensure that the entire array fits within a 64K boundary (the largest possible
array addressable by the 16-bit auxiliary register).
6-4
Using Bit-Reverse Addressing In FFT Algorithms
Therefore, AR0 must be loaded with a base address of this form (Xs are don’t
cares):
T0 = 2n = 26 = 64
;...
BCLR ARMS ; reset ARMS bit to allow bit−reverse addressing
.arms_off ; notify the assembler of ARMS bit = 0
;...
off_place:
RPTBLOCAL LOOP
MOV dbl(*AR0+), AC0 ; AR0 points to input array
LOOP: MOV AC0, dbl(*(AR1+T0B)) ; AR1 points to output array
; T0 = NX = number of complex elements in
; array pointed to by AR0
Note: The algebraic instructions code example for Off-Place Bit Reversing of a Vector Array (in Assembly) is shown in
Example B−35 on page B-37.
6-6
Using the C55x DSPLIB for FFTs and Bit-Reversing
Example 6−3 shows how you can invoke the cbrev() DSPLIB function from C
to do in-place bit-reversing. The function bit-reverses the position of in-order
elements in a complex vector x and then computes a complex FFT of the bit-re-
versed vector. The function uses in-place bit-reversing. See Chapter 8 for an
introduction to the C55x DSPLIB.
Example 6−3. Using DSPLIB cbrev() Routine to Bit Reverse a Vector Array (in C)
#define NX 64
short x[2*NX] ;
short scale = 1 ;
void main(void)
{
;...
cbrev(x,x,NX) // in−place bit−reversing on input data (Re−Im format)
cfft(x,NX,scale) // 64−point complex FFT on bit−reversed input data with
// scaling by 2 at each phase enabled
;...
}
Note: This example shows portions of the file cfft_t.c in the TI C55x DSPLIB (introduced in Chapter 8).
Topic Page
7-1
Symmetric and Asymmetric FIR Filtering (FIRS, FIRSN)
Another class of FIR filter is the antisymmetric FIR: h(n) = −h(N−n). A com-
mon example is the Hilbert transformer, which shifts positive frequencies by
+90 degrees and negative frequencies by −90 degrees. Hilbert transformers
may be used in applications, such as modems, in which it is desired to cancel
lower sidebands of modulated signals.
Figure 7−1 gives examples of symmetric and antisymmetric filters, each with
eight coefficients (a0 through a7). Both symmetric and antisymmetric filters
may be of even or odd length. However, even-length symmetric filters lend
themselves to computational shortcuts which will be described in this section.
It is sometimes possible to reformulate an odd-length filter as a filter with one
more tap, to take advantage of these constructs.
Because (anti)symmetric filters have only N/2 distinct coefficients, they may
be folded and performed with N/2 additions (subtractions) and N/2 multiply-
and-accumulate operations. Folding means that pairs of elements in the delay
buffer which correspond to the same coefficient are pre-added(subtracted)
prior to multiplying and accumulating.
The C55x DSP offers two different ways to implement symmetric and asym-
metric filters. This section shows how to implement these filters using specific
instructions, FIRS and FIRSN. To see how to implement symmetric and asym-
metric filters using the dual-MAC hardware, see section 4.1.1, Implicit Algo-
rithm Symmetry, which begins on page 4-4. The firs/firsn implementation and
the dual-MAC implementation are equivalent from a throughput standpoint.
7-2
Symmetric and Asymmetric FIR Filtering (FIRS, FIRSN)
a0 a1 a2 a3
a0 a1 a2 a3 a4 a5 a6 a7 a4 a5 a6 a7
Definitions:
a = Filter coefficient Y = Filter output
n = Sample index x = Filter input data value
N = Number of filter taps
firs(Xmem,Ymem,Cmem,ACx,ACy)
Operand(s) Description
Xmem and Ymem One of these operands points to the newest value in the
delay buffer. The other points to the oldest value in the
delay buffer.
The antisymmetric FIR is the same as the symmetric FIR except that the pre-
addition of sample pairs is replaced with a pre-subtraction. The C55x instruc-
tion for antisymmetric FIR filtering is:
firsn(Xmem,Ymem,Cmem,ACx,ACy)
7-4
Symmetric and Asymmetric FIR Filtering (FIRS, FIRSN)
;
; Start of outer loop
;−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
localrepeat { ; Start the outer loop
; Clear AC0 and pre−load AC1 with the sum of the 1st and last inputs
||AC0 = #0;
Note: This example shows portions of the file firs.asm in the TI C55x DSPLIB (introduced in Chapter 8).
Some applications for adaptive FIR (finite impulse response) and IIR (infinite
impulse response) filtering include echo and acoustic noise cancellation. In
these applications, an adaptive filter tracks changing conditions in the environ-
ment. Although in theory, both FIR and IIR structures can be used as adaptive
filters, stability problems and the local optimum points of IIR filters makes them
less attractive for this use. Therefore, FIR filters are typically used for practical
adaptive filter applications. The least mean square (LMS), local block-repeat,
and parallel instructions on the C55x DSP can be used to efficiently implement
adaptive filters. The block diagram of an adaptive FIR filter is shown in
Figure 7−2.
x(n)
z−1 z−1 z−1
b0 b1 bN−1
Desired
response
d(n)
+ + +
y(n)
+
ȍ −
ȍ
LMS
Two common algorithms employed for least mean squares adaptation are the
non-delayed LMS and the delayed LMS algorithm. When compared to non-
delayed LMS, the more widely used delayed LMS algorithm has the advan-
tage of greater computational efficiency at the expense of slightly relaxed con-
vergence properties. Therefore, section 7.2.1 describes only the delayed LMS
algorithm.
7-6
Adaptive Filtering (LMS)
ȍ b Ăxā (n * k)
N*1
y(n) + k
k+0
where
y = Filter output
n = Sample index
k = Delay index
N = Number of filter taps
bk = Adaptive coefficient
x = Filter input data value
The value of the error is computed and stored to be used in the next invocation:
where
e = Error
d = Desired response
y = Actual response (filter output)
The coefficients are updated based on an error value computed in the previous
invocation of the algorithm (β is the conversion constant):
The delayed LMS algorithm can be implemented with the LMS instruction—
lms(Xmem, Ymem, ACx, ACy)—which performs a multiply-and-accumulate
(MAC) operation and, in parallel, an addition with rounding:
The input operands of the multiplier are the content of data memory operand
Xmem, sign extended to 17 bits, and the content of data memory operand
Ymem, sign extended to 17 bits. One possible implementation would assign
the following roles to the operands of the LMS instruction:
Operand(s) Description
ACy ACy is one of the four accumulators (AC0−AC3) but is not the
same accumulator as ACx. ACy holds the output of the FIR filter.
7-8
Adaptive Filtering (LMS)
StartSample:
Note: The algebraic instructions code example for Delayed LMS Implementation of an Adaptive Filter is shown in
Example B−36 on page B-38.
The rate of the convolutional encoder is defined as R = 1/n. Figure 7−3 gives
an example of a convolutional encoder with K=5 and R = 1/2.
Input k0 k1 k2 k3
bits Z−1 Z−1 Z−1 Z−1 k4
XOR
G1
bits
The C55x DSP creates the output streams (G0 and G1) by XORing the shifted
input stream (see Figure 7−4).
7-10
Convolutional Encoding (BFXPA, BFXTR)
0 1 0 1 1 0 0 1 0 1 1 1 0 0 1 0 + x >> 3
0 1 0 1 1 0 0 1 0 1 1 1 0 0 1 0 + x >> 4
12 bits of G0
x(11) + x(14) + x(15) x(0) + x(3) + x(4)
Example 7−3 shows an implementation of the output bit streams for the con-
volutional encoder of Figure 7−3.
; Generate G0
XOR AC1 << #−1, AC0 ; A = A XOR B>>1
XOR AC1 << #−3, AC0 ; A = A XOR B>>3
MOV AC0, T0 ; Save G0
; Generate G1
XOR AC1 << #−1, AC0 ; A = A XOR B>>1
XOR AC1 << #−3, AC0 ; A = A XOR B>>3
XOR AC1 << #−4, AC0 ; A = A XOR B>>4 −−> AC0_L = G1
Note: The algebraic instructions code example for Generation of Output Streams G0 and G1 is shown in Example B−37 on
page B-39.
G0
Multiplexor G0G1
G1
The C55x DSP has a dedicated instruction to perform the multiplexing of the
bit streams:
dst = field_expand(ACx,k16)
4) Scan the bit field mask k16 from bit 0 to bit 15, testing each bit. For each
tested mask bit:
If the tested bit is 1:
a) Copy the bit pointed to by index_in_ACx to the bit pointed to by
index_in_dst.
b) Increment index_in_ACx.
c) Increment index_in_dst, and test the next mask bit.
If the tested bit is 0:
Increment index_in_dst, and test the next mask bit.
7-12
Convolutional Encoding (BFXPA, BFXTR)
Example 7−4. Multiplexing Two Bit Streams With the Field Expand Instruction
G0G1 = field_expand(G0,#5555h)
5555h
ÉÉÉÉÉ
ÉÉÉÉÉÉÉÉÉÉÉÉÉ
ÉÉÉ
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
G0(15−0) X X X X X X X
ÉÉÉÉÉ
X
ÉÉÉÉÉÉÉÉÉÉÉÉÉ
1 1
ÉÉÉ
0 0 1 1 1 1
ÉÉÉÉ ÉÉÉÉÉÉÉÉÉÉÉ ÉÉ
ÉÉÉÉ ÉÉÉÉÉÉÉÉÉÉÉ ÉÉ
0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1
G0G1 = field_expand(G1,#AAAAh)
AAAAh 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
G1(15−0) X X X X X X X X 0 1 1 1 0 0 1 0
0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0
ÉÉÉÉÉÉÉÉÉ ÉÉÉÉÉÉÉÉÉ
G0G1 = G0G1 | Temp
0
ÉÉÉÉÉÉÉÉÉ ÉÉÉÉÉÉÉÉÉ
1 1 1 1 0 1 0 0 1 0 0 1 1 0 1
dst = field_extract(ACx,k16)
4) Scan the bit field mask k16 from bit 0 to bit 15, testing each bit. For each
tested mask bit:
If the tested bit is 1:
a) Copy the bit pointed to by index_in_ACx to the bit pointed to by
index_in_dst.
b) Increment index_in_dst.
c) Increment index_in_ACx, and test the next mask bit.
If the tested bit is 0:
Increment index_in_ACx, and test the next mask bit.
Example 7−5 demonstrates the use of the field_extract() instruction. The ex-
ample shows how to de-multiplex the signal created in Example 7−4.
7-14
Convolutional Encoding (BFXPA, BFXTR)
Example 7−5. Demultiplexing a Bit Stream With the Field Extract Instruction
G0 = field_extract(G0G1,#5555h)
5555h 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
G0G1(15−0) 0 1 1 1 1 0 1 0 0 1 0 0 1 1 0 1
G0 X X X X X X X X 1 1 0 0 1 0 1 1
G1 = field_extract(G0G1,#AAAAh)
AAAAh 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
G0G1(15−0) 0 1 1 1 1 0 1 0 0 1 0 0 1 1 0 1
G1 X X X X X X X X 0 1 1 1 0 0 1 0
The convolutional encoder depicted in Figure 7−3 (page 7-10) is used in the
global system for mobile communications (GSM) and is described by the fol-
lowing polynomials (K=5):
G 0Ă(x) + 1 ) x 3 ) x 4 G 1Ă(x) + 1 ) x ) x 3 ) x 4
In the case of the GSM encoder, there are 16 possible states for every symbol
time interval. For rate 1/n systems, there is some inherent symmetry in the trel-
lis structure, which simplifies the calculations. The path states leading to a
delay state are complementary. That is, if one path has G0G1 = 00, the other
path has G0G1 = 11. This symmetry is based on the encoder polynomials and
is true for most systems. Two starting and ending complementary states can
be paired together, including all the paths between them, to form a butterfly
structure (see Figure 7−6). Hence, only one local distance is needed for each
butterfly; it is added and subtracted for each new state. Additionally, the old
metric values are the same for both updates, so address manipulation is mini-
mized.
7-16
Viterbi Algorithm for Channel Decoding (ADDSUB, SUBADD, MAXDIFF)
Figure 7−6. Butterfly Structure for K = 5, Rate 1/2 GSM Convolutional Encoder
0000 00 0000
Old_Metric(0) New_Metric(0)
11
Old_Metric(1)
11
0001
00 New_Metric(8)
1000
Time
Symbol Symbol
Time 0 Time 1
The following equation defines a local distance for the rate 1/2 GSM system:
where
Usually, the Gx(j)s are coded as signed antipodal numbers, meaning that “0”
corresponds to +1 and “1” corresponds to −1. This coding reduces the local
distance calculation to simple addition and subtraction.
As shown in Example 7−6, the DSP can calculate a butterfly quickly by using
its accumulators in a dual 16-bit computation mode. To determine the new path
metric j, two possible path metrics, 2j and 2j+1, are calculated in parallel with
local distances (LD and −LD) using the add-subtract (ADDSUB) instruction
and an accumulator. To determine the new path metric (j+2(K − 2)), the subtract-
add (SUBADD) instruction is also used, using the old path metrics plus local
distances stored in a separate accumulator. The MAXDIFF instruction is then
used on both accumulators to determine the new path metrics. The MAXDIFF
instruction compares the upper and lower 16-bit values for two given accumu-
lators, and stores the larger values in a third accumulator.
Example 7−6 shows two macros for Viterbi butterfly calculations. The add-
subtract (ADDSUB) and subtract-add (SUBADD) computations in the two
macros are performed in alternating order, which is based on the expected en-
coder state. A local distance is stored in the register T3 beforehand. The
MAXDIFF instruction performs the add-compare-select function in 1 cycle.
The updated path metrics are saved to memory by the next two lines of code.
Two 16-bit transition registers (TRN0 and TRN1) are updated with every com-
parison done by the MAXDIFF instruction, so that the selected path metric can
be tracked. TRN0 tracks the results from the high part data path, and TRN1
tracks the low part data path. These bits are later used in traceback, to deter-
mine the original uncoded data. Using separate transition registers allows for
storing the selection bits linearly, which simplifies traceback. In contrast, the
TMS320C54x (C54x) DSP has only one transition register, storing the
selection bits as 0, 8, 1, 9, etc. As a result, on the C54x DSP, additional lines
of code are needed to process these bits during traceback.
You can make the Viterbi butterfly calculations faster by implementing user-
defined instruction parallelism (see section 4.2, page 4-16) and software pipe-
lining. Example 7−7 (page 7-20) shows the inner loop of a Viterbi butterfly al-
gorithm. The algorithm places some instructions in parallel (||) in the CPU, and
the algorithm implements software pipelining by saving previous results at the
same time it performs new calculations. Other operations, such as loading the
appropriate local distances, are coded with the butterfly algorithm.
7-18
Viterbi Algorithm for Channel Decoding (ADDSUB, SUBADD, MAXDIFF)
BFLY_DIR_MNEM .MACRO
;new_metric(j)&(j+2^(K−2))
;
ADDSUB T3, *AR5+, AC0 ; AC0(39−16) = Old_Met(2*j)+LD
; AC0(15−0) = Old_met(2*j+1)−LD
BFLY_REV_MNEM .MACRO
;new_metric(j)&(j+2^(K−2))
SUBADD T3, *AR5+, AC0 ; AC0(39−16) = Old_Met(2*j)−LD
; AC0(15−0) = Old_met(2*j+1)+LD
Note: The algebraic instructions code example for Viterbi Butterflies for Channel Coding is shown in Example B−38 on
page B-39.
RPTBLOCAL end
butterfly:
ADDSUB T3, *AR0+, AC0 ; AC0(39−16) = Old_Met(2*j)+LD
; AC0(15−0) = Old_met(2*j+1)−LD
|| MOV *AR5+, AR7
end MOV AC2, *AR2(T0), *AR2+ ; Store lower and upper maxima
; from previous MAXDIFF operation
|| MAXDIFF AC0, AC1, AC2, AC3 ; Compare AC0 and AC1
Note: The algebraic instructions code example for Viterbi Butterflies Using Instruction Parallelism is shown in Example B−39
on page B-40.
7-20
Chapter 8
%%&#'!
The TI DSPLIB includes commonly used DSP routines. Source code is pro-
vided to allow you to modify the functions to match your specific needs
and is shipped as part of the C55x Code Composer Studio product under
the c:\ti\C5500\dsplib\55x_src directory.
Topic Page
8-1
Features and Benefits / DSPLIB Data Types / DSPLIB Arguments
- Q.15 (DATA): A Q.15 operand is represented by a short data type (16 bit)
that is predefined as DATA, in the dsplib.h header file.
- Q.31 (LDATA): A Q.31 operand is represented by a long data type (32 bit)
that is predefined as LDATA, in the dsplib.h header file.
8-2
Calling a DSPLIB Function from C
- Link your code with the DSPLIB object code library, 55xdsp.lib.
For example, the following code contains a call to the recip16 and q15tofl rou-
tines in DSPLIB:
#include ”dsplib.h”
DATA r[NX];
DATA rexp[NX];
float rf1[NX];
float rf2[NX];
void main()
{
short i;
for (i=0;i<NX;i++)
{
r[i] =0;
rexp[i] = 0;
}
return;
}
In this example, the q15tofl DSPLIB function is used to convert Q15 fractional
values to floating-point fractional values. However, in many applications, your
data is always maintained in Q15 format so that the conversion between float-
ing point and Q15 is not required.
Realize that the DSPLIB is not an optimal solution for assembly-only program-
mers. Even though DSPLIB functions can be invoked from an assembly pro-
gram, the resulting execution times and code size may not be optimal due to
unnecessary C-calling overhead.
- araw_t.c: main driver for testing the DSPLIB acorr (raw) function.
- test.h: contains input data(a) and expected output data(yraw) for the acorr
(raw) function as. This test.h file is generated by using Matlab scripts.
- test.c: contains function used to compare the output of araw function with
the expected output data.
- ftest.c: contains function used to compare two arrays of float data types.
- ltest.c: contains function used to compare two arrays of long data types.
- 55x.cmd: an example of a linker command you can use for this function.
8-4
DSPLIB Functions
For specific DSPLIB function API descriptions, refer to the TMS320C55x DSP
Library Programmer’s Reference (SPRU422).
This appendix is a list of D-unit instructions where A-unit registers are read
(“pre-fetched”) in the Read phase of the execution pipeline.
Instruction Description
ABS_RR dst = |src|
A-1
Instruction Description
DRSUB_RLM_D HI(ACx) = HI(Lmem) − Tx , LO(ACx) = LO(Lmem) − Tx
A-2
Instruction Description
ROR_RR_1 dst = {TC2,Carry} // src // {TC2,Carry}
ST_ADD ACy = ACx + (Xmem << #16) ,Ymem = HI(ACy << T2)
ST_SUB ACy = (Xmem << #16) − ACx ,Ymem = HI(ACy << T2)
This appendix shows the algebraic instructions code examples that corre-
spond to the mnemonic instructions code examples shown in Chapters 2
through 7.
B-1
Example B−1. Partial Assembly Code of test.asm (Step 1)
B-2
Example B−3. Partial Assembly Code of test.asm (Part3)
AC0 = @x
AC0 += @(x+3)
AC0 += @(x+1)
AC0 += @(x+2)
end
nop
goto end
_sum:
;** Parameter deleted n == 9u
T0 = #0 ; |3|
repeat(#9)
T0 = T0 + *AR0+
return ; |11|
_main:
SP = SP + #−1
XAR0 = #_a ; |9|
call #_sum ; |9|
; call occurs [#_sum] ; |9|
*(#_sum1) = T0 ; |9|
XAR0 = #_b ; |10|
call #_sum ; |10|
; call occurs [#_sum] ; |10|
*(#_sum2) = T0 ; |10|
S = SP + #1P
return
; return occurs
B-4
Example B−5. Assembly Generated Using −o3, −pm, and −oi50
_sum:
T0 = #0 ; |3|
repeat(#9)
T0 = T0 + *AR0+
return ; |11|
_main:
XAR3 = #_a ; |9|
repeat(#9)
|| AR1 = #0 ; |3|
AR1 = AR1 + *AR3+
_vecsum:
AR3 = T0 − #1
BRC0 = AR3
localrepeat {
AC0 = (*AR0+ << #16) + (*AR1+ << #16) ; |7|
*AR2+ = HI(AC0) ; |7|
} ; loop ends ; |8|
L2:
return
_sum:
AR1 = #0 ; |3|
if (T0 <= #0) goto L2 ; |6|
; branch occurs ; |6|
AR2 = T0 − #1
CSR = AR2
repeat(CSR)
AR1 = AR1 + *AR0+
T0 = AR1 ; |11|
return ; |11|
_sum:
AR2 = T0 − #1
CSR = AR2
AR1 = #0 ; |3|
repeat(CSR)
AR1 = AR1 + *AR0+
T0 = AR1 ; |12|
return ; |12|
B-6
Example B−9. Generated Assembly for FIR Filter Showing Dual-MAC
_fir:
AR3 = T0 + #1
AR3 = AR3 >> #1
AR3 = AR3 − #1
BRC0 = AR3
push(T3,T2)
T3 = #0 ; |6|
|| XCDP = XAR0
SP = SP + #−1
localrepeat {
T2 = T1 − #1
XAR3 = XAR1
CSR = T2
AR3 = AR3 + T3
XAR4 = XAR3
AR4 = AR4 + #1
AC0 = #0 ; |8|
repeat(CSR)
|| AC1 = AC0 ; |8|
XAR0 = XCDP
T3 = T3 + #2
AR0 = AR0 − T1
|| *AR2(short(#1)) = HI(AC0)
AR2 = AR2 + #2
|| *AR2 = HI(AC1)
XCDP = XAR0
}
SP = SP + #1
T3,T2 = pop()
return
_sadd:
AR1 = T1 ; |5|
T1 = T1 ^ T0 ; |9|
TC1 = bit(T1,@#15) ; |9|
AR1 = AR1 + T0
if (TC1) goto L2 ; |9|
; branch occurs ; |9|
AR2 = T0 ; |9|
AR2 = AR2 ^ AR1 ; |9|
TC1 = bit(AR2,@#15) ; |9|
if (!TC1) goto L2 ; |9|
; branch occurs ; |9|
if (T0 < #0) goto L1 ; |22|
; branch occurs ; |22|
T0 = #32767 ; |22|
goto L3 ; |22|
; branch occurs ; |22|
L1:
AR1 = #−32768 ; |22|
L2:
T0 = AR1 ; |25|
L3:
return ; |25|
; return occurs ; |25|
B-8
Example B−11. Assembly Code Generated When Using Compiler Intrinsic for
Saturated Add
_sadd:
bit(ST3, #ST3_SATA) = #1
T0 = T0 + T1 ; |3|
bit(ST3, #ST3_SATA) = #0
return ; |3|
; return occurs ; |3|
_circ:
AC0 = #0 ; |7|
if (T1 <= #0) goto L2 ; |9|
; branch occurs ; |9|
AR3 = T1 − #1
BRC0 = AR3
AR2 = #0 ; |6|
localrepeat {
; loop starts
L1:
AC0 = AC0 + (*AR1+ * *AR0+) ; |11|
|| AR2 = AR2 + #1
TC1 = (AR2 < T0) ; |12|
if (!TC1) execute (D_Unit) ||
AR1 = AR1 − T0
if (!TC1) execute (D_Unit) ||
AR2 = AR2 − T0
} ; loop ends ; |13|
L2:
return ; |14|
; return occurs ; |14|
_circ:
push(T3,T2)
SP = SP + #−7
dbl(*SP(#0)) = XAR1
|| AC0 = #0 ; |4|
dbl(*SP(#2)) = AC0 ; |4|
|| T2 = T0 ; |2|
if (T1 <= #0) goto L2 ; |6|
; branch occurs ; |6|
T0 = #0 ; |3|
T3 = T1
|| dbl(*SP(#4)) = XAR0
L1:
XAR3 = dbl(*SP(#0))
T1 = *AR3(T0) ; |8|
XAR3 = dbl(*SP(#4))
AC0 = dbl(*SP(#2)) ; |8|
T0 = T0 + #1
AC0 = AC0 + (T1 * *AR3+) ; |8|
dbl(*SP(#2)) = AC0 ; |8|
dbl(*SP(#4)) = XAR3
call #I$$MOD ; |9|
|| T1 = T2 ; |9|
; call occurs [#I$$MOD] ; |9|
T3 = T3 − #1
if (T3 != #0) goto L1 ; |10|
; branch occurs ; |10|
L2:
AC0 = dbl(*SP(#2))
SP = SP + #7 ; |11|
T3,T2 = pop()
return ; |11|
; return occurs ; |11|
B-10
Algebraic Instructions Code Examples
.data
A .int 1,2,3,4,5,6 ; Complex input vector #1
B .int 7,8,9,10,11,12 ; Complex input vector #2
.text
bit(ST2,#ST2_ARMS) = #0 ; Clear ARMS bit (select DSP mode)
.arms_off ; Tell assembler ARMS = 0
cplxmul:
XAR0 = #A ; Pointer to A vector
XCDP = #B ; Pointer to B vector
XAR1 = #C ; Pointer to C vector
BRC0 = #(N−1) ; Load loop counter
T0 = #1 ; Pointer offset
T1 = #2 ; Pointer increment
.data
COEFFS .int 1,2,3,4 ; Coefficients
IN_DATA .int 1,2,3,4,5,6,7,8,9,10,11 ; Input vector
.text
bit(ST2,#ST2_ARMS) = #0 ; Clear ARMS bit (select DSP mode)
.arms_off ; Tell assembler ARMS = 0
bfir:
XCDP = #COEFFS ; Pointer to coefficient array
XAR0 = #(IN_DATA + N_TAPS − 1) ; Pointer to input vector
XAR1 = #(IN_DATA + N_TAPS) ; 2nd pointer to input vector
XAR2 = #OUT_DATA ; Pointer to output vector
BRC0 = #((N_DATA − N_TAPS + 1)/2 − 1) ; Load outer loop counter
CSR = #(N_TAPS − 1) ; Load inner loop counter
B-12
Algebraic Instructions Code Examples
.data
COEFFS .int 1,2,3,4 ; Coefficients
IN_DATA .int 1,2,3,4,5,6,7,8,9,10,11 ; Input vector
.text
bit(ST2,#ST2_ARMS) = #0 ; Clear ARMS bit (select DSP mode)
.arms_off ; Tell assembler ARMS = 0
bfir:
XCDP =
#COEFFS ; Pointer to coefficient array
XAR0 =
#(IN_DATA + N_TAPS − 1) ; Pointer to input vector
XAR1 =
#(IN_DATA + N_TAPS) ; 2nd pointer to input vector
XAR2 =
#OUT_DATA ; Pointer to output vector
BRC0 =
#((N_DATA − N_TAPS + 1)/2 − 1)
; Load outer loop counter
CSR = #(N_TAPS − 3) ; Load inner loop counter
T0 = #(−(N_TAPS − 1)) ; CDP rewind increment
; Variables
.data
.global start_a1
.text
start_a1:
AR0 = #HST_FLAG ; AR0 points to Host Flag
AR2 = #HST_DATA ; AR2 points to Host Data
AR1 = #COEFF1 ; AR1 points to COEFF1 buffer initially
AR3 = #COEFF2 ; AR3 points to COEFF2 buffer initially
CSR = #4 ; Set CSR = 4 for repeat in COMPUTE
BIT(ST1, #ST1_FRCT) = #1 ; Set fractional mode bit
BIT(ST1, #ST1_SXMD) = #1 ; Set sign−extension mode bit
B-14
Algebraic Instructions Code Examples
LOOP:
T0 = *AR0 ; T0 = Host Flag
if (T0 == #READY) GOTO PROCESS ; If Host Flag is ”READY”, continue
GOTO LOOP ; process − else poll Host Flag again
PROCESS:
T0 = *AR2 ; T0 = Host Data
COMPUTE:
AC1 = #0 ; Initialize AC1 to 0
REPEAT(CSR) ; CSR has a value of 4
AC1 = AC1 + (*AR2 * *AR3+) ; This MAC operation is performed
; 5 times
AR4 = AC1 ; Result is in AR4
RETURN
HALT:
GOTO HALT
Example B−18. A-Unit Code in Example B−17 Modified to Take Advantage of Parallelism
; Variables
.data
.global start_a2
.text
start_a2:
LOOP:
T0 = *AR0 ; T0 = Host Flag
if (T0 == #READY) GOTO PROCESS ; If Host Flag is “READY”, continue
GOTO LOOP ; process − else poll Host Flag again
B-16
Algebraic Instructions Code Examples
Example B−18. A-Unit Code in Example B−17 Modified to Take Advantage of Parallelism
(Continued)
PROCESS:
T0 = *AR2 ; T0 = Host Data
END
COMPUTE:
AC1 = #0 ; Initialize AC1 to 0
|| REPEAT(CSR) ; CSR has a value of 4
AC1 = AC1 + (*AR2 * *AR3+) ; This MAC operation is performed
; 5 times
AR4 = AC1 ; Result is in AR4
|| RETURN
HALT:
GOTO HALT
; Variables
.data
.global start_p1
.text
start_p1:
XDP = #var1
AR3 = #var2
AC2 = #0006h
BLOCKREPEAT {
AC1 = AC2
AR1 = #8000h
LOCALREPEAT {
AC1 = AC1 − #1
*AR1+ = AC1
}
AC2 = AC2 + #1
}
@(AC0_L) = BRC0 ; AC0_L loaded using EB
@(AC1_L) = BRC1 ; AC1_L loaded using EB
B-18
Algebraic Instructions Code Examples
Example B−20. P-Unit Code in Example B−19 Modified to Take Advantage of Parallelism
; Variables
.data
.global start_p2
.text
start_p2:
XDP = #var1
AR3 = #var2
AC2 = #0006h
|| BLOCKREPEAT {
AC1 = AC2
AR1 = #8000h
|| LOCALREPEAT {
AC1 = AC1 − #1
*AR1+ = AC1
}
AC2 = AC2 + #1
}
@(AC0_L) = BRC0 ; AC0_L loaded using EB
@(AC1_L) = AR1 ; AC1_L loaded using EB
; Variables
.data
.global start_d1
.text
start_d1:
AR3 = #var1
AR4 = #var2
B-20
Algebraic Instructions Code Examples
Example B−22. D-Unit Code in Example B−21 Modified to Take Advantage of Parallelism
; Variables
.data
.global start_d2
.text
start_d2:
AR3 = #var1
AR4 = #var2
Example B−23. Code That Uses Multiple CPU Units But No User-Defined Parallelism
; Register usage
; −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
FRAME_SZ .set 2
.global _fir
.text
;**********************************************************************
_fir
SP = SP − #FRAME_SZ
B-22
Algebraic Instructions Code Examples
Example B−23. Code That Uses Multiple CPU Units But No User-Defined Parallelism
(Continued)
AC1 = T0 − #3 ; AC1 = nh − 3
*SP(0) = AC1
CSR = *SP(0) ; Set inner loop count to nh − 3
BLOCKREPEAT {
*DB_ptr = *X_ptr+
REPEAT (CSR)
AC0 = AC0 + (*H_ptr+ * *DB_ptr+)
Example B−23. Code That Uses Multiple CPU Units But No User-Defined Parallelism
(Continued)
; Store result
; −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
*R_ptr+ = HI(AC0)
}
END_FUNCTION:
RETURN
;********************************************************************
B-24
Algebraic Instructions Code Examples
; Register usage
; −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
FRAME_SZ .set 2
.global _fir
.text
;**************************************************************************
_fir
AC1 = T0 − #3 ; AC1 = nh − 3
*SP(0) = AC1
||BLOCKREPEAT {
*DB_ptr = *X_ptr+
B-26
Algebraic Instructions Code Examples
; Store result
; −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
*R_ptr+ = HI(AC0)
}
END_FUNCTION:
|| RETURN
;********************************************************************
BRC0 = #(n0−1)
BRC1 = #(n1−1)
; ...
localrepeat{ ; Level 0 looping (could instead be blockrepeat):
; Loops n0 times
; ...
repeat(#(n2−1)
; ...
localrepeat{ ; Level 1 looping (could instead be blockrepeat):
; Loops n1 times
; ...
repeat(#(n3−1))
; ...
}
; ...
}
_cfft:
radix_2_stages:
; ...
outer_loop:
; ...
BRC0 = T1
; ...
BRC1 = T1
; ...
AR4 = AR4 >> #1 ; outer loop counter
|| if (AR5 == #0) goto no_scale ; determine if scaling required
; ...
no_scale:
localrepeat{
localrepeat {
Note: This example shows portions of the file cfft.asm in the TI C55x DSPLIB (introduced in Chapter 8).
B-28
Algebraic Instructions Code Examples
Note: This example shows portions of the file cfft.asm in the TI C55x DSPLIB (introduced in Chapter 8).
;*********************************************************************
; 64−Bit Addition Pointer assignments:
;
; X3 X2 X1 X0 AR1 −> X3 (even address)
; + Y3 Y2 Y1 Y0 X2
; −−−−−−−−−−−−−− X1
; W3 W2 W1 W0 X0
; AR2 −> Y3 (even address)
; Y2
; Y1
; Y0
; AR3 −> W3 (even address)
; W2
; W1
; W0
;
;*********************************************************************
B-30
Algebraic Instructions Code Examples
;**********************************************************************
; 64−Bit Subtraction Pointer assignments:
;
; X3 X2 X1 X0 AR1 −> X3 (even address)
; − Y3 Y2 Y1 Y0 X2
; −−−−−−−−−−−−−− X1
; W3 W2 W1 W0 X0
; AR2 −> Y3 (even address)
; Y2
; Y1
; Y0
; AR3 −> W3 (even address)
; W2
; W1
; W0
;
;**********************************************************************
;****************************************************************
; This routine multiplies two 32−bit signed integers, giving a
; 64−bit result. The operands are fetched from data memory and the
; result is written back to data memory.
;
; Data Storage: Pointer Assignments:
; X1 X0 32−bit operand AR0 −> X1
; Y1 Y0 32−bit operand X0
; W3 W2 W1 W0 64−bit product AR1 −> Y1
; Y0
; Entry Conditions: AR2 −> W0
; SXMD = 1 (sign extension on) W1
; SATD = 0 (no saturation) W2
; FRCT = 0 (fractional mode off) W3
;
; RESTRICTION: The delay chain and input array must be
; long-word aligned.
;***************************************************************
B-32
Algebraic Instructions Code Examples
;**************************************************************************
; This routine multiplies two Q31 signed integers, resulting in a
; Q31 result. The operands are fetched from data memory and the
; result is written back to data memory.
;
; Data Storage: Pointer Assignments:
; X1 X0 Q31 operand AR0 −> X1
; Y1 Y0 Q31 operand X0
; W1 W0 Q31 product AR1 −> Y1
; Y0
; Entry Conditions: AR2 −> W1 (even address)
; SXMD = 1 (sign extension on) W0
; SATD = 0 (no saturation)
; FRCT = 1 (shift result left by 1 bit)
;
; RESTRICTION: W1 W0 is aligned such that W1 is at an even address.
;***************************************************************************
mar(*AR0+) ; AR0 points to X0
AC0 = uns(*AR0−)*(*AR1+) ; AC0 = X0*Y1
AC0 = AC0 + ((*AR0)* uns(*AR1−)) ; AC0 =X0*Y1 + X1*Y0
AC0 = (AC0 >> #16) + ((*AR0)*(*AR1)) ; AC0 = AC0>>16 + X1*Y1
dbl(*AR2) = AC0 ; Save W1 W0
;***************************************************************************
; Pointer assignments: ___________
; AR0 −> Dividend Divisor ) Dividend
; AR1 −> Divisor
; AR2 −> Quotient
; AR3 −> Remainder
;
; Algorithm notes:
; − Unsigned division, 16−bit dividend, 16−bit divisor
; − Sign extension turned off. Dividend & divisor are positive numbers.
; − After division, quotient in AC0(15−0), remainder in AC0(31−16)
;***************************************************************************
;***************************************************************************
; Pointer assignments: ___________
; AR0 −> Dividend high half Divisor ) Dividend
; Dividend low half
; ...
; AR1 −> Divisor
; ...
; AR2 −> Quotient high half
; Quotient low half
; ...
; AR3 −> Remainder
;
; Algorithm notes:
; − Unsigned division, 32−bit dividend, 16−bit divisor
; − Sign extension turned off. Dividend & divisor are positive numbers.
; − Before 1st division: Put high half of dividend in AC0
; − After 1st division: High half of quotient in AC0(15−0)
; − Before 2nd division: Put low part of dividend in AC0
; − After 2nd division: Low half of quotient in AC0(15−0) and
; Remainder in AC0(31−16)
;***************************************************************************
B-34
Algebraic Instructions Code Examples
;***************************************************************************
; Pointer assignments: ___________
; AR0 −> Dividend Divisor ) Dividend
; AR1 −> Divisor
; AR2 −> Quotient
; AR3 −> Remainder
;
; Algorithm notes:
; − Signed division, 16−bit dividend, 16−bit divisor
; − Sign extension turned on. Dividend and divisor can be negative.
; − Expected quotient sign saved in AC0 before division
; − After division, quotient in AC1(15−0), remainder in AC1(31−16)
;***************************************************************************
;***************************************************************************
; Pointer assignments: (Dividend and Quotient are long−word aligned)
; AR0 −> Dividend high half (NumH) (even address)
; Dividend low half (NumL)
; AR1 −> Divisor (Den)
; AR2 −> Quotient high half (QuotH) (even address)
; Quotient low half (QuotL)
; AR3 −> Remainder (Rem)
;
; Algorithm notes:
; − Signed division, 32−bit dividend, 16−bit divisor
; − Sign extension turned on. Dividend and divisor can be negative.
; − Expected quotient sign saved in AC0 before division
; − Before 1st division: Put high half of dividend in AC1
; − After 1st division: High half of quotient in AC1(15−0)
; − Before 2nd division: Put low part of dividend in AC1
; − After 2nd division: Low half of quotient in AC1(15−0) and
; Remainder in AC1(31−16)
;***************************************************************************
if (AC0 >= #0) goto skip ; If actual result should be positive, goto skip.
AC1 = dbl(*AR2) ; Otherwise, negate Quotient.
AC1 = − AC1
dbl(*AR2) = AC1
skip:
return
B-36
Algebraic Instructions Code Examples
;...
bit (ST2, #ST2_ARMS) = #0 ; reset ARMS bit to allow bit−reverse addressing
.arms_off ; notify the assembler of ARMS bit = 0
;...
off_place:
localrepeat{
AC0 = dbl(*AR0+) ; AR0 points to input array
dbl(*(AR1+T0B)) = AC0 ; AR1 points to output array
; T0 = NX = number of complex elements in
; array pointed to by AR0
}
Note: This example shows portions of the file cbrev.asm in the TI C55x DSPLIB (introduced in Chapter 8).
StartSample:
Note: This example shows portions of the file dlms.asm in the TI C55x DSPLIB (introduced in Chapter 8).
B-38
Algebraic Instructions Code Examples
; Generate G0
AC0 = AC0 ^ (AC1 << #−1) ; A = A XOR B>>1
AC0 = AC0 ^ (AC1 << #−3) ; A = A XOR B>>2
T0 = AC0 ; Save G0
; Generate G1
AC0 = AC0 ^ (AC1 << #−1) ; A = A XOR B>>1
AC0 = AC0 ^ (AC1 << #−3) ; A = A XOR B>>3
AC0 = AC0 ^ (AC1 << #−4) ; A = A XOR B>>4 −−> AC0_L = G1
BFLY_DIR_ALG .MACRO
;new_metric(j)&(j+2^(K−2))
hi(AC0) = *AR5+ + T3, ; AC0(39−16) = Old_Met(2*j)+LD
lo(AC0) = *AR5+ − T3 ; AC0(15−0) = Old_met(2*j+1)−LD
BFLY_REV_ALG .MACRO
;new_metric(j)&(j+2^(K−2))
hi(AC0) = *AR5+ − T3, ; AC0(39−16) = Old_Met(2*j)−LD
lo(AC0) = *AR5+ + T3 ; AC0(15−0) = Old_met(2*j+1)+LD
localrepeat {
butterfly:
hi(AC0) = *AR0+ + T3, ; AC0(39−16) = Old_Met(2*j)+LD
lo(AC0) = *AR0+ − T3 ; AC0(15−0) = Old_met(2*j+1)−LD
|| AR7 = *AR5+
B-40
Index
Index-1
Index
Index-2
Index
DSP function library for TMS320C55x DSPs 8-1 extended auxiliary register usage in dual-MAC
calling a function from assembly source 8-4 operations 4-3
calling a function from C 8-3 extended coefficient data pointer usage in
data types used 8-2 dual-MAC operations 4-3
function arguments 8-2
extended-precision 2s-complement arithmetic,
list of functions 8-5
details 5-9, 5-14
where to find sample code 8-4
dual-MAC operations, helping the C compiler
generate them 3-22 F
dual-access RAM (DARAM) accesses (buses and
pipeline phases) 4-73 FFT flow graph with bit-reversed input 6-4
dual-MAC hardware/operations field expand instruction
efficient use of 4-2 concept 7-12
matrix multiplication 4-13 example 7-13
pointer usage in 4-3 field extract instruction
taking advantage of implicit algorithm concept 7-14
symmetry 4-4 example 7-15
dynamic scaling for overflow handling 5-25 filtering
adaptive FIR (concept) 7-6
adaptive FIR (example) 7-9
symmetric/asymmetric FIR 7-2
E with dual-MAC hardware 4-4, 4-6, 4-14
FIR filtering
encoding, convolutional 7-10 adaptive (concept) 7-6
examples adaptive (example) 7-9
adaptive FIR filter (delayed LMS) 7-9 symmetric/asymmetric 7-2
block FIR filter (dual-MAC implementation) 4-9, FIRS instruction 7-3, 7-4
4-10
FIRSN instruction 7-4
branch-on-auxiliary-register-not-zero loop
construct 4-44 fixed scaling for overflow handling 5-25
complex vector multiplication (dual-MAC fixed-point arithmetic 5-1
implementation) 4-6 float size in C55x compiler 3-2
demultiplexing bit streams (convolutional fractional representation (2s-complement) 5-5
encoder) 7-15
fractions versus integers 5-3
multiplexing bit streams (convolutional
encoder) 7-13 function inlining 3-11
nested loops 4-43 function library for TMS320C55x DSPs 8-1
output streams for convolutional encoder 7-11 calling a function from assembly source 8-4
parallel optimization across CPU functional calling a function from C 8-3
units 4-35 data types used 8-2
parallel optimization within A unit 4-25 function arguments 8-2
parallel optimization within CPU functional list of functions 8-5
units 4-25 where to find sample code 8-4
parallel optimization within D unit 4-33
parallel optimization within P unit 4-30
symmetric FIR filter 7-5 G
use of CSR for single-repeat loop 4-47
Viterbi butterfly 7-19 global symbols 3-43
execute (X) phase of pipeline 4-51 guard bits used for overflow handling 5-24
Index-3
Index
intrinsics (continued)
H long_lssub 3-32
long_smac 3-32
hardware resource conflicts (parallelism rule) 4-21 long_smas 3-32
holes caused by data alignment 3-42
K
I keyword
inline 3-11
I/O space available 1-2 onchip 3-23
IBQ, minimizing delays 4-49 restrict 3-4, 3-5, 3-26
idioms in C for generating efficient assembly 3-40
if-then-else and switch/case constructs 3-39 L
implicit algorithm symmetry 4-4 labels 2-3
inline keyword 3-11 least mean square (LMS) calculation for adaptive
input scaling for overflow handling 5-25 filtering 7-6
input/output space available 1-2 least mean square (LMS) instruction 7-6
instruction buffer 1-2 linker command file sample 3-47
linking 2-10
instruction length limitation (parallelism rule) 4-21
local block-repeat instruction, when to use 4-45
instruction pipeline, introduced 1-2
local symbols 3-43
instructions executed in parallel 4-16
localrepeat/RPTBLOCAL instruction 3-16
instructions for specific applications 7-1 long data accesses for 16-bit data 3-33
int _norm 3-33 long long size in C55x compiler 3-2
int size in C55x compiler 3-2 long size in C55x compiler 3-2
integer representation (2s-complement) 5-4 loop code tips 3-16
integers versus fractions 5-3 loop unrolling 4-6
interrupt-control logic 1-2 loop-control registers
avoiding pipeline delays when accessing 4-48
intrinsics 3-29 to 3-33
when they are accessed in the pipeline 4-64
int_abss 3-32
int_lnorm 3-33 loops
int_lshrs 3-33 implementing efficient 4-42
int_md 3-33 nesting of 4-42
int_norm 3-33
int_sadd 3-32
int_shrs 3-33
M
int_smacr 3-33 MAC hardware tips 3-21
int_smasr 3-33 MAC units 1-2
int_smpy 3-32, 3-33 map file, example 2-11
int_sneg 3-32 matrix mathematics using dual-MAC
int_ssubb 3-32 hardware 4-13
long_labss 3-32
MAXDIFF instruction used in Viterbi code 7-19
long_lsadd 3-32
long_lsmpy 3-32 maximum instruction length (parallelism rule) 4-21
long_lsneg 3-32 memory accesses and the pipeline 4-72
long_lssh 3-33 memory available 1-2
long_lsshl 3-33 memory dependences 3-4
Index-4
Index
Index-5
Index
restrict keyword 3-4, 3-5, 3-26 switch/case and if-then-else constructs 3-39
RPTBLOCAL/localrepeat instruction 3-16 symbol declarations
rules for user-defined parallelism 4-20 global vs. local 3-43
local vs. global 3-43
symmetric FIR filtering 7-2
S with FIRS instruction (concept) 7-3
with FIRS instruction (example) 7-4
saturation for overflow handling 5-25
saturation mode bits used for overflow
handling 5-24
scaling
T
dynamic 5-25
fixed 5-25 table to help generate optional application
input 5-25 mapping 4-78
methods for FFTs 5-26 TI C55x DSPLIB 8-1
methods for FIR filters 5-25 calling a function in assembly source 8-4
methods for IIR filters 5-26 calling a function in C 8-3
scaling methods (overflow handling) data types used 8-2
FFTs 5-26 function arguments 8-2
FIR filters 5-25 list of functions 8-5
IIR filters 5-26 where to find sample code 8-4
section allocation 2-5 tips
example 2-6 applying parallelism 4-24
code and data allocation 3-44, 3-48
sections 3-45
control code 3-39
.bss 3-45
data and code allocation 3-44, 3-48
.cinit 3-45
data types 3-2
.const 3-45
loop code 3-16
.do 3-45
MAC hardware 3-21
.ioport 3-45
nesting loops 4-45
.stack 3-45
performance for C code 3-15
.switch 3-45
preventing pipeline delays 4-54
.sysmem 3-45
producing efficient code 1-3
.sysstack 3-45
resolving pipeline conflicts 4-53
.text 3-45
short size in C55x compiler 3-2 TMS320C54x-compatible mode 1-2
single-repeat instruction, when to use 4-45 TMS320C55x DSP function library 8-1
calling a function from assembly source 8-4
size of C data types 3-2
calling a function from C 8-3
soft dual encoding 4-21 data types used 8-2
stack configuration 3-43 function arguments 8-2
stacks available 1-2 list of functions 8-5
where to find sample code 8-4
standard block-repeat instruction, when to
use 4-45 transition registers (TRN0, TRN1) used in Viterbi
algorithm 7-18
SUBADD instruction used in Viterbi code 7-19
subtraction trip count 3-17
2s-complement 5-5 unsigned integer types 3-17
extended-precision 2s-complement 5-9 tutorial 2-1
Index-6
Index
U W
unroll-and-jam transformation 3-25 W (write) phase of pipeline 4-51
user-defined parallelism 4-16 W+ (write+) phase of pipeline 4-51
process 4-23 write (W) phase of pipeline 4-51
rules 4-20
write+ (W+) phase of pipeline 4-51
writing assembly code 2-3
V
variables 2-5 X
vector multiplication using dual-MAC hardware 4-4
Viterbi algorithm for channel decoding 7-16 X (execute) phase of pipeline 4-51
Viterbi butterfly XARn, example 2-9
examples 7-19 XARn usage in dual-MAC operations 4-3
figure 7-17 XCDP usage in dual-MAC operations 4-3
Index-7
Index-8