TMS320C6713 - Digital Signal Processor
TMS320C6713 - Digital Signal Processor
Any reference to the C67x DSP or C67x CPU also applies, unless otherwise
noted, to the C67x+ DSP and C67x+ CPU, respectively.
Topic Page
Speech recognition
Audio
Radar
Atmospheric modeling
Telecommunications Voice/Speech
1200- to 56Ă600-bps modems Faxing Speaker verification
Adaptive equalizers Future terminals Speech enhancement
ADPCM transcoders Line repeaters Speech recognition
Base stations Personal communications Speech synthesis
Cellular telephones systems (PCS) Speech vocoding
Channel multiplexing Personal digital assistants (PDA) Text-to-speech
Data encryption Speaker phones Voice mail
Digital PBXs Spread spectrum communications
Digital speech interpolation (DSI) Digital subscriber loop (xDSL)
DTMF encoding/decoding Video conferencing
Echo cancellation X.25 packet switching
Two multipliers
Six ALUs
Advanced VLIW CPU with eight functional units, including two multipliers
and six arithmetic units
Executes up to eight instructions per cycle for up to ten times the
performance of typical DSPs
Allows designers to develop highly effective RISC-like code for fast
development time
Instruction packing
40-bit arithmetic options add extra precision for vocoders and other
computationally intensive applications
Field manipulation and instruction extract, set, clear, and bit counting
support common operation found in control and data manipulation
applications.
In addition to the features of the C67x device, the C67x+ device is enhanced
for code size improvement and floating-point performance. These additional
features include:
Unified memory controller features support for flat on-chip data RAM and
ROM organizations for zero wait-state accesses from both load store units
of the CPU. The memory controller supports different banking organiza-
tions for RAM and ROM arrays. The memory controller also supports
VBUSP interfaces (two master and one slave) for transfer of data from the
system peripherals to and from the CPU and internal memory. A VBUSP-
based DMA controller can interface to the CPU for programmable bulk
transfers through the VBUSP slave port.
The VelociTI architecture of the C6000 platform of devices make them the first
off-the-shelf DSPs to use advanced VLIW to achieve high performance
through increased instruction-level parallelism. A traditional VLIW architecture
consists of multiple execution units running in parallel, performing multiple
instructions during a single clock cycle. Parallelism is the key to extremely high
performance, taking these DSPs well beyond the performance capabilities of
traditional superscalar designs. VelociTI is a highly deterministic architecture,
having few restrictions on how or when instructions are fetched, executed, or
stored. It is this architectural flexibility that is key to the breakthrough efficiency
levels of the TMS320C6000 Optimizing C compiler. VelociTI’s advanced
features include:
ÁÁ Á
256-bit data
Á C6000 CPU
ÁÁÁ
ÁÁÁ
Power Program fetch
down Control
ÁÁ
Instruction dispatch (See Note)
registers
Instruction decode
ÁÁ ÁÁÁÁÁÁ
ÁÁ ÁÁÁÁÁÁ ÁÁÁÁÁÁ
ÁÁÁÁÁ
Data path A Data path B Control
DMA, EMIF
Á Á ÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
Register file A Register file B logic
ÁÁÁÁ ÁÁÁÁÁÁ
ÁÁ ÁÁÁÁÁÁ
Á
.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2
Interrupts
ÁÁÁÁÁ
Á
ÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
Á ÁÁ Á Additional
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
Á Á
peripherals:
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
Á
Timers,
Data cache/data memory
serial ports,
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
32-bit address
etc.
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
8-, 16-, 32-bit data
Program cache
2-level caches
DMA Controller (C6701 DSP only) transfers data between address ranges
in the memory map without intervention by the CPU. The DMA controller
has four programmable channels and a fifth auxiliary channel.
EDMA Controller performs the same functions as the DMA controller. The
EDMA has 16 programmable channels, as well as a RAM space to hold
multiple configurations for future transfers.
HPI is a parallel port through which a host processor can directly access
the CPU’s memory space. The host device has ease of access because
it is the master of the interface. The host and the CPU can exchange infor-
mation via internal or external memory. In addition, the host has direct
access to memory-mapped peripherals.
Timers in the C6000 devices are two 32-bit general-purpose timers used
for these functions:
Time events
Count events
Generate pulses
Interrupt the CPU
Send synchronization events to the DMA/EDMA controller.
For an overview of the peripherals available on the C6000 DSP, refer to the
TM320C6000 DSP Peripherals Overview Reference Guide (SPRU190).
This chapter focuses on the CPU, providing information about the data paths and
control registers. The two register files and the data cross paths are described.
Topic Page
2.1 Introduction
The components of the data path for the TMS320C67x CPU are shown in
Figure 2−1. These components consist of:
The C67x DSP general-purpose register files support data ranging in size from
packed 16-bit data through 40-bit fixed-point and 64-bit floating point data.
Values larger than 32 bits, such as 40-bit long and 64-bit float quantities, are
stored in register pairs. In these the 32 LSBs of data are placed in an even-
numbered register and the remaining 8 or 32 MSBs in the next upper register
(that is always an odd-numbered register). Packed data types store either four
8-bit values or two 16-bit values in a single 32-bit register, or four 16-bit values
in a 64-bit register pair.
There are 16 valid register pairs for 40-bit and 64-bit data in the C67x DSP
cores. In assembly language syntax, a colon between the register names
denotes the register pairs, and the odd-numbered register is specified first.
The additional registers are addressed by using the previously unused fifth
(msb) bit of the source and register specifiers. All 64-bit register writes and
reads are performed over 2 cycles as per the current C67x devices.
Figure 2−2 shows the register storage scheme for 40-bit long data. Operations
requiring a long input ignore the 24 MSBs of the odd-numbered register.
Operations producing a long result zero-fill the 24 MSBs of the odd-numbered
register. The even-numbered register is encoded in the opcode.
ÁÁÁÁ ÁÁÁÁ
Figure 2−1. TMS320C67x CPU Data Paths
ÁÁÁÁÁÁ src1
ÁÁÁÁ
ÁÁÁÁÁÁ ÁÁÁÁ
ÁÁÁÁÁ .L1 src2
ÁÁÁÁÁ
ÁÁÁÁ dst
8
ÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁ
long dst
long src 8
Á ÁÁÁÁÁ ÁÁÁÁ
32
LD1 32 MSB
ÁÁÁÁÁ ÁÁÁÁ
ST1 8
long src 32 Register
ÁÁÁÁÁ ÁÁÁÁÁ
long dst file A
8 (A0−A15)
Data path A
ÁÁÁÁÁÁ ÁÁÁÁ
dst
.S1
src1
ÁÁÁÁÁÁ src2
ÁÁÁÁÁ
ÁÁÁÁÁ dst
ÁÁÁÁÁ
ÁÁÁÁ
ÁÁÁÁÁÁÁ
.M1 src1
ÁÁÁÁÁ ÁÁÁÁÁ
src2
ÁÁÁÁÁ ÁÁÁÁÁ
LD1 32 LSB
dst
ÁÁÁÁÁ 1X
ÁÁÁÁ
DA2
Á ÁÁÁÁ
ÁÁÁÁÁ src2
ÁÁÁÁ
ÁÁÁÁÁ
.D2 src1
dst
ÁÁÁÁÁÁ ÁÁÁÁ
LD2 32 LSB
ÁÁÁÁÁÁ src2
ÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁ
.M2 src1
ÁÁÁÁÁÁ ÁÁÁÁ
dst
Register
ÁÁÁÁÁ ÁÁÁÁÁ
src2 file B
(B0−B15)
ÁÁÁÁ ÁÁÁÁÁ
src1
Data path B .S2
dst
ÁÁÁÁÁ ÁÁÁÁÁ
8
long dst
8
Á ÁÁÁÁ ÁÁÁÁÁ
long src
32
LD2 32 MSB
ÁÁÁÁÁ ÁÁÁÁ
ST2 8
long src 32
ÁÁÁÁ
ÁÁÁÁÁÁ
long dst
.L2
dst
8
ÁÁÁÁÁ
ÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁ
src2
ÁÁÁÁÁÁ src1
ÁÁÁÁ
Á
Control
register
file
Á Á
Á Á
Read from registers
39 32 31 0
40-bit data
Á Write to registers
Á
ÍÍÍÍÍÍÍÍÍ Odd register 39 32 31 Even register 0
.D unit (.D1, .D2) 32-bit add, subtract, linear and circular Load doubleword with 5-bit constant
address calculation offset
Loads and stores with 5-bit constant offset
Loads and stores with 15-bit constant
offset (.D2 only)
On the C67x DSP, six of the eight functional units have access to the register
file on the opposite side, via a cross path. The .M1, .M2, .S1, and .S2 units’ src2
units are selectable between the cross path and the same side register file. In
the case of the .L1 and .L2, both src1 and src2 inputs are also selectable
between the cross path and the same-side register file.
Only two cross paths, 1X and 2X, exist in the C6000 architecture. Thus, the
limit is one source read from each data path’s opposite register file per cycle,
or a total of two cross path source reads per cycle. In the C67x DSP, only one
functional unit per data path, per execute packet, can get an operand from the
opposite register file.
On the C6000 architecture, some of the ports for long and doubleword oper-
ands are shared between functional units. This places a constraint on which
long or doubleword operations can be scheduled on a data path in the same
execute packet. See section 3.7.5.
The DA1 and DA2 resources and their associated data paths are specified as
T1 and T2, respectively. T1 consists of the DA1 address path and the LD1 and
ST1 data paths. For the C67x DSP, LD1 is comprised of LD1a and LD1b to
support 64-bit loads. Similarly, T2 consists of the DA2 address path and the
LD2 and ST2 data paths. For the C67x DSP, LD2 is comprised of LD2a and
LD2b to support 64-bit loads.
The T1 and T2 designations appear in the functional unit fields for load and
store instructions. For example, the following load instruction uses the .D1 unit
to generate the address but is using the LD2 path resource from DA2 to place
the data in the B register file. The use of the DA2 resource is indicated with the
T2 designation.
LDW .D1T2 *A0[3],B1
Additionally, some of the control register bits are specially accessed in other
ways. For example, arrival of a maskable interrupt on an external interrupt pin,
INTm, triggers the setting of flag bit IFRm. Subsequently, when that interrupt
is processed, this triggers the clearing of IFRm and the clearing of the global
interrupt enable bit, GIE. Finally, when that interrupt processing is complete,
the B IRP instruction in the interrupt service routine restores the pre-interrupt
value of the GIE. Similarly, saturating instructions like SADD set the SAT
(saturation) bit in the control status register (CSR).
Pipeline Stage E1
Read src2
Written dst
Even though MVC modifies the particular target control register in a single
cycle, it can take extra clocks to complete modification of the non-explicitly
named register. For example, the MVC cannot modify bits in the IFR directly.
Instead, MVC can only write 1’s into the ISR or the ICR to specify setting or
clearing, respectively, of the IFR bits. MVC completes this ISR/ICR write in a
single (E1) cycle but the modification of the IFR bits occurs one clock later. For
more information on the manipulation of ISR, ICR, and IFR, see section 2.7.9,
section 2.7.5, and section 2.7.7.
Saturating instructions, such as SADD, set the saturation flag bit (SAT) in CSR
indirectly. As a result, several of these instructions update the SAT bit one full
clock cycle after their primary results are written to the register file. For exam-
ple, the SMPY instruction writes its result at the end of pipeline stage E2; its
primary result is available after one delay slot. In contrast, the SAT bit in CSR
is updated one cycle later than the result is written; this update occurs after two
delay slots. (For the specific behavior of an instruction, refer to the description
of that individual instruction).
The B IRP and B NRP instructions directly update the GIE and NMIE,
respectively. Because these branches directly modify CSR and IER,
respectively, there are no delay slots between when the branch is issued and
when the control register updates take effect.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
B7 MODE B6 MODE B5 MODE B4 MODE A7 MODE A6 MODE A5 MODE A4 MODE
R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
25−21 BK1 0−1Fh Block size field 1. A 5-bit value used in calculating block sizes for circular
addressing. Table 2−6 shows block size calculations for all 32 possibilities.
Block size (in bytes) = 2 (N+1), where N is the 5-bit value in BK1
20−16 BK0 0−1Fh Block size field 0. A 5-bit value used in calculating block sizes for circular
addressing. Table 2−6 shows block size calculations for all 32 possibilities.
Block size (in bytes) = 2 (N+1), where N is the 5-bit value in BK0
15−14 B7 MODE 0−3h Address mode selection for register file B7.
3h Reserved
3h Reserved
11−10 B5 MODE 0−3h Address mode selection for register file B5.
3h Reserved
9−8 B4 MODE 0−3h Address mode selection for register file B4.
3h Reserved
7−6 A7 MODE 0−3h Address mode selection for register file A7.
3h Reserved
5−4 A6 MODE 0−3h Address mode selection for register file A6.
3h Reserved
3h Reserved
1−0 A4 MODE 0−3h Address mode selection for register file A4.
3h Reserved
The power-down modes and their wake-up methods are programmed by the
PWRD field (bits 15−10) of CSR. The PWRD field of CSR is shown in
Figure 2−5. When writing to CSR, all bits of the PWRD field should be
configured at the same time. A logic 0 should be used when writing to the
reserved bit (bit 15) of the PWRD field.
15 10 9 8 7 5 4 2 1 0
PWRD SAT EN PCC DCC PGIE GIE
R/W-0 R/WC-0 R-x R/W-0 R/W-0 R/W-0 R/W-0
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; WC = Bit is cleared on write; -n = value
after reset; -x = value is indeterminate after reset
† See the device-specific data manual for the default value of this field.
0−1h Reserved
2h C67x CPU
3h C67x+ CPU
4h−FFh Reserved
23−16 REVISION ID 0−FFh Identifies silicon revision of the CPU. For the most current silicon
revision information, see the device-specific data manual. Not writable
by the MVC instruction.
15−10 PWRD 0−3Fh Power-down mode field. See Figure 2−5. Writable by the MVC instruction.
0 No power-down.
1h−8h Reserved
Ah−10h Reserved
12h−19h Reserved
1Bh Reserved
1D−3Fh Reserved
9 SAT Saturate bit. Can be cleared only by the MVC instruction and can be set
only by a functional unit. The set by a functional unit has priority over a
clear (by the MVC instruction), if they occur on the same cycle. The SAT
bit is set one full cycle (one delay slot) after a saturate occurs. The SAT
bit will not be modified by a conditional instruction whose condition is false.
0 Big endian
1 Little endian
1h Reserved
3h−7h Reserved
4−2 DCC 0−7h Data cache control mode. Writable by the MVC instruction. See the
TMS320C621x/C671x DSP Two-Level Internal Memory Reference Guide
(SPRU609).
1h Reserved
3h−7h Reserved
1 PGIE Previous GIE (global interrupt enable). Copy of GIE bit at point when
interrupt is taken. Physically the same bit as SGIE bit in the interrupt task
state register (ITSR). Writeable by the MVC instruction.
0 GIE Global interrupt enable. Physically the same bit as GIE bit in the task state
register (TSR). Writable by the MVC instruction.
0 Disables all interrupts, except the reset interrupt and NMI (nonmaskable
interrupt).
Note:
Any write to ICR (by the MVC instruction) effectively has one delay slot
because the results cannot be read (by the MVC instruction) in IFR until two
cycles after the write to ICR.
Any write to ICR is ignored by a simultaneous write to the same bit in the
interrupt set register (ISR).
15 14 13 12 11 10 9 8 7 6 5 4 3 0
IC15 IC14 IC13 IC12 IC11 IC10 IC9 IC8 IC7 IC6 IC5 IC4 Reserved
W-0 R-0
Legend: R = Read only; W = Writeable by the MVC instruction; -n = value after reset
3−0 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this
field has no effect.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
IE15 IE14 IE13 IE12 IE11 IE10 IE9 IE8 IE7 IE6 IE5 IE4 Reserved NMIE 1
R/W-0 R-0 R/W-0 R-1
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
15−4 IEn Interrupt enable. An interrupt triggers interrupt processing only if the
corresponding bit is set to 1.
0 Interrupt is disabled.
1 Interrupt is enabled.
3−2 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this
field has no effect.
1 All nonreset interrupts are enabled. The NMIE bit is set only by completing a
B NRP instruction or by a write of 1 to the NMIE bit.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
IF15 IF14 IF13 IF12 IF11 IF10 IF9 IF8 IF7 IF6 IF5 IF4 Reserved NMIF 0
R-0 R-0 R-0 R-0
Legend: R = Readable by the MVC instruction; -n = value after reset
15−4 IFn Interrupt flag. Indicates the status of the corresponding maskable interrupt. An
interrupt flag may be manually set by setting the corresponding bit (ISn) in the
interrupt set register (ISR) or manually cleared by setting the corresponding bit
(ICn) in the interrupt clear register (ICR).
3−2 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this
field has no effect.
The IRP contains the 32-bit address of the first execute packet in the program
flow that was not executed because of a maskable interrupt. Although you can
write a value to IRP, any subsequent interrupt processing may overwrite that
value.
The interrupt set register (ISR) allows you to manually set the maskable inter-
rupts (INT15−INT4) in the interrupt flag register (IFR). Writing a 1 to any of the
bits in ISR causes the corresponding interrupt flag (IFn) to be set in IFR. Writ-
ing a 0 to any bit in ISR has no effect. You cannot set any bit in ISR to affect
NMI or reset. The ISR is shown in Figure 2−10 and described in Table 2−11.
Note:
Any write to ISR (by the MVC instruction) effectively has one delay slot
because the results cannot be read (by the MVC instruction) in IFR until two
cycles after the write to ISR.
Any write to the interrupt clear register (ICR) is ignored by a simultaneous
write to the same bit in ISR.
31 16
Reserved
R-0
15 14 13 12 11 10 9 8 7 6 5 4 3 0
IS15 IS14 IS13 IS12 IS11 IS10 IS9 IS8 IS7 IS6 IS5 IS4 Reserved
W-0 R-0
Legend: R = Read only; W = Writeable by the MVC instruction; -n = value after reset
3−0 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this
field has no effect.
15 10 9 5 4 3 2 1 0
ISTB HPEINT 0 0 0 0 0
R/W-0 R-0 R-0
Legend: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
Table 2−12. Interrupt Service Table Pointer Register (ISTP) Field Descriptions
9−5 HPEINT 0−1Fh Highest priority enabled interrupt that is currently pending. This field indicates
the number (related bit position in the IFR) of the highest priority interrupt (as
defined in Table 5−1 on page 5-3) that is enabled by its bit in the IER. Thus,
the ISTP can be used for manual branches to the highest priority enabled in-
terrupt. If no interrupt is pending and enabled, HPEINT contains the value 0.
The corresponding interrupt need not be enabled by NMIE (unless it is NMI)
or by GIE.
The NRP contains the 32-bit address of the first execute packet in the program
flow that was not executed because of a nonmaskable interrupt. Although you
can write a value to NRP, any subsequent interrupt processing may overwrite
that value.
Topic Page
4.1 Pipeline Operation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
4.2 Pipeline Execution of Instruction Types . . . . . . . . . . . . . . . . . . . . . . . . 4-12
4.3 Functional Unit Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-33
4.4 Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-56
Fetch
Decode
Execute
All instructions in the C67x DSP instruction set flow through the fetch, decode,
and execute stages of the pipeline. The fetch stage of the pipeline has four
phases for all instructions, and the decode stage has two phases for all instruc-
tions. The execute stage of the pipeline requires a varying number of phases,
depending on the type of instruction. The stages of the C67x DSP pipeline are
shown in Figure 4−1.
4.1.1 Fetch
The C67x DSP uses a fetch packet (FP) of eight instructions. All eight of the
instructions proceed through fetch processing together, through the PG, PS,
PW, and PR phases. Figure 4−2(a) shows the fetch phases in sequential order
from left to right. Figure 4−2(b) is a functional diagram of the flow of instructions
through the fetch phases. During the PG phase, the program address is gener-
ated in the CPU. In the PS phase, the program address is sent to memory. In
the PW phase, a memory read occurs. Finally, in the PR phase, the fetch pack-
et is received at the CPU. Figure 4−2(c) shows fetch packets flowing through
the phases of the fetch stage of the pipeline. In Figure 4−2(c), the first fetch
packet (in PR) is made up of four execute packets, and the second and third
fetch packets (in PW and PS) contain two execute packets each. The last fetch
packet (in PG) contains a single execute packet of eight single-cycle instruc-
tions.
CPU
(a) (b)
PG PS PW PR Functional
units
Registers
PR Memory
PS
PG
PW
(c)
Fetch 256
Decode
4.1.2 Decode
In the DP phase of the pipeline, the fetch packets are split into execute pack-
ets. Execute packets consist of one instruction or from two to eight parallel
instructions. During the DP phase, the instructions in an execute packet are
assigned to the appropriate functional units. In the DC phase, the the source
registers, destination registers, and associated paths are decoded for the
execution of the instructions in the functional units.
Figure 4−3(a) shows the decode phases in sequential order from left to right.
Figure 4−3(b) shows a fetch packet that contains two execute packets as they
are processed through the decode stage of the pipeline. The last six instruc-
tions of the fetch packet (FP) are parallel and form an execute packet (EP).
This EP is in the dispatch phase (DP) of the decode stage. The arrows indicate
each instruction’s assigned functional unit for execution during the same cycle.
The NOP instruction in the eighth slot of the FP is not dispatched to a functional
unit because there is no execution associated with it.
The first two slots of the fetch packet (shaded below) represent an execute
packet of two parallel instructions that were dispatched on the previous cycle.
This execute packet contains two MPY instructions that are now in decode
(DC) one cycle before execution. There are no instructions decoded for the .L,
.S, and .D functional units for the situation illustrated.
(b)
Decode 32 32 32 32 32 32 32 32
ADD ADD STW STW ADDK NOP† DP
MPYH DC MPYH
Functional
.L1 .S1 .M1 .D1 units .D2 .M2 .S2 .L2
4.1.3 Execute
The execute portion of the pipeline is subdivided into ten phases (E1−E10),
as compared to the five phases in a fixed-point pipeline. Different types of
instructions require different numbers of these phases to complete their
execution. These phases of the pipeline play an important role in your
understanding the device state at CPU cycle boundaries. The execution of dif-
ferent types of instructions in the pipeline is described in section 4.2, Pipeline
Execution of Instruction Types. Figure 4−4(a) shows the execute phases of
the pipeline in sequential order from left to right. Figure 4−4(b) shows the
portion of the functional block diagram in which execution occurs.
(a) E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
(b)
Execute E1
SADD B SMPY STH STH SMPYH SUB SADD
.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2
32
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Register file A Register file B
Data 1 32 32 Data 2
0 1 2 3 4 5 6 7
8 9
Data address 1 Data address 2
Figure 4−5 shows all the phases in each stage of the C67x DSP pipeline in
sequential order, from left to right.
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Figure 4−6 shows an example of the pipeline flow of consecutive fetch packets
that contain eight parallel instructions. In this case, where the pipeline is full,
all instructions in a fetch packet are in parallel and split into one execute packet
per fetch packet. The fetch packets flow in lockstep fashion through each
phase of the pipeline.
For example, examine cycle 7 in Figure 4−6. When the instructions from FPn
reach E1, the instructions in the execute packet from FPn +1 are being
decoded. FP n + 2 is in dispatch while FPs n + 3, n + 4, n + 5, and n + 6 are
each in one of four phases of program fetch. See section 4.4, page 4-56, for
additional detail on code flowing through the pipeline. Table 4−1 summarizes
the pipeline phases and what happens in each phase.
Figure 4−6. Pipeline Operation: One Execute Packet per Fetch Packet
Clock cycle
Fetch
ÁÁÁÁ
packet 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
ÁÁÁÁÁÁÁÁÁ
n PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
ÁÁÁÁÁÁÁÁÁÁÁ
n+1 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
n+2 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9
n+3 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
n+4 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
n+5 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6
ÁÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁ
n+6 PG PS PW PR DP DC E1 E2 E3 E4 E5
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
n+7 PG PS PW PR DP DC E1 E2 E3 E4
n+8
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
PG PS PW PR DP DC E1 E2 E3
n+9 PG PS PW PR DP DC E1 E2
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
n+10 PG PS PW PR DP DC E1
This chapter describes CPU interrupts, including reset and the nonmaskable
interrupt (NMI). It details the related CPU control registers and their functions
in controlling interrupts. It also describes interrupt processing, the method the
CPU uses to detect automatically the presence of interrupts and divert
program execution flow to your interrupt service code. Finally, the chapter
describes the programming implications of interrupts.
Topic Page
5.1 Overview
Typically, DSPs work in an environment that contains multiple external
asynchronous events. These events require tasks to be performed by the DSP
when they occur. An interrupt is an event that stops the current process in the
CPU so that the CPU can attend to the task needing completion because of
the event. These interrupt sources can be on chip or off chip, such as timers,
analog-to-digital converters, or other peripherals.
Servicing an interrupt involves saving the context of the current process, com-
pleting the interrupt task, restoring the registers and the process context, and
resuming the original process. There are eight registers that control servicing
interrupts.
Reset
Maskable
Nonmaskable
These three types are differentiated by their priorities, as shown in Table 5−1.
The reset interrupt has the highest priority and corresponds to the RESET signal.
The nonmaskable interrupt has the second highest priority and corresponds
to the NMI signal. The lowest priority interrupts are interrupts 4−15
corresponding to the INT4−INT15 signals. RESET, NMI, and some of the
INT4−INT15 signals are mapped to pins on C6000 devices. Some of the
INT4−INT15 interrupt signals are used by internal peripherals and some may
be unavailable or can be used under software control. Check your device-
specific data manual to see your interrupt specifications.
NMI Nonmaskable
INT4 Maskable
INT5 Maskable
INT6 Maskable
INT7 Maskable
INT8 Maskable
INT9 Maskable
INT10 Maskable
INT11 Maskable
INT12 Maskable
INT13 Maskable
INT14 Maskable
Reset is the highest priority interrupt and is used to halt the CPU and return
it to a known state. The reset interrupt is unique in a number of ways:
RESET must be held low for 10 clock cycles before it goes high again to
reinitialize the CPU properly.
NMI is the second-highest priority interrupt and is generally used to alert the
CPU of a serious hardware problem such as imminent power failure.
For NMI processing to occur, the nonmaskable interrupt enable (NMIE) bit in
the interrupt enable register must be set to 1. If NMIE is set to 1, the only condi-
tion that can prevent NMI processing is if the NMI occurs during the delay slots
of a branch (whether the branch is taken or not).
The CPUs of the C6000 DSPs have 12 interrupts that are maskable. These
have lower priority than the NMI and reset interrupts. These interrupts can be
associated with external devices, on-chip peripherals, software control, or not
be available.
Assuming that a maskable interrupt does not occur during the delay slots of
a branch (this includes conditional branches that do not complete execution
due to a false condition), the following conditions must be met to process a
maskable interrupt:
The global interrupt enable bit (GIE) bit in the control status register (CSR) is
set to1.
The NMIE bit in the interrupt enable register (IER) is set to1.
The corresponding interrupt enable (IE) bit in the IER is set to1.
The IACK and INUMn signals alert hardware external to the C6000 that an
interrupt has occurred and is being processed. The IACK signal indicates that
the CPU has begun processing an interrupt. The INUMn signal (INUM3−
INUM0) indicates the number of the interrupt (bit position in the IFR) that is
being processed. For example:
INUM3 = 0 (MSB)
INUM2 = 1
INUM1 = 1
INUM0 = 1 (LSB)
Together, these signals provide the 4-bit value 0111, indicating INT7 is being
processed.
Architecture and
Peripherals of TMS320C67x
3.1 Introduction
In the previous chapter we had a glimpse of the general features of, different
generation of processors and their architectures. The TMS320C6x are the first processors
to use velociTI architecture, having implemented the VLIW architecture. The
TMS320C62x is a 16-bit fixed point processor and the ‘67x is a floating point processor,
with 32-bit integer support. The discussion in this chapter is focused on the
TMS320C67x processor. The architecture and peripherals associated with this processor
are also discussed.
In general the TMS320C6x devices execute up to eight 32-bit instructions per cycle.
The ‘67x devices core consist of ‘C6x CPU which has following features.
12
3.2 Architecture of TMS320C67xx
The simplified architecture of TMS320C6713 is shown in the Figure 3.1 below. The
processor consists of three main parts: CPU, peripherals and memory.
The CPU contains program fetch unit, Instruction dispatch unit, instruction
decode unit. The CPU fetches advanced very-long instruction words (VLIW) (256 bits
wide) to supply up to eight 32-bit instructions to the eight functional units during every
clock cycle. The VLIW architecture features controls by which all eight units do not have
to be supplied with instructions if they are not ready to execute. The first bit of every 32-
bit instruction determines if the next instruction belongs to the same execute packet as the
previous instruction, or whether it should be executed in the following clock as a part of
the next execute packet. Fetch packets are always 256 bits wide; however, the execute
packets can vary in size. The variable-length execute packets are a key memory-saving
feature, distinguishing the C67x CPU from other VLIW architectures. The CPU also
contains two data paths (Containing registers A and B respectively) in which the
processing takes place. Each data path has four functional units (.L, .M, .S and .D). The
functional units execute logic, multiply, shifting and data address operation. Figure 3.2
shows the simplified block diagram of the two data paths.
13
All instructions except loads and stores operate on the register. All data transfers
between the register files and memory take place only through two data-addressing units
(.D1 and .D2). The CPU also has various control registers, control logic and test,
emulation and logic. Access to control registers is provided from data path B.
The CPU contains two general purpose register files A and B. These can be used
for data or as data address pointers. Each file contains sixteen 32-bit registers (A0-A15
for file A and B0-B15 for file B). The registers A1, A2, B0, B1, B2 can also be used as
condition registers. The registers A4-A7 and B4-B7 can be used for circular addressing.
These registers provide 32-bit and 40-bit fixed-point data. The 32-bit data can be stored
in any register. For 40-bit data, processor stores least significant 32 bits in an even
register and remaining 8 bits in upper (odd) register.
The CPU features two sets of functional units. Each set contains four units and a
register file. One set contains functional units .L1, .S1, .M1, and .D1; the other set
contains units .D2, .M2, .S2, and .L2. The two register files each contain sixteen 32-bit
registers for a total of 32 general-purpose registers. The two sets of functional units,
along with two register files, compose sides A and B of the CPU. Each functional unit
has two 32-bit read ports for source operands and one 32-bit write port into a general
purpose register file. The functional units . L1, .S1, .M1, and .D1 write to register file A
and the functional units .L2, .S2, .M2, and .D2 write to register file B. As each unit has its
own 32-bit write port, all eight ports can be used in parallel in every cycle. The .L, .S, and
.M functional units are ALUs. They perform 32-bit/40-bit arithmetic and logical
operations. .S unit also performs branching operations and .D units perform linear and
circular address calculations. Only .S2 unit performs accesses to control register file.
Table 3.1 describes the functional unit along with its description.
14
Functional Unit Description
32/40-bit arithmetic and compare operations
Left most 1, 0, bit counting for 32 bits
Normalization count for 32 and 40 bits
.L unit (.L1, .L2)
32 bit logical operations
32/64-bit IEEE floating-point arithmetic
Floating-point/fixed-point conversions
32-bit arithmetic operations
32/40 bit shifts and 32-bit bit-field operations
32 bit logical operations
Branching
.S unit (.S1, .S2) Constant generation
Register transfers to/from the control register file
32/64-bit IEEE floating-point compare operations
32/64-bit IEEE floating-point reciprocal and square root
reciprocal approximation
16 x 16 bit multiplies
32 x 32-bit multiplies
.M unit (.M1, .M2)
Single-precision (32-bit) floating-point IEEE multiplies
Double-precision (64-bit) floating-point IEEE multiplies
.D unit (.D1, .D2) 32-bit add, subtract, linear and circular address calculation
15
3.3 Peripherals of TMS320C6713
The TMS320C67x devices contain peripherals for communication with off-chip
memory, co-processors, host processors and serial devices. The following subsections
discuss the peripherals of ‘C6713 processor.
The enhanced direct memory access (EDMA) controller transfers data between
regions in the memory map without interference by the CPU. The EDMA provides
transfers of data to and from internal memory, internal peripherals, or external devices in
the background of CPU operation. The EDMA has sixteen independently programmable
channels allowing sixteen different contexts for operation.
The EDMA can read or write data element from source or destination location
respectively in memory. EDMA also provides combined transfers of data elements such
as frame transfer and block transfer. Each EDMA channel has an independently
programmable number of data elements per frame and number of frames per block.
• Sixteen channels: The EDMA can keep track of the contexts of sixteen
independent transfers.
• Each channel’s source and destination address registers can have configurable
indexes for each read and write transfer. The address may remain constant,
increment, decrement, or be adjusted by a programmable value.
16
• Event synchronization: Each channel is initiated by a specific event. Transfers
may be either synchronized by element or by frame.
The Host-Port Interface (HPI) is a 16-bit wide parallel port through which a host
processor can directly access the CPUs memory space. The host device functions as a
master to the interface, which increases ease of access. The host and CPU can exchange
information via internal or external memory. The host also has direct access to memory-
mapped peripherals.
The HPI is connected to the internal memory via a set of registers. Either the host
or the CPU may use the HPI Control register (HPIC) to configure the interface. The host
can access the host address register (HPIA) and the host data register (HPID) to access
the internal memory space of the device. The host accesses these registers using external
data and interface control signals. The HPIC is a memory-mapped register, which allows
the CPU access.
The data transactions are performed within the EDMA, and are invisible to the
user.
- Little-endian ordering, in which bytes are ordered from right to left, the most
significant byte having the highest address.
- Big-endian ordering, in which bytes are ordered from left to right, the most
significant byte having the lowest address.
17
The EMIF reads and writes both big- and little-endian devices. There is no distinction
between ROM and asynchronous interface. For all memory types, the address is
internally shifted to compensate for memory widths of less than 32 bits.
The C62x/C67x multichannel buffered serial port (McBSP) is based on the standard
serial port interface found on the TMS320C2000 and C5000 platforms. The standard
serial port interface provides:
• Full-duplex communication
The Fig 3.4 shows the basic block diagram of McBSP unit.
Data communication between McBSP and the devices interfaced takes place via
two different pins for transmission and reception – data transmit (DX) and data receive
(RX) respectively. Control information in the form of clocking and frame
synchronization is communicated via CLKX, CLKR, FSX, and FSR. 32-bit wide control
registers are used to communicate McBSP with peripheral devices through internal
peripheral bus. CPU or DMA write the DATA to be transmitted to the Data transmit
register (DXR) which is shifted out to DX via the transmit shift register (XSR). Similarly,
receive data on the DR pin is shifted into the receive shift register (RSR) and copied into
the receive buffer register (RBR). RBR is then copied to DRR, which can be read by the
CPU or the DMA controller. This allows internal data movement and external data
communications simultaneously.
18
Figure 3.4: Multichannel Serial Port unit
3.3.5 Timers
The ’C62x/C67x has two 32-bit general-purpose timers that can be used to:
• Time events
• Count events
• Generate pulses
The timer works in one of the two signaling modes depending on whether clocked by
an internal or an external source. The timer has an input pin (TINP) and an output pin
(TOUT). The TINP pin can be used as a general purpose input, and the TOUT pin can be
used as a general-purpose output.
When an internal clock is provided, the timer generates timing sequences to trigger
peripheral or external devices such as DMA controller or A/D converter respectively.
When an external clock is provided, the timer can count external events and interrupt the
CPU after a specified number of events.
19
3.3.6 Multichannel Audio Serial Ports (McASP)
The ‘C6713 processor includes two Multichannel Audio Serial Ports (McASP).
The McASP interface modules each support one transmit and one receive clock zone.
Each of the McASP has eight serial data pins which can be individually allocated to any
of the two zones. The serial port supports time-division multiplexing on each pin from 2
to 32 time slots. The C6713B has sufficient bandwidth to support all 16 serial data pins
transmitting a 192 kHz stereo signal. Serial data in each zone may be transmitted and
received on multiple serial data pins simultaneously and formatted in a multitude of
variations on the Philips Inter-IC Sound (I2S) format, [10].
In addition, the McASP transmitter may be programmed to output multiple
S/PDIF IEC60958, AES-3, CP-430 encoded data channels simultaneously, with a single
RAM containing the full implementation of user data and channel status fields.
The McASP also provides extensive error-checking and recovery features, such as the
bad clock detection circuit for each high-frequency master clock which verifies that the
master clock is within a programmed frequency range.
20
Addressing modes
• Determines how one access memory
• Addressing refers to means to specify location of operands for
instructions
- types of addressing are called addressing modes
- operands may be input operands for the operation as well as
results of the operation
register-indirect,
indexed register-indirect,
and modulo addressing (circular addressing).
Immediate data is also supported.
• The TMS320C67x does not support modulo addressing for 64-bit
data.
• Immediate
– The operand is part of the ADD .L1 -13,A1,A6
instruction
• Register
– The operand is specified in a (implied) ADD .L1 A7,A6,A7
register
• Direct
– The address of the operand is not supported
part of the instruction (added
to imply memory page)
• Indirect
– The address of the operand is LDW .L1 *A5++[8],A1
stored in a register
Register-Indirect Addressing
• Operand is located in memory address stored in a register
• Special group of registers can be used to store addresses
(address registers)
• Most important addressing mode in DSPs
• Efficient from instruction set point of view
• Few bits are needed to indicate address of operand
• 32 registers(A0-A15,B0-B15) are used as pointers
• Indirect addressing uses ‘*’ in conjunction with one of the 32
registers
1. *R – register R contains address of a memory location
where a data value is stored
2. *R++ (d) - register R contains memory address
- after the memory address is used, R is
postincremented such that new address is R+1 if d=1
- double minus (- -) update the address by d-1
3. * ++ R(d) - address is preincremented or offset by d
• For each of the eight registers (A4–A7, B4–B7) that can perform linear
or circular addressing, the addressing mode register (AMR) specifies
the addressing mode.
• A 2-bit field for each register selects the address modification mode:
linear (the default) or circular mode.
• With circular addressing, the field also specifies which BK (block size)
field to use for a circular buffer.
• In addition, the buffer must be aligned on a byte boundary equal to
the block size.
AMR mode and description
Mode description
00 for linear addressing
01 for circular addressing using BK0
• For circular addressing using BK1
• reserved
Block size = 2N+1 bytes