lx5280 1 9
lx5280 1 9
Lexra, Inc.
Release 1.9
MIPS, MIPS16, MIPS ABI, MIPSII, MIPSIV, MIPSV, MIPS32, R3000, R4000, and other MIPS
common law marks are trademarks and/or registered trademarks of MIPS Technologies, Inc. Lexra,
Inc. is not associated with MIPS Technologies, Inc. in any way.
Table of Contents
1. LX5280 Product Overview ........................................................................................1
1.1. Introduction ..................................................................................................................... 1
1.2. LX5280 Processor Overview .......................................................................................... 3
1.3. System Level Building Blocks ....................................................................................... 4
1.3.1. SMMU .............................................................................................................. 4
1.3.2. Local Memory Interface ................................................................................... 4
1.3.3. Coprocessor Interface ....................................................................................... 5
1.3.4. Custom Engine Interface .................................................................................. 5
1.3.5. Lexra Bus Controller ........................................................................................ 5
1.3.6. Building Block Integration ............................................................................... 5
1.4. RTL Core & SmoothCore ............................................................................................... 5
1.5. EDA Tool Support .......................................................................................................... 6
2. LX5280 Architecture .................................................................................................7
2.1. Motivation ....................................................................................................................... 7
2.2. Hardware Architecture .................................................................................................... 7
2.2.1. Module Partitioning .......................................................................................... 7
2.2.2. Six Stage Pipeline ............................................................................................. 8
2.3. Dual Issue ....................................................................................................................... 9
2.3.1. Instruction Fetch ............................................................................................... 9
2.3.2. Instruction Analysis and Select Logic .............................................................. 9
2.3.3. MIPS16 ............................................................................................................. 9
2.4. RALU Data Path ........................................................................................................... 10
2.4.1. Overview ........................................................................................................ 10
2.4.2. Assignment of Instructions to Pipe A, Pipe B. ............................................... 11
2.5. System Control Coprocessor (CP0) .............................................................................. 12
2.6. Dual Multiliply-Accumulate (MAC) ............................................................................ 13
2.6.1. Dual MAC Operations ................................................................................... 13
2.6.2. MAC MODE (MMD) register ....................................................................... 14
2.6.3. Architecture .................................................................................................... 15
2.7. Data Addressing ............................................................................................................ 18
2.7.1. Twinword Data Movement ............................................................................ 18
2.7.2. Vector Addressing .......................................................................................... 18
2.7.3. Circular Buffers .............................................................................................. 20
2.8. Radiax ALU Operations ............................................................................................... 21
2.8.1. Extensions to MIPS ALU Operations ............................................................ 21
2.8.2. New ALU Instructions ................................................................................... 21
2.8.3. Conditional Move Operations ........................................................................ 21
2.9. Zero Overhead Loop Facility ........................................................................................ 22
2.10. Low-Overhead Prioritized Interrupts ............................................................................ 24
3. LX5280 RISC Programming Model .......................................................................27
3.1. Summary of MIPS-I Instructions .................................................................................. 27
3.1.1. ALU Instructions ............................................................................................ 28
3.1.2. Load and Store Instructions ............................................................................ 29
3.1.3. Conditional Move Instructions ....................................................................... 29
3.1.4. Branch and Jump Instructions ........................................................................ 30
3.1.5. Control Instructions ........................................................................................ 31
3.1.6. Coprocessor Instructions ................................................................................ 31
3.2. Opcode Extension Using the Custom Engine Interface (CEI) ..................................... 32
3.2.1. CEI Operations ............................................................................................... 32
3.2.2. Interface Signals ............................................................................................. 32
3.3. Memory Management ................................................................................................... 33
3.4. Exception Processing .................................................................................................... 33
3.4.1. Exception Processing Registers ..................................................................... 35
3.4.2. Exception Processing: Entry and Exit ............................................................ 36
3.5. The Coprocessor Interface (CI) .................................................................................... 36
3.6. Power Savings Mode .................................................................................................... 36
4. MIPS16 ......................................................................................................................39
4.1. MIPS16 Instructions ..................................................................................................... 39
4.2. Mode switching ............................................................................................................ 42
4.3. Exceptions ..................................................................................................................... 42
4.4. No Delay Slots .............................................................................................................. 42
5. LX5280 DSP Programming Model .........................................................................43
5.1. Radiax Instructions ....................................................................................................... 43
5.1.1. Radiax Dual-MAC Instructions ..................................................................... 43
5.1.2. Cycle-by-Cycle Usage for Dual MAC Instructions ...................................... 46
5.1.3. Vector Addressing Instructions ...................................................................... 48
5.1.4. Radiax ALU Operations ................................................................................. 50
5.1.5. Conditional Operations .................................................................................. 52
5.2. Instruction Encoding ..................................................................................................... 53
5.2.1. Lexra Formats ................................................................................................ 53
5.3. Load/Store Formats ...................................................................................................... 54
5.3.1. Arithmetic Format .......................................................................................... 56
5.3.2. MAC Format A .............................................................................................. 57
5.3.3. MAC Format B ............................................................................................... 58
5.3.4. MAC Format C ............................................................................................... 59
5.3.5. RADIAX MOVE Format and Lexra-Cop0 MTLXC0/MFLXC0 .................. 59
5.3.6. CMOVE Format ............................................................................................. 61
5.3.7. Lexra SUBOP Bit Encodings ......................................................................... 61
6. Integer Multiply-Divide-Accumulate .....................................................................63
6.1. Summary of Instructions ............................................................................................... 63
6.2. MAC-DIV Instruction Overview .................................................................................. 64
6.3. Op-codes for Standard Mode (32-Bit) MAC Instructions ............................................ 65
6.4. Op-codes for MIPS-16 (16-Bit) Mode MAC Instructions .......................................... 66
6.5. Non-Standard Instruction Descriptions ........................................................................ 67
6.6. Accessing HI and LO after multiply instructions ......................................................... 69
6.7. Divider Overview and Register Usage ......................................................................... 70
7. LX5280 Local Memory ............................................................................................71
7.1. Local Memory Overview .............................................................................................. 71
7.2. Cache Control Register: CCTL .................................................................................... 72
7.3. Instruction Cache (ICACHE) LMI ............................................................................... 73
7.4. Instruction Memory (IMEM) LMI ............................................................................... 75
7.5. Instruction ROM (IROM) LMI .................................................................................... 76
7.6. Direct Mapped Write Through Data Cache (DCACHE) LMI ..................................... 77
7.7. Scratch Pad Data Memory (DMEM) LMI .................................................................... 78
8. LX5280 System Bus .................................................................................................81
8.1. Connecting the LX5280 to internal devices ................................................................. 81
List of Tables
Table 1: EDA Tool Support .................................................................................................... 6
Table 2: Assignment of Instructions of Pipe A, Pipe B ........................................................ 11
Table 3: CP0 Registers.......................................................................................................... 13
Table 4: MMD Fields (Radiax User Register 24)................................................................. 15
Table 5: Prioritized Interrupt Exception Vectors.................................................................. 25
Table 6: ALU Instructions .................................................................................................... 28
Table 7: Load and Store Instructions .................................................................................... 29
Table 8: Conditional Move Instructions ............................................................................... 29
Table 9: Branch and Jump Instructions................................................................................. 30
Table 10: Control Instructions ................................................................................................ 31
Table 11: Coprocessor Instructions......................................................................................... 31
Table 12: Custom Engine Interface Operations...................................................................... 32
Table 13: Custom Engine Interface Signals............................................................................ 32
Table 14: SMMU Address Mapping....................................................................................... 33
Table 15: List of Exceptions ................................................................................................... 34
Table 16: MIPS I Instructions Not Supported by MIPS16 ..................................................... 40
Table 17: MIPS16 Instructions that Support MIPS I.............................................................. 40
Table 18: New MIPS16 Instructions....................................................................................... 41
Table 19: PC-Relative Addressing.......................................................................................... 41
Table 20: Radiax Dual-MAC Instructions .............................................................................. 43
Table 21: Vector Addressing Instructions .............................................................................. 48
Table 22: Radiax ALU Operations ......................................................................................... 50
Table 23: Conditional Operations ........................................................................................... 52
Table 24: Summary of MAC-DIV Instructions. ..................................................................... 63
Table 25: 16-bit Multiply and Multiply-Accumulate Instructions.......................................... 67
Table 26: 32-Bit Multiply-Accumulate Instructions............................................................... 68
Table 27: Local Memory Interface Modules .......................................................................... 71
Table 28: ICACHE Configurations......................................................................................... 73
Table 29: ICACHE RAM Interfaces....................................................................................... 74
Table 30: IMEM Configurations............................................................................................. 75
Table 31: IMEM RAM Interfaces........................................................................................... 75
Table 32: IROM Configurations ............................................................................................. 76
Table 33: IROM ROM Interfaces ........................................................................................... 77
Table 34: DCACHE Configurations ....................................................................................... 78
Table 35: DCACHE RAM Interfaces ..................................................................................... 78
Table 36: DMEM Configurations ........................................................................................... 79
Table 37: DMEM RAM Interfaces ......................................................................................... 79
Table 38: Line Read Interleave Order..................................................................................... 83
Table 39: LBus Signal Description......................................................................................... 84
Table 40: LBus Byte Lane Assignment .................................................................................. 85
Table 41: LBus Commands Issued by the LBC...................................................................... 86
Table 42: LBC Interface Signals............................................................................................. 94
Table 43: Coprocessor Interface Signals ................................................................................ 97
Table 44: EJTAG Pinout....................................................................................................... 102
Table 45: EJTAG AC Characteristics................................................................................... 102
Table 46: EJTAG Synthesis Constraints............................................................................... 102
Table 47: Single-Processor PC Trace Pinout........................................................................ 104
Table 48: Single-Processor PC Trace AC Characteristics .................................................... 104
Table 49: LX5280 Processor Port Summary ........................................................................ 109
Table 50: Instruction Groupings For Stall Definition........................................................... 117
Table 51: Load/Store Ops Stall Matrix ................................................................................. 122
Table 52: Cycles Required Between Dual MAC Instructions .............................................. 123
List of Figures
Figure 1: LX5280 Processor Overview.................................................................................... 3
Figure 2: Superscalar Processor Core Module Partitioning ..................................................... 8
Figure 3: Superscalar Instruction Issue .................................................................................. 10
Figure 4: Dual MAC Data Path.............................................................................................. 16
Figure 5: Post-modified Pointers with Circular Buffer Support ............................................ 20
Figure 6: Lexra System Bus Diagram .................................................................................... 81
1.1. Introduction
This data sheet describes the LX5280, a high-performance RISC-DSP, developed for Intellectual Property
(IP) licensing. However, the LX5280 is not just a highly specialized DSP architecture, but also a carefully
engineered extension to the MIPS ISA. As a result, system functions and computationally intensive DSP
algorithms can be integrated on a single, low-cost subsystem. Key applications include data communication
products such as network protocol processors, cable and xDSL modems, Voice over Packetized Network
gateways, and set-top boxes, as well as disk controllers.
As DSP-intensive applications have gained commercial importance, there has been an increasing recognition
of the benefit of implementing DSP functions on the CPU. A CPU is usually required for memory
management, user interface and control software; CPUs also have excellent third-party software tool support.
However, software implementations of DSP algorithms such as the FIR Filter or Discrete Cosine Transform
(DCT) typically suffer by an order of magnitude in performance compared to specialized DSPs. The problem
is compounded by the difficulty of deterministically allocating real-time in sophisticated CPUs. Vendors have
addressed these problems by offering DSP Coprocessors which have separate instruction sets, separate
instruction stores and execution units; and DSP Accelerators, which share the same I-stream with the CPU
but have separate DSP execution units. Each of these approaches imposes a substantial burden on the CPU in
managing the DSP functions. The LX5280, on the other hand, tightly integrates its DSP extensions into the
MIPS ISA. As a result, a wide variety of third-party tools are available and the LX5280 programmer can
switch seamlessly from RISC code to DSP code.
The LX5280 adds to the MIPS-I instruction set a collection of DSP-oriented instructions called the RadiaxTM
instruction set. The Radiax instruction set adds dual 16-bit multiply and multiply-accumulate operations
including DSP modes such as saturation, rounding and fractional arithmetic. It includes DSP addressing
modes such as post-modified address pointers, circular buffers and zero-overhead loops. It also includes dual
16-bit SIMD ALU operations and data alignment operations for applications where 16 bits of data is
sufficient.
The LX5280 pipeline is a dual-issue, six-stage architecture. Pipe A is the Load/Store Pipe and includes data
memory access and all MIPS instructions except multiply and divide operations, while Pipe B is the
Multiply-Accumulate Pipe and includes a MAC and ALU, each with dual 16-bit operations. DSP algorithms
will typically use Pipe A to load a pair of operands into a general register while executing Dual MAC
operations in Pipe B on earlier data. Decoupling register loads from the MAC allows loop unrolling and takes
effective advantage of the 32 general registers for temporary storage. As compared to memory-based
operands which are common to specialized DSP instruction sets, dual-issue allows the LX5280 to achieve the
memory bandwidth required by DSP within a RISC architecture. In addition, the LX5280 introduces
twinword load and store functions, allowing 64-bits of data to be moved between the local cache and register
file in a single cycle. This provides sufficient data-movement for data-hungry DSP inner loops.
Features introduced in Lexra’s RISC product line to support System-on-Chip (SoC) design, including
customer-defined Coprocessors and customer extensions to the MIPS ISA, are standard in the LX5280.
Configuration options include Extended-JTAG (EJTAG) support for debug and In-Circuit Emulation (ICE).
Lexra’s products include the same memory management stub (SMMU) as the LX4189.
Because the LX5280 executes the MIPS instruction set, a wide variety of third-party software tools are
available including compilers, operating systems, debuggers and in-circuit emulators. The assembler
extensions and a cycle accurate Instruction Set Simulator (ISS) are developed by Lexra. Programmers can
use “off-the-shelf” C Compilers for initial coding; then replace performance-critical loops with optimized
assembler code.
Third parties provide C compiler support for the new DSP instructions and will supply DSP macro libraries
and application packages. Compiler support is provided by the GreenHills MULTI IDE package. A DSP
library of functions such as filters and transforms will be available from Lexra.
Memory sizes and peripherals can be tailored by the licensee to system requirements, avoiding excess silicon
cost and power dissipation. It is expected that the LX5280 will deliver exceptional price/performance for
numerous consumer products and that multiple LX5280 subsystems on a single die can cost-effectively
implement high-performance next generation telecommunications systems.
Key Features
• SIMD operations.
• Zero-overhead loop.
• Multiply-accumulate instructions.
• Vector and circular buffer addressing modes.
The LX5280 is a RISC-DSP processor that executes the MIPS-I instruction set1 along with Lexra’s
RadiaxTM DSP extensions. However, the clocking, pipeline structure, pin-out, and memory interfaces have
all been designed by Lexra to reflect system-on-silicon design needs, deep submicron process technology, as
well as design methodology advances.
CI (1-3)
Dcache Data
RAM RAM
MIPS ISA Execution. The LX5280 supports the MIPS I programming model. Two source operands can be
supplied and one destination update performed per cycle. The second operand is either a register or 16-bit
immediate. The instruction set includes a wide selection of ALU operations executed by the RALU, Lexra’s
proprietary register based ALU. The RALU also generates memory addresses for 8-bit, 16-bit, and 32-bit
register loads from (stores to) memory by adding a register base to an immediate offset. An extension to the
MIPS ISA allows a pair of 32-bit registers to be loaded from (stored to) memory. Branches are based on
comparisons between registers, rather than flags, and are therefore easy to relocate. Optional links following
jump or branch instructions assist with subroutine programming.
The MIPS unaligned load and store instructions are not supported, because they represent poor
price/performance trade-off for embedded applications.
Pipeline. LX5280 instructions are executed by a six-stage pipeline that has been designed so that all
transactions internal to the LX5280, as well as at the interfaces, occur on the positive edge of the processor
clock. Two-phase clocks are not used.
Exception Handling. The MIPS R3000 exception handling model is supported. Exceptions include both
instruction-synchronous traps as well as hardware and software interrupts. The STATUS register controls the
interrupt mask and operating mode. Exceptions are prioritized. When an exception is taken, control is
transferred to the exception vector, the current instruction address is saved in the EPC register, and the
exception source is identified in the CAUSE register. A user program located at the exception vector identifies
the cause of the exception, and transfers control to the application-specific handler. In the event of an address
error exception, the BADVADDR holds the failing address.
Coprocessor Operations. The LX5280 supports 32-bit Coprocessor operations. These include moves to and
from the Coprocessor general registers and control registers (MTCz, MFCz, CTCz, CFCz), Coprocessor
loads and stores (LWCz, SWCz) and branches based on Coprocessor condition flags (BCzT, BCzF). The
Lexra-supplied Coprocessor Interface can support Coprocessor operations in a single cycle, without pipeline
1. The MIPS unaligned load and store instructions (LWL, LWR, SWL, SWR) are not supported.
stalls.
LX5280 provides excellent price/performance and time-to-market. There are two main approaches which
Lexra has taken to achieve this:
• Deliver simple building blocks outside the processor core to enable system level
customizations such as coprocessors, application specific instructions, memories, and
busses.
• Deliver either a fully synthesizable Verilog source model or fully implemented hardcore
(called SmoothCore) for popular pure-play foundries.
Section 1.3 describes the building blocks, and Section 1.4 describes the deliverable models.
The LX5280 processor is designed to easily fit into different target applications. It provides the following
building blocks.
• A flexible Local Memory Interface (LMI) that supports instruction cache, instruction
RAM, instruction ROM, data cache and data RAM.
• A Lexra Bus Controller (LBC) to connect peripheral devices and secondary memories to
the processor’s own local buses.
The following sections discuss each of these system building block interfaces.
1.3.1. SMMU
The LX5280 SMMU is designed for embedded applications using a single address space. Its primary
function is to provide memory protection between user space and kernel space. The SMMU is consistent
with the MIPS address space scheme for User/Kernel modes, mapping, and cached/uncached regions.
The LX5280’s Harvard Architecture provides Local Memory Interfaces (LMIs) that support instruction
memory and data memory. Synchronous memory interfaces are employed for all memory blocks. The LMI
block is designed to easily interface with standard memory blocks provided by ASIC vendors or by third-
party library vendors.
The LMIs provide a two-way set associative instruction cache interface, and a direct-mapped write-through
data cache interface. The tag compare logic as well as a cache replacement algorithm are provided as part of
the LMI. One of the instruction cache sets may be locked down as un-swappable local memory. Local
instruction and data memories can also be mapped to fixed regions of the physical address space, and include
non-volatile memory (such as ROM, flash, or EPROM).
Lexra supplies an optional Coprocessor Interface (CI) for applications requiring this functionality. Up to three
CIs may be implemented in one design. The Coprocessor Interface “eavesdrops” on the Instruction bus. If a
Coprocessor load (LWCz) or “move to” (MTCz, CTCz) is decoded, data is passed over the Data Bus into a
CI register, then supplied to the designer-defined Coprocessor. Similarly, if a Coprocessor store (SWCz) or
“move from” (MFCz, CFCz) is decoded, data is obtained from the Coprocessor and loaded into a CI register,
then transferred onto the Data Bus in the following cycle. The design interface includes a data bus, five-bit
address, and independent read and write selects for Coprocessor registers and control registers. The LX5280
pipeline and Harvard Architecture permit single cycle Coprocessor access and transfer. An application-
defined Coprocessor condition flag is synchronized by the CI then passed to the Sequencer for testing in
branch instructions.
The LX5280 includes a Custom Engine Interface (CEI) that the application may use to extend the MIPS I
ALU opcodes with application-specific or proprietary operations. Similar to the standard ALU, the CEI
supplies the Custom Engine two input 32-bit operands, SRC1 and SRC2. One operand is selected from the
Register File. Depending on the most significant 6 bits of the opcode, the second operand is either selected
from the Register File or is a 16-bit sign-extended immediate. The opcode is locally decoded by the custom
engine, and following execution by the custom engine, the result is returned on the 32-bit result bus to the
LX5280. To support multi-cycle operations, a stall input is included in the interface.
The Lexra Bus Controller (LBC) is the interface between the LX5280 and the outside world, which includes
DRAM and various peripherals. It is a non-multiplexed, non-pipelined, and non-parity checked bus to
provide the easiest bus protocol for design integration. On the processor side, the LBC provides a write-buffer
of configurable depth to support the write-through cache, as well as the control for byte and half-word
transfers. On the peripheral side, the LBC is designed to easily interface to industry standard bus protocols,
such as PCI, USB, and FireWire.
The LBC can run at any speed from 33 MHz, up to the speed of the LX5280 processor core in both the RTL
core and SmoothCore.
The LX5280 configuration script, lconfig, provides a menu of selections for designers to specify building
blocks needed, number of different memory blocks, target speed, and target standard cell library. Next, the
configuration software automatically generates a top level Verilog model, makefiles, and scripts for all steps
of the design flow.
For testability purposes, all building blocks contain scan control signals. The Lexra synthesis scripts include
scan insertion, which allows ATPG testing of the entire LX5280 core.
RTL Core: For full ASIC designs, the RTL is fully synthesizable and scan-testable Verilog source code, and
may be targeted to any ASIC vendor’s standard cell libraries. In this case, the designer may simply follow the
ASIC vendor’s design flow to ensure proper sign-off. In addition to the Verilog source code and system level
test bench, Lexra provides synthesis scripts as well as floor plan guidelines to maximize the performance of
the LX5280.
SmoothCore: For COT designs that are manufactured at popular foundries such as IBM, TSMC, and UMC,
a SmoothCore port is the quickest, lowest cost, and best performance choice. In this case, the LX5280 has
been fully implemented and verified as a hard macro. All data path, register file, and interface optimizations
have been performed to ensure the smallest die size and fastest performance possible. Furthermore, there is a
scan based test pattern that provides excellent fault coverage during manufacturing tests.
Lexra supports mainstream EDA software, so designers do not have to alter their design methodology. The
following is a snapshot of EDA tools currently supported:
2. LX5280 Architecture
2.1. Motivation
The LX5280 issues dual 32-bit instructions to two distinct 6-stage execution pipelines. Superscalar
architectures have been widely deployed in RISC CPUs where increased performance is obtained at the cost
of significantly increased area. Although superscalar issue significantly increases the area of the LX5280
processor Core, performance analysis at Lexra demonstrates benefits on key DSP algorithms well beyond
that which is obtained in typical CPU benchmarks.
Sustaining peak computational performance in DSP algorithms typically requires at least one operand from
memory per instruction cycle. DSPs have traditionally implemented specialized instruction sets that support
memory-based operands. Single-issue RISC architectures operate on register-based operands and thus
degrade performance by a factor of two in order to pre-load the operand into the register file. Grafting
memory-based operands onto a RISC architecture is inconsistent with both the RISC pipeline and ISA. Dual-
issue superscalar design, on the other hand, allows operands to be loaded from memory by one instruction
while the Multiple Accumulate data path (MAC) operates on register-based data loaded earlier from memory.
The RISC data path is 32-bits, but few DSP algorithms require more than 16-bits of precision. Thus two
values can be fetched simultaneously from memory. Simultaneous dual 16-bit ALU and MAC operations
further improve the LX5280 DSP performance.
Compared to specialized 32-bit (or even 16-bit) DSP instructions which allow memory reference, the
superscalar approach will have lesser code density, but only within the DSP loop, or kernel. These kernels are
typically small sections of code which are executed many times. Thus the overall degradation of code density
is minimal and can be offset by use of MIPS16 code compression in “outer loop” code which is not
performance critical.
The LX5280 processor core includes two major blocks: the RALU (register file and ALU) and the CP0
(Control Processor). The RALU performs ALU operations and generates data addresses while CP0 includes
instruction address sequencing, exception processing, and product specific mode control. The RALU and
CP0 are loosely-coupled and include their own independent instruction decoders.
Instructions
RALU
addr_wA2 addr_wB2
addr_wA1 Register File addr_wB1
addr_rA2
8p (4r/4w) addr_rB2
PC and
Sequencer
wA1 wA2 wB1 wB2
The LX5280 I-Cache and IRAM can fetch two 32-bit instructions I0_I, I1_I simultaneously. Following the
superscalar instruction buffer and issue logic, described below, the instructions are issued to Pipe B and Pipe
A as appropriate. To avoid degrading operating frequency, the superscalar issue logic operates during the
Decode stage (D-stage) of the pipeline. Support for fully synchronous memories in the LX5280 has the
added benefit of isolating the processor logic from the customer-supplied memories in the instruction cache,
thus facilitating integration of the LX5280 into SoC designs.
As a result of the D-Stage, a two cycle penalty is incurred on branch prediction failure vs. the one-cycle
penalty in the LX4180 five stage pipeline. However, the LX5280’s zero-overhead loop hardware and
conditional move instructions can be used to avoid any wasted cycles in the control of real-time critical loops.
Two instructions are fetched during each instruction cache access. In the event of a cache miss, the processor
will be stalled until the cache line containing the requested instructions is retrieved. In the event that only one
instruction of a fetched pair can issue, the fetch will be stalled until the second instruction is issued to the
pipeline.
Instruction fetches always occur on an aligned 64-bit address boundary. In the event of a branch to an odd 32-
bit address in the 64-bit boundary, both instructions in the 64-bit window will be fetched, but only the second
(odd) instruction will issue to the pipeline. The first, or even, instruction will be ignored.
The Instruction Analysis and Select Logic is located in the D-stage of the pipeline. During this stage, the
processor analyzes both instructions in a fetched pair and determines which pipeline can execute the
instructions. For example, if the first instruction in the pair, I0, is an ADD, and the second instruction I1is a
MAC, the processor will determine that I0 can be executed by either Pipe A or Pipe B while I1 can be
executed by Pipe B. The Instruction Select Logic will then issue I0 to pipe A and I1 to pipe B, since only pipe
B can execute the MAC instruction.
If both instructions of the fetched pair can only be issued to one pipeline (for example, a pair of MAC
instructions, which can only issue to Pipe B), the two instructions will be issued serially. The instruction fetch
will be stalled by one cycle until the second instruction has been issued to the pipeline.
If the result of the first instruction, I0, is used by the second instruction, I1, only one of the two instructions
will issue. The second instruction, I1, will issue in the next cycle, and the instruction fetch will be stalled for
one cycle until I1 has been issued.
2.3.3. MIPS16
The MIPS16_N signal indicates whether or not MIPS16 code compression has been enabled. If so, each 32-
bit fetch is interpreted as a pair of 16-bit instructions encoded according to the MIPS16 Specification.
MIPS16 instructions are not dual-issued, but always issued to Pipe A. It is expected that MIPS16 code
compression is enabled for “outer loop” code where code density is more important than performance. The
critical Register File read addresses for MIPS16 are resolved during the D-stage so that register file access for
MIPS16 instructions, as for 32-bit MIPS instructions, can begin on the rising edge of the S-Stage clock.
I$ and IRAM
1 V_I 32 I0_I 32 I1_I
Instructions
Analysis
Logic
Inst Issue
State
2.4.1. Overview
The Superscalar RALU Datapath is illustrated in the Figure. Operations are divided between Pipe A and Pipe
B in such a way that the RALU is the only major section of the processor which requires both Pipe A and B
instructions. Coprocessor 0, as well as the optional customer-defined Coprocessors 1-3, only require the Pipe
A instruction.
To “first approximation” the superscalar RALU is a “doubling” of the LX4180 RALU: it includes an 8-port
(4r/4w) general register file with 4-ports (2r/2w) assigned to Pipe A, and 4-ports (2r/2w) assigned to Pipe B.
In each Pipe, one write port is dedicated to register file updates from the Data Bus (Loads, MFCz, CFCz -
moves from Coprocessor). The remaining three ports (2r/1w) are available for the other operations assigned
to that Pipe. As a result, loads, including “twinword” loads of register pairs can dual-issue with any MAC or
ALU instruction without register port access restriction.
Each Pipe has an ALU and a nearly-independent control section. Differences occur in the assignment of
operations to Pipe A and Pipe B, and in the pipeline features to support superscalar. The pipeline differences
in the RALU to support superscalar issue are:
• Data must be forwarded from Pipe A (Pipe B) to Pipe B (Pipe A) when the input to a Pipe
B (Pipe A) execution unit requires a result computed earlier in Pipe A (Pipe B). The
• If both Pipe A and Pipe B operations write the same register, the RALU control examines
the instruction order and suppresses the write for the earlier instruction based on program
order.
In addition, the RALU interfaces with the dual-MAC as a custom engine; this interface can supply two, 32-
bit operands per cycle, and return a single 32-bit operand per cycle. In the case of the dual-MAC, each of the
32-bit operands can be interpreted as two independent 16-bit operands.
Table 2 lists the detailed assignment of instructions to Pipe A and Pipe B. Pipe B is called the “MAC Pipe”
because it uniquely supports multiply-accumulate, as well as multiply and divide operations. The Dual MAC
unit, which is attached to Pipe B as Custom Engine 0 (CE0), includes the accumulator registers (including HI
and LO) and therefore also supports the move to and move from operations which transfer data between these
registers and the general register file.
Pipe A is called the “Load/Store Pipe” because it uniquely supports the Load and Store operations. DSP
extensions to memory addressing are therefore also unique to Pipe A. These extensions include pointer post-
modification and circular buffer addressing. The Figure illustrates the circular buffer start registers (cbs0-
cbs2) and circular buffer end registers (cbe0-cbe2) located in ALU A.
The Coprocessor operations, and all “sequencing control instructions” (branches, jumps) are unique to Pipe
A. As a result, Pipe B instructions are not routed to Coprocessors.
The opcodes reserved for a customer defined Custom Engine 1 (CE1) are routed to Pipe B, since CE1 is
attached to Pipe B.
All ALU operations are available in both Pipe A and Pipe B. As a result performance is improved,
particularly in computation-intensive programs, and, the design is simplified because major sub-blocks in
ALU A and ALU B are replicated.
The Custom Engine Interface (CEI) is available for customer proprietary operations in Pipe B. This allows
the customer extensions to maintain high throughput since they can dual-issue with Load and Store
instructions which issue to Pipe A.
Pipe A Pipe B
MIPS 32-bit MIPS 32-bit General Instructions MULT(U), DIV(U), MFHI, MFLO,
General Instructions except: MTHI, MTLO,MAD(U),MSUB(U)
CE1 Custom Engine Opcodes, CE1 Custom Engine Opcodes,
MULT(U), DIV(U), MFHI, MFLO, MIPS 32-bit ALU Instructions
MTHI, MTLO,MAD(U),MSUB(U) Note: No Load or Store
Instructions
Pipe A Pipe B
MIPS16 Instructions All MIPS16 Instructions except: MULT(U), DIV(U), MFHI, MFLO
(No Doubleword
Instructions) MULT(U), DIV(U), MFHI, MFLO
New Lexra ALU MIN, MIN2, MAX, MAX2, ABSR, MIN, MIN2, MAX, MAX2, ABSR,
Operations ABS2, CLS, MUX2, BITREV, ABS2, CLS, MUX2, BITREV,
CMVEQZ, CMVNEZ CMVEQZ, CMVNEZ
The System Control Coprocessor (CP0) is responsible for instruction address sequencing and exception
processing.
For normal execution, the next instruction address has several potential sources: the increment of the previous
address, a branch address computed using a pc-relative offset, or a jump target address. For jump addresses,
the absolute target can be included in the instruction, or it can be the contents of a general-purpose register
transferred from the RALU.
Branches are assumed (or predicted) to be taken. In the event of prediction failure, two stall cycles are
incurred and the correct address is selected from a special “backup” register. Statistics from several large
programs suggest that these stalls will degrade average LX5280 throughput by several percent. However, the
net effect of the LX5280’s branch prediction on performance is positive because this technique eliminates
certain critical paths and therefore, permits a higher speed system clock.
If an exception occurs, CP0 selects one of several hardwired vectors for the next instruction address. The
exception vector depends on the mode and specific trap which occurred. This is described further in
Section 3.4, Exception Processing.
The following registers, which are visible to the programming model, are located in CP0:
EPC, STATUS, CAUSE, and BADVADDR are described further in the Section 3.4. PRID is a read-only
register that allows the customer’s software to identify the specific version of the LX5280 that has been
implemented in their product. The CCTL register is a Lexra defined CP0 register used to control the
instruction and data memories, as described in Section 7.2, Cache Control Register: CCTL.
The contents of the above registers can be transferred to and from the RALU’s general-purpose register file
using CP0 operations. (Unlike registers located in Coprocessors 1-3, they cannot be loaded or stored directly
to data memory.)
The Dual MAC data path is illustrated in Figure 4 on page 16. The major subsystems are:
• 32-bit x 32-bit multiply executes on a single Multiply-Accumulate data path with five
cycle latency. By using both data paths, a 32-bit x 32-bit multiply-accumulate can be
initiated every other cycle.
• Complex Multiply (16-bit Real, 16-bit Imaginary per-product) uses both Multiply-
Accumulate data paths with three cycle latency. A new Complex Multiply can be initiated
every two cycles.
Several new DSP features are controlled using the MMD (“MAC Mode”) register. MMD is a new Radiax
User register (24) which is accessed using Radiax User Move instructions MTRU and MFRU. If MMD is
updated between a MAC instruction and the MFA instruction that retrieves the result of that instruction, the
resulting operation is undefined.
The fields in MMD are as follows. Note that MMD is reset to all zeroes.
0: saturate at 40 bits
1: saturate at 32 bits
01: Round-to-Nearest
Round to nearest number; when the number to be rounded is midway
between two numbers representable in the smaller format, round to the
more positive number. (This rounding mode is common because it is
easily implemented by always adding 0...0^10...0 to the number to be
rounded. Digits to the right of “^” are dropped after rounding.)
1x: reserved
• Reset to 0
31 5 4 3 2 1 0
0 RND MF MS MT
27 2 1 1 1
Width
Field Description
(Bits)
2.6.3. Architecture
The Multiply-Accumulate data paths can operate on 16-bit input data, either individually or in parallel. The
same Assembler mnemonic is used for individual or parallel operation. The output register specified
determines whether MAC0 or MAC1 or both, operate. For example,
Each Multiplier can initiate a new 16-bit x 16-bit product every cycle (single cycle throughput). Each 16-bit x
16-bit multiply-accumulate completes in three cycles. (Figure 4 illustrates the intermediate pipeline registers
TEMP0, TEMP1, Product 0, Product 1 to help the reader remember that the Multipliers require two cycles
but have single cycle throughput. TEMP0, TEMP1, Product 0, Product 1 are not accessible by the
programmer.). Thus, there are two delay slots for Multiplication or Multiply-Accumulate. For example,
x y
32 32
Multiplier 1 Multiplier 0
16b x 16b 16b x 16b Divider
DIV, DIVU
TEMP1 (32) TEMP0 (32)
32 32
40 40
Accumulator 1 Accumulator 0
mh ← p ml ← p
mh ← [sat]mh + {p,ax} ml ← [sat]ml + {p,ax}
mh ← [sat]mh - {p,ax} ml ← [sat]ml - {p,ax}
40 40
40 40
Scaler Scaler
>> [0-8] >> [0-8]
16, 32 16, 32
32
The accumulator m1h can be referenced by MFA in Inst2(Inst3), however two (one) stall cycles will be
incurred. It is expected that the number of stall cycles in DSP algorithms will be minimal because, typically
many products are accumulated before the accumulator must be stored. In a 64-tap FIR, for example, 64
terms are accumulated before the filter sample is updated in memory. Also, the four accumulator pairs allow
loops to be “unrolled” so that up to three additional independent MAC operations can be initiated before the
result of the first is available.
Compared to a typical RISC multiply-accumulate unit the LX5280 MAC includes a number of features
critical to high-fidelity DSP arithmetic. These features are optionally selected by opcodes and/or mode bits in
the MMD register, and are compatible with conventional integer arithmetic, also supported by the LX5280:
• Fractional arithmetic,
• Saturation,
• Rounding,
• Output Scaling.
Accumulation is performed at 40-bit precision, using eight guard bits for overflow protection. The alternative
is to require the programmer to right-shift (scale) products prior to accumulation, which complicates
programming and causes loss of precision. Prior to accumulation, the product is sign-extended to 40-bits.
With guard bits, typically the only loss of precision will occur at the end of a lengthy calculation when the 40-
bit result must be stored to the general register file or to memory in 32-bit or 16-bit format.
Fractional arithmetic is implemented by the program’s interpretation of the 16-, 32- or 40-bit quantities and
is controlled by a bit in the MMD register. When fractional mode is selected, the Dual MAC shifts the results
of any Radiax multiply operation left by one bit to maintain the alignment of the implied radix point.
Furthermore, since -1 can be represented in fractional format but +1 cannot be represented in fractional
mode, the Dual MAC detects when both operands of a multiply are equal to -1. If so, it generates the
approximate product consisting of 0 for the sign bit (representing a positive result) and all ones for the
remaining bits. This is true for both 16x16 bit and 32x32 bit Radiax multiplications. The least significant bit
of a product is always zero in fractional mode (due to the left shift).
The Accumulation Units can add the product to, or subtract it from, one of the four accumulator registers.
This operation can be performed with optional saturation; that is, if a result overflows (underflows), the
accumulator is updated with the largest (smallest) positive (negative) number rather than the “wraparound”
result with incorrect sign. The LX5280 instructions include a Multiply-add and Multiply-sub, each with and
without saturation. There are also instructions for adding or subtracting any pair of 40-bit accumulator
registers together, with and without saturation. A bit in the MMD register determines whether the saturation
is performed on the full 40 bits or whether saturation is performed at 32 bits. The latter capability is useful for
emulating the results of other architectures that do not have guard bits. In 32-bit saturation mode, a full 40- bit
compare is used to determine if the result is greater (less) than the maximum (minimum) value which can be
stored in a 32-bit quantity; this provides the most robust solution.
In the case that the instruction requires multiplication, but no accumulation, the product is passed through the
accumulation unit unchanged. (Thus, both 16-bit multiplication and multiply-accumulate require three MAC
cycles.)
A Round instruction can also be executed on one (or a pair) of the accumulator registers to reduce precision
prior to storage. The rounding mode is selectable in the MMD register.
The output Scaler is used to right shift, (scale), the accumulator register when it is transferred to the general
register file.
The Dual MAC is also used to execute the 32-bit MULT(U) and DIV(U) instructions specified in the MIPS
ISA. In the case of MULT(U), one of the 16-bit Multiply-Accumulate data paths works iteratively to produce
the 64-bit product in five cycles. (The least significant 32 bits are available one cycle earlier than the most
significant 32 bits.)
Note: The MMD mode bits have no effect on the operation of the standard MIPS ISA instructions. By
contrast, the LX5280 MULTA instruction is subject to the MMD mode bits for fractional arithmetic and
truncated 32x32 multiplication.
32-bit x 32-bit Multiply-Accumulate instructions (MADDA, MSUBA) are implemented using one of the 16-
bit Multiply-Accumulate data paths, and the Add-Round unit. It provides a 64-bit Multiply result which is
sign-extended and accumulated at 72-bits. The result is available in six cycles. (The least significant 32 bits
are available one cycle earlier than the most significant 40 bits.) The MAD(U) and MSUB(U) instructions of
the MIPS32 ISA are also supported.
For the LX5280 MULTA, an accumulator pair M0h[39:0]/M0l[31:0], M1h[39:0]/M1l[31:0] etc. is the target.
M0h[39:0] is aliased to HI; M0l[31:0] is aliased to LO. The most significant 8-bits of the 40-bit HI
accumulator are used as the guard bits, while the LO accumulator is simply zero-extended to 40 bits. Unlike
the (dual) 16-bit operations, single-cycle throughput is not available for 32-bit data. However since there are
two available data paths, two 32-bit x 32-bit multiply operations can be initiated every four cycles. The Dual
MAC hardware automatically allocates the second operation to the available data path. If a third 32-bit
multiplication is programmed too soon, stall cycles are inserted until one of the data paths is free.
The Dual MAC also supports a complex multiply instruction, CMULTA. For this instruction, each of the 32-
bit general register operands is considered to represent a 16-bit real part (in bits 31:16) and a 16-bit imaginary
part (in bits 15:00). One of the multiply-accumulate engines calculates the real part (33 bits) of the complex
product (namely XrYr - XiYi) and stores it in the “h” half of the target accumulator pair. The other MAC
engine calculates the imaginary part (32 bits) of the complex product (namely XrYi + XiYr) and stores it in
the “l” half of the target accumulator pair. This instruction can be initiated every two cycles (2-cycle
throughput) and takes four cycles to complete. As in the other Dual MAC operations, programming
CMULTA instructions too close together causes stall cycles but the correct results are always obtained.
The Dual MAC includes a separate Divide Unit for executing the 32-bit DIV(U) operations specified by the
MIPS ISA. The Divide requires 19 cycles to complete. The quotient is loaded into M0l[31:0], M1l[31:0],
M2l[31:0] or M3l[31:0] and the remainder is loaded into the lower 32-bits of the other accumulator in the
target pair. There is no special support for fractional arithmetic for the divide operations.
Since the Dual MAC is capable of consuming four 16-bit operands every cycle (in Pipe B) by performing two
16x16 multiply-accumulates, it is desirable to be able to fetch four 16-bit operands from memory every cycle
(in Pipe A). Therefore, the LX5280 extends the MIPS load and store instructions to include twinword
accesses and implements a 64-bit data path from memory. A twinword memory operation accesses an (even-
odd) pair of 32-bit general registers with a single instruction and executes in a single pipeline cycle. The
nomenclature “twinword” is used to distinguish these operations from “doubleword” operations which (in
other extensions to the MIPS ISA) access a single 64-bit general register.
Like the standard byte, halfword, and word load/store instructions, the twinword load/store instructions use a
register and an immediate field to specify the memory address. However, in order to obtain the maximum
range from the LEXOP instruction format, the available signed 11-bit immediate field (called the
displacement) is considered a twinword quantity, so is left-shifted by 3 bits before being added to the base
register. This is equivalent to a 14-bit byte offset, in comparison to the full 16-bit immediate byte offset used
in the byte, halfword and word instructions. Also, the target register pair for the twinword load/store must be
an even-odd pair, so that only 4 bits are used to specify it.
DSP algorithms usually operate on vectors or matrices of data; for example Discrete Cosine Transforms
operate on 8x8 pixel blocks. As a result data memory pointers are incremented from one operand to the next.
The extra instruction cycle required to increment RISC memory pointers is eliminated in DSPs with auto-
increment. This capability is provided in the LX5280. Memory pointers are used unmodified to create the
address, then updated in the general register file before the next use:
address ← pointer
pointer ← pointer + stride
In the LX5280 the 8-bit immediate field containing the stride is sign-extended to 32-bits before being added
to the pointer for the latter’s update. The nomenclature “pointer” is used to distinguish the update performed
after memory addressing from the “offset”, in which the “base” register (in the MIPS ISA) which is
augmented by the offset before addressing memory in the standard instructions. The nomenclature “stride”,
which is dependent on the granularity of the access, is used to distinguish it from the invariant byte offset
used in the standard load and store instructions. For twinword/word/halfword addressing the 8-bit field is first
left-shifted by three/two/one places and zero-filled, before sign extension to 32-bits. This use of left shifts for
the twinword, word, and halfword word and halfword strides is similar to MIPS16 and is used to extend the
effective address range. Thus, increments of between -128 and +127 twinwords1, words, halfwords or bytes
are available for each data type.
In the case of Loads (but not Stores) pointer update requires a second general register file write port. The
LX5280 includes an 8p(4r/4w) register file with two of the four write ports dedicated to register Loads. As a
result, twinword loads can execute in parallel with any Pipe B operation.
For some DSP algorithms - notably Filters - DSP data is organized into “circular buffers”. In this case, at the
end of the buffer the next reference is to the beginning of the buffer. Implementing this structure in RISC
requires:
Note that the above example is written so that a branch prediction failure will only be incurred at the end of
the buffer. Nevertheless, the combination of post-modified pointers together with hardware support for
circular buffers in the LX5280 allows this typical DSP addressing operation to be reduced from four cycles to
one.
1. Twinwords are supported only on the LX5280, and not the LX5180
TEMP
ALU A
DBUS_M
A + BI A REG
DADDR_E
M
U
X
select
compare
cbs0 cbe0
3 3
cbs1 cbe1
enable cbs2 enable cbe2
CB[2:0] CB[2:0]
The LX5280 supports three circular buffers. To initialize the circular buffers, the MTRU instructions are used
to set the twinword start addresses CBS0-CBS2[31:3] and twinword end addresses CBE0-CBE2[31:3].
Circular buffers are only used when memory pointers are post-modified, and consist of an integral number of
twinwords.
When a circular buffer pointer is used in a post-modified address calculation, the pointer is compared to the
associated CBE address; if they match (and the stride is non-negative), the CBS address (rather than the post-
modified address) is restored to the register file. Similarly, to allow for traversing the circular buffer in the
reverse direction, the pointer is compared to the CBS address; if they match (and the stride is negative) the
CBE address (rather than the post-modified address) is restored to the register file.
It is worth noting that circular buffers can also be accessed with byte, halfword, or word Load/Store with
Pointer Increment instructions. In those cases, the several least significant bits of the pointer register are
examined to determine if the start or end of the buffer has been reached, taking into account the granularity of
the access, before replacing the pointer with the CBS or CBE as appropriate.
Any general register memory pointer can be used with circular buffers using the “.Cn” option. To use general
register rP as a circular buffer pointer. For example, the instruction
associates the r4 memory pointer with circular buffer C2 which is defined by the start address CBS2 and end
address CBE2.
The LX5280 introduces extensions to the MIPS instructions to support dual 16-bit operations. The LX5280
also introduces a number of new ALU instructions which improve performance on DSP algorithms. These
instructions will also be described in this section.
To support high-performance dual 16-bit operations in the RISC-DSP it is necessary to support not only Dual
MAC instructions but also dual 16-bit versions of other arithmetic operations that the programmer may
require. To maintain a simple, orthogonal instruction set, the following criteria were used to determine the
MIPS ALU extensions:
• Dual 16-bit versions of all MIPS ALU operations without immediate data,
• Optional saturation for every ALU instruction (without immediate data) that can produce
signed overflow or underflow.
It is expected that the above organizing principles will simplify the LX5280 ISA for both programmers and
tool developers. Obviously, dual 16-bit versions of logical operations such as AND are not required.
However, dual 16-bit versions have been provided for all 3-register operand shifts and add/subtracts included
in the MIPS R-Format. The character “2” in the assembler mnemonic indicates an operation on dual 16-bit
data.
DSP algorithms are often somewhat tolerant of data errors. For example, a bad audio sample may cause a
brief distortion, but no lasting effect as new audio samples arrive and the bad sample is cleared out of the
buffer. Accordingly, the saturated result of signed arithmetic is a closer, more desirable, approximation than
the wraparound result. Therefore, all LX5280 arithmetic operations which may, potentially, produce
arithmetic overflow or underflow, and do not have immediate operands, support optional saturation. For
example, not only the dual 16-bit add (ADDR2) but also the 32-bit add (ADDR) have optional saturate in the
LX5280. Saturation options are not provided for MIPS I-Format 32-bit instructions; for example ADDIU.
However, in this case the programmer selects the immediate operand and, as a result, saturation is less likely,
or at least more predictable.
Neither the dual 16-bit instructions nor the new 32-bit saturating adds and subtracts cause exceptions.
The LX5280 adds several new ALU instructions which have proven useful in DSP performance analysis.
Consistent with the approach described above, each new instruction has both a 32-bit and a dual 16-bit
version. If signed overflow/underflow is possible, a saturation option is provided.
The LX5280 includes new instructions (MOVZ and MOVN) to support conditional operations. These
instructions are described in this section.
A number of DSPs and RISC processors have deployed extensive “conditional execution.” In these
processors the branch prediction penalty is three cycles or more. Conditional execution can mitigate the effect
of the branch prediction penalty by allowing the branch to be avoided in some cases. However, conditional
execution is a costly alternative: it uses instruction opcode bits and consequently limits the size of immediate
data and/or limits the number of general purpose registers visible to the program. The LX5280 branch
prediction penalty is only two cycles; therefore the need for conditional execution is minimized and only a
restricted set of “conditional move” instructions is needed. It is notable, however, that the effect of any
conditional execution can be “emulated” in the LX5280 with a sequence of two instructions by using the
conditional move. For example:
LX5280:
If rB satisfies the COND, rD is updated with rA; i.e. the 2nd ALU operation is executed to “completion”.
Note that this sequence is interruptible.
Another use of the conditional move instructions is to code “if-then-else” constructs as follows:
if (rB COND)
rD = rA
else
rD = rC
One reason Lexra has provided conditional move is to facilitate initial porting of Assembler code from
processors with conditional execution to the LX5280.
Because DSP algorithms spend much of their time in short real-time critical code loops, DSPs often include
hardware support for “zero-overhead looping.” The goal of zero-overhead looping is that branching from the
end-to-beginning of the loop can be accomplished without explicit program overhead if the loop is to be
executed a fixed number of times, known at compile time.
The LX5280 supplies such a facility but allows the loop count to be determined at run time as well. The
facility consists of three new Radiax registers, which are accessible by a program running in User mode using
the Radiax instructions MFRU and MTRU. The operating system should consider these registers as part of
the context of the executing process and must save and restore them in the case of an interrupt.
LPS0[28:2] —low order bits of the virtual address of the “starting” instruction of the loop.
Although the facility is intended for use in loops, the algorithm executed by the hardware can be described
more simply. In particular it should be noted that there is no “knowledge” of being “inside” the loop. All that
matters is the contents of the three registers when an attempt is made to execute the instruction at the address
specified by LPE0:
The following restrictions apply to the usage of the Zero Overhead Loop Facility:
• LPS may not be exactly equal to LPE if LPC is non-zero. Therefore, the loop must contain
at least two instructions. Otherwise, operation is undefined.
• LPE may not be in the delay slot of a branch, nor may it be a branch or jump instruction
itself if LPC is non-zero. Otherwise, operation is undefined.
• For correct operation, the order of loading the registers must be: first LPS, then LPE, then
LPC with a non-zero value.
• For correct operation, there must be at least two (2) instructions between the instruction
which loads LPC with a non-zero value, and the instruction at the LPE address. To
guarantee that no stall cycles are incurred, there must be at least three (3) cycles between
the instruction which loads LPC with a non-zero value, and the instruction at the LPE
address.1
• If the instruction at LPE is a load type instruction, then the immediately executed
instruction at LPS is considered to be in the load delay slot and cannot rely on seeing the
result of the load.
The following items are not restrictions that apply to the usage of the Zero Overhead Loop Facility but are
features to be aware of:
• The loop count LPC may be reloaded multiple times after LPS and LPE are loaded.
Typically this would be done in an outer loop.
• The instruction at LPE may be the target of a jump or branch, including a change in mode
from 16-bit to 32-bit ISA.
• Any of the instructions before or at LPE may be subject to exceptions or interrupts and
processing will conform to the normal exception handling rules. Note that the BD bit will
1. The following discussion is only relevant if LPC will be updated in an instruction that is “close” to LPE. That case can have a per-
formance impact although correct operation will still be obtained. The programming guideline is: keep the LPC update in an outer
loop as far as possible from the (end of) the inner loop:
The updates of LPS, LPE, and LPC use the MTRU instruction. Therefore the new LPS, LPE, and LPC values are only known after
the E-stage of the pipeline. But in order to perform the pseudo-branch they must be used in the I-stage of the pipeline. Because of
the restriction on the order of setting these registers, the hardware introduces a minimum number of stalls after setting LPS and LPE
to test for an LPE match against the current instruction address. However, if the LPC update is still in the pipeline when the LPE
match is detected, the hardware stalls to check and update the new value of LPC. To avoid these stalls, LPC should not be updated
within 3 cycles (which could be as many as 6 instruction issue slots) of an expected LPE-matching instruction. As noted above, for
correct operation there must be at least 2 instructions between the LPC update and any expected LPE-matching instruction.
always be off since LPE must not be in the delay slot of a branch. The return from the
exception handler to LPE will also be handled normally, since it is just a special case of
LPE being the target of a jump.
• If the instruction at LPE causes a Reserved Instruction Trap, it is necessary for the
Exception Handler to decrement LPC prior to return, after emulating the instruction at
LPE and before returning to the instruction at LPS. Similar restrictions apply if the
instruction at LPE is not to be re-executed for any other reason, such as BREAK or
SYSCALL execution.
The LX5280 includes eight new low-overhead hardware interrupt signals. These signals are compatible with
the R3000 Exception Processing model and are useful for real-time applications.
These interrupts are supported with three new Lexra CP0 registers, ESTATUS, ECAUSE, and INTVEC,
accessed with the new MTLXC0 and MFLXC0 variants of the MTC0 and MFC0 instructions. As with any
COP0 instruction, a Coprocessor Unusable Exception is taken if these instructions are executed while in User
Mode and the Cu0 bit is 0 in the CP0 STATUS register.
The three new Lexra CP0 registers are ESTATUS (0), ECAUSE (1), and INTVEC (2), and are defined as
follows:
31 - 24 23 - 16 15 - 0
0 IM[15:8] 0
31 - 24 23 - 16 15 - 0
0 IP[15:8] 0
31 - 6 5-0
BASE 0
ESTATUS contains the new interrupt mask bits IM[15:8], which are reset to 0 so that none of the new
interrupts will be activated, regardless of the global interrupt signal IEc. IP[15:8] for the new interrupt signals
is located in ECAUSE and is read-only. These fields are similar to the IM and IP fields defined in the R3000
Exception Processing Model, except that the new interrupts are prioritized in hardware, and each have a
dedicated exception vector.
IP[15] has the highest priority, while IP[8] has the lowest priority, however, all new interrupts are higher
priority than IP[7:0]. The processor concatenates the program defined BASE address for the exception
vectors with the interrupt number for form the interrupt vector, as shown in the table below. Two instructions
can be executed in each vector; typically these will consist of a jump instruction and its delay slot, with the
target of the jump being either a shared interrupt handler or one that is unique to that particular interrupt.
15 { BASE, 6’b111000 }
14 { BASE, 6’b110000 }
13 { BASE, 6’b101000 }
12 { BASE, 6’b100000 }
11 { BASE, 6’b011000 }
10 { BASE, 6’b010000 }
9 { BASE, 6’b001000 }
8 { BASE, 6’b000000 }
When a vectored interrupt causes an exception, all of the standard actions for an exception occur. These
include updating the EPC register and certain subfields of the standard STATUS and CAUSE registers. In
particular, the Exception Code of the CAUSE register indicates “Interrupt”, and the “current” and “previous”
mode bits of the STATUS register are updated in the usual manner.
This section describes the LX5280 Programming Model. Section 3.1, Summary of MIPS-I Instructions,
contains a list summarizing all MIPS-I operations supported by the LX5280. These opcodes may be extended
by the customer using Lexra’s Custom Engine Interface (CEI). This capability is described in Section 3.2,
Opcode Extension Using the Custom Engine Interface (CEI).
Section 3.3, Memory Management, describes the Simplified Memory Management Unit (SMMU) which is
physically incorporated in the LX5280 LMI. The SMMU provides sufficient memory management
capabilities for most embedded applications while ensuring execution of third-party MIPS software
development tools.
The LX5280 supports the MIPS R3000 Exception Processing model, as described in Section 3.4, Exception
Processing.
The LX5280 supports all MIPS-I Coprocessor operations. The customer can include one to three application-
specific Coprocessors. Lexra provides a functional block called the Coprocessor Interface (CI) which allows
the customer a simplified connection between their Coprocessor and the internal signals of the LX5280. The
CI is described in Section 3.5, The Coprocessor Interface (CI).
The LX5280 executes MIPS-I instructions as detailed in the tables below. To summarize, the LX5280
executes MIPS-I instructions with the following exclusions: the unaligned loads and stores (LWL, SWL,
LWR, SWR) are not supported because they add significant silicon area for little benefit in most applications.
Instruction Description
Instruction Description
Instruction Description
Instruction Description
BLTZAL rA, destination Similar to the BLTZ and BGEZ except that the address of the
BGEZAL rA, destination instruction following the delay slot is saved in r31 (regardless
of whether the branch is taken.)
JAL target Same as above except that the address of the instruction fol-
lowing the delay slot is saved in r31.
JR rA pc <- (rA)
The instruction following JR (delay slot) is always executed.
JALR rA, rD Same as above except that the address of the instruction fol-
lowing the delay slot is saved in rD.
Instruction Description
Instruction Description
Customers may add proprietary or application-specific opcodes to their LX5280 based products using the
Custom Engine Interface (CEI). The new instructions take one of the following forms illustrated below and
use reserved opcodes.
Lexra permits customer operations to be added using the four (4) I-Format opcodes and six (6) R-Format
opcodes listed in the table above. Other opcode extensions in future Lexra products will not utilize the
opcodes reserved above.
When the CEI decodes NEWOPI or NEWOPR, it must signal the Core that a custom operation has been
executed so that the Reserved Instruction trap will not be taken. Multi-cycle custom operations may be
executed by asserting CESEL.
Note: The custom operation may choose to ignore the SRC1 and SRC2 operands supplied by the CEI and
reference customer registers instead. Results can also be written to an implicit customer register; however,
unless D = 0 is coded, a register in the Core will also be written.
The LX5280 includes a Simplified Memory Management Unit (SMMU) for the instruction memory address
and the data memory address. These units are physically located in the Local Memory Interface (LMI)
modules. The hardwired virtual-to-physical address mapping performed by the SMMU is sufficient to ensure
execution of third-party software development tools.
The LX5280 implements the MIPS R3000 exception processing model as described below. Features specific
to on-chip TLB support are not included. In the discussion below, the term exception refers to both traps,
which are non-maskable program synchronous events, and interrupts, which result from unmasked
asynchronous events.
The list below is numbered from highest to lowest priority. ExcCode is stored in CAUSE when an exception
is taken. Note that Sys, Bp, RI, CpU can share the same priority level because only one can occur in a
particular time slot.
BEV Bootstrap Exception Vector. Selects between two trap vectors. (see below)
IM Interrupt masks for the six hardware interrupts and two software interrupts.
KU/IE KU = 0(1) indicates kernel (user) mode. In the LX5280, user mode virtual addresses must have
msb = 0. In kernel mode, the full address space is addressable. IE = 1(0) indicates that
interrupts are enabled (disabled).
The KUo, IEo, KUp, IEp, KUc and IEc fields form a three-level stack hardware stack KU/IE
signals. The current values are KUc/IEc, the previous values are KUp/IEp, and the old values
(those before previous) are KUo/IEo. (See Section 3.4.2.)
STATUS is read or written using MTC0 and MTF0 operations. On reset, BEV = 1, KUc = IEc = 0. The other
bits in STATUS are undefined. The 0 fields are ignored on write and are 0 on read. It is recommended that the
user explicitly write them to 0 to insure compatibility with future versions of the LX5280.
BD Branch Delay. Indicates that the exception was taken in a branch or jump delay slot.
CE Coprocessor Exception. In the case of a Coprocessor Usability exception, indicates the number
of the responsible Coprocessor.
IP Interrupt Pending. Each bit in IP(7:0) indicated an associated unmasked interrupt request.
ExcCode The ExcCode listed above for the different exceptions are stored here when as exception
occurs.
CAUSE is read or written using MTC0 and MTF0 operations. The only program writable bits in CAUSE are
IP(1:0), which are called software interrupts. CAUSE is undefined at reset. The 0 fields are ignored on write
and are 0 on read.
EPC is a 32-bit read-only register which contains the virtual address of the next instruction to be executed
following return from the exception handler. If the exception occurs in the delay slot of a branch, EPC will
hold the address of the branch instruction and BD will be set in CAUSE. The branch will typically be re-
executed following the exception handler.
BADVADDR is a 32-bit read-only register containing the virtual address (instruction or data) which
When an exception occurs, the instruction address changes to one of the following locations:
RESET 0xbfc0_0000
which disables interrupts and puts the program in kernel mode. The code (ExcCode) for the exception source
is loaded into CAUSE so that the application-specific exception handler can determine the appropriate action.
The exception handler should not re-enable Interrupts until necessary context has been saved.
To return from the exception, the exception handler first moves EPC to a general register using MFC0,
followed by a JR operation. RFE only pops the KU/IE stack:
(This example assumes that KU/IE were not modified by the exception handler). Therefore, a typical
sequence of operations to return from the exception handler would be:
Designers may implement up to three Coprocessors to interface with the LX5280. The contents of these
Coprocessors may include up to thirty-two (32) 32-bit general registers and up to thirty-two (32) 32-bit
control registers. The general registers may be moved to and from the RALU’s registers using MTCz, MFCz
operations, or be loaded and stored from data memory using LWCz, SWCz operations. The control registers
may only be moved to and from the RALU’s registers using CTCz, CFCz operations.
Lexra supplies a simple Coprocessor Interface (CI) model allowing the customer to easily interface a
Coprocessor to the LX5280. The CI supplies a set of control, address, and data busses that may be tied
directly to the Coprocessor general and special registers.
The operating system kernel can initiate a power savings standby mode using the Lexra specific SLEEP
instruction. This holds the LX5280's internal clocks in the high state until an external hardware interrupt is
received.
Before executing the SLEEP instruction, the kernel must ensure that the interrupt condition that will
ultimately terminate standby mode has been enabled via the IM field of the coprocessor 0 Status register.
When the SLEEP instruction enters the W stage, the standby logic stalls the processor and waits for the LBC
to complete any outstanding processor initiated system bus operations. After these are completed, the
standby logic holds the system and bus clocks high. These are held high until an enabled interrupt is
received.
When standby mode is terminated by an interrupt, the standby logic allows the clocks to toggle. The
processor honors the interrupt by branching to the exception handler as is normally done for interrupt
servicing. Because several instructions are held in the pipeline while the clocks are frozen prior to the
interrupt, the exception PC will not point to the SLEEP instruction, but rather some later instruction.
Typically, a kernel would enter an idle loop just after executing the SLEEP instruction, so the interrupt will be
serviced from the kernel's normal idle interrupt service level.
The LX5280 takes a minimum of 6 cycles after the SLEEP instruction enters the W stage to safely
synchronize the initiation of standby mode, i.e. hold the clocks in the high state. Two cycles are required
terminate standby mode. The processor is stalled during these periods.
The standby logic receives the free running system and bus clocks, and generates gated clocks for distribution
to the LX5280. The standby logic must use flip-flops tied to free running clocks, which results in about a
dozen loads on the free running clocks.
Two pins, SL_SLEEPING_R and SL_SLEEPING_BR, are available from the standby logic and are asserted
high when the processor is in standby mode. The _R pin is for use in the system clock domain, and the _BR
pin is for use in the bus clock domain.
4. MIPS16
MIPS16 is an extension to the MIPS Instruction Set Architecture (ISA) that was developed to improve code
density, especially for System-on-Chip (SoC) designs. In these designs, on-chip instruction storage is often a
significant, even dominant, portion of the silicon component cost. This is especially true for real-time
applications because, in order to meet real-time requirements, instruction cache miss penalties cannot be
tolerated and thus a large portion of the instruction storage must be resident on-chip.
MIPS16 provides a set of 16-bit instruction formats to encode the most common operations. The key
compromises required to achieve 16-bit encoding include: (i) some MIPS I instructions are not available, (ii)
immediate widths are reduced, (iii) only 8 of the 32 general registers may be directly addressed. As a result
some operations cannot be executed in MIPS16 or require multiple MIPS16 instructions. Thus realistic
programs need to include both MIPS16 and MIPS I instructions, using MIPS16 where possible to save
storage, at some cost to performance.1 Mode switching between MIPS16 and MIPS I is discussed below. To
permit occasional access to all 32 general registers without the overhead of mode switching, MIPS16
provides MOVE instructions to move data between the MIPS16-visible registers and the full general register
set. Also, to permit occasional use of 16-bit immediates without mode switching, MIPS16 provides the
EXTEND instruction to allow a full width immediate in two MIPS16 instruction cycles. (Programs requiring
a large register set or frequent full-width immediates should be compiled in MIPS I.)
MIPS16 is difficult to program effectively at the assembler level. This is because of the limited register set
and the restricted size immediates. In fact, according to Sweetman2, "MIPS16 is not a suitable language for
assembly coding". Rather, MIPS16 is viewed as a compiler option which can be effectively applied to
achieve significant code size reduction where performance is not critical.
This section describes the MIPS16 instructions, with emphasis on the differences between MIPS16 and the
32-bit MIPS ISA. The first table lists MIPS I Instructions that are not supported in MIPS16.
The second table lists MIPS I instructions which are supported in MIPS16. In most cases these are
specialized versions of the MIPS I instruction. MIPS16 is compatible with MIPS I, II and III, IV or V. The
LX5280 implements all MIPS16 for 32-bit data operations.3 The table lists all MIPS16 instructions together
with the corresponding MIPS I instruction and the specialization required to produce the MIPS16 instruction
(other than smaller register set and smaller immediates).
The third table lists the several new instructions introduced by MIPS16.
It is notable that MULT(U), DIV(U) are supported in MIPS16. MFHI and MFLO are also supported and are
necessary to access the result of MULT(U) or DIV(U). However, MTHI and MTLO are not supported. These
are used primarily to restore the state after exception handling and are used within the kernel, typically in
MIPS I.
1. The MIPS16 performance penalty results from occasionally using two instructions where one MIPS I instruction would suffice.
Some of this penalty is recovered in applications where a larger number of instructions per cache line reduces cache miss rate.
2. “See MIPS Run”, Dominic Sweetman, Appendix D, p. 425.
3. MIPS16 includes 16-bit formats for a number of MIPS III 64-bit doubleword operations which are not supported in the MIPS I
ISA.They are also not supported in Radiax.
Jump J
SLT(U) rx, ry (r24 dest. implied) SLT(U) rd, rs, rt; rd=r24
SLTI(U) rx, immediate (2-op., r24 dest) SLTI(U) rt, rs, immediate; rt=rs
CMPI rx, immediate (r24 dest. implied) XORI rt, rs, immediate; rt=r24
CMP rx, ry (r24 dest. implied) XOR rd, rs, rt; rd=r24
MULT(U) rx, ry
DIV(U) rx, ry
MFHI rx
MFLO rx
JAL target
JR rx
JR ra JR rs; rs=r31
JALR ra, rx (2-operand; link = r31) JALR rs, rd; rs=r31
BREAK
a. If no 32-bit MIPS instruction is listed, no specialization beyond limited size register set and
limited size immediates is required.
As noted earlier, MIPS16 restricts the MIPS I directly addressable register set and immediate field. Another
common MIPS16 restriction is that two, rather than three, register operands, are permitted. MIPS16 provides
a number of instructions that are not found MIPS I, as shown in Table 18.
The pc-relative load LW is important to overcoming the drawback of smaller immediates in MIPS16. It
allows full 32-bit immediates to be embedded in the program and loaded into registers in a single instruction.
The ADDIU with pc operand is useful to support immediates embedded in the program. The pc value
referenced in LW or ADDIU depends on the context of the pc-relative instruction as shown in Table 19.
EXTEND is used to supply an extra 11-bits of immediate. It is used together with the restricted size
immediate field of the next instruction to supply a full width immediate. EXTEND cannot occur in the delay
slot of a Jump. It is not necessary for the assembly programmer to code EXTEND instructions. It will
automatically be assembled by MIPS16 assemblers wherever the immediate is too large to be encoded in a
single MIPS16 instruction.
Another new instruction JALX, is available in both MIPS16 and also in MIPS I on machines implementing
MIPS16 and is discussed below. [in MIPS I machines not implementing MIPS16, the JALX opcode 000111
causes an RI trap.]
1. The instruction,
JALX target
JR rx
JALR rs, rx (in MIPS16 rs=ra)
causes the mode to be set to MIPS16 if rx[0] = 1; to MIPS I if rx[0] = 0. However, the lsb of the instruction
memory address from JR/JALR is forced to 0. As a consequence, machines that implement MIPS16 never
take AdEL exceptions on the lsb of the instruction address (this is true regardless of whether the machine is
operating in MIPS16 or MIPS I mode.).
The mode bit is saved in the lsb of the link register in JAL, JALX, JALR.
4.3. Exceptions
Upon Exception, the mode is automatically switched to MIPS I. The mode is saved in the lsb of the
Exception PC (EPC). EPC[0] = 0 indicates that the Exception occurred while executing code in MIPS I
mode; EPC[0] = 1 indicates that the Exception occurred in MIPS16 mode. The typical program will save the
EPC to a general register and later return to the main program with a JR instruction, causing the proper ISA
mode to be restored.
Consistent with the MIP16 emphasis on code density, there are no load delay or branch delay slots. In other
words, the instruction following the branch is executed only if the branch is not taken. [MIPS16 jumps (JAL,
JALX, JR, JALR) have a single delay slot, the same as in MIPS I. For jumps, the target address is always
taken. Thus, there is no risk that the delay slot cannot be used to do useful work: the instruction from the
target can be moved to the delay slot, if necessary.]
For MIPS16 loads, the instruction following the load can reference the loaded register (as in MIPS II). This
feature is present because the MIPS I compiler is not always successful in scheduling a useful instruction in
the delay slot and must occasionally resort to a NOP, reducing code density. This possibility is eliminated in
MIPS16.
The LX5280 supports Lexra’s Radiax DSP extensions to the MIPS-1 instruction set. This chapter describes
the Radiax extensions in detail. Section 5.1 describes each of the Radiax instructions. Section 5.2 describes
the instruction encoding.
The Radiax instruction extensions include MAC operations, vector-addressing, and enhanced extensions to
the MIPS-1 ALU instructions.
Nomenclature:
rS, rT = r0 - r31
mD = mDh || mDl; also for mT
mDh = m0h - m3h; also for mSh, mTh
mDl = m0l - m3l; also for mSh, mTh
HI = m0h[31:00]
LO = m0l[31:00]
The Dual MAC eliminates all programming hazards for its instructions by stalling the pipeline when
necessary. It does this both to avoid resource conflicts and to wait for results of a first instruction to be ready
before attempting to use those results in a second instruction. This means that there are no programming
restrictions in order to obtain correct results from a sequence of Dual MAC instructions.
However, the most efficient use of the Dual MAC hardware is obtained when the program avoids these stalls.
This can be done by scheduling the instructions properly. Table 52 on page 123 indicates the number of
cycles that must be present between MAC instructions to avoid stalls. In addition several instruction
sequences are presented that represent the most efficient use of the Dual MAC for the “inner loop” of some
common DSP algorithms. Typically, these make use of the multiple accumulators in the Dual MAC.
The following code sequences indicate the most efficient use of the Dual MAC for coding the inner loop of
some common DSP algorithms. The algorithms are presented for 16-bit operands with 16-bit results, as well
as 32-bit operands with 32-bit results. The algorithms assume that fractional arithmetic is used. Therefore, for
the 32-bit results of a 32x32 multiply, only the HI half of the target accumulator pair is retrieved or used.
In these examples, only the Dual MAC instructions are shown. The other pipe is used to fetch and store
operands and take care of loop housekeeping functions. The loops may need to be unrolled to take full
advantage of the multiple Dual MAC accumulators.
MADDA2 m0,r1,r2
MADDA2 m0,r3,r4
MADDA2 m0,r5,r6
MADDA2 m0,r7,r8
...
Assuming packed fractional operands, two multiplies per two cycles using two accumulator pairs.
MULTA2 m0,r1,r2
MFA2 m1,r8
MULTA2 m1,r3,r4
MFA2 m0,r7
...
Assuming fractional operands packed as 16-bit real, 16-bit imaginary. One complex multiply every two
cycles using two accumulator pairs.
CMULTA m0,r1,r2
MFA2 m1,r8
CMULTA m1,r3,r4
MFA2 m0,r7
...
Assuming fractional 32-bit operands so that the MFA waits for the HI result of the MULTA. Achieves one
multiply per two cycles using all the accumulators.
Assuming fractional 32-bit operands so that the ADDMA/SUBMA waits for the HI result of the second
MULTA. Achieves one complex multiply per ten cycles using all the accumulators, with two inserted
instructions. This is a good example of the cycles needed from MULTA to SUBMA/ADDMA (5 cycles for
HI) and from SUBMA/ADDMA to MFA (2 cycles).
Nomenclature:
Notes:
1. For LTP[.Cn], LWP[.Cn], LHP(U)[.Cn], LBP(U)[.Cn], rT = pointer is unsupported.
2. When a circular buffer is selected, the update of the pointer register is performed according to the
following algorithm, which depends on the sign of the stride and the granularity of the access. A stride
exactly equal to 0 is not supported:
The Radiax ALU operations include both dual 16-bit and saturating versions of the MIPS-1 ALU operations
and several new ALU operations which are useful for common DSP algorithms.
The LX5280 provides conditional move instructions that reduce the need for program branches, resulting in
greater program efficiency.
Usage Note:
When combined with the SLT or SLTR2 instructions, the conditional move instructions can be used to
construct a complete set of conditional move macro-operations. For example:
The Lexra Formats are introduced into the MIPS instruction set by designating a single I-Format as
“LEXOP”, then using the INST[5:0] “subop” field to permit up to 64 new Lexra opcodes. Thus the new DSP
opcodes model the MIPS “special” opcodes encoded in R-Format. The diagrams below illustrate the LEXOP
codes using I-Format 011_111 which is unused in the MIPS I-IV ISA.
The following principles are used to resolve potential ambiguity of encoding between the new LX5280 DSP
extensions and MIPS instructions:
a. LX5280 instructions with similar operations to existing MIPS instructions, but with additional operands
permitted, are programmed with new Assembler mnemonics and encoded as a LEXOP. For instance:
b. If a MIPS instruction is “extended” with new functionality, it is programmed with new Assembler
mnemonics and encoded as a LEXOP. Lexra mnemonics which end in “r” indicate general register file
targets; mnemonics which end in “a” indicate accumulator register targets. This convention removes
ambiguity between the Lexra op and a similar MIPS op. For example,
The MIPS add and the LEXOP addr are both signed 32-bit additions. However, on overflow the MIPS
instruction triggers the Overflow Exception, while the LEXOP does not. Alternatively the result of the
LEXOP will saturate if the “.s” option is selected (addr.s).
31 26 25 21 20 16 15 6 5 0
31 26 25 21 20 16 15 8 7 6 5 0
31 26 25 21 20 16 15 11 10 9 8 7 6 50
6 5 5 5 2 1 1 1 6
31 26 25 21 20 16 15 11 10 9 8 5 6 50
6 5 5 5 1 1 1 1 1 6
31 26 25 21 20 16 15 11 10 7 6 5 0
31 26 25 21 20 16 15 11 10 8 7 6 5 0
HL 00 = reserved
01 = mNl
10 = mNh
11 = reserved
s Selects saturation of result. s=1 indicates that saturation is
performed.
31 26 25 21 20 16 15 11 10 8 7 6 5 0
1001 reserved
101xx reserved
11000 mmd
11001 reserved
111xx reserved
31 26 25 21 20 16 15 11 10 0
Assembler COP0
Mnemonic 010 000 Copz rs rt rd 000 0000 0000
These are not LEXOP instructions. They are variants of the standard MTC0 and MFC0 instructions
that allow access to the Lexra Coprocessor0 Registers listed below. As with any COP0 instruction, a
Coprocessor Unusable Exception is taken in User mode if the Cu0 bit is 0 in the CP0 Status register
when these instructions are executed.
31 26 25 21 20 16 15 11 10 9 8 6 5 0
Inst[2:0]
Inst[5:3] 0 1 2 3 4 5 6 7
* Indicates instructions which are implemented only in the LX5280, and not the LX5180 product.
6. Integer Multiply-Divide-Accumulate
The integer Multiply-Divide-Accumulate instructions, which are optional on other Lexra processors, are a
standard feature of the LX5280 processor.
MAD {HI,LO}<- {HI,LO} + (Rs * Rt) 32x32 signed multiply, with 64bit signed add
to accum
MADU {HI,LO}<- {HI,LO} + (Rs * Rt) 32x32 unsigned multiply, with 64bit
unsigned add to accum
MSUB {HI,LO}<- {HI,LO} - (Rs * Rt) 32x32 signed multiply, with 64bit signed add
to accum
MSUBU {HI,LO}<- {HI,LO} - (Rs * Rt) 32x32 unsigned multiply, with 64bit
unsigned add to accum
MADH HI <- HI + (Rs[15:0] * Rt[15:0]) 16x16 signed multiply, with 32 bit signed
add to accum
MADL LO <- LO + (Rs[15:0] * Rt[15:0]) 16x16 signed multiply, with 32 bit signed
add to accum
MSBH HI <- HI - (Rs[15:0] * Rt[15:0]) 16x16 signed multiply, with 32 bit signed
sub from accum
MSBL LO <- LO - (Rs[15:0] * Rt[15:0]) 16x16 signed multiply, with 32 bit signed
sub from accum
MSZH HI <- 0 - (Rs[15:0] * Rt[15:0]) 16x16 signed multiply, sub from pre-zeroed
32bit accum
MSZL LO <- 0 - (Rs[15:0] * Rt[15:0]) 16x16 signed multiply, sub from pre-zeroed
32bit accum
The processor may stall if a new MAC instruction is executed while a prior MAC operation is pending.
Table 52 on page 123 indicates the number of cycles that must be present between MAC instructions to avoid
stalls.
• In case of resource conflicts, hardware manages all hazards simplifying software debug.
31 26 25 21 20 16 15 6 5 0
15 11 10 8 7 5 4 0
Notes:
The 32-bit op-codes are unchanged (from the MIPS-I standard) for the existing MULT, DIV, MF, and MT
instructions. The MAD, MADU, MSUB, and MSUBU are new Special2 opcodes, also standard to several
processors. In M32 mode, the new instructions are all R-format with bits 31:26 = 6'b111100. Bits 5:0
determine the specific operation, as shown. In M16 mode, the new instructions are all RR-format with bits
15:11 = 5'b11111. Bits 4:0 determine the specific operation, as shown in Section 6.4.
The upper 16 bits of both operand registers are ignored by 16-bit instructions.
The MxxH and MxxL instructions can be freely interleaved. That is, adds and subtracts from either
accumulator can be combined in a sequence with the two accumulators functioning "in parallel."
The MxZx instructions can be used as stand-alone 16-bit signed multiply. This removes the need for a
"MTHI, zero" instruction at the beginning of a multiply-accumulate sequence, for example:
MAZH r1,r2
MADH r3,r4
MADH r5,r6
MADH r7,r8
any op that doesn't write HI
any op that doesn't write HI
MFHI r9
In the above sequence, the two non-HI ops are not necessary for correct operation but the pipeline will stall if
they are not used, so it is more efficient to perform useful work in those slots.
For the MULTx, MADx or MSUBx instructions, the most efficient use is:
MULTx r1,r2
MADx r3,r4
MSUBx r5,r6
any op that doesn't write HI or LO
any op that doesn't write HI or LO
any op that doesn't write HI or LO
MFLO r7 /* LO or HI is available this cycle*/
MFHI r8
The MFLO (MFHI) instruction reads the contents of the LO (HI) register during the E cycle of the pipeline.
The following descriptions indicate how the latency of the multiply instructions affects the usage of the MF
instructions. The most efficient sequence is shown. If the MF instruction is coded earlier, the correct result
will still be obtained because the hardware will stall the MF instruction in the E-cycle until the result is valid.
During the E cycle of any multiply operation, the initial operands are re-coded and loaded into the
MANDHW and MIERHW (MBOOTH) registers. For the MULTx operations, the multiply cycles can be
labeled M1 through M3. Then the following timing diagram is valid:
MULTx I S E M1 M2 M3
LO/HI valid X
any op I S E M W
any op I S E M W
MFLO I S E M W
or MFHI I S E M W
For the MADx operations, the pipeline cycles after E can be labeled as C (carry save), and A (accumulate).
Then the following timing diagram is valid:
MAZH0 I S E C A
MADH1 I S E C A
MADH2 I S E C A
MADH3 I S E C A
any op I S E M W
any op I S E M W
MFHI I S E M W
HI contains A0 A1 A2 A3
Given a dividend DEND, and divisor DVSR, the divider generates a quotient QUOT and remainder REM
that satisfy the following conditions, regardless of the signs of DEND and DVSR:
It is worth noting that the requirement that REM and DEND have the same sign is not universally accepted if
DEND and DVSR are not both positive. (For example the Modula-3 language expects: -5DIV3=-2, -
5MOD3=+1, whereas the divider generates QUOT=-1, REM=-2 in agreement with FORTRAN and others.)
These examples show the possible combinations of signs:
Thus the pipeline flow of a division instruction and the most efficient subsequent read of the quotient (using
MFLO) is as shown in the following diagram, assuming that all the intervening instructions complete in one
cycle. If the MFLO is issued earlier it will stall until the divide completes. Less than 19 instructions may be
issued if some of them take more than one cycle to complete (due to cache misses or data dependent stalls, for
example).
This chapter describes how memories are configured and connected to the LX5280 using the Local Memory
Interfaces (LMIs). This section provides a brief summary of the conventions and supported memories.
Section 7.2 describes the control register that allows software control over certain aspects of the LMIs. The
subsequent sections cover each of the LMIs in detail.
This chapter also discusses configuration options and the ports that customers must access to connect
application specific RAM and ROM devices that are used by the LX5280 LMIs. All of the signals between
the processor core, the LMIs, RAMs and the system bus controller are automatically configured by lconfig,
the LX5280 configuration tool. Lconfig also produces documentation of the exact RAMs required for the
chosen configuration settings, and writes RAM models used for RTL simulation.
The LMIs connect to RAMs that service the LX5280 processor’s local instruction and data busses. The LMIs
also provide the pathways from the processor to the system bus. The LX5280 includes an LMI for each of the
local memory types. The sizes of the RAMs and ROMs are customer selectable. The LX5280 LMIs directly
support synchronous RAMs that register the address, write data, and control signals at the RAM inputs. The
LMIs also supply redundant read enable and chip select lines for each RAM, which may be required for some
RAM types. ROMs may also be connected, but may require a customer supplied address register at the
address inputs.
Lexra supplies an integration layer for the LMIs and the memory devices connected to them. In this layer,
memory devices are instanced as generic modules satisfying the depth and width requirements for each
specific memory instance. The lconfig utility supplies a summary of the memory devices required for the
chosen configuration. In most cases, customers simply need to write a wrapper that connects the generic
module port list to a technology specific RAM instance inside the RAM wrapper.
The LX5280 is configurable for a 16, 32, 64, or 128-byte cache line size. The tag store RAM sizes shown in
the tables of this chapter assume a 16-byte line size. The documentation produced by lconfig indicates the
required tag RAMs for the selected configuration options, including the line size. As a general rule, a
doubling of the line size results in halving the tag store depth.
The valid bits within tag stores are automatically cleared by the LMIs upon reset. The data cache implements
a write-through protocol. Caches do not snoop the system bus. The LX5280 is configurable to work with
RAMs with a write granularity of 8 bits (byte) or 32 bits (word). Byte write granularity results in more
efficient operation of store byte and store half-word instructions.
Table 27 summarizes the LMIs that can be integrated on the local busses.
Name Description
31-8 7 6 5 4 3-2 1 0
When reading this register, the contents of the Reserved bits are undefined. When writing this register, the
contents of the Reserved bits should be preserved.
Changes in the contents of the CCTL register are observed in the W stage. However, these changes affect
instruction fetches currently in progress in the I stage, and data load or store operations in progress in the M
stage.
The IROMOn and IROMOff bits of the CCTL register control the and use of the optional local IROM
memory configured into the LX5280. When IROM is present and the LX5280 is reset, the LMI enables
access to the IROM. A transition from 0 to 1 on IROMOff disables the IROM, allowing instruction references
to be serviced IMEM, ICACHE or the system bus. A transition from 0 to 1 on IROMOn enables the IROM.
The IMEMFill and IMEMOff bits of the CCTL register control the contents and use of any local IMEM
memory configured into the LX5280. When the LX5280 is reset, the LMI clears an internal register to
indicate that the entire IMEM LMI contents are invalid. When IMEM is invalid, all cacheable fetches from
the IMEM region will be serviced by the instruction cache, if an instruction cache is present.
A transition from 0 to 1 on IMEMFill causes the LMI to initiate a series of line read operations to fill the
IMEM contents. The addresses used for these reads are defined by the configured BASE and TOP addresses
of the IMEM, described in Section 7.4. The processor stalls while the entire IMEM contents are filled by the
LMI. Thereafter, the LMI sets its internal IMEM valid bit and will service any access to the IMEM range
from the local IMEM memory. The time that an IMEM fill takes to complete is the number of line reads
needed to fill the IMEM range, multiplied by the latency of one line read, assuming there is no other system
bus traffic.
A transition from 0 to 1 on IMEMOff causes the LMI to clear its internal IMEM valid bit. Subsequent
cacheable fetches from the IMEM region will be serviced by the instruction cache. To use the IMEM again,
an application must re-initialize the IMEM contents through the IMEMFill bit of the CCTL register.
The ILock field controls set locking in the two set associative instruction cache. When ILock is 00 or 01, the
instruction cache operates normally. When ILock is 10, all cached instruction references are forced to occupy
set 1. The hardware will invalidate lines in set 0 if necessary to accomplish this. When ILock is 11, lines in set
1 are never displaced – i.e. they are locked in the cache. Set 0 is used to hold other lines as needed.
To utilize the cache locking feature, software should execute at least one pass of critical subroutines or loops
with ILock set to 10. After this has been done, ILock should be set to 11 to lock the critical code into set 1,
and use set 0 for other code.
The IInval and DInval fields control hardware invalidation of the instruction cache and data cache. A
transition from 0 to 1 on IInval will initiate a hardware invalidation sequence of the entire instruction cache.
Likewise, a 0 to 1 transition on DInval will initiate a hardware invalidation sequence of the entire data cache.
The DMEM, if present, is unaffected by this operation.
The hardware invalidation sequence for the instruction and data caches requires one cycle per cache line to
complete.
Depending on the circumstances, software may be able to employ an alternative to a full invalidation of the
data cache. If a small number of lines must be invalidated, software may perform cached reads from aliases of
the memory locations of concern. This displaces data in the addressed locations of the data cache, even if they
do not encache the affected memory location.
Another alternative, if the affected memory location has an alias in uncacheable (KSEG1) space, is to simply
perform an uncached read of the affected memory locations. If the location is resident in the data cache it will
be invalidated. This method has the advantage of not displacing data in the cache unless it is absolutely
necessary to maintain coherency. Note that a write to a KSEG1 address has no affect on the contents of the
data cache.
With either of these two alternatives, it is only necessary to reference one word of each affected cache line.
The ICACHE LMI supplies the interface for a direct mapped or two-way set associative instruction cache
attached to the LX5280 local bus. The degree of associativity is specified through lconfig. The ICACHE LMI
participates in cacheable instruction fetches, but only if the address is not claimed by the IMEM module. The
configurations supported by ICACHE, and the synchronous RAMs required for each, are summarized in
Table 28.
The instruction store for the two-way ICACHE consists of two 64-bit wide banks, with separate write-enable
controls. The tag store consists of one RAM bank with tag and valid bits for set 0, and a second RAM for set
1 that holds the tag, valid, LRU (Least Recently Used), and lock bits. When a miss occurs in the two-way
ICACHE, the LRU bit is examined to determine which element of the set to replace, with element 0 being
replaced if LRU is 0, and element 1 being replaced if LRU is 1. The state of the LRU bit is then inverted. To
optimize the timing of cache reads, the two-way ICACHE uses the state of the LRU bit to determine which
element should be returned to the CPU. In the following cycle, the ICACHE determines if the correct element
was returned. If not, the ICACHE takes an extra cycle to return the correct element to the CPU and inverts the
LRU bit.
Table 29 lists the ICACHE signals that are connected to application specific modules. The IC_ prefix
indicates signals that are driven by the ICACHE LMI module and received by the RAMs. The ICR_ prefix
indicates signals that are driven by the ICACHE RAMs and received by the ICACHE LMI. Lexra supplies
the Verilog module that makes all required connections to these wires. The width of the index and data lines
depends upon the RAM connected to the LMI, and can be inferred from the Table 28.
Signal Description
The IMEM LMI supplies the interface for an optional local instruction store. The IMEM serves a fixed range
of the physical address space, determined by configuration settings in lconfig. The IMEM contents are filled
and invalidated under the control of the CP0 CCTL register, described in Section 7.2, Cache Control
Register: CCTL. The IMEM module services instruction fetches that falls within its configured range. The
IMEM is a convenient, low-cost alternative to a cache that makes instruction memory available to the core for
high-speed access.
The configurations supported by IMEM, and the synchronous RAMs required for each, are summarized in
Table 30.
Table 31 lists the IMEM signals that are connected to application specific modules. The IW_ prefix indicates
signals that are driven by the IMEM LMI module and received by RAMs. The IWR_ prefix indicates signals
that are driven by RAMs and received by the IMEM LMI. The CFG_ prefix identifies configuration ports on
the IMEM LMI that are typically wired to constant values. The width of the index and data lines depends
upon the RAM connected to the LMI, and can be inferred from Table 30.
The CFG_ wires define where the IMEM is mapped into the physical address space. This configuration
information defines the local bus address region of the IMEM. It also determines the address of the external
resources which are accessed when an IMEM miss occurs. The lconfig utility supplied by Lexra will verify
that the configured address range does not interfere with other regions defined for LX5280. The size of the
memory region must be a power of two, and must be naturally aligned.
Signal Description
Signal Description
CFG_IWTOP[17:10] Configured top address (bits that may differ from base).
The IROM LMI supplies the interface for an optional read-only local instruction store. The IROM serves a
fixed range of the physical address space, determined by configuration settings in lconfig. IROM may be
disabled via a hardware configuration pin, CFG_IROFF. IROM may also be enabled and disabled under
software control as described in Section 7.2, Cache Control Register: CCTL. The IROM is a convenient,
low-cost alternative to a cache that makes read-only instruction memory available to the core for high-speed
access.
The configurations supported by IROM, and the synchronous ROMs required for each, are summarized in
Table 32.
Configuration IROM_DATA
Table 33 lists the IROM signals that are connected to application specific modules. The IR_ prefix indicates
signals that are driven by the IROM LMI module and received by the ROM. The IRR_ prefix indicates
signals that are driven by ROM and received by the IROM LMI. The CFG_ prefix identifies configuration
ports on the IROM LMI that are typically wired to constant values. Lexra supplies the Verilog module that
makes all required connections to these wires. The width of the index and data lines depends upon the ROM
connected to the LMI, and can be inferred from Table 31.
The CFG_ wires define where IROM is mapped into the physical address space. This configuration
information defines the local bus address region of the IROM. It also determines the address of the external
resources which are accessed when an IROM miss occurs. The lconfig utility supplied by Lexra will verify
that the configured address range does not interfere with other regions defined by the LX5280. Note that the
size of the memory region must be a power of two, and must be naturally aligned.
Signal Description
CFG_IRTOP[17:10] Configured top address (bits that may differ from base).
The DCACHE LMI supplies the interface for a direct mapped, write through data cache attached to the
LX5280 local bus. The DCACHE LMI participates in cacheable data reads and writes, but only if the address
is not claimed by the DMEM LMI. The configurations supported by DCACHE, and the synchronous RAMs
required for each, are summarized in Table 34.
The direct mapped DCACHE module services word or twin-word read requests from the core in one cycle
when the request hits the cache. Byte or half-word reads that hit the data cache require an extra cycle for
alignment. The data cache can stream word and twin-word reads or writes that hit the cache at the rate of one
per cycle. If the LX5280 is configured to work with RAMs that have word write granularity, byte or half-
word writes that follow any write by one cycle and hit the cache require an extra cycle to merge the data with
the current cache contents. Alternatively, the LX5280 can be configured to work with RAMs support byte
write granularity, which eliminates the extra cycle. See Appendix C, LX5280 Pipeline Stalls, for detailed
descriptions of these and other pipeline stall conditions.
Writes that are serviced by the data cache may require extra time to be serviced by the LBC if its write buffer
is full. Also, when a cache write operation is immediately followed by a cache read, the cache must delay the
read for one cycle while the write completes.
When a miss occurs, the cache obtains a cache line (4, 8, 16, or 32 words) of data from the Lexra Bus
Controller (LBC). Write operations that hit the data cache are simultaneously written into the cache and
forwarded to the write buffer of the LBC. Thus, if the core subsequently reads the data, it will likely be
available from the cache. For main memory systems that support byte writes, all data writes that miss the
cache are forwarded to the write buffer of the LBC, without disturbing any data currently in the cache. For
main memory systems that can only write with word granularity, a byte or half-word write that misses the
cache causes the cache to perform a line fill from main memory. The cache then merges the partial write data
with the full word data obtained from memory, and writes the word to the system bus.
Table 35 lists the DCACHE signals that are connected to application specific modules. The DC_ prefix
indicates signals that are driven by the DCACHE LMI module and received by the RAMs. The DCR_ prefix
indicates signals that are driven by the DCACHE RAMs and received by the DCACHE LMI. Lexra supplies
the Verilog module that makes all required connections to these wires. The width of the index and data lines
depends upon the RAM connected to the LMI, and can be inferred from Table 34.
Signal Description
The DMEM LMI supplies the interface for a scratch pad data RAM attached to the LX5280 local bus. The
DMEM module services in any cacheable or uncacheable data read or write operation that falls within its
configured range.
Byte or half-word reads that hit the DMEM require an extra cycle for alignment. DMEM can stream word
and twin-word reads or writes that hit DMEM at the rate of one per cycle. If the LX5280 is configured to
work with RAMs that have word write granularity, byte or half-word writes that follow any write by one
cycle and hit DMEM require an extra cycle to merge the data with the current DMEM contents. Alternatively,
the LX5280 can be configured to work with RAMs support byte write granularity, which eliminates the extra
cycle. See Appendix C, LX5280 Pipeline Stalls, for detailed descriptions of these and other pipeline stall
conditions. Also, because a write operation to the DMEM is never sent to the LBC, writes to DMEM will not
cause the LBC to stall the processor due to a full write buffer condition.
The DMEM configurations and the synchronous RAMs required for each are summarized in the Table 36.
Table 37 lists the DMEM signals that are connected to application specific modules. The DW_ prefix
indicates signals that are driven by the DMEM LMI module and received by RAMs. The DWR_ prefix
indicates signals that are driven by RAMs and received by the DMEM LMI. The CFG_ prefix identifies
configuration ports on the DMEM LMI that are typically wired to constant values. The width of the index and
data lines depends upon the RAM connected to the LMI, and can be inferred from Table 36.
The CFG_ wires define where DMEM is mapped into the physical address space. It is not possible for any
DMEM reference to result in an operation on the system bus. The lconfig utility supplied by Lexra will verify
that the configured address range does not interfere with other regions defined for LX5280. The size of the
memory region must be a power of two, and must be naturally aligned.
The DMEM LMI can also be used as a ROM controller simply by tying off the write enable and data input
lines in the RAM wrapper, and instancing a ROM in the RAM wrapper.
Signal Description
Signal Description
CFG_DWTOP[17:10] Configured top address (bits that may differ from base).
The Lexra System Bus (LBus) is the connection between the LX5280 and other internal devices, such as
system memory, USB, IEEE-1394 (Firewire), and an external bus interface. The LBC uses a protocol similar
to that of the Peripheral Component Interface (PCI) bus. This is a well-known and proven architecture.
Adding new devices to the Lexra Bus is straightforward and the performance approaches the highest that can
be achieved without adding a great deal of complexity to the protocol.
The Lexra bus supports multiple masters. This allows for mastering I/O controllers with DMA engines to be
connected to the bus. The bus has a pended architecture, in which a master holds the bus until all the data is
transferred. This simplifies the design of user-supplied bus agents and reduces latency for cache miss
servicing.
The Lexra bus is a synchronous bus. Signals are registered and sampled at the positive edge of the bus clock.
Certain logical operations may be made to the sampled signals and then new signals can be driven
immediately, such as for address decoding. This allows for same-cycle turn-around. The LBC provides an
optional asynchronous interface between the CPU and the Lexra bus, allowing the Lexra bus speed can be set
to be any speed equal to or less than the CPU clock frequency.
The Lexra bus data path for the LX5280 is 32 bits wide. Therefore, the bus can transfer one word, halfword,
or byte in one bus clock. The bus supports line and burst transfers in which several words of data are
transferred. The Lexra bus accomplishes this by transferring words of data from incremental addresses on
successive clock cycles.
The LBC contains a write buffer. When the CPU issues a write request to a Lexra Bus device, the address and
data are saved in the buffer and sent to the device sometime later. The CPU can continue processing, having
safely assumed that the write will eventually happen. This is described more thoroughly in Section 8.7.2.
The LBC drives enabling signals to control muxes or tristate buffers. This allows the Lexra bus to have either
a bi-directional or point-to-point topology.
8.2. Terminology
The Lexra bus borrows terminology from the PCI bus specification, on which the Lexra bus is partially based.
Bus transactions take place between two bus agents. One bus agent requests the bus and initiates a transfer.
The second responds to the transfer.
The agent initiating a transfer is called the bus initiator. It is also referred to as the bus master. Both terms are
used interchangeably in this document.
The responding agent is known as the bus target. It samples the address when it is valid, and determines if the
address is within the domain of the device. If so, indicates as such to the initiator and becomes the target.
A read transfer is a bus operation whereby the master requests data from the target.
A write transfer is a bus operation whereby the master requests to send data to the target.
A single-cycle bus operation is used to transfer one word, halfword, or byte of data. This amount of data can
be transferred in one bus cycle, not including the address cycle and device latencies.
A line transfer is a read or write operation where an entire cache line of data is transferred in successive
cycles as fast as the initiator and target can send/receive the data.
A burst transfer is a read or write operation where a large amount of data needs to be sent. The initiator
presents a starting address and data is transferred starting at that address in successive cycles; for each word
transferred, the address is incremented by the devices internally.
Some signals on the Lexra bus are active low. That is, they are considered logically true when they are
electrically low and logically false when electrically high. A device asserts a signal when it drives it to its
logical true electrical state.
The purpose of the Lexra bus is to connect together the various components of the system, including the
LX5280 CPU, main system memory, I/O devices, and external bus bridges. Different devices have different
transfer requirements. For example, the LX5280 CPU will request the bus to fetch a cache line of data from
memory. I/O devices will request large blocks of data to be sent to and from memory. The Lexra bus supports
the various types of transfers needed by both I/O and the processor.
The six types of bus operations are single-cycle read, line read, burst read, single-cycle write, line write
(though this won’t be used by the LX5280 core) and burst write.
The single-cycle read operation reads a single word, halfword, or byte from the target device. This operation
is usually used by the CPU to read data from uncachable address space. (If the read address was in cacheable
address space, either a hit would occur resulting in no bus activity, or a miss would occur resulting in a read
line transaction.)
The read line operation reads a sequence of data from memory corresponding to the size of a cache line. The
cache line size affects how many cycles are required to transfer the full line. The LX5280 and the Lexra bus
support a configurable line size, specified through lconfig. The default line size of four words (16 bytes) is
assumed here.
There are two ways that the target could transfer the data back to the initiator. The conventional way is to
transfer four words of data in sequence, starting at the nearest 16-byte-aligned address smaller or equal to the
address that the initiator drives. In other words, the target starts the transfer at the beginning of the line
containing the requested address.
Some memory devices may implement a performance optimization called desired-word-first. If the address is
not aligned to a 16-byte boundary, then the first data returned by the target is the word corresponding to the
address instead of the first word of the line. The second word is the next sequential word of data and so on. At
the end of the line, the target wraps around and returns the first word of line.
The LX5280 supports two ways of incrementing the address of a line refill. One is by linear wrap, where the
address is simply incremented by one. The other is by interleaved wrap, where the next address is determined
by the logical xor of the cycle count and the first word address. The interleave sequence is shown in the table
below. The low-order address bits 3:2 for the first data beat are the obtained from the address of the line read
request. The low order address bits for the subsequent data indicate the corresponding interleave order.
Interleaved Address[3:2]
The burst read operation transfers an arbitrary amount of data from the target to the initiator. The initiator first
presents a starting address to the target. The target responds by providing multiple cycles of data words in
sequence, starting at the initial address. The initiator indicates to the target when to stop providing data.
Burst read operations are used by I/O devices for block DMA transfers. The LX5280 will never issue a burst
read operation.
Note that there is a difference between a 4-cycles burst and a line read. A line read may use a desired-word-
first increment and wrap. A burst will always increment and will never wrap.
The single-cycle write operation writes a single word, a halfword, or a byte to the target.
The LX5280 uses a cache with a write-through policy. All CPU instructions that write to memory generate a
single-cycle write operation. (Unless the address is in the local scratchpad memory, in which case the write
operation will not make it out to the Lexra bus).
The line write operation is not used by the LX5280. This operation could be used by a processor that has a
data cache that implements a write-back policy.
A burst write is an operation where the initiator sends an address and then an indefinite sequence of data to
the target. The initiator will inform the target when it has finished sending data. This operation is used by I/O
devices for DMA transfers. It is not used by the processor.
Source
Signal Name Description
(Initiator/Target/Ctrl)
BIRDY Initiator For writes, indicates that initiator is driving valid data;
on reads, indicates that initiator is ready to accept
data.
BDATA[31:0] Initiator on write/Target on Data; if driven by initiator, BIRDY indicates valid data
read on bus; if driven by target, BTRDY indicates valid data
on bus.
The initiator drives BCMD during the cycle that BFRAME is asserted.
BCMD[5:4] 54
00 burst, fixed length1
01 burst, unlimited number of words
10 line, interleaved wrap2
11 line, linear wrap
BCMD[3:0] 3210
1000 1 byte
1001 2 bytes
1010 3 bytes
1011 1 word
1100 2 words
1101 reserved
111x reserved
0000 4 words
0001 8 words
0010 16 words
0011 32 words
01xx reserved
The Lexra Bus is a big endian bus. Transactions must have their data driven to the appropriate bus rails. The
bus mapping is as shown in Table 40.
00 00 X
00 01 X
00 10 X
00 11 X
01 00 X X
01 10 X X
10 00 X X X
10 01 X X X
11 00 X X X X
The Lexra Bus does not define unaligned data transfers, such as a halfword transfer that starts at
ADDR[1:0]=01, or transfers that would need to wrap to the next word.
The Lexra Bus Controller (LBC) is the element of the LX5280 that connects to the Lexra Bus. It forwards all
transaction requests from the LX5280 CPU to the Lexra Bus. It is an initiator and will never respond to
requests from other Lexra Bus initiators.
The LBC issues the only the LBus commands listed in the table below.
The LBC contains a write buffer with a depth that is configurable with lconfig. All write requests from the
CPU are posted in the write buffer. The CPU will not wait for the write to complete. Write operations
complete in the order they are entered into the queue. If the queue fills, then the CPU must wait until an entry
becomes available.
When the CPU issues a read operation, the LBC will attempt to forward that request to the Lexra Bus ahead
of any pending write operations. This significantly improves performance since the CPU needs to wait for the
read operation to complete and would waste time if it had to also wait for unnecessary or irrelevant writes to
complete.
There are a few cases when the LBC will not allow the read operation to pass pending writes:
1. The address of a pending write is within the same cache line as the read. The LBC will hold the
read operation until the matching write operation, and all write operations ahead of it, com-
plete. If the read is for an instruction fetch, it can still pass a pending write that is inside the
same cache line.
2. The read is to uncacheable address space. All writes will complete before the read is issued.
This avoids any problems with I/O devices and their associated control/status registers.
3. A pending write is to uncachable address space. The LBC will hold the read operation until all
writes up to and including the write to uncacheable address space complete. This further avoids
I/O device problems.
The write buffer bypass feature can be disabled so that reads will never pass writes.
The LBC contains a read buffer with a depth that is configurable with lconfig. All incoming read data from
the system bus passes through the read buffer. This allows the LBC to accept incoming data as a result of a
cache line fill operation without having to hold the bus.
When the LBC is configured with an asynchronous interface, a larger read buffer improves system and
processor performance in the event of cache miss. When the LBC is configured with a synchronous interface,
the cache can accept the data as fast as the LBC can read it. Therefore, there is no need for a large read buffer.
Customers may reduce the size of the read buffer to a minimum size of two 32-bit entries.
In some cases, there is a need to minimize the number of gates. The read buffer size may be reduced to two or
four entries for the asynchronous case. This causes a penalty in terms of Lbus utilization since now the LBC
may have to de-assert IRDY if it cannot hold part of the line of data. When the read buffer is the size of a
cache line, this will be relatively rare since simultaneous instruction cache and data cache misses are
relatively rare. For a smaller read buffer, IRDY deassertion is almost a certainty.
This section describes the various types of read and write transfers in detail. These operations follow certain
patterns and rules. The rules for driving and sampling the bus are as follows:
1. Agents that drive the bus do so as early as possible after the rising edge of the bus clock. There
is some time to perform some combinational logic after the bus clock goes high, but the
amount of time is determined by the speed of the bus clock and the number of devices on the
bus.
2. Agents sample signals on the bus at the rising edge of the bus clock.
3. All bus signals must be driven at all times. If the bus is not owned, and external device must
drive the bus to a legal level.
4. A change in signal ownership requires one dead cycle. If an initiator gives up the bus, another
initiator needs to wait for one dead cycle before it can drive the bus. If the same initiator issues
a read operation and then needs to issue a write operation, it also must wait one extra cycle for
the data bus to turn around.
5. Agents that own signals must drive the signals to a logical true or logical false; all other agents
must disable (tristate) their output buffers.
The Lexra Bus protocol is based on the PCI Bus protocol1. The Lexra Bus signals BFRAME, BTRY, BIRDY,
and BSEL have a similar function to the PCI signals FRAME#, TRDY#, IRDY#, and DEVSEL#,
respectively. In general, the protocol for the Lexra bus is as follows:
1. The initiator gains control of the bus through arbitration (described later in this chapter).
2. During the first bus cycle of its ownership (before the first rising clock edge), the initiator
drives the address for the bus transaction onto BADDR. At the same time, it asserts BFRAME
to indicate that the bus is in use. It will de-assert BFRAME before it send or accepts the last
word of data. In most cases, the initiator will asserts BIRDY to indicate that it is ready to
receive data (or read operations) or is driving valid data (for write operations). If the operation
is a write, the initiator will drive valid data onto BDATA.
3. At the rising edge of the first clock, all agents sample BADDR and decode it to determine
which agent will be the target.
4. The agent that determines that the address is within its address space asserts BSEL sometime
after the first rising edge of the bus clock. BSEL stays asserted until the transaction is com-
plete.
5. The initiator and the target transfer data either in one cycle or in successive cycles. The agent
driving data (the initiator for a write, the target for a read) indicates valid data by asserting its
ready signal (IRDY or TRDY for writes and reads, respectively). The agent receiving data (tar-
get for a write, initiator for a read) indicates its ability to receive the data by asserting its ready
1. The Lexra Bus is not PCI compatible; it merely borrows concepts from the PCI Bus specification.
signal. Either agent may de-assert its ready signal to indicate that it cannot source or accept
data on this particular clock edge.
6. When the initiator is ready to send or receive the last word of data, that is, when it asserts
BIRDY for the last time, it also de-asserts BFRAME. It will deassert BIRDY when the last
word of data is transferred.
7. The arbiter grants the bus to the next initiator, and may do so during a bus transfer by a differ-
ent initiator. The new initiator must sample BFRAME and BIRDY. When both BIRDY and
BFRAME is sampled de-asserted and the new initiator has been given grant, it can assert
BFRAME the next cycle to start a new transaction.
NOTE: in the examples below, the signals BADDR and BDATA are often shown to be in a high-impedance
state. In reality, internal bus signals should always be driven, even if they are not being sampled. The Hi-Z
states are shown for conceptual purposes only.
This operation is used to read a word, halfword or byte from memory, usually in uncachable address space.
CLOCK
BFRAME
BADDR
BDATA
BIRDY
BTRDY
BSEL
D0000
This is a simple read operation where the target responds immediately with data. This is unlikely, since most
bus memory will require one or more cycles to fetch data. This example illustrates the most basic read
operation without waits.
2. Target asserts BSEL to indicate to initiator that a target is responding. In this example, there is
an immediate fetch of data, so Target drives data and asserts BTRDY to indicate to target that it
is driving data. The Initiator de-asserts BFRAME and asserts BIRDY to indicate that the next
piece of data received will be the last.
3. Initiator de-asserts IBIRDY and the target de-asserts BSEL and BTRDY to indicate the end of
the transaction. The Initiator that has been given grant owns the bus this cycle.
This is the same as the single-cycle read, except that the target needs time to fetch the data from memory.
CLOCK
BFRAME
BADDR
BDATA
BIRDY
BTRDY
BSEL
D0001
2. Target asserts BSEL to indicate that it has decoded the address and is acknowledging that it is
the target device. However, it is not ready to send data, so it does not assert BTRDY. Initiator
de-asserts BFRAME and asserts BIRDY to indicate that the next piece of data will be the last it
wants.
4. After a second wait cycle, target drives data and asserts BTRDY to indicate that data is on the
bus.
5. Target de-asserts BSEL and BTRDY. Initiator de-asserts BIRDY. Another initiator may drive
the bus this cycle.
This operation is used to service a cache miss. Four words of data are transferred in sequence. In this
example, the target is supplying four words of data without any waits.
CLOCK
BFRAME
BADDR
BDATA
BIRDY
BTRDY
BSEL
D0002
2. Target asserts BSEL to indicate that it had decoded the address and will send data when it is
ready. Initiator asserts BIRDY to indicate that it is ready to receive data.
6. Target drives last word of data. Initiator de-asserts BFRAME to indicate that the next word of
data it receives will be the last it needs.
7. Target de-asserts BTRDY and BSEL; initiator de-asserts BIRDY. Another master may gain
ownership of the bus this cycle.
This illustrates what happens when a target needs extra time to fetch data it needs to service a cache miss.
CLOCK
BFRAME
BADDR
BDATA
BIRDY
BTRDY
BSEL
D0003
2. Target asserts BSEL to indicate that it is acknowledging the operation. Initiator asserts BIRDY
to indicate that it is ready to receive data.
This occurs when a line of data is requested from the target and the initiator cannot accept all of the data in
successive cycles.
CLOCK
BFRAME
BADDR
BDATA
BIRDY
BTRDY
BSEL
D0004
2. Target asserts BSEL. It doesn’t have data, so it does not assert BTRDY. Initiator asserts BIRDY
to indicate that it can accept data
3. Target now has data, so it drives the data and asserts BTRDY.
4. Target drives second word of data; initiator cannot accept it, so it de-asserts BIRDY.
5. Target holds second word of data; initiator can accept it and asserts BIRDY.
7. Target drives fourth word of data; initiator cannot accept it and de-asserts BIRDY. initiator hold
BFRAME until it can assert BIRDY.
8. Initiator asserts BIRDY to accept fourth word of data. It de-asserts BFRAME to indicate this is
the last word of data.
A single-cycle write operation occurs almost every time the LX5280 processor executes a store instruction.
This is because the cache used in the processor uses a write-through policy. Of course, writes to uncacheable
address space and to an I/O device will also generate a single-word write. Single-word write operations are
used to write words, halfwords and bytes.
CLOCK
BFRAME
BADDR
BDATA
BIRDY
BTRDY
BSEL
D0005
2. Target samples address and asserts BSEL. Initiator drives data and asserts BIRDY. In this case,
target is also able to accept data, so it asserts BTRDY. Initiator also de-asserts BFRAME to
indicate that it is ready to send the last (and only) word of data.
3. Target accepts data, de-asserts BTRDY and BSEL. Initiator de-asserts BIRDY.
This is an example of a single-cycle write operation where the target cannot immediately accept data and
must insert wait states.
CLOCK
BFRAME
BADDR
BDATA
BIRDY
BTRDY
BSEL
D0006
This is the same description as the above example, except that the target inserts two wait states until it asserts
BIRDY to indicate acceptance of data.
A burst write operation is generally used to transfer large amounts of data from an I/O device to memory via
a DMA transfer. The following illustrates a best-case scenario with no wait states.
CLOCK
BFRAME
BADDR
BDATA
BIRDY
BTRDY
BSEL
D0007
2. Target asserts BSEL and BTRDY to indicate it will accept data. Initiator drive data and asserts
BIRDY.
3. Initiator drives next word of data; target continues to accept data and indicates as such by con-
tinuing to assert BTRDY.
5. Initiator drives fourth word of data and de-asserts BFRAME to indicate that this will be its last
word sent; target accepts data.
6. Target de-asserts BTRDY and BSEL; initiator gives up control of the bus by de-asserting
BIRDY.
This example is similar to the above example, except that during the third and fourth data word transfer, the
target cannot accept the data quickly enough, so it de-asserts BTRDY which indicates to the initiator that it
should hold the data for an additional cycle.
CLOCK
BFRAME
BADDR
BDATA
BIRDY
BTRDY
BSEL
D0008
The example illustrates what happens when the initiator cannot supply data fast enough and has to insert
waits.
CLOCK
BFRAME
BADDR
BDATA
BIRDY
BTRDY
BSEL
D0009
The table below summarizes the LX5280 LBC ports. The "LBC Port" column indicates the name of the port
supplied by the LBC. The "Bus Signal" column indicates the corresponding Lexra bus signal. The LBC ports
are strictly uni-directional, while the bus signals (at least conceptually) include multiple sources and sinks.
The manner in which LBC ports are connected to bus signals is technology dependent, and may employ tri-
state drivers or logic gating in conjunction with the LBC’s LCoe, LDoe and LToe outputs.
8.9. Arbitration
8.9.1. Rules
1. Master asserts REQ at the beginning of a cycle and may start sampling for asserted GNT in the
same cycle (in case GNT is already asserting in the case of a “park”).
2. If bus is idle or it is the last data phase of the previous transaction when master samples
asserted GNT, master may assert FRAME on next cycle.
3. If the bus is busy when the master samples GNT, is must also snoop FRAME, IRDY and Trdy.
One cycle after FRAME is not asserted and both IRDY and TRDY are asserted (indicating the
last data phase), if GNT is still asserted, master may now drive FRAME (i.e. GNT &
~Frame_R & (Irdy_R & Trdy_R)).
The LBC, when it need access to the bus, asserts REQ and in the same cycle samples GNT, ~FRAME, and
either ~IRDY or (IRDY & TRDY). If these are true, then the LBC will on the next cycle take ownership of
the bus. REQ is deasserted on the cycle after LBC asserts FRAME. If the bus is busy, LBC continues to
snoop these four signals for this condition. All other Lbus arbitration rules can be based on this behavior of
the LBC.
There are three sets of output enables: TOE(valid for the length of the transaction), COE (valid for only the
first cycle of a transaction), and DOE (valid for data transfers, asserted by the master for writes and by the
slave for reads).
FRAME
IRDY
CMD
ADDR
DATA
There is no output enable to qualify TRDY and SEL. These are defined by customer logic for slave devices.
Instead of using TOE it may be desirable to instead OR all of the FRAME signals, either centrally or one OR
gate for each target and master. The same holds true for IRDY, TRDY, and SEL. This simplifies the
connections when a relatively few number of devices are used and there are no off-chip devices connected
directly to the Lexra Bus.
Therefore, it is defined that masters and slaves not taking part in a transaction always keep FRAME, IRDY,
TRDY, and SEL driven and de-asserted.
The LX5280 processor provides customer access points for the Coprocessor Interfaces. This section provides
a description of these access points. Attachment of memory devices to the LMIs, the System Bus, and the
EJTAG interface are described in separate chapters.
A coprocessor may contain up to 32 general registers and up to 32 control registers. Each of these registers is
up to 32 bits wide. Typically, programs use the general registers for loading and storing data on which the
coprocessor operates. Data is moved to the coprocessor’s general registers from the core’s general registers
with the MTCz instruction. Data is moved from the coprocessor’s general registers to the core’s general
registers with the MFCz instruction. Main memory data is loaded into or stored from the coprocessor’s
general registers with the LWCz and SWCz instructions.
Programs may load and store the coprocessor’s control registers from the core’s general registers with the
CTCz and CFCz instructions respectively. Programs may not load or store the control registers directly from
main memory.
The coprocessor may also provide a condition flag to the core. The condition flag can be a bit of a control
register or a logical function of several control register values. The condition flag is tested with the BCzT and
BCzF instructions. These instructions indicate that the program should branch if the condition is true (BCzT)
or false (BCzF).
The CI provides the mechanism to attach the custom coprocessor to the core. The CI snoops the instruction
bus for coprocessor instructions and then gives the coprocessor the signals necessary for reading or writing
the general and control registers.
The addresses, output data, and control signals are supplied to the user’s Coprocessor on the rising edge of the
system clock. In the case of a read cycle, the coprocessor must supply the data from either the control or
general register on C<z>rd_data by the end of the same cycle. Similarly, the write of data from C<z>wr_data
to the addressed control or general register must be complete by the end of the cycle.
The CI incorporates a forwarding path so that data which is written in instruction (N) can be read in
instruction (N + 2). The Coprocessor registers should be implemented as positive-edge flip-flops using the
LX5280 system clock.
During a coprocessor write, the CI sends C<z>wr_addr and C<z>wr_data, and asserts either C<z>wr_gen or
C<z>wr_con. The coprocessor must ensure that the coprocessor completes the write to the appropriate
register on the subsequent rising edge of the clock. The target register is a decoding of C<z>wr_addr,
C<z>wr_gen and C<z>wr_con. Use these instructions to cause a coprocessor write: LWCz, MTCz, and
CTCz.
During a coprocessor read, the CI sends C<z>rd_addr and asserts either C<z>rd_gen or C<z>rd_con. The
coprocessor must return valid data through C<z>rd_data in the following clock cycle. If the core asserts
C<z>rhold, indicating that it is not ready to accept the coprocessor data, the coprocessor must hold the
previous value of C<z>rd_data. The target register for the read is a decoding of C<z>rd_addr, C<z>rd_gen,
and C<z>rd_con. The instructions causing a coprocessor read are SWCz, MFCz, and CFCz.
The CPU stalls the pipeline so that the program can access data read by a coprocessor instruction in the
immediately following instruction. For example, if an MFCz instruction reads data from the coprocessor and
stores it in the core’s general register $4, the program can get access to that data in the following instruction:
When the core initiates a coprocessor read, the coprocessor must return valid data in the following clock
cycle. The coprocessor cannot stall the CPU. Applications must ensure that the source code does not access
invalid coprocessor data if the coprocessor operations take several clock cycles to complete. This is done in
one of three ways:
• Ensure that code does not access data from the coprocessor until N instructions after the
coprocessor operation has stared. This is the least desirable method as it depends on the
relative execution of the core and coprocessor. It can also complicate software debug.
• Have the coprocessor send an interrupt to the core, and the service routine for that
interrupt accesses the appropriate coprocessor registers.
• Have the coprocessor set the C<z>condin flag when its operation is complete. The source
Coprocessor writes occur in the W stage of the instruction pipeline. For coprocessor reads, the core generates
address, rd_gen, and rd_con signals during the S stage, and the coprocessor returns data during the E stage
which is passed by the CI to the core in the M stage. The core introduces a pipeline bubble after coprocessor
instructions to ensure that the result of a MTCz instruction can be used by the immediately following
instruction.
In particular, if there are back-to-back MTCz and MFCz instructions that access the same coprocessor
register, the pipeline bubble still does not allow a cycle between the W stage write and E stage read as
required. In this case a special forwarding path within the CI is used. That is, the “true” data from the
coprocessor is ignored. Instead the exact data from the MTCz is used.
mtc2 I D S E M W
bubble I D . . . .
mfc2 I D S E M W # data forwarded by CI from mtc2
wr_gen (W) X
rd_gen (S) X
rd_data(E) X
The forwarding path can cause side effects if the coprocessor does not implement all of the bits of a register,
contains read-only bits, or updates the register value upon reading the register. In such cases, the mfc2
instruction returns different data from what it would if the core did not activate the forwarding path. To avoid
the forwarding path, another instruction must be inserted between the mtc2 and mfc2:
mtc2 I D S E M W
bubble I D . . . .
foo I D S E M W
mfc2 I D S E M W # read data from coprocessor
wr_gen (W) X
rd_data(E) X
The coprocessor must register the read address and the control signals rd_gen and rd_con. It must hold the (E
stage) registered values of these signals when C<z>_rhold is active high, and should make the read data
output a function of the (E stage) registered read address and control signals.
The wr_addr, wr_data, wr_gen and wr_con signals need not be registered. The coprocessor may decode these
(W stage) signals directly to the appropriate register.
Under certain circumstances the instruction pipeline can contain an instruction that must be discarded. This
can be due to mispredicted branches, cache misses, exceptions, inserted pipeline bubbles etc. In such cases,
For the coprocessor write-type instructions, the CI will only issue the W stage control signals wr_gen and
wr_con for valid instructions. The coprocessor does not need to qualify these controls.
For the coprocessor read-type instructions, the CI may issue the S stage control signals rd_gen and rd_con for
instructions that must be discarded. If the coprocessor can tolerate speculative reads then it need not qualify
those signals. However, if the coprocessor performs “destructive” reads, such as updating a FIFO pointer
upon read, then it must use the qualifying signals C<z>_xcpn_m and C<z>_invld_m as follows:
The signal C<z>_xcpn_m signal is used to discard any S stage (from CI) rd_gen or rd_con signal and any E
stage (registered in the coprocessor) rd_gen or rd_con signal. It indicates that a preceding instruction in the
pipe has taken an exception and that subsequent instructions in the pipe must be discarded.
The signal C<z>_invld_m signal is used to invalidate the operation of the current instruction in the M stage.
This can be for various reasons not limited to an exception on a preceding instruction. If the coprocessor
cannot tolerate speculative reads, it must register an M stage version of rd_gen and rd_con. The coprocessor
must use the C<z>_rhold signal to hold this M stage version (as well as the E stage version). If
C<z>_invld_m is asserted, then any such M stage signals must be discarded. To summarize, a rd_gen or
rd_con instruction can “retire” only if it reaches the M stage and neither C<z>_rhold nor C<z>_invld_m is
asserted.
10.1. Introduction
Given the increasing complexity of SoC designs, the nature of embedded processor-design debug, hardware
and software, and the time-to-market requirements of embedded systems, a debug solution is needed which
allows on-chip processor visibility in a cost-effect, I/O constrained manner.
Lexra’s EJTAG solution meets all such requirements. It uses existing IEEE JTAG pins as well as fast bring-up
on new designs. It provides a way of debugging all devices accessible to the processor in the same way the
processor would access those devices itself. Using EJTAG, a debug probe can access all the processor
internal registers and caches. It can also access devices connected to the Lexra Bus, bypassing internal caches
and memories.
Software debug is enhanced by EJTAG features that allow single-stepping through code and halting on
breakpoints (hardware and software, address and data with masking). For debugging problems that are
artifacts of real-time interactions, EJTAG gives real-time Program Counter trace capabilities from which an
accurate program execution history is derived. For the code-system perspective, PC profiling provides
statistical analysis of code usage to aim code optimization.
10.2. Overview
A debug host computer communicates to the EJTAG probe through either a serial or parallel port or Ethernet
connection. The probe, in turn, communicates to the LX5280 EJTAG hardware via the included IEEE 1149.1
JTAG interface. Through the use of the JTAG TAP controller, probe data is shifted into to the EJTAG data and
control registers in the LX5280 to respond to processor requests, DMA into system memory, configure the
EJTAG control logic, enable single-step mode, or configure the EJTAG breakpoint registers. Through the use
of the EJTAG control registers, the user can set hardware breakpoints on the instruction cache address, data
cache address or data cache data values.
Physical address range 0xFF20_0000 to 0xFF3F_FFFF is reserved for EJTAG use only and should not be
mapped to any other device.
Currently, Embedded Performance Inc. (EPI) and Green Hills Inc. provide EJTAG debuggers and probes for
the LX5280. Information on these products is available at the following web sites.
LX5280 EJTAG implements all required features of version 2.0.0 of the EJTAG specification, and includes
support for the following features:
• Host probe can DMA directly into system memory or I/O devices.
• Debug exception and two new debug instructions: one for raising a debug exception via
software, and one for returning from a debug exception.
IEEE JTAG pins used by EJTAG are shown below. These are required for all EJTAG implementations.
JTAG_TRST_N is an optional pin.
JTAG_TMS Input Test Mode Select. Connected to each EJTAG TAP controller.
JTAG_TRST_N Input TAP controller reset. Connected to each EJTAG TAP controller.a
Signal Name Probe Budget Core Budget Slack remaining for other logic
The LX5280 EJTAG includes support for real-time Program Counter Trace (PC Trace). When in PC Trace
mode, the LX5280 will serially output a new value of the program counter whenever a change in program
control occurs (i.e. branch or jump instruction, or an exception).
When the PC Trace option is set to EXPORT in lconfig, the following signals will be output from the
LX5280: DCLK, PCST, and TPC. These are described in more detail in the following subsections.
The DCLK output is used to synchronize the probe with the LX5280’s SYSCLK.
The PCST (PC Trace Status) signals are used to indicate the status of program execution. Example status
indications are sequential instruction, pipeline stall, branch, or exception.
The TPC pins output the value of the PC every time there is a change of program control.
The maximum speed allowed for the Debug Clock (DCLK) output is 100MHz (as an EPI probe
requirement). As cores typically run in excess of this speed DCLK can be set to a divided down value of
SYSCLK. This is set by the DCLK N parameter in lconfig, which indicates the ratio of SYSCLK frequency
to DCLK: 1, 2, 3 or 4.
The Program Counter Status (PCST) output comprises N sets of 3-bit PCST values, where N is configurable
as 1, 2, 3 or 4 via lconfig. A PCST value is generated every SYSCLK cycle. When DCLK is slower than the
LX5280’s SYSCLK, up to N PCST values are output simultaneously.
The bus width of the Target Program Counter (TPC) output is user configured in lconfig via the “M”
parameter to be one of 1, 2, 4 or 8 bits. When change in program flow occurs the current PC value is sent out
of TPC. As the PC is 32-bits wide, the number of TPC pins affects how quickly the PC is sent. For example,
if the TPC is 4 bits wide the PC will take 8 DCLK cycles to be sent. If another change in flow occurs while
the PC of the previous change is being transmitted, the new PC will be sent and the remainder of the previous
PC will be lost.
The TPC bus also outputs the exception type when an exception occurs. The exception type field-width is
either 3- or 4-bits depending on whether or not vectored interrupts are present. This is covered in more detail
below.
To reduce pinout, the TDO output is used for the least significant bit of TPC (or the only bit if “M” is set to 1).
The EJTAG PC Trace facility specifies that a PCST (PC Trace Status) code is issued if the instruction pipeline
has stalled, sequentially completed an instruction, or taken an branch or jump. In order to accommodate the
two pipelines in the LX5280, the capability of emitting more than one PCST code per cycle is employed.
Specifically, to the external EJTAG probe, the LX5280 appears to be a single pipe machine running at twice
the speed that it actually does.
Since there must be an even number of PCST codes made available at every DCLK rising edge (in the
EJTAG nomenclature), the DCLK parameter “N” must be set to 2 or 4. Setting the DCLK N parameter to 2
results in DCLK running at the same frequency of SYSCLK; setting the parameter to 4 results in DCLK
running at one-half the frequency of SYSCLK.
The maximum value of the N parameter is 4, and the maximum DCLK frequency is 100MHz. Therefore,
until the EJTAG specification is extended beyond N=4 or a maximum DCLK of 100MHz, the maximum
SYSCLK frequency for which dual-pipe PC Trace can be used is 200 MHz.
JPT_TPC_DR O/P The PC value is output on these pins when a PC-discontinuity occursa
M bits
JPT_PCST_DR O/P PC Trace Status: Outputs current instruction type every DCLK
N*3 bits
JPT_DCLK O/P PCST and TPC clock. Frequency determined as a fraction of SYSCLK
via the N parameter. Maximum frequency of DCLK is 100MHz.
Low Time 4 ns
The EJTAG PC Trace facility specifies a 3-bit code be output on the TPC output when an exception occurs
(the PCST pins give the EXP code). In order to distinguish the eight vectored interrupts in the LX5280 from
all other exceptions, a 4-bit code is used instead.
For all exceptions other than vectored interrupts, the most significant bit of the 4-bit code is zero and the
remaining 3-bits are the standard 3-bit code. Note that this includes the standard software and hardware
interrupts numbered 0 through 7.
For vectored interrupts, the most significant bit is always 1. The 4-bit code is simply the number of the
vectored interrupt (from 8 through 15) being taken.
Since the target of the vectored interrupt is determined by the contents of the INTVEC register, the debug
software which monitors the EJTAG PC Trace codes must be aware of the contents of this register in order to
trace the code after the vectored interrupt is taken.
For probes that do not support a 4-bit exception code, the LX5280 can be configured via the
EJTAG_XV_BITS lconfig option to use only the 3-bit standard codes. In that case, if a vectored interrupt is
taken, the 3-bit code for RESET will be presented.
In normal EJTAG PC Trace, TDI and TDO are multiplexed with the debug interrupt (DINT) and the lsb of
the TPC (TPC[0]) when in PC Trace mode. This reduces the number of pins required by PC Trace, but has
the unfortunate side-affect of preventing any access to EJTAG registers during PC Trace.
In order to allow access to EJTAG registers during PC Trace, and to facilitate PC Trace in multiprocessor
environments, the lconfig option JTAG_TRST_IS_TPC=YES causes TDI and TDO to be demultiplexed
such that TRST is used as TPC[0] and DINT is generated via EJTAG registers. Note: setting this option may
require changes in EJTAG probe hardware. Check with probe manufacturer for details.
This section provides a summary of the configuration options available with lconfig. Refer to lconfig forms
for a detailed description of these form options.
The timing information indicates the point within a cycle when the signal is stable, in terms of percent. The
timing information also includes parenthetical references to these notes:
2. Clocked in the BUSCLK domain if crossbar or LBC are asynchronous. Otherwise, clocked in
the SYSCLK domain.
4. A constant that is treated as false path for timing analysis. These inputs must not change after
the processor is taken out of reset.
6. A test-related input or output that is treated as false path for timing analysis. Such inputs must
not change during normal at-speed operation.
7. An asynchronous input.
The table below shows the possible port connections for the top level module of the LX5280 processor,
known as lx2. The actual ports that are present depends upon lconfig settings. The timing information and
notes have the same meaning as for the previous table.
Names that include _N indicate active low signals. All other signals are active high unless otherwise
indicated.
For single bit signals, the signal name and signal description indicate the action or function when the signal is
in the active state.
CResetN input 10% Cold reset (or power on), active low.
Configuration
Coprocessor Interface
Issue stall: an invalid instruction enters the pipe, while any other valid instructions in the pipe advance.
Pipeline stall: All instructions in either pipe stay in the same stage, and do not advance.
Dual-issue interlock: Only one of the potential pair of instructions enters a pipe, the other instruction of the
pair waits for the next cycle to enter.
These instruction groupings are used to describe stall conditions that are based on the type of instructions in
the pipeline.
M-I-Mac MULT(U),DIV(U),MFHI,MFLO,MTHI,MTLO
M16-LoadStore LB, LH, LWSP, LW, LBU, LHU, LWPC, SB, SH, SWSP, SW,
SWRASP
RAD-LoadStore LT, LTP, LWP, LHP, LBP, LHPU, LBPU, ST, STP, SWP, SHP, SBP
In order for a pair of instructions to dual-issue, they must be found in the same aligned doubleword.
An UnlinkedBranch can dual issue with the preceding instruction, if no other rules are violated. The
delay slot instruction of an M-I-UnlinkedBranch single issues in the cycle following the
UnlinkedBranch.
A pair of instructions will NOT dual issue if the second instruction uses a register updated by the
first instruction. This does not apply to register 0, which never causes an interlock.
A pair of instructions will NOT dual issue if the second instruction updates a register updated by the
first instruction. Unless the common target register is also a source register of the second instruction
(in which case the RAW interlock applies), no useful program is expected to include such a pair of
instructions, since the results of the first update are lost. This does not apply to register 0, which
never causes an interlock.
Examples:
The branch rules are a consequence of the fact that all branches are assumed to be taken.
The Store Twinword instructions (ST,STP) always single issue. (Because they use 3 of the 4 register
file read ports, leaving only one for the other instruction, which usually needs two read ports.)
There is one unconditional issue stall after any M16 Load instruction. (there is no M16 target
register analysis).
After a Load instruction to a target register, an instruction which follows the load by one CYCLE
and uses the target register of the load will stall issue for one cycle.
Note: The architectural load-delay slot has been eliminated. This issue stall applies even to the
This does NOT apply to M16 Loads, since they are always followed by a single cycle issue stall.
Examples:
For Twinword Loads (LT, LTP) this rule applies to both of the target registers in the register-pair
operand.
For Radiax Pointer-Update Load instructions, (LBP,LHP,LTP,LWP,LBPU) this rule does NOT
apply to the updated pointer register, which is covered by the RAW and WAW hazard dual-issue
interlocks.
Load instructions which have Byte or Halfword operands always cause a one-cycle stall.
Store-Load stall:
A Load instruction which follows a Store instruction by one CYCLE always causes a one-cycle
stall.
Note: This stall only applies if the Store instruction hits in the Dcache or has a Byte or Halfword
operand.
Examples:
Any store instruction which follows a Store Twinword instruction (ST,STP) by one CYCLE, always
causes a single cycle stall.
Examples:
A Store instruction which has a Byte or Halfword operand, and which follows any Store instruction
by one CYCLE, always causes a one-cycle stall. This cycle includes any potential StoreTwin-
StoreAny stall.
Examples:
The following table summarizes the stall rules related to Load and Store instructions described above. This
table does NOT include the RAW and WAW dual-issue interlocks. In this table, the "2nd OP" refers to an
instruction which issues in the CYCLE after the "1st OP".
1st op
non 1U
1U 2U 1S - - -
load-store 2S
LT, LTP
LW, LWP
LB(U)
1U 2U 1S 2S 1U 1U 1U
LH(U)
LBP(U)
LHP(U)
SW, SWP
1U 2U 1S 2S - - 1U
ST, STP
Notes:
- means no stalls
xU indicates unconditional stall for the indicated number of cycles
xS indicates stall only if 2ndOp Source = 1stOp Load-target
xW indicates stall if data RAMs have word-write granularity
The Mac in the 5280 eliminates all programming hazards between Mac instructions by stalling the pipeline
as necessary. This is done both to avoid resource conflicts as well as to wait for results of a first instruction
that is needed by a second instruction.
The following table indicates the number of cycles that must be inserted between the first indicated
instruction and the second. A zero (or dash) indicates that the instructions can issue back-to-back to the Mac
pipe with no stalls. A non-zero number indicates the number of stall cycles that will occur if the instructions
are issued in consecutive cycles. These stall cycles are available for any other non-Mac instructions, but
should NOT be filled with NOPs since that would only increase the code footprint without improving
performance.
MADDA2[.S]
MSUBA2[.S]
ADDMA[.S]
SUBMA[.S]
MULTA2
1st Op MULNA2
MADDA(U) RNDA2 MFA
MSUBA(U) MTA2 MFA2
MULTA(U) MAD(U) DIVA(U) MTHI MFHI
2nd Op MULT(U) MSUB(U) CMULTA DIV(U) MTLO MFLO
MULTA(U)
MADDA(U)
MSUBA(U) 1U 1U 1U (19T) - -
MAD(U)
MSUB(U)
DIVA(U)
(3T) (4T) (1T) 19U - -
DIV(U)
CMULTA
MADDA2[.S]
MSUBA2[.S]
MULTA2
3U 4U 1U (19T) - -
MULNA2
MTA2
MTHI
MTLO
ADDMA[.S] LO HI
1S (19S)
SUBMA[.S] 2S 3S 4U - -
1T (19T)
RNDA2 2T 3T
MFA
MFA2 LO HI LO HI
MFHI 4S 5S 5S 6S 3S 19S 2S -
MFLO
Nomenclature:
Delay of “x” cycles means that if the 1st Op issues in cycle N, then the 2nd Op may
issue in cycle N+x+1.
Examples:
The coprocessor move instructions (M-I: LWCz, MTCz, MFCz, and Radiax: MTLXC0, MFLXC0, MTRU,
MFRU, MTRK, MFRK) always single issue and are always followed by a single cycle issue stall.
There are no special ZovLoop rules but the execution of a ZovLoop must follow all of the other rules as it
wraps from the loop end back to the loop start. Unless one of these other rules require it, there are NO stall
cycles between loop end and loop start.
Examples:
# executes in 4 cycles per loop (block copy, one word per loop)
00: addi a0,4 ; 04: lw s0,0(a1) ; dual issue
08: addi a1,4 ; 0c: sw s0,0(a0) ; 2x single issue (s0 Load-use)
addi a0,4 ; 04: lw s0,0(a1) ; 2x single issue (sw-lw stall)
08: addi a1,4 ; 0c: sw s0,0(a0) ; 2x single issue (s0 Load-use)
# executes in 5 cycles per loop (block copy, two words per loop)
(lps = 00, lpe = 14)
00: lw s0,0(a1) ; 04: lw s1,4(a1) ; 2x PipeA single issue
08: sw s0,0(a0) ; 0c: sw s1,4(a0) ; 2x PipeA single issue
10: addi a0,8 ; 14: addi a1,8 ; dual issue
00: lw s0,0(a1) ; 04: lw s1,4(a1) ; 2x PipeA single issue
08: sw s0,0(a0) ; 0c: sw s1,4(a0) ; 2x PipeA single issue
10: addi a0,8 ; 14: addi a1,8 ; dual issue
IMMU stall:
When the program jumps, branches, or increments between the two most recently used pages, a
single cycle stall is incurred.
When the program jumps, branches or increments to a third page a two-cycle stall is incurred.
When an IMMU stall occurs due to incrementing across a page boundary, AND there is any of the
following instructions found anywhere in the last doubleword of the page, then there is one issue
stall in addition to the IMMU stalls:
When an instruction cache miss occurs, the processor is stalled for the duration of the cache line fill
operation.
The number of cycles required to complete the line fill is system dependent.
When a 2-way set associative instruction cache is in use, a soft-miss is defined as a hit in the
unpredicted set, with set prediction defined as follows:
If not running in Lock mode, or if the current cache index has no Locked line, set prediction is based
onthe LRU bit (predict the non-least recently used set at the current cache index.)
If running in Lock mode, and the current cache index has a Locked line, set prediction is based on
the previous Icache access (predict the Locked set if the previous Icache access hit a Locked line
and vice versa).
When a data cache miss occurs as the result of a load instruction, the processor stalls while it waits
for the data. The data cache releases the stall condition after the required word is supplied to the
processor, even if additional words must still be filled into the data cache. However, if the processor
issues another load or store operation to the data cache while the remainder of the line fill is in
progress, the cache will again stall the processor until the line fill operation is completed.
When a data cache miss occurs as a result of a load byte or load halfword, the processor stalls for
the duration of the cache line fill operation.
The number of cycles required to complete the line fill is system dependent.
JR I D S E M W
delayslot I D S E M W
notvld I . . .
notvld I . .
target I D S E
I D S E
J I D S E M W
delayslot I D S E M W
target I D S E M
I D S E M
B I D S E M W
notvld I . . . . .
target I D S E M
B-ntkn I D S E M W
delayslot I D S E M W
notvld I . . .
notvld I . . .
delay+4 I D S
I D S
B-ntkn I D S E M W
notvld I . . . . .
notvld I . . . .
notvld I . . .
delay+4 I D S
Load I D S E M W
notvld I . . . . .
Load+2 I D S E M
00: lw s0,0(a0) I D S E M W
04: addi a0,4 I D S E M W
08: add s1,s0 I d D S E M W
0c: add t1,t2 I d D S E M W
00: lw s0,0(a0) I D S E M W
04: addi a0,4 I D S E M W
08: add t1,t2 I D S E M W
0c: add s1,s0 I d D S E M W
00: lw s0,0(a0) I D S E M W
04: lw s2,4(a0) I d D S E M W
08: add s1,s0 I D S E M W
0c: addi a0,8 I D S E M W
00: lb I D S E M M W
04: foo1 I D S E M M W
08: foo2 I D S E E M W
0c: foo3 I D S E E M W
10: foo4 I D S S E M W
14: foo5 I D S S E M W
RHOLD X
Store-Load stall:
00: sw s0,4(a0) I D S E M W
04: addi a0,8 I D S E M W
08: add s0,s1 I D S E M M W
0c: lw s2,0(a0) I D S E M M W
10: foo2 I D S E E M W
14: foo3 I D S E E M W
RHOLD X
00: sw s0,4(a0) I D S E M W
04: lw s2,8(a0) I d D S E M M W
08: addi a0,8 I D S E E M W
0c: add s0,s2 I D S E E M W
10: foo2 I D S S E M W
14: foo3 I D S S E M W
RHOLD X
00: nop I D S E M W
04: st s0,8(a0) I d D S E M W1 W2
08: add s0,s1 I D S E M M W
0c: sw s2,0(a0) I D S E M M W
10: foo2 I D S E E M W
14: foo3 I D S E E M W
RHOLD X
00: st s2,0(a0) I D S E M W1 W2
04: st s0,8(a0) I d D S E M M W1 W2
08: foo2 I D S E E M W
0c: foo3 I D S E E M W
10: foo4 I D S S E M W
14: foo5 I D S S E M W
RHOLD X
00: sw s0,4(a0) I D S E M W
04: addi a0,8 I D S E M W
08: add s0,s1 I D S E M M W
0c: sb s2,0(a0) I D S E M M W
10: foo2 I D S E E M W
14: foo3 I D S E E M W
RHOLD X
00: sh s0,4(a0) I D S E M M W
04: addi a0,8 I D S E M M W
08: sb s2,0(a0) I D S E E M M W
0c: add s0,s1 I D S E E M M W
10: foo2 I D S S E E M W
14: foo3 I D S S E E M W
RHOLD X X
00: st s2,0(a0) I D S E M W1 W2
04: sb s0,3(a0) I d D S E M M W
08: foo2 I D S E E M W
0c: foo3 I D S E E M W
10: foo4 I D S S E M W
14: foo5 I D S S E M W
RHOLD X
00: nop I D S E M W
04: st s0,8(a0) I d D S E M W1 W2
08: add s0,s1 I D S E M M W
0c: sb s2,0(a0) I D S E M M W
10: foo2 I D S E E M W
14: foo3 I D S E E M W
RHOLD X
multcount(4S) 0 1 2 3 4
RHOLD X X
00: mtc0 I D S E M W
notvld I . . . . . .
04: foo I d d D S E M W
08: foo1 I D S E M W
0c: foo2 I D S E M W
00: nop I D S E M W
04: mtc0 I d D S E M W
notvld I . . . . .
08: foo1 I d D S E M W
0c: foo2 I d D S E M W
10: foo3 I D S E M W
14: foo4 I D S E M W
RHOLD X
00: lw s0,0(a1) I D S E M W
04: lw s1,4(a1) I d D S E M W
08: sw s0,0(a0) I D S E M W
0c: sw s1,4(a0) I d D S E M W
10: addi a0,8 I D S E M W
14: addi a1,8 I D S E M W
00: foo0 I D S E M M M M M M W
04: foo1 I D S E M M M M M M W
08: foo2 I D S E E E E E E M W
0c: foo3 I D S E E E E E E M W
10: foo4 I ~d . . . I D S E M W
14: foo5 I ~d . . . I D S E M W
RHOLD X X X X X
00: foo0 I D S E M M M W
04: foo1 I D S E M M M W
08: foo2 I D S E E E M W
0c: foo3 I D S E E E M W
10: foo4 I ~d I D S E M W
14: foo5 I ~d I D S E M W
18: foo6 I D S E M W
1c: foo7 I D S E M W
RHOLD X X
00: foo I D S E M W
04: lw I D S E M . . . . W
08: foo1 I D S E M M M M M W
0c: foo2 I D S E M M M M M W
08: foo3 I D S E E E E E M W
0c: foo4 I D S E E E E E M W
RHOLD X X X X