0% found this document useful (0 votes)
4 views

TMS34010 -- An Embedded Microprocessor

The TMS34010 is a high-performance 32-bit microprocessor designed for graphics systems, featuring specialized instructions for bit-field data and address manipulations. It integrates control for DRAM, reducing system costs while maintaining high processing speeds and efficiency with internal caches and registers. This microprocessor is positioned between low-cost 8/16-bit processors and high-cost general-purpose 32-bit processors, making it suitable for a variety of embedded applications beyond graphics.

Uploaded by

Private
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

TMS34010 -- An Embedded Microprocessor

The TMS34010 is a high-performance 32-bit microprocessor designed for graphics systems, featuring specialized instructions for bit-field data and address manipulations. It integrates control for DRAM, reducing system costs while maintaining high processing speeds and efficiency with internal caches and registers. This microprocessor is positioned between low-cost 8/16-bit processors and high-cost general-purpose 32-bit processors, making it suitable for a variety of embedded applications beyond graphics.

Uploaded by

Private
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

The TMS34010:

An Embedded Microprocessor

he TMS34010 is a high-performance 32-bit microprocessor with


Aimed at special instructions and hardware for handling the bit-field data and
address manipulations often associated with computer graphics.
graphics systems, With integrated control and addressing for dynamic random access
memory (DRAM), it supports a lower system cost than would normally be
this 32-bit associated with a 32-bit microprocessor. Internal features such as an in-
struction cache, thirty-one 32-bit registers, and an independent memory
processor’s large control unit maintain a high degree of parallelism while efficiently utilizing
lower cost DRAM.
address reach, Embedded processing for graphics was the target application for the
34010 from its inception. This led to a set of cost and performance decisions
bit-field that are applicable to a wide variety of systems in addition to graphics systems.
The chip contains over 180,000 transistors fabricated in 1.8-pm CMOS,
consuming approximately one-half watt while executing in excess of six
processing, and million instructions per second. In addition to a full 32-bit microprocessor
core, the 34010 contains on-chip video random access memory (VRAM)
DRAM interface display support, DRAM control, and a host interface.

make it suitable
History
for many other Embedded microprocessors grew out of the need in the late 1960’s for
more advanced calculators. 1 Calculator designers recognized that as the
embedded variety and complexity of applications for calculators grew, programmable
rather than fixed-function processors would be advantageous. Further-
processing more, engineers at Texas Instruments and Intel both recognized that these
processors, once designed, could be used for much more than just
applications. calculators. The same chips could be used as general-purpose controllers,
replacing gears, tubes, and relays with solid-state control. Texas In-
struments initially pursued the single-chip microcomputer with its
Karl M. Guttag, TMSlOOO line, while Intel focused on the multichip microprocessor market
Thomas M . Albers, with the 4004 and then the 8008.
Michael D.Asal, and Similarly, in the late 1970’s and early 1980’s designers began studying
Kevin G . Rose the processing needs of more advanced applications (such as digital signal
processing, local area networks, and graphics). The comprehensive
Texas Instruments nature of these applications led designers at our company to the conclu-
sion that even application-specific processors should be generally pro-

072-1732/88/0600-0039$0l .XI 0 1988 IEEE June 1988 39

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
7

TMS34OI0processor

Embedded processors
The term embeddedprocessor covers a wide range of
processors and their applications. The term embedded
implies that the user is not aware of the presence of the
processor. That is, the user does not directly interact
with or program the processor, but rather the pro-
cessor provides a convenient method of implementing
the control function. Microwave controllers, electronic
games, and computer peripherals are typical examples
of systems using embedded processors. With the falling
cost and increasing power of microprocessor systems,
the range and capabilities of embedded processors are
expanding.
As 32-bit devices with their improved software sup-
port and speed move into the embedded control arena,
it is natural to see them applied to the same applica-
Photomicrograph of the TMS34010 embedded microprocessor. tions currently supported by 8- and 16-bit devices. This
trend tends to “raise the intelligence” of control sys-
tems while maintaining their embedded nature.
grammable. The design decision to support a com- The evolution of printer systems clearly shows how
pletely general-purpose instruction set on an applica- the nature of embedded processors can change. The
tion-specific device means that the device will be well simple impact-head printers were implemented with
suited for many new application areas as well. &bit microcomputers as embedded processors. Today,
In display graphics, for example, the demands on high-resolution laser printers are emulating the impact
graphics subsystems are growing rapidly. As desktop printers they replaced and are providing the added ca-
systems advance, a wide range of both graphic and pability of interpreting high-level page description lan-
nongraphic functions are being required of the guages. Consequently, the processing system of the
graphics subsystem. The advent of graphical user inter- laser printer is often more powerful than the host sys-
faces for bitmapped graphics systems and the growing tem it is connected to. Although the printer and its
complexity of the interfaces between the system pro- software interface have become more advanced, the
cessor and its graphics subsystem dictate that the processor is still being used in an embedded fashion.
embedded graphics device should be programmable. In general, embedded systems are more cost sensitive
With advancements in processor technology, em- than host systems, and large-volume embedded-system
bedded control has expanded considerably. As many applications require relatively few chips. In embedded
applications needed and could afford more processing applications the processing is more a means to an end
power, &bit and, later, 16-bit CPUs came into use for than an end in itself as in host applications. The
embedded control applications. Today, many embed- embedded-system designer’s goal is to provide the level
ded applications are migrating to 32-bit microproces- of processing power necessary for the application with
sors for their speed and advanced feature sets. the highest system reliability at the lowest cost.
A 32-bit microprocessor offers significant advan-
tages over older 8-bit and 16-bit chips, both in speed Applying host processors to embedded-processor
and ease of use. It has a large linear address reach for systems. Most 32-bit microprocessor design to date has
larger program and data requirements. The linear ap- focused on host systems. These designs assume large
proach simplifies address management and tool re- systems such as multitiered memory systems with vir-
quirements, and the larger address reach extends the tual demand paging memory management. Most of the
application limits of the device. Additionally, 32-bit new RISC (reduced instruction set computer) machines
processors generally have much better bit-field pro- also require specialized memory systems such as fast
cessing capabilities than the older processors. Many memory subsystems (in some cases special external
32-bit machines have internal instruction and/or data caches), very wide buses, and sometimes multiple
caches for faster program execution. buses. Another important factor is the bus speeds that
With the growing size and complexity of applica- many of these new processors require; these speeds are
tions, the need for high-level-language (HLL) support beyond most available application-specific ICs
has increased. However, for speed-critical portions of (ASICs) and can mean requiring very fast external
an application, the need for strong assembly language logic with relatively low levels of integration.
support for embedded control systems still exists. These features are desirable for larger host applica-
Thus, a processor that can be programmed easily at tions, but for most embedded applications they are
both levels is beneficial. either unnecessary or too expensive. The new pro-

40 IEEEMICRO

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
cessors, for example, typically require ceramic pack-
ages of 100 pins or more, resulting in higher compo-
F
stat;-
@@ -
nent and system costs. Host processors
Thus, a large practical gap in system complexity and
cost lies between most 32-bit microprocessors and their
16- and 8-bit predecessors. The needs of embedded
controller applications for faster and easier-to-use mi- c

croprocessors without the unnecessary impediments


associated with host systems are not being addressed by
most 32-bit processors.
Figure 1 diagrams the trade-offs in the current mi-
croprocessor market. On one axis is the relative system
cost of the processor and its intended memory system,
and on the other axis is relative performance. The gen-
eral-purpose 32-bit microprocessors lie in the high-per-
formance and high-system-cost region of the graph. Performance
The processors alone for these systems cost over $300
and require fast memory in the form of external static
Figure 1. Embedded controller system trade-offs.
RAM (SRAM) caches to achieve optimum perfor-
mance. The RISC chips on today’s commercial market
likewise have been designed for high-performance and
high-cost systems. At the lower cost and lower perfor- the bit-field processing and large memory spaces
mance end are the 8-bit and 16-bit microprocessors and associated with graphics rendering. Its large address
microcomputers. In the case of microcomputers the reach, bit-field processing capability, on-chip timers,
memory is built into the processor chip. The 8- and and DRAM interface make the processor well suited to
16-bit microprocessors typically have SRAM interfaces many embedded-processing applications.
and require external logic to connect to lower cost Focus on embedded processing. Because it was in-
DRAM. Consequently, high-cost SRAM has been used tended to be an embedded processor or second pro-
for lower chip count memory systems. cessor in many systems, the 34010 has a set of features
The 34010 is positioned between the two extremes in focused on reducing system complexity. It assumes a
the cost-performance trade-off. It offers substantially small system model with a single external memory
more performance than the 16-bit microprocessors and hierarchy. The silicon budget was put into such
lower system cost than the 32-bit general-purpose pro- features as DRAM control, timers, bit-field process-
cessors. The two positions of the 34010 on the graph in ing, pixel processing, and simple connection to a host
Figure 1 depict its relative performance in general- processor or communication channel.
purpose processing and its advantage in graphics due Figure 2 shows the distribution of the processor’s 68
to its special graphics processing hardware. pins. Table 1 lists the functions of these pins. Note the

Application microprocessors. An application micro-


processor is a general-purpose microprocessor designed
with special hardware and instruction support for a
specific system application. These microprocessors
provide significant performance and cost advantages -
- HCS DDOUT

--
c
for applications needing their special capabilities. Ap- LAL
plication microprocessors grew out of the recognition
Host Interface HREAD
HWRITE
-
-
-
RAS -
c

--
CAS - Local

-- --
HLDS L
that certain functions occur more frequently in embed- m W Memory
ded applications than in general-purpose processing. HRDY mm Interface

The TMS320 digital signal processor (DSP) is a good HINT LRDY I-


example of an application processor family aimed at a
specific application area. Designed for DSP applica- Video Timing
VSYNC
LCLKI
tions, it optimizes the performance of high-speed mul- 4- VCLK kLK2 01-
tiplication operations typical in these applications. INCLK
Many other applications have made use of the unique Power, Ground Hold and
capabilities of DSP chips, including such varied ap- and Reset
_I_
RUNIm
-- Emulator
plications as voice recognition, adaptive suspension, RESET HLDNEMUA Interlaces
and image compression for graphics.
The TMS340 graphics system processor (GSP) fami-
ly, of which the 34010 is the first processor, is aimed at Figure 2. TMS34010 pin diagram.

June1988 41

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
-
TMS340I0processor

~ ~~~

Table 1. TMS34010 pin description.


~

Name I Pin I I/O I Description


Host Interface Bus Pins
m ~~~~~
I 66 I I I Host chip select
HDO-HD15 44-5153-60 I/O Host bidirectional data bus
HFS0,HFSl 67.68 I Host function select
m 42 0 Host interrupt request
rn ~- I
63 I I Host lower data select
mJ6s I 62 1 I I Host upper data select
HRDY 43 0 Host ready
FImm 64 I Host read strobe
HWRlfE 65 I Host write strobe

BAS
m
I 38
39
I 0
0
I Local row-address strobe
Local column-address strobe
I
DDOUT 36 0 Local data direction out
m 37 0 Local data enable
LADO-LAD15 10-1 7.1 9-26 I/O Local address/data bus
m 34 0 Local address latched
LCLK1.LCLK2 28.29 0 Local output clocks
m 1 , m s
LRDY
I 6,7
9
I
I
I I Local interrupt request pins
Local ready
1
mm 41 0 Local shift-register transfer or output enable
W 40 0 Local write strobe
INCLK 5 I Input clock
~

tFFarra I a I I 1 Hold request 1


RUN/EIXU 2 I I Run/Emulate
m/m ~
33 0 I Hold acknowledge or emulate acknowledge
VideoTiming Signals
~

mm I 32 I1 0 I Blankina -
rn 30 1/0 Horizontal sync
VCLK 4 I Video clock
m 31 I/O Vertical sync
Miscellaneous
msET 3 I Device reset
, vcc 27,61 I Nominal 5-volt power supply
. vss 1,18,35,52 I Ground

emphasis both on adequate local bus control and on a port the local addressldata bus designed to make
host interface for attached processing. More than one DRAM interfacing straightforward. The device per-
third of the pins on the device are dedicated to this forms the row/column address multiplexing necessary
embedded function support. Almost half the pins sup- to interface efficiently to DRAM.

42 IEEEMICRO

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
Overview of the internal architecture \\\\\\\\\\\\\\
256 Byte Instruction Cache
As shown in Figure 3, the internal architecture of the
TMS34010 consists of six major blocks: main CPU, in-
struction cache, memory controller, host interface,
display controller, and internal clocks.2-4 N 32-Bit Adder N
N
1‘ .,
Proaram Counter N
L 1
Main CPU. The CPU is controlled by an 808
microstate control ROM (CROM). It can execute sim-
ple instructions in a single cycle when in cache, and it
can perform complex microsequences such as the pixel
block transfer instructions (PIXBLTs). Each CROM
word has 166 bits, which support highly parallel opera-
tions within the CPU.
The main internal data paths are 32 bits wide and
contain the key elements for efficient execution of both

m
graphics and general-purpose instructions. The thirty-
one 32-bit registers are organized as two register files,
each with 15 registers sharing a common stack pointer
register. The ALU, adder/subtracter, and barrel Interface
shifter have separate inputs and control and can all
operate in parallel.
Graphics
Instruction cache. The instruction cache, trans-
parent to the programmer, was designed to support
fast execution while using D.RAM for the system
memory. During the execution of time-critical loops,
the cache helps in two ways: It supports fast instruction Figure 3. TMS34010 internal architecture.
fetches, and it frees the memory bus for reading and
writing. The programming model is a single memory Figure 4 shows the local bus timing. The signals
space for instructions and data, and the cache is used to LCLKl and LCLK2 are the local bus clocks - generated
separate them for parallel access. by the processor for timing on the bus. The RAS (row
The 256-byte cache uses a four-way set associative address strobe) and CAS (column address strobe)
with four segments (sets), eight subsegments per seg- signals directly generate the timing required by
ment, and four words per subsegment. Each subseg- DRAM. The LAL (local address latch) signal, which
ment has a “present” bit, and direct replacement of falls after the column address is valid, is used to latch
“misses” within a subsegment is made for the four the column address. For static memory i n t e r f a c i n m e
words. The four segments use a four-location CAM RAS signal latches the upper addressbits, and the LAL
(content-addressable memory) to determine whether a signal latches the lower address bits. W (write enable) is
segment is present and a four-location LRU (least d e s i g g t o give the proper write signal for DRAM.
recently used) stack to determine which segment is The DEN (data enable) and DDOUT (data direction
replaced on a miss. out) signals enable and control the direction of bus
Memory controller. The memory controller is ac- transceivers if necessary in the system. LRDY (local
tually a separate microcoded processor with its own ready) is a processor input used to lengthen memory
control ROM that coordinates all accesses to local cycles for slower memories and peripherals.
memory. CPU, host, and video requests are prioritized The m/@? (shift register transfer/output enable)
and scheduled by the memory controller along with signal, controlling a corresponding input on the
DRAM refresh cycles. The memory controller is re- VRAM, serves two dissimilar functions. It initiates the
sponsible for generating the control signals for the shift register transfer cycle on the VRAM, and it is
local memory bus. timed to control the output enables on the VRAM and
All the necessary DRAM and VRAM control signals the 4-bit-wide DRAM devices.
are generated by the memory controller. The row and The mixing of the dissimilar functions of the
column addresses along with data are triple-multi- VRAM’s shift register transfer signal and its output
plexed on the 16-bit local a d d r e d d a t a (LAD) bus enable on TR/W (alternatively named TR/G) came
under control of the memory controller. Only one ex- about because the 34010 was defined in conjunction
ternal buffer/latch is required to connect the processor with the original VRAM, the TMS4161.5 The purpose
to DRAM. of the output enable function was to make it un-

June1988 43

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
-
TMS34OlOprocessor

i ai j 02 I a3 j a4 j ai a2 j 03 I a4 I ai j necessary to add extra buffers be-


LCLKl tween the VRAM and the 34010,
but there were no pinsleft on the
first VRAM. T h e T R signal
LCLK2 already existed, so the designers
decided t o time-multiplex the
LADO-LAD15 Row Col. Data
signal. This invention of necessity
r I I I l l 1 I 1 1 1 has become s t a n d a r d on all
I I I I I l l I
VRAMs designed since.
I
I
I
The memory controller also
I plays an important role in off-
I loading the main CPU from the
I
I burden of bit-field processing. On
I
bit-field operations (including any
I
I move instructions), the CPU sim-
I
I ply passes the starting bit address,
I the field size, and the data to the
I
I memory controller; then the CPU
I
is free to execute the next instruc-
j I Wph) ; I
I
I
I
I
I
l
l
l
l
II tion out of cache. Before sending
I I I I l l I I
I I I I l l
I
I
l
l
l
l I
the data, the CPU uses its barrel
I
I
I
l
l
l
l
shifter to get the data in the proper
I
I
I I l l
I
I 1
l l
1
bit alignment, but the memory
I I I 1
.
1
.
1
.
1
.
1
.
1
.
I
controller actually performs the

DDoUT A i I
I
I
l
l
l
l
l
l
I
I
I
I
I
I
I
I
I
l
l
l
l
l
l
I
I
I
masking and merging operations
to insert the field. On the basis of
field size and address alignment, it
computes and schedules as many
read and write cycles as necessary,
requiring no further interaction
with the main CPU.
LADO-LAD15

Host interface. Unlike other mi-


croprocessors, the 34010 has a
dedicated interface port to allow
another processor to gain access to
its memory. The host interface is
actually a communication channel
into the 34010's memory space
and can be used for other pur-
poses. Accesses requested via this
8- or 16-bit data interface are
scheduled at higher priority than
the 34010's CPU by the memory
controller. The local memory can
also be directly accessed by a con-
ventional hold interface.
Four internal memory-mapped
registers are dedicated to the host.
These registers are loaded from the
LRDY 8/16-bit host bus under the control
of two function select pins (HFSO,
( b) HFS1). Two of the 16-bit registers
combine to form a 32-bit address
into local memory. Another regis-
Figure 4. bcal bus cycle timing (a) write cycle; (b) read cycle. ter holds the data written to and

44 IEEEMICRO

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
TMS34010

Interface

Figure 5. TMS7042 as serial port host.

read from memory, and the fourth contains control in- Using the host interface f o r other functions.
formation. Although intended to be a port for a host processor’s
In addition to the data transfer registers, the host commands and data, the host interface provides an
processor has access to a 16-bit control word. Using economical way to access the local memory for any
this register, the host has complete control of the purpose. One system, for example, uses an 8-bit
34010. The control register can be set up for automatic microcomputer as a front-end serial controller. The
incrementing of the address register on reads and/or microcomputer, operating as a serial port or network
writes for block access throughput of five megabytes interface, attaches very easily to the 34010. Figure 5
per second. shows the connection of the 8-bit TMS7042 microcom-
The management of these resources and the memory puter with serial port to the processor host port.
pointer registers can be controlled by the 34010 or the The host has access to the interrupt vector table via
host. In a 34010-controlled interface scheme, the host the host interface. Thus, the host or other local pro-
does not need to have any knowledge of the local cessor can dynamically install interrupt vectors and
memory organization. The 34010 is capable of loading their associated interrupt routines. The external pro-
the address registers locally, thus decoupling the host cessor can then initiate interrupts via either the 34010’s
from the local memory implementation. The host in- interrupt pins or the host control register. For external-
terface’s control register provides the host access to ly generated interrupts, there are both maskable and
dedicated message-passing bits, a register-controlled nonmaskable interrupts in addition to Reset. Two ex-
interrupt in and out, and a nonmaskable interrupt. The ternal pins, LINT1 and LINT2, are dedicated external
control register also supports halting of the 34010’s interrupt inputs. Two bits in the host control register
CPU and flushing of the contents cache, particularly control the operation of the nonmaskable interrupt.
useful when downloading code. The halt control can One bit invokes the interrupt itself while the other
also be very useful in capturing the full bandwidth of enables or disables the stacking of the program counter
the local bus for time-critical data transfers. and status during the interrupt routine. Preventing the

June 1988 45

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
-TMS34010processor

stacking process in the interrupt routine is sometimes designed to bring the high performance and ease of
necessary to gain immediate control of a system in programming associated with 32-bit microprocessors
which the stack pointer value may have been corrupted. to low-cost systems. Contrary to common belief, the
The host interface can also be used as an indepen- difficult part of product definition is not identifying
dent debug/test channel into the local memory space. good features to add (which are infinite) but making
Debugging programs can use this port to access state the tough choices between what can be included and
variables supplied by a local monitor program giving what must be left out for cost reasons.
information about the processor’s internal machine To achieve the best system cost-performance ratio,
state. Similarly, the port can be used for communi- the feature set and functions of a processor must be
cating system state variables at the end of prescribed balanced. Balanced means that the features comple-
system tests at bootup. These can be either controlled ment each other in a practical way. For example, the
locally by the 34010 or command-driven over the host 34010 was targeted at low-cost memory systems. Fea-
port. tures such as the on-chip instruction cache and large
register file provide faster execution by reducing the
Display controller. Although designed for CRT con- need for access to the DRAM, while direct DRAM con-
trol, the display controller is a very flexible counter that trol and multiplexed addressing reduce system cost and
can be used for a wide range of timing or event- complexity.
counting functions. Functionally, it has two cascaded
16-bit counters (for a total of 32 bits of dynamic range) Large linear address space with bit-field processing.
with four programmable comparators on each counter. Graphics display systems need large amounts of
The comparators are used to generate three output memory, leading to several basic design decisions. A
signals (nominally used as vertical sync, horizontal clean architecture to support a large linear address
sync, and blanking). A programmable interrupt can be reach is needed, and this in turn requires a 32-bit inter-
set on the basis of any vertical count. nal data path to manipulate the large addresses quickly.
The input clock for the counter can run from 0 (stop- Unique among microprocessors, the 32-bit address
ped) to 7.5 MHz and can be totally asynchronous with of the 34010 points to the exact bit location in memory
the processor clock. The counters support an external rather than to the byte, word, or long-word of other
video mode, which allows external events to reset the processors. The whole of memory is viewed as a series
vertical and horizontal counters independently. of bits ordered from 0 to 232-1. All the memory ad-
dressing modes directly support bit-field processing.
Internal clocks. The clock timing logic converts the Field lengths from 1 to 32 bits are directly supported in
input clock frequency into the various internal timing all general-purpose move operations. The autodecre-
clocks needed to operate the processor. In addition, it ment and autoincrement addressing modes also use the
generates the local bus clock signals used by external field size to adjust address registers. In addition, any
devices to operate synchronously with the processor’s size array of fields can be moved with the PIXBLT in-
local bus. Current devices operate with a divide-by- struction (pixel block transfer).
eight from the input clock to generate 130-11s cycle The byte, word, and long-word are artifacts of older
times for a 60-MHz input clock. processor architectures in which there was only one
basic data size-the byte. In processors with limited
Model of operation. The large register file, instruc- address bits, byte addressing served as a good com-
tion cache, and independent memory controller are promise between data granularity and address reach.
designed to work together for high performance. The Also, earlier architectures did not have the hardware,
processor’s model of operation is that instructions con- such as barrel shifters and masklmerge multiplexers,
trolling the algorithm are automatically loaded into the to quickly handle problems bit-aligned addressing can
cache, data is stored in the large register file, and the create. But the 34010’s architecture started with a
memory controller and its bandwidth into the off-chip 32-bit address reach, and its hardware manipulates bit
memory are used only in manipulating external data. fields with equal ease as bytes or words, so there was no
Other system operations such as host accesses are re- need to impose a distinction.
quested and scheduled by the memory controller as a Bit-field processing is directly supported by special
background task, and interrupts are used for commu- hardware on the device. This extends the bit control
nications, handshaking, and error conditions. and manipulation facilities found in controller sys-
tems, adding memory bit manipulations to those avail-
able in registers. Processors without this facility must
Feature set and definition decisions perform many operations to achieve the same effect.
To transfer a nonaligned byte field from one memory
In designing a processor, one must evaluate package location to another, the typical processor must read the
size, pin count, target memory type, and a wide variety source word into a register, mask out uninvolved bits,
of features that can be incorporated. The 34010 was align the source word with the destination location,

46 IEEEMICRO

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
Contrary to common belief,
the difficult part of product
large number of critical values can be kept in the regis-
definition is not identifying good ter file during execution. This is sufficient for most
control algorithms and has a direct impact on the ease
features to add. of design of the algorithm implementation. In addi-
tion, the processor’s efficient register-stacking and
-unstacking instructions make register allocation and
management trivial.
The 34010’s initial target applications area clearly in-
dicated that a large number of 32-bit registers were very
read in the destination word, mask out the affected desirable. Many graphics algorithms require a large
field (byte), logically merge the source and destination number of data variables and address pointers. With-
words, and then write the result to the destination. The out the target application to measure against, it would
34010 CPU can direct this operation within a single in- have been much easier to define an architecture with
struction and then continue to execute operations fewer registers.
within its register file while the destination memory ac- The on-board registers are organized as two 15-regis-
cesses are being performed. ter files named the A and B files. The distinction be-
tween the files is that instructions requiring two regis-
Large register file. The decision to go to thirty-one ters must have both registers in the same file. The move
32-bit registers rather than the 16 or fewer found on register to register instruction is an exception to this
most machines was driven by the desire to make time- rule so that data can be moved between register files.
critical functions run faster and to ease assembly-level The other exception is the stack pointer that is accessi-
programming. Register-to-register operations occur in ble as the 16th register of either file.
a single cycle when running out of cache and can occur The memory move, arithmetic, Boolean, shift, and
in parallel with the memory controller’s completion of other register-based instructions and addressing modes
previously started write cycles. This parallelism can use any of the 31 registers. This is particularly im-
naturally occurs in routines in which the CPU is com- portant in optimizing languages such as C, where any
puting functions written to a series of memory loca- operation that can be performed on data can be done
tions. The example used as a model during the 34010’s on address pointers. Therefore, data/address place-
definition was an ellipse-drawing routine, in which the ment is not constrained, giving the compiler great
address computations and data values are held in the latitude in organizing register usage.
register file and the pixels to be written are sent to the An important part of the processing task for embed-
memory controller. A large register file means that all ded-control applications is the processor’s interrupt
the parameters for most time-critical functions can be support. An added benefit of the large register file is
kept inside the processor, thus preventing the thrashing the ability to dedicate registers to time-critical func-
of parameters between the register file and memory. By tions such as interrupt routines. Embedded applica-
preventing thrashing, the register file frees the memory tions often have parameters that must be dealt with
bandwidth for other functions such as memory write quickly when an interrupt occurs. Dedicating part of
cycles, host accesses, and DRAM refresh. In many pro- the register file to these values can save considerable
grams the large register file can be the single most im- time swapping data in and out.
portant feature for improving performance.
There are several ways to take advantage of the Instruction cache. The four-way set associative ap-
larger register file. It is helpful to the compiler to have proach was designed to support two loops, each strad-
several working registers for storing intermediate dling set boundaries without thrashing. During the
results of computations such as evaluating expression 34010’s early definition, relatively little time was spent
trees. Intelligent compilers (made popular by RISC ar- defining the cache compared with time spent justifying
chitectures) can take even greater advantage of more its hypothetical performance on pathological cases,
registers by tracking the contents of registers and per- cache architecture being part philosophy and part
forming efficient constant generation. This also means science.
that the programmer trying to optimize his or her own Since graphics was the focus of the 34010, circle,
code in high-level language will not compete with the line, and ellipse algorithms were used as model test
compiler for access to a small number of register vari- cases for the cache. The nature of these algorithms is
ables to hold critical values. that they have a single loop, but a decision based on in-
Important, time-critical routines are usually written cremental calculations is made between two sets of
in assembly code to optimize performance. It is in code on each loop. The goal was to have these algo-
writing such routines that the benefit of the large regis- rithms fit in the cache without thrashing and without
ter file is most obvious. If there are too few registers, requiring the programmer to worry about the position
most of a programmer’s algorithmic effort can be ex- of code in memory. The four-way set associative
pended on register management. With 31 registers, a method worked well on the graphics test cases.

June1988 41

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
TMS34010processor

14ALS245
CONTROL
HD15- a-
_.

HD8 CAS
External
0 Data BUS
S
T

E
R
0
C
I
PAL20L8A
S
S
0
R
External
Address BUS CONTROL

HRDY

KESET

Figure 6. TMS34010 DRAM system.

Unlike general data caches, relatively small instruc- Although DRAMs are the most cost-effective mem-
tion caches can have very good hit rates. This is due to ory device per bit, they have not generally been used in
the locality of programs and the fact that many time- low-cost, low-chip-count systems. DRAM devices re-
critical functions are done in small loops. Additionally, quire complex timing and address multiplexing, which,
instruction caches are less expensive than general data if not built into the microprocessor, require external
caches since they do not have to deal with data writing control. Because of this overhead, applications need-
back to main memory (unless self-modifying code is ing relatively few memory chips have not previously
allowed). used DRAM.
The capacity of DRAM chips allows the designer
Making DRAM accessible to low-chip-count sys- fairly large amounts of memory in very few chips.
tems. A key difference between the 34010 and other With 256K-bit DRAMs organized as 64K deep by 4 bits
microprocessors is its direct interface to DRAM. This wide per chip, only four chips give a 16-bit data width
interface changes the point at which DRAM becomes and 128K bytes of memory. With 256K by 4 (one
cost effective in a system. megabit) DRAM devices, four chips result in 512K
Figure 6 shows the minimum number of devices re- bytes. Thus, a relatively small number of DRAM
quired to implement an embedded system in a low-cost devices can provide all the RAM required for most
PC platform. A PAL (programmable array logic) and embedded applications.
an octal transceiver provide the host processor access Having the DRAM interface built into the processor
to the 34010’s five-chip, 128K-byte DRAM system. can provide performance benefits in systems using
The host processor provides nonvolatile program store these lower cost memories. Typically, considerable
and bootstrap capability for the system. This provides time is lost in interfacing a microprocessor to the
two benefits. First, the host has random access into the DRAM because of the inherent losses in going through
processor’s memory space for communications, elim- another controller chip and because of timing mis-
inating the need for a separate I/O channel. Second, matches between the microprocessor’s signals and the
the operating software of the embedded processor can DRAM. The processor uses a high-frequency (40 MHz
be updated or entirely changed remotely from the host to 60 MHz) input clock so that it can precisely place the
system. This eliminates the need for exchanging parts large number of timing edges required by a DRAM.
or otherwise “touching” the embedded system hard- Since DRAM is generally slower than other memory
ware in order to perform field upgrades. types, considerable attention was given to improving

48 IEEEMICRO

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
performance in a DRAM-based system. While “big
II I I I I I l l
system” host models may use multitiered memory olololo Owode File Reg. No.
hierarchies with external instruction and data caches One And No Register Format
and very fast buses to achieve very high performance, 1 5 1 4 1 3 1211 10 Cl 8 7 6 5 4 3 2 1 0
these features come at a cost. The instruction cache and
the large register file were designed to reduce accesses I bp!cot!le ’I I l
Source Reg.
l
I File I Dest Reg. I
I l l

to the DRAM.

Packaging for low cost. The processor requires only 010 0 ’ 1 I I I I I I l l


0 0 ’1 0 Opcode Short Constant File Dest Reg.
68 pins and power dissipation of about one-half watt,
allowing it to use a very low cost plastic package.
Packaging dominates the cost of making most high-
volume components, so fitting a low-cost package was
an important design consideration. One key to keeping Conditional Jump Format
the pin count small lay in the DRAM timing. Time
multiplexing the row address, column address, and
data on the same pins fit the DRAM timing require- Figure 7. TMS34010 opcode formats.
ments well and saved a number of pins.
While the 34010 has a 32-bit internal data path, the
external data path is 16 bits to reduce overall system sidered, such as split a d d r e d d a t a files as was done on
cost. This design decision kept the pin count down and the 68000.7 But that organization might have restricted
eliminated the overhead associated with a wider data the operations that could be done on addresses, limited
bus. As part of the original strategy, wider data bus the flexibility of register usage for address or data, and
versions of the device are under development. made instruction decoding more complicated.

Opcode formats and design simplicity. The 34010 RISC. What discussion of microprocessor architec-
follows the RISC philosophy of having very few fixed ture decisions would be complete today without com-
opcode formats and has a fixed-length 16-bit instruc- menting on RISCs, a much overused and abused term?*
tion. As other researchers have indicated, a 32-bit op- The 34010 design, influenced by the Berkeley RISC
code, as in most RISC machines, is not efficient in philosophy, has an RISC base instruction set to which
terms of the number of bits required by a function.6 were added special graphics instructions. Independent
The same philosophy of architectural efficiency has of the RISC influence, our experience designing the
resulted many times in inefficient instruction encoding. 9900 family of microprocessors had demonstrated that
Opcodes (not including immediate data) on the 34010 decoding complex instructions wastes hardware and
were kept to 16 bits in length to reduce instruction performance.
bandwidth requirements and to obtain better utiliza- Statistical data on general applications running on
tion of the instruction cache. our 990 minicomputers agreed with other studies in-
Another objective was to make the instruction for- dicating that move, jump, increment, decrement, and
mats easy to decode. There are only four opcode for- add operations dominate the mix of instructions exe-
mats: one-register, two-register, short-constant, and cuted.9.lo In contrast, embedded applications can stress
jump opcodes, as shown in Figure 7. These four for- a specific function and thus may have instruction mixes
mats are organized into two groups for decoding pur- that look very dissimilar to the general-purpose cases.
poses: single-register and everything else. For the one- The 34010 supports variable bit-field sizes, not
register format a special nonbinary address decoder found in many RISCs, but it does adopt the Berkeley
was used to avoid an extra level of decoding. Thus, one RISC concept of sign or zero extending to 32 bits for
less level of instruction pipelining was required, smaller data types when moving them into a register.]]
resulting in simpler logic and faster jumps and branch Like the Berkeley RISC, it performs 32-bit register-to-
execution. For the other formats, the upper 8 bits (bits register arithmetic.
0 to 15) of the opcode specify the instruction. Redun- Although it does not use the Berkeley register win-
dant states in the microcode preclude the need for dow concept, the 34010 does offer a larger register file
detailed decoding beyond the upper 8 bits. than most 32-bit microprocessors. Some consideration
The two-register-file organization was devised to was given to the register window concept since it is
provide a large register file and at the same time to similar to the older workspace pointer concept of the
maintain a compact, easy-to-decode, 16-bit instruction TMS9900 family, which used a pointer to its register
word. This organization requires only 9 bits (since the file in external RAM. The 9995 microprocessor had
register file select bit is shared) as opposed to 10 bits to 256 bytes of internal memory, typically used to hold a
specify two of 31 different registers, but it also limits large number of workspace “registers,” but this was
operations between files. Other approaches were con- still in slower RAM rather than faster, dual-ported reg-

June 1988 49

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
-TMS34OlOprocessor

ister file storage. The 34010’s designers decided that the Graphics terminal and display systems. The 34010
very large windowed file of the Berkeley RISC would has been widely used in both graphics add-in boards
add too much to the chip’s size and would create speed and display terminals.12 The hardware support for
paths that might limit fast operation. video display terminals, DRAM, VRAM, and host in-
In keeping with the RISC philosophy, the 34010 exe- terface greatly reduces the cost of these systems. The
cutes most basic instructions in a single cycle. By using support for graphics drawing, X-Y addressing, pixel
a single-level, very wide (]&-bit) control word, the block transfers, and general-purpose processing pro-
processor can execute single-cycle instructions without vides high performance and easier programming.
extra pipelining. The opcodes were kept very simple The 34010 can also be used to offload nongraphic
and were constructed to act directly as addresses into processing from the host to improve overall system
the microcoded ROM. This eliminated the extra performance. As a result, entire applications have been
decode logic normally associated with microcoded mi- written to run directly on the embedded processor, using
croprocessors. In effect, whereas a RISC machine uses the host only for keyboard interface, disk drive, and
a PLA for instruction decoding, a single-level ROM other 1 / 0 functions.
lookup is sufficient on the 34010.
Departing from the strict RISC philosophy, the Consumer electronics. Because it is both a powerful
designers recognized that applications would benefit microprocessor and a graphics chip integrated in low-
greatly from more sophisticated instructions that do cost packaging, the 34010 has potential applications in
not fit the single-cycle model of pure RISC machines. new consumer graphics designs. To date, most video
The complex PIXBLT instructions can be pipelined game systems have used a combination of an 8-bit mi-
much more efficiently as single instructions than if croprocessor and a video controller chip to support
broken into multiple single-cycle instructions. With in- graphics functions such as sprites (two-dimensional
ternal microcoding, the address manipulations, field X-Y positional characters). The microprocessor and
extractions, merging, and multiple memory cycles can the controller chip can be replaced by a single 34010.
be more efficiently coordinated. The wide control sup- The processing power of the resultant system opens up
ports many parallel operations for faster execution. whole new areas of education as well as entertainment.
Electronic building blocks, home 3-D CAD, flight
Real-time software development. The importance of simulators, driving trainers, and 3-D adventure games
assembly and HLL support was the basis for TI’Sdeci- are just a few of the possibilities.
sion to support the 34010 family with source and object
management tools for assembly and C. In addition, Image compression for facsimile and CD-ROM.
both real-time and software-only emulation tools pro- Graphics and images require large amounts of data to
vide debugging capability for hardware and algorithm be stored and/or transmitted, making data compres-
prototyping. Application libraries provide source code sion essential both for performance and cost. The need
algorithms for specific design tasks such as CCITT (In- to globally transmit images has resulted in the CCITT
ternational Consultative Committee for Telegraphy facsimile standards for data compression. 14.15 Inexpen-
and Telephony) Group 3/Group 4 compression and sive stand-alone systems can be constructed incor-
decompression, as well as for various graphics stan- porating CCITT Group 3, a 9600-bps modem, and
dards from the Massachusetts Institute of Technology, paper-handling control. The acceptance of CCITT
the American National Standards Institute, Graphic Group 3 compression has led to its wider use in other
Software Systems, Microsoft, a n d others.12 applications. A laser printer with the addition of a
Embedded-processor applications also usually require 9600-bps modem can double as a facsimile printer. The
real-time operating systems. These have been devel- CCITT standard is also being used as a compression
oped specifically for embedded processing for the method in applications such as CD-ROM (compact
34010 and are available from external parties. 12.13 disks) and scanners.
Fax-modem P C add-in boards provide facsimile
transmission directly from a PC. They have a con-
siderable quality advantage over printing out the image
Applications in embedded control and then having it scanned by a stand-alone fax
machine because they eliminate the distortion in-
The general-purpose processing power and low sys- troduced during the scanning process. Since CCITT
tem cost of the 34010 make it useful for a wide number encoding is a generally accepted standard, it is also a
of applications. Because of its graphics hardware and convenient way to communicate images between
instructions, it naturally found wider initial use in machines without an intermediate step to paper.
graphics and graphics-related applications, but over The CCITT Group 3 and Group 4 standards are
time more designers are finding its larger applicability. based on bit manipulations and variable field length
The bit-field processing is an important feature that encoding. This bit-field processing is applied to edge
makes it attractive in control and data compression detection, run length encoding, and data movement. A
applications. graphics processor has the added benefit of being able

50 IEEEMICRO

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
Control
/ \1 n 1

0
1 I 1
Address J
34010 1 1 :
Interface

Program
ROMlRAM

Buffer DRAM

-
- Laser
Printer
Engine
7

Figure 8. Laser printer system.

to generate, display, and manipulate the resulting im- printers also support page description languages such
ages. In the fax-modem application, although most PC as Postscript or compatible products. For economy the
displays cannot display the high-resolution fax image, printers typically have used 16-bit microprocessors, but
the image can be translated into a gray-scale image that these processors can often be the limiting factor in
can be displayed for preview. print speed, particularly when a page description lan-
Data compression techniques are being applied to in- guage is used. Even for black-and-white laser printers,
crease the storage capabilities of such high-density the amount of memory required is large due to the
media as CD-ROMs and other optical laser disks. As relatively high-resolution (typically over 2500 by 3300
the storage requirements for these media increase, a dots or one megabit) images they generate. This has led
higher degree of processing capability is required of the to the almost exclusive use of dynamic memory for the
embedded controller to compress and decompress the image buffer.
data. For 6- to 10-page-per-minute laser printers, the
The CCITT standard is primarily focused on black- 34010 works alone. The on-chip DRAM controller
and-white document transmission and is not ideal for reduces the external logic requirement, and the host in-
every application. There are denser compression terface is used as a general communication port. For
methods, particularly for handling color and gray- performance of 15 to 60 ppm with page description re-
scale images. Because of the 34010’s programmability, quirements, multiprocessor configurations can be ap-
it can be adapted to handle many of these unique com- plied to improve throughput. The embedded system’s
pression methods. master processor can be either a 16- or 32-bit micropro-
cessor. The master processor handles the page descrip-
Page printers. Laser and other printing technology, tion language interpretation and then passes the
such as thermal dye transfer (used in many color graphics processing to the slave processors. The
printers), is an obvious application for an embedded graphics processing performance of the 34010 and the
microprocessor with special graphics capabilities. resulting low system cost makes this multiprocessor
Figure 8 shows a 34010-based laser printer system. In configuration feasible.
additicn to generating the print image, the processor Additional embedded application areas for the
can perform other functions such as controlling the 34010 include laboratory and factory data acquisition,
print engine and the page feeder. optical character recognition, dashboard and cockpit
Page printers require the ability to move and mani- instrumentation, cardiac monitors, radar control sys-
pulate (to the bit level) large bitmaps of data, calculate tems, process control, robotics, and medical imaging
outline fonts, and emulate dot matrix printers. Many systems.

June 1988 51

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.
-
TMS34010processor

he 34010 is the first of a family of 32-bit embed- 6. M. Johnson, “System Considerations in the Design of

T ded processors blending performance and system


integration to support the growth of advanced
applications. New family members currently in design
the Am29000,” ZEEE Micro, Aug. 1987, pp. 28-41.
7. E. Stritter and T. Gunter, “A Microprocessor Architec-
ture for a Changing World: The Motorola 68000,” Com-
will extend the design philosophy of this processor and puter, Feb. 1979, pp. 43-52.
improve the performance of the next generation of 8. R.P. Colwell et al., “Computers, Complexity, and Con-
embedded systems. % troversy,” Computer, Sept. 1985, pp. 8-19.
9. R.P. Blake, “Exploring a Stack Architectu-e,” Com-
puter, May 1977, pp. 30-38.
References 10. D. Fairclough, “A Unique Microprocessor Instruction
Set,” ZEEE Micro, May 1982, pp. 8-18.
1. R. Noyce and M. Hoff, “A History of Microprocessor
Development at Intel,” ZEEE Micro, Feb. 1981, pp. 11. D.A. Patterson and C.H. Sequin, “A VLSI RISC,”
8-21. Computer, Sept. 1982, pp. 8-20.
2. C.R. Killebrew Jr., “The TMS34010 Graphics System 12. TMS34010 3rd Party Guide, 2nd ed., Texas Instruments
Processor,” Byte, Dec. 1986, pp. 193-204. Inc., Oct. 1987.
3. M. Asal et al., “The Texas Instruments 34010 Graphics 13. C Executive User’sManual, JMI Software Cansultants,
System Processor,” ZEEE Computer Graphics and Ap- 1986.
plications, Oct. 1986, pp. 24-39. 14. “Standardization of Group 3 Facsimile Apparatus for
4. TMS34OIO User’s Guide, Graphics Products, Texas In- Document Transmission,” Recommendation T.4,
struments Inc., Jan. 1987. CCITT, 1984.
5. R. Pinkham, M. Novak, and K. Guttag, “Video RAM 15. “Facsimile Coding Schemes and Coding Control Func-
Excels at Fast Graphics,” Electronic Design, Aug. 18, tions for Group 4 Facsimile Apparatus,” recommenda-
1983, pp. 161-168. tion T.6, CCITT, 1984.

Karl M. Guttag works in the Micro- Michael D. Asal is a systems engineer in the Microprocessor
processor and Microcontroller Division of and Microcomputer Products Division of Texas Instruments.
Texas Instruments. Since 1982he has been He is currently working on the design of TI’s next-generation
responsible for graphics product defini- graphics processor in Bedford, England. Since joining TI in
tion, including graphics processor ar- 1982, he has been involved with the specification, architec-
chitecture and the multiport video RAM. tural development, and design of the TMS34010 and higher
Previously, he was the IC architect of two performance follow-on devices. He was elected a member of
16-bit microprocessors and a design engi- the Semiconductor Group technical staff in 1987. Asal re-
neer of the TMS9918 video display pro- ceived the BSEE and MSEE from Bradley University.
cessor. Guttag received his BSEE from
Bradley University and his MSEE from the University of
Michigan. He was elected a Texas Instruments fellow in 1988. Kevin G . Rose is a strategic marketing en-
He holds 25 patents in microprocessors and computer gineer with Texas Instruments. He pro-
graphics hardware and is a member of the IEEE and the vides application and marketing support
ACM. to laser printer, terminal, and facsimile
designers. He also manages business de-
Thomas M. Albers is responsible for soft- velopment for these markets. Rose re-
ware support of the 340 family of proces- ceived his BS from Brigham Young
sors. He joined TI in 1980 and has worked University and his MBA from the Univer-
with 4-, 8-, and 32-bit microprocessors. sity of Utah.
He has contributed to the microcode tools,
software simulator, and operating environ-
ment for the TMS34010. Currently, he is
working on strategic software development Questions about this article can be addressed to Tom
and TI’s next generation of graphics Albers, Texas Instruments Inc., PO Box 1443. M/S 712,
devices. Albers received a BS in electrical Houston, TX 77001.
engineering from Texas A&M University. He is a member of
Tau Beta Pi and the ACM.

Reader Interest Survey


Indicate your interest in this article by circling the appropriate number on the Reader Interest Card.
Low 156 Medium 157 High 158

52 IEEEMICRO

Authorized licensed use limited to: HOSEO UNIVERSITY. Downloaded on October 6, 2009 at 04:30 from IEEE Xplore. Restrictions apply.

You might also like