Computer Architecture
Computer Architecture
PART I
INTRODUCTION
From the earliest technology discussions that led to the creation of the SuperH®
RISC engine architecture nearly a decade ago, to the development efforts now under way
or planned, there has been one basic engineering and marketing goal for the product line.
The essential parts of that common goal are
• to provide an extended series of upward-compatible microcontroller (MCU) and
microprocessor (MPU) devices
• to offer optimized balances of performance, power consumption, integration and die
size
• to allow customers to take full advantage of windows of
market opportunity
• to deliver economical devices that customers can use to build systems that offer the
price/performance levels needed to achieve high sales volumes.
The four generations of SuperH Cool Engine™ RISC processors currently in
production conform to an aggressive, periodically updated technology roadmap.
Enthusiastic customer response worldwide has earned the architecture a leadership
position worldwide in the 32-bit embedded RISC market.
To supply customers with advanced processors for the products and systems of
the next decade, the SuperH roadmap specifies a fifth-generation architecture
(and,beyond that, a sixth). Development of the fifth-generation architecture was guided
by the overall SuperH series engineering and marketing goal described previously. To
fulfill that goal, given today’s evolving, escalating market requirements, the development
team had to overcome many design challenges. Specifically, they had to create a
microprocessor core that enables nextgenerationsystem-on-a-chip (SOC) consumer
products, provides enhanced performance for multimedia applications, and reduces
customers’ time to market. The Hitachi and STMicroelectronics (ST™) design team
accomplished this and more.
Windows CE supported the Hitachi SuperH-3 and SuperH-4 processors. These
were commonly abbreviated SH-3 and SH-4, or just SH3 and SH4, and the architecture
series was known as SHx. I’ll cover the SH-3 processor in this series, with some nods to
the SH-4 as they arise. But the only binaries I have available for reverse-engineering are
SH-3 binaries, so that’s where my focus will be. The SH-3 is the next step in the processor
series that started with the SH-1 and SH-2. It was succeeded by the SH-4 as well as the
offshoots SH-3e and SH-3-DSP. The SH-4 is probably most famous for being the processor
behind the Sega Dreamcast. As with all the processor retrospective series, I’m going to
focus on how Windows CE used the processor in user mode, with particular focus on the
instructions you will see in compiled code.
The SH-3 can operate in either big-endian or little-endian mode. Windows CE uses
it in little-endian mode.
The SH-3 has sixteen general-purpose integer registers, each 32 bits wide, and
formally named r0 through r15. They are conventionally used as follows:
r0 return value No
r1 No
r2 No
r3 No
r4 argument 1 No
r5 argument 2 No
r6 argument 3 No
r7 argument 4 No
r8 Yes
r9 Yes
r10 Yes
r11 Yes
r12 Yes
r13 Yes
We’ll learn more about the conventions when we study calling conventions.
There are actually two sets (banks) of the first eight registers (r0 through r7). User-
mode code uses only bank 0, but kernel mode can choose whether it uses bank 0 or bank
1. (And when it’s using one bank, kernel mode has special instructions available to access
the registers from the other bank.)
The SH-3 does not support floating point operations, but the SH-4 does. There are
sixteen single-precision floating point registers which are architecturally
named fpr0 through fpr15, but which the Microsoft assembler calls fr0 through fr15. They
can be paired up to produce eight double-precision floating point registers:
Double-precision register Single-precision register pair
If you try to perform a floating point operation on an SH-3, it will trap, and the kernel
will emulate the instruction. As a result, floating point on an SH-3 is very slow.
Windows NT requires that the stack be kept on a 4-byte boundary. I did not observe any
red zone.
Some calling conventions for the SH-3 say that mach and macl are preserved, or that gbr is
reserved, but in Windows CE, they are all scratch.
The architectural names for data sizes are as follows:
The SH-3 has branch delay slots. Ugh, branch delay slots. What’s worse is that some
branch instructions have branch delay slots and some don’t.
Instructions on the SH-3 are generally written with source on the left and destination on
the right. For example,
After an instruction that modifies flags, the new flags are not available for a cycle,
and after a load instruction, the result is not available for two cycles. There are other
pipeline hazards, but those are the ones you are likely to encounter. If you try to use the
results of a prior instruction too soon, the processor will stall. (Don’t forget that the SH-3
is dual-issue, so two cycles can mean up to four instructions.)
HISTORY
The SuperH processor core family was first developed by Hitachi in the early
1990s. Hitachi has developed a complete group of upward compatible instruction
set CPU cores. The SH-1 and the SH-2 were used in the Sega Saturn, Sega
32X and Capcom CPS-3. These cores have 16-bit instructions for better code density than
32-bit instructions, which was a great benefit at the time, due to the high cost of main
memory.
A few years later the SH-3 core was added to the SH CPU family; new features
included another interrupt concept, a memory management unit (MMU) and a modified
cache concept. The SH-3 core also got a DSP extension, then called SH-3-DSP. With
extended data paths for efficient DSP processing, special accumulators and a
dedicated MAC-type DSP engine, this core was unifying the DSP and the RISC processor
world. A derivative was also used with the original SH-2 core.
Between 1994 and 1996, 35.1 million SuperH devices were shipped worldwide.
For the Dreamcast, Hitachi developed the SH-4 architecture. Superscalar (2-way)
instruction execution and a vector floating point unit (particularly suited to 3d graphics)
were the highlights of this architecture. SH-4 based standard chips were introduced
around 1998.
The SH-3 and SH-4 architectures support both big-endian and little-endian byte
ordering (they are bi-endian).
The evolution of the SuperH architecture still continues. The latest evolutionary
step happened around 2003 where the cores from SH-2 up to SH-4 were getting unified
into a superscalar SH-X core which forms a kind of instruction set superset of the
previous architectures.
Today, the SuperH CPU cores, architecture and products are with Renesas
Electronics, a merger of the Hitachi and Mitsubishi semiconductor groups and the
architecture is consolidated around the SH-2, SH-2A, SH-3, SH-4 and SH-4A platforms
giving a scalable family
PART II
SH-2
The SH-2 is a 32-bit RISC architecture with a 16-bit fixed instruction length for high
code density and features a hardware multiply–accumulate (MAC) block for DSP
algorithms and has a five-stage pipeline.
Today the SH-2 family stretches from 32 KB of on-board flash up to ROM-less devices. It
is used in a variety of different devices with differing peripherals such as CAN, Ethernet,
motor-control timer unit, fast ADC and others.
SH-2A
The SH-2A is an upgrade to the SH-2 core. It was announced in early 2006.
The SH-2A family today spans a wide memory field from 16 KB up to and includes many
ROM-less variations. The devices feature standard peripherals such
as CAN, Ethernet, USB and more as well as more application specific peripherals such
as motor control timers, TFT controllers and peripherals dedicated to automotive
powertrain applications.
SH-4
The SH-4 is a 32-bit RISC CPU and was developed for primary use in multimedia
applications, such as Sega's Dreamcast and NAOMI game systems. It includes a much
more powerful floating point unit and additional built-in functions, along with the
standard 32-bit integer processing and 16-bit instruction size.
FPU with four floating point multipliers, supporting 32-bit single precision and 64-
bit double precision floats
4D floating point dot-product operation
128-bit floating point bus allowing 3.2 GB/sec transfer rate from the data cache
64-bit external data bus with 32-bit memory addressing, allowing a maximum of 4 GB
addressable memory with a transfer rate of 800 MB/sec
Built-in interrupt, DMA, and power management controllers
There is no FPU in the custom SH4 made for Casio, the SH7305.
SH-5
Almost no non-simulated SH-5 hardware was ever released, and unlike the still live SH-
4, support for SH-5 was dropped from gcc.
PART III
PART IV
PART V
REFERENCES
https://en.wikipedia.org/wiki/SuperH
http://segatech.com/technical/cpu/tech_sh4.html
http://segatech.com/technical/cpu/tech_sh4.html