5-stage Pipeline CPU hardware
Pipeline CPU hardware
Distribution of control signals in pipeline CPU
Data hazards
Control hazards
Control hazard occurs whenever there is change in normal sequential flow of
program (caused by branch/jump, calling subroutine, interrupt, return from
interrupt etc.)
Structural hazards
[1] multiply instruction holds Ex stage for two or more clock cycle.
[2]Two or more instructions in pipeline try to read/write register file =>
Since there is only one read/write port, only one instruction is allowed to
read/write register file.
ARM Architecture
ARM core :
Pipelined RISC CPU reduced number of fixed size instructions
Offers high code density, small size, low power
Applications are cell phones, handheld PDA, camera
But different from pure RISC (to gain some advantages)
Variable cycle execution for certain instructions to support multiple
load and store
Inline barrel shifter leading to few complex instructions –
preprocessing one operand enhances computational power
Thumb state (16-bit instruction set) to improve code density
Conditional execution of instructions for smooth pipeline operation
DSP instructions to support signal processing
Performance: speed=> MIPS@ Clk freq., DMIPS@ Clk freq.
power=> mW @ (Volt, Clk freq., technology)
6
DMIPS
Dhrystone is a synthetic benchmark program for system programming. So
DMIPS measures not just instructions per second but gives an idea of how
long overall it will take one processor to perform a task versus another,
taking into account the different number and kinds of instructions.
The industries have adopted the VAX 11/780 as the reference 1 MIPS
machine. The VAX 11/780 achieves 1757 Dhrystones per second.
The Dhrystone figure of given computing system is calculated by measuring
the number of Dhrystones executed per second and dividing that by 1757.
So if a computing system able to execute 140560 dhrystones per second,
then its DMIPS rating is 140560/1757 = 80 DMIPS
To compare two computing systems that run at different clock frequency,
DMIPS is normalized to clock frequency.
e.g. 60 DMIPS @ 40 MHz = 1.5 DMIPS/MHz
New Benchmarking => CoreMark MIPS
7
Sign Extend -> converts
signed 8/16 bit to 32 bit
value and places in reg.
Two source registers (Rn
and Rm) and one result
register Rd
Barrel shifter =>
preprocess Rm before it
enters to ALU
MAC unit => for multiply
and accumulation
operation
8
On Chip Debug Hardware
9
ARM Architecture
ARM Core under study is ARM7TDMI
ARM state => Instructions are 32-bit wide and address is word aligned
Thumb state => Instructions are 16-bit and address is half-word aligned
ARM Modes:
Different Modes of ARM processor are defined for specific purpose
User mode => most application softwares run in this mode
10
ARM Architecture
Exception modes => Supervisor, IRQ, FIQ, abort, undefined
Non exception modes=> User, System
‘supervisor’ mode => runs embedded operating system routines
‘User’ mode => runs Application programs
IRQ & FIQ modes => handles hardware interrupts
Abort mode => handles memory access violations
Undefined mode => handles undefined instruction
ARM Architecture
CPSR:
32-bit register with condition flags, control bits, status & ext.
Only privileged modes have full write access to CPSR
Every processor mode except user mode can change mode by writing
directly to the mode bits of the CPSR.
N = 1 if MSB of the ALU result is 1
Z = 1 if Zero result from ALU
C = 1 if ALU operation results in Carry (if Subtraction result is -ve =>C reset)
V =1 if ALU operation oVerflowed (useful for signed numbers only)
Flags are updated only if suffix ‘S’ is added to instruction 12
ARM Architecture
When the processor is executing in ARM state:
All instructions are 32 bits wide
All instructions must be word aligned
Therefore the pc value is stored in bits [31:2] with bits [1:0]
undefined (as instruction cannot be halfword or byte aligned).
When the processor is executing in Thumb state:
All instructions are 16 bits wide
All instructions must be halfword aligned
Therefore the pc value is stored in bits [31:1] with bit [0] undefined
(as instruction cannot be byte aligned).
When the processor is executing in Jazelle state:
All instructions are 8 bits wide
Executes java byte codes
13
Banked Registers:
15
ARM Architecture
Total 37 registers = 30 general purpose + 6 status + 1 PC
Different set of register in different mode of operation
User and System mode uses same set of registers
Shaded registers (banked registers) are hidden from user/system mode and
available only in exception modes.
R13 = Stack pointer (SP). Each exception mode has its own SP
R14 = link register (LR) -> Holds return address of subroutine when it is
called with BL instruction.
Each exception mode has its own SP and LR
BL <cc> subroutine_label (LR automatically stores return add.)
The return can be in two ways
MOV PC, LR or
B LR
16
ARM Family and Cores
ARM Core Features ARM ISA Thumb
family version version
ARM7TDMI 3-state pipeline, thumb state ARMv4T v1
ARM7 ARM 720T as ARM7TDMI, cache
ARM 740T as ARM7TDMI, cache
ARM 920T 5-stage pipeline, thumb, data and inst. ARMv4T
cache, MMU
ARM 922T 5-stage pipeline, thumb, data and inst.
cache, MMU
ARM9 ARM946E 5-stage pipeline, thumb, Enhanced DSP ARMv5TE
instructions, caches, MPU
ARM926EJ 5-stage pipeline, thumb, Jazelle DBX, ARMv5TEJ
Enhanced DSP instructions, caches, MMU
ARM11 ARM1156T2(F) 8-stage pipeline, SIMD, Thumb-2, VFP, ARMv6T2 v2
Enhanced DSP instructions
ARM Cortex Series: Profile A, Profile R, Profile M
ARM Data Processing
Syntax : <opcode> {<cc>} {S} Rd, Rn, op2
‘op2’ normally comes from barrel shifter and can be the following:
Rm and Rs should not be PC (r15) in shift/rotate by register mode of ‘op2’
shift and rotate affects N,Z,C flags
# value for shift and rotate is 5-bit unsigned integer
18
19
ARM - The Barrel Shifter
LSL : Logical Left Shift ASR: Arithmetic Right Shift
CF Destination 0 Destination CF
Multiplication by a power of 2 Division by a power of 2,
preserving the sign bit
LSR : Logical Shift Right
ROR: Rotate Right
...0 Destination CF Destination CF
Bit rotate with wrap around
Division by a power of 2
from LSB to MSB
RRX: Rotate Right Extended
Destination CF
Single bit rotate with wrap around
from CF to MSB
20
ARM Data Processing Instructions
CMP,CMN,TST & TEQ always update flags (even if ‘S’ is not used as
suffix) and do not alter any register. They use only Rn and OP2.
MOV & MVN use only two operands i.e. Rd and ‘op2’
21
Data processing:
ADD R9, R5, R5, LSL #3 ; R9 = R5+(R5*8) = 9*R5
RSB R9, R5, R5, LSR #3 ; R9 = (R5/8) – R5
MOV R12, R4, ROR R3 ;R12= R4 rotated right by value of R3
CMP R7, R5 ; update flags after (R7-R5)
Conditional Execution:
ARM instructions can be made to execute conditionally by post fixing
them with the appropriate condition code field. (e.g. MOVEQ R0,R1)
Condition checks the status of appropriate flags
If condition is true, normal execution otherwise no execution.
Adv. => Greater pipeline performance and higher code density leading to
higher instructions throughput
22
ARM Conditional Execution
23
ARM Conditional Execution
Set the flags, and then use various conditional codes
CMP r0, # 0 if (a==0) x=0; (here r0 = a, r1= x)
MOVEQ r1, # 0 if (a>0) x=1;
MOVGT r1, #1
Set of Conditional compare instruction
CMP r0, # 4 if (a==4 or a==10)
CMPNE r0, #10 x=0;
MOVEQ r1, # 0
Reduces number of instructions
While (a!=b) {
if (a>b) a=a-b; else b=b-a; } (here r1 = a, r2= b)
------------------------------------------------------------------------------------------
loop: CMP r1,r2 loop1: CMP r1, r2
BEQ finish SUBGT r1, r1, r2
BLT lessthan SUBLT r2, r2, r1
SUB r1, r1, r2 BNE loop1
B loop
lessthan : SUB r2,r2,r1
B loop
finish
24