0% found this document useful (0 votes)
41 views71 pages

l18 Arm

Uploaded by

ThủyBình
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views71 pages

l18 Arm

Uploaded by

ThủyBình
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

ARM

Introduction &
Instruction Set Architecture
Outline
 ARM Architecture
 ARM Organization and Implementation
 ARM Instruction Set
 Thumb Instruction Set
 Architectural Support for System
Development
 ARM Processor Cores
 Memory Hierarchy
 Architectural Support for Operating Systems
 ARM CPU Cores
 Embedded ARM Applications
2
ARM History
 ARM – Acorn RISC Machine (1983 – 1985)
 Acorn Computers Limited, Cambridge, England
 ARM – Advanced RISC Machine 1990
 ARM Limited, 1990
 ARM has been licensed to many semiconductor
manufacturers

3
ARM’s visible registers
 User level r0
usable in user mode
r1
 15 GPRs, PC, r2
r3 system modes only
CPSR (current r4
program r5
r6
status r7
r8_fiq
register) r8
r9 r9_fiq
r10_fiq
 Remaining r10
r11_fiq
r11 r13_und
registers are r12 r12_fiq
r13_fiq r13_svc r13_abt r13_irq
r14_irq r14_und
r13 r14_abt
used for system- r14 r14_fiq r14_svc

level r15 (PC)

programming SPSR_irq SPSR_und


SPSR_abt
and for handling CPSR SPSR_fiq SPSR_svc

exceptions
fiq svc abort irq undefined
user mode mode mode mode mode mode

4
ARM CPSR format
 N (Negative), Z (Zero), C (Carry), V (oVerflow)
 mode – control processor mode
 T – control instruction set
 T = 1 – instruction stream is 16-bit Thumb
instructions
 T = 0 – instruction stream is 32-bit ARM instructions
 I F – interrupt enables

31 28 27 8 7 6 5 4 0

NZCV unused IF T mode

5
ARM memory organization
 Linear array of bytes numbered
from 0 to 232 – 1 bit 31 bit 0
 Data items 23 22 21 20

 bytes (8 bits) 19 18 17 16
word16
 half-words (16 bits) – always 15 14 13 12
aligned to 2-byte boundaries half-word14 half-word12
11 10 9 8
(start at an even byte address) word8
7 6 5 4
 words (32 bits) – always byte6 half-word4
byte
aligned to 4-byte boundaries 3 2 1 0
address
byte3 byte2 byte1 byte0
(start at a byte address which
is multiple of 4)

6
ARM instruction set
 Load-store architecture
 operands are in GPRs
 load/store – only instructions that operate with memory
 Instructions
 Data Processing – use and change only register values
 Data Transfer – copy memory values into registers
(load) or copy register values into memory (store)
 Control Flow
o branch
o branch-and-link –
save return address to resume the original sequence
o trapping into system code – supervisor calls

7
ARM instruction set (cont’d)
 Three-address data processing instructions
 Conditional execution of every instruction
 Powerful load/store multiple register instructions
 Ability to perform a general shift operation and a
general ALU operation in a single instruction that
executes in a single clock cycle
 Open instruction set extension through coprocessor
instruction set, including adding new registers and
data types to the programmer’s model
 Very dense 16-bit compressed representation of the
instruction set in the Thumb architecture

8
I/O system
 I/O is memory mapped
 internal registers of peripherals (disk controllers,
network interfaces, etc) are addressable locations
within the ARM’s memory map and may be read and
written using the load-store instructions
 Peripherals may use either the normal interrupt
(IRQ) or fast interrupt (FIQ) input
 normally most interrupt sources share the IRQ input,
while just one or two time-critical sources are
connected to the FIQ input
 Some systems may include external DMA hardware
to handle high-bandwidth I/O traffic

9
ARM exceptions
 ARM supports a range of interrupts, traps, and supervisor calls –
all are grouped under the general heading of exceptions
 Handling exceptions
 current state is saved by copying the PC into r14_exc and CPSR
into SPSR_exc (exc stands for exception type)
 processor operating mode is changed to the appropriate
exception mode
 PC is forced to a value between 0016 and 1C16, the particular
value depending on the type of exception
 instruction at the location PC is forced to (the vector address)
usually contains a branch to the exception handler; the
exception handler will use r13_exc, which is normally initialized
to point to a dedicated stack in memory, to save some user
registers
 return: restore the user registers and then restore PC and CPSR
atomically
10
ARM cross-development toolkit
 Software development C source C libraries asm source
 tools developed by
ARM Limited C compiler assembler

 public domain tools .aof


(ARM back end for object
libraries
linker
gcc C compiler)
.axf debug
 Cross-development
 tools run on different
ARMsd
architecture from one system model

for which they


produce code ARMulator
development
board

11
Outline
 ARM Architecture
 ARM Assembly Language Programming
 ARM Organization and Implementation
 ARM Instruction Set
 Architectural Support for High-level Languages
 Thumb Instruction Set
 Architectural Support for System Development
 ARM Processor Cores
 Memory Hierarchy
 Architectural Support for Operating Systems
 ARM CPU Cores
 Embedded ARM Applications

12
ARM Instruction Set
 Data Processing Instructions
 Data Transfer Instructions
 Control flow Instructions

13
Data Processing Instructions
 Classes of data processing instructions
 Arithmetic operations
 Bit-wise logical operations
 Register-movement operations
 Comparison operations
 Operands: 32-bits wide;
there are 3 ways to specify operands
 come from registers
 the second operand may be a constant (immediate)
 shifted register operand
 Result: 32-bits wide, placed in a register
 long multiply produces a 64-bit result
14
Data Processing Instructions (cont’d)
Arithmetic Operations Bit-wise Logical Operations
ADD r0, r1, r2 r0 := r1 + r2 AND r0, r1, r2 r0 := r1 and r2
ADC r0, r1, r2 r0 := r1 + r2 + C ORR r0, r1, r2 r0 := r1 or r2
SUB r0, r1, r2 r0 := r1 - r2 EOR r0, r1, r2 r0 := r1 xor r2
SBC r0, r1, r2 r0 := r1 - r2 + C - 1 BIC r0, r1, r2 r0 := r1 and (not) r2
RSB r0, r1, r2 r0 := r2 – r1
RSC r0, r1, r2 r0 := r2 – r1 + C - 1

Register Movement Comparison Operations


MOV r0, r2 r0 := r2 CMP r1, r2 set cc on r1 - r2
MVN r0, r2 r0 := not r2 CMN r1, r2 set cc on r1 + r2
TST r1, r2 set cc on r1 and r2
TEQ r1, r2 set cc on r1 xor r2

15
Data Processing Instructions (cont’d)
 Immediate operands:
immediate = (0->255) x 22n, 0 <= n <= 12

ADD r3, r3, #3 r3 := r3 + 3


AND r8, r7, #&ff r8 := r7[7:0], & for hex

 Shifted register operands


 the second operand is subject to a shift operation
before it is combined with the first operand

ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1


ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3

16
ARM shift operations
 LSL – Logical Shift Left 31 0 31 0

 LSR – Logical Shift Right


 ASR – Arithmetic Shift 00000 00000

Right LSL #5 LSR #5

 ROR – Rotate Right 31

0
0 31

1
0

 RRX – Rotate Right


Extended by 1 place 00000 0 11111 1

ASR #5 , positive operand ASR #5 , negative operand

31 0 31 0
C

C C

ROR #5 RRX

17
Setting the condition codes
 Any DPI can set the condition codes (N, Z, V, and C)
 for all DPIs except the comparison operations
a specific request must be made
 at the assembly language level this request is indicated
by adding an `S` to the opcode
 Example (r3-r2 := r1-r0 + r3-r2)

ADDS r2, r2, r0 ; carry out to C

ADC r3, r3, r1 ; ... add into high word


 Arithmetic operations set all the flags (N, Z, C, and V)
 Logical and move operations set N and Z
 preserve V and either preserve C when there is no shift
operation, or set C according to shift operation (fall off
bit)
18
Multiplies
 Example (Multiply, Multiply-Accumulate)
MUL r4, r3, r2 r4 := [r3 x r2]<31:0>
MLA r4, r3, r2, r1 r4 := [r3 x r2 + r1] <31:0>
 Note
 least significant 32-bits are placed in the result register,
the rest are ignored
 immediate second operand is not supported
 result register must not be the same
as the first source register
 if `S` bit is set the V is preserved and
the C is rendered meaningless
 Example (r0 = r0 x 35)
 ADD r0, r0, r0, LSL #2 ; r0’ = r0 x 5
RSB r3, r3, r1 ; r0’’ = 7 x r0’

19
Data transfer instructions
 Single register load and store instructions
 transfer of a data item (byte, half-word, word)
between ARM registers and memory
 Multiple register load and store instructions
 enable transfer of large quantities of data
 used for procedure entry and exit, to save/restore
workspace registers, to copy blocks of data around
memory
 Single register swap instructions
 allow exchange between a register and memory
in one instruction
 used to implement semaphores to ensure mutual
exclusion on accesses to shared data in multis
20
Data Transfer Instructions (cont’d)
Register-indirect addressing
Single register load and store LDR r0, [r1] r0 := mem32[r1]
STR r0, [r1] mem32[r1] := r0
Note: r1 keeps a word address (2 LSBs are 0)

Base+offset addressing
(offset of up to 4Kbytes) LDRB r0, [r1] r0 := mem8[r1]
Note: no restrictions for r1
LDR r0, [r1, #4] r0 := mem32[r1 +4]

Auto-indexing addressing
LDR r0, [r1, #4]! r0 := mem32[r1 + 4]
r1 := r1 + 4
Post-indexed addressing
LDR r0, [r1], #4 r0 := mem32[r1]
r1 := r1 + 4
21
Data Transfer Instructions (cont’d)
COPY: ADR r1, TABLE1 ; r1 points to TABLE1

ADR r2, TABLE2 ; r2 points to TABLE2

LOOP: LDR r0, [r1]

STR r0, [r2]

ADD r1, r1, #4

ADD r2, r2, #4


COPY: ADR r1, TABLE1 ; r1 points to TABLE1
...
ADR r2, TABLE2 ; r2 points to TABLE2
TABLE1: ...
LOOP: LDR r0, [r1], #4
TABLE2:...
STR r0, [r2], #4

...
22
Data Transfer Instructions
Multiple register data transfers
LDMIA r1, {r0, r2, r5} r0 := mem32[r1]
r2 := mem32[r1 + 4]
r5 := mem [r1 + 8]
Note: any subset (or all) of the registers may be32
transferred with a single instruction
Note: the order of registers within the list is
insignificant  Block copy view
Note: including r15 in the list will cause a change  data is to be stored above
in the control flow
or below the the address
 Stack organizations held in the base register
 FA – full ascending  address incrementing or
 EA – empty ascending decrementing begins
before or after storing
 FD – full descending
the first value
 ED – empty descending
23
Multiple register transfer addressing
modes

r9’ 1018 16 r9’ r5 1018 16


r5 r1
r1 r0
r9 r0 100c 16 r9 100c 16

100016 100016

STMIA r9!, {r0,r1,r5} STMIB r9!, {r0,r1,r5}

1018 1018
16 16

r9 r5 100c 16 r9 100c 16
r1 r5
r0 r1
r9’ 1000 r9’ r0 1000
16 16

STMDA r9!, {r0,r1,r5} STMDB r9!, {r0,r1,r5}

24
The mapping between the stack and
block copy views

Ascendi Descend
ng
Full Empty Full
Before STM IB
Increment STM FA
After STM IA LDM IA
STM EA LDM FD
Before LDM DB STM DB
Decrement LDM EA STM FD

25
Control flow instructions
Branch Interpretation Normal uses
B Unconditional Always take this branch
BAL Always Always take this branch
BEQ Equal Com paris on equal or zero result
BNE Not equal Com paris on not equal or non-zero res ult
BPL Plus Result pos itive or zero
BMI Minus Result minus or negative
BCC Carry clear Arithmetic operation did not give carry-out
BLO Lower Unsigned comparis on gave lower
BCS Carry set Arithmetic operation gave carry-out
BHS Higher or same Unsigned comparis on gave higher or same
BVC Overflow clear Signed integer operation; no overflow occurred
BVS Overflow s et Signed integer operation; overflow occurred
BGT Greater than Signed integer comparison gave greater than
BGE Greater or equal Signed integer comparison gave greater or
equal
BLT Less than Signed integer comparison gave les s than
BLE Less or equal Signed integer comparison gave les s than or
equal
BHI Higher Unsigned comparis on gave higher
BLS Lower or s ame Unsigned comparis on gave lower or s ame

26
Conditional execution
 Conditional execution to avoid branch instructions
used to skip a small number of non-branch
instructions
 Example
CMP r0, #5 ;

BEQ BYPASS ; if (r0!=5) {

ADD r1, r1, r0 ; r1:=r1+r0-r2

SUB r1, r1, r2 ;}


With conditional execution
BYPASS: CMP
... r0, #5 ; if ((a==b) && (c==d)) e++;
;

ADDNE r1, r1, r0 ;


CMP r0, r1
Note: add 2 –letter condition after the 3-letter opcode
CMPEQ r2, r3
SUBNE r1, r1, r2 ; 27
Branch and link instructions
 Branch to subroutine (r14 serves as a link register)
BL SUBR ; branch to SUBR

.. ; return here

SUBR: .. ; SUBR entry point


 Nested subroutines BL SUB1
MOV pc, r14 ; return
..

SUB1: ; save work and link register

STMFD r13!, {r0-r2,r14}

BL SUB2

..
28
Supervisor calls
 Supervisor is a program which operates at a
privileged level – it can do things that a user-level
program cannot do directly
 Example: send text to the display
 ARM ISA includes SWI (SoftWare Interrupt)
; output r0[7:0]

SWI SWI_WriteC

; return from a user program back to monitor

SWI SWI_Exit

29
Jump tables
 Call one of a set of subroutines depending on a
value computed by the program
BL JTAB
BL JTAB
...
...
JTAB: CMP r0, #0
JTAB: ADR r1, SUBTAB
BEQ SUB0
CMP r0, #SUBMAX ; overrun?
CMP r0, #1
LDRLS pc, [r1, r0, LSL #2]
BEQ SUB1
Note: slow when the list is long,
and all subroutines are equally B ERROR
CMP r0, #2
frequent
SUBTAB: DCD SUB0
BEQ SUB2
DCD SUB1
30
DCD SUB2
Hello ARM World!
AREA HelloW, CODE, READONLY ; declare code area

SWI_WriteC EQU &0 ; output character in r0

SWI_Exit EQU &11 ; finish program

ENTRY ; code entry point

START: ADR r1, TEXT ; r1 <- Hello ARM World!

LOOP: LDRB r0, [r1], #1 ; get the next byte

CMP r0, #0 ; check for text end

SWINE SWI_WriteC ; if not end of string, print

BNE LOOP

SWI SWI_Exit ; end of execution

TEXT = “Hello ARM World!”, &0a, &0d, 0


31
ARM
Organization and Implementation

Aleksandar Milenkovic
E-mail: [email protected]
Web: http://www.ece.uah.edu/~milenka
Outline
 ARM Architecture
 ARM Organization and Implementation
 ARM Instruction Set
 Architectural Support for High-level Languages
 Thumb Instruction Set
 Architectural Support for System Development
 ARM Processor Cores
 Memory Hierarchy
 Architectural Support for Operating Systems
 ARM CPU Cores
 Embedded ARM Applications

33
ARM organization A[31:0] control

address register

 Register file – P
C incrementer
 2 read ports, 1 write port +
1 read, 1 write port reserved register
PC

for r15 (pc) bank

instruction
 Barrel shifter – shift or rotate
decode
one operand for any number of A multiply &
L
bits U
A
register
B
control
b
 ALU – performs the arithmetic u
s
b
u
b
u
and logic functions required s barrel
shifter
s

 Memory address register +


incrementer ALU

 Memory data registers


 Instruction decoder and
associated control logic data out register data in register

D[31:0]
34
Three-stage pipeline
 Fetch
 the instruction is fetched from memory and placed in
the instruction pipeline
 Decode
 the instruction is decoded and the datapath control
signals prepared for the next cycle; in this stage the
instruction owns the decode logic but not the
datapath
 Execute
 the instruction owns the datapath; the register bank
is read, an operand shifted, the ALU register
generated and written back into a destination register

35
ARM single-cycle instruction pipeline

1 fetch decode exec ute

2 fetch decode execute

3 fetch decode execute


instruction
time

36
ARM single-cycle instruction pipeline

fetch decode execute add add r0,r1,#5

sub r2,r3,r6 fetch decode execute sub

cmp r2,#3 fetch decode execute cmp

time
1 2 3

37
ARM multi-cycle instruction pipeline

Decode logic is always generating


the control signals for the datapath
1 fetch ADD decode execute to use in the next cycle

2 fetch STR decode calc. addr


. data xfer

3 fetch ADD decode execute

4 fetch ADD decode execute

5 fetch ADD decode execute


instruction
time

38
ARM multi-cycle LDMIA (load
multiple) instruction

Decode stage occupied


since ldmia must continue to
remember decoded instruction
ldmia fetch decodeex ld r2ex ld r3
r0,{r2,r3}

sub r2,r3,r6 fetch decode ex sub

cmp r2,#3 fetch decodeex cmp

time
Instruction delayed sub fetched at normal time but
not decoded until LDMIA is finishing

39
Control stalls: due to branches
 Branches often introduce stalls (branch penalty)
 Stall time may depend on whether branch is taken
 May have to squash instructions
that already started executing
 Don’t know what to fetch until condition is
evaluated

40
ARM pipelined branch

Decision not made until the third clock cycle

bne foo fetch decode ex bne ex bne ex bne

sub Two cycles of work thrown


fetch decode
r2,r3,r6 away if bne takes place

foo add fetch decode ex add


r0,r1,r2

time

41
Pipeline: how it works
 All instructions occupy the datapath
for one or more adjacent cycles
 For each cycle that an instruction occupies the
datapath,
it occupies the decode logic in
the immediately preceding cycle
 During the fist datapath cycle each instruction
issues
a fetch for the next instruction but one
 Branch instruction flush and refill the instruction
pipeline

42
ARM9TDMI
5-stage pipeline next
pc +4
I-cache fetch
 Fetch pc + 4

 Decode pc+8 I decode


 instruction is decoded r15 instruction
decode
 register operands read register read
immediate
(3 read ports) fields
mul
 Execute LDM/
STM post-
+4 index
 an operand is shifted and shift reg
shift
pre-index execute
the ALU result ALU forwarding
paths
generated, or B, BL
mux

MOV pc
 address is computed SUBS pc
byte repl.
 Buffer/data
 data memory is D-cache buffer/
load/store
address data
accessed (load, store) rot/sgn ex
 Write-back LDR pc

 write to register file 43


register write write-back
ARM9TDMI
Data Forwarding next
pc +4
I-cache fetch
Data Forwarding pc + 4

ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1 pc+8 I decode


r15 instruction
ADD r5, r5, r3, LSL r2 r5 := r5 + 2 x r3
r2 decode
register read
immediate
ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1 fields
mul
LDM/
STM post-
ADD r8, r9, r10 r8 := r9 + r10 +4 index shift reg
shift
pre-index execute
ADD r5, r5, r3, LSL r2 r5 := r5 + 2r2 x r3 mux
ALU forwarding
paths
B, BL
Stall? MOV pc
SUBS pc
byte repl.
LD r3, [r2] r3 := mem[r2]
D-cache buffer/
load/store
address data
ADD r1, r2, r3 r1 := r2 + r3
rot/sgn ex
LDR pc

write-back
44
register write
ARM9TDMI
PC generation next
pc +4
I-cache fetch
 3-stage pipeline pc + 4

 PC behavior:
pc+8 I decode
operands are read in r15 instruction
execution stage decode
register read
r15 = PC + 8 immediate
fields
 5-stage pipeline mul
LDM/
 operands are read in decode STM post-
+4 index
stage and r15 = PC + 4? shift reg
shift
pre-index execute
 incompatibilities between 3- ALU forwarding
mux paths
stage and 5-stage B, BL
implementations => MOV pc
SUBS pc
unacceptable byte repl.
 to avoid this 5-stage pipeline
D-cache buffer/
ARMs emulate the behavior load/store
address data
of the older 3-stage designs rot/sgn ex
LDR pc

write-back
45
register write
Data processing instruction
datapath activity (Ex)
Reg-Reg
address register address register
 Rd = Rn op
Rm increment increment
 r15 = AR + 4 Rd PC Rd PC
AR = AR + 4 Rn
registers
Rm Rn
registers

Reg-Imm
mult mult
 Rd = Rn op
Imm as ins. as ins.

 r15 = AR + 4
as instruction as instruction
AR = AR + 4
[7:0]

data out data in i. pipe data out data in i. pipe

(a) register – register operations (b) register – immediate operations

46
STR (store register) datapath activity
(Ex1, Ex2)
Compute
address (Ex1) address register address register

 AR = Rn op increment increment
Disp PC Rn PC
 r15 = AR + 4 registers registers
Rn Rd
Store data
(Ex2) mult mult

 AR = PC lsl #0 shifter

 mem[AR] =
= A / A +B/ A -B = A +B/ A -B
Rd<x:y>
[11:0]
 If autoindexing
=> byte? data in i. pipe
data out data in i. pipe
Rn = Rn +/- 4

(a) 1st cycle – compute address (b) 2nd cycle – store data & auto-index

47
The first two (of three) cycles of a
branch instruction
Compute target address register
address register
address
AR = PC + Disp,lsl increment increment

#2 R14
registers registers
Save return address PC PC

mult mult
(if required)
lsl #2 shifter
r14 = PC
Third
AR cycle:
= AR do+
a small
4 = A+B =A
correction to the value
stored in the link register in [23:0]
order that it points to
directly at the instruction data out data in i. pipe data out data in i. pipe
which follows the branch?

(b) 2nd cycle – save return address


(a) 1st cycle – compute branch target

48
ARM Implementation
 Datapath
 RTL (Register Transfer Level)
 Control unit
 FSM (Finite State Machine)

49
2-phase non-overlapping clock
scheme
 Most ARMs do not operate on edge-sensitive
registers
 Instead the design is based around
2-phase non-overlapping clocks which are
generated internally from a single clock signal
 Data movement is controlled by passing the data
alternatively through latches
which are open during phase 1 or latches during
phase 2phase 1

phase 2

1 cloc k c yc le

50
ARM datapath timing
 Register read
 Register read buses – dynamic, precharged during phase 2
 During phase 1 selected registers discharge the read buses
which become valid early in phase 1
 Shift operation
 second operand passes through barrel shifter
 ALU operation
 ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU
as soon as they are valid, but they close at the end of phase 1
so that the phase 2 precharge does not get through to the ALU
 ALU processes the operands during the phase 2, producing the
valid output towards the end of the phase
 the result is latched in the destination register
at the end of phase 2

51
ARM datapath timing (cont’d)
ALU operands
latched
phase 1

phase 2
register
read
time read bus valid
precharge
invalidates register
write time
shift time shift out valid buses

ALU time

ALU out
Minimum Datapath Delay =
Register read time +
Shifter Delay + ALU Delay +
Register write set-up time + Phase 2 to phase 1 non-overlap time
52
The original ARM1 ripple-carry adder
 Carry logic: use CMOS AOI (And-Or-Invert) gate
 Even bits use circuit show below
 Odd bits use the dual circuit with inverted inputs
and outputs and AND and OR gates swapped
around Cout

 Worst case path:


32 gates long
A
B

sum

Cin

53
ARM2 4-bit carry look-ahead scheme
 Carry Generate (G)
Carry Propagate (P)
 Cout[3] =Cin[0].P + G
 Use AOI and
alternate AND/OR gates Cout[3]

 Worst case:
8 gates long
A[3:0] G
4-bit
P adder sum[3:0]
logic
B[3:0]

Cin[0]

54
The ARM2 ALU logic for one result bit
 ALU functions
 data operations (add, sub, ...)
 address computations for memory accesses
 branch target computations
f s: 5 01 23 4
 bit-wise logical carry
logic
operations NB
bus
G
 ...
ALU
bus
P
NA
bus

55
ARM2 ALU function codes

fs5 fs4 fs3 fs2 fs1 fs0 ALU output


0 0 0 1 0 0 A and B
0 0 1 0 0 0 A and not B
0 0 1 0 0 1 A xor B
0 1 1 0 0 1 A plus not B plus carry
0 1 0 1 1 0 A plus B plus carry
1 1 0 1 1 0 not A plus B plus carry
0 0 0 0 0 0 A
0 0 0 0 0 1 A or B
0 0 0 1 0 1 B
0 0 1 0 1 0 not B

56
The ARM6 carry-select adder scheme
 Compute sums
of various fields a,b[3:0] a,b[31:28]
of the word
for carry-in of + +, +1 +, +1
c s s+1
zero and carry-
mux
in of one
 Final result is
mux
selected by
using the
mux
correct carry-in
value to control sum[3:0] sum[7:4] sum[15:8] sum[31:16]
a multiplexor
Worst case: Note: Be careful! Fan-out on some of these
gates is high so direct comparison with
O(log2[word width]) gates previous schemes is not applicable.
long 57
The ARM6 ALU organization
 Not easy to merge the arithmetic and logic
functions =>
a separate logic unit runs in parallel with the adder,
A operand latc h B operand latc h
andinvert
multiplexor
A
selects the output invert B
XOR gates XOR gates

func tion C in
logic functions adder C
V

logic /arithmetic
result mux
N

zero detec t Z

result
58
ARM9 carry arbitration encoding
 Carry arbitration adder
ai bi Ci vi, wi ai bi ai-1 bi-1 Ci vi, wi
0 0 - - 0 0, 0
0 0 0 0, 0
1 1 - - 1 1, 1
1 1 1 1, 1 0(1 1(0 0 0 0 0, 0
) )
1 0 u 1, 0
0(1 1(0 1 1 1 1, 1
0 1 u 1, 0 ) )
0(1 1(0 0(1 1(0 u 1, 0
) ) ) )
v i =a i +bi
wi =ai⋅bi

59
The cross-bar switch barrel shifter
 Shifter delay is critical since it contributes directly
to the datapath cycle time
 Cross-bar switch matrix (32 x 32)
 Principle for 4x4 matrix
right 3 right 2 right 1 no shift

in[3]
left 1

in[2]
left 2

in[1]
left 3

in[0]

out[0] out[1] out[2] out[3]

60
The cross-bar switch barrel shifter
(cont’d)
 Precharged logic is used =>
each switch is a single NMOS transistor
 Precharging sets all outputs to logic 0, so those
which are not connected to any input during
switching remain at 0 giving the zero filling required
by the shift semantics
 For rotate right, the right shift diagonal is enabled +
complementary shift left diagonal (e. g., ‘right 1’ +
‘left 3’)
 Arithmetic shift right:
use sign-extension => separate logic is used to
decode the shift amount and discharge those
outputs appropriately
61
Multiplier design
 All ARMs apart form the first prototype have included
support for integer multiplication
 older ARM cores include low-cost multiplication hardware
that supports only the 32-bit result multiply and
multiply-accumulate
 recent ARM cores have high-performance multiplication
hardware and support 64-bit result multiply and
multiply-accumulate
 Low cost implementation
 Use the datapath iteratively, employing the barrel shifter
and ALU to generate 2-bit product in each clock cycle
 use early termination to stop the iterations when there
are no more ones in the multiply register

62
The 2-bit multiplication algorithm,
Nth cycle
 Control settings for the Nth cycle of the
multiplication
 Use existing shifter and ALU + additional hardware
 dedicated two-bits-per-cycle shift register for the
multiplier and a few gates for the Booth’s algorithm
control logic
(overhead is a few
Carry-in per
Multipliercent Shift
on the area ofALU
ARM core)
0 x0 LSL #2N A +0
x1 LSL #2N A +B
x2 LSL #(2N +1) A –B
x3 LSL #2N A –B
1 x0 LSL #2N A +B
x1 LSL #(2N +1) A +B
x2 LSL #2N A –B
x3 LSL #2N A +0
63
High speed multiplication
 Where multiplication performance is very
important,
more hardware resources must be dedicated
 in some embedded systems the ARM core is used to
perform real-time digital signal processing (DSP) –
DSP programs are typically multiplication intensive
 Use intermediate results which include
partial sums and partial carries
 Carry-save adders are used for this
 These two binary results are added together at the
end of multiplication
 The main ALU is used for this

64
Carry-propagate (a) and carry-save
(b) adder structures
 Carry propagate adder takes two conventional (irredundant)
binary numbers as inputs and produces a binary sum
 Carry save adder takes one binary and one redundant (partial
sum and partial carry) input and produces a sum in redundant
binary representation (sum and carry)

A B Cin A B Cin A B Cin A B Cin


(a) + + + +
Cout S Cout S Cout S Cout S

A B Cin A B Cin A B Cin A B Cin


(b) + + + +
Cout S Cout S Cout S Cout S

65
ARM high-speed multiplier
organization
 CSA has 4 layers of adders each handling 2
multiplier bits
=> multiply 8-bits per clock cycle
 Partial sum and carry are cleared at the beginning
or initialized to accumulate a value
 Multiplier is shifted right 8-bits
per cycle in the ‘Rs’ register
 Carry sum and carry
are rotated right 8 bits per cycle
 Performance: up to 4 clock cycles
(early termination is possible)
 Complexity: 160 bits in shift registers,
128 bits of carry-save adder logic
(up to 10% of simpler cores)
66
ARM high-speed multiplier
organization

registers
initializa tion f or MLA

Rs >> 8 bits/cycle

Rm

rotate sum and carry-save adders


carry 8 bits/cy cle

partial sum

partial carry

ALU (add partials)

67
ARM2 register cell circuit

read read
write A B
ALU bus
A bus
B bus

68
ARM register bank floorplan

A bus read decoders


B bus read decoders
Vdd write dec oders

Vss

ALU ALU
bus bus
PC PC
bus register cells
A bus
INC B bus
bus

69
ARM core datapath buses

address register
incrementer
Ad A B
PC inc register bank

multiplier
ALU
shift out W
shifter

data in
instruction
instruction pipe
data out
Din

70
ARM control logic structure

instruction

coprocessor

multiply
control
decode cycle
PLA count
load/store
multiple

address register ALU shifter


control control control control

71

You might also like