0% found this document useful (0 votes)
18 views

Unit-11 ARM Processor Fundamentals_Technical

The document provides an overview of ARM processor fundamentals, including its architecture, features, and operational modes. It details the ARM core's data flow model, the types of registers, and the significance of the Current Program Status Register (CPSR). Additionally, it discusses the various processor modes, highlighting their specific functions and the organization of general and special-purpose registers.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Unit-11 ARM Processor Fundamentals_Technical

The document provides an overview of ARM processor fundamentals, including its architecture, features, and operational modes. It details the ARM core's data flow model, the types of registers, and the significance of the Current Program Status Register (CPSR). Additionally, it discusses the various processor modes, highlighting their specific functions and the organization of general and special-purpose registers.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

14 ARM Processor Fundamentals

Syllabus
Registers, Current Program Status Register, Pipeline, Exceptions, Interrupts and Vector Table,
Core Extensions, Architecture Revisions, Arm Processor Families

Contents
14.1 Introduction
Summer-16, 19 (CSE),
Winter-17, 19 (CSE), Marks7
14.2 Registers
14.3 Curent Program Status Register
14.4 Pipeline
14.5 Exceptions, Interrupts and Vector Table
14.6 Core Extensions
14.7 Architecture Revisions
14.8 ARM Processor Families

(14- 1)
Microprocessors and Microcontrollers 14-2 ARM Processor Fundamentals

14.1 Introduction GTU : Summer-16, 19 (CSE), Winter-17, 19 (CSE)

14.1.1 Features of ARM Processor


The features of ARM microcontroller are as follows :
Consists of a large uniform register file.
Supports load-store architecture.
Uses simple addressing modes.
Contains a reduced number of instructions.
Instructions perform simple but powerful operations which can be executed in
a single cycle. If the complicated operations, such as division, are to be
performed, the compiler or programmer synthesizes them by combining several
simple instructions.
It has uniform and fixed-length (32-bit) instruction fields.
Each instruction is a fixed length. This allows the pipeline to fetch future
instructions before decoding the current instruction. In contrast, the CISC
processor contains the instructions of variable size, complex and take more
cycles to execute. So in CISC complexity is in processor hardware whereas in
RISC, complexity is in compiler. The uniform and fixed-length instruction fields
simplify the instruction decoding.
Most instructions have three-address instruction format.
Does not use delayed branch since they make exception handling more
complex.
L Does not use register windows to keep chip area small and hence the cost.
It has control over both the Arithmetic Logic Unit (ALU) and shifter in every
data-processing instruction to maximize the use of an ALU and a shifter.
Supports auto-increment and auto-decrement addressing modes to optimize
program loops.
Supports Load and Store Multiple instructions to maximize data
throughput.
Supports conditiona execution of all instructions to maximize execution
throughput.
These enhancements to a basic RISC architecture allow ARM
agood balance of high performance, low code processors to achieve
low silicon area.
size, low power consumption and
14.1.2 ARM Architecture
An ARM core is an engine within a system
that fetches ARM instructions from
memory and executes them.
TECHNICAL PUBLICATIONS An up thrust for
knowledge
Microprocessors and Microcontrolers 14-3 ARM Processor Fundamentals

ARM cores are very small. Typically they occupy a few square millimeters of chip
area.

With advances in modern VLSI technology, it became possible to build additional


system components such as cache memory, memory management unit or
application specific hardware on the same chip. Application specific hardware may
include signal processing hardware or further ARM processor cores.
While designing a new system, selecting the correct processor core is one of the
most critical decision.
14.1.2.1 ARM Core Dataflow Model
Fig. 14.1.1 shows the basic structure of ARM core and how data moves between
its different parts.
The ARM core dataflow model shown in Fig. 14.1.1 is Von Neumann
implementation of the ARM.

Data

Instruction
decoder Sign extend

Write Read

Registerfitle
PC(r15) 015 Rd
Ra Rmt Bbus
Bbus Barrel shifter
Abus N
Acc Abus
MAC ALU

Result bus

Address Register

Incrementer
Address
Fig. 14.1.1 ARM core dataflow model

TECHNICAL PUBLICATIONs - An up thrust for knowledge


ARM Processor Fundamentals
Microprocessors and Microcontrollers 14 - 4

a load-store
Since ARM processor is basically a RISC processor, it uses
architecture. This means, it has two instruction types, load and store, for
transferring data in and out of the processor respectively.
LOAD : This instruction copies data from memory to registers in the processor
Core.
STORE: This instruction copies data from registers in the processor core to
memory.
The ARM processor instruction set does not include the instructions that directly
maripulate data in memory. The data processing is carried out only in registers.
Data bus
The data enters the ARM core through the data bus. The data is either in the form
of an instruction opcode or a data item.
Since Von Neumann architecture is used, data items and instructions share the
same bus. This is in contrast with Hardvard architecture which uses two different
buses.
Instruction decoder
This unit decodes the instruction opcode read from the memory and then the
instruction is executed.
Register file
" This is a bank of 32-bit registers used for storing data items.
Sign extend :
The ARM core is a 32-bit processor. So most instructions of the ARM processor
treat registers as holding signed or unsigned 32-bit values.
When the processor reads signed &-bit or 16-bit numbers from memory, the sign
extend hardware converts these numbers to 32-bit values and then places them in
a register file.
ALU (Arithmetic Logic Unit) and MAC (Multiply-Accumulate Unit)
Most of the ARM instructions are two operand instructions. The two source
registers R, and Rm are used to store these operands. These source operands are
read from the R, and Rm registers using the internal buses A and B respectively.
The ALUor MACreads the operand values from Rn and Rm registers via A and
B buses respectively, performs the operation and stores the computed result via
internalCbus in destination register, Ra and then to the register file.
" The load and store instructiorns generate address using ALUand stores it in the
address register.

TECHNICAL PUBLICATIONS- An up thrust for knowledge


Microprocessors and Microcontrollers 14-5 ARM Processor Fundamentals

Address register
This holds the address generated by the load and store instructions and places it
on the address bus.
Barrel shifter
The conternts of the Rm register alternatively can be preprocessed in the barrel
shifter before applying as an input to the ALU.
" A wide range of expressions and addresses can be calculated using the barrel
shifter and ALU.
Incrementer
For load and store instructions, the incrementer updates the contents of the
address register before the processor core reads or writes the next register value
from or to the consecutive memnory location.
" The processor core continues the execution of instruction. Only when an exception
or interrupt occurs, the normal execution flow is changed.
14.1.2.2 Data Types
ARM processors support the following data types :
Byte : 8 bits.
Halfword :16 bits (halfwords must be aligned to two-byte boundaries).
Word : 32-bits (words must be aligned to four-byte boundaries).
Note

Al three types are supported in ARM architecture versiorn 4 and above. Oniy
bytes and words were supported prior to ARM architecture version 4.
When any of these types is described as unsigned, the N-bit data value
represents a non-negative integer in the range 0to +2-1, using normal bìnary
format.
When any of these types is described as signed, the N-bit data value represents
an integer in the range -2N- to +2N--1, using two's complement format.
All data operations, for example ADD, are performed on word quantities.
" Load and store operations can transfer bytes, halfwords and words to and from
memory, automatically zero-extending or sign-extending bytes or halfwords as
they are loaded.
ARM instructions are exactly one word (and are aligned on a four-byte
boundary). Thumb instructions are exactly one halfword (and are aligned On a
two-byte boundary).
TECHNICAL PUBLICATIONs - An up thrust for knowledge
Microprocessors and Microcontrollers 14- 6 ARM Processor Fundamentals

Review Questions

1. Describe the features of ARMprocessor. GTU : Summer-16, 19 (CSE), Marks 7


2. Explain ARM core data flow model with a neat diagram. GTU: Winter-17, 19 (CSE), Marks 7
3. Explain the data types supported by ARM processor.
14.2 Registers
The register file in the ARM core contains all the registers, available to a
programmer. The current mode of the processor decides the availability of the
registers to the programmer.
14.2.1 Processor Modes
The processor mode determines which registers are active and the access rights to
the cpsr register itself.
In the ARM7, there are seven operating modes. These modes are protected or
exception modes which have associated interrupt sources and their own register
set.
1. Supervisor mode (Defaule) : This is protected mode for running system level
code to access hardware or run OS calls. The ARM7 enters this mode after
reset.

2. FIQ (Fast Interrupt reQuest): This mode supports high speed interrupt
handling.
3. IRQ (Interrupt ReQuest) : This mode supports all other interrupt sources in a
system.
4. Abort : If an instruction or data is fetched from an invalid memory location,
an abort exception will be generated.
5. Undefined : If a fetched opcode is not an ARM instruction, an undefined
instruction exception will be generated.
6. User: This mode is used to run the application code. In the user mode we
cannot change the contents of CPSR (Current Program Status Register) and
modes can only be changed when an exception is generated. This mode is also
known as Unprivileged mode.
7. System : This mode is used for running operating system tasks. It uses the
same registers as user mode.
All the above modes, except user mode, are privilege modes.
For all operating modes, user registers r0 - r7 are common. However, FIQ mode
replaces the r0 - r7 registers by its own registers r8 to rl4. Similarly, each of the
TECHNICAL PUBLICATIONS " An up thrust for knowledge
Microprocessors and Microcontrollers 14-7 ARM Processor Fundamentals

other modes have their own r13 and r14 registers so that each operating mode has
its own unique stack pointer and link register.
14.2.2 Programming Model
Fig. 14.2.1 shows the programming model of ARM processor.
User and
system

r2

r4

Fast
interrupt
r7 request
r8
r9 9 fiq
r10

r11 Interrupt
r12 request Supervisor Undefined Abort

r13 sp r13 iq 13 irq M3 undef t3_abt


r14 Ir 14 fig 14 ira ri4 svC 4 undef 4 abt

r15 pc

cpsr

Spsr Spsr sVC spsr abt

Fig. 14.2.1 Programming model of ARM processor


The ARM processor has a total of 37 registers. All registers are 32-bits wide. They
can be classified into tWO groups as,
General purpose registers
Special purpose registers

14.2.3 General Purpose Registers


Registers r0 to rl2 are used as general purpose registers. Depending upon the
context, registers r13to r15 can also be used as general purpose registers.
The general purpose registers hold either data or an address.
TECHNICAL PUBLICATIONS - An up thrust for knowledge
MicroprOcessors and Microcontrollers 14- 8 ARM Processor Fundementals

14.2.4 Special Purpose Registers


Registers r13 to r15, CPSR (Current Program Status Register) and SPSR (Saved
Program Status Register) are the special purpose registers.
Registers r13 to r15
In user mode, registers r13 to r15 are labeled as r13 sp, r14 Ir and r15 pc
respectively to differentiate them from other registers. The functions of these
registers are given below.
Stack pointer (r13 sp) :Register r13 is the stack pointer. It stores the top of the
stack in the current processor mode.
Link register (rl4 lr) : Register rl4 is the link register. The processor stores the
return address in this register when a subroutine is called.
Program counter (r15 pc) : Register r15 is the program counter and stores the
address of the next instruction to be fetched from the memory by the processor.
It is used in most instructions as a pointer to the instruction which is two
instructions after the instruction being executed.
All ARM instructions are four bytes long (one 32-bit word) and are always
aligned on a word boundary. This means that the bottom two bits of the PC
are always zero, and therefore the PC contains only 30 non-constant bits.
" It can often be used in place of one of the general-purpose registers r0 to r14,
and is therefore considered one of the general-purpose registers. However,
there are also many instruction-specific restrictions or special cases about its
use. Usually, the instruction is unpredictable if r15 is used in a manner that
breaks these restrictions.

The Unbanked Registers r0-r7


Registers r0 to r7 are unbanked registers. This means that each of them refers to
the same 32-bit physical register in all processor modes.
They are completely general-purpose registers, with no special uses implied by the
architecture, and can be used wherever an instruction allows a general-purpose
register to be specified.
The Banked Registers, r8 ri4
Registers r8 to rl4 are banked registers.
The physical register referred to by each of them depends on the current processor
mode. Where a particular physical register is intended, without depending on the
current processor mode, a more specific name (as described below) is used.
Almost all instructions allow the banked registers to be used wherever a
general-purpose register is allowed.
TECHNICAL PUBLICA TIONS - An up thrust for knowledge
Microprocessors and Microcontrollers 14-9 ARM PrOcessor Fundamentals

" Out of 37 registers, 20 registers which are shown shaded in Fig. 14.2.1 are the
banked registers. Fig. 14.2.1 also shows which banked registers are used in which
mode. Banked registers of a particular mode are denoted by, r number_mode.
For example, supervisor, mode has banked registers r13_svc, r14 svc and spsr_Svc.
On the other hand, abort mode has banked registers r13-abt, r14-abt and spsr-abt.
Registers r8 to r12 have two banked physical registers each. The first group of
physical registers are referred to as r&_usr to rl2_usr and the second group as
r8 _fiq to r12_fiq. The r8_usr to r12_usr group is used in all processor modes other
than FIQ mode, and the other is used in FIO mode.
Registers r13 and r14 have six banked physical registers each. One is used in User
and System modes, while each of the remnaining five is used in one of the five
exception modes.
The registers r0 to r13 are orthogonal. This means, any instruction which you can
apply to ro, you can equally well apply to any of the r1 to r13 registers. This is
not the case with r14 and r15 registers.

Review Questions

1. Draw and explain the ARM programmer's model.


2. List the special purpose registers of ARM processor.
3. Explain diferent processor modes of ARM processor.
4. Explain the programmer's model of ARM processor with complete register sets available.
5. Explain registers used under various modes.

14.3 Current Program Status Register


The current program status register (cpsr) is accessible in all processor modes. It
contains condition code flags, interrupt disable bits, the current processor mode,
and other status and control information.
Each exception mode also has a saved program status register (spsr), that is used
to preserve the value of the cpsr when the associated exception occurs.

Note User mode and System mode do not have an SPSR, because they are not

exception modes. All instructions which read write the SPSR are

UNPREDICTABLE when executed in User mode or System mode.


Microprocessors and Microcontrollers 14- 10 ARM Processor Fundamentals

Fig. 14.3.1 shows the format of the cpsr and spsr.


Fields Flags Status Extension Control

Bits 31 28 27 87 6 5 4
NZCv UNDEFINED Mode

Functions Condition Interrupt Processor


flags masks mode

Thumb
state
Fig. 14.3.1 Format of cpsr and spsr
Control flags (Bits 0-7)
The control bits change when an exception arises and can be altered by software
only when the processor is in a privileged mode.
Bits 0-4 (Mode Select Bits) : Processor modes
These bits determine the processor mode as shown in Table 14.3.1.

Processormoce Mode Select Bits 4: 01


Abort 10 111

Fast interupt request 10.001


Interrupt request 10010
Supervisor 1001 1

System 11111

Undefined 11011
User 10000
Table 14.3.1 Processor mode
Bit 5(Thumb State Bit):
This bit gives the state of the core. The state of the core
determines which
instruction set is being executed.
There are three instruction sets, ARM, Thumb and Jazelle. One of the three
instruction set is active when the processor is in ARM state, Thumb state and
Jazelle state respectively.
Thumb
Thumb instructions are 16 bits (instead of the usual 32 bit). This allows for
greater
code density in places where memory is restricted.

TECHNICAL PUBLICATIONS - An up thrust for knowledge


Microprocessors and Microcontrollers 14-11 ARM Processor Fundamentals

The Thumb set can only address the first eight registers, and there are no
conditional execution instructions. Also, the Thumb cannot do a number of things
required for low-level processor exceptions, so the Thumb instruction set will
always come alongside the full ARM instruction set.
Exceptions and the like can be handled in ARM code, with Thumb used for the
more regular code.
Table 14.3.2 gives the comparison of ARM and Thumb instruction set features.
ARM (Cpsr T = 0) Thumb (cpsr T =1)
Instruction size 32-bit 16-bi.
Core instructions 58 30

Conditional execution most only branchinstructions


Dataprocessing access to barrel shifter and ALU separate barrel shifter and ALU
instructons instructions
Programn status register read-write in privleged mode no direct access
Register usage 15 general-purposeregisters +pc 8 generai-purposeregisters
+7 high registers pc
Table 14.3.2 Comparison of ARM and Thumb instruction set features
Jazelle
The third instruction set introduced by ARM designers is Jazelle. The J bit is the
additional flag bit in the flags field only available on Jazelle-enabled processors.
The Jazelle J and Thumb T bits in the cpsr decide the state of the processor. When
both, J and I bits are 0, the processor is in ARM state and executes ARM
instructions. When the T bit is 1, the processor is in Thumb state and executes
Thumb nstructions. When I bit is 0 and J bit is 1, the processor is in Jazelle state
and executes Jazelle instructions.
Jazelle executes 8-bit instructions. It is a hybrid mix of software arnd hardware. It
is designed to increase the speed of execution of Java byte codes. The Jazelle
technology and a specially modified version of the Java virtual machine is needed
to execute Java byte codes.
The Jazelle instruction set features are given below.
Jazelle instructions are 8-bit in size.
Over 60 % of the Java bytecodes are implemented in hardware and remaining
codes are implemented in software.

TECHNICAL PUBLICATIONS An up thrust for knowledge


Microprocessors and Microcontrollers 14- 12 ARM Processor Fundamentals

The Jazelle instruction set is a closed instruction set and is not openly available.
An extra software licensed from both, ARM Limited and Sun Microsystems is
required to use Jazelle.
Bits 6 and 7 (Interrupt Masks) :
There are two interrupts available on the ARM processor core :
Interrupt Request (IRQ) and
Fast Interrupt Request (FIQ).
These are maskable interrupts and their masking is controlled by bits 6 and 7 of
cpsr. Bit 6(F) controls FIQ and bit 7T) controls IRQ.
When the bit is set to binary 1, the corresponding interrupt request is masked and
when bit is 0, the interrupt is available.
Condition code flags
These flags in the cpsr can be tested by most instructions to determine whether
the instruction is to be executed.
The condition code flags are usually modified by : Execution of a
comparison
instruction (CMN, CMP, TEQ or TST).
Execution of some other arithmetic, logical or move instruction, where the
destination register of the instruction is not r15. Most of these instructions have
both a flag-preserving and a flag-setting variant, with the latter being selected by
adding an S qualifier to the instruction mnemonic. Some of these instructions only
have a flag-preserving version.
Bit 27 (Saturation flag, Q)
This flag is available for the ARM processor cores which
include the DSP
extensions. If an overflow and/or saturation occurs in an enhanced DSP
instruction, the Q bit is set to 1. The flag is 'sticky' which means the hardware
only sets this flag. We need to write to the cpsr directly to clear the flag
bit.
Similarly, bit [27] of each spsr is a Q flag and is used to preserve and restore the
cpsr Q flag if an exception occurs.
Bit 28 (0verflow flag, V)
" It is set in one of two ways:
For an addition or subtraction, V is set to 1 if signed overflow occurred,
regarding the operands and result as two's complement signed integers.
. For
non-addition/subtractions, V is normally left unchanged

TECHNICAL PUBLICATIONs - An up thrust for


knowledge
Microprocessors and Microcontrollers 14- 13 ARM Processor Fundamentals

Bit 29 (Carry flag, C)


" It is set in one of four ways :

For an addition, including the comparison instruction CMN, C is set to 1 if the


addition produced a carry (that is, an unsigned overflow), and to 0 otherwise.
For a subtraction, including the comparison instruction CMP, C is set to 0 if the
subtraction produced a borrow (that is, an unsigned underflow), and to 1
otherwise.

For non-addition/subtractions that incorporate a shift operation, C is set to the


last bit shifted out of the value by the shifter.
For other non-addition/subtractions, C is normally left unchanged.
Bit 30 (Zero flag, Z)
" It is set to 1 if the result of the instruction is zero (which often indicates an egual
result from a comparison), and to 0 otherwise.
Bit 31 (Negative flag, N)
" It is set to bit 31 of the result of the instruction. If this result is regarded as a
two's complement signed integer, then N = 1if the result is negative and N = 0 if
it is positive or zero.
The N, z, C and V flags carn be modified in these additional ways:
Execution of an MSR instruction, as part of its function of writing a new value
to the cpsr or spsr.
Execution of MRC instructions with destination register r15. The purpose of
such instructions is to transfer coprocessor-generated condition code flag values
to the ARM processor.
Execution of some variants of the LDM instruction. These variants copy the spsr
to the cpsr, and their main intended use is for returring from exceptions.
Execution of flag-setting variants of arithmetic and logical instructions whose
destination register is r15. These also copy the spsr to the cpsr and are mainly
intended for returning from exceptions.
Other Bits
Other bits in the program status registers are reserved for future expansion.

Review Questions

1. Explain the various fields in Current Program Status Register (CPSR).


2. Explain the condition flags of ARM processor.
3. Compare ARM and Thumb instruction set features.

TECHNICAL PUBLICATIONs An up thust for knowiedge


Microprocessors and Microcontrollers 14-14 ARM Processor Fundamentals

14.4 Pipeline
Fetching the next instruction while other instruction is in execution is called
pipelining. It is useful to speed up program execution. The ARM processor uses
this pipeline mechanism.
ARM7 uses a simple three-stage pipeline as shown in the Fig. 14.4.1.

Stage 1 Fetch Instruction fetched from memory

Stage 2 Decode Decoding of register used in instruction

Register(s) read from register bank


Stage 3 Execute shift and ALUoperation write register(s)
back to register bank

Fig. 14.4.1 3-stages in the ARM7 pipeline


The three stages used in the pipeline are :
Fetch : n this stage the processor fetches the instruction from the memory.
Decode : In this stage processor identifies the instruction which is to be
executed.

Execute : In this stage the processor processes the instruction and stores (writes)
result in a register.
By overlapping the above stages of execution of the different instructions, the
speed of execution is increased. After filling the pipeline, each instruction takes a
single cycle to complete the execution. Thus increases the throughput.
Fig. 14.4.2 shows the 3-stagepipelined instruction execution.
Pipeline stages
Fetch Decode Execute
Time
Cycle 1 Instruction 1

Cycle 2 |Instruction 2 Instruction 1

Cycle 3 Instruction 3 Instruction 2 Instruction 1

Fig. 14.4.2 Three-stage plpelined instructlon executlon


TECHNICAL PUBLICATIONS - An up thrust for knowledge
Microprocessors and Microcontrollers 14 -15 ARM Processor Fundamentals

As shown in Fig. 14.4.2, in cycle 1, processor fetches instruction 1 from memory.


In cycle 2, it fetches instruction 2 from memory and decodes instruction 1. In the
third cycle, it fetches instruction 3 from memory, decodes instruction 2 and
executes instruction 3. Ths the pipeline is filled with three sequential instructions.
It delivers a throughput approaching one instruction per cycle.
5-Stage ARM Pipeline
The pipeline provided by ARM7 is very cost-effective. However, for higher
performances, we require processor organizations which support more number of
pipeline stages.
The time required to execute a program is given by
Ninst XCPI
Tprog fcik
where

Tprog : Time required to execute a given program.


Ninst : Number of ARM instructions executed in the program.
CPI : Average number of clock cycles per instruction.
fclk : Processor's clock frequency.
There are some ways to increase performance :
Increase the clock rate, flk : To achieve this it is necessary to simplify the
pipeline stages to increase the number of pipeline stages.
Reduce the average number of clock cycles per instruction (CP) : To achieve
this it is either necessary to re-implement the instructions which occupy more
than one pipeline slot in a 3-stage pipeline ARM to occupy fewer slots or to
reduce dependencies between instructions or a combination of both.
Increase in data width during memory access: Furthermore to get a better
CPI, the memory system must deliver more than one value in each clock cycle
either by delivering more than 32 bits per cycle from a single memory or by
having separate memories for instruction and data accesses.
Thus, to give higher performance ARM9 core employs a 5-stage pipeline a shown
in the Fig. 14.4.3.

TECHNICAL PUBLICATIONS An up thrust for knowledge


Microprocessors and Microcontrollers 14- 16 ARM Processor Fundamentals

Phase-1

Phase-2

FETCH INSTRUCTION FETCH

DECODE
THUMB/ ARM INSTR.DECODER
REG DECODER REG READ

EXECUTE SHIFT ALU

MEMORY MEMORY ACCESS

WRITE REGWRITE

Fig. 14.4.3 Five stage pipeline


The Fig. 14.4.4 shows the organization of ARM9TDMI that supports 5-stage
pipeline. It has separate instruction and data memories to support 5-stage pipeline.
(See Fig. 14.4.4 on next page )
" It provides forwarding paths to solve the problem of data dependencies without
stalling the 5-stage pipeline. (Data dependency is a pipeline hazard which arise
when an instruction needs to use the result of one of its predecessors before that
result has returned to the register file). This concept is known as data forwarding.
There are some cases in which forwarding paths cannot avoid a pipeline stall due
to data dependencies.
For example,
LDR RO, [R7]
ADD R4, RO, R2
Instruction sequence suffers a single cycle penalty due to load-use interlock on
register R0. In such cases, compilers are encoraged to not to put a dependent
instruction immediately after a load instruction.
TECHNICAL PUBLICATIONS - An up thrust for knowledge
MicroprOcessors and MicrOcontrollers 14- 17 ARM Processor Fundamentals

Next pc
+4
I-cache Fetch

pc + 4

pc+8
I-decode
15
Decode
Register bank
+pc Immediate
LDMI field
STM
KMUL

Post-index
Shift Register
shift
Pre-index
Execute
ALU Forwarding
paths
MUX
B, BL
Mov pc
SUBS pc

Byte repl.

Memory
D-cache
Load/store
address
Byte rotate/
Sign extension
LDR PC

Write
Register write
Fig. 14.4.4 ARM9TDMI 5-stage pipeline organization
" The 5-stage pipeline stages are :
. Fetch : In this stage the processor fetches instruction from memory and places
in the instruction pipeline.
Decode : In this stage
1. The instruction is decoded and

2. The register operands read from the register.


TECHNICAL PUBLICATIONS -An up thrust for knowledge
Microprocessor and Microcontrollers 14- 18 ARM Processor Fundamentals

Execute : In this stage


1. An operand is shifted.
2. The ALUresult generated.
3. If the instruction is a load or store, the memory address is computed in
the ALU.
Memory : In this stage, data memory is accessed if required (that is, for LOAD
or STORE instructions). Otherwise, the ALUresult is simply buffered for one
clock cycle to give the same pipeline flow for all instructions.
Write : In this stage, the results generated by the instruction are written back to
the register file including any data loaded from memory.
6-Stage ARM pipeline
The ARM10supports six stage pipeline. It add issue stage in the pipeline as shown in
the Fig. 14.4.5.

Decode Execute Memory Write


Fetch Issue

Fig. 14.4.5 ARM-10 six-stage pipeline


Review Questions

1. What is pipelining ?
2. Explain the concept of pipeline used in ARM processor.
3. Explain the ARM 5-stage pipelining.
4. Draw the ARM-10 six-stage pipeline.
5. With diagram explain the various blocks in a 3stage pipeline of ARM processor organization.
6. Explain the pipeline mechanism in (Advanced RISC Machine) ARM processor.

14.5 Exceptions, Interrupts and Vector Table


When an exception or interrupt occurs, the PC is loaded with specific address
corresponding to the interrupt or exception. This specific address is known as
vector address.
The vector table holds the vector addresses for all the interrupts or exceptions. The
vector table.
memory map address 00000000H is reserved for the
" In some processors the vector table is located at the higher address in memory
starting at the offset FFFFO000H.
When an exception or interrupt occurs, the processor suspends normal instruction
execution and loads the instruction from the vector table with specific vector
address. Each vector table entry contains a form of branch instruction pointing to
vector table.
the start of aspecific interrupt service routine. Table 14.5.1 shows the
TECHNICAL PUBLICATIONS-An up thrust for knowledge
Microprocessors and Microcontrollers 14- 19 ARM Processor Fundamentals

Exception / Interrupt Address High Address


Reset (RESET) 00000000H FFFFO000H

Undefined instruction (UNDEF) 00000004H FFFFO004H

Software interrupt (SWI) 00000008H FFFFO0081H

Prefetch abort (PABT) 0000000CH FFFFO00CH

Data abort (DABT) 00000010H FFFFO010H


Reserved 00000014H FFFFOO14H
Interrupt Request (IRO) 00000018H FFFFO018H
Fast Interrupt Recquest (FIO) 0000001CH FFFFO0CH
Table 14.5.1 Vector table
Exception Interrupts
Reset : It occurs when power is applied. In response to RESET processor executes
branch instruction located at address (00000000HH) to transfer program control to
the initialization code.
Undefined Instruction : It occurs when processor cannot decode the instruction. In
response processor executes branch instruction located at address 00000004H.
Software Interrupt : The software interrupt vector is used when we execute a SWI
instruction. It is frequently used to invoke an operating system routine.
Prefetch Abort : The prefetch abort vector is used when the processor attempts to
fetch an instruction from an address without the correct access permissions.
Data Abort : The data abort vector is used when an instruction attempts to access
data memory without the correct access permissions.
Interrupt Request : The interrupt request vector is used by the external hardware
to interrupt the normal instruction execution flow of the processor. The external
hardware can use IRQ only when it is not masked in the cpsr.
Fast Interrupt Request: The fast interrupt vector is used by external hardware
which requires faster response time. The external hardware can use FIR only when
it is not masked in the cpsr.

Review Questions

1. Define vector address and vector table.


2. List the vector addresses of various interrupts / exceptions in ARM processor.
3. Explain various exceptions /interrupts suyported by ARM processor.
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Microprocessors and Microcontrollers 14-20 ARM Processor Fundamentals

14.6 Core Extensions


The hardware extensions to ARM core improve performance, manage resources,
and provide extra functionality and are designed to provide flexibility in handling
particular applications.
There are three hardware extensions: cache and tightly coupled memory,
memory management, and the coprocessor interface.
14.6.1 Cache and Tightly Coupled Memory
14.6.1.1 Cache Memory
Cache memory is a small-sized type of volatile memory placed between main
memory and the core. It provides high-speed data access to processor core and
stores frequently used programs and data. With a cache the processor core can run
for the majority of the time without having to wait for data from slow external
memory.
ARM has two forms of cache : single unified cache and separate caches for data
and instruction.
Single Unified Cache : It is found attached to the Von Neumnann-style cores. It
combines both data and instruction into a single cache, as shown in Fig. 14.6.1

ARM core ARM cOre


D I
Unified cache Data Instruction
cache cache
Lootc and control
Loglc and contro!

AMBA bus interface unit Main memory Man memory


AMBA bus interface unit
On-chip AMBA bus On-chip AMBA bus

(a) Unified cache (b) Separate cache


Fig. 14.6.1 Processor core with memory
Separate Caches for Data and Instruction : It is attached to the
cores, has separate caches for data and instruction.
Harvard-style
" A cache provides an overall increase in performance;
however, its performance is
not predictable. This means that if data to be processed is
available in the cache
execution time will be less (performance is good); otherwise, execution time will
be more (performance is poor).
TECHNICAL PUBLICATIONS -An up thrust for knowledge
Microprocessors and Microcontrollers 14- 21 ARM PrOcessor Fundementals

14.6.1.2 Tightly Coupled Memory


For real-time systems, the time taken for loading and storing instructions or data
must be predictable.
The predictable performance is achieved using a memory called tightly coupled
memory (TCM). TCM is fast SRAM located close to the processor core. It requires
fixed anmount of clock cycles to fetch instructions or data, and instruction execution
is fast enough to satisfy the needs real-time algorithms.
" Fig. 14.6.2 shows processor core with tightly coupled menmories.

ARM Core

Logic and control


D

Data Instructiorn D
TCM TCM

AMBA bus interface unit Main mermory


D+I
On-chip AMBA bus
Fig. 14.6.2 Processor core with tightly coupled memories
14.6.1.3 Combining Cache and Tightly Coupled Memory
By combining both cache and tightly coupled memory, ARM processors can have
both improved performance and predictable real-time response.
Fig. 14.6.3 shows an example core with a combination of caches and TCMs.

ARMcore

Logic and control


D I D

Data Instruction Data Instruction


TCM TCM cache cache

AMBA bus interface unit Main memory


D+I

On-chip AMBA bus

Fig. 14.6.3 Processor Core with a combination of caches and TCMs.


TECHNICAL PUBLICA TIONs -An up thrust for knowledge
Microprocessors and Microcontrollers 14- 22 ARM Processor Fundamentals

14.6.2 Memory Management


Embedded systems mostly use multiple memory devices.
ARM processors use memory management hardware to organize these memory
devices and protect the system from applications trying to make inappropriate
accesses to hardware.

ARM cores have three different types of memory management hardware :


Noextensions - Do not provide any protection
Memory Protection Unit (MPU) -Provides limited protection
Memory Management Unit (MMU) -Provides full protection
No extension (No protection)
Memory with no protection is fixed and provides very little flexibility.
" It is used for small, simple embedded systems that requires no protection.
MPU(Limited protection)
Memory protection unit is used in simple systems that use a limited number of
memory regions.
Special processor registers are used to control these memory regions.
Each memory region is defined with specific access permissions.
MPU is used in systems where memory protection is required but don't have a
complex memory map.
MMU(Full Protection)
It uses a set of translation tables to provide extensive control over memory.
The translation tables are stored in main memnory and they provide a virtual to
physical address translation as well as access permissions.
" It gives most comprehensive memory management hardware and sophisticated
platform to support multitasking.
14.6.3 Coprocessors
The processing features of a core can be enhanced by attaching coprocessors to the
ARM core. The coprocessors extends the instruction set and provides configuration
registers to add processing features. It is possible to add more than one
coprocessor to the ARM core via the coprocessor interface.
" If the coprocessor is not present or doesn't recognize the instruction, then the
ARM takes an undefined instruction exception, which allows you to emulate the
behavior of the coprocessor in software.
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Microprocessors and Microcontrollers 14- 23 ARM Processor Fundamentals

The ARM processor core supports several different types of closely coupled
coprocessors, including floating point, SIMD, and systme control and cache
maintenance.

" Each coprocessor present in an ARM system has a unique 4-bit D code.
Coprocessor instructions contain afield for the ID code of the processor on which
they will execute.
One of the primary goals of the ARM coprocessor interface is not to slow down
the CPU core. Beyond checking to see if a coprocessor instruction is coded for an
existing coprocessor, the core does not spend time sorting out coprocessor
instructions within its own pipeline. The core sends all the instructions it fetches
from memory directly to all coprocessors. The coprocessor decodes all incoming
instructions, which include both ordinary ARM instructions as well as coprocessor
instructions. During the decoding stage, the coprocessor rejects any instructions
that are not recognised as its own. This includes both ARM instructions and
instructions coded for other coprocessors. The coprocessor recongises its own
instructions and adds only those to its internal execution pipeline. The coprocessor
then sends a signal back to the core indicating that it has accepted an instruction.
The coprocessor can be accessed through a group of dedicated ARM instructions
that provide a load-store type interface.
The coprocessor can also externd the instruction set by providing a specialized
group of new instructions. For example, floating-point instructions can be added to
the standard ARM instruction set by attaching vector floating-point (VFP)
coprocessor.

Review Questions

1. Write anote on core extensions.


2. Explain the cache memory extensions to processor core with the help of neat diagrams.
3. Explain the tightly coupled memory extensions to processor core vith the help of neat diagrams.
4. Write a short note on memory management.
5. Exaplain the concept of core extensions in ARM processor.
6. Discuss briefly how coprocessors can be attached to ARM processor.
7. Discuss the following with diagrams :
i) Von Neumann architecture with cache
ii)Harvard architecture with TCM.

TECHNICAL PUBLICATIONS An up thrust for knowledge


MicroproceSsors and Microcontrollers 14 - 24 ARM Processor Fundamentals

14.7 Architecture Revisions


ARM has several processors that are grouped into number of families based on
the processor core they are implemented with.
The architecture of ARM processors has continued to evolve with every family.
Some of the famous ARM Processor families are ARM7, ARM9, ARM10 and
ARM11. Every ARM processor implementation executes a specific Instruction Set
Architecture (1SA).
The ISA has evolved to keep up compatibility so that code written to execute on
an earlier architecture revision will also execute on a later revision of the
architecture.

14.7.1 ARM Nomenclature


ARM Nomenclature identifies individual processors and provides basic
information about the feature set.
The letters or words after "ARM" are used to indicate the features of a processor.
ARMxyzTDMIEJF-S
x - Farmily or series
y-Memory management/Protection unit
z - Cache
T-16 bit thumb decoder
D -JTAG debugger
M - Fast multiplier
.I- Embedded In-circuit Emulator (ICE) macrocell
. E-Enhanced instructions for DSP (assumes TDMI)
J- Jazelle (for accelerated JAVAexecution)
F-Vector floating-point unit
.S- Synthesizable version
. T - Thumb instruction set : ARM processors support both the 32-bit ARM
Instruction Set and 16-bit thumb instruction set. The original 32-bit ARM
instructions consist of 32-bit opcodes which turns out to be a 4-byte binary
pattern. The 16-bit thumb instructions consist of 16-bit opcodes or 2-byte binary
pattern to improve the code density.
D- JTAG debug : JTAG is a serial protocol used by ARM to transfer the debug
information between the processor and the test equipment.

TECHNICAL PUBLICATIONS - An up thrust for knowledge


Microprocessors and Microcontrollers 14- 25 ARM Processor Fundamentals

" M- Fast multiplier : Older ARM processors used a small and simple multiplier
unit. This multiplier unit required more clock cycles to complete a single
multiplication. With the introduction of fast multiplier unit, the clock cycles
required for multiplication are significantly reduced and modern ARM processors
are capable of calculating a 32-bit product in a single cycle.
" I- Embedded ICE Macrocell : ARM processors have on-chip debug hardware that
allows the processor to set breakpoints and watchpoints.
" E - Enhanced instructions : ARM processors with this mode will support the
extended DSP instruction set for high performance DSP applications. With these
extended DSP instructions, the DSP performance of the ARM processors can be
increased without high clock frequencies.
" J -Jazelle : ARM processors with Jazelle technology can be used in accelerated
execution of Java bytecodes. Jazelle DBX or Direct bytecode eXecution is used in
mobile phones and other consumer devices for high performance Java execution
without affecting memory or battery.
F - Vector floating-point unit : The floating point architecture in ARM processors
provide execution of floating point arithmetic operations. The Dynamic Range and
precision offered by the floating point architecture in ARM processors are used in
many real time applications in the industrial and automotive areas.
" S- Synthesizable : The ARM processor core is available as source code. This
software core can be compiled into a format that can be easily understood by the
EDA tools. Using the processor source code, it is possible to modify the
architecture of the ARM processor.

14.7.2 Architecture Evolution


The architecture has continued to evolve since the first ARM processor
implementation was introduced in 1985. Table 14.7.1 shows the significant
architecture enhancements from the original architecture version 1 to the current
version 6 architecture.

Revision Example Core Implementation ISA Enhancement


ARMV1 ARM1 First ARM processor
26-bit addressing
ARMV2 ARM2 32-bit multiplier
32-bit coprocessor support

TECHNICAL PUBLICATIONS - An up thrust for knowledge


Microprocessors and Microcontrollers 14- 26 ARM Processor Fundamentals

ARMV2a ARM3 On-chip cache


Atomic swap instructior
Coprocessor 15 for cache management
ARMV3 ARM6 and ARM7DI 32-bit addressing
Separate cpsr and spsr
New modes-undefined instruction and
abort

MMU support-virtual memnory


ARMy3M ARMZM Signed and instructions
unsigned long multiply
ARMv4 StrongARM Load-store instructions for signed and
unsigned halfwords/bytes
New mode system
Reserve SWI space for architecturally
defined operatiors
26-bit addressing node no longer
Supported
ARMyT ARMITDMI and ARM9T Thumb
ARMV5TE ARM9E and ARM10E Supersetof the ARMv4T
Bxtra instructions added for changing
statebetween ARM and ThLmb
Enhanced multiply instructions
Extra DSP-type instructions
Faster multiply accumulate
ARMV5TEJ ARM7EJ and ARM926EJ Java acceleration

ARMV6 ARM11 Improved multiprocessor instructions


Unaligned and mixed endian data
handling
New multimedia instructions
Table 14.7.1 ARM architecture enhancements

14.8 ARM Processor Families


ARM has several processors that are grouped into number of families based on
the processor core they are implemented with.

TECHNICAL PUBLICA TIONS -An up thrust for knowladge


Microprocessors and Microcontrollers 14- 27 ARM Processor Fundarmentals

The families are based on the ARM7, ARM9, and ARM11 cores. The postfix
numbers 7,9 and 11 indicate different core designs.
14.8.1 ARM7
ARM7 family is introduced in 1994 (ARM7TDMI, ARM7EJ-S, ARM720T)
This family has been immensely successful and has established ARM as the
architecture of choice in digital word.
Over the years more than 10 billion ARM7 processor family based devices have
powered a verity of cost and power sensitive applications.
Due the availability of more advanced ARM processors, the ARM7 processor
family (ARM7 TDMI) is not recommended for new designs.
14.8.1.1 Features of ARM7
1. Pipeline depth :Three stage (Fetch, Decode, Execute)
2. Operating frequency : 80 MHz
3. Power consumption : 0.06 mW/MHz.
4. MIPS/MHz : 0.97
5. Architecture used : Von-Neumann
6. MMUIMPU: Not present
7. Cache memory : Not present
8. Jazelle instruction : Not present
9. Thumb instruction : Yes (16 bit instruction set)
10. ARM instruction set : Yes (32 bit)
11. ISA (Instruction set Architecture) :V4T (4 TH Version)
12. Interrupt controller : Not Present
13. ISR entry : Non deterministic ISR entry
14. Power Management : No in built power management
15. Instruction Set Performance v/s code size : Optimal performance code size balance
requires interworking between ARM & Thumb code
16. Ease of application porting from one device to another : Lack of standardization
inhibits application porting.

TECHNICAL PUBLICATIONS - An up thrust for knowledge


14- 28 ARM Processor Fundamentals
MicroprOcessors and MicroOcontrollers

14.8.2 ARM9 Processor Family


The ARM9 family was announced in 1997.
JAVA
" ARM9 family enables single processor solution for microcontroller, DSP &
applications, offering savings in chip area and complexity, power consumption
and timne to market.
ARM9 family has enhanced processors and these processors are well suited for
applications requiring a mix of DSP+ Microcontroller performance.
ARM9 family includes ARM920T, ARM922T, ARM940T, ARM946E-S,
ARM966E-S, and ARM926EJ-S processors.
14.8.2.1 Features of ARM9
Pipeline Depth : 5 stage (Fetch, Decode, Execute, Decode, Write)
Operating frequency : 150 MHz
Power consumption : 0.19 mW/MHz
MIPS/MHz : 1.1
Architecture used : Harvard. It separates the data Dand instruction Ibuses.
MMUMPU : Present

Cache memory : Present (separate 16 K/8 K)


ARM/Thumb Instruction : Support both
ARM920T
The first processor in the ARM9 family
It includes a separate D + Icache and an MMU.
Provides virtual memory support to operating systems.
It executes the architecture v4T instructions
ARM922T is a variation on the ARM920T with half the D+Icache size.
ARM940T
. It includes a smaller D + I cache and an MPU.
It isdesigned for applications that do not require a platform operating system.
. It executes the architecture v4T instructions
ARM946E-S and ARM966E-S
Both execute architecture v5TE instructions.

They support the optional embedded trace macrocell (ETM), which allows a
developer to trace instruction and data execution in real time on the processor.

TECHNICAL PUBLICATIONS - An up thrust for knowledge


Microprocessors and Microcontrollers 14- 29 ARM Processor Fundamentals

. The ARM946E-S incudes TCM (Tightly Coupled Memory), cache, and an MPU.
The sizes of the TCM and caches are configuralble. It is designed for use in
embedded control applications that require deterministic real-time response.
On the other hand, the ARM966E does not have the MPU and cache extensions
but does have configurable TCMs.
" ARM926EJ-S
. It is synthesizable processor core, announced in 2000.
It is designed for use in small portable Java-enabled devices such as 3G phones
and Personal Digital Assistants (PDAs).
Supports the Jazelle technology, which accelerates Java bytecode execution.
It features an MMU, configurable TCMs, and D +Icaches with zero or nonzero
wait state memories.

14.8.3 ARM10 Processors Family


The ARM10 family was announced in 1999.
ARM10's purpose was to double the performance of its predecessor on the same
fabrication, while allowing for further improvements with smaller processes.
Pipeline Depth : 6 stage
Operating frequency : 260 MHz
Power consumption : 0.5 mW/MHz
MIPS/MHz : 1.3

" Architecture used : Harvard. It separates the data D and instruction I buses.
MMUIMPU: Present
Cache Memory : Present (separate 32 K)
" ARM Thumb instruction : Support both
" It also supports an optional Vector Floating-Point(VFP) unit, which adds a seventh
stage to the ARM10 pipeline. The VFP significantly increases floating-point
performance and is compliant with the IEEE 754.1985 floating-point standard.
ARM10 is the first ARM core to support architecture version 5TE. This is a
superset of version 4T, adding BLX (branch-with-link and toggle Thumb/ARM
mode), CLZ (Count Leading Zeroes, useful for DSP operations), and BRK
(software breakpoint).
" Production ARM10 processors actually support v5TE, which adds signal
processing (saturate-on-overflow) instructions.

TECHNICAL PUBLICATIONS - An up thrust for knowiedge


ARM PrOceSSor Fundamentals
Microprocessors and Microcontrollers 14-30

ARM1020E
The ARM1020E is the first processor to use an ARM10E core.
Like the ARM9E, it includes the enhanced E instructions.
" It has separate 32K D+I caches,optional vector floatingpoint unit, and an MMU.
" The ARM1020E also has a dual 64-bit bus interface for increased performance.
ARM1026EJ-S
ARM1026EJ-S is very similar to the ARM926EJ-S but with both MPUand MMU.
This processor has the performance of the ARM10 with the flexibility of an
ARM926EJ-S.

14.8.4 ARM11 Processors Family


This family provides the engine that power many smartphones, also widely used
in consumer, home and embedded applications.
" It delivers low power and a range of performance from 350 MHz to 1 GHz.
ARM11 processor software is compatible with all previous generations of ARM
processSors.
ARM11 family includes - ARM1176JZ (F)-S and ARM11MP core, ARM1136J(F-S,
ARM1156T2-S processors.
14.8.4.1 Features of ARM11
Pipeline Depth : 8 stage pipeline with separate loadstore and arithmetic pipelines.
Operating frequency :335 MHz.
Power Consumption :0.4 mW/MHz.
MIPS/MHz : 1.2
Architecture used:Harvard
" MMUMPU : Present
Multiplier unit : 16 x 32 (16 bits of 32-bit size register)
" Cache memory : Present (4-64 K size)
ARM1136J-S
It was announced in 2003.

Designed for high performance and power efficient applications.


. It executes architecture ARMv6 instructions.
Supports Single Instruction Multiple Data (SIMD) extensions for media
processing, specifically designed to increase video processing performance.
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Microprocessors and Microcontrollers 14- 31 ARM Processor Fundamentals

The ARM1136JF-S is an ARM1136]-S with the addition of the vector


floating-point unit for fast floating-point operations.
Supports the thumb instruction set-memory BW & size requirements reduces by
up to 35 %

Supports Jazelle technology for efficient embedded JAVA execution


Supports the DSP extensions
Supports ARM Trust-Zone technology for on chip security
Physically tagged caches to improve OS context switch performance.
Tightly coupled memories for real-time applications.
14.8.5 Comparison between ARM7, ARM9, ARM10 and ARM11 Cores
Table 14.8.1shows a comparison between the ARM7, ARM9, and ARM11 cores.
Processor attribute ARM7 ARM9 ARM10 ARM11
Pipeline depth 3- stage 5- stage 6- stage 8-stage
Typical MHz 80 150 260 335
mW/MHz 0.06 0.19 (+ cache) 0.5 (+ cache) 0.4 (+ cache)
MIPS/MHz 0.97 1.1 1.3 1.2
Architecture Von Neumann Harvard Harvard Harvard
Mutiplier 8 x 32 8x 32 16 x32 16 x 32
MMU/MPU Absent Present Present Present
Cache nemory Absent Present Present Present
Configurable TCM Absent Present Present Present
Jazelle technology Absent Present. Present Present
Table 14.8.1

Review Questions

1. List the features of ARM7 family processors.


2. List the features of ARM9 family processors.
3. List the features of ARM10 family processors.
4. List the features of ARM11 family processors.
5. Give the comparison between ARM7, ARM9, ARM10 and ARM11 family
processors.

TECHNICAL PUBLICATIONS- An up thrust for knowiedge

You might also like