ARM Processors and Architectures - Uni Program
ARM Processors and Architectures - Uni Program
Architectures
A Comprehensive Overview
ARM University Program
September 2012
More information about ARM and our offices on our web site:
http://www.arm.com/aboutarm/
Optional features
VFPv3 Vector Floating-Point
NEON media processing engine
Dual-issue, super-scalar 13-stage pipeline
Branch Prediction & Return Stack
NEON and VFP implemented at end of pipeline
Mode Description
Supervisor Entered on reset and when a Supervisor call
(SVC) instruction (SVC) is executed
Exception modes
cpsr
spsr spsr spsr spsr spsr
Syntax:
<Operation>{<cond>}{S} {Rd,} Rn, Operand2
Examples:
ADD r0, r1, r2; r0 = r1 + r2
TEQ
University r0,
Program r1
Material ; if r0 = r1, Z flag will be set
Copyright ARM Ltd 2012 23
Single Access Data Transfer
Use to move data between one or two registers and memory
LDRDSTRD Doubleword
LDR STR Word
Memory
LDRBSTRB Byte
LDRHSTRH Halfword
LDRSB Signed byte load
31 0
LDRSH Signed halfword load
Upper bits zero filled or
Rd sign extended on Load
Syntax:
LDR{<size>}{<cond>} Rd, <address>
STR{<size>}{<cond>} Rd, <address>
Example:
LDRB r0, [r1] ; load bottom byte of r0 from the
; byte of memory at address in r1
University Program Material
Copyright ARM Ltd 2012 24
Multiple Register Data Transfer
These instructions move data between multiple registers and memory
Syntax
<LDM|STM>{<addressing_mode>}{<cond>} Rb{!}, <register list>
4 addressing modes (IA) IB DA DB
Increment after/before
r4
Decrement after/before
r4 r1
r1 r0 Increasing
Base Register (Rb) r10 r0 r4 Address
r1 r4
r0 r1
r0
Also
PUSH/POP, equivalent to STMDB/LDMIA with SP! as base register
Example
LDM r10, {r0,r1,r4} ; load registers, using r10 base
PUSH {r4-r6,pc} ; store registers, using SP base
func1 func2
void func1 (void)
{
: :
BL func2 :
func2(); BX lr
:
:
}
The SVC handler can examine the SVC number to decide what operation
has been requested
But the core ignores the SVC number
s
tu
Adjusts LR based on exception type
ta
us r s
at o
Dd
Destination
Register
Lane
D30
Q15
D31
Off-chip
ARM Core Memory
On-chip
BIU
SRAM
D-Cache RAM
L1 L2 L3
Cache Lockdown
Prevents line Eviction from a specified Cache Way (discussed later)
Streaming, Critical-Word-First
Cache data is forwarded to the core as soon as the requested word is received in
the Linefill buffer
Any word in the cache line can be requested first using a WRAP burst on the bus
19 8 3
Cache line
7 6 5 4 3 2 1 0 d
Tag v Data d
Tag vv
Tag Data d
Data
Line 0 d Cache has 8 words of data in each line
Tag v DataLine 0 d
Counter
Line 1 Line 0
LineLine
1 0 Each cache line contains Dirty bit(s)
Victim
Line 1
Line 1
Line 254
Indicates whether a particular cache
Line 30
LineLine
25530 line was modified by the ARM core
LineLine
31 30
Line 31
Line 31 Each cache line can be Valid or invalid
An invalid line is not considered
when performing a Cache Lookup
v - valid bit d - dirty bit(s)
D$ I$ D$ I$ D$ I$ D$ I$
Cortex M3 Total
60k* Gates
University Program Material
Copyright ARM Ltd 2012 43
Cortex-M0
ARMv6-M Architecture
16-bit Thumb-2 with system control
instructions
Fully programmable in C
3-stage pipeline
von Neuman architecture
AHB-Lite bus interface
Fixed memory map
1-32 interrupts
Configurable priority levels
Non-Maskable Interrupt support
Low power support
Core configured with or without
debug
Variable number of watchpoints and
breakpoints
Cortex M3 Total
60k* Gates
University Program Material
Copyright ARM Ltd 2012 44
Agenda
Introduction
ARM Architecture Overview
ARMv7-AR Architecture
Programmers Model
Memory Systems
ARMv7-M Architecture
Programmers Model
Memory Systems
Floating Point Extensions
ARM System Design
Software Development Tools
R0 Registers R0-R12
R1
General-purpose registers
R2
R3
R4 R13 is the stack pointer (SP) - 2 banked versions
R5
R6
R14 is the link register (LR)
R7
R8
R9 R15 is the program counter (PC)
R10
R11 PSR (Program Status Register)
R12
R13 (SP)
Not explicitly accessible
R14 (LR) Saved to the stack on an exception
R15 (PC)
Subsets available as APSR, IPSR, and EPSR
PSR
ARM Processor
Application Code
Thread Reset
Mode
Exception Exception
Entry Return
Exception Code
Handler
Mode
Memory Access:
STRB r2, [r10, r1] ; store lower byte in r2 at
address {r10 + r1}
LDR r0, [r1, r2, LSL #2] ; load r0 with data at address
{r1 + r2 * 4}
Program Flow:
BL<label> ; PC relative branch to <label>
location, and return address stored
in LR (r14)
Interrupt handling
Interrupts are a sub-class of exception
Automatic save and restore of processor registers (xPSR, PC, LR, R12, R3-R0)
Allows handler to be written entirely in C
INTNMI
INTISR[0]
NVIC
Cortex-Mx
INTISR[N] Processor Core
IRQ1
IRQ2
IRQ3
Base CPU
Time
Core Execution Foreground ISR2 ISR1 ISR2 ISR3 Foreground
(ISR 2 resumes)
Main
5
4
Reset Handler
Main
4
3
1
Exception Handler
Exception Vector
1. Exception occurs
Current instruction stream stops
Processor accesses vector table
2. Vector address for the exception loaded from the vector table
3. Exception handler executes in Handler Mode
4. Exception handler returns to main
During (or after) state saving the address of the ISR is read from the Vector Table
ExecFuncPtr exception_table[] = {
(ExecFuncPtr)&Image$$ARM_LIB_STACK$$ZI$$Limit, /* Initial SP */
(ExecFuncPtr)__main, /* Initial PC */
NMIException,
The vector table at address
HardFaultException,
0x0 is minimally required to
MemManageException,
have 4 values: stack top,
BusFaultException,
reset routine location,
UsageFaultException,
NMI ISR location,
0, 0, 0, 0, /* Reserved */
HardFault ISR location
SVCHandler,
DebugMonitor, The SVCall ISR
0, /* Reserved */ location must be
PendSVC, populated if the
SysTickHandler SVC instruction will
/* Configurable interrupts start here...*/ Once interrupts be used
}; are enabled, the
#pragma arm section vector table
(whether at 0 or in
SRAM) must then
have pointers to all
enabled (by mask)
exceptions
University Program Material
Copyright ARM Ltd 2012 60
Vector Table in Assembly
PRESERVE8
THUMB
IMPORT ||Image$$ARM_LIB_STACK$$ZI$$Limit||
AREA RESET, DATA, READONLY
EXPORT __Vectors
UNUSED FFFF_FFFF
E004_2000 512MB System (XN)
ETM
E004_1000 E000_0000
TPIU
E004_0000
1GB External
E003_FFFF Peripheral
RESERVED
E000_F000
NVIC A000_0000
E000_E000
RESERVED
E000_3000
FPB 1 GB External
E000_2000
DWT SRAM
E000_1000
ITM
E000_0000 6000_0000
Internal Private Peripheral Bus
512MB Peripheral
4000_0000
512MB SRAM
2000_0000
512MB Code
0000_0000
Cortex M3 Total
60k* Gates
University Program Material
Copyright ARM Ltd 2012 68
Cortex-M4F Floating Point Registers
FPU provides a further 32 single-precision registers
Can be viewed as either
32 x 32-bit registers S0
D0
16 x 64-bit doubleword registers S1
Any combination of the above S2
D1
S3
S4
D2
S5
S6
D3
S7
~
~ ~
~ ~
~ ~
~
S28
D14
S29
S30
D15
S31
ARMv7-M
Architecture
ARMv6-M
Architecture
AMBA AXI
Varying width, speed and size core Interface
AMBA APB
Other
Other peripherals and interfaces CoreLink
Can include on-chip memory from Peripherals
High Performance
APB
ARM processor UART
High
Bandwidth AHB Timer
APB
External
Bridge
Memory Keypad
Interface
Arbiter
HADDR
HADDR HWDATA Slave
Master HWDATA
#1
HRDATA
#1
HRDATA
Address/Control
Slave
#2
Master
#2
Write Data
Slave
Read Data #3
Master
#3
Slave
#4
Decoder
ARM Master 2
Inter-connection architecture
Master interface
Slave interface
Linux Support
Pre-built Linux images are available for ARM hardware platforms
DS-5 accepts kernel images built with the GNU toolchain
Can also debug applications or loadable kernel modules
RVCT can be used to build Linux applications or libraries
Giving performance benefits
ARM does not provide technical support for the GNU toolchain, or Linux
kernel/driver development
August 2012
ARMv7-A is designed for applications requiring high performance and includes features like memory management unit (MMU) and support for multi-tasking operating systems, such as in the Cortex-A5, Cortex-A8, and Cortex-A9 processors . ARMv7-R targets real-time applications with a focus on low latency and predictability, using a protected memory MPU instead of an MMU, suitable for processors like Cortex-R4 and Cortex-R5 . ARMv7-M is optimized for microcontroller applications with the lowest gate count and a strong emphasis on deterministic and predictable behavior, as seen in Cortex-M3 processors . Each profile is tailored to different market needs: application processing, real-time control, and microcontroller development.
Thumb-2 technology plays a crucial role in improving processor performance and code efficiency by extending the original 16-bit Thumb instruction set with additional 32-bit instructions, providing a comprehensive command set in a compact format . This approach enhances code density, allowing for more operations to fit in the same memory space, which is beneficial in memory-constrained environments. The technology enables higher performance by maintaining compatibility with the full ARM architecture, allowing seamless state switching and execution of complex instructions without switching overhead . As a result, Thumb-2 balances computational power with reduced memory usage, streamlining resource use.
ARM's instruction set architecture employs a 32-bit fixed instruction length, allowing for a full range of operations within a consistent format . In contrast, the Thumb instruction set uses a mix of 16- and 32-bit instructions to improve code density . Switching between ARM and Thumb states requires executing a specific instruction to change the processor mode, with the T bit in the program status register indicating the current state (ARM or Thumb). This capability offers flexibility between full instruction capability and enhanced performance in memory-constrained environments, maximizing instruction execution efficiency.
The Nested Vectored Interrupt Controller (NVIC) in the ARMv7-M architecture provides efficient interrupt handling by tightly integrating with the processor core to manage up to 496 interrupts . NVIC supports vectored interrupt handling, meaning each interrupt has a dedicated handler address, minimizing latency by reducing the overhead of determining the interrupt source . Additionally, the NVIC prioritizes interrupts and allows for nested interrupts, enabling higher-priority interrupts to preempt lower-priority ones. This structured approach facilitates fast, deterministic responses critical for real-time applications.
ARM's Load/Store architecture impacts processor design by optimizing for speed and efficiency. In this architecture, operations on data must occur in CPU registers; memory serves only for loading from or storing to these registers . This principle minimizes direct memory access, which is often slower than register operations, thereby enhancing speed. The architecture encourages a design where many registers are available for computation, and instruction pipelines can be kept active without stalls caused by memory delays . The simplicity of load/store instructions also streamlines the command set, facilitating quick instruction execution.
The DSP extensions in the ARM Cortex-M4 processor enhance computational capabilities by providing specialized instructions for common digital signal processing operations like saturating arithmetic and SIMD execution . These instructions include operations such as QADD (saturated 16-bit addition) and SMULTB (signed 16-bit multiply), which improve efficiency in processing tasks by accelerating mathematical operations and reducing the need for multiple instruction cycles . This makes the Cortex-M4 particularly effective for audio processing, motor control, and other tasks requiring intensive real-time data manipulation.
ARM System Design accommodates high performance and power efficiency through its modular architecture and support for advanced bus protocols like AMBA for optimal data transfer among components . Utilizing ARM processors’ capabilities for low-power operation in conjunction with efficient memory and peripheral designs allows for scalable performance in different market segments. Moreover, features like dynamic voltage and frequency scaling (DVFS) and power gating contribute to balancing performance with energy consumption, making ARM-based SoCs suitable for diverse applications from mobile devices to cloud computing .
TrustZone technology in ARM Cortex processors significantly enhances system security by segregating execution environments into a 'secure world' and a 'normal world' . This hardware-enforced division enables secure applications to run isolated from general applications, preventing unauthorized access and reducing attack surfaces . TrustZone is particularly useful in ensuring the protection of sensitive data and operations such as cryptographic processes, making it vital for applications requiring high security like mobile payments and enterprise solutions.
The ARM Cortex-A8 processor uses a 13-stage superscalar pipeline that allows for multiple instructions to be processed at once, thereby increasing throughput and reducing latency . Key performance features include branch prediction and a return stack to minimize branch penalties, usage of the NEON media processing engine for SIMD operations, and VFPv3 for vector floating-point operations . Additionally, the architecture supports Thumb-2 and TrustZone extensions which enhance code density and security, respectively . Such features collectively improve the processor's computational efficiency and operational speed in diverse application environments.
Jazelle technology in ARM architectures accelerates Java applications by executing Java bytecode directly or using minimal translation, bypassing the performance limitations associated with traditional interpretation of Java VM instructions . This direct execution model reduces the overhead of a software-based Java Virtual Machine, delivering faster start-up times and increased efficiency—especially in mobile and embedded environments where system resources are limited. Jazelle extends ARM's capability in supporting high-level language execution, making it a critical feature for devices focused on running Java applications more efficiently .