ARM Cortex-A9
MPCore Processor
Presented byM. Jawwad Rafiq
FA15-R01-017
Cell no: +923336335584
Cortex-A Series
Efficient application processors for
every level of performance.
Application processors for OS and
user applications.
Processors in smartphones, tablets,
notebooks, eBook readers etc.
Cortex-A Series
High performance, in a family of low
power.
Cortex-A9 single core processor or a
scalable multicore processor: the
Cortex-A9 MPCore processor.
Where is it used?
Examples:
-
Apple A5 (iPhone 4S, iPad 2, iPad mini)
Where is it used? (2)
Examples:
-
NVIDIA Tegra 2 (Motorola Xoom, Droid X2)
Where is it used? (3)
Examples:
-
PlayStation Vita
What are its specs?
Cortex-A9 processors One to four
L1 cache size per A9 processor 16,32,64
L2 cache 128 KB8 MB caches
Gives 2.5 DMIPS/MHz/core (Dhrystone M
Generally clocked between 800MHz and
2GHz
Out-of-Order Superscalar Pipeline stages
11
NEON & FPU
What are its specs?
Branch Predicator
Technology node 40 nm to 65nm
Supply Voltage 1.05V
Transistor Count 26,00,000
Power Consumption 0.5 W to 1.9 W
Jazelle Support
37 ARM traditional registers, 80 Neon
registers
Thumb-2 instruction set
Registers
ARM has 37 registers in total, all of
which are 32-bits long.
1 dedicated program counter
1 dedicated current program status
register
5 dedicated saved program status
registers
30 general purpose registers
16 registers only visible in ARM.
A particular SPSR (saved program
Registers
Pipeline Stages in A9
Presentation Overview
Micro-architecture
Memory System
Microarchitecture Overview
Variable length, out of order, superscalar
pipeline
Two instructions are fetched in one cycle
Issue up to 4 instructions per cycle into:
Primary data processing pipeline
Secondary data processing pipeline
Load-store pipeline
Compute engine (FPU/NEON) pipeline
CortexA9 Microarchitecture
Renam
e
Issue
Execut
e
Write
back
Decode
Instructio
n Fetch
Memor
y
Instruction Fetch
Instruction cache size: 16KB, 32KB, or 64KB
Superscalar pipeline: fetching two instructions at once
Branch Prediction:
Global History Buffer: 1K ~ 16K entries
Branch-Target Address Cache: 512 ~ 4K entries
Return stack of 4 x 32 bits
Fast-loop mode: instruction loop that are smaller than 64
bytes often complete without additional instruction cache
accesses
Instruction Decode
Super Scalar Decoder
- Capable of decoding two full instructions per
cycle
Rename
Register Renaming
- Resolving data dependencies and
unroll small loops by hardware
- Virtual renaming of registers
Issue
Issue can dispatch up to 4 instructions per
cycle
Out of order selection of instructions from
Execute
Variable length Executing Stage (1 ~ 3 cycles)
- Most Instructions finish within 1 cycle
- Instruction which folds shifts and rotates can take 3
cycles
- ADD r0, r1, r2 LSL r3 (3 cycle) Logical shift left
immediate (LSL)
Memory Hierarchy
Cortex A9 MPcore
CPU
Instruc
tion
Cache
Data
Cache
CPU
CPU
Instruc
tion
Cache
Data
Cache
Instruc
tion
Cache
Data
Cache
Snoop Control Unit (SCU)
L2
L2 Cache
Cache
Main
Main Memory
Memory
CPU
Instruc
Data
tion
Cache
Cache
Accelerator
Coherence
Port
L1 caches
Cortex A9
MPcore
CP
U
D$
D$
I$
I$
CP
U
CP
U
D$
D$
I$
I$
D$
D$
I$
I$
SCU
AXI RW
64-bit
bus
CP
U
D
D
$
$
I$
I$
ACP
ACP
AXI RW
64-bit
bus
L2
L2 Cache
Cache
Main
Main Memory
Memory
16, 32 or 64KB
Support for Security Extensions
I cache:
D cache:
L2 cache
Cortex A9
MPcore
CP
U
D$
D$
I$
I$
CP
U
CP
U
D$
D$
I$
I$
D$
D$
I$
I$
SCU
AXI RW
64-bit
bus
CP
U
D
D
$
$
I$
I$
ACP
ACP
AXI RW
64-bit
bus
L2
L2 Cache
Cache
Main
Main Memory
Memory
Shared
128KB to 8MB
Snoop Control Unit
Cortex A9
MPcore
CP
U
D$
D$
I$
I$
CP
U
CP
U
D$
D$
I$
I$
D$
D$
I$
I$
SCU
AXI RW
64-bit
bus
CP
U
D
D
$
$
I$
I$
ACP
AXI RW
64-bit
bus
L2
L2 Cache
Cache
Main
Main Memory
Memory
Integral part of cache
memory systems
Connects processors to
memory system through
AXI(Advanced Extensible
Interface) interfaces