Notes
Notes
ARM Features:
❖ ARM stands for Advanced RISC Machine. It is one of the most licensed and
extensive processor cores in the world.
❖ Specially used in portable devices like digital cameras, mobile phones, home
network modules, wireless communication technologies, etc..
❖ Advanced RISC Machine Architecture is better than x86.
❖ ARM Processor is not only limited to mobile phones but is also used in
Fugaku, the world’s fastest supercomputer.
❖ ARM can be used as a microprocessor, microcontroller and both.
4. These chips are relatively simple These chips are complex to design.
to design.
12. Registers are used for procedure The stack is used for procedure
arguments and return arguments and return addresses.
addresses.
ASB ARM System Bus is used for connecting internal communication in the embedded
systems designed with ARM. AHB – ARM High Performance Bus with larger
bandwidths (64 /128 bit) is used to connect with external memory. Multi-layer AHB
and AHB Lite is also provided to support variable speed accesses to external
components. APB – ARM Peripheral Bus with slower bandwidth to devices like
peripherals. AHB New interconnects supports multiple processors, supporting
operations in parallel
Types of Memory
Read-only memory (ROM) is the least flexible of all memory types because it
contains an image that is permanently set at production time and cannot be
reprogrammed. Many devices also use a ROM to hold boot code.
Flash ROM can be written to as well as read. It is slow to write. Its main use
is for holding the device firmware. The erasing and writing of flash ROM are
completely software controlled.
Dynamic random-access memory (DRAM) is the most commonly used
RAM for devices. It has the lowest cost per megabyte. DRAM is dynamic it
needs to have its storage cells refreshed and given a new electronic charge
every few milliseconds, so you need to set up a DRAM controller before using
the memory.
Static random-access memory (SRAM) is faster. The RAM does not require
refreshing. The access time is shorter. Higher cost.
Synchronous dynamic random-access memory (SDRAM) run at much
higher clock speeds and it synchronizes itself.
Peripherals
• Forms the outside world interaction of Embedded Systems.
• A peripheral device performs input and output functions for the chip by
connecting to other devices that are off-chip.
• Each peripheral device usually performs a single function.
• Peripherals range from a simple serial communication to complex 802.11
wireless device.
• All ARM peripherals are memory mapped—the programming interface is a
set of memory-addressed registers.
• Specialized peripherals called as Controllers that implement higher levels of
functionality. Two important types
1. Memory controllers
2. Interrupt controllers.
Memory Controllers
• Connect different types of memory to the processor bus.
• On power-up a memory controller is configured in hardware to allow
certain memory devices to be active.
• Some memory devices must be set up by software.
Interrupt Controllers
Interrupts are raised to gain attention. In this case, interrupts raised by
peripherals to get attention of ARM processor. Interrupts go through Interrupt
controllers. There are two types of interrupt controller available for the ARM
processor:
1. Standard interrupt controller (SIC).
2. Vector interrupt controller (VIC).
Hence SIC has option to either ignore or allow interrupts being raised by
peripherals.
Hence interrupts are masked or serviced based on priorities. There exists a vector
table which is looked up for interrupt handling.
An embedded system gets into action by first executing boot code (initialization
code). t sets up memory devices, caches, peripheral devices Deals with
administrative work before actual OS image is loaded.
Phase-2: Diagnostics
In this phase, diagnosis happens whether hardware is working or not. This helps
in isolating faults.
Operating Systems
OS is like manager which manages various resources like memory, processor
and peripherals. ARM has more than 50 OS, but 2 categories: RTOS and PSOS.
Real Time Operating Systems (RTOS):
It has deadlines and guaranteed response.
Hard RTOS provides guaranteed response
Soft RTOS provides good response
RTOS do not have secondary storage.
Platform Specific Operating Systems (PSOS):
No deadlines
Large exclusive memory manager for real time applications
Tend do have secondary storage
Applications
An application implements a processing task; the operating system controls the
environment. An embedded system can have one active application or several
applications running simultaneously. In contrast, ARM processors are not found in
applications that require leading-edge high performance. Because these applications
tend to be low volume and high cost, ARM has decided not to focus designs on these
types of applications.
ARM Registers
Basically, there are two types of ARM registers – General purpose registers and
Special purpose registers. General-purpose registers hold either data or an address.
The letter r is prefixed to the register number to identify them. For example, the label
r4 is assigned to register 4. The registers are 32 bits in size. Up to 18 active registers
are available: 16 data registers and 2 processor status registers. The data registers are
labeled r0 through r15 by the programmer. The ARM processor contains three
registers: r13, r14, and r15, each of which is allocated to a specific duty or unique
function.
Register r13 is traditionally used as the stack pointer (SP) and stores the head
of the stack in the current processor mode.
Register r14 is called the link register (LR) and is where the core puts the
return address whenever it calls a subroutine.
Register r15 is the program counter (PC) and contains the address of the next
instruction to be fetched by the processor.
•The exception modes also have a saved processor status register (SPSR), that is
used to preserve the value of CPSR when the associated exception occurs.
•Because the User and System modes are not exception modes, there is no SPSR
available.
Banked Registers
There exists 37 registers in the register file. Of those, 20 registers are hidden from a
program at different times. These registers are called banked registers and are
identified by the shading in the diagram. They are available only when the processor
is in a particular mode; for example, abort mode has banked registers r13_abt,
r14_abt and spsr_abt. Banked registers of a particular mode are denoted by an
underline character post-fixed to the mode mnemonic or _mode.
ARM vs THUMB
Interrupt Mask
An internal switch setting that controls whether an interrupt can be processed or
not. The mask is a bit that is turned on and off by the program. If interrupt is allowed
it is said to be serviced.
Condition Flags
There exist many condition flags affected directly and indirectly by the
instructions. Majority of these flags are affected by ALU operations. Conditional
execution will determine whether ARM processor will execute a specific instruction
or not.
3 Stage pipeline follows Fetch-Decode and Execute model as shown. Fetch process
fetches an instruction from memory. Decode identifies the instruction. Execute
executes the instruction and writes result back to register
Example pipeline sequence:
5 stage pipeline:
7 stage pipeline:
It can be observed that MSR – an instruction that moves contents from status register
to normal register is an interrupt instruction as user mode is switched to OS mode.
Interrupt also forms part of pipeline and interrupt handling happens after Execution
stage.
It can be observed that Program Counter gets updated to next instruction plus the
number of instructions in pipeline. Execution of branch instruction causes ARM core
to flush its pipeline. Loading the new branch address prior to the execution of
instruction. Instruction in the execute stage will complete even though an interrupt
has been raised.
Exceptions and Interrupts
Exceptions and interrupts are unexpected events that disrupt the normal flow of
instruction execution. An exception is an unexpected event from within the
processor. An interrupt is an unexpected event from outside the processor. When an
exception or interrupt occurs, the hardware begins executing code that performs an
action in response to the exception. This action may involve killing a process,
outputting a error message, communicating with an external device, or horribly
crashing the entire computer system by initiating a "Blue Screen of Death" and
halting the CPU. The instructions responsible for this action reside in the operating
system kernel, and the code that performs this action is called the interrupt handler
code. You can think of handler code as an operating system subroutine. After the
handler code is executed, it may be possible to continue execution after the
instruction where the execution or interrupt occurred.
When exception or interrupt, processor sets pc to special address from vector table.
ARM Extensions
ARM processor is given extensions to improve performance, do extra functionality
and provide flexibility. 3 ARM extensions exist and are: (i) Cache and Tightly
coupled memory (ii) Memory Management and (iii) Coprocessors
Unified Cache Management using Von-Neuman architecture:
In this cache management technique, single cache is shared by data and instruction.
Hence it is said to follow Von-Nueman architecture and a single bus is provided to
access the cache for data or instruction.
In this cache management, tightly coupled cache is additionally provided which acts
like shared memory for processes. This Tightly coupled cache is separately provided
for data and instructions. There also exists separate data and instruction cache for
high speed access.
Coprocessors
• A co-processor is many times referred as a Math Processor. As the coprocessor
performs routine mathematical tasks, the core processor is freed up from this
computation and its time is saved. By taking specialized processing tasks from
core CPU, coprocessor reduces the strain on the main microprocessor, so that it
can run at a greater speed.
• A coprocessor can perform special tasks like complex mathematical calculations
or graphical display processing. They perform such jobs faster than core CPU. As
a result, overall computer speed of the system increases.
• To an ARM processor, we can attach the coprocessors. A coprocessor when
added, we need to expand instruction set of Core CPU or add configurable
registers, to increase the processing power. The coprocessor interface permits a
couple of coprocessors to be connected to the ARM CPU.
Microcontrollers and Embedded Systems – Module 2
ARM Instructions:
All ARM instructions are 32 bits long. Different ARM revisions have various
instruction sets. These instruction sets express capabilities of ARM processor. Every
ARM instruction follows this format:
PRE <preconditions>
INSTRUCTION(s)
POST<postconditions>
There are no direct memory access instructions. In case of registers, usually
destination first is applied.
MOV Instruction
It is the simplest ARM instruction used to set initial values and transfer data
between registers.
Syntax:
instruction{cond}{S} Rd, N
In this syntax, condition and S are optional. Rd is the destination register and N can
be register or immediate value with/without barrel shifter.
In addition to MOV instruction, there also exists MVN which negates the data in the
register and moves to another register.
MOV example:
Role of Barrel Shifter:
Barrel shifters are used to preprocess the data in the register before movement. As
shown in the figure, data in the register can be given to barrel shifter and then sent
to ALU. Unique and powerful feature of ARM to shift data left or right by specified
positions before used in instructions. However, instructions MUL, CLZ, and QADD
does not use barrel shifter.
Hence in right shifting, either LSR or ASR can be used. LSR is used when sign
extension is not required. However, ASR extends sign bit after shifting, Further it is
noteworthy that LSL is equivalent to multiply by 2 and LSR is equivalent to divide
by 2. Rotation will preserve the shifted data from right to left.
Working of RRX:
RRX instruction shifts one bit right and hence lsb (b0) will be rotated to msb if carry
flag is on and not if carry flag is off.
Eg:
Impact of Barrel Shifter on Condition flags:
It can be observed that carry flag is affected when left shifting is done. Condition
flags are only updated when MOV instruction contains suffix S.
Condition can be specified on addition and suffix S can be given for updation of
cpsr. Computation of arithmetic instruction is:
Rd=Rn operator N
where N can be immediate value, register or scaled value.
Eg:
(i)
Tracing:
r0=r1-r2=2-1=1
(ii)
Tracing:
r0=0-r1=0-0x00000077 = 0xffffff89 (hexadecimal subtraction)
(iii)
Tracing:
r1=r1-1=1-1=0 (Zero and carry flag is affected)
(iv)
Tracing:
r0=r1+r1<<1=5+(5<<1)=5+10=15=0xf
Logical Instructions:
ARM logical operations include AND, ORR (OR), EOR (XOR), and BIC (bit clear).
These each operate bitwise on two sources and write the result to a destination
register. The first source is always a register and the second source is either an
immediate or another register.
(ii)
Tracing: r0=r1 & ~r2= 1111 & ~(0101)= 1111 & 1010 = 1010
Comparison Instructions:
These instructions are used to compare the contents of registers. They affect the
condition flags in the CPSR.
Syntax:
instruction{<cond>}{S} Rn, N
Multiplication Instructions:
These instructions are intended to find the product of registers. Multiplication can be
augmented with addition of accumulator. Further, there are signed and unsigned
multiplication instructions exclusively. Multiplication of 32 bit numbers can result in
64bit product and such product is stored in low and high registers.
Syntax:
MLA{<cond>}{S} Rd, Rm,Rs, Rn
MUL {<cond>}{S} Rd, Rm, Rs
Tracing: r0=r1*r2=2*2=4
(ii)
Tracing:
[r0,r1]=r2*r3=0xf0000002 * 0x00000002
=0b 1111 000000 0010 << 000000001
r0=0b1110 000000 0100 =0xe0000004
r1=shifted contents=1
Branch Instructions:
Branch Instructions are used in ARM to branch to labels and this alters the sequential
execution of the programs. Branches can be unconditional or conditional (based on
the value of conditional flags in CPSR). Further, one can branch to the instructions
ahead (forward branch) or instructions before (backward branch).
Single Register load can be LDRB (Load byte), LDRSB (Load Signed Byte), LDRH
(Load half word-16 bits), LDR (Load word – 32 bits)
Single Register store can be STRB (Store byte), STRSB (Store Signed Byte), STRH (Store
half word-16 bits), STR (Store word – 32 bits)
Eg:
(i) Preindex with write back:
(ii) Preindex only
Tracing:
Before LDMIA
Similarly, LDMDB and LDMDA works by decrementing addresses and goes to the
lower address of the memory.
Block memory copy:
The load-store multiple instructions can help in performing copying one block of
memory to another. An ALP for block memory copy is:
loop:
LDMIA r9!, {r0-r7}
STMIA r10, {r0-r7}
CMP r9, r11
BNE loop
The corresponding memory map can be depicted as:
Stacks:
Stacks are data structures and memory locations which can be operated in Last-In-
First-Out (LIFO) order. Stacks are supported with PUSH (writing to stack-Store-
STRM) and POP (reading from stack-Load-LDM) operations. Stacks are operated
through stack pointers (sp). Stack Base is a pointer that points to the starting address
of the stack. Stack pointer points to the top of the stack. Maximum size of the stack
beyond which overflow occurs is called Stack limit.
Stacks are called Full Stack if sp points to the last inserted location and are
called Empty Stack if sp points the next available location. Stacks are called
Descending stack if they grow from higher memory address to lower memory address
and Ascending stack if they grow from lower memory address to higher memory
address. Hence we have four combinations of Stacks:
Addressing modes for Stack operations:
Eg:
(i) Descending Full Stacks
Eg:
Software Interrupts
They are the call to the operating system for specific tasks. Each task is identified by
SWI number.
Format:
SWI{<cond>} SWI number
Eg:
MSR copies contents from general purpose registers to the status register. MRS
operates in the reverse direction. Further the specific fields to copied or saved in the
status register can also be mentioned in the MRS and MSR instructions. The MRS and
MSR instructions also change processor mode from user to OS as they need elevated
privileges.
Eg:
There exists tradeoff in Optimizing code takes and code readability. Hence one
needs to optimize the functions that are frequently executed. Further, optimize the
functions that are important. It is necessary to document non-obvious optimizations.
C compilers need to be efficient yet conservative of ARM processor architecture.
Eg:
void memclr(char *data, int N)
{
for(;N>0;N--)
{
*data=0;
data++;
}
}
This program clears N bytes of data. However, compiler does not know:
(i) whether N=0, as if N=0, condition checking in loop can be avoided.
(ii) Data array pointer is 4 byte aligned, otherwise masking would be required
(iii) N is a multiple of 4 so that aligned addressing is possible
Hence prior to ARMV4, only byte and word loading/storing was supported. In
ARMV4, signed and unsigned byte and half words load-store was added. ARMV5
also supports double word (64 bit) load-stores.
It is recommended to use int and long (32 bit words) in the C program as they are
friendly to ARM architecture which has 32 bit registers and uses 32 bit memory.
Eg:
Consider a C program that computes checksum. Checksum is the sum of 64 data
elements.
int checksum_v1(int *data)
{
char i;
int sum=0;
for(i=0;i<64;i++)
{
sum+=data[i];
}
return sum;
}
Thus an AND instruction is added to make sure char data type boundary is not
violated. However AND is costlier and can be avoided by changing data type to
unsigned int.
In this program, casting to short is required as sum is 32 bit and data element is 16 bit.
Since loop iterates 64 times, 64 times casting would be done which will be costlier.
This can also be emphasized with the following converted ALP:
checksum_v3
MOV r2,r0; r2->data
MOV r0,#0; r0->sum
MOV r1, #0; r1->i
checksum_v3_loop
ADD r3,r2, r1, LSL #1;r3=&data[i]
LDRH r3, [r3,#0];r3=data[i]
ADD r1,r1,#1 ; i++
CMP r1, #0x40; i<64
ADD r0,r3,r0; sum+=data[i]
MOV r0,r0, LSL #16
MOV r0, r0, ASR #16
BCC checksum_v1_loop
MOV pc, r14 ;return sum
It can be observed that in then ALP, first address of loading is computed and the
loading is done with LDRH. Separate steps are necessary as LDRH does not support
barrel shifting. Further in each iteration of the loop, casting of sum is done by left
shifting lower half word to upper half word and then right shifting upper half word
to lower half word with sign extension.
This casting and shifting which is costlier can be avoided by processing data as a
pointer rather than an array. The next version of checksum illustrates this:
Hence two improvements can be pointed out. Separate steps of address computation
and loading is combined as single load with post index. Further casting is done only
on return value and not in each iteration. Thus, only once left and right barrel shifters
are used.
Functions:
Functions are written in C for modularization of the code. A function is written once
and called many times. Every function has a name, return type and arguments.
Function return values of the appropriate type. Function definition is known as callee
statement and function invocation is known as caller.
Eg:
Function arguments are said to be passed wide if they are not reduced to range of
type. Else they are narrow. Following rules prevail with respect to widening and
narrowing:
(i) If compiler passes wide, callee should reduce
(ii) If compiler passes narrow, caller should reduce
(iii) If compiler returns wide, caller should reduce
(iv)If compiler returns narrow, callee should reduce
It has shifting code (LSL and ASR) because of type casting issues.
Here division by 2 can be implemented with ASR. However for negative numbers 1
should be added before division. Thus following ternary condition illustrates the
division complexity of signed numbers:
average_v1
ADD r0,r0,r1
ADD r0,r0,r0, LSR #31
MOV r0, r0, ASR #1
MOV pc, r14
Thus we bring sign bit to lsb by performing LSR 31 times and add it to the number
before performing division. If unsigned number was used, no such shifting was
required.
Loops:
Loops are mechanism deployed in C to repeat certain sequence of statements.
Multiple looping mechanisms like for, do-while and while exist. However, efficient
way of coding loops in C is required such that it will be suitable for ARM architecture.
Two cases shall be considered:
(i) Loops with fixed number of iterations
(ii) Loops with variable number of iterations
The generated ALP has 3 instructions required for looping: ADD, CMP and BCC:
checksum_v5
MOV r2,r0; r2->data
MOV r0,#0; r0->sum
MOV r1, #0; r1->i
checksum_v5_loop
LDR r3,[r2],#4
ADD r1,r1,#1 ; i++
CMP r1, #0x40; i<64
ADD r0,r3,r0; sum+=data[i]
BCC checksum_v5_loop
MOV pc, r14 ;return sum
checksum_v6
MOV r2,r0; r2->data
MOV r0,#0; r0->sum
MOV r1, #0x40;r1->i
checksum_v6_loop
LDR r3,[r2],#4
SUBS r1,r1,#1 ; i--
ADD r0,r3,r0; sum+=data[i]
BNE checksum_v6_loop
MOV pc, r14 ;return sum
This unnecessary condition checking can be avoided if do-while is used instead of for
loop:
int checksum_v8(int *data, unsigned int N)
{
int sum=0;
do
{
sum+=*(data++);
}while(--N!=0);
return sum;
}
Loop Unrolling:
Every loop incurs an overhead. Considering the previous example, even with down
counter loop, every loop iteration uses 2 instructions: SUBS and BNE . SUBS requires
1 instruction cycle and BNE requires 3 instruction cycle. Hence each iteration has an
overhead of 4 instruction cycle. Hence for loop with N iterations, there shall be 4N
cycles overhead. This can be reduced if same loop statement is written multiple times
and step decrement is reduced by the times it is unrolled.
Eg:
int checksum_v9(int *data, unsigned int N)
{
int sum=0;
do
{
sum+=*(data++);
sum+=*(data++);
sum+=*(data++);
sum+=*(data++);
N-=4;
}while(N!=0);
return sum;
}
Hence in this example, loop has been unrolled 4 times by writing same loop statement
4 times and consequently N is reduced by 4 each time than one. Now, each loop
iteration requires 4 cycles (SUBS-1, BNE-3) and we will have N/4 iterations. Hence
loop overhead is reduced to 4*N/4 = N. Thus, we can reduce loop overhead 4 times.
Register Allocation:
If more variables exist than available registers, some variables are stored in
process stack and are called Spilled variables. For efficient implementation it is
advised to minimize the number of spilled variables. Compilers chose most important
and frequently accessed variables are stored in registers. Programmers are advised to
limit loop variables to 12 (even though in theory 14 registers are possible). Compiler
decides spilling of variables on the basis of frequency of use. C has register keyword
which hints the compiler to use register to store the variable than stack. However it is
compiler dependent decision on whether it will store or not.
Function calls:
❑ APCS – ARM Procedure Call Standard – A standard for arguments and return
values
❑ In case of Thumb registers, APCS is redefined as ATPCS – ARM Thumb
Procedure Call Standard
❑ As per APCS, 4 function arguments can be stored in registers and rest of the
arguments find place in the stack. Hence for C functions having more than 4
arguments Stack will be accessed by caller and callee
❑ To optimize, use structures in C which can pack arguments and make sure that
only registers are used.
Consider the following example that demonstrates the use of creating structures of
function arguments:
char *queue_bytes_v1(char *Qstart, char *Qend, char *Qptr, char *data, unsigned int
N)
{
do{
*(Qptr++)=*(data++);
if(Qptr==Qend)
Qptr=Qstart;
}while(--N);
return Qptr;
}
In this C program data is transferred from location data to a destination queue which
is managed by start pointer Qstart, end pointer Qend and current pointer Qptr. The
transfer is limited by the value of N. The queue is modelled as a circular queue in
which insertion starts from beginning, when end is reached.
However the C program written requires 5 arguments and hence will need the use of
stack. However this can be avoided if queue pointers are packed inside the following
structure and structure is sent in the function argument:
typedef struct
{
char *qptr;
char *qstart;
char *qend;
}Queue;
Pointer Aliasing:
2 pointers are said to be alias, if they point to same address. Changing one
affects another. The problem of pointer aliasing in C is that compiler uses a pessimistic
attitude towards them. Compiler thinks all pointers are aliased and results in repeated
loading of same memory locations in resulting ALP.
Eg:Consider the following C code that uses pointers as function arguments:
void timers_v1(int *timer1, int *timer2, int *step)
{
*timer1+=*step;
*timer2+=*step;
}
In this compiler treats timer1 and step as aliases. Hence it assumes timer1 will
update step and hence step will be reloaded. To avoid reloading, step is used
as a local variable. Consider the example in which step is used as a local
variable and hence will not be reloaded.
Here, compiler assumes the pointers data and &N as aliases and reloads N again
and again. Hence it is recommended to never send address of local variables and
instead make a copy and send.