Computer Architecture: Edited by Galatro Giovanni
Computer Architecture: Edited by Galatro Giovanni
Index .............................................................................................................................................................. 2
Recapitulation ................................................................................................................................................ 4
Monocycle datapath:.................................................................................................................................. 4
Pipeline .......................................................................................................................................................... 5
Datapath..................................................................................................................................................... 6
Out-of-Order Execution............................................................................................................................... 11
Strategies ..................................................................................................................................................... 12
Differences .......................................................................................................................................... 15
Branches .............................................................................................................................................. 15
Cache ....................................................................................................................................................... 19
MMU ....................................................................................................................................................... 27
ARM ............................................................................................................................................................ 29
RISC ........................................................................................................................................................ 29
Instructions .......................................................................................................................................... 31
2
I/O ........................................................................................................................................................ 31
3
Recapitulation
Monocycle datapath:
Multicycle datapath:
4
Fig 2. Complete finite state machine for the multicycle datapath
Pipeline
Pipelining is an implementation technique in which multiple instructions are overlapped in execution. The
processor starts the execution of one operation before the end of the previous one. So, there are multiple
instructions executed during a clock cycle.
5
The pipelining paradox is that the time from execute a single operation doesn’t change (probably is bigger,
due to bigger inter-stage buffers respect to the multicycle processor). Substantially, pipelining increase
throughput. It’s important to notice that in a pipeline, every clock cycle an instruction starts and another ends.
Pipeline violates the principle of Von Neumann Model (one execution starts when another ends) because
operation starts before that the effects produced by the previous one are shown.
Take in mind that a register can be written and read in the same clock period. Ex. Read on the falling edge and
written on the rising one (Master-Slave).
Datapath
Using pipeline, instructions are the same length. This “restriction” makes it much easier to fetch and decode
instructions.
Between two phases there is an Inter-stage buffer, in which results of previous phase are stored in order to be
used from the next one. Also, the value of PC is passed through the phases and stored in buffers in order to
manage branches (if there is a branch, the offset is given in a relative way to the current PC, so I need to store
the value of PC until the branch operation is over).
Also control signals are stored into the inter-stage buffers and are passed through phases. In fact, the Control
Unit is like Monocycle CU (simple combinatorial machine). The key point is that the computation of control
signals is divided from their execution. So, in one cycle (during Decode), the CU generates all the signals for
that instruction.
In using a pipeline, a Risc machine is better than a Cisc, because Risc operation are similar and the information
passed through the buffers are few.
6
The clock period is given by the longest phase, usually the Execute. In order to reduce the clock period, we
can consider a more stages pipeline, but we need more HW.
Branch Hazard
A branch followed by other operations could be a problem because the other instructions start before the
computation of branch. One of the possible solution is avoiding the “writing” of the following instruction by
forcing the signal control of inter-stage buffer to 0 (so they are not updated). This situation is also called Stall.
A variant could be done using a Hazard Detection Unit. In the Decode phase, there is extra hardware that
evaluate the condition before the Execute phase. So, the PC is updated during the second stage of the pipeline.
If we cannot resolve a branch in the second stage (ex. For longer pipeline), we have a larger slowdown if we
stall on branches (waste of clock cycle). The cost of this option is too high for most computers, so we can use
other two solutions: Pre-Fetch or Prediction.
The Pre-Fetch is a replication of Fetch phase. If there is a branch, it will be evaluated and PC refreshed. This
solution is costly in terms of HW but let the reduction of inter-stage buffers (no need of pass through the PC
over the phases).
The Prediction is one of the most widely used solution. It consists of a prediction of the branches based on
Principle of locality (if I do a branch once, I can do it several times. Ex. For cycle). This solution reduces the
penalty of branches and it’s based on the sentence: If a branch is executed the previous time, I execute it again.
This is not a deterministic technique but it’s a bet. There are two way of prediction: Branch Prediction Buffer
and Branch History table.
The first time I find a branch I execute it. If it has not to be taken, I’ve wasted CPU cycles. Otherwise, If the
branch is True, I store the result in the Buffer. Next time I find the same branch (identified by his PC), I read
1 in the Buffer and I execute it again. Then if the branch must be taken, I saved CPU cycles. If the branch has
not to be taken I waste CPU cycles and the normal flow will be restored. By using this solution I waste time
only for the first and for the last branch.
It is used for nested cycles, because the inner cycle is executed n times (branch True for n times), then the
same branch is false (exit from the inner cycle) and again True for other n times. So, I store the previous two
values. If at least one of the two values is true, I execute the branch. In this way, I save time for each “first
time” of the inner cycle.
7
Fig 3: Two-bit scheme, finite state machine
In order to decide which one is the best solution, we should evaluate performances for each case we want.
Moreover, we should have a trade-off between performance and costs.
Stall implementation
To flush instructions in the IF stage, we add a control line, that zeros the instruction field of the IF/ID pipeline
register. Clearing the register transforms the fetched instruction into a nop (no action, the program remain in
the same state). This action is performed by the Hazard Detection unit.
Doing this operation, the “bubble” pass through all the phases.
Data Hazard
If an operation needs a value from the previous operation that has not yet stored it, where it can find it? Is the
datum I need stored somewhere in the datapath?
8
Without intervention, a data hazard could severely stop the pipeline (in this case 3 bubbles)
We have to send the datum back in the pipeline. For doing this, we need a Forwarding Unit and 3 MUX.
The forwarding unit checks if the destination register of an instruction is used as a source register for a
subsequent instruction. In this case, we have to forward this datum because it has not yet been written to the
register file. Finally, when we move something from the right to the left of the pipeline we are jumping ahead
of time.
The only cost of this solution is additional HW (MUX, increase inter-stage buffers and FU).
Forwarding cannot prevent all pipeline stalls, in fact, if there is a load instruction you cannot avoid to stall the
pipeline for one clock cycle. The Hazard Detection Unit manages this stall.
9
Fig 7: Forwarding unit and Hazard Detection Unit
In input to the ALU now there are three sources: ID/EX, EX/MEM and MEM/WB.
Final Datapath
Because different operations take different times of execution (EX phase) we can think to use a pipeline into
the Execution Unit. Using a pipeline in Ex phase is useful because we can reduce the duration of clock cycle.
The Reservation shift manager let us to store the result of execution until we can write it into the EX/MEM
buffer in order to don’t overlap instructions. Using this solution, different operation takes a different number
of clock cycles to complete their execution, as we can see in figure 9.
There could be situations in which more than one operation are in the Exec phase. It is not guaranteed that the
output of the system is correct. If there are not dependencies, there aren’t problem. But if an operation refers
to the destination register of a previous one what will happen? We can execute other operation without data
dependencies before doing that operation and save clock cycles. If I have mandatory to wait, I can perform the
nop operation (all these adjustments are made by the compiler).
ES. div f
mul f
11
add f
add
If addf depends on mulf, we can execute add (if no data dependencies) or nop.
Strategies
Now we have to manage bus contention and the order of operation. There are three strategies:
Ordered Execution
Reordering Buffer
History Buffer
Ordered execution
If an operation needs j CPI, it will be inserted in the position j of SRS, if the validity bit is set to 0. When an
operation is inserted at position j, the RSE set the Validity bit to all the positions before j, so no other operations
are inserted before that one, in order to guarantee the execution order. In this way, there is a waste of clock
12
cycles (if consider CPI in absolute way) but if we want to respect the order of operation, we save clock cycles
respect not-pipelined EXEC.
If an operation has to be inserted in the j-th raw, but the validity bit is set to 1, it has to wait until RSR[j].V =
0. The waiting queue is managed in the Issue Register.
This solution is easy to implement but requires extra HW for the Issue Register. Moreover, in this solution,
some operations have to wait their turn.
Reordering buffer
ROB is a buffer used to store results of operations. It is used in combination with RSR.
Il allows OOO execution but guarantees ordered output. It is managed as a circular queue, with a pointer to
the front and another to the back.
13
Now the RSR contain only the validity bit, the Functional unit used and the position in the ROB. Differently
from the previous solution, when you insert an element in position j, the previous j-1 positions are not set to 1,
so you can insert other operations. If RSR[i].V = 1, the operation is inserted into the Issue Register.
When an instruction is completed, result is stored in the ROB and has to wait to become the front of the queue
in order to be pass over. When an element become the front of the queue and its C bit is 0, you have to wait
until his completion.
If you need a value stored in ROB (ex. Data dependencies), you can take it by the ROB and could be used by
the Forwarding unit.
Finally, ROB allows OOO execution, but output order is guaranteed and the program state is update
sequentially. It only need some extra HW for ROB.
History buffer
History Buffer take care of evolution of the machine state, in order to improve ROB performances. The key
point is to allow the operation to update registers not in order but having information about changes in order
to restore, if it is necessary, the previous state. HB is managed as a circular queue. Operation that need other
values not yet written (data dependencies) are inserted in the issue register.
The RSR change again. Now it has a reference to the position in HB and again the destination register.
14
Rd: tells the destination register
Old: the old value in the register
C: completion bit
Program Counter: unique identifier of an instruction
When an instruction is completed, the result is immediately written into register. When it arrives at the front
of HB with C = 1, the element in HB is deleted.
Because of output order guarantees coherence respect data dependencies, the only reason of restoring current
state are linked to interrupt and wrong branch prevision.
This approach allows speculative execution and updates the program state out-of-order, but guarantees that at
the end the result is the same. The disadvantage is that he need extra time for managing interrupt and branches.
Differences
Branches
When some operations terminate before Branch condition is checked there could be a problem. The common
solution is the use of an Error Prediction Bit (EPR) both in ROB and HB.
In ROB, we don’t store in registers operations after the Branch, if a wrong prediction is made.
15
In HB, a restore is done. HB is blocked, active operations are completed (wait for this), HB elements from
back to front are deleted and previous value of register are restored until you reach the first wrong instruction.
Then the program resume execution from the right path.
One of the hardest parts of control is implementing exceptions and interrupt. But what are these things?
An exception, also called interrupt, is an unscheduled event that disrupt program execution. It doesn’t allow
the completion of the operation and comes from an internal cause (ex. Overflow).
On the other side, the interrupt is an exception that comes from outside of the processor (ex. Communication
with external devices).
Detecting exceptional conditions and taking the appropriate actions is often on the critical timing path of a
machine, which determines the clock cycle time and thus performance.
Two types of exception that could be generated are execution of an undefined instruction and an arithmetic
overflow. The basic action that a machine must perform when an exception occurs is to save the PC and then
transfer control to someone that takes the appropriate actions and terminates program or lets it continue its
execution, using the PC stored.
To handle the exception, the ISR (Interrupt Service Routine) must know the reason for it. There are two main
methods:
Status register (Cause Register), which holds a field that indicates the reason for the exception.
Vectored interrupts. The address to which control is transferred is determined by the cause of the
exception.
When the exception is not vectored, a single entry point for all exception can be used, and then the operating
system decodes the status register to find the cause.
Moreover, we need a MUX in order to decide which instruction write in the PC (Exception, Branch and so
on). What happens if multiple exceptions occur simultaneously in a single clock cycle?
The normal solution is to prioritize them. An interrupt mask is used to decide which interrupt consider.
16
Fig 14: Interrupt ISR scheme
The IM (interrupt mask) are decided by the processor in order to decide if an interrupt should be managed or
not. The Mem is the array containing the offset oh the selected ISR. Another solution is using the order in
memory to decide priority if interrupt.
There is a comparison between the priority of current instruction (CPL) and the interrupt priority.
What happen if a program is near to the end of its execution? Will the ISR gain the processor? Usually, we
don’t want to stop the task if the time remained to complete it is smaller than the time for doing the context
switch to ISR, so we can implement a mechanism of Increasing Priority to task in execution.
In Pipeline, exceptions are another form of control hazard. If an exception occurs, we need to flush (avoid the
production of effects) the instruction that follow the instruction affected by forcing control signal to 0 logic
and begin fetching instruction from the new address.
17
Another solution could be to execute the following instructions: when I find an interrupt I enqueue it, I execute
the following instruction and then I manage interrupt. This solution is called Imprecise Interrupt, but this
allows to save time.
With five instructions, active in any clock cycle, the challenge is to associate an exception with the appropriate
instruction. Moreover, multiple exceptions can occur simultaneously in a single clock cycle. In pipelines, the
check for interrupt comes after the write back phase.
Memory hierarchy
The principle of locality is a heuristic principle that states that programs access a relatively small portion of
their address space at any instant of time. There are two different types of locality:
We take advantage of the principle of locality by implementing the memory of computer as a memory
hierarchy. The faster memories are more expensive per bit than the slower memories and thus smaller.
Increasing distance, the cost per bit and the speed decrease, but dimensions grows.
A memory hierarchy can consist of multiple levels, but data is copied between only two adjacent levels at a
time. The minimum unit of information that can be either present (or not) in the two-level hierarchy is called
block. If the data requested by the processor appears in some block in the upper level, this is called a hit,
otherwise a miss. The miss penalty is the time to replace a block in the upper level with the corresponding
block from the lower level.
18
Cache
The smallest kind of memory is the register. It is inside the processor (register pool). The next level is cache.
The simplest cache is made of blocks of a single word. How can I find a data in the cache?
In the structure called memory mapped, each word can go in exactly one place in the cache. The typical
mapping between addresses and cache location for this kind of cache is simple:
Because each location can contain the contents of several different memory locations, how do we know if the
data in the cache corresponds to the requested word. We can use a field tag (identifier of block, ex. first to bit
of address in the example) to the data. There are also a field named valid bit, that indicate if an entry contains
a valid address.
The referenced address is divided into a cache index, which is used to select the block and a tag field used to
compare with the value of the tag field of the cache.
19
Fig 18: Accessing a cache
Assuming the 32-bit byte address, a direct mapped cache of size 2n blocks (n bit for index) with 2m words.
Tag’s size is 32 – (n + m + 2). So the size of the cache is: 2n x(block_dimension + tag_dimension + validity_bit).
How to choose block’s dimension? Larger blocks exploit spatial locality to lower miss rates. But the miss rate
may go up eventually if the block size become a significant fraction of the cache size because the number of
blocks that can be held in the cache will become small and there will be a great deal of competition. Increasing
block size, the cost of a miss grows because transfer a block from the memory will require more time.
Moreover, using larger block reduce decoding time.
20
What happen if a cache miss happens? The processor doesn’t care about it (caching is transparent to processor)
but it has to wait more time. In order to get the data after one clock cycle (requirement of pipeline), the MMU
(Memory Management Unit) generate an asynchronous signal that disable the clock until data is available (so
one clock period can be longer than others.
The steps to be taken on an instruction cache miss (data cache miss is similar):
- Write-through: we access the main memory, we send the data I need and then I copy the
whole block.
- Write-back: I copy all the block and then I send the data.
Writes work somewhat differently because I must have data coherence between cache and memory. There are
two policy:
- Write-through: I write both in cache and in memory, but I waste time (I can use a write
buffer, of the same size of a block, and when the bus is not used, I send the data from the
buffer to memory). PRO: I copy only one word at time.
- Write-back (or copy-back): When I have to replace a block, I copy it into memory. So, I don’t
waste time during each store, but I increase the miss penalty. Write-back is harder to
implement but has better performance.
Cache miss are satisfied from main memory. Although it is difficult to reduce the latency to fetch the first word
from the memory, we can reduce the miss penalty if we increase the bandwidth. There are three different
schemes:
1) Memory is one word wide and all access are made sequentially.
2) Bandwidth is increased by widening the memory and the bus (allows parallel access)
3) Bandwidth is increased by widening only the memory. In this case, we still pay a cost to transmit
each word but we can avoid paying the cost of the access latency more than once. The memory
chip is organized in banks (same size of word) to read and write multiple words in one access time
rather than reading or writing a single word each time. This scheme is called interleaving.
21
Fig 20: Memory organization
Up to now, we only consider one type of cache: direct mapped. It is very fast but has a problem: If a need
two blocks with the same suffix? In this cache, they should stay in the same area, but it is not what we want.
A solution could be increasing cache dimension but this increase accessing time.
Another solution could be a cache in which each block can be placed in any position (fully associative). To
find a given block, all cache entries must be searched because the block can be placed anywhere. To make the
search practical, it is done in parallel, but this increase HW cost. Another way is to use a table, but where to
store it? It is impossible to store it in the memory, but it is impossible to store it in the cache because it is too
big, so the use of the table is absurd.
The solution that can be used is consider sets. A set is a group of blocks that are mapped to the same position
in the cache (as hash table). This is the middle solution between the solutions above and it is called set
associative. In a set-associative cache, there are a fixed number of locations where each block can be placed
(if the places are n, the cache is called n-way set-associative). In a set-associative cache, the set containing a
memory block is given by (Block number) modulo (number of sets in the cache).
Since the block may be placed in any element of the set, all the tags of all the element of the set must be
searched. Because the speed is of the essence, all the tags in the selected set are searched in parallel. It is
important to notice that each factor-of-two increase in associativity, decreases the size of the index by 1 bit
and increases the size of the tag of one bit.
22
In an n-way set-associative cache, n comparators are needed together with a n-to-1 MUX to choose among the
n potential members of the selected set. The cache access consists of indexing the appropriate set and then
searching the tags of the set.
The most commonly used scheme is least recently used (LRU). In a LRU scheme, the block replaced is the
one that has been unused for the longest time. This can be implemented using only one bit in a 2-way set-
associative cache, to identify which block of the two has been used more recently. As associativity increases,
implementing LRU gets harder.
Another way of reducing miss penalty is to add an additional level to the hierarchy, using a technique called
multilevel caching. The second level cache handle the primary cache misses. It is 10 or more times larger than
the primary cache.
Virtual Memory
The main memory can act as a “cache” for the secondary storage, using a technique called virtual memory.
Two are the main reason to use a virtual memory:
We cannot know which programs will share the memory with other programs when we compile them. So, we
compile each program into its own address space. Virtual memory implements the translations of a program’s
address space to physical addresses.
A virtual memory block is called a page and a virtual memory miss is called a page fault. Page fault cause a
stop into the execution of program, because it needs million clock cycles to be resolved. Using virtual memory,
the processor produces a virtual address, which is translated by HW and SW to a physical address, which can
be used to access main memory (address mapping).
24
Fig 25: Virtual memory
In a virtual memory, the address is broken into a virtual page number and a page offset. The number of bits in
the page-offset field determines the page size. The page dimensions are based on a tradeoff, because bigger
pages are better for locality by worst for time of copy.
It’s important to notice that the write-through technique doesn’t work in virtual memory, because of the longer
time to access to memory.
In a virtual memory system, we locate pages by using a table that indexes the memory: the page table. It
resides in memory and it is indexed with the page number from the virtual address to discover the
corresponding physical page number. Each program has its own page table. Because the page table contains a
mapping for every possible virtual page, no tags are required (fully associative).
Page table contains, for each row, a validity bit; if it is 0, the page is not present in physical memory. Moreover,
I need to know if the pager was being modified, because when I would substitute it, I have to save the changes
(if dirty bit is 0, I don’t need to copy back). Another field required is the Use bit, in order to manage page
substitution (as in cache). If the valid bit of the page table is 0, a page fault occurs, an exception is thrown and
the operating system must be given control.
25
Fig 27: Page table
Since the page tables are stored in main memory, every memory access by a program can take at least twice
as long: one memory access to obtain the physical address and a second to get the data. The key of improving
access performance is to rely on locality of reference to the page table. Modern processors include a special
cache that keeps in tack of recently used translations. It is called translation-lookaside buffer (TLB), a king
of translation cache.
- Look up the page table entry using the virtual address and find the location of the referenced
page on disk.
- Choose a physical page to replace (usually LRU) having attention to dirty bit (if 1 I need to
copy back).
- Start a read to bring the referenced page from the disk to the chosen physical page.
When I have to write a page from disk to memory, there is a huge waste of clock cycles. In order to avoid it,
the DMA (Direct Memory Access) is used. It manages relations between virtual space and physical space
(manage paging on disk). It is an interface between disk and memory and it is transparent to CPU. In order to
write pages, the DMA takes possess of address bus, delaying CPU.
MMU
A memory management unit (MMU), is a computer hardware unit having all memory references passed
through itself, primarily performing the translation of virtual memory addresses to physical addresses. An
MMU effectively performs virtual memory management, handling at the same time memory
27
protection, cache control and bus arbitration. It is independent from CPU and manages PF, taking care that the
state’s update doesn’t get lost. MMU is the architectural support for OS by providing information about
interrupt (ex. Address affected by page fault).
Every process has a unique identifier. Every page of this process has a mapping with the physical page in the
memory. The MMU have a pointer to the virtual address of the page that generates the page fault. MMU pass
this address to OS. Who knows if there is space into the main memory to load a page? The descriptor of the
main memory, that is a part of OS. It resides in a dedicated part of memory (together with other parts of OS).
In the OS part of memory (called resident OS), the processes cannot access.
MMU manages also the interaction between processes in memory. If a process is trying to access the wrong
page table or addresses that are not in his part of memory, MMU interferes. For each process, we have an upper
and lower bound of address in physical memory; the MMU generate the address and check if it’s inside the
bounds, otherwise it throws an exception.
MMU check also if the address is inside the cache, if not it asks the memory. The CPU has to wait until the
data is present in cache, so MMU disable clock.
28
ARM
RISC
It was introduced at Berkeley in 1980 and its design was much simpler than the commercial CISC processors.
The RISC I had the following key feature:
- A fixed instruction size (32-bit). CISC processors typically had variable length instructions.
- Operands in data processing instruction must stay in registers. On CISC they could stay in
memory.
- 32 registers general purpose.
These differences simplified the design and allowed the implementation of the architecture using
organizational features that contributed to the performance of the prototype devices:
- Hard-wired instruction decode logic (instead of Rom in CISC);
- Pipelined execution;
- Single-cycle execution (in CISC an instruction can take many clock cycles to complete).
So, RISC advantages are:
- Simple processor requires less transistors;
- A shorter development time;
- A higher performance (pipelining and high clock rate).
To resolve the first one, ARM has incorporated a novel mechanism, called the Thumb architecture, into some
version of ARM processors. The Thumb instruction set is 1 16-bit compressed form of the original 32-bit ARM
instruction set and employs dynamic decompression hardware in the instruction pipeline.
29
ARM Architecture
The ARM architecture incorporated a number of features from the Berkeley RISC design, such as:
- A load-store architecture;
- Fixed-length 32-bit instructions;
- 3-address instruction format.
The key word of ARM design is simplicity. The combination of the simple hardware with an instruction set
that is grounded in RISC ideas but retains a few key CISC features, and thereby achieves a significantly better
code density than a pure RISC, has given the ARM its power-efficiency and its small core size.
Memory system
When writing user-level programs, only the 15 general-purpose 32-bit registers (r0 to r14), the program counter
(r15) and the current program status register (CPSR) need be considered. The remaining registers are used only
for system-level programming and for handling exceptions.
Instructions
ARM employs a load-store architecture, so the instruction set will only process values which are in registers.
In fact, Arm doesn’t support memory-to-memory operations. Therefore, all ARM instruction fall into one of
the following three categories: data processing instructions, data transfer instructions and control flow
instructions.
The most notable features of the ARM instruction set are:
- The load-store architecture;
- 3-address data processing instruction;
- Conditional execution of every instruction;
- Multiple load and store
- General shift operation
Supervisor mode
The ARM processor supports a protected supervisor mode. The protection mechanism ensures that user code
cannot gain supervisor privileges without appropriate checks. The upshot of this for the user-level programmer
is that system-level functions can only be accessed through specified supervisor calls.
I/O
The ARM handles I/O peripherals as memory-mapped devices with interrupt support. Two kinds of interrupt
are present: IRQ (normal interrupt) and FIQ (fast interrupt).
All operand are 32 bits wide and come from registers or are specified as literals in the instruction itself. The
result is 32 bits wide (except for long mul) and is placed in a register.
Instead of use two base register, we can add a constant by replacing the second operand with an immediate
value (not allowed for mul). A third way to specify a data operation is similar to the first but allows the second
register operand to be subject to a shift operation before combining.
31
Any data processing instruction can set the condition code (NZCV).
Three kind of operations are allowed: single register load and store instructions, multiple register l/s
instructions and single register swap instruction (a value in register is exchanged with a value in memory).
- Base register
- Base + offset (max 4 Kbytes)
Pre-index LDR r0, [r1, #4] r0 = mem[r1+4]
Auto-index LDR r0, [r1, #4]! r0 = mem[r1+4], r1=r1+4
Post-index LDR r0, [r1], #4 r0 = mem[r1], r1=r1+4
When using a multiple load/store instruction, memory is used as a stack. A stack is usually implemented as a
linear data structure, which grows up (an ascending stack) or down (a descending stack) memory as data is
added to it and shrinks back as data is removed. The ARM multiple register transfer instructions support all
four forms of stack:
- Full ascending: the stack grows up through increasing memory addresses and the base register
points to the highest address containing a valid item.
- Empty ascending: the stack grows up through increasing memory addresses and the base
register points to the first empty location above the stack.
- Full descending: the stack grows down through decreasing memory addresses and the base
register points to the lowest address containing a valid item.
- Empty descending: the stack grows down through decreasing memory addresses and the base
register points to the first empty location below the stack.
32
Figure 30: Stack modes
The load and store multiple are an efficient way to save and restore processor state. But they are not pure RISC.
The most common way to switch program execution from one place to another is to use the branch instruction.
Sometimes you will want the processor to take a decision whether or not to branch, this is called conditional
branch.
An unusual feature of the ARM instruction set is that conditional execution applies not only to branch but to
all ARM instruction. A branch which is used to skip a small number of following instruction may be omitted
by giving those instructions the opposite condition. If a programmer wants to call one of a set of subroutines,
a jump table is used.
An important branch instruction is branch and link. It is used to branch to a subroutine in a way which makes
it possible to resume the original code sequence when the subroutine has completed, by saving his PC in r14.
Note that since the return address is held in a register, the subroutine should not call a further, nested, subroutine
without first saving r14, otherwise the new return address will overwrite the old one and it will not be possible
to find the way back to the original caller. To go back to the calling, a move instruction is used to move PC
from r14 to r15.
Another important branch is the supervisor calls. The supervisor is a program which operates at a privileged
level. An example is whenever a program requires input or output.
33
Multicore
34