0% found this document useful (0 votes)

81 views

Computer Architecture: Edited by Galatro Giovanni

This document discusses computer architecture and concepts related to instruction processing. It covers: 1) Single-cycle and multi-cycle datapath designs for processing instructions one at a time or over multiple clock cycles. 2) Pipeline designs which overlap instruction execution to improve throughput by starting new instructions before others finish. 3) Hazards that can occur in pipelines due to branches, where the outcome of a conditional branch may not be known when following instructions enter the pipeline. 4) Out-of-order execution techniques which attempt to mitigate pipeline stalls from hazards by reordering instructions. 5) Memory hierarchy concepts like caches, virtual memory, and memory management units (MMUs) which aim to speed

Uploaded by

Antonio Amodeo

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views

Computer Architecture: Edited by Galatro Giovanni

Uploaded by

Antonio Amodeo

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Computer Architecture

Edited by Galatro Giovanni

Index

Index .............................................................................................................................................................. 2

Recapitulation ................................................................................................................................................ 4

Monocycle datapath:.................................................................................................................................. 4

Multicycle datapath: .................................................................................................................................. 4

Pipeline .......................................................................................................................................................... 5

Datapath..................................................................................................................................................... 6

Branch Hazard ........................................................................................................................................... 7

Data Hazard ............................................................................................................................................... 8

Final Datapath ......................................................................................................................................... 10

Out-of-Order Execution............................................................................................................................... 11

Strategies ..................................................................................................................................................... 12

Ordered execution ............................................................................................................................... 12

Reordering buffer ................................................................................................................................ 13

History buffer ...................................................................................................................................... 14

Differences .......................................................................................................................................... 15

Branches .............................................................................................................................................. 15

Exceptions and Interrupts ............................................................................................................................ 16

Memory hierarchy ....................................................................................................................................... 18

Cache ....................................................................................................................................................... 19

Virtual Memory ....................................................................................................................................... 24

MMU ....................................................................................................................................................... 27

ARM ............................................................................................................................................................ 29

RISC ........................................................................................................................................................ 29

ARM Architecture ................................................................................................................................... 30

Memory system ................................................................................................................................... 30

Instructions .......................................................................................................................................... 31

Supervisor mode .................................................................................................................................. 31

2
I/O ........................................................................................................................................................ 31

ARM Assembly Language Programming ............................................................................................... 31

Data process instructions ..................................................................................................................... 31

Data transfer instruction ...................................................................................................................... 32

Control flow instruction....................................................................................................................... 33

3
Recapitulation

Monocycle datapath:

 One instruction at time

 Cost=DELAY
 Clock is based on the longest instruction --> There is a waste of time in faster operation

Multicycle datapath:

Fig 1: High level view of a multicycle datapath

 Datapath divided in sub-path : Functional units

 Each step in the execution will take 1 clock cycle
 While one unit is working, the others are idle --> can reuse pieces:
 only one memory for data and instruction
 ALU can be used for different operation (ex. instead of Adder for increasing the PC.
Additional MUXs are added in order to manage different inputs.
 Register for partial results (between phases)
 Time for the longest operation is increased (because of registers)
 TARGET: reduce time of execution of a group of operation
 Cannot use pipeline
 Control unit is a sequential machine

4
Fig 2. Complete finite state machine for the multicycle datapath

Pipeline

Pipelining is an implementation technique in which multiple instructions are overlapped in execution. The
processor starts the execution of one operation before the end of the previous one. So, there are multiple
instructions executed during a clock cycle.

5
The pipelining paradox is that the time from execute a single operation doesn’t change (probably is bigger,
due to bigger inter-stage buffers respect to the multicycle processor). Substantially, pipelining increase
throughput. It’s important to notice that in a pipeline, every clock cycle an instruction starts and another ends.

Pipeline violates the principle of Von Neumann Model (one execution starts when another ends) because
operation starts before that the effects produced by the previous one are shown.

Take in mind that a register can be written and read in the same clock period. Ex. Read on the falling edge and
written on the rising one (Master-Slave).

Datapath

MIPS instruction classically takes five steps:

 Fetch instruction from memory

 Read register and decode the instruction
 Execute the operation or calculate ad address
 Access an operand in data memory (Memory access)
 Write the result into a register (Write Back)

Using pipeline, instructions are the same length. This “restriction” makes it much easier to fetch and decode
instructions.

FETCH DECODE EXECUTE MEMORY ACCESS WRITE BACK

Between two phases there is an Inter-stage buffer, in which results of previous phase are stored in order to be
used from the next one. Also, the value of PC is passed through the phases and stored in buffers in order to
manage branches (if there is a branch, the offset is given in a relative way to the current PC, so I need to store
the value of PC until the branch operation is over).

Also control signals are stored into the inter-stage buffers and are passed through phases. In fact, the Control
Unit is like Monocycle CU (simple combinatorial machine). The key point is that the computation of control
signals is divided from their execution. So, in one cycle (during Decode), the CU generates all the signals for
that instruction.

In using a pipeline, a Risc machine is better than a Cisc, because Risc operation are similar and the information
passed through the buffers are few.

6
The clock period is given by the longest phase, usually the Execute. In order to reduce the clock period, we
can consider a more stages pipeline, but we need more HW.

Branch Hazard

A branch followed by other operations could be a problem because the other instructions start before the
computation of branch. One of the possible solution is avoiding the “writing” of the following instruction by
forcing the signal control of inter-stage buffer to 0 (so they are not updated). This situation is also called Stall.

A variant could be done using a Hazard Detection Unit. In the Decode phase, there is extra hardware that
evaluate the condition before the Execute phase. So, the PC is updated during the second stage of the pipeline.

If we cannot resolve a branch in the second stage (ex. For longer pipeline), we have a larger slowdown if we
stall on branches (waste of clock cycle). The cost of this option is too high for most computers, so we can use
other two solutions: Pre-Fetch or Prediction.

The Pre-Fetch is a replication of Fetch phase. If there is a branch, it will be evaluated and PC refreshed. This
solution is costly in terms of HW but let the reduction of inter-stage buffers (no need of pass through the PC
over the phases).

The Prediction is one of the most widely used solution. It consists of a prediction of the branches based on
Principle of locality (if I do a branch once, I can do it several times. Ex. For cycle). This solution reduces the
penalty of branches and it’s based on the sentence: If a branch is executed the previous time, I execute it again.
This is not a deterministic technique but it’s a bet. There are two way of prediction: Branch Prediction Buffer
and Branch History table.

Branch Prediction Buffer

The first time I find a branch I execute it. If it has not to be taken, I’ve wasted CPU cycles. Otherwise, If the
branch is True, I store the result in the Buffer. Next time I find the same branch (identified by his PC), I read
1 in the Buffer and I execute it again. Then if the branch must be taken, I saved CPU cycles. If the branch has
not to be taken I waste CPU cycles and the normal flow will be restored. By using this solution I waste time
only for the first and for the last branch.

Branch History Table

It is used for nested cycles, because the inner cycle is executed n times (branch True for n times), then the
same branch is false (exit from the inner cycle) and again True for other n times. So, I store the previous two
values. If at least one of the two values is true, I execute the branch. In this way, I save time for each “first
time” of the inner cycle.

7
Fig 3: Two-bit scheme, finite state machine

In order to decide which one is the best solution, we should evaluate performances for each case we want.
Moreover, we should have a trade-off between performance and costs.

Stall implementation

To flush instructions in the IF stage, we add a control line, that zeros the instruction field of the IF/ID pipeline
register. Clearing the register transforms the fetched instruction into a nop (no action, the program remain in
the same state). This action is performed by the Hazard Detection unit.
Doing this operation, the “bubble” pass through all the phases.

Fig 4: Stalling a Pipeline

Data Hazard

If an operation needs a value from the previous operation that has not yet stored it, where it can find it? Is the
datum I need stored somewhere in the datapath?

8
Without intervention, a data hazard could severely stop the pipeline (in this case 3 bubbles)

We have to send the datum back in the pipeline. For doing this, we need a Forwarding Unit and 3 MUX.

The forwarding unit checks if the destination register of an instruction is used as a source register for a
subsequent instruction. In this case, we have to forward this datum because it has not yet been written to the
register file. Finally, when we move something from the right to the left of the pipeline we are jumping ahead
of time.
The only cost of this solution is additional HW (MUX, increase inter-stage buffers and FU).

Fig 5: graphical representation of forwarding

Forwarding cannot prevent all pipeline stalls, in fact, if there is a load instruction you cannot avoid to stall the
pipeline for one clock cycle. The Hazard Detection Unit manages this stall.

Fig 6: graphical representation of forwarding with a load

9
Fig 7: Forwarding unit and Hazard Detection Unit

In input to the ALU now there are three sources: ID/EX, EX/MEM and MEM/WB.

Final Datapath

Fig 8: Final Datapath

10
Out-of-Order Execution

Because different operations take different times of execution (EX phase) we can think to use a pipeline into
the Execution Unit. Using a pipeline in Ex phase is useful because we can reduce the duration of clock cycle.

Fig 8: Ex pipelined Fig 9: CPI for instruction in EX pipelined

The Reservation shift manager let us to store the result of execution until we can write it into the EX/MEM
buffer in order to don’t overlap instructions. Using this solution, different operation takes a different number
of clock cycles to complete their execution, as we can see in figure 9.

There could be situations in which more than one operation are in the Exec phase. It is not guaranteed that the
output of the system is correct. If there are not dependencies, there aren’t problem. But if an operation refers
to the destination register of a previous one what will happen? We can execute other operation without data
dependencies before doing that operation and save clock cycles. If I have mandatory to wait, I can perform the
nop operation (all these adjustments are made by the compiler).

ES. div f
mul f

11
add f
add

If addf depends on mulf, we can execute add (if no data dependencies) or nop.

Strategies

Now we have to manage bus contention and the order of operation. There are three strategies:

 Ordered Execution
 Reordering Buffer
 History Buffer

Ordered execution

It guarantees the order of instruction, by using a Reservation Shift Register.

Fig 10: RSR

The SRS is composed by four fields:

 UF: tells the functional unity used

 Rd: tells the destination register
 V: a validity bit.
 Program Counter: unique identifier of an instruction

If an operation needs j CPI, it will be inserted in the position j of SRS, if the validity bit is set to 0. When an
operation is inserted at position j, the RSE set the Validity bit to all the positions before j, so no other operations
are inserted before that one, in order to guarantee the execution order. In this way, there is a waste of clock
12
cycles (if consider CPI in absolute way) but if we want to respect the order of operation, we save clock cycles
respect not-pipelined EXEC.

If an operation has to be inserted in the j-th raw, but the validity bit is set to 1, it has to wait until RSR[j].V =
0. The waiting queue is managed in the Issue Register.

This solution is easy to implement but requires extra HW for the Issue Register. Moreover, in this solution,
some operations have to wait their turn.

Reordering buffer

ROB is a buffer used to store results of operations. It is used in combination with RSR.

Fig 11: ROB scheme

Il allows OOO execution but guarantees ordered output. It is managed as a circular queue, with a pointer to
the front and another to the back.

Fig 12: RSR and ROB

13
Now the RSR contain only the validity bit, the Functional unit used and the position in the ROB. Differently
from the previous solution, when you insert an element in position j, the previous j-1 positions are not set to 1,
so you can insert other operations. If RSR[i].V = 1, the operation is inserted into the Issue Register.

The ROB is composed by four fields:

 Rd: tells the destination register

 R: Result
 C: completion bit
 Program Counter: unique identifier of an instruction

When an instruction is completed, result is stored in the ROB and has to wait to become the front of the queue
in order to be pass over. When an element become the front of the queue and its C bit is 0, you have to wait
until his completion.

If you need a value stored in ROB (ex. Data dependencies), you can take it by the ROB and could be used by
the Forwarding unit.

Finally, ROB allows OOO execution, but output order is guaranteed and the program state is update
sequentially. It only need some extra HW for ROB.

History buffer

History Buffer take care of evolution of the machine state, in order to improve ROB performances. The key
point is to allow the operation to update registers not in order but having information about changes in order
to restore, if it is necessary, the previous state. HB is managed as a circular queue. Operation that need other
values not yet written (data dependencies) are inserted in the issue register.

Fig 13: RSR and HB

The RSR change again. Now it has a reference to the position in HB and again the destination register.

The HB is composed by four fields:

14
 Rd: tells the destination register
 Old: the old value in the register
 C: completion bit
 Program Counter: unique identifier of an instruction

When an instruction is completed, the result is immediately written into register. When it arrives at the front
of HB with C = 1, the element in HB is deleted.

Because of output order guarantees coherence respect data dependencies, the only reason of restoring current
state are linked to interrupt and wrong branch prevision.

Fig 13: HB scheme

This approach allows speculative execution and updates the program state out-of-order, but guarantees that at
the end the result is the same. The disadvantage is that he need extra time for managing interrupt and branches.

Differences

Ordered execution  execution order of operation in guaranteed

ROB  output order is guaranteed

HB  Result is the same at the end

Branches

When some operations terminate before Branch condition is checked there could be a problem. The common
solution is the use of an Error Prediction Bit (EPR) both in ROB and HB.

In ROB, we don’t store in registers operations after the Branch, if a wrong prediction is made.

15
In HB, a restore is done. HB is blocked, active operations are completed (wait for this), HB elements from
back to front are deleted and previous value of register are restored until you reach the first wrong instruction.
Then the program resume execution from the right path.

Exceptions and Interrupts

One of the hardest parts of control is implementing exceptions and interrupt. But what are these things?

An exception, also called interrupt, is an unscheduled event that disrupt program execution. It doesn’t allow
the completion of the operation and comes from an internal cause (ex. Overflow).

On the other side, the interrupt is an exception that comes from outside of the processor (ex. Communication
with external devices).

Detecting exceptional conditions and taking the appropriate actions is often on the critical timing path of a
machine, which determines the clock cycle time and thus performance.

Two types of exception that could be generated are execution of an undefined instruction and an arithmetic
overflow. The basic action that a machine must perform when an exception occurs is to save the PC and then
transfer control to someone that takes the appropriate actions and terminates program or lets it continue its
execution, using the PC stored.

To handle the exception, the ISR (Interrupt Service Routine) must know the reason for it. There are two main
methods:

 Status register (Cause Register), which holds a field that indicates the reason for the exception.
 Vectored interrupts. The address to which control is transferred is determined by the cause of the
exception.

When the exception is not vectored, a single entry point for all exception can be used, and then the operating
system decodes the status register to find the cause.

For both solutions, we should add two registers to datapath:

 EPC: Address of affected instruction

 Cause: a register use to record the cause of the exception.

Moreover, we need a MUX in order to decide which instruction write in the PC (Exception, Branch and so
on). What happens if multiple exceptions occur simultaneously in a single clock cycle?

The normal solution is to prioritize them. An interrupt mask is used to decide which interrupt consider.

16
Fig 14: Interrupt ISR scheme

The IM (interrupt mask) are decided by the processor in order to decide if an interrupt should be managed or
not. The Mem is the array containing the offset oh the selected ISR. Another solution is using the order in
memory to decide priority if interrupt.

Fig 15: Priority check

There is a comparison between the priority of current instruction (CPL) and the interrupt priority.

What happen if a program is near to the end of its execution? Will the ISR gain the processor? Usually, we
don’t want to stop the task if the time remained to complete it is smaller than the time for doing the context
switch to ISR, so we can implement a mechanism of Increasing Priority to task in execution.

In Pipeline, exceptions are another form of control hazard. If an exception occurs, we need to flush (avoid the
production of effects) the instruction that follow the instruction affected by forcing control signal to 0 logic
and begin fetching instruction from the new address.

17
Another solution could be to execute the following instructions: when I find an interrupt I enqueue it, I execute
the following instruction and then I manage interrupt. This solution is called Imprecise Interrupt, but this
allows to save time.

With five instructions, active in any clock cycle, the challenge is to associate an exception with the appropriate
instruction. Moreover, multiple exceptions can occur simultaneously in a single clock cycle. In pipelines, the
check for interrupt comes after the write back phase.

Memory hierarchy

The principle of locality is a heuristic principle that states that programs access a relatively small portion of
their address space at any instant of time. There are two different types of locality:

- Temporal locality: if an item is referenced, it will tend to be referenced again soon.

- Spatial locality: if an item is referenced, items whose addresses are close will tend to be
referenced soon.

We take advantage of the principle of locality by implementing the memory of computer as a memory
hierarchy. The faster memories are more expensive per bit than the slower memories and thus smaller.

Fig 16: Memory hierarchy

Increasing distance, the cost per bit and the speed decrease, but dimensions grows.
A memory hierarchy can consist of multiple levels, but data is copied between only two adjacent levels at a
time. The minimum unit of information that can be either present (or not) in the two-level hierarchy is called
block. If the data requested by the processor appears in some block in the upper level, this is called a hit,
otherwise a miss. The miss penalty is the time to replace a block in the upper level with the corresponding
block from the lower level.

18
Cache

The smallest kind of memory is the register. It is inside the processor (register pool). The next level is cache.
The simplest cache is made of blocks of a single word. How can I find a data in the cache?

In the structure called memory mapped, each word can go in exactly one place in the cache. The typical
mapping between addresses and cache location for this kind of cache is simple:

(Block address) modulo (Number of cache blocks in the cache)

Fig 17: Direct-mapped cache

Because each location can contain the contents of several different memory locations, how do we know if the
data in the cache corresponds to the requested word. We can use a field tag (identifier of block, ex. first to bit
of address in the example) to the data. There are also a field named valid bit, that indicate if an entry contains
a valid address.

The referenced address is divided into a cache index, which is used to select the block and a tag field used to
compare with the value of the tag field of the cache.

19
Fig 18: Accessing a cache

Assuming the 32-bit byte address, a direct mapped cache of size 2n blocks (n bit for index) with 2m words.
Tag’s size is 32 – (n + m + 2). So the size of the cache is: 2n x(block_dimension + tag_dimension + validity_bit).

How to choose block’s dimension? Larger blocks exploit spatial locality to lower miss rates. But the miss rate
may go up eventually if the block size become a significant fraction of the cache size because the number of
blocks that can be held in the cache will become small and there will be a great deal of competition. Increasing
block size, the cost of a miss grows because transfer a block from the memory will require more time.
Moreover, using larger block reduce decoding time.

Fig 19: Miss rate vs block size

20
What happen if a cache miss happens? The processor doesn’t care about it (caching is transparent to processor)
but it has to wait more time. In order to get the data after one clock cycle (requirement of pipeline), the MMU
(Memory Management Unit) generate an asynchronous signal that disable the clock until data is available (so
one clock period can be longer than others.

The steps to be taken on an instruction cache miss (data cache miss is similar):

- Send the original PC value (current PC – 4) to the memory;

- Read from memory;
- Write the block from memory inside the cache;
- Restart the instruction execution from the fetch.

There are two policy:

- Write-through: we access the main memory, we send the data I need and then I copy the
whole block.
- Write-back: I copy all the block and then I send the data.

Writes work somewhat differently because I must have data coherence between cache and memory. There are
two policy:

- Write-through: I write both in cache and in memory, but I waste time (I can use a write
buffer, of the same size of a block, and when the bus is not used, I send the data from the
buffer to memory). PRO: I copy only one word at time.
- Write-back (or copy-back): When I have to replace a block, I copy it into memory. So, I don’t
waste time during each store, but I increase the miss penalty. Write-back is harder to
implement but has better performance.

Cache miss are satisfied from main memory. Although it is difficult to reduce the latency to fetch the first word
from the memory, we can reduce the miss penalty if we increase the bandwidth. There are three different
schemes:

1) Memory is one word wide and all access are made sequentially.
2) Bandwidth is increased by widening the memory and the bus (allows parallel access)
3) Bandwidth is increased by widening only the memory. In this case, we still pay a cost to transmit
each word but we can avoid paying the cost of the access latency more than once. The memory
chip is organized in banks (same size of word) to read and write multiple words in one access time
rather than reading or writing a single word each time. This scheme is called interleaving.

21
Fig 20: Memory organization

Up to now, we only consider one type of cache: direct mapped. It is very fast but has a problem: If a need
two blocks with the same suffix? In this cache, they should stay in the same area, but it is not what we want.
A solution could be increasing cache dimension but this increase accessing time.

Another solution could be a cache in which each block can be placed in any position (fully associative). To
find a given block, all cache entries must be searched because the block can be placed anywhere. To make the
search practical, it is done in parallel, but this increase HW cost. Another way is to use a table, but where to
store it? It is impossible to store it in the memory, but it is impossible to store it in the cache because it is too
big, so the use of the table is absurd.

The solution that can be used is consider sets. A set is a group of blocks that are mapped to the same position
in the cache (as hash table). This is the middle solution between the solutions above and it is called set
associative. In a set-associative cache, there are a fixed number of locations where each block can be placed
(if the places are n, the cache is called n-way set-associative). In a set-associative cache, the set containing a
memory block is given by (Block number) modulo (number of sets in the cache).

How can we find a block in a set-associative cache?

Fig 21: The three portions of an address in a set-associative cache

Since the block may be placed in any element of the set, all the tags of all the element of the set must be
searched. Because the speed is of the essence, all the tags in the selected set are searched in parallel. It is
important to notice that each factor-of-two increase in associativity, decreases the size of the index by 1 bit
and increases the size of the tag of one bit.
22
In an n-way set-associative cache, n comparators are needed together with a n-to-1 MUX to choose among the
n potential members of the selected set. The cache access consists of indexing the appropriate set and then
searching the tags of the set.

Fig 22: Memory configurations.

Fig 23: Four-way set-associative cache

23
Fig 24: Miss rate vs associativity and cache size

One important question is: Which block I have to replace?

The most commonly used scheme is least recently used (LRU). In a LRU scheme, the block replaced is the
one that has been unused for the longest time. This can be implemented using only one bit in a 2-way set-
associative cache, to identify which block of the two has been used more recently. As associativity increases,
implementing LRU gets harder.

Another way of reducing miss penalty is to add an additional level to the hierarchy, using a technique called
multilevel caching. The second level cache handle the primary cache misses. It is 10 or more times larger than
the primary cache.

Virtual Memory

The main memory can act as a “cache” for the secondary storage, using a technique called virtual memory.
Two are the main reason to use a virtual memory:

- To allow efficient and safe sharing of memory among multiple programs

- To remove the programming burdens of a small, limited amount of memory, so a single user
program can exceed the size of primary memory and use all the addressable space.

We cannot know which programs will share the memory with other programs when we compile them. So, we
compile each program into its own address space. Virtual memory implements the translations of a program’s
address space to physical addresses.

A virtual memory block is called a page and a virtual memory miss is called a page fault. Page fault cause a
stop into the execution of program, because it needs million clock cycles to be resolved. Using virtual memory,
the processor produces a virtual address, which is translated by HW and SW to a physical address, which can
be used to access main memory (address mapping).
24
Fig 25: Virtual memory

In a virtual memory, the address is broken into a virtual page number and a page offset. The number of bits in
the page-offset field determines the page size. The page dimensions are based on a tradeoff, because bigger
pages are better for locality by worst for time of copy.

Fig 26: Virtual memory

It’s important to notice that the write-through technique doesn’t work in virtual memory, because of the longer
time to access to memory.

In a virtual memory system, we locate pages by using a table that indexes the memory: the page table. It
resides in memory and it is indexed with the page number from the virtual address to discover the
corresponding physical page number. Each program has its own page table. Because the page table contains a
mapping for every possible virtual page, no tags are required (fully associative).

Page table contains, for each row, a validity bit; if it is 0, the page is not present in physical memory. Moreover,
I need to know if the pager was being modified, because when I would substitute it, I have to save the changes
(if dirty bit is 0, I don’t need to copy back). Another field required is the Use bit, in order to manage page
substitution (as in cache). If the valid bit of the page table is 0, a page fault occurs, an exception is thrown and
the operating system must be given control.
25
Fig 27: Page table

Since the page tables are stored in main memory, every memory access by a program can take at least twice
as long: one memory access to obtain the physical address and a second to get the data. The key of improving
access performance is to rely on locality of reference to the page table. Modern processors include a special
cache that keeps in tack of recently used translations. It is called translation-lookaside buffer (TLB), a king
of translation cache.

Fig 28: TLB and Page table

26
On every reference, we look up the virtual page number in TLB. If we get a hit, the physical page number is
used to form the address (reference bit to 1, if write dirty bit to 1). If a miss in the TLB occurs, we must
determine whether it is a page fault or a TLB miss. A TLB miss can be handled in SW or HW because it will
require only a short sequence of operations to copy a valid page table entry from memory into TLB. If the page
is not in the memory (valid bit 0 in page table), the control is given to OS. Once the operating system knows
the virtual address that caused the page fault, it must complete three operations:

- Look up the page table entry using the virtual address and find the location of the referenced
page on disk.
- Choose a physical page to replace (usually LRU) having attention to dirty bit (if 1 I need to
copy back).
- Start a read to bring the referenced page from the disk to the chosen physical page.

When I have to write a page from disk to memory, there is a huge waste of clock cycles. In order to avoid it,
the DMA (Direct Memory Access) is used. It manages relations between virtual space and physical space
(manage paging on disk). It is an interface between disk and memory and it is transparent to CPU. In order to
write pages, the DMA takes possess of address bus, delaying CPU.

MMU

Fig 28: Architecture

A memory management unit (MMU), is a computer hardware unit having all memory references passed
through itself, primarily performing the translation of virtual memory addresses to physical addresses. An
MMU effectively performs virtual memory management, handling at the same time memory
27
protection, cache control and bus arbitration. It is independent from CPU and manages PF, taking care that the
state’s update doesn’t get lost. MMU is the architectural support for OS by providing information about
interrupt (ex. Address affected by page fault).

How to manage a page fault?

Every process has a unique identifier. Every page of this process has a mapping with the physical page in the
memory. The MMU have a pointer to the virtual address of the page that generates the page fault. MMU pass
this address to OS. Who knows if there is space into the main memory to load a page? The descriptor of the
main memory, that is a part of OS. It resides in a dedicated part of memory (together with other parts of OS).
In the OS part of memory (called resident OS), the processes cannot access.

MMU manages also the interaction between processes in memory. If a process is trying to access the wrong
page table or addresses that are not in his part of memory, MMU interferes. For each process, we have an upper
and lower bound of address in physical memory; the MMU generate the address and check if it’s inside the
bounds, otherwise it throws an exception.

MMU check also if the address is inside the cache, if not it asks the memory. The CPU has to wait until the
data is present in cache, so MMU disable clock.

28
ARM

- Advanced RISC Machine

- First RISC microprocessor for commercial use
- Market-leader for low-power and cost-sensitive embedded applications
- Architectural simplicity which allows very small implementations which result in very low
power consumption.

RISC

It was introduced at Berkeley in 1980 and its design was much simpler than the commercial CISC processors.
The RISC I had the following key feature:

- A fixed instruction size (32-bit). CISC processors typically had variable length instructions.
- Operands in data processing instruction must stay in registers. On CISC they could stay in
memory.
- 32 registers general purpose.

These differences simplified the design and allowed the implementation of the architecture using
organizational features that contributed to the performance of the prototype devices:
- Hard-wired instruction decode logic (instead of Rom in CISC);
- Pipelined execution;
- Single-cycle execution (in CISC an instruction can take many clock cycles to complete).
So, RISC advantages are:
- Simple processor requires less transistors;
- A shorter development time;
- A higher performance (pipelining and high clock rate).

On the other hand, the disadvantages are:

- Poor code density compared with CISC;

- Don’t execute x86 code.

To resolve the first one, ARM has incorporated a novel mechanism, called the Thumb architecture, into some
version of ARM processors. The Thumb instruction set is 1 16-bit compressed form of the original 32-bit ARM
instruction set and employs dynamic decompression hardware in the instruction pipeline.

29
ARM Architecture

The ARM architecture incorporated a number of features from the Berkeley RISC design, such as:

- A load-store architecture;
- Fixed-length 32-bit instructions;
- 3-address instruction format.

The key word of ARM design is simplicity. The combination of the simple hardware with an instruction set
that is grounded in RISC ideas but retains a few key CISC features, and thereby achieves a significantly better
code density than a pure RISC, has given the ARM its power-efficiency and its small core size.

Memory system

The first level of memory in an ARM processor are registers.

When writing user-level programs, only the 15 general-purpose 32-bit registers (r0 to r14), the program counter
(r15) and the current program status register (CPSR) need be considered. The remaining registers are used only
for system-level programming and for handling exceptions.

Figure 29: Arm’s registers

One of the most important register is CPSR (Current Program Status Register). It is used to store the condition
code bits. The bits at the bottom control the processor mode, instruction set (T) and interrupt enables (I and F).
The condition code flags are in the top four bits and are:
- N: The last ALU operation which changed the flags produced a negative result.
- Z: The last ALU operation which changed the flags produced a zero result.
- C: The last ALU operation which changed the flags generated a carry-out.
30
- V: The last ALU operation which changed the flags generated an overflow.
In addition to the processor register state, an ARM system has memory state. Memory may be viewed as a
linear array of bytes numbered from zero up to 232-l. Data items may be 8-bit bytes, 16-bit half-words or 32-
bit words. Words are always aligned on 4-byte boundaries.

Instructions

ARM employs a load-store architecture, so the instruction set will only process values which are in registers.
In fact, Arm doesn’t support memory-to-memory operations. Therefore, all ARM instruction fall into one of
the following three categories: data processing instructions, data transfer instructions and control flow
instructions.
The most notable features of the ARM instruction set are:
- The load-store architecture;
- 3-address data processing instruction;
- Conditional execution of every instruction;
- Multiple load and store
- General shift operation

Supervisor mode

The ARM processor supports a protected supervisor mode. The protection mechanism ensures that user code
cannot gain supervisor privileges without appropriate checks. The upshot of this for the user-level programmer
is that system-level functions can only be accessed through specified supervisor calls.

I/O

The ARM handles I/O peripherals as memory-mapped devices with interrupt support. Two kinds of interrupt
are present: IRQ (normal interrupt) and FIQ (fast interrupt).

ARM Assembly Language Programming

Data process instructions

All operand are 32 bits wide and come from registers or are specified as literals in the instruction itself. The
result is 32 bits wide (except for long mul) and is placed in a register.

Operations are: Arithmetic, bit-wise logical, register movement and comparison.

Instead of use two base register, we can add a constant by replacing the second operand with an immediate
value (not allowed for mul). A third way to specify a data operation is similar to the first but allows the second
register operand to be subject to a shift operation before combining.

31
Any data processing instruction can set the condition code (NZCV).

Data transfer instruction

Three kind of operations are allowed: single register load and store instructions, multiple register l/s
instructions and single register swap instruction (a value in register is exchanged with a value in memory).

Different types of addressing are used:

- Base register
- Base + offset (max 4 Kbytes)
 Pre-index  LDR r0, [r1, #4]  r0 = mem[r1+4]
 Auto-index  LDR r0, [r1, #4]!  r0 = mem[r1+4], r1=r1+4
 Post-index  LDR r0, [r1], #4  r0 = mem[r1], r1=r1+4

When using a multiple load/store instruction, memory is used as a stack. A stack is usually implemented as a
linear data structure, which grows up (an ascending stack) or down (a descending stack) memory as data is
added to it and shrinks back as data is removed. The ARM multiple register transfer instructions support all
four forms of stack:
- Full ascending: the stack grows up through increasing memory addresses and the base register
points to the highest address containing a valid item.
- Empty ascending: the stack grows up through increasing memory addresses and the base
register points to the first empty location above the stack.
- Full descending: the stack grows down through decreasing memory addresses and the base
register points to the lowest address containing a valid item.
- Empty descending: the stack grows down through decreasing memory addresses and the base
register points to the first empty location below the stack.

32
Figure 30: Stack modes
The load and store multiple are an efficient way to save and restore processor state. But they are not pure RISC.

Control flow instruction

The most common way to switch program execution from one place to another is to use the branch instruction.
Sometimes you will want the processor to take a decision whether or not to branch, this is called conditional
branch.

An unusual feature of the ARM instruction set is that conditional execution applies not only to branch but to
all ARM instruction. A branch which is used to skip a small number of following instruction may be omitted
by giving those instructions the opposite condition. If a programmer wants to call one of a set of subroutines,
a jump table is used.

An important branch instruction is branch and link. It is used to branch to a subroutine in a way which makes
it possible to resume the original code sequence when the subroutine has completed, by saving his PC in r14.
Note that since the return address is held in a register, the subroutine should not call a further, nested, subroutine
without first saving r14, otherwise the new return address will overwrite the old one and it will not be possible
to find the way back to the original caller. To go back to the calling, a move instruction is used to move PC
from r14 to r15.

Another important branch is the supervisor calls. The supervisor is a program which operates at a privileged
level. An example is whenever a program requires input or output.

33
Multicore

SAS Cabling DS4243 DS2246 DS4486 DS4246
No ratings yet
SAS Cabling DS4243 DS2246 DS4486 DS4246
67 pages
VMW Vcta DCV Exam Preparation Guide
No ratings yet
VMW Vcta DCV Exam Preparation Guide
8 pages
QUESTION BANK UNIT 5 - Computer Organization and Architecture
No ratings yet
QUESTION BANK UNIT 5 - Computer Organization and Architecture
9 pages
Ca06 2014 PDF
No ratings yet
Ca06 2014 PDF
53 pages
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
No ratings yet
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
7 pages
Contact Session 8
No ratings yet
Contact Session 8
63 pages
CO Pipelining PDF notes
No ratings yet
CO Pipelining PDF notes
10 pages
Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
No ratings yet
Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
34 pages
Pipeline and Vector Processing
100% (1)
Pipeline and Vector Processing
18 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
COA DR MVN 5 UNIT - Latest PDF
No ratings yet
COA DR MVN 5 UNIT - Latest PDF
24 pages
chapter4_2
No ratings yet
chapter4_2
34 pages
Contact Session 8 - With Annotation-1
No ratings yet
Contact Session 8 - With Annotation-1
47 pages
CPU Archeticture
No ratings yet
CPU Archeticture
26 pages
L03-Pipelining
No ratings yet
L03-Pipelining
45 pages
Pipelining
No ratings yet
Pipelining
21 pages
Pipelined MIPS Processor: Dmitri Strukov ECE 154A
No ratings yet
Pipelined MIPS Processor: Dmitri Strukov ECE 154A
81 pages
ACA Question Bank
No ratings yet
ACA Question Bank
19 pages
Lect28-Pipeline_15012019
No ratings yet
Lect28-Pipeline_15012019
36 pages
Unit 5
No ratings yet
Unit 5
23 pages
Unit 5 (Coa) Notes
No ratings yet
Unit 5 (Coa) Notes
35 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
38 pages
Embedded Systems Design: Pipelining and Instruction Scheduling
No ratings yet
Embedded Systems Design: Pipelining and Instruction Scheduling
48 pages
CSO Lecture Notes Unit - 5
No ratings yet
CSO Lecture Notes Unit - 5
11 pages
Chapter 4 The Processor
No ratings yet
Chapter 4 The Processor
72 pages
Lec 11
No ratings yet
Lec 11
30 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Chapter 6
No ratings yet
Chapter 6
71 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
No ratings yet
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
64 pages
05 RISCV ISA Implementation MC
No ratings yet
05 RISCV ISA Implementation MC
199 pages
L11 Pipelined Datapath and
100% (1)
L11 Pipelined Datapath and
31 pages
Chapter 9 - Pipeline and Vector Processing Section 9.1 - Parallel Processing
No ratings yet
Chapter 9 - Pipeline and Vector Processing Section 9.1 - Parallel Processing
10 pages
L04-Pipelining
No ratings yet
L04-Pipelining
38 pages
Coa Lecture Unit 3 Pipelining
No ratings yet
Coa Lecture Unit 3 Pipelining
95 pages
onur-447-spring15-lecture7-pipelining-afterlecture
No ratings yet
onur-447-spring15-lecture7-pipelining-afterlecture
66 pages
3-Pipelining_241110_203716
No ratings yet
3-Pipelining_241110_203716
59 pages
COAU5
No ratings yet
COAU5
31 pages
3 Pipeline
No ratings yet
3 Pipeline
38 pages
Pipelining and Parallelism
No ratings yet
Pipelining and Parallelism
41 pages
New Doc 01-31-2024 16.43
No ratings yet
New Doc 01-31-2024 16.43
19 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
7 pages
Dpco Unit 4
No ratings yet
Dpco Unit 4
21 pages
16.482 / 16.561 Computer Architecture and Design: Instructor: Dr. Michael Geiger Fall 2013
No ratings yet
16.482 / 16.561 Computer Architecture and Design: Instructor: Dr. Michael Geiger Fall 2013
42 pages
Unit-5-Parallel Processing
No ratings yet
Unit-5-Parallel Processing
11 pages
Differentiate Organization and Architecture.: Advanced Computer Architechture Assignment 1
No ratings yet
Differentiate Organization and Architecture.: Advanced Computer Architechture Assignment 1
4 pages
Lecture13 Pipeline1
No ratings yet
Lecture13 Pipeline1
26 pages
Helping Slides Pipelining Hazards Solutions
No ratings yet
Helping Slides Pipelining Hazards Solutions
55 pages
Chap 3 Memory System Organization and Architecture (Part 2)
No ratings yet
Chap 3 Memory System Organization and Architecture (Part 2)
41 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
ASIC Design of MIPS Based RISC Processor For High Performance
No ratings yet
ASIC Design of MIPS Based RISC Processor For High Performance
7 pages
Computer Hardware Lecturer - 4
No ratings yet
Computer Hardware Lecturer - 4
9 pages
Real Time System Lect10 A
No ratings yet
Real Time System Lect10 A
25 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
49 pages
Chapter 4.5 - 4.8 Piplined Processor and Hazards
No ratings yet
Chapter 4.5 - 4.8 Piplined Processor and Hazards
68 pages
Ambo University Waliso Campus: Dep:-Information Technology Group 8 It2 Year
No ratings yet
Ambo University Waliso Campus: Dep:-Information Technology Group 8 It2 Year
10 pages
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
No ratings yet
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
49 pages
COA CH 6
No ratings yet
COA CH 6
14 pages
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
No ratings yet
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
136 pages
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
An Introduction To Data Acquisition
From Everand
An Introduction To Data Acquisition
Jason King
No ratings yet
Assignment 2
No ratings yet
Assignment 2
4 pages
FLEXIDOME IP 3000i I Data Sheet enUS 73270028939
No ratings yet
FLEXIDOME IP 3000i I Data Sheet enUS 73270028939
7 pages
Potentiometer_ Pinout, Wiring, and How It Works
No ratings yet
Potentiometer_ Pinout, Wiring, and How It Works
10 pages
Fyp Thesis
100% (1)
Fyp Thesis
51 pages
Max Luns To FA in EMC Vmax
No ratings yet
Max Luns To FA in EMC Vmax
6 pages
5fe932d126caf PDF
No ratings yet
5fe932d126caf PDF
50 pages
CSTA Protocol: CSTA Is A Kind of Standard Communication Protocol Used Between PBX and Computer That Is Famous in Europe
No ratings yet
CSTA Protocol: CSTA Is A Kind of Standard Communication Protocol Used Between PBX and Computer That Is Famous in Europe
9 pages
IIIIIII Open CV
No ratings yet
IIIIIII Open CV
14 pages
Core Python Syllabus
No ratings yet
Core Python Syllabus
3 pages
Research On Global Positioning System in Mobile Communication Equipment Based On Android Platform
No ratings yet
Research On Global Positioning System in Mobile Communication Equipment Based On Android Platform
3 pages
Basics of JDECloud Chase 482
No ratings yet
Basics of JDECloud Chase 482
11 pages
Co - Dea 2342 Edited Mac 2021
No ratings yet
Co - Dea 2342 Edited Mac 2021
8 pages
Unit V - 2 - Two Port Networks
No ratings yet
Unit V - 2 - Two Port Networks
39 pages
mx6000 Series Rev e
No ratings yet
mx6000 Series Rev e
2 pages
KangCV KAIST2015
No ratings yet
KangCV KAIST2015
47 pages
VW Polo 1.4D Year 2003 9N1 Car Computer Network Got Crazy Repair Manual
No ratings yet
VW Polo 1.4D Year 2003 9N1 Car Computer Network Got Crazy Repair Manual
10 pages
(Student) L1 L2 FAG1003 Type Computer Component
No ratings yet
(Student) L1 L2 FAG1003 Type Computer Component
31 pages
Practical 1 - Javascript
No ratings yet
Practical 1 - Javascript
2 pages
Computer Networking: Lecture7: The Network Layer
No ratings yet
Computer Networking: Lecture7: The Network Layer
22 pages
Syllabus
No ratings yet
Syllabus
5 pages
0x31 Shellcode
No ratings yet
0x31 Shellcode
93 pages
Half and Full wave rectifier
No ratings yet
Half and Full wave rectifier
15 pages
C-TEC (Computionics Limited) - CFP702-4 Datasheet 2019-03!26!141006
No ratings yet
C-TEC (Computionics Limited) - CFP702-4 Datasheet 2019-03!26!141006
4 pages
4th PT TLE-ICT 10 CSS
100% (1)
4th PT TLE-ICT 10 CSS
3 pages
Python Training Courses Content
No ratings yet
Python Training Courses Content
8 pages
Introduction To LPC 23XX
No ratings yet
Introduction To LPC 23XX
9 pages
Week 1 Solution
No ratings yet
Week 1 Solution
13 pages
MC III Exp User Manual
100% (1)
MC III Exp User Manual
136 pages