Embedded Processors: Instruction Set Architecture (ISA)
Embedded Processors: Instruction Set Architecture (ISA)
EMBEDDED PROCESSORS
• Processors are the main functional units of an embedded board, and are primarily responsible for
processing instructions and data.
• An electronic device contains at least one master processor, acting as the central controlling device, and
can have additional slave processors that work with and are controlled by the master processor.
• These slave processors may either extend the instruction set of the master processor or act to manage
memory, buses, and I/O (input/output) devices.
Instruction Set Architecture (ISA)
• ISA is an interface between the Hardware and software, it instruct the hardware about what needs be done
through the software instructions.
• It helps to utilize the hardware completely through the software instructions.
The Role of ISA :
What is the operation to be performed ?
Where to store the operand?
Where to store the result of operation?
How many operands are needed to perform this operation?
What is the type of operation?
What is the format of the operation?
Size of operands
What are the addressing modes supported? How to address the operands?
Sample ISA operations:
Operands
Operands are the data that operations manipulate. An ISA defines the types and formats of operands for a particular
architecture.
For example, in the case of the MPC823 (Motorola/Freescale PowerPC), SA-1110 (Intel StrongARM), and many
other architectures, the ISA defines simple operand types of bytes (8 bits), halfwords (16 bits), and words (32 bits).
An ISA also defines the operand formats (how the data looks) that a particular architecture can support, such as
binary, decimal and hexadecimal.
Example:
MOV registerX, 10d ; Move decimal value 10 into register X
Storage:
The ISA specifies the features of the programmable storage used to store the data being operated on
1. Memory is simply an array of programmable storage, that stores data, including operations, operands, and so on.
The indices of this array are locations referred to as memory addresses, where each location is a unit of memory that can be addressed separately.
An ISA defines specific characteristics of the address space, such as whether it is:
Linear. A linear address space is one in which specific memory locations are represented incrementally, typically starting at “0”.
Specificmemory locations can only be accessed by specifying a segment identifier, a segment number that can be explicitly defined or implicitly
obtained from a register, and specifying the offset within a specific segment within the segmented address space.
The offset within the segment contains a base address and a limit, which map to another portion of memory that is set up as a linear address space.
If the offset is less than or equal to the limit, the offset is added to the base address, giving the un-segmented address within the linear address
space.
An important note regarding ISAs and memory is that different ISAs not only define where data is stored, but also how data is stored in memory—
specifically in what order the bits (or bytes) that make up the data is stored, or byte ordering.
The two byte-ordering approaches are big-endian, in which the most significant byte or bit is stored first, and little-endian, in which the least
significant bit or byte is stored first.
An important note regarding ISAs and memory is that different ISAs not only
define where data is stored, but also how data is stored in memory—
specifically in what order the bits (or bytes) that make up the data is stored, or
byte ordering.
The two byte-ordering approaches are big-endian, in which the most
significant byte or bit is stored first, and little-endian, in which the least
significant bit or byte is stored first.
Example:
68000 and SPARC are big-endian
x86 is little-endian
2. Register Set
A register is simply fast programmable memory normally used to store operands that are immediately or
frequently used.
A processor’s set of registers is commonly referred to as the register set or the register file.
Different processors have different register sets, and the number of registers in their sets vary between very few
to several hundred (even over a thousand).
3. How Registers Are Used:
An ISA defines which registers can be used for what transactions, such as special purpose, floating point, and
which can be used by the programmer in a general fashion (general purpose registers).
Addressing Modes
Addressing modes define how the processor can access operand storage. In fact, the
usage of registers is partly determined by the ISA’s Memory Addressing Modes.
The two most common types of addressing mode models are:
Load-Store Architecture, which only allows operations to process data in registers, not anywhere else in memory.
For example, the PowerPC architecture has only one addressing mode for load and store instructions: register
plus displacement (supporting register indirect with immediate index, register indirect with index, etc.).
Register-Memory Architecture, which allows operations to be processed both within registers and other types of
memory. Intel’s i960 Jx processor is an example of an addressing mode architecture that is based upon the register-
memory model (supporting absolute, register indirect, etc.).
Interrupts and Exception Handling:
Interrupts (also referred to as exceptions or traps depending on the type) are mechanisms that stop the standard
flow of the program in order to execute another set of code in response to some event, such as problems with the
hardware, resets, and so forth.
The ISA defines what if any type of hardware support a processor has for interrupts.
ISA models:
1.Application-Specific ISA Models: These ISA models define processors that are intended for specific embedded
applications, such as processors made only for TVs
2.Controller
Model : The Controller ISA is implemented in processors that are not required to perform complex
data manipulation, such as video and audio processors that are used as slave processors on a TV board.
3.DatapathModel: The Datapath ISA is implemented in processors whose purpose is to repeatedly perform fixed
computations on different sets of data.
Example: Digital Signal Processors (DSPs)
Finite
State Machine with Datapath (FSMD) Model: The FSMD ISA is an implementation based upon a
combination of the Datapath
ISA and the Controller ISA for processors that are not required to perform complex data manipulation and must
repeatedly perform fixed computations on different sets of data.
Examples: Application-Specific Integrated Circuits (ASICs), programmable logic devices (PLDs), and field-
programmable gate-arrays
FSMD
An FSMD (finite state machine with data path) combines an FSM and regular sequential circuits.
The FSM, which is sometimes known as a control path, examines the external commands and
status and generates control signals to specify operation of the regular sequential circuits, which are
known collectively as a data path. Algorithms described in RT (register transfer) operation, in which
the operations are specified as data manipulation and transfer among a collection of registers, can
be converted to FSMD and realized in hardware.
Java Virtual Machine (JVM) Model:
The JVM ISA is based upon one of the Java Virtual Machine standards real-world JVMs can
be implemented in an embedded system via hardware
Example : aJile’s aj-80 and aj-100 processors.
2. General-Purpose ISA models
Complex Instruction Set Computing (CISC) Model
Reduced Instruction Set Computing (RISC) Model
These cycles are implemented through some combination of four major CPU
components:
The arithmetic logic unit (ALU) – implements the ISA’s operations
Registers – a type of fast memory
The control unit (CU) – manages the entire fetching and execution cycle
The internal CPU buses – interconnect the ALU, registers, and the CU
The MPC860 CPU – the PowerPC core
Internal CPU Buses
The CPU buses are the mechanisms that interconnect the ALU, the CU, and
registers
Buses are simply wires that interconnect the various other components
within the CPU.
Each bus’s wire is typically divided into logical functions, such as
Data : which carries data, bi-directionally, between registers and the ALU
Address: which carries the locations of the registers that contain the data to
be transferred
Control: which carries control signal information, such as timing and
control signals, between the registers, the ALU, and the CU
In the PowerPC Core, there is a Control Bus that carries the control
signals between the ALU, CU, and registers.
In PowerPC source buses are the data buses that carry the data
between registers and the ALU.
There is an additional bus called the write-back which is dedicated to
writing back data received from a source bus directly back from the
load/store unit to the fixed or floating point registers .
ALU
The arithmetic logic unit (ALU) implements the comparison, mathematical and logical
operations defined by the ISA.
The format and types of operations implemented in the ALU can vary depending on the ISA.
The ALU is responsible for accepting multiple n-bit binary operands and performing any
logical (AND, OR, NOT, etc.), mathematical (+, –, *, etc.), and comparison (=, <, >, etc.)
operations on these operands.
In the PowerPC core, the ALU is part of the “Fixed Point Unit” that implements all fixed-point
instructions other than load/store instructions.
The ALU is responsible for fixed-point logic, add, and subtract instruction implementation.
In the case of the PowerPC, generated results of the ALU are stored in an Accumulator.
The PowerPC has an IMUL/IDIV unit (essentially another ALU) specifically for performing
multiplication and division operations.
Registers
Registers are simply a combination of various flip-flops that can be used to
temporarily store data or to delay signals.
storage register is a form of fast programmable internal processor memory usually
used to temporarily store, copy, and modify operands that are immediately or
frequently used by the system.
Shift registers delay signals by passing the signals between the various internal flip-
flops with every clock pulse.
ISA designs do not use all registers in the same way to process the data, storage
typically falls under one of two categories, either general purpose or special purpose
General purpose registers can be used to store and manipulate any type of data
determined by the programmer.
Special purpose registers can only be used in a manner specified by the ISA, including
holding results for specific types of computations, having predetermined
flags (single bits within a register that can act and be controlled independently)
Acting as counters (registers that can be programmed to change states—that is, increment
—asynchronously or synchronously after a specified length of time)
Controlling I/O ports(registers managing the external I/O pins connected to the body of
the processor and to board I/O).
Shift registers are inherently special purpose, because of their limited functionality.
The PowerPC Core has a “Register Unit” which contains all registers visible to a user.
PowerPC processors generally have two types of registers: general-purpose and special-
purpose (control) registers.
Control Unit (CU)
The control unit (CU) is primarily responsible for generating timing signals, as well as controlling and
coordinating the fetching, decoding, and execution of instructions in the CPU.
After the instruction has been fetched from memory and decoded, the control unit then determines
what operation will be performed by the ALU, and selects and writes signals appropriate to each
functional unit within or outside of the CPU (i.e., memory, registers, ALU, etc.).
PowerPC core’s CU is called a “sequencer unit,” and is the heart of the PowerPC core.
The sequencer unit is responsible for managing the continuous cycle of fetching, decoding, and
executing instructions.
Provides the central control of the data and instruction flow among the other major units within the
PowerPC core (CPU), such as registers, ALU and buses.
Implements the basic instruction pipeline. Fetches instructions from memory to issue these
instructions to available execution units.
Maintains a state history for handling exceptions.
The CPU and the System (Master) Clock
A processor’s execution is ultimately synchronized by an external system or master
clock ,located on the board.
The master clock is an oscillator along with a few other components, such as a crystal.
It produces a fixed frequency sequence of regular on/off pulse signals
Processor Performance
There are several measures of processor performance, but are all based upon the processor’s behaviour
over a given length of time. One of the most common definitions of processor performance is a
processor’s throughput, the amount of work the CPU completes in a given period of time.
CPU throughput (in bytes/sec or MB/sec) = 1 / CPU execution time = CPU performance
CPU execution time in seconds per program = (total number of instructions per program or instruction
count) * (CPI in number of cycle cycles/instruction) * (clock period in seconds per cycle) =
((instruction count) * (CPI in number of cycle cycles/instruction)) / (clock rate in MHz)
CPI (average number of clock cycles per instruction) can be determined CPI =∑ (CPI per instruction
* instruction frequency)
Other definitions of performance besides throughput include:
A processor’s responsiveness, or latency, which is the length of elapsed time a processor takes to
respond to some event.
A processor’s availability, which is the amount of time the processor runs normally without failure;
reliability, the average time between failures or MTBF (mean time between failures); and
recoverability, the average time the CPU takes to recover from failure or MTTR (mean time to
recover).
One of the most common performance measures used for processors in the embedded market is
millions of instructions per seconds or MIPS.
MIPS = Instruction Count / (CPU execution time * 106) = Clock Rate / (CPI * 106)
The MIPS performance measure gives the impression that faster processors have higher MIPS values,
since part of the MIPS formula is inversely proportional to the CPU’s execution time.
MIPS = Instruction Count / (CPU execution time * 106) = Clock Rate / (CPI * 106)
Memory : Basic Concepts
Stores large number of bits
m x n: m words of n bits each
k = Log2(m) address input signals
m = 2^k words
e.g., 4,096 x 8 memory:
32,768 bits
12 address input signals
8 input/output data signals
Memory access
r/w: selects read or write
enable: read or write only when asserted
multiport: multiple accesses to different locations simultaneously
Traditional ROM/RAM distinctions
ROM
read only, bits stored without power
RAM
read and write, lose stored bits without power
Other Differences
Advanced ROMs can be written to : e.g., EEPROM
Advanced RAMs can hold bits without power :e.g., NVRAM
Write ability
Manner and speed a memory can be written
Storage permanence: ability of memory to hold stored bits after they are
written
Ranges of write ability
High end
processor writes to memory simply and quickly
e.g., RAM
Middle range
processor writes to memory, but slower
e.g., FLASH, EEPROM
Lower range
special equipment, “programmer”, must be used to write to memory
e.g., EPROM, OTP ROM
Low end
bits stored only during fabrication
e.g., Mask-programmed ROM
In-system programmable memory
Can be written to by a processor in the embedded system using the memory
Memories in high end and middle range of write ability
Range of storage permanence
High end
essentially never loses bits
e.g., mask-programmed ROM
Middle range
holds bits days, months, or years after memory’s power source turned off
e.g., NVRAM
Lower range
holds bits as long as power supplied to memory
e.g., SRAM
Low end
begins to lose bits almost immediately after written
e.g., DRAM
Nonvolatile memory
Holds bits after power is no longer supplied
High end and middle range of storage permanence
ROM
Nonvolatile memory
Can be read from but not written to, by a processor in an embedded system
Traditionally written to, “programmed”, before inserting to embedded system
Uses
Store software program for general-purpose processor
program instructions can be one or more ROM words
Store constant data needed by system
Implement combinational circuit
Example: 8x4 ROM
Horizontal lines = words
Vertical lines = data
Lines connected only at circles
Decoder sets word 2’s line to 1 if address input is 010
Data lines Q3 and Q1 are set to 1 because there is a
“programmed” connection with word 2’s line
Word 2 is not connected with data lines Q2 and Q0
Output is 1010
Implementing combinational function
Any combinational circuit of n functions of same k variables can be done with 2k x
n ROM
Most common types of on-chip ROM include:
MROM (mask ROM), which is ROM (with data content) that is permanently etched into the
microchip during the manufacturing of the processor, and cannot be modified later.
PROMs (programmable ROM), or OTPs (one-time programmable), can be integrated on-
chip, that is one-time programmable by a PROM programmer (in other words, it can be
programmed outside the manufacturing factory).
EPROM (erasable programmable ROM), can be integrated on a processor, in which content
can be erased and reprogrammed more than once (the number of times erasure and re-use
can occur depends on the processor). The content of EPROM is written to the device using
special separate devices and erased, Either selectively or in its entirety using other devices
that output intense ultraviolet Light into the processor’s built-in window.
EEPROM (electrically erasable programmable ROM), can be erased and reprogrammed
more than once. The number of times erasure and re-use can occur depends on the
processor. Unlike EPROMs, the content of EEPROM can be written and erased without using
any special devices while the embedded system is functioning. With EEPROMs, erasing must
be done in its entirety, unlike EPROMs, which can be erased selectively.
RAM: “Random-access” memory
Typically volatile memory
bits are not held without power supply
Read and written to easily by embedded system during execution
Internal structure more complex than ROM
a word consists of several memory cells, each storing 1 bit
each input and output data line connects to each cell in its column
rd/wr connected to every cell
when row is enabled by decoder, each cell has logic that stores input
data bit when rd/wr indicates write or outputs stored bit when rd/wr
indicates read
RAM (random access memory), is a main memory, in which any location within it can be accessed directly
(randomly, rather than sequentially from some starting point), and whose content can be changed more than once (the
number depending on the hardware).
Unlike ROM, contents of RAM are erased if RAM loses power, meaning RAM is volatile.
Basic types of RAM
SRAM: Static RAM
Memory cell uses flip-flop to store bit
Requires 6 transistors
Holds data as long as power supplied
DRAM: Dynamic RAM
Memory cell uses MOS transistor and capacitor to store bit
More compact than SRAM
“Refresh” required due to capacitor leak
word’s cells refreshed when read
RAM variations
PSRAM: Pseudo-static RAM
DRAM with built-in memory refresh controller
Popular low-cost high-density alternative to SRAM
NVRAM: Nonvolatile RAM
Holds data after external power removed
Battery-backed RAM
SRAM with own permanently connected battery
writes as fast as reads
no limit on number of writes unlike nonvolatile ROM-based memory
SRAM with EEPROM or flash
stores complete RAM contents on EEPROM or flash before power turned off
Composing memory:
Memory size needed often differs from size of readily available memories
When available memory is larger, simply ignore unneeded high-order address bits and
higher data lines
When available memory is smaller, compose several smaller memories into one larger
memory
Connect side-by-side to increase width of words
Connect top to bottom to increase number of words
added high-order address line selects smaller memory containing desired word using
a decoder
Combine techniques to increase number and width of words
Composing memory:
a. Increase width of words b. Increase number of words
D3 D3 D0 D3 D0 D3 D0 D3 D0
8x4 RAM 1 8x4 RAM 2 8x4 RAM 3 8x4 RAM 4
A0 A0 A0 A0
D0
A2 A2 A2 A2
RD WR CS RD WR CS RD WR CS RD WR CS
RD
WR
A0
2X4 DEC.
A2
A3 A Y0
A4 B Y1
A5 Y2
A6 CS Y3
A7
Memory Map for previous example
There are three address lines connected on the address selection circuit. Thus there can be eight different
memory map configurations.
Three possible memory map configurations are shown below.
D7
D0 D7 D0 D7 D0 D7 D0 D7
D0 A0 A0 A0 A0
RD WR CS RD WR CS RD WR CS RD WR CS
RD
WR
A0
2X4 DEC.
A11
A Y0
A12
B Y1
A13
Y2
A14
CS Y3
A15 A15
Example: Design 16K words of memory using 4Kbytes of RAM using appropriate control signals
and show the address range of each memory block.
First Step:
Combine your 2k x 8 ROMs into a 2k X 32 ROM (requires 4 x 2k x 8 ROM ICs per 2k x 32 unit)). The address
inputs will be common and need to be connected in parallel. The data outputs are kept separate to for the 32 lines
required. Don't forget there are also control lines, usually a chip enable and a read line (usually active LOW) but
check the specs.
Second Step:
This involves combining four "2k x 32 bit" ROM units. The input ADDRESS LINES (A0 - A10) are connected
together in parallel. The OUTPUT DATA lines are also connected together in parallel. This just leaves the problem
of the CONTROL LINES. The READ line is simply common as you want the ROM to output the data with a single
'read' signal. The CHIP ENABLE lines are used as an extra ADDRESS signal to ensure that only ONE 2k x 32 bit
block is addressed at any given time. We have input addresses A11 and A12 to give the full 8K address for the ROM.
We need to add a 2 to 4 line decoder to convert these address lines to CHIP ENABLE selections.
On-Chip Memory
Embedded platforms have a memory hierarchy, a collection of different types of memory, each with unique speeds, sizes,
and usages.
Some of this memory can be physically integrated on the processor, such as registers, read-only memory (ROM), certain
types of random access memory (RAM), and level-1 cache.
1. Registers
CPU registers are at the top most level of this hierarchy, they hold the most frequently
used data. They are very limited in number and are the fastest.
They are often used by the CPU and the ALU for performing arithmetic and logical
operations, for temporary storage of data.
2. Cache
The very next level consists of small, fast cache memories near the CPU. They act as
staging areas for a subset of the data and instructions stored in the relatively slow main
memory.
There are often two or more levels of cache as well. The cache at the top most level
after the registers is the primary cache. Others are secondary caches.
Many a times there is cache present on board with the CPU along with other levels that
are outside the chip.
3. Main Memory:
The next level is the main memory, it stages data stored on large, slow disks often
called hard disks. These hard disks are also called secondary memory, which are the
last level in the hierarchy. The main memory is also called primary memory.
The secondary memory often serves as staging areas for data stored on the disks or
tapes of other machines connected by networks.
Cache
which is a small amount of fast, expensive memory.
The cache goes between the processor and the slower, dynamic main memory.
It keeps a copy of the most frequently used data from the main memory.
Memory access speed increases overall, because we’ve made the common case faster.
Reads and writes to the most frequently used addresses will be serviced by the cache.
We only need to access the slower main memory for less frequently used data.
The principle of locality
It’s usually difficult or impossible to figure out what data will be ―most
frequently accessed before a program actually runs, which makes it hard to
know what to store into the small, precious cache memory.
But in practice, most programs exhibit locality, which the cache can take
advantage of.
The principle of temporal locality says that if a program accesses one memory
address, there is a good chance that it will access the same address again.
The principle of spatial locality says that if a program accesses one memory
address, there is a good chance that it will also access other nearby addresses
As the locality of reference is applicable in the computer system, instead of
fetching a single instruction a group of instruction are sent from primary memory
to cache. These groups of instructions are called blocks.
Entire primary memory is divided into equal sized blocks. Cache memory is also
divided into same-sized blocks. For placing primary memory blocks into cache
memory blocks, 3 different mapping techniques are available.
Cache memory mapping techniques:
Direct Mapping
Associative Mapping
Set – Associative Mapping
Direct Mapping:
The direct mapping concept is if the ith block of main memory has to be placed at the jth block
of cache memory then, the mapping is defined as:
j = i % (number of blocks in cache memory)
Let’s see this with the help of an example.
Suppose, there are 4096 blocks in primary memory and 128 blocks in the cache memory.
Then the situation is like if I want to map the 0th block of main memory into the cache
memory, then I apply the above formula and I get:
0 % 128 = 0
So, the 0th block of main memory is mapped to the 0th block of cache memory. Here, 128 is the
total number of blocks in the cache memory.
1 % 128 = 1
2 % 128 = 2
The following diagram illustrates the direct mapping process.
Associative Mapping:
In the direct cache memory mapping technique, the problem was every block of main
memory was directly mapped to the cache memory. So, the major drawback was the
high conflict miss. That means we had to replace a cache memory block even when other
blocks in the cache memory were present as empty.
Suppose, I have already loaded the 0th block of main memory to the 0th block of cache
memory using the direct mapping technique. Now consider, the next block that I need
is 128. Even if the 1,2,3… all blocks of cache memory are empty, I still have to map
the 128 block of main memory to the 0th block of cache memory since,
128 % 128 = 0
Therefore, I have to change the previously loaded 0th block of main memory to
the 128 block. So, that was the reason for high conflict miss. That means I have to replace
a cache block even if the other cache blocks are present as empty. To overcome this
drawback of direct mapping technique, the concept of associative mapping technique
was introduced.
The idea of associative mapping technique is to avoid the high conflict miss, any block of
main memory can be placed anywhere in the cache memory. Associative mapping
technique is the fastest and most flexible mapping technique. We can have the following
diagram to illustrate the associative mapping process.
Set Associative mapping:
Set associative mapping is introduced to overcome the high conflict miss in the direct
mapping technique and the large tag comparisons in case of associative mapping. In this
cache memory mapping technique, the cache blocks are divided into sets. Here the set
size is always in the power of 2, i.e. if the cache has 2 blocks per set then it is called as 2-
way set associative. Similarly, if it has 4 blocks per set then it is called as 4-way set
associative.
It basically means that instead of just referring to the cache block directly we will refer to
the particular sets present in the cache memory. So basically the concept is we map a
particular block of main memory to a particular set of cache and within that set, the
block can be mapped to any of the cache blocks that are available.
Consider a system with 128 cache memory blocks and 4096 primary memory blocks.
Here we are considering 2 blocks in each set, or simply we are considering a 2-way set
associative process. Since there are 2 blocks in each set, so there will be total 64 sets in
our cache memory.
Here, to determine the proper set position in which the main memory will be placed we
use a concept i.e. if the ith block of main memory has to be placed in the jth block of
cache memory then,
j = i % (number of sets in cache)
After determining the cache position, the primary memory block may be placed in any
block inside the set.
Following diagram illustrates this process.
Mapping of main memory address to cache memory:
Example: Consider a cache consisting of 128 blocks of 16 words each, for total of 2048(2K) words and assume
that the main memory is addressable by 16 bit address. Main memory is 64K which will be viewed as 4K blocks of
16 words each.
1)Direct Mapping:
1) The simplest way to determine cache locations in which store Memory blocks is direct Mapping technique.
2) In this block J of the main memory maps on to block J modulo 128 of the cache. Thus main memory blocks
0,128,256,….is loaded into cache is stored at block 0. Block 1,129,257,….are stored at block 1 and so on.
3) Placement of a block in the cache is determined from memory address. Memory address is divided into 3 fields,
the lower 4-bits selects one of the 16 words in a block.
4) When new block enters the cache, the 7-bit cache block field determines the cache positions in which this block
must be stored.
5) The higher order 5-bits of the memory address of the block are stored in 5 tag bits associated with its location in
cache. They identify which of the 32 blocks that are mapped into this cache position are currently resident in the
cache.
2) Associative Mapping:-
1) This is more flexible mapping method, in which main memory block can be placed into any cache block position.
2) In this, 12 tag bits are required to identify a memory block when it is resident in the cache.
3) The tag bits of an address received from the processor are compared to the tag bits of each block of the cache to
see, if the desired block is present. This is known as Associative Mapping technique.
4) Cost of an associated mapped cache is higher than the cost of direct-mapped because of the need to search all
128 tag patterns to determine whether a block is in cache. This is known as associative search.
3) Set-Associative Mapping
1) It is the combination of direct and associative mapping
technique.
2) Cache blocks are grouped into sets and mapping allow
block of main memory reside into any block of a specific
set. Hence contention problem of direct mapping is
eased , at the same time , hardware cost is reduced by
decreasing the size of associative search.
3) For a cache with two blocks per set. In this case,
memory block 0, 64, 128,…..,4032 map into cache set 0
and they can occupy any two block within this set.
4) Having 64 sets means that the 6 bit set field of the
address determines which set of the cache might contain
the desired block. The tag bits of address must be
associatively compared to the tags of the two blocks of the
set to check if desired block is present. This is two way
associative search.
Examples on cache mapping:
Where would the byte from memory address 6146 be stored in this direct-mapped 210-
block cache with 22-byte blocks?
6146 in binary is 00...01 1000 0000 0010.
Example 1: A computer system uses 16-bit memory addresses. It has a 2K-byte cache
organized in a direct-mapped manner with 64 bytes per cache block. Assume that the
size of each memory word is 1 byte.
Calculate the number of bits in each of the Tag, Block, and Word fields of the memory
address.
If the cache is organized as a 2-way set-associative cache
Average Memory Access Time:
Average Memory Access Time= (Hit Ratio x Time taken to access memory in case of hit) +( (1-Hit
Ratio)x Time taken to access memory in case of Miss)
Miss Ratio =1-Hit Ratio
Example: Assume that for a certain processor, a read request takes 50 nanoseconds on a cache miss and
5 nanoseconds on a cache hit . Suppose while running a program, it was observed that 80% of the
processors read read requests result in a cache hit. The average read access time in nanoseconds is?
Given :
Hit ratio = 0.8
Time taken in case of a hit = 5ns
Time taken in case of a hit = 50ns
Average memory Access time = 0.8x5 + (1-0.8)x50
= 4+0.2x50 =4+10 =14ns
1). Given the following three cache designs, find the one with the best performance by calculating the
average cost of access. Show all calculations.
(a) 4 Kbyte, 8-way set-associative cache with a 6% miss rate; cache hit costs one cycle, cache miss costs
12 cycles.
(b) 8 Kbyte, 4-way set-associative cache with a 4% miss rate; cache hit costs two cycles, cache miss costs
12 cycles.
(c) 16 Kbyte, 2-way set-associative cache with a 2% miss rate; cache hit costs three cycles, cache miss
costs 12 cycles.
a.) 4 Kb, 8-way set-associative cache with a 6% miss rate; cache hit costs 1 cycle,
cache miss costs 12 cycles.
miss rate = .06
hit rate = 1- miss rate = .94
.94 * 1cycle (hit) + .06 * 12 cycles (miss) = .94 + .72 = 1.66 cycles avg.
b.) 8 Kb, 4-way set-associative cache with a 4% miss rate; cache hit costs 2 cycles,
cache miss costs 12 cycles.
miss rate = .04
hit rate = 1 – miss rate = .96
.96 * 2 cycles (hit) + .04 * 12 cycles (miss) = 1.92 + .48 = 2.4 cycles avg.
c.) 16 Kb, 2-way set-associative cache with a 2% miss rate; cache hit costs 3 cycles,
cache miss costs 12 cycles.
miss rate = .02
hit rate = 1 – miss rate = .98
.98 * 3 cycles (hit) + .02 * 12 cycles (miss) = 2.94 + .24 = 3.18 cycles avg.
BEST PERFORMANCE: a) 1.66 cycles avg.
2). Given a 2-level cache design where the hit rates are 88% for the smaller cache and 97% for the
larger cache, the access costs for a miss are 12 cycles and 20 cycles, respectively, and the access
cost for a hit is one cycle, calculate the average cost of access.
hit rate = .88
L1 miss/L2 hit rate = .12 * .97
L1miss/L2 miss rate = .12 * .03
Avg. cost = (.88 * 1) + (.12 * .97 * 12) + (.12 * .03 * 20)
= .88 + 1.3968 + .072
= 2.3488 cycles
3).A computer system uses 16-bit memory addresses. It has a 2K-byte cache organized in a 1).direct-
mapped manner with 64 bytes per cache block 2) 2-way set-associative cache. Assume that the size of
each memory word is 1 byte. Calculate the number of bits in each of the Tag, Block, and Word fields
of the memory address.
1. For direct-mapped:
Block size = 64 bytes = 26 bytes = 26 words (since 1 word = 1 byte)
Therefore, Number of bits in the Word field = 6
Cache size = 2K-byte = 211 bytes
Number of cache blocks = Cache size / Block size = 211/26 = 25
Therefore, Number of bits in the Block field = 5
Total number of address bits = 16
Therefore, Number of bits in the Tag field = 16 - 6 - 5 = 5
For a given 16-bit address, the 5 most significant bits, represent the Tag, the next 5 bits represent the
Block, and the 6 least significant bits represent the Word.
2) For ) 2-way set-associative
Block size = 64 bytes = 26 bytes = 26 words
Therefore, Number of bits in the Word field = 6
Cache size = 2K-byte = 211 bytes
Number of cache blocks per set = 2
Number of sets = Cache size / (Block size * Number of blocks per set) = 211/(26 * 2) = 24
Therefore, Number of bits in the Set field = 4
Total number of address bits = 16
Therefore, Number of bits in the Tag field = 16 - 6 - 4 = 6
Cache replacement policy
The cache-replacement policy is the technique for choosing which cache block to replace when a fully-associative
cache is full, or when a set-associative cache’s line is full. Note that there is no choice in a direct-mapped cache; a
main memory address always maps to the same cache address and thus replaces whatever block is already there.
There are three common replacement policies.
1. A random replacement policy chooses the block to replace randomly. While simple to implement, this policy does
nothing to prevent replacing block that’s likely to be used again soon.
2. A least-recently used (LRU) replacement policy replaces the block that has not been accessed for the longest time,
assuming that this means that it is least likely to be accessed in the near future. This policy provides for an
excellent hit/miss ratio but requires expensive hardware to keep track of the times blocks are accessed.
3. A first-in-first-out (FIFO) replacement policy uses a queue of size N, pushing each block address onto the queue
when the address is accessed, and then choosing the block to replace by popping the queue.
Cache write techniques
When we write to a cache, we must at some point update the memory. Such update
is only an issue for data cache, since instruction cache is read-only.
There are two common update techniques, write-through and write-back.
In the write-through technique, whenever we write to the cache, we also write to main memory, requiring the
processor to wait until the write to main memory completes. While easy to implement, this technique may
result in several unnecessary writes to main memory.
For example, suppose a program writes to a block in the cache, then reads it, and then writes it again, with the
block staying in the cache during all three accesses. There would have been no need to update the main
memory after the first write, since the second write overwrites this first write.
The write-back technique reduces the number of writes to main memory by writing a block to main
memory only when the block is being replaced, and then only if the block
was written to during its stay in the cache. This technique requires that we associate an
extra bit, called a dirty bit, with each block. We set this bit whenever we write to the
block in the cache, and we then check it when replacing the block to determine if we
should copy the block to main memory.
Memory Management of External Memory
There are several different types of memory that can be integrated into a system, and there are also
differences in how software running on the CPU views logical/virtual memory addresses and the actual
physical memory addresses—the two-dimensional array or row and column. Memory managers are ICs
designed to manage these issues.
The two most common types of memory managers found on an embedded board are memory controllers
(MEMC) and memory management units (MMUs).
A memory controller (MEMC), shown in Figure, is used to implement and provide glueless interfaces to
the different types of memory in the system, such as SRAM and DRAM, synchronizing access to memory
and verifying the integrity of the data being transferred. Memory controllers access memory directly with
the memory’s own physical two- dimensional addresses. The controller manages the request from the
master processor and accesses the appropriate banks, awaiting feedback and returning that feedback to
the master processor. In some cases, where the memory controller is mainly managing one type of
memory, it may be referred to by that memory’s name, such as DRAM controller, cache controller, and so
forth.
Memory management units (MMUs) mainly allow for the flexibility in a system of having a larger
virtual memory (abstract) space within an actual smaller physical memory. An MMU, shown in below
Figure, can exist outside the master processor and is used to translate logical (virtual) addresses into
physical addresses (memory mapping), as well as handle memory security (memory protection),
controlling cache, handling bus arbitration between the CPU and memory, and generating appropriate
exceptions.
In the case of translated addresses, the MMU can use level-1 cache or portions of cache allocated as
buffers for caching address translations, commonly referred to as the translation lookaside buffer or
TLB, on the processor to store the mappings of logical addresses to physical addresses. MMUs also must
support the various schemes in translating addresses, mainly segmentation, paging, or some combination
of both schemes.