The Memory Hierarchy (2) - The Cache: 8.1 Some Values
The Memory Hierarchy (2) - The Cache: 8.1 Some Values
Many modern computers have more than one cache, it is common to find
an instruction cache together with a data cache. and in many systems the
caches are hierarchy structured by themselves: most microprocessors in the
market today have an internal cache, with a size of a few KBytes, and
allow an external cache with a much larger capacity, tens to hundreds of
KBytes.
Freedom of placing a block into the cache ranges from absolute, when the
block can be placed anywhere in the cache, to zero, when the block has a
strictly predefined position.
• if the block can be placed anywhere in the cache the cache is said
to be fully associative;
Transfers between the lower level of the memory and the cache occur in
blocks: for this reason we can see the memory address as divided in two
fields:
What is the size of the two fields in an address if the address size is 32 bits
and the block is 16 Byte wide?
Answer:
Assuming that the memory is byte addressable there are 4 bits necessary to
specify the position of the byte in the block. The other 28 bits in the
address identify a block in the lower level of the memory hierarchy.
The address above refers to block number 3 in the lower level; inside that
block the byte number 13 will be accessed.
For a cache that has a power of two blocks (suppose 2m blocks), finding the
position is a direct mapped cache is trivial: position (index) is indicated by
the last (the least significant) log2m bits of the block-frame address.
For a set associative cache that has a power of two sets (suppose 2k sets),
the set where a given block has to be mapped is indicated by the last (the
least significant) log2k bits of the block-frame address.
The address can be viewed as having three fields: the block-frame address
is split into two fields, the tag and the index, plus the block offset address:
A CPU has a 7 bit address; the cache has 4 blocks 8 bytes each. The CPU
addresses the byte at address 107. Suppose this is a miss and show where
will be the corresponding block placed.
Answer:
(107)10 = (1101011)2
With an 8 bytes block the least significant three bits of the address (011) are
used to indicate the position of a byte within a block.
The most significant four bits ((1101)2 = 1310) represent the block-frame
address, i.e. the number of the block in the lower level of the memory.
Because it is a direct mapped cache, the position of block number 13 in the
cache is given by:
Hence the block number 13 in the lower level of the memory hierarchy will
be placed in position 1 into the cache. This is precisely the same as using
the last
log24 = 2
Figure 8.2 is a graphical representation for this example. Figures 8.1 and
8.3 are graphical representations of the same problem we have in example
8.2 but for fully associative and set associative caches respectively.
Because the cache is smaller than the memory level below it, there are
several blocks that will map to the same position in the cache; using the
Example 8.2 it is easy to see that blocks number 1, 5, 9, 13 will all map to
the same position. The question now is: how can we determine if the block
in the memory is the one we are looking for, or not?
Each line in the cache is augmented with a tag field that holds the tag field
of the address corresponding to that block. When the CPU issues an
address, there are, possibly, several blocks in the cache that could contain
the desired information. The one will be chosen that has the same tag as
that of the address issued by the CPU.
Figure 8.4 presents the same cache we had in figures 8.1 to 8.3, improved
with the tag fields. In the case of a fully associative cache all tags in the
cache must be checked against the address's tag field; this because in a fully
associative cache blocks may be placed anywhere. Because the cache must
be very fast, the checking process must be done in parallel, all cache's tags
must be compared at the same time with the address tag fields. For a set
associative cache there is less work than in a fully associative cache: there
is only one set in which the block can be; therefore only the tags of the
blocks in that set have to be compared against the address tag field.
If the cache is direct mapped, the block can have only one position in the
cache: only the tag of that block is compared with the address tag field.
Figure 8.6 presents the status of a four line, direct mapped cache, similar to
the one we had in Example 8.2 after a sequence of misses; suppose that
after reset (or power-on), the CPU issues the following sequence of reads at
addresses (in decimal notation): 78, 79, 80, 77, 109, 27, 81. Hits don't
change the state of the cache when only reads are performed; therefore
only the state of the cache after misses is presented in Figure 8.6. Below is
the binary representation of addresses involved in the process:
Index
0 Fully associative cache
Block
Number
0 Block 13 can go
anywhere in the cache
1
13
15
Lower level in memory hierarchy
FIGURE 8.1 A fully associative four blocks (lines) cache connected to a 16 blocks.
Index
0 Direct mapped cache
Block
Number
0 Block 13 can go
only in position 1
1 (13 mod 4) in the
cache
13
15
FIGURE 8.2 A Direct mapped, four blocks (lines) cache connected to a 16 blocks memory.
Index
0 Set associative cache
Set 0
1
Set 1
2
Block
Number
0 Block 13 goes to
set 1 (13 mod 2);
1 in the set 1 it can
occupy any position
13
15
Tag Data
13
3
Set 0
For a set associative cache
the block can be in only one
set; only the tags of that set
must be checked
13
Set 1
FIGURE 8.4 Finding a block in the cache implies comparing the tag field of the actual address with the
content of one or more tags in the cache.
• Address 78: miss because the valid bit is 0 (Not Valid); a block is
brought and placed into the cache in position Index = 01
• Address 79: hit; as Figure 8.6.b points out the content of this
memory address is already in the cache
• Address 80: miss because the valid bit at index 10 in the cache is 0
(Not Valid); a block is brought into the cache and placed at this
index.
• Address 109: miss; the block being transferred from the lower
level of the hierarchy is placed in the cache at index 01, thus
replacing the previous block.
• Address 27: miss; block transferred into the cache at index 11.
• Address 81: hit; the item is found in the cache at index 10.
It is a common mistake to neglect the tag field when computing the amount
of memory necessary for a cache.
Answer:
The cache will have a number of lines equal with:
Hence the number of bits in the index field of an address is 10. The tag field
in an address is:
1 + 19 + 16 * 8 = 148 bits
This figure is by 18% larger than the “useful” size of the cache, and is
hardly negligible.
CPU Addresses
nindex
ntag Address
1 ntag
MUX
COMP
=
Hit
• FIFO (First In First Out): the oldest block in the cache (or in the
set for a set associative cache) is selected for replacement. This
policy does not take into account the addressing pattern in the
past: it may happen the block has been heavily used in the
previous addressing cycles, and yet it is chosen for replacement.
The FIFO policy is outperformed by the random policy which has,
as a plus, the advantage of being easier to implement.
Consider a fully associative four block cache, and the following stream of
block-frame addresses: 2, 3, 4, 2, 5, 2, 3, 1, 4, 5, 2, 2, 2, 3. Show the content
of the cache in two cases:
a) using a LRU algorithm for replacing blocks;
b) using a FIFO policy.
Answer:
Address:
2 3 4 2 5 2 3 1 4 5 2 2 2 3
21 22 23 21 22 21 22 23 24 51 52 53 54 55
31 32 33 34 35 31 32 33 34 21 21 21 22
41 42 43 44 45 11 12 13 14 15 16 31
51 52 53 54 41 42 43 44 45 46
M M M M M M M M M
For the LRU policy, the subscripts indicate the age of the blocks in the
cache. For the FIFO policy a star is used to indicate which is the next block
to be replaced. The Ms under the columns of tables indicate the misses.
For the short sequence of block-frame addresses in this example, the FIFO
policy yields a smaller number of misses, 7 as compared with 9 for the
LRU. However in most cases the LRU strategy proves to be better than
FIFO.
So far we have discussed about how reads are handled in a cache. Writes
are more difficult and affect the performance more than reads do. If we take
a closer look at the block scheme in Figure 8.5 we realize that, in the case
of a read, the two basic operations are performed in parallel: the tag and
reading the block are read at the same time. Further, the tags must be
compared, and the delay in the comparator (COMP) is slightly higher then
the delay through the multiplexor (MUX): if we have a hit then the data is
already stable at the cache's outputs; if there a miss there is no harm in
reading some improper data from the cache, we simply ignore it.
There are two options when writing into the cache, depending upon how
the information in the lower lever of the hierarchy is updated:
• write through: the item is written both into the cache and into the
corresponding block in the lower level of the hierarchy; as a
• write back: writes occur only in the cache; the modified block is
written into the lower level of the hierarchy only when it has to be
replaced.
With the write-back policy there is useless to write back a block (i.e. to
write a block into the lower level of the hierarchy) if the block has not been
modified while in the cache. To keep track if a block was modified or not, a
bit, called the dirty bit, is used for every block in the cache; when the
block is brought into the cache this bit is set to Not-dirty (0); the first write
in that block sets the bit to Dirty (1). When the replacement decision is
taken, the control checks if the block is dirty or clean. If the block is dirty it
has to be to the lower level of the memory; otherwise a new block coming
from the lower level of the hierarchy can simply overwrite that block in the
cache.
For fully or set associative caches, where several bocks may candidate for
replacement, it is common to prefer the one which is clean (if any), thus
saving the time necessary to transfer a block from the cache to the lower
level of the memory.
The two cache write policies have their advantages and disadvantages:
where both the execution time and stalls are expressed in clock cycles.
Now the natural question we may ask is: do we include the cache access
time in the CPUexec or in Memory_stalls? Both ways are possible: it is
possible to consider the cache access time in Memory_stalls, simply
because the cache is a part of the memory hierarchy. On the other hand,
because the cache is supposed to be very fast, we can include the hit time in
the CPU execution time as the item sought in the cache will be delivered
very quickly, maybe during the same execution cycle. As a matter of fact
this is the widely accepted convention.
Memory_stalls will include the stall due to misses, for reads and writes:
The above formula can be also written using misses per instruction as:
Answer:
CPUtime = IC*(CPIexec + Mem_accesses_per_instruction*miss_rate*miss_penalty)*Tck
The IC and Tck are the same in both cases, with and without cache, so the
result of including the cache's behavior is an increase in CPUtime by
8.25
---------- – 1 = 17.8%
7
The following example presents the impact of the cache for a system with a
lower CPI (as is the case with pipelined CPUs):
The CPI for a CPU is 1.5, there are on the average 1.4 memory accesses
per instruction, the miss rate is 5%, and the miss penalty is 10 clock cycles.
What is the performance if the cache is considered?
Answer:
CPUtime = IC*(CPIexec + Mem_accesses_per_instruction*miss_rate*miss_penalty)*Tck
Note that for a machine with lower CPI the impact of the cache is more
significant than for a machine with a higher CPI.
The following example shows the impact of the cache on system with
different clock rates.
Example 8.7 CPU PERFORMANCE WITH CACHE, CPI AND CLOCK RATES:
Answer:
CPUtime = IC*(CPIexec + Mem_accesses_per_instruction*miss_rate*miss_penalty)*Tck
For the CPU running with a 20ns clock cycle, the miss penalty is 140/20 =
7 clock cycles, and the performance is given by:
The effect of the cache, for this machine, is to stretch the execution time by
32%. For the machine running with a 10 ns clock cycle, the miss penalty is
140/10 = 14 clock cycles, and the performance is:
• reducing the miss rate: the easy way is to increase the cache size;
however there is a serious limitation in doing so for on-chip
caches: the space. Most on-chip caches are only a few kilobytes in
size.
• reducing the miss penalty: for most cases the access time
dominates the miss penalty; while the access time is given by the
technology used for memories, and, as a result can not be easily
lowered, it is possible to use intermediate levels of cache between
the internal cache (on-chip) and main memory.
• capacity: if the cache does not contain all the blocks needed for
the execution of the program, then some blocks will be replaced
and then, later, brought back into the cache;
As for capacity misses, the solution is larger caches, both internal and
external. If the cache is too small to fit the requirement of some program,
then most of the time will be spent in transferring blocks between the cache
and the lower level of the hierarchy; this is called trashing. A trashing
memory hierarchy has a performance that is close to that of the memory in
the lower level, or even poorer due to misses overhead.
Initial caches were meant to hold both data and instructions. This caches
are called unified or mixed. It is possible however to have separate caches
for instructions and data, as the CPU knows if it is fetching an instruction
or loading/storing data. Having separate caches allows the CPU to perform
an instruction fetch at the same time with a data read/write, as it happens in
pipelined implementations. As the table in section 8.6 shows, most of the
today’s architectures have separate caches. Separate caches give the
designer the opportunity to separately optimize each cache: they may have
different sizes, different organizations, and block sizes. The main
observation is that instruction caches have lower miss rates as data caches,
for the main reason that instructions expose better spatial locality than data.
Exercises
8.1 Draw a fully associative cache schematic. Which are the hardware
resources besides the ones required by a direct mapped cache? You must
pick some cache capacity and some block size.
8.2 Redo the design in problem 8.1 but for a 4-way set associative cache.
Compare your design with the fully associative cache and the direct
mapped cache.
8.3 Design a 16 KB direct mapped cache for a 32 bit address system. The
block size is 4 bytes (1 word). Compare the result with the result in
Example 8.3.
8.4 Design (gate level) a 4 bit comparator. While most MSI circuits provide
three outputs indicating the relation between the A and B inputs (A > B, A
= B, A < B), your design must have only one output which gets active (1)
when the two inputs are equal.
8.5 Assume you have two machines with the same CPU and same main
memory, but different caches:
cache 1: a 16 set, 2-way associative cache, 16 bytes per block, write
through;
cache 2: a 32 lines direct mapped cache, 16 bytes per block, write
back.
Also assume that a miss takes 10 longer than a hit, for both machines. A
word write takes 5 times longer than a hit, for the write through cache; the
transfer of a block from the cache to the memory takes 15 times as much as
a hit.
a) write a program that makes machine 1 run faster than machine 2 (by as
much as possible);
b) write a program that makes machine 2 run faster than machine 1 (by as
much as possible).