Intellegent Cache System
Intellegent Cache System
The motivation for the design, the architectural model for the SMI cache
system, and its operation in the context of hardware prefetching is as follows.
2.1: Motivation:
The direct mapped cache is the main cache and its organization
is similar to a traditional direct mapped cache. But with a smaller block size.
The spatial buffer is designed so that each entry is a collection of several
banks, each of which is the size of a block in the direct mapped cache. The tag
space of the buffer is a content addressable memory (CAM). The small blocks
in each bank include a hit bit (H), to distinguish referenced blocks from
unreferenced blocks. Each large block entry in the spatial buffer further has a
prefetch bit (P), which direct the operation of prefetch controller. The prefetch
bits are used to dynamically generate prefetch operations.
If at least one entry in the spatial buffer is in the invalid state, a large
block is fetched and stored in the spatial buffer. When a particular small block
is accessed by the CPU, the corresponding hit bit is set to one. Thus, the hit bit
of the small block identifies it as a referenced block.
Cache write back does not occur from the spatial buffer because
any modified or referenced small block is always moved to the direct mapped
cache or victim cache, which would typically have the small block size (e.g.,
32 bytes) as the spatial buffer, write back must be performed for the full 32
byte block, even though one word require write back. In contrast, the SMI
cache executes the write back operation only for the marked 8-byte small
blocks. Therefore, write traffic into memory is potentially reduced to a
significant degree.
The potential exists in any split cache for incoherent copies of
blocks appear in the different subcaches. To avoid this problem, we chose and
simulated a simple mechanism, which is as follows: When a global miss
occurs, the cache controller searches the tags of the temporal caches to detect
whether any of the four small blocks belonging to the particular large block
being fetched are present in the temporal caches. If a match is detected, then all
the corresponding small blocks in the temporal cache are invalidated. Each of
these small blocks that is also dirty is than used to update its corresponding
entry in the spatial buffer once the large block has been loaded. This search
operation can be accomplished while the cache controller is handling a miss.
Further, the power consumption is negligible because the miss ratio is only
about 1.7% of the total number of the addresses generated by the CPU.
3: Performance Evaluation
The direct mapped cache is chosen, for comparison in
terms of performance and cost.
Here, hit time is the time to process a hit in the cache and miss penalty is the
additional time to service the miss. The basic parameters for the simulation are
as follows: The hit time of the direct mapped cache and fully associative buffer
are both assumed to be one cycle. We assumed 15 cycles are needed for a miss.
Therefore, each 8-byte block is transferred from the off-chip memory after a 15
cycle penalty. These parameters are based on the values for common 32-bit
embedded processor (e.g., Hitachi SH4 or ARM920T).
Two common performance metrics, the miss ratio and the average
memory access time, are used to evaluate and compare an SMI cache system
operating in a “prefetch-4” configuration with other approaches. Here, the
direct mapped cache is compared with the SMI cache in terms of miss ratio and
average memory access time.
In general the, logic to manage the tags for the fully associative
cache is designed as a CAM structure for simultaneous comparison for each
entry. Because each CAM cell is a combination of storage and comparison, the
size of CAM cell is double of RAM cell. For fair performance/cost analysis,
the performance for various direct mapped-cache and buffer size is evaluated.
The metric is rbe (register bit equivalent) and the total area can be calculated
as follows:
Here, the control logic PLA (Programmable logic array) is assumed to be 130
rbe, a RAM cell as 0.6 rbe, and a CAM cell is 1.2 rbe. Equation (2) represents
the RAM area:
where Lsense_amp is the bit length of a bit line sense amplifier, Wdriver
the data width of a driver, #entries the number of rows of the tag array or data
array, #data_bits the tag bit or data bit of one set, and #status_bits the state bits
of one set. Finally, (3) calculate the area of the CAM:
where #tag_bits is the number of bits for one set in the tag array. Table 1
shows the performance/cost ratio for direct mapped cache and SMI cache.
The SMI cache shows about a 60% area reduction compared with the 64KB-
32byte conventional direct mapped cache, even though it provides higher
performance. And, it offers an 80 percent area reduction compared with the
64KB-32byte configuration, while providing much higher performance. Also
the improvement ratio for the average memory access time shows that the
8KB-2KB SMI is the best configuration.
Avg. memory 1.34 cycle 1.29 cycle 1.26 cycle 1.25 cycle
access time (1.00) (0.96) (0.94) (0.93)
(Improvement ratio)
Table-B: Miss Ratio if the Two-way set associative cache and SMI cache.
Table-C: Average memory access time of the direct mapped cache and the SMI
cache.
4. Conclusion
Table of Content:
1. Introduction
2. Selective Mode Intelligent Cache System
2.1. Motivation
3. Performance Evaluation
4. Conclusion
References:
1. IEEE transaction On computers,
Vol. 52, NO.5, MAY 2003
2. Computer Architecture and Organization:
John P. Hayes
3. Computer Organization and Architecture:
William Stallings