Stack and Cache
Stack and Cache
29
299
INTRODUCTION
The prefetch methods [1,2,3,4] appear as a very effective approach for minimizing the
memory latency in the current computers architectures. One weakness of the existing prefetch
methods is the fact they do not differentiate the prefetched data based on their locality in the
virtual memory and thus they cannot exploit their specific characteristics. For example the
references to the stack data tend to have clear spatial and temporal locality that can be
effectively used for improving the characteristics of the prefetch methods. This fact was
observed in the past and there are a few proposals using these characteristics of stack data.
In [5] is proposed a transparent data buffer as a close map of the run-time stack, called
stack cache. In this architecture the stack cache acts as register file for the simulated 'Stack
machine'.
In [6] is studied a data 'decoupled architecture' that differentiates the references to the data
based on their locality, stack and heap. A mechanism is proposed for identifying the type of
instructions on early stage of processors pipeline employing specialized cache, called Access
Region Locality Table (ARPT). The ARPT cache stores information for the type of the any
load/store instruction and subsequently this information is used for directing the corresponding
load/store instructions in separate Load Store Queue (LSQ). At every LSQ is attached a separate
cache memory, one for stack and one for global data. The published results showed a very good
rate of prediction of the memory instructions type and also an overall improvement of the
performance of the studied computer architecture.
31
319
problem met in [6] and [7], i.e. the limited time for prefetching the stack data before their actual
use by the application is overcome. Also this adds a selective characteristics to the prefetch
method compared to the classical prefetch methods, because the depth of the prefetch requests
depend on the size of the procedure frames and is not a constant.
The studied cache memory, called Stack Cache Memory (SCM), acts as window in stack
memory containing all or nearly all data from the used procedures frames. The SCM is filled
with data from the current procedure frame. When during the execution of the application a new
control branch is encountered the new slice of the stack (procedure frame) is estimated and all
data from this slice are prefetched into SCM. Also a process of prefetching is started for the slice
of the stack of the predicted procedure. Using these methods for starting the prefetch early
enough is achieved a better hits ratio for memories with long latency.
2.2. Algorithm of work
The SCM works in the following way, all stack references are directed to SCM. The
algorithm used for differentiation of the memory reference to stack and global one is based on
simple compare of address of the reference with the TOS. The issued processor requests are
performed as an ordinary operation with cache memory with classical organization. In case of
hit the data is stored or read from SCM. In case of miss the missed data is brought from L2
cache or main memory and it is preserved in SCM.
When an instruction for entering into new procedure is encountered at decode stage of the
processor pipelinoe, the size and the borders of this new procedure frame are calculated and a
prefetch request for all data within this slice not already cached in SCM is issued. The evicted
lines from SCM are preserved in L2 data cache. The algorithm used for replacing the lines from
SCM is chosen to be FIFO. After prefetching the necessary blocks, the next probable procedure
frame is predicted and a process of prefetching of these stack data is started. The same
algorithm is executed when an instruction for returning from procedure is encountered. In case
that the size of prefetched procedure frame is greater than the size of the SCM, only a part of the
frame equal to the size of SCM is prefetched.
As can be seen from showed results in next chapter the proposed algorithm for prefetching
stack data into SCM in advance, has one drawback. The problem arises with application that
have a relatively big procedures frames that can not be efectivelly prefetched and stored in
SCM,. Because of this here is examined a modification of the proposed SCM, called SCMlimited. In the SCM-limited a limited number of data from procedure frame is really prefetched
and stored into SCM, as the actual number, watermark, after which the process of prefetching of
the data is stopped, is chosen by experiments.
2.3. Implementation
On Figure 2.1 is given one example implementation of the proposed stack cache memory.
It consists of a memory for storing cached data, organized as direct mapped cache, a device for
tracking procedure transitions (PTT), a control logic and associated registers, SP and BP that
point the border of stack slice currently prefetched into cache memory.
Figure 2.1
The cache is with FIFO organization, it is organized as a circular buffer and it is virtually
indexed. The stream of memories reference is decoupled as the address of the data is compared
relatively to the current stack pointer SP, and if it is bigger we accept that the reference is to
stack data. The stack is growing downward.
The control logic synchronizes and manages the rest of the modules.
3.
33
339
SCM accuracy
100
99.5
99
98.5
98
97.5
hits
97
96.5
96
95.5
95
gcc
vortex
perl
m88k
sim
ijpeg
go
li
Figure 3.1
The showed results point that there is one acceptable level of cache hits ratio. Especially
for m88ksim, go and ijpeg where we have nearly 100% percents success. The others tests, li,
gcc, vortex and perl have a worse hit ratio, as li is the worst. This has clear explanation, because
as can be seen on figure 3.5, there is little utilization of the prefetched blocks and also the
average size of the procedure frames for these applications are relatively greater than the
capacity of SCM.
On next figure 3.2 is given the variation of the accuracy of the proposed cache memory in
the case of changing the capacity of the SCM. As can be expected the bigger the cache is the
better the hits ratio is.
On next figure 3.3 are illustrated the results when is taken into account the prefetch of the
data from the procedure frame of the N+1 procedure, as it is predicted by PTT versus the case
where only the data from the current procedure frame are prefetched.
As we see there is a substantial improvement for all tests, except gcc and li. We can argue
that the average size of the procedure frames of gcc and li are relatively big, see figure 3.5, thus
the impact of prefetching the frame of next procedure is little. Also the overall percentage of
usage of the prefetched blocks are very low for these two tests, see figure 3.4.
1KB size
2KB size
4KB size
gcc
vortex
perl
m88ksim
ipeg
go
li
Figure 3.2
improvment
gcc
vortex
perl
m88ksim
ijpeg
go
li
Figure 3.3
On figure 3.4. is given the overall usage of the prefetched blocks and on figure 3.5 is given
the average size of procedures frames in bytes. As we see the size of the procedures frames of
gcc, vortex and li are relatively greater than the size of the tested SCM and that together with the
fact that a small fraction of the prefetched blocks is really used explains the worse hits ratio for
these tests, see figure 3.1.
On next figure 3.6 are given the results of testing the accuracy of SCM versus the cache
memory with classical organization. The tests for this cache are performed in the exactly same
way as for SCM, i.e. the cache is for stack data only, it is with direct mapped organizations and
it uses FIFO replacement policy.
As can be seen from the graph the SCM performs better in all tests, but the test 'li' and
'gcc'. The results clearly point that the SCM outperform the caches without prefetch algorithms
employed for applications with relatively small size of procedure frames.
35
359
100
550
90
500
80
450
70
400
60
350
50
300
40
250
30
200
bytes
150
20
100
10
50
0
gcc
vortex
perl
m88k
sim
ijpeg
go
li
gcc
vortex
perl
Figure 3.4
m88k
sim
ijpeg
go
li
Figure 3.5
The showed results lead on conclusion that the policy of prefetching the entire procedure
frame is not good enough for all type of applications. Because of this on next figure 3.7 are
illustrated the results of testing the accuracy of the proposed modification of the SCM with
limited perfetching versus the SCM with full prefetching. The amount of data (watermark) after
which the prefetch process stops is chosen to be 8 blocks after experiments.
SCM vs classical cache
accuracy
100
99.5
99
98.5
98
97.5
Classical
SCM
97
96.5
96
95.5
95
gcc
vortex
perl
m88ksim
ijpeg
Figure 3.6
go
li
97
96.5
96
95.5
95
gcc
vortex
perl
m88ksim
ijpeg
go
li
Figure 3.7
As is showed there is notable improvement of the hits ratio for the tests with relatively
bigger procedure frames as gcc and li.
On next figure is showed the relatively improvement of the SCM with limited prefetching
versus the stack cache with classical organizations.
Classical
SCM-limited
97
96.5
96
95.5
95
gcc
vortex
perl
m88k
sim
ijpeg
go
li
Figure 3.8
As can be seen from the graph the SCM-limited performs better in all tests compared with
the stack cache with classical organization. This emphasizes how important the filtering process
is in any prefetch methods.
3.
CONCLUSION
The existing methods of prefetching stack data do not employ algorithms for precise
tracking of the procedures frames and thus they can not achieve the maximum from the spatial
locality of the stack references. In this paper we propose a new algorithm for caching of stack
data in advance using a novel approach for early starting the process of prefetching.
37
379
The proposed SCM acts as a window in the run-time stack memory containing all or nearly
all data from the current used procedures frames. The SCM prefetches and stores all or limited
number of data from procedure frame employing two mechanisms for early starting of the
process of prefetching. The SCM starts the process of prefetching at the moment of execution of
any control branch (call/return) and also it prefetchs the data from the frame of the procedure
that is likely to be executed next.
This approach allows the process of prefetching to be started early enough relatively to the
usage of the stack data and thus improves the cache hits ratio. Also this adds selective
characteristics to the prefetch method compared to the classical prefetch methods, because the
depth of the prefetch requests depends on the size of the procedures frames and is not a
constant.
The proposed stack cache memory achieves very high percentage of hits ratio, on average
99,35% for SCM of 2KB. In case the data from the next probable procedures frame are also
prefetched we have a 30.31% improvement of hits ratio. Unfortunately the proposed SCM has
same drawbacks, as for example the usage of prefetched blocks is relatively low, on average
52.28%. Because of this a version of SCM with limitation of the amount of the prefetched data
is proposed that achieves notable better results for applications with relatively bigger procedure
frames, on average 99.85%.
Conclusion that we can make is that the proposed SCM performs very well compared to
cache memories with classical organization and achieves very good results.
REFERENCES
[1] Vander Wiel, S. P., Lilja, D.J., "When Caches Aren't Enough : Data Prefetching
Techniques", IEEE Computer, vol. 30, no.7, pp 23-30, July 1997.
[2] Pinter, S. S., Yoaz, A., Tango: A Hardware-Based Data Prefetching Technique for
Superscalar Processors, Proc. of the 29th symp. on Microarchitecture, pp 214-255, 1996.
[3] Fu, J. - Patel, J.: Stride directed prefetching in scalar processors. In: Proc. of the 25th Znt
1Sy Microarchitecture, p. 102-110, December 15, 1992.
[4] Joseph, D. - Grunwald, D.: "Prefetching Using Markov Predictors", In: Proceedings of the
24th Annual Symposium on Computer Architecture, Denver-Colorado, pp. 252-263, June 2-4
1997.
[5] Ditzel, D. - McLellan, R.: "Register Allocation for Free: The C Machine Stack Cache", In:
Proc. of the Symp. on Architectural Support for Prog. Lang. and Operating Systems, pp. 4856, March 1982.
[6] Cho, S. et al.: "Decoupling Local Variable Accesses in a Wide-Issue Superscalar
Processor", In: Proc. of the 26th Int'l Symp. on Computer Arch., pp. 100-110, May 1999.
[7] Hemsthad, A.: Implementing a Stack Cache, Rice University, Advanced Microprocessors
Architecture, 1998 http://www.owlnet.rice.edu/~elec525/projects/SCreport.pdf
[8] Romansky, R. - Lazarov, Y.: A method for tracking procedure invocations, Automatics
and Informatics , 2nd, 2004.
[9] Burger, D. - Austin, T.: "The SimpleScalar Tool Set, Version 2.0," Computer Sciences
Department Technical Report, No. 1342, Univ. of Wisconsin, June 1997.