0% found this document useful (0 votes)
77 views

Stack and Cache

The document summarizes an approach for improving computer performance by prefetching and caching stack data using a specialized cache called a Stack Cache Memory (SCM). The SCM acts as a window into the runtime stack, containing data from the currently used procedures. It employs two mechanisms - prefetching data when procedures are entered or exited, and prefetching likely future procedures based on predicted execution paths. Results show the SCM achieves a 99% hit rate for a 2KB cache by exploiting the spatial and temporal locality of stack data. The approach represents an improvement over prior work by starting prefetching early enough to minimize memory latency.

Uploaded by

Nandan Kumar Jha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

Stack and Cache

The document summarizes an approach for improving computer performance by prefetching and caching stack data using a specialized cache called a Stack Cache Memory (SCM). The SCM acts as a window into the runtime stack, containing data from the currently used procedures. It employs two mechanisms - prefetching data when procedures are entered or exited, and prefetching likely future procedures based on predicted execution paths. Results show the SCM achieves a 99% hit rate for a 2KB cache by exploiting the spatial and temporal locality of stack data. The approach represents an improvement over prior work by starting prefetching early enough to minimize memory latency.

Uploaded by

Nandan Kumar Jha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Journal of Information, Control and Management Systems, Vol. 1, (2003), No.

29
299

STACK CACHE MEMORY


Radi ROMANSKY, Yordan LAZAROV
Faculty of Computer Systems and Control, Technical University, Sofia
Bulgaria
e-mail: [email protected], [email protected]
Abstract
The gap between processor and memory speed appears as a serious bottleneck in
improving the performance of the current computers architectures. The prefetch methods
appear as one of the most promising way for filling this gap. The known prefetch
methods do not distinguish the prefetched data based on their locality in the virtual
memory and thus they can not exploit their specific characteristics. In this paper is
described a novel approach for effectively prefetching and caching stacks data, which
have a clear spatial and temporal characteristics. The showed results, 99% hits ratio for
2KB cache, confirm our believing that exploiting these characteristics gives good
results.
Keywords: prefetching, cache, stack data.

INTRODUCTION
The prefetch methods [1,2,3,4] appear as a very effective approach for minimizing the
memory latency in the current computers architectures. One weakness of the existing prefetch
methods is the fact they do not differentiate the prefetched data based on their locality in the
virtual memory and thus they cannot exploit their specific characteristics. For example the
references to the stack data tend to have clear spatial and temporal locality that can be
effectively used for improving the characteristics of the prefetch methods. This fact was
observed in the past and there are a few proposals using these characteristics of stack data.
In [5] is proposed a transparent data buffer as a close map of the run-time stack, called
stack cache. In this architecture the stack cache acts as register file for the simulated 'Stack
machine'.
In [6] is studied a data 'decoupled architecture' that differentiates the references to the data
based on their locality, stack and heap. A mechanism is proposed for identifying the type of
instructions on early stage of processors pipeline employing specialized cache, called Access
Region Locality Table (ARPT). The ARPT cache stores information for the type of the any
load/store instruction and subsequently this information is used for directing the corresponding
load/store instructions in separate Load Store Queue (LSQ). At every LSQ is attached a separate
cache memory, one for stack and one for global data. The published results showed a very good
rate of prediction of the memory instructions type and also an overall improvement of the
performance of the studied computer architecture.

Optimization of Public Transport Supply


Another interesting proposal is described in [7], where the proposed stack cache acts as
window in the run-time stack, containing all data within a certain offset from the current Top of
the Stack (TOS). When TOS moves upwards or downwards the new data from the stack are
prefetched and preserved in the stack cache. The published results are mixed, for some type of
applications there is improvement while for others there is degradation. The authors concluded
that the reason is the fact that there is not employed method for precise tracking of the
movement of the top of the stack.
In [8] is proposed a method for effective defining and predicting the movement of the used
slice of the stack during the execution of an application, called Procedure Tracking Transitions
(PTT). In the proposed method the execution of one application is formalized as sequence of
executions and transitions between procedures. Markov chains are used for modeling this
process as the procedures are the nodes in the model and the control branches
(CALL/RETURN) are the transition between nodes. Using this model the authors achieve very
good accuracy of predicting the usage of stack data, on average 98.49%.
In this paper is proposed a novel approach for stack data prefetching and caching using
specialized cache memory, called Stack Cache Memory (SCM). The SCM acts as a window in
the run-time stack memory containing all or nearly all data from the currently used procedures
frames. The SCM employs two mechanisms for prefetching and storing the stack data. In the
first approach the process of prefetching is started at the moment of the execution of the control
instruction for entering or living into procedure. And in the second approach the process of
prefetching is started for the slice of stack memory assigned to the procedure that is likely to be
executed next. These approaches allow the process of prefetching to be started early enough
related to the usage of the stack data and thus this allows the cache hits ratio to be improved.
2. STACK CACHE MEMORY
2.1. A case of SCM
The main goal of this project is minimizing the memory latency using well-known spatial
locality of the references to stack data for driving the process of perefetching.
The execution of one application, as it is formalized in [8], can be viewed as a set of
executed procedures and transitions between them. Using this model the usage of the stack data
can be formalized and modeled as sequence of frames/slices assigned to the corresponding
executed procedures. At any point an application uses strictly defined slice of the stack, i.e. the
slice assigned to the current executed procedure. In the rest of this document this slice will be
called procedure frame (PF). In general, we can accept that the size of procedures frames is
constant during the application execution and also we can accept that the size of procedure
frames is easily estimated at the moment of calling/leaving the procedures. This allows the
process of prefetching of stack data to be started at the time of the execution of the control
instruction for entering/leaving into new procedure.
As is pointed in [8] the path between procedures followed during execution of an
application tends to repeat thus allowing the next probable procedure that will be executed after
current one to be predicted with high certainty. Employing this method for predicting in advance
the procedure frame that is like to be used after the current one allows these data to be
prefetched in advance.
The mentioned two characteristics of stack data allow the process of prefetching the stack
data to be issued early enough before their actual use by the application. In this way the major

Journal of Information, Control and Management Systems, Vol. 1, (2003), No.2

31
319

problem met in [6] and [7], i.e. the limited time for prefetching the stack data before their actual
use by the application is overcome. Also this adds a selective characteristics to the prefetch
method compared to the classical prefetch methods, because the depth of the prefetch requests
depend on the size of the procedure frames and is not a constant.
The studied cache memory, called Stack Cache Memory (SCM), acts as window in stack
memory containing all or nearly all data from the used procedures frames. The SCM is filled
with data from the current procedure frame. When during the execution of the application a new
control branch is encountered the new slice of the stack (procedure frame) is estimated and all
data from this slice are prefetched into SCM. Also a process of prefetching is started for the slice
of the stack of the predicted procedure. Using these methods for starting the prefetch early
enough is achieved a better hits ratio for memories with long latency.
2.2. Algorithm of work
The SCM works in the following way, all stack references are directed to SCM. The
algorithm used for differentiation of the memory reference to stack and global one is based on
simple compare of address of the reference with the TOS. The issued processor requests are
performed as an ordinary operation with cache memory with classical organization. In case of
hit the data is stored or read from SCM. In case of miss the missed data is brought from L2
cache or main memory and it is preserved in SCM.
When an instruction for entering into new procedure is encountered at decode stage of the
processor pipelinoe, the size and the borders of this new procedure frame are calculated and a
prefetch request for all data within this slice not already cached in SCM is issued. The evicted
lines from SCM are preserved in L2 data cache. The algorithm used for replacing the lines from
SCM is chosen to be FIFO. After prefetching the necessary blocks, the next probable procedure
frame is predicted and a process of prefetching of these stack data is started. The same
algorithm is executed when an instruction for returning from procedure is encountered. In case
that the size of prefetched procedure frame is greater than the size of the SCM, only a part of the
frame equal to the size of SCM is prefetched.
As can be seen from showed results in next chapter the proposed algorithm for prefetching
stack data into SCM in advance, has one drawback. The problem arises with application that
have a relatively big procedures frames that can not be efectivelly prefetched and stored in
SCM,. Because of this here is examined a modification of the proposed SCM, called SCMlimited. In the SCM-limited a limited number of data from procedure frame is really prefetched
and stored into SCM, as the actual number, watermark, after which the process of prefetching of
the data is stopped, is chosen by experiments.
2.3. Implementation
On Figure 2.1 is given one example implementation of the proposed stack cache memory.
It consists of a memory for storing cached data, organized as direct mapped cache, a device for
tracking procedure transitions (PTT), a control logic and associated registers, SP and BP that
point the border of stack slice currently prefetched into cache memory.

Optimization of Public Transport Supply

Figure 2.1
The cache is with FIFO organization, it is organized as a circular buffer and it is virtually
indexed. The stream of memories reference is decoupled as the address of the data is compared
relatively to the current stack pointer SP, and if it is bigger we accept that the reference is to
stack data. The stack is growing downward.
The control logic synchronizes and manages the rest of the modules.
3.

SIMULATIONS AND RESULTS


SimpleScalar framework [9] is used for testing the SCM. The system calls and the
influence of the others applications are not taken into account in all conducted tests. The
benchmarks from SPECint95 set are used as tests applications, as the first 50M instructions of
any benchmarks are simulated.
In the base configuration the SCM has capacity of 2K bytes, the block size is of 32 bytes
and write-back and fetch on write policy is used. The chosen PTT features 256 entries.
On Figure3.1 are given the results of simulating the SCM featuring 2KBs. In this test only
the data from current procedure frame are prefetched.

Journal of Information, Control and Management Systems, Vol. 1, (2003), No.2

33
339

SCM accuracy
100
99.5
99
98.5
98
97.5
hits

97
96.5
96
95.5
95
gcc

vortex

perl

m88k
sim

ijpeg

go

li

Figure 3.1
The showed results point that there is one acceptable level of cache hits ratio. Especially
for m88ksim, go and ijpeg where we have nearly 100% percents success. The others tests, li,
gcc, vortex and perl have a worse hit ratio, as li is the worst. This has clear explanation, because
as can be seen on figure 3.5, there is little utilization of the prefetched blocks and also the
average size of the procedure frames for these applications are relatively greater than the
capacity of SCM.
On next figure 3.2 is given the variation of the accuracy of the proposed cache memory in
the case of changing the capacity of the SCM. As can be expected the bigger the cache is the
better the hits ratio is.
On next figure 3.3 are illustrated the results when is taken into account the prefetch of the
data from the procedure frame of the N+1 procedure, as it is predicted by PTT versus the case
where only the data from the current procedure frame are prefetched.
As we see there is a substantial improvement for all tests, except gcc and li. We can argue
that the average size of the procedure frames of gcc and li are relatively big, see figure 3.5, thus
the impact of prefetching the frame of next procedure is little. Also the overall percentage of
usage of the prefetched blocks are very low for these two tests, see figure 3.4.

Optimization of Public Transport Supply


SCM accuracy vs the size of SCM
100
99.5
99
98.5
98
97.5
97
96.5
96
95.5
95
94.5
94
93.5
93
92.5
92
91.5
91
90.5
90

1KB size
2KB size
4KB size

gcc

vortex

perl

m88ksim

ipeg

go

li

Figure 3.2

SCM N+1 prefetch accuracy


100
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0

improvment

gcc

vortex

perl

m88ksim

ijpeg

go

li

Figure 3.3
On figure 3.4. is given the overall usage of the prefetched blocks and on figure 3.5 is given
the average size of procedures frames in bytes. As we see the size of the procedures frames of
gcc, vortex and li are relatively greater than the size of the tested SCM and that together with the
fact that a small fraction of the prefetched blocks is really used explains the worse hits ratio for
these tests, see figure 3.1.
On next figure 3.6 are given the results of testing the accuracy of SCM versus the cache
memory with classical organization. The tests for this cache are performed in the exactly same
way as for SCM, i.e. the cache is for stack data only, it is with direct mapped organizations and
it uses FIFO replacement policy.
As can be seen from the graph the SCM performs better in all tests, but the test 'li' and
'gcc'. The results clearly point that the SCM outperform the caches without prefetch algorithms
employed for applications with relatively small size of procedure frames.

Journal of Information, Control and Management Systems, Vol. 1, (2003), No.2

Used blocks in frame

35
359

The size of frames

100

550

90

500

80

450

70

400

60

350

50

300

40

250

30

200

bytes

150

20

100

10

50

0
gcc

vortex

perl

m88k
sim

ijpeg

go

li

gcc

vortex

perl

Figure 3.4

m88k
sim

ijpeg

go

li

Figure 3.5

The showed results lead on conclusion that the policy of prefetching the entire procedure
frame is not good enough for all type of applications. Because of this on next figure 3.7 are
illustrated the results of testing the accuracy of the proposed modification of the SCM with
limited perfetching versus the SCM with full prefetching. The amount of data (watermark) after
which the prefetch process stops is chosen to be 8 blocks after experiments.
SCM vs classical cache
accuracy
100
99.5
99
98.5
98
97.5

Classical
SCM

97
96.5
96
95.5
95
gcc

vortex

perl

m88ksim

ijpeg

Figure 3.6

go

li

Optimization of Public Transport Supply


SCM vs SCM-limited accuracy
100
99.5
99
98.5
98
97.5
full
limited

97
96.5
96
95.5
95
gcc

vortex

perl

m88ksim

ijpeg

go

li

Figure 3.7

As is showed there is notable improvement of the hits ratio for the tests with relatively
bigger procedure frames as gcc and li.
On next figure is showed the relatively improvement of the SCM with limited prefetching
versus the stack cache with classical organizations.

SCM-limited vs classical cache


accuracy
100
99.5
99
98.5
98
97.5

Classical
SCM-limited

97
96.5
96
95.5
95
gcc

vortex

perl

m88k
sim

ijpeg

go

li

Figure 3.8
As can be seen from the graph the SCM-limited performs better in all tests compared with
the stack cache with classical organization. This emphasizes how important the filtering process
is in any prefetch methods.
3.

CONCLUSION
The existing methods of prefetching stack data do not employ algorithms for precise
tracking of the procedures frames and thus they can not achieve the maximum from the spatial
locality of the stack references. In this paper we propose a new algorithm for caching of stack
data in advance using a novel approach for early starting the process of prefetching.

Journal of Information, Control and Management Systems, Vol. 1, (2003), No.2

37
379

The proposed SCM acts as a window in the run-time stack memory containing all or nearly
all data from the current used procedures frames. The SCM prefetches and stores all or limited
number of data from procedure frame employing two mechanisms for early starting of the
process of prefetching. The SCM starts the process of prefetching at the moment of execution of
any control branch (call/return) and also it prefetchs the data from the frame of the procedure
that is likely to be executed next.
This approach allows the process of prefetching to be started early enough relatively to the
usage of the stack data and thus improves the cache hits ratio. Also this adds selective
characteristics to the prefetch method compared to the classical prefetch methods, because the
depth of the prefetch requests depends on the size of the procedures frames and is not a
constant.
The proposed stack cache memory achieves very high percentage of hits ratio, on average
99,35% for SCM of 2KB. In case the data from the next probable procedures frame are also
prefetched we have a 30.31% improvement of hits ratio. Unfortunately the proposed SCM has
same drawbacks, as for example the usage of prefetched blocks is relatively low, on average
52.28%. Because of this a version of SCM with limitation of the amount of the prefetched data
is proposed that achieves notable better results for applications with relatively bigger procedure
frames, on average 99.85%.
Conclusion that we can make is that the proposed SCM performs very well compared to
cache memories with classical organization and achieves very good results.
REFERENCES
[1] Vander Wiel, S. P., Lilja, D.J., "When Caches Aren't Enough : Data Prefetching
Techniques", IEEE Computer, vol. 30, no.7, pp 23-30, July 1997.
[2] Pinter, S. S., Yoaz, A., Tango: A Hardware-Based Data Prefetching Technique for
Superscalar Processors, Proc. of the 29th symp. on Microarchitecture, pp 214-255, 1996.
[3] Fu, J. - Patel, J.: Stride directed prefetching in scalar processors. In: Proc. of the 25th Znt
1Sy Microarchitecture, p. 102-110, December 15, 1992.
[4] Joseph, D. - Grunwald, D.: "Prefetching Using Markov Predictors", In: Proceedings of the
24th Annual Symposium on Computer Architecture, Denver-Colorado, pp. 252-263, June 2-4
1997.
[5] Ditzel, D. - McLellan, R.: "Register Allocation for Free: The C Machine Stack Cache", In:
Proc. of the Symp. on Architectural Support for Prog. Lang. and Operating Systems, pp. 4856, March 1982.
[6] Cho, S. et al.: "Decoupling Local Variable Accesses in a Wide-Issue Superscalar
Processor", In: Proc. of the 26th Int'l Symp. on Computer Arch., pp. 100-110, May 1999.
[7] Hemsthad, A.: Implementing a Stack Cache, Rice University, Advanced Microprocessors
Architecture, 1998 http://www.owlnet.rice.edu/~elec525/projects/SCreport.pdf
[8] Romansky, R. - Lazarov, Y.: A method for tracking procedure invocations, Automatics
and Informatics , 2nd, 2004.
[9] Burger, D. - Austin, T.: "The SimpleScalar Tool Set, Version 2.0," Computer Sciences
Department Technical Report, No. 1342, Univ. of Wisconsin, June 1997.

You might also like