Compiler Based Optimization Techniques For Scratchpad Memory
Compiler Based Optimization Techniques For Scratchpad Memory
Outline
Introduction Motivation Static Allocation Approach Scratchpad only architecture Cache + Scratchpad architecture Dynamic Allocation Approach Scratchpad only architecture Conclusion & Future Work
Embedded Systems
Embedded systems (ES) = information processing systems embedded into a larger product Main reason for buying is not information processing Transportation (e.g. ABS) Telecommunication (e.g. mobile phone) Manufacturing (incl. robotics) Medical instruments (e.g. artificial eye)
www.dobelle.com
Manish Verma, Computer Science XII, Univ. Dortmund, 2004
Power Issues
Power is considered as the most important constraint in embedded systems [in: Eggermont
(ed): Embedded Systems Roadmap 2002, STW]
Manish Verma, Computer Science XII, Univ. Dortmund, 2004
-4-
Power Distribution
Memory subsystem consumes > 50% of total energy budget 1 Memory hierarchy Cache Vs. Scratchpad Power 2 Performance 2 Predictability 3 Software Support
>50%
1 : [S. Segars ISSCC, 2001] 2 : [S. Steinke DATE, 2002] 3 : [P. Marwedel ASPDAC, 2004]
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -5-
Outline
Introduction Motivation Static Allocation Approach Scratchpad only architecture 1 Cache + Scratchpad architecture Dynamic Allocation Approach Scratchpad only architecture Conclusion & Future Work
Focus on memory- & energy- aware compilation: Scratch pad memories (SPM)
Processor Small; no tag memory Scratch pad
Main Memory
-8-
nJ
Energy
-9-
char ch()
int wh () "main" memory
int p []
?
real a [] SPM; capacity K int c[]
-10-
Cycles
Outline
Introduction Motivation Static Allocation Approach Scratchpad only architecture Cache + Scratchpad architecture Dynamic Allocation Approach Scratchpad only architecture Conclusion & Future Work
-13-
Scratch pad
D-Cache
I-Cache
-14-
Example
B1
1
B2
10
B1
B2 B3 B4 B7
B1 B7
B2 B3 B5 B4 B6
B3
10
B5 B6
90
2
3
B4
10
[100, 0]
4 5
I-Cache
B8
B5 B6
B7
1
[90, 10]
6 7
B8
I-Mem
-15-
Trace Generation
T4
B1
1
T1
90
B2
10
T3
T2
B5
90 99
B3
10
B4
10
B6
90
B7
1
Min #jumps across traces NP Complete problem Greedy approach Coalesce most freq exec BB Size of trace <= Scratchpad Size Append NOPs Reduce i-cache misses Improve processor cycles
T1
B8
T5
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -16-
Conflict Graph
T4 ((T1 T2 T1)9 (T1 T3 T1)))10 T5
0 1 T2 [180,20] T2 (180) 20 20
T3 (20)
T5 (1)
2
3 4 T3 5 6 7 T4 T5 [20,20] T1 [200,0] T1 (200)
T4 (1)
Conflict Graph
I-Mem
Weighted Directed Graph Nodes (traces) Execution frequency Edges (conflict relationship) # conflict misses
-17-
Energy Model
Constant
-18-
Problem Formulation
NP-complete: Knapsack (no edges) Maximum Independent Set (ESP_Hit = ECache_Hit) Integer Linear Programming / Greedy Heuristic
T2 (180) [360] T1 (200) [200] 20 20 T3 (20)
[200]
T4 (1)
T5 (1)
Conflict Graph
Formal Problem Formulation Given: conflict graph (G), scratchpad, i-cache, energy model Determine: Min. energy mapping Assumption: No new edges; copying traces;
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -20-
21000
18000 15000 12000 9000 6000 3000 0 1024 2048 4096 8192 16384
MPEG benchmark
-22-
I-Cache + SP (512B)
I-Cache + SP (1024B)
8kB DM ICache
120% 100% 80% 60% 40% 20% 0% 1kB (DM) 2kB (DM) 4kB (DM) 1kB (2-way) 2kB (2-way) 4kB (2-way) 1kB (4-way) 2kB (4-way)
MPEG benchmark
I-Cache Configuration
-23-
-24-
Outline
Introduction Motivation Static Allocation Approach Scratchpad only architecture Cache + Scratchpad architecture Dynamic Allocation Approach (Scratchpad Overlay) Scratchpad only architecture Conclusion & Future Work
-25-
A
Main Memory
{ {
A B
Scratchpad Memory
Dynamic Allocation (Scratchpad Overlay) increased scratchpad utilization overhead due to spill routines similar to register allocation
-26-
Register File
Register File
Scratch pad
RISC
CISC
Scarce Resource (Register File / Scratchpad) Life-time of variables (temp. regs. / vars + code) Similar to RA for CISC, not for RISC processors Memory objects (vars + code) are of various sizes
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -27-
Scratchpad Overlay
5. Code Generation 4. Onchip Address Assignment
Memory Objects: Global Variables (A) Non-Scalar Local Variables Traces (T1, T2, T3, T4)
T3
B3 T2 B4 B7
B5 B6
T4 B8
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -29-
Liveness Analysis
B1 B2
MOD A USE T3 DEF A
B3
B5 B6 B7 T4 B8
USE A
T3
USE A
B4
USE T3
USE T4
USE T4
-30-
Memory Assignment
Processor Given: MOs, LiveRanges, Scratchpad Determine: Memory Assignment of MOs Assumption: Onchip address to MOs can be assigned Discussion: NP-complete, reduces to register allocation Solutions: Optimal: ILP formulation (16 sec.) Near Optimal: Heuristic
Scratchpad
Main Memory
-31-
B3
B5 B6 B7 B8
USE A
T3 B4
USE A
USE T3
Solution: A SP & T3 SP
B10
SPILL_LOAD(A);
-32-
-33-
8000 7000 6000 5000 4000 3000 2000 1000 0 0 64 128 256 512 1024
1/8th Scratchpad
-35-
Memory Energy
Total Energy
Execution Time
Edge Detection
-36-
Total Energy
Execution Time
Code Size
36%
Static Allocation
34%
edge_detection
mpeg
adpcm
histogram
multisort
avg.
Benchmarks
Manish Verma, Computer Science XII, Univ. Dortmund, 2004 -37-
-38-