0% found this document useful (0 votes)
18 views

MLin Compiler Paper

The document discusses the history and future of machine learning techniques in compiler optimization. It describes how iterative compilation was used to directly search the optimization space but was slow, leading to the development of machine learning methods that learn heuristics from previous searches to optimize programs in a single compilation. The paper reviews early works applying machine learning to compiler optimizations and envisions the future of the field, including with deep learning techniques.

Uploaded by

pouhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

MLin Compiler Paper

The document discusses the history and future of machine learning techniques in compiler optimization. It describes how iterative compilation was used to directly search the optimization space but was slow, leading to the development of machine learning methods that learn heuristics from previous searches to optimize programs in a single compilation. The paper reviews early works applying machine learning to compiler optimizations and envisions the future of the field, including with deep learning techniques.

Uploaded by

pouhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Machine Learning in Compilers:

Past, Present and Future


Hugh Leather Chris Cummins
University of Edinburgh Facebook AI Research
Edinburgh, Scotland Menlo Park, California
[email protected] [email protected]

Abstract—Writing optimising compilers is difficult. The Search Engine Best Heuristic Values

range of programs that may be presented to the compiler


is huge and the systems on which they run are complex,
heterogeneous, non-deterministic, and constantly changing.
The space of possible optimisations is also vast, making it very
hard for compiler writers to design heuristics that take all of Heuristic
Values
Program
Inputs
these considerations into account. As a result, many compiler
optimisations are out of date or poorly tuned.
Near the turn of the century it was first shown how com-
pilers could be made to automatically search the optimisation
space, producing programs far better optimised than previously
possible, and without the need for compiler writers to worry Program Compiler Executable Machine Performance Metrics
about architecture or program specifics. The searches, though,
were slow, so in the years that followed, machine learning Fig. 1. Iterative Compilation: a search technique explores a space of
was developed to learn heuristics from the results of previous compilation strategies, continually compiling, executing and profiling
searches so that thereafter the search could be avoided and to find the best performing strategy.
much of the benefit could be gained in a single shot.
In this paper we will give a retrospective of machine learning
in compiler optimisation from its earliest inception, through find the best one. The search evaluates each strategy by
some of the works that set themselves apart, to today’s deep compiling the program and running it with representative
learning, finishing with our vision of the field’s future. inputs enough times to estimate the measure of interest
Index Terms—machine learning, compilers.
(typically performance). This process is shown in Figure 1.
Unlike many compiler heuristics, iterative compilation is
I. Introduction
blind to the platform, easily adapts to changes, and is
Machine learning in compilers has been around for more based on evidence, not belief. The potential gains are huge,
than two decades. It is now a burgeoning field. This paper [3] found speedups of up to 2.23× over many data sets.
looks back at where the field started, covers some of the The paper that first coined the term iterative com-
stand out works over the years, and then presents our pilation was [4], in which they show how the search
vision for the future. We add to two earlier surveys, [1], space for one problem is both non-linear and different
[2], offering additional perspectives on the field. First, we across architectures. They used a simple grid search, but
begin with the precursor to machine learning in compilers, what they were trying to demonstrate was: first, that any
directly searching the optimisation space with iterative heuristic was going to be difficult for humans to derive;
compilation. second, that a new heuristic would be needed for each
architecture; and third, that the benefits of selecting the
II. Iterative Compilation right optimisation include considerable performance gain.
Developers have known since they first used optimising There have been many iterative compilation works.
compilers that the compiler does not always choose the Each targets some different heuristic or applies a dif-
best options. They would try different compiler command ferent search technique. The use of genetic algorithms
lines, either in an ad hoc fashion or more rigorously is common [5]–[7], but random search [8] and greedy
by searching. Even simple techniques, like exhaustively approaches [9] feature also. Optimisation targets include
searching for the best tiling factors for matrix multipli- phase ordering [7], [10], code size [6], compiler flags [11],
cation, could yield considerable benefits. Eventually, this and many others. Libraries, such as ATLAS [12] and
practice would be named iterative compilation in Europe SPIRAL [13], auto tune on installation. PetaBricks [14]
or adaptive compilation or auto-tuning in the US. and LIFT [15] are among some of the works that expand
The idea is straightforward: define a space of opti- the search space to include algorithmic choices. The
misation strategies, such as unrolling factors, tilings, or high cost of iterative compilation has been addressed
complete command lines, then use a search technique to by statistical techniques [16]. Frameworks to support
Examples
per benchmark

Features
... Machine
Predictive
Learning
Model
Tool

...
Best heuristic
value

Fig. 2. Feature vectors are designed by compiler experts who decide Fig. 3. A supervised machine learning tool creates a predictive model
what information will be most helpful when deciding the best value from training examples.
for the target heuristic.

Features
iterative compilation exist, such as Collective Tuning [17], Predictive
Model
OpenTuner [18], and CLTune [19]. Predicted
heuristic
value

III. Machine Learning


New program

Fig. 4. In production, the predictive model replaces a human


Although iterative compilation can improve program constructed heuristic. Features are calculated for a new program
performance to a remarkable degree, the search, which and the predicts the best heuristic value.
must be repeated for every new program and might require
thousands of compilations and executions, is so time-
consuming as to be impractical for all but a few specialised Others consider individual heuristics, such as loop un-
use cases. Early pioneers began to look to supervised rolling [22], [23], instruction scheduling [24], inlining [25],
machine learning to get the same benefits as iterative data partitioning [26], or thread coarsening [27]. Dozens of
compilation but in a single compilation. different heuristics have been examined. A series of works
The principle is relatively straightforward. Training pro- used genetic programming1 to learn heuristics [28]–[30].
grams are iteratively compiled to find the best compilation Perhaps one of the more well known works was Mile-
strategy for each. For example, if the optimisation is postGCC [21]. Milepost brought together the many nec-
loop unrolling, then iterative compilation will find the essary tools to do complete machine learning in compilers
best unroll factor for each loop in a number of training experiments. These tools included the basic iterative
benchmarks. A compiler writer decides what information compilation, simple static program features, and so forth
summarises the programs in a way that may be useful to enable a modest range of experiments to be undertaken.
in deciding which compilation strategy to apply to any The work was featured on the extremely popular slashdot
particular program. For loop unrolling, this might be a website, leading to widespread attention for a while.
vector of values, such as the loop’s trip count, the number An interesting work [8] employed a hybrid approach.
of instructions in the loop, the dependency depth, and They used machine learning, not to directly predict the
so forth. These pairs of summary vectors and desired best optimisation, but rather to predict which part of
strategies found by iterative compilation become training the search space would be profitable for further iterative
data for a supervised machine learner. The summary data compilation. Their model determines which points in
are called feature vectors. The output of the learner is the space are likely within 5% of the optimal. Iterative
a model that can be used, given the features of a new, compilation then searched that reduced space, lowering
unseen program, to predict the best compilation strategy the cost of iterative compilation and reducing the burden
or heuristic value. The model can then be inserted into on the machine learning to get the prediction correct.
the compiler, replacing whatever human-built heuristics
existed previously. Moreover, should there be any changes A. Fitting into the ML Mould
to the architecture, operating system, the rest of the
The examples previously cited learn the heuristic di-
compiler, or the target application domain, then the
rectly. Often this is predicting some best category, such
training data can simply be regenerated in the new
as loop unroll factor or whether to inline a function. Not
environment and machine-learned heuristic retuned on
all problems fit so neatly, and when it is not clear how
that. These steps are shown in figures 2, 3 and 4.
to represent the heuristic values as a category then they
The earliest example we can find is [20] which used a
instead require more careful consideration for how to fit
neural network for branch prediction, both in hardware
them into a machine learning paradigm.
and in compiler optimisations. Since then, many different
compiler optimisations have been targeted, each working 1 Conflating genetic programming with machine learning is a good
at a different granularity. Many learn command line way to get in a row with machine learning experts, as one author
options for whole programs or compilation units [8], [21]. has found out to his cost.
Instruction scheduling, for example, requires permuting is facetious but it is quite common for features to be
instructions to improve their performance. Rather than poorly designed. Features that are in essence random
learning permutations of instructions, [31] and [30] learn relative to the problem at hand can confuse the machine
priority functions that compare two instructions and de- learner. It can find a coincidental correlation in the
termine which should come first. This neatly sidesteps the random data and then attempt to learn that, ruining
permutation issue, as the ordering is generated implicitly. the generality of the model. A feature may be a function
In [32], it is similarly not obvious how the problem will of other features, which is sometimes helpful when the
be mapped to machine learning. The input problems are machine learner cannot combine information that way, and
parallel streaming task graphs where each node represents sometimes not helpful when the redundant information
some computation that is fed along edges to the next makes learning slower and less accurate. Features may
computation. The goal is to allocate the tasks to threads also be incompatible with the chosen machine learning
to reduce communication overhead while also improving tool. For example, if features place different classes in
parallelism and throughput. The approach taken in the nonlinear regions, linear models cannot separate them.
paper again avoids having to directly learn the heuristic. Some solutions to this have been explored. [38] deter-
Instead, they take an input graph, describe it using some mined the distance between two programs by a graph
features, and then predict what the features of the ideal, similarity metric. They begin by comparing the similarity
mapped version of that graph should be. This gives them of basic blocks and then expand that outwards to measure
something to aim for and they then do a hill-climbing the similarity of CFGs. Armed with this distance metric,
search, randomly applying merge or split operations to they use a k nearest-neighbour model to find which
regions of the graph, aiming to approach the predicted heuristic values to use. The downside of the approach is
ideal. As the features are static, the mapped program need the computational cost. Not only do many training graphs
not be run during the search, accelerating the search. need to be shipped with the compiler but the distance
There are several works that predict the performance calculation itself is expensive.
of a program after applying a transformation [33]–[35]. In Genetic programming was used to search for features
these papers, the general idea is that, after learning how in [23]. A grammar described a programming language of
the speedup of different parts of the optimisation space, features. A genetic program evolved individuals from the
the space can be quickly searched for good optimisations feature language, searching to improve the accuracy of a
without having to re-execute the program. machine learning model for the target heuristic.
A similar technique to searching a feature space was
B. Feature Design taken by [39] where the program was represented by facts
Machine learning tools need features that describe the in the logic programming language, Datalog. From these
data to the heuristic. For example, in loop unrolling facts, they used a solver to infer rules matching program
the number of times the loop will iterate may be a facts to the desired heuristics.
good feature. The most common feature types have been Static compiler features have always been an issue. It
the frequencies of instruction types in the code to be is impossible to know statically how often code will be
optimised [24], [27], [36], [37]. When learning branch executed. Consider a function with two possible control
prediction routines, [20] used features about the type of paths. Very different optimisation may be necessary,
the branch and the successor instructions, whether the depending on which path is taken. It may be that at
branch is a loop and its direction. [30] learned hyper-block run time one path is rarely or never taken, but the static
formation policy with features including the maximum compiler cannot know this. Many machine learning works
dependency height of instructions in the path, the total have used features like ‘the number of instruction in a
number of instructions, and whether there are memory function’ and are oblivious even to static control flow, let
hazards. For register allocation, they use the number of alone dynamic control flow. Performance counters have
calls in the containing basic block, use-def counts and been used as features to solve this [40]. The program is
estimates of spill costs and benefits. [24] also included run once, collecting counters for hot code. These counters
features about the number of garbage collection points in are used directly as features which now depend on the run
a method and the possibility of causing a thread switch. time behaviour, rather than purely static analyses.
[37] summarise loops according to the number of memory
C. Offline vs Online
accesses, histograms of the different instruction types, and
iteration count estimates. [22], [31] use counts and lengths Nearly all machine learning works for compilers do the
of use-def chains, dependency heights, and latencies. learning offline. Different compilation options are applied
Sensible features are essential to learning good heuris- to example programs in the lab. The reasons that the
tics. Conversely, whether the indentation was four spaces learning is not done live on users’ machines are twofold.
or one tab is probably not a useful feature2 . This example Firstly, it is quite common to find that, while looking
for the best optimisation strategy, the search touches on
2 Although clearly, anyone using tabs is psychotic. appallingly bad strategies that can trash the program’s
performance. Users would be distinctly nonplussed to find OpenCL. They made use of an existing technology that
their programs running at half speed, even if it was in the had had great success in natural language processing called
cause of eventually good performance. Secondly, in the long short term memory networks (LSTM). These nets are
laboratory, the inputs are always the same so that for able to process streams of inputs and remember events
deterministic programs the timings of any two runs are from arbitrarily far in the past. This enables them to
directly comparable and it is easy to see which strategy somewhat understand the structure of the program, such
is superior. In a live system, the inputs are different each as whether a variable has been declared in the past. The
time and, as there is no guarantee that two runs will do results improved over prior, hand-built features.
the same amount of work, the difference in their run times The authors of [49] went further. They realised that
may be due to that, rather than better optimisation. Chen while a token representation is well suited to ambiguous
[41], [42] shows how much input data can affect the best natural language, a graph-based representation would suit
optimisation choices. programming languages better. They represent the in-
[43] identified stable phases in a program execution structions of a program as edges in a graph describing the
during which performance comparisons could be made. relationships between variables. They then learn vectors to
They then used multi-versioning to select different com- represent each instruction given its context in the graph. A
piled versions to improve performance. Mars [44] proposes program can then be processed by LSTMs as a sequence of
using IPC for online adaptation. During the learning these vectors. [50], [51] extends this idea so that the graph
phase, competing versions of a hot function are executed, structure is used not just to decide the vectors used to
each for the same amount of time. The one with the represent instructions, but also how the learner processes
highest number of retired instructions is selected for use. them. They use message passing neural networks where
[45], [46] have used user inputs to perform adaptation in each node has a state. That state is sent along edges to
distributed data centers. Each compilation worker receives each neighbour who merges that into its own state with
a subset of the input data set on which to evaluate a a learned function. After some rounds of message passing
small set of optimization settings. The best such setting a learned aggregation function gives the heuristic value.
from each round is used for subsequent executions of Compared with machine learning of the past, deep
the same code. The best-found compilation strategy is learners are hungry for data sets far larger than is typically
then refined over time, by testing new settings and re- seen in compiler research. While in some cases this can be
evaluating old ones on new data sets. This approach only mitigated by generating data synthetically [52], the pace
works well with MapReduce-like workloads, since it relies of innovation will increase considerably with an increase
on the framework for repeating the same computation in the availability of large labelled data sets.
multiple times without causing side-effects. [47] solves
some of the online problems for iterative compilation only. V. Reinforcement Learning
They capture a memory snapshot as a hot function is Recently, reinforcement learning techniques have begun
being executed live. The snapshot is recreated offline and to make inroads in compiler optimization. Reinforcement
compilation strategies are searched. This enables tuning learning concerns the process of iterative decision making
specifically for each user while ensuring the user does not of an agent within an environment. The environment
suffer slow performance during the search. provides a state, a set of actions, and a reward signal.
The goal of an agent is to select the sequence of actions,
IV. Deep Learning one at a time, that will maximise cumulative reward.
The advent of deep learning has begun to pervade A recent work [53] casts loop vectorization as a rein-
every discipline and compilers are no exception. Deep forcement learning problem. An environment represents a
learning means using very large neural networks with program containing a single loop of interest, observations
many, or ‘deep’, layers. Previously, these large networks are provided by summarizing paths through the program’s
were infeasible to train, but processing power is now AST, and reward is calculated using the runtime of the
up to the task. What has changed by these advances program after applying a given vectorization choice. This
is that large, deep neural nets can scale up to much approach works well, but frames the problem in such a way
larger training data sizes and produce much more detailed that an agent only makes a single decision per problem.
models that can learn functions with an ease that might One of the key strengths of reinforcement learning is the
previously have seemed magical. But, the game-changer is ability to decompose large problems into a sequence of
that whereas before the choice of features was so crucial, smaller discrete choices.
it is now possible to feed in raw data and the deep learner Many compiler optimization problems can be broken
will make sense of it. down into a sequence of smaller decisions to fit the
The first such work was [48]. They parsed input pro- reinforcement learning mould. For example, in [54], the
grams as source token streams and then used a deep full optimization pipeline of LLVM is presented as an en-
neural network to directly predict from that what the vironment in which a partially optimized program provides
right heuristic value should be for some optimisations in the state, and an action is to select a single optimization
Compiler Optimisation Service as a playground for experimentation will lower the barrier

Frontend
.c IR
Language
model
to entry in compiler research, having a democratizing
Optimization
context
Available
xforms
effect. The first step is to enable every optimization choice
to be exposed through discoverable APIs with which
Program

Change Action State


iterative search and machine learning can interact. For
Apply xform
context history history

example, when a loop may be unrolled, a search or machine


xform context
Codegen

Action Selected
.exe
End of type? action
Agent
learning tool should be able to determine the range of
acceptable factors, make queries about the code and force
optimization
Executable

an unroll factor. Notice that this has to be dynamic, rather


Fig. 5. Our vision: A reinforcement learning system to control all than determined by some static list of unroll factors per
aspects of the compiler.
loop, since earlier choices change the loops that will be
considered. The compiler becomes a transformation and
pass to run. The selected pass is then run, producing a query engine, capable of making decisions but not needing
new state. Reward is provided by compiling the partially to do so. There are a lot of choices made in compilers. In
optimized program and estimating the execution cycle compilers with extensible representations, like MLIR, the
count. In this manner, the sequence of transformations challenge of enabling these APIs is greater than in more
that produces the best performing code can be found locked down compilers. The software engineering effort to
through incremental improvements. make a truly machine learning-enabled compiler should
[55] uses reinforcement learning to tackle a graph not be underestimated.
partitioning problem, similar to [32]. In [55], the goal
is to find the optimal device placement for nodes in B. Deep Language Modelling at Scale
large computation graphs so as to minimize runtime by
Before the advent of deep learning moving into com-
most efficiently exploiting the available hardware and
pilers the features that summarised programs were quite
minimizing communication costs. An LSTM model is used
basic. The token-based approach of [48], the embedding
to produce a representation of a particular mapping by
of inst2vec [49], and the new graph representation of [50]
feeding through a description of each operation’s type,
make great strides. These are not enough, however. The
shape, and graph adjacencies. A second LSTM model
more that the machine learning can understand the pro-
decodes this representation to provide a sequence of device
gram, the better. We need better program representations
placements. A key challenge here is that these sequence-to-
and compiler-specific DNN architectures.
sequence techniques struggle with long sequences, limiting
For instance, the most advanced representation [50]
the scalability over large problems. This was addressed
does not represent variables, types, operand order, etc.
in [56] using a hierarchical model to decompose large
It is, therefore, incapable of replicating the common data-
graphs.
flow analyses that litter any modern compiler – nor can
VI. The Future the others. Data-flow is fundamental to practically every
We have been in this field for the last fifteen years3 . optimisation in the compiler. We need models that can
In that time we have seen it move from a niche academic reason about the programs in at least as complex a fashion.
discipline, struggling for acceptance, to one which industry This will require better representations and graph RNNs
is now scrambling to adopt. that match the data-flow pattern. We may never know
So, where is this field going? What do we need to do that we have the best formulation, but if what we have
in the coming years? Optimisation is hard and it is only cannot at least learn all the standard data-flow analyses,
going to get harder still. We need to remove humans from then we do know it is not yet enough.
all the heuristics in the compiler, and this will require
a coordinated, inter-disciplinary effort. We need compiler C. Reinforcement Learning Everything
writers to build modular compilers that support iterative With all of the compiler’s choices exposed and suitable
compilation and machine learning from the ground up. We representation available, we can begin to replace every
need machine learning researchers to invent models that heuristic in the compiler with a learned one. We see
are suited to the recurrent, flow-based nature of programs. reinforcement learning as the most promising approach.
And we need efficient learning strategies that can cope Reinforcement learning seeks to choose actions that move
with the huge number of choices, distant rewards, and through a state space so that the final state has the great-
slow evaluations that apply to compiler optimisation. est associated reward. For compilers, the states are the IR
A. Machine Learning Enabled Compilers of a program, an action is a transformation of some part of
Modern compilers are multi-million line pieces of soft- the IR, and the reward is the speedup when the program
ware that can take years to master. Exposing the compiler is fully compiled and run on representative inputs. Such a
system would continually apply transformations until no
3 Well, the one of us with gray hair has, anyway. further speedup could be squeezed out.
Our proposed architecture is shown in Figure 5. The [5] S. J. Beaty, “Genetic algorithms and instruction scheduling,”
compiler front-end processes the program source code as in Proceedings of the 24th Annual International Symposium
on Microarchitecture, ser. MICRO 24. New York, NY, USA:
normal, constructing an intermediate representation. It Association for Computing Machinery, 1991, p. 206–211.
chooses an initial optimisation context on which to focus [Online]. Available: https://doi.org/10.1145/123465.123507
(most likely the main function). The context can be later [6] K. D. Cooper, P. J. Schielke, and D. Subramanian, “Optimizing
for reduced code space using genetic algorithms,” SIGPLAN
changed by the RL system, to look at optimisations on Not., vol. 34, no. 7, p. 1–9, May 1999. [Online]. Available:
other functions, or different granularities, such as loops, https://doi.org/10.1145/315253.314414
basic blocks and individual instructions. For each context [7] L. Almagor, K. D. Cooper, A. Grosul, T. J. Harvey, S. W.
Reeves, D. Subramanian, L. Torczon, and T. Waterman, “Find-
the compiler can determine a set of applicable and valid ing effective compilation sequences,” in LCTES ’04: Proceedings
transformations which it passes to an RL agent to make its of the 2004 ACM SIGPLAN/SIGBED conference on Languages,
choice. The IR is consumed by a language model which compilers, and tools for embedded systems. New York, NY,
USA: ACM, 2004, pp. 231–239.
compresses the IR into a finite state vector. The agent [8] F. Agakov, E. Bonilla, J.Cavazos, B.Franke, G. Fursin,
chooses the next action based on the current state of the M. O’Boyle, J. Thomson, M. Toussaint, and C. Williams,
program and a history of actions and states it has seen. An “Using machine learning to focus iterative optimization,” in
CGO ’06: Proceedings of the International Symposium on Code
action is either a transformation to apply to the current Generation and Optimization. Washington, DC, USA: IEEE
focused context or a change of focus to another context. Computer Society, 03 2006, pp. 295–305. [Online]. Available:
The RL system will take actions that increase the likely http://www.anc.ed.ac.uk/machine-learning/colo/cgo06.pdf
[9] Z. Pan and R. Eigenmann, “Fast and effective orchestration of
future reward which will be the speedup found by applying compiler optimizations for automatic performance tuning,” in
the action sequence to the code. When the predicted future CGO ’06: Proceedings of the International Symposium on Code
reward is zero, then no further speedup can be gained by Generation and Optimization. Washington, DC, USA: IEEE
Computer Society, 2006, pp. 319–332.
additional actions and the process can stop, delivering the [10] P. Kulkarni, S. Hines, J. Hiser, D. Whalley, J. Davidson,
final executable to the user. and D. Jones, “Fast searches for effective optimization
This problem is larger than those to which reinforcement phase sequences,” in Proceedings of the ACM SIGPLAN
2004 Conference on Programming Language Design and
learning is typically applied. The state space is huge – Implementation, ser. PLDI ’04. New York, NY, USA:
programs come from a space of unbounded dimension. Association for Computing Machinery, 2004, p. 171–182.
The action space is also huge – hundreds or thousands of [Online]. Available: https://doi.org/10.1145/996841.996863
[11] M. Haneda, P. M. W. Knijnenburg, and H. A. G. Wijshoff,
transformations are possible, many can be parametrised, “Automatic selection of compiler options using non-parametric
and there are many places in the code to apply them. inferential statistics,” in PACT ’05: Proceedings of the 14th
Evaluating the reward is slow – the program must be International Conference on Parallel Architectures and Compi-
lation Techniques. Washington, DC, USA: IEEE Computer
compiled to a binary and executed with representative Society, 2005, pp. 123–132.
inputs enough time to give statistically sound timings. [12] R. C. Whaley and J. J. Dongarra, “Automatically tuned linear
All these challenges will need careful thought and vast algebra software,” in Conference on High Performance Network-
ing and Computing. IEEE Computer Society, 1998, pp. 1–27.
computing power to solve, but in the end, we will have [13] M. Püschel, J. Moura, J. Johnson, D. Padua, M. Veloso,
compilers that far exceed the quality of today’s. B. Singer, J. Xiong, F. Franchetti, A. Gačić, Y. Voronenko,
K. Chen, R. Johnson, and N. Rizzolo, “Spiral: Code generation
for dsp transforms,” Proceedings of the IEEE, vol. 93, no. 2,
VII. Conclusion pp. 232–273, feb 2005.
[14] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao,
Machine learning is making a significant impact on A. Edelman, and S. Amarasinghe, “Petabricks: A language
and compiler for algorithmic choice,” SIGPLAN Not.,
compiler optimisation and will continue to in coming years. vol. 44, no. 6, p. 38–49, Jun. 2009. [Online]. Available:
https://doi.org/10.1145/1543135.1542481
[15] M. Steuwer, T. Remmelg, and C. Dubach, “Lift: A functional
References data-parallel ir for high-performance gpu code generation,” in
2017 IEEE/ACM International Symposium on Code Generation
[1] A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and and Optimization (CGO), 2017, pp. 74–85.
C. Silvano, “A survey on compiler autotuning using machine [16] H. Leather, M. O’Boyle, and B. Worton, “Raced profiles:
learning,” ACM Comput. Surv., vol. 51, no. 5, Sep. 2018. Efficient selection of competing compiler optimizations,” in
[Online]. Available: https://doi.org/10.1145/3197978 LCTES ’09: Proceedings of the ACM SIGPLAN/SIGBED 2009
[2] Z. Wang and M. O’Boyle, “Machine learning in compiler Conference on Languages, Compilers, and Tools for Embedded
optimization,” Proceedings of the IEEE, vol. 106, no. 11, pp. Systems, June 2009.
1879–1901, 2018. [17] G. Fursin, R. Miceli, A. Lokhmotov, M. Gerndt, M. Baboulin,
[3] Y. Chen, Y. Huang, L. Eeckhout, G. Fursin, L. Peng, A. D. Malony, Z. Chamski, D. Novillo, and D. Del Vento,
O. Temam, and C. Wu, “Evaluating iterative optimization “Collective mind: Towards practical and collaborative auto-
across 1000 datasets,” in Proceedings of the 31st ACM tuning,” Sci. Program., vol. 22, no. 4, p. 309–329, Oct. 2014.
SIGPLAN Conference on Programming Language Design [Online]. Available: https://doi.org/10.1155/2014/797348
and Implementation, ser. PLDI ’10. New York, NY, USA: [18] J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley,
Association for Computing Machinery, 2010, p. 448–459. J. Bosboom, U.-M. O’Reilly, and S. Amarasinghe, “Opentuner:
[Online]. Available: https://doi.org/10.1145/1806596.1806647 An extensible framework for program autotuning,” in
[4] F. Bodin, T. Kisuki, P. Knijnenburg, M. Boyle, and E. Rohou, Proceedings of the 23rd International Conference on Parallel
“Iterative compilation in a non-linear optimisation space,” Architectures and Compilation, ser. PACT ’14. New York, NY,
Workshop on Profile and Feedback-Directed Compilation, 03 USA: Association for Computing Machinery, 2014, p. 303–316.
2000. [Online]. Available: https://doi.org/10.1145/2628071.2628092
[19] C. Nugteren and V. Codreanu, “Cltune: A generic auto-tuner vol. 34, no. 5, p. 185–194, Oct. 2006. [Online]. Available:
for opencl kernels.” in MCSoC. IEEE Computer Society, 2015, https://doi.org/10.1145/1168919.1168881
pp. 195–202. [36] Z. Wang, D. Grewe, and M. F. P. O’boyle, “Automatic
[20] B. Calder, D. Grunwald, M. Jones, D. Lindsay, J. Martin, and portable mapping of data parallel programs to opencl
M. Mozer, and B. Zorn, “Evidence-based static branch predic- for gpu-based heterogeneous systems,” ACM Trans. Archit.
tion using machine learning,” ACM Transactions on Program- Code Optim., vol. 11, no. 4, Dec. 2014. [Online]. Available:
ming Languages and Systems, vol. 19, 1996. https://doi.org/10.1145/2677036
[21] G. Fursin, C. Miranda, O. Temam, M. Namolaru, E. Yom-Tov, [37] A. Monsifrot, F. Bodin, and R. Quiniou, “A machine learning
A. Zaks, B. Mendelson, P. Barnard, E. Ashton, E. Courtois, approach to automatic production of compiler heuristics,” in
F. Bodin, E. Bonilla, J. Thomson, H. Leather, C. Williams, Proceedings of the 10th International Conference on Artifi-
and M. O’Boyle, “Milepost gcc: machine learning based research cial Intelligence: Methodology, Systems, and Applications, ser.
compiler,” in Proceedings of the GCC Developers’ Summit, June AIMSA ’02. Berlin, Heidelberg: Springer-Verlag, 2002, p.
2008. 41–50.
[22] M. Stephenson and S. Amarasinghe, “Predicting Unroll Factors [38] E. Park, J. Cavazos, and M. A. Alvarez, “Using graph-
Using Supervised Classification,” in International Symposium based program characterization for predictive modeling,” in
on Code Generation and Optimization (CGO). IEEE, 2005. Proceedings of the Tenth International Symposium on Code
[23] H. Leather, E. Bonilla, and M. O’Boyle, “Automatic feature Generation and Optimization, ser. CGO ’12. New York, NY,
generation for machine learning based optimizing compilation,” USA: Association for Computing Machinery, 2012, p. 196–206.
in CGO ’09: Proceedings of the International Symposium on [Online]. Available: https://doi.org/10.1145/2259016.2259042
Code Generation and Optimization, March 2009. [39] M. Namolaru, A. Cohen, G. Fursin, A. Zaks, and A. Freund,
[24] J. Cavazos and E. Moss, “Inducing heuristics to decide whether “Practical aggregation of semantical program properties for
to schedule,” vol. 39, 06 2004, pp. 183–194. machine learning based optimization,” in Proceedings of the
[25] J. Cavazos and M. F. P. O’Boyle, “Method-specific dynamic 2010 International Conference on Compilers, Architecture, and
compilation using logistic regression,” SIGPLAN Not., Synthesis for Embedded Systems, CASES 2010, Scottsdale,
vol. 41, no. 10, pp. 229–240, 2006. [Online]. Available: AZ, USA, October 24-29, 2010, V. Kathail, R. Tatge, and
http://homepages.inf.ed.ac.uk/jcavazos/oopsla-2006.pdf R. Barua, Eds. ACM, 2010, pp. 197–206. [Online]. Available:
[26] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer, “A https://doi.org/10.1145/1878921.1878951
static performance estimator to guide data partitioning deci- [40] J. Cavazos, C. Dubach, F. Agakov, E. Bonilla, M. F. P. O’Boyle,
sions.” vol. 26, 07 1991, pp. 213–223. G. Fursin, and O. Temam, “Automatic performance model
[27] A. Magni, C. Dubach, and M. O’Boyle, “Automatic construction for the fast software exploration of new hardware
optimization of thread-coarsening for graphics processors,” designs,” in CASES ’06: Proceedings of the 2006 international
in Proceedings of the 23rd International Conference on Parallel conference on Compilers, architecture and synthesis for embed-
Architectures and Compilation, ser. PACT ’14. New York, NY, ded systems. New York, NY, USA: ACM Press, 2006, pp.
USA: Association for Computing Machinery, 2014, p. 455–466. 24–34.
[Online]. Available: https://doi.org/10.1145/2628071.2628087 [41] Y. Chen, Y. Huang, L. Eeckhout, G. Fursin, L. Peng, O. Temam,
[28] N. Paterson, M. Livesey, and K. Ss, “Evolving caching algo- and C. Wu, “Evaluating iterative optimization across 1000
rithms in c by genetic programming,” in In Genetic Program- datasets,” in Proceedings of the 31st ACM SIGPLAN Confer-
ming. MIT Press, 1997, pp. 262–267. ence on Programming Language Design and Implementation,
[29] M. O’Neill and C. Ryan, “Automatic generation of caching ser. PLDI ’10. New York, NY, USA: ACM, 2010, pp. 448–459.
algorithms,” in Evolutionary Algorithms in Engineering and
[42] Y. Chen, S. Fang, Y. Huang, L. Eeckhout, G. Fursin, O. Temam,
Computer Science. John Wiley and Sons, 1999, pp. 127–134.
and C. Wu, “Deconstructing iterative optimization,” ACM
[30] M. Stephenson, S. Amarasinghe, M. Martin, and U.-M. O’Reilly, Trans. Archit. Code Optim., vol. 9, no. 3, pp. 21:1–21:30, Oct.
“Meta optimization: Improving compiler heuristics with ma- 2012.
chine learning,” in Proceedings of the ACM SIGPLAN 2003
[43] G. Fursin, A. Cohen, M. O’Boyle, and O. Temam, “A practi-
Conference on Programming Language Design and Implemen-
cal method for quickly evaluating program optimizations,” in
tation, ser. PLDI ’03. New York, NY, USA: Association for
International conference on high-performance embedded archi-
Computing Machinery, 2003, p. 77–90.
tectures and compilers, 11 2005, pp. 29–46.
[31] E. Moss, P. Utgoff, J. Cavazos, C. Brodley, D. Scheeff, D. Pre-
cup, and D. Stefanović, “Learning to schedule straight-line [44] J. Mars and R. Hundt, “Scenario based optimization: A frame-
code,” in Proceedings of the 1997 Conference on Advances work for statically enabling online optimizations,” in CGO’09.
in Neural Information Processing Systems 10, ser. NIPS ’97. [45] Y. Chen, S. Fang, L. Eeckhout, O. Temam, and C. Wu,
Cambridge, MA, USA: MIT Press, 1998, p. 929–935. “Iterative optimization for the data center,” in Proceedings
[32] Z. Wang and M. F. O’Boyle, “Partitioning streaming of the Seventeenth International Conference on Architectural
parallelism for multi-cores: A machine learning based Support for Programming Languages and Operating Systems,
approach,” in Proceedings of the 19th International Conference ser. ASPLOS XVII. New York, NY, USA: ACM, 2012, pp.
on Parallel Architectures and Compilation Techniques, 49–60.
ser. PACT ’10. New York, NY, USA: Association for [46] S. Fang, W. Xu, Y. Chen, L. Eeckhout, O. Temam, Y. Chen,
Computing Machinery, 2010, p. 307–318. [Online]. Available: C. Wu, and X. Feng, “Practical iterative optimization for the
https://doi.org/10.1145/1854273.1854313 data center,” ACM Trans. Archit. Code Optim., vol. 12, no. 2,
[33] C.-K. Luk, S. Hong, and H. Kim, “Qilin: Exploiting pp. 15:1–15:26, May 2015.
parallelism on heterogeneous multiprocessors with adaptive [47] P. Mpeis, P. Petoumenos, and H. Leather, “Iterative compi-
mapping,” in Proceedings of the 42nd Annual IEEE/ACM lation on mobile devices,” in Proceedings of the 6th Interna-
International Symposium on Microarchitecture, ser. tional Workshop on Adaptive Self-tuning Computing Systems
MICRO 42. New York, NY, USA: Association for (ADAPT), January 2016.
Computing Machinery, 2009, p. 45–55. [Online]. Available: [48] C. Cummins, P. Petoumenos, Z. Wang, and H. Leather, “End-
https://doi.org/10.1145/1669112.1669121 to-end deep learning of optimization heuristics,” in Proceedings
[34] E. Park, L.-N. Pouche, J. Cavazos, A. Cohen, and P. Sadayap- of the International Conference on Parallel Architectures and
pan, “Predictive modeling in a polyhedral optimization space,” Compilation Techniques (PACT 2017), September 2017.
in Proceedings of the 9th Annual IEEE/ACM International [49] T. Ben-Nun, A. S. Jakobovits, and T. Hoefler, “Neural
Symposium on Code Generation and Optimization, ser. CGO code comprehension: A learnable representation of code
’11. USA: IEEE Computer Society, 2011, p. 119–129. semantics,” in Advances in Neural Information Processing
[35] B. C. Lee and D. M. Brooks, “Accurate and efficient Systems 31, S. Bengio, H. Wallach, H. Larochelle,
regression modeling for microarchitectural performance K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. Curran
and power prediction,” SIGARCH Comput. Archit. News, Associates, Inc., 2018, pp. 3588–3600. [Online]. Available:
http://papers.nips.cc/paper/7617-neural-code-comprehension-
a-learnable-representation-of-code-semantics.pdf
[50] A. Brauckmann, S. Ertel, A. Goens, and J. Castrillon,
“Compiler-Based Graph Representations for Deep Learning
Models of Code,” in CC, 2020.
[51] C. Cummins, Z. V. Fisches, T. Ben-Nun, T. Hoefler, and
H. Leather, “Programl: Graph-based deep learning for program
optimization and analysis,” arXiv preprint arXiv:2003.10536,
2020.
[52] C. Cummins, P. Petoumenos, Z. Wang, and H. Leather,
“Synthesizing benchmarks for predictive modeling,” in 2017
IEEE/ACM International Symposium on Code Generation and
Optimization (CGO). IEEE, 2017, pp. 86–99.
[53] A. Haj-Ali, N. K. Ahmed, T. Willke, S. Shao, K. Asanovic,
and I. Stoica, “NeuroVectorizer: End-to-End Vectorization with
Deep Reinforcement Learning,” in CGO, 2020.
[54] Q. Huang, A. Haj-Ali, W. Moses, J. Xiang, I. Stoica,
K. Asanovic, and J. Wawrzynek, “Autophase: Compiler Phase-
ordering for HLS with Deep Reinforcement Learning,” in
FCCM, 2019.
[55] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen,
Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean, “De-
vice Placement Optimization with Reinforcement Learning,” in
ICML, 2017.
[56] A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and
J. Dean, “A Hierarchical Model for Device Placement,” in ICLR,
2018.

You might also like