Survay of Programing Languages
Survay of Programing Languages
Research at the intersection of machine learning, programming languages, and software engineering has
recently taken important steps in proposing learnable probabilistic models of source code that exploit the
abundance of patterns of code. In this article, we survey this work. We contrast programming languages
against natural languages and discuss how these similarities and differences drive the design of probabilistic
models. We present a taxonomy based on the underlying design principles of each model and use it to navigate
the literature. Then, we review how researchers have adapted these models to application areas and discuss
cross-cutting and application-specific challenges and opportunities.
CCS Concepts: • Computing methodologies → Machine learning; Natural language processing; • Soft-
ware and its engineering → Software notations and tools; • General and reference → Surveys and
overviews;
Additional Key Words and Phrases: Big code, code naturalness, software engineering tools, machine learning
ACM Reference format:
Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learn-
ing for Big Code and Naturalness. ACM Comput. Surv. 51, 4, Article 81 (July 2018), 37 pages.
https://doi.org/10.1145/3212695
1 INTRODUCTION
Software is ubiquitous in modern society. Almost every aspect of life, including healthcare, energy,
transportation, public safety, and even entertainment, depends on the reliable operation of high-
quality software. Unfortunately, developing software is a costly process: Software engineers need
to tackle the inherent complexity of software while avoiding bugs and still deliver highly functional
software products on time. There is therefore an ongoing demand for innovations in software tools
that help make software more reliable and maintainable. New methods are constantly sought to
reduce the complexity of software and help engineers construct better software.
This work was supported by Microsoft Research Cambridge through its Ph.D. Scholarship Programme. M. Allamanis,
E. T. Barr, and C. Sutton are supported by the Engineering and Physical Sciences Research Council [grant numbers
EP/K024043/1, EP/P005659/1, and EP/P005314/1]. P. Devanbu is supported by the National Research Foundation award
number 1414172.
Authors’ addresses: M. Allamanis, Microsoft Research, 21 Station Road, CB1 2FB, Cambridge, United Kingdom; email:
[email protected]; E. T. Barr, Department of Computer Science, University College London, Gower Street, WC1E
6BT, United Kingdom; email: [email protected]; P. Devanbu, Department of Computer Science, Univerity of California,
Davis, 95616, California, USA; email: [email protected]; C. Sutton, School of Informatics, University of Edinburgh,
Edinburgh, EH8 9AB, United Kingdom; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2018 ACM 0360-0300/2018/07-ART81 $15.00
https://doi.org/10.1145/3212695
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018. 81
81:2 M. Allamanis et al.
Research in this area has been dominated by the formal, or logico-deductive, approach. Practi-
tioners of this approach hold that, since software is constructed in mathematically well-defined
programming languages, software tools can be conceived in purely formal terms. The design of
software tools is to be approached using formal methods of definition, abstraction, and deduction.
Properties of tools thus built should be proven using rigorous proof techniques such as induction
over discrete structures. This logico-deductive approach has tremendous appeal in programming
languages research, as it holds the promise of proving facts and properties of the program. Many
elegant and powerful abstractions, definitions, algorithms, and proof techniques have been de-
veloped, which have led to important practical tools for program verification, bug finding, and
refactoring [24, 42, 45]. It should be emphasized that these are theory-first approaches. Software
constructions are viewed primarily as mathematical objects, and when evaluating software tools
built using this approach, the elegance and rigor of definitions, abstractions, and formal proofs-of-
properties are of dominant concern. The actual varieties of use of software constructs, in practice,
become relevant later, in case studies, that typically accompany presentations in this line of work.
Of late, another valuable resource has arisen: the large and growing body of successful, widely
used, open source software systems. Open source software systems such as Linux, MySQL, Django,
Ant, and OpenEJB have become ubiquitous. These systems publicly expose not just source code
but also metadata concerning authorship, bug-fixes, and review processes. The scale of available
data is massive: billions of tokens of code and millions of instances of metadata, such as changes,
bug-fixes, and code reviews (“big code”). The availability of “big code” suggests a new, data-driven
approach to developing software tools: Why not let the statistical distributional properties, esti-
mated over large and representative software corpora, also influence the design of development
tools? Thus rather than performing well in the worst case, or in case studies, our tools can perform
well in most cases, thus delivering greater advantages in expectation. The appeal of this approach
echoes that of earlier work in computer architecture: Amdahl’s law [15], for example, tells us to
focus on the common case. This motivates a similar hope for development tools, that tools for
software development and program analysis can be improved by focusing on the common cases
using a fine-grained estimate of the statistical distribution of code. Essentially, the hope is that an-
alyzing the text of thousands of well-written software projects can uncover patterns that partially
characterize software that is reliable, easy to read, and easy to maintain.
The promise and power of machine learning rests on its ability to generalize from examples and
handle noise. To date, software engineering (SE) and programming languages (PL) research has
largely focused on using machine-learning (ML) techniques as black boxes to replace heuristics and
find features, sometimes without appreciating the subtleties of the assumptions these techniques
make. A key contribution of this survey is to elucidate these assumptions and their consequences.
Just as natural language processing (NLP) research changed focus from brittle rule-based expert
systems that could not handle the diversity of real-life data to statistical methods [99], SE/PL should
make the same transition, augmenting traditional methods that consider only the formal structure
of programs with information about the statistical properties of code.
Structure. First, in Section 2, we discuss the basis of this area, which we call the “naturalness hy-
pothesis.” We then review recent work on machine-learning methods for analyzing source code,
focusing on probabilistic models, such as n-gram language models and deep-learning methods.1
We also touch on other types of machine-learning-based source code models, aiming to give a
broad overview of the area, to explain the core methods and techniques, and to discuss applica-
tions in programming languages and software engineering. We focus on work that goes beyond
1 It
may be worth pointing out that deep-learning and probabilistic modeling are not mutually exclusive. Indeed, many of
the currently most effective methods for language modeling, for example, are based on deep learning.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:3
a “bag of words” representation of code, modeling code using sequences, trees, and continuous
representations. We describe a wide range of emerging applications, ranging from recommender
systems, debugging, program analysis, and program synthesis. The large body of work on se-
mantic parsing [138] is not the focus of this survey, but we include some methods that output
code in general-purpose programming languages (rather than carefully crafted domain-specific
languages). This review is structured as follows. We first discuss the different characteristics of
natural language and source code to motivate the design decisions involved in machine-learning
models of code (Section 3). We then introduce a taxonomy of probabilistic models of source code
(Section 4). Then we describe the software engineering and programming language applications
of probabilistic source code models (Section 5). Finally, we mention a few overlapping research
areas (Section 7), and we discuss challenges and interesting future directions (Section 6).
Related Reviews and Other Resources. There have been short reviews summarizing the progress
and the vision of the research area, from both software engineering [52] and programming lan-
guages perspectives [28, 195]. However, none of these articles can be considered extensive lit-
erature reviews, which is the purpose of this work. Ernst [57] discusses promising areas of ap-
plying natural language processing to software development, including error messages, variable
names, code comments, and user questions. Some resources, datasets, and code can be found at
http://learnbigcode.github.io/. An online version of the work reviewed here—which we will keep
up to date by accepting external contributions—can be found at https://ml4code.github.io.
analyses. At a high level, statistical methods allow a system to make hypotheses, along with prob-
abilistic confidence values, of what a developer might want to do next or what formal properties
might be true of a chunk of code. Probabilistic methods also provide natural ways of learning
correspondences between code and other types of documents, such as requirements, blog posts,
comments, and so on—such correspondences will always be uncertain, because natural language
is ambiguous, and so the quantitative measure of confidence provided by probabilities is especially
natural. As we discuss in Section 5, one could go so far as to claim that almost every area of soft-
ware engineering and programming language research has potential opportunities for exploiting
statistical properties.
Although the “naturalness hypothesis” may not seem surprising, one should appreciate the root
cause of “naturalness.” Naturalness of code seems to have a strong connection with the fact that
developers prefer to write [5] and read [85] code that is conventional, idiomatic, and familiar,
because it helps in understanding and maintaining software systems. Code that takes familiar
forms is more transparent, in that its meaning is more readily apparent to an experienced reader.
Thus, the naturalness hypothesis leads seamlessly to a “code predictability” notion, suggesting
that code artifacts—from simple token sequences to formal verification statements—contain useful
recurring and predictable patterns that can be exploited. “Naturalness” and “big code” should be
viewed as instances of a more general concept that there is exploitable regularity across human-
written code that can be “absorbed” and generalized by a learning component that can transfer its
knowledge and probabilistically reason about new code.
This article reviews the emerging area of machine-learning and statistical natural language pro-
cessing methods applied to source code. We focus on probabilistic models of code, that is, methods
that estimate a distribution over all possible source files. Machine learning in probabilistic models
has seen wide application throughout artificial intelligence, including natural language processing,
robotics, and computer vision, because of its ability to handle uncertainty and to learn in the face
of noisy data. One might reasonably ask why it is necessary to handle uncertainty and noise in
software development tools, when in many cases the program to be analyzed is known (there is no
uncertainty about what the programmer has written) and is deterministic. In fact, there are several
interesting motivations for incorporating probabilistic modeling into machine-learning methods
for software development. First, probabilistic methods offer a principled method for handling un-
certainty and fusing multiple, possibly ambiguous, sources of information. Second, probabilistic
models provide a natural framework for connecting prior knowledge to data—providing a natural
framework to design methods based on abstractions of statistical properties of code corpora. In
particular, we often wish to infer relationships between source code and natural language text,
such as comments, bug reports, requirements documents, documentation, search queries, and so
on. Because natural language text is ambiguous, it is useful to quantify uncertainty in the cor-
respondence between code and text. Finally, when predicting program properties, probabilities
provide a way to relax strict requirements onsoundness: We can seek unsound methods that pre-
dict program properties based on statistical patterns in the code, using probabilities as a way to
quantify the method’s confidence in its predictions.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:5
handled, or remain open. Although code and text are similar, code written in a general-purpose
programming languages, is a relatively new problem domain for existing ML and NLP techniques.
Hindle et al. [87] showed not only that exploitable similarity exists between the two via an n-
gram language model but also that code is even less surprising than text. Although it may seem
manifestly obvious that code and text have many differences, it is useful to enumerate these dif-
ferences carefully, as this allows us to gain insight into when techniques from NLP need to be
modified to deal with code. Perhaps the most obvious difference is that code is executable and has
formal syntax and semantics. We close by discussing how source code bimodality manifests itself
as synchronization points between the algorithmic and explanatory channels.
Executability. All code is executable; text often is not. So code is often semantically brittle—
small changes (e.g., swapping function arguments) can drastically change the meaning of code;
whereas natural language is more robust in that readers can often understand text even if it con-
tains mistakes. Despite the bimodal nature of code and its human-oriented modality, the sensitivity
of code semantics to “noise” necessitates the combination of probabilistic and formal methods. For
example, existing work builds probabilistic models then applies strict formal constraints to filter
their output (Section 4.1) or uses them to guide formal methods (Section 5.8). Nevertheless, further
research on bridging formal and probabilistic methods is needed (Section 6.1).
Whether it is possible to translate between natural languages in a way that completely preserves
meaning is a matter of debate. Programming languages, on the other hand, can be translated be-
tween each other exactly, as all mainstream programming languages are Turing-complete. (That
said, porting real-world programs to new languages and platforms remains challenging in prac-
tice [32].) ML techniques have not yet comprehensively tackled such problems and are currently
limited to solely translating among languages with very similar characteristics, e.g., Java and C#
(Section 6.1). Programming languages differ in their expressivity and intelligibility, ranging from
Haskell to Malbolge,2 with some especially tailored for certain problem domains; in contrast, nat-
ural languages are typically used to communicate across a wide variety of domains. Executability
of code induces control and data flows within programs, which have only weak analogs in text.
Finally, executability gives rise to additional modalities of code—its static and dynamic views (e.g.,
execution traces), which are not present in text. Learning over traces or flows are promising di-
rections (Section 6.1).
Formality. Programming languages are formal languages, whereas formal languages are only
mathematical models of natural language. As a consequence, programming languages are de-
signed top-down by a few designers for many users. Natural languages, in contrast, emerge, bot-
tom up, “through social dynamics” [46]. Natural languages change gradually, while programming
languages exhibit punctuated change: New releases, like Python 3, sometimes break backward
compatibility. Formatting can also be meaningful in code: Python’s whitespace sensitivity is the
canonical example. Text has a robust environmental dependence, whereas code suffers from bit
rot—the deterioration of software’s functionality through time because of changes in its environ-
ment (e.g., dependencies)—because all its explicit environmental interactions must be specified
upfront and execution environments evolve much more quickly than natural languages.
Source code’s formality facilities reuse. Solving a problem algorithmically is cognitively expen-
sive, so developers actively try to reuse code [95], moving common functionality into libraries
to facilitate reuse. As a result, usually functions are semantically unique within a project. Coding
competitions or undergraduate projects are obvious exceptions. In contrast, one can find thousands
of news articles describing an important global event. On the other hand, Gabel and Su [67] have
2 https://en.wikipedia.org/wiki/Malbolge.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:6 M. Allamanis et al.
found that, locally, code is more pattern dense than text (Section 4.1). This has led to important
performance improvements on some applications, such as code completion (Section 5.1).
Because programming languages are automatically translated into machine code, they must be
syntactically, even to a first approximation semantically, unambiguous.3 In contrast to NLP mod-
els, which must always account for textual ambiguity, probabilistic models of code can and do take
advantage of the rich and unambiguous code structure. Although it is less pervasive, ambiguity
remains a problem in the analysis of code, because of issues like polymorphism and aliasing. Sec-
tion 4.2 and Section 5.4 discuss particularly notable approaches to handling them. Co-reference
ambiguities can arise when viewing code statically, especially in dynamically typed languages
(Section 6.1). The undefined behavior that some programming languages permit can cause seman-
tic ambiguity, and, in the field, syntactic problems can arise due to nonstandard compilers [24];
however, the dominance of a handful of compilers/interpreters for most languages ameliorates
both problems.
Cross-Channel Interaction. Code’s two channels, the algorithmic and the explanatory channels,
interact through their semantic units, but mapping code units to textual units remains an open
problem. Natural semantic units in code are identifiers, statements, blocks, and functions. None of
these universally maps to textual semantic units. For example, identifiers, even verbose function
names that seek to describe their function, carry less information than words like “Christmas” or
“set.” In general, statements in code and sentences in text differ in how much background knowl-
edge the reader needs to understand them in isolation; an arbitrary statement is far more likely
to use domain-specific, even project-specific, names or neologisms than an arbitrary sentence is.
Blocks vary greatly in length and semantics richness and often lack clear boundaries to a human
reader. Functions are clearly delimited and semantically rich but long. In text, a sentence is the nat-
ural multiword semantic unit and usually contains fewer than 50 words (tokens). Unfortunately,
one cannot, however, easily equate them. A function differs from a sentence or a sequence of sen-
tences, i.e., a paragraph, in that it is named and called, while, in general settings, sentences or
paragraphs rarely have names or are referred to elsewhere in a text. Further, a single function acts
on the world, so it is like a single action sentence, but is usually much longer, often containing
hundreds of tokens, and usually performs multiple actions, making a function closer to a sequence
of sentences, or a paragraph, but paragraphs are rarely solely composed of action sentences.
Additionally, parse trees of sentences in text tend to be diverse, short, and shallow compared to
abstract syntax trees of functions, which are usually much deeper with repetitive internal struc-
ture. Code bases are multilingual (i.e., contain code in more than one programming language, e.g.,
Java and SQL), with different tasks described in different languages, more frequently than text
corpora; this can drastically change the shape and frequency of its semantic units. Code has a
higher neologism rate than text. Almost 70% of all characters are identifiers, and a developer must
choose a name for each one [50]; when writing text, an author rarely names new things but usu-
ally chooses an existing word to use. Existing work handles code’s neologism rate by introducing
cache mechanisms or decomposing identifiers at a subtoken level (Section 4.1).
Determining which semantic code unit is most useful for which task is an open question. Con-
sider the problem of automatically generating comments that describe code, which can be formal-
ized as a machine translation problem from code to text. Statistical machine translation approaches
learn from an aligned corpus. Statement-granular alignment yields redundant comments, while
function granular alignment has saliency issues (Section 5.5). As another example, consider code
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:7
search, where search engines must map queries into semantic code units. Perhaps the answer will
be in maps from code to text whose units vary by granularity or context (Section 6.1).
4 Inthe machine-learning literature, representation, applied to code, is roughly equivalent to abstraction in programming
language research: a lossy encoding that preserves a semantic property of interest.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:8 M. Allamanis et al.
Since code-generating models predict the complex structure of code, they make simplifying as-
sumptions about the generative process and iteratively predict elements of the code to generate a
full code unit, e.g., code file or method. Because of code’s structural complexity and the simplifying
assumptions these models make to cope with it, none of the existing models in the literature gen-
erate code that always parses, compiles, and executes. Some of the models do, however, impose
constraints that take code structure into account to remove some inconsistencies; for instance,
[126] only generate variables declared within each scope.
We structure the discussion about code-generating models of code as follows. We first discuss
how models in this category generate code, and then we show how the three different types of
models (language, transducer, and multimodal models) differ. (Table 1) lists related work on code-
generating models.
4.1.1 Representing Code in Code-generating Models. Probabilistic models for generating struc-
tured objects are widely in use in machine learning and natural language processing with a wide
range of applications. Machine-learning research is considering a wide range of structures from
natural language sentences to chemical structures and images. Within the source code domain,
we can broadly find three categories of models based on the way they generate code’s structure:
token-level models that generate code as a sequence of tokens, syntactic models that generate
code as a tree, and semantic models that generate graph structures. Note that this distinction is
about the generative process and not about the information used within this process. For example,
Nguyen et al. [144] uses syntactic context but is classified as a token-level model that generates
tokens.
Token-level Models (Sequences). Sequence-based models are commonly used because of their
simplicity. They view code as a sequence of elements, usually code tokens or characters, i.e.,
c = t 1 . . . t M . Predicting a large sequence in a single step is infeasible due to the exponential num-
ber of possible sequences; for a set of V elements, there are |V | N sequences of length N . Therefore,
most sequence-based models predict sequences by sequentially generating each element, i.e., they
model the probability distribution P (tm |t 1 . . . tm−1 , C(c)). However, directly modeling this distri-
bution is impractical and all models make different simplifying assumptions.
The n-gram model has been a widely used sequence-based model, most commonly used as a
language model. It is an effective and practical LM for capturing local and simple statistical depen-
dencies in sequences. n-gram models assume that tokens are generated sequentially, left to right,
and that the next token can be predicted using only the previous n − 1 tokens. The consequence of
capturing a short context is that n-gram models cannot handle long-range dependencies, notably
scoping information. Formally, the probability of a token tm , is conditioned on the context C(c)
(if any) and the generated sequence so far t 1 . . . tm−1 , which is assumed to depend on only the
previous n − 1 tokens. Under this assumption, we write
M
P D (c|C(c)) = P (t 1 . . . t M |C(c)) = P (tm |tm−1 . . . tm−n+1 , C(c)). (1)
m=1
To use this equation, we need to know the conditional probabilities P (tm |tm−1 . . . tm−n+1 , C(c)) for
each possible n-gram and context. This is a table of |V |n numbers for each context C(c). These are
the parameters of the model that we learn from the training corpus. The simplest way to estimate
the model parameters is to set P (tm |tm−1 . . . tm−n+1 ) to the proportion of times that tm follows
tm−1 . . . tm−n+1 . In practice, this simple estimator does not work well, because it assigns zero prob-
ability to n-grams that do not occur in the training corpus. Instead, n-gram models use smooth-
ing methods [40] as a principled way for assigning probability to unseen n-grams by extrapolat-
ing information from m-grams (m < n). Furthermore, considering n-gram models with non-empty
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:9
(Continued)
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:10 M. Allamanis et al.
Table 1. Continued
contexts C(c) exacerbates sparsity rendering these models impractical. Because of this, n-grams
are predominantly used as language models. The use of n-gram LMs in software engineering orig-
inated with the pioneering work of Hindle et al. [88], who used an n-gram LM with Kneser-Ney
[104] smoothing. Most subsequent research has followed this practice.
In contrast to text, code tends to be more verbose [88], and much information is lost within
the n − 1 tokens of the context. To tackle this problem, Nguyen et al. [144] extended the standard
n-gram model by annotating the code tokens with parse information that can be extracted from
the currently generated sequence. This increases the available context information allowing the
n-gram model to achieve better predictive performance. Following this trend, but using concrete
and abstract semantics of code, Raychev et al. [166] create a token-level model that treats code
generation as a combined synthesis and probabilistic modeling task.
Tu et al. [180] and, later, Hellendoorn and Devanbu [84] noticed that code has a high degree of
localness, where identifiers (e.g., variable names) are repeated often within close distance. In their
work, they adapted work in speech and natural language processing [109] adding a cache mech-
anism that assigns higher probability to tokens that have been observed most recently, achieving
significantly better performance compared to other n-gram models. Modeling identifiers in code is
challenging [6, 11, 29, 126]. The agglutinations of multiple subtokens (e.g., in getFinalResults)
when creating identifiers is one reason. Following recent NLP work that models subword structure
(e.g., morphology) [174], explicitly modeling subtoken in identifiers may improve the performance
of generative models. Existing token-level code-generating models do not produce syntactically
valid code. Raychev et al. [166] added additional context in the form of constraints—derived from
program analysis—to avoid generating some incorrect code.
More recently, sequence-based code models have turned to deep recurrent neural network
(RNN) models to outperform n-grams. These models predict each token sequentially but loosen
the fixed-context-size assumption, instead representing the context using a distributed vector rep-
resentation (Section 4.2). Following this trend, Karpathy et al. [103] and Cummins et al. [48] use
character-level LSTMs [91]. Similarly, White et al. [188] and Dam et al. [49] use token-level RNNs.
Recently, Bhoopchand et al. [26] used a token sparse pointer-based neural model of Python that
learns to copy recently declared identifiers to capture very long-range dependencies of identifiers,
outperforming standard LSTM models.5
5 This work differs from the rest, because it anonymizes/normalizes identifiers, creating a less sparse problem. Because of
the anonymization, the results are not directly comparable with other models.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:11
Although neural models usually have superior predictive performance, training them is signifi-
cantly more costly compared to n-gram models usually requiring orders of magnitude more data.
Intuitively, there are two reasons why deep-learning methods have proven successful for language
models. First, the hidden state in an RNN can encode longer-range dependencies of variable-length
beyond the short context of n-gram models. Second, RNN language models can learn a much richer
notion of similarity across contexts. For example, consider a 13-gram model over code, in which
we are trying to estimate the distribution following the context for(int i=N; i>=0; i--). In a
corpus, few examples of this pattern may exist, because such long contexts occur rarely. A simple n-
gram model cannot exploit the fact that this context is very similar to for(int j=M; j>=0; j--).
But a neural network can exploit it by learning to assign these two sequences similar vectors.
Syntactic Models (Trees). Syntactic (or structural) code-generating models model code at the level
of abstract syntax trees (ASTs). Thus, in contrast to sequence-based models, they describe a sto-
chastic process of generating tree structures. Such models make simplifying assumptions about
how a tree is generated, usually following generative NLP models of syntactic trees: They start
from a root node and then sequentially generate children top to bottom and left to right. Syntactic
models generate a tree node conditioned on context defined as the forest of subtrees generated so
far. In contrast to sequence models, these models—by construction—generate syntactically correct
code. In general, learning models that generate tree structures is harder compared to generat-
ing sequences: It is relatively computationally expensive, especially for neural models, given the
variable shape and size of the trees that inhibit efficient batching. In contrast to their wide appli-
cation in NLP, probabilistic context free grammars (PCFG) have been found to be unsuitable as
language models of code [126, 164]. This may seem surprising, because most parsers assume that
programming languages are context free. But the problem is that the PCFGs are not a good model
of statistical dependencies between code tokens, because nearby tokens may be far away in the
AST. So it is not that PCFGs do not capture long-range dependencies (n-gram-based models do not
either), but that they do not even capture close-range dependencies that matter [29]. Further, ASTs
tend to be deeper and wider than text parse trees due to the highly compositional nature of code.
Maddison and Tarlow [126] and Allamanis et al. [13] increase the size of the context considered
by creating a non-context-free log-bilinear neural network grammar, using a distributed vector
representation for the context. Additionally, Maddison and Tarlow [126] restricts the generation
to generate variables that have been declared. To achieve this, they use the deterministically known
information and filter out invalid output. This simple process always produces correct code, even
when the network does not learn to produce it. In contrast, Amodio et al. [16] create a significantly
more complex model that aims to learn to enforce deterministic constraints of the code generation
rather than enforcing them on the directly on the output. We further discuss the issue of embedding
constraints and problem structure in models vs. learning the constraints in Section 6.
Bielik et al. [29] and Raychev et al. [164] increase the context by annotating PCFGs with a learned
program that uses features from the code. Although the programs can, in principle, be arbitrary,
they limit themselves to synthesizing decision tree programs. Similarly, Wang et al. [185] and Yin
and Neubig [196] use an LSTM over AST nodes to achieve the same goal. Allamanis and Sutton
[12] also create a syntactic model learning Bayesian TSGs [43, 156] (see Section 4.3).
Semantic Models (Graphs). Semantic code-generating models view code as a graph. Graphs are a
natural representation of source code that require little abstraction or projection. Therefore, graph
model can be thought as generalizations of sequence and tree models. However, generating com-
plex graphs is hard, since there is no natural “starting” point or generative process, as reflected by
the limited number of graphs models in the literature. We refer the interested reader to the related
work section of Johnson [98] for a discussion of recent models in the machine-learning literature.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:12 M. Allamanis et al.
To our knowledge, there are no generative models that directly generate graph representations
of realistic code (e.g., dataflow graphs). Nguyen and Nguyen [139] propose a generative model,
related to graph generative models in NLP, that suggests application programming interface (API)
completions. They train their model over API usages. However, they predict entire graphs as com-
pletions and perform no smoothing, so their model will assign zero probability to unseen graphs.
In this way, their model differs from graph generating models in NLP, which can generate arbitrary
graphs.
4.1.2 Types of Code Generating Models. We use external context C(c) to refine code generating
models into three subcategories.
Language Models. Language models model the language itself, without using any external con-
text, i.e., C(c) = ∅. Although LMs learn the high-level structure and constraints of programming
languages fairly easily, predicting and generating source code identifiers (e.g., variable and method
names), long-range dependencies, and taking into account code semantics makes the language
modeling of code a hard and interesting research area. We discuss these and other differences and
their implications for probabilistic modeling in Section 6.
Code LMs are evaluated like LMs in NLP, using perplexity (or equivalently cross-entropy) and
word error rate. Cross-entropy H is the most common measure. Language models—as most predic-
tive machine-learning models—can be seen as compression algorithms where the model predicts
the full output (i.e., decompresses) using extra information. Cross-entropy measures the average
number of extra bits of information per token of code that a model needs to decompress the correct
output using a perfect code (in the information-theoretic sense),
1
H (c, P D ) = −
log2 P D (c), (2)
M
where M is the number of tokens within c. By convention, the average is reported per token, even
for non-token models. Thus, a “perfect” model, correctly predicting all tokens with probability 1,
would require no additional bits of information because, in a sense, it already “knows” everything.
Cross-entropy allows comparisons across different models. Other, application-specific measures,
are used when the LM was trained for a specific task, such as code completion (Section 5.1).
Code Transducer Models. Inspired by statistical machine translation (SMT), transducer models
translate/transduce code from one format into another (i.e., C(c) is also code), such as translat-
ing code from one source language into another, target language. They have the form P D (c|s),
where c is the target source code that is generated and C(c) = s is the source source code. Most
code transducer models use phrase-based machine translation. Intuitively, phrase-based models
assume that small chunks from the source input can directly be mapped to chunks in the output.
Although this assumption is reasonable in NLP and many source code tasks, these models present
challenges in capturing long-range dependencies within the source and target. For example, as
we will mention in the next section, transducing code from an imperative source language to a
functional target is not currently possible, because the source and target are related with a signif-
icantly more complicated relation that simply matching “chunks” of the input code to “chunks” in
the output.
These types of models have found application within code migration [2, 102, 141], pseu-
docode generation [146], and code fixing [160]. Traditionally transducer models have followed
a noisy channel model, in which they combine a language model P D (c) of the target lan-
guage with a translation/transduction model P D (s|c) to match elements between the source and
the target. These methods pick the optimal transduction c ∗ such that c∗ = arg max P D (c|s) =
arg max P D (s|c)P D (c), where the second equality derives from the Bayes equation. Again, these
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:13
probabilistic generative models of code do not necessarily produce valid code, due to the sim-
plifying assumptions they make in both P D (c) and P D (s|c). More recently, machine translation
methods based on phrase-based models and the noisy channel model have been outperformed by
neural network-based methods that directly model P D (c|s).
Transducer models can be evaluated with SMT evaluation measures, such as BLEU [150]—
commonly used in machine translation as an approximate measure of translation quality—or pro-
gramming and logic-related measures (e.g., “Does the translated code parse/compile?” and “Are the
two snippets equivalent?”).
Multimodal Models. Code-generating multimodal models correlate code with one or more non-
code modalities, such as comments, specifications, or search queries. These models have the form
P D (c|m) i.e., C(c) = m is a representation of one or more non-code modalities. Multimodal mod-
els are closely related to representational models (discussed in Section 4.2): Multimodal code-
generating models learn an intermediate representation of the non-code modalities m and use it
to generate code. In contrast, code representational models create an intermediate representation
of the code but are not concerned with code generation.
Multimodal models of code have been used for code synthesis, where the non-code modalities
are leveraged to estimated a conditional generative model of code, e.g., synthesis of code given a
natural language description by Gulwani and Marron [77] and more recently by Yin and Neubig
[196]. The latter model is a syntactic model that accepts natural language. Recently, Beltramelli
[23], Deng et al. [51], and Ellis et al. [55] designed multimodal model that accept visual input (the
non-code modality) and generate code in a DSL describing how the input (hand-drawn image,
GUI screenshot) was constructed. Another use of these models is to score the co-appearance of
the modalities, e.g., in code search, to score the probability of some text given a textual query [13].
This stream of research is related to work in NLP and computer vision where one seeks to gen-
erate a natural language description for an image. These models are closely related to the other
code generating models, since they generate code. These models make also assume that the input
modality conditions the generation process. Multimodal models combine an assumption with a
design choice. Like language models, these models assume that probabilistic models can capture
the process by which developers generate code; unlike language models, they additionally bias
code generation using information from the input modality m. The design choice is how to trans-
form the input modality into an intermediate representation. For example, Allamanis et al. [13]
use a bag-of-words assumption averaging the words’ distributed representations. However, this
limits the expressivity of the models, because the input modality has to fit in whole within the dis-
tributed representation. To address this issue, Ling et al. [120] and Yin and Neubig [196] use neural
attention mechanisms to selectively attend to information within the input modality without the
need to “squash” all the information into a single representation. Finally, the text-to-code problem,
in which we take the input modality m to be natural language text and the other modality c to be
code, is closely related to the problem of semantic parsing in NLP; see Section 5.5.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:14 M. Allamanis et al.
code property π as P D (π | f (c)), where f is a function that transforms the code c into a target
representation and π can be an arbitrary set of features or other (variable) structures. These mod-
els use a diverse set of machine-learning methods and are often application specific. Table 2 lists
representational code model research. Below we discuss two types of models. Note that they are
not mutually exclusive and models frequently combine distributed representations and structured
prediction.
4.2.1 Distributed Representations. Distributed representations [89] are widely used in NLP to
encode natural language elements. For example, Mikolov et al. [131] learn distributed represen-
tations of words, showing that such representations can learn useful semantic relationships and
Le and Mikolov [112] extend this idea to sentences and documents. Distributed representations
refer to arithmetic vectors or matrices where the meaning of an element is distributed across mul-
tiple components (e.g., the “meaning” of a vector is distributed in its components). This contrasts
with local representations, where each element is uniquely represented with exactly one compo-
nent. Distributed representations are commonly used in machine learning and NLP, because they
tend to generalize better and have recently become extremely common due to their omnipresence
in deep learning. Models that learn distributed representations assume that the elements being
represented and their relations can be encoded within a multidimensional real-valued space and
that the relation (e.g., similarity) between two representations can be measured within this space.
Probabilistic code models widely use distributed representations. For example, models that use dis-
tributed vector representations learn a function of the form c → RD that maps code elements to
a D-dimensional vector. Such representations are usually the (learned) inputs or output of (deep)
neural networks.
Allamanis et al. [6] learn distributed vector representations for variable and methods usage
contexts and use them to predict a probability distribution over their names. Such distributed
representations are quite similar to those produced by word2vec [131]; the authors found that the
distributed vector representations of variables and methods learn common semantic properties,
implying that some form of the distributional hypothesis in NLP also holds for code.
Gu et al. [76] use a sequence-to-sequence deep neural network [177], originally introduced for
SMT, to learn intermediate distributed vector representations of natural language queries that they
use to predict relevant API sequences. Mou et al. [132] learn distributed vector representations
using custom convolutional neural networks to represent features of snippets of code, then they
assume that student solutions to various coursework problems have been intermixed and seek to
recover the solution-to-problem mapping via classification.
Li et al. [115] learn distributed vector representations for the nodes of a memory heap and
use the learned representations to synthesize candidate formal specifications for the code that
produced the heap. Li et al. [115] exploit heap structure to define graph neural networks, a new
machine-learning model based on gated recurrent units (GRU, a type of RNN [41]) to directly learn
from heap graphs. Piech et al. [155] and Parisotto et al. [151] learn distributed representations of
source code input/output pairs and use them to assess and review student assignments or to guide
program synthesis from examples.
Neural code-generative models of code also use distributed representations to capture context,
a common practice in NLP. For example, the work of Maddison and Tarlow [126] and other neu-
ral language models (e.g., LSTMs in Dam et al. [49]) describe context distributed representations
while sequentially generating code. Ling et al. [120] and Allamanis et al. [13] combine the code-
context distributed representation with a distributed representations of other modalities (e.g., nat-
ural language) to synthesize code. While all of these representations can, in principle, encode un-
bounded context, handling all code dependencies of arbitrary length is an unsolved problem. Some
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:15
References annotated with ∗ are also included in other categories. GM refers to graphical models.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:16 M. Allamanis et al.
neural architectures, such as LSTMs [91], GRUs [41], and their variants, have made progress on
this problem and can handle moderately long-range dependencies.
4.2.2 Structured Prediction. Structured prediction is the problem of predictinga set of interde-
pendent variables, given a vector of input features. Essentially, structured prediction generalizes
standard classification to multiple output variables. A simple example of structured prediction
is to predict a part-of-speech tag for each word in a sentence. Often the practitioner defines a
dependency structure among the outputs, e.g., via a graph, as part of the model definition. Struc-
tured prediction has been widely studied within machine learning and NLP and are omnipresent
in code. Indeed, structured prediction is particularly well suited to code, because it can exploit
the semantic and syntactic structure of code to define the model. Structured prediction is a gen-
eral framework to which deep-learning methods have been applied. For example, the celebrated
sequence-to-sequence (seq2seq) learning models [19, 177] are general methods for tackling the re-
lated structured prediction problem. In short, structured prediction and distributed representations
are not mutually exclusive.
One of the most well-known applications of structured prediction to source code is Raychev et al.
[165], who represent code as a variable dependency network, represent each JavaScript variable as
a single node, and model their pairwise interactions as a conditional random field (CRF). They train
the CRF to jointly predict the types and names of all variables within a snippet of code. Proksch
et al. [159] use a directed graphical model to represent the context of an (incomplete) usage of an
object to suggest a method invocation (viz. constructor) autocompletion in Java.
Structured prediction, such as predicting a sequence of elements, can be combined with dis-
tributed representations. For example, Allamanis et al. [6, 10] use distributed representations to
predict sequences of identifier sub-tokens to build a single token and Gu et al. [76] predict the
sequence of API calls. Li et al. [115] learn distributed representations for the nodes of a fixed heap
graph by considering its structure and the interdependencies among the nodes. Kremenek et al.
[108] use a factor graph to learn and enforce API protocols, like the resource usage specification
of the POSIX file API, as do Livshits et al. [122] for information flow problems. Allamanis et al. [8]
predict the data flow graph of code by learning to paste snippets of code into existing code and
adapting the variables used.
where д is a deterministic function that returns a (possibly partial, e.g., API calls only) view of the
code and l represents a set of latent variables that the model introduces and aims to infer. Appli-
cations of such models are common in the mining software repositories community and include
documentation (e.g., API patterns), summarization, and anomaly detection. Table 3 lists this work.
Unsupervised learning is one of the most challenging areas in machine learning. This hardness
stems from the need to automatically distinguish important patterns in the code from spurious
patterns that may appear to be significant because of limited and noisy data. When designing un-
supervised models, the core assumption lies in the objective function being used, and often we
resort to using a principle from statistics, information theory, or a proxy supervised task. Like
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:17
Table 3. Research on Pattern Mining Probabilistic Models of Source Code (Sorted Alphabetically)
all machine-learning models, they require assumptions about how the data are represented. An
important issue with unsupervised methods is the hardness of evaluating the output, since the
quality of the output is rarely quantifiable. A vast literature on non-probabilistic methods exploits
data-mining methods, such as frequent pattern mining and anomaly detection [190]. We do not
discuss these models here, since they are not probabilistic models of code. Classic probabilistic
topic models [30], which usually views code (or other software engineering artifacts) as a bag-of-
words, have also been heavily investigated. Since these models and their strengths and limitations
are well understood, we omit them here.
Allamanis and Sutton [12] learn a tree substitution grammar (TSG) using Bayesian nonparamet-
rics, a technique originally devised for natural language grammars. TSGs learn to group commonly
co-appearing grammar productions (tree fragments). Although TSGs have been used in NLP to im-
prove parsing performance (which is ambiguous in text), Allamanis and Sutton [12] observe that
the inferred fragments represent common code usage conventions and name them idioms. Later,
Allamanis et al. [4] extend this technique to mine semantic code idioms by modifying the input
code representation and adapting the inference method.
In a similar fashion, Fowkes and Sutton [63] learn the latent variables of a graphical model to
infer common API usage patterns. Their method automatically infers the most probable grouping
of API elements. This is in stark contrast to frequency-based methods [192] that suffer from finding
frequent but not necessarily “interesting” patterns. Finally, Movshovitz-Attias and Cohen [134]
infer the latent variables of a graphical model that models a software ontology.
As in NLP and machine learning in general, evaluating pattern-mining models is hard, since the
quality of the discovered latent structure is subjective. Thus, researchers often resort to extrinsic,
application-specific measures. For example, Fowkes et al. [62] run a user study to directly assess
the quality of their summarization method.
5 APPLICATIONS
Probabilistic models of source code have found a wide range of applications in software engineer-
ing and programming language research. These models enable the principled use of probabilistic
reasoning to handle uncertainty. Common sources of uncertainty are underspecified or inherently
ambiguous data (such as natural language text). In some domains, probabilistic source code mod-
els also simplify or accelerate analysis tasks that would otherwise be too computationally costly
to execute. In this section, our goal is to explain the use of probabilistic models in each area, not
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:18 M. Allamanis et al.
review them in detail. We describe each area’s goals and key problems and then explain how
they can benefit from probabilistic, machine-learning-based methods and how the methods are
evaluated.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:19
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:20 M. Allamanis et al.
Murali et al. [136] focus on possible paths (that remove control flow dependencies) over API calls.
Therefore, each model captures a limited family of defects, determined by the model designers’
choice of abstraction to represent. Pu et al. [160] and Gupta et al. [81] create models for detecting
and fixing defects but only for student submissions where data sparsity is not a problem. Other
data-mining-based methods (e.g., Wasylkowski et al. [186]) also exist but are out of the scope of
this review, since they do not employ probabilistic methods.
Also related is the work of Campbell et al. [35] and Bhatia and Singh [25]. These researchers
use source code LMs to identify and correct syntax errors. Detecting syntax errors is an easier and
more well-defined task. The goal of these models is not to detect the existence of such an error
(that can be deterministically found) but to efficiently localize the error and suggest a fix.
The earlier work of Liblit et al. [117] and Zheng et al. [199] use traces for statistical bug isola-
tion. Kremenek et al. [108] learn factor graphs (structured prediction) to model resource-specific
bugs by modeling resource usage specifications. These models use an efficient representation to
capture bugs but can fail on interprocedural code that requires more complex graph representa-
tions. Finally, Patra and Pradel [153] use an LM of source code to generate input for fuzz testing
browsers.
Not all anomalous behavior is a bug (it may simply be rare behavior), but anomalous behavior
in often executed code almost certainly is [56]. Thus, probabilistic models of source code seem a
natural fit for finding defective code. They have not, however, seen much industrial uptake. One
possible cause is their imprecision. The vast diversity of code constructs entails sparsity, from
which all anomaly detection methods suffer. Methods based on probabilistic models are no excep-
tion: They tend to consider rare, but correct, code anomalous.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:21
information about variable use to predict the correct name adaptations without external infor-
mation (e.g., tests). Clones may indicate refactoring opportunities (that allow reusing the cloned
code). White et al. [187] use autoencoders and recurrent neural networks [72] to find clones as
code snippets that share similar distributed representations. Using distributed vector representa-
tions allows them to learn a continuous similarity metric between code locations rather than using
edit distance.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:22 M. Allamanis et al.
formalizing, reasoning about, and interlinking code to (i.e., the traceability problem of Gotel et al.
[74]) documentation are seminal software engineering problems. Mining common API patterns
is a recurring theme, and there is a large literature of non-probabilistic methods (e.g., frequency-
based) for mining and synthesizing API patterns [34, 192], which are out of the scope of this
review. Also out of scope is work that combines natural language information with APIs. For ex-
ample, Treude and Robillard [179] extract phrases from StackOverflow using heuristics (manually
selected regular expressions) and use off-the-shelf classifiers on a set of hand-crafted features. We
refer the reader to Robillard et al. [169] for all probabilistic and non-probabilistic recommender
systems. Within this domain, there are a few probabilistic code models that mine API sequences.
Gu et al. [76] map natural language text to commonly used API sequences, and Allamanis and
Sutton [12] learn fine-grained source code idioms that may include APIs. Fowkes and Sutton [63]
uses a graphical model to mine interesting API sequences.
Documentation is also related to information extraction from (potentially unstructured) docu-
ments. Cerulo et al. [37] use a language model to detect code “islands” in free text. Sharma et al.
[175] use a language model over tweets to identify software-relevant tweets.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:23
so they, in the end, generate only valid programs. In a similar manner, Patra and Pradel [153] syn-
thesize JavaScript programs for fuzz testing JavaScript interpreters.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:24 M. Allamanis et al.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:25
[17] create neural networks by composing them from “neural modules,” based on the input query
structure. Similarly, we believe that such architectures will be useful within probabilistic models
of source code. An early example is the work of Allamanis et al. [8], who design a neural net-
work based on the output of data flow analysis. Such architectures should not only be effective
for bridging representations among communities but also—as we will discuss next—can combat
issues with compositionality, sparsity, and generalization. Nevertheless, issues that arise in static
analyses, such as path explosion, will still need to be addressed.
Data Sparsity, Compositionality, and Strong Generalization. The principle of reusability in soft-
ware engineering creates a form of sparsity in the data, where it is rare to find multiple source code
elements that perform exactly the same tasks. For example, it is rare to find hundreds of database
systems, whereas one can easily find thousands of news articles on a popular piece of news. The
exceptions, like programming competitions and student solutions to programming assignments,
are quite different from industrial code. This suggests that there are many opportunities for re-
searching machine-learning models and inference methods that can handle and generalize from
the highly structured, sparse, and composable nature of source code data. Do we believe in the
unreasonable effectiveness of data [83]? Yes, but we do not have sufficient data.
Although code and text are both intrinsically extensible, code pushes the limit of existing
machine-learning methods in terms of representing composition. This is because most natural
language methods rarely define novel, previously unseen, terms, with the possible exception of
legal and scientific texts. In contrast, source code is inherently extensible, with developers con-
stantly creating new terms (e.g., by defining new functions and classes) and combining them in
still higher-level concepts. Compositionality refers to the idea that the meaning of some element
can be understood by composing the meaning of its constituent parts. Recent work [86] has shown
that deep-learning architecture can learn some aspects of compositionality in text. Machine learn-
ing for highly compositional objects remains challenging, because it has proven hard to capture
relations between objects, especially across abstraction levels. Such challenges arise even when
considering simple codelike expressions [9]. However, if sufficient progress is to be made, then rep-
resenting source code artifacts in machine learning will improve significantly, positively affecting
other downstream tasks. For example, learning composable models that can combine meaningful
representations of variables into meaningful representations of expressions and functions will lead
to much stronger generalization performance.
Data sparsity is still an important and unsolved problem. Although finding a reasonably large
amount of source code is relatively easy, it is increasingly hard to retrieve some representations
of source code. Indeed, it is infeasible even to compile all of the projects in a corpus of thousands
of projects, because compiling a project requires understanding how the project handles exter-
nal dependencies, which can sometimes be idiosyncratic. Furthermore, computing or acquiring
semantic properties of existing, real-world code (e.g., purity of a function [60] or pre-/post- condi-
tions [90]) is hard to do, especially at scale. Scalability also hampers harvesting run-time data from
real-world programs: It is challenging to acquire substantial run-time data even for a single project.
Exploring ways to synthesize or transform “natural” programs that perform the same task in dif-
ferent ways is a possible way ahead. Another promising direction to tackle this issue is by learn-
ing to extrapolate from run-time data (e.g., collected via instrumentation of a test-suite) to static
properties of the code [4]. Although this is inherently a noisy process, achieving high accuracy
may be sufficient, thanks to the inherent ability of machine learning to handle small amounts of
noise.
Strong generalization also manifests as a deployability problem. Machine-learning models, espe-
cially when they have become effective, are often so large that they are too large for a developer’s
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:26 M. Allamanis et al.
machine, but using the cloud raises privacy6 concerns and prevents offline coding. When under
development and tooling is needed, code evolves quickly, subjecting models to constant concept
drift and necessitating frequent retraining, which can be extremely slow and costly. Addressing
this deployability concern is an open problem and requires advances in machine-learning areas
such as transfer learning and one-shot learning. For example, say a program P uses libraries A and
B, which have been shipped with the models M A and M B . Could we save time training a model for
P by transferring knowledge from M A and M B ?
Finally, source code representations are multifaceted. For example, the token-level “view” of-
source code is quite different from a data flow view of code. Learning to exploit multiple views
simultaneously can help machine-learning models generalize and tackle issues with data sparsity.
Multi-view [193] and multi-modal learning (e.g., Gella et al. [70]), areas actively explored in ma-
chine learning, aim to achieve exactly this. By combining multiple representations of data, they
aim to improve on the performance on various tasks, learning to generalize using multiple input
signals. We believe that this is a promising future direction that may allow us to combine proba-
bilistic representations of code to achieve better generalization.
Measures. To train and evaluate machine-learning models, we need to easily measure their per-
formance. These measures allow the direct comparison of models and have already lead to im-
provements in multiple areas, such as code completion (Section 5.1). Nonetheless, these measures
are imprecise. For instance, probabilistic recommender systems define a probability density over
suggestions whose cross-entropy can be computed against the empirical distribution in test data.
Although cross-entropy is correlated with suggestion accuracy and confidence, small improve-
ments in cross-entropy may not improve accuracy. Sometimes the imprecision is due to unrealistic
use case assumptions. For example, the measures for LM-based code completion tend to assume
that code is written sequentially, from the first token to the last one. However, developers rarely
write code in such a simple and consistent way [158]. Context-based approaches assume that the
available context (e.g., other object usages in the context) is abundant, which is not true in a real
editing scenarios. Researchers reporting keystrokes saved have usually assumed that code com-
pletion suggestions are continuously presented to the user as he or she is typing. When the top
suggestion is the target token, the user presses a single key (e.g., return) to complete the rest of
the target.
Furthermore, some metrics that are widely used in NLP are not suited for source code. For exam-
ple, BLEU score is not suitable for measuring the quality of output source code (e.g., in transducer
models), because it fails to account for the fact that the syntax is known in all programming lan-
guages, so the BLEU score may be artificially “inflated” for predicting deterministic syntax. Second,
the granularity over which BLEU is computed (e.g., per-statement vs. per-token) is controversial.
Finally, syntactically diverse answers may be semantically equivalent, yielding a low BLEU score
while being correct. Finding new widely accepted measures for various tasks will allow the com-
munity to build reliable models with well-understood performance characteristics.
6 https://www.theregister.co.uk/2017/07/25/kite_flies_into _a_fork/.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:27
of available information. A multitude of tools exist in this area whose main goal is to visualize a
program’s state during its execution. Probabilistic models of source code could help developers,
such as by filtering highly improbable program states. Statistical debugging models, such as the
work of Zheng et al. [198, 199] and Liblit et al. [117] are indicative of the possibilities within this
area. Further adding learning within debugging models may allow further advances in statisti-
cal debugging. However, progress in this area is impeded by the combination of lack of data at
a large scale and the inherent difficulty of pattern recognition in very high-dimensional spaces.
Defects4J [100]—a curated corpus of bugs—could further prove useful within machine learning for
fault prediction. Furthermore, collecting and filtering execution traces to aid debugging is another
challenge for which machine learning is well suited. Collection requires expensive instrumenta-
tion, which can introduce Heisenbugs, bugs masked by the overhead of the instrumentation added
to localize them. Here the question is “Can machine learning identify probe points or reconstruct
more complete traces frompartial traces?” Concerning filtering traces, machine learning may be
able to find interesting locations, like the root cause of bugs. Future methods should also be able
to generalize across different programs, or even different revisions of the same program, a difficult
task for existing machine-learning methods.
Traceability. Traceability is the study of links among software engineering artifacts. Examples
include links that connect code to its specification, the specification to requirements, and fixes
to bug reports. Developers can exploit these links to better understand and maintain their code.
Usually, these links must be recovered. Information retrieval has dominated link recovery. The
work of Guo et al. [79] and Le et al. [113] suggests that learning better (semantic) representations
of artifacts can successfully, automatically solve important traceability problems.
Two major obstacles impede progress: lack of data and a focus on generic text. Tracing dis-
cussions in email threads, online chat rooms (e.g., Slack), documents, and source code would be
extremely useful, but no publicly available and annotated data exist. Additionally, to date, NLP
research has mostly focused on modeling generic text (e.g., from newspapers); technical text
in conversational environments (e.g., chatbots) has only begun to be researched. StackOverflow
presents one such interesting target. Although there are hundreds of studies that extract useful
artifacts (e.g., documentation) from StackOverflow, NLP methods—such as dependency parsing,
co-reference analysis and other linguistic phenomena—have not been explored.
Code Completion and Synthesis. Code completion and synthesis using machine learning are two
heavily researched and interrelated areas. Despite this fact, to our knowledge, there has been no
full-scale comparison between LM-based [87, 144, 166] and structured prediction-based autocom-
pletion models [33, 159]. Although both types of systems target the same task, the lack of a well-
accepted benchmark, evaluation methodology, and metrics has lead to the absence of a quantitative
comparison that highlights the strengths and weaknesses of each approach. This highlights the ne-
cessity of widely accepted, high-quality benchmarks, shared tasks, and evaluation metrics that can
lead to comparable and measurable improvements to tasks of interest. NLP and computer vision
follow such a paradigm with great success.7
Omar et al. [149] discuss the challenges that arise from the fact that program editors usually
deal with incomplete, partial programs. Although they discuss how formal semantics can extend
to these cases, inherently any reasoning about partial code requires reasoning about the program-
mer’s intent. Lu et al. [125] used information-retrieval methods for synthesizing code completions
showing that simply retrieving snippets from “big code” can be useful when reasoning about code
completion, even without a learnable probabilistic component. This suggests a fruitful area for
7 See https://qz.com/1034972/ for a popular account of the effect of large-scale datasets in computer vision.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:28 M. Allamanis et al.
probabilistic models of code that can assist editing tools when reasoning about incomplete code’s
semantics by modeling how code could be completed.
Education. Software engineering education is one area that is already starting to be affected by
this field. The work of Campbell et al. [35] and Bhatia and Singh [25] already provide an auto-
mated method for fixing syntax errors in student code, whereas Piech et al. [155] and Wang et al.
[182] suggest advancements towards giving richer feedback to students. Achieving reasonable au-
tomation can help provide high-quality computer science education to many more students than
is feasible today. However, there are important challenges associated with this area. This includes
the availability of highly granular data where machine-learning systems can be trained, difficulty
with embedding semantic features of code into machine-learning methods, and the hardness of
creating models that can generalize to multiple and new tasks. Student coursework submissions
are potentially a ripe application area for machine learning, because here we have available many
programs, from different students, which are meant to perform the same tasks. An especially large
amount of such data is available in Massive Open Online Courses. This opens exciting possibil-
ities, such as providing granular and detailed feedback, and curriculum customization and other
intelligent tutoring systems can significantly change computer science education.
Assistive Tools. Probabilistic models have allowed computer systems to handle noisy inputs such
as speech and handwritten text input. In the future, probabilistic models of source code may enable
novel assistive IDEs, creating inclusive tools that improve on conventional methods of developer–
computer interaction and provide inclusive coding experiences.
8 http://science.dodlive.mil/2014/03/21/darpas-muse-mining-big-code/.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:29
8 CONCLUSIONS
Probabilistic models of source code have exciting potential to support new tools in almost every
area of program analysis and software engineering. We reviewed existing work in the area, pre-
senting a taxonomy of probabilistic machine-learning source code models and their applications.
The reader may appreciate that most of the research contained in this review was conducted within
the past few years, indicating a growth of interest in this area among the machine-learning, pro-
gramming languages, and software engineering communities. Probabilistic models of source code
raise the exciting opportunity of learning from existing code, probabilistically reasoning about new
source code artifacts and transferring knowledge between developers and projects.
ELECTRONIC APPENDIX
The electronic appendix for this article can be accessed in the ACM Digital Library.
REFERENCES
[1] Mithun Acharya, Tao Xie, Jian Pei, and Jun Xu. 2007. Mining API patterns as partial orders from source code: From
usage scenarios to specifications. In Proceedings of the Joint Meeting of the European Software Engineering Conference
and the Symposium on the Foundations of Software Engineering (ESEC/FSE’07).
[2] Karan Aggarwal, Mohammad Salameh, and Abram Hindle. 2015. Using Machine Translation for Converting Python
2 to Python 3 Code. Technical Report.
[3] Alex A. Alemi, Francois Chollet, Geoffrey Irving, Christian Szegedy, and Josef Urban. 2016. DeepMath–Deep se-
quence models for premise selection. In Proceedings of the Annual Conference on Neural Information Processing Sys-
tems (NIPS’16).
[4] Miltiadis Allamanis, Earl T. Barr, Christian Bird, Premkumar Devanbu, Mark Marron, and Charles Sutton. 2016.
Mining Semantic Loop Idioms from Big Code. Technical Report. Retrieved from https://www.microsoft.com/en-us/
research/publication/mining-semantic-loop-idioms-big-code/.
[5] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In
Proceedings of the International Symposium on Foundations of Software Engineering (FSE’14).
[6] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class
names. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on
the Foundations of Software Engineering (ESEC/FSE’15).
[7] Miltiadis Allamanis and Marc Brockschmidt. 2017. SmartPaste: Learning to adapt source code. arXiv Preprint
arXiv:1705.07867 (2017).
[8] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with
graphs. In Proceedings of the International Conference on Learning Representations (ICLR’18).
[9] Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, and Charles Sutton. 2017. Learning continuous
semantic representations of symbolic expressions. In Proceedings of the International Conference on Machine Learning
(ICML’17).
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:30 M. Allamanis et al.
[10] Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summa-
rization of source code. In Proceedings of the International Conference on Machine Learning (ICML’16).
[11] Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositories at massive scale using language
modeling. In Proceedings of the Working Conference on Mining Software Repositories (MSR’13).
[12] Miltiadis Allamanis and Charles Sutton. 2014. Mining idioms from source code. In Proceedings of the International
Symposium on Foundations of Software Engineering (FSE’14).
[13] Miltiadis Allamanis, Daniel Tarlow, Andrew Gordon, and Yi Wei. 2015. Bimodal modelling of source code and natural
language. In Proceedings of the International Conference on Machine Learning (ICML’15).
[14] Sven Amann, Sebastian Proksch, Sarah Nadi, and Mira Mezini. 2016. A study of visual studio usage in practice. In
Proceedings of the International Conference on Software Analysis, Evolution, and Reengineering (SANER’16).
[15] Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In
Proceedings of the Spring Joint Computer Conference.
[16] Matthew Amodio, Swarat Chaudhuri, and Thomas Reps. 2017. Neural attribute machines for program generation.
arXiv Preprint arXiv:1705.09231 (2017).
[17] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for
question answering. In Proceedings of the Annual Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies (NAACL-HLT’16).
[18] Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, and Konrad Rieck. 2014. DREBIN: Effective and
explainable detection of android malware in your pocket. In Proceedings of the Network and Distributed System
Security Symposium.
[19] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to
align and translate. In Proceedings of the International Conference on Learning Representations (ICLR’15).
[20] Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2017. DeepCoder:
Learning to write programs. In Proceedings of the International Conference on Learning Representations (ICLR’17).
[21] Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of Python functions and documentation
strings for automated code documentation and code generation. In Proceedings of the Eighth International Joint
Conference on Natural Language Processing (Volume 2: Short Papers) 2 (2017), 314–319.
[22] Rohan Bavishi, Michael Pradel, and Koushik Sen. 2017. Context2Name: A deep learning-based approach to infer
natural variable names from usage contexts. TU Darmstadt, Department of Computer Science.
[23] Tony Beltramelli. 2018. pix2code: Generating code from a graphical user interface screenshot. In Proceedings of the
ACM SIGCHI Symposium on Engineering Interactive Computing Systems. ACM, 3 pages.
[24] Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott
McPeak, and Dawson Engler. 2010. A few billion lines of code later: Using static analysis to find bugs in the real
world. Communications of the ACM 53, 2 (2010), 66–75.
[25] Sahil Bhatia and Rishabh Singh. 2018. Automated correction for syntax errors in programming assignments using
recurrent neural networks. In Proceedings of the International Conference on Software Engineering (ICSE’18).
[26] Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, and Sebastian Riedel. 2016. Learning Python code suggestion
with a sparse pointer network. arXiv Preprint arXiv:1611.08307 (2016).
[27] Benjamin Bichsel, Veselin Raychev, Petar Tsankov, and Martin Vechev. 2016. Statistical deobfuscation of android
applications. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.
[28] Pavol Bielik, Veselin Raychev, and Martin Vechev. 2015. Programming with “big code”: Lessons, techniques and
applications. In Proceedings of the LIPIcs-Leibniz International Proceedings in Informatics.
[29] Pavol Bielik, Veselin Raychev, and Martin Vechev. 2016. PHOG: Probabilistic model for code. In Proceedings of the
International Conference on Machine Learning (ICML’16).
[30] David M. Blei. 2012. Probabilistic topic models. Communications of the ACM 55, 4 (2012), 77–84.
[31] Marc Brockschmidt, Yuxin Chen, Pushmeet Kohli, Siddharth Krishna, and Daniel Tarlow. 2017. Learning shape
analysis. In Proceedings of the International Static Analysis Symposium. Springer.
[32] Peter John Brown. 1979. Software Portability: An Advanced Course. CUP Archive.
[33] Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion
systems. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on
the Foundations of Software Engineering (ESEC/FSE’09).
[34] Raymond P. L. Buse and Westley Weimer. 2012. Synthesizing API usage examples. In Proceedings of the International
Conference on Software Engineering (ICSE’12).
[35] Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax errors just aren’t natural: Improving
error reporting with language models. In Proceedings of the Working Conference on Mining Software Repositories
(MSR’14).
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:31
[36] Lei Cen, Christoher S. Gates, Luo Si, and Ninghui Li. 2015. A probabilistic discriminative model for Android malware
detection with decompiled source code. IEEE Transactions on Dependable and Secure Computing 12, 4 (2015), 400–412.
[37] Luigi Cerulo, Massimiliano Di Penta, Alberto Bacchelli, Michele Ceccarelli, and Gerardo Canfora. 2015. Irish: A
hidden markov model to detect coded information islands in free text. Science of Computer Programming 105 (2015),
26–43.
[38] Kwonsoo Chae, Hakjoo Oh, Kihong Heo, and Hongseok Yang. 2017. Automatically generating features for learning
program analysis heuristics for C-like languages. In Proceedings of the Conference on Object-Oriented Programming,
Systems, Languages & Applications (OOPSLA’17).
[39] Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM Computing Surveys
(CSUR) 41, 3 (2009), 15.
[40] Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling.
Computer Speech and Language 13, 4 (1999), 359–394.
[41] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neu-
ral machine translation: Encoder–Decoder approaches. In Syntax, Semantics and Structure in Statistical Translation
(2014).
[42] Edmund Clarke, Daniel Kroening, and Karen Yorav. 2003. Behavioral consistency of C and verilog programs using
bounded model checking. In Proceedings of the 40th Annual Design Automation Conference.
[43] Trevor Cohn, Phil Blunsom, and Sharon Goldwater. 2010. Inducing tree-substitution grammars. Journal of Machine
Learning Research 11, Nov (2010), 3053–3096.
[44] Christopher S. Corley, Kostadin Damevski, and Nicholas A. Kraft. 2015. Exploring the use of deep learning for feature
location. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME’15).
[45] Patrick Cousot, Radhia Cousot, Jerôme Feret, Laurent Mauborgne, Antoine Miné, David Monniaux, and Xavier Rival.
2005. The ASTRÉE analyzer. In ESPO. Springer.
[46] William Croft. 2008. Evolutionary linguistics. Ann. Rev. Anthropol. (2008).
[47] Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. End-to-end deep learning of optimiza-
tion heuristics. In Proceedings of the 26th International Conference on Parallel Computing Technologies (PACT’17).
IEEE, 219–232.
[48] Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. Synthesizing benchmarks for predictive
modeling. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’17).
IEEE, 86–99.
[49] Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language model for software code. arXiv Preprint
arXiv:1608.02715 (2016).
[50] Florian Deißenböck and Markus Pizka. 2006. Concise and consistent naming. Software Quality Journal 14, 3 (2006),
261–282.
[51] Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M. Rush. 2017. Image-to-markup generation with
coarse-to-fine attention. In Proceedings of the International Conference on Machine Learning (ICML’17). 980–989.
[52] Premkumar Devanbu. 2015. New initiative: The naturalness of software. In Proceedings of the International Confer-
ence on Software Engineering (ICSE’15).
[53] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel rahman Mohamed, and Pushmeet Kohli.
2017. Robustfill: Neural program learning under noisy I/O. In Proceedings of the International Conference on Machine
Learning (ICML’17).
[54] Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A language and infrastructure
for analyzing ultra-large-scale software repositories. In Proceedings of the International Conference on Software En-
gineering (ICSE’13).
[55] Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Joshua B. Tenenbaum. 2017. Learning to infer graphics
programs from hand-drawn images. arXiv Preprint arXiv:1707.09627 (2017).
[56] Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. 2001. Bugs as deviant behavior: A
general approach to inferring errors in systems code. In ACM SIGOPS Operating Systems Review.
[57] Michael D. Ernst. 2017. Natural language is a programming language: Applying natural language processing to
software development. In Proceedings of the LIPIcs-Leibniz International Proceedings in Informatics.
[58] Ethan Fast, Daniel Steffee, Lucy Wang, Joel R. Brandt, and Michael S. Bernstein. 2014. Emergent, crowd-scale pro-
gramming practice in the IDE. In Proceedings of the Annual ACM Conference on Human Factors in Computing Systems.
[59] John K. Feser, Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow. 2017. Neural functional programming.
InProceedings of the International Conference on Learning Representations (ICLR’17).
[60] Matthew Finifter, Adrian Mettler, Naveen Sastry, and David Wagner. 2008. Verifiable functional purity in java. In
Proceedings of the 15th ACM Conference on Computer and Communications Security. ACM, 161–174.
[61] Eclipse Foundation. Code Recommenders. Retrieved June 2017 from www.ecli pse.org/recommenders.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:32 M. Allamanis et al.
[62] Jaroslav Fowkes, Pankajan Chanthirasegaran, Razvan Ranca, Miltos Allamanis, Mirella Lapata, and Charles Sutton.
2017. Autofolding for source code summarization. IEEE Transactions on Software Engineering 43, 12 (2017), 1095–
1109.
[63] Jaroslav Fowkes and Charles Sutton. 2015. Parameter-free probabilistic API mining at GitHub Scale. In Proceedings
of the International Symposium on Foundations of Software Engineering (FSE’15).
[64] Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. 2015. Cacheca: A cache language
model based code suggestion tool. In Proceedings of the International Conference on Software Engineering (ICSE’15).
[65] Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In Proceedings of the International
Symposium on Foundations of Software Engineering (FSE’17).
[66] Mark Gabel and Zhendong Su. 2008. Javert: Fully automatic mining of general temporal properties from dynamic
traces. In Proceedings of the International Symposium on Foundations of Software Engineering (FSE’08).
[67] Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of the International
Symposium on Foundations of Software Engineering (FSE’10).
[68] Rosalva E. Gallardo-Valencia and Susan Elliott Sim. 2009. Internet-scale code search. In Proceedings of the 2009 ICSE
Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation.
[69] Alexander L. Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and
Daniel Tarlow. 2016. TerpreT: A probabilistic programming language for program induction. arXiv Preprint
arXiv:1608.04428 (2016).
[70] Spandana Gella, Mirella Lapata, and Frank Keller. 2016. Unsupervised visual sense disambiguation for verbs using
multimodal embeddings. In Proceedings of the Annual Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies (NAACL-HLT’16).
[71] Elena L. Glassman, Jeremy Scott, Rishabh Singh, Philip J. Guo, and Robert C. Miller. 2015. OverCode: Visualizing
variation in student solutions to programming problems at scale. ACM Transactions on Computer-Human Interaction
(TOCHI) 22, 2 (2015), 7 pages.
[72] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
[73] Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sriram K. Rajamani. 2014. Probabilistic programming.
In Proceedings of the International Conference on Software Engineering (ICSE’14).
[74] Orlena Gotel, Jane Cleland-Huang, Jane Huffman Hayes, Andrea Zisman, Alexander Egyed, Paul Grünbacher, Alex
Dekhtyar, Giuliano Antoniol, Jonathan Maletic, and Patrick Mäder. 2012. Traceability fundamentals. In Software and
Systems Traceability. Springer, 3–22.
[75] Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing machines. arXiv Preprint arXiv:1410.5401 (2014).
[76] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of the
International Symposium on Foundations of Software Engineering (FSE’16).
[77] Sumit Gulwani and Mark Marron. 2014. NLyze: Interactive programming by natural language for spreadsheet data
analysis and manipulation. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.
[78] Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, and others. 2017. Program synthesis. In Foundations and Trends®
in Programming Languages 4, 1–2 (2017), 1–119.
[79] Jin Guo, Jinghui Cheng, and Jane Cleland-Huang. 2017. Semantically enhanced software traceability using deep
learning techniques. In Proceedings of the International Conference on Software Engineering (ICSE’17).
[80] Rahul Gupta, Aditya Kanade, and Shirish Shevade. 2018. Deep reinforcement learning for programming language
correction. arXiv Preprint arXiv:1801.10467 (2018).
[81] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: Fixing common C language errors by
deep learning. In Proceedings of the Conference of Artificial Intelligence (AAAI’17).
[82] Tihomir Gvero and Viktor Kuncak. 2015. Synthesizing java expressions from free-form queries. In Proceedings of the
Conference on Object-Oriented Programming, Systems, Languages & Applications (OOPSLA’15).
[83] Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The unreasonable effectiveness of data. IEEE Intelligent
Systems 24, 2 (2009), 8–12.
[84] Vincent J. Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling
source code? In Proceedings of the International Symposium on Foundations of Software Engineering (FSE’17).
[85] Vincent J. Hellendoorn, Premkumar T. Devanbu, and Alberto Bacchelli. 2015. Will they like this?: Evaluating code
contributions with language models. In Proceedings of the Working Conference on Mining Software Repositories
(MSR’15).
[86] Felix Hill, KyungHyun Cho, Anna Korhonen, and Yoshua Bengio. 2016. Learning to understand phrases by embed-
ding the dictionary. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’16).
[87] Abram Hindle, Earl T. Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. 2016. On the naturalness of
software. Communications of the ACM 59, 5 (2016), 122–131.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:33
[88] Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of
software. In Proceedings of the International Conference on Software Engineering (ICSE’12).
[89] G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. 1986. Distributed representations. In Parallel Distributed Pro-
cessing: Explorations in the Microstructure of Cognition, vol. 1. MIT Press, 77–109.
[90] C. A. R. Hoare. 1969. An axiomatic basis for computer programming. Commun. ACM 12, 10 (Oct. 1969), 576–580.
DOI: http://dx.doi.org/10.1145/363235.363259
[91] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–
1780.
[92] Reid Holmes, Robert J. Walker, and Gail C. Murphy. 2005. Strathcona example recommendation tool. In ACM SIG-
SOFT Software Engineering Notes 30, 5 (2005), 237–240.
[93] Chun-Hung Hsiao, Michael Cafarella, and Satish Narayanasamy. 2014. Using web corpus statistics for program
analysis. In ACM SIGPLAN Notices 49, 10 (2014), 49–65.
[94] Xing Hu, Yuhan Wei, Ge Li, and Zhi Jin. 2017. CodeSum: Translate program language to natural language. arXiv
Preprint arXiv:1708.01837 (2017).
[95] Andrew Hunt and David Thomas. 2000. The Pragmatic Programmer: From Journeyman to Master. Addison-Wesley
Professional.
[96] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using
a neural attention model. In Proceedings of the Annual Meeting of the Association for Computational Linguistics
(ACL’16).
[97] Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically generating commit messages from diffs
using neural machine translation. In Proceedings of the International Conference on Automated Software Engineering
(ASE’17).
[98] Daniel D. Johnson. 2016. Learning graphical state transitions. In Proceedings of the International Conference on Learn-
ing Representations (ICLR’16).
[99] Dan Jurafsky. 2000. Speech & Language Processing (3rd. ed.). Pearson Education.
[100] René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A database of existing faults to enable controlled
testing studies for Java programs. In Proceedings of the International Symposium on Software Testing and Analysis
(ISSTA’14).
[101] Neel Kant. 2018. Recent advances in neural program synthesis. arXiv Preprint arXiv:1802.02353 (2018).
[102] Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-based statistical translation of program-
ming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflec-
tions on Programming & Software. ACM, 173–184.
[103] Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2015. Visualizing and understanding recurrent networks. arXiv
Preprint arXiv:1506.02078 (2015).
[104] Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Procdedings of
the 1995 International Conference on Acoustics, Speech, and Signal Processing (ICASSP’95). 1 (1995), 181–184.
[105] Donald Ervin Knuth. 1984. Literate programming. The Computer Journal 27, 2 (1984), 97–111.
[106] Ugur Koc, Parsa Saadatpanah, Jeffrey S. Foster, and Adam A. Porter. 2017. Learning a classifier for false positive
error reports emitted by static code analysis tools. In Proceedings of the 1st ACM SIGPLAN International Workshop
on Machine Learning and Programming Languages.
[107] Rainer Koschke. 2007. Survey of research on software clones. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-
Leibniz-Zentrum für Informatik.
[108] Ted Kremenek, Andrew Y. Ng, and Dawson R. Engler. 2007. A factor graph model for software bug finding. In
Proceedings of the International Joint Conference on Artifical intelligence (IJCAI’07).
[109] Roland Kuhn and Renato De Mori. 1990. A cache-based natural language model for speech recognition. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 12, 6 (1990), 570–583.
[110] Nate Kushman and Regina Barzilay. 2013. Using semantic unification to generate regular expressions from natural
language. In Proceedings of Annual Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL-HLT’13).
[111] Tessa Lau. 2001. Programming by Demonstration: A Machine Learning Approach. Ph.D. Dissertation. University of
Washington.
[112] Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the
International Conference on Machine Learning (ICML’14).
[113] Tien-Duy B. Le, Mario Linares-Vásquez, David Lo, and Denys Poshyvanyk. 2015. Rclinker: Automated linking of
issue reports and commits leveraging rich contextual information. In Proceedings of the International Conference on
Program Comprehension (ICPC’15).
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:34 M. Allamanis et al.
[114] Dor Levy and Lior Wolf. 2017. Learning to align the source code to the compiled object code. In Proceedings of the
International Conference on Machine Learning (ICML’17).
[115] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2016. Gated graph sequence neural networks. In
Proceedings of the International Conference on Learning Representations (ICLR’16).
[116] Percy Liang, Michael I. Jordan, and Dan Klein. 2010. Learning programs: A hierarchical bayesian approach. In Pro-
ceedings of the International Conference on Machine Learning (ICML’10).
[117] Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan. 2005. Scalable statistical bug isolation.
In ACM SIGPLAN Notices 40, 6 (2005), 15–26.
[118] Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, and Michael D. Ernst. 2017. Program
Synthesis from Natural Language using Recurrent Neural Networks. Technical Report UW-CSE-17-03-01. University
of Washington Department of Computer Science and Engineering, Seattle, WA.
[119] Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. 2018. NL2Bash: A corpus and semantic
parser for natural language interface to the linux operating system. In Proceedings of the International Conference on
Language Resources and Evaluation.
[120] Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, Andrew Senior, Fumin Wang, and Phil
Blunsom. 2016. Latent predictor networks for code generation. In Proceedings of the Annual Meeting of the Association
for Computational Linguistics (ACL’16).
[121] Han Liu. 2016. Towards better program obfuscation: Optimization via language models. In Proceedings of the 38th
International Conference on Software Engineering Companion.
[122] Benjamin Livshits, Aditya V. Nori, Sriram K. Rajamani, and Anindya Banerjee. 2009. Merlin: Specification infer-
ence for explicit information flow problems. In Proceedings of the Symposium on Programming Language Design and
Implementation (PLDI’09).
[123] Sarah M. Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. 2017. Deep network guided proof search. In
Proceedings of the International Conference on Logic for Programming Artificial Intelligence and Reasoning (LPAR’17).
[124] Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017. A neural architecture for generating natural lan-
guage descriptions from source code changes. In Proceedings of the 55th Annual Meeting of the Association for Com-
putational Linguistics (Volume 2: Short Papers) 2 (2017), 287–292.
[125] Yanxin Lu, Swarat Chaudhuri, Chris Jermaine, and David Melski. 2017. Data-Driven program completion. arXiv
Preprint arXiv:1705.09042 (2017).
[126] Chris Maddison and Daniel Tarlow. 2014. Structured generative models of natural source code. In Proceedings of the
International Conference on Machine Learning (ICML’14).
[127] Ravi Mangal, Xin Zhang, Aditya V. Nori, and Mayur Naik. 2015. A user-guided approach to program analysis. In
Proceedings of the International Symposium on Foundations of Software Engineering (FSE’15).
[128] Collin Mcmillan, Denys Poshyvanyk, Mark Grechanik, Qing Xie, and Chen Fu. 2013. Portfolio: Searching for relevant
functions and their usages in millions of lines of code. ACM Transactions on Software Engineering and Methodology
(TOSEM) 22, 4 (2013), 37 pages.
[129] Aditya Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, and Adam Kalai. 2013. A machine learning framework
for programming by example. In Proceedings of the International Conference on Machine Learning (ICML’13).
[130] Kim Mens and Angela Lozano. 2014. Source code-based recommendation systems. In Recommendation Systems in
Software Engineering. Springer, 93–130.
[131] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in
vector space. arXiv Preprint arXiv:1301.3781 (2013).
[132] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for
programming language processing. In Proceedings of the Conference of Artificial Intelligence (AAAI’16).
[133] Dana Movshovitz-Attias and William W. Cohen. 2013. Natural language models for predicting programming com-
ments. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’13).
[134] Dana Movshovitz-Attias and William W. Cohen. 2015. KB-LDA: Jointly learning a knowledge base of hierarchy,
relations, and facts. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’15).
[135] Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, and Chris Jermaine. 2018. Neural sketch learning for condi-
tional program generation. In Proceedings of the International Conference on Learning Representations (ICLR).
[136] Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. 2017. Bayesian specification learning for finding API
usage errors. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 151–162.
[137] Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever. 2015. Neural programmer: Inducing latent programs with
gradient descent. In Proceedings of the International Conference on Learning Representations (ICLR’15).
[138] Graham Neubig. 2016. Survey of methods to generate natural language from source code. Retrieved from http://
www.languageandcode.org/nlse2015/neubig15nlse-survey.pdf.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:35
[139] Anh Tuan Nguyen and Tien N. Nguyen. 2015. Graph-based statistical language model for code. In Proceedings of the
International Conference on Software Engineering (ICSE’15).
[140] Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2013. Lexical statistical machine translation for
language migration. In Proceedings of the International Symposium on Foundations of Software Engineering (FSE’13).
[141] Anh T. Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2015. Divide-and-conquer approach for multi-phase sta-
tistical migration for source code. In Proceedings of the International Conference on Automated Software Engineering
(ASE’15).
[142] Trong Duc Nguyen, Anh Tuan Nguyen, and Tien N. Nguyen. 2016. Mapping API elements for code migration with
vector representations. In Proceedings of the International Conference on Software Engineering (ICSE’16).
[143] Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N. Nguyen. 2017. Exploring API embedding for
API usages and applications. In Proceedings of the International Conference on Software Engineering (ICSE’17).
[144] Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2013. A statistical semantic
language model for source code. In Proceedings of the Joint Meeting of the European Software Engineering Conference
and the Symposium on the Foundations of Software Engineering (ESEC/FSE’13).
[145] Haoran Niu, Iman Keivanloo, and Ying Zou. 2017. Learning to rank code examples for code search engines. Empirical
Software Engineering (ESEM’16) 22, 1 (2017), 259–291.
[146] Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura.
2015. Learning to generate pseudo-code from source code using statistical machine translation. In Proceedings of the
International Conference on Automated Software Engineering (ASE’15).
[147] Hakjoo Oh, Hongseok Yang, and Kwangkeun Yi. 2015. Learning a strategy for adapting a program analysis via
bayesian optimisation. In Proceedings of the Conference on Object-Oriented Programming, Systems, Languages & Ap-
plications (OOPSLA’15).
[148] Cyrus Omar. 2013. Structured statistical syntax tree prediction. In Proceedings of the Conference on Systems, Pro-
gramming, Languages and Applications: Software for Humanity (SPLASH’13).
[149] Cyrus Omar, Ian Voysey, Michael Hilton, Joshua Sunshine, Claire Le Goues, Jonathan Aldrich, and Matthew A.
Hammer. 2017. Toward semantic foundations for program editors. arXiv preprint arXiv:1703.08694.
[150] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of
machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’02).
[151] Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. 2017.
Neuro-symbolic program synthesis. In Proceedings of the International Conference on Learning Representations
(ICLR’17).
[152] Terence Parr and Jurgen J. Vinju. 2016. Towards a universal code formatter through machine learning. In Proceedings
of the International Conference on Software Language Engineering (SLE’16).
[153] Jibesh Patra and Michael Pradel. 2016. Learning to Fuzz: Application-Independent Fuzz Testing with Probabilistic,
Generative Models of Input Data. TU Darmstadt, Department of Computer Science, TUD-CS-2016-14664.
[154] Hung Viet Pham, Phong Minh Vu, Tung Thanh Nguyen, and others. 2016. Learning API usages from bytecode: A
statistical approach. In Proceedings of the International Conference on Software Engineering (ICSE’16).
[155] Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, and Leonidas J. Guibas. 2015.
Learning program embeddings to propagate feedback on student code. In Proceedings of the International Conference
on Machine Learning (ICML’15).
[156] Matt Post and Daniel Gildea. 2009. Bayesian learning of a tree substitution grammar. In Proceedings of the Annual
Meeting of the Association for Computational Linguistics (ACL’09).
[157] Michael Pradel and Koushik Sen. 2017. Deep learning to find bugs. TU Darmstadt, Department of Computer Science.
[158] Sebastian Proksch, Sven Amann, Sarah Nadi, and Mira Mezini. 2016. Evaluating the evaluations of code recom-
mender systems: A reality check. In Proceedings of the International Conference on Automated Software Engineering
(ASE’16).
[159] Sebastian Proksch, Johannes Lerch, and Mira Mezini. 2015. Intelligent code completion with bayesian networks.
ACM Transactions on Software Engineering and Methodology (TOSEM) 25, 1 (2015), 3.
[160] Yewen Pu, Karthik Narasimhan, Armando Solar-Lezama, and Regina Barzilay. 2016. sk_p: A neural program correc-
tor for MOOCs. In Proceedings of the Conference on Systems, Programming, Languages and Applications: Software for
Humanity (SPLASH’16).
[161] Chris Quirk, Raymond Mooney, and Michel Galley. 2015. Language to code: Learning semantic parsers for if-this-
then-that recipes. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’15).
[162] Maxim Rabinovich, Mitchell Stern, and Dan Klein. 2017. Abstract syntax networks for code generation and semantic
parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’17).
[163] Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu.
2016. On the naturalness of buggy code. In Proceedings of the International Conference on Software Engineering
(ICSE’16).
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
81:36 M. Allamanis et al.
[164] Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause. 2016. Learning programs from noisy data. In
Proceedings of the Symposium on Principles of Programming Languages (POPL’16).
[165] Veselin Raychev, Martin Vechev, and Andreas Krause. 2015. Predicting program properties from “big code.” In Pro-
ceedings of the Symposium on Principles of Programming Languages (POPL’15).
[166] Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with statistical language models. In Pro-
ceedings of the Symposium on Programming Language Design and Implementation (PLDI’14).
[167] Scott Reed and Nando de Freitas. 2016. Neural programmer-interpreters. In Proceedings of the International Confer-
ence on Learning Representations (ICLR’16).
[168] Sebastian Riedel, Matko Bosnjak, and Tim Rocktäschel. 2017. Programming with a differentiable forth interpreter.
In Proceedings of the International Conference on Machine Learning (ICML’17).
[169] Martin Robillard, Robert Walker, and Thomas Zimmermann. 2010. Recommendation systems for software engineer-
ing. IEEE Software 27, 4 (2010), 80–86.
[170] Martin P. Robillard, Walid Maalej, Robert J. Walker, and Thomas Zimmermann. 2014. Recommendation Systems in
Software Engineering. Springer.
[171] Tim Rocktäschel and Sebastian Riedel. 2017. End-to-end differentiable proving. In Proceedings of the Annual Confer-
ence on Neural Information Processing Systems (NIPS’17).
[172] Caitlin Sadowski, Kathryn T. Stolee, and Sebastian Elbaum. 2015. How developers search for code: A case study. In
Proceedings of the International Symposium on Foundations of Software Engineering (FSE’15).
[173] Juliana Saraiva, Christian Bird, and Thomas Zimmermann. 2015. Products, developers, and milestones: How should
I build my N-gram language model. In Proceedings of the International Symposium on Foundations of Software Engi-
neering (FSE’15).
[174] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword
units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’16).
[175] Abhishek Sharma, Yuan Tian, and David Lo. 2015. NIRMAL: Automatic identification of software relevant tweets
leveraging language model. In Proceedings of the International Conference on Software Analysis, Evolution, and Reengi-
neering (SANER’15).
[176] Rishabh Singh and Sumit Gulwani. 2015. Predicting a correct program in programming by example. In Proceedings
of the International Conference on Computer Aided Verification.
[177] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Pro-
ceedings of the Annual Conference on Neural Information Processing Systems (NIPS’14).
[178] Suresh Thummalapenta and Tao Xie. 2007. Parseweb: A programmer assistant for reusing open source code on the
web. In Proceedings of the International Conference on Automated Software Engineering (ASE’07).
[179] Christoph Treude and Martin P. Robillard. 2016. Augmenting API documentation with insights from stack overflow.
In Proceedings of the International Conference on Software Engineering (ICSE’16).
[180] Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the localness of software. In Proceedings of the
International Symposium on Foundations of Software Engineering (FSE’14).
[181] Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering clear, natural identifiers from
obfuscated JS names. In Proceedings of the International Symposium on Foundations of Software Engineering (FSE’17).
[182] Lisa Wang, Angela Sy, Larry Liu, and Chris Piech. 2017. Deep knowledge tracing on programming exercises. In
Proceedings of the Conference on Learning @ Scale.
[183] Song Wang, Devin Chollak, Dana Movshovitz-Attias, and Lin Tan. 2016. Bugram: Bug detection with n-gram lan-
guage models. In Proceedings of the International Conference on Automated Software Engineering (ASE’16).
[184] Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Pro-
ceedings of the International Conference on Software Engineering (ICSE’16).
[185] Xin Wang, Chang Liu, Richard Shin, Joseph E. Gonzalez, and Dawn Song. 2016. Neural Code Completion. Retrieved
from https://openreview.net/pdf?id=rJbPBt9lg.
[186] Andrzej Wasylkowski, Andreas Zeller, and Christian Lindig. 2007. Detecting object usage anomalies. In Proceed-
ings of the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of
Software Engineering (ESEC/FSE’07).
[187] Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments
for code clone detection. In Proceedings of the International Conference on Automated Software Engineering (ASE’16).
[188] Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward deep learning
software repositories. In Proceedings of the Working Conference on Mining Software Repositories (MSR’15).
[189] Chadd C. Williams and Jeffrey K. Hollingsworth. 2005. Automatic mining of source code repositories to improve
bug finding techniques. IEEE Transactions on Software Engineering 31, 6 (2005), 466–480.
[190] Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2016. Data Mining: Practical Machine Learning Tools
and Techniques. Morgan Kaufmann.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.
A Survey of Machine Learning for Big Code and Naturalness 81:37
[191] W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization.
IEEE Transactions on Software Engineering 42, 8 (2016), 707–740.
[192] Tao Xie and Jian Pei. 2006. MAPO: Mining API usages from open source repositories. In Proceedings of the Working
Conference on Mining Software Repositories (MSR’06).
[193] Chang Xu, Dacheng Tao, and Chao Xu. 2013. A survey on multi-view learning. arXiv Preprint arXiv:1304.5634 (2013).
[194] Shir Yadid and Eran Yahav. 2016. Extracting code from programming tutorial videos. In Proceedings of the 2016 ACM
International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software.
[195] Eran Yahav. 2015. Programming with “big code.” In Asian Symposium on Programming Languages and Systems.
Springer, 3–8.
[196] Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. Proceed-
ings of the Annual Meeting of the Association for Computational Linguistics (ACL’17).
[197] Wojciech Zaremba and Ilya Sutskever. 2014. Learning to execute. arXiv Preprint arXiv:1410.4615 (2014).
[198] Alice X. Zheng, Michael I. Jordan, Ben Liblit, and Alex Aiken. 2003. Statistical debugging of sampled programs. In
Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’03).
[199] Alice X. Zheng, Michael I. Jordan, Ben Liblit, Mayur Naik, and Alex Aiken. 2006. Statistical debugging: Simultaneous
identification of multiple bugs. In Proceedings of the International Conference on Machine Learning (ICML’06).
[200] Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating structured queries from natural
language using reinforcement learning. arXiv Preprint arXiv:1709.00103 (2017).
[201] Thomas Zimmermann, Andreas Zeller, Peter Weissgerber, and Stephan Diehl. 2005. Mining version histories to
guide software changes. IEEE Transactions on Software Engineering 31, 6 (2005), 429–445.
ACM Computing Surveys, Vol. 51, No. 4, Article 81. Publication date: July 2018.