0% found this document useful (0 votes)
14 views

2020-emnlp-PYMT5 - Multi-Mode Translation of Natural Language and Python Code With Transformers

Uploaded by

heyq1314
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

2020-emnlp-PYMT5 - Multi-Mode Translation of Natural Language and Python Code With Transformers

Uploaded by

heyq1314
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

P Y MT5: multi-mode translation of natural language and

P YTHON code with transformers


Colin B. Clement∗ Dawn Drain
Microsoft Cloud and AI Microsoft Cloud and AI
[email protected] [email protected]

Jonathan Timcheck† Alexey Svyatkovskiy


Stanford University Microsoft Cloud and AI
[email protected] [email protected]

Neel Sundaresan
Microsoft Cloud and AI
[email protected]

Abstract generation (summarization), and achieved


a ROUGE-L F-score of 24.8 for method
Simultaneously modeling source code generation and 36.7 for docstring genera-
and natural language has many excit- tion.
ing applications in automated software
development and understanding. Pur- 1 Introduction
suant to achieving such technology, we
introduce P Y MT5, the P YTHON method Software is a keystone of modern society,
text-to-text transfer transformer, which is touching billions of people through services
trained to translate between all pairs of and devices daily. Writing and documenting
P YTHON method feature combinations: a the source code of this software are challeng-
single model that can both predict whole ing and labor-intensive tasks; software devel-
methods from natural language documen-
opers need to repeatedly refer to online doc-
tation strings (docstrings) and summarize
code into docstrings of any common style. umentation resources in order to understand
We present an analysis and modeling ef- existing code bases to make progress. Devel-
fort of a large-scale parallel corpus of 26 oper productivity can be improved by the pres-
million P YTHON methods and 7.7 mil- ence of source code documentation and a de-
lion method-docstring pairs, demonstrat- velopment environment featuring intelligent,
ing that for docstring and method gen- machine-learning-based code completion and
eration, P Y MT5 outperforms similarly- analysis tools.
sized auto-regressive language models
(GPT2) which were English pre-trained
Recent progress in natural language process-
or randomly initialized. On the C ODE - ing (NLP), especially encoder/decoder-based
S EARCH N ET test set, our best model pre- transformer models (Vaswani et al., 2017)
dicts 92.1% syntactically correct method and pre-training (Radford et al., 2018; Lewis
bodies, achieved a BLEU score of 8.59 for et al., 2019), has led to state-of-the-art per-
method generation and 16.3 for docstring formance on language modeling, classifica-

Corresponding author tion (Devlin et al., 2018), translation (Raffel

Work done during a Microsoft internship et al., 2019), summarization (Liu and Lap-

9052
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9052–9065,
November 16–20, 2020. c 2020 Association for Computational Linguistics
ata, 2019), grammar correction (Bryant et al., (2017) and treat P YTHON and its docstrings
2017), entity recognition, dialogue genera- as fundamentally no different than other ‘natu-
tion (Budzianowski and Vulić, 2019), and ral’ languages, representing both source code
more. Along with these quantitative advances and natural language docstrings as sequences
have come deeper understanding of the learned of tokens sharing the same vocabulary. Here
hidden representations which power transform- we present a multi-mode translation method
ers (Kovaleva et al., 2019; Voita et al., 2019; resulting in P Y MT5, the P YTHON method
Clark et al., 2019; Ethayarajh, 2019). While text-to-text transfer transformer (inspired by
they are arguably not ‘natural,’ programming the text-to-text transfer transformer T5 (Raffel
languages are increasingly becoming model- et al., 2019)). Our single model can both learn
ing playgrounds for NLP modeling. Since code/language generation and understand the
these languages by definition have a gram- relationships between them.
mar, syntax, and known relationships between The paper is organized as follows: we
entities, they offer enticing opportunities for begin in sec. 2 by presenting examples of
an even deeper probing of NLP models and the performance of our novel multi-mode
tasks. Beyond theoretical importance, many P Y MT5 —the P YTHON method text-to-text
NLP tasks have practical utility in software transfer transformer model—which we trained
development environments: language model- to translate between all pairs of combinations
ing or generation can be used for code com- of method signatures, docstrings, and bod-
pletion (Raychev et al., 2014; Bruch et al., ies which do not have the same feature in
2009; Svyatkovskiy et al., 2019, 2020), transla- both the source and target. In sec. 2.1 we de-
tion/summarization to generate documentation scribe our training data and the pre-processing
or natural language summaries (Moreno et al., steps for source code and natural language
2013; Scalabrino et al., 2017; Wan et al., 2018; we followed, and compared it to existing par-
Alon et al., 2018) or even summarize a set of allel docstring-method corpora like C ODE -
code changes (Moreno et al., 2014), transla- S EARCH N ET (CSN)(Husain et al., 2019) and
tion and grammar error correction to patch and that presented by Barone et al (Barone and Sen-
detect bugs (Zhai et al., 2019), and joint em- nrich, 2017). In sec.2.2 we explain our BART-
bedding of code and natural language for code like (Lewis et al., 2019) pre-training scheme,
search (Husain et al., 2019; Gu et al., 2018). demonstrating a 25× speed-up in training time
for docstring generation. Next, in sec. 2.3 we
In this work we focus on jointly modeling analyze and classify P YTHON docstrings, en-
both source code (P YTHON) and concomitant abling style-conditioned docstring generation
natural language documentation (docstrings) in P Y MT5. In sections 3 and 4, we discuss
with transformers, through the study of dual P Y MT5 results on method generation and doc-
tasks: generating method code bodies from string generation respectively and compare it
signatures and docstrings, and generating doc- to two GPT2 models randomly initialized and
strings from signatures and method code bod- pre-trained on English.
ies. While previous work (Allamanis et al.,
2015; Yin and Neubig, 2017) has leveraged the 2 Multi-mode training
grammar of code to extract features like the Ab-
stract Syntax Tree for modeling (treating code Figure 1 shows examples of inputs and outputs
and natural language as separate modalities), of our model P Y MT5 for 3 example tasks:
we follow examples like Barone and Sennrich (top, blue) predicting a body from a method

9053
Figure 1: Real examples of P Y MT5 performing method generation using combinations of signatures
and docstrings. A leading comment in the input sequence instructs the model to output a particular
combination of features, e.g. ‘# target signature and body’ instructs P Y MT5 to predict
both a signature and body.

P Y MT5

Figure 2: P Y MT5 performing docstring generation on an example method, showing the output when the
target prefix indicates one line (top, blue) and Numpydoc docstring (bottom, red) styles.

signature, (middle, red) predicting a whole containing numbers.


method from a natural language docstring,
P Y MT5 can also be prompted with source
and (bottom, green) predicting a body from
code to produce a docstring summary in
a signature and docstring. Note that the com-
various styles. Figure 2 shows the model
ment ‘# target <specification>’ in-
prompted with one of the methods generated
structs the model to choose a particular form
by P Y MT5 in Fig. 1 (top, blue), in both a
of output. Further note that P Y MT5 correctly
‘one line’ (top, blue) style and a ‘Numpydoc’
learns to interpret natural language: it inter-
(bottom, red) style. It infers the intent from the
prets ‘even’ as being related to ‘(example
signature name and code, and even infers that
%2) == 0’, and ‘greater than 1000’
type of the argument is a list and return type
as ‘number > 1000’. The model also pro-
int. It produces the same terse one sentence
duces syntactically correct code (as we will
summary of the function in both cases.
discuss later, we never show the model syntac-
tically incorrect code), and correctly infers the In order to teach P Y MT5 to maximally re-
types of ‘lst’ and ‘numbers’ to be iterables late the separate method features (signatures,
docstrings, bodies), we trained it to translate

9054
between all pairs of feature combinations in We used the P YTHON module astunparse
which the same feature does not appear in both to take the AST for each method and unparse
the source and target. This scheme is also ad- them back into source code, so that our fine-
vantageous as our corpus is unbalanced, with tuned model was never trained on syntactically
only 1/5 methods featuring docstrings, and so incorrect code. The statistics of our method-
the model can learn to leverage all the features docstring corpus are summarized in Table. 1.
whether they are present or not. Additionally, it Our parallel method-docstring corpus is twice
has been shown that code is more ‘predictable’ as large as the next largest irrespective of lan-
than natural language (Hindle et al., 2012). If guage and over 15× as large as the next largest
the method and argument names are a domi- P YTHON parallel corpus, both in CSN.
nating signal due to their relatively rigid struc- For each method, we ignored comments as
ture, the model may learn to ignore the content they generally represent trivia and are not part
of docstrings. This multi-mode method over- of the normal language syntax. We cleaned the
comes that by training the model to generate docstrings by removing non-ASCII characters,
method bodies from docstrings alone. See the normalizing Unicode, and replacing commit
appendix for a more detailed description of the hashes, file paths, and URLs with placeholder
multi-mode training scheme. tokens. In all studies here, we randomly split
the files at the repository level (to prevent data
2.1 Dataset leakage) with 90% for training, 5% for valida-
tion, and 5% for a test set.
Our data consists of 118k G IT H UB reposito-
ries, which includes all public repositories la-
2.2 Pre-training
belled as containing primarily P YTHON source
code, featuring at least 10 stars, and which The majority of our P YTHON methods—over
have had a commit in the past 5 years. We 20 million methods— do not possess doc-
successfully cloned 112k of these repositories, strings. This imbalance is, in fact, an oppor-
extracting 5.3 million P YTHON files from the tunity in light of the recent trend for NLP:
default HEAD state of each repository. We then unsupervised pre-training of language mod-
removed literal duplicate files, resulting in 2.3 els on vast amounts of raw text (Devlin et al.,
million unique files, but did not remove finer- 2018). Using these pre-trained models as start-
grained clones. After removing license from ing points for downstream tasks—like classi-
the files, the literal contents were used in the fication, translation, summarization, and ques-
pre-training step, comprising about 27GB of tion answering—consistently yields state-of-
raw text. the-art results (Lewis et al., 2019; Raffel et al.,
In order to extract method-level informa- 2019).
tion for fine-tuning, we used the python3.7 Following this trend, we use a similar span-
standard library ast to produce the file- masking objective used by the recent text-to-
level Abstract Syntax Tree (AST) for each text transfer transformer (T5) (Raffel et al.,
P YTHON file, extracting every individual and 2019). As shown in Figure 3, after tokeniz-
class method. For each file which failed to ing the inputs, we sample a random subset of
parse, we used 2to3 and autopep8 to over- the token spans up to length 3 to be replaced
come the issue of different styles and white with, e.g. a [MASK0] token, and then teach
space or tab conventions, successfully parsing the sequence-to-sequence model to replace the
97.3% of the 2.3 million unique P YTHON files. missing tokens. The training target is com-

9055
Dataset Methods w/ docstring Languages
P Y MT5 2.6 × 107 7.7 × 106 P YTHON
CSN (Husain et al., 2019) 6.4 × 106 2.3 × 106 P YTHON, et al.
Ciurumelea et al. (2020) 1.6 × 105 1.6 × 105 P YTHON
Barone and Sennrich (2017) 1.6 × 105 1.5 × 105 P YTHON

Table 1: Summary statistics of our P YTHON parallel corpus compared to others presented in the literature.
CSN contains 500k P YTHON methods with docstrings, among 6 other languages. Our parallel corpus is
3× as large as the next largest, and over 15× the size of the next largest P YTHON parallel corpus.

Figure 3: Denoising auto-encoder pre-training for sequence-to-sequence tasks, based on the span-
masking objective used by the T5 (Raffel et al., 2019). P YTHON files are first tokenized with spaces
replaced by the character Ġ, which is 256 in ordinal above the space character (similarly for newlines,
tabs, etc.). Note that indentation is a token of multiple Ġ’s. We replace random sub-sequences of tokens
with numbered masks, and train the model to return each mask followed by the tokens it replaced.

prised of numbered mask tokens followed by this pre-trained model.


the tokens that mask represents.
2.3 Docstring analysis
The architecture of P Y MT5 is an encode-
decoder transformer with a vocabulary of When examining docstring samples from our
50181 (byte-pair BPE encoder trained on raw corpus, one of the most salient features is
python files), 6 self-attention encoder/decoder the different styles of documentation. The
layers in each encoder layers, and a hidden di- P YTHON community has no prescribed or de
mension of 1472, totaling 374 million parame- facto style for docstrings, but P YTHON en-
ters. All the experiments in this paper, includ- hancement protocol 257 (Goodger and van
ing GPT2 were done using this same extended Rossum, 2001) does describe one-line and
GPT tokenizer. We pre-trained P Y MT5 on multi-line docstrings, and mandates indenta-
27GB of raw source code in total, for 3 weeks tion as well. Most modern large-scale projects
on sixteen 32GB Tesla V100 GPUs, or 73 utilize docstring styles which are parseable, al-
epochs total. When training on docstring gen- lowing the automatic creation and synchroniza-
eration alone, we observed 25× faster conver- tion of source code and documentation web-
gence to a lower loss when starting with this sites, see, e.g. sphinx. Therefore, a number
pre-trained model as compared to a random ini- of standard styles have evolved in the commu-
tialization. See the appendix for details. In all nity.
experiments P Y MT5 is trained starting with The currently dominant parseable docstring

9056
styles (and the ones supported by sphinx) plied the t-distributed stochastic neighbor em-
are RE S TRUCTURED T EXT (reST) (Jones, bedding (T- SNE) to obtain a two-dimensional
2013), the official G OOGLE style (Google, visualization. Figure 4 shows 1/10th of our
2020), N UMPY style (also technically satis- corpus (700k docstrings) embedded, colored
fies reST) (Maintainers, 2020), and JAVADOC by docstring style as defined above. We can
style (jav, 2011). The difference be- see clear clustering of styles, indicating that
tween each style is mainly in the syntax similar docstrings use the same style (for the
of denoting sections (if they exist) and parseable styles). There is also a natural di-
the name/type/description annotation of the chotomy between parseable and non-parseable
method arguments and returned/yielded quan- styles: the left side is dominated by ‘one line,’
tities (if they exist). We defined, in addi- ‘one paragraph,’ and ‘other’ styles, and the four
tion to these styles, one-line (containing only parseable styles are largely on the right side.
one line), one-paragraph (containing no empty This observation can be used to generate docu-
lines), and ‘other’ to label any docstring not mentation consistent with the style of a given
described so far, which includes informal user project, or it could be used to translate meth-
docstring styles and a few project-specific ods into more informal descriptions useful for
styles like the SAGE mathematics toolkit li- search indices.
brary.
Table 2 shows the breakdown of the fraction
of each of these styles in our corpus. The plu-
rality of docstrings (44%) are one-line. The
next most common style is one-paragraph at
14%. The next four most-common styles are
the machine parseable styles discussed above,
comprising 26.2% of the total number of doc-
strings. The appendix contains detailed dis-
tributions of method signature, docstring, and
method body character and line lengths.

Style Fraction of methods


One line 44%
One paragraph 14%
RE ST 13%
G OOGLE 7.3%
N UMPY 4.8%
JAVADOC 1.6%
Other 15%
Figure 4: Visualization of continuous embed-
Table 2: Docstring style statistics from 7.7 million dings of 1/10th of our docstring corpus (770k doc-
P YTHONdocstrings. strings), colored by docstring style. Embeddings
were obtained using FAST T EXT, and the two-
To visualize the space of these styles, we dimensional embedding was obtained via PCA
used FAST T EXT vector embeddings of the doc- (for dimensionality reduction and initialization)
and t-SNE.
strings, obtaining 100-dimension continuous
vector representations of each. We then used
PCA to reduce the dimensionality to 50 and ap-

9057
Model Ppl BLEU Syntax Stat. R1 R2 RL
GPT2-med 2.36 5.60 85% Prec. 25.8 12.3 26.8
random Rec. 26.7 12.1 25.9
F1 21.8 10.6 22.5
GPT2-med 2.09 5.63 86% Prec. 25.4 12.1 26.3
english Rec. 26.9 12.2 26.1
F1 21.7 10.6 22.5
P Y MT5 2.36 10.6 93.6% Prec. 33.8 21.5 33.6
Rec. 44.1 25.0 43.8
F1 35.1 21.5 32.2
CSN test:
GPT2-med – 2.8 77.2% Prec. 32.3 11.8 33.7
random Rec. 19.6 7.0 19.3
F1 20.9 7.6 21.9
P Y MT5 – 8.59 92.1% Prec. 25.6 12.5 25.3
Rec. 40.2 18.3 39.6
F1 28.4 13.5 24.8
Barone and Sennrich (2017) test:
P Y MT5 – 20.2 91.1% Prec. 41.3 28.5 40.7
Rec. 52.2 34.7 51.3
F1 43.2 29.8 39.7
Barone et al. – 10.9 – – – –

Table 3: Comparing 3 models–GPT2 with a random weight initialization, GPT2 pre-trained on English,
and P Y MT5–on the task of method generation from a signature and natural language docstring. The
first three rows use our test set consisting of 1,285,794 methods. The fourth and fifth rows compare
the performance of P Y MT5 and GPT2-medium on the CodeSearchNet P YTHON test set. The final
rows compare the performance of P Y MT5 on the parallel corpus test set of Barone and Sennrich (2017).
Syntax is the fraction of predicted methods which had correct syntax using the P YTHON 3.7 grammar.

3 Method generation correct P YTHON 3.7, whereas only 86% of


GPT2 methods were syntactically correct.
Now we turn our attention to method gener- P Y MT5 was trained on 16 Tesla V100 16GB
ation: predicting a whole method code body GPUs for 62 epochs, or 5 weeks training time
from either a method signature, a natural lan- (see the appendix for its hyper-parameters) and
guage docstring, or both. We first discuss a the GPT2 baselines were trained on the same
benchmark of this task using a GPT2-medium hardware for 1 week training time (achieving
model (345 million parameters, see the ap- the same or better validation loss/perplexity as
pendix for details), training from scratch and P Y MT5).
starting with the publicly released O PENAI En-
glish pre-trained checkpoint with weights from The English pre-trained initialization of
HuggingFace(Wolf et al., 2019). In all experi- GPT2 only slightly beats the random initial-
ments we used an extended GPT2 tokenizer— ization of GPT2, which could indicate that the
including white-space (one tab, two tabs, etc.) learned biases of English are not particularly
tokens—for a total vocabulary size of 50337, beneficial for writing P YTHON code; the met-
and we used beam decoding with a beam width rics are almost all within our margin of error.
of 5. Note that Barone and Sennrich (2017) also
The third row of tab. 3 shows P Y MT5 has modeled methods from docstrings, obtaining
more than double the BLEU score, overall a similar BLEU score of 10.9 on their own
better recall, and significantly better ROUGE- P YTHON parallel corpus. On the Barone et al.
2 and ROUGE-L F-scores than our GPT2 test set, P Y MT5 obtains nearly double these
baselines. Further, 93.6% of the methods scores at 20.2; such a large discrepancy could
generated by P Y MT5 were syntactically be explained by data leaking from their test set

9058
Model Ppl BLEU R1 R2 RL
GPT2-med 2.36 19.4 P 32.6 19.3 33.6
ence between our test set and the CSN test set.
random R 36.2 19.4 34.7 We also conclude that tests and short methods
F1 31.0 18.2 31.6
GPT2-med 2.15 19.6 P 33.1 19.4 33.9 are ‘easier’ to complete, which is plausible,
English R 36.4 19.5 34.8 and bodes well for automatic code completion
F1 31.4 18.3 31.8
P Y MT5 3.74 25.2 P 42.1 23.7 41.3 applications.
R 50.4 27.0 49.3
F1 43.3 24.4 39.8
CSN test: 4 Docstring Generation
GPT2-med – 9.5 P 30.6 13.3 31.4
random R 31.1 12.9 29.8
F1 26.3 11.5 27.2 We now examine results from the docstring
P Y MT5 – 16.3 P 38.0 19.2 36.8
R 52.7 24.5 51.0 generation task, which for evaluation pur-
F1 41.3 20.4 36.7 poses were conditioned on both signatures and
Barone test:
P Y MT5 – 17.4 P 39.6 26.0 38.7 method bodies. As in method generation, we
R 53.6 33.7 52.1
F1 43.1 27.8 39.1 set a GPT2 benchmark with random initial-
Barone et al. – 13.84 – – – – ization and pre-trained English initialization
Table 4: Comparing 3 models–GPT2 with a ran- as well as the same hyperparameters. Table 4
dom weight initialization, GPT2 pre-trained on shows that the ROUGE scores of the GPT2
English, and P Y MT5–on the task of natural lan- baselines are within the margin of error; a
guage docstring generation from a signature and somewhat surprising result given the English
method body. The first three rows are evaluated domain of docstrings. The third row shows
on our test set of 383695 methods. The fourth P Y MT5 to be superior to GPT2-medium in
and fifth rows shows performance of P Y MT5 and
terms of BLEU and all of the ROUGE metrics.
GPT2-medium on the CSN P YTHON test set, and
the last two rows compare our model to Barone et We again present the results from the pub-
al. on their test set. licly available CSN test set. Similar to the
method generation task, P Y MT5 performs
worse on the CSN data than our own, likely
into our training set. Barone’s test set is also for the same reasons we discussed in sec. 3.
200× smaller than ours and may not be a rep- We also evaluated P Y MT5 on the Barone et
resentative sample of the whole P YTHON code al. parallel test set, as shown in the second to
domain. last row of tab. 4, and find P Y MT5 performs
The third and fourth rows of tab. 3 show the notably worse on Barone’s test set than our
performance of P Y MT5 using the publicly own test set, contradicting the hypothesis that
available CSN P YTHON test set, from which our doubling of the method generation BLEU
we find notably worse results than on our own score is due to data leakage. P Y MT5 has a
test set. CSN curated their whole set by remov- much higher BLEU score than that reported by
ing any methods with ‘test’ in the name and any Barone et al, perhaps indicating real progress
methods with fewer than 3 lines of code. We in the code summarization field.
calculated the performance of P Y MT5 only Docstring generation is similar to code sum-
on a subset of our test set curated the same marization, though the domains are different as
way as CSN, observing F-scores for R1, R2, docstrings also contain structured annotations
and R-L on our test set of 29.7, 17.2, and 26.1, of arguments, return values, raised exceptions,
which is lower than our nominal test set perfor- and even in-line unit tests (doctest). TranS3
mance of 35.1, 21.5, and 32.2 and closer to the by Wang et al. (Wang et al., 2020) reports a
CSN performance of 28.4, 13.5, and 24.8. We best ROUGE-L of 51.27 on the same test set
believe this curating choice explains the differ- for code summarization, but does not specify

9059
which statistic they are reporting, so we can- Acknowledgements
not make strong conclusions about the perfor-
mance of P Y MT5 compared to the state of the We would like to thank the Microsoft Cloud
art. and AI SmartML engineering team for help
in preparing the data, Shao Kun Deng for the
development of compelling user experiences
leveraging P Y MT5, and Christian Bird for use-
5 Conclusion ful discussions.

In this work, we presented a novel multi-mode References


P YTHON method text-to-text transfer trans- 2011. Java doc. Technical report.
former model P Y MT5as well as the largest
parallel corpus of P YTHON source code and Miltiadis Allamanis, Daniel Tarlow, Andrew D.
docstrings reported in the literature to date. We Gordon, and Yi Wei. 2015. Bimodal modelling
of source code and natural language. In Pro-
have trained P Y MT5 to translate between ceedings of the 32nd International Conference
all pairs of combinations of method signa- on International Conference on Machine Learn-
tures, docstrings, and method bodies which ing - Volume 37, ICML’15, page 2123–2132.
do not have the same feature in both the source JMLR.org.
and target. Further, we introduced control Uri Alon, Shaked Brody, Omer Levy, and Eran
token prefixes for docstring generation to fa- Yahav. 2018. code2seq: Generating sequences
cilitate docstring generation of various styles. from structured representations of code. arXiv
Focusing on two modeling tasks – predict- preprint arXiv:1808.01400.
ing P YTHON methods from docstrings and
Antonio Valerio Miceli Barone and Rico Sennrich.
summarizing P YTHON source code methods 2017. A parallel corpus of python functions and
into docstrings of various commonly occur- documentation strings for automated code doc-
ring styles – we have compared this new ap- umentation and code generation. arXiv preprint
proach to the auto-regressive GPT2 baselines arXiv:1707.02275.
trained on individual docstring or method gen- Marcel Bruch, Martin Monperrus, and Mira
eration tasks. On the C ODE S EARCH N ET test Mezini. 2009. Learning from examples to im-
set P Y MT5 achieves a BLEU score of 8.59 prove code completion systems. In Proceed-
for method generation and 16.3 for docstring ings of the 7th joint meeting of the European
software engineering conference and the ACM
generation, and a ROUGE-L F-score of 24.8
SIGSOFT symposium on the foundations of soft-
for method generation and 36.7 for docstring ware engineering, pages 213–222.
generation. We have demonstrated the ef-
fectiveness of dynamic masked pre-training, Christopher Bryant, Mariano Felice, and Edward
reducing docstring generation training time Briscoe. 2017. Automatic annotation and eval-
uation of error types for grammatical error cor-
by 25×. Looking forward, we plan to lever- rection. Association for Computational Lin-
age P Y MT5 for various downstream auto- guistics.
mated software engineering tasks—including
code documentation and method generation Paweł Budzianowski and Ivan Vulić. 2019. Hello,
it’s gpt-2–how can i help you? towards the
from natural language statements—and de- use of pretrained language models for task-
velop more model evaluation criteria to lever- oriented dialogue systems. arXiv preprint
age the unique properties of source codes. arXiv:1907.05774.

9060
Adelina Ciurumelea, Sebastian Proksch, and Har- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
ald Gall. 2020. Suggesting comment comple- Ghazvininejad, Abdelrahman Mohamed, Omer
tions for python using neural language models. Levy, Ves Stoyanov, and Luke Zettlemoyer.
In 27th edition of the IEEE International Con- 2019. Bart: Denoising sequence-to-sequence
ference on Software Analysis, Evolution and pre-training for natural language generation,
Reengineering (SANER). IEEE. translation, and comprehension. arXiv preprint
arXiv:1910.13461.
Kevin Clark, Urvashi Khandelwal, Omer Levy,
and Christopher D Manning. 2019. What does Yang Liu and Mirella Lapata. 2019. Text sum-
bert look at? an analysis of bert’s attention. marization with pretrained encoders. arXiv
arXiv preprint arXiv:1906.04341. preprint arXiv:1908.08345.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Numpydoc Maintainers. 2020. Numpydoc doc-
Kristina Toutanova. 2018. Bert: Pre-training of string guide. Technical report.
deep bidirectional transformers for language un-
derstanding. arXiv preprint arXiv:1810.04805. Laura Moreno, Jairo Aponte, Giriprasad Sridhara,
Andrian Marcus, Lori Pollock, and K Vijay-
Kawin Ethayarajh. 2019. How contextual are con- Shanker. 2013. Automatic generation of nat-
textualized word representations? comparing ural language summaries for java classes. In
the geometry of bert, elmo, and gpt-2 embed- 2013 21st International Conference on Pro-
dings. arXiv preprint arXiv:1909.00512. gram Comprehension (ICPC), pages 23–32.
IEEE.
David Goodger and Guido van Rossum. 2001.
Docstring conventions. PEP 257. Laura Moreno, Gabriele Bavota, Massimiliano
Di Penta, Rocco Oliveto, Andrian Marcus, and
Google. 2020. Google python style guide. Techni- Gerardo Canfora. 2014. Automatic genera-
cal report. tion of release notes. In Proceedings of the
22nd ACM SIGSOFT International Symposium
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. on Foundations of Software Engineering, pages
2018. Deep code search. In Proceedings of the 484–495.
40th International Conference on Software En-
gineering, ICSE ’18, page 933–944, New York, Alec Radford, Karthik Narasimhan, Tim Salimans,
NY, USA. Association for Computing Machin- and Ilya Sutskever. 2018. Improving language
ery. understanding by generative pre-training. URL
https://s3-us-west-2. amazonaws. com/openai-
Abram Hindle, Earl T Barr, Zhendong Su, Mark assets/researchcovers/languageunsupervised/language
Gabel, and Premkumar Devanbu. 2012. On the understanding paper. pdf.
naturalness of software. In 2012 34th Inter-
national Conference on Software Engineering Colin Raffel, Noam Shazeer, Adam Roberts,
(ICSE), pages 837–847. IEEE. Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J Liu.
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, 2019. Exploring the limits of transfer learning
Miltiadis Allamanis, and Marc Brockschmidt. with a unified text-to-text transformer. arXiv
2019. Codesearchnet challenge: Evaluating the preprint arXiv:1910.10683.
state of semantic code search. arXiv preprint
arXiv:1909.09436. Veselin Raychev, Martin Vechev, and Eran Ya-
hav. 2014. Code completion with statistical
Richard Jones. 2013. A restructuredtext primer. language models. In Proceedings of the 35th
docutils. sourceforge. net, March. ACM SIGPLAN Conference on Programming
Language Design and Implementation, pages
Olga Kovaleva, Alexey Romanov, Anna Rogers, 419–428.
and Anna Rumshisky. 2019. Revealing
the dark secrets of bert. arXiv preprint Simone Scalabrino, Gabriele Bavota, Christo-
arXiv:1908.08593. pher Vendome, Mario Linares-Vásquez, Denys

9061
Poshyvanyk, and Rocco Oliveto. 2017. Au- generation. In Proceedings of the 55th Annual
tomatically assessing code understandability: Meeting of the Association for Computational
How far are we? In 2017 32nd IEEE/ACM In- Linguistics (Volume 1: Long Papers), pages
ternational Conference on Automated Software 440–450, Vancouver, Canada. Association for
Engineering (ASE), pages 417–427. IEEE. Computational Linguistics.

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Juan Zhai, Xiangzhe Xu, Yu Shi, Minxue Pan,
Fu, and Neel Sundaresan. 2020. Intellicode Shiqing Ma, Lei Xu, Weifeng Zhang, Lin Tan,
compose: Code generation using transformer. and Xiangyu Zhang. 2019. Cpc: automatically
arXiv preprint arXiv:2005.08025. classifying and propagating natural language
comments via program analysis.
Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu,
and Neel Sundaresan. 2019. Pythia: Ai-assisted A Appendix
code completion system. In Proceedings of the
25th ACM SIGKDD International Conference A.1 Docstring statistics
on Knowledge Discovery & Data Mining, pages
2727–2735. Figure 5 shows the distributions of various fea-
tures of docstrings in our corpus. The top row
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez, is the distribution of total character-level length
Łukasz Kaiser, and Illia Polosukhin. 2017. At- of the method signatures (left), docstrings (cen-
tention is all you need. In Advances in neural ter), and code bodies. The blue lines are for
information processing systems, pages 5998– methods possessing a docstring, and we can
6008.
see that the vast majority of these methods
Elena Voita, Rico Sennrich, and Ivan Titov. 2019. have docstrings with more than 10 characters.
The bottom-up evolution of representations in The bottom row shows the distribution of line
the transformer: A study with machine transla- lengths of the concomitant features from the
tion and language modeling objectives. arXiv
preprint arXiv:1909.01380. top row. While the most common line length
of docstrings is 1 (comprising 41%), the vast
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, majority of docstrings have multiple lines.
Haochao Ying, Jian Wu, and Philip S Yu. 2018.
Improving automatic source code summariza- A.2 Pre-training details
tion via deep reinforcement learning. In Pro-
ceedings of the 33rd ACM/IEEE International Figure 7 is the complete training script,
Conference on Automated Software Engineer- using the Facebook AI Research Se-
ing, pages 397–407.
quence (FAIR S EQ) modeling library, with
Wenhua Wang, Yuqun Zhang, Zhengran Zeng, and which we pre-trained P Y MT5. The data
Guandong Xu. 2020. Transˆ 3: A transformer- was pre-noised and processed using the
based framework for unifying code summa- fairseq-preprocess command, and
rization and code search. arXiv preprint
arXiv:2003.03238. placed in the directory indicated by $DIR.
The architecture and training hyper-parameters
Thomas Wolf, Lysandre Debut, Victor Sanh, are set in this script. P Y MT5 was trained
Julien Chaumond, Clement Delangue, An- with the same hyperparameters, but with data
thony Moi, Pierric Cistac, Tim Rault, R’emi
Louf, Morgan Funtowicz, and Jamie Brew. described in sec.A.4.
2019. Huggingface’s transformers: State-of- Figure 7 shows learning curves of a sin-
the-art natural language processing. ArXiv, gle seq2seq model of the same architecture as
abs/1910.03771. P Y MT5 trained only on docstrings, starting
Pengcheng Yin and Graham Neubig. 2017. A syn- from random initializations, and starting from
tactic neural model for general-purpose code our pre-trained model. As the figure shows, the

9062
Figure 5: Histogram of the number of characters (top row) in the P YTHON signatures (left), docstrings
(middle), and method body (right). The blue lines are for methods with docstrings, the yellow lines are for
methods without docstrings. The vast majority of docstrings have more than 10 characters. The bottom
row shows histograms of the number of lines for the same features described in the top row.

pre-trained initialization converged to a better


validation loss 25× faster than the randomly
initialized model.

A.3 GPT2 training details


Our GPT2 experiments also used the FAIR S EQ
library, with the OpenAI English checkpoint
supplied by the HuggingFace library. Fig-
ure 8 shows the complete training script, where
for the English pre-trained initialization a pre-
trained checkpoint was provided. Each models
was trained on 4 Tesla V100 GPUs with 16GB Figure 6: Learning curves for training a sequence-
of memory each, for 7 days. to-sequence transformer, translating from python
method definitions to their docstrings. Blue curves
represent the training and validation loss, and show
A.4 Multi-mode training details
that convergence (validation loss stops decreasing)
In order to better teach P Y MT5 to under- occurs after 3.97 × 105 steps or 183 epochs. The
stand the relationships between all the differ- optimization of the pre-trained model with identi-
ent features of code (signatures, docstrings, cal hyperparameters reaches and beats the best val-
idation loss at 1.5 × 104 steps or 7 epochs.
and bodies) we taught it to translate between

9063
Figure 7: The fairseq-train script used to
pre-train P Y MT5, setting all the relevant hyper-
parameters.

Figure 8: The fairseq-train script we used


to train our GPT model baselines

all pairs of combinations of these features


which do not contain the same feature in
both the source and target. In this way, the
model can learn to produce method bodies us- Sources

Doc + body
Sig + body
ing both signatures and docstrings, or one or

Dosctring

Sig + doc
Signature
the other. Table 5 spells out exactly which

Body
combinations were provided to the model
as a source and target. For each source Signature 3 3 3
example the comment string ‘# target Docstring 3 3 3
<feature> (<style>)’ was added, in- Body 3 3 3
structing the model which feature combination Sig + doc 3
Targets

(e.g. signature and body). Only if a docstring


was in the target, a style imperative was added,
where the styles are defined and discussed in Sig + body 3
the main text. Doc + body 3
Figure 9 shows the training curves for
P Y MT5, where the solid black line is the train- Table 5: A table of all possible translation possibil-
ing loss, and all the other curves are the valida- ities between the 3 features of a function: the sig-
tion loss for each of the tasks indicated in tab. 5. nature (sig), docstring (doc), and body. We train
our model to translate between sources and targets
The dashed lines indicate tasks where doc-
indicated with a 3, which were chosen as all pairs
strings are present in the target, showing that of feature combinations which do not contain the
these are generally less predictable than code- same feature in both the source and target. The sys-
only targets (as the validation loss is larger). tem is then instructed to target code bodies when
P Y MT5was trained on 16 Tesla V100 16GB performing function completion.
GPUs for 62 epochs, or 5 weeks training time.

9064
Figure 9: Learning curve for the multi-mode training, where the black line is the training loss, and the
other lines are the validation loss for each mode of translation. Dashed lines indicate the docstrings are
in the target, solid lines have only code in the target.

9065

You might also like