2020-emnlp-PYMT5 - Multi-Mode Translation of Natural Language and Python Code With Transformers
2020-emnlp-PYMT5 - Multi-Mode Translation of Natural Language and Python Code With Transformers
Neel Sundaresan
Microsoft Cloud and AI
[email protected]
9052
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9052–9065,
November 16–20, 2020. c 2020 Association for Computational Linguistics
ata, 2019), grammar correction (Bryant et al., (2017) and treat P YTHON and its docstrings
2017), entity recognition, dialogue genera- as fundamentally no different than other ‘natu-
tion (Budzianowski and Vulić, 2019), and ral’ languages, representing both source code
more. Along with these quantitative advances and natural language docstrings as sequences
have come deeper understanding of the learned of tokens sharing the same vocabulary. Here
hidden representations which power transform- we present a multi-mode translation method
ers (Kovaleva et al., 2019; Voita et al., 2019; resulting in P Y MT5, the P YTHON method
Clark et al., 2019; Ethayarajh, 2019). While text-to-text transfer transformer (inspired by
they are arguably not ‘natural,’ programming the text-to-text transfer transformer T5 (Raffel
languages are increasingly becoming model- et al., 2019)). Our single model can both learn
ing playgrounds for NLP modeling. Since code/language generation and understand the
these languages by definition have a gram- relationships between them.
mar, syntax, and known relationships between The paper is organized as follows: we
entities, they offer enticing opportunities for begin in sec. 2 by presenting examples of
an even deeper probing of NLP models and the performance of our novel multi-mode
tasks. Beyond theoretical importance, many P Y MT5 —the P YTHON method text-to-text
NLP tasks have practical utility in software transfer transformer model—which we trained
development environments: language model- to translate between all pairs of combinations
ing or generation can be used for code com- of method signatures, docstrings, and bod-
pletion (Raychev et al., 2014; Bruch et al., ies which do not have the same feature in
2009; Svyatkovskiy et al., 2019, 2020), transla- both the source and target. In sec. 2.1 we de-
tion/summarization to generate documentation scribe our training data and the pre-processing
or natural language summaries (Moreno et al., steps for source code and natural language
2013; Scalabrino et al., 2017; Wan et al., 2018; we followed, and compared it to existing par-
Alon et al., 2018) or even summarize a set of allel docstring-method corpora like C ODE -
code changes (Moreno et al., 2014), transla- S EARCH N ET (CSN)(Husain et al., 2019) and
tion and grammar error correction to patch and that presented by Barone et al (Barone and Sen-
detect bugs (Zhai et al., 2019), and joint em- nrich, 2017). In sec.2.2 we explain our BART-
bedding of code and natural language for code like (Lewis et al., 2019) pre-training scheme,
search (Husain et al., 2019; Gu et al., 2018). demonstrating a 25× speed-up in training time
for docstring generation. Next, in sec. 2.3 we
In this work we focus on jointly modeling analyze and classify P YTHON docstrings, en-
both source code (P YTHON) and concomitant abling style-conditioned docstring generation
natural language documentation (docstrings) in P Y MT5. In sections 3 and 4, we discuss
with transformers, through the study of dual P Y MT5 results on method generation and doc-
tasks: generating method code bodies from string generation respectively and compare it
signatures and docstrings, and generating doc- to two GPT2 models randomly initialized and
strings from signatures and method code bod- pre-trained on English.
ies. While previous work (Allamanis et al.,
2015; Yin and Neubig, 2017) has leveraged the 2 Multi-mode training
grammar of code to extract features like the Ab-
stract Syntax Tree for modeling (treating code Figure 1 shows examples of inputs and outputs
and natural language as separate modalities), of our model P Y MT5 for 3 example tasks:
we follow examples like Barone and Sennrich (top, blue) predicting a body from a method
9053
Figure 1: Real examples of P Y MT5 performing method generation using combinations of signatures
and docstrings. A leading comment in the input sequence instructs the model to output a particular
combination of features, e.g. ‘# target signature and body’ instructs P Y MT5 to predict
both a signature and body.
P Y MT5
Figure 2: P Y MT5 performing docstring generation on an example method, showing the output when the
target prefix indicates one line (top, blue) and Numpydoc docstring (bottom, red) styles.
9054
between all pairs of feature combinations in We used the P YTHON module astunparse
which the same feature does not appear in both to take the AST for each method and unparse
the source and target. This scheme is also ad- them back into source code, so that our fine-
vantageous as our corpus is unbalanced, with tuned model was never trained on syntactically
only 1/5 methods featuring docstrings, and so incorrect code. The statistics of our method-
the model can learn to leverage all the features docstring corpus are summarized in Table. 1.
whether they are present or not. Additionally, it Our parallel method-docstring corpus is twice
has been shown that code is more ‘predictable’ as large as the next largest irrespective of lan-
than natural language (Hindle et al., 2012). If guage and over 15× as large as the next largest
the method and argument names are a domi- P YTHON parallel corpus, both in CSN.
nating signal due to their relatively rigid struc- For each method, we ignored comments as
ture, the model may learn to ignore the content they generally represent trivia and are not part
of docstrings. This multi-mode method over- of the normal language syntax. We cleaned the
comes that by training the model to generate docstrings by removing non-ASCII characters,
method bodies from docstrings alone. See the normalizing Unicode, and replacing commit
appendix for a more detailed description of the hashes, file paths, and URLs with placeholder
multi-mode training scheme. tokens. In all studies here, we randomly split
the files at the repository level (to prevent data
2.1 Dataset leakage) with 90% for training, 5% for valida-
tion, and 5% for a test set.
Our data consists of 118k G IT H UB reposito-
ries, which includes all public repositories la-
2.2 Pre-training
belled as containing primarily P YTHON source
code, featuring at least 10 stars, and which The majority of our P YTHON methods—over
have had a commit in the past 5 years. We 20 million methods— do not possess doc-
successfully cloned 112k of these repositories, strings. This imbalance is, in fact, an oppor-
extracting 5.3 million P YTHON files from the tunity in light of the recent trend for NLP:
default HEAD state of each repository. We then unsupervised pre-training of language mod-
removed literal duplicate files, resulting in 2.3 els on vast amounts of raw text (Devlin et al.,
million unique files, but did not remove finer- 2018). Using these pre-trained models as start-
grained clones. After removing license from ing points for downstream tasks—like classi-
the files, the literal contents were used in the fication, translation, summarization, and ques-
pre-training step, comprising about 27GB of tion answering—consistently yields state-of-
raw text. the-art results (Lewis et al., 2019; Raffel et al.,
In order to extract method-level informa- 2019).
tion for fine-tuning, we used the python3.7 Following this trend, we use a similar span-
standard library ast to produce the file- masking objective used by the recent text-to-
level Abstract Syntax Tree (AST) for each text transfer transformer (T5) (Raffel et al.,
P YTHON file, extracting every individual and 2019). As shown in Figure 3, after tokeniz-
class method. For each file which failed to ing the inputs, we sample a random subset of
parse, we used 2to3 and autopep8 to over- the token spans up to length 3 to be replaced
come the issue of different styles and white with, e.g. a [MASK0] token, and then teach
space or tab conventions, successfully parsing the sequence-to-sequence model to replace the
97.3% of the 2.3 million unique P YTHON files. missing tokens. The training target is com-
9055
Dataset Methods w/ docstring Languages
P Y MT5 2.6 × 107 7.7 × 106 P YTHON
CSN (Husain et al., 2019) 6.4 × 106 2.3 × 106 P YTHON, et al.
Ciurumelea et al. (2020) 1.6 × 105 1.6 × 105 P YTHON
Barone and Sennrich (2017) 1.6 × 105 1.5 × 105 P YTHON
Table 1: Summary statistics of our P YTHON parallel corpus compared to others presented in the literature.
CSN contains 500k P YTHON methods with docstrings, among 6 other languages. Our parallel corpus is
3× as large as the next largest, and over 15× the size of the next largest P YTHON parallel corpus.
Figure 3: Denoising auto-encoder pre-training for sequence-to-sequence tasks, based on the span-
masking objective used by the T5 (Raffel et al., 2019). P YTHON files are first tokenized with spaces
replaced by the character Ġ, which is 256 in ordinal above the space character (similarly for newlines,
tabs, etc.). Note that indentation is a token of multiple Ġ’s. We replace random sub-sequences of tokens
with numbered masks, and train the model to return each mask followed by the tokens it replaced.
9056
styles (and the ones supported by sphinx) plied the t-distributed stochastic neighbor em-
are RE S TRUCTURED T EXT (reST) (Jones, bedding (T- SNE) to obtain a two-dimensional
2013), the official G OOGLE style (Google, visualization. Figure 4 shows 1/10th of our
2020), N UMPY style (also technically satis- corpus (700k docstrings) embedded, colored
fies reST) (Maintainers, 2020), and JAVADOC by docstring style as defined above. We can
style (jav, 2011). The difference be- see clear clustering of styles, indicating that
tween each style is mainly in the syntax similar docstrings use the same style (for the
of denoting sections (if they exist) and parseable styles). There is also a natural di-
the name/type/description annotation of the chotomy between parseable and non-parseable
method arguments and returned/yielded quan- styles: the left side is dominated by ‘one line,’
tities (if they exist). We defined, in addi- ‘one paragraph,’ and ‘other’ styles, and the four
tion to these styles, one-line (containing only parseable styles are largely on the right side.
one line), one-paragraph (containing no empty This observation can be used to generate docu-
lines), and ‘other’ to label any docstring not mentation consistent with the style of a given
described so far, which includes informal user project, or it could be used to translate meth-
docstring styles and a few project-specific ods into more informal descriptions useful for
styles like the SAGE mathematics toolkit li- search indices.
brary.
Table 2 shows the breakdown of the fraction
of each of these styles in our corpus. The plu-
rality of docstrings (44%) are one-line. The
next most common style is one-paragraph at
14%. The next four most-common styles are
the machine parseable styles discussed above,
comprising 26.2% of the total number of doc-
strings. The appendix contains detailed dis-
tributions of method signature, docstring, and
method body character and line lengths.
9057
Model Ppl BLEU Syntax Stat. R1 R2 RL
GPT2-med 2.36 5.60 85% Prec. 25.8 12.3 26.8
random Rec. 26.7 12.1 25.9
F1 21.8 10.6 22.5
GPT2-med 2.09 5.63 86% Prec. 25.4 12.1 26.3
english Rec. 26.9 12.2 26.1
F1 21.7 10.6 22.5
P Y MT5 2.36 10.6 93.6% Prec. 33.8 21.5 33.6
Rec. 44.1 25.0 43.8
F1 35.1 21.5 32.2
CSN test:
GPT2-med – 2.8 77.2% Prec. 32.3 11.8 33.7
random Rec. 19.6 7.0 19.3
F1 20.9 7.6 21.9
P Y MT5 – 8.59 92.1% Prec. 25.6 12.5 25.3
Rec. 40.2 18.3 39.6
F1 28.4 13.5 24.8
Barone and Sennrich (2017) test:
P Y MT5 – 20.2 91.1% Prec. 41.3 28.5 40.7
Rec. 52.2 34.7 51.3
F1 43.2 29.8 39.7
Barone et al. – 10.9 – – – –
Table 3: Comparing 3 models–GPT2 with a random weight initialization, GPT2 pre-trained on English,
and P Y MT5–on the task of method generation from a signature and natural language docstring. The
first three rows use our test set consisting of 1,285,794 methods. The fourth and fifth rows compare
the performance of P Y MT5 and GPT2-medium on the CodeSearchNet P YTHON test set. The final
rows compare the performance of P Y MT5 on the parallel corpus test set of Barone and Sennrich (2017).
Syntax is the fraction of predicted methods which had correct syntax using the P YTHON 3.7 grammar.
9058
Model Ppl BLEU R1 R2 RL
GPT2-med 2.36 19.4 P 32.6 19.3 33.6
ence between our test set and the CSN test set.
random R 36.2 19.4 34.7 We also conclude that tests and short methods
F1 31.0 18.2 31.6
GPT2-med 2.15 19.6 P 33.1 19.4 33.9 are ‘easier’ to complete, which is plausible,
English R 36.4 19.5 34.8 and bodes well for automatic code completion
F1 31.4 18.3 31.8
P Y MT5 3.74 25.2 P 42.1 23.7 41.3 applications.
R 50.4 27.0 49.3
F1 43.3 24.4 39.8
CSN test: 4 Docstring Generation
GPT2-med – 9.5 P 30.6 13.3 31.4
random R 31.1 12.9 29.8
F1 26.3 11.5 27.2 We now examine results from the docstring
P Y MT5 – 16.3 P 38.0 19.2 36.8
R 52.7 24.5 51.0 generation task, which for evaluation pur-
F1 41.3 20.4 36.7 poses were conditioned on both signatures and
Barone test:
P Y MT5 – 17.4 P 39.6 26.0 38.7 method bodies. As in method generation, we
R 53.6 33.7 52.1
F1 43.1 27.8 39.1 set a GPT2 benchmark with random initial-
Barone et al. – 13.84 – – – – ization and pre-trained English initialization
Table 4: Comparing 3 models–GPT2 with a ran- as well as the same hyperparameters. Table 4
dom weight initialization, GPT2 pre-trained on shows that the ROUGE scores of the GPT2
English, and P Y MT5–on the task of natural lan- baselines are within the margin of error; a
guage docstring generation from a signature and somewhat surprising result given the English
method body. The first three rows are evaluated domain of docstrings. The third row shows
on our test set of 383695 methods. The fourth P Y MT5 to be superior to GPT2-medium in
and fifth rows shows performance of P Y MT5 and
terms of BLEU and all of the ROUGE metrics.
GPT2-medium on the CSN P YTHON test set, and
the last two rows compare our model to Barone et We again present the results from the pub-
al. on their test set. licly available CSN test set. Similar to the
method generation task, P Y MT5 performs
worse on the CSN data than our own, likely
into our training set. Barone’s test set is also for the same reasons we discussed in sec. 3.
200× smaller than ours and may not be a rep- We also evaluated P Y MT5 on the Barone et
resentative sample of the whole P YTHON code al. parallel test set, as shown in the second to
domain. last row of tab. 4, and find P Y MT5 performs
The third and fourth rows of tab. 3 show the notably worse on Barone’s test set than our
performance of P Y MT5 using the publicly own test set, contradicting the hypothesis that
available CSN P YTHON test set, from which our doubling of the method generation BLEU
we find notably worse results than on our own score is due to data leakage. P Y MT5 has a
test set. CSN curated their whole set by remov- much higher BLEU score than that reported by
ing any methods with ‘test’ in the name and any Barone et al, perhaps indicating real progress
methods with fewer than 3 lines of code. We in the code summarization field.
calculated the performance of P Y MT5 only Docstring generation is similar to code sum-
on a subset of our test set curated the same marization, though the domains are different as
way as CSN, observing F-scores for R1, R2, docstrings also contain structured annotations
and R-L on our test set of 29.7, 17.2, and 26.1, of arguments, return values, raised exceptions,
which is lower than our nominal test set perfor- and even in-line unit tests (doctest). TranS3
mance of 35.1, 21.5, and 32.2 and closer to the by Wang et al. (Wang et al., 2020) reports a
CSN performance of 28.4, 13.5, and 24.8. We best ROUGE-L of 51.27 on the same test set
believe this curating choice explains the differ- for code summarization, but does not specify
9059
which statistic they are reporting, so we can- Acknowledgements
not make strong conclusions about the perfor-
mance of P Y MT5 compared to the state of the We would like to thank the Microsoft Cloud
art. and AI SmartML engineering team for help
in preparing the data, Shao Kun Deng for the
development of compelling user experiences
leveraging P Y MT5, and Christian Bird for use-
5 Conclusion ful discussions.
9060
Adelina Ciurumelea, Sebastian Proksch, and Har- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
ald Gall. 2020. Suggesting comment comple- Ghazvininejad, Abdelrahman Mohamed, Omer
tions for python using neural language models. Levy, Ves Stoyanov, and Luke Zettlemoyer.
In 27th edition of the IEEE International Con- 2019. Bart: Denoising sequence-to-sequence
ference on Software Analysis, Evolution and pre-training for natural language generation,
Reengineering (SANER). IEEE. translation, and comprehension. arXiv preprint
arXiv:1910.13461.
Kevin Clark, Urvashi Khandelwal, Omer Levy,
and Christopher D Manning. 2019. What does Yang Liu and Mirella Lapata. 2019. Text sum-
bert look at? an analysis of bert’s attention. marization with pretrained encoders. arXiv
arXiv preprint arXiv:1906.04341. preprint arXiv:1908.08345.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Numpydoc Maintainers. 2020. Numpydoc doc-
Kristina Toutanova. 2018. Bert: Pre-training of string guide. Technical report.
deep bidirectional transformers for language un-
derstanding. arXiv preprint arXiv:1810.04805. Laura Moreno, Jairo Aponte, Giriprasad Sridhara,
Andrian Marcus, Lori Pollock, and K Vijay-
Kawin Ethayarajh. 2019. How contextual are con- Shanker. 2013. Automatic generation of nat-
textualized word representations? comparing ural language summaries for java classes. In
the geometry of bert, elmo, and gpt-2 embed- 2013 21st International Conference on Pro-
dings. arXiv preprint arXiv:1909.00512. gram Comprehension (ICPC), pages 23–32.
IEEE.
David Goodger and Guido van Rossum. 2001.
Docstring conventions. PEP 257. Laura Moreno, Gabriele Bavota, Massimiliano
Di Penta, Rocco Oliveto, Andrian Marcus, and
Google. 2020. Google python style guide. Techni- Gerardo Canfora. 2014. Automatic genera-
cal report. tion of release notes. In Proceedings of the
22nd ACM SIGSOFT International Symposium
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. on Foundations of Software Engineering, pages
2018. Deep code search. In Proceedings of the 484–495.
40th International Conference on Software En-
gineering, ICSE ’18, page 933–944, New York, Alec Radford, Karthik Narasimhan, Tim Salimans,
NY, USA. Association for Computing Machin- and Ilya Sutskever. 2018. Improving language
ery. understanding by generative pre-training. URL
https://s3-us-west-2. amazonaws. com/openai-
Abram Hindle, Earl T Barr, Zhendong Su, Mark assets/researchcovers/languageunsupervised/language
Gabel, and Premkumar Devanbu. 2012. On the understanding paper. pdf.
naturalness of software. In 2012 34th Inter-
national Conference on Software Engineering Colin Raffel, Noam Shazeer, Adam Roberts,
(ICSE), pages 837–847. IEEE. Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J Liu.
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, 2019. Exploring the limits of transfer learning
Miltiadis Allamanis, and Marc Brockschmidt. with a unified text-to-text transformer. arXiv
2019. Codesearchnet challenge: Evaluating the preprint arXiv:1910.10683.
state of semantic code search. arXiv preprint
arXiv:1909.09436. Veselin Raychev, Martin Vechev, and Eran Ya-
hav. 2014. Code completion with statistical
Richard Jones. 2013. A restructuredtext primer. language models. In Proceedings of the 35th
docutils. sourceforge. net, March. ACM SIGPLAN Conference on Programming
Language Design and Implementation, pages
Olga Kovaleva, Alexey Romanov, Anna Rogers, 419–428.
and Anna Rumshisky. 2019. Revealing
the dark secrets of bert. arXiv preprint Simone Scalabrino, Gabriele Bavota, Christo-
arXiv:1908.08593. pher Vendome, Mario Linares-Vásquez, Denys
9061
Poshyvanyk, and Rocco Oliveto. 2017. Au- generation. In Proceedings of the 55th Annual
tomatically assessing code understandability: Meeting of the Association for Computational
How far are we? In 2017 32nd IEEE/ACM In- Linguistics (Volume 1: Long Papers), pages
ternational Conference on Automated Software 440–450, Vancouver, Canada. Association for
Engineering (ASE), pages 417–427. IEEE. Computational Linguistics.
Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Juan Zhai, Xiangzhe Xu, Yu Shi, Minxue Pan,
Fu, and Neel Sundaresan. 2020. Intellicode Shiqing Ma, Lei Xu, Weifeng Zhang, Lin Tan,
compose: Code generation using transformer. and Xiangyu Zhang. 2019. Cpc: automatically
arXiv preprint arXiv:2005.08025. classifying and propagating natural language
comments via program analysis.
Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu,
and Neel Sundaresan. 2019. Pythia: Ai-assisted A Appendix
code completion system. In Proceedings of the
25th ACM SIGKDD International Conference A.1 Docstring statistics
on Knowledge Discovery & Data Mining, pages
2727–2735. Figure 5 shows the distributions of various fea-
tures of docstrings in our corpus. The top row
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez, is the distribution of total character-level length
Łukasz Kaiser, and Illia Polosukhin. 2017. At- of the method signatures (left), docstrings (cen-
tention is all you need. In Advances in neural ter), and code bodies. The blue lines are for
information processing systems, pages 5998– methods possessing a docstring, and we can
6008.
see that the vast majority of these methods
Elena Voita, Rico Sennrich, and Ivan Titov. 2019. have docstrings with more than 10 characters.
The bottom-up evolution of representations in The bottom row shows the distribution of line
the transformer: A study with machine transla- lengths of the concomitant features from the
tion and language modeling objectives. arXiv
preprint arXiv:1909.01380. top row. While the most common line length
of docstrings is 1 (comprising 41%), the vast
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, majority of docstrings have multiple lines.
Haochao Ying, Jian Wu, and Philip S Yu. 2018.
Improving automatic source code summariza- A.2 Pre-training details
tion via deep reinforcement learning. In Pro-
ceedings of the 33rd ACM/IEEE International Figure 7 is the complete training script,
Conference on Automated Software Engineer- using the Facebook AI Research Se-
ing, pages 397–407.
quence (FAIR S EQ) modeling library, with
Wenhua Wang, Yuqun Zhang, Zhengran Zeng, and which we pre-trained P Y MT5. The data
Guandong Xu. 2020. Transˆ 3: A transformer- was pre-noised and processed using the
based framework for unifying code summa- fairseq-preprocess command, and
rization and code search. arXiv preprint
arXiv:2003.03238. placed in the directory indicated by $DIR.
The architecture and training hyper-parameters
Thomas Wolf, Lysandre Debut, Victor Sanh, are set in this script. P Y MT5 was trained
Julien Chaumond, Clement Delangue, An- with the same hyperparameters, but with data
thony Moi, Pierric Cistac, Tim Rault, R’emi
Louf, Morgan Funtowicz, and Jamie Brew. described in sec.A.4.
2019. Huggingface’s transformers: State-of- Figure 7 shows learning curves of a sin-
the-art natural language processing. ArXiv, gle seq2seq model of the same architecture as
abs/1910.03771. P Y MT5 trained only on docstrings, starting
Pengcheng Yin and Graham Neubig. 2017. A syn- from random initializations, and starting from
tactic neural model for general-purpose code our pre-trained model. As the figure shows, the
9062
Figure 5: Histogram of the number of characters (top row) in the P YTHON signatures (left), docstrings
(middle), and method body (right). The blue lines are for methods with docstrings, the yellow lines are for
methods without docstrings. The vast majority of docstrings have more than 10 characters. The bottom
row shows histograms of the number of lines for the same features described in the top row.
9063
Figure 7: The fairseq-train script used to
pre-train P Y MT5, setting all the relevant hyper-
parameters.
Doc + body
Sig + body
ing both signatures and docstrings, or one or
Dosctring
Sig + doc
Signature
the other. Table 5 spells out exactly which
Body
combinations were provided to the model
as a source and target. For each source Signature 3 3 3
example the comment string ‘# target Docstring 3 3 3
<feature> (<style>)’ was added, in- Body 3 3 3
structing the model which feature combination Sig + doc 3
Targets
9064
Figure 9: Learning curve for the multi-mode training, where the black line is the training loss, and the
other lines are the validation loss for each mode of translation. Dashed lines indicate the docstrings are
in the target, solid lines have only code in the target.
9065