2022.findings Acl.100
2022.findings Acl.100
Capabilities of Transformers
1263
Findings of the Association for Computational Linguistics: ACL 2022, pages 1263 - 1273
May 22-27, 2022 c 2022 Association for Computational Linguistics
Motorbike stunt rider
I work (1) a motorbike stunt rider — that is, I do tricks on my motorbike at shows. The Le Mans race track in France
was (2) I first saw some guys doing motorbike stunts. I’d never seen anyone riding a motorbike using just the back
wheel before and I was (3) impressed I went straight home and taught (4) to do the same.
2 Related Work can be useful for designing open cloze tests at dif-
ferent CEFR levels. The work we present in this
While research into automatic cloze test generation paper, however, aims to model the more complex
is vast (Mostow et al., 2017; Kurdi et al., 2020; task of predicting a full set of gaps at the paragraph
Yang et al., 2021), work on open cloze tests for level that comply with design and testing princi-
language learning is scarce. Pino et al. (2008) ples and is, to the best of our knowledge, the first
generate open cloze questions using sample sen- to employ and adapt transformer-based models for
tences from a learners’ dictionary based on four this task.
linguistic criteria: (grammatical) complexity, well-
System evaluation is also challenging, since
defined context (collocations), grammaticality and
there is usually more than one potential word in
length. A later version of their system adds hints
the text that could constitute a good gap. While
for gapped words (Pino and Eskenazi, 2009). Exer-
previous work often made a choice between auto-
cise Maker (Malafeev, 2014) is a rule-based open
matic (Marrese-Taylor et al., 2018) or human eval-
source system that attempts to emulate exercises
uation (Malafeev, 2014; Das and Majumder, 2017)
in Cambridge English examinations based on the
for their experiments, we perform both: automatic
most frequently tested words. Most of the gaps
evaluation to identify the best models during de-
it proposes were found to be useful and the au-
velopment and human evaluation to measure test
tomated exercises were hard to differentiate from
quality in the final output.
authentic tests.
Chinkina et al. (2017) generate open cloze exer- 3 Model
cises for phrasal verbs by extracting sentences from
news articles and generating a pair of questions We define open cloze generation as the task of pre-
and answers where the identified particle verbs are dicting a set of tokens that should be gapped in
gapped. Similarly, Soonklang et al. (2017) gap the text. Unlike previous approaches that work at
words in sentences according to their part of speech the sentence level, our models work at the para-
in order to practise articles, prepositions, etc. Fi- graph level (i.e. take the full text as input), since
nally, Marrese-Taylor et al. (2018) use LSTMs to we believe the interactions between gaps can only
build sequence labelling and classification mod- be optimally captured when the text is processed
els that decide where to insert a single gap in a as a whole rather than sentence by sentence.
single sentence. Automatic evaluation against gold- Given a text passage, we aim to predict the words
standard gaps showed the method was effective. that should be gapped in order to create a cloze test
Other work has focused on creating automated that would reliably assess student ability. The task
cloze tests by controlling aspects of the proposed is modelled as a supervised sequence tagging prob-
gaps so that they correlate with a target proficiency lem where each token is classified as being a good
level. Lee et al. (2019), for example, manipulate potential gap or not. We employ ELECTRA (Clark
the difficulty of C-tests (open cloze tests with hints, et al., 2020), one of the state-of-the-art pre-trained
Grotjahn et al. (2002)) by varying the position transformer-based language representation mod-
and word length of the gaps. A similar concept els (Wolf et al., 2020). ELECTRA is an extension
is presented by Settles et al. (2020) and McCarthy of BERT (Devlin et al., 2019) with a different pre-
et al. (2021), although difficulty is predicted using a training task which is a discriminator (rather than
machine-learning model that correlates with CEFR a generator) and aims to detect replaced tokens
levels. In these cases, tests are dynamically adapted (rather than generating words for the masks). We
to the examinee’s proficiency level during the test believe that this discrimination objective makes
session. From a different perspective, Felice and it more suitable for our token classification task.
Buttery (2019) show that controlling gap entropy Moreover, we also exploit ELECTRA’s generation
1264
capabilities as a language model for estimating the
answers to the proposed gaps as an auxiliary task.
Hence, to make the most of this pre-trained model,
we fine-tune it using two training objectives, as
depicted in Figure 2:
1 A token classification objective which aims to
minimise the error of classifying each token as
a potential gap or not.
2 A language modelling objective that aims to
minimise the negative log-likelihood of re-
generating the words that have been gapped.
The first objective is typical of any standard to-
ken classification model and constitutes our key
task. In particular, we use ELECTRA’s discrimi-
nator head with softmax to tag each word in the
input sequence as a ‘good’ gap or not. All the gaps
in our training data are replaced with the first in- Figure 2: Architecture of our multi-objective
tended target answer and labelled positive, while ELECTRA-based system. The model is simultaneously
the remaining tokens are labelled negative ( A ). trained on two objectives: 1) token classification and 2)
The second and auxiliary objective attempts to LM prediction of gapped words.
model our preference for gaps with a restricted
number of answers while also ensuring that the that are in close proximity to a gap. Let g be the po-
original word can be guessed from the context. sition of a gap in the sequence, then for each token
This is to avoid generating gaps that are too ‘open’ in position i in the proximity of g, i.e. |g − i| < D,
and therefore ineffective, such as a gap that accepts the loss function li ′ for the token in position i is
any noun or adjective. Specifically, we mask the defined as:
words in the positions that are predicted as gaps by W
li ′ = li ∗ (1)
the discriminator and use ELECTRA’s generative |g − i|
head to generate the expected words in the blanks where W represents the penalty and D is the
( B ). maximum distance scope for penalisation.3 Equa-
While the input layers are shared between the tion 1 thus gives more weight to tokens closer to
discriminator and the generator model, the two gaps, which results in higher penalisation of their
branches of the system leading to the two objectives cost functions whenever they are misclassified.
are fine-tuned in parallel in a multi-task setting.
4.2 Post-processing
4 Extensions We also employ a post-processing strategy where
Our neural transformer-based sequence tagging we replace the gaps that are repeated in the text
model can be very effective at proposing potentially with better options. We optimise the choice of
good gaps, but the task becomes more challeng- these alternative gaps by considering the distance
ing when we expect the output to meet additional between them and the resulting distribution of gaps
requirements such as no repetitions, no gap inter- with different part-of-speech (PoS) tags.
dependence, a minimum distance between gaps Our post-processing step can be seen as a re-
and a varied selection of lexico-grammatical items. ranking function. The gap candidates that are
We address these issues using two complementary originally ranked based on the model’s confidence
strategies: a manipulation of the loss function and scores change their ranking to match other desir-
a post-processing module. able requirements of a well-structured cloze test.
If the selected n-best gaps include repetitions, our
4.1 Loss manipulation post-processing algorithm randomly chooses one
In order to spread gaps evenly throughout the text, of them at a time and attempts to replace it with
we modify the token-level loss function of our tag- 3
We empirically set the values of constants D and W to 3
ging model by imposing a higher penalty on tokens and 3.0 respectively.
1265
a better alternative. An alternative gap is deemed Train Dev Test
better if 1) its answer is not a repetition of another Tasks 356 58 36
gapped word, 2) its distance to other selected gaps Tokens 79,863 12,797 6,621
meets the minimum required distance or is higher Gaps 4,565 787 360
than the pairwise distances of the originally se-
Table 1: Number of tasks, tokens and gaps in each
lected gaps, and 3) it improves the PoS distribution
section of the data.
of the gapped words. The PoS distribution of each
new selection of gaps is compared to the average
gapped PoS distribution of the cloze tests in the 6 Experiments
training data using Kullback-Leibler (KL) diver-
gence. A combination of gaps that yields lower KL 6.1 Setup
divergence is assumed to be a better solution. We use the pre-trained ELECTRA base discrimina-
These extensions to the base model bring our tor model5 with 12 attention heads and 12 hidden
final cloze tests closer to those created by human layers. Along with all the tokens in the sequences,
experts by automatically controlling variables that we also input dependency parsing information to
would otherwise need to be adjusted manually. the system. More specifically, we concatenate
This makes our solution a fully-automated system the ELECTRA representation of each token with
that can produce ready-to-use cloze tests from an the representation of its head in the dependency
input text passage. graph.6 On top of the encoding layers, we have
two branches that are being learned simultaneously
5 Data (Figure 2).
To the best of our knowledge, there are no public The first branch is a simple linear layer that aims
datasets of full-text open cloze tests that could be to classify each token as a gap or non-gap. For
used for our task. The CLOTH dataset (Xie et al., the second branch, we add ELECTRA’s genera-
2018), for example, contains gapped passages de- tion layer plus a linear layer which aims to predict
signed for language learners, but it is primarily fo- the best word from the whole vocabulary as an
cused on reasoning and reading comprehension and auxiliary task. We are only interested in predict-
uses multiple choice questions where distractors ing the answer words for the gaps. Therefore, we
play a major role, making it substantially different change the input to the second branch by masking
to the task we aim to model. the words that are predicted as gaps by the first
For this reason, we use a collection of expertly branch at each step of training. We employ cross-
created open cloze tests at the B2 CEFR level entropy loss on each branch and ignore the loss
that was kindly provided by Cambridge Univer- values for the tokens that are not masked in the
sity Press & Assessment (CUP&A) for research second branch. The whole architecture is updated
purposes. Each task consists of a text passage of no based on the sum of the two losses. Fine-tuning
more than 300 tokens, a variable number of gaps parameters are specified in Appendix B.
(between 8 and 16) and a list of valid answers for
6.2 Baselines
each gap (between 1 and 7). During the design
process, the tasks undergo extensive quality control We compare our multi-objective ELECTRA model
and pretesting, so their gaps are guaranteed to be to other systems, namely:
very effective at assessing student ability. Random baseline Generates a random set of gaps
For training, we reconstruct the texts by replac- for each task based on the average probability
ing each gap with its first answer and we split the distribution of gapped PoS in the training data.
whole collection into train, dev and test. Details of Exercise Maker Generates gaps using rules and a
our dataset are shown in Table 1. pre-compiled list of commonly gapped words
Given the lack of publicly available data, we from a variety of Cambridge English main
make our test set available with this paper so as suite exams (Malafeev, 2014). Set to FCE
to provide a common benchmark for the task and mode for our experiments.
to encourage further research in this area. All the
texts were tokenised and parsed using spaCy v2.34 . 5
https://github.com/huggingface/trans
formers.
4 6
https://spacy.io/ If the token is the head, then its representation is repeated.
1266
Class Label Description
Good Good The gap is appropriate, i.e. it is expected to be effective during testing.
Bad Too close to other gaps The gap is in close proximity to another gap.
Bad Too many possible answers The gap allows too many answers (often more than 5).
Bad Too many gaps of this type There are many gaps with the same part-of-speech or testing focus in the text.
Bad Answers can change meaning The gap can be filled by answers that would change the meaning of the text, e.g.
‘and’ or ‘but’.
Bad Answers can have different PoS The gap can be filled by answers that have a different grammatical function, e.g.
‘which’ or ‘and’.
Bad Gap depends on another There is some dependency between this gap and another in the text.
Bad Repeated gap There is already another gap testing the same word in the text.
Bad Phantom gap The gap does not require an answer for the text to make sense.
Bad Unacceptable outlier The gap does not fit in the text for multiple reasons (e.g. inappropriate difficulty).
Bad Other (please specify) Any other reason why the gap is considered unsuitable.
Figure 5: Sample output of our extended ELECTRA model. Darker shades of red indicate higher confidence in
inserting a gap. Predicted gaps are framed in black while gold standard gaps are in yellow font.
PoS
Proportion
P R F1 be very restricted classes: NUM includes only the
in TEST
word one, INTJ only the word like, SCONJ only a
ADP 20.59% 50.00 43.24 46.38
ADV 14.17% 57.69 58.82 58.25 few subordinating conjunctions while NOUN and
DET 13.89% 56.41 44.00 49.44 ADJ, despite being open classes, are limited to
SCONJ 13.89% 59.09 78.00 67.24 words used in common constructions such as order
AUX 10.83% 45.83 28.21 34.92 (in order to) or same (the same).
PRON 9.44% 47.92 67.65 56.10
ADJ 4.44% 60.00 75.00 66.67 The two worst performing classes are PART (the
NOUN 3.33% 77.78 58.33 66.67 particles to and not) and AUX (auxiliary verbs)
NUM 2.78% 61.54 80.00 69.57 and, once again, we conjecture that these words
CCONJ 2.50% 55.56 55.56 55.56 are so common in the language and in non-gapped
VERB 2.22% 50.00 50.00 50.00
PART 1.67% 0.00 0.00 0.00 positions that the model is unable to get them right
INTJ 0.28% 50.00 100.00 66.67 most of the time. The remaining PoS classes vary
in performance but we found only very weak cor-
Table 8: Performance by PoS on the test set based on relation between PoS gap frequency in the test set
automatic evaluation. and F1 scores (Pearson’s r = 0.1932, Spearman’s
ρ = 0.1350).
and F1 scores in the test set being negligible (Pear- When we look at human annotations on the test
son’s r = 0.0108, Spearman’s ρ = 0.0915). set, however, performance by PoS is consistently
Interestingly, while our model was unable to pre- higher and more even across the board. If we re-
dict gaps not previously seen in the training data quire that gaps are rated ‘good’ by at least two
(turned, amount, pushed and started), it did pre- annotators, accuracy values range between 75%
dict a (previously unseen) gap for the word fewer, and 100% for all PoS, with a mean of 85%.
which did not match the gold standard but was Under these conditions, the best performing
unanimously deemed good by our annotators. classes are NOUN (100%), INTJ (100%) and ADJ
(95%), which agree with automatic evaluation. Out
7.4 Predictions by PoS of these, only NOUN achieves perfect accuracy
We also classified predictions based on their PoS across all annotators. The worst performing classes
tags8 and report performance in Table 8. The most are PRON (77%), NUM (77%) and VERB (75%)
frequently gapped PoS tags in our datasets corre- as opposed to the previous AUX and PART coun-
spond to closed word classes (such as ADP, DET, terparts (now 79% and 83% respectively). When
SCONJ, AUX, etc.), which is expected given that we require agreement by all annotators, the worst
our open cloze tests are mostly focused on testing overall class is CCONJ with 44%.
grammar rather than vocabulary. The best predicted
classes, however, are NUM, SCONJ, NOUN, ADJ 7.5 Qualitative Analysis
and INTJ which on closer inspection turn out to Figure 5 shows the output of our model for a sam-
8
Using the Universal Dependencies tagset: https:// ple text passage, where darker red indicates higher
universaldependencies.org/u/pos/ confidence in inserting a gap. The final model’s
1270
predictions have a black frame (at, in, so, after, Than Generators. In International Conference on
etc.) while the gold standard gaps are in yellow Learning Representations.
font (at, in, so, etc.). There are 8 matched gaps out Council of Europe. 2001. Common European Frame-
of 11 in this example, yielding 72.73% accuracy. work of Reference for Languages: learning, teach-
As can be seen in the figure, our model is able ing, assessment. Cambridge University Press, Cam-
bridge.
to identify appropriate gap candidates, even if they
do not match the gold standard. In fact, annotators Bidyut Das and Mukta Majumder. 2017. Factual open
considered all the unmatched gaps in this example cloze question generation for assessment of learner’s
knowledge. International Journal of Educational
(after, for and take) to be good and the second Technology in Higher Education, 14:1–12.
matched gap (in) to be inappropriate. It is also
interesting to see how the model prioritises function Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
words and content words that are highly restricted deep bidirectional transformers for language under-
in context (such as take or part), skilfully avoiding standing. In Proceedings of the 2019 Conference of
general gaps that could accept multiple answers the North American Chapter of the Association for
and would be less effective for testing purposes. Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
8 Conclusion and Future Work 4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
We described the first transformer-based approach Mariano Felice and Paula Buttery. 2019. Entropy as a
to open cloze test generation. Our ELECTRA- proxy for gap complexity in open cloze tests. In Pro-
based model is trained on two objectives: token ceedings of the International Conference on Recent
classification (gap/non-gap) and language mod- Advances in Natural Language Processing (RANLP
2019), pages 323–327, Varna, Bulgaria. INCOMA
elling (for predicting the expected answer). The Ltd.
model is further improved by manipulating the loss
function and post-processing the results. Rüdiger Grotjahn, Christine Klein-Braley, and Ulrich
Raatz. 2002. C-tests: an overview. In James A Cole-
System accuracy using automatic evaluation is man, Rüdiger Grotjahn, and Ulrich Raatz, editors,
53.89% while human evaluation ranges between University language learning and the C-Test, pages
75% − 82%, showing that at least 7 out of 10 gaps 93–114. AKS-Verlag, Bochum, Germany.
predicted are considered useful by experts. A de- Ghader Kurdi, Jared Leo, Bijan Parsia, Uli Sattler, and
tailed analysis of results reveals a few structural Salam Al-Emari. 2020. A systematic review of auto-
problems such as gaps in close proximity and in- matic question generation for educational purposes.
appropriate difficulty, which we plan to address in I. J. Artificial Intelligence in Education, 30(1):121–
204.
future work. Our test data and human annotations
are released with this paper. Ji-Ung Lee, Erik Schwan, and Christian M. Meyer. 2019.
Manipulating the difficulty of C-tests. In Proceed-
Acknowledgements ings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 360–370, Florence,
The authors are immensely grateful to Louise Italy. Association for Computational Linguistics.
Gilbert, Sally Moore and Clare Williams from Alexey Malafeev. 2014. Language exercise generation:
CUP&A for their annotations. This paper reports Emulating cambridge open cloze. Int. J. Concept.
on research supported by Cambridge University Struct. Smart Appl., 2(2):20–35.
Press & Assessment, University of Cambridge. Edison Marrese-Taylor, Ai Nakajima, Yutaka Matsuo,
and Ono Yuichi. 2018. Learning to automatically
generate fill-in-the-blank quizzes. In Proceedings
References of the 5th Workshop on Natural Language Process-
ing Techniques for Educational Applications, pages
Maria Chinkina, Simón Ruiz, and Detmar Meurers. 152–156, Melbourne, Australia. Association for Com-
2017. Automatically generating questions to sup- putational Linguistics.
port the acquisition of particle verbs: Evaluating via
crowdsourcing. In CALL in a climate of change: Arya D. McCarthy, Kevin P. Yancey, Geoff T. LaFlair,
adapting to turbulent global conditions – short pa- Jesse Egbert, Manqian Liao, and Burr Settles. 2021.
pers from EUROCALL 2017, pages 73–78. Jump-starting item parameters for adaptive language
tests. In Proceedings of the 2021 Conference on
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Empirical Methods in Natural Language Processing,
Christopher D. Manning. 2020. ELECTRA: Pre- pages 883–899, Online and Punta Cana, Dominican
training Text Encoders as Discriminators Rather Republic. Association for Computational Linguistics.
1271
Jack Mostow, Yi-Ting Huang, Hyeju Jang, Anders We- H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
instein, Joe Valeri, and Donna Gates. 2017. Devel- nett, editors, Advances in Neural Information Pro-
oping, evaluating, and refining an automatic gener- cessing Systems (NIPS) 30, pages 5998–6008. Curran
ator of diagnostic multiple choice cloze questions Associates, Inc.
to assess children’s comprehension while reading.
Natural Language Engineering, 23(2):245–294. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
Association of Language Testers in Europe (ALTE). ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
2005. Materials for the guidance of test item writers. icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Technical report, Association of Language Testers in Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Europe. Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020. Trans-
Association of Language Testers in Europe (ALTE). formers: State-of-the-art natural language processing.
2011. Manual for language test development and ex- In Proceedings of the 2020 Conference on Empirical
amining. Technical report, Association of Language Methods in Natural Language Processing: System
Testers in Europe. Demonstrations, pages 38–45, Online. Association
for Computational Linguistics.
Juan Pino and Maxine Eskenazi. 2009. Measuring hint
level in open cloze questions. In Proceedings of Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy.
the Twenty-Second International Florida Artificial 2018. Large-scale cloze test dataset created by teach-
Intelligence Research Society Conference (FLAIRS), ers. In Proceedings of the 2018 Conference on Em-
Sanibel Island, Florida, USA. AAAI Press. pirical Methods in Natural Language Processing,
pages 2344–2356, Brussels, Belgium. Association
Juan Pino, Michael Heilman, and Maxine Eskenazi. for Computational Linguistics.
2008. A selection strategy to improve cloze question
Albert C. M. Yang, Irene Y. L. Chen, Brendan Flanagan,
quality. Intelligent Tutoring Systems for Ill-Defined
and Hiroaki Ogata. 2021. Automatic generation of
Domains: Assessment and Feedback in Ill-Defined
cloze items for repeated testing to improve reading
Domains, page 22.
comprehension. Educational Technology & Society,
24(3):147–158.
Justus J Randolph. 2005. Free-marginal multirater
kappa (multirater k [free]): An alternative to fleiss’
fixed-marginal multirater kappa. Online submission.
1272
A Dataset composition B Model parameters
Table C.1: Example of the different reasons given by the annotators for rejecting a gap proposed by our system.
1273