0% found this document useful (0 votes)
64 views11 pages

2022.findings Acl.100

Uploaded by

Lê Diễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views11 pages

2022.findings Acl.100

Uploaded by

Lê Diễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Constructing Open Cloze Tests Using Generation and Discrimination

Capabilities of Transformers

Mariano Felice, Shiva Taslimipoor and Paula Buttery


ALTA Institute, Computer Laboratory, University of Cambridge
Cambridge, UK
{mf501,st797,pjb48}@cam.ac.uk

Abstract ELECTRA (Clark et al., 2020) model that is fine-


tuned on the two described objectives in a multi-
This paper presents the first multi-objective task scenario.
transformer model for constructing open cloze
tests that exploits generation and discrimina- Our output aims to mimic the style of open cloze
tion capabilities to improve performance. Our tests in the First Certificate in English (FCE) exam1 ,
model is further enhanced by tweaking its loss which is targeted at learners of English at the B2
function and applying a post-processing re- proficiency level of the Common European Frame-
ranking algorithm that improves overall test work of Reference (CEFR) for languages (Council
structure. Experiments using automatic and of Europe, 2001). Unlike other tests, the FCE open
human evaluation show that our approach can
cloze task aims to simultaneously test many aspects
achieve up to 82% accuracy according to ex-
perts, outperforming previous work and base-
of grammar and vocabulary that students are ex-
lines. We also release a collection of high- pected to know at this level. Since the tests are
quality open cloze tests along with sample sys- created from a text passage, they must be skilfully
tem output and human annotations that can designed in order to ensure an optimal distribution
serve as a future benchmark. of gaps that adheres to guidelines. A shortened
example is shown in Figure 1.
1 Introduction Our system is evaluated under two settings: 1)
Open cloze (Taylor, 1953) tests are a common type automatic evaluation, where the generated gaps are
of exercise where words are removed from a piece compared to gold-standard gaps proposed by test
of text and must then be filled in by the students experts, and 2) human evaluation, where the quality
without any options to choose from. They are often of the generated gaps is judged by test experts.
used in language learning environments as a quick The main contributions of our work are as fol-
and effective way to test vocabulary, grammar and lows: 1) we are the first to employ transformer
reading comprehension (Tremblay, 2011; Trace, models for open cloze test generation, 2) unlike
2020). However, designing high-quality cloze tests previous studies, we work at the paragraph level,
for language learning is a laborious process that which is a much more challenging task, 3) we pro-
involves finding an optimal distribution of gaps pose a multi-task learning approach with two objec-
based on aspects such as function, distance, number tives: one is to classify tokens into gaps/non-gaps
of answers, etc. (ALTE, 2005; 2011). and the other to minimise the error of re-generating
In this paper, we propose a strategy to con- the gapped word, 4) we report state-of-the-art re-
struct open cloze exercises using transformer mod- sults, outperforming previous work and a strong
els (Vaswani et al., 2017). Our transformer-based baseline, 5) we propose additional components to
architecture employs two objectives to predict the control the structure of the final cloze tests as hu-
words that should be gapped in a text passage. man experts do, 6) we perform both automatic and
Our main objective is standard token classification, human evaluation and 7) we make our test data,
where we aim to minimise the error of classifying system output and human annotations available to
a token as a gap or not. The second and auxil- the research community2 .
iary objective is a language-model-based objec-
1
tive whereby we attempt to minimise the language Now known as B2 First: https://www.cambridge
english.org/exams-and-tests/first/
model error when predicting the right answer for 2
Dataset available at https://github.com/Cambr
each gap. Our solution is based on a pre-trained idgeALTA/fce-cep-oc.

1263
Findings of the Association for Computational Linguistics: ACL 2022, pages 1263 - 1273
May 22-27, 2022 c 2022 Association for Computational Linguistics
Motorbike stunt rider
I work (1) a motorbike stunt rider — that is, I do tricks on my motorbike at shows. The Le Mans race track in France
was (2) I first saw some guys doing motorbike stunts. I’d never seen anyone riding a motorbike using just the back
wheel before and I was (3) impressed I went straight home and taught (4) to do the same.

Figure 1: Sample FCE open cloze test (shortened).

2 Related Work can be useful for designing open cloze tests at dif-
ferent CEFR levels. The work we present in this
While research into automatic cloze test generation paper, however, aims to model the more complex
is vast (Mostow et al., 2017; Kurdi et al., 2020; task of predicting a full set of gaps at the paragraph
Yang et al., 2021), work on open cloze tests for level that comply with design and testing princi-
language learning is scarce. Pino et al. (2008) ples and is, to the best of our knowledge, the first
generate open cloze questions using sample sen- to employ and adapt transformer-based models for
tences from a learners’ dictionary based on four this task.
linguistic criteria: (grammatical) complexity, well-
System evaluation is also challenging, since
defined context (collocations), grammaticality and
there is usually more than one potential word in
length. A later version of their system adds hints
the text that could constitute a good gap. While
for gapped words (Pino and Eskenazi, 2009). Exer-
previous work often made a choice between auto-
cise Maker (Malafeev, 2014) is a rule-based open
matic (Marrese-Taylor et al., 2018) or human eval-
source system that attempts to emulate exercises
uation (Malafeev, 2014; Das and Majumder, 2017)
in Cambridge English examinations based on the
for their experiments, we perform both: automatic
most frequently tested words. Most of the gaps
evaluation to identify the best models during de-
it proposes were found to be useful and the au-
velopment and human evaluation to measure test
tomated exercises were hard to differentiate from
quality in the final output.
authentic tests.
Chinkina et al. (2017) generate open cloze exer- 3 Model
cises for phrasal verbs by extracting sentences from
news articles and generating a pair of questions We define open cloze generation as the task of pre-
and answers where the identified particle verbs are dicting a set of tokens that should be gapped in
gapped. Similarly, Soonklang et al. (2017) gap the text. Unlike previous approaches that work at
words in sentences according to their part of speech the sentence level, our models work at the para-
in order to practise articles, prepositions, etc. Fi- graph level (i.e. take the full text as input), since
nally, Marrese-Taylor et al. (2018) use LSTMs to we believe the interactions between gaps can only
build sequence labelling and classification mod- be optimally captured when the text is processed
els that decide where to insert a single gap in a as a whole rather than sentence by sentence.
single sentence. Automatic evaluation against gold- Given a text passage, we aim to predict the words
standard gaps showed the method was effective. that should be gapped in order to create a cloze test
Other work has focused on creating automated that would reliably assess student ability. The task
cloze tests by controlling aspects of the proposed is modelled as a supervised sequence tagging prob-
gaps so that they correlate with a target proficiency lem where each token is classified as being a good
level. Lee et al. (2019), for example, manipulate potential gap or not. We employ ELECTRA (Clark
the difficulty of C-tests (open cloze tests with hints, et al., 2020), one of the state-of-the-art pre-trained
Grotjahn et al. (2002)) by varying the position transformer-based language representation mod-
and word length of the gaps. A similar concept els (Wolf et al., 2020). ELECTRA is an extension
is presented by Settles et al. (2020) and McCarthy of BERT (Devlin et al., 2019) with a different pre-
et al. (2021), although difficulty is predicted using a training task which is a discriminator (rather than
machine-learning model that correlates with CEFR a generator) and aims to detect replaced tokens
levels. In these cases, tests are dynamically adapted (rather than generating words for the masks). We
to the examinee’s proficiency level during the test believe that this discrimination objective makes
session. From a different perspective, Felice and it more suitable for our token classification task.
Buttery (2019) show that controlling gap entropy Moreover, we also exploit ELECTRA’s generation
1264
capabilities as a language model for estimating the
answers to the proposed gaps as an auxiliary task.
Hence, to make the most of this pre-trained model,
we fine-tune it using two training objectives, as
depicted in Figure 2:
1 A token classification objective which aims to
minimise the error of classifying each token as
a potential gap or not.
2 A language modelling objective that aims to
minimise the negative log-likelihood of re-
generating the words that have been gapped.
The first objective is typical of any standard to-
ken classification model and constitutes our key
task. In particular, we use ELECTRA’s discrimi-
nator head with softmax to tag each word in the
input sequence as a ‘good’ gap or not. All the gaps
in our training data are replaced with the first in- Figure 2: Architecture of our multi-objective
tended target answer and labelled positive, while ELECTRA-based system. The model is simultaneously
the remaining tokens are labelled negative ( A ). trained on two objectives: 1) token classification and 2)
The second and auxiliary objective attempts to LM prediction of gapped words.
model our preference for gaps with a restricted
number of answers while also ensuring that the that are in close proximity to a gap. Let g be the po-
original word can be guessed from the context. sition of a gap in the sequence, then for each token
This is to avoid generating gaps that are too ‘open’ in position i in the proximity of g, i.e. |g − i| < D,
and therefore ineffective, such as a gap that accepts the loss function li ′ for the token in position i is
any noun or adjective. Specifically, we mask the defined as:
words in the positions that are predicted as gaps by W
li ′ = li ∗ (1)
the discriminator and use ELECTRA’s generative |g − i|
head to generate the expected words in the blanks where W represents the penalty and D is the
( B ). maximum distance scope for penalisation.3 Equa-
While the input layers are shared between the tion 1 thus gives more weight to tokens closer to
discriminator and the generator model, the two gaps, which results in higher penalisation of their
branches of the system leading to the two objectives cost functions whenever they are misclassified.
are fine-tuned in parallel in a multi-task setting.
4.2 Post-processing
4 Extensions We also employ a post-processing strategy where
Our neural transformer-based sequence tagging we replace the gaps that are repeated in the text
model can be very effective at proposing potentially with better options. We optimise the choice of
good gaps, but the task becomes more challeng- these alternative gaps by considering the distance
ing when we expect the output to meet additional between them and the resulting distribution of gaps
requirements such as no repetitions, no gap inter- with different part-of-speech (PoS) tags.
dependence, a minimum distance between gaps Our post-processing step can be seen as a re-
and a varied selection of lexico-grammatical items. ranking function. The gap candidates that are
We address these issues using two complementary originally ranked based on the model’s confidence
strategies: a manipulation of the loss function and scores change their ranking to match other desir-
a post-processing module. able requirements of a well-structured cloze test.
If the selected n-best gaps include repetitions, our
4.1 Loss manipulation post-processing algorithm randomly chooses one
In order to spread gaps evenly throughout the text, of them at a time and attempts to replace it with
we modify the token-level loss function of our tag- 3
We empirically set the values of constants D and W to 3
ging model by imposing a higher penalty on tokens and 3.0 respectively.

1265
a better alternative. An alternative gap is deemed Train Dev Test
better if 1) its answer is not a repetition of another Tasks 356 58 36
gapped word, 2) its distance to other selected gaps Tokens 79,863 12,797 6,621
meets the minimum required distance or is higher Gaps 4,565 787 360
than the pairwise distances of the originally se-
Table 1: Number of tasks, tokens and gaps in each
lected gaps, and 3) it improves the PoS distribution
section of the data.
of the gapped words. The PoS distribution of each
new selection of gaps is compared to the average
gapped PoS distribution of the cloze tests in the 6 Experiments
training data using Kullback-Leibler (KL) diver-
gence. A combination of gaps that yields lower KL 6.1 Setup
divergence is assumed to be a better solution. We use the pre-trained ELECTRA base discrimina-
These extensions to the base model bring our tor model5 with 12 attention heads and 12 hidden
final cloze tests closer to those created by human layers. Along with all the tokens in the sequences,
experts by automatically controlling variables that we also input dependency parsing information to
would otherwise need to be adjusted manually. the system. More specifically, we concatenate
This makes our solution a fully-automated system the ELECTRA representation of each token with
that can produce ready-to-use cloze tests from an the representation of its head in the dependency
input text passage. graph.6 On top of the encoding layers, we have
two branches that are being learned simultaneously
5 Data (Figure 2).
To the best of our knowledge, there are no public The first branch is a simple linear layer that aims
datasets of full-text open cloze tests that could be to classify each token as a gap or non-gap. For
used for our task. The CLOTH dataset (Xie et al., the second branch, we add ELECTRA’s genera-
2018), for example, contains gapped passages de- tion layer plus a linear layer which aims to predict
signed for language learners, but it is primarily fo- the best word from the whole vocabulary as an
cused on reasoning and reading comprehension and auxiliary task. We are only interested in predict-
uses multiple choice questions where distractors ing the answer words for the gaps. Therefore, we
play a major role, making it substantially different change the input to the second branch by masking
to the task we aim to model. the words that are predicted as gaps by the first
For this reason, we use a collection of expertly branch at each step of training. We employ cross-
created open cloze tests at the B2 CEFR level entropy loss on each branch and ignore the loss
that was kindly provided by Cambridge Univer- values for the tokens that are not masked in the
sity Press & Assessment (CUP&A) for research second branch. The whole architecture is updated
purposes. Each task consists of a text passage of no based on the sum of the two losses. Fine-tuning
more than 300 tokens, a variable number of gaps parameters are specified in Appendix B.
(between 8 and 16) and a list of valid answers for
6.2 Baselines
each gap (between 1 and 7). During the design
process, the tasks undergo extensive quality control We compare our multi-objective ELECTRA model
and pretesting, so their gaps are guaranteed to be to other systems, namely:
very effective at assessing student ability. Random baseline Generates a random set of gaps
For training, we reconstruct the texts by replac- for each task based on the average probability
ing each gap with its first answer and we split the distribution of gapped PoS in the training data.
whole collection into train, dev and test. Details of Exercise Maker Generates gaps using rules and a
our dataset are shown in Table 1. pre-compiled list of commonly gapped words
Given the lack of publicly available data, we from a variety of Cambridge English main
make our test set available with this paper so as suite exams (Malafeev, 2014). Set to FCE
to provide a common benchmark for the task and mode for our experiments.
to encourage further research in this area. All the
texts were tokenised and parsed using spaCy v2.34 . 5
https://github.com/huggingface/trans
formers.
4 6
https://spacy.io/ If the token is the head, then its representation is repeated.

1266
Class Label Description
Good Good The gap is appropriate, i.e. it is expected to be effective during testing.
Bad Too close to other gaps The gap is in close proximity to another gap.
Bad Too many possible answers The gap allows too many answers (often more than 5).
Bad Too many gaps of this type There are many gaps with the same part-of-speech or testing focus in the text.
Bad Answers can change meaning The gap can be filled by answers that would change the meaning of the text, e.g.
‘and’ or ‘but’.
Bad Answers can have different PoS The gap can be filled by answers that have a different grammatical function, e.g.
‘which’ or ‘and’.
Bad Gap depends on another There is some dependency between this gap and another in the text.
Bad Repeated gap There is already another gap testing the same word in the text.
Bad Phantom gap The gap does not require an answer for the text to make sense.
Bad Unacceptable outlier The gap does not fit in the text for multiple reasons (e.g. inappropriate difficulty).
Bad Other (please specify) Any other reason why the gap is considered unsuitable.

Table 2: Labels used in human annotation.

BERT Predicts potentially good gaps using BERT Model P R F1


(Devlin et al., 2019) for token classification. Random baseline 15.29 14.87 15.08
We use the pre-trained base model with stan- Exercise Maker 23.33 25.79 24.50
BERT 51.16 47.65 49.34
dard parameters and fine-tune the weights of
Standard ELECTRA 55.61 46.00 50.35
the whole architecture.
Multi-objective ELECTRA 57.41 46.25 51.23
Standard ELECTRA Similar to BERT, it pre-
dicts potentially good gaps using a standard Table 3: Models’ performance on the development set.
pre-trained ELECTRA-base model. This is
a single-objective model that is fine-tuned on
7 Results and Discussion
token classification only.
Both random and Exercise Maker attempt to 7.1 Automatic evaluation
generate the same number of gaps per task as de- We carry out automatic evaluation by computing P,
fined in the gold standard, although this is not al- R and F1 on our development set. Table 3 reports
ways possible since the required conditions (such the results of our multi-objective ELECTRA model
as specific words or PoS) are not always met. (enhanced with dependency information) as well
as the random baseline, Exercise Maker, BERT,
6.3 Evaluation and the standard single-task ELECTRA. This is
We report precision (P), recall (R) and F1 scores our base model, which does not include any loss
based on a strict matching between the gaps pre- manipulation or post-processing. In this setting,
dicted by our models and those in the gold standard. the number of predicted gaps was decided by each
While this evaluation strategy might seem strict, it model based on the confidence scores (> 0.5 for
has the advantage of being fully automatic, thus the positive class).
avoiding the subjectivity and time required by hu- Overall, we observe that performance increases
man evaluation, so we adopt it during development. with more sophisticated models. Exercise Maker re-
In addition to letting the models decide the op- lies on previously seen gaps and so outperforms the
timal number of gaps, we also evaluate system random baseline by a large margin. However, it can
performance when we fix the number of predicted only create gaps for the 139 words in its predefined
gaps for each task to the number of gaps they have FCE word list, missing gaps that are not on that list.
in the gold standard. The n-best predicted gaps are Neural transformer-based models are the best, with
chosen based on their confidence scores. In this improvements over Exercise Maker of at least 25
scenario, P, R and F1 become the same. F1 on our development set. Although the improve-
We also report human evaluation by three test ment of our multi-objective ELECTRA model over
experts from CUP&A who volunteered for the task. BERT does not seem to be very significant based
The experts were asked to label each proposed gap simply on P, R and F1 , a closer look at the results re-
in each task of our test set (a total of 360 gaps) as veals that BERT produces a much higher number of
either good or bad and provide a reason and op- repeated gaps (25 compared to 9 by multi-objective
tional comments for their choice. The list of labels ELECTRA) as well as more cases of gaps in close
available to our annotators is shown in Table 2. proximity, as shown in Figure 3.
1267
BERT # of predicted gaps P R F1
Multi-objective
30 ELECTRA As-in-gold 54.26 54.26 54.26
10 56.72 42.80 48.14
20 15 51.49 56.93 54.07
20 44.83 66.07 53.42
10
0 1 2 3
30 35.63 78.78 49.07
Distance
Table 4: Results of multi-objective ELECTRA when
Figure 3: Frequency of pairs of gaps with distance rang- we predefine the number of predicted gaps.
ing from 0 to 3. Distance is measured by the number
of words in between two gapped words. The minimum P R F1
acceptable number of words between two gaps is 4. Multi-objective ELECTRA 57.41 46.25 51.23
+ loss manipulation 47.87 59.85 53.19
+ post-processing 48.42 60.23 53.68
We also perform an ablation study in Table 3
where we compare our multi-objective ELECTRA Table 5: Effect of loss manipulation and post-
model to a standard one that does not include our processing on our multi-objective ELECTRA model.
auxiliary language model objective. Results show
that the former outperforms the latter on all metrics, gapped words from 18 to 33. The decline in the
confirming that the addition of the LM objective is restricted F1 based on automatic evaluation is not
clearly beneficial. favourable, but we make this sacrifice at the price
Table 4 shows the performance of our multi- of achieving a better-structured final test.
objective ELECTRA model as we increase the n- After adding post-processing for repeated gaps,
best list of gaps according to their confidence score. we observe that, although overall F1 performance
The first row indicates the results of the system drops slightly, the number of repeated gapped
when it is forced to predict the exact same number words decreases favourably from 33 to 9 (Table 6).
of gaps per task as in the gold standard.7 This It also creates a better spread of gaps, as shown by
causes P and R to be the same. As we expect, the a lower KL-divergence between the average PoS
results show that the number of gaps in the gold distribution of the output and that of the gold stan-
data is actually the optimal number to achieve the dard (0.55 with post-processing as opposed to 0.59
best F1 score. without it). Post-processing also removes two cases
Although our multi-objective model shows good in the development set where the gaps do not meet
performance based on automatic evaluation, a the minimum 4-word distance.
closer look at the output reveals that the structure of It is worth recalling that these extensions are
the cloze tests is far from ideal as they often contain highly effective when we do not restrict the num-
repetitions and gaps that are too close to each other, ber of predicted gaps. Table 5 shows that they
aspects that are carefully controlled in the gold significantly improve R, which results in higher
standard. Table 5 shows that system performance overall F1 .
effectively improves as we add the extensions pro- As a result of these experiments, we stick with
posed in Section 4, indicating that global aspects our post-processing approach for the rest of our
of the task are not properly captured by our initial experiments and use it to produce the output sub-
model and require further manipulation. mitted for human annotation.
In order to make the structure of our output as
similar as possible to our target tasks, we fix the 7.2 Human evaluation
number of predicted gaps for each task to the num- Following our intuition that test experts could find
ber of gaps they have in the gold standard. Note more value in our system than initially shown
that P and R are the same in this setting so we only by our automatic evaluation, we asked a panel
report F1 . The effect of this decision is shown is of three test experts to judge the quality of the
Table 6. We can see that adding loss manipulation gaps produced by our extended model on the test
to our model decreases the number of adjacent gaps set. Inter-annotator agreement on gap classifica-
from 40 to 23, but increases the number of repeated tion (good/bad) was found to be moderate (percent
7
The number of gaps can vary per passage (see Ap- agreement is 75.93%, Randolph’s free-marginal
pendix A). kappa is 0.52 (Randolph, 2005)).
1268
Restricted Repeated Adjacent Too close to other gaps 42.43%
F1 gaps gaps Unacceptable outlier 32.47%
Multi-objective ELECTRA 54.26 18 40 Too many gaps of this type 6.92%
+ loss manipulation 51.59 33 23 Other 4.77%
+ post-processing 51.33 9 23 Gap depends on another 4.32%
Phantom gap 3.9%
Table 6: Analysis of our model after adding extensions: Answers can have different PoS 2.17%
Answers can change meaning 1.29%
loss manipulation and post-processing.
Too many possible answers 0.87%
Repeated gap 0.87%
Auto Ann. 1 Ann. 2 Ann. 3
53.89 82.50 75.83 77.50 Figure 4: Average frequency of the reasons given by the
annotators for rejecting a gap.
Table 7: Accuracy of our extended multi-objective
ELECTRA model based on automatic and human eval-
uation on the test set.
proach was successful in reducing these cases, we
did not attempt to eradicate them completely since
Unlike in automatic evaluation, we only report there are many factors at play when choosing more
accuracy for our human experiment. System per- appropriate gaps than just distance. In many cases,
formance using automatic and human evaluation gaps in close proximity test different words in the
is compared in Table 7 (reported individually for same phrase (e.g. take part in, in addition to, etc.)
each annotator). These results show that perfor- so we preferred to keep these cases and encourage
mance increases dramatically when the output is annotators to comment on their preferences. Rep-
judged by human experts, confirming our suspicion etitions, on the contrary, are much better handled,
that performance is underestimated by automatic accounting for only 0.87% of all bad gaps.
evaluation and that there are many other words in The second most frequent reason is ‘unaccept-
the texts that could constitute equally useful gaps able outlier’ (32.47%), which normally accounts
apart from those in the gold standard. With system for cases where the difficulty of the gap is con-
accuracy ranging between 75% − 82% for human sidered inappropriate for the target proficiency
judgements, we can conclude that at least 7 out level (B2 in this case). This is an interesting phe-
of 10 gaps proposed by our system are considered nomenon, since the fact that the text as a whole
good by our experts. pertains to a given CEFR level does not guarantee
We observed that differences between annota- that the gaps created will always be appropriate for
tors’ judgements and the gold-standard can occur the level. The remaining reasons are substantially
for many reasons, e.g.: less frequent than the first two and mostly related
• non-gaps in the gold standard are not neces- to aspects that were not explicitly controlled in our
sarily bad gaps, models, except for the third topmost reason (‘Too
• gold standard gaps are derived from pilot test- many gaps of this type’) which we did control by
ing while annotators’ gaps are derived from comparing PoS distributions. These results show
their expertise, that our system is able to capture many aspects of
• previous judgements by the annotators can af- the task that were not explicitly modelled.
fect the judgement of new gaps (e.g. choosing Finally, we compared system accuracy per task
the best of two close gaps), etc. computed from annotators’ judgements vs. the gold
Annotator accuracy against the gold standard standard. Average correlation across all annotators
ranges between 50% − 60%. was found to be very weak (Pearson’s r = 0.0558,
Following our classification in Table 2, we anal- Spearman’s ρ = 0.1474). This suggests that au-
ysed the reasons why some gaps were not consid- tomatic scores are not a good proxy for human
ered good by the annotators. Figure 4 shows the perception, with experts being much more positive
average frequency of the different reasons given by about our model’s output (as shown in Table 7).
the annotators for rejecting a gap proposed by our
system. Examples are included in Appendix C. 7.3 Predictions by Gapped Word Frequency
The most frequent reason is the violation of We found that our model does not overfit to words
the minimum required distance between two gaps that are most frequently gapped in the training data,
(42.43%). Although our loss-manipulation ap- with correlation between gapped word frequency
1269
Gardening
It is early summer , the season of abundance , when my garden is at its fullest . Flowers are in bloom and the grass is
growing so fast that half an hour after cutting it , I seem to be back where I started . This year for the first time I am
attempting to grow my own vegetables , an attempt that has so far proved very successful . My vegetable plants have been
yielding an abundance of produce , in fact much more than I can possibly consume myself . I ’m convinced that you cannot
plant even a single tomato without feeling a connection to the earth and to the countless generations who have worked the
land before you . To plant seeds and then to harvest what you have grown gives a deep sense of satisfaction . I believe that
many doctors and mental health organisations all around the world now recognise the value of gardening to the well-being
of those who take part in this activity .

Figure 5: Sample output of our extended ELECTRA model. Darker shades of red indicate higher confidence in
inserting a gap. Predicted gaps are framed in black while gold standard gaps are in yellow font.

PoS
Proportion
P R F1 be very restricted classes: NUM includes only the
in TEST
word one, INTJ only the word like, SCONJ only a
ADP 20.59% 50.00 43.24 46.38
ADV 14.17% 57.69 58.82 58.25 few subordinating conjunctions while NOUN and
DET 13.89% 56.41 44.00 49.44 ADJ, despite being open classes, are limited to
SCONJ 13.89% 59.09 78.00 67.24 words used in common constructions such as order
AUX 10.83% 45.83 28.21 34.92 (in order to) or same (the same).
PRON 9.44% 47.92 67.65 56.10
ADJ 4.44% 60.00 75.00 66.67 The two worst performing classes are PART (the
NOUN 3.33% 77.78 58.33 66.67 particles to and not) and AUX (auxiliary verbs)
NUM 2.78% 61.54 80.00 69.57 and, once again, we conjecture that these words
CCONJ 2.50% 55.56 55.56 55.56 are so common in the language and in non-gapped
VERB 2.22% 50.00 50.00 50.00
PART 1.67% 0.00 0.00 0.00 positions that the model is unable to get them right
INTJ 0.28% 50.00 100.00 66.67 most of the time. The remaining PoS classes vary
in performance but we found only very weak cor-
Table 8: Performance by PoS on the test set based on relation between PoS gap frequency in the test set
automatic evaluation. and F1 scores (Pearson’s r = 0.1932, Spearman’s
ρ = 0.1350).
and F1 scores in the test set being negligible (Pear- When we look at human annotations on the test
son’s r = 0.0108, Spearman’s ρ = 0.0915). set, however, performance by PoS is consistently
Interestingly, while our model was unable to pre- higher and more even across the board. If we re-
dict gaps not previously seen in the training data quire that gaps are rated ‘good’ by at least two
(turned, amount, pushed and started), it did pre- annotators, accuracy values range between 75%
dict a (previously unseen) gap for the word fewer, and 100% for all PoS, with a mean of 85%.
which did not match the gold standard but was Under these conditions, the best performing
unanimously deemed good by our annotators. classes are NOUN (100%), INTJ (100%) and ADJ
(95%), which agree with automatic evaluation. Out
7.4 Predictions by PoS of these, only NOUN achieves perfect accuracy
We also classified predictions based on their PoS across all annotators. The worst performing classes
tags8 and report performance in Table 8. The most are PRON (77%), NUM (77%) and VERB (75%)
frequently gapped PoS tags in our datasets corre- as opposed to the previous AUX and PART coun-
spond to closed word classes (such as ADP, DET, terparts (now 79% and 83% respectively). When
SCONJ, AUX, etc.), which is expected given that we require agreement by all annotators, the worst
our open cloze tests are mostly focused on testing overall class is CCONJ with 44%.
grammar rather than vocabulary. The best predicted
classes, however, are NUM, SCONJ, NOUN, ADJ 7.5 Qualitative Analysis
and INTJ which on closer inspection turn out to Figure 5 shows the output of our model for a sam-
8
Using the Universal Dependencies tagset: https:// ple text passage, where darker red indicates higher
universaldependencies.org/u/pos/ confidence in inserting a gap. The final model’s
1270
predictions have a black frame (at, in, so, after, Than Generators. In International Conference on
etc.) while the gold standard gaps are in yellow Learning Representations.
font (at, in, so, etc.). There are 8 matched gaps out Council of Europe. 2001. Common European Frame-
of 11 in this example, yielding 72.73% accuracy. work of Reference for Languages: learning, teach-
As can be seen in the figure, our model is able ing, assessment. Cambridge University Press, Cam-
bridge.
to identify appropriate gap candidates, even if they
do not match the gold standard. In fact, annotators Bidyut Das and Mukta Majumder. 2017. Factual open
considered all the unmatched gaps in this example cloze question generation for assessment of learner’s
knowledge. International Journal of Educational
(after, for and take) to be good and the second Technology in Higher Education, 14:1–12.
matched gap (in) to be inappropriate. It is also
interesting to see how the model prioritises function Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
words and content words that are highly restricted deep bidirectional transformers for language under-
in context (such as take or part), skilfully avoiding standing. In Proceedings of the 2019 Conference of
general gaps that could accept multiple answers the North American Chapter of the Association for
and would be less effective for testing purposes. Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
8 Conclusion and Future Work 4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
We described the first transformer-based approach Mariano Felice and Paula Buttery. 2019. Entropy as a
to open cloze test generation. Our ELECTRA- proxy for gap complexity in open cloze tests. In Pro-
based model is trained on two objectives: token ceedings of the International Conference on Recent
classification (gap/non-gap) and language mod- Advances in Natural Language Processing (RANLP
2019), pages 323–327, Varna, Bulgaria. INCOMA
elling (for predicting the expected answer). The Ltd.
model is further improved by manipulating the loss
function and post-processing the results. Rüdiger Grotjahn, Christine Klein-Braley, and Ulrich
Raatz. 2002. C-tests: an overview. In James A Cole-
System accuracy using automatic evaluation is man, Rüdiger Grotjahn, and Ulrich Raatz, editors,
53.89% while human evaluation ranges between University language learning and the C-Test, pages
75% − 82%, showing that at least 7 out of 10 gaps 93–114. AKS-Verlag, Bochum, Germany.
predicted are considered useful by experts. A de- Ghader Kurdi, Jared Leo, Bijan Parsia, Uli Sattler, and
tailed analysis of results reveals a few structural Salam Al-Emari. 2020. A systematic review of auto-
problems such as gaps in close proximity and in- matic question generation for educational purposes.
appropriate difficulty, which we plan to address in I. J. Artificial Intelligence in Education, 30(1):121–
204.
future work. Our test data and human annotations
are released with this paper. Ji-Ung Lee, Erik Schwan, and Christian M. Meyer. 2019.
Manipulating the difficulty of C-tests. In Proceed-
Acknowledgements ings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 360–370, Florence,
The authors are immensely grateful to Louise Italy. Association for Computational Linguistics.
Gilbert, Sally Moore and Clare Williams from Alexey Malafeev. 2014. Language exercise generation:
CUP&A for their annotations. This paper reports Emulating cambridge open cloze. Int. J. Concept.
on research supported by Cambridge University Struct. Smart Appl., 2(2):20–35.
Press & Assessment, University of Cambridge. Edison Marrese-Taylor, Ai Nakajima, Yutaka Matsuo,
and Ono Yuichi. 2018. Learning to automatically
generate fill-in-the-blank quizzes. In Proceedings
References of the 5th Workshop on Natural Language Process-
ing Techniques for Educational Applications, pages
Maria Chinkina, Simón Ruiz, and Detmar Meurers. 152–156, Melbourne, Australia. Association for Com-
2017. Automatically generating questions to sup- putational Linguistics.
port the acquisition of particle verbs: Evaluating via
crowdsourcing. In CALL in a climate of change: Arya D. McCarthy, Kevin P. Yancey, Geoff T. LaFlair,
adapting to turbulent global conditions – short pa- Jesse Egbert, Manqian Liao, and Burr Settles. 2021.
pers from EUROCALL 2017, pages 73–78. Jump-starting item parameters for adaptive language
tests. In Proceedings of the 2021 Conference on
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Empirical Methods in Natural Language Processing,
Christopher D. Manning. 2020. ELECTRA: Pre- pages 883–899, Online and Punta Cana, Dominican
training Text Encoders as Discriminators Rather Republic. Association for Computational Linguistics.

1271
Jack Mostow, Yi-Ting Huang, Hyeju Jang, Anders We- H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
instein, Joe Valeri, and Donna Gates. 2017. Devel- nett, editors, Advances in Neural Information Pro-
oping, evaluating, and refining an automatic gener- cessing Systems (NIPS) 30, pages 5998–6008. Curran
ator of diagnostic multiple choice cloze questions Associates, Inc.
to assess children’s comprehension while reading.
Natural Language Engineering, 23(2):245–294. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
Association of Language Testers in Europe (ALTE). ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
2005. Materials for the guidance of test item writers. icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Technical report, Association of Language Testers in Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Europe. Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020. Trans-
Association of Language Testers in Europe (ALTE). formers: State-of-the-art natural language processing.
2011. Manual for language test development and ex- In Proceedings of the 2020 Conference on Empirical
amining. Technical report, Association of Language Methods in Natural Language Processing: System
Testers in Europe. Demonstrations, pages 38–45, Online. Association
for Computational Linguistics.
Juan Pino and Maxine Eskenazi. 2009. Measuring hint
level in open cloze questions. In Proceedings of Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy.
the Twenty-Second International Florida Artificial 2018. Large-scale cloze test dataset created by teach-
Intelligence Research Society Conference (FLAIRS), ers. In Proceedings of the 2018 Conference on Em-
Sanibel Island, Florida, USA. AAAI Press. pirical Methods in Natural Language Processing,
pages 2344–2356, Brussels, Belgium. Association
Juan Pino, Michael Heilman, and Maxine Eskenazi. for Computational Linguistics.
2008. A selection strategy to improve cloze question
Albert C. M. Yang, Irene Y. L. Chen, Brendan Flanagan,
quality. Intelligent Tutoring Systems for Ill-Defined
and Hiroaki Ogata. 2021. Automatic generation of
Domains: Assessment and Feedback in Ill-Defined
cloze items for repeated testing to improve reading
Domains, page 22.
comprehension. Educational Technology & Society,
24(3):147–158.
Justus J Randolph. 2005. Free-marginal multirater
kappa (multirater k [free]): An alternative to fleiss’
fixed-marginal multirater kappa. Online submission.

Burr Settles, Geoffrey T. LaFlair, and Masato Hagiwara.


2020. Machine learning–driven language assessment.
Transactions of the Association for Computational
Linguistics, 8:247–263.

Tasanawan Soonklang, Sunee Pongpinigpinyo,


Weenawadee Muangon, and Sirak Kaewjamnong.
2017. Automatic question generation system
for english exercise for secondary students. In
Proceedings of the 25th International Conference
on Computers in Education (ICCE 2017), pages
890–895, New Zealand. Asia-Pacific Society for
Computers in Education.

Wilson L Taylor. 1953. “cloze procedure”: A new


tool for measuring readability. Journalism quarterly,
30(4):415–433.

Jonathan Trace. 2020. Clozing the gap: How far do


cloze items measure? Language Testing, 37(2):235–
253.

Annie Tremblay. 2011. Proficiency assessment


standards in second language acquisition re-
search:“clozing” the gap. Studies in Second Lan-
guage Acquisition, 33(3):339–372.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In I. Guyon, U. V. Luxburg, S. Bengio,

1272
A Dataset composition B Model parameters

# Gaps Train Dev Test Multi-objective


8 4 0 0 Parameters BERT
ELECTRA
9 39 2 9 Learning rate 3 × 10−5 3 × 10−5
10 71 2 18 Batch size 1 1
11 38 12 9 Number of epochs 4 4
12 4 6 0 n n
Training steps ×e ×e
13 58 9 0 b b
14 10 2 0 Table B.1: Model parameters used for the experiments.
15 0 0 0 n: the number of training examples; b: batch size; e:
16 132 25 0 number of epochs.
Total 356 58 36

Table A.1: Distribution of the number of gaps per task


in each section of the data.

# Answers Train Dev Test


1 3637 639 296
2 689 111 45
3 147 23 16
4 77 11 2
5 11 3 1
6 3 0 0
7 1 0 0
Total 4565 787 360

Table A.2: Distribution of the number of answers per


gap in each section of the data.

C Human labelling examples

Reason for rejection Example gap in context Annotator comments


Too close to other gaps ... the thousands of questions I asked Minimum distance is not met.
as a child were met not by impatient an-
swers ...
Too many possible answers ... and does not sound threatening. Many verbs could fit in this gap: does,
may, might, should, will, etc.
Too many gaps of this type ... the country where the largest number Too many relative pronouns are tested in
of bamboo varieties grow naturally... the task.
Answers can change meaning ... the petals were narrower and less The word more also fits.
clearly separated ...
Answers can have different PoS The Indian bansuri bamboo flute, when Other possible answers are often, usu-
played by a master musician, ... ally, normally, etc.
Gap depends on another What I love most about being on a horse The second gap depends on the first.
is that ...
Repeated gap ..., although she later became a biologist. The task has another gap where although
is a possible answer.
Phantom gap The name actually refers to the statuette Which can be omitted.
which all of the winners receive.
Unacceptable outlier The school was by no means an The phrase by no means is at the C1
overnight success; ... CEFR level.
Other (please specify) It is sometimes said that animals use lan- Avoid having a gap for the very first
guage. word in the text.

Table C.1: Example of the different reasons given by the annotators for rejecting a gap proposed by our system.

1273

You might also like