Automated Question Generation and Question Answering From Turkish Texts
Automated Question Generation and Question Answering From Turkish Texts
(2021) : –
© TÜBİTAK
doi:10.3906/elk-
1
OBSS AI, Ankara, Turkey
2
Graduate School of Informatics, METU, Ankara, Turkey
3
Computer Engineering, METU, Ankara, Turkey
Abstract: While exam-style questions are a fundamental educational tool serving a variety of purposes, manual
construction of questions is a complex process that requires training, experience and resources. Automatic question
generation (QG) techniques can be utilized to satisfy the need for a continuous supply of new questions by streamlining
their generation. However, compared to automatic question answering (QA), QG is a more challenging task. In this work,
we fine-tune a multilingual T5 (mT5) transformer in a multi-task setting for QA, QG and answer extraction tasks using
Turkish QA datasets. To the best of our knowledge, this is the first academic work that performs automated text-to-text
question generation from Turkish texts. Experimental evaluations show that the proposed multi-task setting achieves
state-of-the-art Turkish question answering and question generation performance on TQuADv1, TQuADv2 datasets and
XQuAD Turkish split. The source code and the pre-trained models are available at https://github.com/obss/turkish-
question-generation.
Key words: turkish, question answering, question generation, answer extraction, multi-task, transformer
1. Introduction
Question Generation (QG) is the task of generating questions from a given context and, optionally, some answers.
The research on QG has been developing exponentially with the task getting more popular in education [11]
[13], commercial applications such as chatbots and dialogue systems [12] [24] and healthcare [30].
Early works in QG were based mainly on human-designed sophisticated syntactic rules to transform
a declarative sentence into the corresponding question. These tasks mainly relied on handcrafted feature
extraction from documents. A method for generating multiple-choice tests from instructional documents (e.g.,
textbooks or encyclopedias) was proposed in [16]. In this work, domain-specific terms were extracted using
the term frequency approach, and the sentences including the retrieved terms were transformed into questions
using the parsed syntactic information of the sentence. In [7], the input text is first simplified with a set of
transformations to produce multiple declarative sentences. Then, a declarative sentence is transformed into
a set of possible questions by syntactic and lexical transformations. However, being based on rule-based
transformations, these methods are not applicable to other languages and question styles. A preliminary
work provides an implementation plan for rule-based question generation from Turkish texts using syntactic
(constituent or dependency) parsing and semantic role labeling systems [22]. In the QG part, manually generated
∗ Correspondence: [email protected]
1
This work is licensed under a Creative Commons Attribution 4.0 International License.
Akyon et al/Turk J Elec Eng & Comp Sci
templates and rules are used. However, the proposed method is not fully automated considering the manual
selection of templates and its rule-based nature. Moreover, the paper does not provide sufficient technical details
and no follow-up paper giving the details of the planned implementation is available.
Recently, many neural networks based techniques have been proposed for QG. An encoder-decoder ar-
chitecture of an LSTM based seq2seq model is adopted in [5]. Both the input sentence and the paragraph
containing the sentence are encoded via separate bidirectional LSTMs [8] and then concatenated. This repre-
sentation is then fed into the decoder, which is a left-to-right LSTM, to generate the question. The decoder
learns to use the information in more relevant parts of the encoded input representation via an attention layer.
Later models included target answer in the input to avoid questions that are too short and/or broadly targeted,
such as ”What is mentioned?”. Some models have achieved that by either treating the answer’s position as an
extra input feature [32], [31] or by encoding the answer using a separate network [6], [9]. Moreover, position
embeddings have been used to give more attention to the answer words closer to context words. Some utilized
additional decoders to predict the question words (when, how, why, etc.) before generating the question [25].
LSTM based Seq2seq models struggle to capture the paragraph-level context that is needed to generate high
quality questions. Seq2seq model was extended with answer tagging, maxout pointer mechanism and a gated
self-attention encoder [31]. A multi-stage attention to link long document context with targeted answer is used
in [26].
Transformer based models have been dominating the NLP research in tandem with QG research lately.
These models are capable of capturing longer and comprehensive contexts more effectively than their prede-
cessors, mainly LSTM based seq2seq based models. Named Entity Recognition (NER) is used in [10], as a
(preprocessing) task before the application of a transformer based model. In the work, they first extract a
variety of named entities from the input text and then replace these entities with named entity tags for better
generalization. Superior performances have been reported by applications of QG task that were proposed by
those using a large transformer based language model (LM). A pre-trained BERTurk for QG was adopted in [4]
and three models were proposed using BERTurk for QG, sequential question generation with BERTurk using
the previous decoded results, and finally highlighting the answer in the context which yielded a performance
improvement. A pre-trained GPT-2 for QG is used in [15] in a straightforward way by preparing inputs in a
particular format. The model is evaluated in several ways such as context-copying, failures on constructing
questions, robustness and answer-awareness.
An example question generation project using transformers in a specific framework is available1 but
it’s sentence tokenization pipeline is specific to English language, presents the results on an English dataset,
emphasizes limited input types (highlight and prepend) and does not have a peer reviewed publication.
Moreover, there is a publicly shared work based on fine-tuning mT5-small model [29] on Turkish dataset
for question generation task2 however it’s sentence tokenization pipeline is not adapted to Turkish language,
it is not clear whether the validation set is included in the training, does not present any evaluation results,
emphasizes limited input types (only highlight) and does not have a peer reviewed publication.
In this work, in order to fully automate the question generation process from Turkish texts using a single
model, we propose a multi-task fine-tuning of mT5 model [29]. To the best of our knowledge, this is the first
comprehensive academic work that performs automated text-to-text question generation from Turkish texts.
1 Neural question generation using transformers (2020). Website https://github.com/patil-suraj/question generation [accessed
04 07 2021].
2 Turkish Multitask MT5 (2021). Website https://github.com/ozcangundes/multitask-question-generation [accessed 04 07 2021]
2
Akyon et al/Turk J Elec Eng & Comp Sci
Main contributions can be summarized as, adaptation of a sentence tokenization pipeline for highlight input
format, Turkish question generation and answering benchmarking of mT5 model on TQuADv1, TQuADv2 and
XQuAD datasets in multi-task and single-task settings with different input formats (highlight/prepend/both).
The model we explored, mT5, is a variant of T5 [19], which is a flexible Transformer model used in
sequence-to-sequence NLP tasks. T5 is an encoder-decoder style language model whose architecture closely
follows the original Transformer [27]. It is pre-trained on “span-corruption” objective, a special type of masked
language modeling. In this scheme, consecutive input token spans are replaced with a mask token and the
model is asked to reconstruct the original tokens in the spans as training objective. During fine-tuning, various
distinct NLP tasks such as classification and generation formulated in common text-to-text format in multi-task
learning setting. The main difference of mT5 is that it was trained on mC4 dataset, comprising natural text
in 101 languages collected from the public Common Crawl web scrape. Being trained on multiple tasks and
multiple languages, it can readily be fine-tuned on QA, QG and answer extraction tasks in Turkish language
after converting the datasets to the common text-to-text format. As shown in Figure 1 (top), QA task uses
context and question pair as input and answer as target, QG uses answer highlighted context as input and
question as target, answer extraction uses sentence highlighted context as input and answer list with separator
as target. This approach does not require an external answer extraction model or human effort to label the
answers since the same model is used to extract the answers (corresponding to one of the potential questions)
from the context as shown in Figure 1 (bottom).
Figure 1. Multi-task fine-tuning of the multilingual pre-trained mT5 model (top). The same fine-tuned model is then
used for both answer extraction and question generating task (bottom).
3
Akyon et al/Turk J Elec Eng & Comp Sci
2. Proposed Approach
To convey a fully automated question generation pipeline, we assume that answer may not be given in the
generation phase, and thus we also train the model to find answer a (corresponding to one of the potential
questions) which is a span in the given context c. The task is formulated as in Eq. 1 where q denote the question
targeting the answer a and c0 is the context c with highlighted tokens for sentences containing answers.
P (q, a|c) = P (a|c0 ) · P (q|a, c) (1)
Answer extraction task is formulated as P (a|c0 ) where context and answer pairs are used from {context,
question, answer} triplets from SQuAD style dataset. Context c is first preprocessed to highlight the target
answers, and the preprocessed context c0 is used as an input and answers are used as in training.
Question generation task is formulated as P (q|a, c) where {answer, context} pairs are used as input and,
in training, question for the given answer is used as the target. If an answer is provided with the context, the
answer extraction step is skipped, otherwise answer extraction is done before question generation.
When providing the inputs to text-to-text transformer, different parts of the input c, a, q are separated
by a separator. In both single-task and multi-task QG setting, we apply three different input format styling:
prepend, highlight and both. In prepend format, we prepend the base input text with a task specific prefix as in
T5 [19]. For example, for QG task we prepend the base input format with “generate question : ” prefix. In
highlight format, we wrap the answer text in the context with a special <hl> (highlight) token similar to [29].
The both input format contains both prepend and highlight input formats.
In the single-task setting, we modify each sample to train separate models for QA and QG tasks. In
generation phase, QA task requires a context and a question and QG task requires a context and an answer
as input. In the multi-task setting, we train the model to perform answer extraction, question generation
and question answering tasks simultaneously. For answer extraction task, we put highlight tokens at the
start and end of the sentence that contains the answer to be extracted. For the question generation, the
answer of the question to be generated is highlighted [29]. Moreover, we prepend “question” , “context” ,
“generate question” , “extract answer” tokens before each sample to help the model distinguish one task from
other. In Figure 1, the input and target formats of the model during fine-tuning is presented.
In single-task setting, for QG generation task, the answer always needs to be provided along with the
context , whereas in multi-task setting the answer is not strictly required to be given.
We adopt the answer-aware question generation methodology [25], where the model requires both the
context and answer to generate questions. Use of the same model for automatic answer extraction in the
multi-task setting eliminates the need for manual highlighting of the answer and enables end-to-end question
generation from raw text.
It also has to be noted adapting the current schema to another language involves putting highlight tokens
between sentences. However this might not be straightforward due to the language dependent nature of the
sentence tokenization part, for which we needed to carefully design a proper sentence tokenization by manually
handling edge cases mostly caused by abbreviations to mark the end of a sentence correctly in a Turkish text.
There are wide options for sentence tokenization approaches for English text; however, there is no directly
available sentence tokenization tool for Turkish text. We adapted an open-source tool TrTokenizer package3 for
a sentence tokenization step as the base tool and adapted it by enhancing the edge cases. These edge cases such
3 TrTokenizer (2020) Sentence and word tokenizers for the Turkish language https://github.com/apdullahyayik/TrTokenizer
[accessed 25.07.2020]
4
Akyon et al/Turk J Elec Eng & Comp Sci
as ”Ar. Gör.”, ”(d. 998 - ö. 1068)”, ”Ömer b. Abdülazı̄z”, etc. are then handled by regular expression based
operations. The adapted Turkish sentence tokenization based answer highlighting, together with extended edge
cases have been open-sourced in the project repo.
We first fine-tune BERTurk [21] and mT5 models on TQuADv2 training split to have the base models. Then F1,
EM scores are calculated on TQuADv2 validation split and XQuAD Turkish split for experimental evaluation.
All the experiments have been performed on Nvidia A100 GPU with 80 GB VRAM using Transformers Trainer
[28] on Pytorch [18] backbone.
3.1. Datasets
TQuAD 4 (TQuADv1) is a Turkish QA dataset on Turkish & Islamic Science History that was published within
the scope of Teknofest 2018 Artificial Intelligence competition. TQuADv2 dataset [23] extended the number of
question-answer pairs along with the number of subjects by adding additional paragraphs and question-answer
pairs to TQuADv1, i.e. TQuADv1 ⊂ TQuADv2. Both of these datasets have the same structure with SQuAD
[20]
XQuAD [2] is a multilingual QA dataset in Arabic, Chinese, German, Greek, Hindi, Russian, Spanish,
Thai, Turkish and Vietnamese languages. It consists of samples professionally translated from the SQuAD 1.1
validation set. The Turkish split of the XQuAD, namely XQuAD.tr, is used to evaluate the fine-tuned models,
for brevity we denote it as XQuAD in the remainder of this paper.
The details of these datasets are provided in Table 1, training sets are used for training the models and
hyper-parameter tuning, validation set is used for performance evaluation only. Some examples are presented
in Appendix A.
We experimentally evaluated mT5 [29] against BERTurk [21] and, to have a fair comparison, we performed hyper
parameter tuning. For both models, we used grid-search to select the best optimizer type (Adafactor, AdamW),
initial learning rate (1e-3, 1e-4, 1e-5 ) and number of training epochs (1, 3, 5, 8, 10, 15, 20). BERTurk-base
language model [21] has been fine-tuned for QA task on TQuADv2 training split, F1 and EM scores have been
calculated on TQuADv2 validation split and XQuAD Turkish split. We selected the set of parameters which
attain the overall best scores in all metrics: AdamW optimizer with a learning rate of 1e-4 and number of
4 TQuad (2019) Turkish NLP Q&A Dataset https://github.com/TQuad/turkish-nlp-qa-dataset [accessed 04.07.2021]
5
Akyon et al/Turk J Elec Eng & Comp Sci
Table 2. QA scores of the best performing hyper-parameter combination for BERTurk, AdamW with initial learning
rate of 1e-4 and 3 epocs.
Table 3. QA scores of the best performing hyper-parameter combination for mT5-small: AdamW with initial learning
rate of 1e-3 and 15 epocs.
Table 4. QG scores of the best performing hyper-parameter combination for mT5-small: AdamW with initial learning
rate of 1e-3 and 15 epocs.
epochs 3 (shown in Table 2) for BERTurk. Similarly, mT5-small language model [21] has been fine-tuned in a
multi-task setting on TQuADv2 training split. Then F1, EM scores for QA samples and BLEU, ROUGE scores
for QG samples have been calculated on TQuADv2 validation split and XQuAD Turkish split. QA and QG
results of the best performing combination can be seen in Table 3 and 4, respectively. We selected the set of
parameters which attain the overall best scores in all metrics: AdamW optimizer with a learning rate of 1e-3
and 15 epochs. For the remainder of the experiments, these fine-tuned BERTurk and mT5 models with the
determined set of parameters have been used.
For the evaluation of QA task performance, widely accepted F1 and Exact Match (EM) scores [20]
are calculated. Although there is no widely accepted automatic evaluation metric for measuring the QG
performance[1], most of the previous works used the classical metrics like BLEU [17], METEOR [3] and ROUGE
[14]. METEOR applies stemming and synonym matching (in English). Hence, it has been excluded in our
6
Akyon et al/Turk J Elec Eng & Comp Sci
Table 6. mT5-base QG evaluation results for single-task (ST) and multi-task (MT) for TQuADv2 fine-tuning.
Table 7. TQuADv1 and TQuADv2 fine-tuning QA evaluation results for multi-task mT5 variants and BERTurk.
MT-Both means, mT5 model is fine-tuned with Both input formats and in a multi-task setting.
Table 8. TQuADv1 and TQuADv2 fine-tuning QG evaluation results for multi-task mT5 variants. MT-Both means,
mT5 model is fine-tuned with ’Both’ input format and in a multi-task setting.
experiments as these processes are not applicable to Turkish. We reported BLEU-1, BLEU-2 and ROUGE-L
metrics for evaluating the QG task performance. According to the TQuADv2 fine-tuning results in Table 5, the
7
Akyon et al/Turk J Elec Eng & Comp Sci
proposed mT5 settings outperform the BERTurk in QA task and multi-task setting further increases the QA
performance.
To evaluate the question generation performance of the proposed single and multi task settings, we fine-
tuned mT5 model on TQuADv2 training split in both settings. Three different input formats explained in
Section 2 are used and BLEU-1, ROUGE-L scores are calculated on TQuADv2 validation split and XQuAD
Turkish split. According to TQuADv2 fine-tuning results in Table 6, highlight format increases the BLUE-1
scores by up to 1.6 points and ROUGE-L scores by up to 1.7 points compared to prepend format in single-task
setting. Moreover, highlight format increases the BLUE-1 scores by up to 1.2 and ROUGE-L scores by up to
0.8 points compared to prepend format in multi-task setting. Moreover, combining both techniques increases
BLEU-1 scores by up to 2.9 points and ROUGE-L scores by up to 3.8 points compared to prepend format.
Additional experiments have been conducted to evaluate the overall performance of the larger mT5
variants, mT5-base and mT5-large in comparison to BERTurk. QA and QG evaluation results for fine-tuned
TQuADv1 and TQuADv2 are provided in Tables 7 and 8 respectively. According to the QA results in Tables
7, all mT5 variants outperform BERTurk for smaller dataset sizes, BERTurk may outperform mT5-small for
larger dataset sizes. This indicates that mT5 models are always preferable when the data is scarce whereas
regular single-task training may also be used in place of the mT5-small variant when sufficient data is available.
A comparative performance evaluation of mT5 variants shows that increasing the model size improves the
performance significantly for both datasets, especially when switching from mT5-small to mT5-base. While using
an even bigger model, mT5-large, improves the performance, it has a relatively more modest effect. Nevertheless,
this trend of obtaining better scores by increasing the model capacity is consistent with the previous works on
other transformer based models. Comparison of the results for different versions of the TQuAD datasets in
Table 7 and 8 show that, although the TQuADv1 validation scores are higher than TQuADv2 validation scores,
the models trained on the TQuADv2 train set are able to generalize better as indicated by the XQuAD Turkish
split results. This can be attributed to the larger size and better quality of the TQuADv2 dataset.
For qualitative evaluation, some model outputs from different paragraphs are provided in Appendix B
on Table 12 and 13 for consistent and inconsistent (that lacks coherence, not addressing the input answer, etc.)
generations, respectively.
4. Conclusions
By combining the proposed answer extraction and answer-aware QG modules, it is possible to fully automate
the QG task without any manual answer extraction labor. Automated evaluation metrics on TQuAD validation
set show that the model is capable of generating meaningful question-answer pairs from the context after fine-
tuning. Moreover, results show that the proposed multi-task approach has better performance on QA, answer
extraction and QG compared to single-task setting. By combining the prepend and highlight input formats, QG
performance of an mT5 model can be boosted up to 10%.
In the future, further experiments with the multi-task model on the other QG tasks such as multiple
choice, true/false, yes/no will be examined and effect of multilingual knowledge in mT5 will be analysed. In
addition, human evaluations could be done to provide further insight about the performances of the methods.
References
[1] Amidei, J., P. Piwek, and A. Willis (2018). Evaluation methodologies in automatic question generation 2013-2018.
8
Akyon et al/Turk J Elec Eng & Comp Sci
[2] Artetxe, M., S. Ruder, and D. Yogatama (2019). On the cross-lingual transferability of monolingual representations.
arXiv preprint arXiv:1910.11856 .
[3] Banerjee, S. and A. Lavie (2005). Meteor: An automatic metric for mt evaluation with improved correlation with
human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine
translation and/or summarization, pp. 65–72.
[4] Chan, Y.-H. and Y.-C. Fan (2019). A recurrent bert-based model for question generation. In Proceedings of the 2nd
Workshop on Machine Reading for Question Answering, pp. 154–162.
[5] Du, X., J. Shao, and C. Cardie (2017). Learning to ask: Neural question generation for reading comprehension. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 1342–1352.
[6] Duan, N., D. Tang, P. Chen, and M. Zhou (2017). Question generation for question answering. In Proceedings of the
2017 Conference on Empirical Methods in Natural Language Processing, pp. 866–874.
[7] Heilman, M. and N. A. Smith (2010). Good question! statistical ranking for question generation. In Human Language
Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational
Linguistics, pp. 609–617.
[8] Hochreiter, S. and J. Schmidhuber (1997). Long short-term memory. Neural computation 9 (8), 1735–1780.
[9] Kim, Y., H. Lee, J. Shin, and K. Jung (2019). Improving neural question generation using answer separation. In
Proceedings of the AAAI Conference on Artificial Intelligence, Volume 33, pp. 6602–6609.
[10] Kriangchaivech, K. and A. Wangperawong (2019). Question generation by transformers. arXiv preprint
arXiv:1909.05017 .
[11] Kurdi, G., J. Leo, B. Parsia, U. Sattler, and S. Al-Emari (2020). A systematic review of automatic question
generation for educational purposes. International Journal of Artificial Intelligence in Education 30 (1), 121–204.
[12] Laban, P., J. Canny, and M. A. Hearst (2021). What’s the latest? a question-driven news chatbot. arXiv preprint
arXiv:2105.05392 .
[13] Lee, C.-H., T.-Y. Chen, L.-P. Chen, P.-C. Yang, and R. T.-H. Tsai (2018). Automatic question generation from
children’s stories for companion chatbot. In 2018 IEEE International Conference on Information Reuse and Integration
(IRI), pp. 491–494. IEEE.
[14] Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out,
pp. 74–81.
[15] Lopez, L. E., D. K. Cruz, J. C. B. Cruz, and C. Cheng (2020). Transformer-based end-to-end question generation.
arXiv preprint arXiv:2005.01107 .
[16] Mitkov, R. et al. (2003). Computer-aided generation of multiple-choice tests. In Proceedings of the HLT-NAACL
03 workshop on Building educational applications using natural language processing, pp. 17–22.
[17] Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu (2002). Bleu: a method for automatic evaluation of machine
translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318.
[18] Paszke, A., S. Gross, F. Massa, A. Lerer, J. Bradbury, et al. (2019). Pytorch: An imperative style, high-performance
deep learning library. In NeurIPS.
[19] Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, et al. (2020). Exploring the limits of transfer learning with
a unified text-to-text transformer. Journal of Machine Learning Research 21, 1–67.
[20] Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang (2016). Squad: 100, 000+ questions for machine comprehension
of text. In EMNLP.
[21] Schweter, S. (2020, April). Berturk - bert models for turkish.
[22] Soleymanzadeh, K. (2017). Domain specific automatic question generation from text. In Proceedings of ACL 2017,
Student Research Workshop, pp. 82–88.
9
Akyon et al/Turk J Elec Eng & Comp Sci
[23] Soygazi, F., O. Çiftçi, U. Kök, and S. Cengiz (2021). Thquad: Turkish historic question answering dataset for
reading comprehension. In 2021 6th International Conference on Computer Science and Engineering (UBMK), pp.
215–220.
[24] Sreelakshmi, A., S. Abhinaya, A. Nair, and S. J. Nirmala (2019). A question answering and quiz generation chatbot
for education. In 2019 Grace Hopper Celebration India (GHCI), pp. 1–6. IEEE.
[25] Sun, Xingwu; Liu, J., Y. Lyu, W. He, Y. Ma, et al. (2018, October-November). Answer-focused and position-
aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, Brussels, Belgium, pp. 3930–3939. Association for Computational Linguistics.
[26] Tuan, L. A., D. Shah, and R. Barzilay (2020). Capturing greater context for question generation. In Proceedings of
the AAAI Conference on Artificial Intelligence, Volume 34, pp. 9065–9072.
[27] Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017).
Attention is all you need. Advances in neural information processing systems 30.
[28] Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, et al. (2020, October). Transformers: State-of-the-
art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations, Online, pp. 38–45. Association for Computational Linguistics.
[29] Xue, L., N. Constant, A. Roberts, M. Kale, R. Al-Rfou, et al. (2020). mt5: A massively multilingual pre-trained
text-to-text transformer. arXiv e-prints, arXiv–2010.
[30] Yue, X., X. F. Zhang, Z. Yao, S. Lin, and H. Sun (2020). Cliniqg4qa: Generating diverse questions for domain
adaptation of clinical question answering. arXiv preprint arXiv:2010.16021 .
[31] Zhao, Y., X. Ni, Y. Ding, and Q. Ke (2018). Paragraph-level neural question generation with maxout pointer
and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, pp. 3901–3910.
[32] Zhou, Q., N. Yang, F. Wei, C. Tan, H. Bao, et al. (2017). Neural question generation from text: A preliminary
study. In National CCF Conference on Natural Language Processing and Chinese Computing, pp. 662–671. Springer.
Appendix
A. Samples from Datasets
Here we provide some samples from datasets for visual inspection. For TQuADv1 (Table 9) and TQuADv2
(Table 10), samples are shown from both training and validation sets, and for XQuAD (Table 11), samples are
shown with tag of validation set only as XQuAD is considered a validation set by itself as a whole. The answers
are highlighted within the context with green background for ease of reading.
B. Model Outputs
Here we provide some sample question generation results from TQuADv2 dataset for visual inspection. For
sample consistent (Table 12) and inconsistent (Table 13) question generations, results are shown. The answers
are highlighted within the context with green background for ease of reading.
10
Akyon et al/Turk J Elec Eng & Comp Sci
11
Akyon et al/Turk J Elec Eng & Comp Sci
12
Akyon et al/Turk J Elec Eng & Comp Sci
13
Akyon et al/Turk J Elec Eng & Comp Sci
14