FutureOfLearning_LLMs_Book_Chapter
FutureOfLearning_LLMs_Book_Chapter
In recent years, large language models (LLMs) and generative AI have revolutionized natural language
processing (NLP), offering unprecedented capabilities in education. This chapter explores the transforma-
tive potential of LLMs in automated question generation and answer assessment. It begins by examining
the mechanisms behind LLMs, emphasizing their ability to comprehend and generate human-like text.
The chapter then discusses methodologies for creating diverse, contextually relevant questions, enhancing
learning through tailored, adaptive strategies. Key prompting techniques, such as zero-shot and chain-of-
thought prompting, are evaluated for their effectiveness in generating high-quality questions, including
open-ended and multiple-choice formats in various languages. Advanced NLP methods like fine-tuning
and prompt-tuning are explored for their role in generating task-specific questions, despite associated
costs. The chapter also covers the human evaluation of generated questions, highlighting quality varia-
tions across different methods and areas for improvement. Furthermore, it delves into automated answer
assessment, demonstrating how LLMs can accurately evaluate responses, provide constructive feedback,
and identify nuanced understanding or misconceptions. Examples illustrate both successful assessments
and areas needing improvement. The discussion underscores the potential of LLMs to replace costly,
time-consuming human assessments when appropriately guided, showcasing their advanced understand-
ing and reasoning capabilities in streamlining educational processes.
Keywords: Natural Language Processing (NLP), Large Language Models (LLMs), Education, Auto-
mated Question Generation (AQG), Answer Assessment, Prompt Engineering
1. I NTRODUCTION
The educational landscape is evolving rapidly, driven by the integration of advanced technolo-
gies that challenge traditional teaching methods. Among these technologies, Large Language
Models (LLMs) have emerged as powerful tools, capable of revolutionizing the way we ap-
proach learning and assessment. These models, epitomized by systems such as GPT-4 (Achiam
et al., 2023) and beyond, have demonstrated an extraordinary ability to understand and gen-
erate human-like text, enabling them to perform tasks that were once the exclusive domain of
1
human educators (Brown et al., 2020; Floridi and Chiriatti, 2020). In the realm of education,
question generation and assessment are critical components that shape the learning experience.
Traditionally, these tasks require significant human effort, involving educators in the meticulous
design of questions that not only test knowledge but also promote deeper understanding (Mazidi
and Nielsen, 2014). Assessing student responses, particularly in open-ended formats, is an-
other labor-intensive task that demands careful consideration of context, nuance, and individual
student needs (Chappuis et al., 2015). However, as the demand for personalized and adaptive
learning grows, the limitations of human-driven approaches have become more apparent.
This chapter delves into the transformative potential of LLMs in automating these crucial
educational tasks. We explore how LLMs can be leveraged to generate a wide variety of ques-
tions—ranging from simple factual queries to complex, open-ended questions—that are con-
textually relevant and aligned with educational goals (Maity et al., 2023; Maity et al., 2024a;
Maity et al., 2024c). We also examine the capabilities of LLMs in automated answer assessment,
where these models can evaluate student responses, offer feedback, and even identify subtle mis-
conceptions, all at a scale and efficiency that human educators cannot match (Fagbohun et al.,
2024). The introduction of LLMs into the educational process is not without challenges. Issues
such as the quality and relevance of generated questions, the accuracy of automated assessments,
and the ethical implications of relying on AI for education require careful consideration (Floridi
and Cowls, 2022).
This chapter addresses these concerns, offering insights into how LLMs can be guided and
refined to ensure they complement and enhance human-led education rather than replacing it.
In the sections that follow, we will first provide a detailed overview of LLMs, focusing on their
architecture and underlying mechanisms. This will set the stage for a discussion on various
methodologies and prompting techniques used to generate educational questions. We will then
explore the role of advanced NLP methods such as fine-tuning and prompt-tuning in enhancing
the quality and specificity of generated questions. The chapter will also cover human evaluation
metrics for assessing the quality of these questions and the performance of LLMs in automated
answer assessment. Finally, we will discuss the broader implications of integrating LLMs into
education, highlighting both their potential benefits and the challenges that must be addressed
to fully realize their capabilities.
2
textually relevant and pedagogically sound. The training process of LLMs involves exposure to
diverse datasets that cover a wide range of topics and writing styles (Raiaan et al., 2024). This
extensive training enables the models to develop a broad understanding of language, which they
can then apply to specific tasks such as question generation and assessment. However, while
LLMs excel in generating human-like text, their effectiveness in educational contexts depends
on how well they are guided and fine-tuned for specific tasks.
3
part of the prompt, this method enhances the model’s understanding of the task, leading
to improved relevance and quality of the generated questions (Brown et al., 2020). This
technique is effective in scenarios where the desired question format or content is more
complex and needs to be clearly defined for the model.
• Chain-of-Thought Prompting: A structured technique that involves guiding the LLM
through a step-by-step reasoning process before it generates the final question. For ex-
ample, the model may first be asked to summarize a passage, identify key concepts, and
then generate a question that tests understanding of these concepts (Wei et al., 2022; Maity
et al., 2024d). This approach is particularly effective for generating higher-order questions
that require critical thinking and analysis, ensuring that the questions align with specific
educational goals.
• Fine-Tuning: Fine-tuning involves further training the LLM on a specific dataset of ques-
tions and answers relevant to the target domain. By learning the patterns and structures of
effective questions from the training data, fine-tuning allows the model to generate more
accurate and context-specific questions (Raffel et al., 2020). This method is resource-
intensive but results in highly specialized models that can produce high-quality questions
tailored to specific subjects or curricula (Maity et al., 2023).
• Prompt-Tuning: A recent and computationally efficient technique, prompt-tuning in-
volves adjusting a small set of parameters (the prompt) while leaving the rest of the model
unchanged. This method has proven effective in generating high-quality questions across
various educational contexts, especially when the goal is to adapt a general-purpose LLM
to a specific task without extensive retraining (Lester et al., 2021). Prompt-tuning allows
for quick adaptation and customization of LLMs to generate questions that are both rele-
vant and aligned with specific educational objectives.
• Multiformat and Multilingual Question Generation: LLMs are capable of generating
both open-ended (Maity et al., 2023) and multiple-choice questions (Maity et al., 2024d),
catering to different assessment needs. Open-ended questions encourage critical thinking
and exploration, while multiple-choice questions are useful for evaluating specific knowl-
edge or skills (Maity et al., 2024d). Additionally, the multilingual capabilities of LLMs
enable the generation of questions in various languages, making them valuable tools for
language learning and cross-cultural education (Radford et al., 2019; Maity et al., 2024d).
These methodologies, when applied effectively, enhance the educational process by gener-
ating diverse, high-quality questions that cater to different learning contexts and objectives. As
LLMs continue to evolve, the integration of these techniques will further improve the relevance,
accuracy, and utility of automated question generation in education.
• Factual Questions: These questions focus on the recall of specific information, such as
dates, definitions, or events. They are typically straightforward and aim to assess the
4
student’s memory and basic understanding of the subject matter (Mulla and Gharpure,
2023).
Example: ”What is the capital of France?”
• Open-Ended Questions: Open-ended questions are designed to encourage deep thinking
and exploration, allowing students to express their thoughts freely and creatively. These
questions do not have a single correct answer, promoting critical thinking and discussion
(Mulla and Gharpure, 2023; Maity et al., 2023).
Example: ”What does purchasing power parity do?”
• Multiple-Choice Questions (MCQs): MCQs assess specific knowledge or skills by pro-
viding a set of possible answers from which the student must choose the correct one. They
are widely used for their efficiency in testing and grading (Maity et al., 2024d).
Example: ”Which of the following is the largest planet in our solar system?
(a) Earth (b) Jupiter (c) Mars (d) Venus”
LLMs, through their sophisticated language processing capabilities, can generate these var-
ied question types effectively, adapting them to different educational contexts and learning ob-
jectives.
5
consistency of the assessments. LLMs, like all AI systems, are not infallible and can sometimes
produce incorrect or biased evaluations (Owan et al., 2023). Ensuring that the assessments are
fair, accurate, and aligned with the learning objectives is crucial for the successful integration of
LLMs into the educational process (Fagbohun et al., 2024).
To illustrate the capabilities of LLMs in automated answer assessment, consider the following
examples:
• Essay Grading: In a history class, students are asked to write essays on the causes and
effects of World War II. The LLM evaluates the essays based on criteria such as under-
standing of key events, analysis of historical factors, and coherence of argument. The
model is able to identify well-reasoned arguments and provide feedback on areas where
the student could improve, such as providing more evidence or considering alternative
perspectives (Mansour et al., 2024; Henkel et al., 2024).
While these examples demonstrate the potential of LLMs in automated assessment, there are
also areas for improvement. One challenge is ensuring that the feedback provided by the LLM
is constructive and actionable (Meyer et al., 2024a). For instance, while the model may correctly
identify an error in a student’s response, it must also provide clear guidance on how to address
the mistake. Additionally, the LLM must be able to adapt its feedback to the individual needs of
each student, taking into account their prior knowledge and learning style.
Another area for improvement is the ability of LLMs to assess more complex and creative
responses, such as those involving critical thinking, problem-solving, or artistic expression.
While LLMs have made significant strides in understanding and generating text, evaluating these
higher-order skills remains a challenge (Hsiao et al., 2023). Future research and development
will be needed to enhance the capabilities of LLMs in these areas, ensuring that they can fully
support the diverse needs of learners.
6
5. H UMAN E VALUATION AND Q UALITY M ETRICS FOR G ENERATED Q UES -
TIONS
7
to all learners is an important consideration in the evaluation process (Maity et al., 2024a; Maity
et al., 2024b).
8
6.3. F UTURE D IRECTIONS IN AUTOMATED Q UESTION G ENERATION AND A SSESS -
MENT
Looking to the future, the role of LLMs in automated question generation and assessment is
likely to expand and evolve (Fagbohun et al., 2024). Advances in AI and NLP technologies will
enable the development of more sophisticated models that are better equipped to handle complex
and creative educational tasks (Alqahtani et al., 2023). As these models become more integrated
into the educational process, they will play a key role in supporting personalized and adaptive
learning, providing scalable solutions that enhance the quality and accessibility of education.
One promising direction for future research is the development of models that can assess
higher-order thinking skills, such as critical thinking, problem-solving, and creativity. These
skills are essential for success in the 21st century, and the ability to assess them accurately and
efficiently is a major challenge for educators. LLMs, with their advanced language understand-
ing and generation capabilities, have the potential to address this challenge, providing new tools
for assessing and supporting the development of these critical skills (Moore et al., 2023).
Another important direction for future research is the exploration of new methodologies for
fine-tuning and prompt-tuning LLMs for specific educational tasks. As LLMs continue to be
used in a wider range of educational contexts, it will be important to develop techniques that
allow for the efficient and effective adaptation of these models to different subject areas, student
populations, and learning objectives.
7. C ONCLUSION
In conclusion, large language models have the potential to revolutionize education through au-
tomated question generation and answer assessment. These models, with their ability to under-
stand and generate human-like text, offer scalable solutions that can enhance personalized and
adaptive learning. By leveraging advanced prompting techniques and fine-tuning methodolo-
gies, educators can create high-quality, contextually relevant questions that challenge students
and support their learning. Furthermore, LLMs’ capabilities in automated assessment can pro-
vide timely and constructive feedback, helping students identify areas for improvement and
guiding their educational journey.
However, the integration of LLMs into education also presents challenges and ethical con-
siderations that must be carefully addressed. Ensuring the fairness, accuracy, and transparency
of AI-driven educational processes is essential for building trust and confidence in these tech-
nologies. As we look to the future, ongoing research and development will be key to realizing
the full potential of LLMs in education, creating a more personalized, adaptive, and accessible
learning experience for all students.
R EFERENCES
ACHIAM , J., A DLER , S., AGARWAL , S., A HMAD , L., A KKAYA , I., A LEMAN , F. L., A LMEIDA ,
D., A LTENSCHMIDT, J., A LTMAN , S., A NADKAT, S., ET AL . 2023. Gpt-4 technical report. arXiv
preprint arXiv:2303.08774.
A LIER , M., C ASA Ñ , M. J., AND F ILV À , D. A. 2023. Smart learning applications: Leveraging llms
for contextualized and ethical educational technology. In International conference on technological
ecosystems for enhancing multiculturality. Springer, 190–199.
9
A LQAHTANI , T., BADRELDIN , H. A., A LRASHED , M., A LSHAYA , A. I., A LGHAMDI , S. S., BIN
S ALEH , K., A LOWAIS , S. A., A LSHAYA , O. A., R AHMAN , I., A L YAMI , M. S., ET AL . 2023.
The emergent role of artificial intelligence, natural learning processing, and large language models in
higher education and research. Research in Social and Administrative Pharmacy 19, 8, 1236–1242.
BADAWI , G., DE B EYROUTH , G., AND BADAWI , H. 2018. Ai-driven educational paradigms: Opportu-
nities and challenges, and ethical considerations in teaching and learning.
BALFOUR , S. P. 2013. Assessing writing in moocs: Automated essay scoring and calibrated peer re-
view™. Research & Practice in Assessment 8, 40–48.
B ROADBENT, J., PANADERO , E., AND B OUD , D. 2018. Implementing summative assessment with a
formative flavour: A case study in a large class. Assessment & Evaluation in Higher Education 43, 2,
307–322.
B ROWN , T., M ANN , B., RYDER , N., S UBBIAH , M., K APLAN , J. D., D HARIWAL , P., N EELAKANTAN ,
A., S HYAM , P., S ASTRY, G., A SKELL , A., AGARWAL , S., H ERBERT-VOSS , A., K RUEGER , G.,
H ENIGHAN , T., C HILD , R., R AMESH , A., Z IEGLER , D., W U , J., W INTER , C., H ESSE , C., C HEN ,
M., S IGLER , E., L ITWIN , M., G RAY, S., C HESS , B., C LARK , J., B ERNER , C., M C C ANDLISH ,
S., R ADFORD , A., S UTSKEVER , I., AND A MODEI , D. 2020. Language models are few-shot learn-
ers. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell,
M. Balcan, and H. Lin, Eds. Vol. 33. Curran Associates, Inc., 1877–1901.
C HAPPUIS , J. ET AL . 2015. Seven strategies of assessment for learning. Pearson.
C ROGMAN , H. AND T REBEAU C ROGMAN , M. 2018. Modified generated question learning, and its
classroom implementation and assessment. Cogent Education 5, 1, 1459340.
FAGBOHUN , O., I DUWE , N., A BDULLAHI , M., I FATUROTI , A., AND N WANNA , O. 2024. Beyond
traditional assessment: Exploring the impact of large language models on grading practices. Journal
of Artifical Intelligence and Machine Learning & Data Science 2, 1, 1–8.
F LORIDI , L. AND C HIRIATTI , M. 2020. Gpt-3: Its nature, scope, limits, and consequences. Minds and
Machines 30, 681–694.
F LORIDI , L. AND C OWLS , J. 2022. A unified framework of five principles for ai in society. Machine
learning and the city: Applications in architecture and urban design, 535–545.
G OSLEN , A., K IM , Y. J., ROWE , J., AND L ESTER , J. 2024. Llm-based student plan generation for adap-
tive scaffolding in game-based learning environments. International Journal of Artificial Intelligence
in Education, 1–26.
H ENKEL , O., H ILLS , L., B OXER , A., ROBERTS , B., AND L EVONIAN , Z. 2024. Can large language
models make the grade? an empirical study evaluating llms ability to mark short answer questions in
k-12 education. In Proceedings of the Eleventh ACM Conference on Learning@ Scale. 300–304.
H SIAO , Y.-P., K LIJN , N., AND C HIU , M.-S. 2023. Developing a framework to re-design writing as-
signment assessment for the era of large language models. Learning: Research and Practice 9, 2,
148–158.
K ASNECI , E., S ESSLER , K., K ÜCHEMANN , S., BANNERT, M., D EMENTIEVA , D., F ISCHER , F.,
G ASSER , U., G ROH , G., G ÜNNEMANN , S., H ÜLLERMEIER , E., ET AL . 2023. Chatgpt for good?
on opportunities and challenges of large language models for education. Learning and individual
differences 103, 102274.
K AZI , N. H. 2023. Automated short-answer grading and misconception detection using large language
models. University of North Florida.
K IM , J. 2024. Leading teachers’ perspective on teacher-ai collaboration in education. Education and
Information Technologies 29, 7, 8693–8724.
10
K URDI , G., L EO , J., PARSIA , B., S ATTLER , U., AND A L -E MARI , S. 2020. A systematic review of au-
tomatic question generation for educational purposes. International Journal of Artificial Intelligence
in Education 30, 121–204.
L ESTER , B., A L -R FOU , R., AND C ONSTANT, N. 2021. The power of scale for parameter-efficient
prompt tuning. arXiv preprint arXiv:2104.08691.
L I , Q., F U , L., Z HANG , W., C HEN , X., Y U , J., X IA , W., Z HANG , W., TANG , R., AND Y U , Y. 2023.
Adapting large language models for education: Foundational capabilities, potentials, and challenges.
arXiv preprint arXiv:2401.08664.
L UCKIN , R. AND H OLMES , W. 2016. Intelligence unleashed: An argument for ai in education.
M AITY, S., D EROY, A., AND S ARKAR , S. 2023. Harnessing the power of prompt-based techniques for
generating school-level questions using large language models. In Proceedings of the 15th Annual
Meeting of the Forum for Information Retrieval Evaluation. 30–39.
M AITY, S., D EROY, A., AND S ARKAR , S. 2024a. Exploring the capabilities of prompted large language
models in educational and assessment applications. In Proceedings of the 17th International Confer-
ence on Educational Data Mining, B. Paaßen and C. D. Epp, Eds. International Educational Data
Mining Society, Atlanta, Georgia, USA, 961–968.
M AITY, S., D EROY, A., AND S ARKAR , S. 2024b. How effective is gpt-4 turbo in generating school-level
questions from textbooks based on bloom’s revised taxonomy?
M AITY, S., D EROY, A., AND S ARKAR , S. 2024c. How ready are generative pre-trained large language
models for explaining bengali grammatical errors? In Proceedings of the 17th International Confer-
ence on Educational Data Mining, B. Paaßen and C. D. Epp, Eds. International Educational Data
Mining Society, Atlanta, Georgia, USA, 664–671.
M AITY, S., D EROY, A., AND S ARKAR , S. 2024d. A novel multi-stage prompting approach for language
agnostic mcq generation using gpt. In European Conference on Information Retrieval. Springer, 268–
277.
M ANSOUR , W., A LBATARNI , S., E LTANBOULY, S., AND E LSAYED , T. 2024. Can large language mod-
els automatically score proficiency of written essays? arXiv preprint arXiv:2403.06149.
M AZIDI , K. AND N IELSEN , R. 2014. Linguistic considerations in automatic question generation. In
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume
2: Short Papers). 321–326.
M EMARIAN , B. AND D OLECK , T. 2023. Fairness, accountability, transparency, and ethics (fate) in artifi-
cial intelligence (ai), and higher education: A systematic review. Computers and Education: Artificial
Intelligence, 100152.
M EYER , J., JANSEN , T., S CHILLER , R., L IEBENOW, L. W., S TEINBACH , M., H ORBACH , A., AND
F LECKENSTEIN , J. 2024a. Using llms to bring evidence-based feedback into the classroom: Ai-
generated feedback increases secondary students’ text revision, motivation, and positive emotions.
Computers and Education: Artificial Intelligence 6, 100199.
M EYER , J., JANSEN , T., S CHILLER , R., L IEBENOW, L. W., S TEINBACH , M., H ORBACH , A., AND
F LECKENSTEIN , J. 2024b. Using llms to bring evidence-based feedback into the classroom: Ai-
generated feedback increases secondary students’ text revision, motivation, and positive emotions.
Computers and Education: Artificial Intelligence 6, 100199.
M OORE , S., T ONG , R., S INGH , A., L IU , Z., H U , X., L U , Y., L IANG , J., C AO , C., K HOSRAVI , H.,
D ENNY, P., ET AL . 2023. Empowering education with llms-the next-gen interface and content gen-
eration. In International Conference on Artificial Intelligence in Education. Springer, 32–37.
11
M ULLA , N. AND G HARPURE , P. 2023. Automatic question generation: a review of methodologies,
datasets, evaluation metrics, and applications. Progress in Artificial Intelligence 12, 1, 1–32.
N EMA , P. AND K HAPRA , M. M. 2018. Towards a better metric for evaluating question generation sys-
tems. arXiv preprint arXiv:1808.10192.
OWAN , V. J., A BANG , K. B., I DIKA , D. O., E TTA , E. O., AND BASSEY, B. A. 2023. Exploring the
potential of artificial intelligence tools in educational measurement and assessment. Eurasia Journal
of Mathematics, Science and Technology Education 19, 8, em2307.
R ADFORD , A., W U , J., C HILD , R., L UAN , D., A MODEI , D., S UTSKEVER , I., ET AL . 2019. Language
models are unsupervised multitask learners. OpenAI blog 1, 8, 9.
R AFFEL , C., S HAZEER , N., ROBERTS , A., L EE , K., NARANG , S., M ATENA , M., Z HOU , Y., L I , W.,
AND L IU , P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of machine learning research 21, 140, 1–67.
R AHMAN , M. A., A LQAHTANI , L., A LBOOQ , A., AND A INOUSAH , A. 2024. A survey on security and
privacy of large multimodal deep learning models: Teaching and learning perspective. In 2024 21st
Learning and Technology Conference (L&T). IEEE, 13–18.
R AIAAN , M. A. K., M UKTA , M. S. H., FATEMA , K., FAHAD , N. M., S AKIB , S., M IM , M. M. J.,
A HMAD , J., A LI , M. E., AND A ZAM , S. 2024. A review on large language models: Architectures,
applications, taxonomies, open issues and challenges. IEEE Access.
S HIN , J. AND G IERL , M. J. Automated short-response scoring for automated item generation in science
assessments. In The Routledge International Handbook of Automated Essay Evaluation. Routledge,
504–534.
S TAMPER , J., X IAO , R., AND H OU , X. 2024. Enhancing llm-based feedback: Insights from intelligent
tutoring systems and the learning sciences. In International Conference on Artificial Intelligence in
Education. Springer, 32–43.
VASWANI , A., S HAZEER , N., PARMAR , N., U SZKOREIT, J., J ONES , L., G OMEZ , A. N., K AISER ,
L. U ., AND P OLOSUKHIN , I. 2017. Attention is all you need. In Advances in Neural Information
Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, Eds. Vol. 30. Curran Associates, Inc.
W EI , J., WANG , X., S CHUURMANS , D., B OSMA , M., ICHTER , B ., X IA , F., C HI , E., L E , Q. V.,
AND Z HOU , D. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Ad-
vances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
K. Cho, and A. Oh, Eds. Vol. 35. Curran Associates, Inc., 24824–24837.
Y EKOLLU , R. K., B HIMRAJ G HUGE , T., S UNIL B IRADAR , S., H ALDIKAR , S. V., AND FAROOK
M OHIDEEN A BDUL K ADER , O. 2024. Ai-driven personalized learning paths: Enhancing education
through adaptive systems. In International Conference on Smart Data Intelligence. Springer, 507–
517.
12