0% found this document useful (0 votes)
6 views

FutureOfLearning_LLMs_Book_Chapter

This chapter discusses the transformative potential of Large Language Models (LLMs) in education, focusing on automated question generation and answer assessment. It explores methodologies for generating diverse and contextually relevant questions, as well as the capabilities of LLMs in evaluating student responses and providing feedback. The chapter highlights the advantages and challenges of integrating LLMs into educational processes, emphasizing their ability to enhance personalized learning while addressing concerns related to quality and ethical implications.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

FutureOfLearning_LLMs_Book_Chapter

This chapter discusses the transformative potential of Large Language Models (LLMs) in education, focusing on automated question generation and answer assessment. It explores methodologies for generating diverse and contextually relevant questions, as well as the capabilities of LLMs in evaluating student responses and providing feedback. The chapter highlights the advantages and challenges of integrating LLMs into educational processes, emphasizing their ability to enhance personalized learning while addressing concerns related to quality and ethical implications.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

The Future of Learning in the Age of Gen-

erative AI: Automated Question Genera-


tion and Assessment with Large Language
Models
Subhankar Maity Aniket Deroy
Department of Artificial Intelligence Computer Science & Engineering
Indian Institute of Technology Kharagpur Indian Institute of Technology Kharagpur
[email protected] [email protected]

In recent years, large language models (LLMs) and generative AI have revolutionized natural language
processing (NLP), offering unprecedented capabilities in education. This chapter explores the transforma-
tive potential of LLMs in automated question generation and answer assessment. It begins by examining
the mechanisms behind LLMs, emphasizing their ability to comprehend and generate human-like text.
The chapter then discusses methodologies for creating diverse, contextually relevant questions, enhancing
learning through tailored, adaptive strategies. Key prompting techniques, such as zero-shot and chain-of-
thought prompting, are evaluated for their effectiveness in generating high-quality questions, including
open-ended and multiple-choice formats in various languages. Advanced NLP methods like fine-tuning
and prompt-tuning are explored for their role in generating task-specific questions, despite associated
costs. The chapter also covers the human evaluation of generated questions, highlighting quality varia-
tions across different methods and areas for improvement. Furthermore, it delves into automated answer
assessment, demonstrating how LLMs can accurately evaluate responses, provide constructive feedback,
and identify nuanced understanding or misconceptions. Examples illustrate both successful assessments
and areas needing improvement. The discussion underscores the potential of LLMs to replace costly,
time-consuming human assessments when appropriately guided, showcasing their advanced understand-
ing and reasoning capabilities in streamlining educational processes.

Keywords: Natural Language Processing (NLP), Large Language Models (LLMs), Education, Auto-
mated Question Generation (AQG), Answer Assessment, Prompt Engineering

1. I NTRODUCTION
The educational landscape is evolving rapidly, driven by the integration of advanced technolo-
gies that challenge traditional teaching methods. Among these technologies, Large Language
Models (LLMs) have emerged as powerful tools, capable of revolutionizing the way we ap-
proach learning and assessment. These models, epitomized by systems such as GPT-4 (Achiam
et al., 2023) and beyond, have demonstrated an extraordinary ability to understand and gen-
erate human-like text, enabling them to perform tasks that were once the exclusive domain of

1
human educators (Brown et al., 2020; Floridi and Chiriatti, 2020). In the realm of education,
question generation and assessment are critical components that shape the learning experience.
Traditionally, these tasks require significant human effort, involving educators in the meticulous
design of questions that not only test knowledge but also promote deeper understanding (Mazidi
and Nielsen, 2014). Assessing student responses, particularly in open-ended formats, is an-
other labor-intensive task that demands careful consideration of context, nuance, and individual
student needs (Chappuis et al., 2015). However, as the demand for personalized and adaptive
learning grows, the limitations of human-driven approaches have become more apparent.
This chapter delves into the transformative potential of LLMs in automating these crucial
educational tasks. We explore how LLMs can be leveraged to generate a wide variety of ques-
tions—ranging from simple factual queries to complex, open-ended questions—that are con-
textually relevant and aligned with educational goals (Maity et al., 2023; Maity et al., 2024a;
Maity et al., 2024c). We also examine the capabilities of LLMs in automated answer assessment,
where these models can evaluate student responses, offer feedback, and even identify subtle mis-
conceptions, all at a scale and efficiency that human educators cannot match (Fagbohun et al.,
2024). The introduction of LLMs into the educational process is not without challenges. Issues
such as the quality and relevance of generated questions, the accuracy of automated assessments,
and the ethical implications of relying on AI for education require careful consideration (Floridi
and Cowls, 2022).
This chapter addresses these concerns, offering insights into how LLMs can be guided and
refined to ensure they complement and enhance human-led education rather than replacing it.
In the sections that follow, we will first provide a detailed overview of LLMs, focusing on their
architecture and underlying mechanisms. This will set the stage for a discussion on various
methodologies and prompting techniques used to generate educational questions. We will then
explore the role of advanced NLP methods such as fine-tuning and prompt-tuning in enhancing
the quality and specificity of generated questions. The chapter will also cover human evaluation
metrics for assessing the quality of these questions and the performance of LLMs in automated
answer assessment. Finally, we will discuss the broader implications of integrating LLMs into
education, highlighting both their potential benefits and the challenges that must be addressed
to fully realize their capabilities.

2. U NDERSTANDING L ARGE L ANGUAGE M ODELS IN E DUCATION


2.1. T HE A RCHITECTURE AND M ECHANISMS OF LLM S
Large Language Models (LLMs), built on the foundations of deep learning and transformer
architectures (Vaswani et al., 2017), have brought about a paradigm shift in natural language
processing (NLP). These models, trained on vast corpora of text data, are designed to predict
and generate text based on a given input (Radford et al., 2019). Their ability to understand
context, recognize patterns, and generate coherent, contextually appropriate text makes them
particularly well-suited for educational applications.
At the core of LLMs is the transformer architecture, which uses self-attention mechanisms
to weigh the importance of different words in a sentence relative to each other (Vaswani et al.,
2017). This allows the model to capture long-range dependencies in text, making it capable of
understanding complex sentences and generating nuanced responses. For educational purposes,
this means LLMs can generate questions that are not only grammatically correct but also con-

2
textually relevant and pedagogically sound. The training process of LLMs involves exposure to
diverse datasets that cover a wide range of topics and writing styles (Raiaan et al., 2024). This
extensive training enables the models to develop a broad understanding of language, which they
can then apply to specific tasks such as question generation and assessment. However, while
LLMs excel in generating human-like text, their effectiveness in educational contexts depends
on how well they are guided and fine-tuned for specific tasks.

2.2. T HE R OLE OF F INE -T UNING AND P ROMPT-T UNING


To adapt LLMs for educational question generation and assessment, techniques such as fine-
tuning and prompt-tuning are employed. Fine-tuning involves training the LLM on a specialized
dataset that is closely aligned with the target task. This allows the model to learn the nuances
of educational content and generate questions that are more closely aligned with the curriculum
and learning objectives (Li et al., 2023).
Prompt-tuning, on the other hand, involves designing specific prompts that guide the LLM
in generating the desired output (Lester et al., 2021). This technique leverages the model’s
existing knowledge and directs it towards generating contextually relevant and pedagogically
valuable questions. For instance, a prompt might instruct the LLM to generate a question based
on a specific passage of text, encouraging the model to focus on key concepts and ideas that are
essential for learning.
Both fine-tuning and prompt-tuning have their advantages and challenges. Fine-tuning can
produce highly specialized models that excel in specific tasks, but it is resource-intensive and
requires access to large, high-quality datasets (Raffel et al., 2020). Prompt-tuning, while more
flexible and less resource-demanding, relies heavily on the design of effective prompts and may
not always achieve the same level of specificity as fine-tuned models (Lester et al., 2021). De-
spite these challenges, both techniques have shown significant promise in enhancing the perfor-
mance of LLMs in educational settings.

3. AUTOMATED Q UESTION G ENERATION : M ETHODOLOGIES AND T ECH -


NIQUES

3.1. G ENERATING D IVERSE AND C ONTEXTUALLY R ELEVANT Q UESTIONS


The automated generation of questions using large language models (LLMs) represents a power-
ful tool in education, enabling the creation of diverse and contextually relevant questions tailored
to various learning objectives (Maity et al., 2024a). The methodologies employed in question
generation are varied, each contributing to the quality and applicability of the generated content.
Below are the key methods utilized in this domain:
• Zero-Shot Prompting: Zero-shot learning allows models like GPT-3 (Brown et al., 2020)
to generate questions based on minimal instructions. The model leverages its pre-trained
knowledge to generate relevant questions directly from the provided text, without the need
for additional examples or fine-tuning (Brown et al., 2020). This approach is particularly
useful for generating questions across a wide range of topics, but the quality may vary
depending on the complexity of the input text (Maity et al., 2023; Maity et al., 2024b).
• Few-Shot Prompting: Few-shot prompting provides the model with a few examples of
the task to guide its question generation. By including a few question-answer pairs as

3
part of the prompt, this method enhances the model’s understanding of the task, leading
to improved relevance and quality of the generated questions (Brown et al., 2020). This
technique is effective in scenarios where the desired question format or content is more
complex and needs to be clearly defined for the model.
• Chain-of-Thought Prompting: A structured technique that involves guiding the LLM
through a step-by-step reasoning process before it generates the final question. For ex-
ample, the model may first be asked to summarize a passage, identify key concepts, and
then generate a question that tests understanding of these concepts (Wei et al., 2022; Maity
et al., 2024d). This approach is particularly effective for generating higher-order questions
that require critical thinking and analysis, ensuring that the questions align with specific
educational goals.
• Fine-Tuning: Fine-tuning involves further training the LLM on a specific dataset of ques-
tions and answers relevant to the target domain. By learning the patterns and structures of
effective questions from the training data, fine-tuning allows the model to generate more
accurate and context-specific questions (Raffel et al., 2020). This method is resource-
intensive but results in highly specialized models that can produce high-quality questions
tailored to specific subjects or curricula (Maity et al., 2023).
• Prompt-Tuning: A recent and computationally efficient technique, prompt-tuning in-
volves adjusting a small set of parameters (the prompt) while leaving the rest of the model
unchanged. This method has proven effective in generating high-quality questions across
various educational contexts, especially when the goal is to adapt a general-purpose LLM
to a specific task without extensive retraining (Lester et al., 2021). Prompt-tuning allows
for quick adaptation and customization of LLMs to generate questions that are both rele-
vant and aligned with specific educational objectives.
• Multiformat and Multilingual Question Generation: LLMs are capable of generating
both open-ended (Maity et al., 2023) and multiple-choice questions (Maity et al., 2024d),
catering to different assessment needs. Open-ended questions encourage critical thinking
and exploration, while multiple-choice questions are useful for evaluating specific knowl-
edge or skills (Maity et al., 2024d). Additionally, the multilingual capabilities of LLMs
enable the generation of questions in various languages, making them valuable tools for
language learning and cross-cultural education (Radford et al., 2019; Maity et al., 2024d).

These methodologies, when applied effectively, enhance the educational process by gener-
ating diverse, high-quality questions that cater to different learning contexts and objectives. As
LLMs continue to evolve, the integration of these techniques will further improve the relevance,
accuracy, and utility of automated question generation in education.

3.2. T YPES OF Q UESTIONS G ENERATED BY LLM S


In the context of education, different types of questions serve varied pedagogical functions, and
LLMs are capable of generating a broad spectrum of question types. Below are the primary
categories:

• Factual Questions: These questions focus on the recall of specific information, such as
dates, definitions, or events. They are typically straightforward and aim to assess the

4
student’s memory and basic understanding of the subject matter (Mulla and Gharpure,
2023).
Example: ”What is the capital of France?”
• Open-Ended Questions: Open-ended questions are designed to encourage deep thinking
and exploration, allowing students to express their thoughts freely and creatively. These
questions do not have a single correct answer, promoting critical thinking and discussion
(Mulla and Gharpure, 2023; Maity et al., 2023).
Example: ”What does purchasing power parity do?”
• Multiple-Choice Questions (MCQs): MCQs assess specific knowledge or skills by pro-
viding a set of possible answers from which the student must choose the correct one. They
are widely used for their efficiency in testing and grading (Maity et al., 2024d).
Example: ”Which of the following is the largest planet in our solar system?
(a) Earth (b) Jupiter (c) Mars (d) Venus”

LLMs, through their sophisticated language processing capabilities, can generate these var-
ied question types effectively, adapting them to different educational contexts and learning ob-
jectives.

4. AUTOMATED A NSWER A SSESSMENT : E VALUATING S TUDENT R E -


SPONSES

4.1. T HE C APABILITIES OF LLM S IN AUTOMATED A NSWER A SSESSMENT


In addition to generating questions, LLMs have demonstrated significant potential in automated
answer assessment (Fagbohun et al., 2024). The ability to accurately evaluate student responses
and provide feedback is a critical component of the educational process (Fagbohun et al., 2024).
Traditionally, this task has been performed by human educators, who must carefully consider
the content, context, and nuance of each response (Balfour, 2013). However, as the demand for
personalized and scalable education grows, the limitations of human-driven assessment become
more apparent (Luckin and Holmes, 2016).
LLMs offer a scalable solution to automated answer assessment, with the ability to evaluate a
wide range of responses, from simple factual answers to complex, open-ended essays (Fagbohun
et al., 2024). By leveraging their deep understanding of language and context, LLMs can identify
key concepts, assess the accuracy of the response, and provide constructive feedback (Stamper
et al., 2024). This capability is particularly valuable in large-scale educational settings, where
the volume of student responses can be overwhelming for human assessors (Broadbent et al.,
2018).
One of the key strengths of LLMs in automated assessment is their ability to identify nuanced
understanding or misconceptions in student responses (Kazi, 2023). For example, an LLM
can evaluate an essay on a historical event, recognizing whether the student has grasped the
underlying causes and implications of the event, rather than simply recounting facts (Kasneci
et al., 2023).
However, while LLMs have shown great promise in automated assessment, there are chal-
lenges to be addressed (Fagbohun et al., 2024). One of the primary concerns is the accuracy and

5
consistency of the assessments. LLMs, like all AI systems, are not infallible and can sometimes
produce incorrect or biased evaluations (Owan et al., 2023). Ensuring that the assessments are
fair, accurate, and aligned with the learning objectives is crucial for the successful integration of
LLMs into the educational process (Fagbohun et al., 2024).

4.2. E XAMPLES OF S UCCESSFUL A SSESSMENTS AND A REAS FOR I MPROVEMENT

To illustrate the capabilities of LLMs in automated answer assessment, consider the following
examples:

• Short-Answer Evaluation: An LLM is tasked with evaluating short-answer responses in


a biology exam (Shin and Gierl, ). The model is able to accurately assess whether the
student has correctly identified the function of a specific organelle within a cell, providing
feedback on both correct and incorrect answers. The LLM also identifies common mis-
conceptions, such as confusing the roles of the mitochondria and the nucleus, and provides
corrective feedback to guide the student’s learning.

• Essay Grading: In a history class, students are asked to write essays on the causes and
effects of World War II. The LLM evaluates the essays based on criteria such as under-
standing of key events, analysis of historical factors, and coherence of argument. The
model is able to identify well-reasoned arguments and provide feedback on areas where
the student could improve, such as providing more evidence or considering alternative
perspectives (Mansour et al., 2024; Henkel et al., 2024).

• Multiple-Choice Question Analysis: An LLM is used to analyze student responses to


multiple-choice questions in a mathematics exam (Henkel et al., 2024). In addition to
identifying the correct answers, the model also analyzes the patterns of incorrect re-
sponses, identifying common errors and misconceptions. This information is used to
provide targeted feedback and suggest areas for further study.

While these examples demonstrate the potential of LLMs in automated assessment, there are
also areas for improvement. One challenge is ensuring that the feedback provided by the LLM
is constructive and actionable (Meyer et al., 2024a). For instance, while the model may correctly
identify an error in a student’s response, it must also provide clear guidance on how to address
the mistake. Additionally, the LLM must be able to adapt its feedback to the individual needs of
each student, taking into account their prior knowledge and learning style.
Another area for improvement is the ability of LLMs to assess more complex and creative
responses, such as those involving critical thinking, problem-solving, or artistic expression.
While LLMs have made significant strides in understanding and generating text, evaluating these
higher-order skills remains a challenge (Hsiao et al., 2023). Future research and development
will be needed to enhance the capabilities of LLMs in these areas, ensuring that they can fully
support the diverse needs of learners.

6
5. H UMAN E VALUATION AND Q UALITY M ETRICS FOR G ENERATED Q UES -
TIONS

5.1. A SSESSING THE Q UALITY OF G ENERATED Q UESTIONS


The quality of questions generated by LLMs is a critical factor in their effectiveness as edu-
cational tools. High-quality questions should be clear, relevant, and aligned with the learning
objectives, challenging students to think critically and apply their knowledge. To ensure that the
questions generated by LLMs meet these standards, human evaluation and quality metrics play
a crucial role (Kurdi et al., 2020).
Human evaluation involves assessing the generated questions based on a set of predefined
criteria, such as grammaticality, relevance, clarity, complexity, and alignment with the curricu-
lum (Kurdi et al., 2020; Maity et al., 2023). Expert educators or subject matter experts typically
conduct this evaluation, providing feedback on the strengths and weaknesses of the questions.
This feedback is invaluable for refining the prompts and improving the quality of the generated
questions.
In addition to human evaluation, automated quality metrics can be used to assess the gen-
erated questions. These metrics may include measures such as unigram-, bigram-, and n-gram-
based evaluations, which provide quantitative insights into the quality of the questions (Kurdi
et al., 2020). However, these automated evaluation metrics used for assessing LLM-generated
questions have limitations. This limitation arises because these metrics often prioritize linguistic
similarity (e.g., character, unigrams, bigrams, or longest common subsequence-based overlap)
rather than deeper contextual understanding (Nema and Khapra, 2018).
One of the challenges in evaluating the quality of generated questions is the subjective
nature of some of the criteria. For instance, what one educator considers a challenging and
thought-provoking question, another might view as overly complex or unclear (Crogman and
Trebeau Crogman, 2018). To address this, it is important to establish clear guidelines and crite-
ria for evaluation, ensuring consistency and objectivity in the assessment process.

5.2. VARIATIONS IN Q UALITY ACROSS D IFFERENT M ETHODS


The quality of questions generated by LLMs can vary significantly depending on the meth-
ods and techniques used. For example, questions generated using zero-shot prompting may be
more general and less tailored to the specific content, while those generated using fine-tuning
or prompt-tuning may be more precise and relevant (Maity et al., 2023). Understanding these
variations is essential for selecting the appropriate method for a given educational context.
One common variation in quality is related to the complexity of the generated questions.
LLMs are capable of generating both simple, factual questions and more complex, analytical
questions (Maity et al., 2024b). However, the latter requires a deeper understanding of the con-
tent and context, which may not always be achievable through basic prompting techniques. To
generate higher-order questions, more advanced techniques, such as chain-of-thought prompting
(Wei et al., 2022) or fine-tuning (Raffel et al., 2020), may be necessary.
Another variation in quality is related to the cultural and linguistic diversity of the generated
questions. LLMs trained on diverse datasets are better equipped to generate questions that are
culturally relevant and appropriate for different student populations. However, this diversity can
also introduce challenges, as the model may generate questions that are less familiar or relevant
to certain groups of students. Ensuring that the generated questions are inclusive and accessible

7
to all learners is an important consideration in the evaluation process (Maity et al., 2024a; Maity
et al., 2024b).

6. B ROADER I MPLICATIONS AND F UTURE D IRECTIONS


6.1. T HE R OLE OF LLM S IN P ERSONALIZED AND A DAPTIVE L EARNING
As LLMs continue to evolve, their role in personalized and adaptive learning is becoming in-
creasingly significant. The ability of LLMs to generate contextually relevant questions and
assess student responses on a large scale opens up new possibilities for personalized education
(Alier et al., 2023). By leveraging LLMs, educators can create tailored learning experiences that
adapt to the individual needs and progress of each student (Goslen et al., 2024).
One of the key benefits of using LLMs in personalized learning is the ability to provide
immediate feedback and guidance (Meyer et al., 2024b). As students interact with the system,
LLMs can generate questions that challenge their understanding, identify areas of difficulty, and
offer targeted feedback to support their learning. This real-time interaction can help students
stay engaged and motivated, while also providing educators with valuable insights into their
progress.
However, the integration of LLMs into personalized learning also raises important questions
about the balance between human and AI-driven education (Yekollu et al., 2024). While LLMs
can offer scalable and efficient solutions, they cannot replace the nuanced understanding and
empathy that human educators bring to the classroom. The challenge lies in finding the right
balance, where LLMs complement and enhance human-led education, rather than supplanting
it.

6.2. E THICAL C ONSIDERATIONS AND C HALLENGES


The use of LLMs in education also raises important ethical considerations (Meyer et al., 2024b).
Issues such as bias, fairness, and transparency are central to the responsible use of AI in edu-
cation (Memarian and Doleck, 2023). LLMs, like all AI systems, are trained on data that may
contain biases, and these biases can be reflected in the questions they generate or the assess-
ments they perform (Memarian and Doleck, 2023). Ensuring that LLMs are fair and unbiased
requires careful attention to the training data, as well as ongoing monitoring and evaluation of
the system’s outputs.
Another ethical consideration is the transparency of the AI-driven educational process (Badawi
et al., 2018). Students and educators need to understand how LLMs generate questions and as-
sess responses, and they should be informed about the potential limitations and biases of the
system (Memarian and Doleck, 2023). Transparency is key to building trust in AI-driven edu-
cation and ensuring that students and educators feel confident in the use of these technologies
(Kim, 2024).
Finally, the use of LLMs in education raises questions about data privacy and security (Rah-
man et al., 2024). As LLMs interact with students and assess their responses, they may collect
and store sensitive information about the student’s performance and learning history. Protecting
this data and ensuring that it is used responsibly is essential for maintaining the integrity and
security of the educational process.

8
6.3. F UTURE D IRECTIONS IN AUTOMATED Q UESTION G ENERATION AND A SSESS -
MENT

Looking to the future, the role of LLMs in automated question generation and assessment is
likely to expand and evolve (Fagbohun et al., 2024). Advances in AI and NLP technologies will
enable the development of more sophisticated models that are better equipped to handle complex
and creative educational tasks (Alqahtani et al., 2023). As these models become more integrated
into the educational process, they will play a key role in supporting personalized and adaptive
learning, providing scalable solutions that enhance the quality and accessibility of education.
One promising direction for future research is the development of models that can assess
higher-order thinking skills, such as critical thinking, problem-solving, and creativity. These
skills are essential for success in the 21st century, and the ability to assess them accurately and
efficiently is a major challenge for educators. LLMs, with their advanced language understand-
ing and generation capabilities, have the potential to address this challenge, providing new tools
for assessing and supporting the development of these critical skills (Moore et al., 2023).
Another important direction for future research is the exploration of new methodologies for
fine-tuning and prompt-tuning LLMs for specific educational tasks. As LLMs continue to be
used in a wider range of educational contexts, it will be important to develop techniques that
allow for the efficient and effective adaptation of these models to different subject areas, student
populations, and learning objectives.

7. C ONCLUSION
In conclusion, large language models have the potential to revolutionize education through au-
tomated question generation and answer assessment. These models, with their ability to under-
stand and generate human-like text, offer scalable solutions that can enhance personalized and
adaptive learning. By leveraging advanced prompting techniques and fine-tuning methodolo-
gies, educators can create high-quality, contextually relevant questions that challenge students
and support their learning. Furthermore, LLMs’ capabilities in automated assessment can pro-
vide timely and constructive feedback, helping students identify areas for improvement and
guiding their educational journey.
However, the integration of LLMs into education also presents challenges and ethical con-
siderations that must be carefully addressed. Ensuring the fairness, accuracy, and transparency
of AI-driven educational processes is essential for building trust and confidence in these tech-
nologies. As we look to the future, ongoing research and development will be key to realizing
the full potential of LLMs in education, creating a more personalized, adaptive, and accessible
learning experience for all students.

R EFERENCES
ACHIAM , J., A DLER , S., AGARWAL , S., A HMAD , L., A KKAYA , I., A LEMAN , F. L., A LMEIDA ,
D., A LTENSCHMIDT, J., A LTMAN , S., A NADKAT, S., ET AL . 2023. Gpt-4 technical report. arXiv
preprint arXiv:2303.08774.
A LIER , M., C ASA Ñ , M. J., AND F ILV À , D. A. 2023. Smart learning applications: Leveraging llms
for contextualized and ethical educational technology. In International conference on technological
ecosystems for enhancing multiculturality. Springer, 190–199.

9
A LQAHTANI , T., BADRELDIN , H. A., A LRASHED , M., A LSHAYA , A. I., A LGHAMDI , S. S., BIN
S ALEH , K., A LOWAIS , S. A., A LSHAYA , O. A., R AHMAN , I., A L YAMI , M. S., ET AL . 2023.
The emergent role of artificial intelligence, natural learning processing, and large language models in
higher education and research. Research in Social and Administrative Pharmacy 19, 8, 1236–1242.
BADAWI , G., DE B EYROUTH , G., AND BADAWI , H. 2018. Ai-driven educational paradigms: Opportu-
nities and challenges, and ethical considerations in teaching and learning.
BALFOUR , S. P. 2013. Assessing writing in moocs: Automated essay scoring and calibrated peer re-
view™. Research & Practice in Assessment 8, 40–48.
B ROADBENT, J., PANADERO , E., AND B OUD , D. 2018. Implementing summative assessment with a
formative flavour: A case study in a large class. Assessment & Evaluation in Higher Education 43, 2,
307–322.
B ROWN , T., M ANN , B., RYDER , N., S UBBIAH , M., K APLAN , J. D., D HARIWAL , P., N EELAKANTAN ,
A., S HYAM , P., S ASTRY, G., A SKELL , A., AGARWAL , S., H ERBERT-VOSS , A., K RUEGER , G.,
H ENIGHAN , T., C HILD , R., R AMESH , A., Z IEGLER , D., W U , J., W INTER , C., H ESSE , C., C HEN ,
M., S IGLER , E., L ITWIN , M., G RAY, S., C HESS , B., C LARK , J., B ERNER , C., M C C ANDLISH ,
S., R ADFORD , A., S UTSKEVER , I., AND A MODEI , D. 2020. Language models are few-shot learn-
ers. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell,
M. Balcan, and H. Lin, Eds. Vol. 33. Curran Associates, Inc., 1877–1901.
C HAPPUIS , J. ET AL . 2015. Seven strategies of assessment for learning. Pearson.
C ROGMAN , H. AND T REBEAU C ROGMAN , M. 2018. Modified generated question learning, and its
classroom implementation and assessment. Cogent Education 5, 1, 1459340.
FAGBOHUN , O., I DUWE , N., A BDULLAHI , M., I FATUROTI , A., AND N WANNA , O. 2024. Beyond
traditional assessment: Exploring the impact of large language models on grading practices. Journal
of Artifical Intelligence and Machine Learning & Data Science 2, 1, 1–8.
F LORIDI , L. AND C HIRIATTI , M. 2020. Gpt-3: Its nature, scope, limits, and consequences. Minds and
Machines 30, 681–694.
F LORIDI , L. AND C OWLS , J. 2022. A unified framework of five principles for ai in society. Machine
learning and the city: Applications in architecture and urban design, 535–545.
G OSLEN , A., K IM , Y. J., ROWE , J., AND L ESTER , J. 2024. Llm-based student plan generation for adap-
tive scaffolding in game-based learning environments. International Journal of Artificial Intelligence
in Education, 1–26.
H ENKEL , O., H ILLS , L., B OXER , A., ROBERTS , B., AND L EVONIAN , Z. 2024. Can large language
models make the grade? an empirical study evaluating llms ability to mark short answer questions in
k-12 education. In Proceedings of the Eleventh ACM Conference on Learning@ Scale. 300–304.
H SIAO , Y.-P., K LIJN , N., AND C HIU , M.-S. 2023. Developing a framework to re-design writing as-
signment assessment for the era of large language models. Learning: Research and Practice 9, 2,
148–158.
K ASNECI , E., S ESSLER , K., K ÜCHEMANN , S., BANNERT, M., D EMENTIEVA , D., F ISCHER , F.,
G ASSER , U., G ROH , G., G ÜNNEMANN , S., H ÜLLERMEIER , E., ET AL . 2023. Chatgpt for good?
on opportunities and challenges of large language models for education. Learning and individual
differences 103, 102274.
K AZI , N. H. 2023. Automated short-answer grading and misconception detection using large language
models. University of North Florida.
K IM , J. 2024. Leading teachers’ perspective on teacher-ai collaboration in education. Education and
Information Technologies 29, 7, 8693–8724.

10
K URDI , G., L EO , J., PARSIA , B., S ATTLER , U., AND A L -E MARI , S. 2020. A systematic review of au-
tomatic question generation for educational purposes. International Journal of Artificial Intelligence
in Education 30, 121–204.
L ESTER , B., A L -R FOU , R., AND C ONSTANT, N. 2021. The power of scale for parameter-efficient
prompt tuning. arXiv preprint arXiv:2104.08691.
L I , Q., F U , L., Z HANG , W., C HEN , X., Y U , J., X IA , W., Z HANG , W., TANG , R., AND Y U , Y. 2023.
Adapting large language models for education: Foundational capabilities, potentials, and challenges.
arXiv preprint arXiv:2401.08664.
L UCKIN , R. AND H OLMES , W. 2016. Intelligence unleashed: An argument for ai in education.
M AITY, S., D EROY, A., AND S ARKAR , S. 2023. Harnessing the power of prompt-based techniques for
generating school-level questions using large language models. In Proceedings of the 15th Annual
Meeting of the Forum for Information Retrieval Evaluation. 30–39.
M AITY, S., D EROY, A., AND S ARKAR , S. 2024a. Exploring the capabilities of prompted large language
models in educational and assessment applications. In Proceedings of the 17th International Confer-
ence on Educational Data Mining, B. Paaßen and C. D. Epp, Eds. International Educational Data
Mining Society, Atlanta, Georgia, USA, 961–968.
M AITY, S., D EROY, A., AND S ARKAR , S. 2024b. How effective is gpt-4 turbo in generating school-level
questions from textbooks based on bloom’s revised taxonomy?
M AITY, S., D EROY, A., AND S ARKAR , S. 2024c. How ready are generative pre-trained large language
models for explaining bengali grammatical errors? In Proceedings of the 17th International Confer-
ence on Educational Data Mining, B. Paaßen and C. D. Epp, Eds. International Educational Data
Mining Society, Atlanta, Georgia, USA, 664–671.
M AITY, S., D EROY, A., AND S ARKAR , S. 2024d. A novel multi-stage prompting approach for language
agnostic mcq generation using gpt. In European Conference on Information Retrieval. Springer, 268–
277.
M ANSOUR , W., A LBATARNI , S., E LTANBOULY, S., AND E LSAYED , T. 2024. Can large language mod-
els automatically score proficiency of written essays? arXiv preprint arXiv:2403.06149.
M AZIDI , K. AND N IELSEN , R. 2014. Linguistic considerations in automatic question generation. In
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume
2: Short Papers). 321–326.
M EMARIAN , B. AND D OLECK , T. 2023. Fairness, accountability, transparency, and ethics (fate) in artifi-
cial intelligence (ai), and higher education: A systematic review. Computers and Education: Artificial
Intelligence, 100152.
M EYER , J., JANSEN , T., S CHILLER , R., L IEBENOW, L. W., S TEINBACH , M., H ORBACH , A., AND
F LECKENSTEIN , J. 2024a. Using llms to bring evidence-based feedback into the classroom: Ai-
generated feedback increases secondary students’ text revision, motivation, and positive emotions.
Computers and Education: Artificial Intelligence 6, 100199.
M EYER , J., JANSEN , T., S CHILLER , R., L IEBENOW, L. W., S TEINBACH , M., H ORBACH , A., AND
F LECKENSTEIN , J. 2024b. Using llms to bring evidence-based feedback into the classroom: Ai-
generated feedback increases secondary students’ text revision, motivation, and positive emotions.
Computers and Education: Artificial Intelligence 6, 100199.
M OORE , S., T ONG , R., S INGH , A., L IU , Z., H U , X., L U , Y., L IANG , J., C AO , C., K HOSRAVI , H.,
D ENNY, P., ET AL . 2023. Empowering education with llms-the next-gen interface and content gen-
eration. In International Conference on Artificial Intelligence in Education. Springer, 32–37.

11
M ULLA , N. AND G HARPURE , P. 2023. Automatic question generation: a review of methodologies,
datasets, evaluation metrics, and applications. Progress in Artificial Intelligence 12, 1, 1–32.
N EMA , P. AND K HAPRA , M. M. 2018. Towards a better metric for evaluating question generation sys-
tems. arXiv preprint arXiv:1808.10192.
OWAN , V. J., A BANG , K. B., I DIKA , D. O., E TTA , E. O., AND BASSEY, B. A. 2023. Exploring the
potential of artificial intelligence tools in educational measurement and assessment. Eurasia Journal
of Mathematics, Science and Technology Education 19, 8, em2307.
R ADFORD , A., W U , J., C HILD , R., L UAN , D., A MODEI , D., S UTSKEVER , I., ET AL . 2019. Language
models are unsupervised multitask learners. OpenAI blog 1, 8, 9.
R AFFEL , C., S HAZEER , N., ROBERTS , A., L EE , K., NARANG , S., M ATENA , M., Z HOU , Y., L I , W.,
AND L IU , P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of machine learning research 21, 140, 1–67.
R AHMAN , M. A., A LQAHTANI , L., A LBOOQ , A., AND A INOUSAH , A. 2024. A survey on security and
privacy of large multimodal deep learning models: Teaching and learning perspective. In 2024 21st
Learning and Technology Conference (L&T). IEEE, 13–18.
R AIAAN , M. A. K., M UKTA , M. S. H., FATEMA , K., FAHAD , N. M., S AKIB , S., M IM , M. M. J.,
A HMAD , J., A LI , M. E., AND A ZAM , S. 2024. A review on large language models: Architectures,
applications, taxonomies, open issues and challenges. IEEE Access.
S HIN , J. AND G IERL , M. J. Automated short-response scoring for automated item generation in science
assessments. In The Routledge International Handbook of Automated Essay Evaluation. Routledge,
504–534.
S TAMPER , J., X IAO , R., AND H OU , X. 2024. Enhancing llm-based feedback: Insights from intelligent
tutoring systems and the learning sciences. In International Conference on Artificial Intelligence in
Education. Springer, 32–43.
VASWANI , A., S HAZEER , N., PARMAR , N., U SZKOREIT, J., J ONES , L., G OMEZ , A. N., K AISER ,
L. U ., AND P OLOSUKHIN , I. 2017. Attention is all you need. In Advances in Neural Information
Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, Eds. Vol. 30. Curran Associates, Inc.
W EI , J., WANG , X., S CHUURMANS , D., B OSMA , M., ICHTER , B ., X IA , F., C HI , E., L E , Q. V.,
AND Z HOU , D. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Ad-
vances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
K. Cho, and A. Oh, Eds. Vol. 35. Curran Associates, Inc., 24824–24837.
Y EKOLLU , R. K., B HIMRAJ G HUGE , T., S UNIL B IRADAR , S., H ALDIKAR , S. V., AND FAROOK
M OHIDEEN A BDUL K ADER , O. 2024. Ai-driven personalized learning paths: Enhancing education
through adaptive systems. In International Conference on Smart Data Intelligence. Springer, 507–
517.

12

You might also like