Evaluating-a-Learning-Design-for-EFL-Writing-Using-ChatGPT
Evaluating-a-Learning-Design-for-EFL-Writing-Using-ChatGPT
net/publication/373265937
CITATIONS READS
0 1,125
1 author:
SEE PROFILE
All content following this page was uploaded by David James Woo on 22 August 2023.
- Postal address: Precious Blood Secondary School, 338 San Ha Street, Chai Wan,
Funding Acknowledgements
This research received no specific grant from any funding agency in the public, commercial,
or not-for-profit sectors.
The data that support the findings of this study are available from the corresponding author,
Biographical Note
David James Woo is a secondary school teacher. His research interests are in artificial
Abstract
This study explores the application of ChatGPT in enhancing English as a Foreign Language
(EFL) writing skills in a Hong Kong secondary school setting. The innovation focused on
writing tasks and integrating AI-generated text with their own words. Evaluation showed that
while students could utilize ChatGPT's capabilities, heavy reliance on AI output could mask
their writing abilities. The study emphasizes the need for students to exercise more agency in
editing AI output and suggests pedagogical strategies. It provides valuable insights for
secondary education
1 Introduction
GPT-3 and GPT-4 have captivated educational researchers’ and practitioners’ interest. This is
because they appear to be “stronger” AI (Hockly, 2023) that can generate large chunks of
coherent text indistinguishable from human writing (Brown et al., 2020). Besides, ChatGPT
has popularized interaction with language models through a chatbot, that is, a conversational
user interface that enables human users to engage in meaningful verbal or text-based
exchanges with a computer program (Kim et al., 2022). Importantly, these language models
enable people to write with a machine-in-the-loop (Clark et al., 2018), that is, to write with
the support of generative AI that is designed to assist people, such as a chatbot, while people
exercise full agency on how to act on generative AI output, if at all. Since AI curriculum
(Chiu et al., 2022; Education Bureau, 2023) has shown a gap in instruction for writing with a
innovation that has not yet been widely adopted (Rogers, 1962) but may enhance students’
writing abilities.
in-the-loop writing can vary not least with the type of students and the type of writing. In this
paper, we report a case on the design, implementation and evaluation of this innovation.
Our innovation coincides with the release of the POE app. At the time of study, the
app granted free access to ChatGPT and five other chatbots (i.e. Sage; GPT-4; Claude+;
language models hundreds of billions of parameters in size (see Figure 1). The language
models in SOTA chatbots have capabilities such as understanding abstract task descriptions
and human concepts in natural language (Reynolds & McDonell, 2021) and chain-of-thought-
reasoning, that is, breaking down a problem into steps before delivering a verdict (Kojima et
al., 2022). For students to take advantage of these novel capabilities and to get desired output,
students must learn prompt engineering, that is, how to craft appropriate instructions for
(Figure 1 here)
Hong Kong’s educational conditions also influence our innovation. First, Hong Kong
secondary schools deliver English as a foreign language (EFL) curriculum and we aim to
release a curriculum module about writing with a machine-in-the-loop in this subject area.
Second, at the author’s Hong Kong school where the author is an EFL teacher, the principal
has tasked the first author to develop a generative AI ethical use policy for the school.
Besides, the principal has tasked the author to instruct the first author’s colleagues in the EFL
department so that colleagues might acquire the technological pedagogical knowledge and
4
skills (Puentedura, 2015) to effectively integrate ChatGPT into their writing classrooms.
Since we have considerable opportunity to leverage SOTA chatbots and to influence policy
and practice in our local education system, the stakes for the design, implementation and
evaluation of the innovation are high. As such, we are piloting our innovation, which we
We report our pilot as a learning design, that is, a framework for describing learning
environments and learning activities (Conole & Wills, 2013). Because we had not found
approached our learning design’s development by design-based research (DBR) (Wang &
Hannafin, 2005), that is, a flexible and systematic methodology that can improve educational
first designed its purpose and intended learning outcomes (ILOs), that is, what students
should achieve by the end of the implementation. Then we designed the learning activities,
that is, basic units of interaction with or among learners. Table 1 summarizes our initial
learning design for EFL students’ machine-in-the-loop writing with ChatGPT. The design
comprises its (1) title, (2) purpose, (3) ILOs, (4) learning activities, and (5) materials and
resources.
(Table 1 here)
In the following sections, we report features of our pilot learning design that we
Students could write either a feature article or a letter to the editor. Figure 2 shows the
prompts we selected, taken from the 2023 Hong Kong university entrance examination for
the EFL subject area (HKDSE), writing paper, which almost all Hong Kong mainstream
school students take in their final year of secondary school. Prior to the implementation, only
two students had reported writing a feature article in English language, while eight students
had written a letter to the editor. Although many students may not have been taught to read
and write these text types yet, we wondered if students could use ChatGPT and other
chatbots’ capabilities to overcome students’ limitations and if so, how studennts would
(Figure 2 here)
As the actual HKDSE writing prompts instruct students to write around 400 words,
we limited a student’s written work to no more than 500 words on Google Docs, using their
own words and words generated from POE chatbots. Students could prompt any POE chatbot
in any way possible, as many times as necessary and use any chatbot output. In this way, we
thought chatbots could adapt to students’ abilities and provide differentiated instruction
(Kohnke, 2022). Since we proposed students could use words from more than one chatbot,
we instructed students to differentiate their own words from AI words by highlighting words
Since we were exploring how students would use SOTA chatbots and how these
chatbots would contribute to the quality of students’ written work, we used the actual
HKDSE writing paper marking scheme1, which comprises dimensions of content, language
1
Please see HKDSE English language paper 2 writing marking scheme
6
and organization. The highest possible score for each dimension is seven and the total
The author and teachers from the school double-scored students’ written work. To do
this, we anonymized the texts so that a scorer would not know who wrote the text and which
words were human words. Then two scorers independently scored each text for dimensions of
content, language and organization according to the marking scheme. By identifying texts
with higher human-rated scores and analyzing these texts’ integration of human words and
chatbot output, we might then have evidence to inform effective practice for AI word use.
and task rules. We introduced chatbots using an inductive approach, first, showing a chatbot
screenshot and asking students, “What are you looking at?” Second, we asked students to
interact with a chatbot, before asking students what this type of generative AI is and how to
interact with it. We introduced the features of chatbots, including turn-taking and memory; as
for the garbage-in-garbage-out principle for interacting with chatbots, we showed a chatbot
screenshot to students and asked, “What is a problem with this conversation?” Next, we
classmates’ actual prompt content and theoretical prompt content. Finally, we introduced the
writing prompt, task rules, materials including POE app and Google Docs and assessment.
4.5 Implementation
STEM lab of the first author’s school on July 5, 2023 and repeated on July 6. Six students
voluntarily participated in the July 5 workshop and 16 on July 6. The students came from
The instructional materials were delivered in English by the first author. At the same
time, the first author’s colleague provided simultaneous spoken translation in Cantonese
Chinese language. At the workshop, students were given 45 minutes to complete a writing
4.6 Evaluation
cognitive load questionnaire developed from measures by Paas (1992) and Sweller, van
Merrienboer, and Passi (1998). Table 2 summarizes the descriptive statistics of the
questionnaire. According to a six-point Likert rating scheme, 1 refers to strongly disagree and
6 strongly agree. The results show it is normal and common for students to agree that the
workshop content was difficult, that it required a lot of mental effort and that students did not
(Table 2 here)
15 students submitted valid texts, written according to the task rules and 5 students
wrote invalid texts. Figure 3 shows a feature article written with a student’s own words and a
chatbot’s words. Figure 4 shows a letter to the editor written with a student’s own words and
chatbots’ words.
(Figure 3 here)
(Figure 4 here)
We analyzed valid texts for language features, including students’ use of chatbot
words. The average length of a composition was 340 words, and a composition on average
contained 80% chatbot words. In fact, only one student wrote at least 50% of her composition
with her own words. The modes for number of instances of human words, total number of
human words, number of instances of chatbot words and number of chatbots used were two,
8
four, one and one, respectively, showing little integration of human words and editing of
chatbot output.
The average human-rated score was 11.8 with a range from 14.5 to 5. Content showed
the lowest average score at 3.6 and language the highest at 4.5. For content, students’ lacked
details and creativity and for organization, clear intra-paragraph structure such as by using a
topic sentence and coherent referencing, but we found for language, students’ sentences were
The learning design for writing with a machine-in-the-loop has shown utility in our
grade levels could answer a challenging HKDSE writing prompt. It appears students were
enhancing their knowledge and skills to prompt SOTA chatbots. In addition, most students
How students leveraged chatbots highlight learning design weaknesses. First, students
relied heavily on large unedited chunks of chatbot output to complete the task and that from
just one chatbot. Thus, ChatGPT is not a supplemental language learning tool and is not
language and masking students’ writing abilities. Second, although students on average
scored adequately, students may not attain higher human-rated scores without exercising
more agency to edit chatbot output for content and organization. How students leveraged
chatbots and their output appear related to the difficulty of the task (Charters, 2003), which
We recommend the following pedagogical strategies that may reduce task difficulty
and facilitate student agency to edit chatbot output. First, the writing prompt should be
approachable from students’ existing abilities so that students can answer the prompt
9
independently or with some chatbot assistance (Vygotsky, 1978). For example, writing
prompts can feature topics, text types and word counts with which students are already
familiar. Second, teach task completion and prompt engineering in a multi-step, scaffolded
way. For instance, teach task completion according to a writing approach (e.g. genre-based;
process-based; product-based) and teach how to prompt SOTA chatbots according to each
stage in that writing approach. For process writing students could learn about the roles that
ChatGPT can play in pre-writing, drafting, editing and revising. Besides, for low literacy EFL
students, emphasize prompts for chatbots to produce outputs in students’ native language
(Kohnke et al., 2023) and to produce English language output in a simpler style.
Implementation of these recommendations will require more time, but by applying these
methods, students could move beyond using ChatGPT to replace their own effort to answer a
writing prompt. Finally, consider an assessment method that penalizes wholesale copying
from ChatGPT but rewards editing of chatbot output and the inclusion of students’ own
words in an answer. For instance, the HKDSE marking scheme did not account for copying
text from other sources. However, the HKDSE marking scheme for integrated skills tasks
accounts for students’ wholesale copying, penalizing them in the dimensions of language and
appropriacy. Adopting a marking scheme like that may motivate students to critically
6 References
Brown TB, Mann B, Ryder N, et al. (2020) Language Models are Few-Shot Learners.
Chiu TKF, Meng H, Chai C-S, et al. (2022) Creation and Evaluation of a Pretertiary
Clark E, Ross AS, Tan C, et al. (2018) Creative Writing with a Machine in the Loop:
Intelligent User Interfaces, New York, NY, USA, 5 March 2018, pp. 329–340.
Conole G and Wills S (2013) Representing learning designs – making design explicit
10.1080/09523987.2013.777184.
Education Bureau (2023) Module on Artificial Intelligence for Junior Secondary Level.
Education Bureau.
Hockly N (2023) Artificial Intelligence in English Language Teaching: The Good, the
Bad and the Ugly. RELC Journal. SAGE Publications Ltd: 00336882231168504.
DOI: 10.1177/00336882231168504.
11
Kim H, Yang H, Shin D, et al. (2022) Design principles and architecture of a second
August 2023).
10.1177/00336882211067054.
Kohnke L, Moorhouse BL and Zou D (2023) ChatGPT for Language Teaching and
10.1177/00336882231162868.
Kojima T, Gu SS, Reid M, et al. (2022) Large Language Models are Zero-Shot
Reynolds L and McDonell K (2021) Prompt Programming for Large Language Models:
10.48550/arXiv.2102.07350.
Sweller J, van Merrienboer JJG and Paas FGWC (1998) Cognitive Architecture and
10.1023/A:1022193728205.
1629.
13
Figure 1
Figure 2
Writing Prompts
15
Figure 3
A Feature Article
Figure 4
Note. Words from the Sage chatbot are colored purple; ChatGPT in green; GPT-4 in blue; Claude+ in red; and
Google-Palm in grey.
17
Table 1
Table 2
Questionnaire The learning I had to put a lot It was I felt frustrated I did not have During the I need to put lots The
Item content in this of effort into troublesome for answering the enough time to workshop, the of effort into instructional
workshop was answering the me to answer questions in this answer the way of completing the way in the
difficult for me. questions in this the questions in workshop. questions in this instruction or learning tasks or workshop was
workshop. this workshop. workshop. learning content achieving the difficult to
presentation learning follow and
cause me a lot of objectives in this understand.
mental effort. workshop.