0% found this document useful (0 votes)
34 views

Building A Validity Argument For An Automated Writing Evaluation System (Erevise) As A Formative Assessment

This document discusses building a validity argument for an automated writing evaluation system called eRevise as a formative assessment tool. The authors aim to evaluate eRevise's ability to provide feedback on text-based argument writing. Automated writing evaluation systems are intended to increase writing opportunities for students by reducing teacher workload and providing timely feedback. However, prior systems have not focused on higher-level writing skills like argumentation. The authors argue that validity evidence is needed to show that eRevise's feedback supports the intended uses and interpretation of scores.

Uploaded by

mark moyo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Building A Validity Argument For An Automated Writing Evaluation System (Erevise) As A Formative Assessment

This document discusses building a validity argument for an automated writing evaluation system called eRevise as a formative assessment tool. The authors aim to evaluate eRevise's ability to provide feedback on text-based argument writing. Automated writing evaluation systems are intended to increase writing opportunities for students by reducing teacher workload and providing timely feedback. However, prior systems have not focused on higher-level writing skills like argumentation. The authors argue that validity evidence is needed to show that eRevise's feedback supports the intended uses and interpretation of scores.

Uploaded by

mark moyo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/359666393

Building a Validity Argument for an Automated Writing Evaluation System


(eRevise) as a Formative Assessment

Article in Computers and Education Open · April 2022


DOI: 10.1016/j.caeo.2022.100084

CITATIONS READS

6 274

5 authors, including:

Richard Correnti Lindsay Clare Matsumura


University of Pittsburgh University of Pittsburgh
71 PUBLICATIONS 2,300 CITATIONS 71 PUBLICATIONS 1,636 CITATIONS

SEE PROFILE SEE PROFILE

Elaine L. Wang Diane Litman


RAND Corporation University of Pittsburgh
71 PUBLICATIONS 568 CITATIONS 285 PUBLICATIONS 9,239 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Elaine L. Wang on 05 July 2022.

The user has requested enhancement of the downloaded file.


Computers and Education Open 3 (2022) 100084

Contents lists available at ScienceDirect

Computers and Education Open


journal homepage: www.sciencedirect.com/journal/computers-and-education-open

Building a validity argument for an automated writing evaluation system


(eRevise) as a formative assessment
Richard Correnti a, *, Lindsay Clare Matsumura a, Elaine Lin Wang b, Diane Litman a,
Haoran Zhang a
a
University of Pittsburgh/Learning Research and Development Center, 3420 Forbes Ave, Pittsburgh, PA 15260, United States
b
RAND Corporation, 4570 Fifth Avenue #600, Pittsburgh, PA 15213, United States

1. Introduction pertinent concepts related to effective argumentation, such as the


importance of providing reasons (i.e., warrants) linking evidence to
Beginning in the upper elementary grades, text-based argument claims as suggested by the Toulmin model (see, e.g., [91]). Research
writing has been increasingly emphasized in U.S. learning standards as shows that across the elementary and secondary grades, teachers rarely
critical to college readiness [62,63]. Results of national assessments in assign tasks that require analysis and use of text evidence [41,58,59].
the United States consistently show that the very large majority of stu­ Surveys reveal that a majority of middle school teachers assign argu­
dents do not have proficient writing skills [61], and this is especially the ment writing tasks no more than one or two times per year [27]. In short,
case for text-based writing [45,47,81]. Young writers especially lack classroom supports for text-based argument writing instruction are
familiarity with the discursive features associated with argumentation, clearly needed that make critical features of the construct explicit to
such as identifying evidence and explaining how it connects to the claim teachers and students and increase students’ opportunities to write and
[66,80,85,86]. Indeed, marshaling effective text evidence in argument revise their essays in response to substantive feedback.
writing has proven difficult even for secondary [64], and post-secondary
students [22].
Several explanations account for why so many students struggle with 1.1. Automated writing evaluation systems
text-based argument writing. First, more generally, teachers often do not
implement research-based practices for writing instruction that include Automated writing evaluation (AWE) systems that employ auto­
providing substantive formative feedback on drafts of student essays [8, mated essay scoring (AES) technologies to generate personalized feed­
41,57,69]. Providing timely feedback on drafts of essays is difficult for back to students have been proposed as a way to improve students’
busy teachers, many of whom are required to keep pace with curriculum classroom writing opportunities (see studies reviewed in [34,88]). AWE
guides which require them to address particular content on specified systems are intended to serve as formative assessments, broadly, that are
weeks [5]. The reluctance to assign tasks that require students to write intended to provide information that students can use to improve their
across drafts also is reinforced by state accountability policies which, writing and that teachers can use to increase the quality of their in­
under pressure to ensure that their students meet testing requirements, struction. In other words, they are intended to be learning tools for stu­
can lead teachers to assign writing tasks that resemble the content and dents and teachers alike [82]. AWE systems also are intended to support
format of state tests [56,104]. teachers in their instruction by reducing the burden of grading and
Second, text-based argument writing instruction is rare. Even though providing timely, substantive feedback on students’ written responses.
recent studies show the payoff for undergraduate students’ increased In doing so, AWE systems are expected to increase the frequency of
academic achievement in the sciences [71], there is little accumulated students’ opportunities to revise their essays in response to substantive
knowledge for teaching argumentation even at the college level [33,42]. feedback.
Text-based argument writing instruction in the elementary grade A persistent criticism of AWE systems, however, is that they have not
curricula is also a relatively new addition because curricula have been designed to meet ambitious writing standards, and this is notably
traditionally mostly centered on narrative writing. As a result, many the case with respect to text-based argument writing [9]. The AES
teachers are under-prepared by their undergraduate programs to teach technologies that undergird AWE systems have historically leveraged
linguistic properties of student writing – for example, syntactic

* Corresponding author.
E-mail address: [email protected] (R. Correnti).

https://doi.org/10.1016/j.caeo.2022.100084
Received 17 November 2020; Received in revised form 21 February 2022; Accepted 29 March 2022
Available online 1 April 2022
2666-5573/© 2022 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
R. Correnti et al. Computers and Education Open 3 (2022) 100084

complexity, cohesion, vocabulary, and length – rather than the content interpretation and use of scores from the assessment, the inter­
of student writing [9,18,65,89]. Unsurprisingly, the positive effects of pretation/use argument is especially important for laying bare theo­
AWE systems on student writing have mostly been observed in accuracy retical assertions and researcher assumptions so the validity of the
or linguistic sophistication of responses (studies reviewed in Deeva evidence can be evaluated in relation to the claim [10]. In essence, the
et al., 2021; [49,50,52]; Ranalli et al., 2016). interpretation/use argument states the claim, while the validity argu­
Development of AES technologies keyed to source texts has mostly ment provides evidence to evaluate the plausibility of the claim [40].
focused on evaluating the quality of students’ summaries (see, e.g., [90]) Despite the recognized need to evaluate assessments relative to their
or understanding of subject matter-content (e.g., a concept taught in a intended uses [39], the evaluation of AWE systems has mostly centered
curricular unit). In recent years, scoring algorithms have become better around the accuracy of scores (Dikli, 2006 as cited in Chappelle, Cotos &
at extracting substantive features of writing quality, such as organiza­ Lee, 2015). Accuracy of AES, for example, the relationship of
tion, clarity, the presence or absence of argument elements, and subject human-machine ratings, is an important part of construct validity,
matter content [25,54,55,67,90,95,97,99,102]. Developing algorithms especially in the case of summative evaluations where scores have
that capture the content of students’ responses (e.g., use of evidence or consequences for users. For AWE systems, however, a validity investi­
warrants) continues to be an ongoing area of investigation. Although gation needs to consider not just accuracy, but how the system is
most of the AES technologies for argument writing have been developed interpreted and used by participants toward a learning purpose. It is to
for prompts that are independent of source texts (e.g., [16,105]), we this end that we focus our work.
have found it possible to understand evidence-use as a construct when
the writing prompt explicitly asks students to use evidence from a source 2.1. Using activity theory to frame our investigation
text [106].
To frame our validity argument and attendant investigation, we
1.2. Present study draw on Pryor and Crossouard’s [70] visualization of an activity system
for sociocultural theorization of formative assessment. The foundations
In the current study, we take up the challenge of automating the for activity theory (also referred to as cultural-historical activity theory)
assessment of standards-aligned text-based argument writing by inves­ are based in the work of Vygotsky and his followers who emphasized the
tigating the quality of an AWE system – termed eRevise – for improving situated and social nature of learning [94]. From this perspective,
young adolescent students’ use of source text evidence in their argument mental functions occur first as social interactions among and between
essays. Specifically, we draw on sociocultural theory [48,93] and ac­ people (i.e., within communities). These interactions, in turn, are sha­
tivity theory (e.g., [70]) to investigate the potential of our system to ped by cultural norms, traditions and institutions and mediated by tools
serve a formative assessment purpose. That is, the degree to which and artifacts (i.e., objects) in the environment (see, e.g., [48,79]). Thus,
formative feedback delivered through eRevise; 1) increases students’ the social context for learning can be decomposed to include an un­
understanding of, and use of evidence in their argument essays and 2) derstanding of structural features of an activity such as the subjects
fosters teacher-student interactions focused on students’ making present (e.g., a teacher with a group of students), objects (e.g., a text,
meaning of the feedback in relation to their own work [1–3]. curricula or standardized assessment), as well as an understanding of the
While the term formative assessment has been used broadly to goals for an activity as perceived by subjects which can influence their
distinguish between one-shot assessments for evaluative purposes and interactions (e.g., with learners)1.
assessments used in the classroom during instruction, researchers Fig. 1 depicts elements of our formative assessment activity system as
studying formative assessments have implicitly or explicitly aligned applied to eRevise. The bottom of the triangle is the disciplinary norms for
themselves with different learning theories [4]. For example, teachers text-based argumentation – the use of evidence and warrants to support
could provide multiple-choice questions during instruction to identify claims [91]. On the left side of the triangle is subjects – including the
and fill gaps in students’ knowledge (see, e.g., [23,24]), a practice that goals and values held by teachers for implementing the system - who
could be seen as aligned with behaviorist theories that characterize shape the behavioral scripts (mediating process, top of the triangle)
learning as the accumulation of small, discrete units of information or teachers employ in their interactions with students around the individ­
skills, often acquired through transmission models of teaching [28]. ualized automated feedback messages (objects, right side of the triangle).
Immediate feedback may be one mechanism for how formative assess­ All of these elements of the activity system around eRevise are expected
ment can influence student performance, but researchers investigating to shape student outcomes – their understanding of feedback messages
sociocultural theories of formative assessment expand learning out­ and their application of feedback messages as they revise.
comes beyond performance to also include students’ self-regulation [32, Rooted in this theoretical framework, Table 1 makes explicit our
68,100] and identity development [17,70]. These outcomes are theo­ interpretation/use argument (warrants, assumptions, research questions
rized to result from dialogic interpersonal interactions. This latter view, and evidence sources) for the use of eRevise as a formative assessment in
built on the ideas of sociocultural learning theorists and focused on a classroom activity system.
dialogic interactions around feedback (e.g., [92]), is where we perceive Disciplinary norms for argumentation (bottom of Fig. 1). In order to
our activity system for eRevise belonging, as student-teacher interactions develop complex knowledge acquisition and skills for text-based
involve complex judgments about whether and how to implement
automated feedback centered on a central tenet of argumentation –
evidence use (see, e.g., [15]). In Section 2.2 we describe the theoretical 1
A good example of how a subject’s perceptions of an activity can shape their
framework that underpins the claims, warrants and sources of evidence interactions (with learners) in an activity system is captured in Wertsch’s [96]
for our validity investigation. study comparing Brazilian mothers’ interactions with children around a puzzle
activity with teachers’ interactions with children around the same activity.
2. A validity argument for the use of eRevise Teachers, perceiving the purpose of the puzzle to be a learning opportunity,
encouraged students to complete the task as independently as possible,
providing hints only as necessary for completion. Mothers, in contrast, saw the
The guiding doctrine of a validity argument is how well evidence
puzzle as a task to be completed and so worked together with their children to
supports a claim (e.g., [38]). Evidence aligned with the claim provides finish it. The teachers’ and mothers’ distinct goals for the activity thus shaped
warrant for a valid inference. While validity arguments have typically their interactions with the children [26]. In our study, we investigate in­
been applied to summative assessments, recent work has begun to teractions around the object (automated feedback) as we attempt to understand
extend the logic chain to formative assessments [31,35,37,73]. Because how subjects interact with the feedback and with each other to engage in
this typically involves additional steps for inferences about the proposed meaning making about evidence use.

2
R. Correnti et al. Computers and Education Open 3 (2022) 100084

Fig. 1. Components of an Interpretation/Use Argument of a Formative Assessment Activity System with a Focus on Developing.
Students’ Evidence Use Text-Based Argumentation.
Note: T = teacher; Ss = students; ‘takeaway’ refers to the student response to the question, “What is one thing you learned about using evidence in your writing that
you could use again?”.

argument writing, we argue that AWE systems should provide infor­ question we rely on student surveys. The interviews and surveys,
mation about the content of student writing and reveal strengths and respectively, describe the extent to which each role group understood
weaknesses of students’ abilities linked to disciplinary norms for argu­ the feedback and perceived the feedback as beneficial to students’
mentation. Formative assessments are expected to assist learning by writing.
making salient the ‘gap’ between performance on a task and ‘next step’ Mediating process (top of Fig. 1). To serve a formative assessment
for improvement, and provide scaffolding (e.g., hints or suggestions) for purpose, we argue as well that teacher-student interactions should focus
improvement [83,84,93]. Thus, measures of students’ performances on on students’ making meaning of the feedback in relation to their own
evidence use are paramount for understanding where students are and work (see e.g., [92]). Indeed, recent work on AWE systems has estab­
what the imagined ‘next step’ might be. We begin our validity investi­ lished that individualized support from teachers is necessary for writing
gation, therefore, examining the reliability of our automated scoring for improvement as students often need support to understand the auto­
the key features for evidence use (see Table 1, RQ1). Next, we gauge the mated feedback messages they receive [43,72,98]. The extent to which
extent to which the automated feature scores facilitated our ability to teachers provide individualized guidance (for example, use automated
make meaning out of students’ improvements in their essays (see feedback messages as a starting point for discussions around writing or
Table 1, RQ2), especially relative to other ways of measuring improve­ clarify and interpret feedback messages with students) significantly
ment (i.e., change in rubric scores). impacts students’ uptake of feedback messages [11,30,51] and moti­
Subjects (top left of Fig. 1). To serve a formative assessment purpose, vation to incorporate AWE feedback in their revision [76,97]. To
we argue that teachers must perceive an AWE system to be an authentic address these questions, we drew on teacher ‘implementation logs’ in
learning opportunity for students that is aligned with their pedagogical which teachers documented the questions students asked them and their
goals. If teachers see systems as undermining their instructional routines responses to student queries as a measure of interactivity around student
and goals (e.g., the learning standards to which they are held account­ questions. Below we describe how we use this measure to explore the
able), they are unlikely to implement the system in a way that supports a relationship between the mediating process and improvements during
learning purpose (see Table 1, RQ3 and RQ4) or use the system at all. revision.
This is a concern because integration of AWE systems in instruction is Object (automated feedback) and outcomes (right hand side of Fig. 1). As
critical to their success [30,87]. alluded to earlier, to serve as tools for learning, formative assessments,
Students also must be able to make sense of the feedback messages in this case in the form of automated feedback, must clearly communi­
they receive and perceive the information as legitimate. Absent an un­ cate the criteria for successful task performance (e.g., [1]), and be
derstanding of the criteria for successful revision (i.e. what the messages tailored to students’ learning needs. In the context of an AWE system
are asking them to do to improve their essays) students are unlikely to such as eRevise, the messages must be appropriate to student essays.
use the information they receive to successfully revise their argument Otherwise, we might not see improvements in students’ essays, nor
essays or ‘take away’ information from the assessment to apply in future would we expect improvements aligned with feedback. For our out­
writing situations (see Table 1, RQ5). To address the former research comes, as is common practice (see, e.g., [102]), we examined changes in
questions we draw on teacher interviews, and to address the latter student performance for evidence of student learning attributed to the

3
R. Correnti et al. Computers and Education Open 3 (2022) 100084

Table 1
Validity Argument Framework for eRevise as a Formative Assessment.
Warrant/ Inference Assumption(s) Research Question Evidence Source

Aligned with AWE system captures meaningful features of effective 1) How reliable are the automated scores generated in Feature scores generated in eRevise
disciplinary norms evidence use. Those features can be measured with eRevise at identifying features of effective evidence use compared to human scores
for argumentation accuracy. The features are sensitive to meaningful aligned with disciplinary norms for argumentation (i.e., Comparison of feature and rubric
improvements in students’ essays. the number of different evidence focal topics students scores in identifying improvements
cited in their essay and the total number of unique and
specific references to text-based evidence in students’
essays)?
2) Are feature scores sensitive to meaningful
improvements in evidence-use?
Subjects (Values and Teachers perceive eRevise as aligned with their 3) Do teachers perceive eRevise as beneficial to their work Teacher interviews
Goals) pedagogical aims, and work. Students perceive the (i.e., feasible to implement and helpful for their work)? Student surveys
feedback as interpretable and beneficial. 4) Do teachers perceive eRevise as aligned with the
standards and assessments to which they are held
accountable (aligned with their pedagogical aims)?
5) Do students understand the feedback messages and
perceive them as beneficial to their writing?
Mediating process Teachers vary in providing active support to students to Independent variable measuring variation in the Teacher implementation logs
interpret feedback messages provided in eRevise classroom implementation of eRevise as it naturally
occurred during implementation [used in RQ9 below].
Object Automated feedback based on feature scores of original Validity evidence supports our interpretation/use Inferences from RQ1-9
essay is accurate and meaningful. argument for eRevise as a formative assessment.
Outcome Essays improve in features of evidence use. Student 6) Do students’ essays improve in evidence use? Change in rubric/feature scores in
essays improve in alignment with feedback messages 7) Is improvement in student essays aligned with the student essays from first to final draft
they received. Students’ articulated ‘takeaway’ or features targeted in the feedback message they received? Student surveys
learning is aligned with feedback messages. 8) What do students believe they learned from using
eRevise?
Mediating process –> Substantive (potentially dialogic) teacher-student 9) Is there a relationship between substantive teacher HLM model examining relationship
Outcome interactions support student revision (and retention of interactions and student improvement on feature scores? between teacher-student interactions
concepts). and revision improvements

Note: Bolded cell is stated as a claim because it represents the generalized inference (from our interpretation/use argument) we’d like to make from the cumulative
evidence gathered in response to research questions 1 through 9.

automated feedback, in general (Table 1, RQ6). We also used the stu­ 3. Methods
dents’ improvements in feature scores (revised minus original) to
investigate student responsiveness to feedback, specifically, whether the 3.1. Context and participants
improvements we observed were aligned with researcher hypotheses of
what we would expect given the feedback messages students were Our validity investigation took place in 8 public parishes (i.e., dis­
provided - i.e., given their original feature scores and the assumed tricts) in Louisiana that are representative of the state demographics. As
strengths and weaknesses of evidence-use on their original essay (see of the 2018–2019 school year, across these parishes, 47% of the students
Table 1, RQ7). identified as White, 42% African American, 4% Latinx, 2% Asian, and
Given our expressed desire to explore socio-culture theories of stu­ 5% other. About 70% of the students were eligible for free-or reduced-
dent learning and the focus of our activity system on dialogic teacher- price lunch.
student interactions we were interested in outcomes such as shifts in Teachers. 16 English language arts teachers participated in the
students’ understanding of performance expectations because such un­ study. They were selected for their comfort with basic technology and
derstanding is likely to contribute to students’ self-regulation going access to a class set of computers to complete the online assessment in
forward [32,68,100]. Therefore, we examined how students’ experience eRevise. All 16 teachers were white females with at least a Bachelors
with our automated formative assessment system influenced their un­ degree. They averaged 10 years (range = 4–18) of teaching experience.
derstanding by asking them to articulate what they learned from the Seven teachers taught fifth grade; eight taught sixth grade; and one
formative assessment experience that they will use again in their future indicated she taught both fifth and sixth grade.
(argument-based) writing (Table 1, RQ8). We then inferred from the Students. The 16 teachers implemented eRevise to all students in one
student responses whether students obtained any generalized under­ of their English language arts classes. The classes averaged 16.6 students
standing(s) from their experience with eRevise. (range = 10 to 34). In the end, 266 fifth and sixth grade students
Mediating process influence on outcome (dotted arrows in Fig. 1). completed all data collection (i.e., submitted both a first draft and a
Finally, we conducted an empirical test for our hypothesized mediating revised draft of the essay and completed the post-eRevise survey items).
process. The dotted arrows in Fig. 1 signify our final research question
(see Table 1, RQ9) about the relationship between the mediating process
(teacher-student interactions) around the object (the automated feed­ 3.2. eRevise and its automated feedback messages
back) and its influence on the outcome (improvements in feature
scores). Although our scores constitute a coarse proxy for dialogic in­ Our AWE system, eRevise, was designed to score responses and pro­
teractions, we see this as a nascent empirical test to provide evidence for vide feedback to students on the Response-to-Text Assessment (RTA).
socio-cultural theorizations (e.g., [1,12,17,70]) of formative feedback Elsewhere, we have described RTA development, administration, and
on student revision quality. scoring [13–15]. In brief, the assessment used in this pilot is based on a
feature article from Time for Kids (“A Brighter Future” by Hannah Sachs)

4
R. Correnti et al. Computers and Education Open 3 (2022) 100084

about the Millennium Villages Project, a United Nations-supported evidence supports their main idea versus just letting the evidence speak
effort to eradicate poverty in a rural village in Kenya.2 The prompt for itself.
asks students, “Based on the article, did the author provide a convincing eRevise uses the first two of these natural language processing fea­
argument that ‘winning the fight against poverty is achievable in our tures generated during automatic scoring of students’ first- draft essays
lifetime’? Explain why or why not with 3–4 examples from the text to to select formative feedback messages on evidence use to guide essay
support your answer.” The RTA rubric for human raters focuses on five revision. Three levels of feedback messages were available (for full
dimensions– evidence use, analysis, organization, academic style, and messages see Table A1, Appendix A): Level 1 feedback messages focused
mechanics. Each is scored on a scale from “1=low” to “4=high”. on completeness (i.e., guided students to provide more evidence) and
eRevise focuses specifically on the dimension of evidence use. Else­ specificity (i.e., guided students to provide more details about the evi­
where, we provided a validity argument for the automated scoring of dence they referenced). Level 2 feedback messages also prompted stu­
this writing construct at the classroom level for research purposes [15]. dents to be more specific, and, in addition, directed students to explain
Aligned with the rubric criteria for this dimension, the automated their evidence. Finally, level 3 feedback messages focused students on
scoring model that underlies eRevise is based on the following four not only explaining the evidence they provided, but also connecting it to
features: the overall argument. Elsewhere, we discuss the assumptions and
(1) Number of pieces of evidence (NPE): To calculate the breadth of methods used to channel students’ essays to each of the three different
focal topics from the source text that the students used in their essay levels of feedback based on the number of topics (NPE) referred to in
(NPE), project researchers first defined a list of main topics in the source their original essay and the number of unique and specific references to
text (i.e., the Time for Kids article) that were then incorporated into the source-text evidence for the four focal topics (SPCfocal) (see [106] for
AES system. These four topics correspond to the ways the Millennium technical details).
Villages project affected the quality of life in a village (i.e., hospital
conditions, access to schools, malaria, agriculture). The AES system uses 3.3. Procedures/Measures
a simple window-based algorithm with fixed window-size to calculate
NPE. A window within the essay contains evidence related to a topic if it Participating teachers implemented eRevise in late fall 2018. The
uses at least two keywords from the list of words for that topic. eRevise system is designed for use over two class periods. Students wrote
(2) Specificity (SPC): For each main topic from the source text, (i.e., typed) their essays on the first day. On a second day (no more than
researchers identified a comprehensive list of associated keywords five school days later), students logged into eRevise to view the auto­
(i.e., specific text evidence/examples). For example, the topic “hospital matically generated formative feedback messages and revise their first
conditions” included as keywords “water,” “electricity,” “hospital beds”, drafts. Fig. 2 shows an example screenshot with the formative feedback
“medicine,” and “doctors” (initially, these aspects were lacking or that students would see on day 2. While eRevise generates an automated
insufficient in the villages but then improved over time). For each score in the background, commensurate with our conception of forma­
student essay, the AES system used this keyword list to identify matches tive assessment, students do not receive the score; they receive only the
– i.e., how many (and which) specific pieces of evidence the essay feedback associated with the scored features.
addressed. The system included accounts for the similarity between a Teachers were instructed to provide at least 30 min of independent
word in the student’s essay and a word in the topic or keywords list, so work time on day 1 for students to draft their essay, and on day 2 for
students will be credited for evidence that uses slightly different words them to revise. Actual revision times varied within and across classes.
(e.g., “power” instead of “electricity”) or words with different stems. According to eRevise’s built-in time log, the average revision time across
Each phrase containing keyword matches is only counted once to avoid classes was approximately 25 min (range = 13–57 min).3 To further
redundancy. To select feedback, we used a measure for unique and understand how eRevise was implemented, we asked teachers to keep a
specific mentions of evidence associated with the four focal topics of detailed record (‘implementation log’) of the questions students asked
hospital conditions, malaria, agricultural conditions and school (here­ during the administration of the formative assessment task (both drafts)
after referred to as SPCfocal). and their responses to student questions.
(3) Concentration (CON): High concentration signals listing of evi­ Students completed brief surveys after submitting their first draft and
dence without explanation or elaboration and, typically, receives a again after the revised draft. Questions on the survey (assessed on a 4-
lower score. Concentration is a binary feature meant to capture a com­ point scale) focused on students’ experience with eRevise. For
mon instance with developing writers – answering a prompt by simply example, students were asked about the helpfulness of the feedback
providing unelaborated evidence directly from the source text. To message they received, whether they understood what the feedback
calculate this feature, the AES system counts the number of phrases that message was asking them to do, and the extent to which they believed
contain keyword matches and compares them to the total number of their revised essay had improved from their first draft. Students also
sentences. If there are several keyword matches but fewer than three responded to an open-ended question, “What is one thing you learned
sentences, the concentration is deemed high. about using evidence in your writing that you could use again?” to gauge
(4) Word count (WOC): This feature is a proxy for elaboration of the potential of the system to build students’ understanding of effective
thinking and for students using their own language to reason how the use of text evidence.

3.4. Data analyses


2
Elsewhere we have described key features of our text and administration in
order to support measurement of students’ analytic thinking and reasoning in RQ1: How reliable are the automated scores generated in eRevise at
response to text (see [14] for an extended discussion). Thus, we chose texts we identifying features of effective evidence use (i.e., the number of different
felt were authentic, complex, and readable, but challenging for the grade level.
evidence focal topics students cited in their essay and the total number of
We did several things to mitigate readability as a potential confound in our
unique and specific references to text-based evidence in students’ essays)? To
measurement strategy. First, we used a lexile analyzer to interrogate the
grade-level appropriateness. Second, we define several vocabulary words in
explore the reliability of our automated feature scores, we had a human
call-out boxes for ease of comprehension. Third, the assessments are brief rater score the two features the eRevise AWE system uses to select
enough that the teacher can read the assessment aloud with students. Fourth,
the teacher asks clarifying questions – with standardized language and potential
3
follow-up prompts - during the reading of the text in order to facilitate a literal The elapsed time is a rough estimate of time spent revising. We cannot be
understanding of the text, from which we expected the students to be able to certain that students began working as soon as they logged into eRevise, nor that
provide an analytic response in writing. they worked without interruption until the time they logged off.

5
R. Correnti et al. Computers and Education Open 3 (2022) 100084

Fig. 2. Screenshot of eRevise essay with associated feedback for a student.


Note: Students had access to the source text if they scrolled down the page.

feedback (NPE and SPCfocal) for more than 20% of the students in the established techniques, including clustering, making contrasts, and
sample (n = 63). Specifically, the human rater scored both the first-draft seeking repeating patterns [6,7,78]. The researcher made transparent
essays and the revised-draft essays for these students to assess inter-rater the coding scheme, definitions, and example coded excerpts for team
(i.e., computer-human) agreement of the feature scores. The first feature discussions and to check for underlying analyst assumptions or biases
(NPE) was the number of topics from the source text, out of a possible [21]. Data analysis involved generating counts and percentages of
four, that the student marshaled evidence from and referenced in their teachers that expressed a given opinion or theme.
essay. The second feature (SPCfocal) was the number of specific and RQ5: Do students understand the feedback messages and perceive them as
unique text-based evidence the student referenced from those four focal beneficial to their writing (i.e., what do students believe they learned from
topics. We examined the intra-class correlation of these continuous using eRevise)? We examined student surveys to understand how stu­
measures using SPSS V.26.0. dents perceived the feedback provided in eRevise and what they self-
RQ2: Are feature scores sensitive to meaningful improvements in evi­ reported learning from using eRevise that they would apply in future
dence-use? To understand the sensitivity of the feature scores for writing situations.
detecting improvement in student essays, we calculated the number of RQ6: Do students’ essays improve in evidence use? To examine outcome
students who, by our feature metrics (NPE, SPCfocal, and WOC as metrics, we first analyzed the data using paired samples t-tests to un­
described above), displayed any evidence of improvement. We derstand improvement across all students from the initial draft to the
compared this to the baseline of 41 percent of students who gained at revised draft. We examined the breadth (number) of different topics
least one point on the AES score based on the evidence-use rubric (see covered among the four focal topics in the source text (NPE). We also
Appendix Table B1 for the human evidence-use rubric). We also exam­ examined students’ use of text-based evidence within each topic sepa­
ined t-tests for improvement scores for two subgroups of students – first, rately (the number of specific and unique uses of evidence identified for
we examined all students, then we examined only those students whose each focal topic) and then aggregated across the four focal topics. Higher
rubric score did not improve. means on certain topics reveal the evidence students were most likely to
RQ3,4: Do teachers perceive eRevise as beneficial to their work (i.e., select from the text to support their argument – both initially on their
feasible to implement, helpful for their work, and aligned with their peda­ first drafts and as they revised.
gogical aims)? To investigate the compatibility of eRevise with teachers’ RQ7: Is improvement in student essays aligned with the features targeted
instructional context (i.e., consonance with the instructional system), in the feedback message they received? We generated analytic hypotheses
we interviewed all 16 teachers by phone in spring 2019, after their based on the feedback messages as to which features we would likely
classes had experienced eRevise. The 45-minute, semi-structured inter­ ‘see’ improvement in4 and conducted a series of between-group com­
view protocol addressed whether students had difficulty understanding parisons. For example, if students were responsive to the feedback
and applying the feedback in their revisions, whether the feedback asking them to provide more complete evidence (level 1 feedback), then
provided was sufficient and aligned with the teachers’ pedagogical aims; we would expect practically and statistically significant increases in:
the pros and the cons of the system; how use of eRevise might impact
teachers’ writing instruction; and how frequently teachers would
employ the eRevise system in the future if it were available to them. 4
These predictions were based on the fact that we used the same features for
Interviews were audio-recorded with teachers’ permission; subse­
understanding improvement that were used to determine the feedback level.
quently, we generated detailed notes or transcripts for coding. One
Using the same features allowed us to generate testable hypotheses to probe the
researcher engaged in multiple readings and performed iterative quali­ alignment of feedback messages with improvement scores on those features. In
tative coding and analysis [60,101] on the transcripts using Dedoose future iterations of eRevise, we plan to design a second round of feedback about
[19]. Specifically, structural codes reflected the interview topics. The­ students’ revisions based both on AES-calculated values for these same features,
matic coding emerged from data [60]. We identified themes following as well as additional features that measure and describe students’ revision(s).

6
R. Correnti et al. Computers and Education Open 3 (2022) 100084

specific mentions of evidence from the source text (SPCfocal) and in the variable at the teacher level. Our dependent variable was an ‘improve­
number of topics mentioned from the source text (NPE). Additionally, ment score’ generated from a factor analysis of three composite items:
we would expect an increase in the density of evidence on focal topics the change in topic breadth (NPE), the change in amount of unique and
(i.e., the ratio of SPCfocal divided by word count) because students specific evidence for the focal topics (SPCfocal), and the change in word
received explicit feedback on adding evidence (see Table 4, column count (WOC). Our main hypothesis was examined in a cross-level
labeled “feature-specific hypothesis” for our tests for alignment in interaction between teachers’ class-level reports of providing substan­
relation to the feedback provided). tive help to students and students’ reports of having asked the teacher
RQ8: What do students believe they learned from using eRevise? Next, we for help, and the influence of the interaction on the change score. Thus,
analyzed the similarity between what the student reported learning that we examined a series of two-level hierarchical linear models [74] - 1) a
they would use next time in their writing and the feedback message fully unconditional model (FUM); 2) a model with group-mean centered
provided. We first used natural language processing to represent the student-level covariates where we also examined a random slope for the
meaning of every student text response, as well as the meaning of eRe­ indicator variable where students said they asked their teacher for help
vise’s feedback messages (Table 4), in terms of a Term Frequency - In­ (Model 1); and 3) our final random-intercept random-slope model where
verse Document Frequency (TF-IDF) vector representation. In this we added the cross-level interaction (Model 2 - see Appendix C for the
representation, words are modeled as vectors with each dimension in the full model description and rationale).
vector corresponding to a word in the vocabulary and each cell value
(co-occurrence count) weighted by multiplying term and inverse docu­ 4. Results
ment frequencies [36]. We then computed the cosine5 between the
student and the feedback vectors to measure their text similarity. For 4.1. How reliable are the automated scores generated in eRevise at
each student, we compared the similarity between the vectors repre­ identifying features of effective evidence use? (Aligned with disciplinary
senting their response and the feedback messages they received, versus norms, RQ1 Table 1)
the vectors representing their response and the feedback messages they
did not receive. We regard better alignment (i.e., higher cosine values / To examine the reliability between our human rater and the auto­
text similarity) of students’ open-ended response with the feedback mated features, we examined the intra-class correlation (ICC). The ICCs
message they received as evidence that they had processed (and perhaps show excellent agreement for NPE (ICC = 0.844) and moderate agree­
acted on and remembered) eRevise’s feedback. ment for SPCfocal (ICC = 0.687) across sixty-three students’ first and
While this hypothesis test can help us understand statistical signifi­ revised drafts (see Koo and Yi’s (2016) guidelines for interpreting ICCs).
cance, it is abstract. Therefore, we further analyzed the data qualita­ Given that the focus of our analyses is to explore the utility of these
tively for alignment. Specifically, blinded to the feedback messages features to represent ‘improvement scores’, we also examined the intra-
students received, we coded for which, if any, of the feedback messages class correlation for the human change score and the automated change
are reflected in the student response to the open-ended survey question. score for each feature. The intra-class correlations show excellent
We attended to key words and ideas in each message. For example, agreement for both NPEchg (ICC = 0.790) and SPCfocal-chg (ICC = 0.812).
student responses that used words such as “more evidence” or “different
evidence” signaled alignment with feedback message 1. “Details” and 4.2. Are feature scores sensitive to meaningful improvements in evidence-
“be specific” aligned with feedback message 2. “Explain evidence” and use? (Aligned with disciplinary norms, RQ2 Table 1)
“why” were key words associated with message 3. Finally, “argument”,
“prove”, “elaborate” aligned with message 4. We allowed for double The paired-samples t-tests shown in Table 2 describe improvements
coding (i.e., student responses could align with more than one message). in the revised essays across all students by features of evidence-use (n =
Analysis involved counting the proportion of student responses coded to 266). Table 2 provides evidence of significant improvement on all of the
each message. For the responses that were codable,6 we report the features examined, including NPE, SPCfocal, and word count (ES range
proportion of responses that aligned with the messages students were from 0.20 - 0.57). For example, the mean number of topics addressed
provided (i.e., that should have guided their revision) and the propor­ (NPE) per essay increased by nearly one-half, meaning about one out of
tion of responses that aligned with messages they had not been provided. every two students added evidence on a new topic.
RQ9: Is there a relationship between substantive teacher interactions and Additionally, the feature scores used to measure improvement were
student improvement on feature scores? To investigate the extent to which more sensitive to revisions in student writing than the 4-point evidence-
teachers implemented eRevise as a formative assessment, we examined use rubric. Only 110 students (41%) would have been identified as
teachers’ documentation of student questions and their responses to having improved by 1 or more points on the 4-point evidence-use rubric.
these questions during classroom implementation. Teachers varied in Table D1 in Appendix D provides paired-samples t-tests for the
how they approached their role during assessment – from treating it like remaining 156 students who, using the rubric score, would not have
practice for a standardized test to facilitating students’ understanding of been identified as having improved their essay. Table D1 shows that
the automated feedback and helping students construct plans for revi­ even this group made significant additions to their revised essays in all
sion based on that understanding. We used this as our main independent but two rows of the table. Thus, students improved their essays in ways
detectable by change in the feature scores, even when the automated
rubric score failed to identify improvement.
5
The cosine similarity metric is based on the dot-product (a linear algebra
operator) of the two vectors, but is modified to normalize for the vector 4.3. Do teachers see eRevise as beneficial to their work (i.e., feasible to
lengths. The normalized dot-product is in fact the same as the cosine of the implement and helpful for their work)? (Subject [teacher] values, RQ3
angle between the two vectors, hence the metric’s name. Table 1)
6
Some student responses could not be coded for alignment to eRevise’s
feedback messages. We generated the following codes to characterize these
Teachers overall responded very positively to eRevise, with nearly all
responses: Students offered no response, leaving the question blank; Students
reporting that they would use eRevise again. Eighty-seven percent of
wrote, “Nothing” or “I don’t know”; Student responses pertained to source text
rather than use of evidence (e.g., “A lot of people have poverty”); Student re­ teachers said they would use it 4–6 times per year or more, while the
sponses pertained generally to writing (e.g., “Always reread what you are remainder said they would not use it that often or that their use would
writing so that it can make sense”); Student responses pertained to grammar or depend on the availability of technology. The two most frequently cited
mechanics; Student responses were unclear, for example, responses used “it” benefits teachers mentioned were: 1) the time saved and the ability for
without a clear antecedent (e.g., “It helps you write better”). students to receive timely feedback (100%); and 2) the opportunity for

7
R. Correnti et al. Computers and Education Open 3 (2022) 100084

Table 2
Paired-samples t-tests examining change in evidence use from first- to revised- draft.
Feature Outcome First Draft Revised Draft Revised-First Draft t ES
Mpre Mpost Mdiff
(SD) (SD)

Count of Focal Topics Breadth of Text Evidence (NPE) 2.474 2.959 .49 8.10 .44
(1.183) (0.997)
Malaria-Related Text Evidence (SPCmal) 2.316 2.767 .45 5.87 .25
(1.779) (1.827)
Hospital-Related Text Evidence (SPChosp) 1.985 2.519 .53 7.80 .34
(1.622) (1.544)
School-Related Text Evidence (SPCschl) 1.876 2.553 .68 8.47 .39
Count of Unique and Specific Evidence-Use
(1.729) (1.712)
Agriculture-Related Text Evidence (SPCAgr) 1.083 1.357 .27 4.88 .20
(1.36) (1.378)
Cumulative Text Evidence for Focal Topics 7.26 9.20 1.94 10.39 .42
(SPCfocal ¼ SPCmalþ SPChosp þ SPCschlþ SPCAgr) (4.59) (4.72)
Word Count Word Count (WOC) 189.823 260.914 71.09 17.13 .57
(106.551) (141.996)

Note: Mpre = sample mean for first draft; Mpost = sample mean for revised draft; Mdiff = sample mean change from first to revised draft.
Bolded items = features of evidence use that were used in data-driven approach to channel student essays to context-sensitive feedback messages.

system, the teacher would still need to interact with it and the students’
Table 3
writing (i.e., they saw their interactions with students as a potential
Student survey responses within the eRevise system.
mediating factor within the activity system). Also, 63% percent of
Question M Not at A little Mostly Completely teachers mentioned that at least a few students in their class had diffi­
(SD) all bit
culty with our feedback. For example, the system asked students to
explain their evidence more when the student thought they already had.
Did you understand the 2.99 12 59 80 80
Teachers suggested the system identify places in the original essay that
feedback you received? (0.90) (5%) (25%) (35%) (35%)
require revision and/or provide a few specific instances where an
Did you understand how 3.07 14 41 91 85 explanation would improve the essay.
you were supposed to (0.89) (6%) (18%) (39%) (37%)
revise your essay based
4.5. Do students understand the feedback messages and perceive them as
on the feedback you
received? beneficial to their writing? (Subject [student] perceptions of automated
None A little A Lot All feedback, RQ5 Table 1)
bit
How much of the feedback 2.86 8 70 100 53 Table 3 shows that students were mostly positive about the feedback
did you use when you (0.81) (4%) (30%) (43%) (23%)
revised your essay?
they received. Roughly seven out of ten students reported “mostly” or
How much did you refer 2.68 24 83 64 60 “completely” understanding the feedback, and a similar number re­
back to the original text (0.97) (10%) (36%) (28%) (26%) ported understanding how they were supposed to revise their essay.
as you were revising Most students felt their revised essay was an improvement from their
your essay?
original version (94% total; 45% felt it was a little bit better; 49% felt it
None A little A Lot
bit was a lot better). In general, students reported understanding and using
How much better do you 2.43 14 104 113 the feedback, and they felt their essays improved.
think your revised essay (0.60) (6%) (45%) (49%)
is compared to your first
4.6. Do students’ essays improve in evidence use? (Student improvement
draft?
The feedback I received 2.13 39 119 73 on evidence use [outcome], RQ6 Table 1)
was different from what (0.68) (17%) (52%) (31%)
I normally receive The paired-samples t-tests in Table 2 demonstrated statistically sig­
Note: Row totals are for 231 students because 35 students were missing on the nificant improvements in students’ revised essays by features of
survey responses within eRevise. evidence-use. Table 2 also provides descriptive information about which
despite turning in both drafts of their writing. evidence from the text students most frequently used. For example, of
the four focal topics (i.e., malaria, hospitals, school, and agriculture),
students to engage in the writing process and receive feedback they may students were most likely to use evidence to compare the number of
not have otherwise benefitted from (75%). Africans suffering from malaria before versus after the Millennium Vil­
lages Project. The number of specific text-based references related to
each focal topic increased from students’ first draft to their revised draft;
4.4. Do teachers see eRevise as beneficial to their work (i.e., aligned with however, evidence related to the topic of schools7 went up the most (see
their pedagogical aims)? (Subject [teacher] pedagogical aims, RQ4 SPCschl row; Mpost = 2.55; Mpre = 1.87; t = 8.47; p < .001; ES = 0.39).
Table 1) Meanwhile, students were least likely to mention evidence related to the
topic of agriculture (see SPCAgr row; Mpost = 1.36; Mpre = 1.08; t = 4.88; p
Teachers reported that the feedback messages were aligned with < .001; ES = 0.20). In general, students did add specific text-based
their instructional goals (72%) and that the system reinforced the
feedback they provided to students in their instruction (31%). However,
most teachers (66%) also suggested the teacher and system were 7
In our level 1 and level 2 feedback, we provided one example of text-
mutually reinforcing. In other words, most recognized the system would evidence use from the topic ‘schools’ to demonstrate to students how to be
not replace the role of the teacher because to get the most out of the more specific when conveying evidence from the text.

8
R. Correnti et al. Computers and Education Open 3 (2022) 100084

Table 4
Changes to Feature Scores Aligned with Feedback Provided to the Student.
Level # Feedback Heading± Feature-Specific Hypothesis Hypothesis Change in feature Effect Size‡/Change in Ratio
St. Confirmed score (SPCfocal:WOC)₸
2nd draft – 1st
Draft†

1 80 Use more evidence from the article 1. NPE expected Large Yes 1.24*** ES = 1.27
Increase
Provide more details for each piece of evidence 2. SPCfocal expected Large Yes 2.62*** ES = 0.82
you use Increase
3. SPCfocal /WOC expected Yes 0.011** 1:43 to 1:33
Increase
2 76 Provide more details for each piece of evidence 4. NPE expected Moderate Yes 0.54** ES = 0.78
you use Increase
5. SPCfocal expected Large Yes 1.95*** ES = 1.20
Increase
Explain the evidence 6. SPCfocal /WOC expected Yes − 0.002 n.s.; 1:33
No Change
3 110 Explain the evidence 7. NPE expected No Change Yes 0.04 n.s.
Explain how the evidence connects to the main 8. SPCfocal expected No No 1.83*** ES = 0.30
idea & elaborate Change
9. SPCfocal /WOC expected Yes − 0.015*** 1:25 to 1:29
Decrease

± See Appendix Table A1 for full feedback messages.


† The mean change in feature score for each row is presented along with results from a paired-samples t-test for significance.
‡ Effect size is calculated using the mean change in feature score from 1st to 2nd draft and dividing by the standard deviation.
₸ The ratio at time 1 is provided 1st, while the ratio at time 2 is provided 2nd for hypothesis tests 3, 6 and 9; because the 1st term of the ratio (antecedent) is 1 in both
cases, an increase in evidence provided per word is associated with a decrease in the 2nd term (consequent) and vice-versa.
NPE = Number of focal topics student provided evidence for (0–4); SPCfocal is the cumulative number of pieces of unique and specific evidence provided on the focal
topics; SPCfocal /WOC is the ratio of number of pieces of evidence per word in the student’s essay.
n.s. = non-significant.
~ p < .10.
* p < .05.
** p < .01.
*** p < .001.

evidence in their revisions related to the focal topics – with nearly two was a linear decrease from 1.24 (level 1) to 0.54 (level 2) to 0.04 (level
additional pieces of evidence added (see SPCfocal row; Mpost = 9.20; Mpre 3) topics added for feedback levels 1 through 3, respectively.
= 7.26; t = 10.39; p < .001; ES = 0.42). Students also added general Finally, the ratio of focal topics to the total word count (hypotheses
evidence for how conditions improved.8 On average then, students 3, 6, and 9 in Table 4) is a proxy for the concentration of the number of
added over three additional pieces of evidence per essay in their revision unique and specific references to focal topic. We expected that this ratio
(Mpost = 17.77; Mpre = 14.02; t = 12.32; p < .001; ES = 0.47). would increase for students receiving level 1 feedback; that there would
be no change for students receiving level 2 feedback; and that the ratio
would decrease for level 3 feedback, which guided students to add
4.7. Is improvement in student essays aligned with the evidence-use
explanation, not evidence. Our hypotheses about change in the con­
features targeted in the feedback message they received? (Student
centration of evidence were confirmed for all three feedback levels.
improvement aligned with automated feedback based on researcher
While there was variation in this ratio in the first draft essays, after the
hypotheses [outcome], RQ7 Table 1)
revisions, the ratio became roughly similar. On average, students pro­
vided one piece of unique evidence per 29–33 words.
We looked at patterns of improvement for students receiving
different levels of feedback to investigate alignment of students’ revision
with the feedback messages they received. Table 4 displays the gist of 4.8. What do students believe they learned from using eRevise? (Student
the feedback for each level along with our analytic hypotheses for each improvement aligned with automated feedback based on student open-
feature. Results suggest that 8 of our 9 hypotheses held. Students from ended response[outcome], RQ8 Table 1)
all three feedback levels added evidence to their essays, though we did
not anticipate that students receiving feedback level 3 (i.e., explain We analyzed students’ responses about “…one thing [they] learned
evidence and connect it to the overall argument) would add more about using evidence in [their] writing that [they] could use again”. We
evidence. compared cosine similarity scores between the natural language pro­
In general, the pattern for the number of topics students added to cessing vector representations of the text of students’ responses and the
their essay corresponds to the strengths/weaknesses of students’ first- feedback messages they saw versus the messages that they did not see. In
draft essays and the feedback they received. A priori hypotheses 1, 4, general, students’ responses to this open-ended question were more
and 7 in Table 4 suggest a linear decrease from level 1 to level 3 because similar to the feedback messages they received (cosine(A,B) = 0.086)
students receiving level 3 feedback began with essays that addressed a than the messages they did not receive (cosine(A,C) = 0.066).9 Thus,
larger number of focal topics. This is precisely what we observed; there students’ self-reported ‘learning’ about use of evidence aligned with the

8 9
General evidence (not included in Table 2) included evidence that was from Where A is the student response, B is the seen feedback message and C is the
the source text but was not part of the four focal topics (i.e., malaria, hospitals, unseen feedback message. Given cosine(A,B)>cosine(A,C), the student re­
agriculture, and schooling). sponses, in the aggregate, were more similar to the seen feedback messages.

9
R. Correnti et al. Computers and Education Open 3 (2022) 100084

feedback they received.

Students wrote, “Nothing” or “I don’t know”; Student responses pertained to the text of the article rather than use of evidence (e.g., “A lot of people have poverty”); Student responses pertained generally to writing (e.g.,
“Always reread what you are writing so that it can make sense”); Student responses pertained to grammar or mechanics; Student responses were unclear; many responses used “it” without a clear antecedent (e.g., “It helps
There were many reasons why we could not be sure the students’ written comments related to feedback messages. We generated the following codes to characterize these responses: Students left the question blank;
“That you have to put enough detail so the reader can understand what you
Table 5 provides results from our qualitative coding for alignment

“I learned that if I fully address the prompt I will be able to get full credit “
between student response and feedback received.
We make three key observations. First, 70% of students responding
articulated a ‘takeaway’ related to evidence use (see column 3, “related

“I learned that you really do need a lot of evidence in a essay.”


to evidence use”). Second, in general, the pattern described by the cosine

“To explain how my evidence ties in with my argument.”


similarity analysis is confirmed. That is, in the aggregate students’ re­
sponses were more aligned with the feedback message they were pro­
Examples Related to Automated Feedback Messages

“that its very important to look back in the story” vided (59% of the time). However, the pattern is strongest for students

“I will prove my point better when i elaborate.”


receiving level 1 feedback (83% of the time), followed by level 3 feed­
back (54% of the time) and finally, level 2 feedback (only 43% of the
time). Third, many of the student articulations – e.g., “That you have to
put enough detail so the reader can understand what you are doing and
saying” – represent generalizations about evidence use aligned with the
feedback provided.
“do it in your own words”

are doing and saying.”

4.9. Is there a relationship between substantive teacher interactions and


student improvement on feature scores? (Mediating process —> outcome,
Note: Column totals are for 231 students because 35 students were missing on the survey responses within eRevise despite turning in both drafts of their writing.

RQ9 Table 1)

Because our goal in designing eRevise was for it to be used for


formative assessment purposes, we wondered how teachers interacted
with students during its implementation. Our results suggested this
varied across teachers. More than a third of the teachers (n = 7) did not
Student takeaway relates to messages NOT

interact with students at all. In these cases, it appeared as if eRevise was


being used as practice for the state standardized test – an independent
one-draft writing assessment. For these teachers, their implementation
log represents only procedural questions from students and corre­
sponding teacher responses. For example, students might ask, “How do I
(17)

(57)

(46)

(41)
21

38

66
7

copy and paste?” or “How do I submit?” and the teacher would resolve
the problem. Other teachers were slightly less procedural during
provided N (%)

implementation (n = 4). For example, students might ask, “Is this


enough evidence?” and instead of answering the question, the teacher
would refer students back to the task description. Finally, the last group
of teachers (n = 5) interacted substantively with student questions. That
is, they appeared to use eRevise as a teaching and learning opportunity.
For example, when a student asked, “Why did it say explain more?”, the
Qualitative analysis of alignment of student open-ended responses to feedback message received.
Of comments Related to Evidence Use

teacher reported that they “went over [the student’s] writing and dis­
Student takeaway relates to messages

cussed that more was needed and that some evidence was the same.” The
teacher advised the student to “add more information or details to
explain your thinking.”
To understand the potential consequences of variation in classroom
(83)

(43)

(54)

(59)
35

16

44

95

implementation (i.e., the extent to which the assessment appeared to be


treated as a formative assessment as opposed to a summative assess­
provided N (%)

ment) we constructed a series of hierarchical linear models where stu­


dents’ improvement scores were nested in classrooms. We briefly review
findings about the variance components (see bottom panel of Table 6),
before we discuss the substantive meaning of associations between
covariates and the improvement score.
Related to Evidence Use‡ N (%

Findings from the unconditional model reveal significant between-


classroom differences in the improvement scores (about 6% of the
variance lies between classrooms). As student predictors were added in
Model 1, including the first-draft rubric-based score, student-level
(66)

(54)

(84)

(70)
161
42

37

82

variance decreased (about 19% of the variance between students was


explained) while variance between classrooms increased slightly.
Notably, there is a significant reduction in the deviance statistic from the
of Total)

prior fully-nested model, suggesting the explanatory power of these


student-level covariates. Model 1 also examined the variance of the
slopes for the relationship between being helped by a teacher and the
improvement score within classrooms (τβ6= 0.085; p = .322). Finally, in
231
Total

64

69

98

Model 2 after adjusting for whether the teacher provided substantive


N

you write better”).

comments on the intercept we explained about 21% of the between-


classroom variance in means (reduction in τβ0 from Model 1). More
Feedback

importantly, the addition of teachers’ substantive comments as a cross-


Level 1

Level 2

Level 3
Table 5

Level

Total

level interaction on the random slope explained about 95% of the


variance between classrooms in the relationship between being helped by

10
R. Correnti et al. Computers and Education Open 3 (2022) 100084

Table 6
Hierarchical linear model examining effects of teachers’ responses to students’ queries during eRevise implementation.
Unconditional Model Student Level Student Level +
(FUM) (Model 1) Teacher Level
(Model 2)

Coef. s.e. Coef. s.e. Coef. s.e.

Mean Improvement, γ00 − 0.00 .08 − 0.01 .09 − 0.00 .08


Substantive Response to Questions, γ01 0.20* .09
Rubric Score on First Draft, γ10 − 0.24*** .05 − 0.25*** .05
Used Feedback, γ20 0.02 .08 0.03 .08
Re-Read Article, γ30 − 0.01 .06 − 0.01 .06
How Much Revision is Better, γ40 0.38*** .10 0.39*** .10
Feedback was Different, γ50 − 0.11 .08 − 0.10 .08
Teacher Helped Me, γ60 − 0.10 .16 − 0.13 .14
Substantive Response to Questions, γ61 0.41* .17
I like to write, γ70 0.00 .04 0.00 .04
Approach to Writing, γ80 − 0.01 .08 − 0.01 .08

Variance Components
Classroom Variance in means (τβ0) .062 .071 .046
(% var. explained from prior model) (0%) (21%)
Classroom Variance in Teacher Helped Me Slope (τβ6) .085 .004
(% var. explained from prior model) (95%)
Between Student Variance w/in Classrooms (σ2) .941 .764 .761
(% var. explained from prior model) (19%) (0.4%)

Deviance 895.51 747.07 740.94


(Chi-square; df; p-value from previous model) (χ2 = 151.2; df=2; p=.001) (χ2 = 6.1; df=2; p=.045)

Note: FUM = Fully unconditional model.


~ p < .10.
* p < .05.
** p < .01.
*** p < .001.

the teacher leading to a higher improvement score. Once again the been rarely applied to the evaluation of AWE systems [34].
deviance statistics revealed significant reduction from the prior fully- Evidence from our current validity argument for eRevise demon­
nested Model 1 with only the addition of this teacher-level covariate strates promise in both regards. We note that several limitations should
on the random intercept and slope. be considered in interpreting our results. Most obviously, our study was
Associations between improvement scores and a couple of student- conducted in a limited number of classrooms, thus attenuating the
level covariates was observed. For example, students with lower rubric strength of our inferences regarding students’ and teachers’ response to
scores on the first draft were predicted to have higher improvement the system.10 Furthermore, we did not have access to individual stu­
scores (γ 10 = − 0.25; p < .001). The model also shows that students made dents’ demographic and achievement scores and so were not able to
fairly accurate predictions about how much their revision improved. For control for potentially important confounding variables (e.g., reading
each scale point on the students’ survey response (from 1 “none” to 3 “a skills) that could impact students’ revision independent of the feedback
lot”), students’ improvement scores increased by about 0.4 standard messages. Such variables could also permit investigation into whether
deviations (γ 40 = 0.39; p < .001). Moreover, the degree to which eRevise has differential impacts on students from different racial and
teachers interacted with their students during implementation of eRevise socioeconomic backgrounds. This, along with ensuring that the corpora
influenced students’ improvement scores. This relationship was statis­ of essays used as training sets of natural language processing algorithms
tically significant on the classroom mean for improvement scores (γ 01 = are drawn from representative samples, is important for ensuring that
0.22; p = .043). Thus, in classrooms where teachers provided substan­ AWE systems are not biased against or disadvantage particular groups of
tive help to any student the classroom mean was higher by about 0.2 students (see e.g., [46]). Finally, we examined improvement in our
standard deviations. Finally, the cross-level interaction revealed that feature scores using Cohen’s dz representing a standardized effect size
when students received help in classrooms where teacher interactions for within-subjects designs (see, e.g., [44]). Given we might expect
with students were more substantive, students’ improvement scores improvement from any attempt at revision (i.e., draft 2 feature scores –
were higher (γ61 = 0.41; p = .027). draft 1), it will be important in the future to explore designs with a
comparison group in order to calculate an effect size (Cohen’s ds) for a
5. Discussion between-subjects design and/or a difference-in-difference (DID) esti­
mate. This result would demonstrate improvement for the eRevise con­
As natural language processing technologies grow in their ability to dition beyond what we’d expect under normal (comparison) conditions.
assess substantive dimensions of writing quality, we expect that AWE These limitations aside, our study contributes to the development of
systems such as eRevise will proliferate. The potential of AWE systems to systems that assess substantive dimensions of standards-based writing
meet their intended purpose of supporting teaching and learning,
however, is dependent on the degree to which they serve an authentic
formative assessment purpose. As reviews of AWE systems have shown, 10
The sample size for students in this study is considered ‘large’ relative to
there is a need to generate substantive feedback in response to what most other studies of automated feedback systems (see Deeva et al., 2021), but
students have produced. Thus, one challenge is to design systems with we recommend that these samples be expanded, especially at the classroom
these parameters in mind. A second challenge is to design research level, to further our understanding and to increase power for statistical hy­
studies structured to support validity arguments, which, to date, have pothesis tests.

11
R. Correnti et al. Computers and Education Open 3 (2022) 100084

(i.e., the use of source text evidence) in the early grades, as well as efforts discuss important aspects of evidence use with their students) auto­
to advance validity arguments for the use of AWE systems as formative mated feedback systems could foster the kinds of teacher-student in­
assessments more broadly (e.g., [10,73,98]). We discuss our findings teractions that support successful writing and revision [75]. A question
relative to the activity system we theorized in Fig. 1, the research going forward then, is how to design mechanisms for supporting
questions elaborated in Table 1 and the evidence generated for our teachers and students to view the automated feedback as an opportunity
claims constituting our interpretation/use argument (IUA).11 to scaffold rich conversations at the nexus of targeted constructs in
writing such as evidence use as a warrant.
Meaning-making as evidenced in students’ articulated learnings: In the
5.1. Meaning making as a focal inference in IUA for formative assessment
absence of a second text-based argument writing task in this study, we
asked students what they believe they learned from using eRevise that
Meaning-making as inferred through model-based hypothesis tests: The
they would carry with them to their next argument writing task. This
object of our activity system was the automated feedback provided by
analysis was limited because the data were captured in written form.
eRevise, and our validity argument examined nine research questions.
Thus, researchers were not able to probe students to expand on vague
The cumulative evidence from our investigation supports our bolded
statements they made. Instead, the analysis included only what students
claim in Table 1 for the potential of eRevise as a formative assessment.
were able to articulate in writing in response to our open-ended survey
Central to this claim are the evidence we observed of student meaning-
question. Nevertheless, we find it encouraging that 70 percent of the
making aligned with disciplinary norms for evidence use in text-based
students presented a well-articulated ‘takeaway’ about evidence use.
argument writing. For example, we observed an interaction effect in
These articulations were evidence that students may have learned
our hierarchical linear model, which examined the role of teacher-
something about how evidence is used to support their viewpoint that is
student interactions as a mediating process on the quality of student
generalizable. Moreover, we found a statistically significant effect that
revisions. As other researchers of AWE systems have noted, students
students’ written ‘takeaway’ was more aligned with the feedback mes­
often need help understanding automated feedback messages [11,77].
sages they received than the feedback messages they did not receive.
Findings from our hierarchical linear models showed that, in general
Our qualitative analysis provides greater description of this effect which
when teachers provided more substantive support – i.e., they took an
seemed most prominent for students receiving level 1 feedback.
active role helping students interpret and use the feedback – students’
Finally, our argument about student meaning making in response to
essays showed larger improvements on the feature scores. For the
the automated feedback is also based on researcher hypotheses about
roughly one-quarter of students who asked their teacher for assistance,
there was an additional relationship between teachers’ substantive help the types of improvements we would expect to see in student essays in
response to each feedback level. This was a different way to examine
and higher improvement scores. Thus, students in classrooms where
teachers provided substantive support benefitted overall, but students patterns in our data to infer whether improvements in feature scores
were aligned to the feedback students received. Because 8 of the 9 hy­
who asked specific questions and then received substantive support
benefitted the most. All else being constant, students had higher esti­ potheses were confirmed, we infer that students were, in general,
responsive to the feedback they received – thus, they likely engaged in
mated improvement scores when teachers treated eRevise as a formative
assessment rather than a test of students’ independent writing skill. making sense of the feedback they did receive and revised their essays
according to the feedback.
Our inference, aligned with socio-cultural learning theories, is that
students’ interaction with a knowledgeable other about the automated
feedback better supported them to make meaning of the automated 5.2. Measuring improvement at a grain size that supports the use of
feedback and then make relevant revisions. This is the very essence of eRevise as a formative assessment
formative assessment – teachers facilitating students’ interpretation of
the feedback in relation to what they wrote to help them ‘see’ the next Much of the past research on the quality of AWE systems has focused
steps they need to take to improve their essay. on comparing human and machine-generated rubric scores. We argue,
There is much we did not observe that we would also want to see in however, that a smaller ‘grain size’ for assessing change in students’
order to confirm further theory-based interpretations of our evidence. writing (i.e., a feature score) is likely more useful for formative assess­
For example, our theory suggests that these teacher-student interactions ment purposes. Our results provide evidence that understanding change
promote student learning in addition to improved student revisions. To in the features (atomistic elements) of evidence use is possible and
make such an inference, replication studies could seek evidence of constructive. Specifically, we found that human raters had moderate to
transfer by examining students’ first draft of a first text-based argument good inter-rater reliability with the AWE system when rating the NPE
writing task with a cycle of automated feedback and revision to a first and SPC features, as well as change in those features. Furthermore, the
draft of a second text-based argument writing task. For students having rubric score was only capable of indicating improvement for 41% of the
substantive teacher-student interactions, evidence of greater improve­ students, but we established that feature scores also improved for stu­
ments on students’ first drafts could further support inferences about dents, as a whole, whose scores did not indicate a rubric score change.
student learning (i.e., confirm that students’ generalizations about evi­ Feature scores, then, were more sensitive to incremental differences in
dence use from teacher-student interactions during their first writing student writing than rubric scores and provide more specific information
task were implemented in their next, similar, writing task). about evidence use. They have greater utility for formative assessment
Our findings are consonant with recent calls for more human inter­ purposes because they provide teachers with more information about
action to be built into automated feedback systems [20,89], and students’ revision efforts and progress. This may be especially important
research we described earlier underscoring the importance of in­ in the context of writing produced by younger students, many of whom
teractions around feedback messages (e.g., [92]). We see our empirical struggle to revise effectively [53,103]. We note as well that feature
results as valuable evidence for the idea that, in certain contexts (i.e., scores can support more precise investigation of the alignment between
where teachers view the automated feedback as an opportunity to feedback messages and specific changes in writing quality, which is
important for making inferences about the impact of automated feed­
back messages on students’ writing. In all, scores on specific features are
11
An interpretation/use argument is the aggregated claim(s) made during a a potential avenue for future researchers to explore in evaluating AWE
validity investigation that the investigator is attempting to infer. As Kane [39] systems intended as formative assessments.
defines it for assessment scores; “the IUA includes all of the claims based on the Finally, the inferences above suggest that using feature scores from
test scores (i.e., the network of inferences and assumptions inherent in the students’ first drafts to channel them to different feedback messages
proposed interpretation and use)” (p.2). could be beneficial. We think the combination of our design features

12
R. Correnti et al. Computers and Education Open 3 (2022) 100084

– i.e., the attempt at automating scoring of substantive features and the assessment systems; especially systems designed to build students’
hybrid mix of expert-driven feedback combined with a data-driven conceptual understanding of argument elements (such as evidence use)
approach to channel the feedback based on feature scores – contrib­ through feedback and dialogic teacher-student interactions.
uted to a positive student experience with evidence for student learning
about the target domain of evidence use. Designers of AWE systems that Supplementary materials
seek to be responsive to student-generated text could consider a similar
approach. Supplementary material associated with this article can be found, in
the online version, at doi:10.1016/j.caeo.2022.100084.
5.3. Contributions to the design of AWE systems
References
We designed our AWE system with features that represent salient
elements of evidence use for developing adolescent writers. We see the [1] Adie L, van der Kleij F, Cumming J. The development and application of coding
frameworks to explore dialogic feedback interactions and self-regulated learning.
benefits accruing beyond simply greater efficiency and speed of grading Br Educ Res J 2018;44(4):704–23.
– the usual benefits cited for justifying automation. In this study, we [2] Aguinis H, Gottfredson RK, Culpepper SA. Best-practice recommendations for
sought to demonstrate that an AWE system can encourage content re­ estimating cross-level interaction effects using multilevel modeling. J Manage
2013;39(6):1490–528.
visions in writing and avoid the common criticism that the system only [3] Aguinis H, Culpepper SA. An expanded decision-making procedure for examining
encourages syntactic and grammatical revisions (see, e.g., [29]). Finally, cross-level interaction effects with multilevel modeling. Organ Res Methods
an essential question for formative purposes is how teachers can inte­ 2015;18(2):155–76.
[4] Baird JA, Andrich D, Hopfenbeck TN, Stobart G. Assessment and learning: fields
grate automated feedback into their writing instruction to achieve
apart? Assessment in Education: principles. Policy Practice 2017;24(3):317–50.
broader goals of, say, developing argumentative writing skills within a [5] Bauml M. Beginning primary teachers’ experiences with curriculum guides and
process approach to writing. Our vision looking forward is that the pacing calendars for math and science instruction. J Res Childhood Educ 2015;29
(3):390–409.
feature scores can help teachers construct sociocultural learning op­
[6] Bernard HR, Wutich A, Ryan GW. Analyzing qualitative data: systematic
portunities with their students (see, e.g., [11,30]) about these essential approaches. Thousand Oaks, CA: Sage Publications; 2016.
components of evidence use and build student understanding of a larger [7] Bliese PD, Maltarich MA, Hendricks JL. Back to basics with mixed-effects models:
writing construct such as argumentation. nine take-away points. J Bus Psychol 2018;33(1):1–23.
[8] Brindle M, Graham S, Harris KR, Hebert M. Third and fourth grade teacher’s
We presented evidence for a validity argument for the use of one classroom practices in writing: a national survey. Read Writ 2015;29(5):929–54.
writing evaluation system (eRevise) and its potential as a formative [9] Burstein J, Riordan B, McCaffrey D. Expanding Automated Writing Evaluation.
assessment. While it is impossible to generalize from just one system, we Handbook of automated scoring. Chapman and Hall/CRC; 2020. p. 329–46.
[10] Chapelle CA, Cotos E, Lee J. Validity arguments for diagnostic assessment using
note the congruence of certain design features with critical themes automated writing evaluation. Lang Test 2015;32(3):385–405.
outlined in recent reviews of automated systems. For example, eRevise [11] Chen CFE, Cheng WYEC. Beyond the design of automated writing evaluation:
demonstrates the use of the recommended approach of using data to pedagogical practices and perceived learning effectiveness in EFL writing classes.
Lang Learn Technol 2008;12(2):94–112.
drive context-sensitive feedback to students [20]. In addition, our sys­ [12] Chong SW. Reconsidering student feedback literacy from an ecological
tem provides feedback based on scoring of a substantive dimension (i.e., perspective. Assess Eval Higher Educ 2021;46(1):92–104.
evidence use), and the feature scores are interpretable. We argue this [13] Correnti R, Matsumura LC, Hamilton LS, Wang E. Combining multiple measures
of students’ opportunities to develop analytic, text-based writing skills.
aids in the measurement of features contributing to clearer hypothesis
Educational Assessment 2012;17(2–3):132–61.
tests to support the validity argument, while the interpretability of the [14] Correnti R, Matsumura LC, Hamilton L, Wang E. Assessing students’ skills at
features aids in fitting into the formative assessment ecology – i.e., aids writing analytically in response to texts. The Elementary School Journal 2013;
114(2):142–77.
in channeling essays to appropriate context-sensitive feedback that
[15] Correnti R, Matsumura LC, Wang E, Litman D, Rahimi Z, Kisa Z. Automated
students and teachers perceived as relevant; while also providing feed­ scoring of students’ use of text evidence in writing. Reading Research Quarterly
back that served to improve student learning in intended ways. Finally, 2020;55(3):493–520.
our statistical models provide evidence that a system designed to sup­ [16] Crossley SA, Varner LK, Roscoe RD, McNamara DS. Using automated indices of
cohesion to evaluate an intelligent tutoring system and an automated writing
port or encourage human interactivity around the feedback is likely to evaluation system. In: International Conference on Artificial Intelligence in
see learning gains. We believe eRevise could serve as an existence proof Education. Berlin, Germany: Springer; 2013. p. 269–78.
for the potential import of these system design characteristics. [17] Dann R. Feedback as a relational concept in the classroom. Curriculum J 2019;30
(4):352–74.
[18] Deane P. On the relation between automated essay scoring and modern views of
6. Conclusion the writing construct. Assess Writing 2013;18:7–24.
[19] Dedoose, software, version 8.3.17 (2020). As of June 2, 2020: http://www.de
doose.com.
Our findings confirm the perceived needs for automated feedback [20] Deeva G, Bogdanova D, Serral E, Snoeck M, De Weerdt J. A review of automated
systems to design better sociocultural learning processes that include feedback systems for learners: classification framework, challenges, and
teachers in their design [75]. AWE systems are not likely to be effective opportunities. Comput Educ 2020:104094.
[21] Denzin NK, Lincoln YS. Collecting and interpreting qualitative materials.
if they are seen as burdensome, undermining instruction, and/or treated thousand oaks. CA: Sage Publications; 2003.
as summative assessments. Overall, the large majority of teachers indi­ [22] Du H, List A. Evidence Use in Argument Writing Based on Multiple Texts. Read
cated that they would use eRevise at multiple points in the year if it were Res Q 2020.
[23] Elmahdi I, Al-Hattami A, Fawzi H. Using Technology for Formative Assessment to
available. They appreciated the potential ‘time savings’ afforded by the Improve Students’ Learning. Turkish Online J Educ Technol TOJET 2018;17(2):
system, and perceived the feedback messages and writing task in eRevise 182–8.
as aligned with their instruction and their state and district writing [24] Enders CK, Tofighi D. Centering predictor variables in cross-sectional multilevel
models: a new look at an old issue. Psychol Methods 2007;12(2):121.
standards and so, reinforcing of their instructional messages to students.
[25] Foltz PW, Rosenstein M. Data mining large-scale formative writing. Handbook of
We interpret these results as evidence of the coherence of eRevise with Learning Analytics; 2017. p. 199.
classroom practice and potential to alleviate, not add to, teachers’ [26] Gallimore R, Goldenberg CN, Weisner TS. The social construction and subjective
burden. Looking forward, automated systems that design better ways to reality of activity settings: implications for community psychology. Am J
Community Psychol 1993;21(4):537–60.
support a close partnership between teachers and machines may be the [27] Graham S, Capizzi A, Harris KR, Hebert M, Morphy P. Teaching writing to middle
most productive way for advancing the potential of AWE formative school students: a national survey. Read Writ 2014;27(6):1015–42.

13
R. Correnti et al. Computers and Education Open 3 (2022) 100084

[28] Greeno JC, Collins A, Resnick LB. Cognition and learning. In: Berliner DC, [61] National Center for Education Statistics. (NCES, 2012). The nation’s report card:
Calfee RC, editors. Handbook of educational psychology. New York: Macmillan; writing 2011 (NCES 2012–470).
1996. p. 15–46. [62] National Council of Teachers of English/International Reading Association.
[29] Grimes, D., & Warschauer, M. (2006, April). Automated essay scoring in the Standards for the english language arts (1996/2012). 2022. https://ncte.org
classroom. In Annual Meeting of the American Educational Research Association, San /standards/ncte-ira.
Francisco, CA. [63] National Governors Association Center for Best Practices. Council of chief state
[30] Grimes D, Warschauer M. Utility in a fallible tool: a multi-site case study of school officers (NGAC/CCSSO, 2010). common core state standards english
automated writing evaluation. J Technol Learn Assess 2010;8(6). language arts standards. washington, DC: national governors association center
[31] Gu PY. An argument-based framework for validating formative assessment in the for best practices. Council of Chief State School Officers; 2022.
classroom. Front Educ 2021;6:605999. 10.3389/feduc. [64] Newell GE, Beach R, Smith J, VanDerHeide J. Teaching and learning
[32] Hawe E, Dixon H. Assessment for learning: a catalyst for student self-regulation. argumentative reading and writing: a review of research. Read Res Q 2011;46(3):
Assess Eval Higher Educ 2017;42(8):1181–92. 273–304.
[33] Harrell M, Wetzel D. Using argument diagramming to teach critical thinking in a [65] Odendahl N, Deane P. Assessing the Writing Process: a Review of Current
first-year writing course. The palgrave handbook of critical thinking in higher Practice. ETS Res Memorandum Series 2018.
education. New York: Palgrave Macmillan; 2015. p. 213–32. [66] O’Hallaron CL. Supporting fifth-grade ELLs’ argumentative writing development,
[34] Hockly N. Automated writing evaluation. ELT J. 2019;73(1):82–8. https://doi. 31. Written Communication; 2014. p. 304–31.
org/10.1093/elt/ccy044. [67] Palermo C, Thomson MM. Teacher implementation of self-regulated strategy
[35] Otter Hopster-den, D Wools, S Eggen, J T, Veldkamp BP. A general framework for development with an automated writing evaluation system: effects on the
the validation of embedded formative assessment. J Educ Meas 2019;56(4): argumentative writing performance of middle school students. Contemp Educ
715–32. Psychol 2018;54:255–70.
[36] Jurafsky, Daniel, & James H. Martin. "Speech and language processing: an [68] Panadero E, Andrade H, Brookhart S. Fusing self-regulated learning and formative
introduction to natural language processing, computational linguistics, and assessment: a roadmap of where we are, how we got here, and where we are
speech recognition", 3rd Edition, Forthcoming. 2022. going. Australian Educ Res 2018;45(1):13–31.
[37] Khamboonruang, A. (2020). Development and Validation of a Diagnostic Rating [69] Patthey-Chavez GG, Matsumura LC, Valdes R. Investigating the process approach
Scale for Formative Assessment in a Thai EFL University Writing Classroom: a Mixed to writing instruction in urban middle schools. Journal of Adolescent & Adult
Methods Study (Doctoral dissertation). Literacy 2004;47(6):462–76.
[38] Kane, M.T. (2006). Validation. Educational Measurement, 4(2), 17–64. [70] Pryor J, Crossouard B. A socio-cultural theorisation of formative assessment.
[39] Kane MT. Validating the interpretations and uses of test scores. J Educ Meas Oxford Rev Educ 2008;34(1):1–20.
2013;50(1):1–73. [71] Quintana R, Schunn C. Who Benefits From a Foundational Logic Course? Effects
[40] Kane MT. Validation as a pragmatic, scientific activity. J Educ Meas 2013;50(1): on Undergraduate Course Performance. J Res Educ Eff 2019;12(2):191–214.
115–22. [72] Ranalli J. Automated written corrective feedback: how well can students make
[41] Kiuhara SA, Graham S, Hawken LS. Teaching writing to high school students: a use of it? Comput Assisted Lang Learn 2018;31(7):653–74.
national survey. J Educ Psychol 2009;101(1):136. [73] Ranalli J, Link S, Chukharev-Hudilainen E. Automated writing evaluation for
[42] Kneupper, C.W. (1978). Teaching argument: an introduction to the toulmin formative assessment of second language writing: investigating the accuracy and
model. College Composition and Communication, 29(3), 237–41. usefulness of feedback as part of argument-based validation. Educ Psychol (Lond)
[43] Koh, W.Y. (2017). Effective applications of automated writing feedback in 2017;37(1):8–25.
process-based writing instruction. English Teaching, 72(3). [74] Raudenbush SW, Bryk AS. Hierarchical linear models: applications and data analysis
[44] Lakens D. Calculating and reporting effect sizes to facilitate cumulative science: a methods (Vol. 1). Thousand Oaks, CA: Sage Publications; 2002.
practical primer for t-tests and ANOVAs. Front Psychol 2013;4:863. [75] Roschelle J, Lester J, Fusco J. AI and the future of learning: expert panel report
[45] Lawrence JF, Galloway EP, Yim S, Lin A. Learning to write in middle school? [Report]. Digital Promise 2020. https://circls.org/reports/ai-report.
J Adolesc Adult Literacy 2013;57(2):151–61. [76] Roscoe RD, Allen LK, Johnson AC, McNamara DS. Automated writing instruction
[46] Litman D, Zhang H, Correnti R, Matsumura LC, Wang E. June). A Fairness and feedback: instructional mode, attitudes, and revising. In: Proceedings of the
Evaluation of Automated Methods for Scoring Text Evidence Usage in Writing. In: Human Factors and Ergonomics Society Annual Meeting. 62. Los Angeles, CA:
International Conference on Artificial Intelligence in Education. Cham: Springer; Sage Publications; 2018. p. 2089–93.
2021. p. 255–67. [77] Roscoe RD, McNamara DS. Writing Pal: feasibility of an intelligent writing
[47] Lee VE. Using hierarchical linear modeling to study social contexts: the case of strategy tutor in the high school classroom. J Educ Psychol 2013;105(4):1010.
school effects. Educ Psychol 2000;35(2):125–41. [78] Ryan GW, Bernard HR. Techniques to identify themes. Field methods 2003;15(1):
[48] Leont’ev A. Psychology and the language learning process. London: Pergamon; 85–109.
1981. [79] Sannino, A., & Engeström, Y. (2018). Cultural-historical activity theory: founding
[49] Li J, Link S, Hegelheimer V. Rethinking the role of automated writing evaluation insights and new challenges. Cultural-historical psychology.
(AWE) feedback in ESL writing instruction. J Second Lang Writing 2015;27:1–18. [80] Schleppegrell MJ, Achugar M, Oteíza T. The grammar of history: enhancing
[50] Liao HC. Enhancing the grammatical accuracy of EFL writing by using an AWE- content-based instruction through a functional focus on language. TESOL
assisted process approach. System 2016;62:77–92. Quarterly 2004;38(1):67–93.
[51] Link S, Mehrzad M, Rahimi M. Impact of automated writing evaluation on teacher [81] Shanahan T, Shanahan C. Teaching disciplinary literacy to adolescents:
feedback, student revision, and writing improvement. Comput Assist Lang Learn rethinking content-area literacy. Harv Educ Rev 2008;78(1):40–59.
2020:1–30. [82] Shepard LA. The role of assessment in a learning culture. Educ Res 2000;29(7):
[52] Lu X. An empirical study on the artificial intelligence writing evaluation system in 4–14.
China CET. Big Data 2019;7(2):121–9. [83] Shepard LA. Linking formative assessment to scaffolding. Educ Leadership 2005;
[53] MacArthur CA. Evaluation and revision. Best practices in writing instruction 2018: 63(3):66–70.
287. [84] Shute VJ. Focus on formative feedback. Rev Educ Res 2008;78(1):153–89.
[54] Madnani N, Burstein J, Elliot N, Klebanov BB, Napolitano D, Andreyev S, [85] Snow CE, Uccelli P. The challenge of academic language. In: Olson DR,
Schwartz M. Writing mentor: self-regulated writing feedback for struggling Torrance N, editors. The cambridge handbook of literacy. Cambridge, New York:
writers. In: Proceedings of the 27th International Conference on Computational Cambridge University Press; 2009. p. 112–33.
Linguistics: System Demonstrations; 2018. p. 113–7. [86] Snijders Tom AB, Bosker Roel J. Multilevel analysis: an introduction to basic and
[55] Mao L, Liu OL, Roohr K, Belur V, Mulholland M, Lee HS, Pallant A. Validation of advanced multilevel modeling. sage 1999.
automated scoring for a formative assessment that employs scientific [87] Stevenson M. A critical interpretative synthesis: the integration of automated
argumentation. Educ Assess 2018;23(2):121–38. writing evaluation into classroom writing instruction. Comput Compos 2016;42:
[56] Mathison S, Freeman M. Constraining elementary teachers’ work: dilemmas and 1–16.
paradoxes created by state mandated testing. Educ Policy Anal Arch 2003;11:34. [88] Stevenson M, Phakiti A. Automated feedback and second language writing.
[57] Matsumura LC, Patthey-Chavez GG, Valdés R, Garnier H. Teacher feedback, Feedback in second language writing: Contexts and issues; 2019. p. 125–42.
writing assignment quality, and third-grade students’ revision in lower-and [89] Strobl C, Ailhaud E, Benetos K, Devitt A, Kruse O, Proske A, Rapp C. Digital
higher-achieving urban schools. The Elementary School Journal 2002;103(1): support for academic writing: a review of technologies and pedagogies. Comput
3–25. Educ 2019;131:33–48.
[58] Matsumura LC, Garnier HE, Slater SC, Boston MD. Toward measuring [90] Sung YT, Liao CN, Chang TH, Chen CL, Chang KE. The effect of online summary
instructional interactions “at-scale. Educational Assessment 2008;13(4):267–300. assessment and feedback system on the summary writing on 6th graders: the LSA-
[59] Matsumura LC, Correnti R, Wang EL. Classroom writing tasks and students’ based technique. Comput Educ 2016;95:1–18.
analytic text-based writing. Reading Research Quarterly 2015;50(4):417–38. [91] Toulmin SE. The uses of argument. Cambridge university press; 2003.
[60] Miles MB, Huberman AM, Saldaña J. Qualitative data analysis: a methods
sourcebook. 3rd. editor. Thousand Oaks, CA: Sage Publications; 2014.

14
R. Correnti et al. Computers and Education Open 3 (2022) 100084

[92] Van der Schaaf M, Baartman L, Prins F, Oosterbaan A, Schaap H. Feedback [100] Xiao Y, Yang M. Formative assessment and self-regulated learning: how formative
dialogues that stimulate students’ reflective thinking. Scandinavian J Educ Res assessment supports students’ self-regulation in English language learning.
2013;57(3):227–45. System 2019;81:39–49.
[93] Vygotsky LS. Mind in society: the development of higher psychological processes. [101] Yin RK. Qualitative research from start to finish. New York, NY: Guilford
Harvard University Press; 1980. Publications; 2015.
[94] Vygotsky L. Thought and language (A. kozulin. Trans.). Cambridge, MA; London, [102] Zhu M, Liu OL, Lee HS. The effect of automated feedback on revision behavior
England: The MITPress; 1986. and learning gains in formative assessment of scientific argument writing.
[95] Wan Q, Crossley S, Allen L, McNamara D. Claim Detection and Relationship with Comput Educ 2020;143:103668.
Writing Quality. In: Proceedings of the 13th International Conference on Educational [103] Wang Lin Elaine, Matsumura Clare Lindsay, Correnti Richard, Litman Diane,
Data Mining (EDM 2020); 2020. p. 691–5. Zhang Haoran, howe emily, et al. eRevis (ing): Students’ revision of text evidence
[96] Wertsch JV. The zone of proximal development: some conceptual issues. New Dir use in an automated writing evaluation system. Assessing Writing 2020;44.
Child Adolesc Dev 1984;23:7–18. 1984. [104] Wang EL, Matsumura LC. Text-based writing in elementary classrooms: teachers’
[97] Wilson J, Czik A. Automated essay evaluation software in English Language Arts conceptions and practice. Reading and. Writing 2019;32(2):405–38.
classrooms: effects on teacher feedback, student motivation, and writing quality. [105] Attali Yigal, Burstein Jill. Automated essay scoring with e-rater v.2. Journal of
Comput Educ 2016;100:94–109. Technology, Learning, and Assessment 2006;4(3).
[98] Wilson J, Roscoe RD. Automated writing evaluation and feedback: multiple [106] Zhang H, Magooda A, Litman D, Correnti R, Wang E, Matsmura LC, Quintana R.
metrics of efficacy. J Educ Comput Res 2020;58(1):87–125. July). eRevise: Using natural language processing to provide formative feedback
[99] Woods B, Adamson D, Miel S, Mayfield E. Formative essay feedback using on text evidence usage in student writing 2019;(Vol. 33, No. 01,:9619–25.
predictive scoring models. In: Proceedings of the 23rd ACM SIGKDD international
conference on knowledge discovery and data mining; 2017. p. 2071–80.

15

View publication stats

You might also like