Computer or Human: A Comparative Study of Automated Evaluation Scoring and instructors’ feedback on Chinese College Students’ English Writing

The role of internet technology in higher education and particularly in teaching English as a Foreign language is increasingly prominent because of the interest in the ways in which technology can be applied to support students. The automated evaluation scoring system is a typical demonstration of the application of network technology in the teaching of English writing. Many writing scoring platforms have been developed and used in China, which can provide on-line instant and corrective feedback on students’ writing. However, the validity of Aim Writing, a product developed by Microsoft Research Asia, which claims to be the best tool to facilitate Asian EFL learners, has not been tested in previous studies. In this mixed methods study, the feedback and effect of Aim Writing on college students’ writing will be investigated and compared to instructor’s feedback. The results indicate that Aim Writing’s performance is insu�cient to support all students’ needs for writing and that colleges should encourage a hybrid model that contains both AES and instructor’s feedback in writing.


Introduction
English essay writing, which requires integrated knowledge of linguistic and content, poses a great challenge for English as a Foreign Language (EFL) learners and teachers.Even though teachers have dedicated efforts in instructing writing, EFL learners' performances haven't been improved, especially in content organization, idea development, and grammar structure (Chen, 2022).EFL teachers not only need to develop students' linguistic and communicative competence but also use relevant feedback techniques for responding to students' writing (Alharbi, 2022).Feedback in writing with effective instruction has a positive in uence on facilitating students' writing ability.Evidence supports the positive effect on students' engagement and revision practices through feedback from teachers (Zhang & Hyland, 2022;Link et al., 2020).Moreover, feedback from the automated evaluation system also helps language learners improve their writing pro ciency, self-regulation, and self-e cacy (Naghdipour, 2022;Nückles et al,2020;Ekholm et al., 2014).
With the expansion of the class sizes in China and the emphasis on grammar drilling in College English courses, the instruction on English essay writing is overlooked.Previous studies reported that English teachers in China emphasized much more on English reading while neglecting English writing.Furthermore, the curriculum and syllabus of College English combine Reading and Writing together, and there were no specially designed English writing courses (Sang, 2017).Students are not provided channels to speak about their di culties in writing due to limited class time and usually, they cannot receive timely feedback and correction in writing, their enthusiasm for English writing is largely decreased (Yang, 2016).Even though there is time set aside for the teaching of writing during class time, the content tends to be about how to deal with prompts in the College English Test (CET).This national, high-stakes English test in China, examines the English pro ciency of undergraduate students in China and ensures that Chinese undergraduates reach the required English levels speci ed in the National College English Teaching Syllabus (NCETS) (Roach, 2018).Because of its importance, instructions on writing in class are limited to checking syntax and grammar (Wang, F & Wang, S, 2012).
The number of EFL learners in China has witnessed a big bloom in China because of the trend of globalization.Teaching English has been given prominence because of the large number of learners.
However, the increase in student numbers in each class results in a series of problems in current EFL teaching in universities in China, such as the high demand for English teachers (Rao & Lei, 2014).Evaluating English writings is a time-consuming, challenging, and burdensome task because of English teachers' writing pro ciency and their personal beliefs and practices in providing feedback (Yu, 2021).
Normally, a student's practice writing is 200 to 250 words, and teachers are required to evaluate the writing from the similar rubric in CET in standpoint, content, structure, language use, vocabulary, and grammar.The excessive time to evaluate a large number of students' writing easily leads to teachers' burnout that they can hardly offer instant feedback on students' writing (Alharbi, 2019).Teachers often give an overall score for each essay with less detailed feedback and suggestions on the vocabulary, grammar, sentence structure.Upon receiving feedback from teachers, students always feel passive to improve their writing because they remain passive throughout and there is always no requirement for redrafting (Lee, 2014).
A solution to the challenges discussed above has been the use of automatic feedback scoring.The automated essay scoring (AES) system is an online writing analysis tool, which assesses writing based on arti cial intelligence according to different features like grammar, usage, mechanics, style, organization, and content.These systems were designed for writers in English-speaking countries initially and since have been adopted in English language education (Liu & Kunnan, 2015).Previous studies have tended to examine the accuracy and validity of those systems, however, few studies have focused on the effectiveness of the automotive feedback on improving language learners' writing performance (Geng & Razali, 2020).In China, due to the large population of EFL learners, the number of faculty using AES to provide writing feedback has been increasing.The most extensively used and examined AES systems in China are iWrite and Pigai, and studies found that those systems provided controversial outcomes in evaluating writings (Jiang et al, 2020;Koltovskaia, 2020;Li et al., 2015).A recent developed AES system Aim Writing, claimed to offer pro cient feedback based on a new model [1] (Ge et al., 2018) and can provide evaluation approximate to a professional English teachers' feedback, though this has not been studied in terms of accuracy, validity, or performance in improving the outcome of writings.
In this article, the author investigates the e cacy of feedback from Aim Writing on Chinese college EFL students and compares it with the instructor's feedback as well as students' preferences.The goal of this research is to identify the most effective writing feedback model for Chinese EFL students.
[1]The new model contains Fluency Boost Learning and Inference algorithms.Fluency Boost Learning is a new model to improve a sentence's uency without changing its original meaning; thus, any sentence pair that satis es this condition (we call it uency boost condition) can be used as a training instance.

Literature Review
Response plays a critical role in learning (Vygotsky, 1978).EFL students need to know the merits and drawbacks of their writing to improve their skills.Feedback, as the responses from other sources, is signi cant for learners.

Teacher feedback
Feedback plays an important role in improving writing pro ciency, and teacher feedback is the most common form of writing instruction (Kamberi, 2013).Previous studies on teacher feedback mainly explored the feedback focus, forms, and e ciency.Earlier studies of teacher feedback indicated that teachers mainly focused on language mistakes in students' writing because they viewed writing as a product, and they tended to view themselves as language teachers rather than writing instructors (Zamel, 1985).Current studies focus on two types of focus of feedback: focused feedback concerning repeated grammar errors and unfocused feedback on the general errors are compared.Eslami (2014), Farrokhi & Sattarpour (2011) pointed out that teacher's focused feedback was more effective because learners can improve their grammar by focusing on one grammar mistake at a time.
Writing instruction has changed due to the insights from research studies.EFL teachers do not only provide their own feedback on students' writing.Teacher written feedback now tends to combine with peer feedback, writing workshops, oral conferences, video feedback, and computer-delivered feedback (Hyland & Hyland, 2006;Mathisen, 2012).However, despite the various sources emerging in providing feedback, teacher written response is still dominant in most EFL writing classes (Hyland, 2013).
Even though that teachers' written feedback has a positive effect on students' writing ability (Razali & Jupri, 2014), researchers began to evaluate the effectiveness of teacher feedback in facilitating EFL students' writing.Early studies suggested that much written feedback was in poor quality and focused too much on the errors (Yoshida, 2008).Current studies reveal that there are discrepancies between teachers' feedback and students' perceptions of correction (Muliyah & Aminatum, 2020;Agbayahoun, 2016).Learners preferred teachers' written extended comments on content and grammar (Chen, 2014).

AES feedback
AES systems with the capability of evaluating writings using computer-generated technology, have become an essential part of large-scale writing assessments since 1999 (Dikli & Bleyle, 2014).In the research area of the AES systems, scholars put their emphasis on three aspects: the impact of AES systems on students' writing pro ciency (El Ebyary & Windeatt, 2010; Parra G. & Calero S., 2019), the attitude of teachers and students in regards to AES systems (Chen & Cheng, 2008), and the application of AES systems in English writing teaching (Koh, 2017).
Technology-assisted teacher feedback and technology-assisted peer feedback were coming under the spotlight (Chen, 2014;Huang, 2016).AES systems were useful tools to provide formative, diagnostic, and summative feedback so that EFL learners could self-correct and self-revise effectively (Wang, F & Wang, S., 2012).However, the accuracy and reliability of the technology-assisted tools became the focus of providing hybrid feedback.Although most of the researchers found the consistency of the scores between AES systems and human raters (Wilson & Roscoe, 2020), controversial opinions about the AES systems are increasing because it is argued that a computer software cannot rate a student's writing as humans do (McCurry, 2010).
Researchers have surveyed teachers' and students' attitude toward AES in the form of a questionnaire.Most of the surveys found that most of the students' attitudes towards AES were positive because AES systems can respond to the writings in time, and help students monitor their writing improvement (Zhang, 2020).
Research on AES regard to EFL reveals some limitations.First, most of the AES systems assess data from native English-speaking writers in large-scale writing assessments (Attali & Burstein, 2004).Studies on assessing non-native speakers' writing are less (Vajjala, 2017).The present study concerning with application of AES systems in EFL writing pedagogy is insu cient.Second, many studies are conducted in order to develop the system, rather than provide instructions in writing (Qian et al., 2020).Third, researches indicate that the e ciency of the systems is unsatisfactory.There are situations that they are unable to give accurate feedback on the content and logical structure of the writings (Zhang & Cai, 2019).Therefore, students' writing ability cannot be improved by using the system.From teachers' perspectives, on the one hand, they believe that the system can provide timely feedback to ease the pressure on teachers; on the other hand, since the feedback from the system is based on a large corpus and cannot provide personalized comments, the combination of different forms of feedback is suggested in future writing teaching (Elola & Oskoz, 2016).
Current studies on AES systems nd that most of the systems are developed by institutions and companies in the United States, and most of them are exclusive to certain institutions.As a result, many systems are inaccessible in China.In recent decades, AES systems boosted in China.Pigai and iWrite are the most extensively used two systems locally, and their performance and validity of them were studied thoroughly.Studies showed that both systems had shortcomings in scoring and providing feedback for learners (Qian et al., 2019;Yan, 2019;Wu, 2020).Aim Writing is another new AES system developed by Microsoft Research of Asia (MSRA), and it was launched in China in late 2019.However, present studies haven't paid much attention to this system, especially in its validity and e cacy in evaluating writings of EFL students.To ll the research gap, this study aims to investigate the performance of Aim Writing under three writing practices among non-English majors in a college in China.

The comparison between teacher's feedback and AES feedback
Recent EFL writing research on feedback looked into the e cacy of feedback forms.Scholars tried to compare the merits and drawbacks between teacher feedback, peer feedback, and AES system feedback (Niu et al., 2021).In terms of which feedback was more effective in promoting EFL learners' writing ability, scholars began to use hybrid interventions to exert the merits of each feedback.
One focus of the comparison is to test the effectiveness of AES is the correlation between AES and human raters.Previous studies demonstrated controversial outcomes in the correlation.Though studies proved a moderate or high correlation between the score of human raters and the AES systems (Almusharraf & Alotaibi, 2022), many researchers also found the AES system and human scoring had a weak correlation (Huang, 2014).
Another aspect of the comparison is which feedback is more helpful to students' writing.Studies show large discrepancies in the effect between two feedback types (Dikli & Bleyle, 2014), so instructions' awareness of the various need of students should be raised.

Methodology Context
The purpose of this research is to examine essay feedback between an automated feedback system via the computer and teacher direct feedback.The research questions that guided this study include the following.
1) What is the relationship between Aim Writing and the course instructor's scores?
2) What are the differences between the feedback from Aim Writing and the instructors?
3) According to students, what is the preferred model of getting feedback in writing?This research took place at one college located in an urban area on the eastern coast of China, which is a small-sized private college.The College English course was designed for four academic hours per week, lasting for sixteen weeks.According to the syllabus, the course instructor should devote two academic hours to reading and writing, and the rest to listening and speaking.There was no separate course set aside for English writing.However, the seems reasonable syllabus was encumbered by intensive grammar and text lecturing.Course instructors had to spend more than two academic hours demonstrating complex sentence analysis and translating obscure English sentences into Chinese to guarantee the maximum of students understand the reading materials, thus leaving limited time for teachers to illustrate the basic skills in writing in class.In most circumstances, the course instructor left a writing prompt for students at the end of a class and graded those essays within weeks.Scarcely did the teacher spare a proper period of time to talk about the problems in the writings.Only when the CET approached, some teachers would give abstract tips on essay structure and content.For the writing instructions, teachers usually listed the essay structure for students and had them memorize the positions of those important points like thesis, topic sentence, and conclusion sentences without vivid examples.Most students tended to seek sample writings online and memorize them.

About Aim Writing
Aim Writing is a new AES system developed by Microsoft Research of Asia (MSRA).It was launched in China in late 2019.MSRA is one of the world's leading computer infrastructure and application research institutions, which is dedicated to advancing computer science in general.It claims to offer pro cient feedback based on Fluency Boost Learning and Inference algorithms, which approximates to English teachers' feedback.Aim Writing is based on a natural language processing system, adopted the uency boost learning and inference mechanism, the pre-training language model, and partial masking text strategy to boost the validity in uency, accuracy in the score, and vocabulary diversity.It currently supports eight common types of English tests in China, including elementary, secondary, college entrance, College English Test Band-4 (CET-4) and Band-6, postgraduate, TOEFL, IELTS.In different test modes, the system gives feedback according to the speci c scoring criteria and writing requirements of each type of test.

Participants
From a class of 30 ten students volunteered to be part of the research.According to students' scores on the English test of the College Entrance Exam and their nal scores in the previous two sessions of College English courses, their English levels were different.Three of them had a good mastery of English, ve were intermediate level and two struggled at English learning.Furthermore, except for the College English course, students did not enroll in extra English courses.

Methods
This study employed a mixed methods approach using quantitative description and qualitative semistructured interviews.All students in the class received three writing assignments distributed across one semester.Those writing prompts were designed according to the reading materials in the course textbook, New Horizon College English, which was published by Foreign Language Teaching and Research Press, China's largest university press and the largest foreign-language publishing institution.The writing prompts were a narrative essay, a biographical narrative essay, and an argumentative essay.The rubric for grading those essays for the course instructor was adopted from the College English Test (CET).
All students were required to accomplish three writing practices, and they received the course instructors' feedback under the rubrics.Besides, ten students who participated in the study uploaded their writings to Aim Writing to get extra online feedback after submitting their writings to the instructor.Both the scores from the course instructor and Aim Writing were collected.In addition, at the end of the semester, students who participated in the study were invited to a semi-structured interview about their opinions on the two forms of feedback.

Data Collection and Analysis
Written approval to conduct the study was obtained from the university Institutional Review Board (IRB).
In order to guarantee a fair scoring throughout the semester, the scoring rubric was created based on the scoring scale of the writing section in the CET-4 before the semester began.All scores from the course instructor were collected.Participants' use of Aim Writing data was extracted to capture their writing performance.Data included the general comments on Aim Writing and the score for each writing.Since Aim Writing uses a percentile system to score, the instructor also converted the scores into percentiles in order to compare.
Besides the data from Aim Writing, at the end of the semester, interviews were conducted and transcribed verbatim.The interview questions were categorized into three sections: 1) perceptions of the feedback from Aim Writing, 2) perceptions of the instructor's feedback, and 3) preferred feedback content and model.All the interviews were conducted at the researcher's o ce on the college campus.The researcher invited the participants to a one-on-one half an hour interview.The interview was recorded and conducted under the interview protocols given to participants in advance.Since the original interviews were conducted in Chinese, the researcher used back-translation to make sure the information from the participants was consistent in both languages.
Thirty writing samples were collected.Using the CET rubric the instructor assessed all writing samples, while the Aim Writing provided feedback automatically.
Picture 1. Feedback from Aim Writing for one student Picture 2. The instructor's feedback for the same student In this study, Person correlation and paired sample test will be used to evaluate the agreement between the automated scoring system and human scores.The percentage of agreement between the automated scoring system Aim Writing and the human rater is the standard to evaluate the reliability of the automated scoring system.
The second and third research questions addressed their perceptions of feedback from both courses and their preferable feedback model.The interviews were rstly transcribed and coded with in vivo codes.
After the rst round of coding, four themes were abstracted from the codes.Table 2 shows the themes, categories, and some examples in codes.Each category and its corresponding codes will be analyzed by providing the original interview extracts.In order to protect the privacy of the participants, pseudonyms (S01, S02…) are used for the participants.

Results And Discussion
Descriptive statistics for the scoring of the automated scoring system Aim Writing and human rater are presented in Table 3.The average score of the automated scoring system Aim Writing is 86.60, while that of the human rater is 84.87.The average score of the two measures is close, and the average score of the automated scoring system is a little bit higher than that of the human rater (Table 3).With paired-samples T-test (see Table 5.), the scores rated by Aim Writing had a weak correlation with those rated by human, r= .58,p< .001.That said, the rating criterion of the Aim Writing is not consistent with the human rater.
There was a signi cant difference in scores rated by Aim Writing and by human, t = -2.26,df = 29, p < .05.Aim Writing tended to give hhigher points compared with the human rater.From the quantitative results, even the grading criteria between the human tater and Aim Writing was the same, the scores of Aim Writing's scores were higher than the human rater.From the feedback of Aim Writing, all corrections were on grammar, vocabulary, and sentence structure, while the teachers' feedback covered not only those aspects but also the evaluation of the content.The scores from Aim Writing were higher might be because of the different emphasis on the writing.
Teachers using the AES systems as a tool to evaluate students' writing need to pay attention to the corrections in order to have a comprehensive understanding of the system's bias.

Time e
The timing of feedback is a controversial topic among researchers.Some believe that immediate feedback is a means to prevent errors that will be encoded into memory (Lee et al., 2013), while others argue that delayed feedback reduces proactive interference so that the correction information can be encoded with no interference by the initial error (Ravand & Rasekh, 2011).In terms of writing tasks for EFL learners, written feedback provided in a timely manner greatly in uenced student learning (Basey et al., 2014).The AES system can provide immediate and continual feedback on essay content based on statistical techniques.
The participants expressed their re ections on the time they received feedback from Aim Writing and the instructor.Aim Writing could provide instant feedback after the users submit their essays on the input page.However, participants admitted that they usually received the instructor's feedback until after a week.Several students pointed out that the time of feedback had an impact on their willingness to make revisions to their essays.For example:

Communicative competence
Communicative is the language user's grammatical knowledge of syntax, morphology, phonology, and the social knowledge about how and when to use utterances appropriately.Peter and Chomsky (1968) referred competence to as the "linguistic system" that the language user had internalized the perception and production of speech.Savignon (1983) proposed a communicative competence model that consisted of grammatical competence, discourse competence, socio-cultural competence, and strategic competence for guiding language learning.
Aim Writing contributed to the improvement in grammatical competence of English language learners, especially the lower-level learners.It can provide appropriate vocabulary choices for students in the context.
I feel more con dent in my grammar because the grammar mistakes Aim Writing pointed out were what I usually ignored.After I paid special attention to them, my grammar was better.(S03) Aim Writing served as an instant grammar correction tool can largely enhance students' grammatical knowledge and make students re ect on their mistakes in order to avoid repetitive ones in the future.
The teacher's written feedback also paid attention to this part.However, only students with higher-level English realized the teacher's instruction in grammar.
I agree more with teacher's feedback on the choice of words because she considered the context and encouraged me to use the words and phrases we newly learned.I can remember them after repeated practice.Aim Writing's suggestions were useful, but the words it offered sometimes were hard, I can't remember several days later.(S04) Aim Writing and the teacher's feedback in communicative competence were recognized by the participants, but the effectiveness was different according to the student's English ability.

Feedback focus
The participants stated the differences in feedback focus between Aim Writing and the instructor.They pointed out that Aim Writing mostly presented the corrective feedback on grammatical errors including choice of words, tenses, and pronouns which was helpful not only in increasing the clarity of the essay but also in improving their self-e cacy in writing.The feedback could indicate the error as well as provide the corrected version for users to consider: Aim Writing clearly pointed out which part was missed in the sentence and gave suggestions on adding speci c words.(S02) Aim Writing could suggest I use an alternative word to be more accurate, and when I wrote a similar sentence in another situation, I could still remember the suggested word.(S03) Aim Writing helped me avoid low-level grammatical mistakes.(S10).
The participants pointed out the instructor's feedback contained grammar corrections but focused more on the organization of arguments.Researchers found that Chinese teachers showed a stronger focus on correcting the use of grammar and vocabulary (Cheng et al., 2021).However, researchers tend to agree that the strategy and in uence of teacher feedback are context speci c.
The teacher was not able to point out every grammatical mistake in my essay but could provide suggestions on the arguments.For example, one of the feedback was adding an example to prove the thesis statement.However, Aim Writing wouldn't tell me to add which kind of content, and the comment on the writing was abstract, that is to say, I didn't know how to further enrich my content based on the comment.(S01) The instructor's feedback focused more on the ideas presented by us.The teacher usually brought out suggestions on my essay structure, posted some questions to me, and encourage me to think about the logic between sentences and paragraphs.I could talk to my teacher about my thoughts on revision.Aim Writing did not have these functions.(S04) The instructor could circle out some grammar mistakes in my essay, but maybe due to fatigue in grading, she could not be as e cient as Aim Writing.There were cases that which she did not point out the spelling mistakes and inappropriate word use.(S07) The ndings suggest that feedback from Aim Writing can promote students writing pro ciency, especially in grammar and vocabulary use, and intermediate and introductory English level students had improved more with the help of Aim Writing.However, Aim Writing, in most situations has a good performance in language correction, can mistake right expressions.The future AES should expand the corpus with the development of English expressions in order to improve its reliability.Furthermore, the human rater plays an important role in providing feedback to individuals.Aim Writing is focused on language level correction, while the human rater can also provide suggestions on the organization of the structure and arguments in a more individualized approach.Therefore, Aim Writing cannot wholly replace teachers' positions.Teachers' individualized feedback is vital to students' improvement in writing.Last, to bring AES into fullest play and relieve teachers from labor, a hybrid model of feedback is needed.AES can evaluate the grammar and provide feedback on language preliminarily, and teachers' feedback can focus on the organization of structure and arguments.Teachers can choose the different combinations of the process to meet students' needs in writing instructions.
The development of the automated evaluation scoring system in China still has a long way to go.For future studies, more samples are needed to investigate the validity of AES.In addition, different English levels of samples are needed to explore which kind of feedback is more e cient and effective.Finally, this study was targeted at college-level EFL students, more levels of students can be involved in this research.

Abbreviations as a foreign language
Declarations

Table 1 .
The rubric of essay writing scoring for the instructor

Table 2 .
Themes, categories, and codes extracted from the interviews

Table 5 .
Paired Samples Test think feedback from the instructor was slower than Aim Writing.Usually, I wouldn't want to revise my essay if I receive feedback for more than one week.I think the teacher's feedback should be delivered within two days.(S04) I can get immediate feedback from Aim Writing so that I know what the errors are in my own essay.It is a good experience.I know my mistakes and I can correct them right on the spot.(S06)