Skip to main content

Computer or human: a comparative study of automated evaluation scoring and instructors’ feedback on Chinese college students’ English writing

Abstract

The role of internet technology in higher education and particularly in teaching English as a Foreign language is increasingly prominent because of the interest in the ways in which technology can be applied to support students. The automated evaluation scoring system is a typical demonstration of the application of network technology in the teaching of English writing. Many writing scoring platforms have been developed and used in China, which can provide online instant and corrective feedback on students’ writing. However, the validity of Aim Writing, a product developed by Microsoft Research Asia, which claims to be the best tool to facilitate Chinese EFL learners, has not been tested in previous studies. In this mixed methods study, the feedback and effect of Aim Writing on college students’ writing will be investigated and compared to the instructor’s feedback. The results indicate that Aim Writing’s performance is insufficient to support all students’ needs for writing and that colleges should encourage a hybrid model that contains both AES and instructor’s feedback in writing.

Introduction

English essay writing, which requires integrated knowledge of linguistics and content, poses a great challenge for English as a Foreign Language (EFL) learners and teachers. Even though teachers have dedicated efforts in instructing writing, EFL learners’ performances haven’t been improved, especially in content organization, idea development, and grammar structure (Chen, 2022). EFL teachers not only need to develop students’ linguistic and communicative competence but also use relevant feedback techniques for responding to students’ writing (Alharbi, 2022). Feedback in writing with effective instruction has a positive influence on facilitating students’ writing ability. Evidence supports the positive effect on students’ engagement and revision practices through feedback from teachers (Link et al., 2020; Zhang & Hyland, 2022). Moreover, feedback from the automated evaluation system also helps language learners improve their writing proficiency, self-regulation, and self-efficacy (Ekholm et al., 2014; Naghdipour, 2022; Nückles et al., 2020).

With the expansion of the class sizes in China and the emphasis on grammar drilling in College English courses, the instruction on English essay writing is overlooked. Previous studies reported that English teachers in China emphasized much more on English reading while neglecting English writing. Furthermore, the curriculum and syllabus of College English combine Reading and Writing together, and there were no specially designed English writing courses (Sang, 2017). Students are not provided with channels to speak about their difficulties in writing due to limited class time. Usually, they cannot receive timely feedback and correction in writing, and therefore their enthusiasm for English writing is largely decreased (Yang, 2016). Even though there is time set aside for the teaching of writing during class time, the content tends to be about how to deal with prompts in the College English Test (CET). This national, high-stakes English test in China, examines the English proficiency of undergraduate students in China and ensures that Chinese undergraduates reach the required English levels specified in the National College English Teaching Syllabus (NCETS) (Roach, 2018). Because of its importance, instructions on writing in class are limited to checking syntax and grammar (Wang & Wang, 2012).

The increasing number of EFL students in each class results in a series of problems in current EFL teaching in universities in China, such as the high demand for English teachers (Rao & Lei, 2014). Evaluating English writings is a time-consuming, challenging, and burdensome task because of English teachers’ writing proficiency and their personal beliefs and practices in providing feedback (Yu, 2021). Normally, teachers are required to evaluate writing from a similar rubric to CET-4. The excessive time to evaluate a large number of students’ writing easily leads to teachers’ burnout that they can hardly offer instant feedback on students’ writing (Alharbi, 2019). Teachers often give an overall score, without detailed feedback and suggestions on the content. Upon receiving feedback from teachers, students always feel passive to improve their writing because they remain passive throughout and there is always no requirement for redrafting (Lee, 2014).

A solution to the challenges discussed above has been the use of automatic feedback scoring. It is characterized as a computer technology's capacity to assess and grade writing. (Shermis et al., 2013), which is based on artificial intelligence according to different features like grammar, usage, mechanics, style, organization, and content. Developments have taken place in content evaluation for formative evaluation. Learners can not only get a holistic score on their writing but the suggestions on how to improve it (Shermis et al., 2016). However, these systems, like The Intelligent Essay Assessor (IEA) and WriteToLearn® were designed for native English learners in English-speaking countries initially and have been adopted in English language education (Liu & Kunnan, 2015). Furthermore, many of the systems, such as The Criterion® (the e-rater® Engine developed by ETS), PEG™, and MY Access! that are popular in other countries are inaccessible in China. Previous studies have tended to examine the accuracy and validity of those systems; however, few studies have focused on the effectiveness of automotive feedback on improving language learners’ writing performance (Geng & Razali, 2020). In China, due to the large population of EFL learners, the number of faculty using AES to provide writing feedback has been increasing. The most extensively used and examined AES systems in China are iWrite and Pigai, but studies found that those systems provided controversial outcomes in evaluating writings (Jiang et al., 2020; Koltovskaia, 2020; Li et al., 2015). A recent developed AES system Aim Writing, claimed to offer proficient feedback based on a new modelFootnote 1 (Ge et al., 2018) and can provide evaluation approximate to a professional English teacher’s feedback, has not been studied in terms of accuracy, validity, or performance in improving the outcome of writings.

In this article, the authors investigate the efficacy of feedback from Aim Writing on Chinese college EFL students and compares it with the instructor’s feedback as well as students’ preferences. The goal of this research is to identify the preferred writing feedback model for Chinese college EFL students.

Literature review

Response plays a critical role in learning (Vygotsky et al., 1978). EFL students need to know the merits and drawbacks of their writing to improve their skills. Feedback, as the responses from other sources, is significant for learners.

Teacher feedback

Feedback plays an important role in improving writing proficiency, and teacher feedback is the most common form of writing instruction (Kamberi, 2013). Previous studies on teacher feedback mainly explored the feedback focus, forms, and efficiency. Earlier studies of teacher feedback indicated that teachers mainly focused on language mistakes in students’ writing because they viewed writing as a product, and they tended to view themselves as language teachers rather than writing instructors (Zamel, 1985). Ferris (2010) pointed out that written corrective feedback could improve second language learners’ writing accuracy, but teachers are faced with a dilemma when considering the types of feedback. Current studies focus on two types of focus of feedback: focused feedback concerning repeated grammar errors and unfocused feedback on general errors are compared. Eslami (2014) and Farrokhi and Sattarpour (2011) pointed out that teacher’s focused feedback was more effective because learners can improve their grammar by focusing on one grammar mistake at a time. As for college-level students who are learning English for academic purposes, their English writing should be more formal and career-oriented. In improving writing accuracy, syntactic complexity, and fluency in English writing, comprehensive written corrective feedback shows more vigor than focused feedback (Zhang & Cheng, 2021).

Writing instruction has changed due to the insights from research studies. EFL teachers do not only provide their own feedback on students’ writing. Teacher written feedback now tends to combine with peer feedback, writing workshops, oral conferences, video feedback, and computer-delivered feedback (Hyland & Hyland, 2006; Mathisen, 2012). However, despite the various sources emerging in providing feedback, teacher written response is still dominant in most EFL writing classes (Hyland, 2013).

Researchers have embarked on the evaluation of the effectiveness of teacher feedback in facilitating EFL students’ writing. Research findings reveal that the efficacy of teachers’ written feedback is inconclusive. Razali and Jupri (2014) found teachers' written feedback had a positive effect on students' writing ability. Chen (2014) found learners preferred teachers’ written extended comments on content and grammar to improve writing proficiency. However, some studies suggested that much written feedback was of poor quality and focused too much on the errors (Yoshida, 2008). Current studies reveal that there are discrepancies between teachers’ feedback and students’ perceptions of correction (Muliyah et al., 2020; Agbayahoun, 2016). Teachers’ emphasis on direct correction feedback, especially on grammar (Cheng et al., 2021a) would send a misleading signal that good writing is grammatical accuracy, which causes students to neglect other important writing elements (Lee, 2011).

AES feedback

AES systems with the capability of evaluating writings using computer-generated technology have become an essential part of large-scale writing assessments since 1999 (Dikli & Bleyle, 2014). In the research area of the AES systems, scholars put their emphasis on three aspects: the impact of AES systems on students' writing proficiency, the attitude of teachers and students in regard to AES systems, and the application of AES systems in English writing teaching.

In terms of writing proficiency, studies indicated that AES tools improved students’ learning process and increased their awareness of their grammar mistakes (Parra & Calero, 2019). Han and Sari (2022) compared two groups of Turkish EFL university students who received feedback from automated-teacher feedback and full teacher in their writing improvement. They found that though both groups of students’ analytic writing scores improved, in terms of lowering students' grammatical and mechanics mistakes, combined automated-teacher feedback was more beneficial than full teacher feedback. However, researchers pointed out AES is more effective in enhancing students’ use of vocabulary than grammar, and that long exposure to AES benefits students more than short duration (Ngo et al., 2022).

The perception on AES of teachers and students was not consistent in studies. Chen and Cheng (2008) contended that the limitations of AES in providing suggestions on coherence and idea development, together with teachers’ pedagogical practices, led to students’ negative attitudes toward AES. Nevertheless, AES can support teachers in achieving pedagogical objectives, and students’ writing motivation is improved even though limitations exist (Wilson et al., 2021). Sun and Fan (2022) also found that AES could reduce students’ avoidance behavior in writing and adjust their anxiety level to an ideal condition in writing.

Various AES tools have been applied in English writing practices. Studies showed that only with teachers’ scaffolding in several rounds of drafting, revising, and editing, AES can act as a positive role in improving writing proficiency, and students should be given chances to be involved in the writing cycles guided by AES systems (Nunes et al., 2022).

Technology-assisted teacher feedback and technology-assisted peer feedback were coming under the spotlight (Chen, 2014; Huang, 2016). AES systems were useful tools to provide formative, diagnostic, and summative feedback so that EFL learners could self-correct and self-revise effectively (Wang & Wang, 2012). However, the accuracy and reliability of the technology-assisted tools became the focus of providing hybrid feedback. Although most of the researchers found the consistency of the scores between AES systems and human raters (Wilson & Roscoe, 2019), controversial opinions about the AES systems are increasing because it is argued that a computer software cannot rate a student’s writing as humans do (McCurry, 2010).

Researchers have surveyed teachers’ and students' attitude toward AES in the form of a questionnaire. Most of the surveys found that most of the students' attitudes towards AES were positive because AES systems can respond to the writings in time, and help students monitor their writing improvement (Zhang, 2020).

Research on AES regard to EFL reveals some limitations. First, most of the AES systems assess data from native English-speaking writers in large-scale writing assessments (Attali & Burstein, 2004). Studies on assessing non-native speakers’ writing are less (Vajjala, 2017). The present studies concerning with application of AES systems in EFL writing pedagogy is insufficient. Second, many studies are conducted in order to develop the system, rather than provide instructions in writing (Qian et al., 2020). Third, researches indicate that the efficiency of the systems is unsatisfactory. There are situations where they are unable to give accurate feedback on the content and logical structure of the writings (Zhang & Cai, 2019). Therefore, students’ writing ability cannot be improved by using the system. From teachers’ perspectives, on the one hand, they believe that the system can provide timely feedback to ease the pressure on teachers; on the other hand, since the feedback from the system is based on a large corpus and cannot provide personalized comments, the combination of different forms of feedback is suggested in future writing teaching (Elola & Oskoz, 2016).

Current studies on AES systems find that most of the systems are developed by institutions and companies in the United States, and most of them are exclusive to certain institutions. As a result, many systems are inaccessible in China. In recent decades, AES systems boosted in China. Pigai and iWrite are the most extensively used two systems locally, and their performance and validity of them were studied thoroughly. Studies showed that both systems had shortcomings in scoring and providing feedback for learners (Qian et al., 2019; Wu, 2020; Yan, 2019). Aim Writing is another new AES system developed by Microsoft Research of Asia (MSRA). However, present studies haven’t paid much attention to this system, especially in its validity and efficacy in evaluating the writings of EFL students. To fill the research gap, this study aims to investigate the performance of Aim Writing under three writing practices among non-English majors in a college in China.

The comparison between teacher’s feedback and AES feedback

Recent EFL writing research on feedback looked into the efficacy of feedback forms. Scholars tried to compare the merits and drawbacks between teacher feedback, peer feedback, and AES system feedback (Niu et al., 2021). In terms of which feedback was more effective in promoting EFL learners’ writing ability, scholars began to use hybrid interventions to exert the merits of each feedback.

One focus of the studies aiming to test the effectiveness of AES is the correlation between AES and human raters. Previous studies demonstrated controversial outcomes in the correlation. Though studies proved a moderate or high correlation between the score of human raters and the AES systems (Almusharraf & Alotaibi, 2022), many researchers also found the AES system and human scoring had a weak correlation (Huang, 2014).

Another aspect of the comparison is which feedback is more helpful to students’ writing. Studies show large discrepancies in the effect between two feedback types (Dikli & Bleyle, 2014), so instructors’ awareness of the various need of students should be raised.

Methodology

Context

The purpose of this research is to examine essay feedback between an automated feedback system via the computer and teacher direct feedback. The research questions that guided this study include the following.

  1. (1)

    To what extent is Aim Writing’s scores associated with the course instructor’s?

  2. (2)

    Are there any differences between Aim Writing feedback and teacher feedback, and if yes, what are the differences?

  3. (3)

    According to participants, what is the preferred model of writing feedback?

This research took place at one college located in an urban area on the eastern coast of China, which is a small-sized private college. The College English course was designed for four academic hours per week, lasting for sixteen weeks. According to the syllabus, the course instructor should devote two academic hours to reading and writing, and the rest to listening and speaking. There was no separate course set aside for English writing. However, the seemingly reasonable syllabus was encumbered by intensive grammar and text lecturing. Course instructors had to spend more than two academic hours demonstrating complex sentence analysis and translating obscure English sentences into Chinese to guarantee the maximum of students understand the reading materials, thus leaving limited time for teachers to illustrate the basic skills in writing in class. In most circumstances, the course instructor left a writing prompt for students at the end of a class and graded those essays within weeks. Scarcely did the teacher spare a proper period of time to talk about the problems in the writings. Only when the CET approaches, some teachers would give abstract tips on essay structure and content. For the writing instructions, teachers usually lists the essay structure for students and have them memorize the positions of those important points like thesis, topic sentence, and conclusion sentences without vivid examples. Most students tend to seek sample writings online and memorize them.

About Aim Writing

Aim Writing is a new AES system lunched in late 2019 in China by Microsoft Research of Asia (MSRA). MSRA is one of the world’s leading computer infrastructure and application research institutions, which is dedicated to advancing computer science in general ("MSRA”, n.d.). Aim Writing is a free online writing feedback webpage, which not only provides instant scores on writing but also comments under three domains—vocabulary (accuracy, variety, and complexity), sentence pattern (clarity, complexity, and fluency), and discourse structure (coherence). It is also easy for Chinese college students to record and view their writings and results on their phones because it can be linked to WeChat, which is a widely used instant messaging and social media smartphone application in China. It currently supports eight common types of English tests in China, including elementary, secondary, college entrance, College English Test Band-4 (CET-4) and Band-6, postgraduate, TOEFL, and IELTS. In different test modes, the system gives feedback according to the specific scoring criteria and writing requirements of each type of test.

Modern AES systems incorporate linguistic features such as length-based aspects (e.g., number of words, sentences), readability scores (e.g., Flesch-Kincaid readability), syntactic (e.g., sentence structure complexity), semantic (e.g., word information score), and discourse structure (e.g., argument position) to evaluate the quality of the writing (Ke & Ng, 2019). They adopt machine learning algorithms and Deep Learning-based approaches to detect the features, but researches show that their performances on complex linguistic and cognitive characteristics are still far away from human (Hussein et al., 2019).

Romano (2019) pointed out that Chinese English learners tended to misuse tense agreement extensively. Aim Writing claims to diagnose the problems of Chinese EFL learners’ linguistic features in writing based on Fluency Boost Learning and Inference algorithms, and the feedback approximates to human English teachers’ feedback. The Fluency Boost Learning and Inference algorithms is developed with three fluency boost learning strategies that incorporate large corpus data. The data sets not only include correct native expressions but also error samples. Aim Writing is also based on a natural language processing system. By adopting the fluency boost learning and inference mechanism, the pre-training language model, and partial masking text strategy, the sentence put onto Aim Writing can be edited through multiple rounds so that the result boosts the validity in fluency, accuracy in the score, and vocabulary diversity (Ge et al., 2018).

Participants

From a class of thirty, ten students volunteered to be part of the research. By summarizing students’ scores on the English test of the College Entrance Exam and their final scores in the previous two sessions of College English courses as well as their CET-4 scores, their English levels were different. According to the English proficiency level description from China’s Standards of English Language Ability (National Education Examinations Authority, 2018), three students in good English level can be categorized as Level Seven, five intermediate English level students were in Level Five, and two struggled at English learning who were in Level Four (see the “Appendix 1” for further description). Those English levels have aligned with TOFEL IBT test (National Education Examinations Authority, 2019, Table 1), and the participants’ estimated English level in TOFEL IBT test is listed in Table 1. Furthermore, except for the College English course, students did not enroll in extra English courses.

Table 1 TOFEL IBT test scores’ alignment to China’s Standards of English Language Ability (CSE)

Methods

This study employed a mixed methods approach. In order to ensure the validity and reliability of the research, data triangulation was adopted (Bryman, 2004). Specifically, Pearson correlations and paired samples t-test were conducted using SPSS to answer the first research question. Semi-structured interviews after the semester were undertaken to answer research question two and three. The two methods were used sequentially, with the quantitative method carried out before the interview.

Data collection and analysis

Written approval to conduct the study was obtained from the university Institutional Review Board (IRB). In order to guarantee a fair scoring throughout the semester, the scoring rubric was created based on the scoring scale of the writing section in the CET-4 before the semester began. All scores from the course instructor were collected.

All students in the class received three writing assignments distributed across one semester. Those writing prompts were designed according to the reading materials in the course textbook, New Horizon College English, which was published by Foreign Language Teaching and Research Press, China’s largest university press and the largest foreign-language publishing institution. The writing prompts were a narrative essay, a biographical narrative essay, and an argumentative essay. The rubric for grading those essays for the course instructor was adopted from the College English Test (CET).

All students received the course instructors’ feedback under the rubrics (see Table 2). Besides, ten students who participated in the study uploaded their writings to Aim Writing to get extra online feedback after submitting their writings to the instructor. Both the scores from the course instructor and Aim Writing were collected. In addition, at the end of the semester, students who participated in the study were invited to a semi-structured interview about their opinions on the two forms of feedback.

Table 2 The Rubric of essay writing scoring for the instructor

Participants’ use of Aim Writing data was extracted to capture their writing performance. Data included the general comments on Aim Writing and the score for each writing. Since Aim Writing uses a percentile system to score, the instructor also converted the scores into percentiles in order to compare.

Besides the data from Aim Writing, at the end of the semester, interviews were conducted and transcribed verbatim. The interview questions were categorized into three sections: (1) perceptions of the feedback from Aim Writing, (2) perceptions of the instructor’s feedback, and (3) preferred feedback content and model. All the interviews were conducted at the researcher’s office on the college campus. The researcher invited the participants to a one-on-one half an hour interview. The interview was recorded and conducted under the interview protocols given to participants in advance. Since the original interviews were conducted in Chinese, the researcher used back-translation to make sure the information from the participants was consistent in both languages.

Thirty writing samples were collected in this study. Using the CET rubric, the instructor assessed all writing samples (see Fig. 1) with Microsoft Word. The corrections were marked by track changes and annotation. Aim Writing provided feedback automatically after the participant uploaded the writing to the web page (see Fig. 2).

Fig. 1
figure 1

The instructor’s feedback for one student

Fig. 2
figure 2

Feedback from Aim Writing for the same student

In this study, Pearson correlation and paired samples t-test were used to evaluate the agreement between the automated scoring system and human scores in order to answer the first research question. The reliability of the automated scoring system relied on the percentage of agreement between Aim Writing and the human rater.

The second and third research questions addressed their perceptions of feedback from both sources and their preferable feedback model. The interviews were conducted after the semester when participants completed all assignments and tests. The interviews were recorded under the consent of the participants. All the recordings were firstly transcribed and coded with In Vivo codes. After the first round of coding, four themes were abstracted from the codes. Table 3 shows the themes, categories, and some examples in codes. Each category and its corresponding codes will be analyzed by providing the original interview extracts. In order to protect the privacy of the participants, pseudonyms (S01, S02…) are used for the participants.

Table 3 Themes, categories, and codes extracted from the interviews

Results and discussion

Descriptive statistics for the scoring of the automated scoring system Aim Writing and human rater are presented in Table 4. The average score of the automated scoring system Aim Writing is 86.60, while that of the human rater is 84.87. The average score of the two measures is close, and the average score of the automated scoring system is slightly higher than that of the human rater (see Table 4).

Table 4 Summary of Scores from the Instructor and Aim Writing

With paired bivariate correlations (see Table 5), the scores rated by Aim Writing had a medium correlation with those rated by human, r = 0.58, p < 0.001. That said, the rating criterion of the Aim Writing is relatively correlated with the human rater.

Table 5 Pearson correlations

There was a significant difference in scores rated by Aim Writing and by human, t = − 2.26, df = 29, p < 0.05. Aim Writing tended to give higher points compared with the human rater (see Table 6).

Table 6 Paired Samples T-test

From the quantitative results, even the grading rubric of the human rater and Aim Writing was the same, the scores of Aim Writing’s scores were higher than that of the human rater.

Qualitative responses

Time efficiency

The timing of feedback is a controversial topic among researchers. Some believe that immediate feedback is a means to prevent errors that will be encoded into memory (Lee et al., 2013), while others argue that delayed feedback reduces proactive interference so that the correction information can be encoded with no interference by the initial error (Ravand & Rasekh, 2011). In terms of writing tasks for EFL learners, written feedback provided in a timely manner greatly influenced student learning (Basey et al., 2014). The AES system can provide immediate and continual feedback on essay content based on statistical techniques.

The participants expressed their reflections on the time they received feedback from Aim Writing and the instructor. Aim Writing could provide instant feedback after the users submit their essays on the input page. However, participants admitted that they usually received the instructor’s feedback until after a week. Several students pointed out that the time of feedback had an impact on their willingness to make revisions to their essays. For example:

I think feedback from the instructor was slower than Aim Writing. Usually, I wouldn’t want to revise my essay if I receive feedback for more than one week. I think the teacher’s feedback should be delivered within two days. (S04)

I can get immediate feedback from Aim Writing so that I know what the errors are in my own essay. It is a good experience. I know my mistakes and I can correct them right on the spot. (S06)

Communicative competence

Communicative competence is the language user’s grammatical knowledge of syntax, morphology, phonology, and the social knowledge about how and when to use utterances appropriately. Peter and Chomsky (1968) referred competence to as the “linguistic system” that the language user had internalized the perception and production of speech. Savignon (1983) proposed a communicative competence model that consisted of grammatical competence, discourse competence, socio-cultural competence, and strategic competence for guiding language learning.

Aim Writing contributed to the improvement in grammatical competence of English language learners, especially the lower-level learners. It can provide appropriate vocabulary choices for students in the context.

I feel more confident in my grammar because the grammar mistakes Aim Writing pointed out were what I usually ignored. After I paid special attention to them, my grammar was better. (S03)

Aim Writing served as an instant grammar correction tool can largely enhance students’ grammatical knowledge and make students reflect on their mistakes in order to avoid repetitive ones in the future.

The teacher’s written feedback also paid attention to this part. However, only students with higher-level English realized the teacher’s instruction in grammar.

I agree more with teacher’s feedback on the choice of words because she considered the context and encouraged me to use the words and phrases we newly learned. I can remember them after repeated practice. Aim Writing’s suggestions were useful, but the words it offered sometimes were hard, I can’t remember several days later. (S04)

Aim Writing and the teacher’s feedback in communicative competence were recognized by the participants, but the effectiveness was different according to the student’s English ability.

Feedback focus

The participants stated the differences in feedback focus between Aim Writing and the instructor. They pointed out that Aim Writing mostly presented the corrective feedback on grammatical errors including choice of words, tenses, and pronouns which was helpful not only in increasing the clarity of the essay but also in improving their self-efficacy in writing. The feedback could indicate the error as well as providing the corrected version for users to consider. One of the participants believed “Aim Writing clearly pointed out which part was missed in the sentence and gave suggestions on adding specific words”. Another participant also indicated that Aim Writing improved the vocabulary variety because “Aim Writing could suggest I use an alternative word to be more accurate, and when I wrote a similar sentence in another situation, I could still remember the suggested word”.

The participants pointed out the instructor’s feedback contained grammar corrections but focused more on the organization of arguments. Researchers found that Chinese teachers showed a stronger focus on correcting the use of grammar and vocabulary (Cheng et al., 2021b). However, researchers tend to agree that the strategy and influence of teacher feedback are context specific.

The teacher was not able to point out every grammatical mistake in my essay but could provide suggestions on the arguments. For example, one of the feedback was adding an example to prove the thesis statement. However, Aim Writing wouldn’t tell me to add which kind of content, and the comment on the writing was abstract, that is to say, I didn’t know how to further enrich my content based on the comment. (S01)

The instructor’s feedback focused more on the ideas presented by us. The teacher usually brought out suggestions on my essay structure, posted some questions to me, and encourage me to think about the logic between sentences and paragraphs. I could talk to my teacher about my thoughts on revision. Aim Writing did not have these functions. (S04)

The instructor could circle out some grammar mistakes in my essay, but maybe due to fatigue in grading, she could not be as efficient as Aim Writing. There were cases that which she did not point out the spelling mistakes and inappropriate word use. (S07)

Previous research indicated that teachers’ feedback has an advantage over other kinds of feedback in improving students’ language proficiency in both grammar and meaning-level issue and content (Ruegg, 2015).

The participants contended instructor’s feedback was individualized, thus having more weight in promoting their writing proficiency compared with Aim Writing’s feedback. Teacher’s feedback was drawn from the evaluation of the organization, grammar, and content in accordance with the prompt so that the participants felt they could be more proficient in writing for these parts which accounted for a larger proportion of scores in English standard tests:

The scores for my three essays were almost the same, and the comments were too. I found a problem with Aim Writing, that is, it cannot judge whether the examples I used or the arguments I stated were appropriate to answer the questions in the prompt. There is no space for me to tell the machine what question I am going to answer or what the prompt for this writing is. (S10)

One of the concerns of the participants was the correctness of the grammar errors pointed out by Aim Writing. The participants argued that grammar correction was the most intuitive evaluation given by the system, and that grammar correction and vocabulary suggestions were the most helpful. However, the system could not judge whether the grammar was right or wrong according to the specific context of students' writing. Two of the participants raised their concerns about the pseudo “grammar errors” in the use of articles and vocabulary suggestions, which misled them to revise.

Students with poor grammar knowledge will all accept the grammar advice given by Aim Writing, while those with better grammar will notice that sometimes the grammar or vocabulary advice does not fit the context:

The alternative word suggested by Aim Writing was not appropriate in the context of my essay. I looked it up in the English dictionary, and it was not right. (S08)

I think some feedback on the grammar aspect was of no use to me because I didn’t think there was any mistake in my sentences. Why would I change? I compared it with the instructor’s feedback, and the teacher didn’t mark the same sentence, so I believed I was right. (S02)

Another concern was that the summative comments at the end of the essay were homogenous and less helpful in revision. Aim Writing provided overall grading on participants’ essays and offered comments on three aspects: vocabulary, sentences, and discourse, but the language of the comments is very abstract, for example, "discourse is not in-depth, and not convinced". Most of the students could not get any authentic improvements through those words.

I found the comments in each of my essays were almost the same. For example, in the vocabulary part, the comments were “the words used in the writing are not advanced”. I know some of the words I used are very common, but how I can revise is not pointed out by Aim Writing. (S08)

Preferred feedback model

The preferred feedback model for most of the participants was the combination of the feedback from Aim Writing and the instructor. However, few preferred only teacher’ s feedback.

Participants stated their preferred sequence of getting feedback. Some would enjoy the teacher’s feedback first and then check the AES feedback because they “can feel the encouragement and receive valuable suggestions from the teacher’s comments”; some preferred to seek AES feedback first “it can point out grammar mistakes, and then the teacher can provide feedback on content and structure”. Besides these two models, there were students only valuing the teacher’s feedback because “this system is of limited help to me. I would only consider the feedback from Aim Writing and compare both feedback (from Aim Writing and the instructor) in grammar and see which one makes sense to me”. Table 7 lists the participants provided three models of feedback.

Table 7 Feedback models

Participants who preferred the first feedback model were mostly with high English proficiency. They indicated that teacher feedback was more helpful in communicative competency. After teacher feedback, they had a deeper understanding of the writing’s structure and logic, and their self-efficacy in writing was enhanced. In addition, they also contended that the AES system could provide alternative words to enrich their vocabulary variety, so they prefer to use the AES system as a polishing tool.

Most of the participants advocated that the AES system should come first because they could revise their grammar mistakes immediately, which could make their writings more coherent and cohesive in meaning. After this process, teachers could give their feedback based on the grammatical corrected version. Under this situation, the teacher could allocate more time to the content and offer detailed suggestions rather than spending much time correcting grammar mistakes. In this model, students’ needs in writing can be satisfied.

Few students also suggested the third model. They preferred to wait for both feedback and compare their corrections and select the merits from both sources. This feedback model will be effective if the instructor could provide feedback within three days.

Conclusion

The findings suggest that feedback from Aim Writing can promote students writing proficiency, especially in grammar and vocabulary use, and intermediate and introductory English level students had improved more with the help of Aim Writing. However, in most situations, Aim Writing, which has a good performance in language correction, can mistake right expressions. The future AES should expand the corpus with the development of English expressions in order to improve its reliability. Furthermore, the human rater plays an important role in providing feedback to individuals. Aim Writing focused on language level correction, while the human rater can also provide suggestions on the organization of the structure and arguments in a more individualized approach. Therefore, Aim Writing cannot wholly replace teachers’ role. Teachers’ individualized feedback is vital to students’ improvement in writing. Last, to bring AES into fullest play and relieve teachers from labor, a hybrid model of feedback is needed. AES can evaluate the grammar and provide feedback on language preliminarily, and teachers’ feedback can focus on the organization of structure and arguments. Teachers can choose the different combinations of the process to meet students’ needs in writing instructions.

The development of the automated evaluation scoring system in China still has a long way to go. For future studies, more samples are needed to investigate the validity of AES. In addition, different English levels of samples are needed to explore which kind of feedback is more efficient and effective. Finally, this study was targeted at college-level EFL students, more levels of students can be involved in this research.

Availability of data and materials

All the data are available upon the request of the editors and the corresponding author can provide them.

Notes

  1. The new model contains Fluency Boost Learning and Inference algorithms. Fluency Boost Learning is a new model to improve a sentence’s fluency without changing its original meaning; thus, any sentence pair that satisfies this condition (we call it fluency boost condition) can be used as a training instance.

Abbreviations

EFL:

English as a foreign language

References

Download references

Acknowledgements

We would like to thank the students who participated in the study. We would also extend our heartfelt thanks to the editors and reviewers of the journal for the helpful comments.

Funding

The study belongs to the project “2021 Shanghai University Teachers' Industry-University-Research Practice Plan”, and received funding from Shanghai Normal University Tianhua college.

Author information

Authors and Affiliations

Authors

Contributions

HC designed the study, collected the data, and wrote the first draft of the paper. JP analyzed the data and revised the manuscript. Both authors read and approved the final manuscript.

Authors’ information

Huimei Chen is a doctoral student in Curriculum and Instruction program at Northern Arizona University, and she also works at Shanghai Normal University Tinahua College, China.

Jie Pan is a doctoral student in Curriculum and Instruction program at Northern Arizona University, and he also works at Shanghai Normal Universit Tinahua College, China.

Corresponding author

Correspondence to Huimei Chen.

Ethics declarations

Ethics approval and consent to participate

A statement on ethics approval was obtained from NAU IRB.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

 

Chinese English Proficiency Scale (CSE) overall language ability

CSE 9

• Can accurately and thoroughly understand a large variety of language materials

 

• Can easily use all manner of expressions to engage in in-depth verbal communication with others on all kinds of topics; can express him/herself precisely, naturally, idiomatically, and in a particular style

CSE 8

• Can understand different types of language materials on a variety of topics; can comprehend the message and recognize discourse features and linguistic style

 

• Can skillfully use varied manner of expressions to communicate with others tactfully and effectively on an academic or specialized topic on a variety of occasions; can accurately, appropriately, and fully explain, justify, and comment on a range of related topics; can express him/herself precisely, fluently, coherently, and appropriately

CSE 7

• Can understand language materials on a range of topics, including those related to his/her field of specialization; can accurately identify the theme and key points of the material, objectively assess and comment on its content, and understand its deeper meaning

 

• Can engage in in-depth discussion and exchange with others on a range of related academic and social topics; can effectively describe, clarify, explain, justify, and comment on such matters and express him/herself clearly, appropriately, smoothly, and in a conventional manner

CSE 6

• Can understand language materials on a range of topics (including subjects of a more general nature) and fully grasp their key points and logical relationships; can analyze, determine, and evaluate viewpoints, attitudes, and implicit meanings therein

 

• Can discuss a range of familiar topics in academic and work interactions, effectively present information about, compare, and comment on different ideas, and express his/her own opinions; can express him/herself coherently, appropriately, smoothly, and in keeping with relevant stylistic conventions and the features of a particular register of language

CSE 5

• Can understand language materials on general topics discussed in a variety of situations; can grasp their theme, identify the key points, find out facts, views, and intricate details, and get to know the intentions and attitudes of others

 

• Can communicate, discuss, and negotiate with others on topics such as study and work in familiar situations and express his/her viewpoints and display his/her attitude; can describe, clarify, or explain matters on general topics relatively effectively and express him/herself accurately, coherently, and appropriately

CSE 4

• Can understand language materials on common topics discussed in normal social interactions; can identify main themes and key content, grasp the main facts and viewpoints, and understand the intentions and attitudes of others

 

• Can communicate with others on familiar subjects in familiar situations; can describe the development of an event, describe current situations and related activities, point out the main features of things, and briefly discuss his/her viewpoints; can express him/herself fairly accurately, clearly, and coherently

CSE 3

• Can understand simple everyday language materials, glean specific or key information from them, identify key points, and deduce the intentions of others

 

• Can communicate with others in routine or normal social interactions using simple language; can describe personal experiences and aspirations and clearly present his/her reasons and viewpoints; can express him/herself with a basic level of accuracy, coherence, and fluency

Appendix 2

Interview protocol

  1. 1.

    Do you think the system can provide feedback that you can understand?

  2. 2.

    Do you think the system can provide valuable feedback to help you modify your writing?

  3. 3.

    How do you feel about your improvement in English essay writing?

  4. 4.

    After using the system to revise writing online, do you feel that writing is easier?

  5. 5.

    Do you think you will continue to use the system to revise your writing repeatedly in the future?

  6. 6.

    In what scale does the system help reduce your mistakes in word usage?

  7. 7.

    In what scale does the system help reduce your grammatical errors in writing?

  8. 8.

    In what scale does the system help reduce your writing errors in spelling and punctuation?

  9. 9.

    In what scale does the system help you improve the writing in structure and logic?

  10. 10.

    What function of the website is most helpful to you in improving your writing?

  11. 11.

    What function of the website fail to meet your expectation in improving your writing?

  12. 12.

    What aspect of the instructor’s feedback is most helpful to you in improving your writing?

  13. 13.

    What aspect of the instructor’s feedback fails to help you in improving your writing?

  14. 14.

    What do you think is the most helpful model in providing feedback?

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, H., Pan, J. Computer or human: a comparative study of automated evaluation scoring and instructors’ feedback on Chinese college students’ English writing. Asian. J. Second. Foreign. Lang. Educ. 7, 34 (2022). https://doi.org/10.1186/s40862-022-00171-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40862-022-00171-4

Keywords