Sage Journals: Discover world-class research

Abstract

Language learning has increasingly benefited from Computer-Assisted Language Learning (CALL) technologies, especially with Artificial Intelligence involved in recent years. CALL in writing learning acknowledged as the core of language learning is being realized by technologies like Automated Writing Evaluation (AWE), and Automated Essay Scoring (AES), which have developed considerably in both computer and language education fields. AWE has effectively enhanced EFL students’ writing performance to some extent, but such technology can only provide an evaluation in the form of scores, the majority of which are based on holistic scoring, resulting in the inability to provide comprehensive and detailed content-based feedback. In order to provide not only the writing multiple trait-specific evaluation scores, but also detailed writing feedback, we proposed a computer-assisted EFL writing learning system incorporating the neural network models and a couple of semantic-based NLP techniques, MsCAEWL, which fully meets the requirements of writing feedback theory, i.e., multiple, continuous, timely, clear, and multi-aspect guidance interactive feedback. The results of comparison experiments with the AWE baseline models and human raters demonstrated the superiority and the high correlation contained by the proposed system. The independent-sample t-test and paired-sample t-test results of the experiments on MsCAEWL effect validation suggested the significant impact of our proposed system in enhancing students’ EFL writing proficiency.

Keywords

automated writing evaluation automated essay scoring teaching of EFL writing computer-assisted writing writing evaluation writing feedback

Introduction

Computer-Assisted Language Learning (CALL) technologies have profited greatly from the quick development of Artificial Intelligence and have been increasingly involved and developed in language learning. Writing to learn and writing to learn is identified as essential component in ESL/EFL (Zhang, 2013), the Computer-assisted writing evaluation technology thereby has drawn adequate attention both in computer science and language education.

Writing learning has traditionally been viewed as crucially dependent on writing evaluation or feedback (Lloyd-Jones, 1977). Writing feedback helps improve learning outcomes, draws L2 learners’ attention to the gap between the target language and the interlanguage, and stimulates L2 learners to internalize L2 knowledge. At the same time, teachers can also adjust the teaching method through feedback (Zamel, 1982). Effective feedback is supposed to include multiple modifications and reciprocating processes and to indicate revision instructions covering dimensions such as organization, content, and mechanics (Cohen & Cavalcanti, 1990; Ferris, 1995; Hedgcock & Lefkowitz, 1994).

For many non-native English-speaking countries, EFL/ESL writing is the least gratifying task for both teachers and students, especially in China (Mo, 2012). EFL teachers in Chinese high schools or colleges have to face 40-80 students in one class, which is the only one of 2–5 classes he or she teaches in a semester. Thus, the workload of the TEFL is abnormally high. Moreover, in China, the largest English learning group is non-English majors, and there is no exclusive course for writing in high schools and universities for this group. The large class size makes it almost impossible for teachers to assign enough writing tasks, not to mention providing detailed evaluations or individual instructions. Most of the writing feedback students receive is vague, overall, or inconsistent (Beach & Friedrich, 2006).

Recently the TEFL, especially writing teaching, has remarkably benefited from the rapid development of CALL technologies. Computer-assisted writing evaluation technology, known as Automated Writing Evaluation (AWE), Automated Essay Scoring (AES), and Computer essay grading, has emerged and flourished since the 1960s. In addition to scoring compositions, computer-assisted writing evaluation technology can also provide diagnostic feedback concerning content, logic, vocabulary, grammar, spelling, etc., which is personalized, timely, objective, and constant (Li et al., 2015; Polio, 2012). With the AWE or AES system, students can receive timely and detailed writing feedback and freely modify articles at unlimited times, so as to achieve the purpose of internalizing knowledge and improving cognitive and speech ability. The existing AWE technologies or products can provide effective vocabulary and simple syntax modification suggestions. Their feedback on complex syntax errors is, however, less satisfactory (Fang, 2010).

The existing AWE/AES technologies or applications provide feedback that is capable of increasing the composition mechanics, such as accuracy, complexity of words, and average length of sentences (Li et al., 2017). Nevertheless, there is little help they can offer in feedback on content, organization, and unity, which does not accord with the requirements of writing feedback of second language acquisition theory. Furthermore, neural network-based AWE models rely too much on labeled training data. At present, more than 90% of AWE models use scored corpora from Kaggle ASAP (2012), which may lead to less objective scoring results. In addition, lacking the ability to evaluate the content relevance, the AWE models are vulnerable to spoofing or attacks of adversarial texts (Higgins & Heilman, 2014; Kabra et al., 2021; Parekh et al., 2020).

In this study, following the guidance of writing feedback theory and adopting the analytical scoring method, we proposed a Multi-strategy Computer-assisted EFL Writing Learning System (MsCAEWL) incorporating a variety of semantic-based NLP technologies and models, including deep learning in writing evaluation in order to provide not only the writing evaluation score but also the detailed writing feedback. Its introduction is able to overcome the limitation of previous AWE/AWE technologies which are powerless to evaluate texts from content-based dimensions, such as assessment of content and organization. Moreover, it addresses the problem that the improvement of writing ability benefited from AWE/AES models is limited due to their common implementation of holistic scoring. The proposed system is capable of providing evaluation sub and total scores from multiple traits while outputting the improving and correcting writing instructions for students writing learning.

Our contributions are summarized as follows:

1) Following the writing feedback theory, our model provides timely, detailed, multiple trait-specific, accurate analytic evaluation feedback on compositions utilizing multiple indicators, rather than the commonly used holistic evaluation in prior AWE models.

2) A package strategy of the Natural Language Processing technologies including neural networks is introduced and modified to yield multiple trait-specific evaluations of an essay based on the sentiment, such as thesis, fluency, and content.

3) Grammatical Error Correction (GEC) and Grammatical Error Diagnosis (GED) are introduced to provide revision and evaluation feedback.

4) A novel method is introduced to assess the competence of complexity of word and syntax usage and provide detailed lexical and syntactic usage suggestions in accordance with the level of the writer’s language abilities.

Literature Review

On top of scoring compositions, Automated Writing Evaluation (AWE), Automated Essay Scoring (AES), Computer essay grading, Computer-Assisted Writing Evaluation, and other computer-assisted writing evaluation technologies can also provide diagnostic feedback on content, logic, vocabulary, grammar, and spelling. Furthermore, AWE/AES technologies are distinguished by their uniqueness, timeliness, impartiality, and consistency (Li et al., 2015; Polio, 2012). With the AWE or AES system, students can receive timely and detailed writing feedback and freely modify essays with no limit of rounds, thus realizing the requirements and goals of the process-oriented writing approach and achieving the purpose of internalizing knowledge and improving cognitive and speech abilities. Students can use automated feedback to assist them improve their linguistic correctness by employing automated writing evaluation (AWE) technology (Saricaoglu & Bilki, 2021). AWE/AES technologies and their effects have elicited vehement debates. The findings of extensive research and quantitative analyses concerning the effects of AWE/AES improvements in various aspects of writing have demonstrated that there have been noteworthy long- and short-term improvements in accuracy, learn autonomy, and interaction (Link et al., 2022; Wang et al., 2013). Especially in grammatical error correction, the errors of certain categories were demonstrated to be greatly decreased when AWE was used (Saricaoglu & Bilki, 2021). Note that improvements like these may occur during the later stages of a period of sustained use of AWE for writing assistance (Liao, 2016). The utilization of AWE systems has been shown to generate not only positive effects on students’ writing performance, but also to significantly enhance the quality of teacher feedback in several dimensions, including mode, amount, types, and levels of feedback (Jiang et al., 2020). Thus, teachers’ attitudes and methods of use of AWE systems are positively associated with the performance of such systems in the context of writing learning., underscoring the importance of teachers’ roles in successful integration of these systems (Li, 2021), besides, different combinations of multi-approach writing teaching integrating AWE yield varying effects on writing performance (Tang & Rich, 2017).

In the industry, a large number of commercial AWE and AES products have emerged, including E-rater, Project Essay Grade (PEG), Writing Roadmap, My Access, Intelligent Essay Grade (IEA), Criterion, Write to Learn and Summary, pigai.org, Youdao Writing, I-write, Bingo English, etc. The early stage of AES/AWE development is dominated by the traditional machine learning technologies based on Bayes’ theorem (Rudner & Liang, 2002), linear regression (Phandi et al., 2015), rank preference learning (Chen & He, 2013; Yannakoudakis et al., 2011), reinforcement learning (Wang et al., 2015), etc. In recent years, the accelerated development of deep learning, especially neural network technology, has also greatly benefited the field of writing evaluation, such as Recurrent Neural Networks (Cai, 2019), Long Short-term Memory (Alikaniotis et al., 2016; Jin et al., 2018; Liu et al., 2019; Taghipour & Ng, 2016), Convolutional Neural Networks (CNN) (Dong & Zhang, 2016; Dong et al., 2017; Farag et al., 2017), BERT-based approach (Sharma et al., 2021). The introduction of attention mechanisms also significantly improved AES SOTA performance (Dong et al., 2017). Automated Writing Evaluation also benefited from pre-training strategies implemented at the forefront of natural language processing (Cai, 2019; Farag et al., 2017; Mim et al., 2021). Many studies adopted a variety of solutions based on neural networks, such as generating adversarial samples for learning (Farag et al., 2018; Liu et al., 2019), Multitask Learning (Cummins & Rei, 2018), and Self-Supervised Learning (Cao et al., 2020), Graph Algorithms (Jiang et al., 2021). Some studies proposed different approaches to address the problem of scarce training data for writing evaluation (Li et al., 2020; Ran et al., 2018). There are also scoring models that integrate multiple neural network models and schemes (Beseiso et al., 2021).

Grammatical Error Correction (GEC) or Grammatical Error Detection (GED) is an independent track of NLP missions, which is capable of automatically correcting or detecting the mechanical error of composition. The task content of GEC/GED overlaps with the parts of AES/AWE. AWE tasks are always accompanied by a diagnosis of grammatical errors, most of which need to be pointed out in their correct form. However, the AWE studies pay little attention to the GEC approaches, and only a few studies introduced the early GEC model into the AWE works (Liu et al., 2019). The initial research on GEC and GED mostly adopts N-gram (Xie et al., 2015; Zhang & Wang, 2014), Confusion Set (Lin & Chu, 2015; Rozovskaya & Roth, 2010), and language model (Brockett et al., 2006) including BERT (Devlin et al., 2019; Hong et al., 2019; Zhang et al., 2020) to diagnose and correct grammatical errors. However, the above solutions hardly solve problems such as disorder and component omission. Encoder-decoder architecture (Ge et al., 2018; Sutskever et al., 2014) has achieved similar success in GEC by end-to-end addressing the issues that other models fail to address at one time. It is worth noting that most recent advanced GEC architectures exploit the advantages of multiple models to solve different kinds of problems (Chollampatt & Ng, 2018; Rozovskaya & Roth, 2010; Zhang & Wang, 2014), including the Transformer-based approaches (Lichtarge et al., 2019; Zhao et al., 2019).

Writing Feedback and ESL riting

Zamel defines feedback as teaching information that helps the writing revision process and improves language learning (Zamel, 1982). Writing feedback is regarded as input information about second language writing provided to authors to improve their accuracy of expression (Arndt, 1992), and also as information provided to authors to modify their interlanguage (Keh, 1990). The function of feedback is generally considered to be to confirm, understand and clarify requirements (Ellis, 1994). Modification serves as its theoretical core. Learners use modification feedback as a clue to conduct exploratory and open learning, and gradually form scaffolding instruction.

Good and effective writing feedback not only needs to point out whether it is correct or not but also provides suggestions for modifying or improving performance (Zamel, 1982). Based on the practice and verification of generations of pedagogics and linguists, writing feedback is essential to the success of writing instruction, and a few key requirements should be met in order to ensure its effectiveness (B. Chen & J. Zhang, 2022). Clarity is the first, and it should be ensured in order for students to make the appropriate changes (Conrad & Goldstein, 1999; Ferris & Roberts, 2001; Freedman, 1984). Feedback must be clear and unambiguous to be beneficial for writing performance (Polio, 2012). Secondly, feedback should also occur in multiple rounds to guide the repeated composition modifications during instructional interactions (Zamel, 1982). Through the coordination of multiple modifications and feedback, student’s cognition, thinking, and writing skills will be improved (Onozawa, 2010), which is the goal of process-oriented writing. Furthermore, feedback should be timely. Students can benefit from timely feedback-modification interactions and see a marked improvement in their writing performance (Beach & Friedrich, 2006; Ferris, 1995; Freedman, 1984; Leki, 1990; Zamel, 1982). Studies show that if formative feedback is given and executed quickly during L2 writing instruction, there will be an immediate improvement in students’ performance (Beach & Friedrich, 2006).

To be more effective, writing feedback should cover all aspects of writing, including content, organization, grammar, and mechanics (Cohen & Cavalcanti, 1990; Ferris, 1995; Hedgcock & Lefkowitz, 1994). Holistic scoring, in which an overall score is set to reflect the writing performance, is traditional and widely adapted by AWE models or teachers. Under this circumstance, single-trait holistic scoring cannot provide diagnostic information (Cohen, 1994) and is not adequate as ideal writing feedback. Comparatively, analytic scoring assigns separate sore from different dimensions as mentioned above to assess writing performance, thus can provide adequate diagnostic information of splendor or insufficiency in particular performance and detailed guidance for the development of students’ writing (Bauer, 1981; Klein et al., 1998; Weir, 1988).

Theoretically, multiple, continuous, timely, clear, and multi-aspect guidance interactive feedback is an optimal way to improve writing ability. In order to maximize the feedback effect, it is necessary to put forward specific suggestions for improvement.

However, in ESL teaching, the writing feedback students receive is mostly vague, global, or inconsistent (Beach & Friedrich, 2006), and in most cases delayed. Especially in China, it is difficult for teachers to follow writing feedback in the teaching process. In the context of poor educational resources, it is impossible to require teachers to provide personalized, continuous, and detailed feedback for each student each time they write or modify an essay.

In accordance with the requirements of writing feedback theory for writing evaluation, this study, therefore, seeks to propose an easy-to-operate automatic machine writing evaluation method, which aims to enable users to obtain immediate responses and use them to revise their compositions repeatedly, and the response are supposed to contain the specific suggestions of organization, use of grammar and vocabulary, cohesion, etc.

Methodology

Grading Criteria

IELTS and TOEFL tests although adapt different writing scoring rubrics, they take a similar view on the properties of well-written essays. By and large, coherence, cohesion, vocabulary usage, grammar usage, etc. are the major concerns in evaluating an IELTS or TOEFL essay. Nevertheless, in the routine teaching and evaluation of EFL writing, teachers pay more attention to the development of students’ writing ability in mechanics, grammar, organization, style, unity, vocabulary, content, etc. (Pishghadam et al., 2014). Constrained by the limitations of the prior art in computer-assisted writing evaluation, the style-related essay assessment remains elusive. To this end, the evaluation categories adopted in this study involve unity, organization, fluency, content, vocabulary, grammar, and mechanics (see Table 1).

Table 1.

Definitions of Evaluation Categories.

UNITY/COHERENCE	Relevance of Paragraphs to a Single clear central idea (Prompt), relevance of Paragraphs to each other, transitions
ORGANIZATION/COHESION	Paragraph development, paragraph structure, essay development, essay structure
FLUENCY	The clear, smooth use of language
CONTENT	Clear central idea, development of an idea
VOCABULARY	Proper word usage and word choice, use of different and complex vocabulary
GRAMMAR	Correctness in tense, sentence structure, number, agreement, article, preposition, pronoun, etc.; complexity in structure, phrasal structure
MECHANICS	Punctuation, spelling, capitalization, face, paragraphing, handwriting, space

According to the table above, in the proposed computer-assisted writing learning, we set five modules to execute the above five evaluation requirements for writing: THESIS, FLUENCY, CONTENT, COMPLEXITY, and CORRECTNESS. In particular, “Thesis” evaluates the unity and organization, and sees the degree how the paragraphs serve the topic; “Complexity” consists of grammar and vocabulary assessments, it is a check of the authors’ mastery of morphology and syntax listed in the Table 1, meanwhile, “Correctness” checks the use of vocabulary, mechanics, and grammar conforming to the standard English. The schema of MsCAEWL’s evaluation criteria is shown in Figure 1.

Figure 1.

The mapping between evaluation modules of MsCAEWL and the evaluation categories of composition properties. The proposed computer-assisted EFL writing learning system consists of five main modules which function following the requirements of the general EFL writing evaluation categories, that is, unity, organization, content, fluency, vocabulary, grammar, and mechanics.

Task Definition

In accordance with the requirements of writing feedback on the EFL writing teaching, this paper proposes a novel multi-strategy automatic computer-assisted EFL writing learning system MsCAEWL. The flow chart of the proposed system is shown in Figure 2.

Figure 2.

The flow chart of MsCAEWL. Essays and prompts are combined as inputs into MsCAEWL. The Theory, Fluency and Content modules generate the corresponding scores respectively. In the complexity module, sub-scores will be provided for each of the sentence and vocabulary levels, with the final completion score being driven by a combination of the two sub-scores. In addition to detecting and scoring syntactic and lexical errors, the Correction module also provides suggestions for error correction.

The proposed system aims to solve the defects of the prior studies and proposes a series of solutions involving deep learning methods to give users detailed vocabulary and syntax modification or improvement suggestions and gives itemized and overall feedback and evaluations on compositions based on big data so that teachers and students can be fully aware of the writing process. MsCAEWL evaluates compositions from five indicators of thesis (unity and organization), fluency, content, complexity (lexical and syntactic complexity), correctness (vocabulary, mechanics, and grammar), and incorporates the corresponding modules to conduct evaluations. Moreover, it diagnoses the errors of words, phrases, sentences, and improper use in essays, puts forward suggestions for correction, points out where words and sentences can be improved, and gives feedback. We proposed a model to describe the calculation of the evaluation process of MsCAEWL in equations (1)–(4), the notations are listed in Table 2.

F i n a l S c o r e = {10 α}_{t} \cdot S_{t} + {10 α}_{f} \cdot S_{f} + 10 α_{c o n} \cdot S_{c o n} + 10 α_{l} \cdot S_{l} + {10 α_{s} \cdot S}_{s} + 10 α_{c} \cdot S_{c}

(1)

[\begin{array}{c} S_{t} \\ S_{f} \\ S_{c o n} \\ S_{s} \end{array}] = δ ([\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}])

(2)

{\begin{array}{c} δ ([\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}]) = 5, i f \frac{4}{5} σ [\begin{array}{c} S_{t}^{'} \\ S_{f}^{'} \\ S_{c o n}^{'} \\ S_{s}^{'} \end{array}] \leq [\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}] \leq σ [\begin{array}{c} S_{t}^{'} \\ S_{f}^{'} \\ S_{c o n}^{'} \\ S_{s}^{'} \end{array}] \\ δ ([\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}]) = 4, i f \frac{3}{5} σ [\begin{array}{c} S_{t}^{'} \\ S_{f}^{'} \\ S_{c o n}^{'} \\ S_{s}^{'} \end{array}] \leq [\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}] \leq \frac{4}{5} σ [\begin{array}{c} S_{t}^{'} \\ S_{f}^{'} \\ S_{c o n}^{'} \\ S_{s}^{'} \end{array}] \\ δ ([\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}]) = 3, i f \frac{2}{5} σ [\begin{array}{c} S_{t}^{'} \\ S_{f}^{'} \\ S_{c o n}^{'} \\ S_{s}^{'} \end{array}] \leq [\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}] \leq \frac{3}{5} σ [\begin{array}{c} S_{t}^{'} \\ S_{f}^{'} \\ S_{c o n}^{'} \\ S_{s}^{'} \end{array}] \\ δ ([\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}]) = 2, i f \frac{1}{5} σ [\begin{array}{c} S_{t}^{'} \\ S_{f}^{'} \\ S_{c o n}^{'} \\ S_{s}^{'} \end{array}] \leq [\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}] \leq \frac{2}{5} σ [\begin{array}{c} S_{t}^{'} \\ S_{f}^{'} \\ S_{c o n}^{'} \\ S_{s}^{'} \end{array}] \\ δ ([\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}]) = 1, i f [\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}] \leq \frac{1}{5} σ [\begin{array}{c} S_{t}^{'} \\ S_{f}^{'} \\ S_{c o n}^{'} \\ S_{s}^{'} \end{array}] \end{array}

(3)

σ (x) = \frac{1}{1 + e^{- x}}

(4)

Table 2.

Notations.

$S_{t}$	Score of the Thesis evaluation
$S_{f}$	Score of the fluency evaluation
$S_{c o n}$	Score of the content evaluation
$S_{l}$	Score of the lexical complexity evaluation
$S_{s}$	Score of the syntactic complexity evaluation
$S_{c}$	Score of the correctness evaluation
$α_{t}$	Coefficient of the thesis evaluation
$α_{f}$	Coefficient of the fluency evaluation
$α_{c o n}$	Coefficient of the content evaluation
$α_{l}$	Coefficient of the lexical complexity evaluation
$α_{s}$	Coefficient of the syntactic complexity evaluation
$α_{c}$	Coefficient of the correctness evaluation
$σ (x)$	Logistic sigmoid function
$δ (x)$	The normalizing and scaling function
$I_{t}$ , $I_{f}$ , $I_{c o n} I_{s}$	The intermediate values to scale values of $S_{t}^{'}$ , $S_{f}^{'}$ , $S_{c o n}^{'} S_{s}^{'}$

In the proposed model, each evaluation module possesses different approaches to modeling and calculating in scoring compositions, the ranges of scores of the Thesis, Fluency, Content, and Syntactic Complexity modules vary, thus the functions are implemented to normalize and scale the above sub-evaluation sores. The logistic sigmoid function $σ (x)$ is introduced to normalize the scores into [0, 1], then $δ (x)$ divides the sigmoid result of scores of the Thesis, Fluency, Content, Syntactic Complexity modules into five levels based on the mean and assigns {1, 2, 3, 4, 5} credits respectively to each level.

Considering the value ranges of the scoring modules still differ (see Table 3), coefficients are introduced to regulate the individual score of each module and the total score.

α_{t}

α_{f}

α_{c o n}

α_{l}

α_{s}

α_{c}

are leveraged as the coefficients adjustable according to the upper limit of the total score or the focus of the writing evaluation. In our study, we set a scenario where the total score of composition is 100 credits, the thesis, fluency, and content aspects of writing are paid more attention in evaluation and scoring. Thus, the score assignment and coefficients are set as shown in Table 4. Follow the settings, the total score turns out 100 credits calculated as equation (1). All the above settings and the upper limit of the total score can be adjusted according to requirements, for instance, the maximum total score could be 10 credits along with the corresponding parameter settings.

Table 3.

Value Range of the Scoring Modules or Functions.

Function or score	Range
$[\begin{array}{c} S_{t} \\ S_{f} \\ S_{c o n} \\ S_{s} \end{array}] = δ ([\begin{array}{c} I_{t} \\ I_{f} \\ I_{c o n} \\ I_{s} \end{array}])$	{1, 2, 3, 4, 5}
$S_{l}$	[0, 2]
$S_{c}$	[0, 1]
$σ (I_{t})$ , $σ (I_{f})$ , $σ (I_{c o n}) σ (I_{s})$	(0, 1)

Table 4.

The Score Assignment and Coefficient Setting.

	Thesis	Fluency	Content	Complexity		Correctness
	Thesis	Fluency	Content	Lexical Complexity	Syntactic Complexity	Correctness
Score	20	15	30	10	10	15
Coefficient ( $α$ )	0.4	0.3	0.6	0.5	0.2	1.5

The Thesis Module

The majority of EFL writing exams contain opinion/discussion writing tasks essentially with prompts in them. The writing prompt students must respond to in the writing task of EFL is a compact set of writing instructions that assist students to focus on a certain topic, task, or goal. Writing prompt is the key to achieving Unity.

Quality of Organization is related to whether ideas are arranged logically and accurately. Each part of the essay is supposed to be clearly organized and flow smoothly. An essay with good organization indicates that paragraphs should support the thesis of the paper.

However, the EFL students often fail to grasp the topic or have problems in developing content with regard to the prompt, thus possibly resulting in digression or failure to arrange the ideas or statements to support the thesis. Due to incapacity in identifying the content of texts, the prior AWE models do not attend to this issue and are inadequate to address it. Much worse, students tend to cheat models with fancy expressions and words while leveraging automatic writing evaluation means, which is inevitable given the current state of the art.

This module is incorporated to check the accordance between the prompt and the main body, and the unity of each paragraph, i.e., the Thesis module merges the Unity and Organization evaluation. A traditional wide-applied Natural Language Processing method is implemented to accomplish this end. TF–IDF, short for Term Frequency–Inverse Document Frequency, is a numerical statistic to qualify how important or relevant a word/words is to a set of a document amongst a collection of documents. It is often used as a weighting factor in searches for information retrieval, text mining, and user modeling. TF looks at the frequency of a particular term (word) in a document, while IDF - the commonness of the term in a corpus. In this case, All The News corpus from Kaggle is incorporated (see section “Datasets”). TF-IDF seeks the importance of the term inversely related to its frequency in a corpus and is calculated as follows.

T F - I D F (w o r d, d o c u m e n t, c o r p u s) = T F (w o r d, d o c u m e n t) \cdot I D F (w o r d, c o r p u s)

(5)where

T F (w o r d, d o c u m e n t) = \frac{n u m b e r o f o c c u r r e n c e s o f t h e w o r d i n t h e d o c u m e n t}{t o t a l n u m b e r o f w o r d s i n t h e d o c u m e n t}

(6)

I D F (w o r d, c o r p u s) = \log (\frac{t o t a l n u m b e r o f d o c u m e n t s}{n u m b e r o f d o c u m e n t s w i t h t h e w o r d i n i t})

(7)In the Thesis evaluation task, two word vectors are calculated by TF-IDF utilizing two word sets containing the keywords and excluding the Stop words respectively from the prompt paragraphs and the composition, only the synonyms of the prompt word set are added after selection. The cosine similarity of these two vectors which are the semantic representations of the prompt, and the composition is gained to measure the degree of the Thesis between the prompt and the essay. The larger the value, the more relevant the two are, and the higher the Thesis degree is. The calculation is shown in equation (8).

S_{t}^{'} = S i m (V_{p}, V_{c o m}) = \cos (θ) = \frac{V_{p} \cdot V_{c o m}}{‖ V_{p} ‖ \times ‖ V_{c o m} ‖}

(8)where,

S_{t}^{'}

is the score of the Thesis module,

S i m ()

is the cosine similarity function,

V_{p} i s p r o m p t v e c t o r = I D F (p r o m p t),

V_{c o m} i s c o m p o s i t o n v e c t o r = I D F (c o m p o s i t i o n)

The Fluency Module

Writing Fluency refers to a student’s ability to write with a natural flow and rhythm, utilizing appropriate word patterns, vocabulary, and content. Syntactic fluency refers to the extent to which a writer constructs a sentence containing linguistically complex structures (Shapiro, 1999). More specifically, fluent composition flows smoothly from word to word, phrase to phrase, and sentence to sentence. In Natural Language Processing the N-gram-based metrics carry a similar function achieving the above requirement for Writing Fluency.

As an automatic evaluation metric of text generation or machine translation BLEU (Bilingual Evaluation Understudy) measures the closeness by comparing the machine output against the human reference and is well-known for its strong correlation with human evaluation in adequacy and fluency. Its calculation is as follows:

B L E U = B P \cdot \exp (\sum_{n = 1}^{N} w_{n} \log p_{n})

(9)

B P = {\begin{array}{c} 1 \\ e^{(1 - \frac{r}{c})} \end{array} \begin{array}{l} i f c > r \\ i f c \leq r \end{array}

(10)where

p_{n}

is the precision of

n - g r a m

c

is the length of the system output, and

r

is the length of the reference corpus,

B P

indicates the brevity penalty.

BLEU compares the n-gram of the candidate with the reference corpus to count the number of overlaps that are independent of the word positions. This n-gram precision scoring captures two aspects of translation: adequacy and fluency. The longer n-gram overlaps between the candidate and reference account for fluency, which means the generated sentences are well-formed and mature in length and structure compared with the reference.

In this study, BLEU with 4-g is utilized for writing fluency evaluation matric. We take compositions as system output candidates in machine translation, and a standard corpus with sufficient capacity is used as a reference. Considering that EFL writing requires writing with standard English, we choose a news set as the reference corpus. The Fluency score is the average BLEU score of each sentence from the essay against the news corpus (All The News, see section “Datasets”). The calculation is shown in equation (11).

S_{f}^{'} = \frac{\sum_{i = 1}^{n} {B l e u}_{i}}{n}

(11)where,

S_{f}^{'}

is the score of the Fluency module,

{B l e u}_{i}

indicates the BLEU score of the

i - t h

sentence,

n

is the total number of the sentences of the corresponding essay.

The Content Module

In essence, grading the content of writing is a comprehensive evaluation of whether the thesis and purpose are clear, whether expectations and development are in a consistently excellent manner according to the viewpoint, whether the structure and word choice and use are fluent and graceful, and so on. Content evaluation is the core of the human judgment of an essay; thus, it regularly presents as holistic scoring.

The AWE/AES models assign holistic scores based on the writing content. More than 90% of AES models are trained and tested leveraging the ASAP (The Automated Student Assessment Prize) (see section “Comparison experiments”) corpus and its scoring labels. Based on the ASAP corpus as well, this article established an LSTM-based model to assess the writing content. Recurrent Neural Network (RNN) has been widely applied in various NLP tasks for their outstanding time series feature extraction capability. Long-short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) (the formula is as follows) is an improvement to solve the problems such as the disappearance of the RNN gradient. LSTM is introduced in this section to extract sequential features from essays with different scores in the training set so as to predict the holistic score of the essays to be tested.

In an LSTM cell, the hidden state is split into two vectors: $h_{t}$ and $c_{t}$ , which respectively represent the short-term state and the long-term state. The current input vector $x_{t}$ and the previous short-term state $h_{t - 1}$ are fed to four different fully connected layers. One layer outputs $\hat{c_{t}}$ analyzing the current inputs $x_{t}$ and the previous (short-term) state $h_{t - 1}$ . The other three are gate controllers. They control information by activation functions, whose range is from 0 to 1. The forget gate $f_{t}$ determines what proportion of the previous long-term state $c_{t - 1}$ should be erased. The input gate $i_{t}$ governs to what extent $\hat{c_{t}}$ should be added to the current long-term state $c_{t}$ . Finally, the output gate $o_{t}$ addresses what should be output to $h_{t}$ from the long-term state at this time step. The score of the Content module $S_{c o n}^{'}$ is the output $Y_{p r e d}$ of LSTM networks, that is the rating of a whole essay. The processing progress is depicted in equations (12) and (13).

{\begin{array}{c} [\begin{array}{c} \begin{array}{c} i_{t} \\ f_{t} \end{array} \\ \begin{array}{c} o_{t} \\ {\hat{c}}_{t} \end{array} \end{array}] = [\begin{array}{c} \begin{array}{c} σ \\ σ \end{array} \\ \begin{array}{c} σ \\ \tanh \end{array} \end{array}] W \cdot [h_{t - 1}, x_{t}] \\ c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \hat{c_{t}} \\ h_{t} = o_{t} ⊙ \tanh (c_{t}) \end{array}

(12)

S_{c o n}^{'} = Y_{p r e d}

(13)

The Complexity Module

As stated in section “Grading Criteria.” Complexity checks the lexical and structural diversity and complexity, i.e., the proficiency in vocabulary and grammar of writers. Two sub-modules are incorporated to achieve this end: the Lexical Complexity and the Syntactic Complexity.

The Lexical Complexity

The existing AWE research and products give improving feedback on vocabulary choice or usage by listing the possible synonyms of words or phrases in the essays, merely based on dictionary-matching technology. The listed words or phrases will not be changed according to the learners’ level. Most students can’t benefit from feedback and learn new-to-them vocabulary and usage. Thus, the word list scales based on the EFL learners’ lexical knowledge are introduced to address this problem. The New JACET 8000 (Committee, 2016) is the scaled list of basic words established by the Japan Association of College English Teachers (JACET). Eight levels are divided among 8000 words conforming to the lexical profile of Japanese EFL learners, and each level contains 1000 words. On the basis of the author’s vocabulary level being positioned leveraging the word scales, the gap between the author’s level and the writing requirements is judged and corresponding evaluation and suggestions on vocabulary use are given accordingly.

According to the learners’ level, a higher vocabulary set and a lower vocabulary set are established. The words or phrases in the composition are replaced by the synonyms from the higher and lower vocabulary sets respectively using the thesaurus. If the replacement occurs in the higher vocabulary set, it indicates that the author still has room for improvement in the use of this replaceable word or phrase. The system will output the words in the higher vocabulary set as improving suggestions, and at the same time reduce the score of Lexical Complexity. On the contrary, the replacement can only be found in the lower vocabulary set, the system will output no suggestions, and the score is increased.

A novel model for scoring lexical complexity is proposed in this study. We use the ratio of the total number of edits that can be replaced to the weighted total number of words after deduplication as the basis for the vocabulary complexity score. The calculation is as follows:

S_{l} = (1 + \frac{f (R)}{C \times a %})

(14)

{\begin{array}{c} f (R) = C \times a %, i f R \geq C \times a % \\ f (R) = R, i f - C \times a % \leq R \leq C \times a % \\ f (R) = - C \times a %, i f R \leq - C \times a % \end{array}

(15)

R = P + N

(16)where

S_{l}

is the final score of the Lexical Complexity,

P

is a positive value, representing the number of bonus credits,

N

is a negative value and represents the number of penalty credits,

R

is the total number of the replaceable edits, i.e., the sum of

P

and

N

C

denotes the count of the words after deduplication. Note that

a

functions as an adjustable parameter and represents the word number. The weight

a %

specifies the limit that the system allows for rewards or penalties for lexical complexity. We take

a = 30

as the threshold, that is, the bonus credits greater than or equal to 30% of the composition words will reach the upper limit of the reward, and the penalty credits less than or equal to 30% of the composition words will reach the upper limit of the penalty. The range of

S_{l}

is from 0 to 2. When

R

exceeds the reward upper limit,

\frac{f (R)}{C \times a %} = 1,

i.e.,

S_{l}

equals 2, which means the proficiency in vocabulary mastery of this composition is excellent. When

R

exceeds the penalty upper limit,

\frac{f (R)}{C \times a %} = - 1

, then,

S_{l}

equals 0, which means the vocabulary mastery of this composition does not meet the requirements.

The syntactic complexity

The Syntactic Complexity mainly depends on the diversity of structural patterns and the flexible application of the grammar, such as the diversity of active and passive sentences, clauses, participles, etc. From the linguistic point of view, these syntactic phenomena have their own characteristic function words. These function words are selected into a set, such as: which, when, copular verbs, adverbs, etc. TextRank in NLP is incorporated to calculate the representation of function words in the set based on lexical relations, i.e., the appropriateness of the syntax use represented by the function-word-centered lexical relationship. The TextRank values based on function words in the news corpus (All The News, corpus from Kaggle in section “Datasets”) and each composition are calculated and compared respectively. The higher the similarity between the two, the better the Syntactic Complexity, and the higher the score. The Syntactic Complexity is calculated based on a real-world corpus, there are thus no worries about being cheated by adversarial samples.

TextRank, as an application of graph algorithm, takes words of text as graph nodes, constructs a graph model, and then calculates the importance of each word based on the weights between words. That is, the ratio of the weight $w_{{j i}}$ of the $(j, i)$ edge formed by word $i$ between its frontal each point (word) $j$ to the sum of the weights of the point $j$ to the other edges is the TextRank value $S (v_{i})$ of the word, and the calculation is shown in equation (17).

S (v_{i}) = (1 - d) + d \sum_{(j, i) \in ε} \frac{w_{{j i}}}{\sum_{v_{k} \in o u t (v_{j})} w_{j k}} S (v_{j})

(17)

The Syntactic Complexity score is the average TextRank score of each sentence from the essay. The calculation is shown in equation (18).

S_{s}^{'} = \frac{\sum_{i = 1}^{n} {T r}_{i}}{n}

(18)where,

S_{s}^{'}

is the score of the Syntactic Complexity module,

{T r}_{i}

denotes the TextRank sore of the

i - t h

sentence,

n

is the total number of the sentences of the corresponding essay.

The Correctness Module

As stated in section “Grading Criteria,” Correctness checks the proper use of vocabulary, mechanics, and grammar, i.e., whether the compositions are written in correct standard English. Two sub-modules are incorporated to achieve this goal: the GED module locates the lexical or syntactic errors, and the GEC module corrects them and outputs the correction suggestions.

The Grammatical Error Diagnosis module

Grammar error detection (GED) can be analyzed as a Sequence Tagging task, that is, the task of predicting labels based on sequential features. Although BERT and Transformer techniques are popular in all aspects of NLP, the RNN family is more dominant in the extraction of sequential features. Therefore, this study adopts Bi-directional Long-short term memory neural networks (Bi-LSTM) to detect grammatical errors.

A large number of experimental studies and applications have proved that the BERT model pre-trained based on a huge multilingual corpus can obviously improve the effect of various downstream tasks because of its excellent semantic/syntactic representation ability (Devlin et al., 2019; Herzig et al., 2020; Sun et al., 2019; Zhu et al., 2019). In this study, BERT’s pre-training strategy is implemented to Fine-tune the training data, and then the task of grammar error detection is carried out.

After the text enters the BERT model for fine-tuning, a new semantic vector representation is obtained, which is then processed by Bi-LSTM and conditional random field (CRF) to finally output the error type tags. The calculation process is shown in Figure 3:

Figure 3.

The diagram of the Grammatical Error Diagnosis model. This vector representation is used to make predictions regarding the type of errors within the text, utilizing both Bi-LSTM and CRF for the task. The BERT model is fine-tuned with the text, resulting in a new semantic vector representation. This vector is then processed with Bi-LSTM and CRF, ultimately outputting the error type tags.

The Grammatical Error Correction module

Grammatical Error Correction models with SOTA scores mostly regard Grammatical Error Correction as a machine translation task and often adopt an Encoder-decoder structure. In this study, a novel Grammatical Error Correction architecture is inspired by the previous study (B. Chen & J. Zhang, 2022). As the neural networks in the Grammatical Error Diagnosis module, the training text obtains a new semantic representation after pre-training via BERT, and then enters the Encoder-decoder model with Global attention, where the Encoder involves a uni-direction LSTM, and the Decoder takes Bi-LSTM as the core. The specific architecture is depicted in Figure 4.

Figure 4.

The Architecture of Grammatical Error Correction Model. In the Grammatical Error Diagnosis module, the training text is pre-trained using BERT to obtain a new semantic representation, which is then fed into the Encoder-Decoder model with Global Attention. For the Encoder, a uni-directional LSTM is employed. The Decoder, on the other hand, utilizes a Bi-LSTM as its central component.

The scoring method of the correctness evaluation

The Grammatical Error Diagnosis module charges for identifying and showing the errors to the users, meanwhile, the Grammatical Error Correction module is responsible for correcting the located errors. The errors located and corrected are supposed to be marked as the candidate items of losing credits. The mechanic of penalty credits for grammatical errors is depicted in equations (19) and (20).

S_{c} = \frac{γ (a % \times C - e)}{a % \times C}

(19)

γ (x) = R e L U (x) = \max (0, x)

(20)where,

S_{c}

is the score of the Correctness module,

e

denotes the total number of errors,

C

denotes the word count of the essay after deduplication.

R e L U

, the Rectified linear unit function, is a piecewise linear function that outputs any input value to zero if it is negative, otherwise, outputs input to itself.

a

is the adjustable threshold that controls the maximum number of errors that can be tolerated.

a = 20

is adopted in our study, which means

S_{c}

turns out 0 credits if the error count exceeds 20% of the whole composition under the control of

R e L U

function. The range of

S_{c}

is from 0 to 1, where

S_{c}

equals one when

e

equals 0, that is, there are no errors in the essay.

Experiments

Datasets

a) Reference corpus in the model

In sections of the Thesis, Fluency module, and the Syntactic Complexity of the Complexity module, the role of the reference corpus is essential. The quantity and quality of the corpus will lead to a change in scores, making the scores unreliable, which means a massive and error-free corpus is required. Most EFL writing tasks focus on elaborating or discussing opinions and examining the author’s proficiency in standard English. Therefore, the writing style requires written language rather than spoken language. In view of this, the news texts with rigorous terms are adopted as the basic dataset.

The All The News¹ from Kaggle consists of 143,000 news articles covering a wide range of topics from 15 American publications including the New York Times, Breitbart, CNN, Business Insider, etc. This reference corpus derives from the news from 2016 to 2017, with a total size of about 1.2 Gb.

b) The scored essay corpus

The ASAP corpus is the dataset incorporated in the Kaggle (2012) AES competition, the Automated Student Assessment Prize. The corpus has a total of 12,976 compositions under eight prompts, the distribution of the essays under prompts and the score range from human raters are shown in Table 5. The score range varies. Furthermore, it is worth noting that there are one or more human raters on individual composition, and the final scores are determined or calculated in different ways according to the corresponding prompt. The comparison experiments of most AES/AWE studies are conducted based on the final resolved score, and so do the experiments in this study.

Table 5.

The Statistics of the ASAP Corpus.

	Essay Type	Essays	Score Range	Avg words	Avg sentences
Prompt 1	Argumentative	1783	2–12	350	23
Prompt 2	Argumentative	1800	1–6	350	20
Prompt 3	Source-dependent	1726	0–3	150	6
Prompt 4	Source-dependent	1772	0–3	150	4
Prompt 5	Source-dependent	1805	0–4	150	7
Prompt 6	Source-dependent	1800	0–4	150	8
Prompt 7	Narrative	1569	0–30	250	12
Prompt 8	Narrative	732	0–60	650	35

Comparison Experiments

Evaluation metric

Quadratic Weighted Kappa (QWK) which is the official metric of the Automated Student Assessment Prize and regarded as the major metric of the performance of AWE models is introduced as the evaluation metric in this section. It measures the agreement between the prediction score set and the label score set. The QWK value ranges from 0 to 1, which indicates the agreement arises from arbitrariness to completeness. In this sense, the human annotator rating set is considered as the labels, and the system outputs present as the prediction scores.

An N-by-N quadratic weight matrix $W$ is first computed to encode the rating information.

W_{i, j} = \frac{{(i - j)}^{2}}{{(N - 1)}^{2}}

(21)where

N

is the number of possible ratings. An N-by-N matrix

A

is calculated such that

A_{i, j}

corresponds to the number of essays that receive a score

i

by the human rater, and a score

j

by the scoring system. Another N-by-N matrix

B

is constructed as the outer product of the histogram vectors of the two ratings.

A

and

B

are then normalized such that they have the same sum. Finally, from the three matrices, the quadratic weighted kappa is calculated as follows:

k = 1 - \frac{\sum_{i, j} W_{i, j} A_{i, j}}{\sum_{i, j} W_{(i, j)} B_{i, j}}

(22)

Comparison based on holistic scoring

Most AWE models adopt a holistic scoring method, i.e., an AWE model outputs an overall score of the corresponding essay, which is also an essential part of the scoring system of our proposed model. In order to validate the performance on holistic scoring of the proposed model, frontier models in AES/AWE (Table 6) are introduced as baseline models compared with MsCAEWL.

Table 6.

The Descriptions of the Baseline Models.

Baseline models	Sources	Backbone
S1	RL1 (Wang et al., 2018)	BERT, LSTM, boosting tree
S2	TSLF (Liu et al., 2019)	Dilated LSTM
S3	BLRR (Phandi et al., 2015)	EASE, BLRR, SVM
S4	LSTM-CNN-att (Dong et al., 2017)	LSTM, CNN, attention
S5	SKIPFLOW LSTM (Tay et al., 2018)	LSTM
S6	BERT + Essay-level features (Uto et al., 2020)	BERT

ASAP corpus with holistic scores as training labels serves as experiment data set in this section. Although more than 10 years have passed since the Kaggle AES competition, the labeled data set continue to benefit the AWE/AES field. Remarkable models have been emerging, especially for ones with neural networks incorporated, long after the competition event.

However, there are two issues that do not fit the requirements of our experiment and need to be modified in the pre-processing.

First, the number of human annotators in each essay is not consistent. Most essays contain two rater scores while some of them have three rater scores. More, given that one set of essays adopts trait scoring evaluation, i.e., the analytic scoring, while the others incorporate the holistic scoring method, the resolved score is obtained inconsistently: either the higher rater score is directly elected, or the composite score is calculated by relevant equations. The score range of each prompt thus differs. In this section, we take the resolved scores as baseline scores or labels regardless of the acquiring method. Before experiments, the score range should be normalized into (0, 1).

Second, out of privacy protection, the competition organizer has removed any personal information and replaced the relevant entities with marks, such as @ORGANIZATION1, @PERSON1, etc. It is worth noting that the replacements often act as vital sentence constituents like subjects or objects. For people and machines, this is obviously a huge obstacle to reading or understanding, especially for machine learning methods with semantic-based models as their backbones. Since all the modules of the proposed model are highly semantic-dependent, these marks will infect the performance to a different extent. Hence, we randomly choose words under the marked categories to restore the sentence, such as replacing @ORGANIZATION1 with “bank” and @PERSON1 with “Jackson.”

In the comparison experiments, the resolved scores of human raters are utilized to calculate the QWK values of all the baseline models and MsCAEWL under each prompt, and the average QWK value of all prompts from the corresponding model is presented as well. All results are shown in Table 7, the value in the bolt indicates the optimal performance in the corresponding comparison.

Table 7.

The Comparison with the Baseline Models on Holistic Scoring.

	S1	S2	S3	S4	S5	S6	MsCAEWL
Prompt 1	0.766	0.852	0.761	0.822	0.832	0.852	0.889
Prompt 2	0.659	0.736	0.606	0.682	0.684	0.651	0.781
Prompt 3	0.688	0.731	0.621	0.672	0.695	0.804	0.815
Prompt 4	0.778	0.801	0.742	0.814	0.788	0.888	0.831
Prompt 5	0.805	0.823	0.784	0.803	0.815	0.885	0.873
Prompt 6	0.791	0.792	0.775	0.811	0.810	0.817	0.821
Prompt 7	0.76	0.762	0.730	0.801	0.800	0.864	0.834
Prompt 8	0.545	0.684	0.617	0.705	0.697	0.645	0.713
Avg	0.724	0.773	0.705	0.764	0.764	0.801	0.820

The comparison results show that MsCAEWL gains the five top accomplishments out of eight prompts and the optimal performance in the track of average QWK score. On the whole, S6, S2, and S4 have achieved relatively satisfactory in part of the tracks. From their experimental results and the model architecture, it can be concluded that the sentiment-oriented structures based on the cutting-edge language models greatly boost the experimental results, such as LSTM, BERT, and attention mechanism. Though the S6 obtains astonishing results in most prompts, the two obvious weakness items occur in prompts 2 and 8 which contain the biggest average length of the essay, 350 words. However, in these two tracks, the top models including ours are based on RNN serial models, namely, LSTM. That is, the RNN architecture remains the superior ability to capture sequential semantic features, especially in a long text, despite the fact that S6 particularly adapts the BERT to fit the large-scale text mission.

Owning to the multiple strategies incorporated in NN, the proposed system can benefit from the advantages of various technologies so as to meet different challenges. There is an obvious superiority in tackling long text, the proposed model dominates in almost all essay sets with more than 150 words size, which mainly leads to the superiority in average score.

Comparison based on the analytic scoring

In this section, we aim to verify the analytic scoring of the proposed system. Hence the dataset with trait-specific scores as labels is primary to the validation experiments. Though the ASAP corpus provides the trait-specific scores on part of sets, there are two obstacles making the dataset unsuitable for the experiments. Only two prompts are presenting the analytic scores, prompts 7 and 8. However, the trait categories differ from each other. The scoring of prompt 7 evaluates Ideas, Organization, Style, and Conventions, while Ideas and Content, Organization, Voice, Word Choice, Sentence Fluency, and Conventions involves in the assessment of prompt 8. That means, the evaluation rubrics of each prompt are different, and the focus on parts of writing ability from the examiners are different. Moreover, the compositing methods of total scores are distinct. The final total score on each essay is simply obtained by adding 2 raters’ sum scores of all traits in prompt 7, rather than the weighted sum of scores on each trait by 2 taters in prompt 8. Since the traits and compositing methods of MsCAEWL are not consistent with any prompts above, it reveals unlikely to be utilized as the basis of the analytic scoring comparison.

By all accounts, a new validation experiment is set as follows. A group of college students is asked to write an essay according to a prompt. The essays then are scored by human raters and MsAWE as well following the detailed analytic scoring instructions concretely categorized as Thesis, Fluency, Content, Complexity, and Correctness. Accordingly, the trait-specific and total scores between human and machine raters are obtained eventually to be leveraged to reveal the analytic scoring ability of the proposed writing computer-assisted writing learning by statistical analysis.

In this experiment, 80 sophomores majoring in English are invited as subjects, i.e., 80 compositions will be produced. Three experienced college teachers of English language and literature review and score each essay so that every essay contains three score sets with six scores. The writing prompt (see Table 8) is selected from the Tests for English Majors Grade Four (TEM 4) which is specialized for evaluating the comprehensive EFL ability of English majors in China.

Table 8.

Writing Prompt From TEM 4.

Prompt	Source
Read carefully the following excerpt on term-time holiday arguments in the UK, and then write your response in NO LESS THAN 200 words, in which you should	TEM-4 of 2016
1) Summarize the main message of the excerpt, and then
2) Comment on whether parents should take children out of school for a holiday during term time in order to save money
You should support yourself with information from the excerpt
Marks will be awarded for content relevance, content sufficiency, organization, and language quality. Failure to follow the above instructions may result in a loss of marks
*Specific source reading material is omitted here

The scoring rubrics, coefficient settings, score scope, and the score compositing equations in this experiment are shown in section “Task Definition,” which are mainly derived from the grading criteria of TEMS. The statistical results of trait and total scores from human and machine raters are shown in Table 9.

Table 9.

The Statistical Results of Trait and Total Scores From Human Raters and MsCAEWL.

	Thesis		Fluency		Content		Complexity		Correctness		Total
	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std
R1	14.10	1.29	10.43	0.84	20.25	1.54	13.89	1.17	11.23	0.85	69.90	5.71
R2	13.74	1.39	9.43	0.81	21.84	1.62	12.10	1.23	10.93	0.81	68.02	5.85
R3	14.47	1.18	9.93	0.80	21.05	1.61	12.98	1.09	11.02	0.76	69.45	5.43
Avg	14.10	1.33	9.93	0.91	21.04	1.72	12.99	1.37	11.05	0.82	69.16	5.35
MsCAEWL	13.84	0.86	10.21	0.70	20.40	1.18	13.42	0.75	11.37	0.53	69.25	4.02

The box-plot diagrams in Figure 5 depict the distribution of rating scores of MsCAEWL and human annotators. It can be seen that, overall, the machine rating keeps constituent with the average rating of three human raters, which is also consistent with the QWK results in section “Comparison based on holistic scoring.” That is, MsCAEWL is reliable in scoring in comparison with the overall human level.

Figure 5.

Box-plot diagram of score distribution by MsCAEWL and human raters. The box-plot diagrams show the distribution of rating scores of MsCAEWL and human annotators. Comparison of the diagrams indicates that the machine rating is consistent with the average rating of three human raters. This implies that MsCAEWL is reliable for scoring in relation to the overall human level.

It should be noted that the data distribution of MsCAEWL ratings is significantly less volatile than that of the human ratings. More important, human ratings vary widely from one another, and it is difficult to maintain consistency under each category. The opinions of human raters keep the relative agreement in thesis and correctness. It shows that human raters are easier to reach an agreement at judging digression and quality of organization. Furthermore, the review of grammar, spelling, mechanics, and other errors is more objective, and human judges are less likely to disagree on such issues. However, for evaluations that need multi-dimensional consideration, such as fluency, content, and complexity, human ratings are prone to diverge. To recap, the scoring of MsCAEWL is more robust.

Validation of effects on writing learning assistance of MsCAEWL

The aforementioned experiments can only compare model performance based on their scores. Our model may offer not only scores, but also feedback such as improvements in vocabulary and grammar, error correction, and so on. Scores are solely used as a reference for validation of improvements in writing ability, the major priority of the proposed system is to enhance students’ writing skills. By that means, in this part, we devised experiments to verify if MsCAEWL can increase students’ writing abilities.

Method

In this section, 70 high school sophomores with comparable scores in a high school in Guizhou province were selected as subjects for a 60-day writing learning experiment. Among them, 35 were randomly assigned to the experimental group and 35 to the control group. There was no significant difference in writing proficiency between the two groups. The experimental group and the control group received ordinary face-to-face writing instructions from the same teacher during the experiment. Differently, in the experimental group, the MsCAEWL is adopted to assist writing learning. According to the writing feedback by MsCAEWL, the experimental group students revise their essays by themselves. A pre-test was performed before the experiment, and a post-test was performed after the experiment for the experimental and the control group. The performances of the two groups are analyzed by independent-sample t-test and paired-sample t-test with SPSS 26.0 to verify the effects of MsCAEWL in improving students’ English writing proficiency. All compositions in the experimental and control group were scored by MsCAEWL.

Participants

Participants $(n = 70)$ were high school sophomores from the same school in Guizhou province, China, aged 16–17 years. All participants take eight English lessons per week, each lesson lasts 45 minutes according to their school schedule. There were 35 participants in the experimental group $(m a l e = 20, f e m a l e = 15)$ , and 35 participants in the control group $(m a l e = 16, f e m a l e = 19)$ . The two groups took a writing test before the writing instruction experiment. There was no significant difference in writing proficiency between the two groups. The average scores of participants between the control and experimental group in the pre-test are comparable: 62.32 (SD 6.99) and 61.68 (SD 5.85), respectively.

Materials

In the writing learning experiments, several essay prompts from the former National Matriculation English Tests are adopted as the writing learning and teaching materials. Given the short period of the verification experiment, the familiar materials are more beneficial for students to achieve better writing performance. Moreover, the prompt design of the National Matriculation English Tests is more rigorous, and the difficulty of each prompt varies slightly, so it is more suitable for the experiment of this study. The prompts are selected randomly as follows:

Procedure

Before the experiment, the experimental and the control group were pre-tested with the first writing prompt to determine whether there was a difference in writing proficiency between them. At the end of the experiment, the two groups were post-tested using the fourth prompt. During the experiment, two groups of students received four writing instructions from the same English teacher, with an interval of 20 days. The wring prompts are shown in Table 10. Teachers’ guidance, explanation of relevant vocabulary and sentence patterns, and thesis discussion are arranged in each instruction. In most scenarios of EFL lessons in China, there is merely no time for exclusive writing teaching and learning due to the large number of students and the heavy workload. Consequently, writing teaching could only be an adjunct to the explanation of the text and the test questions. The abundant essays after assignments make teachers have little time to review and issue detailed feedback. Commonly, a holistic score with a short comment is attached to each essay after the short-term review.

Table 10.

Prompts Adopted in Writing Learning.

Round	Prompt	Sours
First: Pre-test	Assuming you are Li Hua, and you would like to invite Henry, a foreign teacher, to visit the Chinese paper-cutting art exhibition. Please write him a letter, which includes: (1) Exhibition time and place; (2) Exhibition content	National test II of 2017
Second	Assuming that you are Li Hua, your New Zealand friend terry will visit a Chinese friend’s home and send an email to ask you about the customs. Please reply to the email, which includes: (1) arrival time; (2) appropriate gifts; (3) table manners	National test I of 2018
Third	Last weekend, you and your classmates took part in a picking activity. Please write a short essay for the class English corner to introduce this activity, including (1) farm activities; (2) the picking process; (3) personal feelings	National test II of 2020
Fourth: Post-test	Assuming you are Li Hua, and your English friend peter wrote to ask about the students’ sports in your school. Please write back to him, including (1) sports venues in the school; (2) main kinds of sports; (3). Your favorite sports	National test III of 2018

Thus, in order to simulate the real writing teaching and learning scene, each instruction is within 15 minutes. For the control group, the students will receive a holistic score, marks for grammatical and lexical errors, and a general comment just like the daily teaching. With regard to the evaluation in the control group, the raters are also asked to finish the review of each composition within 10 minutes. In this way, the marks of errors and the comment may be rough, inadequate, or even faulty. Note that only the holistic score by the teacher will be presented to the writer. Even though MsCAEWL will review each essay and output the sub and overall scores as experimental data which are completely blind to the control group.

As for the experimental group, besides the ordinary writing instructions, their essays are completely evaluated by the system rather than the teacher. After finishing the writing and evaluation, the students may revise each composition according to MsCAEWL’s multiple trait-specific scores and writing feedback. The revision effect can be checked because the revised essay can be scored again immediately. This computer-assisted English writing learning round is without limitation of times. However, in this experiment, the revision shall not stop unless at least three rounds are completed, or the score stops rising. All writing tests are carried out according to the requirements of The National Matriculation English Test, with a full score scope of 100 credits.

Results and discussions

a) The Effects of Writing Learning Assistance of MsCAEWL on Overall Performance

In order to verify if a difference exists in the means scores between the control and experimental group on the pre-test, an independent-sample t-test was conducted to compare the English writing competence between the control group and the experimental group (see Tables 11–14).

Table 11.

Group Statistics of Pre-Test Scores in Experimental Group and Control Group.

Group	N	Mean	Std. Deviation	Std. Error Mean
Control group	35	62.32	6.99	1.18
Experimental group	35	61.68	5.85	0.99

Table 12.

Independent-Sample t-Test of Pre-Test Scores of Experimental Group and Control Group.

			Equal variances assumed	Equal variances not assumed
Levene’s test for Equality of variances	F		0.692
Levene’s test for Equality of variances	P		0.408
MsCAEWL	t		0.413	0.413
	df		68	65.972
	Sig.(2-tailed)		0.681	0.681
	Mean difference		0.637	0.637
	Std.Error difference		1.542	1.542
	95%Confidence interval of the difference	Lower	−2.441	−2.442
	95%Confidence interval of the difference	Upper	3.714	3.717

Table 13.

Group Statistics of Post-Test Scores of Control and Experimental Group.

Group	N	Mean	Std. Deviation	Std. Error Mean
Control group	35	63.39	6.89	1.16
Experimental group	35	69.33	6.44	1.08

Table 14.

Independent-sample t-Test of Post-Test Scores of Experimental Group and Control Group.

			Equal variances assumed	Equal variances not assumed
Levene’s test for Equality of variances	F		0.016
Levene’s test for Equality of variances	P		0.899
t-test for Equality of means	t		−3.725	−3.725
	df		68	67.682
	Sig.(2-tailed)		0.000	0.000
	Mean difference		−5.942	−5.942
	Std.Error difference		1.595	1.595
	95%Confidence interval of the difference	Lower	−9.126	−9.126
	95%Confidence interval of the difference	Upper	−2.759	−2.759

The results of the independent-sample t-test for the pre-test in Tables 11 and 12 showed that there was no statistically significant difference in the scores for the control group ( $M = 62.32, S D = 6.99$ ) and the experimental group ( $M = 61.68, S D = 5.85$ ); $t (68) = 0.413, P = 0.408, S i g . (2 - t a i l e d) = 0.637$ (see Tables 11 and 12), which suggests that the writing proficiency did not differ between the two groups before the experiments.

In order to examine if the effects of writing learning assistance of MsCAEWL took place at the end of wringing instructions in the experimental group and the control group, after the experiment, the post-test scores of the two groups were investigated by independent-sample t-test.

The results in Tables 13 and 14 revealed a statistically significant mean difference between the control ( $M = 63.39, S D = 6.89$ ) and experimental group ( $M = 69.33, S D = 6.44$ ) after the experiments in terms of the writing proficiency; $t (68) = - 3.725, P = 0.899, S i g . (2 - t a i l e d) = 0.000$ . The mean score of the experimental group is greater than that of the control group. The experimental results show that there is a statistically significant difference between the two groups in writing proficiency, and the writing competence of the experimental group is higher. In other words, the computer-assisted EFL Writing Learning System, MsCAEWL, based on writing feedback theory plays an essential role in improving students’ English writing proficiency.

b) The Effects of Writing Learning Assistance of MsCAEWL on Trait-specific Performance

On the basis that MsCAEWL can effectively assist students in learning English writing, in order to have a deeper understanding of which aspects of writing learning and to what extent our system has produced positive effects, paired-sample t-tests were conducted to compare pre-test and post-test scores in each trait of writing, i.e., Thesis, Fluency, Content, Complexity, Correctness, and total score.

For reference, the paired-sample t-test results in Tables 15 and 16 indicate that there is no statistically significant difference between the pre-test and post-test in each track in the control group. Regardless of the increase in mean scores for Thesis, Fluency, Content, Complexity, and Correctness, the improvement was insufficient, not to mention the decrease in Fluency. According to the findings, it is difficult for students to enhance their writing proficiency with only limited unilateral instructions and feedback from teachers. In addition, simply doubling the number of writing assignments cannot considerably boost students’ writing learning.

Table 15.

Statistics of Control Group.

		N	Mean	Std. Deviation	Std. Error Mean
Thesis	Pre-test	35	12.306	1.357	0.229
Thesis	Post-test	35	12.557	1.654	0.279
Fluency	Pre-test	35	9.280	1.007	0.170
Fluency	Post-test	35	9.149	1.235	0.208
Content	Pre-test	35	18.931	2.018	0.341
Content	Post-test	35	19.171	2.476	0.418
Complexity	Pre-test	35	12.566	1.466	0.247
Complexity	Post-test	35	12.846	1.534	0.259
Correctness	Pre-test	35	9.237	1.147	0.194
Correctness	Post-test	35	9.669	1.106	0.187
Total	Pre-test	35	62.319	6.994	1.182
Total	Post-test	35	63.391	6.898	1.166

Table 16.

Paired-Sample t-Test of Control Group.

			Thesis	Fluency	Content	Complexity	Correctness	Total
Paired differences	Mean		−0.251	0.131	−0.240	−0.280	−0.431	−1.071
	Std. Deviation		2.298	1.429	3.406	1.902	1.708	9.980
	Std. Error mean		0.388	0.241	0.575	0.321	0.288	1.687
	95% confidence interval of the difference	Lower	−1.040	−0.359	−1.410	−0.934	−1.018	−4.499
	95% confidence interval of the difference	Upper	0.538	0.622	0.930	0.374	0.155	2.357
t			−0.647	0.544	−0.417	−0.871	−1.494	−0.635
df			34	34	34	34	34	34
Sig. (2-Tailed)			0.522	0.590	0.679	0.390	0.144	0.530

In order to whether and on which trait MsCAEWL yields writing learning assistance, a paired-sample t-test was conducted to compare the writing scores between the pre-test and the post-test of the experimental group.

According to Tables 17 and 18, a statistically significant difference can be observed between the pre-test and post-test in every trait and the total score. That indicates that our proposed system is capable of effectively assisting EFL writing learning. It is worth noting that the scores of the experimental group in Complexity and Correctness improved comparatively greater. The mean scores of Complexity have significantly increased by 10% from pre-test to post-test: an increase of

M e a n = - 2.017

out of the full score of 20 credits,

t (34) = - 5.371, p = 0.000

, and the mean scores of Correctness have significantly increased by 11%: an increase of

M e a n = - 1.622

out of the full score of 15 credits,

t (34) = - 24.982, p = 0.000

. The findings suggest that even though MsCAEWL can improve students’ writing ability from all-around aspects, students can obtain better benefits in morphology and syntax than in mastering unity and organization. Judging from the results, students can more easily get promotion from the corrections of lexical and syntactic errors, and more suggestions on the use of words and sentences, which is also the most significant advantage of computer-assisted writing learning.

Table 17.

Statistics of Experimental Group.

		N	Mean	Std. Deviation	Std. Error Mean
Thesis	Pre-test	35	12.240	1.268	0.214
Thesis	Post-test	35	13.680	1.699	0.287
Fluency	Pre-test	35	9.360	0.940	0.158
Fluency	Post-test	35	10.237	1.381	0.233
Content	Pre-test	35	18.626	1.887	0.319
Content	Post-test	35	20.320	2.213	0.374
Complexity	Pre-test	35	12.251	1.740	0.294
Complexity	Post-test	35	14.269	1.601	0.270
Correctness	Pre-test	35	9.206	1.016	0.171
Correctness	Post-test	35	10.829	1.364	0.230
Total	Pre-test	35	61.682	5.859	0.990
Total	Post-test	35	69.334	6.441	1.088

Table 18.

Paired-Sample t-Test of Experimental Group.

			Thesis	Fluency	Content	Complexity	Correctness	Total
Paired differences	Mean		−1.440	−.877	−1.694	−2.017	−1.622	−7.651
	Std. Deviation		0.451	1.566	0.352	2.221	0.384	3.728
	Std. Error mean		0.076	0.264	0.059	0.375	0.065	0.630
	95% confidence interval of the difference	Lower	−1.595	−1.415	−1.815	−2.780	−1.754	−8.932
	95% confidence interval of the difference	Upper	−1.284	−.338	−1.573	−1.253	−1.490	−6.370
t			−18.851	−3.312	−28.455	−5.371	−24.982	−12.140
df			34	34	34	34	34	34
Sig. (2-Tailed)			0.000	0.002	0.000	0.000	0.000	0.000

Conclusion and Future Works

How MsCAEWL Can Benefit Writing Learning Under the Guidance of Writing Feedback

In this study, we proposed a computer-aided writing learning system that integrates the neural network models and a couple of semantic-based NLP techniques. Different from the previous AWE or AES models, which mostly adopt the holistic scoring method, the concept of our proposed system follows the writing feedback theory. Combining the advantages of the AWE or AES model, neural networks, and multiple NLP technologies, our proposed system can provide a package of writing feedback that fully meets the requirements of writing feedback theory, namely, multiple, continuous, timely, clear, and multi-aspect guidance interactive feedback.

More specifically, as with the other AWE models, MsCAEWL is designed for easy operation, allowing students and teachers to use it multiple times without burden and obtain feedback immediately. The analytic scoring method, multiple trait-specific evaluations, is adopted in MsCAEWL in order to provide the multi-aspect guidance feedback. Thank to this, a temporal matrix of writing assessment can be easily formed based on feedback on each trait-specific evaluation after each revision of a composition. Students and teachers can intuitively observe the continuous changes in the composition and provide dynamic feedback.

Users can also use the scores of trait-specific evaluations in this system (rather than all of them) to compose different assessment matrices, thereby achieving different teaching or learning goals. It is worth noting that the utilization of writing feedback theory or AWE models is predicated upon a fundamental principle, namely that learners possess sufficient self-regulation to execute or teachers can oversee multiple careful revisions of a writing task. In our observation, even in instances where AWE or teachers proffer detailed and constructive feedback, the benefits derived from such feedback are exceedingly circumscribed if the author only revises once without attending to the feedback thereafter.

Conclusion

In this article, following the guidance of writing feedback theory, we designed a Multi-strategy Computer-assisted EFL Writing Learning System (MsCAEWL) in order to provide the writing multiple trait-specific evaluation scores and the detailed writing instructions.

In the comparison experiment of holistic scoring, MsCAEWL achieved the best performance on multiple prompts and the final average score. Due to the introduction of various semantic-based NLP technologies, our proposed system outperformed other outstanding baseline models incorporating the single model in their backbone structure.

The experiment of analytic scoring proves that in multiple trait-specific scoring, our system is more reliable and more robust than human raters. In other words, MsCAEWL is more suitable as a reference standard for writing verification experiments. It can not only provide more comprehensive multi-aspect assessments but also avoid individual differences in human scoring and inconsistent rating due to fatigue.

However, scoring compositions are only one of MsCAEWL’s capabilities. The most valuable task is to improve the author’s writing ability through multiple rounds of writing revision based on the interactive and all-around feedback provided to the author. In the validation experiment of assistance effect in writing learning, learners may utilize MsCAEWL’s writing feedback to edit their compositions without the restriction of use times, and they can immediately acquire the score changes after one round of modification to lead the next round of modification. With the assistance of our system, learners’ writing skills may be greatly enhanced over time, indicating that the writing feedback theory assisted by the computer is more effective.

Future Works

In the future, we will focus on text evaluation based on paragraph and text intention extraction, as well as evaluating the continuity between paragraphs and support for the topic based on paragraph intention. We will strive to bring the model closer to human thought in works of writing evaluation and feedback. Simultaneously, we will consult with second language acquisition specialists on the weight allocation and rationality of the traits in writing assessment.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Social Sciences Foundation of China (21BYY162); The Research Project of Introducing High-level Talents of Qiannan Normal College for Nationalities (2021qnsyrc09).

ORCID iD

Mingjiang Li

Note

Author Biographies

Binbin Chen is a doctor in linguistics whose research direction is Deep Learning, Natural Language Processing, Cognitive Linguistics, and Second Language Acquisition. He is dedicated to combining cutting-edge technologies such as deep learning and natural language processing with linguistic theory and practice, exploring and solving real problems related to language cognition, cognitive language, and language education. Based on investigating the similarities and differences between the neural mechanisms of the human brain and computer bionic algorithms and modeling, he strives for innovation in neural network modeling. He has published over 30 papers in the fields of computer science and linguistics.

Lina Bao is a doctor in linguistics who has been working in the field of education for decades. She has always been dedicated to researching how to better teach language knowledge and skills, with a focus on language acquisition and teaching methods. Particularly in regards to second language acquisition, she has been searching for better ways to teach language knowledge.

Rui Zhang is a software engineer who is also a master’s student in software engineering. He is proficient in various programming languages and database technologies, including Python and MySQL. He values teamwork and is able to cooperate with other team members to demonstrate his professional abilities in projects.

Jingyu Zhang is a professor in linguistics who has been engaged in linguistic research for several decades. He is a Fulbright scholar, an awardee of the New Century Talent Support Program sponsored by the Ministry of Education in China, and an outstanding teacher of the Sasakawa Medical Fellowship program sponsored by the Ministry of Health. He has published over 60 academic papers in domestic and foreign journals such as Foreign Language Teaching and Research, Chinese Teaching in the World, Applied Psycholinguistics, Metaphor and Symbol, International Journal of Law, Language and Discourse and has authored two monographs, one translated work, one textbook, and edited nine collections.

Feng Liu is a postdoctoral researcher in linguistics and a supervisor of master’s students majoring in foreign linguistics and applied linguistics. He is a core researcher of the provincial-level linguistic innovation research team in Shaanxi Province, an expert in the youth talent pool of the Shaanxi Province Social Science Federation, and one of the first batch of “Young Excellent Talents” selected by Xi’an International Studies University.

Shuai Wang also has rich experience in educational research. She firmly believes that through continuous experimentation, innovation, and progress, the quality and level of education can be continuously improved. Her educational philosophy is to “provide students with the best learning environment and resources, help them fully realize their potential and achieve self-worth.”

Mingjiang Li is a professor in computer science who has been committed to applying computer technology to teaching to assist and improve teaching work. His main research areas include data mining and educational technology, and he has created a series of valuable teaching tools and applications by using these technologies and methods. He has a deep understanding of the needs and problems in the field of education, and has successfully developed multiple educational information systems, which have been widely recognized and praised in the education field. His research and contributions have received high attention and acclaim in both academic and education circles.

References

Alikaniotis

Yannakoudakis

Rei

(2016). Automatic text scoring using neural networks (pp. 715–725). Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1068

Arndt

(1992). Response to writing: Using feedback to inform the writing process. Teaching composition around the Pacific rim: Politics and pedagogy (pp. 90–116).

Bauer

B. A.

(1981). A study of the reliabilities and the cost-efficiencies of three methods of assessment for writing ability (pp. 216–357). ERIC Document Reproduction Service No.

Beach

Friedrich

(2006). Response to writing. In MacArthur

C. A.

Graham

Fitzgerald

(Eds), Handbook of writing research (pp. 222–234). The Guilford Press.

Beseiso

Alzubi

O. A.

Rashaideh

(2021). A novel automated essay scoring approach for reliable higher educational assessments. Journal of Computing in Higher Education, 33(3), 727–746. https://doi.org/10.1007/s12528-021-09283-1

Brockett

Dolan

W. B.

Gamon

(2006). Correcting ESL errors using phrasal SMT techniques. In: Proceedings of the 21st international conference on computational linguistics and 44th annual (p. 249). Meeting of the ACL.

Cai

(2019). Automatic essay scoring with recurrent neural network. In: Proceedings of the 3rd international conference on high performance compilation (pp. 1–7). Computing and Communications.

Cao

Jin

Wan

(2020). Domain-adaptive neural automated essay scoring. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1011–1020.

Chen

Zhang

(2022). Pre-training-based grammatical error correction model for the written language of Chinese hearing impaired students. IEEE Access, 10, 35061–35072. https://doi.org/10.1109/ACCESS.2022.3159676

10.

Chen

Zhang

(2022). Research on principles of designing computer assisted writing feedback technology in EFL of basic education under the guidance of process-oriented writing approach. Advances in Educational Technology and Psychology, 6(6), 57–62.

11.

Chen

(2013). Automated essay scoring by maximizing human-machine agreement. In: Proceedings of the 2013 conference on empirical methods in Natural Language Processing (pp. 1741–1752).

12.

Chollampatt

H. T.

(2018). A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Thirty-second aaai conference on artificial intelligence/thirtieth innovative applications of artificial intelligence conference/eighth aaai symposium on educational advances in artificial intelligence (pp. 5755–5762).

13.

Cohen

A. D.

(1994). Assessing language ability in the classroom. Foreign Language Teaching and Research Press.

14.

Cohen

A. D.

Cavalcanti

M. C.

(1990). Feedback on compositions: Teacher and student verbal reports. Second Language Writing Research Insights for the Classroom, 13(2), 155–177.

15.

Committee

J. B. W. R.

(2016). The new JACET list of 8000 basic words. Kirihara Shoten.

16.

Conrad

S. M.

Goldstein

L. M.

(1999). ESL student revision after teacher-written comments: Text, contexts, and individuals. Journal of Second Language Writing, 8(2), 147–179. https://doi.org/10.1016/s1060-3743(99)80126-x

17.

Cummins

Rei

(2018). Neural multi-task learning in automated assessment (pp. 1–9). arXiv preprint arXiv:1801.06830 https://doi.org/10.48550/arXiv.1801.06830

18.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). Bert: Pre-training of deep bidirectional transformers for language understanding Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics. Human Language Technologies, 1, 4171–4186.

19.

Dong

Zhang

(2016). Automatic features for essay scoring-an empirical study. EMNLP, 435, 1072–1077.

20.

Dong

Zhang

Yang

(2017). Attention-based recurrent convolutional neural network for automatic essay scoring (pp. 153–162). CoNLL.

21.

Ellis

(1994). The study of second language acquisition. Oxford University Press.

22.

Fang

(2010). Perceptions of the computer-assisted writing program among EFL college learners. Journal of Educational Technology and Society, 13(3), 246–256.

23.

Farag

Rei

Briscoe

(2017). An error-oriented approach to word embedding pre-training. Proceedings of the 12th workshop on innovative use of NLP for building educational applications (pp. 149–158). https://doi.org/10.18653/v1/W17-5016

24.

Farag

Yannakoudakis

Briscoe

(2018). Neural automated essay scoring and coherence modeling for adversarially crafted input. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 263–271.

25.

Ferris

Roberts

(2001). Error feedback in L2 writing classes: How explicit does it need to be? Journal of Second Language Writing, 10(3), 161–184. https://doi.org/10.1016/s1060-3743(01)00039-x

26.

Ferris

D. R.

(1995). Student reactions to teacher response in multiple-draft composition classrooms. TESOL Quarterly, 29(1), 33–53. https://doi.org/10.2307/3587804

27.

Freedman

S. W.

(1984). The evaluation of, and response to student writing: A review. the Annual Meeting of the American Educational Research Association.

28.

Wei

Zhou

(2018). Reaching human-level performance in automatic grammatical error correction: An empirical study. arXiv:1807.01270v5 1-15 https://doi.org/10.48550/arXiv.1807.01270

29.

Hedgcock

Lefkowitz

(1994). Feedback on feedback: Assessing learner receptivity to teacher response in L2 composing. Journal of Second Language Writing, 3(2), 141–163. https://doi.org/10.1016/1060-3743(94)90012-4

30.

Herzig

Nowak

P. K.

Mueller

Piccinno

Eisenschlos

(2020). TaPas: Weakly supervised table parsing via pre-training. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1, 4320–4333.

31.

Higgins

Heilman

(2014). Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. Educational Measurement: Issues and Practice, 33(3), 36–46. https://doi.org/10.1111/emip.12036

32.

Hochreiter

Schmidhuber

(1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

33.

Hong

Liu

(2019). FASPell: A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm. In: Proceedings of the 5th workshop on noisy user-generated text (pp. 160–169). W-NUT 2019). https://doi.org/10.18653/v1/D19-5522.

34.

Jiang

Wang

(2020). Second language writing instructors’ feedback practice in response to automated writing evaluation: A sociocultural perspective. System, 93, 102302. https://doi.org/10.1016/j.system.2020.102302

35.

Jiang

Liu

Yin

Cheng

(2021). Learning from graph propagation via ordinal distillation for one-shot automated essay scoring. Proceedings of the Web Conference, 2021, 2347–2356.

36.

Jin

Hui

Sun

(2018). Tdnn: A two-stage deep neural network for prompt-independent automated essay scoring. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 1, 1088–1097. (Long Papers).

37.

Kabra

Bhatia

Kumar

Jin

Shah

(2021). Calling out bluff: Evaluation toolkit for robustness testing of automatic essay scoring systems (pp. 1–21).

38.

Keh

C. L.

(1990). Feedback in the writing process: A model and methods for implementation. ELT Journal, 44(4), 294–304. https://doi.org/10.1093/elt/44.4.294

39.

Klein

S. P.

Stecher

B. M.

Shavelson

R. J.

McCaffrey

Ormseth

Bell

R. M.

Comfort

Othman

A. R.

(1998). Analytic versus holistic scoring of science performance tasks. Applied Measurement in Education, 11(2), 121–137. https://doi.org/10.1207/s15324818ame1102_1

40.

Leki

(1990). Coaching from the margins: Issues in written response. Second language writing: Research insights for the classroom (pp. 57–68).

41.

D. Q.

Wan

Q. L.

Pathak

J. L.

Z. B.

(2017). Platelet-derived growth factor BB enhances osteoclast formation and osteoclast precursor cell chemotaxis. Journal of Bone and Mineral Metabolism, 35(4), 355–365. https://doi.org/10.1007/s00774-016-0773-8

42.

Link

Hegelheimer

(2015). Rethinking the role of automated writing evaluation (AWE) feedback in ESL writing instruction. Journal of Second Language Writing, 27, 1–18. https://doi.org/10.1016/j.jslw.2014.10.004

43.

Chen

Nie

J.-Y.

(2020). Sednn: Shared and enhanced deep neural network model for cross-prompt automated essay scoring. Knowledge-Based Systems, 210, 106491. https://doi.org/10.1016/j.knosys.2020.106491

44.

(2021). Teachers in automated writing evaluation (AWE) system-supported ESL writing classes: Perception, implementation, and influence. System, 99, 102505. https://doi.org/10.1016/j.system.2021.102505

45.

Liao

H. C.

(2016). Enhancing the grammatical accuracy of EFL writing by using an AWE-assisted process approach. System, 62, 77–92. https://doi.org/10.1016/j.system.2016.02.007

46.

Lichtarge

Alberti

Kumar

Shazeer

Parmar

Tong

(2019). Corpora generation for grammatical error correction North American chapter of the association for computational linguistics: human language technologies. Conference of the Proceedings of the 2019, 1, 3291–3301. (Long and Short Papers) .

47.

Lin

C.-J.

Chu

W.-C.

(2015). A study on Chinese spelling check using confusion sets and? N-Gram statistics June 2015-special issue on Chinese as a Foreign Language. International Journal of Computational Linguistics and Chinese Language Processing, 20, 23–48. Number 1 .

48.

Link

Mehrzad

Rahimi

(2022). Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Computer Assisted Language Learning, 35(4), 605–634. https://doi.org/10.1080/09588221.2020.1743323

49.

Liu

Zhu

(2019). Automated essay scoring based on two-stage learning (pp. 1–7). arXiv preprint arXiv:1901.07744 https://doi.org/10.48550/arXiv.1901.07744

50.

Lloyd-Jones

(1977). Primary trait scoring. Evaluating writing: Describing, measuring (pp. 33–66). judging.

51.

Mim

F. S.

Inoue

Reisert

Ouchi

Inui

(2021). Corruption is not all bad: Incorporating Discourse structure into pre-training via corruption for essay scoring. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 2202–2215. https://doi.org/10.1109/taslp.2021.3088223

52.

(2012). A study of the teaching of ESL writing in colleges in China. International Journal of English Linguistics, 2(1), 118. https://doi.org/10.5539/ijel.v2n1p118

53.

Onozawa

(2010). A study of the process writing approach. Research Note, 10, 153–163.

54.

Parekh

Singla

Y. K.

Chen

J. J.

Shah

R. R.

(2020). My teacher thinks the world is flat! interpreting automatic essay scoring mechanism (pp. 1–16). arXiv preprint arXiv:2012.13872 https://doi.org/10.48550/arXiv.2012.13872

55.

Phandi

Chai

K. M. A.

H. T.

(2015). Flexible domain adaptation for automated essay scoring using correlated linear regression. Proceedings of the 2015 conference on empirical methods in Natural Language Processing (pp. 431–439).

56.

Pishghadam

Zabihi

Ghadiri

(2014). Teachers’ feedback on EFL students’ writings: A linguistic or life syllabus perspective. Humanising Language Teaching, 16, 1–18.

57.

Polio

(2012). The relevance of second language acquisition theory to the written error correction debate. Journal of Second Language Writing, 21(4), 375–389. https://doi.org/10.1016/j.jslw.2012.09.004

58.

Ran

(2018). A study on performance sensitivity to data sparsity for automated essay scoring. International Conference on Knowledge Science, Engineering and Management, 104–116.

59.

Rozovskaya

Roth

(2010). Generating confusion sets for context-sensitive error correction. Proceedings of the 2010 conference on empirical methods in natural language processing (pp. 961–970).

60.

Rudner

L. M.

Liang

(2002). Automated essay scoring using Bayes’ theorem. The Journal of Technology, Learning and Assessment, 1(2), 1–22.

61.

Saricaoglu

Bilki

(2021). Voluntary use of automated writing evaluation by content course students. ReCALL, 33(3), 265–277. https://doi.org/10.1017/S0958344021000021

62.

Shapiro

D. A.

(1999). Stuttering intervention: A collaborative journey to fluency freedom. Pro ed Press.

63.

Sharma

Kabra

Kapoor

(2021). Feature enhanced capsule networks for robust automatic essay scoring. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 365–380.

64.

Sun

Qiu

Huang

(2019). How to fine-tune bert for text classification? China national conference on Chinese computational linguistics (pp. 194–206).

65.

Sutskever

Vinyals

Q. V.

(2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 3104–3112.

66.

Taghipour

H. T.

(2016). A neural approach to automated essay scoring. Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1882–1891).

67.

Tang

Rich

C. S.

(2017). Automated writing evaluation in an EFL setting: Lessons from China. The JALT CALL Journal, 13(2), 117–146. https://doi.org/10.29140/jaltcall.v13n2.215

68.

Tay

Phan

M. C.

Tuan

L. A.

Hui

S. C.

(2018). SkipFlow: Incorporating neural coherence features for end-to-end automatic text scoring. Proceedings of the thirty-second AAAI conference on artificial intelligence (pp. 5948–5955). https://doi.org/10.1609/aaai.v32i1.12045

69.

Uto

Xie

Ueno

(2020). Neural automated essay scoring incorporating handcrafted features. Proceedings of the 28th International Conference on Computational Linguistics, 6077–6088.

70.

Wang

Dongchuan

X. U.

Nong

(2015). Research of location-routing problem in emergency logistics system for post-earthquake transitional stage. Journal of Computer Applications, 35(1), 243–246.

71.

Wang

Wei

Zhou

Huang

(2018). Automatic essay scoring incorporating rating schema via reinforcement learning (pp. 791–797). EMNLP.

72.

Wang

Y.-J.

Shang

H.-F.

Briody

(2013). Exploring the impact of using automated writing evaluation in English as a foreign language university students’ writing. Computer Assisted Language Learning, 26(3), 234–257. https://doi.org/10.1080/09588221.2012.655300

73.

Weir

C. J.

(1988). Communicative language testing with special reference to English as a foreign language. Exeter Linguistic Studies, 11, 1–241.

74.

Xie

Huang

Zhang

Hong

Huang

Chen

Huang

(2015). Chinese spelling check system based on n-gram model. Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, 128–136.

75.

Yannakoudakis

Briscoe

Medlock

(2011). A new dataset and method for automatically grading ESOL texts. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 180–189).

76.

Zamel

(1982). Writing: The process of discovering meaning. TESOL Quarterly, 16(2), 195–209. https://doi.org/10.2307/3586792

77.

Zhang

Wang

(2014). A unified framework for grammar error correction. Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, 96–102.

78.

Zhang

L. J.

(2013). Second language writing as and for second language learning. Journal of Second Language Writing, 22(4), 446–447. https://doi.org/10.1016/j.jslw.2013.08.010

79.

Zhang

Huang

Liu

(2020). Spelling error correction with soft-masked BERT (pp. 882–890). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.82

80.

Zhao

Wang

Shen

Jia

Liu

(2019). Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data North American chapter of the association for computational linguistics: human language technologies. Conference of the Proceedings of the 2019, 1, 156–165. (Long and Short Papers) https://doi.org/10.18653/v1/N19-1014

81.

Zhu

Xia

Qin

Zhou

Liu

(2019). Incorporating BERT into neural machine translation. International Conference on Learning Representations, 1–18. https://doi.org/10.48550/arXiv.2002.06823

A Multi-Strategy Computer-Assisted EFL Writing Learning System With Deep Learning Incorporated and Its Effects on Learning: A Writing Feedback Perspective

Abstract

Keywords

Introduction

Literature Review

Writing Feedback and ESL riting

Methodology

Grading Criteria

Task Definition

The Thesis Module

The Fluency Module

The Content Module

The Complexity Module

The Lexical Complexity

The syntactic complexity

The Correctness Module

The Grammatical Error Diagnosis module

The Grammatical Error Correction module

The scoring method of the correctness evaluation

Experiments

Datasets

Comparison Experiments

Evaluation metric

Comparison based on holistic scoring

Comparison based on the analytic scoring

Validation of effects on writing learning assistance of MsCAEWL

Method

Participants

Materials

Procedure

Results and discussions

Conclusion and Future Works

How MsCAEWL Can Benefit Writing Learning Under the Guidance of Writing Feedback

Conclusion

Future Works

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iD

Note

Author Biographies

References