Sage Journals: Discover world-class research

Abstract

Keywords

Automated essay scoring Bidirectional Encoder Representations From Transformers (BERT)ELion intelligent Chinese composition tutoring system large language models

Writing practice plays a crucial role in the process of language learning. But grading compositions often requires a lot of time and effort, and different teachers have varying standards. Therefore, an automatic essay scoring system is particularly important as it can alleviate the difficulties teachers face in correcting compositions. It is revealed that automated essay scoring systems are predominantly found in the context of English language corpora, with a staggering 95 pieces of literature dedicated to the subject, but there are only two or three pieces of literature related to automated essay scoring systems for Chinese or other languages (Huawei & Aryadoust, 2023). Therefore, the development of an automated essay scoring system in the Chinese context is particularly essential.

Project overview

The ELion Intelligent Chinese Composition Tutoring System (https://elion.ecnu.edu.cn/write/list.htm) is a joint effort between East China Normal University's Shanghai Institute of Artificial Intelligence for Education and Microsoft Research Asia. The ELion Intelligent Chinese Composition Tutoring System, based on the “Compulsory Education Chinese Curriculum Standards,” analyzes students’ compositions from four main dimensions: topic comprehension (keyword detection), content (composition length and quality of words and sentences), expression (fluency, logic, and grammatical errors), and handwriting. The system also provides detailed comments on students’ word choice, sentence structure, and paragraph organization, offering a multi-dimensional and multi-perspective objective analysis of students’ compositions. Teachers can review the system's preliminary assessments, provide additional feedback, write teacher messages, and set examples.

This program began in the spring of 2021 with a modest but specific goal: to reduce teachers’ grading workloads in Chinese composition writing in Chinese primary and secondary schools. As of June 2024, a total of 251 primary and secondary schools had adopted the ELion Chinese Composition Intelligent Tutoring System. The system's reach has expanded from its initial base in the Jiangsu and Shanghai regions to various parts of the country. It has accumulated a user base of 15,240 students and 560 teachers from Grades 3 to 9, with nearly 50,000 compositions uploaded and a monthly active user count exceeding 5,375. As the system's promotion expands, the ELion system has also received unanimous praise from frontline teachers. A primary school teacher in Shanghai skillfully uses the ELion Intelligent Chinese Composition Tutoring System for essay review. By comparing AI-generated reviews with those of the students, the teacher stimulates students to express their creativity based on the AI feedback, fostering interactive learning between students and AI within the classroom setting. The principal of a primary school also pointed out:
Previously, the application of AI in the classroom mainly focused on subjects like mathematics, which are more inclined toward science. There are not many mature AI applications in the Chinese language subject. The performance of ELion has liberated Chinese language teachers to some extent from the task of essay correction, playing an important role.

The introductory paper (Zheng et al., 2023) provides an overview of the ELion Intelligent Chinese Composition Tutoring System. This brief report will focus on our investigation and exploration of large language models (LLMs) in essay scoring and feedback, planned or unplanned, which can be naturally divided into three stages: BERT only, BERT-ChatGPT Synergy for improved feedback generation, and BERT-ChatGPT synergy in an expanded framework for diversified and bold applications. The major underlying AI technology for such a system is called natural language processing (NLP), and the current core architecture for NLP is Bidirectional Encoder Representations From Transformers (BERT). The key element here is known as the transformer, and it is also an essential component for ChatGPT (Chat Generative Pre-Trained Transformer), which has become a hot topic in AI and education. Thus, when this project began, BERT, ChatGPT's cousin, was the clear choice for an automated essay scoring and evaluation system. This is the unfolding story of this system, which is full of surprises and echoes many issues, particularly the use of (generative) LLMs in education. This brief report aims to share some key findings, along with a new working concept and its theoretical support.

Stage I: BERT only

The most critical technological element of the system is an algorithm that automatically assigns scores and assesses textual attributes of essays. RoBerTa, a robust variant of BERT, and several other AI techniques, most notably Optical Character Recognition (OCR) technology for Chinese children's handwriting, are employed by ELion to evaluate essays from students across a spectrum of feature levels.

The operation of ELion's automated essay evaluation algorithm is illustrated in Figure 1. The algorithm converts the images into text encoding using text recognition after students upload photographs of their essays. In addition, it generates fundamental descriptive statistics, including the number of words and paragraphs. The algorithm then analyzes the entire essay, including words, phrases, sentences, and paragraphs, using the RoBerTa model. In addition to typographical errors and inappropriate word usage, the system is capable of identifying language fluency, coherence, logical reasoning, rhetorical techniques, and more. The algorithm then determines grades and provides essay correction information in accordance with the analysis of the aforementioned text. Finally, the intelligent feedback generation technology produces customized revision suggestions and comments for the essay, based on the evaluation results from the preceding steps and predefined comment templates.

Figure 1.
ELion's intelligent essay scoring algorithm with BERT only.

In order to fulfill the demands of the national curriculum and educators regarding “deep” language feature analysis, ELion assembled a teaching and research team of 23 individuals. This team consisted of Chinese educators, educational researchers, evaluation specialists, and AI experts. The group was assigned the responsibility of creating a composition evaluation framework that would be suitable for use in Chinese elementary and intermediate schools. Following a comprehensive examination of established frameworks for evaluating Chinese composition, the group devised a four-tiered assessment system that progresses from superficial to profound (see Figure 2).

Figure 2.
ELion's Chinese composition assessment framework.

Language application layer: Assessing students’ command of grammatical structures, idioms, words, and punctuation, this layer predominantly concentrates on fundamental writing skills. It is especially suitable for students in the lower primary grades.

Language expression layer: Designed specifically for lower and middle primary school students, this layer evaluates students’ language expression skills in four aspects—rhetoric recognition, descriptive techniques, proficient use of words and sentences, and paragraph development.

Discourse anomaly detection: The principal function of this layer is to ascertain whether instances of disorganized writing or plagiarism are present in the work of the students; thus, it bolsters the validity of the evaluation results.

Discourse quality assessment: This stratum predominantly evaluates compositions based on four key dimensions—comprehension of the prompt, substance, articulation, and writing. It offers focused evaluation viewpoints that are particularly suitable for students in upper primary and middle school.

With varying degrees of precision, the scoring and evaluation algorithm based on RoBerTa can do an excellent job of assessing Chinese compositions for the aforementioned qualities. However, generating feedback reports on a large scale regarding the satisfaction of teachers and students remained a significant obstacle at this time.

In this iteration, ELion generated comments utilizing the BERT model combined with predefined templates. The assessment result for each student is considered as “accurate and individualized,” but this report exhibits several shortcomings, primarily in its delivery style: Initially, it appeared tedious and inflexible due to the pre-written nature of numerous sentences within the template. Furthermore, it is possible that these sentences lack amusement, lack friendliness, or fail to be presented in a manner suitable for the students’ grade levels and ages. Teachers are occasionally concerned that younger pupils may not be able to comprehend the report.

Another concern regarding the unexpected surge in popularity of ChatGPT is the perception that this technology possesses such immense power that it can supplant any existing method. People are, at the very least, curious as to why the system is not “upgrading” to ChatGPT.

Stage II: BERT-ChatGPT synergy for improved feedback generation

The team initiated trials of ChatGPT for essay scoring in early December 2022, well in advance of its widespread adoption in the field of education. A number of the aforementioned concerns may be readily remedied by integrating ChatGPT exclusively during the feedback phase. Our initial report exhibits precision in its evaluation yet lacks polish in its presentation. One straightforward resolution entails modifying our report to suit the various patterns specified by the instructors. “Easy to read for third or fourth graders,” “conversational in tone,” and “in a more humorous fashion” are the most favored alternatives.

The feedback generated from templates can be effortlessly altered to conform to various styles due to the exceptional text generation capabilities of GPT technology. By developing GPT prompts, the ELion system can enhance the potential for generating feedback by guaranteeing language generation that is more suitable and in accordance with the true requirements of educators. The teachers’ responses to our small-scale survey contrasting the two types of reports generated by these two distinct methods readily attest to this. The operating method of ELion's second-generation intelligent automatic essay evaluation algorithm, which is built upon BERT-ChatGPT synergy for improved feedback, is illustrated in Figure 3.

Figure 3.
ELion intelligent essay scoring algorithm with BERT-ChatGPT synergy.

The second significant concern is one that has been long anticipated: Is it imperative to substitute the current BERT engine with the ChatGPT engine? A comprehensive evaluation of ChatGPT's efficacy in assessing Chinese compositions in the aforementioned domains is therefore required. A sequence of initial investigations were undertaken to evaluate the performance of ChatGPT in this particular text assessment, driven by the concern that it might surpass the system on which our ardent 2-year labor has been spent. Despite being highly innovative and powerful in content generation, ChatGPT may not be as capable as a specifically designed system when it comes to serving as a general cognitive engine, and it has a number of evident shortcomings in essay scoring and text evaluation. These initial investigations have yielded some indications that a custom-engineered system remains necessary and might not be swiftly supplanted by GPT technology. An initial investigation was undertaken to assess the ChatGPT's efficacy in TOEFL essay scoring (Xia et al., 2024). The achievement of satisfactory scoring results, marginally inferior to the meticulously designed system, is especially noteworthy for the regression effect of scoring, potentially attributable to data limitations in the ChatGPT.

Stage III: BERT-ChatGPT synergy in an expanded framework

Additional research has been conducted to explore the potential advantages of LLMs in the fields of essay evaluation and education at large. In order to facilitate the continued advancement of the ELion essay system, a novel conceptual framework called Learning Copilot Enabled by Accurate Assessment and ChatGPT (LCEAAG) has been introduced. Using ChatGPT in conjunction with rigorous assessment within this framework could facilitate the development of a multitude of educational applications, such as interactive instruction, in addition to essay scoring and evaluation.

The possibility that ChatGPT's explanation fails to satisfy the standards of educators and learners is even more consequential. This practice, known as “prompt engineering,” necessitated furnishing ChatGPT with meticulously crafted and extremely detailed instructions.

An effective approach was proposed and experimented by the ELion team to carry out prompt engineering in a methodical and quality-controlled fashion by leveraging established educational assessment tools, including scoring rubrics and the indicator framework for psychological assessment. These psychological assessment development tools commonly take the form of a hierarchical tree, which illustrates the different facets of the target domain or concept (construct, as employed in psychology and education). These facets span from more generalized to more specific and minute dimensions. An example of narrative essay assessment is presented in Figure 4.

Figure 4.
An assessment indicator framework for narrative essays.

It is possible to generate a set of prompts using the indicator framework, conceivably in accordance with every single bottom-level indicator. By adopting this approach, we can guarantee that we have inquired about every essential aspect pertaining to the subject matter and have not neglected any critical components. It can operate as a quality control mechanism in this manner to achieve the alignment between the prompts and domain knowledge which is labeled as the “tool empowerment” of psychometrics devices.

Additionally, information gathered from instructors can be used to generate a list of prompts as presented in Table 1 that can be cross-referenced with the framework to guarantee comprehensiveness. Through a straightforward survey of educators, we compiled this inventory to ascertain the matters that preoccupy them the most when assessing and grading the essays of their pupils. These points (or tags, as they are referred to in computer science) constitute the most exhaustive compilation of all aspects pertaining to writing feedback and learning that the majority of instructors found intriguing or pertinent.

Table 1.
Tags for essay assessment from teachers’ perspective.

ID Tags (EN) Tags (CN)

1 unexpressive language 语言平淡

2 too general/too specific 详略问题

3 unclear main idea 中心不突出

4 insufficient description 缺乏描写

5 inaccurate expressions 表达不准确

6 vivid writing 文笔生动

7 organized writing 叙事条理

8 substantial content 内容丰富

9 good writing flow 节奏鲜明

10 authentic language 语言自然

11 fluent language 语言流畅

12 well-structured writing 结构完整

13 improper content 选材不当

14 lacking in rhetorics 缺乏文采

15 clear arrangement of ideas 层次分明

16 unclear structure 结构不清晰

17 sincere emotion 情感真挚

18 unexpressive description 描写不生动

19 vivid language 语言生动

20 deficient in emotion 缺乏情感

21 literary interest 文学趣味

22 elegant language 语言优美

23 non-fluent language 语言不流畅

24 creative content 想象丰富

25 succinct language 语言简洁

26 improper main idea 立意不当

27 vivid description 描写生动

28 lacking in details 叙述不具体

29 stimulating reading interests 激发阅读兴趣

30 opening with the main idea 开篇点题

Note. EN = English; CN = Chinese.

Under this LCEAAG framework, ChatGPT is capable of two functions: First, it can conduct the text assessment by adhering to the given prompts generated from the assessment indicators; second, it can engage in intelligent discourse with a student by utilizing assessment data from an external assessment, either in isolation or in conjunction with data from an external assessment and itself. The second function entails interactive instruction, wherein the personalized assessment outcome further augments the guidance provided by the assessment concept framework. When utilized in conjunction with the assessment outcomes which are labeled as the “information empowerment” from the assessment result, this collection of prompts has the potential to enhance the accuracy of an interactive dialogue between the ChatGPT and the students. The whole LCEAAG framework is presented in Figure 5.

Figure 5.
The LCEAAG framework.

The idea of systematic prompt engineering in LCEAAG is inspired by Herbert Simon's (re)definition of “psychology as a science of the artificial.” Following the same line of thinking, we can argue that psychometrics can be redefined as “a science of the artificial.” Psychometrics becomes the interface among the human mind, the target domain, and the LLM. To be more specific, the traditional important technical devices in psychometrics emerge as the interface or bridge between the target domain and the LLM. This interface's main job is to guide the LLM to make a pedagogically relevant and enriching conversation with students. The quality of this interactive tutoring is further ensured through the “accurate assessment” and is conversational and amicable due to the ChatGPT's distinctive advantage. This approach can also be implemented in a range of situations involving psychological and educational evaluations. We are experimenting, for instance, with interactive report feedback for a personality assessment related to the workplace.

The investigation into the possible integration of LLMs into this automated scoring and evaluation system project has progressed through three discrete phases. The unanticipated appearance of ChatGPT disrupts the initial plan, resulting in both positive and negative consequences; thus, this process is also generative. Furthermore, the primary insight gained thus far is that even for LLMs, it may be necessary to incorporate a variety of technologies—both old and new—across various disciplines in order to attain satisfactory results in practice. Innovative technology cannot solve problems in the actual world on its own.

Takeaway message

This article integrates how the ELion Intelligent Chinese Composition Tutoring System evolved in China to address the challenging and time-consuming essay grading requirements.

The ELion system's technical progress occurs in three stages: first, multidimensional assessment using only BERT; second, ChatGPT is added for enhanced feedback generation; and lastly, the LCEAAG framework is implemented. The system combines psychometric tools with rapid engineering to enhance AI's applications in education.

Psychometrics and LLMs enable AI incorporation in education. Based on Herbert Simon's “the sciences of the artificial,” the study redefines psychometrics as a “science of the artificial,” making AI–student interactions pedagogically helpful and intellectually celebratory.

The study underlines that single technology cannot solve real-world problems. Technology and traditional methods must be used to generate cross-domain educational solutions. Generative AI offers pros and cons, stressing the need for customization and disciplinary adaptability in education technology.

ID	Tags (EN)	Tags (CN)
1	unexpressive language	语言平淡
2	too general/too specific	详略问题
3	unclear main idea	中心不突出
4	insufficient description	缺乏描写
5	inaccurate expressions	表达不准确
6	vivid writing	文笔生动
7	organized writing	叙事条理
8	substantial content	内容丰富
9	good writing flow	节奏鲜明
10	authentic language	语言自然
11	fluent language	语言流畅
12	well-structured writing	结构完整
13	improper content	选材不当
14	lacking in rhetorics	缺乏文采
15	clear arrangement of ideas	层次分明
16	unclear structure	结构不清晰
17	sincere emotion	情感真挚
18	unexpressive description	描写不生动
19	vivid language	语言生动
20	deficient in emotion	缺乏情感
21	literary interest	文学趣味
22	elegant language	语言优美
23	non-fluent language	语言不流畅
24	creative content	想象丰富
25	succinct language	语言简洁
26	improper main idea	立意不当
27	vivid description	描写生动
28	lacking in details	叙述不具体
29	stimulating reading interests	激发阅读兴趣
30	opening with the main idea	开篇点题

Footnotes

Contributorship

Chanjin Zheng, the principal researcher, was responsible for the overall conceptualization and supervision of the project, provided key insights into the integration of educational theory with artificial intelligence technology, and contributed to the theoretical foundation of the LCEAAG framework. Wei Xia is mainly in charge of drafting the initial manuscript, organizing theoretical content, and creating illustrations. Shaoguang Mao is responsible for system development, ensuring the system's accuracy in evaluating Chinese compositions. Yan Xia coordinated the technical team and managed the deployment of the system in educational settings.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Wei Xia

References

Huawei

Aryadoust

(2023). A systematic review of automated writing evaluation systems. Education and Information Technologies, 28(1), 771–795. https://doi.org/10.1007/s10639-022-11200-7

Xia

Mao

Zheng

(2024). Empirical study of large language models as automated essay scoring tools in English composition—Taking TOEFL independent writing task for example. https://doi.org/10.48550/arXiv.2401.03401

Zheng

Guo

Xia

Mao

(2023). ELion: An intelligent Chinese composition tutoring system based on large language models. Chinese/English Journal of Educational Measurement and Evaluation, 4(3), Article 3. https://doi.org/10.59863/MPJO6480

ChatGPT,BERT,or Both? This Is Not a Question: The Evolution Story of LLMs in ELion Intelligent Chinese Composition Tutoring System

Abstract

Keywords

Project overview

Stage I: BERT only

Stage II: BERT-ChatGPT synergy for improved feedback generation

Stage III: BERT-ChatGPT synergy in an expanded framework

Takeaway message