Sage Journals: Discover world-class research

Abstract

Objective:In order to save teachers’ correcting time, improve the accuracy and efficiency of English composition grading.Methods: This paper briefly introduces the algorithm of deep sentence smoothness and text semantic matching based on graph neural network, and then designs an automatic scoring algorithm for English text. Result: The experimental data was collected from 12,000 essays written by international students in the United States in the Pratt & Whitney Foundation’s Automated Student Value Assessment Project (ASAP), and these data were graded through a comparative experiment,Through comparative tests, the automatic scoring algorithm designed in this paper can achieve better scoring results and better handle automatic essay scoring problems. Among all the experimental mean values of evaluation methods, the experimental mean value of the algorithm designed in this paper is 0.790, the smoothness algorithm is 0.768, and the text matching vector is 0.759. The experimental mean values of the other two traditional automatic scoring algorithms are 0.710 and 0.712 respectively, and the results are lower than the algorithm designed in this paper. Conclusion: According to the experimental results, it can be concluded that good feature selection can give good scoring performance to the algorithm and cope with the problem of automatic scoring. At the same time, it also confirms the feasibility of the algorithm designed in this paper, which can be effectively applied in practical English composition scoring.

Keywords

English composition automatic scoring scoring algorithm

1. Introduction

English education has received great attention, and writing is an important part of exams such as the middle school entrance examination, CET-4 and CET-6, TOEFL, and IIELTS [1, 2]. Writing can comprehensively reflect students’ understanding and application abilities of language, such as the use of vocabulary and grammar and the mastery of discourse structure. However, due to the complex subjective expressions involved in English writing, it cannot be simply evaluated through mechanical means, which brings many problems to the grading work. From the perspective of grading, compared to objective questions and subjective questions with smaller length, the difficulty of grading an essay is greater and requires a lot of manpower and material resources.In simulated exams, examiners need to perform high-load homework in a closed environment, which is inevitably influenced by personal subjectivity and external factors, resulting in incomplete agreement on the evaluation standards, making the structure not objective enough, resulting in errors, and damaging the fairness of grades [3, 4]. In order to save teachers’ correction time and improve the accuracy and efficiency of English essay grading, this article applies automatic grading technology to English texts. During this period, smoothness is a key attribute of the text, reflecting its readability and comprehensibility. Therefore, it is often used as an important indicator for scoring. The article precisely captures this key factor and will design an English text automatic scoring algorithm based on sentence smoothness features and text matching features, and compare its performance with other algorithms to explore its feasibility. Due to the different domestic and foreign environments, this type of research is still in the preliminary stage. Through this preliminary exploration, the article aims to better fill this research gap and lay the foundation for further development in this field.

2. Word2vec word vector model

2.1 Principle of Word2vec word vector model

Bengio proposed a neural language model, introducing neural networks into the training of speech models, and using language models to generate word vectors, which are essentially the use of context words to predict the central word or the use of the first n words to predict the next word. Word vectors are a by-product of this. Later, Mikolov proposed the Word2vec model, which aims at better obtaining word vectors [5]. Word2vec model adopts two more efficient neural network models and Skip-Gram, and simultaneously utilizes hierarchical Softmax and negative sampling technology [6, 7]. Continuous Bag-of-Words (CBOW) belongs to the continuous bag-of-words model, which removes the time-consuming non-linear hidden layer [8, 9, 10].

2.2 Function derivation

The CBOW model predicts ${p}\left({w_{t}|w_{t-2},w_{t+1},w_{t-1},w_{t+2}}\right)$ , takes two words before and after the target word $w_{t}$ , and the size of the window is 2. The input layer is the ont-hot vector representation of four words, respectively $x_{t-2},x_{t+1},x_{t+2}$ , the dimension of the word vector is ${V*1}$ ( ${V}$ is the total number of words in the corpus), the weight matrix from the input layer to the hidden layer is ${W}$ , the dimension is ${V*d}$ , and it is the pre-preset word vector dimension, the vector of the hidden layer is ${h}$ , the dimension is ${d*1}$ , and the calculation method of ${h}$ is summative average, the formula is as follows:

$\displaystyle h=\frac{W^{T}\left({x_{t-2}+x_{t-1}+x_{t+1}+x_{t+2}}\right)}{4}$ (1)

The weight matrix from the hidden layer to the output layer is $U$ , dimension is $d*V$ , the output layer vector is $y$ , dimension is $V*1$ , and the calculation method is as follows:

$\displaystyle y=\text{Softmax}\left({U^{T}h}\right)$ (2)

The maximum likelihood function and gradient descent method are used to solve the parameters. For Skip-Gram model, the calculation process is the opposite of CBOW, predicting ${p}\left({w_{t-2}|w_{t}}\right)$ , ${p}\left({w_{t-1}|w_{t}}\right)$ , ${p}\left({w_{t+1}|w_{t}}\right)$ , ${p}\left({w_{t+2}|w_{t}}\right)$ , two words before and after $w_{t}$ , and the window size is 2. The input layer is the ont-hot vector representation of the word, which is $x_{t}$ , the weight matrix from the input layer to the hidden layer is ${W}$ , the dimension is ${V*d}$ , ${d}$ is the pre-preset word vector dimension, the hidden layer vector is ${h}$ , the dimension is ${d*}1$ , the calculation method of ${h}$ is as follows:

$\displaystyle h=W^{T}x_{t}$ (3)

Softmax is used in the classification process to implement multiple classifications. Softmax is used to map some output neurons to real numbers between 0 and 1, and normalize the sum to 1, so that the probability of multiple classes adds up to 1. The function is defined as follows:

$\displaystyle S_{i}=\frac{e^{V_{i}}}{\mathop{\sum}\nolimits_{i}^{C}e^{V_{i}}}$ (4)

Where $V_{i}$ is the output of the classifier’s pre-output unit. i represents the category index, the total number of categories is C, and $S_{i}$ represents the ratio of the exponent of the current element to the sum of the exponents of all elements. The hierarchical Softmax is a huffman tree, the leaf nodes of the tree are all the words in the training text, and the non-leaf nodes are a logistic regression binary classifier. The output result $\sigma\left({\theta*h}\right)$ of each logistic regression classifier means that the probability of passing h vector to the left child of the node is $\sigma\left({\theta*h}\right)$ ; otherwise, the probability of passing $1-\sigma\left({\theta*h}\right)$ vector to the right child of the node is O. Repeat the process up to the leaf node. Negative sampling is actually a means of sampling negative samples to help training, and its purpose is the same as the level Softmax, to improve the speed of model training. In general, the greater the prediction probability of the model for the positive example, the better, and the smaller the prediction probability of the model for the negative sample, the better. Due to the small number of positive examples, it is easy to ensure that the prediction probability of each positive example is as large as possible, while the number of negative samples is particularly large. Therefore, the idea of negative sampling is to maximize the random probability according to a certain strategy, and then ensure that the prediction probability of the selected negative samples is as small as possible. The negative sampling strategy has a great influence on the effect of the model, and the common negative sampling strategies in Word2vec include uniform negative sampling, word-frequency sampling and so on.

3. Doc2vec sentence vector model

3.1 Doc2vec sentence vector model principle

Representing a sentence or a document as a vector is an important problem that can have many applications, such as similarity calculation, classification, clustering, etc. For short texts, the bag of words model is usually used, but the bag of words model has the disadvantages of ignoring word order and syntactic and semantic information. Doc2vec overcomes the shortcomings of the traditional bag of words model and can learn the vector representation of sentences unsupervised through machine learning in large-scale corpora [11, 12, 13].

3.2 Doc2vec architecture

Doc2vec trains sentence vectors in a very similar way to Word2vec. The core idea of training word vectors is that each word, the word in context, can have an impact. Similarly, Doc2vec is trained in the same way, for a sentence, if you want to predict the words in the sentence, then you can combine the way to generate features based on other words and the relationship between other words and sentences. The Doc2vec framework is shown in Fig. 1.

Figure 1.

Doc2vec model framework.

Paragraphs are mapped to a vector space, represented by columns of matrix D, and words are similarly mapped to a vector space of the same level, represented by columns of matrix W. Then the paragraph vector and word vector are concatenated or averaged to get the feature, and the next word is predicted according to this feature. The process of predicting the current word based on the paragraph matrix and the context word, the paragraph vector is shared across all sentences of a paragraph, and the paragraph vector is also trained during the construction of the model. The process of predicting a random word in a paragraph according to the paragraph vector. This training method ignores the context of the word and uses Softmax classification to predict other words in the sentence.

4. Algorithms related to English composition scoring

4.1 Deep sentence smoothness algorithm based on multi-feature fusion

Smoothness is a key attribute of text, and constructing the structure of smooth text is an important problem in natural language processing. In automatic essay scoring, a good essay always has a special high-level logical and thematic structure, in which the actual word and sentence selection and the arrangement between them serve this high-level structure. Therefore, a deep sentence smoothness algorithm integrating multiple features is proposed. The specific architecture is shown in Fig. 2.

Figure 2.

Statement smoothness algorithm architecture.

It can be seen that the algorithm architecture mainly covers two parts: the first part uses many features to represent words vectorically, and then forms a variety of different sentence representation matrices to form the input layer; The second part is the network structure of the smoothness algorithm, which carries out feature extraction and relationship modeling between sentences, including convolution layer, pooling layer, self-attention layer, multi-layer perceptron layer and output layer.The advantage of this algorithm is that feature fusion improves the accuracy, robustness and stability of the algorithm, while reducing the data dimension, processing missing data, and improving the interpretability of the model. In practical applications, feature fusion has been widely used and has become one of the core techniques of many algorithms.Overall, deep sentence smoothness algorithms that integrate multiple features have prominent advantages in text recognition.

4.2 Text matching degree algorithm based on graph neural network

The goal of text matching is to evaluate the semantic similarity between the original text and the target text, which is very important in many natural language processing tasks. For example, information retrieval, automatic question answering, machine translation, dialogue system and repetition problems can be abstracted into text semantic matching problems to some extent. Based on this, the text semantic matching algorithm architecture of graph neural network is given, as shown in Fig. 3.

Figure 3.

Text matching algorithm architecture.

It can be seen that the text semantic matching degree algorithm based on graph neural network mainly covers four parts: the first part can preprocess the sentence and pre-train the sentence vector, and extract the structural features of the article by using recurrent neural network; In the second part, we cluster the sentences and combine the sentence vectors from the three documents to form the concept. The third part is used to construct the vertices and edges of the graph in the graph convolution Network, use the Triplet Network to form the eigenimage vector of the fixed points in the graph, and use the sentence Tf-idf similarity to form the edge weights. The fourth part includes the graph convolutional network layer, and then the multi-layer perceptron layer and the output layer, which are used to train the vertex feature vector, and then get the similarity feature of the document.The algorithm has the following advantages: First, it can learn end-to-end. End-to-end means efficient, can effectively reduce the information asymmetry of the intermediate link, once the problem is found in the terminal, each link of the entire system can be adjusted. Second, good at reasoning. Large-scale graph data can express human common sense and expert rules that are rich and contain logical relations. Graph nodes define understandable symbolic knowledge, and irregular graph topologies express reasoning relations such as dependence, dependency and logical rules among graph nodes, so they have strong reasoning ability. Third, it is analyzable. Graphs have a strong semantic visualization ability, which is enjoyed by all GNN models, and thus gives them strong parsing ability.Therefore, the text matching degree algorithm based on graph neural network can effectively deal with text matching problems.

4.3 Automatic essay scoring algorithm

The automatic essay scoring algorithm designed in this paper integrates the sentence smoothness features and text matching features. The overall architecture of the algorithm is composed of the basic algorithm layer, the engine layer and the interface layer, as shown in Fig. 4.

The basic algorithm layer contains some basic algorithms and models used in the system, which are called by the engine layer. Among them, the word vector module includes pre-trained Word2vec word vector model and Glove model, and the neural network model module includes convolutional neural network, self-attention network and graph convolutional network and other common deep learning models. In the engine layer, there are algorithm engines and storage engines. Among them, the storage engine consists of the MySQL database and disk file format, which is mainly used to store data sets for the algorithm engine to call. The algorithm engine includes sentence smoothness calculation engine, text matching calculation engine and total score calculation algorithm engine. In the interface layer, it involves four kinds of apis, such as smoothness score API, text match score API, total score API, and user interaction API. Among them, the first three apis interact with the corresponding engine, and the last API is used to input the score type required by the user. After the user API gets these data, the first three apis are called to obtain the score and transmit it to the user.This automatic essay scoring algorithm, which integrates sentence smoothness features and text matching features, has the advantages of smoothness feature matching algorithms, such as accuracy, robustness, stability, and reduction of data processing dimensions. It also integrates the advantages of text matching algorithms, such as being able to learn end-to-end, good at reasoning, and strong parsing.It can be seen that this algorithm combines the advantages of integrating sentence smoothness features and text matching features, and is even more suitable for the development of essay grading.

Table 1
ASAP data set

Serial number	Body load	Theme	Grade	Quantity	Mark
1	Expository writing	Write a letter about the effects of computers on people	8	1782	0–12
2	Expository writing	Write a letter about the effects of computers on people	10	1801	0–6
3	Material composition	Write your reaction to a passage	10	1727	0–3
4	Material composition	Write your reaction to a passage	10	1771	0–3
5	Material composition	Write your reaction to a passage	8	1806	0–3
6	Material composition	Write your reaction to a passage	10	1800	0–4
7	Expository writing	Talk about your views on patience	7	1570	0–30
8	Expository writing	Talk about what you think about laughter	10	723	0–60

Figure 4.

Architecture of automatic essay scoring algorithm that incorporates sentence smoothness features and text matching features.

5. Simulation experiment

5.1 Experimental data

The essay experimental data came from the Pratt & Whitney Foundation’s Automated Student Value Assessment Project (ASAP), which aims to develop an automatic essay scoring algorithm for student writing. The essay data of 12,000 international students in the United States is divided into eight sets, and each essay set has a corresponding profile. The selected articles are generally between 150 and 500 words long. These 8 data sets have their own characteristics, which can better test the feasibility of the scoring algorithm. The corresponding data sets are shown in Table 1.

Most of the essays in the dataset are manually entered into the computer and stored in the tab-separateb Value(TSV) file and Microft Excel file format. The ASAP project also uses data desensitization techniques to anonymize the data and ensure the privacy of the authors. Using algorithms including Named Entity Recognition(NER) technology, identify information such as “organization”, “person”, ‘place” in the article and replace it with the words “@PERSONI” [14, 15]. In the selection of evaluation index, Quadratic Weight Kappa value used by the official data source is selected to quadratic Weight Kappa to indicate consistency, with a value range of [0,1], where 0 indicates inconsistency. 1 indicates consistency. If the articles in dataset M have a total of N possible scores, then $\text{Score}\left(\text{M}\right)\in\left[{0,\text{N}}\right]$ , a represents the scoring model, b represents the expert score, and $\left({S_{a},S_{b}}\right)$ represents the expert score and the scoring model score of an article respectively. Firstly, calculate the weight matrix W, the dimension of W is N* N, and the calculation formula of each element $W_{i,j}$ is:

$\displaystyle W_{i,j}=\frac{\left({i-j}\right)^{2}}{\left({N-1}\right)^{2}}$ (5)

The calculation of Quadratic Weight Kappa in the whole dataset M is as follows:

$\displaystyle{k}=1-\frac{\mathop{\sum}\nolimits_{{i},{j}}{W}_{{i},{j}}{O}_{{i}% ,{j}}}{\mathop{\sum}\nolimits_{{i},{j}}{W}_{{i},{j}}{E}_{\textit{ij}}}$ (6)

Where, $O_{i,j}$ represents the number of compositions scored by the expert frequency ${i}$ and the scoring model $E_{i,j}$ , and ${i}$ represents the product of the probability of the expert rating ${j}$ and the scoring model rating ${i}$ , where both $O_{i,j}$ and $E_{i,j}$ are normalized. ASAP contains multiple datasets and requires a way to measure the Mean, which is called Mean Quadratic Weight Kappa. Firstly, the QWK(Represent consistency) value is Fisher transformed, and the specific formula is as follows:

$\displaystyle{z}=\frac{1}{2}\textit{{ln}}\frac{1+{k}}{1-{k}}$ (7)

Fisher Transformation is an approximately variously stable transformation with different scoring ranges of different data sets. The influence of its final Mean Quadratic Weight Kappa needs to be normalized, as shown in the following formula:

$\displaystyle{k}=\frac{{e}^{2{z}}-1}{{e}^{2{z}}+1}$ (8)

In this way, the Mean Quadratic Weight of Kappa after conversion can be obtained, which is the value of Kappa as the final evaluation index.

5.2 Experimental setup

Pytorch deep learning technology is used to build the model, and the training server is selected by 8-core inteli7-6850k CPU, 32G memory, and equipped with an NVIDIA GeForce GTX 1070 graphics card. The network Settings of the text matching degree algorithm based on neural network are as follows: the convolution kernel size is 100*300*5; The GloVe vector has 200 dimensions; The output dimension of LSTM is 100, the number of layers of multi-layer perceptron is 3, and the dimensional distribution is [100,50,20]. While using the Adam optimizer, set the batch size to 128 and the epochs to 10. In order to better reflect the performance of the automatic essay scoring algorithm, A comparative test was carried out with the deep sentence smoothness algorithm integrating multi-features, the text matching algorithm based on neural network, the EASE algorithm and the CNN-LSTM algorithm. The specific setting details are as follows: First, the input of LSTM layer removes the sentence smoothness vector $S_{A}$ and is named $\text{Model}\left({\text{S}_{\text{B}}}\right)$ ; The input of the multilayer perceptron removes the text matching vector $S_{B}$ and is named $\text{Model}\left({\text{S}_{\text{A}}}\right)$ . The automatic essay scoring algorithm based on these two algorithms is named $\text{Model}\left({\text{All}}\right)$ . Among them, EASE is based on hand-designed features and regression models, and the selected features mainly include four categories: first, features based on length; Second, features based on part-of-speech tagging; The third is the feature of word overlap based on the content of introduction; Fourth, features based on n-gram; After feature extraction, support vector regression (SVR) was used for data fitting.

5.3 Experimental results

Based on the selection of the above data set and the setting of the experiment, the experiment of automatic essay scoring algorithm is carried out here and compared with other scoring algorithms. The results obtained are shown in Table 2.

Table 2
Comparison of experimental results

Method	topic
	1	2	3	4	5	6	7	8	Avg
EASE	0.784	0.652	0.663	0.752	0.756	0.777	0.745	0.551	0.710
CNN-LSTM	0.832	0.659	0.644	0.756	0.756	0.682	0.745	0.622	0.712
$\text{Model}\left({\text{S}_{\text{A}}}\right)$	0.832	0.688	0.612	0.788	0.816	0.815	0.806	0.719	0.759
$\text{Model}\left({\text{S}_{\text{B}}}\right)$	0.825	0.684	0.645	0.845	0.802	0.814	0.815	0.714	0.768
$\text{Model}\left({\text{All}}\right)$	0.845	0.699	0.715	0.856	0.825	0.823	0.831	0.726	0.790

It can be seen that all algorithms, including EASE algorithm, have achieved good experimental results, indicating that through a good feature selection, traditional machine learning algorithms can also perform automatic scoring of English as a good feature. Meanwhile, it has been proved that automatic scoring tasks can conduct more accurate modeling by artificially selecting multiple article features and some deep features. It also explains the reasons why the two deep features of the text choice sentence smoothness and text matching can bring better results. Therefore, the results of $\text{Model}\left({\text{All}}\right)$ algorithm are basically better than other automatic essay evaluation algorithms, which confirms the feasibility of $\text{Model}\left({\text{All}}\right)$ algorithm. At the same time, the similar results of $\text{Model}\left({\text{S}_{\text{A}}}\right)$ and $\text{Model}\left({\text{S}_{\text{B}}}\right)$ algorithms also mean that the importance of these two features in automatic essay scoring is basically the same.

6. Conclusions

In summary, this paper first explains Word2vec word vector model and Doc2vec sentence vector model, which lays a solid theoretical foundation for the empirical analysis of automatic essay scoring. Secondly, the paper analyzes the algorithms related to English essay scoring, including the deep sentence smoothness algorithm based on multi-features and the text matching algorithm based on graph neural network. Then, based on these two algorithms, an automatic essay scoring algorithm is designed and the corresponding architecture is given. Finally, in order to test the feasibility of the algorithm designed in this paper, a comparative test was carried out, compared with EASE algorithm, Model(sb), Model(SA) and CNN-LSTM. The results showed that, the mean value of model (All) is 0.790, higher than that of other algorithms, which confirms the feasibility of the algorithm designed in this paper and can make due contributions to the automatic scoring of English compositions.In the future, in the field of education, it can not only save teachers’ correcting time, but also improve the accuracy and efficiency of English composition scoring, avoid the influence of subjective factors or fatigue errors in teachers’ scoring, and give the most authentic scoring results to a great extent, which has a high application prospect.

References

Chen

. Peer Assessment Teaching Practice of College English Writing based on Automatic composition assessment system [J]. Journal of Huzhou Teachers College. 2022; 44(12): 42-49.

Chen

Weng

. The application of automatic writing assessment in the Wisdom classroom of academic English Writing: a case study of learners [J]. Research in Educational Linguistics. 2022; (00): 219-233.

Luan

Dong

. The reliability of automatic essay scoring System and its implications for College English Writing Teaching: A comparative analysis of iWrite Scoring System and Manual scoring [J]. Journal of Higher Education. 2002; 8(33): 94-97.

Zou

Fan

. The definition of sub-item scoring criteria for foreign language writing test from the perspective of raters [J]. Contemporary Foreign Language Studies. 2022; (04): 133-143+160.

Shi

. Application of Word2Vec word vector based on natural language Processing [J]. Journal of Heihe University. 2019; 11(07): 173-177.

Wei

. Text sentiment tendency analysis based on neural network under big data [J]. Software. 2023; 44(03): 71-75.

Xia

Cao

Peng

, et al. Text topic modeling of Skip-Gram structure and word embedding Characteristics [J]. Minicomputer Systems. 2020; 41(07): 1400-1405.

Wang

Pan

Wang

, et al. Research on text classification based on improved CBOW and BI-LSTM-ATT [J]. Computer and Digital Engineering. 2019; 49(07): 1372-1376.

Sun

Rao

Shi

, et al. Internal Threat Detection Method Based on Hybrid N-Gram Model and XGBoost Algorithm [J]. Computer and Modernization. 2022; (08): 99-105.

10.

Chen

Zhang

Sun

, et al. Incremental outdoor scene discovery based on hierarchical word bag model [J]. Control Theory and Application. 2020; 37(07): 1471-1480.

11.

Xiao

Liu

. Automatic scoring of subjective questions based on doc2vec [J]. Modern Computer. 2020; 28(1): 79-82+95.

12.

Chen

. Research on Long Text Topic Clustering Based on Doc2Vec Enhanced Features [J]. Computer Science. 2023; 50(S1): 221-226.

13.

Huang

Jiang

. Patch Verification Method Combining Doc2Vec and BERT Embedding Technology [J]. Computer Science. 2022; 49(11): 83-8.

14.

Wang

. Named Entity Recognition Method based on Multi-network feature Extraction [J]. Computer Programming Skills and Maintenance. 2023; (6): 53-55.

15.

Ding

Kong

Liu

, et al. Research on Agricultural Entity Naming Recognition by Integrating Multiple Feature Words Embedding [J]. Modern Intelligence. 2023; 43(11): 135-145.

Performance comparison of multiple scoring algorithms: A study of automatic scoring of English text

Abstract

Keywords

1. Introduction

2. Word2vec word vector model

2.1 Principle of Word2vec word vector model

2.2 Function derivation

3.1 Doc2vec sentence vector model principle

3.2 Doc2vec architecture

4.1 Deep sentence smoothness algorithm based on multi-feature fusion

Table 1 ASAP data set

5.1 Experimental data

5.3 Experimental results

Table 2 Comparison of experimental results

References

Table 1
ASAP data set

Table 2
Comparison of experimental results