Sage Journals: Discover world-class research

Abstract

Text classification is an important tasks in natural language processing. Multilayer attention networks have achieved excellent performance in text classification tasks, but they also face challenges such as high temporal and spatial complexity levels and low-rank bottleneck problems. This paper incorporates spatial attention into a neural network architecture that utilizes fewer encoder layers. The proposed model aims to enhance the spatial information of semantic features while addressing the high temporal and spatial demands of traditional multilayer attention networks. This approach utilizes spatial attention to selectively weigh the relevance of the spatial locations in the input feature maps, thereby enabling the model to focus on the most informative regions while ignoring the less important regions. By incorporating spatial attention into a shallower encoder network, the proposed model achieves improved performance on spatially oriented tasks while reducing the computational overhead associated with deeper attention-based models. To alleviate the low-rank bottleneck problem of multihead attention, this paper proposes a variable multihead attention mechanism, which changes the number of attention heads in a layer-by-layer manner with the encoder, achieving a balance between expression power and computational efficiency. We use two Chinese text classification datasets and an English sentiment classification dataset to verify the effectiveness of the proposed model.

Keywords

Text classification BERT Spatial attention Multihead attention mechanism

1 Introduction

Text classification is a crucial task in natural language processing, as it seeks to convert textual information into a machine-readable form to achieve human-like performance. However, the large scale, complexity, and imbalance of textual data make it challenging to perform large-scale pretraining and fine-tuning, which in turn poses obstacles to obtaining precise feature representations and constructing effective classification models [1, 2]. Traditional machine learning algorithms, such as naive Bayes classifiers [3], KNN classifiers [4], and SVMs [5], have been effectively established for text classification tasks. Although these methods have achieved good performance, some issues remain, such as their inability to correctly express word orders and semantics and their text recognition limitations caused by the high dimensionality and sparsity of the given data. In recent years, with the rapid development of data mining techniques, neural networks based on CNN [6], LSTM [7, 8], and RNN [9] have gradually been used for text classification, and attention-based network models have been able to effectively solve the text classification problem. An attention mechanism focuses on extracting key semantic information by assigning different weights to words in a sentence to construct attention information and achieve improved model performance. However, multilayer attention networks also have many problems.

Multilayer attention networks can enhance performance by adding more attention layers. However, this also leads to significant time and space overheads. Increasing the number of parameters results in higher memory usage, while deepening a multilayer attention network increases the required training time and causes more gradient-related issues.

[10, 11] discussed the low-rank bottleneck problem that exists in multihead attention. While some heads in multihead attention mechanisms are redundant, simply reducing their frequency makes it difficult for the constructed model to fully learn the semantic information of the context. Moreover, increasing the vector dimensionality significantly increases the computational overhead of the model.

Self-attention mechanisms mainly focus on word-level cloze learning and do not fully utilize the semantic and spatial information contained in the given data. The extracted semantic feature space is limited, which poses a challenge for complex semantic text classification tasks.

The main contributions of this article are as follows.

In this article, we propose a text classification model called variable multihead hybrid attention based on BERT (BVMHA). The model proposed in this paper achieves excellent results on several datasets without introducing external knowledge while requiring a small number of parameters and minimal hardware computations.

We incorporate spatial attention into a self-attention module to improve the performance of the self-attention model in terms of extracting semantic information properties.

We present a variable multihead attention-based self-attention mechanism that alleviates the low-rank bottleneck of multihead attention, enabling the model to balance its computational speed and expression power.

The paper is structured as follows. Chapter 2 reviews the related work, Chapter 3 describes the structure and details of the proposed model, Chapter 4 presents the experimental results and analysis, Chapter 5 provides a visual analysis to demonstrate the effectiveness of the proposed model, and Chapter 6 concludes the article and discusses future improvements.

2 Related works

As deep learning continues to be widely applied in natural language processing, related technologies and research algorithms are now being utilized in text classification tasks as well. The TEXTCNN [12] model uses a convolutional neural network for text classification by making some changes to the input layer of the CNN. The model primarily focuses on the local text information, thereby failing to obtain contextual information. Consequently, this limitation has an adverse impact on the model’s ability to understand the semantic meaning of the given text.

Traditional word vector models are built upon statistical approaches such as naive Bayes classifiers [3], SVMs [5], KNN classifiers [4], and other similar techniques. On the other hand, Word2Vec [13], a deep learning-based word vector model, and ELMo [14], proposed by M.E. Peters et al., utilize a bidirectional LSTM network to acquire the contextual representations of words. These models dynamically adjust their vector representations based on word-specific contextual information obtained from pretrained word vectors. However, ELMo handles the model loss issue by simply stacking the losses of the two models together.

In 2017, VASWANI [15] proposed the transformer model to parallelize the computation process by introducing a self-attention mechanism, which enables larger models to be trained on larger datasets. Starting in 2018, a series of transformer-based pretraining models emerged, and BERT [16] is one of the widely used pretraining models; it obtains pretraining models by performing two pretraining tasks, masked language modeling and next sentence prediction, and fine-tunes them on the text classification task to obtain models that are applicable to text classification with optimal results at that time. Since 2018, we have seen the rise of a set of large-scale transformer-based pretrained language models (PLMs). Transformer-based PLMs use deeper network architectures and are pretrained on larger text corpora to learn contextual textual representations by predicting context-based words. In a transformer-based model, the “pretraining + fine-tuning” training mode [17-20] occupies a dominant position in the NLP field. Multidimensional feature extraction in the semantic space was achieved in the literature [21] by adding a multidimensional convolution module after the BERT pretraining model. Reference [22] achieved better results using spatial attention-assisted semantic matching tasks. BERT models, convolutional neural networks, and recurrent neural networks have made some progress in text classification. However, BERT models mainly focus on word granularity for completion learning and do not fully utilize the lexical and semantic information contained in the input data. Recurrent neural networks can obtain the temporal information of text, but recurrent neural networks are not sensitive to local information. In the face of the above problems, we propose a variable multihead hybrid attention approach based on BERT (BVMHA) that achieves improved text classification performance with a small number of computations.

3 Model introduction

3.1 BVMHA structure

Google proposed the BERT pretrained model in 2018, and BERT excelled in 11 NLP task tests. BERT is based on the transformer implementation, and a key factor in the success of BERT is the powerful role of its transformer. The structure of the encoder network model contained in BERT is shown in Fig. 1.

Fig. 1

The structure of BERT.

This article proposes a text classification model called BVMHA based on the BERT pretraining model. The improvements of BVMHA over BERT include prompt templates and a multifeature semantic extraction module. The BVMHA text classification model proposed in this paper takes a text utterance as input and constructs a complete label-sentence utterance by integrating it into a prompt. The utterance is then mapped to a word embedding matrix with a positional encoding added as the input layer. The pretrained word vector is then fed through a variable multihead hybrid attention layer to extract semantic features from the text. The output of this layer is passed through a residual normalization layer and then through a feedforward network-based residual normalization layer, where the obtained output is used as the next input layer. After n layers, the semantic spatial information is extracted after a linear layer, the dimensionality is reduced to the number of classification steps, and softmax is performed to output the maximum probability classification. The complete structure of BVMHA is shown in Fig. 2.

Fig. 2

The structure of BVMHA.

3.2 Prompt templates

Before the given word sequences are fed into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then tries to predict the masked words based on the context provided by the other unmasked words in the sequence. From a technical point of view, the output word prediction process must do the following. 1. The layer is used for classification after the encoder output of the model is obtained. 2. The output vector is multiplied by the embedding matrix of the word list to convert the dimensionality of the output to the word list dimensionality. 3. Softmax is used to calculate the probability of each word in the word list being the [MASK]. Prompt-based learning [23] has been widely explored in classification-based tasks where prompt templates can be constructed relatively easily, such as text classification [24] and natural language inference [25]. This paper presents an approach that achieves enhanced text classification performance by incorporating a prompt during the fine-tuning stage of the BERT model. Specifically, this approach involves constructing a sentence that includes both the text label and the prompt, which are subsequently utilized during the fine-tuning process. The BERT model is then able to insert a ([CLS]) token at the beginning of the text and compute an output vector that represents the semantic representation of the entire text, which is subsequently used for classification purposes.

To illustrate, for a sentence such as “This item is of poor quality” with a prompt, the model transforms the sentence into “Negative is the classification of ‘This item is of poor quality”’. To improve upon this, we propose a new sentence construction; namely, “[CLS][MASK] is the classification of ’this product is of poor quality”’, where ([MASK]) serves as a placeholder for the word “negative”. The classification and labeling output of the model are computed via the cross-entropy loss.

Our experimental results demonstrate that incorporating the prompt into the training process can effectively improve the model’s ability to learn relational features between text and labels without the need for changes to the model architecture or an increase in its computational burden. Figure 3 depicts the process of constructing a prompt.

Fig. 3

Process of constructing a prompt.

3.3 Multifeature semantic extraction module

The proposed multifeature semantic extraction module in this paper consists of two parts: a variable multihead attention layer and a spatial attention layer, and the module extracts different semantic spatial information features for fusion and finally outputs a classification. The complete structure of the multifeature semantic extraction module is shown in Fig. 4.

Fig. 4

The multifeature semantic extraction module.

The feature matrices extracted by the self-attention and spatial attention modules are added and fused, and the formulas for doing so are shown below. $Q = W_{q} M$ (1) $K = W_{k} M$ (2) $V = W_{v} M$ (3) $M_{att} = Self Attention (Q, K, V) + SpatialAttention (M)$ (4) where M represents the input embedding matrix. W_q, W_k and W_v represent three different linear layers. SelfAttention (·) refers to the self-attention mechanisms, SpatialAttention (·) refers to the spatial attention mechanisms, and $M \in ℝ^{s_l \times h \times (s_l / h)}$ represents the output matrix of the word embedding layer. Here, h is the number of attention heads.

The input embedding matrix M is linearly transformed three times using W_q, W_k and W_v to obtain the Q, K, and V matrices, respectively; the Q matrix is recorded as the query matrix, the K matrix is recorded as the key matrix, and the V matrix is recorded as the value matrix. SelfAttention (·) is described in Equation (5). The main implementation steps of the self-attention mechanism are shown in Fig. 5.

Fig. 5

Self-attention.

SelfAttention (Q, K, V) = softmax (\frac{Q \cdot K^{T}}{\sqrt{d_{k}}}) V

(5)

We describe SpatialAttention (·) in detail in Section 3.3.2.

3.3.1 Variable multihead attention mechanism

Due to the number of static attention heads set in BERT, this paper proposes a variable multihead attention mechanism, which enables the model to learn semantic features from different angles while reducing the number of redundant attention heads. The number of attention heads h decreases with the number of encoder layers. It should be noted that h needs to be divisible by the length of the word embedding dimensionality. When it is not divisible, h is still the value of the previous h. For example, suppose that h is 12, and the hidden embedding dimensionality is 768. When the model reaches the first encoder layer, h is 12. When the second layer is used, h should be 10, but this value is not divisible by the hidden embedding dimensionality, so the value of the previous h is still 12. In the third layer, h is 8, and this reduction process continues in a layer-by-layer manner; the specific mechanism is shown in Fig. 6.

Fig. 6

Variable multihead attention mechanism.

Experiments show that variable multihead attention can enrich the angle of feature extraction and semantic feature space information, leading to better performance than that of static multihead attention. In this paper, the step size is set to 2, and the number of attention heads is in the [2, 12] range. The corresponding formula is shown in Equation (6). $H_{l + 1} = H_{l} - l * Step$ (6)

In the formula, H_l+1 represents the current number of attention heads, H_l represents the number of attention heads in the previous layer, l represents the current number of layers, Step represents the step size for reducing the number of attention heads and n represents the current number of layers.

3.3.2 Spatial attention layer for text embedding

This article combines spatial attention and self-attention to construct hybrid attention. Unlike the self-attention mechanism that focuses on global features, spatial attention can make the model focus on important task-related regions. For example, in classification tasks, spatial attention finds important parts for processing. First, maximum pooling and average pooling are applied in the channel dimension to obtain two representation matrices, and these two representation matrices are spliced together. Then, through the convolution layer with a sigmoid activation function, a weight coefficient M_s is obtained. Finally, M_s is multiplied by the semantic information to obtain the new scaled features. The spatial attention mechanism is shown in Fig. 7.

Fig. 7

Spatial attention.

Because the word embedding matrix has only one channel, how to introduce spatial attention to adapt to the data characteristics of the given text is a key issue. Inspired by the multihead self-attention mechanism, this model uses multiple attention heads as channels to adapt spatial attention to the channel dimension. Here, $M \in ℝ^{s_l \times h \times (h_d / h)}$ transforms to $M \in ℝ^{s_l \times c \times (h_d / c)}$ , and h equals c. h is the number of attention heads, and c is the number of channels. s_l represents seq_ len, and h _ d represents hidden_size. The associated process is shown in Fig. 8.

Fig. 8

Spatial attention operations applied to a text embedding.

4 Experiment and analysis

4.1 Experimental data sources and dataset construction

Three datasets are used in this experiment, and they are introduced as follows.

ChnSentiCorp is a Chinese sentiment analysis dataset containing online shopping reviews of hotels, laptops, and books.

SST-2 (The Stanford Sentiment Treebank, Stanford Sentiment Treebank), a single-sentence classification task, contains human annotations of sentences in movie reviews and their sentiments. This task is to determine the emotion of a sentence. The categories are divided into two types: positive emotions (positive, with a sample label corresponding to 1) and negative emotions (negative, with a sample label corresponding to 0), and only sentence-level labels are used. That is, this task is also a binary classification task, which is divided into positive and negative emotions at the sentence level. The dataset distribution is as follows.

COLDDateset [26 is a Chinese insulting language dataset. The dataset contains a total of 37 r516 sentences.

The results obtained on the dev dataset are used for model parameter tuning rfinding good hyperparameters rand simulating the test set to prevent the model from overfitting on the training set. The results obtained on the test dataset are used to evaluate the performance of the model.

We used 9600 sentences on ChnSentiCorp for training r1200 sentences for validation rand 1200 sentences for testing. We use 67350 sentences on SST-2 for training r873 sentences for validation rand 1821 sentences for testing. We use 25762 sentences for training on COLDateset ruse 6431 sentences for validation rand use 5323 sentences for testing.Details of all the datasets are shown in Table 1.

Table 1
The statistics of the utilized datasets

Dataset ChnSentiCorp SST-2 COLDDateset

Train 9600 67 r350 25 r762

Dev 1200 873 6431

Test 1200 1821 5323

Total 12 r000 70 r044 37 r516

Dataset	ChnSentiCorp	SST-2	COLDDateset
Train	9600	67 r350	25 r762
Dev	1200	873	6431
Test	1200	1821	5323
Total	12 r000	70 r044	37 r516

4.2 Experimental environment and parameter settings

Experimental environment: A Ubuntu 16.4 system ran Nvidia Quadro P5000 GPU.

Experimental parameters: The experiments used the PyTorch deep learning framework(version 1.11.0) with an embedding dimension of 768 for each word ra total word list size of 21128 ra sentence length of 128 r6 Encoder layers r12 attention matrix heads rand matrix initialization range of 0.02. The model used the cross-entropy function as the loss function rand the AdamW optimizer was used to update the parameters with a pre. The Epoch of the fine-tuning is 20. The batch size is set to 64 rthe learning rate is set to 2e-5 rthe gradient clipping max grad norm is set to 10.Results are from the average of five randomized seed experiments.

4.3 Loss function and evaluation index

Loss function: We use the cross-entropy loss function to measure the deviation of the predicted value from the actual value. The cross-entropy loss function is shown in Equation (7). $L = - \frac{1}{N} \sum_{i = 1}^{N} y_{i} log ({\hat{y}}_{i}) + (1 - y_{i}) log (1 - {\hat{y}}_{i})]$ (7) where $\hat{y}$ is the predicted probability, y is the true label, and N is the number of sample categories. According to the loss function, the network parameters are updated.

Accuracy, Precision, Recall, and F1 are used as the evaluation indices of the model, and their calculation formulas are as follows. Accuracy is the number of correctly classified samples out of the total number of samples, and its calculation formula is shown in Equation (8). $accuracy = \frac{TP + FN}{TP + TN + FP + FN}$ (8)

Precision refers to the ratio of the number of actual positive samples among the positive samples predicted by the model to the number of predicted positive samples. Its calculation formula is shown in Equation (9). $precision = \frac{TP}{TP + FP}$ (9)

The recall rate (Recall) refers to the proportion of actual positive samples among the predicted positive samples, and its calculation formula is shown in Equation (10). $recall = \frac{TP}{TP + FN}$ (10)

The F1 value is the weighted average of the precision rate and the recall rate, and its calculation formula is shown in Equation (11). $f 1 = \frac{2 * precision * recall}{precision + recall}$ (11)w _ f1(weighted_F1) Since the macro F1 score is the arithmetic mean of multiple F1 values, when the samples are unbalanced, sometimes we want to assign different weights to different classes according to the number of samples in each category. This is w _ f1, and its calculation formula is shown in Equation (12). $w_f 1 = \frac{1}{total} \sum_{k = 1}^{N} f 1_{i} * N_{i}$ (12) where TP is the number of positive classes predicted as positive classes; FN is the number of positive classes predicted as negative classes; FP is the number of negative classes predicted as positive classes; and TN is the number of negative classes predicted as negative classes.

4.4 Experimental results

The experimental results obtained on the dev set and test dev are included. The purpose of evaluating the dev set is to show the training effect of the model during the training process. The experimental data produced on the test set show the final performance of the optimal hyperparametric model.

To verify the effectiveness of the BVMHA model, this paper utilizes TEXTCNN, FastText [27], DPCNN- [28], BERT-CNN, BERT(bert-base-cased, bert-wwm-chinese), Albert-base [18] (Albert-base-v2, Albert-chi-nese-base), Albert-xlarge, RoBERTa [17], MacBERT- [29], ERNIE2.0 [30], and ERNIE3.0 [31] for comparative experiments. ERNIE2.0 introduced external knowledge for use in training, and ERNIE3.0 introduced large-scale knowledge graphs for the first time to tens of billions of pretrained models for pretraining large-scale knowledge-enhanced models. ERNIE3.0 is significantly different from ERNIE2.0 in terms of its training method and introduced knowledge. We add a comparison experiment involving ERNIE to verify our first contribution.

From Tables 2 to 5, it can be seen that the proposed model achieves excellent results in all comparative experiments. It outperforms RoBERTa, the best performer among the baseline models, on both the CHNSenticorp dev set and test set and outperforms RoBERTa and MacBERT on the COLD dev set. It also achieves excellent performance on SST-2. Compared with those of BERT, the F1 values achieved by the proposed model on the test set are increased by 0.88%, 1.55%, 0.25%.BVMHA is the best performer in terms of Iteration speed, mainly due to the significant reduction in the number of parameters in the model. The performance of the proposed model on the three datasets is significantly improved over that of Albert-base and Albert-xlarge. Compared to RoBERTa and MacBERT, which perform best among the comparison models, the proposed model achieves the same performance requires fewer parameters and calculations.

Table 2
Experimental results obtained on the dev set

Model CHNSenticorp SST-2 COLD

dev_acc dev_wf1 dev_acc dev_wf1 dev_acc dev_wf1

TEXTCNN 86.08 86.03 63.67 63.74 80.35 80.25

FastText 86.92 86.90 67.89 67.80 84.05 84.05

DPCNN 84.08 84.07 71.22 71.13 83.56 83.54

BERT-CNN 91.06 91.15 91.2 91.26 91.35 91.40

BERT 91.68 91.64 91.42 91.42 91.30 91.68

Albert-base 89.1 89.09 90.30 - 89.93 89.15

Albert-xlarge - - - - 90.97 90.96

RoBERTa 92.64 92.58 92.55 92.54 91.62 91.60

MacBERT 92.55 92.41 - - 91.68 91.68

ERNIE2.0 93.25 93.58 - - 92.89 92.61

ERNIE3.0 93.75 93.78 - - 92.88 92.58

BVMHA 92.66 92.24 91.85 91.80 91.80 91.61

Model	CHNSenticorp	SST-2	COLD
TEXTCNN	86.08	86.03	63.67	63.74	80.35	80.25
FastText	86.92	86.90	67.89	67.80	84.05	84.05
DPCNN	84.08	84.07	71.22	71.13	83.56	83.54
BERT-CNN	91.06	91.15	91.2	91.26	91.35	91.40
BERT	91.68	91.64	91.42	91.42	91.30	91.68
Albert-base	89.1	89.09	90.30	-	89.93	89.15
Albert-xlarge	-	-	-	-	90.97	90.96
RoBERTa	92.64	92.58	92.55	92.54	91.62	91.60
MacBERT	92.55	92.41	-	-	91.68	91.68
ERNIE2.0	93.25	93.58	-	-	92.89	92.61
ERNIE3.0	93.75	93.78	-	-	92.88	92.58
BVMHA	92.66	92.24	91.85	91.80	91.80	91.61

Table 3

Experimental results obtained on the test set

Model	CHNSenticorp		SST-2		COLD
	test_acc	test_wf1	test_acc	test_wf1	test_acc	test_wf1
TEXTCNN	84.00	84.00	78.40	78.06	74.15	74.09
FastText	86.92	86.90	71.06	70.63	74.96	75.21
DPCNN	84.75	82.75	78.83	78.42	76.57	76.22
BERT-CNN	92.50	92.49	84.95	84.88	81.32	81.51
BERT	92.30	92.29	87.12	87.55	81.23	81.39
Albert-base	89.1	89.09	-	-	79.56	80.12
Albert-xlarge	-	-	-	-	80.24	79.83
RoBERTa	92.61	92.64	89.68	89.67	81.62	81.83
MacBERT	92.41	92.45	-	-	81.36	81.29
ERNIE2.0	93.15	93.52	-	-	81.29	81.46
ERNIE3.0	93.95	94.12	-	-	82.13	82.34
BVMHA	93.16	93.17	89.13	89.10	81.55	81.64

Table 4

Floating-point operations

Model	Number of parameters	FLPOS
BERT	108M	1120.14M
Albert-base	18M	936.77M
Albert-xlarge	60M	13298.96M
RoBERTa	120M	935.72M
MacBERT	102M	936.77M
ERNIE	117.99M	1279.57M
BVMHA	60M	467.83M

Table 5

Iteration speed of each model (iterations/s)

Model	CHNSenticorp		COLD
dev	test	dev	test
BERT	2.5	4.12	2.12	6.53
Albert-base	3.08	3.08	3.03	7.46
Albert-xlarge	1.02	1.04	1.07	1.22
RoBERTa	3.12	6.11	2.94	7.48
MacBERT	3.11	6.07	3.40	7.45
ERNIE	3.14	6.22	3.31	7.65
BVMHA	3.24	6.13	3.82	8.44

Although this model does not achieve the best performance on SST-2, its ACC and F1 values are 0.7% and 0.5%, respectively, which are lower than those of the best-performing RoBERTa method. After conducting a detailed experimental analysis of the text, it is found that this model does not surpass RoBERTa. The reason for this is that RoBERTa has a larger number of model parameters and more training data, and RoBERTa takes an order of magnitude more training time than BERT. However, the parameters of this model are only 50% of RoBERTa’s, and the number of calculations is only 49% of RoBERTa’s.

Compared with Alberta-base, which has a smaller number of parameters, the number of BVMHA hardware calculation is reduced by 50.1%.

Because the ERNIE series is a Chinese task model, there is no experimental result for the English dataset SST-2. We compare the experimental results of ERNIE (2.0 and 3.0).We can find that the performance of BVMH on CHNSenticorp, COLD is 1% lower than that of ERNIE series on average, which reflects the advantage of introducing knowledge. The pre-trained model can strongly improve the performance of the model in extracting semantic features and Chinese semantic understanding on the Chinese classification task by introducing Chinese knowledge.

However, the parameters of BVMHA are only 50.85% of ERNIE, and the number of calculations is only 36.55% of ERNIE. Substantial reductions in the number of parameters and the amount of computation are acceptable and are more competitive than an average 1% reduction in classification performance.

Through the experimental results, it can be found that the proposed model can achieve the same performance as the baseline models while achieving significant improvements in its number of parameters and calculations. The significant reduction in the number of parameters, iteration speed and computation is acceptable compared to the reduction in classification performance.

4.5 Ablation experiments

To discuss the impact of each module on the performance of the developed model, the three utilized strategies, removing the prompt, the spatial attention module, and the variable attention mechanism, are tested and compared. The experimental results are as follows.

It can be seen from Table 6 that after deleting the semantic prompt words on the basis of this model, the accuracy rate drops by 0.79%, and the F1 value drops by 0.81%. Deleting the spatial attention module on the basis of this model reduces the accuracy rate achieved on the test set by 0.7% and the F1 value by 0.8% compared to those of the full model. Deleting the variable multihead attention mechanism on the basis of this model reduces the accuracy rate achieved on the test set by 1.52%, and the F1 value also drops by 1.48%. These experiments prove the effectiveness of introducing each module.

Table 6
Ablation experiment

Model COLD

acc w_f1

Removing the prompt 80.76 80.83

Removing the spatial attention module 80.21 80.42

Removing the variable multihead 80.03 80.18

attention mechanism

BVMHA 81.55 81.64

Model	COLD
Removing the prompt	80.76	80.83
Removing the spatial attention module	80.21	80.42
Removing the variable multihead	80.03	80.18
attention mechanism
BVMHA	81.55	81.64

To discuss the correlation between the variable multihead attention mechanism and the number of encoder layers, this paper proposes two strategies, which are increasing and decreasing the number of heads with the number of encoder layers. The increasing strategy involves an increase from 2 to 12, and the decreasing strategy entails a decrease from 12 to 2. The results of the two strategies are shown below. It can be seen in Table 7 that when the number of heads is in a decreasing state, the model performs better, with a 0.2% increase over the increasing strategy. When the initial number of heads is large, semantic features can be captured from more angles, and the model achieves better performance in terms of capturing low-level features. In the later stage, as the number of attention layers deepens, the model captures global sentence-level semantic information. Reducing the number of redundant heads increases the feature dimensionality of a single head and improves the expression ability of the model.

Table 7

Experiment in which the number of attention heads is changed

Layers	COLD
	dev_acc	dev_wf1	test_acc	test_wf1
2 → 12	92.36	92.22	81.30	81.48
12 → 2	92.53	92.50	81.55	81.69

We set different numbers of encoder layers to study the effect of the number of layers on the performance of the model. As shown in Fig. 9, the model with 6 layers has the best performance on the COLD test set. When the number of encoder layers is set to 6, the model maintains fewer parameters and achieves better performance.

Fig. 9

Effects of different numbers of layers on model performance.

This paper further discusses the impact of parameter sharing on the lightweight model. This experiment further reduces the number of parameters by adopting cross-layer parameter sharing, where length represents the length of the input sentence.

From Table 8, we can see that through cross-layer parameter sharing, the number of model parameters can be further reduced by 60%, and the performance exhibits a loss of 2% -3%. When the sentence length is set to 64, the training speed of the shared parameter model can be increased by 25.3%. Although parameter sharing can greatly reduce the number of parameters in the model, it weakens the expression ability of the model.

Table 8

Parameter sharing ablation experiments

Model	Parameter	acc	Iteration/s
Parameter sharing (length=64)	24.9M	78.48	10.53
Parameter sharing (length=128)	24.9M	79.35	8.22
BVMHA (length=128)	60M	81.55	8.40

5 Visual analysis

In this section, we provide a hybrid visual analysis of the attention module. The results of the visualization experiment are as follows. Figure 10 is a heatmap of BERT’s attention scores with seq_len equal to 128. Figure 11 is a heatmap of BVMHA’s attention scores with seq_len equal to 128. Figure 12 is the heatmap of the attention scores of BVMHA with seq_len equal to 64.

The input text is The input text is [CLS]. This book is really good [SEP]. terrible [SEP].

Because of the introduction of spatial attention, the attention weight obtained by mixed attention pay attention to contextual information from the visual level. Compared with BERT, this model more accurately captures the key areas and related information of the text. For example, in the example sentence, the negative emotional words related to the label are the focus of our attention, so from the perspective of visual perception, the degree adverbs related to “really terrible” are given higher attention weights. Despite the fact that seq_len is reduced by half, BVMHA is still able to focus on the words expressing emotions more effectively than BERT with seq_len equal to 128. The visualization results further prove that the introduction of spatial attention in this model improves the ability of the model to extract spatial information from semantic features.

Fig. 10

BERT self-attention score visualization.

Fig. 11

BVMHA attention score visualization with seq_len equal to 128.

Fig. 12

BVMHA attention score visualization with seq_len equal to 64.

6 Conclusion

In this paper, we combine spatial attention and variable multihead attention mechanisms to propose a BERT-based variable multihead hybrid attention-based text classification model (BVMHA). BVMHA can achieve the same text classification performance as that of the baseline model while significantly reducing the numbers of parameters and calculations. In our forthcoming research, we intend to delve deeper and seek more effective multifeature semantic extraction methods.

Footnotes

Acknowledgments

This work was supported by the Young Scientists Fund of the Autonomous Region Science and Technology Program (2022D01C83), Natural Science Foundation of Xinjiang Uygur Autonomous Region (2021D01C077), and Science and Technology Plan Project of the Xinjiang Uygur Autonomous Region (2022NC192, 2021B01002).

References

Muthuraman Thangaraj , Muthusamy Sivakami , Text classificationtechniques: A literature review, Interdisciplinary Journal ofInformation, Knowledge, and Management 13 (2018), 117.

Muhammad Zulqarnain , Rozaida Ghazali , Yana Mazwin Mohmad Hassim , Muhammad Rehan , A comparative review on deep learning models fortext classification, Indones. J. Electr. Eng. Comput. Sci 19(1) (2020), 325–335.

Irina Rish

et al. An empirical study of the naive bayes classifier, In IJCAI 2001 Workshop on Empirical Methods in ArtificialIntelligence 3 (2001), 41–46.

Gongde Guo , Hui Wang , David Bell , Yaxin Bi , Kieran Greer Knn model-based approach in classification. In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, andODBASE:OTMConfederated International Conferences, CoopIS,DOA, andODBASE 2003, Catania, Sicily, Italy, November 3–7, 2003. Proceedings, pages 986–996. Springer, 2003.

Marti Hearst

, Susan Dumais

, Edgar Osuna , John Platt , Bernhard Scholkopf , Support vector machines, IEEE IntelligentSystems and their Applications 13(4) (1998), 18–28.

Keiron O’Shea , Ryan Nash An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458, 2015.

Sepp Hochreiter , Jurgen Schmidhuber , Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

Felix Gers

, Jurgen Schmidhuber , Fred Cummins , Learning toforget: Continual prediction with lstm, Neural Computation 12(10) (2000), 2451–2471.

Jiuxiang Gu , Zhenhua Wang , Jason Kuen , Lianyang Ma , Amir Shahroudy , Bing Shuai , Ting Liu , Xingxing Wang, , Gang Wang , Jianfei Cai, et al. Recent advances in convolutional neural networks, Pattern Recognition, 77 (2018), 354–377.

10.

Srinadh Bhojanapalli , Chulhee Yun , Ankit Singh Rawat , Sashank Reddiand Sanjiv Kumar Low-rank bottleneck in multi-head attention models. In International Conference on Machine Learning, pages 864–873. PMLR, 2020.

11.

Noam Shazeer , Zhenzhong Lan , Youlong Cheng Talking-heads attention. arXiv preprint arXiv:2003.02436, 2020.

12.

Yahui Chen Convolutional neural network for sentence classification. Master’s thesis, University ofWaterloo, 2015.

13.

Tomas Mikolov , Kai Chen , Greg Corrado , Jeffrey Dean Efficient estimation ofword representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

14.

Matthew E. Peters , Mark Neumann , Mohit Iyyer, , Matt Gardner , Christopher Clark , Kenton Lee , Luke Zettlemoyer, Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.

15.

Ashish Vaswani , Noam Shazeer, , Niki Parmar, , Jakob Uszkoreit, , LlionJones, , Aidan Gomez,

, Łukasz Kaiser , Illia Polosukhin, , Attention is all you need, Advances in Neural InformationProcessing Systems 30 (2017).

16.

Jacob Devlin , Ming-Wei Chang , Kenton Lee , Kristina Toutanova BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lanuage Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.

17.

Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , DanqiChen , Omer Levy , Mike Luke Zettlemoyer , Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.

18.

Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , PiyushSharma , Radu Soricut Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.

19.

Victor Sanh , Lysandre Debut , Julien Chaumond , Wolf Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.

20.

Zhilin Yang , Zihang Dai , Yiming Yang , Jaime Carbonell , Russ RSalakhutdinov and Quoc Le

, Xlnet: Generalized autoregressivepretraining for language understanding, Advances in NeuralInformation Processing Systems 32 (2019).

21.

Zhe Zhang, , Yiyang Zhang, , Xiang Li, , Yurong Qian , Tao Zhang, Bmcsa: Multi-feature spatial convolution semantic matching model based on bert, Journal of Intelligent & Fuzzy Systems (Preprint) 1–11, 2022.

22.

Mohan Munasinghe , Priyangi Jayasinghe , Yvani Deraniyagala , ValenteJose Matlaba , Jorge Filipe dos Santos , Maria Cristina Maneschy , Jose Aroudo Mota , Value-supply chain analysis (vsca) of crude palmoil production in brazil, focusing on economic, environmental andsocial sustainability, Sustainable Production and Consumption 17 (2019), 161–175.

23.

Pengfei Liu , Weizhe Yuan , Jinlan Fu , Zhengbao Jiang , Hiroaki Hayashi , Graham Neubig , Pre-train, prompt, and predict: A systematicsurvey of prompting methods in natural language processing, ACMComputing Surveys 55(9) (2023), 1–35.

24.

Wenpeng Yin , Jamaal Hay , Dan Roth Benchmarking zeroshot text classification: Datasets, evaluation and entailment approach. arXiv preprint arXiv:1909.00161, 2019.

25.

Tom Brown, , Benjamin Mann , Nick Ryder, , Melanie Subbiah , Jared Kaplan

, Prafulla Dhariwal , Arvind Neelakantan, , Pranav Shyam, , Girish Sastry , Amanda Askell, et al. Language models are few-shot learners, Advances in Neural Information Processing Systems 33(2020), 1877–1901.

26.

Jiawen Deng , Jingyan Zhou , Hao Sun , Fei Mi , Minlie Huang Cold: A benchmark for chinese offensive language detection. arXiv preprint arXiv:2201.06025, 2022.

27.

Armand Joulin , Edouard Grave , Piotr Bojanowski , Tomas Mikolov Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.

28.

Rie Johnson , Tong Zhang Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 562–570, 2017.

29.

Yiming Cui , Wanxiang Che, , Ting Liu, , Bing Qin, , Shijin Wang , Guoping Hu, Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922, 2020.

30.

Yu Sun , Shuohuan Wang , Yukun Li , Shikun Feng , Hao Tian , Hua Wu , Haifeng Wang, , Ernie 2.0: A continual pretraining framework forlanguage understanding. In, Proceedings of the AAAI conferenceon artificial intelligence 34 (2020), 8968–8975.

31.

Yu Sun , ShuohuanWang , Shikun Feng , Siyu Ding , Chao Pang , JunyuanShang , Jiaxiang Liu , Xuyi Chen , Yanbin Zhao , Yuxiang Lu et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137, 2021.