Sage Journals: Discover world-class research

Abstract

Most aspect-level sentiment classification networks include the long short-term memory (LSTM) network, coupled with attention mechanism and memory module, is becoming widely applied in aspect-level sentiment classification. Although it has achieved good results, it cannot extract the global and local information of the context at the same time, and it is only based on the semantic relatedness between an aspect and its corresponding context words to model, while neglecting their syntactic dependencies. This paper proposes the aspect-level sentiment classification by combining convolutional neural network (CNN) and proximity-weighted convolution network (PWCN), as well as a new method to calculate the proximity weight. To obtain contextualized word vectors, corpora has been trained by the model of bidirectional encoder representations from transformers (BERT), which can be taken as text features. The CNN is able to extract sequence features from the text and to take the sequence information from the text into account. In addition, the PWCN can consider the syntactic dependencies inside the sentences. The BERT model also has the ability to model complex features of words, such as their syntactic and semantic changes in a linguistic context. Experiments conducted on the SemEval 2014 benchmark demonstrate compared to the well-established ones, the proposed approach had bigger effectiveness.

Keywords

PWCN CNN syntax-awareness BERT aspect-level sentiment classification

Introduction

The aspect-level sentiment classification (also known as the aspect-based sentiment classification). It is a fine-grained sentiment classification task, which aims to identify the polarity of a certain aspect in a particular context, that is, a comment or a review.¹ For instance, for the sentence “the price is reasonable enough the service is poor,” the words “price” and “service” are the aspect ones, and the attitudes to “price” and “service” are positive and negative, separately.

Additionally, there may be totally opposite emotional polarities for different specific aspects in one sentence, the case in point is a positive word “happily,” it may express negative emotions in some specific contexts: Don’t bother me. I’m living happily ever after. So analyzing the specific emotional polarity toward individual aspects can be more effective to help people understand the emotional expression of users, leading to more and more attention in the field. Early works in aspect-level sentiment classification were mainly based on extracting defined features manually from a statistics perspective and adopting machine learning, such as support vector machine, conditional random field, etc.^2,3 The feature quality carries a big weight in the performance of these models, and feature engineering is labor intensive.

Currently, with the maturity of attention mechanism and memory network. More and more such methods have been successfully used in aspect-based sentiment classification.⁴ Despite the effectiveness of these approaches, the syntactic relationship between the aspect and its context is neglected, this may lead to an undesirable result that the aspect attends on contextual words that are descriptive of other aspects. To overcome this Zhang et al.⁵ proposed a convolution network weighing on proximity to provide an aspect-specific representation of contexts that was syntax-aware. Although PWCN is effective in extracting syntactic information and the global emotion information, they cannot capture important local emotion information in sentences. In particular, the linear function cannot accurately describe the position proximity. Also, in a complex sentence, it is possible that each aspect is only related to its adjacent context. It is necessary to estimate the influence scope of each aspect before identifying its sentiment polarity. Therefore, a better language representation model is needed to generate more accurate semantic expressions. Word2Vec⁶ and GloVe⁷ have been widely used to converts words into real numerical vectors. However, there is a problem with these two. Actually, words may differ in meanings in different contexts, and while the targeted sentences are in different languages, the vector representations in the context are the same. An enhancement for them is ELMo,⁸ but it is imperfect since it applies LSTM⁹ in its language model. There are two main problems in LSTM.¹⁰ The first issue is that it is unidirectional, meaning it works by reasoning in order. Even the BiLSTM¹¹ bidirectional model is just a simple addition at the loss, resulting in it being unable to consider the data in the other direction. The other problem is that it is a sequence model. In other words, during its processing, one step cannot proceed until the previous one has completed, leading to poor capability of parallel calculation. The theory behind BERT¹² is similar to ELMo, but the former uses transformer for encoding. When predicting words, a two-way synthesis takes the context characteristics into account, exhibiting remarkable parallel properties. Therefore, we used BERT model to train word vectors in this paper. Inspired by the limitations noted, we propose an aspect-level sentiment classification model by combining convolutional neural network (CNN) and proximity-weighted convolution network (PWCN). The experiments on SemEval 2014 Datasets clearly showed that our model achieved a state-of-art performance.

The major contributions of this paper are summarized as the following:

The PWCN model cannot highlight local features, which plays a significant role in sentiment classification. Thence, an aspect-specific syntax-aware context representation is proposed, which is primarily abstracted by convolutional neural network and bidirectional LSTM, and further strengthened by a proximity-weighted convolution. Results showed that syntactic dependency is effective to improve the outcomes of aspect-level sentiment classification.

The Gaussian function is introduced to substitute the position proximity, so as to better evaluate the proximity weight. The contextual words’ proximity to the aspect is better described, and the model’s performance is further improved.

We use pretrained BERT embeddings to represent the context, which is able to capture obvious word differences such as polysemy. Furthermore, these context-sensitive word embeddings also retrieve other forms of information, which may assist in producing more accurate feature representations and improving model performance.

The rest of this paper is arranged as follows: In Section 2, we review related work. In Section 3 we describe our BERT-CNN-PWCN model in detail. Section 4 is the description of the framework of our model. The experimental results are given and analyzed, which of them prove that our algorithm is effective in Section 5. And finally, we will give a summary to this work.

Related work

In this section, we review related work as follows: First, we discuss the particularities of aspect-based sentiment classification and existing related methods. Secondly, we present recent neural networks for aspect-based sentiment classification. Thirdly, we present some Syntactic Dependency Method for Aspect-Level Sentiment Classification.

Aspect-level sentiment classification

Sentiment analysis (SA) is an hot topic in the area of text mining, which is the calculation of views, sentiments and subjectivity in a text.¹³ There are three levels of granularity in opinion mining, namely document-level, sentence-level, and aspect-level.¹⁴ When a document or a sentence involves multiple emotional expressions, emotional analysis on the former two levels will not be able to accurately extract the deep feelings within the text.

Different from other granularity levels in SA, the emotional polarity, which is on different aspects in the sentence, need to be determined in the aspect-level sentiment classification, and it is dependent not only on the context information, but also the emotional information of different aspects. At present, there are three mainstream methods: lexicon-based, traditional machine learning, and deep learning methods.¹⁵ In the first, Deng et al.¹⁶ proposed a novel hierarchical supervision topic model to construct a topic-adaptive sentiment lexicon (TaSL) for higher-level classification tasks. Federici and Dragoni¹⁷ constructed an aspect-based opinion mining system based on the use of semantic resources for the extraction of the aspects from a text and for the computation of their polarities. Kang et al.¹⁸ proposed a Bayesian inference method to explore the latent semantic dimensions as contextual information in natural language. These methods are easy to understand and use, but their accuracy is not high, and there are problems of dimension explosion and gradient disappearance, which limit their wide application in practice. Secondly, such as those proposed by Jiang et al.,² Marcheggiani et al.,³ which use the support vector machine, conditional random field, etc. The feature quality carries a big weight in the performance of these models, and feature engineering is labor intensive. The third category, deep learning, is receiving increasing attention.

Deep learning for aspect-level sentiment classification

In the late years, more and more techniques, which adopt the deep learning have been integrated into natural language processing (NLP) tasks.¹⁹ Compared with traditional machine learning, they achieves better results in aspect-level emotion classification. Zhou and Long²⁰ proposed a method, which combined CNN with BiLSTM models to analysis the Chinese product reviews. Xue and Li²¹ reported a more accurate and productive model, which was combined convolutional neural networks with gating mechanisms. Dong et al.²² used adaptive recursive neural network to classify the target-dependent sentiment on Twitter. Vo and Zhang²³ applied sentiment lexicons, together with distributed word representations and neural pooling to improve the capability of sentiment analysis. Ma et al.²⁴ built a neural architecture for targeted aspect-based sentiment analysis, while being able to incorporate important commonsense. The performance of these conventional neural models have been more outstanding than traditional machine learning in aspect-level sentiment classification. However, they could only capture context information in an implicit way, leading to imperfections in explicitly, which excludes some important context clues in an aspect.

Attention mechanism and memory network for aspect-level sentiment classification

Currently, with the maturity of attention mechanism and memory network. More and more such methods have been used in NLP and had a good effect, such as machine translation,²⁵ with an improved performance compared to previous approaches. In this field, the generation of representations can be mutually influence by the target and context. For instance, Wang et al.²⁶ applied an attention-based LSTM network to aspect-level sentiment classification. Long et al.²⁷ proposed a BiLSTM based multi-headed attention mechanism, which is integrated into a crossing model of text sentiment analysis. Lin et al.²⁸ built a brand-new framework for aspect-level sentiment classification, which was a deep mask memory network based on semantic dependency and context moment. Zhang et al.²⁹ designed a convolutional multi-head self-attention memory network for the same task. Ma et al.³⁰ developed an interactive attention network (IAN) model derived from LSTM networks and attention mechanism. However, syntactic relations among the aspect with its context words are generally neglected in these studies, which may stunte the validity of aspect-based context representation. Besides, the aspect of sentiment polarity is normally hinged on a key phrase.³¹ Zhang et al.⁵ proposed a convolution network weighing on proximity to provide an aspect-specific representation of contexts that was syntax-aware. However, the long-distance dependencies in the text sequence is only taken into account by this network, therefore the effect of capturing local features is not ideal.

Even though the proposed model is inspired by PWCN, it differs in three main respects. First, it applies convolution neural network to PWCN to extract local emotion information in a text, whereas PWCN uses BiLSTM to obtain the global sentence information, but cannot effectively extract the local aspect information. CNN is forced to capture local information through the use of local receiving domain, shared weight, etc. Therefore, CNN has a strong ability to capture local information. The model adds with convolution neural network can obtain a more comprehensive aspect level text information representation. Secondly, the proposed model use a Gaussian function to calculate the proximity weight for PWCN. As it is experimentally demonstrated that Gaussian function is more consistent with proximity weight than linear function. Thirdly, it applies pretrained BERT embeddings to word embedding to represent the context, whereas PWCN uses Glove. The BERT embeddings is able to capture obvious word differences such as polysemy, but Glove cannot.

The proposed model

In this section, we will introduce BERT-CNN-PWCN, which is to improve the performance of aspect-level sentiment classification. Figure 1 shows a high-level description of our proposed model. First, datasets are pre-processed by BERT as word embeddings. After that, the BERT output is fed into a CNN layer to extract local features. Then BiLSTM is applied to obtain contextual information. The hidden-state representation is used to predict sentiment polarity, which can make further efforts to be enhanced by proximity-weighted convolution.

Figure 1.

The architecture of the proposed model.

The model structure

Word embedding

The first stage to deal with the natural language tasks for computers is to convert the text into data, so that they can understand it. Text representation is the basis in NLP, which plays an essential part in the performance of the NLP system as a whole, and text vectorization is an important way to achieve it. Through text vectorization, each word is transformed to a real numerical vector. Generally speaking, there are two ways to represent: discrete and distributed representations. The distributed representation is one-hot encoding, which is high in the number of dimensions and low in density, being unable show the direct correlation between words. In addition, such a high-dimensional vector will seriously downregulate the calculation speed. This paper adopts the word embedding, which is a distributed representation, to avoid those defects. Word embedding can integrate the one hot high-dimensional representation of the original word into a continuous vector space of lower dimensions, and the two word embedding algorithms adopted in our model will be introduced afterward.

GloVe

In order to combine the merits of global matrix factorization and local context windows, Pennington et al. introduced the GloVe,⁷ which can efficiently leverages statistical information by training only on the non-zero elements in a word-to-word co-occurrence matrix rather than on the entire sparse one or on individual context windows in a large corpus. In the following equations, $X_{ij}$ stands for the matrix of word-to-word co-occurrence counts, where j is the frequency of the center word i occurs in the window content. $X_{i}$ is the sum of the number of times that all words appear in the context of word $i$ . Finally, $P_{ij}$ is served to express the possibility that word j occurs in the context of word i.

X_{i} = \sum_{k} X_{ik}

(1)

P_{ij} = P (j | i) = \frac{X_{ij}}{X_{i}}

(2)

We can easily find out by examples that the ratio $P_{ik} / P_{jk}$ is dependent on the words i, j, and k. By setting the unknown function $F$ , the above ratio can be obtained after operating the word vector, and the details are as follows:

F (w_{i}, w_{j}, {\tilde{w}}_{k}) = \frac{P_{ik}}{P_{jk}}

(3)

Where $w_{i}, w_{j}, w_{k} \in R^{d}$ are the word vectors corresponding to words i, j, and k, respectively. After a series of rigorous derivations that is detailed description in Xue and Li,²¹ we result in the model:

J = \sum_{i, j = 1}^{V} f (X_{ij}) {(w_{i}^{T} {\tilde{w}}_{j} + b_{i} + {\tilde{b}}_{j} - \log X_{ij})}^{2}

(4)

f (x) = {\begin{matrix} {(x / x_{max})}^{α} & if x < x_{max} \\ 1 & otherwise \end{matrix}

(5)

Where the empirical motivation value of $α$ is 3/4 and $f (x)$ is a weighting function.

Bert

BERT word embedding is based on the model of deep bidirectional transformers.¹² Compared with the conventional GloVe-based embedding layer, BERT can not only consider the context features in a comprehensive manner, but also eliminate polysemy. Consider the input data is denoted by $x$ , and $H$ is the embeddings generated after processing $x$ with BERT.

H = BERT (x)

(6)

The BERT-pretrained word embedding will act as the input for subsequent tasks. Apart from the effective semantic feature extraction in the text, it could also improve the performance of the subsequent steps.

Convolutional neural network

Convolutional neural network (CNN, or ConvNet) is among the most important innovations in the computer vision community. Its function is to extract relevant features from sequences.³² In recent years, CNN has been proved capable to solve many problems in NLP, and has produced unexpected results. In our model, the word vector matrix is adopted as the inputs for the CNN, which are further applied used as the inputs for the BiLSTM. Because the very important information about word position will be lost in the pooling layer in NLP, we applied convolutional neural network without pooling layer.

Convolutional Layer is where a word vector is convolution operated with a specific feature detector containing $M$ filters. A sliding convolution kernel of width $k$ can extract k-grams feature in a word vector. The feature map can be generated from a filter $F_{m} (1 \leq m \leq M)$ as follows:

y_{i}^{m} = f (X_{i : i + k - 1} \otimes w^{m} + b^{m})

(7)

Where ⊗ represent the convolution, and the symbol of the weight matrix is denoted by $w^{m} \in R^{k \times d}$ , which corresponds to the filter $F_{m}$ ; $b^{m}$ stands for the bias, and $X_{i : i + k - 1}$ denotes the sliding matrix window consisting rows of $i$ to $i + k - 1$ in the input matrix; $f$ is a non-linear activation function, and here we choose RELU for it. The symbol $y_{i}^{m}$ stands for the feature map generated from filter $F_{m}$ , where is the $i$ -th element of $y^{m}$ .

Bidirectional LSTM

$h_{t}$ is the vector of the hidden layer at time $t$ , and $σ$ stands for the sigmoid function. The symbol ⊙ represents element-wise multiplication.

The advantage of bidirectional LSTM over its standard counterpart is its capability to pick up the information from not only the past but also the future. It is comprised of both the forward and backward neural networks, which are responsible for memorizing the past and future information, respectively, promoting text analysis. And both neural networks are concatenated to form the final output $h$ :

h = [\vec{h_{t}}, \overset{\leftarrow}{h_{t}}]

(8)

Proximity-weighted convolution network

The proximity-weighted convolution network (PWCN) apply the proximity weight, which is the syntactic proximity of context word to the aspect, to calculate its importance in the sentence, followed by inputting it into a convolutional neural network to obtain n-gram information.

Proximity weight

According to the context representation between the semantic and their corresponding aspect in the component words, their proximity weights are the main way to obtain contextual representations. However, this approach neglects syntactic information, which may reduce the models’ effectiveness. The proximity methods focusing on position and dependency proximity formalize syntactical dependency information as the proximity weight, describing the contextual words’ proximity to the aspect.⁵ Moreover, we propose to use Gaussian function to calculate the proximity weight.

Position proximity

The words around the word of one aspect are generally used to describe their core word. Therefore, their position information is regarded as an approximate measure of syntactic proximity. The final representations are given as the following:

p_{i} = {\begin{matrix} 1 - \frac{τ - i}{n} & 0 \leq i < τ \\ 0 & τ \leq i < τ + m \\ 1 - \frac{i - τ - m + 1}{n} & τ + m \leq i < n \end{matrix}

(9)

Where $τ$ is the start token of the aspect term, and $p_{i}$ denotes the proximity weight. The weight reduction is linear to the distance from the word itself to the nearest aspect boundary.

Dependency proximity

Syntax dependency parsing tree is used to calculate the distance between words in our approach. The dependency proximity weights are calculated by:

p_{i} = {\begin{matrix} 1 - \frac{d_{i}}{n} & 0 \leq i < τ & or & τ + m \leq i < n \\ 0 & τ \leq i < τ + m \end{matrix}

(10)

Where $d_{i}$ is the tree-based distance that all words in respect of the aspect term in a sentence.

Gaussian proximity

The former two methods are linear. However, the contextual words’ proximity to the aspect is actually nonlinear, which may lead to false weight results and loss of information. On the other hand, the curve for Gaussian distribution is bell-shaped, and as a value move closer to the center, it becomes larger, and vice versa. This excellent pattern can effectively prevent interfering noises in the information, conforming nonlinear characteristics of position information. Therefore, Gaussian function is a desirable mode for weight distribution. The Gaussian proximity weights of the sentence are defined as:

p_{i} = {\begin{matrix} \frac{1}{\sqrt{2 π}} e^{- \frac{{(i - \frac{τ + τ + m}{2})}^{2}}{2}} & 0 \leq i < τ & and & τ + m \leq i < n \\ 0 & τ \leq i < τ + m \end{matrix}

(11)

Proximity-weighted convolution

Proximity-Weighted Con-volution is actually a 1-dimensional convolution with a kernel of length l and its proximity weight assigned in advance. The weight $r_{i}$ , which is used to denote the i-th word of the sentence representation, can be retrieved as:

r_{i} = p_{i} h_{i}

(12)

The convolution is performed as:

t = ⌊ \frac{l}{2} ⌋

(13)

q_{i} = max (W_{c}^{T} [r_{i - t} \oplus \dots \oplus r_{i} \oplus \dots \oplus r_{i + t}] + b_{c}, 0)

(14)

Where $q_{i} \in R^{2 d_{h}}$ is the output from the convolution layer. The features that are instructive to classification usually occupy a small proportion of the general matrix, and a 1-dimensional max-pooling layer with a kernel of length $n$ can help us filter out the most obvious features, denoted as:

q_{s} = \begin{matrix} {[\begin{matrix} max_{0 \leq i < n} & q_{i, j} \end{matrix}]}^{T} & 0 \leq j < 2 d_{h} \end{matrix}

(15)

Where $q_{i, j}$ denotes the $j - th$ part of $q_{i}$ .

The filtered representation for the most obvious feature $q_{s}$ , which will act as the input to a fully connected layer, is processed afterward by a softmax layer to acquire the conditional probability distribution of sentiment polarity $y$ :

y = softmax (W_{f}^{T} q_{s} + b_{f})

(16)

Where $b_{f}$ and $W_{f}$ are the weighted matrices corresponding of the fully connected layer.

The standard gradient descent algorithm is adopted to train our model, where the cross-entropy loss with $L_{2}$ regularization is defined as the loss function:

L = - \sum_{(S, \overset{\land}{y}) \in D} \sum_{u} {\overset{\land}{y}}_{u} \log y_{u} + λ {‖ Θ ‖}_{2}

(17)

Where $\overset{\land}{y}$ is a one-hot label, $D$ stands for the training dataset, $Θ$ represents all parameters that is trainable in our model, and $λ$ denotes the hyper-parameter for $L_{2}$ regularization.

Experiment

The datasets and experimental environment

In this paper, for the sake of illustrating the performance of our model, which was tested on two real-world datasets from SemEval 2014,³³ including about the Laptop and Restaurant. Moreover, Accuracy and Macro-Averaged F1 Metrics were adopted to verify improvement over the models of comparison.

Precision (P):

P = \frac{TP}{TP + FP}

(18)

Recall (R):

R = \frac{TP}{TP + FN}

(19)

F1-Score (F1):

F 1 = \frac{2}{\frac{1}{P} + \frac{1}{R}} = \frac{2 (TP + FP) (TP + FN)}{TP}

(20)

Accuracy (Acc):

Acc = \frac{TP + TN}{TP + FP + FN + TN}

(21)

Macro-Precision (Macro-P):

Macro - P = \frac{1}{n} \sum_{1}^{n} P_{i}

(22)

Macro-Recall (Macro-R):

Macro - R = \frac{1}{n} \sum_{1}^{n} R_{i}

(23)

Macro-Recall (Macro-R):

Macro - F = \frac{1}{n} \sum_{1}^{n} F_{i} = \frac{2 \times Macro - P \times Macro - R}{Macro - P + Macro - R}

(24)

In our experiments, we used the GloVe⁷ word vector and the pre-trained language model word representation BERT¹² as the word embedding methods, the dimensionalities of which were 300 and 768, respectively. The Adam³⁴ algorithm was adopted as the optimizer for the cross entropy loss function with a learning rate of 0.01. The value of L2-regularization weight was 0.0001 and batch size was 8. The convolutional neural network using 512 filters, and the filter window size was 3 (Table 1).

Table 1.

Classification result matrix.

Categories	Real value
Categories	Class 1	Class 2
Class 1	True class1(T1)	False class2(F1)
Class 2	False class2(F2)	True class2(P2)

Model comparison

We evaluate the performance of this model by comparing with six baseline models, which were:

LSTM: only the last hidden state vector was applied to obtain the sentiment analysis.³⁵

BiLSTM: consisting of two standard LSTMs, its output can be obtained from forward and backward scannings of the text.¹⁹

TD-LSTM: an extension on long short-term memory (LSTM) by incorporating target information. Such target-dependent LSTM approach models the relatedness of a target word with its context words, and selects the relevant parts of contexts to infer the sentiment polarity toward the target.³⁶

TC-LSTM: a method to extend TD-LSTM by incorporating an target connection component, which explicitly utilizes the connections between target word and each context word when composing the representation of a sentence.

ATAE-LSTM: append position-aware influence vectors into the hidden representations of the context words on top of LSTM layer. Finally, the hidden state vector is used to generate the final vector representation.³⁷

IAN: a method based on long-short term memory networks (LSTM) and attention mechanism. IAN utilizes the attention mechanism associated with a target to get important information from the context and compute context representation for sentiment classification.

MemNet: apply attention multiple times on the word embedding, so that more abstractive evidences could be selected from the external memory. The output of the last attention layer is fed to a softmax layer for predictions.

CABASC: a novel content attention based aspect based sentiment classification model, with two attention enhancing mechanisms: sentence-level content attention mechanism is capable of capturing the important information about given aspects from a global perspective, whiles the context attention mechanism is responsible for simultaneously taking the order of the words and their correlations into account, by embedding them into a series of customized memories.³⁸

AoA: approach models aspects and sentences in a joint way and explicitly captures the interaction between aspects and context sentences. With the AOA module, the model jointly learns the representations for aspects and sentences, and automatically focuses on the important parts in sentences.²⁹

PWCN: the connection between a context word and aspect in syntax was used to obtain its importance and integrate it into a convolution network.⁵

Experimental results

Model comparison results on the datasets are shown in Table 2, which is obtained by randomly initializing the average performance of three runs since the performance fluctuates with random initialization. We observe that the BERT-CNN-PWCN-Gauss model get the best performance, especially in laptop reviews. The model in this paper is the most similar to PWCN-Pos and PWCN-Dep, with CNN and Gaussian Proximity added and the utilization of BERT on the pre-training word vector. Compared with PWCN-Dep or PWCN-Pos, our approach improve the accuracy in laptop dataset and Macro-F1 dataset by 0.1125, 0.149, and 0.1537, respectively. The three PWCN approaches, namely BERT-CNN-PWCN-Pos, BERT-CNN-PWCN-Dep, and BERT-CNN-PWCN-Gauss, outperform BERT-PWCN-Pos, BERT-PWCN-Dep, and BERT-PWCN-Gauss by about 0.0006, 0.0081, and 0.0076 on accuracy, respectively. The same three approaches reveal better accuracy for 0.0215, 0.0516, and around 0 on Macro-F1 dataset, suggesting that CNN-based PWCN model is better than PWCN model alone.

Table 2.

Experimental results.

Method		Laptop		Restaurant
Method		Accuracy	Macro-F1	Accuracy	Macro-F1
Baselines	LSTM	0.5144	0.4384	0.6187	0.4720
	BiLSTM	0.5203	0.4563	0.6396	0.5086
	TD-LSTM	0.5865	0.5182	0.6589	0.5139
	TC-LSTM	0.5599	0.5094	0.6536	0.5400
	ATAE-LSTM	0.5771	0.4988	0.6598	0.5307
	IAN	0.5411	0.4871	0.6714	0.5386
	MemNet	0.6006	0.5363	0.6768	0.5434
	CABASC	0.5771	0.5102	0.6607	0.5173
	AoA	0.5379	0.4746	0.6367	0.5117
	PWCN-Pos	0.5360	0.4070	0.6442	0.5677
	PWCN-Dep	0.5361	0.4023	0.7008	0.5881
Proposed models	BERT-PWCN-Pos	0.5369	0.4395	0.6508	0.6396
	BERT-CNN-PWCN-Pos	0.5429	0.4610	0.6553	0.6396
	PWCN-Gauss	0.6213	0.5375	0.7257	0.5638
	BERT-PWCN-Gauss	0.6410	0.5559	0.7500	0.6036
	BERT-CNN-PWCN-Gauss	0.6486	0.5560	0.7511	0.6118
	BERT-PWCN-Dep	0.5364	0.4148	0.7510	0.6458
	BERT-CNN-PWCN-Dep	0.5445	0.4709	0.7539	0.6508

Accuracy and macro-F1 score are the average values of more than 3 random initialization runs, and show the best results in bold.

In order to highlight the advantages of Gaussian proximity. We put the model experimental results before and after using Gaussian proximity into Table 3. It can be seen that the model which used Gaussian Proximity outperforms the model used linear function in most cases, mainly because the Gaussian Proximity is more consistent with the contextual words’ proximity to the aspect.

Table 3.

Experimental results using Gaussian Proximity.

Method		Laptop		Restaurant
Method		Accuracy	Macro-F1	Accuracy	Macro-F1
PWCN	PWCN-Pos	0.5360	0.4070	0.6442	0.5677
	PWCN-Dep	0.5361	0.4023	0.7008	0.5881
	PWCN-Gauss	0.6213	0.5375	0.7257	0.5638
BERT-PWCN	BERT-PWCN-Pos	0.5369	0.4395	0.6508	0.6396
	BERT-PWCN-Dep	0.5364	0.4148	0.7510	0.6458
	BERT-PWCN-Gauss	0.6410	0.5559	0.7500	0.6036
BERT-CNN-PWCN	BERT-CNN-PWCN-Dep	0.5445	0.4709	0.7539	0.6508
	BERT-CNN-PWCN-Pos	0.5429	0.4610	0.6553	0.6396
	BERT-CNN-PWCN-Gauss	0.6486	0.5560	0.7511	0.6118

Accuracy and macro-F1 score are the average values of more than 3 random initialization runs, and show the best results in bold.

Discussion and conclusion

In this paper, we developed a new CNN-based PWCN model, which not only combines CNN and PWCN together, but also proposed Gaussian proximity as a new method of calculating proximity weight. The experimental results demonstrated that our model is better than single PWCN model, such as PWCN-Pos and PWCN-Dep, in terms of aspect-level sentiment analysis based on syntactic dependency. Meanwhile, we used pretrained BERT embeddings to produce more accurate feature representations, and successfully improved model performance that Glove can not.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by National Key Research and Development Project (2018YFB1402900)

ORCID iDs

Siyi Chen

Xinhao Du

References

Zhang

Zhu

, et al. Multiple interactive attention networks for aspect-based sentiment classification. Appl Sci 2020; 10: 2052.

Jiang

Zhou

, et al. Target-dependent Twitter sentiment classification. In: Proceedings of the ACL, 2011.

Marcheggiani

Täckström

Esuli

, et al. Hierarchical multi-label conditional random fields for aspect-oriented opinion mining. In: Proceedings of the ECIR, 2014.

Jiang

Zhang

, et al. An LSTM-CNN attention approach for aspect-level sentiment classification. J Comput Methods Sci Eng 2019; 19: 859–868.

Zhang

Song

. Syntax-aware aspect-level sentiment classification with proximity-weighted convolution network. In: SIGIR’19: Proceedings of the 42nd International ACM SIGIR conference on research and development in information, 2019.

Mikolov

Sutskever

Chen

, et al. Distributed representations of words and phrases and their compositionality. ArXiv 2013, abs/1310.4546.

Pennington

Socher

Manning

. Glove: global vectors for word representation. In: Proceedings of the EMNLP, 2014.

Devlin

Chang

M-W

Lee

, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL-HLT, 2019.

Zhao

Chen

Guo

, et al. A spatial-temporal attention model for human trajectory prediction. IEEE/CAA J Automat Sin 2020; 7: 965–974.

10.

Yuan

, et al. Integrated deep learning method for workload and resource prediction in cloud systems. Neurocomputing 2021; 424: 35–48.

11.

Zhang

Wang

Lan

. Dynamic hand gesture recognition based on short-term sampling neural networks. IEEE/CAA J Automat Sin 2021; 8: 110–120.

12.

Wang

Lai

, et al. Dimensional sentiment analysis using a regional CNN-LSTM Model. In: Proceedings of the ACL, 2016.

13.

Medhat

Hassan

Korashy

. Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J 2014; 5: 1093–1113.

14.

Pradhan

Vala

Balani

. A survey on sentiment analysis algorithms for opinion mining. Int J Comput Appl 2016; 133: 7–11.

15.

Liu

Chatterjee

Zhou

, et al. Aspect-based sentiment analysis: a survey of deep learning methods. IEEE Trans Comput Soc Syst 2020; 7: 1358–1375.

16.

Deng

Jing

, et al. Sentiment lexicon construction with hierarchical supervision topic model. IEEE/ACM Trans Audio Speech Lang Process 2019; 27: 704–718.

17.

Federici

Dragoni

. A knowledge-based approach for aspect-based opinion mining. In: Sack

Dietze

Tordai

, et al. (eds) Semantic Web challenges. SemWebEval 2016. Communications in Computer and Information Science. Cham: Springer, 2016, Vol. 641, pp.141–152.

18.

Kang

Ren

. Exploring latent semantic information for textual emotion recognition in blog articles. IEEE/CAA J Automat Sin 2018; 5: 204–216.

19.

LeCun

Bengio

Hinton

. Deep learning. Nature 2015; 521: 436–444.

20.

Zhou

Long

. Sentiment analysis of text based on CNN and bi-directional LSTM model. In: 2018 24th international conference on automation and computing (ICAC), Newcastle Upon Tyne, 6–7 September 2018, pp.1–5. New York, NY: IEEE.

21.

Xue

. Aspect based sentiment analysis with gated convolutional networks. ArXiv 2018, abs/1805.07043.

22.

Dong

Wei

Tan

, et al. Adaptive recursive neural network for target-dependent Twitter sentiment classification. In: Proceedings of the ACL, 2014.

23.

D-T

Zhang

. Target-dependent Twitter sentiment classification with rich automatic features. In: Proceedings of the IJCAI, 2015.

24.

Peng

Cambria

. Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM. In: Proceedings of the AAAI, 2018.

25.

Bahdanau

Cho

Bengio

. Neural machine translation by jointly learning to align and translate. CoRR 2015, abs/1409.0473.

26.

Wang

Huang

Zhu

, et al. Attention-based LSTM for aspect-level sentiment classification. In: Proceedings of the EMNLP, 2016.

27.

Long

Zhou

. Sentiment analysis of text based on bidirectional LSTM with multi-head attention. IEEE Access 2019; 7: 141960–141969.

28.

Lin

Yang

Lai

. Deep mask memory network with semantic dependency and context moment for aspect level sentiment classification. In: Proceedings of the IJCAI, 2019.

29.

Zhang

Zhao

. Convolutional multi-head self-attention on memory for aspect sentiment classification. IEEE/CAA J Automat Sin 2020; 7: 1038–1044.

30.

Zhang

, et al. Interactive attention networks for aspect-level sentiment classification. In: Proceedings of the IJCAI, 2017.

31.

Fan

Gao

, et al. Convolution-based memory network for aspect-based sentiment analysis. In: SIGIR’18: The 41st International ACM SIGIR conference on research & development in information retrieval, 2018.

32.

Pontiki

Galanis

Pavlopoulos

, et al. SemEval-2014 Task 4: aspect based sentiment analysis. In: Proceedings of the COLING 2014, 2014.

33.

Lai

Liu

, et al. Recurrent convolutional neural networks for text classification. In: Proceedings of the AAAI, 2015.

34.

Huang

Carley

. Aspect level sentiment classification with attention-over-attention neural networks. ArXiv 2018, abs/1804.06536.

35.

Kingma

. Adam: a method for stochastic optimization. CoRR 2015, abs/1412.6980.

36.

Tang

Qin

Feng

, et al. Effective LSTMs for target-dependent sentiment classification. In: Proceedings of COLING 2016, the 26th International conference on computational linguistics: technical papers, 2016.

37.

Zeng

Zhou

. Enhancing attention-based LSTM with position context for aspect-level sentiment classification. IEEE Access 2019; 7: 20462–20471.

38.

Liu

Zhang

Zeng

, et al. Content attention model for aspect based sentiment analysis. In: Proceedings of the 2018 world wide web conference, 2018.

A syntactic dependency method for aspect-level sentiment classification by deep learning

Abstract

Keywords

Introduction

Related work

Aspect-level sentiment classification

Deep learning for aspect-level sentiment classification

Attention mechanism and memory network for aspect-level sentiment classification

The proposed model

The model structure

Word embedding

GloVe

Bert

Convolutional neural network

Bidirectional LSTM

Proximity-weighted convolution network

Proximity weight

Proximity-weighted convolution

Experiment

The datasets and experimental environment

Model comparison

Experimental results

Discussion and conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References