Sage Journals: Discover world-class research

Abstract

With the increasing popularity of knowledge graph (KG), many applications such as sentiment analysis, trend prediction, and question answering use KG for better performance. Despite the obvious usefulness of commonsense and factual information in the KGs, to the best of our knowledge, KGs have been rarely integrated into the task of answer selection in community question answering (CQA). In this paper, we propose a novel answer selection method in CQA by using the knowledge embedded in KGs. We also learn a latent-variable model for learning the representations of the question and answer, jointly optimizing generative and discriminative objectives. It also uses the question category for producing context-aware representations for questions and answers. Moreover, the model uses variational autoencoders (VAE) in a multi-task learning process with a classifier to produce class-specific representations for answers. The experimental results on three widely used datasets demonstrate that our proposed method is effective and outperforms the existing baselines significantly.

Keywords

Community question answering knowledge graph context convolutional-deconvolutional variational autoencoder

1. Introduction

Knowledge graphs (KGs), such as DBpedia [3] and BabelNet [38], are multi-relational graphs. They consist of entities and relationships among them. Many applications such as sentiment analysis [30], recommender systems [65], relation extraction [62], and question answering integrate the information in KGs by linking the entities mentioned in the text to entities in the KGs.

Community question answering (CQA) forums, such as Stack Overflow and Yahoo! Answer provide new opportunities for users to share knowledge. In these forums, anyone can ask any question, and a question is answered by one or more members. Unfortunately, there is often no evaluation of the given answers in how much they are related to the question. It means one has to go through all possible answers for assessing them, which is exhausting and time-consuming. Thus, it is essential to automatically identify the best answers for each question.

In this paper, we address the task of answer selection. As defined in SemEval 2015 [36], in this task, the goal is to classify the answers given a question into three categories: (i) good, which are the answers that address the question well (ii) potentially useful to the user (e.g., because they can help educate him/her on the subject) (iii) bad or useless. It should be noted that a good answer is an answer semantically relevant to the question, not necessarily the correct answer.

Table 1 shows two examples of questions, each with four answers, taken from the SemEval 2015 [36] dataset.1

¹
http://alt.qcri.org/semeval2015/task3/index.php?id=data-and-tools
As shown in this table, in CQA, each question has at least three parts: (i) question category, which is the category that the question belongs to; (ii) question subject, which summarizes the question, and (iii) question body, which describes the question in details, and might contain useless or noisy parts as well. Most of the questions and answers in these forums are often lengthy, informal, and contain abbreviations and grammatical mistakes. In these examples, for each question, the first and the second answers are labeled as ‘good’ as both try to answer the question. These two answers are both semantically relevant to the question, even though they might be inaccurate or completely wrong. The third answer is ‘bad’ because it is completely irrelevant to the question. The final answer is labeled as ‘potentially useful’ because while it does not provide a relevant answer to the question, it contains a relevant advice for the user.

Table 1
Example of two questions and four of their answers from the SemEval 2015 dataset

Example 1 Example 2

Question category Life in Qatar Qatar Living Lounge

Question subject Vodka cost Nissan Offer

Question body Hey guys just wondering, what is the cost of a bottle of vodka in doha ? I dont mean from a hotel, but from the single bottle shop that is set up there. This is my favourite tipple $\dots$ thanks Saw an ad in today’s GT… some offer for Nissan Vehicles. Pathfinder for QR. 89,000/- onwards…and Xterra is QR.93,000. and Armada is QR.118,000.

I thought Pathfinder is more expensive than Xttera. Anyone know why Pathfinder is so cheap?

Did the prices come down or is it a good offer price?

Answer 1 (good) Good to clean house piping russian standard, zubrovka, movscoscaya not passing 100 qr. go for the first two, extra frozen and do n’t forget the caviar check mate $\dots$ Basic ones fs, fully loaded ones will cost much more

Answer 2 (good) Ketel 1 is qar240 2009 models are on offer. basic xe qr 101 000 automatic transmission, power window no cd player xe qr 111 000 cd player m cruise control, alloy wheel, power window… etc

Answer 3 (bad) Thanks weasal – i would also like to see the same drive thru style bottle o as well! Take a guess !

Answer 4 (potentially useful) You need to make sure you have your rp and alcohol permit before you purchase any vodka however i have heard there are some brands we have not heard of in oz that are pretty cheap if you are willing to try them Call them again and check how much is Safari or Infiniti FX35

The main difficulty is how to bridge the semantic gap between question-answer pairs. In other words, by recognizing the semantic relatedness of the question and answer, one can decide about the relevance of the question and its answers.

Early work in this area includes feature-based methods for explicitly modeling the semantic relation between the question and answer [36,40]. With the great advances in deep neural networks, most recent researches apply deep learning based methods to answer classification in question answering communities [54,56,57,59,64]. These methods typically use a Convolutional Neural Network (CNN) [39] or Long Short term Memory (LSTM) [51] network for matching the question and answer. However, these methods have not achieved high accuracy due to some reasons. The main challenges remaining in this field are as follows:

Despite the usefulness of commonsense or factual background knowledge in the KGs (such as DBpedia [3], BabelNet [38], etc.), to the best of our knowledge, these KGs have been rarely integrated in the recent deep neural CQA networks. KGs provide rich information about entities, specially named entities, and relations between them. Considering the examples in Table 1, named entities “Armada” and “Infiniti FX35” in the question and answer, do not exist in the available word embedding methods such as Word2vec [33] or Glove [42] and so, are out-of-vocabulary. Therefore, the conventional methods assign a negative score to the first answer due to their misunderstanding of named entities and their relations. However, by using a comprehensive KG like BabelNet, the model can assign the correct label to the answer due to the entities and facts exist in it.

There are some words that may have different meanings in different contexts. By using the category of the question as the context representative, the correct meaning of the question and answer words can be extracted, and so a more accurate representation of the question and answer would be generated.

The previous methods are unable to encode all semantic information of the question and answer. Also, in [5] it has been shown that it is difficult to encode all semantic information of a sequence into a single vector;

In semantic matching problems, the learned representations must contain two main properties. First, the representation must preserve the important details mentioned in the text. Second, each learned representation must contain discriminative information regarding its relationship with the target sentence. Following this motivation, by leveraging the external background knowledge and question category, we use deep generative models for question-answer pair modeling. Due to their ability to obtain latent codes that contain essential information of a sequence, we expect that their resulting representations can suite the question-answer relation extraction better.

In the proposed model, at the first step, the question and answer words are disambiguated based on the question category and external background knowledge from our selected KG. At the end of this step, the correct meaning of each word in the current context is captured. In the second step, by using the representation of the question subject as the attention source, the noisy parts of the question and answer are discarded and the useful information of them is extracted. At the final step, by using the convolutional-deconvolutional autoencoding framework, which is first proposed in [63] for paragraph representation learning, the representations of questions and answers are learned. This framework, which uses the deconvolutional network as its decoder, is used to model each of the question and answer separately. In this multi-task learning process, the question-answer relevance label information is also considered in the representations learning, enabling class-specific representations.

The main contributions of our work can be summarized as follows:

We leverage external knowledge from KGs to capture the meaning of the question and answer words and extract the relation between them.

We propose to use the category of the question as context to understand the correct meaning of the question and answer words in the current context. To the best of our knowledge, we are the first to use the question category to have context-aware representations in CQA.

We propose to use two convolutional-deconvolutional autoencoding frameworks that attempt to make separate representations of the question and answer. To the best of our knowledge, we are the first to use this deconvolutional VAE in answer selection problem.

We introduce a new architecture for answer selection, in which a classifier combined with variational autoencoders to make the representations class-specific.

Our proposed model achieves state-of-the-art performance in three CQA datasets: SemEval 2015, SemEval 2016 [37], and SemEval 2017 [35].

In the next section, we provide preliminaries in this field. Then we review some previous researches in Section 3. The proposed idea is presented in Section 4. In Section 5, experimental results and analyses are presented. The conclusion is given in Section 6.
2. Preliminaries

	Example 1	Example 2
Question category	Life in Qatar	Qatar Living Lounge
Question subject	Vodka cost	Nissan Offer
Question body	Hey guys just wondering, what is the cost of a bottle of vodka in doha ? I dont mean from a hotel, but from the single bottle shop that is set up there. This is my favourite tipple $\dots$ thanks	Saw an ad in today’s GT… some offer for Nissan Vehicles. Pathfinder for QR. 89,000/- onwards…and Xterra is QR.93,000. and Armada is QR.118,000.
I thought Pathfinder is more expensive than Xttera. Anyone know why Pathfinder is so cheap?
Did the prices come down or is it a good offer price?
Answer 1 (good)	Good to clean house piping russian standard, zubrovka, movscoscaya not passing 100 qr. go for the first two, extra frozen and do n’t forget the caviar check mate $\dots$	Basic ones fs, fully loaded ones will cost much more
Answer 2 (good)	Ketel 1 is qar240	2009 models are on offer. basic xe qr 101 000 automatic transmission, power window no cd player xe qr 111 000 cd player m cruise control, alloy wheel, power window… etc
Answer 3 (bad)	Thanks weasal – i would also like to see the same drive thru style bottle o as well!	Take a guess !
Answer 4 (potentially useful)	You need to make sure you have your rp and alcohol permit before you purchase any vodka however i have heard there are some brands we have not heard of in oz that are pretty cheap if you are willing to try them	Call them again and check how much is Safari or Infiniti FX35

2.1. Latent-variable model for text processing

The most common way to obtain sentence representations is to use sequence-to-sequence models, due to their ability to leverage information from unlabeled data [21]. In these models, first an encoder encodes the input sentence x into a fixed-length vector z, and then the output sentence $\tilde{x}$ is reconstructed from z through a decoder network. Specifically, in the autoencoder models, the encoder is a deterministic function and the output of the decoder is the reconstruction of the input sentence x. A problem with autoencoders for text is the deterministic nature of the encoder function, which results in poor model generalization. Variational autoencoders (VAEs) [16] provide a probabilistic manner for describing an observation in a latent space, instead of a vector.

In VAEs, the decoder network reconstructs the input conditioning on the samples from the latent code (via its posterior distribution). Given an observed sentence x, the VAE objective is to maximize the variational lower bound, as follows [16]: $\begin{array}{l} (1) & z \sim Enc (x) = q (z ∣ x), \tilde{x} \sim Dec (z) = p (x ∣ z) \\ L_{VAE} = E_{q_{\emptyset} (z ∣ x)} [log p_{θ} (x ∣ z)] - D_{KL} (q_{\emptyset} (z ∣ x) | p (z)) \\ = E_{q_{\emptyset} (z ∣ x)} [log p_{θ} (x ∣ z) + log p (z) - log q_{\emptyset} (z ∣ x)] \\ (2) & ⩽ log \int p_{θ} (x ∣ z) p (z) d z = log p_{θ} (x) \end{array}$

In Eq. (1), q and p are the encoder and decoder probabilistic functions, respectively. In Eq. (2), ∅ and θ are the encoder and decoder parameters, respectively. The lower bound $L_{VAE} (θ, \emptyset; x)$ is maximized with respect to these parameters.

2.2. Challenges of VAEs for text

Typically, the LSTM networks is used as the decoder in VAEs for text generation [4]. However, due to the recurrent nature of LSTMs, the decoder tends to ignore the information of the latent variable. Providing the ground-truth words of the previous time steps during training process, prevents the learned sentence embeddings to have enough information about the input [4]. To resolve this problem, we use a deconvolutional network as decoder shown to have the best performance among the other methods [61]. As said in [61], deconvolutional networks are typically used in deep learning networks for up-sampling fix-length latent representations usually made by a convolutional network.

3. Related work

3.1. Applications of knowledge graphs

In many NLP and ML applications, KGs are integrated in the models, e.g., sentiment analysis [9,30], recommender systems [7,65], relation extraction [62], entity linking [2], and question answering (QA). For the QA problem, the authors in [25] use KG embeddings for answering the questions, especially simple questions. The work done in [52] is also in the QA field which leverages relation phrase dictionaries and KG embeddings for answering the questions in natural language. In [32], a model is presented that uses KGs for question routing in CQA. In this model, topic representations with network structure are integrated into a unified KG question routing framework. The work done in [27] presents a survey on the representation, acquisition, and applications of KGs.

3.2. Answer selection in CQA

In the literature, the methods for answer classification can be roughly divided into two main groups: feature-based and deep learning methods.

Feature-based methods, with a long research history, employ a simple classifier with manually constructed features. In these methods, some textual and structural features are selected and a simple classifier such as support vector machine (SVM) or KNN is applied to them. The methods presented in [13,19,22,24,36,40,45,49], and [46], are all in this category. Some of these papers along with their features are summarized in Table 2.

Table 2
Summarization of previous community question answering approaches

Answer selection approaches References Methods

Feature-based [13] Used general tree matching methods based on tree edit distances

[22] Used logistic together with a tree kernel function and extracted features to learn the associations between the question/answer pair

[46] Used translation features, frequency features, and similarity features

[39] Used two convolutional neural network (CNN) to capture the similarity between the question and the answer

Deep learning [51] Used Recurrent Neural Networks (RNNs) and LSTM based model

[57,66] Used CNNs for similarity matching and the label of previous and next answer for context modeling through LSTM

[54] Used joint modeling of users, questions, and answers and also, attention mechanism for modeling question-answer pair

[58] Used hierarchical attentional model and also, the knowledge from the knowledge base

[31] Integrates contextualized embeddings with the transformer encoder (CETE) for sentence similarity modeling

Answer selection approaches	References	Methods
Feature-based	[13]	Used general tree matching methods based on tree edit distances
[22]	Used logistic together with a tree kernel function and extracted features to learn the associations between the question/answer pair
[46]	Used translation features, frequency features, and similarity features
[39]	Used two convolutional neural network (CNN) to capture the similarity between the question and the answer
Deep learning	[51]	Used Recurrent Neural Networks (RNNs) and LSTM based model
[57,66]	Used CNNs for similarity matching and the label of previous and next answer for context modeling through LSTM
[54]	Used joint modeling of users, questions, and answers and also, attention mechanism for modeling question-answer pair
[58]	Used hierarchical attentional model and also, the knowledge from the knowledge base
[31]	Integrates contextualized embeddings with the transformer encoder (CETE) for sentence similarity modeling

In 2015, SemEval organized a similar task to ours, titled “answer selection in community question answering”. Thirteen teams participated in that challenge. The participants mainly focused on defining new features to capture the semantic similarity between the question and its answers. Word matching features, special component features, topic-modeling-based features, non-textual features, etc. are typical features used by the participants. This shared task was repeated by SemEval in 2016 and 2017 as SemEval 2016 task 3 and SemEval 2017 task 3. The best system in SemEval 2015/2016/2017 are the JAIST [48], KeLP [19,20], and Beihang-MSRA [18].

In contrast to feature engineering methods, deep learning based methods learn features automatically by end-to-end training, greatly reducing the needs of feature engineering. Some of these methods are summarized in Table 2.

The model presented in [39] uses two convolutional neural networks (CNNs) to capture the similarity between the questions and answers, and based on it, label the answer. In [47], a convolutional sentence model is proposed to identify the answer content of a question. Wang and Nyberg [51] present a method that successfully employs recurrent neural networks (RNNs) for this task.

In addition to modeling the similarity of the answer and its question, context modeling is also considered in some recent studies. [57] and [66] propose models in which the labels of the previous and next answers are considered as context information. These methods outperform their counterparts which do not consider context information.

Attention is another method used for answer selection. Authors in [56] proposed an attentive deep neural network which employs attention mechanism besides CNN and LSTM networks for answer selection in CQA. In [55], a network called Question Condensing is proposed. In this method, which is based on the question’s subject-body relationship, the question’s subject is considered as the main part and the question’s body is aggregated with it based on their similarity and disparity. Joint modeling of users, questions, and answers is proposed in [54], in which a hybrid attention mechanism is used to model question-answer pairs. User information is also considered in answer classification in this model. In [59], an advanced deep neural network is proposed that leverages text categorization to improve the performance of question-answer relevance classification. Also, external knowledge is used to capture important entities in questions and answers. A hierarchical attentional model named KHAAS is proposed in [58] for answer selection in CQA.

Recently, various attention models based on the Transformer architecture are proposed for learning sentence representation [50]. Also, some models are introduced with Transformer network as their encoders or decoders [8,41]. BERT [15] and RoBERTA [1] as contextualized word embeddings are widely used nowadays. BERT outperformed the previous state-of-the-art results significantly, for question answering in the Stanford question answering dataset (SQuAD) by fine-tuning the pre-trained model [31]. In [53], authors propose the gated self-attention network along with transfer learning from a large-scale online corpus, and provide improvements in the TREC-QA [43] and WikiQA [60] datasets for the answer selection task. In [31], a model is presented with the Transformer encoder (CETE) for sentence similarity modeling. In this paper, by utilizing contextualized embeddings (BERT, ELMo, and RoBERTA [1]), two different approaches, namely, feature-based and fine-tuning-based, are presented. CETE model has achieved state-of-the-art performance in answer selection task in CQA and is our main baseline.

There are still some limitations in the aforementioned methods that make the answer selection in CQA a challenge. In feature engineering methods, the main problem is that extracting informative features is tedious and time-consuming. Also, they do not achieve high performance in most of the time. In the deep learning methods, the representations of the question-answer pair are learned independently which results in insufficient exploitation of the semantic correlation between them. Also, none of the existing methods have considered the question category as context information in question-answer representation. Furthermore, sometimes the named entities in the questions and answers are disregarded when learning the representations because they do not exist in the word embedding methods such as Glove or Word2vec.

Different from the aforementioned studies, in our proposed model, we contribute to use external background knowledge from KGs to capture the meaning of the question and answer words and the relation between them. We also consider the context in the representation which leads to having a more accurate representation and so better performance. Furthermore, we contribute to learning the joint representations of question-answer pair. This allows us to find compact representations of them in the latent space which benefits the semantic matching between question-answer sentences.

Fig. 1.

Proposed model architecture. The inputs to this architecture are question’s category, question’s subject, question’s body, answer’s body and also, the KG. The output is question-answer relevance label.

Table 3

Notation list

Notation	Description
$Q^{cat}$	Question category
$Q^{sub}$	Question subject
$Q^{body}$	Question body, containing the details of the question
$A^{body}$	Answer body, containing the details of the answer
${sub}^{init}$	Initial representation of the question subject
$Q^{init}$	Initial representation of the question
$A^{init}$	Initial representation of the answer
$Q^{rep}$	Attentional representation of the question and subject
$A^{rep}$	Attentional representation of the answer and subject
$Z_{a}$	The sampled latent feature vector of answer
$Z_{q}$	The sampled latent feature vector of question
y	Question-answer relevance label

4. Proposed method

The main principle of this paper is to address the question-answer relevance classification in CQA by using KGs. In our proposed model, depicted in Fig. 1, at the first step, the words in the question and the answer are disambiguated using WSD and leveraging external knowledge from a KG. By using the KG, the entities (especially named-entities) and the relations between them are captured. As we know, noisy information exists in the questions and answers, so in the next step, we employ an attention mechanism to extract the important information. Finally, to infer the label of question-answer relevance, we propose a classifier in a multi-task learning process with two separate VAEs for the question and answer. These VAEs help learning the class-specific representation.

Next, we elaborate on three key components of the model in more details: initial representation, attention, and multi-task learning. The main notations used in Fig. 1 are summarized in Table 3 for clarity.

4.1. Initial representations

Some words may have different meanings in different contexts. Static word embedding methods, such as word2vec or Glove, do not address this issue and may lead to incorrect sentence representations. Furthermore, there are sometimes named entities in sentences not defined in the common word embedding vocabularies (such as “Armada” and “Infiniti FX35” in Table 1) and so, they are ignored in sentence representations. Considering these two problems, we propose to disambiguate each word of the question subject, question body, and answer body by leveraging KG. We also use the question category, as the context representative. In this disambiguation procedure, the meaning of each disambiguated word (including named entities) is captured through KG, and the relation between them is extracted. We use Babelfy, a unified graph-based approach to entity linking (EL) and word sense disambiguation (WSD) [34], for disambiguating the question and answer.

The Babelfy algorithm is a KG based model that requires the availability of a semantic network, such as BabelNet, which encodes structural and lexical information. In this semantic network, each vertex is an entity. The Babelfy algorithm has three main steps. At first, given a lexicalized semantic network, it assigns each vertex a semantic signature, i.e., a set of related vertices. For the relatedness notion in this step, the global structure of the semantic network is exploited and a more precise and higher coverage measure of relatedness is obtained. To address this issue, a structural weighting of the network’s edge is provided. Then for each vertex, a set of related vertices is created by using random walks with restart. In the second step, for a given text, by applying part-of-speech tagging and identifying all the textual fragments, it lists all possible meanings of the extracted fragments. Finally, by creating a graph-based semantic interpretation of the whole text and using a previously-computed semantic signature, it selects the best candidate meaning for each fragment [34].

Based on this process, it can be said that Babelfy uses the context of a word to disambiguate it in a text. In our proposed method, to consider the question category as the contextual information, we simply concatenate it to the question subject, question body, and answer body. The concatenation of these three parts is considered as the input text.

To apply Babelfy to our problem, at its first step, we use BabelNet, the largest multilingual KG [38], as our lexicalized semantic network in the disambiguation procedure. The BabelNet, which contains both concepts and named entities as its vertices, is obtained from the automatic seamless integration of Wikipedia2

²
www.wikipedia.org
and WordNet [11]. Then, independently of the input texts which are the question category, question subject, question body, and the answer body, we assign each vertex of the BabelNet a set of related vertices as its semantic signature. As said before, for the relatedness notion in this step, the global structure of the semantic network is exploited and a more precise and higher coverage measure of relatedness is obtained. This is done by using a structural weighting of the network’s edge and after that, applying random walks with restart method. At the second step, given the input texts, all the textual fragments of them are identified, and at the final step, each fragment is disambiguated.

After disambiguating and capturing the correct sense in the current context from KG, we represent it using NASARI [6]. NASARI is a multilingual vector representation of word senses with high coverage, including both concepts and named entities [6]. More specifically, NASARI combines the structural knowledge from semantic networks with the statistical information derived from text corpora. This makes it possible to have an effective representation of millions of BabelNet synsets. The output of this step is the initial representation of the question subject, question body, and answer denoted as ${sub}^{init}$ , $Q^{init}$ , and $A^{init}$ , respectively, in Fig. 1.
4.2. Attention layer

The problem of redundancy and noise is prevalent in CQA [29]. On the other hand, the question subject summarizes the main points of the question and so can be used to extract useful information from the question and answer.

In order to reduce the impact of redundancy and noise, we use the representation of the question subject, ${sub}^{init}$ in Fig. 1, as the attention source to capture the important and useful information of the question and answer. $Q^{rep}$ and $A^{rep}$ are the outputs, which are the attentional representations of the question and answer, respectively. By defining $w_{i}^{q}$ and $w_{i}^{a}$ as the i-th word of the question and answer, respectively, $Q^{rep}$ and $A^{rep}$ are computed as follow: $\begin{array}{l} (3) & α_{i}^{q} = \frac{exp (ρ ([w_{i}^{q}; {sub}^{init}]))}{\sum_{j = 1}^{m} exp (ρ ([w_{j}^{q}; {sub}^{init}]))} \\ (4) & Q^{rep} = \sum_{i = 1}^{m} α_{i}^{q} . w_{i}^{q} \\ (5) & α_{i}^{a} = \frac{exp (ρ ([w_{i}^{a}; {sub}^{init}]))}{\sum_{j = 1}^{l} exp (ρ ([w_{j}^{a}; {sub}^{init}]))} \\ (6) & A^{rep} = \sum_{i = 1}^{l} α_{i}^{a} . w_{i}^{a} \end{array}$

Where $α_{i}^{q}$ and $α_{i}^{a}$ indicates the importance of i-th word in the question and answer, respectively. Also, m is the length of question and l is the length of answer. ρ is the attention function and is computed as follow: $\begin{matrix} (7) & ρ ([x; y]) = U_{d}^{T} tanh (W_{d} [x; y]) \end{matrix}$

Where $U_{d}$ and $W_{d}$ are projection parameters to be learned.

4.3. Multi-task learning

The multi-task learning module in Fig. 1 is based on Siamese architecture [14]. Siamese neural architecture first appeared in vision (face recognition [10]). It has recently been extensively studied to learn representations of sentences and to predict similarity or entailment relation between sentence pairs as an end-to-end differentiable task [12,23,26,44].

Our model consists of deconvolutional-based twin networks. This proposed model is used for question-answer relevance extraction by employing the discriminative information encoded by the encoder network.

As shown in Fig. 1, $Q^{rep}$ and $A^{rep}$ , the question and answer representations, are fed into separate VAEs. The encoder, i.e., a convolutional network, starting encodes the representation to the latent code z. Then the decoder, i.e., a deconvolutional network, starting by the latent code z, tries to arrive at the initial representation. These two VAEs are trained with shared weights.

To infer the label of the question-answer relevance, two latent features are sampled from the inference network, as $z_{q}$ and $z_{a}$ , and after concatenation, are fed into a classifier in a multi-task learning process with the two VAEs. The classifier is an MLP network. It generates the probability for each label (“good”, “bad”, and “potentially useful”), to model the conditional distribution $p_{φ} (y ∣ z_{q}, z_{a})$ with parameters φ.

To balance between maximizing the variational lower bound and minimizing the classifier loss, the model training objective is defined as follow: $\begin{matrix} (8) & L^{labeled} = α L_{classifier} (φ; z_{a}, z_{q}, y) - L_{VAE} (θ, \emptyset; a) - L_{VAE} (θ, \emptyset; q) \end{matrix}$

Here, α is an annealing parameter between 0 to 1 (treated as a hyper-parameter), balancing the importance of the classifier loss. φ represents the classifier parameters. By changing the value of α, the learned latent variable can gradually focus only on retraining those features useful for answer classification.

5. Experimental results and analysis

In this section, we demonstrate the implementation details and analysis of our proposed framework and the comparison of experimental results.

5.1. Data

We conduct experiments on three widely used CQA datasets, SemEval-2015 Task 33

³
http://alt.qcri.org/semeval2015/task3/index.php?id=data-and-tools
[36], SemEval-2016 Task 34 ⁴
https://alt.qcri.org/semeval2016/task3/index.php?id=data-and-tools
[37], and SemEval-2017 Task 35 ⁵
http://alt.qcri.org/semeval2017/task3/index.php?id=data-and-tools
[35], which contain real data from the QatarLiving forum. This forum is organized as a set of independent question-comment threads. Table 4 shows the statistics of these three datasets. For SemEval-2017 dataset, the training set is exactly the same as SemEval-2016, but the test set does not contains the “Potentially Useful” class.

Each question in the datasets consists of a short title or subject and a detailed description or body. Questions are followed by a list of comments (or answers), each of which is classified in one of three categories: “Definitely Relevant” (Good), “Potentially Useful” (Potential), or “Bad” (bad, dialog, non-English, other). “Good” label indicates that the answer is relevant to the question and answers it, even though it might be a wrong answer. “Potential” indicates that the answer contain potentially useful information about the question, and “Bad” indicates that the answer is irrelevant or useless. Besides three-class classification experiments, we also conducted experiments for two-class classification. Similar to the previous work, for two-class classification, we merge “Potentially Useful” and “Bad” labels to one label, “Bad”, in our experiments.

Table 4
Statistics of SemEval 2015, 2016, and 2017 datasets

Statistics Number of questions Number of answers

SemEval 2015 Train 2600 16541

Dev 300 1654

Test 329 1976

SemEval 2016 Train 4879 36198

Dev 244 2440

Test 327 3270

SemEval 2017 Train 4879 36198

Dev 244 2440

Test 293 2930

5.2. Baselines

Statistics	Number of questions	Number of answers
SemEval 2015	Train	2600	16541
Dev	300	1654
Test	329	1976
SemEval 2016	Train	4879	36198
Dev	244	2440
Test	327	3270
SemEval 2017	Train	4879	36198
Dev	244	2440
Test	293	2930

In the experiments, we compare our proposed method with several baselines:

JAIST [48]: this method, which had the best performance in SemEval-2015, investigates various features. SVM classifier is then used to predict the question-answer relation.

KeLP [19]: It uses three kinds of features, including linguistic similarities between texts, syntactic trees, and task-specific information. This model was the winner of the SemEval-2016 and SemEval-2017 Task 3.

CNN [28]: this model is a basic Siamese model with CNNs as encoder.

BiLSTM-attention [64]: A biLSTM network for building the embeddings of question and answer followed by an attention mechanism are used to learn the question and answer representations.

CNN-LSTM-CRF [57]: This model is a hierarchical architecture combining CNN, biLSTM, and CRF to model the context information, including content correlation and label dependency.

RCNN [66]: In this model, a CNN is used to capture the semantic matching between the question and answer and an RNN is used for capturing the semantic correlations embedded in the sequence of answers.

Question Condensing [55]: In this model, the question subject is considered as the main source and the information in the question body is aggregated based on that.

MKMIA-CQA [59]: This model is a multi-task network that uses interactive attention and external knowledge to classify the answer in CQA. The knowledge base used in this model is a subset of Freebase6

⁶
http://www.freebase.com/
(FB5M3).

KHAAS [58]: This model is a hierarchical attentional model that exploits the knowledge in the knowledge base for answer selection in CQA. The knowledge base used in this model is Freebase for the English dataset.

UIA-LSTM-CNN [54]: This model calculates inter and intra sentence attentions between questions and answers. It also exploits the user information.

CETE [31]: In this model, contextualized word embeddings with the transformer encoder are utilized for sentence similarity modeling in answer selection in CQA.

5.3. Implementation details

As mentioned before, we use BabelNet as our KG, which contains both concepts and named entities. Then NASARI is used for capturing the embedding of each disambiguated word (sense). The max length is set to 50 and the vocab size is set to 5000. For the training procedure, we use a convolutional encoder with three layers followed by a deconvolutional encoder with the same number of layers. We try these hidden sizes: 100, 300, and 500. The weight parameters are randomly sampled from a uniform distribution $U (- 0.01, 0.01)$ , and the bias parameters are set to zero. The batch size is set to 128.

The model is trained using RMSProp optimizer [17]. Dropout is employed on the latent variable layer with the dropout rate of 0.5.

5.4. Quantitative evaluation

For the answer selection task, the standard metrics used in previous work for benchmarking are macro-averaged F1 and Mean Average Precision (MAP). We measure the performance using these metrics on three datasets: SemEval 2015, SemEval 2016, and SemEval 2017.

Table 5, Table 6, and Table 7 show the performance comparison of our proposed model with other baselines for three-class classification, on SemEval 2015, SemEval 2016, and SemEval 2017, respectively. It should be noted that in three-class classification, for the baselines in which their results are reported for two-class classification, we modified their source code for three-class classification (KeLP [19], Question Condensing [55], MKMIA-CQA [59], KHAAS [58], UIA-LSTM-CNN [54], and CETE [31]). Also, CNN [28] and BiLSTM-attention [64] models, which their original implementations are for datasets other than ours, were re-implemented for SemEval datasets. Table 8, Table 9, and Table 10 are for two-class classification results.

Table 5
Quantitative evaluation results on SemEval 2015 for three-class classification

Method F1 score MAP

JAIST 57.19 66.23

KeLP 59.71 68.42

CNN 54.42 64.09

BiLSTM-attention 58.63 67.86

CNN-LSTM-CRF 58.96 68.03

RCNN 58.77 69.15

Question Condensing 60.63 71.45

MKMIA-CQA 61.93 72.07

KHAAS 57.81 69.74

UIA-LSTM-CNN 61.37 69.89

CETE 69.08 78.63

Proposed model 74.91^∗ (p-value = 0.03) 85.41^∗ (p-value = 0.02)

^∗Numbers mean that improvement from our model is statistically significant over the baseline methods (t-test, p-value < 0.05).

Method	F1 score	MAP
JAIST	57.19	66.23
KeLP	59.71	68.42
CNN	54.42	64.09
BiLSTM-attention	58.63	67.86
CNN-LSTM-CRF	58.96	68.03
RCNN	58.77	69.15
Question Condensing	60.63	71.45
MKMIA-CQA	61.93	72.07
KHAAS	57.81	69.74
UIA-LSTM-CNN	61.37	69.89
CETE	69.08	78.63
Proposed model	74.91^∗ (p-value = 0.03)	85.41^∗ (p-value = 0.02)

Table 6

Quantitative evaluation results on SemEval 2016 for three-class classification

Method	F1 score	MAP
JAIST	46.65	57.89
KeLP	44.67	54.38
CNN	43.57	55.21
BiLSTM-attention	49.28	60.08
CNN-LSTM-CRF	50.08	61.57
RCNN	49.82	61.98
Question Condensing	52.47	61.49
MKMIA-CQA	56.68	64.25
KHAAS	53.06	61.05
UIA-LSTM-CNN	56.87	64.17
CETE	65.39	72.32
Proposed model	68.79^∗ (p-value = 0.04)	77.48^∗ (p-value = 0.03)