Abstract
With the increasing popularity of knowledge graph (KG), many applications such as sentiment analysis, trend prediction, and question answering use KG for better performance. Despite the obvious usefulness of commonsense and factual information in the KGs, to the best of our knowledge, KGs have been rarely integrated into the task of answer selection in community question answering (CQA). In this paper, we propose a novel answer selection method in CQA by using the knowledge embedded in KGs. We also learn a latent-variable model for learning the representations of the question and answer, jointly optimizing generative and discriminative objectives. It also uses the question category for producing context-aware representations for questions and answers. Moreover, the model uses variational autoencoders (VAE) in a multi-task learning process with a classifier to produce class-specific representations for answers. The experimental results on three widely used datasets demonstrate that our proposed method is effective and outperforms the existing baselines significantly.
Keywords
Introduction
Knowledge graphs (KGs), such as DBpedia [3] and BabelNet [38], are multi-relational graphs. They consist of entities and relationships among them. Many applications such as sentiment analysis [30], recommender systems [65], relation extraction [62], and question answering integrate the information in KGs by linking the entities mentioned in the text to entities in the KGs.
Community question answering (CQA) forums, such as Stack Overflow and Yahoo! Answer provide new opportunities for users to share knowledge. In these forums, anyone can ask any question, and a question is answered by one or more members. Unfortunately, there is often no evaluation of the given answers in how much they are related to the question. It means one has to go through all possible answers for assessing them, which is exhausting and time-consuming. Thus, it is essential to automatically identify the best answers for each question.
In this paper, we address the task of answer selection. As defined in SemEval 2015 [36], in this task, the goal is to classify the answers given a question into three categories: (i) good, which are the answers that address the question well (ii) potentially useful to the user (e.g., because they can help educate him/her on the subject) (iii) bad or useless. It should be noted that a good answer is an answer semantically relevant to the question, not necessarily the correct answer.
Table 1 shows two examples of questions, each with four answers, taken from the SemEval 2015 [36] dataset.1
Example of two questions and four of their answers from the SemEval 2015 dataset
The main difficulty is how to bridge the semantic gap between question-answer pairs. In other words, by recognizing the semantic relatedness of the question and answer, one can decide about the relevance of the question and its answers.
Early work in this area includes feature-based methods for explicitly modeling the semantic relation between the question and answer [36,40]. With the great advances in deep neural networks, most recent researches apply deep learning based methods to answer classification in question answering communities [54,56,57,59,64]. These methods typically use a Convolutional Neural Network (CNN) [39] or Long Short term Memory (LSTM) [51] network for matching the question and answer. However, these methods have not achieved high accuracy due to some reasons. The
Despite the usefulness of commonsense or factual background knowledge in the KGs (such as DBpedia [3], BabelNet [38], etc.), to the best of our knowledge, these KGs have been rarely integrated in the recent deep neural CQA networks. KGs provide rich information about entities, specially named entities, and relations between them. Considering the examples in Table 1, named entities “Armada” and “Infiniti FX35” in the question and answer, do not exist in the available word embedding methods such as Word2vec [33] or Glove [42] and so, are out-of-vocabulary. Therefore, the conventional methods assign a negative score to the first answer due to their misunderstanding of named entities and their relations. However, by using a comprehensive KG like BabelNet, the model can assign the correct label to the answer due to the entities and facts exist in it.
There are some words that may have different meanings in different contexts. By using the category of the question as the context representative, the correct meaning of the question and answer words can be extracted, and so a more accurate representation of the question and answer would be generated.
The previous methods are unable to encode all semantic information of the question and answer. Also, in [5] it has been shown that it is difficult to encode all semantic information of a sequence into a single vector;
In semantic matching problems, the learned representations must contain two main properties. First, the representation must preserve the important details mentioned in the text. Second, each learned representation must contain discriminative information regarding its relationship with the target sentence. Following this motivation, by leveraging the external background knowledge and question category, we use deep generative models for question-answer pair modeling. Due to their ability to obtain latent codes that contain essential information of a sequence, we expect that their resulting representations can suite the question-answer relation extraction better.
In the proposed model, at the first step, the question and answer words are disambiguated based on the question category and external background knowledge from our selected KG. At the end of this step, the correct meaning of each word in the current context is captured. In the second step, by using the representation of the question subject as the attention source, the noisy parts of the question and answer are discarded and the useful information of them is extracted. At the final step, by using the convolutional-deconvolutional autoencoding framework, which is first proposed in [63] for paragraph representation learning, the representations of questions and answers are learned. This framework, which uses the deconvolutional network as its decoder, is used to model each of the question and answer separately. In this multi-task learning process, the question-answer relevance label information is also considered in the representations learning, enabling class-specific representations.
The
We leverage external knowledge from KGs to capture the meaning of the question and answer words and extract the relation between them.
We propose to use the category of the question as context to understand the correct meaning of the question and answer words in the current context. To the best of our knowledge, we are the first to use the question category to have context-aware representations in CQA.
We propose to use two convolutional-deconvolutional autoencoding frameworks that attempt to make separate representations of the question and answer. To the best of our knowledge, we are the first to use this deconvolutional VAE in answer selection problem.
We introduce a new architecture for answer selection, in which a classifier combined with variational autoencoders to make the representations class-specific.
Our proposed model achieves state-of-the-art performance in three CQA datasets: SemEval 2015, SemEval 2016 [37], and SemEval 2017 [35].
In the next section, we provide preliminaries in this field. Then we review some previous researches in Section 3. The proposed idea is presented in Section 4. In Section 5, experimental results and analyses are presented. The conclusion is given in Section 6.
Latent-variable model for text processing
The most common way to obtain sentence representations is to use sequence-to-sequence models, due to their ability to leverage information from unlabeled data [21]. In these models, first an encoder encodes the input sentence
In VAEs, the decoder network reconstructs the input conditioning on the samples from the latent code (via its posterior distribution). Given an observed sentence
In Eq. (1),
Challenges of VAEs for text
Typically, the LSTM networks is used as the decoder in VAEs for text generation [4]. However, due to the recurrent nature of LSTMs, the decoder tends to ignore the information of the latent variable. Providing the ground-truth words of the previous time steps during training process, prevents the learned sentence embeddings to have enough information about the input [4]. To resolve this problem, we use a deconvolutional network as decoder shown to have the best performance among the other methods [61]. As said in [61], deconvolutional networks are typically used in deep learning networks for up-sampling fix-length latent representations usually made by a convolutional network.
Related work
Applications of knowledge graphs
In many NLP and ML applications, KGs are integrated in the models, e.g., sentiment analysis [9,30], recommender systems [7,65], relation extraction [62], entity linking [2], and question answering (QA). For the QA problem, the authors in [25] use KG embeddings for answering the questions, especially simple questions. The work done in [52] is also in the QA field which leverages relation phrase dictionaries and KG embeddings for answering the questions in natural language. In [32], a model is presented that uses KGs for question routing in CQA. In this model, topic representations with network structure are integrated into a unified KG question routing framework. The work done in [27] presents a survey on the representation, acquisition, and applications of KGs.
Answer selection in CQA
In the literature, the methods for answer classification can be roughly divided into two main groups: feature-based and deep learning methods.
Feature-based methods, with a long research history, employ a simple classifier with manually constructed features. In these methods, some textual and structural features are selected and a simple classifier such as support vector machine (SVM) or KNN is applied to them. The methods presented in [13,19,22,24,36,40,45,49], and [46], are all in this category. Some of these papers along with their features are summarized in Table 2.
Summarization of previous community question answering approaches
Summarization of previous community question answering approaches
In 2015, SemEval organized a similar task to ours, titled “answer selection in community question answering”. Thirteen teams participated in that challenge. The participants mainly focused on defining new features to capture the semantic similarity between the question and its answers. Word matching features, special component features, topic-modeling-based features, non-textual features, etc. are typical features used by the participants. This shared task was repeated by SemEval in 2016 and 2017 as SemEval 2016 task 3 and SemEval 2017 task 3. The best system in SemEval 2015/2016/2017 are the JAIST [48], KeLP [19,20], and Beihang-MSRA [18].
In contrast to feature engineering methods, deep learning based methods learn features automatically by end-to-end training, greatly reducing the needs of feature engineering. Some of these methods are summarized in Table 2.
The model presented in [39] uses two convolutional neural networks (CNNs) to capture the similarity between the questions and answers, and based on it, label the answer. In [47], a convolutional sentence model is proposed to identify the answer content of a question. Wang and Nyberg [51] present a method that successfully employs recurrent neural networks (RNNs) for this task.
In addition to modeling the similarity of the answer and its question, context modeling is also considered in some recent studies. [57] and [66] propose models in which the labels of the previous and next answers are considered as context information. These methods outperform their counterparts which do not consider context information.
Attention is another method used for answer selection. Authors in [56] proposed an attentive deep neural network which employs attention mechanism besides CNN and LSTM networks for answer selection in CQA. In [55], a network called Question Condensing is proposed. In this method, which is based on the question’s subject-body relationship, the question’s subject is considered as the main part and the question’s body is aggregated with it based on their similarity and disparity. Joint modeling of users, questions, and answers is proposed in [54], in which a hybrid attention mechanism is used to model question-answer pairs. User information is also considered in answer classification in this model. In [59], an advanced deep neural network is proposed that leverages text categorization to improve the performance of question-answer relevance classification. Also, external knowledge is used to capture important entities in questions and answers. A hierarchical attentional model named KHAAS is proposed in [58] for answer selection in CQA.
Recently, various attention models based on the Transformer architecture are proposed for learning sentence representation [50]. Also, some models are introduced with Transformer network as their encoders or decoders [8,41]. BERT [15] and RoBERTA [1] as contextualized word embeddings are widely used nowadays. BERT outperformed the previous state-of-the-art results significantly, for question answering in the Stanford question answering dataset (SQuAD) by fine-tuning the pre-trained model [31]. In [53], authors propose the gated self-attention network along with transfer learning from a large-scale online corpus, and provide improvements in the TREC-QA [43] and WikiQA [60] datasets for the answer selection task. In [31], a model is presented with the Transformer encoder (CETE) for sentence similarity modeling. In this paper, by utilizing contextualized embeddings (BERT, ELMo, and RoBERTA [1]), two different approaches, namely, feature-based and fine-tuning-based, are presented. CETE model has achieved state-of-the-art performance in answer selection task in CQA and is our main baseline.
There are still some limitations in the aforementioned methods that make the answer selection in CQA a
Different from the aforementioned studies, in our proposed model, we

Proposed model architecture. The inputs to this architecture are question’s category, question’s subject, question’s body, answer’s body and also, the KG. The output is question-answer relevance label.
Notation list
The main principle of this paper is to address the question-answer relevance classification in CQA by using KGs. In our proposed model, depicted in Fig. 1, at the first step, the words in the question and the answer are disambiguated using WSD and leveraging external knowledge from a KG. By using the KG, the entities (especially named-entities) and the relations between them are captured. As we know, noisy information exists in the questions and answers, so in the next step, we employ an attention mechanism to extract the important information. Finally, to infer the label of question-answer relevance, we propose a classifier in a multi-task learning process with two separate VAEs for the question and answer. These VAEs help learning the class-specific representation.
Next, we elaborate on three key components of the model in more details: initial representation, attention, and multi-task learning. The main notations used in Fig. 1 are summarized in Table 3 for clarity.
Initial representations
Some words may have different meanings in different contexts. Static word embedding methods, such as word2vec or Glove, do not address this issue and may lead to incorrect sentence representations. Furthermore, there are sometimes named entities in sentences not defined in the common word embedding vocabularies (such as “Armada” and “Infiniti FX35” in Table 1) and so, they are ignored in sentence representations. Considering these two problems, we propose to disambiguate each word of the question subject, question body, and answer body by leveraging KG. We also use the question category, as the context representative. In this disambiguation procedure, the meaning of each disambiguated word (including named entities) is captured through KG, and the relation between them is extracted. We use Babelfy, a unified graph-based approach to entity linking (EL) and word sense disambiguation (WSD) [34], for disambiguating the question and answer.
The Babelfy algorithm is a KG based model that requires the availability of a semantic network, such as BabelNet, which encodes structural and lexical information. In this semantic network, each vertex is an entity. The Babelfy algorithm has three main steps. At first, given a lexicalized semantic network, it assigns each vertex a semantic signature, i.e., a set of related vertices. For the relatedness notion in this step, the global structure of the semantic network is exploited and a more precise and higher coverage measure of relatedness is obtained. To address this issue, a structural weighting of the network’s edge is provided. Then for each vertex, a set of related vertices is created by using random walks with restart. In the second step, for a given text, by applying part-of-speech tagging and identifying all the textual fragments, it lists all possible meanings of the extracted fragments. Finally, by creating a graph-based semantic interpretation of the whole text and using a previously-computed semantic signature, it selects the best candidate meaning for each fragment [34].
Based on this process, it can be said that Babelfy uses the context of a word to disambiguate it in a text. In our proposed method, to consider the question category as the contextual information, we simply concatenate it to the question subject, question body, and answer body. The concatenation of these three parts is considered as the input text.
To apply Babelfy to our problem, at its first step, we use BabelNet, the largest multilingual KG [38], as our lexicalized semantic network in the disambiguation procedure. The BabelNet, which contains both concepts and named entities as its vertices, is obtained from the automatic seamless integration of Wikipedia2
After disambiguating and capturing the correct sense in the current context from KG, we represent it using NASARI [6]. NASARI is a multilingual vector representation of word senses with high coverage, including both concepts and named entities [6]. More specifically, NASARI combines the structural knowledge from semantic networks with the statistical information derived from text corpora. This makes it possible to have an effective representation of millions of BabelNet synsets. The output of this step is the initial representation of the question subject, question body, and answer denoted as
The problem of redundancy and noise is prevalent in CQA [29]. On the other hand, the question subject summarizes the main points of the question and so can be used to extract useful information from the question and answer.
In order to reduce the impact of redundancy and noise, we use the representation of the question subject,
Where
Where
Multi-task learning
The multi-task learning module in Fig. 1 is based on Siamese architecture [14]. Siamese neural architecture first appeared in vision (face recognition [10]). It has recently been extensively studied to learn representations of sentences and to predict similarity or entailment relation between sentence pairs as an end-to-end differentiable task [12,23,26,44].
Our model consists of deconvolutional-based twin networks. This proposed model is used for question-answer relevance extraction by employing the discriminative information encoded by the encoder network.
As shown in Fig. 1,
To infer the label of the question-answer relevance, two latent features are sampled from the inference network, as
To balance between maximizing the variational lower bound and minimizing the classifier loss, the model training objective is defined as follow:
Here,
Experimental results and analysis
In this section, we demonstrate the implementation details and analysis of our proposed framework and the comparison of experimental results.
Data
We conduct experiments on three widely used CQA datasets, SemEval-2015 Task 33
Each question in the datasets consists of a short title or subject and a detailed description or body. Questions are followed by a list of comments (or answers), each of which is classified in one of three categories: “Definitely Relevant” (Good), “Potentially Useful” (Potential), or “Bad” (bad, dialog, non-English, other). “Good” label indicates that the answer is relevant to the question and answers it, even though it might be a wrong answer. “Potential” indicates that the answer contain potentially useful information about the question, and “Bad” indicates that the answer is irrelevant or useless. Besides three-class classification experiments, we also conducted experiments for two-class classification. Similar to the previous work, for two-class classification, we merge “Potentially Useful” and “Bad” labels to one label, “Bad”, in our experiments.
Statistics of SemEval 2015, 2016, and 2017 datasets
In the experiments, we compare our proposed method with several baselines:
As mentioned before, we use BabelNet as our KG, which contains both concepts and named entities. Then NASARI is used for capturing the embedding of each disambiguated word (sense). The max length is set to 50 and the vocab size is set to 5000. For the training procedure, we use a convolutional encoder with three layers followed by a deconvolutional encoder with the same number of layers. We try these hidden sizes: 100, 300, and 500. The weight parameters are randomly sampled from a uniform distribution
The model is trained using RMSProp optimizer [17]. Dropout is employed on the latent variable layer with the dropout rate of 0.5.
Quantitative evaluation
For the answer selection task, the standard metrics used in previous work for benchmarking are macro-averaged F1 and Mean Average Precision (MAP). We measure the performance using these metrics on three datasets: SemEval 2015, SemEval 2016, and SemEval 2017.
Table 5, Table 6, and Table 7 show the performance comparison of our proposed model with other baselines for three-class classification, on SemEval 2015, SemEval 2016, and SemEval 2017, respectively. It should be noted that in three-class classification, for the baselines in which their results are reported for two-class classification, we modified their source code for three-class classification (KeLP [19], Question Condensing [55], MKMIA-CQA [59], KHAAS [58], UIA-LSTM-CNN [54], and CETE [31]). Also, CNN [28] and BiLSTM-attention [64] models, which their original implementations are for datasets other than ours, were re-implemented for SemEval datasets. Table 8, Table 9, and Table 10 are for two-class classification results.
Quantitative evaluation results on SemEval 2015 for three-class classification
∗Numbers mean that improvement from our model is statistically significant over the baseline methods (t-test, p-value < 0.05).
Quantitative evaluation results on SemEval 2015 for three-class classification
∗Numbers mean that improvement from our model is statistically significant over the baseline methods (t-test, p-value < 0.05).
Quantitative evaluation results on SemEval 2016 for three-class classification
∗Numbers mean that improvement from our model is statistically significant over the baseline methods (t-test, p-value < 0.05).
Quantitative evaluation results on SemEval 2017 for three-class classification
∗Numbers mean that improvement from our model is statistically significant over the baseline methods (t-test, p-value < 0.05).
Quantitative evaluation results on SemEval 2015 for two-class classification
∗Numbers mean that improvement from our model is statistically significant over the baseline methods (t-test, p-value < 0.05).
Quantitative evaluation results on SemEval 2016 for two-class classification
∗Numbers mean that improvement from our model is statistically significant over the baseline methods (t-test, p-value < 0.05).
Quantitative evaluation results on SemEval 2017 for two-class classification
∗Numbers mean that improvement from our model is statistically significant over the baseline methods (t-test, p-value < 0.05).
As it is obvious in Table 5, Table 6, and Table 7, for three-class classification, our proposed model outperforms other baselines. It beats the state-of-the-art method, CETE, in F1 by about 6%, 4% and 3% for SemEval 2015, 2016, and 2017. Similarly, it outperforms the MAP results of the CETE in all three datasets. The p-values for these differences are less than 0.05, indicating that the improvements are statistically significant. It should be noted that considering the “potentially useful” label as a separate class, instead of merging it with the “bad” class and having a three-class classification model, needs the model to be more accurate and it is the superiority of our proposed approach over the competitors.
Similarly, for two-class classification, as indicated in Table 8, Table 9, and Table 10, our proposed method outperforms the baseline methods in F1 and MAP. Except the MAP of 2015 and 2017 datasets, the increase in other values is statistically significant. These results show that our model’s improvements are not dependent on the number of classes only. The experimental results prove our hypothesis about the obtained representations for the question and answer. In other words, the results indicate that these representations are informative in predicting the relevance of the questions and answers.
To analyze the effects of each component of our model, we also report the ablation test of our model in terms of discarding external knowledge from KG (w/o KG), attention on the subject (w/o AS), question category (w/o category), deconvolutional decoder (w/o deconv), and VAE (w/o VAE). For w/o KG, we simply use word embedding instead of sense embedding in the initial representation. For w/o category, we disambiguate each question and answer themselves, without considering category information. Also, for w/o deconv and w/o VAE, we use LSTM for the decoder and simple autoencoder instead of VAE, respectively. The ablation results are summarized in Table 11 and Table 12 for the three datasets.
We also analyze the performance of the proposed method by starting from the baseline model, and incrementally add one component at a time. The baseline model is the vanilla version in which there are only two parallel autoencoders to obtain question and answer representations. Then, the concatenation of these representations are sent to an MLP to extract question-answer relevance. Table 13 and Table 14 demonstrate the results.
Ablation test of the proposed model on SemEval 2015, SemEval 2016, and SemEval 2017 for three-class classification
Ablation test of the proposed model on SemEval 2015, SemEval 2016, and SemEval 2017 for three-class classification
Ablation test of the proposed model on SemEval 2015, SemEval 2016, and SemEval 2017 for two-class classification
Analysis of each component impact on SemEval 2015, SemEval 2016, and SemEval 2017 for three-class classification
Generally, all five factors contribute to the results of our proposed model. It is obvious that F1 and MAP decrease sharply by discarding KG. This is within our expectation since using KG enriches overall text representation, by making it possible to consider all entities (especially named entities), the context, and focusing on useful information. In addition, deconvolutional VAE also has a great contribution. This verifies that using deconvolutional decoder results to have a more informative representation. Not surprisingly, combining all components achieves the best performance.
In this subsection, we analyze the model sensitivity to hyper-parameters specific to CNN: window size, stride, and filter-size (number of filters). Figure 2 and Fig. 3 indicate the change of macro-averaged F1 values for different values of window size and filter-size, respectively.
Analysis of each component impact on SemEval 2015, SemEval 2016, and SemEval 2017 for two-class classification
Analysis of each component impact on SemEval 2015, SemEval 2016, and SemEval 2017 for two-class classification
For stride value, we observe that when it is 4 or greater, the system gets close to fully fit the training data (over fitting). The best value for stride is 2 for both datasets.
As it is obvious in Fig. 2 and Fig. 3, the best value obtained for macro-averaged F1 is 74.91 for SemEval 2015, 68.79 for SemEval 2016, and 70.43 for SemEval 2017, which are for window size, stride, and filter-size equal to 4, 2, and 300, respectively.

The influence of window size on model performance.

The influence of filter size on model performance.
In this article, we proposed a new model based on KGs for answer selection in community question answering forums. In the proposed architecture, external background knowledge is used to capture entity mentions and their relations in questions and answers. Also, by using the question category, a context-aware representation is generated for the question and answer. The model is trained in a multi-task learning procedure, in which there are two variational autoencoders in combination with a classifier to capture the semantic relatedness of the question and answer.
Quantitatively, the experimental results demonstrated that our model outperformed all existing baselines. We also conducted an ablation analysis to show the effectiveness of each component of the proposed model. The results confirm the choices we had in our architecture design because all of them, especially the KG integration, contribute positively.
Footnotes
Acknowledgement
This work is based upon research funded by Iran National Science Foundation (INSF) under project No. 4002438.
