Learning bi-utterance for multi-turn response selection in retrieval-based chatbots

Abstract

Multi-turn response selection is essential to retrieval-based chatbots. The task requires multi-turn response selection model to match a response candidate with a conversation context. Existing methods may lose relationship features in the context. In this article, we propose an improved method that extends the learning granularity of the multi-turn response selection model to enhance the model’s ability to learn relationship features of utterances in the context, which is a key to understand a conversation context for multi-turn response selection in retrieval-based chatbots. The experimental results show that our proposed method significantly improves sequential matching network for multi-turn response selection in retrieval-based chatbots.

Keywords

Learning bi-utterance dialogue system deep learning information retrieval

Introduction

Dialogue systems and conversational agents—including chatbots, personal assistants, and voice-control interfaces—are becoming ubiquitous in modern society.¹ Recently, more and more attention from both academia and industry is paying to build non-task-oriented chatbots that can naturally converse with humans on any open domain topics.² Existing work on building chatbots includes generation-based methods and retrieval-based methods.³ Retrieval-based methods have been applied to many chatbots such as the MILABOT from Montreal Institute for Learning Algorithms¹ and the XiaoIce from Microsoft,⁴ because they can return informative and fluent responses.

A key step to response selection is measuring the matching degree between a response candidate and an input.² Single-turn response selection in retrieval-based chatbots matches between a response and the last input message.⁵ However, multi-turn response selection in retrieval-based chatbots requires matching between a response and a conversation context (also known as previous utterances in a continuous conversation session).⁶ The key for retrieval-based multi-turn response selection task is to understand context in a continuous conversation. In a multi-turn conversation, if a chatbot can completely understand the whole context in a continuous conversation, it can reply the best answer to user. Understanding the conversation context is difficult for chatbots, because it is challenging to catch utterance-level discourse information and dependencies.⁷ Identifying important information in the context and model relationships among utterances in the context are challenges of multi-turn response selection task.³

In response to those challenges, many retrieval-based multi-turn response selection models have been provided. Sequential matching network (SMN)³ is the most distinguished method that significantly outperforms the state-of-the-art methods for response selection in multi-turn conversation.³ However, SMN is not good enough for learning relationship features among utterances in the context. Understanding relations of utterances plays a key role in mastering dialogues.⁸ Modeling relationships among utterances in the context is challenging, while acquiring human annotation on discourse relations is a time consuming and expensive process and does not scale for large data sets.⁸

In this article, an improved method is proposed to facilitate SMN to learn more relationship features among utterances in the context. This method is inspired by the idea of the N-gram model proposed by Shannon (1948),⁹ who used it to explore language modeling in the context of information entropy, which was also introduced in the same paper. Since then many studies have improved the natural language processing model by changing the granularity of natural language modeling. In natural language processing model, if the phrase “white house” is separated into two independent units to process, this phrase “white house” will be understood as a house with white color. If the words in this phrase were concatenated into one unit to process, natural language processing model will be understood “white house” as the US presidential palace. This example indicates that bigram model has a better performance than unigram model in some natural language understanding tasks. Most studies improve the natural language processing model by changing word-level granularity or character-level granularity in processing unit. Few studies have focused on changing larger granularity natural language processing unit to improve the natural language processing model. Our study improves multi-turn response selection model by changing utterance-level granularity in the processing unit. For one thing, utterance is the most important learning granularity in multi-turn response selection task. By analyzing Multi-view and SMN models on Ubuntu Corpus, it shows that utterance-level granularity plays a more important role in the context understanding of multi-turn dialogue, as shown in Table 1. In Multi-view model, Utter-seq-GRU and Word-seq-GRU are based on utterance-level granularity and word-level granularity, respectively. In SMN model, M₂ and M₁ are based on utterance-level granularity and word-level granularity, respectively. Details of M₁ and M₂ will be introduced in “SMN method” section. For another, modeling the relationship between utterances is one of the two major challenges of multi-turn response selection task (another challenge is to identify important words in the context). In addition, good features can be learned automatically using a general-purpose learning procedure. In most tasks of natural language processing, word vectors are composed of learned features that were not determined ahead of time by experts but automatically discovered by the neural network.¹⁰ Thus, in this article, we propose a method that concatenates each two adjacent utterances as a unit into a neural network model, which can facilitate the model to learn more good relationship features among utterances in the context. We refer to this method as learning bi-utterance (LBU), because it has learned more good relationship features among utterances in the context by LBU. By adding our method to SMN, the experimental results on two public data sets show that LBU significantly improves SMN for multi-turn response selection in retrieval-based chatbots.

Table 1.

Experimental results of Multi-view and SMN model ablation.^a

		Ubuntu Corpus
		R₂@1	R₁₀@1	R₁₀@2	R₁₀@5
Multi-view	Word-seq-GRU	0.886	0.609	0.757	0.931
Multi-view	Utter-seq-GRU	0.888	0.622	0.766	0.934
SMN	only M₁	0.919	0.704	0.832	0.955
SMN	only M₂	0.921	0.715	0.836	0.956

SMN: sequential matching network.

^aBold font indicates that using utterance-level granularity method. Bold numbers indicate that the experimental results have significantly improved. All the results are from Zhou et al.⁷ and Wu et al.³

The rest of this article is organized as follows. Second section gives the detailed description about related work. Third section shows principle of the proposed method through three parts. The experimental results and analysis are included in fourth and fifth sections, respectively. Finally, the work is concluded and future research directions are identified in sixth section.

Related work

In this section, we briefly reviewed the existing literature that closely relates to the proposed method. At first, we introduced the chatbots system. Then, we reviewed basic and advanced models of multi-turn response selection in retrieval-based chatbots.

Chatbots system

The research of chatbots system originated from the development of conversation system. Owning to the lack of conversational data, the early conversation system is usually closed domain which aims at a specific domain such as psychological consultation. The method for building early conversation system is based on rules or templates, which requires many manual efforts. And, it is difficult to change an existing conversation system for other domain in early stage. With the advent of big data, the tremendous development of social networks has made it possible for researchers to obtain a large amount of conversational data. The chatbots system research is emerging and receiving increasing attention from academia and industry. Unlike the early conversation system, the chatbots system is an open domain. Users can naturally converse with chatbots on any open domain topics. There are two promising methods presented for building chatbots system including retrieval-based method,^{2,3,6,7,11

–15} which selects a response from a large corpus and generation-based method,^{16

–24} which directly generates the next utterance from a trained model. Our study focused on retrieval-based method.

Basic models of retrieval-based chatbots

Corpus is the basis of natural language processing. Due to the lack of large data sets, it has become a barrier for research in multi-turn response selection.²⁵ Until Lowe et al.²⁵ released the Ubuntu Dialogue Corpus that founded data for research in multi-turn response selection. In addition, Lowe et al.²⁵ presented three benchmark methods including term frequency-inverse document frequency (TF-IDF), recurrent neural networks (RNN), and long short-term memory (LSTM) for analyzing this data set. Kadlec et al.²⁶ improved baselines for the Ubuntu Dialogue Corpus using a better word embedding method to improve LSTM and presented convolutional neural networks (CNN) and bidirectional long short-term memory (Bi-LSTM) as new methods. Moreover, Kadlec et al.²⁶ created an ensemble of multiple neural networks that further enhanced the performance and achieved a better result.²⁶ TF-IDF, RNN, CNN, LSTM, and Bi-LSTM are considered as basic models.³

Advanced models of retrieval-based chatbots

Ensemble models using multiple neural networks are considered as advanced models.²⁶ The Deep Learning to Respond (DL2R) proposed by Yan et al.⁶ is one of the earliest advanced models. This model is an ensemble of three different neural networks, including Bi-LSTM, CNN, and MLP, which significantly outperforms basic models for response selection in multi-turn conversation. Yan et al.⁶ also contributed to an approach to model context in a continuous conversation in multi-turns that reformulates the message with other utterances.³ At the same time, Zhou et al.⁷ considered the works of Lowe et al.²⁵ and Kadlec et al.²⁶ viewing the context and response as word sequences that have left out the utterance-level discourse information and dependencies. Thus, they presented a new ensemble model called Multi-view⁷ which integrates information from both word sequence view and utterance sequence view, and each view can represent relationships between context and response from a particular aspect. Features extracted from the word sequence and the utterance sequence provide complementary information for response selection.⁷ Wu et al.³ verified that Multi-view⁷ is not only superior to the methods of Lowe et al.²⁵ and Kadlec et al.²⁶ but also better than the DL2 R⁶ on the Ubuntu Dialogue Corpus. Then, Wu et al.³ provided a brand new ensemble model called SMN³ for response selection in multi-turn conversation. This model is an ensemble of two GRUs and one CNN. Especially, it constructs a word–word similarity matrix and a sequence–sequence similarity matrix to link the first GRU and CNN in the model. These two matrices capture important matching information in a word level and a segment (word subsequence) level, respectively, which means important information from multiple levels of granularity in the context is recognized under sufficient supervision from the response and carried into matching with minimal loss.³ An empirical study on two public data sets shows that SMN can significantly outperform state-of-the-art methods for response selection in multi-turn conversation.³ Wu et al.³ also published Douban Conversation Corpus which is the first human-labeled multi-turn response selection data set to research communities. The SMN has been described in detail in “SMN method” section.

It is worth noting that some leading researchers^3,7 of chatbots think it is essential to learn relationship features of utterances in the context for retrieval-based multi-turn response selection model. Therefore, in our study, we focus on improving retrieval-based multi-turn response selection model by enhancing model’s ability of learning relationship features of utterances in the context.

The method principle

The principle of retrieval-based multi-turn response selection task is to retrieve candidate answers from the corpus according to a conversation context, instead of only considering the last input information of user in single-turn response selection task. And then selecting the best answer to the user from retrieval candidate answers. The key of this task is to select the best answer among all candidate answers according to a conversation context.¹² It is necessary to score retrieval candidate answer and provide the answer with the highest score for user.

In the proposed method, we add a bi-utterance constructing process to SMN which concatenates each two adjacent utterances as a unit. In addition, a concatenating vectors process is added to manage the generated additional vectors from SMN. The process of SMN with LBU is illustrated in Figure 1.

Figure 1.

SMN with LBU. SMN: sequential matching network; LBU: learning bi-utterance.

Multi-turn response selection formalization

The problems solved by the retrieval-based multi-turn response selection task can be expressed as $g = (c, r)$ , where c represents the context, r represents the response, $c_{i} = {u_{i, 1}, \dots, u_{i, n_{i}}}$ , where ${u_{i, k}}_{k = 1}^{n_{i}}$ represents each utterance, and $g = (c, r)$ represents the relationship between c and r. The multi-turn dialogue corpus in the experiment can be expressed as $D = {(y_{i}, c_{i}, r_{i})}_{i = 1}^{N}$ , where $y_{i} \in {0, 1}$ . When $y_{i} = 1$ , it means that r_i is the correct answer of c_i ; otherwise, when $y_{i} = 0$ , it means that r_i is the wrong answer of c_i . The goal of this study is to optimize the value of $g = (\cdot, \cdot)$ on data set D.

SMN method

The first step of SMN is word embedding which finds the embedding table and then expresses u as $U = [e_{u, 1}, \dots, e_{u, n_{u}}] \in ℝ^{d \times n_{u}}$ and r as $R = [e_{r, 1}, \dots, e_{r, n_{r}}] \in ℝ^{d \times n_{r}}$ , where $e_{u, i}, e_{r, i} \in ℝ^{d}$ . The lower i of e denotes the corresponding ith word in the utterance u or i th word in the response r of word embedding table.

SMN establishes a word–word similarity matrix and a sequence–sequence similarity matrix, where the word–word similarity matrix is denoted by M₁. Since $M_{1} = U^{T} \cdot R, M_{1} \in ℝ^{n_{u} \times n_{r}}$ , M₁ is defined as equation (1)

e_{1, i, j} = e_{u, i}^{T} \cdot e_{r, j}

where $i, j$ represents the ith row and jth column elements in M₁. M₁ establishes utterance and response matching models at the word-level granularity.

The sequence–sequence similarity matrix of SMN is represented by M₂. To get M₂, the first gated recurrent unit (GRU) is used to convert U and R into hidden vectors. Assuming the hidden vectors of U are represented by $H_{u} = [h_{u, 1}, \dots, h_{u, n_{u}}]$ , then $\forall i$ , $h_{u, i} \in ℝ^{m}$ is defined as equation (2)

\begin{array}{l} z_{i} = σ (W_{z} e_{u, i} + U_{z} h_{u, i - 1}) \\ r_{i} = σ (W_{r} e_{u, i} + U_{r} h_{u, i - 1}) \\ {\tilde{h}}_{u, i} = tanh (W_{h} e_{u, i} + U_{h} (r_{i} ⊙ h_{u, i - 1})) \\ h_{u, i} = z_{i} ⊙ {\tilde{h}}_{u, i} + (1 - z_{i}) ⊙ h_{u, i - 1} \end{array}

where $h_{u, 0} = 0$ , $z_{i}$ represents an update gate, r_i represents a reset gate, and the activation function represented by $σ (\cdot)$ is a sigmoid function, $W_{z}, W_{r}, W_{h}, U_{z}, U_{r}, U_{h}$ are parameters. The same method can get R’s hidden vectors $H_{r} = [h_{r, 1}, \dots, h_{r, n_{r}}]$ . Since $M_{2} = H_{u}^{T} A H_{r}, A \in ℝ^{m \times m}, M_{2} \in ℝ^{n_{u} \times n_{r}}$ , where A is a linear transformation, $M_{2}$ is defined as equation (3)

e_{2, i, j} = h_{u, i}^{T} A h_{r, j}

where $i, j$ represents the elements of ith row and jth column in $M_{2}$ . Since the first GRU encodes the word sequence information and the dependency relationship into the hidden vector, M₂ establishes the matching model of utterances and response at the segment granularity, which is larger than the word-level granularity.

After that, the word–word similarity matrix M₁ and the sequence–sequence similarity matrix M₂ are as input of CNN to extract important matching information to form a matching vector v. Let $M_{f}, f \in {1, 2}$ be CNN input, $z^{(l, f)} = {[z_{i, j}^{(l, f)}]}_{I^{(l, f)} \times J^{(l, f)}}$ represents the output of feature maps. l represents the number of CNN layers. When $l = 0$ , $z^{(0, f)} = M_{f}, f \in {1, 2}$ . In the convolution layer, the SMN uses a 2-D convolution operation with a window size $r_{w}^{(l, f)} \times r_{h}^{(l, f)}$ , so $z_{i, j}^{(l, f)}$ is defined as equation (4)

z_{i, j}^{(l, f)} = σ (\sum_{f^{'} = 0}^{F_{l - 1}} \sum_{s = 0}^{r_{w}^{(l, f)}} \sum_{t = 0}^{r_{h}^{(l, f)}} W_{s, t}^{(l, f)} \cdot z_{i + s, j + t}^{(l - 1, f^{'})} + b^{l, k})

where $σ (\cdot)$ is a ReLU function, $W^{(l, f)} \in ℝ^{r_{w}^{(l, f)} \times r_{h}^{(l, f)}}$ and $b^{l, k}$ are parameters, and $F_{l - 1}$ is the number of the output of the $l - 1$ th layer of the feature maps. Afterward, in the pooling layer, SMN adopts a 2-D maximum pooling operation with a window size of $p_{w}^{(l, f)} \times p_{h}^{(l, f)}$ , so $z_{i, j}^{(l, f)}$ is defined as equation (5)

z_{i, j}^{(l, f)} = max_{p_{w}^{(l, f)} > s \geq 0} max_{p_{h}^{(l, f)} > t \geq 0} z_{i + s, j + t}

Through the above steps, SMN gets a matching vector $v \in ℝ^{q}$ , which is defined as equation (6)

v = W_{c} [z_{0, 0}^{l^{'}, 1} \dots z_{I, J}^{l^{'}, 1}, z_{0, 0}^{l^{'}, 2} \dots z_{I, J}^{l^{'}, 2}] + b_{c}

where l′ represents the last layer of the feature map, I and J are the maximum index of the feature map, and W_c and b_c are parameters. It means that SMN uses two different text granularities to extract matching information from multi-turn dialogue corpus and maps it as a vector.

Afterward, SMN inputs the obtained matching vector v into the second GRU for processing. The method is the same as equation (2), and $H_{m} = [h_{1}^{'}, \dots, h_{n}^{'}] \in ℝ^{q \times n}$ are obtained. The SMN provides three different methods for $L [h_{1}^{'}, \dots, h_{n}^{'}]$ denoted as SMN_last, SMN_static, and SMN_dynamic,³ which calculate the final score for all response candidates. After testing our proposed method on all three methods, the results proved that only the method of SMN_last can significantly improve the SMN. Therefore, we chose SMN_last as the method, that is $L [h_{1}^{'}, \dots, h_{n}^{'}] = h_{n}^{'}$ : only the last hidden state is used. Then $g (c, r)$ is defined as equation (7)

g (c, r) = softmax (W_{1} L [h_{1}^{'}, \dots, h_{n}^{'}] + b_{1}) = softmax (W_{1} h_{n}^{'} + b_{1})

where W ₁ and b ₁ are parameters.

In the model training process, SMN uses cross-entropy as the loss function, so that the training obtained θ to denote parameters of $g (\cdot, \cdot)$ , then the objective function $L (D, θ)$ is defined as equation (8)

L (D, θ) = - \sum_{i = 1}^{N} [y_{i} log (g (c_{i}, r_{i})) + (1 - y_{i}) log (1 - g (c_{i}, r_{i}))]

where N is the number of utterances in data set D.

LBU method

The principle of proposed method LBU is to concatenate each two adjacent utterances as a unit into SMN, so that SMN can learn more relationship features among utterances in the context. The bi-utterance constructing step was added in the middle of the processing of the first GRU and the construction of the similarity matrix. LBU has two steps and the most important one is to construct bi-utterance at first. It is defined as equation (9)

b = F (u_{i, j}, u_{i, j + 1})

where $1 \leq j \leq n_{i} - 1, j \in ℤ$ , b represents a new hidden vector including learned feature of relationship among each two adjacent utterances in the context. $u_{i, j}$ represents the jth utterance of corpus, and $u_{i, j + 1}$ represents the $j + 1$ th utterance of corpus. The $F (\cdot, \cdot)$ represents a series of processes containing: word embedding, the first GRU, and constructing similarity matrix. And CNN of the model captured important relationship features of utterances in the context and encoded them to generate a new hidden vector.

The second step of LBU is a vector fusion operation which is used to cope with a new generated vector b. Vector b contains more relationship features between two adjacent utterances. In the process of SMN, it gets a matching vector $v \in ℝ^{q}$ including matching information of utterances and response from multi-turn dialogue corpus. And, the new vector b generated by our method including relationship information among each two adjacent utterances in the context from multi-turn dialogue corpus. We concatenated this new hidden vector b with the hidden vector v to get another new vector m, which is defined as equation (10)

m = concatenate (b, v) = F (u_{i, j}, u_{i, j + 1}) \oplus F (u_{i, j}, r)

where $concatenate (\cdot, \cdot)$ represents vector connection, $1 \leq j \leq n_{i} - 1, j \in ℤ$ , $u_{i, j}$ represents the jth utterance of corpus, $u_{i, j + 1}$ represents the $j + 1$ th utterance of corpus, and r represents response of corpus. Vector m will be finally entered into the second GRU, and the final score for all response candidates will be calculated.

Algorithm 1 shows the processing of the proposed method. It is noted that proposed method uses an essential padding strategy. After concatenating each two adjacent utterances into one, the total number of input data is reduced by 1. For example, we originally had 10 utterances, and then we add each two adjacent utterances together to get one bi-utterance, then we end up with 9 bi-utterances. To make up for this missing data, our method has a padding process by taking the last utterance again and adding it with itself to form a new bi-utterance in the process of reading data.

Algorithm 1.

Learning bi-utterance for SMN.

Experiment

In the experimental section, we verified the proposed method on two public corpus data sets including Ubuntu Corpus and Douban Conversation Corpus. To compare with SMN method, the data sets and parameters are kept the same as SMN³ to ensure that no other factors affected the experimental results.

Experimental data sets

The Ubuntu Corpus²⁵ is an English multi-turn dialogues data set. This data set consists of 1 million multi-turn dialogues for training, 0.5 million multi-turn dialogues for validation, and 0.5 million multi-turn dialogues for testing. The Ubuntu Corpus extracted from the Ubuntu chat logs, which is used to receive technical support for various Ubuntu-related problems. Positive responses are true responses from humans and negative ones are randomly sampled. The ratio of the positive and the negative is 1:1 in training data set and 1:9 in validation and testing data sets. This study uses the preprocessed data from the Ubuntu Corpus in which numbers, URLs, and paths are replaced by special placeholders.²⁷ The evaluation metrics of the Ubuntu Corpus is recalled at position k in n candidates (R _n @k).²⁵

The Douban Conversation Corpus³ is a Chinese multi-turn dialogues data set, which is crawled from the Chinese popular social networking service Douban group. This data set consists of 1 million multi-turn dialogues for training, 50,000 multi-turn dialogues for validation, and 10,000 multi-turn dialogues for testing. In the test set, every context has 10 response candidates, and each of the response has a label “good” or “bad” by human annotators.² The evaluation metrics of the Douban Conversation Corpus includes recalling at position k in n candidates (R _n @k),²⁵ mean average precision (MAP),²⁸ mean reciprocal rank (MRR),²⁹ and precision at position 1 (P@1). The Douban Conversation Corpus did not use R₂@1 as evaluation metrics to prevent bringing bias to evaluation, because one context could have more than one correct response in the Douban corpus.

Statistics of Ubuntu Corpus and Douban Conversation Corpus are displayed in Table 2.

Table 2.

Statistics of Ubuntu Corpus and Douban Conversation Corpus.^a

	Ubuntu Corpus			Douban Conversation Corpus
	Train	Valid	Test	Train	Valid	Test
# context–response pairs	1M	500k	500k	1M	50k	10k
# candidates per context	2	10	10	2	2	10
Avg. # turns per context	10.13	10.11	10.11	6.69	6.75	6.45
Avg. # words per utterance	11.35	11.34	11.37	18.56	18.50	20.74

Experiment setup

Models in this work are implemented with Theano,³⁰ and all experiments are processed on a single GPU. Word embedding is initialized by the results of Word2Vec,³¹ which runs on the training data, and the dimensionality of word vectors is 200. We followed the work of Wu et al.³ to set the dimensionality of the hidden states of the first GRU as 200. The window size of convolution and pooling is (3, 3). The number of feature maps is eight. In the second GRU, we set as 100 rather than 50, because LBU requires more space. The experimental results show that tuning the dimensionality of the hidden states of GRU cannot improve SMN. Parameters of models are updated by stochastic gradient descent and optimized by Adam algorithm.³² The initial learning rate is 0.001, and parameters of Adam, $β_{1}$ and $β_{2}$ , were set as 0.9 and 0.999, respectively. In addition, early-stopping is employed as a regularization strategy. Models are trained in mini-batches with a batch size of 200, and the maximum utterance length is 50. The maximum context length is set as 10, because the performance of models does not improve on contexts when the length of context is longer than 10. Zero padding is needed when the number of utterances in a context is less than 10, otherwise the last 10 utterances are kept. We use the same evaluation results as baseline models from SMN experimental results, because their results are all available in the existing article.³

Experimental results

We choose three retrieval-based multi-turn response selection models as baselines, including DL2R,⁶ Multi-view,⁷ and SMN.³ Three different methods SMN_last, SMN_static, and SMN_dynamic are used to calculate final score of response candidates in SMN. Our proposed method is an improvement with final score calculation method provided by SMN_last. Table 3 shows the evaluated results on two data sets. It demonstrates that our proposed method significantly improved SMN and is better than other models. Experimental results of our proposed method are the best among all methods on the Ubuntu Corpus. Furthermore, the effectiveness of our proposed method is more significant on the Douban Conversation Corpus.

Table 3.

Experimental results on two data sets.^a

	Ubuntu Corpus				Douban Conversation Corpus
	R₂@1	R₁₀@1	R₁₀@2	R₁₀@5	MAP	MRR	P@1	R₁₀@1	R₁₀@2	R₁₀@5
DL2 R	0.899	0.626	0.783	0.944	0.488	0.527	0.330	0.193	0.342	0.705
Multi-view	0.908	0.662	0.801	0.951	0.505	0.543	0.342	0.202	0.350	0.729
SMN_last	0.923	0.723	0.842	0.956	0.526	0.571	0.393	0.236	0.387	0.729
SMN_static	0.927	0.725	0.838	0.962	0.523	0.572	0.387	0.228	0.387	0.734
SMN_dynamic	0.926	0.726	0.847	0.961	0.529	0.569	0.397	0.233	0.396	0.724
SMN_last with LBU	0.932	0.745	0.858	0.962	0.546	0.588	0.403	0.244	0.421	0.765

MAP: mean average precision; MRR: mean reciprocal rank; LBU: learning bi-utterance; DL2 R: deep learning to respond; SMN: sequential matching network.

^aBold font indicates the experimental results of proposed method. All the results except ours are from the article of Wu et al.³

Analysis

We verified the effectiveness of our proposed method via model ablations and explored the possibilities of learning granularity tuning about further improved model.

Model ablations

We ablated model to further analyze the effectiveness of our proposed method, as shown in Table 4. There are two channels in SMN to capture important matching information, which are M₁: word–word similarity matrix and M₂: sequence–sequence similarity matrix. The performance of the model reduced significantly, when SMN_last only uses one channel to capture important matching information. If SMN_last only uses M₂ (the sequence–sequence similarity matrix) channel, the performance is better than when SMN_last only uses M₁ (word–word similarity matrix) channel. In addition, our proposed method can use two channels M₁ and M₂ simultaneously to capture important matching information. M₁ with LBU is referred as word–word similarity matrix with LBU, and M₂ with LBU is referred as sequence–sequence similarity matrix with LBU. The performance of the model reduced significantly, when LBU with SMN_last only uses one channel to capture important matching information. Noting that when the proposed improved model only uses one channel M₁ with LBU is better than SMN_last only uses one channel M₁. Identically, when the proposed improved model only uses one channel M₂ with LBU than SMN only uses one channel M₂. It demonstrates that LBU is effective in improving the performance of whole SMN or the different parts of SMN.

Table 4.

Experimental results of SMN model ablation on Ubuntu Corpus and Douban Conversation Corpus.^a

	Ubuntu Corpus				Douban Conversation Corpus
	R₂@1	R₁₀@1	R₁₀@2	R₁₀@5	MAP	MRR	P@1	R₁₀@1	R₁₀@2	R₁₀@5
only M₁	0.919	0.704	0.832	0.955	0.518	0.562	0.370	0.228	0.371	0.737
M₁ with LBU	0.921	0.729	0.854	0.956	0.530	0.572	0.386	0.233	0.400	0.734
only M₂	0.921	0.715	0.836	0.956	0.521	0.565	0.382	0.232	0.380	0.734
M₂ with LBU	0.927	0.728	0.852	0.962	0.541	0.587	0.404	0.241	0.397	0.768
SMN_last (M₁ + M₂)	0.923	0.723	0.842	0.956	0.526	0.571	0.393	0.236	0.387	0.729
SMN_last with LBU	0.932	0.745	0.858	0.962	0.546	0.588	0.403	0.244	0.421	0.765

MAP: mean average precision; MRR: mean reciprocal rank; LBU: learning bi-utterance; SMN: sequential matching network.

^aBold font indicates the experimental results of proposed method.

Learning granularity tuning

Our proposed method concatenates each two adjacent utterances as a unit into the model. We also try out learning tri-utterance or n-utterance (n > 3) methods to figure out whether it is possible to make more progress with different n. The results show that concatenating three or more adjacent utterances as a unit into the model cannot improve the model. On the contrary, continuously extending learning granularity will reduce the effectiveness of the model, since it will generate redundant relationship information among utterances in the context, which make the model become harder to learn good features of the context. Therefore, concatenating each two adjacent utterances as a unit into the model is the best learning granularity for improving multi-turn response selection in retrieval-based chatbots.

Conclusions

In this article, we propose an improved method based on SMN for multi-turn response selection in retrieval-based chatbots. This method facilitates SMN to learn more relationship features of utterances in the context, which is a key to understand context in a continuous conversation. The experimental results show that our proposed method significantly improves the SMN for response selection in multi-turn conversation. In the future, we will try to apply our proposed method to other tasks of text matching.

Footnotes

Acknowledgement

We appreciate the support and advice of Yu Wu.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This article was supported by National Key Research and Development Program of China (2016YFB0502600), Beijing Municipal Science and Technology Project (BJMSJY-170167) and Open Fund of Key Laboratory for National Geographic Census and Monitoring, National Administration of Surveying, Mapping and Geoformation (2017NGCMZD03).

ORCID iD

Dapeng Li

References

Serban

Sankar

Germain

. A deep reinforcement learning chatbot. CoRR 2017, abs/1709.02349.

. Learning matching models with weak supervision for response selection in retrieval-based chatbots. In: Proceedings of the 56th annual meeting of the association for computational linguistics (short papers) (eds Gurevych

Iryna

Miyao

Yusuke

), Melbourne, Australia, 15–20 July 2018, pp. 420–425. ACL.

Xing

. Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In: Proceedings of the 55th annual meeting of the association for computational linguistics (eds Barzilay

Regina

Kan

Min-Yen

), Vancouver, Canada, 30 July –4 August 2017, pp. 496–505. ACL.

Shum

. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Front Inf Technol Electron Eng 2018: 19(1): 10–26.

Wang

. A dataset for research on short-text conversation. In: Proceedings of the 2013 conference on empirical methods in natural language processing (eds Yarowsky

David

Baldwin

Timothy

Korhonen

Anna

), Seattle, Washington, USA, 18–21 October 2013, pp. 935–945. ACL.

Yan

Song

. Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval (eds Perego

Raffaele

Sebastiani

Fabrizio

Aslam

Javed

), Pisa, Italy, 17–21 July 2016, pp. 55–64. ACM.

Zhou

Dong

. Multi-view response selection for human-computer conversation. In: Proceedings of the 2016 conference on empirical methods in natural language processing (eds Su

Jian

Carreras

Xavier

Duh

Kevin

), Austin, Texas, 1–5 November 2016, pp. 372–381. ACL.

Qin

Wang

Kim

. Joint modeling of content and discourse relations in dialogues. In: Proceedings of the 55th annual meeting of the association for computational linguistics (eds Barzilay

Regina

Kan

Min-Yen

), Vancouver, Canada, 30 July–4 August 2017, pp. 974–984. ACL.

Shannon

. A mathematical theory of communication. The Bell System Technical Journal 1948; 27: 379–423, 623–656.

10.

LeCun

Bengio

Hinton

. Deep learning. Nature 2015: 521: 436–444.

11.

. Convolutional neural network architectures for matching natural language sentences. In: Proceedings of the 27th international conference on neural information processing systems (eds Ghahramani

Welling

Cortes

), Montreal, Canada, 08–13 December 2014, pp. 2042–2050. MIT Press.

12.

. An information retrieval approach to short text conversation. CoRR 2014. abs/1408.6988.

13.

Wang

. Syntax-based deep matching of short texts. In: Proceedings of the 24th international joint conference on artificial intelligence (eds Yang

Qiang

Wooldridge

Michael

), Buenos Aires, Argentina, 25–31 July 2015, pp. 1354–1361. AAAI Press.

14.

Wang

Xue

. Ranking responses oriented to conversational relevance in chat-bots. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers (eds Calzolari

Nicoletta

Matsumoto

Yuji

Prasad

Rashmi

), Osaka, Japan, 11–17 December 2016, pp. 652–662. ACL.

15.

. Topic augmented neural network for short text conversation. CoRR 2016. abs/1605.00090.

16.

Shang

. Neural responding machine for short-text conversation. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (eds Zong

Chengqing

Strube

Michael

), Beijing, China, 26–31 July 2015, pp. 1577–1586. ACL.

17.

Serban

Sordoni

Bengio

. Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the 30th AAAI conference on artificial intelligence (eds Schuurmans

Dale

Wellman

Michael P

), Phoenix, Arizona, 12–17 February 2016, pp. 3776–3783. AAAI Press.

18.

Vinyals

. A neural conversational model. CoRR 2015. abs/1506.05869.

19.

Galley

Brockett

. A diversity-promoting objective function for neural conversation models. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies (eds Knight

Kevin

Nenkova

Ani

Rambow

Owen

), San Diego, California, 12–17 June 2016, pp. 110–119. ACL.

20.

Galley

Brockett

. A persona-based neural conversation model. In: Proceedings of the 54th annual meeting of the association for computational linguistics (eds Erk

Katrin

Smith

Noah A

), Berlin, Germany, 7–12 August 2016, pp. 994–1003. ACL.

21.

Xing

. Topic augmented neural response generation with a joint attention mechanism. CoRR 2016. abs/1606.08340.

22.

Serban

Klinger

Tesauro

. Multiresolution recurrent neural networks: an application to dialogue response generation. In: Proceedings of the 31st AAAI conference on artificial intelligence (eds Singh

Satinder P

Markovitch

Shaul

), San Francisco, California USA, 4–9 February 2017, pp. 3288–3294. AAAI Press.

23.

Zhou

Luo

Cao

. Mechanism-aware neural machine for dialogue response generation. In: Proceedings of the 31st AAAI conference on artificial intelligence (eds Singh

Satinder P

Markovitch

Shaul

), San Francisco, California, USA, 4–9 February 2017, pp. 3400–3406. AAAI Press.

24.

Yang

. Neural response generation with dynamic vocabularies. In: Proceedings of the 32nd AAAI conference on artificial intelligence (eds Mcllraith

Sheila A

Weinberger

Kilian Q

), New Orleans, Louisiana, USA, 2–7 February 2018, pp. 5594–5601. AAAI Press.

25.

Lowe

Pow

Serban

. The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In: Proceedings of the 16th annual meeting of the special interest group on discourse and dialogue (eds Koller

Alexander

Skantze

Gabriel

Araki

Masahiro

), Prague, Czech Republic, 2–4 September 2015, pp. 285–294. ACL.

26.

Kadlec

Schmid

Kleindienst

. Improved deep learning baselines for ubuntu corpus dialogs. CoRR 2015. abs/1510.03753.

27.

Liu

Wang

. Incorporating loose-structured knowledge into conversation modeling via recall-gate LSTM. In: Proceedings of the 2017 international joint conference on neural networks (eds Choe

Yoonsuck

Jayne

Chrisina

Hammer

Barbara

), Anchorage, AK, USA, 14–19 May 2017, pp. 3506–3513. IEEE.

28.

Baeza-Yates

Ribeiro-Neto

. Modern information retrieval. Vol. 463. New York: ACM press, 1999.

29.

Voorhees

EM.

The TREC-8 question answering track report. In: Proceedings of the 8th text retrieval conference (eds Voorhees

Ellen M

Harman

Donna K

), Gaithersburg, Maryland, USA, 17–19 November 1999, pp. 77–82. National Institute of Standards and Technology (NIST).

30.

Theano Development Team. Theano: a Python framework for fast computation of mathematical expressions. CoRR 2016. abs/1605.02688.

31.

Mikolov

Sutskever

Chen

. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems (eds Burges

Christopher JC

Bottou

Léon

Ghahramani

Zoubin

), Lake Tahoe, Nevada, 5–10 December 2013, pp. 3111–3119. Curran Associates Inc.

32.

Kingma

. Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference for learning representations, San Diego, USA, 7–9 May 2015.