Sage Journals: Discover world-class research

Abstract

Conversational recommender systems provide users with item recommendations via interactive dialogues. Existing methods using graph neural networks have been proven to be an adequate representation of the learning framework for knowledge graphs. However, the knowledge graph involved in the dialogue context is vast and noisy, especially the noise graph nodes, which restrict the primary node’s aggregation to neighbor nodes. In addition, although the recurrent neural network can encode the local structure of word sequences in a dialogue context, it may still be challenging to remember long-term dependencies. To tackle these problems, we propose a sparse multi-hop conversational recommender model named SMCR, which accurately identifies important edges through matching items, thus reducing the computational complexity of sparse graphs. Specifically, we design a multi-hop attention network to encode dialogue context, which can quickly encode the long dialogue sequences to capture the long-term dependencies. Furthermore, we utilize a variational auto-encoder to learn topic information for capturing syntactic dependencies. Extensive experiments on the travel dialogue dataset show significant improvements in our proposed model over the state-of-the-art methods in evaluating recommendation and dialogue generation.

Keywords

Conversational recommender systems graph neural networks knowledge graphs multi-hop attention

1. Introduction

Conversational recommender systems (CRS) have become an emerging research topic, aiming to provide users with high-quality recommendations through natural language conversations [1, 2, 3, 4]. The general idea of such systems is that they support a task-oriented, multi-turn dialogue with their users. Unlike traditional recommender systems, CRS utilizes dialogue data to accomplish recommendation tasks. The CRS usually consists of a dialog component interacting with the user and a recommendation component.

Most existing CRSs focus on natural language processing or semantic-rich search solutions for dialogue systems. Traditionally, the CRS mainly asked users about their preferences over pre-defined slots to provide recommendations [2, 5]. Moreover, some studies [6, 7] use natural language dialogues and user interactions to make recommendations, emphasizing fluent response generation and accurate advice. Recently, a popular trend [6, 8] has been to incorporate knowledge or reinforcement learning into enhanced user models or interaction mechanisms to improve the performance of CRS. For example, the task-oriented dialogue system (e.g., Mem2Seq) uses a memory network based on multi-hop attention to incorporate knowledge and user input [9, 10, 11, 12]. In addition, to meet the needs of multiple topics, the deep conversational recommender in travel (DCR) leverages a graph convolutional network (GCN) to capture the relationships between different venues and the match between the venue and dialog context.

However, these existing methods suffer from two issues. (1) The GCN-based deep conversational recommender model cannot adequately capture space information. Because the conversational recommender system modeled by the graph convolutional network has the same weights, and these weights are assigned to different neighbors in the same order neighborhood, resulting in a huge graph and a lot of noise, which may lead to overfitting of the GCN-based conversational recommendation model. (2) The RNN-based deep conversation recommendation model has long dependency problems generating conversation topics. On this basis, the RNN-based hierarchical recurrent encoder-decoder (HRED) is employed to solve the long dependency problem in natural language processing tasks [6, 13, 14, 15, 16]; however, it is still difficult to capture long dependencies well by concatenating the hidden states of the last steps of each position of a sentence. In particular, when two words in the same sentence are far apart, the model cannot accurately capture the problem of the dependency between the two words, which affects the overall structure of the sentence and is very important to ensure smooth communication between users and agents.

To address these two issues, we observe that sparse graph attention networks [17, 18, 19] can effectively remove edges in a graph unrelated to tasks. The sparse graph attention networks (SGAT) leverage binary masks assigned to every edge to filter out noisy nodes. Moreover, different attention scores solve the limitation of the same weight in the first order. In addition, we also notice that the combination of memory networks and multi-hop attention in the Mem2Seq model reinforces the capability of capturing long-term dependencies. Instead of directly inputting the triples and historical dialogue sequences in the knowledge graph into the encoder, we just use the context as the input of the encoder, which can avoid introducing some noisy information. Graph structure information often contains various types of connections and relationships, while sequence information focuses on the temporal or sequential order of data. The two types of information have different characteristics and may introduce conflicting or irrelevant patterns when combined. This interference of noise information can arise due to the complexity and heterogeneity of the combined data sources. Additionally, the combination of graph structure and sequence information may result in a high feature extraction granularity. Graph structure information provides fine-grained details about the relationships and connections between entities, while sequence information captures the local dependencies within the data. The reason is that embedding vectors are stored in external memory, and query vectors can easily access “memories.” Meanwhile, to learn topic information, we use variational auto-encoders optimized by KL cost annealing [20], and it is effectively solve KL-vanishing problem.

In this paper, we propose a novel model called sparse multi-hop conversational recommender systems (SMCR). SMCR is a neural method that generates natural languages by fusing recommended items through filtered knowledge graphs. We fully take advantage of widely used slots filling techniques to ensure that items are inserted where they are in the sentence. The item recommender selects the appropriate item in the slot, which utilizes enhanced graph attention mechanisms to incorporate external knowledge into the dialogue. Our model is deployed in an end-to-end fashion. The SMCR has both the controllability of traditional slot filling models and the flexibility of neural language models. Besides, Our model cleverly combines the ability of RNNs to quickly deal with sequences and the ability of attention mechanisms to connect different parts of long sequences. The advantage of SMCR is that it can not miss the dependencies of each part of the dialogue under the premise of ensuring the memory of the dialogue content.

To sum up, the major contributions of this work are as follows:

•
We develop a recommendation model based on sparse graph attention to match items with dialogue context, which can accurately identify important edges, thereby reducing the complexity of graph computing and the interference of noisy nodes.
•
We design a multi hop focus network that can quickly encode long conversation sequences to capture long-term dependencies, and learn topic information through a variable automatic encoder optimized by KL cost annealing.
•
We conduct extensive experiments on a multi-domain wizard dataset to comparatively evaluate, demonstrating that our proposed method outperforms state-of-the-art methods.

2. Related work

2.1 Dialogue system

According to different application scenarios, the dialogue system is divided into three types: task-oriented dialogue systems (e.g., Cortana and Siri), chit-chat dialogue systems (e.g., Xiaobing), and question-and-answer dialogue systems (e.g., online store assistants). Traditional dialogue systems are usually based on rules or templates. For example, Weizenbaum et al. [21] developed the Eliza system to simulate psychotherapists’ treatment of people with a mental health conditions. Subsequently, Wallace et al. [22] developed the Alice system based on AIML and XML language to create stimulus-response chatbots. However, these methods rely on a great deal of manual labeling. In order to solve this problem, De et al. [23] designed a multi-party dialogue system based on machine learning and rules, leveraging support vector machines for decision-making. In addition, benefiting from the rapid development of deep learning and natural language technologies, more and more researchers focus on dialogue systems based on deep learning. For example, Dhingra et al. [24] combines reinforcement learning and knowledge graphs to develop a KB-InfoBot model, which is a dialogue agent that provides users with entities from the knowledge base by interactively querying features. Lipton et al. [25] proposed the BBQ network, which uses reinforcement learning for dialogue systems. These studies on dialogue systems can achieve smooth human-computer interaction; nonetheless, we believe that discovering user interests through dialogue and guiding users to complete purchases, subscriptions, and other behaviors has more excellent commercial value. Therefore, it is imperative to construct a dialogue-based recommendation system.

2.2 Conversational recommender

With a fast uptake of deep learning in recent years, researchers have become more interested in interactive recommender systems. For example, Christakopoulou et al. [26] proposed a novel view that regards the recommendation as an interactive process. Greco et al. [27] utilized hierarchical reinforcement learning to model the CRS target as a target-specific representation module. Sun et al. [2] proposed a unified framework that integrates the recommendation system and the dialogue system to build an intelligent dialogue recommendation system. Due to the lack of publicly available large-scale dialogue datasets, Li et al. [28] provided a REDIAL dataset containing real-world dialogues. In order to proactively ask questions in the conversation, Zhang et al. [29] introduced not only the system ask-user response (SAUR) paradigm for conversational search and recommendations but also designed a unified implementation framework for product search and recommendations in e-commerce. Although these studies have achieved some success, they only use dialogue information in the model, resulting in a lack of sufficient context to express user preferences. Overall, these models perform poorly on evaluation tasks (i.e., recommendation and dialogue). In order to solve these problems, many researchers focus on knowledge-based conversational recommender systems, which can provide external knowledge to narrow the gap between the dialogue system and the recommendation system for improving the performance of the recommender model.

2.3 Knowledge-based conversational recommender

Knowledge graphs (KG) are able to represent the structured relationship among entities and have been successfully used in conversational recommender systems. Chen et al. [6] proposed a novel end-to-end framework and introduced knowledge-grounded information about users’ preferences. Moon et al. [30] proposed a DialKG Walker model, which learns the symbolic transitions of dialog contexts as structured traversals over KG and predicts natural entities to introduce given previous dialog contexts via a novel domain-agnostic and attention-based graph path decoder. Liao et al. [13] combined the sequence-to-sequence model with a neural latent topic component and graph convolutional network to recommend in the tourism field. Lei et al. [31] utilized graphs to solve the multi-round dialogue recommendation problem and proposed the conversational path reasoning framework, which synchronizes the conversation with the graph-based path reasoning. This model makes the use of attributes more explicit and greatly improves the interpretability of conversational recommendations. Zhou et al. [8] adopted mutual information maximization to align the word-level and entity-level semantic spaces with bridging a semantic gap between natural language expressions and item-level user preferences.

To sum up, these works utilize the path of the knowledge graph to simulate the dialogue process or leverage the knowledge graph to model the item. However, in the real world, the dialogue has the characteristics of multiple levels, multiple rounds, and multiple topics, and there are complex dependencies between sub-conversations in the dialogue. In addition, there are many items involved in the dialogue, and each item has many attributes, which will add a lot of calculation to the modeling. Therefore, we argue that the graph generated from the dialogue is complex and sparse, resulting in some noise nodes in the process of extracting and aggregating graph information, which does not contribute to the aggregation result. Effectively distinguishing noisy nodes from important nodes will improve aggregation efficiency and save computational space. Based on these assumptions, we develop sparse graph attention to match items with the dialogue context in order to reduce the complexity of graph computing and the interference of noisy nodes. In addition, we design a multi-hop attention network to encode the dialogue context, which can quickly encode the long dialogue sequences to capture the long-term dependencies.

3. The proposed model: SMCR

In this section, we introduce our proposed conversational recommender method (SMCR), which combines the recommender system and the conversation system. We will illustrate how the encoder based on multi-hop attention maps dialogue information to vectors and how it brings external knowledge to context. The SMCR method consists of two components: a dialogue state tracking module and an SGAT-based recommender module. The detailed structure of the model and a dialogue example is shown in Figs 1 and 2, respectively. Our parameters to learn are organized into three groups. Algorithm 2 presents the training algorithm for our SMCR model. The detailed description of Algorithm 2 is as follows: the conversational recommendation dataset and knowledge graph are used as inputs to the model. In the recommendation branch, firstly, the dialogue dataset is encoded using a multi-hop attention network to obtain the dialogue hidden states. Then, a sparse graph attention network is used to extract the knowledge graph and obtain the graph hidden states. Finally, these two hidden states are passed through a fully connected layer followed by a softmax function to obtain the item probabilities. In the generation branch, similar steps are followed to obtain the hidden states. To differentiate the impact of stop words from non-stop words on the language model, the TopicRNN model is used in conjunction with an RNN decoder to generate word probabilities.

Figure 1.

The proposed SMCR model consists of three components. The left part of the graph is a multi-hop encoder module, which uses a multi-hop attention mechanism to learn historical dialogue information. The upper right part of the graph is the dialogue state tracking module, which is used to capture user dialogue and manage the dialogue state of the system. The lower right part of the graph is an SGAT-based recommender module, which is used to match items and contexts.

3.1 Dialogue state tracking module

We use hierarchical recurrent encoder-decoder (HRED) to model the dialogue state tracking. First, we use HRED to build an encoder based on sentence-level and word-level RNN to encode the context and words separately; then, we use an LSTM-based or GRU-based decoder.

Compared with using Transformer as the backbone of SMCR, The RNN is more sensitive to the timing of sentences, and the memory cost of RNN is much lower than that of Transformer. In addition, the positional encoding of the Transformer has shortcomings. In the processing of utilizing word vectors, word vectors are linearly transformed, and the most semantic information can be preserved. However, it doesn’t work for positional encoding. Therefore, It is unreasonable that the positional encoding is added to the word vector.

Figure 2.

A sample dialog between a user and an agent from the dataset. We observe the need for global topic control and knowledge graph to recommend appropriate venue.

Multi-hop Encoder.

It is difficult to deal with the problem of long-term dependence when using RNN-based encoders in dialogue generation modeling. Inspired by the end-to-end memory network [32, 33, 34, 35, 36], we develop a multi-hop attention-based encoder to encode the dialogue context, which is based on the fact that using the attention mechanism in the coding phase helps to deal with long-term dependencies. Besides external memories in memory networks reinforce the persistence of memory. Specifically, we consider a dialogue as a sequence of $n$ utterances $D=\{{U_{1}},{U_{2}},\ldots,{U_{n}}\}$ , and each $U_{n}\in D$ contains a sequence of ${M_{n}}$ tokens, i.e., ${U_{n}}=\{W_{n,1},\ldots,W_{n,{M_{n}}}\}$ , where $W_{n,m}$ is a random variable taking values in the vocabulary $V$ , which represents the token at position $m$ in utterance $n$ . The encoder maps each word $W_{h,j}$ in utterance ${U_{h}}=\{W_{h,1},\ldots,W_{h,i}\}$ to word vector ${\theta_{h,j}}$ . Iteratively, the encoder encodes the dialogue $D$ into high-level representations $\beta=\{{\beta_{1}},\ldots,{\beta_{N}}\}$ , which is called the memory. Then, we consider a query vector $Q$ as a reading head. The model loops over the hop of $L$ , and it computes the attention weights of each memory $k$ at the hop $l$ . The calculation formula is as

$\displaystyle A_{k}^{l}=\textit{Softmax}({{({{Q}^{l}})}^{T}}\beta_{k}^{l}),$ (1)

where $\beta_{k}^{l}$ is the memory content in position $k$ , and $\textit{Softmax}({{z}_{k}})={{e}^{{{z}_{k}}}}/\sum\nolimits_{u}{{{e}^{{{z}_{u}% }}}}$ is used to measure the degree of association between the memory and the query vector ${{Q}^{l}}$ . The value is between 0 and 1. The closer the memory score is to 1, the greater the degree of association. Then, the model reads out the memory ${{O}^{l}}$ by adding the weighted sum of $A_{k}^{l}$ to $\beta_{k}^{l+1}$ ,

$\displaystyle{{O}^{l}}=\sum\nolimits_{k}{A_{k}^{l}}\beta_{k}^{l+1},$ (2)

where the query vector is updated for the next hop by using ${{Q}^{l+1}}={{Q}^{l}}+{{O}^{l}}$ . The result is the memory vector ${{O}^{l}}$ , which will become one of the inputs for the decoder.

TopicRNN Learning.

Although the RNN model can capture the local relationship of the sentence well, it lacks in capturing the dependency relationship of words in a longer-range sequence, while the TopicRNN [37, 38, 39] model can capture the global semantic information in the document well. Since a large part of the long-term dependencies in language derives from semantic coherence, in multiple rounds of multi-topic dialogues, the capture of subtopics will affect the quality of dialogue generation. The generative learning process of the TopicRNN model is described by Algorithm 1.

The detailed description of Algorithm 1 is as follows: Firstly, the user input and context are packaged into a document. Then, a topic vector based on a Gaussian distribution is generated for the document. Next, hidden states for each word in the document are generated. A Bernoulli distribution based on the current word’s hidden state is used to determine whether the word is a stop word. If the current word is a stop word, it is generated from the stop word model; otherwise, it is generated jointly from the stop word model and the topic model.

The output of the decoder is affected by the topic vector $\phi$ , and its output is taken as a bias, allowing us to separate the global semantics from the local dynamic semantics. The stop word indicator $l_{m}$ determines how the topic vector $\phi$ affects the output results. Specifically, if the indicator $l_{m}$ is equal to 1, it means that the word is a stop word, and the output will not be affected. In addition, it means that the word belongs to the topic $\phi$ . Introducing weights to increase the proportion of words that belong to topic $\phi$ in the output and better simulate stop words and non-stop words. The weight is obtained by the dot product of the transpose of the bias $b_{j}$ and the topic vector $\phi$ . It can be seen that the topic vector $\phi$ captures the long-range semantic information, which has a direct impact on the output, and it is carried out through an additive procedure.

During model inference, the observations are token sequences and stop word indicators. The log marginal likelihood is as follows:

$\displaystyle\log p({U_{n}},{l_{1:{M_{n}}}}|{{U_{1:n-1}}})=\log\int_{\phi}{p(% \phi|{{U_{1:n-1}}})\prod\limits_{t=1}^{{M_{n}}}{p({y_{n,t}}|{{h_{n,t}},{l_{t}}% ,}\phi)}}p({l_{t}}|{{h_{n,t}}})d\phi.$ (3)

Since direct optimization of Eq. (3) is intractable due to the integral over the continuous latent space. Suppose $q(\phi|{U_{1:n}})$ is the variational distribution on the marginalized variable. The variational lower bound of Eq. (3) can therefore be constructed as:

$\displaystyle L({U_{n}},{l_{1:{M_{n}}}}|{q(\phi|{U_{1:n}})},\vartheta)% \buildrel\Delta\over{=}{E_{q(\phi|{U_{1:n}})}}\left[\sum\limits_{t=1}^{{M_{n}}% }\log p({y_{n,t}}|{h_{n,t}},{l_{t}},\phi)+\sum\limits_{t=1}^{{M_{n}}}\log({l_{% t}}|h_{n,t})\right]$ (4) $\displaystyle\quad{}-{D_{KL}}(q(\phi|{U_{1:n}})||p(\phi|{U_{1:n-1}}))\leqslant% \log p({U_{n}},{l_{1:{M_{n}}}}|{U_{1:n-1}},\vartheta).$

Inspired by the neural variational inference framework and the Gaussian reparameterization trick in Variational auto-encoder (VAE), we construct $q(\phi|{U_{1:n}})$ as an inference network using a feed-forward neural network,

$\displaystyle q(\phi|{U_{1:n}})=N(\phi;\mu({U_{1:n}}),\textit{diag}({\sigma^{2% }}({U_{1:n}}))).$ (5)

where $\mu({U_{1:n}})$ and ${\sigma}({U_{1:n}})$ both are feed-forward neural network that use the ReLU activation function.

Suppose during training, the one-hot vector for any token $y$ and its stop word indicator are $y$ and $l$ respectively. The predicted correspondence vectors are ${\hat{y}}$ and ${\hat{l}}$ . Inspired by Eq. (3.1), the loss for this global topic control component consists of two cross entropy losses and a KL divergence between the assumed distribution and learned distribution as follows,

$\displaystyle{L_{\textit{Topic}}}=\textit{avg}.[{L_{\textit{cross}(y,\hat{y})}% }+{L_{\textit{cross}(l,\hat{l})}}]-{D_{KL}}(N(0,I)||q(\phi|{U_{1:n}}))$ $\displaystyle{}*{[{1+\exp(-\kappa)*(\iota-\varpi)}]^{-1}}.$ (6)

where ${[{1+\exp(-\kappa)*(\iota-\mathbf{\rho})}]^{-1}}$ is a KL anneal function, and $\kappa=0.0025$ , $\iota$ is the current step and $\varpi=2500$ .

[h] : Learning Process of TopicRNN[1] Input the user $U$ and the context $C$ Output the prediction token $y$ h $\leftarrow$ HRED ( $U$ , $C$ ) Draw a topic vector $\phi\sim N(0,I)$ ${H_{0}=h}$ $\leftarrow$ Initialize with decoder Given tokens $T=(t_{1},\ldots,t_{n-1})$ from context $C$ the token $y_{n}$ Compute hidden state of a decoder, ${H_{n}}={{f}_{W}}(H_{n-1},T)$ ; Draw a stop word indicator from context $C$ , ${{l}_{n}}\sim\textit{Bernoulli}(\textit{sigmoid}({{W}^{T}}{{H}_{n-1}}))$ ; Draw a token ${{y}_{n}}\sim p({{y}_{n}}|{{H}_{n}},\phi,{{l}_{n}},B)$ , where $B$ means Bernoulli distribution,

$\displaystyle p({{y}_{n}}=j|{{H}_{n}},\phi,{{l}_{n}},B)\propto\exp(w_{j}^{T}{{% H}_{n}}+(1-{{l}_{n}})b_{j}^{T}\phi)$

3.2 SGAT-based recommender module

SGAT-based Recommender.

Usually, an item has many attributes. For example, when a new visitor queries for a hotel, the hotel has the address, area, network, name, free parking space, etc. It is very suitable for modeling the item by utilizing graph data. When a user sends a request that he wants a Chinese restaurant for dinner, the user clearly provides the system with two constraints like “Chinese” and “restaurant.” Therefore, the system not only accurately captures them but also considers potential constraints such as location and business hours because users are more willing to consider restaurants near the hotel. In order to capture the explicit and latent relationships between these places, we use the sparse graph attention mechanism. Different from the graph convolution network, it can assign different weights to the neighbor nodes of vertexes in the graph and enhance the spatial information of the model. However, in the real world, a graph is large and complex. What’s most important is that the graph is sparse and noisy. Therefore, the graph attention is prone to overfitting if not regularized properly. The SGAT can remove at least 20% of the useless edges from the graph while maintaining high accuracy. In addition, the binary gate in the SGAT model cleverly achieves edge clipping.

We formulate an undirected graph $G=(V,E)$ with a set of nodes $V=\{v_{1},v_{2},\ldots,v_{m}\}$ and a set of edges $E\subseteq V\times V$ to connect these nodes. A dense matrix $M\in{{\mathbb{R}}^{m\times m}}$ covers the node features, and each row of the matrix represents the feature vector of a node. We utilize $A$ to denote the adjacency matrix and augment a self-loop to each node to maintain the nodes’ information. In detail, add 1 to the adjacency matrix $A$ diagonal. Let ${A}^{\prime}=A+{{I}_{m}}$ denote the adjacency matrix with added self-connections, where ${{I}_{m}}\in{{\mathbb{R}}^{m\times m}}$ is an identity matrix.

Given such a graph, we generate embeddings of items to calculate the matching score with dialog context. Finally, we get the recommended items. In general, we employ selective multiple-layer convolutional modules to aggregate feature information of first-order neighbor nodes. We get a high representation for an item that contains a great deal of extra information. The purpose is learn how to selectively filter out the nodes that need to participate in the aggregation operation and how to aggregate neighborhood information. We use a binary gate $b_{ij}\in\{0,1\}$ to each edge $e_{ij}$ to recognize edges that will participate in the aggregation operation and to clip edges unrelated to the task. In a few words, if ${{b}_{ij}}$ is equal to 1, it means that the edge participates in the aggregation operation. If ${{b}_{ij}}$ is equal to 0, it means that the edge does not participate in the aggregation operation. This corresponds to attaching a set of binary masks to the adjacency matrix $A$ :

$\displaystyle\bar{A}=A\odot B,B\in{{\{0,1\}}^{N}},$ (7)

Where $N$ is the number of edges in graph $G$ , since we want to use as few edges as possible for the semi-supervised node classification, we train model parameters $P$ and binary masks $B$ by minimizing the following ${L_{0}}$ -norm regularized empirical risk, and the optimization is as follows:

$\displaystyle R(P,B)=\frac{1}{n}\sum\limits_{i=1}^{n}{L({{f}_{i}}(M,A\odot B,P% ),{{y}_{i}})}+\lambda{{\|{B}\|}_{0}}=\frac{1}{n}\sum\limits_{i=1}^{n}{L({{f}_{% i}}(M,A\odot B,P),{{y}_{i}})}{}+\lambda\sum\limits_{(i,j)\in E}{{{{1}}_{\text{% }\!\![\!\!\text{ }{{\text{b}}_{\text{ij}}}\neq\text{0 }\!\!]\!\!\text{ }}}},$ (8)

where ${{\|{B}\|}_{0}}$ represents the ${L_{0}}$ -norm of binary mask $B$ , i.e., the number of non-zero elements in $B$ , ${1_{[z]}}$ is an indicator function and satisfies the condition $z$ of 1, otherwise 0, $\lambda$ is a regularized hyper-parameter that balances data loss and edge sparsity. For encoder function ${{f}}(M,A\odot B,P)$ , we define the following attention based aggregate functions as:

$\displaystyle R_{i}^{(l+1)}=\sigma\left(\sum\limits_{j\in{{N}_{i}}}{{{a}_{ij}}% R_{j}^{(l)}{{W}^{(l)}}}\right),$ (9)

where ${a_{ij}}$ is the attention coefficient of the edge ${e_{ij}}$ . The SGAT assigns an independent attention coefficient for each edge ${e_{ij}}$ at layer $l$ .

We compute normalized attention coefficients through a row-wise normalization of $A\odot B$ as:

$\displaystyle{{a}_{ij}}=\textit{normalize}({{A}_{ij}}{{b}_{ij}})=\frac{{{A}_{% ij}}{{b}_{ij}}}{\sum\nolimits_{k\in{{N}_{i}}}{{{A}_{ik}}{{b}_{ik}}}},$ (10)

To reinforce the capacity of the SGAT model, we augment similar multi-head attention as in GAT. Therefore, we define a multi-head SGAT layer as:

$\displaystyle R_{i}^{(l+1)}=\|_{k=1}^{K}\sigma\left(\sum\limits_{j\in{{N}_{i}}% }{{{a}_{ij}}R_{j}^{(l)}W_{k}^{(l)}}\right),$ (11)

where $K$ is the number of heads, $\parallel$ represents concatenation, ${a_{ij}}$ is the attention coefficients, and ${w_{k}^{(l)}}$ is the weight matrix of head $k$ at layer $l$ .

After introducing the updating rules for node representations are in Eq. (11), the objective function resumes the cross-entropy loss as follows,

$\displaystyle{L_{\textit{SGAT}}}=-\frac{1}{M}\sum\limits_{i=1}^{M}{[{s_{i}}% \log({p_{i}})+(1-{s_{i}})\log(1-{p_{i}})]},$ (12)

where ${h_{i}}$ means the dialogue context representation, ${s_{i}}$ means the ground truth node vector and ${p_{i}}=\textit{Softmax}({R^{T}}{h_{i}})$ is the item score computed by the SGAT-based model.

Integration Mechanism.

Given the dialog context, we can predict the next utterance via the dialogue state tracking component and obtain the recommended item by utilizing the SGAT-based recommender model. We employ an integration mechanism to achieve the above two tasks. Gated Recurrent Unit (GRU) [40, 41, 42] is widely used in end-to-end dialogue systems. In detail, at each decoding step $s$ in turn $n$ , GRU takes as input of the previously generated token and the hidden state previously generated to generate the new hidden state,

$\displaystyle{{h}_{n,s}}=\textit{GRU}({{h}_{n,s-1}},{{\hat{{t}}}_{n,s-1}}).$ (13)

After obtaining the new hidden state $h_{n,s}$ , it is passed to two branches. We respectively illustrate how the next token is generated and get the top-ranked item name.

In one branch, the ${h_{n,s}}$ is used as input for the dialogue state tracking module to generate the next token, the probability of generating the next token is calculated as:

$\displaystyle{{p}_{1}}({{{\hat{t}}}_{n,s}})\propto\exp({{W}^{T}}{{h}_{n,s}}+(1% -{{l}_{s}}){{B}^{T}}\phi).$ (14)

In the other branch, the $h_{n,s}$ is passed to the SGAT-based recommender. the probability of predicted items is computed as:

$\displaystyle{{p}_{2}}({{{\hat{t}}}_{n,s}})=\textit{Softmax}({{R}^{T}}{{h}_{n,% s}})$ (15)

3.2.1 Model optimization

In order to optimize Eq. (8), we use an inequality in stochastic variational optimization to solve the binary optimization problem of binary mask B. The following inequality holds as follows: given any function $\varphi$ and any distribution $\rho$ ,

$\displaystyle\mathop{\min}\limits_{b}\varphi(b)\leqslant{{\mu}_{b\sim\rho(b)}}% [\varphi(b)],$ (16)

i.e., the minimum of a function is upper bounded by its expectation.

Thus, we can upper bound formula 8 by its expectation:

$\displaystyle\tilde{R}(P,\alpha)=\frac{1}{n}{\mu_{\rho(B|\alpha)}}L({f_{i}}(M,% A\odot B,P),{y_{i}})+\mathop{\sum{{\alpha_{ij}}}}\limits_{(i,j)\in E},$ (17)

where we assume $b_{ij}$ is subject to a Bernoulli distribution with parameter ${\alpha_{ij}}\in(0,1)$ , i.e., ${b_{ij}}\sim\textit{Bernoulli}({b_{ij}};{\alpha_{ij}})$ . For $\forall(i,j)$ , ${b_{ij}}\in E$ is a binary random variable.

[h] : Learning Process of SMCR[1] Input the conversation recommendation dataset $D$ and the knowledge graph $G$ Output the Model parameters ${\Theta_{\mu}}$ , ${\Theta_{\sigma}}$ and ${\Theta_{\alpha}}$ $t=$ 1 to $|D|$ Acquire locations’ representations $R$ from $G$ by Eq. (9), Acquire contexts’ representations ${h_{n,s-1}}$ by the multi-hop attention mechanism using Eq. (2), Acquire the new hidden state ${h_{n,s}}$ by the gate recurrent union using Eq. (13), Compute ${P_{\textit{rec}}}(i)$ using Eq. (14), Perform Gradient Descent (GD) on Eq. (12) ${\Theta_{\alpha}}$ . $t=$ 1 to $|D|$ Acquire locations’ representations $R$ from $G$ by Eq. (9), Acquire contexts’ representations ${h_{n,s-1}}$ by the multi-hop attention mechanism using Eq. (2), Acquire the new hidden state ${h_{n,s}}$ by the gate recurrent union using Eq. (13), Draw a stop word indicator from context, ${{l}_{n}}\sim\textit{Bernoulli}(\textit{sigmoid}({{W}^{T}}{{h}_{n-1}}))$ , Compute ${P_{\textit{gen}}}({y_{n}}|{y_{1}},{y_{2}},\ldots,{y_{n-1}})$ using $p({{y}_{n}}=j|{{h}_{n}},\phi,{{l}_{n}},B)\propto\exp(w_{j}^{T}{{h}_{n}}+(1-{{l% }_{n}})b_{j}^{T}\phi)$ Perform Gradient Descent (GD) on Eq. (3.1) ${\Theta_{\mu}}$ , ${\Theta_{\sigma}}$ . ${\Theta_{\mu}}$ , ${\Theta_{\sigma}}$ and ${\Theta_{\alpha}}$

For the first term, the SGAT selects the Hard Concrete Gradient Estimator to solve $\alpha$ of the gradient computation. Specifically, the hard concrete estimator employs a reparameterization trick to approximate the original optimization problem formula 17 by a close surrogate function:

$\displaystyle\tilde{R}(P,\log\alpha)=\frac{1}{n}\sum\limits_{i=1}^{n}{{\mu_{u% \sim U(0,1)}}L({f_{i}}(M,A\odot g(f(\log\alpha,u))},P),{y_{i}}){}+\lambda\sum% \limits_{(i,j)\in E}\sigma\left(\log{\alpha_{ij}}-\beta\log\frac{{-\gamma}}{% \zeta}\right),$ (18)

with $f(\log\alpha,u)=\sigma((\log u-\log(1-u)+\log\alpha)/\beta)(\zeta-\gamma)+\gamma$ , $g(\cdot)=\min(1,\max(0,\cdot))$ , where $U(0,1)$ is a standard uniform distribution. $\sigma$ means sigmoid function, and $\beta=2/3$ , $\gamma=-0.1$ and $\zeta=1.1$ .

The model uses different schemes to optimize binary masks $B$ in the training and testing stages. specifically, During the training, it utilizes $\log{\alpha_{ij}}$ for edge ${e_{ij}}$ , At the test phrase, it generates masks ${\hat{B}}$ by the following formula:

$\displaystyle\hat{B}=\min(1,\max(0,\sigma((\log\alpha)/\beta)(\zeta-\gamma)+% \gamma)).$ (19)

4. Experiment

In this section, we conducted extensive experiments aimed at answering the following research questions:

RQ1: How does the performance of our proposed SMCR algorithm compare with other state-of-the-art baselines? RQ2: How do different hyper-parameters tuning (e.g., the number of dimensions and the number of hops in the attention network) affect the performance of the SMCR algorithm? RQ3: How does our algorithm reduce the complexity of graph calculation and the interference of noisy nodes (i.e., calculating the number of useless edges removed during computing the knowledge graph)?

4.1 Experimental settings

4.1.1 Datasets

We conduct experiments in two publicly accessible datasets: MultiWOZ1

¹
https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.0.zip.
and REDIAL.2 ²
https://redialdata.github.io/website/download.
MultiWOZ is an accurate dialogue data set between tourists and tourist city information centre consultants. Specifically, the dataset has 10438 sets of dialogues, 115434 rounds of dialogues, and about 7306 sets of dialogues with more than ten rounds of dialogues. In addition, the MUltiWOZ dataset contains the knowledge graph of the mentioned entities. After analysis, it is found that the knowledge graph is very sparse (e.g., a total of 280 entities and the number of edges per node is only 8.0). REDIAL is a dialogue recommendation dataset in the movie domain. It is automatically collected and constructed by Amazon Mechanical Turk (AMT). In the REDIAL dataset, the seeker-recommender pair implements dialogue recommendations. It contains 10006 dialogues and 181250 utterances and covers 51699 movies.
4.1.2 Evaluation protocols

Our SMCR algorithm consists of a language generation module and a recommendation module, so we use BLEU score [43, 44], Accuracy, Recall, NDCG, and Perplexity as evaluation metrics to evaluate the performance of our model.

BLEU: Measures the difference between the sentence generated by the model and the ground truth sentence. It is often used in several fields (i.g, machine translation and dialogue systems). The BLEU score ranges from 0 to 1, and a higher score indicates better performance.

Accuracy: Calculate the average of the system response and compare it with the whole ground truth set. This metric evaluates the ability to recommend appropriate items from the provided item set and capture the semantics of dialogues.

Recall: It reflects the number of hits in the interpolated top-N recommendation list in the generated sentence.

NDCG: Considers the gain brought by order of hits based on Recall. The conversational recommender system can capture the connection between dialogue information and items. It is worth noting that the dialogue recommendation system can capture the connection between dialogue information and items compared with traditional recommendation algorithms.

Perplexity: A language model that measures the fluency of natural language. Lower Perplexity indicates the higher performance of a language model.

4.2 Baselines

The baselines for the experiment are illustrated in the following:

•
HRED [45]: It adopts hierarchical encoder-decoder structure to dialogue domain. The model utilizes RNN to encode words and sentences, respectively. Therefore, it can capture both the local and global structure of sentences.
•
TopicRNN [37]: It integrates the merits of RNNs and latent topic models, including using RNN to capture syntactic dependencies and using latent topics to capture semantic dependencies.
•
Mem2Seq [9]: It incorporates the multi-hop attention over memories and pointer work to produce answers from the external information and then capture the different semantic associations between questions and answers.
•
DCR [13]: It is equipped with a TopicRNN and a graph convolution network to recommend a travel venue. By using the GCN, it intends to match external information with dialogue information.

4.3 Training setups

We implement our proposed model SMCR in PyTorch. For the construction of the graph, we utilize a deep graph library (DGL) to add nodes, node features, and edges. We set the embedding and GRU state sizes to 300 and 100, respectively. The reason is that a larger embedding dimension can provide more parameters to represent the semantic features of words, while a smaller state size can reduce computational complexity but may limit the model’s ability to capture long-term dependencies. The number of topics is set to 10, and the stop word frequency is set to 1000. Setting the number of topics to 10 may be to introduce a certain level of topic diversity and coverage in the model. A smaller number of topics may lead to excessive generalization, while a larger number of topics may increase computational complexity. We apply two layers of graph convolutional operations. With multiple layers of graph convolutions, the model can obtain richer contextual information from multiple neighboring nodes. During the training, the dimension of the hidden unit in RNN and the hidden layer of the inference network is set to 64. We exploit the Adam optimizer with a learning rate of 0.001. All baseline models have their hyperparameters set according to the aforementioned values, with slight variations in the learning rate.

Table 1
Performance of all competitors on the MultiWoz dataset

Models	Accuracy	Recall	BLEU	Perplexity
HRED	0.1210	0.0569	0.2023	10.91
TopicRNN	0.1006	0.0531	0.2359	11.05
Mem2Seq	0.1071	0.0400	0.3034	11.25
DCR	0.1625	0.0505	0.2315	10.24
SMCR	0.1966	0.0649	0.2234	6.56

4.4 Experiment results

4.4.1 Experiment results and analysis (RQ1)

Table 1 shows the performance of our SMCR method compared with other methods on four evaluation metrics on the MultiWoz dataset. It can be observed that the SMCR method is better than all baselines on the accuracy metrics. Specifically, our SMCR method improves accuracy by 62.6%, 95.0%, 83.5%, and 21.0% compared with HRED, TopicRNN, Mem2Seq and DCR methods, respectively. In the recall metrics, our SMCR method ranked first. It is better than DCR by about 28.5%. Benefiting from the edge-pruning strategy of SGAT, subject-irrelevant edges are excluded from feature aggregation features under the utility of binary gates. Such a mechanism alleviates the obscuring of critical information by noisy information so that more topic-related graph structure information is preserved. Unfortunately, our model does not have an advantage compared with other baseline methods in the BLEU metrics. Although the multi-hop attention mechanism can alleviate long-term dependence to a certain extent, the memory ability of historical dialogue information is still relatively weak. Adding relative and absolute position information to the sequence information will overcome these problems. In the perplexity metrics, we observe a 35.4% performance improvement compared to the state-of-art method (i.e., DCR). The improvement of our SMCR method is more obvious compared to other baselines, i.e., the improvement ratio reached 39.9%, 40.6%, and 41.7% compared with HRED, TopicRNN, and Mem2Seq methods, respectively. The lower the perplexity means that the model knows more about the given historical information.

Table 2 reports the performance of all models on the REDIAL dataset. We can observe that our SMCR method outperforms most models on various evaluation metrics. The metric of accuracy, HRED, TopicRNN, Mem2Seq, DCR and SMCR reach 0.86%, 0.58%, 0.92%, 0.46% and 1.23%, respectively. In particular, in the important recommendation evaluation metrics of recall (the recall refers to recall@1), our SMCR method obtains the best result of 0.0047 among all conversational recommendation models. Compared to the Mem2Seq method, our SMCR method improves by 2.0%, which is a critical metric in CRS, as it means that the top-ranked items can be recommended to users first. Moreover, our SMCR algorithm achieves the highest BLEU and lowest Perplexity compared with all competitors, i.e., lower perplexity values indicate better language model fluency, so it can be concluded that our model achieves very significant results in addressing natural language fluency of generated dialogue.

Table 2
Performance of all competitors on the REDIAL dataset

Models	Accuracy	Recall	BLEU	Perplexity
HRED	0.0086	0.0045	0.0642	25.01
TopicRNN	0.0059	0.0028	0.0423	39.94
Mem2Seq	0.0092	0.0046	0.0801	41.64
DCR	0.0046	0.0032	0.0631	32.27
SMCR	0.0123	0.0047	0.0728	24.82

Table 3 reports the performance results of all models in the metric of NDCG@k (i.e., $k=$ 1, 5, 10, 15, 20) on the MultiWoz dataset. We can observe that our SMCR method outperforms most models on various evaluation metrics. In particular, in the most crucial recommendation evaluation metrics of NDCG@1, our SMCR method obtains the best result of 0.1324 in all conversational recommendation models. Compared to the DCR method, our SMCR method improves by 30%. It is worth noting that the top-ranked items are recommended to users first, and this is especially important in CRSs. On most NDCG@k (i.e., $k=$ 5, 10, 15, 20) metrics, our SMCR method is better than other baselines. e.g., in the metrics of NDCG@5, our SMCR and DCR methods reach 0.1784 and 0.1360, respectively. In addition, Our SMCR method outperforms HRED and other methods in NDCG@5, NDCG@10, NDCG@15, and NDCG@20 metrics. The possible reason is that our SMCR method usually encodes all text, while the HRED method encodes paragraph-level text in the context encoding stage. It should be noted that since there are as many as 6800 movie entities and as many as 10064 movie-related entities in the REDIAL dataset, the NDCG metric is not suitable for evaluating the recommendation performance of the model since NDCG is equivalent to weighting the reciprocal position of Recall calculate.

Table 3

Performance of all competitors on the MultiWoz dataset with respect to the different number of NDCG@k

Models	NDCG@1	NDCG@5	NDCG@10	NDCG@15	NDCG@20
HRED	0.0987	0.1601	0.1834	0.1931	0.1995
TopicRNN	0.0760	0.1313	0.1511	0.1592	0.1648
Mem2Seq	0.0961	0.1523	0.1632	0.1752	0.1762
DCR	0.1016	0.1360	0.1482	0.1542	0.1584
SMCR	0.1324	0.1784	0.1904	0.1965	0.2102

In conclusion, by analyzing the experimental data in Tables 1–3, we summarize the possible reasons why our model is effective. Compared with the DCR method, our SMCR method uses a recommender based on SGAT, which effectively removes edges that are not related to the task; in addition, in the convolution operation, it aggregates more information about neighbour nodes that are associated with the host node. Compared with the encoder based on multi-hop attention, our SMCR method retains more information because the attention mechanism is always interested in the information related to the query vector. What’s more, compared with the traditional RNN model, the multi-hop attention mechanism uses memory to remember historical information and is more stable when processing long sequences. The benefits of separately feeding dialogue text and knowledge graph triples to different models are obvious. A sparse graph neural network is more suitable for feature capture and combining knowledge triples. At the same time, an encoder may be more appropriate for mapping text information with sequence features into high-dimensional feature space. Simply inputting the graph structure information together with the sequence information as combined information into the encoder will lead to the interference of noise information, and the feature extraction granularity is too high.

Table 4

Ablation study on the MultiWoz dataset

Models	Accuracy	Recall	BLEU	Perplexity
SMCR (-MHA)	0.1716	0.0523	0.2023	8.62
SMCR (-SGAT)	0.1618	0.0481	0.2075	9.69
SMCR	0.1966	0.0649	0.2234	6.56

Figure 3.

Performance (i.e., Accuracy, Recall, F1 and NDCG@1) of various conversational recommendation methods with respect to the different number of dimensions.

4.4.2 Ablation study

We build an ablation study based on two variants of our complete method to show the contributions of each component to the conversational recommender task: SMCR (-MHA) by removing the multi-hop attention from the dialogue state tracking module, while SMCR (-SGAT) by removing the sparse graph attention network from the recommender module. As shown in Table 4, we can observe that the performance of BLEU and Perplexity degrades after removing the multi-hop attention network. Because the multi-hop attention mechanism utilizes external memory units to store and propagate information. After removing the multi- hop attention network, the proposed model is unable to capture the connection between the current hidden state and the previous hidden state, resulting in a decline in memory ability. Limited information limits the ability of language models, resulting in a decrease in the quality of generated statements. So the semantic information of the sentence is missing. Besides, the sparse graph attention network seems to play an essential role in the recommendation task. One of the possible explanations is that the binary gate in SGAT helps to preserve vital information and remove the noisy information. Therefore, it retains the critical knowledge graph information to the greatest extent possible. In conclusion, Table 4 shows that two components help improve the performance of the conversational recommender system.

Figure 4.

Performance (i.e., NDCG@5, NDCG@10, NDCG@15 and NDCG@20) of various conversational recommendation methods with respect to the different number of dimensions.

Figure 5.

Performance (i.e., BLEU and Perplexity) of various conversational recommendation methods with respect to the different number of dimensions.

Figure 6.

Performance(i.e., BLEU and Perplexity) of SMCR with respect to the different number of hops.

4.4.3 Experiment with long-term dependencies (RQ2)

Figures 3–5 show the performance of our SMCR method and other baselines with the number of dimensions ranging from 8 to 64. We can observe that when the number of the dimensions is relatively small, our SMCR method achieves the best results in the most evaluation metrics, i.e., accuracy, recall, F1, NDCG@1, NDCG@5, BLEU, and Perplexity, and our method obtains the second-best result in the evaluation metrics of NDCG@10, NDCG@15, NDCG@20. Unfortunately, when the number of dimensions becomes larger, the performance of the SMCR method decreases slightly. We infer that the possible reason is limited by the operating mechanism of the memory network, and current AI hardware accelerators cannot improve the memory network well. In particular, we also find an essential conclusion, that is, compared with other conversational recommendation methods, our SMCR method performance is very stable in the dimensions ranging from 8 to 64. Especially in the evaluation metrics of Perplexity, our method is far superior to other baseline algorithms. In addition, the external memory mechanism of the RNN-based model will not decay as the sequence length increases. The experiment result shows that we design a multi-hop attention network to encode dialogue context, which can quickly encode long dialogue sequences to capture long-term dependencies.

Figure 6 shows the Performance (i.e., BLEU and Perplexity) of our method SMCR with the number of hops ranging from 5 to 10. We can observe that as the number of hops increases, the BLEU score increases steadily. One of the possible explanations is that the memory vector in the multi-hop attention network seems to carry most of the semantic information of the sentence. Besides, the dot product of the query tensor and the memory vector models the interaction of different parts of the sentence. Unfortunately, there was a slight decrease in the perplexity score. We infer that a few stop words would not have been considered by us, and stop words will have a great impact on perplexity. But from semantic analysis, whether there are stop words does not fully represent the quality of sentence generation. In addition, meaningless information in sentences also affects perplexity.

4.4.4 The complexity of graph calculation (RQ3)

We further analyze the edges removed by SGAT. Figure 7 illustrates the effectiveness of the SGAT method in removing useless edges when modelling a knowledge graph on the MultiWOZ dataset. Due to the operation of the binary gate and the optimization of the ${L_{0}}$ -norm, we can observe that the number of edges is reduced by 20% in 50 epochs, indicating a significant edge redundancy in modelling knowledge graph of the conversational recommender. Comparing our SMCR method with the DCR method shows that SGAT can accurately identify important edges. Because compared with random edge removal or bottom edge removal, our model has a significant improvement in the evaluation metrics, i.e., accuracy, recall, and BLEU, which can be observed from the experimental result in Tables 1 and 2. This experimental result verifies one of the research motivations of this paper. That is, we expect to reduce the complexity of graph calculation and the interference of noisy nodes by using the sparse graph attention mechanism.

Figure 7.

The number of edges in the knowledge graph changes with iteration.

Figure 8.

The evolution of the graph created by a synthetic dataset in 10 training epochs.

To illustrate how the sparse graph attention mechanism works, we construct a graph and randomly assign features and labels to nodes. Then we train a one-layer SGAT on the dataset to complete the node classification task. As shown in Fig. 8, we can find that most of the edges are retained; meanwhile, some useless edges are cut off. Furthermore, we make inferences about the reasons for the removal of edges. For example, the edge from node 13 to node 11 is removed. This is because the feature similarity between node 11 and node 8, node 3 and node 9 is higher, while node 13 is only connected to node 3 and node 9, so removing the edge connecting node 11 and node 13 will not affect node 11. The above experimental results also verify our scientific assumption. By developing a recommendation model based on sparse graph attention, we can accurately identify important edges by matching items with dialogue context, so as to reduce the complexity of graph computation and the interference of noisy nodes, but effectively improve the performance of the recommendation model.

5. Conclusion

In this work, we developed a conversational recommender system based on sparse graph attention and multi-hop attention. Our key argument is that conversational recommender based on the graph convolutional network has the same weights, and these weights are assigned to different neighbors in the same order neighborhood, resulting in a huge graph and a lot of noise, which may lead to overfitting of the model. Therefore, we design a recommendation model based on sparse graph attention to match items with dialogue context, reducing the complexity of graph computing and the interference of noisy nodes.In addition, the conversation recommendation model has a long-term dependency problem when generating conversation topics. Because the model cannot accurately capture the dependency between words, which will affect the overall structure of sentences but is very important to ensure smooth communication between users and agents. To address this point, we design a multi-hop focus network that can quickly encode long conversation sequences to capture long-term dependencies, and learn topic information through a variable automatic encoder optimized by KL cost annealing. We conducted empirical studies to validate the effectiveness of our SMCR methods. Experiment results show that our SMCR method outperforms other state-of-the-art methods in both the evaluations of recommendation and dialogue generation. While our algorithm has demonstrated promising results, it is important to acknowledge its limitations. The algorithm may have high computational complexity, especially when dealing with large-scale datasets or complex problem domains. This can potentially limit its scalability and efficiency in real-world applications. The algorithm may lack interpretability, meaning it may be challenging to understand and interpret the underlying reasoning or decision-making process.

In future, we will continue our work in two directions. First, we will explore the masked language model to further boost the performance of response generation. Second, we will try to perceive the spatial characteristics of the knowledge graph to enhance the recommendation performance.

Footnotes

Acknowledgments

The work is supported bythe Science and Technology Research Program of Chongqing Municipal Education Commission (No. KJZD-K202101105, KJQN202001136), Humanities and Social Sciences Research Program of Chongqing Municipal Education Commission (No. 22SKGH302), the National Natural Science Foundation of China (No. 61702063).

References

Zhou

Zhao

W.X.

Wang

Zhang

Wang

Wen

J.-R.

, Leveraging historical interaction data for improving conversational recommender system, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 2349–2352.

Sun

Zhang

, Conversational recommender system, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 235–244.

Lei

Miao

Hong

Kan

M.-Y.

Chua

T.-S.

, Estimation-action-reflection: Towards deep interaction between conversational and recommender systems, in: Proceedings of the 13th International Conference on Web Search and Data Mining, 2020, pp. 304–312.

Lei

Jiang

Chua

T.-S.

, Seamlessly unifying attributes and items: Conversational recommendation for cold-start users, ACM Transactions on Information Systems (TOIS) 39(4) (2021), 1–29.

Zou

Chen

Kanoulas

, Towards question-based recommender systems, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 881–890.

Chen

Lin

Zhang

Ding

Cen

Yang

Tang

, Towards Knowledge-Based Recommender Dialog System, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 1803–1813.

Liu

Wang

Niu

Che

Liu

, Towards conversational recommendation over multi-type dialogs, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 235–244.

Zhou

Zhao

W.X.

Bian

Zhou

Wen

J.-R.

, Improving conversational recommender systems via knowledge graph based semantic fusion, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1006–1014.

Madotto

C.-S.

Fung

, Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 1468–1478.

10.

Wang

Cao

Jiang

Wang

Tong

Guo

, Incorporating Specific Knowledge into End-to-End Task-oriented Dialogue Systems, in: 2021 International Joint Conference on Neural Networks (IJCNN), IEEE, 2021, pp. 1–8.

11.

Dethlefs

Schoene

Cuayáhuitl

, A divide-and-conquer approach to neural natural language generation from structured data, Neurocomputing 433 (2021), 300–309.

12.

Sun

Yang

, Multi-goal multi-agent learning for task-oriented dialogue with bidirectional teacher–student learning, Knowledge-Based Systems 213 (2021), 106667.

13.

Liao

Takanobu

Yang

Huang

Chua

T.-S.

, Deep conversational recommender in travel, arXiv preprint arXiv:1907.00710 (2019).

14.

Wang

Cui

Zhou

Fung

G.P.C.

Wong

K.-F.

, Topicrefine: Joint topic prediction and dialogue response generation for multi-turn end-to-end dialogue system, arXiv preprint arXiv:2109.05187 (2021).

15.

Serban

Sordoni

Lowe

Charlin

Pineau

Courville

Bengio

, A hierarchical latent variable encoder-decoder model for generating dialogues, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, 2017.

16.

Cui

Shen

Ouchi

Liu

, Modeling semantic and emotional relationship in multi-turn emotional conversations using multi-task learning, Applied Intelligence 52(4) (2022), 4663–4673.

17.

, Sparse graph attention networks, IEEE Transactions on Knowledge and Data Engineering (2021).

18.

Zhao

Wang

Shi

Song

, Heterogeneous graph structure learning for graph neural networks, in: 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.

19.

Pal

Xue

Desai

Banjo

A.A.F.

Adepiti

C.A.

Long

L.R.

Schiffman

Antani

, Deep multiple-instance learning for abnormal cell detection in cervical histopathology images, Computers in Biology and Medicine 138 (2021), 104890.

20.

Bowman

S.R.

Vilnis

Vinyals

Dai

A.M.

Jozefowicz

Bengio

, Generating Sentences from a Continuous Space, conference on computational natural language learning (2015).

21.

Weizenbaum

, ELIZAâ€•a computer program for the study of natural language communication between man and machine, Communications of the ACM 9(1) (1966), 36–45.

22.

Wallace

R.S.

, The anatomy of ALICE, in: Parsing the turing test, Springer, 2009, pp. 181–210.

23.

de Bayser

M.G.

Cavalin

Souza

Braz

Candello

Pinhanez

Briot

J.-P.

, A hybrid architecture for multi-party conversational systems, arXiv preprint arXiv:1705.01214 (2017).

24.

Dhingra

Gao

Chen

Y.-N.

Ahmad

Deng

, Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 484–495.

25.

Lipton

Gao

Ahmed

Deng

, Bbq-networks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 2018.

26.

Christakopoulou

Radlinski

Hofmann

, Towards conversational recommender systems, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 815–824.

27.

Greco

Suglia

Basile

Semeraro

, Converse-et-impera: Exploiting deep learning and hierarchical reinforcement learning for conversational recommender systems, in: Conference of the Italian Association for Artificial Intelligence, Springer, 2017, pp. 372–386.

28.

Kahou

Schulz

Michalski

Charlin

Pal

, Towards deep conversational recommendations, in: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 9748–9758.

29.

Zhang

Chen

Yang

Croft

W.B.

, Towards conversational search and recommendation: System ask, user respond, in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018, pp. 177–186.

30.

Moon

Shah

Kumar

Subba

, Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 845–854.

31.

Lei

Zhang

Miao

Wang

Chen

Chua

T.-S.

, Interactive path reasoning on graph for conversational recommendation, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 2073–2083.

32.

Sukhbaatar

Szlam

Weston

Fergus

, End-to-end memory networks, Advances in Neural Information Processing Systems 2015 (2015), 2440–2448.

33.

Bouritsas

Frasca

Zafeiriou

S.P.

Bronstein

, Improving graph neural network expressivity via subgraph isomorphism counting, IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

34.

Wang

Zhang

Gao

Wang

Long

, PredRNN: A recurrent neural network for spatiotemporal predictive learning, IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

35.

Zheng

Liu

Yin

, Sentence representation method based on multi-layer semantic network, Applied Sciences 11(3) (2021), 1316.

36.

Hua

Mou

Lin

Heidler

Zhu

X.X.

, Aerial scene understanding in the wild: Multi-scene recognition via prototype-based memory networks, ISPRS Journal of Photogrammetry and Remote Sensing 177 (2021), 89–102.

37.

Dieng

A.B.

Wang

Gao

Paisley

, Topicrnn: A recurrent neural network with long-range semantic dependency, Proceedings of the 5th International Conference on Learning Representations (2017).

38.

Huang

Yuan

Zhang

Qiao

, Attention-emotion-enhanced convolutional LSTM for sentiment analysis, IEEE Transactions on Neural Networks and Learning Systems (2021).

39.

Zhu

Zhang

Kang

Liu

, Aspect-gated graph convolutional networks for aspect-based sentiment analysis, Applied Intelligence 51(7) (2021), 4408–4419.

40.

Zhang

Yang

, A novel dynamic predictive method of water inrush from coal floor based on gated recurrent unit model, Natural Hazards 105(2) (2021), 2027–2043.

41.

Wei

Wang

Niu

, Wind speed forecasting system based on gated recurrent units and convolutional spiking neural networks, Applied Energy 292 (2021), 116842.

42.

ArunKumar

Kalaga

D.V.

Kumar

C.M.S.

Kawaji

Brenza

T.M.

, Forecasting of COVID-19 using deep layer recurrent neural networks (RNNs) with gated recurrent units (GRUs) and long short-term memory (LSTM) cells, Chaos, Solitons & Fractals 146 (2021), 110861.

43.

Fan

Bhosale

Schwenk

El-Kishky

Goyal

Baines

Celebi

Wenzek

Chaudhary

Beyond english-centric multilingual machine translation, Journal of Machine Learning Research 22(107) (2021), 1–48.

44.

Schlag

Irie

Schmidhuber

, Linear transformers are secretly fast weight programmers, in: International Conference on Machine Learning, PMLR, 2021, pp. 9355–9366.

45.

Serban

Sordoni

Bengio

Courville

Pineau

, Building end-to-end dialogue systems using generative hierarchical neural network models, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30, 2016.

46.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Ł.

Polosukhin

, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.

47.

Emami

Jelinek

, A neural syntactic language model, Machine learning 60(1) (2005), 195–227.

48.

Gao

Nie

J.Y.

Cao

, Dependence language model for information retrieval, in: International ACM SIGIR Conference on Research & Development in Information Retrieval, 2004.

49.

Bruna

Zaremba

Szlam

Lecun

, Spectral networks and locally connected networks on graphs, in: International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014, p. http–openreview.

50.

Jiang

, Order-agnostic cross entropy for non-autoregressive machine translation, in: International Conference on Machine Learning, PMLR, 2021, pp. 2849–2859.

Conversational recommender based on graph sparsification and multi-hop attention

Abstract

Keywords

1. Introduction

2.1 Dialogue system

2.2 Conversational recommender

2.3 Knowledge-based conversational recommender

3. The proposed model: SMCR

Multi-hop Encoder.

TopicRNN Learning.

SGAT-based Recommender.

Integration Mechanism.

4.1 Experimental settings

4.1.1 Datasets

4.2 Baselines

Table 1 Performance of all competitors on the MultiWoz dataset

4.4.1 Experiment results and analysis (RQ1)

Table 2 Performance of all competitors on the REDIAL dataset

4.4.4 The complexity of graph calculation (RQ3)

Footnotes

Acknowledgments

References

Table 1
Performance of all competitors on the MultiWoz dataset

Table 2
Performance of all competitors on the REDIAL dataset