Abstract
Persuasion is an important and yet complex aspect of human intelligence. When undertaken through dialogue, the deployment of good arguments, and therefore counterarguments, clearly has a significant effect on the ability to be successful in persuasion. Two key dimensions for determining whether an argument is “good” in a particular dialogue are the degree to which the intended audience believes the argument and counterarguments, and the impact that the argument has on the concerns of the intended audience. In this paper, we present a framework for modelling persuadees in terms of their beliefs and concerns, and for harnessing these models in optimizing the choice of move in persuasion dialogues. Our approach is based on the Monte Carlo Tree Search which allows optimization in real-time. We provide empirical results of a study with human participants that compares an automated persuasion system based on this technology with a baseline system that does not take the beliefs and concerns into account in its strategy.
Introduction
Persuasion is an important and multifaceted human facility. The ability to induce another party to believe or do something is as essential in commerce and politics as it is in many aspects of daily life. We can consider examples such as a doctor trying to get a patient to enter a smoking cessation programme, a politician trying to convince people to vote for him in the elections, or even just a child asking a parent for a rise in pocket money. There are many components that boost the effectiveness of persuasion, and simple things such as how someone is dressed or a compliment can affect the person they are trying to convince. Nevertheless, arguments are a crucial part of persuasion, and resolving a given person’s doubts and criticisms is necessary to win them over.
While arguments can be implicit, as in a product advert, or explicit, as in a discussion with a doctor, in both cases they need to be selected with the target audience in mind. In this paper, we focus on the following two dimensions in which a potential persuadee may judge arguments in the context of a dialogue.
Arguments are formed from premises and a claim, either of which may be explicit or partially implicit. An agent can express a belief in an argument based on the agent’s belief in the premises being true, the claim being implied by the premises, and the claim being true. There is substantial evidence in the behaviour change literature that shows the importance of the beliefs of a persuadee in affecting the likelihood that the persuasion attempt is successful (see for example the review by Ogden [83]). Furthermore, beliefs can be used as a proxy for fine-grained argument acceptability, the need for which was highlighted by empirical studies conducted in [86,97].
Arguments are statements that contain information about the agent and/or the world. Furthermore, they can refer to impacts on the agent and/or the world, which in turn may relate to the concerns of the agent. In other words, some arguments may have a significant impact on what the agent is concerned about. In empirical studies, it has been shown that taking the persuadee’s concerns into account can improve the likelihood that persuasion is successful [29,32,52]. Conceptually, concerns can be seen as related to values as used in value-based argumentation. For instance, values can be used to capture the general goals of an agent as discussed in [9]. However, as we will explain, the way we use concerns in this paper is quite different to the way values are used in value-based argumentation.
To illustrate how beliefs (respectively concerns) arise in argumentation, and how they can be harnessed for more effective persuasion, consider Example 1 (respectively Example 2).
Consider a health advisor who wants to persuade a student to join a smoking cessation programme ( Argument 1: If I give up smoking, I will get more anxious about my studies, I will eat less, and I will lose too much weight. Argument 2: If I give up smoking, I will start to eat more as a displacement activity while I study, and I will get anxious as I will put on too much weight.
Based on the conversation so far, the health advisor has to judge whether the student believes Argument 1 or Argument 2. With that prediction, the advisor can try to present an appropriate argument to counter the student’s belief in the argument, and thereby overcome the student’s barrier to joining the smoking cessation programme. For instance, if the advisor thinks it is Argument 1, they can suggest that as part of the smoking cessation programme, the student can join free yoga classes to overcome any stress that they might feel from the nicotine withdrawal symptoms.
Consider a volunteer street-fundraising for a hospital charity who has managed to engage in a conversation with a passerby.
Argument 1: Supporting this hospital will fund innovative cancer research. Argument 2: Supporting this hospital will fund specialized hearing equipment for deaf people.
The volunteer is fundraising in a university area and managed to stop a passer-by that is likely to be a professor in the nearby institution. Hence, they may guess that supporting research is something that is likely to be a significant concern to the passer-by. The volunteer is likely to have just one chance at convincing the passer-by to sign up, and will regard Argument 1 as more likely to be a convincing argument for the passer-by.
So in Example 1, the student has the same concerns, but different beliefs, associated with the arguments. In contrast, in Example 2, the passer-by has the same beliefs, but different concerns, associated with the arguments. We therefore see both the concerns and beliefs as being orthogonal kinds of information that an agent might have about an argument, and knowing about them can be valuable to a persuader.
In the research reported in this paper, we consider how beliefs and concerns can be taken into account in automated persuasion systems (APSs). An APS plays the role of the persuader and engages in a dialogue with a
In previous research, we have used the epistemic approach to probabilistic argumentation to reason with beliefs [21,62,64,87,105], and the value of this has been supported by experiments with participants [86]. In applying this approach to modelling a persuadee’s beliefs in arguments, we have developed methods for: (1) updating beliefs during a dialogue [57,66,68]; (2) efficiently representing and reasoning with the probabilistic user model [49]; (3) modelling uncertainty in the modelling of persuadee beliefs [51,69]; (4) harnessing decision rules for optimizing the choice of argument based on the user model [50,53]; (5) crowdsourcing the acquisition of user models based on beliefs [56]; (6) modelling a domain in a way that supports the use of the epistemic approach [30]. These developments for taking belief into account offer a well-understood theoretical and computationally viable framework for applications such as behaviour change.
However, belief in an argument is not the only dimension of a user model that could be taken into account. Recent research provides some evidence that taking concerns into account can improve the persuasiveness of a dialogue [32,52]. Thus, in order to model users better, it is worth exploring a combination of consideration of a user’s beliefs with consideration of their concerns in a coherent framework for strategic argumentation. Beliefs and concerns are different concepts (as seen in Examples 1 and 2). For instance, it is possible for an argument to be believed but neither raises nor addresses concerns that are important for an agent. Similarly, it is possible for an argument to be disbelieved but it does raise or address concerns that are important for an agent. Ideally, to increase the impact of arguments that we present to an agent, we would want the agent to believe the argument and to see that it raises or addresses concerns that are important to her. However, to date, there is a lack of a computational framework for harnessing both dimensions together in making strategic choices of move in a persuasion dialogues.
The aim of this paper is therefore to provide a computational approach to strategic argumentation for persuasion that takes both the concerns and the beliefs of the persuadee into account. They will be used to provide a more advanced user model which can be harnessed by a decision-theoretic APS to choose the arguments to present in a dialogue. To render this approach viable for real-time applications and to dynamically update an APS’s strategy as the dialogue progresses, we present an approach based on Monte Carlo Tree Search. We evaluate our proposal in an empirical study with human participants using an APS based on this technology which we will refer to as the
We proceed as follows: (Sections 2, 3 and 4) We present our setting, from domain and user modelling to dialogue protocols; (Section 5) We present our framework for optimizing choices of moves in persuasion dialogues; (Section 6) We present our approach to acquiring and harnessing the crowdsourced data for user models; (Section 7) We present our experiments for evaluating our technology in automated persuasion systems; (Section 8) We discuss our work with respect to the related literature; and (Section 9) We discuss our contributions and future work.
In the context of this paper, we focus on arguments as they can be found in newspaper articles or in discussions between humans,
We will represent arguments and relations between them through the means of argument graphs as defined by Dung [38], which do not presuppose any particular argument structure and focus on modelling the attack relation.
An
For arguments
An argument graph can be easily depicted as a directed graph, where nodes represent arguments and arcs represent attacks. We will therefore use
Given an argument graph, a natural question to ask is which arguments are acceptable,
In our work, we will use the definition of an argument graph. However, as we will see in Section 5, we do not assume that the agents use dialectical semantics. The reason for not using dialectical semantics is that we are not concerned with determining the arguments acceptable according to normative principles. Instead, we wish to model how persuasion may occur in scenarios where the participants are allowed the freedom of opinion and as such do not need to adhere to any rationality principles. Certain studies show that the performance of dialectical semantics can be rather low in such applications [86,93]. Since we wish to construct a predictive model, we will not impose conditions for when an agent should be persuaded, but rather have a model that reflects how an agent is likely to behave. We refer to Section 8 for a discussion on approaches that follow the dialectical semantics.
In this section we will describe the type of user model that we want to incorporate into our APS. We focus on two possible dimensions – the concerns and the beliefs of the user – and in the next sections we explain how they can be interpreted and modelled.
Concerns
A concern is meant to represent something that is important to an agent. It may be something that they want to maintain (for example, a student may wish to remain
Consider the following arguments about abolishing student fees. Depending on the participant, the first argument could be addressing the concern of student finances whereas the second could be raising the concern of education.
( (
The types of concern may reflect the possible motivations, agenda, or plans that the persuadee has in the domain (
Various agents can, independently of each other, identify similar concerns. Thus, it may be appropriate to group these into types of concern. For example, we could choose to define the type “Fitness” to cover a variety of concepts, from lack of exercise through to training for marathons. The actual types of concern we might consider, and the scope and granularity of them, depend on the application. However, we assume that they are atomic, and that ideally, the set is sufficient to be able to type all the possible arguments that we might want to consider using in the dialogue. We therefore introduce the following notation:
Let
While certain agents can agree on what kinds of concern a given argument raises or addresses, this does not mean that the concerns themselves are equally important or relevant to them. We thus also require information about the preferences over types of concern of an agent. For instance, if we have a collection of arguments on the topic of the university fees, we may have types such as “Student Well-being”, “Education”, or “Student Satisfaction”. We then may have an agent that regards “Student Well-Being” as the most important concern for them, “Education” being the second, and “Student Satisfaction” as last one. Another agent may have an entirely different preference regarding those categories, and these need to be represented, so that they can be harnessed by an APS to put forward more convincing arguments during a dialogue (see also Example 2).
With
By
There are some potentially important choices for how we can model agent’s preferences. For instance, we can define them as pairwise choices or assume they form a partial or even linear ordering. While during the experiments we will focus on the linear approach (see Section 6), our general method is agnostic as to how preferences are represented. There are also numerous techniques for acquiring preferences from participants (see
The beliefs of the user strongly affect how they are going to react to persuasion attempts [83]. There is a close relationship between the belief an agent has in an argument, and the degree to which the agents regard the argument as convincing [56]. Furthermore, beliefs can be used as a proxy for fine-grained argument acceptability, the need for which was highlighted by empirical studies conducted in [86,97]. We therefore treat the belief in arguments as a key dimension in a user model. In this section we explain how we represent belief for an individual, and how we can capture the uncertainty in that belief when considering multiple (sub)populations of agents.
For modelling the beliefs of a user, we use the epistemic approach to probabilistic argumentation [21,63,64,106], which defines a belief model as a
A
For a probability distribution
The persuader uses a belief distribution
Definition 4 considers the belief we have in an argument. However, it lacks any quantification of the uncertainty about the belief in an argument. For example, an agent may be certain of the value that
These distributions offer a well-established and well-understood approach to quantifying uncertainty. Additionally, they allow for a principled way of representing subgroups within a population. This is particularly important for applications in persuasion, where different subpopulations may have significantly different beliefs in the arguments in a dialogue. Furthermore, they may also have radically different ways of responding to specific dialogue moves. In such situations, easy-to-get or already gathered data (such as a medical record for instance) can be leveraged to match a new user with a particular subpopulation in order to use a more efficient argumentation strategy.
So rather than use a probability distribution as given in Definition 4 to formalize the belief in an argument, we will use a beta distribution. In a beta distribution for an argument
As we see in the following definition, the shape of the beta distribution is determined by two hyperparameters
A
Whilst this definition may appear complex, it gives a natural way of capturing the probability of a probability value. Furthermore, the definition can be easily understood in terms of capturing Bernoulli trials.
Given the definition for a beta distribution, the mean
Using the beta distributions gives us a number of advantages. The distribution can handle the uncertainty on the belief (i.e. the uncertainty over the value assigned to
We can then plug the estimates
Another advantage of beta distributions is that we use them to detect the subpopulations with homogeneous behaviours (

Examples of beta distributions with parameters
The mixture of beta distributions in Fig. 1 shows the initial belief of all the participants in the argument
A
Therefore, a mixture
By extension, the probability of a belief
Figure 1 also presents the mixture of the two components, combined with a weights vector
So beta distributions offer a flexible and practical way of representing the belief in arguments when there is uncertainty in what that belief might be. As we will see, we can collect data from crowdsourced participants about their belief in arguments, and use this to populate the beta distributions in our user models. This involves finding the choice of components that best describes the data. This may involve a trade-off of the number of components and the fit with the data (see [51] for more details).
There are many possible moves and protocols for dialogical argumentation (see Section 8), and they vary in their goals or properties. For the purpose of this study, the protocol we require would have to meet the following principles –
Last, but not least,
In the next sections we will provide a formalization of dialogue moves and the above intuitions concerning the protocol. In this paper we will assume that a dialogue is a sequence of moves
Dialogue moves
In this paper, we will only consider two types of move – the posit move, which will be used by the APS to state arguments, and the menu move, which will allow the user to give their counterarguments. In order to make certain notation easier, throughout this section we will assume that we have a dialogue
The purpose of the
The set of
With this, we define posit moves as follows:
For a dialogue
We note that for every posit move
Now we introduce the Please note that displaying the listing does not count as a dialogue move in this approach. It is merely a way to facilitate the user move.
The

Interface for an asymmetric dialogue move for asking the user’s counterarguments. Multiple statements (and their counterarguments) can be displayed, one after another.
A menu move is simply a selection of responses from the menu listings against previously stated arguments, with the added constraints that for any given argument, a choice has to be made, and one cannot pick the null moves and counterarguments at the same time.
For a dialogue
We observe that for a given menu move
The posit and menu moves as defined here are only some of the possible moves that could be used in asymmetric persuasion dialogues. Nevertheless, they are sufficient for our purposes and have the benefit of resembling other popular non-dialogical interfaces for user input. In the next sections, we define our dialogue protocol based on these moves.
Let us now formally define the protocol for our dialogues using the previously proposed moves. The use of posit and menu moves and the lack of certain restrictions on them allows the protocol to meet our principles of asymmetry, timeliness and incompleteness. This, along with certain classical requirements (such as participants taking turns in expressing their opinions), leads to the following formalization.
A dialogue For each step For each step For step For each step For each step For step For each step For the final step
For the purpose of our experiments, we have constructed two argument graphs that we will discuss in Section 6.1. Figure 3 presents the subgraph of the original graph associated with the discussion visible in Table 1. The table presents a dialogue between the user and one of the APSs we have implemented; for the sake of readability, we refer to arguments with their tags rather than textual content.
We observe that the agents take turns in presenting their arguments, and since the definition of our dialogue moves forces an argument-counterargument relation, the system moves are at an even distance from the persuasion goal (argument 0) and user moves are at an odd distance. In other words, they are (indirect) defends and (indirect) attackers of the goal respectively. The fact that not every user argument has to be answered by the system in our protocol results in arguments 17, 34, 35, 36 and 83 being unattacked.
An example of a dialogue adhering to the asymmetric posit protocol for the argument graph in favour of maintaining student fees (see data appendix)

A subgraph of the argument graph in favour of maintaining student fees (see data appendix) induced by the dialogue from Table 1.
The above restrictions can be simply explained as follows. First, only the arguments that occur in the graph can be exchanged. The system and the user take turns in the dialogue, with the system making the first move by positing the persuasion goal. The system posits need to be met with appropriate user menu moves, and the counterarguments raised by the user may be (not necessarily fully) addressed by system posits. The system is only forced to give a complete response to the first user move – after that, the responses can be partial. In particular, we limit the number of active dialogue lines to two, i.e., that at most two arguments that can still be responded to by the user can be played by the system2 When there are more than 2 active dialogue lines, the number of arguments can be quite large. So we limited the amount of information presented to the user so that it was not overwhelming. We made an exception for the reply to the users first move because the user is more likely to understand and appreciate the system’s response to each of the user’s arguments. This limit and the exception for the first move are assumptions and alternative assumptions could be investigated.
At the end of the dialogue, the system (
Our approach is one of many, and there exist various different dialogue protocols (see also Section 8). Nevertheless, we are not aware of other methods that would adhere to our principles. This protocol is different to the dialogue protocols for abstract argumentation that are used for determining whether specific arguments are in the grounded extension [90] or preferred extension [35]. It is also different to the dialogue protocols for arguments that are generated from logical knowledge bases (
Typically, at any step of the dialogue, there can be multiple move options to choose from. In other words, particularly with large domains, more often than not it can be the case that
Advanced strategy
Our approach to making strategic choices of move is to harness decision trees [50]. A
In the case of dialogical argumentation, a full decision tree represents all the possible dialogues. Each path from the root to a leaf is one possible permutation of the moves permitted by the dialogue protocol

A decision tree for an argumentation dialogue. Each arc is labelled with a posit move in a dialogue, which for readability purposes is assumed to consist of only single arguments in this example. Each branch denotes a dialogue involving exactly three arguments with the first (respectively the second) being posited by the proponent (respectively the opponent). The proponent (decision) nodes are solid boxes, the opponent (chance) nodes are dashed boxes and the leaf nodes are circles.
In order to compare different dialogues so as to be able to select the best one that can be reached from each step, we need to define a
Decision trees are useful tools in artificial intelligence. However, they also have their limits, and quickly become unmanageable in applications with a large number of possible outcomes. For instance, while we can use decision trees for a tic-tac-toe game, Go is too complicated. Unfortunately, given the sizes of argument graphs we will be dealing with in this paper, the same holds for our APS. A possible solution is to make use of appropriate sampling techniques that explore only certain branches of the tree (
Monte-Carlo Tree Search (MCTS) [33] methods are amongst the most efficient online methods to approximately solve large-sized sequential decision-making problems (for a review, see [23]). This method is notably used in the
The approach can be roughly split into four phases – selection, expansion, simulation and backpropagation – that are repeated until the desired number of simulations has occurred or some time limit is reached.
Starting from the root of the tree (the current state of the dialogue) the algorithm chooses an action to perform in a black box simulator of the environment. It uses the UCB1 [4] procedure to choose the action and then observes the new state of the environment that is output by the simulator. It then goes down a level in the tree depending on this new state. The algorithm repeats this step until it reaches a leaf in the tree.
If this leaf is not a terminal state of the problem (
Once the leaf node has been expanded (and is thus not a leaf anymore) the algorithm simulates all the subsequent steps in the dialogue until it reaches a possible terminal state. This simulation does not expand the tree.
Once a terminal state has been reached, a reward can be calculated and then backpropagated up in the tree to calculate the most promising nodes.
These four steps are repeated until the desired number of simulations has occurred or some time limit is reached. At this point, the most promising next argument in the dialogue is selected as the child of the root node with the highest backpropagated reward. This argument is played in the real dialogue, a new state is observed after the user responds and the root of the simulation tree is moved down to the node representing this new state.
Reward function
The purpose of the reward function is to be able to compare dialogues; the higher the reward, the better or more desirable the dialogue is. In our framework, the reward function is based on the usage of concerns arising in the arguments, and the belief in the persuasion goal at the end of the dialogue. For this, we have designed the reward function in two parts, that are ultimately combined into one single value, as we describe in this section.
The aim of the concern scoring function is to reflect how well the arguments posited by the system match the user’s preferences over concerns. We assume that arguments covering fewer, but more important concerns to the user, are more interesting than arguments covering more, but less relevant concerns.
We thus aim to select the most appropriate argument(s) to state out of the possible ones. Every argument uttered in a dialogue (aside from the persuasion goal) is stated in response to one or more arguments that appeared in the previous move (see Section 4.2). We can thus speak about “dialogue parents” of a given argument (
We explain these functions as follows:
In addition, we require the function
Let
The concern score of a given dialogue step associated with the system is now defined in terms of how “good” or “bad” the sibling arguments that were not played are. For obvious reasons, the first step in which the persuasion goal is played is ignored. The score of the dialogue is then simply an average of these values:
Assume
The non-chosen score for a step of the dialogue generates an average preference score for each non-chosen concern. This is done by taking each concern of the arguments played (
We observe that the concern score of a dialogue is always in the
Let

Argument graph used in Example 5.
Consider the following dialogue based on the argument graph in Fig. 5 where the graph and dialogue are hypothetical: 
Assume that the concerns associated with the arguments and the population preference scores are as follows:
The concern score combines the information about the concerns associated with the arguments appearing and not appearing in the dialogue, and the relative preference over those concerns, to a single value in the unit interval. The definition incorporates a bias favouring arguments that have fewer concerns associated with them. Increasing the number of concerns for the chosen arguments causes the non-chosen score (given by the
In principle, we could update a user model during a dialogue using a belief redistribution function that takes the old probability distribution and returns a revised one. To do this, we could consider the notion of an
However, if the update is applied on all of the possible subsets of arguments, it may lead to a computationally intractable problem. To address this issue, we can for instance exploit the structure of the argument graph
A
For further details on the properties of such labellings we refer to [61,88,89]. What is important to note is that every probability distribution has a corresponding labelling, and for every labelling we can find at least one probability distribution producing it.
We also need the following notion of an
Let
Consider the argument graph Example of an induced argument graph.
Throughout the rest of this section, we will assume that we are working with a graph
We consider three stages of belief for an argument (and hence three labellings) as arising in a dialogue (and so attack and defence is with respect to the arguments in the dialogue).
the initial belief the attacked belief the reinstated belief
The reinstated belief corresponds to the value we use to evaluate the belief in an argument (and thus in the goal) at the end of the dialogue. Note, when a belief is reinstated, it is not necessarily reinstated to its original value. Rather, depending on the belief in its attackers, the reinstated belief may be below its original value, as shown in [97].
In order to model this behaviour of partial effectiveness of updating belief, we introduce the following coefficients to play the role of dampening factors (as suggested in [17]) which causes the effect of an attacker to decrease as the length of the chain of arguments increases. We will introduce the
For
The definition of these coefficients provides a balance between how strongly the attackers are believed and their number. There are other ways to aggregate beliefs of attackers, each with their advantages and disadvantages. Summing the attackers’ beliefs, while simple, may not be in [0,1], whereas taking the maximal or average belief does not reflect the number of attackers. Our definition addresses these issues and additionally provides a form of dampening effect so that the influence of an argument decreases as the length of the chain of arguments increases.
In the following, we show that: the coefficients are in the unit interval (Proposition 2); when there are attackers of an argument, and all the attackers have an initial belief of 1 (respectively 0), then the
Let
Assume
In the following result, we assume that the attackers for an argument
From the proof of Proposition 2, we have
Like in the previous result, the following result assumes that the attackers for
From the proof of Proposition 2, we have
Now we define the way initial, attacked, and reinstated belief are calculated for each argument
Let
So
In contrast,

Argument graph used in Example 7.
Consider the argument graph in Fig. 7. The following table gives the 
Here we see that because
We use the formulae in Definition 17 for several reasons. First, the definition takes the initial belief into account (where that comes from the beta distribution). Second, the definition takes the number of attackers into account. In a fine-grained setting, it is normally difficult to have a uniform intuition as to whether a single attacker with high belief should be more effective in decreasing the belief on the attackee than 10 attackers with lower beliefs. Using our
The
For a dialogue
Put simply, the reward function is the product of the two dimensions. This means we give equal weight to the two dimensions. It also means that weakness in either dimension is sufficient to give a low reward.
We continue Example 5 where for the dialogue
The reward function is a simple and intuitive way of aggregating the two dimensions, but other aggregations could be specified and used directly in our framework.
When simulating the results of the system’s actions, we also need to simulate a credible behaviour from the user in order to advance in the simulated dialogue. Thus, it is important to mimic the choices of arguments that the user could make. In order to do that, we propose a multistep process, for each argument
All the subsets of believed counterarguments to the arguments played by the system are then used as the new step in the simulated dialogue,
The assumptions that we have made for the simulation of the user choices are supposed to encompass any possible actual behaviour for the users (supposing their choices are based on the same elements: belief in arguments, ranking on them and then choice of whether to play them or not). Therefore, in theory, with enough simulation, the strategy that we obtain is the most robust way for facing the real users.
Baseline strategy
The baseline and advanced systems use the same protocol and the same argument graphs. This means that the baseline and advanced systems only differ on the argument selection strategy. The baseline strategy is a form of random strategy: When the baseline system has a choice of counterargument to present, it makes a selection using a uniform random distribution.
Data for domain and user modelling
In this section, we describe the methods used for obtaining the data we required for populating our domain and user model. The domain model is based on an argument graph (Section 6.1) and the assignment of concerns to arguments (Section 6.3), and the user model is based on the belief in arguments (Section 6.2), preferences over concerns (Section 6.4), and classification trees for predicting preferences over concerns (Section 6.5). This data is available as an appendix.3 The Data Appendix containing the arguments, argument graphs, assignment of concerns to arguments, preferences over concerns, and assignment of beliefs to arguments, is available at the link
For each task, we have recruited a certain number of participants from a crowdsourcing platform called Prolific4
For the experiments that we present in Section 7, we used two argument graphs on the topic of charging students a fee for attending university. Since the experiments were conducted with participants from the UK, the arguments used in the argument graph pertain to the UK context. In the UK, the current situation is that students from the UK or other EU countries pay £9K per year for tuition at most universities. This is a controversial situation, with some people arguing for the fees to be abolished, and with others arguing that they should remain in place.
Each argument graph has a persuasion goal. For the first argument graph, the persuasion goal is
The arguments were hand-crafted, however, the information in them has been obtained from diverse sources such as online forums and newspaper articles so that it would reflect a diverse range of opinions. These arguments are enthymemes (
For the following data gathering steps, we split the 146 arguments into 13 groups (the groups are distinguished in the Data Appendix files associated with surveys in which grouping was needed). We did this so that no group contained two directly related arguments (

One of the argument graphs for the university case study. The persuasion goal is “Charging students the £9K fee for university education should be abolished”. In order to better show the structure of the graph, the textual content of the arguments has been removed. The text is available in the data appendix.
In order to determine the belief that participants have in each argument, we used the 13 groups of arguments as described in the previous section. For each group, we recruited 80 participants from the Prolific crowdsourcing platform and asked each of them to assign a belief value to every argument in the group.5 Please note that the total number of participants was greater, but many of the submissions were rejected due to failed attention checks, which are a standard tool for rejecting dishonest participants.
For each statement, the participants could provide their belief using a slider bar that has a range from −5 to 5 and with a granularity of 0.01. This means that a participant can give a belief such as −2.89 or 0.08. Whilst the granularity is finer grained than perhaps necessary, we believe it is better to do this than risk losing information with a less-fine grained granularity. We also associated a text description with each integer value as follows: (−5)
Once all the beliefs were gathered, we calculated the beta mixture for every argument (recall Section 3.2), using the method described in [51]. Using an
Types of concern for the topic of charging university tuition fees
Once all the arguments have been defined, they need to be appropriately tagged with concerns. The types of concern that can be associated with the arguments are topic dependent and in this work we manually defined a set of 12 classes (as presented in Table 2). These were based on a consideration of the different possible stakeholders who might have a view on university tuition fees in the UK, and what their concerns might be.
In order to determine the concerns that the participants associate with each argument, we used the 13 groups of arguments as described previously. For each argument described in the Data Appendix, we asked the participants to choose the type of concern they think is the most appropriate from the list presented in Table 2 (
The concern assignment was later post-processed in order to reduce possible noise in the data. The concerns of a given argument are ordered based on the number of times they were selected by the participants. The threshold is set at half of the number of votes of the most popular concern and only concerns above this threshold were kept. For instance, if
For this step, we recruited at least 40 participants from the Prolific6
Please note that the total number of participants was greater, but many of the submissions were rejected due to failed attention checks. Due to minor platform issues, certain arguments received between 41 and 43 responses rather than 40.
After the set of types of concern had been created, the next step was to determine the preferences that the users of our system could have on these types. Preference elicitation and preference aggregation are research domains by themselves and it would take more than a paper to fully investigate them all in our context. Consequently, in this work, we decided to use a simple approach which was to ask the participants to provide a linear ordering over the types of concern.
It is interesting to note that the results show that on average, the “Education” and “Student Well-being” concerns were ranked respectively first and second and “Government Finances” was ranked last.
For this step, we used 110 participants from the Prolific crowdsourcing website.8 Similarly as in the previous tasks, we used an attention checks to discard dishonest submissions.
The preferences of concerns may allow an APS to offer a more user-tailored experience: When the APS has a choice of arguments to present as its next move, choosing the argument with the more preferred concern may be advantageous. However, agents may differ in their preferences, and so we need to discover the preferences of the current user during a dialogue. This then creates certain challenges. A simple way to achieve it would be to query the user about all the concerns to determine their ranking. However, in practice, we do not want to ask the user too many questions as it is likely to increase the risk of them disengaging. Longer discussions also tend to be less effective [108]. Furthermore, it is normally not necessary to know about all of the preferences of the user. To address this, we can acquire comprehensive data on the preferences of a set of participants, and then use this data to train a classifier to predict the preferences for other participants. Thus, in this study we have created the classification trees using information that we had obtained about the users.
In addition to asking for the ranking of the concerns of the participants (as explained in the previous subsection), we asked them to take a personality test. We used the Ten-Item Personality Inventory (TIPI) [46] to assess the values on 5 features of personality based on the OCEAN model [77], one of the most famous model of the psychology literature. These features were “Openness to experience”, “Conscientiousness”, “Extroversion”, “Agreeableness” and “Neuroticism” (emotional instability). We also asked them to provide some demographic information and domain dependent information, such as age, sex, if they were a student in any higher institution, and the number of children they might have in general as well as in school or university.
Using all the above data, we learnt a decision tree for each pair of concerns using the Scikit-learn9
As a first stage, we ran a meta-learning process in order to determine the best combination of tree depth and minimum number of samples at each leaf for each pair of types. The meta-learning process is the repeated application of the learning algorithm for different choices of these parameters (
We used cross-validation in the meta-learning to determine the best combination of tree depth and minimum number of datapoints at each leaf. Once the best parameters were found for each pair of types, we then ran the actual learning part using these parameters with all the datapoints concerning the personality and demographic information. We thus obtained one decision tree for each pair of types that was used by the automated persuasion system in the final study.
Figure 9 shows the example of the decision tree learnt for the Economy/Fairness pair of types where “C” (resp. “N”) stands for “Conscienciousness” (resp. “Neuroticism”) in the OCEAN model.

Example of a decision tree for the economy/fairness pair where “C” (resp. “N”) stands for “conscienciousness” (resp. “Neuroticism”) in the OCEAN model.
We now describe the experiments we undertook to evaluate our approach to strategic argumentation for persuasion using the data and models explained in the previous parts of this paper.
Methods
In this section, we describe the implemented systems that we used for the experiments, and we describe the recruitment of participants.
Implementations used in the experiments
For the experiments, we implemented two versions of our automated persuasion system, and we deployed them with participants to measure their ability to change the belief of participants in a persuasion goal. We excluded the participants from data gathering studies from taking part in the persuasion experiments. The two versions followed the protocol described in Section 4.2, and were implemented as follows.
This was the baseline or control system, and it chose arguments at random from the ones attacking at least one of the arguments presented by the persuadee in the previous step.
This was the system that made a strategic choice of move that maximizes the reward (see Section 5.1. It incorporates the Monte Carlo Tree Search algorithm as presented in Section 5.1.1 and uses the reward function as presented in Section 5.1.2.
Each chatbot was composed of a front-end we coded in Javascript and a back-end in C The code is available at

High level architecture of chatbot platform.
For the MCTS component of the advanced APS, we set the number of simulations to 1000 to balance out the trade-off between the completeness of exploration and the time waited by the participant. Indeed, the longer they wait, the less engaged they are, which causes deterioration in the quality of data. On average, the round trip from sending the counterarguments picked by the user to the back-end through the API, calculating the answer and back to the front-end for presentation (therefore including network time and client side execution) took between 0.5 and 5 seconds, depending on the number of counterarguments selected by the participant. We argue that these are acceptable times compared to traditional human to human chat experience.
In this study, we used 261 participants recruited from the Prolific crowdsourcing website, which later allowed us to have 126 participants for the advanced system and 119 for the baseline.11 Some of the submissions had to be filtered out due to technical issues.
At the start of each experiment, each participant was asked the same TIPI, demographic and domain dependant questions as in the ranking of concerns explained in Section 6.5. The full survey description and demographic statistics can be found in the Data Appendix.
After collecting the demographic and personality information, we asked the participants for their opinion on the following statement (using a slider bar ranging from −3 to 3 with 0.01 graduation). We note that the answer 0 ( Are you against (slider to the left) or for (slider to the right) the abolishment of the £9K fees for universities, and to what degree?
Then we presented each participant with a chatbot (either the baseline system or the advanced system). After the end of the dialogue with the chatbot, the participant was again presented the statement about the abolition of the £9K student fee, and asked to express their belief using the slider bar. This way we obtain a value for the participant’s belief before and after the persuasion dialogue.
From the dialogues we obtained from running the advanced system and the baseline system, we obtained a head-to-head comparison for both of the graphs we have considered (Section 7.2.2). This analysis corresponds to the two kinds of conditions we had in mind when designing the experiment. This general analysis is further supplemented by an explorative study of if and how certain structural properties of dialogues may have had an impact on the behaviour of the users. Appropriate vertical or horizontal lines are used in tables when reporting on these two kinds of results. We note that for not all different subtypes of dialogues sufficient samples have been obtained that would allow us to speak of the results with confidence. This is due to the fact that while we had control over whether users engaged with the advanced or baseline system, the graph that was chosen for the discussion or the nature of the dialogue that was created were essentially determined by the users themselves. All the dialogues are presented in the Data Appendix.
Structural analysis of dialogues
We start by considering the structure of the dialogues produced by our APSs. We focus on three dimensions: completeness, linearity and length.
By a complete dialogue we understand a dialogue such that all the leaves in the subgraph associated with the dialogue are of even depth in the original graph on which the dialogue was based, and no
By a linear dialogue we understand a dialogue such that the subgraph associated with it is simply a chain from the root to the leaf. In other words, at most one argument is used in every dialogue move. Distinguishing these kinds of dialogues is useful due to their simplicity. When faced with such dialogues, many of the well-known argumentation semantics “converge”, i.e. offer similar predictions as to whether an argument is accepted or rejected, what kind of rank it obtains, and so on. A tree-like structure is not necessarily sufficient for that when we consider ranking or gradual semantics, particularly those that aim to balance the strength and the number of the attackers of a given argument [1,3,17]. Linearity therefore removes the issue of how the impact of multiple attackers of a given argument should be approached in a fine-grained setting such as ours. It is also worth noting that branching dialogues are more complex and more demanding from the user, and thus more likely to promote disengagement. It is therefore interesting to consider dialogues that lessen this burden.
The effectiveness of a dialogue can be linked to its length, as seen in [108]. We will also therefore consider if and what kind of a relationship exists between the dialogue length and belief change in our study. For this purpose, we consider two kinds of lengths; one seen as the number of exchanges between the system and the user, and one seen as the number of arguments uttered during the dialogue. While for linear dialogues these two lengths are identical, they are different in situations where more than one argument is uttered in a given turn and branching occurs.
We also separate our analysis with respect to the graph that was used for the dialogue. We have the graph built in favour of keeping the university fees (used when the participant was in favour of their abolishment), and the dual abolishing graph (used when the participant was in favour of keeping the fees). While these graphs partially overlap, they are not the same and each one possesses arguments unique to it; the analysis needs to respect these differences.
Tables 3 and 4 show the distributions of dialogues structure of different types that were produced by the advanced and baseline systems.
Analysis of the dialogues with the advanced system w.r.t. completeness, linearity, and used graph
Analysis of the dialogues with the advanced system w.r.t. completeness, linearity, and used graph
Analysis of the dialogues with the baseline system w.r.t. completeness, linearity, and used graph
In the case of both APSs we therefore observe that the dialogues are more likely to be complete than incomplete, and linear rather than non-linear (i.e. branched). We also note that in both cases, the majority of the participants were in favour of abolishing the university fees, and therefore the graph in favour of keeping them was used for the dialogue – this is however something out of the control of the APS and merely reflects the views of the participant pool.
There are however some differences between the dialogues produced by the two systems. We first observe that while complete discussions are prevalent in both the advanced and the baseline system, there is still quite a difference as to their degree. Deeper analysis of the reasons for this variance in incompleteness (see Table 5) shows that the primary cause of this is associated with the baseline system being less likely to address all of the user’s arguments (i.e. the system is more likely to lead to odd-branched dialogues).
We believe this behaviour is simply a result of the design of the baseline system, where counterarguments are selected from the applicable ones at random. This behaviour might also contribute to the differences in the distributions of different kinds of dialogues produced by the APSs – while complete and linear dialogues are prevalent in both cases, there are differences further down. In particular, incomplete and nonlinear dialogues rank second for the baseline system, while complete and nonlinear dialogues rank second for the advanced system.
Reasons for incompleteness of dialogues in the advanced and baseline systems
An additional message to take out of this is that
Last, but not least, we look at the dialogue lengths. The boxplots depicting the results obtained from the advanced and baseline systems can all be found in the appendix. They appear to indicate that a relationship between dialogue lengths and belief change, if it exists, is highly likely to be non-monotonic. We have therefore decided to use Fisher’s exact test of independence to see if belief change and dialogue lengths are somehow linked. As visible in Table 6, independently of the system, length and dialogue type, all of the
P-values of Fisher’s exact test of independence between belief change and dialogue lengths in discussions carried out by the baseline and the advanced systems. We distinguish two dialogue lengths, one understood as the number of turns that took place and one as the number of arguments exchanged. For the purpose of this analysis, belief change values have been discretized into the following intervals:
A natural next step is to compare the performance of the advanced and baseline systems, by which we understand the belief changes they lead to.
The scatter plots for the before-after beliefs in persuasion goal for each of our APSs are visible in Fig. 11. Please note that a “perfect” system, managing to convince the participant to radically change their stance on the persuasion goal, would be one producing an “after” belief of 3 for any participant with a negative “before” belief, and an “after” belief of −3 for any participant with a positive “before” belief (i.e. the perfect scatter plot would not be diagonal).

Scatter plots of the before and after beliefs for participants that entered a dialogue with the advanced or with the baseline system.
We supplement the scatterplots with additional statistical analysis of the belief change. We can observe that the distributions of beliefs before the dialogues in the advanced and baseline systems are not statistically different, meaning that the two populations are not dissimilar belief-wise for the two systems. The same holds for the distributions of beliefs after the dialogues have taken place, independently of the subdistribution chosen according to a particular structural condition (assuming the sample was large enough to carry out the tests). We also note that the before and after belief distributions per system are also not dissimilar. We highlight this result speaks only about the distributions, not the effects of the dialogues with the system. Wilcoxon rank-sum test has been used for establishing these results, and detailed statistics can be found in the appendix at the end of this paper in Tables 11 and 12.
However, the effects that these systems had on the users are not the same, as visible in Table 7. We have used Shapiro–Wilk test in order to determine whether the “before” beliefs of the participants were normally distributed. If the answer was positive, student t-test was used to determine if the changes in beliefs were significant or not; if the answer was negative, Wilcoxon signed-rank test was used.12 All calculations have been carried out in R, detailed statistics can be found in the appendix at the end of this paper.
We observe that for the baseline system, independently of the considered subclass of dialogues, the changes in beliefs were either not significant or significance could not have been determined. In turn, the advanced system led to statistically significant changes in beliefs in most cases where significance could have been determined.
Results of analysis of statistical significance of belief changes caused by the APSs on different types of dialogues. Shapiro–Wilk test was used to determine whether the “before” beliefs were normally distributed. If they were, t-test was used to determine significance of belief changes; otherwise, Wilcoxon signed-rank test was used. By “−” we understand that due to the nature of the data, exact
Despite these positive results, the actual differences in belief changes between the two systems are rather modest when we look at the numbers. Table 8 shows the average changes in beliefs between the two system. We distinguish between the normal average, which simply takes the mean of all differences in beliefs (which can be negative or positive depending on the persuasion goal), and the absolute average, which ignores whether the change is negative or positive and focuses on the value only. Additionally, in Tables 9 and 10 we include the percentage distribution of the changes.
Average changes in beliefs in dialogues with the advanced and baseline systems. A change in beliefs can be negative or positive depending on the persuasion goal. Average changes takes the average of all the obtained differences; absolute average ignores whether the change is negative or positive and focuses on the value. Results are rounded to the third decimal place
Belief change analysis of the dialogues with the advanced system. The results are rounded to two decimal places. Second row shows the interval to which a belief change value should belong to be interpreted as very positive (
Belief change analysis of the dialogues with the baseline system. The results are rounded to two decimal places. Second row shows the interval to which a belief change value should belong to be interpreted as very positive (
An important thing to notice here is the behaviour of the participants in the complete dialogues. The arguments uttered by the APS in these scenarios correspond to the grounded/preferred/stable extensions of the graphs associated with the dialogues. Predictions using classical Dung semantics would put both of our APS in a strongly winning position, while we would argue that it is not the case here.
Despite these properties of the complete dialogues, we achieve no statistical significance of the changes in beliefs for the baseline system (see Table 7). This important threshold is only passed by the advanced system, which, in contrast to the baseline APS, attempts to tailor the dialogue to the profile of the user. This indicates that relying only on the structure of the graph for selecting arguments in dialogues is insufficient. Presenting just any counterargument to a participant’s argument turned out to be ineffective in the context of this experiment. In contrast, including beliefs and concerns as factors in the selection of the counterarguments, proved to have statistically significant effects.
Nevertheless, we still need to acknowledge that in pure numbers, the results are modest. The average changes in beliefs require improvement. While 55.77% of the users that engaged with the advanced system did experience positive changes, this still leaves a significant proportion of participants that were affected negatively or not affected at all. Thus, there is the need for increasing both the number of participants experiencing positive changes, as well as the degree of these changes.
We therefore believe that further research in this direction needs to be undertaken. The results require improvement and/or replication; there may also exist other additional factors, besides their beliefs and concerns, that would allow for dialogues to be better tailored for participants.
In general, the changes in beliefs (good direction, bad direction, etc) are quite similar between the baseline and advanced systems. The averages are also relatively close. Yet, only in the advanced system, do we get a statistically significant change in users’ beliefs. Even if we focus on the simpler dialogues (i.e. those that are complete, which are all dialogues that a classical argumentation system would create), the results are similar. Yet again, only the advanced system is significant. So the results suggest that the advanced system is better at changing belief more in favour of the persuasion goal than the baseline system.
Our claim at the start of this study was that a dialogue needs to be tailored to the user, otherwise it is less effective than it could be. Lack of significance in the baseline system supports that, as no tailoring was taking place. The advanced system was doing some tailoring, and we obtain significance. Nevertheless, the end gain is not as marked as we might hope for. We get a little less than 5% population increase in positive changes, and over 1% increase in negative changes. Despite this, we need to remember that it is widely acknowledged that convincing anyone of anything (when they are allowed to have their own opinions, as opposed to a logical reasoning exercise), is a difficult task. So developing a system that is going to persuade the majority of participants on a real and controversial topic such as university fees is not very likely. Therefore, even these small improvements of the advanced system over the baseline system are valuable, and indicate that further research into dialogue tailoring approaches is promising.
Throughout the paper we have been clarifying what assumptions we have made about the systems and methods used in the experiments. Therefore, in the remainder of this subsection, we collate and discuss these assumptions, and how they impact the conclusions we can draw from the experiments.
The argument graphs were edited by the research team. While the arguments and attacks between them were not tested with participants, it has been shown that group deliberation approach improves performance in argumentation tasks [27]. Nevertheless, we acknowledge that with natural arguments, there is a degree of subjectivity in whether one argument attacks another. Moreover, both systems use the overlapping graphs, and so the effect is to some degree the same for both systems.
We acknowledge that there are certain restrictions in how the system can react to the user. We assume that each dialogue is started by the system, and that the system is required to respond to the first move by the user. Without this, there would not be meaningful engagement by the user, and no opportunity for belief change. Additionally, when there are more than 2 active dialogue lines, the number of arguments can be quite large. This can further lead to the users being overwhelmed and disengage; we have therefore limited the amount of information presented to the user at any given time. We made an exception for the reply to the user’s first move because the user is more likely to understand and appreciate the system’s response to each of the user’s arguments. It would be desirable to investigate alternative definitions in future work.
In order to model how propagating the update of beliefs may be influenced by proximity of the original update to the other arguments, we used a dampening factor (as suggested in [17]) which causes the effect of an attacker to decrease as the length of the chain of arguments increases. Given that the existing theoretical proposals for such approaches follow design patterns that are highly undesirable in our setting (please see Section 5.1.2), the strategic system uses a different method that is more suited for epistemic probabilities. Nonetheless, we acknowledge that none of the existing dampening factor proposals have been empirically verified. It would be desirable to consider this in future work, and equip the strategic system with alternative approaches.
We note that even though we use both beliefs and concerns in our strategic system, there are certain differences in how they are treated. The concerns that are associated with each argument have been determined during the pre-dialogue experiments, and we do not ask users for this information in studies that require engaging with our APSs. In a similar fashion, the preferences that each user can have over these concerns are predicted from data obtained using another pre-dialogue experiment, and not directly obtained from each APS user. The purpose of this was to lower the risk of exhausting the users and causing them to disengage. Nevertheless, we acknowledge that we did not check for this domain whether the inferred preference is used by the participants when moving an argument. We have done this check in other domains (commuting by bicycle [52] and improving health by doing more sport [32]), and so we felt it was reasonable to extrapolate from those studies for the purposes of setting up this study. Finally, in the user model, there was the assumption that the preferences over concerns are static throughout the dialogue. The dialogues were relatively short and did not contain any arguments aimed at changing the concerns of a given user, thus allowing us to assume that the concerns have remained static. We leave it to future work to consider how we could incorporate changing of concerns.
In contrast to the above, the beliefs the users have in arguments have been assumed to be dynamic. Furthermore, the belief in the persuasion goal of each dialogue was directly obtained from the user prior to and after the discussion has concluded. Nevertheless, the beliefs in the remaining arguments are computed using beta distributions obtained from pre-dialogue experiments, and not obtained directly. It would be possible to personalize the beliefs of a participant further; however, asking the users for their fine-grained beliefs at each stage runs the risk of the users being overwhelemed and disengage, while more advanced methods (see for example, modelling sub-populations using beta distributions [51]) would increase the computational complexity of the system and the waiting time between responses. We therefore leave further investigations of these alternatives to future work.
In the experiments, we only compared a baseline system with a strategic system that was based on beliefs and concerns. However, to better understand the proposal, it would be desirable to undertake trials with a strategic system that just uses beliefs and a strategic systems that just uses concerns.
Literature review
Since the original proposals for formalizing dialogical argumentation [47,74], a number of proposals have been made for various kinds of protocol (
Some strategies have focussed on correctness with respect to argumentation undertaken directly with the knowledge base, in other words, whether the argument graph constructed from the knowledge base yields the same acceptable arguments as those from the dialogue (
In [15], a planning system is used by the persuader to optimize choice of arguments based on belief in premises, and in [16], an automated planning approach is used for persuasion that accounts for the uncertainty of the proponent’s model of the opponent by finding strategies that have a certain probability of guaranteed success no matter which arguments the opponent chooses to assert. Alternatively, heuristic techniques can be used to search the space of possible dialogues [75]. Persuasion strategies can also be based on convincing participants according to what arguments they accept given their view of the structure of an argument graph [76]. As well as trying to maximize the chances that a dialogue is won according to some dialectical criterion, a strategy can aim to minimize the number of moves made [2]. The application of machine learning is another promising approach to developing more sophisticated strategies such as the use of reinforcement learning [5,11,54,71,73,92,94,98] and transfer learning [93].
There are some proposals for strategies using probability theory to, for instance, select a move based on what an agent believes the other is aware of [101], or, to approximately predict the argument an opponent might put forward based on data about the moves made by the opponent in previous dialogues [60]. Using the constellations approach to probabilistic argumentation, a decision-theoretic lottery can be constructed for each possible move [62]. Other works represent the problem as a probabilistic finite state machine with a restricted protocol [65], and generalize it to POMDPs when there is uncertainty on the internal state of the opponent [48]. POMDPs are in a sense more powerful than the MCTS that we advocate in our proposal. However, as discussed in [48], there is a challenge in managing the state explosion in POMDPs that arises from modelling opponents in argumentation.
A novel feature of the protocol in this study is that it allows a form of incompleteness in that not every user argument has to be countered for the dialogue to continue. The protocol ensures that for each user argument presented at each step of the dialogue, if the system has a counterargument to it that has not been presented in the dialogue so far, then it will present a counterargument to that user argument. However, if the system does not have a counterargument for a user argument, but it does have a counterargument for another user argument played at that step of the dialogue, then the dialogue can still continue. The aim of this tolerance to incompleteness is to reflect how real-life discussions do not necessitate every argument to be countered. This means that discussions can usefully continue, and they are not sensitive to a participant be able to counter every possible argument from the other participants. So our protocol is in contrast to protocols in other approaches to computational argumentation where the aim is for each participant to counter all the arguments by the other agent.
In our previous work [52], we developed an APS that selects arguments based on the concerns of the participant in the current dialogue. For this, we assumed that we have a set of arguments, and that each argument is labelled with the type(s) of concern it has an impact on. Furthermore, the system had a model of the user in the form of a preference relation over the types of concern. We did not assume any structure for the preference relation. In particular, we did not assume it is transitive. For each user argument
Conceptually, concerns can be seen as related to value-based argumentation. This approach takes the values of the audience into account, where a value is typically seen as moral or ethical principle that is promoted by an argument. It can also be used to capture the general goals of an agent, as discussed in [9]. A value-based argumentation framework (VAF) extends an abstract argumentation framework by assigning a value to each argument, and for each type of audience, a preference relation over values. This preference relation which can then be used to give a preference ordering over arguments [9,10,12–14]. The preference ordering is used to ignore an attack relationship when the attackee is more preferred than the attacker, for that member of the audience. This means the extensions obtained can vary according to who the audience is. VAFs have been used in a dialogical setting to make strategic choices of move [19]. So theoretically, VAFs could take concerns into account, but they would be unable to model beliefs.There is also no decision-theoretic framework for this, nor is there an empirical evaluation of VAFs.
More recently, the use of values has been proposed for labelling arguments that have been obtained by crowdsourcing. Here a value is a category of motivation that is important in the life of the agent (
The notion of interests as arising in negotiation is also related to concerns. In psychological studies of negotiation, it has been shown that it is advantageous for a participant to determine which goals of the other participants are fixed and which are flexible [41]. In [100], this idea was developed into an argument-based approach to negotiation where meta-information about each agent’s underlying goals can help improve the negotiation process. Argumentation has been used in another approach to co-operative problem solving where intentions are exchanged between agents as part of dialogue involving both persuasion and negotiation [37]. Even though the notions of interests and intentions are used in a different way to the way we use the notion of concerns in this paper, it would be worthwhile investigating the relationship between these concepts in future work.
The empirical approach taken in this paper is part of a trend in the field of computational argumentation for studies with participants (for a review see [26]). This includes studies that evaluate the accuracy of dialectical semantics of abstract argumentation for predicting behaviour of participants in evaluating arguments [36,97], studies comparing a confrontational approach to argumentation with argumentation based on appeal to friends, appeal to group, or appeal to fun [110,111], studies of appropriateness of probabilistic argumentation for modelling aspects of human argumentation [86], studies to investigate physiological responses of argumentation [109], studies using reinforcement learning for persuasion [54], and studies of the use of predictive models of an opponent in argumentation to make strategic choices of move by the proponent [93]. There have also been studies in psycholinguistics to investigate the effect of argumentation style on persuasiveness [72].
There have already been some promising studies that indicate the potential of using automated dialogues in behaviour change such as using dialogue games for health promotion [28,42,44,45], conversational agents for encouraging exercise [25,82] and for promoting plant-based diets [113], dialogue management for persuasion [6], persuasion techniques for healthy eating messages [107], and tailored assistive living systems for encouraging exercise [43]. However, none of these studies have provided a framework for strategic argumentation, in contrast to the proposal we present in this paper.
Discussion
In this paper, we have presented a framework for user modelling that incorporates the beliefs and concerns of persuadees, as well as introduced a framework for optimizing the choice of moves in dialogical argumentation by taking into account these user models. We have shown how we can crowdsource the data required for constructing the user models, and that this can be used by APSs to make strategic choices of move that outperform a baseline system over a population of participants.
This study therefore indicates the value of taking the beliefs and concerns of agents into account. Furthermore, it indicates the viability and utility of undertaking real-time decisions on moves to make based on the Monte Carlo Tree Search algorithm. The way we have harnessed this algorithm is quite general, and alternative options for the reward function could be deployed. For instance, the belief component of the reward function could be replaced by different methods of modelling belief updates (e.g. [57,68]), or even having a richer modelling of user beliefs and updating of them (e.g. [58,59]). One can also consider different concern scoring approaches, where rather than using preferences, we can focus on how the concerns associated with the played arguments are matched between the system and user moves or investigate different quantity-quality balancing. Finally, the way these two score values are aggregated can also be modified. In future work, we will investigate some of the options to get a more comprehensive understanding of the effectiveness and behaviour of different reward functions.
In the user model, there was the assumption that the preferences over concerns are static throughout the dialogue. The dialogues were relatively short and did not contain any arguments aimed at changing the concerns of a given user, thus allowing us to assume that the concerns have remained static. Dynamic concerns is a very interesting topic. It raises questions about what kinds of move could possibly change a users preferences over concerns, and how we could model this. Investigating this would complicate how we would undertake globally optimal decisions over moves, and it would require a different study design for the experiments with participants.
Another topic for future work is the specification of the protocol. Many protocols for dialogical argumentation involve a depth-first approach (e.g., [19]). So when one agent presents an argument, the other agent may provide a counterargument, and then the first agent may provide a counter-counterargument. In this way, a depth-first search of the argument graph is undertaken. With the aim of having more natural dialogues, we used a breadth-first approach. So when a user selects arguments from the menu, the system then may attack more than one of the arguments selected. For the argument graph we used in our study, this appeared to work well. However, for larger argument graphs, a breadth-first approach could also be unnatural. This then raises the questions of how to specify a protocol that interleaves depth-first and breadth-first approaches, and of how to undertake studies with participants to evaluate such protocols. Other possibilities for improving the naturalness of the dialogue by considering chains of arguments [99].
The aim of our dialogues in this paper is to raise the belief in goal arguments. A goal argument may, among other things, incorporate an intention to change behaviour, though we accept that there is a difference between have a intention to do something, and actually doing it. Nonetheless, having an intention to change behaviour is a valuable step to make towards actually changing behaviour. We focus on the beliefs in arguments because belief is an important aspect of the persuasiveness of an argument (see for example [56]). Furthermore, beliefs can be measured more easily than intentions in crowdsourced surveys. In future work, we would like to investigate to what extent an increased belief in the persuasion translates to actual changes in behaviour. This would be interesting to investigate in healthcare applications, such as persuading participants to undertake regular exercise or reduce alcohol intake.
We also need to investigate new ways of asking for the beliefs in the arguments. Currently, we do it through a direct question. Unfortunately, this method is vulnerable to participants who misunderstand the instructions, select values carelessly or lie on purpose. Therefore, we need to create a new, indirect way, of asking for the belief. A possible approach is to develop several simpler, indicative questions (
Footnotes
Appendix
Results of analysis of statistical significance of belief changes caused by the APSs on different types of dialogues. Shapiro–Wilk test was used to determine whether the “before” beliefs were normally distributed. If they were, t-test was used to determine significance of belief changes; otherwise, Wilcoxon signed-rank test was used. By “−” we understand that due to the nature of the data, exact
| Advanced system | Baseline system | |||
|
|
|
|||
| Dialogue type | Normality | Significance | Normality | Significance |
| All |
|
|
|
|
| Keeping Graph |
|
|
|
|
| Abolishing Graph |
|
|
|
|
| Complete |
|
|
|
|
| Incomplete |
|
– |
|
– |
| Linear |
|
|
|
|
| Nonlinear |
|
– |
|
– |
Acknowledgements
This research was funded by EPSRC Project EP/N008294/1 Framework for Computational Persuasion. We thank Dr Andreas Artemiou for his valuable assistance. We also thank the reviewers for their valuable comments for improving the paper.
