Abstract
Drawing on the “Voigt-Kampff Empathy Test”—a science fiction version of Turing’s famous thought experiment—we propose the Conversational Action Test (CAT): a new approach to evaluating conversational artificial intelligence (AI) voice agents. We compare social actions in a range of telephone service encounters where one party is an artificial conversational agent to a range of similar human-human calls. The CAT demonstrates a novel paradigm that addresses long-standing theoretical and methodological problems for ostensible “tests” of conversational AI by (a) revealing the conceptual confusion of attempting to “detect” an AI in routine service interactions and (b) focusing, instead, on the situated interactional practices through which an AI “passes” for human. We discuss the implications of the CAT for the design and evaluation of conversational AI, and for the notion of “humanness” as a goal or benchmark for such systems. Data include publicly available human/AI service calls and comparable human-human calls in British and American English.
Keywords
Introduction
In Ridley Scott’s (1982) film Blade Runner, the “Voight-Kampff Empathy Test” distinguishes androids from humans by monitoring the subject’s biometric responses while the examiner describes a series of grotesque scenes. This interpretation of Alan Turing’s (1950) thought experiment, fictionalised by Phillip K. Dick, imagines a future of ubiquitous “strong deception” in human-machine communication (Natale, 2023), in which it has become otherwise impossible to tell them apart. By 2019, the year in which the film’s events are set, Google’s conversational artificial intelligence (AI) agent Duplex (Leviathan and Matias, 2018) was able to mimic human callers well enough to make booking calls to real restaurants and salons, with the artificial agent apparently passing as human “in the wild” at its product launch demonstration 1 . Duplex has since been withdrawn amid questions about the ethics of its mimicry (O’Leary, 2019), its efficacy (Bonifacic, 2022), and, ironically, about the authenticity of its demonstration calls (Natale, 2021). Once Duplex was publicly deployed, with its automated agents beginning encounters with the preface: “Hi I’m Google’s automated booking service” (Dwoskin, 2019), businesses apparently started ignoring Duplex’s “spam calls” (Garun, 2019). This suggests that the functionality of these systems may hinge on the ability to pass as human. Though Duplex was discontinued, AI call centers now offer similar services 2 . The “deceptive AI ecosystem” (Zhan et al., 2023) that these systems now inhabit, enhanced by Large Language Models (LLMs), further enables AI agents to navigate a range of conversational situations. Given the challenges of detecting AI-generated text (Else, 2023; Liang et al., 2023) and much-vaunted claims that LLM technologies now “pass the Turing Test” (Adams, 2024), there are increasingly urgent calls for telephonic equivalents of the “Voight-Kampff” test (e.g. Shen et al., 2024).
In this article, however, we start by reconsidering what it means, in practical, interactional terms, for an AI to “pass” as human in the context of a routine service call. Natale (2023: 92–123) suggests that the “Eliza effect,” named after Weizenbaum’s 1960s ELIZA psychotherapist bot, not only biases us to ascribe agency to even the simplest bots, but also constructs a mediagenic narrative about the boundaries between humans and machines. Should we be developing tests for Voight-Kampff-like behavioral “tells” to disambiguate humans from AI? Or does the very concept of a test for humanness uphold a flawed narrative about human authenticity and sociality that, as in Blade Runner, dehumanizes both tester and subject? Here we rethink the notion of such a test in relation to sixty years of research in conversation analysis (CA). We contribute to an emerging approach to “conversational AI” that looks beyond common interpretations of the Turing Test as either a deceptive “imitation game” or as an operationalization of machine “intelligence” (French, 2000) by analyzing, in detail, how routine social actions involving such machines are accomplished interactionally (Liesenfeld and Dingemanse, 2024; Porcheron et al., 2018).
We start from Garfinkel’s (1967: 157) ethnomethodological conceptualization of “passing” as the “work of achieving the ascribed status” of, in this case, a human interlocutor. Garfinkel’s (1967: 118–185) famous case study shows how Agnes, a transgender woman whose gender is under “chronic threat or open contradiction,” uses various situated “passing devices” to protect her gender identity across a range of everyday and institutional interactions. Agnes’ passing devices include euphemism, feigned ignorance, and other contingent strategies to “avoid any tests she thought she might fail” in everyday “passing occasions.” The key point that Garfinkel (1967: 180) makes is that Agnes is a “practical methodologist” of “natural, normal female” social life whose practices do far more than conform to a set of dualistic gendered norms or suppress a fixed catalog of “tells.” Indeed, binary gender “tests” based on definitional characteristics that ignore the situated performativity of gender can result in acts of misgendering (Pino and Edmonds, 2024) that can include persecuting cisgender people as trans (Joubin, 2024). Instead, Agnes learns to recognize and manipulate the “unavoidable, unnoticed texture of relevances” that embed “appearances-of-normal-sexuality” (Garfinkel, 1967: 183) in daily life.
This notion of “passing” presents a radically different challenge both to common interpretations of “passing the Turing Test”. It neither aims to ascribe intelligence to machines nor does it, like the fictional Voight-Kampff Test of the eponymous bounty hunters in Blade Runner, aim to place suspects into untroubled categories of either “AI” (Suchman, 2023) or “human”. Instead, in this article, we explore the practical and narrative potential of a “Conversational Action Test” (CAT) that explores the interactional work required to achieve conversational participation as constituted in specific, situated, interactional environments. Here the unit of analysis is not the “person” or the “AI.” Rather, we focus on the mundane, interactional “passing occasions” within routine service calls, where callers and call-takers encounter one another within the limited roles and tasks involved in, for example, making a booking or enquiring about prices. In such highly constrained environments, “passing” as human is hardly the central challenge. Indeed, where we encounter an artificial agent “unannounced” as such, passing as human may still, though perhaps not for much longer, depend more on the basic assumptions or “trust conditions” that underpin a sequentially structured social interaction than on technical sophistication (Ivarsson and Lindwall, 2023; Relieu et al., 2020; Turowetz and Rawls, 2021). Participants may reasonably assume they are talking to a human simply by answering the phone and falling into the pervasive, mutual accountability of social interaction (Coulter, 1979). Our analysis, then, explores the interactional details of human-human service calls (e.g. to a doctor’s surgery or a veterinary practice, or a university contact center), alongside a range of similarly task-constrained service calls performed by an AI conversational agent to human call-takers.
While we can categorize these calls, a priori, as “human-human,” or “human-AI,” such categorizations are neither the starting point nor the end goal of our analysis. Instead, we start with “detailed, concrete observations and descriptions of organizationally achieved social phenomena” in a routine service call (Garfinkel, 2021: 19; see also Eisenmann and Lynch, 2021). Turowetz and Rawls (2021) argue that Garfinkel’s focus on the lifeworld of marginalized identities with “at best, unstable routinization” (Garfinkel, 1967: 179) allows us to study the practical ethno-methods that members in human sociality use to “pass” or avoid contingent “tests we might fail.” Examining a range of service calls where at least one caller, as Suchman (2023: 4) puts it, “travel[s] under the sign of AI,” provides a rich opportunity for analytic observation. This approach also suggests a novel paradigm for developing evaluative tests of conversational AI based on empirical analysis of the “passing occasions” constituted through social situations.
Why test “interaction” rather than “intelligence”?
Interaction is far more explanatory and generative as an empirical material than reductive tests of ostensible intelligence. Most varieties of “Turing Test” use human judges to evaluate machine responses to text-based question-answer sequences as an operational test of “human-level intelligence” (Loebner, 2009), but often overlook the empirical material of interaction itself. Conversation analysts, by contrast, treat interactional resources and practices as their fundamental objects of study. CA has often studied the kinds of standardized question-answer sequences used in Turing Tests in a range of interactional settings. Such question/answer sequences usually structure common “interview activity types” (Levinson, 1979) that place routine, situated, interactional constraints on turn-by-turn talk. These patterns organize how participants solicit and produce accounts (Carlin, 2006; Potter and Hepburn, 2012), and an interactional perspective can explain how (not just that) such tests are “passed.” For example, Weizenbaum’s famous ELIZA bot exploits the interactional constraints of question-answer sequences by reversing pronoun pairs from “your” to “my” in each turn (Wallace, 2009). Critics who decry this kind of passing as algorithmic “trickery” rather than an ostensible test of “AI” (Harnad, 1992; Kurzweil and Kapor, 2009) often suggest making the test harder by, say, extending its length or topical range. However, this overall approach fundamentally treats “intelligence,” operationalized by interaction, as somehow separable from the interactional structures and practices on which the test itself relies, risking “losing the phenomenon” (Eisenmann and Lynch, 2021) entirely.
By contrast, Collins (2018: 50–51) argues that a well-designed test should focus on the “quintessentially human activity” of repair: the ways participants deal with “problems of speaking, hearing and understanding” as they occur within social interaction (Jefferson, 1987; Schegloff et al., 1977: 361). Repair operates as a naturalistic, endogenous, “test” of mutual understanding by enabling coordinated joint action (Albert and de Ruiter, 2018). Contrast this with exogenous “tests” where human judges decide, post hoc, whether the participants’ responses to test questions have matched their assumptions about “human intelligence.” Given the universal availability of repair across languages and cultures (Dingemanse and Enfield, 2024), we can track, monitor, and re-establish mutual ongoing intersubjectivity in interaction if or when it seems to be breaking down. For example, one can initiate repair at any time by flagging a “trouble source” and can enact repair by providing a “trouble solution” before progressing the interaction. The speaker of the trouble source (“self”) and a recipient (“other”) can use a four-way matrix of repair actions that are “self-initiated self-repair,” “self-initiated other-repair,” “other-initiated self-repair,” and “other-initiated other-repair.” Repair thus functions as an infrastructure for intersubjectivity (Schegloff, 1992) between “self” and “other” because each party can initiate and resolve repair at any time. Rather than defining an operational test for the intelligence or subjectivity of one party to an interaction, repair endogenously constitutes each party’s subjectivity as a special case of intersubjectivity through interaction.
Similarly, the embodied interactional order is often overlooked in operational tests of machine intelligence, and in computational linguistics more broadly. As Goodwin and Heritage (1990) point out in a discussion of Chomsky’s (2002) disregard of linguistic “performance,” informational theories of communication that exclude the “noisy” data of talk cannot deal with how language is used interactionally. Thus, Natural Language Processing (NLP) technologies tend to treat repair, disfluencies, hesitations, glottal cut-offs and other “miscommunication phenomena” as informational noise by filtering them out (Healey et al., 2018). Such embodied interactional resources are, therefore, mostly ignored (Purver et al., 2018), despite their fundamental importance for recognizing, forming, and ascribing social actions (Levinson, 2013). As Pütz and Esposito (2024) demonstrate in their study of interactions with LLM-based chatbots, where repair does occur, it is the humans that do most of the interactional work. In summary, rather than operationalizing tests for “artificial intelligence” through post hoc human judgments about interaction, the CAT proposes examining conversation itself as a material and locus for the observable, endogenously analyzable “embodiment of human sociality” (Schegloff, 2015).
Why a CAT? And what should it test for?
The structural organization of social action is remarkably stable over time and between settings when compared to the situated contingencies of language and meaning (Heritage, 2008). A CAT, then, might draw on the way CA studies social action in specific settings as constituted by sequences of “turn constructional units” (TCUs) that build and progress courses of action (e.g. requests, offers, invitations), where any single action can be achieved via multiple grammatical formats. For example, “requesting” may be achieved by interrogatives (e.g. “can I”; “do you”; “would you”) in some situations, but also by declaratives (e.g. “that cake looks good”) or narrative descriptions (e.g. “I’ve been getting terrible headaches lately”) in others. Such actions are also often defeasibly and tacitly embedded within “pre-sequences” such as “my car is stalled” produced as a precursor for a request for a lift (see Stokoe et al., 2024), or produced through embodied resources such as gaze, head orientation, or gesture (e.g. a “can I have the bill” gesture in a restaurant). In all cases, it is the action—the offer or request—rather than the specific words or practices that implement the action that is consequential for what happens next (e.g. an acceptance, granting, or rejection). Our selection between—and recognition of—one another’s choices between methods for initiating and responding to social actions are what constitutes the situated specificity of human sociality (Goodwin, 2000). In this sense, social action is central to human sociality and could motivate our tests and evaluations of conversational technology (Liesenfeld and Dingemanse, 2024) in terms of situational constitutiveness; that is, the “realness” or “artificiality” of the sociality they achieve.
This approach stands in stark contrast with methods of automatic NLP, where social action is conceptualized as abstract “user intent,” rather than concretely constituted through turns and sequences of social interaction (Albert et al., 2019). Even state-of-the-art LLMs cannot reliably address the long-standing “pragmatics problem” (Cummins and De Ruiter, 2014) of mapping between words and social functions (Stokoe et al., 2024). NLP systems that model the regularities of semantic and lexical features still focus on language, rather than action (Housley et al., 2019), missing out on the pragmatic context that shapes the relevance of any utterance. By “context,” here we refer to the turn-by-turn construction of the prospective and retrospective interpretability of actions and utterances rather than to a generic “bucket theory” of psychological or cultural context (Goodwin and Heritage, 1990). While technologists acknowledge that “context matters” for the sense of any utterance (e.g. Pearl, 2016), it is also often presumed that a task or setting (e.g. a specific type of service call) supplies “context” as a fixed variable (Stokoe et al., 2021; Stokoe and Richardson, 2023). Pragmatic context, on the other hand, is dynamically constructed by local modifications of, say, the organization of turn-taking (see Albert et al., 2019), multi-unit turn design (see Relieu, 2024), or patterns of non-lexical vocalizations, disfluencies, and hesitations (Lopez et al., 2022), and these practices are CA’s central object of study.
A CAT of Google Duplex
Here we use CA to examine an instance of what Natale and Depounti (2024) describe as a “banal deception”: Google Duplex. At its launch, journalists enthusiastically described how this telephone reservation and inquiry-making bot used “pauses and ‘ums’ to mimic a human” (Chen and Metz, 2019), and—within the limitations of its booking task—to interact “flawlessly” enough to “believe the hype” (Amadeo, 2018). These mirror later journalistic responses to the launch of ChatGPT and other LLMs in the early 2020s. In the analyses below, we focus on interactions initiated by Duplex in its publicly available recordings. Our observational focus is informed by related analyses of a wide range of pragmatically similar service calls drawn from the cumulative body of systematic research (including our own previous work) on social interaction in service calls. Building on these analyses, we outline procedures for conducting a putative CAT. We suggest the CAT as a practical method for creating situationally specific threshold criteria for the competences (including those of “AI”-labeled participants) associated with interaction in routine service calls. We then discuss how the procedures and criteria for a CAT may be adapted and replicated for drawing new empirical and conceptual axes for future comparative and applied studies in the field of conversational AI.
Data and methods
Some of CA’s earliest findings document the structure of call-opening sequences (Sacks, 1995, pp. 3–32; Schegloff, 1968). Our analysis uses three data sets that are rich in these routine actions. First, we used the collections of “classic CA data” currently in circulation (Hoey and Raymond, 2022), for example, the Schegloff Media Archive (International Society for Conversation Analysis (ISCA), 2023), featuring hundreds of call openings, appointment-requests, and other routine actions within a range of service call environments. Second, we used several large sets of between 100 and 3000 call recordings from our own previous studies of service calls to doctors’ offices (Stokoe et al., 2016), university administration contact centers (Hoey and Stokoe, 2018), and veterinary surgeries (Stokoe et al., 2020). Our third data set consisting of a set of service call recordings featuring Google Duplex allowed us to compare actions in human-human service calls to related routine actions in Duplex-human calls.
We were able to access Duplex calls from publicly available recordings produced in Google’s promotional material and technical documentation, although these data came with some analytic and ethical complexities. We first downloaded and transcribed all available Google Duplex calls using Jeffersonian transcription (Hepburn and Bolden, 2017), totaling five complete encounters and several smaller fragments (Leviathan and Matias, 2018). These calls seem to have been edited before publication, possibly for data privacy reasons. We assumed, a-priori, that these were all Duplex-human calls, although Chen and Metz (2019) revealed that Google uses human call-center workers for up to 25% of its Google Assistant app calls, while Duplex handles the rest. Where Duplex fails in these calls, the call is transferred to a human operator. One such recording published online by the New York Times (Chen and Metz, 2019), provides us with at least one like-for-like comparison between Duplex and its human counterpart. We selected calls in which the main purpose was closest to the Duplex calls (e.g. booking appointments for non-urgent services such as annual vaccinations). We used these calls as publicly available data, since they are published online, though we recognize that no explicit consent was given for this research purpose. Nor, for that matter, was consent for this use necessarily given by participants in the calls collected in CA’s canon of “classic data,” published long before contemporary norms of institutional ethical review. Nonetheless, the public, online availability of these data rendered them fair use for our research purposes. Participants in our corpus of 500 human-human service calls consented to us using these recordings for research purposes.
In the analyses below, we follow Schegloff (1987, 2009) by applying previous findings about specific interactional phenomena to new data and by taking a comparative approach. The range of interactional phenomena we focus on here were inductively derived from repeated reviewing and analysis of our data, informed by the wealth of existing CA findings about the structure of service calls (e.g. Flinkfeldt et al., 2021; Hoey and Stokoe, 2018; Lee, 2006, 2011; Schegloff, 1986; Stokoe et al., 2016, 2020; Whalen et al., 2002). We begin each analytic section by outlining an interactional practice identified in previous CA studies of human-human service calls, using examples to describe the interactional features that constitute the phenomenon. We then analyze Duplex calls featuring similar phenomena to see how the actions in question are recognized and accomplished (or not). We aim to show how a “baseline” analysis of routine interactions in a specific environment (here, service calls) can draw on the wealth of interactional research in similar settings to underpin a comparative analysis. A further aim is to also show how such analyses allow us to evaluate the ostensibly “artificial” sociality constituted by the actions of an AI participant. We should note here that our designation of “artificial” and “AI” here is made a priori, and is, in any case, not the point of this analytic exercise. Whatever our ontological commitments, our analyses only commit to these a priori categories as a convenient starting point for analysis that focuses on methods and practices, not individuals, intelligences, or persons (artificial or otherwise).
Analysis
We present five sections of analysis. In the first two, we examine turn-component and sequential aspects of call openings, in which callers produce (a) first turns in the “reason-for-the-call” slot and (b) “second summonses,” in which callers extend openings by re-doing a summons before progressing to the reason-for-the-call. In three further sections, we examine features of trouble, perturbation, and repair in which callers (c) place and produce “um” and “ah” particles in the unfolding production of turns; (d) mark trouble; and (e) organize and respond to repair initiation. In each of the extracts below, some of which predate mobile and video telephony, we should note that all calls are audio-only. While this provides a somewhat restricted interactional environment where participants cannot see one another, talk is still rich with forms of phonetic embodiment available to both parties through prosodic and intonational variation. We also, therefore, offer some phonetic observations of Duplex’s vocally embodied performance, based on acoustic and impressionistic approaches to comparable human-to-human calls. Together, these analytic approaches allow us to identify Duplex’s capabilities and shortcomings and to reflect on their implications for testing the artificial (or otherwise) sociality of its routine actions.
Reason-for-the-call in service call openings
The first challenge for all participants in service calls, human or otherwise, is to conduct the situationally relevant organization of the call-opening sequence (Whalen and Zimmerman, 1987). The interactional features that constitute this routine include a summons/answer sequence, a greeting from the call-taker, and an official “place-self-identification” (e.g. a business name, Schegloff, 1986: 123). The call-taker usually speaks first, so the criterion for success in this routine is successfully moving from the call taker’s answering the ringing phone to delivering the reason-for-the-call. This usually involves placing a service request in the “anchor position” (Schegloff, 1986): the structural slot in the opening where the caller may introduce the first topic. Reaching this point is criterial for a successful call opening because it demonstrates having achieved and progressed beyond mutual recognition of caller and call-taker’s respective roles.
Extracts 1–3 show human-human calls to the vet (extract 1) and doctor’s (2–3) receptions.
In the three extracts below, Duplex (DUP) calls reception (REC) at a salon and two restaurants to make bookings. Each includes all the routine components of a service call opening, albeit with the identification components apparently redacted. Duplex first provides a responsive greeting (e.g. “H↑
→
→
Extract 7 is from our one recording of a call initiated by a human Google call-center worker. The same opening sequence is accomplished, but in this, the business self-identification (the restaurant name) is unredacted.
→
If we compare the human-human service and human-Duplex calls, we see similarly structured opening sequences containing the same turn components (e.g. greeting, request, etc.), which reflexively accomplish the mutually recognized interactional roles and actions of a “service call.” In these types of sequences, then, based on an examination of routine procedures, a CAT would define a criterion for “passing” at a threshold for interactional competence that caller reciprocates any greeting and moves on to the first topic in the next turn.
Re-setting the call opening via a second summons
In some situations, of course, the routine turn components of call openings may be organized somewhat differently. As we have seen, in service calls, the summons of the ringing phone is usually reciprocated with a vocal response including various routine components (e.g. greetings, self-identifications etc.). Where the call-taker’s vocal response is missing, previous interactional studies have identified the “second summons” as a method callers can use to deal with the absence of the vocal response. For example, if the caller does not hear the call-taker’s responsive “hello,” perhaps due to a technical problem, they may re-do their initial summons (i.e. the ringing of the phone), with a spoken, often upward intoned, re-doing of the summons turn, for example, “hello?” (Schegloff, 1968: 1088). Second summonses are also useful for dealing with other kinds of call-opening trouble. For example, Lee (2006) showed that Korean callers often do a second summons if they have not recognized the call-taker’s voice, which can occasion a repeat response, providing the caller with another opportunity to identify the call-taker from their voice sample. In all cases, the second summons works by sequentially deleting whatever the call-taker may have said in their initial summons response and making a re-doing of the response relevant next. A second summons is successfully achieved, then, when the call-taker re-does their summons response.
Extracts 8–9 provide examples of second summonses from a variety of human-human service call settings including calls to the police and to university admissions:
→
In extract 8, the Police Desk dispatcher does an official self-identification as a first response, then the caller does a second summons in line 03, occasioning a full repeat of the dispatcher’s first summons-response turn. Note that the second summons here achieves a “reset” of the call when the dispatcher then “re-starts” with a full repeat of the official summons response and institutional identification “Hello (pause). Police desk?” in line 04. In extract 9, the caller is a parent calling university admissions on behalf of their child. The second summonses here deal with troubles of overlapping talk. The caller’s second summons in line 07 comes after a series of delays (lines 03–06) that occasion an overlapped response (line 08). The caller then re-does a second summons adding the call-taker’s name “Anne” (line 10), once again in overlap. This time the call-taker duly re-does their summons response (lines 11–12) sufficiently in the clear to facilitate progress to the first topic at line 16, effectively re-starting the call-opening sequence.
Extracts 10–12 show how Duplex deals with trouble or deviations from the routine structure of service call openings using a second summons to accomplish a “reset” in the opening sequence.
→
→
→
In each case, Duplex issues a second summons following the call-takers’ first response. This second summons has a rising pitch contour—common in standalone first greetings in English (Kaimaki, 2011). In the calls above, following Duplex’s second summons, the call-taker duly re-issues a response, sequentially re-setting the call opening. In each case, in the following turn, Duplex proceeds to the first topic, as in the straightforward call openings in Extracts 1–7.
Both Duplex’s and human callers’ second summonses above clearly create an opportunity to re-start the call-opening sequence, so a CAT might treat the reset of the call following a second summons as a criterion for successful service call interactions.
Anchor position uh(m)s
Duplex’s developers note that where a response may be expected with no delay, or when dealing with complex activities that may incur what Leviathan and Matias (2018) call “processing delays,” Duplex may interject a “speech disfluency” or a sound stretch that “masks” such delays. However, as Schegloff (2010) points out, these utterances have a wide range of systematic sequential positions, functions, and production characteristics far beyond covering for delays. For example, in a call opening sequence, callers routinely produce an “um,” “uh,” or “ah” (all of which we combine here as “uh(m)”) just before the reason-for-the-call in “first topic” slot (Schegloff, 1986). This is a different phenomenon from the type of uh(m) that often occurs when participants encounter troubles of speaking or understanding (e.g. Jefferson, 1974). Callers can also produce a first topic without doing a turn-initial uh(m); however, pre-anchor position uh(m)s can project the reason for the call or some form of intersection rather than trouble, as suggested by the way they also occur when the anchor position is “displaced” by some other business (Schegloff, 2010).
Extracts 13–15 below are taken from human-to-human service calls to GP offices, vets, and police dispatchers. In each case, the caller produces this specific type of anchor position uh(m).
→
→
In Extracts 13 and 14, the caller reciprocates the greeting before doing an uh(m) and moving on to provide the first topic in anchor position. Schegloff (2010) uses extract 15 to demonstrate the relevance of an anchor position “Uh” (in line 02), where the caller begins to ask for help and give an address. After the operator interjects and the dispatcher explains the interjection, note how the caller re-starts his request for emergency help (line 05). He repeats the “Uh” in anchor position but deletes the other two “uhs” (“could you uh go to uh”) in his re-doing of his first topic turn, suggesting that only the anchor position uh(m) has some kind of persistent interactional relevance.
In Extracts 11 and 12, above, and in extract 16 below, Duplex produces an uh(m), or a sound stretch that sounds like an uh(m), just before introducing the reason-for-the-call.
→02 DUP: H↑i:: u:::m I would like t’reserve a t 03 M
Despite the claims of the developers to be masking processing delays, the placement of these uh(m)s does not appear arbitrary. These are slotted into the anchor position when the opening sequence is extended in various ways. For example, in Extracts 10 and 11 above, and in extract 17 below, Duplex extends the greeting sequence by using a second summons to reset the call. In these cases, Duplex still produces an uh(m) in anchor position before introducing the first topic.
→
Note that here in lines 7 and 8, Duplex starts with an announcement about the “reason for the call” (“I’m calling to make a reservation?”), but without actually producing the reservation request. This turn functions as a kind of “pre-request” forming part of a standardized service announcement that the call is from an automated booking service and is being recorded. These types of pro-forma “recording for training and monitoring purposes” announcements are, typically, separate from the “business” of the call. Indeed, once the pro-forma announcement is delivered, Duplex does an anchor position “A::m” just before the first topic in line 10, suggesting that this uh(m) tracks the anchor position, rather than simply being placed after the greeting sequence automatically.
Whatever the interactional consequences of this phenomenon, Duplex’s anchor position uh(m)s successfully occasion a re-doing of the call-taker’s response and, as such, they achieve this interactional practice.
Other-initiated self-repair
One striking feature of Duplex’s calls is that, in few instances, its calls involve the use of other-initiated self-repair (i.e. where “other”—the recipient of the trouble-source turn—flags the problem, then allows “self”—the speaker of the trouble source—to solve it). In human-human service calls such as in extracts 18-20, below, this kind of repair operation often occurs when it is especially important that participants achieve and secure a shared understanding of times, dates, and other consequential details.
→
→
→
Note that there is a variety of forms of other-initiation that we see from the participants in these three cases. In extract 18, line 04, we see the call-taker (D) use a partial repeat of the prior turn as an “understanding check.” “
In extract 19, line 11, we see the caller (C) use a less specific “open class” repair initiator “
In extract 20, we see another example of the partial repeat and “understanding check” form of repair initiation. Here the 911 call-taker does an emphatic partial repeat of the prior turn “
In Extracts 21–23, we also see Duplex participating (as trouble-source speaker, or “self”) in several instances of other-initiated self-repair. Note that Extracts 21 and 22 are from call fragments, so they cannot be analyzed in any wider sequential context.
In extract 21, after the call-taker initiates repair at line 05 with “sorry what day?,” Duplex provides the repair solution, inserting the day “Friday” as well as repeating the relevant part of the trouble-source turn “May twenty fifth.” This repair treats the trouble either as an issue of which day of the week the reservation falls on, or as a trouble of hearing/understanding the day and date altogether.
Extract 22 starts as Duplex is dictating a phone number when the call-taker initiates repair by asking Duplex to “start over.” Duplex duly re-starts the dictation, and this time the call-taker displays uptake and alignment by doing a “continuer” (Goodwin, 1986) “Uhuh” at line 33 after the area code, interspersed between Duplex’s ongoing dictation.
In extract 23, the call-taker’s turn at line 08 “Fo::::r
In terms of sequential structure, these three examples all demonstrate the successful accomplishment of other-initiated self-repair because following the repair procedure, both caller and call-taker proceed with the task at hand. However, Duplex’s responses in Extracts 21–23 do not unambiguously display as specific an orientation to the trouble source as we saw in the human-human examples in extracts 18–20. For example, in extract 21, the call taker flags up the day, but not necessarily the date as the trouble source, but Duplex’s response in the following turn includes both the day (Friday) and re-does the date, disattending to the specificity of the repair initiation “sorry what day?.” Similarly, in extract 22 where the call-taker asks Duplex to “start over” when giving a phone number (line 29) Duplex re-does a fully sentential turn prefaced with, “the number is . . .” rather than responding to the precision of the repair initiation to “start over,” that is, specifically re-dictating the number, rather than re-doing the entire turn. Finally, in extract 23, Duplex’s repair in line 10 “um it’s for fo̲u̲r̲ people” goes along with the call-taker’s misunderstanding that the prior request related to numbers of people, without addressing their prior mishearing of “Wednesday the seventh.”
So, while Duplex’s responses to other-initiations of repair in these data meet the basic sequential criterion for accomplishing repair (i.e. getting the repair done and moving on), our final analysis in this section suggests a certain lack of sensitivity, on Duplex’s part, to the precision of other-initiations of repair to locate and help to swiftly resolve interactional trouble. In the following section, we discuss how our analyses, starting with human-human service calls, show how we might develop a CAT that evaluates the performance of callers or call-takers (artificial or not) in terms of their participation in situated forms of sociality.
Discussion
The aim of this article was to examine how a conversational voice agent interacts on the phone with naïve human interlocutors in service encounters to achieve a form of “banal deception” (Natale and Depounti, 2024). We evaluated Duplex’s turns in relation to conversation analytic research into the structure of call openings, second summonses, uh(m)s that precede a reason for the call, and other-initiated self-repair. We evaluated Duplex’s achievement of these practices against human-to-human calls, following Schegloff’s (2009) guidelines for comparative CA that require analysts to describe the interactional features that constitute a practice, to propose criteria to test its achievement, to discuss how the practice may transfer to other interactional contexts. Our analysis showed that Duplex’s actions largely achieved these practices in terms of our basic procedural/sequential criteria. In the following section, we discuss how each practice “passes” as conversationally competent and ask what we can infer from observing the degree of specificity with which Duplex responds to other-initiations repair. We consider the broader implications of using CA in situations that resemble the fictional “Voight-Kampff Test” for artificial sociality. Finally, we propose some aims and procedures for developing a form of “CAT” capable of evaluating sociality in specific interactional situations.
“Passing devices” maintain artificial sociality
Our analysis highlighted several methods that Duplex used to progress through a potentially tricky interaction. First, the opening phase of a service call is a highly routinized site for institutional talk, where contributions from each party fit into a set of mutually expectable sequential “slots” (Drew and Heritage, 1992), although there may still be significant variations. As our human-human data reveal, greetings can vary with time of day (good morning/evening); and may include names and organizational self-identifications that can be more generic or more specified (e.g. “surgery” vs “Limetown Surgery”). Duplex’s practices for moving from call openings into the reason-for-the-call are clearly robust enough to manage these variations. However, even though Duplex’s practices meet our criteria for achieving this call opening structure, they may not make use of all the interactional resources available. Minor variations in the position and composition of turns provide participants with a range of resources for accomplishing their respective, situated identities as they move on to the first topic of the call (Psathas, 1999). For example, in extract 9, the caller’s long gaps, pauses and disfluencies display hesitation or delicacy in formulating her situated identity as the “mother of the official caller.” Duplex’s relatively crude use of second summonses in Extracts 10–12, on the other hand, simply reset the opening sequence, shunting the call toward first topic. We can thus see this use of second summons as one of several “passing devices” (Garfinkel, 1967): methods for moving through a stretch of interaction where there is a threat of exposing possible “incompetence.” This method is very similar to how Lenny 3 (a telephone “spam trap” bot that simply reads out—with “a soft and slow Australian accent in the manner of an elderly man” (Oberhaus, 2018)—a set of 16 carefully scripted, pre-recorded turns to fool telemarketers into wasting their time, see Relieu, 2024; Sahin et al., 2017) occasionally reports trouble on the line: “hello? are you there?,” often resulting in re-setting, and sustaining the ongoing interaction.
Some passing devices effectively mimic the way people manage and mark trouble in ongoing talk through delays, disfluencies, and hesitations. The uh(m)s of this sort were enthusiastically applauded by the crowd during a demonstration of Duplex at the Google IO 2018 keynote (Google Developers, 2018), as well as in media reports that celebrated Duplex’s “authentic” use of speech disfluencies. Indeed, our analysis showed that Duplex sometimes positions uh(m)s in ways that account for their placement (e.g. in overlap resolution, or in call openings just prior to the reason-for-the-call) and build toward a target action such as requesting a reservation. However, though we lack space to reproduce them here, our wider analyses of Duplex calls also found uh(m)s that seemed phonetically and procedurally unfitted to their sequential environments. Perhaps these were masking non-interactional “processing delays,” as the developers claimed (Leviathan and Matias, 2018), rather than being positioned in relation to the unfolding action. Similarly, in “mystery shopper” calls described by Stokoe et al. (2020), mystery shopper callers simulating clients to test the phone services of a vet’s surgery simply have different issues at stake from genuine pet owners, and thus use different interactional patterns. For example, while real pet owners answered the receptionists’ questions about their pets fluently, mystery shoppers tended to delay, defer, or respond disfluently. Given the way that humans struggle to simulate the behaviors of other humans, even in task-specific contexts such as service calls, we might expect this to remain a long-term challenge for artificial sociality.
Finally, while Duplex’s involvement in other-initiated self-repair is successful, it is also ambiguous since its responses do not always target the specific trouble source cited in the repair initiation 1 . These passing devices may help to smooth the path toward a successful service call closing, but the way Duplex uses them to “bypass” trouble may obviate valuable interactional resources humans use to recognize and deal with miscommunication (Healey et al., 2018; Purver et al., 2018). Indeed, we may depend on the specificity of our abilities to recognize and manage interactional trouble to secure shared understanding and intersubjectivity (Albert and de Ruiter, 2018; Schegloff, 1992; Sidnell, 2014). Where artificial forms of sociality evade repair using a passing device, they may miss an essential, if difficult, step toward understanding and dealing with more unpredictable and complex interactions.
The implications of AI for CA
One outcome of our analysis is to add a new analytical frame to CA, which has included, from the outset, a burgeoning set of studies framed as “institutional talk” in a wide range of settings including helplines, healthcare, and service interactions (e.g. Drew and Heritage, 1992). The structure of talk in these situations is studied in relation to the institutional constraints we can observe on the putatively ubiquitous frame of “everyday talk,” which is understood to encompass a relatively unconstrained range of interactional practices (Hester and Francis, 2001). Each situated form of human sociality, described in terms of the constraints on “institutional talk,” creates a “unique ‘fingerprint’ for each kind of institutional interaction” (Heritage, 1997: 225), providing the basis for informative comparative and evaluative analysis. For example, Stokoe (2013) shows how, even when domain experts set out to simulate an interaction, such as police interview trainers in a role-play, they tend to talk in ways that do not correspond with recordings of real interviews (see also Atkins, 2019; Stokoe et al., 2020). Similarly, CA studies of “atypical interaction” (Antaki and Wilkinson, 2012; Wilkinson et al., 2020) involving disabled people, often in institutional settings, increasingly focus on how people manage constraints on normative interactional patterns rather than on the communication impairments or medical diagnoses of individuals (Bottema-Beutel et al., 2021; Maynard and Turowetz, 2022). In this vein, studies of interactions involving artificial agents may require new analytic frames that can evaluate, for example, conversation design, voice user experience design, and agent design etc. in relation to the specific “fingerprint” of practices and interactional competences that constitute a growing range of contingent, situated, socialities (cf. Porcheron et al., 2018). Such frames will need ongoing revision as interactional studies of artificial sociality extend further beyond task-specific domains of the HCI lab and become an increasingly ubiquitous part of everyday life (Mlynář et al., 2024).
Another implication of our analyses is to show how some interactional phenomena can be amenable to both automated and conversation analytic forms of discovery. The AI methods underpinning Duplex bear comparison, in some ways, with CA in that they are strongly data-driven and use observations as a basis for theorizing about phenomena that, as Sacks (1984) puts it (p. 25) “can find things that we could not, by imagination, assert were there.” Duplex’s use of anchor position uh(m)s is a good example of this kind of phenomenon. The machine learning methods that inform some of Duplex’s behaviors may have “discovered” this little-known pattern of behavior, bottom-up, by deriving statistical regularities from processing large numbers of recorded service calls. Duplex’s competent use of this practice therefore addresses some long-standing debates about whether, and how, some CA findings may be amenable to statistical and computational analysis (Button, 1990; Kendrick, 2017; Schegloff, 1993; Stivers, 2015). Although the interactional consequences of anchor position uh(m)s are still unknown, future studies that use AI in this way may identify related patterns in large volumes of data, opening up the possibility of using detailed CA studies to discover their situated interactional relevance (Steensig and Heinemann, 2015).
A research trajectory for a CAT?
When Duplex’s practices and actions pass, and its service calls progress sequentially, this does not equate to Duplex itself “passing a Turing Test” in the vernacular sense of “passing as human.” We call the method used in this article the “CAT” to focus, instead, on the actions and practices that comprise conversational competence and membership within specific interactional situations. We started by examining service calls involving Google’s Duplex along with a wealth of data and findings from prior CA studies of similar interactional settings for comparative analysis. This enabled us to identify, describe, illustrate, and evaluate practices associated with the conduct of competent service encounters and their mutually acknowledged interactional roles. This analytic procedure comprises a test specifically configured for a particular interactional situation. This process may be repeated to design new CATs to evaluate how any agents (human or machine) achieve conversational competence and membership across a range of interactional situations. In this way, the CAT can inform the design and evaluation of AI and voice technologies and may lead to new research questions for CA studies.
To design a CAT for a specific interactional situation, we suggest the following:
Specify an interactional setting underpinned by CA research.
Gather data featuring candidate actions and practices involving a “tested” party.
Gather data of normatively achieved practices in a similar, naturally occurring setting.
Transcribe and analyze data from both using standard CA methods.
Identify evident candidate practices for a comparative CA analysis (Schegloff, 2009). (a) State a clear understanding of the target phenomenon or practice. (b) Identify situationally specific and observable criteria for recognizing it. (c) Describe how this phenomenon has been examined and analyzed previously. (d) Compare its use between these environments and discuss any differences.
Ask if the tested party uses practices competently and is treated as a member.
Identify problems or observations that may feed into future design processes.
Having proposed this procedure for developing CATs, we conclude with a discussion of the implications for the design and interpretation of such tests more broadly.
The CAT evaluates actions, not agents
The CAT evaluates social actions rather than purported “intelligence”—artificial or otherwise—let alone ascribing the category of human or machine. Even if it were straightforward to ascribe humanness and evaluate intelligence, this common interpretation of the “Turing Test” has already been passed many times by simple chat bots (see Wallace, 2009) during the annual Loebner Prize competition (Loebner, 2009) with little impact beyond temporary sensationalist news coverage. Passing this kind of operationalized test of “human intelligence” often turns out to be trivial in both senses of being easy and being inconsequential. Some researchers have therefore advocated raising the bar for what might be considered intelligent up to and including being indiscriminable from a human (Harnad, 1992), or even exceeding human capabilities (Schweizer, 1998). Other proposals suggest extending the time allocated, stipulating the expertise of the judges, or enhancing the complexity or generality of the test (Kurzweil and Kapor, 2009). However, a harder operational test would not necessarily be any more explanatory about how, precisely, the test has been passed, nor what “indiscriminable” may mean in terms of how such judgments are made. Rather than refining operational tests that aim to ascribe human intelligence, the CAT aims to describe and then evaluate the pragmatics of situated human sociality. It describes criteria for evaluating the detailed interactional procedures that constitute each action, at each “passing opportunity.” The analytic procedure of the CAT, using CA, can also provide thorough explanations about precisely how an action “passes” in each specific circumstance under scrutiny. For example, Sahin et al. (2017) use CA to show how Lenny’s call opening turns are designed to maximize coherence and agreement, to report “trouble on the line,” and use misplacement markers such as “by the way” to account for any incoherence with the caller’s prior turn. Lenny’s simple recordings are effective without using speech recognition, AI, or any NLP technology aside from playing pre-recorded turns when it detects that the caller has stopped speaking. Passing as human, then, which Lenny achieves with remarkable consistency, may rely more on the normative expectations that constitute the social situation, than on sophisticated AI systems.
The consequences of passing a conventional Turing Test have often focused on mediagenic scare stories of AI or robots “taking over” (Whitby and Oliver, 2000). In the case of Duplex, its first demonstration at the 2018 Google IO conference (Google Developers, 2018) did raise serious ethical questions about whether an AI should masquerade as human in public life (O’Leary, 2019). Similarly, today’s AI-driven social bots are often convincing enough to influence commercial and political choices by emulating social media users, so there is an increasing demand for research into methods for categorizing agents as human or artificial (e.g. Ferrara et al., 2016). The arms race between AI developers and AI-detection measures will drive the sophistication of such systems, but not necessarily explain or ameliorate the consequences of their social actions.
A CA-informed approach such as the CAT, however, which focuses on the analysis of social actions, can achieve far more than simply ascribing the category “human” or “non-human.” It can also show how such categories are used as resources in the production of social actions. For example, Housley et al. (2017) focus on the actions of social media users to show how discursive formulations of membership categories in social media posts can ignite antagonistic readings and responses and open up the potential for spreading false or malicious information. Thus, epithets like “bot” and “troll” are now used as terms of abuse on social media (Ruck et al., 2019), often aimed at users accused of repeating provocative or propagandistic talking points. These categories are harnessed as resources for social action (i.e. doing insulting), rather than working primarily as technical or ontological ascriptions. In terms of social consequences, whether the agent of an utterance is human or not may matter far less than how their utterances are implicated in a specific interactional situation.
Conclusion
The conceit of the Voigt-Kampff test in Blade Runner is to ask whether, and how, we define “humanness.” The moral confusion of the protagonist Deckard, who falls in love with an android, shows how our intuitions, as well as more technical and conceptual operational definitions of humanness, may be fundamentally flawed. This focus on judging participants as either human or non-human by operationalizing interaction is a long-standing, though mediagenic, category error. Garfinkel’s work on “trust conditions” showed how, turn-by-turn, interaction works as a “proving ground” for the micro-social structures and mutual expectancies that constitute human sociality. The categorical status of an interlocutor as “human” or “machine” is (still) rarely in question, whereas in everyday talk, the precise “fittedness” and the reciprocity of the design of each response to the previous action is, with each turn, immediately under scrutiny—as summed up in the conversation analytic dictum “why that now?” (Sacks et al., 1974: 241). Humanness, intelligence, and the artificiality (or otherwise) of sociality is not based on the inherent properties of interlocutors but must be ongoingly constituted in and through action. With this proviso, we propose the CAT as a practical method for evaluating and understanding the coming wave of conversational AI through its constitutive involvement in forms of sociality. As Sacks (1995: 536) reminds us, “anthropomorphizing humans” is only an analytic convenience. For Deckard, in the end, “the electric things have their lives too.” What matters is social action and how we conduct our social relationships in and through the technology of talk.
Footnotes
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
