Abstract
Conversation—a verbal interaction between two or more people—is a complex, pervasive, and consequential human behavior. Conversations have been studied across many academic disciplines. However, advances in recording and analysis techniques over the last decade have allowed researchers to more directly and precisely examine conversations in natural contexts and at a larger scale than ever before, and these advances open new paths to understand humanity and the social world. Existing reviews of text analysis and conversation research have focused on text generated by a single author (e.g., product reviews, news articles, and public speeches) and thus leave open questions about the unique challenges presented by interactive conversation data (i.e., dialogue). In this article, we suggest approaches to overcome common challenges in the workflow of conversation science, including recording and transcribing conversations, structuring data (to merge turn-level and speaker-level data sets), extracting and aggregating linguistic features, estimating effects, and sharing data. This practical guide is meant to shed light on current best practices and empower more researchers to study conversations more directly—to expand the community of conversation scholars and contribute to a greater cumulative scientific understanding of the social world.
Conversation is one of the most pervasive of all human behaviors—people talk to each other all the time, all over the world (Dunbar et al., 1997). Most interpersonal relationships develop through a series of conversations over time—time spent talking and not talking, together and apart. Although a frequent and familiar task, each conversation is complex—it requires (and enables) people to coordinate their behavior and beliefs about the world (Clark et al., 2011; Jaques et al., 2019; Misyak et al., 2014; Rossignac-Milon et al., 2021). Conversations are consequential, allowing people to pursue a wide array of informational and relational goals (Yeomans et al., 2022) in the short term and over the long term—spanning each individual conversation and longer-term relationships (Fitzsimons & Finkel, 2018). Indeed, the amount and quality of social interaction is one of the most enduring predictors of human well-being (H. K. Collins et al., 2022; Diener & Seligman, 2002; Epley & Schroeder, 2014; Mehl et al., 2010; Quoidbach et al., 2019; Sun, Harris, & Vazire, 2020).
It is no surprise that researchers are increasingly interested in studying conversations, the contextual factors that surround them, and the short- and long-term effects of having them. This practical guide argues for the relevance of this work now, the benefits and challenges researchers should expect from studying conversations, and how to analyze conversation data, pair transcripts with surveys, and share results as the field moves toward a cumulative science of conversation (see Figs. 1 and 2).

A workflow for researchers to collect and analyze conversation data.

A map of data sources for conversation research. From the transcript itself, features can be extracted using static text methods and relying on time stamps and interactivity. These conversation features are then compared with preconversation context variables and postconversation outcomes.
Why Now?
At least three developments have enabled a recent boom in conversation research. First, conversations have become increasingly mediated through technology as a consequence of the Digital Revolution and Information Age of the 20th century and the social media era of the 21st century (Rainie & Wellman, 2012), shifts that were accelerated during the COVID-19 pandemic (M. Nguyen et al., 2020). These mediated communication technologies allow for the recording of text, audio, and/or video and thus preserve a rich source of conversation data for analysis. Second, there have been many advances in natural language processing (NLP)—an interdisciplinary subfield at the intersection of linguistics, computer science, and artificial intelligence that seeks to learn, parse, and understand human-language content using quantitative techniques (Hirschberg & Manning, 2015). This field develops computational tools that turn raw conversations into behavioral data—words into numbers—especially at scale (Jurafsky & Martin, 2017). Finally, the value of larger-scale analyses has been underscored by the recent revolution in research practices (Nelson et al., 2018). Taken together, these cultural and methodological developments offer wide promise for the study of conversation across a variety of academic disciplines.
Measuring Conversation
Although conversations are common and consequential, they are also complicated—no two are identical. Researchers have dealt with the complexity of conversation with a wide range of approaches aimed to simplify and isolate different aspects of a conversation. In exchange for simplicity, these approaches can make conversations less natural and more abstract. For example, researchers often study dialogue indirectly by having participants talk to a trained confederate, respond to hypothetical vignettes, make evaluations of carefully selected transcript segments, recall a previous conversation from memory, or offer holistic evaluations of a conversation after it is over. These approaches constitute creative and generative ways to study conversations and were particularly useful when conversation technology was nascent.
These approaches allow researchers to simplify and study conversations, but they also suffer from several well-known biases. For instance, confederate simulations rely on faithful execution of researchers’ instructions; hypothetical and recall methods suffer from errors in forecasting and memory; self-report measures suffer from social-desirability bias, hindsight bias, and demand effects; and experimenter-generated stimuli remove the conversational context in which they would occur in the real world. Conversation is a complex, contextual, and improvisational environment, and these kinds of simplifications can result in a misunderstanding between the assumed, perceived, and actual goals and psychological experiences of the speakers (Stokoe, 2021).
On the other hand, many researchers have taken on the daunting task of studying natural, contextualized conversational behavior, beginning with study of “ordinary language” as early as the mid-20th century (e.g., Garfinkel, 1956; Goffman, 1981; Heritage, 2008; Pomerantz, 1990; Sacks et al., 1974; Schegloff, 1968; Schegloff & Sacks, 1973; Stivers et al., 2010; Stivers & Sidnell, 2012). This work has typically prioritized attention to descriptive detail in natural settings by scrutinizing isolated portions of transcripts at the expense of scalability and controlled measures of outcomes and effects. Furthermore, linguistic inquiry often assumes rationality on the part of speakers (e.g., Goodman & Frank, 2016; Grice, 1975; Misyak et al., 2014) and infers intent on the basis of outcomes. This assumption can be limiting. People constantly deviate from rational behavior (Kahneman, 2002), so it is important to measure both intentions and outcomes to see whether speakers are making wise choices, enacting good behaviors, or making mistakes (Yeomans et al., 2022).
More recent work has conceptualized conversation as a diagnostic window into variables such as health status, personality, and well-being (e.g., G. Collins et al., 2018; H. K. Collins et al., 2022; Conner & Mehl, 2015; de Barbaro, 2019; Jaidka et al., 2020; Mehl & Pennebaker, 2003; Robbins et al., 2011). This simplification abstracts away from the details of each particular conversation and focuses instead on person-level variables. The focus is on what speakers’ behavior says about themselves, rather than the effects of their behavior on their partner, and ignores the specific goals and outcomes of individual conversations.
Many prior articles have compared conversation behavior with context and outcome data (e.g., Weingart et al., 2004; Word et al., 1974). But this work usually relies on human annotation to quantify conversational behavior from recordings or transcripts. Although these insights are useful, they are costly to scale and often do not give a transparent or interpretable definition of how a measure is calculated (see below for more on this point). Likewise, speakers are often asked to quantify the content of the conversation themselves using retrospective survey measures. Again, these measures are convenient but opaque and suffer from the same self-report and memory biases of other survey methods.
In this article, we highlight how recent technological advances provide researchers with novel capabilities to combine the best aspects of these research approaches and directly measure conversation behavior in more natural contexts at scale. The tools for conversation science are rapidly improving—both for recording conversations and for analyzing them—leading to an emerging boom of conversation research in a wide range of contexts across a wide range of academic disciplines (for a review, see Table 1). Modern workflows have made it easier than ever for researchers to combine detailed transcript analysis with algorithmic tools to scale up their insights and obtain robust measures of context and outcome variables surrounding conversational choices.
A Nonexhaustive List of Recent Research That Analyzes Transcript Data Across Behavioral Domains, Conducted Across Academic Disciplines
In this practical guide, we aim to make data, tools, and methods more accessible to a wider group of researchers by describing common challenges that face behavioral researchers who wish to study conversations and suggesting approaches that address those challenges. This review is aimed at researchers across disciplines who are looking to incorporate conversation-research methodology into their work for the first time or expand on a body of conversation research by incorporating new methods and techniques.
The Scope of This Article: A Focus on Transcripts
Conversations can include a wide array of psychological and behavioral content, including verbal features—what words are uttered, by whom, and in what order—and nonverbal features—tone of voice, gesture, posture, facial expressions, and so on (via visual and/or audio inputs). We focus primarily on verbal content for three key reasons. First, every conversation includes verbal content, whereas nonverbal cues are not present in many conversations (e.g., emails and phone calls). Second, verbal content presents common challenges for conversation research no matter what other types of cues are also present. The decisions, beliefs, and consequences that stem from the verbal content of conversation are only beginning to be rigorously understood. Finally, although nonverbals can inflect the meaning of the words spoken, it is the words themselves that form most of the meaning: They define the topics of conversation and what is being said about them. Indeed, verbal content has an overwhelming effect on how nonverbals are interpreted (Lapakko, 1997).
For these reasons, the scope of this article focuses on the aspects of conversations that can be captured in a transcript. This includes conversations conducted through sound and through writing. 1 Transcript data primarily include all words and phrases uttered by the speakers, the relative order and timing in which they are produced, and who produced them. In addition, transcript data can include some paralinguistic features (e.g., laughter, backchannel feedback like “yea” or “uh huh,” and disfluencies like “um” or “uhhh”). Likewise, written conversation sometimes includes features intended to represent nonverbal information (e.g., emojis or emoticons).
However, this scope excludes data that are present in many types of conversations. Primarily, this excludes paralinguistic (acoustic) information, such as the tone, pitch, and volume of voice, and visual nonverbal information, such as the speakers’ facial expressions, hand gestures, and body posture. We also focus almost exclusively on monolingual English conversations because the complexities of conversing in two or more languages simultaneously are too manifold for us to address properly herein. Most common NLP tools are available in many languages, although in cases in which researchers are studying dialogue from underresourced languages or other complex sources (e.g., slang, jargon, multiple languages at once), they may want to rely more heavily on expert human annotation.
In cases in which these other sources of information are relevant to the research question, we urge researchers to take a more tailored approach rather than rely only on our simplified workflow. For example, we note that these other types of conversational content can be added or annotated within transcript data (which we address in Capturing Conversation Data below) and can be quite important in some cases.
As a final statement of scope, we have also avoided research on dialogue generation (i.e., building models that can converse autonomously, sometimes called “dialogue agents,” “dialogue systems,” or “chatbots”). Transcripts of conversations that include bots can be analyzed in essentially the same way as conversations that include only humans, but building the chatbots themselves is more arduous. A practical reason to avoid this topic is because it is an especially fast-moving field. For example, between our initial submission of this article and its final acceptance, ChatGPT was released (OpenAI, 2022), followed by a rush of similarly impressive language-generation models. Although the future remains uncertain, we do anticipate that novel and enhanced models will emerge and become available in the coming years and will become increasingly important in the field of conversation analysis.
At this point in time, effective chatbots in the real world tend to be task-specific (e.g., customer-service phone trees, smart-home assistants) or serve narrow roles, such as a conversation facilitator for human speakers (e.g., Adamson et al., 2014; Traeger et al., 2020). When chatbots participate in broader conversations, they often have problems with listening, consistency, factuality, and other basic skills, although this may improve in the near future (M. Huang et al., 2020).
Leveraging the Predictable Structure of Conversation
Conversation is constructed jointly by (at least) two people, each of whom has their own independent goals, preferences, beliefs, perceptions, traits, and choices, often intertwined in an interdependent relational system (Fitzsimons & Finkel, 2018; Yeomans et al., 2022). In light of the difficult coordination puzzle that conversation presents, it is a wonder that humans manage to communicate at all. Remarkably, people do figure out how to understand each other (Goodman & Frank, 2016; Grice, 1975; Misyak et al., 2014). In fact, the predictable and intuitive structure of conversation—a pattern humans learn to recognize and produce from a very young age—facilitates information flow between speakers. The raw data of conversation are carefully structured by the participants themselves. For example, conversation partners alternate turns, jointly establish topics as a common frame of reference, and ask and answer questions (Pickering & Garrod, 2004; Schegloff, 2007).
However, from a researcher’s perspective, conversational transcript data are difficult to analyze quantitatively and involve many steps (see Fig. 1). First, sounds sometimes need to be converted into words (e.g., “uh-huh” or “[laughter]”). Then, all words need to be arranged in sentences and turns. Once the transcript is generated, researchers will notice how conversation data are high-dimensional—no two conversations are exactly alike. Within one conversation, every possible turn branches into an exponentially large decision tree containing what could be said next in quick, recursive cycles across multiple speakers. Although researchers can take advantage of the predictable aspects of conversational structure, they must also sift through the exponential complexity—they must make many judgment calls to determine which features are counted and how to do so from raw text.
The distinctiveness of dialogue versus single-voice text
For our purposes, conversations consist of dialogue 2 generated between two or more people over a series of turns. This definition primarily serves to distinguish conversations from documents authored from a single perspective, including speeches, essays, newspaper and magazine articles, books, product reviews, legal documents, and social media posts. Although the “great bulk” of language use is conversational (Levinson, 2016), single-voice documents have been the dominant source material in applied text analysis, and many review articles in related fields have focused only on single-voice documents (e.g., Benoit, 2020; Berger et al., 2020; Boyd & Schwartz, 2021; Dehghani & Boyd, 2022; Gentzkow et al., 2019; Grimmer & Stewart, 2013; Hansen & Ash, 2023; Hirschberg & Manning, 2015; Jackson et al., 2022; Pennebaker et al., 2003). Many of the techniques developed for single-voiced documents are also useful for studying conversations. However, the differences between single-voiced text and dialogue motivate different ways that researchers should capture and analyze conversation data.
First, unlike single-voiced text, conversations include multiple interchanging contributors. Each person’s contribution to the full conversation must be disambiguated (e.g., Who said what?). Second, conversations are generated on the spot and responsively, which puts a special priority on understanding the sequence of what is said, when it is said, and how it relates to adjacent conversational turns. Third, conversations are usually less thoroughly edited than single-voice documents, often because the turns are spontaneously composed. This means conversation entails looser sentence structure and breakdowns in the coordination of common ground, including more interruptions, cross-talk, silence, repairs, repetitions, misarticulations, clarifications, backchannels, conflicts, slurs, and jargon (Fox Tree, 2010). Lack of editing means conversations tend to have more spelling and grammatical errors and disfluencies (e.g., “umm,” “uh-huh”). Fourth, conversation often covers many topics and goals (Cooney et al., 2020; Yeomans et al., 2022), whereas most single-voiced documents focus on one or a small number of topics (e.g., product reviews or news articles). These complications of conversation pose many novel challenges (and opportunities) for researchers, even for researchers familiar with text analysis of single-voiced text data.
Managing conversation data sets: analyzing turn-level and speaker-level data simultaneously
Managing conversation data requires researchers to handle two distinct data sets: a turn-level data set (to examine conversational behavior) and a speaker-level data set (to compare that conversational behavior with preconversation or postconversation data, such as individual differences, experimental conditions, or outcomes). Thus, conversation analysis calls for analytical software (e.g., R or Python) that allows researchers to efficiently manage multiple data sets at once.
Turn-level data set: the contents of conversation
Most conversations can be discretized into a series of speaker “turns,” much like a screenplay or script. These data can be represented as a transcript in which each row contains information about a single turn—specifically, who was speaking, the words spoken during the turn, and time stamps indicating when the turn started and ended. This data structure requires the turn-level data set to have unique identifiers for every conversation or group (e.g., Group 1, 2, 3, 4 . . .), every turn in each conversation (e.g., Turn Number 1, 2, 3, 4, 5 . . .), and each speaker in the group (e.g., Speaker A, B, C . . .). We provide an example of a turn-level data set in Table 2.
Example of a Turn-Level Data Set
Note: The column labels are “Group ID,” used to distinguish between conversations; “Turn,” an index for each turn in the conversation in order; “Start Time” and “End Time,” indicating the time span of the turn; “Speaker ID,” indicating the speaking participant; “Text,” what was said in the turn; “Question,” a code for whether the turn contained a question; “Laughter,” code for whether the turn contained laughter; and “Word Count,” a count of words spoken during the turn.
In general, the boundaries of each turn are determined by the time during which a single speaker is talking. Every new turn will involve a different speaker than the turn prior. Linguists distinguish the concept of a turn from that of an “utterance,” defined as a single continuous expression by a speaker. A turn can be composed of multiple utterances. For example, speakers could send several messages in a row before their partner responds. In that case, as a simplifying assumption, researchers typically collapse multiple consecutive utterances from a single speaker into a single turn.
Speaker-level data set: data from outside the conversation
In a speaker-level data set, there is a unique row for each speaker. Each conversation will have multiple rows (one for each speaker in that conversation), and speakers who joined multiple conversations will have multiple rows (one for each conversation). The unique identifiers for the conversation (or group) and speaker included in the turn-level data set can be used to connect the speakers’ conversational behaviors to the speaker-level data set, which also contains the conversation (or group) and speaker identifiers in addition to other variables recorded before the conversation (e.g., random assignment, time of day, context, demographics) and after the conversation (e.g., self-reported survey items, negotiated outcomes). We provide an example of a speaker-level data set in Table 3.
Example of Speaker-Level Data Set With Round-Robin Design
Note: The column labels are “Group ID,” used to distinguish between conversations; “Speaker ID” and “Partner ID,” used to distinguish between participants in a conversation; “Age,” the age of the participant; “Gender” and “Partner Gender,” the gender of the participants in the conversation; “Condition,” which represents assignment; “Liking” and “Partner Liking,” self-reported measures; “Questions,” total number of questions the speaker asked in that conversation; “Laughter,” total amount of speaker laughter in that conversation; “Turn,” total number of turns in the conversation; “Word Count,” the word count of the speaker in that conversation.
Many researchers will conduct their final analyses in the speaker-level data set because many research questions focus on variation at the person level or context level. When this is the case, the turn-level data set is used to generate measures of conversational behaviors (e.g., the number of questions or interruptions), which are then summarized at the person-level data set and tallied in the speaker-level data set (e.g., Speaker A in Group 4 asked 41 questions, five hedges, and interrupted three times during the conversation). We provide further detail on this topic in Model Construction below.
Capturing Conversation Data
There are considerable challenges involved in coercing conversation data into the data sets described above, and they vary based on modality. We focus on the two most common conversational modalities, in which words are either written as text or spoken out loud. Each of these major modalities presents unique challenges and opportunities for speakers and researchers (e.g., M. Berry, 2013; Boland et al., 2022; Meredith & Stokoe, 2014; Oba & Berger, in press).
In either case, the fixed cost of structuring a conversation data set is not trivial. Once it is done, a good data set can benefit many subsequent research projects (and, possibly, many different researchers). Thus, we encourage researchers to explore whether it is possible to pilot test their research ideas in data sets from past research, including in archives purpose built for conversation data (e.g., Chang et al., 2020; Liberman & Cieri, 1998; Miller et al., 2017; Reece et al., 2022). For similar reasons, we also encourage researchers to share their own data after they have structured it (see Data Sharing below).
Text-only conversations
Research on text-only conversation has proliferated in part because of the availability of text data, which are easy to record and store. It is often produced in massive Internet forums (e.g., Wikipedia, Twitter) or in catalogued archives (e.g., newspaper articles, books, legal documents, earnings calls) in which records are public and accessible to researchers (Hirschberg & Manning, 2015) or scraped using one of many available software tools that can scrape text content from webpages. In addition, people often have records of conversations conducted by chat or email. Accordingly, some researchers use software that allows consenting participants to extract and share their own text or social media conversations (e.g., Stillwell & Kosinski, 2004). Researchers also collect their own text conversations within controlled experiments with emerging technologies such as ChatPlat (www.chatplat.com; K. Huang et al., 2017), iDecisionGames (www.idecisiongames.com), Smartriqs (Molnar, 2019), and survconf (Brodsky et al., 2022).
Text conversation can be easier to analyze than spoken conversation because the words are already transcribed during the conversation itself (by the speakers). The style of conversation conducted via text is also different; compared with voice conversation, text-only conversation tends to be more asynchronous, with more time for cognitive preparation, reflection, and processing within and between turns; clearer sentence structures; and fewer disfluencies (M. Berry, 2013; Meredith & Stokoe, 2014). Still, text-conversation data present unique challenges for researchers.
Turn boundaries
The time course of text-only conversation can be tricky to pinpoint because transcripts often include only one time stamp per turn: when a message is “sent” or “posted.” If the conversation is more synchronous (e.g., instant messages), the lag time between these stamps may be a useful signal of the time spent reading the last message or composing the next one. If the conversation is more asynchronous (e.g., email), the lag time may not be as informative.
In addition, in text conversations, people can compose their turns simultaneously, which can lead to multiple disjointed threads. When topics overlap, researchers must disentangle them by hand (or else accept some measurement error). Furthermore, most text platforms allow a single person to send multiple messages in a row, essentially replying to themselves. This can be simplified by combining consecutive messages from the same person into discrete, alternating turns—or by considering each message as separate turns.
Standardizing typing
In text-based conversation, people type their own transcripts. Writing style differs across people, cultures, languages, and time, and spelling and grammatical errors are common. There is a range of unique spellings in modern written language, including emojis (e.g., “☺”), variants (e.g., “oh nooo,” “woot!”), representations of sounds (e.g., “jajajaja,” “haha”), and acronyms (e.g., “tbh,” “lmk,” “lol,” “tldr,” “wtf”).
In many analyses, variants are simply ignored, especially if they are rare. However, some research questions might require attention to variants (e.g., grouping different kinds of typed laughter or unpacking emoji valence to detect emotional sentiment). Clear writing errors can be more pernicious given that most feature-extraction systems rely on correct spelling and grammar. To address this, we strongly recommend that a person looks through each text at least once, perhaps assisted with spell-checking software, to fix obvious errors.
Voice conversations
Research on spoken conversations usually requires additional steps because spoken words are expressed in continuous sound waves that must be discretized into words, sentences, and turns. Some high-stakes audio conversations are routinely transcribed (e.g., interviews, conference calls, government proceedings), and some researchers have examined such documents (e.g., D. S. Berry et al., 1997; Chen et al., 2018; Danescu-Niculescu-Mizil et al., 2012; Hansen et al., 2018). However, the burden of accurately transcribing conversations often falls on researchers themselves. With technological advances, automatic speech recognition (ASR) and speaker disambiguation have improved (Park et al., 2022), but they are still not nearly as good at parsing speech as human transcribers (Errattahi et al., 2018; Meier et al., 2021), and this is likely to remain true for some time. Furthermore, these automated tools are often trained on convenience data samples, so they may be most inaccurate for speakers from underrepresented groups, who may use an accent or vocabulary that is not well represented in the training data (Dehghani et al., 2015).
We urge researchers to put serious effort into assuring data quality both through preparation before the conversations happen and after they have been recorded. Here, we suggest a series of steps and several tips to capture research-quality voice conversations.
Record
Researchers often underestimate the importance of audio-recording quality. This is especially critical when researchers have complete control over the recording protocol (e.g., recording participants speaking to each other inside a behavioral lab). However, there are cases in which researchers have less control, for example, the Electronically Activated Recorder (e.g., Kaplan et al., 2020; Mehl, 2017; Mehl et al., 2001), the Language ENvironment Analysis system (Ganek & Eriks-Brophy, 2016), and other experience-sampling methods that require people to carry microphones with them throughout the day. Furthermore, online experiments may have people conversing through their own home computers, which researchers do not have control over. Nevertheless, each of these protocols involves different considerations and constraints to optimize audio quality. Across all these study designs, we urge researchers to test their recording setup in advance.
High-quality audio recordings will lead to higher-quality transcriptions later. If you are having trouble hearing words when listening to a recording, your transcriber (human or ASR) will certainly struggle. Make sure you can clearly identify what words are being said and by whom. Some of the main factors to consider include microphone quality (e.g., sensitivity, internally generated noise, distortion, and directional characteristics), speaker clarity, background noise, distance from the microphones, and reverb. Ideally, researchers should rely on solutions that do not place a burden on the speakers; for example, a change in microphone placement will be a more reliable fix than asking speakers to enunciate more clearly.
One common decision point for researchers is the number of audio recordings per conversation: Should the entire conversation be captured in one file, or should each individual be recorded separately? A single recording may seem easier to set up but may complicate the analysis later because audio-transcription services often struggle with speaker differentiation, especially when two speakers have similar-sounding voices. With only a single recording, transcribers must determine whether the person talking is (a) different from the previous turn (Did the speaker change?) and (b) the same as any of the previous turns (Has this person spoken before?). This task is especially difficult when speakers have similar speaking styles or vocal registers and as the number of speakers increases. Video recordings can help, although we have found that professional human-transcription services often do not look at videos.
When possible, we recommend collecting separate audio recordings for each speaker. This makes speaker differentiation simple and improves audio quality by moving microphones closer to each speaker. Fortunately, virtual meeting services (e.g., Zoom) record separate audio streams from each computer, which automatically differentiate speakers (if people have their own computer). Some services automatically combine these separate streams into a single turn-by-turn transcript (including Zoom and Microsoft Teams). If separate recordings are set up manually, they must then be combined and sorted into the correct order using the time stamps for each turn.
To connect the speaker-level data to the data collected outside the conversation (e.g., demographics and survey data), each speaker and each conversation must have a unique identifier that can be used to link the turn-level and speaker-level data sets. As a safety measure, researchers may consider reading the conversation identifier out loud at the beginning (or end) of the audio recording and use the identifier as the name of the audio file as well. Likewise, speakers in the conversation should say their unique speaker identifier as one of their first turns in the recording so their voices can be unmistakably matched to their conversation-level and speaker-level data.
We recommend conducting a few test recordings that run through as much of the workflow as possible. The researchers should check to see that the file records well (that the audio is clear and that the spacing of microphones and speakers is appropriate), that it can be played back properly, that it is saved in a format that is compatible with the intended transcription method, and that the researcher can match each recording and each speaker to the metadata. Finally, do not forget to press “record.”
Transcribe
Transcriptions of the audio recordings will form the foundation of the turn-level data set. There are several approaches to generate transcriptions from audio files. Most commonly, researchers pay traditional transcription services, which hire trained humans to type words while they listen to audio recordings. However, this approach is often inadequate (and expensive)—the quality is inconsistent, typos are inevitable, and transcribers use different formatting methods (even within the same company). Some researchers hire research assistants to transcribe. Although this affords more control over formatting, the training can be long, and the work can be arduous and inefficient. Others use automated speech-recognition software. Although software will never be as accurate at recognizing words as the best trained humans, they produce the most precise time stamps, and they deliver consistent formatting and spelling.
We strongly recommend a hybrid approach, combining automated speech-recognition software with trained humans, which is both accurate and cost-effective. First, automatic speech-recognition software can generate a low-cost first-draft transcription, tackling the easiest sections of the transcript quickly and producing transcripts with consistent formatting and reliable time stamps. Then, this initial draft of the transcript can be edited by a human, who can focus time and attention on the more difficult tasks, such as speaker differentiation and correcting any passages with low-quality audio.
It Is important to establish consistent formatting conventions early. Many transcription services (human and machine) export their data in text documents (e.g., Microsoft Word, PDF) rather than tabular files (e.g., Microsoft Excel, CSV). However, as long as all files have a consistent format, researchers can write code to parse the text files into an analyzable tabular format. Subtitle file formats (.VTT files) are also common for mapping utterances to time stamps, and these files can be processed into tabular formats automatically in R (Knight, 2023).
There are many automatic transcription services available today (e.g., Otter, Temi, Amberscript, Descript, Trint, Sonix, Happy Scribe, Wreally, Ebby, Scribie; Table A1), and new services and iterations are rapidly emerging (e.g., OpenAI’s Whisper tool, which was released during the revision process of this article). In September 2020, we systematically tested 10 of the most popular transcription services available. Each service transcribed the same series of audio recordings, and we evaluated the services along the following dimensions: (a) transcription accuracy, (b) speaker differentiation, (c) incorporation of time stamps, (d) user-friendliness, and (e) pricing. We summarize our findings in the Appendix.
This review is not meant to be definitive. Rather, its primary purpose is to demonstrate how researchers might test and compare various transcription tools. Automatic speech-recognition products and services have been rapidly evolving over time. Thus, we strongly encourage readers to conduct their own contemporaneous search at the time they require these services, evaluating their options based on the dimensions we list above. Researchers’ needs may also vary depending on what is best for their projects, so there is not a single best transcription service for everyone. However, we believe a hybrid transcription approach—automated transcription followed by human correction—is and will remain the most cost-efficient way to produce accurate, research-quality transcripts, at least in the near term.
Check
Automated transcription services have become more accurate over time, but they are not perfect (and neither are human transcribers). We strongly recommend asking people to listen to the audio recording while reading through the transcript, fixing any mistakes, and ensuring that formatting conventions are consistent throughout.
For example, transcription services have different policies about how to demarcate inaudible moments. Many will simply skip over this moment and leave a blank, whereas others will flag this with “[inaudible],” sometimes with a time stamp including duration. Our preference is typically to use the “[inaudible]” flag, which can be removed as needed; either way, it is essential to be consistent throughout. Furthermore, there are many paralinguistic features that may be ignored by some transcription services. Common examples of these are laughter (“[laughter],” “[laughs],” or “[laughing]”) and interruptions (“[interruption],” “[interposing],” or “–” at the start of an interrupting turn). Similar approaches are taken for other paralinguistic cues, such as sighing, singing, crying, yelling, whispering, or cross-talk. Research questions should inform your approach: If laughter is important, make sure you annotate it and do so consistently.
Checking transcripts can also uncover errors in the time stamps. One common error is typos from human transcribers—large errors can often be detected in later analyses (e.g., typos often result in negative or very long interturn pauses), although smaller errors also happen. When speakers are recorded separately, their time stamps may be aligned to different benchmarks in each recording (e.g., if the recordings start at different times). In this case, time stamps must be realigned to a common reference time before the transcripts from each recording are merged.
It is often useful to have human coders fix errors made by the speakers themselves, too, unless those errors are of research interest (e.g., self- and other-initiated repairs are important conversational phenomena). Some examples include the following:
Include and standardize the spelling of backchannels (e.g., “yeah,” “uh-huh,” “oh”).
Remove erroneously repeated words (e.g., “I thought you . . . thought you were ready”).
Include punctuation (e.g., question marks, periods, commas, ellipses).
Change “gonna,” “sorta,” “dunno,” and so on to “going to,” “sort of,” “don’t know,” and so on.
Correct misspoken words in cases in which the intended meaning is clear (e.g., “nice to mate you”).
There can be subtle but important differences in meaning among nonstandard variations (e.g., “yes,” “yup,” “yasss”). However, there is a trade-off between specificity and statistical power. In general, differentiation could be reasonable if there is an adequate sample size of each variation and if the distinctions matter for the research questions at hand. Otherwise, it may be best to aim for consistency (e.g., “yes” to study linguistic affirmation broadly).
Although transcript checking can be monotonous, the process can be designed efficiently. We typically find it easier for research assistants to complete all tasks for one document at a time rather than completing one task for all documents before moving to the next task. However, to batch tasks like this, you must plan your checking needs in advance. For more efficiency, error checking can also be batched with human-feature annotation (see Feature-Extraction Objectives below).
Extracting Features From Text
Perhaps the most daunting task for conversation researchers is to decide which features to extract from the transcripts. Each “feature” can be thought of as a measure of one behavior in the transcript (e.g., the number of first-person pronouns, the percentage of words that mention food, the average length of pauses). There are a large (and increasing) number of tools available to researchers for this task, and researchers are presented with a wide array of options, even for measuring the same underlying construct (Schweinsberg et al., 2021; Yeomans, 2021).
We offer a brief review of common techniques with a special focus on the challenges of studying dialogue data (vs. single-voice documents). Although tools for these steps are available in several software environments, we point readers to tools in the R software language. However, we note that Python also has many excellent tools for NLP (ConvoKit, in particular; Chang et al., 2020). Note that both Python and R allow users to manage the two data sets—turn-level and speaker-level data—simultaneously. This means researchers can integrate their feature-extraction code with their analysis code (see Model Construction below).
Feature-extraction objectives
Before we introduce common feature-extraction methods below, we first describe the important dimensions on which these methods can differ. This is important because there is no one “correct” approach. Instead, researchers must choose techniques on the basis of their own idiosyncratic objectives and constraints, which are determined by their skill set, audience, research goals, resources, deadlines, and so on. Each of these dimensions should be considered when choosing a feature-extraction method.
Accuracy
First and foremost, researchers should hope the features they extract from text data are valid, accurate measures of the underlying behavior or belief (Flake & Fried, 2020). Thankfully, accuracy can be evaluated empirically within a validation data set that has labels that can be treated as “ground truth” for comparison. For example, a turn-by-turn measure of question asking should correlate as highly as possible with the true number of questions in each turn.
However, accuracy is not an inherent property of any method—it can be defined only within a particular population of interest. For instance, a model trained to label different types of questions in a doctor’s office may not be as valid for labeling question types in a job interview. Researchers should be explicit about their intended populations and the boundary conditions of their results (Simons et al., 2017). They should also routinely conduct tests of “transfer learning” (Weiss et al., 2016; Yeomans, 2021) by explicitly testing how well their methods perform when they are developed in one context and applied to data from a different context.
Fairness
Bias is a concern shared by both humans and artificial-intelligence (AI) systems. Just as humans are prone to unconscious biases (Greenwald & Banaji, 1995), AI models can exhibit algorithmic bias (Kordzadeh & Ghasemaghaei, 2022). Mitigating this bias is essential to ensuring the accuracy and fairness of research outcomes regardless of the initial source.
Because language models learn about the world from data used to train them, anything that learns from biased language data may unwittingly generate models that reinforce and codify prejudice, stereotypes, or other unsavory aspects of human judgment (Caliskan et al., 2017). And, as is often the case with historical (and present-day) data sets, the speakers in the training data may themselves be biased or prejudiced. Sometimes this bias is the subject of research inquiry itself; however, if the focus is on other aspects of human behavior, this bias can undermine the goals of the research. This is especially true when a model or estimate is used to make decisions that affect real people. Consider, for example, an algorithm used to match job candidates to job postings using similarity to exemplars in past training data. If that training data reflect a past in which some demographic groups (e.g., women, minorities) were excluded or discouraged from leadership roles, then the model on which it is trained may unwittingly reinforce that bias going forward. For example, an algorithm employed for recruitment at Amazon was later shown to be unwittingly discriminating against female applicants because the data it learned from showed that most leaders tended to be male (Dastin, 2018).
The accuracy of a model can thus vary across social groups in ways that may have biased consequences for the outcomes of those group members. Models trained on only one kind of speech, such as data from the most commonly studied sources (e.g., from demographic majority groups, from American-English speech), may be much less accurate when they parse speech from groups that are historically underrepresented, from speakers from non-American countries, or for other reasons not included in the training data (Koenecke et al., 2020). This is an issue for all kinds of slang, jargon, and other language that are contextually—or socially—determined, and this type of language is very common in conversation.
There are no surefire techniques that can ensure a model is unbiased. One approach that has grown more common in recent years is to conduct an “algorithm audit” in which AI systems are evaluated to ensure they work as expected and do so without bias or discrimination (Brown et al., 2021; Koshiyama et al., 2022). Moreover, transfer-learning tests, as described in the Accuracy section, are very useful—by comparing how well a model’s accuracy varies across different populations, researchers can evaluate whether particular groups may be adversely affected. When transfer-learning tests are not possible, researchers should explicitly acknowledge the limitations of their training data so that their tools are not misused by others. To improve the model itself, researchers should try to find training data that best represent the people involved, perhaps even oversampling from less numerous groups so that they are accounted for in the model. Above all, we recommend not taking model outputs as ground truth; instead, researchers should try to interpret and understand their models as much as possible, evaluate the contents using their own domain expertise, and be as thorough as possible in making sure the model is behaving as expected.
Interpretability
Behavioral scientists are rarely concerned only with prediction accuracy. They also seek to understand and explain how people behave, which means they also need to understand what drives the results of their statistical models. Interpretability allows researchers to scrutinize their models so that they might improve them and think about how well they might generalize to new contexts (Bianchi & Hovy, 2021). Improving interpretability can also improve fairness by allowing users (including regulatory bodies) to evaluate the model’s strengths and failings in detail (Doshi-Velez et al., 2017; Rudin, 2019), and users generally trust models more when they understand them (Gilpin et al., 2018; Yeomans, Shah, et al., 2019). We recommend a similar skepticism from researchers—so-called black-box methods that are not explained should not be relied on to provide scientific insights.
Although interpretability is almost universally desirable, it is difficult to define or quantify it precisely (Lipton, 2018). But generally speaking, models can be made more interpretable along two dimensions. First, the methods themselves should be transparent. Their exact content, code, and training procedure should be shared and benchmarked against related models across diverse contexts (Mitchell et al., 2019). However, transparency is necessary but not sufficient—many modern NLP models are still too complex to scrutinize, even by experts (Bender et al., 2021). More troublingly, this information is often not shared because of expediency and to prioritize individual success over progress as a field (Belz et al., 2021). For example, the DICTION software package provides only broad generalities about how its features are scored or how its formulae were determined and validated (Hart, 2001) even though its license fee is much higher than open-source models that are much more transparent.
In addition to transparency, models can be made more interpretable by generating additional outputs in addition to raw feature scores. One approach is to use the model scores to find excerpts from the dialogue that highlight contrasting levels of a given measure (e.g., high vs. low warmth; follow-up vs. switch question). Often, they can also extract coefficients directly from the model to reveal which features most affect a model’s output (e.g., K. Huang et al., 2017; Voigt et al., 2017). Even when researchers must rely on an uninterpretable model because of their high accuracy (e.g., human annotators or black-box NLP), they should still try to understand its workings. One approach is to train a simpler model that approximates the predictions of the more complex one and interpret that simpler one instead (Madsen et al., 2021; Ribeiro et al., 2016).
Scalability
Researchers usually need to anticipate the costs of calculating and extracting features at a large scale. All feature-extraction methods involve direct resource costs. These costs come in the form of upfront investment (e.g., learning how to use a new software package or developing an annotation scheme) and in the marginal cost of applying a method to new data (e.g., computation or annotation time). There are other limitations that affect the costs of implementing different methods. For example, when data are proprietary, identifiable, or otherwise sensitive, some methods (e.g., human annotators reading raw text) may come under more intense scrutiny from stakeholders than other, less invasive methods (e.g., computing average turn length).
Complexity
Many of these objectives are related to the complexity of a feature-extraction method, even though complexity is not itself an objective. Complex features tend to be costlier to implement, but this extra effort is typically justified because of improved accuracy, fairness, or interpretability. Conversation is itself complex, so a perfectly accurate feature extractor would have to be correspondingly complex. Instead, researchers often settle on a trade-off between acceptable effort and acceptable accuracy, and this can be done iteratively: Simpler measures can be used first, and if that is insufficient, then more complex measures can be used. To borrow an idiom, before investing in a more complex method, researchers should first consider if “the juice is worth the squeeze.”
Complexity is often related to the scope of information needed from the transcript to identify a single feature, whether it is responsiveness, warmth, question types, expressions of gratitude, disfluency, or interturn pause length. The simplest and most common methods treat a person’s turns as a block of static text, as if they were single-voice documents (see Static-Text Features). This allows researchers to draw on the large tool kit from single-voice document analysis. However, this ignores the features of text that make conversation unique. For example, some features incorporate the time stamps from the transcripts (see Timing Features). Many other features look at consecutive sequences of turns to understand the structure of how speakers are interacting (see Interactive Features). We illustrate these different input scopes in Figure 2.
NLP versus human annotation
Before computational tools were available, researchers traditionally annotated conversations, scoring various features in transcripts by hand. In theory, any annotation task done by a human could be attempted with an algorithm instead and vice versa. Thus, it is tempting to see NLP as a potential substitute for human labor to automate simple workloads and reduce time spent reading.
However, we argue the opposite: Researchers should consider NLP as a complement to human work. These algorithms make close reading more powerful because they can be used to scale up and interpret human insights. Humans can develop typologies and provide labels to train supervised algorithms. Researchers themselves can read their corpora to guide their intuitions on which algorithms might be the best fit for their data and context.
Advantages of humans
Human and algorithmic feature extraction have contrasting strengths and weaknesses. For example, many conversational phenomena are too complex for current tools to automatically detect with sufficient accuracy. In these cases, trained human annotators usually produce more accurate labels and can be used as the “gold standard” for evaluating NLP performance (Bommasani et al., 2021). Human annotators can use their knowledge about the social context of a conversation to frame their responses, whereas an algorithm typically applies the same scoring rule regardless of context. For example, humans use their knowledge about speakers and context to infer sarcasm, whereas algorithms are typically built to take all of a speaker’s words at face value. Humans are better at understanding nuanced meaning amid social exchange.
Limitations of humans
People can be inconsistent from day to day and between one another—annotators almost always have some amount of disagreement. Furthermore, their thought processes may be hard to know or interpret (Nisbett & Wilson, 1977). Annotators often do not—or cannot—give precise reasons for their judgments. Although the exact protocols used to train the annotators can be shared, this does not guarantee that human annotators followed them or followed them in the same way. Thus, algorithms are not the only black-box feature extractors used in research—humans can be black boxes, too.
Humans can suffer from many of the same problems that algorithms do. Accuracy within and across domains is always a concern. When human annotators perform poorly, it can be hard to know if the task is inherently difficult, human judgment is too subjective, or the annotators are lacking the right training. Human annotators can treat people unfairly because of historical bias and prejudice or inexperience in the domain, among other reasons (Denton et al., 2021). All of the tools available to interpret algorithmic judgments should be used to scrutinize human annotations for unintended biases or blind spots.
Costs of human annotation
The costs of using human annotators are typically higher than using an algorithm. Much of this difference lies in the marginal costs of annotating new data—annotator time scales linearly with the amount of data, whereas the marginal cost of automatically processing more data is trivial once an algorithm is built. However, there are upfront fixed costs for both. For humans, researchers must establish clear definitions and protocols for assigning labels. Annotators then practice until they reach sufficient agreement on training cases. Researchers may revise their protocols during training, as their definitions are applied to edge cases in real data. This process is iterative: drafting a scheme, then testing it individually and via group discussion, revising the scheme, and retesting. These details are usually context-specific, and researchers should work with domain experts to develop their annotation schemes.
Often, researchers try to reduce annotation costs by crowdsourcing label generation to pools of online workers (e.g., from Mechanical Turk). However, crowdsourced workers have their own problems. They are hard to train, do not provide good feedback during protocol development, and can be inattentive. The task must be cleverly allocated across many workers because each one can label only part of the data set (e.g., Benoit et al., 2016; Kiritchenko & Mohammad, 2017). Accuracy concerns are less relevant for simple tasks and can be mitigated in part by averaging over many annotators (although, this reduces their cost advantage).
In general, we have found that if annotation tasks are sufficiently complex, a pair of in-house research assistants can produce more accurate labels than a larger pool of crowdsourced workers. Moreover, in-house annotators can complete the necessary checking and cleaning tasks described above (also see Check section).
Human-algorithm hybrids
As with transcription, a hybrid approach may be useful during feature extraction. Human annotations can be used to train interpretable algorithms that reproduce human judgments. This approach identifies the linguistic features that are driving the humans’ judgments. A side benefit to this hybrid approach is that if the resulting algorithm is accurate, it can be directly applied on new data without having to recruit new human annotators. In addition, rough algorithmic approaches can be used as a first pass to focus the efforts of human annotators.
We used this workflow ourselves in K. Huang et al. (2017) when we wanted humans to annotate different question types. First, we applied a simple algorithm to identify turns that included a question (to assist the humans’ search through the transcript). Then, human research assistants coded these questions as one of several question types. After the human annotations were collected, the consensus labels were then fed back into a supervised learning algorithm to train a question-type detector. The final model included both the initial search filter and the supervised model so that it could reproduce the human annotators’ judgments at scale. It was trained on 4,209 annotated question turns within 368 conversations from a lab experiment and then applied to an observational data set with 987 conversations and 19,321 question turns.
Static-text features
There are many review articles covering different methods for extracting features from single-voice documents. For brevity, we review the most common methods and focus on why they may function differently in dialogue. These methods treat turn content as though it were from a single-author document, such as a news article. However, individual turns vary wildly in word count. In practice, this means many turns from one speaker are collapsed into a single piece of text (this is discussed in detail in Aggregating Conversation Features below).
Counting words
A common, straightforward approach to analyze text is the “bag of words” approach: Count each word that occurs at least once, ignoring order. This can produce a very large feature set (perhaps thousands of different words in a single conversation). There are many preprocessing steps commonly used to smooth out the raw counts, including reducing words to their stems, expanding contractions, removing rare words, removing common “stop words,” and constructing “n-grams” (two- or three-word phrases).
These techniques improve models, but they should be considered in light of the specific research questions that are being addressed (Denny & Spirling, 2018). Conversation has a lot of stylistic and structural language, which tends to be determined by the more common function words—pronouns (“you,” “they”), adpositions (“to,” from”), determiners (“the,” “your”), and adverbs (“mostly”). For example, question words (“who,” “what,” “where,” “when,” “why,” “how,” “which”) are essential for determining what types of questions people are asking (K. Huang et al., 2017; Zhang et al., 2017). However, these words tend to get removed by most off-the-shelf stop-word lists, which were typically built for single-voiced text.
Dictionaries
Dictionaries are lists of words generated by expert human annotators that give scores to words that group them into simpler dimensions of meaning. For example, a “food” dictionary would give all the words relating to food (e.g., “pizza,” “broccoli”) a score of 1 and the rest of the words (e.g., “bicycle,” “reading,” “heavenly”) a score of 0. Other dictionaries assign each word a score on a continuous scale using average ratings (e.g., concreteness; Coltheart, 1981; Warriner et al., 2013). To calculate the summary score for the whole text, the scores of the individual words within it are averaged. For binary dictionaries, this score is the percentage of words that comes from a dictionary.
Dictionaries are common and accessible. The Linguistic Inquiry Word Count (LIWC) is probably the most often used NLP tool in psychology (Tausczik & Pennebaker, 2010) because it requires no special skill to conduct analyses and many features are simple to understand (e.g., first-person pronouns, words about music). Although dictionaries can be quite useful, users should be aware of their limitations. Most obviously, dictionaries (like bag of words) ignore the order of words, sentences, phrases, and topics—how verbal content unfolds in sequence. For example, most dictionaries do not account for negations (“not bad” vs. “bad”) or relative magnitude (“very bad” vs. “bad” vs. “terrible”; although, see Hutto & Gilbert, 2014). Furthermore, the interpretation of dictionary results is often lacking. Although it is tempting to simply take the title of a dictionary at face value, its meaning should be determined from the actual words it contains and the procedure by which it was created and validated. Sometimes these details are not shared publicly.
Furthermore, authors should make sure the dictionary is capturing what is intended in their context by comparing texts from their data with the dictionary’s scores, perhaps starting with texts that get especially high or low scores. Most dictionaries implicitly assume domain-generality—that the contained words each have a single, stable meaning (Hamilton et al., 2016). This is not always true in conversation (Boyd & Schwartz, 2021; Eichstaedt et al., 2021; Yeomans, 2021). For example, even something simple such as emotional sentiment (e.g., positive words minus negative words) can fail to measure closely related concepts such as the experience of happiness or well-being of the speaker (Beasley & Mason, 2015; Jaidka et al., 2020; Kross et al., 2019; Sun, Schwartz, et al., 2020) or the nuances of how a business or product is being described (Frankel et al., 2022; Rocklage et al., 2022). Although domain-specific dictionaries can help these concerns (e.g., Loughran & McDonald, 2016), the boundary for what is in- versus out-of-domain is not always clear, and researchers are usually best off conducting their own in-domain validation (Benoit et al., 2019; Yeomans, 2021).
Sentence structure
Modern NLP tools can extract not just the words themselves but also the underlying structure of sentences—that is, the grammatical parsing of sentences into subjects, verbs, objects, modifiers, clauses, and so on. This improves the features extracted from a typical bag-of-words model by making use of structures that determine meaning—for example, negations (“bad” vs. “not bad”), named entities (“apple” the company vs. the fruit), and homonyms (“like” the positive-valence verb vs. “like” the valence-neutral adposition). Researchers can use pretrained neural-network models (Honnibal & Johnson, 2015; Manning et al., 2014, 2020) to generate grammar tags for each word and then build features using the tagged set.
These tools have been effectively applied to measure markers of politeness from individual turns (Danescu-Niculescu-Mizil, Sudhof, et al., 2013; Voigt et al., 2017; Yeomans et al., 2020; Yeomans, Kantor, & Tingley, 2018). In conversational text, politeness features often succeed at capturing the robust dimensions of how speakers structure their conversational turns—agreement, disagreement, acknowledgment, hedging, gratitude, subjectivity, apologies, greetings, and goodbyes. Models trained on these dimensions have generalized well across multiple domains because they focus on structural and stylistic features rather than the main content features that tend to define a domain (e.g., specific nouns and verbs). Figure 3 provides an example of politeness features extracted from a data set to show the differences in linguistic style that result from a randomized preconversation assignment to condition.

An example graph showing dialogue features extracted from negotiation transcripts (Jeong et al., 2019) using the politeness R package (Yeomans, Kantor, & Tingley, 2018). (Top) Comparison of the feature usage between buyers and sellers. (Bottom) Comparison of the feature usage of buyers instructed to be warm and friendly versus tough and firm. All bars show group means and standard errors. Note that plots show feature counts per 100 words because buyers (especially buyers instructed to be warm) use many more words than sellers.
Embeddings
A common approach to detecting semantic content is to use pretrained “embedding spaces” that represent words and sentences as vectors within a space of meaning (e.g., Landauer & Dumais, 1997; Mikolov et al., 2013). Most modern embedding models are extracted from small neural networks trained to estimate which words tend to have the same neighbors (Bhatia et al., 2019). To solve this problem, the inner layer of the neural network groups words with similar meanings close to one another within the space. These embeddings are particularly useful for tasks that involve a similarity calculation—for example, measuring the semantic similarity of two texts (Arora et al., 2017) or improving dictionaries. Rather than using a dictionary to count words in a binary sense (i.e., presence/absence), authors can compute the similarity of a whole document to the dictionary as a continuous measure (e.g., Garten et al., 2018; Sagi & Dehghani, 2014).
Embedding models have several advantages over raw word counts. These models group words with similar meanings into a common dimension, whereas a word-count model treats each word as its own dimension, reducing the feature space considerably. Although word-count models typically remove rare words to simplify the estimation, embedding models are pretrained on large data in which a high frequency of words is seen often enough to be included in the model.
However, embedding spaces are difficult to interpret—the dimensions themselves do not directly correspond to meaningful concepts, and researchers must use other tools to interpret what the model is doing. In addition, many common pretrained embedding models are mapped to individual words, which means that they ignore the order of words spoken in conversation and other sources of contextual variation in meanings. Still, newer models of embeddings can encode entire sentences within an embedding space (e.g., Devlin et al., 2018) and can be fine-tuned to incorporate some contextual differences in meaning if the researchers have enough data. This is a frontier of constant progress in the NLP community.
Timing features
In this section, we review several conversation-specific features that can be derived from time stamps. Many types of conversation features are particularly prevalent in some parts of the conversation (for an example, see Fig. 4). Furthermore, the impact of some features of language may vary in meaning or effect depending on when they are said during a conversation (e.g., Y. Li et al., 2022). The most common use of time stamps is to organize other features of text and to select features from certain parts of the conversation for analysis. This is relevant for causal versus predictive inference (see Model Estimation).

An example conversational time-series graph showing frequency of question types asked over the course of approximately 300 conversations between strangers (data from K. Huang et al., 2017).
Pauses
Typically, there is some amount of pause between turns, measured as the difference between one turn’s end time stamp and the next turn’s start time stamp. Pauses tend to be longer in asynchronous and text conversations and shorter in synchronous and spoken conversations. Teleconference conversations tend to be somewhere in the middle of the two (Boland et al., 2022). Within a particular data set, pauses of various lengths can be counted as turn-level features (Templeton et al., 2022, 2023). Some researchers simply dichotomize each turn into pause or no pause using a threshold and show that results are robust over a range of thresholds (e.g., Curhan et al., 2022). It is more difficult to define within-turn pauses, in which people pick up after their own silence, and the relevant time stamps are not included in a turn-level data set. Transcribers (human or algorithmic) can be instructed to indicate a midturn pause as a nonverbal (e.g., “so anyways . . . [pause] did you see them at the wedding?”), which can be counted or removed as needed.
Interruptions
Sometimes speakers do not leave any time in between their turns or even talk over one another. This often happens when the first speaker is interrupted by the second, and this type of interruption is often given a special annotation in transcripts (e.g., a single dash at the beginning or end of a turn) and a zero or negative gap between the end time of the previous turn and the start time of the interruption. The meaning of these interruptions is the subject of scholarly study—as a signal of disrespect or authority in formal settings (H. Z. Li et al., 2004; Mendelberg & Karpowitz, 2016); a sign of excited, enjoyable discourse (H. Z. Li et al., 2004; Yeomans & Brooks, 2023); or a signal that one person was merely filling dead air until the partner was ready to take a turn. The content of the interrupter’s turn also distinguishes different types of interruptions, such as backchannels, questions, and arguments (Shi et al., 2022).
Speaking time
Time stamps can also be used to measure speech patterns over longer periods. For example, speaking time (i.e., “participation” or “airtime”) is commonly measured as the percentage of the total time that is used by a particular speaker. When time stamps are not available, airtime can be approximated using the number of words spoken by each speaker as a percentage of the total words spoken (although this does not account for when no one is speaking). Comparing turn length with the time stamps will give an estimate of the person’s speaking speed (i.e., “cadence”).
Interactive features
Backchannels
During conversation, listeners often insert a brief utterance to signal they understand (e.g., “yeah,” “ok,” “mm-hmm”) while someone else is talking. Different definitions have been used, and it varies according to context (e.g., audio vs. text chat). Typically, backchannels are treated as a single turn within the flow of conversation with zero time gap between the preceding and subsequent turns. This may unnaturally divide the longer turn of the backchannel recipient into two separate turns, which could interfere with sentence-level features. Some researchers have avoided this by considering backchannels as features of the turn receiving the backchannel. Then, each turn has a feature counting the number of backchannels it receives from other speakers (Reece et al., 2022).
Dialogue acts
Most of what is said in conversation imposes a structure on what is said in subsequent turns: asking different types of questions; stating facts, opinions, or feelings; making requests or commands; signaling understanding, agreement, or disagreement; or initiating repair. These “dialogue acts” are essential to understand how speakers are communicating with one another (Bunt et al., 2010; Stolcke et al., 2000). Other theoretical frameworks (e.g., speech acts; Searle, 1965) capture roughly the same idea, which is that conversational turns are usually more than just statements of fact about the world. Rather, they communicate speakers’ intentions and give structure to the response they expect to receive.
Some dialogue acts can be reasonably approximated with features extracted from individual turns by the politeness package (e.g., gratitude, apologies, acknowledgment; Yeomans, Kantor, & Tingley, 2018; see Fig. 3). However, many other dialogue acts are difficult to identify without information from other turns. For example, adjacency pairs (e.g., consecutive turns such as question/answer, offer/acceptance, misunderstanding/repair) often demarcate essential decisions in a conversation.
There is no universally accepted, domain-general list of dialogue acts. Instead, the set of relevant dialogue acts will change depending on the conversational context (e.g., the modality of exchange, the goals of the speakers). For example, consider the sequence of formal offers within a negotiation. Specific offers are among the most important dialogue acts, so the impact of measurement error on these features would be considerable. In fact, most negotiation platforms (e.g., iDecisionGames or eBay) require that formal offers be made separately from the unstructured stream of conversation so that the speakers themselves can understand their partners. Algorithms may be able to parse the offers in simple negotiations (Lewis et al., 2017), but if the negotiation involves multiple complicated issues, automatic extraction may not be possible and human annotation may be preferred (Jäckel et al., 2022; Weingart et al., 2004). The same treatment may be necessary for other dialogue in which particular turns have formal significance—for example, voting during a meeting or generating creative ideas (Brucks & Levav, 2022).
Accommodation
One of the most common and reliable results in conversation analysis is accommodation—the tendency of one speaker to mirror the linguistic features of the previous speaker (Giles et al., 1991). Several models of accommodation have been proposed. The most common measure combines the entire transcript of each person separately and then calculates the similarity of those two documents (Ireland et al., 2011). However, this ignores order and directionality (e.g., Which of the speakers is doing the accommodating?). Other models are purpose-built for conversation and explicitly identify accommodation from one turn to the next (Danescu-Niculescu-Mizil et al., 2011; Demszky et al., 2021; Doyle & Frank, 2016), and this can be aggregated as a feature of one or several turns.
Researchers have considered several feature sets over which accommodation should be measured. Some articles have focused on mirroring of content (e.g., If I talk about my dog, will you talk about your dog?; Babcock et al., 2014; Fusaroli et al., 2012), and others have focused on stylistic categories (e.g., If I use more quantifiers, will you do the same?; Danescu-Niculescu-Mizil et al., 2011) or syntactic structure (e.g., If I use short, clipped sentences, will you?; Boghrati et al., 2018). Other articles have included a wide range of features, combining content and style (Niederhoffer & Pennebaker, 2002; Srivastava et al., 2018). In truth, it is not clear whether conversational style and content can be cleanly separated, and the two often correlate with one another—in essence, some types of content naturally pair with particular styles. This is a subject of ongoing research.
Topics
Conversations are very often broken into discrete topics (e.g., the weather, then work, then cooking, and so on) based on speakers’ varied intentions (Passonneau & Litman, 1993). There are well-known NLP algorithms that focus on extracting topical content from text (i.e., topic modeling). The most common approach, latent Dirichlet allocation, assumes that each text document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics (Blei et al., 2003; Roberts et al., 2019).
Alas, conversation data are not well suited for topic models built for single-voice text. Topic models focus on the distinctive words that demarcate content and typically remove common words, such as pronouns (e.g., “I,” “you,” “it,” “she,” “they”). However, many turns contain no topic-relevant information (e.g., “Why is that?” could be asked in almost any topic), and most turns are too short to reliably estimate word co-occurrence. Instead, blocks of turns must be segmented into topics for analysis, and dividing dialogue into segments is arguably even harder than assigning a topic to a particular segment (Purver, 2011; although, see Galley et al., 2003; Hearst, 1997; V. A. Nguyen et al., 2014). Furthermore, in both single voice and dialogue, it can be hard to choose the number of topics and interpret the words within each topic (Boyd-Graber et al., 2014; Chang et al., 2009). Still, topic models may be a useful tool for rough exploration and descriptions of the main themes of a body of dialogue.
If the topical structure is important to measure precisely, we suggest researchers avoid relying on an unsupervised algorithm but instead develop their own categories on the basis of their knowledge of the domain and their exploration. For example, conversations can have a list of preassigned topics, which makes ex-post segmentation much easier (e.g., Yeomans & Brooks, 2023). Many conversations that are repeated often—such as sales calls, customer service, doctor-patient interactions, police interviews, parole hearings—have explicit or accepted dialogue scripts that speakers have been trained to follow as a progression through a series of stages. These scripts can be used to develop domain-specific rules to segment individual transcripts into discrete topics or stages (e.g., Takanobu et al., 2018). This is a subject of ongoing work, and NLP researchers have made progress in tracking topics shifts within dialogue (e.g., Xing & Carenini, 2021; Xu et al., 2021).
Model Construction
Most conversation research does not just examine transcripts. Instead, conversational behavior from transcripts is compared with data from outside the conversation, such as the speakers’ gender, when the conversation took place, the terms they negotiated, or how they felt about each other when the conversation ended. This means that feature counts in the turn-level data set (their words) need to be aggregated and merged with the speaker-level data set (other measures outside the conversation). Then a statistical model must be estimated and interpreted. Finally, the results must be reported and benchmarked.
Aggregating conversation features
Although many conversation features are observed at the turn level, other variables of interest may be measured at a higher level, such as at the level of the conversation, individual, dyad, group, organization, or society. Usually, these are measured once per conversation, either as context variables before the conversation (e.g., mood, location, preferences, random assignment to an experimental condition) or outcome variables after it (e.g., enjoyment, learning, negotiated outcomes). However, they can also be measured once per speaker (e.g., demographics) or after multiple conversations in a relationship.
To estimate the links between conversation features and these higher-level measures, turn-level features should be aggregated in some form (e.g., count, average, sum, standard deviation). These aggregations can then be merged to the speaker-level data set using the speaker- and conversation-level unique identifiers (the dplyr R package makes this process easier; Wickham et al., 2019).
Aggregation window
Researchers should almost always separate the features of each person in the conversation before analysis (e.g., How many questions did Mary ask?) rather than across the entire transcript (e.g., How many questions did everyone ask?). This is necessary any time speaker-level variables vary within a conversation, such as occupying different roles, experimental conditions, and demographics.
In addition, researchers may want to aggregate features from only a subset of the conversation. For example, they may remove greetings, off-topic chatter, or final decisions from analysis of a task-focused conversation. In other cases, they may aggregate features only from the beginning of the conversation to focus on each person’s behavior before being influenced by the partner’s manner of speech or because the meaning of a feature changes at different times (e.g., Y. Li et al., 2022).
Controlling for speaking time
Researchers should be clear about counts versus rates. The total word count of each turn and each conversation is used in many analyses—it is a common and simple benchmark to use for prediction tasks. Other times, feature counts are transformed into feature rates to control for the length of each text (e.g., feature count per minute or per 100 words, which is the default in the LIWC-dictionary approach). Analyses are simplest when word counts are relatively similar across texts. When word-count differences are large, researchers must decide whether the difference is endogenous (i.e., controllable). For example, if someone is studying a mix of 30- and 60-min meetings, then total feature counts would be mainly driven by the prescheduled meeting length. Thus, controlling for the total word count would make it easier to compare language across the two time frames.
Sometimes, total speaking time is an outcome. For example, when people are told to ask more questions, their partner speaks more and enjoys the conversation more (K. Huang et al., 2017). This is not a confound—one reason there is an increase in talking is due to the amount of questions asked. Furthermore, enjoyment early in the conversation can increase talking as the conversation continues. In these cases, it may be better to focus only on the early part of the conversation, before differences in speaking time emerge (Shi et al., 2023). Otherwise, researchers should look at both what and how much is said as two distinct outcomes.
Model estimation
Although a review of the rich existing literature on model estimation (i.e., constructing a statistical model to test a hypothesis) is outside of the scope of this article, we briefly touch on several challenges that are particularly common in conversation research.
Units of observation
Although speakers are given their own row in the speaker-level data set, these are not independent observations. There is often some shared variance with their partner in the context and outcomes. There is also shared variance when a speaker is present in multiple conversations (e.g., in a round-robin design or when tracking relationships over time) or when outcomes are measured multiple times per conversation (e.g., once per topic). This is commonly addressed by using heteroskedasticity-robust standard errors (e.g., through the estimatr R package; Zeileis et al., 2020). Researchers who ignore these issues can end up overstating the precision of their estimates and overfit models that are too complex to be estimated well by their data sets (Bertrand et al., 2004; Yeomans, Brooks, et al., 2019).
Interpreting effects
The time course of conversation complicates the interpretation of estimated effects. In particular, we distinguish between “causal” relationships (“What is the effect of X?”), “predictive” relationships (“Will X happen next?”), and “descriptive” relationships (“Did X happen?”). All three have some practical value (J. Kleinberg et al., 2015; Mullainathan & Spiess, 2017), but it is important to know the difference. This is especially difficult in interpersonal interaction because there are many possible third variables that could confound any estimate: Someone’s midconversation behavior could either affect outcomes directly, be correlated with something that affects outcomes, or be an outcome of something that happened earlier in the conversation.
The “gold standard” for causal estimation is a randomized experiment in which at least one speaker is randomly assigned to an intervention that affects some part of the conversational behavior (e.g., try to interrupt a lot vs. try not to interrupt at all) or outcomes the speaker or the speaker’s partner will report (e.g., come with as many ideas as you can vs. choose one idea to pursue). In lieu of experimental control, some empirical approaches can help make causal interpretations more plausible. If speakers have stable conversational tendencies across conversations (e.g., some people always laugh more frequently or have a penchant for arguing), then the random assignment of speakers to their partners can be used as an instrumental variable (Zhang et al., 2020). Researchers have also sharpened their interpretations by focusing on conversation features (as in the Counting Words section) from the beginning of conversations, before speakers are deeply influenced by their partner (e.g., Curhan & Pentland, 2007; Voigt et al., 2017; Zhang et al., 2018). Other common causal inference strategies (e.g., controlling for preconversation variables, matching, event studies) may also be useful (Angrist & Pischke, 2008).
Reporting results
Only a subset of a researcher’s analyses will end up in a final publication. The low cost of additional analyses can be harnessed to produce a variety of benchmark models, alternative specifications, and robustness checks. Although it is often tempting to report only the positive results, these other analyses are often more useful when they produce negative results because they highlight limitations and boundary conditions.
Although not all of these additional analyses need to make the main body of the article, online appendices often have no word limit. In addition, researchers who share their analysis code and data can encourage their readers to explore alternative models themselves. At the very least, researchers should conduct and report basic sanity checks—for instance, that their results cannot be obtained using simpler text analysis, such as word-count or sentiment analysis.
Benchmarks
Often researchers are focused on a particular variable (e.g., question-asking), and they may want to demonstrate that the variable has a uniquely strong relationship with the outcome of interest. However, because conversation data are complex, there are many potential comparisons that can be constructed.
Instead, researchers should always give context to their focal model with some reasonable set of benchmark models (e.g., Eichstaedt et al., 2021; Yeomans, 2021). For example, computer-science articles routinely include tables comparing the performance of many models on the same data set. Because conversation data are rich, benchmarks could be drawn from contextual data or from other features of the transcript. Another approach to check the importance of a single feature is called an “ablation test.” There, a feature is removed from a more complex model—if the performance of the new model decreases, then the removed feature is considered essential for the original model.
Similar concerns arise when selecting control variables. There are many ways to define a model specification using conversation data, and researchers may find value in estimating alternative models to demonstrate robustness—sometimes called a “multiverse” or “specification curve” analysis (Schweinsberg et al., 2021; Simonsohn et al., 2020). The most reliable results will hold not only across individual specifications within a data set but also across data sets and contexts.
Confirmatory versus exploratory results
The high dimensions of text allow for near-infinite researcher degrees of freedom (Yeomans, 2021). This means the standard concerns about p-hacking, data-dependent modeling choices, and nonreplicability should be especially important for conversation research. Best practices include preregistering NLP analyses whenever possible—including exact analysis code, detailed information on what data are collected, and how the sample will be determined (Nelson et al., 2018). Likewise, researchers should be wary of assuming generalizability for models that have been tested in only one data set or one context. However, exploratory results can be tremendously useful (H. K. Collins et al., 2021; D. A. Moore, 2016). Thus, we recommend a balanced approach that prioritizes preregistered results where possible as a complement to (rather than to the exclusion of) well-grounded exploratory work.
When researchers publish results that have not been preregistered, they can still take steps to enhance the credibility of their findings. For example, they can separate validation analyses from their extraction and estimation strategies using cross-validation or split samples within their data set (Poldrack et al., 2020). Although a common default for these validation checks assigns data into training and testing folds randomly, researchers may find added value from nonrandom splits (Weiss et al., 2016). For instance, they could assign data to training and testing at the level of conversations (so that all speakers within a single conversation are all in the same fold together) or the level of speakers (so that when a speaker appears in multiple conversations, all of the speaker’s conversations are grouped into the same fold together). This is also relevant when researchers have data across a large time span. For example, researchers who want to forecast stock prices from CEO interviews might train on data from 2010 to 2020 and then test their model on data from 2021 to 2022 so that their model is tested on a simulation of its eventual application: seeing into the future. Other examples might be training and testing on different company types, countries, or CEO characteristics (e.g., gender). These nonrandom splits allow researchers to make stronger claims about the robustness and generalizability of their conclusions.
Data Sharing
Collecting and cleaning conversation data for academic research can be costly in terms of time and money. This can make conversation research prohibitive for early-career scholars and privilege scholars from well-resourced institutions. Moreover, costs may lead individual researchers to be reluctant to share their data with others who did not bear those costs themselves. However, we think this reluctance could be holding conversation science back—it is the costliness of collecting conversation data that makes its sharing especially valuable and productive. The field will be better off if researchers establish norms to share their materials, data, and code openly. We hope to encourage a more cumulative, inclusive, and collaborative research community. To this end, in our own work, we have shared as much of our conversation data as we can. Furthermore, our own research has directly benefited from the generosity of others who were willing to share their data and analyses (e.g., K. Huang et al., 2017; Ranganath et al., 2009).
Open-science practices are important (National Academies of Sciences, Engineering, and Medicine, 2018), and we think they are especially important for conversation science (Reece et al., 2022). First, conversation is so multifaceted that the same data set can be used to answer many research questions, beyond the scope of the initial research question of the researchers who collected the data. Second, the upfront costs of collecting and cleaning large-sample conversation data are immense and may be prohibitive for some researchers. Third, the upfront costs of the analysis are also quite high, so researchers can quickly build on one another’s work by publishing reproducible code that can be shared and improved. Finally, individual hypotheses can be more robustly tested if analyses and results can be replicated over multiple data sets that may have been collected in different contexts.
Data privacy
There are barriers to openly sharing data. In our view, the most common and legitimate concern is privacy. Many common privacy issues are exacerbated in conversation research because conversation data sets include identifiable data (Cychosz et al., 2020; Rubinstein & Hartzog, 2016). When conversations are recorded on video and/or audio, these rich media make it easier for subjects to be identified. Furthermore, even the transcripts of conversations can contain revealing details about a person that could be identifiable, either individually or in combination (Sweeney, 2002). These are essential questions for researchers to grapple with, and although there are more extensive treatments of the relevant issues (e.g., Meyer, 2018; Robbins, 2017), we highlight the main concerns.
Preventive measures
The most important step in accounting for privacy is to obtain explicit consent from participants. In practice, we have found that researchers often fail to anticipate future data-sharing needs and are not clear in asking for permission to store and to share deidentified data. Participants and Institutional Review Boards (IRBs) rarely blanch at these requests in consent forms because it is increasingly an essential part of the research process. Furthermore, an explicit warning about sharing may prompt participants not to share anything truly private.
It is worth assessing the importance of individuating information for the research question. For example, if researchers are studying performance during a negotiation simulation in which the particulars are assigned at random in the case materials, then the speakers’ true persona (including names, demographics, and location) are irrelevant to many research questions. In these cases, researchers should directly ask participants to refrain from providing any identifying information before the conversation begins. However, this restriction can interfere with some research questions. Consider two examples—doctor-patient conversations and speed-dating conversations—in which personal information is essential to the goals of the speakers. In these cases, researchers cannot reasonably ask speakers not to share personal information.
Deidentification
It is best practice to anonymize conversation data sets when possible. This is especially important for conversation data because it is open-ended: During a conversation, people can say virtually anything. If data are to be shared for public use (which we encourage), it is essential that the text be completely deidentified. Many feature-extraction techniques automatically remove identifying information. For example, if an n-gram model is used and all n-grams that occur less than 1% of the time are removed, this will mechanically remove any individuating information (as long as no individual makes up more than 1% of the data).
Anonymizing raw text is more challenging. This can be done manually—by a human coder reading through each transcript and removing any identifiers—or automatically. For example, there are software packages that can deidentify most data by replacing named entities (e.g., specific names, addresses) with generic tags, although no algorithmic method is perfect (B. Kleinberg, 2023; Mendels et al., 2018). Like transcription, the best approach may be hybrid—using an algorithm as a first pass at anonymization followed by a human check to handle the identifiable information most difficult to detect.
Some conversation data are especially difficult to anonymize (e.g., audio or video data). We are not aware of any robust method for automatically deidentifying video or audio data; it may be better to simply focus on sharing transcriptions and turn-level extracted features (metadata) rather than the complete or raw data. Likewise, even transcripts can be difficult to anonymize. For example, a real-estate negotiation will likely reveal identifying features of the property in question, which can then be linked to other public records. In these cases, we still encourage researchers to share the turn-level data set with the text removed, leaving only the unique identifiers and the extracted features. Note, however, that this is not always a guarantee of deidentification. It is possible that text or demographic variables (e.g., gender) could be reconstructed from the feature counts. This is primarily a risk for very elaborate feature extraction (e.g., sentence embeddings), whereas it is exceedingly unlikely to be an issue with simpler features (e.g., counts of pauses or questions).
We encourage researchers to scrutinize the identifiability of the metadata they collect outside the conversation (e.g., demographics). If there is a concern about these data, they can be deidentified. Common solutions include coarsening variables to broad categories (e.g., reporting age buckets rather than exact age; Samarati & Sweeney, 1998) or perturbing variables by adding noise (e.g., reporting age ±5 years; Kargupta et al., 2003). This is especially important when researchers combine publicly available text data with nonpublic data, for example, if text from someone’s (public) Twitter account is paired with that person’s (private) school transcripts. Because the text can be searched, this risks identification of each participant’s entire record.
Handling sensitive data
There are unique privacy concerns that arise in many common conversation data settings. Imagine conversations between financial advisors and their clients or between professors and their students. In these cases, researchers must prioritize their responsibilities to protect the rights of the speakers and to uphold the norms of the context in which they were speaking. For example, consent is not always possible to collect from the speakers themselves, and speakers may not be aware of how their data will end up being used.
Many organizations establish their own policies around data sharing. For example, a company may have permission from its users to share data but may not want to make the raw data public because they consider that information proprietary. We strongly encourage researchers to be proactive about this topic when exploring collaborations with outside organizations. Many of the anonymization techniques mentioned above, such as extracting aggregated linguistic features using open-source software (e.g., Yeomans, Kantor, & Tingley, 2018) and using metadata rather than raw transcript data, can be initiated before researchers see any of the data so that no raw text ever leaves the organization.
Depending on their capabilities, organizations may be able to execute analysis code that a researcher writes without ever seeing more than a small example of their internal data. Many feature-extraction algorithms remove identifying information from text (e.g., counts of politeness features). The resulting turn-level feature counts could then be analyzed by researchers and shared publicly along with the code that was used to tally the features.
There are also unique concerns when dealing with text collected from publicly available sources (e.g., social media data or online forums) because there is also a heightened risk that it can be reidentified. If the data set includes metadata that are not publicly available, this creates potential risks for the speakers. For example, if a researcher shares the exact turn-level word embeddings or word counts of entire conversations, that information, although ostensibly anonymized, may be enough to reverse-search and uncover the source of the data. In these cases, researchers may want to increase the anonymity by adding noise to the extracted feature counts and/or the metadata.
Conclusion
This is an exciting time to be studying conversation, a fundamental activity of the social world. With technological advances, it is becoming easier to collect and analyze large-scale conversation data and to pair turn-level conversation data with speaker-level data containing more traditional survey and behavioral measures. Still, collecting and analyzing text data and combining turn-level and speaker-level data sets present unique challenges. The complexities of this domain provide opportunities for researchers to build a community of inquiry that shares methods, tools, and data and strives for an ever-growing, cumulative science of conversation.
Footnotes
Appendix
Acknowledgements
This article was much improved by helpful comments on earlier drafts from many other researchers, including (among others) Ken Benoit, Ryan Boyd, Gus Cooney, Morteza Dehghani, Grant Donnelly, Bennett Kleinberg, Andrew Knight, Celia Moore, James Pennebaker, Gillian Sandstrom, Martin Schweinsberg, Lyle Ungar, and Simine Vazire.
Transparency
Action Editor: David A. Sbarra
Editor: David A. Sbarra
Author Contribution(s)
