Abstract
Intersubjectivity, the coordination of people’s beliefs and actions, is central to human psychology and society. However, a legacy of Cartesian dualism has fragmented research along two dimensions (psychological vs. interpersonal; structural vs. interactional). This fragmentation obscures how the psychological and interpersonal aspects of intersubjectivity mutually reinforce one another. I argue that, despite epistemological differences, these literatures converge on recurring formal correspondences: three recursive levels of perspective-taking (direct, meta, meta-meta), three-turn interaction sequences (initiation-response-feedback), and triadic self-other-object relations. Building on these convergences, I propose a minimal model of dialogical intersubjectivity that entails three-turn social interaction generating and sustaining three levels of perspective-taking. Each turn can take the prior turn as its object, thereby adding a level of perspective-taking. The exchange of turns rotates participant positions (e.g., speaker becomes listener), introducing an external vantage point on the prior contribution, thus ratcheting up intersubjectivity by one level. This model explains how implicit assumptions become explicit and how social interaction patterns give rise to psychological processes.
Introduction
Intersubjectivity is fundamental to psychology and society. It is central to social coordination, communication, and the stream of thought. It spans from strategic negotiation to embodied simulation, and from narratives to inner dialogues. Yet, because intersubjectivity is central to so many domains, research is fragmented, and basic questions about its structure, function and development remain contested.
Intersubjectivity is like the elephant in the ancient Indian parable that is encountered in complete darkness. Not knowing what it is, people explore it by touch. Each encounters a different aspect, identifying it variously as a snake (trunk), spear (tusk), pillar (leg), and brush (tail). Like a hand that can only feel one aspect of the elephant at a time, each approach to intersubjectivity has provided a valid but incomplete account.
Consider the diversity of definitions. Psychological approaches define intersubjectivity variously as embodied simulation, explicit cognitive perspective-taking, or dynamic inner dialogue (Aveling et al., 2015; Trevarthen, 2015; Vogeley, 2017). Interpersonal approaches define it structurally in terms of agreement, misunderstanding, feeling understood, and dynamically in terms of coordination, conversational repairs, and consolidating mutual understanding (Bavelas et al., 2017; Gallagher, 2020; Schegloff, 1992). Consequently, depending on the approach, intersubjectivity is characterized either as a structural matrix of cognitive representations or as a performative interaction, occurring either within or between people.
This fragmentation can be traced back to Descartes’ (1637) infamous dualism, namely, the ontological separation of mind and matter (Gillespie, 2006a). This dualism has bequeathed theories of intersubjectivity that proceed either from inside, through the mind (e.g., Husserl, 1931), or from the outside, via the social relations (e.g., Mead, 1913; Schutz, 1932). Approaches from the inside start with the self, while those from the outside start with the self-other relationship (Crossley, 1996). However, Descartes also created a second split in the literature. He sought timeless, rationalistic truths underlying the flux of experience, thus privileging structures. The alternative to this Cartesian paradigm is the Hegelian paradigm that focuses on how minds and societies develop (Marková, 1982), thus privileging dynamic processes and interactions. Thus, the debate between the Cartesians and the anti-Cartesians created a split between approaches that focus on structures (e.g., mental representations) or interactions (e.g., moment-to-moment sequences).
Figure 1 uses these two dimensions of our Cartesian legacy to conceptualize the literature on intersubjectivity. The Overview of approaches to intersubjectivity
In the center of Figure 1 are the theories that straddle Descartes’ dualism trying to explain how social interaction gives rise to psychological intersubjectivity. For example, Trevarthen (2015), Tomasello (2019), and Gillespie and Martin (2014) have focused on how specific interactions (mirroring, imitation, emotional attunement, culture, language, and exchanging positions) scaffold psychological intersubjectivity, becoming internalized as psychological functions (e.g., inner dialogue, self-reflection). However, the mechanism connecting interactional and psychological intersubjectivity has remained unclear.
I propose that a key bridging mechanism is turn-taking. Specifically, I argue that human intersubjectivity usually entails three levels of perspective-taking (direct, meta, and meta-meta) and these are incrementally created through minimally three-turn interactions (initiation-response-feedback sequences) within triadic relations.
My aim is not to resolve the longstanding epistemological puzzles inherited from Descartes. Nor am I proposing to collapse the four quadrants or privilege one over others. My epistemological standpoint is pragmatist; theories and their associated epistemologies are just tools that can be more or less useful (Gillespie et al., 2024). Each quadrant is a different way of looking at the elephant of intersubjectivity. While they might be logically incompatible (because of incommensurable assumptions), they are not empirically incompatible (they pertain to the same observables). The psychological is not a mysterious inner substance but a real semiotic activity that can become observable in talk.
My analysis contributes to theories of intersubjectivity by identifying deep convergence across four siloed literatures on three-level, three-turn, and triadic relations. This convergence is the set-up for my main contribution: proposing turn-taking as the mechanism through which interpersonal interaction scaffolds recursive psychological perspective-taking. This mechanism can explain how implicit intersubjective assumptions become explicit and how social interaction patterns become internalized as ‘inner’ dialogue.
I define intersubjectivity as any coordination (‘inter’) of subjective states (‘subjectivity’) occurring either within or between people (Gillespie & Cornish, 2010). This definition spans all four quadrants of Figure 1. I will also use the shorthand terms psychological, interpersonal, structural, and interactional intersubjectivity to refer to the right, left, top, and bottom quadrants of Figure 1, respectively. I will use the terms implicit and explicit to refer to that which has not been verbally indexed as compared to that which has been verbally thematized (brought into language). Turn-taking will refer to the observable interactional change of speakership or turns of action (e.g., in a game). Although I focus on verbal turn-taking, in practice, it is richly supported by non-verbal cues (Mondada, 2007) and occurs in non-human species (Mondémé, 2022). Perspective-taking will refer to the psychological (i.e., intra-psychological) attempt to adopt another person’s perspective.
To theorize perspective-taking as the psychological corollary of turn-taking, I begin by reviewing the research in each quadrant. This reveals striking convergences on three-level intersubjectivity, three-turn interactions, and triadic relations. These convergences, I argue, are not coincidental; each quadrant addresses the same phenomenon, albeit from a different epistemological stance. Then I use these convergences to build a model of dialogical intersubjectivity, in which the interactional processes of turn-taking scaffold and mutually reinforce the psychological processes of perspective-taking. However, before arriving at this argument, the next four sections aim to pinpoint the convergences upon which my argument builds.
Psychological-Structural Approaches
This quadrant inherits Descartes’ focus on the mind: intersubjectivity resides in cognitive structures. This approach looks ‘within’ people, rather than ‘between’ them. The aim is to identify stable representations in the subject’s mind that pertain to other minds. This approach provides precise vocabulary for conceptualizing these mental representations, but risks studying them in isolation from social interactions. This literature spans cognitive, developmental, social and evolutionary psychology. The key insight for my argument is that people do not merely represent other minds; they also model how those other minds represent them in return.
Theory of Mind
Theory of mind refers to the cognitive capacity to empathize with and understand other minds (Gopnik & Wellman, 1992). The term ‘theory’ reflects the premise that other minds, being inaccessible to direct observation, are always inferred. The theory of mind concept became widespread in developmental psychology and has since been used in many domains (Wellman, 2018). Although measuring theory of mind is challenging (Warnell & Redcay, 2019), there is converging evidence for dual systems of embodied simulation and reflective theorizing (Keysers & Gazzola, 2007; Vogeley, 2017).
Simulation-theory of mind focuses on how people understand the mental states of others by activating equivalent neural circuits (mirror neurons; Gallese & Goldman, 1998). This leads to embodied feelings of empathy, upon which cognitive representations are constructed. Simulation is a spontaneous, non-verbal, embodied resonance between self and other. For example, empathy for pain activates similar brain regions as the direct experience of pain (Lamm et al., 2011).
Theory-theory of mind focuses on the innate and folk psychology beliefs that people have about other minds (Gopnik & Wellman, 1992). It builds upon the mirror neuron system to create more abstract and reflective mind-reading abilities (Gallese & Goldman, 1998). Like naïve scientists, children’s initially simplistic theories of other minds lead to errors that prompt more refined theories (Saxe, 2005), building up from understanding desires and beliefs to representing hidden emotions and false beliefs (Wellman, 2018).
The theory of mind literature has identified the embodied and cognitive components that underlie mental state representation. However, it has tended to ignore the system of social relations that it is, not only designed to represent, but also embedded in (Hughes & Devine, 2015). Additionally, it has tended to neglect higher-order perspective-taking (i.e., understanding other people’s theory of mind). For the present argument, the question is not only how these representations are structured but also which interactional patterns elicit and ratchet them toward greater complexity.
Perspective-Taking
Perspective-taking is defined as the act of trying to adopt another person’s perspective by imagining their thoughts, feelings, and experiences (Epley et al., 2004). Originally, ‘perspective’ referred to a behavioral orientation (Mead, 1925; O’Toole & Dubin, 1968). It has since been reconceptualized as a cognitive structure, measured with surveys and manipulated with primes (e.g., suggestions to think from the standpoint of others). Research has found that it is positively associated with social cognition, such as decision-making (Tuazon et al., 2019) and creativity (Hoever et al., 2012).
One advance has been discovering the importance of meta-meta-perspectives. These include, for example, felt understanding which refers to whether one thinks another person or group values one’s beliefs (Livingstone, 2023). Research has shown how these beliefs about other people’s beliefs about our own beliefs are important for trust (Thomas et al., 2014), intergroup conflict (Lees & Cikara, 2020), polarization (Lees & Cikara, 2021), and social identity (Livingstone et al., 2019). Accordingly, it is increasingly recognized that perspective-taking must be conceptualized within a broader multi-level architecture of intersubjectivity.
Until recently, this approach had also set aside the relation between perspective-taking and social interaction. However, research has found that priming perspective-taking is much less effective than ‘perspective-getting’ (e.g., asking questions; Eyal et al., 2018). Outside of the laboratory, people don’t merely imagine other people’s perspectives, they talk to them (Kalla & Broockman, 2023; Phelps et al., 2025). This reveals that perspective-taking also needs to be conceptualized in the context of social interaction. Remaining exclusively on the psychological side of Descartes’ dualism is problematic because perspective-taking is inextricably linked to interpersonal perspective-getting.
Orders of Intentionality
Research on orders of intentionality examines recursive perspective-taking. This approach starts with the insight that to understand another mind is to understand the intention of that mind toward an object (Dennett, 1983, 1989). This ‘intentional stance’ conceptualizes recursive levels: zero-order intentionality does not entail a belief, as it is mere action; first-order intentionality is a belief, but not a belief about a belief; second-order intentionality entails beliefs about beliefs; third-order intentionality entails beliefs about beliefs about beliefs, and so on.
Naturally occurring conversations contain many orders of intentionality (Dunbar et al., 1997). However, findings on the typical number of orders of intentionality are conflicting. A study of undergraduates recalling stories with complex mental states found that the number of errors dramatically increased after four orders of intentionality (Kinderman et al., 1998). In contrast, jokes have been found to operate at six or even seven orders of intentionality, and were judged to be most funny at five orders of intentionality (Dunbar et al., 2016).
The literature on orders of intentionality provides rich terminology for conceptualizing recursive intersubjectivity and valuable empirical evidence on its pervasiveness. However, the number of levels routinely used remains contested (more on this debate later).
Takeaway: Recursive Levels of Intersubjectivity
Psychological-Structural Approaches to Intersubjectivity
Unanswered questions within these psychological-structural approaches include: How many explicitly indexed levels of intersubjectivity are required for routine coordination? How should the implicit and explicit components of intersubjectivity be conceptualized? And, most importantly for my argument, how are these structures of intersubjectivity scaffolded by specific patterns of interaction? These questions are difficult to address if this approach remains siloed within the cognitive side of Descartes’ dualism.
Interpersonal-Structural Approaches
This quadrant approaches intersubjectivity from the outside: it relocates intersubjectivity into relations that can be compared across persons (agreement, accuracy, misunderstanding). These approaches originate in critiques of Descartes’ (1637) transcendental ego as solipsistic. Mead (1913) and Schutz (1932), for example, situated intersubjectivity in the space between people. The key idea for my argument is not only that Self (S) and Other (O) have different perspectives on the world (X), but also that they recognize this difference and, to some extent, understand that the other recognizes it as well.
Misunderstandings
Ichheiser (1943, 1949) proposed that when Self (S) and Other (O) meet, six representations come into contact: how each person sees themselves, how each person believes they are seen by the other, and how each person sees the other. To conceptualize misunderstandings, he introduced the axiomatic distinction between S’s expression (i.e., S’s verbal or non-verbal communication) and the impression it makes on O (i.e., O’s interpretation). This distinction became fundamental to research on attribution (Heider, 1958), impression management (Goffman, 1959), and interpersonal perception research (Kenny, 1994).
For analyzing intersubjectivity, Ichheiser’s framework had two limitations. First, the object around which the perspectives are being coordinated changes from ‘self’ to ‘other’, which means that the levels are not coordinating around the same object. Second, although Ichheiser analyzed misunderstandings, his framework did not enable him to conceptualize how misunderstandings are resolved. Resolution entails realizing that there is a misunderstanding (i.e., S believes that O’s belief about S’s belief is incorrect), which is a level above his framework.
Interpersonal Perception
Laing and Colleagues (1966) distinguished three levels of intersubjectivity: direct perspectives (S’s beliefs), meta-perspectives (S’s beliefs about O’s beliefs), and meta-meta-perspectives (S’s beliefs about O’s beliefs about S’s beliefs). They argued that without meta-meta-perspectives there would be no way to resolve misunderstandings, as there would be perspective-taking (meta-perspectives) without any awareness that the perspective-taking might be inaccurate.
Using this three-level framework, Laing and Colleagues (1966) developed operational definitions of agreement/disagreement (comparing S’s direct perspective with O’s direct perspective), understanding/misunderstanding (comparing S’s meta-perspective with O’s direct perspective), and realization of understanding/misunderstanding (comparing S’s meta-meta-perspective with O’s meta-perspective). They also distinguished perceived agreement (S’s direct perspective with O’s meta-perspective), feeling understood (S’s direct perspective with S’s meta-meta-perspective), and perceived understanding (S’s meta-perspective with S’s meta-meta-perspective).
Laing and Colleagues’ (1966) framework improves on Ichheiser’s (1943) by keeping each level focused on the same object, explaining how misunderstandings are resolved, and providing a rich terminology for describing complex interpersonal relations. This framework has been expanded to conceptualize societal consensus (Scheff, 1967), personal relationships (Hinde, 1997), and interpersonal perception (Kenny, 1994). Research using this framework has compared the perspectives of parents and children (Sillars et al., 2005), pharmacists and clients (Assa-Eley & Kimberlin, 2005), doctors and patients (Kenny et al., 2010), care-givers and care-receivers (Moore & Gillespie, 2014), and people with autism and their parents (Heasman & Gillespie, 2017). These empirical studies have used questionnaires, but have only focused on direct and meta-perspectives, not meta-meta-perspectives.
Coorientation Framework
The coorientation model of communication, proposed by McLeod and Chaffee (1973), moves beyond person-perception toward perspectives coordinated around any object (X). This model combines Laing et al.’s (1966) interpersonal framework with Newcomb’s (1953) analysis of how S and O co-orient around a generic object X. Therefore, while the interpersonal perception approaches focus on S’s and O’s perceptions of S and O, the coorientation framework focuses on S’s and O’s perceptions of any object X.
The coorientation model has been widely used to study communication patterns between family members (Koerner & Schrodt, 2014; Sillars et al., 2005), employees (Seltzer & Mitrook, 2009; Van Riel & Fombrun, 2007), and human-chatbot relations (Jang & Lee, 2023). As with the interpersonal perception approaches, this research has tended to use questionnaires, and has also tended to neglect meta-meta-perspectives in empirical research.
Takeaway: Three Levels of Explicit Perspective-Taking in Dyadic Relations
Interpersonal-Structural Approaches to Intersubjectivity
S = Self; O = other; p = perspective; X = any object.
A limitation of these interpersonal-structural approaches is their detachment from social interaction. What are the interactional and communicative processes that sustain this web of intersubjectivity? Despite providing a compelling argument for the importance of meta-meta-perspectives, why has the empirical research neglected them? And most importantly for us, why does the structure of intersubjectivity require three levels? These questions are difficult to address if one remains focused on timeless structures. All these questions presuppose social interaction, since that is how misunderstandings are resolved. The next section provides the missing ingredient: three-turn sequences enable understandings to be displayed, repaired, and consolidated.
Interpersonal-Interactional Approaches
This quadrant pushes furthest from Descartes. It is neither psychological nor structural. Instead, it conceptualizes intersubjectivity as something that grows within situated interaction; it is a performance rather than a thing. The key is how each turn in an interaction sequence responds to the prior turn, not unobservable mental states. This research spans from non-verbal interactions to highly linguistically mediated forms of intersubjectivity. For my argument, this quadrant contributes a crucial ingredient: turn-by-turn organization, especially the functional role of third turns in both repairing and consolidating mutual understanding.
Enactive Intersubjectivity
Enactive intersubjectivity is embodied coordination achieved through direct physical interaction, without requiring mental representations. This approach begins with a criticism of the psychological approaches and shifts the focus to observable interactions (De Jaegher et al., 2016; Gallagher, 2020). Building on Merleau-Ponty (1945) and Wittgenstein (1953), it is argued that mental representations are often unnecessary; the anger we see in the face of someone shouting at us is immediate and embodied, not based on looking up a mental representation (Fuchs & De Jaegher, 2009). Even an utterance such as ‘I think’ can be treated not as a marker of subjectivity but as a dynamic stance marker that foregrounds contestability (Kärkkäinen, 2006).
Enactive intersubjectivity is embodied, situated, and often non-linguistic (Fuchs & De Jaegher, 2009; Gallagher, 2023). In two-party coordination, each side is adjusting to the other side, creating a dynamically evolving system. This is evident in protoconversations (primary intersubjectivity; Trevarthen, 1998), where infants and care-givers exchange smiles, vocalizations, and gestures. What is being coordinated are not two representational systems, but instead action, voice, touch, gesture, and gaze. Examples of enactive intersubjectivity include the spontaneous reciprocation of a smile, or how boxers have a conversation of gestures through dodging, feinting, and repositioning.
There is strong empirical evidence for basic enactive intersubjectivity, but the approach is often criticized for failing to scale up to more complex intersubjective phenomena. Gallagher (2023), however, argues that narratives, diagrams, and conversation externalize complex (apparently private) representations, enabling practical coordination, training, and socialization into a world of perspectives. His argument is that, whatever is taken as ‘complex’ or ‘private’ intersubjectivity can be decomposed into observable and practical tasks.
The contribution of enactive intersubjectivity is to conceptualize non-representational coordination. It distributes intersubjectivity into body-environment relations, and thus naturalizes it, linking it into non-human forms of coordination. The limitation, however, is that instead of explaining psychological intersubjectivity in terms of social interaction, this approach risks removing the psychological element. In this sense it is similar to conversation analysis (De Jaegher et al., 2016), which also eschews mental representations.
Conversational Repairs
Conversation analysis provides the empirical counterpart to abstract theories of intersubjectivity (Schegloff, 1992). Instead of intersubjectivity being a philosophical problem of knowing other minds, it is studied as a mundane conversational task. It is studied in interactionally endogenous terms, as it arises for participants themselves, without appeal to anything beyond what is empirically evident in the interaction (Schegloff, 2007).
Intersubjectivity is particularly evident in conversation repairs (e.g., rephrasing, clarifications), because it is in the breakdown of intersubjectivity that the mechanisms of perspective coordination become evident. The main components of a repair are who initiates, who repairs, and in which turn the repair occurs. These reveal three main repair types.
Third-Turn Self-Initiated Repair (from Schegloff, 1992, p. 1303)
Second-Turn Other-Initiated Repair (from Schegloff, 1992, p. 1302)
Finally, self-repairs can also occur in the first turn, when people repair themselves while speaking. This is illustrated in the first turn of Excerpt 2. Marcia is trying to say that the soft-top on the car was ripped off as part of a theft. However, she is unsatisfied with her first attempt, initiating a self-repair (“which iz tihsay”), and clarifying that “
These analyses of intersubjectivity as a practical achievement are insightful. However, like the enactive approach, these analyses resist integration with psychological approaches because the methodology explicitly excludes mental states, considering only what is endogenously observable in the transcript (de Ruiter & Albert, 2017). Yet integration would be beneficial (Albert & de Ruiter, 2018), and I will argue that first-turn self-repairs can bridge interactional and psychological intersubjectivity.
Initiation-Response-Feedback Sequences
Classroom learning is a domain in which psychology cannot be bracketed aside. Learning is, by definition, a situation-transcending psychological phenomenon that is exogenous to the interaction. Research on how learning is achieved through conversations has identified a recurring three-turn sequence: initiation-response-feedback (Sinclair & Coulthard, 1975; Waring, 2009).
Learning in Three-Turn Sequences
Broadening this insight, Linell and Marková (1993; Marková & Linell, 1996), argue that the third turn does not merely enable learning; it consolidates mutual understanding. In Excerpt 3, the teacher’s feedback (turn 3) does more than mark the prior turn as correct; it makes the student aware that the teacher knows the student answered correctly. Within this emerging mutual understanding the student might, for example, feel pride. However, such pride is unintelligible within a first turn (unless the student is anticipating a congratulatory third turn). The initiation-response-feedback model thus reveals how intersubjective dynamics ‘ratchet-up’, with subsequent turns enabling novel dynamics (e.g., pride or guilt, or trying to make someone feel pride or guilt).
This broader role for third turns, to consolidate mutual understanding, is widespread. Bavelas and Colleagues’ (2017) analysis of students’ getting-acquainted conversations found that over half of all turns were part of three-turn calibration sequences. New information was demonstrated in the second turn and consolidated by follow-ups in the third turn. Such rapid, efficient, and often non-verbal third-turn follow-ups are routine sequences, not just for repairing or learning, but, for consolidating mutual understanding. Without these interactions, the levels of intersubjectivity identified in the structural quadrants would be unstable, unevidenced, and uncorrectable.
Takeaway: Three-Turn Interaction Sequences
Interpersonal-Interactional Approaches to Intersubjectivity
Interpersonal-interactional approaches reveal the moment-to-moment embodied coordination of perspectives as an ongoing achievement. However, these approaches are also epistemologically fractured. Enactive intersubjectivity and conversation analysis are anti-representational, eschewing mental representations. While this narrowed focus powerfully highlights intersubjectivity as a practice, it provides an incomplete account of the psychological aspects of intersubjectivity (e.g., intra-psychological experiences of imagining other perspectives). While many view psychological concepts as antithetical to this approach, the next section will reveal striking similarities. Specifically, three-turn interaction sequences are frequently observed within a single turn. I will use this observation to argue that turn-taking scaffolds perspective-taking, and therefore first-turn self-repairs are observable instances of recursive perspective-taking.
Psychological-Interactional Approaches
This quadrant stems not from Descartes’ ideas, but from his method. Arguably, Descartes was the first phenomenologist, inspiring Husserl (1931). He analyzed the workings of his own mind from the inside. And although he focused on timeless truths, his actual method of meditation entailed a dynamic interplay of perspectives (Gillespie, 2006a). Continuing in this vein, psychological-interactional approaches examine the flow of phenomenological experience as punctuated with intersubjective elements. The focus is on the semiotics of how perspectives (variously termed I-positions, signs, or voices) interact within psychological experience. The key idea that I will use for my argument is that the stream of experience comprises inner dialogues of three-part semiotic sequences that mirror the three-part sequences identified by the interpersonal-interactional approaches.
The ‘I’ and the ‘me’
James (1890) conceptualized the stream of thought as the psychological flow of experience (e.g., images, impulses, thoughts) that passes before the mind’s eye during introspection (Levine, 2018). The ‘I’ is the subject of this stream of thought (i.e., the one who senses, thinks, and acts). The ‘me’ refers to moments of self-awareness within the stream of thought, when the self has become the object of experience (Woźniak, 2018). When the ‘I’ turns upon itself to observe itself, it only finds an inert ‘me’ – a hollow memory of a prior initiation. James (1890) vividly described the self as duplex, flipping over on itself, unable to catch its own experiencing, but, he did not explain its origin.
Mead (1913, 1934) proposed that the ‘me’ originates in social interaction. He argued that the ‘I’, the subject of action, is common to all organisms, but the ‘me’ is peculiar to humans and arises through perspective-taking. Just like we see the actions of others from the outside (as a ‘she’ or ‘them’), when we take the perspective of others toward ourselves, we see ourselves from the outside (as a ‘me’). Through socialization, Mead proposed, humans develop a situation-transcending structure of intersubjectivity (which he called the “generalized other”; Mead, 1934, p. 90), which is an internalization of the perspectives of significant others (e.g., family, friends, community). Crucially, the ‘I’ and the ‘me’ are not structures within the self; they are phases of a semiotic process turning over upon itself. The shift of perspective that turns the ‘I’ at time one into the ‘me’ at time two is, Mead (1934) argued, derivative of the shift of perspective occurring in social interaction between self and other (Gillespie, 2005).
Dialogical Self
The theory of the dialogical self develops the ideas of James and Mead, with inspiration from Bakhtin (1986). It conceptualizes the self as a landscape of I-positions within which phenomenological experience moves (Hermans, 2002). I-positions, like James’s ‘I’, are places from which the self thinks and speaks. These can either originate in the acting subject (i.e., impulses and attitudes of the self) or in the perspectives of others (i.e., the voices of significant others in the social world). This landscape of potentially discordant I-positions forms a “society of mind” (Hermans, 2002, p. 147).
Dialogical self research does not examine whether there is perspective-taking, or whether it is accurate; instead it examines how I-positions interact, for example, creating inner conflicts. This approach has been particularly useful in clinical contexts (Neimeyer, 2006; Stiles et al., 2004). The therapeutic encounter, it is argued, should aim to create new or strengthened I-positions that enable the client to reflect upon conflicting I-positions, potentially resolving inner conflicts (Hermans & Dimaggio, 2004).
Commonly, conflict between I-positions leads, over time or through therapy, to a third, meta I-position that reconciles the conflict (Kay et al., 2024). For example, Branco and colleagues (2008) analyzed an interview transcript with Rosane, a Catholic woman in Brazil. They identified two I-positions in conflict: Rosane was a Catholic daughter (first I-position) and a lesbian (second I-position). This tension was resolved by the emergence of a third I-position, where Rosane became committed to being a Catholic missionary working within the lesbian community.
Valsiner (2005) has theorized the three-part emergence of meanings within the dialogical self. For example, in response to conflicting meanings, such as being hungry (sign 1) and finding dirty bread (sign 2), third signs, or meta-positions, emerge to regulate the conflict. Thus sign 3 could be insisting that even dirty bread is still bread, arguing that it is not ‘so’ dirty, cleansing the bread, or conceiving of both dirt and bread as part of nature (Josephs & Valsiner, 1998, p. 6). These third signs in the semiotic sequence simultaneously address the prior tension and feed forward to promote and constrain the next sign in the stream of thought (Valsiner, 2018). A key insight about these semiotic sequences is that without a third sign, there would only be semantic tensions without any resolutions or circumventions.
Multivoicedness
Multivoicedness is related to the Dialogical Self approach but broader, as this tradition is not bound to a single theory or focused exclusively on the self. It studies talk and texts in terms of moment-to-moment perspective shifts. The idea is that a detailed analysis of people’s talk can reveal multiple, often colliding points of view that interact in real-time as they talk. This approach combines the insights of James and Mead with those of Bakhtin (1986), Rommetveit (1974), Linell (2009), and Marková (2016). It takes a micro-textual approach to studying voices within the mind as they manifest in people’s observable utterances.
Multivoicedness in a focus group
The tension between the first and second voices creates semantic contact between the speaker’s initial voice and a disruptive voice (e.g., a belief attributed to the outgroup; Gillespie, 2020). When the second voice is disruptive, the third typically attempts to quell or shut it down. Tactics include avoidance (e.g., “it is totally separate”), de-legitimizing (e.g., “but they are delusional”), and limiting (e.g., “but it’s not that bad”). The key point is that without the third part of the semiotic sequence, there could be no response to psychological tensions. Accordingly, again, we see the same three-part structure, mirroring not only what has been found in the Dialogical Self, but, also what has been observed in the interpersonal-interactional approaches.
Takeaway: Three-Part Semiotic Sequences
Psychological-Interactional Approaches to Intersubjectivity
While these approaches assume voices in the social world connect to voices in the stream of thought, the mechanism is usually stated vaguely as ‘internalization’ rather than specified in interactional terms. How do the voices of other people become voices in the mind? What is the relation between the inner and outer voices? What types of social relations scaffold this internalization? To address these questions requires specifying the active ingredient in social interaction. My proposal is that this ingredient is turn-taking. Because turns are publicly observable and because speakers also hear themselves as audiences, turn exchange provides a mechanism by which three-part interactional sequences can become three-part semiotic sequences.
Integrative Approaches
Although much of the literature falls into the four quadrants of Figure 1, there is also a substantial literature on how the subjective side of intersubjectivity is embedded in observable social relations (De Jaegher et al., 2010; Gillespie & Martin, 2014; Lawrence & Valsiner, 1993; Marková, 2016; Schilbach et al., 2013; Valsiner & Van de Veer, 2000; Zittoun et al., 2007). Many of these approaches stem from Mead (1913) and Vygotsky (1997), who both argued that cognitive functions begin between people, in social interaction, and only subsequently appear in the mind. Since their broad insights, researchers have sought to identify the specific social interaction patterns that might scaffold intersubjectivity. I review their insights before proposing turn-taking as a key ingredient.
Primary, Secondary, and Tertiary Intersubjectivity
The distinction between primary, secondary, and tertiary intersubjectivity (Bråten, 2009; Trevarthen & Aitken, 2001) conceptualizes young children’s social understanding within its interactional and cultural context. Each of these three forms of intersubjectivity is embedded in a different type of interaction: dyadic interaction, coordination around objects, and culture.
Primary intersubjectivity is dyadic attunement (Trevarthen, 1998). It arises through parent-infant face-to-face interaction or protoconversations, where nothing is communicated except connectedness itself. At this level, there is no shared object independent of the dyadic relation and thus there is no reading the mind or intention of others (i.e., no explicit theory-theory of mind). Primary intersubjectivity is built on the mirror neuron system and is comparable to the simulation-theory of mind approach (Ferrari & Gallese, 2007).
Secondary intersubjectivity is shared attention. It originates in triadic relations, with self and other coordinating around a shared object (e.g., child and adult playing with a ball), where each learns to read the intention of the other (e.g., expecting the ball). At this level, the mind of the other is explicitly represented in relation to the shared object and is thus comparable to a theory-theory of mind (Bråten, 2009).
Tertiary intersubjectivity incorporates culture (narratives, folk beliefs) that exists beyond the immediate triadic relation (Bråten, 2009). Narratives scaffold perspectives, providing a web of interacting perspectives that the audience gets drawn into and can experience vicariously. Additionally, language, it is argued, facilitates more complex intersubjectivity, enabling people to ask about people’s intentions, feelings, and perspectives.
Shared Intentionality
Tomasello (2019) has also proposed three different types of intersubjectivity, each embedded in a distinct social relation. These are broadly complementary with primary, secondary, and tertiary intersubjectivity, but put more emphasis on social coordination.
The first stage is emotion-sharing, such as protoconversations between parents and their infants. Although similar to primary intersubjectivity, Tomasello (1999) argues that this stage does not contain enough mutual understanding to be called properly intersubjective. There is dynamic attunement, and embodied co-presence, but no social coordination
The second stage is joint intentionality with recursive understanding within triadic relations; S and O both know that they are attending to X. Chimpanzees, Tomasello (2019) argues, can follow gaze direction and thus attend to the same object, but they do not know that they are attending to the same object. Mutual understanding that both parties are attending to the same object entails a theory-theory of mind.
The third stage is cultural collective intentionality that integrates multiple perspectives within a partially shared social world. At this level, S and O are aware of what is common and what is not; they each know that the other inhabits a recursive multi-perspectival and partially shared social world (Tomasello, 2020).
Tomasello’s (2019) contribution is to ground types of intersubjectivity in specific patterns of interpersonal interaction. He points to “dialogic interactions” (Tomasello, 2019, p. 188), the back-and-forth exchanges within triadic (S-O-X) relationships, through which attempts, failures, and requests for repair incrementally build triadic mutual understanding. This intersubjectivity is then further scaffolded, at the third level, by language, culture, and institutions.
This basic idea is also found among other scholars. Tuomela (2005) argues that to have a ‘we-intention’ implies mutual belief, not only that we know the others’ perspective, but also that they know ours, thus presupposing meta-meta-perspectives. But this structure must be actively coordinated in interaction. Individual intent is bound into shared intentionality, Gilbert (2015) argues, through reciprocal expressions of willingness that create normative commitments neither party can unilaterally rescind. So, again, the focus is on back-and-forth exchanges between S and O in relation to some X. But what is the precise interactional mechanism that bridges from these exchanges into psychological intersubjectivity? I will argue that it is turn-taking.
Position Exchange Theory
Position exchange theory (Gillespie & Martin, 2014; Martin & Gillespie, 2010) builds on Trevarthen’s and Tomasello’s models to propose a precise mechanism through which perspective-taking can arise in triadic social interaction. The idea is that we are socialized into positions that support perspectives; when we exchange positions, we effectively learn to exchange perspectives.
Initially the child interacts with objects and with other people, gradually differentiating themselves from the world, and learning action patterns vis-à-vis objects and people in the world. The child is engaging in social action, but there is no reflexive awareness of it. At this stage children can play a role, but it is not socially coordinated (i.e., they can’t regulate their play from the standpoint of others).
Reflexive intersubjectivity emerges within triadic interactions, when S and O have roles with respect to a task oriented to X. When S and O exchange roles within the task, they not only learn about the perspective of the other, but, they also learn to integrate these perspectives. Consider the game of hide-and-seek (Gillespie, 2006b): the hider regulates their hiding actions from the standpoint of the seeker – because they themselves have been the seeker. Equally, the seeker might find a good place to hide while seeking. Exchanging positions is widespread (e.g., giving-getting, talking-listening, fleeing-chasing, buying-selling), and the idea is that this both cultivates and integrates complementary perspectives, enabling perspective-taking.
Position exchange theory also specifies how narratives (stories, films, reverie) can scaffold intersubjectivity. Narratives invariably contain multiple perspectives (i.e., different characters), and, story-telling guides us through these (e.g., alternating between the perspectives of key protagonists). Narratives tend to introduce us first to one perspective, and then to a reciprocating perspective (e.g., little red riding hood and the wolf, the wolf and the three pigs, or goldilocks and the three bears). Within narratives the audience experiences position exchange, as they are guided through both sides of a social interaction, thus scaffolding their intersubjectivity.
Takeaway: Triadic Interaction
Integrative Approaches to Intersubjectivity
S = Self; O = Other; X = any object; C = culture.
I build on these approaches, and especially position exchange theory, to argue turn-taking may be the active ingredient. It is present in all forms of social interaction. Turn-taking spans from non-verbal emotional adjustments to complex conversations and negotiations. As narratives progress, they give ‘turns’ to key characters. This focus on turn-taking develops the position exchange hypothesis. Turn-taking is the most fundamental and pervasive exchange of positions: initiators becoming responders, speakers becoming listeners.
Three-Level Intersubjectivity
What is the minimum number of levels a model of intersubjectivity needs to consider? Comparing the psychological-structural (Table 1) and the interpersonal-structural (Table 2) approaches reveals agreement on recursive levels of intersubjectivity, but disagreement about the number of levels. The interpersonal perception literature conceptualizes three levels, but often only studies two levels. Meanwhile the orders of intentionality literature has studied four (Kinderman et al., 1998), five, and even six (Dennett, 1989; Dunbar et al., 2016) levels. These inconsistencies, I suggest, stem from three methodological issues.
The first source is how implicit and explicit perspective-taking is treated. Kinderman et al. (1998) focused on explicit perspective-taking (i.e., talk about mental states). By contrast, Dunbar and colleagues (2016) counted implied perspectives in jokes. They maintained that before a joke is told, there are already three orders of intentionality (the speaker intends for the listener to understand that the speaker intends to tell a joke). Any mental states within the joke are then added to these “minimum obligatory three mindstates” (Dunbar et al., 2016, p. 133). If these assumed mindstates are removed, the number of explicit orders of intentionality in jokes drops to four, comparable to Kinderman et al.’s (1998) findings. This resolvable confusion reveals that we need to better conceptualize the implicit and explicit aspects of intersubjectivity.
A second source of discrepancy is the conflation of intersubjective depth with intersubjective breadth. Depth refers to the number of recursive perspective-taking levels within a dyadic exchange (e.g., S’s belief about O’s belief about S’s belief). Breadth refers to the number of agents whose perspectives are tracked simultaneously (e.g., tracking A’s, B’s, and C’s separate beliefs). Research on interpersonal perception measures depth within dyads, capping out at meta-meta-perspectives (Assa-Eley & Kimberlin, 2005; Heasman & Gillespie, 2017; Koerner & Schrodt, 2014). Research on orders of intentionality has tended to count the mindstates of all agents, combining intersubjective depth and breadth (Dunbar et al., 2016). A six-order story with three agents may have only three levels of depth. Distinguishing intersubjective depth and breadth explains why intentionality research reports higher orders of intentionality.
Finally, why have interpersonal-structural approaches neglected the third level of intersubjectivity despite conceptualizing it as critical for resolving misunderstandings? I suggest it is because they use questionnaires. Questionnaire items about meta-meta-perspectives are confusing to answer and produce messy results. The issue is that asking an explicit question about any level of intersubjectivity entails responding at one level higher (Gillespie & Cornish, 2010). In contrast, orders of intentionality have been observed in jokes (Dunbar et al., 2016) and in recalled story elements (Kinderman et al., 1998). Observational methods have an advantage because they don’t require participants to self-report on their perspective-taking; it is done by the researchers.
A self-report question at level N requires the respondent to report on level N from level N+1. The question ‘Do you believe X?’ requires awareness of one’s believing (level 2 operation on level 1 content). The question ‘Do you think she believes X?’ requires awareness of one’s meta-belief (level 3 operation on level 2 content). The question ‘Do you think she believes that you believe X?’ requires level 4 operation on level 3 content, which pushes respondents toward their intersubjective ceiling. Self-report methods are therefore unwittingly capped at studying one less intersubjective level than the method appears to target, explaining the systematic neglect of meta-meta-perspectives despite their theoretical importance.
Disentangling these three methodological issues reveals a deep convergence underlying superficial differences. Using different methods and theoretical traditions, these approaches converge on explicitly indexed (not implicit or self-reported) recursive (intersubjective depth, not breadth) perspective-taking having three levels. Why do social relations stabilize on three levels of perspective-taking? How are these anchored in social interaction?
Three-Turn Intersubjectivity
What is the minimum length of interaction sequences that a model of intersubjectivity needs to consider? Comparing the interpersonal-interactional (Table 3) and the psychological-interactional (Table 4) approaches reveals the importance of three-part interaction sequences. Although much literature on turn-taking and conversation analysis has focused on two-turn sequences, three-turn sequences (in interaction) and three-part semiotic sequences (in psychology) are required for complex intersubjective phenomena.
Turn-taking occurs in many non-human species (Mondémé, 2022), and in all human cultures (Nguyen et al., 2022). It is evident in non-verbal interactions, games, queues, and public debates. Turn-taking can be managed by umpires, traffic lights, and moderators. However, the primordial form is informal dialogue, when participants coordinate turns of talk (Sacks et al., 1974). Sometimes speakers select the next speaker (e.g., by asking a question), but more often turn transitions are managed implicitly via cues such as transition-relevance points, gaze, and gesture (Skantze, 2021). Successful turn-taking entails remaining on topic, responding to the prior turn, and not speaking over anyone else. Yet, despite this complexity, speakership is usually transferred smoothly within milliseconds (Templeton et al., 2022).
Turn-taking research has neglected third turns. Usually, it is studied as the transition between turn 1 and 2; any third turn is just another transition to be analyzed in relation to the prior turn. Similarly, enactivist, behaviorist, and cognitive approaches tend to focus on the first two turns (i.e., responses to actions). Conversation analysis, despite studying repairs in the third turn, takes two-turn adjacency pairs as the basic unit of analysis (e.g., question-answer, greeting-greeting, request-response; Schegloff, 2007). It conceptualizes third turn repairs as deviations.
In contrast, I argue that third turns are widespread, frequent, and fundamental to intersubjectivity. Third turn repairs pervade face-to-face and online dialogue (Dingemanse et al., 2015; Goddard & Gillespie, 2025). Moreover, three-turn calibration sequences, where the third turn demonstrates, elaborates, and consolidates understanding occurs many times a minute (Bavelas et al., 2017). Without these consolidations of mutual understanding, resolving misunderstandings, and even conducting strategic brinkmanship would be impossible (Laing et al., 1966; Schelling, 1966).
The third turn unlocks distinctive intersubjective phenomena. The first and second turn enable emotional resonance, empathy, basic simulation, perspective-taking and second order intentionality. But the third turn enables repairs, consolidation of learning, and feelings of being understood (or misunderstood). Most importantly, the third turn consolidates mutual understanding. Without a third turn, O would have beliefs about S’s perspective, but could not discover or comprehend a misunderstanding. No matter how inaccurate or fantastical O’s understanding of S’s perspective, S would be unable to correct it and, thus, complex social coordination would collapse. Without three-turn interaction we would be unable to mutually agree on the rules through which we coordinate, and thus institutional life, as we know it, would be impossible.
Comparably, three-turn semiotic sequences unlock distinctive psychological phenomena. If thoughts only responded to the prior thought there could be cognitive tension, but no resolution. Research on the dialogical self and multivoicedness shows how one voice prompts a second, creating a tension, that is resolved with a third (Table 4). Without the third voice there is only the tension between the first two voices. The tension between ‘I am hungry’ and ‘the bread is dirty’ is resolved by a 3rd sign (e.g., ‘I’ll eat something else’). But this third sign only has meaning due to its position as a third sign; it can’t resolve the tension if acting as the first or second sign (Josephs & Valsiner, 1998). Thus it is the third sign in the sequence that enables resolving psychological tensions.
More speculatively, three-part sign sequences may be central to logical inference. Peirce (1955), who provided one of the most thorough analyses of semiotics and logic, emphasised ‘thirdness’. In his studies of logic and signs he argued for three-part sequences, with the third part being crucial for mediation, meaning, and inference. Consider the classic syllogism: All men are mortal, Socrates is a man, Socrates is mortal. There is no logical inference if the third sign is missing. The first sign states a rule. The second sign, in this case, is an observation. Inference arises with the third sign, which integrates the first (rule) and second (case) to yield a conclusion (e.g., therefore Socrates will die). Thus, even in formal streams of thought (i.e., logical inference), we can detect three-part semiotic sequences.
The minimum length of interaction sequences for intersubjectivity, I argue, is three. Three-turn interpersonal interaction is necessary for mutual understanding, realizing misunderstandings, and resolving misunderstandings; three-part semiotic sequences are necessary for resolving psychological tensions and making logical inferences. Is it just coincidence that the minimum number of levels and turns is the same? Are the three levels derivative of the three turns?
Dialogical Intersubjectivity: From Turn-Taking to Perspective-Taking
The self-other-object (S-O-X) triadic relation is a core unit of analysis across cultural, developmental, and social psychology (Zittoun et al., 2007). However, these triadic models rarely specify the type of relationships within the triad. What do the relations (lines in the triangle) denote? Is mere action sufficient? What occurs within triadic relations that might produce three-level intersubjectivity?
Triadic relations, I argue, only become genuinely triadic (i.e., more than the sum of the parts) when they are functionally unified through three-turn interaction sequences that build three-level intersubjectivity. Three-turn interaction sequences not only involve all components (S, O, X), they also entail a looping back (the third turn responds to the second turn, which responded to the first turn). This cumulatively and recursively builds the levels of intersubjectivity.
Figure 2 illustrates how three-turn interaction within triadic relations can scaffold three levels of intersubjectivity. Starting with the undifferentiated triangle (A), S has an action orientation toward X (B). O then reacts to S’s first turn (C). And finally, S reacts to O’s reaction to S (D). Because each turn responds to the prior turn, and to some extent incorporates it, each turn introduces another level of intersubjectivity: direct perspective (B), meta-perspective (C), and meta-meta-perspective (D). In the figure, only S in the first turn (B) is responding to X; the subsequent turns are responding to the previous responses (not X). The ratcheting of intersubjective levels occurs Three-turn interaction producing three-level intersubjectivity in a triadic relation
The key idea is that each turn of communication has the potential to introduce a higher level of intersubjectivity. Ichheiser’s (1949) expression-impression distinction helps conceptualize this mechanism. Each communicative expression, or turn of conversation, creates an impression on the audience that is one order of intentionality higher. Whatever S expresses, the impression created for O concerns the perspective of S. Thus, what begins as a direct perspective for S arrives as a meta-perspective for O.
Ratcheting Levels of Intersubjectivity
J = judge; D = defendant; p = perspective; (admit) = the object around which the perspective-taking occurs.
The S-O-X triangle (like the three levels and three turns) is a simplification. It is a distilled set of ingredients, to aid focused theorizing. In reality, S-O-X triadic relations always occur in a context (see Table 5; Zittoun et al., 2007). This context does more than merely ‘surround’ interaction, it can structure who speaks, who interrupts, whether third turns are licensed, and how feedback is received. For example, misunderstandings often arise because social norms (politeness, communicative routines) or power structures (status, role-authority) inhibit third-turn consolidations. In the case of position exchange, the context is even more important, because it creates the entire experiences of S and O that are re-shaped each time they exchange social positions, or roles.
The key point is that each turn can take the prior turn, not X, as its object, thereby foregrounding the perspective in the prior turn (not the object of that perspective).
Each turn not only responds to the prior turn, but subsumes it, bringing another perspective on the prior perspective. The perspective of the prior turn becomes the object of the next turn, namely, a meta-perspective – which itself can become the object of the next turn (i.e., a meta-meta-perspective).
Making the Implicit Explicit
Ratcheting intersubjectivity through turn-taking explains how intersubjectivity moves from being implicit to explicit. Intersubjectivity begins with implicit embodied attunement (Trevarthen, 2015), but it culminates in highly nuanced narratives that dramatize the interplay of multiple perspectives (e.g., Shakespeare; Bakhtin, 1986) and highly reflexive and strategic perspective-taking (e.g., brinkmanship; Schelling, 1966). How does implicit intersubjectivity develop into explicit intersubjectivity?
Impressions always deviate from expressions: what is said diverges from what is heard. Sometimes this deviation is an error (e.g., O misunderstands S), but, sometimes it is a valid insight (e.g., O recognizes S’s implicit assumption). In these latter cases, the expression creates ‘surplus’ meaning (Gillespie, 2003), namely, meanings that S did not intend to give (i.e., their implicit assumption).
Consider the following example. While eating a cake, S exclaims “Yum!”. O responds: “You really like cakes!” To which S replies: “You think I eat too much cake!” In this example, both the second and third turns interpret meanings in the prior turns that were not explicitly given. O ‘takes’ the meaning that S likes cake (S did not explicitly say it), and S ‘takes’ the meaning that O thinks S eats too much cake (which again O did not explicitly say). Of course, O and S may have been hinting at these meanings. But, for our purpose, it is enough to point out that each instance of turn-taking has the potential to make explicit meanings that were implicit in the prior turn.
When a subsequent turn takes the prior turn as its object it has the potential to draw out of it meanings that were implicit. The range of meanings and assumptions that can be called out in subsequent turns is infinite, and many will be invalid. Thus turn-taking can both ratchet intersubjective levels and externalize implicit meanings because each turn introduces externality on the prior turn. This occurs when the subsequent turns are not referring to the object of discourse itself, but to the prior turns as turns (i.e., the object of talk shifts from X to SpX or SpOpX).
Reinterpreting Internalization
The proposed model of dialogical intersubjectivity reconceptualizes internalization. Since Vygotsky’s (1997) argument that psychological function is derivative of social interaction, there have been many attempts to explain the mechanism of internalization. But, there have also been critiques, arguing that the concept of internalization is an unhelpful byproduct of Descartes’ mistaken dualism (Gallagher, 2020; Stetsenko, 2017). Crossing from observable interactions to the mental substance of psychological perspectives, it is argued, is philosophically impure. Setting aside grand philosophical debates, dialogical intersubjectivity narrows the focus to a tractable question: how do three-turn interactions become three-part semiotic sequences?
Self-initiated self-repairs in the first turn, I argue, are a bridge between the psychological and interpersonal aspects of intersubjectivity. Despite being rarely studied, these are the most common type of repair (Purver et al., 2018). Let us return to Marcia’s self-repair in the first turn of Excerpt 2. What is interesting with such self-repairs is that they have the same sequencing as three-turn repairs (Ginzburg et al., 2007). “Becuz the to:p was ripped off’v iz car” is the trouble source (akin to a first turn), “which iz tihsay” is the repair initiation (akin to a second turn that asks clarification), and “someb’dy helped th’mselfs” is the repair (akin to a third turn). Thus, refracted within this single turn is a three-turn sequence.
From a dialogical standpoint Marcia’s self-repair can be reinterpreted as self-dialogue. Marcia hears her own utterance in the same way that it is heard by Tony. Marcia is her own audience. This enables Marcia to turn-take within herself. Marcia is simultaneously the subject (‘I’) of her utterance and an observer of her utterance (‘me’). She hears her own utterance as ambiguous. This self-reflection leaves a trace (“Which iz tihsay”), that leads to a clarification (“someb’dy helped the’mselfs”).
Internalization, from this dialogical standpoint, does not require any philosophically dubious dualism. All it requires is that people can hear themselves speak (or think), and that they can respond to themselves, in the same way they might respond to anyone else. This is what Mead (1913) called the peculiar significance of the vocal gesture. People observe themselves and respond to themselves as they would to another. Responding to oneself talking is turn-taking within a single turn.
Turn-taking within a single turn can be analyzed either from the non-mentalistic standpoint of conversation analysis, or from the more psychological standpoint of multivoicedness. Focusing on what is empirical, there is no mystical dualism; the text being analyzed is the same. Hearing oneself speak and then responding to oneself as an other (either publicly or privately) is a basis for three-part sign sequences and thus ‘inner’ dialogues. What is important is that a response or perspective becomes an object that can itself be responded to.
This reinterpretation responds to the enactivist objection that internalization replicates Descartes’ problematic dualism (Gallagher, 2020; Stetsenko, 2017). By focusing on self-repair as the empirical phenomenon, this model does not require commitment to internal mental states as ontologically distinct. These phenomena can be analyzed either as public self-repairs or as self-talk. The difference is one of analytic stance, not ontology. Intersubjectivity can be approached either from the inside (e.g., phenomenology) or from the outside (e.g., observing others). Textual excerpts in which speakers respond to their own utterances can be studied either as inner-dialogue or outward-facing self-repair. These are choices about analytic frame, not ontology. Accordingly, the paradigmatic differences between the quadrants in Figure 1 reveal less about the ontological status of things in the world, and more about our approach to them. Looking beyond these debates about analytic stance, I argue, reveals a striking isomorphism between three-turn sequences and three-level perspective coordination that is observable from both analytic stances.
Conclusion
I have attempted to re-assemble the elephant of intersubjectivity, by systematically reviewing approaches mapped in Figure 1. Each quadrant captures a distinctive and valid aspect of intersubjectivity. I have shown how, despite each having a distinctive epistemological stance, which has led to incompatible assumptions and methods, there is underlying convergence. Beneath this diversity of approaches there is convergence on three levels, three turns, and triadic relations. Identifying this convergence on ‘threes’ is itself a contribution, revealing common ground across traditions. Moreover, I have argued that it is not a coincidence; this convergence points to an underlying integrated architecture for intersubjectivity, that I term dialogical intersubjectivity.
The proposal that perspective-taking arises out of turn-taking is a contribution to the literatures at the intersection of the quadrants, namely, theories that attempt to explain how the psychological and interactional aspects of intersubjectivity are related. The driving mechanism, I propose, is turn-taking, where each turn that takes the prior turn as its object, ratchets up a level of intersubjectivity. Turn-taking is also a bridging mechanism that spans Descartes’ dualism. It is observable both ‘between’ individuals (e.g., conversation analysis) and ‘within’ an individual utterance (e.g., multivoicedness and self-repairs).
Dialogical intersubjectivity expands the basic unit of analysis beyond isolated levels, turns or triangles (Zittoun et al., 2007). Although many social interactions can be achieved with two-turn adjacency pairs (question and answer, greetings, commands), these only work by assuming intersubjectivity. Once human reflexive intersubjectivity breaks down, the minimal unit of analysis for conceptualizing its reconstruction is three levels of perspective-taking (enabling feeling misunderstood), three-turn interactions (enabling repairs), and triadic relations (embedded in institutional and cultural contexts). This expanded minimal unit of analysis enables conceptualizing all parts of the elephant of intersubjectivity simultaneously.
Dialogical intersubjectivity can be studied face-to-face, online, or in video. The data can be single cases, big data, or experiments. The key is not in the data or the method, but the integrative consideration of levels, turns, and triadic relations simultaneously. This integrated model opens new questions. When a prior turn is taken as the object of a subsequent turn, does this observably lead to indexing higher levels of intersubjectivity? Does ratcheting perspectives through turn-taking make implicit assumptions explicit? Do three-turn interactions develop in children alongside three-level intersubjectivity? Do specific types of turn-taking relations (e.g., cooperative, conflictual, hierarchical) foster comparable three-part semiotic sequences? When situational constraints limit interaction to one-turn or two-turn interactions, does it reduce mutual understanding? Conversely, does facilitating three-turn interactions enable feeling understood? If turn-taking can influence meta-meta perspectives, can it be used as an intervention for intergroup conflict, distrust, and polarization?
In applied contexts, dialogical intersubjectivity provides insight on how communication technologies can be designed to foster robust intersubjectivity. For example, chatbots that rely on two-turn interaction may fail to create robust intersubjectivity (Corti & Gillespie, 2016). Similarly, organizational communication channels that permit only unidirectional announcements, or meeting formats that allocate speaking time without back-channel opportunities, may structurally impair intersubjective alignment. These predictions can be examined by comparing mutual understanding, social-emotional engagement, and misunderstandings across two-turn versus three-turn communication scenarios and interventions.
Intersubjective coordination is central to our private and collective lives. However, it does not occur accidentally. Understanding the interactional infrastructure that can scaffold mutual understanding is a crucial task. To date we have tended to build our communication and interactional infrastructure on one-turn (sender-receiver) and two-turn (adjacency pairs) models. I propose that we can only address the challenges of intersubjective coordination with an expanded minimal unit of analysis, namely, three-levels of representation constructed through three-turn interaction embedded in triadic social relations.
Footnotes
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Swiss National Science Foundation, 51NF40-205605.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
