Sage Journals: Discover world-class research

Abstract

Karadöller, Sümer and Özyürek make a clear case for moving to a multimodal framework across spoken and signed languages, and they reveal some key insights for language acquisition that can be gained from such a framework. Bringing together current findings on both the development of speech-gesture integration and sign language also highlights a number of desiderata for moving the field forward. Here, I present some elements of a possible framework to capture multimodal language acquisition in spoken and signed languages, and I motivate the need for comparable evidence across sign and spoken languages that directly taps into learning in the real-world.

Keywords

Situated language multimodal language development sign language caregiver–child interaction language corpora

The ecology of language use, and especially language acquisition, is social interaction. In this context, language is not only expressed by the mouth and understood by the ears, or expressed by the hands and understood by the eyes: it comprises information spread across modalities/channels (Holler & Levinson, 2019; Murgiano et al., 2021). Moreover, language acquisition is a social activity: it can be characterised as a type of joint action among people (Fusaroli et al., 2024; Nikolaus & Fourtassi, 2023). Thus, any theory of language development needs to account for how language emerges in the situated context of, primarily, caregiver–child interaction.

Karadöller, Sümer and Özyürek (2025; henceforth KSO) present a rich overview of the current literature on multimodal language development, focused on evidence from speech, gesture and sign. The effort to bring these lines of research together, especially from a cross-cultural perspective, is to be commended. To move the field forward, however, I argue that we need a conceptual framework that motivates the relationship between linguistic elements and other bodily signals, across spoken and signed languages, and data from real-world interactions between children and their caregivers.

A conceptual framework: from language as a system to language as situated

KSO use a definition of multimodal language in terms of language modality (signed, spoken). From this, they derive three characteristics (visual indexicality, icononicity and simultaneity) that, while clearly ‘visible’ in sign language, can only be ‘seen’ in spoken language if we bring in the visual modality. Putting aside a potential confusion between modality considered as language versus as sensory, bringing in the visual modality in spoken language changes the definition of language and calls for a conceptual framework that emphasises the dynamic nature of language.

Murgiano et al. (2021; Figure 1) explicitly addressed this by discussing the relationship between language as a population-level system, namely the structured, categorical components of speech (for spoken languages) and manual signs (for sign languages) that can be described in a population at a given time point, to characterise the language synchronically and track diachronic change, and situated language, namely the ensemble of speech (or sign) in specific communicative and physical contexts as dynamically presenting during communicative interactions. From a situated language perspective, there is no a priori reason for researchers to limit their attention to points and iconic gestures, as a number of other signals (e.g. eye gaze) are also central.

Figure 1.

Components of language under a language as situated perspective. Language as a system remains in place under this view, containing those behaviours characterised as systemic. We show different communicative behaviours for both signed (blue square) and spoken (green square) languages. Iconic cues are shown in italics, indexical cues in bold. Some features (e.g. pointing in sign languages) can be characterised as either systemic or contextual.

From the situated language perspective, there are two separate questions that can be asked concerning multimodal development: how situated language develops, and how system-level properties of language are afforded/scaffolded by the situated context. These two questions are often somewhat confused in the literature, as becomes clear reading KSO, and sometimes only the latter is addressed (e.g. spoken language studies showing that pointing predicts later vocabulary; Rowe & Goldin-Meadow, 2009).

Because situated language involves social interaction, we need to look beyond the child (which is KSO’s primary focus) to the context of learning. How does situated language develop within a context that includes actions by the child, actions by caregivers and child-caregiver joint actions and/or how do these actions support system-level language development? For example, in spoken language, studies have addressed how system-level properties of language (e.g. vocabulary) are scaffolded (or not) by the caregiver’s signals (Bang et al., 2023; Shi et al., 2023) or by coordinated behaviours between caregiver and child (West & Iverson, 2017; Yu & Smith, 2012). However, existing studies mostly focus only on a single signal (e.g. eye gaze or gestures) and on one part of the dyadic interaction (e.g. actions by the caregiver, or joint actions). Thus, they are able to provide only a partial picture.

Data from real-world interactions between children and their caregivers

Answering the questions above requires that we know what goes on in naturalistic interaction. We therefore need corpora of naturalistic (or semi-naturalistic) dyadic interaction between an infant/child and their caregiver, annotated for the different signals co-occurring with speech and shareable across researchers. This type of data is extremely difficult to come by because of how effortful it is to develop corpora that are annotated for multimodal signals (despite the availability of automatic annotation tools). Furthermore, sharing video data are difficult because of the need to protect children’s identity. There are some existing corpora (e.g. Fusaroli et al., 2023; Gu et al., 2025; and the Language Development Project by Susan Goldin-Meadow (https://ldp-uchicago.github.io/docs/index.html), that provide great promise, but they are necessarily limited in what signals are annotated, the languages covered (English), the age of the children and the specific contexts.

For example, ECOLANG (Gu et al., 2025) is a publicly available corpus that provides video recordings and annotations of multimodal behaviours of English-speaking caregivers in dyadic semi-naturalistic interaction with their children (38 dyads, children aged 3–4). Caregivers were asked to introduce sets of objects provided by experimenters, where half of the objects were known to the child and the other half unknown. Moreover, dyads were asked to talk about the objects when objects were set in front of them (hence present and manipulable) as well as a displaced learning context where the objects were absent (see Figure 2). The corpus also contains data from 32 dyads of familiar adults talking about objects across the same conditions as the child data.

Figure 2.

Corpus design. The top panel shows the recording setup. Two cameras record the interaction: camera A focuses on the speaker, camera B on the interaction space. The speaker wears Tobii eye-tracking glasses and a lapel microphone. Grey objects on the table are known to the interlocutor. Images below provide examples of interactions viewed from camera A when objects are present (situated context) and when they are absent (displaced context).

At present, speech, prosody, gestures, eye gaze and object manipulations produced by the caregivers have been annotated. This data provide a snapshot of the distribution of the different multimodal signals used by adults when talking to children or other adults in similar contexts (see Gu et al., 2025). In addition, children’s vocabulary was tested at the time of recording and 1 year later, and immediate learning of the new words was also tested. Thus, ECOLANG allows us to explore the nature of the caregiver’s input and to address the relationship between caregiver’s input and child vocabulary development. This corpus is being extended to a cohort of younger children (Motamedi et al., 2024) and to dyads using British Sign Language (following Perniss et al., 2018) in an effort to track developmental trends and, crucially, provide comparable data across language modalities.

However, no corpus by itself will be sufficient. True progress can be made by bringing together multiple corpora, because different corpora can provide insight into different aspects of multimodal development (e.g. ages, cultures) and together they can provide converging evidence. The sharing and at least partial comparability of the data will be critical, and to this end any initiative that provides a forum for discussion of data, annotation practices and results (e.g. https://envisionbox.org/index.html) will be essential.

Footnotes

Author contributions

Gabriella Vigliocco: Conceptualisation; Resources.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This commentary discusses work that has been supported by a European Research Council Advanced Grant (ECOLANG, 743035) to G.V.

ORCID iD

Gabriella Vigliocco

References

Bang

J. Y.

Bohn

Ramírez

Jr. Marchman

V. A.

Fernald

(2023). Spanish-speaking caregivers’ use of referential labels with toddlers is a better predictor of later vocabulary than their use of referential gestures. Developmental Science, 26(4), e13354. https://doi.org/10.1111/desc.13354

Fusaroli

Cox

C. M. M.

Weed

Szabó

B. I.

Fein

Naigles

(2024). The development of turn-taking skills in typical development and autism. PsyArXiv. https://osf.io/preprints/psyarxiv/5ap6u_v2

Fusaroli

Weed

Rocca

Fein

Naigles

(2023). Caregiver linguistic alignment to autistic and typically developing children: A natural language processing approach illuminates the interactive components of language development. Cognition, 236, 105422. https://doi.org/10.1016/j.cognition.2023.105422

Donnellan

Grzyb

Brekelmans

Murgiano

Brieke

Perniss

Vigliocco

(2025). The ECOLANG multimodal corpus of adult-child and adult-adult language. Scientific Data, 12(1), 89. https://doi.org/10.1038/s41597-025-04405-1

Holler

Levinson

S. C.

(2019). Multimodal language processing in human communication. Trends in Cognitive Sciences, 23(8), 639–652. https://doi.org/10.1016/j.tics.2019.05.006

Karadöller

Sümer

Özyürek

(2025). First language acquisition in a multimodal language framework: Insights from speech, gesture, and sign. First Language, 45(6), 673–710. https://doi.org/10.1177/01427237241290678

Motamedi

Murgiano

Grzyb

Kewenig

Brieke

Donnellan

Marshall

Wonnacott

Perniss

Vigliocco

(2024). Language development beyond the here-and-now: Iconicity and displacement in child-directed communication. Child Development, 95(5), 1539–1557. https://doi.org/10.1111/cdev.14099

Murgiano

Motamedi

Vigliocco

(2021). Situating language in the real-world: The role of multimodal iconicity and indexicality. Journal of Cognition, 4(1), 38. https://doi.org/10.5334/joc.113

Nikolaus

Fourtassi

(2023). Communicative feedback in language acquisition. New Ideas in Psychology, 68, 100985. https://doi.org/10.1016/j.newideapsych.2022.100985

10.

Perniss

J. C.

Morgan

Vigliocco

(2018). Mapping language to the world: The role of iconicity in the sign language input. Developmental Science, 21(2), e12551. https://doi.org/10.1111/desc.12551

11.

Rowe

M. L.

Goldin-Meadow

(2009). Early gesture selectively predicts later language learning. Developmental Science, 12(1), 182–187. https://doi.org/10.1111/j.1467-7687.2008.00764.x

12.

Shi

Vigliocco

(2023). Prosodic modulations in child-directed language and their impact on word learning. Developmental Science, 26(4), e13357. https://doi.org/10.1111/desc.13357

13.

West

K. L.

Iverson

J. M.

(2017). Language learning is hands-on: Exploring links between infants’ object manipulation and verbal input. Cognitive Development, 43, 190–200. https://doi.org/10.1016/j.cogdev.2017.05.004

14.

Smith

L. B.

(2012). Embodied attention and word learning by toddlers. Cognition, 125(2), 244–262. https://doi.org/10.1016/j.cognition.2012.06.016

Situating multimodal language acquisition in the real-world: A commentary on Karadöller,Sümer and Özyürek

Abstract

Keywords

A conceptual framework: from language as a system to language as situated

Data from real-world interactions between children and their caregivers

Footnotes

Author contributions

Funding

ORCID iD

References