Abstract
This commentary discusses implementing a unified multimodal framework for studying first language acquisition, responding to Karadöller, Sümer, and Özyürek. It outlines four key considerations: the need for clear definitions of multimodality, the importance of including communicative nonverbal behaviors beyond gestures, the limitations of sequential analyses in capturing true multimodality, and gaps in research on early iconic gesture production in infants. Finally, the commentary advocates for sharing coding manuals and creating large video collections to advance multimodal language acquisition research.
Keywords
The paper by Karadöller, Sümer, and Özyürek (2025) presents a review of the role multimodality plays in first language acquisition, focusing on both co-speech gestures and sign languages. The authors make a compelling case for adopting a unified multimodal framework for studying first language acquisition, building on its established use in adult language research. I agree that such an approach is essential for capturing the rich and dynamic multisensory nature of children’s language development. In this commentary, I will outline four key considerations for effectively implementing this framework across the field and suggest potential directions for future research.
A clear definition of ‘multimodality’ is essential
To develop a robust ‘multimodal’ framework for understanding first language acquisition processes, clear and consistent definitions are crucial. In the review, the term ‘modality’ is often used interchangeably to describe both linguistic systems (e.g. speech, gesture, sign) and sensory domains (e.g. auditory, visual, visual-spatial), which are conceptually distinct. This overlap can create confusion and hinder theoretical clarity. Moving forward, language acquisition researchers should explicitly define ‘modality’ in their studies and distinguish between communicative modes and sensory processing to ensure consistency across the field.
Furthermore, multimodality is about the interplay of sensory domains, not their isolation. Take speech, for example: in face-to-face communication, speech is not merely an auditory phenomenon but an inherently audiovisual experience. Beyond the acoustic signal, speech involves lip movements, facial expressions, and articulatory gestures that provide crucial contextual information. In noisy environments, visual cues enhance speech perception for hearing individuals by providing supplementary information that aids their comprehension (see Peelle & Sommers, 2015, for a review).
Similarly, co-speech gestures and sign language reveal the multisensory complexity of face-to-face communication. For example, an iconic hand gesture depicting the shape of a ball is simultaneously visual (observable to others) and kinesthetic (experienced by the gesturer through bodily movement). Sign language also exemplifies this multisensory richness, seamlessly integrating visual, kinesthetic, and even tactile modalities. In tactile sign languages used by deaf-blind individuals, communication becomes an even more intricate multisensory interaction, with direct physical engagement fundamentally reshaping linguistic exchange (Van Der Mark, 2023).
Categorizing communication modes by a single dominant sensory modality oversimplifies how humans learn and use language. Therefore, a precise definition of ‘multimodal’ communication that acknowledges the intricate interplay between linguistic systems and sensory channels is essential for implementing a multimodal framework in language acquisition research.
Multimodal communication covers a wider range of communicative signals
A comprehensive multimodal framework must extend beyond hand gestures. While the review acknowledges hands, body, and face in its definition of gesture, it primarily examines hand gestures in relation to spoken and signed language. Although possibly beyond the scope of the review, this approach does not consider research on other essential communicative behaviors – eye gaze, facial expressions, head movements, intonation, and body language. To better understand first language acquisition, we must adopt an inclusive approach that recognizes all bodily communications as integral to meaning-making in both spoken and signed language (Kendon, 2014; McNeill, 2005). Without acknowledging this full spectrum of natural communicative behaviors, we risk misunderstanding how multimodal interactions shape language acquisition (Perniss et al., 2010).
To implement this approach effectively, language acquisition researchers should define the behaviors they are investigating, ensuring clarity and consistency across studies. In addition, developing, validating, and sharing coding manuals for these behaviors will facilitate comparability, reproducibility, and replication, enabling the wider research community to adopt standardized methodologies.
Sequential analyses are unimodal rather than multimodal
The argument that gestures may precede, follow, or develop in parallel with language milestones provides valuable insight into developmental sequences. However, framing the discussion in terms of which modality ‘comes first’ risks reinforcing a unimodal perspective rather than recognizing (sign) language and gesture as part of an integrated communicative system.
That said, examining these sequential relationships remains important from a developmental viewpoint, as they offer insights into children’s readiness to communicate, regardless of the modality they use. Regardless of whether a child first expresses a concept through speech, sign, gesture, or another communicative behavior, this initial expression signals their cognitive readiness for that aspect of language acquisition. Furthermore, identifying early markers of communication across modalities can aid in detecting potential developmental concerns, enabling earlier intervention and improving long-term outcomes.
Yet this sequential focus seems to contradict the fundamental nature of multimodality. Multimodality in language acquisition would better be understood as children’s simultaneous use of multiple channels of communication – speech, signs, gestures, facial expressions, and body language – to express meaning. Rather than debating the temporal relationship between gesture and speech, or gesture and sign, researchers should examine how children leverage this full repertoire of communicative behaviors to develop their language skills. This broader perspective would better capture the inherently multimodal nature of human communication and provide a richer understanding of children’s developing linguistic capabilities.
By expanding analyses to include hand gestures alongside spoken or signed language, the authors have taken a step toward multimodality, but this still represents a limited view of children’s communicative capabilities. Multimodality requires examining how children simultaneously leverage their full communicative repertoire in language development. Otherwise, we risk simply replacing one restricted analysis (speech or sign only) with another (speech or sign plus gesture).
Evidence on iconic gesture production in infants is scarce
The broad developmental window of 17–36 months for iconic gesture production mentioned in the review likely reflects a research gap rather than true developmental or cross-linguistic variability. Drawing parallels with infant research on sound symbolism, which is also iconic in nature and has been shown to emerge as early as 4 months of age (Öztürk et al., 2013), it is possible that children’s sensitivity to iconic gestures develops earlier than currently assumed. Indeed, a recent analysis of five English-speaking children (Green et al., 2025) reported that infants’ first 10 iconic gestures emerged between 12 and 20 months, with most depicting actions and produced independently of adult models. This suggests that infants may exhibit this communicative behavior earlier than previously thought, indicating that the developmental window for iconic gesture production could be narrower or less variable than currently reported. However, evidence supporting an earlier emergence of iconic gestures remains limited, and further research is needed to explore the full range of infant gesture production in naturalistic settings. The lack of large-scale studies and systematic observations of infant gesture production across diverse linguistic contexts likely contributes to this gap in our understanding. Expanding research to better capture the forms and contexts of early iconic gestures could offer valuable insights into the early stages of communicative development and the role of iconic gestures in language acquisition. In addition, differing definitions of what constitutes an iconic gesture may exist between researchers, underscoring the importance of clearly defining the behaviors that are the focus of an analysis and sharing coding manuals with the broader research community.
Concluding remarks
The paper by Karadöller, Sümer, and Özyürek represents an important first step in establishing a unified multimodal framework for first language acquisition. I fully support the authors’ call for further research in this area and for revisiting language development theories from a multimodal perspective. To advance the field, it is essential for researchers to define core concepts, share valuable resources such as coding manuals, and create large, naturalistic data sets that capture a wide range of multimodal behaviors produced by both infants and children. Web-based repositories like Databrary (https://nyu.databrary.org/) facilitate the documentation of extensive collections of video recordings of multimodal interactions between caregivers and children, allowing other developmental researchers to analyze these data. The widespread adoption of such platforms would be transformative for multimodal language acquisition research.
