Abstract
The seemingly ubiquitous tendency of caregivers to speak to infants in special ways has captivated the interest of scholars across diverse disciplines for over a century. As a result, this phenomenon has been characterized in quite different ways. Here, we highlight the shift from early definitions of “baby-talk” which implied that the nature of speech directed towards infants would vary in different sociolinguistic contexts, to later terms such as “motherese” or “infant-directed speech” (IDS) which came to refer to a specific set of features, some of which were argued to represent a universal, optimal and culturally invariant form of speech. These divergent conceptualizations of IDS thus reflect broader disciplinary tensions pertaining to the role allotted to cultural processes in psychological research. We hope to contribute to this literature by pointing to the complexity associated with identifying discrete categories of speech (i.e., baby-talk and motherese/IDS) within a complex multi-dimensional sociolinguistic landscape. We also highlight ways in which a lack of attention to the cultural context of infant-caregiver interactions may have led to biased characterizations of IDS. Furthermore, these biases may implicitly penetrate the nature of empirical work on IDS as well. We end with a series of suggestions for future directions.
Chapter 1: Introduction
Overview
Human speech is dynamic and continually modulated within a complex, multidimensional sociolinguistic landscape (Slobin et al., 2014). Researchers interested in characterizing this variability have used terms such as registers, genres and styles to capture meaningful and stable aggregates within this multi-faceted sociolinguistic space (Biber & Conrad, 2019). In the present paper, we focus on one such meaningful aggregate – “baby-talk” or infant-directed speech (IDS) – terms referring to the seemingly ubiquitous tendency of caregivers in many cultural contexts to speak to their infants in ways that contrast with their typical speech to adults. IDS is a subject of great interest to a wide range of researchers due to the role it is thought to play in facilitating infant-caregiver interaction (e.g., Fernald, 1995), infants’ language development (e.g., Rowe & Zuckerman, 2016). and knowledge acquisition more generally (e.g., Csibra & Gergely, 2009).
While early definitions of “baby-talk” (e.g., Ferguson, 1964) implied that the nature of speech directed towards infants would vary across different sociolinguistic contexts, later terms such as “motherese” (Newport, 1975) or “infant-directed speech” (IDS) (e.g., Cooper & Aslin, 1990) came to refer to a specific set of features essential to the IDS construct, some of which were argued to represent a universal and culturally-invariant form of speech (e.g., Fernald, 1995; Kuhl, 2000; Falk, 2004). In this essay, we trace the historical emergence of these conceptualizations of IDS and discuss challenges associated with identifying discrete categories of speech (i.e., baby-talk and motherese/IDS) given the complexity of speech across contexts among varying interlocutors. We then highlight ways in which unexamined assumptions about such categories seem to be manifested in mainstream, “standard-practice” empirical investigations into IDS. We end with suggestions for future directions.
History of Research
Interest in special forms of speech addressed to infants dates back at least to Roman times (Jespersen, 1922, citing Varro, Solomon, 2011). In the 20th century, Sapir (1915, 1929) documented types of speech amongst Nootka Indians that included words used only when speaking to infants and little children. This sort of early 20th century linguistic-anthropological interest in the baby-talk phenomenon emerged in the context of a broad general interest in language as a cultural product (see Sapir, 1921). In fact, Charles Ferguson, an important figure in mid-20th century research on baby-talk, admired Sapir for his interest “in every aspect of human language behavior” (Ferguson, 1997, p. 7).
While interest in baby-talk persisted sporadically throughout the 20th century (e.g., Casagrande, 1948), empiricist-oriented psychologists and linguists became increasingly focused on baby-talk in opposition to Chomsky’s view of language acquisition as largely innate, occurring almost independently of the language environment (see Snow, 1977; Ninio, 2011). To this day, baby-talk remains a major topic of interest for a range of researchers (e.g., ManyBabies, 2020; Hilton et al., 2022).
Consistent with its initial disciplinary roots, Ferguson’s early definition of baby-talk was decisively culture-centered (though his later work took a more universalistic turn, see Ferguson, 1978). Ferguson (1964, p. 103) stated that “By the term baby talk is meant here any special form of a language which is regarded by a speech community as being primarily appropriate for talking to young children and which is generally regarded as not the normal adult use of language.” According to this definition, the key factor is that a given speech community regards a particular form of language as appropriate for use with young children/infants. This of course implies that baby-talk reflects local and conventional attitudes regarding how to speak to infants.
In contrast to Ferguson’s early definition of baby-talk, later terms like motherese (initially introduced by Newport, 1975) came to refer to a particular form of infant-directed speech (IDS) that is thought to be culturally invariant, rather than any form of speech considered to be appropriate for addressing infants in a given speech community. For example, based on commonalities found between IDS produced by Mandarin Chinese, English and German mothers, Grieser and Kuhl (1988) suggested that universal acoustic characteristics of baby-talk (termed “motherese”) may be higher mean pitch, higher pitch minima and maxima, greater pitch variability, shorter vocalizations and longer pauses than would be observed in adult-directed speech (see also Fernald, 1985; 1995; Kuhl, 2004).
From a bird’s eye view of what is now over a century of fascinating and important research, we thus identify two central theoretical approaches to IDS. First, culture-centered sociolinguistic approaches have conceptualized IDS as expressing diverse, culturally variable, child-rearing and language use systems (e.g., Harkness, 1971; Ochs & Schieffelin, 1984; Pye, 1986; Gaskins, 2006; Solomon, 2011). In contrast, others have portrayed IDS as the product of an evolutionarily specialized system that is manifested universally as a species-typical behavior (e.g., Sachs, 1977; Fernald, 1995; Falk, 2004), the features of which emerge due to caregivers’ attempts to interact with infants who do not yet know the language and whose attention span is limited (e.g., Ervin-Tripp, 1977; Saxton, 2009).
Proliferation of terms
While it is evident that a unique infant-directed way of speaking is a relatively ubiquitous phenomenon observed across many different cultural contexts, the extent to which these ways of speaking share universal characteristics is an unresolved issue and warrants further investigation (a point we will return to in chapter 4). A significant obstacle towards progress on this front is the proliferation of terms used to discuss IDS. To illustrate this pattern, we note that Ferguson’s term baby-talk encompassed a wide range of features thought to be characteristic of infant-directed speech (e.g., syntactical, phonological, lexical). Acoustic features such as pitch modifications (“Ammentum” in Ferguson’s 1964:105 terms) are thus but one component of a broader register category. In contrast, Newport’s (1975) introduction of the term motherese referred primarily to a set of syntactical features, while Fernald’s (1995) view of motherese referred primarily to a key set of acoustic features combined with a constellation of lexical and syntactic characteristics. Finally, Kuhl’s depictions of motherese have sometimes been restricted primarily to acoustics (e.g., Grieser & Kuhl, 1988; Kuhl, 2004) but elsewhere have been broader in scope (e.g., Kuhl, 2000). Complicating matters further, the term infant-directed speech (IDS) has been used to refer to any of the versions described above.
To illustrate how this may have led to conceptual confusion, Ochs and Schieffelin’s (1984) account of the features of Kaluli caregivers’ speech has typically been cited as evidence against universal depictions of IDS (e.g., Rosenberg et al., 2004). Yet this account includes a description of Kaluli mothers’ use of “expressive vocalizations” (p. 289). Had they included an acoustic analysis, we suspect these vocalizations may have been reminiscent of subsequent universalistic descriptions of motherese (e.g., Fernald et al., 1989; Fernald, 1995). Yet in Ochs and Schieffelin’s (1984) account, these expressive vocalizations are tangential in comparison to their emphasis on the paucity (or indeed, absence) of dyadic proto-conversations and simplification efforts in Kaluli caregivers’ speech.
Another example is confusion that has arisen in the literature related to Shatz and Gelman’s (1973) work on 4-year-old’s speech modifications to 2-year-old listeners (in contrast to modifications made to adult listeners). This classic study is often cited as evidence of the ubiquity or indeed universality of motherese (e.g., Fernald, 1985; Werker et al., 1994; Levine et al., 2017). However, Shatz & Gelman’s analysis did not address acoustic aspects of the speech stream, and thus those findings are not germane to many of the specific claims about universality at issue. In other words, the evidence supporting (or refuting) claims regarding the universality of (a particular form of) IDS is often difficult to interpret due to the ambiguity surrounding the precise boundaries of the IDS concept itself.
An underlying challenge contributing to this confusion may be the complexity inherent in attempting to identify discrete categories of speech such as IDS within a complex sociolinguistic landscape. In the next section, we directly discuss this challenge. In particular, we note tension between the interest in identifying, labeling and describing systematic context-specific speech styles (or “registers”) that exist within a language and the understanding that such categorization efforts may inadvertently create the false impression that human speech is neatly sortable into discrete, register categories that are akin to “collections of objects – like button-s and pebble-s – that can be identified and enumerated in an unproblematic way” (see Agha, 2004, p. 36). We then discuss ways in which assumptions made by researchers regarding the universality of IDS and the validity of separable speech categories may have impacted empirical investigations of IDS, and in turn the findings and conclusions drawn from such research.
Chapter 2: Sociolinguistics, registers of speech and the problem of categorization
Sociolinguistic variation
A hallmark of sociolinguistic approaches to language is the observation that the structural and acoustic properties of a given speech act vary according to diverse contextual and relational factors surrounding an interaction. Systematic investigations of IDS (then referred to as baby-talk) first emerged within this tradition (e.g., Ferguson, 1964). We present this approach here to elucidate ways in which more recent characterizations of the properties of IDS -- as well as its putative reception by infants -- may have glossed over the sociolinguistic complexity within which speech to infants resides. In particular, we argue that the tendency to compare and contrast IDS to generic “adult-directed” speech (ADS) reflects an underestimation of the extent to which language, even when spoken among adults, is a flexible phenomenon. To illustrate, people adjust their pronunciation, vocabulary, syntax, and intonation depending upon the individual(s) they are addressing and the circumstances of the conversation. As Chambers (2007) noted, sales representatives talk about “automobiles” in the showroom but about “cars” with their neighbors, and they enunciate when reporting to their managers in ways that would seem stilted if they spoke in that manner to their friends after work. Chambers goes on to demonstrate that such adjustments are linguistically subtle, socially meticulous, and seem to be largely automatic, occurring outside conscious awareness.
The general point here is that structural and acoustic properties of a given human speech act are influenced by dynamic interplay between contextual factors surrounding the interaction (e.g., formal/informal, public/private, dyadic/multi-party), sociological characteristics of the interlocutors (e.g. status, occupation, kinship, age), and their motivations (e.g. rapport-seeking, respect-seeking, authority-establishing). This non-exhaustive set of factors jointly determines the type of speech that likely would be considered appropriate or normative within a given context (see Ervin-Tripp, 1976; Hymes, 1992; Yaeger-Dror et al., 2003; Gergely et al., 2017). Furthermore, while certain interactive realities may be analogous across different sociolinguistic contexts (e.g., interaction of a lower-status individual with a high-status individual), these may nevertheless be associated with distinctive speech characteristics, due to different norms of expression (see Pye, 1986). In sum, speech seems best characterized as intrinsically variable and contextualized.
The register
One influential approach taken to studying sociolinguistic complexity has been typological - attempting to systematically characterize various forms of speech that exist within a speech community, giving rise to the notion of “register” (Reid, 1956). Examples of registers identified by sociolinguists include “legalese” (referring to conventional forms of speech considered to be appropriate in legal settings), scientific discourse, journalistic discourse, various slangs and, most relevant for our purposes, “motherese” (aka baby talk or IDS).
For Ferguson (1977), register variation was distinct from regional and social dialect variation on the one hand, and from idiosyncratic and stylistic variation on the other hand. Specifically, a register in a given speech community is defined by the uses for which it is appropriate and by a set of structural features that differentiate it from the other registers in the total repertoire of the community. As with dialect variation, the determination of register boundaries and significant discontinuities was, to a certain extent, seen as arbitrary, since in some instances the variation may be continuous and in other instances there may be overlapping of several registers.
General problems with the register approach
On what grounds should registers be categorized?
From the outset, critics of the register concept pointed to the lack of discreteness of different registers as a problem. In fact, Ferguson (1982) noted that for years, linguists were reluctant to even use the term register “out of respect for the complexity of language variation phenomena” (p. 55). Indeed, sorting varieties of speech into meaningful aggregates exemplifies a basic inductive challenge that characterizes categorization efforts more generally (Markman, 1989; Quine, 1960), given the potentially infinite range of categorization schemes that are theoretically possible for any collection of items. Importantly, the categorization scheme that is eventually settled upon has downstream implications for how the phenomena are conceptualized and what inferences are promoted (Murphy & Medin, 1985; Medin et al., 1997).
To illustrate the import of these issues in the context of infant-directed speech, consider, for example, the type of speech that a “fussy” infant might elicit from a caregiver. The structural and acoustic properties of that speech would likely reflect the caregiver’s motivations and goals (e.g., to soothe, comfort, or distract the infant). Should such speech be categorized as “soothing speech,” “infant-directed speech,” or “fussy-infant speech”? Or perhaps it would best be categorized as one of the other possible designations from an infinite list, such as “that-particular-parent’s-version-of-baby-talk”? Similarly, consider speech that might occur in the first moments of an encounter between family members or close friends who have not seen one another for many years. Should such speech be categorized as “enthusiastic loving speech,” “long-lost family-greeting speech,” or as “adult-directed speech”?
The seemingly obvious answer to these questions is that the categorization of speech varieties is, to a great extent, an act of invention rather than discovery (see Bruner et al., 1956). This is because the decision to categorize speech based on, say, the age of the recipient rather than the context of the interaction or some other dimension is, to some significant degree, a matter of choice. Furthermore, regardless of the decision made, it is undeniable that features that appear in one category (e.g., infant-directed speech that is meant to be soothing) would also appear in a different speech category (e.g. adult-directed soothing speech), rendering category distinctions such as infant-versus-adult directed speech inherently overlapping (see Brown, 1977).
Is there a default form of language use from which registers differ?
A second potential pitfall associated with the register concept is the implication that a standard or baseline form of language use exists against which particular registers may be compared. Yet as Fischer (2016) noted, this assumption is not necessarily justified, particularly due to the presupposition that there is a standard from which usage in peculiar situations deviates. Fischer (2016) followed Wagner (1996) in this line of reasoning; Wagner bemoaned the tendency in foreign language acquisition research to “operate with a concept of ‘normal’ interaction which is understood to be the unproblematic exchange of information. The ‘real’, ‘pure’, ‘intended’, ‘undisturbed’, ‘unmodified’ information [which] is presupposed to be best transferred between native speakers.” (p. 222). This idealized depiction of unmodified interaction then serves as “the baseline towards which studies of modifications orient.” (Wagner, 1996:222). However, “even in baseline interactions, ‘normality’ is interrupted.” (Wagner, 1996:222). In other words, since interactions involve dynamic and situation-specific modifications, it is not clear what the notion of typical speech refers to, since all speech is directed at a specific audience in a particular way for a particular purpose under specific expressive norms.
Implications for IDS-related research
Despite the challenges discussed above, the register concept seems useful in its ability to characterize stable, or at least somewhat systematic varieties of speech that few would deny exist, as Ferguson (1982) noted. However, an additional concern with the register concept relates to the potential cognitive consequences of attaching a register label (e.g., IDS) to dynamically variable speech phenomena. Providing labels has a tendency to reify peoples’ notions of the categories to which these labels are used to refer. In particular, under certain circumstances (see Gelman et al., 2000), labels may contribute to implicit attribution of an underlying essence to the category denoted by the label. That is, there is a tendency to assume that category members are both alike in fundamental, essential ways, and distinctive from members of contrasting categories in fundamental, essential ways. Put another way, the attribution of essence tends to lead to the expectation of “deep” commonalities among category members, licensing causal inferences regarding the nature and function of that category (Medin & Ortony, 1989). Along with their perceived discreteness, essentialized categories tend also to be experienced as immutable and natural -- in that they are perceived as “carving nature at its joints” (see Haslam et al., 2006).
Commonplace in the IDS literature is a tendency to write about IDS and ADS as highly discrete phenomena that are non-overlapping in their impact on infants. For example, “When parents speak in their regular adult voice, termed adult-directed speech, infants do not attend.” (Rowe & Zuckerman, 2016:827). As we will discuss (see section 4), existing findings aren’t consistent with this assertion. Even infants who attend longer to IDS than ADS do display some level of attention to ADS. And in any case, enhanced attention to IDS over ADS is far from universal in infants. All in all, it is simply inaccurate to say, in general terms, that infants don’t attend to ADS.
Another hallmark of essentialist thinking is to assume that the category of interest tends to be unchanging and even immutable. As we will argue in the following sections, the IDS literature provides substantial evidence that both IDS and ADS change depending on characteristics of the speaker, the addressee, and the context in which speech is embedded. Researchers’ frequent tendency to ignore or downplay the extent of such modulation seems to speak to the immutability assumption at play. Along these lines, very little attention has been paid to ways in which the form that IDS takes within a given culture may evolve over historical time (though see Ferguson, 1964). The relative absence of such investigation may reflect an essentialist assumption of immutability as well (see Iliev & Ojalehto, 2015).
An assumption of naturalness is yet another feature of essentialistic thinking. Regarding IDS, this naturalness assumption may add potency and appeal to theoretical speculations (e.g., Fernald, 1995) that IDS is an optimal form of evolved biological signaling for infants. In some cases, multiple essentialistic assumptions seem to be at play, as in Haden et al.’s., (2020) suggestion that IDS and ADS are processed by distinct neural pathways. The distinct neural pathway proposal rides on conceptualizing IDS as unique, invariant, immutable, and processed via receptors evolved through mechanisms of natural selection (see also Zangl & Mills, 2007).
All in all, it seems altogether plausible that the human tendency to essentialize may have shaped the way researchers have conceptualized IDS, and perhaps linguistic registers more generally. Essentializing assumptions seem to have had implications for the kinds of theoretical speculations proposed about IDS, and the nature of empirical work that has (and has not) been undertaken. Recognizing the degree to which psychological essentialism may have influenced current research on IDS offers the opportunity to reassess such implicit assumptions and re-examine what is actually known about the IDS phenomenon in light of the full range of available evidence.
Summary
Psychological investigation of IDS has adopted the sociolinguistic terminology of ‘register’ but has not grappled with the problems associated with this approach. In particular, this literature reveals a tendency to compare and contrast IDS to a generic “adult-directed” category of speech, which in turn reflects a) an underestimation of the extent to which language, even when spoken among adults, varies by context and b) an overestimation of the extent to which language addressed to infants is uniform across diverse cultural (and historical) contexts. Concretely, discussion of the properties of “baby-talk”/“motherese”/“infant-directed speech” in contrast to “adult-directed speech” creates the potentially illusory sense that there are “joints” within the sociolinguistic landscape that afford “natural” and causally-relevant carving. In the following section we explore ways in which these assumptions about the nature of sociolinguistic constructs have shaped empirical investigations of IDS.
Chapter 3: Experimental Treatment of IDS
Contrasting IDS and ADS: Varying versus matching the interactive context
As we’ve just noted, acoustic characteristics of (and infants’ response to) IDS are typically documented via comparison to adult-directed speech. This means that descriptions of IDS characteristics and findings regarding infants’ processing of (and enhanced attention to) IDS are only meaningful and interpretable in relation to the speech stimuli to which they were compared. The way that conditions for such comparisons are mounted, methodologically, impacts the descriptions of IDS and ADS that are derived. We provide a brief review of these methodological issues below.
The varied context approach
In an effort to obtain “representative” IDS and ADS stimuli, one line of research has contrasted speech obtained during what is assumed to be a “typical” interaction with an infant, such as face-to-face play, with what is assumed to be a “typical” interaction with an adult, such as a conversation about one’s daily routine (see Fernald, 1985; Piazza et al., 2017; Broesch & Bryant, 2018; Hilton et al., 2022). The adult interlocutors may be explicitly instructed by experimenters to keep the infant content (Broesch & Bryant, 2018) or to speak to their infants as if they were fussy (Hilton et al., 2022). Critically, prolonged interaction (5–10 minutes) is achieved by either restraining the infant (Fernald, 1985) or by explicitly instructing the adult not to pick the infant up (Broesch & Bryant, 2018).
On the one hand, the decision to allow interactive contexts to vary is reasonable, since real-life interactions with infants versus adults indeed often occur in varied and distinct contexts. On the other hand, we point to several drawbacks associated with the “varying context” approach. First, to keep a restrained infant in a lab content, caregivers must rely solely on their voice (and motions – see Brand & Shallcross, 2008). This is likely to result in highly engaging, non-monotone, rapport-seeking, comforting, entertaining, and affectionate speech. In contrast, adult speakers who are discussing a daily routine with an experimenter (in what is often described as an interview – see Piazza et al., 2017) may not be particularly motivated to keep their interlocutor content, would almost certainly not be attempting to communicate their affection to their interlocutor, and may have little reason to even care if their interlocutor is interested in what is said. In fact, whereas the caregiver is the active leader of the interaction in the IDS condition, the caregiver is likely to respond to the lead of the experimenter more passively in the ADS condition. These factors may all impact the acoustic characteristics of the recorded speech.
To use Halliday’s (1978) terminology, although the mode of the discourse (the modality of speech, e.g., spoken face to face) is constant across the IDS and ADS samples in research of the kind just described, neither the field (the content, task or activity) nor the tenor (the interpersonal relationship between the interlocutors) is comparable. In sum, these confounding variables make it difficult to interpret the significance of differences observed between AD and ID speech stimuli (and infants’ response to these stimuli) gathered in the “varied context” approach.
The matched context approach
In contrast to the “varying context” approach, Werker and McLeod (1989) and Cooper and Aslin (1990) controlled for the content/context of speech across ID and AD conditions by recording adults reciting identical scripts to infants versus adults who were present in the room (Werker & McLeod, 1989), or to an imagined infant versus adult (Cooper & Aslin, 1990). Furthermore, Werker and McLeod (1989) even attempted to address the issue of affect differences between the AD and ID speech stimuli by selecting samples that were evenly matched in terms of affective content (though the details of this procedure were not provided).
Similarly, ManyBabies (2020) obtained their IDS and ADS stimuli by matching the context of the interaction across the two conditions, but also obtained naturalistic, rather than scripted speech (see also Garnica, 1977; Schachner & Hannon, 2011). Specifically, mothers were asked to talk about a series of objects (e.g., ball, shoe) either to their baby (for the IDS samples) or to an experimenter (for the ADS samples) until they ran out of things to say. The final speech stimuli selected from this activity were balanced in various ways (e.g., number of speakers, speaker transitions, number of words). Along these lines, ManyBabies (2020) noted that “All characteristics of the recordings other than register (IDS vs. ADS) were as balanced as possible across clips.” (p. 29)
The central advantage associated with this “matched-context” approach is that it better isolates the addressee’s status (infant vs. adult) as the comparison of interest than does the “varying context” approach. However, advantages and disadvantages of the two approaches trade off, meaning that drawbacks arise with the matched-context approach as well. For one, although the activity itself is identical across ID and AD conditions, the particular activity chosen seems to have been biased towards culturally canonical infant-directed activities. This imbalance makes IDS/ADS comparisons difficult to interpret. For example, Garnica (1977) had participants discuss pictures “thought to be of interest to children” (p. 69) and the script used by Werker and McLeod (1989) for both IDS and ADS was written “to reflect the kind of monologue a parent might deliver when addressing their infant for the day.” (p. 233) Finally, ManyBabies’ (2020) activity of serially removing items from a box and presenting them to an interlocutor is a typical occurrence in adult-infant interactions, but rather atypical for adult-directed interactions. This may have rendered the obtained AD speech disjointed or monotonic. To return to Halliday (1978), while both the mode (the modality), and the field (the content) are now comparable in the matched-context approach, the tenor (the interpersonal relationship) remains incommensurate across conditions. In sum, one concern we have with both the “unmatched” and “matched” context approaches is that the AD speech that is contrasted to IDS may not adequately reflect the diverse and expressive forms of speech that indeed occur between adults. The significance of the contrast between ADS and IDS is thus difficult to interpret.
Until this point, we have discussed ways in which prior research paradigms may not have adequately addressed various contextual and pragmatic imbalances across ID and AD conditions. We now turn to a related potential limitation associated with ID-AD comparisons – failure to account for relational and affective components that may differ across conditions.
Contrasting IDS and ADS: Addressing relational and affective components
Commenting on the tendency to overlook relational imbalances in experimental treatment of AD and ID speech, Ochs and Schieffelin (1984) noted that while ID speech is typically obtained by recording a mother’s speech to her infant (e.g., Newport, 1975), AD speech is typically obtained by recording speech between the mother and an experimenter (see also Grimshaw, 1977). In other words, IDS stimuli used in most research exemplifies speech within a close familial relationship, whereas the ADS stimuli employed exemplifies speech occurring between virtual strangers. The different affective and acoustic characteristics of IDS and ADS obtained in standard experimental paradigms therefore confounds relational intimacy with the status of the interlocutor (infant vs. adult). Thus, infants’ response to the two sets of stimuli could reflect more of a response to the acoustics of relational intimacy than age-of-addressee.
Furthermore, even if relational intimacy were to be kept equivalent, the fact that the AD speech is recorded/observed would likely ensure that participants’ speech is highly self-regulated (see Labov, 1972). Particularly affectionate speech would likely be absent from such observed adult-to-adult interactions given cultural conventions surrounding adult affective privacy. In contrast, mothers may be likely to exaggerate certain speech properties when addressing their infants in an effort to signal their culturally-valued parenting skills (see Lewis et al., 1996). This all may lead to a lack of equivalence in the affective quality of lab-produced IDS versus ADS stimuli, leaving it unclear how to interpret any acoustic differences that might be observed.
Context-specific ID and AD speech as “representative”
Much of the existing research on IDS seems to rest on the assumption that it is straightforward to obtain representative examples of IDS and ADS. This assumption seems worth questioning. First, a variety of verbal modifications likely occur depending on the social relationships and social contexts that are called into play (Ochs & Schieffelin, 1984). Yet, in the case of ADS, researchers have typically exemplified only one particular relationship/context combination: that of distant strangers conversing on a relatively neutral topic. It has not been made clear why a conversation/interview between distant strangers is assumed to be representative of the broader category referred to as ADS. More broadly, since context exerts measurable effects on speech (Slobin et al., 2014), it seems problematic to rely on any one given speech context to represent the general category of ADS.
Related issues emerge with respect to representativeness of speech stimuli vis à vis the IDS category, as well. For instance, previous research has illustrated ways in which IDS tends to differ depending on the target infant’s age (see Furrow et al., 1979; Kaye, 1980; Stern et al., 1983; Fernald & Morikawa, 1993; Kitamura et al., 2001). This variation in IDS is typically finessed (or ignored) in the existing literature investigating infants’ IDS preference. It often isn’t clear whether speech stimuli are argued to be “representative” of, say, speech to a 4-month-old or speech to an 18-month-old. Moreover, infant participants in attentional tasks typically hear IDS stimuli that are directed toward infant addressees of a narrow age range, even though the ages of the infant participants themselves may be considerably more wide-ranging (though see Cooper et al., 1997: Study 3).
To illustrate how this point is overlooked, consider the investigation of age differences in attentional response to IDS in the large-scale, multi-site study by the ManyBabies consortium (to which one of the present authors contributed). In this work, infants of various ages were assessed regarding their IDS responding, but the IDS stimuli embodied just one type of IDS directed at just one age group. An underlying implicit assumption seems to have been that IDS directed at infants at different ages (e.g., 4 months old vs. 18 months old) is essentially equivalent and that infants’ preference for IDS is driven by stable properties of IDS. In contrast to this assumption, Kitamura and Lam (2009) found that, just as IDS directed to infants varies with their age, infants at varying ages display differential responding to different types of IDS as well. For example, 3-month-olds preferred IDS primarily aimed to comfort, while 6-month-olds preferred approval-focused IDS. To bring this point back around to ADS as well, we note that just as speech might differ systematically depending on the age of the infant addressee, such systematic modifications are likely to be present when addressing adults of different ages as well. In other words, yet another factor that is generally not considered is the particular age of the adult-addressee. The implications of this type of variability for our understanding of infants’ response to IDS also have yet to be explored.
Summary
In this section we discussed ways in which cognitive-developmental researchers have operationalized the IDS and ADS constructs in their experiments. In particular, we question the extent to which the available data supports claims regarding both characteristics of IDS, and effects of IDS on infants, due to a host of potentially confounding factors associated with IDS-ADS comparisons in the existing body of research. These confounds stem from limited consideration of the sociolinguistic constraints undergirding all speech acts.
To illustrate the importance of taking sociolinguistic considerations into account when conducting empirical research regarding IDS and how infants respond to it, in the next section we critically evaluate assumptions that seem to undergird two major collaborative IDS-centered research projects: infants’ preference for IDS (e.g., ManyBabies, 2020; Byers-Heinlein et al., 2021) and cross-linguistic investigation of IDS (e.g., Hilton et al., 2022). These are important, landmark projects within the literature on IDS, and make key contributions to current knowledge. At the same time, they illustrate some of the very issues that we hope to foreground here. In particular, infants’ looking behavior in IDS preference experiments typically is assumed to reflect a response to something (assumed to be) intrinsic to the IDS category (e.g., exaggerated intonation contours, see Fernald & Kuhl, 1987). Similarly, the interpretation of recent data pointing to acoustic commonalities in IDS obtained in different languages (Hilton et al., 2022) seems to implicitly assume these patterns may represent universal acoustic features of a (singular) IDS category.
In both cases, researchers have decided to identify speech directed to infants as “IDS” rather than viable alternatives (e.g., “affectionate,” “engaging”, “comforting,” or “intra-familial” speech). Furthermore, speech directed to (strange) adults has been classified as “ADS” (itself a rather broad construct) rather than “stranger” or “interview” speech. These decisions have resulted in as-yet-unwarranted conclusions regarding the nature of IDS and its reception by infants (see Kitamura & Burnham, 1998; Trainor et al., 2000; Singh et al., 2002). We now turn to these points in greater detail.
Chapter 4 – Implications for current research
Do infants prefer IDS?
Research on IDS preference relies primarily on looking-time measures (though see Werker & McLeod, 1989, study 2). In these studies, the amount of time infants spend looking at a visual stimulus while listening to IDS is compared to the amount of time infants spend looking at that same stimulus accompanied by ADS. Longer looking time in the IDS condition is interpreted as reflecting infants’ preference for IDS. Yet as Schroer et al. (2019) note, this (highly robust) result is most neutrally described as IDS eliciting enhanced attention relative to ADS. Although it may not seem inaccurate to call this a “preference,” such terminology nevertheless may be misleading, as preference seems to suggest that IDS is more pleasurable, appealing, or desirable to infants. And this may not be the case. IDS may simply be better at eliciting, capturing, or sustaining infants’ attention than ADS.
Second, even if we were to accept that looking-time measures are adequate indices of preference (in the desirability sense), we note that the ADS stimuli that have been compared to IDS stimuli do not exhibit many of the features likely to drive auditory preferences of infants and adults alike (see Singh et al., 2002). We suspect that, just as for adult listeners, a wide range of factors that modulate speech quality (e.g., charisma, musicality, affect) might drive infants’ speech preferences. In fact, while prior research has not systematically controlled for all these factors when comparing IDS and ADS, when affect was systematically controlled for, infants’ preference (i.e., looking time) followed the affect. Kitamura and Burnham (1998), for example, systematically examined the extent to which vocal affect, as opposed to pitch level, drives infant’s preference (measured via looking time – a point we will return to) for IDS. Specifically, they compared infants’ preference for various IDS stimuli differing in terms of affect and pitch. They found that when pitch level was held constant, infants preferred IDS with higher vocal-affect levels. However, they also found that when affect was held constant, infants displayed no systematic preference for high-pitched IDS relative to low-pitched IDS. These findings suggested that the affective quality of IDS, rather than pitch, may have been what drove infant’s response to IDS in this study, and by implication, in many of the studies on the topic (see Piazza et al., 2017).
In contrast to Kitamura and Burnham (1998), who examined infants’ preference for IDS varying in terms of pitch and/or affect, Singh et al. (2002) extended this line of reasoning and systematically varied affect independently for both IDS and ADS and measured infants’ looking-time during exposure to these stimuli (e.g. high affect IDS vs. low affect ADS, low affect IDS vs. high affect ADS). When affect was held constant, infants exhibited no systematic differences in their looking times for IDS versus ADS speech stimuli. Moreover, when ADS stimuli presented more positive affect than IDS stimuli, infants’ looking time (“preference”) followed the positive affect. Like Kitamura and Burnham (1998), they concluded that infant’s enhanced looking time to IDS is attributable to a more general preference for speech that imparts relatively positive affect rather than to the fact that it is infant-directed, per se.
As we have discussed at length (and as has been noted by authors before us, e.g., Grimshaw, 1977; Ochs & Schieffelin, 1984), IDS-ADS comparison research is typically associated with a host of confounding factors and this is the case for the preference literature as well. Consequently, rather than stating that “infants prefer IDS”, a more appropriate (and limited) conclusion would be that infants’ attention is better captured by IDS of a specific type produced in a specific context in comparison to the ADS to which it was compared.
Finally, we note that even this more circumspect phrasing of the finding requires further amendment because many infants actually do not show attentional enhancement to IDS. In fact, ManyBabies (2020) found that only 1373 out of 2329 infants (just 58.95%) actually showed a numerical IDS “preference” (meaning any measurable level of longer looking to IDS-associated visual stimuli relative to the same stimuli associated with ADS). This represented a percentage of infants that was statistically greater than would be expected by chance alone. Yet a full 41% of the infants in that large-scale replication study did not show even the weakest possible sign of IDS preference. An informal survey of the IDS preference literature suggests that these percentages are fairly typical in such research. This illustrates the extent to which the generic phrasing that “infants prefer IDS” falls short of capturing the actual variation observed regarding the phenomenon at hand. In sum, reified views of IDS have potentially contributed to research designs which fail to consider the non-discrete nature of sociolinguistic registers, rendering narrative depictions of infants’ response to IDS empirically unsupported. We now turn to another example of the potentially misleading consequences of these methodological choices – the claim that IDS (of a particular form) is a universal (or near-universal) phenomenon.
Do cross-linguistic investigations of IDS point to universality?
Recent cross-linguistic investigations of IDS purport to have identified several universal features. For example, Broesch and Bryant (2015) found that IDS produced by mothers in Fiji, Kenya and North America all displayed higher pitch, greater pitch variability, and a reduced rate of production relative to ADS, echoing Kuhl’s (2004) characterization of the universal features of IDS. Similarly, Broesch and Bryant (2018) found that IDS produced by fathers in a small-scale society in Vanuatu and in North America had higher pitch range than ADS (though several interesting differences were also found – a point we will return to). Finally, Hilton et al. (2022) built a large corpus of infant- and adult-directed speech (and song) recordings produced by 410 people living in 21 urban, rural, and small-scale societies. They found that a machine classifier reliably distinguished ID and AD speech (see also Piazza et al., 2017). Taken together, findings such as these make it plausible that certain universal acoustic features of IDS may exist.
Yet considering these findings in light of our preceding discussion, several limitations are apparent. The common thread in these studies is the use of the “unmatched context” approach to ID/AD comparison. Specifically, Broesch and Bryant (2015, 2018) recorded ID speech produced in a context in which the infant and caregiver were facing each other, and the caregiver was explicitly asked to avoid picking the infant up, thus maintaining a distal interactive modality (though touch was allowed). In striking contrast, the AD speech was comprised of caregivers’ responses to general descriptive questions such as the age and sex of the infant, as well as their general thoughts regarding the interaction. Similarly, while Hilton et al. (2022) obtained their ID speech by asking caregivers to speak to their infants as if they were fussy, the AD speech was obtained by recording participants speaking to a researcher about a topic of their choice, such as their daily routine.
The ID and AD stimuli in these studies thus differ in many ways other than the age of the interlocutor. Furthermore, rather than capturing the broad category of IDS, these studies capture IDS occurring exclusively in a dyadic distal face-to-face context in which the interaction is the activity. It is therefore entirely possible that the ecological affordances of this particular interactive modality may drive the acoustic characteristics of the observed IDS. The proposed universals may thus reflect these ecological affordances rather than universal properties of IDS more generally (see also Puccini et al., 2010).
This point is particularly important since distal, dyadic, face-to face interactions have been argued to embody a “prototypical” mode of caregiver-infant interaction in many Western societies while proximal, non-face-to-face multiparty interactions may be more representative of typical interactions with infants in many other societies (see Keller, 2003; Solomon, 2011; Tamis LeMonda & Song, 2012). While these are both conceptualized as universal interactive modalities that may be present in all societies, across cultures they are weighted differently in terms of their centrality (Keller et al., 2009). For example, the very notion that a caregiver would use speech, rather than, say, touch, to sooth or interact with an infant reflects a particular and localized “distal-centered” parental ethnotheory (see Richman et al., 1992; Keller & Lamm, 2005). It is thus noteworthy (and potentially concerning) that the experimental method employed to investigate universals of IDS relies on what is argued to be a prototypically Western interactive context.
We note further in this regard that whereas systematic descriptions of baby-talk registers across the world tend to be identified in relation to their local context (e.g., “Bengali baby talk,” Dil, 1971, or “Toňdol: Sinhala baby talk,” Meegaskumbura, 1980 “janyarrp, the Gurindji ‘baby talk’”, Jones & Meakins, 2013), analogous descriptions of caregivers’ speech in the US middle class are typically identified in decontextualized terms (e.g., “Motherese” Newport, 1975; Fernald, 1985 or “infant-directed speech,” Cooper & Aslin, 1990). The potentially problematic assumption that caregiver speech occurring in the US middle class can be interpreted through a culture-free lens is thus reflected in the very terminology used. From a culture-centered perspective on IDS, this assumption may have led to generalized claims regarding alleged universal features of motherese that may prove to be unsupported (see Lieven, 1994; Gaskins, 2006).
To illustrate this point, consider the fact that over the course of an hour of observing Aka caregiver-infant interactions, Hewlett and Roulette (2016) found few cases of speech directed to infants and no signs of IDS/motherese (i.e. high-pitched speech) whatsoever. This raises the possibility that IDS – at least in the form that Western researchers recognize – does not exist in Aka daily life. It is possible, however, that even the Aka caregivers might produce something recognizable as IDS observed in other sociolinguistic contexts if researchers instructed the caregivers to engage in a prolonged, distal, dyadic, face-to-face interaction with their infant. But if so, it isn’t at all clear what conclusions one should draw from such findings regarding universal acoustic properties of IDS.
Findings pointing to potential cross-cultural differences in IDS are of equivalent interest to those pointing to commonalities. For example, Broesch and Bryant (2018) found that when speaking to infants, both North American and Vanuatu fathers increased their F0 range, yet only Vanuatu fathers increased their average F0. This means that one of the purportedly universal features of IDS – higher mean pitch – did not characterize the North-American fathers’ IDS. Conversely, while North American fathers slowed their speech rate when addressing infants relative to adults, Vanuatu fathers did not. This means that a different alleged universal feature of IDS – slower rate (see Kuhl, 2004) – did not characterize Vanuatu fathers’ IDS.
Conclusions regarding universality clearly are premature given the scarcity of data and confounds in existing experimental designs. It seems altogether too easy – and unwarranted as yet – to leap to claims about universality. Illustrative of this point, while Ferguson found many commonalities in the baby-talk (BT) of six languages (and subsequent studies that included 27 languages), these features were hardly uniform. In fact, in a study of 34 languages, not one of the 22 characteristics that Ferguson believed to represent universal BT was reported for every language studied (see Haynes & Cooper, 1986). This underscores the need for more research before conclusions regarding universality are entertained, especially given how little is currently known about the thousands of ethnolinguistic environments in which children develop and have developed throughout history (see Evans & Levinson, 2009).
Finally, it is worth considering the possibility that discovering a set of necessary and sufficient features to identify a universal IDS phenomenon may not be achievable, and such a search may even be theoretically mis-guided. For one, most categories – including natural-kind categories -- do not exhibit structure that is analyzable in terms of necessary-and-sufficient features (Wittgenstein, 1953; Markman, 1989; Keil, 1992). Thus, it is unclear why we would expect to find an unvarying, core set of acoustic features universally present in IDS. Furthermore, since IDS always occurs within a particular cultural context, the search for a decontextualized and idealized universal form of IDS may be untenable. In fact, the assumption that a core form of IDS exists which is then modulated by cultural factors (e.g., Hilton et al., 2022) reflects a more general tendency in psychology to view culture as a moderator of more basic a-cultural psychological processes (see Miller, 1997). In contrast, when culture is viewed as constitutive of psychological processes, teasing apart the universal from the cultural becomes far more fraught (see Geertz, 1973).
Chapter 5 – Summary and future directions
Summary
In this paper, we discussed complexities associated with identifying discrete categories of speech (e.g., baby-talk, motherese or IDS) within a complex multi-dimensional sociolinguistic landscape. We highlighted ways in which questionable assumptions made in this classification process are manifested in mainstream, “standard-practice” empirical investigations into IDS. Specifically, the comparison between “Infant-directed” speech (IDS) and “Adult-directed” speech (ADS) illustrates the assumption that a “standard” form of language use (i.e., ADS) exists that can be meaningfully contrasted with a “special” register, such as IDS. We discussed ways in which, given the complexity of the sociolinguistic landscape, this assumption may be unjustified.
In that vein, the tendency to refer to a generic IDS irrespective of factors such as the specific age of the infant or context of the interaction may reflect a reified view of IDS that is (implicitly) regarded as a stable entity-like object or essentialized natural-kind, potentially overlooking heterogeneity associated with the IDS phenomenon. Depictions of infants’ attentional orienting to IDS in the literature therefore imply that a specific package of acoustic features may exert a specific type of response from infants (i.e., preference). Yet, since (high pitched, melodic) IDS occurs within a broader interactive event, it is unlikely to exert specific potion-like effects that operate independently of that interactive context.
Theoretical implications and future directions
IDS (as depicted in the developmental psychology literature) is often characterized as an optimal form of interaction with infants. However, when viewed as a culturally and situationally diverse phenomenon, claims regarding the optimality of IDS are difficult to interpret. That is, there appears to be considerable tension between the existence of cultural differences in conceptualizations of optimal parenting and desirable developmental endpoints on the one hand, and descriptions of (a particular form of) IDS as an optimal form of interaction with infants in the developmental psychology literature on the other hand. Since terms such as optimal relate to an evaluative standard, they (arguably) carry a normative weight (see Kessen, 1979). Given the diversity in parental ethno-theories worldwide (Harkness et al., 2009), there is an urgent need to reconsider the extent to which caregiving practices that are particularly valued in the cultural environments producing IDS research should be presented as universally prescriptive accounts of IDS (see Hruschka et al., 2018).
Terms like motherese, parentese (see Ramírez et al., 2020) and infant-directed speech seem to imply that caregivers (generically) speak to their infants in a uniform way. This overlooks the diverse range of ways in which caregivers interact with their infants. The general lack of attention to the cultural organization of infant-caregiver interactions in the IDS literature is reflected in the experimental methods employed as well. In particular, investigations of proposed universal features of IDS typically place infants at a distance from their caregivers. Caregivers must therefore rely on their voice to achieve a sustained interaction with their infants. This experimental design also likely mimics the typical quasi-conversational interactive pattern that is dominant in Western middle-class families (Ochs & Schieffelin, 1984; Richman et al., 1992; Keller et al., 2009). Conclusions regarding universal features of IDS drawn from observations of behaviors occurring in one particular culturally-canonical interactive context are therefore difficult to interpret. Going forward, discussions of universality can continue to hold weight only as the range of interactive contexts in which IDS (and ADS, for that matter) is broadened in a manner that reflects cultural diversity.
Rather than reference IDS to caregiving optimality, we suggest researchers explore the implications, consequences, and effects of any particular form of IDS – for infants, caregivers, and for the cultural management of children more generally – while taking into account possible costs and/or tradeoffs to any given behavior. Caregiving – and its implications for children’s responding and development -- always occurs in relation to a rich, multi-layered contextual backdrop (e.g., Bronfenbrenner, 1979), in which parents may hold many, and even potentially conflicting, goals for their children. For this reason, a specific caregiving behavior – such as use of attention-capturing IDS -- likely supports some, but not all, one’s goals for a child, and may even actively suppress or inhibit some goals. Put another way, any given caregiving behavior holds multiple implications, which caregivers themselves may recognize as trading off of one another.
With respect to IDS, certain verbal techniques such as higher pitch, greater pitch excursion, and the like (which some forms of IDS frequently employ) may help establish a pedagogic “spotlight” that entrains infants’ attention in ways that the speaker desires. In middle-class North American IDS, for example, these acoustic characteristics are often used to highlight object labels at a time when infants are being shown a relevant object referent (Fernald & Morikawa, 1993). In addition to potential benefits for infants’ learning about object labels, however, shaping infants’ attention in this way of course also holds potential costs; it occurs at the expense of their attention to other things and thus potentially constrains their learning in a complex environment that otherwise would afford multiple learning opportunities. This is reminiscent of Bonawitz and colleagues’ (2011) demonstration of a “double-edged sword of pedagogy”: although pedagogy can heighten learning in desired ways, it can also limit exploration and discovery. This analysis leads us to expect that attention-capturing IDS – conceptualized by some as a form of natural pedagogy (Csibra & Gergely, 2009) -- likewise will bear costs as well as benefits for infants’ learning.
Future research on IDS will be enriched to the extent that it systematically examines potential costs and benefits of different ways of influencing an infants’ attention and explores the function of IDS in relation to a diverse range of socio-cognitive phenomena such as learning, memory, motivation, causal inference, social responding, and so forth. This approach opens the door to exploring possibly disparate effects of IDS, while taking into account both the culturally varied forms that it takes and the multi-faceted diversity of contexts within which it occurs.
A contextualized perspective on IDS will also help to highlight connections with other lines of work that speak to the significance of cultural diversity in caregiving patterns. One such body of work concerns attitudes toward teaching, which appear to vary considerably across the globe (see Lancy, 2010; 2016). For example, Hewlett and Roulette (2016) interpreted the apparent reluctance of Aka parents to verbally intervene with children’s ongoing activity in order to teach lessons as reflecting a broader emphasis on respecting children’s autonomy. They remarked that Aka and several other hunter–gatherer groups do not have a term for ‘teaching’, but among the Aka the word mateya is used to refer to advice or guidance. The term implies that children retain the option whether or not to be guided by any given instance of guidance. Such an approach to interaction with infants seems quite different from typical descriptions of IDS in the western scientific literature. Specifically, one of the benefits accorded to IDS is its effectiveness as an attention-
Relatedly, Ochs and Schieffelin (1984) argue that the western, middle-class use of an attention-capturing form of speech may reflect an experience on caregivers’ part of egalitarian discomfort with the competence differential between themselves and their infants. Western caregivers thus simultaneously engage in “child-raising” strategies (such as rich mentalistic interpretations of infant behaviors) and “self-lowering” strategies (such as the use of a special-purpose modified form of speech aimed at sustaining infants’ attention) to approximate a more “even” playing-field between their infants and themselves. In contrast, Samoan caregivers in Ochs and Schieffelin’s (1984) study generally did not attempt to adapt situations to meet the perceived needs of young children. Instead, they focused on adapting the child to the situation, and their goals for children’s behavior – in both the short and longer term – also differed. Clearly, the general interactive mode caregivers adopt with infants would seem to affect not only the form that IDS takes, but also a large constellation of other factors in children’s ongoing experience, calling out for analysis of the functional significance of IDS that takes into account this larger sociological landscape. In sum, it is evident that diversity in caregiving orientations (only a sliver of which it has been possible to highlight in this review) worldwide raises major questions regarding characterizations of optimal forms of IDS.
Finally, evolutionary accounts regarding universal acoustics of IDS are conceptually rooted in broader portrayals of basic emotions as natural kind categories (see Fernald, 1995). In both cases, emotions and their vocal expression are argued to be manifested in universal ways (i.e., distinct, culturally invariant facial expression or vocalizations). In light of alternative accounts emphasizing ways in which cultural factors are constitutive of psychological processes such as emotions (Barrett, 2012; Gendron et al., 2014), future work should explore ways in which cultural meaning systems shape the nature of facial and vocal emotional expressions directed towards infants as well.
Concluding remarks
Human speech is dynamic and continually modulated within a complex multidimensional sociolinguistic landscape. The relatively ubiquitous use of ‘special’ forms of speech when interacting with infants has captivated the interest of scholars from a diverse range of fields. In this paper we discussed ways in which characterizations of the nature of these ‘special’ forms of speech and their effects on infants have (or have not) grappled with the complexity of the sociolinguistic landscape within which all speech acts occur. In many cases, unexamined assumptions regarding the ontological nature of the IDS concept have implicitly shaped theorizing and empirical work, yielding findings that, as yet, provide only a very partial glimpse into the full richness of the relevant speech phenomena. We hope our analysis will be useful in continued efforts to characterize how people talk to infants, what values and beliefs are realized thereby, and what such speech accomplishes, both interpersonally and as a scaffold for learning.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
