Abstract
The evolution of human musicality has often been linked to the evolution of the faculty of language since the development of musical and linguistic abilities seems to share a common phase in their ontogenesis. Apart from that, both singing and speaking are, on the one hand, universal forms of human vocal expression and, on the other hand, consist of culturally specific elements. Such a probable co-occurrence of the predisposition to speak and sing, with the cultural variability of both these forms of communication, has prompted researchers to indicate gene–culture co-evolution as the probable mechanism responsible for the emergence of human musicality and the faculty of language. However, in most evolutionary scenarios proposed so far, the evolutionary paths of music and language followed independently after divergence from a common precursor. This article, based on observations of contemporary interactions between language and music, presents a different view in which musical and language-like forms of proto-communication interacted leading to the repurposing of some of their neural mechanisms. In this process, the Baldwinian interplay between plasticity and canalization has been proposed as the most probable evolutionary mechanism that shaped our musicality. The premises that support the presence of cross-domain co-evolutionary interactions in the contemporary communicative niche of Homo sapiens are indicated.
Keywords
Music and natural language are inseparable parts of contemporary human culture, which often interact in various types of lyrical expression, such as songs and melo-recitation. Although the term music has been used in the Western tradition to refer to many different forms of expression involving sound, songs and instrumental performances have certain functional and structural similarities that can be observed in all, often disparate, human cultures (Mehr et al., 2018, 2019; Savage et al., 2015). These similarities are probably not merely coincidental but could have emerged from a set of abilities that are shared by Homo sapiens, referred to as musicality (Fitch, 2015; Honing, 2018), and therefore form a natural basis for music. The crucial consequence of the influence of human musicality on music is that music is a form of expression involving sound divided into discrete units of pitches and rhythms. From this perspective, music and natural language, the analogous discrete units of which are phonemes, are human-specific forms of communication through sound exemplifying the Humboldt system (Merker, 2002). Despite these similarities both music and language, as observed in human cultures around the world, are enormously diverse communicative systems. This diversity would not be possible without cultural inventiveness, which, as a part of gene–culture co-evolution, is more and more often treated as an important element in the evolution of cognitive abilities (Jablonka & Lamb, 2005; Laland, 2017; Whitehead et al., 2019), including musicality (Patel, 2023; Tomlinson, 2015). Even if current gene–culture co-evolutionary explanations of the evolution of music (Killin, 2017; Podlipniak, 2017; Savage et al., 2021a; Shilton, 2022; Tomlinson, 2015) allow for different, non-mutually exclusive, adaptive functions of music, they still point to functionally independent evolutionary pathways that could have led to the appearance of human musicality. Admittedly, Savage et al. (2021) have indicated that elements of musicality may originally have arisen because of other adaptations that were not music-specific. However, they have emphasized that only later in the process of the gene–culture co-evolution of human musicality were these formerly non–music-specific abilities exapted (i.e., they were pre-existing traits that were repurposed to fulfill a new function; see Gould & Vrba, 1982) and modified due to the adaptive function of music, which in their view was to promote social bonding. According to Savage et al. (2021) a clear sequence of events occurred, starting with adaptive abilities that arose because of adaptive values unrelated to musicality, followed by the process of gene–culture co-evolution. This in turn led to the development of human musicality, thanks to the adaptive value of social bonding. While this sequence allows for the exaptation of previously non-specific musical abilities, the whole process of gene–culture co-evolution of human musicality took place independently from the evolution of other non-musical communicative abilities. From this perspective, the gene–culture co-evolution of human musicality was unidirectional and restricted to one domain, in that the co-evolutionary path and all the selective pressures that shaped our capacity for music were related solely to the adaptive value(s) of music.
In this article another view is proposed, in which the Baldwinian co-evolution of a language-like propositional domain and a music-like emotional domain of communication interacted in such a way that the abilities for one were exapted by the other and vice versa. However, in this process, in contrast to classical exaptation by natural selection in genetic evolution (Gould & Vrba, 1982), the main cause of selection leading to exaptation was the cultural invention of using some elements of one domain to achieve functions specific to the other domain. Specifically, certain features of spoken language, such as intonation and syntax were initially adopted socially from music-like emotional vocalizations. Similarly, some types of concept originating in associations formerly reserved solely for language-like propositional communication, such as timbral symbolism and iconicity (Imai & Kita, 2014), were taken over by music. One method used to infer the evolutionary interplay of music- and language-like communication is based on Bateson’s model of double description (Bateson, 1979; Hui et al., 2008), which is a type of logical abduction.
The origin of musicality and Baldwinian loops
The debate about the origin of human musicality has been dominated by the search for the adaptive value of music because, for many scholars, musical behavior seems to be biologically useless (Pinker, 1997; Wilson, 2013). Nonetheless, since Darwin, many different adaptive functions have been proposed as the possible reason for the evolution of musicality, such as strengthening social bonds (Dunbar, 2012; Roederer, 1984; Savage et al., 2021a; Storr, 1992), mate attraction (Darwin, 1871; Miller, 2000; Ravignani, 2018), enhancing mother-infant affiliative interactions (Dissanayake, 2001), parent–infant communication and bonding (Leongómez et al., 2021), eliciting attention in parent–infant competition (Mehr & Krasnow, 2017), signaling social strength (Hagen & Bryant, 2003; Hagen & Hammerstein, 2009; Mehr et al., 2021), free-rider recognition (Podlipniak, 2023), deterring predators (Jordania, 2011), and vocal grooming (Dunbar, 1996). However, since musicality is a set of distinct abilities (Fitch, 2015; Honing, 2018), they might have evolved independently because they had different adaptive values. They need not be mutually exclusive, for the purposes of explaining the evolution of musicality (Harrison & Seale, 2021; Juslin, 2021; Savage et al., 2021b), so long as they can be included in an evolutionary scenario that explains the phylogeny of musicality. Since hominins can also respond to environmental challenges by means of phenotypic adaptations (i.e., by means of acquired features), such as habits and traditions (Avital & Jablonka, 2000; S. E. Fisher & Ridley, 2013; Jablonka & Lamb, 2005), the possibility that such adaptations also played an important role in the evolution of musicality cannot be ruled out. This possibility leads to the assumption that, rather than being solely a result of genetic evolution, human musicality could have been a product of gene–culture co-evolution (Killin, 2016, 2017, 2018; Patel, 2018, 2021, 2023; Podlipniak, 2016, 2017, 2021; Savage et al., 2021a; Tomlinson, 2015), a process in which cultural and genetic evolutions interact, leading to the appearance and inheritance of new traits (Lumsden & Wilson, 1982; Richerson et al., 2010).
A special type of gene–culture co-evolution is Baldwinian evolution (Baldwin, 1896a, 1896b). In this process, a culturally invented behavioral trait, which is adaptive and in which the learning is time-consuming and effortful, comes under genetic control. Although this process may look like Lamarckian inheritance of acquired traits, the actual reason for the inheritance of such a behavioral trait is the accidental appearance of a mutation enabling a predisposition to learn this behavior faster and with less effort (Hall, 2001), which is then favored by natural selection (although for an alternative explanation, see Hughes, 2012). Importantly, in Baldwinian evolution, genetic inheritance follows cultural inheritance (West-Eberhard, 2005), in which the crucial role is played by learning (Jablonka & Lamb, 2005). Learning, like inventing new behaviors, necessitates a specific kind of developmental (or phenotypic) plasticity (Pigliucci, 2001; West-Eberhard, 2003) that enables ontogenetic modification of behavior in response to the environment (including the cultural environment). This kind of plasticity is called behavioral plasticity (Dor & Jablonka, 2010; Mery & Burns, 2010) and is grounded in neural plasticity (Dor & Jablonka, 2014). However, neural plasticity can differ depending on the specializations that have evolved. Patel (2023) refers to two types of neural plasticity proposed by Greenough et al. (1987): experience-expectant and experience-dependent plasticity. Both are mechanisms that enable the acquisition of culture-specific traits, such as language-specific vocabulary and writing. However, while the phonological system specific to a particular language is acquired as a result of experience-expectant plasticity, the learning of writing necessitates experience-dependent plasticity. As a result, speech—as a default form of language—is a human universal, but literacy is not observed in all human cultures (Pinker, 1994). This means that experience-expectant plasticity is restricted by a canalized neural system, whereas experience-dependent plasticity opens more space for learning.
These two types of plasticity seem to be the solutions for different environmental challenges. Greenough et al. (1987) have proposed that experience-expectant plasticity evolved as a way of dealing with environmental challenges that are ubiquitous and stable. In contrast, experience-dependent plasticity developed, according to them, as a tool for storing information that is related to experiences that are unique to a particular individual, such as the location of a source of food or shelter. In the case of hominin culture, however, experience-dependent plasticity permits the individual not only to remember information unique to them, but also to learn cultural innovations that are widespread throughout the whole group, such as writing. In a stable cultural environment, experience-expectant plasticity that enables fast learning during sensitive periods has an advantage over experience-dependent plasticity, which necessitates more time-consuming and effortful learning. From the Baldwinian point of view, experience-expectant plasticity related to a particular trait is therefore a result of canalization, but it allows cultural evolution of this trait, such as a particular language (Dor & Jablonka, 2010, 2014) or culture-specific music (Jan, 2018, 2022; Savage, 2019). By contrast, experience-dependent plasticity is a source of behavioral innovations that go beyond the canalized behavior and enable open-ended cultural evolution.
As music has both universal (S. Brown & Jordania, 2013; Mehr et al., 2019; Savage et al., 2015) and idiosyncratic features, the origin of musical behavior could be considered in terms of Baldwinian evolution (Podlipniak, 2017, 2021; Savage et al., 2021a). After all, the Baldwinian model predicts that the behavior is partly predisposed and partly culture-dependent (Jablonka & Lamb, 2005). Moreover, since music is a multifaceted form of communication in which the different elements depend on different specialized abilities, then Baldwinian transformation could have happened many times. In line with this assumption, Savage et al. (2021) have proposed that the evolution of musicality should be considered in terms of the “iterated Baldwin effect” (p. 3), that is, a process in which culturally invented musical behavioral innovation creates a niche enabling the selection of genetically controlled elements of musicality, allowing subsequent innovation and so on, resulting in “a virtuous spiral” (Savage et al., 2021, p. 3). However, because the processing of music and language by modern human brains involves the same neural structures, at least to some extent (Steinbeis & Koelsch, 2008a), it would be useful to take into account the broadening of the musical niche into a communicative niche comprising both proto-musical and proto-lingual forms of communication. In this scenario, the cycle of plasticity and canalization that characterizes the iterated Baldwinian process went beyond the musical niche, drawing inspiration for innovations from both music- and language-like behaviors. The proposed extension of the musical niche to a communicative niche assumes that the existing proto-musical cognitive tools could have been used by hominins to fulfill new communicative functions specific to a proto-language. Similarly, the cognitive tools specific to a proto-language could have been applied to music-like communication.
Developmental plasticity is required for exaptation of this nature (Hughes, 2012). The exaptation of cognitive tools involves implementing an existing neural submodule, or neural circuitry, in a functionally new module, or circuit. Such repurposing, to use Schlaudt’s term (2022), or neuronal recycling (Dehaene, 2005), demands experience-dependent plasticity and must be first achieved in the domain of culture. In some sense, a cultural niche tinkers with and creates a new device from pre-existing cognitive tools, rather like natural selection, which can be seen as a tinkerer using everything at their disposal to produce a useful tool (Jacob, 1977). The cultural repurposing of an existing neural circuit is therefore the first attempt to cope with the environmental challenge. If this attempt is successful, the next step is canalization by means of expensive cultural inheritance based on learning this repurposing. Since the cost of learning becomes a burden, in a population where a particular individual was accidentally endowed with the predisposition to learn a new behavior faster and with less effort, natural selection starts to favor this individual and then its progeny. Only then does natural selection canalize a new behavioral trait, in the long term, by genetic inheritance. However, a novel behavior changes the niche, which creates new challenges leading to a new circle of plasticity, exaptation, and canalization. This is the iterated Baldwin effect described by Savage et al. (2021a). In the case of vocal communication among early hominins, even as distant from Homo sapiens as Ardipithecus ramidus (Clark & Henneberg, 2017), this niche probably involved the use of different communicative tools, because culture usually tests different variants of behavior to achieve a particular goal (see, e.g., the different methods of opening milk bottles used by certain species of birds, such as Parus major, Parus coeruleus, and Parus ater; J. Fisher & Hinde, 1949; Hawkins, 1950; Jablonka & Lamb, 2005).
The two forms of human vocal communication
Singing and speaking, or more broadly, music and natural language, have often been understood as functionally different forms of communication and could have had their beginnings in hominin vocalizations. However, while individual vocalizations could have been the predecessors of speech, group vocalizations could have been more closely related to music (Jan, 2022), as music seems to be predominantly effective for producing different sounds simultaneously. In this case, vocalizations should have possessed certain features that could have acted as the anchor points for synchronization. In fact, singing, like instrumental music, uses rhythm and pitch to achieve this goal. Although interindividual synchronization does not necessarily involve the coordination of pitches in time, the use of pitch as an additional anchor point increases the complexity and thus the combinatoriality of a system. Auditory-motor synchronization (Zatorre et al., 2007) enables the coordination of matching sounds in time for the purposes of collective singing or playing music. When singers or instrumental musicians align pitches according to culturally specific rules, they can do this because they know how to imitate the fundamental frequency of harmonic sound (F0) (Bannan, 2008, 2012) producing a range of textures: monophony, heterophony, homophony, and polyphony. In contrast, speech is mainly used in a responsorial way that involves sequential interactions (S. Brown, 2022b). This difference between music and speech can thus be described as the “choric/dialogic distinction” (Haiduk & Fitch, 2022, p. 1), respectively. However, it should not be treated as absolute in the sense that it precludes possible exceptions to the rules of collectivity and turn-taking. Probably because of the collective nature of music, songs differ from speech, mainly in that in singing, the pitches are stable and discrete, while in speaking, they are continuous (Zatorre & Baum, 2012). Conversely, people do sing when they are alone, for example, when they are taking a shower, or to keep themselves company (Falk, 2004), and as soloists in ensemble contexts. The contour of a melody may be formed of melismas and glissandi rather than being based on intervals between pitches. Sliding between pitches can be choric over long passages, each singer performing as an individual, as in the case of isophony (Gill, 2023; Nikolsky, 2018). Nevertheless, the distinction proposed by Haiduk and Fitch (2022) reflects general tendencies in the ways that sounds are organized in singing and speaking. Importantly, these two different types of vocal communication require specific characteristics of vocal learning (Merker, 2012). While speaking necessitates the imitation of distinctive spectral features of sounds, singing requires the volitional control of F0 (Bannan, 2008, 2012).
As far as the content of communication is concerned, Shilton (2022) has indicated that, while natural language is focused on external objects, music acts as a tool of cooperative interactions by means of temporal and tonal alignment, as described above. In other words, natural language is oriented to extrinsic meaning, whereas music is connected to intrinsic meaning. These two types of meaning are relevant to the distinction between implicit and explicit knowledge (Schilhab & Gerlach, 2008). In both cases, communication leads to the synchronization of brain states (Abrams et al., 2013; Jiang et al., 2012; Pérez et al., 2017). However, the communication of intrinsic meaning is achieved by directly eliciting motor and emotional states. In contrast, concepts have to be inferred from patterns of sound for the vocal communication of extrinsic meaning. This is not to say that intrinsic meaning does not influence extrinsic meaning. Being evolutionarily older, intrinsic meaning can scaffold conceptual meaning, as in the case of the influence of tactile sensations on conceptual knowledge (Ackerman et al., 2010) or the emotions induced by timbre (Wallmark et al., 2018, 2019). After all, emotions also serve as a mechanism for assessing the external world, that is, by efficiently rating the ecological relevance of sound sources (Ma & Thompson, 2015). However, the use of internally and externally oriented communication systems fulfills the different functions of motivating reactions to stimuli and creating a conscious model of reality, respectively. Since these two functions of vocal communication seem to be detached from each other among chimpanzees (Watson et al., 2015), one can assume that hominins used two types of vocal communication systems before language and music evolved, namely externally and internally oriented protolanguages (Podlipniak, 2022). Today music and speech are typically accompanied by gestures, which have led to claims that both these communicative systems are integrated with gestural expressions (Kelly & Ngo Tran, 2023; Nussbaum, 2007). However, while affective gestures are present during both singing and speaking, semantic gestures (deictic, iconic, and symbolic) seem to dominate in language, which suggests that music differs from natural language also in the domain of gesturing. Although the division between music as an intrinsic communicative system and language as an extrinsic communicative system reflects the fundamental characteristics of these two types of human expression, this difference is not absolute. It should be emphasized that both natural language and music convey intrinsic and extrinsic meanings, respectively. Every natural form of speech consists of suprasegmental (prosody) and segmental features (consonants and vowels) that transmit meaning in different ways. These two components of speech, nonverbal and verbal, are often described as affective vocalization and articulate speech, and are assumed to be based on two different brain pathways (Ackermann et al., 2014). Speech prosody, apart from its many possible contributions to propositional meaning, transmits affective meaning in a similar way to music that is by eliciting emotions directly (Frühholz et al., 2014). In other ways, music is often reported as a source of referential meaning (S. Brown, 2022b; Cross, 2009; Cross & Woodruff, 2010; Jan, 2022; Koelsch, 2013; Koelsch et al., 2004; Patel, 2008; Tomlinson, 2023), and although musical semantics are usually thought to be much more ambiguous than the semantics of prose (Cross, 2005), neuroimaging studies have shown overlaps between the parts of the brain responsible for processing meaning in music and in language (Koelsch, 2005, 2011; Koelsch et al., 2004; Painter & Koelsch, 2011; Steinbeis & Koelsch, 2008b).
Common-precursor and multi-source models of music and language evolution
The observed overlaps between musical and speech communication have inspired many scholars to look for the common origin of language and music (Bannan, 2008; S. Brown, 2000, 2017, 2022b; Darwin, 1871; Fitch, 2013; Jan, 2022; Livingstone, 1973; Rousseau, 1998; Spencer, 1890). Many of these proposals have additionally assumed that human musicality and the faculty of language came into existence as the result of biological evolution (Bannan, 2008; S. Brown, 2000, 2017, 2022b; Darwin, 1871; Fitch, 2013; Jan, 2022). The dominant view among these hypotheses is a linear model in which language and music evolved from a common vocal precursor. The main premise for this explanation is based on the similarities between speech prosody and music (London, 2012; Palmer & Hutchins, 2006; Patel & Daniele, 2003; Patel et al., 2006), which include pitch contour, rhythm, stress, loudness, tempo, and pauses. The use of these features to express emotions, so-called affective prosody (or expressive dynamics), is characterized by many intercultural, and even interspecies similarities (Filippi, 2016, 2020; Filippi et al., 2017; Merker, 2003; Zimmermann et al., 2013). Speech and music also seem to possess a common first phase in their ontogenetic development (Brandt et al., 2012; McMullen & Saffran, 2004). Having a common developmental origin and shared prosodic features can be explained by the descent from an evolutionary common precursor. After all, homologies are evidence of shared ancestry. The common-precursor models assume that after a musilanguage phase (S. Brown, 2000) music and language started to evolve separately. In other words, after the split from a common precursor, the evolution of musicality and language faculty took disparate and independent, uni-domain evolutionary paths. The exception to this standard view is S. Brown’s theory (2022b) that a protomusic co-opted a rhythmic system that evolved independently from music as a part of dance (S. Brown, 2022a). He has not explained, however, what mechanism led to this co-optation.
An alternative explanation suggests that, rather than having one common precursor, music and language have multiple sources. According to this multi-source explanation, hominins created a communicative niche consisting partly of instinctive affective prosody combined with affective gestures, and partly of culturally invented signals, such as iconic and symbolic gestures and sounds. As communication about internal states and external objects have different functions, the two forms of communication, music-like (internally oriented) and language-like (externally oriented), could initially have evolved separately (Podlipniak, 2022). However, as the social niche started to become more and more complex, thereby creating new challenges, existing forms of communication appeared insufficient. Thus instead of expanding the existing music- and language-like communicative tools by introducing new features, hominins could have begun to use elements of one form of communication to enhance the communication capabilities of the other. It is well known that speech prosody can influence social interactions by conveying clues to the speaker’s internal state, such as politeness, impoliteness, dominance, or submissiveness (P. Brown & Levinson, 1987; Culpeper, 2011; Culpeper et al., 2003; Ponsot et al., 2018); these clues can affect the way that lexical and grammatical content is interpreted by the listener. Interpretation is also likely to be influenced by the speaker’s affective gestures, which can function as pragmatic gestures (Lopez-Ozieblo, 2020). It is therefore plausible that hominins could have used music-like tools for communicating internal states to enhance their use of language-like tools for communicating external social relations. Natural selection could have favored such repurposing because it is more economical than creating new structures. The examples of interactions between music and language that can be observed in contemporary cultures, and interpreted as repurposing, suggest that a similar process could have happened in the ancestral-hominin cultures of our species.
Cross-domain interactions between modern communicative phenomena
The observed differences between the phonological systems of contemporary languages and music seem to be the results of experience-expectant plasticity, as people acquire these systems via implicit learning during childhood (McMullen & Saffran, 2004). The appearance of some forms of communication in certain populations must, however, have demanded experience-dependent plasticity. Good examples of such communication systems are whistled (Meyer, 2008, 2015) and drum languages (Akinbo, 2021; Arewa & Adekola, 1980; Seifart et al., 2018). In both cases, music-specific elements, such as pitch and rhythm, are used to code speech-specific features to convey propositional meaning. The users of whistled and drum languages can transmit propositional meaning by emulating the tonal and rhythmic patterns of spoken language using sound sequencing (Akinbo, 2021; Seifart et al., 2018) and also, in the case of whistled languages, the spectral characteristics of harmonic sounds (F0 and formants) (Meyer, 2015). The invention of both of these communication systems was probably a way of overcoming the limitations of spoken language related to the short range of the propagation of speech sounds. Deserving of special attention here, however, is that behavioral plasticity consists in this instance of using the resources of an existing system to perform a function specific to another system. This re-use or repurposing probably results in the reorganization of certain neural circuits. It has been discovered, for instance, that native users of a whistled language in the mountains of Northeast Turkey exhibit a decrease in left-hemisphere and increase in right-hemisphere activity, such that symmetric hemispheric processing can be observed when they are listening to and understanding the whistled language as opposed to speech (Güntürkün et al., 2015).
People who are native speakers of tonal and non-tonal languages also exhibit lateralization differences (Wang et al., 2004). Left-hemispheric lateralization of lexical tone processing has been found in native tonal-language speakers (Chien et al., 2020; Gu et al., 2013). There are also structural differences between the brains of native- and non-native tonal-language speakers, such that, the former have a greater density of gray and white matter near the right anterior temporal lobe and the left insula (Crinion et al., 2009). Except when used as part of prosody, relative changes in pitch become phonological features in tonal languages and are used as cues influencing the meaning of words (Maddieson, 2005) and/or their grammatical relationships. Although a lexical tone is not an interval between two musical pitches, musical training enhances the recognition of lexical tones (Patel & Iversen, 2007; Wong et al., 2007); this suggests a possible interaction between music and speech in the cultural development of these communicative systems. The differences between the ways in which tonal and non-tonal languages are processed shows that the choice of a particular culture, in this case, whether or not to use pitch as part of a language phonological system, can induce the reorganization of neural circuitry via experience-expectant plasticity. It could also be that the use of tonal speech in a cultural environment was responsible for producing particular genotypes, because correlations have been found between the population frequency of a variant of the development-related gene ASPM (abnormal spindle-like, microcephaly-associated) and the distribution of tonal languages in that population (Dediu, 2021; Dediu & Ladd, 2007), and between the same gene and the ability to perceive lexical tones (Wong et al., 2012, 2020). Thus, Baldwinian evolution could contribute to language diversity. Another well-documented reorganization of brain circuitry in response to cultural demands concerns differences between musicians and non-musicians (Leipold et al., 2021). In fact, the changes in musicians’ brain networks are a widely used example of experience-dependent plasticity (Leipold et al., 2021; Münte et al., 2002). An important behavioral difference between musicians and non-musicians is that musicians are trained to organize their perception of musical structure by using concepts, often reflected in music notation, such as particular pitch intervals, rhythm values, and tonal functions, while non-musicians are unaware of these constraints when they listen to music. Since conceptual thinking is not a domain of musical communication this behavioral difference can be interpreted as a result of a conceptualization of music that has been imposed culturally (Zbikowski, 2002).
Culture can have a negative impact on the various forms of vocalization described above by suppressing the development of cognitive abilities to which human beings are predisposed. One example related to vocalization and pitch perception is octave equivalence, or the ability to perceive the similarity of two pitches an octave apart (Hoeschele et al., 2012). This ability is widespread among human populations, and it has been claimed that it is a universal feature of music perception (S. Brown & Jordania, 2013; Harwood, 1976). In one study, however, researchers played sequences of sounds to indigenous Tsimane people from the Amazonian rainforest and North Americans. Unlike the latter, the Tsimane participants ignored octave similarity when they reproduced the sequences. This may indicate that octave equivalence depends on the experience of culture-specific music. Or perhaps, the Tsimane participants ignored it because their own music “appears to lack group performance and harmony” (Jacoby et al., 2019, p. 3230; see also McDermott et al., 2016). It has recently been suggested that octave equivalence originates in social bonding produced by chorusing (Bannan et al., 2022), so the non-communal use of singing can be interpreted as a culturally driven change of music function leading to the suppression of a predisposition to develop octave equivalence. In this case, the trajectory of cultural evolution changes the cultural communicative niche in such a way that natural selection no longer prefers individuals who experience pitches an octave apart as perceptually similar. A comparable effect whereby the ability to recognize tone in speech is suppressed can be induced by an environment in which language is non-tonal. A similar process may have led to the suppression of perfect pitch at the expense of developing linguistic abilities, as suggested by Mithen (2006). Nonetheless, the existence of cultural repurposing or suppression of cognitive abilities related to communication observed today suggests that the same effects could have influenced the evolution of music and language in the past too.
Cultural invention, canalization, and exaptation as drivers in the evolution of musicality and language faculty
An intriguing feature of the functional organization of the brain is activity in the structures responsible for the processing of language syntax during the processing of music (Koelsch, 2005; Li et al., 2023; Steinbeis & Koelsch, 2008a). To explain overlaps between these structures, Patel (2003, 2011) has proposed the shared syntactic integration resource hypothesis, also known as the resource-sharing framework. In these models, the shared neural structures (resources) operate on domain-specific knowledge localized in different areas of the brain. The widely accepted view is that musical syntax is a result of employing syntactic abilities that evolved originally as a part of the language faculty (Lerdahl & Jackendoff, 1983; Patel, 2008; Pinker, 1997). These shared neural resources are therefore seen as elements of a cognitive mechanism underlying language. In an alternative model, in which specialized structures are used by a variety of neural networks (Peretz et al., 2015), shared neural resources can be viewed as an integrated part of a music-specific network (although for yet another explanation, see Asano et al., 2022). The incorporation of these resources into a music-specific network can be explained by culturally driven repurposing, as described above, and also probably by canalization through natural selection. From the psychological point of view, the main difference between the functions of the language-specific and music-specific syntactic systems is that patterns of sound are mapped on to a hierarchy of concepts in the language-specific system, and on to pre-conceptual hierarchical experiences of stability in the music-specific system. This difference could also be seen in terms of the use of hierarchical control over abstract rules in the case of language and patterns of stability and instability in motor networks linked to emotions (Asano et al., 2021) in the case of music. Considering that pre-conceptual experiences of stability were probably part of a form of vocal communication that is evolutionarily older than language, it is more likely that shared neural resources originated in and were incorporated from a music-like communication system into a language-like communication system than vice versa (Podlipniak, 2023). In this view, pitch hierarchy evolved before language grammar (Podlipniak, 2016), probably with increased control of the larynx, subglottal system, and supralaryngeal tract in Homo ergaster, affording later-evolving species (e.g., Homo erectus, Homo heidelbergensis, and the Neanderthals), the ability, like Homo sapiens, to control a melody consciously (Deacon, 2000; Morley, 2013, 2014; Wurz, 2009). It may be, however, that the pre-conceptual experiences of stability and instability that became the basis for a syntactic hierarchy in a music-like communication system evolved earlier, as sensations accompanying the motor expression of rhythm patterns by gestures and vocalizations. In that case, considering that vocal abilities began to increase in Ardipithecus ramidus (Clark & Henneberg, 2017), it is even possible that the first sound hierarchies appeared before the evolution of the Homo genus. Regardless of which proximal function of syntax evolved first (i.e., expressing the hierarchy of concepts in language or pre-conceptual experiences of stability in music), the exaptation of cognitive machinery from one of these communication systems to the other is a convincing explanation for the evolution of resource-sharing networks captured in Patel’s model. Importantly, the main results of this cross-domain co-evolutionary interaction are the canalized syntactic properties of modern language and music.
Besides octave equivalence, relative pitch is a cognitive ability that could have been repurposed by hominins to fulfill a new function. This ability allows us to recognize transposed melodies as being examples of the same prototype independent of differences of pitch between the original and transposed melodies. Patel (2023) has speculated that relative pitch could have evolved as a speech- specific specialization and was only later employed in the context of music. However, relative pitch could also have evolved because of its adaptive value to music-like emotional vocalizations consisting of a primitive pitch hierarchy based on pre-conceptual sensations of stability. It could then have been used in language-like propositional vocalizations as a result of social invention. For example, a pitch contour contributing to a music-like emotional vocalizations could have become, independent of its absolute pitch, not only a precursor of speech prosody but could have also been used by early hominins in language-like propositional vocalizations to indicate attitudes to events or other hominins. In a broad sense, these attitudes—referred to in linguistics as grammatical mood (Gil, 2021) (e.g., interrogative or imperative, referring to questions and statements, respectively)—resemble internal pre-conceptual states (e.g., questions representing uncertainty and statements representing certainty) thus producing the original content of music-like vocalizations. This was probably not accidental. Nowadays, mood is typically conveyed by intonation in the majority of languages (Jun, 2005; Warren & Calhoun, 2021), suggesting that intonation is a tool that has been canalized in the course of evolution. In other words, the use of pitch in spoken language to indicate cues or questions (Chien et al., 2020; Gussenhoven, 2016; Gussenhoven & Chen, 2000; Jun, 2005; Ma et al., 2011) could have been adopted by hominins from music-like vocalizations and then canalized via the Baldwinian process.
Overlapping brain structures do explain not only the processing of musical and linguistic syntax and prosody but also musical and linguistic semantics (Koelsch, 2005, 2011; Painter & Koelsch, 2011; Steinbeis & Koelsch, 2008b). The extent to which music has meaning is an intriguing question that is much debated. It is generally agreed that music lacks propositional semantics (Lerdahl, 2013) but many examples of specific pieces are interpreted as conveying an intersubjectively consistent extra-musical meaning (Koelsch, 2013; Patel, 2008). The role of emotions in creating concepts may hold a clue to the emergence of the propositional meaning of music. The neural processing of emotional and semantic information, respectively, converges in the lateral zone of the inferior frontal gyrus pars orbitalis (Belyk et al., 2017), suggesting that emotions were implicated in the evolution of propositional semantics. Importantly, this structure is also a part of the affective prosody (Belyk & Brown, 2014) and music production networks (Bianco et al., 2022), indicating a potential link between the expression of emotions and semantic signals. In line with this view, Filippi (2020) proposes that vocal emotional expressions facilitated the evolution of language semantics because vocal patterns became associated with meaning. Alternatively, this process could be explained in terms of a language-like communication system repurposing elements of music-like communication. Since the emotional pre-conceptual sensations that accompany the experience of music can be divided into relatively distinct sets of emotions, such as fear, pleasure, anger, joy, and sadness, it can be speculated that these sensations informed the development of these concepts in language.
A music-like communication system could have repurposed elements of language-like communication too. Spoken language consists of sequences of vowels and consonants that we recognize based on the spectral characteristics of the sounds. It is therefore the changing spectral characteristics that make speech so effective in conveying propositional meaning. From this perspective, the experience of listening to spoken words is similar to the experience of listening to a series of fluctuating timbres. Although it is traditionally assumed that the relationship between the sounds of words and their meaning is arbitrary (de Saussure, 1959), several studies have shown that certain psychoacoustic features of the sounds of words can have a universal, non-arbitrary relationship to their meaning (D’Anselmo et al., 2019; Dingemanse et al., 2015; Erben Johansson et al., 2020; Monaghan et al., 2014; Preziosi & Coane, 2017). This phenomenon is called sound symbolism; Imai and Kita (2014) claim that it is an important part of natural language that can shed light on the evolution of meaning. Non-arbitrary sound-meaning mappings seem to be deeply engrained in human cognition (Erben Johansson et al., 2020) and may represent the canalized vestiges of the language-like propositional tool of communication used by hominins. Since timbre is important in music, and also conveys iconicity and symbolism, it can be speculated that music owes sound symbolism to the repurposing of non-arbitrary sound-meaning mappings from language-like propositional vocalizations. It is also probable that the associative tendencies governing sound symbolism in speech are also present in music. For example, there is evidence of cross-modal correspondences between timbre and the visual or tactile domains (Wallmark, 2019; Wallmark & Allen, 2020).
Propositional meaning can also be conveyed by gestures. This being the case, it can be asked if propositional meaning emerged in music because elements of motor behaviors were repurposed in the auditory domain or vice versa. The findings of research involving the participation of individuals who stammer provide an answer. Stammering reduces the number of gestures accompanying speech (Jaques & Mayberry, 2010) but does not affect singing (Wan et al., 2010). This may be because gestures have different roles in speech and music. Affective gestures were repurposed by language-like communication from music-like communication, but not the other way round. If the communicative niche proposed by Brown (2022a) included dance, which evolved separately from music-like vocalizations, the affective gestures repurposed by a language-like communication could have had their roots in motor behavior specific solely for dance.
I propose that all the cross-domain interactions I have described are important in the evolutionary history of both human musicality and the faculty of language. The evolutionary process common to both is as follows: cultural invention of a particular behavior in one domain; canalization of this behavior as a part of this domain; cultural exaptation of this behavior into another domain; and finally canalization of this behavior as a part of this new domain. If all these new culturally canalized behaviors were adaptive, they could have undergone Baldwinian evolution whereby an individual genetically predisposed to learn a particular behavior faster and with less effort would be born sooner or later. The predisposition to learn behaviors involving cross-domain interactions would set the individual on a genetically driven developmental path in which interdomain neural connections would lead to the emergence of a new functionally specialized neural network with its origins in developmental plasticity. Considering that the processing of modern language and music involves many shared structures in the brain, the interactions I have proposed are likely candidates for explaining their evolutionary origin.
Conclusions
My argument has focused on the role of cultural flexibility in the evolution of human musicality and the language faculty as a part of a dynamic process of interaction between these two sets of communicative abilities. The examples of potential interactions and repurposing that I have presented are not, of course, exhaustive. I could also have described cross-domain interactions in relation, for example, to the volitional control of affective prosody in speech; the perception that pitches, rhythms, and phonemes are discrete; and the combinatoriality of discrete musical and speech units. Moreover, the communicative niche of early hominins was probably not restricted to vocal modes of communication. The most useful and opportunistic strategy for hominins to exchange information was probably to combine auditory and visual signals (Zlatev et al., 2020). The fact that both singing and speaking are typically accompanied by involuntary gestures suggests that gestural modes of communication could also have interacted with both the music- and language-like vocalizations of early hominins. Visual signals as the source of iconicity and symbolism could have been invented primarily in the gestural domain in the form of pantomime, which is also a good candidate precursor of semantics (Zlatev et al., 2017).
My argument could be criticized for contradicting the principle of parsimony, which states that the most convincing explanation should involve the fewest assumptions or entities that fit the evidence. This principle is also applied in evolutionary biology (Sober, 1988). However, the reconstruction of phylogeny by means of the shortest evolutionary tree, according to the principle of parsimony, does not necessarily reflect the actual phylogeny (Stewart, 1993). I have presented a number of examples of repurposing in contemporary culture, and of overlaps between the neural processing of music and language. Together, they form the basis of my premise that cross-domain interactions are an important mechanism underlying the evolution of both human musicality and the faculty of language. Theories of the evolution of music, along Baldwinian lines, are incomplete if cross-domain interactions are not considered. More research is needed, of course, to elucidate the possible cross-domain evolutionary paths that have led to the emergence of musicality as we know it today. Nevertheless, the cross-domain co-evolutionary interactions that, according to my proposal, drove the evolution of human musicality and the faculty of language bring us closer to understanding the intricate phylogenetic relationships that exist between music and natural language.
Footnotes
Acknowledgements
The author would like to thank two anonymous reviewers for their many useful comments and suggestions. The author would also like to thank Jane Ginsborg for her helpful advice as well as Peter Kośmider-Jones for his language consultation on the first draft of this manuscript.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the National Science Center, Poland (grant number: 2021/41/B/HS1/00541).
