Abstract
Theories of music evolution rely on our understanding of what music is. Here, I argue that music is best conceptualized as an interactive technology, and propose a coevolutionary framework for its emergence. I present two basic models of attachment formation through behavioral alignment applicable to all forms of affiliative interaction and argue that the most critical distinguishing feature of music is entrained temporal coordination. Music's unique interactive strategy invites active participation and allows interactions to last longer, include more participants, and unify emotional states more effectively. Regarding its evolution, I propose that music, like language, evolved in a process of collective invention followed by genetic accommodation. I provide an outline of the initial evolutionary process which led to the emergence of music, centered on four key features: technology, shared intentionality, extended kinship, and multilevel society. Implications of this framework on music evolution, psychology, cross-species and cross-cultural research are discussed.
Introduction
One of the most significant biases of Western music scholarship has been the treatment of music as a reified auditory stimulus, something we listen to rather than interact with (Cross, 2012; Turino, 2008). Research has therefore tended to separate sound from movement, listening from active participation, and the aesthetic experience from a social one. It remained conceptually grounded in the concert hall, where audiences listen to compositions written and performed by professionals—a state of affairs largely preserved in today's aural consumption of recorded music. This bias has significantly affected music evolution research. Perhaps its biggest influence is the enduring emphasis on ultimate function, which stems, I argue, from the perceived plausibility of the null hypothesis—that music is, evolutionarily speaking, useless. Having been confined to the status of a mere aesthetic object, music was suggested to be a “byproduct,” and the ability to perceive, produce and enjoy it the result of auditory sensitivities that evolved for other purposes such as auditory scene analysis and language (Pinker, 1997). 1 This argument, and the neo-Darwinian logic in which it was couched, dominated the field for the last two decades. Research consequently focused on the various ancestral contexts in which music might be functional: male or male-coalition displays (e.g., Merker, 1999; Miller, 2000), coalition displays more generally (e.g., Hagen and Bryant, 2003), infant care (e.g., Dissanayake, 1999) and social bonding (e.g., Dunbar, 2012). More recently, two contrasting articles both rejected the byproduct hypothesis, but still dealt primarily with the question of function: Mehr et al. (2021) argued that music served as a credible signal in the context of both coalition displays and infant care, while Savage et al. (2021) argued for social bonding as an overarching function, operating in the context of infant care, mating and group cohesion.
There is, however, a growing awareness that this approach to music evolution is overly reductive (Killin, 2016; Savage et al., 2021; Tomlinson, 2015). Four related problems can be identified. First, music cannot be clearly isolated from other forms of communication: it is part of a greater communicative toolkit and is interwoven into different communicative registers and rituals (Cross, 2014). Second, it overlooks the relationship of musical behavior to the unique social niche created and inhabited by humans for the past 2 million years (Shilton et al., 2020). Third, it often implies a strictly neo-Darwinian framework, which does not consider the complexities of developmental plasticity, niche construction and genetic accommodation (Killin, 2016; Tomlinson, 2015). Fourth, it is becoming clear that even in ancestral conditions, music or music-like behaviors would have had multiple and diverse purposes, making the search for a single function unfeasible. This problem was noted by Savage et al. (2021), who consequently proposed social bonding as an umbrella explanation, akin to the notion that "vision is for seeing." However, since the function of social bonding is not unique to music, but shared with many other forms of interaction (e.g., conversation, play), the unique properties of music remain to be clarified. A different approach to the question of function would be to ask what is at the core of music as an interactive strategy, distinguishing it—conceptually if not practically—from other forms of communication. This approach can also potentially clarify why music is preferred over alternative interactive strategies in some contexts, and why it is used differently in different societies.
Here, I expand on Savage et al. (2021), and propose a complementary framework which hinges on the treatment of music as an interactive technology. I present two general theories of affiliative interaction, and argue that music elaborates on a basic process of intentional, gestural and emotional alignment through which attachment to individuals and groups is formed, and which is common to all human communication. I argue that music is distinguished primarily by its focus on accurate temporal alignment, which is achieved through the co-construction of a stable periodic framework. I enumerate six distinctive outcomes of this interactive strategy, which enable musical interactions to include more participants, be more compelling and engaging, and last longer. Next, I argue that music, like tool making and language, evolved first as a social technology and was only later genetically accommodated. Finally, I outline the process by which humans became more dependent on one another and on socially created technologies (Shilton et al., 2020). I argue that loose temporal alignment was critical for consolidating attachment between infants and their multiple caregivers, and that it was later extended towards other types of relationships. Accurate temporal alignment, which underlies musical interaction, was borne out of the need to extend attachment to a larger network of kin and non-kin, and is suggested to have been particularly important during regular group gatherings, and for the formation of multilevel societies.
Music as an Interactive Technology
The conceptualization of music primarily as a form of interaction is informed by evidence from four domains of research. First, the anthropological and fieldwork-based study of music across cultures, which has firmly established that music is deeply embedded in social life and is often a participatory activity (Blacking, 1973; Feld, 1984; Lewis, 2013; Merriam, 1964; Nettl, 2015; Savage et al., 2015; Turino, 2008). Second, experiments in social psychology demonstrating how musical interactions support empathy, bonding and prosociality (Mogan et al., 2017; Pearce et al., 2015; Rabinowitch et al., 2013; Weinstein et al., 2016). Third, the psychological study of music perception, which points to the critical importance of embodied anticipation, and consequently suggests that music listening is essentially active, and can in fact be construed as “covert performance” (Cannon & Patel, 2021; Cross, 2010; Huron, 2006; Koelsch et al., 2019; Patel & Iversen, 2014; Vuust & Frith, 2008). And finally, the study of musical behavior in animals, which finds no shortage of species with complex individual vocalizations, but very few with tightly coordinated or entrained group vocalizations, suggesting flexible coordination as the most important bottleneck for musical behavior (Ravignani et al., 2014; Schachner et al., 2009). The word “music” will therefore denote an interactive process throughout the paper, and could be replaced with “music making,” “musical interactions” or simply “musicking.” 2
Affiliative Interactions
To understand music as a type of interaction, we first need to ground it in a more general theory of affiliative interactions and their effects. On a psychological level, affiliative interactions involve biobehavioral synchrony, a phylogenetically ancient process that underlies the selective formation of attachment bonds in mammals (Feldman, 2017). Biobehavioral synchrony refers to the coordination of behavioral and biological processes between attachment partners, comprising of behavioral synchrony (coordination and alignment of gaze, touch, movement and vocalizations), heart rate coupling, endocrine fit and brain-to-brain synchrony. The behavioral component can be described as a bi-directional signal, in which the correspondence between attachment partners reliably indicates attention and affiliation. Behavioral synchrony activates the mammalian attachment system, which evolved originally in the context of consolidating mother-infant bonds and is underpinned by dopamine and oxytocin crosstalk in the striatum (Feldman, 2017). It is essential in alloparental species to flexibly create emotional bonds with multiple caregivers. In humans, it is observed across all types of social bonds: parental, romantic, peer and conspecific (Feldman, 2017).
On a microsociological level, interaction rituals have been suggested as the most basic model of copresent human interactions (Collins, 2004; Goffman, 1959). Interaction rituals are defined as situations when two or more persons share an attentional focus and an emotional state, and align their communicative expressions in both form and periodicity. When successful, they have three main outcomes: group solidarity, positive emotional energy (as well as emotional contagion), and shared symbols and values. Group solidarity is a form of attachment that extends towards a group rather than a specific individual. Emotional energy ranges from enthusiasm and high motivation to depression and apathy. Successful interaction rituals elevate emotional energy, making participants more driven and exuberant. The shared symbols are those objects of shared attention that connect the group, and can consequently transform into sacred objects. Shared values are often embedded in the ritual itself and are derived from the laws of behavior guiding the ritual (e.g., Lewis, 2013).
Participation in rituals is often costly, reliably displaying the commitment of participants to the sacred symbols and social rules which govern the ritual (Henrich, 2009; Wen et al., 2020), while the emotional energy generated during the ritual intensifies it (Collins, 2004). Whereas biobehavioral synchrony explains the neuroendocrine basis of attachment formation between dyads (originally, mother and infant), interaction ritual theory explores how the basic rules of dyadic interaction can be expanded to a group, and how these rituals produce shared symbols and values.
Interaction rituals do not necessarily succeed. They can also fail, and in their failure, point to disparities and dissatisfactions. As such, they also reliably test the strength of personal relationships and group identity. We can clearly recognize when our interaction partners are inattentive or unresponsive, when there is no shared “rhythm” or mutual understanding. The same is true of group rituals, where failure to participate can bring to the fore hidden disputes (e.g., Oloa-Biloa, 2017, pp. 198–199). Interaction rituals are therefore both producers and reliable tests of social solidarity.
Examples of interaction rituals abound—indeed, Collins (2004) considers them to be quite ubiquitous in social life. To take one very well-known example, let us consider the presidential inauguration ceremony in the United States. After a period of heated disagreement, political parties and their supporters need to align behind the elected candidate. Crowds gather in the Capitol and watch together the new president become symbolically one with the state by reciting the oath of office under the nation's flag and in front of the iconic Capitol dome. The ritual enacts the power relations and organizational principles it is meant to uphold: the platform of leaders stands above the citizens, and both watch the newly empowered head of state proclaim his allegiance to the state's basic laws. The audience co-constructs and invigorates the ceremony by its sheer size and uniformity, whether expressed as silent mutual focus or through enthusiastic applause. The emotional energy generated during the ceremony revitalizes the commitment of participants to the symbols and values embedded in it. Just as personal attachments need to be affectively affirmed routinely, so does a person's sense of belonging to a larger community. It cannot be merely stated—it must be felt.
Musical Interactions
Both biobehavioral synchrony and interaction rituals are widely applicable across various types of human interactions. The alignment of attention, emotion, and the shape and rhythm of gestures and vocalizations constitutes the interaction engine that underlies all human communication (Levinson, 2006). While it involves alignment on several levels, some can be prioritized over others, in service of different mutual goals. Conversations, for example, prioritize the alignment of imagined referential meaning (Dor, 2015), and mutual focus is set primarily on encoding and decoding instructive messages, though other types of alignment are occurring simultaneously.
Because all communication relies on a basic interaction engine, music can sometimes be difficult to clearly demarcate as a pattern of interaction. Speech can often sound “musical,” with distinctive pitch movement and rhythmic organization. Conversations can demonstrate a relatively high degree of periodic and tonal alignment between participants (Hawkins, 2014; Robledo et al., 2016), and the backchannel communication that scaffolds them can be more simultaneous and rhythmic, include more participants and result in greater involvement from listeners (Bavelas et al., 2002; Wiessner, 2014). Clear distinctions, therefore, appear to be more a matter of cultural constructs than of any hard cognitive boundaries. Different cultures slice up the communicative spectrum in different ways, utilizing pragmatically and normatively the vocal and gestural channels to serve different purposes (Cross, 2015; Everett, 2012; Lewis, 2014; Seeger, 1987; Senft, 2018). Lewis (2009), for example, describes several interactive categories defined by the Mbendjele, which incorporate musical elements to different degrees: from the hushed, secretive and monotonic “ya miso minai” (speech of four eyes), through the louder, more song-like women's talk (besime ya baito), to gano, a form of storytelling which involves spoken narrative, song and rhythmic entrainment, and massana, a full-blown communal song and dance. It is reasonable to expect similar or even greater entanglements of speech, song and movement in our evolutionary past, as hominins were venturing into the more complex use of multimodal communication and honing their skills in each modality—the vocal modality, in particular (Levinson & Holler, 2014; Mithen, 2005).
That said, the property which seems to distinguish music most consistently is the degree of temporal and tonal alignment between participants (Robledo et al., 2021). The focus of attention is not on some external object or some displaced event participants are trying to imagine together—it is on the participants’ rhythmic, gestural and vocal coordination. If language can be thought of as an extension of the extrinsic component of interactions—the mutual reference to external objects—music is an extension of the intrinsic component of interactions—the embodied alignment which affirms the affiliative relationship between participants and allows for cooperative interactions (Whiteman, 2020).
Musical interactions are therefore a special case of biobehavioral synchrony or interaction ritual, in which action is entrained within a stable periodic framework, instead of being more flexibly coordinated. The shared periodic framework is constituted through the repeated articulation of pitched or unpitched rhythmic patterns (regular sequences of inter-event durations, often experienced as related to one another by simple ratios; Polak et al., 2018). From these, regular pulses are abstracted, according to which vocal utterances, percussion and movement can be coordinated (Clayton et al., 2020; Jones, 2016). The use of the vocal modality prioritizes rhythmic structure and salient pitch movement, mostly through discrete pitch changes. The latter also enables frequency and spectral alignment between multiple voices (Savage et al., 2021). Timelines, metric structures and pitch classes are all elaborations of these basic features, and extend the ability of humans to coordinate and diversify their contributions.
The periodicity of music is prioritized because it is the most foundational resource for creating vocal and gestural alignment between multiple participants. If we are to consider the meaning of a signal as the desired response of its receiver, then the cyclicity of music carries one of its most basic meanings: it invites participation. This is supported by studies showing that repetition of sound stimuli makes them feel more musical (Simchy-Gross & Margulis, 2018), and perhaps best illustrated by the speech-to-song illusion, in which the repetition of a speech phrase re-orients listeners away from decoding its linguistic meaning, to a focus on reproducing its rhythmic and tonal structure (Deutsch et al., 2011). Repetitive rhythmic patterns are, in that sense, like gaze following and pointing for referential communication. Both are invitations to attend and respond jointly to a shared experience. A pointing finger means: “look at this, with me”; a repetitive rhythmic sequence means: “embody this, with me.”
Of course, periodicity is not always as foundational. Musical interactions can have different levels of stratification (the extent to which some participants control and lead the ritual), from the mostly leveled polyphony of BaYaka spirit plays to the ornate recitation of scripture by a single Hazzan in synagogues (Lewis, 2013; Slobin, 2002; Turino, 2008). The more stratified a ritual is, the less emphasis is expected on the structuring of a shared periodic framework enabling wide participation. Instead, the focus will be on elaborate performance techniques, capturing the attention of participants while preventing more active involvement. From the perspective offered here, the focus on a single performer is more similar to oratory, while the shared immersion in non-periodic sound—be it vocal, orchestral or ambient—is a categorically different experience.
These distinguishing features of music raise a question: Why did entrained temporal coordination evolve when less accurate forms of alignment are apparently sufficient to create attachment bonds? Several distinctive outcomes of music's unique interactive strategy can be identified:
Musical interactions increase substantially the potential number of participants, and were demonstrated to be effective for social bonding in groups of over 200 people (Launay et al., 2016; Weinstein et al., 2016). Speech usually involves one speaker at a time, which limits the ability of others to fully participate, and may be abused by dominant individuals. This problem exacerbates as group size increases, as does the problem of assessing temporal alignment between participants, upon which attachment formation depends. A stable periodic framework allows for a clear convergence between many participants, and bypasses the limitations of turn-taking by allowing everyone to participate at the same time. Musical interactions strengthen the effects of emotional contagion, a process in which automatic bodily mimicry results in emotional convergence (Hatfield et al., 1994). Music, as Langer (1957) wrote, reflects the “morphology of feeling,” a proposition affirmed by the consistent association between musical form (e.g., tempo, pitch, timbre) and certain emotional states (Juslin, 2019; Juslin & Laukka, 2003). Enacting that form (and, to a lesser extent, listening to it) changes the participants’ emotional state in a bottom-up process, and can result in such an overwhelming state of convergence that participants feel as though they merged into a larger, collective body (Lewis, 2014). Musical interactions can engage participants for a very long time. Among Mbendjele and Suya, for example, ritual singing can last for several hours and even days (Lewis, 2013; Seeger, 1987). The Natural History of Song ethnographic corpus, which contains coded ethnographic texts mentioning singing from 60 traditional societies, lists 168 episodes of singing which lasted for 1–10 hours, and 31 which lasted for more than 10 hours (Mehr et al., 2019). Long durations may be partially due to how music alters time perception: listening to music can give the impression that time moves faster or slower or even disappears completely (Schäfer et al., 2013). Musical interactions can accommodate trancing, defined by Becker (2004, p. 43) as “a bodily event characterized by strong emotion, intense focus, the loss of the strong sense of self, usually enveloped by amnesia and a cessation of inner language”. By focusing primarily on bodily action in the present moment, performers feel more intensely their most basic sense of self as it exists in the here and now, and less their autobiographical self, which connects that present feeling with a personal past and future (Damasio, 1999). This profound change in conscious experience is often associated with religious practices, as it enables performers to embody or commune with spirits, partially explaining the cross-cultural use of music for communication with the supernatural (Nettl, 2015). Musical interactions allow participants to unite on a more basic, physical level, establishing a floating intentionality that can be vital for temporarily curbing disputes and handling precarious social situations (Cross, 2009). Musical interactions often demand higher levels of exertion, resulting in greater opioid release which enhances social bonding as well as inducing a general feeling of euphoria (Tarr et al., 2015).
All of the above make music a highly potent technology of engagement (Shilton et al., 2020), capable of producing higher levels of emotional contagion and emotional energy, and creating a stronger sense of group solidarity and a firmer devotion to shared symbols and values. Its ability to extend a shared copresent and embodied experience to tens and even hundreds of participants suggests that the invention of musical interactions was intimately connected to changes in human social organization, in particular, the creation of larger and more stable parties, and the accommodation of larger temporary aggregations.
To summarize, music can create a narrowly focused, present-oriented, shared intentional space that diffuses and defuses existing tensions (if only temporarily), removes individual boundaries and creates a larger, unified self. It bypasses the partial limitation of turn-taking in conversations by foregrounding the backchannel, allowing multiple individuals to contribute more evenly at the same time, and producing an important leveling effect. It also increases substantially the number of people who can participate in a shared intentional space. Musical interactions are inherently affiliative, emphasizing the relational dimension of communication, and focusing more on the interaction itself (an intrinsic rather than extrinsic purpose), and more on the body and the present moment. Group musical interactions are consequently an important feature of social gatherings across cultures, particularly those involving the supernatural—propitiation of spirits, initiations, healing and mourning—but also in other contexts related to group coordination, such as work, recreation, and games (Feld, 1984; Lewis, 2013; Mehr et al., 2019; Merriam, 1964; Nettl, 2015).
If we attempt to provide a general explanation of music parallel to “vision is for seeing,” but in the sense explored here—focusing on process rather than purpose—we arrive at an apparent tautology: music is for musicking. While used in a variety of contexts, music has a single interactive strategy: the tight coordination of vocal and corporal gestures in time. It shares with other forms of interaction the utility of creating social bonds and shared symbols but can be far more powerful than others in generating group solidarity and a shared emotional state because it unites the basic embodied experience of multiple participants. This bridge that music creates between embodiment and social co-production was aptly captured by Langer (1957, p. 199): “Music,” she wrote, “is our myth of the inner life.”
Culturally Driven Evolution
Music was initially conceptualized as a technology to differ it from a biological adaptation (Patel, 2008; Pinker, 1997). This is a false dichotomy. Like tool-making technologies, music evolved through cumulative innovation sustained by social learning, and—along with other communication technologies—influenced biological evolution (Dor, 2015; Killin, 2016; Patel, 2018; Tomlinson, 2015). This section aims to explain how evolutionary adaptations may stem from behavioral, developmental, and cultural changes, in a phenotype-first mode of evolution in which “genes are followers, not leaders” (West-Eberhard, 2003, p. 20). More specifically, it adds to the existing literature on music and niche construction (Killin, 2016; Tomlinson, 2015) the concepts of plasticity and genetic accommodation, which have yet to be integrated more fully into the emerging coevolutionary framework (though see Podlipniak, 2017).
The basic process of phenotype-first evolution has been summarized by West-Eberhard (2005). We start with a varied population of developmentally plastic organisms. Environmental changes are then met with variable developmental responses, which consist of new combinations of phenotypic traits. Given the persistence of these environmental changes, there is a consistent selection of the most adaptive responsive phenotypes. This may result in genetic accommodation, in which genetic variations that support the phenotypic adaptations are selected. Given the genetic complexity of natural populations, genetic accommodation does not necessarily require new mutations—it is often likely that standing variation will be sufficient to accommodate new phenotypic responses.
Genetic accommodation may increase or decrease the plasticity of a given trait (Dor & Jablonka, 2010; Schlichting & Wund, 2014). Genetic assimilation, which decreases the plasticity of a trait by making it less dependent on environmental inputs (a process also known as canalization), is one type of genetic accommodation and was famously demonstrated in Drosophila by Waddington (1953). Alternatively, fluctuating environments may result in selection for increased plasticity, where context-dependent responses are more suitable (Jablonka, 2017). Genetic accommodation is especially relevant when considering behavioral adaptations that are based on novel neural associations. Avital and Jablonka (2000, pp. 330–333) have suggested the “assimilate-stretch” principle, in which a behavioral sequence is canalized (becoming less dependent on learning and environmental inputs), simplifying the learning process and freeing up cognitive resources that can later be used for the sophistication of that behavior. Learning is thus guided by predispositions, while remaining open-ended.
In humans, cumulative culture adds another dimension. Accurate social learning—to which humans are predisposed—provides another channel of inheritance, which can result in environmental cues being sustained for longer periods (Jablonka & Lamb, 2005; Laland et al., 2000; Tennie et al., 2009). Selection is then guided by a culturally constructed niche, in a process that essentially turns the gene-first neo-Darwinian view on its head: culture is the fountainhead from which changes in cognition and physiology arise, changes that may eventually become genetically accommodated.
The earliest example of technology transforming human evolution is probably that of tool making and its influence on the human hand. The production of sharpened stone tools originated over 3 million years ago (Harmand et al., 2015), and became more systematic over time. As early as 2 million years ago, the human hand acquired several traits which improved the precision grip abilities underlying stone tool production and use: a bigger thumb-to-fingers ratio, increased thumb robusticity, musculature and opposition efficiency, and broad fingertips (Karakostis et al., 2021; Key et al., 2018; Richmond et al., 2016). A much more recent example of gene-culture coevolution is the relationship between dairying and lactase persistence. There is clear evidence that the cultural and behavioral adaptations related to dairying preceded the correspondent genetic changes by several thousand years (Burger et al., 2020; Gerbault et al., 2011). Studies of lactase persistence also demonstrate how a variety of changes to regulatory pathways can lead to a single phenotypic trait—with different single nucleotide polymorphisms implicated for different populations (Ingram et al., 2009; Ségurel & Bon, 2017).
Dor and Jablonka (2000, 2010, 2014) have written extensively about the relevance of this process for the evolution of language. They argue for a culturally-driven coevolutionary process, in which interactive exploration resulted in communicative innovations—themselves reliant upon individual plasticity—which was later genetically accommodated. Their crucial point is that language was borne out of social processes rather than starting with changes in individual cognition: cultural invention preceded and guided biological, neurophysiological, and genetic adaptation. “First we invented language,” they write, “then language changed us.” (Dor & Jablonka, 2014, p. 16)
Languages are often shaped by the communicative demands of different social environments. For example, languages that are more exoteric—with larger speaker populations, greater geographical spread, and more contact with other languages—have simpler morphologies and larger phonological inventories (Lupyan & Dale, 2010; Nettle, 2012). These characteristics appear to be shaped by the communicative pressures of exoteric groups, in which more frequent interactions between strangers and a greater proportion of second language adult learners require simpler and more systematic language structures. Following the same logic, Dor and Jablonka suggest core properties of language have emerged to meet the demands of a changing social environment of increasing cooperativity and codependence, and were constructed through the use of a more limited mimetic communication system (Donald, 1991; Dor, 2015). Music, I argue, was likewise constructed, though it is still unclear what social demands—beyond codependence—stimulated the development of its unique interactive strategy.
Behavioral innovations are reliant upon individual plasticity, which is amply provided by the exceptionally large human brain. Dor and Jablonka (2014) illustrate this with the fascinating example of human echolocators: blind people who have learned to perceive their physical environment by making clicking sounds and listening to their echoes (Thaler & Goodale, 2016). Functional magnetic resonance imaging (fMRI) studies show that in these individuals, areas of the brain normally related to visual processing are recruited for the processing of sounds, even producing retinotopic-like maps in the primary visual cortex (Norman & Thaler, 2019; Thaler et al., 2011). Plastic reorganization was also demonstrated in deaf people, with fMRI studies showing auditory cortex activation during the discrimination of temporally complex visual stimuli (Bola et al., 2017). Cross-species studies of the capacity to interact with language and music also demonstrate the importance of plasticity. Several great apes and a single exceptional parrot have learned language-like systems of communication (Patterson & Cohn, 1990; Pepperberg, 1999; Savage-Rumbaugh & Lewin, 1994). A sulphur-crested cockatoo has learned to dance (Patel et al., 2009), and a California sea lion was trained to accurately bob her head to an isochronous beat (Cook et al., 2013). The extent of neural plasticity in these and many other cases makes it all the more likely that individual adaptations to changing social environments need not rely on novel mutations.
Genetic accommodation seems to fit the evolution of musicality (the biological capacity to engage in music) because of the latter's partial modularity, early ontogeny and long evolutionary history. Beat induction appears to play no role in linguistic communication, while fine and relative pitch processing play much more subtle ones. Congenital amusia, a condition impairing fine pitch discrimination, appears in relative isolation from speech disorders, and has a much smaller effect on language processing (though more pronounced in the case of tonal languages) than on music processing (Liu et al., 2012; Peretz, 2016). Studies in newborn infants reveal the very early ontogeny of beat perception (Winkler et al., 2009) and of right hemisphere dominance (Perani et al., 2010). Moreover, music is at least as old as the earliest musical instruments (approximately 40,000 years, Buisson, 1990; Conard et al., 2009), and almost certainly much older (d’Errico et al., 2003). The longer a selective environment persists, the more likely it is that the plastic responses to it will be genetically accommodated. If 8–11,000 years of dairy farming resulted in genetic accommodation, it is highly reasonable that tens or even hundreds of thousands of years of musical interactions would result in a similar process.
The Evolution of Musical Interactions
In the following, I introduce chronologically four key features of human evolution, that together laid the foundations for the social construction of music as an interactive technology. Each is based on well-established paleoanthropological evidence, though arguments about their influence on more nuanced interactive dynamics are necessarily conjectural. First, the technological niche, improving motor and emotional control and establishing human reliance on tools and on the social creation of technologies; second, shared intentionality and social learning, which resulted in humans experiencing their world as a shared intentional space; third, extended kinship, which expanded the creation of attachment bonds through temporal alignment into a wide array of human relationships; and fourth, multilevel society and the creation of larger aggregations, in which entrainment could have played a significant role. Taken together, these evolutionary processes created the prerequisites of musical behavior—fine motor control, multimodal communication, shared intentionality, and bonding behavior extending toward non-kin and groups (see also Killin, 2017, 2018; Tomlinson, 2015). They also profoundly changed human social organization, extending kinship networks and creating the need for communication technologies capable not only of sharing information in a bigger network but of establishing attachment and trust within it.
Technology
The first sharpened stone tools used by hominins predate the emergence of Homo erectus by more than a million years (Harmand et al., 2015; McPherron et al., 2010). During that time, hand morphology and musculature had changed substantially, with the most prominent divergent traits suggesting it coevolved with toolmaking. A growing reliance on tools also coevolved with domain-general and domain-specific changes in brain structure. Overlap in brain areas involved in both tool-use and speech suggest improvements in manual dexterity could also involve finer motor control in the facial and vocal modalities (Stout & Chaminade, 2012). Furthermore, even the earliest Oldowan tools (2.6 mya) were suggested to have been reproduced through behavior copying rather than individual reinvention (cf. Donald, 1991; Stout et al., 2019). All of the above point to the early start and lengthy development of the human technological niche, marked by an increasing reliance on tools for foraging, which drove the complexification of those tools by means of social learning, which then selected for improved social learning and motor and emotional control (Shilton et al., 2020; Stout & Khreisheh, 2015). It is the first instance of an important pattern in human evolutionary history: a growing reliance on a certain technology drives the improvement of that technology, which then further increases the reliance on it, triggering more improvements, and so on. This evolutionary spiral is relevant not just to stone tool technologies, but to communication and interactive technologies as well (Dor, 2015). The technological niche thus set the stage for the evolution of music by improving fine motor control and social learning skills and engendering the evolutionary spiral dynamic that would underpin the evolution of complex communication technologies.
The technological niche also means that humans have been deft percussionists and have inhabited a distinctive soundscape for millions of years. Both stone tool production and the extraction of marrow involve precise striking and are staggeringly ancient. Rhythmicity aside, it can be safely assumed that percussive activities were accurately executed and that the accompanying sounds were intimately familiar to ancient hominins. It was experimentally demonstrated that Aurignacian-type flint blades can produce distinct pitches, resulting in consistent use-wear patterns (Cross & Blake, 2008). Pitched and unpitched percussion in echo-rich environments can also produce quite impressive sound effects, with several examples of such use in the Paleolithic (Dams, 1985). While percussion is unlikely to be entrained in the context of tool making and marrow extraction, the endurance of this activity for millions of years means it was a proximate target for later explorations of entrained joint action.
Shared Intentionality
The emergence of Homo erectus marked a major advance in the social learning of skills, as indicated by the more complex Acheulian technology, the hunting of large mammals, and the rapid migration out of Africa and across Eurasia. What form did these new modes of social transmission take, and what were their affective, cognitive and communicative requirements? Explicit teaching is unlikely to have been an important factor at any time period, as it is rare even among modern foragers (Shilton, 2019). Instead, social learning seems to occur through the shared experience of foraging and tool-making activities. Novices spend time with experienced adults, observing them as they hunt, forage, make tools and locate raw materials. Through gaze following and attention guiding gestures, they learn to attend to the same external cues, e.g. the tracks and calls of different animals, the look and texture of suitable lithic and organic materials, or the spot at which it is best to strike a stone core. Both in real-time with adults, and in jest with inexperienced peers, novices practice and gradually attain the necessary perceptual-motor skills.
These activities require, at the very least, habitual group foraging, a good degree of social motivation, a theory of mind, and a communicative repertoire enabling shared intentionality. Mimetic communication, a multimodal and representational toolkit that includes bodily and manual gestures, facial expressions, vocalizations and mimicry, was critical to accommodate these new forms of social learning (Donald, 1991; Shilton, 2019). Functionally limited to the here and now, mimesis allowed the diverse types of cooperation which became essential to human subsistence. Most importantly, the regular practice of social learning steadily transformed the environment into an intersubjective space, in which one is increasingly aware of the attention and intent of others, and increasingly motivated to seek out this information.
Large mammal hunting, the extraction of lithic raw materials, and the reduction of large cores for the production of bifacial stone tools were all likely practiced predominantly by males, involving substantial risks and often requiring considerable physical strength. Hunting, in particular, seems to require active cooperation, as in the absence of deadly projectile weapons the capturing and killing of large, prime-aged mammals is unlikely to have been achieved alone (Bunn & Gurtov, 2014). According to Sterelny (2020), these indicate the presence of male coalitions, and make it likely that early Homo had Pan-like residential patterns, with subadult female dispersal. Sterelny further suggests that it was this residential pattern that explains the apparent stasis in lithic technology observed between approximately 1.7 and 0.9 Ma. As long as males remained in their natal groups, innovations in Acheulian bifacial technology could not disperse beyond it. Whether or not the stasis was related to male vs. female dispersal, it has been shown by several studies that fluid dispersal (both males and females leaving their natal residence) and extensive kin recognition, common among modern foragers, are essential for cumulative culture (Dyble, 2018; Migliano et al., 2017, 2020). The change to the fluid residence is related to another major factor in human evolution: alloparenting, and the dramatic expansion of kinship relationships.
Extended Kinship
Sometime during erectine evolution, infant care became—like foraging—a cooperative effort (Hrdy, 2009). Extended altricial periods, most likely related to the brain size increase, meant that mothers needed plenty of assistance. Matrilocal residence and female coalitions consequently became more common. Stable pair bonds—perhaps a correlate of male coalitions and the leveling effect of deadly weapons—did not only add another provisioning parent but also increased substantially the size of kin networks (Chapais, 2017). Female coalitions would have also resulted in decreased intragroup and intergroup violence—as suggested by the different rates of violence in chimpanzees and bonobos—extending the kin networks even further (Chapais, 2017; Furuichi, 2011; Stanford, 2018). Human sociality went through a profound change, and the added dimension of kinship had its impact on the evolution of interactions.
As previously mentioned, behavioral synchrony—the matching of gaze, movement and vocalization—was essential for the flexible creation of attachment bonds between infants and multiple caregivers. As infant care became a more distributed task, so did the proficiency in this form of interaction, which foregrounds the alignment between attachment partners. Both male and female infants would engage in behavioral synchrony for long developmental periods, making it quite plausible that behavioral synchrony would radiate from early caregiving interactions to pair and peer interactions in later life. It is also important to remember the fluid nature of copresence in caregiving situations, which were not strictly dyadic. Inexperienced caregivers would be present and watching more experienced ones, and infants would occasionally move from the arms of one to another. This fluidity not only enabled the social learning of caregiving skills but also provided a possible arena for entrainment between multiple participants, thus expanding attachment formation from the dyad to the group.
Behavioral synchrony thus became an important component of the human communicative toolkit. Humans were not just guiding each other's attention but were creating affiliative relationships through attentional, gestural and emotional alignment. Adding an affective component to human relationships increased social motivation and, consequently, the amount of active time spent in social interactions—certainly beyond the 10%–20% of active time characteristic of primates (Fuentes, 2021).
Biobehavioral synchrony allowed for kinship to become more flexible—to be socially constructed rather than biologically mandated. By frequently interacting in this manner, unrelated individuals could feel like kin, and a variety of relationships could tap into the neuroendocrine core of the mother–infant attachment bond. In modern humans kinship is extended to non-biological kin, as well as other-than-human persons like animals, forests and spirits—making it one of the key relational principles through which humans perceive their place in the world (Bird-David, 1999), and enabling a significant increase in social complexity.
Multilevel Society
Extended kinship affected all levels of hominin social organization: mother-infant dyads expanded to include more caregivers temporary small parties grew into larger, more stable bands; and dispersed, fission-fusion communities turned into federated societies (Grueter et al., 2012; Layton et al., 2012). All levels were scaled up, and a new midlevel social entity had emerged: the band. In modern foragers the band normally comprises of about 30 individuals (though numbers can vary substantially) who are organized as families, engage in cooperative foraging and childcare, and regularly assemble each night at a camp site (Layton et al., 2012). Archeological evidence suggests the residential camp had emerged about 400,000 years ago, and perhaps even sooner (Goren-Inbar et al., 2018; Kuhn & Stiner, 2019). This new form of sustained copresence was the primordial soup in which new forms of interaction and joint action were experimented with in multi-participant settings. Various situations could call for coordinated group action, including predator deterrence, coalition displays and play. Predator deterrence through group music making was documented among several hunter-gatherers (Knight & Lewis, 2017). The joint production of loud, heterogenous sounds seems particularly crucial during dark hours, when humans are most vulnerable to large nocturnal carnivores (and have been for millions of years; Knight and Lewis, 2017; Packer et al., 2011). The compelling nature of music, as well as its influence on participants’ sense of time, makes it a highly viable strategy for extending the duration and intensity of group-coordinated predator deterrence. Another platform for joint action was coalition displays, whether by males in territorial boundaries or by females against dominant males (Power, 2014). The regular use of fire extended the possibilities of nighttime interactions, which in modern foragers differ substantially from those engaged during the day (Shimelmitz et al., 2014; Wiessner, 2014). While daytime interactions are more economically focused and involve the separation of the band into several parties, the night unites everyone in a single location, with little productive work to be done, and in a somewhat enchanted atmosphere. The shift from extrinsic to intrinsic purpose is therefore quite natural, as interactions are aimed less at achieving a clear target and more at strengthening the connection between group members.
Regular aggregations are another feature of multilevel societies likely to have been of importance to the emergence of musical interactions. Common to many modern foragers, and often corresponding to seasonal variation in resource availability, regular aggregations can involve hundreds of people and were described as times of intense sociality (Lee, 1972; Mauss & Beuchat, 1979; Shott, 2004). The distinct features of music seem uniquely poised to resolve some of the challenges of aggregated residence. Interpersonal entrainment can join tens and even hundreds of participants in a single copresent interaction, engendering a sense of belonging and general euphoria. These positive emotional effects, along with music's floating intentionality, can help mitigate the greater potential for disputes in large groups (Lee, 1972). Looking at the different levels of social organization—from residential units to temporal, large-scale aggregations—it seems reasonable that the greater the number of participants (and the less familiar participants are with one another), the more formalized group interactions need to be. Thus, band gatherings and seasonal aggregations seem to be the most crucial platforms for the ritualization of entrainment-based interactions.
Conclusion and Future Directions
The framework presented here considers musical interactions to be a type of biobehavioral synchrony on the psychological level, and interaction ritual on the sociological level. Music is considered as part of a larger communicative toolkit and is most consistently characterized by entrained coordination of movement, percussion and vocalization, based on a shared periodic framework. It emerged after the unique features of human sociality had been established, within the context of shared intentionality, extended kinship and an interaction engine based on mimesis.
This framework coheres with some of the fundamental perspectives offered by other theories of music evolution. It is most in line with the social bonding hypothesis (Savage et al., 2021), and helps clarify why social bonding seems to be more relevant to music than language, despite the fact that both serve that purpose. Since music focuses more on the interactive event itself and less on external objects (present or imagined), it more obviously, and often more powerfully, contributes to social bonding. Biobehavioral synchrony clarifies further the strong relationship of music to infant care and mate bonding, while the emphasis on joint action through entrainment fits the focus on coalitions (Hagen & Bryant, 2003). The concept of costly signaling is relevant primarily to the reliable indication of attachment and commitment to a single partner or to a group through loose temporal alignment or entrainment. The complexity and cultural specificity of ritual behaviors—including musical performance—indicate it is primarily an in-group-directed display, as only in-group members can judge whether they are performed correctly (cf. Mehr et al., 2021). Finally, the understanding of music as a more emotionally effective copresent interaction explains why, in societies in which such interactions are increasingly unimportant for the maintenance of alliances and institutions, and in which music is experienced mostly outside the context of social gatherings, the byproduct hypothesis should seem particularly appealing.
The conceptualization of music as a form of biobehavioral synchrony has two important implications. First, it resolves the apparent circularity problem of the social bonding hypothesis raised by Pinker (1997) and rearticulated by Mehr et al. (2021). Musical interactions rely and elaborate on the baseline bonding effects of biobehavioral synchrony. They, therefore, had bonding as well as rewarding effects from the moment humans started exploring them as an interactive possibility.
Second, it potentially offers a different interpretation of studies of reward system activation in response to music listening (Salimpoor et al., 2011; Salimpoor & Zatorre, 2013). Results from such studies are often interpreted within the framework of musical expectation theory (Huron, 2006; Meyer, 1956), which focuses on music's ability to generate salient expectations, combined with the framework of prediction error theory, which predicts positive reward to be generated when outcomes are better than expected, though what that means for music is unclear (Cheung et al., 2019; Salimpoor et al., 2015; Schultz, 2017). However, this interpretation fails to convincingly account for the clear preference of familiar music by most listeners (Madison & Schiölde, 2017; Pereira et al., 2011), since fully predicted outcomes are not meant to trigger a reward. While the expectations generated by music are quite certain to play a critical role, it is possible working within the framework of attachment theory—which involves dopamine, oxytocin, opioid and endocannabinoid activity in the same brain areas—will provide more fruitful interpretations than theories of musical affect relying solely on prediction error.
The emphasis given here to social co-production and plasticity suggests that cross-species experiments should not just test baseline abilities, but also explore how social motivation and long-term learning can influence the extent to which different species can participate in musical interactions. If music was social from its inception, it makes sense that it should be studied as a plastic response to a form of social interaction. This is very likely what happened in the case of “Snowball,” the dancing cockatoo (Patel et al., 2009), and could probably be reproduced with a paradigm similar to that used by Pepperberg (1999). This experimental approach has the potential to produce insights on baseline abilities of different species, developmentally sensitive periods, and correspondent changes in behavior, physiology and neuroanatomy.
Finally, this framework suggests a positive correlation between temporal alignment and affiliative interaction will be observed across various forms of communication and diverse human cultures. Similar to Savage et al. (2021), it suggests that studies of musical practice across cultures will find a greater prevalence overall of group, participatory music making over solo performances. Levels of participation should correlate with emphasis on rhythm, and vary based on ritual stratification, which itself may depend on social organization. The hypothesized relationship between music and midlevel social structures also suggests musical activity should be more common during the night, and strongly associated with regular aggregations. Altogether, these diverse forms of evidence are expected to deepen our understanding of music as an interactive technology.
Footnotes
Acknowledgements
I am grateful to Ian Cross, Eva Jablonka, Daniel Dor, Anton Killin, Aniruddh Patel, David Huron, Nikki Moran and Jin Hyun Kim for their helpful suggestions and comments on previous versions of this manuscript.
Action Editor
Elizabeth Tolbert, John Hopkins University, The Peabody Institute.
Peer Review
Nikki Moran, University of Edinburgh, Reid School of Music. Jin Hyun Kim, Humboldt-Universitat zu Berlin, Institute for Musicology and Media Science.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
