Abstract
This article reflects on the “flat” history of timbre space, tracking its emergence as a technical inscription in psychoacoustic experiments and its rise to become a dominant conceptual metaphor in timbre studies. Drawing on Bruno Latour's notion of “immutable mobiles,” the author shows how the idea of a multidimensional timbre space has been propagated through the circulation of diagrams, which make perceptual data on listeners accessible to remote viewers. After surveying laboratory tools and techniques required for the production of these diagrams, the article considers how models of timbre space have been built into new technologies for music composition, performance, and listening, as well as into audio classification schemes and metadata formatting standards like MPEG-7. Mapping connections between psychoacoustic discourses and design practices, the article sheds light on the technoscientific origins of timbre space, examining its articulation to research labs at Bell, CCRMA, and IRCAM, and interrogating its role in determining what counts as sound knowledge.
A Deflationary Strategy for Historicizing Timbre Space
When people talk about music in everyday life, they use metaphors. Pitches go up and down, rhythms are straight or swung, polyphonic textures grow thick and thin. By relating sonic experiences to other modalities, like space, motion, and touch, these metaphors allow listeners to describe music in terms that are meaningful, even if the descriptions don’t necessarily match physical reality (Lakoff & Johnson, 1980; Larson, 2012; Zbikowski, 2008). The situation is no different in the science of sound, where key conceptual metaphors help to crystallize theories of auditory perception and cognition, making them easier to see, grasp, and use in support of a wide range of applications.
One metaphor that has been central to psychoacoustics since the seventies is timbre space (Grey, 1975, 1977; Plomp, 1970, 1975; Wessel, 1973, 1979). Unlike other spatial metaphors in music, timbre space is said to be multidimensional, and it is visualized with the aid of computers as a graph that can be used to infer perceptual distances between instruments. A classic example from John Grey's 1975 dissertation is shown in Figure 1. Each axis here was interpreted by Grey in physical terms through identification of the most likely acoustic correlates; going further, one may also venture an interpretation of the psychological meaning of each axis, invoking a 3-D network of semantic scales and cross-modal associations. For instance, spectral energy distribution on the I-axis has often been described in terms of relative brightness, while onset synchronicity/flux in the II-axis and inharmonicity of the attack on the III-axis are more difficult to disentangle, but might elicit general descriptions of roughness, sharpness, or richness. Timbre space thus acts as a kind of meta-metaphor, remediating other musical metaphors within its multidimensional frame.

Diagram of “timbre space” from Grey 1975, p. 62. The abbreviations used for instruments are as follows: O1 and O2 = oboes; C1 and C2 = clarinets; X1, X2, and X3 = saxophones; EH = English horn; FH = French horn; S1, S2, and S3 = strings; TP = trumpet; TM = trombone; FL = flute; and BN = bassoon.
Further unpacking the contents of the diagram reveals another dimension of the timbre space metaphor: its substitution of instruments for listeners. That is, while the diagram seems to show distances between instruments, what it actually shows is the statistical clustering of listener responses to timbral dissimilarity tests. This translation of listening subjects into instrumental objects produces a probabilistic impression of objectivity, eliding differences between individuals into averages and norms of audition. Through its visualization of these norms, the timbre space diagram tends to obscure the fact that it represents a particular group of listeners responding to a specific set of sounds under certain environmental conditions. Instead, it offers an external perspective, repackaging listeners in a portable and miniature format that makes them legible to remote viewers. These outside viewers, in turn, are interpellated as cyborgian figures, registering timbre similarities through a reduction of instrumental identities to quantifiable sonic parameters. Timbre space can be interpreted, in this way, as an index of an information-processing model of audition that is characteristic of mid-century research in psychoacoustics and the cognitive sciences.
In this article, I will frame timbre space as an instance of what the late sociologist Bruno Latour called immutable mobiles, by which he meant technical inscriptions that communicate data via “written, numbered, or optically consistent surfaces,” thereby “mobilizing resources without transforming them” (Latour, 1986, pp. 21 and 23). Based on work at experimental laboratories and other “centers of calculation” (Latour, 1986, p. 29), these inscriptions distill collected data into formats that are flat, scaleable, reproducible, recombinable, and superimposable, making them ideal for the distribution of scientific research (Latour, 1986, pp. 19–20). But for Latour, they are more than just passive charts, graphs, and diagrams; they are active mediators in a consensus-building project, allowing researchers to mobilize a set of facts for larger purposes as they move from empirical data collection to theoretical science. Immutable mobiles are, for this reason, a special kind of visual inscription in science—one that shuttles between local phenomena and global networks, swapping facts and figures for the real-world events they represent, and facilitating a highly specialized strain of scientific discourse.
Applied to timbre space, the concept of immutable mobiles allows us to better understand how diagrams like the one above—which, in Latourian terms, provides “optical consistency” by offering the same window for gazing at all listeners—have proliferated in psychoacoustic research over the course of several decades. While Grey's early research at Stanford University's Center for Computer Research in Music and Acoustics (CCRMA) helped popularize the idea of timbre space, as well as propagate its visual presentation in graphic form, timbre perception experiments have since become commonplace, with many experiments involving similar methods and producing comparable diagrams (e.g., Grey & Gordon, 1978; Iverson & Krumhansl, 1993; Lakatos, 2000; McAdams et al., 1995; Wessel et al., 1987; a meta-analysis of MDS experiments can be found in Thoret et al., 2021). Over time, repeated uses of the same visual trope have coalesced into what Latour would describe as a “cascade of mobile inscriptions” (Latour, 1986, p. 29), which reinforces the conceptual stability of timbre space and helps secure research funding, leading to the production of further studies in what eventually becomes a self-sufficient cycle. While timbre studies remains a polyglot field, it's clear the timbre space paradigm has solidified as part of a vital core of psychoacoustic discourse. This success has also spilled over into adjacent areas of research, making timbre space appear as a topic of interest in fields like composition, computer science, and cognitive neuroscience, and facilitating its integration in commercial applications that range from automatic speech recognition to music recommendation systems. Indeed, as I will suggest here and have argued elsewhere (Morrison, 2022a, 2022b), the rise of so-called “machine listening” today can be traced in part to the encapsulation by timbre space of a psychometric framework for the multidimensional analysis of perceived dissimilarities between sounds.
The broad historical sweep outlined here—where flat diagrams open onto a deep network of scientific and sonic practices—provides a general context for my focus on the role of timbre space in the dissemination of psychoacoustic theories, as well as their technological extension into creative applications and everyday listening situations. My interest is not so much in the scientific accuracy of these theories as compared to others, but rather in the way they travel via inscription, and in the craftsmanship involved in the production of these signifying surfaces. In this respect, I adopt what Latour calls a “strategy of deflation,” focusing on how “grandiose schemes and conceptual dichotomies” are translated into “paper, signs, prints, and diagrams” (Latour, 1986, p. 3). The goal is to sketch a “flat” history of timbre space as a multidimensional metaphor in music and science, and in doing so, to draw out the social and material dynamics that inflect the formation of contemporary sonic technocultures. For this, Latour's early work on immutable mobiles is especially relevant, as are adjacent lines of inquiry taken up by others in science studies, such as Lorraine Daston and Peter Galison (2007), who have documented historical shifts in the representation of objectivity through a detailed analysis of images in scientific atlases dating back to the Enlightenment. Closer to the domain of sound, one thinks of Sybille Krämer's (2017, 2023) media-theoretical interpretations of “flattening as cultural technique” in the production of musical scores and analytical diagrams, as well as Nick Seaver's (2021) work on music recommendation systems and the use of “spatializing techniques for analysing cultural data.” Common to these approaches is a concern with the epistemic import of inscribed surfaces and spatial modes of representation, which don’t passively depict reality, but rather actively mediate people's access to it, setting conditions on what counts as scientific (and musical) knowledge. Given this influence, as well as the outsized role played by a handful of research institutions (i.e., Latour's “centers of calculation”), it is also helpful to historicize timbre space by drawing on more overtly political and philosophical perspectives emanating from the broader “spatial turn” in the humanities (Warf & Arias, 2008). In this regard, a critical contribution is found in Henri LeFebvre's Marxist theory of the “production of space,” in which he examines the conditions underlying different types of space (e.g., physical, social, mathematical, mental), noting the way abstract spaces spawned by science tend to become the “locus of a ‘theoretical practice’ which is separated from social practice and which sets itself up as the axis, pivot or central reference point of knowledge” (LeFebvre, 1991, p. 6). When this happens, there is a tendency to start thinking of psychoacoustic concepts like timbre space in naturalized terms, as though they were universal and somehow outside the frames of culture and history. But this would be mistaking the map for the territory and forgetting the messy processes of social and technological mediation that underlie the production of timbre space.
At this point, it is important to note there were others prior to Grey who attempted to represent timbre spatially, and likewise others who proffered a definition of timbre in terms of multidimensionality. In 1951, Joseph Licklider famously declared timbre to be a “multi-dimensional dimension” (see also Handschin, 1948), and long before that, Carl Stumpf (1890) laid out a multidimensional view of timbre in his Tonpsychologie. Similarly, visualizations of a multidimensional timbre space avant la lettre appear as early as Albersheim 1939, where a vowel study is presented using sound color triangles and cylinders to show perceptual distances, and more generally, spatial representations of perceptual data can be traced as far back as Schrodinger, Helmholtz, and even Newton (Shepard, 1980). But the translation of perceptions to distances in these cases was not supported by empirical studies or computer-based methods in the same way as later timbre perception experiments. This is where Grey's study stands out, as it was among the first to articulate a multidimensional concept of timbre space to a set of digital synthesis technologies, experimental research practices, statistical modeling techniques, and data visualization methods, marking a historical conjuncture that remains active today.
It is also necessary, at this stage, to dispel any notions that timbre space has any kind of monopoly on a definition of timbre, or that its success as a model has settled once and for all which aspects of sound contribute to timbre perception. There remains little consensus around the question of timbre, as is clear from the persistence of alternative models (e.g., formant analysis), as well as from the (in)famously nondescript definition issued by the American National Standards Institute, which asserts that timbre is the “attribute of auditory sensation in terms of which a listener can judge that two sounds similarly presented and having the same loudness and pitch are dissimilar” (ANSI, 1960). To this negative definition, which succeeds only in noting what timbre is not (i.e., loudness or pitch), ANSI adds that “timbre depends primarily upon the spectrum of the stimulus, but it also depends upon the waveform, the sound pressure, the frequency location of the spectrum, and the temporal characteristics of the stimuli” (ANSI, 1960). Although phrased in more positive terms, this description is still not very helpful in narrowing down the relevant parameters of timbre perception, as it touches on several properties of sound. Reacting to this conundrum, the pscyhoacoustician Albert Bregman once characterized timbre as an “ill-defined auditory wastebasket category” (Bregman, 1990, pp. 92–93), voicing what has become a common refrain among subsequent researchers who have searched for a better definition of timbre, both perceptually and in terms of its acoustic correlates (Siedenburg & McAdams, 2017). Likewise, music researchers and practitioners have noted timbre's poorly defined and neglected status, arguing that it deserves more attention as an alternative to pitch-centric frameworks for music analysis and composition (Boulez, 1987; Erickson, 1975; Hasegawa, 2018; McAdams, 1999; Lavengood, 2020). But there have also been outliers, like Michel Chion (2011; based on a French article (Chion, 1986)), who has provocatively argued that timbre doesn’t really exist in any absolute sense and is no longer a relevant concept in the context of contemporary music practices, which have largely upended traditional relationships between sounds, acoustic instruments, and playing techniques. Adopting the view that timbre is “nothing other than the general physiognomy that allows us to identify a sound as emanating from a specific instrument” (Chion, 2011, p. 237)—and in light of the fact that styles such as musique concrète and its acousmatic offshoots have destabilized this physiognomy—Chion contends that the term has lapsed into a shorthand for the quality of sound itself, standing for everything and nothing as it becomes a problematic placeholder, glossing what are actually diverse materialities and perceptual phenomena (along similar lines, see Smalley, 1994).
Does timbre exist as a psycho-physical fact or is it a techno-social construction? And in either case, what are the stakes of using statistical and spatial methods to define and represent it? These are complicated questions, and again, my objective is not to parse interpretations according to their correctness, or to embrace one representation of timbre as somehow closer to truth or reality. Rather, against the backdrop of contested methods and meanings, I want to show how the idea of timbre space has a history, how it emerged from a specific experimental apparatus in research laboratories, and how it spread through a proliferation of images, which circulated as a kind of currency in psychoacoustic literature, gradually coalescing over decades into an entire subfield of research. I also want to show how, in an extension of its capacity as an immutable mobile, timbre space has been encoded in formatting standards and has been operationalized in digital audio technologies across a wide range of applications. My analysis thus proceeds by tracing the production of timbre space and tracking its decades-long shift from a descriptive diagram to a prescriptive framework for analyzing and indexing sounds in a global information infrastructure.
Restoring the multiple mediations of timbre space requires pulling that which is typically relegated to the background as a “container technology” (Sofia, 2000) into the foreground as part of the actual contents for analysis. It requires performing an “infrastructural inversion,” shifting figure-ground relations to resist the “tendency of infrastructure to disappear (except when breaking down)” (Bowker & Star, 1999, p. 34). To this end, the rest of the article is organized in two main sections (or, in keeping with the prevailing metaphor, spaces): the first covers scientific experiments in the lab, showing how timbre space gets built out of specific technological and cultural conditions, while the second follows designers in the workshop, showing how timbre space subsequently gets built into new musical and scientific instruments, setting conditions on the future development of applications for music composition, performance, and listening. The overall argument is intended as a friendly provocation, and as a way of generating dialog between the scientific end of the spectrum in timbre studies and recent scholarship that considers timbre (and sound more broadly) through the lens of critical, cross-disciplinary perspectives in new organology (Loughridge, 2018; Tresch & Dolan, 2013; Dolan, 2013), media history (Mills, 2012; Sterne, 2003, 2012), philosophy (Elferen, 2020; Vélez, 2018; Isaac, 2018), and cultural studies (Eidsheim, 2019; Fink et al., 2018). These latter approaches gesture toward a relational understanding of timbre's entanglements with instruments, bodies, languages, and affects, emphasizing how there are many ways of measuring and representing timbre, and showing that how one chooses to do so influences the way sounds are experienced. By bringing this relational framework into contact with the discourse of timbre space, this article aims to tell the story of how a flat diagram that is commonplace today was at one time new, developing out of a series of historical contingencies to take hold in the collective imagination of scientists, technologists, and musicians alike.
Out of the Lab: Psychoacoustics and the Experimental Apparatus
Imagine you are sitting in a 12 × 12 ft acoustic isolation chamber listening to short recordings of instrumental tones, played in a series of pairs over the course of an hour. For each pair, you’re asked to judge the dissimilarity of the tones relative to all other pairs heard, ranking them on a scale of 1 to 30. At times, you feel tired and your mind starts to wander, but you persevere in the name of science. Your judgements in this moment will be collected and forever preserved as “raw” perceptual data, merging with those of other listeners in a matrix of correlations. From there, scientists will use data visualization techniques to build a representation of your responses, rendered as psychological distances between instruments in timbre space. By the end of the process, only a quantitative copy of your qualitative experience will remain fixed in the form of a timbre space diagram, an immutable mobile.
This section revisits the lab space to consider how subjective experiences get translated into a probabilistic form of objectivity in timbre experiments. The focus is on the craft of making timbre space diagrams, on the people, places, and things that must gather to produce the auditory stimulus, to measure and collect listener data, to graph the data in geometric space, and to interpret the space in relation to acoustic features observed in the original sounds. Despite the apparent tidiness of the final diagram, with its clean lines and precision coordinates, the actual production of timbre space is a messy process, accompanied at several stages by technological and social mediations that are too often neglected. By looking at the entire experimental apparatus, it is possible to restore the role of these mediators and situate timbre space as a technoscientific artifact belonging to a deep historical network of actors.
Of Test Tones and Expert Listeners
One way into this analysis is to consider the staging of the original experiments, and in particular, the kinds of technology required to produce auditory stimuli, which in Grey's case were not actually recordings of musical instruments, but rather synthesized sound-alikes. The process is spelled out in considerable detail in his dissertation, where we learn that 16 orchestral instruments were recorded playing the pitch E-flat above middle C (311 Hz) for durations of 280–400 ms (Grey, 1975, p. 26). These analogue recordings were made in a “dry” studio with a Revox tape recorder using Scotch ¼-inch low-noise tape, and they were later digitized at a 25.6 kHz sample rate using an Analogic 14-bit analogue-to-digital converter. Heterodyne filter analysis was used to produce a readout of each instrumental timbre, with the aim of manipulating “the very complex physical factors of tones such that they might be systematically simplified and related to perception” (Grey, 1975, p. 17). This reduced template was then used to digitally resynthesize a copy that was, ideally, “indistinguishable from the original tone” (Grey, 1975, p. 16), and finally, the results were converted back to analogue tape, where they were equalized for pitch and loudness in preparation for use in timbre experiments. For those interested, there are yet further technical details, including distances for microphone placement, types of loudspeakers used for playback, and an intriguing bit on the addition of tape noise to the synthesized sounds to make them seem more like the originals. There are also notes on the selection of 16 “musically sophisticated” listeners at Stanford University, some of which were “actively involved in advanced instrumental performance and others in conducting or musical composition… [and others] with the production of computer music” (Grey, 1975, p. 30). All told, Grey gives a thorough account of a highly mediated process, offering a significant degree of insight into the social and material conditions under which the experiments were conducted.
Still today, many timbre space experiments continue to use digitally synthesized versions of orchestral instruments as auditory stimuli; some (e.g., Wessel et al., 1987) have even adopted cross-synthesis methods to test listener perceptions of hybrid timbres like the “guitarnet” (i.e., combination of guitar and clarinet) and to explore the possibility of hearing midpoints between two instrumental timbres. The assumed benefit of strictly controlled analysis and synthesis operations is that researchers will be able to fully define the sonic parameters and thus isolate the effects of timbre, disambiguating it from other parameters so that a comparison might be made between different instruments. But there are practical limitations to this experimental design, as is always the case, not only in timbre studies, but in empirical research in psychoacoustics and psychology more broadly. In particular, the method has had the side effect of reifying sounds into simplified test tones that, despite best efforts, bear little resemblance to the complexities of sound encountered by listeners in the real world. As a result, even those who were early proponents of this method, like Reinier Plomp, have gone on to criticize it for focusing too narrowly on the perception of isolated tones in “clean” laboratory spaces that are cut-off from the “dirty” conditions of everyday listening (Plomp, 2002). Perhaps in response to these critiques, some studies have shifted from synthesized stimuli to actual recordings, such as Iverson and Krumhansl (1993), which use short, digital samples of 16 orchestral instruments, or Lakatos (2000), which juxtapose sets of recordings of pitched versus percussive instruments. In addition, some researchers have established large audio databases, such as the McGill University Master Samples (Opalko & Wapnick, 1987), or more recently online resources like the Vienna Symphonic Library and the RWC (“Real World Computing”) database, which offer access to a searchable library of equalized recordings. This effectively removes the need to produce new recordings for each test and gives everyone, everywhere access to the same sounds for shared use across multiple studies. But even with studies that use recordings, it is common to reduce an instrument's identity to a single pitch—a reduction performed out of necessity to maintain a manageable number of stimuli, but one which ignores that instruments can produce wildly varying timbres and articulations through the use of different playing techniques, making the representation of an instrument as a point in space inadequate. By the same token, most studies limit their sample size because each pair of sounds needs to be compared, and the inclusion of more samples causes the amount of data to increase quadratically, making experiments impractical. Instead, smaller studies are conducted, with results generalized and scaled, or else aggregated, with multiple listeners grouped together as a kind of “super listener” (Sterne, 2020, p. 172). But these composite listeners mobilize the same set of assumptions and practices that one finds in the underlying experiments from which they are built, including not only the use of simplified synthesis recipes to simulate timbral identities, but the use of data sets that are mostly populated with orchestral instruments and musically trained listeners from universities in the Global North, raising questions about the wider relevance of such studies to situations outside of Western classical music traditions.
Multidimensional Scaling Tools and Techniques
Beyond the experimental phase, we may also consider how perceptual data collected from listeners is analyzed and statistically rendered in the visual topology of a timbre space diagram. For this, we need to turn our attention to multidimensional scaling (MDS) techniques, without which, it's fair to say, there would be no “timbre space” as such. Detailed in Groenen and Borg (2013), the seeds of MDS can be found in fields like geography going back to the 17th century, as evidenced by an early map of Durham County, England that incorporates a numerical matrix of its underlying distances between towns (ca. 1635, Jacob van Langren). But if “scaling” originated as a cartographic problem, it would later be transposed to the problem of mapping “psychological space,” and it is this more recent articulation of MDS techniques to computer-based programming and psychometric research that concerns us here.
An early touchstone in this regard was a 1952 article by Warren Torgerson in Psychometrika outlining a metric MDS approach, in which comparative distances between pairs of stimuli were obtained, translated into absolute distances, and then situated in a multidimensional representation of psychological space (n.b., Torgerson's (1952) work built on the “multidimensional psychophysics” of Richardson, 1938). These classical models were thought to be rather limited, however, because perceptual differences could only be translated to geometric distances in a linear and uniform manner. In response, researchers like Roger Shepard and Joseph Kruskal, working together at Bell Laboratories in the early sixties, developed non-metric MDS models to handle non-linearities in the perception of “proximities” (Shepard, 1962a, 1962b). In short, whereas metric MDS preserved the ratios of the distance estimations, non-metric MDS only preserved the order of the distances (with respect to a “greater than” relationship). Such a method could be used to search for the smallest number of dimensions for representing data in a way that best explains variance (i.e., the principle of parsimony), shifting incrementally over multiple iterations towards a state of statistical stability that “allows the data to speak for themselves” (Kruskal, 1964, p. 9). An image of the resulting proximities could then be produced, offering “a convenient, objective, and uniform way of representing the essential pattern underlying experimental results,” and further ensuring that, no matter what kind of data was handled, “virtually the same, readily appreciated spatial picture may be obtainable” (Shepard, 1974, p. 375). Here, in a nutshell, we find MDS conceived in terms quite like Latour's concept of immutable mobiles, with a strong emphasis on the diagram's utility in communicating and comparing research results through their uniform representation in geometric space.
Shepard and Kruskal also wrote a computer program, MDSCAL, which operationalized non-metric scaling techniques and automatically fitted dissimilarity data into multidimensional representations. The program was initially presented as a “tool for reductively analyzing several types of psychological data” (Shepard, 1962a, p. 125), and early on it was used in experiments on the perception of differences in colors (Shepard, 1962a; based on data from Ekman, 1954), morse code signals (Shepard, 1963; based on data from Rothkopf, 1957), and consonant phonemes (Shepard, 1972; based on data from Miller & Nicely, 1955). But from the beginning, Shepard saw the purpose of MDS extending to “concepts, attitudes, personality structures, or even social institutions, political systems, and the like” (Shepard, 1962a, p. 125), and as he would later reflect, the entire enterprise was grounded in a belief that “not only physics but also psychology can aspire to laws that ultimately reflect mathematical constraints, such as those of group theory and symmetry, and, so, are both universal and nonarbitrary” (Shepard, 2004, p. 1). Indeed, the program would eventually be applied to general data analysis purposes, such as market research on product perception or modeling voter preferences among different social groups, and MDS techniques would later be adopted as a quantitative analytical tool in humanistic fields of study like anthropology (Burton, 1970; Seaver, 2021). Through its application across these different domains, MDS contributed to the idea that all people perceive phenomena along the same lines of interaction, allowing them to become interchangeable points of data in statistical space.
Over the years, several variations on MDS have been developed, most of which are now available as readymade tools in software applications like MATLAB. Looking back, we find key developments in programs like INDSCAL (Carroll & Chang, 1970) – another Bell Labs program, and the one used by Grey to calculate the timbre space diagram in Figure 1 – which was introduced not long after MDSCAL, with the main updates being an ability to simultaneously analyze multiple matrices and to account for how individuals may weight dimensions differently. Subsequent variants include an extended INDSCAL model known as EXSCAL (Winsberg & Carroll, 1989), where “specificities” for different stimuli were factored into the equation of the stress function, as well as models like MULTICLUS (De Sarbo et al., 1991) for combining MDS with hierarchical cluster analysis, and CLASCAL (Winsberg & De Soete, 1993) for assigning weights to latent classes of subjects (n.b., for a fuller version history of MDS technologies, see Groenen and Borg, 2013). These new models were quickly adopted in timbre perception studies, as in the use of EXSCAL by Wessel et al. (1987) to introduce specificities accounting for the unique timbral characteristics of certain instrument sounds (e.g., clarinet or harpsichord), which causes them to stand out from the overall set. Likewise, we find examples of CLASCAL being used for the analysis of latent classes of listeners from varying backgrounds, where each class has its own weighting of dimensions (McAdams et al., 1995). Each of these stages in the development of MDS tools and techniques introduced new possibilities and refinements into the evolving timbre space model.
On the Problem of Interpreting Results
While MDS-based timbre experiments were designed to measure the perception of timbral dissimilarities in listening tests, a parallel objective has always been to interpret these measurements in relation to the acoustic signals themselves, finding which parameters of sound (i.e., dimensions) best explain perceived distances between instrument sounds (Grey & Gordon, 1978; Krimphoff et al., 1994; Misdariis et al., 1998). In theory, a timbre space can be constructed along any number of dimensions, and if you change the set of sounds, you change the relevant dimensions. The idea is to find the optimal number that will define timbre without overfitting and devolving into the so-called “noise of data” (i.e., the limit of statistical advantage). But whereas the rest of the methodology is highly regulated, this step is more subjective. The researcher is supposed to discover the proper sonic dimensions by following the law of economy, using MDS for dimension reduction to limit representation to those features deemed most salient, which are typically those with the most variation across the set. But as Shepard noted of his early studies, the final “stable configurations have been achieved in practice only by resorting to laborious smoothing procedures of a rather inelegant and ad hoc nature” (Shepard, 1962a, p. 128). Moreover, the process of formulating a psychophysical interpretation of the relevant dimensions and instrumental proximities has often been tinged with ambiguity. Shepard laments that, even though “interpretation is the end result which the investigator is seeking… [it] is still sometimes neglected or mishandled” (Shepard, 1974, p. 382), undercutting the neutrality of the MDS method in determining goodness of fit for dimensionality. To educate would-be MDS researchers, Shepard suggested one should “always try for a solution in a space of three, or, preferably, fewer dimensions where the spatial structure of the entire configuration can be seen and interpreted directly” (Shepard, 1974, p. 382), and he offered tips for interpreting clusters, circles, and linear orderings within these low-dimensional spaces. Thus, as a practical matter, the determination of dimensions and their interpretation in MDS studies was more constrained than one might think from reading the underlying explanations (although this is no doubt changing today with the use of machine-learning techniques, which pose fewer practical limits on the number of dimensions that may be computed as part of a timbre space model, even if anything beyond 3D representations remains difficult to visually comprehend).
Another rationale for perceptual dissimilarity tests has been their ability to bypass semantic representation, to sidestep the question of language by limiting responses to numerical ratings, with no descriptions of sound involved. But timbre researchers have also searched for ways to correlate perceptual data and low-level acoustic features to higher-level semantic descriptions of audio content (Reybrouck, 2013; Saitis et al., 2018, Saitis & Weinzierl, 2019; Zacharakis et al., 2014). Over the last two decades, this descriptive turn has been especially catalyzed by the rise of Music Information Retrieval (MIR) methods, which enable the construction of sound-indexing systems for finding correlations between signals, their perception, and the language used to describe them. Going further, some studies have studied links between timbre perception and a listener's affective responses (Eerola et al., 2012), as well as links to a listener's embodied actions and neural correlates in the brain (Caclin et al., 2006; Wallmark et al., 2018), which have been held up as a potential “biological bases of musical timbre perception” (Patil et al., 2012). With these extensions of timbre space, researchers have sought to forge connections between sounds, signals, semantics, affects, and bodies, building an expansive network of associations to mirror the material-semiotic loops of listening.
To be sure, timbre space remains a contested model, even within the psychoacoustic community. Bregman once claimed that no MDS experiment had “been able to show that the dimensions identified as the important ones are completely adequate, even within the restricted range of stimuli used, to account for differences in the qualities of the sounds” (Bregman, 1990, p. 125). There have been several suggestions for how to address this apparent failure. For some, the answer is more psycho-physiological testing to address the role of embodiment in the ecological perception of timbre (Wallmark et al., 2018, Ferrer, 2011). Others have pointed to the need to test a more diverse range of listeners and auditory stimuli, for instance by focusing on the experiences of those wearing cochlear implants (Erickson et al., 2020), or by breaking away from the traditional focus on orchestral instruments and turning instead to studies of non-western instrumental sounds and listeners (Fales, 2002; Levin & Süzükei, 2018; Serafini, 1993; Tenzer, 2018). Others still, such as Reuter and Siddiq, have called for the use of MIR to build a “physical timbre space,” or even a “beyond timbre space,” where consideration would be given to pitch, dynamics, articulations, and other features of the “whole instrument” (Reuter and Siddiq, 2017, p.162). And increasingly, researchers are using machine-learning techniques, as in the case of Thoret et al. (2021), which draws on sounds and perceptual data from eight previous MDS experiments to train a distance metric that mimics human dissimilarity judgements, promising to reveal high-dimensional acoustic spaces based on the analysis of spectrotemporal modulations, where certain areas can be linked to timbre perception. As this proliferation of methodologies suggests, there have been many extensions and transformations of timbre space research, with old MDS approaches yielding to new ways of conducting experiments and working with perceptual data. But even as classical MDS becomes less common, its underlying techniques for multidimensional analysis and data visualization persist in psychoacoustic research and have been encapsulated by today's audio-technical systems.
Into the Workshop: Design and the Musical Apparatus
A composer uses automatic orchestration software to render recordings of the human voice as instrumental simulacra. A performer uses gestural controllers to navigate a virtual space of audio synthesis parameters. A listener uses an audio-fingerprinting app on their smart phone to identify a song on the radio. Each of these scenarios involves the articulation of musical practice to technical implementations of psychoacoustic knowledge gleaned from timbre space experiments.
Whereas the last section focused on how timbre space was built out of computer-based tools for MDS analysis, this section focuses on how norms of perception established by timbre space models were subsequently built into new digital tools for music composition, performance, and listening. These tools rely on support from layers of conceptual and technical infrastructure derived from timbre space experiments, and they extend beyond purely musical contexts to a wide range of applications. Some of these, such as police use of speaker identification systems in surveillance (Li & Mills, 2019) or the medical use of voice analysis to diagnose pathologies (Saenz–Lechon et al., 2006), make clear the ethical stakes presented by the encapsulation of timbral metrics in new technologies. But even creative applications raise complex questions around the relation of timbre to the interpretation of social identities and the ways these relations get encoded in music technologies (e.g., see the chapter on Vocaloid synthesizers in Eidsheim, 2019). Because these technologies are often produced en masse and widely distributed, they have a potential to impact diverse populations, as well as a tendency to recede into the background and to be conflated with universal truths, even though they developed under specific cultural and historical conditions. To sketch in some details of these conditions, and to give a sense of the reciprocity underpinning the production of scientific and musical instruments, the following account highlights early attempts to harness timbre space as a compositional aid and performance controller, as well as later developments that led to the consolidation of standard metrics for describing timbre in digital audio metadata and MIR applications. Details for the latter are too numerous to account for fully in the scope of this article, so I confine my discussion to the establishment of the MPEG-7 standard. This standard, in turn, is framed as a second-generation immutable mobile that propels ideas about the multidimensionality of timbre into a new era of predictive applications.
Centers of Calculation at Bell, CCRMA, and IRCAM
To begin, we shift our attention from the labs of Bell and CCRMA to the design spaces and workshops at the French Institut de recherche et coordination acoustique/musique (IRCAM), where many psychoacoustic theories first found their way into technologies for creative purposes. But in doing so, we continue to follow many of the same actors, as these centers shared research personnel and technologies, and to a certain degree can be considered as institutional clones of one another. As Andrew Nelson details in his history of CCRMA (2015), the research center nurtured a symbiotic relationship with IRCAM that began after a widely publicized visit to Stanford in 1975 by the French composer Pierre Boulez and an entourage of musicians, psychoacousticians, and computer scientists. The trip was viewed as an opportunity to observe state-of-the-art computer facilities at CCRMA and to gather ideas for how a similar setup might be constructed at the new IRCAM building in Paris. A few years later, there would be a reverse trip of CCRMA- (and Bell-) affiliated staff to IRCAM, with researchers like David Wessel, John Chowning, Max Mathews, Jean-Claude Risset, and Andy Moorer taking up temporary posts there. In addition, there was a parallel chain of technology-sharing that accompanied the flow of people between these centers, as documented by Chowning in a 1979 letter to the Rockefeller Foundation: We have not only served to some extent as a model for Boulez's institute, but a substantial interaction and cooperation between the two centers has developed. In 1976, IRCAM acquired the same computer which we have in order to have access to all of the programs which we have developed over the years… this has saved IRCAM tens of man-years of development work, which in a commercial context would have resulted in a large financial return to CCRMA. In the arts such a financial transaction cannot happen nor should it, which means that the visibility of an activity such as CCRMA must find support elsewhere. (reported in Nelson, 2015, pp. 55–56)
Instrumentalizing Timbre Space
Timbre space provided a possible answer to the psychoacoustic problem, as it offered a multidimensional representation of sound that highlighted parameters thought to be the most perceptually salient. We find one of the first attempts to deploy timbre space in this way in Wessel (1979), which reports on the development of the ESQUISSE program to conduct a timbre perception experiment at IRCAM. Unlike previous studies, however, Wessel was not just interested in what the experiment might reveal about timbre perception; he was equally interested in how this knowledge might be wielded both compositionally and in performance as part of what he called a “musical control structure.” The basic idea was that timbre space might become prescriptive and interactive, rather than just descriptive, thus allowing musicians to not only measure the perceptual distances between instrumental sounds, but to navigate the space in-between and to calculate the construction of hybrid timbres along the way. Wessel described these moves in terms of “timbral analogies,” with the assumption being that they operated in the same way as modulations between different keys in tonal music (Ehresman & Wessel, 1978; Wessel, 1979). By traversing distances along relevant dimensional axes within a graph, one could conceivably use timbral analogies as a structuring device, either compositionally or in performance. Although Wessel didn’t propose any specific instrument for realizing these modulations at the time, he noted the potential of combining gestural control interfaces with real-time synthesis programs, and he suggested the “most natural way to move about in the timbral space would be to attach the handles of control directly to the dimensions of the space” (Wessel, 1979, p. 51).
There have since been several design projects aimed at implementing the kind of timbre space interface described above. Indeed, Wessel himself was involved in early-1990s attempts to train neural networks to translate a performer's gestures into real-time synthesis algorithms based on their location in a virtual timbre space (Lee & Wessel, 1992). Elsewhere, in Vertegaal and Eaglestone (1996), we read about options for navigating timbre space and manipulating audio using different input devices like a computer mouse, a joystick, and a powered glove. These and a series of subsequent studies (e.g., Choi et al., 1995; Garnett & Goudeseune, 1999; Wanderley et al., 1998) have wrestled with the ins-and-outs of mapping real-time gestures to timbre in high-dimensional control spaces. Others have focused more on the utility of timbre space for educational purposes (e.g., Timbre Explorer of Lam & Saitis, 2021), or else as a compositional aide, treating “timbre space as synthesis space” (Seago et al., 2008) to generate musical materials directly or to plan timbral trajectories over measured distances. Most recently, researchers have incorporated machine-learning techniques and MIR feature-mapping into their designs for both composition (Einbond, 2012; Esling et al., 2018) and performance (Zbyszynski et al., 2019), while researchers working on computer-assisted orchestration programs like Orchidea (Cella, 2022; originally Orchidée, see Carpentier et al., 2010) have integrated timbre space with indexical systems to generate timbre “forecasting” models for predicting what different combinations of musical instruments might sound like, or how close to a given acoustic model they will come.
MPEG-7, Timbre Toolbox, and the Standardization of Audio Description
For decades now, the design of digital musical instruments and software applications has been supported by the encapsulation of timbre space models in audio metadata standards. A key source of these standards is IRCAM, where, in the late 1990s, members of the Analysis-Synthesis research team spearheaded a European Working Group dubbed CUIDAD. This group was comprised of an international network of academic institutions and private industry stakeholders, including Artspages International AS (NO), University Pompeu Fabra (ES), Ben Gurion University (IL), Creamware Datentechnik Gmbh (DE), Oracle Iberica (ES), and Sony CSL (FR). It drew on experimental research by IRCAM's Music Perception and Cognition team—and specifically, on the results of three timbre perception experiments (Krumhansl, 1989; Lakatos, 2000; McAdams et al., 1995)—to recommend a set of perceptually salient timbre descriptors to be included as part of what would become the MPEG-7 format, aka the Multimedia Content Description Interface. Approved by the ISO/IEC in 2002 (ISO/IEC 15938-2002), MPEG-7 applies descriptive metadata to audio and visual media content, drawing on a classification system that designates seven low-level timbre descriptors, which are subdivided into two groups. The first pertains to temporal timbral descriptors, including the log attack time and temporal centroid, while the second is for spectral timbral descriptors, including the spectral centroid, harmonic spectral centroid, deviation, spread, and variation. These do not represent every possible timbral descriptor, only those thought to correspond with the most essential perceptual features, and they are nested in MPEG-7 alongside a larger set of low-level audio descriptors, likewise grouped in sub-categories, including temporal, energy, spectral, harmonic, and perceptual (for full details, see Herrera et al., 1999; Peeters et al., 2000; Peeters, 2004).
Work done at IRCAM was subsequently relayed into a larger project, CUIDADO (Pachet, 2001; Vinet et al., 2002), which operationalized MPEG-7 in the context of two web-based applications: 1) Sound Palette, a creative authoring tool for retrieving and manipulating short audio samples, which can be considered as an extension of the earlier in-house program Studio Online at IRCAM (Ballet et al., 1999); and 2) Music Browser, a classification and retrieval system for navigating music catalogues, which acted as a precursor to services like Spotify, although it lacked streaming capabilities. To support these target applications, CUIDADO was further tasked with developing machine-learning techniques to forge connections between low-level audio descriptors and higher-level “description schemes” for musical features like melody, scale, key, tempo, meter, and genre. In theory, low-level data could be automatically extracted from an audio signal and correlated with knowledge about music and auditory perception to infer symbolic and semantic descriptions at a higher conceptual level, thus instituting a way of analyzing and talking about sound that could be deployed in music indexing and retrieval applications.
The MPEG-7 framework developed by IRCAM also informed later research-based applications like MATLAB's Timbre Toolbox, which includes more than 40 audio descriptors, including both global and time-varying aspects of sound. The involvement of IRCAM researchers across all these projects is important, as they acted as a consistent presence at the table of international standards-making negotiations for how timbre is represented in digital audio technologies. And yet, despite this consistency, there is still disagreement about which acoustic features are needed for a comprehensive characterization of timbre, and in both MPEG-7 and Timbre Toolbox, there exist redundancies, with researchers noting that out of dozens of features it may only be possible to distinguish 10 independent classes (Peeters et al., 2011). Such variance hints at the problem of correlation between statistical models and timbre perception, as well as the general potential for larger incongruences between disciplines with different objectives. To take just one instance: as has been noted by scholars in both fields (Aucouturier & Bigand, 2012; Siedenburg et al., 2016a, 2016b), MIR-based research and cognitive psychology tend to diverge on which properties of the audio signal should be used as physical correlates of timbre perception, as well as to what extent one can relate these to high-level constructs, such as genre, mood, instrumentation, and so on. As a result, MIR researchers may analyze hundreds of audio features to infer semantic descriptions of the music based on established statistical correlations, while cognitive psychologists may consider only a handful of acoustic correlates relevant because the field is concerned with a different set of evaluative criteria, such as the perceptual and physiological accuracy of the model, not just its predictive efficiency. These differences shed light on the often-hidden processes of negotiation that underwrite formatting standards, opening space to interrogate why one set of audio features, instead of another, should become constitutive of timbre.
To this last point, it is useful to think about these historical developments in terms of format theory which, as posed by Jonathan Sterne, dwells on “smaller registers like software, operating standards, and codes, as well as larger registers like infrastructures, international corporate consortia, and whole technical systems”; he adds that, “if there were a single imperative of format theory, it would be to focus on the stuff beneath, beyond, and behind the boxes our media come in” (Sterne, 2012, p.11). Extended to the case of MPEG-7 and Timbre Toolbox, such a theory can help explain how the categories used to describe audio are themselves contingent on a wider network of institutional, industrial, and governmental organizations involved in negotiating how sound is represented in digital media. It can also draw attention to the longer history of acoustical theories and psychometric techniques that feed into classification systems, allowing us to situate formats like MPEG-7 in relation to a gradual process, in which first sound, then auditory perception, and finally audio description have come to be defined by quantifiable operations. And finally, it can be used to uncover the political dimensions of timbre space, revealing a collision of subjectivities and standards as the science of perception is cast in technological form. For what we ultimately find in timbral taxonomies is an underlying set of assumptions about sound and listening, which are always at risk of collapsing diverse practices and modes of perception within the fuzzy bounds of a contested nomenclature.
Concluding Notes on the History of Technoscientific Artifacts
One imagines that timbre might provide an ideal impetus for exploring diverse epistemologies of listening. After all, it often gets described as an emergent sonic property that resists the discrete, quantifiable structures of notated scores and pitch-based musical frameworks. Understudied and ill-defined, it gets characterized as the “auditory wastebasket” of music, containing all those unknown variables that are not pitch or loudness. But as we’ve seen, in the context of new media, timbre gets reduced to knowable, nameable parameters all the time, and it gets flattened into “optically consistent” surfaces that can be reproduced and circulated in identical form, coalescing into a “cascade of immutable mobiles” that underwrite claims to scientific knowledge. It is with this in mind that the present article has sketched a “flat” history of timbre space, showing how data on listener behaviors was collected, stored, and, at a later stage, employed in the design of new sound technologies and the negotiation of formatting standards.
One area for future work lies in the analysis of specific use cases of music technologies built around theories of timbre space, including compositional tools for generating orchestrations (or pedagogical tools for analyzing them), such as those now being developed by researchers in the ACTOR project (e.g., Orchidea and OrchPlay; for details, see project website, https://www.actorproject.org/workgroups). Timbre-based tools are also being used in popular and commercial music applications, for instance by DJs who want to navigate their music collection by moving along different dimensions of timbre space, or by people using “audio fingerprinting” to automatically identify songs on the radio, enforce copyright on the web, or monitor scheduled programming on broadcast networks (Allamanche et al., 2001; Haitsma & Kalker, 2002; Ramona & Peeters, 2013; Wang, 2003). And of course, timbre-based classifications in MPEG-7 are at work in all kinds of non-musical applications; this includes automatic speech recognition and speaker identification systems, which have been employed in both quotidian circumstances (e.g., speech-to-text transcription for Zoom calls) and for more specialized purposes (e.g., forensic analysis in criminal proceedings), and it also includes remote sensing systems for the analysis of environmental sounds, which, again, may be articulated to laudable goals like maintaining sustainable ecosystems, but can also be questionably used for massive surveillance operations (Kim et al., 2005). Through its historical role in establishing the underlying metrics (timbral descriptors) and the methods (dissimilarity tests) used in these wide-ranging applications, timbre space research can be understood as a key catalyst of what James Parker and Sean Dockray have described as the “planetization of machine listening” (Parker & Dockray, 2023). Clearly, the ethical stakes of such varied and widespread applications warrant further consideration of the relevance and reach of the underlying psychometric norms derived from timbre perception experiments, which, as I’ve argued, emerged in proximity to an orchestral aesthetics of tone and were primarily developed in the context of music authoring and indexing tools.
Another area for investigation concerns the scale and temporality of timbre space, along with its surrounding cycles of research and design, use and re-use. Decades of timbre experimentation has resulted in a stockpile of perceptual data, in which information is preserved as a resource for future use. From this perspective, any consideration of timbre space as a “holding container” must also account for its associations with a larger system of extraction and “re-sourcing” (Sofia, 2000); the resource in this case is perceptual judgement, which is converted into labor when it gets used to train machine-learning systems. One might question whether subjects who participated in a science experiment fifty years ago ought to be incorporated today as data in commercial products, surveillance tech, and profit-making enterprises. Moreover, one wonders what will happen as timbre space experiments move online, entering the ebb-flow of supply and demand in digital labor markets. Recent studies have turned to the underpaid labor of online “crowd-sourcing” platforms like Amazon Turk (Lee, 2010; Marjieh et al., 2024; Samiotis et al., 2022), where thousands of listening subjects can be hired for comparably marginal sums, extracting vast troves of perceptual data that can then be analyzed or used to train machine-learning and artificial intelligence systems. This method is part of a general trend in experimental behavioral research (Crump et al., 2013; Mason & Suri, 2012), where scientists and subjects are brought together on a global cognitive-capitalist marketplace, and where the all-too-human labor of dissimilarity testing moves out of the research lab and into people's personal living spaces. Going forward, how will the expanded scale of testing (both in population and geography) and accelerated temporality of such studies effect the coordination of listeners in timbre space, and how will the unequal distribution of resources that underpin this multidimensional space tilt the power dynamics of our technocultural landscape?
Flat representations of lived experience were created through the establishment of specific tools and techniques, and conversely, they can be reinflated to global proportions via technological means. In this article, I have attempted to account, in brief, for the rise of timbre space and its role in spreading scientific sight (or rather, audition) as objectified knowledge. This history includes how timbre perception experiments were conducted, and how, in the process of making diagrams, the subjective space of listening was flattened into the statistical space of MDS representations. Knowledge thus produced was shown to be consolidated in a working definition of sound that was then encoded as digital audio metadata and operationalized using MIR methods. This overall shift from signals to senses to semantics has ushered in a historical progression from descriptive diagrams to more prescriptive and even predictive purposes. The result has been an increasingly “closed world” (Edwards, 1996) of seamless mappings, which risk foreclosing on the possibility of non-normative positions and experiences as they become more ubiquitous and standardized in music technologies. By historicizing timbre space diagrams, we glean an alternative understanding of them not just as immutable mobiles, but also as what Donna Haraway would call “performative images that can be inhabited… [as] condensed maps of contestable worlds” (Haraway, 1997, p. 11). It is with a view to restoring the performativity and contestability of timbre space diagrams that this article has rehearsed a chapter from the history of music, psychoacoustics, and technoscience, and it is hoped that the account provided here will contribute to a cross-disciplinary conversation on the role of material artifacts and their corresponding systems of representation in shaping contemporary sonic practices.
Footnotes
Acknowledgements
Research for this article was supported by funding from UKRI Frontier Research grant EP/X023478/1 (RUDIMENTS).
Action Editor
Kai Siedenburg, Graz University of Technology, Signal Processing and Speech Communication Laboratory
Peer Review
Daniel Muzzulini, Zurich University of the Arts; Lindsey Reymore, Arizona State University, School of Music, Dance and Theatre
Ethical Approval Statement
This research did not require ethics committee or IRB approval. This research did not involve the use of personal data, fieldwork, or experiments involving human or animal participants, or work with children, vulnerable individuals, or clinical populations.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the UK Research and Innovation, (grant number EP/X023478/1).
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
