Abstract
There is a gap in existing critical scholarship that engages with the ways in which current “machine listening” or voice analytics/biometric systems intersect with the technical specificities of machine learning. This article examines the sociotechnical assemblage of machine learning techniques, practices, and cultures that underlie these technologies. After engaging with various practitioners working in companies that develop machine listening systems, ranging from CEOs, machine learning engineers, data scientists, and business analysts, among others, I bring attention to the centrality of “learnability” as a malleable conceptual framework that bends according to various “ground-truthing” practices in formalizing certain listening-based prediction tasks for machine learning. In response, I introduce a process I call Ground Truth Tracings to examine the various ontological translations that occur in training a machine to “learn to listen.” Ultimately, by further examining this notion of learnability through the aperture of power, I take insights acquired through my fieldwork in the machine listening industry and propose a strategically reductive heuristic through which the epistemological and ethical soundness of machine learning, writ large, can be contemplated.
Introduction
Oliver Selfridge and Ulric Neisser, a pattern recognition researcher and a cognitive psychologist, wrote in one of the earliest attempts to understand what would now be grouped under the banner of “machine learning” (ML) that “… any system must fail if it tries to specify every detail of a procedure for identifying patterns that are themselves defined only ostensively. A pattern-recognition system must learn. But how much?” (Selfridge and Neisser, 1960: 65). Indeed, this question of “how” appropriately directs attention to the epistemological foundations of ML. Arguably more vital, however, is the ontological question of “what.”
What is a machine trying to learn? For Selfridge and Neisser, as well as the broader community of early pattern recognition researchers in the late fifties and early sixties 1 , this was Object Character Recognition (OCR)—the automatic recognition of hand-printed letters via computational statistics. Despite the challenges that drove this scientific community at the time, OCR presented a relatively easy “what.” It was easy in the sense that written letters of the alphabet—a bounded sequence of 26 distinctly shaped characters that can be described or deduced with descriptive features such as whether the letter has a vertical line, crossbar, or “a concavity at the top” (p. 66) 2 —comprised a reasonably stable “ground truth.” In the most basic sense, ground truth refers to information that is assumed to be true for an ML system. With the increasing complexity of the tasks to which ML has been applied over the past six decades, however, agreement on what constitutes adequately stable ground truths for ML systems has become exponentially more complicated.
I localize my observations and analyses of ML epistemology and ontology to the comparatively understudied but rapidly growing industry of machine listening 3 , or what Kang (2022) refers to as voice identification and analysis (VIA) technologies. I focus specifically on the VIA industry because the current shortage of critical work that engages with the ways in which VIA systems intersect with the technical specificities of ML not only reflects a discursive chasm in the ways that these systems are understood, discussed, and critiqued, but also points to an existing lack of nuance which adds to the challenge of parsing out useful ML applications from damaging ones 4 . In this way, the VIA industry also presents an opportunity to shift the focus back to ML broadly to examine how practitioners reconcile ontologically polysemous phenomena—i.e. voice—with the statistical and categorical epistemologies of ML.
The voice, unlike the alphabet, has a notoriously complex ontology. Numerous scholars have pointed to the voice as a dynamic phenomenon (as opposed to a static object) that is both produced and heard according to the sociocultural (e.g. Eidsheim, 2018; Kang, 2022; Stoever, 2016), biological (e.g. Kreiman and Sidtis, 2011), and physical (e.g. Weidman, 2015) conditions in which it is voiced. By complicating and enriching understandings around what voice is (ontology) and how it is known (epistemology), both of which are foundational to the specific ML practice of ground-truthing, these works present crucial apertures through which any voice-adjacent ML technology must be examined. It is from this angle that I also approach my examination of the VIA industry and the ML practices, cultures, and communities that underlie it.
I spoke directly with practitioners involved in various parts of the development pipeline for VIA systems ranging from CEOs/founders, SVPs, ML engineers, computer scientists, data scientists, product developers, project managers, business development managers, database managers, brand evangelists, and staff developers. The companies that these industry professionals work(ed) for range from well-known big tech firms to smaller but still influential industry leaders in VIA technologies, to emerging startups. I reached out to some of these individuals through their LinkedIn pages and Twitter accounts. Others I met by attending flagship industry conferences such as the VOICE Summit and VOICE Global, webinars organized by various VIA companies, and industry talks where I also learned about existing “state-of-the-art” systems and future visions for the industry. Based on the hundreds of hours I spent over the course of 10 months (October 2021–June 2022) speaking with and listening to these professionals and meditating upon our conversations through recordings and personal notes, I learned that conceptions and expectations around what voice is or can do in the context of ML, or more specifically machine listening applications, vary significantly based on who one talks to and the contexts in which they are conceived. There were, for instance, product developers and business development managers who were excited about machine listening systems that could predict the emotional state of a speaker in real time, estimate the likelihood of a speaker defaulting on a loan, forecast how well a speaker would fit in with a certain team, and even match potential dates by analyzing the voice. But there were also engineers and research scientists who represented teams that expressed critical reflexivity toward the current capabilities of not just machine listening applications, but also the current capabilities of ML, writ large.
Ultimately, I argue that this conceptual variance around the ontology and utility of voice in machine listening is a result of negotiations and technical decisions made by ML practitioners that alter notions of a particular ML problem's learnability. It is important to note, here, that I differentiate my use of “learnability,” from the more technical use of the term invoked in formal ML learning models such as “Probably Approximately Correct (PAC) learnability,” which essentially describes a statistical situation in which a hypothesis is considered learnable when there is a relatively high probability that the learning function will have a low generalization error (Shalev-Schwartz and Ben-David, 2014: 22). Instead, I use “learnability” in a less quantitative sense to refer more broadly to the capacity for a certain qualitative problem to be conceptualized, translated, and formalized into the operational framework of ML as ground truth. This notion of learnability thus does not require a direct mathematical engagement with ML models and algorithms, but rather takes a more holistic approach that qualitatively attends to the processes of conceptualization, translation, and ontological variation that occur in re-conceptualizing qualitative phenomena through the aperture of quantitative ML frameworks.
I refer to this approach of qualitatively tracing ground-truthing processes as ground truth tracings (GTT) 5 . This intentionally non-mathematical process for the evaluation and analysis of learnability in ML systems is useful for two interrelated reasons. (1) Querying an ML system through the grammar of natural language allows for a productive distancing from the narrower technical discourse of model performance and algorithmic efficiency, and instead foregrounds the broader sociomaterial practices of data collection, preparation, and maintenance 6 , while still engaging with the logical step-by-step sequences related to data analysis that must occur for an ML system to function. (2) And as a result, it creates a much-needed collaborative space for interdisciplinary dialog between engineers/developers and social scientists/humanists to collectively think through the broader sociotechnical assemblage of humans, machines, imaginaries, and infrastructures that “learnability” is contingent upon.
This collaborative effort becomes especially important as ML is implemented for use-cases with increasingly higher stakes. In his early article, “The Imitation of Man by Machine,” Neisser (1963) already understood and emphasized that the fundamentally different underlying processes by which machines and humans learn and execute certain tasks can have serious consequences in “real-world” applications. We must take this warning seriously. While differences in human and ML may be relatively trivial for the purposes of technical pattern recognition tasks (i.e. OCR), they become, as Neisser asserts, tremendously apparent and important when computers take on the role of adjudicating social decisions—i.e. decisions contingent on complex ontologies mediated by power with people at stake. Drawing from my conversations with practitioners in the VIA industry, I thus splinter my ontology-focused method of GTT once more along the dimensions of subjectivity, stakes, and power to present an intentionally simplified heuristic—the Learnability-Stakes (LS) table—for analyzing both the epistemological and ethical soundness of ML, writ large.
Machine learnability: GTT and ontological dissonance
In the words of Mackenzie (2017: 49), “to understand what machines can learn, we need to look at how they have been drawn, designed, or formalized.” One way to approach this is by taking a closer look at the problem around which an ML system is developed. Jaton (2021: 82) writes in The Constitution of Algorithms that a “problem an algorithm is designed to solve does not preexist: it has to be produced during what one may call a ‘problematization process’ – a succession of collective practices that aim to empirically define the terms of a problem to be solved.”. A central step in this problematization process is what is referred to in the ML community as “ground-truthing.” To elaborate on the earlier definition, ground truth points to the referential repository that serves as the base from which ML algorithms are derived—it is literally where the truth and possibility of an algorithm are grounded. Most importantly, this means that a problem of interest and the method of addressing that problem must first be defined and determined through data in the ground truth, thus constituting a task-bounding process and a form of intentional biasing that hardlines the limits of the algorithm and the possible range of outcomes for an ML system.
This specificity of ML is sometimes referred to in industry discourse as “Artificial Narrow Intelligence” or ANI (Kanade, 2022). The modifier “narrow” inserted between the more common phrase “artificial intelligence” is meant to emphasize the “goal-oriented version of AI designed to better perform a single task” (Kanade, 2022). What is misleading about this rhetorical distinction from other forms of AI such as “Artificial General Intelligence” (AGI), however, is that it attempts to expand AI ontology through the implication of another more adaptive version of AI that goes beyond the capabilities of ANI. To be clear, there is no such version, yet. AGI, which refers broadly to a vision of AI that has the capacity to emulate human cognition and adapt to any complex problem, is still a loosely defined concept and an unrealized ambition within the ML community. In fact, as many AI ethicists have pointed out, the hype around AGI distracts from the actual potentials and harms of existing AI/ML applications and the broader ethical, social justice, and practical questions they pose (Johnson, 2022). It is thus important to emphasize that in the current state of AI/ML research, ANI comprises the entirety 7 of all artificial intelligence applications. I accentuate this point, here, because it illuminates the centrality of ground-truthing practices in how any current AI/ML system is made to execute a task. As Jaton writes, “we get the algorithms of our ground truths” 8 (Jaton, 2021: 81). Examining an ML system through the aperture of ground truths thus allows one to examine the practical steps by which a claim made around a certain technology's capabilities is “ground-truthed.” Here, the underlying questions, in their simplest form, would be: (1) Is this a learnable problem? (2) If so, how? (3) If not, why not? In the same way that ground truths must first be formed as a base upon which ML algorithms can be developed, qualitative GTT can also retroactively be performed to interrogate the logical sequences through which a ground truth is constructed for that claim 9 .
Jaton's discussion of saliency detection presents a useful example to contextualize my argument. He writes that in contrast to high-level detection algorithms such as those for face-detection, which are task-specific and thus conceptually easier to construct ground truths for, low-level saliency detection—detecting which parts of an image are “salient”—has “no ‘natural’ ground truth allowing the design and evaluation of computational models” (Jaton, 2021: 56). This impossibility lies in the obvious fact that “what is considered as salient in a natural image tends to change from person to person” (p. 57) and, of course, situation to situation. Performing GTT is thus a fundamental task for ML engineers because it allows them to either dismiss a problem as one in which a ground truth cannot possibly be established—an “unlearnable” problem—or find a way to redefine the problem in such a way that a ground truth can be established. Liu et al. (2007) thus tackled this problem of saliency detection by reframing the task from simply recognizing salient parts of an image, which was deemed to have too much statistical variance, to detecting “the most salient object within a given digital image” (Jaton, 2021: 57). This still has the potential for disagreement between labelers, but it can decrease the variance for images in which there is a relatively obvious distinction of figure vs. ground. By thus repositioning saliency within this binary framework, it became possible to conceive of and build a ground truth database. Of course, this is not to say that in the process of developing a ground truth, negotiations, disagreements, or other complexities related to labeling and classification do not occur—the establishment of a ground truth should never be confused with a “true” quantification or datafication of a “real-world” phenomenon. Rather, it is the translation and flattening of a messy qualitative problem into a usable quantitative reference 10 .
What is implied in a ground truth is thus not necessarily a representation of “reality,” but rather the translatability of a problem of interest, which allows it to be legible to and expressed in the language of mathematics. This numericization allows it to become more sharable, comparable, and malleable (p. 233), thus allowing new statistical instruments to be used in its analysis. The most ideal situation is thus one in which a problem of interest can easily take a quantitative form (i.e. ground-truthed) without losing a significant portion of its qualitative character. The reality, however, is that there are varying degrees to which different qualitative phenomena can be smoothly numericized, and this translation process can require reconceptualizing a qualitative problem to fit the quantitative framework through which the problem is formalized (i.e. redefining general “saliency detection” to the binary task of “figure-ground distinction”). GTT can thus be thought of as a method in which this translation process is made explicit for the purpose of gauging the degree of ontological dissonance that is generated in the conversion from “complex entity” to “scriptural form” (Jaton, 2021: 229). In this way, it should be understood as a mode of interrogating the practical negotiations that ML practitioners must make in operationalizing ML systems.
Consider, for instance, the task of recognizing hand-printed characters that were of interest to the community of early pattern recognition researchers. As Selfridge and Neisser (1960) document in “Pattern Recognition by Machine,” letters of the alphabet can be described in terms of specific features. An “A,” for instance, has a horizontal crossbar near the center, is convex-shaped in the upper half, and concave-shaped in the lower half. Few other hand-printed letters in the alphabet would match this description, and from a regression logic, no other letter would consistently match this description. The qualitative character of an A can thus be relatively accurately described in terms of discrete features, or “cells” if situating this description within a table-based taxonomy. Once the task of interest can be rendered in table format, quantitative tools such as percentages or weights can also be enrolled to aid in the execution of the task.
In contrast, consider the controversies around ML research that have claimed to be able to predict criminality through a computer vision analysis of facial features (Wu and Zhang, 2016a). Without even beginning to delve into the oppressive metanarratives around race and class or the histories of scientific racism in phrenology that such claims invoke, it is possible to examine these claims by performing GTT: is criminality a learnable ML problem?
Wu and Zhang (2016b) write in their response to the highly publicized critiques of their controversial article “Automated Inference on Criminality Using Face Images” (Wu and Zhang, 2016a) that “taking a court conviction at its face value, i.e. as the “ground truth” for ML, was indeed a serious oversight on our part” (Wu and Zhang, 2016b: 1). It is interesting to note, here, that despite the many other refutations made by the authors against their many critics, they do—or perhaps have no other choice but to—accept that a court conviction does not establish an adequate ground truth. This is important because it reflects the efficacy of shared terminology in interdisciplinary dialog. By couching a sociohistorical critique of court convictions within an ML context of technical ground-truthing practices, it not only gives the critique more heft and a technical base from which developers who build ML programs can work, but it also allows for a productive rethinking of how feasible the proposed ML system is.
GTT shows that the only means by which “criminality” can be established as ground truth is to accept an individual's prior conviction as adequately ascertaining that individual's “criminality,” since the only means through which “criminality” can be recognized is through the very justice system that defines it. Once this is accepted as constituting the “criminality” component of the ground truth, a photo of that individual's face is labeled as a “facial instance” of “criminality” to establish the full ground truth of “facial criminality.” In response, however, if it is acknowledged that (1) all persons are internationally entitled to the presumption of innocence under Article 11 of the UN's Universal Declaration of Human Rights, (2) “criminality” does not exist outside of the documents and institutions that produce it, and (3) those documents and institutions do not establish an adequate reference upon which a criminality-prediction ML system can stand because racist and classist conviction practices not only represent a social problem of racial/class inequity, but also a technical deficiency in which the history of biased policing practices skew historical conviction data to overrepresent certain types of appearances 11 , it becomes evident that the sequence of assumptions through which a “facial criminality” ground truth is constructed is both socially and technically unsound. Unsurprisingly, Wu and Zhang do not address the development of an alternative ground truth, and instead, justify their choices by stating that they “maintain a sober neutrality on whatever [they] might find” (2016b: 1).
GTT is important precisely because of its ability to make explicit these inadequacies in ground truth construction. It allows for an understanding of which ML claims are more realistic, and which might be products of eloquent storytelling and weak methodological assumptions. While pernicious ML programs such as “criminality”-prediction immediately raise red flags due to the high stakes associated with the term, other comparably subtler (but not necessarily unproblematic) applications such as predicting “employee-fit” or “personality profiling,” both of which are existing use cases I encountered in my fieldwork, might not. Performing GTT against such claims thus grants one the capacity to gauge an ML program's practicality and understand its strengths and shortcomings without necessarily “opening up” its algorithmic “black box.” After all, as many researchers working under the banner of “Critical Technology Studies” have pointed out, “algorithmic transparency” is a limiting framework through which ML systems can be understood or interrogated (e.g. Burrell, 2016; Christin, 2020; Seaver, 2019, etc.). GTT provides an alternative tool that translates the problems of efficacy and opacity into a question of learnability, which ultimately means foregrounding the specific task that the machine was calibrated for. Learning is by definition a practice that is always in relation to another process—neither human nor machine never “just learns,” but rather always “learns X.” Understanding X is an indispensable component of interrogating machine learnability.
Use-Case specificity: Ground-truthing the vocality of employee-fit
In “Biometric Imaginaries,” Kang (2022) points to the fluidity of voice to shed light on the fundamentally incompatible logics that voice and biometrics technologies operate through. This observation also extends to voice analysis systems, especially as they are examined through the aperture of ground truths. GTT makes the ontological complexity of voice as it relates to ML more explicit because it forces one to reframe voice analysis as a specific machine-learnable problem. It moves beyond conceptualizing the ontological tensions between voice and a particular technological form, and instead requires identifying a specific translation process that splinters voice into quantified features (e.g. frequency and speed) that are mathematically legible (e.g. spectrogram) to an ML application.
As part of my fieldwork, I encountered a startup that offers voice analytics for a suite of behavior prediction, mental health diagnosis, and personality profiling applications. One of the use cases for its voice analytics system was to predict using ML whether an individual will successfully fit into an existing team for recruitment purposes. I spoke separately with the founder and chief scientist (Adam), a senior-level executive (Brian), and a mid-level manager (Colby) at this company 12 . I asked them questions about the design and efficacy of their system, some of which included what the voice signals are that would help make this employee-fit prediction, what the labeling process is like, and how voice data is collected. The responses I received ranged from a genuine “I don't know the specifics” (Brian) to “years of research have allowed us to correlate certain vocal features with certain behaviors and personalities” (Adam) and of course, “there are limits to how much we can disclose the details of how our systems work” (Colby). Simply put, the responses I received were overall unproductive in helping me understand exactly how the “magic” happens. While it is impossible to know the line-by-line details of how this company's version of employee-fit prediction works without extensive cooperation from the company, it is possible to retroactively deduce what the parameters for learning employee-fit entail via GTT: How would “fit” be measured?
On the most basic level, to “learn” employee-fit one must acknowledge that “fit-ability” is presumably a state determined by the relationship between the potential employee, the members of the team for which the employee is being considered, as well as the work, objectives, and motivations that drive the team. An adequately robust ground truth, then, would have to identify variables or classes that measure each existing employee's fit with this specific team. The “team” would also have to be bounded as a specific group of people who work together toward particular team objectives. Even before any voice analysis is introduced, a multitude of both technical and organizational problems surface.
First, the evaluation of existing employees’ fit-ability is an extremely challenging task. Developing a reliable evaluation standard for quantifying employee fit is difficult because it requires a nuanced and dynamic understanding of the team's unique culture and how it adapts to different situations, interpersonal dynamics between current members, and the interplay between potential team members’ various skills and those that are missing from or needed at the team. Knowledge of these criteria is difficult to taxonomize not only quantitatively, but also qualitatively because they represent unstable information that exists and changes across the minds of the employees. Not only that, but they are also only assessable via self-reports with the cooperation of the employees themselves, in which the high-stakes contexts of employment influence both self and peer evaluations and reports of fit-ability. In other words, there is an incredible amount of ontological dissonance as well as unavoidable conflicts of interest that crucially mediate the process of translating employee fit into a taxonomy of classes that can be used to predict the fit-ability of an outside candidate.
Even if it is assumed that such indices for measuring fit are achievable, these hypothetical parameters and datasets would have to be correlated with specific vocal features for each existing employee recorded and measured across various situations, with the additional assumptive caveat that the presence of audio monitors do not alter the ways in which the recorded individuals vocalize. In fact, considering the numerous situational (Eidsheim, 2018), physiological (Kreiman and Sidtis, 2011), and emotional (Scherer et al., 2003) variables that affect vocal expression, the attempt to correlate fit-ability measures to vocal features for the construction of a ground truth database would, on a practical level, border on impossibility. Even if it were assumed to be possible, it would still be a tedious, expensive, and error-prone procedure, as well as an undesirable one with regard to generalized scalability, considering the uniqueness and variability of each team.
So, how is this company in business? GTT shows that deploying ML for an employee-fit prediction system is unviable even if certain assumptions are made with regard to data accessibility. It is still confronted by one too many theoretical and practical complications. However, just as the criminality-prediction example demonstrated the utility of GTT to shed light on the broader contexts upon which an ML system's ground truth is contingent, it becomes important to recognize, here, that this employee-fit system is being developed in the setting of a technology start-up, an industry notoriously associated with both pressures and ambitions of high-growth and scalability. A connotative reading 13 (Poirier, 2021) thus allows one to understand that in this context, it is financially advantageous to develop a product that is generally applicable as opposed to one that is developed for unique situations. Both unsurprisingly and surprisingly, Adam informed me that their system is actually “not for a specific use case. It is for a more generic platform that collects the vocal input and then uses predictive analytics to correlate vocal features and patterns with different behaviors and personalities 14 .” This means that the company is claiming an AGI-adjacent system that transcends the specificity of a particular ML task with the capacity to process a voice and gain different insights ranging from “health, financial behavior, consumption patterns, and interpersonal communication tendencies” (Adam).
This framing makes sense from a financial standpoint—it would be impractical for the company to go through the different steps I outline above for each instance of integration. Instead, it would be in its best interests to redefine the various domains of prediction not as domains that are context-specific, but rather as generalizable patterns that can be statistically measured with a centralized model of vocal expressions and benchmark behavioral patterns. In the case of predicting whether an individual will mesh well with a team, then, employee fit is translated into a ground truth model of a “top candidate” that is correlated with vocal characteristics that they have deemed to be generally associated with “top candidacy.” This generalizing claim is also apparent in Adam's assertion that they are developing a technology that is not only “language independent” but also “culturally independent,” reiterating the widely observed but also heavily critiqued folk theory that the timbral qualities of voice/speech (as opposed to semantics) are stable and direct representations of an individual's “authentic” inner state—most commonly understood as race, class, and gender, but in this case, also top candidacy, along with the numerous characteristics that have been associated with these socially constructed classification schemes (e.g. Eidsheim, 2018; Kang, 2022; Sterne, 2003; Stoever, 2016). In Adam's own words, “voice is a direct link to your character; it can't be manipulated.” What this means specifically in the case of the employee-fit prediction system is that the ground truth for employee-fit vocality is not team-specific (nor specific to any language or culture), and instead based around a singular datafied model of vocal characteristics that supposedly represent the “ideal employee” who can “fit” into “any” team.
This is concerning not only because of the neoliberal ideologies that undoubtedly inform such data around “the ideal worker,” but also because even from a technical standpoint, its reductive conflation of “fit-ability” with “the ideal employee” for the purposes of constructing its ground truth does not adequately engage with any of the important intricacies involved in assessing the successful integration of a particular individual into a specific team. Even at its best, then—i.e. if vocal markers of “top candidacy” truly do exist and this company has identified them—what it functions as is a capitalist instrument that imports neoliberal standards of labor productivity, and then measures how well an employee fits that notion of the ideal neo-liberal worker. Of course, this is not entirely bereft of benefits from a corporate perspective, since regardless of the objectives of the team, on a broader level, it would inevitably operate under the ideological structure of a neoliberal workforce: an ideal neoliberal worker, if such an entity were to exist, would not not be a “fitting employee,” in the same way that the company's employee-fit prediction system does not not work. It thus constitutes a savvy business move that leverages the inscrutability of ML and couples it with an ambiguously functional system that allows the company to offer an “objective” data point for traditionally subjective decision-making contexts such as employee recruitment 15 .
What is important, here, is that gauging learnability via GTT as an analytic framework makes this translation process from “employee-fit” to “ideal employee” explicit. It allows one to understand that the company's employee-fit prediction system must redefine the problem from a practically unlearnable one to a practically learnable one, or at least one that can be actionably quantified. As I show, however, this translation process is often dependent on a series of assumptions, ontological reductions, and importations of external standards that not only change the form of the ML task but also introduce new problems, both social and technical. Indeed, like the move made by Liu et al. for their saliency detection model, this case requires reframing employee-fit into a binary model of “best” vs. the “rest,” where “best” is also defined through a particular ideological alignment. I thus present this example to bring attention to the granularity that a potential use case requires an ML system to account for, and how the friction between theoretical granularity and concerns around practicality can alter its ground truth. It is precisely because of this specificity that current ML systems cannot be discussed in separation from the contexts in which they are deployed.
In addition, this example is also representative of the flexibility of GTT as an investigative method. ML practices in the tech industry are notoriously protected by NDAs, which makes it difficult to consistently conduct close tracings of the actual datasets and labeling procedures that comprise the ground truth for a particular ML system. To augment or offer an alternative to other investigative methods (i.e. Poirier, 2021; Van Rossem and Pelizza, 2022), in which emphasis on a direct rigorous engagement with underlying datasets is justifiably correlated with decreased applicability and accessibility, GTT incorporates a retroactively deductive/logical process in which the end task itself presents the greatest clue as to how a ground truth might be established. This is not to say that the most robust and ideal iteration of GTT would not require direct engagement with the underlying datasets and labeling procedures, but rather to emphasize that even when faced with a relative lack of such information, GTT can still inform a systematic process for deducing the qualitative processes through which ground truth datasets are created 16 . The most realistic (though not necessarily ideal) way in which GTT might be applied is thus as an amalgamation of deductive reasoning and empirical tracing, in which there is room for a shifting balance between the two depending on how much data around the ML system is available.
Subjectivity, stakes, and power: Splintering the GTT-learnability framework
The GTT/learnability framework is a useful analytic tool to gauge the practical feasibility of an ML system. What is missing or unclear from it, however, is a discussion of subjectivity, stakes, and power. Although it explicitly foregrounds the question of “how is this made possible?,” it does not necessarily engage with questions of “who does this serve?,” “what are the stakes and for whom?,” and “should it be made possible?.”
In attending to these questions, I momentarily return to Neisser's 1963 article on “The Imitation of Man by Machine,” because I believe his differentiation between explicitly “technical applications” and “social decision-making” (see Introduction) is an important one not only for understanding the epistemological limits of ML systems, but also ethical ones. Contextualized in a GTT-learnability framework, the difference between “technical” and “social” can break down in the most basic sense to: ((a) “technical”) a problem or task in which the spectrum of possibilities is bounded, and performance or “success” is defined explicitly, vs. ((b) “social”) a problem in which the spectrum of possibilities is unclear, constantly changing, and contingent on broader social factors that result in a fluctuating, interpretive, or contestable standard of performance or success. Sorting the various examples mentioned in this article according to these two criteria makes evident the utility of the framework in gauging the applicability of ML to a particular task. OCR, for instance, fits relatively neatly into category “a,” while tasks such as saliency detection or employee-fit fall under “b,” but can potentially be reconceptualized—generating varying degrees of ontological dissonance—into a problem that fits “a.” Only once a problem is “acceptably” fitted to “a” can it be “solved” with ML. In other words, one key practice of ML practitioners is to conceptualize problems, so they fit “a” and not “b.”
Rachael Tatman 17 , an expert in natural language processing (NLP) with a PhD in computational linguistics refers to “a” as the “boring” problem, in her words, “like plumbing.” She presents this simile not to view plumbing as a vocation or an infrastructure that is by any means trivial or uninteresting, but rather to express her hope that ML also becomes an important but somewhat taken-for-granted substructure that facilitates specific straight-forward tasks. This is a deliberate discursive positioning against the sheen and hype that uplifts and drives so much of the current AI/ML industry. Expressing her concerns around the absence of a “Standard of Care 18 ” for software or ML engineers, Rachael was particularly skeptical and worried about ML systems that “make decisions about people,” which almost always fit in category “b.” As I emphasize throughout the paper, such ML tasks (i.e. Wu and Zhang's criminality-prediction system) are often exposed of their technical shortcomings when examined via GTT. Ethical concerns, however, are still left unaddressed. This is vital especially in the lack of regulated guidelines, which means there are few limits that require ML engineers or the broader organizations for which they work to not only acknowledge the deficiencies of a particular system, but also foreground a system's capacity for harm. Rachael points out that this is exacerbated by the fact that the people who the decisions are being made about by these ML technologies “don't have legal or practical power over those decisions.” Understood in relation to “automation bias,” which states that people are more likely to believe automated decision-making systems and to ignore contradictory information made without automation, even if the non-automated information is correct (Skitka et al., 1999), using ML for making decisions about people ultimately means that power is consolidated among those deploying and using the systems, while it is taken away from those it is being used on.
Such widespread lack of understanding around ML applicability and techniques exacerbates a corporate environment in which “spinning a good enough story will secure funding” (Rachael Tatman). A tech culture in which overly ambitious projects and magical thinking (e.g. Elizabeth Holmes and Theranos) can command immense capital and human resources contribute toward the increasing attraction towards provocative narratives while moving us away from engaging with the complex realities and true potentials of ML 19 . As Rachael emphasizes, there is an urgent need to engage with both the practical possibilities of ML, as well as the broader stakes of its application.
David, the principal scientist at a “unicorn 20 ” conversation analytics company, shared similar concerns. He was adamant in his view that ML research and implementation should always “stay away from high-stakes industries such as healthcare or law enforcement.” According to David, this was one of the reasons he chose to apply his PhD training in ML to a “lower-stakes” context like customer service. David's story, which elaborates upon many of the concerns introduced by Rachael, presents an interesting example from which we can think through the interplay between a GTT/learnability framework and a subjectivity/stakes/power one. Or more specifically, how subjectivity/stakes/power can still change how one might understand an ML model that GTT shows to be relatively ontologically consonant with the qualitative phenomenon it is modeling.
Broadly speaking, David's work consists of using ML techniques to identify signals that are indicative of “sentiment,” “intent,” and “flow” of a conversation. Constructing a ground truth for a system that identifies “intent” is of course extremely difficult. It consists of initially collecting massive amounts of data of previous calls with known outcomes as well as drawing heavily from existing NLP literature and vocal expressions of emotion/sentiment to make correlations between specific linguistic patterns and audio signals with variously identified relevant “intents” (i.e. “refund,” “exchange,” and “balance check”). GTT makes explicit the extremely difficult and contextual nature of this process, but because in the case of the customer service industry, large internal archives of recorded conversations do exist, it is not necessarily impossible to find somewhat usable correlations between conversation patterns and intent.
That being said, David shared that the accuracy of predictions is still at a very low level—in his words, “slightly better than chance.” If the goal is to simply add a layer of analysis to help streamline incoming customer service calls, he said the technology can help. With regard to finding reliable and accurate signals that will consistently be able to recognize ambiguously defined concepts like “intent,” however, the research is nascent. Following this, he contrasted his work with that of software engineers and emphasized that there is minimal structure to how an AI/ML research team works because it is not about “fixing a specific problem or adding a feature”: it starts from a more fundamental state of turning something into a problem and then experimenting with ML to see if that problematization is the most efficient, or even the correct approach. He mentioned that because of this, a team focused on ML research never truly knows if it is on the right path. Mentioning the lengthy research periods and copious amounts of resources required for research and development in ML, he shared that the time and capital could arguably be better used for other purposes. These resources, of course, not only refer to the vast amount of computing power required to train ML models, but also to the global infrastructure of underpaid human annotators required to construct, maintain, and make usable the ground truth training data that scientists such as David work with (Elliott, 2021; Hao and Paola Hernández, 2022; Wang et al., 2022; Williams et al., 2022). In this way, examining ML systems through the aperture of ground truths also explicitly foregrounds the colonialist underpinnings (Hao, 2022) of the AI industry by shifting attention away from the sheen of algorithms, engineers, and industrial research labs, and toward the actual annotators who do the laborious work of building the ground truth datasets upon which all ML systems depend.
David also voiced his concerns about the lack of conversations AI-as-a-Service (AIaaS) companies in general have with the people they say they serve. In the context of his conversation analytics company, he expressed that augmentation and surveillance are two sides of the same coin. Because the ML team requires mining data from the conversations of customer service calls, there is constant audio monitoring of customers and call center agents. Not only is this a concern with regard to storing sensitive information and surveilling agents’ actions, but it also means that in efforts to formalize a task (customer service), the machine listening system is exploiting and commoditizing data gathered from the craft of customer service agents. Mentioning that there are very few standardized evaluations of the concrete benefits of these technologies, David said he is concerned that he is contributing toward a system that is sold based on a story of helping workers, but ultimately has the potential to make fungible those same individuals. After a pause, he said that he likes the general idea of empowering people with tools as long as there is sufficient dialogue and understanding between developers and users.
David's perspective is important because it both highlights and complicates the utility of the GTT/learnability framework. In his discussion of choosing customer service over other industries such as healthcare or law enforcement, he expressed the significance of considering what the stakes are for the ML work that is being conducted, regardless of how feasible it is. This continues in his contemplation on whether the ML tools he builds will actually help the individuals they were intended for, and further, if the resources required to sustain ML research and implementation are justifiable. It thus broadens the perspective from an ontology-informed interrogation of ground-truthing practices to an ethical consideration that attends to the varying power dynamics that mediate the network of actors upon which ML is dependent. Filtering the two aforementioned categories of GTT/learnability through subjectivity/stakes/power ultimately narrows down the range of applications for which ML is ideal.
Acknowledging that both GTT/learnability and the dimensions of subjectivity/stakes/power are better expressed as spectrums as opposed to discrete cells, Table 1 presents an intentionally simplified visualization representing how these questions around subjectivity/stakes/power mesh with and change the GTT/learnability framework. As is evident, the most ideal context is currently one in which there is a relatively bounded standard, which means that high-quality ground truth labels can be developed that do not require extensive translations from the original problem, and the stakes of both the conditions for ground truth construction as well as the effects of the ML system are low. Not only that, but this orthogonal analysis also complicates previously more “clear-cut” situations: a situation with a fluctuating standard might not always reflect a deficient ML system if the associated stakes of development and output are low, while a situation with a bounded standard might not always reflect an acceptable ML system if the associated stakes of development and output are high.
Orthogonal relations between subjectivity/stakes/power and GTT/learnability.
In the same way that Jaton discusses the costs and benefits of ground-truthing practices, however, this simplified table above should not be taken as an end-all index against which the possibility and utility of all ML systems are measured. There are, after all, many highly debated cases in ML research such as autonomous driving, for which the stakes are extremely high, with its standards of learnability ranging anywhere from extreme variance to partial boundedness (depending on the metric by which one considers an autonomous driving system “successful”), that many believe drive rigorous academic and industry research. This table is not meant to serve as a shortcut for validating nor discounting such work, but rather the opposite. It is to provoke further contemplation and add another dimension by which one may explicitly foreground the necessary ontological translation processes that are required to make a problem practically learnable, as well as foreground the broader social questions around “who/what is at stake?” and “who/what does this serve?,” with the ultimate purpose of more effectively engaging with and understanding an ML system.
It should also not be mistaken as any kind of groundbreaking addition to the field of ML as it represents a perhaps previously tacit but still prevalent thought process by which any ML engineer or scientist must approach an ML problem. Instead, taken together with the thoughts laid out in this paper, it should be understood as a strategically reductive heuristic through which ML engineers, data scientists, social scientists, humanists, activists, policymakers, and even business analysts interested in ML can collaborate, debate, and think about such systems together.
Ultimately, this Learnability-Stakes (LS) table is my response to a tendency I observe among some humanists, social scientists, and perhaps more importantly, policymakers/analysts to express ethical concerns around technologies with minimal discussion or understanding of their actual capacities, which I believe is ultimately not very different from the overly celebratory hype culture of the tech world (only from the other side of “fear”). The GTT/learnability framework is meant to act as a funneling tool to help non-technicians (such as myself) better understand the operational logics of ML and make explicit the cascade of decisions that massage a problem into a machine “learnable” one. On the other side, this LS table is meant to serve as a push for the developers working in the broader ML community to more explicitly foreground questions of subjectivity/stakes/power that humanists and social scientists have brought forth in engaging with these systems. As I mention throughout, the eventual hope is to build a just and equitable foundation for effective interdisciplinary dialogue and collaboration.
Conclusion
The relatively sparse mentions of “voice” and “listening” in this paper may come as a surprise to readers, especially considering the individuals with whom I conducted my fieldwork. This curiosity is not lost on me. And my short answer is that the absence is ultimately a reflection of the data. I found throughout my fieldwork that although everyone I spoke to was working in the “voice tech” industry, the ontological and epistemological complexities of voice were sidelined by discussions around the potentials of AI/ML for leveraging “voice” as a previously under-utilized domain of data analytics. At best, the messiness of voice was selectively grappled with as technical wrinkles that needed to be formalized into an ML problem. For the majority, however, folk theories of voice and listening or what Eidsheim and Meizel (2019) refer to as “vocal imaginaries” appeared to be sufficient in informing the industry-wide justification around the importance of voice as the “next frontier” of AI and ML.
Under the framework of ML, all qualitative phenomena break down into patterns and correlations. In the words of Mackenzie (2017: 73–74), vectorizing data “produces a common space that juxtaposes and mixes complex localized realities … Similarity and belonging no longer rely on resemblance or a common genesis but on measures of proximity or distance, on flat loci that run as vectors through the space.” Quantitatively flattened into vector space, “voice,” “emotion,” “intent,” “criminality,” “employee-fit,” “saliency,” etc., are all translated into correlatable data. Of course, not all translation processes are equally smooth, and examining each type of application through the lens of GTT/learnability makes explicit the degree of ontological dissonance that is produced in such translations. I sought to show in this paper that the most useful applications for not only machine listening technologies, but also ML systems writ large, are those in which the range of possible outcomes is bounded (i.e. games such as chess or go, recognizing characters, detecting objects, etc.) and the implied stakes of both the conditions for development and the various outcomes are low. This is not to say that other uses that do not entirely conform to these qualities, such as OpenAI's text-to-image generator DALL-E 2 (Ramesh et al., 2022) 21 , are deficient. But rather, it is to present a frame of thinking that allows one to qualitatively engage with ML technologies in a systematized and grounded way. We often hear the phrase “there is no right answer” to many of life's most profound questions, and there is beauty in that uncertainty. We should not try to find answers to those questions with ML. Rather, ML might be best suited for those “boring” questions and bounded tasks, where there may in fact be a “right answer.”
Footnotes
Acknowledgement
I am grateful for the conversations I had with friends, colleagues, mentors, and interviewees in the writing and editing of this paper. These include, but are not limited to: Larry Gross, Josh Kun, John Cheney-Lippold, Amy Lee, fellow members of the Sloan-funded international research collective “Knowing Machines,” three anonymous reviewers, the editors at Big Data & Society, and all of the industry practitioners who kindly shared their time to speak with me.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article. This work was supported by the Annenberg School for Communication and Journalism, University of South California.
