Abstract
Drawing on a comparative case study of four landmark image datasets that led up to the creation of a new job—data annotators—I examine the laboratory origins of dispersible labor in the development of artificial intelligence (AI) technology. The rise of large-scale datasets cannot be attributed solely to gig platforms that supply already dispersed labor. To transform data annotation from an in-laboratory, expert task to a microtask that can be performed by data annotators without scientific expertise, scientists introduced into and consolidated through the scientific literature and laboratory life of AI research new organizational repertoires of bureaucratic, centralized, algorithmic, and corporate control that each made data annotator’s work more dispersible. By bringing back the organization and control of scientific labor as a missing link between the technical and social dimensions of AI, this study has implications for research on hidden labor, algorithmic biases, and the future of knowledge work.
Large-scale data of human behaviors and the human labor that produces these data are now recognized as essential components of artificial intelligence (AI) systems. An emerging sociological literature has documented how such labor is often dispersed through gig platforms and, as a result of its dispersion, hidden behind the scaling-up and commercial hype of AI technology (Denton et al. 2021; Gray and Suri 2019; Joyce et al. 2021; Paullada et al. 2021; Shestakofsky 2017; Vallas and Schor 2020). But data work—the generation, annotation, and verification of human data for such algorithms—was not a job designed to be dispersed or even carried out by nonscientists outside of the laboratory until very recently.
Sociologists traditionally treated the production of scientific data as an expert task to be carried out inside the laboratory (Cetina 1999; Collins 1998; Latour and Woolgar 1986; Owen-Smith 2001). Indeed, in the early years of AI research, datasets, if any, were curated in relatively small volumes by small teams of scientists in laboratories. From these laboratories that used to trust only expert work and value algorithmic solutions has nevertheless emerged a multibillion-dollar, rapidly globalizing data annotation industry that delegates knowledge production to unskilled labor (Murgia 2019a). Yet this historical transformation of data annotation, from a task that requires scientific expertise to a form of dispersed labor that can be outsourced to gig workers, remains unaccounted for by the sociology of AI. How did scientific labor become degraded and dispersible behind the development of AI technology?
Drawing on a comparative case study of four landmark image datasets that led up to the creation of a new occupation—data annotators—this article calls for attention to the laboratory origins of dispersible labor in the prehistory of AI technology. I argue that the rise of large-scale datasets in AI research cannot be attributed solely to gig platforms that supply already dispersed labor. To bring the deep learning paradigm out of clean, curated laboratory environments into the messy, uncurated real world, scientists also had to transform the social organization of data annotation in the laboratories even before AI became a black box technology ready for commercialization. Under the disguise of methodological innovations, scientists introduced into and consolidated through the scientific literature and laboratory life of AI research new organizational repertoires of bureaucratic, centralized, and algorithmic control that each made data annotation more dispersible. By taking into account how scientific labor is reorganized, redistributed, controlled, and eventually degraded, this study identifies a missing link between the technical and social dimensions of AI technology that has implications for research on hidden labor, algorithmic biases, rising corporate power, and the future of knowledge work.
The Historical Transformation of Data Annotation
Sociologists of knowledge work have long recognized that science is a social organization of work (Barley 1996; Latour and Woolgar 1986; Owen-Smith 2001). However, existing research tends to focus on professional communities (Abbott 1988; Menchik 2021) and academic laboratories (Cetina 1999; Collins 1998; Latour and Woolgar 1986; Owen-Smith 2001) for prototypes of knowledge workers. In a seminal study of how scientific data are produced in Italian and American gravitational wave laboratories, Collins (1998) shows how open versus closed evidential cultures help scientists distinguish between interesting data and uninteresting noise in their everyday work. Scholars have also called for more attention to the invisible labor of technicians (Barley 1996; Barley and Bechky 1994; Plantin 2019; Shapin 1989), lay experts (Epstein 1995; Eyal 2013), and industry representatives (Menchik 2021), but these quasi-experts generally participate in and negotiate with epistemic cultures and forms in similar ways as scientists do. The core of this literature, as Latour’s (1983:141) incisive criticism points out, still puts a premium on “what scientists do inside the walls of these strange places called laboratories.”
Although models of expert work may help explain how AI scientists in the 1960s and 1970s created handmade block-like objects to represent geometric shapes, seeing the social organization of scientific labor as an epistemic problem leaves a widening gap as the emerging occupation of data annotators becomes an important theoretical puzzle and regulatory challenge (Alegria and Yeh 2023; Hoffman et al. 2022; Joyce et al. 2021). To create an image dataset for AI research today, scientists have to collect a large number of images that contain a large variety of objects and, in a process called “data annotation,” attach to each of those images “ground-truth” labels that help the algorithm learn what humans would see in the image, such as which types of objects are and are not present. With this exponential growth in the scale and variety of datasets, there arose a mounting demand for labor power that laboratories could no longer accommodate. Data annotators are workers outside the laboratories with no scientific training and to whom scientists outsource the work of data annotation (Denton et al. 2021).
An emerging literature in the sociology of AI and work importantly reveals how the AI hype obscures the exploitative labor conditions of gig workers, including data annotators, behind the scaling-up and commercialization of AI technology. Shestakofsky (2017) finds AI technology relies on large numbers of human workers for supplemental computational labor and emotional labor. Gray and Suri (2019) draw on rich qualitative and quantitative evidence on Amazon Mechanic Turk and other gig workers to demonstrate the unprecedented scale and dire work conditions of “ghost work” in the contemporary digital economy. Recent sociological work on gig work takes a step further to theorize how digital platforms created a spatially dispersed but managerially centralized and even despotic regime of labor and under this regime, a global hierarchy of different types of employment for different levels of dispersed labor (Griesbach et al. 2019; Kornberger, Pflueger, and Mouritsen 2017; Vallas and Schor 2020; Wood et al. 2019).
Although this literature on hidden labor draws urgently needed attention to new forms of workplace and labor market inequalities as a result of AI technology, it has not sufficiently explained the emergence of dispersed labor itself. On the one hand, data annotation is not just any precarious job that technology has created. It is the precarious job that made AI work for the first time as a functioning technology for the real world (Denton et al. 2021). It is also a job that, unlike in traditional industries where knowledge work and production floors are strictly separated (Braverman 1974), is still today under close control by the disciplinary values and discursive power of scientists (Paullada et al. 2021; Scheuerman, Hanna, and Denton 2021). On the other hand, the availability of gig platforms also does not explain why a group of scientists who dedicated their careers to finding algorithmic solutions with minimal human intervention would turn to nonexpert labor for their scientific challenge of making AI work. How did a laboratory science that used to rely on expert labor give rise to an AI industry that controls and exploits the dispersed labor of data annotators?
Without a sociological framework for understanding how the technical development of AI is embedded in the social organization of data annotation as a new job, we are missing important conceptual tools for regulating an open, sociotechnical system. For example, algorithmic biases are being scorned as bad input into good algorithms, the results of individual technologists and data annotators who “contaminate” datasets with human subjectivities and prejudice. But they are also the results of structural conditions that allow certain forms of labor to be performed and types of biases to be left uncorrected. To answer “who created AI technology and how” (Renski et al. 2020), we need to understand data annotators’ job not merely as a social externality of technology but as an emerging social organization of knowledge work that was constructed and consolidated to be an indispensable infrastructure for the development and commercialization of AI.
The Laboratory Origins of Dispersible Labor
A good way to understand how an expert task can possibly become a dispersible job is to trace historical changes in the control of work. Experts control their own work and the work of others when the work has to be distributed by controlling the knowledge base that defines, conceptualizes, and thereby legitimates their monopoly over certain tasks (Abbott 1988). Owen-Smith (2001), for example, finds that in a neuroscience lab, scientific skepticism shared by senior scientists, junior scientists, and technicians becomes a mechanism of control over their work across hierarchies. But when the workers to whom knowledge work is delegated do not share the same professional training or laboratory space with the experts, the epistemic problem becomes an organizational problem of control over unskilled workers.
A rich tradition in the sociology of work has offered great insights into how employers regulate and extract effort from workers in different ways and how direct and despotic control by foremen and frontline managers under Taylorism (Edwards 1979) has developed into indirect control by consent under Fordism (Burawoy 1979; Sallaz 2015, 2019) and algorithmic control by conceding autonomy and flexibility in recent years (Griesbach et al. 2019; Vallas and Schor 2020; Wood et al. 2019). Although scholars have pointed out that algorithmic control may have selectively combined elements from previous regimes to create a paradoxical system of centralized control and dispersed work (Vallas and Schor 2020; Wood et al. 2019), it remains unclear where and how algorithmic control was created and what role scientists played in it for the labor of data annotator to be dispersible.
Laboratories provide one possible site where different forms of control can be brought together and selectively recombined to create the social organization needed for dispersible labor (Latour 1983). To reconcile scientists’ amplified agency through the laboratory (Latour 1983) with structural conditions that facilitate certain forms of control, I introduce what I call the scientists’ repertoires of control: an ensemble of infrastructures and techniques that historically became available to scientists for distributing, motivating, monitoring, and evaluating scientific labor as they sought to solve their epistemic problems in the laboratory. Coming from the literature on collective action (Clemens 1993; Tilly 2013) and cultural toolkits (Swidler 1986), the concept of repertoire captures the entanglement between the laboratory and global capitalism by helping us precisely envision scientists’ bounded agency in shaping this relationship.
Through the lens of repertoires, scientists choose between, make use of, and recombine alternative organizational, cultural, and technological models that are made available to them for interpreting and acting on a situation. The availability of those models varies from one historical context to another, but it is scientists’ decisions in the laboratory to activate one as opposed to the other that bring these disjoint factors together. Their solutions to epistemic problems, through their success in the laboratory as leverage and testing ground (Latour 1983, 1993), in turn, lay the groundwork for organizing and situating the job of data annotators in global capitalism. Although historical transformations such as the emergence of dispersed labor in science often appear to us as an exogenous shock that just happened, these repertoires help reveal how scientists constructed and accelerated the great separation of data annotation from its laboratory origins.
Case and Methods
Computer vision is an ideal empirical case also because it is where the deep learning paradigm achieved its first big success. Computer vision is an interdisciplinary subfield of AI that aims to computationally model and mimic human vision (“to build computers that see”). It has undergone radical transformation and commercialization over the past decade in areas such as facial recognition, self-driving vehicles, and robotics. With the deep learning revolution in the past decade, computer vision has grown out of the niche interests of university laboratories and security agencies to become a highly competitive, multibillion-dollar market that catches the attention of venture capitalists, multinationals, governments, and the popular media. Yet despite the availability of high-quality, large-scale image and video data, human ground truth remains costly and labor-intensive to curate.
At the same time, instrumental to multiple controversial industries, computer vision is also where ethical and regulatory debates are most heated. In 2019, Stanford University, Duke University, and Microsoft removed from their websites three open-access facial recognition datasets after the Financial Times reported the potential intrusion of privacy during data collection and the alleged uses of these data for military and/or repressive purposes (Metz 2019; Murgia 2019b). The extensive use of facial recognition technology in policing in the United States has also drawn concerns and criticisms from scholars and activists because it has resulted in mass surveillance that is invasive, largely unregulated, and disproportionately targeted at marginalized populations (Brayne 2017; Hill 2023; Qian et al. 2022). In short, computer vision encapsulates both the epistemic and ethical dilemmas of AI.
To understand how scientists’ repertoires of control have transformed data annotation work in the historical development of image datasets, this article draws on a comparative case study of MNIST (1994), Pascal VOC (2007), ImageNet (2009), and Google Open Images (2018)—four landmark image datasets in computer vision research. The four datasets were selected from a broader study of 25,689 open science articles and 546 datasets introduced in those articles. The selection was based on several methodological considerations. First, object detection, where the goal is to detect and locate objects of interest in an image that could contain any number of different objects, has historically been a core interest of computer vision research since the beginning of the discipline and the gold standard for measuring progress in the field. Object detection is also what distinguishes the core interest of computer vision from that of machine learning methodologists. Early machine learning methodologists were more interested in image classification—classifying the whole image into different categories—than understanding what humans can see in the image.
Second, these four datasets are among the most widely used datasets in computer vision research. Whereas ImageNet has been the go-to case for algorithmic bias research and the sociology of AI, I extend the usual single-case study to a comparative analysis that situates ImageNet in the historical development of image datasets. Third, these four datasets offer the most detailed documentation of the underlying dataset creation process, a source of data widely used in the literature (Denton et al. 2021; Miceli et al. 2022; Miceli and Posada 2021, 2022; Scheuerman et al. 2021), and corroborating accounts through other primary and secondary sources. Finally, although the purpose of the study is not to enumerate all possible organizations of work or to generalize to the production of all datasets in AI, the four datasets represent useful ideal types because they pioneered the popular paradigm of dataset curation at the time.
This study draws primarily on scientific documentation and digital traces of the design, creation, usage, and ethics of the four landmark datasets. Although participant observation is the gold standard for studying control of work, it is not available for a retrospective study of datasets that were created by work conditions that are historical and no longer available for observation. In place of the unbeatable depth of ethnography, a comparative case study allows us to take advantage of the rich deposit of archival data in science to offer invaluable breadth and retrospective reach when we need to go back in time to trace a historical development with expansive temporal and geographic spans. It also helps set the stage for future, more targeted research on the labor process of data annotators in the AI industry and situates ethnographic findings in a broader and more coherent historical narrative.
The original articles that introduce the datasets and the authors’ websites and GitHub repositories that host the datasets typically provide the most detailed documentation on their design, usage, and production process. Although methods sections are often written with a significant amount of retrospective sensemaking, because labor is rarely a point of discussion (or contention) for their readers but methodological transparency is, scientists are surprisingly forthcoming about the ways they organized labor in the original articles and retrospective accounts, such as interviews. I also analyzed relevant research papers, presentations, interviews, newspaper articles, and other forms of secondary accounts by researchers and journalists that give backstories not typically told in articles and official documentation. More recent articles that use those datasets or criticize bias in those datasets are also included.
To establish a “descriptive framework” (Yin 2017) for comparing similarities and differences in “control of work,” I draw from one year of field observations and 17 informant interviews with AI researchers in the early phase of a larger study of the rise of corporate open science in computer vision. These data were mainly collected between August 2022 and July 2023, with some preliminary online observations in the prior year during the pandemic. Although none of the participants with whom I interacted directly participated in the creation of the four datasets under study, they offered important insights into how scientists create, use, evaluate, and think about similar datasets and datasets in general. Offline observations and interviews were conducted in Silicon Valley and Beijing, the two leading and competing scientific and commercial hubs yet deeply connected flows of knowledge and migrant labor, and at CVPR, the top academic conference in computer vision that attracts thousands of academic and industry scientists and representatives every year.
At these locations, I observed graduate seminars, industry events, conference presentations, invited talks, and occasional formal and informal social gatherings of faculty and PhD students. Online sites include social media and collaborative platforms where conversations and debates about datasets occur most often among computer vision researchers: Twitter (English), Reddit (English), GitHub (English), and Zhihu (Chinese). MNIST, Pascal VOC, and ImageNet are frequently referenced in discussions about computer vision research across multiple settings. Although Google Open Images came up only once in class, Google and other companies’ role in creating datasets was nevertheless a recurring theme in conversations among scientists. These observations show how despite a lack of conceptual language, computer vision scientists understand data production as a problem of labor control.
Talks of Labor in the Making of Image Datasets
Labor was not a problem that people recognized from the beginning, when creating computer vision data required no more than a couple of graduate students working in their laboratory. As in a story often told by computer vision researchers themselves, the field of computer vision began in the 1960s at MIT with Lawrence Roberts’s 1963 dissertation on “Machine Perception of Three-Dimensional Solids” and Seymour Papert’s Summer Vision Project to construct “a significant part of a visual system” by the end of summer 1966. Both founding studies of the field, far from solving all problems in computer vision, were uncannily focused on matching two-dimensional line drawings to three-dimensional geometric shapes. As a result of their geometric approach to vision, the first generation of computer vision researchers from MIT would set up in their laboratories balls, bricks, cylinders, and other handmade block-like objects and build worlds of blocks that nonetheless in no way resembled real-world objects. As one of them recalled: “In fact, it has been said that the M.I.T. focus on line drawings kept back the field of computer vision from realistic image analysis tasks” (Shapiro 2020:112). However, it was the dominant approach at the time partly because a few graduate students could easily supply the labor and control the quality required for handmaking blocks and scenes based on the limited number of most common geometric shapes.
Today, as computer vision research moves away from geometric shapes to real-world objects and scenes, the traditional laboratory model of dataset curation fell apart as computer vision researchers sought to train machine learning algorithms on real-world objects that are not reducible to geometric shapes. The number of images required for object detection datasets exceeded tens of hundreds. In their place is an ever-growing repertoire of computer hardware and software, including high-performance computers, motion-capture cameras, high-resolution display devices, and software packages, to create, store, display, and process millions of images and videos of objects, persons, and movements in the real world. As some computer vision researchers joked about “the dataset paradigm of AI,” these large-scale, real-world datasets stand in stark contrast to their earliest predecessors in both variety and scale and have now become the blood and bones of contemporary object detection research.
Ethnographic evidence shows that this epistemic transformation was marked by scientists’ excitement and frustration with the huge investment that they had to make in creating datasets to compete with companies that have either abundant high-quality consumer data or readily available labor from their business operations to hand curate data. Academic laboratories can no longer accommodate the surging demand for labor for collecting and labeling images at an unprecedented scale and for extending data sources and evaluative frameworks beyond university campuses to the real world. That was when scientists started to recognize that the real challenge was not epistemic but organizational. A PhD student recalled how he spent most of his time managing undergrad data annotators when he first entered the field as a graduate student at a top Chinese lab in 2016, several years after ImageNet’s big success with gig workers: Before we started to have all the open-source datasets, doing AI research in a university was like running a small sweatshop of undergrads just to produce the datasets we needed . . . I felt like I was running a startup [rather than a research project]. I think the shop is still up and running even today.
Labor can be crucial to the success of a project even at Oxford University’s Visual Geometry Group, a traditional and international powerhouse of AI research. Parkhi, Vedaldi, and Zisserman (2015), for example, report that with a creative combination of Internet Movie Data Base (IMDB) celebrity list, Google and Bing Image Search, and a set of automatic filters to remove duplicates, a small annotation team spent a total amount of manual effort equivalent to 14 days to create a high-quality dataset of 2.6 million images from 2,622 celebrities.
Despite scholarly and media attention on how gig workers made possible explosive growth in both the scale and variety of image data (Denton et al. 2021; Murgia 2019a), the transformation is not simply a history of industry taking over by their sheer volume of resources and power to mobilize dispersed labor. Laboratories’ ability to attract and organize the quasi-expert labor of students allows academic researchers to not only stay in the game but also gain unique competitive advantages against industry. This is especially true in subfields where marginal return to data quality is disproportionately high or data annotation still cannot be transformed into dispersible tasks. Robotics is one such field that particularly attracts computer vision researchers because of its heavy use of image data, as one PhD student explains: Many computer vision researchers, including my advisor, are turning to robotics . . . partly because robotic data are still too costly for companies to produce at large scale. . . . You have to have a running robot that you can sell to real users in real homes. Only then you can train robotic algorithms on these real-world data at a large scale. [Before that happens] robotic research will rely on laboratory data.
Together, these findings point to the growing influence of gig platforms and the tech industry on computer vision researchers’ thinking of data and the central role of labor that scientists started to recognize even in laboratories. What types of data are included in image datasets? Whose labor is being used to create these datasets? How do scientists organize different types of labor? These talks of labor offer empirical groundings for my framework of comparison.
Table 1 summarizes the comparison. As image datasets grow in scale and variety, computer vision scientists draw on new organizational repertoires of control that became available to them to introduce bureaucratic, standardized, and dispersed forms of labor into computer vision laboratories and methodologies. A combination of these repertoires eventually allows the AI industry to redraw the division of scientific labor and transform data annotation from a scientific task to a micro-task that can be not only outsourced but even specialized by data annotators with no scientific training. These different repertoires of control at the same time accommodate different types of algorithmic bias.
Scientists’ Repertoires of Control in the Historical Transformation of Data Annotation in Artificial Ingelligence Research.
Bureaucratic Control: MNIST (1994)
Although many commentators assumed that the use of Amazon MTurk workers for data annotation was the turning point in the history of data work, the idea of outsourcing data collection and annotation had existed long before gig platforms. Computer vision researchers’ first move away from PhD students in laboratory settings took advantage of the readily available bureaucratic repertoires of indirect control provided by government agencies that became interested in AI technology in the early 1990s.
MNIST, the Modified National Institute of Standards and Technology database, introduced in an article by Lecun et al. (1998), is a dataset of 70,000 handwritten digits in grayscale images with a size of 28 × 28 pixels, combining two datasets the National Institute of Standards and Technology (NIST) compiled in 1990 and 1992 from digits written by high school students and Census Bureau employees. It not only marks object detection research’s first step toward real-world datasets but also has had a long-lasting impact on the field as still one of the most popular benchmark datasets for testing radically new machine learning algorithms, used in 1,228 open access machine learning papers in 2021 and 900 papers in 2022.
The NIST is an agency of the Department of Commerce tasked in the late 1980s and early 1990s with providing objective evaluation of emerging computer technologies being sold to the government for recognizing zip codes, postal addresses, and human faces in government databases (Garris 2018). It was also a period when machine learning algorithms started to show promise in recognizing simple patterns in real-world visual data that were becoming increasingly available through government databases (Gates 2011; Lecun et al. 1998). Handwritten digits and letters, as a special case of pattern recognition and as a task so common in bureaucratic work, offered an ideal test ground that interested both the federal government seeking new technology and computer vision researchers lacking labor power to create larger datasets. The NIST’s interest and investment in evaluating machine learning algorithms plugged scientists into the massive repertoire of bureaucratic labor that the federal government had at its disposal for the impossible task of collecting and labeling 70,000 handwritten digits.
To solicit handwriting images and the corresponding ground truth labels, the NIST mailed out thousands of handwriting sample forms (Figure 1) with return envelopes, first to high school students and later to census field staff across the United States. These forms, available in 100 different templates, provided the ground truth—carefully balanced and randomized mixtures of digits and lower and upper case letters in print—and requested the recipients to produce the data by copying the printed digits and letters in the boxes below by hand (Garris 2018). A geographically diverse sample of writers returned 2,100 handwriting sample forms to the NIST. The NIST lab technicians would sit for long hours to scan and crop the handwritten digits and letters from those forms and to go through the isolated digits to identify obvious errors (Garris 2018). In 1990, the NIST released Special Database 1 to the public domain, which contains 58,527 digit images from 500 different writers across the United States. The NIST has since released updates and extensions and eventually scaled to 3,699 handwriting sample forms and 814,255 segmented and labeled character images in 1995 with Special Database 19 (Garris 2018; Grother 1995). LeCun and his coauthors (1998) would then further curate NIST Special Databases 1 and 3 to create MNIST in a seminal study that first demonstrated the effectiveness of machine learning algorithms in digit recognition, a special and highly simplified case of object recognition (Deng 2012).

Handwriting sample forms used by the NIST to solicit handwritten digits from high school students and Census Bureau employees that eventually become the MNIST dataset.
The creation of MNIST encapsulates the scientists’ first success in incorporating nonscientific labor and envisioning an organizational form that redistributes scientific labor to a geographically diverse group with no scientific training. The labor of NIST lab technicians to mail out thousands of handwriting sample forms and to sort, crop, and clean handwriting on the returned forms was instrumental to the dataset’s success. Also crucial to the success of MNIST was the fact that unlike graduate students whose labor is dually bounded by the competing demands of the project in hand and their scientific careers, the NIST could quickly redirect its staff by making this contingent task of collecting handwritten digits and letters the basis of their performance evaluation and rewards (Garris 2018), a form of indirect control well documented in the labor process literature (Burawoy 1979; Sallaz 2019). By aligning their research closely with an emerging market for pattern recognition technology selling primarily to government agencies, researchers were able to tap into the government’s vast pool of bureaucratic labor and mature regime of control to collect, process, and calibrate large administrative data.
Bureaucratic labor has its limits. Although MNIST is object detection’s first success in scaling datasets to tens of thousands that had a long-lasting impact, it was done at the price of limited variety: only images that the government was good at collecting and had an interest in providing. Not all object detection research can be translated into the recombination of bureaucratic tasks that government agencies are willing to take on and that are legible to the state (Scott [1998] 2020). As a result, the bias of the state inevitably infiltrated computer vision research. In theory, the government provides access to a large scale and variety of data on the national population and to a coincidentally skilled data labor force. But in reality, this population is often heavily skewed toward either rich, White, educated, suburban neighborhoods that are highly visible to the state or poor, Black, undereducated, urban neighborhoods that attract the interests of criminal justice institutions. The more facial recognition algorithms, for example, target police data and incorporate police labor into data annotation, the more these systems are structurally vulnerable to integrating structural disparities in the criminal justice system and facilitating police violence.
Centralized Control: PASCAL VOC (2007)
A key obstacle to scaling up data production inside the laboratory was how to streamline and standardize decisions rather than leaving them at each lab member’s discretion. Although the resemblance between computer vision laboratories and small entrepreneurial shops of the nineteenth century makes laboratories perfect sites for a regime of direct, centralized, and often personal control by the principal investigator (Edwards 1979), the promise of apprenticeship for an independent scientific career tends to prevent one from arising or persisting organically. Treating PhD students not as future scientists but as waged labor that needs close supervision would not only jeopardize the creativity of the lab but also drive away future prospective students. Computer vision researchers had to construct temporary yet legitimate situations where direct, centralized control is permissible and standardization of data work is possible. They draw from the organizational repertoires of government and corporate bureaucracy: paperwork, guidelines, training sessions, and interorganizational collaboration.
PASCAL Visual Object Classes (VOC) 2007 is another landmark in object detection research. Introduced in 2007 by a team of computer vision researchers based in UK, Scotland, and Switzerland universities (Everingham et al. 2010), PASCAL VOC provides 9,963 images containing 24,640 annotated real-world objects of 20 different classes, including persons, 6 types of animals (bird, cat, cow, dog, horse, sheep), 7 types of vehicles (airplane, bicycle, boat, bus, car, motorbike, train), and 6 types of indoor objects (bottle, chair, dining table, potted plant, sofa, TV/monitor). PASCAL VOC was also created as the benchmark dataset of the PASCAL VOC challenge that ran from 2005 through 2012. Now any research team could “train and compare their algorithms on a consistent set of data with the same objects” (Shapiro 2020:116), diverse real-world objects. It was also on this dataset that computer vision researchers first convincingly demonstrated that the groundbreaking performance by deep neural networks on ImageNet also applied to another dataset. PASCAL VOC has since been the de facto gold standard for evaluating almost every new object detection algorithm until Microsoft COCO started a new paradigm in the late 2010s (Girshick 2015; Girshick et al. 2013).
Like most traditional, small-scale datasets, PASCAL VOC is an in-house academic dataset created in university laboratories through a collaboration of 5 faculty authors and 10 graduate student annotators from the University of Leeds, KU Leuven, University of Edinburgh, Microsoft Research Cambridge, and the University of Oxford. At the time, researchers were actively looking for ways to scale dataset production, especially by outsourcing labor through commercial venues and web users. PASCAL VOC’s closest predecessor and competitor—the “LabelMe” dataset—was created at MIT through a web-based interface that encourages untrained web users to causally contribute and share annotations but suffered from incomplete and inaccurate annotation of objects (Everingham et al. 2010; Russell et al. 2008). The design of PASCAL VOC was motivated by the high efficiency as much as the low quality of these early forms of outsourced labor. If faculty are too costly and crowd labor is too unreliable, the ideal annotators should at least follow a carefully curated guideline that ensures everything that could be annotated with confidence is annotated (Everingham et al. 2010).
PASCAL VOC’s most important contribution, as a computer vision researcher would later claim in his talk, would lie in the guidelines (Figure 2) it created and validated for data annotation work. In other words, PASCAL VOC created an organizational structure for controlling data annotating labor. The article also unprecedentedly lays out the labor process of dataset curation in detail: Consistency was achieved by having all annotations take place at a single annotation “party” at the University of Leeds, following a set of annotation guidelines which were discussed in detail with the annotators. The guidelines covered aspects including what to label; how to label pose and bounding box; how to treat occlusion; acceptable image quality; how to label clothing/mud/snow, transparency, mirrors, and pictures. The full guidelines (Winn and Everingham 2007) are available on the WWW. In addition, during the annotation process, annotators were periodically observed to ensure that the guidelines were being followed. Several current annotation projects rely on untrained annotators or have annotators geographically distributed e.g. LabelMe (Russell et al. 2008), or even ignorant of their task e.g. the ESP Game (von Ahn and Dabbish 2004). It is very difficult to maintain consistency of annotation in these circumstances, unlike when all annotators are trained, monitored, and co-located. (Everingham et al. 2010:309)

PASCAL VOC 2007 annotation guidelines (Everingham et al. 2010).
PASCAL VOC 2007 would take 10 annotators, at those “annotation parties” following the annotation guideline, 700 person-hours to annotate under the close supervision of faculty authors. These structures and guidelines not only standardize data annotation and improve data quality but also draw boundaries between the labor of scientists, which entails making decisions on every image, and the labor of annotators, who are reduced to nonscientists implementing the decisions that have been made. This new organizational repertoire for dividing scientific labor helped create a hierarchical structure and centralized control over labor in the computer vision laboratory disguised as methodological innovations. In other words, although computer vision datasets are often constructed as representing human beings’ “natural” and “innate” way of seeing, the annotators’ actual way of seeing is not only implicitly shaped by the sociotechnical context (Denton et al. 2021) but also quite explicitly shaped by the guidelines and supervision that they receive.
The centralization and standardization of annotation labor also enabled researchers to scale the operation of their own laboratories by recruiting an unprecedented number of research assistants and adopting a hierarchical organizational structure. Faculty and increasingly, PhD students, who used to curate the data themselves, would play the managerial role of dividing the tasks into subtasks that require as little decision-making as possible and training and monitoring undergraduate research assistants to go out of the laboratory and take photos of objects and scenes in the real world and to download and label images from the internet. Did the standardization of annotation labor help the scientific careers of PhD students who provided the labor? Among all 10 annotators acknowledged on the PASCAL VOC webpage, none holds a faculty position, and only 3 have remained research active as corporate researchers by 2023.
As laboratories eventually became outscaled by the demand for even larger datasets, the shift that PASCAL VOC exemplifies toward a division between scientists and annotators, supported by the creation of guidelines and training for controlling annotation labor, opened up possibilities of outsourcing annotation labor through gig platforms and synchronizing scientific labor with hierarchical control of labor prevalent in the market for curating even larger scale datasets, as the authors themselves had already envisioned in the PASCAL VOC article: Possibilities include recruiting help from a much larger pool of volunteers (in the footsteps of LabelMe), combined with a centralized effort to check quality and make corrections. We are also investigating the use of systems like Mechanical Turk to recruit and pay for annotation (Sorokin and Forsyth 2008; Spain and Perona 2008). Alternatively, commercial annotation initiatives could be considered, like the aforementioned Lotus Hill dataset (Yao et al. 2007), in combination with sampled quality inspection. (Everingham et al. 2010:336)
Although the design of paper/web interfaces, annotation guidelines, and training sessions remains a function of expert judgment on and biases toward what constitutes interesting and unnoisy data (Collins 1998), these new repertoires of control consolidated scientists’ decisions before these annotation sessions and rendered autonomy and expertise irrelevant during those sessions. These new repertoires pushed the division of scientific labor between faculty and graduate students, although only temporarily during these annotation sessions, toward the direction of what would more closely resemble managers’ direct control over workers under Taylorism.
Algorithmic Control: ImageNet (2009)
Only when annotation labor was separated from scientific labor and controlled to ensure quality did outsourcing become a realistic solution to scaling for computer vision. Gig platforms have been said to reconfigure the nature of work by facilitating a new regime of algorithmic control that allows at the same time centralized control and spatial dispersion of labor (Vallas and Schor 2020; Wood et al. 2019). ImageNet’s use of Amazon Mechanical Turk (MTurk)—an online platform on which users complete small, web-based human intelligence tasks posted by other users for a tiny amount of payment—for annotating an unprecedently amount of image data to “map out the entire worlds of objects” has been widely discussed as a turning point in the history of artificial intelligence (Deng et al. 2009; Denton et al. 2021; Gershgorn 2017). But the introduction of MTurks into the social organization of data annotation was not simply a flash of scientific genius or as one of the authors described herself, a “Godsend” (Fei-Fei 2019). Computer vision scientists had worked on how to use gig workers for a couple of years before ImageNet. Without new repertoires of control that scientists developed and tested out during the period that led up to ImageNet, gig workers, without carefully curated training and centralized control by scientists, would not have fit into the laboratory model of data annotation.
ImageNet is a dataset of 14,197,122 annotated images of 10,000 different categories of objects ranging from tench to toilet tissues, far exceeding all its predecessors in both scale and variety. Although according to the researchers’ calculation, it would take a graduate student 19 years to curate a dataset of this size, ImageNet grew from 0 images to 3 million annotated images of objects from 6,000 categories in five months with the power of gig workers and dispersed annotation labor. As Fei-Fei Li, the computer scientist leading the ImageNet team, recalled when Jia Deng, a PhD student of Li’s at the time, showed her the MTurk website: “I can tell you literally that day I knew the ImageNet project was going to happen,” she said. “Suddenly we found a tool that could scale, that we could not possibly dream of by hiring Princeton undergrads” (Gershgorn 2017).
Although many accounts of ImageNet’s success have portrayed its use of MTurk workers as a groundbreaking contribution, as we have shown, computer vision researchers, including the authors of PASCAL VOC, had already been thinking about using gig workers for some time before ImageNet. The problem that ImageNet solved, inspired by PASCAL VOC (Fei-Fei and Deng 2017), was how to organize and control gig labor and integrate it with expert labor in the laboratory to obtain high-quality data.
To replace 10 well-trained, colocated graduate students under close supervision with 49,000 unskilled, untrained, geographically dispersed MTurk workers across the United States and India, the labor of annotation must be divided into microtasks that can be managed through a single web interface, carried out in parallel to each other, and efficiently evaluated for quality control. Figure 3 shows the backend interface of ImageNet that MTurk workers see when they accept a task from ImageNet. In this example, workers are instructed to select images that contain the object or depict the concept of a “delta,” defined as a low triangular area of alluvial deposits where a river divides before entering a larger body of water. To ensure that workers understand the object of interest, they will also receive quizzes on what a delta is. Most importantly, any candidate image would be annotated not by 1 but by 10 MTurk workers, whose annotations will then be pooled and checked against each other by an algorithm that dynamically determines the level of inter-annotator agreement that it would require to include the image under different categories (Deng et al. 2009; Fei-Fei 2010). Categories in which votes tend to split without a clear majority will be excluded from the dataset (Deng et al. 2009; Fei-Fei 2010).

(a) ImageNet Basic User Interface (b) ImangeNet Definition Quiz. The ImageNet backend interface that Amazon Mechanic Turk workers see (Fei-Fei 2010).
Although the digital platform surely contributes to the dehumanizing and decontextualized nature of annotation (Denton et al. 2021), it is the creation of this automated, dispersed quality control algorithm based on redundant labor that distinguishes ImageNet from its predecessors. Annotation is never a task that inherently requires the colocation of workers, but quality control was before ImageNet. In PASCAL VOC, the scientific labor of individual annotators is also standardized and decontextualized by centralized control over their decision-making power. In ImageNet, scientific labor is not just decontextualized but also further divided into subindividual, redundant pieces: One task completed by an MTurk worker constitutes no more than 1/10 of a scientific decision being made, and one annotator’s labor is frequently used to devalue and discard the labor of the minority. With not only annotation but also evaluation now being dispersed, scientific training is no longer required even to oversee an annotation task.
Corporate Control: Google Open Images (2018)
The consolidation of repertories of bureaucratic, centralized, and algorithmic control over what we used to know as scientific labor in computer vision laboratories eventually facilitated the full marketization of data annotation as a new job and a new industry. The production of large-scale, high-quality, real-world datasets after ImageNet often combines ImageNet’s dispersed redundancy of gig workers and PASCAL VOC’s centralized control over in-house, semiskilled annotators who are trained to curate training examples, resolve difficult cases, and sometimes audit the work of gig workers. This new hybrid regime of control, encapsulated by new corporate datasets such as Google Open Images, has become the organizational backbone of an industry-led open science of AI.
Google Open Images is a recent dataset of 9 million images containing 16 million labeled objects (15 times greater than ImageNet) from 600 categories appearing in complex, everyday contexts (Benenson and Ferrari 2022; Benenson, Popov, and Ferrari 2019; Kuznetsova et al. 2020). Similar to ImageNet, Google Open Images also utilized a large pool of MTurk workers to disperse annotation labor, only with a more intensified repertoire of dispersed redundancy and control. Not only does it employ a larger pool of MTurk workers; the task itself is further trivialized for scaling and quality control. Whereas ImageNet asked MTurk workers to select images that contain a given object, the latest iteration of Google Open Images replaced it with a yes-or-no question of whether a pixel point on a given image is part of an object (Benenson and Ferrari 2022). This extremely simple question, as the authors argue, would allow annotators to understand immediately and answer quickly “without undergoing any training nor having any understanding of the notion of object boundaries” (Benenson and Ferrari 2022).
Google Open Images and its contemporaries also diverge from ImageNet in two important ways. First, ImageNet is still an academic dataset in the sense that (1) it was created within a university laboratory by a team of a principal investigator and her PhD students and research assistants, (2) it was funded by public money, and (3) it collected publicly available images from the Internet. Google Open Images, on the contrary, was created and funded by Google Research, building on a proprietary dataset that Google owned called JFT, using Google’s Crowdsource Android app for an extension to more diverse offline images from users in India, the Middle East, Africa, and Latin America, yet free for both academic and commercial use. The rise of corporate open science like Google Open Images begs the question of why corporate laboratories would invest labor and resources in creating open-source datasets instead of proprietary datasets that give them a competitive advantage. But it also entails a seamless integration and easy extension of scientists’ labor repertoire by tech companies’ organizational structure. Whereas the laboratory model of principal investigators and PhD students’ handmaking blocks are costly to replicate, both the centralized control of annotator teams by a managing researcher and the dispersed control of MTurk workers by a well-designed user interface are the livelihood of tech companies.
Second, in addition to an external pool of MTurk workers who carried out most of the annotation work, Google employed and trained an internal pool of “professional” annotators. Those internal annotators would be provided with extensive guidance on how to interpret and verify the presence of classes in images and go through rounds of qualification tasks and training games on PASCAL VOC with immediate feedback to improve their performance (Kuznetsova et al. 2020). Those internal annotators would typically work on annotation tasks that require more “skilled” labor but could also be used as a high-quality benchmark to check the quality of outsourced annotations. The extensive training and control they are subject to also give their decisions a higher weight in quality control: Whereas an MTurk worker’s annotation has to win a majority vote among seven redundant annotations, an internal annotator’s work is evaluated against only two others.
As corporate researchers reimport these repertoires of control back to tech companies’ organizational structure and bring together gig workers with professional annotators to work together on the same dataset, data annotation is finally brought out of academic laboratories to become a business that can be run by anyone. The global data labeling market was valued at $150 million in 2018 and is expected to grow to more than $1 billion by 2023 (Murgia 2019a). In the place of Amazon Mechanical Turk, there have emerged companies that provide professional data curation services to companies and academics alike, deploying similar labor repertoires to train and control professional annotators in China, India, Southeast Asia, and Africa. Yet much of the groundwork, as we have shown, was laid inside the laboratory.
Discussion
From MNIST to Google Open Images, as computer vision datasets grew in both scale and variety, computer vision scientists made use of repertoires available to them for bureaucratic, centralized, and algorithmic control to divide, standardize, and distribute scientific labor. Under the disguise of scientific challenges and methodological innovations, these repertoires of control facilitate the full marketization of data annotation work and a new, hybrid regime of control. A rapidly expanding and globalizing industry, as a result of this laboratory project of labor control, now employs workers from all over the world into precarious employment relationships and undignified labor conditions to become the core of what we know as AI technology.
Implications for Research on Hidden Labor
The case of data annotators allows us to take a step to move beyond hidden labor as a result of technological change toward hidden labor as a more historical and structural approach to questions such as “How is the production of AI social?” (Joyce et al. 2021) and “Who created AI technology and how?” (Renski et al. 2020). My findings show how the technical development of AI is deeply intertwined with the social organization of work and how repertoires of control are particularly important for our understanding of inequality in the age of AI. First, the degradation of data annotation work from an expert task for the laboratory to a form of precarious, monotonous labor that can be outsourced, marketized, and obscured is in itself a new and undertheorized source of inequality in the labor market and in the workplace (Gray and Suri 2019; Joyce et al. 2021; Shestakofsky 2017). Second, scholars have called for theoretical attention to the power behind the rise of a multibillion-dollar data labeling industry across the Global South (Miceli and Posada 2021, 2022; Murgia 2019a). Conceptual tools developed in this study, such as “repertoires of control,” may help pin down the sources and mechanisms of power in future research on the AI industry.
Implications for Research on Algorithmic Bias
Scholars have criticized the AI industry for propagating a psychological and individualistic theory of algorithmic bias as the result of individual technologists and data annotators contaminating good algorithms with their human subjectivities and prejudice (Benjamin 2019; Joyce et al. 2018; Miceli and Posada 2022). This psychological understanding obscures structural conditions such as the rise of corporate power and the exploitation of hidden labor behind AI technology that allows certain forms of labor to be performed on data and certain types of biases to be included. Remedial measures suggested by AI ethicists and taken by the AI industry have focused on individual accountability: to audit and ablate individual datasets and data production protocols (Hutchinson et al. 2021; Paullada et al. 2021) and to educate and monitor individual engineers and data workers, often on a case-by-case basis (Hutchinson et al. 2021; Yang et al. 2020).
Although social scientific research on algorithmic bias has offered important insights into the underrepresentation and stereotypical representation of minorities in machine learning datasets (Crawford and Paglen 2021; Denton et al. 2021; DeVries et al. 2019; Paullada et al. 2021; Shankar et al. 2017), this study calls for future research that goes beyond what ends up “in” the datasets to disentangle structural conditions that shape the labor of data annotators and other data workers. A labor perspective would also provide much-needed cultural toolkits for the AI industry and regulators to understand AI as an open system deeply embedded in our capitalist society. In addition, given the disproportionate agency of scientists in translating structural conditions into “technological innovations,” finding an ideal alternative must be construed as a dually scientific and structural problem and requires close collaboration between AI researchers, social scientists, tech companies, and the state based on the consensus that blaming the racist prejudice of AI engineers and data annotators only distracts us from the real problem (Benjamin 2019; Bonilla-Silva 1997).
Implications for Research on the Future of Knowledge Work
The concept of repertories of control expands the horizon of the sociology of scientific knowledge by building on its long-standing but underdeveloped interest in studying knowledge production as a form of work and laboratories as shop floors (Doing 2004; Hoffman 2021; Owen-Smith 2001; Shapin 1989). Laboratory ethnographies in this tradition inevitably tend to see laboratories as a professional community based on shared culture and status rather than a regime of labor control. Using the case of data annotators in computer vision laboratories, I further open up laboratories to new analytical tools from the labor control literature. Seen through the lens of control, scientific work is not only a problem of how knowledge is produced and legitimized but also how labor becomes exploitable and monotonous and how science today may have increasingly relied on such labor beyond the laboratory. As AI algorithms, whether a fashion or a fad, diffuse, these repertoires of control and the resulting dispersion of scientific labor may be brought into even the less labor-intensive scientific fields and creative industries.
Finally, a consideration of the relationship between science and dispersible labor raises the interesting question of how the sociology of knowledge work may shed new light on emerging institutions of surveillance and inequality in the digital society. On the one hand, recent research has revealed how the division of labor in and between medical and legal professions facilitates the street-level governance of urban poverty (Lara-Millán 2021; Seim 2020). On the other hand, economic and political sociologists have highlighted the role of new instruments and infrastructures of knowledge-making in the governance and reproduction of inequality by state and market institutions (de Souza Leão 2022; Fourcade and Gordon 2020; Fourcade and Healy 2017, 2024; Hirschman 2021). Future research may further explore the labor-intensive nature of those institutions in the age of AI and the potential mediating role of social scientists, and increasingly computer scientists, and their evolving repertoires of control over labor that produces knowledge about the population (Foucault [1976] 1990). As more and more precarious workers participate in the development, commercialization, and maintenance of AI technology, their labor becomes simultaneously the infrastructure and the subject of algorithmic governance. Understanding how their labor is organized, controlled, and dispersed as knowledge work will help us think about the long-term, second-order effects of AI technology on the future of power and inequality.
Footnotes
Acknowledgements
I thank Jeffrey Sallaz for inspiration and encouragement on early iterations of the idea and Ronald Breiger and Corey Abramson for methodological advice on the broader project. I am particularly grateful to Daniel Menchik, Socius special issue editors, and two anonymous reviewers for their extensive comments that greatly improved the article. Finally, I thank all my participants who generously shared with me their time and insights. All errors and omissions are my own.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This material is based upon research that has been supported by an American Sociological Association Doctoral Dissertation Research Improvement Grant (ASA DDRIG).
