Abstract
This article presents an ethnographic case study of a corporate-academic group constructing a benchmark dataset of daily activities for a variety of machine learning and computer vision tasks. Using a socio-technical perspective, the article conceptualizes the dataset as a knowledge object that is stabilized by both practical standards (for daily activities, datafication, annotation and benchmarks) and alignment work – that is, efforts including forging agreements to make these standards effective in practice. By attending to alignment work, the article highlights the informal, communicative and supportive efforts that underlie the success of standards and the smoothing of tensions between actors and factors. Emphasizing these efforts constitutes a contribution in several ways. This article's ethnographic mode of analysis challenges and supplements quantitative metrics on datasets. It advances the field of dataset analysis by offering a detailed empirical examination of the development of a new benchmark dataset as a collective accomplishment. By showing the importance of alignment efforts and their close ties to standards and their limitations, it adds to our understanding of how machine learning datasets are built. And, most importantly, it calls into question a key characterization of the dataset: that it captures unscripted activities occurring naturally ‘in the wild’, as alignment work bleeds into moments of data capture.
Introduction
This article analyzes the construction of a dataset designed for benchmarking models in a variety of machine learning and computer vision tasks. Benchmarks are related to the Common Task Framework which is characterized by curated data, exact task specifications, standardized evaluation metrics and an evaluation dataset. This framework is known in the machine learning community for facilitating the evaluation of models based on how well they perform in comparison to one another in a constructed environment (Rodu & Baiocchi, 2023). Benchmarks tend to ‘become definitions of the essential common tasks to validate the performance of any given model’ (for an overview of benchmarks, see Raji et al., 2021:1).
A growing strand in critical data and algorithm studies has shown that benchmark datasets, like ImageNet, display effects that exceed simple reference points: they impose norms in the machine learning community (Denton et al., 2021; Koch et al., 2021). Moreover, scholars have argued that machine learning datasets are partial socio-technical products (Raji et al., 2021). Currently, numerous quantitative measures that aim to add epistemological handles to datasets circulate in the machine learning community (see Mitchell et al., 2023 for an overview); these might, for instance, assess harmful representations or the range of diversity in a dataset's representation.
But how do machine learning scientists construct datasets in practice? This study aims to understand how machine learning scientists create a benchmark dataset as a socio-technical knowledge object. Through an ethnographic case study of a dataset created by a large corporate-academic group, I investigate the work involved in designing, assembling and transporting the dataset within the machine learning community. In contrast to kitchen-sink approaches that integrate any kind of retrievable item (Jo and Gebru, 2020), this dataset was carefully created from the ground up.
More precisely, the footage the dataset is based on was recorded not in studios or with prepared scripts, but seemingly impromptu, with head-mounted cameras worn in data subjects’ homes and other locations in their daily lives; its creators describe the dataset as capturing activities as they occur ‘in the wild’. Considered as a cascade of individual scenes, this massive dataset might look like a series of first-person perspectives in which hands reach to refuel a machine, push wood through a circular saw as part of a home-improvement task, operate a sewing machine or add shading to a portrait, a desk lamp providing just the right light. Ongoing commentary accompanies the recorded footage, as data annotators have written descriptions of the events that occur and inscribed boxes around objects.
When they are seen as knowledge objects, the process of crafting benchmark datasets involves adherence to agreed-upon standards, which necessitates diverse efforts at alignment. This article details the connection between these efforts and standards and their limitations, emphasizing alignment work's crucial role in the process. At the same time, alignment work is less visible than other efforts, consisting of the informal, communicative and supportive efforts that underlie the success of standards. This regulates a number of intersections where actors and factors interact, revealing the practical work undertaken by machine learning scientists in assembling socio-technical knowledge objects.
A focus on alignment work provides perspective on the comprehensive work a dataset relies on, contributing to an understanding of how it is created and what this implies for computation efforts. For the dataset in question, practical standards were established to organize the conditions in which unscripted activities could be captured ‘in the wild’, and digital cameras facilitated the observation of these activities from afar. But as interviews with data subjects showed, alignment work entailed a hands-on stewarding that significantly influenced the moments of capture. The dataset creators, for instance, provided guidance on the optimal way to record activities (ensuring maximum camera coverage) and established clear agreements about their parameters – that is, what each should and should not encompass. Alignment work thus bled into – and ultimately domesticated – the activities characterized as occurring naturally ‘in the wild’.
Having laid out the terms and stakes of this inquiry, in the following section, I turn to relevant studies of machine learning datasets that contributed to my analysis. Subsequently, I present the socio-technical framework including the concepts of standard-setting and alignment work. I then detail my ethnographic process, materials that informed the results, and the context of the dataset. The findings are presented step-by-step, with each section analyzing a different practical standard and the subsequent alignment work it relied on. Ultimately, I take issue with the ‘in-the-wild’ characteristic attributed to the dataset by its creators. The article thus presents an ethnographic story of a machine learning dataset, along with commentary relevant to the intersection of the social and computing sciences. I conclude with the suggestion that alignment work is a significant analytic for social analyses of dataset construction because it reveals the practical work that enlivens standards. Throughout the article, I refer to the recorded research participants as data subjects, and the group of machine learning scientists in the dataset project as the group or the dataset creators.
Studies of machine learning datasets
The proliferation of deep learning in computer vision has led to a partial but significant shift in focus from algorithm development to the rigorousness of datasets, as noted by Nasser et al. (2019). The computer science-oriented literature places significant emphasis on data quality and annotation, paying particular attention to the reliability of labels. The current best practice consists of manual human annotation, in which at least two people independently annotate the same data. Consistency between the labels provided by the different annotators, known as ‘inter-annotator agreement’, is integral to the reliability and accuracy of annotations (e.g., Nassar et al., 2019). Evaluation methods, such as computing similarity scores, are commonly employed to measure inter-annotator agreement (e.g., Yang et al., 2023). ‘Datasheets for Datasets’ (Gebru et al., 2018), the gold standard for dataset documentation, includes intended uses and limitations, among others, as important criteria for datasets.
In parallel to the computer science–oriented literature, dataset analysis is a growing area of social research on computation, conducted through methods such as critical historiographies (e.g., Denton et al., 2020, 2021), archival research (Thylstrup, 2022) and ethnography (Jaton, 2021). An initial strand of inquiry found that machine learning datasets are composed solely for a computational other: models. In Jaton's (2017:827) lab ethnography, computer vision scientists created a dataset by downloading images from Flickr. These were annotated by crowdsourced workers, who placed rectangles around regions they identified as salient. Scientists then manually segmented these annotated regions into distinct ‘salient features’. This process laid the foundation for the creation of a saliency detection algorithm. Trained models are thus shaped by the composition of data and the annotation labels with which they are provided, which serve as a model's ‘ground-truth’ (Jaton, 2017). Scheuerman, Denton and Hanna (2021:2) describe a similar machine learning convention accordingly: ‘Computer vision uses a subset of data instances, known as training data, to fit the parameters of a model. Another subset of the data is known as testing data, leveraged to evaluate the model and estimate the ability of the model to generalize to images not seen during training’.
An important area of focus in social analyses of machine learning datasets concerns the power relations and exploitation in data annotation work (which is generally low-paid, crowdsourced and made invisible) and the social organization of dataset construction, especially in the harvesting of online materials (Gray and Suri, 2019). Thakkar and colleagues’ (2022) interview study reveals substantial variations in the valuations of data within the machine learning data supply chain and specific facets of the social organization of workers. Model developers prioritize ‘structured data’ and data stewards emphasize the need for ‘readable’ data; while the valuations of on-the-ground data collectors are primarily linked to their performance at the worksite, despite being ancillary to their core work. According to Miceli and colleagues’ (2020) qualitative study, top-down communication patterns and ad-hoc updates to datasets exclude data workers from relevant workflows in companies. The authors suggest that dataset documentation should be viewed as a boundary object derived from practice, one that allows for consistent yet flexible communication between relevant actors. And Scheuerman, Denton and Hanna (2021:24) moreover compare the textual production of multiple datasets to understand their underlying values, showing that certain value dichotomies – ‘efficiency versus care, universality versus contextuality, impartiality versus positionality, and model work versus data work’ – tend to be prioritized when datasets are articulated by their makers.
Another strand of research has shown that datasets contribute to the ongoing production of social and computational realities. Crawford (2021) extends this understanding by describing datasets as ‘classification engines’ because of their reliance on data annotation labels that influence new data instances. Crawford (2021:139) argues that these classification engines serve to ‘naturalize a particular ordering of the world; which produces effects that are seen to justify their original ordering’. Jaton (2023) leverages Kang's (2023) ‘ground-truth tracing’ to show that the stabilization of historically situated knowledge in a benchmark dataset for cancer immunotherapy is contestable through ongoing knowledge production that revisits previous assumptions.
A related concern in social studies of datasets investigates the use of identity categories (e.g., Scheuerman et al., 2020). A key finding suggests that current practices involve ‘the collapsing of human identity – which consists of both visible external and invisible internal aspects – into solely the visible. For example, visible gender expression is being used as a proxy for internal gender identity’ (Scheuerman et al., 2020:21). Malevé (2021:1128) argues that in ImageNet, the context behind each and every photograph is collapsed; although photographs are shaped by a range of institutional assumptions about image characteristics such as perspective and metadata, ‘The data set is not merely a collection of photographs… Flickr's amateur snaps, police portraits, submarine photographs are re-aligned as items of a training set’. In other words, data become reconfigured. Amoore's (2024) research on machine learning in border policing shows a merging of diverse data types – biometric, biographical, transactional and social media – to stack their features. Moreover, data annotation has been shown to superimpose meanings onto data (Miceli et al., 2020). A key realization from these studies is that representational claims are present in machine learning datasets, though the objects they represent become reconfigured. This is especially significant when the specific composition of datasets will form the basis of the classification and prediction architecture of subsequent machine learning models.
Analyzing the movement of machine learning knowledge has a direct kinship with the analyses of data, socio-technical systems and knowledge infrastructures that take place in science and technology studies. Some of the observations about machine learning datasets reflect earlier findings in science and technology studies. For instance, Latour (1995) described the chain of analog transformations underlying the production and circulation of scientific referents. During his fieldwork with geologists, Latour traced the gradual transformations through which soil was decontextualized and turned into labels, numbers and graphs, producing scientific referents. In this stream of locality loss and legibility gain, Latour (1995:170) claimed, ‘Reference … is that which remains constant through a series of transformations’. Haraway (1988:583) located similar tendencies concerning the partial visions of technoscience.
Bowker's (1994) notion of ‘infrastructural inversion’ is another precedent approach that analyzes arrangements that contribute to socio-technical systems before fading away once they are stable. This involves administrative apparatuses (Bowker, 1994:10), classification systems and other infrastructural objects that support systems or tasks, including lists (Bowker and Star, 1999) – and in my view, datasets. Star's (1993) related call to ‘study boring things’ thus remains a highly compelling provocation as we seek to better understand machine learning from a socio-technical perspective.
In sum, the recent research focus in social studies of machine learning and datasets – aligning with long-standing findings in science and technology studies – has taken issue with conceptions of digital data as pristine, instead understanding data as the effect of work situated in space and time and in diverse formations of people, social contexts, power orders and geopolitical structures (Beaulieu & Leonelli, 2021; D'Ignazio and Klein, 2020). Slogans such as Raw Data Is an Oxymoron (Gitelman, 2013) and All Data Are Local (Loukissas, 2019) reveal a common position embraced in social research on computation and data (see also boyd and Crawford, 2012).
By studying the work involved in the construction of a machine learning dataset as a socio-technical knowledge object, this article adds to the growing repertoire of social analyses of machine learning datasets that call for ‘bringing people back in’ (Denton et al., 2020) through ethnographic approaches (Jaton, 2021) and recognize the unequal conditions in such work (Gray and Suri, 2019). Jaton (2023:803) asserts that ‘if we get the algorithms of our ground truths’ (Jaton 2017), we get the ground truths of our organizations and metrological equipment‘. Thus, I place my contribution in the existing literature that analyzes the activities of machine learning and inquire about the constitutive work that moves a socio-technical knowledge object along with a process of stabilization, deeply marked by establishing standards and aligning the relevant constituents in agreement. Moreover, it is significant that this article attends to a novel benchmark dataset, as studies on machine learning datasets tend to accumulate around ImageNet (Deng et al., 2009) – the most influential benchmark dataset in recent computer vision history – and the biased representations built into it (e.g., Shankar et al., 2017; Crawford and Paglen, 2019).
Socio-technical knowledge objects: Standards and alignment work
A socio-technical perspective unsettles the accepted distinction between social and technical processes and considers instead their continuous interaction (Bijker and Law, 1992; Latour, Mauguin and Teil, 1992; Vertesi et al., 2019). This perspective brings into focus a heterogeneous chain involving human actors, technical elements, devices and work. A knowledge object is formed as actors work on, interact with and navigate the dataset through the various components and stages of production (cf. Latour, 1987). At the forefront is the dataset as a progressively socio-technical knowledge object, along with the work integral to its formation. This framework focuses on two work activities conducted during the construction of the dataset: standard-setting and alignment work. The progression of socio-technical knowledge objects, including machine learning datasets, hinges on the establishment of standards and, subsequently, endeavors to enforce them, which boil down to how actors create and enact agreements (Strauss, 1993).
The explicit aim of the dataset in question is to collect data from daily activities in a variety of circumstances through first-person perspectives. Thus, as a socio-technical knowledge object for computer vision and machine learning, the dataset draws on everyday knowledge. To grasp the processes at work in the dataset, I embrace an understanding of ‘daily activities’ as mixtures of diverse and historically situated knowledge. Enactments of everyday life are distinguished by a person's use of socially acquired knowledge, both specialized and general (Berger and Luckmann, 1969). Such variable pieces of knowledge enable enactments of something as seemingly mundane – yet socially received and contingent – as doing the dishes. In this way, the machine learning dataset as a socio-technical knowledge object gains scientific value through its link to the embodied knowledge enacted by the data subjects.
A standard serves as a communal basis for enabling the movement of knowledge objects (Timmermans and Epstein, 2010). Conventions and agreements can be made into formal standards, such as Structured Query Language (a common structure of data retrieval in databases) or established protocols for metadata (including author, file size and date created). Bowker and Star (1999:13) define such standards as ‘a set of agreed-upon rules for the production of (textual and material) objects’. Similarly, Busch (2011:58, 74) contends that standards act as guiding principles for human actions and objects, and that they are used to ensure operational coherency. Standards put complexity at risk, as they aim to simplify, order and formalize work procedures (Lampland and Star, 2009). Notably, Busch (2011:151) observes variability in the definition of success across context-specific standards and asserts that some standards are intentionally crafted to regularize subsequent processes.
Standards can also be more informal (Lampland and Star, 2009). Lampland and Star (2009:16–24) postulate a ‘leaky border’ between standards and conventions and highlight locally negotiated efforts that ‘touch very specific communities in very specific contexts’. Standards are thus connected to the conventions of professional and scientific communities. From these scholars’ research, we learn that standards result from temporary agreements and are enacted in local practices to make further work consistent (Busch, 2011; Lampland and Star, 2009). What I refer to as practical standards are explicit arrangements and methods that follow from ‘best practice’ (established conventions that vary in their formalization) – for instance, an agreed-upon way of annotating a machine learning dataset. This definition helps identify conventions and agreements, both those that are established in the community and those emerging in local practice.
As a concept, alignment work is based on symbolic interactionist theory developed in Susan L. Star's work on infrastructures and standards, and Anselm Strauss's concept of articulation work and studies of invisible work (Kruse, 2021:5; Strauss and Star, 1999). My understanding of the concept draws inspiration from Kruse's (2021) elaborations of it, which is based on research analyzing knowledge infrastructures (see also Vertesi, 2014). Alignment work is directly tied to standards and makes up the ‘work that is necessary to keep the undergirdings of movement in working order’ (Kruse, 2021:11). Unlike explicit standards, it may entail fluid, emotional and communicative efforts (Kruse, 2021). The concept directs attention to the less visible, less recognized work that goes into upholding, repairing or enforcing various types of standards – including forging agreement and distributing associated tasks.
Kruse's (2021:3) research on knowledge infrastructures shows that alignment work is common in settings where different ‘epistemic cultures’ interact (Knorr Cetina, 1999). Alignment work is necessary to enliven agreements at the intersection of distinct knowledge practices – from daily life and from machine learning, for example – where tension inevitably occurs. Drawing on this body of work shows that alignment work is performed to smooth tensions as the knowledge object is moved between sites. I use this framework to analyze the work of defining daily activities and making them recordable, computationally retrievable and possible to disseminate.
Ethnographic approach to a distributed object of study
This article is produced as part of a multi-staffed and multi-sited ethnographic project. The research team studies different types of visual evidence produced by digital data. Inspired by laboratory ethnography, for the specific study described in this article, I followed the dataset project digitally over the course of 14 months. My own data was produced from multiple sources, including interviews, observations and documents. In total, the interview data consisted of 15 interviews with computer scientists and machine learning practitioners, and shorter segments of interviews with data subjects made available by the dataset creators. I also observed the dataset creators’ recorded workshops, the project forum and associated traces on GitHub (Geiger and Ribes, 2011; Turner, 2019). In addition, I collected archival materials and documents, such as data annotation guidelines, benchmark challenges and a data license agreement. I also consulted a published article containing an extensive scientific account of the dataset.
I produced a fieldsite by stitching together digital entry points (Postill and Pink, 2012) for ethnographic analysis. Accordingly, it is best conceived of as, in Hine's (2015:60) words, an ‘artful construction’ rather than as a site I discovered. The main project website contained an overview of the dataset and related important information, including how to obtain access to the dataset; it also had a forum, with community rules and user profiles. The dataset's GitHub page included documented code sequences, associated work logs and its own issues forum. These entry points allowed for insight into several ongoing work activities. As an ethnographer, I also attempted to penetrate deeper into technological production by participating: I wrote a post in the forum, announcing my presence and seeking further interlocutors and contacted key people sourced from the project website.
After signing the license agreement protecting the dataset, I detailed my study to the group and received their informed consent. Signing this agreement enabled me to not only access the data but also establish deeper relationships with both the creators and the dataset's components, including its metadata. Since the dataset is heavily populated, I was unable to receive informed consent from everyone who contributed to it (Sveningsson Elm, 2009). However, the dataset is mostly anonymized, except for a handful of cases in which data subjects have agreed to show their faces. In line with ethical considerations, I took measures to pseudonymize my interlocutors and scrambled online materials to reduce their traceability.
An overview of the dataset
In 2021, a group of computer scientists announced the dataset in a technical publication, a promotional video and an online seminar. It was the result of a corporate-academic collaboration – the main organizing actor, a corporate team, assembled the group and was responsible for the annotation of data, while the academic institutions were responsible for recruiting research participants and collecting data in a range of continents and jurisdictions. The boards of the universities conducted ethics reviews. About 15 local groups recruited data subjects who were willing to wear cameras and contribute footage to the project; ultimately, almost a thousand people in countries across the world recorded footage. According to my interlocutors, the dataset is at the frontier of first-person computer vision, with widespread collaboration required to push the field further.
Promotional videos illustrate how the dataset's creators envisioned the technical outcomes of egocentric vision data. In one, a tidy kitchen is shown from the first-person perspective of a person wearing a pair of smart glasses. The glasses house an AI assistant, which uses a camera sensor to monitor the movements of the person's hands and other objects. During the hustle and bustle of food preparation, an avatar of an older woman appears in the corner of the image and says – to the wearer of the glasses, presumably – ‘My trick is to mix two-thirds of an habanero pepper into the saucepan’. Simultaneously, a translucent box labelled ‘Habanero chili pepper’ appears as a minimalist overlay above the pepper on the kitchen counter.
According to the dataset creators, about 50% of the footage represents handicraft, food preparation, carpentry or domestic work such as washing clothes. The other 20% involve activities such as shopping, eating, reading and playing music. The remainder of the dataset represents activities such as sports and game-playing, watching television and other screen-based media, walking (both indoors and outdoors), commuting and interacting with others. These data play a crucial role in facilitating computer vision tasks, including action recognition, gaze understanding and prediction, hand and object interaction, and 3D scene understanding. 1 These tasks support the functionality of systems commonly referred to as spatial computing and mixed, augmented, virtual and extended reality.
Setting a daily activities standard
To give the knowledge object direction, the group began by establishing what daily activities meant. Deciding how everyday life should be represented required significant work in advance of the practical datafication efforts. Specifically, the group used the results of an American government agency's annual survey on domestic life – using materials from established institutions have been shown to add credibility to scientific projects (Shapin and Schaffer, 1985). This survey, which aggregated a flood of information from respondents in specific categories, provided the group with a point of reference and comparison, and statistical information on how everyday life is lived helped them decide which major types of activities to include in the dataset.
The empirically informed categorizations built into the survey reduced potential ambiguity and open-endedness in the group's interpretations of daily life. At the cost of maintaining productive ambiguity, the survey's categories and results were imported, becoming satisfactory and reliable standards for the datafication efforts. However, some categories of daily life are more difficult to handle and codify than others: shopping and procuring professional, domestic or personal care services fall into the category of ‘buying products and services’. So does obtaining government services like food assistance, a task which blurs the line between market and welfare activities. Moreover, according to the survey's classification system, cooking breakfast is not childcare; instead, it belongs to the category of ‘household activities’. As these issues show, the survey actualizes segmented valuations into the standard.
Aligning the daily activities standard
The survey was a recurring part of the group's presentation of their work in project documents. However, an explicit standard based on these information items – my interlocutors referred to it as ‘the list’ – became unwieldy in practice, and it was not sufficient for developing the knowledge object in itself. The statistical findings did not seamlessly correspond to the daily lives of individual data subjects in countries far from the one in which the survey had originated. To arrive at agreement on their contributions required negotiation between individual data subjects and the dataset creators, illustrating the close kinship between alignment work and standard-setting. Subtler ways of engaging became tantamount to enlivening the group's standardization of daily activities.
‘The list’ was used as a first step in screening potential data subjects’ eligibility for the research project. When enrolling, potential subjects identified which items on the list they could consider recording as part of their contribution. In my interview with Jane, a senior dataset expert, she recollected questions posed during further interactions with data subjects regarding the range of their hobbies. During a ‘moment of alignment’ (Vertesi, 2014:268), Jane asked about applicants’ normal daily routines and suggested that of the activities on the list, they record those they usually engaged in. This was done to reduce the risk of prompting specific activities that the data subjects did not typically perform. The version of ‘unscriptedness’ produced here shows that the main consideration was the data subject's familiarity with the activity. Negotiating the activities for the data subject was seen as unproblematic, and it also generated further agreement on the conditions for the recordings.
Occasionally, a data subject mentioned an activity that was not on the list, such as rock climbing, and Jane would ask them to record it. In this case, Jane used ‘the list’ to make everyday life reportable (Garfinkel, 1988), and it also provided guidance on covering activities in the aggregate. Alignment work thus temporarily restored the ambiguity of daily life, fitting data subjects to the group's objectives. However, Jane emphasized the importance of not overdoing such work – allowing for many hours of cycling to be recorded, for example. It was important to accumulate coverage of select activities, not streams of dull events with little computational value. Similarly, to spur diversity in the recorded activities, in their instructions to data subjects, other members of the group set explicit time limits for how long given activities could be recorded. The computational utility of the recordings took precedence over daily life as such.
Alignment work occurred as the group negotiated agreement between the activities the survey suggested needed to be covered and those the individual data subjects actually tended to do. Through textual devices and standards – including pre-selected activities, example recordings and time limits for specific activities – the group aligned the data subjects to producing versions of everyday life that, to various extents, corresponded with items on the American survey. In the work of matching the data subjects to it, the standard as such grew, incorporating the interpersonal agreements that occurred in parallel and becoming more extensive compared to the simple activity categories that had been imported to establish a reliable list. Further alignment work stemmed from one of the costs of this import, which related to a recognized problem of surveys on domestic life: respondents generally abstain from reporting morally sensitive activities (Bureau of Labor Statistics, 2021). The group managed such activities by negotiating with data subjects which parts of their lives they were willing to record and which details would be included in the footage. For instance, data subjects were told to avoid recording visible personal traces like tattoos, which could be used to identify them, whenever feasible. In this sense, the alignment work helped forge agreement on the circumstances for the recordings, equipping the data subjects with an understanding of what was expected from them.
Setting a datafication standard
By referencing expert conventions, the dataset creators established a datafication standard. This standard involved head-mounted camera devices adhering to high-definition resolution criteria, generating the desired intensity values within the standardized range of 0–255 across red, green and blue channels, rendering the daily activities computationally legible. This type of image production stands as a long-standing convention within computer vision research. Moreover, at some sites, 3D sensors were moreover prepared for supplementary capture of activities in studio settings. This was done separately from the recording of activities ‘in the wild’, which was performed with devices ranging from head-mounted cameras to sunglasses with built-in cameras.
The variety of devices meant differences in sensor characteristics, such as the fields of view of cameras. But the dataset creators embraced this variety, which, rather than compromising the coherence of the values produced, was seen as enhancing the quality of the data. A key concern for the dataset creators was the problem of overfitting during model development. A common problem in machine learning, overfitting is the tendency of machine learning models to become too reliant on artefacts from the training process, limiting their ability to generalize to new data instances (Alpaydin, 2020:41). The inclusion of different fields of view diversified the visibility fields of the recorded activities, mitigating the issue of a single camera structuring the dataset. The varied camera array thus served as a safeguard against this problem.
Aligning the datafication standard
The skeleton of this socio-technical object, consisting of data subjects and the list of daily activities, gained computational significance once digital values began to accumulate. As data subjects recorded their activities, pixel values – crucial for more significant machine learning accomplishments – began to be produced. In practice, however, this proved to be dependent on work that was both more and less visible. Successful preparation and collection of the recordings relied on several interventions.
One of my interlocutors, Tom, shared stories about the practical work of data collection. As part of a local hub of the project, he was in charge – together with other senior scientists located elsewhere – of transforming people into data subjects who would produce footage. In this specific locale, Tom and his colleagues set up digital meetings to introduce the data subjects to the camera equipment used for recording once the initial recruitment process was complete.
During the meeting, he and his colleagues instructed the data subjects in how to best capture their activities, even providing an example of what a successful recording might look like. Their instruction aligned the data subjects and the recording devices, with the camera as a standardizing device that output numbers to retrievable files. After all, it is one thing to cook pancakes in one's kitchen, and quite another to do so with a camera mounted on one's head, trying to ensure that it captures one's hand movements and surroundings.
When interviewed by the group about their experience of recording their activities, several data subjects alluded to the unfamiliar feeling of recording themselves. For instance, one stated that ‘wearing cameras was a challenge at the beginning’, but that they ‘got used to it as the days passed by’. Another framed it as a ‘unique experience’, and a third brought up a feeling of ‘performance anxiety’. These accounts highlight data subjects’ reflexive consciousness of wearing head-mounted cameras for a scientific project in the context of everyday life.
Usually, Tom or his colleagues would drive to a data subject's home to deliver the camera equipment. Wearing masks to prevent the spread of Covid-19, they would knock on the door, greet the subject and help them set up the hardware. They would then leave, while the data subject would record themself performing the agreed-upon activities, playing board games or eating dinner, for instance, in order to produce data. Meanwhile, the local group representative would wait in a car outside, retrieving the equipment from the data subject once the activity had been recorded. In another locale, the equipment was mailed to subjects after the digital meetings, with procedures depending on local circumstances such as geographical proximity.
Setting an annotation standard
The dataset creation process also required adherence to a standard for data annotation. This standard determined how data would be marked and labelled, and annotation added computational indexicality to the footage. The group's data annotation standard was developed in consideration of three factors: the computer's ability to recognize and organize the data, the need for consistency across annotations and objectives related to computational tasks such as action recognition. The process involved about four million sentences and was estimated to be equivalent to almost 30 years of work. Consider this brief example of annotation for just six seconds of a data scene: #C places the paint roller in the container of paint #C reaches for the cup #C uses his left hand to grasp the water bottle and drilling equipment #C puts the drilling machine horizontally on the ground #C removes the water bottle #C reaches for the container #C shuts the container
When I asked one of my interlocutors, a senior dataset expert, why annotation matters in the first place, she replied that data couldn't be used without it. I compared annotation to a vehicle that made the data computationally workable, and she agreed, adding that it made the data indexable. It allowed the dataset creators to see in detail what they had captured, she said – for instance, how many times a cyclist turned left or right. Without annotation, they would know only that they had a video of someone cycling, not what actions were captured or how they were conducted.
As an explicit goal for their data annotation, the group had agreed on consistency, which would enable a larger basis for computational processing of the pixel sets discussed above. As my interlocutor's explanation made clear, annotation efforts are critical to rendering the footage indexable through computer-legible handles. Data annotation is thus seen as recovering the ingredients of the footage.
The interpretations, intentions, goals and affects of the data subjects were outside the scope of interest as defined by the annotation standards. Beyond this definition, we can also distinguish between categories that are locally produced and defined by actors and those that are imposed by experts (Hacking, 1999:168). With this double vision of categories from above and below, the particularities of each data scene and the local definitions of the situations they depict became streamlined into a granular, descriptive and seemingly objective vocabulary. By understanding the range in this way, annotation standards become investments in specific descriptive styles.
As I later learned, the style of the annotations was appropriated from the computer vision tasks circulating within the community. For instance, the group's interest in the computer vision task of action recognition required breaking down eventful data scenes into their smallest components (cf. Agre, 1994). For this reason, action- and object-centered annotations such as ‘#C folds pizza box’ dominate the data scenes, making up one annotation layer out of several, with the rest geared towards other tasks. These task-specific annotation standards are spread throughout the dataset. The current convention in computer vision defines the task of action recognition as a classification problem and accuracy is usually assessed by measuring whether the predicted label matches the annotated label, or if it matches any of the five highest-scoring predictions of what it might be (Plizzari et al., 2023:18). Accordingly, the annotation standards enabled the footage to become a knowledge object integrated in machine learning practice.
Aligning the annotation standard
With a defined standard in place, further alignment work coordinated the annotators’ efforts. The group commissioned two companies in the Global South to manually annotate the dataset. In this context, the dataset project exhibits parallels with prior research that asserts the role of ambiguous labor conditions as a foundation for the development of machine learning (Gray and Suri, 2019). The group instructed two workers to annotate the same 5-min chunks of footage with the aim of acquiring richer descriptions and correcting mistakes. The double annotator approach helped enforce the standard of consistency as laid out in the annotation guidelines. The annotation guidelines were a significant vehicle for alignment work vis-à-vis the data annotators, and they are key to understanding how the group's version of consistency was accomplished.
Extensive and schematic in form, these documents provided annotators with illustrative examples of how to describe the footage; one instructs annotators to ‘fix the start time to the moment the person takes up the knife and apple and the end time to the moment they are done, then write ‘#C is slicing an apple’ into the text area’. The annotation style for action recognition is translated into both a simple task described by example text (‘#C is slicing an apple’) and the affordances of an annotation interface (‘fix the start time’). The annotators also followed open instructions, such as ‘Imagine you are viewing this video while talking to a peer who is unable to see it, and you have to describe everything that happens in it’. The results of such instructions were then automatically mined to generate taxonomies that became resources for further annotation.
The annotation guidelines show how the group aligned the data annotators, through textual devices and example cases, to produce what they defined as consistent versions of the footage. The dataset shows, and through the explicit annotation guidelines, it was also made to tell in specific ways. In turn, open access to the guidelines themselves – accounting for how the annotations were constructed – helps the dataset travel to sites beyond the group and is an important resource for external scientists during model work. As such, these documents help align the community of machine learning scientists external to the group with the annotation standard.
Setting a benchmark standard
Once the dataset was considered stable enough, it was made available to the wider machine learning and computer vision community. The announcement of the dataset was tightly linked to a set of benchmark challenges promoting distinct scientific work packages. The dataset was classified as a benchmark – source of not just new models but points of comparison. The challenges consisted of computational tasks given names such as ‘episodic memory’, ‘forecasting’ and ‘hand and object manipulation’. During an established conference in the field, the group organized a competition with standardized evaluative metrics to measure the best technical efficacy on these specific tasks. The substantial work required to construct ‘doable problems’ in science has been shown to fade into the background once these problems have been articulated; for instance, Fujimura (1987) describes the coupling of experts with relevant know-how with technical objects in situated spaces. Beyond the scientific problems, the group created a contract requiring contestants to adhere to specific licenses, rules and procedures.
Identifying the dataset as a benchmark helped it become a site for comparing the performance of different models. One of my interlocutors underlined the importance of testing machine learning models across many datasets in order to show that they can be generalized. To demonstrate that they have not been overfit to a single dataset, models must perform well on several. Benchmarking is important in this process because it offers a standardized means of assessing and comparing model performance. To date, the contest's leaderboards are populated by about a hundred contestants in various team constellations.
Aligning the benchmark standard
Beyond standardization efforts like the rules of the competition and the shared metrics in the benchmark dataset, another type of alignment work occurred. In structuring the benchmark challenges, the dataset creators managed the relationship between contestants and the dataset. To make the knowledge object circulate, they arranged a platform infrastructure on which the contestants’ models would operate under the same computational conditions, making the contest fair. And while datasets have exceptionally high value in the machine learning community, the group attempted to solidify the dataset's standing even further – for instance, by offering a cash prize to the contest's winners. Moreover, external scientists circulated the knowledge object within their local community. In one of my interviews with Ken, a contestant from a Big Tech corporation, he described his uptake of the dataset accordingly: ‘There was a lot of circling around it internally. I was working on a related task, and I figured that I could make use of this additional benchmark to evaluate my prior work’.
As new actors come into play, the dataset is increasingly adopted and improved, accelerating its entrenchment in the machine learning community. In addition to the competitive aspect of benchmarks, it becomes a temporary site for the work efforts of diverse actors in the wider community of machine learning practitioners. Considering the frequently distributed nature of production within machine learning, this collaboration serves to show that model evaluation first and foremost is a collective accomplishment. Moreover, measured against the scores provided by the group, the contestants produced machine learning models that performed better than the baseline results. As part of the contest arrangement, these models were looped back in to the next rendition of the benchmark contest as code and scientific papers, renewing the standard. Consistently incorporating models into the process thus contributes to strengthening their position and credibility within the community.
Interestingly, external scientists also contributed to making the dataset conform to the standards established in the community. They observed and reported numerous perceived errors in the dataset, prompting the creators to improve it, and communicate the associated updates. It has been observed that when a knowledge object is moved between sites in this way, stability is achieved not as ‘a quality (of for example a knowledge object) but as a rather fleeting state that requires the work of a community and can only be temporarily attained’ (Kruse, 2021:15). The ongoing troubleshooting and updates to the dataset suggest that this is also the case in machine learning.
This conduit shows that the construction process does not end when a benchmark dataset becomes a socio-technical knowledge object; rather, it provides the grounds for further work. Moreover, the movement from dataset to model performance, contributes to ‘black boxing’ the dataset – as its inner workings become hidden from view (Latour, 1987) among other so-called state-of-the-art datasets. As a result, once a dataset is stabilized, it becomes increasingly challenging to understand its contents and limitations.
Intervening ‘in the wild’
Along with the diversity of the footage, the group regularly presented unscriptedness as a key characteristic of the dataset. This seemed to imply that the recorded activities had been captured without alignment. The group also described these data as having been captured ‘in the wild’, referring to the collection of data on naturally occurring events, including heterogeneous lighting conditions, significant in computer vision. This term nonetheless suggests that data can be produced without significant interventions by the researchers (in contrast to more organized efforts like recording data subjects in studio settings), and it is sometimes used when scraping materials on digital platforms, which platforms store without researcher intervention.
As I began following the hugely peopled socio-technical knowledge object more closely, a friction developed between my outsider position and the use of this insider term, ‘in the wild’. It began to puzzle me, and my curiosity grew to encompass questions related to the work that contributed to the stabilized object. To approach these questions, my intervention began by reaffirming the setting of standards in the construction of the dataset, which in my view connects to computer science's general relationship to abstraction (Seaver, 2022:33).
Standards rely on alignment work, which my argument was developed to render visible; such work impacted not just the group's internal efforts but the ways in which data subjects proceeded during their recordings. By understanding alignment work, we come to see the comprehensive efforts that enabled the data subjects to perform their activities. This destabilizes the notion of the dataset as ‘unscripted’ or captured ‘in the wild’, suggesting instead that data subjects performed activities in accordance with their alignment.
My argument draws on the following observations: The American domestic activity survey was used to establish the initial meaning of ‘daily activities’ and provided a list of activities for the dataset. Data subjects negotiated with the group the activities they would record, further structuring their performance. The group instructed data subjects on how best to record themselves while wearing the head-mounted cameras, further aligning them to a shared agreement about the situation. Wearing their temporary visual companions, data subjects moved about in unconventional ways after pressing the record button, and they avoided recording certain aspects of their activities, their surroundings and themselves – tattoos, for example. Annotators then added symbols and text to add additional meanings to the footage, contributing to a remarkable degree to describing the data scenes. Throughout, the process of putting the dataset together relied on the group's alignment of actors and factors.
At times, these contingent efforts were performed reflexively. The instruction to avoid recording certain aspects, for instance, arose from concerns about data privacy and an attempt at de-identification. Other moments of active stewarding followed from legal concerns and expressed an ethos of care – for instance, undressed children were protected from the roving camera frame within a data subject's home. Interestingly, the group's stewarding of footage was more readily articulated when it was aligned with legal standards than it was when sourcing activities from a survey, for instance. Thus, within a more explicit moral framework, the unscripted and ‘in the wild’ attributes of the footage seemed to dissolve. In this sense, the dataset recomposes a clean, ethical version of daily activities.
The dataset marks the creation of not a representation of daily activities as they are performed – which it is presented as being – but, instead, of an object that must be construed through the very range of actors and factors contributing to it. The ‘unscriptedness’ of the dataset footage is the result of meticulous work that thoroughly configured the recorded data scenes, not only setting the stage for the recordings but structuring their contents. The argument presented here thus concerns the dataset's relationship to the ‘daily activities’ as being one that is first and foremost creative and additive. These are not, as they are commonly understood to be, external, natural behavior or objective phenomena that were captured from a detached point of view.
Conclusion
This article has told an ethnographic story of the process of constructing a dataset: the work it relied on, its composition of daily life and the work that enabled it. The story centers on the standard-setting and alignment work that was required to enact practical standards for daily activities, datafication, annotation and benchmarks during the process; in doing so, it advances previous and ongoing discussions by articulating the work necessary to establish and enact specific conventions and agreements. Crafting benchmark datasets, seen as knowledge objects, is a process that involves adherence to agreements or standards that necessitate diverse alignment efforts. Thus, the results of my study contribute to the ongoing discussion about how the machine learning datasets essential to contemporary computer vision models become constituted and how we as analysts understand them.
Throughout the process, the dataset creators performed alignment work. Using a national survey on domestic life in the United States, they created a list of ‘daily activities’, and in instances of alignment with the data subjects, they amended this list to screen the activities that were covered. Through practical walk-throughs on how to best record their activities while wearing a camera, they further aligned the data subjects. To achieve the goal of consistency through task-specific demands, they aligned data annotators through extensive instructions. Meanwhile, external computer scientists helped align the dataset with the standards already put in place.
Alignment work is less visible than overt standards. But once its importance to this process is registered, it becomes difficult to embrace the idea that data was captured in the wild. Comprehensive configuring, including the employment of various alignment scripts, was necessary to produce ‘unscripted’ footage, calling into question the significance of ‘unscriptedness’ as an attribute. Recognizing the persistent efforts that go into alignment work does not dethrone it – it becomes no less scientific. Instead, our understanding of scientific work practices is made more thorough, our understanding of what dataset construction entails beyond the recognized conventions of scientific practice (that standards should be perceived as rigorous work, for example) is deepened.
The mode of analysis in this article supplements quantitative measures on datasets. As new computational models are forged on and evaluated through this dataset, which activities it records is significant, as is the way those activities are performed. For instance, there is a consistent pull in the dataset towards coherent activities, which influences subsequent computational outcomes. The dataset recomposes ‘unscripted’ everyday activities into sanitized versions of themselves brought into existence by a collective of many, the minutiae of which are narrated by a descriptive other. It locates daily life primarily in the kitchen, not the toilet; the living room, not the bedroom; in distinct activities, and less so in the gaps between them. In this sense, the dataset commits to a situated and specified version of ‘daily activities’. Put differently, the construction of the dataset culminated in a partial version of the stated data objective. This version entails that new and existing models that set out to measure their performance are under the homogenizing control of what daily activities have been made to look like. To develop or compare computational models of messier, neglected or morally ambiguous activities – inherently part and parcel of daily life as it is lived – appears out of reach in and for this dataset. This realization is critical as machine learning scientists develop models that are intended to inhabit such borders.
We have seen that to create datasets and make them valuable for machine learning, it is necessary to impose visibility, difference and similarity. Data annotation is the dominant method for ‘uncovering’ what takes place in recorded footage, and it contributes to the construction of the recorded content; enforcing data annotation standards makes data indexable according to certain automation objectives. I have traced the ways of performing annotation to the computational ‘tasks’ of interest they are contingent on, such as action recognition. This linkage opens an avenue for further research, such as genealogical studies uncovering the historical processes that have shaped specific tasks or ethnographic studies of developing such tasks, or the model work they activate.
Footnotes
Acknowledgements
The author would like to express my gratitude to the editors of Big Data & Society for their guidance and the anonymous reviewers for their constructive comments, which have enhanced the quality of this article. Special thanks are due to my colleagues at Lund University and the STS environment in Sweden, whose support has been instrumental – each contribution has been significant to the fruition of this work.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the H2020 European Research Council, (grant number 949050).
