Abstract
As machine learning (ML) technologies move from their discrete existence in research to being highly applied technologies across society, critical scholars have begun to address the epistemological conditions that shape the emergence of such systems and their societal implications. In this paper, we investigate a specific epistemological condition of ML, namely, how ML systems rely on ongoing negotiations and agreements of ‘good enough’ to be deployed. We do so by drawing on ethnographic fieldwork with the British Broadcasting Corporation (BBC) – a large data- and value-driven organisation. In studying the epistemological function and politics of ‘good enough’, we take an AI lab studies approach, following the Recommendations Team's efforts to materialise ‘good enoughness’ and make it negotiable as they develop and modify recommender systems that aim to better serve the BBC's audiences. Through our ethnographic account, we demonstrate how the team relies on various metrics and qualitative evaluations to inform provisional performance thresholds before submitting the ML systems to A/B testing to establish whether one of them is ‘good enough’ to deploy. By following these processes of establishing ‘good enough’, we see how these negotiations are entangled in various, sometimes competing organisational objectives, as well as particular data and technical infrastructures. By extension, we show how the metricised performance scores of AB testing are negotiated in practice by readjusting performance thresholds to manoeuvre different values and constraints. Ultimately, our paper shows that establishing ‘good enough’ is a political endeavour of adjusting seemingly objective evaluation criteria to find the best-fitting metrics and ‘right’ thresholds.
Introduction
Is this machine learning (ML) model good enough? This question is fundamental to the evaluation of any ‘real-world’ application of ML (Hutchinson et al., 2022). The significance of this question also became highly evident during the 6 months, when I (first author) ethnographically followed the work of the ‘Recommendations Team’ 1 at the British Broadcasting Corporation (BBC). This team works closely with domain experts to develop and optimise recommender systems, to better serve the BBC audience and its public service mission across various BBC platforms, including BBC News, BBC Sounds, and iPlayer. Recommender systems are ML-based systems that personalise content distribution based on past user behaviour (e.g. clicks) and according to different technical logics (e.g. item similarity) (Bobadilla et al., 2013). During the fieldwork, the question of ‘good enough’ kept recurring at various stages of developing and evaluating recommender systems; as the team discussed the accuracy of particular ML models, the editorial appropriateness of recommendations, and the results of A/B tests. A manager in the Recommendations Team also described how ‘the challenge of developing a good recommender system is the challenge of evaluating it’ (INT17), 2 highlighting the centrality of evaluation in the constitution of ML systems.
While the significance of the question of ‘good enough’ is well established in applied ML practice, limited research has explored how this question is pursued and answered in practice, particularly in highly value-driven and applied ML settings, such as the BBC. This paper aims to fill this gap by examining the various materialisations and negotiations surrounding what constitutes a system as ‘good enough’ to deploy at the BBC. In doing so, we aim to theorise the epistemic role of ‘good enough’ in ML development and shed light on the politics that shape its establishment. With this focus, the paper aligns with recent efforts to better understand the politics of ML evaluation and their implications (Hansen and Luitse, 2024; Luitse et al., 2024).
Understanding how ML systems are judged to be ‘good enough’ in practice is crucial, as such systems are being rapidly deployed across various societal sectors and participate in numerous decision-making processes, shaping citizens’ everyday interactions with different services and institutions. In the context of recommender systems, Mackenzie (2018) describes how the growing shift towards ML-based personalisation marks a broader trend in the probabilisation of social life, as these systems rely on large datasets to calculate the probabilities of what might be ‘interesting’ or ‘relevant’ to the individual. Through this process of probabilisation, experiences such as shopping (Mackenzie, 2018) or listening to music (Kang, 2023; Seaver, 2022) are reconfigured, resulting in new forms of cultural practice. At the BBC, recommender systems are already shaping how users interact with cultural content across its platforms, which can contribute to strengthening or weakening cultural citizenship and social cohesion in societies (Born, 2018b; Ferraro et al., 2024). As a public service media (PSM), the BBC is part of a cohort of value and mission-oriented media organisations dedicated to offering universal and diverse content (Born, 2018a; Jones, 2022). The BBC's mandate as a PSM, which is funded primarily by UK citizens through a licence fee, is rooted in upholding these values and serving the public's interest (BBC, 2017). Exactly, because the BBC is both a highly normative and large data-driven organisation, the case helps highlight key dynamics and negotiations that shape how ‘good enough’ is established in applied settings, which can extend generatively beyond the case itself.
In the following, we first outline the epistemological significance of ‘good enough’ in ML, arguing that it is an epistemological condition for realising ML systems. While we acknowledge that recommender systems are a specific type of ML systems, we connect our argument more broadly to the specificities of ML epistemologies. We then situate our study within existing social studies of ML that foreground the sociotechnical nature of ML development, before outlining our AI lab study approach (Jaton, 2020) and how we intend to study the politics of ‘good enough’. Finally, we follow the Recommendations Team, as they engage in various experimentations and evaluative practices to reach a final decision on whether a recommender system is ‘good enough’ to deploy on the BBC platforms.
‘Good enough’ as an epistemological condition of machine learning
The notion of ‘good enough’ itself is not unique to ML, as it has been used more broadly to challenge ideals of perfection in engineering and particularly in software development (Bialski, 2024). Here, the notion of ‘good enough’ is used as a pragmatic ideal to promote more ‘reasonable’ or ‘rationalistic’ understandings of software development (Collins et al., 1994; Yourdon, 1995). Building on these early conceptions, Bialski (2024), in her recent ethnographic study of software development, demonstrates how this ideal continues to flourish. Yet, she also finds that achieving ‘good enough’ software is, in fact, highly complex and laborious. Even the most mundane software requires ongoing maintenance work and constant ‘negotiations of what is good (enough) or not’ (Bialski, 2024: 164). The notion of ‘just good enough data’ has also been mobilised to challenge critiques of citizen data, grounded in their lack of accuracy, and to express ‘alternative ways of creating, valuing and interpreting datasets’ (Gabrys et al., 2016: 2; Gabrys and Pritchard, 2018). Their work illustrates how a broader set of ‘indicative’ measuring practices serves to differently constitute phenomena, such as pollution, allowing for the reconfiguration of what is at stake and even facilitating political action (Gabrys et al., 2016; Gabrys and Pritchard, 2018: 6). In this paper, we build on these prior conceptualisations and pay particular attention to the negotiated nature of ‘good enoughness’ in ML – how it often serves to pragmatically accommodate various interests and constraints, as well as how measuring practices are world-making practices and thus highly political. Yet we also argue that ‘good enough’ in ML differs from both this pragmatist software ideal and the critique of dominant measurement regimes, because in ML it serves to epistemologically constitute the workings of the ML system.
To understand the epistemological significance of ‘good enoughness’ in ML development, we set off from Amoore's (2020) observation that ‘good enough’ in computer science has a distinct meaning and function. In ML, ‘a “good enough” solution is one that achieves some level of optimisation in the relationship between a given target and the actual output of a model’ (Amoore, 2020: 67). Rather than being a pursuit of exact correctness, ML development is characterised by a process of continued experimentation and optimisation aimed at delivering an output that closely enough approximates the desired target. What exactly constitutes a ‘good enough’ solution is, therefore, always an experimentally achieved accomplishment, rather than a pre-existing state. This premise is tied to the particularities of ML epistemologies, which alter how ‘good enoughness’ can be established and its epistemic function, compared to both software and data practices.
ML systems, as learning systems, infer probabilities from their training data through inductive reasoning (Mackenzie, 2017). Yet, probabilities alone are not actionable and require a decision or classification threshold that specifies when an output should ‘count’ as knowledge or, in our case, when and how a probability translates into a personalised recommendation. By convention, ML systems typically use a threshold of 0.5 (Google for Developers, n.d.), meaning any output with a probability of 50% or higher is treated as a ‘good’ recommendation for the user. Such classification thresholds appear in all ML decision-making, but in different forms and can be combined with other techniques to turn probabilities into knowledge. For example, while 0.5 is the standardised threshold, classification thresholds can be set differently to accommodate specific localised needs or priorities 3 and be used in combination with other thresholds to address diverse values throughout the ML development process, as will become evident in this paper. Yet, these different types of thresholds all serve the same purpose of transforming quantitative outputs into actionable, qualitative decisions on what, in our case, constitutes a ‘good’ or relevant recommendation. Without establishing a threshold for what is ‘good enough’, a system merely generates probabilities without producing actionable knowledge. Thus, ‘good enough’ operates as an epistemic threshold that marks the point at which outputs can be considered useful or satisfactory within a given context.
By placing an emphasis on the epistemological role of ‘good enough’, we join recent studies that have traced how ML epistemologies transform the role of error (Aradau and Blanke, 2021) and addressed the epistemic limits set by ML systems (Kang, 2023), as well as emerging enquiries into the constitutive practices of ML development. Several scholars have noted that the emphasis on AI's inductive learning nature tends to background the human labour and conceptual work involved, for example, in hypothesising (Enni and Herrie, 2021) or formalising knowledge (Girard-Chanudet, 2025), when developing ML systems. In studying the practical development of ML, others have also found that such systems are shaped by various material conditions, such as metrological instruments (Jaton, 2023), data-gathering procedures (Engdahl, 2024) or by available technologies and the context of application (Passi and Sengers, 2020; Seaver, 2022).
We join these studies in foregrounding the sociotechnical nature of ML development. Yet these studies tend to focus either on the early stages of ML development or on its overall processes, paying limited attention to the evaluative processes that establish ‘good enoughness’ and legitimise the deployment of ML systems. Passi and Sengers (2020) begin to address such questions by following the various considerations that enable an ML system to ‘work’ in the context of a Law firm, demonstrating that what constitutes a working system is shaped by both organisational objectives and available technologies. The recent work of Wirth et al. (2025) further underscores that ML evaluation should be treated as a continuous process rather than a one-time event. In this paper, we take up this processual approach and extend these insights by focusing on the role of metrics and thresholds in materially governing the establishment of ‘good enoughness’. We further investigate how negotiations over the choice of metrics and thresholds were shaped by the cultural and organisational context of the BBC, as well as its public service and business goals.
An artificial intelligence lab study: examining the politics of ‘good enough’
To study the materialisations and politics of ‘good enough’, we take an AI lab studies 4 approach (Jaton, 2020). In doing so, we follow the ‘mundane actions and work practices to document and make visible’ how ‘good enoughness’ is practically achieved through various practices of experimentation (Jaton, 2020: 20). Specifically, we examine how data scientists, in collaboration with domain experts, employ various quantitative and qualitative ML evaluation methods that both provide ‘evidence’ of model performance and allow them to adjust the models to better align with desired outcomes – for example, by applying weights or fine-tuning parameters – before ultimately deciding whether one of them is ‘good enough’ to deploy. Whereas prior AI lab studies have focused predominantly on so-called ground-truthing practices, namely the practices of composing the training and evaluation datasets used in ML development (see Jaton, 2025), we turn to practices of metricising and thresholding, 5 which are central to establishing whether a model is ‘good enough’. In doing so, we are particularly interested in addressing what Law and Mol (2008) call ‘material politics’, which describes how material practices participate in ordering the world in ways that shape what can be negotiated. To understand how these material politics unfold in the establishment of ‘good enoughness’, we draw on existing work that addresses the political nature of metrics and thresholds.
Measuring practices are inherent to ML development, as they are used to quantitatively characterise dataset composition and model performance (Hutchinson et al., 2022; Mitchell et al., 2023). However, what and how things are measured is not given. Rather, measurements are always constructed and decided upon by those who design these systems, making them inherently political, as they express what is valued (Thakkar et al., 2022) and constitute how the phenomenon can be discussed and acted upon (Gabrys et al., 2016; Gabrys and Pritchard, 2018). In ML, evaluation metrics are the dominant way to measure model performance (Hutchinson et al., 2022). While often criticised for failing to deliver on their aims, such standardised methods of measurement in ML are still often considered ‘objective, reliable, and reproducible’ ways of assessing performance (Scheuerman et al., 2019: 317:13; see also Grill, 2022). As a result, scholars argue that metrics hold performative power in establishing certainty around the capacities and values of ML systems (Grill, 2022). The most common evaluation metrics in ML evaluation are accuracy metrics, which express the proportion of results that were correctly classified. However, as metrics are always partial, expressing only a fraction of the model's performance, using accuracy metrics alone overlooks other vital social and ethical values (Birhane et al., 2022; Grill, 2022). In capturing the political nature of ‘good enoughness’, such metricising practices are, therefore, crucial in setting the conditions for what can be measured and negotiated in the first place. Consequently, we need to understand what knowledge is privileged, but also who gets to decide what knowledge ‘counts’ in these decisions to foreground the epistemological politics that shape decisions about whether a model is ‘good enough’ to deploy (see also Schwandt, 2007).
Unlike metrics, thresholds have been much less discussed in critical examinations of ML evaluation. Drawing on Benjamin's (1999) discussion of architectural thresholds, Aradau and Blanke (2022) argue that thresholds constitute a specific mode of governance in the context of ML. Thresholds differ from boundaries in that ‘thresholds are not about separation, but about transition, passage, and transformation’ (Aradau and Blanke, 2022: 192). They invoke ambiguity and possibilities through questions of more or less, rather than setting binary boundaries of either/or. Thresholding practices are therefore inherently negotiable and afford flexibility in managing diverse possibilities and potentialities. Determining what finally constitutes a satisfactory threshold is also highly political, because as Amoore (2020: 69) highlights: ‘To adjust the threshold of what is “good enough” is to decide the register of what kinds of political claims can be made in the world, who or what can appear on the horizon, who or what can count ethico-politically’. In the context of climate, Gabrys and Yusoff (2012) similarly discuss ‘politics at the threshold’, arguing that various threshold values invoke particular worlds, determining what is considered acceptable, possible, or undesirable. Given the epistemological significance of thresholds in ML, the demarcation of a threshold, therefore, not only sets the conditions for how the system works but also shapes what realities are produced at the expense of others, thereby involving ontological politics (see also Mol, 1999). Consequently, by studying thresholding practices, we can highlight the political implications of how the threshold for ‘good enoughness’ is ultimately demarcated, which voices or interests are privileged in setting it, and why.
The recommendations team at the British Broadcasting Corporation
The Recommendations Team at the BBC differs from the ‘classic’ data science lab setting explored by, among others, Jaton (2020), as it functions as a product team that delivers recommender solutions for the BBC services. In a research-focused setting, evaluative practices are narrower and ‘learner-centric’, focusing on demonstrating improved performance and robustness of an ML model on a specific task (Hutchinson et al., 2022: 1861). When ML are implemented in particular domains, evaluation practices are instead ‘application-centric’ (Hutchinson et al., 2022: 1861), which involves more diverse evaluation methods to accommodate wider societal, economic and infrastructural considerations, as these systems are supposed to deliver on specific business objectives and interact with existing infrastructures (Aradau and Blanke, 2022; Breck et al., 2018). At the BBC, recommender systems must, for example, interact with existing data infrastructures and content management systems. More importantly, they are intended to meet specific business and editorial objectives, as they are designed to become what are referred to as ‘public service recommenders’ (Piscopo et al., 2024). As recommenders were originally developed in purely commercial settings, they are commonly optimised towards monetisable goals such as purchasing or attention (Stray et al., 2022). Making these systems optimise for public service goals, therefore, involves different forms of editorial alignment work (see e.g. Møller, 2024; Schjøtt Hansen and Hartley, 2021; Stray et al., 2022). Such alignment work extends to the decision on whether a recommender system is ‘good enough’, where a variety of evaluation methods are used to ensure that both business goals, such as growing its audience base and public service ideals, including universalism, shape how its recommender systems are made to work. While the BBC, as a public service organisation, does not rely directly on audience growth for its business model, it does so implicitly, as its public service remit remains grounded in maintaining a broad reach within UK society (Born, 2018b).
Recommender systems also pose a particular evaluative challenge because experimentation and evaluation with offline and historical data are not considered sufficient to understand the ‘real’ behaviour and performance of ML models (Castells and Moffat, 2022). As historical user data only reflects content that users were already shown, it is impossible to know how they would have reacted to a different recommendation. 6 To reliably determine how users will actually respond to recommendations, data scientists instead conduct (live) A/B testing of ML models (Castell and Moffat, 2022). In the context of recommender systems, A/B testing assumes a privileged role in establishing ‘good enoughness’, as everything up to that point ‘is only indicative’ (INT1), as a BBC data scientist explained. Consequently, A/B test results served as the grounds for making a final decision on whether a recommender system was ‘good enough’ to deploy at the BBC. In other domains and for other ML applications, this might differ depending on the context or the specific technicalities involved, making it important for future research to examine what evaluative methods shape ‘good enoughness’ across different applications and settings.
Methodology: ethnographic studies of machine learning experimentation
The paper draws on a hybrid ethnographic enquiry conducted at the BBC between September 2023 and February 2024. 7 Ethnography is a central methodology in AI Lab studies and more broadly in the social study of AI systems (see Jaton, 2025). During fieldwork, the first author participated both in person (primarily in the London offices) and online in the Recommendations Team's ongoing activities. Most of the observations centre on the work of two sub-teams developing recommender systems for BBC Sounds and iPlayer, which were very active during this period, with several ongoing projects. However, the first author also participated in general ‘scrum meetings’, team days and meetings in other sub-teams. The hybrid nature of the enquiry was chosen because the team is distributed across various BBC offices in the UK, and, as a result, their work is inherently hybrid. The first author would, therefore, participate on an ongoing basis as a ‘remote’ member of the team via Zoom. Once a month, she would spend approximately a week in London participating in in-person activities in the London offices, often coinciding with occasions when the team gathered in person. The observations were complemented by ongoing ethnographic interviews with key actors involved in recommender system development (Spradley, 1979) and by two workshops with the team.
Consequently, our empirical data consists of: (1) observations recorded as fieldnotes during both online and in-person meetings about specific ML projects; (2) 20 recorded and transcribed interviews 8 with data scientists, editors, engineers, product managers, and management who participated in the observed meetings, and (3) insights from two workshops conducted in December 2023 (one online and one in-person), aimed at qualifying the initial findings with participants and creating a space for reflection. The empirical data were analysed thematically using a bottom-up strategy (Gibbs, 2007). For data analysis, the open-source software Taguette was used to facilitate an iterative process of identifying and connecting emerging themes in the data.
The Recommendations Team continually develops new recommender systems and modifies existing ones. During the observation period, several projects at various stages were underway across the different sub-teams. In the following sections, we first trace three central moments in which ‘good enough’ materialises through diverse evaluative devices as it transitions from the data scientist's laboratory to the hands of the editors, before ultimately encountering the ultimate test, when the system faces the ‘real’ audience in A/B testing. This initial part of the analysis draws on insights from the various observed projects and the interviews. Second, we conduct a ‘close reading’ of two ethnographic moments in which the team developing recommenders for BBC Sounds discussed the results of two A/B tests to determine whether to deploy one of the tested variants of the ML system. Together, these two parts of the analysis demonstrate how various aspects of ‘good enoughness’ are continuously negotiated using different evaluation methods before a final performance threshold is established based on A/B test results.
From lab experiments to testing ‘real’ performance
The development of recommender systems at the BBC followed a relatively standardised process. Once a new idea for either an entirely new recommender system or a modification to existing services was proposed – most often by the product team – the project would be scoped out. This scoping exercise typically involved data scientists, editors, engineers, and the product manager, who would collaborate to outline the project's aim. At this stage, those involved would specify what ‘good(s)’ the new or modified recommender would fulfil and, importantly, for whom. Such ‘goods’ could include editorial or public service considerations – such as serving underserved audiences or reflecting diversity through content offerings – as well as more business-oriented ‘goods’ such as ensuring increased audience engagement, retention and growth. These discussions, therefore, served to collectively problematise exactly what specific problem the recommender was to solve (see also Jaton, 2017, 2020), and were generally informed by the BBC's editorial principles and organisational performance indicators, as determined by senior figures in the BBC's Product Group, which sets targets for all BBC Products.
This collective problematisation of the project would generally be materialised in a product narrative, a brief scenario-based description outlining the service and its value. Essentially, the narrative would serve as an initial reference point for what ‘goods’ the system would need to achieve to be deemed ‘good enough’. However, this product narrative was relatively high-level and typically did not include technical specifications or measurable evaluation targets. Exactly
Offline evaluation: establishing ‘good enough’ machine learning models
Grounded in the product narrative, the data scientists would begin their experimentations. Depending on whether they were building a new recommender or modifying an existing one, they would follow different strategies. For changes to existing systems, they would start by exploring trends in the data to inform their choices about what to test later in a live A/B test. When developing new recommender systems, more experimental steps were necessary. In such cases, the data scientists would typically identify a handful of open-source recommender algorithms 9 they considered most promising for meeting the project objectives and that could realistically run on BBC infrastructure using existing data sources. These models would then be subjected to what was generally referred to as offline evaluation, which involves testing the ML model's performance using offline historical data. 10 Using historical user data, the data scientists would train the ML algorithms on a subset of data containing the various items that users had watched or listened to (i.e. the training dataset), and then evaluate that against another subset of data to see how closely the different model's predictions matched what that user had consumed next (i.e. the evaluation dataset).
To conduct these experiments, the team relied on a set of standardised evaluation metrics originally developed by the BBC's R&D department and continuously adapted to reflect industry developments. This repository contains several accuracy and diversity metrics that operationalise the ‘relevance’ of recommended content to the user according to different logics. Where accuracy metrics measure how well recommended content matches what the user consumed, diversity metrics instead measure the variety of the recommendations produced by the system. 11 Each of these metrics, therefore, delineates what constitutes an accurate or diverse recommendation, which was then used to assess whether the models were ‘good enough’ according to that measure.
The data scientists used these experiments to determine which ML models were ‘good enough’ to be presented to their editorial colleagues and subjected to further experimentation. As one data scientist explained in an interview: ‘(…) from there [i.e. offline experimentation], we learn if this model is good enough or not. It is more a safeguard for us to know if something is worthy’ (INT8). Generally, the data scientist would compare how the selected ML models performed against each other using the standardised evaluation metrics and also compare them against what they called baseline models as a form of ‘sanity check’ (INT1). These baseline models would be simple random and popularity-based models, meaning models that simply produce random predictions or predictions based on what is most popular. In making their final decision on which models to take forward, the data scientists generally relied on a single accuracy metric to demonstrate the models’ performance to their editorial colleagues. At this particular moment, ‘good enough’ was, therefore, materialised as ‘accurate’ decisions and established through comparison with other ML models. However, offline evaluation was considered rather limited in its epistemic value due to the problems associated with historical data. Additionally, the sole focus on accuracy could not capture what would be an editorially ‘good’ recommendation. Consequently, the data scientists would always rely on their editorial colleagues’ qualitative evaluations of whether the recommendations produced were editorially ‘good enough’ in the following stage of experimentation.
Editorial evaluation: ‘good enough’ recommendations
To facilitate the editorial evaluation of the recommender systems, the data scientists relied on visualisation tools that presented results from the initial offline evaluation in a way that mimicked how they would appear on the on-demand platforms. Specifically, these tools visualised a single user's consumption history from the training data, along with the recommendations produced by one or more ML models for that user. Unlike the offline evaluation, the goal was not to assess the general accuracy of the ML models based on the probabilities of the recommendations’ ‘correctness’. Instead, the aim was to evaluate whether the recommendations produced by these models were also ‘good’ from an editorial point of view. Seeing the recommendations was an important part of assessing their quality, as the visual form enabled the editors to determine whether they reflected the intended objectives and values. As one editor explained, the visualisations would help editors see whether, for example, minority groups (ethnic or demographic) were represented in the recommended content (INT9). During these evaluations, the editors and commissioners would therefore use the visualisations to discuss whether the recommended content was sufficiently diverse, local, or niche – depending on the editorial objectives outlined in the product narrative and the BBC's wider public service mission. If not, they could propose changes to the content collection used for the recommendations or suggest adding business rules. 12 These rules refer to post-processing steps in which the recommended content from the ML algorithm is filtered or weighted before being presented to the audience. One example could be excluding certain types of content from recommendations, such as highly popular content. Additionally, this process served an important safeguarding role, aimed at identifying any ‘problematic’ recommendations that did not align with the BBC's editorial principles, and which could potentially create public outcry and challenge the BBC's remit.
At this stage, establishing ‘good enoughness’, therefore, no longer relied on the standardised data science methods described above, but rather on situated practices and domain-specific knowledge regarding the available content and its editorial (in)appropriateness. As the data scientists continued to experiment with and optimise these ML models, the editorial stakeholders would be invited to several editorial review sessions, where they would use the visualisation tool to negotiate and adapt the ML models’ performance. The aim of these iterative sessions was, similarly to the offline evaluation, to reach a high enough level of confidence in the models’ performance to determine that they were editorially ‘good enough’ to be submitted to a live A/B test. As an editorial lead described, the visualisation tool ‘helps us to refine the decisions that we might want to make before we kind of go live for experiments and A/B test with audiences’ (INT7). While these editorial evaluations were central to ‘steering’ model behaviour in the desired direction, they still could not account for the models’ live performance. The only way to truly assess whether the selected ML models delivered the intended value was through their encounter with their intended ‘real’ audience, which was always tested via A/B tests.
A/B testing: establishing grounds for a decision on ‘good enoughness’
In short, A/B testing involves presenting subsets of users with two or more variants of recommendations to compare their performance against predetermined metrics. Describing the role of A/B testing, a data scientist during one of the workshops remarked: ‘At some level, you can think one purpose of A/B tests is to check that our offline estimations of what we think we are doing online ring true’ (WORK1). In this way, the A/B test served as the ‘real’ ground truth or benchmark against which the team could test their previously offline-based assumptions (see also Jaton, 2020). Before running such A/B tests, the team would decide which metrics to use to evaluate the results. These metrics were of high importance, as they would establish the grounds for the team's final decision on whether the systems were ‘good enough’ to deploy. Furthermore, the BBC had a limited capacity to run A/B tests, both due to resource constraints and their commitment not to overburden an unaware audience with tests that could temporarily impoverish their experience. Consequently, it was crucial that the chosen metrics could provide the necessary answers for the team to make an informed decision and ensure high-quality testing.
While the choice of metrics would always depend on the intended aim of the recommender, one data scientist explained that ‘the primary measure of our A/B test will likely be click-through rate or watch-through rate’ (INT1). 13 Click-through rate (CTR) refers to the percentage of users who click on recommended content, whereas watch-through rate refers to the percentage of users who click and then actually watch the content. These metrics were considered easy to monitor in a live system and important for understanding the impact on audience engagement: ‘If it is positive or negative results, we use that to make decisions’, another data scientist highlighted (INT8). A positive or negative outcome in this instance means that there was a statistically significant increase (positive) or decrease (negative) in the number of users who clicked on or watched the recommended content. Notably, these metrics also served as so-called ‘guardrails’, as the same data scientist explained: ‘(…) we do not want to see a drop and go below our business target’ (INT8). Consequently, the metrics provided a means to assess whether the tested systems reached a positive or negative performance threshold, which could inform the decision which one, if any, of the tested models were ‘good enough’ to be deployed.
However, as the quote above reveals, these metrics predominantly captured the ‘business targets’ related to the recommender, namely attention or engagement. In contrast, both long-term engagement on the platform and audience growth in the form of ‘retention metrics’ (INT4) and public service considerations were not directly measured in these tests. The reason for excluding these measurements was partly the difficulty of measuring retention and public service values meaningfully. Retention can be hard to measure because it ‘by definition takes a lot longer to measure’ (INT1), as a data scientist explained, and because the long-term effects were seen as challenging to disentangle from other effects and require measuring a cumulative effect over time. Public service considerations were even more challenging, as they were seen as ‘quite subtle, hard to define things’ (INT1) that would be difficult to measure. 14 Importantly, they were also often not captured in the current data infrastructures. Developing such metrics would, therefore, potentially require setting up new data infrastructures to capture these considerations. Currently, the A/B testing relies on the existing data infrastructure at the BBC, which was originally implemented as part of the move towards commercial audience analytics systems that gained prominence in media organisations with the advent of online news (see e.g. Christin, 2020). The existing data infrastructures and availability of metrics, therefore, shaped how the A/B test could be set up and, ultimately, the grounds for how ‘good enoughness’ could be established. The following explores how the results of A/B tests were used to determine what constituted a ‘good enough’ performance threshold in different cases, and how these decisions ultimately informed which ML model, with differently configured decision thresholds, would be deployed, and begin to make ‘decisions’ on what content users would be recommended.
Negotiating ‘good enoughness’: deciding on performance thresholds
During the fieldwork at the BBC, the Recommendations Team planned and conducted multiple A/B tests to validate the performance of new recommender services or make modifications to existing ones. Only one of the observed A/B tests resulted in an immediate decision to roll out a ‘winning’ variant for production. In the other experiments, the question of whether the results sufficiently demonstrated that the system worked satisfactorily remained more confounding. However, this did not necessarily mean that the systems were not deployed. In the subsequent sections, we follow the team's negotiations regarding what constituted a ‘good enough’ performance threshold, whether the systems delivered on the predetermined evaluation criteria, and how these thresholds could be expanded or narrowed to accommodate specific constraints and facilitate a decision. As both experiments aimed to change an existing decision threshold for when content could be recommended, much was at stake in these discussions as they, implicitly, related to how future recommenders would produce knowledge.
The threshold versus training time experiment: expanding the threshold of ‘good enough’
The first experiment was commonly known as the ‘threshold versus training time experiment’. This experiment aimed to test how changes to the current ‘personalisation threshold’ would affect CTR. The personalisation threshold refers to the amount of content a user must have listened to over a specific timeframe to receive personalised recommendations. Currently, this threshold is set to a specific number of listens over a certain period of months. 15 Once a user meets this threshold, their data will enter the training data for the recommender system and be used to produce recommendations for that user. However, the team suspected that it could be beneficial to lower the threshold to a shorter period with fewer listens. This change would mean that more users would receive personalised content more quickly, which was seen as improving their experience and thus increasing the value of the service, and potentially leading to users consuming more content overall. To test this assumption, they conducted an A/B test with multiple variants, including both lower and higher thresholds.
During a meeting in early November, the team's hypothesis was confirmed as they discussed the experiment's results. The ‘winning’ variant, as the product manager explained, was a combination of 2 months and two listens, which had achieved the highest increase in CTR. The product manager, furthermore, emphasised that this confirmed the change would not be ‘hurting the performance’, which was generally the baseline for these evaluations geared towards improving performance (Quote from fieldnotes). The statistically significant result served to establish this variant's ‘good enoughness’ by not only reaching the minimum threshold of not negatively affecting the CTR but, in fact, surpassing it. The data scientist further highlighted how implementing this change would also facilitate another ‘good’, explaining how ‘it [i.e. the change] would reduce the training time by minimising the amount of training data, so that would get the process to be faster and speed it up significantly’ (Quote from fieldnotes). Changing the threshold would ultimately help to lower the time that it takes to retrain ML models, which the team does twice a day to ensure the recommendations are ‘fresh’. However, reaping this benefit would require implementing the change across all BBC Sounds recommenders that rely on the tested model, as the same models are often used across multiple rails on the platform.
This A/B test had only tested the change on the personalised rail, ‘Recommended for You’. Rail, in this context, refers to a specific row of content on the mobile or web app, and all testing and performance measuring was always centred on a single rail. However, the data scientist argued that many of the existing personalised rails already have arbitrary thresholds determined without any in-depth analysis. The fact that the test had only been tested on this rail should, therefore, not impede them from rolling out a variant that ‘would allow the entire process to be faster’ (Quote from fieldnotes, data scientist). In making this argument, the data scientist did not attempt to account for similarities between the tested and remaining rails, as these were implicitly seen as comparable use cases. Instead, the arbitrariness of the existing thresholds was used to argue that they would likely also benefit, as there was no evidence that the existing thresholds were superior. However, the decision to roll out the variant across the entire site raised the critical question: ‘How do we explain to the editorial team that we will make these changes without tests?’ (Quote from fieldnotes, data scientist). The following discussion details the involved considerations:
The age of content experiment: narrowing the threshold of ‘good enoughness’
The second experiment, referred to as the ‘age of content experiment’, aimed to test the effects on CTR of changing the threshold for the age eligible for recommendations. Currently, the threshold is set at content that is 10 years or younger, meaning content older than 10 years will not be recommended. However, there were concerns within the editorial team regarding the cultural appropriateness of older content that might have aged ‘poorly’ in attitude (Quote from fieldnotes). Specifically, they worried about how it could affect the public perception of the BBC, as the recommended content could be considered problematic by contemporary cultural standards. As one editor noted in a meeting, ‘(…) the way we derive comedy has changed in 10–15 years’ (Quote from fieldnotes). To address this concern, they proposed lowering the threshold. However, to understand the effects of this, the team first conducted a 2-week experiment with four variants, setting the threshold at either 13 months, 3 years, 5 years, or 10 years (the current threshold was used as a control). The expectation was that lowering the age of content would positively affect CTR, as the data scientists’ initial data explorations showed that ‘the vast majority of impressions made were on items five years younger’ (Quote from fieldnotes). Impressions refer to content displayed on a user's screen, even if it is not interacted with. In early December, the results were in, and surprisingly, the results did not confirm the hypothesis: Overall, they [i.e., variants] are all minus, but it is strange that variant one with 13 months and variant three with five years are quite similar in the negative effects, while the middle one [i.e., variant two; three years] is different. It is worse than the others. Any thoughts? It is a bit weird (Quote from fieldnotes, product manager).
To better understand the results, the team began investigating the test result data more closely before deciding whether to discard the project or conduct further A/B testing. During a meeting in mid-December, the team, along with key editorial stakeholders, discussed the results and how to proceed. Various possibilities for why the test had yielded negative results were explored, but importantly, the product manager offered a potential alternative way forward in making a decision, wondering whether there could be a difference in the effect on CTR across user groups: ‘We know we are super-serving a certain audience segment (…). I suspect that they might see old content and feel nostalgic and engage with it because it is familiar, but with infrequent younger audiences, there might be a difference’ (Quote from fieldnotes). This proposal introduced the possibility of narrowing the threshold of ‘good enough’ to apply solely to a subpopulation of the audience. An editor chimed in, noting that they would not wish to make changes that would upset their main audience, but that, to grow, they might still have to move forward with the change to give those younger audiences a better experience.
Within the team, they often discussed various audience groups, such as their so-called ‘heartland’ audience, which refers to regular BBC consumers and is often equated with affluent middle-aged listeners. 16 However, the team was keen to reach infrequent users, who instead were often associated with younger audiences, as they were seen as a potential source of growth for the BBC's reach and of securing its remit in the future. Narrowing the threshold to only account for infrequent users could, therefore, be argued to be a sensible way to deliver on the project's commercial and public service objectives. Moreover, the project was specifically viewed as targeted toward the younger audience, who were essential to growth, but perhaps also more culturally sensitive than older generations. If one of the variants showed statistically significant results among infrequent users, the team could therefore reassess whether that variant could be deemed ‘good enough’ after all, given the narrowed performance threshold. These constant renegotiations and adjustments of the thresholds of ‘good enough’ illustrate how malleable ML evaluation can be. As an editorial lead also noted in an interview, the testing and evaluation of recommenders tended to almost become self-fulfilling prophecies. There always seemed to be a way to confirm the starting hypothesis and deliver on the project aim (INT7).
Concluding discussion: making machine learning ‘good enough’
When ML systems fail to accurately detect darker-skinned faces or wrongly withhold social security due to misclassifications, the egregious societal impacts of these systems become visible. Such failures can spark public controversy, reopening questions about their appropriateness and leading to the suspension or cancellation of these systems (Ananny, 2023; Ratner and Schrøder, 2024). However, in ML, such errors and other inaccuracies are transient and inherent to its development. They become something to optimise for and improve in the future (Aradau and Blanke, 2021). The decision to deploy an ML system is never grounded in full certainty or exactness but on continuous optimisation until a ‘good enough’ result is achieved. This process involves accounting for the epistemic uncertainties inherent in ML by fine-tuning the system until errors and other undesirable outcomes are recalibrated, though never entirely eliminated.
Unlike for other digital technologies, such as software, we argue in this paper that ‘good enoughness’ is an epistemic condition of ML that shapes the emergence and functioning of these systems. To establish confidence in an ML system being ‘good enough’, practitioners must continuously negotiate which metrics can inform them about a system's performance and decide on which performance thresholds, in the end, constitute a ‘good enough’ result. Decisions regarding metrics and thresholds bring human sense-making into contact with the epistemological and technical conditions of ML development, as these decisions both explicitly and implicitly shape how the systems operate and produce knowledge. Following such processes, thereby, makes visible how decisions to deploy an ML model are always a sociotechnical accomplishment, involving both material conditions and social negotiations. In this article, we followed a specific case of such negotiations, which is relevant not only because it involves the BBC, one of the world's largest media organisations, but also because, as a PSM, the BBC needs to bring diverse goods to bear in these negotiations. The BBC also resembles other large data-driven organisations that seek to leverage their existing data to enhance their services with ML. While our findings describe this situated context, our discussions of ‘good enough’ more broadly advance an understanding of the epistemological specificity of ML evaluation – particularly regarding the role of metrics and thresholds in constituting what is considered ‘good enough’.
Through our ethnographic account of BBC development processes, we illustrate how establishing ‘good enoughness’ is a continuous process that involves different ways of materialising ML performance to understand and adjust different facets of model behaviour. Simply accounting for the ‘final’ and often singular metrics used to publicly legitimise the deployment of ML systems risks overlooking the many considerations and negotiations that shape the establishment of ‘good enough’ in practice. Viewing ‘good enough’ as a continuous and provincial achievement also challenges current approaches to ethical or responsible AI, which tend to treat ML evaluation statically and only account for potential impacts upon deployment (see also Wirth et al., 2025). Furthermore, our account demonstrates that no ML system is ever self-evidently ‘good enough’; rather, making a system ‘good enough’ is always a political endeavour of adjusting seemingly objective evaluation criteria by finding the ‘right’ thresholds and best-fitting metrics. In making this argument, our paper joins other critiques of ML evaluation that aim to denaturalise and politicise how, for example, accuracy metrics obscure the discriminatory effects of ML systems (see e.g. Birhane et al., 2022; Grill, 2022). Specifically, our paper extends such efforts by showing the intricate ways in which different metrics inform the establishment of various provisional and final thresholds of ‘good enough’ and how these practices shape what comes to ‘count’ as knowledge. These findings enable us to begin to theorise the epistemic function and politics of ‘good enough’ in ML development. Here, we return to Walter Benjamin's (1999) definition of thresholds as transitional, ambiguous and transformative to characterise the unfolding politics of metricising and thresholding practices (see also Aradau and Blanke, 2022).
Following the establishment of ‘good enoughness’, we observed how the data scientist initially sets a threshold for model performance in terms of accuracy, which is then qualified through editorial assessments of the recommendations. Finally, the team decides on the performance metrics to use in A/B testing, thereby setting the conditions for evaluating the system. Consequently, we argue that metricising and thresholding practices serve a key
However, we also observed that, in establishing the final threshold for what constitutes a ‘good enough’ system to deploy based on A/B testing, metricised understandings of performance alone did not inform this decision. Instead, other considerations – economic or editorial – can re-enter the negotiations to either expand or narrow the performance threshold. Here, the inherent
Finally, in establishing a specific threshold for ‘good enough’ by either limiting which audiences are included in the assessment or treating the results as representative of all users across the entire platform, the BBC makes crucial demarcations about whose realities will be of greater importance in how the system works. In setting such thresholds, the team therefore
Footnotes
Acknowledgements
Many contributed to the realisation of this work. First of all, this work would not have been possible without the generosity of the BBC professionals who welcomed me into their meetings and daily discussions and allowed us a glimpse into the complexity of building and evaluating ML systems. Here, a special thank you to Rhianne Jones and Natali Helberger, who helped facilitate this access. Secondly, thank you to the Journalism and Democracy group at Roskilde University and the DeepCulture research group at the University of Amsterdam, who provided feedback on this work at various stages.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partly funded through the Horizon2020 project, AI4Media. Grant agreement ID: 951911. This research was partly funded through the ERC project, Deep Culture. Grant agreement ID: 101141330.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
