Abstract
Our article contributes to debates on critical humanitarianism and critical dataset studies by examining the praxis of digital humanitarianism and the entanglement between what we know and how we know human suffering in large-scale disasters. We specifically focus on the “data work” carried out by various professionals in creating and curating datasets used by humanitarian actors to address both current and future crises. In doing so, we explore how these actors make sense of the development, maintenance and (re)use of such datasets. This approach enables us to analyse the underlying principles and practices of data curation and to investigate the challenges that arise when humanitarian and non-humanitarian actors collaborate to produce and sustain these datasets. We also examine how these often tension-filled collaborations influence practitioners' understandings of humanitarian crises and their efforts to reconcile efficiency with humanitarian principles. We argue that while there is no such thing as a “humanitarian dataset”, there are contested processes of datafied humanitarian knowledge production that are situated in specific contexts. We suggest that the construction of these datasets redefines humanitarian knowledge production and is therefore critical to understanding the shifting politics of digital humanitarianism. Our findings highlight three key challenges that characterise the relationships among these groups: the fragmentation of the humanitarian data value chain; the asymmetries in expertise and skills among the diverse professionals involved in dataset curation; and the integration of small and big data in crisis analysis and prediction. These insights carry both epistemic and ethical implications, drawing attention to the problematic relationship between datafication and humanitarian principles.
This article is a part of special theme on Datafied Development. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/datafied_development?pbEditor=true
Introduction
Over the past decade, there has been a growing emphasis on digital humanitarianism, or the use of digital technologies and big data to gain insights and address a wide range of humanitarian crises (Madianou, 2021b; Meier, 2015; Mulder et al., 2016; Sandvik, 2023). This “humanitarian datafication” trend (Firoz, 2024) is part of a broader “innovation turn” in the aid industry (Scott-Smith, 2016) that is reshaping organisational structures, partnerships with the private sector, relationships with donors, and the very framing and understanding of humanitarian issues. This approach reflects a belief that data can enhance the efficiency, effectiveness, and accountability of humanitarian interventions by providing timely information about the scale and impact of crisis-induced displacements (Burns, 2015).
Additionally, data is at the core of a rising interest in Anticipatory Humanitarian Action (AHA), an emerging policy and practice field that uses historical records to generate insights into humanitarian crises and displacement trajectories (Iazzolino et al., 2022). Endorsed by international and aid agencies (UN Secretary-General António Guterres, 2018) and technically supported by the United Nations Office for the Coordination of Humanitarian Affairs’ (UNOCHA) Centre for Humanitarian Data (CHD), these AHA initiatives leverage diverse data sources to build and train predictive models (Thalheimer et al., 2022; van den Homberg et al., 2020).
As a result of this growing interest, digital humanitarianism is attracting significant political and financial support from donors and businesses, including tech firms participating in the AI for Social Good (AI4SG) movement, which focuses on the use of Artificial Intelligence (AI) and Machine Learning (ML) to address the Sustainable Development Goals (SDGs) (Bjola, 2022; Cowls et al., 2021; Henriksen and Richey, 2022; Iazzolino and Stremlau, 2024; Tomašev et al., 2020).
Critical humanitarian scholars have examined humanitarian data infrastructures at the intersection of control and care (Pallister-Wilkins, 2018; Tazzioli, 2020). However, the digital humanitarianism literature hasn't fully explored how this data is created, maintained, and reused. If we adopt the information systems (IS) perspective that data is an “epistemic object” (Knorr Cetina, 1997), always evolving and shaped by material factors, how does this dynamic nature of data impact digital humanitarian interventions?
This article critically examines the praxis of digital humanitarianism, focusing on the “data work” (Parmiggiani et al., 2022) performed by various professionals in creating and curating datasets used by humanitarian actors. The aim of these datasets is not only to help analysts capture the current state of humanitarian crises, but also their potential future developments by feeding into models that identify patterns and anticipate trajectories. We ask: What challenges arise when bringing together humanitarian and non-humanitarian actors to create and maintain humanitarian datasets? How do these collaborations, often fraught with tensions, affect practitioners’ and policymakers’ understanding of humanitarian crises and the way they reconcile efficiency and humanitarian principles?
To answer these questions, we focus on how humanitarian actors make sense of the development, maintenance and (re)-use of these datasets. This approach allows us to analyse the underlying principles and practices involved in organising and managing these “objects of critical study in their own right” (Thylstrup, 2022: 656), a process known as data curation. Data curation involves a diverse group of professionals and includes decisions about which data to select, how to structure it, and how to ensure its effective use and reuse (Fraser, 2019a, 2019b; Mannheimer, 2022; Parmiggiani et al., 2023).
Our main argument is that, while there is no such thing as “humanitarian datasets”, there are contested processes of datafied humanitarian knowledge production (Thylstrup et al., 2019) situated in specific contexts. We suggest that, by “promot(ing) and mediat(ing) relationships among experts” (Parmiggiani et al., 2022: 142), the construction of these datasets redefines humanitarian knowledge production and is thus crucial for understanding the shifting politics of digital humanitarianism. These shifts are shaped by complex power dynamics involving aid organisations, business entities (Olwig, 2021; Richey, 2018), data scientists, and field workers (Mulder et al., 2016; Richey et al., 2021).
Our findings point to three challenges characterising the relationships among these groups: the fragmentation of the humanitarian data value chain; the asymmetries in expertise and skills among the different professionals involved in the curation of datasets used by humanitarian actors; and the integration of small and big data in crisis analysis and prediction.
By doing so, we contribute to debates on critical humanitarianism and critical dataset studies (Thylstrup, 2022) by highlighting the entanglement between “what we know” and “how we know” and its implications for humanitarian epistemics and ethics.
This article is structured as follows. We begin by discussing the theoretical underpinning of our contribution, which rests on two main bodies of literature: the first revolves around the notion of digital humanitarianism; the second focuses on the concepts of data curation, or the practice of constructing and managing datasets in order to ensure their future fungibility. After describing our methodological approach, we present our findings. We then discuss their implications for humanitarian epistemics and ethics, and conclude our paper by drawing insights for further research on emerging trajectories of the global humanitarian sector.
‘Digital humanitarianism and its critics
Digital humanitarianism is an expanding field of policy and practice poised between the promise of greater efficiency and effectiveness of humanitarian response (Meier, 2015) on the one hand, and the risk of epistemic injustice and invisibilisation of historically marginalised communities, and the growing influence of corporate actors (Henriksen and Richey, 2022; Mulder et al., 2016), on the other. This turn towards digital and data-driven technologies within the humanitarian sector is accompanied by the increased availability of and access to big data, which is reshaping the way social realities are observed and made sense of (Kitchin, 2014). Big data, collected from people's “data exhaust” or through crowdsourcing, are seen to equip humanitarian agencies with the technical ability to quickly and accurately identify, predict, and respond to emergencies. As part of the global initiative for “early warnings for all,” predictive technologies are increasingly embraced as a means to better understand extreme weather patterns in order to afford disaster responders with real-time information about imminent hazards, and allow for quick and targeted emergency responses (United Nations, 2022; World Meteorological Organisation, 2022). At the same time, data-intensive tools such as digital identity or biometrics are favoured to promote swift, transparent and fair distribution of cash payments to affected communities reeling from disasters (Holloway et al., 2021). However, the conceptualisation and deployment of these myriad technologies have sparked scholarly concerns along a mix of technical and political-ethical lines.
Critics argue that the shift towards digital humanitarianism reflects the increasing influence of corporate values, such as efficiency and optimisation, which deviate from, and might even undermine, the fundamental principles of humanitarianism (Burns, 2015; Henriksen and Richey, 2022). Despite such diverging values and organisational structures, humanitarian agencies like UNHCR, UNICEF, UNDP, and FAO, and Big Tech corporations like Google, Meta, IBM and Amazon are forging partnerships over data-centric digital technologies to improve the logistics of humanitarian operations and anticipate the scope and size of humanitarian suffering and displacement induced by various forms of natural hazards (such as floods, droughts, and tornados), and, to a limited extent, violent conflicts. These synergies cast Big Tech as a major problem-solving actor by virtue of its capacity to mobilise human and technological resources to strengthen the humanitarian response to a broad range of crises and support policymakers’ planning. Nevertheless, critics warn that such shifts in humanitarian responses via corporate-humanitarian partnerships, coupled with the transfer of technologies from the private sector to the humanitarian sector, are insufficiently accompanied by adequate measures to protect against the potential misuse of data collected from vulnerable populations (Jutel, 2022; Madianou et al., 2023). Some of these critical scholars, for instance, argue that Big Tech companies’ own humanitarian projects tend to reproduce their neo-colonial logics, geared towards pursuing profits, rather than achieving humanitarian goals, often at the expense of the individuals they allegedly seek to help (Magalhães and Couldry, 2021; Richey and Fejerskov, 2024; Taylor and Broeders, 2015). In particular, the extraction of data through digital platforms is viewed as a likely interference with, and potential violation of, already disadvantaged communities’ right to privacy and right to information (Holloway et al., 2021; Jacobsen and Fast, 2019). This risk of data extraction from vulnerable communities is further exacerbated by insufficient regulatory frameworks and operational standards aimed at safeguarding the interests and rights of disaster-prone or affected communities (Beduschi, 2022; Rikap and Lundvall, 2022; Sandvik and Jumbert, 2023). Furthermore, the growing emphasis on data-driven humanitarian response is being questioned for imposing additional constraints on the agency of these communities to shape humanitarian aid (Duffield, 2016), making it more challenging for them to hold humanitarian organisations accountable (Madianou, 2021a).
The concerns over the extractive and exploitative risks associated with these data-intensive partnerships and innovations also ought to be viewed in light of the political tensions and contradictions that underpin these systems. This data-centric politics casts a critical spotlight on the promise of technology, akin to what Rothe et al. (2021) term “naïve technological determinism” (58), characterised by the complex and conflicting interests among the proponents of such technologies, the uncertainty surrounding humanitarian work, and the expectations from crisis-prone communities. The ambiguities surrounding the techno-scientific logics of data-driven humanitarianism and the organisational uncertainty over the governance standards for the ‘humanitarian data life cycle’ (including data collection, storage, utilization, and sharing) (Fejerskov et al., 2024) cast additional doubts on the promise of data-driven humanitarian action. Crisis prediction involves using data to monitor and track hazards before they escalate into emergencies, aspiring to blend technological and scientific rationalities as a form of “politics of probability” that is geared at making unknowns amendable to scientific discovery (Amoore, 2013: 157). This perspective aligns with Beck's (1992) concept of the risk society, where the nature of risks is constantly changing, which, in turn, makes the technoscientific norms and practices of interpreting and managing risk a matter of socio-political contention.
The humanitarian use of predictive tools is not immune to such organisational inconsistencies and socio-political contentions. Moreover, the practical terrain of data-driven anticipatory action is complicated by ever-shifting unknowns, data scarcity, and dilemmas over what data to collect, how to collect it, and for what purpose (Chaves-Gonzalez et al., 2022). Critical humanitarian scholars and practitioners have raised concerns that the spreading of predictive models within the humanitarian assemblage could enhance issues of ‘function creep’, where data collected and processed to better help vulnerable populations end up strengthening control apparatuses. This is the case, for instance, of the controversy surrounding the EU-funded ITFLOWS project, launched in 2020 to develop EUMigraTool (EMT), an agent-based model predictive tool intended to assist EU authorities in managing migration flows by analysing data from news media, social media, and conflict databases to simulate migration patterns, forecast migration trends, and identify potential sources of tension between migrants and local communities across EU countries. As unveiled by an investigative report, internal memos from ITFLOWS’ ethical board highlighted the risk of EMT being used for profiling, possibly resulting in the targeting of migrants based on factors like ethnicity or immigration status (Campbell and D’Agostino, 2022).
At the same time, data-driven prediction might perpetuate standardised forms of epistemic authority through quantitative modelling and trend analysis at the expense of epistemologies that are reliant on participatory and indigenous methods (Muiderman, 2022). These concerns demand further scholarly attention given the long-held understanding that narrow reliance on measurement and quantification has historically undermined the participatory and contextual development wisdom emanating from local communities (Chambers, 2017) and shifted power into the hands of technocrats (Ferguson, 1994). The focus on quantitative data also risks misinterpreting contextual realities and undermining the rights and voices of the local communities it claims to serve (Sandvik and Jumbert, 2023). These concerns call for a deeper insight into how data is curated and structured to ensure it can be effectively reused in digital humanitarian efforts.
Data curation for humanitarian reusability
In this article, we suggest that the limitations and contradictions of digital humanitarianism stem from the very way in which the data used to inform humanitarian action is collected, organised, and made fungible. We thus turn to a second stream of literature, primarily situated in media studies, critical data studies, and information systems, which revolves around the data curation involved in the construction of datasets.
Data curation is concerned with the ‘behind-the-scenes’ combination of routine and ad-hoc practices aimed at preparing data for future reuse (Parmiggiani et al., 2022), performed by various professionals. Broadly speaking, curation is about “deciding what content to care for and making decisions about ways of representing or displaying it” (Fraser, 2019a). To some extent, all connected users participate in data curation as they interact with digital devices and contribute data to a “curation loop” (Villi, 2012; Pedersen and Burnett, 2018), which enables providers to refine and improve the delivery of valuable content through “algorithmic computation” (Cheney-Lippold 2017). In this context, data curation produces value from data by connecting “data-laden, algorithm-infused capitalist enterprises to users of diverse technology services and devices” (Fraser, 2019a). However, we adopt a narrower approach to data curation, focusing on the technical and organisational actors and practices tasked with transforming data into a strategic resource within the humanitarian sector and aggregating this resource into datasets. This ‘data work’ (Parmiggiani et al., 2022) is based on a relational process and debunks the widely-held, and corporate-derived, view of data as “discrete, interchangeable and governable commodities that must be moved up and down different supply chains to create new value” (Thylstrupp et al., 2022: 2). Instead, it reveals datasets as artefacts shaped by diverse epistemic approaches to our understanding of humanitarian crises and reuse strategies. Parmiggiani et al. (ibid.) use the idea of anticipatory generification to “capture the work to make the data collected in a specific setting just about malleable enough to be exported to future unknown analytical contexts.” (12)
This process partially aligns with what Aaltonen et al. (2021) describe as the production of data commodities, which are “associated with the open-endedness and ambiguity of data as a medium for sensemaking and knowledge creation […] particularly under conditions where data are repurposed, decontextualized, aggregated, and recontextualized” (402). For humanitarian organisations, the key difference between their creation and maintenance of data commodities and that of commercial enterprises lies in the purposes of this valorisation process and its compliance with humanitarian principles. As quantification logics and efficiency imperatives have spread through the humanitarian space, frameworks, policies and infrastructures to facilitate the exchange of data across organisations have become central to humanitarian operations.
A pivotal innovation-centric organisation in this space, which we will discuss later in the article, is the Humanitarian Data Exchange (HDX). Established by UN OCHA, HDX is specifically tasked with overseeing the curation of datasets produced and used by humanitarian actors, taking into account their possible re-use within the humanitarian-development context and over long periods of time. In so doing, HDX data scientists are concerned not only with making data usable across different agencies, but also with envisaging “future contexts of data (re)use” (Parmiggiani et al., 2022: 3) within the boundaries of the humanitarian ethics.
Understanding how this mundane work is carried out behind the scene, and by whom, is thus crucial to gleaning power dynamics within digital humanitarianism. However, the interdisciplinary debate on data work highlights two key issues that are essential for critically examining the diverse groups and processes involved in creating datasets for humanitarian purposes.
The first issue concerns the epistemic inequalities among the diverse actors involved, who may have different understandings of data reusability and varying levels of influence over the governance of data infrastructures. Viewing data production as the “product of the amalgamation of different actors, interests and social forces” (Dencik et al., 2019: 873) enables us to appreciate the significance of power relationships behind the technical veneer of data curation. By leveraging their expertise and access to resources, some actors could have greater control over reuse strategies, including for commercial purposes, of datasets based on historical records and curated in the present. The data repositories compiled, maintained and shared by these organisations often must comply with open access requirements dictated by donors and embraced by corporate actors. And yet, the a priori emphasis placed on openness risks glossing over the power structures that make ML an uneven space of collaboration. These inequalities are exacerbated by the fact that most humanitarian workers are not specifically trained to work with data (Jarke and Büchner, 2024), and by the opacity that often surrounds how data is collected, organised, and reused—raising concerns about the reliability of the frameworks governing partnerships between humanitarian organisations and private tech contractors. This tension is particularly relevant when dealing with data collected from vulnerable populations. A notable example is the partnership between the World Food Program and the U.S.-based data analytics firm Palantir, which has faced scholarly criticism due to Palantir's close ties with U.S. intelligence agencies (Masiero, 2023).
The second issue highlighted by critical data scholars is the risk of reusing problematic, or ‘toxic’, data for machine learning (Birhane et al., 2021; Harvey and LaPlace, 2021; Thylstrup et al., 2022), such as in the training of the predictive models on which AHA is based. Thylstrup et al. (2022) use the concept of ‘entanglement’ to grapple with the pretence, largely held in ML, of completely extricating data from its context of production in order to obtain untainted algorithmic fodder. Instead, their argument challenges the possibility to draw a clear boundary between “data and algorithms and ‘good’ and ‘bad’ in machine learning regimes” (5) by emphasising that parameters of ML models bear the legacy of previous iterations with other training datasets. Therefore, as we argue in this article, delving into the construction of these datasets has implications for how digital humanitarianism is changing our understanding of humanitarian crises.
Methodology
As our analytical focus, we use the scarce theoretical and empirical insights surrounding the partnerships between conventional humanitarian actors and data experts, focusing on the operational, political, and ethical implications of such collaborations. We examine the underlying assumptions that inform these partnerships, with closer attention to the issues of equity, accountability and epistemic justice. The article draws on a mix of data sources spanning published documents, illustrative cases and key informant interviews. Firstly, it analyses the public documentation about various forms of data-driven and prediction-oriented initiatives involving international humanitarian sector. Secondly, we draw on key informant interviews and documentary evidence, focusing in particular on the innovation hubs of humanitarian and aid organisations (such as UN OCHA CHD, UNHCR, UNDP, UNICEF, FAO, WHO, Danish Refugee Council, Save the Children).
Defined as “dedicated organisational structures, small or large […] established to forward the agenda of innovation” (Wells, 2023: 270), humanitarian innovation hubs have been established by humanitarian practitioners and the private sector over the past decade to harness ‘technology for good’ (Powell et al., 2022). Usually staffed by individuals with mixed humanitarian and data science backgrounds, these hubs often serve as the interface between the humanitarian and corporate sectors and play a crucial role in translating, synthesising, and harmonising the languages, principles, and goals of both fields. Relevant to our study, these hubs aim to make data collection and curation machine-readable to meet the needs of data analysts.
The study draws on interviews with 15 key informants who have hands-on experience working as data analysts and scientists, most of whom have also been involved in negotiating and managing the process of making humanitarian data open access for anticipatory action and decision-making. More specifically, the interviewees represent diverse technical backgrounds and affiliations, primarily from large-scale humanitarian organisations. Their roles include monitoring and evaluation, data science, and humanitarian risk analysis in areas such as disaster risk mitigation, food insecurity, and conflict-induced displacement. Three participants previously worked in the corporate sector, particularly in cloud engineering, before transitioning to the humanitarian sector. Two interviewees specialised in predictive analysis, focusing on developing, applying, standardising, and validating data science tools to inform humanitarian response. Additionally, two interviewees had experience in data-driven humanitarian advocacy and two interviewees had dual roles in academia and humanitarian consultancy, working on humanitarian innovation AI for humanitarianism, among others.
The small size of our interview sample is mainly due to the concentration of expertise in this fledgling field. Yet, the impact of the organisations in which our informants are based resonate across the global aid sector.
The interview questions primarily focused on the challenges of data curation in humanitarianism. However, interviewees were also asked about how they define prediction; the context and significance of data-driven predictions; their possibilities and limitations; and the data-centric partnerships established and promoted between the humanitarian and private/corporate sectors.
All interviews were transcribed to ensure accuracy and comprehensiveness. Interview codes were developed based on the literature review and refined through emerging themes identified during the documentary review. Transcript analysis followed a collaborative approach: both authors independently reviewed and coded the transcripts, followed by regular meetings to develop shared analytical themes (Cornish et al., 2013).
Insights from the interviews, along with a review of relevant documents and data hosting platforms, were compared and analysed to develop a shared set of findings, as outlined below.
Findings
Our research focused on the challenges that arise from collaboration between humanitarian and non-humanitarian actors in the creation and maintenance of humanitarian datasets, and on how the datasets produced through these collaborations influence humanitarian epistemics and ethics. To explore these questions, we invited humanitarian practitioners and data scientists to reflect on their involvement in constructing such datasets. This section presents our findings, organised around three key themes that emerged from the interviews and the document review: defragmenting the humanitarian data value chain, filling the humanitarian data expertise gap, and integrating big and small data.
The findings are supported by relevant quotes, each accompanied by a description of the interviewee's role, further illustrating the diversity that characterises humanitarian data work.
Defragmenting the humanitarian data value chain
The first issue highlighted by our interviewees regarding the challenges of building datasets for use by humanitarian stakeholders was the relationship between data quality and the fragmentation of the humanitarian data value chain.
There was broad consensus on the generally poor quality of data collected by workers from different organisations and entered into humanitarian digital infrastructures. This often results in issues such as inconsistent data formats, missing or inaccurate data elements, and a lack of data standardisation.
According to a data scientist based at a humanitarian organisation, “There is quite a lot of frustration around inconsistent figures, lack of figures […] We particularly have gaps in the disaggregation of that data. So it might be that the governments can provide us with the total number of refugees, asylum seekers in those countries, but we don't have the breakdown of this data, which obviously is really important from a process for crafting our operations and responding to those populations. (interview #4, disaster risk analysis and mitigation)
Moreover, when private contractors are entrusted with data collection, they are not merely middlemen. They exert de facto control over the format in which data is collected. As one lead data scientist explained: “The data is often not in an Excel file. It's in a video, a JPEG, PNG, or an image someone just put on a slide deck, or even PDFs of thousands of pages.” (Interview #3, lead data scientist)
These non-machine-readable formats are difficult for digital humanitarians (especially those working in understaffed innovation hubs) to process effectively. As noted by a data scientist at a UN organisation working on refugee welfare: “In an ideal world, there should be someone in charge of harmonising all those datasets and compiling them together. But in practice, I think it really depends on whether the partners work together and on national interagency structures, and on the presence of clear rules and common tools that are being used by the humanitarian community.” (Interview #6, data analyst in charge of data harmonisation and standardisation)
The fragmentation referred to above often stems from a lack of coordination in implementing protocols that regulate data collection and sharing, resulting in “separate worlds that don't talk to each other” (Interview #6). Defining data-sharing protocols (a set of guidelines specifying the responsible construction and use of datasets) is indeed crucial for standardising data, ensuring consistency across the data value chain, and mitigating risks such as the so-called mosaic effect. This refers to the possibility that even anonymised data, when combined with similar or complementary pieces of information, could potentially identify the individual to whom it pertains (Fournier-Tombs, 2021; Interview #2, cybersecurity expert at a global NGO).
Although organisations such as the UNHCR have historically used data from different sources and formats, it was only in 2020, following the establishment of the UNHCR Global Data Service, that defragmenting the data value chain became a strategic priority. This is viewed as a precondition for ensuring the reusability of data across organisations and over time—what we refer to as organisational and temporal interoperability. This principle also underpins the establishment of the Humanitarian Data Exchange (HDX) by the UN OCHA Centre for Humanitarian Data (CHD). Described by one of its executives as a “one-stop shop for humanitarian data” (Interview #3, lead data scientist), HDX was designed to facilitate the exchange of data among humanitarian organisations and foster data-driven decision-making.
HDX was inspired by the recognition that, as one interviewee put it: “You encounter situations where you are an information management officer, and you have to go to an individual organisation each time to request data to do analysis that aids in your day-to-day job. HDX started as a project to just bring together humanitarian datasets and to make them more easily accessible to management officers and humanitarian affairs officers—so basically to anyone who's looking to make something more technical.” (Interview #8, data access, sharing, and partnership development)
The CHD took a leading role in certifying the quality of data collected by humanitarian organisations, assessing it for both consistency and compliance with humanitarian principles.
The effort to defragment and consolidate data is also a response to the “data fatigue” experienced by disaster-prone and disaster-affected communities (Interview #10, data analysis, monitoring, and evaluation). Rather than repeatedly collecting new datasets from vulnerable populations, humanitarian actors increasingly rely on secondary data for both analysis and prediction. Leveraging unused secondary datasets (e.g., census data, health and demographic surveys, programmatic data) is therefore considered operationally valuable to avoid duplication and data fatigue, while protecting disaster-prone or affected populations from the burden of repeated data collection (Sandvik, 2023). As one interviewee with a background in monitoring and evaluation, currently working as a data scientist for an international humanitarian organisation, explained: “There have been lots of duplication of data leading to ‘data fatigue’, and also people are saying, ‘Why are you all asking me the same questions?’ There is also some level of assessment fatigue. So the plan is to structure the unstructured [data].” (Interview #10, data analysis, monitoring, and evaluation)
In some cases, the scarcity of demographic data is addressed by using modeling techniques to estimate missing information based on historical records from various sources. However, this approach can compromise the accuracy of the resulting datasets. This issue is particularly pronounced when relying on data from national statistical bodies in crisis areas, as these datasets are often patchy and unreliable—especially when governments may be reluctant to collect or share health data, such as information on cholera outbreaks. The same challenges apply to displacement data, particularly when governments are themselves drivers of the crisis or may manipulate data to attract donor funding.
Expertise gap
The second theme emerging from the interviewees’ accounts is the asymmetric power dynamics among actors, particularly in their understanding of how to reformat data to improve its reusability. This asymmetry draws attention to the expertise gap between humanitarian field workers and data scientists. As pointed out by an interviewee working for a UN agency, “Data scientists know the limits of what they can do and the format of data they need. They would stress granularity and thoroughness across different fields, not just two or three” (interview #5, data standardisation and sharing)
The program director of an NGO that delivers data literacy training to humanitarian workers, however, cautioned against the ever-expanding world of data governance standards, which humanitarian practitioners perceive as overly complex. “There's a need to shift direction. Instead of adding more layers of complexity, data governance should be designed with accessibility and practicality in mind.”(interview #9, data-driven decision making and advocacy)
The complexity of data management is thus proving to be a barrier for those furthest from the data itself, particularly field workers who are removed from the data itself. The increasing use of technical jargon can also be a challenge for humanitarians without a technical background, making it difficult to even remember all the terms, let alone understand what kind of data should be included in a dataset. As the same specialist explained, “The focus should be on making data governance practices more accessible, not less. This means using clear and concise language that everyone involved can understand. Field workers also report this frustration when grappling with the disconnect between how data is collected and how it's ultimately used. There's a sense that the ethics and practicalities of data collection in the field aren't being fully considered when developing data governance standards.”(interview #9)
Due to limited technical expertise and funding constraints, humanitarian NGOs have sought to collaborate with tech firms that can provide resources and expertise. A data specialist who worked for a major international NGO, and with prior corporate work experience in cloud computing and information technologies, offered a mixed perspective on the potential benefits and risks of these humanitarian-corporate partnerships. “A lot of the work is being outsourced to the corporate sector. Some of this work is being done pro-bono. Some people in the tech industry are genuinely interested [to support the NGOs), they also have the infrastructure. But at the end of the day, they are for-profit making”. (interview #11, data science and predictive analysis)
Despite the growing interest in addressing the data expertise gap, the for-profit-making motive of the private sector can pose the risk of misuse of data given the limited safeguarding and oversight mechanisms. In other cases, private sector interest in joining a partnership is driven by a mix of corporate social responsibility goals and the opportunity to test their own predictive models. This was the case, for instance, of an initiative developed by an international NGO in partnership with IBM. As explained by the NGO leader for the project, the US tech giant was eager to leverage its technical expertise and computational power to analyse data and predict displacement patterns in Afghanistan and Myanmar. However, the initial iteration of the project failed to capture the onset of the refugee crises in both countries, primarily due to the reliance on historical data that was ill-suited to the ‘black swan’ nature of the humanitarian emergencies. The unexpected return of the Taliban to power in Afghanistan, the escalation of the Rohingya genocide, and the lack of signals from the field all contributed to this limitation, compounded by a communication bottleneck between the field workers and the data scientists (interview #10, NGO Senior Analyst).
These frequent bottlenecks are the reasons why, as noted by several interviewees, humanitarian NGOs may be better equipped to develop their own in-house capabilities, which can enable them to be more “proactive” in their action as part of the anticipatory humanitarian responses. This has led to a growing demand for new technical profiles within the humanitarian sector, despite financial constraints. As suggested by a CHD data scientist, “OCHA is more and more looking for data scientists. These profiles are different from traditional technical personnel. In the past, information management typically referred to GIS, mapmaking. A shift is happening “. (interview #7, GIS and risk mapping)
Organisations such as the previously mentioned HDX were established specifically to bridge this gap by focusing on the collection of high-quality data and the anticipatory process of “generification”, which transforms datasets into valuable resources for AHA. Interviewees noted that the emphasis on prediction has actually improved data quality itself. By developing predictive models, analysts have become more attentive to identifying issues with data sources and overall quality (Interview #3).
The use of prediction began gaining momentum around 2020, leading to the development of two key models: nowcasting, which addresses short-term needs by filling gaps caused by delays in official statistics; and forecasting, which considers longer timeframes — typically two to three years — to plan for future tasks (Interview #4). To support this shift towards prediction, the CHD broadened its scope “from data responsibility to responsible analytics” (Interview #3), auditing predictive models developed by humanitarian organisations, both within and outside the UN system. The auditing process evaluates the validity of underlying assumptions, as well as a model's transparency and ethical use. This includes examining whether model hypotheses are sound (from both statistical and humanitarian perspectives) and identifying potential biases in the outputs. As one CHD data scientist explained: “…if the model predicts people affected by a crisis, is it excluding areas with data collection problems? We want to avoid situations where the model simply ignores problematic areas, leading to an inaccurate overall picture.” (Interview #4, disaster risk analysis and mitigation)
Regarding model transparency, CHD acknowledges there is no foolproof “tick-all-boxes” method for auditing data models. However, they have taken steps to promote responsible use by asking whether the model aligns with humanitarian principles (e.g., using the appropriate data for its intended purpose); evaluating datasets and considering the ethical implications of outputs beyond statistical accuracy; and subjecting models to a peer review process involving a mix of relevant experts—such as statisticians, humanitarians, and domain-specific specialists (e.g., climate experts for climate-related models) (Interview #4).
Small and big data integration
The third theme emerging from the interviews, closely related to the previous two, revolves around the difficulty of integrating big and small data—namely, the quantitative indicators gathered by data scientists and the contextual knowledge provided by field workers and, ideally, local communities. The main question, as effectively summarised by an interviewee from an advocacy organisation, boils down to, “What is being predicted?” The data collected and curated for reusability are intended to produce metrics that aid professionals and policymakers can use in resource allocation to respond to emergencies. However, this approach risks further concealing local agency, as most initiatives can leverage only limited data on the coping capacities of communities affected by disasters. As the same interviewee specified, “There is little understanding of what resources communities have to prepare for or respond to crises. What type of resources does the community possess? For example, can schools be turned into contemporary shelters if there is a crisis?” (interview #12, data-driven monitoring and advocacy)
However, as a sign of appreciation of the benefits of integrating the two types of data, a data scientist based in a humanitarian innovation hub suggested that the challenge is precisely to “toe the line between these two forms of knowledge. While big data offers vast quantities of information, it often lacks the richness of context” (interview #4, disaster risk analysis and mitigation)
Interviewees from humanitarian backgrounds, in particular, expressed concerns about an overreliance on quantitative data, such as census, household, health, and population figures, and, in some cases, social media data. They argued that this reliance could lead to a narrow focus on quantifying the likelihood, trajectory, and impact of humanitarian crises, resulting in metrics that are easily digestible and actionable for aid organisations, policymakers, and donors. The downside of this approach, though, is the danger to overlook the complex and multifaceted effects of certain disasters, as one risk analyst reflected: “Imagine analysing flood data. Big data tells us the number of people affected, but contextual knowledge, like infrastructure or preparedness, unveils vulnerability. A seemingly less-affected area might have strong resilience measures, while another, with a lower number impacted, could be severely vulnerable due to lack of infrastructure.” (interview #4, disaster risk analysis and mitigation)
The quote above reflects an understanding that a lack of contextual knowledge limits aid organisations’ ability to fully grasp the unequal impacts of disasters on historically disadvantaged populations. Yet, despite widespread awareness of the complementarity between these different types of data, few efforts have been made to integrate small data—such as operational data, qualitative interviews, and observational reports—with quantitative data to support data-driven predictions.
A notable exception is a project designed and implemented by the UNHCR Innovation Service Team in Somalia, cited by five separate interviewees as a best practice in big-small data integration. Launched in 2019, Project Jetson was the first predictive analytics initiative to leverage ML for anticipating and preparing for population displacement. By partnering with local organisations that sourced historical records dating back seven years, Jetson's data scientists identified ten indicators of displacement. These included meteorological data, market prices, historical population movements, and food insecurity, as recorded by humanitarian agencies in the Somali Horn of Africa during previous droughts (Beduschi, 2022; Schneider et al., 2022). However, following advice from field workers, the project implementers began incorporating qualitative insights derived from local observations. Particularly valuable were reports of a significant decrease in the price of small livestock in local markets. Through interviews with refugees, field workers established that this drop was linked to pre-displacement livestock sales: goats, being difficult and impractical to transport during displacement, were often sold before journeys began. This surge in livestock sales within a short timeframe triggered a corresponding decline in market prices. Crucially, researchers found that this price drop consistently preceded actual displacement events, as refugees “needed to sell their goats and transfer their major financial assets to generate cash for their fleeing journey” (Schneider et al., 2022: 69). This insight prompted UNHCR data analysts to incorporate historical prices of goats and water drums into a broader dataset composed of information mined from various sources, including graphs, reports, spreadsheets, and websites (Schneider et al., 2022).
The outcomes of this project underscored the value of fine-grained qualitative analysis and contextual specificity. In the case of Jetson, the success of the goat price indicator was demonstrably linked to the specific cultural and economic realities of Somalia. But in regions with different displacement patterns or livestock practices, this approach may not be as effective. The interviewees acknowledged nevertheless that the Jetson project highlighted the importance of leveraging local knowledge for identifying which types of data should be added to datasets for ML (interview #3, #5).
Our findings indicate that evidence-based humanitarian action still relies heavily on quantitative measures and metrics, while the use of contextual and qualitative data is often limited to triangulation purposes, as noted by several interviewees. However, there was broad consensus among study participants that humanitarian practice would benefit from greater attention to qualitative and contextual data that reflects the lived experiences of crisis-prone and crisis-affected populations.
Discussion
The findings we presented in the previous section identified three challenges in the construction of datasets used for humanitarian prediction: defragmenting the humanitarian data value chain; filling the humanitarian data expertise gap; and integrating big and small data. These issues have implications for our understanding and engagement with humanitarian crises on humanitarian epistemics and ethics.
Implications for humanitarian epistemics
While the humanitarian sector is eager to adopt predictive technologies, these efforts are hampered by a paradox: the very effort to overcome data scarcity often reveals how deep that scarcity really is. The more humanitarian actors try to solve the problem of not having enough or well-organised data, the less useful data they seem to have, especially when trying to predict complex and uncertain events. Some crises (e.g., conflicts, famines, complex emergencies) are indeed non-linear and unpredictable, making them particularly hard to model or predict using quantitative data alone.
This is further complicated by inadequate data governance standards and mechanisms. AI and ML require large amounts of reliable data to be effective. However, in humanitarian contexts, such data is often scarce or fragmented, creating a mismatch between the promise of prediction tools and the reality of available data.
This challenge has implications for how humanitarian actors “see” and prioritise disasters for intervention. There is a growing preference for quantitative data, which is more readily available for consolidation and sharing than qualitative or contextual data. Although efforts to minimise primary data collection from vulnerable populations—who may be experiencing “data fatigue”—reflect a commendable “do no harm” approach, reliance on specific forms of quantitative datasets not only raises questions about the validity of models used to predict complex disasters (Rivera, 2022; Thompson, 2022), but also risks sidelining the value of non-quantitative, contextual information.
This preference for numbers and statistics can lead to overly simplified predictions, and to the prioritisation of certain types of disasters over others. Our analysis shows that much of the predictive focus centres on a narrow set of indicators and quantitative measures, particularly those related to displacement caused by imminent or active threats. This, in turn, risks drawing disproportionate attention to certain crises while neglecting others.
For example, disasters such as cyclones and hurricanes are both recurring and more easily predicted, thanks to the availability of geospatial data. In contrast, slow-onset or intersecting crises—such as droughts or famines occurring in regions affected by low-intensity conflict—are often harder to quantify and thus risk being oversimplified or overlooked.
The three challenges we identified—fragmentation, power asymmetries, and an overemphasis on quantitative data—affect not only how datasets are constructed by data scientists to support aid workers and policymakers, but also how humanitarian crises are problematised. Murray Li (2014) defines problematisation as the identification of “deficiencies that need to be rectified” (p. 228). She notes that “identifying the problem to be solved and rendering it in technical terms is often quite straightforward in the context of emergency, where immediate needs come to the fore” (ibid.).
However, the entanglement (Thylstrup et al., 2022) of datasets with the models they train—and their continual reuse and repurposing—can have a cascading impact on how we “manage uncertainty and make an unknowable and indeterminate future knowable and calculable” (Amoore, 2013: 7). This may shift the humanitarian gaze away from the socio-political dimensions of disasters - an issue long raised by critical disaster scholars and practitioners (Hewitt, 1983).
Implications for humanitarian ethics
Our findings highlight the significance of datasets as artifacts shaped by constant negotiations among the stakeholders involved. But the three themes we identified reveal that these negotiations are fraught with tensions among the involved partners – data analysts, humanitarian workers, and, in some cases, Big Tech companies on an operational basis – stemming from their different and unequal understandings of data formats, types, and reusability strategies. There is thus a need to think about humanitarian ethics, and the way it is embedded in humanitarian practices and policies, against the socio-political value of data, which expands beyond the specific scientific or evaluative purpose for which it was originally collected. Despite the significant emphasis on open access and shareable data, the use of this data is not adequately supervised, nor is the risk of publishing sensitive information that may compromise the rights and well-being of vulnerable populations sufficiently mitigated. While the focus remains on “consolidation” and “collaboration” around available data, our analysis raises questions about standards for how data should be used and reused, the forms of prediction to be conducted, by whom and for what purposes.
Given the for-profit nature of the tech firms supporting and shaping the current trend of humanitarian datafication, there is a risk that interactions between the private and aid sectors could dilute core humanitarian principles. Indeed, our analysis shows that efforts to leverage these partnerships have, so far, failed to heed the guidance of organisations such as UN OCHA and the International Committee of the Red Cross, despite these organisations having been particularly proactive in recent years in promoting sound data management practices.
Moreover, the outsourcing of data expertise, largely driven by the push to “scale up” humanitarian decision-making, is characterised by two key trends. First, the increasing quantification of disasters reflects the dominance of larger humanitarian organisations, which limits opportunities for local actors to engage in data-driven prediction. Second, the outsourcing of data analysis roles, and the delegation of predictive responsibilities to external data analysts, risks detaching humanitarian actors from their core commitment to accountability to disaster-affected populations. It also undermines their ability to remain publicly accountable for delays or inaccuracies in predictions (Thompson, 2022).
Despite the challenges we have foregrounded, our analysis underscores the potential of data-driven prediction to transform the practical landscape of humanitarianism. As previously mentioned, the acquisition of in-house capability and the establishment clear guidelines to harmonise collaboration with external partners are largely viewed as necessary provisions to level the playing field.
Finally, our findings suggest that the increasing emphasis on open access and data reuse may lead to a greater blurring of the existing divide between humanitarian datafication and humanitarian action (Fejerskov et al., 2024). This has further implications for humanitarian epistemics and ethics, especially in light of longstanding criticisms of the sector's tendency to respond to humanitarian crises based on simplistic and selective interpretations of political and contextual dynamics often to unintended consequences (Keen, 2008; Terry, 2002) and raising questions over the ethical justifications of humanitarian action, inaction or delayed action (Slim, 2015). The predictive aspirations of the humanitarian sector, as the findings above suggest, continue to face the challenge of oversimplification and misrepresentation of contextual realities. Nonetheless, the move towards data integration and sharing represents an evolving humanitarian trend, where inaction is subject to increased public scrutiny and pressure and from which humanitarian actors find themselves harder to escape. As one interviewee observed, an accurate prediction means “there's really no excuse for inaction” (interview #9, data-driven decision making and advocacy).
Conclusions
Our contribution has discussed the production of datafied humanitarian knowledge aimed at inspiring evidence-based policies and, more recently, designing and deploying predictive models to strengthen the preparedness of the humanitarian system. We set out by examining the challenges of building and curating humanitarian datasets and identifying three critical issues arising from the need to adhere to humanitarian principles and the production of metrics. We also suggested that the datasets used in the humanitarian space are the outcome of this tension, which reverberates on the way we understand humanitarian crises and plan our response to them. In so doing, we sought to bridge the academic and practitioner divide surrounding the question of data-driven knowledge production. We explored the “what” and “how” of knowledge generation within the context of data infrastructures collecting and storing humanitarian information, suggesting that this approach sheds light on the current state of humanitarian prediction and potentially offers insights into the future. A key takeaway is the significant role of power dynamics and communication barriers in building humanitarian datasets. We argued that, by critically examining datasets, as advocated by critical data studies, we can understand how their construction and use shape predictive models for specific crises, ultimately prioritising certain crises and associated responses while neglecting others.
Ultimately, our findings suggest that the growing emphasis on anticipation through quantitative data and numbers is self-limiting. In the absence of meaningful efforts to leverage and integrate qualitative and contextual data that are more suited to capture the lived experiences and realities of communities in or prone to crisis, the focus on numbers and metrics risks misrepresenting and misrecognising injustices that humanitarian crises tend to inflict. As an exploratory study, we call for further theoretical and empirical attention to the epistemological (knowledge-based) foundations, humanitarian-corporate partnerships, and associated inclusive practices that will define the future of humanitarianism.
Footnotes
Acknowledgements
The authors would like to thank João C. Magalhães for his valuable suggestions on the conceptualisation of this paper. His insights greatly enriched the development of the ideas presented here.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
