Abstract
The proliferation of environmentally oriented programs within the tech industry, and the industry's coinciding efforts toward data and technology democratization, generate concerns about the status of environmental data within digital economy. While the accumulation of digital personal data has been a cornerstone of domination of the data analytics industry, many believe environmental data to be a source of “untapped potential.” The potential of environmental data, the argument goes, would benefit equally the digital economy, environmental sciences, and academic data and artificial intelligence experts. This article analyzes the proliferation of the rhetoric about open environmental data by focusing on Microsoft's Planetary Computer cloud computing program and computer vision experts who curate and use biodiversity data stored on Microsoft's servers. Through an analytical framework of sociotechnical imaginaries, the article draws connections between visions of future for environmental knowledge production and governance promoted by Microsoft and the work of computer vision experts intending to benefit from the potential of environmental data as machine learning training sets while at the same time helping environmental sciences. Although environmental data on the Planetary Computer is democratized, it nonetheless becomes a valued asset to data economy, but often with unintended consequences, such as enabling citizen science biodiversity data to be used by state surveillance apparatus. The article challenges the view that data's democratization is unproblematically serving environmental sciences by examining the consequences of imaginaries of democratization emerging from the data industry leaders and processes of nonmonetary valuation of environmental data by experts who curate these datasets.
Keywords
In the past few years, companies such as IBM, Google, Microsoft, and Hewlett-Packard began to market their prowess in data analytics, artificial intelligence and machine learning (AI/ML), and cloud computing as solutions to environmental problems. For example, in 2017, Microsoft launched the AI for Earth initiative, a 50-million-dollar grant program promising to put “Microsoft's cloud and AI tools in the hands of those working to solve global environmental challenges” (Microsoft, 2021a). The program offers technical and financial support to a host of academic and nongovernmental organizations focusing on partnerships between data and environmental scientists. Environmental scientists can collaborate with AI experts through AI for Earth and use storage and processing capacity via infrastructure credits on Microsoft's Azure cloud computing program.
Three years later, in April 2020, Microsoft introduced the second stage of its environmental program: the Planetary Computer. This program is a proposal for a centralized environmental knowledge infrastructure. Such infrastructure would combine cloud computation with open environmental data and allow real-time global monitoring and modeling on Azure. Thematically, AI for Earth and the Planetary Computer are programs primarily concerned with sustainability and biodiversity conservation. At that time, Microsoft's Chief Environmental Officer Lucas Joppa invited the readers of Scientific American to imagine the Planetary Computer “less as a giant computer in a stark white room and more of as an approach to computing that is planetary in scale and allows us to query every aspect of environmental and nature-based solutions available in real-time” (Joppa, 2019).
The central theme in Microsoft's Planetary Computer imaginary is the promise of democratization of computational tools and data through cloud computing. Democratization here implies allowing any interested stakeholder to use open-source tools and data to answer environmental questions. As Joppa (2019) explains: A planetary computer […] will require us to build a global network that connects billions, or trillions, of data points about our environment with the computing power and machine learning tools to process them into actionable insights that will empower decision-makers in every corner of the globe.
This statement raises many questions: Who should be building such networks? Who collects the data? Who curates these data points? How does one process data into insights? And subsequently, insights into policies? Commenting on the critical study of the democratization of environmental data practices, sociologist Jennifer Gabrys (2016: 3) writes: the democratisation of technological engagement both brings new ways of addressing environmental problems as well as questions about what is meant by democratisation, especially when extended to questions of the production and circulation of data. (cf. Miller, 2005)
This article responds to Gabrys’ call for critical studies of environmental data and of data's democratization in particular. It integrate the conversations in Science and Technology Studies (STS) about technology and data democratization with anthropological and sociological interests in the value and lifecycles of data. 1 Following the processes of co-production of market forces and life sciences (Jasanoff, 2004; Sunder Rajan, 2006) helps to reveal how organizationally situated conceptions of democratization travel across the industry-academia boundary, and show how industry agendas shape the work of academic researchers in the domain of computer vision.
While critical data studies scholars have devoted much attention to the political economy of data by analyzing human-derived data as assets (Birch et al., 2021), less is understood about the role of open biological data embedded within the economy of private environmental philanthropy (see Lippert, 2015). Nevertheless, the past few years saw a major shift toward critical appraisals of environmental data and AI, particularly in the field of political ecology (Nost and Goldstein, 2022). Following scholars interested in the intersection of data mythologies and environment (Brevini, 2020), I explore how Microsoft's imaginaries of data analytics and innovation affect data curation practices in environmental sciences. While Sheila Jasanoff (2017: 3) has asked, “What is made visible in diverse practices of environmental data collection and what, by contrast, remains unperceived?,” I shift this locus of analysis to imaginaries and data curation and its consequences to understand how environmental records become an asset to nonenvironmental experts, such as the computer vision community and the role commercial actors play in shaping materiality and imaginaries of environmental datasets.
While Microsoft propagates forms of democratization grounded in commercial logics that can seamlessly apply in the context of environmental knowledge production, academic computer vision experts curate and host data on Microsoft's servers to advance their own discipline and biodiversity science. Problematizing this dynamic, this article challenges the view that data's democratization is unproblematically serving environmental sciences by examining the consequences of imaginaries of democratization emerging from the data industry leaders and processes of nonmonetary valuation of environmental data.
In the first section, I explore Microsoft's imaginary of democratization by first looking at the company-wide discourses, and later focusing on a more fine-grained description of the Planetary Computer. My analysis of Microsoft's vision for democratization and the Planetary Computer follows the method of bringing to light sociotechnical imaginaries, collectively held, institutionally stabilized, and publicly performed visions of desirable futures, animated by shared understandings of forms of social life and social order attainable through, and supportive of, advances in science and technology. (Jasanoff and Kim, 2015: 4)
The framework of sociotechnical imaginaries offers analytical leverage in articulating how tech industry actors and academic data experts alike work within and promote visions of desirable futures for the environment. Building on the analysis of Microsoft's imaginary of environmental data, the second section of this article describes the prospecting work of computer vision experts who strive to help conservation experts advance their own field by providing them with the “untapped” resources of biodiversity data. In effect, ML trained on environmental data gains potential applications beyond the environmental realm thanks to the value of curated biodiversity datasets. One way that this valuation takes place is through the curation of data to serve as AI training sets. Stated otherwise, ML “wants” (Mackenzie, 2015) environmental data as training datasets are the key component of the contemporary AI economy (Crawford and Paglen, 2021).
The article discusses the value of open dataset libraries associated with the AI for Earth and the Planetary Computer programs, the Labeled Information Library of Alexandria: Biology and Conservation (LILA BC). LILA BC, as its website states, is a “repository for data sets related to biology and conservation, intended as a resource for both ML researchers and those that want to harness ML for biology and conservation” (LILA BC Website, 2023). As LILA BC was curated by former students of a prominent computer vision researcher, Pedro Perona, I contextualize Perona's and his student's early involvement in environmental data curation. In particular, I recount the history of the involvement of Perona's lab into the curation of biodiversity data and the development of the iNaturalist app by some of the lab members. I argue that Microsoft openly embraces its mission to help environmental scientists and decision-makers, yet the data curation practices position Microsoft's computational infrastructure and ML expertise as “obligatory passage point[s]” (Callon, 1984) between information and decision. I end by juxtaposing the Planetary Computer and a similar program under development by American and European space agencies.
Research context and methods
The analysis in this article is a result of a larger historical and ethnographic project about the shifting dynamics of collaboration between data scientists and ecologists conducted by the author between 2018 and 2022. The larger project involved multisited participant-observation and archival research. A key empirical locus was the emerging community of ML experts organizing around a Slack Chanel started by Microsoft affiliates called “AI for Conservation.” I conducted a Thematic Analysis (Braun and Clarke, 2006) of (a) publicly available Microsoft documents and (b) interviews and statements made by Microsoft actors, with a focus on press and podcast interviews with, as well as conference talks by Lucas Joppa. The larger project involved over 20 interviews with data scientists and ecologists. While all these interviews informed the analysis and understanding of the Planetary Computer and biodiversity data, three of the informants were affiliated with the LILA BC dataset and one interview is directly quoted here. 2 The themes in an aggregated dataset were identified in an inductive, data-driven manner, therefore no preexisting coding scheme was used, and the thematic categories were assigned during the analysis. Lastly, in an effort to understand the value of LILA BC, the labor put into creating this dataset, and the consequences of LILA BC's interoperability, I relied on the method of “following the data” inspired by Sabina Leonelli's (2016) study of data-centric biology.
The value of open data
Recent contributions to anthropology of data argued that “data is understood to be valuable because it can be transformed into something else” (Walford, 2021: 1). In other words, the value of biodiversity data analyzed in this paper is not dependent on the for-profit commodification of data (as i.e. in the case of weather data (Randalls, 2010)), but rather from data's potential to be transformed into datasets that improve ML algorithms. Walford shows that there is a bifurcation in the valuation of data. First, data becomes valued as an end in itself, after it was collected by an interested party. This kind of value of data would be assumed by for example a biodiversity conservation expert who collected the data. This article is concerned with the other kind of value: the “secondary” value derived from the capacity to transform data—or data's interoperability (Ribes, 2017).
3
As the data lifecycle evolves and data turns into data products, it generates value for multiple communities. The question arises: what are the conditions under which data capital becomes economic and academic capital (Bourdieu, 1984)? This essay builds on a critical insight that “data is made valuable in diverse ways, entering economies not always recognizable as financial” (Douglas-Jones et al., 2021: 10). Therefore, in the following pages I consider data's value by examining an interdisciplinary terrain where the fields of computer vision and conservation biology intersect. To fully capture the trend of open biological data, Leonelli offers conclusions drawn from her own historical research. Leonelli writes: [Open Data] exemplifies the embedding of scientific research in market logics and contexts. To make it at all feasible for data to travel, market structures and political institutions need to assess not only their scientific value but also their value as political, financial, and social objects: The increased mobility of data is unavoidably tied to their commodification. (2013: 8)
I apply Leonelli's insights about from her historical work to the contemporary context of open biodiversity data hosted by Microsoft. Together, Walford's intervention about the potentiality of data as its value and Leonelli's emphasis on the correspondence of value and data's mobility compliment my analysis.
Distinct from biological data at large, the process of “counting species” (Youatt, 2015), or collection of biodiversity data, belongs to and produces its own distinct form of environmental politics. Although biodiversity data has a different cultural and ontological status from other forms of biological data, such as genomic data, conservation science has been defined by “explicit efforts to recast conservation as a matter of pegging the ‘value’ of nature to quantifiable measures of industrial worth” (Hayden, 2003: 57). While Hayden's account of “neoliberal nature” centers on the politics of bioprospecting and local ecological knowledge, as this article shows, biodiversity data has also become a matter of concern and value for the data analytics industry and the computer vision community.
Ambiguities of democratization
The concept of democratization in the digital age is notoriously complicated to define (see Miller, 2007). An important starting point for defining the term in this paper is Microsoft's own definition of democratization, which reads: Data democratization is the ability to make digital information accessible to the average non-technical user of information systems, without having a gatekeeper or outside help to access the data. Democratizing data helps users gain unfettered access to important data without creating a bottleneck that impedes productivity.
At face value, this definition postulates a possibility of the expansion of the accessibility of data and data analytic tools, but with a particular goal-orientation: increase in productivity. By contrast, Jennifer Gabrys and colleges center their analysis of democratization of environmental data on the kinds of alternative political pathways offered by democratization, especially in the hands of citizens scientists. Yet in both Microsoft's and Gabrys et al. (2016) understanding of data, ‘types of data’ and ‘types of uses’ are interlinked (…). In other words, there is a co-constitutive dynamic that develops across the range of ways in which data are parsed, processed and put to use. (pp. 157–181)
4
In what follows, I develop an argument that this “co-constitutive dynamic” within the Planetary Computer, instead of aligning with the “non-technical user,” caters to the already skilled community of ML experts. In lieu of problematizing how Microsoft's definition frames the “average non-technical user” or what constitutes an “unfettered access,” the thematic analysis in the first part of the paper develops a more comprehensive depiction of how the company's understanding of democratization relates to the Planetary Computer. The aim of the analysis is hence a discursive description of the elusive relationship between democratization and the Planetary Computer. To this end, analysis first shows that Microsoft positions environmental data within the Open Data Campaign as means of inciting a whole tech sector to “close the divide” between those who do and those do not have access to data and computing power; I then show that the strategy of democratizing data and putting computing power in the hands of environmental experts is “good for the planet [and] good for the business” (Microsoft, 2021b). The final point gleaned from Microsoft's discourse about the imaginary of democratization through the Planetary Computer is the allure of possibility of “optimizing the environment.” Yet to achieve this goal, as Lucas Joppa clarifies, the Planetary Computer needs to also fulfill its potential of bringing “the two worlds” (Joppa's words) of computer science and environmental science together.
Part I: Democratization, the Microsoft way
Since entering the information technology (IT) market, Microsoft has engaged in a series of campaigns designed to position itself as one of the key actors shaping the future of cloud technology and tech policy. The rubric of democratization is one of the crucial tenants in Microsoft's agenda. Two documents released to the public in 2018 by the company offer a glimpse into how Microsoft envisions its democratizing and environmental missions: The Future Computed: Artificial Intelligence and its Role in Society and A Cloud for Global Good: A Policy Road Map for a Trusted, Responsible and Inclusive Cloud. These reports showcase how Microsoft promotes a human-centered, socially relevant, and environmentally friendly computing infrastructure. Behind this rhetoric is an assumption that AI and big data will revolutionize all domains of human life.
These reports construct a democratization-centered narrative about the history of Microsoft, stating that “[w]hen Bill Gates and Paul Allen founded Microsoft over 40 years ago, they aimed to bring the benefits of computing (…) to everyone” (Microsoft, 2018). The company draws on its history of democratizing the PC to promote a vision of open access to AI and data. In words co-authored by Harry Shum, former executive vice president of Microsoft's Artificial Intelligence and Research group, and Microsoft President Brad Smith, “[o]ur approach to AI is making the fundamental AI building blocks like computer vision, speech, and knowledge recognition available to every individual and organization to build their own AI-based solutions” (Microsoft, 2018).
Shum and Smith also observe that, while access to this technology will help level the field of technology innovation, it will also engender a shared sense of responsibility. Therefore, answering the question of what role AI should play in society “requires that people in government, academia, business, civil society, and other interested stakeholders come together to help shape this future” (Microsoft, 2018). Accordingly, both documents include policy recommendations directed foremost at governments but also at industry and civil society. The Future Computed report reads: As a company that is helping to drive technology innovation in this new era, we recognize our responsibility to work in partnership with governments and communities to help advance social and economic progress. (Microsoft, 2018)
Harry Shum and Brad Smith in the foreword to the Future Computed report conjure an image of a sense of shared responsibility for AI development needs to emerge and that AI design choices cannot be limited only to the tech sector. The two 2018 Microsoft reports do not ignore the company's considerable environmental footprint. For example, the report acknowledges that Microsoft consumes even more energy than some countries. To remedy this problem, the reports outline a robust sustainability strategy. AI for Earth and the Planetary Computer feature in the reports as part of Microsoft's sustainability strategy.
The planetary computer as a data intermediary
The data hosted on the Planetary Computer comes almost exclusively from state services including the National Aeronautics and Space Administration (NASA), the National Oceanic and Atmospheric Administration (NOAA), the European Space Agency (ESA), the UK Met Office, the Chinese Meteorological Administration, and the Chinese Academy of Sciences. Thus, Microsoft should be understood as a “data intermediary” (Magalhaes et al., 2013; Sawicki and Craig, 1996; Sein and Furuholt, 2012) between state, academic, expert, and nonexpert communities. Data intermediaries “make use of data in novel ways, connecting entities that would not otherwise be connected” (Schrock and Shaffer, 2017: 3). Such organizations play a critical role in providing open access to data, including those from open environmental data repositories. By allowing data to travel across a vast data ecosystem, intermediary organizations intend to play a role of an agent of democratization (Van Schalkwyk et al., 2015).
The Planetary Computer caters to ML, data science, and computer vision communities. As Lucas Joppa has often claimed, environmental data is the most complex and exciting dataset out there. Crucially, ML algorithms trained on environmental data can later be modified to accommodate data coming from different domains. Data science is “data-hungry”: since data scientists do not have data of their own, they seek strategic alliances with domain experts, such as environmental scientists, in the process of “data prospecting” (Slota et al., 2020). Prospecting is similar to the work of geological surveying (Bowker, 1994) in as much as it relies on “the notion of unexplored territory that may yield some value once it is better understood” (Slota et al., 2020: 3). Planetary Computer might be said to provide environmental scientists and ML experts alike a common infrastructure for surveying the unexplored territory of big environmental data.
Closing the data divide through a slogan “Good for the Planet, Good for Business”
The Planetary Computer belongs to Microsoft's Open Data Campaign, which was launched in 2020 with two partners: the Open Data Institute (ODI) and the Governance Lab (the GovLab) at New York University's Tandon School of Engineering. In the first Campaign report published in April 2021, Jennifer Yokoyama, the Vice President and Deputy General Counsel from Microsoft's Intellectual Property Group, stated: “we are shifting to a new paradigm in technology and business: using data to collaborate, not just compete” (Microsoft, 2021a). The Open Data Campaign frames access to data as a political problem: uneven access to data creates inequality in the extent to which actors (states, corporations) can “innovate with AI.” Innovation of AI here is promoted as the ultimate goal, while data is a crucial mean to reach this goal.
In the company's discourse, the Open Data Campaign is about “closing the Data Divide.” While the Open Data Campaign and the Planetary Computer are depicted as steps toward the democratization of AI, these narratives are part of a wider “mythology of big data” (Boyd and Crawford, 2012). For example, Microsoft envisions environmental data as an untapped source of value, both for the data science and ML communities and its own commercial initiatives. In his 2020 keynote at the Ecological Society of America (ESA) Annual Meeting, Joppa said: “The planetary data represent the biggest data sets.”
The aspiration here is to form a platform through which “all of the globally available environmental data” (Joppa, 2020) could be deposited on a common computing infrastructure. Joppa described such a platform with three main objectives: to monitor, model, and manage Earth's natural systems. This platform would enable answering questions such as: “Where are the world's forests, where are the world's wetlands, how fast are they changing? What are the sorts of benefits that we are gaining from those ecosystems?” (Joppa, 2020). Since the global policy community is currently lacking such a platform, Microsoft stresses the need to accelerate and scale monitoring, modeling, and management and make such a platform democratized—that is to say, accessible to policymakers, environmental scientists, and the public.
In the narrative of Microsoft and other Big Tech companies, the computing and storage capacities of the Cloud form part of the solution to the environmental crisis. Microsoft's strategy is well illustrated by its rhetoric of “good for the planet, good for business” (Microsoft, 2021b). With its $1 billion investment in climate mitigation technology, the company strives to be carbon negative by 2030 and remove from the atmosphere all the CO2 that Microsoft has produced since 1975 via various carbon sequestration programs. Lucas Joppa is at the forefront of these efforts, but he considers Microsoft's carbon-negative standing as “far from enough for the world to achieve what it needs to achieve.” AI for Earth is a program that represents the company's reach beyond being environmentally neutral to designing technologies that attempt to change the world for the better.
In one interview (Handavar, 2020), Joppa claimed that global environmental monitoring systems produce “an incredible amount of information”; however, processing and gaining insight into this information represents a technical problem. But, according to Joppa, “[i]t is a problem that I believe the modern cloud platform approach to computing is uniquely positioned to help solve.” The keyword to understanding why the cloud and AI will help this environmental data challenge is scale. In Joppa's words: “we are in the tech sector where if there is one word that we live and die by it's scale.” Therefore, one of the main rationales behind the AI for Earth and Planetary Computer programs is to allow grantees to scale their work, as scaling is profitable from the perspective of both the industry and environmental scientists.
The discourse of scale matters here, as part of the critique propagated by Microsoft is that environmental scientists might be good at creating localized knowledge, but they are not proficient or lack funding for scaling their methods, tools, and data repositories. The imperative of scaling is further reinforced by the mission of democratization: bringing software, data, and other computational tools to the cloud is at the same time about expanding the scale of environmental sciences and making them more accessible and democratized. Joppa's words are illuminating: this thing that we’re calling the planetary computer is really an integration of the cloud scale compute globally important environmental data sets, and then building out kind of the programming environments, the machine learning training environments, and investments in some application areas, so that partners all around the world can take advantage of the cloud platform that Microsoft provides in a much more easy (sic) fashion. (Joppa, 2020)
Going back to Microsoft's definition of democratization, noticeable here is Joppa's implicit equation of scaling with global accessibility enabled by a cloud platform.
Optimizing the environment requires a boundary infrastructure for bringing computer and environmental sciences
In one of the promotional videos for the Planetary Computer, Joppa begins with a statement: “Imagine if we had a Planetary Computer that could tell us exactly what we needed to do to protect planet Earth.” This sociotechnical assemblage has been compared in Microsoft's promotional documents to a personal AI assistant akin to Cortana or Siri. Such an assistant would not only protect the planet, but also control it on an unprecedented scale. In the same video, Joppa makes a swift shift between speaking about nature to speaking about data science: “We're talking about a wide range of environmental concerns. These represent the world's biggest data challenges, the world biggest compute challenges, and the world's biggest algorithmic challenges.” In Microsoft's discourse promoted by Joppa, the gathering and analysis of these data will enable optimization of the “Earth's operating system”: We talk about how to solve climate change. There's a higher-order question for society: What climate do we want? What output from nature do we want and desire? If we could agree on those things, we could put systems in place for optimizing our environment accordingly (Joppa interview by Strickland, 2019).
Joppa introduces the notion of an “objective function for planet Earth,” and calls the Planetary Computer “the world's largest optimization experiment” (Joppa interview by Strickland, 2019). The objective function is meant to signify the process of getting “our species (…) back in the driver's seat.” Who will be responsible for steering the planet is less clear from Joppa's statements. If actual political consequences of the imaginary of optimizing the environment are hard to discern at this point, I do want to suggest that Joppa's views are indicative of the crystallization of a new form of power within the digital society exercised by what Burrel and Fourcade (2021) call the “coding elite.” In other words, through a reliance on advanced computational data analytics tools, environmental knowledge production is being steered into an ever-growing reliance on the new social elite wielding the mastery of these tools while at the same time “consolidat[ing] power by framing human reasoning as inadequate, even the expert decision-making of high-status professionals” (Burrell and Fourcade, 2021: 222). Reliance on machine reason of the Planetary Computer hence is one route of displacing fallible human judgment and moving from suboptimal to optimal climate and environmental conditions.
Microsoft argues that it is trying to make it easier for environmental scientists to use artificial intelligence. In one of the interviews describing the vision for the Planetary Computer, Joppa explained that this platform is an attempt to bring two worlds—that of tech developers and environmental scientists—together: I really see it as our job to help show the world what's possible and then to help bring those two worlds [computer & environmental science] together. And that's easier to do when you have a platform upon which to do it. And that's really one of the reasons we're building the Planetary Computer platform. (Joppa in Handavar, 2020)
Joppa thinks like a good sociologist of science in arguing that once two social worlds are sharing a “boundary infrastructure,” 5 a collaboration between them becomes easier. Recognizing the lack of such infrastructure, Joppa's objective is to foster collaborations between data and environmental scientists and allow those collaborations to integrate their work via cloud computing, data storage, and analytics through Microsoft's Azure platform. Joppa sees it as Microsoft's responsibility to bring the two worlds of computer and environmental sciences together. His assumption is that it gets easier once both worlds are already on the same platform. For example, in a data science podcast interview, Joppa spoke about the “Earth operating system”—a phrase to which the host, himself a data scientist, exclaimed: “I absolutely love that explanation. I think you’ve nailed it perfectly for this audience” (Handavar, 2020). This response makes it clear that Joppa self-consciously chooses words to translate ecological concepts into the pidgin dialect of environmental computer science (Galison, 1996).
To sum up, grounded in principles of cooperation and democratic participation, the vision for Planetary Computer portrays boundary infrastructure for collaboration between environmental and computers sciences as both “Good for the Planet, Good for Business,” while the environment itself is understood as a cybernetic object amenable to global optimization. Still, Microsoft's vision for the Planetary Computer is but a part of a larger milieu of “collectively imagined forms of social life and social order reflected in the design and fulfillment of nation-specific scientific and/or technological projects” (Jasanoff and Kim, 2009: 120) data and technology democratization. Cutting across multiple discursive layers of Microsoft's Planetary Computer project is desired to enroll computer scientists in solving environmental problems, and thus the second part of the paper singles out computer vision researchers curating biodiversity datasets. As a way of transition, I quote political sociologist Felix Tréguer's (2019: 147) argument, who in a chapter titled Seeing Like Big Tech wrote: “In the age of Big Data, the techniques mastered by Big Tech are now seen as crucial to make the digitised world legible and governable.” Building on Tréguer's, I uphold the view that the role of Big Tech's influence on environmental sciences must be seen through the lens of “public-private hybridization” (Tréguer, 2019: 147). Microsoft's aspiration to become an open government data intermediary thus requires an attention towards the company's effect on academic actors but also an inextricable relationship between data's flows and circulation, its value and interoperability, original contexts of production and data's appropriation into the AI economy of training sets.
Part II: Democratizing environmental data for the AI economy
This part examines how open and democratized biodiversity data becomes a valued asset for the computer vision experts who use it to train AI models. I follow this analytical thread to capture how conceptions of democratization and openness coalesce around a key aspect of Microsoft's imaginary: a view that their environmental programs should ideally help both environmental sciences and computer sciences (or in this case, the computer vision community).
Biodiversity data from citizen science apps and the data hunger of computer vision experts
One of the early types of data hosted on the Planetary Computer was biodiversity datasets, many of which have been collected by citizen scientists. Since part of the account below shows how biodiversity data from one citizen science app—iNaturalist—has become an asset to the computer vision community, I will now recount the historical context and motivations of computer vision experts who helped to design the app. iNaturalist is advertised as an app that helps “to connect people to nature” while generating scientifically useful data (see Altrudi, 2021). While this section indicates how experts from one of the most prominent computer vision labs in the United States, led by computer scientist Pedro Perona at California Institute of Technology (Caltech), helped to design algorithms that automate citizen science data collection, the next subsections depict the subsequent journeys of the specific data from iNaturalist app. I provide the intellectual context of the work of Pedro Perona because it was his students who gained the support of the AI for Earth program to curate the LILA BC database.
Around 2010, Perona began collaborating with the Computer Vision Group at Cornell University to produce Visipedia—a visual encyclopedia fundamental for creating a popular citizen science app, iNaturalist (Van Horn, 2019). Perona described the long-term vision behind repositories like Visipedia: “In the future, you could point your phone at a rash or a mole on your skin, and the phone would tell you, ‘Go see a doctor,’ or ‘Take another picture tomorrow and let's see where this goes.’ You would have peace of mind” (Stathatos and Perona, 2023). Later, in an interview for Breakthrough Caltech Magazine, Perona added: “I want my phone to become an expert in my pocket that will tell me about any object I might find” (Stathatos and Perona, 2023).
While Visipedia was about the possibility of identifying any object in the world, the iNaturalist app focused specifically on plants and animals. iNaturalist resembles a social network for both professional and lay experts in biodiversity science. 6 The app gained prominence thanks to the support of the California Academy of Sciences and the National Geographic Society. Grant Van Horn, one of Perona's graduate students, developed the image identification algorithms for the iNaturalist and the Merlin Bird ID apps. Caltech Breakthrough Magazine reports that “Van Horn sees birds as a perfect testbed for future progress in machine vision and learning. Why? Birds vary subtly in looks, songs, and behaviors—and they have a large human fan base that contributes tons of data” (Van Horn in Pourbahrami, 2018). Both Visipedia and iNaturalist were tools for collecting massive open datasets, which could later be used as a testbed for ML algorithms.
The value of biodiversity databases for computer vision experts
For many in Perona's lab, collaborations with environmental sciences meant access to valuable training data. The drive on the part of computer vision researchers to aid with environmental data collection and curation stems from both an obligation to advance their own field, but also a genuine desire to use their technology in socially relevant problems. But the curation of datasets, although labor-intensive, brings other crucial gains for ML experts beyond a sense of satisfaction with helping to solve environmental problems. In one interview, a key architect of LILA told me: There's a big need for accessible ecology datasets that are in a format that machine learning people can work with, and that is already curated. And that's how LILA started. [LILA] is a website sponsored by Microsoft AI for Earth [that] hosts tons of ecology datasets, and they all use the same standardized format.
In response to the question of how LILA turns into an asset for the computer vision community, one of my interlocutors responded: “step one is to make the data accessible so that computer scientists don’t have to do the work to curate your data for you.” They added: “the act of curating that dataset was a nightmare. And computer vision scientists do not want to spend their time curating datasets. They want to just try their models on already curated datasets.” The labor investment in curating environmental databases is an essential step in the process of making data valuable.
LILA BC is a prime example of one of the principles of the “logic of domains” (Ribes et al., 2019), namely the principle that “both ships will rise,” or, in the context of this specific repository, the notion that by processing data to fit the needs of the computer vision community, both the computer vision experts and conservationists will benefit. The LILA BC's (2023) website reads: [e]veryone benefits when labeled data is made available. Biologists and conservation scientists benefit by having data to train on, and free hosting allows teams to multiply the impact of their data. (...) ML researchers benefit by having data to experiment with.
While computer vision experts gain access to data on which they can train their models, the conservationists’ benefits are measured in terms of the increased efficiency of the ML models in classifying camera trap photos. The value of these models is measured by the time saved due to automation in image processing. In effect, computer vision experts argue that, by modernizing data collection, camera traps have transformed wildlife ecology and conservation in recent decades (see Nichols and Karanth, 2011).
I use LILA BC as an example of what Slota and colleagues call “data prospecting,” namely “the work of rendering data, knowledge, expertise, and practices of worldly domains available or amenable to engagement with data scientific method and epistemology” (Slota et al., 2020: 1). Data prospecting is one step in the process of valuation of data situated in the political-economic, infrastructural, and ethical milieus of the contemporary AI economy. The authors contrast their notion of prospecting with similar metaphors of “data capitalism” (West, 2019), “data extractivism” (Sadowski, 2019), and “data colonialism” (Thatcher et al., 2016). As the authors argue, unlike the two previous concepts, the concept of “data prospecting” takes into consideration the valuation of data not only at the point of analysis but crucially during “earlier moments in data journeys” (2020: 4; see also Leonelli and Tempini, 2020).
Key to the present analysis is Slota et al.'s (as well as Walford's and Leonelli's), insight that data's value is defined by its not-yet-realized potential. As Slota et al. (2020: 9) write, “the value of data that we see as the core of how prospecting is fundamental to the practice of data science: it names the work of discovering data resources ripe for value extraction.” Stated otherwise, data becomes valuable through labor. Datasets need to be curated to become value-laden to the data analytics industry. In this process, the value of data is derived from its utility for prediction practices, and datasets lose association with their domain origins. Open Data is about potentiality and promissory tendencies. Reading Leonelli (2013: 10), we get the impression that the further data travels, the more dimensions of value it is imbued with: “The vision underlying the Open Data movement is that data risk remaining meaningless if they are prevented from traveling far and wide and that travel endows data with multiple forms of scientific as well as financial, social, and political value.” This logic succinctly captures both Microsoft's and the researchers they support visions of data democratization: democratization in a large sense depends on the labor of curating and prospecting, and this labor in turn aims to make data interoperable, thus increasing its value.
Data: A gift in the wrong hands?
In their 2021 article, Crawford and Paglen (2021) asked how “underlying logic of how images are used to train AI systems to ‘see’ the world.” Crawford and Paglen centered their analysis on one of the most important computer vision datasets: ImageNet. This dataset was instrumental in the revival of neural networks as the dominant approach in AI in the early 2010s (Wiggins and Jones, 2023: 163). This “canonical training set,” as Crawford and Paglen call it, was co-created by another student of Pietro Perona from Caltech: Fei-Fei Li—currently the director of the Stanford Institute for Human-Centered Artificial Intelligence. As Crawford and Paglen (2021) recount, in the early 2010s, the ImageNet dataset became a “critical asset for computer-vision research.” Yet these datasets reproduce metaphysical and ontological assumptions which are not only political—as Bowker and Star (1999) taught us, all classification is political—but often catastrophic. This happens when whole demographics of people are “misclassified,” or even erased.
In lieu of Crawford and Paglen argument that “Datasets aren’t simply raw materials to feed algorithms, but are political interventions,” I want to draw attention to the political consequences of environmental data training sets. To this end, I single out some unintended consequences of a data competition based on the iNaturalist open dataset. iNaturalist data stored on Microsoft's servers was used to host ML challenges oriented toward biodiversity applications to foster computer vision experts’ creativity toward solving biodiversity problems. The challenge is associated with the biggest meeting in the field—the Conference on Computer Vision and Pattern Recognition (CVPR), and more specifically, with the Workshops on Fine-Grained Visual Categorization (FGVC) organized in conjunction with CVPR. The data competition, called iWildCam-FGVC, is supported by AI for Earth and Wildlife Insights.
The iWildCam 2021 focused on developing ML techniques that could allow counting individual members of the same species in a moving image from a camera trap. This kind of ML has direct application in not only determining individual members of nonhuman, but also human “herds.” The winner of the 2020 challenge was a team led by Dr Xiu-Shen Wei, a professor of computer science and engineering at Nanjing University of Science and Technology (NUST), and a Founding Director of Megvii Research Nanjing, part of the Chinese corporation Megvii Research. Thanks to the competition and the iNaturalist training set, Megvii computer vision ML systems became more accurate in identifying not only distinct species, but likewise individual members of a species. In other words, Megvii ML can claim with a higher level of confidence to which species a given animal belongs. Identifying individual members of a species has been shown to be a productive method of assessing species population, and the scientific value of such systems for biodiversity science is undeniable. Yet the same layers of a neural network responsible for distinguishing zebra A from zebra B can be retrained to distinguish members of different ethnic communities. In this sense, the interoperability of biodiversity data can have unintended consequences, one of which is the value derived from environmental training datasets by the national surveillance apparatus.
Megvii is one of the industry leaders in the international market of face recognition technology. In October 2019, the Bureau of Industry and Security of the US Department of Commerce added Megvii and 27 other Chinese companies to a blacklist for their implication in human rights violations. Megvii was one of the companies whose face recognition technology has been linked to the ethnic cleansing of more than a million Uighur Muslims in the Chinese province of Xinjiang. Megvii denies any allegations of the use of their technology in human rights violations, and the evidence of such connections is sparse. In December 2020, however, Internet Protocol Video Market—a leading independent research organization dedicated to video surveillance—found a confidential report which, according to Washington Post, shows that: the telecommunications firm [Huawei] worked in 2018 with the facial recognition start-up Megvii to test an artificial-intelligence camera system that could scan faces in a crowd and estimate each person's age, sex and ethnicity.
In the context of economic forces where more training data leads to more profit and market domination, biodiversity training datasets become a free gift. Leonelli (2013) reminds us that data has not only scientific value but can also act as “political, financial, and social objects.” Leonelli (2013) adds: “The increased mobility of data is unavoidably tied to their commodification.” The crucial point here is that not only can data be commodified, but that open and free datasets are part of the political economy, which derives profit from the free gifts of labor of others (on the gift economy on platforms and digital divide, see Fuchs and Horak, 2008). In the scenario I have recounted above, a facial recognition company with ties to the surveillance state apparatus has derived value from citizen scientists collecting data all around the world by improving the accuracy of their algorithms to identify individuals.
Discussion
The Planetary Computer project is in its early stages; hence it is hard to determine by whom it will be used and what is its future. We can, however, make some speculations, especially by juxtaposing this industry project with emerging state-led data intermediary platforms. In his post about the launch of the Planetary Computer, Microsoft's president Brad Smith wrote: “We will support and advocate for public policy initiatives that measure and manage ecosystems at the national and global scale” (Microsoft, 2018). But is Microsoft the right organization to help manage global ecosystems? Microsoft sources its data from NASA and the ESA, among other organizations. Yet in late 2021, NASA and ESA announced their own data analytics platform called Multi-Mission Algorithm and Analysis Platform (MAAP), which “brings together relevant data, algorithms, and computing capabilities in a common cloud environment to address the challenges of sharing and processing data” (NASA, 2019). This description reads almost exactly as if it came from Microsoft's website, but there are important differences between the two initiatives.
While Microsoft openly states that its data is curated to be used in ML workflows, “NASA and ESA are working together to make data and metadata more interoperable across organizations” (NASA, 2019). In other words, Microsoft's intention is to use environmental data primarily as a support for training ML models: The Azure Open Datasets website states that “Curated open public datasets in Azure Open Datasets are optimized for consumption in machine learning workflows” (Microsoft, 2022). In contrast, the NASA-ESA platform addresses the problem of interoperability of data heads-on and bases its data curation practices on the ideal “to meet the needs of the Earth observation research community.” This is a significant difference—one determined in part by Microsoft's AI-centric view of innovation and decision making.
My analysis shows that Microsoft puts the needs of the ML community ahead of the science-driven questions of Earth and environmental sciences. This point is supported by the fact that LILA BC dataset is curated first and foremost to be seamlessly used in the training of computer vision algorithms and only secondarily to inform biodiversity science and policy. Indeed, one of the themes on which literatures on critical data studies and democratization converge is the growing “Big Data divide” (Andrejevic, 2014). For example, Stefan Baack (2015: 3) complicates the narrative that open data leads to increased political participation and agency in activist circles if such data are not properly “refined.” Likewise, the extent to which citizens, or even environmental experts, will be able to mobilize data hosted on the Planetary Computer depends on a set of technical and practical barriers. The divide between those environmental scientists who possess technical skills in analyzing such data might exacerbate the existing expertise divide in environmental sciences. Furthermore, access to open data might be complicated if such data have been curated and optimized to be used by a particular community, as in the case explored above.
Unreflective open data and AI initiatives are never neutral (Gurstein, 2011; Kitchin, 2014). The downstream effects of expertise divide in environmental data democratization can be ameliorated by paying attention to a diversity of citizen science sensing practices and uses of data through “creative data citizenship” (Gabrys et al., 2016). But what happens if the democratization of environmental data is leveraged through a data analytics industry potentate like Microsoft? As this article shows, the democratization of citizen science biodiversity data can uncritically reproduce power structures in which the training of AI models achieves priority over informing environmental decisions.
Anthropologist Bill Maurer (2021: 2) has recently asked whether data must inherently reproduce the structure of power within which it was produced or, can data “work more like ‘the gift’, varied in meaning and pragmatic, unfolding depending on the stakes of the games in which people are caught up?.” Drawing on the anthropological analysis of gift exchange, Maurer (2021: 2) adds: “a pig is never just a pig, interchangeable with all the others, but ‘this’ pig which sits in a system of relations among other pigs and people, making its transfer particularly commanding.” Data's interoperability, when understood as a key transformation in the process of democratization, makes environmental data just data—this data no longer “sits in a system of relations,” to use Maurer's refrence to the Marylin Strathern's (1988) work. Environmental data does in fact become interchangeable with any other data source in a larger project of training computer vision algorithms. Marion Fourcade and Daniel Kluttz also draw on anthropology of gift exchange (Mauss, 1990) to put forward their critique of “accumulation by gift” and processes of “subsumption” (Polanyi, 2001) of “social relations to economic motives” (Fourcade and Kluttz, 2020). Gift economy of environmental data seems to follow a similar logic. In the case of democratized biodiversity data from the iNaturalist, the “subsumption” into the Planetary Computer relied on the free labor of the app's users and the curators of LILA BC. But thus far, it is the computer vision community that benefits from these data gifts, especially if we consider that expertise required to analyze labeled image datasets curated for ML is not yet prevalent in conservation biology.
Democratization of environmental databases and analytic tools might have the capacity to reshape the landscape of environmental knowledge production, governance, and expertise at large. In this new epistemic regime, determining who counts as an expert and who counts as a decision maker will be tightly coupled to proficiency in data analysis (see Lave, 2015). Environmental problems, now rebranded as “data challenges,” thus acquire a new epistemological position in ways that can directly influence the process of policymaking.
Conclusions
In this article, I probed the consequences of Microsoft's commitment toward data and AI democratization and traced democratization's framing within the company's reports, the position of its leaders, and online statements. I focused on the lives and afterlives of environmental data within the data economy to illustrate the complexity of academia-industry relations conjured by the transformation of data into a valued asset. I explored the concept of democratization through this empirical material, yet more work needs to be done to describe how theories of democratization permeate the porous boundaries between academia and industry and, more precisely, how democratization will change environmental knowledge production and governance. This article suggests that while Microsoft's discourse implies the company's intention to support the biodiversity community, the majority of the Planetary Computer's tools and datasets are curated with the ML community in mind. This finding supports the argument that sociotechnical imaginaries of AI economy can intensify processes of data extractivism and data prospecting and hence erasing the original contexts of data production. In effect, environmental data comes to be valued as an AI training dataset, and not a source of environmental action.
But could environmental science be corrupted if the data on which knowledge production relies were to be hosted on commercial cloud computing services? Are nationally sponsored knowledge infrastructures a better solution—even when they are “under siege” (Edwards, 2019)? Clearly, the manipulation of data is both a political and user-oriented process, and as such, it should be no surprise that Microsoft curates its data to serve the ML community better. Nevertheless, literature on the co-production of market forces and life sciences warns that market priorities can create infrastructural conditions under which the basic research needs of life scientists (Sunder Rajan, 2006; 2012) or environmental activists (Fortun, 2004) no longer take priority. Considering this, increased dependance of biodiversity science on commercial (as opposed to state-funded) computing infrastructure might be an alarming development.
The Planetary Computer also has an enormous potential for exacerbating inequalities within transnational science and creating a divide between environmental scientists who have expertise in AI big data analysis and those who do not. But as this article argued, contemporary politics of information is not just about those who do and those who do not have access to open data, but also about a “computational divide” between those who have access to compute power and wield a form of expertise necessary to analyze the data, and those without such computing power and expertise. This divide is visible in environmental domain and beyond it. Considering the ascendence of a new class of “coding elites,” computational divide, a term that sits on the intersection of democratization of data, centralized computing infrastructures, and expertise inequalities, will require further attention in critical scholarship. 7
What are the consequences of the economic logic of AI for environmental data? As a historian of AI, Jonnie Penn (2018) wrote in an op-ed for The Economist, “AI thinks like a corporation—and that is worrying.” Understanding of AI is often obscured by this technology's quasi-mythological cultural status. I find it instructive to follow cultural scholars of technology, such as Tung-Hui Hu (2015), whose historical account of cloud computing conveys that the cloud mirrors an “architecture of our own desire.” Azure, it might be argued, as any other infrastructure, “operate[s] on the level of fantasy and desire” (Larkin, 2013: 333). Considering these views, it imperative to ensure that concepts like democratization and computation infrastructure of environmental sciences do not reflect solely the desires of the AI economy.
Footnotes
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author recieved funding for the cost of publication of this article from the MIT Open Access Publishing Fund.
