Abstract
Progress in biotechnology is critically dependent on continued access to new biological “components” (genes, proteins, organisms) from nature. Over recent decades, the way that researchers access and use these components has changed dramatically in response to similarly dramatic developments in technology and regulation. The net effect of these changes has been to severely restrict the availability of high-quality genetic data from biodiversity. This bottleneck limits the potential of machine learning (AI) in biotechnology and is a threat to progress across the industry. We suggest that the inevitable demand for high-quality genetic data to train the next generation of biological AI models has the potential to align the economic and technical interests of the bioeconomy with those of biodiverse “provider” countries and communities. The impending era of big data in biotechnology will therefore require the industry to break its dependence on “digital biopiracy” and embrace sustainable partnership-based data supply chains.
Introduction
Biotechnology is the science of taking biological “components”—genes, proteins, organisms—that have evolved in nature and then adapting, engineering or repurposing them to deliver a solution to a problem. 1 The source of these components is our planet’s rich biodiversity, a complex web of ecological variation and interaction evolved over nearly 3.8 billion years. 2 Innovation in biotechnology is dependent upon easy access to new biological components and the ability to understand the properties of those biological components in new ways.
The multi-trillion dollar global bioeconomy represents one of humanity’s few credible routes towards a clean, sustainable, and healthy future for all. 3 The COVID-19 pandemic and an international demand for increased sustainability have accelerated investment, demand, and publicity. 4 Today, the industry is within reach of revolutionary medicines, diagnostics, foods, fuels, crops, materials, and more.
However, decades of technological and economic success in the bioeconomy have been mirrored by a growing sense of injustice over the distribution of the benefits realized. This has translated into a proliferation of national and international regulations restricting commercial access and use of genetic resources from biodiversity. The industry has responded by scaling back commercial bioprospecting activity in favor of product development based on “digital sequence information” (DSI). 5 This shift has been enabled by technical breakthroughs in synthetic biology, significant reductions in the cost of DNA sequencing and synthesis, and the establishment of publicly available collections of DSI (databases) obtained by academic research projects.
For more than a decade, reliance on these public databases has provided to be an imperfect but serviceable source of biological components to fuel progress in biotechnology. In the meantime, the gears of the international legal system have slowly ground towards closing this “digital loophole.”
However, the industry now faces yet another tectonic shift. Biotechnology has progressed beyond the techniques based on human insight and limited by the throughput of assaying laboratories, and is on the threshold of a new period of progress characterized by increasing reliance on data-hungry artificial intelligence (AI) models. Public DSI databases have serious limitations as a source of training data for this new generation of models, and the lack of a sustainable alternative source of high quantities of quality genetic data from biodiversity now threatens to limit the pace of innovation and potential achievements of the impending era of biological AI.
In this paper, we propose that this development presents a natural opportunity for mutually beneficial realignment of the interests of biotechnology companies and biodiversity stakeholders. By adopting a more equitable and inclusive approach to commercial biodiscovery, companies can access sustainable data supply chains capable of meeting the demand for large quantities of high-quality training data. We propose, based on real world case studies, that collaboration can simultaneously give researchers (and their AI) access to a far more detailed and comprehensive understanding of biology, while ensuring that biodiverse countries and communities see the benefits of this progress and are rewarded for their key role in the bioeconomy.
A Brief History of the Relationship Between Biodiversity and Biotechnology
Biotechnology has always depended on access to genetic biodiversity
Biotechnology as an industry was born in the 1970s, enabled by significant advances in both molecular biology and genetic engineering. For many, the first true biotechnology company was Genentech, founded in 1976, 6 which successfully expressed a human gene in bacteria and paved the way for the commercial production of human insulin and other pharmaceuticals through genetic engineering. 7 Genentech was quickly joined by a number of other biotechnology companies across a wide range of medical, agricultural, and industrial applications.
In these early years, DNA sequencing and synthesis was too complex, slow, and expensive to be used at any scale. Instead DNA was obtained from physical biological samples by lysis (purification of the debris obtained from breaking open cells), 8 by recombinant DNA technology (the use of restriction enzymes to “cut” sequences from within the genome of an organism), 9 and from 1983 onwards by newly developed polymerase chain reaction techniques. 10 These techniques and their dependence on access to physical samples from biodiversity would continue to characterize biotechnology for nearly half a century. Companies inevitably turned to biodiversity to collect physical samples from the natural environment as the starting point for engineering in the wet lab. 11
Tensions between the “users” and the “providers” of biodiversity
However, this technical and commercial success was focused primarily in the industrialized “Global North,” 12 whereas biodiversity rich countries providing the underlying resource tended to be in the “Global South.” 13 As the economic potential of this new industry became apparent, tensions arose between “user” and “provider” countries. The term “biopiracy” emerged in the late 1980s to describe the practice of “exploiting naturally occurring genetic material while failing to pay fair compensation to the community from which it originates.” 14 While the practice itself is centuries old, 15 the pejorative term was coined in the late 20th century in the context of the burgeoning international environmental movement and growing pressure from “provider” countries which felt their genetic resources had been exploited by “user” countries in the Global North. 16
These flames were fanned by a number of particularly high profile incidents of commercialization without consent throughout the 1980s and 1990s: the development of biopesticides from the Neem plant native to India and Nepal by US chemical company WR Grace; 17 commercialization of a dietary product based on the Hoodia plant of the Kalahari Desert licensed to UK-based company Pytopharm and later to Anglo-Dutch Unilever; 18 the development of a multi-billion dollar blood pressure drug captopril by US-based company Squibb Pharmaceuticals based on venom taken from a Brazilian viper. 19
Regulation and restricted access
In this context, the 1992 United Nations Convention on Biological Diversity (the CBD or Rio Convention after the city in which it was signed) included as one of three main objectives the “fair and equitable sharing of the benefits arising from use of genetic resources.” 20 To this end Article 15 of the CBD made provision recognizing the sovereign rights of states over their natural resources and the right to determine access to genetic resources, 21 and provided for access to be subject to mutually agreed terms (MAT) 22 and on the basis of prior informed consent (PIC) given by the provider state. 23 The 1990s and early 2000s saw a small number of headline commercial bioprospecting agreements, most notably the Merck-INBio (Costa Rica) agreement in 1991 24 and a singular earnest attempt to engage in ethical bioprospecting in the spirit of the CBD at scale by San Diego-based company Diversa. 25
However, the optimism which followed the agreement of the CBD quickly faded. It was not until 2010 that a framework for this sharing of benefits from the utilization of genetic resources in a fair and equitable way was finally agreed following the tenth meeting of the conference of the parties to the CBD in Nagoya, Japan. Under the “Nagoya” Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilization (ABS), parties are required to ensure that domestic users of genetic resources comply with the relevant legislation of the country providing those resources, generally by obtaining PIC before accessing the resource and doing so in accordance with MAT detailing how any subsequent benefits will be shared. 26 On 12 October 2014 the Nagoya Protocol finally entered force. Members duly adopted national legislation to implement this international agreement in domestic law. 27 Today, 141 28 countries have signed up as Parties to the Protocol (with the notable exception of the United States, a significant loss given that the country accounts for more than 20% of worldwide use of genetic resources). 29 However, in the two decades which elapsed between the 1992 Earth Summit in Rio de Janeiro and the entry into force of the Nagoya Protocol, biotechnology had changed almost beyond recognition. Technical breakthroughs in synthetic biology together with significant reductions in cost of DNA sequencing and synthesis meant that companies were no longer dependent on physical samples as the foundation for lab-based research, but could instead base all of their research on genetic data in the form of digital sequence information (DSI).
DSI is of course obtained by sequencing of a physical biological sample, and so ultimately requires the kind of bioprospecting activity which under the Nagoya Protocol would be subject to requirements for MAT and PIC. However, by the time implementing legislation came into force the availability of genetic data in online databases compiled from academic research enabled companies to avoid the need to obtain genetic data directly from biodiversity. Academic researchers have tended to be less deterred by regulation, and have often been subject to less onerous access requirements by provider countries than would be applied to commercial operations. 30 In the spirit of open access and collaboration, these academic researchers upload their results in data form to public databases. 31 In 2010 (the year the Nagoya Protocol was signed) one of the largest of these public databases, GenBank, surpassed 130 million DNA sequences. 32 It has now become widespread practice for commercial companies and academic researchers alike to use DSI from these collections; there are estimated to be between 10 and 15 million unique users of these databases worldwide. 29 For companies themselves, conducting all research using large online databases has had many benefits, not least the ease of access to samples from around the world, and the lack of onerous access and benefit sharing obligations which disincentivize bioprospecting.
The net effect of this is that, since 2015, there has been a dramatic reduction in commercial bioprospecting of the type envisaged by the Nagoya Protocol; instead, companies have exploited what has been described as a “digital loophole” by drawing on DSI published in academic collections. Barely a year after it entered into force the Nagoya protocol was dismissed as akin to “regulat[ing] VCR technology in the era of YouTube.” 33
Unsurprisingly, this development engendered a debate as to whether genetic data falls within the scope of “genetic resources” protected under the Nagoya Protocol. Just as predictably, this debate has polarized along the lines of “users” and “providers” of genetic data. 34 Many “provider” countries (including Brazil, 35 India, China, South Africa, and Costa Rica) have adopted national legislation governing access to genetic data notwithstanding the international disagreement as to whether this falls under the scope of Nagoya. 36 MAT and PIC associated with collection of physical samples may also cover the use of resulting genetic data; EU Commission Guidance confirms that although genetic data could be regarded as outside the scope of EU regulations implementing the Nagoya Protocol, companies are nevertheless required to respect any mutually agreed terms which restrict the use and or publication of this data. 37 The collections widely used for commercial research do not carry out any checks to confirm whether academic contributors have complied with national legislation, or whether any MAT and/or PIC under which the researchers accessed the physical genetic resource permits upload to a public database (still less whether it covers possible uses by companies who may access it through the database and go on to use the sequence data in commercial research). For all these reasons, the commercial utilization of genetic data from public databases has been described more bluntly as “digital biopiracy.” 38
In many respects, the biotechnology industry has repeated the cycle which culminated in the adoption of the CBD and the Nagoya Protocol, and there is now growing international impetus to regulate access to genetic data from biodiversity. In December 2022, the CBD adopted a formal decision agreeing that “the benefits from the use of digital sequence information [DSI] … should be shared fairly and equitably,” that the “distinctive practices in its use require a distinctive solution for benefit-sharing,” and agreeing to develop such a solution. 39 More recently still, the draft UN Convention on the Law of the Sea on the conservation and sustainable use of marine biological diversity of areas beyond national jurisdiction, agreed on 4 March 2023, expressly refers to “marine genetic resources and digital sequence information [DSI] on marine genetic resources” 40 and imposes a variety of requirements including notification of activity via a clearing-house mechanism 41 and requiring that the benefits from these activities must be “shared in a fair and equitable manner.” 42
Public Databases Are a Poor Foundation for Biological AI
There is little doubt that biological AI is the future of innovation in the biosciences. 43 Genomics is one of the areas with the greatest potential; as all of life runs on the common “coding language” of DNA, machine learning techniques are perfectly suited to annotation, prediction, and generation tasks here.
“Protein AI”—the application of sophisticated models to single protein-encoding genes—has seen some of the most impressive progress recently. The first transformative breakthrough is widely considered to be AlphaFold2, 44 a protein folding model released by Google’s DeepMind in 2020. Described as “the most important achievement in AI-ever,” 45 AlphaFold2 is considered to have “solved” the protein folding problem that lies at the core of understanding biology. 46
The recent rapid progress in “generalizable” biological AI is built upon the public databases of biodiversity that are under scrutiny from the UN and biodiverse provider countries. One of the largest and most curated public protein databases, UniProt, 47 forms the basis of all leading protein structure prediction models (e.g., AlphaFold2, ESMFold, 48 OpenFold, 49 and RoseTTaFold 50 ), protein-ligand models (e.g., NeuralPlexer 51 ), protein function models (e.g., CLEAN, 52 ProteInfer, 53 and ProteinVec 54 ) and protein generation models (e.g., Chroma, 55 ProteinGAN, 56 ProGen, 57 ProtGPT2, 58 and EvoDiff 59 ). Beyond “protein AI,” the cutting edge of this field is now the development of “genomic language models,” capable of generating genome-length stretches of DNA using techniques that are not dissimilar to the way that OpenAI’s ChatGPT generates poetry.60,61 Across all these applications, the same fundamental problem arises; the more advanced the model, the higher the quality and quantity of training data needed, and the lower the availability of suitable data for this training. 62
While the AlphaFold2 algorithm marked a major breakthrough, its performance correlates strongly with the quality and availability of training data. The fundamental correlation between training data and performance cannot be overstated; AlphaFold2 is outperformed by a margin of up to 6× by the recently released BaseFold, a structure prediction algorithm that uses the same model architecture as AlphaFold2, yet is trained on a database several times larger and more diverse than UniProt. 63 BaseFold’s improvements over AlphaFold2 are particularly significant in the “orphan protein problem”—situations where very few similar proteins have ever been found in nature. It follows that similar improvements in model performance would be expected by improving the training datasets for all the models mentioned previously in this paper.
These improvements are seen because all machine learning (or AI) models are “just” sophisticated pattern recognition tools that work by ingesting and calculating statistical representations of vast datasets of examples. Therefore, the outputs of these models are highly dependent on the information that they ingest. Relying on either “exploiting” information they have seen or “exploring” beyond it using learned statistical patterns, their performance is critically dependent on the quality, quantity and information content of training data.
Limited data means limited AI performance
In this context, the public databases that offered an imperfect but serviceable source of biological components at a time where the biotechnology industry relied on human scientists learning from “one sequence at a time” are unfit as training data for machines that are able to learn from the entire database at once.
These databases continue to represent a mechanism for academic collaboration rather than a concerted effort to explore biodiversity. This, combined with the lack of bioprospecting activity in the post-Nagoya regulatory landscape means that the growth of these databases over the past decade has been piecemeal and relatively sluggish by comparison with advances in sequencing technology. Today, GenBank, which, as above, contained 130 million sequences when the Nagoya Protocol was signed-is less than twice the size at time of writing 64 despite a >100× reduction in the cost of sequencing over the same period. 65 From a size and diversity perspective, there are estimated to be over 1 trillion species on Earth, 66 yet half of all the microbial genomes available publicly are from just 12 species. 67 Additionally, likely due in part to the regulatory challenges, the majority of samples to date have been collected from the US, Europe, and China. 68 Because public databases are collated records of academic studies over several decades, each study and by extension each entry in the database has a different purpose, uses different molecular biology techniques, collects different metadata and adopts different data annotation pipelines.
Meanwhile the use of these public collections for training models carries with it growing legal uncertainty and reputational exposure. Major AI companies in across other domains from music, 69 images, 70 and posterchild ChatGPT 71 (including OpenAI, Anthropic, and Stability AI) are facing high profile legal proceedings in a tranche of test claims which challenge their use of training data without the consent of the parties who originally generated that data. It is increasingly implausible to hope that the use of public DSI collections to train AI models will escape similar scrutiny as legal frameworks for access to training data crystallize across other applications, as the international political pressure to regulate DSI gathers momentum, and as the achievements of biological AI inevitably attracts growing public attention.
Better data will unlock new model capabilities
Not only does the inconsistent collection and curation of the public datasets mean that they lack the size, quality, diversity, and context (information content) required for the existing biological AI models to reach their full potential, more importantly, they offer very limited support for significantly more advanced biological AI models. The recent advances in machine learning outside of the biosciences have arisen thanks to significant improvements in model architectures and compute power. The most powerful of these new architectures have been those capable of understanding context in large datasets; notably Google’s Transformer architecture first published in 2017—the basis of OpenAI’s ChatGPT. 72
Therefore, it is the absence of “biological context” (describing the environment the genes, proteins, and organisms evolved in) in the public data resources that is likely to be the most significant limitation for the field. Capturing and making these additional signals “machine readable” on a global scale will be the key to unlocking the next levels of success in biological AI.
Biological AI Needs a Sustainable Data Supply
Because the CBD has severely curtailed commercial bioprospecting activity, and because the bioeconomy has become so dependent on public databases, many in the scientific community consider the CBD’s perceived “limits” on biodiversity research to be unacceptable. 73 Many leading scientific and industry organizations support the notion of benefit sharing for DSI in principle, but not in practice. 74 Proposals to overcome the impasse range from the African Union’s suggestion of a flat 1% levy on retail sales of all products sold in developed countries, 75 to global multilateral benefit-sharing frameworks, 76 and a small number of practical case studies of proactive mutually beneficial benefit sharing from genetic data (notably from Kew Gardens in the UK). 77 As set out above many nations have already adopted domestic legislation which require benefit sharing from commercial use of genetic data. 36
Here, we contribute an alternative perspective from our work at Basecamp Research Ltd., a private biotechnology company based in London, UK, specializing in building foundational datasets for biological AI and developing the next generation of models. The company has successfully built bilateral access and benefit sharing relationships with biodiversity stakeholders around the world, enabling them to collect genetic data from biodiversity at a pace, scale, and quality not previously attempted. Access to this data has in turn enabled the company to outperform all foundational AI models across protein function, structure, and controllable generation.63,78,79 Experience at Basecamp Research demonstrates that the demand for data to unlock the next generation of AI models in biology presents an opportunity for mutually beneficial realignment of the interests of biotechnology companies and biodiversity stakeholders.
Today, Basecamp Research has access and benefit sharing relationships in more than 20 countries worldwide. Importantly, the company commits to ABS even in countries that do not legally require it, and it commits to ABS from genetic data and AI-generated sequences generated from that genetic data. In less than 4 years, Basecamp Research has invested millions of dollars into biodiversity partnerships and has already paid back royalties to a wide range of biodiversity stakeholders. In addition to the monetary benefits (investment, employment, royalties), Basecamp Research shares nonmonetary benefits including training, transfer of technology, and research collaboration.
By working with local partners and scientists, the company benefits from the experience and passion of experts who have a deep knowledge and understanding of the local biodiversity. For the company, training in-country partners in cutting-edge molecular biology techniques ensures consistent, high-quality data collection, and funding the construction of lab facilities is more effective, efficient, and sustainable than operating “helicopter-science” biodiscovery expeditions. This forms the basis of a lasting relationship between local scientists and Basecamp Research; a particular advantage that opens the possibility of returning to sample sites which prove to be of particular scientific interest, an important benefit given that valuable therapeutic enzyme classes can be over 1000 times more prevalent in particular samples.
Beyond the partnership with Basecamp Research, these skills and resources are highly valued, facilitating participation in the bioeconomy and more advanced conservation techniques. Molecular biology skills, portable laboratories, and increased “bioliteracy” are also the basis of projects that support human health, such as disease monitoring and tracking of COVID-19.
These partnerships, therefore, lay the groundwork for a long-term shift to redress the technological imbalance between “user” and “provider” countries, and create economic and social value directly biodiversity. This incentivizes and rightly rewards those responsible for preserving this irreplaceable resource; it is for good reason that these “nonmonetary” benefits are among the core goals of most ABS frameworks. 80 The mutual benefits of this model are exemplified by the partnership between Basecamp Research’s and Costa Rica, which recently awarded the company the ABS Certification from the National Commission for Biodiversity (CONAGEBIO) in recognition of contributions to ABS from DSI. 81
Proactively developing ABS relationships has allowed Basecamp Research to collect a dataset which far exceeds the size, quality, and information content of public DSI collections (Fig. 1). 82 Importantly, the company is able to deploy portable laboratories 83 in each country, giving a much greater control and consistency of sample choice, metadata collection, and molecular biology techniques used. These data advantages have translated into a string of state of the art achievements in biological AI78,79,84 and a wide range of successful industrial partnerships that tackle important environmental issues.85–87
In many respects the concept of benefit-sharing is more naturally suited to the AI era. Historically, the one-one relationship between biological component and product, combined with the long development timelines and high attrition rates, meant that benefits to be shared arose only rarely. 88 This in turn meant that providers tended to require a high payout for commercialization of these rare “lottery ticket” samples. In contrast, DSI used training AI is valuable in its aggregate; quantity and quality of data correlates directly with performance. AI also develops products much faster than previous approaches. These benefits combine to offering more predictable payments to a larger number of providers within a shorter period. For the company, these advantages justify both large upfront investments in biodiversity partnerships and much closer public-private relationships within the provider countries than have previously been viable. 89
These early successes demonstrate how proactively engaging in ABS can facilitate development of sustainable data supply chains capable of meeting the demand for large quantities of high-quality training data. Collaboration can simultaneously give researchers (and their AI) access to a far more detailed and comprehensive understanding of biology, while delivering a better deal for communities preserving biodiversity.
Conclusion
The advent of AI in biotechnology brings a watershed moment for the industry. Limited availability of high-quality training data is already slowing the pace of innovation. The nascent big data era in biotechnology presents a natural opportunity to align commercial interests, development goals, and sustainability objectives of stakeholders in the bioeconomy. The growing demand for vast quantities of high-quality genetic data for training large models can only be met by developing sustainable partnership-based data supply chains which actively align incentives and share benefits with the providers of biodiversity.
Companies which continue to depend on the narrowing DSI loophole will face growing legal and reputational challenges and will increasingly suffer from a deficit of quality genetic data. Conversely, companies which proactively and positively engage with the providers of the natural resource that they depend upon can confidently navigate the regulatory minefield, and will be rewarded with access to training data which will enable them to unlock far more powerful technical and biological modeling capabilities than would be possible using only publicly available data. Sustainable and legal access is the key to realizing the technical potential of AI in biology.
Footnotes
Authors’ Contributions
All listed authors contributed to the development of this article.
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No funding was received for this work.
