Sage Journals: Discover world-class research

Abstract

Progress in biotechnology is critically dependent on continued access to new biological “components” (genes, proteins, organisms) from nature. Over recent decades, the way that researchers access and use these components has changed dramatically in response to similarly dramatic developments in technology and regulation. The net effect of these changes has been to severely restrict the availability of high-quality genetic data from biodiversity. This bottleneck limits the potential of machine learning (AI) in biotechnology and is a threat to progress across the industry. We suggest that the inevitable demand for high-quality genetic data to train the next generation of biological AI models has the potential to align the economic and technical interests of the bioeconomy with those of biodiverse “provider” countries and communities. The impending era of big data in biotechnology will therefore require the industry to break its dependence on “digital biopiracy” and embrace sustainable partnership-based data supply chains.

Introduction

Biotechnology is the science of taking biological “components”—genes, proteins, organisms—that have evolved in nature and then adapting, engineering or repurposing them to deliver a solution to a problem.¹ The source of these components is our planet’s rich biodiversity, a complex web of ecological variation and interaction evolved over nearly 3.8 billion years.² Innovation in biotechnology is dependent upon easy access to new biological components and the ability to understand the properties of those biological components in new ways.

The multi-trillion dollar global bioeconomy represents one of humanity’s few credible routes towards a clean, sustainable, and healthy future for all.³ The COVID-19 pandemic and an international demand for increased sustainability have accelerated investment, demand, and publicity.⁴ Today, the industry is within reach of revolutionary medicines, diagnostics, foods, fuels, crops, materials, and more.

However, decades of technological and economic success in the bioeconomy have been mirrored by a growing sense of injustice over the distribution of the benefits realized. This has translated into a proliferation of national and international regulations restricting commercial access and use of genetic resources from biodiversity. The industry has responded by scaling back commercial bioprospecting activity in favor of product development based on “digital sequence information” (DSI).⁵ This shift has been enabled by technical breakthroughs in synthetic biology, significant reductions in the cost of DNA sequencing and synthesis, and the establishment of publicly available collections of DSI (databases) obtained by academic research projects.

For more than a decade, reliance on these public databases has provided to be an imperfect but serviceable source of biological components to fuel progress in biotechnology. In the meantime, the gears of the international legal system have slowly ground towards closing this “digital loophole.”

However, the industry now faces yet another tectonic shift. Biotechnology has progressed beyond the techniques based on human insight and limited by the throughput of assaying laboratories, and is on the threshold of a new period of progress characterized by increasing reliance on data-hungry artificial intelligence (AI) models. Public DSI databases have serious limitations as a source of training data for this new generation of models, and the lack of a sustainable alternative source of high quantities of quality genetic data from biodiversity now threatens to limit the pace of innovation and potential achievements of the impending era of biological AI.

In this paper, we propose that this development presents a natural opportunity for mutually beneficial realignment of the interests of biotechnology companies and biodiversity stakeholders. By adopting a more equitable and inclusive approach to commercial biodiscovery, companies can access sustainable data supply chains capable of meeting the demand for large quantities of high-quality training data. We propose, based on real world case studies, that collaboration can simultaneously give researchers (and their AI) access to a far more detailed and comprehensive understanding of biology, while ensuring that biodiverse countries and communities see the benefits of this progress and are rewarded for their key role in the bioeconomy.

A Brief History of the Relationship Between Biodiversity and Biotechnology

Biotechnology has always depended on access to genetic biodiversity

Biotechnology as an industry was born in the 1970s, enabled by significant advances in both molecular biology and genetic engineering. For many, the first true biotechnology company was Genentech, founded in 1976,⁶ which successfully expressed a human gene in bacteria and paved the way for the commercial production of human insulin and other pharmaceuticals through genetic engineering.⁷ Genentech was quickly joined by a number of other biotechnology companies across a wide range of medical, agricultural, and industrial applications.

In these early years, DNA sequencing and synthesis was too complex, slow, and expensive to be used at any scale. Instead DNA was obtained from physical biological samples by lysis (purification of the debris obtained from breaking open cells),⁸ by recombinant DNA technology (the use of restriction enzymes to “cut” sequences from within the genome of an organism),⁹ and from 1983 onwards by newly developed polymerase chain reaction techniques.¹⁰ These techniques and their dependence on access to physical samples from biodiversity would continue to characterize biotechnology for nearly half a century. Companies inevitably turned to biodiversity to collect physical samples from the natural environment as the starting point for engineering in the wet lab.¹¹

Tensions between the “users” and the “providers” of biodiversity

However, this technical and commercial success was focused primarily in the industrialized “Global North,”¹² whereas biodiversity rich countries providing the underlying resource tended to be in the “Global South.”¹³ As the economic potential of this new industry became apparent, tensions arose between “user” and “provider” countries. The term “biopiracy” emerged in the late 1980s to describe the practice of “exploiting naturally occurring genetic material while failing to pay fair compensation to the community from which it originates.”¹⁴ While the practice itself is centuries old,¹⁵ the pejorative term was coined in the late 20th century in the context of the burgeoning international environmental movement and growing pressure from “provider” countries which felt their genetic resources had been exploited by “user” countries in the Global North.¹⁶

These flames were fanned by a number of particularly high profile incidents of commercialization without consent throughout the 1980s and 1990s: the development of biopesticides from the Neem plant native to India and Nepal by US chemical company WR Grace;¹⁷ commercialization of a dietary product based on the Hoodia plant of the Kalahari Desert licensed to UK-based company Pytopharm and later to Anglo-Dutch Unilever;¹⁸ the development of a multi-billion dollar blood pressure drug captopril by US-based company Squibb Pharmaceuticals based on venom taken from a Brazilian viper.¹⁹

Regulation and restricted access

In this context, the 1992 United Nations Convention on Biological Diversity (the CBD or Rio Convention after the city in which it was signed) included as one of three main objectives the “fair and equitable sharing of the benefits arising from use of genetic resources.”²⁰ To this end Article 15 of the CBD made provision recognizing the sovereign rights of states over their natural resources and the right to determine access to genetic resources,²¹ and provided for access to be subject to mutually agreed terms (MAT)²² and on the basis of prior informed consent (PIC) given by the provider state.²³ The 1990s and early 2000s saw a small number of headline commercial bioprospecting agreements, most notably the Merck-INBio (Costa Rica) agreement in 1991²⁴ and a singular earnest attempt to engage in ethical bioprospecting in the spirit of the CBD at scale by San Diego-based company Diversa.²⁵

However, the optimism which followed the agreement of the CBD quickly faded. It was not until 2010 that a framework for this sharing of benefits from the utilization of genetic resources in a fair and equitable way was finally agreed following the tenth meeting of the conference of the parties to the CBD in Nagoya, Japan. Under the “Nagoya” Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilization (ABS), parties are required to ensure that domestic users of genetic resources comply with the relevant legislation of the country providing those resources, generally by obtaining PIC before accessing the resource and doing so in accordance with MAT detailing how any subsequent benefits will be shared.²⁶ On 12 October 2014 the Nagoya Protocol finally entered force. Members duly adopted national legislation to implement this international agreement in domestic law.²⁷ Today, 141²⁸ countries have signed up as Parties to the Protocol (with the notable exception of the United States, a significant loss given that the country accounts for more than 20% of worldwide use of genetic resources).²⁹ However, in the two decades which elapsed between the 1992 Earth Summit in Rio de Janeiro and the entry into force of the Nagoya Protocol, biotechnology had changed almost beyond recognition. Technical breakthroughs in synthetic biology together with significant reductions in cost of DNA sequencing and synthesis meant that companies were no longer dependent on physical samples as the foundation for lab-based research, but could instead base all of their research on genetic data in the form of digital sequence information (DSI).

DSI is of course obtained by sequencing of a physical biological sample, and so ultimately requires the kind of bioprospecting activity which under the Nagoya Protocol would be subject to requirements for MAT and PIC. However, by the time implementing legislation came into force the availability of genetic data in online databases compiled from academic research enabled companies to avoid the need to obtain genetic data directly from biodiversity. Academic researchers have tended to be less deterred by regulation, and have often been subject to less onerous access requirements by provider countries than would be applied to commercial operations.³⁰ In the spirit of open access and collaboration, these academic researchers upload their results in data form to public databases.³¹ In 2010 (the year the Nagoya Protocol was signed) one of the largest of these public databases, GenBank, surpassed 130 million DNA sequences.³² It has now become widespread practice for commercial companies and academic researchers alike to use DSI from these collections; there are estimated to be between 10 and 15 million unique users of these databases worldwide.²⁹ For companies themselves, conducting all research using large online databases has had many benefits, not least the ease of access to samples from around the world, and the lack of onerous access and benefit sharing obligations which disincentivize bioprospecting.

The net effect of this is that, since 2015, there has been a dramatic reduction in commercial bioprospecting of the type envisaged by the Nagoya Protocol; instead, companies have exploited what has been described as a “digital loophole” by drawing on DSI published in academic collections. Barely a year after it entered into force the Nagoya protocol was dismissed as akin to “regulat[ing] VCR technology in the era of YouTube.”³³

Unsurprisingly, this development engendered a debate as to whether genetic data falls within the scope of “genetic resources” protected under the Nagoya Protocol. Just as predictably, this debate has polarized along the lines of “users” and “providers” of genetic data.³⁴ Many “provider” countries (including Brazil,³⁵ India, China, South Africa, and Costa Rica) have adopted national legislation governing access to genetic data notwithstanding the international disagreement as to whether this falls under the scope of Nagoya.³⁶ MAT and PIC associated with collection of physical samples may also cover the use of resulting genetic data; EU Commission Guidance confirms that although genetic data could be regarded as outside the scope of EU regulations implementing the Nagoya Protocol, companies are nevertheless required to respect any mutually agreed terms which restrict the use and or publication of this data.³⁷ The collections widely used for commercial research do not carry out any checks to confirm whether academic contributors have complied with national legislation, or whether any MAT and/or PIC under which the researchers accessed the physical genetic resource permits upload to a public database (still less whether it covers possible uses by companies who may access it through the database and go on to use the sequence data in commercial research). For all these reasons, the commercial utilization of genetic data from public databases has been described more bluntly as “digital biopiracy.”³⁸

In many respects, the biotechnology industry has repeated the cycle which culminated in the adoption of the CBD and the Nagoya Protocol, and there is now growing international impetus to regulate access to genetic data from biodiversity. In December 2022, the CBD adopted a formal decision agreeing that “the benefits from the use of digital sequence information [DSI] … should be shared fairly and equitably,” that the “distinctive practices in its use require a distinctive solution for benefit-sharing,” and agreeing to develop such a solution.³⁹ More recently still, the draft UN Convention on the Law of the Sea on the conservation and sustainable use of marine biological diversity of areas beyond national jurisdiction, agreed on 4 March 2023, expressly refers to “marine genetic resources and digital sequence information [DSI] on marine genetic resources”⁴⁰ and imposes a variety of requirements including notification of activity via a clearing-house mechanism⁴¹ and requiring that the benefits from these activities must be “shared in a fair and equitable manner.”⁴²

Public Databases Are a Poor Foundation for Biological AI

There is little doubt that biological AI is the future of innovation in the biosciences.⁴³ Genomics is one of the areas with the greatest potential; as all of life runs on the common “coding language” of DNA, machine learning techniques are perfectly suited to annotation, prediction, and generation tasks here.

“Protein AI”—the application of sophisticated models to single protein-encoding genes—has seen some of the most impressive progress recently. The first transformative breakthrough is widely considered to be AlphaFold2,⁴⁴ a protein folding model released by Google’s DeepMind in 2020. Described as “the most important achievement in AI-ever,”⁴⁵ AlphaFold2 is considered to have “solved” the protein folding problem that lies at the core of understanding biology.⁴⁶

The recent rapid progress in “generalizable” biological AI is built upon the public databases of biodiversity that are under scrutiny from the UN and biodiverse provider countries. One of the largest and most curated public protein databases, UniProt,⁴⁷ forms the basis of all leading protein structure prediction models (e.g., AlphaFold2, ESMFold,⁴⁸ OpenFold,⁴⁹ and RoseTTaFold⁵⁰), protein-ligand models (e.g., NeuralPlexer⁵¹), protein function models (e.g., CLEAN,⁵² ProteInfer,⁵³ and ProteinVec⁵⁴) and protein generation models (e.g., Chroma,⁵⁵ ProteinGAN,⁵⁶ ProGen,⁵⁷ ProtGPT2,⁵⁸ and EvoDiff⁵⁹). Beyond “protein AI,” the cutting edge of this field is now the development of “genomic language models,” capable of generating genome-length stretches of DNA using techniques that are not dissimilar to the way that OpenAI’s ChatGPT generates poetry.^60,61 Across all these applications, the same fundamental problem arises; the more advanced the model, the higher the quality and quantity of training data needed, and the lower the availability of suitable data for this training.⁶²

While the AlphaFold2 algorithm marked a major breakthrough, its performance correlates strongly with the quality and availability of training data. The fundamental correlation between training data and performance cannot be overstated; AlphaFold2 is outperformed by a margin of up to 6× by the recently released BaseFold, a structure prediction algorithm that uses the same model architecture as AlphaFold2, yet is trained on a database several times larger and more diverse than UniProt.⁶³ BaseFold’s improvements over AlphaFold2 are particularly significant in the “orphan protein problem”—situations where very few similar proteins have ever been found in nature. It follows that similar improvements in model performance would be expected by improving the training datasets for all the models mentioned previously in this paper.

These improvements are seen because all machine learning (or AI) models are “just” sophisticated pattern recognition tools that work by ingesting and calculating statistical representations of vast datasets of examples. Therefore, the outputs of these models are highly dependent on the information that they ingest. Relying on either “exploiting” information they have seen or “exploring” beyond it using learned statistical patterns, their performance is critically dependent on the quality, quantity and information content of training data.

Limited data means limited AI performance

In this context, the public databases that offered an imperfect but serviceable source of biological components at a time where the biotechnology industry relied on human scientists learning from “one sequence at a time” are unfit as training data for machines that are able to learn from the entire database at once.

These databases continue to represent a mechanism for academic collaboration rather than a concerted effort to explore biodiversity. This, combined with the lack of bioprospecting activity in the post-Nagoya regulatory landscape means that the growth of these databases over the past decade has been piecemeal and relatively sluggish by comparison with advances in sequencing technology. Today, GenBank, which, as above, contained 130 million sequences when the Nagoya Protocol was signed-is less than twice the size at time of writing⁶⁴ despite a >100× reduction in the cost of sequencing over the same period.⁶⁵ From a size and diversity perspective, there are estimated to be over 1 trillion species on Earth,⁶⁶ yet half of all the microbial genomes available publicly are from just 12 species.⁶⁷ Additionally, likely due in part to the regulatory challenges, the majority of samples to date have been collected from the US, Europe, and China.⁶⁸ Because public databases are collated records of academic studies over several decades, each study and by extension each entry in the database has a different purpose, uses different molecular biology techniques, collects different metadata and adopts different data annotation pipelines.

Meanwhile the use of these public collections for training models carries with it growing legal uncertainty and reputational exposure. Major AI companies in across other domains from music,⁶⁹ images,⁷⁰ and posterchild ChatGPT⁷¹ (including OpenAI, Anthropic, and Stability AI) are facing high profile legal proceedings in a tranche of test claims which challenge their use of training data without the consent of the parties who originally generated that data. It is increasingly implausible to hope that the use of public DSI collections to train AI models will escape similar scrutiny as legal frameworks for access to training data crystallize across other applications, as the international political pressure to regulate DSI gathers momentum, and as the achievements of biological AI inevitably attracts growing public attention.

Better data will unlock new model capabilities

Not only does the inconsistent collection and curation of the public datasets mean that they lack the size, quality, diversity, and context (information content) required for the existing biological AI models to reach their full potential, more importantly, they offer very limited support for significantly more advanced biological AI models. The recent advances in machine learning outside of the biosciences have arisen thanks to significant improvements in model architectures and compute power. The most powerful of these new architectures have been those capable of understanding context in large datasets; notably Google’s Transformer architecture first published in 2017—the basis of OpenAI’s ChatGPT.⁷²

Therefore, it is the absence of “biological context” (describing the environment the genes, proteins, and organisms evolved in) in the public data resources that is likely to be the most significant limitation for the field. Capturing and making these additional signals “machine readable” on a global scale will be the key to unlocking the next levels of success in biological AI.

Biological AI Needs a Sustainable Data Supply

Because the CBD has severely curtailed commercial bioprospecting activity, and because the bioeconomy has become so dependent on public databases, many in the scientific community consider the CBD’s perceived “limits” on biodiversity research to be unacceptable.⁷³ Many leading scientific and industry organizations support the notion of benefit sharing for DSI in principle, but not in practice.⁷⁴ Proposals to overcome the impasse range from the African Union’s suggestion of a flat 1% levy on retail sales of all products sold in developed countries,⁷⁵ to global multilateral benefit-sharing frameworks,⁷⁶ and a small number of practical case studies of proactive mutually beneficial benefit sharing from genetic data (notably from Kew Gardens in the UK).⁷⁷ As set out above many nations have already adopted domestic legislation which require benefit sharing from commercial use of genetic data.³⁶

Here, we contribute an alternative perspective from our work at Basecamp Research Ltd., a private biotechnology company based in London, UK, specializing in building foundational datasets for biological AI and developing the next generation of models. The company has successfully built bilateral access and benefit sharing relationships with biodiversity stakeholders around the world, enabling them to collect genetic data from biodiversity at a pace, scale, and quality not previously attempted. Access to this data has in turn enabled the company to outperform all foundational AI models across protein function, structure, and controllable generation.^63,78,79 Experience at Basecamp Research demonstrates that the demand for data to unlock the next generation of AI models in biology presents an opportunity for mutually beneficial realignment of the interests of biotechnology companies and biodiversity stakeholders.

Today, Basecamp Research has access and benefit sharing relationships in more than 20 countries worldwide. Importantly, the company commits to ABS even in countries that do not legally require it, and it commits to ABS from genetic data and AI-generated sequences generated from that genetic data. In less than 4 years, Basecamp Research has invested millions of dollars into biodiversity partnerships and has already paid back royalties to a wide range of biodiversity stakeholders. In addition to the monetary benefits (investment, employment, royalties), Basecamp Research shares nonmonetary benefits including training, transfer of technology, and research collaboration.

By working with local partners and scientists, the company benefits from the experience and passion of experts who have a deep knowledge and understanding of the local biodiversity. For the company, training in-country partners in cutting-edge molecular biology techniques ensures consistent, high-quality data collection, and funding the construction of lab facilities is more effective, efficient, and sustainable than operating “helicopter-science” biodiscovery expeditions. This forms the basis of a lasting relationship between local scientists and Basecamp Research; a particular advantage that opens the possibility of returning to sample sites which prove to be of particular scientific interest, an important benefit given that valuable therapeutic enzyme classes can be over 1000 times more prevalent in particular samples.

Beyond the partnership with Basecamp Research, these skills and resources are highly valued, facilitating participation in the bioeconomy and more advanced conservation techniques. Molecular biology skills, portable laboratories, and increased “bioliteracy” are also the basis of projects that support human health, such as disease monitoring and tracking of COVID-19.

These partnerships, therefore, lay the groundwork for a long-term shift to redress the technological imbalance between “user” and “provider” countries, and create economic and social value directly biodiversity. This incentivizes and rightly rewards those responsible for preserving this irreplaceable resource; it is for good reason that these “nonmonetary” benefits are among the core goals of most ABS frameworks.⁸⁰ The mutual benefits of this model are exemplified by the partnership between Basecamp Research’s and Costa Rica, which recently awarded the company the ABS Certification from the National Commission for Biodiversity (CONAGEBIO) in recognition of contributions to ABS from DSI.⁸¹

Proactively developing ABS relationships has allowed Basecamp Research to collect a dataset which far exceeds the size, quality, and information content of public DSI collections (Fig. 1).⁸² Importantly, the company is able to deploy portable laboratories⁸³ in each country, giving a much greater control and consistency of sample choice, metadata collection, and molecular biology techniques used. These data advantages have translated into a string of state of the art achievements in biological AI^78,79,84 and a wide range of successful industrial partnerships that tackle important environmental issues.^85–87

In many respects the concept of benefit-sharing is more naturally suited to the AI era. Historically, the one-one relationship between biological component and product, combined with the long development timelines and high attrition rates, meant that benefits to be shared arose only rarely.⁸⁸ This in turn meant that providers tended to require a high payout for commercialization of these rare “lottery ticket” samples. In contrast, DSI used training AI is valuable in its aggregate; quantity and quality of data correlates directly with performance. AI also develops products much faster than previous approaches. These benefits combine to offering more predictable payments to a larger number of providers within a shorter period. For the company, these advantages justify both large upfront investments in biodiversity partnerships and much closer public-private relationships within the provider countries than have previously been viable.⁸⁹

These early successes demonstrate how proactively engaging in ABS can facilitate development of sustainable data supply chains capable of meeting the demand for large quantities of high-quality training data. Collaboration can simultaneously give researchers (and their AI) access to a far more detailed and comprehensive understanding of biology, while delivering a better deal for communities preserving biodiversity.

Conclusion

The advent of AI in biotechnology brings a watershed moment for the industry. Limited availability of high-quality training data is already slowing the pace of innovation. The nascent big data era in biotechnology presents a natural opportunity to align commercial interests, development goals, and sustainability objectives of stakeholders in the bioeconomy. The growing demand for vast quantities of high-quality genetic data for training large models can only be met by developing sustainable partnership-based data supply chains which actively align incentives and share benefits with the providers of biodiversity.

Companies which continue to depend on the narrowing DSI loophole will face growing legal and reputational challenges and will increasingly suffer from a deficit of quality genetic data. Conversely, companies which proactively and positively engage with the providers of the natural resource that they depend upon can confidently navigate the regulatory minefield, and will be rewarded with access to training data which will enable them to unlock far more powerful technical and biological modeling capabilities than would be possible using only publicly available data. Sustainable and legal access is the key to realizing the technical potential of AI in biology.

Footnotes

Authors’ Contributions

All listed authors contributed to the development of this article.

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received for this work.

References

Verma

, Agrahari

, Rastogi

, et al. Biotechnology in the realm of history. J Pharm Bioallied Sci, 2011; 3(3):321–323; doi: 10.4103/0975-7406.84430

Bull

, Goodfellow

, Slater

. Biodiversity as a source of innovation in biotechnology. Annu Rev Microbiol, 1992; 46:219–252; doi: 10.1146/annurev.mi.46.100192.001251

The Future of Biotech Report. US National Intelligence Council; April, 2021. Available from: https://www.dni.gov/files/images/globalTrends/GT2040/NIC-2021-02494–Future-of-Biotech–Unsourced– 14May21.pdf [Last accessed: April 29, 2024].

, Mei

, Tseng

. Biopharma innovation trends during COVID-19 and beyond: An evidence from global partnerships and fundraising activities, 2011–2022. Global Health, 2023; 19(1):57; doi: 10.1186/s12992-023-00953-6

Michiels

, Feiter

, Paquin-Jaloux

, et al. Facing the harsh reality of Access and Benefit Sharing (ABS) legislation: An industry perspective. Sustainability, 2021; 14(1):277; doi: 10.3390/su14010277

Genentech About Us. Available from: https://www.gene.com/about-us/leadership/our-founders [Last accessed: April 23, 2024].

Goeddel

, Kleid

, Bolivar

, et al. Expression in Escherichia coli of chemically synthesized genes for human insulin. Proc Natl Acad Sci U S A, 1979; 76(1):106–110; doi: 10.1073/pnas.76.1.106

Birnboim

, Doly

. A rapid alkaline extraction procedure for screening recombinant plasmid DNA. Nucleic Acids Res, 1979; 7(6):1513–1523; doi: 10.1093/nar/7.6.1513

Eun

. Enzymology Primer for Recombinant DNA Technology, Enzymes and Nucleic Acids. Elsevier; 2006, pp. 1–108.

10.

Saiki

, Scharf

, Faloona

, et al. Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science, 1985; 230(4732):1350–1354; doi: 10.1126/science.2999980

11.

Mathur

, Toledo

, Green

, et al. A biodiversity-based approach to development of performance enzymes. Industrial Biotechnology, 2005; 1(4):283–287; doi: 10.1089/ind.2005.1.283

12.

Hoffman

. (2006). Global Biotechnology Cluster Map. MBBNet. Available from: http://www.mbbnet.umn.edu/ [Last accessed: April 29, 2023].

13.

Jenkins

, Pimm

, Joppa

. Global patterns of terrestrial vertebrate diversity and conservation. Proc Natl Acad Sci U S A, 2013; 110(28):E2602–E2610; doi: 10.1073/pnas.1302251110

14.

Definition from Oxford English Dictionary Online. Oxford University Press, 2024. Available from oed.com

15.

Gollin

. Biopiracy started with a bounce. Nature, 2008; 451(7182):1055–1055; doi: 10.1038/4511055a

16.

Robinson

. Confronting Biopiracy: Challenges, Cases and International Debates, 1st ed. Routledge; 2010.

17.

EPO accepts biopiracy argument and revokes patent. Available from: https://cordis.europa.eu/article/id/23505-epo-accepts-biopiracy-argument-and-revokes-patent [Last accessed: April 23, 2024].

18.

Case Study 7. The Commercial Development of Hoodia. CBD. Available from: https://www.cbd.int/doc/meetings/abs/abswg-06/other/abswg-06-cs-07-en.pdf [Last accessed: April 23, 2024].

19.

Rohter

. In Brazil, a conviction on biopiracy charges angers scientists. The New York Times; August 29, 2007. Available from: https://www.nytimes.com/2007/08/29/health/29iht-snbiopiracy.1.7298414.html [Last accessed: April 28, 2024].

20.

The Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilization to the Convention on Biological Diversity. CBD. Available from: https://www.cbd.int/abs/doc/protocol/nagoya-protocol-en.pdf [Last accessed: April 22, 2024].

21.

The Convention on Biological Diversity—Article 15(1)—Access to Genetic Resources. Available from: https://www.cbd.int/convention/articles/default.shtml?a=cbd-15 [Last accessed: April 29, 2024].

22.

The Convention on Biological Diversity—Article 15(4)—Access to Genetic Resources. Available from: https://www.cbd.int/convention/articles/default.shtml?a=cbd-15 [Last accessed: April 29, 2024].

23.

The Convention on Biological Diversity—Article 15(5)—Access to Genetic Resources. Available from: https://www.cbd.int/convention/articles/default.shtml?a=cbd-15 [Last accessed: April 29, 2024].

24.

Bioprospecting practice in the pharmaceutical industry. CBD. Available from: https://www.cbd.int/financial/bensharing/g-tleclair.doc [Last accessed: April 22, 2024].

25.

Christoffersen

, Mathur

. Bioprospecting ethics & benefits; A model for effective benefit-sharing. Industrial Biotechnology, 2005; 1(4):255–259; doi: 10.1089/ind.2005.1.255

26.

The Convention on Biological Diversity—Article 15—Access to Genetic Resources. Available from: https://www.cbd.int/convention/articles/default.shtml?a=cbd-15 [Last accessed: April 29, 2024].

27.

In the United Kingdom: The Nagoya Protocol (Compliance) Regulations 2015 (S.I. 2015/821, implementing and Regulation (EU) No 511/2014 of the European Parliament and of the Council of 16 April 2014 on compliance measures for users from the Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilization in the Union and Commission Implementing Regulation (EU) 2015/1866 of 13 October 2015 laying down detailed rules for the implementation of Regulation (EU) No 511/2014 of the European Parliament and of the Council as regards the register of collections, monitoring user compliance and best practices).

28.

Access and Benefit Sharing Clearing House. Available from: https://absch.cbd.int/en/ [Last accessed: April 22, 2024].

29.

Rohden

, Huang

, Dröge

, et al. Combined study on digital sequence information in public and private databases and traceability. Report No. CBD/DSI/AHTEG/2020/1/4. 2020. Available from: https://www.cbd.int/doc/c/1f8f/d793/57cb114ca40cb6468f479584/dsi-ahteg-2020-01-04-en.pdf [Last accessed: April 22, 2024].

30.

Harvey

, Gericke

. Bioprospecting: Creating a value for biodiversity. In: Research in Biodiversity—Models and Applications. ( Pavlinov

., ed.) InTech; 2011. 10.5772/24905

31.

Sara

, Hufton

, Sett

, et al. Open access: A technical assessment for the debate on benefit-sharing and digital sequence information. Zenodo, 2022; doi: 10.5281/zenodo.5838273

32.

NCBI GenBank summary statisticS. Available from: https://www.ncbi.nlm.nih.gov/genbank/statistics/ [Last accessed: April 22, 2024].

33.

Servick

. Rise of digital DNA raises fears of biopiracy. Science, 2016; doi: 10.1126/science.aal0395. Available from https://www.science.org/content/article/rise-digital-dna-raises-biopiracy-fears

34.

Nawaz

, Satterfield

, Hagerman

. From seed to sequence: Dematerialization and the battle to (re)define genetic resources. Global Environmental Change, 2021; 68:102260; doi: 10.1016/j.gloenvcha.2021.102260

35.

Mozini

. Brazilian Biodiversity Law: Challenges and Opportunities for Industries and Research Institutions. In: Global Transformations in the Use of Biodiversity for Research and Development. ( Kamau

, ed.) Springer; 2022; pp. 69–92.

36.

Bagley

, Karger

, Muller

, et al. Fact-finding study on how domestic measures address benefit-sharing arising from commercial and non-commercial use of digital sequence information on genetic resources and address the use of digital sequence information on genetic resources for research and development. CBD; 2020. Available from: https://www.cbd.int/doc/c/428d/017b/1b0c60b47af50c81a1a34d52/dsi-ahteg-2020-01-05-en.pdf [Last accessed: April 29, 2024].

37.

European Commission Notice Guidance document on the scope of application and core obligations of Regulation (EU) No 511/2014 of the European Parliament and of the Council on the compliance measures for users from the Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilisation in the Union, OJ C 313/01, 27.8.2016, at 2.3.5.

38.

Bond

, Scott

. Digital biopiracy and the (dis)assembling of the Nagoya Protocol. Geoforum, 2020; 117:24–32; doi: 10.1016/j.geoforum.2020.09.001

39.

CBD/COP/DEC/15/9: Decision adopted by the conference of the parties to the convention on biological diversity: 15/9—Digital sequence information on genetic resources. December 19, 2022. Available from: https://www.cbd.int/doc/decisions/cop-15/cop-15-dec-09-en.pdf [Last accessed: April 29, 2024].

40.

Article 8(1) in the Draft agreement under the United Nations Convention on the Law of the Sea on the conservation and sustainable use of marine biological diversity of areas beyond national jurisdiction. 4th March, 2023. Available from: https://www.un.org/bbnj/sites/www.un.org.bbnj/files/draft_agreement_advanced_unedited_for_posting_v1.pdf [Last accessed: April 29, 2024].

41.

Article 10 in the Draft agreement under the United Nations Convention on the Law of the Sea on the conservation and sustainable use of marine biological diversity of areas beyond national jurisdiction. 4th March, 2023. Available from: https://www.un.org/bbnj/sites/www.un.org.bbnj/files/draft_agreement_advanced_unedited_for_posting_v1.pdf [Last accessed: April 29, 2024].

42.

Article 11(1) in the Draft agreement under the United Nations Convention on the Law of the Sea on the conservation and sustainable use of marine biological diversity of areas beyond national jurisdiction. March 4, 2023. Available from: https://www.un.org/bbnj/sites/www.un.org.bbnj/files/draft_agreement_advanced_unedited_for_posting_v1.pdf [Last accessed: April 29, 2024].

43.

Holzinger

, Keiblinger

, Holub

, et al. AI for life: Trends in artificial intelligence for biotechnology. N Biotechnol, 2023; 74:16–24; doi: 10.1016/j.nbt.2023.02.001

44.

Jumper

, Evans

, Pritzel

, et al. Highly accurate protein structure prediction with AlphaFold. Nature, 2021; 596(7873):583–589; doi: 10.1038/s41586-021-03819-2

45.

Toews

. AlphaFold Is The Most Important Achievement In AI—Ever, Forbes. October 3, 2021. Available from: https://www.forbes.com/sites/robtoews/2021/10/03/alphafold-is-the-most-important-achievement-in-ai-ever/ [Last accessed: April 23, 2024].

46.

Bouatta

, Sorger

, AlQuraishi

. Protein structure prediction by AlphaFold2: Are attention and symmetries all you need? Acta Crystallogr D Struct Biol, 2021; 77(Pt 8):982–991; doi: 10.1107/S2059798321007531

47.

The UniProt Consortium. UniProt: The Universal Protein Knowledgebase in 2023. Nucleic Acids Res, 2023; 51(D1):D523–D531.

48.

Lin

, Akin

, Rao

, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 2023; 379(6637):1123–1130; doi: 10.1126/science.ade2574

49.

Ahdritz

, Bouatta

, Kadyan

, et al. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, 2022; doi: 10.1101/2022.11.20.517210

50.

Baek

, Dimaio

, Anishchenko

, et al. Accurate prediction of protein structures and interactions using a three- track neural network. Science, 2021; 373(6557):871–876; doi: 10.1126/science.abj8754

51.

Qiao

, Nie

, Vahdat

, et al. State-specific protein-ligand complex structure prediction with a multi-scale deep generative model. arXiv, 2024; doi: 10.48550/arXiv.2209.15171

52.

, Cui

, Li

, et al. Enzyme function prediction using contrastive learning. Science, 2023; 379(6639):1358–1363; doi: 10.1126/science.adf2465

53.

Sanderson

, Bileschi

, Belanger

, et al. ProteInfer, deep neural networks for protein functional inference. Elife, 2023; 12:e80942; doi: 10.7554/eLife.80942

54.

Hamamsy

, Barot

, Morton

, et al. Learning sequence, structure, and function representations of proteins with language models. bioRxiv, 2023; doi: 10.1101/2023.11.26.568742

55.

Ingraham

, Baranov

, Costello

, et al. Illuminating protein space with a programmable generative model. Nature, 2023; 623(7989):1070–1078; doi: 10.1038/s41586-023-06728-8

56.

Repecka

, Jauniskis

, Karpus

, et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat Mach Intell, 2021; 3(4):324–333; doi: 10.1038/s42256-021-00310-5

57.

Madani

, Krause

, Greene

, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol, 2023; 41(8):1099–1106; doi: 10.1038/s41587-022-01618-2

58.

Ferruz

, Schmidt

, Höcker

. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun, 2022; 13(1):4348; doi: 10.1038/s41467-022-32007-7

59.

Alamdari

, Thakkar

, Berg

, et al. Protein generation with evolutionary diffusion: Sequence is all you need. bioRxiv, 2023; doi: 10.1101/2023.09.11.556673

60.

Hwang

, Cornman

, Kellogg

, et al. Genomic language model predicts protein co-regulation and function. Nat Commun, 2024; 15(1):2880; doi: 10.1038/s41467-024-46947-9

61.

Nguyen

, Poli

, Durrant

, et al. Sequence modeling and design from molecular to genome scale with Evo. bioRxiv, 2024; doi: 10.1101/2024.02.27.582234

62.

Why we’re so excited about the Arc Institute’s new Evo model. 1st March. Basecamp Research Blog; 2024. Available from: https://medium.com/@basecamp-research/why-were-so-excited-about-the-arc-s-new-evo-model-6a94e86e2c56 [Last accessed: April 23, 2024].

63.

Munsamy

, Bohnuud

, Lorenz

, et al. Improving Alphafold2 performance with a global metagenomic & biological data supply chain. bioRxiv, 2024; doi: 10.1101/2024.03.06.583325

64.

At the time of writing (April 2024), GenBank contains around 250 million sequences. NCBI Genbank Summary Statistics. Available from: https://www.ncbi.nlm.nih.gov/genbank/statistics/ [Last accessed: April 22, 2024].

65.

NCBI summary fact sheet on the cost of DNA sequencing over time. Available from: https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data [Last accessed: April 23, 2024].

66.

Locey

, Lennon

. Scaling laws predict global microbial diversity. Proc Natl Acad Sci U S A, 2016; 113(21):5970–5975; doi: 10.1073/pnas.1521291113

67.

Discovering bacterial histones: Wrapping our heads around dogma-defying surprises in nature. Basecamp Research blog. February 8, 2023. Available from: https://medium.com/@basecamp-research/discovering-bacterial-histones-wrapping-our-heads-around-dogma-defying-surprises-in-nature-1f367410ad43 [Last accessed: April 23, 2024].

68.

Makhalanyane

, Bezuidt

OKI

, Pierneef

, et al. African microbiomes matter. Nat Rev Microbiol, 2023; 21(8):479–481; doi: 10.1038/s41579-023-00925-y

69.

Getty Images & Ors v. Stability AI Ltd [2023] EWHC 3090 (Ch).

70.

Franzen, Grisham and Other Prominent Authors Sue OpenAI, The New York Times, 20th September 2023. Available from: https://www.nytimes.com/2023/09/20/books/authors-openai-lawsuit-chatgpt-copyright.html [Last accessed: April 23, 2024].

71.

Concord Music Group, Inc. et al v. Anthropic PBC, 17th April 2024. Available from: https://www.govinfo.gov/app/details/USCOURTS-tnmd-3_23-cv-01092/summary [Last accessed: April 23, 2024].

72.

Vaswani

, Shazeer

, Parmar

, et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc.: Red Hook, NY; 2017; pp. 6000–6010. https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf

73.

Prathapan

, Pethiyagoda

, Bawa

, et al. 172 co-signatories from 35 countries. When the cure kills – CBD limits biodiversity research. Science, 2018; 360(6396):1405–1406; doi: 10.1126/science.aat9844

74.

McCarthy

. The digital loophole. The Biologist; 2022. Available from: https://www.rsb.org.uk/biologist-features/closing-the-digital-loophole-2 [Last accessed: April 22, 2024].

75.

Co-leads’ report on the work of the informal co-chairs’ advisory group on digital sequence information on genetic resources since the fourth meeting of the open-ended working group on the post-2020 global biodiversity framework—CBD/WG2020/5/INF/1. CBD; 2022. Available from: https://www.cbd.int/doc/c/1d8d/b02d/cd3a295666ce409859ad62e9/wg2020-05-inf-01-en.pdf [Last accessed: April 29, 2024].

76.

Scholz

, Freitag

, Lyal

CHC

, et al. Multilateral benefit-sharing from digital sequence information will support both science and biodiversity conservation. Nat Commun, 2022; 13(1):1086; doi: 10.1038/s41467-022-28594-0

77.

Cowell

, Paton

, Borrell

, et al. Uses and benefits of digital sequence information from plant genetic resources: Lessons learnt from botanical collections. Plants People Planet, 2022; 4(1):33–43; doi: 10.1002/ppp3.10216

78.

Ayres

, Munsamy

, Heinzinger

, et al. HiFi-NN annotates the microbial dark matter with Enzyme Commission numbers. NeurIPS; 2023. Available from: https://www.mlsb.io/papers_2023/HiFi-NN_annotates_the_microbial_dark_matter_with_Enzyme_Commission_numbers.pdf [Last accessed: April 29, 2024].

79.

Munsamy

, Lindner

, Lorenz

, et al. ZymCTRL: A conditional language model for the controllable generation of artificial enzymes. NeurIPS; 2022. Available from: https://www.mlsb.io/papers_2022/ZymCTRL_a_conditional_language_model_for_the_controllable_generation_of_artificial_enzymes.pdf [Last accessed: April 29, 2024].

80.

Heitmüller

, Meyer

, Bavikatte

, et al. The ABS Agreement, First Pan-African Workshop on ABS and Forests. July, 2014. Available from: https://www.abs-biotrade.info/fileadmin/Downloads/Diverse/Publikationen/Manuals%20ABS%20agreements/The-ABS-Agreement-Key-Elements-and-Commentary-2014.pdf [Last accessed: April 29, 2024].

81.

Setting the Bar for Environmental Responsibility: Basecamp Research’s New ABS Certification in Costa Rica. Basecamp Research blog. April 18, 2024. Available from: https://medium.com/@basecamp-research/setting-the-bar-for-environmental-responsibility-basecamp-researchs-new-abs-certification-in-a366b90f3ef9 [Last accessed: April 23, 2024].

82.

Expanding the repertoire of recombinases that can integrate large DNA fragments into the human genome by over 30X. February 17, 2023. Basecamp Research blog. Available from: https://medium.com/@basecamp-research/expanding-the-repertoire-of-recombinases-that-can-integrate-large-dna-fragments-into-the-human-2e67635fdda3 [Last accessed: April 23, 2024].

83.

Gowers

, Vince

, Charles

, et al. Entirely off-grid and solar-powered DNA sequencing of microbial communities during an ice cap traverse expedition. Genes (Basel), 2019; 10(11):902; doi: 10.3390/genes10110902

84.

Munsamy

, Bohnuud

, Lorenz

. Improving Alphafold2 performance with a global metagenomic & biological data supply chain. bioRxiv, 2024; doi: 10.1101/2024.03.06.583325

85.

Colorifix and Basecamp Research harmonize biotechnology and biodiversity. October 24, 2023. Available from: https://www.synbiobeta.com/read/ai-alchemy-unleashing-a-biotechnological-renaissance-in-textile-dyeing [Last accessed: April 29, 2024].

86.

Johnson Matthey announces partnership with Basecamp Research to accelerate the adoption of biocatalysis solutions. December 5, 2023. Available from: https://matthey.com/media/2023/basecamp-research-partnership [Last accessed: April 29, 2024].

87.

Making Polyurethane and Nylon Infinitely Recyclable with Basecamp Research. January 24, 2024. Available from: https://www.protein-evolution.com/perspective/making-polyurethane-and-nylon-infinitely-recyclable-with-basecamp-research [Last accessed: April 29, 2024].

88.

Firn

. Bioprospecting – why is it so unrewarding? Biodiversity and Conservation, 2003; 12(2):207–216; doi: 10.1023/A:1021928209813

89.

Cicka

, Quave

. (2019). Bioprospecting for pharmaceuticals: An overview and vision for future access and benefit sharing. In: Medicinal Plants. ( Joshee

, Dhekney

, Parajuli

. eds.) Springer: Cham. 10.1007/978-3-030-31269-5_2

The Natural Future for AI in Biotech: The Next Generation of Machine Learning Demands Partnership with Biodiversity

Abstract

Introduction

A Brief History of the Relationship Between Biodiversity and Biotechnology

Biotechnology has always depended on access to genetic biodiversity

Tensions between the “users” and the “providers” of biodiversity

Regulation and restricted access

Public Databases Are a Poor Foundation for Biological AI

Limited data means limited AI performance

Better data will unlock new model capabilities

Biological AI Needs a Sustainable Data Supply

Conclusion

Footnotes

Authors’ Contributions

Author Disclosure Statement

Funding Information

References