Running ahead of evolution—AI-based simulation for predicting future high-risk SARS-CoV-2 variants

Abstract

The never-ending emergence of SARS-CoV-2 variations of concern (VOCs) has challenged the whole world for pandemic control. In order to develop effective drugs and vaccines, one needs to efficiently simulate SARS-CoV-2 spike receptor-binding domain (RBD) mutations and identify high-risk variants. We pretrain a large protein language model with approximately 408 million protein sequences and construct a high-throughput screening for the prediction of binding affinity and antibody escape. As the first work on SARS-CoV-2 RBD mutation simulation, we successfully identify mutations in the RBD regions of 5 VOCs and can screen millions of potential variants in seconds. Our workflow scales to 4096 NPUs with 96.5% scalability and 493.9× speedup in mixed-precision computing, while achieving a peak performance of 366.8 PFLOPS (reaching 34.9% theoretical peak) on Pengcheng Cloudbrain-II. Our method paves the way for simulating coronavirus evolution in order to prepare for a future pandemic that will inevitably take place. Our models are released at https://github.com/ZhiweiNiepku/SARS-CoV-2_mutation_simulation to facilitate future related work.

Keywords

COVID-19 artificial intelligence protein language model mutation simulation high-risk variants prediction

Get full access to this article

View all access options for this article.

References

Amicone

Borges

Alves

, et al. (2022) Mutation rate of SARS-CoV-2 and emergence of mutators during experimental evolution. Evolution, Medicine, and Public Health 10(1): 142–155.

Beiko

Charlebois

(2007) A simulation test bed for hypotheses of genome evolution. Bioinformatics 23(7): 825–831.

Belouzard

Chu

Whittaker

(2009) Activation of the SARS coronavirus spike protein via sequential proteolytic cleavage at two distinct sites. Proceedings of the National Academy of Sciences 106(14): 5871–5876.

Callaway

(2022) Are covid surges becoming more predictable? Nature 605(7909): 204–206.

Chen

Wang

, et al. (2020) Mutations strengthened SARS-CoV-2 infectivity. Journal of Molecular Biology 432(19): 5212–5226.

Chi

Yan

Zhang

, et al. (2020) A neutralizing human antibody binds to the n-terminal domain of the spike protein of SARS-CoV-2. Science 369(6504): 650–655.

Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (2020) The species severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2. Nature Microbiology 5(4): 536–544.

De Maio

Boulton

Weilguny

, et al. (2022) phastsim: efficient simulation of sequence evolution for pandemic-scale datasets. PLoS Computational Biology 18(4): e1010056.

Devlin

Chang

Lee

, et al. (2018) Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 .

10.

Drosten

Günther

Preiser

, et al. (2003) Identification of a novel coronavirus in patients with severe acute respiratory syndrome. New England Journal of Medicine 348(20): 1967–1976.

11.

Duffy

(2018) Why are RNA virus mutation rates so damn high? PLoS Biology 16(8): e3000003.

12.

Elnaggar

Heinzinger

Dallago

, et al. (2022) ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(10): 7112–7127.

13.

Ewing

Hermisson

(2010) MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26(16): 2064–2065.

14.

Ferruz

Schmidt

Höcker

(2022) ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications 13(1): 1–10.

15.

Fletcher

Yang

(2009) Indelible: a flexible simulator of biological sequence evolution. Molecular Biology and Evolution 26(8): 1879–1888.

16.

Gallagher

Buchmeier

(2001) Coronavirus spike proteins in viral entry and pathogenesis. Virology 279(2): 371–374.

17.

Goyal

Dollár

Girshick

, et al. (2017) Accurate, Large Minibatch SGD: Training Imagenet in 1 Hour. arXiv preprint arXiv:1706.02677 .

18.

Xue

Ren

, et al. (2021) Large-scale Deep Learning Optimizations: A Comprehensive Survey. arXiv preprint arXiv:2111.00856 .

19.

Hoffer

Hubara

Soudry

(2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in Neural Information Processing Systems 30.

20.

Hoffmann

Kleine-Weber

Schroeder

, et al. (2020) SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor. Cell 181(2): 271–280.

21.

Howard

Ruder

(2018) Universal Language Model Fine-Tuning for Text Classification. arXiv preprint arXiv:1801.06146 .

22.

Hudson

(2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18(2): 337–338.

23.

Ito

Piantham

Nishiura

(2021) Predicted dominance of variant Delta of SARS-CoV-2 before Tokyo olympic games, Japan, July 2021. Euro Surveillance 26(27): 2100570.

24.

Keskar

Mudigere

Nocedal

, et al. (2016) On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv preprint arXiv:1609.04836 .

25.

Lamers

Haagmans

(2022) SARS-CoV-2 pathogenesis. Nature Reviews Microbiology: 1–15.

26.

Lan

Chen

Goodman

, et al. (2019) ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. arXiv preprint arXiv:1909.11942 .

27.

Laval

Excoffier

(2004) Simcoal 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics 20(15): 2485–2487.

28.

Moore

Vasilieva

, et al. (2003) Angiotensin-converting enzyme 2 is a functional receptor for the SARS coronavirus. Nature 426(6965): 450–454.

29.

Wallace

Shen

, et al. (2020) Train big, then compress: rethinking model size for efficient training and inference of transformers International Conference on Machine Learning, pp. 5958–5968.

30.

Zhang

, et al. (2022) Deep learning based on biologically interpretable genome representation predicts two types of human adaptation of SARS-CoV-2 variants. Briefings in Bioinformatics 23(3): bbac036.

31.

Liu

Ott

Goyal

, et al. (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 .

32.

Wang

Gao

(2015) Bat-to-human: spike features determining ‘host jump’of coronaviruses SARS-CoV, MERS-CoV, and beyond. Trends in Microbiology 23(8): 468–478.

33.

Zhao

, et al. (2020) Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet 395(10224): 565–574.

34.

Mohamed

Sayed

Salah

, et al. (2021) Next generation sequence prediction intelligent system for SARS-CoV-2 using deep learning neural network 2021 17th International Computer Engineering Conference (ICENCO). IEEE, pp. 88–93.

35.

Obermeyer

Jankowiak

Barkas

, et al. (2022) Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science 376(6599): 1327–1332.

36.

Ofer

Brandes

Linial

(2021) The language of proteins: Nlp, machine learning and protein sequences. Computational and Structural Biotechnology Journal 19: 1750–1758.

37.

Osawa

Tsuji

Ueno

, et al. (2022) Scalable and practical natural gradient for large-scale deep learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(1): 404–415.

38.

Pucci

Rooman

(2021) Prediction and evolution of the molecular fitness of SARS-CoV-2 variants: introducing SpikePro. Viruses 13(5): 935.

39.

Rambaut

Grass

(1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics 13(3): 235–238.

40.

Rives

Meier

Sercu

, et al. (2021) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences 118(15): e2016239118.

41.

Sender

Bar-On

Gleizer

, et al. (2021) The total number and mass of SARS-CoV-2 virions. Proceedings of the National Academy of Sciences 118(25): e2024815118.

42.

Shchur

Spirin

Sirotkin

, et al. (2022) Vgsim: scalable viral genealogy simulator for global pandemic. PLoS Computational Biology 18(8): e1010409.

43.

Shu

McCauley

(2017) GISAID: global initiative on sharing all influenza data–from vision to reality. Euro Surveillance 22(13): 30494.

44.

Simmons

Zmora

Gierer

, et al. (2013) Proteolytic activation of the SARS-coronavirus spike protein: cutting enzymes at the cutting edge of antiviral research. Antiviral Research 100(3): 605–614.

45.

Sipos

Massingham

Jordan

, et al. (2011) PhyloSim-Monte Carlo simulation of sequence evolution in the R statistical computing environment. BMC Bioinformatics 12(1): 1–6.

46.

Starr

Greaney

Hilton

, et al. (2020) Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell 182(5): 1295–1310.

47.

Strait

Dewey

(1996) The Shannon information entropy of protein sequences. Biophysical Journal 71(1): 148–155.

48.

Wong

Shi

, et al. (2016) Epidemiology, genetic recombination, and pathogenesis of coronaviruses. Trends in Microbiology 24(6): 490–502.

49.

Suzek

Huang

McGarvey

, et al. (2007) UniRef: comprehensive and non-redundant uniprot reference clusters. Bioinformatics 23(10): 1282–1288.

50.

Trifonov

(2009) The origin of the genetic code and of the earliest oligopeptides. Research in Microbiology 160(7): 481–486.

51.

Vaswani

Shazeer

Parmar

, et al. (2017) Attention is all you need. Advances in Neural Information Processing Systems 30.

52.

Walls

Park

Tortorici

, et al. (2020) Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell 181(2): 281–292.

53.

Wan

Shang

Graham

, et al. (2020) Receptor recognition by the novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS coronavirus. Journal of Virology 94(7): e00127.

54.

Wrapp

Wang

Corbett

, et al. (2020) Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation. Science 367(6483): 1260–1263.

55.

Liu

Yang

, et al. (2020) Analysis of therapeutic targets for SARS-CoV-2 and discovery of potential drugs by computational methods. Acta Pharmaceutica Sinica B 10(5): 766–788.

56.

Yau

, et al. (2021) Phylotransformer: a discriminative model for mutation prediction based on a multi-head self-attention mechanism. arXiv preprint arXiv:2111.01969 .

57.

Yang

Dai

Yang

, et al. (2019) XLNet: generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems 32.

58.

Yin

Wunderink

(2018) MERS, SARS and other coronaviruses as causes of pneumonia. Respirology 23(2): 130–137.

59.

Tanwar

Penha

EDS

, et al. (2019) Grammar of protein domain architectures. Proceedings of the National Academy of Sciences 116(9): 3636–3645.

60.

Zaki

Van Boheemen

Bestebroer

, et al. (2012) Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia. New England Journal of Medicine 367(19): 1814–1820.

61.

Zhou

Yang

Wang

, et al. (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798): 270–273.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.45 MB