Evaluating Protein Language Model Embeddings for Viral Clade Assignment

Abstract

Protein language models (PLMs) provide powerful sequence representations, yet their effectiveness for unsupervised viral clade assignment remains uncertain. In this study, we evaluated embeddings from ProtT5, ProtBert, CARP, and several ESM-2 variants on influenza A/H3N2 hemagglutinin sequences. Using dimensionality reduction (t-SNE, UMAP, PCA, MDS) and clustering with HDBSCAN, we compared PLM embeddings against baseline Hamming distance approaches. Our results show that t-SNE combined with PLM embeddings can recover clade structure, with ProtBert yielding the most stable performance and larger ESM-2 models occasionally achieving lower normalized variation of information scores but with greater variability. These findings suggest that while PLM embeddings capture clade-relevant signals, they also suffer from instability and the loss of site- or nucleotide-specific detail. Future improvements in pooling strategies may enhance their utility for viral surveillance.

Keywords

clustering influenza A/H3N2 dimensionality reduction protein language models viral evolution

Get full access to this article

View all access options for this article.

References

Campello

RJGB

, Moulavi

, Sander

. Density-based clustering based on hierarchical density estimates. In: Advances in Knowledge Discovery and Data Mining, Vol. 7819. Springer; Cham, Switzerland: 2013; doi: 10.1007/978-3-642-37456-2_14

Dhodapkar

. A deep generative model of the SARS-CoV-2 spike protein predicts future variants. bioRxiv 2023; doi: 10.1101/2023.01.17.524472

Elnaggar

, Heinzinger

, Dallago

, et al. Prottrans: Towards cracking the language of life’s code through self-supervised deep learning and high performance computing. bioRxiv 2020; doi: 10.1101/2020.07.12.199554

Hesslow

, Zanichelli

, Notin

, et al. Rita: A study on scaling up generative protein sequence models. arXiv Preprint 2022.

Hie

, Zhong

, Berger

, et al. Learning the language of viral evolution and escape. Science 2021;371(6526):284–288; doi: 10.1126/science.abd7331

Jolliffe

. Principal Component Analysis. Springer Series in Statistics. Springer; 2002.

Koel

, Burke

, Bestebroer

, et al. Substitutions near the receptor binding site determine major antigenic change during influenza virus evolution. Science 2013;342(6161):976–979; doi: 10.1126/science.1244730

Kruskal

, Wish

. Multidimensional Scaling. SAGE Publications, 1978; doi: 10.4135/9781412985130

Lin

, Akin

, Rao

, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379(6637):1123–1130; doi: 10.1126/science.ade2574

10.

Lin

, Akin

, Rao

, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv 2022.

11.

Łuksza

, Lässig

. A predictive fitness model for influenza. Nature 2014;507(7490):57–61; doi: 10.1038/nature13087

12.

McInnes

, Healy

, Melville

. Umap: Uniform manifold approximation and projection for dimension reduction. ArXiv 2020.

13.

Nanduri

, Black

, Bedford

, et al. Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and sars-cov-2. Virus Evol 2024;10(1):veae087; doi: 10.1093/ve/veae087

14.

Nijkamp

, Ruffolo

, Weinstein

, et al. ProGen2: Exploring the boundaries of protein language models. ArXiv 2022; doi: 10.48550/arXiv.2206.13517

15.

Rancati

, Nicora

, Bergomi

, et al. SARITA: A large language model for generating the S1 subunit of the SARS-CoV-2 spike protein. Brief Bioinform 2025;26(4):bbaf384; doi: 10.1093/bib/bbaf384

16.

Rancati

, Nicora

, Prosperi

, et al. Forecasting dominance of SARS-CoV-2 lineages by anomaly detection using deep AutoEncoders. Brief Bioinform 2024;25(6):bbae535; doi: 10.1093/bib/bbae535

17.

Rao

, Bhattacharya

, Thomas

, et al. Evaluating protein transfer learning with tape. In Advances in Neural Information Processing Systems 2019;32:9689–9701.

18.

Rives

, Meier

, Sercu

, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A 2021;118(15):e2016239118; doi: 10.1073/pnas.2016239118

19.

Shih

AC-C

, Hsiao

T-C

, Ho

M-S

, et al. Simultaneous amino acid substitutions at antigenic sites drive influenza A hemagglutinin evolution. Proc Natl Acad Sci U S A 2007;104(15):6283–6288; doi: 10.1073/pnas.0701396104

20.

van der Maaten

, Hinton

. Visualizing data using t-SNE. Journal of Machine Learning Research 2008;9(86):2579–2605.

21.

Vieira

, Handojo

, Wilke

. Medium-sized protein language models perform well at transfer learning on realistic datasets. Sci Rep 2025;15(1):21400; doi: 10.1038/s41598-025-05674-x

22.

Wolf

, Viboud

, Holmes

, et al. Long intervals of stasis punctuated by bursts of positive selection in the seasonal evolution of influenza A virus. Biol Direct 2006;1(1):34; doi: 10.1186/1745-6150-1-34

23.

Yang

, Fusi

, Lu

. Convolutions are competitive with transformers for protein sequence pretraining. Cell Syst 2024;15(3):286–294.e2; doi: 10.1016/j.cels.2024.01.008

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.13 MB

0.01 MB

0.00 MB

0.01 MB