Structure-informed protein language models are robust predictors for variant effects

Adzhubei I, Jordan DM, Sunyaev SR (2013) Predicting functional effect of human missense mutations using polyphen-2. Curr Protoc Hum Genet 76(1):7–20

Google Scholar

Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16(12):1315–1322

Article CAS PubMed PubMed Central Google Scholar

Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450

Ben Chorin A, Masrati G, Kessel A, Narunsky A, Sprinzak J, Lahav S, Ashkenazy H, Ben-Tal N (2020) Consurf-db: An accessible repository for the evolutionary conservation patterns of the majority of pdb proteins. Protein Sci 29(1):258–267

Article CAS PubMed Google Scholar

Bepler T, Berger B (2021) Learning the protein language: Evolution, structure, and function. Cell Syst 12(6):654–669

Article CAS PubMed PubMed Central Google Scholar

Biswas S, Khimulya G, Alley EC, Esvelt KM, Church GM (2021) Low-n protein engineering with data-efficient deep learning. Nature Methods 18(4):389–396

Article CAS PubMed Google Scholar

Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M: ProteinBERT: A Universal Deep-learning Model of Protein Sequence and Function. https://www.biorxiv.org/content/10.1101/2021.05.24.445464v1

CAGI (2024) The critical assessment of genome interpretation, establishes progress and prospects for computational genetic variant interpretation methods. Genome Biol 25:53

Article Google Scholar

Chen C, Natale DA, Finn RD, Huang H, Zhang J, Wu CH, Mazumder R (2011) Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation. PloS one 6(4):18910

Article Google Scholar

Chen D, Hartout P, Pellizzoni P, Oliver C, Borgwardt K (2024) Endowing Protein Language Models with Structural Knowledge

Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M et al (2021) Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 44(10):7112–7127

Article Google Scholar

Esposito D, Weile J, Shendure J, Starita LM, Papenfuss AT, Roth FP, Fowler DM, Rubin AF (2019) Mavedb: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome biol 20(1):1–11

Article Google Scholar

Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, Gal Y, Marks DS (2021) Disease variant prediction with deep generative models of evolutionary data. Nature 599(7883):91–95

Article CAS PubMed Google Scholar

Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F, Rost B (2019) Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform 20(1):1–17

Article Google Scholar

Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

Hopf TA, Ingraham JB, Poelwijk FJ, Schärfe CP, Springer M, Sander C, Marks DS (2017) Mutation effects predicted from sequence co-variation. Nat Biotechnol 35(2):128–135

Article CAS PubMed PubMed Central Google Scholar

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A et al (2021) Highly accurate protein structure prediction with alphafold. Nature 596(7873):583–589

Article CAS PubMed PubMed Central Google Scholar

Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y et al (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637):1123–1130

Article CAS PubMed Google Scholar

Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos Jr JL, Xiong C, Sun ZZ, Socher R et al (2023) Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 1–8

Marquet C, Heinzinger M, Olenyi T, Dallago C, Erckert K, Bernhofer M, Nechaev D, Rost B (2022) Embeddings from protein language models predict conservation and variant effects. Human Genet 141(10):1629–1647

Article CAS Google Scholar

Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A (2021) Language models enable zero-shot prediction of the effects of mutations on protein function. Adv Neural Inf Process Syst 34:29287–29303

Google Scholar

Ng PC, Henikoff S (2001) Predicting deleterious amino acid substitutions. Genome Res 11(5):863–874

Article CAS PubMed PubMed Central Google Scholar

Nijkamp E, Ruffolo J, Weinstein EN, Naik N, Madani A (2022) Progen2: exploring the boundaries of protein language models. arXiv preprint arXiv:2206.13517

Notin P, Dias M, Frazer J, Marchena-Hurtado J, Gomez A, Marks DS, Gal Y (2022) Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval https://doi.org/10.48550/arxiv.2205.13760

Pejaver V, Urresti J, Lugo-Martinez J, Pagel KA, Lin GN, Nam H-J, Mort M, Cooper DN, Sebat J, Iakoucheva LM et al (2020) Inferring the molecular and phenotypic impact of amino acid variants with mutpred2. Nat Commun 11(1):5918

Article CAS PubMed PubMed Central Google Scholar

Raimondi D, Tanyalcin I, Ferté J, Gazzo A, Orlando G, Lenaerts T, Rooman M, Vranken W (2017) Deogen2: prediction and interactive visualization of single amino acid variant deleteriousness in human proteins. Nucl Acids Res 45(W1):201–206

Article Google Scholar

Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, Abbeel P, Song Y (2019) Evaluating protein transfer learning with tape. Adv Neural Inform Process Syst 32:58

Google Scholar

Rao R, Liu J, Verkuil R, Meier J, Canny JF, Abbeel P, Sercu T, Rives A (2021) Msa transformer bioRxiv. https://doi.org/10.1101/2021.02.12.430858

Rao R, Meier J, Sercu T, Ovchinnikov S, Rives A(2020) Transformer protein language models are unsupervised structure learners. Biorxiv, 2020–12

Riesselman AJ, Ingraham JB, Marks DS (2018) Deep generative models of genetic variation capture the effects of mutations. Nat Methods 15(10):816–822

Article CAS PubMed PubMed Central Google Scholar

Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J, Fergus R (2019) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv https://doi.org/10.1101/622803

Romero PA, Arnold FH (2009) Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Biol 10(12):866–876

Article CAS PubMed PubMed Central Google Scholar

Rubin AF, Min JK, Rollins NJ, Da EY, Esposito D, Harrington M, Stone J, Bianchi AH, Dias M, Frazer J, et al (2021) Mavedb v2: a curated community database with over three million variant effects from multiplexed functional assays. bioRxiv, 2021–11

Shen Y (2022) Predicting protein structure from single sequences. Nat Comput Sci 2(12):775–776

Article PubMed Google Scholar

Shin J-E, Riesselman AJ, Kollasch AW, McMahon C, Simon E, Sander C, Manglik A, Kruse AC, Marks DS (2021) Protein design and variant prediction using autoregressive generative models. Nat Commun 12(1):2403

Article CAS PubMed PubMed Central Google Scholar

Su J, Han C, Zhou Y, Shan J, Zhou X, Yuan F (2023) Saprot: protein language modeling with structure-aware vocabulary. bioRxiv, 2023–10

Sundaram L, Gao H, Padigepati SR, McRae JF, Li Y, Kosmicki JA, Fritzilas N, Hakenberg J, Dutta A, Shon J et al (2018) Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50(8):1161–1170

Article CAS PubMed PubMed Central Google Scholar

Van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CL, Söding J, Steinegger M (2024) Fast and accurate protein structure search with foldseek. Nat Biotechnol 42(2):243–246

Article PubMed Google Scholar

Vig J, Madani A, Varshney LR, Xiong C, Socher R, Rajani NF (2020) Bertology meets biology: interpreting attention in protein language models. arXiv preprint arXiv:2006.15222

Wang S, You R, Liu Y, Xiong Y, Zhu S (2022) Netgo 3.0: Protein language model improves large-scale functional annotations. bioRxiv, 2022–1205519073 https://doi.org/10.1101/2022.12.05.519073

Zhang Z, Xu M, Chenthamarakshan V, Lozano A, Das P, Tang J (2023) Enhancing protein language models with structure-based encoder and pre-training. arXiv preprint arXiv:2303.06275

Zheng Z, Deng Y, Xue D, Zhou Y, Ye F, Gu Q (2023) Structure-informed language models are protein designers. bioRxiv, 2023–02

View original article

HUMAN GENETICS

Like

分享书签

0 0 0 0 0 0 0

More from this channel

Structure-informed protein language models are robust predictors for variant effects

留言 (0)