EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations

Chen KM, Wong AK, Troyanskaya OG, Zhou J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat Genet. 2022;54:1–10.

Article Google Scholar

Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, Assael Y, Jumper J, Kohli P, Kelley DR. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18(10):1196–203.

Article CAS PubMed PubMed Central Google Scholar

Zhou J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat Genet. 2022;54(5):725–34.

Article CAS PubMed PubMed Central Google Scholar

Hoffman GE, Bendl J, Girdhar K, Schadt EE, Roussos P. Functional interpretation of genetic variants using deep learning predicts impact on chromatin accessibility and histone modification. Nucleic Acids Res. 2019;47(20):10597–611.

Article CAS PubMed PubMed Central Google Scholar

Dey KK, Van de Geijn B, Kim SS, Hormozdiari F, Kelley DR, Price AL. Evaluating the informativeness of deep learning annotations for human complex diseases. Nat Commun. 2020;11(1):1–9.

Article Google Scholar

Koo PK, Ploenzke M. Improving representations of genomic sequence motifs in convolutional networks with exponential activations. Nat Mach Intell. 2021;3(3):258–66.

Article PubMed PubMed Central Google Scholar

Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, Fropf R, McAnany C, Gagneur J, Kundaje A, et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet. 2021;53(3):354–66.

Article CAS PubMed PubMed Central Google Scholar

Koo PK, Majdandzic A, Ploenzke M, Anand P, Paul SB. Global importance analysis: an interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput Biol. 2021;17(5):1008925.

Article Google Scholar

de Almeida BP, Reiter F, Pagani M, Stark A. Deepstarr predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat Genet. 2022;54(5):613–24.

Article PubMed Google Scholar

Horton CA, Alexandari AM, Hayes MG, Schaepe JM, Marklund E, Shah N, Aditham AK, Shrikumar A, Afek A, Greenleaf WJ, et al. Short tandem repeats recruit transcription factors to tune eukaryotic gene expression. Biophys J. 2022;121(3):287–8.

Article Google Scholar

Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):1–48.

Article Google Scholar

Fort S, Brock A, Pascanu R, De S, Smith SL. Drawing multiple augmentation samples per image during training efficiently decreases test error. 2021. arXiv preprint arXiv:2105.13343

Zhu S, An B, Huang F. Understanding the generalization benefit of model invariance from a data perspective. Adv Neural Inf Process Syst. 2021;34:4328–41.

Google Scholar

Geiping J, Goldblum M, Somepalli G, Shwartz-Ziv R, Goldstein T, Wilson AG. How much data are augmentations worth? An investigation into scaling laws, invariance, and implicit regularization. 2022. arXiv preprint arXiv:2210.06441

Puli A, Zhang LH, Oermann EK, Ranganath R. Out-of-distribution generalization in the presence of nuisance-induced spurious correlations. 2021. arXiv preprint arXiv:2107.00520

Zhou H, Shrikumar A, Kundaje A. Towards a better understanding of reverse-complement equivariance for deep learning models in genomics. In: Machine Learning in Computational Biology, PMLR; 2022. p. 1–33

Toneyan S, Tang Z, Koo PK. Evaluating deep learning for predicting epigenomic profiles. Nat Mach Intell. 2022;4:1–13.

Article Google Scholar

Kelley DR. Cross-species regulatory sequence activity prediction. PLoS Comput Biol. 2020;16(7):1008050.

Article Google Scholar

Frazer KA, Murray SS, Schork NJ, Topol EJ. Human genetic variation and its contribution to complex traits. Nat Rev Genet. 2009;10(4):241–51.

Article CAS PubMed Google Scholar

Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26(7):990–9.

Article CAS PubMed PubMed Central Google Scholar

Shigaki D, Adato O, Adhikari AN, Dong S, Hawkins-Hooker A, Inoue F, Juven-Gershon T, Kenlay H, Martin B, Patra A, Penzar DD, Schubach M, Xiong C, Yan Z, Boyle AP, Kreimer A, Kulakovskiy IV, Reid J, Unger R, Yosef N, Shendure J, Ahituv N, Kircher M, Beer MA. Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay. Hum Mutat. 2019;40(9):1280–91.

Article CAS PubMed PubMed Central Google Scholar

Lu, A.X, Lu, A.X, Moses, A. Evolution is all you need: phylogenetic augmentation for contrastive learning. 2020. arXiv preprint arXiv:2012.13475

Kryukov GV, Schmidt S, Sunyaev S. Small fitness effect of mutations in highly conserved non-coding regions. Hum Mol Genet. 2005;14(15):2221–9.

Article CAS PubMed Google Scholar

Crawshaw, M. Multi-task learning with deep neural networks: a survey. 2020. arXiv preprint arXiv:2009.09796

Zbontar J, Jing L, Misra I, LeCun Y, Deny S. Barlow twins: Self-supervised learning via redundancy reduction. In: International Conference on Machine Learning, PMLR; 2021. p. 12310–12320

Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, Bengio Y. Learning deep representations by mutual information estimation and maximization. 2018. arXiv preprint arXiv:1808.06670

Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:1810.04805

Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, et al. Population based training of neural networks. 2017. arXiv preprint arXiv:1711.09846

Liaw R, Liang E, Nishihara R, Moritz P, Gonzalez JE, Stoica I. Tune: a research platform for distributed model selection and training. 2018. arXiv preprint arXiv:1807.05118.

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: Large-scale machine learning on heterogeneous systems. 2015. https://www.tensorflow.org/. Accessed 31 Oct 2022.

Bradbury J, Frostig R, Hawkins P, Johnson MJ, Leary C, Maclaurin D, Necula G, Paszke A, VanderPlas J, Wanderman-Milne S, Zhang Q. JAX: Composable transformations of Python+NumPy programs. http://github.com/google/jax. Accessed 31 Oct 2022.

Lee NK, Toneyan S, Tang Z, Koo PK. EvoAug Data [Data set]. Zenodo. 2022. https://doi.org/10.5281/zenodo.7265991. Accessed 31 Oct 2022.

Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning, PMLR; 2015. p. 448–456

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.

Google Scholar

Luo Y, Hitz BC, Gabdank I, Hilton JA, Kagda MS, Lam B, Myers Z, Sud P, Jou J, Lin K, et al. New developments on the encyclopedia of DNA elements (encode) data portal. Nucleic Acids Res. 2020;48(D1):882–9.

Article Google Scholar

Kingma D, Ba J. Adam: A method for stochastic optimization. 2014. arXiv preprint arXiv:1412.6980

Koo PK, Ploenzke M. Deep learning for inferring transcription factor binding sites. Curr Opin Syst Biol. 2020;19:16–23.

Article PubMed PubMed Central Google Scholar

Castro-Mondragon JA, Riudavets-Puig R, Rauluseviciute I, Lemma RB, Turchi L, Blanc-Mathieu R, Lucas J, Boddie P, Khan A, Pérez NM, Fornes O, Leung TY, Aguirre A, Hammal F, Schmelter D, Baranasic D, Ballester B, Sandelin A, Lenhard B, Vandepoele K, Wasserman WW, Parcy F, Mathelier A. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2021;50(D1):165–73.

Article Google Scholar

Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8(2):1–9.

Article Google Scholar

Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30. https://papers.nips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html.

Kokhlikyan N, Miglani V, Martin M, Wang E, Alsallakh B, Reynolds J, Melnikov A, Kliushkina N, Araya C, Yan S, Reblitz-Richardson O. Captum: a unified and generic model interpretability library for pytorch. 2020. arXiv preprint arXiv:2009.07896

Tareen A, Kinney JB. Logomaker: beautiful sequence logos in python. Bioinformatics. 2020;36(7):2272–4.

Article CAS PubMed Google Scholar

Majdandzic A, Rajesh C, Koo PK. Statistical correction of input gradients for black box models trained with categorical input features. 2022. bioRxiv preprint. biorxiv.org/content/10.1101/2022.04.29.490102v2.

Lee NK, Toneyan S, Tang Z, Koo PK. EvoAug reproducibility code. Github. 2022. https://github.com/p-koo/evoaug_analysis. Accessed 31 Oct 2022.

View original article

GENOME BIOLOGY

Like

分享书签

0 0 0 0 0 0 0

More from this channel

EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations

留言 (0)