Navigating bottlenecks and trade-offs in genomic data analysis

Wetterstrand, K. A. DNA sequencing costs: data. National Human Genome Research Institute www.genome.gov/sequencingcostsdata (2022).

Preston, J., VanZeeland, A., & Peiffer, D. A. Innovation at illumina: the road to the $600 human genome. Nature Portfolio https://www.nature.com/articles/d42473-021-00030-9 (2021).

Pennisi, E. A. $100 genome? New DNA sequencers could be a ‘game changer’ for biology, medicine. Science 376, 1257–1258 (2022).

CAS  Google Scholar 

Regalado, A. China’s BGI says it can sequence a genome for just $100. MIT Technology Review. https://www.technologyreview.com/2020/02/26/905658/china-bgi-100-dollar-genome/ (2020).

Berger, B., Daniels, N. M. & Yu, Y. W. Computational biology in the 21st century: scaling with compressive algorithms. Commun. ACM 59, 72–80 (2016).

Google Scholar 

Rozenblatt-Rosen, O., Stubbington, M. J. T., Regev, A. & Teichmann, S. A. The Human Cell Atlas: from vision to reality. Nature 550, 451–453 (2017).

Google Scholar 

Zheng, G. Our 1.3 million single cell dataset is ready to download. 10x Genomics. https://www.10xgenomics.com/blog/our-13-million-single-cell-dataset-is-ready-to-download (2022).

Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).

CAS  Google Scholar 

Marçais, G., Solomon, B., Patro, R. & Kingsford, C. Sketching and sublinear data structures in genomics. Annu. Rev. Biomed. Data Sci. 2, 93–118 (2019). This work is an excellent in-depth review of sketching for algorithm designers.

Google Scholar 

Kurzak, J., Bader, D.A., & Dongarra, J., (eds) Scientific Computing with Multicore and Accelerators (CRC, 2010 Dec 7).

Mernik, M., Heering, J. & Sloane, A. M. When and how to develop domain-specific languages. ACM Comput. Surv. 37, 316–344 (2005).

Google Scholar 

Van der Auwera, G. A. et al. From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinforma. 43, 11 (2013).

Google Scholar 

McKenna, A. et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297–1303 (2010).

CAS  Google Scholar 

Banks, E. Run the germline GATK best practices pipeline for $5 per genome. GitHub https://github.com/broadinstitute/gatk-docs/blob/master/blog-2012-to-2019/2018-02-12-Run_the_germline_GATK_Best_Practices_Pipeline_for_%245_per_genome.md (2020).

Illumina. DRAGEN Complete Suite; latest version: 4.0.3. AWS Marketplace. https://aws.amazon.com/marketplace/pp/prodview-ypz2tpzy6f5xq (2022).

Shajii, A., Yorukoglu, D., Yu, Y. W. & Berger, B. Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics 32, i538–i544 (2016).

CAS  Google Scholar 

Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 1–4 (2016).

Google Scholar 

Stein, L. Genome annotation: from sequence to biology. Nat. Rev. Genet 2, 493–503 (2001).

CAS  Google Scholar 

Lewis, C. M. Genetic association studies: design, analysis and interpretation. Brief. Bioinforma. 3, 146–153 (2002).

CAS  Google Scholar 

Baldi, P. & Brunak, S. Bioinformatics: The Machine Learning Approach (MIT Press, 2001).

Suhre, K., McCarthy, M. I. & Schwenk, J. M. Genetics meets proteomics: perspectives for large population-based studies. Nat. Rev. Genet 22, 19–37 (2021).

CAS  Google Scholar 

Allis, D. C. & Jenuwein, T. The molecular hallmarks of epigenetic control. Nat. Rev. Genet 17, 487–500 (2016).

CAS  Google Scholar 

Moses, L. & Pachter, L. Museum of spatial transcriptomics. Nat. Methods 19, 534–546 (2022).

CAS  Google Scholar 

Burgess, D. J. Spatial transcriptomics coming of age. Nat. Rev. Genet 20, 317–317 (2019).

CAS  Google Scholar 

Berger, B. & Cho, H. Emerging technologies towards enhancing privacy in genomic data sharing. Genome Biol. 20, 1–3 (2019).

CAS  Google Scholar 

Gürsoy, G. et al. Functional genomics data: privacy risk assessment and technological mitigation. Nat. Rev. Genet 2021, 1–14 (2021).

Google Scholar 

Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. Introduction to Algorithms (MIT Press, 2022).

Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. 19, A68–A77 (2015).

Google Scholar 

Zhang, Z. et al. Uniform genomic data analysis in the NCI Genomic Data Commons. Nat. Commun. 12, 1226 (2021).

CAS  Google Scholar 

BackupWorks.com. LTO Program announces price per gigabyte now less than one penny. BackupWorks.com https://www.backupworks.com/LTO-program-cost-per-gigabyte-milestone.aspx (2022).

100,000 Genomes Project Pilot Investigators. 100,000 genomes pilot on rare-disease diagnosis in health care — preliminary report. N. Engl. J. Med. 385, 1868–1880 (2021).

Google Scholar 

Matange, K., Tuck, J. M. & Keung, A. J. DNA stability: a central design consideration for DNA data storage systems. Nat. Commun. 12, 1358 (2021).

CAS  Google Scholar 

Jacob, B, Wang, D, & Ng, S. Memory Systems: Cache, DRAM, disk (Morgan Kaufmann, 2010).

Bonfield, J. K. CRAM 3.1: advances in the CRAM file format. Bioinformatics 38, 1497–1503 (2022).

CAS  Google Scholar 

Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

Google Scholar 

Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38, 1767–1771 (2010).

CAS  Google Scholar 

Hernaez, M., Pavlichin, D., Weissman, T. & Ochoa, I. Genomic data compression. Annu. Rev. Biomed. Data Sci. 2, 19–37 (2019). This work is a canonical review of genomic data compression by many of the authors involved in standardization efforts.

Google Scholar 

Loh, P. R., Baym, M. & Berger, B. Compressive genomics. Nat. Biotechnol. 30, 627–630 (2012).

CAS  Google Scholar 

Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet 19, 208–219 (2018). This article goes more in-depth on cloud computing and how that is changing genomic data analysis.

CAS  Google Scholar 

Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

CAS  Google Scholar 

Hie, B., Cho, H., DeMeo, B., Bryson, B. & Berger, B. Geometric sketching compactly summarizes the single-cell transcriptomic landscape. Cell Syst. 8, 483–493 (2019).

CAS  Google Scholar 

Hie, B. et al. Computational methods for single-cell RNA sequencing. Annu. Rev. Biomed. Data Sci. 3, 339–364 (2020). This review discusses some of the newer computational challenges presented by scRNA-seq data.

Google Scholar 

Lähnemann, D. et al. Eleven grand challenges in single-cell data science. Genome Biol. 21, 1–35 (2020).

Google Scholar 

Evans, C., Hardin, J. & Stoebel, D. M. Selecting between-sample RNA-seq normalization methods from the perspective of their assumptions. Brief. Bioinforma. 19, 776–792 (2018).

CAS  Google Scholar 

Google. All networking pricing. Google Cloud https://cloud.google.com/vpc/network-pricing (2022).

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203 (2018).

CAS  Google Scholar 

Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).

Google Scholar 

Gaziano, J. M. et al. Million veteran program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).

Google Scholar 

Lin, J. C., Hsiao, W. W. W. & Fan, C. T. Transformation of the Taiwan Biobank 3.0: vertical and horizontal integration. J. Transl. Med. 18, 1–13 (2020).

Google Scholar 

All of Us Research Program Investigators. The “All of Us” research program. N. Engl. J. Med. 381, 668–676 (2019).

Google Scholar 

Baker, M. & Buyya, R. Cluster computing: the commodity supercomputer. Softw. Pract. Exp. 29, 551–576 (1999).

Google Scholar 

Goenka, S. D. et al. Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing. Nat. Biotechnol. 40, 1035–1041 (2022).

CAS  Google Scholar 

Marshall, P., Keahey, K., & Freeman, T. in 2011 11th IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing 205–214 (IEEE, 2011).

Grossman, R. L. The case for cloud computing. IT professional 11, 23–27 (2009).

Google Scholar 

Cormode, G. & Garofalakis, M. in Proc. 2007 ACM SIGMOD Int. Conf. Management of Data 281–292 (2007).

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

CAS  Google Scholar 

Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).

CAS  Google Scholar 

Berger, B., Waterman, M. S. & Yu, Y. W. Levenshtein distance, sequence comparison and biological database search. IEEE Trans. Inf. Theory 67, 3287–3294 (2020).

Google Scholar 

He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat. Methods 19, 316–322 (2022).

CAS  Google Scholar 

Kaminow, B., Yunusov, D. & Dobin, A. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. Preprint at Biorxiv https://doi.org/10.1101/2021.05.05.442755 (2021).

Article  Google Scholar 

Sarkar, H., Srivastava, A. & Patro, R. Minnow: a principled framework for rapid simulation of dscRNA-seq data at the read level. Bioinformatics 35, i136–i144 (2019).

CAS  Google Scholar 

Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 1–8 (2018).

CAS  Google Scholar 

Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

CAS  Google Scholar 

Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv https://doi.org/10.48550/arXiv.1303.3997 (2013).

Article  Google Scholar 

Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

CAS  Google Scholar 

Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

CAS  Google Scholar 

Grigoryev, D. N. in Big Data Analysis for Bioinformatics and Biomedical Discoveries (ed. Ye, S. Q.) 15–34 (CRC, 2016).

Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

CAS  Google Scholar 

Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

CAS  Google Scholar 

Endrullat, C., Glökler, J., Franke, P. & Frohme, M. Standardization and quality management in next-generation sequencing. Appl. Transl. Genomics 10, 2–9 (2016).

Google Scholar 

Yorukoglu, D., Yu, Y. W., Peng, J. & Berger, B. Compressive mapping for next-generation sequencing. Nat. Biotechnol. 34, 374–376 (2016).

CAS  Google Scholar 

Shajii, A. et al. A Python-based programming language for high-performance computational genomics. Nat. Biotechnol. 39, 1062–1064 (2021).

CAS  Google Scholar 

Berger, B., Peng, J. & Singh, M. Computational solutions for omics data. Nat. Rev. Genet 14, 333–346 (2013). This work is an older review of computational challenges and solutions in bioinformatics, the topics of which this Review assumes background familiarity with.

CAS 

留言 (0)

沒有登入
gif