The role of single-cell genomics in human genetics

Here, we describe two prospective applications in human genetics, which take advantage of the latest developments and promises of sc-seq to address current challenges in human genetics. The first relates to our quest to understand the non-coding genome. Specifically, the functional annotation of cis-regulatory elements. Here, sc-seq in combination with functional genomics allows thousands of functional experiments to be performed in a single, multiplexed assay (prospective application 1). The second addresses the challenges spawned by next-generation sequencing (NGS), namely the thousands of rare variants in each genome, and the vast majority of them being located in the >98% of the human genome that is non-coding.132 Here, the richness of the sc-seq data combined with machine learning models help prioritise variants based on biological characteristics, such as gene expression and chromatin accessibility (prospective application 2).

Prospective application 1: single-cell functional genomics—annotating the non-coding genome

Delineating the function of the 98% of the genome that is non-coding and deciphering the pathogenicity of the variants identified in these non-coding regions is the central challenge of human genetics in the next decade. However, we are far from this goal, because: first, the ‘regulatory code’ of the non-coding sequence is still unknown; second, an important aspect of physiological gene regulation by cis-regulatory elements and enhancers lies in the 3D architecture of the genome and lastly, the number of non-coding variants per generation is so high that traditional functional tests have reached their limit. This underscores the urgent need for high-throughput functional screening technologies to study the endogenous functions of the non-coding genome. Until recently, annotating a non-coding region in a native context (cf. their identification in plasmids, eg, STARR-seq133) faced trade-offs between the number of regions simultaneously screened versus the complexity of phenotypes assayed.134 On the one hand, pooled genetic screens, for example, with CRIPSR/Cas9, have mostly been associated with hypothesis-driven assays of simple phenotypes (eg, proliferation/survival screens, reporter expression).135–138 This precludes the systematic investigation of the enhancer-driven phenotypes. On the other hand, the unbiased phenotype screening offered by sc-seq has been mostly applied to investigate a handful of mutations or diseases, but often in a non-multiplexed fashion.139 Here, the effect of one or a few genomic perturbation(s) is assayed separately or at best in an arrayed format, which requires a separate experiment per enhancer and severely restricts scalability.

The integration of sc-seq with CRISPR/Cas9 has resulted in a new set of technologies that eliminate this trade-off, enabling the screening of non-coding (as well as coding) regions and their phenotypic consequences in a multiplexed fashion, at an unprecedented scale. Gasperini et al used this approach to functionally characterise ~6000 human candidate enhancers.19 They used a pool of CRISPR/Cas9-based genetic perturbations to inactivate these candidate enhancers (CRISPRi) in a collection of ~250 000 cells, followed by sc-seq to measure the functional consequence in terms of expression >10 560 genes. In total, they could interrogate ~80 000 potential cis-regulatory relationships in a single experiment. The innovation of the technology is in the ability to identify the perturbation(s) present (ie, the enhancer targeted) in each cell from the sc-seq data, such that the effect of the perturbations on the omics-profile could be quantified. This is achieved by including a transcribable barcode (guide barcode, unique guide index or the gRNA sequence itself) along with every CRISPR guide RNA (gRNA), which gets integrated into the genome (figure 4A,B). Thus, the read-out of the guide barcode along with the rest of the cell’s sequencing library by sc-seq, made it possible to associate the omics-profile of every cell to the perturbation present in that cell (figure 4C). In effect, the technology enabled the transformation of each perturbed cell into a patient harbouring (non-)coding variant(s), from an expression quantitative trait loci (eQTL) perspective.

Figure 4Figure 4Figure 4

Single-cell sequencing (sc-seq) CRISPR screening of cis-regulatory elements. (A) To perturb the various (non-coding) genomic regions of interest, a CRISPR library is created by inserting guide RNAs (gRNAs) targeting these regions, along with barcodes unique to each gRNA, a fluorescence protein (eg, GFP - green fluorescent protein), and necessary promoters into a (lentiviral) vector. Perturbing non-coding regions, such as enhancers, followed by sc-transcriptome-seq can help establish enhancer-gene relationships. The method can also be used to reveal the functions of unannotated regions (not shown), which can help prioritise variants in that region. (B) Transfection/Transduction is followed by the integration of the gRNA library (and the CRISPR/Cas9 machinery; not shown) into the genome of the cells. Depending on the multiplicity of infection (MOI), the number of gRNAs (therefore, perturbations) per cell can be tuned. For simplicity, the figure depicts a maximum of one gRNA per cell. Multiple perturbations within a cell can be used to assess functional cooperativity between regulatory elements (eg, enhancer compensation) or to reduce the number of cells. FACS-sorting or selection for antibiotic resistance enables filtering out cells without any perturbation. (C) The identity and the number of gRNA barcodes per cell detected is identified from the sc-seq data. Cells (columns) can be ordered based on the perturbations they harbour for downstream analysis purposes. Note: each row would have traditionally been a separate experiment and each column would have been a sample or an experimental repeat, which can now be pooled into a single experiment. (D) Enhancer screening—the cis-regulatory functions of non-coding loci on the expression profiles of genes of interest can be investigated using sc-transcriptome-seq. Here, perturbation of the enhancer 2, results in the downregulation of gene 1. (E) The entire transcriptome can be assessed for changes in expression on individual (or cooperative) perturbations to establish enhancer-gene relationships. (A–E) Synthesised data based on Gasperini et al and Xie et al.19 150

The initial demonstrations of the technology (called Perturb-seq,15 140 CRISP-seq14 and CROP-seq141) were focused on elucidating gene functions (as opposed to non-coding regions). They perturbed the genes in a loss-of-function manner by implementing CRISPR/Cas9 in a knockout fashion (CRISPRko) and measured their trans-effects on the entire transcriptome. Dixit et al knocked out 24 transcription factors in bone marrow-derived dendritic cells and investigated the effect of these perturbations in the single-cell transcriptome. These initial studies screened tens of genes and assessed their trans-effect on the transcriptome of thousands of cells in a single experiment. Since these pioneering demonstrations, the technology has been optimised and adapted, for example, to improve the efficiency,142 to study the effect of upregulation using CRISPR-activation,143 to read out the effect of perturbations on the epigenome 144 or proteome145 to dissect the function of protein domains (sc-Tiling),146 as well as to elucidate gene functions147 and gene regulatory networks related to development, diseases148 or DNA-chromatin structure.149

Xie et al demonstrated that this approach can be used to measure the function of enhancers (figure 4D,E).150 They used the CRISPR/dCas9-KRAB system and sgRNAs to epigenetically suppress the activity of 15 super-enhancers (containing a total of 71 constituent enhancers) in topologically associated domains containing highly expressed genes and quantified their function. They dissected the contribution of the individual constituent enhancers within these super-enhancers and concluded that often only a few constituent enhancers contribute to the regulation of target-gene expression. Moreover, by targeting multiple enhancers per cell (3.2 sgRNAs per cell on average), they could evaluate combinatorial enhancer activity and observed evidence of enhancer compensation. However, the design of this pioneering study was limited to quantifying the function of enhancers with known gene associations. Gasperini et al extended the approach with a goal to functionally characterise ~5779 candidate enhancers, where each perturbation was found in a median of 900 cells and each cell contained on average 28 gRNAs. They used an eQTL-inspired analysis framework to establish ~600 new enhancer-gene pairs. A similar use of the dCas9-KRAB system was recently demonstrated by Lopes et al 151 to evaluate the ~15 000 putative regulatory elements engaged by oestrogen receptors in the context of breast cancer. By combining the sc-seq data, HiC maps and functional assays (cell proliferation), they could map the oestrogen receptor-driven oncogenic programme and decipher the role of the respective non-coding regions.

In the above-mentioned studies, the perturbations were introduced in a particular cell type followed by omics-profiling at a particular time-point or, at best, within a time-window. This could, however, lead to missing enhancer functions, because of their time-dependent and cell type-dependent activity. That is, the perturbation of an enhancer active during the development of a specific cell type would go unnoticed if it is perturbed in an unrelated cell type. To overcome this challenge, Jin et al applied the Perturb-Seq technology in utero, in the developing brain at E12.5, to study the function of a panel of 35 neurodevelopmental delay-related genes.12 As a result, a wide variety of cell types, including neurons, microglia and oligodendrocytes, were targeted and the effect of the perturbations on the expression of 14 gene modules could be assessed. Another advantage of this approach was the feasibility to query the effect of the otherwise lethal genetic perturbations, since only 0.1% of the cells were perturbed at a time. Application of this approach to screen the function of non-coding regions relevant to development or in an in vivo context is, however, yet to be demonstrated.

Taken together, it is currently possible to screen the effects of perturbing multiple coding or non-coding genomic regions on the transcriptome, epigenome or the proteome, with single-cell resolution in a single experiment. While most studies have used this multiplexed screening technology to evaluate the functions of genes, a handful of studies have already applied it to establish regulatory relationships. The technology can be performed in cultured cells or within a developing organism. Commercial kits are readily available from vendors such as 10x Genomics for similar experiments. In human genetics, the functional characterisation of unannotated genomic regions will assist in interpreting variants in these regions. Global efforts to annotate molecular and cellular phenotypes, such as the MorPhiC programme (National Institutes of Health, funding RFA-HG-21-029), would also benefit from the evolution of such technologies.

The next logical advance is the direct experimental screening of the variants, but this is just out of reach of the gene-editing technology, due to limited efficiency and accuracy. However, early demonstrations to profile single nucleotide variants do look promising.152 Transduction-based overexpression instead of gene-editing has also been used as a method to overcome the limitations for functional annotation of variants.147 As the gene-editing technologies improve in efficiency, resolution, accuracy and specificity,153 154 it is increasingly likely that such technologies currently restricted to research will find direct applications in Personal Genomics for high-throughput experimental screening of variants identified in an individual. In the meanwhile, computational approaches combined with sc-seq data can be used to prioritise (if not annotate) variants, as is discussed next.

Prospective application 2: in silico variant prioritisation using sc-seq data

Genome-wide association studies (GWAS) and the widespread introduction of NGS technologies in medical genetics have led to a massive increase in the identification of common and rare variants, respectively.155 Most of these variants fall into the non-coding genome. By far, not all these variants have an associated disease phenotype yet, and the experimental screening of variants is expensive, laborious and time-consuming. Currently, databases like Clinvar, HPO and OMIM are used to filter for known gene variants (figure 5A). Computation methods play a key role in the interpretation of these unknown variants, but current variant prioritisation methods, like deep-learning methods to prioritise non-coding variants,156–159 CADD score,160 SIFT159 and several other methods161 use bulk-seq data to rank these candidate variants based on highest disease-association probability. Sc-seq can enhance these methods by providing information at a cell type level rather than at a tissue level. This higher dimensionality information can enhance the interpretation of how subtle changes can lead to diseases. This section focuses on how machine learning models trained on sc-seq data can prioritise a given of variants found either through GWAS or whole genome sequencing.

Figure 5Figure 5Figure 5

Variant prioritisation workflow with sc-ATAC data. (A) The current molecular diagnostic workflow starts with the next-generation sequencing (NGS) of a patient with a specific disease (eg, autism spectrum disorder), which identifies a large number of variants. Various databases are used to filter and rank the variants in the coding region. Meanwhile, a large number of variants in the non-coding regions are discarded, due to the lack of prioritisation methods. (B–D) Variant prioritisation with sc-seq data helps rank every variant pertinent to a cell type even if it is previously not known. (B) Supervised machine learning approaches are trained on sc-ATAC peaks, transcription factor sequence motifs and matched GC content of the peak and non-peak regions of a control sample. (C) sc-ATAC-seq data provide the machine learning model insights into the chromatin accessibility profile of all the cell types in the tissue and the sequence motifs inform the model on the allelic importance. Pathogenicity of unknown variants are predicted based on the disruption it causes to the accessibility of the loci. For example, the sequence motif shows the significance of the alleles which are accessible only in cell type 3. Single nucleotide variant (SNV)1 (G>A), therefore, causes a large disruption to the epigenome and thus receives a high score. Whereas SNV3 (C>G) is in the same accessible region, but the allele in the motif is not significant and the disruption it causes is low, hence it is scored low. (D) Variants are ranked based on the predicted pathogenicity scores. (A–D) Synthesised data based on Corces et al and Trevino et al. 115 162

Corces et al 115 and Trevino et al 162 showed that an sc-ATAC atlas of brain cells can be used to prioritise non-coding variants (figure 5B–D). Corces et al prioritised GWAS variants for Alzheimer’s disease and Parkinson’s disease by developing a machine learning model (gkm-SVM). They trained their model with sc-ATAC-seq data of adult human brains to predict the importance of each allele for chromatin accessibility. Trevino et al also used the sc-ATAC of developing human cerebral cortex, but to prioritise non-coding de novo mutations from patients with autism spectrum disorder from the Simons Simplex Collection.163 They used BPNet,164 which is a deep convolutional neural network model that predicts each transcription factor’s per base binding signal as counts (ChIP-nexus).163 Both methods were trained with cluster-specific sc-ATAC-seq peaks, transcription factor sequence motifs in the peak region, along with the regions where GC content of the peak region matches with the non-peak region to reduce the bias due to GC content in the prediction (figure 5B). This enabled them to predict an importance or disruption score based on the changes in the chromatin accessibility with allelic changes (figure 5C and D). This way, variants whether identified de novo or through GWAS approaches could be prioritised. Trevino et al tested their model with sc-ATAC-seq data of cell types from other organs like heart and noticed that there was no change in the case versus control mutations, signifying that the disease state was highly cell type specific. They were also able to predict the most frequently disrupted motifs in autism. Even though the conservation score and the distance to the gene were similar in case and control mutations, sc-ATAC was able to rank the pathogenic mutations, which would have been difficult to identify by other methods. Similar sc-ATAC-seq data-based approaches have been used to prioritise 527 GWAS variants in 48 diseases5 and in type 2 diabetes.165 Sc-seq data can also help elucidate the mechanisms in which GWAS variants affect haematopoiesis166 and type 1 diabetes.167

With such applications being validated across various diseases and with the establishment of Human Cell Atlas of disease and healthy individuals, we can be hopeful that the in silico variant prioritisation methods using sc-seq data will evolve further to be able to rank the effect of rare de novo mutations for routine clinical diagnostics. Currently, however, these methods stop short of providing insights into the global mechanistic consequences of the ranked variants, especially when the variants are located in unannotated genomic regions. Methods such as sc-RNA-eQTL168, sc-ATAC-eQTL169 and single-cell functional genomic approaches (prospective application 1) can further enable functional annotation of the ranked variants by tracking cellular-level changes, which would be missed by bulk-seq, and help uncover holistic disease biology.

留言 (0)

沒有登入
gif