hadge: a comprehensive pipeline for donor deconvolution in single-cell studies

The hadge pipeline

Hadge offers a user-friendly, zero-config solution for analyzing multiplexed single-cell data at scale (Fig. 1B). Our pipeline takes advantage of Nextflow’s cloud-computing capabilities, enabling efficient use of cloud resources to accelerate the analysis of large datasets. Furthermore, Nextflow’s built-in containerization functionality simplifies deployment, providing a more reliable and reproducible analysis environment. The hadge pipeline consists of 12 deconvolution tools, including five genetics-based tools (Demuxlet [17], Freemuxlet [26], Vireo [22], scSplit [20], and Souporcell [21]), seven hashing-based tools (HTODemux [27], Multiseq [12], HashedDrops [28], Demuxem [15], gmm-demux [29], BFF [24], and Hashsolo [30]), and one doublet-detection method (Solo [30]). All of these tools have been benchmarked in independent publications and are widely used by the scientific community [14, 23, 31, 32]. Furthermore, for methods that require additional preprocessing such as normalization of the HTO counts or variant calling, the hadge pipeline includes a preprocessing step before the genotype-based deconvolution algorithm is applied.

The hadge pipeline has three modes: “genetic,” “hashing,” and “rescue.” In the genetic or hashing mode, the pipeline runs either the genotype- or hashing-based deconvolution workflow allowing for choice of methods and customization of input parameters. Each of these pipelines can be run in parallel across multiple samples, reducing the time and effort required for deconvolution. Finally, in the rescue mode, hadge allows jointly deconvolving hashing experiments with genotype-based deconvolution tools, with the option to recover cells from failed hashing. Lacking prior individual genetic profiles that associate SNPs to explicit donors, genotype-based deconvolution tools assign cells to anonymous donors. Hadge de-anonymizes the donors by calculating the best match between a hashing and a genetic demultiplexing method. After the conversion of the cell deconvolution assignments into a binary matrix with rows representing cell barcodes and columns representing the assigned donors or hashtags, donor genotypes are matched with hashtags by measuring the concordance of two methods in assigning the droplets, computing pairwise Pearson correlation to determine the optimal match, hereby termed “Phi score” (see the “ Methods” section). hadge then generates a new assignment of the cells based on this optimal match between hashing and genotype-based deconvolution to uncover the true donor identity of the cells effectively rescuing cells from failed hashing with a valid genotyped-based deconvolution assignment. Finally, hadge outputs the results of the donor deconvolution for all combinations of methods and hyperparameters tested, both as a separate tabular format and as cell metadata in either Anndata [33] or MuData [34] objects, depending on the users’ choice.

Hashing-based methods’ performance greatly varies with noisy HTO libraries

We applied the hadge pipeline to a hashing dataset of single nuclei sequencing collected from post-mortem brain tissue from multiple sclerosis donors [35]. The hashing count matrix of this dataset presented a high background noise from non-specific antibody binding, which originally resulted in a high number of doublets and negative cells (Fig. 2A, B). We ran both hashing and genetic deconvolution workflows to assess the performance of the two types of approaches. We observed inconsistent hashtag counts (Fig. 2A, B and Additional file 1: Fig. S1). Specifically, hashtag 453 showed a high overall expression, while hashtags 454 and 455 were expressed in relatively low levels (Fig. 2A and Additional file 1: Fig. S1, S6). Due to the variable readout of the hashing oligos, the sample assignment of the hashing-based methods was not consistent. The number of detected singlets varied greatly between different methods (Fig. 2C and Additional file 1: Fig. S1-2, S7). While Hashsolo classified almost every droplet as a singlet, HashedDrops detected only 32 singlets among 4048 non-empty droplets. Notably, DemuxEM and Multiseq exhibited nearly identical performance (Additional file 1: Fig. S1-2,S4), both assigning nearly 1800 singlets, (Additional file 2: Table S1) with Multiseq identifying slightly more singlets and being considerably faster than DemuxEM. (Additional file 3: Table S1). Despite the noisy HTO readouts, the RNA profiles of these cells are still of good quality, allowing demultiplexing to be performed from the RNA library. Since the donor-specific reference genotypes are not available for this experiment, we run all genetic deconvolution tools in reference genotype-free mode. Compared to hashing, genotype-based deconvolution methods performed more consistently and identified significantly more singlets (Fig. 2D and Additional file 1: Fig. S3, S8-9). Each tool classified over 90% of the droplets as singlets, and there was consistent agreement between all tools for 3914 singlets (Fig. 3D). However, scSplit identified 296 droplets as doublets, which were consistently identified as singlets by three other methods. Due to the high consistency among Vireo, Freemuxlet, and Souporcell, and available benchmarks showcasing its favorable performance compared to the other tools [23], we decided to use Vireo as a baseline for genotype-based deconvolution methods.

Fig. 2

Comparison of the performance of donor deconvolution methods. A The violin plot of raw HTO counts shows high count levels of Hashtag 453 in cells with noisy or undetectable expression of the other HTOs. B UMAP visualization of normalized HTO counts colored by HTODemux assignment shows poor separation of the cells based on hashtags, with most droplets assigned to Hashtag 453. C Hashing-based deconvolution methods show the inconsistent assignment of cells, reported as the different proportions of cells identified as one of either singlet, negative, or doublet. D Genetic deconvolution tools show a more consistent assignment of the cell mixture to singlets, doublets, and negatives

Fig. 3

Joint deconvolution recovers high-quality cells. A Overview of the steps to extract cell variants from common SNPs in the population based on the assignment of Multiseq and Vireo. B Heatmap summarizing the donor matching result shows that DemuxEM and Multiseq are in high concordance with all genotype-based deconvolution methods, where all the donors are matched with a high matching score. C Correlation heatmap of donor identification between Vireo and Multiseq. D Sankey plot summarizing the percentages of cells deconvoluted by hashing (Multiseq) and after the joint deconvolution step (Vireo). E Number of donor-specific variants used as input for the refinement step. F Sankey plot summarizing the percentages of cells deconvoluted by hashing (Multiseq) and after the refinement step

Joint deconvolution recovers cells with low-quality hashing data

Beyond determining the optimal combination of hashing- and genotype-based deconvolution methods, hadge aims to rescue cells whose hashing quality was low or whose hashes were missing (Fig. 3A). Hadge performs joint deconvolution with both hashing and genetic deconvolution tools to rescue high-confidence singlets. Only cells that can be confidently genetically deconvoluted are eligible to be rescued. After having demultiplexed the experiment in genotype-free mode [20, 22], the anonymous donors need to be matched to their original sample to be identified. Here, we rely on the hashing deconvolution to provide the known correspondence between the antibody hashtags and the original sample. Cells that are jointly deconvolved provide the key to de-anonymize the genetically rescued cells.

We first define the hashing method that matches the genetic demultiplexing method by calculating the Phi score (see the “ Methods” section). For each pair of hashing and genetic deconvolution outputs, we compute the pairwise Pearson correlation on the binarized cell classification vectors, thereby matching donors where a high correlation is observed. We then compute the matching score by summing the non-negative correlations and dividing by the number of expected donors, obtaining the degree of consistency in donor identification between any two methods. Based on the observed high matching score and the successful matching of all anonymous donors (Additional file 1: Fig. S4-5, S10), two hashing demultiplexing methods performed best compared to Vireo, namely Multiseq and Demuxem, both recovering identical matches between genetic and hashing donors (Fig. 3B, Additional file 1: Fig S4-5 and Additional file 2: Table S1). When the optimal match is identified, the identities recovered using the cells in the intersection between genetic and hashing can be propagated to the cells that are identified by genetic deconvolution alone. Here, we decided to use the joint demultiplexing of Multiseq and Vireo to showcase the rescue mode because of Multiseq’s reduced runtime. For every anonymous donor recovered by Vireo, there was only one hashtag with a high correlation, with scores ranging from 0.53 to 0.89 (Fig. 3C). Using the cells that are jointly deconvoluted into singlets by hashing and genetic demultiplexing, we extended the classification to those cells whose hashing was undetectable (negatives). We identify 90% of the cells as singlets, rescuing 89.7% of the original negatives (Fig. 3D), and double the number of recovered singlets for the hashes with the lowest detection rate (Hash454-456, Fig. 2A). Vireo is implemented to rely on cellSNP, which outputs the recovered genetic variants in each cell. We implemented an optional process in hadge to refine and confirm the quality of the deconvolution, by extracting cell-variants to reconstruct minimal donor genotypes from the common SNPs in the populations. Variants with low coverage (allele depth < 10) or a low frequency of the overrepresented allele (frequency < 0.1) were excluded, revealing 7866 variants that were unique to each donor (Fig. 3E). Since only a small fraction of the hashing-recovered singlets is sufficient to de-anonymize the genetic-singlets, we can use these reconstructed genotypes to run an additional genetic demultiplexing. Therefore, this final refinement step allows to effectively demultiplex cell mixtures without having to generate new SNP references. Using this refining approach, we identify 75% of the cells as singlets, with the number of rescued negatives decreasing to 69.7% (Fig. 3F). Nevertheless, we obtain 97.6% consistent donor assignment between the rescued and the refined assignments (Additional file 1: Table S1), suggesting that these variants were crucial in distinguishing a donor cluster from others during deconvolution.

Recovered cells recapitulate known cell types

To investigate whether the cells that are rescued are of good quality and biologically relevant, we reanalyzed the MS samples, including the recovered cells. We first merged the already existing annotation of the cells with the deconvolution information obtained from the hadge pipeline. We then applied quality filtering, removing cells based on gene content, mitochondrial percentage, and doublet rates (Additional file 1: Fig S12) (see the “Methods” section), reproducing the quality control performed in the original study but with a more stringent doublet detection threshold. With this approach, we retained 3208 cells, rescuing 952 cells that were excluded in the original study. We then embedded the cells using UMAP and calculated Leiden clustering. Most of the rescued cells were distributed across existing clusters, with comparable marker expression between the old and new cells (Fig. 4A, B, D, Additional file 1: Fig. S11). Intriguingly though, the percentage of rescued cells per cluster varied. While most of the clusters consisted predominantly of previously annotated cells mixing with a smaller part of rescued cells, two clusters were composed of more than half or even 100% rescued cells (Fig. 4E). While the smaller one of these, consisting solely of rescued cells, had an almost exclusively high expression of the marker HTR2C, we found the gene marker expression of, e.g., SYT1, SLC17A7, and low GAD2 to be consistent with a neuronal profile with excitatory and non-inhibitory properties in both clusters. Reassuringly, the latter marker expression was in accordance with that of known neuronal clusters [36, 37] (Fig. 4C, D, Additional file 1: Fig. S11).

Fig. 4

Recovered cells recapitulate known cell types. A UMAP of the single-cell gene expression data with old and rescued cells. B Leiden clustering of the dataset with old and rescued cells. C SYT1 expression defines rescued cells as a new cluster of neurons. D Dotplot of a selection of marker genes shows concordant expression of markers in old and rescued cells. E Barplot showing the cluster composition in old and rescued cells, with two neuronal clusters enriched for rescued cells. Colors on top of the barplot identify the cell annotation from A

Benchmarking hadge’s runtime and robustness

To demonstrate the robustness and superior runtime of our proposed pipeline, we benchmarked its performance against two existing pipelines, demuxafy and cellHashR (Table 1). We submitted each pipeline on a Linux server with 32 CPU cores and 160 GB of RAM memory. In all benchmarks, hadge showed superior performance with respect to the optimization of computational resources and runtime (Fig. 5A). Both hadge-genetic and demuxafy successfully executed all methods for the two mpxMS samples and an additional dataset. However, in the hashing deconvolution of the mpxMS data, some methods (bff_cluster, bff_raw) ran but failed to deconvolve the cells in both hadge-hashing and cellhashR. Additionally, one method (demuxmix) failed to initialize in both pipelines and as a standalone method. Hence, we excluded demuxmix from hadge. Notably, despite successfully running gmm_demux within hadge-hashing or when called outside the pipeline, we were not able to run cellhashR’s gmm-demux module.

Table 1 Comparison of donor deconvolution pipelinesFig. 5

Benchmarking performance. A Hadge genetic and hashing demultiplexing pipelines were benchmarked against demuxafy and cellhashR. The benchmark was performed on three samples for each pairwise comparison, for a total of four samples (mpxMS:gx12, mpxMS:gx38, demuxafy dataset, CR-438–21 dataset). B Results of hadge genetic on the onek1k cohort; each boxplot represents the distribution of percentage singlets identified across 75 pools by each genetic deconvolution tool. C Percentage of correctly matched singlets for each tool; each boxplot represents the distribution across 75 pools. D Dotplot showing the effect of the number of cells per pool on the percentage of recovered and matched singlets. The regression line represents the fit of a linear model on the percentage of singlets identified by Freemuxlet (R.2 0.35, p.adj < 0.0001)

Next, we leveraged hadge’s fast multi-sample, multi-process handling to investigate how the input number of cells affects the performance of genetic and hashing demultiplexing. We ran hadge with default parameters on the onek1k dataset, which comprises 75 pools of cell mixtures from 9 to 15 donors each [9] and has ground truth donor-genotypes available. All tools detected a proportion of singlets per pool between 75 and 98% (Fig. 5B). However, when looking at matched donors within the singlets, the performance of the tools varied substantially, with scSplit having the lowest concordance with the original donor identities, while Demuxlet had the best performance in terms of recovered singlets and Vireo recovering the most matched singlets (Fig. 5C). We investigated if these two metrics were associated with the number of cells in each pool. Only Freemuxlet’s singlets percentage had a significant association with the number of cells per pool (R2 = 0.35, p.adj < 0.0001) (Fig. 5D), but all methods were significantly affected by the number of donors in the pools (Additional file 1: Fig. S13A, Additional file 3: Table S2), with the highest concordant calls reached on pools with 14 multiplexed donors (Additional file 1: Fig. S13B). Across all the pools, Vireo, Demuxlet, Freemuxlet, and Souporcell were mostly consistent, confirming these tools’ superior performance, and as observed on the MS dataset (Figs. 2 and 3). Additionally, we ran downsampling on two hashing and one genetic multiplexing experiments and used hadge to obtain the percentages of correctly assigned singlets. The hashing methods showed an overall similar trend with the percentage of matched singlets decreasing with the number of cells, except for HTODemux and Demuxem (Additional file 1: Fig. S14). All the hashing demultiplexing tools were able to correctly assign at least 90% of the cells after downsampling. The performance of the genetic demultiplexing tools was in line with what we observed on the onek1k dataset, and consistent for each tool across the different subsamples, with Vireo having the best performance across the board (~ 99% recovered matching singlets) (Additional file 1: Fig. S14). Hadge allowed us to efficiently benchmark the demultiplexing performances of all the methods across the two workflows. Collectively, these results indicate that the number of donors and cells in the cell mixtures can significantly affect the number of recovered cells and the quality of the deconvolution for both families of demultiplexing methods.

View original article

GENOME BIOLOGY

分享书签

0 0 0 0 0 0 0

More from this channel

hadge: a comprehensive pipeline for donor deconvolution in single-cell studies

留言 (0)