DISSECT: deep semi-supervised consistency regularization for accurate cell type fraction and gene expression estimation

In this section, we first formally define the cell deconvolution task, then present our hypothesis and DISSECT deep learning models, and compare DISSECT’s performance to other state-of-the-art deconvolution algorithms.

Task of cell deconvolution

Given an $m \times n$ gene expression matrix $\textbf$ consisting of m bulk gene expression vectors measuring n genes, the goal of deconvolution is to find an $m \times c$ matrix $\textbf$ of cell type fractions, where c is the number of cell types present in bulk samples such that,

$$\begin \textbf = \textbf\textbf, \end$$

(1)

where fractions and gene expression satisfy non-negativity $0 \le \textbf_,$ and $0 \le \textbf_$, $\forall i \in [1,m], \forall j \in [1, n]$ and $\forall k \in [1,c]$ and sum-to-1 criterion, i.e., $\sum \limits _^\textbf_$ = 1, $\forall i \in [1,m]$. Here, $\textbf$ is known as the signature matrix and is unobserved. Each row $\textbf_$ is a gene expression profile (or signature) of cell type k. To utilize a reference based framework, $\textbf$ can be replaced with $\textbf_$ derived from a single-cell experiment by identifying the most representative cell type specific gene expression [8].

The problem of reference-based cell deconvolution can alternatively be formulated as a learning problem, where a function f such that $f(\textbf) = \textbf$ is learnt. Since only $\textbf$ is available and $\textbf$ is generally unknown, simulations from a single-cell reference can be used to learn f. Clearly, from the above formulation of the cell deconvolution task, it is reasonable to assume linearity of deconvolution, i.e., each bulk mixture is a linear combination of expression vectors of cells spanned with corresponding cell type fractions. Thus, as defined previously in Scaden [9], multiple single cells can be combined in random proportions to generate training examples $\textbf^}$ and $\textbf^}$, where each row of $\textbf^}$ is defined as,

$$\begin \textbf^}_ = \sum \limits _^ \sum \limits _^} \textbf^k_, \end$$

where $\textbf^k_$ is the expression vector of cell l belonging to cell type k, and $\alpha _$ is the number of cells belonging to cell type k sampled to construct $\textbf^}_$. Correspondingly, each element of $\textbf^}$ is the proportion of a cell type k in that sample i and is defined as,

$$\begin \textbf^}_ = \frac}^\alpha _}, \end$$

In this case, since each simulated sample has a distinct signature (i.e., gene expression profile), $\textbf$ is a three dimensional matrix with each element $\textbf_$ denoting gene expression of gene j in cell type k for sample i. It is computed as following,

$$\begin \textbf^}_ = \frac^} \textbf^k_}}. \end$$

The predictor f, learned from a simulated dataset, can then be applied to $\textbf$ to estimate $\textbf$. Note that, the genes expressed may differ between vectors $\textbf_l$ and $\textbf$ and as such before learning function f, each $\mathbf $ is subsetted to include genes common with $\textbf$. This is the reason why this learning problem is transductive and a separate model needs to be reconstructed for each $\textbf$.

Exploiting the linearity of deconvolution

The deconvolution task is to learn a cell type-specific gene-expression matrix (or signature matrix) $\textbf$, which serves to accurately predict cell fractions and their corresponding gene expression from a bulk gene expression matrix $\textbf$. The actual mixing process of cells to form a tissue is assumed to be linear and, as such, the relationship between $\textbf$ and $\textbf$ is linear. However, $\textbf$ is unobserved, and the deconvolution algorithm is learned using simulations. This learning process involving simulations is highly dependent on the reference being the single-cell dataset used to generate simulations, and is subjected to an inherent strong domain shift [14]. To address this, we hypothesize that a consistency-based regularization penalizing the non-linearity of mixtures of real and simulated samples would result in a mapping $\hat$ that is closer to true mapping f. Non-linearity of mixtures of real and simulated samples refers to the violation of Eq. 4, defined later, for estimated $\textbf_, \textbf^}_$ and $\textbf^}_$ using mapping f.

Consistency regularization

Consider that $\textbf$ represents gene expression matrices of real (test) bulk RNA-seq that we want to deconvolve and and $\textbf^}$ represents gene expression matrix of simulated bulk samples. The number of rows (representing samples) in these two matrices may differ. To simplify the notation, we use the same index i to denote indices for real bulk samples, simulations ($\text $) and their mixtures ($\text $, defined further). Given a true bulk RNA-seq sample $\textbf_$, and a simulated sample $\textbf^}_$ with paired proportions $\textbf^}_$ defined over a common set of genes, we can generate a mixture $\textbf^}_$ such that

$$\begin \textbf^}_ = \beta \textbf_ + (1-\beta )\textbf^}_, \end$$

(2)

Which gives us the relation

$$\begin \textbf^}_\textbf^}_ = \beta \textbf_ \textbf_ + (1-\beta ) \textbf^}_ \textbf^}_. \end$$

(3)

where $\textbf_$ represents cell fractions of sample i and where $\beta \in [0,1]$. Cell types are characterized by a few marker genes that are invariant across cell states and even across tissues [15]. A network that accurately predicts cell type fractions based on gene expression of simulated or real bulk RNA-seq data would thus have to learn them. In the estimation of cell type fractions, we therefore assume that the expression of these marker genes should be identical in signatures $}^\text _, }_ \text }^}}_$. Hence,

$$\begin \textbf^}_ = \beta \textbf_ + (1-\beta )\textbf^}_, \end$$

(4)

Equation 4 serves as the formulation to generate pseudo ground-truths for these mixtures during learning, and it enables the use of consistency regularization without having to explicitly estimate signatures. In an iterative learning process $\textbf_$ can be replaced with predictions of the algorithm from the previous iteration. Naturally, it is also possible to only mix real samples with each other. The number of bulk RNA-seq samples is, however, considerably lower (tens to hundreds) than the amount of single-cells present in a single-cell experiment (thousands or more). Equation 4 allows to generate pseudo ground truth proportions for mixtures $\textbf^}_$ at each step of learning cell type fractions, while Eq. 3 allows to generate pseudo ground truth signatures at each step of learning gene expression profiles.

Network architecture and learning procedure

We approach the two tasks, estimation of cell type fractions and estimation of gene expression profiles per cell type as two different tasks because of their differing assumptions. For the estimation of cell type fractions, we assume that signatures are identical for each sample, both simulated and bulk, while to estimate gene expression, we relax this condition and involve complete consistency regularization (Eq. 3). An illustration of the method is presented in Fig. 1.

Fig. 1

A Illustration of the simulation procedure using reference single-cell data. The figure shows the simulation of one sample which consists of cell type fractions, simulated gene expression and cell type specific gene expression profiles (i.e., signature matrix). B Detailed overview of an MLP used to estimate cell type fractions. C Overview of an autoencoder used to estimate cell type specific gene expression profiles

Estimation of cell type fractions

The underlying algorithm of the first part of our deconvolution method is an average ensemble of multilayered perceptrons (MLPs). The ensembling is performed to reduce the variance by averaging different runs [16]. Each MLP consists of the same architecture initialized with different weights. Each MLP has an architecture: Input (# genes) - ReLU6 (512) - ReLU6 (256) - ReLU6 (128) - ReLU6 (64) - Linear (# cell types) - Softmax. ReLU6 (output of ReLU activation clipped by a maximum value of 6) [17, 18] was chosen out of tested activations over grid search on (Linear, ReLU, ReLU6, Swish [19]). The final application of a softmax activation function allows to achieve the non-negativity and sum to 1 criteria of deconvolution. We train the network with batch size 64 to minimize the loss function per batch defined below with an Adam Optimizer with initial learning rate of $1e-5$.

$$\begin \mathcal _} \left(\textbf^}_, f\left(}^}_\right),\textbf^}_, f\left(\textbf^}_\right)\right) & = \mathcal _}\left(\textbf^}_, f\left(}^}_\right)\right) \nonumber \\ & \quad + \lambda _1*\mathcal _}\left(\textbf^}_, f\left(\textbf^}_\right)\right), \end$$

(5)

where $\mathcal _}(\cdot , \cdot )$ is the Kullback-Leibler divergence and $\mathcal _}(\cdot , \cdot )$ is the consistency loss defined as:

$$\begin \mathcal _}\left(\textbf^}_, f\left(\textbf^}_\right)\right) = \left\|\textbf^}_ - f\left(\textbf^}_\right)\right\|^2_2, \;\text \end$$

$$\begin \textbf^}_ = \beta f(\textbf_) + (1-\beta )\textbf^}_. \end$$

To generate mixtures, for each batch, we sample $\beta$ uniformly at random for Eq. 4. The interval [0.1, 0.9] was chosen for the uniform distribution to allow for at least some real and some simulated gene expression in the mixture. Since the number of simulations is generally larger (in our experiments, set to 1,000 times the number of cell types) than that of real data, we sample real data to create additional bulk samples, $\textbf_$, until the size equals that of the simulated data, $\textbf^}_$. This pair of data together with simulated proportions, $\textbf^}_$, is then used to create training batches of size 64. For every batch, we generate mixtures according to Eq. 2.

Our loss is inspired by MixMatch [20], which uses unlabelled samples to mix up and match sample predictions. Our adaptation in Eq. 5 addresses the limited samples available from true bulk RNA-seq, unavailability of sample fractions and is derived from the definition of the task itself. In essence, Eq. 5 integrates domain knowledge into the objective.

To avoid a scenario where the network does not learn and outputs predictions such that $f(\textbf^}_) = f(\textbf^}_) = f(\textbf_)$, which is a solution to Eq. 4, we first let the model learn purely from simulated examples. This allows the model to learn meaningful expression profiles to achieve accurate results on simulated examples. We selected $\lambda _1$ based on a grid search over constant and step-wise functions. We adopt a step-wise function for $\lambda _1$, given as:

$$\begin \lambda _1 = \left\ 0 & \text \le 2000, \\ 15 & \text \ 2000 \le \text \le 4000, \\ 10 & \text \end\right. \end$$

We train the network for a predefined number of steps as opposed to epochs, since it is possible to generate infinitely many simulated samples without increasing the intrinsic dimensionality of the data. In our experiments, we limit the number of steps to 5000 as found optimal in Scaden [9].

Estimation of per sample cell type specific gene expression profiles

Estimation of cell type fractions from bulk RNA-seq requires an assumption that signatures of cell types are shared across single cell and bulk RNA-seq. However, cell type gene expression profiles (at least for genes that are not invariant across tissue states) may differ between samples. Previously, works such as CSx [8] and TAPE [5] have explored utilizing cell type fractions to estimate gene expression per sample. Here, we make use of a $\beta$-variational autoencoder with standard normal distribution as prior to estimate average gene expression of the different cell types from bulk RNA-seq expression levels. To jointly train the network on all cell types, we condition the decoder (at its input layer) with cell type labels. This allows for training a single model to estimate gene expression of each cell type for a sample. To make use of bulk RNA seq during the training, we regularize the reconstruction loss with a consistency loss defined over per cell type signature. Denoting f as before and $g(\cdot , k)$ as the output of the autoencoder with condition k (corresponding to cell type label) on the decoder input, this consistency loss is defined as:

$$\begin \mathcal _}^}\left(f, g, \textbf_^}, \textbf_, \textbf^}_, }^}_\right)&= \left\|f\left(\textbf_^}\right)_g \left(\textbf^}_, k\right) - \beta f\left(\textbf_\right)_ g(\textbf_, k)\right. \nonumber \\&\quad - \left.(1-\beta ) \textbf^}_ }^}_\right\|^2_2, \end$$

where $\textbf^\text _i$ is given by Eq. 2, and $f(\textbf_^})_$ is the proportion of cell type k in sample i as estimated during cell type fraction estimation and is fixed during training. In implementation, we replace $f(\textbf_^})_$ with $\beta f(\textbf_)_ + (1-\beta ) \textbf^}_$. Thus, this loss forces the learned signature for cell type k, $g(\textbf^}_, k)$, to be closer to signatures for both real and simulated bulk samples. This loss function makes the assumption that mixing two bulk samples is similar to mixing individual cell type specific signatures that constitute those bulks. We added this loss function with a regularization parameter $\lambda _2$ (with default value 0.1) to the loss of the standard $\beta$-variational autoencoder (the weight on the KL divergence, denoted as $\beta ^}$, is set to 0.1 by default). The total loss function sums up to:

$$\begin \mathcal ^}_}\left(f, g, \textbf^}_, \textbf_^}, \textbf_, \textbf^}_, }^}_\right)&= \left\|\textbf^}_ - g\left(\textbf^}_, k\right)\right\|^2_2 \\&\quad + \lambda _2 \mathcal _}^}\left(f, g, \textbf_^}, \textbf_, \textbf^}_, }^}_\right) \\&\quad + \beta ^ \mathcal _} (\mathcal (\mu , \sigma ), \mathcal (0,1)), \end$$

where $\mathcal (0,1)$ is standard normal distribution, and $\mu \text \sigma$ are the empirical mean and standard deviation estimated from the output of the encoder. Both the encoder and decoder consist of two hidden layers. Under default settings used throughout this work, we train the network to minimize the loss function with an Adam optimizer with initial learning rate of $1e-3$, and the values for hyperparameters $\lambda _2$ and $\beta ^}$ are respectively 0.1 and $1e-2$. The network is trained for 5000 $\times k$, k being the number of cell types.

Estimation of cell type fractions and comparison with flow cytometry

To quantitatively assess the deconvolution algorithm, we first deconvolve six different peripheral blood mononuclear cells (PBMC) bulk datasets for which cell type proportions have already been quantified using flow cytometry (Additional file 1: Table S1). To evaluate deconvolution performance, we utilize root-mean-squared error (rmse) and Pearson correlation (r) for cell type-wise comparisons and Jensen-Shannon distance (JSD) for sample-wise comparisons between estimated fractions and ground truth proportions. The evaluation metrics are defined in the “Evaluation metrics” section. To evaluate our approach, we compared it to state-of-the-art deconvolution methods, MuSiC [21], CSx [8], Scaden [9] and TAPE (TAPE-O and TAPE-A) [5], BayesPrism and BayesPrism-M [7], and bMIND [6]. MuSiC and CSx were chosen for their best performances in benchmarking studies [22, 23]. Scaden and TAPE are selected as both are deep learning-based deconvolution approaches, the latter of which, TAPE-A, performs an adaptation of the network weights for test samples. Since deconvolution is linear, we also considered linear MLPs as a deconvolution algorithm. Further details can be found under the “State of the art” section.

We utilize the PBMC8k single cell RNA-seq dataset as reference (Additional file 1: Table S2) for all methods. The first two principal components of combined simulated and real PBMC datasets are visualized in Additional file 2: Fig. S1A, illustrating a domain shift between datasets.

For each dataset, DISSECT always obtained the best JSD across all datasets (Fig. 2A), leading to an average improvement over the second-placed algorithms of 6 percentage points. On the GSE65133 dataset, for instance, DISSECT outperforms second-paced Scaden by 8 percentage points (DISSECT: JSD = 0.145, Scaden: JSD = 0.222). Similarly, DISSECT always obtains the best rmse across all datasets and improves over second-placed algorithms by 2 percentage points, on average (Fig. 2B). In addition, it achieved the best r on 4 out of 6 datasets (Fig. 2B).

Fig. 2

Evaluation of deconvolution algorithm on six datasets with ground truth information. A Per-sample Jensen-Shannon divergence (JSD). Each plot corresponds to a dataset. From left to right and top to bottom: SDY67, Monaco I, Monaco II, GSE65133, GSE107572, and GSE120502. B Root mean-squared-error (rmse, top) averaged over cell types for each of the dataset. Datasets are listed on x-axis. Pearson’s correlation (r, bottom) averaged over cell types

Furthermore, we computed macro- level r and rmse by computing the metrics without making a distinction of cell types as performed previously in [9]. Note that in this setting, JSD remains unaffected as it is a sample-level metric and is therefore excluded. We observe that DISSECT achieves consistently best rmse across all datasets while achieving best r on 5 out of the 6 datasets (Additional file 2: Fig. S1).

Since MuSiC can take advantage of multi-sample references, we also evaluated MuSiC using blood data from the Immune Cell Atlas (ICA) (Additional file 1: Table S2). We also evaluated MuSiC with pre-selected marker genes (MuSiC-M) that were selected by CSx. MuSiC-M showed increased performance in 4 out of 6 datasets (Additional file 2: Fig. S2A-B). MuSiC also shows improved performance in the multi-sample setting in both rmse (Additional file 2: Fig. S2A) and r (Additional file 2: Fig. S2B). DISSECT still reaches best performance in rmse (on average 8 percentage points better) and r (on average 13 percentage points better) across all datasets.

Next, we evaluated the cell fraction deconvolution performance on the Monaco I (Additional file 1: Table S1) dataset, which contains several closely related and rare cell types and constitutes a relatively hard cell deconvolution task, using Ota dataset (Additional file 1: Table S1). With a correlation of 0.6, DISSECT’s average performance is 14 percentage points better than the second placed Scaden (Additional file 1: Table S3), while Scaden’s average RMSE was marginally (1 percentage point) better than second placed DISSECT (Additional file 1: Table S4). To validate that the performance improvement in DISSECT is due to the semi-supervised learning and consistency loss, we performed an ablation study on data SDY67 by successively and cumulatively removing components of the algorithm and testing it again. The following components were removed successively: consistency regularization, KL Divergence loss (mean squared error instead), and the nonlinear activation function (identity function instead). The ablation results are shown in Additional file 1: Table S5.

In summary, these results provide strong evidence that DISSECT consistently outperforms current state-of-the-art cell type deconvolution algorithms across six different datasets with ground truth information.

Consistency of predictions and relationship between cell type fractions and biological phenotypes

To further corroborate the above results, we evaluate DISSECT’s performance on three datasets that do not have paired flow cytometry data. In this section, we compare to other established biological facts as well as divergences over different reference single-cell datasets. The bulk datasets together with literature-based expected biological relationships of cell types are listed in Additional file 1: Table S1.

Brain

The ROSMAP dataset consists of 508 bulk RNA-seq samples from the dorsolateral prefrontal cortex (DLPFC) of patients with Alzheimer’s disease (AD) as well as non-AD samples (Additional file 1: Table S1). For 463 of these samples, Braak stages of disease severity have been quantified. Correspondingly, single-nuclei RNA-seq (snRNA-seq) for 48 individuals from the same cohort is available [24]. For 41 of these samples, cell type fractions based on immunohistochemistry (IHC) from a previous work exist [25]. It should be noted that IHC was performed for all neurons and as a result, comparison with respect to excitatory vs inhibitory neurons was not possible. Here, we consider two biological ground truths: first is the ratio of excitatory neurons to inhibitory neurons (Additional file 1: Table S1), and second is the neurodegeneration, or the loss of neurons with increasing Braak Stages [26]. We deconvolved ROSMAP using the Allen Brain Atlas reference (Additional file 1: Table S2).

We computed the JSD between the estimated fractions and IHC cell type proportions. DISSECT estimated fractions had the best average JSDs and provides the expected excitatory-inhibitory neuron ratio of (3:1–9:1), while other methods generally underestimated this ratio (Fig. 3A). All methods recover a negative correlation between increasing Braak stages and the fraction of neurons (Additional file 2: Fig. S3).

Previously, it has been noted that snRNA-seq and IHC data provide different estimates for some cell types, notably microglia and endothelial cells [25]. It is interesting to observe that DISSECT and Scaden were the only methods where the estimates of microglia resembled closely those obtained from snRNA-seq and IHC data (Fig. 3B). We also computed r and rmse between the IHC cell type proportions and estimated fractions (Fig. 3C). With a correlation r of 0.901 DISSECT proved to be 14 percentage points better than the second-placed linear MLP. DISSECT also displayed the best rmse at 0.079.

Overall, the comparison to IHC and snRNA-seq ground truth information for the ROSMAP data further strengthens our claim that consistency regularization with DISSECT robustly improves cell deconvolution.

Pancreas

The GSE50244 bulk RNAseq dataset consists of 89 pancreas samples from healthy and type 2 diabetes (T2D) individuals (Additional file 1: Table S1). For 77 of these samples, hemoglobic 1C levels are available as ground truth information. We performed the deconvolution using three single-cell reference datasets Baron, Segerstolpe, and Xin (Additional file 1: Table S2). Both Baron and Segerstolpe datasets contain alpha, beta, gamma, delta, acinar, and ductal cell types. While only alpha, beta, gamma, and delta cell types were present in the Segerstolpe dataset. To measure the consistency of deconvolution algorithms, we measured JSDs between estimated fractions using each of the three references (Additional file 2: Fig. S4A). While several methods showed considerable divergences, indicating reference-dependent deconvolution results, DISSECT displayed the most consistent results with a JSD of $\sim$0.1–0.2 across the three pairs. In terms of recovery of significant negative correlations between the estimated fractions of beta cells and hemoglobin 1C (hba1c) levels, DISSECT provided highly significant correlations of between − 0.45 and − 0.47 across the three references (Additional file 2: Fig. S4B). These results further suggest that DISSECT is both precise and robust in cell type deconvolution on real data and is comparatively less affected by the choice of single-cell reference.

Fig. 3

A Left: Box-plots showing JSD between estimated fractions and IHC based cell type proportions from 41 individuals from ROSMAP. Right: Ratio of excitatory to inhibitory neurons computed from ROSMAP. Expected ratios lie between 3:1 and 9:1 as indicated by dashed lines. B Boxplots showing microglia proportion as estimated by different methods. Median proportions of microglia estimated using snRNA-seq and IHC are labeled. C Correlations between estimates (y-axis) and IHC cell type proportions (x-axis). D JSD between predicted proportions from Kidney between experiments with Miao and Park as references. E Predictions from TAPE-O and DISSECT from Kidney. From left to right: Proximal tubule (PT), ductal convoluted tubules (DCT), and macrophages (Macro). Each row indicates a reference. Error bars show standard deviations, while height of the bars shown mean prediction

Kidney

The GSE81492 dataset consists of 10 kidney samples of APOL1 mutant mice, which is a mouse model of chronic kidney disease (CKD) (Additional file 1: Table S1). We deconvolved the dataset using two single cell references: Miao and Park (Additional file 1: Table S2). Similar to our experiments on the pancreas tissue, we computed JSD between the estimated cell type fractions from the two references. DISSECT provided the best average JSD (0.09) out of all considered methods (Fig. 3D). We further compare the methods on the recovery of expected relation of cell type fractions with the biological phenotype (Additional file 1: Table S1). Figure 3E compares two best methods on JSD, DISSECT, and TAPE-O, while Additional file 2: Fig. S5 presents these results on all cell types for all methods. It is known that CKD results in the decrease in proximal tubule cells (PT) and distal convoluted tubules (DCT). Cell type fractions estimated with DISSECT showed a significant loss of PTs and DCTs and a corresponding increase in macrophages, while TAPE-O provided much smaller differences between the control and CKD model (Fig. 3E). PTs are the most abundant cell type in kidney making up around 50% of a mouse kidney [27]. DISSECT correctly estimates the high abundance of PTs in healthy kidney, while TAPE-O underestimates them (Fig. 3E).

In summary, it is noteworthy that DISSECT shows state-of-the-art precision and robustness in cell type deconvolution across various ground truth information and 9 datasets, including PBMC, brain, pancreas, and kidney bulk RNA-seq samples. DISSECT also shows superior robustness to the choice of single cell reference.

Application to proteomics and spatial transcriptomics

It is conceivable that DISSECT’s consistency regularization for bulk RNA-seq cell type deconvolution should also lend itself to other biomedical datatypes in which domain shifts might be a problem. Applications might include, for example, the deconvolution of spatial transcritomic (ST) and bulk proteomic data with supra-cellular resolution. In order to evaluate these potential use-cases, we performed deconvolution of spatial transcriptomics and proteomics samples. Here, our aim is to test the hypothesis of applicability of DISSECT on these data modalities and we do not intend to perform an exhaustive comparison to multiple methods developed for these modalities. For comparisons on spatial transcriptomics, we consider four state-of-the-art spatial deconvolution methods, RCTD [28], Cell2location (C2L) [29] as shown to perform among the best in the benchmarking study [30]. We also include SONAR [31] and CARD [32], both of which can utilize spatial information. For comparisons on proteomic deconvolution, we consider the tested bulk deconvolution methods.

Spatial transcriptomics

We evaluated DISSECT on the task of spatial deconvolution using mouse brain and human lymph node samples (Additional file 1: Table S1). As a ground truth, we considered relationships with biological phenotypes in line with our application of kidney and pancreas datasets (Additional file 1: Table S1). Due to the spatial nature of the ST, we could verify the recovery of neuronal layers in brain (Additional file 2: Fig. S6) and discernment of germinal centers in lymph node (Additional file 2: Fig. S7). DISSECT performs on par with C2L and RCTD on both datasets. The results are provided and discussed in detail in the Additional file 2: Supplementary Note.

Proteomics

To compare the ability of the tested deconvolution methods to recover cell type proportions from proteomics mixtures, we utilized 50 human brain samples (Additional file 1: Table S1). We applied each deconvolution method on these samples using the Allen Brain Atlas reference (Additional file 1: Table S2). Compared to other methods, DISSECT recovered excitatory neurons to be the expected majority population in both datasets while maintaining the excitatory to inhibitory neuron ratio to be around expected range of (3:1–9:1) (Additional file 2: Fig. S8). These results strongly suggest that DISSECT reaches state-of-the-art performance on proteomic cell type deconvolution and might be applicable to other biomedical data types.

Evaluation of DISSECT under domain shifts

To assess the impact of consistency regularization on the performance of DISSECT and other algorithms, we used Ota dataset (Additional file 1: Table S1). Using this dataset in a dynamic domain shift setup (see the “Domain shift experimental setup” section), we evaluated the performance of deconvolution methods. We also included DISSECT without consistency (DISSECT w/o consistency) to asses the impact of semi-supervised learning under varying shifts. The performance of all methods dropped significantly for test sets with domain shifts (Additional file 2: Fig. S9). However, the drop in performance was much lower for DISSECT than other methods. Furthermore, a clear advantage of semi-supervised learning with consistency regularization is observed in comparison to DISSECT without consistency, especially in terms of rmse.

Estimation of cell type-specific

View original article

GENOME BIOLOGY

分享书签

0 0 0 0 0 0 0

More from this channel

DISSECT: deep semi-supervised consistency regularization for accurate cell type fraction and gene expression estimation

留言 (0)