spSeudoMap: cell type mapping of spatial transcriptomics using unmatched single-cell RNA-seq data

Composition of public datasetsHuman brain cortex

Spatial transcriptomic datasets of postmortem human DLPFC tissues were acquired from a 30-year-old subject without neurological disorders. Two Visium slides, named “151673” and “151676”, which are 300 μm apart were selected for the analysis [19]. The count matrix (151673: 33,538 genes across 3639 spots; 151676: 33,538 genes across 3460 spots) and cortical layer information of the spatial spots were utilized. The spots were assigned to a specific cortical layer [ranging from layers 1 to 6 and white matter (WM)] based on the spatial expression pattern of layer-specific marker genes and the expert’s opinion, as described in the paper [20]. The spots for which the layer could not be determined were classified as not available (NA). For the integrative analysis, a single-nucleus dataset obtained from the DLPFC of postmortem healthy individuals was adopted [21]. The count matrix (30,062 genes across 35,212 cells) and the cell type annotations defined based on the well-known marker genes were used [22]. Among the 33 cell types, 10 layer-specific excitatory neurons (from Ex_1_L5_6 to Ex_10_L2_4) were selected. This subpopulation of the single-nucleus dataset was considered a simulation of scRNA-seq data acquired by physically sorting excitatory neurons and was provided as an input for spSeudoMap.

Mouse brain coronal section

The Visium spatial transcriptomic dataset of the mouse brain coronal section was obtained from a C57BL/6 mouse that was at least 8 weeks old. The dataset, named “V1_Adult_Mouse_Brain,” was downloaded from the data repository provided by 10x Genomics [23]. Additionally, single-nucleus data extracted from mouse brain coronal tissue were included in the joint analysis [24]. The brains of a female and a male C57BL/6 mouse, both 56 days old, were sectioned and processed for sequencing analysis. The count matrices were composed of 32,285 genes and 2702 spots for spatial data and 31,053 genes and 40,532 cells for single-cell data. The cell types of the single-cell data were defined using reported marker genes [25, 26] and an in situ hybridization dataset from the Allen Brain Atlas (https://mouse.brain-map.org/static/atlas) [27], as described in the paper [11]. Among the 59 cell types, 23 region-specific neuron types were selected, and the subpopulation of single-nucleus data was utilized for the generation of the cell mixture.

Human breast cancer

Both spatial and single-cell transcriptomics data for breast cancer were extracted from an 88-year-old female subject (ID: 4290) without a previous history of treatment [28,29,30]. The cancer tissue was invasive ductal carcinoma (IDC) with ER-positive (90%, 3+), PR-positive (30%, 2+), and HER2-negative (1+) profiles, and cancer had invaded the adjacent skin and chest wall (pathologic T stage: T4b). The single-cell data (5789 cells and 29,733 genes) from the same patient were composed of cancer or normal epithelial cells, perivascular-like (PVL) cells, cancer-associated fibroblasts (CAFs), endothelial cells, and immune cells. This unsorted data was considered reference datasets for the prediction of spatial cell composition. In addition, CD45+ sorted single-cell data (29,900 cells and 31,993 genes) acquired from ER-positive IDC patients (n=4, ID: BC1, BC2, BC4, BC6) were utilized [31, 32], and the spatial immune cell composition was estimated using spSeudoMap.

Clustering and annotation of single-cell and spatial data

The clustering and visualization of transcriptomics datasets were implemented using Seurat (v.4.0.5) [33] in R (v.4.1.1). First, count normalization was performed, and the total count was set to 10,000 across all cells and spots. Then, the count matrices were natural log-transformed [ln(1 + X)], and the top 2000 highly variable genes (HVGs) were chosen by standardizing counts with the mean-variance relationship (vst method) [4]. Next, the matrices were regressed against the total count and scaled such that the mean and standard deviation for HVGs were 1 and 0, respectively. After reducing the dimensionality of the dataset using principal component analysis (PCA), the top 30 PCs were selected for the downstream analysis. A shared nearest neighborhood (SNN) graph was constructed based on pairwise cell-cell distances calculated on PC space. The Louvain community detection algorithm [34] was applied with a resolution of 0.5, and the resulting clusters were visualized with uniform manifold approximation and projection (UMAP) plots [35].

For human and mouse brain single-nucleus datasets, the cell annotation information offered by the paper was adopted and visualized on UMAP plots. In the unsorted human breast single-cell data, the cell annotation was originally obtained from the pooled single-cell data of all patients, and each cell type of one of the patients (ID: 4290) contained only a small number of cells. Therefore, the cells were reannotated based on the Louvain cell clustering results by an automatic cell type assignment tool [36] and cell type information from the paper [28]. Additionally, the cells from the CD45+ sorted breast dataset were annotated by the automatic cell type annotation method [36].

Meanwhile, for mouse brain spatial data, the spot clusters were named according to the corresponding anatomical structure (amygdala, caudoputamen, cortex, ependyma, hypothalamus, meninges, piriform cortex, thalamus, and white matter) defined in Allen Brain Reference Atlases (https://mouse.brain-map.org/static/atlas) [27].

spSeudoMap: spatial mapping of the transcriptomics of the cell subpopulations

To estimate a spatial map of cell types for single-cell data of cell subpopulations such as sorted or enriched single-cell datasets, a synthetic cell mixture that contains all cell types from spatial data is defined (Fig. 1). It is intended to create a reference dataset that is highly similar to the spatial transcriptomic data. For each mixture, the fraction of the exclusive cell types from the spatial data is assigned, and their synthetic gene expression profiles are created. The rest of the cell mixture is generated from the single-cell data. The process is implemented in Scanpy (v.1.5.1) [37] and Numpy in Python (v.3.7).

$$_=\left(\begin_& \cdots & _\\ \vdots & \ddots & \vdots \\ _& \cdots & _\end\right),_=\left(\begin_& \cdots & _\\ \vdots & \ddots & \vdots \\ _& \cdots & _\end\right)$$

$$_=\left(\begin_^& \cdots & _^\\ \vdots & \ddots & \vdots \\ _^& \cdots & _^\end\right),_=\left(\begin_^& \cdots & _^\\ \vdots & \ddots & \vdots \\ _^& \cdots & _^\end\right)$$

$$_^=\left(_/\sum_^y_\right)\times \textrm,_^=\left(_/\sum_^v_\right)\times \textrm$$

$$C=\left\_1,_2,\cdots, _h\right\}$$

RCsc: raw count matrix for single-cell data, RCsp: raw count matrix for spatial data, NCsc: total normalized single-cell count matrix, NCsp: total normalized spatial count matrix, x, y: number of cells and genes in single-cell data, u, v: number of spots and genes in spatial data, C: set of the index for overlapping genes between single-cell and spatial data. The gene index in set C is shared between the two transcriptomic data (for example, \(_}_1}\) and \(_}_1}\) indicate the counts of the same gene with index c1 in the second cell and spot, respectively).

Fig. 1figure 1

Mapping cell subpopulations to the spatial transcriptomic data with spSeudoMap. The cell types of the single-cell transcriptomic data acquired from cell sorting experiments can be spatially mapped to the tissue using spSeudoMap. The single-cell data of cell subpopulations are composed of sorted cells from the tissue, and the cell types cover only part of those in the spatial transcriptomics data. To create the reference dataset that mimics the spatial data, virtual cell mixtures, pseudospots, are defined in which all cell types from the tissue are included. First, the cell types exclusively present in the spatial data are aggregated and named pseudotypes. The virtual markers for the pseudotypes are selected from the top genes highly expressed in spatial pseudobulk compared to single-cell pseudobulk data. Then, the pseudotype fraction in the spatial spots is estimated from the module scores (sc.tl.score_genes in Scanpy) of the top 20 pseudotype markers. The fraction and gene expression of the pseudotypes are assigned based on the presumed pseudotype fraction and expression of a randomly selected spatial spot. Lastly, the target type proportion of the pseudospot, explained by cell types in the single-cell data, is filled with the randomly sampled cells from the single-cell data of cell subpopulations. Finally, the pseudospot is considered a reference dataset for the domain adaptation method CellDART [10]

First, a fixed number of cells (n; brain: 8 and breast: 10) and cell type annotations are randomly sampled from single-cell data with random weights, and a cell mixture named sub-pseudospot is created as with CellDART [10]. The cell types that overlap between the single-cell and spatial data are called “target types.” These cell types are assumed to be identical to the cell types from single-cell data. The Wilcoxon rank-sum test is performed, and the top l markers for each target cell type are pooled. A marker panel (total number of genes: m) is curated by extracting intersecting genes with the total gene list of spatial data. For each sub-pseudospot, the composite gene expression profiles of the marker panel are calculated.

$$\textrm=_i\right)}_^n=\left(_1,_2,\cdots _n\right)\kern0.50em _i\in \left\$$

$$T=_j\right)}_^m=\left(_1,t,\cdots _m\right)\kern0.50em _j\in C$$

$$F=_i\right)}_^n=\left(_1,_2,\cdots _n\right)\ \sum_^n_i=1$$

$$G=_j\right)}_^m=\left(_1,_2,\cdots _m\right)\kern0.50em _j=\sum_^n_i__j}^$$

n: total number of the cells to be randomly sampled from the single-cell data, m: total number of marker genes from single-cell data: S: randomly selected index of the cells, T: index of the marker genes in the single-cell count matrix (RCsc), F: proportion of each sampled cell in the cell mixture, G: composite gene expression profiles of the cell mixture (named sub-pseudospot).

Next, the cell types present in the spatial data but absent in the single-cell data are aggregated, and the aggregate is named the “pseudotypes.” The pseudotype markers are extracted from pseudobulk analysis of both transcriptomic data. It can be assumed that the genes showing greater pseudobulk expression in spatial data than single-cell data are pseudotype markers. Therefore, the summed counts for each gene are divided by the total counts, and the normalized counts are compared between the two datasets. The ratio of the normalized counts between spatial to single-cell data is log2-transformed, and the genes are sorted in descending order of the log fold change. The top k genes not overlapping with the target type markers are selected as pseudotype markers, and the top 20 genes are used as predictors of the pseudotype fraction. The target types and pseudotype markers are combined and named the “composite marker panel.”

$$\begin _=\left(p_j\right)_^h=\left(\sum_^xa_,\cdots\sum_^xa_\right)\\ _=\left(q_j\right)_^h=\left(\sum_^ub_,\cdots\sum_^ub_\right) \end$$

$$\begin logFC(j)=\log_2\left(\frac^hq_k}^hp_k}\right)\\ w=\underset\left[logFC(j)\right]=\left(r_1,r_2,\cdots,r_k\right)\end $$

BCsc: pseudobulk count matrix for single-cell data, BCsp: pseudobulk count matrix for spatial data, logFC: log fold change between the pseudobulk counts, w: index number for the top k genes with the highest logFC.

Then, the pseudotype fraction in a given spot of the spatial data is presumed to be correlated with an enrichment score for the top 20 pseudotype markers (scanpy.tl.score_genes in Scanpy). The genes of spatial data are divided into 25 bins according to the log-normalized expression level. For each marker gene, a total of 50 control genes are selected from the same bin and pooled. The enrichment score is calculated by subtracting the average expression of control gene pools from that of pseudotype markers [38]. The distribution of a created module score is scaled to have a given mean (M′) and standard deviation (σ′). The values over 1 and less than 0 are replaced with 1 and 0, respectively. The scaled module score is considered as a pseudotype fraction of a spatial spot, assuming a linear relationship between the two.

$$\begin Z_= \left(\frac-M}\sigma\right)\sigma^+M^ \\ Z_=\left(z_i\right)_^u=\left\ 0& if \;Z_<0\\ 1& if\;Z_>1\\ Z_ & otherwise \end\right. \end$$

MSsp: module scores for the top 20 pseudotype markers calculated in spatial data, M, σ: mean and standard deviation of MSsp, M′, σ′ : mean and standard deviation of the presumed pseudotype fraction in spatial transcriptomic spots, Zsp: presumed pseudotype fraction.

To create a reference dataset, a sub-pseudospot is aggregated to the pseudotype portion of a randomly chosen spot, and the combined gene expression for the composite marker panel is calculated (Additional file 1: Fig. S1). The resulting cell mixture is named the pseudospot. Since pseudotype markers are selected according to the log fold change in the pseudobulk approach, the expression of pseudotype markers in pseudotypes is expected to be significantly higher than that of target type markers. Additionally, for simplicity, the expression of all pseudotype markers in a spot is assumed to be directly proportional to that in the pseudotypes of the spot. Thus, the target type marker expression in pseudotypes is set to 0, and the pseudotype marker expression is assigned by multiplying the normalized count of the selected spot by the presumed pseudotype fraction.

$$\begin T^=\left(t_j^ \right)_^=\left\t_j&if\;j\leq m\\ r_&otherwise\end\right.t_j^\in C \end$$

$$\begin G^=\left(g_j^\right)_^=\left\\left(1-z_\beta\right)g_j&if\;j\leq m\\z_\beta b_}^+\left(1-z_\beta\right)\sum_^nf_ia_}^&otherwise\end\right.\end$$

T′: index of the marker genes and pseudotype markers in the single-cell count matrix (RCsc), G′: composite gene expression profiles of the modified cell mixture (named pseudospot), β: randomly selected index of the spot

Finally, the gene expression profiles of the pseudotypes and the sub-pseudospot are summed with the pseudotypes to target type ratio as a weight, and the integrated expression of a pseudospot is obtained (Additional file 1: Fig. S1). Of note, the pseudotype fraction is extracted from the same spatial spot that the expression profile is referenced. The algorithm for generating a pseudospot is summarized with the above formulae. The pseudospots and spatial transcriptomic data are jointly provided as inputs for the domain adaptation in CellDART [10].

Performance evaluation of spSeudoMap

The capability of spSeudoMap to predict the spatial composition of cell subpopulations was assessed in the three different datasets. In human DLPFC tissue, the accuracy of the model to localize 10 layer-specific excitatory neuron types to corresponding cortical layers was measured by receiver operating characteristic (ROC) analysis. The layer annotation of each spatial transcriptomic spot (layers 1 to 6 and WM) offered by the paper was considered a gold standard of spot identity [20]. The area under the ROC curve (AUROC) was calculated to explain the performance across all neuron types. For instance, a fraction of Ex_3_L4_5, the cell type known to be highly localized in layers 4 and 5 is predicted in all spots, and the “sensitivity” and “1 – specificity” of classifying a spot as belonging to layers 4 and 5 are plotted for every threshold value. The AUROC is calculated and it represents the performance of the model in Ex_3_L4_5. In the case of the mouse brain sample, the spot clusters were named after the anatomical region, and the spatial distribution of region-specific excitatory neuron types across the clusters was examined. To further assess the performance of spSeudoMap in brain tissues, the spatial distribution of pseudotype fraction estimated from spSeudoMap was compared with the spatial expression pattern of missing cell type markers. The missing cell type markers were discovered by performing differential gene expression analysis between excitatory neurons and others in unsorted single-cell data containing whole cell types. Wilcoxon rank-sum test was implemented and the Bonferroni method was applied for multiple comparison corrections. The top k (the number of pseudotype markers) genes showing the highest log2 fold change were selected and the module score was calculated. Spearman’s correlation coefficient was computed between the predicted pseudotype fraction and the module score across all spots. Finally, in human breast tissue, the spatial localization patterns of the top immune cell subpopulation were mapped. The spatial correlation patterns were evaluated by Spearman’s correlation coefficient and visualized with a heatmap.

The performance of spSeudoMap was compared in two different aspects. First, the spSeudoMap was tested whether it is superior to existing cell type decomposition methods (CARD, CellDART, Cell2location, DSTG, RCTD, and SPOTlight) in mapping the cell types in the situation that single-cell transcriptomic data only has subpopulation of cell types. The count matrix for selected cell types from the single-cell dataset of human DLPFC was provided as an input for the decomposition tools. CARD predicts spatial cell composition based on non-negative matrix factorization (NMF) and considers spatial correlation by a conditional autoregressive model [12]. Spots with a total count of less than 100 and genes with the number of spots showing a non-zero count of less than 5 were excluded from the analysis. CellDART decomposes spatial cell proportions based on domain adaptation [10]. The numbers of sampled cells in the cell mixture were set to 8 and 10 in brain and breast cancer tissues, respectively. The number of markers per cell cluster was set to 20. Cell2location predicts spatial cell proportion based on the Bayesian statistical model [11]. The expected cell abundance in each spatial spot was set to 8 in the brain tissue. The number of iterations was 30,000, and default values were given to the rest of the parameters in the model. DSTG constructs a linked graph between the spatial data and cell mixture created from single-cell data and applies a graph neural network to estimate spatial cell fraction [9]. RCTD predicts spatial cell compositions using maximum-likelihood estimation based on Poisson distribution [6]. The doublet mode was set to “full mode” which does not limit the number of cell types in the spots. SPOTlight utilizes NMF with regression to estimate the spatial distribution of cell types [8]. A total of 100 cells were randomly sampled from each cell cluster in single-cell data to enhance the computational speed. For all methods, the default parameters suggested in the user guide were applied for analyses.

Second, cell subpopulation mapping results by spSeudoMap were compared with the reference spatial map of cell types. As a surrogate to the real spatial distribution, the reference map was defined as spatial cell composition predicted using the unsorted single-cell dataset covering all cell types. It was presumed that the closer the spatial distribution of cell types obtained from spSeudoMap is to the reference map, the higher the performance of spSeudoMap. CellDART and Cell2location were selected as methods for obtaining the reference map since they showed similar high accuracy in localizing the layer-specific excitatory neurons [10, 11].

Exploration of optimal parameter range in spSeudoMap

The key parameters for spSeudoMap are the number of markers per single-cell cluster (n), the ratio of the total number of single-cell to pseudotype markers (m/k ratio), and the mean and standard deviation of the presumed pseudotype fraction in spatial spots (M′ and σ′). The performance of the spSeudoMap was tested across the various parameters in human DLPFC datasets (slide number: 151676). Since the proportion of 10 layer-specific excitatory neurons was 0.53 among the single-cell data, M′ was set to 0.47. The cortical layer annotation in spatial data was used as a reference, and the layer discriminative accuracy of the predicted neuron fraction was assessed by AUROC. In general, spSeudoMap was capable of stably predicting the spatial distribution of neuron subpopulations with a median AUROC over 0.5 with n larger than 20, m/k larger than 1, and σ′larger than 0.05 (Additional file 1: Fig. S2). The corresponding parameter ranges were selected for the downstream analyses. For the human brain (slide 151673) and mouse brain tissues, n was set to 80, the m/k ratio to 4, and σ′to 0.1. In human breast cancer tissue, n was set to 40, the m/k ratio to 2, and σ′to 0.1. In brain tissues, M′ was assigned based on the proportion of non-excitatory neurons in the unsorted single-cell data containing all cell types and in breast cancer tissue, based on the proportion of non-immune cells (human brain: 0.47, mouse brain: 0.67, and breast cancer: 0.83). Other parameters for the domain adaptation were given as the user guidelines of CellDART [10].

留言 (0)

沒有登入
gif