Multi-slice spatial transcriptome domain analysis with SpaDo

Data description

SpaDo is designed to be compatible with all spatial transcriptomic sequencing technologies and platforms. In this study, we specifically tested its performance on the osmFISH, STARmap, seqFISH + , MERFISH, 10 × Visium, ST, and Slide-seq V2 platforms (Additional file 1: Table S1). Notably, the DLPFC dataset [33] includes 12 human DLPFC slices sampled from three individuals. The DLPFC layers and white matter (WM) were manually annotated by the original study. To obtain the cell type abundance of the above 12 DLPFC slices, we performed spot deconvolution using Cell2location [26] with a single-cell transcriptomic data [47] of DLPFC as reference.

The MERFISH dataset [48] used in our study comprised three samples with Animal_IDs 31, 32, and 33. These samples were characterized by a Bregma value of 0.16 and a specific behavioral trait described as “Aggression to adult”.

The RCC dataset [34] used in our study includes five RCC slices (10 × Visium): “GSM5924041_ffpe_c_51”, “GSM5924043_frozen_a_3”, “GSM5924044_frozen_a_15”, “GSM5924046_frozen_b_1”, and “GSM5924047_frozen_b_7”. Among these, the first four slices contain one or two TLS regions, while the last slice without TLS region is taken as negative control. We performed spot deconvolution of the above 5 RCC slices using Cell2location with single-cell transcriptomic data [49] (P76 and P90) of RCC as the reference, and the annotations of cell subtypes from the original study were merged to 17 main cell types.

The human heart dataset [40] used in our study consists of 19 slices (ST), representing the developing human heart at three developmental stages in the first trimester: 5, 6, and 9 post-conception weeks (PCW). We used single-cell transcriptomic data from the original study as a reference to obtain the cell type deconvolution results with Cell2location.

The chicken heart dataset [42] used in our study consists of 11 slices (10 × Visium) obtained from the early to late four-chambered heart stage: 4, 7, 10, and 14 days. We used single-cell transcriptomic data from the original study as a reference to obtain the cell type deconvolution results with Cell2location.

The organoid dataset [41] used in our study comprises 10 slices (Slide-seq V2) obtained from the developing human cortical organoid at 1, 2, and 3 months. Given the high resolution of Slide-seq V2 (Each spot has a diameter of 10 μm and contains about 1–3 single cells), we analyzed the organoid dataset as single-cell resolution spatial transcriptomic data. Cell type labels were obtained from the original study.

Data preprocessing

We applied different normalization methods depending on the spatial transcriptomic platform used for data generation. For datasets generated from osmFISH, STARmap, and MERFISH platforms, we followed the normalization methods recommended by their respective original studies. This involved dividing the gene counts per cell by the total counts per cell, followed by a log transformation (log(1 + normalized counts)). For datasets obtained from other platforms, including seqFISH + , 10 × Visium, ST, and Slide-seq V2, we performed the standard normalization procedure using the Seurat package. This involved normalizing the gene expression measurements for each cell/spot by the total expression, multiplying the result by a scaling factor of 10,000, and finally applying a log transformation (log(1 + normalized counts)).

Cell type annotation

SpaDo employs distinct strategies for cell type annotation depending on the resolution, whether it is at the single-cell or spot level.

For single-cell resolution spatial transcriptomics data, in our study, we utilized cell type labels from the original studies. In cases where these labels were unavailable, we employed Seurat v4, selecting the top 2000 highly variable genes with default parameters and obtaining cell type annotation results using the parameter “resolution = 2”. Importantly, the robustness of SpaDo to the “resolution” parameter of Seurat v4 was demonstrated in our analysis (Fig. 3c). It is important to note that for multi-slice single-cell resolution transcriptomic data, Seurat must be applied to the entire multi-slice gene expression profile to ensure consistent cell type annotation results.

On the other hand, for spot resolution spatial transcriptomics data, in our study, SpaDo specifically utilized Cell2location [26] to obtain spot annotations. The robustness of SpaDo to other spot deconvolution methods, such as RCTD and SPOTlight, was also demonstrated (Fig. 3d). It is important to highlight that, for multi-slice spot resolution transcriptomic data, Cell2location should be applied to the entire multi-slice dataset using the same single-cell reference to ensure consistent spot deconvolution results.

Calculating SPatially Adjacent Cell type Embedding

SpaDo employed two distinct strategies to calculate SPatial Adjacent Cell type Embedding (SPACE) for single-cell and spot resolution spatial transcriptomic data, respectively. For single-cell spatial transcriptomic data, the K-nearest neighbors (KNN) method was used to identify the adjacent cells of each cell because KNN is able to take full advantages of density information. Then, SpaDo calculated the cell type proportion of these adjacent cells, obtaining the SPACE for each cell. For spot resolution spatial transcriptomic data, SpaDo obtained adjacent neighbors of each spot by searching within a specified radius as spot is distributed evenly. Then, SpaDo calculated the cell type proportion using the deconvolution results of these adjacent spots.

In multi-slice domain detection, we initially generated a SPACE for each slice. Given the consistent cell type annotations across all slices, meaning they share the same embedding space, the SPACEs of cells/spots from different slices became comparable. SpaDo achieved this by concatenating each individual SPACE, thereby obtaining a unified SPACE representation for multiple slices.

Spatial domain detection

In this study, spatial domains are defined as clusters of cells or spots with similar SPACE from single or multiple slices.

For spatial domain detection in each slice, a distance matrix of SPACE was calculated firstly. To measure the similarities of SPACEs, SpaDo were equipped with two distance metrics, the Jensen–Shannon divergence (JSD) [50] and Manhattan distance. As a widely used measure of distribution distance, the JSD is based on the Kullback–Leibler divergence (KL) between two distributions. The KLD of SPACE between two cells or spots P and Q is defined as:

$$}\left(P,Q\right)=\sum _*}(_/_)$$

As a symmetrized, finite, and smoothed version, JSD is defined as follows:

$$}\left(P,Q\right)=(}\left(P,M \right)+}(Q,M))/2$$

where M = (P + Q) / 2. A smaller JSD value indicates a higher similarity between the distributions, while a larger value suggests greater dissimilarity.

Manhattan distance (MD) is also equipped as a candidate in SpaDo software because it is much faster than JSD by sacrificing a little accuracy. MD is the sum of absolute differences between points in their cartesian coordinates and is calculated as:

$$}\left(P,Q\right)=\sum |_-_|$$

Next, SpaDo detects spatial domains by applying hierarchical clustering to the distance matrix. Hierarchical clustering is performed using the hclust() function from R package with default parameters.

For multi-slice spatial domain detection, firstly, SpaDo calculates SPACE of cells or spots for each slice. Because all slices have consistent cell type annotations, i.e., they have the same embedding space, we concatenate SPACE of each slice together to calculate the JSD distance and then perform hierarchical clustering to detect spatial domain. Finally, each domain is backtracked to each slice.

The selection of proper domain numbers

The spatial domain with different resolutions can be obtained by selecting proper domain numbers (Additional file 1: Fig. S1). To determine the optimal spatial domain number, SpaDo offers three optional strategies: (1) automatic selection using the cutreeDynamic() function with parameter “deepSplit = 2” from R package dynamicTreeCut [51]; (2) manually set by users based on their prior knowledge or specific requirements; and (3) visualization of the hierarchical trees and UMAP clustering results can assist in determining the optimal spatial domain number. The last two strategies allow for customization, providing a high level of flexibility and interpretability in the analysis of spatial domains. In this study, for each test data, different approaches were employed. If region labels were provided in the original study, the number of regions will be used as the spatial domain number. If not, SpaDo adopted the first strategy to determine the optimal spatial domain number.

Spatial domain annotation with spatial reference

SpaDo utilizes the annotated datasets, called spatial reference, to annotate newly acquired spatial transcriptomes, referred to as spatial domain queries. Specifically, this process consists of the following four steps: (1) for each spatial domain in the spatial reference, the centroid is calculated by averaging the SPACE of all cells/spots identified as the same domain; (2) the SPACE of each cell/spot in the spatial query dataset is calculated; (3) the JSD distance between the SPACE of each cell/spot in the spatial query and each centroid of SPACE of spatial domain in the spatial reference is calculated; and (4) the spatial domain in the reference with the minimum JSD distance is assigned as the annotation for corresponding cell/spot in the spatial query.

Multi-slice clustering analysis

Intuitively, SpaDo performs multi-slice clustering analysis by assessing the similarity between multiple slices. The similarity is calculated by spatial domain composition. Firstly, SpaDo performs the multi-slice spatial domain detection for multiple slices. Then, the spatial domain composition of individual slice is calculated, which is defined as:

$$^=\left[\frac_}_}\right],j=1,\dots N$$

where \(^\) is a vector, meaning the spatial domain composition of the i-th slice, and \(N\) is the number of detected domains in all slices. \(_\) is the number of cell/spot identified as the j-th domain in the i-th slice. If the j-th domain is absent in the i-th slice, the \(_\) is set to 0. \(_\) is the number of cell/spot in the i-th slice.

Finally, SpaDo performs hierarchical clustering on the spatial domain composition of all slices using pheatmap() function from R package pheatmap with default parameters.

Parameter settings in this study

SpaDo incorporates several key parameters, including the number of k nearest neighbors for single-cell spatial transcriptomics data, searching radius for spot resolution spatial transcriptomics data, and the domain number.

In all tests conducted in this study, the number of k nearest neighbors is consistently set to 30. The selection of the domain number varies based on the test data. If region labels are available in the original study, the number of regions is used as the spatial domain number. In cases where region labels are not provided, SpaDo automatically selects the domain number using the cutreeDynamic() function with the parameter “deepSplit = 2” from the R package dynamicTreeCut [51].

Regarding the searching radius, the default value is “Radius = 2” for all test data, except for the human heart dataset [40]. For the human heart dataset [40], derived from old ST where each spot has a diameter of 100 μm and contains about 10–40 single cells [1], the searching radius is set to 1. This adjustment is made to accommodate the specific characteristics of the dataset.

Sensitivity to distance metrics

It is important to note that SpaDo calculates the distance of SPACE for each spot or cell by default using JSD. To validate its robustness, we systematically compared SpaDo’s performance when employing various distance metrics, which include Euclidean distance, Manhattan distance, Spearman correlation, Pearson correlation, Cosine similarity, and JSD. Specifically, for Spearman, Pearson, and Cosine, where the results represent similarity within the range of − 1 to 1, the corresponding distance was obtained using “1—similarity”. This analysis provides a comprehensive evaluation of SpaDo’s stability across a spectrum of distance measurement approaches.

Sensitivity to sequencing depth and dropouts

The sensitivity of SpaDo to sequencing depth and dropouts was assessed to account for the inherent noise in spatial transcriptomics data. Specifically, we artificially increased the dropout rate in the DLPFC_151673 and osmFISH datasets by randomly setting 10%, 30%, and 50% of the nonzero expression values to zero (Fig. 3g, h). For each dataset, n = 20 random dropout assignments were performed.

The batch effects evaluation of SpaDo

SpaDo effectively addresses batch effects, as demonstrated through a comprehensive analysis involving four spot resolution DLPFC datasets (DLPFC_151673, DLPFC_151674, DLPFC_151675, DLPFC_151676), as well as three single-cell resolution MERFISH datasets. For the spot resolution DLPFC datasets, we compared the SpaDo embedding strategy SPACE with embeddings obtained from SEDR and SpaGCN, both with and without harmony [38]. Harmony was applied using default parameters. Subsequently, to evaluate the performance of SEDR and SpaGCN after incorporating harmony, we calculated the “1-Pearson correlation” as the distance between each spot embedding. In contrast to SpaDo, we refrained from using JSD in this context, given that the embeddings from SEDR and SpaGCN are not distributions and are thus unsuitable for JSD. Following this, we applied the same hierarchical clustering method as SpaDo to conduct multi-slice domain detection for SEDR and SpaGCN, with the specified domain number set at 7.

For the three MERFISH datasets, we conducted a parallel comparison involving the SpaDo embedding strategy SPACE and embeddings derived from SEDR, SpaGCN, and STAGATE, with and without harmony. We followed the same analytical steps as described above, employing default settings for domain number selection.

Evaluation metrics

To evaluate the performance of SpaDo, ground-truth information such as the true spatial domain labels were utilized to calculated two performance metrics: adjusted rand index (ARI) and macro-F1.

For spatial domain detection with single slice, ARI was used to evaluate the performance of each method:

$$}=\frac_\left(\begin_\\ 2\end\right)-\left[_\left(\begin_\\ 2\end\right)_\left(\begin_\\ 2\end\right)\right]/\left(\beginn\\ 2\end\right)}\left[_\left(\begin_\\ 2\end\right)+_\left(\begin_\\ 2\end\right)\right]-\left[_\left(\begin_\\ 2\end\right)_\left(\begin_\\ 2\end\right)\right]/\left(\beginn\\ 2\end\right)}$$

where \(_\) is the number of cells that are assigned to the i-th predicted domain label with their true domain label as the j-th label, \(_=_\left(_\right)\) and \(_=_\left(_\right)\).

For spatial domain annotation with multiple slices, macro-F1 was used to evaluate the performance of each method:

$$}-F1=\frac\sum_^\frac}}_\times }}_}}}_+}}_}$$

where N denotes the number of spatial domains in a dataset. \(}}_\) and \(}}_\) are the precision and recall of the i-th spatial domain in the dataset.

Benchmarking methods

In this study, we benchmarked SpaDo with Scanpy, Seurat v4, SEDR, SpaGCN, STAGATE, BayesSpace, and PASTE in different tests with default parameters (Additional file 1: Table S2).

For spatial domain detection using single-cell spatial transcriptomic data, we benchmarked Scanpy, Seurat v4, SEDR, SpaGCN, and STAGATE with default parameters. BayesSpace was excluded from this scenario as it was specifically designed for spot resolution spatial transcriptomics data. Identical number of domains was set as in the original study.

For spatial domain detection using spot level spatial transcriptomic data, we benchmarked Scanpy, Seurat v4, SEDR, SpaGCN, and BayesSpace with default parameters. STAGATE was excluded due to its occasional instability and failure in handling spot resolution data. Identical number of domains was set as in the original publications.

For reference-based spatial domain annotation, we benchmarked PASTE and Seurat v4 with default parameters.

In the sensitivity test of SpaDo combined with spot deconvolution methods, we benchmarked Cell2location against RCTD and SPOTlight. For Cell2location, the single-cell regression model was trained with default parameters and the Cell2location model was obtained with parameter detection_alpha = 20 for all datasets. Specifically, “N_cells_per_location” was set to 10 for RCC and DLPFC datasets and 20 for human and chicken heart datasets. RCTD and SPOTlight were performed with default parameters.

In all benchmarking tests, the tools were executed on a system with Intel Xeon E5-2696 v4 CPU (2.20 GHz) and GeForce GTX GPU 1080 Ti.

留言 (0)

沒有登入
gif