The covariance environment defines cellular niches for spatial inference

Computational methodsMSSI

When comparing the spatial distribution of genes or markers across a tissue, it is imperative to have a robust metric that takes spatial structure into account. Although ubiquitous metrics, such as Pearson correlation, SSIM and root mean square error, can provide some insight, they lack spatial context (for example, cell–cell proximity or spatial patterns) and measure only per-cell discrepancy.

To devise a metric for spatial data, we borrowed the MS-SSIM45, a ubiquitous metric for the quality of image reconstruction, from computer vision. Given two images, MS-SSIM iteratively downsamples each image, creating an image pyramid79—a multiscale signal representation consisting of the same image at multiple resolutions. MS-SSIM returns a weighted geometric average of the standard SSIM scores between the two images at each scale of the pyramid. Standard SSIM for two images, x and y, is

$$}(x,y)=l(x,y)\cdot c(x,y)\cdot s(x,y),$$

where

$$\beginl(x,y)\,=\,\displaystyle\frac_}}_}}+^}_^+_^+\frac},c(x,y)=\displaystyle\frac__+^}_^+_^+^},\\s(x,y)=\displaystyle\frac_+\displaystyle\frac^}}__+\displaystyle\frac^}}\end$$

M represents the maximum values between x and y; μx and μy are their average values; σx and σy measure how each varies; and σxy represents how much they covary. l(x,y), c(x,y) and s(x,y) are measures of ‘luminance’ (signal brightness), contrast and structure, respectively. Although SSIM is meant for images, it can also be calculated between any two vectors of similar sizes.

We introduce the MSSI as an adaptation of MS-SSIM to spatial transcriptomics that compares count matrices from segmented cells, rather than pixels, using a neighbor graph of spatially neighboring cells to capture structure. Intuitively, MSSI is a spectral analog of MS-SSIM; by rephrasing image coarsening to its graph-based counterpart, we can apply it to segmented cells and produce a multiscale, spatially driven score of expression reconstruction quality.

MSSI compares the expression profiles of two genes from a multiplexed image: $X=_\}}_^$ and $Y=_\}}_^$, where the index i enumerates segmented cells, and x and y can be either (1) two different genes or (2) a ground truth gene and its imputed value. In addition, the spatial coordinate of each cell is $D=_\}}_^$. We first compute the k nearest neighbor (kNN) graph G1 of segmented cells from $_\}}_^$. To generate a subsampled version of the kNN graph, we use a graph coarsening algorithm80, which pools nodes together based on their connectivity pattern, similarly to how image downsampling groups pixels together (Fig. 2c). We iteratively coarsen and blur the graph four times by a factor of 2 and produce the expression of every gene at each scale.

Mathematically, each coarsening step produces a pooled version of the graph $^\}}_^$ and a coarsening operator $^\}}_^$, which is the mapping between nodes at one scale to nodes at the next and allows us to generate pooled versions of the gene expression signals:

After MS-SSIM, we compute the MSSI between the expression profiles at each scale and return their weighted geometric mean. In detail, we compute the l, c and s SSIM-related values at each scale and derive MSSI based on their weighted product, as for the MS-SSIM:

$$}(X,Y,D)=_^,^)}^_}\mathop\limits_^c^,^)}^_}s^,^)}^_},$$

where the weights are equivalent to those in MS-SSIM45:

$$\alpha =(0.0448,\,0.2856,\,0.3001,\,0.2363,\,0.1333)$$

When Xi,Yi are anti-correlated ($_ < 0$), s is negative, which prevents computing the weighted geometric mean; we, thus, clip negative values to 0. This implies that if, at any scale, Xs,Ys are anti-correlated, the MSSI will be 0, its lowest possible value. We also normalize the original-scale gene expression to be between 0 and 1 but do not re-normalize at each coarsening scale.

Spatial covariance representation

Our spatial covariance framework includes three components: the COVET statistic, a similarity metric and an algorithm to robustly and efficiently compute the COVET metric. The COVET framework assumes that the interplay between the cell and its environment creates covarying patterns of expression between the cell and its niche, which can be formulated via the gene–gene covariance matrix of niche cells. The COVET statistic constructs a shifted covariance matrix (which preserves algebraic properties of the covariance matrix) and, thus, enables the use of any measure of statistical divergence between covariances to define a principled quantitative similarity metric to compare niches. The key is to build the COVET statistic in such a manner that two COVET matrices are comparable and to design a computationally efficient algorithm to quantify the statistical divergence between them.

COVET

The inputs to COVET are (1) the gene expression matrices ($X\in ^$), where n is the number of cells and g is the number of genes profiled; (2) the location of each cell in situ; and (3) a parameter (k) that defines the number of nearest cells to be included in the niche. For each cell, we identify its k nearest cells (excluding the cell itself) based on their spatial proximity and construct a niche matrix $_=\left\_\in\right.$$\left.^|\,j\in kNN(i)\right\}$, which represents the gene expression vector for each of those nearest neighbors. This produces an n × k × g tensor $\varOmega =\,\left\_\in\right.$$\left.^|i=1,\ldots ,n\right\}$, which combines the niche matrices of every cell.

The fundamental goal of COVET is to transform those niche matrices into effective representations of a cell’s niche. To this end, we calculate the ‘shifted’ gene–gene covariance matrix between cells in each niche matrix, where, instead of using the classical formulation

$$_}^}=}(_)=\frac(_-}_)^}(_-}_)$$

we swap the niche mean expression $}_}$ with the total expression average $}$ (computing the mean over the entire dataset). This enables direct comparison between covariance matrices, as they are constructed relative to the same reference:

$$_=}(_)=\frac(_-}})^}(_-}}).$$

This creates a representation relative to the entire population, which can better highlight the features that are unique to each niche while also holding the same algebraic properties that the standard covariance matrix holds, namely being positive semi-definite (PSD). Therefore, we can harness measures of statistical divergence to derive a metric on the COVET matrices and quantify differences and similarities between niches. Although we can conceptually use any statistical divergence measure, metrics such as Kullback–Leibler (KL) divergence and Bhattacharyya30 distance are too computationally intensive and lack interpretability.

Distance between COVET matrices

To meaningfully compare between niches, we cannot simply use the sum difference between two niche matrices Ei and Ej, as changing the cells’ order would change the result (whereas there is no meaning to any given order). An intuitive way to quantify niche similarity is by finding the best matching of cells between niche matrices by solving the assignment problem81. Optimal transport (OT)82 is a relaxed version of the assignment problem, where, instead of matching cells one to one, OT finds the best ‘soft assignment’ between cells. However, because this approach has no closed-form solution and does not scale to large datasets, we can use the closed-form solution of OT between covariance matrices, known as the Fréchet distance29, instead:

$$_chet}}(_,_)\,=}(_)+}(_)-2\cdot }(\sqrt__}).$$

The Fréchet distance has time complexity of O(k3) and is, thus, computationally intractable for large-scale datasets, which would require billions of pairwise computations between all niches. To speed up computation, we swap the matrix square root (MSQR) and product operation in the last term of the Fréchet distance and define the AOT distance as:

$$_}(_,_)\,=}(_)+}(_)-2\cdot }\left(\sqrt_}\sqrt_}\right)\,$$

If Σi and Σj are commutative, this is no longer an approximation and ΔAOT = ΔFréchet. Both the approximate and true Fréchet distance require O(k3) operations between each pair of niches and O(n2k3) to compute the full distance matrix; however, using the identity that for symmetric matrices, $}(AB)=\,\sum __\cdot _$, we arrive at:

$$\begin_}(_,_)=\displaystyle\sum _\left(\sqrt__}}\cdot \sqrt__}}+\sqrt__}}\cdot \sqrt__}}-2\cdot \sqrt__}}\cdot \sqrt__}}\right)\\ =\displaystyle\sum ___}}-\sqrt__}}\right)}^=_}-\sqrt_}\,||}_^\end$$

Therefore, when working in square root space, we do not require any computationally extraneous matrix multiplication and many calculations of MSQR. Instead, we first calculate the MSQR of each COVET matrix, which is $O(n^$, and then simply calculate pairwise (squared) Euclidean distance for a total time complexity of $O(n^+^^)$, which is substantially more efficient than $O(^^)$ for large n. For a given PSD matrix A, there could be many possible solutions B that fulfill the equation B2 = A. Although this underdetermination is problematic, there is a unique symmetric PSD solution for the MSQR83. This solution can be found via spectral decomposition and reconstructing with standard square root of the matrix eigenvalues:

$$\sqrt=\sum _\sqrt_}__^},$$

where λi, vi are the eigenvalues/vector of A.

Because AOT can be formalized as the $_^$ between MSQR of COVET matrices, it allows for direct use of any algorithm that is based on the squared Euclidean distance, such as UMAP, tSNE84 and FDL58, clustering31 and DC32 analysis. We can simply compute MSQR of the COVET matrices, flatten the resulting matrices into one-dimensional (1D) vectors and apply the default implementations of all the mentioned algorithms. We can further leverage the squared Euclidean distance representation of the AOT metric and use computational accelerators designed to compute classical pairwise distances for additional speed gains.

We demonstrate that AOT is a good approximation by benchmarking against the true Fréchet distance and the Bhattacharyya distance, another common metric for distances between covariance matrices. Across various sizes of random sets of 64 × 64 covariance matrices, we test the runtime to compute the 10 nearest neighbors matrix in covariance space. As covariance matrices are PSD, to randomly generate n covariance matrices of 64 × 64 elements, we first sample n random 64 × 64 matrices (using the standard normal) and multiply each by its transpose, as a matrix Gramian is always PSD:

$$_}_^=\left\_\cdot _^}\right\}_^;_ \sim N(0,_^}6^}).$$

We find that, whereas AOT produces accurate similarities, its runtime is at least an order of magnitude smaller than that of other metrics, with Fréchet and Bhattacharyya failing on sample sizes larger than 3,000 matrices due to out-of-memory error. Using a GPU implementation of kNN distance built for the Euclidean metric, which can be easily adapted for AOT, the spatial covariance metric is indeed scalable to massive datasets, taking less than 1 min to compute the kNN matrix between 100,000 samples (Extended Data Fig. 1a).

We observe accurate approximation on real COVET matrices, calculated from the eight nearest neighbors of the pharyngeal mesoderm cells from the seqFISH assay37, using the 64 most variable genes among the 350 imaged. Despite its efficiency, AOT does not sacrifice accuracy and concurs highly with Fréchet. We calculate the pairwise distance between the pharyngeal mesoderm COVET matrices according to Fréchet, AOT, Bhattacharyya and naive L2 between matrices. For each pharyngeal mesoderm cell, we find its k nearest neighbors for every metric and compute their Jaccard index with the Fréchet nearest neighbors. Across a wide range of k, AOT-based kNN is highly congruent with Fréchet kNN, whereas Bhattacharyya and naive L2 distances are not (Extended Data Fig. 1b). Qualitatively, using Fréchet, AOT and Bhattacharyya pairwise distances to compute two-dimensional (2D) embeddings and PhenoGraph clusters for the COVET matrices returned similar results (Extended Data Fig. 1c,d).

Choice of k

By default, we select k = 8 neighbors to construct COVET, which usually captures the immediate niche of a cell, but the exact choice of k should reflect the data. For all datasets analyzed in this study, we kept the value of k at the default, demonstrating how finding the optimal k is not required to gain insights from ENVI and COVET. Still, given the computational efficiency of both algorithms, we recommend that users attempt a range of k values at different scales, such as 8, 20 and 50. Users can visualize the ENVI-learned COVET representations with AOT and choose the most appropriate scale for their biological question. We also implemented an option for COVET to be computed on all cells within a given radius, rather than constant number of neighbors, to account for differences in cell density within a tissue.

ENVI algorithm

The ENVI algorithm integrates scRNA-seq and spatial transcriptomics data into a common latent embedding, in a manner that can infer spatial context for scRNA-seq and missing genes for spatial data. The core assumption of ENVI is that the interplay between a cell’s phenotype and its microenvironment, as captured by the COVET matrix, empowers better data integration.

ENVI is grounded on autoencoder variational inference but diverges from previous work9,25,47. Although current methods only model the expression of genes included in both single-cell and spatial datasets, ENVI explicitly incorporates both microenvironment context for spatial data and expression of the full transcriptome for scRNA-seq data. In addition, ENVI contains two decoders: one for expression, which includes additional neurons that learn gene expression only from scRNA-seq data, and one to predict spatial context. Using these decoders, ENVI trains the VAE27 to reconstruct both full transcriptome expression and spatial context from partial transcriptome samples.

To integrate scRNA-seq and spatial data, ENVI learns a common latent space for both data modalities by marginalizing the technology-specific effect on expression via a CVAE28. It achieves this by augmenting the standard VAE with an auxiliary binary neuron in the input layers to the encoding and decoding networks representing each data modality. Integration is crucial, as each modality harbors technology-specific artifacts (Extended Data Fig. 2a). ENVI takes as input the scRNA-seq count matrix Xsc with nsc cells and their full transcriptome of gsc genes as well as counts of segmented cells from spatial transcriptomics matrix Xst from nst cells and gst imaged genes. The algorithm is agnostic to the method used to segment cells before input. It uses the spatial data to compute the COVET matrix for each cell and their MSQR to align with the AOT distance formulation.

Next, ENVI’s conditional autoencoder builds a shared latent space for both data modalities. As the combined embedding must incorporate spatial context and full transcriptome information and must remove confounders relating to modality, we set the latent dimension to 512, substantially larger than standard VAEs in single-cell genomics, which usually contain around 10 neurons25,34,35. As input to the encoder, ENVI takes either spatial or scRNA-seq samples (the latter reduced to the subset of genes that have been imaged), along with the auxiliary neuron c having value 0 for the spatial data and 1 for scRNA-seq. The expression profile along with the auxiliary neuron are transformed into the latent variable l using the same encoding neural network, regardless of data modality:

$$l=\left\}(_},c=0) & _}\in _}\\ }(}_},c=1) & }_}\in _}[:,\,_}]\end\right.,$$

where the encoder returns two vectors, μl and σl, which parameterize a Gaussian with diagonal covariance describing the posterior distribution of the latent. To calculate gradients through random samples, we use the reparameterization trick, which involves generating a sample from the standard normal ε ~ N(0,1) and describing the latent through a function of ε, μl and σl and treating ε as a constant:

$$l \sim N(_,_)\,\Rightarrow l=_+\cdot _,\, \sim N(0,1).$$

Through the training process, our goal is to have the latent encode not only gene expression but also information about the spatial context of a given cell while removing confounding effects to allow transfer learning between modalities. This is achieved by optimizing a single latent space to accurately decode both the full transcriptome and COVET matrix for both data modalities, each missing one of these components. The requisite that the latent space be capable of decoding the spatial niche imbues sufficient spatial information into the latent space during training.

The latent of either modality, along with the appropriate auxiliary neurons, is fed into the ‘expression’ decoder network DecExp. The loss function, calculated by comparing the activations in the output layer to the true expression profiles, needs to reflect the underlying distribution of each data modality. We use the negative binomial distribution to model scRNA-seq data, similarly to previous work25,36, as it suffers from overdispersion and dropout. During training, the scRNA-seq data provide transcriptome-wide expression; therefore, we can include genes whose expression was not provided to the encoder in the loss function, allowing our encoder to model genome-wide expression.

The negative binomial has two parameters per gene: the number of failures, r, and success probability, p. Thus, the output layer of the decoder consists of $2\cdot _}$ neurons, where the first gsc neurons are the parameter r and the latter gsc neurons are p, using the ‘softplus’ nonlinearity for r and the sigmoid function for p to keep it a valid probability:

$$p(}_}=k|NB(r,p))=\frac^^$$

where

$$r,p=}_}(l,c=1)[:,:\,_}],}_}(l,c=1)[:,\,_}:2_}]$$

We use the Poisson distribution to model FISH-based multiplexed imaging data due to its high molecular capture rate3 and have the first gst neurons in the output layer parameterize the per-gene rate parameter λ using ‘softplus’ nonlinearity to ensure that it is a valid rate value:

$$P(}_}=k|}(\lambda ))=\frac^^}$$

where

$$\lambda =}_}(l,c=0)[:,:_}]$$

A standard CVAE, in which all neural parameters are shared aside from the auxiliary neurons, is sufficient to simply integrate between scRNA-seq batches, as demonstrated in scArches34. However, to successfully integrate scRNA-seq and multiplexed FISH-based technologies, a single auxiliary neuron is not sufficient to regress out all biases. In ENVI, only the first gst neurons of the output layer are shared by the two data modalities, whereas the rest are solely trained on the scRNA-seq data. These additional technology-specific parameters improve the ability of ENVI to regress out confounders from the latent embedding, beyond the auxiliary neuron.

Finally, ENVI includes an additional ‘environment’ decoder network DecEnv whose role is to reconstruct the COVET from the latent, which can be trained from the spatial data. The output layer of the environment decoder has $\frac_}\cdot (_}+1)}$ output neurons parameterizing the lower triangular Cholesky factor. The Gramian matrix of the output layer is the mean parameter of a standard normal, reflecting our AOT distance, as the log likelihood of the standard normal is the $_^$ distance.

$$P\left(}^}=^}|N(L\cdot ^},I\,)\right)=\left(2^_}^}}\right)\cdot }^^}}-L\cdot ^}\Vert _^},$$

where $L=}_}(l)$.

The output of the environment decoder is the MSQR of the COVET matrix, which is trained to minimize the $_^$ error with the MSQR of the true COVET matrix. Using the AOT metric in this manner involves computing the MSQR of the COVET samples during training, which can be computationally prohibitive. Instead, we first calculate the MSQR of all COVET matrices, which ENVI is directly trained to reconstruct.

We train ENVI simultaneously on samples from both spatial and single-cell datasets, using mini-batch gradient descent on the variational inference loss. With the learned ENVI model, we impute missing genes for the spatial data by treating the latent embedding of the spatial data as if it were from the single-cell data, using the single-cell auxiliary variable and parameterizing as a negative binomial instead of a Poisson. Conversely, we reconstruct spatial context for the single-cell data by applying the ‘environment’ decoder on its latent, as if it was the latent of the spatial data.

In more detail, we train ENVI to optimize the evidence lower bound (ELBO) with a standard normal prior on the latent, with the goal of increasing the likelihood of the observed data $\_},\,_},_}\}$ for the parameterization of their decoded distributions $\}_},\,}_},}_}\}$ while minimizing the KL divergence between the latent distribution and $N(0,1)$:

$$\beginL=}\,NB(_}|r,p)+}\,}(_}|\lambda )+}\,N(_}|_},I\,)\\\quad-\beta _}(N\_,_\},N\\,)\end$$

To train ENVI to impute missing genes for the spatial data, we generate the latent embedding lst by passing Xst through the encoder, and run the latent layer through the ‘expression’ decoder, but with the inverse auxiliary neuron, as if the embedding came from scRNA-seq data:

where

$$r,p=}_}(_},c=\,1)[:,:_}],}_}(_},c=1)[:,_}:2_}].$$

Similarly, we reconstruct the spatial context for dissociated scRNA-seq samples by passing the scRNA-seq latent embedding lsc through the ‘environment’ decoder:

$$_}^}=E\left[N(_}^},1)\right]\,}\,_}^}=_}\cdot _}^},_}=}_}(_}).$$

To allow flexibility in modeling technologies with different count distributions and molecular capture rates, we implemented the normal, Poisson, negative binomial and zero-inflated negative binomial (ZINB)85 distributions, which can be chosen for either modality to reflect pre-processing steps or varying levels of noise or dropout. The rate or mean parameters (λ for Poisson, r for NB and ZINB and μ for normal) must all be defined per cell and per gene and shared across the single-cell and spatial data. However, all other parameters can be chosen to be either per cell and per gene or simply per gene and can be either shared between technologies or made distinct.

By default, the encoder and two decoder networks consist of three hidden layers, each with 1,024 neurons. The latent embedding consists of 512 neurons, and the prior coefficient is set to β = 0.3. For small datasets whose total samples size is fewer than 10,000 cells, we recommend increasing the reliance on the prior and set β = 1.0. We train ENVI for two14 gradient descent steps with the ADAM optimizer86 with learning rate 10−3 (lowered to 10−4 during the last quarter of training steps) and a batch consisting of 1,024 samples, half taken from scRNA-seq and the other half taken from spatial data. To reduce computational complexity, we subset the scRNA-seq dataset to the union of the 2,048 highly variable genes and all genes included in the spatial dataset rather than the full transcriptome.

ENVI training is constant in both time and memory, whereas methods such as Tangram and NovoSpaRc scale quadratically with dataset size and cannot be GPU accelerated on datasets above a few thousand cells. We benchmarked the run times of ENVI, Tangram9, NovoSpaRc26, gimVI25, uniPort47, deepCOLOR48 and Harmony49 on scRNA-seq datasets of various sizes and on osmFISH, seqFISH, Xenium and MERFISH datasets. All models were trained with their default parameters using a single 12 GB GeForce RTX 2080 GPU, except Tangram, which produced an out-of-memory error above 10,000 cells and was trained with a CPU instead. Model training was stopped prematurely if it exceeded 5 h.

As expected, ENVI’s training time was consistently around 10 min regardless of dataset size (Extended Data Fig. 2b), and Harmony was also constant in time. gimVI runtime grew linearly with dataset size (the model trained for a predefined number of epochs over the datasets), and NovoSpaRc and Tangram were prohibitively slow on larger spatial and scRNA-seq datasets (they learn a cell-to-cell mapping between the spatial and single-cell datasets). We found that GPU acceleration is not possible for Tangram. deepCOLOR and uniPort were also substantially slower than ENVI at larger cell numbers.

Evaluation of integration quality

Batch average silhouette width (bASW), introduced in a recent benchmarking of batch integration methods for scRNA-seq atlases41, evaluates latent integration based on mixing between batches and co-localization of similar cell types within the latent. In brief, bASW computes, for each cell type, how well-mixed batch labels are using the silhouette coefficient and returns the average across all cell types. By treating each modality as a different batch, we could use the bASW score to measure the quality of ENVI’s learned latent. The latent of ENVI is large, consisting of 512 neurons; because silhouette coefficient is affected by the curse of dimensionality, we first compressed the ENVI latent to the top 10 principal components and computed bASW on them.

Benchmarking imputation

We benchmarked ENVI gene imputation following previous approaches25,44 that generate a test set of held-out genes using cross-validation and compared imputed and true expression using Pearson correlation and our spatially aware MSSI metric. We evaluated log expression and imputation profiles, with pseudocount 0.1.

Many algorithms use scRNA-seq data to impute missing genes in spatial transcriptomics data42,43,87,88,89. We compared ENVI to gimVI, Tangram and uniPort for their competitive performance44,47, NovoSpaRc for its use of spatial context and optimal transport for data integration, deepCOLOR48 for its use of a deep generative model and Harmony49 for its prevalence as a batch correction method50.

On the osmFISH dataset, which includes only 33 genes, we performed a full leave-one-out cross-validation by hiding every gene in the imaging panel individually and predicting its expression. On the seqFISH, MERFISH and Xenium datasets, which assay hundreds of genes, we used five-fold cross-validation, whereby the imaged gene set was divided into five random groups, and each model was tested on one withheld group after training on four. To appraise performance, we used a ‘relative’ one-sided t-test, as scores are paired across genes.

We benchmarked all models using their default parameters and instructions on all datasets:

gimVI: We trained for 200 epochs with a batch size of 128 and latent dimension, per author recommendations (https://docs.scvi-tools.org/en/stable/), and parameterized spatial and scRNA-seq datasets with NB and ZINB distributions, respectively. To impute genes with the trained model, we followed manuscript instructions and trained a kNN regression model on the scRNA-seq latent and full transcriptome expression, setting k as 5% of cells in the single-cell dataset. We then applied the regression model on the spatial data latent to predict the expression of unimaged genes.

Tangram: We trained for 1,000 epochs using default parameters (https://github.com/broadinstitute/Tangram). For osmFISH, seqFISH and Xenium datasets, we used the default ‘cells’ mode, and, for the much larger MERFISH atlas, we used the ‘cell-type’ mode, per the tutorial. We set the density prior to be uniform, as our spatial benchmark datasets are single-cell resolution. With the learned mapping, we used the ‘project_genes’ function to impute genes from scRNA-seq onto the spatial dataset.

NovoSpaRc: We followed the repository instructions (https://github.com/rajewsky-lab/novosparc), using an ‘alpha’ coefficient on a spatial location prior of 0.25 and smoothness parameter ‘epsilon’ of 0.005. To compute the scRNA-seq pairwise distance matrix, we used the union of the 2,048 most variable genes and all genes in the spatial dataset. For spatial datasets consisting of multiple samples, we trained a different model on each sample. Because NovoSpaRc does not scale well to large datasets, we reduced the MERFISH-related scRNA-seq dataset to a tenth of its size, sampling uniformly across each cell type. We applied the learned mapping to impute missing genes using the ‘tissue.sdge’ function.

uniPort: We replicated tutorial instructions for integrating spatial and single-cell datasets (https://uniport.readthedocs.io/) by normalizing each dataset according to library size, log transforming counts, executing the ‘batch_scale’ function, training the model for 30,000 iterations with a ‘lambda_kl’ value of 5.0 and, finally, predicting the expression of hidden genes using the ‘predict’ function.

deepCOLOR: We trained for 500 epochs using default parameters from the tutorial (https://github.com/kojikoji/deepcolor). deepCOLOR does not directly impute unimaged genes, so we multiplied the resulting mapping matrix with the scRNA-seq expression of the hidden genes to predict their expression.

Harmony: We treated spatial and single-cell datasets as separate batches and integrated them using the default Harmony implementation in scanpy90 (https://scanpy.readthedocs.io/). We only included genes from the scRNA-seq data that were also in the spatial data (and removed test genes) to produce Harmony embeddings from the principal components of the concatenated dataset. Mirroring gimVI’s imputation procedure, we performed kNN regression on the Harmony embeddings to reconstruct expression of the manually hidden genes.

Impact of data sparsity on ENVI

To validate ENVI’s robustness to single-cell or count data sparsity, which can affect integration9, we benchmarked the full embryogenesis seqFISH and scRNA-seq data against random subsampling to 90% or 80% of counts according to the binomial distribution. For all three datasets, we performed five-fold cross-validation (see the ‘Benchmarking imputation’ subsection for details), finding that removing even 20% of counts does not greatly impact ENVI performance, which still surpasses Tangram on the full dataset (Extended Data Fig. 2c). Using a kNN (k = 5) classifier trained to predict cell type from the scRNA-seq latent space, we assigned labels to the seqFISH data and measured balanced accuracy compared to the original assignment, finding that the ENVI latent space remains reliable upon downsampling; datasets with higher sparsity are only slightly less accurate (Extended Data Fig. 2d).

FDL and DCs

FDL58 and DCs32 capture and visualize continuous trends in single-cell data19. We calculated FDL following the implementation in Van Dijk et al.91, by computing a kNN matrix (using default k = 30), converting to an affinity matrix using an adaptive Gaussian kernel with width = 30 and k = 10, symmetrizing the matrix and using the ForceAtlas92 function ‘force_directed_layout’ for visualization. DC computation followed a similar process to compute a data affinity matrix, except that we multiplied the affinity matrix by the inverse of its degree matrix to compute the normalized Laplacian. The eigenvectors of the Laplacian matrix, in order of eigenvalue magnitude, are the DCs.

Applying ENVI to seqFISH embryogenesis data

We started with pre-processed data from the E8.75 mouse gastrulation study37, which included 351 genes measured with seqFISH (57,536 imaged cells), and paired it with E8.5 scRNA-seq data (12,995 cells) from a second study38. We further processed the scRNA-seq data by removing mitochondrial genes, genes expressed in less than 1% of cells, cells with library size greater than 33,000 (set manually to match the knee point) and cells annotated as ‘nan’ or representing doublets. To avoid confounding batch effects50, we used only the largest scRNA-seq batch (labeled ‘3’). For the seqFISH dataset, we used only the first of three imaged embryos (‘embryo1’), removed cells with abnormally high total expression (threshold set manually to 600) and removed the gene Cavin3, which did not appear in the scRNA-seq dataset. For both datasets, we used cell type annotations provided by the authors and visualized the seqFISH data using spatial coordinates and the scRNA-seq data using a UMAP embedding (Fig. 2a). We also renamed several cell types to resolve nomenclature differences, including changing presomitic mesoderm to somitic mesoderm and splanchnic mesoderm to pharyngeal mesoderm.

We trained ENVI on the union of the 2,048 most variable genes in the scRNA-seq data, all seqFISH-measured genes, all HOX genes and several organ markers (Supplementary Table 1) using default parameters. We visualized the learned latent posterior of the seqFISH and scRNA-seq datasets using UMAP and found that cell types tend to co-embed regardless of modality (Fig. 2b).

To test the imputation of unimaged canonical organ markers Ripply3 (ref.

View original article

NATURE BIOTECHNOLOGY

分享书签

0 0 0 0 0 0 0

More from this channel

The covariance environment defines cellular niches for spatial inference

留言 (0)