Bento: a toolkit for subcellular analysis of spatial transcriptomics data

Overview of Bento data infrastructure for subcellular analysis

In order to facilitate a flexible workflow, Bento is generally compatible with molecule-level resolution spatial transcriptomics data (Fig. 1A), such as datasets produced by MERFISH [13], seqFISH + [14], CosMx (NanoString) [33], Xenium (10 × Genomics) [15, 34], and Molecular Cartography (Resolve Biosciences) [35]. Bento’s workflow takes as input (1) 2D spatial coordinates of transcripts annotated by gene and (2) segmentation boundaries (e.g., cell membrane, nuclear membrane, and any other regions of interest) (Fig. 1B). While 3D molecular coordinates are commonly included, 3D segmentation information is limited to z-stacked 2D segmentation, limiting its usability. If available, Bento can also handle arbitrary sets of segmentations for other subcellular structures or regions of interest. These inputs are stored in the AnnData data format [36], which links cell and gene metadata to standard count matrices, providing compatibility with standard single-cell RNA-seq quality control and analysis tools in the Scverse ecosystem [32]. With a data structure for segmentation boundaries and transcript coordinates in place, Bento can easily compute spatial statistics and measure spatial phenotypes to build flexible multidimensional feature sets for exploratory subcellular analysis and utilize these spatial metrics to augment quality control (Fig. 1C).

Fig. 1

Workflow and functionality of the Bento toolkit. A Single-molecule resolved spatial transcriptomics data from commercial or custom platforms are ingested into Bento where it is converted to the AnnData format (B), where it can be manipulated with Bento as well as a wide ecosystem of single-cell omics tools. C Geometric statistics are illustrated for the seqFISH + dataset, including metrics describing cell and nuclear geometries and cell density to assess overall data quality. D Bento has a standard interface to perform a wide variety of subcellular analyses

Bento offers a precise yet flexible palette of novel complementary subcellular analyses (Fig. 1D). We introduce RNAforest, a multilabel approach for annotating RNA localization patterns adapted from FISH-quant v2 [22]. We find that many RNAs are spatially distributed according to gene function. We then implement RNAcoloc, a context-specific approach to quantify colocalization to characterize how genes colocalize with each other in a compartment-specific manner. Having established systematic patterning and organization of RNA transcripts, we demonstrate RNAflux, an unsupervised method for semantic segmentation of subcellular domains. RNAflux first quantifies subcellular expression gradients at pixel resolution before identifying consistent subcellular domains via unsupervised clustering. We demonstrate the utility of Bento’s tools by applying them to identifying critical localization changes in human iPSC-derived cardiomyocytes upon drug treatment with doxorubicin, a widely used chemotherapeutic known to cause cardiotoxicity [37].

RNAforest: utilizing subcellular landmarks to predict RNA subcellular localization

In computer vision, key points or landmarks are commonly used for tasks like facial recognition [38] and object detection. Analogous to these classical applications, we derive spatial features using cell and nucleus boundaries as landmarks to predict RNA localization patterns from spatial summary statistics. Building on the summary statistics used for classifying smFISH data in FISH-quant v2 [39], RNAforest consists of an ensemble of five binary random forest classifiers rather than a single multi-classifier model to assign one or more labels. These pattern labels, adapted from several high-throughput smFISH imaging experiments in HeLa cells [40,41,42,43], are broadly applicable to eukaryotic cells: (i) nuclear (contained in the volume of the nucleus), (ii) cytoplasmic (diffuse throughout the cytoplasm), (iii) nuclear edge (near the inner/outer nuclear membrane), (iv) cell edge (near the cell membrane), and (v) none (complete spatial randomness). It is important to note, as was done previously in FISH-quant v2 [39] that because of the 2D nature of the dataset, RNA that is in truth cytoplasmic but above or below the nucleus will still appear as though in the nucleus when collapsed in the z-dimension. As we use the FISH-quant v2 pattern simulation framework, this is accounted for in the training dataset.

We used the FISH-quant v2 simulation framework to generate realistic ground-truth data [42]. Each sample is defined as a set of points with coordinates in two dimensions, representing the set of observed transcripts for a gene in a particular cell. In total, we simulated 2000 samples per class for a total of 10,000 samples (see the “ Methods” section). We used 80% of the simulated data for training and held out the remaining 20% for testing (Additional File 1: Fig. S1A). Each sample is encoded by a set of 13 input features, describing characteristics of its spatial point distribution, including proximity to cellular compartments and extensions (features 1–3), measures of symmetry about a center of mass (features 4–6), and measures of dispersion and point density (features 7–13) (Fig. 2A). These features are normalized to morphological properties of the cell to control for variability in cell shape. A detailed description of every feature is described in Additional File 1: Table S1, and model architectures and hyperparameters are described in Additional File 2: Table S2 (see the “ Methods” section).

Fig. 2

Subcellular localization pattern identification with RNAforest. A Thirteen spatial summary statistics are computed for every gene-cell pair describing the spatial arrangement of molecules and boundaries in relation to one another. The features (Supp. Table 1) are inputs for RNAforest, a multilabel ensemble classifier that assigns one or more subcellular localization labels: cell edge, cytoplasmic, nuclear, nuclear edge, and none. The colors for each label are used consistently throughout to figure. Top 10 genes for each label visualized for each label other than “none” in B U2-OS cells and C 3T3 cells. D and E are UpSet plots showing the proportion of measured transcripts assigned to each label. F and G show the relative label proportion across cells for each gene and are colored by the majority label (F and G). H Top 5 consistent genes for each label. I ssGEA identifies the enrichment of GO cellular component domains for each label in the 3T3 cell dataset. Stars denote p-values under thresholds defined in the legend. P-values are derived from ssGSEA permutation tests with Benjamini–Hochberg correction controlling for false discovery rate

We applied RNAforest on two datasets from different spatial platforms, cell types, and gene panel sizes: a MERFISH dataset in U2-OS cells and a seqFISH + dataset in 3T3 cells. Validation performance on manual annotation of subsets of both datasets shows that RNAforest generalizes well despite biological and technical differences (see the “ Methods” section, Additional File 1: Fig. S1B). The MERFISH dataset measured 130 genes (low plexity) with high detection efficiency per gene (111 molecules per gene per cell on average), while the seqFISH + dataset measured 10,000 genes (very high plexity) with lower detection efficiency (8 molecules per gene per cell on average) (Fig. 2B, C, Additional File 1: Fig. S1C-F). In agreement with previous work characterizing RNA localization of 411 genes [43], we find that genes commonly exhibit variability in localization across cells. This suggests that heterogeneity in localization likely generalizes to the entire transcriptome. Of the localization patterns besides “none,” “nuclear” was the most common (22.1%) in the U2-OS osteosarcoma cells (Fig. 2D, F), while “cell edge” was the most common (15.9%) in the 3T3 fibroblast cells (Fig. 2E, G).

In the U2-OS cells, we found many genes to have preferential localization in different subcellular compartments (Fig. 2H). In agreement with our RNAflux findings, we find genes known to localize to the nucleus [44, 45] to be frequently labeled “nucleus” (MALAT1, SOD2) and genes encoding secreted extracellular proteins [13] to be frequently labeled “nuclear edge” (FBN1, FBN2). As expected, we find genes preferentially “nuclear” and “nuclear edge” localized to mirror nucleus and endoplasmic reticulum genes found in a 10 k genes MERFISH study of U2-OS cells that included ER staining [46] (Additional File 1: Fig. S2; see the “ Methods” section). Leveraging the 3T3 seqFISH + dataset’s higher plexity, we were able to ask whether genes with similar localization preferences are functionally related. We applied gene set enrichment analysis to gene localization frequencies to identify enriched gene ontology terms [47] (Fig. 2I; see the “ Methods” section). Secretory processes were enriched in the nucleus and nuclear edge, which may be linked to increased transcription of fibroblast-related functions. Cell edge enriched pathways consisted of those with the cell membrane as their site of function (e.g., endocytosis and tight junction suggesting local translation of these genes). Additionally, the term for cell cycle was significantly enriched in the cytoplasm only. Genes without strong localization preference (most frequently “none”) were not significantly associated with any pathways. These genes likely do not undergo active transport and are functionally independent of local translation.

In summary, RNAforest gives a user a facile method for annotating RNA localization patterns and quantifying heterogeneity in a transcriptome-wide manner independent of RNA abundance. Beyond known RNA localizations, we find that transcript location is generally associated with known gene function, alluding to the systematic spatial regulation of RNA transport. We foresee RNAforest will be a valuable addition to characterize RNA localization across diverse spatial transcriptomics datasets.

RNAcoloc: an approach for context-specific RNA colocalization

In geospatial information processing, a fundamental feature that is often gleaned from large datasets is the colocation of objects (e.g., gleaning socialization metrics from cell phone colocation data in Singapore [48]). Colocation is similarly valuable in understanding co-translation and interaction networks of genes in a biological context [49]. Recent spatial transcriptomics approaches have used a number of colocalization metrics from the geographic information systems and ecology fields, e.g., the bivariate versions of Ripley’s K function (also known as cross-k-function) [50], Moran’s I [51], and the join count statistic [52]. These metrics are designed to measure spatial associations between two populations, i.e., gene A transcripts and gene B transcripts. However, it is more appropriate to think of all transcripts in a single cell from a single population; after all, RNA transcription and localization are not completely stochastic. We have shown that the subcellular distribution of RNA is highly structured with RNAforest. As such, we developed RNAcoloc, an approach that combines the Colocation Quotient (CLQ) [53] metric and tensor decomposition for context-specific RNA colocalization (see the “ Methods” section). The CLQ is a colocalization statistic that is capable of accounting for the biophysical properties of RNA spatial distributions. First, the CLQ considers how clustered the overall RNA population is in a cell and measures whether specific pairs of genes are more clustered than expected given the spatial pattern of the overall population. Second, the CLQ is inherently asymmetric and captures the direction of attraction, i.e., the attraction of gene A to gene B is not the same as the attraction of gene B to gene A. This is most common when gene A and gene B have very different expression levels, which is prevalent due to overdispersion in gene expression data.

RNAcoloc calculates CLQ scores for each gene per cell in a compartment-specific manner, such that each sample has 2 scores, a nucleus and cytoplasm CLQ score. An initial comparison in the U2-OS dataset of global colocalization between nuclear and cytoplasmic fractions unsurprisingly found that transcripts from the same gene tend to cluster more tightly with themselves than with transcripts from other genes (Fig. 3B). Additionally, self-colocalization is significantly stronger in the cytoplasm than in the nucleus. In conjunction with our findings from RNAforest analysis that genes of the same localization pattern tend to have similar functions, this suggests that the RNAs are more tightly spatially regulated once exported from the nucleus.

Fig. 3

Compartment-specific RNA colocalization with RNAcoloc. A Transcripts are separated by compartment (nucleus and cytoplasm) before CLQ scores are calculated for every gene pair across all cells. This yields a cell × gene pair × compartment tensor. B Pairwise comparison of log CLQ distributions for gene pairs and self-pairs, further categorized by compartment. The Mann–Whitney U test was used for comparisons. Stars denote p-values below the legend threshold with Benjamini–Hochberg correction controlling for false-discovery rate. From top to bottom, group sizes are 12,254,430 (cytoplasm gene pairs), 115,187 (nucleus gene pairs), 6,778,402 (cytoplasm self-pairs), and 86,474 (nucleus self-pairs). C Tensor decomposition yields 4 factors. From left to right, the three heatmaps show the loadings of each factor for each dimension—compartments, cells, and gene pairs. Only the top 5 associated gene pairs for each factor are shown. D Top examples of compartment-specific colocalized gene pairs. Black scale bars denote 10 μm

By calculating CLQ scores for every gene–gene pair across compartments, RNAcoloc constructs a tensor of shape P × C × S where P, C, and S represent the number of gene–gene pairs, cells, and compartments, respectively (Fig. 3A; see the “ Methods” section).

RNAcoloc then applies tensor decomposition — specifically, non-negative parallel factor analysis — a data-driven, unsupervised approach for discovering substructure in high-dimensional data [31, 54] to decompose the tensor into k “factors”. The number of factors is determined using the elbow method heuristic, optimizing for the root mean squared error (RMSE) reconstruction loss (see the “ Methods” section). Unlike matrix dimensionality reduction methods, such as PCA, the order of the components (factors) is unassociated with the amount of variance explained. Each factor is composed of 3 loading vectors, which correspond to the compartments, cells, and gene pairs. Higher values denote a stronger association with that factor. Crucially for interpretation, factors derived from tensor decomposition are not mutually exclusive and can share overlapping sets of associated compartments, cells, and gene pairs.

Applied to the U2-OS dataset, RNAcoloc decomposes RNA colocalization into 4 factors. Examining factor loadings indicates two distinct subpopulations of cells with compartment-specific colocalization behaviors; cluster 1 cells exhibit uniform (Factor 0) and cytoplasmic (Factor 3) colocalization, while cluster 2 cells show nuclear (Factor 1) and cytoplasmic colocalization (Factor 2) (Fig. 3C, D). Factor 3 describes the colocalization of gene pairs in the cytoplasm of cluster 1 cells, especially a number of genes that attract PIK3CA transcripts While little is known about PIK3CA RNA interactions, the PI3K pathway regulates mitotic organization, including the regulation of dynein and dynactin motor proteins. DYNC1H1 is among the top genes attracting PI3KCA and specifically encodes cytoplasmic dynein, a motor protein critical for spindle formation and chromosomal segregation in mitosis [55]. This hints that not only is compartmental localization of RNA linked to the cell cycle [46], but RNA-RNA interactions may play a role as well. In cluster 2 cells, MALAT1 attracts CNR2 transcripts more than expected in the cytoplasm. Even though MALAT1 is canonically abundantly localized to the nucleus, this demonstrates that the CLQ score identifies gene pairs colocalizing more than expected despite the disproportionate expression of MALAT1 relative to CNR2, whereas other approaches seem confounded by large differences in expression [45].

We demonstrate the ability of RNAcoloc to quantify compartment-specific gene-pair colocation by exploring cytoplasmic vs. nuclear colocalization. As we found separately with RNAforest, RNAcoloc analysis finds evidence that RNA transport is spatially regulated, especially after nuclear export. We highlight several examples of colocalization suggesting how RNA localization allows the same gene to have multiple functions in a spatially dependent fashion, i.e., depending on its molecular neighbors and local environment [56, 57]. We foresee RNAcoloc will be increasingly relevant as many spatial technologies are beginning to image proteins along with RNA, which can be used to delineate more granular compartments, such as cell organelles or distinct regions, e.g., neuron cell bodies vs dendrites.

RNAflux: unsupervised semantic segmentation of subcellular domains in single cells

To build on RNAforest, we overcame the restricted number of localization patterns defined by the supervised method by framing RNA localization as an unsupervised embedding problem. RNAflux looks at local neighborhoods within the space of a cell and builds a normalized gene composition per neighborhood. Differences in neighborhood compositions can be leveraged to identify distinct subcellular domains in a manner that is entirely unsupervised and independent of cell geometry.

We applied this embedding procedure to compute a gene composition vector for every pixel in 2D coordinate space, generating a spatially coherent embedding across entire cells (Fig. 4A; see the “ Methods” section).

Fig. 4

RNAflux finds distinct subcellular domains with consistent spatial organization and local gene composition. A Flowchart of RNAflux and fluxmap computation. Local neighborhoods of a fixed radius are arrayed across a cell and a normalized gene composition is computed for each pixel coordinate, producing an RNAflux embedding. The first three principal components of the RNAflux embedding are visualized for U2-OS cells coloring RGB values by PC1, PC2, and PC3 values respectively for each pixel. Fluxmap domains are computed from each RNAflux embedding to create semantic segmentation masks of each subcellular domain. B The left panel shows a field of view of U2-OS cells, dots denoting individual molecules colored by gene species, nuclei, and cell boundaries outlined in white. For the same field of view of cells, the center panel shows RNAflux embeddings and the right panel shows fluxmap domains. C The scatter plot shows how the composition of each gene is distributed across fluxmap domains. The position of each point denotes the relative bias of a given gene’s composition across fluxmaps. D Heatmap showing the fraction of pixels with a positive enrichment value for each APEX-seq location for each fluxmap domain. E–I The most highly enriched location is shown for each fluxmap domain. Domain boundaries are denoted by white lines within each cell

Applied to a MERFISH dataset with a target panel of 130 genes across over 1153 U2-OS cells, we demonstrate that RNAflux embeddings can detect transcriptionally distinct subcellular domains. Performing dimensional reduction of the embeddings showed that the top sources of variation spatially correspond to the nucleus, the nuclear periphery, and cytoplasmic regions consistently across cells (Fig. 4B; see the “ Methods” section) confirming that RNAflux measures intracellular transcriptional variation, as opposed to intercellular variation. To delineate compositionally similar domains in a data-driven manner, we cluster pixel embeddings using self-organizing maps (SOMs), effectively performing unsupervised semantic segmentation (see the “ Methods” section). We denote the resulting clusters as “fluxmap domains.” We found that this assigned pixels to 5 fluxmap domains, consistently highlighting spatial regions across every cell (e.g., fluxmap 2 is always nuclear while the remaining domains constitute the cytoplasm) (Fig. 4B). By considering the spatial distribution of molecules across fluxmap domains, we can quantify the composition of molecules for each gene across fluxmaps (Fig. 4C), e.g., nuclear-localized MALAT1 [44].

We sought to characterize the fluxmap domains with known information about RNA localization. We used data from a previous study that measured gene expression at “distinct subcellular locales” via APEX-seq, a technique for proximity labeling and sequencing of RNA [58]. Of the 3288 genes differentially enriched to one or more locales, 63 overlapped with the 130 MERFISH genes. The location enrichment score for each pixel is calculated by taking the weighted sum of its RNAflux embedding and the measured relative enrichment, i.e., log fold change measured by APEX-seq loadings for a given organelle-specific geneset (see the “ Methods” section). Visualizing each pixel’s location-specific enrichment scores from the APEX-seq dataset highlights the subcellular localization of these compartments, including the cytosol, nucleus, nucleolus, nuclear pore, nuclear lamina, endoplasmic reticulum lumen (ER lumen), ER membrane (ERM), and the outer mitochondrial membrane (OMM) (Fig. 4D). We find the nuclear compartments have high scores in domain 2, while the cytoplasm scores rank highest in domains 4 and 5. Both the ERM and OMM scores are the strongest in domain 1 (Fig. 4E).

The most common application for spatial transcriptomics is in mapping heterogeneous cell types in large tissue samples. This presents several challenges. First, the panels for these experiments are weighted heavily towards cell type markers determined by single-nuclei RNA-seq, potentially reducing the intracellular variability in the expression on which Bento relies. Second, substantial intercellular heterogeneity can skew the low-rank embedding by introducing too much variance in requisite cell radii information. To explore the applicability of RNAflux on tissue, we applied it to a previously published breast cancer tissue dataset generated by 10 × Xenium [59]. We successfully reproduced the identification of unique cell types that reflect canonical expression markers (Fig. S4A and B). We apply RNAflux to cell type-disaggregated subsets of the full datasets in fields of view of interest with at least 100 cells of interest and find that fluxmaps 1–3 show different enrichment scores for Nucleus and Endoplasmic Reticulum (ER; combination of ERM and ER lumen). We note that despite the successful delineation of discrete regions in the form of fluxmaps when looking at different regions of tissue that are enriched for different cell types, nuclear and ER enrichment scores change for each fluxmap (Fig. S4C).

In summary, RNAflux finds distinct subcellular domains with consistent spatial organization and local gene composition. As an unsupervised method, RNAflux can be applied to any cell type for inferring subcellular domains from transcript locations and functionally annotated with biological enrichment analysis. This process is best performed on cell type-separated data to guard low-rank embeddings from being generated by cells of vastly different morphologies. This disaggregation is important because, unlike uniform cell lines, heterogeneous tissue composed of functionally diverse stromal cells and leukocytes should indeed be expected to have different distributions and complements of subcellular domains.

Doxorubicin-induced stress in cardiomyocytes depletes RNA from the endoplasmic reticulum

Having established Bento’s utility to characterize RNA localization in U2OS cells, we applied Bento to quantify changes in localization in the context of perturbations in cells. Specifically, we performed single-molecule spatial transcriptomics on doxorubicin-treated and untreated cardiomyocytes to identify RNA localization changes as a result of treatment.

Doxorubicin (DOX) was once one of the most effective broad-spectrum anti-cancer anthracycline antibiotics [60, 61] with particular efficacy against solid malignancies such as lung and breast cancer, as well as hematologic neoplasia [62, 63]. However, DOX’s propensity to cause cardiac damage in patients has led to significant limitations in its clinical use [64]. The exact mechanism by which DOX induces heart failure is unclear, but significant evidence suggests cardiomyocyte injury driven by oxidative stress is a major factor [62, 65,66,67,68]. Specifically, DOX causes stress and dysfunction in multiple cellular compartments in cardiomyocytes such as mitochondria, Sarco/endoplasmic reticulum (SER), deficiencies in calcium signaling, and lipid degradation at the cellular membrane [69]. We reasoned that by measuring the localization of the RNA transcripts of 100 genes crucial to cardiomyocyte health and function (Additional File 2: Table S2) and leveraging the tools developed within Bento, we could recapitulate known dysfunction of subcellular domains in cardiomyocytes upon DOX stress and measure novel RNA localization phenotypes.

We utilized a chemically defined protocol to differentiate human induced pluripotent stem cells (iPSCs) into beating cardiomyocytes and treated them with either DMSO (vehicle) or 2.5 μM DOX for 12 h before fixation (see the “ Methods” section). Single-molecule spatial transcriptomes were measured by Resolve Bioscience using Molecular Cartography. The resulting data was segmented using ClusterMap [24] for cell boundaries and Cellpose [70] for nuclei boundaries (Fig. 5A). Non-myocytes were filtered out using SLC8A1 as a canonical marker for cardiomyocytes (see the “ Methods” section, Additional File 1: Fig. S3A). Comparing vehicle and DOX-treated cardiomyocytes, we found NPPA, a classic marker for cardiac stress [71,

View original article

GENOME BIOLOGY

分享书签

0 0 0 0 0 0 0

More from this channel

Bento: a toolkit for subcellular analysis of spatial transcriptomics data

留言 (0)