Spatially resolved transcriptomics and graph-based deep learning improve accuracy of routine CNS tumor diagnostics

Clinical data and ethics

The tissue collection and processing were performed in accordance with the local ethic regulations (Institutional Review Board Heidelberg, S-318/2022) and the local ethics committee of the University of Freiburg approved the data evaluation, imaging procedures and experimental design (protocols 100020/09 and 472/15_160880). Written informed consent was obtained by all participants. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Spatial transcriptomics of FFPE samples

Preparation of FFPE samples for spatial transcriptomics followed protocols from 10X Genomics. Briefly, a small tissue section from a scored tissue block was rehydrated, cut into 5-µm sections and placed on a Visium slide’s capture frame. For stereotactic biopsies, multiple sections were included in a single frame. The sections were dried at 42 °C for 3 h and stored overnight in a desiccator. Paraffin removal involved sequential incubations in xylene, ethanol and water. The tissue was stained with hematoxylin and eosin (H&E), scanned with an Aperio AT2 slide scanner at ×40 and then decrosslinked in TE buffer (pH 9.0) at 70 °C for 1 h. The 10X Genomics V1 human transcriptome probe set was used for mRNA detection. After a 20-h hybridization at 50 °C, excess probes were removed through washes, ligated probes were retained and RNA digestion released the probes. Spatial barcodes were added through an extension reaction and an index PCR was performed on the basis of qPCR evaluation. Samples were sequenced on an Illumina NovaSeq6000, with targeted read depths of at least 100 million per sample.

Spatial transcriptomics of freshly frozen samples

Freshly frozen samples were processed following 10X Genomics protocols. After cutting 5-µm sections, the tissue was placed in a Visium slide capture frame, incubated with methanol at −20 °C and then washed with isopropanol. H&E staining was applied, followed by imaging on an Aperio AT2 slide scanner. RNA release was achieved by permeabilization, spatial barcodes were added by reverse transcription and DNA was synthesized and amplified for library preparation. Sequencing was performed on a NextSeq500 with targeted read depths of at least 100 million reads per sample.

Extraction of DNA for methylation analysis

For methylation analysis, DNA was extracted using the Maxwell RSC FFPE DNA purification kit. Tumor-rich regions were first identified by pathologists through examination of H&E-stained sections. A punch biopsy, either 1.5 mm or 3 mm in diameter, was then collected from the FFPE block in the selected area. The collected tissue was incubated with proteinase K at 70 °C overnight to remove paraffin and to digest the tissue. Following this, lysis buffer was added to release the DNA. The sample was subsequently cooled to allow the paraffin to solidify and the aqueous phase containing DNA was transferred to a new tube. DNA purification was then completed using the Promega Maxwell system and DNA concentrations were measured with the Qubit system (Invitrogen).

Detection of DNA methylation

Methylation profiling and copy-number analysis were carried out using data from the Infinium MethylationEPIC BeadChip array (850K). Following the protocol established by Pfister et al.22, 250 ng of extracted DNA was subjected to bisulfite conversion with the Zymo EZ methylation kit. The DNA underwent repair using the Infinium HD FFPE restore kit, followed by amplification and hybridization onto an Infinium BeadChip. After several washing steps, single-base extension and fluorescent staining, the chip was scanned using an IScan software system. For bioinformatics, the minfi workflow (available from https://github.com/mwsill/minfi) was used for methylation analysis and conumee was used for copy-number calling (available from https://github.com/mwsill/conumee-2), following the methodology described by Sturm et al.23. Methylation classification was performed using version 12.5 of the classifier24.

CNV detection

Most samples’ copy-number profiles were derived using the Infinium MethylationEPIC BeadChip array (850K), with four samples (GB5, GB14, GB15 and GB16) analyzed using the Infinium MethylationEPICv2.0 BeadChip array (935K). Sample processing was conducted at the neuropathology department in Heidelberg, following the procedure outlined in Pfister et al.22. Tumor-rich areas were identified on H&E-stained slides and DNA was extracted either from ten 10-µm-thick tissue slides or by punch biopsies from paraffin-embedded blocks (1.5 or 3 mm in diameter) using the Maxwell system (Promega). DNA quality was assessed using the Qubit system (Invitrogen) and 250 ng of DNA was then used for bisulfite conversion (Zymo). For methylation and CNV profiling, the samples were analyzed on either the Infinium MethylationEPIC or EPICv2.0 arrays (Illumina). Bioinformatic analyses, including methylation calling and CNV detection, were performed with minfi (available from https://github.com/mwsill/minfi) and conumee (available from https://github.com/mwsill/conumee-2), as described by Sturm et al.23, with classification performed using version 12.5 of the classifier24.

IHC stainings

Immunohistochemical stainings were performed using a Ventana BenchMark Ultra Immunostainer. Stains for GFAP (mouse monoclonal, clone GA5; Cell Signaling), Ki67 (mouse monoclonal, clone MIB-1; Dako), ATRX (mouse monoclonal, clone BSB-108; Bio SB), Hip1R (rabbit, monoclonal, clone EPR9437; Abcam) and Vim (mouse monoclonal, clone V9; Dako) were conducted. ATRX, Hip1R and ATRX stainings were performed according to Sahm et al.25. GFAP and Ki67 were stained according to the protocol published by Wefers et al.26.

Sample segmentation and histological annotation

Manual segmentation of the histological regions was performed by a neuropathologist according to the morphological features of the H&E scan. Spots were annotated using the loupe browser from 10X. A matrix containing the spatial barcode and the annotation of the neuropathologist was exported. The segmentation data frame was imported into the SPATA objects using the addFeature function of the SPATA2 package.

Data import, preprocessing, filtering and normalization for spatial data analysis

For data analysis and quality control, we used the Cell Ranger pipeline from 10X Genomics. To facilitate spatial data analysis, we developed a custom framework. Data can be imported into our SPATA tool either by using a dedicated function (SPATA::initiateSpataObject_10X) or through manual entry of count matrices, barcode–coordinate matrices and H&E images. Standard import procedures include normalizing gene expression, which is achieved using the Seurat version 4.0 package. This process involves scaling transcript counts per spot to a total of 10,000, followed by natural log transformation. To control for batch effects, data normalization included regressing out sample batch variations and the percentage of ribosomal and mitochondrial gene expressions.

CNA estimation

Our CNA analysis leverages a pipeline integrated within the SPATA2 R tool, with a development version available at https://github.com/theMILOlab/SPATA2. The SPATA2::runCnvAnalysis() command enables additional CNA analyses. Chromosomal bins were created using the SPATAwrapper::Create.ref.bins() function, with a bin size of 1 Mbp used for this study. Data were then rescaled and interpolated over a 10-kbp window, with normalization achieved using a loess regression model through SPATAwrappers::runCNV.Normalization().

Multiclass ROC analysis

To evaluate the performance of our classification model in a multiclass setting, we used multiclass ROC analysis. This method extends the traditional binary ROC analysis to handle more than two classes by averaging pairwise comparisons between classes. The algorithm we used for this analysis was developed by Hand and Till27. This approach calculates the AUC by considering each class against all other classes and then averaging these pairwise AUC values. This method ensures that the simplicity of the AUC metric is maintained while providing a comprehensive evaluation for multiclass classification problems. The steps for the multiclass ROC analysis start with a pairwise comparison. For each class k, the AUC for the binary classification of class k is computed versus all other classes. Next, the AUC values obtained from all pairwise comparisons are averaged to get the overall multiclass AUC. We implemented this algorithm using the pROC library in R, which provides functions to compute multiclass ROC curves and AUC values27.

Spatial autocorrelation Moran’s I

We assessed the spatial dependencies among spots using Moran’s I statistics. This analysis discerns whether gene expression patterns across the sample are spatially clustered, randomly distributed or dispersed. Moran’s I is indicative of spatial clustering when positive and spatial dispersion when negative. Moran’s I index is given by the following formula:

$$I=\frac\times \frac___(_-\bar\,)(_-\bar\,)}__-\bar\,)}^},$$

where \(N\) is the total number of spatial spots, \(_\) and \(_\) are the gene expression values at spots i and \(j\), respectively. \(\bar\) is the mean gene expression value across all spots. \(_}\) represents the spatial weight between spots i and \(j\). \(W\) is the sum of all spatial weights \(_}\). The computation of Moran’s I, including the spatial weights matrix, was performed using the ‘inferSpatial.ac()‘ function from the SPATAwrappers package.

Spatial correlation analysis

The hlpr_join_with_aes helper function was used to merge these data, with optional normalization for gene sets (method_gs) and smoothing (smooth), including adjustment of smoothing span (smooth_span) as needed. After data preparation, we constructed a graph representing spatial relationships among spots. This was achieved by generating a positional data frame (pos) containing the x and y coordinates linked to each spot’s barcode. We then created a spatial weight matrix (weight) using the getSpatialNeighbors function from the MERINGUE package, which computes proximity between spots. To identify primary spatial patterns across features, MERINGUE calculates a spatial cross-correlation index (SCI). This index assesses the correlation between gene pairs exhibiting significant spatial heterogeneity, considering genes expressed by a sufficient fraction of cells. The SCI is determined using the following formula:

$$}=\frac\nolimits_^\mathop\nolimits_^_}}\frac\nolimits_^\mathop\nolimits_^_}\left(_-\bar\right)\left(_-\bar\right)}\nolimits_^_-\bar\right)}^}\sqrt\nolimits_^_-\bar\right)}^}},$$

where xi and yj represent the levels of two different features and \(\bar\) and \(\bar\) are the mean feature levels. Wij represents the spatial weight matrix or spatial weights between locations i and j. This matrix captures the spatial relationships or proximity between the spatial units (for example, barcodes or regions) in the spatial grid. The values in the spatial weight matrix Wij quantify the proximity of spots. The feature matrix, feature_mat, contained the expression values of the features for each barcode and was transposed to align with the spatial weight matrix. The resulting matrix encapsulates the spatial correlation coefficients for each pair of features across the spatial grid, providing insights into the spatial distribution and coexpression patterns of the genes of interest.

Imaging IHC and spatial transcriptomic quantification

To quantify the correlation between IHC staining from an image and the expression levels of the corresponding gene, we implemented a detailed image processing and analysis pipeline. We start by converting the IHC image to grayscale by extracting the blue-channel (img[,,3]) and transform into data.frame format using the reshape2::melt() function. The pixel intensity values were extracted and rescaled to a range of 0–1. Next, we performed a Gaussian kernel smoother:

$$\widehat\left(x,y\right)=\frac\nolimits_^_\exp \left(-\frac_\right)}^+_\right)}^}^}\right)}\nolimits_^\exp \left(-\frac_\right)}^+_\right)}^}^}\right)},$$

where zi represents the intensity value at the position (xi, yi) and \(\hat(x,\,y)\) is the smoothed intensity value at the spatial position (x, y). Next, we integrated the spatial distribution of the IHC intensity values (smoothed) and spatial transcriptomic gene expression of the corresponding genes. For the alignment of spatial data, we segmented the spatial data grid into smaller sections and calculates the mean expression value within each segment using the point.in.polygon method of the sf package. If \(_\le y < _}}\) or \(_}}\le y < _\), the x coordinate of the intersection of the ray extending to the right from (x, y) is computed with the following edge:

$$_}=_+\frac_)\,(_}\}}-_))}_}\}}-_)}$$

where xi and yi represent the coordinates of the current vertex of the polygon. If no cells are present in a segment, a value of ‘NA’ is returned. To assess the correlation between image intensities and gene expression, we fitted a linear model to the data, with expression as the dependent variable and cell density as the independent variable. The correlation coefficient was calculated to quantify the relationship between image intensity and expression. The significance of the correlation was assessed using the P value from the correlation test.

Single-cell deconvolution with Cell2location

To set up the Cell2location model, we configured the AnnData object using the setup_anndata function, defining parameters such as the number of cells per location and detection sensitivity. We trained the model on a graphics processing unit for 500 iterations to ensure computational efficiency. After training, we used the export_posterior function to extract the posterior distribution of cell type proportions, sampling 1,000 times for precise estimation across the spatial framework. Median estimates of cell type abundance were saved in adata_vis.obsm[‘q05_cell_abundance_w_sf’]. Finally, we incorporated the cell type abundance data back into the SPATA object using the addFeature function in SPATA2, allowing for further analysis steps within the spatial data framework.

Spatial super-resolution inference with BayesSpace

Our analysis incorporated a super-resolution approach implemented into the runSuperresolution function, specifically designed to augment the resolution of spatial gene expression. We used the BayesSpace11 algorithm, which inferred super-resolution by performing enhanced clustering on principal component (PC) space derived from log-normalized gene expression data and then mapping the high-resolution PCs back to gene expression space using predictive models, such as linear and XGBoost regression, to spatially visualize and analyze refined gene expression patterns at a finer scale. To achieve this from SPATA2 data, we then created a SingleCellExperiment (SCE) object. The preprocessing of the spatial data was conducted through the BayesSpace::spatialPreprocess function, which is designed for Visium platform data. The data were then clustered spatially with the BayesSpace::spatialCluster function, where the number of clusters was set on the basis of our earlier determined unique features count. The clustering process was iterated for repetitions with a burn-in of iterations to ensure model stability and convergence. Following spatial clustering, we applied a super-resolution enhancement using BayesSpace::spatialEnhance, which further refines the spatial resolution of clusters by enhancing the signal within the spatial data.

Inferred IHC visualization

For visualizing inferred IHC results, we used the plotInferredIHC (available from https://github.com/heilandd/NePSTA) function. This function is designed to create illustrative representations of the spatial distribution of cellular features within the tissue on the basis of the enhanced-resolution data. The visualization process began with the extraction of enhanced PC analysis dimensions from the sce.enhanced object and reference PC analysis dimensions from the SCE object using the reducedDim function. We then aligned the dimensions of the enhanced and reference datasets to a common number of PCs. With the BayesSpace::featurePlot function, we generated data for feature visualization on the basis of the enhanced PC analysis results. We normalized the feature intensities using the SPATA2::hlpr_normalize_vctr function to ensure that the intensity values were proportional and visually interpretable. Depending on the chosen type of visualization, either scatter plots or Voronoi tessellation were used to depict the spatial feature data. Scatter plots were rendered using the scattermore::geom_scattermore function, while Voronoi tessellation was executed with ggforce::geom_voronoi_tile. This allowed for a flexible representation of spatial data, either as individual data points or as contiguous spatial regions. The plots were further refined by adjusting the spatial coordinates to the range of the image and scaling the feature intensities for optimal visualization. The final output was a plot that combines the spatial coordinates with the feature data, overlaid onto the spatial image.

Construction of spatial graphs from Visium spatially resolved transcriptomic data

Each spatial transcriptomic dataset was preprocessed using SPATA2, which included log-transforming the count matrix and aligning the imaging data (H&E Image). Nucleus positions were annotated through an automated ilastik pretrained segmentation algorithm. For samples where high imaging quality was lacking and automated segmentation was unsuccessful, we applied a modified version of a recently published CytoSpace algorithm, as described above. We extracted spot coordinates using the getCoordsDf function from the SPATA2 package and calculated a pairwise distance matrix representing spatial distances between all cell pairs on the basis of their x and y coordinates. To prevent computational issues from zero-distance values (self-distances), we substituted these values with a constant of 1,000, ensuring no cell was mistakenly marked as its own neighbor in later steps. We set a distance threshold slightly above the smallest nonzero value in the modified matrix to establish cell adjacency, where cells within this threshold were marked as adjacent (1) and those beyond as nonadjacent (0). The adjacency matrix was converted into an undirected graph using the graph_from_adjacency_matrix function from the igraph package, with rows and columns labeled by each cell’s unique barcode from getCoordsDf. We retrieved the gene expression matrix from the ‘obj’ object, transposing it to match the graph vertices. The matrix was then filtered to include only the top 5,000 variable genes, corresponding to labeled graph vertices, creating a combined structure of spatial and expression data. With this graph, we analyzed local spatial gene expression patterns by examining the neighborhood around specific query spots. A three-hop neighborhood for a query spot included all spots within three edges, capturing the spatial context and connectivity within the defined distance.

The NePSTA graph neural network

The NePSTA framework uses GNNs to analyze spatially resolved multiomics data, integrating clinical, histological and molecular features. Using a GIN architecture, NePSTA models local spatial structures and predicts tumor-related parameters with high accuracy. A detailed description of the network is provided on GitHub (https://github.com/heilandd/NePSTA).

Data splitting and preparation

From 107 EPIC-characterized participants, datasets were divided into training and validation subsets. For multibiopsy samples, datasets were split by biopsy cores; for single-specimen samples, manual segmentation was performed using SPATA2 to create regions. Training data comprised 97,000 subgraphs from tumor datasets and 12,000 subgraphs from healthy controls, created using a three-hop neighborhood approach. Evaluation data comprised subgraphs from the validation datasets covering a range of epigenetic classes for robust model evaluation. A detailed description of the network is provided on GitHub (https://github.com/heilandd/NePSTA).

Node and edge features

Expression profiles (top 5,000 genes), CNAs, histological annotations (one-hot encoded) and encoded H&E image vectors were included in the neural networks. Edge features were defined by spatial proximity with up to six neighbors per node; self-loops ensured that nodes retained the original feature information. A detailed description of the network is provided on GitHub (https://github.com/heilandd/NePSTA).

NePSTA GNN architecture

The back bone consisted of a three-layer GIN to process subgraphs, aggregating node features using MLPs and message-passing techniques. Regularization consisted of batch normalization, dropout layers and leaky rectified linear unit activations to stabilize training. The latent space consisted of a global mean pooling of features to enable graph-level predictions for survival and tumor classification. A detailed description of the network is provided on GitHub (https://github.com/heilandd/NePSTA).

Evaluation metricsMetrics used

Accuracy, precision, recall, F1 score and confusion matrices were used to evaluate performance. Loss functions consisted of cross-entropy loss for categorical variables and L1 norm and mean squared error for continuous variables. Losses were combined using weighted sums that were optimized during training. A detailed description of the network is provided on GitHub (https://github.com/heilandd/NePSTA).

Training and inferenceOptimization

The Adam optimizer was used to minimize task-specific losses over epochs. Gradients were reset after each batch to refine model weights iteratively. A detailed description of the network is provided on GitHub (https://github.com/heilandd/NePSTA).

Evaluation of the subgraph cell composition

To reconstruct the cellular composition of each subgraph, we pinpointed cellular positions as previously described and then determined the probable cellular composition by considering the cell count per spot and the deconvolution scores from Cell2location, using the spAnchor package for implementation. We began by pinpointing each nucleus’s spatial location using the SPATAwrappers getNucleusPosition function and recorded the spot coordinates using the SPATA2 getCoordsDf function. The spatial coordinates representing the positions of nuclei were obtained as \(P=\}}_}}|i=1,\,\ldots ,\,N\,\}\), where pi is the coordinate pair for the ith nucleus and N is the total number of nuclei. Spatial grid coordinates corresponding to the transcriptomics data points were retrieved, denoted as \(G=\}}_}}| \,j=1,\ldots ,M\,\}\), with each gj representing the coordinate pair for the jth grid point. For each grid point gj, a vector of deconvolution scores \(}}_}}=\}}_}}| k=1,\ldots ,T\,\}\) was extracted, where djk represents the score for the kth cell type at grid point j and T is the number of cell types. The scores were normalized to a range of [0, 1] and the number of cells of each type at each grid point was estimated as follows:

$$}}_}}=}\left(\frac} }_}}\times }}_}}}_}=1}^}}} }_}}}\right)$$

where \(} }_}}\) is the normalized score and Nj is the number of cells at grid point j. Cell types were assigned to each grid point gj to create a mapping Mj, correlating grid points with their respective cell types. The cell type mapping was integrated with nucleus position data to produce a comprehensive spatial map of cell type distribution: \(S=\}}_}},}}_}})}}_}}\in P,}}_}}\in M\,\}\). This methodology facilitates the visualization and analysis of the cellular composition within the tissue section, providing insights into the complex spatial organization of the cellular environment.

Statistics and reproducibility

No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. Data distribution was measured and tailored methods were applied to handle non-normal distributions in gene expression, methylation and image data.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

留言 (0)

沒有登入
gif