Diagnostic classification of childhood cancer using multiscale transcriptomics

KiCS enrollment and ethics declaration

The SickKids Cancer Sequencing (KiCS) Program is a prospective study of a demographically diverse population of children and adolescents and young adults (AYAs) with refractory, metastatic, relapsed or rare cancers, as well as children with unresolved suspicion of cancer predisposition. It was launched in April 2016 and is an ongoing study. Guardians or capable patients are guided through an informed consent discussion with a trained genetic counsellor or pediatric oncologist. KiCS has been approved by The Hospital for Sick Children’s Research Ethics Board. The first trimester human fetal tissue was collected from an elective termination of pregnancy procedure at Addenbrooke’s Hospital through the ethically approved Wellcome-MRC Cambridge Stem Cell Institute and Department of Clinical Neurosciences tissue bank (REC-96/085). Written informed consent was given for tissue collection by the patient in accordance with the Declaration of Helsinki 2000.

Reference dataset

We used the UCSC Treehouse Childhood Cancer Initiative Compendium (version 9, March 2019)9 as a reference dataset to build the hierarchy of subtypes and train the ensemble CNN classifier. This cohort includes 11,750 tumor samples from TCGA16, TARGET17 and other contributing institutions, prepared with either poly(A) selection or ribosomal depletion. Gene expression counts from the STAR + RSEM Toil RNA-seq pipeline59 of samples in the compendium are publicly available and cover more than 58,000 genes, raw or normalized by log2 transcripts per million (TPM). The same pipeline was applied to any other data in this publication. Alternatively, counts obtained with Kallisto were used for performance benchmarks. To expand the pediatric reference, we added 313 further samples from St. Jude Children’s Hospital Pediatric Cancer Genome Project (PCGP)18, run through the same pipeline after filtering them by alignment quality. While building the subtypes hierarchy, we removed samples with particularly low purity but kept them to boost the ensemble CNN training. Ribodepleted samples showed consistent batch effects across tumor types during the clusters search. We, thus, chose to exclude them from the rest of the analysis and the CNN training; ribodepleted-only classes were removed from the hierarchy. Finally, we added 1,735 normal tissue samples with the best coverage and quality scores from 51 different organ sites from the GTEx project19 to the dataset. To avoid degradation in the output from tumor samples with normal contamination within the Treehouse cohort, the tumor and normal datasets were kept separate at the first stage of the clustering and merged later as separate branches of the classification hierarchy.

In input to the clustering algorithm, genes mapping to non-coding sections of the RNA were removed. Among these remaining, only genes with high variability, accounting for 99% of the cumulative variance on the full cohort, were kept. This reduced the feature space to 18,010 functional genes and pseudogenes, allowing us to speed up the rest of the analysis with a negligible loss of information.

Diagnoses and genomic markers reported by the sharing institutions were used, when available, as a reference for tumor type comparisons and the annotation of clusters.

Quality control and batch effects

Samples included in the final version of our reference cohort were pre-filtered by standard quality control parameters by the groups that generated the data9,18,19. We included GTEx samples with a high number of sequenced reads (as a proxy for coverage). St. Jude samples from the PCGP cohort were filtered based on TPM distribution. Samples were ranked based on the number of protein-coding genes found to have zero expression (TPM = 0) and excluded if more than 25% of their protein-coding genes were not expressed. This resulted in an improvement in RACCOON’s clustering and clarity of each cluster (Supplementary Fig. 13). Any St. Jude sample already present in the Treehouse data was removed. This left us with 512 samples from the St. Jude cohort—62% of the total available.

Differential expression and gene sets analysis

log2 TPM-normalized counts were used for clustering, classification and map projections. For differential expression analysis, TMM normalization and the negative binomial generalized log-linear model fitting from EdgeR60 version 3.30.3 were used instead. Gene set enrichment analysis (GSEA) and single-sample GSEA (ssGSEA) were carried out with gseapy61 version 0.9.5 in Python version 3.6.9 and GSVA62 version 1.36.3 in R version 4.0.2. ssGSEA-based scores were also calculated with gseapy version 0.9.5 on TPM-normalized counts and scaled between 0 and 1 to assure consistency in the comparisons. They were used as part of stemness, immune activity and neuroblastoma identity scores (see Stemness score, Immune deconvolution and activity score and Neuroblastoma cell lineage score in Methods for details). The two-sided Mann–Whitney U-test was used when evaluating significance in comparing these scores and any other distribution between paired groups of samples throughout the text.

Plots and diagrams were produced with Matplotlib63.

Survival analysis

Survival analyses and log-rank tests were carried out with lifelines version 0.21 (ref. 64). Where available, outcomes were defined based on overall survival times provided by the sharing institution. A Cox survival regression of neuroblastoma subtypes was performed with the same library on 161 neuroblastoma observations, of which 81 were censored.

Multilevel clustering

Given a set of data points, RACCOON removes low-information features, reduces their complexity with a non-linear dimensionality reduction algorithm and finally identifies clusters with a density-based approach. The search is continued depth-first for each of the clusters identified iteratively. The search is terminated only when further splits would lead to a particularly suboptimal value of the objective function or the class population is lower than a pre-set bound (for example, 25–50 samples). The features removal cutoff, the number of neighbors employed by uniform manifold approximation and projection (UMAP)65 and the clusters search parameter (for example, maximum clustering distance parameter in DBSCAN) are optimized by maximizing a clustering quality score. For this project, the tunable parameters were optimized with a grid search, and the total silhouette coefficient of the dataset66 was set as the objective function. This score quantifies the quality of clustering by calculating the ratio between the clusters cohesion and their separation. Ranging between −1, when all points are incorrectly assigned, and 1, when all points are well assigned, we set here 0 as the minimum threshold for accepting a set. In a scenario where the best combinations of parameters found still leads to a negative score, the cluster under scrutiny is not split.

We applied RACCOON to our extended dataset to build a hierarchical tree of tumor and normal subtype clusters. The number of final (reduced) dimensions was empirically set to 12, a choice that proved to be a good compromise between accuracy and computational cost. A population cutoff of 25 was applied to stop the search, and nodes with fewer than ten samples were pruned, because their annotation and training for the classifier would be too unreliable. This method initially yielded more than 700 individual clusters. A subset of low-population leaf nodes was removed after manual annotation, for the lack of sufficient biological and gene expression information to support any finding, together with classes populated exclusively by ribodepleted samples. This process left us with a total of 455 clusters (406 tumor and 49 normal tissue classes, respectively), of which 303 are non-overlapping independent terminal (leaf) nodes.

Normal tissue inclusion

Multilevel clustering was applied independently to the normal tissue samples and malignant subsets. Normal and tumor samples from the same organ would have been grouped in common classes at the highest level if they had been mixed. Clustering quality decreases, as less aggressive or low-purity tumors can be difficult to separate from healthy normal samples.

During training, this choice forced OTTER to learn high-level features that distinguish normal tissue from neoplasms, independently from their anatomical location. This boosted OTTERʼs ability to recognize tumor populations in low-quality samples, as the tumor–normal separation is prioritized in the hierarchy. Similarly, we expect generalization to unseen normal tissues to also be improved.

Annotation

Clusters obtained with RACCOON were annotated based on their most characteristic transcriptional features compared to the closest members of their hierarchical family. Differential gene expression and GSEA were carried out to identify each clusterʼs defining gene and pathway expression. Limited clinical information, including age, sex and a diagnostic label, were available for each sample. All 455 separate classes were first annotated to assign a unique label (a code and a name). We then extended the annotation for the five major families of pediatric tumors: CNS, leukemia, neuroblastoma and the two mesodermal classes, as well as the branch stemming from the healthy normal samples. More details on the annotation of these groups can be found at https://rna-atlas.github.io/.

Entropy calculations

Expression variance and its derived quantities (for example, the coefficient of variation) could be used as a proxy of variability; however, they fail when dealing with multimodal or discontinuous distributions. Shannon entropy S is a much more appropriate and robust measure. It is a generalization to Boltzmannʼs thermodynamics entropy; it quantifies the information content or the randomness of a given distribution23. It is defined as follows:

$$S = - \mathop \limits_i $$

where Pi is the probability of an event i, in our case the probability that a certain gene will lead to a specific expression count.

Starting from TMM-normalized data, which already account for the skew introduced by extreme values across the population, the expression was standardized along genes to limit heteroscedasticity. Being additive, Shannon entropy cannot be naively measured on groups with different populations, and it requires enough samples to approximate a continuous distribution. In our case, this holds true for a good number of classes but not for the smallest leaf nodes. We, thus, first approximate the expression distribution of every single gene independently with a fixed-bandwidth Gaussian kernel density estimation and then extract the probabilities for Shannon entropy from the estimated distribution on a 100,000-points grid. Higher-resolution grids approach the limit of differential entropy and approximate the integral better, yet they lead only to marginal changes and increase the computational cost considerably. A mesh to 250,000 points led to a change in entropy of less than 2%, confirming that our choice was close to convergence. The entropy calculation was limited to a subset of more than 14,000 highly variant genes, by filtering those with both consistent low expression across samples and low entropy across all classes. The final values obtained for each gene and each class were divided by the median entropy of the normal tissues cohort class N000. All calculations were carried out with scikit-learn 0.22.2.post1(ref. 67).

RNA-seq expression data are commonly approximated by a negative binomial distribution, which accounts for overdispersion in its mean–variance relationship. The coefficient of variation is a popular measure of mean independent dispersion; however, it still relies on variance and, thus, inherits all its shortcomings when attempting to quantify transcriptional noise.

We observed a fair correlation between entropy and mean expression across all groups (Pearson r = 0.68). We, thus, fit a linear model on the entropy and adjusted the score to account for its dependency on the median expression (Supplementary Fig. 14). The coefficient of determination R2 was 0.46, suggesting that the mean dependent component accounts for less than half of the total entropy, and transcriptional noise within and across tumor types cannot be entirely explained by differences in expression levels alone. This adjusted entropy (S) was used throughout the manuscript.

PaWS calculations

Although entropy entails the overall variability of gene expression within a population, part of this can be translated into a different pattern of activation of relevant biological pathways, thus defining different tumor types. This inter-tumoral heterogeneity is explicitly accounted for by the cluster hierarchy itself. We can define a score based on the number of offspring nodes that a specific group generates and measure it on the classes that we identified. We call this PaWS and define it as follows:

$$PaWS\left( n \right) = \frac \right|}}}\frac}}}\left( \right|} \right)}}}}}\left( \right)}}$$

where \(n = \left\ \right\}\) is a set of samples identified as a cluster or node; L is the set of all leaf nodes l—that is, all the childless nodes, Ln  is the set of leaf nodes that are offspring of n; and root is the hierarchical tree root, a set containing all our dataset samples. The PaWS of n is, thus, defined as the ratio between the cardinality (|Ln,|) of Ln and the cardinality of L – that is, the number of leaf nodes that are offspring of n, over the total number of leaf nodes, weighted by the inverse of the log population ratio of n. This last term was added to account for the fact that smaller clusters will have less probability to be split by the algorithm.

Correlation between heterogeneity scores

The relationship between these quantities is not trivial; we observed a weak correlation (Spearman rank test coefficient = 0.355, P = 1.120 × 10−5) between median entropy and PaWS score, after removing all the leaf nodes, to avoid including clusters with possible subtypes but insufficient population to be split by the algorithm. Entropy is thus a good proxy for intra-cluster expression disorder, as it accounts for that part of expression differences within a population that are not coherent enough to be translated into clear subtypes and yet not able to disrupt the overarching patterns that define the parent class.

Stemness score

A unified stemness score was calculated as the average among CytoTRACE35 single-cell stemness score, mRNAsi36 and the ssGSEA score from Miranda et al.34. The score was then normalized for each inter-tumor type comparison.

Immune deconvolution and activity score

The immune activity score was calculated as the average between Reactome immune system31 ssGSEA score and Gene Ontology immune activity32 score. The result was averaged with methylation-derived leukocyte content fraction by Thorsson et al. (218)33 in TCGA samples where the information was available. The score was then normalized for each inter-tumor type comparison. Immune deconvolution scores and immune cell type ratios were obtained with CIBERSORT68.

Neuroblastoma cell lineage score

A unified neuroblastoma cell lineage score was calculated by first averaging separately neural crest-like and mesenchymal identity ssGSEA scores and adrenergic identity scores from three different publications43,44,45. For each sample, the final score was obtained as the difference between the mesenchymal/NCC-like unified score and the adrenergic unified score, and it was scaled to range between 0 (more adrenergic) and 1 (more mesenchymal).

Single-cell RNA-seq

Fetal age (post-conception weeks (PCWs)) was estimated using the independent measurement of the crown rump length (CRL), using the formula PCW (days) = 0.9022 × CRL (mm) + 27.372.

Paired femora, tibiae and fibulae were dissected from the fetal hind limbs by a specialist bone and soft tissue pathologist (P.B.) under a microscope using sterile microsurgical instruments. The femora were further dissected into proximal and distal halves, to give eight samples in total (paired proximal and distal femora, paired tibiae and paired fibulae). Each sample was then processed into single-cell suspensions. In brief, tissue was digested in a 5 µg ml−1 Liberase TH working solution prepared from Liberase TH powder (Sigma-Aldrich, 5401135001) and 1× PBS on a shaking platform (750 r.p.m.) at 37 °C for 30 minutes. The tissue was gently agitated using a P1000 pipette after 15 minutes. Then, 5 ml of 2% FBS in PBS was added to stop the dissociation, before second-stage digestion with 0.25% trypsin solution for a further 30 minutes at 37 °C, with pipette agitation every 5 minutes. Cells were then spun down at 750g at 4 °C for 5 minutes and resuspended in 50–200 µl of 2% FBS in PBS. Fetal cells were loaded for single-cell RNA-seq directly after sample processing.

Single-cell suspensions from the eight samples were loaded onto a separate channel of a Chromium 10x Genomics Single Cell 3′ v2 library chip as per the manufacturerʼs protocol (PN-120233), aiming for a cell capture recovery of 3,000–5,000 cells. cDNA sequencing libraries were prepared according to the manufacturerʼs protocol and sequenced on an Illumina HiSeq 4000 (2 × 50-bp paired-end reads).

Raw sequence reads in FASTQ format from fetal samples were processed and aligned to the GRCh38-1.2.0 human reference transcriptome using the Cell Ranger version 2.1.1 pipeline69 (10x Genomics) with default parameters.

The resulting expression matrices were processed with SoupX version 1.3.0 (ref. 70) to estimate and remove cell-free mRNA contamination before analysis. Cells with fewer than 300 genes and more than 7,500 genes were filtered, as well as those in which mitochondrial genes represented 10% or more of total gene expression. A quantitative estimation of cell cycle stage was performed on the remaining cells with Seurat version 3.0 (ref. 71). Log-normalization was then performed before data scaling, which used cell cycle score, mitochondrial gene expression level and the total unique molecular identifiers (UMIs) per cell as regression variables.

We normalized the raw expression data to log2 (TPM + 1) and randomly selected 25,000 samples. The resulting dataset was merged with the bulk RNA-seq sarcoma data (T002 MESODM IMMhigh and T003 MESODM STEMhigh). A low-information filtering step was applied, to boost the signal-to-noise ratio and partially remove batch effects, before projecting the data to a lower-dimensionality space with UMAP65. The nearest neighbors cutoff was set as the square root of the total population. The centroid distance between T002 and single-cell data was constantly higher than that between T003 and the single-cell cluster, independently of how the dimensionality reduction was parametrized over a grid of combinations (Supplementary Fig. 15).

Classification

We built a set of mono-dimensional CNNs, called OTTER, which takes the RSEM gene expression reduced output (18,010 log2 TPM genes) as input and returns the membership probability to any or multiple of the 455 hierarchical classes.

We trained these networks on the full reference cohort of more than 13,000 samples, which includes samples at a range of sequencing depths—a computationally expensive task. The resulting model proved markedly more accurate and robust than alternative classification methods, such as k-nearest neighbors, which are affected by tears, deformations and the partial loss of meaningful distances in the dimensionally reduced space, and have limited flexibility when dealing with multiple tumor components.

To identify optimal architectures, we emplyed Hyperopt, a Bayesian hyperparameter optimization library based on a Tree-structured Parzen Estimator (TPE)72. The micro-F1 (μF1) score was chosen as the objective function to guide the search. We enriched this group of models with a number of manually tuned architectures.

All models included one-dimensional convolutional (CV) layers followed by fully connected (FC) layers. The number of filters of CV layers was tuneable and shared across layers, and so was the kernel size. Each CV layer was followed by batch normalization and max pooling with fixed size 4 and stride 2. The size of hidden dense layers was also tuneable and halved at every successive layer, and dropout was activated at parametrizable percentage. The loss function was binary cross-entropy. Adadelta was set as optimizer with a starting learning rate of 0.001 and early stopping.

The top-scoring models among the pool of all candidates were subject to five randomized rounds of five-fold cross. The splitting into train and test sets was stratified to assure a proportional coverage for every class, and early stopping after three epochs was activated to avoid overfitting. Five candidate models were then selected according to their different performances on a set of scores including macro-F1 (MF1) and macro precision recall area under the curve (MAUCPR).

The final classifier was built as an ensemble average of a subset of these models, an unweighted arithmetic mean of three. The ensemble classifier led to an improvement in most scores while limiting the shortcomings of each single model and adding robustness to the final predictions. Finally, the models were trained on all available samples, with an adequate number of epochs to avoid overfitting for each separate case. A comparison with alternative tumor type classification RNA-seq models available in literature can be found in Supplementary Table 6.

The models were built as multilabel and multiclass; both the input labels and the output membership assignment are not exclusive to a single class. A post-processing step ensures consistency among the probabilities of classes within a family: if the classifier assigns higher probability to an offspring node than to its parent, the average of the two is assigned to both classes, and sibling nodes are adjusted accordingly. This correction is then propagated upward along the hierarchy.

The scores in output from the final ensemble are not binarized to allow the user a full picture of confidence scores. To make up for the strong class imbalance in the training dataset, we recalibrated these output probabilities.

We identified a binary classification cutoff that maximizes an adapted Youden J statistic (precision + recall) and then transformed the output scores linearly so that this cutoff value falls at 0.5 of the final probability. Although this change does not affect markedly the resulting output (.998 cosine similarity, .957 hierarchical similarity on the validation cohort), it helps relieve some of the overfitting on minority classes. The median cutoff was .585 (.651 for highest level classes, .425 for the lowest level, .543 for leaf nodes), suggesting that this is an overall balanced classifier.

The input data features were ordered by correlation following a quick agglomerative clustering on their log2-normalized expressions. The input gene arrays are scaled to a 0–1 range, and the labels were transformed to a one-hot Boolean encoding.

All models were built with Keras version 2.2.2 and TensorFlow73 version 1.10.1 backend. All code was run with Python version 3.6.9, and model training was run on our local high-performance computing machine with eight Xeon E5-2670 v2 @ 2.50-GHz or Xeon Gold 6140 CPU @ 2.30-GHz cores and 64 GB of RAM.

Comparison to DNA methylation-based classifier

We compared the results of our transcriptional classifier to a DNA methylation-based classifier52 for a set of CNS tumors. In a previous work57, we profiled 252 high-risk pediatric cancers through multiple sequencing technologies. Sixty-three of these are CNS tumors with data from both DNA methylation and RNA-seq and can be directly compared.

After a manual curation, the methylation classifier matched these tumors to their presenting clinical diagnosis in 86% of the cases. The remaining 14% are either matched to a wrong subtype but within the correct parent family or do not match the expected subtype. The two classifiers agree in almost all of these cases, within the limits of tumor types available in the respective reference datasets, and complement each other in the few exceptions.

The dataset includes a number of tumor subtypes that are rare or absent in our reference cohort (atypical teratoid rhabdoid tumor, diffuse midline gliomas (DMGs) and meningiomas); for the purpose of this comparison, consistency in their assignment to a transcriptionally similar subtype (for example, all DMGs that were assigned to the same proximal subtypes of high-grade gliomas) was considered a match.

Among the 8% of samples matched only to the parent family according to DNA methylation, OTTER, our transcription-based classifier, could correctly identify the subtype of three samples: a medulloblastoma, called as retinoblastoma by DNA methylation; a low-grade glioma, instead of a high-grade glioma; and an ependymoma, whose methylation profile was reflecting immune infiltration. In two cases, the classifiers were in agreement, in spite of a mismatch with respect to the pathologist diagnosis: an IDH wild-type glioma and a medulloblastoma of the G3 subtype. Finally, there are three cases in which the transcriptional classifier fell short, where the DNA methylation matched the correct subtype: an ependymoma, which was not recognized by OTTER due to low purity and high immune infiltration, and two atypical teratoid rhabdoid tumors, a subtype that is absent in our RNA-seq reference.

Both classifiers can provide highly accurate predictions and complement each other in the most complex cases.

Hierarchical similarity score

To evaluate the accuracy of predictions within the hierarchical framework, we employ the hierarchical similarity score (H), a union/intersection score based on the graph information content similarity (SimGIC) that measures the proximity of two points along the class tree while accounting for its structure and populations:

$$\begin\left( \right) = 1 - \Delta _\left( \right)\\ \qquad\qquad \, = \frac\nolimits_ \right) \cap nodes\left( \right)} }}\nolimits_ \right) \cup nodes\left( \right)} }}\\ \qquad\qquad \, = \frac\nolimits_} } \right)} \right)} }}\nolimits_} } \right)} \right)} }}\end$$

where v1, v2 are the membership assignment probability vectors of two samples; nodes(vx) is the list of nodes or classes activated in such vectors; \(nodes\left( \right)\) is the list of all nodes; and w(n) are the nodes’ weights. These are calculated as information content—that is, the probability of a sample falling into the lower node connected to the edge, which can be approximated to the class frequency of observations in the training dataset:

$$w_\left( n \right) = - logp\left( n \right)$$

We also define the partial hierarchical similarity score (η), which looks only at the branches active in the ground truth while disregarding false positives:

$$\begin\eta \left( \right) = 1 - \delta _\left( \right)\\ \qquad \qquad = \frac\nolimits_ \right) \cap nodes\left( \right)} w\left( n \right)}}\nolimits_ \right)} w\left( n \right)}} = \frac\nolimits_})} w\left( n \right)\;}}}\left( \right)}}\nolimits_})} w\left( n \right)\;v_1\left( n \right)}}\end$$

Sequencing depth benchmarks

Stochastic subsampling of the total number of reads was repeated at set intervals for five chosen samples from the KiCS cohort—five times with different random seeds for each set threshold. Each original sample had at least 108 reads (on paired FASTQ files), and its OTTER output was set as ground truth. Accurate classification (.85 with RSEM, .75 H with Kallisto) can be obtained with OTTER with 1 million reads. Although less accurate, Kallisto is considerably faster (Supplementary Fig. 6).

Library preparation and storage benchmarks

OTTER was trained on a dataset of poly(A) sequencing samples from fresh-frozen (FF) tissue. To evaluate its generalizability to alternative library preparation techniques, we tested its performance on a set of 247 samples from the Treehouse Childhood Cancer Initiative compendium version 9 (ref. 9) treated with ribosomal depletion (Supplementary Fig. 16). The closest possible tumor class to the provided diagnostic label was set as ground truth. Tumor subtypes that were absent in our reference were removed from this analysis; however, tumors lacking a matching class in the atlas, but with a consistent population in the reference cohort, were included. As an example, gastrointestinal stromal tumors, which are consistently found in the T078 SARC EPITH/KIT class but lack for now their own subgroup due to a limited population (n = 6), were included, and so were samples with myofibromatosis. The classifier can still identify the correct tumor type, albeit with lower confidence. Although ribdopleted libraries are somewhat compatible with our classifier, the results should be treated cautiously, and the user should be aware that different tumor types will have an unequal impact on the classifierʼs performance.

Formalin-fixed, paraffin-embedded (FFPE) is commonly used for long-term storage of samples, yet degradation of the DNA and RNA in FFPE samples has been described in literature, and most molecular-based analyses seem incompatible with FFPE data74,75 Information on the storage method is available for only 93 of the 247 samples in the Treehouse cohort. We then repeated the analysis on a smaller cohort of matched FF and FFPE samples (n = 52 pairs). This set includes a slice of the KiCS cohort and samples from two different publications75,76. We also stratified the FFPE tumors by library preparation to demonstrate that the impact of library preparation and storage are additive (Supplementary Fig. 16).

Expanding the tumor atlas

We investigated the behavior of OTTER in inference on data from tumor types missing from our transcriptional phenotypesʼ hierarchy, and we measured the effect of adding such data to the RACCOON clustering. To this end, we selected 19 atypical teratoid/rhabdoid tumors (AT/RTs) from an unseen dataset, which was not used for clustering or training, and ran them through the current version of OTTER. Eighteen AT/RT snap-frozen tumor materials and clinical information were collected at The Hospital for Sick Children or through an international collaborative network with consent as per protocols approved by the hospital research ethics boards at participating institutions. All AT/RTs had negative BAF47 immunohistochemistry stain and biallelic SMARCB1-inactivating alterations as confirmed with FISH, MLPA, targeted Sanger sequencing or high-throughput sequencing analyses. RNA-seq libraries were prepared using the Illumina TruSeq RNA Sample Preparation Kit for poly-adenylated mRNA selection and sequenced at the Centre for Applied Genomics77. To these, we added a single AT/RT sample from the KiCS cohort.

Our reference cohort currently contains three samples that have been labeled as AT/RT. Two of these are found in T040 GLI HG/GBM MES, a group of high-grade gliomas and glioblastomas of the mesenchymal subtypes. The remaining sample was grouped by RACCOON with lung adenocarcinoma, likely due to contamination or low tumor purity. A true AT/RT target class is not available to OTTER. Fifteen of 19 samples are assigned to classes along the T040 CNS branch with at least 5% of confidence. AT/RTs also possessed some signal of high stemness, yielding a partial match to the mesodermal stem high class (T004) in 15 of 19 samples.

We then clustered the group of 21 AT/RTs (19 unseen + 2 high-quality samples from the reference cohort) using RACCOON together with samples from the CNS class (Supplementary Fig. 17).

All AT/RTs clustered together within a new class (just below the high-grade gliomas T034, in the same lineage as T040). This demonstrates that a critical threshold of AT/RTs was reached to create a new subtype. To study what the exact threshold is, we performed subsetting experiments. Clustering was repeated using different numbers of AT/RT samples along with 100 other CNS tumors, both of which had been randomly selected five separate times. TPEs were used to speed up the search for repeated runs. We computed the adjusted mutual information (AMI) score on the clustering result by assuming a perfect separation of AT/RTs from all other CNS samples as ground truth, to assess how close the resulting partition was to having an AT/RT-only class (Supplementary Fig. 17).

Population characteristics

We included several publicly available, uniformly processed cancer transcriptomes. The tools described are blinded to the sex of the participants whose samples comprise the input dataset. They rely on gene expression data from the protein-coding transcriptome and were not explicitly trained to recognize sex-chromosome-associated genes. No clinical data were included in the training. To our knowledge, gender identity was not recorded or considered in any of the contributing datasets. Furthermore, the data are not disaggregated by sex from the original institutions. A few cancer histotypes identified by clustering are biased (for example, breast cancer) or exclusive to one sex (for example, testicular, ovarian and uterine cancers), and genes on sex chromosomes may play a substantial role in their pathophysiologies (Extended Data Figs. 2 and 3), yet their transcriptional profiles were not the focus of this work. Although the proportions of the sexes have been noted in the clusters annotation, sex differences did not reach significance in clusters discussed in this work and were, thus, not reported.

KiCS classification review

Occasionally, samples from KiCS have been labeled as ‘concordant in the absence of reference’. In this group, we are counting samples that were assigned to families of tumors close to the target in the absence of a strictly matching subtype. We chose a conservative approach in evaluating these. We counted only those tumors: (1) that matched to a tumor subtype of the expected cell type or tissue of origin to that of the expected diagnosis; (2) where multiple samples with the same diagnosis match the same subtype with the same probability profile; or (3) that match a tumor subtype in which we found reference samples of the same diagnosis but for which there are currently too few samples to create their own class. Finally, each putative match was reviewed by a pediatric oncologist to determine whether the data were sufficient to consider the diagnosis as being confirmed. In total, 16.5% of the tumors were in this category.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

留言 (0)

沒有登入
gif