A comprehensive computational framework that integrates gene expression with pathway activity was developed to annotate cell types within diverse scRNA-seq datasets, utilizing multiple pathway databases. Figure 1 provides a schematic overview of the scMCGraph model, illustrating the major stages of this computational framework. This method facilitated the development of a consensus representation of cell–cell interactions, which are derived from signaling pathways, thereby enabling precise cell type annotation. Affinity matrices, created from gene expression and pathway activity data, captured intricate intercellular relationships and were integrated using advanced computational techniques to form a robust model for cell type annotation.
Fig. 1Schematic overview of the scMCGraph model. The figure illustrates the workflow of the scMCGraph computational framework, from the initial generation of scRNA expression matrices to the final cell type annotation. Key stages include the construction of matrices based on gene expression, calculation with pathway activity data from six databases via the AUCell algorithm, integration and refinement of data through graph fusion module, and application of graph refinement module to highlight significant interactions. The process culminates with the extraction of low-dimensional embeddings by a GNN-encoder and the reconstruction of the data structure by a parallel decoder, enabling accurate cell type annotation
The process began with the independent construction of cell–cell affinity matrices for both reference and query datasets, based on their respective gene expressions. To further enhance the model’s ability to capture subtle signals from low-expressing cell types while filtering out noise, pathway activity data was integrated with gene expression matrices. The AUCell algorithm, which assesses the activation states of cellular pathways, was employed as a means to reduce the impact of non-essential genes. This pathway-focused approach prioritizes biologically relevant signals, thus enhancing the overall sensitivity of the model. By integrating pathway activity, even subtle biological signals from low-expressing cells are preserved, and noise from irrelevant genes is effectively reduced. To explore pathway-specific cellular functions in more detail, we utilized six different pathway databases. The AUCell algorithm was then applied to each dataset, analyzing these six databases. This analysis produced six pathway-cell affinity matrices for each dataset, illustrating the activation status of individual cells within each pathway. These matrices were then transformed into cell–cell affinity matrices based on pathways, capturing similarities in pathway activation across cells and providing a comprehensive map of cellular interactions grounded in shared biological processes.
Following the creation of pathway-based cell–cell affinity matrices for each dataset, the analysis was refined by applying similarity network fusion (SNF [39]) to each dataset’s primary cell–cell affinity matrix along with its corresponding pathway-specific matrices as part of the graph fusion module. This was performed for both reference and query datasets, resulting in six SNF-enhanced cell–cell affinity matrices per dataset, each providing an integrated view of gene expression and pathway-specific interactions. By integrating multiple views from pathway-specific data, the SNF module further enhances the model’s ability to capture intercellular relationships and subtle signals from low-expressing cell types, improving the robustness of cell type annotation.
To synthesize these insights into a unified framework, the similarity subspace matrices fusion (SSMF), also within the graph fusion module, was employed for each dataset. This method merged the six SNF-enhanced matrices into a single, comprehensive cell–cell affinity matrix, encapsulating a holistic overview of cellular interactions. Each individual pathway view contributed to the final consensus map by enhancing its ability to prioritize biologically relevant signals and filter out noise. By dynamically adjusting the weights in each pathway view to optimize the consensus map, the model effectively reduces biases and errors that may arise from relying on a single data source. This approach resulted in two composite matrices, one for each of the reference and query datasets, representing a unified map of intercellular and pathway-driven affinities. To enhance the structural detail within these composite matrices, positive pointwise mutual information (PPMI) was applied as part of the graph refinement module, improving the representation of significant associations and reducing data noise. Consequently, two refined, unified cell–cell similarity matrices were obtained, each offering a detailed and integrated view of cellular interactions and pathway networks.
The refined matrices served as the foundation for the cell type annotation model. A GNN-encoder was used to derive a low-dimensional embedding for each cell, crucial for the multi-class classification task. Utilizing known cell labels, a multi-class classification loss function (\(Loss\_L\)) was developed to guide model training. Concurrently, a parallel decoder was employed to reconstruct the six pathway-specific cell-cell similarity matrices initially integrated by SNF, defining a reconstruction loss (\(Loss\_G\)). To enhance the model’s predictive accuracy for cell type annotation, a loss based on Kullback-Leibler divergence (\(Loss\_K\)) was included, combining\(Loss\_L,\;Loss\_G\) , and \(Loss\_K\) in a joint optimization framework. This integrative approach ensured that the model not only accurately annotates cell types but also captures the complexity of pathway-specific interactions, key for understanding the nuanced biological processes involved.
The scMCGraph model demonstrated enhanced robustness and accuracy in handling diverse batch effects, as evidenced by its performance across a spectrum of analyses, including cross-platform, cross-time, cross-sample, and clinical dataset analyses. In this section, the dataset preparation is extensively detailed. Furthermore, the performance of the methods was evaluated using multiple metrics, including the accuracy score (ACC), weighted F1 score, and balanced accuracy score (BA), with each providing complementary insights into the model’s overall performance. This comprehensive evaluation showcased the model’s ability to maintain consistent accuracy not only across different developmental stages and sequencing technologies but also in clinical settings. These attributes highlight scMCGraph’s utility in automated cell type annotation and its potential as a powerful tool for complex single-cell sequencing data analysis.
Dataset preparationTo evaluate the proposed model comprehensively, we implemented a multifaceted experimental framework encompassing cross-platform, cross-time, cross-sample, and clinical dataset analyses. To further validate its scalability to larger datasets, we incorporated the breast invasive carcinoma E-MTAB-8107 [40] dataset, demonstrating the model’s capacity to handle more extensive biological data.
Cross-platform analysisWe utilized human pancreatic datasets (Baron Human [41], Muraro [42], Segerstolpe [43], and Xin [44]) alongside PBMC [45] datasets from both high-throughput (10Xv2, 10Xv3, Drop-Seq, inDrop, Seq-Well) and low-throughput (CEL-Seq2, Smart-Seq2) platforms. This yielded twelve reference-query pairs from the pancreatic datasets and forty-two from the PBMC datasets. The design excluded the reference dataset’s platform from the query sets to minimize platform-specific biases.
Cross-time analysisWe used the GSE132188 [46] single-cell dataset, containing mouse embryonic pancreatic epithelial cells at stages E13.5, E14.5, and E15.5 (GSM3852753, GSM3852754, GSM3852755), to create two reference-query pairs. This approach allowed us to assess the consistency of the model’s cell type annotations across developmental stages.
Cross-sample analysisWe analyzed PBMC datasets from diverse samples sequenced on various platforms (CEL-Seq2, Drop-Seq, inDrop, Seq-Well, Smart-Seq2), establishing five reference-query pairs to evaluate the model’s robustness against biological variability.
Clinical dataset validationWe extended our evaluation to clinical settings by incorporating single-cell transcriptomic data from patients with atherosclerosis and osteoarthritis. The Human Artery dataset (GSE159677 [47]) included samples from the calcified core and adjacent non-lesioned arterial tissue from endarterectomy patients. The Human Bone dataset (GSE152805 [48]) contrasted chondrocytes from diseased medial and healthy lateral tibial plateaus of osteoarthritis patients. From these datasets, we collected 20 unique reference-query pairs to compare intersubject variability and four pairs to assess differences between disease states, thereby enhancing the model’s clinical applicability.
These diverse experimental setups provide a robust framework to evaluate the model’s ability to accurately annotate cell types across various conditions and clinical states.
Pathway databasesWe integrated data from established pathway databases to enrich the model with detailed biological pathways and gene functions. This included the KEGG [49] database for metabolic and cellular processes; the PathwayCommons12 series [50], encompassing the humancyc_hgnc [51], panther_hgnc [52], and pathbank_hgnc [53] subsets; and the Reactome [54] and Wikipathways [55] databases for comprehensive coverage of human and cross-species biological pathways. This strategic integration enhanced the dataset with a multidimensional biological context, improving the model’s performance.
Robustness and accuracy of scMCGraph across diverse batch-effect conditionsTo advance single-cell sequencing analysis, the scMCGraph model was developed, exemplifying versatility and accuracy. This model surpasses traditional batch-effect limitations, illustrating innovation and exceptional performance in automated cell type annotation. A series of experiments detailed in this section, including cross-platform, cross-time, and cross-sample analyses, comprehensively assess the model’s robustness and accuracy. To evaluate the performance of scMCGraph, we classify the comparison methods into three categories: (a) correlation-based methods, (b) marker-based methods, and (c) model-based methods. These categories will be used to compare against our proposed model to assess its performance comprehensively. As illustrated in Fig. 2a, the scMCGraph model achieves the highest mean ACC across most datasets, outperforming other methods, except for the dataset of homologous mouse embryonic pancreatic epithelial cells across different embryonic stages (cross-time), where scMCGraph secured the third-highest mean ACC, only surpassed by SingleR and TOSICA. This performance highlights its superior accuracy and generalization capability in cell type annotation tasks.
Fig. 2Visual and quantitative assessment of scMCGraph’s cell type annotation accuracy. a Line chart visualizes the classification accuracy of scMCGraph compared to other methods. The chart demonstrates scMCGraph’s superior performance in accuracy. b The box plot comparing the ACC of the scMCGraph model with other methods across multiple human pancreatic dataset pairs. c Comparison of scMCGraph with 17 other methods on PBMC dataset. d The line graph depicting the cross-time and cross-sample ACC, highlighting the stability of the scMCGraph model across different conditions. e T-SNE plots before and after batch correction, as well as post-feature aggregation, display the scMCGraph’s refinement in cell clustering. f T-SNE plots of cell embeddings mapped by scMCGraph, compared to other models. These plots highlight scMCGraph’s superior ability to cluster cell populations within human pancreas and PBMC dataset pairs
The model’s efficacy in annotating cell types across diverse sequencing platforms was rigorously tested. It processed pancreatic dataset pairs and PBMC dataset pairs, covering both high-throughput and low-throughput technologies. This extensive cross-platform assessment, depicted in Fig. 2b and c, showcases scMCGraph’s ability to consistently mitigate platform-specific biases and maintain high performance across all datasets. Specifically, Fig. 2b compares scMCGraph with other methods on the pancreatic dataset pairs, where our model achieved a mean ACC of 0.9801 with a variance of 0.0106, both ranking first. In Fig. 2c, the model is compared with other methods across the PBMC dataset pairs, evaluating multiple performance metrics including ACC, BA, and F1 score. Our model achieved the highest average ACC of 0.8077, with the lowest standard deviation of 0.0723, highlighting its superior stability and performance. Furthermore, our model ranked fifth in average BA at 0.7257, demonstrating its strong performance in handling class imbalance. It also ranked second in average F1 score (0.7709), just behind CellTypist, underscoring its robustness and accuracy across multiple performance metrics.
The effectiveness of the scMCGraph model was further evaluated in both cross-time and cross-sample scenarios. For the cross-time analysis, the model was applied to datasets derived from mouse embryonic pancreatic epithelial cells across three developmental stages, where it demonstrated notable accuracy. Its robustness was further evidenced by its application to multiple PBMC datasets on the same sequencing platform. Figure 2d displays line graphs that illustrate the outcomes from both the cross-time and cross-sample experiments. Notably, in a specific cross-time pair (reference dataset GSM3852754 and query dataset GSM3852755), scMCGraph outperformed competing methods, achieving the highest accuracy. In the cross-sample analysis, it recorded the highest accuracy in four out of five reference-query pairs, involving platforms such as CEL-Seq2, inDrop, Seq-Well, and Smart-Seq2. These results collectively underscore scMCGraph’s strong generalizability and robustness across varying experimental setups, establishing its utility as a reliable tool in automated cell type annotation and its adaptability to diverse platforms and temporal stages.
Within the scMCGraph framework, the Harmony algorithm [56] was implemented to address batch effects across diverse single-cell datasets, effectively mitigating batch discrepancies during the integration of reference-query dataset pairs. Subsequently, a two-layer GCN aggregated features from two-hop neighbors within the graph, leveraging learned node embeddings for precise cell type annotation. To demonstrate the model’s capabilities concretely, t-SNE visualizations were conducted on the original gene expression data, the batch-effect-corrected features, and the node embeddings post-GCN for the dataset pair GSM3852754 and GSM3852755 (Fig. 2e). These visualizations clearly illustrate the initial disparities in gene expression between the datasets. After applying Harmony, the datasets converge, displaying emerging clustering trends within individual cell populations. Further processing through the GCN highlights these trends, systematically refining cellular features for cell type annotation, thus emphasizing the model’s exceptional proficiency in batch correction and cell type discrimination.
In this work, the aim was to showcase the superior capability of the proposed model in automating cell type annotation and clustering cellular embedding features. Two dataset pairs from the human pancreas—Segerstolpe as reference with Baron as query and Muraro as reference with Segerstolpe as query—were meticulously evaluated. Additionally, two pairs from the PBMC dataset were scrutinized—Drop-Seq as reference with 10Xv2 and 10Xv3 as respective queries. The clustering of cell types across these four reference-query pairs was visualized using t-SNE plots, as shown (Fig. 2f). These findings reveal that conventional methods fail to segregate cell types based on the original gene expression. In the Segerstolpe-Baron pancreas dataset pair, the CHETAH algorithm was unable to effectively separate gamma from alpha cell types as well as delta from beta cell types. The scPred method struggled to discriminate beta from gamma cell types, and the SingleR approach was similarly unable to cleanly segregate gamma from alpha cell types. For the Muraro-Segerstolpe dataset pair, CHETAH again fell short in distinguishing gamma from alpha cell types, and scPred showed persistent admixture of a small number of beta cells in regions rich with alpha cells; SingleR exhibited difficulties in separating beta from delta cell types.
When analyzing the selected PBMC dataset pairs, the results show that in the Drop-Seq-10Xv2 comparison, both CHETAH and SingleR methods failed to adequately resolve CD4+ T cells from cytotoxic T cells as well as cytotoxic T cells from natural killer (NK) cells. The scPred algorithm could not satisfactorily separate CD14+ monocytes from CD16+ monocytes and also struggled with the separation of CD4+ T cells from cytotoxic T cells. In the Drop-Seq-10Xv3 dataset pair, CHETAH and SingleR again faced challenges in segregating CD4+ T cells from cytotoxic T cells, and scPred had difficulty distinguishing certain areas of CD16+ monocytes, cytotoxic T cells, and CD4+ T cells. However, the t-SNE projections generated by the model exhibited clear boundaries between different cell types, with minimal overlap in regions rich in specific cell types. This visually demonstrates that the proposed model has a significant advantage in discerning the gene expression differences among various cell types, thereby more effectively accomplishing the task of cell type annotation.
Clinical validation of scMCGraph for cell type identification across individual and disease statesIn this section, the scMCGraph model underwent rigorous validation using clinical datasets from two diseases characterized by significant cellular heterogeneity and clinical importance: atherosclerosis and osteoarthritis. The validation robustly confirmed the model’s efficacy, demonstrating its ability to accurately delineate cell types with clinical relevance. For atherosclerosis, single-cell RNA-sequencing data (GSE159677) from both the calcified core of atherosclerotic plaques and adjacent non-lesioned tissue were used to compare pathological and normal cell environments. In osteoarthritis, chondrocytes from both diseased and healthy states within individuals (GSE152805) were analyzed. Utilizing these datasets, 12 datasets in osteoarthritis and 8 datasets in atherosclerosis were used for cross-individual comparisons, and 4 datasets in osteoarthritis were used for cross-health condition comparisons, enabling detailed comparisons across subjects and disease states.
The analysis of experimental results is depicted in Fig. 3a and b, where scMCGraph is shown to outperform advanced comparative methodologies in intersubject experiments. Although in inter-disease state experiments, scMCGraph’s maximum accuracy was slightly lower than that of the singleR method, it demonstrated a notably higher minimum accuracy, highlighting its robust performance and reliability.
Fig. 3Efficacy and accuracy of scMCGraph in cell type classification across clinical datasets. a The accuracy of the proposed model was compared with other advanced methods on the cross-individual dataset. This comparison demonstrates that scMCGraph achieves superior performance in both accuracy and robustness. b The accuracy of scMCGraph was compared with other methods on the cross-health condition dataset. It can be seen that scMCGraph demonstrates superior reliability and better performance in terms of minimum accuracy. c Visualization of cell type classification accuracies uses a Sankey diagram for GSM4837524-GSM4837528 and a Chord diagram for GSM4837526-GSM4837524, highlighting scMCGraph’s robust performance. d Heat maps show scMCGraph’s cell type discrimination for GSM4837527-GSM4837525 and intercellular relationship insights critical for tissue microenvironment analysis
The Sankey and Chord diagram (Fig. 3c) depict the classification results for the GSM4837524-GSM4837528 and GSM4837526-GSM4837524 dataset pairs, respectively. Although scMCGraph faced challenges in accurately annotating rare cell types like B cells and mast cells, it predominantly classified cells correctly, with a minimal mislabeling of mesenchymal stem (MSC) cells as fibroblasts. In contrast, other methods such as MarkerCount misclassified numerous MSC cells, and scLearn and scPred had difficulties with myeloid, smooth muscle (SMC) cells, and other cell types. SingleR also showed misclassification of MSC cells. The Chord diagram further elucidates the prediction challenges faced by comparative methods, with MarkerCount and scLearn leaving many cells unannotated and scPred underperforming in classifying NK cells. The scMCGraph model, however, demonstrated robust performance across most cell types, with NK cells being the primary exception.
The heat map (Fig. 3d) of cell type classification for the GSM4837527-GSM4837525 dataset pair is presented alongside a cell correlation heatmap. This comparison reveals that although the model had some difficulties with MSC and plasma cells, it was effective for other cell types. Comparative methods like MarkerCount and scmapCell struggled particularly with SMC, plasma, NK, and MSC cells. scPred showed poor performance for plasma, NK, and MSC cells, with a significant number of cells remaining unannotated across several methods. The cell correlation heatmap serves to elucidate the relationships between cell types, which is imperative for studying cell-to-cell communication and understanding the tissue microenvironment’s complexity.
Pathway analysis and expression profiling in single-cell dataCellular processes are governed by a complex array of biochemical reactions and tightly controlled pathways. These pathways, which are fundamental to the flow of cellular information, serve as the blueprints for critical functions such as gene interactions, metabolic processes, and signal transduction. Their comprehensive analysis is indispensable for unraveling the complexities of cellular functions and disease mechanisms. In this study, pathway analysis was performed using cell embeddings generated by our model based on the GSM4626768 dataset. t-SNE was applied to these embeddings to identify the pathways that best differentiate distinct cell types within the embedding space. One representative pathway was selected from each of six different pathway databases. From KEGG, the human ribosome pathway (hsa03010) was selected, which elucidates the assembly and functions of ribosomal subunits, vital for the translation of mRNA into proteins. The glycolysis/gluconeogenesis pathway (hsa00010) from humancyc_hgnc was chosen to represent metabolic flux in transitioning between glucose breakdown and synthesis. The vasopressin synthesis pathway from panther_hgnc was included, detailing the regulatory process from gene transcription to hormone secretion. From pathbank_hgnc, the protein synthesis pathway was selected, emphasizing the intricate process of translating genetic information into functional proteins. The Reactome database contributed the peptide chain elongation pathway (R-HSA-156902), a critical step in protein synthesis involving ribosomal catalysis. Lastly, the Wikipathways database provided insights into the cytoplasmic ribosomal proteins pathway, focusing on the protein components of the ribosome. These pathways were then visualized using a t-SNE plot, and the resulting figure (Fig. 4a) highlights the AUCell scores of the selected pathways across different cells, represented through color-coding and contour lines.
Fig. 4Pathway analysis across single-cell RNA-seq datasets. a T-SNE plot visualization of pathway expression within the GSM4626771 single-cell dataset. Pathways selected from six distinct databases are color-coded to represent their relative expression levels across different cell embeddings. b Heat map shows the AUCell scores of the top 10 pathways from each database across the GSM4626768 (health) and GSM4626771 (disease) datasets. The consistency of pathway expression across datasets and cell types demonstrates the robustness of pathway analysis in single-cell RNA-seq data
Simultaneously, it is important to consider that single-cell data inherently possess challenges such as high noise levels and dropout events that can obscure true biological signals. Fortunately, the impact of these factors on pathway-level analyses is relatively mitigated due to the nature of pathway information. To assess the robustness of the pathway analysis against these challenges, we first computed AUCell scores for each of the six pathway databases separately across the two datasets, GSM4626768 (health) and GSM4626771 (disease). Both datasets include seven cell types: HomC, RegC, RepC, HTC, FC, preFC, and preHTC. For each dataset, we selected the top 10 pathways with the highest AUCell scores from each pathway database. The heat map in Fig. 4b shows that the AUCell scores of these top 10 pathways are highly consistent across the two datasets (healthy and diseased) for each corresponding cell type, demonstrating the robustness of the pathway analysis in capturing reliable biological insights despite the inherent challenges of single-cell data.
Comprehensive evaluation and optimization of scMCGraphA series of parameter selection, ablation studies, and additional evaluations were conducted to optimize and validate the performance of the scMCGraph model, demonstrating its robustness and effectiveness for cell type annotation. We conduct experiments to evaluate model performance under various parameter settings as well as ablation experiments to assess the contributions of key components, including SSMF, PPMI, pathway databases, and KL divergence. Beyond optimization, we also performed extensive evaluations to further validate the model’s computational efficiency, scalability with larger datasets, sensitivity to pathway sparsity, and predictive reliability using uncertainty quantification techniques. Recognizing the significant impact of parameter choices on model performance, a human pancreatic dataset was employed to deeply investigate the percentage of neighbor nodes in the construction of k-NN graphs (i.e., the k value). The results were visualized using a box plot, as shown (Fig. 5a). Observations indicated that a 2% k value outshines the 1% and 5% alternatives, striking a balance that maximizes accuracy. At 1%, the model’s accuracy is foundational yet suboptimal, and at 5%, it is evident that an expanded neighborhood adversely affects performance.
Fig. 5Visualization of parameter selection and ablation study results. a The box plot represents the variation in accuracy with different k values in the k-NN graph construction using human pancreatic datasets. b A comparative analysis of the SSMF and SUM methods was conducted using the PBMC dataset from the CEL-Seq2 platform, which underscored the enhanced performance of SSMF in integrating cell similarity graphs with pathway information. c The PPMI ablation study on the Human Bone dataset, showcasing the benefits of PPMI in enhancing the integration of cell similarity graphs by emphasizing biologically relevant pathway information. d The removal of individual sources from the pathway database leads to diminished model performance, as revealed by analyses using the PBMC dataset from the Smart-Seq2 platform, thus underscoring the critical role of pathway diversity. e The evaluation of KL divergence in optimizing model accuracy was demonstrated across 42 PBMC dataset pairings, affirming its essential contribution to enhancing performance. f Training times for different methods. g Memory usage for different methods. h Predictive performance of the scMCGraph method on small datasets at varying proportions. i Comparison of predictive performance of the scMCGraph model with varying degrees of sparsity in pathway databases
In the presented SSMF ablation experiment, the PBMC dataset from the CEL-Seq2 platform is utilized as the reference with datasets from six other platforms acting as query datasets. The SUM method, which simply aggregates data by direct summation, is used as a baseline for comparison. The outcomes, depicted through a slide bead diagram (Fig. 5b), show SSMF’s consistent outperformance over the SUM method across all the dataset pairs. This superiority is particularly striking when examining the Drop-Seq platform as the query dataset; the diagram clearly indicates a substantial margin by which SSMF’s ACC surpass those of the SUM method. This substantial lead underscores the SSMF method’s proficiency in integrating complex data patterns, thereby delivering a more precise and coherent unified graph structure, indicative of a robust enhancement in graph integration. In the PPMI ablation experiment utilizing the Human Bone dataset, five reference-query dataset pairs were formed, and the results were presented as depicted (Fig. 5c). The analysis indicates that the PPMI method consistently enhances model performance across all pairs. Notably, the improvement is most pronounced in the GSM4626767-GSM4626769 reference-query pair, where the application of PPMI leads to a visibly higher bar in the chart. This distinct increase reinforces the conclusion that PPMI plays a critical role in improving the model’s data integration capabilities by contributing to a more comprehensive global graph structure.
In the pathway database ablation study, the PBMC dataset from the Smart-Seq2 platform was utilized as the reference dataset, with datasets from six other platforms serving as query datasets, thus establishing six reference-query dataset pairs. One source of pathway database information was sequentially removed from each pair, and the results were visualized (Fig. 5d). This systematic elimination of databases revealed the individual contribution of each database to the model’s performance. It is observed that the integration of multiple sources of pathway information is crucial, as the enrichment of pathway data consistently enhances the model’s performance. It is evident that when any single pathway database was removed, the performance was consistently and significantly lower than when the full complement of databases was applied. These results underscore the importance of a consensus representation approach, where the synergy of multiple pathway databases fosters a more robust and accurate integration of diverse biological data. The KL divergence ablation study utilized the PBMC dataset from each platform as the reference dataset, with the datasets from remaining six platforms serving as query datasets, resulting in 42 reference-query dataset pairs. These comparisons were visualized using scatter plots, in which each dot represented an individual dataset pair and a solid black line indicated the average ACC across all pairs. The visualization of these results, as indicated (Fig. 5e), revealed that KL divergence played a crucial role in boosting model performance, demonstrating excellence across all dataset pairs. Upon close observation of the average values, it becomes evident that the performance with KL divergence was significantly better than without it. Thus, it becomes evident that incorporating KL divergence is essential, as it significantly improves the ACC performance of our cell type annotation efforts.
In the computational complexity analysis, we compare the performance of scMCGraph with 17 other methods in terms of runtime and memory usage, using the PBMC dataset from the CEL-Seq2 platform as the reference dataset and the PBMC dataset from the 10Xv2 platform as the query dataset. In the results shown in Fig. 5f and g, we can observe that the singleCellNet method is the quickest, while the scAGN method utilizes the least amount of storage. In contrast, scBERT is the most demanding in terms of both time and storage. Specifically, for correlation-based methods, the average processing time is 32.79 s, and they typically require 2730.67 MB of RAM. Marker-based methods average 10.44 s for execution and use 645.79 MB of RAM, making them the least resource-intensive. On the other hand, model-based methods take considerably longer, averaging 212.95 s, and use significantly more RAM, averaging 3297.29 MB. This suggests that Marker-based methods are the least resource-demanding, followed by correlation-based methods, with model-based methods consuming the most resources. Our proposed model, scMCGraph, operates within a competitive timeframe of 43.84 s and requires 1977.31 MB of storage. Among the 14 model-based methods, scMCGraph ranks 6th in terms of speed and 5th in terms of storage efficiency, indicating robust performance with a balanced trade-off between execution time and memory usage.
To investigate the minimum number of cells required for effective training of scMCGraph, additional experiments were performed using the PBMC dataset from the Smart-Seq2 platform, which is the smallest dataset in our study. The PBMC dataset from the Smart-Seq2 platform was utilized as the reference dataset, with datasets from six other platforms serving as query datasets. Cells were randomly sampled from the reference dataset at different proportions (0.2, 0.4, 0.6, 0.8, and 1.0) for each cell type. At the 1.0 proportion, the dataset contained 253 cells in total, with the following distribution for each cell type: 117 cytotoxic T cells, 58 CD4+ T cells, 34 CD14+ monocytes, 22 B cells, 14 megakaryocytes, and 8 CD16+ monocytes. To evaluate the impact of different sampling proportions on the model’s performance, the corresponding ACC was calculated for each sampling proportion. As shown in Fig. 5h, the performance of the model at different proportions was as follows: at 0.2 proportion, the average ACC was 0.1162, and at 0.4 proportion, the average ACC remained the same at 0.1162. When the proportion was increased to 0.6, the average ACC rose to 0.6529, and at 0.8, the ACC further improved to 0.7529. The highest average ACC of 0.7996 was achieved when all available cells (1.0 proportion) were used. The standard deviations for these proportions were 0.0364 at 0.2 and 0.4, 0.0599 at 0.6, 0.0392 at 0.8, and 0.0447 at 1.0, indicating that the model’s performance stabilized with larger sample sizes. Specifically, training effects become apparent starting at the 0.6 proportion (149 cells), with optimal performance achieved when the dataset reaches the full sample size (253 cells). These results show that a minimum of 250 cells in total, with at least 10 cells per cell type, is required for reliable model performance. These findings demonstrate that scMCGraph can achieve reliable performance with a relatively small number of cells. This experiment underscores the model’s scalability, showing its ability to handle smaller datasets while maintaining robust cell type annotation.
To evaluate the scalability and robustness of scMCGraph on larger and more complex datasets, we tested the model on five datasets from the E-MTAB-8107 collection, each containing between 2000 and 4000 cells. The datasets used were sc5rJUQ024, sc5rJUQ026, sc5rJUQ033, sc5rJUQ050, and sc5rJUQ060. To simulate a larger dataset, we concatenated four of these datasets to form a comprehensive training set, using the remaining dataset as the test set. This allowed us to assess the performance of scMCGraph as the dataset size increased. In the cross-validation tests, we varied the reference and query datasets, calculating the model’s ACC in each case. For instance, using sc5rJUQ026, sc5rJUQ033, sc5rJUQ050, and sc5rJUQ060 as the reference (12,218 cells) and sc5rJUQ024 as the query (3426 cells), scMCGraph achieved an accuracy of 0.8573. In another configuration, where sc5rJUQ024, sc5rJUQ033, sc5rJUQ050, and sc5rJUQ060 were used as the reference (13,428 cells) and sc5rJUQ026 as the query (2216 cells), the accuracy was 0.8150. When using sc5rJUQ024, sc5rJUQ026, sc5rJUQ033, and sc5rJUQ060 as the reference (11,796 cells) and sc5rJUQ033 as the query (3848 cells), the accuracy was 0.8132. Additionally, when sc5rJUQ024, sc5rJUQ026, sc5rJUQ033, and sc5rJUQ060 were used as the reference (12,626 cells) and sc5rJUQ050 as the query (3018 cells), the accuracy reached a higher value of 0.9248. Finally, with sc5rJUQ024, sc5rJUQ026, sc5rJUQ033, and sc5rJUQ050 as the reference (12,508 cells) and sc5rJUQ060 as the query (3136 cells), the accuracy was 0.7985. In all configurations tested, the model consistently demonstrated accuracy above 0.80, with the only exception being 0.7985. These results show that scMCGraph is capable of scaling effectively to handle larger datasets without compromising accuracy.
To investigate the sensitivity of scMCGraph to pathway sparsity, we conducted additional experiments using varying proportions of pathways from all pathway databases. In these experiments, we used the Smart-Seq2 platform PBMC dataset as the reference, with each of the dataset form other six platform datasets serving as query datasets. For each pathway database, we randomly selected subsets representing 0.2, 0.4, 0.6, 0.8, and 1.0 proportions of the total number of pathways and assessed the model’s predictive performance. The results, shown in Fig. 5i, reveal that scMCGraph’s performance generally improves with the increase in the number of pathways used. The average accuracies ranged from 0.7017 (with 0.2 proportions of the total number of pathways) to 0.8162 (with all pathways), with standard deviations of 0.0592, 0.0544, 0.0656, 0.0647, and 0.0631, respectively. Interestingly, some datasets, such as PBMC1_Smart-PBMC1_inDrop, showed fluctuations in performance between the 0.4 and 0.6 proportions, but the model reached stable performance once the pathway proportion exceeded 0.8. These findings suggest that while reducing the number of pathways leads to a decrease in accuracy, performance tends to stabilize when at least 80% of the pathways are retained. Lower proportions may result in an overly sparse pathway network, which could hinder the model’s ability to effectively represent cellular states and degrade predictive performance.
Additionally, we adopted the post-hoc uncertainty quantification approach for multi-class problems proposed by Khatri et al. [57] to evaluate the predictive reliability of scMCGraph. Specifically, we first calculated nonconformity scores for each sample in the reference dataset based on the prediction scores. These scores were then used to establish a threshold, selecting the top 0.025 of the nonconformity scores. This threshold, derived from the training set, was applied to determine the confidence sets for each prediction score in the query dataset. The model’s performance was assessed using two key metrics: coverage and efficiency. Coverage indicates the proportion of true labels included within the confidence sets, reflecting the model’s reliability, while efficiency measures the average size of these confidence sets, aiming for smaller sets with maintained high coverage. To assess the model’s performance across different platforms, we used the PBMC dataset from the Seq-Well platform as the reference and employed PBMC datasets from the three largest platforms in terms of cell count among the remaining six platforms—namely, 10Xv2, 10Xv3, and Drop-Seq—as query datasets. These datasets contain 5398, 2700, and 2835 cells, respectively. The model demonstrated an impressive average coverage of 98.48%, indicating high reliability in predicting true labels across datasets. In terms of efficiency, the model achieved an average value of 2.97, suggesting it effectively minimizes the size of the confidence sets while maintaining robust predictive performance. For individual datasets, the 10Xv2 dataset achieved the highest coverage of 99.17% with an efficiency of 2.47, followed by the 10Xv3 dataset, which showed a coverage of 98.81% and an efficiency of 3.19, and the Drop-Seq dataset, which achieved a coverage of 97.46% with an efficiency of 3.24. These results highlight the model’s strong performance in terms of both reliability and efficiency, making it well-suited for applications requiring post-hoc uncertainty quantification in single-cell transcriptomic analyses.
留言 (0)