Identification of Disulfidptosis-Related Genes in Ischemic Stroke by Combining Single-Cell Sequencing, Machine Learning Algorithms, and In Vitro Experiments

Sources and Pre-Processing of Raw Data

Two expression matrices of human peripheral blood from patients with ischemic stroke (GSE16561 and GSE58294) were obtained from the Gene Expression Omnibus (GEO). The GSE16561 cohort used the Illumina sequencing platform to study peripheral blood samples from 39 patients diagnosed with ischemic stroke (over 18 years old) and 24 healthy non-stroke control patients (Barr et al., 2010). The GSE58294 cohort included blood from cardiogenic stroke patients and controls, with RNA extracted and analyzed using the whole-genome U133 Affymetrix array. Researchers examined 23 control samples and 69 cardiogenic stroke samples, with the latter group analyzed at three time points post-event: within 3 h, 5 h, and 24 h (Stamova et al., 2014). RMA normalization is a processing method for Affymetrix microarray data that generates gene expression estimates through background correction, probe-level normalization, and calculation of average probe set expression values. The "affy" R package provides functions to perform RMA normalization, typically involving reading CEL files, applying the RMA algorithm, and extracting and using the normalized expression values for further analysis. Subsequently, we used the "limma" package for data normalization and batch correction, and the "ComBat" function from the "sva" package to further adjust for batch effects (Leek et al., 2012). In addition, we also downloaded single-cell RNA sequencing data of mouse brain tissue containing 58528 cells from GSE174574 (Zheng et al., 2022). In this dataset, three mice underwent middle cerebral artery occlusion (MCAO) surgery, while the other three underwent sham operations as a control group. The latest research article mentions 10 genes associated with disulfidptosis, which are SLC7A11, SLC3A2, RPN1, NCKAP1, NUBPL, NDUFA11, LRPPRC, OXSM, NDUFS1, and GYS1 (Liu et al., 2023). These genes are referred to as disulfidptosis-related genes (DRGs). A total of 29 inflammatory pathway-related genes were exported from the Molecular Signature Database (https://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_INFLAM_PATHWAY).

Machine Learning-Based Screening Method for Key Genes

Three machine learning algorithms, Least Absolute Shrinkage and Selection Option (LASSO), Random Forest, and Support Vector Machine-Recursive Feature Elimination (SVM-RFE), were selected for their distinct capabilities in screening hub genes from high-dimensional data. The LASSO algorithm, known for its effectiveness in reducing dimensionality and enhancing model sparsity, was chosen for its ability to zero out irrelevant features, thereby simplifying the model and enhancing interpretability (Engebretsen & Bohlin, 2019). We utilized the "glmnet" R package to implement LASSO with a tuning parameter (lambda) optimized through tenfold cross-validation, which balances the trade-off between model complexity and goodness of fit.

The Random Forest algorithm, a robust supervised learning method, was chosen for its ability to handle a large number of variables without variable deletion and its capacity to identify the most important features. We employed the "random forest" R package to construct multiple decision trees and estimate the importance of each candidate hub gene based on the average decrease in impurity. By determining the importance scores, we identified the top ten centroid genes with the highest scores.

Lastly, the SVM-RFE algorithm was selected for its efficiency in feature selection, particularly in small sample size scenarios (Huang et al., 2014). It iteratively removes less important features, thereby reducing redundancy and focusing on the most relevant variables for the outcome. We applied SVM-RFE with tenfold cross-validation to ensure the stability and generalizability of the selected features, which is crucial for the predictive performance of the model. The combination of these algorithms, each with its unique strengths and hyperparameter tuning strategies, provided a comprehensive approach to identifying key genes in the dataset.

Construction and Validation of Nomograms

In this study, we implemented a meticulous variable selection and model-building process to ensure the accuracy and clinical utility of the nomogram. Initially, we conducted multivariate Logistic regression analysis using the "RMS" R package, integrating key genes with clinical features to construct a predictive model. During the model development, we not only considered the statistical significance of variables but also optimized their selection through stepwise regression to prevent overfitting and enhance the model's interpretability. To assess the model's predictive power, we calculated the area under the receiver operating characteristic curve (AUC) using the "ROCR" R package, a standard measure for evaluating the model's ability to discriminate between different outcomes. Additionally, we plotted calibration curves to evaluate the consistency between the model's predicted probabilities and actual occurrence rates, thereby validating the model's accuracy. To further assess the clinical applicability of the nomogram, we performed Decision Curve Analysis (DCA), which helps to determine the clinical benefits of the model at various threshold probabilities.

For internal validation, we employed the Bootstrap method, simulating different sample distributions through 100 repeated resamplings, thereby assessing the model's robustness. The rationale behind selecting 100 resamples was to ensure adequate sample variability while striking a balance between computational efficiency and model stability. This approach allowed us to more accurately estimate the model's performance on new data. Finally, we evaluated the model's consistency using the Hosmer–Lemeshow goodness-of-fit test, a statistical method to examine the differences between predicted probabilities and actual observed frequencies. Through these comprehensive assessment steps, we ensured the reliability of the nomogram model and its potential application in clinical decision-making.

Consensus Cluster Analysis and Prediction of Immune Cell Infiltration

In this study, we used the "IOBR" software package, which incorporates eight published techniques for quantifying the tumor microenvironment (TME). These techniques include CIBERSORT and MCPcounter, among others, and we used these tools to batch process and visualize the composition of the TME in order to more accurately capture the relative abundance of different cell type (Zeng et al., 2021).

In order to accurately determine the optimal number of clusters and to assess the clinicopathological characteristics of different subgroups of patients, we used the "ConsensusClusterPlus" software package to perform the cluster analysis. The parameters of the package were configured to refine the clustering process and to ensure the consistency and stability of the analysis results. In the parameter settings, the reps parameter is set to 100, which means that 100 cluster analyses are performed to evaluate the consistency of the results across different clusters. pItem and pFeature parameters are used to set the thresholds for the color of the cells in the consensus matrix, which are set to 0.8 and 1, respectively. In addition, the K-means algorithm is chosen as the core clustering method, and Euclidean distance as the most commonly used distance metric. The "ConsensusClusterPlus" software package demonstrates the consistency of the different clustering results by analyzing the clustering of multiple data subsets and constructing a consensus matrix. Consensus Cumulative Distribution Function (CDF) plots and the relative change in the area under the CDF curve provide a quantitative assessment metric for judging the reasonableness of the number of clusters (Wilkerson & Hayes, 2010).

Processing of Single-Cell RNA Sequencing Data from Mouse Brain Tissue

We obtained single-cell gene expression profiles of mouse brain tissue from the GSE174574 dataset and pre-processed the scRNA-seq data using the "Seurat" R package (Hao, et al., 2021). In our single-cell sequencing data analysis process, we first meticulously selected a high-quality dataset using the "PercentageFeatureSet" function, excluding cells with abnormal mitochondrial gene proportions, low cell counts, or extreme gene expression levels. Subsequently, we applied the "NormalizeData" function to standardize the gene expression data of the selected cell population.

We calculated the expression variance coefficient or standard deviation for each gene to assess their expression variability within the cell population, and selected highly variable genes (HVGs) based on preset thresholds. Building on this, we conducted Principal Component Analysis (PCA) using these 2000 HVGs to reveal the main sources of variation among cells. By normalizing the scRNA-seq data and converting it into Seurat objects, we precisely identified these key HVGs using the "FindVariableFeatures" function. Furthermore, we performed cell clustering based on the top 20 principal components (PCs) extracted from PCA analysis. By combining the modularity optimization clustering algorithm based on shared nearest neighbors (SNN) and the Uniform Manifold Approximation and Projection (UMAP) technique, and using the recommended Neighbors parameter with an appropriate resolution, we effectively reduced the dimensionality of the data and visualized the cells in a two-dimensional space. To gain a deeper understanding of the characteristics of each cluster, we utilized the "FindAllMarkers" function to identify unique marker genes for each cluster. Finally, to interpret and annotate the different cell populations at the biological level, we employed the "SingleR" package, effectively correlating our cell classification results with known cell types.

Based on the expression of the 10 DRGs, we used a commonly used algorithm "AddModuleScore" to score the microglia data for gene sets (Tirosh et al., 2016). The pseudo-time trajectories of microglia were also analyzed using the "Monocle" R package (Borcherding et al., 2019). Finally, ""CellChat"" was used to analyze the intercellular communication network, assess the expression levels of receptors and ligands, and infer potential intercellular interactions (Jin et al., 2021).

Cell Culture and Cell Transfection

The human microglia (HMC3) were purchased from Pocell (Procell, China). Cells were cultured in Minimum essential medium(MEM including NEAA) (PM150467, Procell, China) supplemented with 10% fetal bovine serum (FBS)(Gibco). All the cells were maintained in a humidified incubator with a 5% CO2 atmosphere at 37 °C.

SLC7A11 overexpression plasmid based on pcDNA3.1/ + vector were designed and synthesized by GenePharma(Suzhou, China). The plasmids transfections were performed by using the Lipofectamine 3000 (Invitrogen, Carlsbad, USA) according to the manufacturer’s instructions. HMC3 cells cultured in 6-well plates were transfected with SLC7A11-specific siRNAs (GenePharma, Suzhou, China) using Lipofectamine 3000 (Invitrogen, Carlsbad, USA) according to the manufacturer’s instructions. After 48 h of transfection, the cells were collected and used for further analysis. The control siRNA treated or mock-transfected cells were used as negative control.

Immunofluorescence Staining

Cultured cells grown on coverslips were fixed with 4% paraformaldehyde in PBS for 15 min. After rinsing, cells were incubated with a blocking solution (10% normal goat serum (NGS) and 0.5% Triton X-100 in PBS) for 1 h at RT followed by incubation with primary antibodies overnight at 4 °C. After washing with 1 × PBS, the cells were incubated with cells were incubated with appropriate goat Alexa fluor 555/488 secondary antibodies (1:200, Thermo scientific Life Technologies Corporation) for 1 h at RT. The nuclei were counterstained with Hoechst (1:50, Thermo scientific Life Technologies Corporation). The image was shotting with EVOS (Thermo scientific Life Technologies Corporation).

Oxygen Glucose Deprivation/Reoxygenation (OGO/R) Cell Model

HMC3 cell were treated with OGD/R induction to cell model of ischemic stroke according to previous study (Huang et al., 2019). In brief, cells were washed and incubated in the OGD medium (glucose-free DMEM) in an anaerobic chamber containing a 95% N2 and 5% CO2 mixtures for 6 h at 37 °C. After OGD, cells were rinsed twice and returned to standard medium for 12 h.

Western Blots

Cells were lysed using RIPA buffer (#9806, CST) including the protease inhibitor. The extracted proteins were stored at − 80 ℃. For western blot analysis, 8–10% SDS–PAGE was used to resolve equal amounts of protein samples from both cell lysate and supernatant. Briefly, lysates were subjected to SDS–PAGE and transferred onto PVDF membranes (Millipore). The membranes were blocked with 5% nonfat milk and 0.05% Tween-20 (TBST) at room temperature for 1 h, followed by an overnight incubation at 4 °C with primary antibodies specific for the following proteins, as required by each experiment: iNOS (1:1000 dilution, ab178945, Abcam), Cox2 (1:1000 dilution, #12,282, CST), Gapdh (1:5000 dilution, 60,004–1-Ig, Proteintech). Next day, incubated membranes were washed three times with TBS-T, and then incubated with the horseradish peroxidase (HRP)-conjugated secondary antibodies at the dilution of 1:5000 for 1 h at room temperature. Finally, the antigen–antibody complex was screened by chemiluminescence using the Immobilon Western Chemiluminescent HRP Substrate (Millipore, USA). Housekeeping protein GAPDH was selected as an internal control. Each experiment was conducted in triplicate.

Quantitative RT-PCR

Total RNA was extracted by using TRIzol Reagent (Invitrogen, Carlsbad, USA). Total RNA concentration was measured by a Nanodrop spectrophotometer and 1 μg RNA was used for reverse transcription using cDNA Revert-Aid First Strand cDNA Synthesis Kit. All experiments were performed in triplicate and the specificity of the qPCR products was verified by melting curve analysis. GAPDH was considered as endogenous control. Sequences of all primers are provided (Supplementary Table 1).

Statistical Analysis

All raw data processing was performed in R software (version 4.2.1). Nonparametric Wilcoxon rank sum test was used to test the relationship of continuous variables between two groups. Kruskal-Wallis test was used to explore the differences between more than two independent groups. Correlation coefficients were examined using Spearman correlation analysis. In all statistical investigations, P < 0.05 was considered statistically significant.

留言 (0)

沒有登入
gif