Early identification of macrophage activation syndrome secondary to systemic lupus erythematosus with machine learning

Study population

This study included 188 patients diagnosed with SLE (94 patients) or MAS secondary to SLE (94 patients) between May 2012 and January 2023 at the Union Hospital of Tongji Medical College, Huazhong University of Science and Technology. Furthermore, data from patients with SLE and MAS secondary SLE from the Second Xiangya Hospital of Central South University, the Central Hospital of Wuhan, the Zhongnan Hospital of Wuhan University, and the Second Affiliated Hospital of Zhejiang University School of Medicine were collected as an external validation set in this study. The selection of SLE patients complied with the 1997 ACR classification criteria [14]. The selection of MAS secondary to SLE complied with both the 1997 ACR classification criteria and five of the eight HLH-2004 diagnostic criteria [15]. Exclusion criteria were: (1) < 14 years of age; (2) history of combined tumor and other autoimmune diseases; (3) a large amount of missing data. The study was approved by the Ethics Committee of Union Hospital, Tongji Medical College, Huazhong University of Science and Technology.

Candidate predictive variables

The clinical information collected in this study encompassed various aspects of patient characteristics, clinical features, and laboratory parameters. The dataset included demographic information such as age and gender. Clinical features consisted of the Systemic Lupus Erythematosus Disease Activity Index (SLEDAI), highest recorded body temperature, duration of fever, reported symptoms, as well as laboratory indicators such as white blood cell count (WBC), hemoglobin (HB), platelet count (PLT), total bilirubin (TBil), serum electrolytes, serum creatinine (SC), alanine aminotransferase (ALT), aspartate aminotransferase (AST), alkaline phosphatase (ALP), total protein (TP), albumin (ALB), globulin (GLB), serum ferritin (SF), triglycerides (TG), serum low-density lipoprotein cholesterol (LDL), serum high-density lipoprotein cholesterol (HDL), lactate dehydrogenase (LDH), percentages of CD3 T cells, CD8 T cells, CD19 T cells, CD4 T cells, NK cells, as well as levels of interleukin-2 (IL-2), interleukin-4 (IL-4), interleukin-6 (IL-6), interleukin-10 (IL-10), tumor necrosis factor-alpha (TNF-α), interferon-gamma (IFN-γ), antinuclear antibody (ANA), anti-double stranded DNA antibodies (Anti-dsDNA), C-reactive protein (CRP), procalcitonin (PCT), erythrocyte sedimentation rate (ESR), D-dimer, activated partial thromboplastin time (APTT), fibrinogen (FIB), thrombin time (TT), prothrombin time (PT), immunoglobulin A (IgA), immunoglobulin G (IgG), immunoglobulin M (IgM), C3, and C4, totaling 91 variables. The data from the external validation set were collected from four other domestic hospitals.

Data processing and feature engineering

In order to obtain high-quality data, we handled missing values using multiple methods. Variables missing < 5% were populated with plurality and mean values, while variables missing between 5% and 20% were populated using random forests. We performed Mann-Whitney U tests on the data before and after interpolation to ensure that no significant change in data distribution was produced by our interpolation algorithms.

To maximize data retention, we incorporated the clinically accepted normal range as prior knowledge. More specifically, we added some new derived features to indicate whether the value exceeded the upper and lower limits of the empirical range (represented as 1 for values exceeding the upper limit, -1 for values below the lower limit, and 0 for values within the normal range). We identified and processed outliers using box plots. Finally, we normalized the data via Z-scoring to avoid scaling differences across units. Following this procedure, the data conformed to the standard normal distribution.

In the external dataset, patients with missing values were excluded from the evaluation. The remaining samples were normalized using the Z-score model, which had been previously established in the processing of the training dataset.

Feature selection

High-dimensional data is prone to noise during the modeling process and often requires dimensionality reduction. To address this issue, we conducted a two-stage feature selection process. In the first stage, the Pearson correlation coefficient and Variance Inflation Factor (VIF) were used to analyze the correlation and collinearity among various features. Irrelevant features and some features with multicollinearity with the diagnostic outcomes were removed. Derived features with clinical significance were constructed based on clinical experience. In the second stage, the Least Absolute Shrinkage and Selection Operator (LASSO) with 5-fold cross-validation was used to select the remaining features. The variables selection using LASSO is depicted in Supplemental Figure S1.

Model and evaluation

The study employed five classification models, Logistic Regression (LR), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Support Vector Machine (SVM), and Scorecard model to construct an evaluation model for MAS secondary to SLE(Fig. 1). Regarding parameter optimization, a L2 regularization penalty was introduced into the logistic regression model to address overfitting resulting from limited data. The L2 regularization term penalizes high complexity weights by adding the sum of the squared weights to the loss function. We used grid search to optimize relevant hyperparameters and utilized the leave-one-out method for model validation (Supplemental Table S1). The leave-one-out method is a type of cross-validation technique where one data point from the training set is used as the validation set, while the rest of the data is used for training. This method is especially reliable for accurately evaluating the model’s performance on the training set when the dataset is small.

For model evaluation, the performances of these models were assessed and compared across the test set and the external dataset. Considering the high risk of mortality in MAS secondary to SLE, and the potential adverse outcomes resulting from misdiagnosis, we believed that the model should assign varying levels of importance to different types of errors. In the diagnosis of high-risk diseases, false negatives (FN) should be minimized as much as possible. Therefore, we introduced F1 Score and F2 Score as evaluation metrics. The F-Score is the harmonic mean of precision and recall (formula 1–1). The F2 Score applies a weighted penalty to FN errors by assigning a higher weight to sensitivity (by setting β = 2 in formula 1–1), which compels the model to better address cases of missing diagnosis.

$$F - Score = (1 + ) \cdot \frac} \cdot \Pr eicision + \operatorname call}}$$

(1-1)

Fig. 1figure 1

Illustrative overview of the development of diagnostic machine learning model and scoring system for MAS secondary to SLE

Diagnostic scorecard

In this study, we have developed a diagnostic scoring system to predict the probability of secondary MAS in SLE [16].This system utilizes the same set of features as the diagnostic models.

The algorithm first discretized the value range of features into bins. For categorical features, each category was treated as a separate bin, and for continuous features, the chi-square binning algorithm was employed. The chi-square binning algorithm pre-divides the value range of the continuous feature into 20 bins and performs chi-square tests on adjacent pairs of bins. The pair with a p-value less than 0.05 and the highest chi-square value was merged. This process was iterated until no further merging was possible or the minimum bin number was met.

Step 2 involved calculating the Weight of Evidence (WOE) values for each bin. The sample feature values were then transformed into the corresponding bin’s woe values, which were used to train a logistic regression model. Calculate Woe according to the following formula:

$$Wo_=ln\left(\frac\right)$$

(2-1)

Calculate the coefficients A and B for the score transformation function according to the following formulas:

$$Score=A-B\times ln\left(odds\right)$$

(3-1)

$$A=_+B\times ln\left(odds\right)$$

(3-3)

Among them, the odds value represented the ratio of disease to non-disease, P0 was the baseline score at that ratio, PD0 was the score reduction when the ratio doubled, the A and B were coefficients of the scoring function. In this paper, the initial odds value was set to 1/19, P0 was set to 70, PD0 was set to 4.14, and the coefficients A and B were obtained as 52.41 and 5.97, respectively.

To facilitate the assessment of the probability of disease corresponding to the scores, we established the probability thresholds which identified the highest score that satisfied the predefined disease risk (defined as the conditional probability of developing MAS secondary to SLE at a certain score). In practical operations, accurately estimating probabilities through sample frequencies was challenging due to limited sample sizes. Instead, we utilized Gaussian kernel density estimation to estimate probability density functions for positive and negative samples. Gaussian kernel density estimation is a non-parametric method for estimating probability density functions (PDFs). The integration of the PDFs over a unit-sized interval determined the probability of falling within that interval. Considering that positive (MAS secondary to SLE) and negative (SLE) samples were mutually exclusive events, the disease risk of the patient within a determined score interval was equivalent to the probability of a positive sample falling into that interval.

留言 (0)

沒有登入
gif