Machine-learning based prediction of appendicitis for patients presenting with acute abdominal pain at the emergency department

Data collection

Pseudonymized data were retrospectively collected from 350 patients who presented with AAP and were registered with this complaint in the Dutch triage system at the ED of Jeroen Bosch Hospital, a Dutch teaching hospital in Den Bosch, between July 2016 and January 2023.

These patients’ visits to the ED are referred to as cases. No exclusions were made based on age, pregnancy status, comorbidities, medication use, or symptom presentation (see patient population details in Additional File 2). This inclusive approach aimed to reflect the diversity encountered in daily clinical practice and to develop ML models applicable to the entire AAP population. To limit the influence of any medications on the data, the first series of measurements from each ED visit was extracted. To train the model to differentiate appendicitis from other AAP causes, including those with similar clinical presentations, balanced subsampling was applied. This involved achieving equal numbers of appendicitis and other AAP cases. Additionally, among the other AAP cases, those suspected of appendicitis were balanced with those having non-specific or other AAP causes based on initial assessments by primary care physicians or triage nurses upon ED arrival. Balanced subsampling is a data preprocessing technique that enhances ML model performance on minority classes by balancing class distributions; it adjusts class frequencies without accounting for other parameters. No duplicate cases were introduced into our dataset during this process.

Other eligibility criteria included the availability of data from the initial patient evaluation relevant for building models at two key decision points in the ED workup. This included ED intake information, vital signs, medical history and physical examination findings from ED reports. In addition, blood and urine test results from standardized laboratory order sets, routinely requested for ED patients, were collected for these 350 cases. Detailed information on these parameters can be found in Supplemental Tables 1 A to 1E in Additional File 1. Cases were excluded if they had insufficient medical history or physical examination findings or were missing more than 70% of the laboratory tests results or vital signs (n = 14), a threshold chosen to balance the preservation of enough cases while minimizing missing parameters. This resulted in a final dataset of 336 eligible cases. The data extraction was performed using CTcue (IQVIA Nederland B.V., Amsterdam, the Netherlands), a privacy-by-design data extraction tool that automatically pseudonymizes patient data by redacting personally identifiable information and hashing patient IDs.

Reference standard

The determination of ‘appendicitis’ versus ‘other AAP causes’ was based on three criteria: hospitalization, treatment received (e.g. surgery), and International Classification of Diseases 10th Revision (ICD-10) codes. This classification identified 167 cases, for which final pathology and/or radiology reports with confirmatory results were also available. Among the confirmed appendicitis cases, 109 patients underwent surgery, while 58 received conservative treatment. Each case was meticulously reviewed by a team of medical coders and, if necessary, the classification was adjusted after a patient’s hospitalization or surgery. Other AAP cases included 169 cases: 15 directly discharged from the ED and 154 patients lacking both ICD-10 codes for appendicitis and surgery. In these cases, appendicitis was neither suspected by ED physicians during their examinations nor confirmed by radiology or pathology reports. While appendicitis can sometimes resolve without treatment, such cases are rare, and there was no clinical evidence of appendicitis in these patients.

Medical history and physical examination

Medical history and physical examination data provided in free-text entries in the ED reports were extracted from ED reports for each case. To structure this data, an initial annotation process was conducted by two researchers, who labeled all medical symptoms in 100 cases, resulting in 367 initial labels. The annotations were performed using annotation software Doccano (version 1.4) [24]. Labels with a prevalence of less than 5% were then reviewed by two ED physicians for their diagnostic value. Those deemed clinically unrelated to AAP causes were excluded, while others were grouped under overarching labels, reducing the total to 289. This final set of 289 labels was categorized into 73 parameters. These parameters included 50 binary parameters (e.g., presence of nausea) and 23 nominal parameters (e.g., location of pain). Another 236 cases were subsequently annotated using this structured framework.

Model development

To estimate the probability of appendicitis at two key decision points in the ED workup of AAP, two ML models were constructed using the eXtreme Gradient Boosting (XGBoost) algorithm via the XGBoost package (version 2.0.3) [25]. The first model, coined the History Intake Vitals Examination (HIVE) model, used ED intake information, vital signs, medical history, and physical examination inputs. The second model, coined the HIVE-LAB model, was extended with laboratory test results. XGBoost was selected due to its strong performance in classification tasks and its native ability to handle missing data, a key consideration in both this study and daily clinical practice.

For model development, data from 336 cases were split into training and validation sets. 80% was used for training and hyperparameter tuning (n = 268: nappendicitis=133, nother AAP causes=135) and 20% for validation (n = 68: nappendicitis=34, nother AAP causes =34). Repeated stratified 10-fold cross-validation was used for training and tuning to preserve the class distribution across folds, with the mean performance result across repetitions used for tuning the models. To handle binary and nominal parameters, a target-based encoding algorithm “CatBoost” (version 2.6.3) was used to encode binary and nominal parameters into numerical parameters representing statistical properties derived from the training data [26, 27]. Subsequently, both models were trained to optimize the area under the receiver operating characteristic curve (AUROC). Hyperparameters for the models were refined using Bayesian optimization through Optuna (version 3.6.1) [28], involving 100 trials to identify the optimal settings (see Hyperparameter settings XGBoost in Additional File 2). Model interpretation was performed using SHapley Additive exPlanations (SHAP) values by calculating the percentage contribution of each parameter to the prediction of the XGBoost models using the TreeSHAP algorithm (version 0.41.0) [29]. The TRIPOD checklist was followed to ensure increased transparency of the study’s methodology (Supplemental Table 2 in Additional File 1).

Reader study - expert diagnosis

A reader study was conducted to compare the outcomes of the HIVE and HIVE-LAB models with the clinical performance of ED physicians using the same validation set (n = 68). Each case was presented in its original format, mimicking the electronic health record system, and independently evaluated by three ED physicians with one, five, and ten years of post-qualification experience. Each ED physician scored the likelihood of appendicitis for each case on a scale from 0 to 100, with 0 being ‘highly unlikely’ and 100 ‘very likely’. This scale mirrored the probability output of the models.

Initially, the physicians scored each case based on intake information, vital signs, medical history, and physical examination findings. Subsequently, they adjusted the likelihood score, if necessary, after evaluating the laboratory test results for the same case. This two-step evaluation process ensured that the assessments were comprehensive, reflective of real-world diagnostic practices, allowing for an assessment of the added value of the laboratory test results (See Example Case in Additional File 2).

Alvarado scoring system

The Alvarado scoring system, also known as MANTRELS (Migration, Anorexia, Nausea-vomiting, Tenderness in right lower quadrant, Rebound pain, Elevation of temperature, Leukocytosis, Shift to the left), is a 10-point clinical scoring system developed for risk stratification of acute appendicitis for patients presenting with AAP (see Supplemental Table 3 in Additional File 1) [12]. A score of ≤ 4 is considered low risk, while a score of ≥ 7 high risk of appendicitis necessitating specialist consultation and/or further imaging [15]. This scoring system was applied to the validation set (n = 68) to compare performance with the HIVE model, the HIVE-LAB model, and ED physicians.

Statistical analysis

Input parameters are presented as medians with interquartile ranges (IQR) or means with standard deviations (SD), depending on their distribution (Table 1, Supplemental Tables 1 A–1 C in Additional File 1). Differences in medians and means between cases with appendicitis and other AAP causes were assessed using Kruskal-Wallis tests or one-way ANOVA for continuous variables, and chi-square or Fisher’s Exact tests for categorical variables, as appropriate (Table 1, Supplemental Tables 1 A–1E in Additional File 1) [30]. DeLong’s test was employed to compare the AUROC values of the ML models, ED physicians, and the Alvarado score. Statistical significance was set at p < 0.05 (Table 2), and confidence intervals for AUROC values were calculated via bootstrapping.

Table 1 Patient Characteristics (n = 336)

留言 (0)

沒有登入
gif