Early Prediction of Cardiac Arrest in the Intensive Care Unit Using Explainable Machine Learning: Retrospective Study

Introduction

Critical illness is defined as the presence or potential development of organ dysfunction. Cardiac arrest (CA), a critical condition that impacts patient safety, refers to the sudden cessation of cardiac function due to specific abnormal events, such as ventricular arrhythmia, asystole, and pulseless electrical activity []. At least one abnormal sign, such as respiratory distress or hemodynamic instability, occurs in 59.4% of patients within 1-4 hours before the onset of CA []. Early identification of the causes of CA improves patient survival by approximately 29% within the first hour of the episode and 19% at discharge []. Therefore, early prediction of CA is crucial to allow for more time for clinical intervention, thereby reducing mortality.

Clinical decision support systems (CDSSs) are clinical computer systems that apply algorithms to patient information, use machine learning to evaluate clinical data, and provide clinical decision support [,]. These systems, developed using electronic medical records, utilize various paradigms—such as predicting early cardiac events, heart failure (HF), and critical illness—to enable rapid response through real-time patient monitoring [-]. To enhance the quality and speed of medical services, CA prediction and warning systems have been developed for use in intensive care units (ICUs) within the field of CDSSs []. These computer-based CA prediction algorithms offer new opportunities for clinicians to improve the accuracy of predicting CA events [].

Traditional score–based methods, including the Simplified Acute Physiology Score (SAPS)-II, Sequential Organ Failure Assessment (SOFA), and Modified Early Warning Score (MEWS), are tools used by in-hospital care teams to identify early indicators of CA and initiate early intervention and therapy [-]. However, these score-based systems suffer from low sensitivity or a high false alarm rate []. To address these issues, machine learning methods have been used in CA prediction [,], leading to significant improvements in performance.

Churpek et al [] proposed the use of a random forest (RF) classifier, based on clinical information extracted from a multicenter data set, achieving an area under the receiver operating characteristic curve (AUROC) of 0.83. Similarly, Hong et al [] implemented an RF model using a clinical data set from a retrospective study, attaining an AUROC of 0.97 and an area under the precision-recall curve (AUPRC) of 0.86. Although the authors achieved accurate CA prediction results, their methodology relied heavily on features not commonly used during hospitalization and did not offer real-time predictions. To address this, Layeghian Javan et al [] proposed a stacking method that combines RF, balanced bagging, and logistic regression to predict CA 1 hour in advance. Layeghian Javan et al [] achieved an AUROC of 0.82 using the Medical Information Mart for Intensive Care (MIMIC)-III data set. Kwon et al [] proposed a deep learning–based early warning system that utilizes a recurrent neural network (RNN) to assess risk scores from input vectors measured over an 8-hour period. Their system, based on vital signs extracted from a retrospective multicenter cohort data set, resulted in AUROC and AUPRC values of 0.85 and 0.04, respectively. Additionally, Kim et al [] developed an ensemble-based CA prediction system using the light gradient boosting method (LGBM) to predict CA 1 hour in advance, obtaining AUROC and AUPRC values of 0.86 and 0.58, respectively, using the MIMIC-IV data set.

As mentioned earlier, artificial intelligence (AI) has been applied to CA prediction solutions and has demonstrated high predictive power in several studies []. However, hospitals often group patients with similar conditions and illness severity within the same unit for more efficient treatment. Specifically, ICUs are divided into subtypes, such as general ICUs and cardiac ICUs, to optimize care. In this context, previous studies on CA prediction focused on the entire ICU without accounting for the heterogeneity within subtypes. As a result, the performance of CA prediction models may vary depending on the distinct characteristics of each group [].

Although numerous studies have applied AI to CA prediction [,], challenges persist in their practical application. First, CA prediction studies must confirm clinical validity through multicenter studies []. However, clinical maturity for CA prediction has not been established when monitored in real-time, as validation was typically performed using representative events extracted from the validation site. Second, patients grouped into different ICU subtypes exhibit varying characteristics and likelihood of developing CA. However, the performance of CA prediction models across these subtypes has not been validated. Third, while interpreting the results of prediction models is crucial for clinicians to make informed decisions [], an interpretable model capable of providing this information in real-time monitoring—especially among deep learning–based models—has yet to be developed.

This study proposes a framework for early and accurate prediction of CA across diverse clinical settings, accounting for heterogeneity. We aim to validate the clinical maturity, safety, and effectiveness of the proposed framework by comparing it with existing trigger systems and machine learning methods using a pseudo real-time CA evaluation system. We propose a novel framework that learns patient-independent and subtype-specific characteristics in the ICU to improve CA prediction and reduce the false alarm rate. As a deep learning–based model optimized for tabular data, such as tabular network (TabNet), it can address the overfitting and performance limitations of existing tree-based models []. In addition, a cost-sensitive learning approach was applied to address the class imbalance in CA events. We then used the MIMIC-IV and eICU-Collaborative Research Database (eICU-CRD) to evaluate clinical maturity across various patient populations and ICU subtypes [,]. To illustrate changes in feature importance over time for clinical decisions, we utilized the MIMIC-IV data set []. Therefore, the proposed CA prediction framework can offer clinicians a reliable warning of CA occurrence within 24 hours. It also provides interpretable information about CA alarms and insights for rapid response.

MethodsData Source

We used 2 databases: MIMIC-IV and eICU-CRD. The MIMIC-IV database, which includes information on vital signs, laboratory tests, and procedural events for ICU patients, was utilized to develop and validate a CA prediction model using multivariate vital sign time-series data from patients with HF. Specifically, MIMIC-IV is a well-known single-center database containing information on 46,520 patients admitted to the Beth Israel Deaconess Medical Center (BIDMC) between 2008 and 2019. The database includes demographic data, International Classification of Diseases (ICD) codes, clinical modification codes, hourly vital signs, inputs and outputs, laboratory test and microbiological culture results, imaging data, treatment methods, medication administration, and survival statistics. In addition, MIMIC-IV includes data from the clinical information system iMDsoft MetaVision. Compared with MIMIC-III, which extracts data from heterogeneous sources, MIMIC-IV provides more comprehensive patient data and detailed information on procedural events, serving as a primary source of clinical information in ICUs []. Consequently, MIMIC-IV data are more homogeneous compared with MIMIC-III data [].

The eICU-CRD contains data from over 200,000 ICU admissions monitored across the United States through the eICU-CRD program developed by Philips Healthcare. This collaborative database includes information on patients admitted to the ICU in 2014 and 2015 [].

Ethical Considerations

The MIMIC-IV database and eICU-CRD are deidentified, transformed, and made available to researchers who have completed human research training and signed a data use agreement. The Institutional Review Board at the BIDMC granted a waiver of informed consent and approved the sharing of the MIMIC-IV database. Similarly, the eICU-CRD data were exempt from Institutional Review Board approval and were also granted a waiver of informed consent [,].

To enhance the system for doctors and patients, AI has addressed concerns related to bias and fairness in health care. Our strategies to mitigate these issues are as follows:

Bias in AI generally arises from 2 primary sources: the data used for algorithmic training (data bias) and the intrinsic design or learning mechanisms of the algorithm itself (algorithmic bias). In health care settings, the involvement of human interaction and decision-making can introduce additional bias due to the inherently complex nature of the process []. To mitigate the impact of data bias, we used a patient-centered data set rather than relying solely on representative event data. Additionally, we conducted a subgroup analysis to identify potential biases in various environments. Key factors contributing to algorithmic bias are label bias and cohort bias [,]. Label bias has been addressed through updates to the MIMIC-IV and eICU-CRD databases [,]. To counter cohort bias, which may arise when different group levels are not adequately considered, we evaluated both patients with heart disease and the broader population using different databases [].

Fairness in health care is multidimensional, involving the equitable distribution of resources, opportunities, and outcomes among diverse patient populations. Health care systems must ensure access to quality care for all individuals without discrimination []. To uphold this fairness, we selected MIMIC-IV and eICU-CRD—2 representative databases in critical care—and proceeded with AI development only after minimizing bias in each database. Additionally, we assessed explainability and the Brier score to address potential errors, harmful outcomes, and biases in AI-generated predictions.

The use of AI-driven predictions in critical care decisions can introduce various biases. Bias related to clinician interaction may be common in CDSS systems, with risks including overconfidence in the AI system or desensitization to real-world events due to excessive alerts [,]. To mitigate these biases, we evaluated the false alarm rate, event recall, and sensitivity. Additionally, future deployments will require clinician training on inherent biases and regular monitoring of the algorithm.

Problem Definition

The task of the study is to predict CA events within 24 hours. The input data include the patient’s vital signs and the MEWS over a 12-hour time window. We then generate continuous labels every hour regarding the risk of CA within the next 24 hours and calculate the alarm rate based on the correctness of the alarms. The primary outcome was the AUROC, used to quantitatively assess the prediction results for CA events within 24 hours. The alarm rate, including false alarm rate and event recall, was calculated as a secondary outcome to evaluate alarm fatigue. Sensitivity was also assessed as a secondary outcome to identify any reductions in false alarms or missed CA events. Additionally, we provided clinically interpretable decision support information.

Prediction Model FrameworkOverview

We propose a framework for predicting CA events within 24 hours in advance. As illustrated in , the framework consists of 6 components: data preparation, data preprocessing and extraction, feature generation, feature aggregation and CA event labeling, model development, and evaluation. Details about the open-source tools and development code used are provided in [].

After applying the inclusion and exclusion criteria, we extracted vital signs and calculated the MEWS based on these vital signs. In step 2, we processed and normalized the features after resampling the vital signs and MEWS to a 1-hour resolution. In step 3, we generated multiresolution statistical features and Gini index–based features. The multiresolution statistical features were created using a sliding window approach to segment each vital sign into 4-, 6-, and 12-hour intervals. Next, we generated continuous labels every hour indicating the risk of CA within the next 24 hours. In step 4, we aggregated multiresolution statistical features, Gini index–based features, and labels. In step 5, we developed a TabNet classifier, known for its effectiveness in various classification tasks with tabular data [], and incorporated different cost weights for each class. Finally, in step 6, we evaluated the performance of the proposed model using sensitivity, false alarm rate, event recall, and AUROC.

‎

Figure 1. Patient inclusion and exclusion flow diagram for the MIMIC-IV and eICU-CRD. (A) MIMIC-IV, (B) eICU-CRD. CA: cardiac arrest; eICU-CRD: eICU-Collaborative Research Database; HF: heart failure; ICU: intensive care unit; MIMIC: Medical Information Mart for Intensive Care; SAPS: Simplified Acute Physiology Score; SOFA: Sequential Organ Failure Assessment. Step 1: Data Preparation

Data were obtained from the MIMIC-IV and eICU-CRD databases to construct cohorts meeting the inclusion and exclusion criteria []. The target populations of the 2 databases differed: MIMIC-IV includes only patients with cardiac issues in the ICU, while eICU-CRD encompasses all ICU patients. The number of CA events per patient also varied between the databases. MIMIC-IV typically has 1 CA event per patient, whereas eICU-CRD often records multiple CA events per patient. As CA events can occur multiple times per patient in a clinical setting, we validated multiple events in the eICU-CRD. Finally, we performed an analysis that accounted for differences across databases to compare the performance of CA prediction between high-risk patient groups and those without CA, as well as across different clinical settings, using the proposed framework, as shown in .

For the inclusion and exclusion processing of MIMIC-IV, we applied the criteria to select the study cohort. Patients aged over 18 and under 100 years were included. Records of patients without SOFA and SAPS-II scores were excluded, as these scores were used to compare the prediction performance of the prognostic scales with the proposed framework. HF is a major risk factor for sudden CA and a significant contributor to CA-related mortality. CA is more prevalent in patients with a history of HF or previous CA. Therefore, we included ICU stays of patients with these cardiovascular conditions in the cohort study. For the CA group, we included ICU data if the vital sign data were not outliers and if any events occurred within 1 hour before the CA within 24 hours of patient admission.

For the inclusion and exclusion processing of the eICU-CRD, we acknowledged the differences in target population characteristics as mentioned above. Therefore, the inclusion and exclusion criteria were the same as those for MIMIC-IV, except for the criteria related to patients with high-risk CA.

MIMIC-IV includes only patients with cardiac issues in the ICU, while eICU-CRD encompasses all ICU patients. Consequently, the MIMIC-IV data set included 77 patients in the CA group and 1474 patients in the normal group, whereas the eICU-CRD data set included 106 patients in the CA group and 3641 patients in the normal group.

‎

Figure 2. Overview of the proposed framework. This is composed of 6 steps including data preparation; data preprocessing and extraction; feature generation; feature aggregation and CA event labeling; model development; and evaluation. Three components make up TabNet, including feature transformer, attentive transformer, and feature masking. A split block separates the processed representation for the overall output and is used by the attentive transformer of the next phase. The feature selection mask provides comprehensible details about the functioning of the model for each step, and the masks can be combined to produce global feature important attribution. BN: batch normalization; CA: cardiac arrest; DBP: diastolic blood pressure; EWS: early warning score; HR: heart rate; MBP: mean blood pressure; MEWS: Modified Early Warning Score; ReLU: rectified linear unit; RFE: recursive feature elimination; RR: respiratory rate; SBP: systolic blood pressure; SpO2: oxyhemoglobin saturation; TabNet: tabular network; TEMP: temperature. Step 2: Data Preprocessing and Extraction

We collected data on vital sign parameters, including heart rate (HR), systolic blood pressure (SBP), diastolic blood pressure (DBP), temperature, respiratory rate (RR), and oxyhemoglobin saturation (SpO2) from the experimental database. These vital sign parameters may be recorded with irregular time-series data due to equipment malfunctions and varying patient responses []. Prediction models are not designed to classify data with irregular time-series samples between groups. To address this issue, the models require data collected at regular time intervals. We used a bucketing technique to manage the irregularities in the time series []. We divided the 12-hour time windows into 12 sequential 1-hour buckets, and the measured values within each bucket were averaged. Consequently, each time series consisted of 12 values at regular 1-hour intervals. If there were no values in a bucket, it was marked as null. To address missing values, we used the last observation carried forward (LOCF) and last observation carried backward (LOCB) imputation techniques []. In the LOCF method, missing values are filled by carrying forward the most recent nonmissing values. Similarly, in the LOCB method, missing values are filled by carrying backward the subsequent nonmissing values. Although we primarily used the LOCB method to impute missing values, the LOCF method was applied when subsequent values were missing, filling in missing values with the most recent nonmissing values.

Additionally, we extracted the early warning scores (EWS) for vital signs. We used the MEWS [], a composite score commonly used by medical staff to assess illness severity. EWS observations were assigned scores ranging from 0 to 3. The EWS was calculated every hour. To remove outliers, we determined the acceptable range for each variable based on the input from medical experts. Values falling outside this range were eliminated. We normalized each feature using the minimum and maximum values within the abnormal range for each vital sign, as each feature column had a different scale. We converted the database into an hourly time series with 12-hour intervals. Subsequently, we combined the CA and non-CA groups to perform the imputation task.

Step 3: Feature Generation and AggregationFeature Extraction and Processing for Cardiac Arrest Prediction

After applying the inclusion and exclusion criteria, we extracted vital signs—such as HR, SBP, DBP, temperature, RR, and SpO2—and calculated the MEWS based on these vital signs. The features were then processed and normalized after resampling both the vital signs and MEWS to a 1-hour resolution []. The database was organized into an hourly time series with 12-hour intervals. Finally, we combined the CA and non-CA groups to perform the imputation task.

We generated 3 types of features within a 12-hour time window: vital sign–based features, multiresolution statistical features, and Gini index–based features. These features were designed to capture meaningful changes for predicting the occurrence of CA by identifying temporal patterns in vital signs, statistical variations across different resolutions, and the degree of information imbalance. The method for generating these features is outlined in the following sections.

Vital Sign–Based Features

To extract the pattern of vital signs, we used normalized vital signs and MEWS within a 12-hour time window.

Multiresolution Statistical Features

To capture statistical changes across different sections, we created time windows of increasing sizes and extracted summary statistics from these multiresolution sliding windows. For the multiresolution sliding window–based statistical features, the input data were segmented into resolutions of 0-4 hours, 0-6 hours, and 0-12 hours. Each time-series segment of the vital sign data was then aggregated to calculate the mean, median, minimum, maximum, and SD for each feature.

Specifically, the temporal patterns of each biological signal over a 12-hour period, with representative values for each 1-hour segment, show distinctive characteristics between the group that experienced CA and the group that did not. However, compressing the 5 statistical features mentioned above for the entire input window into a single statistical summary may not fully capture the differences in patterns between the groups. Consequently, 5 statistical values were derived by shifting a 4-hour window across the 12-hour input window. This approach allowed us to obtain statistical values for each section using both 6- and 12-hour windows. This feature extraction method, previously used in our research, was shown to enhance CA prediction accuracy by providing a condensed statistical summary of various sections [].

Gini Index–Based Features

Inspired by the Gini index, which measures statistical variance to indicate income inequality in economics, we propose a method to calculate the imbalance of patterns within each vital sign over the input time steps. This method calculates inequality for each vital sign and uses it as a feature to distinguish between situations where CA occurred and did not occur. Previous research suggests that significant changes in temporal patterns often precede CA, making this imbalance a valuable characteristic variable. The index-based feature formulation is expressed as follows:

GV = 1 – (2G/NDiff)

where G is the index; GV indicates index-based features of each vital sign; xi and xj are the values of vital signs within the input range in each vital sign; and NDiff is the number of intervals in the input vital sign. We calculated the Gini index to assess the impurity of each vital sign variation and then performed the normalization step, denoted as GV.

For instance, if the Gini index–based feature value is relatively low, it indicates that the values within the input window are stable. Conversely, if there is a rapid change in HR before CA, the Gini index–based feature value will increase. This pattern change is captured by the Gini index, which measures statistical dispersion.

Step 4: Feature Aggregation and CA Event Labeling

We aggregated multiresolution statistical features, Gini index–based features, and labels to enhance temporal features and achieve better inter-ICU generalizations from the model, utilizing vital signs and specific clinical latent scores.

To select and screen the most relevant and nonredundant features, we used 2 feature selection methods: recursive feature elimination and the Boruta method. Recursive feature elimination identifies the most relevant features for predicting the target by recursively eliminating a small number of features in each iteration []. This process helps eliminate collinearity within the proposed framework. The Boruta method assesses the relevance of each feature using statistical testing and shadow features [].

To overcome the limitations of conventional feature selection methods, we used an ensemble feature screening approach using a majority voting mechanism. This method was applied to a total of 653 features, resulting in the selection of 86 features from the MIMIC-IV database and 94 features from the eICU-CRD database. In this approach, each feature screening method casts 1 vote for a selected feature, and a feature will receive 2 votes if both feature screening methods choose it.

We then aggregated the variables into binary indicators to denote the presence or absence of CA in each class.

Step 5: Model Development

We used a TabNet classifier with features from a 12-hour time window to predict CA events within a 24-hour period. We generated continuous labels every hour indicating the risk of CA in the next 24 hours and calculated the alarming rate based on whether the alarm was correct or incorrect. Additionally, we applied cost-sensitive learning as an algorithm-level approach to address the extreme imbalances in the MIMIC-IV and eICU-CRD data sets []. Cost-sensitive learning was applied to penalize errors in the minority class (the CA group). The TabNet classifier, using cost-sensitive learning, helps reduce bias or variance and improve the stability of machine learning algorithms [,]. The minority classes from the MIMIC-IV and eICU-CRD data sets were penalized with a weight of 100. Finally, we set the weight for the CA class to 100 and the learning rate to 0.01.

Step 6: EvaluationLeave-One-Patient-Out K-Fold Internal Validation

We used leave-one-patient-out (LOPO) K-fold validation to evaluate the individual patient performance of CA prediction and to provide a realistic estimate of model performance on new patients []. This method is a variant of K-fold cross-validation, where the folds consist of individual patients. Specifically, we used a 10-fold LOPO validation [].

Additionally, we established a pseudo-real-time CA evaluation system to assess and compare the real-time CA prediction performance for hospitalized patients, simulating an ICU environment. This system used the LOPO K-fold validation approach to predict CA events within a 24-hour period.

Cross-Data Set External Validation

In this experimental setting, both the proposed method and comparison methods were trained on 1 database and tested on another to evaluate their generalization ability []. We alternately used MIMIC-IV and eICU-CRD as the source and target databases to assess generalization performance. To ensure consistency in feature properties across databases, we excluded SAPS-II, which includes laboratory tests. The experiment consisted of 2 steps. First, we trained the models, including the proposed method and comparison models, using MIMIC-IV as the source database. Next, we tested the CA prediction performance on eICU-CRD using the trained models and compared their performance.

To assess the generalization ability and minimize the impact of specific group characteristics in the learning data, we conducted another cross-data set external validation. We trained the models using eICU-CRD as the source database and evaluated CA prediction performance on MIMIC-IV as the target database.

Subgroup Analysis

We evaluated the CA prediction performance for each ICU subtype to determine if there were differences in performance between the proposed method and the comparative models for each ICU cluster. The common ICU subtypes across MIMIC-IV and eICU-CRD were general, cardiac, neuro, and trauma. Neuro and trauma ICUs were excluded from this analysis due to the low number or absence of admissions of patients with CA. We used 10-fold LOPO cross-validation for subgroup evaluation. In this process, the entire data set assigned to each fold was used for training, and during testing, performance was evaluated by categorizing patients according to their ICU subtype.

Baseline Models

To evaluate the performance of the proposed method, we used the following baseline models: National Early Warning Score (NEWS), SOFA, SAPS-II, logistic regression, k-nearest neighbors, multilayer perceptron, LGBM, RNN, and reverse time attention. Details of these baseline models can be found in .

Evaluation Metrics

We assessed the performance of the proposed and baseline methods using AUROC, event recall, false alarm rate, and sensitivity. Our goal was to evaluate the proposed method in a clinically relevant context by focusing on the percentage of CA events detected and the rate of false alarms. We specifically measured event recall [], which quantifies whether the CA prediction system correctly triggered an alarm in the period preceding a CA event.

ER = NCaptured/NTotal

where ER is event recall; NCaptured implies the number of captured events; and NTotal indicates the number of total CA events.

Next, the false alarm rate was defined as the fraction of alarms that failed to detect an actual event, to investigate whether operational costs were being wasted. This concept is analogous to exon prediction in gene discovery. The false alarm rate is evaluated as follows:

FAR = 1 – (NTrue/NAlarm)

where FAR is the false alarm rate; NTrue is the number of true alarms; and NAlarm is the number of total alarms in the CA prediction system.

Explainable Predictions

We extracted both local and global interpretability information by examining the decision masks of TabNet. After determining the impact of each feature using the proposed model, we summarized and visualized the top 25 features with the highest mean values. Additionally, we visualized the impact of features over time using a heatmap and tracked changes in the features with the highest values. To compare the differences in impact between non-CA and CA groups, we conducted a statistical test. Specifically, an independent t test with false discovery rate (FDR) correction was used to assess the differences in interpretability information between the 2 groups.

Statistical Analysis

Differences in patient characteristics, such as age, ICU length of stay, and vital signs, between the non-CA and CA groups were evaluated using independent t tests. To compare performance metrics between the baseline and proposed models, we used the Kruskal-Wallis test, followed by the honestly significant difference (HSD) test for post hoc analysis. The differences in interpretable information between the non-CA and CA groups were evaluated using an independent t test with FDR correction. A significance level of 5% (P<.05) was used for all analyses.

ResultsPatient Characteristics

The patient characteristics are presented as means and SDs in .

In the 12-hour time window for MIMIC-IV, age did not differ significantly between the CA and non-CA groups. However, the ICU length of stay was statistically different between the 2 groups (P=.02). Significant differences were observed in HR (P<.001), RR (P<.001), SBP (P<.001), DBP (P=.01), SpO2 (P<.001), and temperature (P<.001). In the eICU-CRD data set, age (P=.11) and ICU length of stay (P=.21) were not considered significant because their P values were greater than .05. Except for HR, which was not significant in either group (<.001 in the MIMIC-IV data set and .41 in the eICU-CRD data set), the other variables had significance levels (ie, P<.05; see full data in ).

Next, we provided the patient characteristics for each ICU subtype, specifically general and cardiac ICUs, as shown in and . In the 12-hour time window for MIMIC-IV, the characteristics of general and cardiac ICUs were similar to those of the overall ICU population between the CA and non-CA groups, except for ICU length of stay (hours). In the general ICU, the length of stay was significantly longer in the CA group (P=.006). In the 12-hour time window for eICU-CRD, notable discrepancies were observed in the characteristics of general and cardiac ICUs between the CA and non-CA groups with respect to HR. In the general ICU, the CA group had a significantly higher HR compared with the non-CA group (P<.001). By contrast, the non-CA group in the cardiac ICU had a lower HR, although this difference was not statistically significant (P=.41).

Table 1. Demographic information of patients from MIMIC-IVa and eICU-CRDb.CharacteristicMIMIC-IVeICU-CRDCardiac arrest (nc=77)Noncardiac arrest (n=1474)P valueCardiac arrest (n=106)Noncardiac arrest (n=3541)P valueAge (year), mean (SD)68.53 (13.63)67.64 (13.57).7460.03 (16.52)62.65 (15.83).11Intensive care unit length of stay (hours), mean (SD)318.90 (346.82)273.26 (142.96).02199.82 (198.01)175.28 (144.14).21Vital signs, mean (SD)

Heart rate (beats/minute)88.79 (17.60)87.10 (17.30)<.00187.45 (19.54)87.35 (17.30).41
Respiratory rate (breaths/minute)21.26 (5.68)20.99 (5.70)<.00120.10 (5.90)19.76 (4.85)<.001
Systolic blood pressure (mmHg)111.21 (22.60)118.03 (21.27)<.001118.88 (21.21)125.68 (21.73)<.001
Diastolic blood pressure (mmHg)59.41 (14.17)59.64 (13.70).0163.47 (14.06)68.49 (13.86)<.001
Oxyhemoglobin saturation (SpO2)97.22 (3.57)96.92 (2.87)<.00196.94 (4.01)96.55 (2.82)<.001
Temperature (°C)36.88 (0.91)37.12 (0.64)<.00136.94 (0.94)36.90 (0.58)<.001

aMIMIC: Medical Information Mart for Intensive Care.

beICU-CRD: eICU-Collaborative Research Database.

cn: number of ICU stays.

Feature Screening Strategy

illustrates the efficacy of the proposed methodology, both in its original form and when combined with the ensemble feature screening process. Initially, the proposed framework was trained and validated using all features for CA prediction, with validation performed through a 10-fold LOPO cross-validation approach. The framework achieved AUROC values of 0.75, 0.99, 0.80, and 0.80 for event recall, false alarm rate, sensitivity, and specificity, respectively, on the MIMIC-IV data set. For the eICU-CRD data set, the AUROC values were 0.78, 0.99, 0.45, and 0.99, respectively, for event recall, false alarm rate, sensitivity, and specificity.

To minimize the risk of overfitting in the proposed method, feature screening was essential. An ensemble feature screening method was used to identify the optimal feature set for the best results. The selected feature sets from the ensemble screening on the MIMIC-IV data set were then incorporated into the proposed framework. This adjustment led to AUROC values of 0.79, event recall of 0.99, false alarm rate of 0.77, and sensitivity of 0.89 for the proposed framework. The selected feature sets from the ensemble feature screening on the eICU-CRD data set were incorporated into the proposed framework. This adjustment resulted in AUROC values of 0.80, event recall of 0.99, false alarm rate of 0.36, and sensitivity of 0.99. The proposed ensemble feature screening approach demonstrated superior performance compared with using all feature sets.

Table 2. Performance comparison without and with ensemble feature screening methods along with the proposed framework using MIMIC-IVa and eICU-CRDb.MethodMIMIC-IVeICU-CRDEvent recall (↑c)False alarm rate (↓d)Sensitivity (↑)Event recall (↑)False alarm rate (↓)Sensitivity (↓)Proposed method, mean (SD)0.99 (0.00)0.80 (0.04)0.80 (0.09)0.99 (0.01)0.45 (0.14)0.99 (0.01)Proposed method with feature screening, mean (SD)0.99 (0.00)0.77 (0.05)0.89 (0.06)0.99 (0.00)0.36 (0.16)0.99 (0.00)

aMIMIC: Medical Information Mart for Intensive Care.

beICU-CRD: eICU-Collaborative Research Database.

cThe ↑ symbol indicates that a higher value for the evaluation metric corresponds to a more meaningful or effective model.

dThe ↓ symbol indicates that a lower value for the evaluation metric corresponds to more meaningful or effective model performance.

Predictive Performance

This section presents the results of CA predictive performance. We evaluated the performance using metrics including AUROC, event recall, false alarm rate, and sensitivity.

In the 12-hour time window from the MIMIC-IV database, we compared the AUROC of the proposed framework with that of baseline methods to investigate CA predictive performance. The proposed method achieved a higher overall AUROC value compared with the baseline methods, as shown in . The AUROC results using the proposed method were statistically higher than those of the comparison methods (χ215=68.67), as determined by the Kruskal-Wallis test with HSD post hoc analysis (see ). Additionally, we compared other performance metrics, including event recall, false alarm rate, and sensitivity, to assess effectiveness in a clinically relevant context. This evaluation focused on detecting CA events within a 24-hour period and minimizing false alarm rates, as shown in . The proposed method achieved statistically higher performance in event recall and sensitivity (χ215=90.34 for event recall and χ215=38.70 for sensitivity), as shown in and . Additionally, the proposed method demonstrated a statistically lower false alarm rate (χ215=110.00), as detailed in .

We compared the AUROC of the comparison methods and the proposed framework using the 12-hour time window from the eICU-CRD data set. The proposed method achieved statistically higher performance, with an overall AUROC value as shown in and (χ214=81.38). Additionally, we evaluated other performance metrics, including event recall, false alarm rate, and sensitivity, as detailed in . The proposed method achieved statistically higher values for event recall and sensitivity compared with other methods, as demonstrated by the Kruskal-Wallis test with HSD (χ214=90.75 for event recall, χ214=100.86 for sensitivity), as shown in and . The proposed method also achieved a lower false alarm rate than the comparison methods, except for SAPS-II and LGBM.

‎

Figure 3. Comparison of AUROC performance among baseline models and the proposed method from MIMIC-IV and eICU-CRD. (A) AUROC from MIMIC-IV and (B) AUROC from eICU-CRD. AUROC: area under the receiver operating characteristic curve; DEWS: Deep Early Warning Score; eICU-CRD: eICU-Collaborative Research Database; FS: feature screening; KNN: K-nearest neighbors; LGBM: light gradient boosting method; LR: logistic regression; MIMIC: Medical Information Mart for Intensive Care; MLP: multilayer perceptron; NEWS: National Early Warning Score; RETAIN: reverse time attention; SAPS: Simplified Acute Physiology Score; SOFA: Sequential Organ Failure Assessment. Table 3. Comparison of LOPOa cross-validation performance using MIMIC-IVb.ModelEvent recall (↑c)False alarm rate (↓d)Sensitivity (↑)National Early Warning Score ≥5, mean (SD)0.87 (0.12)0.89 (0.09)0.39 (0.10)Sequential Organ Failure Assessment ≥6, mean (SD)0.71 (0.13)0.90 (0.08)0.59 (0.15)Simplified Acute Physiology Score-II ≥32, mean (SD)0.91 (0.10)0.90 (0.10)0.91 (0.10)Logistic regression, mean (SD)0.99 (0.04)0.90 (0.10)0.69 (0.21)k-Nearest neighbors, mean (SD)0.05 (0.06)0.96 (0.10)0.01 (0.01)Multilayer perceptron, mean (SD)0.36 (0.13)0.88 (0.04)0.03 (0.02)Light gradient boosting method, mean (SD)0.51 (0.17)0.87 (0.01)0.17 (0.08)DEWSe ≥2.9, mean (SD)0.91 (0.11)0.92 (0.08)0.44 (0.10)DEWS ≥3, mean (SD)0.91 (0.11)0.92 (0.02)0.44 (0.10)DEWS ≥7.1, mean (SD)0.85 (0.10)0.92 (0.03)0.37 (0.09)DEWS ≥8, mean (SD)0.84 (0.10)0.92 (0.05)0.35 (0.09)DEWS ≥18.2, mean (SD)0.83 (0.12)0.92 (0.09)0.29 (0.09)DEWS ≥52.8, mean (SD)0.69 (0.13)0.92 (0.07)0.18 (0.06)Reverse time attention, mean (SD)0.98 (0.05)0.92 (0.03)0.94 (0.10)Proposed method, mean (SD)0.99 (0.00)0.80 (0.04)0.80 (0.09)Proposed method with feature screening, mean (SD)0.99 (0.00)0.77 (0.05)0.89 (0.06)

aLOPO: leave-one-patient-out.

bMIMIC: Medical Information Mart for Intensive Care.

cThe ↑ symbol indicates that a higher value for the evaluation metric corresponds to a more meaningful or effective model.

dThe ↓ symbol indicates that a lower value for the evaluation metric corresponds to more meaningful or effective model performance.

eDEWS: Deep Learning–Based Early Warning Score.

Table 4. Comparison of LOPOa cross-validation performance using eICU-CRDb.ModelEvent recall (↑c)False alarm rate (↓d)Sensitivity (↑)National Early Warning Score ≥5, mean (SD)0.99 (0.02)0.51 (0.16)0.68 (0.12)Simplified Acute Physiology Score-II ≥32, mean (SD)0.70 (0.21)0.44 (0.19)0.70 (0.24)Logistic regression, mean (SD)0.99 (0.02)0.48 (0.14)0.86 (0.20)k-Nearest neighbors, mean (SD)0.19 (0.08)0.43 (0.27)0.02 (0.01)Multilayer perceptron, mean (SD)0.54 (0.11)0.49 (0.14)0.08 (0.02)Light gradient boosting method, mean (SD)0.94 (0.05)0.42 (0.14)0.67 (0.05)DEWSe ≥2.9, mean (SD)0.98 (0.03)0.56 (0.18)0.63 (0.09)DEWS ≥3, mean (SD)0.98 (0.03)0.56 (0.18)0.63 (0.09)DEWS ≥7.1, mean (SD)0.96 (0.04)0.55 (0.18)0.55 (0.09)DEWS ≥8, mean (SD)0.96 (0.04)0.55 (0.18)0.54 (0.09)DEWS ≥18.2, mean (SD)0.92 (0.06)0.55 (0.18)0.45 (0.08)DEWS ≥52.8, mean (SD)0.86 (0.09)0.55 (0.18)0.29 (0.06)Reverse time attention, mean (SD)0.99 (0.00)0.50 (0.14)0.99 (0.01)Proposed method, mean (SD)0.99 (0.01)0.45 (0.14)0.99 (0.01)Proposed method with feature screening, mean (SD)0.99 (0.00)0.36 (0.16)0.99 (0.01)

aLOPO: leave-one-patient-out.

beICU-CRD: eICU-Collaborative Research Database.

cThe ↑ symbol indicates that a higher value for the evaluation metric corresponds to a more meaningful or effective model.

dThe ↓ symbol indicates that a lower value for the evaluation metric corresponds to more meaningful or effective model performance.

eDEWS: Deep Learning–Based Early Warning Score.

Subgroup Analysis

We evaluated the performance of the comparison models and the proposed framework across different ICU types, including general and cardiac ICUs. Most ICU types showed similar performance, except for patients in the cardiac ICU within the eICU-CRD data set. As shown in , the proposed method demonstrated statistically higher performance compared with the comparative models across all ICU types in both MIMIC-IV and eICU-CRD data sets. The comparisons by ICU type are as follows: general ICU in MIMIC-IV (χ28=29.67), cardiac ICU in MIMIC-IV (χ28=44.22), and cardiac ICU in eICU-CRD (χ28=45.07). Detailed statistical comparison results are presented in -.

‎

Figure 4. Model performance in difference patient cohorts from MIMIC-IV and eICU-CRD. (A) AUROC on ICU types of MIMIC-IV. (B) AUROC on ICU types of eICU-CRD. Boxes in the box plot show IQR and the cross marks are outliers with values that lie outside the minimum and maximum ranges of the whiskers, where minimum = Q1 - 1.5 × IQR and maximum = Q3 + 1.5 × IQR. * Statistically significant (P<.05). AUROC: area under the receiver operating characteristic curve; DEWS: Deep Learning–Based Early Warning Score; eICU-CRD: eICU-Collaborative Research Database; ICU: intensive care unit; KNN: k-nearest neighbors; LGBM: light gradient boosting method; LR: logistic regression; MIMIC: Medical Information Mart for Intensive Care; MLP: Multilayer perceptron; NEWS: National Early Warning Score; Q1: first quartile; Q3: third quartile; RETAIN: reverse time attention; SAPS: Simplified Acute Physiology Score. External Validation

We conducted cross-data set external validation to assess the generalization ability of the proposed method and comparison models. After training on the MIMIC-IV data set, we evaluated the clinical validity of predicting CA within 24 hours using the eICU-CRD data set as the test set. and present the external validation results for conventional systems (including NEWS, SOFA, and SAPS-II), machine learning–based comparison methods, and deep learning–based scoring systems. The proposed method achieved higher AUROC, event recall, and a lower false alarm rate compared with the comparison methods.

Conversely, we tested the proposed framework by evaluating a cohort from a general hospital setting (eICU-CRD) and a cohort with heart disease (MIMIC-IV). The results showed that the proposed framework achieved superior performance in AUROC, false alarm rate, and sensitivity, as detailed in .

‎

Figure 5. Cross–data set external validation AUROC performance. (A) eICU after training MIMIC-IV. (B) MIMIC-IV after training eICU-CRD. AUROC: area under the receiver operating characteristic curve; DEWS: Deep Learning–Based Early Warning Score; eICU-CRD: eICU-Collaborative Research Database; KNN: k-nearest neighbors; LGBM: light gradient boosting method; LR: logistic regression; MIMIC: Medical Information Mart for Intensive Care; MLP: multilayer perceptron; NEWS: National Early Warning Score; RETAIN: reverse time attention. Table 5. Cross-data set external validation performance using eICU-CRDa after training MIMIC-IVb.ModelEvent recall (↑c) (95% CI)False alarm rate (↓d) (95% CI)Sensitivity (↑) (95% CI)Brier score (95% CI)National Early Warning Score ≥50.

View original article

JOURNAL OF MEDICAL INTERNET RESEARCH

分享书签

0 0 0 0 0 0 0

More from this channel

Early Prediction of Cardiac Arrest in the Intensive Care Unit Using Explainable Machine Learning: Retrospective Study

留言 (0)