Development and validation of early warning score systems for COVID‐19 patients

3.2.1 Outcome definition

We defined respiratory deterioration as the need for advanced respiratory support (high-flow nasal oxygen [HFN0], continuous positive airways pressure [CPAP], NIV(intubation) or ICU admission within a prediction window of 24 h. It should be noted, however, that hypoxic respiratory failure is not the only process through which COVID-19 patients deteriorate as some patients deteriorate through a process of shock due to venous thromboembolism or super-added sepsis. Such events may also lead to ICU admission or increased oxygen requirements and so still be captured by our model.

3.2.2 Performance of the EWS systems

Table 3 outlines the performance of the EWS systems. NEWS, MCEWS, CEWS, AEWS, LDTEWS:NEWS, and LDTEWS achieved an AUROC of 79%, 78%, 63%, 68%, 80%, 62%, respectively. The best performing scores were NEWS and LDTEWS:NEWS (Figure 1). The efficiency curve of the various EWS systems is outlined in Figure 1.

image

This figure includes the efficiency and Receiver Operating Characteristic (ROC) curves for the machine learning models and the Early Warning Scores (EWS). (a) The efficiency curves for the EWS in our study. The low performance of the EWS on the Efficiency Curve metric may be explained by a high false positive. (b) The ROC curves for the various EWS in our study (the best performance is for NEWS with AUROC of 72%). (C) The performance of the GBT model measured by the efficiency curve metric. (d) The ROC and AUROC for the GBT model on F9 feature set (AUROC of 94%)

We evaluated the performance of the recommended (original) thresholds for the different EWS. The default thresholds are 5, 4, 4, 0.27, 0.33 for NEWS, CEWS, MCEWS, LDTEWS:NEWS, and LDTEWS, respectively. AEWS does not have a recommended threshold, therefore we have excluded it from the evaluation of the recommended thresholds. The NEWS score had the most balanced sensitivity and specificity (66% and 75%, respectively). NEWS and LDTEWS achieved the lowest accuracy (75% and 73%) with a sensitivity and specificity of 41% and 74% for LDTEWS. CEWS achieved the highest accuracy (91%) but with a sensitivity of 23% and specificity of 91% (Table 3).

We optimised the thresholds for each score to maximise accuracy as outlined in the Methods Section. Optimised EWS thresholds yielded more balanced performance. LDTEWS:NEWS was the overall best performing score with an accuracy, sensitivity, and specificity of 67%, 77% and 67%, respectively. NEWS, MCEWS, CEWS, and AEWS achieved high accuracy (62%, 61%, 64%, 60%, respectively). The worst performing score was LDTEWS with an accuracy of 52% and AUROC of 62%. The accuracy-optimised thresholds for all scores differed from the recommended values (Table 3).

The performance of the EWS in COVID-19 patients was significantly lower than that previously reported in non-COVID patients. The Royal College of Physicians [6] reported a performance of (AUROC = 89%) for NEWS compared to (AUROC = 79%) in our dataset. Watkinson and colleagues [3] reported a performance of (AUROC = 86.8% and AUROC = 80.8%) for MCEWS and CEWS, respectively. This compares to AUROC values of 78% and 63% for MCEWS and CEWS in our dataset. Shamout and colleagues [7] reported that AEWS achieved an AUROC of 83.8%, while AEWS achieved a performance of 68% on COVID patients in our dataset. Redfern and colleagues reported an AUROC of 90.1–91.6% for LDTEWS:NEWS. In COVID patients, the AUROC for LDTEWS:NEWS was 80%. The worst performing score in our study was LDTEWS (AUROC of 62%). The score was developed by Jarvis and colleagues [8] with a reported AUROC that ranges between 75% and 80% in discriminating in-hospital mortality among the general in-hospital patient cohort. This indicates that while the predictors used in LDTEWS (HGB, Alb, Na, k, Cr, Ur, WBC) are useful to discriminate in-hospital mortality in non-COVID, they are less useful in predicting respiratory deterioration in COVID patients (Table 1).

3.2.3 Performance of the machine learning models

We evaluated the performance of three machine learning models (GBT, RF, and LR) on the training data using an internal 5-fold cross-validation. We evaluated the performance of the machine learning models on multiple feature sets as outlined in the feature sets subsection of the Methods and Table 2 (F1–F11). The GBT model outperformed the other models on the different features sets in our training dataset. Therefore, we made a design choice to use only the GBT model when evaluating the performance on the different feature sets in the test data. The highest AUROC was achieved using the F1 (AUROC of 83%), F7 (AUROC of 93%), F8 (AUROC of 86%), F9 (AUROC of 94%), and F11 (AUROC of 93%) feature sets. The lowest AUROC was observed in the F2 (AUROC of 72%), F4 (AUROC of 77%), F5 (AUROC of 69%), and F6 (AUROC of 78%) feature sets. The F7 dataset is a simple feature set that is based on 6 commonly collected vital signs and their variability. F7 could represent the scenario of an overrun healthcare facility in which access to lab tests may not be easily accessible and readily available.

We compared the performance of the EWS systems and machine learning models to predict COVID-19 patient deterioration in three main feature sets: F1–F3 (Table 2). In each of the three feature sets, the machine learning model outperformed the EWS systems. For the F1 feature set, we can compare the performance of NEWS (AUROC = 79%), MCEWS (AUROC = 78%), CEWS (AUROC = 63%), and AEWS (AUROC = 68%) with the performance of GBT (AUROC = 83%). For the F2 feature set, we can compare the performance of LDTEWS (AUROC = 62%) with the performance of GBT (AUROC = 72%). For the F3 set, we can compare the performance of LDTEWS:NEWS (AUROC = 80%) with the performance of GBT (AUROC = 85%) (Figure 1). The efficiency curve of machine learning EWS systems is outlined in Figure 1.

The overall best performing algorithm for machine learning models was the GBT model on the F9 feature set (AUROC = 94%). Given the imbalanced nature of our dataset, we have decided to tune the probability-class conversion threshold for the GBT model to create the best performing machine learning model. We decided to optimise the threshold to maximise accuracy. We identified the threshold that maximises the accuracy of the GBT model on the training set and measured the performance on the test set. The identified threshold was 0.19. The optimised GBT model achieved an accuracy, sensitivity, and specificity of 70%, 96%, and 70%, respectively. The most and least important features are outlined in Table 4. Out of the 10 most important features (FiO 2, min–max SBP, CRP, max–min HR, PO 2, mean cell volume, arterial blood calcium, max–min RR, CtO 2C, temp), four belonged to the F7 (vital signs and variability) feature set, three belonged to the F5 feature set (arterial blood tests), and two belonged to the F4 feature set (venous blood tests). The most important feature was FiO 2. Delta is a measure of variability of a specific variable, it is calculated as (current value—the mean in the last 24 h). The most important vital signs were heart rate, respiratory rate, temperature, and blood oxygen saturation (SpO 2).

TABLE 4. The performance of machine learning models and the corresponding feature weights for the most and least important features; Sections A, B, C, and D explore the performance of the best machine learning model (GBT model on the F9 feature set). Section A outlines the performance of the accuracy-optimised threshold for the GBT model. The threshold was set on the training data and tested on the test data. Section B outlines the performance of the GBT model after limiting the predictors to the 20 most important features. Sections C and D outline the performance of the GBT model after reducing the look back window from 24 h to 6 and 12 h. Section E outlines the performance after adding FiO2 as a predictor. FiO2 did not improve model performance compared to the optimised threshold model (section A of table ); however, it ranked as the most important feature on feature importance analysis. Section F outlines the performance of the model after adding age as a predictor. Age did not rank within the most important features and did not improve performance over the performance of the optimised threshold model (section A of table ). Section G outlines the performance after adding delta baseline to the vital signs and variations feature set. Section H outlines the performance after adding delta baseline to the all features and vital variations feature set. Section I outlines the performance of the GBT model with the hard output threshold on the different feature spaces before identifying the best model and adjusting the threshold for accuracy. Section J outlines the feature importance for the GBT model on the F9 feature set. It includes the nine most and least important features Section Feature Model Threshold (on train) Acc Sen Sps Prs AUROC A: Accuracy-optimised threshold F9 GBT 0.12 (0.11–0.13) 0.70 (0.69–0.71) 0.96 (0.95–0.96) 0.70 (0.69–0.70) 0.03 (0.03–0.03) 0.94 (0.94–0.94) B: Feature selection F9 (top 20) GBT 0.35 (0.32–0.37) 0.80 (0.80–0.81) 0.91 (0.90–0.92) 0.80 (0.80–0.81) 0.04 (0.04–0.04) 0.94 (0.94–0.94) C: 6-h lookback window F9 GBT 0.09 (0.08–0.09) 0.56 (0.56–0.57) 0.87 (0.86–0.88) 0.56 (0.56–0.56) 0.01 (0.01–0.01) 0.85 (0.85–0.85) D: 12-h lookback window F9 GBT 0.32 (0.30–0.34) 0.66 (0.65–0.67) 0.87 (0.86–0.88) 0.66 (0.65–0.67) 0.02 (0.02–0.02) 0.86 (0.86–0.86) E: Adding FiO2 as a predictor F9 and FiO2 GBT 0.15 (0.13–0.18) 0.72 (0.71–0.73) 0.89 (0.87–0.91) 0.72 (0.71–0.73) 0.03 (0.03–0.03) 0.93 (0.93–0.93) F: Adding Age as a predictor F9 and age GBT 0.19 (0.17–0.21) 0.73 (0.72–0.74) 0.94 (0.94–0.95) 0.73 (0.72–0.74) 0.03 (0.03–0.03) 0.93 (0.93–0.94) G: Adding Delta baseline to vital signs and delta F10 GBT 0.32 (0.30–0.34) 0.82 (0.81–0.83) 0.87 (0.86–0.88) 0.82 (0.81–0.83) 0.04 (0.04–0.04) 0.93 (0.92–0.93) H: Adding Delta baseline to all features and delta F11 GBT 0.22 (0.21–0.24) 0.74 (0.73–0.74) 0.93 (0.92–0.93) 0.73 (0.73–0.74) 0.03 (0.03–0.03) 0.93 (0.92–0.93) Hard output performance (Section I) Feature Model AUROC F1 GBT 0.83 (0.83–0.84) F2 GBT 0.72 (0.71–0.72) F3 GBT 0.85 (0.84–0.85) F4 GBT 0.77 (0.76–0.77) F5 GBT 0.69 (0.68–0.69) F6 GBT 0.78 (0.78–0.79) F7 GBT 0.93 (0.92–0.93) F8 GBT 0.86 (0.86–0.87) F9 GBT 0.94 (0.94–0.94) F10 GBT 0.93 (0.92–0.93) F11 GBT 0.93 (0.93–0.93) Feature weights (Section J) Highest feature weights Lowest feature weights FiO2 0.258354 Bilirubin-umol/L 0.000031 Max-Min SBP 0.151461 METHB (BG) 0.000020 CRP-mg/L 0.108911 FCOHB (BG) 0.000011 Max-Min HR 0.044093 CLAC (BG) 0.000009 PO2 (BG) 0.033090 NA+ (BG) 0.000007 Mean CellVol-fL 0.026848 Basophils-x109/L 0.000003 CA+ + (BG) 0.026313 masktyp 0 Max-Min RR 0.025169 TEMPERATURE POCT 0 CTO2C (BG) 0.024777 Potassium-mmol/L 0 TEMP 0.021500 avpu 0

We conducted three additional experiments. The first was to limit the predictors of the GBT model to the top features that ranked the highest on the feature importance scale considering the training set. We found that the optimal number of features was 18–20 and subsequently chose to report the performance on the 20 most important features. This forward selection experiment did not impact performance (Table 4). We did not attempt a backward selection approach in this study, which is considered preferable in classical statistics. The second experiment was to include a more granular measurement of oxygen support. We included the Fraction of Inspired Oxygen (FiO 2) for this aim. Including the FiO 2 did not improve the performance (Table 4). The third experiment was to include age as a predictor. Including age as a predictor did not significantly impact the performance (Table 4). The lack of performance gains in spite of the high feature importance may be due to multicolinearity, where a subset of existing variables highly correlate with this feature. This is explicit in the construction of the FiO 2 variable, which is calculated from source variables already present in the vital signs feature set (respiratory rate, SpO 2, Masktype) as outlined in the Methods section.

Our results show that summary measures of variability of vital signs and laboratory markers play an important role in predicting deterioration. Adding the variability (range, mean of previous 24-h window) and delta (current value - mean) features to the vital signs feature set added 10% points to the AUROC (vital signs 83% vs. vital signs and variations 93%). Similar results were observed in the all features feature set, where adding the variability and delta predictors added 8% points to AUROC (all features 86% vs. all features and variations 94%). Adding the delta baseline variables to both the all feature and vital signs feature spaces has improved the performance (vital signs and variations and baseline 93%; all features and variations and baseline 93%). These observations echo common clinical practice where physicians often analyse trends of parameters rather than their absolute values when evaluating a patient and highlight the benefits of dynamic monitoring. Moreover, the importance of summarising the variability and changes of vital signs when using them as inputs for machine learning models has already been demonstrated by Shamout and colleagues [26] in their work to develop a deep learning-based early warning system.

The lower performance of the model when using variables from blood gas analysis could partly be explained by inconsistency in the labelling of these samples. The origin of the blood, whether venous or arterial, was frequently missing or mislabelled perhaps reflecting time pressures on clinical staff, or skewed where interest is towards markers minimally influenced by sample provenance (e.g. lactate). This required the use of imputation techniques during the preprocessing of the dataset, which may have had an effect on performance. Moreover, some data points in blood gas readings duplicated information encoded within other feature sets, such as haemoglobin and creatinine.

留言 (0)

沒有登入
gif