Modelling 30-day hospital readmission after discharge for COPD patients based on electronic health records

Data collection

Obstructive airway disease is one of the ten leading causes of death in Macao. Studies have shown that second-hand smoke affects 14% of the local labor force, increasing the incidence and mortality of COPD in Macao. Kiang Wu is one of the three major hospitals in Macao, which accounts for 47% of total resources. In this study, we reviewed the health records of COPD inpatients from the EHR system of Kiang Wu Hospital from January 1, 2018, to December 31, 2019. The criteria of inclusion were: (1) patients admitted with a main diagnosis of COPD (International Classification of Diseases-10 codes (ICD-10): J44); and (2) admission due to acute exacerbation as confirmed by the specialists. It is noted that the labeled data and also the trained prediction model in the study is site-specific for regions with similar patient characteristics, although the overall methodology is transferable to other regions or studies.

Variables and measurements

There were 3 categories of data in this study including demographic data, blood test results and clinical therapies (See Table 1). Patients’ demographic data included age, gender, history of tobacco smoking, number of comorbidities (NoC) and number of hospitalizations in the past 12 months (NoH-12). Blood test results included blood eosinophil count (BEC), hemoglobin, white blood cells (WBC) and creatinine. Clinical therapies for COPD in Macao included data about the usage of systemic steroids (prednisolone, dexamethasone, methylprednisolone) and antibiotics, oxygen therapy, noninvasive ventilation (NIV) and pulmonary rehabilitation (PR). It is noted that there are too many combinations of inhaled medications and so it would be inappropriate to directly treat it as a categorical variable considering the limited number of samples in this study. Therefore, we divide the inhaled medications into a few categories. Following the work34, according to the use of inhaled medications of the COPD, the hospitalization records were assigned to 1 of the 4 groups. Group 1 included the records who used only one type of inhaled medication (e.g., “LABA, LAMA or both”, “SABA, SAMA or both” or ICS only.). Group 2 included the records who received two types of inhaled medications (e.g., “(LABA, LAMA or both) and (SABA, SAMA or both)”, “(LABA, LAMA or both) and ICS” or “(SABA, SAMA or both) and ICS”.). Group 3 included the records who used the combination of all 3 types of inhaled medications (e.g., “(LABA, LAMA or both) and (SABA, SAMA or both) and ICS”). Group 4 referred to the records in which the patients did not use any inhaled medications. It is noted that some variables such as BEC, hemoglobin, WBC, creatinine are indeed varied at every admission for an individual patient due to illness or drug effects, and therefore, this study used data on hospitalization information per patient admission to reflect the dynamic readmission risk.

Data discretization and balancing

Some continuous variables (e.g., BEC, hemoglobin, WBC, creatinine, NoC, NoH-12) were first transformed into categorical variables based on their proper reference ranges. Data imbalance problem, the distribution of examples across different classes is biased or skewed, generally poses a challenge for predictive modelling that the predictive performance is usually poor, specifically for the minority classes (the ones with fewer samples). Considering that in this study the numbers of records with readmission (Yes class) and without readmission (No class) were significantly different, data balancing technique was conducted to generate a balanced dataset for data-driven classification model construction. In particular, down-sampling approach was adopted in this study. In this approach, all records of Yes class (the one with fewer samples) were first preserved, then random sampling was performed by using “randperm” function (i.e., random permutation of integers) in MATLAB R2020b for the records of No class so that the number of randomly sampled records of No class had the same size as the Yes class. Therefore, a balanced dataset consisting of the same number of records for Yes and No classes was generated for the following data analysis and classification model construction.

Data analysis and feature selectionDescriptive analysis

Descriptive analysis was first performed for the continuous variables and categorical variables in the balanced dataset. In particular, for continuous variables, mean and median were adopted, while for categorical variables the number and proportion for different classes were summarized.

Feature selection

Considering that there was 15+ features (i.e., candidate independent variables) and a limited number of available samples (i.e., No. of records) for model construction, feature selection was performed to remove the irrelevant and redundant features so that a simpler and more reliable model can be derived for prediction. The feature selection methods for continuous and categorical variables were introduced below.

For continuous variables, in order to assess their distribution differences under Yes and No classes, the two-sample Kolmogorov-Smirnov test (KS test) was adopted. KS test is a general nonparametric statistical approach to quantify whether two samples come from the same distribution or not. Suppose two samples of size m and n with the observed/empirical cumulative distribution functions F(x) and G(x), the KS statistic is defined by

$$}}}_}}},}}}} = \sup _x|F_m(x) - G_n(x)|$$

(1)

where sup is the supremum function. The null hypothesis is that the samples are drawn from the same distribution, and one rejects the null hypothesis (at a significant level α) if Dm,n > Dm,n,α where Dm,n,α is the so-called critical value. For sufficient large m and n,

$$}}}_}}},}}},\alpha } = }}}\left( \alpha \right)\sqrt }}}$$

(2)

where c(α) is the inverse of the Kolmogorov distribution at α, given by \(}}}\left( \alpha \right) = \sqrt \right)}\). In this study, the “kstest2” function in MATLAB R2020b was adopted with α = 0.05.

For categorical variables, Chi-Square test of independence was adopted. Chi-Square test is a statistical hypothesis test that assumes (the null hypothesis) the observed frequencies for a categorical variable match the expected frequencies for the categorical variable, i.e., H_0: “variable 1 is independent of variable 2”. Therefore, it is usually used to determine whether there is an association between two categorical variables or not. In this study, Chi-Square statistics (along with its p-value) between the candidate categorical variables and the dependent variable (readmission or not) were returned by using the “crosstab” function (e.g., cross-tabulation) in MATLAB R2020b.

Classification modelDecision tree model

Upon choosing the features, the next step is to build a classification model by using machine learning approaches. Different machine learning-based classification models are available in literature such as classification tree, logistic regression, Naïve Bayes, Support Vector Machines (SVM), Ensemble approaches, Neural Network among others. Different models have their own pros and cons in terms of accuracy, computation load, transparency, interpretability and reliance on a large labelled dataset. In this study, upon a preliminary performance comparison in term of accuracy via five-fold cross-validation in MATLAB App “classificationLearner”, decision tree model, a so-called white box model (against black-box or grey box models), is adopted. In particular, the main rationale for choosing the decision tree model are also summarized as below. First, in the preliminary performance comparison, decision tree-based approach possesses the best performance in term of accuracy. Second, decision tree is simple to understand and interpret since its inherent transparency and interpretability can help users follow the path of the tree and therefore understand the decision rules (i.e., if-else rules). Third, the simplicity of the model also makes it have a less reliance on a large training dataset compared against complex models such as neural network models. Fourth, predictor importance values can also be estimated in the decision tree, which can be used to assess the importance of different variables in making the decision. It is also noted that the missing data problem in the training dataset can be automatically handled by the decision tree model (e.g., “fitctree” in MATLAB environment).

Like many other machine learning models, there are hyperparameters in decision tree algorithm which have effects on its performance and should be properly tuned. The hyperparameters include the ones controlling the tree depth (e.g., MaxNumSplits, MinLeafSize or MinParentSize) and Split Criterion (e.g., gdi, deviance). Different approaches (e.g., grid search, random search, Bayesian optimization) are available to systematically tune these hyperparameters in order to get satisfying performance; in this study Bayesian parameter optimization (a sequential model-based optimization) was adopted due to its promising performance (efficiency) in deriving a good solution in a limited amount of steps/time. In addition, 5-fold cross-validation (against hold-out validation) was adopted to maximally use the limited amount of dataset, gain stable predictions and also avoid the problem of overfitting (i.e., gaining good performance on the training dataset but poor performance on testing dataset). The decision tree algorithm with Bayesian hyperparameter optimization is summarized in supplementary martials.

Performance evaluation

Metrics to evaluate the performance of machine learning classification models are also introduced in this part. True Positive (TP) denotes the correctly predicted positive values; False Positive (FP) is the scenario where the actual class is negative, but the predicted class is positive; and False Negative (FN) represents the scenario that the actual class is positive, but the predicted class is negative. From these definitions, different metrics can then be defined for performance evaluation. For instance, Accuracy is a good measure for symmetric datasets (i.e., the number of each class has the same order of magnitude). Precision and Recall are also commonly used for performance evaluation, particularly for data with uneven class distribution. These values are usually first calculated for each class, and their mean values for different classes are then chosen. Accuracy, Precision and Recall for a specific class are defined by formula below (3), which can be calculated by using confusion matrix.

$$Accuracy = \frac }}},\;}ecision = \frac}},\;}call = \frac}}$$

(3)

A receiver operating characteristic curve (also termed ROC curve) is a graphical plot illustrating the classification ability of a binary classifier, where the true positive rate against the false negative rate is plotted at various thresholds (for classification). Upon plotting ROC, area under the ROC curve (AUC) is an effective manner to summarize the overall accuracy, which takes value from 0 to 1. In general, an AUC of 0.5 suggests no discrimination, and 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and more than 0.9 is considered outstanding18.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

留言 (0)

沒有登入
gif