Person-level administrative hospital data have become increasingly available for monitoring disease burden in conditions such as coronary heart disease (CHD), however this data source records hospital admissions and is therefore susceptible to inflation of events if the patient is transferred during their clinical course.1–4 Each transfer leads to a new admission record potentially overestimating the number of episodes of care relating to the same CHD event. We have previously shown that failure to account for transfers will overestimate CHD counts by 8–13%.1
Hospital datasets in many developed countries, including Australia, New Zealand, Canada, England, Wales and Scotland,5–9 contain many variables that can be used to identify transfers, although no approach is currently considered as “gold standard”. We and others have used admission and discharge dates and/or times together with deidentified patient identifiers, whereby any admission occurring within a certain interval after a previous discharge for an individual, usually one day, is considered part of the same episode of care.1,2,10–12 This approach is based on “interval” data, which relies on information that is deemed sensitive as it has the potential to identify patients. Alternatively, investigators have used “pathway” data without any deidentified patient identifiers,5,6 whereby an episode is identified when a patient is admitted from home or discharged from hospital. Although this approach uses less sensitive data, the reliability of the admission and discharge variables has been questioned.6
Variables selected for a study must be deemed “necessary”13 and requests for sensitive variables, such as admission and discharge dates/times, will require written justification and approvals from data custodians and human research ethics committees. The impact of using different sets of variables in estimating CHD burden has not been tested, and we therefore aimed to test six algorithms, each using different combination of variables, to identify and compare counts, rates and trends of CHD episodes. We also tested the algorithms on myocardial infarction (MI), a subset of CHD, to determine if the concordance between algorithms differed between the diagnoses.
Materials and Methods Data Source and Study PopulationWe used person-level linked administrative health data extracted from the Hospital Morbidity Data Collection (HMDC), one of the core data sets of the Western Australian (WA) Data Linkage System for this cohort study. The available data set included all hospital records for any person hospitalised with CHD (International Statistical Classification of Diseases and Related Health Problems, version 10 - Australian Modification (ICD-10-AM]: I20-I25) in WA from 2000 to 2016. Variables available included demographic information (age at admission, sex), admission and discharge dates and times, principal discharge diagnosis field, up to 20 secondary discharge diagnosis fields, hospital category (eg, public, private, rural), type of admission (elective/emergency), where the patient was admitted from, where the patient was discharged to, care type (purpose of treatment) and up to 11 procedure fields.
All variables, except procedure codes, were used without recoding of values. We identified coronary angiography, percutaneous coronary intervention (PCI) and coronary artery bypass graft (CABG) from the 11 procedure fields in the HMDC (Supplementary Table 1). CHD and MI (ICD-10-AM: I21) admissions were identified from the principal discharge diagnosis field of the HMDC data extract.
Identifying and Counting of CHD and MI Episodes Using Different AlgorithmsWe used six algorithms to identify a single CHD admission separately within an episode of care (Supplementary Table 2). The date and datetime algorithms processed data based on time intervals; admission source and discharge destination algorithms processed data based on pathway variables and machine learning models, random forest (RF) and gradient boosting machine (GBM) used a combination of variables:
Date AlgorithmEach row of temporally ordered records belonging to the same patient was compared to the previous admission, and if the admission date was within a day or same day of the previous discharge date, it was assigned to the same episode as the previous record. Only the first CHD admission in the episode was counted.
Datetime AlgorithmThis is like the date but an episode of care was defined as occurring within an interval of 6 hours between admission and previous discharge. We chose 6 hours as our previous study showed that most transfers are completed within this timeframe (unpublished data).
Admission Source AlgorithmThis algorithm identified records coded as “admit from home” with a principal discharge diagnosis of CHD. This algorithm was designed to identify one admission per episode, in this case it identifies a record assumed to represent the first-in-episode admission.
Discharge Destination AlgorithmThis algorithm identified any record coded as discharged to non-hospital facility (ie, home, nursing home, died or discharged against medical advice) with a principal discharge diagnosis of CHD. This algorithm is designed to identify one admission per episode, in this case it identifies a record assumed to represent the last-in-episode admission.
Random Forest (RF) AlgorithmWith this algorithm the computer was trained on records from 2000–2011 to flag CHD admissions identified by the date based on parameter settings in Supplementary Box 1. Variables entered into the models are described in Supplementary Table 3. The Youden Index was used to identify the optimised threshold for the model prediction to achieve balanced sensitivity and specificity.14 This model was then tested on admission data from 2012 to 2016 using the cut-off point defined by the Youden index.
RF, based decision trees, is an ensemble learning method predominantly used for classification tasks.15 Central to RF’s approach is the construction of multiple decision trees during the training phase, with predictions based on the aggregation (mode or mean) of outcomes from these trees. This ensemble strategy, by leveraging diverse subsets of the dataset for training individual trees and employing a random selection of features (variables) at each decision point, inherently reduces the risk of overfitting - a common challenge in analysing the multifaceted and high-dimensional hospital administrative data associated with CHD admissions.
Gradient Boosting Machine (GBM) AlgorithmThis is similar to the RF algorithm but uses GBM instead. GBM parameter settings are described in Supplementary Box 1. GBM, also based on decision trees, distinguishes itself through its methodical construction of an ensemble of weak predictive models, primarily shallow decision trees, which sequentially build upon one another to rectify preceding errors.16 This iterative refinement, guided by the principles of gradient descent to minimize predictive loss, equips GBM with a remarkable capacity for handling complex datasets. Notably, GBM’s adeptness at modelling non-linear relationships and its dynamic adjustment to minimize bias and variance significantly enhance its applicability to the intricacies of administrative health data.
These six algorithms were repeated separately for MI admissions.
Characteristics of First and Last Records in an Episode Based on Different AlgorithmsThe locations that the patient was admitted from and discharged to were identified for the first and last records respectively for episodes of care identified by the date algorithm. The principal discharge diagnosis of records within episodes of care identified by the admission source and discharge destination algorithms were determined.
Statistical AnalysisAge-standardised rates of CHD and MI episodes were calculated for each algorithm using the direct method. The numerator was the number of episodes of CHD or MI per year, and the denominator was the WA population in each year, stratified by sex and 5-year age group. Rates were standardised using the 2011 Australian population as the standard. Age-adjusted trends were estimated for each algorithm from Poisson regression models, which included 5-year age group and calendar year (continuous). The interaction term “calendar year*algorithm” was used to determine statistical significance with date-algorithm as reference.
Data handling, management and statistical analyses including for counts, rates and trends were performed using Stata MP 18.0.17 RF and GBM modelling were performed using the sci-kit learn Python package for machine learning.18
Ethics ApprovalHuman Research Ethics Committee approval was obtained from the University of Western Australia (RA/4/1/1491) and the WA Department of Health (2014/55). A waiver of consent was granted by the WA Department of Health HREC as the research met criteria outlined in the National Statement on Ethical Conduct in Human Research. Analyses were conducted according to relevant local and national guidelines and regulations. This study complies with the Declaration of Helsinki.
ResultsUsing the date algorithm, the number of CHD episodes identified in WA increased from 11,733 in 2000 to 13,274 in 2016, while the corresponding counts for MI hospitalisation episodes were 2605 and 4480 respectively (Figure 1, Supplementary Table 4). Of the six algorithms, the datetime algorithm produced the highest counts of CHD and MI hospitalisation episodes, while the admission source algorithm resulted in the lowest counts. For the 2012–2016 study period, both the RF and GBM algorithms produced counts that were lower than the datetime algorithm but higher than the admission source algorithm.
Figure 1 Counts of CHD and MI episodes determined from six different algorithms using hospital administrative data.
Abbreviations: CHD, coronary heart disease; MI, myocardial infarction; RF, random forest; GBM, gradient boosting machine.
The age-standardised rates of CHD hospitalisation episodes decreased from 2086.2 in 2000 to 1463.1 per 100,000 person-years in 2016 (Figure 2, Supplementary Table 5) with the date algorithm. The corresponding rates for MI increased from 468.2 in 2000 to 498.1 per 100,000 person-years. As with counts, the rates were highest with the datetime algorithm and lowest with the admission source algorithm. Similarly, both the RF and GBM algorithms produced rates that were lower than the datetime algorithm but higher than the admission source algorithm. Rates of CHD and MI hospitalisation episodes based on the datetime algorithm were consistently 1–2% higher than the date algorithm, while CHD rates produced using the discharge destination algorithm were on average 4% lower than the date algorithm, and for MI rates, around 12% lower. Compared with the date algorithm, differences in rates of CHD and MI episodes increased over time using the admission source (−4.1% to −11.1% for CHD; −11.9% to −26.0% for MI), RF (−5.3% to −9.5%; −3.1% to −11.5%) and GBM (−1.5% to −6.3%; −3.2% to −12.1%) algorithms.
Figure 2 Age-standardised rates (ASR) and percentage difference compared to the date algorithm ASR: by CHD and MI episodes identified from hospital administrative data.
Abbreviations: CHD, coronary heart disease; MI, myocardial infarction; RF, random forest; GBM, gradient boosting machine.
Note: Black line for RF is hidden behind the red line for GBM.
Overall, age-adjusted trends in CHD and MI hospitalisation episode rates were similar in the date, datetime, admission source and discharge destination algorithms (Figure 3). The date and datetime algorithms had almost identical trends for both CHD and MI. However, the RF (−3.64%; 95% CI −4.38% to −3.09%) and GBM algorithms (−3.75%; 95% CI −4.29% to −3.21%) had a greater decrease in CHD rates than the date algorithm (−2.53%; 95% CI −3.06% to −2.00%). Similarly, RF (−4.08%; 95% CI −5.01% to −3.15%) and GBM (−4.17%; 95% CI −5.10% to −3.24%) algorithms had a greater decline in MI rates than the date algorithm (−1.88%; 95% CI −2.79% to −0.96%).
Figure 3 Age-adjusted average annual percentage change (95% CI) of CHD and MI episodes using six different algorithms from hospital administrative data. *Statistically different from the date algorithm; dotted vertical lines represent 95% confidence interval for the date algorithm.
Abbreviations: CHD, coronary heart disease; MI, myocardial infarction; RF, random forest; GBM, gradient boosting machine.
The number of first hospitalisations in a date algorithm episode coded as “admit from home” decreased from 93.9% in 2000 to 89.2% in 2016 for CHD hospitalisations, while the corresponding numbers for MI were 89.0% and 80.1%, respectively (Supplementary Table 6). The number of last hospitalisations in an episode coded as “discharge home” remained consistent at around 97% and 92.3% for CHD and MI, respectively throughout the study period. Most of the records coded as “admit from home” and “discharge home” had principal diagnoses of CHD (94.2% and 93.2%, respectively) with most of the remaining hospitalisations recording conditions like renal dialysis, chest pain and heart failure. (Supplementary Table 7). A similar pattern was observed for episodes of MI hospitalisation (Supplementary Table 8).
DiscussionIn this study we compared six different algorithms to determine counts, rates and trends in CHD and MI hospitalisation episodes. We found that differences in ASR of CHD and MI counts increased over time with the admission source, RF and GBM algorithms relative to the date algorithm. Furthermore, age-adjusted trends in CHD and MI episode rates using RF and GBM differed significantly from all other algorithms.
The date and datetime algorithms, which reflect patient movement within the hospital system without requiring coding of other variables, produced the highest counts and rates and with only small differences between these two algorithms. The one-day window used for the date algorithm can include one or more 6-hour intervals used for the datetime algorithm. It also potentially incorporates situations where there is greater than 24 hours between the discharge date on one record and the admission date on the subsequent record, thus likely capturing true readmissions within a transfer series. However, the small increase in rates with the datetime compared to the date demonstrates that this scenario is uncommon. The magnitude of difference is small and consistent over time, therefore indicating that either algorithm is reasonable for ensuring accurate counts, rates and trends for both diagnoses. With data custodians becoming increasingly reluctant to provide sensitive admission and discharge times, our data demonstrates that the date algorithm is reasonably accurate for those measures, although time variables may be a necessity for other types of analyses.
In contrast, the admission source and discharge destination algorithms produced rates that were more than 4% lower than the date algorithm, with the discharge destination algorithm showing a growing discrepancy over time. The poor performances of these two algorithms are likely due to two causes. Firstly, the data are collected for administrative purposes where the coding of the admission source and discharge destination variables reflect current coding practices and billing requirements rather than for epidemiological measures of disease burden. For instance, the first admission in an episode can be coded as “admit from another hospital” when the patient arrives from a non-ward setting of a different hospital (eg, emergency department). Similarly, a patient can be coded as “discharge to another hospital” when they are transferred to a non-ward setting. In both cases, these data are not captured in the HMDC. This is consistent with a government report, which found that 5% of patients diagnosed with ACS who were coded as “transferred to another hospital” in the “discharge-to” field of the WA HMDC had no further hospitalisation records.6 Alternatively, this could be due to coding errors19 but we are unable to determine the extent of this problem in our hospital dataset. Secondly, a small number of these patients could have experienced an in-hospital onset of CHD or MI20,21 while others could have been transferred to a rehabilitation ward during the hospital episode. In this sense, the admission source and discharge destination algorithms which assumptively identifies the first and last admission in an episode may have identified diagnoses other than CHD or MI. On the other hand, our date and datetime algorithms identify any episode with at least one record with a principal diagnosis of CHD or MI. The differences with RF and GBM algorithms increased over time, probably indicating that the model trained on data from 2002–2011 may not be generalisable to data from 2012–2016. “Noise” from the pathway and other variables may have also disrupted the ability of the RF and GBM models to identify CHD and MI episodes.
The strength of this study was its use of person-linked hospital data, which allowed systematic capture of all hospitalisations, both public and private, ensuring complete identification of the patient cohort and complete hospitalisations. There are several limitations to this study. Firstly, we were not able to differentiate between a transfer and readmission using the date and datetime algorithms. A small number of patients are readmitted within one to two days of discharge following an ACS,22 and our comparison against the date algorithm may not truly reflect this situation. Clinical chart reviews, the gold standard for identifying episodes, would be impractical for the large number of cases over the 17-year study period. In the absence of clinical chart reviews, the RF and GBM algorithms were trained on the date algorithm, which has been used previously to define transfers.1,2,10–12 We are unable to determine if the date algorithm is the best approach for training these two machine learning algorithms. However, even with training using the date algorithm, age-adjusted trends in CHD and MI episode rates using RF and GBM differed significantly from all other algorithms. Secondly, the findings may not be generalisable to other jurisdictions where different coding practices and hospital discharge and transfer protocols can result in different findings. However, with many studies identifying similar coding issues in hospital data for a range of medical conditions,6,20,23–25 these findings can likely be replicated in other jurisdictions. Thirdly, we only tested two machine learning algorithms and our results may not be generalisable to other machine learning models.
ConclusionWe contend that the date or datetime algorithms produced the most valid counts, rates and trends of CHD and MI hospitalisation episodes. Our results emphasise the importance of admission and discharge date/time variables and deidentified patient identifiers when performing this type of analysis. Importantly, RF and GBM algorithms may be less reliable than the other algorithms for identifying and counting CHD or MI episodes in WA hospital administrative data as they produced estimates that varied significantly from the other four algorithms, despite using relevant variables in the dataset as inputs. Other machine learning or deep learning algorithms remain to be tested for this purpose.
Data Sharing StatementThe datasets generated and/or analysed during the current study are not publicly available due to the terms of the ethics approval granted by the Western Australian Department of Health Human Research Ethics Committee (WADOH HREC) and data disclosure policies of the Data Providers. The datasets may be available from the corresponding author upon request and subject to approval from the WADOH HREC and relevant custodians.
AcknowledgmentsWe acknowledge the support of the Western Australian Data Linkage Branch and data custodians of the Hospital Morbidity Data Collection (HMDC) from the Western Australian Department of Health for providing the linked HMDC dataset.
Author ContributionsAll authors made a significant contribution to the work reported, whether that is in the conception, study design, execution, acquisition of data, analysis and interpretation, or in all these areas; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; have agreed on the journal to which the article has been submitted; and agree to be accountable for all aspects of the work.
FundingThis work was supported by the National Health and Medical Research Council (NHMRC) of Australia project grant 1078978. The grant agency does not impose restrictions on conduct of analyses or dissemination of findings. LN is funded by a National Heart Foundation Future Leader Fellowship.
DisclosureThe authors report no conflicts of interest in this work.
References1. Lopez D, Nedkoff L, Knuiman M, et al. Exploring the effects of transfers and readmissions on trends in population counts of hospital admissions for coronary heart disease: a Western Australian data linkage study. BMJ Open. 2017;7(11):e019226. doi:10.1136/bmjopen-2017-019226
2. Lopez D, Katzenellenbogen JM, Sanfilippo FM, et al. Transfers to metropolitan hospitals and coronary angiography for rural aboriginal and non-aboriginal patients with acute ischaemic heart disease in Western Australia. BMC Cardiovasc Disord. 2014;14(1):58. doi:10.1186/1471-2261-14-58
3. Westfall JM. Double counting of acute myocardial infarction makes estimates of occurrence and case fatality inaccurate. Am J Cardiol. 2002;89(5):651–652. doi:10.1016/S0002-9149(02)02183-5
4. Chan WC, Wright C, Tobias M, Mann S, Jackson R. Explaining trends in coronary heart disease hospitalisations in New Zealand: trend for admissions and incidence can be in opposite directions. Heart. 2008;94(12):1589–1593. doi:10.1136/hrt.2008.142588
5. Australian Institute of Health and Welfare. Validating Algorithms for Incidence of Cardiovascular Disease: Technical Report, Catalogue Number CDK 22. Canberra: AIHW; 2022.
6. Australian Institute of Health and Welfare. Acute Coronary Syndrome: Validation of the Method Used to Monitor Incidence in Australia. A Working Paper Using Linked Hospitalisation and Deaths Data from Western Australia and New South Wales. CVD 68. Canberra: AIHW; 2014.
7. Health New Zealand Te Whatu Ora. National minimum dataset (hospital events) data dictionary; 2021. Available from: https://www.tewhatuora.govt.nz/for-health-professionals/data-and-statistics/nz-health-statistics/data-references/data-dictionaries. Accessed September13, 2024.
8. Leightley D, Chui Z, Jones M, et al. Integrating electronic healthcare records of armed forces personnel: developing a framework for evaluating health outcomes in England, Scotland and Wales. Int J Med Inform. 2018;113:17–25. doi:10.1016/j.ijmedinf.2018.02.012
9. Canadian Institute for Health Information. DAD data elements, 2024–2025. CIHI; 2024. Available from: https://www.cihi.ca/sites/default/files/document/dad-data-elements-2024-2025-en.pdf. Accessed September13, 2024.
10. Peng M, Li B, Southern DA, Eastwood CA, Quan H. Constructing episodes of inpatient care: how to define hospital transfer in hospital administrative health data? Med Care. 2017;55(1):74–78. doi:10.1097/MLR.0000000000000624
11. Zhang Q, Zhao D, Xie W, et al. Recent trends in hospitalization for acute myocardial infarction in Beijing: increasing overall burden and a transition from ST-segment elevation to non-ST-segment elevation myocardial infarction in a population-based study. Medicine. 2016;95(5):e2677. doi:10.1097/MD.0000000000002677
12. Fransoo R, Yogendran M, Olafson K, Ramsey C, McGowan KL, Garland A. Constructing episodes of inpatient care: data infrastructure for population-based research. BMC Med Res Methodol. 2012;12(1):133. doi:10.1186/1471-2288-12-133
13. Mitchell RJ, Cameron CM, McClure RJ, Williamson AM. Data linkage capabilities in Australia: practical issues identified by a population health research network ‘proof of concept project’. Aust NZ J Public Health. 2015;39(4):319–325. doi:10.1111/1753-6405.12310
14. Fluss R, Faraggi D, Reiser B. Estimation of the youden index and its associated cutoff point. Biom J. 2005;47(4):458–472. doi:10.1002/bimj.200410135
15. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi:10.1023/A:1010933404324
16. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29(5):1189–1232. doi:10.1214/aos/1013203451
17. Stata 18.0 MP [computer program]. College Station, TX 77845, USA; 2024.
18. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–2830.
19. Kahn JM, Iwashyna TJ. Accuracy of the discharge destination field in administrative data for identifying transfer to a long-term acute care hospital. BMC Res Notes. 2010;3:205. doi:10.1186/1756-0500-3-205
20. Maynard C, Lowy E, Rumsfeld J, et al. The prevalence and outcomes of in-hospital acute myocardial infarction in the department of veterans affairs health system. Arch Intern Med. 2006;166(13):1410–1416. doi:10.1001/archinte.166.13.1410
21. Erne P, Bertel O, Urban P, et al. Inpatient versus outpatient onsets of acute myocardial infarction. Eur J Intern Med. 2015;26(6):414–419. doi:10.1016/j.ejim.2015.05.011
22. Dreyer RP, Ranasinghe I, Wang Y, et al. Sex differences in the rate, timing, and principal diagnoses of 30-day readmissions in younger patients with acute myocardial infarction. Circulation. 2015;132(3):158–166. doi:10.1161/CIRCULATIONAHA.114.014776
23. Kim H, Grunditz JI, Meath THA, Quiñones AR, Ibrahim SA, McConnell KJ. Accuracy of hospital discharge codes in medicare claims for knee and hip replacement patients. Med Care. 2020;58(5):491–495. doi:10.1097/MLR.0000000000001290
24. Zhu Y, Stearns SC. Post-acute care locations: hospital discharge destination reports vs medicare claims. J Am Geriatr Soc. 2020;68(4):847–851. doi:10.1111/jgs.16308
25. Assareh H, Achat HM, Levesque JF. Accuracy of inter-hospital transfer information in Australian hospital administrative databases. Health Inform J. 2019;25(3):960–972. doi:10.1177/1460458217730866
留言 (0)