Evaluating Algorithmic Bias in 30-Day Hospital Readmission Models: Retrospective Analysis


IntroductionBackground of Algorithmic Bias

Predictive algorithms and machine learning tools are increasingly integrated into clinical decision-making and population health management. However, with the increasing reliance on predictive algorithms comes a growing concern of exacerbating health disparities [-]. Evidence has shown that widely used algorithms that use past health care expenditures to predict high-risk patients have systematically underestimated the health care needs of Black patients []. In addition, studies have shown that predictive performances of models predicting intensive care unit mortality, 30-day psychiatric readmission, and asthma exacerbation were worse in populations with lower socioeconomic status [,].

With algorithmic bias as a potentially pervasive issue, a few checklists have been published to qualitatively identify and understand the potential biases derived from predictive models [,]. However, no agreed-upon quantitative method exists to routinely assess whether deployed models will lead to biased results and exacerbate health disparities faced by marginalized groups [,]. In this study, we define algorithmic bias as the differential results or performance of predictive models that may lead to differential allocation or outcomes between subgroups [-]. In addition, we define disparity as the difference in the quality of health care (the degree to which health services increase the likelihood of desired health outcomes) received by a marginalized population that is not due to access-related factors, clinical needs, preferences, and appropriateness of intervention [,]. Fairness metrics, which are a set of mathematical expressions that formalize certain equality between groups (eg, equal false negative rates [FNRs]), were proposed to measure and detect biases in machine learning models [,]. Although the machine learning community has shown that fairness metrics are a promising way to identify algorithmic bias, these metrics are criticized for being insufficient to reflect the heterogeneous and dynamic nature of health care [,]. Fairness metrics can also be misleading or conflicting due to their narrow focus on equal rates between groups [,]. Furthermore, these metrics could be interpreted without context-specific judgment or domain knowledge, thus failing to connect predictions to interventions and the downstream health care disparity [,]. Most importantly, these measures are often not fully tested in real-world predictive tasks and lack evidence on how well these measures’ interpretation could guide intervention planning.

Background of Disparity in 30-Day Hospital Readmission

Predicting hospital readmissions is widely studied in health care management and delivery [-]. Hospital readmissions, especially unplanned or avoidable readmissions, are not only associated with a high risk of in-hospital mortality but also costly and burdensome to the health care system [,]. Since 2012, the Hospital Readmission Reduction Program by the Centers for Medicare & Medicaid Services (CMS) has imposed financial penalties for hospitals with excessive readmission rates []. CMS has consequently incentivized hospitals to segment patients by risk so that hospitals can target the delivery of these resource-intensive interventions to the patients at greatest risk, such as transitional care intervention and better discharge planning [,,]. Many hospital readmission predictive models have been published, with >350 models predicting 30-day readmission identified in prior systematic reviews and our prior work [,,,,]. The disparity in hospital readmission rates is well studied. For example, past studies have shown that Black patients have higher readmission rates after adjusting for demographic and clinical characteristics [-]. In addition to racial disparity, patients receiving care at racial and ethnic minority-serving hospitals [,] or living in disadvantaged neighborhoods have higher rates of readmission [-]. Research has also shown that disparity in health care use, including hospital readmission, is related to not only individuals’ racial and ethnic identity but also their communities []. Other research has also suggested that social environments, either the place of residence or the hospital where one receives care, may explain a meaningful portion of health disparity [,].

Objectives

Despite model abundance and known disparity in hospital readmissions, research has been limited in evaluating how algorithmic bias or the disparate performances of these predictive models may impact patient outcomes and downstream health disparities once deployed. Lack of evidence is more prominent in how the model-guided intervention allocation may reduce or aggravate existing health disparities between different populations. To address this gap in evidence, in this study, we aimed to (1) implement a selection of fairness metrics to evaluate whether the application of common 30-day readmission predictive models may lead to bias between racial and income groups and (2) interpret the selected fairness metrics and assess their usefulness in the context of facilitating equitable allocation of interventions. In this paper, we represent the perspective of a health system or payer who uses an established, validated algorithm to identify patients at high risk of unplanned readmission so that targeted intervention can be planned for these patients. Thus, our main concern for algorithmic bias is the unequal allocation of intervention resources and the unequal health outcome as a result. Specifically, we are concerned about risk scores systematically underestimating or overestimating needs for a certain group, assuming the model we deploy is validated and has acceptable overall predictive performance.


MethodsStudy Population and Data

This retrospective study included 1.9 million adult inpatient discharges in Maryland and 8.7 million inpatient discharges in Florida from 2016 to 2019. The State Inpatient Databases (SIDs) are maintained by the United States Agency for Healthcare Research and Quality, as part of the Healthcare Cost and Utilization Project (HCUP), were used for this analysis. The SIDs include longitudinal hospital care data in the United States, inclusive of all insurance payers (eg, Medicare, Medicaid, private insurance, and the uninsured) and all patient ages []. The SIDs capture >97% of all eligible hospital discharges in each state []. Maryland and Florida were selected due to their different population sizes, compositions (eg, racial and ethnic distribution and urban to rural ratio), and health care environment (Maryland’s all-payer model vs Medicaid expansion not adopted in Florida) [,]. In addition, Maryland and Florida are among a small subset of states in which the SIDs contain a “VisitLink” variable that tracks unique patients within the state and across years from 2016 to 2019, allowing for the longitudinal analysis of readmissions across hospitals and different calendar years []. The SIDs were also linked to the American Hospital Association’s Annual Survey Database to obtain hospital-level information. The study population excluded admissions where patients were aged <18 years, died in hospitals, were discharged against medical advice, or had insufficient information to calculate readmission (eg, missing the VisitLink variable or length of stay).

Study Outcome

The calculation of 30-day readmission followed the definition used by the HCUP []. Any inpatient admission was counted as an index admission. The all-cause 30-day readmission rate was defined as the number of admissions with at least 1 subsequent hospital admission within 30 days, divided by the total number of admissions during the study period. Unplanned, all-cause 30-day hospital readmissions were identified using the methodology developed by CMS [,]. The study cohort selection process and determination of unplanned readmission are outlined in .

Figure 1. Determination of the study cohort and unplanned all-cause 30‐day readmission. Predictive Models

The LACE index [], the HOSPITAL score [], and the CMS hospital-wide all-cause readmission measure [] were included in the analysis as they were validated externally and commonly used in practice based on our prior review []. The LACE index and the HOSPITAL score were designed for hospital staff to identify patients at high risk of readmission for targeted intervention efforts and have been converted to a scoring system and extensively validated. Thus, the 2 models were applied to obtain the predicted risk scores without retraining, to mimic how the models were used in practice. In total, 2 of the HOSPITAL score predictors—low hemoglobin and low sodium levels at discharge—were not available in the SIDs, and thus were excluded. The total risk scores were adjusted as a result. Details of model variables and how the 2 models were implemented are reported in and . The CMS measure was evaluated using 2 approaches: applied as-is with existing coefficients and retrained to generate new coefficients using 50% of the sample. To ensure comparability between the CMS measure and other models, the predicted patient-level risk was used without the hospital-level effect from the original measure, and the CMS measure was limited to the “medicine cohort” []. On the basis of the CMS measure’s specification report, the patient population was divided into 5 mutually exclusive cohorts: surgery or gynecology, cardiorespiratory, cardiovascular, neurology, and medicine. The cohorts were determined using the Agency for Healthcare Research and Quality Clinical Classifications Software categories []. The medicine cohort was randomly split 50-50 into a retraining and testing data set. The CMS measure includes age and >100 variables, representing a wide range of condition categories. The measure was trained on the retraining data set with 5 cross-validations and then run on the testing data set using the new coefficients to obtain the performance and bias metrics for the CMS retrained model. Separately, the CMS measure with the published coefficients was run on the full medicine cohort data set to obtain performance and bias metrics for the CMS as-is model. The existing model thresholds were used to classify a positive, or high-risk, class: 10 points for LACE, and high-risk (5 in the adjusted scoring) for modified HOSPITAL. The optimal threshold identified using the Youden Index [] on the receiver operating characteristic curve was used for the 2 CMS measures.

Measures

We measured predictive performances and biases between Black and White subpopulations and between low-income and other-income subpopulations. Race is a normalized variable in the HCUP that indicates race and ethnicity. The low-income group was defined as the fourth quartile of the median state household income, whereas the remaining 3 quartiles were grouped as other income. The median state income quartiles were provided in HCUP SIDs and were calculated based on the median income of the patient’s zip code. Predictive performances of each model were derived for all population and each subpopulation using area under the curve (AUC), Brier statistic, and Hosmer-Lemeshow goodness of fit. Bias was represented by the group difference of the mathematical measures: false positive rate (FPR) difference (eg, FPR between Black and White patients), FNR difference, 0-1 loss difference, and generalized entropy index (GEI). FNR was calculated as the ratio between false negatives (those predicted as low risk while having an unplanned 30-day readmission) and the total number of positives. Similarly, the FPR was calculated as the ratio of false positives out of the total number of negative cases. Normalized total error rates is 0-1 loss, and it is calculated as the percentage of incorrect predictions. Bias measured by FPR, FNR, and 0-1 loss differences focus on unequal error rates. The GEI is a measure of income inequality and proposed to measure algorithm fairness between groups with a range between 0 and infinity, in which lower scores represent more equity [].

Ethical Considerations

This study was not human subjects research, as determined by the Johns Hopkins School of Public Health Institutional Review Board. No compensation was provided.

Statistical Analysis

Primary analyses were conducted using R (version 4.0.2; R Foundation for Statistical Computing). The aggregate condition categories required to calculate unplanned readmission and CMS measures were calculated in SAS software (version 9.4; SAS Institute) using the programs provided by the agencies [,]. GEI measures were calculated using the AI Fairness 360 package published by IBM Corp []. The unit of analysis was admission. FNR and FPR results were first stratified by individual hospital and visualized in a scatter plot. The racial bias results were then stratified by hospital population composition (eg, percentage of Black patients), which was shown to associate with the overall outcome of a hospital []. Hospitals were binned by the percentage of Black patients served in a hospital (eg, >10% and >20%), and the racial bias measures with their 95% CIs were calculated for each bin. For FNR difference, FPR difference, and 0-1 loss difference, the distribution across 2 groups was calculated, and the significance of the measure difference was assessed using the Student t test (2-tailed) under the null hypothesis that the group difference was equal to 0. For all statistical tests, an α of .05 was used.


ResultsDemographic and Clinical Characteristics

As presented in , among the 1,857,658 Maryland inpatient discharges from 2016 to 2019, a total of 55.41% (n=1,029,292) were White patients and 33.71% (n=626,280) were Black patients, whereas in Florida, 64.49% (5,632,318/8,733,002) of the inpatient discharges were White patients and 16.59% (1,448,620/8,733,002) were Black patients.

White patients in both states were older, more likely to be on private insurance, and less likely to reside in large metropolitan areas or be treated in major teaching or large hospitals in urban areas. Compared to White patients, Black patients in Maryland had a longer length of inpatient stay, more inpatient procedures, fewer inpatient diagnoses, higher inpatient charges, and more comorbidities and were more likely to be discharged to home or self-care. However, Black patients in Florida had fewer inpatient diagnoses, fewer procedures, and fewer total charges. These patients also had longer lengths of inpatient stays, more comorbidities, and were more likely to be discharged to home or self-care. In both Maryland and Florida, those in the lowest income quartile were younger, had a longer length of inpatient stay, had higher inpatient charges, had more comorbidities, and had fewer procedures than other-income groups. The low-income group was less likely to reside in metropolitan areas but was more likely to be treated in major teaching hospitals. Except for those noted in footnote c of , all characteristics showed statistically significant differences between racial and income groups (all P values <.001).

Table 1. Demographic characteristics by race and by income in Maryland (n=1,857,658) and Florida (n=8,733,002).Characteristics and stateRaceIncome
WhiteBlackOtherLow incomeOther incomeDischarges, n (%)
MDa1,029,292 (55.41)626,280 (33.71)187,935 (10.12)627,013 (33.75)1,225,820 (65.99)
FLb5,632,318 (64.49)1,448,620 (16.59)1,598,392 (18.3)2,600,326 (29.78)6,028,720 (69.03)Age (y), mean (SD)
MD61.4 (19.9)54.0 (19.4)46.8 (20.7)56.3 (19.8)58.0 (20.8)
FL63.1 (19.4)51.1 (19.8)56.1 (21.6)58.0 (20.4)60.6 (20.5)Sex (female; yes), n (%)
MD586,641 (56.99)377,063 (60.21)128,138 (68.18)364,718 (58.17)733,135 (59.81)
FL3,050,611 (54.16)856,343 (59.11)942,996 (59)1,454,239 (55.93)3,372,155 (55.93)cPayer, n (%)
Medicare

MD550,364 (53.47)262,512 (41.92)43,461 (23.13)293,009 (46.73)566,503 (46.21)

FL3,386,956 (60.13)589,574 (40.7)715,483 (44.76)1,358,285 (52.24)3,301,163 (54.76)
Medicaid

MD142,138 (13.81)192,443 (30.73)71,450 (38.02)192,624 (30.72)215,530 (17.58)

FL504,531 (8.96)373,896 (25.81)315,306 (19.73)505,805 (19.45)682,879 (11.33)
Private

MD306,929 (29.82)148,781 (23.76)59,381 (31.6)120,346 (19.19)398,459 (32.51)

FL1,183,304 (21.01)287,528 (19.85)398,712 (24.94)422,946 (16.27)1,439,482 (23.88)Residence (large metropolitan), n (%)
MD839,688 (81.58)586,868 (93.72)178,015 (94.71)460,587 (73.46)1,154,341 (94.17)
FL2,939,039 (52.18)1,027,033 (70.9)1,319,976 (82.58)1,530,136 (58.84)3,738,731 (62.02)Length of stay, mean (SD)
MD4.73 (6.10)5.22 (7.20)4.19 (6.22)5.17 (6.89)4.68 (6.32)
FL4.96 (6.37)5.24 (7.98)4.76 (6.76)5.16 (7.26)4.89 (6.51)Total charges, mean (SD)
MD17,000 (22,700)17,800 (25,800)14,600 (23,000)18,000 (25,100)16,500 (23,200)
FL68,500 (88,800)62,800 (95,900)68,900 (100,000)67,300 (93,600)67,800 (92,000)Discharge type (home or self-care), n (%)
MD780,238 (75.8)490,051 (78.25)165,538 (88.08)486,882 (77.65)956,510 (78.03)
FL4,408,065 (78.26)1,240,314 (85.62)1,367,677 (85.57)2,102,035 (80.84)4,873,457 (80.84)cCCId score, mean (SD)
MD0.498 (1.04)0.594 (1.18)0.359 (0.962)0.573 (1.13)0.486 (1.06)
FL0.516 (1.04)0.616 (1.23)0.529 (1.13)0.575 (1.13)0.516 (1.07)Number of procedures, mean (SD)
MD1.68 (2.44)1.71 (2.60)2.00 (2.38)1.69 (2.59)1.74 (2.44)
FL1.57 (2.34)1.50 (2.34)1.57 (2.31)1.49 (2.33)1.59 (2.33)Number of diagnoses, mean (SD)
MD15.5 (8.20)14.6 (7.84)11.1 (7.07)15.3 (8.04)14.4 (8.08)
FL13.3 (7.33)11.7 (7.19)10.8 (6.91)12.6 (7.33)12.5 (7.29)Hospital type (major teaching; yes), n (%)
MD181,493 (17.63)120,649 (19.26)27,808 (14.80)172,092 (27.45)158,693 (12.95)
FL576,819 (10.24)231,379 (15.97)242,754 (15.19)354,901 (13.65)691,635 (11.47)Hospital beds (≥200 beds), n (%)
MD741,956 (72.08)460,667 (73.56)150,745 (80.21)482,604 (76.97)878,817 (71.69)
FL3,546,967 (62.98)981,735 (67.77)967,036 (60.5)1,735,704 (66.75)3,730,066 (61.87)Urban hospital (yes), n (%)
MD943,928 (91.71)602,779 (96.25)179,091 (95.29)548,230 (87.44)1,187,298 (96.86)
FL2,859,038 (50.76)858,993 (59.3)903,705 (56.54)1,340,735 (51.56)3,256,784 (54.02)

aMD: Maryland.

bFL: Florida.

cP values were computed between racial groups and between income groups, respectively. All P values are <.001 except for the ones in this footnote: P value for female between income groups=.80 and for discharge type between income groups=.99.

dCCI: Charlson Comorbidity Index.

Predictive Performance

The observed 30-day unplanned readmission rates in Maryland were higher in the Black and low-income patient groups (ie, 11.13% for White patients, 12.77% for Black patients, 10.59% for other-income patients, and 12.73% for low-income patients; ).

Table 2. Observed and predicted 30-day unplanned readmission rates by model and state.
Observed (%)Predicteda (%)
MD (n=1,857,658)FL (n=8,733,002)LACEbHOSPITALcCMSd as-isCMS retrained

MDe (n=1,857,658)FLf (n=8,733,002)MD (n=1,857,658)FL (n=8,733,002)MD (n=714,917)FL (n=2,636,671)MD (n=357,458)FL (n=1,318,335)Total11.3114.3414.4815.881214.6110.4810.2614.7116.5Race
White11.1313.9412.9214.8111.6214.0710.2710.1114.1415.97
Black12.7717.1418.6221.3314.1218.2610.8210.8515.8418.93Income
Other10.5913.612.8814.6210.8413.6610.3710.1914.316.17
Low12.7316.0317.618.7214.2916.7710.6710.4315.4617.25

aPredicted: the predicted readmission rates for LACE and HOSPITAL were calculated as the percentage of patients at high risk of unplanned readmission based on the model output for the group; and the predicted readmission rates for the two CMS models were the predicted probability of being at high risk of unplanned readmission for the group.

bLACE: The LACE Index for readmission risk.

cHOSPITAL: The modified HOSPITAL score for readmission risk.

dCMS: Centers for Medicare & Medicaid Services (readmission measure).

eMD: Maryland.

fFL: Florida.

A fair and well-calibrated predictive model would be assumed to overpredict or underpredict readmission rates to a similar degree across racial or income groups. Compared to the observed readmission rates, the LACE index overestimated readmission rates in all subpopulations and was more pronounced in Black and low-income populations. The readmission rates estimated by the modified HOSPITAL score were closest to the observed rates. The CMS as-is model underestimated across subpopulations, and the estimated rates of readmission were similar between subpopulations, while the retrained CMS model overestimated in all subpopulations to a similar degree. In Florida, the observed 30-day unplanned readmission rates were higher than those in Maryland in all populations. Similar to Maryland, Florida’s observed readmission rates were also higher in the Black and low-income groups (ie, 13.94% for White populations, 17.14% for Black populations, 13.6% for other-income populations, and 16.03% for low-income populations) and had similar overestimation and underestimation patterns ().

As presented in , in Maryland, the retrained CMS model had better predictive performance (AUC 0.74 in all subpopulations) than the other 3 models, which only achieved moderate predictive performance (AUC between 0.65 and 0.68). The modified HOSPITAL score had the best calibration (Brier score=0.16−0.19 in all subpopulations), whereas the CMS as-is model performed poorly on the Brier score. Calibration was better in the White (compared to the Black) population and other-income (compared to low-income) populations in both states, and the AUC was higher or similar in the Black (compared to the White) population. In Florida, the CMS retrained model also performed better than the other models in all subpopulations (AUC 0.68-0.72), and the modified HOSPITAL score had the best calibration (Brier score 0.19-0.21). All models demonstrated excellent goodness of fit across subpopulations ().

Table 3. Predictive performances of each 30-day readmission model in Maryland and Florida.All vs group and performance measureLACEaHOSPITALbCMSc as-isCMS retrained
MDdFLeMDFLMDFLMDFLAll
AUCf0.680.680.650.660.660.650.740.69
Brier statistic0.190.210.170.200.440.370.320.32
Hosmer-Lemeshow, P value<.001<.001<.001<.001<.001<.001<.001<.001Group
Race

White


AUC0.680.670.640.650.650.630.740.68


Brier statistic0.180.210.170.200.410.360.310.32


Hosmer-Lemeshow, P value<.001<.001<.001<.001<.001<.001<.001<.001

Black


AUC0.680.680.670.690.670.680.740.72


Brier statistic0.220.250.190.210.490.430.330.36


Hosmer-Lemeshow, P value<.001<.001<.001<.001<.001<.001<.001<.001
Income

Other


AUC0.690.680.650.660.650.640.740.69


Brier statistic0.170.200.160.190.430.360.310.32


Hosmer-Lemeshow, P value<.001<.001<.001<.001<.001<.001<.001<.001

Low


AUC0.680.670.660.670.670.660.740.70


Brier statistic0.210.230.190.210.470.390.330.34


Hosmer-Lemeshow, P value<.001<.001<.001<.001<.001<.001<.001<.001

aLACE: The LACE Index for readmission risk.

bHOSPITAL: The modified HOSPITAL score for readmission risk.

cCMS: Centers for Medicare & Medicaid Services (readmission measure).

dMD: Maryland.

eFL: Florida.

fAUC: area under the curve.

Bias Measures

Misclassification rates (ie, FPR difference and FNR difference) indicate relative between-group bias, whereas 0-1 loss differences indicate the overall error rates between groups. The between-group GEI indicates how unequally an outcome is distributed between groups []. In Maryland, the retrained CMS model and the modified HOSPITAL score had the lowest racial and income bias ().

Specifically, the modified HOSPITAL score demonstrated the lowest racial bias based on 0-1 loss, FPR difference, and GEI, and the lowest income bias based on FPR and GEI. Retrained CMS demonstrated the lowest racial bias based on 0-1 loss and FNR difference, and the lowest income bias on all 4 measures. In Florida, racial biases based on FPR and FNR differences was generally greater than that in Maryland, especially for FNR differences. In Florida, the modified HOSPITAL score showed the lowest racial bias based on 0-1 loss, FPR difference, and GEI; the LACE index showed the lowest racial bias in FNR difference. Each model scored the best in at least one measure of income bias, but the overall HOSPITAL score and retrained CMS showed the lowest income bias in Florida. In both states, the White and other-income patient groups had a higher FNR, indicating that they were more likely to be predicted as low risk while having a 30-day unplanned readmission. The Black and low-income patient groups had a higher FPR, indicating that they were more likely to be predicted to be high-risk and not have a 30-day unplanned readmission. The overall error rates were higher in the Black and low-income patient groups compared to the White and other-income patient groups, respectively. Except for GEI and the values noted with a footnote in , all other measures showed statistically significant differences (all P values <.001) between racial and income groups, respectively.

Table 4. Bias measures of evaluated 30-day readmission models in Maryland and Florida.Measures and stateAllWhiteBlackDifference (B-W)aOther incomeLow incomeDifference (L-O)aLACEb
Maryland

0-1 loss0.190.180.220.040.170.210.04

FNRc0.690.720.63−0.10d0.710.65−0.06

FPRe0.120.110.160.050.110.150.04

GEIf (between-group)N/AgN/AN/A0.03N/AN/A0.02
Florida

0-1 loss0.210.210.250.040.200.230.03

FNR0.680.710.60−0.110.700.64−0.06

FPR0.130.120.180.050.120.160.04

GEI (between-group)N/AN/AN/A0.02N/AN/A0.01HOSPITALh
Maryland

0-1 loss0.170.170.190.020.160.190.03

FNR0.730.750.69−0.070.750.69−0.06

FPR0.100.100.120.020.100.120.03

GEI (between-group)N/AN/AN/A0.01N/AN/A0.01
Florida

0-1 loss0.200.200.210.010.190.210.02

FNR0.680.710.59−0.120.710.64−0.07

FPR0.120.120.140.020.110.140.02

GEI (between-group)N/AN/AN/A0.01N/AN/A0.01CMSi (as-is)
Maryland

0-1 loss0.440.410.490.070.430.470.04

FNR0.300.340.24−0.100.320.26−0.06

FPR0.470.430.540.110.450.510.06

GEI (between-group)N/AN/AN/A0.05N/AN/A0.02
Florida

0-1 loss0.370.360.430.080.360.390.02

FNR0.420.470.28−0.180.440.38−0.06

FPR0.360.340.470.130.350.400.04

GEI (between-group)N/AN/AN/A0.05N/AN/A0.02CMS (retrained)
Maryland

0-1 loss0.320.310.330.020.310.330.02

FNR0.290.310.26−0.050.300.27−0.03d

FPR0.320.310.350.040.310.350.03

GEI (between-group)N/AN/AN/A0.02N/AN/A0.01
Florida

0-1 loss0.320.320.360.040.320.340.02

FNR0.380.410.28−0.130.400.35−0.05d

FPR0.310.300.380.080.310.340.04

GEI (between-group)N/A

留言 (0)

沒有登入
gif