Imputing Race and Ethnicity: A Fresh Voices Commentary From The Medical Care Blog

The Biden administration is focusing on health equity and improved data collection to measure and analyze disparities and inequities. Imputation is a method of inferring or assigning values, or a vector of probabilities, to missing data. Individual-level data do not always include racial and/or ethnic identification data: some people decline to share their identification; some identification data may be missing at random. In such cases, how does imputing race and/or ethnicity fit with the administration’s efforts, as well as the broader reckoning with the racial equity imperative?

Population health researchers and policymakers often want to know what every person’s race/ethnicity is because it helps us understand and track quality, costs, access to care, and outcomes for different groups. For example, many have asked for COVID-19-related data to be released by race/ethnicity, so we can measure the disproportionate impact in Black, Indigenous, and other People of Color (BIPOC) communities.1

Moreover, the Federal government requires that various entities collect and report race and ethnicity data.2 Recent efforts have focused on reporting quality measures and other outcomes stratified by race/ethnicity.3 However, people sometimes leave race/ethnicity questions blank on a survey, for example, and this is more likely if they do not feel they fit into any of the answer categories.4

BEYOND BLACK AND WHITE

Race and ethnicity have historically been measured in a variety of ways. Since race and ethnicity are social constructs, rather than biological ones, the definitions are fluid.5 In earlier years, races and ethnicities aside from White and Black were measured inconsistently, if at all. In fact, Social Security categories for race were just Black, White, and “Other” until 1980.6

From 1790 to 2020, every US Census has asked about race—using different categories nearly every time.7Figure 1 shows how the 2020 Census asked about race and ethnicity. Historically, people from the Indian subcontinent were categorized as “Hindu” from 1920 to 1940, as “White” in the next 3 censuses, and as “Asian” since 1980. In another example, some people of Middle Eastern and North African (MENA) descent have lobbied to be included in the White category on the Census. Others from MENA communities have lobbied to have their own category, arguing that being lumped into the White category erases their community.8

F1FIGURE 1:

The race question on the 2020 Census form. Shown is the US Census Bureau’s question about race as it appeared on the 2020 Decennial Census instrument.

A small study in 2 diverse clinics nearly 20 years ago found that “many patients became angry when asked about race/ethnicity, and some did not understand the question … many respondents identified with a national origin instead of a race or ethnicity.”9 Is it any wonder that the government’s standard categories, as seen on many surveys, are still contentious? Relying on self-reporting thus means dealing with under-reporting and missing data. Increasing self-reporting takes rebuilding broken trust, which is not quick or easy to do.10

IMPUTING MISSING RACE/ETHNICITY DATA IS A LONG-ESTABLISHED AND COMMON PRACTICE

A complete description of approaches to imputation is beyond the scope of this commentary. However, methods to address missing data date back to at least the 1950s.11 Older approaches often involved assigning the sample mean or mode value to missing data.12 In more complex analyses, researchers used other nonmissing variables to predict values in a single regression imputation.12,13 In recent years, multiple imputation with chained equations (MICE) has overcome the limitations of single regression approaches.14,15 MICE uses information from multiple regression models and random, bootstrapped samples. Bayesian and random forest-based regression approaches have also shown promise in terms of reducing misclassification bias.15 In fact, the Medicare Bayesian Improved Surname Geocoding (MBISG) algorithm is the current standard method in use by the Centers for Medicare & Medicaid Services (CMS)’s Office of Minority Health.3,16

Some Federal data resources still use hot-deck imputation. This approach involves imputing data by randomly selecting a value from a similar record. The Medical Expenditure Panel Survey (MEPS), for example, imputes missing data on income and employment in this manner, but not on disability or race/ethnicity.17 For race/ethnicity, MEPS creates edited/imputed versions of the race/ethnicity indicators, filling in from other data sources (where available) and the race/ethnicity of family members.17

In contrast, the Census did use a simple form of imputation to address missing race/ethnicity data in the most recent decennial count. According to their explainer: “… if race is reported for a parent, we could use that information to fill in their child’s missing race. If no information is available within the household, we would impute the information using data from similar nearby households.”18 The authors do not address the possible limitations of such an approach, but thoughtful criticisms would seem to be warranted.

Notably, modern approaches to imputing race and/or ethnicity often generate estimated probabilities for statistical modeling, rather than assigning people to specific categories.19 This avoids the potentially problematic issue of directly assigning people to the wrong categories. However, when using these probabilities in models, their coefficients cannot be interpreted the same way as the coefficients estimated with categorical race/ethnicity data.20

An argument can be made that, done correctly, imputation is imperfect but better than nothing. It reduces variance and improves the quality of the data.20 Multiple imputation methods also account for uncertainty in the imputed data. Indirect estimation is certainly less “burdensome”—from the government’s perspective—than gathering this information directly.

SHORTCOMINGS OF IMPUTING RACE/ETHNICITY FROM A HEALTH EQUITY PERSPECTIVE

From a health equity perspective, however, it is worth digging deeper. Can a statistical model actually be constructed to predict race/ethnicity that satisfies different kinds of validity—including face validity, construct validity, replicability, and predictive validity? Imputing race/ethnicity can create bias in terms of misidentification, which is particularly problematic in this context.

Another major statistical issue with imputation is that the methodology implies that these missing data are nonsystematically missing and/or that they belong to the same patterns as the nonmissing data. However, research shows that people who do not volunteer identification data tend to come from under-represented groups.9 If we assess the impact of the health care system on health outcomes through stratification by race/ethnicity, using an algorithm that induces bias in a metric so highly related to our outcome(s) of interest seems ill-advised unless it corrects more bias than it introduces, and that depends on one’s perspective and on which groups are prioritized in the analysis.

The targets of an imputation algorithm are whatever categories the algorithm is trained on. The MBISG method, for example, is based on assigning people to Black, Hispanic, Asian, White, or Other. If more categories are added, and if multiracial identification is allowed, the algorithm will need to be retrained. In contrast, individuals can describe their identities with nuance and specificity. If and when the questionnaires change, individuals can adapt, whereas an imputation algorithm will necessarily lag behind in its predictions. From this perspective, self-report is more ethical, valid, and reliable.

ETHICS AND IDENTITIES

Ethically, we should be concerned about filling in information that has been withheld deliberately. For example, someone who agreed to provide personal, financial, or health data may not have done so if plans to impute race or ethnicity to their data were disclosed.21 Choosing not to answer is a valid response category. Imputation should only be done for truly missing responses.

So much of an individual’s experience of the health care system can be shaped by their race/ethnicity because of systemic bias and structural racism. Race and ethnicity are also associated with the effects of segregation, a relative lack of generational wealth, and many other things—largely as a direct result of federal, state, and local policy and practice.22 Race and ethnicity are very different from other kinds of characteristics that could be imputed, like cholesterol levels.

Race and ethnicity are essential parts of our identities, our cultures, and our experiences. Understandably given historical precedents, racial, and ethnic identities also may be correlated with mistrust of the medical profession and mistrust of government.5 Racism and stigma—independent of economics—influence the care that people receive.23,24 For example, care providers have been shown to underestimate and undertreat the physical pain felt by Black people.25 Those with chronic illnesses face additional stigma that worsens their quality of life.26

Given this, some argue that self-report is the only standard for personal identification—not a benchmark for validation, nor merely the best of many ways to determine a person’s identity. Algorithms that impute racial/ethnic data could exacerbate racial/ethnic biases in clinical decision-making and public policy-making. If imputed race and ethnicity variables do not accurately predict actual race and ethnicity, the conclusions policymakers draw from the imputed data could lead to misinformed policy choices that harm BIPOC populations.

UNDER-REPRESENTATION

The imputation methodologies currently in use, by their very nature, perpetuate under-representation: less-represented identifications are going to be less likely to be assigned (by definition) and BIPOC representation will continue to suffer. This seems backward: shouldn’t the point be to understand the experiences of those least likely to be identified? Echoing the language of the disability-rights movement—“nothing about us without us”27—how can we help inform good policy without good data on those who are known to experience worse care and outcomes?

For example, electronic health records are the source of race/ethnicity in some cancer registries. In one study, it was found that American Indians were frequently miscategorized in those registries as white.28 Similarly, individuals with multiple racial/ethnic identities and Indigenous people are often misidentified on death certificates.24 Some healthcare facilities are better than others in collecting race/ethnicity accurately.29 In these cases, how are we to analyze the care provided to under-represented groups if they are misidentified in our data?

Since the mid-2000s, several teams of researchers have developed approaches that use Medicare beneficiaries’ surnames and where they live to impute race/ethnicity.30,31 The method of using surnames has obvious shortcomings: people change their surnames at marriage, people can be adopted by or have parents of different or multiple racial identification, and so on. According to a recent paper, one team of researchers noted that the accuracy of their approach ranges from 88%–95% for Hispanic, Black, white, and Asian/Pacific Islander people.3 However, American Indian/Alaska Native and multiracial people had much lower correlations between imputed and self-reported information: 12%–54%.

The shortcomings of imputation approaches could be magnified with more use of algorithms. Algorithmic bias can be hard to detect and understand.32 More research is needed on the implications of this issue.

WE NEED MORE AND BETTER DATA

Even more importantly, we need better data to begin with. How should we collect these data? What additional research could inform the next evolution in dealing with this perennial problem?

Discrimination in health care occurs on the basis of skin color and other physical characteristics, less-than-perfect English or stigmatized accents, sexual orientation, gender identity and its presentation, poverty, lack of education or literacy, history of substance use, disability, overweight, various other health conditions, and more. Truly, if the goal is health equity for all, we need to be taking all kinds of potential gaps into consideration. One person of color interviewed on the topic of racial identification questions on a health-related survey said:

What is the point of asking me [my race on this survey]? If it is [about] experiencing discrimination, why aren’t you asking if I’ve experienced discrimination because of my race? Why does it even matter what race I am—if the point is to uncover quality [issues], then that should matter regardless of race.

Following this argument, collecting more specific data on inequities, mistreatment, and gaps in high-quality care should be a priority. Similarly, every survey instrument that asks about race/ethnicity ought to allow respondents to select “Prefer not to answer” as a valid response option.

Also, researchers and policymakers alike need to examine their own internal assumptions. Are you using race/ethnicity as a proxy for something else? Are there data available, or could data be collected, to measure that thing? Either way, be explicit about why race/ethnicity are being used in your models.

ASK THE EXPERTS

What if we asked a racially diverse panel of individuals to weigh in on what should happen if they leave the race/ethnicity question blank on a survey? Shockingly, it does not appear that anyone has published research on this to date. The Census has conducted extensive focus groups on race/ethnicity survey collection in recent years, but it’s not clear whether the participants shed light on these specific issues.33

We should ask those who are most likely to be affected to weigh in on current practices and approaches. How would they feel about: being dropped from an analysis altogether? Being lumped into a missing/unknown or “Other” category? Having probabilities of being of various races/ethnicities assigned based on their last name and where they live?

We should also ask about the race/ethnicity questions themselves. How would they feel about open-ended race and ethnicity questions? What about asking people about their ancestry or family origins instead of their race/ethnicity? We need more qualitative data to get perspectives on this issue.

DATA LINKAGES AND IMPROVED ENROLLMENT

In addition to more qualitative data on this issue, we also need more quantitative data. Self-reported racial/ethnic identification collected via surveys, clinical assessments (such as in nursing homes), registries, and electronic health records (EHRs) can be linked with administrative data.6,34 Perhaps someday, a national healthcare blockchain system will protect privacy and confidentiality while allowing us to trace individuals across different instruments and datasets.35

Failing that, the Federal government has some power to facilitate better data linkages and collection. For example, the Medicare enrollment form could be changed to collect more information besides the basics. CMS could leverage linkages between survey, assessment, registry, and EHR data with enrollment data to improve the accuracy of older, Social-Security-supplied race data.36,37 CMS could work with other Federal and state agencies to use both qualitative and quantitative approaches to improve race/ethnicity measurement.

Even so, some states limit when and how race and ethnicity can be collected.24 Others may push back against collecting such data for ideological reasons concerning the scope and limits of government’s role in citizens’ lives. These and other obstacles to collecting race/ethnicity will continue to stymie efforts to promote health equity.

ALTERNATIVES TO IMPUTATION

Admittedly, we currently lack good alternatives to imputation in the analysis phase. This is one reason why the best alternative is to collect better data to begin with.

When running regression models in many statistical software packages, individuals with missing data on predictors are, by default, simply removed from the model. This is known as “complete-case analysis” or “listwise deletion.” This approach decreases information and sample size and can introduce bias if data are not missing completely at random.38 Still, many studies in the literature have used this approach.39

Another approach is to narrow the analysis to only people with known white or Black race and drop all others or put them in an “Other” category. While this sidesteps the missing data issue, it leads to a new problem: lack of generalizability.40 Similarly, creating a separate category of people with missing data and analyzing them separately leads to the inability to draw meaningful conclusions about those in that category.

For research in which proportional representation is a primary concern, we could consider designs that sample portions of more-represented groups, rather than add imputed data to less-represented groups. This approach would be based on an understanding that statistical results based on groups with greater representation—such as white males—would be robust even with fewer observations.

In a design based on relative representation, researchers could—based on reliable sources of population characteristics—under-sample over-represented groups while including all (or nearly all) of the least-represented group, most of the next-least represented group, and so on. This approach still assumes that people from under-represented groups who have missing racial/ethnic information are roughly the same with regard to outcomes as people from the same group whose information is reported do, which may not be the case. This design would not avoid the need for more, and more reliable, data on the health and health outcomes of underserved populations. However, it could allow for fairer proportional representation without requiring imputation of identification data. This would also be a decision made at the design stage of the study, rather than the analysis phase.

Another approach is to use geographic, population-level data instead of, or in addition to, individual-level data. Area-level racial/ethnic-related characteristics can involve measures of racial residential segregation and isolation, dissimilarity indices, and historical redlining, reflecting the place-based approach to health.41 Area-based measures are often independently associated with health outcomes, and including them improves health equity by better accounting for social context.42

CONCLUSIONS

The question of how to handle missing data on race/ethnicity is not a simple matter. We need more and better data to inform health equity analyses and prevent the need to impute in the first place. The onus needs to remain on improving the collection of self-reported data on race and ethnicity, as well as other relevant factors of interest.

Imputation is a common solution to deal with “the missing-data problem,” but much is still unknown about imputation’s implications when it comes to health equity and racial justice.43,44

It is important not to impute race/ethnicity crudely or thoughtlessly, but to think carefully about the validity of your models. Race/ethnicity should never be used as a proxy for whatever the real exposure or confounder is. Rightfully speaking, the latent variable we most frequently are studying is the effect of racism and unfair treatment, rather than the personal characteristic of race or ethnicity.45 Thus, imputation has potential repercussions. A better-than-nothing approach can be dangerous, since issues with regard to poor care or poor outcomes stemming from systemic racism can hardly be mitigated by math. Predicting which box a person would check is an indirect measure of an indirect measure.

Imputing missing race/ethnicity information is routinely done—but just because it’s common doesn’t mean it’s right. Many practices that were once common in health and medicine have gone by the wayside. Someday, perhaps imputing race/ethnicity may be seen as another archaic practice from a less-enlightened era.

ACKNOWLEDGMENTS

The authors gratefully acknowledge the following RTI staff for their helpful feedback on the blog posts: Jane Allen, Anupa Bir, Susan Haber, and Pam Spain.

REFERENCES 2. Institute of Medicine Ulmer C, McFadden B, Nerenz DR, eds. Defining categorization needs for race and ethnicity data. Race, Ethnicity, and Language Data: Standardization for Health Care Quality Improvement. Washington, DC: The National Academies Press; 2009:297. 3. Haas A, Elliott MN, Dembosky JW, et al. Imputation of race/ethnicity to enable measurement of HEDIS performance by race/ethnicity. Health Serv Res. 2019;54:13–23. 4. Woolverton GA, Marks AK. “I just check ‘other’”: evidence to support expanding the measurement inclusivity and equity of ethnicity/race and cultural identifications of US adolescents. Cultur Divers Ethnic Minor Psychol. 2021. doi: 10.1037/cdp0000360. 5. Yearby R. Race based medicine, colorblind disease: how racism in medicine harms us all. Am J Bioeth. 2021;21:19–27. 6. Zaslavsky AM, Ayanian JZ, Zaborski LB. The validity of race and ethnicity in enrollment data for Medicare beneficiaries. Health Serv Res. 2012;47(pt 2):1300–1321. 7. Prewitt K. Racial classification in America: where do we go from here? Daedalus. 2005;134:5–17. 8. Parvini S, Simani E. Are Arabs and Iranians white? Census says yes, but many disagree. The Los Angeles Times. March 28, 2019. 9. Moscou S, Anderson MR, Kaplan JB, et al. Validity of racial/ethnic classifications in medical records data: an exploratory study. Am J Public Health. 2003;93:1084–1086. 10. Thompson K, Glenn J, Moore S. Broken trust and cancer prevention. The Medical Care Blog. 2021. Available at: https://www.themedicalcareblog.com/broken-trust-cancer/. Accessed January 3, 2022. 11. Kish L, Hess I. A “replacement” procedure for reducing the bias of nonresponse. Am Stat. 1959;13:17–19. 12. Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol. 1995;142:1255–1264. 13. Donders ART, Van Der Heijden GJ, Stijnen T, et al. A gentle introduction to imputation of missing values. J Clin Epidemiol. 2006;59:1087–1091. 14. Murray JS. Multiple imputation: a review of practical and theoretical findings. Stat Sci. 2018;33:142–159; 118. 15. Slade E, Naylor MG. A fair comparison of tree-based and parametric methods in multiple imputation by chained equations. Stat Med. 2020;39:1156–1166. 16. Dembosky JW, Haviland AM, Haas A, et al. Indirect estimation of race/ethnicity for survey respondents who do not report race/ethnicity. Med Care. 2019;57:e28–e33. 17. Agency for Healthcare Research and Quality. MEPS HC-209: 2018 full year consolidated data file documentation. 2020. Available at: https://web.archive.org/web/20220104001938/https://meps.ahrq.gov/data_stats/download_data/pufs/h209/h209doc.shtml. Accessed January 3, 2022. 18. Cantwell P. How we complete the census when households or group quarters don’t respond. Random Samplings [Blog Post]. 2021. Available at: https://web.archive.org/web/20220104210241/https://www.census.gov/newsroom/blogs/random-samplings/2021/04/imputation-when-households-or-group-quarters-dont-respond.html. Accessed January 4, 2022. 19. Xue Y, Harel O, Aseltine RH Jr. Imputing race and ethnic information in administrative health data. Health Serv Res. 2019;54:957–963. 20. Silva GC, Trivedi AN, Gutman R. Developing and evaluating methods to impute race/ethnicity in an incomplete dataset. Health Serv Outcomes Res Method. 2019;19:175–195. 21. Randall M, Stern A, Su Y. Five ethical risks to consider before filling missing race and ethnicity data: workshop findings on the ethics of data imputation and related methods. 2021. Available at: https://www.urban.org/research/publication/five-ethical-risks-consider-filling-missing-race-and-ethnicity-data. Accessed January 3, 2022. 22. Lynch EE, Malcoe LH, Laurent SE, et al. The legacy of structural racism: associations between historic redlining, current mortgage lending, and health. SSM Popul Health. 2021;14:100793. 23. Nelson A. Unequal treatment: confronting racial and ethnic disparities in health care. J Natl Med Assoc. 2002;94:666–668. 24. Institute of Medicine (US) Committee on Understanding and Eliminating Racial and Ethnic Disparities in Health Care. In: Smedley BD, Stith AY, Nelson AR, eds. Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care. Washington, DC: National Academies Press; 2003. 25. Hoffman KM, Trawalter S, Axt JR, et al. Racial bias in pain assessment and treatment recommendations, and false beliefs about biological differences between blacks and whites. Proc Natl Acad Sci. 2016;113:4296–4301. 26. Hood AM, Crosby LE, Hanson E, et al. The influence of perceived racial bias and health-related stigma on quality of life among children with sickle cell disease. Ethn Health. 2020:1–14. doi: 10.1080/13557858.2020.1817340. 27. Charlton JI. Nothing About Us Without Us: Disability Oppression and Empowerment. Berkeley, CA: University of California Press; 2000. 28. Clegg LX, Reichman ME, Hankey BF, et al. Quality of race, Hispanic ethnicity, and immigrant status in population-based cancer registry data: implications for health disparity studies. Cancer Causes Control. 2007;18:177–187. 29. Office of Quality and Patient Safety. Facility race/ethnicity concordance reports. Statewide Planning and Research Cooperative System. 2014. Available at: https://www.health.ny.gov/statistics/sparcs/reports/race_eth/. Accessed January 3, 2022. 30. Eicheldinger C, Bonito A. More accurate racial and ethnic codes for Medicare administrative data. Health Care Financ Rev. 2008;29:27–42. 31. Elliott MN, Fremont A, Morrison PA, et al. A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Serv Res. 2008;43(pt 1):1722–1736. 32. Obermeyer Z, Powers B, Vogeli C, et al. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366:447–453. 33. US Census Bureau. Research to improve data on race and ethnicity. 2021. Available at: https://web.archive.org/web/20220104205543/https://www.census.gov/about/our-research/race-ethnicity.html. Accessed January 4, 2022. 34. Jarrín OF, Nyandege AN, Grafova IB, et al. Validity of race and ethnicity codes in medicare administrative data compared with gold-standard self-reported race collected during routine home health care visits. Med Care. 2020;58:e1–e8. 35. Gordon WJ, Catalini C. Blockchain technology for healthcare: facilitating the transition to patient-driven interoperability. Comput Struct Biotechnol J. 2018;16:224–230. 36. National Cancer Institute. About the SEER-MHOS Linked Data Resource. 2021. Available at: https://healthcaredelivery.cancer.gov/seer-mhos/overview/history.html. Accessed August 20, 2021. 37. National Cancer Institute. About the SEER-CAHPS Data Resource. 2021. Available at: https://healthcaredelivery.cancer.gov/seer-cahps/overview/. Accessed August 20, 2021. 38. National Cancer Institute. Guidance Document: Missing Data in SEER-CAHPS. 2020. Available at: https://healthcaredelivery.cancer.gov/seer-cahps/researchers/missing-data-guidance.pdf. Accessed January 4, 2022. 39. Long JA, Bamba MI, Ling B, et al. Missing race/ethnicity data in Veterans Health Administration based disparities research: a systematic review. J Health Care Poor Underserved. 2006;17:128–140. 40. Kukull WA, Ganguli M. Generalizability: the trees, the forest, and the low-hanging fruit. Neurology. 2012;78:1886–1891. 41. US Census Bureau. Housing patterns: guidance for data users; appendix B. Measures of residential segregation. 2017. Available at: https://web.archive.org/web/20220104213532/https://www.census.gov/topics/housing/housing-patterns/guidance/appendix-b.html. Accessed March 15, 2022. 42. Lines LM. Artificially intelligent social risk adjustment. The Medical Care Blog. 2021. Available at: https://web.archive.org/web/20220104213654/https://www.themedicalcareblog.com/artificially-intelligent-social-risk-adjustment/. Accessed January 4, 2022. 43. Grundmeier RW, Song L, Ramos MJ, et al. Imputing missing race/ethnicity in pediatric electronic health records: reducing bias with use of U.S. Census Location and Surname Data. Health Serv Res. 2015;50:946–960. 44. Rhodes W. Improving disparity research by imputing missing data in health care records. Health Serv Res. 2015;50:939–945. 45. Graetz N, Boen CE, Esposito MH. Structural racism and quantitative causal inference: a life course mediation framework for decomposing racial health disparities. J Health Social Behav. 2022:00221465211066108.

留言 (0)

沒有登入
gif