A process to deduplicate individuals for regional chronic disease prevalence estimates using a distributed data network of electronic health records

1 INTRODUCTION

Learning health systems (LHSs) that leverage data for rapid, continuous improvement operate amid broader secular epidemics of chronic disease and substance use that exceed any one healthcare system's ability to address.1-4 Public health problems freely transcend county boundaries and provider networks. A nationwide LHS, based on a federated data sharing model,5 proposes to combine LHS concepts with established public health strategies, such as estimating disease prevalence.6 One common LHS challenge is reconciling fragmented data collected across an ecosystem of electronic health records (EHRs). Record fragmentation limits a single system's ability to learn from patients' experiences and outcomes at the individual or population level. A treatment (or exposure) may be recorded in one healthcare system, while related outcomes may be recorded in another. Conceptually, this identity management (IM) challenge includes several component activities: (a) uniquely identifying and linking individuals across multiple data sources and distinct healthcare organizations, (b) aggregating individual-level health data from multiple sources, and (c) reconciling data and discrepancies across sources (eg, removing duplicates, resolving changing residence data over time). For example, while healthcare organizations identify unique individuals and assign medical record numbers using internal IM tools, health information exchanges (HIEs) facilitate Health Insurance Portability and Accountability Act-(HIPAA) compliant, data sharing from one covered entity to another using cross-entity IM (eg, master patient index).7, 8

EHR distributed data networks (DDNs) can leverage federal investments for research, quality improvement and public health, valued domains for any LHS. Federated data sharing, recognized by funding agencies,9 and adopted by clinical data research networks,10, 11 can preserve privacy and security as data remain behind firewalls of DDN-participating healthcare organizations, until queried for specific approved uses. The importance of IM in a DDN is likely influenced by the specific use case and geographic proximity of participating organizations. For example, whereas some PCORnet clinical data research networks may have limited geographic overlap and duplication of patients, others have implemented IM solutions.12, 13 Regional DDNs designed for quality improvement or public health surveillance in a defined region may be especially likely to experience patient duplication. Risks of duplication bias in public health DDNs have been recognized but lack data to inform decisions.14 Prevalence estimates from DDNs can be biased when individuals access multiple health systems are represented more than once,15 however the degree of bias may differ by use case.

Efforts to define, scope, and address problems caused by duplication for a variety of public health use cases are needed.16 Building on previous DDN-based surveillance17-19 we sought to implement and evaluate methods to deduplicate DDN prevalence estimates. For pilot use cases we selected T1DM and T2DM, health conditions that require enhanced coordination across primary and specialty care settings. Across settings multiple records may exist for the same child, leading to potentially biased prevalence estimates. Our goal was to empirically evaluate a scalable process to deduplicate T1DM or T2DM prevalence estimates among pediatric patients receiving healthcare services at two large health systems in the same region.

2 METHODS 2.1 Setting

The Denver metropolitan area, an urban and suburban region, has collaboratively developed a DDN (ie, Colorado Health Observation Regional Data Service [CHORDS]) through a consortium of state/local public health departments, health systems, federally qualified health centers (FQHCs), community mental health centers, a regional HIE, a university, non-profit organizations, and other key stakeholders. Data from EHRs are normalized to a common data model and queried using DDN data aggregation software (ie, PopMedNet [PMN]).20 A detailed description of the development of CHORDS has been published elsewhere.17 This study focuses on an IM approach that could scale to deduplicating prevalence estimates in the seven-county Denver metropolitan area, which includes over 50% of Colorado's 5.8 million residents.

2.2 Populations

Two health systems contributing data to the CHORDS Network (“data partners”) participated in this study. Data partner 1 (DP1) is a large, integrated safety-net health system recognized as an LHS9 that provides care for the majority of low-income individuals in the City and County of Denver (~30% of Denver's population). Data partner 2 (DP2) is a large pediatric tertiary care facility that participates in a network LHS.10 Children seen in primary care at DP1 are routinely referred to DP2 for many types of specialty care. Patients with T1DM are referred to a specialty diabetes program affiliated with DP2 that did not contribute data to this study. Both data partners have multiple locations throughout the Denver metropolitan area. DP1 and DP2 operate locations as close as 3 miles of one another.

The eligible population (denominator) for this evaluation were children (less than 18 years of age on the date of the encounter) with at least one 2017 healthcare encounter at either data partner, residing in the seven-county Denver metropolitan. Individuals with incomplete information for unique identification by the HIE were excluded (n = 47 347, 2% of all records; see Figure 1).

image

Strobe flow diagram representing the number of unique patients across two distributed data network partners participating in a study of identity management's influence of type 1 and type 2 diabetes prevalence

For its chronic disease surveillance mission the CHORDS Network leverages numerous case definitions drawn from the Centers for Medicare and Medicaid Services Chronic Conditions Data Warehouse.21 Cases were identified individuals with at least one International Classification of Disease (ICD) code for a billing or problem list diagnosis of T1DM or T2DM. Individuals with at least one T1DM diagnosis, at either data partner, were classified as T1DM cases. Likewise, a single T2DM diagnosis resulted in a person being classified as a T2DM case. We did not distinguish cases with both T1DM and T2DM diagnosis codes.

2.3 Data sources and distributed network

Data Governance: Data partners executed a Data Use Agreement (DUA) to share a record-level limited dataset and business associate agreements (BAA) to share personally identifiable information (PII) with the HIE. The HIE has participation agreements with each data partner and routinely manages patients' identities as part of its core business functions. The HIE assigned a unique network-wide identifier (ie, LINK_ID) for each patient, which linked patients across data partners. The Colorado Multiple Institutional Review Board reviewed the CHORDS Network as non-human subjects research for public health uses.

Patient Matching: Data partners generated a panel file containing pertinent demographic data for every individual seen from their data warehouse (Figure 2). Each record in the panel file included a site-specific identifier (PERSON_ID) and a series of pre-specified PII fields. Data partners transmitted panel files to/from the HIE using secure file transfer protocols (SFTP). The HIE used a proprietary referential matching process that combined a database of PII, with rules-based, probabilistic linkage methods to identify unique individuals in the panel file and assign a unique, network-wide identifier for linkage (LINK_ID). The HIE returned the LINK_ID and PERSON_ID for each patient back to each data partner.

image

Generating stratified, deduplicated estimates of diabetes prevalence through a distributed query process that minimizes exchange of phi

CHORDS Data Model: CHORDS used a data model (ie, Virtual Data Warehouse) adapted from other common data models.22, 23 Both data partners extracted, transformed, and loaded (ETL) EHR data into 24 distinct tables, with specific fields and data formats organized for PMN queries. Network identifiers were stored in the LINKAGE table (Figure 2), where one or more rows were associated with each LINK_ID. More than one row was required when the patient matching algorithm identified duplicates within a data partner and assigned the same network-wide identifier (LINK_ID) to several different individuals (eg, PERSON_ID).

Distributed Query Logic: We developed a two-step query process: (a) select a cohort of unique individuals across data partners and (b) classify those individuals into cases (yes/no). Limited exchange of associated PII permitted demographic (age, gender, and race/ethnicity) and geographic (census tract) stratification. We designed an automated process to reconcile medical, demographic, and geographic information that conflicted across partners (eg, an individual may be diagnosed at one data partner and not another). Manual reconciliation was resource-intensive and infeasible. Automated reconciliation required decisions to identify which value(s) to use in estimating prevalence. Individual-level data exchange was limited to three variables: LINK_ID, diagnosis status and final 2017 visit date. Individual demographic and geographic data were selected from the system whose data were used in Step 2. Below, we describe the two-step (ie, cohort selection and stratification) query processes:

Cohort Reconciliation: For each condition (T1DM or T2DM) a cohort was selected by generating lists of all eligible individuals (see above) with an initial query. Query results contained three fields: the LINK_ID, a binary indicator for case status and last visit date. With these fields a single data partner was selected to contribute a given patient's data (medical, demographic, geographic) to prevalence estimates. Rules for data partner selection are described below (mock results are represented in Table 1). Decision 1: When an individual (CID1) was seen by both data partners, yet only one identified the patient as a case, select the record from the data partner identifying the patient as a case. Decision 2: When an individual (CID2 or CID3) was seen by both data partners, and has the same case status, select the record from the data partner with the most recent 2017 visit. Decision 3: When an individual (CID4 or CID5) was seen by both data partners, and has the same case status and the same most recent 2017 visit date, select the data partner at random. TABLE 1. Example of reconciliation process for selecting data partners to contribute demographic and geographic data for individuals seen in multiple health care systems (selection criteria are highlighted) Network identifier Diagnosis present Final 2017 visit date Selected data partner Date partner Data partner 1 2 1 2 CID1 Yes No January 1, 2017 December 31, 2017 1 CID2 Yes Yes December 31, 2017 January 1, 2017 1 CID3 No No January 1, 2017 December 31, 2017 2 CID4 Yes Yes January 1, 2017 January 1, 2017 2 (random) CID5 No No December 31, 2017 December 31, 2017 1 (random)

The query tool applied these logical rules to produce two mutually-exclusive lists of LINK_IDs - one for each data partner.

Stratification: The second query produced aggregate counts, limited to the lists of patients identified through reconciliation. Each data partner's care population and cases were grouped by demographic and geographic factors. American Community Survey data for neighborhood poverty (ie, greater than 20% of population living below the federal poverty level: yes/no) was assigned for each patient, based on census tract of residence. We conducted additional analyses, limited to data on ambulatory encounters and following the same analytic approach, to assess the impact of care setting on prevalence estimates and on duplication bias. Once data partners incorporated the Network Identifier and ensured patient populations were distinct, each data partner returned tables of counts. Stratum-specific counts were summed across data partners and used to generate prevalence estimates for the cohort overall and for each stratum.

2.4 Analytic approach

Outcome: The primary outcomes of interest were the prevalence of T1DM and T2DM in a pediatric population. Estimated prevalence before deduplication, within and between systems, was calculated by dividing total number of individuals with a diabetes diagnosis (of a given type) by the number of eligible patients. De-duplicated prevalence estimates divided the unique number of cases by the unique number of eligible patients. We reported confidence intervals (95%) for all prevalence estimates. Because the process of selecting data for an individual may influence case counts and prevalence estimates we tested three alternative decision rules: selecting the data source with the latest 2017 encounter, while ignoring case status (Alternate 1); selecting the data source with the initial 2017 encounter, while ignoring case status (Alternate 2); and selecting the data partner at random (Alternate 3).

3 RESULTS

Among 58 351 eligible children seen at DP1 and 167 569 seen at DP2 (Table 2), the DP2 population had a higher T1DM prevalence and a lower T2DM prevalence compared to the DP1 population. DP2's population was younger, had a greater proportion of male and white patients, and a smaller proportion of patients of Hispanic ethnicity than the DP1 population.

TABLE 2. Distribution of demographic characteristics and disease prevalence for patient populations (<19 years old) with any encounter during the study period among two data partners, seven-county Denver metro area, 2017 Data partner P-value 1 2 Number of patients 58 351 167 569 n/a Diabetes prevalence (per 1000) Type 1 1.6 4.1 <.0001 Type 2 1.2 0.9 .03 Sex (percent) <.0001 Female 50 48 Male 50 52 Unknown 0 <1 Age group in years (percent) <.0001 0-3 21 33 4-6 16 16 7-9 16 15 10-12 17 14 13-15 17 14 16-17 13 8 Race and ethnicity (percent) <.0001 Non-Hispanic (NH) White 13 46 Hispanic 68 32 NH Black 14 8 NH Asian 4 3 NH American Indian or Alaska Native <1 <1 NH multiple races 1 4 NH race unknown or not reported 2 7 Residing in census tract with > = 20% below federal poverty level (percent)a <.0001 Yes 44 18 No 56 82 Note: P-values calculated using Pearson's Chi-squared test. a Some addresses could not be geolocated to the census tract.

Aggregation across data partners, without deduplication, would have estimated 226 100 children from the Denver region seen by these two data partners. We identified 218 437 unique individuals after deduplication, with 7628 (3.5%) seen in both systems (Table 3). Individuals seen by both data partners had a higher prevalence of both T1DM and T2DM than individuals seen in a single system. Compared to individuals seen in only one system, duplicates were more likely to be identified as Hispanic, non-Hispanic black, or non-Hispanic Asian, and were substantially more likely to reside in a higher poverty neighborhood.

TABLE 3. Distribution of demographic characteristics and disease prevalence for patient populations (<19 years old) with any encounter during the study period among two data partners, by duplicate status, seven-county Denver metro area, 2017 Duplicate status Yes No P-value Number of patients 7628 210 809 Diabetes prevalence (per 1000) Type 1 5 3.4 .03 Type 2 4 0.8 <.0001 Sex (percent) .13 Female 49 48 Male 51 52 Unknown 0 <1 Age Group in Years (percent) <.0001 0-3 20 30 4-6 19 16 7-9 17 15 10-12 15 15 13-15 17 15 16-18 11 9 Race and ethnicity (percent) <.0001 Non-Hispanic (NH) White 10% 40% Hispanic 64% 39% NH Black 16% 9% NH Asian 4% 3% NH American Indian or Alaska Native <1% <1% NH multiple races 1% 3% NH race unknown or not reported 3% 6% Residing in census tract with > = 20% below federal poverty level (percent)a <.0001 Yes 46% 23% No 54% 77% a Some addresses could not be geolocated to the census tract.

The prevalence estimates of T1DM and T2DM before and after deduplication are presented in Table 4. Prevalence did not change after IM processes for either condition. There was no observed change in prevalence for any demographic or geographic subgroup after deduplication, even for the subgroups that were most affected by deduplication (eg, Hispanic patients).

TABLE 4. Prevalence (per 1000) of Type 1 and Type 2 diabetes among patient populations (<19 years) for all encounter types from two health care systems, before and after deduplication, seven-county Denver Metropolitan Area, Colorado, 2017 Deduplication Type 1 Type 2 Before After Before After Overall 3.4 (3.2, 3.6) 3.5 (3.3, 3.7) 1.0 (0.9, 1.1) 0.9 (0.8, 1.0) Sex Female 3.6 (3.2, 4.0) 3.7 (3.3, 4.1) 1.1 (0.9, 1.3) 1.0 (0.8, 1.2) Male 3.3 (3.0, 3.6) 3.3 (3.0, 3.6) 0.8 (0.6, 1.0) 0.8 (0.6, 1.0) Age in years 0-3 0.4 (0.2, 0.6) 0.4 (0.2, 0.6) 0 (0, 0) 0 (0, 0) 4-6 1.6 (1.2, 2.0) 1.6 (1.2, 2.0) 0 (0, 0) 0 (0, 0) 7-9 3.2 (2.6, 3.8) 3.2 (2.6, 3.8) 0.3 (0.1, 0.5) 0.3 (0.1, 0.5) 10-12 5.5 (4.7, 6.3) 5.6 (4.8, 6.4) 0.8 (0.5, 1.1) 0.8 (0.5, 1.1) 13-15 7.1 (6.2, 8.0) 7.2 (6.3, 8.1) 2.2 (1.7, 2.7) 2.0 (1.5, 2.5) 16-17 7.5 (6.3, 8.7) 7.7 (6.5, 8.9) 4.9 (4.0, 5.8) 4.8 (3.8, 5.8) Race Non-Hispanic (NH) White 5.5 (5.0, 6.0) 5.5 (5.0, 6.0) 0.5 (0.3, 0.7) 0.5 (0.3, 0.7) Hispanic 1.9 (1.6, 2.2) 1.9 (1.6, 2.2) 1.4 (1.2, 1.6) 1.3 (1.1, 1.5) NH Black 3.1 (2.3, 3.9) 3.0 (2.2, 3.8) 1.8 (1.2, 2.4) 1.6 (1.0, 2.2) NH Asian 1.0 (0.3, 1.7) 1.1 (0.3, 1.9) 0.6 (0.0, 1.2) 0.6 (0.0, 1.2) NH American Indian or Alaska Native 4.3 (−0.6, 9.2) 4.5 (−0.6, 9.6) 2.9 (−1.1, 6.9) 3.0 (−1.1, 7.2) NH multiple races 2.5 (1.3, 3.7) 2.5 (1.3, 3.7) 0.4 (−0.1, 0.9) 0.4 (−0.1, 0.9) NH race unknown or not reported 3.4 (2.4, 4.4) 3.4 (2.4, 4.4) 0.2 (0.0, 0.4) 0.2 (−0.1, 0.5) Residing in census tract with > = 20% below federal poverty levela Yes 2.2 (1.8, 2.6) 2.1 (1.7, 2.5) 1.4 (1.1, 1.7) 1.3 (1.0, 1.6) No 3.9 (3.6, 4.2) 3.9 (3.6, 4.2) 0.8 (0.7, 0.9) 0.8 (0.7, 0.9) a Some addresses could not be geolocated to the census tract.

Concordance of recorded demographic attributes for duplicate patients was variable. Duplicate patients had high recorded gender agreement (98%) and were likely to have the same case status (>99% for both T1DM and T2DM). However, substantial discordance of recorded race and ethnicity was observed between systems. Agreement of Hispanic ethnicity was relatively high (86%), yet race was in an agreement between systems for only 53% of individuals; race data were often missing or unknown in one system, but not the other.

While insufficient to affect the post-deduplication prevalence estimate for either condition, the selection method did influence the number of cases that we identified. The approach prioritizing diagnosis identified the largest number of cases (758 T1DM cases; 201 T2DM cases). Implementing alternate selection logic resulted in 7 to 10 fewer cases of T1DM and 5 to 11 fewer cases of T2DM, depending on the algorithm.

When restricting the analysis to ambulatory encounters, we observed a lower prevalence of T1DM (2.5 cases per 1000, 95% CI: 2.2-2.8) than the prevalence including all encounter types (3.5 cases per 1000, 95% CI: 3.3-3.7). There was no evidence that restricting to ambulatory encounters affected T2DM prevalence. As with the all-encounter-types analysis, in the ambulatory encounter-only analysis, there was no observable change in prevalence after deduplication overall or for any demographic subgroups (results not displayed). We also identified fewer diabetes cases of either type in the ambulatory-only analysis than in the all-encounters analysis. We identified 399 fewer T1DM cases (53% of 758) and 58 fewer T2DM cases (29% of 201), depending on the choice of care setting (primarily) as well as the selection algorithm.

4 DISCUSSION

This study describes a process we designed to link and deduplicate individuals for prevalence estimate activities using a regional DDN. To our knowledge, this is one of the very few studies to report implementing HIPAA-compliant IM across a DDN to generate deduplicated prevalence estimates.24

The process we designed and tested generated and stored a network-wide identifier for use in distributed public health queries. The two-step query process limited the amount of PII exchanged to a parsimonious limited data set. While we chose T1DM and T2DM in youth as the chronic conditions to test in the development of the algorithm, the deduplication method could be adapted to other chronic conditions including refinement for more prevalent or episodic conditions (eg, depression or substance use disorder). In addition, more refined case definitions for T1DM and T2DM could improve the accuracy of the reported prevalence estimates.

Importantly, unlike many population-based surveys used for public health surveillance, DDN-based prevalence estimates integrated with the LHS mindset provide a powerful approach to evaluating interventions in a given region. Prevalence is a metric that can help health systems, county health departments and others continuously learn how to best respond to pressing public health challenges, including but not limited to diabetes.

In this initial pilot test of our process, involving only two data partners, deduplication had no measurable effect on pediatric diabetes prevalence estimates. Very low disease prevalence estimates may have resulted in fewer opportunities for cross utilization. Analyses were limited to only two data partners; neither was a referral center for diabetes. The relatively small degree of overlap between the data partners was unexpected - given referring relationships and geographic proximity - and likely contributed to the null finding. Having selected just 1 year, there might have been greater utilization overlap if we extended the observation period. Approximately 4% of the 218 437 pediatric patients included in this pilot were represented in both systems during 2017 (n = 7628). The prevalence of T1DM and T2DM was higher in the duplicate population, but duplicates represented a very small share of the overall number of patients. Furthermore, individuals who were not assigned network identifier values (eg, missing critical matching variables) were excluded from this analysis. Our findings might have been different if the prevalence of diabetes or the degree of overlap differed considerably from the population

留言 (0)

沒有登入
gif