The Data-Adaptive Fellegi-Sunter Model for Probabilistic Record Linkage: Algorithm Development and Validation for Incorporating Missing Data and Field Selection

Introduction

Quality patient care requires comprehensive health care data from a broad set of sources. Electronic medical record (EMR) data are increasingly distributed across many sources as the era of digital health care is accelerated in the United States. However, EMR data from independent databases often lack a common patient identifier, which impedes data aggregation, causes inefficiencies (eg, tests repeated unnecessarily), affects patient care, and hinders research. Record linkage is a requisite step for effective and efficient patient care and research. Without a unique universal patient identifier, linkage of patient records is a nontrivial task. The simplest class of approaches is the deterministic method, which requires the strict identity of the selected data elements of a pair of records, such as name, birthdate, gender, and Social Security number. Although deterministic algorithms are generally simple to implement and achieve excellent specificity, they have low sensitivity, are not robust to missing data, cannot quantify the uncertainty of the matching process, and are inflexible to changing data characteristics.

The Fellegi-Sunter (FS) [] model is widely used for probabilistic record linkage based on the binary agreement or disagreement of a select set of fields of record pairs, such as Social Security number, first name, middle name, last name, and date of birth. The FS model is in essence a latent class model applied to record linkage problems. The latent class variable is the unobserved true match status, and the parameters in the model are the match prevalence, probabilities of field agreements among true matches (m-probabilities), and probabilities of field agreements among nonmatches (u-probabilities). A record pair’s matching weights are defined as the logarithms of the m- and u-probability ratios, and the sum of the weights is the matching score of the pair. Record pairs were then classified into matches and nonmatches based on their matching scores for a given threshold. The linking algorithm based on the FS model is shown to outperform the deterministic algorithm []. However, methodological gaps exist in configuring and applying the FS model.

First, it is well known that missing data are prevalent in real-world data in EMRs []. Data necessary for matching records are often missing from clinical data for many reasons: values may be coded as “unknown,” nonexistent (a person with no middle name), or omitted due to privacy concerns (such as Social Security number). Missing field values decrease the information content in the data and consequently hinder matching accuracy. Matching only records with full information is undesirable because it excludes many records and thus misses matches. One study found that mother’s date of birth was often absent because it was not the focus of pediatricians’ attention []. However, this information significantly improved the linkage procedure when present. Therefore, effective accommodation of missing data is needed to maximize linkage. Common strategies in practice involve excluding records with missing values in any of the matching fields when estimating match weights [] or considering the missing field’s agreement pattern as disagreement [] (missing as disagreement [MAD]). The former lacks efficiency because of the loss in sample size due to exclusion. The latter does not account for the fact that true matches can contain missing fields and is deficient in a theoretical justification. Another strategy is to model missing data in a matching field as the third category, in addition to the categories of agree and disagree []. However, it is well established that including missing data by adding a category “missing” causes serious biases, even when data are missing completely at random [-]. In a model-based approach, Enamordo et al [] assume that data in matching fields are missing at random (MAR) conditional on the true match status. Their comprehensive simulation studies show that the FS model with MAR incorporated outperforms deterministic linkage in social science when linking voter files. How the FS model with MAR incorporated compares with the FS model using zero-filled data in which missing values in the original data are replaced by 0 by MAD has not been evaluated. Furthermore, while MAR is evaluated and applied to voting registries, its performance in linking EMR files is not known.

Second, although there may be numerous fields (or attributes) across record files not all of them are useful for matching. For example, if matching 2 obstetrics and gynecology databases, the field “gender” is not informative. In real-world data, there are likely also dependencies among the data fields. As we have demonstrated [], the FS model exhibits poor matching accuracy when the fields are highly correlated. As more fields are used in the FS model, more dependencies may be introduced. Ideally, the FS model should be able to use a minimally sufficient set of fields. However, we are unaware of data-driven methods for matching field selection. In practice, the expert input is solicited to identify an appropriate subset matching fields. Several iterations may be required to achieve the desired match accuracy using a manually reviewed data set with known match statuses among record pairs. This process is neither scalable nor generalizable and is infeasible in privacy-preserving record linkage []. We are also unaware of any work that evaluates the effects of missing data treatment and field selection for matching simultaneously.

We will evaluate the effects of incorporating missing data treatment and matching field selection into the FS algorithm on linkage performance using 4 real-world use cases in our local operational data aggregation system—a health information exchange (HIE) environment, into which different data sources are integrated. The 4 use cases included health information exchange record deduplication (labeled as Indiana Network for Patient Care [INPC]), linkage of a public health registry Marion County Health Department records to HIE (labeled as MCHD), linkage of Social Security Death Master File records of the Social Security Administration to HIE (labeled as SSA), and deduplication of newborn screening records (labeled as NBS). We hypothesize that proper treatment of missing data and data-driven matching field selection will enhance linkage performance.

MethodsBlocking

Records need to be compared in record linkage to ascertain whether they belong to the same entity. Forming record pairs by Cartesian product from the 2 files (or to a file itself in the case of deduplication) results in an enormously large number of pairs. For example, the data set from the INPC (the INPC use case) has 47,334,986 records () and will form 2.24 quadrillion record pairs by the Cartesian product. A common strategy is “blocking on” certain fields (blocking variables) to reduce the number of record pairs; that is, retaining only those record pairs with exact agreement in blocking variables. Blocking helps to enrich matches by restricting the search space. We applied 5 blocking schemes to each use case. In the INPC use case, the five blocking schemes are the Social Security number (SSN); first name and telephone number (FN-TEL); day, month, and year of birth and zip code (DB-MB-YB-ZIP); first name, last name, and year of birth (FN-LN-YB); and day, month, and year of birth and last name (DB-LN-MB-YB). These five blocking schemes contained 613 million record pairs, with the number of pairs in each block listed in . Within each block, record pairs are compared field by field for a collection of matching fields, yielding a vector of comparison results for each pair. For example, if only 3 matching fields are compared by exact comparison (for agreement or disagreement), the vectors will have 23 possible patterns when there are no missing data. In general, if K matching fields are compared, there will be 2K total agreement patterns.

Table 1. Summary of four use cases, Indiana Network for Patient Care (INPC), newborn screening (NBS), Social Security Administration (SSA), and Marion County Health Department (MCHD), with information on the number of records in each use case, blocking schemes, and the numbers of record pairs in blocking schemes.BlockPairsINPC (47,334,986 records)
SSNa53,054,690
FN-TELb41,729,402
DB-MB-YB-ZIPc133,553,036
FN-LN-YBd193,865,283
DB-LN-MB-YBe191,181,498NBS (765,813 records)
MRNf4,147,098
TELg2,644,454
MB-DB-ZIP8,083,396
LN-FNh3,005,368
NK_LN-NK_FNi1,217,736SSA (89,556,520 records)
SSN805,331
FN-LN-ZIP18,103
FN-LN-MI-YB1,395,395
FN-LN-MI-DB-MB547,376
FN-LN-DB-MB-YB722,167MCHD (471,298 records)
SSN869,454
TEL28,238
DB-MB-YB-zip5,083,429
FN-LN-YB3,378,017
DB-LN-MB-YB3,701,460

aSSN: Social Security number.

bFN-TEL: first name and telephone number.

cDB-MB-YB-ZIP: day, month, and year of birth and zip code.

dFN-LN-YB: first name, last name, and year of birth.

eDB-LN-MB-YB: day, month, and year of birth and last name.

fMRN: medical record number.

gTEL: telephone no.

hLN-FN: last name, first name.

iNK_LN, NK_FN: next of kin last name and first name.

The FS Model

Formally, for the ith pair of records, let δi denote the unobserved true match status (a latent binary class variable) with a value of 1 indicating a match and 0 indicating a nonmatch (ie, the class label for match and nonmatch classes), Yi=(Yi1, … ,YiK) be the vector of agreements in K fields, and yi=(yi1, … ,yiK) be the observed agreements. In addition, let n be the total number of pairs and ρ=P(δ=1), the match prevalence in the total n pairs of records. Assuming independent observations (yi, δi), i=1,...,n, we express its complete data likelihood and the marginal distribution of yi, i=1,...,n as follows:

and . For a given i, the posterior probability of δi=1 is . If the true match status δi’s are known, then the MLE of ρ for the complete data likelihood is . When δi’s are unknown, this problem is known as the latent class modeling because the model parameters are estimated without the class label being observed.

A popular algorithm, named after Fellegi and Sunter [] in the probabilistic record linkage literature, further assumes ; that is, the assumption of conditional independence of Yi1, ... ,YiK within each latent class. The FS model greatly simplifies the estimation process, producing estimates for field-specific probability of agreement given that a pair is a match, mk=P(Yik=1|δi=1), and the field-specific probability of agreement given that a pair is a nonmatch, uk=P(Yik=1|δi=0). Model estimates can be obtained by using the Expectation-Maximization (EM) algorithm [] on the complete data likelihood or by using standard optimization routines on the marginal likelihood. The FS approach allows the model parameters to be estimated based on the observed agreements of pairs without the use of a training set, qualifying it as an unsupervised learning algorithm.

Classification of Record Pairs

Match scores are defined as the logarithm of likelihood ratios, . Under the conditional independence assumption, the match score for the ith pair is the sum of the logarithms of the field-specific likelihood ratios, . Match scores are computed using the estimated mk and uk from the FS model and are in turn used to rank all record pairs, with a high score indicating a higher likelihood of a record pair to be a match. In our study, we used the estimated match prevalence ρ to set the threshold as the upper ρ-th quantile of the scores. A record pair is then declared as a match if its score is greater than the threshold; otherwise, it is declared as a nonmatch.

Treatment of Missing Data

Formally describing the missing data mechanism is important for devising an approach to account for missing data. Missing data are generally classified into 3 types []. First, the most restrictive type of missing data is missing completely at random (MCAR), which assumes that the missingness in a variable is independent of all observed or unobserved variables. In this situation, the parameter estimates are unbiased when record pairs with any missing data are excluded. However, omitting missing data may lower the precision of estimated parameters due to the smaller sample size. In addition, MCAR is a strong assumption that cannot be verified with the data at hand. Second, MAR is a less-restrictive yet more realistic missing data model that assumes that the missingness in a variable is independent of unobserved data, although it can depend on other observed variables. Finally, missing not at random (MNAR) asserts that the missingness of a variable is related to the unobserved variable itself. To handle MNAR, knowledge of the missing mechanism is required to model the missing process in the estimation of the parameters and matching scores.

In record linkage applications, missing values in matching fields are typically handled by excluding records with missing values on one of the matching fields when estimating match weights [] or considering the field’s agreement pattern as a disagreement []. Excluding records with missing values is justifiable only when the data are MCAR. Thus, excluding records when the MCAR assumption does not hold leads to inaccurate results due to bias and low precision; the bias arises from the wrong model assumption and the low precision from the reduced sample size. Alternatively, treating missing data as disagreement (MAD) is implicitly invoking the assumption of MNAR, which may yield inaccurate results when the MAD assumption that all missing data represent disagreement is incorrect. This strong assumption is likely false for data to be linked. For example, if the middle name is absent because it does not exist, a missing value from both records of the record pair can provide information that the 2 records belong to the same person. On the other hand, the assumption of MAR is the least restrictive among the 3 types of missing mechanisms, and we hypothesize that it will yield superior match performance. Assuming MAR, the missing data are handled using the full information likelihood approach that uses all available data (ignoring the matching fields with missing values) in the FS model under the assumption of conditional independence of the matching fields.

The predictive results are obtained the same way for the FS model with MAR and MAD. The difference lies in the manner in which missing data are treated. When MAD is used, fields with missing data are set to “disagreement” (coded as 0), and the FS algorithm as is can proceed on the data with missing values replaced by zeros. When MAR is used, the FS algorithm is used on nonmissing data. In either cases, parameters mk,uk and the match prevalence are estimated, and match scores are calculated for all pairs. The threshold for a pair to be a match is set to be the upper -th quantile of the scores. A record pair is then declared as a match if its score is greater than the threshold; otherwise, it is declared a nonmatch.

Selection of Matching Fields

Fields missing 100% within a blocking scheme contain no information and will not be considered further. We examined 2 approaches selecting matching fields: the standard practice of subject matter expert-guided field selection and a data-driven approach. In the data-driven approach, all fields were considered to be putative matching fields. A necessary condition for a field to be useful in matching is that it should exhibit variability. For example, if the value of a field is fixed (no variation), it cannot separate matches from nonmatches. Thus, a blocking variable can no longer be used as a matching field in a block formed using the blocking variable. When running an FS model, we started with the largest possible set of fields; more fields may be dropped from the model, starting with fields with the least variations, until the FS algorithm converges.

Data Sets of 4 Use Cases and Gold Standards

We evaluated the matching performance of the missing data treatment (MAD and MAR) and matching field selection (expert-specified fields vs data-driven fields) by conducting a 2-by-2 factorial design using 4 real-world use cases in our local HIE environment. The 4 use cases contain data that were generated as part of clinical or public health processes.

The 4 use cases included deduplicating clinical records in a state-level HIE, linking a public health registration file to clinical data in the HIE, linking death records to clinical data in the HIE, and deduplication of the Health Level Seven International (HL7) messages for newborns less than 1 month of age from the HIE. For each use case, blocking was performed to confine the total number of record pairs to be compared with a subspace of record pairs enriched with true matches []. Five blocking schemes were selected for each use case based on expert input from our laboratory. The total number of records for each use case and the number of record pairs per blocking scheme are listed in . To assess the matching performance, we selected record pairs for human review by performing proportional sampling from the union of record pairs with strata defined by the five blocking schemes. To compare the sensitivity of the algorithms, a total of 5884 true matches were necessary to test a 2% absolute difference in discordant rates of the 2 algorithms among the true matches with an 80% power at a 2-sided significance level of .05. The same number of true nonmatches was required to test the 2% difference in specificity. We sampled record pairs until we reached at least 5884 pairs in the class of true matches and the class of true nonmatches. Each record pair was reviewed by 2 reviewers; in the case of a disagreement in the classification of the pair, a third reviewer adjudicated the pair. summarizes the manually reviewed sets for the 4 use cases.

Table 2. Manual review results for the 4 use cases.Use caseNumber of pairsaNumber of pairs deemed as matchesNumber of pairs deemed as nonmatchesMatch prevalencebINPCc15,000784071600.523SSAd16,500595010,5500.361NBSe15,000796770330.531MCHDf15,500592795730.382

aNumber of pairs is the total number of pairs sampled for manual review, which determines the pairs as either matches or nonmatches.

bMatch prevalence is the ratio of the number of pairs deemed as matches and the total number of pairs for manual review for each use case.

cINPC: Indiana Network for Patient Care.

dSSA: Social Security Administration.

eNBS: newborn screening.

fMCHD: Marion County Health Department.

Deduplicating HIE (INPC)

This data set reflected demographic records from geographically proximal hospital systems that participate in HIE. Blocking is as described earlier. The data contained a subset of 15,000 sampled gold standard pairs with 7840 (52.3%) true positives and 7160 (47.7%) true negatives. Patients from hospitals in close proximity cross over to nearby institutions, creating the need to identify common records. New value-based purchasing models such as Accountable Care Organizations dramatically increased the need to identify and capture information on patients seeking care from other institutions.

HIE and Vital Records for Ascertaining Death Status (SSA)

These data reflect a combination of the Social Security Death Master File and HIE data. We applied five blocking schemes (): SSN; first name, last name, and zip code (FN-LN-ZIP); first name, last name, middle initial, and year of birth (FN-LN-MI-YB); first name, last name, middle initial, and day and month of birth (FN-LN-MI-DB-MB); and first name, last name, and day, month, and year of birth (FN-LN-DB-MB-YB). This data set contained a subset of 16,500 sampled gold standard pairs with 5950 (36.1%) true positives and 10,550 (63.9%) true negatives. Accurately and comprehensively updating health records with patients’ accurate death status is critical for robust clinical quality measurement, public health reporting requirements, and high-quality clinical research.

Deduplicating Newborn Registration Data (NBS)

This data set included demographic data for newborns derived from multiple hospitals, clinics, and within the HIE. These data were limited to patients aged <2 months. We applied five blocking schemes (): medical record number (MRN), telephone number (TEL), month, day of birth, zip code (MB-DB-ZIP), last name and first name (LN-FN), and next of kin’s last name and first name (NK_LN-NK_FN). This data set contained a subset of 15,000 sampled gold standard pairs, with 7967 (53.1%) true positives and 7033 (46.9%) true negatives. Matching in this cohort is important because not all infants receive appropriate screening for harmful or potentially fatal disorders that are otherwise unapparent at birth []. Public health screening tests must be linked to patient records to avoid harmful delays in diagnosis.

Public Health Registry Linked to Clinical Registrations (MCHD)

This data set comes from the MCHD, Indiana’s largest public health department. The registry contains a master list of demographic information for clients who receive public health services such as immunization; Women, Infants, and Children’s nutrition support; and laboratory testing [,]. The registry also tracks population health trends and supports other public health activities. Duplicate patient records are often unintentionally added. We applied five blocking schemes (): SSN; telephone number (TEL); day, month, and year of birth and zip code (DB-MB-YB-ZIP); first name, last name, and year of birth (FN-LN-YB); and day, month, and year of birth and last name (DB-LN-MB-YB). This data set contained a subset of 15,500 sampled gold standard pairs with 5927 (38.2%) true positives and 9573 (61.8%) true negatives. We linked the complete patient registry to patient records in the aforementioned HIE.

The 4 data sets contained subsets of the following fields: MRN, SSN, last name (LN), first name (FN), middle initial (MI), nickname (NICK_SET), ethnicity (ETH_IMP), sex, month of birth (MB), day of birth (DB), YB, street address (ADR), city, state (ST), zip code (ZIP), telephone number (TEL), email, last name of next of kin (NK_LN), first name of next of kin (NK_FN), last name of treating physician (DR_FN), and first name of treating physician (DR_LN). The last 4 fields were used only in the NBS use case.

Analyses of Use Cases

For each use case, blocking was performed first, and five blocks of record pairs were generated. The blocking schemes are listed in . The FS model is applied 4 times in each block based on the 2-by-2 factorial design, where missing data are either treated using MAD or MAR, and matching fields are either expert-specified or selected by the data-driven method. The parameters of the FS model can be estimated using the Newton Raphson approach or the EM algorithm, both of which maximize the likelihood function of the model. The exact agreement on the following fields (when available for a use case) was considered in the matching in INPC, MCHD, and SSA use cases: street address, city (in address), DB, MB, YB, EMAIL, ethnicity, FN, LN, MI, MRN, nickname, sex, state, ZIP, and TEL. MCHD and SSA do not have MRN; all 4 use cases include the nickname and ethnicity as derived fields. The fields used in the NBS use case were slightly different. The exact agreement on the following fields was considered in the matching in the NBS use case: street address, city, DB, MB, YB, physician’s FN, physician’s LN, email address, ethnicity, FN, LN, MI, nick name, MRN, NK_FN, NK_LN, sex, SSN, state, TEL, and ZIP. The fields of exact agreement available for matching and their percentages of missing values are summarized in - for the 4 use cases.

Within each run of the FS model, the estimate of block-specific prevalence under each missing treatment was used to classify record pairs as matches and nonmatches (see Classification of Record Pairs); the union of matches from all 5 blocks is the set of matches obtained.

Evaluation of Record Linking Performance

To evaluate the accuracy of these matching models, we calculated the following metrics: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1-score, as well as their respective 95% CIs based on 999 bootstrap samples. The abovementioned metrics were estimated from gold standard sets for which manual reviews established the true match status. Our primary evaluation metric was the F1-score.

Ethics Approval

This study was reviewed and approved by the Indiana University Institutional Review Board (IRB#: 1703755361).

Results

The 4 use cases contained missing data to various extents. Notably, 45.8% of the 47,334,986 records in the INPC use case had no SSN, making it necessary to add other blocking schemes that do not rely on the SSN. For the NBS use case, the SSN is typically missing because infants do not receive an SSN for at least 2 to 6 weeks after birth and often later if parents do not initially request the identifier. When linking the other 2 use cases SSA and MCHD to INPC, due to the INPC data set missing SSN in 45.8% of its records, blocking on SSN alone yielded only 4547 out of 5950 (76%) and 1531 out of 5927 (26%) of true matches in SSA and MCHD, respectively, based on the manually reviewed subsets ( and ).

Additional blocking schemes are essential to increase match sensitivity. As the FS algorithm is performed using paired data per blocking scheme, its performance is directly affected by the extent of missing values in the agreement vectors obtained by comparing pairs of records within each block. We summarized the proportions of missing data in the 5 blocking schemes of each use case in -. The proportion of missing values per block ranges from 0% to 100%. Fields that were missing 100% within a blocking scheme contained no information and, therefore, were not considered further. The extent of missing values in a matching field does not necessarily negatively correlate with the discriminating power of the field.

Matching fields with even substantial missing values nonetheless proved to be useful in discriminating matches from nonmatches. For example, the agreement status of email address comparison is missing for 99% of record pairs in the DB-LN-MB-YB blocking scheme of the INPC use case; the m- and u-probabilities were estimated to be 0.01147 and 0.000204 under MAD and 0.3830 and 0.02553 under MAR, respectively. The large ratios of the m-probability over the u-probability in either case indicate the utility of email address in linkage. As another example, the agreement status of zip code comparison is also missing for 99% of record pairs in the FN-LN-MI-YB block of the SSA use case, and the m- and u-probabilities were estimated to be 0.02073 and 8.49×10−7 under MAD and 0.7538 and 0.000137 under MAR, respectively. In both examples, the estimates of m- and u-probabilities are much larger under MAR than under MAD, suggesting that a downward bias might be incurred by artificially setting missing values to disagreement in the MAD approach.

The fields used by the final FS models, either expert-specified or data-driven per block per use case, are summarized in . Except for the SSA use case, where the number of fields that could be used for matching is limited, the number of data-driven fields is greater than the number of expert-specified fields for the remaining 3 use cases.

Table 3. Summary of modeling information by data use case and by blocking scheme.Data and blockExpert-specified fieldsaData-driven fieldsaINPC
DB-LN-MB-YBbMRNc FNd SEXe TELf ADRg ZIPh SSNiMRN FN SEX TEL ADR ZIP SSN CITY EMAIL ETH MI NICK STj
DB-MB-YB-ZIPMRN LN FN SEX TEL ADR SSNMRN LN FN SEX TEL ADR SSN CITY EMAIL ETH MI NICK ST
FN-LN-YBMRN SEX DB MB TEL ADR ZIP SSNMRN SEX DB MB TEL ADR ZIP SSN CITY EMAIL MI ST ETH NICK
FN-TELMRN LN SEX DB MB YB ADR ZIP SSNMRN LN SEX DB MB YB ADR ZIP SSN CITY EMAIL ETH MI NICK ST
SSNMRN LN FN SEX DB MB YB TEL ADR ZIPMRN LN FN SEX DB MB YB TEL ADR ZIP CITY EMAIL ETH MI NICK STSSAk
FN-LN-DB-MB-YBSSN MI ZIPSSN MI ZIP
FN-LN-MI-DB-MBZIP YB SSNZIP YB SSN
FN-LN-MI-YBDB MB ZIP SSNDB MB ZIP SSN
FN-LN-ZIPMI DB MB YB SSNMI DB MB YB SSN
SSNLN FN MI DB MB YB ZIPLN FN MI DB MB YB ZIPNBSl
LN-FNMRN SEX DB MB YB TEL ADR ZIPMRN SEXm DB MB YBm TEL ADR ZIP CITY DR_FN DR_LN MI NK_FN NK_LN
MB-DB-ZIPMRN LN FN SEX YB TEL ADRMRN LN FN SEX YB TEL ADR CITY DR_FN DR_LN ETH MI NK_FN NK_LN NICK
MRNLN FN SEX DB MB YB TEL ADR ZIPLN FN SEXm DB MB YB TELm ADR ZIP CITY DR_FN DR_LN ETH MI NK_FN NK_LN ST
NK_LN-NK_FNMRN LN FN SEX DB MB YB TEL ADR ZIPMRN LNm FN SEX DB MB YB TEL ADR ZIP CITY DR_FN DR_LN ETH LN MI NICK ST
TELMRN LN FN SEX DB MB YB ADR ZIPMRN LN FN SEX DB MB YBm ADR ZIP CITY DR_FN DR_LN ETH MI NK_FN NK_LN STMCHDn
LN-FNMRN SEX DB MB YB TEL ADR ZIPMRN SEXm DB MB YBm TEL ADR ZIP CITY DR_FN DR_LN MI NK_FN NK_LN
MB-DB-ZIPMRN LN FN SEX YB TEL ADRMRN LN FN SEX YB TEL ADR CITY DR_FN DR_LN ETH MI NK_FN NK_LN NICK
MRNLN FN SEX DB MB YB TEL ADR ZIPLN FN SEXm DB MB YB TELm ADR ZIP CITY DR_FN DR_LN ETH MI NK_FN NK_LN ST
NK_LN-NK_FNMRN LN FN SEX DB MB YB TEL ADR ZIPMRN LNm FN SEX DB MB YB TEL ADR ZIP CITY DR_FN DR_LN ETH LN MI NICK ST
TELMRN LN FN SEX DB MB YB ADR ZIPMRN LN FN SEX DB MB YBm ADR ZIP CITY DR_FN DR_LN ETH MI NK_FN NK_LN ST

aColumns “Expert-specified fields” and “Data-driven fields” display the fields used in the Fellegi-Sunter (FS) model.

bDB-LN-MB-YB: day, month, and year of birth and last name.

cMRN: medical record number.

dFN: first name.

eSEX: sex.

fTEL: telephone number.

gADR: address.

hZIP: zip code.

iSSN: Social Security number.

jFields (italicized) selected only by data-driven methods.

kSSA: Social Security Administration.

lNBS: newborn screening.

mFields not selected by the data-driven method but specified by experts.

nMCHD: Marion County Health Department.

The matching metrics of the 4 use cases evaluated on their respective ground truth sets of randomly selected and manually reviewed record pairs are displayed in . -. From , we observe the following:

MAR improves the F1-score in general, whether matching fields are expert-specified or data-driven; the improvement in the F1-score comes from improved sensitivity with comparable or better PPV. The largest improvement in the F1-score occurred in the NBS use case, 0.874 with MAR using data-driven fields compared with 0.837 with MAD using expert-specified fields.MAD using expert-specified fields had higher F1-scores than F1-scores using data-driven fields (except for NBS). As the number of data-driven fields is usually greater than the number of expert-specified fields in a block, we hypothesized that the artificial correlations among the large number of data-driven fields induced by MAD adversely affect the match performance.MAR coupled with data-driven fields yielded F1-scores comparable to or larger than those of MAR with expert-specified fields and larger than the F1-scores of MAD with both methods of field selection.

In the SSA use case, the F1-scores of both methods were similar, 0.873 for MAD and 0.875 for MAR, with either expert-specified matching fields or data-driven matching fields, because both field-selection approaches selected the same set of matching fields. We examined the classification results within the classes of true matches and true nonmatches in the ground truth set, on whether the classified matches and nonmatches were similar or whether the 2 methods made different mistakes. From the diagonals in , we can see that the 2 methods produce roughly congruent classification results in the classes of true matches and true nonmatches. FS under MAR is slightly more sensitive than FS under MAD: 26 true matches that are misclassified as nonmatches by MAD are recovered as matches by MAR; only 3 nonmatches are misclassified as matches by MAR, but 1 nonmatch that is misclassified as a match by MAD is correctly classified as a nonmatch by MAR. In summary, the classification results are similar in the SSA use case; FS under MAR is slightly more sensitive than FS under MAD, while maintaining PPV, NPV, and specificity ().

The algorithms performed differently, partly because of the different data quality of the use cases. The F1-score was 0.979 for INPC but only 0.874 for NBS (). First, INPC and NBS use cases have very different patterns of missing data across matching fields; for example, SSN is missing from 52.6% to 69.7% of record pairs (except for the SSN block, which by definition has no missing SSN) across the 5 blocks in INPC, whereas SSN is missing in more than 98% of record pairs in NBS ( and ). Second, the discriminating powers of the same fields were different in the 2 use cases. A field has high discriminating power if its agreement rate is high among matches and low among nonmatches; otherwise, the data quality is indicated as low. For example, the probabilities of agreement in the fields of the LN and FN in the same DB-MB-YB and ZIP blocking scheme for INPC and NBS show differential data quality: 94.45% of matches of INPC agree on the LN, while only 89% of matches of NBS do; on the other hand, only 0.45% of nonmatches of INPC agree on the LN but 2.70% of nonmatches of NBS do; 94.36% of matches of INPC agree on the FN, while only 66.99% of matches of NBS do; only 0.40% of nonmatches of INPC agree on the FN but 3.45% of nonmatches of NBS do (). As the matching score is a summation of the ratios of the agreement probabilities of the matches versus the nonmatches on the log-scale across all matching fields, data quality directly affects matching performance.

Table 4. Matching results of the four use cases evaluated on their respective ground truth sets of random-selected and manually reviewed record pairs.DataValue, NSensitivity (95% CI)Specificity (95% CI)Positive predictive value (95% CI)Negative predictive value (95% CI)F1-score (95% CI)Expert-specified fields
INPCa

MADb15,0000.962 (0.958-0.967)0.990 (0.987-0.992)0.990 (0.988-0.992)0.960 (0.955-0.964)0.976 (0.974-0.978)

MARc15,0000.970 (0.966-0.974)0.988 (0.986-0.991)0.989 (0.987-0.991)0.968 (0.964-0.972)0.980 (0.977-0.982)
SSAd

MAD16,5000.781 (0.770-0.792)0.995 (0.994-0.996)0.989 (0.986-0.992)0.890 (0.884-0.895)0.873 (0.866-0.879)

MAR16,5000.785 (0.775-0.796)0.995 (0.993-0.996)0.989 (0.985-0.991)0.892 (0.886-0.897)0.875 (0.869-0.882)
NBSe

MAD15,0000.795 (0.786-0.804)0.881 (0.874-0.889)0.883 (0.876-0.891)0.791 (0.782-0.801)0.837 (0.830-0.843)

MAR15,0000.860 (0.852-0.868)0.873 (0.865-0.881)0.885 (0.877-0.892)0.846 (0.838-0.855)0.872 (0.866-0.878)
MCHDf

MAD15,5000.944 (0.937-0.949)0.989 (0.987-0.991)0.982 (0.979-0.986)0.966 (0.962-0.969)0.963 (0.959-0.966)

MAR15,5000.946 (0.940-0.952)0.988 (0.986-0.990)0.980 (0.976-0.983)0.967 (0.964-0.971)0.963 (0.959-0.966)Data-driven fields
INPC

MAD15,0000.579 (0.568-0.590)0.988 (0.986-0.991)0.982 (0.978-0.985)0.682 (0.672-0.690)0.729 (0.719-0.737)

MAR15,0000.970 (0.966-0.974)0.987 (0.984-0.989)0.988 (0.985-0.990)0.968 (0.964-0.972)0.979 (0.976-0.981)
SSA

MAD16,5000.781 (0.770-0.792)0.995 (0.994-0.996)0.989 (0.986-0.992)0.890 (0.884-0.895)0.873 (0.866-0.879)

MAR16,5000.785 (0.775-0.796)0.995 (0.993-0.996)0.989 (0.985-0.991)0.892 (0.886-0.897)0.875 (0.869-0.882)
NBS

MAD15,0000.813 (0.805-0.822)0.875 (0.867-0.883)0.880 (0.873-0.888)0.805 (0.796-0.814)0.845 (0.839-0.852)

MAR15,0000.865 (0.858-0.873)0.870 (0.863-0.878)0.883 (0.876-0.890)0.851 (0.842-0.859)0.874 (0.868-0.880)
MCHD

MAD15,5000.635 (0.622-0.648)0.970 (0.967-0.974)0.929 (0.921-0.937)0.811 (0.804-0.818)0.754 (0.745-0.764)

MAR15,5000.954 (0.948-0.959)0.988 (0.985-0.990)0.979 (0.976-0.983)0.972 (0.968-0.975)0.967 (0.963-0.970)

aINPC: Indiana Network for Patient Care.

bMAD: missing as disagreement.

cMAR: missing at random.

dSSA: Social Security Administration.

eNBS: newborn screening.

fMCHD: Marion County Health Department.

Table 5. Cross-tabulation of ground truth and classification results by the Fellegi-Sunter model under missing as disagreement (MAD) and missing at random (MAR) for the Social Security Administration use case.MADMARValues, N
NonmatchMatch
Man_Rev_Status: matches
Nonmatch1277261303
Match046474647
Value, N127746735950Man_Rev_Status: nonmatches
Nonmatch10,495310,498
Match15152
Value, N10,4965410,550
Discussion

The US health care system will likely not have a unique and universal patient ID in the near future, so innovations such as incorporating missing data under MAR and data-driven field selection in the linkage algorithms are necessary to optimize existing methods to ensure accurate patient identity and support patient safety. Our findings are important because they demonstrate improvements in linkage performance among 4 different but representative use cases. Our HIE-based patient-matching laboratory has experience matching clinical data from heterogeneous sources, including hospitals (inpatient and emergency departments) [,], ambulatory care settings [], public health syndrome as captured by surveillance systems [], electronic laboratory reporting in case detection systems [], and NBS data []. Thus, we can specifically measure variations in patient-matching performance across different use cases for the same patient-matching approaches. Through this work, we have shown that the performance of the 2 enhancements in patient-matching algorithms may have additive values across different clinical use cases.

Although the assumption of missing at random is not verifiable, the success of the FS algorithm coupled with MAR in our four different use cases indicates that missing at random is a reasonable assumption. As MCAR is a special case of MAR, our algorithm works when data are MCAR. These results will inform future research and development in patient-matching spaces.

Furthermore, the superior performance observed with MAR using data-driven fields over other combinations in the 2×2 design and four use cases suggests its potential value for incorporation into privacy-preserving record linkage (PPRL) methods. In PPRL, to preserve privacy, fields can be tokenized (eg, using bigrams) into smaller parts and compared []. As the tokens do not reveal the actual nature and content of the field, an expert cannot specify matching fields as they can with fields such as names, DB, SSN, etc. PPRL is a scenario in which data-driven field selection coupled with MAR in the FS model appears useful.

Finally, many data-driven fields may lead to model overfitting, which is a prominent cause of the poor performance of machine learning algorithms. In many applications in medical research using latent class models, many covariates are available, and the number of covariates overwhelms the number of observations. This is the main motivation for most of the variable selection literature to identify a subset of variables to (1) estimate the association between the covariates and the response variable and (2) obtain a parsimonious model that describes the covariates and the response variable []. However, in record linkage, the primary concern is not estimation but rather the prediction of the unknown response variable—the class label of match or nonmatch, model parsimony is irrelevant. Sample sizes in record linkage are generally large: in our four use cases, even the sample size of the smaller class of our smallest use case, MCHD, is approximately 180,035 ( and ). Hence, the number of matching fields used relative to the overall samples and the within-class sample sizes were small, without concern for overfitting. The FS model for record linkage is an unsupervised algorithm in that the response variable indicates whether a pair of records belonging to the same entity is unknown and is to be inferred. As such, the veracity of any unsupervised classification algorithm applied to a set of record pairs is tested on a representative ground truth set in which the class labels are obtained through manual reviews. As described in the Methods section, we created a gold standard set of randomly selected record pairs for each of the 4 use cases. Most importantly, the match status of those record pairs in the gold standard sets was not used in the FS model fitting process (including field selection). As the data-driven fields used in the FS model under MAR uniformly perform the best for all four use cases, overfitting is not a concern because overfitting tends to hurt performance.

While we strive to generate results that are applicable to the broadest possible audience using a health informatics research laboratory that captures a diverse set of data elements with varying data characteristics, we cannot assure generalizability with complete certainty. If our data are not representative of other health systems, then our linkage results may not be applicable. If the missing data mechanism is not MAR or MCAR (eg, if the missingness of a data element is related to its value), our algorithm will likely not work. Before applying our methods to a data environment with missing data, we recommend creating a ground truth set of randomly selected record pairs whose match status is manually reviewed to determine whether our methods are applicable to a specific data environment.

Finally, our results suggest that accommodating missingness in patient-matching algorithms can improve accuracy. While the FS model is widely used, different FS implementations and completely different models (eg, decision trees or boosting algorithms) may exhibit a greater or lesser effect. We will explore the potential of these machine learning tools in our future work.

In summary, the combination of data-driven matching field selection and MAR methods produced the best overall performance for four real-world matching use cases. The MAR method maintained or improved F1-scor

View original article

JOURNAL OF MEDICAL INTERNET RESEARCH

分享书签

0 0 0 0 0 0 0

More from this channel

The Data-Adaptive Fellegi-Sunter Model for Probabilistic Record Linkage: Algorithm Development and Validation for Incorporating Missing Data and Field Selection

留言 (0)