The adoption of electronic health records (EHRs) optimistically promises easily searchable databases as an accessible means for prospective and retrospective research applications []. EHRs often fall short of these promises due to limited local data and poor data quality (DQ) [,]. To overcome these limitations, several institutions have harmonized databases and model ontologies, including PCORnet (The National Patient-Centered Clinical Research Network), All of Us, MIRACUM (Medical Informatics in Research and Care in University Medicine), and the EHDEN Project [-]. These programs strive to offer high-quality data for research purposes []. However, EHR DQ remains highly variable, with some studies showing completeness in EHR parameter values ranging from 60% to 100% [,]. Similar inconsistencies present a significant limitation to the generalizability and applicability of lessons learned across these datasets for broader medical and research purposes.
Multiple initiatives have aimed to measure and improve EHR data [,]. Early efforts in DQ assessment (DQA) demonstrated inconsistent reporting and a need for universal terminology standards in DQA efforts []. In response, attempts at a standardized ontology for DQA have been developed, such as through the efforts of the International Consortium for Health Outcomes Measurement, 3×3 DQA guidelines, and the terminologies proposed by Kahn et al [] and Wang et al [,-]. More recently, artificial intelligence (AI) and natural language processing techniques have automated quality initiatives, including data assessment and augmentation [,]. Nonetheless, these techniques introduce their own set of quality requirements, including fairness metrics, handling intolerable or lost data, and mitigating data drift []. Measuring the result of these techniques’ application in real-world clinical contexts has given rise to another field that has become crucial for EHR improvement, namely, data performance assessment (DPA) [].
In this review, we critically evaluate peer-reviewed literature on the intersection of DQA and DPA applications, as well as trends in their automation [-,-]. The purpose of this scoping review was to combine the 3 to formulate a more clear road map for evaluating EHR datasets for medical research and practice.
This scoping literature review was conducted according to the 2018 PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews), whose checklist is shown in [].
Literature SearchA search was performed for all full-text research articles published in English in PubMed from inception to May 7, 2023. A list of the exact search terms is included in .
Article SelectionFour investigators (JB, RS, TRB, and YPP) reviewed the selected studies during the title and abstract screening. Further 4 investigators (ML, RS, TOB, and YPP) conducted the full-text review and final extraction of articles. Title or abstract screening, full-text review, and final extraction were based on the consensus opinion between 2 independent reviewers. Conflicts were resolved by a third reviewer. Article management and calculations of interrater reliability and Cohen κ were performed using Covidence systematic review software (Veritas Health Innovation).
Inclusion CriteriaTitles and abstracts were screened to include original research articles assessing the DQ and performance of all or part of a hospital’s EHR system. We looked for studies reporting on 1 or more aspects of DQ (the assessment of EHR data without consideration of follow-up actions) and data performance (the assessment of EHR data applications) as defined ().
Table 1. Data quality and performance indicator definitions, mitigation strategies, and references.DefinitionMitigation strategiesRelevant studiesData qualityCompleteness (or, conversely, missingness)The absence of data points, without reference to data type or plausibility []Automated data extraction; data imputation[-,,,-]ConformanceThe compliance of data with expected formatting, relational, or absolute definitions []Preemptively enforced data format standardization[-,,,-,-,,]PlausibilityThe possibility that a value is true given the context of other variables or temporal sequences (ie, patient date of birth must precede date of treatment or diagnosis) []Periodic realignment with logic rule sets or objective truth standards; thresholding[-,,,,,,-,,-]UniquenessThe lack of duplicate data among other patient records []Two-level encounter or visit data structure[]Data performanceCorrectness or accuracyWhether patient records are free from errors or inconsistencies when the information provided in them is true [,]Periodic validation against internal and external gold standards[,-,,,,]Currency or recencyWhether data were entered into the EHR within a clinically relevant time frame and is representative of the patient state at a given time of interest [,]Enforcing predetermined hard and soft rule sets for timeline of data entry[,,,,,,]Fairness (or, conversely, bias)The degree to which data collection, augmentation, and application are free from unwarranted over- or underrepresentation of individual data elements or characteristicsPeriodic review against a predetermined internal gold standard or bias criterion[,,,,,]Stability (or, conversely, temporal variability)Whether temporally dependent variables change according to predefined expectations [,]Periodic measurement of data drift against a baseline standard of data distribution[,,,]ShareabilityWhether data can be shared directly, easily, and with no information loss []Preemptively enforced data standardization[,]RobustnessThe percent of patient records with tolerable (eg, inaccurate, inconsistent, and outdated information) versus intolerable (eg, missing required information) data quality problems []Timely identification of critical data quality issues[]aEHR: electronic health record.
Data QualityConformanceConformance refers to the compliance of data with expected formatting, relational, or absolute definitions [].
PlausibilityPlausibility refers to the possibility that a value is true given the context of other variables or temporal sequences (ie, the patient’s date of birth must precede the date of treatment or diagnosis) [].
UniquenessUniqueness refers to the lack of duplicated records [].
Completeness (or Conversely, Missingness)With regard to completeness, missingness is the absence of requested data points, without reference to conformance or plausibility as defined [].
Data PerformanceCorrectness or AccuracyCorrectness or accuracy refers to whether patient records are free from errors or inconsistencies when the information provided in them is true [,].
Currency or RecencyCurrency or recency refers to whether data were entered into the EHR within a clinically relevant time frame and are representative of the patient state at a given time of interest [,].
Fairness (or Conversely, Bias)With regard to bias, fairness refers to the degree to which data collection, augmentation, and application are free from unwarranted over- or underrepresentation of individual data elements or characteristics.
Stability (or Conversely, Temporal Variability)With regard to stability, temporal variability refers to whether temporally dependent variables change according to predefined expectations [,].
ShareabilityShareability refers to whether data can be shared directly, easily, and with no information loss [].
RobustnessRobustness refers to the percent of patient records with tolerable (eg, inaccurate, inconsistent, and outdated information) versus intolerable (eg, missing required information) DQ problems [].
We additionally included studies reporting on data imputation methods, defined as techniques used to fill in missing values in an EHR, such as through statistical approximation and the application of AI.
Exclusion CriteriaWe excluded tangential analyses of DQ in articles focused primarily on clinical outcomes. As such, studies discussing data cleaning as part of quantifying clinical outcomes were excluded from our analysis. Proposals or study protocols with no results were also excluded during the screening process.
Article Quality AssessmentFull-text articles were additionally scored as having or missing the criteria for (1) data integrity: comprehensiveness for each main outcome, including attrition and exclusions from the analysis and reasons for them; (2) method clarity: a clear description of DQA data sources, analysis steps, and criteria; (3) outcome clarity: outcomes reporting in plain language, in their entirety, and without evidence for selective reporting; and (4) generalizability: applicability of DQ techniques described in the article to other clinical settings.
The flow diagram for article selection is shown in . A total of 154 records were identified using the search terms defined in using the PubMed library. After the removal of 31 duplicates and the 72 articles identified as irrelevant, 51 studies proceeded to full-text review. Full-text review excluded a further 25 articles owing to reasons listed in , leaving a final total of 26 original research studies [-,,,,,,-]. The Cohen κ between the different pairs of reviewers ranged from 0.28 to 0.54 during the screening process and from 0.54 to 1.00 during the full-text review.
Study characteristics are shown in and . Exactly half of the identified articles targeted general EHR DQ analysis [-,,,-,,], while the other half focused on a particular specialty or diagnosis () [,,,,,-,-]. The latter included primary care (n=3, 12%) [-], cardiovascular disease (n=3, 12%) [,,], anesthesia or pain medicine (n=2, 8%) [,], intensive care units (n=2, 8%) [,], and pediatrics [], oncology [], and infectious disease (n=1 each, 4%) [].
Article quality assessment conducted as part of our review process identified 14 (54%) of the articles [-,,,,,-,,] had at least 1 common study design or reporting limitation, with 5 of the articles having more than 1 [,,,,]. Among these, 6 (30% of all errors) articles did not clearly state their methods [,,,,,], 5 (25%) had incomplete data [,,,,], 5 were not generalizable to other settings [,-,], and 4 did not clearly state their outcomes () [,,,].
Commonly referenced DQ and performance indicators are summarized in . Respective definitions, mitigation strategies, and references are listed in .
Figure 1. PRISMA 2020 flow diagram detailing study selection and reasons for exclusion for all articles considered for this scoping review. PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses. Table 2. Frequency of clinical specialties among all papers and study limitations among all limitations identified by reviewers in this analysis.SettingValues, n (%)SpecialtyICU2 (8)Anesthesia or pain med2 (8)General13 (50)Cardiovascular3 (12)Infectious disease1 (4)Oncology1 (4)Pain medicine0 (0)Pediatrics1 (4)Primary care3 (12)LimitationsIncomplete data5 (25)Methods not clearly stated6 (30)Outcomes not clearly stated4 (20)Not generalizable to other settings5 (25)aICU: intensive care unit.
Table 3. Elements of data quality and performance commonly referenced by papers included in this review.Data Quality and Performance ElementValues, n (%)Data qualityCompleteness21 (81)Conformance18 (69)Plausibility16 (62)Uniqueness1 (4)Data performanceCorrectness or accuracy14 (54)Currency7 (27)Fairness or bias6 (23)Stability4 (15)Shareability2 (8)Robustness1 (4)Data Quality AssessmentCompletenessCompleteness was the most cited element of DQ analysis, with references in 21 (81%) of all articles [-,,,-]. Importantly, 19 (73%) studies integrated data from multiple clinical sites [,-,,,,,,-], which was associated with issues in data collection and missingness “across organizational structure, regulation, and data sourcing” []. Clinical domains reported to be prone to low data completeness included patient demographics, with Estiri et al [] highlighting the issue for records of patient ethnicity and Thuraisingam et al [] for mortality records (eg, missing year of death), and medication management, with Thuraisingam et al [] highlighting the issue for dosage, strength, or frequency of prescriptions and Kiogou et al [] for missing dates or reasons for discontinuation of medications.
To combat data missingness, Lee et al [] used natural language processing algorithms to automatically extract data from patient records, while further 5 studies made use of data imputation techniques. Among the latter, 2 articles generated synthetic data, while another 3 supplemented datasets through information from external datasets. Fu et al [] generated synthetic data by modeling providers’ assessments of EHR data based on different information sources according to their individual characteristics (eg, tendency to ascertain delirium status based on Confusion Assessment Method vs prior International Statistical Classification of Diseases coding or nursing flow sheet documentation), while Zhang et al [] used a generative adversarial network (GAN) trained on real longitudinal EHR data to create single synthetic EHR episodes (eg, outpatient or inpatient visit). Meanwhile, Lee et al [] supplemented existing EHR records on heart failure by aggregating data from open-source datasets of heart failure biomarkers (including the Database of Genotypes and Phenotypes and the Biologic Specimen and Data Repository Information Coordinating Center) and using literature guidelines to create a standard set of cardiovascular outcome measures, while Curtis et al [] supplemented missing EHR mortality records with data from US Social Security Death Index and the National Death Index, and Mang et al [] used a manually generated stand-alone synthetic dataset to test the development of a new software tool for DQ assessment.
ConformanceConformance was the second most cited element of DQA, with references in 18 (69%) articles [-,,,-,-,,]. Similar to completeness, DQ checks on conformance were performed automatically across most studies. Mitigation strategies included enforcing strict formatting rules at the time of data entry, for example, by using International Statistical Classification of Diseases codes to define the cause of death or a diagnosis of delirium [,].
PlausibilityPlausibility was the third most cited element of DQA with references in 16 (62%) articles [-,,,,,,-,,-]. Clinical domains prone to issues with plausibility included patient baseline physical characteristics, medication, and laboratory records. Estiri et al [] and Wang et al [] reported significant rates of plausibility issues for baseline physical characteristics, with higher error rates for records of patient height as compared to weight, likely due to the multiple flow sheet fields for height, including “estimated,” “reported,” and “measured,” which are generally averaged or selectively dropped. Pharmacologic data were prone to issues with plausibility due to timeliness (eg, antiretroviral therapy was dispensed before or more than 30 days after the visit date []) or discrepancies between diagnoses and drugs (eg, nonsteroidal anti-inflammatory drug prescription on the date of gastroduodenal ulcer diagnosis []). Finally, laboratory results were also prone to issues with plausibility due to value ranges, units, timing (eg, laboratory time was at an invalid time of day or in the future), and discrepancies between diagnoses and laboratory records (eg, drug was documented as present but there was no laboratory record) or drug prescriptions and laboratory records (eg, metformin was prescribed prior to a documented hemoglobin A1c laboratory result, or warfarin was prescribed without a follow-up international normalized ratio laboratory result) []. Notably, this may reflect poorly integrated health care systems where laboratories are being drawn at disparate institutions.
A total of 18 (69%) studies used logic statements to assess plausibility [,-,,,,,,,-], including rules to determine temporal plausibility (eg, laboratories drawn at an invalid time of day [eg, 10:65 AM] [], extubation occurring prior to intubation [], or death date occurring before birth date []), diagnostic or procedural plausibility (eg, a procedure marked as an outpatient when it is only performed on an inpatient basis [] or an obstetric diagnosis given for a biologically male patient [,,]), alignment with external standards or expectations (eg, laboratory result absent for diagnosis or drug [] or demographic alignment of medication name and dose with expected value ranges []), and others. A total of 11 (42%) studies used thresholding to identify data of low or questionable quality [,,,,,,,,,,], including clinical and physiological value ranges (eg, BMI between 12 and 90 kg/m2 [] or fraction of inspired oxygen between 10% and 100% []) and logical thresholds (eg, recorded date of arrival prior to the date of data collection initiation [] or difference of >730 days when comparing age in years and date of birth fields []).
UniquenessFinally, 1 (4%) study reported on data uniqueness. Aerts et al [] measured the frequency of patient record duplications (ie, when patient records were erroneously copied during data merging or reprocessing). To reduce the rate of record duplications, the researchers in the study suggest a 2-level data structure, with more general patient data being recorded at the encounter level (which can include multiple visits during a single clinical episode) and diagnosis or procedure-specific data at the level of the particular visit.
Data Performance AssessmentCorrectness or AccuracyCorrectness or accuracy was the most cited element in data performance analysis, with references in 14 (54%) of all articles [,,,,,,,-,]. The metric was evaluated via manual review in 8 (57%) out of the 14 articles that reported the measure [,,,,,,,]. A total of 5 (36%) articles evaluated it in comparison to an external standard, including national registries [,], EHR case definitions based on billing codes [], and literature guidelines with high research use [], or, in the case of a newly proposed AI technique for synthetic data augmentation, comparison to a previously published GAN model performance []. A further 3 (21%) assessed correctness or accuracy against an internal standard by calculating the proportion of records satisfying internally predetermined rule sets [,,]. Of note, Curtis et al [] and Terry et al [] used both manual review and comparison to an external gold standard for validation.
Currency or RecencyRecency was the second most cited data performance element, with references in 7 (27%) articles [,,,,,,]. Among these, 5 (71%) studies evaluated the metric according to internally predetermined hard rule sets (eg, whether a patient who is obese had a weight recording within 1 year of the previous data point or whether data were entered into the EHR within 3 days of the clinical encounter [,,]) or soft rule sets (eg, whether the data were entered into the EHR within a subjectively determined clinically actionable time limit [,]), while 2 (29%) used external standards, including national registries and guidelines [,].
Fairness or BiasThe third most cited data performance element was fairness or bias, with references in 6 (23%) articles [,,,,,]. Among these, Lee et al [], Thuraisingam et al [], Tian et al [], and García-de-León-Chocano et al [] assessed fairness by manual review, while Fu et al [] and Zhang et al [] did so through automated review against a predetermined internal gold standard (ie, distribution of data characteristics within a real EHR dataset) or data bias criterion (ie, critic model measuring Jensen-Shannon divergence between real and synthetic data over time), respectively.
StabilityData stability was the fourth most cited performance element, referenced in 4 (15%) articles [,,,]. All 4 articles that measured data stability did so via temporal statistical analyses of data drift according to a predetermined internal baseline standard of data distribution [,,,].
ShareabilityShareability was referenced in 2 (8%) articles from our analysis [,]. Both studies measured the performance metric by way of manual review in a pre- and posttest analysis of data standardization [,].
RobustnessFinally, García-de-León-Chocano et al [] reported on information robustness by way of statistical estimation of critical (eg, missing or null required values) versus noncritical (all other) DQ issues that may obstruct subsequent data applications and performance measures.
Interventions for Improving DQ and PerformanceThree articles included in our analysis reported effective interventions to improve DQ and performance [,,]. In terms of DQ, Walker et al [] reported an increase in compliance, with 155 completeness and plausibility data checks from 53% to 100% across 6 clinical sites after 3 rounds of DQA. In terms of DQ and performance, Puttkamer et al [] reported both higher data completeness and recency following a continuous data reporting and feedback system implementation. Finally, Engel et al [] reported increased shareability (concept success rate, ie, whether data partners converted information from their individual EHRs to the shared database)—an increase from 90% to 98.5%—and a notable reduction in the percentage of sites with over 3 DQ errors—a reduction from 67% to 35%—across 50+ clinical sites over 2 years.
This scoping review provides an overview of the most common and successful means of EHR DQ and performance analysis. The review adds to a growing body of literature on the subject, most recently supplemented by a systematic review by Lewis et al []. To our knowledge, ours is the first review of specialty-specific applications of DQ alongside performance assessments. We identified and analyzed a total of 26 original research articles recently published on the topic. The results serve to characterize the most common medical fields making use of such assessments, the methodologies they use for conducting them, and areas for specialty-specific, as well as generalizable, future improvement. Finally, the discussion proposes a set of 6 unique and practical recommendations for minimizing modifiable DQ and performance issues arising during data extraction and mapping.
Article CharacteristicsOur review noted a paucity of DQ assessments within clinical specialties, where expert domain knowledge plays a key role in identifying logic inconsistencies. Half of all identified articles concerned general EHR data assessments, while the other half focused on medical fields such as primary care, cardiovascular diseases, or intensive care unit or anesthesia, with the notable absence of psychiatry, emergency medicine, and any of the surgical specialties. This points to a lack of peer-reviewed research and underuse of DQ and performance strategies across a wide spectrum of the medical field. There is a wide knowledge gap between how data are entered and acted upon clinically and how they appear in silico. Therefore, more efforts need to be directed toward supporting EHR data assessment initiatives in these specialties, with close collaboration between clinical users and data scientists.
More than half of the articles included in this scoping review had common limitations, including using or reporting incomplete data, methods, and outcomes. Among the articles scoring high for incomplete data, the chief issues include data attrition during extraction [,] and unclear or missing reporting [,,], pointing to a need for higher information interoperability and reporting standards, such as those put forth by Kahn et al []. These standards recommend using a harmonized and inclusive framework for the reporting of DQ assessments, including standardized definitions for completeness, conformance, plausibility, and other measures as discussed previously.
Similar issues were observed with methods reporting, with several articles underreporting steps in their data extraction or analysis, thereby limiting the replicability and generalizability of their findings [,,,]. Unclear reporting or underreporting was a substantial issue for outcomes as well, with low-scoring articles reporting only partial or too high-level results suggesting selective reporting bias [,,,]. To align with the standards set forth by articles scoring high in reporting quality, we recommend stating all data sourcing, methods, and results according to predetermined definitions of DQ or performance (see above) in enough detail such that they would be easily replicated by researchers at an unrelated institution.
A final article quality pitfall concerned articles that were too specific to a particular health system or clinical context. The chief issues among original research articles that in house scored “low” in our generalizability assessment concerned their overreliance on internal DQ checks or measures that could only be implemented within their specific institutional EHR [,-,]. To increase generalizability, we recommend relying on external DQ standards such as societal guidelines, previously published measures, or open-source databases, to the extent possible before resorting to the development of new in-house tools that impose limitations to generalizability outside the local clinical context [,-].
Data Quality AssessmentThe marked drop-off between the use of completeness, conformance, and plausibility versus other indicators () demonstrates that the field has settled on these measures as the main components of EHR DQ analysis. Taking this into consideration, we recommend measuring all 3 for a general assessment of clinical DQ. Of note, there is a significant drop-off between 81% (n=21) of studies reporting on completeness versus 69% (n=18) on conformance and 62% (n=16) on plausibility, which indicates an opportunity for limited but quick DQ “checks” using completeness measures only. More specialized analyses may require further reporting, including uniqueness in the event of data merger with the possibility of duplicate results. These may be particularly important in the case of EHR DQ assessments following information reconciliation from the merger of multiple data sources, including patient demographics or baseline physical characteristics and laboratory or pharmacological data, which were shown to be particularly prone to errors in DQ.
Our review additionally demonstrates that issues with data completeness, conformance, and plausibility may be at least partially addressed with data imputation methods. While previously these methods were either too limited in scope (completeness only), crude (eg, augmenting missing data with the mean of the entire dataset or a value’s k-nearest neighbor), or computationally expensive (eg, individual values calculated via regression models based on predetermined sets of correlated features), our review suggests that these tasks are being increasingly automated. Specifically, data attrition contributing to missingness and conformity at the extraction stage may be minimized with AI data extractor algorithms, such as the one described by Lee et al []. In cases where further extraction is no longer feasible, the dataset may be augmented by (1) using large language models for extracting structured data available in other formats (eg, laboratory values recorded in the text of media files from outside patient records); (2) incorporating or cross-referencing data from well-established outside data repositories (eg, the US Social Security Death Index for mortality records [] or the Database of Genotypes and Phenotypes and the Biologic Specimen for biomarkers of heart failure and other conditions []); or (3) generating synthetic data, for example, by modeling providers’ behaviors with respect to different information types or sources [] and by using GANs to create synthetic care episodes based on longitudinal EHR observations [].
Data Performance AssessmentCorrectness or accuracy was by far the most reported measure among the data performance indicators examined in our review. While certainly integral to assessing a dataset’s usability and potential for downstream clinical or research impact, correctness alone is insufficient to guarantee the success of said applications. A technically “correct” dataset may still be practically limited if it is outdated, biased, inconsistent, or entirely idiosyncratic. We, therefore, recommend that future data assessments consider including additional measures of recency, fairness, stability, and shareability, respectively, among their core set of performance indicators as they each contribute a unique measure of a dataset’s applicability. Importantly, our review noted considerable heterogeneity in the definitions used for these additional measures (eg, by defining data recency in terms of whether the information was logged into the EHR within a set time or whether it represents a patient’s state at a given time period [] [,]), suggesting that further efforts are needed to harmonize outcome definitions in the field of data performance analysis in particular. Nonetheless, the predominance of internal standard comparisons for measuring recency and stability in our review demonstrates that these indicators may be essential for individualized EHR DPAs and should, therefore, be considered on a case-by-case basis (eg, in epidemiology where the timing and consistency of reporting can be of essential importance, or quality improvement initiatives where a researcher might want to compare pre- vs postintervention results). Likewise, shareability ought to be considered in the case of assessing dataset performance for interoperability purposes (eg, with data integrations, sharing, and reporting).
As discussed previously, data fairness assessments can and should be considered for monitoring overall EHR bias, as well as the bias inherent to any data imputation methods as discussed above. Our review points to the fact that this is a rapidly developing field, with fairness assessments to date mostly requiring manual review against national guidelines or disease registries, or, in the case of synthetic data, real EHR datasets [-]. Nonetheless, such gold standards are not always readily available (eg, What is the standard distribution of age or race in the real world?), so tech-savvy researchers have more recently resorted to detecting fairness during the validation of machine learning models or algorithms instead of the data itself [-]. Several research articles from our analysis proposed ways of automating the process. Fu et al [] present a straightforward way of measuring the agreement of AI-generated synthetic data against a gold standard dataset. Zhang et al [] suggest that while such straightforward analysis may be valuable, it is insufficient to measure true fairness, and they go on to propose a method of measuring bias via Jansen-Shannon divergence, which can be calculated for comparisons of real-world and synthetic data. The latter article also suggests a way of preventing synthetic data drift through condition regularization (ie, minimizing contrastive loss by regularizing the synthetic dataset against a real dataset distribution) and fuzzying (ie, adding controlled noise to broaden the dataset distribution before the AI training phase). To our knowledge, this is the most recently proposed technique for fairness assessment in the field. More research is needed to validate and augment the technique. Whether through Jansen-Shannon divergence or alternative methods, we recommend that all future data assessment projects measure and report model performance and fairness for sensitive groups.
Finally, Garcia-a-de-Leon-Chocano et al [] propose a way of calculating data robustness. The calculation draws on comparing tolerable versus nontolerable issues with DQ, which may be particularly important prior to using the information. We highly suggest that DQ assessments conduct a robustness calculation immediately before calculating data performance measures for downstream applications, which will allow for timely intervention in the case of significant issues with data completeness, conformity, or plausibility that merit additional data collection, review, or imputation steps as discussed above. The above findings and recommendations are summarized in .
Table 4. Recommendations for future EHR data quality and performance assessments.IssueRecommendationArticle characteristicsPaucity of specialty-focused EHR data assessmentsIncentivize (eg, through quality improvement initiatives and grants) more EHR data assessments, particularly in psychiatry, emergency medicine, and surgical specialtiesIncomplete reportingUse standardized frameworks for measuring and reporting data quality and performance assessments (eg, )Poor replicabilityDescribe DQA methods in enough details such that they could be replicated by a research team at a different institutionLimited generalizabilityUse already available data quality tools and standards (eg, DQA Guidelines proposed by Weiskopf et al []) before developing proprietary methodologiesDQAInconsistent methodologiesAnalyze completeness, conformance, and plausibility at every DQA (completeness only may be applicable for quick data quality checks)Data missingness and nonconformityUse available AI-based data extraction algorithms (eg, Lee et al []), and augment data using external and synthetic datasets (eg, Zhang et al [])Data performance assessmentInconsistent methodologiesAugment correctness or accuracy measurement with recency, fairness, stability, and shareability performance metricsEHR data biasAutomate data fairness assessments by measuring agreement of AI-extracted data against an gold standard dataset (eg, manually extracted data) and preventing drift via condition fuzzying and regularization (eg, Zhang et al [])Timeliness of analysisCalculate dataset robustness prior to detailed data quality and performance analysis (eg, as described by García-de-León-Chocano et al [])aEHR: electronic health record.
bDQA: data quality assessment.
Further RecommendationsBased on the review and our team’s experience with DQ improvement initiatives, we recommend that administrators minimize modifiable DQ and performance issues arising during extraction by first using Internet of Things devices (eg, “smart” patient beds and infusion pumps) that directly upload measurements or settings to the EHR instead of requiring manual data entry. Second, the EHR’s interface should be anchored to a predefined data workflow and ontological structure agreed upon in collaboration with clinical and data administrators (eg, encounters start at the time of patient check-in instead of when a physician first sees the patient, and all encounter times are recorded in 1 location using standard units). Finally, the plausibility of automatically entered data should be periodically validated such that corrections can be made when necessary (eg, a minute-by-minute electrocardiogram plausibility check that can detect if an electrocardiography lead falls off a patient’s chest and needs to be replaced to record accurate measurements). Wherever possible, a reference data format (eg, electrocardiogram voltage between 0.5 and 5 mV) for the validation should be provided.
To minimize modifiable issues arising during data mapping, we furthermore recommend first establishing rules for how to treat (1) “missing,” (2) “modified,” or (3) “overlapping” data, such as whether (1) fields with no value should be regarded as data points or artifacts; (2) data points that have been subsequently modified should be updated or retained; and (3) one data source should take precedence over another in case of duplicate records (eg, weight recordings measured by weighing scale should supersede those measured by a hospital bed). Finally, standards for parent-child encounters should be instituted (eg, if a postoperative outpatient clinic visit should be assigned as a unique encounter or as a child encounter of the parent surgery visit).
The provenance of outside facility records, which can be used to identify potential issues with externally collected data, should also be maintained (eg, keeping records of where and when outside laboratory measures were taken in order to identify potential issues with more or less accurate laboratory techniques).
LimitationsWhile this scoping review provides valuable insight into the existing literature on EHR DQ analytics, it has several limitations. Foremost, it is important to acknowledge the limited sample size of 154 articles using our original search criteria, and consequently also the limited number of 26 original research articles which were included in our final analysis after full-text review. Among these articles, there was significant heterogeneity in settings and outcomes of interest, which may limit the validity of direct comparisons between the studies, as well as the generalizability of our findings. The review was furthermore restricted to articles available in the PubMed library, which may introduce a potential publication bias, as well as to articles available only in English, which may introduce a language bias to our study selection and subsequent analysis. Finally, while the review focused on EHR DQ and performance assessments, it did not include adjacent areas that may have a pronounced impact on clinical data recording and use such as EHR implementation or use. Future research should consider broader inclusion criteria and explore additional dimensions of EHR DQ to provide a more comprehensive understanding of this important topic.
ConclusionsThe findings of this scoping review highlight the importance of EHR DQ analysis in ensuring the accuracy and reliability of clinical data. Our review identified a need for specialty-specific data assessment initiatives, particularly in the fields of psychiatry, emergency medicine, and surgery. We additionally identified a need for standardizing DQ reporting to enhance the replicability and generalizability of outcomes in the field. Based on our review of the existing literature, we recommend analyzing DQ in terms of completeness, conformance, and plausibility; data performance in terms of correctness; and use case–specific metrics such as recency, fairness, stability, and shareability. Notably, our review demonstrated several examples of DQ improvement with the use of AI-enhanced data extraction and supplementation techniques. Future efforts in augmenting DQ through AI should make use of data fairness assessments to prevent the introduction of synthetic data bias.
TOB was supported by the National Institutes of Health (NIH; OT2 OD032701); the National Institute of Diabetes and Digestive and Kidney Diseases (NIH/NIDDK; K01 DK120784 and R01 DK121730); the National Institute of General Medical Sciences (NIH/NIGMS; R01 GM110240 and R01 GM149657); the National Institute of Biomedical Imaging and Bioengineering (NIH/NIBIB; R01 EB029699); the National Institute of Neurological Disorders and Stroke (NIH/NINDS; R01 NS120924); University of Florida (UF) Research (DRPD-ROSF2023 [00132783]); and the University of Florida Clinical and Translational Science Institute (AWD10247), which was supported in part by the NIH National Center for Advancing Translational Sciences (UL1TR001427). AB was supported by the NIH (OT2 OD032701), the National Institute of General Medical Sciences (NIH/NIGMS; R01 GM110240 and R01 GM149657), the National Institute of Biomedical Imaging and Bioengineering (NIH/NIBIB; R01 EB029699), the National Institute of Neurological Disorders and Stroke (NIH/NINDS; R01 NS120924), and the National Institute of Diabetes and Digestive and Kidney Diseases (NIH/NIDDK; R01 DK121730). TJL was supported by the National Institute of General Medical Sciences of the National Institutes of Health (R01 GM149657). BS was supported by the NIH (OT2 OD032701), by the National Institute of Diabetes and Digestive and Kidney Diseases (NIH/NIDDK; R01 DK121730), and by the National Institute of General Medical Sciences (NIH/NIGMS; R01 GM110240 and R01 GM149657). JB was supported by the NIH (T32 GM008721). The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the paper; and decision to submit the paper for publication. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH and other funding sources.
YPP performed the investigation, data curation, and writing—original draft, review, and editing. TRB contributed to investigation, data curation, and writing—original draft. MMR performed data curation, investigation, and writing—review and editing. ML performed investigation. RS contributed to investigation. ZG did the investigation, methodology, and writing—review and editing. JB did the data curation, methodology, writing—review and editing—and supervision. TOB performed data curation, methodology, and supervision. BS performed data curation, methodology, and supervision. TJL contributed to data curation, methodology, and supervision. AB performed data curation, methodology, and supervision.
None declared.
Edited by Christian Lovis; submitted 06.03.24; peer-reviewed by Michelle Mun, Zhengyang Liu; final revised version received 14.05.24; accepted 08.06.24; published 06.11.24.
© Yordan P Penev, Timothy R Buchanan, Matthew M Ruppert, Michelle Liu, Ramin Shekouhi, Ziyuan Guan, Jeremy Balch, Tezcan Ozrazgat-Baslanti, Benjamin Shickel, Tyler J Loftus, Azra Bihorac. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 6.11.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
留言 (0)