Application of Spatial Analysis on Electronic Health Records to Characterize Patient Phenotypes: Systematic Review

Introduction

Electronic health records (EHRs) have significantly enriched clinical decision support by providing relatively cost-effective, time-efficient, and convenient sources of a large population of patient records [,]. Because EHRs often contain patient addresses, spatial analysis can enable value addition via high-resolution geocoding. The simplest of such analyses may be mapping, which can promote a better understanding of health disparities. Further, patient geocoding can link external data such as environmental, demographic, and socioeconomic factors for more refined patient phenotyping and a more profound understanding of patient exposures for targeted interventions [].

The possibilities for applying spatial analysis on individual-level, EHR-derived data are beyond geocoding, basic mapping, or external data linkage. For instance, spatial network analysis examines proximity to the sources of pollution [], measures accessibility to health care facilities [], and optimizes resource allocations to mitigate health disparities []. Spatial clustering pinpoints statistically significant spatial and spatiotemporal hotspots and cold spots [], especially when considering longitudinal EHRs data. Moreover, spatial and spatiotemporal modeling can identify localized patterns, trends, and relationships within a specific region [,]. Identifying underserved communities through spatial analysis can enhance clinical decision support to implement targeted interventions such as screening, vaccination, or health education campaigns.

Despite the availability of advanced spatial analysis methods, most studies primarily focus on basic mapping or geocoding. Moreover, while these methodologies have the potential to better describe the context of individual patients in biomedical studies, there is a need for their improved application to derive more meaningful insights. To accurately address medical conditions, identify a disease in a patient, and scale that to cohorts of patients, phenotyping is required []. Phenotypes are a combination of observable traits, symptoms, and characteristics. They can contain inclusion and exclusion criteria (eg, diagnoses, procedures, laboratory reports, and medications) and can be used to recruit patients who fit the necessary criteria for clinical trials.

A prior systematic review used spatially linked EHRs data to investigate the effects of social, physical, and built environments on health outcomes []. Another study highlighted the need to integrate spatial data related to individual patients into health care decision-making and practice []. Nonetheless, this is the first comprehensive study that systematically reviews US-based studies that used spatial analysis for analyzing EHR-derived data in characterizing patient phenotypes for clinical decision support and interventions. This review collates and synthesizes existing literature that used individual-level health data from EHRs in conjunction with advanced spatial analyses and patient phenotyping. Thus, the main objectives of this review are (1) to evaluate the degree to which advanced spatial methods are currently being used with individual-level data sourced from EHRs in the United States, (2) to identify areas of spatial analyses most applicable to biomedical studies, (3) to categorize publications concerning their biomedical and clinical areas and the specific patient phenotypes they target, and (4) to highlight knowledge gaps and propose future research directions for harnessing the potential of spatial analysis to enhance the context of individual-level data sourced from EHRs for biomedical studies.

MethodsOverview

This systematic review was performed using the protocols outlined by the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to identify the studies that satisfy the eligibility criteria for subsequent data extraction and synthesis ().

Data Source

A comprehensive search for peer-reviewed studies was carried out using abstracts and titles screening within the PubMed/MEDLINE, Scopus, and Web of Science databases using the search terms in . The search was conducted on August 29, 2023, without limitations on study design or specific health domains.

Table 1. The search strategy key terms.ThemeaKey termsSpatial analysis(“Geospatial*” OR “Geo-spatial*” OR “Spatio-Temporal” OR “Spatial Temporal” OR “Space-Time” OR “Space Time” OR “Spatiotemporal” OR “Geocod*” OR “ Spatial Autocorrelation” OR “Spatial Interpolation” OR “Spatial Epidemiology” OR “Spatial Data” OR “Spatial Modeling” OR “Spatial Modelling” OR “Spatial Mapping” OR “Geographic Mapping” OR “Georeferenc*” OR “Spatial Analys*” OR “Spatial Inequalit*” OR “Spatial Disparit*” OR “Spatial Dependenc*” OR “Spatial Access*” OR “Geographical Mapping” OR “Geographical Visualization” OR “Geographic Visualization” OR “Geovisualization” OR “Geographical Information System*” OR “Geographic Information System*” OR “Geofencing” OR “Geographical Distribution*” OR “Geographic Distribution*” OR “Spatial Statistic*” OR “Spatial Bayesian” OR “Spatial Hotspot*” OR “Spatial Cluster*” OR “Geographic Cluster*” OR “Geographic Hotspot*” OR “Remote Sensing” OR “Global Positioning System” OR “Spatial Pattern*” OR “Spatial Data Mining” OR “Spatial Variabilit*” OR “Spatial Heterogeneit*” OR “Geostatistic*” OR “Spatial Covariance” OR “Spatial Regression” OR “Spatial Uncertaint*” OR “Spatial Point Pattern*” OR “Kriging” OR “Cartography” OR “Spatial Decision Support System*” OR “OpenStreetMap” OR “Location-Based Services” OR “Spatial Quer*” OR “GIS” OR “Web GIS” OR “Satellite Imager*” OR “ArcGIS” OR “QGIS” OR “Risk Mapping”) ANDEHRb(“EHR” OR “EMR” OR “EPR” OR “Electronic Health Record*” OR “Electronic Medical Record*” OR “Electronic Patient Record*” OR “EDW” OR “Enterprise Data Warehouse” OR “RDW” OR “Research Data Warehouse”)

aThe selected studies that used spatial analysis of EHR data were manually excluded if they lacked patient phenotype characteristics or were not conducted based on the US data.

bEHR: electronic health record.

Search Strategy

The initial search comprised 2 main categories. The first category included a broad set of key terms related to spatial analysis. The second category used the key terms associated with EHR. Henceforth, our reference to EHRs will also encompass electronic medical records (EMRs), electronic patient records (EPRs), enterprise data warehouses (EDWs), and research data warehouses (RDWs). The Boolean operator AND was applied to synthesize the 2 categories.

For PubMed/MEDLINE, Scopus, and Web of Science, we used a consistent search strategy tailored to the specific features and functionalities of each platform. We used the advanced search options available on these databases to input the key terms from . The search was conducted across titles and abstracts. For Google Scholar, due to its distinct search engine and more limited filtering options compared to the other databases, we conducted broad search queries with the same key terms. We then manually reviewed the results to identify and include relevant studies that met our criteria.

Study Selection

The retrieved abstracts and titles were imported into Covidence systematic review software (Veritas Health Innovation), where duplicate records between original databases are automatically eliminated. Two reviewers (AM and BH) independently assessed the eligibility of the studies based on the following inclusion and exclusion criteria.

The studies were eligible for primary inclusion if they (1) were composed in English; (2) were original peer-reviewed studies; (3) used individual-level patient data derived from EHRs, EMRs, EPRs, EDWs, or RDW; and (4) incorporated at least 1 form of spatial methods. Conversely, the studies were excluded if they (1) were not peer-reviewed (eg, letters, editorials, reviews, case reports, abstracts, and grey literature), (2) solely geocoded addresses or generated basic visualizations (eg, dot map and choropleth map) without any spatial analysis, and (3) not based on the US data.

The reviewers (AM and BH) independently reviewed the full texts of all remaining studies. The studies also were excluded if they lacked phenotype characteristics. Further, we manually checked the references for all the selected studies for possible inclusion. A third reviewer (AVA) was consulted to break ties.

Data Extraction

Upon identifying studies that satisfied all inclusion criteria, two reviewers (AM and BH) extracted the following items for each study: title, publication year, country and region, sample size, study period, spatial methodologies, and key findings from the spatial methods. Moreover, studies were assessed to identify clinical domains (including primary and secondary when applicable), health conditions or problems, and themes (including social determinants of health [SDOH], environmental factors, ecological aspects, climate, microbiome, genomics, and clinical phenotypic characteristics). Previous publications have emphasized the importance of data domain sources in phenotyping, underscoring the need for validating the created phenotype [] and using multiple data sources. Thus, in cases where the included publications did not provide details of data sources but instead referenced previously published works, referenced publications were reviewed. Additionally, we cataloged the types of EHRs that served as the sources.

Narrative Synthesis

There is no universally accepted classification for spatial analysis methods. In this review, we have adopted and refined a classification framework based on the study of Nazia et al [], which initially categorized methods into frequentist and Bayesian approaches and spatial and spatiotemporal methods. This classification was further broken down into descriptive, clustering, and modeling techniques []. Therefore, following data extraction, the studies were categorized into the following spatial methodology classifications: descriptive, clustering, modeling (frequentist), spatiotemporal (frequentist), and Bayesian. The phenotype characteristics were extracted and recorded as free text. It should be noted that the categories were not mutually exclusive.

The quality appraisal of the studies was not feasible due to the substantial heterogeneity in spatial methodologies and health domains. The geospatial distribution of the included studies was visualized using ArcGIS Pro software (version 3.0; ESRI).

ResultsStudy Selection

The initial search yielded 1758 references. After removing duplicate records, we identified 952 studies for abstract and title screening, from which 375 were selected for full-text review. Of these, 322 studies were excluded as they only contained geocoding or basic mapping without any spatial analysis. Additionally, 15 studies were omitted due to the absence of patient phenotype characteristics (n=2) or were not based on US data (n=13). We further manually searched references and Google Scholar and found 11 new studies that met the eligibility criteria. Therefore, 49 studies that fulfilled the inclusion criteria were retained for data extraction and synthesis. depicts the PRISMA flowchart for the study selection process.

‎

Figure 1. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) study selection flowchart. Temporal and Geographic Distribution of Studies

Of the 49 included studies, a limited number (n=7, 14%) were published prior to 2017. The earliest study included in this study was published in 2011, and the publication frequency has experienced a significant upsurge since 2017 (n=42, 86%), likely due to increased adoption of EHR systems and growing familiarity with spatial analysis techniques among researchers. There was only one study [] at the national level. General characteristics of the included studies are presented in . Most studies were concentrated in North Carolina (n=8, 16%), Pennsylvania (n=6, 12%), California (n=6, 12%), and Illinois (n=4, 8%). illustrates the geospatial distribution of studies at the state level in the United States.

Table 2. General characteristics of the included studies.No.AuthorYearRegionSample size, nStudy period1Ali et al []2019Atlanta46132002-20102Beck et al []2018Cincinnati24,4282011-20163Bravo et al []2018Durham147,0002007-20114Bravo et al []2019Durham147,3512007-20115Bravo et al []2019Durham41,2032007-20116Brooks et al []2020Delaware542120207Carey et al []2021Utah3662006-20158Casey et al []2016Pennsylvania20,5692006-20139Chang et al []2015Wisconsin103,6902007-200910Cobert et al []2020Durham10,3522013-201811Davidson et al []2018Denver21,5782011-201212DeMass et al []2023South Carolina21952019-202013Epstein et al []2014Los Angeles53902007-201114Gaudio et al []2023Tennessee22402015-202115Georgantopoulos et al []2020South Carolina37361999-201516Ghazi et al []2022Twin Cities, Minnesota20,2892012-201917Grag et al []2023Chicago777,9942007-201218Grunwell et al []2022Georgia14032015-202019Hanna-Attisha et al []2016Flint, Michigan14732013-201520Immergluck et al []2019Atlanta13,9382002-201021Jilcott et al []2011Eastern North Carolina7442007-200822Kane et al []2023Kansas and Missouri24272011-202023Kersten et al []2018San Francisco47,1752007-201124Lantos et al []2018North Carolina3527N/Aa25Lantos et al []2017Durham3527≤201526Lê-Scherban et al []2019Philadelphia3778201627Lieu et al []2015Northern California154,4242000-201128Lipner et al []2017Colorado4792008-201529Liu et al []2021Cincinnati and Houston88,0132011-201630Mayne et al []2019Chicago14,3092015-201731Mayne et al []2018Chicago47482009-201332Oyana et al []2017Memphis28,7932005-201533Patterson and Grossman []2017Nationwide~100 million2003-201034Pearson and Werth []2019Philadelphia6422000-201735Samuels et al []2022New Haven63662013-201736Schwartz et al []2011Pennsylvania47,7692009-201037Sharif-Askary et al []2018North Carolina5581998-201338Sidell et al []2022Southern California446,4402020-202139Siegel et al []2022Delaware34492012-202040Soares et al []2017Pennsylvania20492011-201241Sun et al []2022Southern California395,9272008-201842Tabano et al []2017Denver31,2752009-201143Wakefield et al []2020Memphis37542015-201744Wilson et al []2022Chicago39,2112014-201645Winckler et al []2023Southern California78962017-201946Xie et al []2017Philadelphia27,6042011-201447Xie et al []2023Washington242,6372015-201948Zhan et al []2021Central Texas21,923201949Zhao et al []2021Wisconsin43,7522007-2012

aNot applicable.

‎

Figure 2. Geospatial distribution of the included studies at the state level in the United States. Spatial MethodologiesOverview

Most studies focused on frequentist methods compared to the Bayesian methods. Among frequentist methods, the most prevalent category was clustering (n=29), followed by descriptive (n=12), modeling (n=6), and spatiotemporal analyses (n=2). More detailed explanations of the spatial methods used in this study are provided in .

Descriptive Analyses

Descriptive analyses were categorized into four groups: spatial sampling (n=2), spatial overlay (n=2), proximity analysis (n=4), and spatial interpolation (n=4).

Spatial Sampling

A 2 SD ellipse method is used to optimize spatial sampling density. This ellipse contains almost 95% of the locations of patients and is used to ensure that the collected samples reflect the underlying spatial pattern in data, particularly when resources are limited []. Lantos et al [] and Lantos et al [] adopted this approach when sampling women who underwent cytomegalovirus antibody testing during pregnancy, especially in peripheral areas with limited subject representation.

Spatial Overlay

Spatial overlay integrates various spatial data sources, often maps, to represent their shared features. Wakefield et al [] overlaid the map of major radiation treatment interruptions based on race onto the map of median household income. Their analysis implied that regions with higher income levels experienced lower rates of radiation treatment interruption. Samuels et al [] spatially joined patient addresses to the nearest city parcels and computed an estimate of the incidence of emergency department visits for asthma for each parcel [].

Proximity Analysis

Proximity analysis includes measuring distances between geographic features to identify nearby features within a defined distance or buffer zone to uncover proximity patterns []. Wilson et al [] created temporal and spatial buffers to assess the correlation between individual exposure to violent crime and blood pressure. Schwartz et al [] evaluated the associations between environmental factors and BMI within a 0.5-mile network buffer from the place of residence. Casey et al [] investigated the associations between prenatal residential greenness and birth outcomes within 250-m and 1250-m buffers. Using a geographic information system service area network analysis, Jilcott et al [] examined BMI percentile and proximity to fast-food and pizza establishments among adolescents within 0.25-mile Euclidean and network buffer zones.

Spatial Interpolation

Ordinary Kriging is one of the most widely used spatial interpolation techniques that leverages the spatial autocorrelation structure of observed locations to estimate values at unmeasured locations []. Hanna-Attisha et al [] applied ordinary Kriging with a spherical semivariogram model based on observations of the children’s elevated blood lead level geocoded to the home address to visualize blood lead level variations before and after water source changes. Mayne et al [] interpolated the levels of neighborhood physical disorder based on an exponential variogram. Patterson and Grossman [] demonstrated spatial variations for the incidence rates of each International Classification of Diseases, Ninth Revision diagnostic code based on an exponential variogram. Sun et al [] estimated monthly average concentrations of fine particulate matter to investigate the associations between air pollution exposure during pregnancy and gestational diabetes mellitus.

Spatial ClusteringOverview

Spatial clustering techniques assess whether health outcomes are random, uniform, or clustered and pinpoint the locations of clusters []. Spatial clustering was the most widely used category (n=29) among all studied categories. Moran I clustering and cluster detection were the most frequent techniques (n=10), followed by kernel/point density estimation (n=5), spatial scan statistics (n=4), and Getis-Ord Gi* statistics (n=4).

Kernel/Point Density Estimation

Kernel density estimation generates a smooth surface to visualize areas of the most significant spatial intensity by calculating a distance-weighted count of events within a specified radius per unit area []. Several studies adopted kernel density estimation to analyze patterns, including cholera hospitalization [], comparison of the spatial intensity of chronic kidney disease with nonchronic kidney disease patients [], and comparison of the spatial intensity of breast cancer and nonbreast cancer []. Using the point density function, Beck et al [] pinpointed hotspots of inpatient bed-day rates within a 2-mile radius of a medical center, and Kane et al [] estimated the number of participants per square mile.

Global and Local Moran I

Global Moran I (GMI) evaluates the overall pattern for spatial autocorrelation [] by inferring if a variable is spatially clustered or overdispersed versus being randomly distributed under the null hypothesis []. Local Moran I (LISA) is used to locate statistically significant clusters including hotspots, cold spots, and outliers []. GMI has been adopted to analyze spatial clustering of health outcomes including gestational diabetes mellitus [], day-of-surgery cancellation [], obesity [], and COVID-19 []. All exhibited clustered patterns. Xie et al [] analyzed 3 groups: depression, obesity, and comorbid cases, confirmed clustering for all outcomes, and identified spatial clusters and outliers. Pearson and Werth [] found random distributions for dermatomyositis (DM) and subtypes, classic DM, and clinically amyopathic DM. Meanwhile, Davidson et al [] pinpointed clusters with higher or lower depression prevalence, and Winckler et al [] identified a cluster of low use of acute pediatric mental health interventions in less-densely populated rural border areas.

GMI and semivariograms or variograms can also identify spatial autocorrelation in model residuals. If detected, the models are adjusted accordingly to avoid biased estimates. For example, Lipner et al [] modeled nontuberculous mycobacteria disease, shifting the use from a nonspatial Bayesian model to a spatial model when spatial autocorrelation was found in residuals. Similarly, Georgantopoulos et al [] incorporated spatial random effects into a prostate cancer model due to significant autocorrelation in the residuals. Sharif-Askary et al [] used variograms to assess spatial dependency in cleft lip or palate, leading to a geostatistical model over standard logistic regression. Conversely, Casey et al [] found no spatial autocorrelation in nonspatial model residuals.

The bivariate GMI quantifies the overall spatial dependence between two distinct variables (positive value indicates high values of one variable are surrounded by high values of the other or low values are surrounded by low values, while negative value implies high values of one variable are surrounded by low values of the other) []. Bivariate LISA assesses the relationship between the two variables at the local level. Pearson and Werth [] used bivariate GMI for the prevalence of DM, classic DM, and clinically amyopathic DM with airborne toxics but found no overall spatial dependencies. However, bivariate LISA identified local dependencies at the zip code level. Garg et al [] applied bivariate GMI and found significant overall associations between longer (average) distances to the nearest supermarket and higher incidence of diabetes, and bivariate LISA identified significant “high-high” relationships at the zip code level. Gaudio et al [] used bivariate LISA and found no local association between radiation therapy interruption and social vulnerability index at the zip code level.

Getis-Ord Gi*

The Getis-Ord Gi* statistic identifies high- or low-value clusters (hotspots and cold spots) by assessing deviations of health outcomes at locations from the average within a defined neighborhood []. Lê-Scherban et al [] measured racial residential segregation by examining the deviations in the African American residents in each census tract from the mean of neighboring tracts. Similarly, Mayne et al [] measured racial residential segregation for the percentage of non-Hispanic Black residents. Ali et al [] identified significant community-onset methicillin-resistant Staphylococcus aureus (CO-MRSA) hotspots with distinct patterns between cases and controls. Kersten et al [] detected the high- and low-value clusters for the child opportunity index and median household income.

Spatial Scan Statistics

The spatial scan statistics technique identifies high- and low-risk clusters and estimates their relative risks []. It also can incorporate covariates to characterize underlying patterns []. Lipner et al [] found that people living in zip codes within the primary cluster had an almost 2.5 times greater risk of nontuberculous mycobacteria disease. Lieu et al [] identified clusters of underimmunization and vaccine refusal among children, with rates ranging from 18% to 23% inside the clusters compared to 11% outside.

The technique can also pinpoint cold spots. Brooks et al [] identified areas with significantly lower COVID-19 testing than expected, indicating a need for interventions. Zhan et al [] observed significantly low rates of up-to-date colorectal cancer screening.

Spatial Modeling (Frequentist)

Among the included studies, the generalized additive models (GAMs) emerged as the most frequently used spatial models. GAMs can account for spatial autocorrelation by incorporating smooth functions (such as thin-plate regression) of spatial coordinates [], allowing the estimate of geographic variation with or without covariate adjustments. GAMs were used to identify the spatial variabilities in asthma prevalence [,] and cytomegalovirus [,], although such variations often diminished when adjusted for demographic factors such as race and age. Less commonly used geospatial models were generalized linear mixed effects [] and spatial error [] models.

Spatiotemporal Analysis

Only 2 studies explored spatiotemporal patterns, and no spatiotemporal modeling was conducted. Oyana et al [] used space-time scan statistics to study the spatiotemporal patterns of childhood asthma and found a significant frequency increase (2009-2013) and a rising trend from 4 to 16 per 1000 children (2005-2015). Ali et al [] used the space-time cube tool and emerging hotspot analysis to analyze the spatial-temporal trends and evolving patterns of CO-MRSA from 2002 to 2010. They identified several types of space-time hotspots of CO-MRSA including new, consecutive, intensifying, sporadic, and oscillating hotspots.

Bayesian Analysis

The studies using Bayesian methods were categorized into empirical Bayes smoothing (n=5) and Bayesian modeling (n=6).]

Empirical Bayes Smoothing

The empirical Bayes smoothing was used by Lê-Scherban et al [], Liu et al [], Tabano et al [], and Xie et al [] to stabilize estimated rates in areas with limited data points by borrowing information from the overall population []. Zhao et al [] used nonparametric kernel smoothing to estimate the prevalence of childhood obesity in areas with sparse observations (n<20 individuals) [].

Bayesian Modeling

Bayesian modeling can account for spatial and temporal dependencies and quantify uncertainty by specifying prior distributions []. Among the studies, the conditional autoregressive (CAR) prior emerged as the most used, with 2 variants: intrinsic and multivariate CAR. Intrinsic CAR was used to assess the spatial variations in diabetes in relationship with racial isolation [], hypertension related to racial isolation [], and type 2 diabetes mellitus with the built environment []. Multivariate CAR was used to identify areas with higher or lower-than-expected prostate cancer while controlling for risk factors []. Moreover, hierarchical Bayesian that can incorporate hierarchical structures for modeling [] was used to investigate spatial distributions of patients admitted for drug-related reasons concerning the area deprivation index []. Bayesian negative binomial hurdle models that can account for excessive zeros and overdispersion were used to examine spatial variation between patient responses to the questions concerning unhealthy home environments and the mean number of emergency department visits after screening [].

PhenotypingClinical Domain Characteristics and Themes

The largest category of studies was classified under the infectious disease (n=7), endocrinology (n=7), and oncology (n=6) domains. Additionally, 19 studies had a pediatric domain or focus, as noted with an additional column in . Maternal and newborn care was classified as its own domain (n=8), but it overlapped with other domains such as nephrology, endocrinology, and infectious disease.

Table 3. Clinical domains and condition or problem of focus for each publication.Condition by clinical domainaSecondary clinical domainbPediatric population involvedPediatric
DoSCc []—d✓
EBLLe []—✓
Disparities in inpatient bed-day rates []—✓Maternal and newborn care
Under immunization; vaccine refusal []—✓
Preterm birth; small for gestational age; hypertensive disorder of pregnancy []—

Preterm birth; small for gestational age; low birth weight; low Apgar score []—

Hypertension []—

Hypertension []—

Hypertension; diabetes []Endocrinology

Hypertension; diabetes; CKDf []Endocrine; nephrology

Hypertension, disorder of pregnancy []Maternal and newborn care
Endocrinology
GDMg []Maternal and newborn care

T2DMh []—

T2DM []—

Obesity []—

Obesity []—✓
Obesity []—✓
Obesity []—✓
Obesity; depression []Psychiatry
Psychiatry
Acute pediatric mental health interventions or services []—✓
Depression []—

Telemedicine use in developmental-behavioral pediatrics []—✓
Drug overdoses []Emergency medicine
Emergency medicine
Disparities in pediatric acute care visit frequency and diagnoses []—✓
Disparities in use of PICUi []—✓
Emergency department use []—
Pulmonary
Asthma, emergency department asthma visits []Emergency medicine

Asthma []—✓
Asthma []—✓
Asthma []—

Asthma []—
Infectious disease
Coccidioidomycosis []Pulmonary

Community-associated MRSAj []—✓
Community-onset-MRSA []—✓
COVID-19 []—

COVID-19 []—

CMVk []Maternal and newborn care✓
CMV []—✓
Nontuberculous mycobacterial infection []—
Oncology
RTIl []—

RTI []—

Colorectal cancer screening []—

Prostate cancer []—

TNBCm []—

Disparities in genomic answers for kids (GA4K) []—✓Maxillofacial
Cleft lip or palate []—✓Nephrology
CKD []—
Rheumatology
Dermatomyositis []Neurology; dermatology
All domains
Geospatial variation of disease incidence []—

aCondition or problem of focus column displays the general condition of the study and may not directly correspond to the phenotype.

bPublications with more than 1 clinical domain and those with a pediatric component are noted as such.

cDoSC: day-of-surgery cancellation.

dNot applicable.

eEBLL: elevated blood lead levels.

fCKD: chronic kidney disease.

gGDM: gestational diabetes mellitus.

hT2DM: diabetes mellitus, type 2.

iPICU: pediatric intensive care unit.

jMRSA: methicillin-resistant Staphylococcus aureus.

kCMV: cytomegalovirus.

lRTI: radiation treatment interruption.

mTNBC: triple-negative breast cancer.

The relationship between the clinical domains and the “conditions or problems of focus” in each study was examined (). In some cases, direct correspondence was observed, while in other instances, the “condition or problems of focus” differed from the phenotype of the patient cohort. In many studies, one or more overlapping domains were observed (eg, rheumatology, neurology, and dermatology for the study of DM). Asthma (n=5), hypertension (n=5), and diabetes (n=4) were studied most frequently. Three studies did not focus on any health condition but rather on examining disparities in either a data source or a specific domain or cohort (eg, disparities in the use of pediatric intensive care units).

Every study was attributed to at least one prominent theme, with the possibility of multiple themes. SDOH themes were prevalent in many studies. To organize and present this information, we used the domains defined by the Healthy People 2030 framework []. There are 5 domains in the SDOH framework (), with the corresponding counts of these domains being seen as themes of the studies. Most studies had 1 or more SDOH themes (n=42). Many studies focused either on all the domains or SDOH holistically without particular focus on any specific domain (n=32). However, some studies contained prominent themes that were not directly related to SDOH, which were phenotypic features (n=4), followed by environmental (n=3), and ecological (n=2), with climate, genomics, and microbiome, each contributing one study.

Table 4. SDOHa themes examined within the framework of Healthy People 2030 SDOH domains [].Labels and SDOH domainsCounts, nSDOH 1
Economic stability (employment, food insecurity, housing instability, poverty)2SDOH 2
Education access and quality (early childhood development and education, enrollment in higher education, high school graduation, language, and literacy)N/AbSDOH 3

Health access and quality (access to health services, access to primary care, health literacy)5SDOH 4
Neighborhood and built environment (access to foods that support healthy dietary patterns, crime and violence, environmental conditions, quality of housing)14SDOH 5
Social and community context (civic participation, discrimination, incarceration, social cohesion)5All 5 SDOH domains or SDOH as a whole36Non-SDOH focus8

aSDOH: social determinants of health.

bNot applicable.

Clinical Phenotype Features

For each publication, clinical phenotype definitions were extracted (). In almost all studies, phenotype definitions included demographic details such as patient age, race, and gender, along with some diagnostic characteristics (eg, asthma diagnosis). Only a limited number of phenotypes were observed to be validated (n=8). The most frequently observed method for phenotype validation was a manual chart review of all matches or a sample of matched charts. None of the studies with chart review as a validation method shared information on the match rate. Additionally, only two studies [,] were observed to use validated eMERGE Network computable phenotypes from the Phenotype Knowledgebase [-].

DiscussionPrincipal Findings

This systematic review is the first comprehensive investigation of spatial methodologies within EHR-derived data in the United States. The findings reveal that a considerable portion of studies predominantly focus on basic mapping or geocoding, with a limited use of advanced spatial analysis methods. Spatial clustering and descriptive analysis were the most used methods, while space-time modeling, either frequentist or Bayesian, was not widely applied. The diverse use of spatial analysis for EHR-derived data in different health domains highlights the potential to incorporate spatial methods to enhance the context of individual patients for future biomedical research. We found limited use of EHR-derived data for spatial analysis, probably due to the challenge of safeguarding patient privacy. Address data, crucial for spatial analysis, is highly confidential and often restricted from sharing. Researchers and institutions often use geographic masking techniques [,] to balance data use and privacy protection by altering the precise geographic coordinates while preserving the overall spatial characteristics of data. Encouraging the adoption of spatial analysis could promote biomedical knowledge sharing and collaboration.

The use of EHRs data for spatial analysis can present several challenges, particularly in accurately geocoding patient addresses. Issues, such as address formatting errors, incomplete or outdated addresses, and potential inaccuracies in geocoding services, can influence the outcome of spatial analysis []. Advanced geocoding algorithms and manual verification processes can mitigate these issues. For instance, Goldberg et al [] developed a web-based system for rapid manual intervention of previously geocoded data, significantly improving the match rate and quality of individual geocodes with minimal time and effort. Additionally, when addresses are only available at the zip code level, additional nuances arise as zip code boundaries are often not well-defined and can change over time []. Spatial smoothing techniques and zip code centroids can mitigate some of these challenges. We recommend standardizing address formats before geocoding (using tools like the US Postal Service address verification), using advanced geocoding services, leveraging higher-resolution geographical data when possible, and integrating multiple spatial scales to enhance the accuracy and reliability of spatial analysis using EHRs data.

We acknowledge that not all patient phenotypes are inherently suited for spatial analysis, and integrating genomics, imaging, and clinical notes phenotypes can be particularly challenging. However, evidence suggests that spatial techniques can provide valuable insights even in these areas where their application may initially appear challenging. For instance, Baker et al [] demonstrated the effectiveness of spatial analysis in genomics by combining single-nucleotide polymorphism genotyping with geospatial K-function analysis. Their study of typhoid in Nepal found significant geographic clustering of cases. Canino [] developed a robust framework that integrated biological data with geographic information from EMRs. Their system identified correlations between patient profiles and geographic factors such as environmental exposures related to pollution. Future interdisciplinary studies can explore developing frameworks that integrate genomics or notes with geospatial datasets to reveal complex relationships and patterns.

The application of spatiotemporal analysis of EHR-derived data was mainly limited to exploring spatiotemporal clusters with no spatiotemporal modeling. This might be due to the technical expertise required for analysis, data complexity, availability of longitudinal data, and computational challenges. The Bayesian framework offers a more adaptable framework to handle complex spatial and temporal dependencies, control confounding variables [], and incorporate prior information, such as existing medical literature and expert opinions, resulting in more interpretable results [,]. Moreover, spatiotemporal Bayesian modeling can aid in understanding disease trends and progressions, seasonality, and long-term shifts at the local levels []. Bayesian modeling can also account for uncertainty in parameter estimates and predictions to assess the reliability of findings before implementing interventions []. Thus, future research should delve into spatial and spatiotemporal modeling, focusing on Bayesian approaches. Moreover, ignoring spatial dependence in modeling can bias parameter estimates [,,]. Additional state-of-the-art methods, such as space-time autoregressive models and generalized additive models for location scale and shape, also provide flexibility in modeling complex relationships. Spatiotemporal point process models also contribute by analyzing the distribution of health events and underlying states over space and time.

View original article

JMIR MEDICAL INFORMATICS

分享书签

0 0 0 0 0 0 0

More from this channel

Application of Spatial Analysis on Electronic Health Records to Characterize Patient Phenotypes: Systematic Review

留言 (0)