Deriving neighborhood-level diet and physical activity measurements from anonymized mobile phone location data for enhancing obesity estimation

Study area and dataStudy area

We selected three US cities for this study, which are New York City (NYC), Los Angeles (LA), and Buffalo. We chose these three cities because NYC and LA are two megacities located on the east coast and west coast respectively, while Buffalo is a medium-sized city that the authors are familiar with and it is located close to the Midwest region of the US. Although other cities could also be selected for this study, these three cities allow a comparison of the results from cities located in different geographic regions and of different sizes. The time period of our study is the year 2018, and the geographic unit of analysis is census tract which is roughly comparable to neighborhoods. We choose this time period and this geographic unit largely because of data availability: the obesity data used in this study is from the PLACES project of the Centers for Disease Control and Prevention (CDC), whose data is in the year 2018 and the smallest geographic unit is census tract [37]. Figure 1 shows the city boundaries of NYC, LA, and Buffalo and their census tracts. The geographical boundaries of these three cities were obtained from the 2018 TIGER/line Shapefile products provided by the US Census Bureau.

Fig. 1figure 1

The city boundaries of NYC, LA, and Buffalo and their census tracts: a NYC; b LA; c Buffalo

Obesity data

The outcome variable that we focus on in this study is neighborhood-level obesity prevalence. We obtained the census tract-level obesity prevalence among adults (age ≥18) data from the CDC PLACES Project, and the obesity prevalence is recorded in percentages (e.g., a value of 26.6 indicates the obesity prevalence for that census tract is 26.6%). Among all the census tracts in the three studied cities, 227 census tracts (7.0%) were excluded from this study, because they either have fewer than 50 residents or their obesity prevalence is missing from the CDC data. The total number of census tracts included for analysis for NYC, LA, and Buffalo are 1995, 947, and 77, respectively. Note that there are only 77 census tracts in Buffalo, and this small number of geographic units affects our analysis results and training of machine learning models later. We will also compute global Moran’s I index for obesity prevalence. Global Moran’s I is a common metric for quantifying spatial autocorrelation in data, and it is calculated based on both locations and values (e.g., obesity prevalence) at these locations. The value of global Moran’s I ranges between [− 1, 1], with − 1 indicating a strong negative spatial autocorrelation (i.e., different values tend to cluster together) and 1 indicating a strong positive spatial autocorrelation (i.e., similar values tend to cluster together).

Anonymized mobile phone location data

The anonymized mobile phone location data used in this study are provided by the company SafeGraph, which opened their data for the research community for free. The data of SafeGraph were collected from over 45 million smart mobile devices (mostly smartphones) and roughly 11.8 million POIs covering the entire United States [38, 39]. As noted previously, the data were aggregated to census tracts and POIs, and we only have POI visits without individual-level GPS trajectories. Using a sample of data in NYC, we plot out the visits from census tracts to fast-food restaurants in a week of 2018 (Fig. 2). In this figure, each curve links a census tract (whose centroid is represented by a yellow dot) and a fast-food restaurant (represented by a red dot), which indicates some residents from the census tract visited that fast-food restaurant during that week.

Fig. 2figure 2

A map visualization of the visits from census tracts to fast-food restaurants in a week of 2018 in NYC

Neighborhood-level socioeconomic and demographic data

In this study, we aim to understand to what extent the neighborhood-level diet and physical activity measurements derived from anonymized mobile phone location data can enhance obesity estimation, in addition to the neighborhood-level socioeconomic and demographic variables typically used in existing studies. We select variables in six categories: (1) race and ethnicity, (2) gender, marital status, and age, (3) education, (4) economic status, (5) housing condition, and (6) urbanicity. These variables are selected based on the existing literature. In particular, variables in categories (1), (2), (3), (4), (6) were used in previous studies, such as Ball et al. in 2002 [40], Black et al. in 2008 [17], Yan et al. in 2015 [24], and Puciato et al. in 2020 [41], and variables in category (5) were used in previous studies, such as Norman et al. in 2010 [42] and Fitzpatrick et al. in 2018 [20]. Table 1 presents the detailed notations and descriptions of these variables. We obtained data for these variables from the American Community Survey (ACS) of the US Census Bureau. Note that there is a potential limitation in the socioeconomic and demographic data from the Census and the obesity prevalence data from CDC. The estimates of these two datasets are interval estimates, and the quality of the data varies spatially as pointed out in the literature [43]. Nevertheless, these datasets are the best we can have for this study, and we acknowledge their limitations.

Table 1 Notations and descriptions of the six categories of neighborhood-level variablesOverview of study design

The objective of this study is to derive neighborhood-level diet and physical activity measurements from anonymized mobile phone location data and investigate to what extent the derived measurements can enhance obesity estimation. Figure 3 provides an overview of our study design, using NYC as an example. We first derive neighborhood-level diet and physical activity measurements from anonymized mobile phone location data based on the visits of neighborhood residents to different types of POIs. In particular, we focus on three types of POIs, which are fast-food restaurants, fitness and sports centers, and nature parks. We will explain why we choose to focus on these three types of POIs in the next section. With the derived measurements, we conduct two sets of analyses to examine their ability to enhance obesity estimation at the neighborhood level. In the first set of analyses (baseline analyses), we estimate obesity prevalence at the neighborhood level using the six categories of socioeconomic and demographic variables (see Table 1); in the second set of analyses (test analyses), we add the derived diet and physical activity measurements to the socioeconomic and demographic variables to examine the extent to which these derived measurements can help improve obesity estimation. We use five different statistical and machine learning models to perform these two sets of analyses.

Fig. 3figure 3

An overview of the study design using NYC as an example

Deriving neighborhood-level diet and physical activity measurements

The neighborhood-level diet and physical activity measurements are derived in the following three steps. First, we identify a number of POI types that are shown to be linked to diet and physical activity based on the literature. In particular, three types of POIs are identified in this study, which are fast-food restaurants [26, 44], fitness and sports centers [45, 46], and nature parks [47, 48]. It is worth noting that these three types of POIs only capture some aspects of the everyday life of people related to diet and physical activity, and they certainly do not represent all the places where people can do exercise or purchase healthy food. For example, people can purchase healthy food also from grocery stores and full-service restaurants. However, these places can serve unhealthy food as well [36]. Meanwhile, the anonymized mobile phone location data do not contain information about the specific products that a person purchased at a place. Thus, we do not know, e.g., whether a grocery store or full-service restaurant visit also involves healthy food purchase or not. By contrast, visits to fast-food restaurants, fitness and sports centers, and nature parks have relatively clear associations with corresponding diet and physical activity. Thus, we eventually chose to focus on these three types of POIs.

Second, we utilize the anonymized mobile phone location data to derive total number of visitors from the studied census tracts to these three types of POIs. The original SafeGraph data are organized focusing on POIs by providing information about the number of people who have visited these POIs during a time period and the home census tracts of the POI visitors (inferred based on the nighttime locations of the mobile devices in the previous six weeks). Here, we reverse the focus of the data from POIs to census tracts and compute the total number of visitors from each census tract who visited a type of POIs. In this way, we can measure how the residents of neighborhoods (approximated by census tracts) visit different POIs. Figure 4 illustrates this process. It is worth noting that the residents of a neighborhood can visit POIs outside of their neighborhood and also outside of the studied city boundary (especially in the case of LA whose boundary has a narrow strip connecting to the southern parts of the city). When deriving POI visit information for a neighborhood, we included all POIs that were visited by the neighborhood residents regardless of whether the POIs are within the neighborhood or city boundary. The total numbers of POIs used to derive neighborhood-level visit information are provided in the Additional file 1: Table S1. There is a privacy related limitation in the data: SafeGraph recorded the number of visitors from a census tract to a POI as 4 if the actual number of visitors equals or is smaller than 4 for privacy protection. Thus, a census tract that has 4 visitors to a POI recorded in the data may in fact have 2, 3 or 4 visitors (if a census tract has only 1 visitor to a POI, this visit is removed by SafeGraph for privacy protection). To address this data limitation, we generate randomized numbers from 2 to 4 following a power-law distribution typically observed in human travel behaviors [49, 50].

Fig. 4figure 4

An illustration of deriving neighborhood-level diet and physical activity measurements by reversing the data focus from POIs to census tracts. The POIs visited by the residents of a neighborhood can be outside of the neighborhood or even outside of the city boundary

Third, we divide the total number of visitors aggregated to census tracts by the total number of devices residing in the same census tracts to obtain place visit frequency. Eq. (1) summarizes this computing process:

$$\mathrm=\frac^_}_}$$

(1)

where vij is the number of visitors from census tractj to a POIi related to diet and physical activity; n is the total number of POIs in one type of places (e.g., fast-food restaurants) in the study area; \(_\) is the total number of mobile devices in census tractj. We apply Eq. (1) to each census tract and to each of the three types of POIs. As a result, we obtain three types of diet and physical activity measurements.

Statistical and machine learning models

We use five different statistical and machine learning models to examine the potential improvement brought by the derived measurements for obesity estimation. These models are: ordinary least squares (OLS), geographically weighted regression (GWR), random forest (RF), deep neural network (DNN), and geographical random forests (GRF). The former two are statistical models while the latter three are machine learning or artificial intelligence (AI) models. We use machine learning models instead of only statistical models alone because there has been an increasing interest in using AI models for health studies [51,52,53]. AI models are often based on mechanisms quite different from statistical models, such as neurons and decision trees. Thus, using both statistical and machine learning models allows us to understand how the derived diet and physical activity measurements can function in models with different internal mechanisms. Among the five models, GWR and GRF are spatially explicit models that accommodate spatial heterogeneity typically existing in geographic data [54, 55], while OLS, RF and DNN are non-spatial models. In the following, we briefly describe each model.

Ordinary least squares

OLS is a statistical model of analysis that estimates the relationship between multiple input independent variables and the target outcome variable. The OLS model used in this work is in the form of Eq. (2):

$$\mathrm=\,_ +_r +_a +_s+_e+_h _u (+_v)+\varepsilon$$

(2)

where \(\theta\) r, \(\theta\) a, \(\theta\) s, \(\theta\) e, \(\theta\) h\(,\theta\) u are the coefficients for the six categories of socioeconomic and demographic variables respectively, and \(\theta\) v are the coefficients for the three types of diet and physical activity measurements based on place visits. \(\theta\) vv is within a pair of parentheses in the equation because diet and physical activity measurements will not be included in the baseline analyses. Note that each of \(\theta\) r, \(\theta\) a, \(\theta\) s, \(\theta\) e, \(\theta\) h\(,\theta\) u, \(\theta\) v contains multiple coefficients for the variables in that category (e.g., \(\theta\) v contains three regression coefficients for the three types of diet and physical activity measurements).

Geographically weighted regression

GWR has been frequently used in geographic data analysis to model spatially varying relationships between variables [56, 57]. GWR fits a local OLS model for each geographic unit (i.e., census tract in this study) based on weighted data from nearby geographic units, and therefore can be considered as an ensemble of local models [58]. Specifically, the GWR model used in this work is in the form of Eq. (3):

$$\,=\theta }_(_, _) +_(_, _)r +_(_, _)a +_(_, _)s+_(_, _)e+_\left(_, _\right)h _\left(_, _\right)u \left(+_\left(_, _\right)v\right)+_$$

(3)

where (\(x\) i, \(y\) i) is the spatial coordinates of the geographic unit i. The coefficients have the same meaning as used in OLS, but will vary across different geographic locations capturing the potentially heterogenous local processes. We configured the GWR model following the recommendations of the GWR developers [59]: we employed the bisquare kernel to specify the weights of the data from nearby geographic units based on their distances to the current location, and we applied the golden section search approach to identify the optimal bandwidth which determines the number of nearby geographic units to be included for fitting the local model.

Random forest

Random forest is a bagging-based machine learning model that applies an ensemble learning technique by constructing a group of decision trees [60]. Compared with OLS that assumes a linear relation, RF can model nonlinear relations between input features and the target variable. Given this ability, RF has been used in a variety of previous studies in which the input features and the target variable likely have a nonlinear relation [61, 62].

Deep neural network

DNNs and other deep learning models have shown outstanding predictive power in recent years [63, 64]. A DNN is made of multiple successive layers of neurons and can learn a complex nonlinear relation between the input features and the target variables. The model architecture can be configured flexibly with different numbers of total layers and different numbers of neurons. Additional components, such as dropout layers or batch normalization, could also be added depending on the application.

Geographical random forests

GRF is a disaggregation of a global RF model into multiple local RF models across different spatial locations [55]. The core idea of GRF is similar to GWR, in which a local RF model is fitted for each geographic unit. This means that for each location i, a local RF is trained but is based on only a number of nearby geographic units. Such a design allows the RF model to adapt to different local contexts.

For all the models, we implement them using Python and related packages: statsmodels for OLS, mgwr for GWR, scikit-learn for RF, tensorflow for DNN, and scikit-learn for GRF. For machine learning models, we also perform hyperparameter tuning to identify the best model architecture. Two metrics, R2 and root mean square error (RMSE), are utilized for assessing the accuracy of the five models for obesity estimation. For the statistical models, their R2 and RMSE are directly obtained from the model fitting results. For the machine learning models, their R2 and RMSE are obtained via a tenfold cross-validation process. In addition, for the two statistical models, OLS and GWR, we also report their adjusted R2 and Akaike information criterion (AIC) which take into account the increased model complexity when additional variables, i.e., the derived diet and physical activity measurements, are included. Given that GWR is an ensemble of local linear models, its AIC is calculated based on the log-likelihood of the full model and the effective number of parameters derived based on the selected bandwidth. We used the mgwr package from the GWR developers to calculate its AIC values, and more details can be seen in their papers [59, 65].

留言 (0)

沒有登入
gif