Deviations from typical paths: a novel approach to working with GPS data in the behavioral sciences

We outline a framework involving six steps, presented in Table 1. These steps apply to both primary and secondary data analysis. The first three steps, which typically occur prior to data collection, involve formalizing questions, definitions, and assumptions and include: (1) establishing a research question, (2) establishing theory and formalizing assumptions, and (3) operationalizing the target dynamic and determining a priori groupings of data. The second three steps represent our primary contribution and outline the process of analyzing location tracking data after it has been collected: (4) defining and determining typical paths, (5) calculating deviations, and finally (6) analyzing deviations. Each step is described sequentially in detail below.

Table 1 Steps for deviations from typical paths frameworkFormalizing questions, definitions, and assumptionsEstablish a research question

Before estimating a typical path and computing deviations, we must first establish a research question that defines both the typical travel behavior we are trying to capture and the purpose for studying deviations from this typical behavior. Researchers must decide, for example, whether the study aims to examine individual differences in the magnitude of variability, the representativeness of a typical path, or the association of transit behavior on a given day to experiences on that day. Researchers must also consider whether the focus is on temporal variability, spatial variability, or both. The chosen research question must align with the data used. For example, questions focused on intraindividual variability in daily travel behavior will require multiple days of data to address. The research question will also be critical in selecting an appropriate path estimation procedure, quantification of deviations, and analysis.

Establish theory and formalize assumptions of the target dynamics of travel behavior

The second step is to establish the appropriate theory to characterize the behavioral dynamics under consideration. Theory is essential at this step in the process because it orients the research to the temporal dynamics of the travel behaviors, which are critical to decisions involving data collection and processing. For example, before data processing begins, researchers must decide on the timescale over which behaviors unfold (seconds, minutes, hours, days, weeks) and their cycle for repetition (hours, days, weeks). Researchers may realize that the behavior unfolds across nested times scales, such as momentary behaviors nested within a day, nested within day of the week, nested within the season. Additionally, researchers must refer to or develop a theory to dictate whether a meaningful growth process (e.g., shrinking or expansion of life space or travel behavior during the study period) is also expected. For example, increased travel away from an established location or route could be reflective of a growing flexibility or freedom to travel, or it may indicate an increased need to obtain resources that the currently-accessed built environment does not readily provide [25]. Theory can also establish whether the dynamics of the target travel behavior allow for a typical path to exist. Some researchers have suggested, for example, that there is not a typical day of non-commute travel [14]—but perhaps modeling the commute travel as the typical behavior could provide meaningful context to the non-commute travel that allows for new insights. In cases when data are already collected, exploratory analyses or corresponding qualitative reports may provide inspiration for establishing inherent dynamics.

Due to the flexibility of our technique, it is essential to consider the dynamics of the travel behavior prior to estimating typical paths and deviations. This will assist in defining which controls and constraints are necessary to discover informative deviations in travel behavior. For example, does a lunch excursion from work count as a deviation that is worthy of study? Is a change in common travel behavior, such as a stop at a gas station, a meaningful deviation? Is a route change, such as traffic detour, of interest? When possible, establishing these analysis goals prior to data collection allows researchers to ask participants for these key times or locations to facilitate easier data processing. Either way, clear definitions for target dynamics of travel behavior can improve precision of the estimated path, which will improve estimates of deviations.

Operationalize target dynamic and determine a priori groupings of data

Now that the theoretical framework has been established, data processing decisions can be made to organize the data to align with the target dynamic. Similar to establishing theory, it is preferable to make these decisions prior to data collection when possible. However, adjustments can be made after data collection to accommodate unanticipated behaviors in the data. First, researchers need to decide how to best group the observations based on anticipated travel behavior. A typical path will be estimated for each group of observations, meaning that the groups should reflect the shared travel behavior. A typical path should have an established scope, based on start and end location, time of day, or both. The groupings therefore define identifiable and repeatable behaviors across the observed time series.

We note that the duration of a path may differ across studies. For example, a full-day path is considered relevant for life space, but a study on commute patterns may focus on distinct commutes to and from work. Within a study, multiple typical paths such as weekday and weekend paths may exist for each person. Depending on the study, these categorizations may be established prior to data processing as context- or hypothesis-driven, or they may be explored during processing and analysis as data-driven groupings. In this paper, we will focus on a priori established groupings. For example, studies concerning commute-oriented travel should identify days on which the individual commutes. Given the increased availability of work-from-home options, a longer period of data collection may be needed to identify the typical path for individuals who commute less often. Similarly, individuals who travel to multiple workplaces will require their commutes to be grouped by workplace.

Second, after groupings of observations have been established, the data need to be organized to reflect the theory and align with the path estimation procedure in statistical software. Long format, in which each observed location is a separate row, will be needed in most cases. If separate groupings, such as weekdays and weekends, exist and correspond to separate typical paths to be estimated, the observations for each group should be maintained in separate datasets or a single dataset with clear labels that can be used for subsetting prior to path estimation and again to assign points to the appropriate typical path when calculating deviations.

Once these steps concerning research goals and data processing have been completed, the procedure for obtaining typical paths and deviations from data can commence. The remaining analytical steps, including technical details and modifications, are described in the sections below. As these steps represent the primary contribution of this paper, each step is assigned its own section.

Defining and determining typical paths

The fourth step in this framework is defining and determining typical paths. For our purposes, a path is a continuous mapping of the unit interval onto the two-dimensional space defined by latitude and longitude coordinates, otherwise known as a plane curve. Existence of a typical path is a critical assumption for our framework, and prior steps leading to the organization of observations into groups must establish groupings for which a typical path is believed to exist. Each grouping of observations will correspond to an estimated path. This definition can be extended to the three-dimensional space including time if deviations in timing are of primary or additional interest. Alternatively, timestamps can be used to define groupings and covariates in analysis, rather than used as a third dimension during path estimation. Specific features of the path can be encoded in the definition and estimation procedure. For example, closed paths with the same starting and ending points allow for treating the entire day as the grouping because we can assume that the individual starts and ends each day or trip at a common location. As mentioned above, other groupings of observed location points are possible, such as focusing on the commute to work. In this case, we would instead fit a curve that is not closed to the data with known starting point at home and end point at work. Assumptions such as limiting travel to known roads are possible to incorporate via alternative path estimation techniques such as map-matching, lending face validity to the estimated path. It is also possible to replace the estimated continuous path in this framework with a dense set of discrete ordered or unordered points defined as latitude, longitude pairs representing all geographic locations that the individual passes through as part of their route.

The process of determining a typical path can be done in any number of different ways, and depends on the design of the study and data collection. Below, we describe principal curves as a data-driven option, followed by a brief note on the use of self-reported typical paths. We also offer potential adjustments to account for study design and research goals. Alternative approaches to estimating a typical path can be substituted in this modular framework.

Data-driven typical paths—principal curves and variations

Estimating a path is essentially fitting a curve with no gaps or discontinuities through the data. This process should be completed separately for each grouping if multiple typical paths need to be estimated—for example, typical paths for weekdays and weekends should be estimated separately from separate datasets. There are several options for fitting curves to location data to produce an estimated travel path, most of which are considered smoothing procedures in the statistics literature. We recommend selecting a smoothing procedure based on matching known properties of the data with the assumptions of the procedure. For example, when travel is known to be limited to roads, additional steps could be taken to limit the estimated typical path to known roads. Within this paper, we focus primarily on the use of principal curves to estimate a typical path for two reasons. First, they are relatively simple and easy for behavioral researchers without prior experience in the analysis of GPS data to use via implementations in common statistical software. Second, they facilitate estimation of a typical path over the entire course of study by simply including all data points for all relevant days or trips when estimating a curve.

Principal curves are a nonlinear generalization of principal components, or equivalently a variation on nonlinear regression that allows for symmetric treatment of variables when minimizing errors [26]. Conceptually, each location on the curve is the average of all nearby data points—in this case, all nearby observed location coordinates. While principal curves utilize a similar least squares criterion as linear regression, they minimize squared deviations in all variables compared to only the response variable in regression frameworks. This fits well with the unsupervised learning goal of path estimation [27].

In the context of travel behavior, a principal curve is a nonparametric smoother that reflects a typical path traveled by an individual. Multiple technical definitions of principal curves and corresponding algorithms have been offered by various authors [26, 28,29,30,31,32,33]. We make use of the approach described by [26] and implemented in the princurve package in R [34]. This approach begins with the principal component line and follows a standard Expectation-Maximization (EM) approach, iterating between projection of an updated curve and calculation of conditional expectation until the change in total squared distances from all points to the curve is less than a prespecified threshold. We make use of smoothing splines fit using generalized cross-validation during the expectation step, though a variety of options for smoothers exist and can be substituted. Further adjustments to improve the estimated path after fitting a principal curve are described at the end of this section.

An important limitation of applying the principal curves framework to travel behavior is that the resulting simple curves cannot intersect themselves. As a result, some travel patterns that naturally occur over the course of a trip or day may be impossible to estimate. This type of travel behavior may be more likely in dense urban environments. There are several possibilities for addressing this limitation. One option is to use finer-grain segmentation of observations to develop subgroups for fitting portions of the typical path at a time, resulting in a collection of principal curves defined based on shorter trips or trip segments that can be nested within group for analysis. Another option is to make use of an alternative smoother. For example, strategies to locally fit principal curves could resolve these situations [32, 33]. The princurve package in R [34] offers a choice of periodic LOWESS for fitting closed curves that could reflect travel in a loop. A third option is to incorporate time or temporal ordering of observed points as a third variable to prevent the self-intersecting that occurs in two dimensions due to crossing back over a segment later in time—such as returning on the same route as departing, or taking a cloverleaf highway interchange. Following the estimation of a path over time, the path can be reduced to two dimensions for computations of deviations if temporal deviations are not of interest, or the path can remain in three dimensions to allow for the study of temporal deviations in addition to location deviations.

In addition to principal curves, several options have been proposed in the literature for estimating and smoothing a path from point-based location data, such as splines and functional data analysis techniques [35], hidden Markov models [36], and Kalman filters [37]. All of these statistical smoothing techniques require selection of one or more tuning parameters that describe the interval of time over which to smooth and order of the assumed underlying functional form (in the case of spline- and kernel-based smoothers) or the amount of noise anticipated in the measurement and overall process (in the case of Kalman filters) [37]. Principal curves similarly require decisions concerning tuning parameters; however, many modifications to the principal curve algorithm have been proposed that allow for flexibility in adapting the framework to a variety of situations. For this reason, we have chosen to focus on principal curves as a technique that is broadly applicable—though other techniques may be preferable given specific assumptions concerning travel behaviors. Regardless of the choice of smoothing technique, the estimated typical path should be graphed and assessed visually for goodness of fit to the observed points and to ensure that the estimated typical path is sensible in light of established theory of travel behavior.

Once a typical path has been estimated, researchers may choose to use additional techniques to make adjustments to the estimated typical path that improve precision and add context. Such optional tasks can complement the smoothing procedure to provide the most accurate typical path possible, though they should be done carefully to avoid overfitting. One such technique is map matching, which makes use of known features of the geographic space, such as roads and traffic patterns, to refine estimated travel patterns [38]. Map matching techniques treat locations as nodes connected by directed edges in graphs which can be matched to known nodes and road segments in an established map. Several researchers have proposed map matching algorithms to define or improve estimated paths [39], particularly in low-sampling-rate settings [40]. Although these techniques can increase the accuracy of the estimated path for individual trips under the assumption that all travel is along known roads, they require a reference library of road networks. Alternative principal curve fitting techniques can also be used to improve the fit by bringing in additional data—for example, [41] offer an adaptive algorithm for principal curve estimation that incorporates known endpoints, allowing for specification of known start and end points to trip groupings that can improve fit of the path near the end points.

Self-reported typical paths

We note that self-reports of typical travel routes provide an alternative to data-driven path estimation. Self-reported typical paths can also be treated as a complement to data-driven paths for validating paths or comparing findings—for example, in studies interested in recall. These routes can be stored concisely as turn-by-turn directions or an ordered list of waypoints (e.g., intersections of roads) prior to analysis. Depending on the prior established grouping, it may be necessary to collect multiple routes per person. When obtaining self-reported typical paths, it can be helpful to have local maps available and ask follow-up questions to validate their reflections on typical travel behaviors. Questions such as “Do you have any stops that you typically make along the way?” or “Do you take different routes if there is traffic?” can validate recall, and questions such as “Do you also make this trip on weekends?” can assure that no participant-specific groupings have been missed. Once typical paths have been established, either using data-driven techniques or self-report, researchers can proceed to the next step of computing deviations.

Computing deviations

Once the typical travel routes have been defined, the next step is to calculate distances from the associated typical path for every observed location point. As with estimation of a typical path, the step of recording deviations is modular and may be replaced or built upon with alternatives to suit the specific application. Here, we highlight use of Euclidean distance to quantify distance of each observation to the estimated path.

We begin with a collection of points belonging to a particular estimated path. The most straightforward organization is to subset the observed points based on the same grouping for which typical paths have been estimated—for example, when separate paths are estimated for weekdays and weekends, calculate deviations for weekday points based on the weekday path and deviations for weekend points based on the weekend path. The points in a given grouping should not belong to multiple paths (i.e., each point can belong to a weekday or a weekend but not both), but can belong to only a portion of a longer typical path depending on interpretation goals. For example, deviations for only a morning commute can be calculated and studied using a path based on travel for an entire day.

Each point is assigned a quantification of its deviation from its associated typical path via a distance measure. In practice, the nearest location to the path can be found by densely sampling along the estimated path and using the nearest path point to the observed point for calculating distance. Often the estimated path will already be stored as a dense sampling of points in statistical software, and this process can be automated by calculating the distance from the observed point to all path points and taking the minimum. Distance can reflect a spatial deviation such as traveling to a new location, a temporal deviation such as a traveling to an established location at a new time, or both—depending on whether time was included in the estimation of the typical path and in which dimensions the distance is calculated.

The simplest version of a deviation is a spatial deviation based on Euclidean distance, to produce a distance “as the crow flies” from the observed point to the typical path. Further adjustments, such as using a Haversine distance or map projection to adjust for the curvature of the earth, are possible [42]—though in many cases these adjustments are trivial in comparison to the measurement error in real-world passive location tracking. Distance may also be measured using travel distance along roads or estimated travel time to obtain a more practical measure of the movement that would be required to resume travel along the typical path from the current point.

Analysis of deviations

The final step of this framework is analysis of deviations to gain insights into atypical behavior. Once distances have been computed for every point, they can be analyzed, alone or with complementary information, using a variety of methods. If data were separated into multiple datasets for path estimation and calculation of deviations, they may need to be recombined into a single dataset with retained labels prior to analysis.

The process of computing deviations results in a calculated deviation for every observed point. Depending on the sampling rate of the GPS device and distance traveled, this could mean hundreds or thousands of values for every travel behavior grouping (e.g., day) defined for analysis. These rich data allow for a variety of analytic approaches. Deviations for individual sampled points can be used directly in multilevel models, growth curve models, or other techniques focused on intraindividual variance and covariance based on the research question [43]. These models address the longitudinal, repeated measures design of location-tracking studies, as well as the clustering of observations within day and within person—making them a natural fit for many research questions concerning variability in travel. Additional steps could be taken to identify atypical destinations by including timestamps and clustering observations that deviate from the estimated path.

Summary measures can also be derived to indicate individual differences in travel behaviors for use as either independent or dependent variables. For example, the maximum deviation, total or average deviation, or number of deviation clusters in a grouping may also be useful quantities to estimate. The average of all deviations can be interpreted as a measure of “goodness of fit” for the estimated typical path(s), where a larger average deviation suggests additional behavior that is not captured by the typical path(s), error in measurement such as noisy GPS data, or both. Daily average deviations can describe how typical each day was. In our example below, we demonstrate the utility of such summary measures, as well as the importance of selecting a summary measure that aligns with project goals. The analytic approach should be chosen to match the research protocol and reflect the established theory in terms of grouping, timescale, and frequency of behavior cycle. The approach should also be robust to any data quality concerns such as missing data and irregular sampling rates that occur with some collection methods.

View original article

INTERNATIONAL JOURNAL OF HEALTH GEOGRAPHICS

分享书签

0 0 0 0 0 0 0

More from this channel

Deviations from typical paths: a novel approach to working with GPS data in the behavioral sciences

留言 (0)