Conducting surveys on platforms such as Amazon Mechanical Turk (MTurk) have proliferated as a cost-effective and fast way of collecting data about health [-]. The number of studies using MTurk for social science research has been steadily increasing due in part to the ease of use, existing tools to support research activities, and quick turnaround for data collection []. In addition to the relatively low costs of conducting survey research with MTurk [], another potential benefit is being able to reach participants and retain them in longitudinal studies []. If the goals of a research study involve having a representative sample of participants, it is important to assess how well MTurk can meet that need.
MTurk is one of many ways to collect nonprobability survey samples that are defined and created by researchers from a pool of available participants [,]. Previous research has found differences between the characteristics of MTurk respondents and the US general population. MTurk participants are generally younger, more likely to be female, White, have lower income, and have higher education levels compared with the US general population, differences that have persisted over time [-].
Previous WorkCollecting a nonprobability versus a probability-based sample may depend on the research question. While “statistical sampling theory suggests that any estimate of a parameter will be more accurate when that parameter is estimated using data from a random sample” [], adjustment approaches after sample collection may improve the comparability of a nonprobability sample to the general population []. However, nonresponse bias due to attrition in samples can significantly impact inferences drawn from either a probability or a nonprobability panel [,]. Attrition over time can reduce sample size, which lowers the power of any statistical analysis, while differential attrition can bias inference in less predictable ways []. Several methods exist to control for bias introduced by nonresponse, including sample weighting, that reduce the impact of nonresponse on inferences. Survey attrition has been noted as a critical concern with using MTurk []. Still, there is limited information about the effects of survey attrition in longitudinal studies using MTurk and the extent to which it limits the inferences that can be drawn [].
Previous research has shown that nonresponse patterns vary by survey population and survey type in MTurk. In a 3-wave longitudinal study fielded from April 2020 to March 2021, MTurk respondents who were younger, Hispanic, and had self-rated difficulty with the survey were more likely to drop out in subsequent survey waves []. Rates of nonresponse for short-term studies (ie, a few days to a few weeks) tend to be lower than for long-term studies (ie, a month or more) []. Factors related to nonresponse vary by survey type, time between survey waves, and the underlying population [-]. Surveys that are longer and with greater response burden produce higher rates of nonresponse [] in all types of longitudinal surveys [,], including internet survey panels [,]. Most of these studies have focused primarily on samples of the general population [] rather than on subgroups with clinical conditions.
We use the Mercer et al [] framework to assess the impact of nonresponse on estimation and bias in a longitudinal study of individuals with back pain. The authors propose a 3-element assessment to assess the impact of selection bias in survey estimates. We adapt this framework to and evaluate nonresponse, assuming the baseline data reflect the population and that nonresponse bias is similar to selection bias when assessing longitudinal surveys. The 3 elements proposed by Mercer et al [] include “exchangeability” (whether all confounding variables are known and measured), “positivity” (whether the sample includes all necessary kinds of units in the target population), and “composition” (does the sample distribution match the target population concerning confounders, or can it be adjusted to match the target population). Assessing and addressing issues with exchangeability, positivity, and composition have been shown to improve inference in causal analysis and survey analysis to deal with selection bias issues. In this article, we use the same framework to improve inference from nonresponse bias in MTurk studies.
It is essential to understand the factors associated with attrition in longitudinal studies with internet panels, given their widespread use. To improve exchangeability, it is also important to understand and assess what factors could confound inference due to nonresponse. While studies have previously examined these issues among general populations, the factors associated with attrition may vary among populations with different health conditions. It is estimated that 39% of the US adult population has back pain []; back pain accounts for the largest share of years lived with disability in the United States []. Healthier individuals are more likely to respond to surveys, and longitudinal surveys risk losing an increasing number of less healthy participants in successive survey waves [,]. Given that health and pain are multidimensional, multiple measures of health and pain may be necessary to capture the confounding due to poorer health and increased pain. As more studies use surveys to assess back pain, nonresponse due to poorer health can significantly impact inference drawn from analyses, even longitudinal analyses, if differential attrition by pain status is observed. In addition, MTurk workers are known to have a high turnover rate []. The inability to follow up could be another important source of attrition.
Goals of the StudyAs a part of a more extensive study, we collected survey data on MTurk from individuals who self-identify as having back pain. To improve sample quality, we implemented a range of tactics to screen out poor-quality data, requiring that participants had completed several previous tasks and met an approval threshold, as well as postsurvey data cleaning to screen out those who reported having one or both of 2 fake health conditions included on the survey. What was left was a sample of self-selected, higher-quality participants who were surveyed 3 times over 6 months.
Because of the prevalence of individuals with back pain, attention to them, and the use of survey methods to assess their back pain, we analyzed data from a 6-month 3-wave longitudinal panel survey to (1) describe the patterns of survey responses and nonresponse among MTurk members with back pain, (2) identify factors associated with survey response over time (to assess “exchangeability”), (3) assess the impact of nonresponse on sample characteristics (to assess “positivity”), and (4) assess how well inverse probability weighting can account for differences in sample composition (to assess “composition”). We hypothesize that those with poorer health, more pain symptoms or severity, specific pain, and nonchronic pain will be least likely to respond to follow-up surveys. Weighting may be able to adjust to correct for nonresponse, but whether the sample is sufficiently varied is unclear.
We developed web-based surveys to collect data from MTurk participants and used the platform CloudResearch (formerly TurkPrime; Amazon) to field the survey in 2021 []. Individuals who reported having back pain at baseline (wave 1) were provided the opportunity to complete follow-up surveys after 3 months (wave 2) and 6 months (wave 3). We did not note in the wave 1 survey instructions that this was a longitudinal study because only those who met the inclusion criteria for the longitudinal study were asked if they wanted to participate in follow-up surveys. At the beginning of wave 2 and wave 3 recruitment, all eligible participants who consented to participate in follow-up survey waves were sent a recruitment email telling them the follow-up survey was available, that it would take approximately 25 minutes to complete, the payment for completing it, and that they had up to 5 weeks to return it. Weekly reminder emails (1-4 weeks after the recruitment email) were sent to all nonparticipants reiterating the timeline for survey completion, the approximate time to complete it, and the payment for completion.
Based on previous data collection efforts, we recruited individuals to have a final wave 1 sample of about 1500 individuals with back pain []. Those invited to participate at baseline had to have completed a minimum of 500 previous human intelligence tasks (HITs) on MTurk with a successful completion rate of at least 95%. No additional requirements were given to participate in the wave 1 survey. These threshold values were selected to enhance data quality. Previous research [] and pilot tests of the survey found a 95% approval threshold and at least 500 completed HITs improve data quality and that limiting samples to ensure data quality does not limit the pool of available workers enough to restrict the sample to below the 1500-participant target []. While more recent studies [] have shown that the approval rate is insufficient to ensure high-quality responses, we used a range of steps to ensure high-quality responses, including reputation, number of previous tasks, and attention checks (described in the Measures section). Given the structure of the MTurk interface, we are unable to determine the impact of the approval and completed HIT thresholds have on the sample profile (ie, we cannot quantify the number of individuals who tried to complete the survey but could not because of the thresholds for participation, as those individuals would not see the survey). Additional detail on data collection of wave 1 data is described by Qureshi et al [].
Ethical ConsiderationsAll participants provided electronic consent at the beginning of the survey. Those who completed general health and back pain surveys at wave 1 were offered US $3.50 for their participation. Participants were offered an additional US $5 per subsequent completed survey (wave 2 and wave 3). All baseline participants (even those who did not participate in wave 2) were asked to participate in wave 3. Data were deidentified and are stored online []. All procedures were reviewed and approved by the research team’s institutional review board (RAND Human Subjects Research Committee FWA00003425; IRB00000051) and conforming to the Declaration of Helsinki principles. The study was funded by the National Institutes of Health or the National Center for Complementary and Integrative Health (Grant 1R01AT010402).
MeasuresThe main outcome variable was participation in wave 2 and waves 2 and 3, defined as a binary outcome (0 if no participation and 1 if participation). We used several exposure variables, including self-reported demographic variables, self-reported health conditions, and self-reported back pain assessments.
Each survey asked about demographic characteristics (age, sex, race or ethnicity, employment status, income, education, and marital status) and health conditions. Health conditions were assessed in 2 forms. First, we asked “Have you EVER been told by a doctor or other health professional that you had…” for each of the following conditions: hypertension, high cholesterol, heart disease, angina, heart attack, stroke, asthma, cancer, diabetes, chronic obstructive pulmonary disease, arthritis, anxiety disorder, and depression. Then, we asked “Do you currently have…” for each of the following conditions: allergies or sinus trouble, back pain, sciatica, neck pain, trouble seeing, dermatitis, stomach trouble, trouble hearing, and trouble sleeping. We included these various measures of health to allow for the examination across various dimensions of health to support “exchangeability” for inference.
We also included 2 fake conditions in the survey that were used to screen out low-quality respondents. Individuals who endorsed one or both fake conditions were not asked to participate in the back pain follow-up survey if they endorsed having back pain. Overall, 15% (996/6832) of respondents endorsed one of these fake conditions, and their responses were believed to be dishonest or careless. Those reporting fake conditions were more likely to identify as male, non-White, to be younger, report more health conditions, and take longer to complete the survey. Their responses had less internal consistency reliability on several health measures than those who did not endorse a fake condition (Hays et al []).
Those who reported having back pain were asked to participate in a follow-up survey that included additional questions related to their back pain. If an individual opted not to continue, they would be paid for completing the first part of the survey and were not included in further analysis. The survey included questions about whether the respondent’s back pain was “chronic” according to 1 of 4 definitions (either that their back pain persisted for least 3 months, that their back pain persisted for at least 3 months, and they had pain at least half the days in the past 6 months, that a health provider told them that their pain is chronic, or that they believe their back pain is chronic). We also asked whether their back pain was due to a “specific” medical condition. We categorized individuals with back pain into 4 groups []—those with specific chronic back pain, those with specific nonchronic back pain, those with nonspecific chronic back pain, and those with nonspecific nonchronic pain. The survey also included the Impact Stratification Score (ISS) [], Oswestry Disability Index (ODI) [], Roland Morris Disability Questionnaire (RMDQ) [], the Pain, Enjoyment of Life and General Activity scale (PEG) [], and the Keele STarT Back Screening Tool (SBST) [].
Statistical AnalysisWe report response rates to the wave 2 and 3 surveys among those responding to the wave 1 survey to assess the “positivity” of samples for inference. In addition, we report descriptive statistics on age, sex, race or ethnicity, income, education, marital status, self-reported health conditions, the proportion who endorsed back pain types, and back pain measure scores for those who participated in each survey wave. We report differences for those who did and did not complete the wave 2 survey and both the wave 2 and wave 3 surveys using t tests for continuous and chi-square tests for categorical variables.
Next, we report estimates from stepwise logistic regression models predicting response to wave 2 (model 1) and both waves 2 and 3 (model 2). We used a backward elimination with a selection criterion of α=.157 and a forward selection criterion of α=.05 to select the variables to include in the models []. These selection criteria determine whether a variable is included in the final model. Using a backward elimination with a selection criterion of α=.157 rather than α=.05 is meant to reduce overfitting of the final model, a common issue associated with stepwise models []. We report the odds of completing the subsequent surveys. Based on previous studies, age, sex, race, and ethnicity were included in the regression models [,]. We also examined education, marital status, income categories, employment, health conditions, type of pain, pain impact, and time to complete the questionnaire as predictor variables.
Finally, we used inverse probability score weighting to examine sample characteristics in waves 2 and 3 based on model 1 and model 2 results to assess how well the sample weights correct for nonresponse from in later waves to assess “composition.” Model weights are derived from estimated probabilities of completion using the aforementioned stepwise logistic regression models. By using inverse probability weights, we overweight respondents like those who drop out, approximating how the original sample would have looked if everyone responded to both follow-up waves. We included all candidate variables without backward elimination as a sensitivity analysis to derive inverse probability weights. Similar baseline characteristics between the full sample at baseline and weighted estimates for those who participated in later waves is an indication that observed variables can account for the level of bias introduced by sample attrition. All analyses were conducted using Stata software version MP17 (StataCorp) []. The study confirms to the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist for cohort studies (Table S1 in ).
A total of 1678 adults who responded in wave 1 qualified to take subsequent surveys, that is, did not endorse a fake condition on the wave 1 survey and consented to participate in a future survey []. Of those who qualified to participate from the total sample in wave 1, 983 (59%) responded in wave 2. Of the 983 who responded in wave 2, a total of 703 (42% of wave 1 respondents) also responded in wave 3. The 8 respondents who only responded in waves 1 and 3 (ie, not in wave 2) were excluded from further analyses. Compared with those who did not respond, respondents in wave 2 were older, with higher income, more likely to never have been married, less likely to be Hispanic, less educated, and less likely to be employed full-time. We saw similar trends for those who responded in both waves 2 and 3 versus those who did not ().
shows the overall sample distribution at wave 1 and response rates in waves 2 and 3. Generally, those who were older, those who were female, non-Hispanics, not married or living with a partner, and those at low (ie, US $0-US $39,999) and high income (more than US $60,000) were more likely to respond during waves 2 and 3 than their counterparts. These differences were more apparent when comparing wave 1 with wave 3. However, when comparing wave 3 response among those who responded in wave 2, response rates were generally 65%-75% and not systematically different by characteristic. In addition, the sample prevalence of health conditions was similar between the unweighted samples of those who participated in wave 2 only and those who participated in waves 2 and 3 ().
Those who responded in wave 2 completed the wave 1 survey in less time than those who did not respond in wave 2 (30 min vs 35 min, P<.001). Those who responded to the wave 3 survey also reported less time completing the wave 2 survey than those who did not respond (24 min vs 26 min, P=.02), similar to the time advertised to complete the survey. We found no differences in the time of day (morning, afternoon, evening, or nighttime) when the baseline survey was completed between the responders and nonresponders to the wave 2 survey and the waves 2 and 3 surveys.
Respondents in wave 2 had fewer reported health conditions than those who did not respond (5.8 vs 6.6, P<.001). A similar trend was observed for those who responded versus those who did not respond to both waves 2 and 3, though the effect was not significant (5.7 vs 6.0, P=.08). There were differences between responders and nonresponders in wave 2 for 15 conditions, with nearly all being less common for responders than nonresponders, except for arthritis, anxiety, and allergies. There were also differences between responders and nonresponders in waves 2 and 3, but for fewer (11) conditions ().
Respondents to wave 2 were less likely to have nonspecific low back pain and more likely to have chronic low back pain than those who did not respond, with similar patterns for those who did and did not respond to both the waves 2 and 3 surveys. Participants in wave 2 and in both waves 2 and 3 reported less pain intensity and pain interference, and better health on the ISS, ODI, RMDQ, PEG, and SBST measures ().
Table 1. Characteristics of those participating in wave 1 only versus those who also responded in wave 2 (at 3 months) and in waves 2 and 3 (at both 3 and 6 months).CharacteristicResponded in wave 1 only (N=695), n (%)Responded in wave 1 and 2 only (N=983), n (%)P value (wave 1 vs wave 1 and 2)Responded in all 3 waves (N=703), n (%)P value (wave 1 vs all 3 waves)Age (years), mean (SD)39.13 (10.84)42.47 (12.01)<.00143.58 (12.14)<.001Age category (years)<.001aCOPD: chronic obstructive pulmonary disease.
Table 4. Pain impact reported by those in the baseline (wave 1) sample who did not and did respond at wave 2 (at 3 months) and at waves 2 and 3 (at both 3 and 6 months).Pain assessmentResponded in wave 1 only (N=695)Responded in wave 1 and 2 only (N=983)P value (wave 1 vs wave 1 and 2)Responded in all 3 waves (N=703)P value (wave 1 vs all 3 waves)Nonspecific, proportion (SD)0.80 (0.69)0.55 (0.79)<.0010.64 (0.48)<.001Chronic, proportion (SD)0.84 (0.37)0.92 (0.26)<.0010.94 (0.24).016Pain intensity, z score (SD)0.85 (0.88)0.62 (0.89)<.0010.50 (0.78).002Pain interference, z score (SD)0.80 (0.69)0.55 (0.79)<.0010.57 (0.88).003Impact Stratification Score (ISS), mean (SD)22.07 (7.41)19.34 (8.57)<.00118.99 (8.6).02Oswestry Disability Index (ODI), mean (SD)26.98 (15.99)22.39 (15.99)<.00122.06 (16.09).15Roland Morris Disability Questionnaire (RMDQ), mean (SD)10.35 (6.63)8.09
留言 (0)