Standard multiple imputation of survey data didn’t perform better than simple substitution in enhancing an administrative dataset: the example of self-rated health in England

The original census and survey datasets comprised of 2,848,155 and 374,218 records respectively, of which 1,390,094 (49%) and 134,717 (36%) were respondents aged 25–64, living in England and who were usual residents in non-communal households. In total 634 (0.5%) respondents in the survey dataset had missing values for at least one key or auxiliary variable and they were omitted from analyses, leaving a total of 134,083 survey respondents in the analytical dataset. Distributions of imputation variables in the census and survey are presented in Table 1. Distributions were broadly similar in the two datasets with survey respondents slightly older, more educated, and more likely to be female, own their home, and be married than those from the census.

Census and survey responses to questions on self-rated health are presented in Table 2. Overall, survey respondents were less positive about their health, with 78% rating it as good or very good compared with 83% of the census. Figure 1 presents a scatter plot of the proportion of survey versus census respondents in each of the 576 combinations of age, sex, tenure and region who rated their health as bad or very bad. The dashed line is the line of equality, representing perfect agreement between census and survey measures, while the solid line (shaded line) shows the regression line of best fit (95% confidence interval) describing the association between the two. The intercept and slope of this line of best fit are presented in Table 3. There was a strong linear relationship between the proportions in the two datasets (correlation = 0.93; Table 3). However, the survey overestimated the proportion of respondents with bad or very bad self-rated health, as evidenced by the lack of correspondence between the regression line (intercept (95% confidence interval): 0.01 (0.00, 0.01); slope (95% confidence interval): 0.82 (0.79, 0.84)) and the line of equality.

Table 2 Overall distribution of self-rated health in original census data versus data from or imputed from survey dataFig. 1figure1

Comparison of proportion of bad or very bad self-rated health in original census data versus survey data

Table 3 Linear associations between proportion of bad or very bad self-rated health across 576 groups comparing original census data with data from or imputed from survey data

Similar results for the multiply imputed data are presented in Fig. 2 (Tables 2 and 3). The overall distribution of bad or very bad self-rated health imputed into the census from survey data using standard logistic or poisson regression was very similar to that for the raw survey data (6.3% and 6.2% of imputed census data versus 6.2% of raw survey data were bad or very bad) and, therefore, differed from the original census values (5.1%). Results for the 576 combinations of age, sex, tenure and region (Fig. 2, top left and right) were also very similar to those for raw survey data, with a strong linear relationship, but were generally overestimations relative to the original census values (logistic intercept: 0.00 (− 0.00, 0.00); slope: 0.82 (0.79, 0.84); correlation: 0.95; poisson: 0.00 (− 0.00, 0.00); 0.80 (0.78, 0.82); 0.95). The overall distribution of self-rated health imputed into the census using ordinal logistic regression was, again more similar to the original survey data than the original census data (6.2% bad or very bad self-rated health). Initially, it seemed that the association for the 576 categories was a better fit to the original census data (intercept: − 0.01 (− 0.01, − 0.00); slope: 1.00 (0.98, 1.03)) than that from the raw survey data. However, while there was reasonable linear agreement between values in the middle of the range, the imputed data substantially overestimated the proportion of bad or very bad self-rated health at the lower and upper ends of the distribution and, in practice, a quadratic model was a better fit in describing the association between imputed and original census values (Fig. 2, bottom left). Results for data imputed into the census using multinomial logistic regression were again very similar to those for the raw survey data (6.5% bad or very bad self-rated health; intercept: − 0.00 (− 0.01, − 0.00); slope: 0.83 (0.81, 0.85); correlation: 0.95; Fig. 2, bottom right).

Fig. 2figure2

Comparison of proportion of bad or very bad self-rated health in original census data versus data imputed from survey data

留言 (0)

沒有登入
gif