Estimating and displaying population attributable fractions using the R package: graphPAF

In this section, we consider a scenario where the risk factor is either binary, or perhaps multi-category with some level indicating ‘elimination’ for the risk factor. We will define appropriate PAF estimands for cross-sectional, case–control and cohort study designs, and describe causal exchangeability conditions that are needed for estimation.

Definitions of PAF under differing study designsPAF for prevalent disease

Cross sectional and case control designs can be used to estimate the proportion of prevalent disease that is attributable to a risk factor. Let Y denote a binary disease outcome (1 indicating disease) for a randomly selected individual from the population, and $Y_0$ the same binary disease outcome but where the individual is sampled from a hypothetical population with the risk factor eliminated. The population attributable fraction for prevalent disease can be defined as:

$$\begin PAF = \frac = 1)}, \end$$

(1)

where $P(Y = 1)$ represents the prevalence of disease in the current population, and $P(Y_ = 1)$ the prevalence of disease in the hypothetical population with the risk factor eliminated.

PAF for incident disease

In longitudinal cohort studies, a cohort of healthy individuals are followed over time with some eventually developing disease. In this setting, a differing kind of population attributable fraction can be estimated where the cumulative incidence of disease for that cohort as a function of time is compared to what the incidence would be if the factor had been eliminated from the cohort:

$$\begin PAF(t) = \frac \le t)} \end$$

(2)

Here random sampling is interpreted as random sampling from the cohort of interest and PAF, Y and $Y_0$ from (1) are replaced by PAF(t), $I\$ and $I\$, with the random variables T and $T_0$ representing time to disease in the current population and under hypothetical elimination of the risk factor. In the setting of competing events, such as death, we can interpret T as the time an individual would have developed disease had the competing event not occurred, and the PAF in terms of prevented disease under elimination of the risk factor provided the competing event did not happen. It should be noted that, in the presence of competing events, (2) can only be estimated under limited conditions, such as non-informative censoring of the true survival time, T, by the competing event. However we can also incorporate competing events directly in the definition of PAF. Suppose $\Delta $ represents an indicator for the event that happens first with $\Delta =1$ indicating that disease occurred before any other event. We can then write:

$$\begin & PAF^*(t) \nonumber \\ & = \frac \Delta =1) - P(T_ \le t \text \Delta _=1)} \Delta =1)}\nonumber \\ \end$$

(3)

Note that as $t \rightarrow \infty $, $PAF^*(t)$ converges to the PAF for disease incidence described in [13]. While (3) may at first seem a more sensible estimand than (2) in the presence of competing events, care must be taken in its interpretation. For instance, if the risk factor leads to early mortality due to other mechanisms than the disease of interest (3) may be negative for large t even when the risk factor causes disease. In other words, while (3) is the proportional difference in cumulative incidence in disease by time t due to removing the risk factor, it can’t be interpreted as disease-incidence prevented by eliminating the risk factor. In contrast (2) does have an interpretation in terms of prevented hypothetical incidence in the absence of competing events.

Conditions for estimation

Note that (1), (2) and (3) are causal entities, and unbiased estimation with observational data requires relatively strong assumptions. For instance, if the random variable $A\in \$ represents the observed risk factor ($A=0$ coding for elimination), one cannot say that $P(Y=1 \mid A=0)=P(Y_0=1)$ unless the risk factor A is randomly assigned. Informal sufficient conditions for the possibility of asymptotically unbiased estimation of (1) are:

Unambiguous definition and measurement of the potential outcome: $Y_$, representing risk factor elimination. (This is essentially the famous Stable Unit treated value assumption (SUTVA), first described in [14])

The measurement of a collection of covariates C, so that for observed value of C, $P(Y=1 \mid A=0, C)=P(Y_=1 \mid C)$ (This will be true if within joint strata of the covariates C, the risk factor A behaves as if it were randomly assigned). The collection C is sometimes refered to as a sufficient adjustment set of covariates.

$P(A=0 \mid C)>0$ for all possible values for the sufficient adjustment set of covariates, C. (Informally, absence of the risk factor is possible, regardless of an individual’s covariate values).

The proposed model for disease, conditional on risk factor and covariates, $P(Y=1 \mid A=0, C)$ is correctly specified.

Similar conditions need to be assumed to estimate (2) and (3). The variables C are often, but not always, a set of confounders of the risk factor/outcome relationship (that is they are joint causes of A and Y). We will assume the veracity of these conditions (including the measurement of a sufficient adjustment set C) in what follows, although their validity should be carefully considered in any practical application.

Estimators of PAF

Differing estimators of PAF are appropriate dependent on the study design. In the following, we detail estimation formulae for PAF separately for cross-sectional, case–control, cohort and survey designs. In defining these estimators, the following notation will be used:

$$\begin \begin i \in \ & \text \, N \\ &\text \\ Y_i \in \ & \text \\ &\text \,i \\ A_i & \text \\ &\text \,i \\ C_i & \text \\ &\text \,i\\ }(Y=1 \mid A_i,C_i) & \text \\ &\text \\ & \text \,A_i, \text \\ &\text \,C_i \\ \pi & \text \end \end$$

Cross sectional designs

In cross-sectional designs, a random sample $i \in 1 \dots N$, from the population are collected, at a fixed point in time. In this case, the PAF in the population at the time of data collection, (1), can be estimated by:

$$\begin \frac^(\widehat-\widehat)}^\widehat} \end$$

(4)

(4) is implemented by graphPAF and utilises the parametric g-formula, sometimes referred to as ‘standardisation’, [15], to estimate PAF. Such an approach only requires the specification of a model for the response. Double robust approaches have also been suggested in the literature that require models for both the risk factor and outcome [16].

Case–control designs

In case control studies, disease cases are preferentially selected in the sample, and paired with healthy controls, often in a 1–1 fashion. This implies that the sampled individuals will not be representative of the source population, and equation (4) can’t be used directly. However, if disease prevalence, $\pi $, is known, a similar estimator that re-weights contributions for cases and controls using the same weights, $w_i$, with $w_i=1$ in controls, to ensure that the empirical weighted prevalence of disease $\frac}}=\pi $, is available:

$$\begin \frac^w_(\widehat-\widehat)}^w_\widehat} \nonumber \\ \end$$

(5)

One can think of this approach as weighted standardisation. (5) could be considered a direct approach to estimation, in that it estimates the counterfactual probability in (1) by averaging estimated observed data probabilities. In addition to re-weighting the contributions for each individual in equation (5), the estimates $}(Y_i=1 \mid A_i,C_i)$ may need to be adjusted so they reflect, on average, the assumed disease prevalence in the population. This could be achieved by using estimates from a model fit using weighted maximum likelihood, although this is not what graphPAF does. The correction employed by graphPAF is described below, in the subsection Estimation of prevalent and incident PAF”.

If $\pi $ is unknown, (5) can’t be used for estimation in case control studies. Instead, the formula by [17] should be used:

$$\begin & 1 - \frac\sum _\frac}}\nonumber \\ & =1 - \frac\sum _\hat^ \end$$

(6)

where $}_i=P(Y=1 \mid A_i,C_i)/P(Y=1 \mid A_i=0,C_i)$ is the estimated relative increase in disease risk encountered by individual i based on their risk factor value $A_i$ and $N_c=\sum _I\$ is the number of cases in the data set. $}_i$ can be approximated by the corresponding odds ratio in case–control study designs provided the disease is relatively rare.

Surveys

Equation (5) can also be used to estimate prevalent PAF, (1), for datasets collected with surveys, where $w_i$ is proportional to the inverse of the sampling fraction, that is probability that an individual with covariates, $C_i$ is selected from the population into the sample. In this case, the probabilities $}(Y=1 \mid A_i,C_i)$ should be estimated using weighted maximum likelihood, using weights $w_i$, so they reflect population quantities. See [18] for more details regarding estimating PAF estimation with survey data.

Cohort designs

In cohort designs, cox proportional hazard models are often used to estimate (2). Under the proportional hazards assumption, suppose $}(C_i,A_i)$ is the estimated hazard ratio for an individual with covariates $C_i$ and risk factor $A_i$ compared to their hazard assuming all covariates and risk factors were at reference levels (defined as 0 for continuous covariates). Let $\hat(t)$ be an estimate of the cumulative baseline hazard function. graphPAF uses the Kalbfleich-Prentice estimate for the baseline cumulative hazard function, the default in the survival package) for $}(t)$. PAF(t) is estimated as:

$$\begin \hat = \frac^e^(t)}(C_i,A_i=0)}-e^(t)}(C_i,A_i)}}^(1-e^(t)}(C_i,A_i)})} \end$$

(7)

To estimate $PAF^*(t)$, $}(C_i,A_i)$ can be replaced in (7) by the estimated Fine Gray subdistribution hazard ratio: $}^(C_i,A_i)$ for disease incidence, and $e^(t)}$ by $e^^}_0^(u)du}$, where $h_0^(u)$ is the baseline subdistribution hazard function at time u. As described in the estimation section below, these functions can be estimated by prior weighting of the Cox Model.

Estimates from summarised data

In the case of a binary risk factor, A, Miettinen, [19], showed that equation (1) can be re-expressed as:

$$\begin PAF = P(A=1 \mid Y=1) \frac \end$$

(8)

where Y and A are the indicators for disease and risk factor exposure for a randomly selected individual from the population, and $RR_e = \frac$ is the causal relative risk in risk factor exposed individuals. The above formula can be re-expressed as:

$$\begin PAF = \frac \frac \end$$

(9)

where $RR_u=P(Y=1 \mid A=1)/P(Y=1 \mid A=0)$ is the unadjusted relative risk, see [20] for details. Under a no-effect modification assumption: that is assuming $\frac$ is constant across confounder strata $C=c$, $RR_e$ can be replaced by the conditional relative risk $RR_c = P(Y_1=1 \mid C=c)/P(Y_0=1 \mid C=c)= P(Y=1 \mid A=1,C=c)/P(Y=1 \mid A=0,C=c)$, within any stratum $C=c$. These observations facilitate estimation of prevalent PAF, by replacing $P(A=1)$, $RR_u$ and $RR_e$ in (9) with estimates of risk factor prevalence, and unadjusted and adjusted relative risk from the published literature. Conditions for the approach to be valid, in addition to the three conditions listed under the subsection “Conditions for estimation”, include no-effect modification, so that $RR_e=RR_c$, that $}_c$ is been correctly adjusted for confounding, and transportability of relative risks if the estimates of relative risk and PAF pertain to differing populations. Extensions to (9) for risk factors with multiple levels and continuous exposures are given in [20].

Often a simpler approach, based on Levin’s formula, is used to derive estimates of PAF from summary data:

$$\begin PAF_L = \frac \end$$

(10)

However, $PAF_L$ will only equal equation (1) in special circumstances, such as complete absence of confounding. In general, Levin’s approach will generate asymptotically biased estimators for PAF, even if consistent estimators for $P(A=1)$ and $RR_c$ are plugged into (10).

General features of the graphPAF package

The main functions used by graphPAF for estimating differing types of population attributable fractions are detailed in Table 1. To estimate PAF with individual data, the user needs to specify a fitted statistical model, usually supplied to the functions in Table 1 through the model argument. As listed in Table 1, the supported R model functions vary depending on the type of PAF, but include glm model objects, that are fit using the binomial family using the log or logit link, lm objects, polr model objects, fit using the MASS package, and clogit or coxph model objects, fit using the survival package. Common required arguments for many of the functions include the dataset used to fit the model, data, the riskfactor of interest, riskfactor, and the reference value for that risk factor, refval that codes for ‘non-exposure’. Other exported functions that are listed in Table 1, include auxilliary functions to assist simulataneous fitting of multiple models, functions for plotting and nomograms, and functions that estimate PAF from summary data.

Table 1 The main estimation functions in the graphPAF package and supported model typesConfidence intervals

As a default, point estimates of PAF are printed to the screen. For individual data, bootstrap-calculated confidence intervals for PAF can be requested via ci=TRUE. The Bootstrap is assisted by the R-package boot, with confidence interval calculations being produced by boot::boot.ci. Parallelization over multiple CPU cores is available by setting the option boot.ncpus to an integer above 1. The number of Bootstrap replications can be changed using the argument boot_rep, which has a default of 50. As a default, bias-corrected symmetric confidence intervals are produced, which combine bootstrap estimates of bias and standard error with the point estimate. This default was set with computational speed in mind, and might be increased for smaller datasets where bootstrap sampling is relative easy. Efron and Tibshirani [21] show that 200 bootstrap replications is usually sufficient for bootstrap estimates of standard errors, and consequently for symmetric confidence intervals for normally distributed estimators, but values as low as 25 replications can be useful to assess variability when sampling is expensive. The default type of bootstrap confidence interval, ci_type="norm", can be changed by setting the ci_type argument to "basic", "perc" or "bca". Note that boot_rep should ideally be increased be about 2,000 if this default is changed. In particular, the number of bootstrap replications needs to be set to larger than the number of rows in the dataset if ci_type="bca", otherwise the function will return an error. See the documentation of the "boot" package for more details. When confidence intervals are produced with individual data, using any of the functions PAF_calc_discrete, impact_fraction, PAF_calc_continuous, ps_paf, joint_paf, seq_paf and average_paf, produces additional output in addition to estimates and confidence intervals regarding estimation and inference settings. This extra output can be suppressed by setting the argument verbose=FALSE.

Data sets available

Two simulated datasets are provided with the graphPAF package. stroke_reduced is a matched case–control dataset including 10 stroke risk factors for 6,856 stroke cases and 6,856 stroke controls. The simulations were calibrated accorded to probability distributions estimated using a Bayesian network model fitted to real data from the INTERSTROKE project, [22]. Stroke cases and healthy controls, cases being indicated by case=1, are matched by age-group, gender and region. Matched cases and controls share the same value of the strata variables. Risk factors have differing datatypes, including binary (smoking, stress, diabetes, high_blood_pressure, early_stage _heart_disease, exercise), ordinal (alcohol), and continuous (waist_hip_ratio, lipids diet). To illustrate survival analyses, a time to event variable, time>0 and event indicator, event, with 0 indicating right censored observations, are also included. Hordaland_data is a 2nd example of a case control dataset, pertaining to 5,000 individuals with chronic cough and 5,000 controls, with two binary risk factors (urban.rural, occupational. exposure), and one ordinal risk factor smoking .exposure, simulated based on the relationships in [23].

Estimation of prevalent and incident PAF

The function PAF_calc_discrete estimates PAF for categorical risk factors (binary risk factors, or multiple-category risk factors) collected via cross-sectional, case control and longitudinal cohort designs. As an example, consider estimating the PAF for the variable exercise, a binary indicator for physical inactivity, using the dataframe: stroke_reduced.

First, to deal with the case–control matching, we fit a conditional logistic model to describe the relationships between the prevalence of stroke, exercise and assumed confounders, with the following commands:

The PAF for physical inactivity can then be calculated using the function PAF_calc_discrete. Note that the reference value of the risk factor variable in R, that is the value corresponding to no risk factor exposure, is specified below as refval.

For case–control datasets such as stroke_reduced, the ‘Bruzzi’ method is recommended as it doesn’t require specification of disease prevalence, provided the disease is relatively rare (the approximation of risk-ratios with odds-ratios might be unacceptably inaccurate otherwise). This is the default method employed by graphPAF. If the prevalence or alternatively the average incidence over a period of time is known, the ‘direct’ approach, as described by equation (5), can also be used to estimate PAF in case–control studies, by specifying calculation_method="D" and a value for prev as arguments, as indicated below. Note that when disease prevalence is specified as $\pi $, graphPAF adjusts predicted probabilities via adding a constant to the linear predictor of the estimated model to ensure that $ \sum _^w_i}(Y_i=1 \mid A_i,C_i) / = \sum _^w_i \pi $.

The above calculation assumes that the yearly incidence (averaged over the cohort) of first stroke is 0.0035, which effectively estimates PAF(1), that is equation (2) at $t=1$. If disease prevalence (rather than an estimated incidence) is used in the argument prev, PAF_calc_discrete will estimate (1). Note that when lifetime disease incidence across the cohort is low, relative risks and hazard ratios should correspond and one would expect equation (1) to be approximately equal to (2) at varying t.

As described earlier, confidence intervals can be specified with ci=TRUE. The following command estimates PAF in the same way, but adds confidence intervals:

For PAF calculations with cross sectional datasets, a glm model should first be fit that describes the relationship between risk factor and disease, conditional on covariates. Note in this case that only logistic or log-linear binomial models are permitted in graphPAF. Provided the sample is representative of the source population, standardisation (calculation_method="D") should be used, but disease prevalence, prev, no longer needs to be specified. PAF_calc_discrete will then estimate equation (1).

By default the reference level for a binary risk factor (that is a risk factor coded as 0/1) is set to 0. In the examples above, 1 codes physical inactivity. However, PAF_calc_discrete can also estimate PAF for multiple-category risk factors provided refval is set correctly.

In cohort datasets, estimation focuses on (2). graphPAF assumes the statistical model is a proportional hazards regression for time to the event, fit via the R-function coxph, from the survival package. As an example, in the dataframe stroke_reduced, time denotes a simulated survival time to some event in the stroke controls (individuals with event=0 are considered to not have experienced the event at study completion or when they left the study, and are censored). We are interested in the proportion of the events in the sub-cohort that might have been avoided, at various follow up times, if nobody in the cohort was hypertensive. The following model might be fit, which models the relative hazard of the event as a function of hypertension and possible confounders:

At time 0, nobody had experienced an event, but over time the cumulative number of events, and also the proportion of events that might be avoided in the absence of the risk factor, will change. The user can specify the times, t, at which to calculate PAF(t) using the argument t_vector:

The results indicate that while 39.7% of events that happen within a year might have been avoided in a hypertension-free population, only 17.6% of events that happen within 9 years would be avoided. This is the typical pattern one expects for an event such as death which can only be delayed but not prevented by the risk factor’s absense.

If it is preferred to estimate (3) rather than (2), and data on competing events exists, a weighted Cox model should instead be used with weights calculated using the function finegray from the survival package. Sending the weighted Cox model to PAF_calc_discrete will utilise the Fine Gray modification of (7) described earlier. See [24] for more details.

Estimation of PAF for data collected on surveys

PAF can be estimated with survey data by using the argument weight_vec in the PAF_calc_discrete function, where weight_vec is a vector of survey weights, that correspond to the inverse of the sampling fractions. A regression model (glm or coxph) estimated with weighted maximum likelihood with the same weights should be given as the model argument in this case. For instance, stroke_reduced, contains a column of weights, weights, giving approximate inverse sampling fractions, 0.9965 for a control and 0.0035 for a case. We can use these as illustration for how to estimate PAF from a survey. First, a glm is fit with weighted likelihood to ensure that the estimated probabilisties are representative of the population, using the inverse sampling fractions as the argument weights:

Then PAF is estimated using calc_PAF_discrete, with the argument weight_vec set to the same vector of inverse sampling weights:

Note that confidence intervals will be calculated using standard Bootstrap techniques (including bootstraping the inverse sampling fractions weight_vec) and may give incorrect results for surveys involving cluster sampling. One work-around would be to use graphPAF as a tool to generate point estimates, but to design a custom resampling regime that replicates the population sampling scheme used in the survey, and Bootstrap according to this custom sampling scheme.

Estimation of PAF using Summary data

The functions paf_miettinen and paf_levin implement estimators based on (9) and (10). If confidence intervals for the relative risk and prevalence are available, these functions will return confidence intervals for PAF based on approximate propagation of imprecision, [25].

As an example, suppose we did not have the full stroke_reduced dataset, but instead only had 95% confidence intervals for the prevalence of inactivity: (0.843,0.855), unadjusted odds ratio relating inactivity and stroke (1.514, 1.833) and the odds ratio between inactivity and stroke, adjusted according to the clogit model given earlier in this section: (1.427, 1.806). We will use these odds ratios to provide approximate estimates for the relative risks, $RR_u$ and $RR_c$ in (9) and (10). Estimates for PAF and associated confidence intervals are produced by:

If the risk factor has multiple non-reference levels, conf_ prev, conf_RR and conf_RRu should be specified as matrices, with each row giving separate confidence intervals for prevalence and relative risk at each non-reference level of the risk factor. In this above example, the estimated PAF and confidence interval from paf_miettinen are very similar to those produced via calc_PAF_discrete, with the default setting calculation_method="B". If the relative risk of inactivity varies substantially over strata of confounders and this effect modification is reflected by appropriate interaction terms in a statistical model, the results of the two approaches will differ, with PAF_calc_discrete being more accurate. As mentioned in the subsection “Estimates from summarised data”, Levin’s formula generates asymptotically biased estimates of PAF when $RR_c$ and $RR_u$ differ, even if the estimate of $RR_c$ substituted into the formula is correctly adjusted for confounding. Here, the extent of confounding is relatively minor, and the results of paf_miettinen and paf_levin are very similar.

Estimation of impact fractions

While population attributable fractions (PAF) can summarise the overall impact or importance of a risk factor on disease burden, they tend to give an overly optimistic impression of what an intervention on that risk factor might achieve. The predominant reasons for this are first that it may be difficult if not impossible to eliminate the risk factor from the population (think of the difficulties in preventing all forms of smoking or alcohol-use or enticing an entire population to change their dietary habits) and second that even if one could eliminate the risk factor, disease risk in individuals who previously were exposed might not equal the disease risk if they were never exposed (for instance, former smokers may have higher disease risk than comparable individuals who never smoked) [26].

In contrast, population impact fractions purport to measure the proportional reduction in disease risk from a realistic health intervention that may reduce the prevalence of a risk factor (rather than eliminate the risk factor), or perhaps favorably change the collective statistical distribution of many risk factors. The function impact_fraction in graphPAF can estimate impact fractions under the study designs considered above (cross-sectional, cohort and case–control). We first need to specify how the health intervention changes the distribution of risk factors that might affect disease, through the new_data argument. For instance, imagine a health-intervention (perhaps a national campaign to encourage walking) reduces the prevalence of inactivity by 20%. Assuming the intervention has no effect on any other risk factor, the following code shows how such an intervention might be specified using the new_data argument

The impact fraction for such an intervention is then calculated using:

indicating that the health intervention might result in a 6.7% reduction in the rate of strokes. Note that this calculation really refers to the difference in disease risk in two comparable populations, one with a reduced rate of inactivity. Since changing one’s behaviour may not completely eliminate cumulative damage due to prior unhealthy lifestyle, this estimated 6.7% might overestimate the impact of the intervention at least in the short term. If the 20% reduction in ‘inactivity’ is sustained through the population over years, this estimate may approximate the long run effect of the health intervention.

PAF nomograms

graphPAF facilitates plotting of the inter-relationships between prevalence, odds ratios and attributable fractions over multiple risk factors using methods described in detail in [7]. These plots utilise the concept of ‘approximate-PAF’, derived in the same paper:

$$\begin PAF_ = log(OR) \pi _ \approx PAF \end$$

(11)

where OR is the causal odds ratio between a risk factor and disease, and $\pi _$ is the prevalence of the risk factor in controls. This approximation stems from a Taylor expansion of the PAF around a relative-risk of 1, and will be most accurate for risk factors that have relatively small effects on a relatively rare outcome. One interesting observation regarding approximate PAF is the symmetric roles that risk factor prevalence and log-odds ratio play in its definition; indicating that similar changes in either lead to a similar impact on disease on a population level. To create a fan plot, risk factor data (names, prevalences and log-odds ratios) must be first summarised into an rf_summary object before plotting. For instance:

creates such an object for 10 risk factors from the INTERSTROKE database. rf_prev represents the prevalence of the risk factor in controls. For risk factors with more than 2-levels (here ApoB/ApoA, waist_hip_ratio and alcohol have 3 levels), the prevalence of the non-reference levels of the risk factor should be used as rf_prev. While technically, rf_prev should be the prevalence of the risk factor in controls, this can be substituted with population-prevalence when prevalence in controls is unavailable if the disease is rare. By default, the argument risk should specify confounder-adjusted log-odds ratios for association between risk factor and outcome, although odds ratios or risk ratios can be used via the setting log=FALSE. Note that log-odds ratios can be conveniently estimated via logistic regression models. Plotting this rf_summary object, using default settings, produces Fig. 1 below.

Fig. 1

Fan Plot displaying Prevalences, odds ratios and approximate PAF for INTERSTROKE risk factors. Approximate PAF is represented as both the slope of the line adjoining a point to the y-axis, and also the y-axis intercept of that adjoining line. The fan plot indicates that hypertension and inactivity are the two most prominent risk factors in stroke pathogenesis. Cardiac disease is an outlier on the plot. While it has the highest estimated relative risk, it has low prevalence (less than 5%) in comparison with the other risk factors and is only ranked 7th in terms of

View original article

EUROPEAN JOURNAL OF EPIDEMIOLOGY

分享书签

0 0 0 0 0 0 0

More from this channel

Estimating and displaying population attributable fractions using the R package: graphPAF

留言 (0)