Machine learning in causal inference for epidemiology

Causal research can be generally divided into two approaches: confirmatory and exploratory [13]. The main goal of the confirmatory approach is the evaluation of the evidence, relying on a-priori knowledge, and assuming, as a starting hypothesis, a causal structure describing the relationships between the variables involved (e.g., using directed acyclic graphs (DAGs)). Data analysis is then performed to confirm or not the starting hypothesis. The exploratory approach, on the other hand, does not start with a priori hypotheses. Instead of specifying a model prior to data analysis, it aims at stimulating the exploration of alternative hypotheses and infers the causal model directly from the data. A branch of causal methods named causal discovery has been developed to be used for this purpose, exploiting the power of ML [14]. In this article, we will focus on methods that integrate ML for causal effect estimation in the confirmatory approach.

The problem of model misspecification

The use of parametric models is very popular thanks to their simplicity and useful asymptotic properties that allow the construction of confidence intervals and hypothesis testing [7]. As the sample size increases, the central limit theorem and the law of large numbers may be used to reach desirable properties: efficiencyFootnote 1, consistencyFootnote 2 and asymptotic normalityFootnote 3 [7]. However, for the estimator to converge in probability to the true parameter value (i.e. to be consistent), and to gain other desirable asymptotic properties, it is assumed that the underlying model is correctly specified. In practice, however, parametric models are often misspecified and, consequently, they cannot optimally capture the true data-generating process. One of the strong and often unverifiable assumptions parametric models rely on is the correct model specification of the exposure-outcome relationship. If this assumption is unmet, the estimate can suffer from “estimation bias”Footnote 4 [15]. To specify a parametric model correctly, it is necessary (i) to assume that the true data-generating process belongs to a specific parametric family (in this way, specifying correctly the link function, a possibly nonlinear relationship can be mapped into a linear one), (ii) to include a correct set of exposure-covariates and/or covariate-covariate interactions, if any, and (iii) to model potential nonlinearities appropriately [16]. An example is the use of a logistic regression to estimate the propensity score: it restricts the type of relationship between exposure and confounders, assuming that the log-odds of exposure are appropriately described by a linear combination of the covariates [16].

Classical statistical theory often ensures that the estimator, obtained with the maximum likelihood estimation, is asymptotically efficient, i.e. that it achieves the lowest possible variance among all consistent estimators in large samples, under certain regularity conditions (smoothness). This optimality holds when the assumed parametric model is correct and the sample size is large. As a result, while parametric approaches offer simplicity and computational efficiency, they may not adequately capture the complexity of real-world data, because the assumption on the underlying distribution is often too restrictive.

Nonparametric or semiparametric methods do not rely on assuming that the data follow a specific parametric distribution indexed by finite-dimensional parameters. Nonparametric models are particularly useful when there is limited knowledge or assumptions about the underlying exposure mechanism, outcome mechanism, or both. Despite the absence of parametric assumptions, nonparametric models can achieve convergence rates, and valid confidence intervals (CIs) can be constructed even when ML techniques are used to handle high-dimensional data and capture complex relationships between variables.

In recent years, estimators for causal effects that exploit ML efficiency have been developed [17]. These methods join forces of the two, apparently distinct, perspectives of causal inference and ML, so that each one can take advantage of the other. The integration of ML methods in estimators for a causal effect can mitigate the assumption of correct model specification thanks to their flexibility and capability to approximate complex functions, to handle interactions and nonlinearities, and avoiding functional-form restrictions [7, 8].

Definition of the causal framework

According to the counterfactual theory of causation [18], questions about the causal effect of an exposure A on an outcome Y in a particular population can be expressed in terms of counterfactual contrasts. A counterfactual is a ‘what-if’ statement that describes what would have happened in the target population under different exposure levels than those actually observed. A key causal estimand is the average treatment effect (ATE) that, for a binary exposure, represents the difference between the expected value of the outcome that would have occurred under exposure A = 1 (exposed) and the outcome that would have occurred under exposure A = 0 (unexposed) (the so-called potential outcomes). Mathematically, it is defined as:

ATE = E[Y(1) − Y(0)]

where E denotes the expectation, and Y(1) and Y(0) are the potential outcomes under A = 1 and A = 0, respectively.

To estimate the ATE from observed data, several critical steps and assumptions must be considered within a formal causal framework, such as for example the Causal Roadmap [19], Fig. 1A: (i) after the identification of the research question and (ii) the specification of the causal model (e.g., through a DAG) representing the assumed relationships between variables, (iii) the research question is translated into the causal estimand of interest (e.g. the ATE). To make the causal estimand quantifiable from the observed data, (iv) it is translated into a statistical estimand. However, to establish a causal interpretation of the statistical estimand, it is essential to ensure that the following identifiability assumptions are met [15, 19]:

Counterfactual consistency: The observed outcome is consistent with the potential outcomes under the observed exposure level.

No interference: The potential outcomes for an individual are not affected by the exposure status of other individuals.

Exchangeability: The distribution of potential outcomes is the same across exposed and unexposed, given the covariates.

Positivity: There is a non-zero probability of receiving each level of the exposure for all levels of covariates.

After evaluating the assumptions encoded in the causal model and ensuring adequate data support, v) the statistical parameter can be estimated.

Statistical estimators of the ATE

In this article, we focus on the estimation of the ATE. The Risk Difference (RD) is a straightforward measure of the ATE (for continuous or binary outcome). However, packages implementing the methods illustrated here are versatile and capable of providing estimates of the treatment effects also on the risk ratio and the odds ratio scales (for binary outcomes). Furthermore, they are able of accommodating other causal estimands beyond the ATE, such as the Average Treatment effect among the Treated (ATT) and among the controls (ATC) [20,21,22] and a variety of structural models as detailed in Table 1 [21, 22].

To estimate the ATE, causal inference approaches typically involve fitting “nuisance models” [7] to the data before the final parameter estimation step. These nuisance models aim to estimate the conditional expectation of the outcome given exposure and confounders (outcome mechanism) and/or the conditional probability of exposure given the confounders, namely the propensity score (exposure mechanism).

Traditionally, “nuisance models” are fitted using parametric models. When the number of confounders is high-dimensional and exceeds the sample size, traditional parametric models have a high probability to be misspecified [10]. Since nuisance models are purely predictive problems that do not involve causal interpretation [23], they can benefit from the use of methods with high predictive ability and particularly suited to work with high-dimensional data, such as ML. Supervised ML techniques, like decision trees, random forests, support vector machines, neural networks, and ensembles like the SuperLearner are particularly suited for the purpose [19, 24,25,]- [26].

Fig. 1figure 1

Visual synthesis of the article. In A, the different steps of a causal inference framework. In B, estimators for causal effect that integrate Machine Learning methods, bridging the gap between statistical inference and Machine Learning

Plug-in estimators of the ATE

The predictions obtained from the nuisance models with a single ML method, or with the SuperLearner, can be integrated into the estimator for the ATE (Fig. 1B). An example are plug-in estimators, statistical estimators where estimates of specific quantities, such as parameters or functions, are plugged into a predefined formula to compute the estimate of interest. Two examples are Inverse Probability Weighting (IPW) and g-computation: they involve plugging in estimated quantities (propensity score (PS) in the case of IPW, and potential outcomes in the case of g-computation) into a specific formula to estimate the ATE. They are “singly robust” estimators because they rely on the correct specification of one nuisance model, either the one representing the exposure mechanism (e.g., for the IPW), or the one representing the outcome mechanism (e.g., for the g-computation) [24].

The PS, the nuisance model for the exposure mechanism in IPW, aims at reducing the information from all confounders in one parameter, the “propensity” to be exposed to the exposure of interest, allowing for an optimal balance of observed covariates between exposed and unexposed. The PS can then be used to control for confounding in different ways. For example, it can be integrated in the IPW estimator for the causal parameter of interest: each observation is weighted with the inverse of the probability, conditional on all confounders, that an individual received the exposure that they actually received. The probability is 1/PS for exposed and 1/(1-PS) for unexposed individuals. The weights serve to create pseudo-populations where the exposure status no longer depends on the confounders.

G-computation, on the other side, is an example of an estimator that requires a nuisance model for the outcome mechanism. In this case, the potential outcomes, treated as a missing data problem, are predicted from the model for the outcome. Potential outcomes are then plugged in the g-computation estimator to obtain an estimate for the ATE.

ML techniques can replace the use of parametric models for the computation of the PS and the potential outcomes, improving the quality of the prediction.

However, the potential advantages of using ML to estimate the nuisance models in plug-in estimators come at a price and involve challenges associated with increased complexity, overfitting, sample size requirement and, especially, the risk of (plug-in) bias. The reason is that ML methods are solving an optimization problem for the prediction of the nuisance models, but the bias-variance trade-off they reach may be suboptimal for the task of interest, i.e., obtaining an unbiased estimate of the ATE [12]. Additionally, when integrating nonparametric models, plug-in estimators typically exhibit bias larger than \(\:\frac}\), where n is the sample sizeFootnote 5, and experience slower convergence rates compared to parametric methods. This phenomenon is known as the curse of dimensionality, and it implies that exponentially larger sample sizes are required to obtain parameter estimates that are as close as possible to the true parameter values [24, 27,28,]- [29].

Doubly-robust estimators

In response to the limitations of plug-in estimators, doubly-robust estimators [24] have been proposed. They achieve useful asymptotic properties, including the construction of valid confidence intervals also when nuisance models are estimated using ML [19, 30].

They are called doubly-robust because they provide two opportunities to obtain an unbiased estimator of the ATE. Similarly to singly-robust estimators, they also require predictive steps before the effect estimation but, in this case, two separate nuisance models, one for the exposure and one for the outcome mechanism, are obtained (Fig. 1B). After the prediction of propensity and outcome models, the two nuisance models are combined for the estimation of the target causal effect. Such an estimator will be consistent if either the propensity or the outcome model is specified correctly, but not necessarily both [30]. However, the asymptotic efficiency and the ability to perform standard parametric-rate inference (e.g. with rates of convergence typically associated with parametric models) on the target parameter can be achieved only if both nuisance models are specified correctly [30].

The advantage of the use of ML in conjunction with doubly-robust estimators is the ability of doubly-robust estimators to achieve small bias more readily than singly-robust, owing to the mathematical properties of their estimation error. Specifically, the bias is less than \(\:\frac}\), where n is the sample size, if the errors in both the nuisance models are substantially smaller than \(\:\frac}\), condition that ML estimators can satisfy under smoothness and sparsity assumptions [10].

It is important to be cautious, as it has been shown that doubly-robust estimators are generally less efficient than those obtained with correctly specified parametric models based on maximum likelihood estimation [17]. Moreover, if both nuisance models are misspecified, the resulting estimate may exhibit larger bias than the one obtained with a single, misspecified maximum likelihood model [24]. However, while parametric models may converge faster and require smaller sample sizes to achieve a certain level of efficiency, they may not necessarily exhibit higher accuracy compared to ML models, which typically offer greater flexibility and may capture more complex relationships within the data.

To ensure statistical validity of confidence intervals, doubly robust ML estimators require sample splitting and cross-fitting. Sample splitting involves dividing the study population into estimation and training samples. The training sample is used for training ML algorithms to estimate nuisance models, while the estimation sample is employed for estimating the ATE. This yields a doubly-robust estimate of the ATE, derived from a random half of the study population. However, the resulting confidence intervals tend to be wider than those obtained using the entire sample due to halving the sample size. To mitigate this issue and regain some of efficiency, cross-fitting involves repeating the estimation procedure multiple times using different subsets of the data for training and estimation. Averaging the estimates obtained from these different subsets reduces variability in the estimates, yielding more precise estimates of the treatment effect.

In the next three sections, we will discuss the three most commonly used doubly-robust estimators: Augmented Inverse Probability Weighting (AIPW), Double/Debiased Machine Learning (DML) and Targeted Maximum Likelihood Estimation (TMLE). We will explore their conceptual details, principles, advantages and examples of applications. In Table 1, for each explored method, relevant theoretical articles, tutorials, worked examples, reviews and software are listed.

Augmented inverse probability weighting and double/debiased machine learning

AIPW, first proposed by Robins and colleagues [32] and further developed by Scharfstein and colleagues [33], is a doubly-robust estimator based on the estimating equation methodology [12]. As in IPW, the basic idea is to use weights to adjust for differences in the distribution of confounders between exposed and unexposed. To obtain the AIPW estimator, the IPW estimator is augmented by a term that involves the outcome regression. The augmentation term is the weighted average of the two potential outcomes [34] and serves: (1) to increase the efficiency, resulting in a smaller variance than that of the IPW estimator [35], and (2) to provide the estimator with the double-robustness property [36].

If the PS is well specified, then the AIPW estimator simplifies to the IPW estimator. Conversely, if the PS is misspecified, the AIPW estimator reduces to the outcome model [36].

AIPW, derived from the semiparametric efficiency theory, maintains the double robustness property even when combined with ML techniques [28].

AIPW serves as the foundation for the broader Double/Debiased Machine Learning (DML) framework. In its full-sample implementation, AIPW uses data from all individuals to estimate both the PS and the outcome model, along with the final ATE estimate. However, this full-sample approach carries the risk of introducing correlation between the nuisance models and the final ATE estimate, potentially impacting performance in unpredictable ways [10].

To address this limitation, the DML framework [28], proposed in 2018 by Chernozhukov and colleagues, builds upon AIPW by incorporating sample splitting and cross-fitting techniques. Splitting the sample into two parts, one to estimate the nuisance parameters and the other to compute the final ATE estimate, reduces the risk of bias of the full-sample estimator. Moreover, sample splitting helps to mitigate overfitting bias, allowing for the use of various ML methods such as lasso, random forests, and neural networks, depending on the data characteristics and problem at hand. In DML, ML methods are used to predict, separately, the outcome Y and the exposure A from the covariates. The predictions are then combined by regressing the residuals of Y on the residuals of A, guided by an estimating equationFootnote 6 that ensures double robustness [22, 28, 37], overcoming the problems of plug-in estimators. DML is particularly suited to settings with a large number of covariates [28]. The authors of DML provide guidance on selecting the appropriate ML methods [28] based on the specific characteristics of the data and the problem at hand.

Targeted maximum likelihood estimation

TMLE is a doubly-robust, maximum-likelihood–based estimation method, developed by van der Laan and Rubin [31]. In addition to the initial estimation of the outcome and exposure models, TMLE involves a “targeting” step to get the best estimate of our target parameter of interest (e.g., ATE) [38].

To give some insights into how the method works, an example is provided illustrating the technical steps involved in using TMLE to estimate the ATE of a binary exposure A on an outcome Y, adjusted for baseline confounders W:

Prediction of the outcome model

In the first stage, the conditional expectation of the outcome given exposure and covariates \(\:Q=E\left(Y\right|A,W)\) is modelled and used to predict every individual’s outcome. Such a model can be fitted using the SuperLearner. We can then obtain an estimate of ATE based on g-computation. However, this estimate is singly-robust (thus, susceptible to bias): it is based on a correct estimate of \(\:Q\) rather than the estimate of the ATE.

Prediction of the propensity score

To overcome this problem, information on the exposure mechanism is used. The PS, \(\:P(A=1|W)\), is estimated, for example, using the SuperLearner.

The clever covariate and estimation of the fluctuation parameter ε

The PS is used to create a variable, named clever covariate, defined as \(\:\frac\) for the exposed individuals and \(\:\frac\) for the unexposed individuals. The clever covariate is crucial in updating the initial outcome estimates using information on the exposure and to optimise the bias-variance trade-off for the target parameter (e.g., the ATE) rather than for \(\:Q.\) A predefined regression model is used to update the initial outcome estimates: the observed outcome \(\:Y\) is regressed on the clever covariate as the only predictor, with the initial obtained outcome prediction Q, as a fixed intercept. The regression coefficient ε that will be estimated, based on maximum likelihood estimation, is called the fluctuation parameter. By solving an estimating equation (which sets the efficient influence function equal to zero (see Supplementary Material)) [11], the clever covariate ensures that the estimator becomes approximately unbiased and gains useful asymptotic properties [12].

Updating of the outcome model

The fluctuation parameter is then used to update the initial estimate of \(\:Q\), yielding the two final potential outcomes. The ATE is then computed as the average difference between the two updated potential outcomes across individuals.

The literature on TMLE is expanding [38], and this technique is becoming the most widely used doubly-robust approach [50,51,52]. A recent systematic review examined the increasing adoption of TMLE in public health and epidemiological studies, on a wide range of research questions and outcomes [38]. The diverse applications of TMLE highlight the variety of complex causal effect estimation problems where this method can show its potential, such as multiple time point interventions, longitudinal data, post-intervention effect modifiers, dependence of the exposure assignment between units or censoring, causally connected units, hierarchical data structures, randomisation at the cluster level, large electronic health record data, and meta-analyses [38].

Practical guidelines and tutorials have been published on the implementation of TMLE to model the effects of a binary exposure [11, 37] and sequential interventions with time-varying confounders [38]. These resources offer valuable insights into applying TMLE methodology in various research settings.

Comparison between AIPW and TMLE

Since TMLE and AIPW are based on the efficient influence function (see Supplementary Material), both are mathematically efficient and exhibit similar asymptotic properties. However, while both estimators perform well in large sample settings, they behave differently in finite sample settings with AIPW estimates subject to larger variability than TMLE estimates [30]. An important difference between TMLE and AIPW is that they are both estimating-equation-based estimators, but the former is also a loss-based estimator that makes use of the maximum likelihood estimation. Estimating-equation-based methodology aims at providing estimators with minimal asymptotic variance, without imposing constraints to ensure that the estimated values are realistic and feasible within the context of the observed data [12]. AIPW has the same weaknesses of IPW when it comes to the positivity assumption and unstable weights. Under dual misspecification and near-positivity violations, it has been shown that AIPW performs worse than TMLE, and it is unstable when values of the PS are close to zero [12]. On the other hand, AIPW can be relatively easier to implement, as it does not involve the iterative updating of models, and might require fewer computational resources compared to TMLE.

Table 1 List of relevant theoretical articles, tutorials, worked examples, reviews and software for AIPW, DML and TMLE

留言 (0)

沒有登入
gif