As noted in the introduction, the example is based on a previous report documenting that vitamin D deficiency predicted subsequent suicide (Umhau et al., 2013). The sample was a 1:1 matched (on age, gender, rank, and timing of when blood samples were collected) case-control sample of n = 495 suicide decedents and n = 495 controls from the active-duty US military with a history of combat deployment. The matched case-control design creates certain analysis complexities described below.
Detailed information on the assessment of vitamin D deficiency and the baseline measures examined as possible prescriptive predictors is presented in the original report (Umhau et al., 2013). The baseline measures included two classes of biological risk factors—continuous levels of 22 fatty acids and five trace elements—all tested in the same blood serum samples used to test for vitamin D deficiency, a series of socio-demographics and military career variables abstracted from military records, and information from medical records of psychiatric diagnoses, all as of the time of the vitamin D measurement. The categorical variables among these predictors were coded as dummies, whereas the ordinal and interval variables among the predictors were standardized (mean 0 and variance 1) and stabilized into quartiles. These transformations were important for using the full range of machine learning algorithms included in the analysis.
2.2 Analysis methodsOverview: Numerous methods exist to estimate HTE (Robertson et al., 2020). The major challenge in the case of suicide is that even though some prescriptive predictors have been documented, none has been strong enough alone to guide precision treatment planning. This has prompted growing interest in combining information across multiple prescriptive predictors to create a composite measure of HTE (Salazar de Pablo et al., 2021). This is done most often by estimating a proportional interaction model that contains multiple interactions and combining these interactions to create a single composite measure of HTE. The latter is done using counter-factual logic: that is, by computing a predicted outcome score for each person under each intervention regimen (i.e., regardless of which regimen received) based on model coefficients and then comparing predicted individual-level regimen-specific outcome scores to select the intervention estimated to yield the better outcome for each patient (Kovalchik et al., 2013).
However, when data-driven methods are used to search for interactions, as they typically are in building composite HTE measures, there is a danger of over-fitting. Indeed, a recent simulation suggested that the majority of detected interactions in HTE models are likely to be false positives unless methods are used to reduce over-fitting (van Klaveren et al., 2019). Methods that protect against over-fitting can be used for this purpose to produce composite HTE estimates (VanderWeele et al., 2019).
With these issues in mind, it is often useful to develop an overall risk model prior to attempting to study HTE and to do so using methods that minimize risk of over-fitting by developing the model in a training sample and then testing it in an independent holdout test sample. In the case of our 495 case-control pairs, we did this by creating a training sample made up of a random 33% of observations (i.e., n = 164 matched case-control pairs) and a 67% testing sample (n = 331 pairs). We then used 10-fold cross-validation (10F-CV) in the training sample to develop several different machine learning models. Finally, we applied each model to the testing sample to evaluate the extent to which we could detect meaningful HTE. It is noteworthy that a more typical choice in a substantive context would be to use a 67% training sample and 33% test sample, as a small training sample increases risk of over-fitting.
Special issues in working with case-control samples: Case-control samples of the sort used in the vitamin D example are often used in research on suicide given the fact that suicide is a rare outcome. An analysis of a full sample, especially one based on an administrative data system, would often include thousands of non-cases for every case. Case-control analysis addresses this problem by selecting a probability sample of controls from the full set of controls either with or without weights to adjust for this under-sampling (Keogh & Cox, 2014).
The use of weights allows a wide range of models to be estimated that are appropriate for dichotomous outcomes and allows predicted risk differences to be estimated (Pedroza & Truong, 2016). When weights are not used, in comparison, the model should be estimated with logistic regression, as only the odds ratio can be estimated without bias with such a sample (Hosmer et al., 2013). The individual-level predicted odds generated by a logistic regression model can be converted into individual-level predicted risks post hoc when information is available on the probabilities of selection of controls (Rose & van der Laan, 2014). This is critical for HTE analysis, as the latter focuses on risk differences rather than risk ratios or odds ratios when the outcome is a dichotomy. In the example considered here, the sampling fractions used to select matched controls were not available, making it impossible to use weighting to recover predicted risks. We consequently work with logistic models and examine differences in predicted odds. We caution the reader, though, that this is being done merely to illustrate the general approach and that practical applications should use weights either prior to or subsequent to estimating the initial models and generate estimates of predicted risk rather than predicted odds.
A question might be raised on how to determine whether to weight before or after estimating a model based on a case-control design. The answer depends on the investigator's decision either to give equal weight to false positives and false negatives or, as is often the case in models for rare dichotomous outcomes, to give greater weight to detecting the rare outcome (i.e., minimizing false negatives, as when a premium is placed on detecting suicides) than to correctly classifying non-cases. Numerous methods exist for giving greater weight to detecting the rare outcome, some of them involving the use of weights but others involving either under-sampling non-cases, pseudo-sampling replicates of cases, or using a combination of both without weights (He & Ma, 2013).
Risk modeling: The first approach to estimating HTE that we investigate is known as risk modeling. This approach uses a 1 degree of freedom test to evaluate the strength of HTE by generating a prediction model for the joint effects of all predictors (excluding the target risk factor) to fit a “base risk” model for the outcome in the total sample (Kent et al., 2016). The base risk is then estimated for each observation in the sample regardless of target risk factor score and used to define subgroups for investigating whether the aggregate association between the target risk factor and the outcome varies with base risk.
The intuition underlying the risk modeling approach is that some people have very low risk of the negative outcome, in which case an intervention to change a target risk factor is unlikely to have a large effect. The strength of this approach is that it provides a stable estimate of variation in aggregate outcome risk that can be evaluated with 1 degree of freedom, thereby avoiding the problem of over-fitting and often showing evidence of HTE. Consistent with this thinking, a recent secondary analysis of 32 large clinical trials (primarily in cardiology) found that most trials with significant aggregate treatment effects also had significant HTE, where the highest absolute intervention effects usually occurred among patients with the highest base risk and lowest absolute intervention effects among patients with lowest base risk (Kent et al., 2016). As a striking example, a trial of early intervention versus usual care for unstable angina found that more than half the significant aggregate treatment effect was due to an extremely strong effect among the one-eighth of patients with highest base risk and that there was no meaningful intervention effect among the 50% of patients with lowest base risk (Fox et al., 2005).
Many approaches are available to develop a base risk model, from a simple multiple regression model to a complex machine learning model. We used a stacked generalization approach based on the super learner ensemble machine learning method to develop our base risk model (Polley, 2018). In this approach, the data are analyzed in parallel in a 10F-CV training sample with a set (ensemble) of parametric and flexible prediction algorithms designed to capture nonlinearities and interactions among predictors. Results are then combined by generating a weighted composite of individual-level predicted outcome scores across the algorithms via a meta-learner equation (i.e., an equation in which the predicted outcome scores based on each algorithm is included as a separate predictor).
The advantage of this approach over others is that the composite predicted outcome score is guaranteed in expectation to perform at least as well as the best component algorithm according to a pre-specified criterion (Polley et al., 2011). We defined this as the area under the receiver operating characteristic curve (AUC), but other criteria might be preferred in other cases. Consistent with recommendations (Naimi & Balzer, 2018), we used a diverse super learner ensemble to reduce risk of misspecification (Kabir & Ludwig, 2019). These included several different linear algorithms (logistic regression, regularized regression, spline and polynomial spline regressions, support vector machines) and several different regression tree-based algorithms (boosting and bagging ensemble trees, Bayesian Additive regression trees; Table 1). All the variables in the dataset other than vitamin D status were included as potential predictors.
TABLE 1. Algorithms used in the super learner ensemble Algorithm Description I. Super learner Super learner is an ensemble machine learning approach that uses cross-validation (CV) to select a weighted combination of predicted outcome scores across a collection of candidate algorithms (learners) to yield an optimal combination according to a pre-specified criterion that performs at least as well as the best component algorithm. R package: Super learner (van der Laan et al., 2007). II. Linear algorithms in the super learner library A. Generalized linear models Maximum likelihood estimation with logistic link function. R package: stats (Nelder & Wedderburn, 1972). B. Elastic Net Elastic net is a regularization method that minimizes the problem of overlap among predictors by explicitly penalizing over-fitting with a composite penalty λ, where MPP is a mixing parameter penalty with values between 0 and 1 that controls relative weighting between the lasso penalty (Plasso) and the ridge penalty (Pridge). The parameter λ controls the total amount of penalization. The ridge penalty handles multicollinearity by shrinking all coefficients smoothly towards 0 but retains all variables in the model. The lasso penalty allows simultaneous coefficient shrinkage and variable selection, tending to select at most one predictor in each strongly correlated set, but at the expense of giving unstable estimates in the presence of high multicollinearity. The elastic net approach of combining the ridge and lasso penalties has the advantage of yielding more stable and accurate estimates than either ridge or lasso alone while maintaining model parsimony. R package: glmnet (Friedman et al., 2010). 11 different glmnet specifications were used, varying the α hyperparameter over the values 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1. C. Adaptive splines Adaptive spline regression flexibly captures both linear and piecewise nonlinear associations as well as interactions among these associations by connecting linear segments (splines) of varying slopes and smooths to create piece-wise curves (basis functions). Final fit is built using a stepwise procedure that selects the optimal combination of basic functions. R package: Earth (Milborrow, 2021). 3 different Earth specifications were used, varying the degree hyperparameter over the values 1, 3, and 5. D. Adaptive polynomial splinesb Adaptive polynomial splines are like adaptive splines but differ in the order in which basis functions (e.g., linear vs. nonlinear) are added to build the final model. R package: polspline; Kooperberg, 2020. III. Tree-based algorithms A. Bagging Random forest. Independent variables are partitioned (based on contiguous values) and stacked to build short decision trees that are combined (ensemble) to create an aggregate “forest”. Random forest builds numerous trees in bootstrapped samples and generates an aggregate tree by averaging across trees, thereby reducing over-fitting. R package: ranger (Wright & Ziegler, 2017). 3 different ranger specifications were used, each with the following hyperparameter values: max.depth = (6, 8, 8), num.trees = (1500, 1700, 1000), mtry = (10, 4, 20), splitrule = (“gini,” “hellinger,” “extratrees”) B. Gradient boosting Gradient boosting algorithms build a sequential ensemble of shallow successive regression trees that iteratively learn the residuals from prior trees. This is a flexible method, where the number of trees, interaction depth, and shrinkage are leveraged to build flexible models. R package: CatBoost (Prokhorenkova et al., 2019). 2 different CatBoost specifications were used, each with the following hyperparameter values: Iterations = (50, 100), learning_rate = (0.3, 0.8), depth = (8, 10). C. Extreme gradient boosting A fast and efficient implementation of gradient boosting. R package: XGBoost (Chen & Guestrin, 2016). 5 different XGBoost specifications were used, each with the following hyperparameters: Ntrees = (1000, 100, 500, 100, 800), max_depth = (6, 2, 6, 8, 4), shrinkage = (0.001, 0.1, 0.1, 0.1, 0.001), gamma = (0.3, 0.5, 0.0, 0.5, 0.8), minobspernode = (20, 10, 20, 10, 20), and colsample_bytree = (0.3, 0.8, 0.5, 0.3, 0.8) D. DBARTS Fits Bayesian additive regression trees. R package: dbarts. (Dorie, 2020). Abbreviation: DBARTS, Discrete Bayesian Additive Regression Trees Sampler. a Each linear algorithm was estimated separately with five different lasso screeners where dfmax = 10, 15, 20, 30 and all predictors. Each tree algorithm was estimated separately with five different ranger screeners for number of predictors equal to 10, 15, 20 30 and all predictors. Hyperparameter tuning was achieved by treating different specifications of individual algorithms as separate learners in the ensemble, as detailed in the body of the table. b Hyperparameters: Default values were used unless otherwise noted.We applied the training sample model results to the test sample to generate individual-level predicted log-odds of suicide based on all available predictors in the dataset other than vitamin D deficiency. These predicted values were then included as the key predictor in a conditional logistic regression model that adjusted for the matching of cases and controls (Sun et al., 2011) and included a dummy variable for vitamin D deficiency, the main effect of the predicted values, and an interaction between vitamin D deficiency and the predicted values. The existence of HTE was determined by evaluating the significance of the interaction. We also examined the possibility of nonlinearities by estimating a separate conditional logistic regression model in which dummy variables were created for quartiles of the predicted log-odds and interactions were estimated between the dummy for vitamin D deficiency and the log-odds dummies.
Lasso penalized logistic regression: As noted above, the conventional approach to HTE estimation is the proportional interaction model (Kovalchik et al., 2013). Such a model can be estimated by including all potential prescriptive predictors, the target risk factor, and two-way interactions between the target risk factor and the potential prescriptive predictors. The existence of HTE is evaluated by testing the significance of the interactions. However, as such a model will almost certainly overfit the data (van Klaveren et al., 2019), it is usually wise to use some type of penalty to minimize risk of over-fitting. One way to do this is with Least Absolute Shrinkage and Selection Operator (Lasso) penalized regression. Lasso performs both variable selection and regularization by forcing the sum of the absolute standardized values of all regression coefficients in a model to be less than some fixed value associated to the regularization penalty, thereby forcing some coefficients to zero and using internal CV to determine the optimal penalty value (Tibshirani, 1996). This reduces risk of over-fitting. This is the second approach we investigated in our illustrative analysis. Specifically, we used lasso to select a small set of stable interactions based on 10F-CV in the training sample of our dataset to estimate a reduced conditional logistic model in the test sample. We then searched for HTE by evaluating the significance of the coefficients in this proportional interaction specification.
Causal forest: Although CV can be used to minimize the problem of over-fitting (Abadie et al., 2018), the lasso penalized regression approach, like many other approaches for estimating HTE, can be faulted because accuracy still requires correct specification of both the (possibly nonlinear) main effects and the (possibly complex nonlinear and higher order) interactions. However, other algorithms exist that estimate interactions directly and do not require correct specification of the main effects although they do require correct specification of the interaction terms (Pan & Zhao, 2021; Wang et al., 2018). The third approach we investigate in our illustrative analysis uses one of the most recently developed of these algorithms of this sort: causal forest (Wager & Athey, 2018). This algorithm is an extension of random forests, an algorithm that averages predicted outcome values over many classification trees, each based on a subsample of predictors, to correct for the problem of over-fitting in more conventional regression trees (Breiman, 2001). The causal forest algorithm uses the same logic, but rather than splitting to minimize prediction error in an outcome, it splits to maximize differences in the association between a target dichotomous risk factor and an outcome across subgroups. The output is a predicted slope defined as the individual's odds-ratio of the outcome in the presence versus absence of the target risk factor. Importantly, this estimate is independent of whether the individual does or does not have the target risk factor, as the estimate is based on counter-factual logic in which each person is implicitly assigned two estimated outcomes—one in the presence and the other in the absence of the target risk factor. The individual's estimated HTE is the difference between these two predicted values.
In addition to applying the causal forest algorithm to our training sample, we calculated SHapley Additive exPlanations (SHAP) values to estimate the relative importance of each predictor variable in the model in the 10F-CV training sample (Lundberg & Lee, 2017). SHAP values represent the marginal contribution to overall model accuracy of each variable in a predictor set. The causal forest model was then re-estimated with only the top five and then the top 25 predictors defined by SHAP values to evaluate the effects of over-fitting on model results. The 3 causal forest models estimated in the training sample were then used to generate 3 separate sets of individual-level predicted odds-ratios in the test sample. The existence of HTE under each of these specifications was determined by evaluating the significance of the interaction between the dummy variable for vitamin D deficiency and the predicted individual-level odds-ratios for reactivity to this deficiency.
Data management and estimation of the conditional logistic models were carried out in SAS version 9.4 (SAS Institute Inc., 2013). The lasso, super learner, and causal forest models were estimated in R version 3.6.3 (R Core Team, 2020). SHAP values were estimated in Python (Lundberg, 2018).
留言 (0)