Minimum sample size calculations for external validation of a clinical prediction model with a time‐to‐event outcome

1 INTRODUCTION

Clinical prediction models aim to inform the diagnosis and prognosis of individuals. They utilize multiple predictors in combination to predict an individual's outcome value (for continuous outcomes) or outcome event risk (for binary or time-to-event outcomes).1-3 Before such models are considered for use in clinical practice, it is important to evaluate their predictive accuracy in new data. This usually requires an external validation study (and typically more than one), where a prediction model is applied to new individuals that were not part of the model development dataset (and may even be from a different population), so that predicted outcomes can be compared with observed outcomes.

We recently published two articles in Statistics in Medicine that describe how to calculate the minimum sample size for external validation of a clinical prediction model with either a continuous outcome,4 or with a binary outcome.5 The sample size criteria is based on ensuring precise estimation of key measures of a model's predictive performance in external validation, including calibration and—for binary outcomes—discrimination and net benefit. Calibration refers to the agreement between observed and predicted values, and can be minimally summarized by the validation sample by the calibration-in-the-large (ie, mean observed compared to mean predicted value for continuous outcomes, or mean observed compared to mean predicted event risk for binary outcomes) and calibration slope (ie, the slope of a linear or logistic regression model that includes the model's linear predictor as the only covariate). Discrimination refers to how a model's linear predictor (ie, predicted risks of the outcome event) separates between those with and without the outcome event, and is usually quantified by the C-index. Net benefit is a measure of clinical utility, and provides a weighted measure of the potential benefits to harms of using a model at a defined clinical risk threshold.6

Many prediction models are developed using time-to-event (survival) models (such as QRISK37 and QCOVID8), and require external validation in datasets that will naturally containing censoring. As previous publications focused on continuous and binary outcomes,4, 5, 9, 10 we now address the issue of how to calculate the minimum sample size required for external validation of prediction models with a time-to-event outcome. We focus predominantly on the sample size required to precisely estimate the calibration slope as our previous articles,4, 5, 9 and those of others,11, 12 have shown in applied examples and simulation studies that the calibration slope is usually the measure that requires the largest sample size to estimate precisely. However, we also show that other measures such as discrimination and net benefit can be examined in the proposed framework. The remaining article outline is as follows. In Section 2 we explain how calibration can be examined at a particular time point using pseudo-observations. In Section 3 we propose a simulation-based framework for identifying the sample size that targets a precise estimate of calibration (in terms of calibration slope and calibration curves) at a particular time point, which requires the user to specify the anticipated overall event risk at that time, the distribution of the prediction model's linear predictor and the censoring and event distribution. Section 4 provides an illustrative example aiming to ensure precise calibration across the entire spectrum of predicted risks. Section 5 then considers sample size in the context of potential clinical decision making, where regions of predicted risk may be more important to focus on. Section 6 extends to other issues including unknown linear predictor distributions, multiple time points and competing risks, and Section 7 concludes with discussion.

2 ESTIMATING CALIBRATION AT A PARTICULAR TIME POINT We assume that the developed prediction model enables the calculation of predicted risks over time for each individual in the validation dataset. For example, a regression-based time-to-event prediction model would provide all the regression coefficients and—crucially—the baseline survival (or baseline event) probability at the time points of interest for prediction. Then, for an individual (i) their predicted risk (F^i(t)) of having the event by time t can be calculated. For example, a model developed using a proportional hazards regression will lead to risk predictions of the form,

F^i(t)=1−S^i(t)=1−S^0(t)expβ^1X1i+β^2X2i+β^3X3i+⋯+β^kXki=1−S^0(t)expLPi(1)

where S^i(t) is the survival probability by time t, and S^0(t) is the baseline survival probability by time t (where “baseline” refers to individuals whose LPi is zero). The model's linear predictor (LPi) is the β^1X1i+β^2X2i+β^3X3i+⋯+β^kXki component of the model, and is sometimes referred to as the prognostic index.13 As mentioned, we assume that S^0(t) (or similarly F^0(t)) has been reported by the model developers for each time point of interest so that absolute risk predictions can be calculated. If only the linear predictor (eg, from a Cox model) is provided, it is not possible to make risk predictions and so the reported model is not fit for purpose.13

For validation, we assume that there is a key time point for which accurate risk predictions are required. For example, QRISK3 focuses primarily at predicting risk of cardiovascular disease by 10 years and subsequent clinical decision making is based on these risks. Hence, a calibration assessment in the validation study should examine whether the predicted risks agree with the observed risks by this time point. Although calibration could be checked across all time points simultaneously (eg, to measure the calibration slope averaged across the whole follow-up period),14 this is not especially informative to potential users of the model such as clinicians and patients, as for decision and communication purposes the event risk by a particular time (eg, 10 years) is more important.

2.1 Situations with no censoring before the time horizon for prediction If there is no censoring by the time point of interest, then the outcome event status (usually presence or absence of a certain health outcome) is truly known for all individuals in the validation dataset. Then, the calibration slope at time point t can be estimated by fitting a calibration model in the form of the following logistic regression model,5

logitpi=α+βlogitF^i(t)(2)

where pi is the underlying risk of the event occurring by time t for individual i in the validation dataset, β is the calibration slope, and F^i(t) is the predicted risk of the event by time t as derived from the existing prediction model.

2.2 Situations with censoring before the time horizon for prediction

If there is censoring before the time point of interest in the validation dataset, then the true outcome event status is unknown for the censored individuals. This makes it problematic to directly examine the calibration of predicted risks with observed risks at the time point of interest. A common approach is to create risk groups (eg, 10 groups defined by tenths of predicted risk), and to plot the average predicted risk against the observed (1—Kaplan-Meier) risk for each group. However, this is unsatisfactory, as the number of risk groups and the thresholds used to define them are arbitrary, and what we desire is a plot that examines calibration across the entire range of predicted risks (0-1).

To address this, the use of pseudo-observations (or pseudo-values) has been proposed,15-18 which are derived using the jackknife (ie, leave-one-out) estimator, and give pseudo-observed event probabilities for each individual accounting for non-informative right censoring. We denote these pseudo-observations by F˜i(t) and provided that censoring is independent of covariates, they yield unbiased estimates of the true Fi(t).16 In brief, the pseudo-observation for individual i is calculated in the validation dataset as,

F˜i(t)=nFKM(t)−(n−1)FKM(−i)(t),

where FKM(t)=1−SKM(t) is the Kaplan-Meier (or another nonparametric) estimate of the cumulative incidence at time t using all n individuals in the validation dataset, and FKM(−i)(t) is the cumulative incidence estimate recalculated on the n−1 individuals after removing individual i. Note that the F˜i(t) are not constrained to fall within the range 0 to 1, but the key is that EF˜i(t)=Fi(t).19 This allows the generation of calibration plots at time point t (avoiding risk grouping) and the calculation of the corresponding calibration slope and calibration curves, as shown by Royston.17 In particular, a generalized linear model can be fitted with F˜i(t) as the outcome response, and the expected value (EF˜i(t)) modelled using a particular link function. Royston suggests using a complementary log-log link function,17 such that.

ln−ln1−EF˜i(t)=α+βln−ln1−F^i(t)(3)

where β is the calibration slope, and F^i(t) is the predicted risk calculated from the existing prediction model for individual i at time t. This is equivalent to modelling the log of the cumulative hazard function (H(t)), as −ln1−F^i(t)=H^(t), and this scale is often used in survival modelling. Other link functions (eg, logit) could be chosen to model EF˜i(t)), and we return to this issue in Section 5.

Model fitting is discussed by Andersen and Perme,16 who suggest the use of generalized estimating equations followed by a sandwich estimator for the variance of parameter estimates (which is needed to account for the correlation of the pseudo-observations themselves). This is implemented in various software modules, including stcoxcal for Stata,17 specifically developed to allow researchers to fit Equation (3) and estimate the calibration slope of time-to-event prediction models in new data. More generally, pseudo-observations can be derived directly (eg, using stpci in Stata,20 or using the pseudo or prodlim packages in R21) and then generalized linear models fitted (eg, using glm in Stata or geese in R), with robust standard errors calculated.

Equation (3) examines a linear calibration, but the actual relationship between observed and predicted risks may be non-linear. Hence, it is also important to estimate a flexible calibration curve with a 95% confidence interval,12, 22 for example by fitting a spline or fractional polynomial function and using robust standard errors (or bootstrapping) to derive the confidence interval. Alternatively, the stcoxcal Stata package by Royston (and calPlot in the pec package in R) uses a nearest neighbor smoothed running line to produce the calibration curve and its confidence interval,17, 23 and we use this approach in our article.

3 CALCULATING THE SAMPLE SIZE TO ESTIMATE THE CALIBRATION SLOPE PRECISELY

We now consider how to calculate the sample size for a cohort study aiming to precisely estimate the calibration slope, assuming that the validation dataset will contain observations censored before the time point of interest for prediction. Our approach requires the user to provide some information and to make some assumptions, as in any sample size calculation.

3.1 Specify the anticipated linear predictor distribution

Fundamentally, the model's anticipated linear predictor (LPi) distribution in the validation population must be specified. For example, if the validation population is anticipated to be similar to that used for model development, then it makes sense to assume the linear predictor distribution for validation will be similar in shape to that observed for development. Sometimes this distribution is presented in the model development article as a histogram (either for all individuals, or for each of the events and non-events groups separately), or perhaps the developers report it was approximately normally distributed and give the mean and SD. If information about the linear predictor distribution is unavailable, then the model development team could be asked directly to provide details. We illustrate this in our example later.

If Royston's D statistic is presented,24, 25 then the SD (σ) of the linear predictor distribution can be estimated by,

σ^=D/(8/π)(4)

assuming the linear predictor is normally distributed, which may be a strong assumption. The mean of this normal distribution can be simply set to zero, as without loss of generality it can be assumed centered by its original mean. The only impact of this is on the magnitude of the baseline hazard for simulating survival times (see below).

Jinks et al26 propose the following equation, based on empirical evidence, for approximating Royston's D when only Harrell's C-index is reported2, 27:

D=5.50(C−0.5)+10.26(C−0.5)3(5)

Thus, if only Harrell's C-index is reported, we can use Equation (5) to approximate Royston's D statistic, and then apply Equation (4) to estimate the SD of the linear predictor distribution assuming it is normally distributed. When using the C-index or Royston's D statistic reported from a model development study, ideally their values should have been adjusted for optimism due to any overfitting. For example, this could be the C-index after optimism-adjustment based on results from Harrell's bootstrapping approach2; after a penalized regression approach has been used; or based on the C-index estimated in any independent validation (test) datasets (providing the data used for independent validation is reflective of the target population in whom the model is to be further validated). Note that there are many other types of discrimination measures for survival data, such as Uno's C which (unlike Harrell's C-index) does not ignore censored observations,28 and further work is needed for how to use them to approximate Royston's D and the distribution of the linear predictor.

If no information is available for informing the distribution of the linear predictor, and also when the validation population is expected to be very different to that used for development, then a pilot study could be used to observe the distribution more closely.5

3.2 Specify the anticipated distribution of survival and censoring times

The assumed distribution of survival times must also be specified for the validation population. To make the approach practical, here we will assume survival times follow an exponential survival distribution, with baseline rate parameter (λ) chosen to ensure F(t) (and thus S(t)) in the validation population are as anticipated for the time point of interest for prediction. Crucially, λ needs to be chosen conditional on the effect of the linear predictor. For example, we suggest using trial and error (or a simple one parameter non-linear optimization strategy) to identify the value of λ that gives the desired F(t) at the time point of interest in a large simulated dataset of survival times from an assumed exponential distribution with baseline rate parameter λ and a single covariate for LPi whose corresponding parameter value is 1 (ie, assuming risks from the existing prediction model will have a perfect calibration slope across all time points).

The exponential distribution assumes a constant hazard over time and so will, in most situations, be an overly simple approximation of the truth. More complex survival functions could be assumed if more information is available (eg, if the model development data were available), although we anticipate the exponential distribution will be a pragmatic compromise for most users. The impact of different survival distributions is considered in Section 5.

The censoring distribution must also be specified. For example, an exponential distribution could be assumed with censoring rate matching that reported for the development dataset or in similar studies in the same field, to ensure the censoring proportion matches that anticipated by the time point of interest for prediction. Our example illustrates this later. However, it is also important for the censoring distribution to reflect how and when participants will enter the cohort study. Most censoring is administrative (eg, end of study follow-up), and so a study that recruits everyone at the study onset is likely to have a different censoring pattern than a study that recruits individuals gradually over time or at specific intervals. We examine the impact of censoring distributions in Section 5.

3.3 Specify the target SE for the calibration slope

Finally, the target SE for the calibration slope must be specified. This choice is subjective, and potentially context specific as what constitutes “precise” depends on the magnitude and range of risk predictions arising from the model, which are defined by the outcome event rate and the linear predictor distribution. It may also depend on the potential clinical implementation of the model; in particular, it may be more important to have precise calibration in some ranges of the calibration plot (eg, predicted risks between 0.05 and 0.20) where clinical decision thresholds are likely to be made, compared to other ranges (eg, predicted risks between 0.5 and 1.0) where miscalibration is less of a concern. As in our previous articles,4, 5, 9 we recommend starting with a target SE of 0.051, which corresponds to a narrow confidence interval for the calibration slope of 0.9 to 1.1. The aim is to identify the sample size that is expected to (ie, will on average) give this target SE (and thus target confidence interval width), and we describe how to do this below.

3.4 Use a simulation-based approach to estimate the required sample size

After the set-up phase, further steps are required using a simulation-based framework,29, 30 which proceeds iteratively until the sample size is identified that meets the target SE value on average. The steps are outlined in Figure 1. The process starts by assuming the model will be well calibrated in the validation dataset, which is a pragmatic place to start. This could be changed, for example if the calibration slope is anticipated to be less than 1 upon validation due to suspected overfitting during model development; however, this usually leads to lower sample sizes and therefore assuming a slope of 1 is more conservative.4, 5, 9 The process focuses on precise estimation of calibration slope, but precision of other performance measures should also be checked, such as the calibration-in-the-large, the C-index, and net benefit. If the sample size is adequate for precisely estimating the calibration slope, then usually it will be adequate for these other measures too.5

SIM-9275-FIG-0001-b

Suggested process to identify the sample size required to precisely estimate the calibration slope (and subsequent calibration curves) at a key time point when externally validating the performance of a prediction model with a time-to-event outcome

Once this sample size has been identified, we recommend producing calibration plots for 10-20 of the already simulated datasets of this sample size, including a flexible calibration curve with a 95% confidence interval for each,12, 22 as described in Section 2. The calibration curve and its confidence interval on these plots can then be inspected, to ascertain whether the precision of the curve appears generally acceptable, or whether a smaller or larger sample size is actually needed. Similarly, the empirical distribution of calibration curves could be displayed by presenting the entire set of simulated calibration curves on the same graph (without confidence intervals). This helps to reveal the potential range of calibration curves that might be observed in practice, and if variability is too high then a larger sample size is needed. Our applied example will illustrate these ideas.

4 APPLIED EXAMPLE

Ensor et al developed a prognostic time-to-event model to predict the risk of a recurrent venous thromboembolism (VTE) following cessation of therapy for a first VTE.31 The model included predictors of age, gender, site of first clot, D-dimer level, and the lag time from cessation of therapy until measurement of D-dimer (often around 30 days). These predictors corresponded to six parameters in the model, which was developed using the flexible parametric survival modeling framework of Royston and Parmar.32, 33

A key time point for prediction was 3 years, and the final model equation for calculating predicted risks was given as,

F^i(3)=1−S^0(3)expLPi

where S^0(3)=0.9983, and LPi=[(−0.0105 × Age) + (0.545 × Male) + (1.735 × Site: Proximal DVT) + (1.756 × Site: PE) + (0.701 × ln[D dimer]) + (−0.291 × ln[lag time])].

At the design stage of a validation study, we must now determine the minimum sample size needed to precisely estimate the calibration slope of this model when used to predict risk at 3 years, in order to precisely examine calibration across the whole range of predicted risks. Assuming the validation population will be similar to the development population, we used the process described in Figure 1 to calculate the required sample size. Stata code to implement the approach for this example is provided in the Supplementary Material (and R code available at https://www.github.com/gscollins1973/). The process and findings are now summarized.

4.1 Step 1: Set-up process Specify the time point of interest for checking calibration performance: The key time point of interest for validation was 3 years. Specify the model's anticipated linear predictor (LPi)distribution in the validation population: The distribution of LPi values was not described in the model development article, however upon request the authors provided a histogram. This is displayed in Figure 2, and it could be closely approximated by assuming LPi follows a skew normal distribution with a mean of 4.60, a variance of 0.65, a skewness parameter of −0.5, and a kurtosis parameter of 5. Specify the assumed F(t) (or S(t)) in the validation population at the time point of interest: The model development article reports a Kaplan-Meier survival curve in the development sample, and by 3 years about 17% of the population had a VTE recurrence, and thus F(3)=0.17 and S(3)=0.83. Hence, we assumed that this would be the same in the validation population. Specify the assumed distribution of survival times in the validation population, conditional on the effect of LPi: We assumed the existing model is well calibrated, and identified (in a large simulated dataset) that an exponential distribution with baseline rate parameter of about 0.00050 ensures F(3) is 0.17 when the log hazard ratio for the effect of LPi is 1 (ie, calibration slope is 1). Hence, survival times were drawn from this distribution. Specify the assumed distribution of censoring times in the validation population, and maximum follow-up time: The model developers told us that the censoring rate was high, with about 72% of individuals censored before 3 years. To mirror this—and assuming a constant rate of censoring—we drew censoring times from an exponential distribution with rate parameter of 0.426 (as this gives a probability of being censored by 3 years of about 0.72), and then classed the individual as censored if their generated censoring time was less than their survival time. We also assumed the maximum follow-up time would be 3 years and so all individuals with a survival and censoring times generated to be after 3 years were censored immediately after 3 years. Specify the target value for the SE of the calibration slope: We initially chose a SE of 0.051, to target a confidence interval width of 0.2 (ie, 0.9-1.1 assuming the slope is 1) for the calibration slope as measured on the complementary log-log scale of the calibration model of Equation (3). We emphasize this SE is a starting point, and may need to be refined subsequent to observing calibration plots and curves in simulated data, as described below. SIM-9275-FIG-0002-b

Comparison of the linear predictor distribution from the model development dataset (as supplied by the model developers) and an assumed skew normal distribution (mean 4.60, variance 0.65, skewness −0.5, kurtosis 5) and an assumed N(4.60, 0.6992) distribution

4.2 Steps 2 to 9: Simulation-based approach to identify the required sample size

Under the assumptions from step 1, the iterative process from steps 2 to 9 identified that about 14250 individuals (with about 1430 observed events by 3 years) are required to estimate the calibration slope with a target SE of 0.051. To better appreciate this level of precision, we plotted the calibration curve and its 95% confidence interval for the simulated datasets of this sample size. Figure 3A provides a representative example, which shows precise calibration across the whole spectrum of predicted risks, especially in the range of risks between 0 and 0.5. This is also reflected by the spread of calibration curves from all the simulated datasets with a sample size of 14250, as shown in Figure 4A for a random sample of 500 of the curves. There is little variation in the range of 0 to 0.5, but it gradually increases as the predicted risk moves closer to 1, because the number of participants

留言 (0)

沒有登入
gif