Statistical Approaches for Establishing Appropriate Immunogenicity Assay Cut Points: Impact of Sample Distribution, Sample Size, and Outlier Removal

A review was conducted of data from CP experiments for 16 clinical ADA assays validated at Regeneron’s bioanalytical laboratory in accordance with current regulatory guidance and/or recommendations (Table I) (9,10,11). All methods were bridging electrochemiluminescent (ECL) assays on the (MSD) platform. Fifteen of the 16 molecules examined were mAb therapeutics of differing isotypes (IgG1 or hinge stabilized IgG4), and one was a receptor-Fc biologic. The therapeutic proteins were specific for a variety of targets including soluble or membrane-bound, endogenous or exogenous molecules.

The datasets were generated using drug-naïve samples from either disease-state or healthy individuals and either baseline clinical study samples or samples obtained from commercial vendors. For all datasets, assay signal for each sample was normalized to the negative control (NC) response (signal-noise, \(S/N\)) on each plate by dividing the Mean Counts for each individual serum sample by the NC Mean Counts from the corresponding plate.

The survey of datasets was performed to evaluate the contribution of different sources of variability and compare the data distributions to better inform decisions about CP determination. Subsequently, these datasets were used to compare CPs calculated with SAS® software using linear mixed-effects ANOVA models and an in-house CP determined using a “boxplot” approach with two widely available software applications: Excel or JMP. The different calculation methods had minimal impact on the screening and confirmation CPs (Table II).

Table II Comparison of Cut Point Determination ApproachesBiological Variability Within Validation Study Sample Sets

Variation in sample responses can be attributed to a combination of analytical and biological factors. To investigate the sources of screening assay response variability (i.e., log(\(S/N\))) in the CP experiments for the 16 validation studies, a linear mixed ANOVA model was used (Table I). An average of almost 90% of the variation was explained by the sample (biological) random effect. For one dataset, 97% of the variability in assay response was due to biological factors, and for all (evaluated) datasets, at least 78% of the variability was due to individual samples (for Studies E and F, each sample was run only once; therefore, any biological variation would be captured in the residual effect). This indicates that sample-to-sample variation, or biological variability, explains a large majority of the overall log(\(S/N\)) differences observed and therefore is the key component in the CP determination. For the 16 datasets, the contribution from analyst, assay run, and residual factors (which capture the measurement-to-measurement or analytical variability) explains on average approximately 10% of the variation observed. A similar trend was observed in the confirmation assay, where biological variability was the largest contributor of %Inhibition variation for 12 out of 14 studies (Supplemental Table I).

Distribution of the Datasets: Non-Normality and Right Skewness

For 15 of the 16 sample populations, the normality assumption was not confirmed (\(p < 0.05\)) by the Shapiro–Wilk test (Table I). This is consistent with published data indicating CP datasets are typically non-normally distributed (18). Another measure of relative symmetry of a distribution is skewness. All 16 datasets had a positive or right skew. Eight of the 16 datasets had a coefficient greater than or equal to 1.0, which can be considered highly skewed, while only 3 had a coefficient less than 0.5, considered to have low skewness (19). Both these metrics of distribution indicate that a large majority of the screening datasets were non-normal and/or asymmetric. Similar results were observed for the distribution of the confirmation datasets, indicating these datasets were also typically non-normal and/or asymmetric (Supplemental Table I).

Impact of IQR Factor and Skewness on Outlier Determination, Cut Point, and Sample Positivity

The interquartile range (IQR) is the spread of the middle 50% of the data (25th–75th percentile or first to third quartile). A common approach to outlier determination is to set a “fence” using an IQR multiple (factor) to identify “outside values” (20). However, the approach suggested by Tukey (1977) of applying a 1.5-fold factor to classify “outside values” assumes an approximately symmetrical distribution, which is not the case for skewed CP datasets.

To evaluate the impact of the IQR factor on the screening positivity, percentage of outliers removed, and the screening CP factor, screening data from 16 datasets were analyzed using the JMP boxplot method with IQR factors ranging from 1.5 to 4.0 (Fig. 1). As the IQR factor increased, the percentage of naive or baseline samples that would screen as positive decreased, in most cases reaching a plateau at an IQR factor of 3.0 (Fig. 1a). The range of outlier-inclusive screening positivity ranged from 6.0 to 14.0% using an IQR factor of 1.5, while it ranged from 4.7 to 8.8% using an IQR factor of 3.0.

Fig. 1figure 1

Effect of IQR factor used for outlier removal on a observations screening positive (outlier-inclusive), b percentage of screening outliers removed, and c screening cut point factor (SCP). All studies used the non-parametric cut point except for Studies I and J, which used the parametric cut point

The plateau in screening positivity at IQR factor  ≥ 3.0 was related to changes in the number of outliers removed from the datasets. At 1.5 IQR factor, two datasets (Studies F and K) had more than 8% of the data points removed as outliers, and nine datasets (Studies A, B, D, F, G, K, M, O, and P) had more than 4% of data points removed (Fig. 1b). However, in each case, an IQR factor of 3.0 resulted in less than half the number of outliers being identified compared to an IQR factor of 1.5. Importantly, the total percentage of samples screening positive is approximately the percentage of outliers removed plus the target 5% false positive rate specified in regulatory guidance and white papers. Consequently, those datasets with a large percentage of outliers removed (Fig. 1b) also had an outlier-inclusive screening positivity substantially greater than the recommended 5% rate.

Figure 2 shows similar data for the impact of IQR factor on the confirmation positivity rate, outlier removal, and CP. Here, the impact of a high percentage of outlier removal on the outlier-inclusive positivity is even more impactful for the confirmation CP, with a target FPER of 1% (Fig. 2). An IQR factor of 1.5 resulted in 5 of the 16 studies having an outlier-inclusive positivity greater than 5%, including one dataset with more than 10% of samples confirmed positive. Similar to the screening assay, an IQR factor of 3.0 substantially reduced the number of outliers identified.

Fig. 2figure 2

Effect of IQR factor used for outlier removal on a observations confirming positives (outlier-inclusive), b percentage of confirmation outliers removed, and c confirmation cut point factor (CCP). All studies used the non-parametric cut point except for Studies I and J, which used the parametric cut point

The distributions for the screening and confirmation range in skewness from relatively low to highly right-skewed (Table I and Supplemental Table I). The positivity rate when using parametric CP values demonstrated an increase that corresponded with increasing skewness (for both screening and confirmation assays; Fig. 3). However, when using non-parametric CP estimates, the screening positivity rate was close to 5% regardless of the level of skewness. Similar results were obtained for the confirmation, with a positivity rate close to 1% for non-parametric CP estimates. Studies I and J were both relatively symmetric and allowed for the use of the parametric CP estimate. However, in both these cases, the non-parametric CPs (SCP 1.40 and 1.41, CCP 52.0% and 41.0%, respectively, for I and J) were similar to the parametric CPs (SCP 1.30 and 1.41, CCP 47.0% and 40.2%, respectively). This suggests that although parametric CP estimations are suitable for distributions with low skewness, non-parametric estimates may be applicable for distributions with a wide range of skewness.

Fig. 3figure 3

Relationship between validation sample population skewness and naïve sample positivity for the a screening anti-drug antibody assay and b confirmation anti-drug antibody assay for 16 clinical validation studies. SCP: screening cut point value, CCP: confirmation cut point value

Clinical Impact on Number of Baseline Samples Used to Estimate Population-Specific Cut Point Factors

Observing that biological factors explain the majority of variation in assay response, we wanted to evaluate the impact of using larger numbers of individuals in an attempt to capture a true representation of the diversity of the study population. To do this, we investigated the impact of sample size using in-study baseline assay responses from two large phase 3 studies (\(N>3000\)) for two different molecules. Similar to the 16 CP validation datasets, the distribution of log(\(S/N\)) of the baseline samples for both studies exhibited high skewness (5.17 and 2.21, excluding samples which confirmed as drug-specific). For each study, random baseline selections of different sizes (50, 100, 150, and 300 subjects) were sampled 100 times with replacement after each subset (subjects were eligible to be selected once per sampling but were returned to the dataset between samplings). For example, log(\(S/N\)) values from 50 subjects were randomly selected from the clinical assay dataset and these values were used to generate an estimated cut point; this procedure was repeated 100 times per sample size for the entire dataset (see Supplemental Fig. 1). The JMP boxplot method was used to exclude outliers (IQR × 3.0) and calculate the non-parametric CP factor for each of the 100 samplings. A boxplot representation of the distributions of CPs for the 100 samplings at each sample size is shown in Fig. 4. In addition, the population CP value calculated for the entire baseline dataset is indicated by the dashed horizontal reference lines in Fig. 4, both with and without outlier removal. As expected, for both studies, higher sample sizes were associated with a tighter range of potential calculated CPs, centered around the 95th percentile of all baselines excluding outliers. For example, increasing the sample size from 50 to 150 reduced the standard deviation of calculated CPs 45% for Clinical Study 1 (0.40 vs. 0.22) and 56% for Clinical Study 2 (0.43 vs. 0.19). In addition, the median calculated CP became closer to the overall population 95th percentile (excluding outliers) as sample sizes increased. However, increasing the sample size from 150 to 300 reduced the standard deviation by 32% for both Clinical Study 1 (0.22 vs. 0.15) and Clinical Study 2 (0.19 vs. 0.13), suggesting diminishing returns as sample size increases beyond 150 individuals.

Fig. 4figure 4

Larger sample sizes randomly selected from baseline datasets associated with tighter distribution of possible CPFs. Two phase three studies (Clinical Studies 1 and 2) for different molecules were selected where the number of baseline sample screening assay results was large (\(N>3000\)). Subsets of baseline samples for each study were randomly selected, with replacement between subsets, 100 times with sample sizes of 50, 100, 150, and 300. Outlier exclusion was performed for each random sample using the boxplot method with an IQR factor of 3.0. The distribution of non-parametric cut point factors (empirical 95th percentile) is shown as box-and-whiskers, where the whiskers denote the 5th and 95th percentile of observed CPFs (outliers not shown). The 95th percentile of the entire baseline set for each study is shown as dashed reference lines, before and after outlier removal with IQR factor of 3.0

To further assess the clinical impact of SCPs set with different sample sizes, we investigated the effect of cut points generated with 50 or 150 individuals on in-study baseline samples. The range of different SCPs shown in Fig. 4 (from sample sizes of 50 and 150) was applied to the \(S/N\) results for the predose clinical samples for each study, excluding samples identified as having preexisting ADA using the validated assays, and the resulting FPER in the screening assay was calculated for each SCP. Therefore, a set of 100 FPERs was determined for the corresponding set of 100 SCPs for each sample size of 50 and 150.

A screening FPER in the range of 2% to 11% is generally considered acceptable (7). For Clinical Study 1, SCP set using a sample size of 50 subjects resulted in 36 out of 100 FPERs falling outside of the 2–11% benchmark. In contrast, using a larger sample size of 150 resulted in only 12 out of 100 FPERs falling outside of the accepted range (Table III). A similar trend was observed for Clinical Study 2. When the sample size for the dataset was 150, only 3 out of 100 FPERs were outside the 2–11% range, versus 10 out of 100 FPERs when SCPs were set with sample sizes of 50.

Table III Effect of Sample Size on Baseline FPER

To understand the clinical impact of outlier removal, we evaluated the effect of varying IQR factor on baseline sample positivity using these same studies. To do this, we estimated non-parametric SCPs using the entire baseline datasets after excluding outliers using an IQR factor of either 1.5 or 3.0 (Table IV). These SCPs were applied back to the baseline \(S/N\) dataset to estimate an overall FPER. As shown in Table IV, the use of IQR factor of 3.0 results in FPER closer to the target 5%, while using an IQR factor of 1.5 results in around 10% FPER (9.5% or 10.1% for Clinical Study 1 and 2, respectively) in the screening assay for both studies.

Table IV Effect of IQR Factor on Baseline FPER

留言 (0)

沒有登入
gif