A Novel Bootstrapping Test for Analytical Biosimilarity

Biosimilars are reaching increased clinical, regulatory, and commercial importance since innovator products are going off patent [1, 2].

During the development of biosimilar products (test product TP), similarity to the reference product (RP) needs to be shown. This can be done at multiple levels such as product quality attributes, pharmacokinetic (PK), animal, or clinical studies [3, 4]. Shown similarity at the quality attribute level at the end of the manufacturing process or PK level may serve as additional information or even surrogates to clinical studies expecting no meaningful differences in efficacy and safety of the product [5].

Regulatory Background

Recent regulatory efforts resulted in a series of guidance documents and reflection papers about statistical aspects of analytical biosimilarity assessment [6, 7].

The American Food and Drug Administration (FDA) requirements are less specific than European Medicines Agency (EMA) in their latest reflection paper [6]. In general, FDA states that “The objective of the comparative analytical assessment is to verify that each attribute, as observed in the proposed biosimilar and the reference product, has a similar population mean and similar population standard deviation.” More specifically, they propose to conduct a quality range approach, i.e. checking if a fraction of TP batches are within \( }_} \pm k \times _}\), where \( }_}\) represents the sample mean and \(_}\) the sample standard deviation of the reference product. The factor \(k\) can be adjusted depending on the criticality of the quality attribute; i.e. for more critical attributes a lower value of \(k\) might be chosen. Additionally the sponsor may use equivalence tests. For low-risk quality attributes, graphical comparison may be applied. It is important to note that the equivalence tests have been previously employed for the most important quality attributes [8]. However, these equivalence tests are no longer highly recommended, neither by FDA nor by EMA.

Recent regulatory efforts of EMA resulted in a detailed reflection paper that describes the establishment of analytical biosimilarity. One important message from this reflection paper is that the process needs to be understood as a distribution from which we sample individual lots. Any claim on similarity/equivalence needs to be understood as a claim on the underlying distributions not on the actual samples. This is the basic concept of statistical inference. Additionally, inferential statistics require that the two distributions that are compared are representative for the true TP and RP manufacturing process, respectively. This is also covered in EMA and FDA guidelines; e.g. FDA states that TP lots should be “representative of the intended commercial manufacturing process” [6]. Therefore, we assume for the presented methods in this contribution that samples are representative and the following workflow can be extracted from [7]:

1.

Define general aim

(non-inferiority or equivalence). This depends on the nature of the critical quality attribute (CQA) to be investigated. In case dealing with an impurity a non-inferiority claim is sufficient. Otherwise, checking for equivalence is requested.

2.

Define CQAs to be investigated

CQAs might differ in their mathematical nature (continuous or binary)

Although there are slight differences between the application at EMA and FDA, both agencies do not expect that all quality attributes are identical between the RP and TP. However, prior to the actual analytical biosimilarity comparability assessment, quality attributes should be ranked according to their critical relevance and impact on efficacy and safety. Depending on this ranking, varying rigor in the biosimilarity assessment might be applied. For CQAs that cannot be quantified or have less impact on clinical outcomes, graphical comparison of the raw data is suggested.

3.

Define similarity condition

Similarity condition is a term used by the recent EMA reflection paper [7] to define an a priori agreement on when two data distributions are to be considered as “similar”, i.e. what is the maximum allowed difference between two underlying distributions. This decision making benefits from a knowledge of the impact such differences could have on clinical outcome. In practise, these impacts are usually not known and risk assessments need to be added to submissions to support the definition of an appropriate similarity condition.

It is important to note that EMA stresses to always agree on a similarity condition before a similarity criterion is applied.

4.

Definition of a statistical test/ “similarity criterion”

The term similarity criterion was introduced in the latest EMA reflection paper and is understood as the concrete instruction in how to use data to make any statements of the a priori agreed similarity condition. In practise, the similarity criterion can be understood as the actual test procedure.

Any test employed should show a defined Type I error (error of the agency of wrongly declaring biosimilarity), sometimes also called false positive rate.

The latest EMA reflection paper notes that understanding the operating characteristics of each test is important, i.e. the chance of false positive/false negative results. It is expected from the agency that applicants discuss operating characteristics and justify acceptable low chances of false positive conclusions, i.e. a false conclusion for similarity where actually no similarity exists.

5

Conduct experimental study plan and sampling strategy controlling for measurement variability

e.g. estimate analytical sample size to account for measurement variability. In case within batch variability (analytical variability) is larger than between batch variability this can be reduced by taking replicates of the sample. In specific cases this even might lower the number of required manufacturing lots for the biosimilar to be produced.

6

Perform equivalence/non-inferiority testing

7

Consideration regarding false positive conclusion and risk mitigation of non-comparability results

A usual and expected drawback of current analytical biosimilarity analysis is that for each CQA a test is conducted. This leads to a well-known phenomenon of multiplicity where the false positive rate (Type I) error and the Type II error of the entire test are increased. However, addressing multiplicity is not the focus of this contribution.

This contribution will focus on a novel statistical test for the comparison of quality attributes — also called analytical biosimilarity assessment — between the reference product (RP) and the biosimilar candidate (TP), which is required to achieve licensure [4, 6]. The same comparability exercise is of relevance when evaluating the impact of a change in the manufacturing process [9].

We aim to especially focus on two critical aspects of this workflow: the similarity condition and the similarity criteria. Without having a clear mind about the similarity condition, any formulation of a similarity criterion (realisation of a test) is without meaning. In the past following similarity conditions have been used:

Previously for equivalence tests on the mean between two distributions, FDA proposed to use 1.5 times the estimated standard deviation as an equivalence acceptance criterion. However, this guideline was withdrawn by FDA.

Although not mentioning explicitly, a previous publication tailored to the biosimilar comparison task, defined the “equivalence region” if at least the central 99.7% of the TP are within the central 99.7% of RP [10].

In the area of pre- and post-change comparison, a recent publication used the process capability or out of specification (OOS) rate to set up similarity conditions [11]. Assuming that the specification of the RP equals to 3σ of the RP process, the definition of this similarity condition will be very similar to what we will define in this publication (see the “Results” section). Although this is used frequently in practise, in this case the calculated specification is only a point estimate of the sample mean and standard deviation and therefore associated with sampling error.

In terms of similarity criteria/statistical tests mainly two main approaches were used for claiming biosimilarity for quantitative CQAs in BLA/MMA/NDA fillings in the past:

The first one is using two one-sided t test (TOST) to demonstrate that the difference of mean between biosimilar and innovator is within acceptance criteria. As example, when looking at Mvasi [12], the biosimilar to Avastin from Amgen, similarity of binding to VEGF and anti-proliferation activity of HUVEC cells is assessed using TOST (two one-sided t test).

The second one is to look at the “population within the population,” which is generally referred to as quality ranges or range tests. One example here is Truxima, the biosimilar to Mabthera from Celltrion [13]. Similarity of quality attribute was claimed, when most of biosimilar lots fell within the range calculated based on 3 times the sample standard deviation of innovators.

Flaws of Equivalence Tests and Range Tests

It is important to note that that both equivalence tests and simple range test approaches have flaws and do not comply with all regulatory requirements.

Equivalence Tests

Equivalence tests are designed to test null hypothesis that distributional parameters such as means or variances of TP and RP do not depart too much from each other. If this is the case the populations are called “equivalent.” In order to establish what “not too much” means, quantitative equivalence boundaries have been tried to be established [14]. Although there have been some commonly accepted equivalence acceptance criteria proposed (such as \(_}\) for the TOST of difference in means for showing analytical biosimilarity, or 80–125% of the RP for average similarity in evaluating PK data) all of these limits depend on arbitrarily chosen limits. Those limits may be adapted according to the criticality of the CQA to account for residual risk.

The most frequently applied equivalence test of the past in analytical biosimilarity testing is the TOST test [14,15,16]. This test, for example, examines only the mean difference of RP and TP, and does not take differences in variance into account. In a rather theoretical but extreme case of a very large number of TP and RP lots, it is possible to claim similarity for a biosimilar product with a large difference of variance but a small difference of mean to innovators [10]. To be precise, in this case only the mean difference needs to be smaller than the equivalence margin. Since this test is not present any more in the updated FDA guideline we will not go into detail of this test here and refer to literature elsewhere which even proposes alternative tests with better power [17]. There also exist equivalence tests for variances (F-tests as described in [18]). However, a separate equivalence test on mean and variance neglects the interplay between both to still receive acceptable product; e.g. one could think of a biosimilar which has some (even larger) mean difference to the RP. However, due to its small variability it is still producing a lot of acceptable product. Also vice versa, a biosimilar candidate that shows less or no mean difference to the RP may accept a bit higher variability (compare Fig. 2). Both elements are not taken into account when performing separate equivalence tests on mean and variance.

Range Tests

A quality range approach as defined by FDA is a biosimilarity assessment method that takes into account both sample mean and sample variance to overcome that limitation of the TOST test. Moreover, they take into account the combination between both distributional parameters to reach acceptable product. The rationale is straightforward: if most of TP batches are within RP population, it is likely that TP distribution is within the range of the RP.

A frequently applied realisation of this concept is to claim biosimilarity when a sufficiently large percentage of TP batches (e.g. 90%) falls within the acceptable range represented by the RP population. We want to note that in case of less than 10 TP batches usually all batches need to be within a given acceptable range. Several methods have been used to define the acceptable range, such as Min–Max of all RP batches (so called Min–Max test), three standard deviation (so called 3SD test), or tolerance interval. Amongst them, Min–Max has been considered as the range with less regulatory risk and more manufacture risk [10]. On the contrary, as described in Fig. 2 of [10], tolerance interval is the range with more regulatory risk and less manufacture risk as they show the highest false acceptance rate at typical sample size of n = 10. For these sample sizes, according to Fig. 2 of [10], the three standard deviation test [16] is the one of three methods that balance between regulatory and manufacture risks as it shows comparably low false acceptance rates and low false rejection rates and it was similar to the quality range method recommended by FDA [6]. For all of these simple range tests, such as the 3SD test or the Min–Max test, it is generally easier to claim biosimilarity with less TP batches, which may discourage the sponsor to increase the manufacture batches [10]. Statistically speaking, the false positive error (Type I error) of quality range tests — the chance of concluding biosimilarity although there is none — is a function of the sample size. This is not the preferred behaviour of a proper statistical test which usually keeps the Type I error at a constant, pre-defined level, usually at 5%. Regulatory agencies have taken note of those flaws and urge for a biosimilarity test that keeps the Type I error, which equals to the regulatory risk, independent of the sample size. Since this is known to agencies there is a need to develop a test that controls for the regulatory risk. In some cases the range test is formulated to pass when a large fraction or all of TP fall within \( }_}\pm C \times _}\), where \(\overline\) and \(s\) are the sample mean and standard deviation, respectively. Then C can be defined as a function of the similarity condition, \(_}\) and \(_}\) to assure a defined alpha level of the test (Type I error) [11]. However, there are certain drawbacks of such a test. For ease of comparability to nomenclature of the previous publication let us define \(K1=\frac_} - _}|}_}}\) and \(K2=\frac_}}_}}\), where \(\mu\) and \(\sigma\) are the true population mean and standard deviation.

Figure 1 shows the Type I error as a function of C, K1, and K2 at a sample size of \(_}=_}=20\). K1 and K2 have been chosen in a way where 99% (as a synonym for 3SD) of the TP are within 99% of the RP. This equals the decision boundary between biosimilar and non-biosimilar region as defined in Fig. 2. We see in Fig. 1 that, in order to achieve a 5% Type I error, we need to adjust the levels C as a function of the levels of K1 and K2. However, an a priori knowledge of K1 and K2 is not available in practise; we only have sample estimates. This contribution also aims at investigating the impact of different levels of C (all originating from K1 and K2 values of the same similarity condition) on the power and Type I error.

Fig. 1figure 1

Example data for demonstration: Type I error of a range test (99% of TP being within \( }_\pm C\times _\)) with varying levels of C and K1 ( =\(\frac_- _|}_}\)) and K2 ( =\(\frac_}_}\)). Horizontal red line indicates significance level of 0.05

Fig. 2figure 2

Example of simulation study results. A highly powerful test would be able to distinguish biosimilarity exactly at the biosimilar decision boundary with a full acceptance rate for settings in the green shaded area and without accepting biosimilarity above the orange line in the “non-biosimilar” area

Requirements to a Novel Test for Showing Analytical Biosimilarity

In this contribution, we want to establish a novel statistical test that reduces the abovementioned flaws and is compliant with current regulatory requirements. Specifically, we will focus on three criteria to accept such a new test:

The test should have an easy-to-define and clearly formulated null hypothesis based upon a similarity condition to be tested; i.e. the test is designed to rejects the null hypothesis of being not biosimilar. This is currently not fulfilled with simple range test such as the 3SD or Min–Max test.

The test simultaneously checks for the underlying population and not only single characteristics such as the mean and the variance between the TP and RP. This is currently not achieved by simple equivalence tests such as the TOST test.

The operation characteristics should be easy to understand and the Type I error (agencies risk) should be controlled along the entire similarity condition and independent of the used sample size of TP and RP. This is of outmost importance and currently not achieved by simple quality range test.

For CQAs that do not pass biosimilarity tests as mentioned above, an extensive characterisation and investigation needs to be performed to understand their potential impact to potency and safety, e.g. in clinical trials. Both the analytical analyses and clinical trial data need to be provided by the sponsor and the agency will evaluate biosimilarity claim based on the totality-of-the-evidence and the residual uncertainty.

留言 (0)

沒有登入
gif