Cannons and sparrows II: the enhanced Bernoulli exact method for determining statistical significance and effect size in the meta-analysis of k 2 × 2 tables

A non-parametric exact test of overall statistical significance for dichotomous categorical meta-analysis

Jakob Bernoulli’s notion of what is now called a Bernoulli Trial offers the basis for a non-parametric approach to aggregating multiple epidemiological studies based on dichotomous categorical data. The enhancements to the Bernoulli method developed in this paper offer a practical exact method for assessing the overall statistical significance. A related technique is developed below to estimate the effect size of a dichotomous meta-analysis.

One of the many important contributions of this outstanding seventeenth century mathematician was the idea of the fixed probability of an event over a sequence of independent trials which led to what is now called Bernoulli Trials and to the related Binomial Distribution. In brief, Bernoulli viewed a set of statistical events as a series of independent coin flips with each flip having a probability p of obtaining a head and q = 1 − p of obtaining a tail. This hypothetical coin is often treated as a fair coin where both p and q equal 0.5. The simplest Bernoulli Trials approach encompasses a series of n flips and answers questions of the type: what is the probability of observing × heads in n such flips? (See for example Rosner [13]). In epidemiology, one could consider each of the k contributing studies of a meta-analysis as a single Bernoulli Trial with p = 0.5. Then the combination of the k studies could be analyzed as a binomial distribution. This is the standard Sign Test (see, for example, [14]).

For example, for a meta-analysis of 20 studies, if 15 out of 20 studies had more cases in the exposure group than in the control group, we could ask: What is the probability that 15 or more of the 20 studies could have shown a larger effect in the exposure group strictly by chance alone? If this cumulative probability is less than a pre-specified level of Type I error (e.g., 0.05), one would reject the null hypothesis and conclude there probably exists a statistically reliable relationship between exposure and the end point used.

The principal reason that this approach has seen little use in practical epidemiology is that it suffers from two critical deficits. First, the dichotomous Bernoulli heads vs. tails approach doesn’t deal with the third possibility of a tie. The author of this study believes that no truly useful method to date has been offered to deal with those situations when there are an identical number of events in each of the exposure and the control arms of a study other than to discard the study. Second, a truly exact EBT method requires a complete convolution of the frequency distributions of the contributing studies in order to derive the combined frequency distribution. Even for equal sample size, each of the k contributing studies could have a different Bernoulli probability, p, requiring a full convolution to determine the null distribution of the total number of times there were more cases in the exposure group relative to the control group across the k contributing studies. Before dealing with the ties problem, the determination of the combined distribution will be outlined.

Combining the individual studies contributing to the meta-analysis

A critical problem is finding a method for combining the individual study binomial distributions of the k contributing studies each with a possibly different p value into an overall frequency distribution.

Prior to the widespread availability of computing power, the convolution of a large number of individual binomial distributions was typically handled by approximate methods given the unwieldy nature of the calculations. Even with the advent of available computer power, convolution is still often impractical. As an example, for a meta-analysis involving 24 studies each with a unique binomial distribution, there are over 2 million unique combinations of the studies that need to be considered just to calculate the single discrete probability that exactly 12 of the 24 studies have more cases in the exposure group than in the control group.Footnote 1 However, an exact algorithm was laid out in a readily implementable fashion by Butler and Stephens in a 1993 technical report [15] which can easily be implemented even on a personal computer. The algorithm yields the exact probability distribution of the convolution of individual binomial distributions which in the present application would correspond to the specific studies contributing to a meta-analysis. The method makes use of a recurrence relationship inherent in the binomial distribution which allows the semi-automatic calculation of its probabilities without resort to the simple but overwhelmingly inefficient enumeration of all of the possible combinations of studies. This easily established relationship can be stated as:

$$P\left( \right) = \left( \right)^ \;if\; j = 0$$

$$P\left( \right) = \left\ \right)}}} \right\} \times \left\ \right)}}} \right\} \times P\left( \right)\;if\,j \ge 1$$

Figure 2 compares the estimated number of computer executable steps required in the Butler and Stephens method relative to a traditional convolution.

Fig. 2figure2

Estimated computer executable steps per Butler and Stephens vs. traditional convolution

As can be seen, a traditional convolution is only tractable when the number of contributing studies is less than or equal to approximately 20.

The ties problem

The next problem in adapting the standard Bernoulli Trials technique to practical meta-analysis is a procedure to deal with the situation where there are an identical number of cases in both the exposure and control arms of a study contributing to the meta-analysis. In studies with small sample sizes and/or low disease probabilities, the highest probability tie is typically the “0/0” tie in which no cases are observed in either the exposure or the control arms.

A first step in dealing with ties is to more clearly define the criteria for a “success”. The present EBT approach defines a success as there being a strictly greater number of cases in the exposure group relative to the control group. Under this definition, the same number of cases in both arms of the study or more cases in the control arm of the study is considered a “failure”. In essence, this is a trinomial situation. There are successes, failures and ties. We are simply combining the failures where there are more cases in the control group relative to the exposure group and tie situations and calling the combination “failures”.

Equation 9 below forms the basis of the EBT method. The Greek capital letter “Π” has been chosen to specify the probabilities of there being more cases in one arm of the study relative to the other to differentiate these parameters from the underlying disease probabilities:

$$\Pi_ }} + \Pi_ }} + prob \left( \right)_ = 1$$

(9)

Where \(\Pi_ }}\) = probability of there being strictly more cases in the exposure group relative to the control group in Study i; \(\Pi_ }}\) = probability of there being strictly more cases in the control group relative to the exposure group in Study i; prob(tie)i = probability of finding exactly same number of cases in both groups of Study i.

Assuming that \(\Pi_ }}\) and \(\Pi_ }}\) would be equal under the null hypothesis of no difference between exposure and control groups and rearranging terms, we have:

$$2\Pi_ }} + prob\left( \right)_ = 1$$

(10)

Solving for \(\Pi_ }}\) we have:

$$\Pi_ }} = \frac \right)_ }}$$

(11)

Thus, the only requirement for calculating the \(\Pi_ }}\) parameter for each contributing study is to first determine the probability of all tie situations for that study.

This is a very straightforward procedure. To determine \(prob\left( \right)_\) for each of the contributing studies, all of the tie situations need to be enumerated and then their probabilities summed together.

As a simple example, assume that Study i has 100 participants in each of its exposure and control arms and that the underlying event (disease) probability p is 0.01.

The probability that there are no cases among these 100 participants in the exposure arm would then be:

$$Prob\left( \right) = 0.01^ \times \left( \right)^ = 0.99^ = 0.37$$

Similarly, the probability of there being no cases in the control arm would also be 0.37.

Thus, the probability of a “0, 0” tie would be \(0.37^ = 0.13\) which is surprisingly large.

Table 3 lists the probabilities for the first five tie situations and sums these probabilities to determine \(prob\left( \right)_\).Footnote 2

Table 3 Probability of observing exactly the same number of cases in both the exposure and control groups for background event probability equal to 0.01 and sample size equal to 100 as a function of the number of observed cases

As shown in Table 3, there is over a 30% probability of obtaining a tie for zero cases through five cases in both the exposure and control groups. Applying Equation (11) to this hypothetical study, we see that, under the null hypothesis of equal probabilities, \(\Pi_ }}\) and \(\Pi_ }}\) are both equal to 0.35. Thus, due to ties, the nominal 0.50 value for \(\Pi_ }}\) and \(\Pi_ }}\) has been greatly reduced.

The EBT technique is indeed a “vote counting” method and such methods have been greatly disparaged by Rothman [16] among others as “methods to avoid”. However, unlike a simple Sign Test, the EBT method is based on a reasonable approach to the ties problem and combines the individual \(P_ }}\) values by doing the equivalent of a formal convolution of the frequency distributions of the individual contributing studies.

A non-parametric exact method for the estimation of effect size for dichotomous categorical meta-analysisBasic estimation technique

A second exact technique was developed to estimate the effect size for dichotomous categorical meta-analysis. As a starting point, one might simply form the ratio of the average observed event probabilities, \(p_ }}\) and \(P_ }}\), in the exposure and control groups respectively of each study and average these ratios across the k contributing studies. This simple approach, however, is highly biased. As shown in the underlying model that is described in Eqs. 14, the number of observed “successes” in the exposure and control arms of the k contributing studies each depend on an identical source of variation captured by \(\varepsilon_\) in the model. The exposure group, however, contains an additional source of variation, captured by \(\varepsilon_\) in the model. Figure 3 illustrates the problem of estimating the effect size by simply forming the ratio of \(p_\) to \(p_\).

Fig. 3figure3

Demonstration of inappropriateness of simply directly comparing the \(p_\) and \(p_\) distributions to estimate Effect Size

Even for the relative risk of 1.0 depicted in the figure, the exposure distribution will have positive excursions that are not compensated for by equally robust negative excursions at least for small (rare) values of event probability.

The differential skew of the \(p_ }}\) distribution relative to the \(p_ }}\) distribution was used to address this issue. The additional skew in the exposed group due to the source of \(\varepsilon_\) in Eq. 2 was estimated by taking the difference between the total exposure group skew and the expected skew from a pure binomial with the same observed event probability. The observed average \(p_\) across the k contributing studies was then reduced by a factor proportional to this difference in skew levels.

Monte Carlo simulation of the ebt and dl techniques for statistical significance and effect size estimation

A series of Monte Carlo simulations was conducted to evaluate the EBT statistical significance test and the effect size estimation techniques and to compare them to the typically used DerSimonian–Laird Inverse Variance technique. The simulation was written and executed in the increasingly shared statistical language R [17]. The DerSimonian results were calculated using the “meta” package in R.

Five levels of relative risk (ratio of exposure group to control group event probability) of 1.0, 1.25, 1.5, 1.75, and 2.0 were crossed with three levels of disease background event probability (0.005, 0.01, and 0.05), and three levels of sample size (50, 100 and 200). Finally, the number of studies entering into each meta-analysis was chosen to be 5, 10, 20, or 40 studies. These choices allowed direct comparisons with the earlier work cited above ([12, 18]). In actuality, the background event probabilities were restricted to the small values that are typically encountered in epidemiological studies as discussed in Table 2.

In addition, the heterogeneity between the contributing studies, τ2 in Eq. 4, was evaluated at 0 (homogeneity), 0.4, and 0.8 to, again, allow comparisons to the earlier work. This last value of 0.8 represents a very large variance among the studies and was partially chosen to be able to compare the results with previous work. As an example, at τ2 = 0.8, a nominal exposure group event probability \(p_\) of 0.05 would vary from of 0.007 to 0.39 which is over a 35:1 ratio. Finally, the common variability in both the exposure and control groups represented by \(\gamma^\) in Eq. 1 was chosen to be 0.5 to again allow direct comparison with the earlier work.

The statistical significance and effect size were evaluated using both the EBT and DerSimonian techniques for each replication. All simulation runs were conducted with 10,000 replications. A value of 0.05 was used as the pre-specified level of Type I Error. The “Mid-P” technique advocated by Agresti [19] and others was used to determine the p values in a less conservative manner leading to more realistic power levels.

Results from the Monte Carlo simulations: testing statistical significance

Figure shows the results of both the EBT and the DL methods. To simplify presentation, only scenarios in which the expected number of cases was greater than or equal to two were utilized. Table 4 shows the included scenarios.

Table 4 Scenarios included in the analysis of statistical significance

When the Relative Risk equals one, the power is the Type I error or, equivalently, the false alarm rate. The basic finding was that the EBT method maintained the prespecified level of Type I error for both the homogeneous and heterogeneous scenarios while the DL method had many violations of this level for heterogeneous scenarios. For the homogeneous scenario where τ2 = 0, both the EBT and the DL methods respect the prespecified Type I error level. However, for τ2 = 0.4 and for τ2 = 0.8, the DL method exhibits large violations of this level. As expected, as the number of contributing studies increases, the power for Relative Risk greater than one increases for both the EBT and DL methods. A separate analysis showed that the standard deviation of the power estimates in Fig. 4 was less than or equal to 0.42% (i.e., 0.0042).

Fig. 4figure4figure4

Power as a function of number of studies, relative risk, and heterogeneity. A, C, E, and G are for the EBT method and B, D, F and H are for the DL method

In actuality, comparing the power between the EBT and DL techniques for Relative Risk ratios greater than 1.0 is not truly permissible due to the large number of violations of the pre-specified Type 1 Error for the DL technique.

Figure 5 is a comparison of Type I Error (false alarm rate) for the EBT technique and the DL technique as a function of heterogeneity (τ2).

Fig. 5figure5

Type I error for EBT and DL methods as a function of heterogeneity

As can be clearly seen, the current EBT technique is relatively resistant to the effects of increasing heterogeneity over a very large heterogeneity range. The DL technique, however exhibits a monotonically increasing sensitivity to heterogeneity. A related aspect of any meta-analysis technique’s ability to perform well in the face of heterogeneity is its resistance to “contamination” from one or a small number of “rogue studies”. Since the EBT method does not directly allow such rogue studies to directly affect the test statistic, it should be much more resistant to these distortions.

The large costs of discreteness have been studied by Agresti [20] and others.

A first cost of discreteness results when the number of contributing studies is small. The general issue of overcoverage is highlighted in Fig. 6.

Fig. 6figure6

Interval overcoverage as a function of the number of contributing studies

The overcoverage is greatest for the smallest number of k contributing studies, and generally decreases as the number of contributing studies increases. As Fig. 6 demonstrates, even an unrealistic level of 500 contributing studies is still associated with a relatively large level of overcoverage. While such discreteness clearly reduces power, it could be argued that a statistically significant finding based on extremely sparse tables and a handful of studies requires stronger evidence. Unfortunately, the majority of meta-analyses consist of fewer than two or three studies as Kontopantelis et al. have shown in their extensive analysis of all meta-analyses in the Cochrane Library [21].

Additional Monte Carlo testing was done for unbalanced designs (unequal sample sizes in the exposure and control arms of the contributing studies) and meta-analyses with unequal sample sizes across contributing studies. Table 5 shows the sample sizes for the two groups for a typical unbalanced design in which the control group sample size is twice the exposure group sample size. The sum of the two sample sizes across both arms of the study was chosen to be 200 yielding an average sample size of 100 to allow comparison with the balanced designs of Fig. 4.

Table 5 Sample sizes for simulation of unbalanced designs

Table 6 below shows the results of the simulation for heterogeneity values τ2 = 0 and τ2 = 0.8, Event (“disease”) Probability of 0.05, Number of Studies = 10, and Sample Size (avg.) = 100 at the same five levels of Relative Risk used above. The simulation run consisted of 10,000 replications as in Fig. 4.

Table 6 Power (%) for the unbalanced design of Table 5 τ2 (heterogeneity) equal to 0 and 0.8; event probability = 0.05; Number of studies equal to 10; Sample size (per study arm) equal to 100

As the results in Table 6 show, when the heterogeneity was equal to 0.8, the Type I Error (Relative Risk = 1.0) remained below the specified value of five percent for the EBT technique but was far above this point for the DerSimonian.

Table 7 below shows the sample sizes for the exposure and control groups for each of the contributing studies for a design with unequal sample size across the contributing studies. This particular design was chosen as a relatively extreme case. As can be seen, the average sample size across the two groups was maintained at 100 to allow comparison of the simulation results with the equal sample size scenarios of Fig. 4.

Table 7 Sample sizes for simulation of unequal sample size designs

Table 8 below shows the results of the simulation for a heterogeneity values of τ2 = 0 and τ2 = 0.8, Event (“disease”) Probability of 0.05, and Sample Size (individual study arm average) = 100, at the same five levels of Relative Risk as used above. The simulation run consisted of 10,000 replications as in Fig. 4.

Table 8 Power (%) for the unbalanced design of Table 7 τ2 (heterogeneity) equal to 0 and to 0.8; Event probability = 0.05; Number of studies equal to 10; Sample size (avg. per individual study arm) equal to 100

Most importantly, at a heterogeneity level of 0.8, the EBT Technique was superior at protecting the pre-specified level of Type I Error relative to the DL technique.

A clear finding of the Monte Carlo simulations common to both meta-analysis techniques studies is the apparent fruitlessness of searching for small effect sizes. Both the EBT and DL techniques are very poor at reliably finding statistically significant results until the relative risk approaches 2.0. While this finding does not directly bear on the issues studied in this report, it does serve as a cautionary tale to those who continue to try to tease out very small effects especially from sparse data.

Results from the Monte Carlo simulations: effect size estimation

Figures 7 and 8 capture the basic findings for estimating the Effect Size.

Fig. 7figure7

Effect size as function of relative risk and heterogeneity. A and B correspond to the EBT and DL methods respectively

Fig. 8figure8

Semi-interquartile range as function of relative risk and heterogeneity. A and B correspond to the EBT and DL methods respectively

Again, only simulation scenarios in which the expected number of observed cases was greater than or equal to two were utilized. Since the effect of the number of studies contributing to the meta-analysis was small for this effect size estimation, results were averaged across this variable. As shown in Fig. 7, both methods were reasonably successful at estimating the levels of relative risk. However, both methods generally underestimated the relative risk for τ2 = 0 and overestimated it for τ2 = 0.4 and τ2 = 0.8. Finally, as shown in Fig.  8, the interquartile range for the DL method was considerably smaller than for the EBT method.

留言 (0)

沒有登入
gif