Application of a Hermite-based measure of non-Gaussianity to normality tests and independent component analysis

1. Introduction

Tests for normality play many important roles in neural data analysis. Two of these are complementary. One such role is to determine whether a distribution is sufficiently close to normal (Wilcox and Rousselet, 2018; Kwak and Park, 2019) to justify an assumption of normality. A contrasting role is to identify distributions with the greatest deviation from normality (McKeown and Sejnowski, 1998; Vigário et al., 2000; James and Hesse, 2004; Onton et al., 2006). This is the key step in independent component analysis (ICA), and is motivated by the heuristic that unmixed sources have the lowest entropy, and are therefore highly non-Gaussian (Comon, 1994; Bell and Sejnowski, 1995). Perhaps as a consequence of these disparate goals, there are many tests of normality (Shapiro et al., 1968; Yazici and Yolacan, 2007) and related ICA contrast functions (Grouiller et al., 2007; Rejer and Górski, 2013), but each of these has trade-offs. Here we introduce a new method, and compare its utility to existing approaches for a range of simulated datasets.

The advantages and disadvantages of the various kinds of normality tests and ICA contrast functions can be understood in terms of the strategies that they use. For example, moment-based normality tests can be overly sensitive to outliers (Shapiro et al., 1968). Tests based on cumulative distribution functions and frequency statistics give appropriate weight to tails (Shapiro et al., 1968; Mendes and Pala, 2003), but may overlook the shape near the center. ICA contrast functions that utilize entropy-based measures face the difficulty of estimating entropy from limited data (Paninski, 2003).

Our strategy diverges from these methods: it is based on approximating the distribution of data by using a set of orthonormal basis functions, the Hermite functions (Szegö, 1939). In principle, since the coefficients of a Hermite expansion can be estimated via simple linear estimators that are insensitive to outliers, this approach provides advantages for some kinds of data distributions.

Here, we conduct a thorough investigation to assess whether a Hermite functions-based measure of non-Gaussianity is useful as a normality test and as an ICA contrast function. We note that similar approaches using Hermite functions have been previously proposed for moment-based normality tests (Almuzara et al., 2019; Amengual et al., 2022) and distribution shape-sensitive ICA (Puuronen and Hyvärinen, 2011), but a comparison to other normality tests or applicability as an ICA contrast function had not been undertaken. We test the proposed method by using datasets constructed to have simple distributions as well as realistic EEG simulations. We then benchmark the performance of our method against common normality tests and ICA contrast functions included in a popular ICA package, FastICA (Hyvärinen, 1999). Our method has advantages as a normality test for heavy-tailed and asymmetric distributions. However, as an ICA contrast function, these advantages are only realized for a niche of datasets: small datasets with components that have heavy tailed or asymmetric distributions.

2. Methods

We first introduce Hermite functions and the properties that underlie our proposed measure of non-Gaussianity. Next, we formulate the method which serves as a normality test and a contrast function for ICA. Finally, we detail the datasets and procedures we use for assessing the performance of the proposed method.

2.1. Background on Hermite functions

Our approach to assess the non-Gaussianity of a probability distribution p(x) centers on its expansion in terms of an orthonormal family, the Hermite functions Hn(x) (Szegö, 1939). In the Hermite expansion of p(x), the first term, H0(x), is a Gaussian. This property allows the Hermite expansion to separate the Gaussian and the non-Gaussian parts of p(x).

Each Hermite function (Hn(x)) is given by the Hermite polynomial of the same order (hn(x)), multiplied by a Gaussian envelope e-x22. In the standard convention used here, the Hermite polynomials hn(x) are orthogonal with respect to the Gaussian weight e−x2, and are given by

hn(x)=(-1)nex2dndxne-x2.    (1)

The Hermite function of order n (Hn(x)) thus corresponds to the product of the Hermite polynomial of the same order (hn(x)) and a Gaussian with mean 0 and variance 1.

Hn(x)=12nn!πe-x22hn(x).    (2)

With this definition, the orthonormality of Hermite functions is expressed by

∫−∞∞Hm(x) Hn(x) dx=δmn.    (3)

This property allows us to express probability distribution functions p(x) in terms of Hermite functions:

p(x)=∑i=0∞aiHi(x).    (4)

If p(x) is known, the coefficients ai can be found in the following manner:

ai=∫−∞∞Hi(x) p(x) dx.    (5)

This evaluation of the coefficients ai follows from the orthonormality property of Hermite functions:

∫−∞∞Hi(x) p(x) dx = ∫−∞∞Hi(x)∑j=0∞ajHj(x) = ∫−∞∞aiHi(x)Hi(x) =ai.    (6) 2.2. Hermite function-based measure of non-Gaussianity

In the Hermite expansion of p(x) (Equation 4), H0 is Gaussian (Equation 2), and thus its coefficient a0 can be considered to represent the Gaussian part of p(x). We leverage this to estimate non-Gaussianity of p(x) via the coefficients of the higher-order Hermite functions (Hi(x) where i≠0) that enter into its Hermite expansion (Equation 4). Specifically, we propose to measure non-Gaussianity as the relative power in the coefficients corresponding to all but the zeroth order Hermite function:

J=1-a02∑i=0∞ai2.    (7)

The minimum possible value of J is 0, which is only achieved for a Gaussian distribution with mean 0 and variance 1, i.e., when a0 is non-zero and all other ai are 0. Positive values of J indicate a deviation from this distribution.

However, this measure cannot be used directly for three reasons. The first reason is that our data might not have 0 mean and 1 variance. We address this by standardizing our data before we estimate J. The second reason is that the estimation of ai uses p(x) (Equation 5), which is unknown. Instead, we estimate the aiest from samples xj drawn from p(x):

aiest=∑j=1NHi(xj)N.    (8)

The third reason is that the infinite sum in the denominator of equation 7 is divergent for distributions which are not square integrable (such as a discrete distribution based on a finite number of samples). Thus, we modify the estimator in Equation (7) by truncating the denominator to a finite number of terms n:

Jn=1-(a0est)2∑i=0n(aiest)2.    (9)

By limiting the number of Hermite coefficients, we trade off resolution of the shape of the distribution for an estimator that is robust and avoids overfitting—a kind of regularization. As we show in the Section 3, the choice n = 15 works well in many practical situations, and we use that value unless otherwise stated.

Here, we examined two applications of J15. First, as a normality test, we compare its ability to detect deviations from a normal distribution with that of standard normality tests, and to the measures of normality typically used as ICA contrast functions. Second, since performance as a test of normality need not necessarily translate into utility as a contrast function in ICA, we directly compared an implementation of ICA that uses J15 as a contrast function with standard ICA (Hyvärinen, 1999), for simple low-dimensional datasets with known statistics and in realistic simulated EEG datasets.

2.3. Evaluation of J15 as a normality test

To determine the sensitivity of J15 to typical ways in which distributions may deviate from the normal distribution, we compared its ability to detect non-Gaussianity for three families of distributions: a family of distributions that differed in the extent of bimodality, a family of unimodal symmetric distributions that ranged from light- to heavy-tailed, and a family of unimodal distributions that varied in the degree of asymmetry.

2.3.1. Bimodal distributions

To examine sensitivity to bimodality, we used a family of distributions consisting of a mixture of Gaussians (Equation 10) with a parameter α that controls the mixing weights.

p(x)=αN(x;μ1,σ1)+(1-α)N(x;μ2,σ2)    (10)

where N(x; μi, σi) denotes normally distributed random variable x with mean μi and standard deviation σi. The distribution is Gaussian for α = 0, or 1 and bimodal for intermediate values.

2.3.2. Light- and heavy-tailed distributions

To examine the gamut from light- to heavy-tailed distributions, we used the family of symmetric generalized normal distributions (Equation 11).

p(x)=C(β) e−|x|β2;    (11)

where C(β) is a normalizing constant. This distribution is Gaussian for β = 2, light-tailed (platykurtic) for β>2, and heavy-tailed (leptokurtic) for β < 2.

2.3.3. Unimodal asymmetric distributions

To examine sensitivity to asymmetry, we used the family of generalized extreme value distributions (Equation 12).

p(x)=e−(1+κx)−1κ(1+κx)−1−1κ.    (12)

This distribution has a heavier left tail for negative values of κ, a heavier right tail for positive values of κ, and a transition near −0.3.

2.3.4. Procedure

For each of the above families, we estimated non-Gaussianity via J15 for a range of distribution shapes (generated by varying the family parameter α, β, or κ), and compared these values with measures provided by five standard tests of normality and three standard ICA contrast functions. These comparisons were repeated for 1,000 runs of simulated data, and different sizes of datasets (103, 104, and 105).

While there are many normality tests available today, we focused on five of the most common tests that use a range of strategies. Among the standard normality tests used, two tests were based on cumulative distribution functions (Kolmogorov-Smirnov test Kolmogorov, 1933 and Anderson-Darling test Anderson and Darling, 1952), two were moment-based (Jarque-Bera test Jarque and Bera, 1987 and D'Agostino's k-squared test D'Agostino et al., 1990), and one used frequency statistics (Shapiro-Wilk test Shapiro and Wilk, 1965).

The ICA contrast functions compared here are part of a standard ICA package, FastICA (Hyvärinen, 1999). These comparisons were chosen for two reasons. First, we are targeting analysis of EEG datasets; FastICA is commonly used for this purpose (Vigário, 1997; Rejer and Górski, 2013). The second reason relates to the strategy taken by the new method: it finds components one by one, as does FastICA (via the “deflate” argument for approach). We did not compare our approach to methods (Amari et al., 1995; Belouchrani et al., 1997) that find all components simultaneously, as the proposed contrast function, in its present form, cannot do this. Details of the ICA contrast functions used are discussed in Section 2.4.

2.4. Evaluation of J15 as a contrast function in ICA

To assess the applicability of J15 to ICA, we evaluate the accuracy and precision with which it could identify the maximally non-Gaussian 1D projections of 2- and 5-dimensional datasets with simple density functions, and in high-dimensional datasets drawn from a standard biologically-realistic simulation of EEG data. 1D projections of the 2D datasets were further used to find the optimal number of Hermite functions required in Equation (9) to assess non-Gaussianity. In each case, the true independent components were known, so we could compare accuracy and precision with standard ICA measures used in FastICA (Hyvärinen, 1999), and the dependence of these measures on the size of the dataset.

The standard ICA contrast functions used were: FastICA-I, which is kurtosis-based (invoked with the argument pow3), and FastICA-II and FastICA-III, which are two approximations of differential entropy (invoked with the arguments tanh and gaus). They are defined by

FastICA-I(X)=14E[X4],    (13) FastICA-II(X)=E[log(cosh(X))],    (14)

and

FastICA-III(X)=E[−e−X22].    (15) 2.4.1. 2D and 5D datasets

In the two-and five-dimensional datasets, the component variables were independently drawn from a Gaussian distribution or a member of one of the above families (Section 2.3), with the following parameters: bimodal symmetric (α = 0.5;N(x; −2, 1) and N(x; 2, 1)), bimodal asymmetric (α = 0.7;N(−2, 1) and N(x; 2, 0.4)), heavy-tailed (β = 1), light-tailed (β = 10), and unimodal asymmetric (κ = 0).

Two-dimensional datasets were generated by sampling a product distribution that was Gaussian along one axis and non-Gaussian along the orthogonal axis. For each dataset sampled from the distribution, an exhaustive search (adaptive step size from 1° down to 0.1°) was done to estimate the direction of maximal non-Gaussian projection of data, i.e., the direction that yielded the maximum value of J15 or one of the standard contrast functions. This direction is referred to as the estimated direction of unmixing and its convergence to the true non-Gaussian axis of the underlying distribution measures accuracy. Precision is measured from the estimated direction of unmixing, i.e., the direction at which non-Gaussianity is maximal. Specifically, precision is the angular deviation from this direction at which estimates of non-Gaussianity first reliably deviate from maximal non-Gaussianity.

1D projections of these 2D datasets were also used to find an appropriate number of Hermite functions for estimating non-Gaussianity in Equation (9). We used these datasets for two reasons: (i) they allow us to test for different aspects of non-Gaussianity, and ii) they allow for an exhaustive search. That is, we can determine the precision with which they identify the maximally non-Gaussian direction, which is critical to ICA. For this analysis, we varied the highest order of Hermite functions n in Equation (9) from 2 to 50, and, for each value of n, we considered variants in which the denominator included (a) all orders up to n, (b) even orders up to n, and (c) odd orders up to n (but also including n = 0). Dataset sizes were 102, 103, 104, and 105 samples. An exhaustive search similar to 2D datasets was done to find the estimated direction of unmixing, but with a smaller range of angles (−10° to 10°, with step-size 0.1° for −2° to 2°, and 0.5° otherwise). Error is defined as the average angle between the estimated direction of unmixing and the true direction of unmixing. Confidence limits for error were determined from 1,000 bootstraps.

The 5D datasets were generated by sampling a product distribution that was non-Gaussian along one axis and Gaussian along the orthogonal axes. Since an exhaustive search was impractical, we modified the standard FastICA algorithm (Hyvärinen, 1999) to use a Hermite-based contrast function. This required modification of the search algorithm of FastICA to gradient descent with a time-based learning schedule, since the standard search algorithm requires the contrast function to be linear in the distribution, but J15 is not. To increase the likelihood that the search converged to an extremum for all contrast functions, we used 25 random starts and selected the best result for analysis. For a pilot subset of distributions, using 100 random starts did not result in further improvement. These differences in search algorithm increased the computational cost of J15 by more than a factor of 20; for some of the larger datasets analyzed here, this factor was more than 1,000. We took a final step to ensure that any apparent difference in performance of the contrast functions was due to the contrast function itself and not due to differences in the search algorithm: we chose, for each contrast function, the direction at which it achieved its extreme value, including consideration of the directions identified by extremizing the other contrast functions.

Error and precision were assessed with measures that were analogous to those used in the 2D datasets for the estimation of an optimum set of Hermite functions. Error was quantified by the angle between the estimated and true directions of unmixing. Precision was quantified by the angular variance, a generalization of the circular variance to higher-dimensional distributions. In general, for a set of N unit vectors U, the angular variance av(U) is defined as:

av(U)=1−R; where R2=|1N∑j=1Nuj→|2.    (16)

Since an estimated direction of unmixing u⃗est and its negative cannot be distinguished by ICA, we replaced each estimated direction of unmixing by u⃗est=u⃗estsgn(u⃗est.u⃗true) before applying Equation (16).

2.4.2. Simulated EEG datasets

To test the approach on datasets that have a level of complexity comparable to biological data (e.g., EEG and the artifacts that corrupt EEG signals) yet also have a known ground truth, we used the simBCI package (Lindgren et al., 2018) to generate 249-electrode EEG data based on a realistic head model. The simulation included ocular artifacts referred to in the paper as eye blinks generated by each eye. We tested the ability of ICA to identify the directions corresponding to these artifacts. Simulations were 5 min long and sampled at 250 Hz. To assess the effect of the quantity of data, we applied ICA to 6-s, 15-s, 30-s, 1-min, 2.5-min, and 5-min segments, and analyzed 30 examples of each run length. As is standard, principal compoenent analysis was applied prior to ICA for an initial dimensionality reduction; we retained 50 principal components accounting for over 99% of the variance observed. Finally, each ICA run produced 20 components from which the two components corresponding to ocular artifacts were determined by visual inspection of the waveforms. Accuracy and precision were measured as described above for the 5D datasets, separately for the signals associated with each eye.

3. Results

Below, we explore the use of Hermite-based contrast function as a normality test and a viable option for ICA. As detailed in Section 2, the approach capitalizes on the fact that Hermite functions form an orthonormal basis set, in which one basis vector, the zeroth-order Hermite function, is a Gaussian. Thus, a probability distribution can be expressed as a linear sum of Hermite functions, with the zeroth order term representing the Gaussian part of the distribution. This allows us to measure non-Gaussianity by comparing the size of the zeroth-order coefficient to the size of higher-order coefficients, as formalized by Equation (9).

We assess the performance of this measure in two kinds of applications: as a normality test of one-dimensional distributions, and as a contrast function in ICA for multi-dimensional datasets. The tests of applicability to ICA use simple two- and five-dimensional datasets with known distributions and high-dimensional simulated EEG datasets with ocular artifacts.

3.1. Comparison of measure of non-Gaussianity to normality tests and contrast functions

We evaluate the potential use of J15 as a measure of non-Gaussianity by measuring its sensitivity to typical deviations from the normal distribution. To this end, we used three families of one-dimensional datasets deviating from normality in different aspects of the distribution: modes, heaviness of tails, and symmetry. For each family, a range of distributions with varying shape were tested. We expect that within a family, a good measure of non-Gaussianity will grow as the distribution shape becomes more non-Gaussian. To determine the advantage of the proposed method over the existing ones, we compared its performance against standard normality tests and contrast functions included in a popular ICA package (Hyvärinen, 1999).

Figure 1 compares the sensitivity of J15 for non-Gaussianity to standard normality tests for the three families described in Section 2.3.

www.frontiersin.org

Figure 1. Comparing sensitivity of J15 and standard normality tests for 1D datasets. Each panel analyzes a parametric family of distributions, shown below each abscissa (see Section 2.3 for details). (A–C) Bimodal distributions (Section 2.3.1) with a range of sample sizes. (D) Heavy-tailed distributions (Section 2.3.2). (E) Light-tailed distributions (Section 2.3.2). (F) Asymmetric distributions (Section 2.3.3). Within each panel, measures were normalized so that the maximum value of the mean is 1. Error bars correspond to ± 2SD.

For bimodal distributions (Figures 1AC), the Hermite-function measure, as expected, identifies the bimodal distribution as maximally non-Gaussian, and indicates that non-Gaussianity decreases as bimodality becomes less prominent. However, the standard tests of normality deviate from this behavior, and the way that they deviate can be understood in terms of the approach these tests employ. Specifically, the moment-based tests are heavily influenced by the tails and symmetry of the distributions: the D'Agostino's k-squared test, which is dominated by kurtosis performs poorly for distributions that are skewed (asymmetric bimodal), while the Jarque-Bera test, in which skewness has a heavier weighting, performs poorly for symmetric bimodal distributions. Note that though skewness and kurtosis both peak near the ends (α = 0, 1), D'Agostino k-squared test peaks for bimodal distribution (α = 0.5) and has only minor peaks near the ends. Tests based on cumulative distribution functions (Kolmogorov-Smirnov test and Anderson-Darling test) measure non-Gaussianity via the accumulated difference between the distributions of the data and a Gaussian: asymmetric distributions accumulate the difference faster resulting in greater sensitivity to non-Gaussianity for asymmetric distributions than for bimodal distributions. As tests based on frequency statistics (Shapiro-Wilk test) use principles intuitively similar to tests based on cumulative distribution functions, they behave similarly. Note also that these differences in behavior persist over a wide range of sample sizes (103 to 105 in Figures 1AC). We also examined the performance of the Lilliefors test and the Cramer-von Mises criterion; these are related to the Kolmogorov-Smirnov test and performed similarly to it (not shown) for these distributions and the others described below.

For heavy-tailed and light-tailed distributions (Figures 1D, E), all measures track non-Gaussianity and increase as the distribution deviates further from a normal distribution. For these symmetric distributions, moment-based tests rely solely on kurtosis since their skewness is zero. Despite this, the two moment-based tests have opposite performance for the two sets of distributions: the Jarque-Bera test performs poorly for light-tailed distributions and D'Agostino's k-squared test performs poorly for heavy-tailed distributions. This can be attributed to the way kurtosis is used by the two tests. In the Jarque-Bera test, kurtosis is squared, therefore it grows rapidly and is more sensitive to heavy-tailed distributions. D'Agostino k-squared test uses a function involving the cube root of kurtosis, which decreases slower than kurtosis for heavy tailed distributions but is approximately a linear function of family parameter β for light-tailed distributions. By virtue of the symmetry and unimodality of these distributions, tests based on cumulative-distribution (Kolmogorov-Smirnov test and Anderson-Darling test) and frequency statistics (Shapiro-Wilk test) perform as well as or better than J15. These trends persisted when sample size was varied (data not shown).

For asymmetric distributions (Figure 1F), all measures track non-Gaussianity but the standard normality tests have low precision, as indicated by wide error bars. Both moments-based tests perform poorly for distributions with heavier left tails, but for distributions with heavier right tail, D'Agostino's k-squared test performs better than Jarque-Bera test despite the latter giving more weight to skewness. This is because for this family of distributions, kurtosis rises faster with κ than skewness. Tests based on cumulative distribution function and frequency statistics quickly accumulate the difference between the cumulative data distribution and a Gaussian distribution and perform with good accuracy but less precision.

Figure 2 extends this analysis to the three contrast functions typically used in FastICA (Hyvärinen, 1999): the default function, based on kurtosis (FastICA-I), and two options based on approximations of entropy (FastICA-II and III). Results show that the performance of all three contrast functions has many features in common with that of the moment-based methods (Figure 1 in red and pink). For example, in bimodal distributions (Figures 2AC), the measure of non-Gaussianity peaks either for bimodal symmetric distribution in the center (α = 0.5) like Jarque-Bera test or for asymmetric bimodal distributions near the ends (α = 0, 1) like the D'Agostino's k-squared test. Consequently, they are relatively insensitive to deviations from non-Gaussianity in asymmetric bimodal distributions (α~0.25 or ~0.75). For heavy-tailed distributions (Figure 2D), FastICA-I behaves similarly to Jarque-Bera test (low accuracy and precision) but FastICA-II and III perform similarly to D'Agostino's k-squared test (good accuracy and precision). For light-tailed distributions (Figure 2E), all three FastICA methods perform similarly to D'Agostino's k-squared test and perform better than J15 with respect to precision and accuracy. For asymmetric distributions (Figure 2F), FastICA-I performs similarly to Jarque-Bera test (low accuracy and precision) and FastICA-II and III perform similarly to D'Agostino's k-squared test (good accuracy but low precision).

www.frontiersin.org

Figure 2. Comparison of the sensitivity of J15 to that of FastICA contrast functions for 1D datasets. (A–C) Bimodal distributions (Section 2.3.1) with a range of sample sizes. (D) Heavy-tailed distributions (Section 2.3.2). (E) Light-tailed distributions (Section 2.3.2). (F) Asymmetric distributions (Section 2.3.3). Other conventions as in Figure 1.

3.2. Utility as a contrast function in independent component analysis

To assess the utility of a Hermite-based measure of non-Gaussianity (J15) as a contrast function for ICA, we compared its performance against the ICA contrast functions used in Figure 2 for multi-dimensional datasets. For simple 2D datasets, we compare accuracy and precision in an exhaustive search and further use this framework to choose the set of Hermite functions for estimation of non-Gaussianity using Equation (9). For simple 5D datasets and simulated EEG datasets, we compare accuracy using ICA implementation of the contrast functions in FastICA (Hyvärinen, 1999).

3.2.1. 2D datasets

We generated 2D datasets by mixing a Gaussian and a non-Gaussian component (Section 2.4). The non-Gaussian components were drawn from the three families of distributions analyzed in Figures 1, 2 which represent different aspects of non-Gaussianity (bimodality, heaviness of tails, and asymmetry). Details of the distributions are provided in Section 2.4.

For 2D distributions, all methods were accurate (Figure 3) but J15 had the best precision for most distributions. The precision for FastICA contrast functions followed trends observed for 1D datasets in Figure 2. For bimodal distributions (Figures 3A, B), FastICA contrast functions have wider peaks than J15, corresponding to the finding in 1D distributions (Figures 2AC) that these methods have reduced sensitivity for structure near the center. For heavy-tailed distributions, FastICA methods have low precision (width of peak) and for light-tailed distributions, they have high precision. This behavior of FastICA methods is due to their over-sensitivity to weight in tails. For unimodal asymmetric distributions, as was seen for 1D datasets, J15 has better precision than FastICA contrast functions (Figure 3E). As expected, precision increased with the number of samples but the differences between methods persisted (data not shown).

留言 (0)

沒有登入
gif