Analyzing atomic force microscopy images of virus-like particles by expectation-maximization

Analyses of VLP images by area

Classical statistics presents an inconclusive trend for mean VLP area for the compressed VLPs as the particles age over the first 4 h in solution at 20 °C. The mean VLP area increases from 1509 nm2 to 1668 nm2 over hours 2 to 3. However, the mean observed VLP area decreases to 1397 nm2 after 4 h of thermal aging. Across the same period, the standard deviation of the observed VLP distribution increases from 183 nm2 to 248 nm2 to 307 nm2.

Histograms of three aged VLP collections show that their distributions are non- normal (Fig. 1) and consequently, the sample mean and standard deviation do not present a complete or nuanced view of VLP morphological changes. Cursory study of the histograms shows that the distributions become skewed towards larger VLP areas. All three samples have a significant number of VLPs around 1600 nm2, but later aging times include a selection of VLPs of approximately 2000 nm2.

Fig. 1: The gaussian mixture model shows an increasing spread in particle size distributions as the VLPs thermally age.figure 1

Optimized Gaussian Mixture Model profiles for the single pass AFM images based on VLP area following 2-hours (A), 3-hours (B) and 4-hours (C) of thermal aging. The VLP suspensions were left at room temperature for 2–4 hours prior to placement on a mica surface.

Quantile-Quantile (QQ) plots better illuminate the deviation from normality of the three VLP populations (Fig. 2). A QQ plot is a graphical technique for determining if a set of observed data belongs to a particular theoretical distribution. A quantile is the fraction of observations below a given value. For an ordered (i.e., lowest value to highest value) data set, the quantiles of the observed data are plotted against the quantiles of the theoretical distribution. Here, the theoretical quantiles are normalized to be standard deviations from the mean because the data is assumed to be normally distributed. Were the observed data to adhere to the theoretical distribution, all points would align along the reference line (Fig. 2, red). All three collections of VLP particle sizes show significant deviations from linearity, indicating that none are normally distributed.

Fig. 2: Quantile–quantile plots of VLP area extracted from the AFM images verses a theoretical Gaussian distribution show that the thermally aged particles are not normally distributed.figure 2

Plots are shown for samples following 2 h (A), 3 h (B) and 4 h (C) of thermal aging. That the Q–Q plots are not linear indicates significant deviation from a single Gaussian accurately describing the VLP distribution. The included red line represents the perfect, noiseless, data, and model.

If the collections of VLPs exist as a mixture of distinct normal distributions, each with a different mean area and standard deviation, this data can be modeled by a GMM and deconvolved by the EM algorithm. Each set of VLPs extracted from collections of three AFM images was modeled with 2, 3, 4, and 5 latent Gaussian distributions. The EM algorithm extracts the mean, standard deviation, and relative contribution of each component to the mixture. The optimal complexity of a GMM for each collection was determined based on the fit of each model, as expressed by the log likelihood value and the linearity of the QQ plots for each model. When the inclusion of an additional Gaussian no longer significantly improves the log likelihood and linearity of the QQ plot, the GMM is assumed to be optimal. The final GMM can be expressed in tabular form (Table 1) or graphical form (Fig. 1).

Table 1 Fitted Gaussian Mixture Models parameters for thermally aged VLPs imaged by a single pass of an AFM tip

Application of the EM algorithm (Table 1 and Fig. 3A) indicates that after two hours of thermal aging, the VLPs are best modeled by a multi-normal distribution with mean areas of 1184 nm2 (15% of the VLPs) and 1556 nm2 (85% of the VLPs). After 3 h, three intrinsic normal distributions are extracted from the GMM (Table 1 and Fig. 3B): 1331 nm2 (12% of the VLPs), 1585 nm2 (60% of the VLPs), and 1993 nm2 (28% of the VLPs). After 4 hours of aging, four components were observed (Table 1 and Fig. 3C): 1107 nm2 (41% of the VLPs), 1351 nm2 (20% of the VLPs), 1639 nm2 (30% of the VLPs), and 2011 nm2 (8% of the VLPs). For the 4-h images, four Gaussians were chosen based on the visual fit of the model to the data (in Fig. 3C). Cursory comparison of the QQ plots for the GMM (Fig. 3) indicates significantly better fit of the model to the data than with a single normal distribution (Fig. 2). Pearson’s Chi-squared test for count data indicated that the modeled distributions at three separate times are statistically different (p < 0.01).

Fig. 3: Quantile–quantile plots of VLP area extracted from the AFM images verses the optimized Gaussian Mixture Model distribution show that the thermally aged particle distribution conforms to the GMM.figure 3

Plots are shown for samples following 2 h (A), 3 h (B) and 4 h (C) of thermal aging. That the Q–Q plots are linear indicates confidence that the GMM accurately describe the VLP distribution. The simplest GMM that yielded a linear Q–Q plot was retained. The included red line represents the perfect, noiseless, data and model.

The 95% confidence interval for the mean value of each intrinsic distribution can be estimated by \(\pm t\sigma /\sqrt,\) where t is the tabulated Student t-value for n-1 degrees of freedom, σ is standard deviation of the fitted normal distribution, and n is the number of observed samples in the distribution (Table 1, column 5). For each Gaussian in the GMM, determination of n is not straight forward. Here n for each distribution is estimated as the number of observations divided by the number of distributions in the GMM. Hence, n is 18, 11, and 6 for these three collections of VLPs analyzed by AFM.

Considering the three GMM analyses as a whole, three trends are evident. All optimized GMMs extract a set of VLPs with an area of approximately 1600 nm2 and are of comparable size based on the 95% confidence limits. The two longer aging times exhibit VLPs with a comparable average area around 2000 nm2. These larger observed structures would be expected because as the bonds within the VLPs may change, the VLPs may become structurally less rigid and consequently flatten to a larger area15,43. The 2 h and 4 h VLPs present a smaller structure with an average area around 1100 nm2, while the 3 h and 4 h VLPs have a larger structure around 1300 nm2. Given the small number of AFM sampling sites and number of VLPs analyzed, these two distributions might converge to a single normal distribution in a larger collection of AFM images of VLPs.

Analyses of all 92 determined VLP areas from the three aging times as a single distribution resolves four distinct groups of VLP (Fig. 4D). While a three factor GMM model appears to be a reasonable description of the VLP histogram (Fig. 4C), the QQ-plot of the model shows significant deviations at the large area end of the particle distribution and minor deviations at the small area end of the VLP distribution (Fig. 4A). Adding a fourth term to the GMM provides a much better fit of the model to the data, reducing deviations on both extremes (Fig. 4B). The parameters of the 3 and 4 factor GMM are similar—the ~1615 nm2 centered distribution only differs by a mean of 4 nm2, standard deviation of 2 nm2, and a 2% contribution between the models. However, inclusion of a fourth component in the model adds a factor explicitly modeling the distribution of the smallest particles (µ = 1054 nm2; σ = 48 nm2) and slightly shifts the estimated means and standard deviations of the other two distributions (Table 1). Consequently, the net observed distribution of VLP sizes are better described by a 4-factor GMM model than with a 3-factor GMM model. Pearson’s Chi-squared test for count data indicated that the 3-factor and 4-factor models are statistically different at p = 0.051.

Fig. 4: The four-component Gaussian Mixture Model fits the VLP size distribution better than the three-component Gaussian Mixture Model.figure 4

The improvement of fit in increasing from a three-component Gaussian Mixture Model (A, C) to a four-component Gaussian Mixture Model (B, D) is evident in the increased linearity of the respective quantile–quantile plots (A, B).

Applying the GMM to analyses of all 92 VLPs provides similar results to that of analyzing the three aging processes separately. The ensemble set of VLPs returns distributions centered at 1054 nm2, 1479 nm2, 1617 nm2, and 2045 nm2. Every distribution of VLPs resolved by applying a GMM to individual aging times aligns with the four distributions observed with analyses of the ensemble data. The 3-h thermal aging 1993 ± 120 nm2 and 4-h thermal aging 2011 ± 66 nm2 centered distributions are statistically indistinguishable from the ensemble 2045 ± 62 nm2 centered distribution. Similarly, the ensemble 1617 ± 22 nm2 are matched with the 2-h 1566 ± 60 nm2, 3-h 1585 ± 46 nm2, and 4-h 1639 ± 80 nm2 centered distributions. However, there is more deviance in the resolved GMM parameters between the ensemble and individual aging times at the low area end of the sample set. With the individual aging times, the mean value of the resolved distributions with the smaller areas all lay between the 1054 nm2 and 1479 nm2 centered distributions. This issue could be explained by the low number of VLPs that inherently belong to the smallest distribution. For example, for 2-h, the 1184 nm2 centered distribution contains only 15% of the parent distribution (i.e., 5 or 6 VLPs total). It is unsurprising that the GMM would fail to resolve 2 sub-populations and estimate the mean to be between the 1054 nm2 and 1479 nm2 centered distributions.

Analyses of VLP images by VLP width (single AFM tap)

Manually determining the width of the VLP in each image enabled observation of more VLPs in comparison to employing the area-based method discussed previously. With the width-based method, 54, 33, and 42 individual particles were extracted from the three thermal aging times compared to 36, 33, and 23 VLPs by the area-based method. It is appropriate to note in the following section where each VLP was scanned twice by AFM prior to determining the VLP area and width, less than 10 VLP areas could be reliably determined across three times, yet 98 individual VLP widths were calculated.

Analyses of the width-based collection of VLP data presents a slightly less complex view of the VLP populations at each time as compared to the area-based analyses of the VLP (Table 2 vs Table 1). Perusal of the QQ-plots shows good linearity for each model (Fig. 5A, C, E). For the 2-hour time, both methods present the majority of the VLPs being larger, with a smaller population (15% vs 10%) being of lesser dimension (Fig. 1A vs. Fig. 5B). The width-based measurements model the larger VLPs with 2 factors; however, the mean values of both distributions are statistically indistinguishable. For the 3-h time, the model fit for the width-based measurements did not improve when using more than one normal distribution. This is in contrast to the area-based analyses that was optimally modeled with 3 normal distributions—each with a statistically different average area (Fig. 1B vs. Fig. 5D). Similarly, the area-based analyses for the 4-hour time identifies 4 unique VLP distributions, each with a statistically different mean area while the width-based analyses only identified 3 unique populations (Fig. 1C vs. Fig. 5F). Pearson’s Chi-squared test for count data indicated that the modeled distributions at three separate times are statistically different (p < 0.01).

Table 2 Fitted Gaussian Mixture Model parameters for thermally aged VLPs imaged by a single pass of an AFM tipFig. 5: The optimized Gaussian Mixture Model shows a good fit to the distribution of VLP widths.figure 5

Quantile–quantile plots and optimized Gaussian Mixture Model fit to the VLP widths extracted from the AFM images following 2 h (A, B), 3 h (C, D) and 4 h (E, F) of aging following a single nanoindentation AFM pass. That the Q–Q plots are linear indicates confidence that the GMM accurately describe the VLP distribution. The most simple GMM that yielded a linear Q–Q plot was retained. The included red line represents the perfect, noiseless, data and model.

The differential in model complexity between analyses of AFM images by area-based and width-based approaches for the VLPs holds when all nine collected images, across three aging times, are combined into a single population (Fig. 4D vs. Fig. 6A). The area-based analysis resolved 4 distributions of VLPs—two minor components that were either smaller (7%) or larger (9%) than the two main factors that constituted 84% of the particles (Table 1). However, the width-based analysis only resolved two main components, seemingly unable to extract the smaller and larger minor components. Of course, without further analyses of a larger data set, it is impossible to conclude with certainty whether the 2-component or 4-component model more is the more faithful description of the true VLP distribution.

Fig. 6: Comparison of optimized Gaussian Mixture Models for single and double pass AFM imaging indicate that the VLP widths are significantly wider after the second pass with nanoindentation.figure 6

Distributions presented are from the ensemble all AFM images collected with a single AFM pass (A) and a double AFM pass (B).

Analyses of VLP images by VLP width (double AFM tap)

Because the protocol for estimating VLP area mostly fails when individual particles form clusters or are touching, an alternate procedure to collect AFM images was investigated. Here a rapid, low spatial resolution AFM image of a large area was collected to identify regions of interest with the greatest number of non-contiguous VLPs. Those regions of interest were then resampled at higher spatial resolution. As such, the VLPs were double sampled during analysis. Unfortunately, with the stiff AFM probes, this resulted in wider (80–140 nm vs. 50–100 nm) and flatter VLPs observed during analyses. Ultimately, the resampled VLPs were too flat at the edges to determine the VLP boundary for the area-based procedure. Consequently, only width-based models could be constructed.

For the 0-, 1-, and 2-h times, EM analyses indicated a bi-normal distribution (Fig. 7). With 0-h and 1-h, the mean of the two distributions (the first at ~98 nm and the second at ~113 nm) are statistically indistinguishable at the 95% confidence interval (Table 3). However, after 2-h, both distributions have a statistically greater mean. The smaller of the two distribution means increases from ~98 nm to 110 nm and the larger of the two distribution means increases from ~113 nm to 125 nm. This may be potentially indicative of the VLP losing rigidity during aging and spreading out over a larger area following AFM compression15,43. By comparison, the 3-h data was optimally modeled with a single Gaussian distribution. This distribution lacked features with a width greater than the 130 nm present in the 2-h data. One possible explanation may be that further aged VLPs were tapped sufficiently flat to not rise above the baseline noise of the image and image processing.

Fig. 7: Optimized Gaussian Mixture Model fits to the VLP widths extracted from the double pass AFM imaging show a one or two component model.figure 7

Distributions presented are AFM images following 0 h (A), 1 h (B), 2 h (C), and 3 h (D) aging following a second nanoindentation AFM pass.

Table 3 Fitted Gaussian Mixture Models parameters for thermally aged VLPs imaged by a single pass of an AFM tip

Analyzing the ensemble data from all 12 AFM images, spanning 4 aging times, indicates four normal sub-populations of VLP diameters (Table 3). However, the symmetric spacing and widths of the three leftmost Gaussians (Fig. 7B) are consistent with describing a platykurtic distribution by normal curves. Without a more extensive investigation, it is not feasible to determine whether the double tapping of the AFM analyses leads to a wider, non-normal distribution of VLP widths, or if this one set of data was anomalously wider. However, given that the EM algorithm did not model the data by fitting two Gaussians centered at the 100 nm and 110 nm spikes in the histogram lends credence that this is a single platykurtic, not bi-modal Gaussian distribution of VLPs. Consequently, the ensemble data is better viewed as two distributions of VLP widths across all observed aging times.

The totality of our novel methodology – using AFM in conjugation with GMM fit by the EM algorithm – provides the unique opportunity to investigate the bulk morphology of VLPs and the potential to identify VLP morphological changes. Herein, we report VLP morphological changes are occurring due to observances in the shape and diameter of the VLPs. The cause of these changes in VLP shape and size may be due to a multitude of factors, including room temperature aging, local temperature and pH alterations, stabilization buffer contents, and number of freeze-thaw cycles15,43,44. Moreover, the purified HPV VLP intermediates studied herein were specifically selected for straightforward analytical method development, in which case these VLP intermediates allows for a sample of only particles to be investigated (i.e., no drug product formulation components). The definitive causation of any VLP shape or size changes would indeed require further studies. Notably, the methodology showcased here illustrates, for the first time, the potential of nano-indentation AFM in combination with GMM and EM to probe the internal VLP structural integrity, as opposed to topography of VLPs, to reveal information about changes in VLPs.

留言 (0)

沒有登入
gif