zMAP toolset: model-based analysis of large-scale proteomic data via a variance stabilizing z-transformation

Workflow of zMAP module

Given a large-scale ILMS data set, zMAP toolset transforms each ILMS intensity into a z-statistic that essentially assesses the statistical significance of the deviation of this measurement from that (of the same protein) in the corresponding reference sample. The final p-value for a certain protein is derived by integrating the associated statistical significances in all samples. For this p-value to make biological sense, the reference samples in all MS runs must be biologically identical, and the corresponding null hypothesis is no differential expression between each involved condition and the biological state represented by the reference samples. In practice, reference samples are almost always designed to be the average of all the involved conditions. In this case, the corresponding null hypothesis amounts to no differential expression between each pair of the conditions, and we refer to significant proteins as hypervariable ones.

Two computational modules, named zMAP and reverse-zMAP, have been implemented in the zMAP toolset for handling different scenarios. We first introduce the former, which is specifically developed for ILMS data sets without real reference samples but whose MS runs are all associated with the same biological composition (i.e., they are biological replication of each other).

The basic principle of zMAP module is considering the average of the ILMS samples from each MS run as a pseudo reference profile for these samples. Then, the M-values of each sample against the corresponding reference profile are scaled based on estimated signal variances, producing z-statistics for a final integration across MS runs. Unlike traditional z-score transformation, which is based on observed variance, zMAP module derives the variance estimates by borrowing strength between proteins with similar intensity levels and fitting a smooth mean-variance curve (MVC) for each MS run. Figure 1a depicts the workflow of zMAP module on a single MS run. In detail, after normalizing protein-level ILMS intensities across multiple samples generated by the same MS run, zMAP makes a log2 transformation and models the results as following normal distributions, with the mean and variance parameters linked by an unknown MVC. To fit this MVC, a mean-variance scatter plot for all proteins is drawn, and zMAP uses a sliding window to scan this plot. In this process, the proteins with similar average log2-intensities across samples are grouped together, and a variance estimate is derived for each individual window. For the latter part, zMAP performs a quantile regression that regresses the observed variances of included proteins against a scaled chi-square distribution by fitting a straight line through the origin, and the slope is taken as the variance estimate. To avoid the influence of potential hypervariable proteins, which can lead to an overestimated variance, only a certain proportion of the proteins with the smallest observed variances are used for the quantile regression. Finally, the MVC is fitted by regressing the variance estimates from all windows against the corresponding intensity levels. Notably, this sliding-window process equipped with the robust quantile regression can avoid the requirement of biological replicates within each single MS run for MVC fitting.

Fig. 1figure 1

Applying zMAP module to ILMS samples generated by a single MS run. a The workflow of zMAP module. b The experiment design of each MS run for the iTRAQ data set regarding human erythropoiesis at fetal and adult stages. HSPC, hematopoietic stem and progenitor cell; ProE, proerythroblast. c Mean-variance scatter plot of all proteins, with colors indicating the BH-adjusted p-values derived for identifying hypervariable proteins. All detected hemoglobin subunits are explicitly marked. d, e For each sample, the associated (d) traditional z-scores or (e) z-statistics derived by zMAP are plotted against the corresponding theoretical quantiles of the standard normal distribution. f For the hemoglobin subunits and a list of housekeeping proteins, heat maps showing their traditional z-scores and z-statistics derived by zMAP in all samples. The associated chi-square statistics and BH-adjusted p-values are also displayed

To demonstrate the practical utility of zMAP module, we applied it to an iTRAQ data set regarding human erythropoiesis at both fetal and adult stages [30]. This data set was comprised of five replicate MS runs, each of which generated four proteomic profiles corresponding to hematopoietic stem/progenitor cells (HSPCs) and proerythroblasts (ProEs) at fetal and adult stages (labeled as F0, F5, A0, and A5; Fig. 1b).

We first applied zMAP to the first MS run. In practical analysis, an essential parameter for the sliding-window process, denoted by W, is the exact proportion of proteins in each window that are used to perform the quantile regression. The setting of W depends directly on the abundance of differentially expressed proteins across samples (Additional file 2: Note S1). Here, we tried the default setting (W = 30%) and assessed the goodness of fit for each window by calculating the R-squared (R2) statistic. All the R2 values were above 0.95 (Additional file 1: Fig. S1a). Besides, zMAP also generated a diagnostic plot to summarize the quantile-quantile plots of all windows. Specifically, the observed variances associated with each window were first ordered and scaled by the corresponding variance estimate (i.e., the slope of the fitted line). Then, the scaled variances were averaged across all windows (note that all windows covered the same number of proteins). Finally, the results, together with error bars to indicate the variability across windows, were plotted against the theoretical quantiles (Additional file 1: Fig. S1b). We found that the scaled variances that had been used for the quantile regressions matched well with the line \(y=x\), while the other scaled variances, especially the largest ones, deviated clearly from the line, suggesting the default setting of W was suitable in this case. In practice, this diagnostic plot can be useful for fine-tuning W.

After the sliding-window process, the variance estimates and average log2-intensities of all windows were subject to a regression procedure to fit the MVC. Based on previous studies of iTRAQ data [24, 31, 32], an exponential decay function was fitted, with an R2 of 0.996. zMAP next identified hypervariable proteins among the four conditions by calculating the ratio of the observed variance of each protein to the corresponding variance indicated by the MVC, which was obtained by applying the exponential function to the average log2-intensity of the protein. This ratio followed a scaled chi-square distribution under the null hypothesis of no differential expression between each pair of conditions, and a p-value was accordingly derived. As an evaluation of zMAP, we specifically examined the results regarding all known subunits of hemoglobin (the HBA2 subunit was excluded because none of the five MS runs had detected it). It was found that zMAP deemed all these subunits as significant hypervariable proteins (BH-adjusted p-value < 1e − 7 for each of them; Fig. 1c).

Finally, zMAP transformed the log2-intensities of each protein by subtracting its average log2-intensity and dividing the results by the standard deviation implied by the MVC. The only difference between this z-transformation and the traditional z-score transformation is that the latter uses observed standard deviation as the scaling factor, which may cause a compression of biologically meaningful signal changes across samples. Here, we tried both transformations and aligned the results with the standard normal distribution separately for each sample (Fig. 1d, e). The empirical distributions of traditional z-scores were even more concentrated around zero than the standard normal distribution, while those of the z-statistics derived by zMAP were relatively more long-tailed, suggesting an improved statistical power for identifying hypervariable proteins. We further examined the transformation results of the hemoglobin proteins and a list of housekeeping proteins (Fig. 1f). For the traditional transformation, the intensity differences among samples were considerably compressed for both classes of proteins. In comparison, the zMAP transformation dramatically increased the sensitivity to biologically meaningful intensity differences without sacrificing the specificity. Specifically, all the hemoglobin proteins were upregulated during human erythropoiesis at fetal and/or adult stage. Consistent with previous research [33, 34], HBE1, HBG1, HBG2, and HBZ were mainly expressed at the fetal stage; HBB and HBD were mainly expressed at the adult stage; HBA1 was highly expressed at both stages. We applied zMAP separately to each of the other four MS runs and found similar results (Fig. 2a).

Fig. 2figure 2

Using different statistics to integrate ILMS samples across MS runs. a For the hemoglobin subunits and the housekeeping proteins, heat maps showing their traditional z-scores and z-statistics derived by zMAP for all the five MS runs. The associated chi-square statistics and BH-adjusted p-values are also displayed. bd Two-dimensional PC plots generated by performing PCA on all samples from all MS runs. Each plot corresponds to a kind of measurements for integrating ILMS samples across MS runs. e, f For each kind of measurements, the PCC between each pair of samples with the same biological label was calculated. The PCCs derived from the z-statistics of zMAP are compared to those from (e) traditional z-scores and (f) centered log2-intensities

Integrating proteomic profiles across MS runs and detecting proteins important for human fetal erythropoiesis

A hierarchical clustering of all samples from the total five MS runs was performed based on the Pearson correlation coefficient (PCC) between each pair of them. When the PCCs were calculated by using original (untransformed) log2-intensities, the samples were perfectly clustered by their MS runs of origin, indicating severe batch effects (Additional file 1: Fig. S2a). By contrast, the samples were clustered by their biological labels when the z-statistics derived by zMAP were used (Additional file 1: Fig. S2b).

We next benchmarked zMAP against two other methods for integrating these samples, which were respectively based on traditional z-scores and centered log2-intensities. The latter was equivalent to the M-values of each sample against the corresponding pseudo reference profile. Similar to the z-statistics of zMAP, both traditional z-scores and centered log2-intensities have led to a hierarchical clustering result that was perfectly consistent with the biological labels of samples, suggesting an effective removal of batch effects (Additional file 1: Fig. S2c, d).

We performed PCA on all the samples by using separately the measurements provided by each method. It was found that the samples were well clustered by their biological labels in the two-dimensional PC plot generated by zMAP (Fig. 2b). Moreover, the F0 and A0 clusters were relatively close to each other, and two separate differentiation trajectories starting from them could be depicted in the plot, corresponding to HSPC-to-ProE differentiation at fetal and adult stages respectively. In comparison, the PC plots produced by the other two methods did not distinguish between different biological labels as clearly as did the zMAP plot (Fig. 2c, d). In particular, the F0 and A0 samples tended to mix with each other.

For a more quantitative evaluation of the methods, we assessed the consistency between each pair of samples with the same biological label, based on the measurements provided by each method. Each biological label was associated with five samples from five different MS runs, and we calculated the PCC between each pair of them. In each case, the PCC derived based on the z-statistics of zMAP was higher than those from the other two methods (Fig. 2e, f).

As a general technical problem of iTRAQ experiments, ratio compression has been suggested to arise from contamination during precursor ion selection by co-eluting peptides [24, 35]. These peptides contribute a background value equally to each reporter ion signal, leading to an increase in the intensity levels and a shrinkage of observed intensity fold changes across samples. For the same protein, the influence of ratio compression can be different across MS runs, resulting in different degrees to which the observed variance of log2-intensities is diminished. In the data set about human erythropoiesis, we observed, for specific proteins, that the observed variances in different MS runs were strongly negatively correlated with the average log2-intensities, and that such correlation was effectively eliminated when the observed variances were scaled based on the corresponding MVCs (Additional file 1: Fig. S3a, b). Globally, the median PCC (for all proteins) between observed variances and average log2-intensities was − 0.34, and it became 0.09 when the scaled variances were used (Additional file 1: Fig. S3c).

We next used zMAP to identify hypervariable proteins based on all samples generated by the total five MS runs. This analysis could be performed in either an unsupervised or a supervised manner. For the former, the chi-square statistics derived by zMAP along with the associated numbers of degrees of freedom were summed across MS runs, producing p-values that assessed the overall expression variability of each protein (see “Methods”). In total, we identified 2290 significant hypervariable proteins (BH-adjusted p-value < 0.05; Additional file 2: Note S2). A hierarchical clustering of these proteins revealed quite a few meaningful protein expression patterns (Fig. 3a). For example, the largest cluster comprised 591 proteins whose expression was decreased during human erythropoiesis at both fetal and adult stages. GO enrichment analysis showed that these proteins were significantly enriched in several biological processes, including regulation of actin filament length, regulation of cellular component size, and glycolytic process. All of them were associated with stem cell maintenance and self-renewal [36, 37]. Another example cluster consisted of 316 proteins that were specifically upregulated during adult erythropoiesis. These proteins were enriched in biological processes related to ATP and nucleoside synthesis. Consistently, it has been reported that many ATP synthesis genes are subject to post-transcriptional regulation in adult ProEs [30].

Fig. 3figure 3

Simultaneously comparing ILMS samples from all MS runs. a Heat map showing the z-statistics of significant hypervariable proteins identified from the unsupervised comparison analysis. These proteins were clustered into 12 groups. Representative biological processes for most groups (from GO enrichment analysis) are displayed. b In the supervised comparison analysis, ranking all proteins by the standard deviation of average z-statistics in the four biological conditions. Top 0.5% were selected as hypervariable proteins. c Heat map showing the z-statistics of these hypervariable proteins. Proteins that were detected in only one or two MS runs are not displayed. d Using qRT-PCR to measure the mRNA expression levels of S100A8, S100A9, CHI3L1, and PNMT in human fetal HSPCs under different conditions. NC, negative control. e Assessing the progress of erythroid differentiation under different conditions based on the expression of CD71 and CD235 (accessed via flow cytometry). Two biological replicates using different shRNAs were incorporated. KD, knock-down

Alternatively, the hypervariable analysis could be conducted in a supervised manner in which the samples were grouped by their biological labels. Here, we tried a simple computational pipeline for selecting hypervariable proteins across the four conditions. First, the z-statistics associated with each protein were averaged separately within each group. Then, the standard deviation of the average z-statistics was used to rank all proteins, and we selected top 0.5% as hypervariable ones (Fig. 3b). We further clustered these proteins and found that the vast majority of them had elevated expression specifically in one or two conditions (Fig. 3c). Similar results were observed when different cutoffs were applied to the selection of hypervariable proteins (Additional file 1: Fig. S4). Previously, we had studied the dynamics of protein expression during human erythroid differentiation at adult stage [30]. Here, we focused on fetal erythropoiesis and selected four proteins for further exploration, which were PNMT, CHI3L1, S100A9, and S100A8. All of them showed a potential to promote erythroid differentiation at fetal stage: the expression of PNMT and CHI3L1 was concentrated in F5; S100A9 and S100A8 were mainly expressed in F0 and F5, with the expression in the latter being even higher than in the former (Fig. 3c). To assess whether these proteins were required for fetal erythropoiesis, we employed an in vitro differentiation assay of human fetal HSPCs into ProEs, and utilized short hairpin RNA (shRNA) to separately knock down the four proteins in fetal HSPCs on day 0 (Fig. 3d). Upon the depletion of each protein, we observed significantly suppressed erythroid differentiation. Specifically, fetal HSPCs with PNMT, CHI3L1, S100A9, or S100A8 knock-down generated much fewer CD71+CD235+ ProEs on day 6 (Fig. 3e; Additional file 1: Fig. S5), suggesting these genes play an indispensable role in the maturation of human fetal erythroid cells.

Workflow of reverse-zMAP module

The practical applicability of zMAP module is limited. For example, in cancer studies with a large cohort of patients, different MS runs are typically designed to handle different individuals. The zMAP module is not applicable in such cases because the pseudo reference profiles constructed by it for different MS runs are not biologically identical, owing to the heterogeneity across patients. Another practical concern is that, when the number of conditions involved in each single MS run is large (such as in 8-plex iTRAQ and 11-plex TMT platforms), the differential proteins between samples can be abundant and the proportion of proteins that are suitable to use for the quantile regression in each sliding window can be very small, leading to unreliable variance estimates. Besides, fitting a single MVC for a large number of samples may not be flexible enough to allow for the variation of mean–variance trend across samples.

In the above scenarios, we recommend adopting the experiment design of adding a true reference sample to each MS run, which has been widely adopted by many cancer proteomic studies [23, 38] and is also the only requirement for applying the reverse-zMAP module. The basic principle of reverse-zMAP module is to fit sample-specific MVCs by separately comparing each sample to the corresponding reference sample, for which the M-values of all proteins are calculated and a sliding window is used to group proteins with close intensity levels (Fig. 4a). In each window, the M-values of the enclosed proteins are approximately considered as following the same normal distribution, and the associated parameters are estimated by applying a quantile regression against the standard normal distribution. In detail, this regression is achieved by fitting a straight line, and the intercept and slope of it are taken as the mean and standard deviation estimates, respectively (similar to the zMAP module, the M-values are ordered and only the middle 50%, by default, are used to fit the line). Next, the standard deviation estimates from all windows are gathered to fit an MVC, and the mean estimates are used to model the trend of M-values along the range of intensity levels, producing an M-A curve that essentially serves as a baseline for correcting for normalization biases. Finally, the M-values of all proteins are transformed into z-statistics, with the M-A curve and MVC used for centering and scaling them, respectively (see “Methods”).

Fig. 4figure 4

Benchmarking for the reverse-zMAP module. a The workflow of reverse-zMAP module. b, c For the ovarian carcinoma data set, the PCC between each pair of samples with the same biological context was calculated. The PCCs derived from the z-statistics of reverse-zMAP are compared to those from (b) M-values and (c) traditional z-scores

In order to benchmark reverse-zMAP, we collected a TMT data set that comprised three replicate MS runs [39], such that we could compare different methods for integrating samples across MS runs by evaluating the consistency between replicate samples after applying the corresponding transformations. Each MS run generated 11 samples, profiling the proteomes of 7 low-grade serous ovarian carcinoma (LGSOC) cell lines, a uniform mixture of protein extracts from these LGSOC cell lines, a uniform mixture of 5 high-grade serous ovarian carcinoma (HGSOC) cell lines, and a uniform mixture of all the 12 ovarian carcinoma cell lines (two replicates for this mixture).

We then considered the LGSOC-mixed sample in each MS run as internal reference and accordingly applied the reverse-zMAP module. Similar to the previous benchmarking analysis, we have also used M-values and traditional z-scores to integrate samples across MS runs. The former was derived by comparing each sample to the corresponding reference sample, and the latter was derived by applying z-score transformation to all the M-values (from all MS runs) associated with each protein. Finally, we calculated the PCC between each pair of samples with the same biological context, based on the different kinds of signal measurements. It was found that the PCC derived based on the z-statistics of reverse-zMAP was always higher than those from M-values and traditional z-scores (Fig. 4b, c).

Applying reverse-zMAP to a TMT data set about human hepatocellular carcinoma (HCC)

We applied reverse-zMAP to a TMT data set that profiled the proteomes of 159 hepatitis B virus (HBV)-related HCC patients [23]. This data set comprised 33 MS runs, each of which generated 11 samples corresponding to the tumor tissues and NATs of 5 patients plus a reference sample (samples of 6 patients were later excluded in the original study because of low quality; Additional file 1: Fig. S6). A mixture of equal amounts of protein extracts from the tumor tissues and NATs of 50 patients was used to generate the reference sample in each MS run.

When applying reverse-zMAP, we examined the goodness of fit for the associated regressions. For all pairwise comparisons, the quantile regressions performed in the sliding-window process all achieved an R2 value above 0.99, and the median R2 values for the associated fitting of M-A curves and MVCs were 0.87 and 0.72, respectively (Additional file 1: Fig. S7). Note that the observed mean–variance trend varied considerably across samples and was typically not as regular as observed on the previous iTRAQ data set (Additional file 1: Fig. S8; Additional file 2: Note S3). We therefore used natural cubic spline interpolation for the fitting of all MVCs (and also the fitting of all M-A curves). After transforming all M-values into z-statistics, we examined their distribution separately for each sample. Specifically, the z-statistics associated with each sample were ordered and were plotted against the corresponding theoretical quantiles of the standard normal distribution. It was found that the middle 50% of the z-statistics matched very well with corresponding theoretical quantiles, while the z-statistics at the two ends had even larger absolute values than corresponding theoretical quantiles (Additional file 1: Fig. S9).

We next identified hypervariable proteins for this data set. For each protein, the sum of squares of all the associated z-statistics was compared to a chi-square distribution, producing a p-value that assessed the overall expression variability of the protein. In total, 3097 significant hypervariable proteins were identified (Bonferroni-adjusted p-value < 0.01), and most of them showed consistently increased/decreased expression in tumor tissues compared to NATs (Additional file 1: Fig. S10). We further performed PCA on all the 159 pairs of samples by using z-statistics of the hypervariable proteins as features (Fig. 5a). In the two-dimensional PC plot, samples originated from different MS runs were mixed together, indicating the associated batch effects were effectively removed.

Fig. 5figure 5

Applying reverse-zMAP to an HCC TMT data set. a Performing PCA on all samples of the HCC data set, with the z-statistics (derived by reverse-zMAP) of hypervariable proteins as features. b Venn diagram showing the overlap among PC1-correlated proteins, prognosis-related ones, and hypervariable ones. c Plotting the log-hazard ratio of each protein against the PCC between its z-statistics and the PC1 scores of samples. Colors indicate the BH-adjusted p-values for identifying prognosis-related proteins. d Hierarchical clustering of all patients based on the PC1 scores of their tumor samples. The patients were accordingly classified into three groups. The p-value associated with each clinical feature was derived by applying ANOVA to comparing PC1 scores. e Kaplan–Meier curves for the three groups of HCC patients. The p-value assessed the survival difference across the groups and was derived by applying log-rank test. f Violin plots showing the z-statistics of NPC1 in all NAT samples and different groups of tumor samples. The p-values were derived by applying t-test. g Dividing all patients into two groups based on the median expression of NPC1 in their tumor tissues and assessing the survival difference between the groups. h Using western blot to measure the expression of NPC1 in HepG2 cell line under different conditions. i, j Using MTT assay, transwell migration assay, and plate clone formation experiment to quantitatively assess the influence of NPC1 knock-down on the cell proliferation, migration, and colony forming ability of HepG2 cell line, respectively. Three biological replicates were generated for each experiment. k Heat map showing the GSVA scores of biological pathways associated with differential activity across sample groups. The average GSVA score of each pathway in each sample group is also displayed. The differential pathways were identified by applying limma to comparing GSVA scores (BH-adjusted p-value < 0.05)

It was also observed that the tumor samples and NAT samples were largely separated from each other along the PC1 dimension, suggesting the PC1 score might have the potential to quantitatively assess tumorigenesis and even the stage of tumor progression, with respect to the proteomic landscape. To further explore the role of PC1, we defined PC1-correlated proteins based on the PCC between z-statistics and PC1 scores. In total, we found 1638 positively correlated proteins (PCC > 0.5) and 1488 negatively correlated ones (PCC <  − 0.5), which corresponded very well to upregulated and downregulated proteins in tumor tissues, respectively (Additional file 1: Fig. S11a). Pathway enrichment analysis showed that the positively correlated proteins were enriched in several pathways related to genetic information processing, such as spliceosome, cell cycle, and mismatch repair, while the negatively correlated proteins were enriched in metabolic pathways associated with normal liver function, such as retinol metabolism [40, 41], drug metabolism [42], and fatty acid degradation [43] (Additional file 1: Fig. S11b). These results suggested excessive cell proliferation and the disorder of liver metabolism in HCC tumor.

We next evaluated the prognostic association of the PC1-correlated proteins. We first defined prognosis-related proteins by performing a regression of the overall survival time of the patients on the expression intensities of each protein. More specifically, a Cox proportional hazards model was separately fitted for each protein, with its z-statistics in tumor samples as the only predictor [44]. In total, we defined 585 prognosis-related proteins (BH-adjusted p-value < 0.05), including 335 prognosis-favorable proteins (hazard ratio < 1) and 250 prognosis-unfavorable ones (hazard ratio > 1). A significant overlap was observed between these prognosis-related proteins and the PC1-correlated ones: 69.6% of the prognosis-related proteins were also PC1-correlated (Fig. 5b). As a comparison, only 3.9 and 0.8% of the prognosis-related proteins were identified as PC2 and PC3-correlated ones, respectively (Additional file 1: Fig. S12a). Further examination revealed that the prognosis-favorable proteins remarkably overlapped with the PC1-negatively correlated proteins, while most of the prognosis-unfavorable proteins were PC1-positively correlated ones (Additional file 1: Fig. S12b). We also globally examined this relationship in a more quantitative manner: the log-hazard ratio of each protein was plotted against the PCC between its z-statistics and the PC1 scores. A strong positive correlation between these two statistics was observed (Fig. 5c).

Intriguingly, the PC1 scores of the tumor samples were significantly associated with several key clinical features of the HCC patients (Fig. 5d). For example, patients with tumor thrombus got significantly larger PC1 scores than the others (p-value = 5e − 3). Similar results were also observed on patients with high alpha-fetoprotein (AFP) level. These observations strongly implied the clinical implication of PC1. We next made a classification of all the HCC patients based on the PC1 scores of their tumor samples, producing three subgroups of patients (Fig. 5d). A clear survival difference was observed between these subgroups (Fig. 5e), with the corresponding p-value being even more significant than those resulting from the classifications based on TNM or BCLC stage (Additional file 1: Fig. S13). Specifically, the patients in group III showed clearly worse prognosis and also much higher frequency of tumor thrombus than those in groups I and II (Fig. 5d, e), suggesting PC1 could contribute to elucidating the molecular events underlying HCC progression.

Following this speculation, we tried identifying HCC progression-related proteins based on the protein expression profile across the subgroups of patients as well as the prognostic association. We selected four proteins for further exploration, including two potentially oncogenic ones (NPC1 and UBE2C) and two potential tumor suppressors (PIPOX and MMAA) (Fig. 5f, g; Additional file 1: Fig. S14). As an independent verification, we collected RNA expression data of 363 HCC patients from the TCGA database [45]. For each of the four proteins, a significant survival difference was observed when the patients were divided into two groups based on the median RNA expression level of the corresponding gene (Additional file 1: Fig. S15). We also experimentally explored the roles of NPC1 and UBE2C in two HCC cell lines (HepG2 and Huh7). For NPC1, shRNA was used to knock down its expression in both cell lines, which resulted in significant suppression of cell proliferation, migration, and colony forming ability (Fig. 5h–j; Additional file 1: Fig. S16a-c). For UBE2C, we established its overexpression in the two cell lines and observed significant increases of all the three indexes (Additional file 1: Fig. S16d-i). These results confirmed that elevated expression of the two genes contributed to the proliferation and invasion of HCC cells.

Applying various downstream analyses on the z-statistics of reverse-zMAP

In practice, a common downstream analysis is to assess the overall enrichments of protein sets based on the abundance of individual proteins. Here, we used biological pathways collected from the KEGG database [46] to define protein sets of interest. The GSVA method [47] was then applied to the z-statistic matrix produced by reverse-zMAP for the HCC TMT data set, which quantified the activity of each biological pathway in each sample. Finally, the pathways exhibiting differential activity across sample groups were identified and clustered (Fig. 5k).

It was observed that many of the identified pathways had stepwise increased/decreased activity from NAT samples to tumor samples in groups I–III, showing a good correlation with the PC1 scores. Example pathways with decreased activity across the sample groups included many liver function-related ones, such as bile secretion, retinol metabolism, and vitamin B6 metabolism; examples with increased activity included those related to cell proliferation or cancer, such as cell cycle, mismatch repair, transcriptional misregulation in cancer, and viral carcinogenesis (Fig. 5k). Notably, a few NAT samples were associated with even larger PC1 scores than some tumor samples in group I. Concordantly, compared to the other NAT samples, these samples showed clearly lower activity of many liver function-related pathways (e.g., bile secretion and retinol metabolism) and much higher activity of several cancer-related pathways (e.g., viral carcinogenesis) (Additional file 1: Fig. S17).

We further dug into the NAT samples with excessively large PC1 scores. It was noted that these samples, compared to the other NAT samples, displayed clearly higher activity of Jak-STAT signaling pathway and leukocyte transendothelial migration (Additional file 1: Fig. S18a). The former has been implicated in the pathogenesis of inflammation [48], and the latter is vital for innate immunity and inflammation response [49]. Inspired by these observations, we examined the serum gamma-glutamyl transferase (GGT) concentrations, a commonly used biomarker for hepatitis, of the HCC patients. Intriguingly, all the patients whose NAT samples had excessively large PC1 scores were associated with abnormally high levels of serum GGT (Additional file 1: Fig. S18b). Together, these findings suggested the high-PC1 NATs could be in a state of severe inflammation, providing proteome-level insights into the progression from HBV-infected liver tissue to HCC.

We concluded here that the variation of PC1 score across NAT samples made biological sense as well as that across the tumor samples. Since the PC1 scores were derived from the z-statistics of the hypervariable proteins, this conclusion implied a potential advantage of unsupervised hypervariable analysis over traditional differential analysis: in addition to the differences between sample groups, hypervariable analysis may also capture the heterogeneity within each group. In the original study of the HCC patients, 1274 differentially expressed proteins between the tumor and NAT samples were identified, 1113 (87.4%) of which were also identified as hypervariable proteins in this study (Additional file 1: Fig. S19a). We then clustered the proteins uniquely identified as hypervariable ones and found that two of the resulting clusters were associated with clear expression heterogeneity across the NAT samples (these two clusters were referred to as C4 and C5, which comprised 176 and 241 proteins, respectively). Consistent with the above speculation of hepatitis, both C4 and C5 showed elevated expression levels specifically in the NAT samples with excessively large PC1 scores and were also significantly enriched in biological pathways that suggest activated immune and inflammatory responses (Additional file 1: Fig. S19b). For example, the C4 proteins were enriched in the pathway of ECM-receptor interaction, and it has been suggested that the deposition and remodeling of ECM can enhance local immune response to chronic hepatitis tissue [50]; the C5 proteins were enriched in neutrophil extracellular trap formation and leukocyte transendothelial migration, which are crucial for innate immunity and inflammation response [51, 52]. In order to more quantitatively dissect the expression heterogeneity of the two protein clusters, we hierarchically clustered all NAT samples into two subgroups based on their PC1 scores, which successfully isolated the NAT samples with excessively large PC1 scores (NAT II group, 12 samples) from the others (NAT I group, 147 samples). We then calculated for each sample the average z-statistic across C4/C5 proteins, as an overall expression evaluation for the protein cluster. It was found that the expression of both C4 and C5 was considerably higher in NAT II than in NAT I and the three subgroups of tumor samples, while the differences among NAT I and the tumor subgroups were not as distinct (Additional file 1: Fig. S19c, d). Since NAT II only accounted for a small proportion (7.5%) of all NAT samples, the overall expression difference between all NAT and tumor samples was also not distinct (especially for the C4 cluster, to which the corresponding t-test p-value was only 0.31 even with such a large sample size (159 vs. 159); Additional file 1: Fig. S19c, d), which explained why the associated proteins had not been identified as differential ones in the original study.

As another demonstration of the utility of the z-statistics derived by reverse-zMAP, we constructed a protein co-expression network for the HCC patients based on the z-statistics of their tumor samples. In this network, each protein pair associated with a positive and significant partial correlation coefficient (PTCC) was considered to be co-expressed (see “Methods”). We then specifically took out the sub-network consisting of the hypervariable proteins and identified co-expression modules from it. Finally, we performed functional annotation for each module by identifying enriched biological pathways.

It was found that most of the modules were significantly enriched within one or more pathways, suggesting the proteins belonging to the same module had coordinated functions (Additional file 1: Fig. S20a). We also noticed that the expression of the proteins in the same module tended to have consistent correlations with the PC1 scores (Additional file 1: Fig. S20b). For example, there were two modules enriched within the cell cycle pathway, and all their members had previously been identified as PC1-positively correlated proteins.

Associating z-statistics with the mutation landscapes of the HCC patients

To search for potential driver mutations of proteomic variation across the HCC patients, we performed a quantitative trait locus (QTL) analysis that was aimed at identifying significant gene-protein associations, in the sense that the abundance of the protein was significantly associated with the mutation status of the gene. For each candidate gene-protein pair, the z-statistics of the protein in tumor samples were linearly regressed against the (non-silent somatic) mutation indicators of the gene, with the age and gender variables properly accounted for by treating them as cov

留言 (0)

沒有登入
gif