Detailed information on participant selection, methods for pulmonary function test, and endotoxin assessment has been described elsewhere [17,18,19], and are summarized below.
Study population and study designThe Shanghai Textile Worker Study was initiated in 1981. This study included 447 cotton workers who were exposed to airborne cotton dust and endotoxin from two cotton textile mills, alongside 472 unexposed silk workers who worked at a neighboring silk textile mill in Shanghai, China. Participants had no symptoms of respiratory disease and had worked at least 2 years in the industry before the baseline survey. Subsequently, a total of seven follow-up surveys were conducted respectively in 1986, 1992, 1996, 2001, 2006, 2011, and 2016 (see study schema in Fig. 1). Notably, the loss-to-follow-up rate (excluding deaths) still remained around 30% in the last survey at 2016.
Fig. 1: Shanghai textile worker study schema (1981–2016).Most of the workers retired between 1992 and 2001. Serum samples for proteomic analysis were collected from all 453 workers who participated in the 2016 survey.
Spirometry measurementsForced expiratory spirograms were performed both before and after work shifts on the initial workday following a 2-day rest period. Furthermore, all eligible retirees participated in the follow-up cohort surveys. Workers were instructed to abstain from smoking for at least an hour prior to the test. Each worker performed a maximum of seven trials to generate three acceptable curves. The study collected multi-dimensional spirometric metrics like FEV1, FVC (forced vital capacity), etc., but primarily focused on indices of FEV1. Acceptable FEV1 tracings exhibited variations of no more than 5% or 100 mL, whichever was greater. The highest FEV1 values from technically acceptable tests were employed in the subsequent analyses.
Respiratory questionnaireWe used a modified version of the American Thoracic Society (ATS) standardized respiratory symptom questionnaire [20], which was translated into Chinese and subsequently back-translated into English. This questionnaire served as a tool to gather information regarding work, medical, and smoking history at all follow-up surveys, including basic characteristics, working status, retirement date, smoking status, pack-years of smoking, respiratory symptoms, including chronic bronchitis, chronic cough, and dyspnea, and respiratory syndromes, including byssinosis.
Endotoxin assessmentAirborne cotton dust levels were assessed in the workplace using a Vertical Elutriator during the first four surveys, and the concentrations of gram-negative bacterial endotoxins in the dust samples were determined through the chromogenic assay. In 1996, synthetic fibers were introduced as blends in the cotton mills, resulting in a 50% reduction in both cotton dust and endotoxin exposures compared to pre-blend levels [21]. The airborne endotoxin concentration, estimated from samples collected in the silk mills, closely matched ambient levels. Consequently, silk workers were deemed unexposed to endotoxin.
Sample collection and proteomics profilingPeripheral blood samples were collected for every participant when he/she attended the 2016 survey. Subsequently, serum and buffy coat separated from whole blood were stored at −80 °C. Proteins quantification was achieved using data-independent acquisition mass spectrometry, which is a high-throughput proteomics strategy that could accurately quantify proteins with high reproducibility in a complex proteome [22, 23]. Details of the proteomics profiling procedure can be found in supplementary methods.
Quality controlProtein relative quantitative values were determined by sample normalization and log2 transformation. Initially, we excluded proteins lacking annotation data in the UniProt database, as their existence and functional attributes remained uncertain. The proteomic data contained missing values, which could be attributed to the low abundance in certain samples or technical issues. In such instances, we then removed the protein sequences with missing values over 50%. This strategy aimed to exclude protein sequences that were identifiable in only a small number of samples, as they might produce false-positive signals due to technical issues. Subsequently, we considered the complexity of the sources of missing values such as peptide misidentification, below detection limit, incomplete trypsin digestion, etc. Notably, some of these factors are intensity-based, while others are not. Thus, we employed the sequential k-nearest neighbor method for imputing the missing values. Besides, in order to fully consider missingness not at random caused by falling below detection limit, we imputed the missing values with the minimum value identified within the corresponding protein sequences in sensitivity analysis.
Exploratory studyUK Biobank (UKB) was used as an independent explorative cohort. Within this prospective study, four spirometry tests were conducted among 502,309 individuals, of whom 48,544 had more than two FEV1 measurements. We first excluded any participants without at least two spirograms with acceptable starts. For each participant, we then compared each FEV1 to their maximum FEV1, and spirograms were considered reproducible if they were within 250 mL of the maximum FEV1, based on standard spirometry guidelines [24]. Proteomic analysis was conducted on a randomized subset of UKB participants using plasma samples collected during the baseline recruitment phase, involving a total of 53,026 individuals. Proteomic profiling on blood plasma was performed with the Olink Explore 3072 platform, which measures 2923 unique proteins [25]. The protein quantification values obtained were normalized protein expressions. This study included samples of 6177 individuals who had both two or more FEV1 measurements and available proteomic data.
Statistical analysisOwing to the limited sample size and the extensive number of proteins under investigation, conventional statistical analysis methods, a linear regression analysis between the decline rate of FEV1 and proteins, proved inadequate, i.e., no significant associations were discovered between the decline rate of FEV1 and any protein after multiple correction (Fig. S1). Consequently, a multi-steps strategy was employed to explore the relationship between proteins and lung function (see workflow in Fig. 2).
Fig. 2: Flowchart of this study.Four steps were designed in this study: (1) Quality control and imputation for the protein levels; (2) Exploring the associations between proteins and FEV1 based on four distinct models; (3) Combining the results of each model using ACAT approach; (4) External exploration and validation using proteomic data from UK Biobank and two-sample Mendelian Randomization. Notes: FEV1 forced expiratory volume in one second, ACAT aggregated Cauchy association test, FDR false discovery rate, MR Mendelian randomization, HBB Hemoglobin subunit beta, IG Immunoglobulin.
In the first step, recognizing that the associations between long-term lung function trends and protein levels is not merely linear, we utilized four distinct models to assess the relationship between proteins and FEV1, including cluster-based model, restricted cubic spline (RCS) model, latent class mixed model (LCMM), and mixed model for repeated measurements (MMRM). A comprehensive description of all these models can be found in the supplementary methods. In all of the models, we incorporated adjustments for the same covariates, including age, height, gender, cumulative pack-years of smoking, log-transformed cumulative endotoxin exposure with silk workers set at zero, and years since retirement.
In the second step, we combined the P-values obtained from four models for each protein, using aggregated Cauchy association test (ACAT) [26], which is a robust method for combining P-values, accommodating arbitrary dependency structures. The combined P-values were adjusted for false discovery rate (FDR) using the Benjamini–Hochberg method [27], with statistical significance threshold defined as FDR-q < 0.05. The ACAT method was generated in R package ACAT.
In the validation study, we investigated the association between the rate of decline in FEV1 and protein using linear regression analysis. The rate of decline in FEV1 was determined as the slope of a linear regression model fitted to all reproducible FEV1 measurements plotted against age for each sample. The model was adjusted for baseline age, sex, height, pack-years of smoking, and baseline FEV1. The results were adjusted for FDR, and FDR-q values < 0.05 were considered significant.
Protein-protein Interaction analysisDue to the distinct proteomics assays in the Shanghai Textile Worker Study and UKB, resulting in a limited overlap of all proteins, protein-protein interaction (PPI) network analyses were conducted for proteins potentially associated with lung function from each cohort. These analyses were performed using the STRING database (version 12.0) [28].
Mendelian randomization analysisFinally, to validate further our findings and explore the causal relationship between proteins and lung function, we conducted two-sample MR analysis between significant proteins and FEV1. The genome-wide association study (GWAS) summary data of FEV1 was used from UK Biobank Neale Lab (http://www.nealelab.is/uk-biobank), comprising 361,194 samples from the United Kingdom. The GWAS summary data was adjusted for age, age2, sex, age × sex, age2 × sex, and 20 genetic principal components. The protein quantitative trait loci (pQTL) database utilized in our study was derived from the AGES cohort of 5368 elderly Icelanders [29], which performed a GWAS involving 4782 serum proteins. For our main analysis, we used inverse variance weighted method (IVW). Additionally, several alternative MR methods under different assumptions were also performed as sensitivity analysis to further validate our results: (i) MR-Egger [30]: this method relies on Instrument Strength Independent of Direct Effect assumption and can provide the causal effect estimate as well as test for pleiotropy; (ii) MR-Pleiotropy Residual Sum and Outlier (MR-PRESSO) [31]: it is capable of identifying and removing outliers with horizontal pleiotropic effects; (iii) Generalized Summary-data-based Mendelian Randomization (GSMR) [32]: this method can account for linkage disequilibrium (LD) between instrumental variables, as well as detect and eliminate genetic instruments that have apparent pleiotropic effects on both exposure and outcome. The MR analysis conducted from FEV1 to proteins using standard methods followed a specific procedure to select independent instrumental variants. Initially, we identified the genome-wide significant single nucleotide polymorphisms (SNPs) with P-value < 5 × 10–8, then 1000 Genomes Project phase 3 of European population served as the LD reference panel to obtain independent instrumental variables with r2 < 0.001 or physical distance >10,000 kb. In the inverse analysis from proteins to FEV1, the significance threshold was relaxed to P-value < 5 × 10–6, due to independent instrumental variables. We employed F statistics to assess the strength of genetic associations of instrumental variables and the issue of weak instrument bias. MR analysis was performed using R packages TwoSampleMR and GSMR.
All statistical analyses were performed using R software (version 4.2.0).
留言 (0)