Polymorphisms in transcription factor binding sites and enhancer regions and pancreatic ductal adenocarcinoma risk

Study populations

As the discovery population, genotyping data from PanScan I, PanScan II, PanScan III, and PanC4 were downloaded from the database of Genotypes and Phenotypes (dbGaP) website (study accession numbers: phs000206.v5.p3 and phs000648.v1.p1, project reference: #12644). All the individuals were genotyped using Illumina InfiniumHumanHap550v3 (PanScan I), Illumina InfiniumHuman610-Quad (PanScan II), OmniExpress arrays (PanScan III) or HumanOmniExpressExome-8v1 (PanC4) DNA Analysis Genotyping BeadChips. After merging the four genotype datasets (hereafter referred to as PanScan/PanC4), we performed imputation on the genotype data with the TOPMed imputation panel (version TOPMed-r2), followed by quality control steps. We excluded subjects with cryptic relatedness (PI_HAT > 0.2), gender mismatches, and variants with a minor allele frequency (MAF) < 0.01, completion rate and call rate < 98%, low-quality imputation score (INFO score < 0.7), evidence for violations of Hardy–Weinberg equilibrium (p < 1 × 10−5), leaving 7,509,345 variants genotyped on 14,266 individuals (7205 cases and 7061 controls) in the final dataset. PLINK 2.0 was used to perform principal component analysis on genotypes from all study populations, merged with genotypes of subjects from phase 3 of the 1000 Genomes Project. Individuals who did not cluster with the 1000 Genomes subjects of European descent in the principal component analysis (N = 439) were excluded from further analysis.

In order to narrow the list of variants, the summary statistics of a meta-analysis based on three East Asian studies [the Japan Pancreatic Cancer Research (JaPAN) consortium GWAS, the National Cancer Center (NCC) GWAS, and the BioBank Japan (BBJ) GWAS] comprising 2,039 pancreatic cancer patients and 32,592 controls in the Japanese population was used [20]. Genotyping on these individuals was performed using Illumina HumanCoreExome (JaPAN), Illumina HumanHap550/Illumina Human610-Quad (NCC) or Illumina HumanOmniExpressExome/Illumina HumanOmniExpress (BBJ) Genotyping BeadChips. Imputation was performed on each dataset with the 1000G phase3 v5 reference panel. A total of 7,914,378 variants remained after the post-imputation quality control, excluding variants with a MAF < 0.01 and low-quality imputation score (INFO score < 0.5).

A total of 7182 individuals (3392 PDAC cases and 3790 controls) from the PANcreatic Disease ReseArch (PANDoRA) consortium were genotyped to validate the previously selected variants. PANDoRA was previously described in detail [21]. It is a multicentric consortium consisting of 11 European countries (Greece, Italy, Germany, the Netherlands, Denmark, the Czech Republic, Hungary, Poland, Ukraine, Lithuania, and the UK), whose samples and data have been collected at the German Cancer Research Center (DKFZ, Heidelberg, Germany), where the DNA bank and the central database were established. PDAC cases were defined as individuals with an established diagnosis of PDAC. Controls were patients from the general population without any pancreatic disease at recruitment, individuals hospitalized for reasons other than cancer, or blood donors. Data were collected on sex, age, and country of origin for each case and control. Controls were recruited in the same geographical regions as the cases. Controls from the Netherlands and Germany were obtained respectively from the ‘European Prospective Investigation into Cancer and Nutrition’ (EPIC) [22] and the ‘Epidemiological investigations on chances of preventing, recognizing early and optimally treating chronic diseases in an elderly population’ (ESTHER) [23]. For this study, only PANDoRA subjects with self-declared European ancestry were included.

SNP selection

In order to improve our chances of finding associations with PDAC risk, first of all we limited our selection to SNPs showing association at p < 10−4 in the PanScan/PanC4 dataset and not in LD among them [r2 > 0.6, checked with LDlink (https://ldlink.nci.nih.gov)], resulting into 2575 SNPs. The SNPs in TFBSs were retrieved from the SNP2TFBS database [24], containing annotations for 200 transcription factors with SNPs predicted to alter their affinity for binding. The effects of SNPs in the whole genome on TF binding were estimated using position weight matrices (PWM), which model the specificity of TF binding. The database calculates a score based on the difference between the PWM match scores of both alleles for each SNP-TF binding. The TFBS SNP list was generated by downloading all predicted variant-TF interactions from the SNP2TFBS database (N = 2,281,137, involving 1,900,881 unique SNPs).

To obtain enhancer SNPs, we used a defined list of enhancer regions from published research [25], which used the activity-by-contact (ABC) model to predict which enhancers regulate which genes in 131 human cell types and tissues. First, we extracted the genomic positions of enhancers and their target genes reported for the normal pancreatic tissue and the PANC-1 pancreatic cancer cell line. We thus obtained 55,967 enhancer regions. As the next step, we mapped SNPs on the enhancer regions via UCSC genome browser tools [26] to create a list of SNPs situated in enhancers consisting of 1,190,420 SNPs.

We checked the associations of polymorphisms in TFBS and enhancers with PDAC risk in the discovery dataset (PanScan/PanC4). In the following step, we performed a meta-analysis between TFBS SNPs and enhancer SNPs present in the results of both PanScan/PanC4 and East Asian GWAS summary statistics, using the “meta” and “metafor” R packages. We then excluded known pancreatic cancer risk loci and the SNPs in LD with them (r2 > 0.6). To select SNPs for the replication phase, we applied the following inclusion criteria: p < 10−4 in the meta-analysis and a significant association in both PanScan/PanC4 and the East Asian GWAS summary statistics (p < 0.05). In addition, we selected SNP rs17358295, because it showed the most significant association in the East Asian GWAS summary statistics (p = 4.6 × 10−3), despite having only a modestly significant association in the meta-analysis (p = 2.7 × 10−2). We finally picked the top significant SNPs in TFBSs and enhancers for genotyping analysis on the PANDoRA population. Summary information on the SNPs in TFBSs and enhancers selected for replication is included in Additional file 3: Data S3.

Sample preparation and genotyping

The sample preparation and genotyping process were conducted at a single laboratory at German Cancer Research Center in Heidelberg, Germany. DNA extraction from the whole blood of both cases and controls within the PANDoRA consortium was carried out using a Qiagen-manufactured kit (Hilden, Germany). To ensure uniformity, the order of DNA samples from case and control subjects was randomized on plates, guaranteeing an equal representation of cases and controls in each batch. Genotyping was conducted via allele-specific PCR-based TaqMan technology (ThermoFisher, Applied Biosystems, Waltham, MA) by ordering TaqMan SNP Genotyping Assays for the selected seven SNPs. The PCR protocol was performed with TaqMan Genotyping Master Mix in 384-well plates following the manufacturer's recommendations. The PCR plates were read on a ViiA7 real-time instrument (Applied Biosystems), and genotypes were determined using the ViiA7 RUO Software, version 1.2.2 (Applied Biosystems).

Statistical analysis

In the discovery phase, a logistic regression analysis was carried out by computing odds ratio (OR), 95% confidence intervals (95% CI), and p values to test the association between the SNPs and PDAC risk. The analysis was performed on 14,266 individuals (PanScan/PanC4) and was adjusted for sex, age, and the top ten principal components to avoid confounding due to population stratification. A meta-analysis was performed between the East Asian GWAS summary statistics and the discovery population. The top seven SNPs were analyzed in PANDoRA using logistic regression, adjusting for age, sex and country of origin (PANDoRA lacks GWAS data, therefore principal component data are not available). Deviation from Hardy–Weinberg equilibrium was tested for the variants genotyped in PANDoRA using the control subjects. Finally, a meta-analysis was conducted between the results of the three populations with a total of 56,079 individuals. Meta-analysis models were chosen depending on the heterogeneity (fixed-effect: I2 < 50%, random-effect: I2 ≥ 50%). In order to take into account the number of independent tests, LD, with a threshold of r2 > 0.6, was used to discard variants representing the same association. The Bonferroni-corrected threshold for statistical significance was 0.05/2575 = 1.94 × 10−5.

Functional annotation

Several databases were utilized to link the variants with the best associations to potential functional explanations. To identify the possible effect of the SNPs on gene expression (eQTL/sQTL analysis), we used the data available in the Genotype-Tissue Expression (GTEx) project (https://www.gtexportal.org). We used the Ensembl Variant Effect Predictor (VEP) (https://www.ensembl.org/info/docs/tools/vep/index.html), HaploReg (https://pubs.broadinstitute.org/mammals/haploreg/haploreg.php), RegulomedB (https://www.regulomedb.org), Expression Atlas (https://www.ebi.ac.uk/gxa/home), The Human Protein Atlas (https://www.proteinatlas.org), TNMplot (https://tnmplot.com/analysis/) to check for regulatory potentials (for example, changes in transcription factor affinity, chromatin state regulation, changes in the expression). By using the Ensemble website (https://www.ensembl.org/), we analyzed the regions near the significant SNPs to look for regulatory regions.

留言 (0)

沒有登入
gif