Two protocols were used for the present study, both reviewed and approved by the Dana Farber/Harvard Cancer Center (DF/HCC) Institutional Review Board, which oversees clinical cancer protocols for all Harvard institutions including MGH. For the tumor blocks from sporadic lung cancer cases, the approved protocol DF/HCC no. 13-416 includes permission for molecular analysis, privacy of results and publication of deidentified results. For the T790M familial pedigree, the cases were initially collected under protocol DF/HCC no. 94-138 at the time of the initial publication23 and reconsented under protocol DF/HCC no. 13-416 for the present study. Under both protocols, all participants provided written informed consent for collection of tumor and tissue specimens and clinical information for inclusion in a tissue repository for future research, including DNA and RNA sequencing, except where noted, and written informed consent for sharing of clinical information in deidentified publications.
PatientsTen patients were selected retrospectively from an institutional database as having undergone resection for two or more early stage lung cancers, with at least one lesion having a known EGFR mutation by next-generation sequencing analysis. These patients gave informed consent for their biological materials to be included in this database. These tumors were classified as early stage by the reviewing pathologist at the time, even though our genetic analysis suggests that two of the patients had metastatically related tumors. No patients had received systemic treatment for cancer preoperatively. Clinical histories were extensively reviewed (see Table 1 and Supplementary Table 1 for summary of patient and tumor characteristics). Patient genetic ancestry was inferred from electronic health records. Gender was self-reported and was not considered in the study design or analysis because we were too limited in our sample availability to stratify based on gender. An additional two patients with a known inherited EGFR T790M mutation were included for proof of concept.
Specimen collection and histopathologySections from the entire tumor and representative lung parenchyma distinct from the tumor were fixed in formalin and embedded in paraffin. Histopathological slides were made from the FFPE sections and retrospectively reviewed by a single observer, expert pulmonary pathologist (M.M.K.), who histologically classified and staged individual tumors in accordance with the World Health Organization classification of lung tumors59 and the 8th edition American Joint Committee on Cancer lung cancer staging guidelines60, respectively. The areas of highest tumor purity and histologically normal lung tissue were also selected.
TNA extraction protocolTotal nucleic acid (TNA) was extracted from designated tumor and normal lung tissue from each patient using the standard protocol of Agencourt Formapure kit (Supplementary Table 1). The approximate location of the normal samples relative to the tumors is shown in Extended Data Fig. 1a.
Whole-exome sequencingBefore WES, a standardized PicoGreen dsDNA Quantitation Reagent test (Invitrogen) was used to quantify DNA in triplicate. The Fluidigm Genotyping fingerprint genotyping of 95 frequent SNPs was used for the quality control identification check (Fluidigm). Using the KAPA Library Prep kit and palindromic forked adapters from Integrated DNA Technologies, libraries were constructed from double-stranded (ds)DNA. Before hybridization, libraries were combined. Utilizing a 37-Mb target, hybridization and capture were carried out using the essential components of Illumina’s Rapid Capture Enrichment Kit. On the Agilent Bravo liquid handling system, the library building, hybridization and capture processes were all fully automated. Library pools were denatured on the Hamilton Starlet using 0.1 N NaOH after post-capture enrichment. DNA libraries were cluster amplified using HiSeq 4000 exclusion amplification reagents and HiSeq 4000 flowcells in accordance with the manufacturer’s (Illumina) instructions. HiSeq 4000 flowcells were sequenced using sequencing-by-synthesis chemistry. RTA (v.2.7.3) or later was then used to examine the flowcells. Sequencing of each pool of entire exome libraries was done using paired 76-cycle runs with two 8-cycle index reads across the number of lanes required to provide coverage for all libraries in the pool.
Sequence data were analyzed using the Broad Institute’s Cancer Genome Analysis WES Characterization Pipeline, in which aligned BAM files were inputted into a standardized WES, somatic, variant-calling pipeline as previously described61, which included MuTect (v.1.1.6) for calling somatic single nucleotide variants (sSNVs), Strelka2 (v.2.9.9) for calling small indels, deTiN (v.2.0.1) for estimating tumor-in-normal (TiN) contamination, ContEst (v.1.4-437-g6b8a9e1; GATK v.3.7.0) for estimating cross-patient contamination, AllelicCapSeg (v.22) for calling allelic copy number variants and ABSOLUTE (v.1.5) for estimating tumor purity, ploidy, cancer cell fractions and absolute allelic copy number. Artifactual variants were filtered out using a token panel-of-normals (PoN) filter, a blat filter, an OxoG filter and an FFPE filter.
Filtering formalin fixation (FFPE) and other potential artifactsFFPE and OxoG artifacts, which are developed with inherent lead strand asymmetry (orientation bias) owing to context specificity of the mutational processes, were filtered as previously described61,62. In brief, OxoG is an artifact signature resulting from oxidative damage to guanine during library preparation, which causes guanine to pair with adenine instead of cytosine, ultimately causing an observed G>T mutation. These artifacts will occur only on one strand whereas a somatic event will show the change on both strands of DNA, and this orientation bias is used to distinguish real events from artifacts. The cohort also had single nucleotide artifacts resulting from the use of FFPE samples, wherein formaldehyde causes deamination of cytosine resulting in C>T mutations similar to those of the aging signature, but with the same orientation bias observed in OxoG events, allowing us to use the same algorithm for determining orientation bias that has previously been used on FFPE samples62. Given the importance of this issue for mosaicism and tumor relatedness analysis, we have performed additional filtering, in which we have ‘force called’ the identified mutations across tumors and normals in the cohort. Force calling is the process of gathering the evidence for a mutation at a particular genomic site (that is, determining the number of mutated and WT-independent sequencing reads that cover the site). In standard WES depth, three or more mutated reads in the tumor and none in the normal are required to detect a somatic mutation at a non-noisy site. However, FFPE can dramatically increase the noise at particular sites, hence it is important to assess the number of mutated reads in other FFPE samples (normal or tumor) that do not have a somatic mutation at the site. We indeed found that some mutations in the FFPE context do appear at noisy sites which are also present in other cases, although typically at much lower allele counts. Accordingly, we applied additional conservative filtering to remove any mutations seen in more than two patients or more than two counts in the normal samples not associated with a mutated tumor. In the tree figures, we annotated both the total number of mutations per clone and the number of mutations that were not in the FFPE context, and there is a column in Supplementary Table 2 stating which mutations are in the FFPE context.
PoN filteringTo remove sequencing artifacts and frequent germline events, SNVs and indels were filtered using PoNs which includes 8,334 WES normals63. Briefly, the panel includes eight values for each site, which describe the percentage of normals, different modes of artifact and the likelihood that the event is a germline event at that site.
Phylogenetic analysis and subclonal architecture inferenceThe cancer cell fraction (CCF, represented as a probability density distribution ∈ [0,1]) of individual somatic alterations is estimated using the ABSOLUTE64 algorithm (v.1.5) which calculates the sample purity, ploidy and local absolute DNA copy number of each mutation. This CCF distribution represents independent estimates for each somatic event. As multiple somatic events are expected to originate from the same subclone, that is, sharing the same proportion of cancer cells, a clustering algorithm is employed to estimate these proportions and assign events to each of the subpopulations. PhylogicNDT65 (v.1.0) is able to identify individual clusters even when the number of mutations per cluster is small. PhylogicNDT performs phylogenetic, tree-building and clustering analysis. In brief, the PhylogicNDT Clustering module employs a multidimensional nonparametric Dirichlet process (DP) on the raw CCF probability density distributions of the somatic variants to learn the underlying clonal structure from the data. The DP is based on the approach where the DP mixing parameter, α, is sampled and learned via a Markov Chain Monte Carlo (MCMC) method from the data. The starting assumption of the method is that the posterior CCF distributions are drawn from a mixture of multidimensional distributions with an unknown number of clusters. In this approach, we employ a single, weak prior on α parameterized by a negative binomial distribution over the number of clusters k (default prior is shape = 3 and scale = 3). It is worth noting that this prior has a minimal effect on the resulting number of clusters. The algorithm is designed as a Gibbs sampler where individual mutations are consecutively assigned to the available clusters through a multinomial distribution representing the likelihood of the mutation belonging to the cluster based on the n-dimensional distribution of the cluster CCF position (which is re-estimated after each assignment). At each iteration, there is a probability that the cluster will not retain any mutations and thus will be closed, or that a mutation would open a new cluster. α is re-sampled on each DP iteration from a mixture model that depends on N, the γ shape and scale parameters, specified prior for k and the current k. Sampling with such an approach allows the mixing parameter to be learned from the data, rather than requiring its specification upfront. As the number of clusters change through the process, α changes accordingly, contributing to the convergence of the method. After completion of the DP-MCMC, the ‘burn-in’ iterations are discarded (first half of the MCMC chain) and the posterior N-dimensional CCF distribution of every mutation is estimated based on the average of the CCF distributions along the MCMC chain. The DP also provides a posterior distribution on the total number of clusters. Whenever a fixed number of clusters was required, we selected the least complex likely solution (that is, the lowest number with >10% posterior probability). According to that posterior, the somatic events are then assigned to subclonal cell populations via applying a hierarchical clustering algorithm on their N-dimensional posterior CCF distributions, to obtain the CCF distribution of each cluster. Finally, the probability that a mutation belongs to a particular cluster is calculated based on the normalized product of the MCMC CCF distributions associated with the mutation and the posterior CCF distribution of the specific cluster. This uncertainty in cluster membership is later used in constructing the ensemble of phylogenetic trees. Clusters are usually based on >10 mutations, but, if the clusters are minor (for example, based on only a few shared events) and their CCF pattern is very distinct across samples, then a cluster can be formed based on only a few mutations if the CCF confidence intervals (CIs) are defined66.
Construction of phylogenetic treesThe BuildTree component of PhylogicNDT uses the generated posterior distributions on cluster positions and mutation membership to calculate the ensemble of possible trees that support the phylogenetic relationship of the detected cell populations. This algorithm employs an MCMC Gibbs sampler over the branch positions within the tree and parent–child relationships among clones. In each iteration, a subclone can move to a place in the tree according to a multinomial probability calculated based on the pigeon-hole rule (that is, the sum of CCFs of sibling clones cannot exceed the CCF of the parent clone), accounting for the uncertainty in assignment of mutations to subclones. The likelihood of the entire tree is determined by multiplying the pigeon-hole probabilities for all nodes in the tree (that is, parent–children relationships). In each MCMC iteration, the tree likelihoods are used to draw the new location of a single clone. In addition, all mutations are randomly assigned to clones based on the match between their CCF distribution and the clones’ CCF distributions, which are then updated based on the assignment of mutations. This mutation shuffling ensures that the uncertainty in the tree structure also takes into account the uncertainty in mutation assignment. Finally, the MCMC generates a posterior distribution over the possible trees (that is, a ‘forest’ of trees). Clusters with <10% CCF across all analyzed samples from the patient were excluded from the tree. Tumors with <10% purity were excluded from detailed phylogeny analysis.
Construction of phylogenetic tree diagramsPhylogenetic tree diagrams throughout the paper are designed in the following way: theoretical cell populations are circles and clones derived from the WES are squares. Any germline EGFR mutation found in normal lung tissue is denoted at the top. Branches are configured based on shared and distinct mutations in each clone. Numbers within lineage tracings represent the number of new additional exonic mutations identified in each clone. Numbers in parentheses are exonic mutations that are not in the FFPE context. Additional driver mutations found in tumors are also annotated, including any somatically acquired EGFR mutations, with the clones where they were identified in gray boxes. We cannot determine whether clones that share a boxed EGFR mutation developed independently or from a shared precursor. Resected tumors are assigned to clones based on their majority population in the layered pie charts shown below the tree. Pie charts below the tree indicate clonal representation within each resected tumor.
Mutational signature analysisMutational signatures were determined using SignatureAnalyzer (v.3ddba7a)67. SignatureAnalyzer is a Bayesian NMF (BayesNMF) method that probabilistically infers the number of signatures, K, through the automatic relevance determination technique and returns highly interpretable and sparse representations for both underlying mutational signature profiles and patient attributions that strike a balance between data fitting and model complexity. We ran BayesNMF 10,000× using the graphics processing unit implementation with exponential priors for the signature matrixW and activity matrix H and displayed the solution with the maximum posterior68. Finally, we compared the identified signatures with those in COSMIC (v.3.2) based on cosine similarity.
Poly(G) genotype data preprocessingGeneration and analysis of repeat (poly(G)) genotypes were performed as previously described37,38,39. Briefly, 33 poly(G) loci were PCR amplified using primers targeting their flanking sequences. Primer sequences can be found in Naxerova et al.38. All reactions were run in duplicate. PCR product length was measured using an ABI 3730xl DNA Analyzer and exported as tab-delimited text files through the Thermo Fisher Scientific Microsatellite Analysis Tool (https://www.thermofisher.com/us/en/home/cloud/all-analysis-modules/sanger-analysis-modules.html). Reactions with intensities <10% of the average intensity for that patient and locus were excluded. If the length distributions of both duplicates were similar (Jensen–Shannon divergence <0.11), the duplicate with higher fluorescence intensity was picked as the representative replicate. At a larger discordance between length distributions, the poly(G) tract was excluded from analysis in all samples of that patient. More details on filtering and quality control of poly(G) genotypes are provided in ref. 38. Source data for the poly(G) analysis are available at https://github.com/mblohmer/polyG_egfr_lc.
Poly(G) genotype data analysisAmplification of microsatellites produces a characteristic fragment stutter pattern as a result of polymerase slippage during PCR. The mean fragment length at each locus, which represents the genotype of the most recent common ancestor of all sampled cells69, was used to simplify this stutter pattern to a single value. Somatic shifts in poly(G) length (mutations) are reported in relation to the normal (germline) sample from each patient. To construct phylogenetic trees using the mean length of poly(G) markers, distance matrices containing all the samples from one patient were constructed using the Manhattan distance. This distance measures the sum of indels among all poly(G) markers in two samples, normalized by the number of poly(G) markers analyzed. As the Manhattan distance simply counts the number of mutations, it scales linearly with the number of cell divisions separating two samples41. However, the Manhattan distance is affected by a sample’s purity because the presence of normal cells within a tumor reduces the mean length. Based on the distance matrices, phylogenetic trees were constructed using the neighbor-joining method implemented in the R package ape70. Evolutionary distance between two tumors was estimated using Pearson’s correlation coefficient (r) between the two vectors of poly(G) marker lengths, as described in detail in ref. 41. Only patients in which at least half of all poly(G) markers could be successfully amplified across all samples were considered for this analysis. Samples were analyzed in two batches and only tumors that were analyzed in the same batch were compared with each other. Pearson’s correlation coefficient estimates the fraction of cell divisions in the history of two tumors that they have spent as part of the same lineage. A correlation of 0 means the lineages giving rise to two tumors split at the zygote stage and that they share 0% of their cell divisions, whereas a correlation of 1 means that the tumors’ lineages coincide and they share 100% of their cell divisions. Pearson’s correlation coefficient compares only the direction of mutations, not their magnitude, thus its estimation of evolutionary distance is not affected by purity and mutation rate. To assess the expected correlation value of two unrelated tumors, the distribution of r was calculated based on tumors from different patients, across all possible tumor pairs in this cohort, in which at least 15 of the same poly(G) loci were successfully amplified in both samples. CIs for the correlation of poly(guanine) mean lengths were calculated by resampling the poly(G) repeats that were used to calculate it 1,000× with replacement and using the 2.5th to 97.5th percentiles of the results.
Cell linesThe mouse NIH/3T3 cells were from the American Type Culture collection (cat. no. CRL-1658) and the human NCI-H2228 lung adenocarcinoma cells were a gift from A. Hata (MGH).
EGFR-mutant constructThe WT EGFR expression plasmid pHAGE-EGFR was a gift from G. Mills and K. Scott (Addgene plasmid no. 116731)71. EGFR-mutant constructs containing patient-specific DNA mutations were generated by site-directed mutagenesis (Agilent QuikChange II XL) and confirmed by Sanger sequencing. Overall expression of ectopic EGFR constructs was approximately tenfold higher than the level of endogenous expression in human NIC-H2228 cells, a lung cancer cell line that expresses moderate levels of EGFR. These constructs are available on request.
EGFR in vitro activity assayEGFR signaling in mouse NIH/3T3 cells, which lack endogenous EGFR expression, is measured under three culture conditions: baseline culture (10% serum), 24-h serum starvation (0% serum) and 5 min after addition of EGF (100 ng ml−1) to starved cultures. Western blotting for EGFR autophosphorylation at Tyr845 and downstream AKT phosphorylation at Ser473 are used as markers of EGFR activation.
Western blot analysisCells were lysed with radioimmunoprecipitation assay buffer (Sigma-Aldrich, cat. no. R0278) containing protease and phosphatase inhibitors (Life Technologies, cat. nos. A32965 and A32957). Lysate was cleared and western blotted according to standard protocols. The following antibodies were used: EGFR (Cell Signaling Technologies, cat. no. 4267, 1:500 dilution in 5% bovine serum albumin (BSA) in phosphate-buffered saline–Tween (PBST), imaged on LiCor); EGFR pY845 (Cell Signaling Technologies, cat. no. 6963, 1:500 dilution in 5% BSA in PBST, imaged on LiCor); AKT1/2 (Cell Signaling Technologies, cat. no. 9272, 1:1,000 dilution in 5% milk in PBST, imaged on film); AKT pS473 (Cell Signaling Technologies, cat. no. 4060, 1:500 dilution in 5% milk in PBST, imaged on film); and vinculin (Sigma-Aldrich, cat. no. MAB3574, 1:2,000 dilution in 5% BSA in PBST, imaged on LiCor). LiCor Image Studio software v.5.2.5 was used for LiCor western blot quantification and ImageJ v.2.3.0 for film western blot quantification.
Transformation assaysFor soft agar colony formation assays, NIH/3T3 cells stably expressing RFP, WT EGFR or mutant EGFR were suspended in Dulbecco’s modified Eagle’s medium + 10% fetal bovine serum containing 0.4% agarose with no additional EGF for 3 weeks. Colony growth was assayed by staining with 0.2% Crystal Violet in methanol for 10 min, followed by manual counting using ImageJ v.2.3.0.
DdPCR analysisDdPCR to detect EGFR L858R mutations was performed on TNA extracted from tissue slides. We used the commercially validated probe set for EGFR WT and p.L858R c.2573T>G (EGFR HEX and L858R FAM; BioRad Assay, ID dHsaMDV2010021). Samples were prepared following the standard protocol (BioRad). Briefly, 2× ddPCR Supermix for probes (no dUTP) was combined with 20–400 ng of patient DNA, 1× primer/probe mix and 5 U of MseI restriction enzyme (New England Biolabs). After droplet generation, samples were thermocycled with an annealing/extension temperature of 55 °C. BioRad QuantaSoft Analysis Pro software v.1.0 was used for droplet analysis and quantification. Variant allele frequency was calculated as mutant copies per total copies.
Statistics and reproducibilityNo statistical methods were used to predetermine sample sizes, but our sample sizes are similar to those reported in previous publications15,25. Tumors with <10% purity were excluded from detailed phylogeny analysis. Reactions with intensities <10% of the average intensity for that patient and locus were excluded from poly(G) analysis. If the length distributions of both duplicates were higher than Jensen–Shannon divergence = 0.11, the poly(G) tract was excluded from analysis in all samples of that patient. Patient 2 was excluded from our poly(G) analysis because fewer than half the poly(G) tracts were successfully amplified across all samples. Some samples evaluated by WES were not assessed via ddPCR or poly(G) analysis because of low sample quantity. No other data were excluded from the analysis. Randomization was not applicable to the present study because the patient samples were retrospectively selected owing to a satisfying set of criteria (more than one EGFR-mutant-resected lung tumor) and were not stratified into treatment arms. The investigators were not blinded to allocation during experiments and outcome assessment. A Kruskal–Wallis test and post-hoc Dunn’s test with Holm–Bonferroni correction were used to calculate the P value between groups in Fig. 3g. This test did not assume normality of the data. A two-way analysis of variance (ANOVA) with a Benjamini, Krieger and Yekutieli correction was used to calculate the q-value between groups in Fig. 4d–k. A one-tailed, unpaired Student’s t-test was used to calculate the P value between groups in Fig. 4m,n. The P value was not corrected for multiple comparisons. GraphPad Prism v.9.2 and R v.4.1.2 were used for statistical analyses. Data distribution in Fig. 4 was assumed to be normal but this was not formally tested.
Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
留言 (0)