Predictability of B cell clonal persistence and immunosurveillance in breast cancer

Study population

Eight participants with metastatic breast cancer enrolled within the Vall d’Hebron Institute of Oncology (VHIO) Warm Autopsy Program were included within this study. Ethical approval from the institutional review board of the Vall d’Hebron University Hospital (Barcelona, Spain) was obtained for the use of biospecimens with linked pseudo-anonymized clinical data. The ten participants with primary invasive early breast cancer included in this study were enrolled in the TransNEO study at Cambridge University Hospitals NHS Foundation Trust. Appropriate ethical approval from the institutional review board (research ethics ref.12/EE/0484) was obtained for the use of biospecimens with linked pseudo-anonymized clinical data. All participants provided informed consent for sample collection, and all participants consented to the publication of research results. Full details regarding sample collection, DNA and RNA extraction, library preparation and sequencing have been published elsewhere8,16. No statistical methods were used to predetermine sample sizes, but our sample sizes are similar to those reported in previous publications16. When performing statistical testing, we assessed whether the data met the assumptions of the tests used.

DNA somatic mutation calling and neoantigen prediction

Somatic mutations (Fig. 2d) and predicted HLA class I neoantigens (Fig. 2f) were identified from whole-exome sequencing data and tumor phylogenetic trees (Fig. 2e) were generated using OncoNEM44, as previously described16. MHC class II allele genotyping was performed on the normal tissue DNA sequencing data using HLA-HD45 (version 1.4) using default parameters. MHC class II neoantigens were predicted from the whole-exome mutation data using mixMHC2pred46 (version 1.2) and putative candidates with a percentage rank cutoff of 2% were retained (Fig. 2f).

TME composition and activity deconvolution from bulk RNA-seq

RNA-seq data from the early and metastatic breast cancer cohorts were processed as previously described8,16. Briefly, FASTQ files were aligned to the GRCh37 assembly of the human genome using STAR47 (version 2.5.2b) in two-pass mode and counting of reads aligned over exonic features performed using HTSeq48 (version 0.6.1p1) in read strand-aware union overlap resolution mode.

Immune cell enrichment was performed using MCPcounter22 (version 1.2.0), using as input normalized log-transformed RNA-seq expression data (Extended Data Fig. 2b), and enrichment over 14 cell types using 60 genes21 (Figs. 2c and 3g and Extended Data Figs. 2c,d and 4i). In Extended Data Fig. 2e, published TCGA B cell and T cell enrichment scores are shown21. Correlations between tumor microenvironment components shown in Fig. 2c and Extended Data Fig. 2b were generated using the cor function in the base R stats package and visualized using the corrplot package (version 0.92). The TLS gene signature (CCL19, CCL21, CXCL13, CCR7, CXCR5, SELL, LAMP3)23 shown in Fig. 3g and Extended Data Figs. 2d,e and 4i was calculated using gene-set enrichment analysis. TCGA TLS enrichment scores (Extended Data Fig. 2e) were obtained using FPKM normalized counts provided by TCGA (Genomic Data Commons data release 37.0).

The cytolytic activity score20 (CYT; Extended Data Fig. 2c) was computed as the geometric mean of GZMA and PRF1 expression (TPM, 0.01 offset). The T cell inflamed score49 (Fig. 3g and Extended Data Figs. 2c and 4i) was computed using the GSVA50 R package (version 1.38.2) using as input the log-normalized expression of 18 inflammatory genes (TIGIT, CD27, CD8A, PDCD1LG2, LAG3, CD274, CXCR6, CMKLR1, NKG7, CCL5, PSMB10, IDO1, CXCL9, HLA-DQA1, CD276, STAT1, HLA-DRB1 and HLA-E), while the interferon-γ score49 (Fig. 3g and Extended Data Figs. 2c and 4i) was computed using gene-set variation analysis of six genes (IFNG, STAT1, IDO1, CXCL10, CXCL9 and HLA-DRA). The B cell activation score shown in Extended Data Fig. 2c was computed using GSVA on the MSigDB51 (version 7.3) C5 Gene Ontology Biological Processes POSITIVE_REGULATION_OF_B_CELL_ACTIVATION (GO:0050871) gene set, using as input the log2 TPM expression, with 0.01 offset.

Healthy tissue GTEx isotype analysis

In the healthy tissue BCR isotype analysis shown in Extended Data Fig. 1c, normalized gene counts (TPM) were downloaded from the GTEx17 consortium website (version 8, https://gtexportal.org/home/datasets) and the expression of IGH isotypes retained. Expression data were available for 3,905 samples from organ sites sampled within this study (GTEx n: brain = 2,642, breast = 459, liver = 226, lung = 578). In Extended Data Fig. 1c, the heat map shows the proportion of isotype TPM expression per organ site. In Extended Data Fig. 1d, the median z-score scaled expression of BCR isotypes is shown. The expression values of CD3D, CD3G, CD3E and CD247, which encode for the four different parts of the CD3 complex, were summed to calculate TCR expression. In Extended Data Fig. 4i, samples with high expression of unswitched transcripts were defined as those with a >50th percentile expression of IGHD/IGHM genes, while those with low expression of unswitched transcripts were defined as those with a ≤50th percentile expression of IGHD/IGHM.

BCR library preparation and sequencing

BCR libraries were prepared from RNA samples extracted from 27 metastatic sites and 25 primary breast tumors. BCR variable heavy domains were first amplified using a protocol we have previously described52. Briefly, RNA was reverse transcribed to cDNA using a mixture of IgA/IgD/IgE/IgG/IgM isotype specific primers, incorporating 15 nucleotide unique molecular identifiers (UMIs). The resulting cDNA was used as a template for PCR amplification using a set of six FR1-specific forward primers including sample-specific barcode sequences (seven nucleotides) along with a reverse primer specific to the reverse transcription primer. For three of the replicate libraries, a modified primer set was used where the sample-specific barcode was instead incorporated into the reverse transcription primers after the UMI.

BCR variable heavy domain amplicons (~450 bp) were quantified by TapeStation (Beckman Coulter) and subjected to gel purification. Dual-indexed sequencing adapters (KAPA) were ligated onto ≤500 ng of amplicon per sample using the HyperPrep library construction kit (KAPA). The adaptor-ligated libraries were finally PCR amplified (initial denaturation at 95 °C for 1 min, for 2–83 cycles at 98 °C for 15 s, 60 °C for 30 s, 72 °C for 30 s and a final extension at 72 °C for 1 min). The libraries were sequenced on an Illumina MiSeq using the 2 × 300-bp chemistry.

BCR-sequencing processing

Raw BCR-sequencing reads were processed for analysis using the Immcantation framework, using previously described parameters (docker container v3.0.0)52,53. Briefly, paired-end reads were joined based on a minimum overlap of 20 nucleotides, and a maximum error of 0.2, and reads with a mean Phred score below 20 were removed. Primer regions, including UMIs and sample barcodes, were then identified within each read, and trimmed. Together, the sample barcode, UMI, and constant region primer were used to assign molecular groupings for each read. Within each grouping, usearch54 was used to subdivide the grouping, with a cutoff of 80% nucleotide identity, to account for randomly overlapping UMIs. Each of the resulting groupings is assumed to represent reads arising from a single RNA. Reads within each grouping were then aligned, and a consensus sequence determined. To remove low-level noise, molecular groupings with two or fewer sequences contributing to the UMI consensus were filtered out (Supplementary Table 1). Duplicate reads were then collapsed into a single processed sequence. IgBlast55 (version 1.14.0) was used to annotate the processed sequences, and unproductive sequences were removed. Sequence data from replicate libraries were then pooled for analysis.

BCR clonotype assembly

Annotation of TCR and BCR sequences were performed using IMGT/HighV-QUEST56 (version 1.8.5) and clonotype assembly performed using MRDARCY57, which was run using default parameters. B cell clones are groups of B cells from an individual that derive from the same pre-B cell, and thus have identical BCR sequences or BCR sequences related by SHM. Computationally, BCRs from clonal B cells can be clustered together via network generation using a previously described pipeline26. Briefly, each vertex represents a unique sequence, and the relative vertex size is proportional to the number of identical reads. Edges join vertices that differ by single-nucleotide non-indel differences and clusters are collections of related, connected vertices. A clone (cluster) refers to a group of clonally related B cells, each containing BCRs with identical CDR3 regions and IGHV gene use, or differing by single point mutations, such as through SHM. Likewise, a T cell clone (cluster) refers to a group of related T cells arising from the same pre-T cell, each containing TCRs with identical CDR3 regions and TCRV gene usage.

BCR CDR3 overlap with reference pathogen antibody libraries

A reference antibody database with known binding to viral or bacterial antigen was constructed from existing public databases: the structural antibody database58, abYsis human antibody database59 and the immune epitope database60. Antibody sequences corresponding to synthetic fusion proteins and animal-derived BCRs were excluded.

After preprocessing, 5,800 antibody sequences reacting to antigens were retained, including those derived from human immunodeficiency virus-1 (n = 3,525), Clostridium tetani (n = 817), influenza A (n = 486), vaccinia virus (n = 92), hepatitis C virus (n = 80), Streptococcus pneumoniae (n = 59), Staphylococcus aureus (n = 38) and human betaherpesvirus 5 (n = 32) were used for downstream analysis (Supplementary Table 2).

To determine potential matches, we screened the cancer CDR3 amino acid sequences to the reference antibody database, allowing for up to three amino acid mismatches by fuzzy string matching via a custom Python script. The proportions of BCRs/sample associated with known binding to viral or bacterial antigen across clone classes (Extended Data Fig. 4b) and degree centrality (Extended Data Fig. 5d) were calculated to show that the observations made were not secondary to established systemic responses to non-cancer antigens.

TCR library preparation and sequencing

TCR-sequencing library preparation, sequencing and repertoire identification and network analysis performed by us have been described previously16. Briefly, MiSeq libraries were prepared using the same protocol as for the BCR libraries. Raw MiSeq reads were filtered for base quality, primer and constant region trimming, annotation and clustering using the same protocol as for the BCR libraries but using TCR as the chain parameter.

Clonal overlap between metastatic sites

In Fig. 2 and Extended Data Fig. 2, the clonal repertoire analyses for participants 308 and 315 that were dependent on sequencing depth were generated by subsampling each sample to 90% of the number of unique VDJ sequences present in the sample with the lowest depth (unique VDJ subsampling thresholds: participant 308: n = 980 (BCR), 4,657 (TCRα), 2,620 (TCRβ); participant 315: n = 1,524 (BCR), 3,199 (TCRα), 2,535 (TCRβ). Throughout the paper, we have used the term ‘relative level’ to indicate that the analyses were performed using subsampled data.

In Fig. 2b,d,f and Extended Data Fig. 2a,f,g, the relative level of shared BCR/TCR VDJ sequences was computed by calculating the number of shared VDJ sequences between different metastatic sites in 10,000 subsampling operations and then computing the median of the number of overlaps across iterations. In Fig. 2e, the median Jaccard coefficient of shared VDJ sequences derived in the same 10,000 subsampling operations was used to generate BCR and TCR similarity matrices, from which hierarchical clustering was performed to generate the BCR and TCR clonal similarity trees via the hclust function in R using the ward.D2 agglomeration method. In the spatio-migratory maps of B cell clonal migration shown in Extended Data Fig. 2f,g, the clonal repertoire analyses for participants 308 and 315 were generated by calculating the median number of shared BCR clones across the same 10,000 subsampling operations.

Clonal overlap correlations with tumor genomic landscape

The tumor phylogenetic trees were generated using OncoNEM44, as previously described by us16. The hclust (hierarchical clustering) function in the base R stats package was used to compute the BCR, TCR and genomic trees using the ward.D2 agglomeration method. The comparison of the hclust objects was done using the cophenetic correlation, using the cor_cophenetic function from the dendextend package (version 1.15.2)61. A permutation test was used to calculate correlation one-sided P values, where the tree labels were randomly shuffled for 100 permutations, while keeping the tree topologies constant. The comparison of the BCR and TCR Jaccard clustering trees with the genetic trees was done by using the cophenetic definition for edge-weighted trees. In this version of the cophenetic, the distance between each pair of nodes is the sum of the weights of edges along the path connecting these pairs of nodes.

BCR and TCR clonotype classification

In all participants with more than one tumor sampled (metastatic breast cancer cohort participants: 308, 315, 323, 330; early breast cancer cohort: all participants; Fig. 1), the clone proportion per sample was calculated by dividing the number of UMIs from each clone identified using MRDARCY with the total number of UMIs present in the sample. BCR clones were classified as stem, clade or private depending on whether they were observed in all, some or a single sample from the same participant, respectively (Extended Data Fig. 4a). Stem and clade clones were considered to be immunosurveilling given that they were present in more than one metastatic sample from a single participant.

We further refined the stem, clade and private clone classification by taking into account clone size (percentage of UMIs) to identify clonal expansion. We fitted a Gaussian mixture model to the log percentage UMI values of all BCR sequences of all early and metastatic breast cancer samples using the MClust (version 5.4.9)62 R package to identify an overall BCR clone size cutoff threshold for expanded versus unexpanded clones. This threshold was set to ensure representation of all four clonal classes in all samples and that the expanded clones represented less than 10% of the total repertoire. Using this threshold, BCR clones were classified into four categories: (A) private and expanded, (B) shared and expanded, (C) private and unexpanded and (D) shared and unexpanded (Fig. 3a). Clones where clone size was above the cutoff threshold in some sites (that is, expanded) and below the threshold in others (that is, unexpanded) were classified as expanded.

CDR3 probability of generation analysis

We calculated BCR CDR3 Pgen as a result of VDJ recombination with OLGA25 version 1.2.4 using as input the default human B cell heavy chain model and the amino acid CDR3 sequence of each BCR (Fig. 3c). In Fig. 3c, Pgen scores derived from BCR-sequencing data obtained from the peripheral blood mononuclear cells from a published healthy participant26 are shown. Antigen-experienced BCRs were defined as those that were class switched (IgA, IgE, IgG) and had more than four somatic mutations. Antigen-inexperienced BCRs were defined as non-class-switched BCRs (IgD and IgM) with four or fewer mutations.

Isotype usages and SHM across BCR clone classes

In Fig. 3d, the number of UMIs in each clone per IGH isotype were counted for each sample and summarized by summing the UMI counts by clone class (A, B, C, D) for each isotype/sample, resulting in 192 sample/clone class combinations (48 samples × 4 BCR clone classes) for all 9 BCR isotypes. The total proportion of unswitched BCRs comprised the sum of the proportion of IgD and IgM UMI BCRs. In Fig. 3f, the total proportion of each IGH isotype across sequential samples obtained during therapy in early breast cancer and metastatic samples is shown. Statistical comparisons between early breast cancer time points were performed using an ordinal logistic regression to identify whether there was a monotonic association between IGH isotype proportion and time point. Statistical comparisons between early and metastatic breast cancer samples were performed using Wilcoxon rank-sum tests. In Extended Data Fig. 4g, the data plotted in Fig. 3f are subset across the four BCR clone classes.

BCRs were classified into four SHM categories (no, low, high and very high SHM) using the normalmixEM function from the mixtools R package (version 2.0.0), providing as input the log SHM count. The thresholds used were 0–1 mutation, 1–10 mutations, 11–33 mutations and >33 mutations for the no, low, high and very high SHM categories, respectively. In Extended Data Fig. 4d, the proportion of BCRs for each of the four SHM classes per sample is shown across the four BCR clone classes. In Fig. 3e, highly mutated BCRs were defined as those BCRs classified as having high and very high SHM counts. Statistical comparisons between early breast cancer time points were performed using an ordinal logistic regression to identify whether there was a monotonic association between the percentage of highly mutated BCRs and time point. Statistical comparisons between early and metastatic breast cancer samples were performed using Wilcoxon rank-sum tests.

In the analyses shown in Fig. 3g, all samples from all participants were used. The sample isotype usage was calculated by summing the total number of BCR UMIs per isotype per sample and then dividing this by the total number of UMIs within the sample, as described previously. The total proportion of unswitched BCR comprised the sum of the proportion of IgD and IgM BCRs. The mean sample BCR mutation count was calculated by first calculating the mean SHM per clone per sample, and then calculating the mean SHM per sample (so that larger clones are not overrepresented). Samples with high SHM and CSR were defined as those with a >50th percentile SHM and CSR, while those with low SHM and CSR were defined as those with a ≤50th percentile SHM and CSR (Fig. 3g). In Fig. 3g, data from participants with more than one tumor site sampled are shown (early breast cancer cohort: all participants (n = 10), metastatic breast cancer cohort participants: 308, 315, 323, 330), as classification into the four clonal groups required the sampling of more than one site/participant. In Fig. 3g, all samples from all participants are shown.

BCR clonal expansion and diversification

We calculated BCR clonal expansion by first subsampling each tumor’s BCR-sequencing data to 90% of the number of unique UMIs present in the sample with the lowest depth and summing the total number of UMIs associated with each unique BCR VDJ sequence. The Gini, Shannon index and mean clone sizes were calculated using the ineq R package (version 0.2–13), the posterior R package (version 1.4.1) and custom code, respectively. The mean of 1,000 iterations was used to calculate the final clonal expansion metrics (Fig. 4a and Extended Data Fig. 5a).

To calculate the per-site proportion of immunosurveilling clones (Fig. 4b), the total number of unique VDJ sequences per clone across all samples was calculated, and clones that were present in more than one site and had at least four unique VDJs in at least one metastatic site retained. The proportion of each of these clones across all samples was then calculated by dividing the total number of VDJs per clone per sample by the sum of the number of VDJs for that clone in all samples. The mean of these clone proportions per site was then calculated (Fig. 4b). In Extended Data Fig. 5b, the percentage of clones per sample that had at least four unique VDJ sequences were calculated.

BCR clonal network analysis

Network clustering of BCR clones was performed using MRDARCY57 in participants with more than one site sampled (metastatic breast cancer cohort: participants 308, 315, 323 and 330, early breast cancer cohort: all 10 participants). BCRs were clustered using a sequence identity threshold of 0.95, and clones that were present in a minimum of two tumor samples for each participant and had a minimum of ten unique BCR sequences were retained (number of clones retained in metastatic dataset: participant 308 = 204; participant 315 = 733; participant 323 = 85; participant 330 = 23).

For each BCR clone, the ends of the multiple sequence alignment were trimmed until 95% of all BCR sequences had an aligned nucleotide at the end of the sequence, with a minimum trimmed length of 80 nucleotides required for network clustering to be performed. A distance matrix was subsequently constructed for all sequences per clone, identical BCR sequences grouped together into clusters, and the abundance of these clusters across metastatic sites was calculated by dividing the total number of UMIs present in the cluster by the total number of UMIs in the sample being analyzed. BCR clone network diagrams were generated by computing the pairwise Hamming distances between sequences using the phangorn63 R package (version 2.7.1), followed by neighbor-joining tree estimation and phylogenetic tree construction and optimization using the pml and optim.pml functions in phangorn (Fig. 4c,d and Extended Data Fig. 5c).

To calculate the degree of a BCR sequence, a minimum spanning tree was calculated on the Hamming distance matrix using the mst function in the ape64 R package (version 5.6), which was then converted into an undirected graph using the graph_from_adjacency_matrix function in the igraph R package (version 1.2.10). The degree centrality was then computed using the degree function in igraph (Fig. 4c–f and Extended Data Figs. 5c–h).

We validated in our network clustering findings in four independent datasets (Extended Data Fig. 5i). Two metastatic breast cancer datasets (from the Hartwig Medical Foundation (HMF)27 and the Rapid Autopsy tumor Donation program (RAP) at the UNC at Chapel Hill28) were identified and TRUST4 (ref. 65) was used to reconstruct the BCR immune receptor repertoires from the RNA-seq data, which were then processed using MRDARCY. Sixteen participants in the HMF dataset had breast tumor RNA-seq data for more than one metastatic deposit and clonotype assembly, and intra-participant comparison was only possible in one participant (participant ID: HMFN_0320), which had a higher coverage (1,085 BCRs identified in one sample and 1,757 in another). Similarly, clonotype assembly and intra-participant comparison were possible in one participant in the RAP dataset (participant ID: 828433). BCR-sequencing data for diabetes29 and a multiple sclerosis30 datasets were downloaded from the iReceptor gateway66 and processed using MRDARCY. Eight participants in the diabetes dataset and three participants in the multiple sclerosis dataset had multisite BCR-sequencing data for which clonotype assembly and intra-participant comparisons were possible.

We have created and uploaded an R framework hosted at https://github.com/sjslab/BCR-Immunosurveillance to generate network clustering of BCR clones and compute the centrality analyses from BCR repertoire data derived from BCR sequencing, as well as BCR repertoire data obtained from bulk RNA-seq data.

To determine the predictability of immunosurveilling clones based on BCR degree in the early and metastatic breast cancer cohorts (Fig. 4f and Extended Data Fig. 5g,h), we calculated the sensitivity, specificity and accuracy of a classification that categorizes BCRs as immunosurveilling or not based on a series of degree cutoffs (>1, >2 >10). Model performance metrics were generated using the confusionMatrix function in the caret (version 6.0-90) R package.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

留言 (0)

沒有登入
gif