All samples were collected with informed consent in concordance with institutional review board (IRB) approval. Primary breast carcinoma samples were collected during surgical resection and verified by standard pathology (IRB protocol 201108117). Blood was collected at the time of surgery into vacuum tubes containing ethylenediaminetetraacetic acid (EDTA) (BD Bioscience). Cells were isolated by Ficoll-density centrifugation and frozen in fetal bovine serum with 5% dimethyl sulfoxide. Clinical data were captured in accordance with IRB protocol 20108117 at the time of informed consent and entered into the REDCap database.
Statistics and reproducibilityRelevant statistics are referred to in each of the associated methods sections. We did not use statistical methods to predetermine a sample size and patients were not randomly selected as patients were enrolled as they entered the clinic. We excluded samples that did not pass sample preparation QC. Further information on research design is available in the Nature Research Reporting Summary linked to this article. Data distribution was assumed to be normal but this was not formally tested. Data collection and analysis were not performed blind to the conditions of the experiments.
Human sample processingAfter verification by an attending pathologist, a 1.5 × 1.5 × 0.5-cm portion of the tumor was removed, photographed, weighed and measured. Each piece was then subdivided into 6–9 pieces (depending on the original size) and then further subdivided into four transverse cut pieces. Pieces were then placed into formalin, snap frozen in liquid nitrogen, DMEM and formalin, respectively. Relevant protocols can be found at https://www.protocols.io/view/biospecimen-collection-and-processing-2-0-bp2l6b3bzgqe/v1(ref. 89). As per the institutional requirement, pathology restricts the sampling of any tumors to 2 cm or above with no restriction on maximal tumor size or burden.
Pathologic parameters and assessmentEach tumor that was subdivided into smaller increments was subjected to H&E stain and was assessed by a pathologist for the following parameters: tumor differentiation and grade, percentage of tumor-infiltrating lymphocytes, lymphovascular invasion and perineural invasion. Tumor viability was also assessed by the presence or absence of necrosis. Both slices of each tumor piece, both L1 and L4 when available were subjected to assessment.
Mouse sample collection and processingB6.FVB-Tg(MMTV-PyVT) 634Mul/LellJ (strain 022974) female mice were purchased from the Jackson Laboratory. Mice were killed by carbon dioxide asphyxiation and five pairs of mouse mammary glands were collected at 12 weeks old. For each pair, the left mammary gland was flash frozen in liquid nitrogen and the right glands were fixed in 10% neutral buffered formalin (Epredia, 5725) then embedded in paraffin. The five left mammary glands were pooled together for snRNA-seq and snATAC-seq sample preparation. The paraffin-embedded glands were used for H&E staining. All animal experiments were approved by the Washington University in Saint Louis Institutional Animal Care and Use Committee office.
Genomic DNA and RNA extractionTumor tissues and corresponding normal tissues were obtained from surgically resected specimens and after a piece was removed for fresh single-cell preparation the remaining sample was snap frozen in liquid nitrogen and stored at −80 °C. Before bulk-RNA/DNA extraction, samples were cryopulverized (Covaris) and aliquoted for bulk extraction methods. Genomic DNA was extracted from tissue samples with either the DNeasy Blood and Tissue kit (QIAGEN, 69504) or the QIAamp DNA Mini kit (QIAGEN, 51304). Total RNA was extracted with TRI reagent (Millipore Sigma, T9424) and treated with DNase I (QIAGEN, 79254) using an RNeasy MinElute Cleanup kit (QIAGEN, 74204). RNA integrity was evaluated using either a Bioanalyzer (Agilent Technologies) or TapeStation (Agilent Technologies). Germline genomic DNA was purified from cryopreserved peripheral blood mononuclear cells using the QIAamp DNA Mini kit (QIAGEN, 51304) according to the manufacturer’s instructions. DNA quantity was assessed by fluorometry using the Qubit dsDNA HS Assay (Thermo Fisher Scientific, Q32854) according to the manufacturer’s instructions (Thermo Fisher Scientific). Relevant protocols can be found at https://www.protocols.io/view/bulk-dna-extraction-ding-lab-bsnhndb6, https://www.protocols.io/view/bulk-rna-isolation-ding-bsnfndbn (refs. 90,91).
Whole-exome sequencingA total of 100–250 ng of genomic DNA was fragmented on the Covaris LE220 instrument targeting 250-bp inserts. Automated dual indexed libraries were constructed with the KAPA Hyper library prep kit (Roche) on the SciClone NGS platform (PerkinElmer). Up to ten libraries were pooled at an equimolar ratio by mass before hybrid capture targeting a 5-µg library pool. The library pools were hybridized with the xGen Exome Research Panel v.1.0 reagent (IDT Technologies) that spans a 39-Mb target region (19,396 genes) of the human genome. The libraries were hybridized for 16–18 h at 65 °C followed by a stringent wash to remove spuriously hybridized library fragments. Enriched library fragments were eluted and PCR cycle optimization was performed to prevent over amplification. The enriched libraries were amplified with KAPA HiFi master mix (Roche) before sequencing. The concentration of each captured library pool was accurately determined through qPCR utilizing the KAPA library Quantification kit according to the manufacturer’s protocol (Roche) to produce cluster counts appropriate for the Illumina NovaSeq-6000 instrument. Then, 2 × 150 paired-end reads were generated, targeting 12 Gb of sequence to achieve ~100× coverage per library.
RNA sequencingTotal RNA integrity was determined using Agilent Bioanalyzer or 4200 TapeStation. Library preparation was performed with 500 ng to 1 μg total RNA. Ribosomal RNA was blocked using FastSelect reagents (QIAGEN) during cDNA synthesis. RNA was fragmented in reverse transcriptase buffer with FastSelect reagent and heating to 94 °C for 5 min, 75 °C for 2 min, 70 °C for 2 min, 65 °C for 2 min, 60 °C for 2 min, 55 °C for 2 min, 37 °C for 5 min and 25 °C for 5 min. mRNA was reverse transcribed to yield cDNA using SuperScript III RT enzyme (Life Technologies, per manufacturer’s instructions) and random hexamers. A second strand reaction was performed to yield ds-cDNA. cDNA was blunt ended, had an A base added to the 3′ ends and then had Illumina sequencing adaptors ligated to the ends. Ligated fragments were then amplified for 15 cycles using primers incorporating unique dual index tags. Fragments were sequenced on an Illumina NovaSeq-6000 S4 instrument generating approximately 30 million paired-end 2 × 150 reads per library.
Single-cell suspension preparationFor each tumor, approximately 15–100 mg of 2–4 sections of each tumor and/or normal piece of tissue were cut into small pieces using a blade and processed separately. Enzymes and reagents from the human tumor dissociation kit (Miltenyi Biotec, 130-095-929) were added to the tumor tissue along with 1.75 ml DMEM. The resulting suspension was loaded into a gentleMACS C-tube (Miltenyi Biotec, 130-093-237) and subject to the gentleMACS Octo Dissociator with Heaters (Miltenyi Biotec, 130-096-427). After 30–60 min on the heated dissociation program (37h_TDK_1), samples were removed from the dissociator and filtered through a 40-μM Mini-Strainer (PluriSelect, 43-10040-60) or 40-μm nylon mesh (Fisher Scientific, 22-363-547) into a 15-ml conical tube on ice. The sample was then spun down at 400g for 5 min at 4 °C. After removing supernatant, when a red pellet was visible the cell pellet was resuspended using 200 μl to 3 ml ACK Lysis Solution (Thermo Fisher, A1049201) for 1–5 min. To quench the reaction, 10 ml PBS (Corning, 21-040-CM) with 0.5% BSA (Miltenyi Biotec, 130-091-376) was added and spun down at 400g for 5 min at 4 °C. After removing supernatant, cells were resuspended in 1 ml PBS (Corning, 21-040-CM) with 0.5% BSA. Live and dead cells were visualized using Trypan blue. If greater than 40% of dead cells were present, the sample was spun down at 400g for 5 min at 4 °C and subjected to the dead cell removal kit (Miltenyi Biotec, 130-090-101). Finally, the sample was spun down at 400g for 5 min at 4 °C and resuspended in 500 μl to 1 ml PBS with 0.5% BSA to a final concentration of 700 to 1,500 cells per μl. A step-by-step protocol is found at https://www.protocols.io/view/wu-sc-prep-protocol-for-solid-tumors-v2-1-yxmvmkp5bg3p/v1 (ref. 92).
Single-cell library prep and sequencingUtilizing the Chromium Next GEM Single Cell 3′ GEM, Library & Gel Bead kit v.3.3 and Chromium instrument, approximately 17,500 to 25,000 cells were partitioned into nanoliter droplets to achieve single-cell resolution for a maximum of 10,000–15,000 individual cells per sample (10x Genomics). The resulting cDNA was tagged with a common 16-nt cell barcode and 10-nt unique molecular identifier (UMI) during the reverse transcriptase (RT) reaction. Full-length cDNA from poly-A mRNA transcripts was enzymatically fragmented and size selected to optimize the cDNA amplicon size (approximately 400 bp) for library construction (10x Genomics). The concentration of the 10x single-cell library was accurately determined through qPCR (Kapa Biosystems) to produce cluster counts appropriate for the HiSeq 4000 or NovaSeq-6000 platform (Illumina). The 26 × 98-bp sequence data were generated targeting 50,000 read pairs per cell, which provided digital gene expression profiles for each individual cell.
Single-nuclei RNA and ATAC library preparation and sequencingApproximately 20–30 mg of cryopulverized powder from BRCA specimens was resuspended in 2 ml lysis buffer (10 mM Tris-HCl (Thermo Fisher, 15567027) (pH 7.4); 10 mM NaCl (Themo Fisher, AM9759); 3 mM MgCl2 (Thermo Fisher, AM9530G) and 0.1% NP-40 (Sigma, 74385-1L)) plus 0.1 U μl−1 RNase Inhibitor (Invitrogen, AM2696). This suspension was pipetted gently 6–8 times, incubated on ice for 30 s and pipetted again 4–6 times. The lysate containing free nuclei was filtered through a 40-μm cell strainer. We washed the filter with 1 ml Wash and Resuspension buffer (1× PBS (Corning, 21-040-CM) + 2% BSA (Miltenyi Biotec, 130-091-376) + 0.2 U μl−1 RNase inhibitor) plus 0.1 U μl−1 RNase Inhibitor and combined the flow through with the original filtrate. After a 6-min centrifugation at 500g and 4 °C, the nuclei pellet was resuspended in 300 μl Wash and Resuspension buffer plus 0.1 U μl−1 RNase Inhibitor. After staining with 1 μl 7AAD (ATAC or multiome) or DRAQ5 (RNA) the nuclei were further purified by FACS. FACS-purified nuclei were centrifuged again and resuspended in a small volume (~30 μl Wash and Resuspension buffer plus 0.1 U μl−1 RNase Inhibitor). After counting and microscopic inspection of nuclei quality, the nuclei preparation was diluted to ~1,000 nuclei per μl. Then, ~20,000 nuclei were used for snRNA-seq by the 10x Chromium platform. For snRNA, we loaded the single nuclei onto a Chromium Next GEM Chip G kit and processed them through the Chromium Controller to generate GEMs (gel beads in emulsion). For a subset of samples with joint snRNA and snATAC the multiome kit, Chromium Next GEM Single Cell Multiome ATAC + Gene Expression was used. For ATAC-only samples FACS-purified nuclei were centrifuged again and resuspended in 5 μl Diluted Nuclei Buffer. After counting and microscopic inspection of nuclei quality, the nuclei preparation, ~10,000 nuclei were used for snATAC-seq by the 10x Chromium platform. We loaded the single nuclei onto a Chromium Next GEM Chip H kit and processed them through the Chromium Controller to generate GEMs. After that, post-GEM-RT Cleanup was performed with target cell recovery ≥2,000. We then prepared the sequencing libraries following the manufacturer’s protocol. All sequencing was performed on an Illumina NovaSeq-6000 S4 flow cell. The libraries were pooled and sequenced using the XP workflow according to the manufacturer’s protocol with a 28 × 8 × 98-bp sequencing recipe. The resulting sequencing files were available in FASTQ format per sample after demultiplexing. A step-by-step protocol is found at https://www.protocols.io/view/wu-sn-prep-protocol-for-solid-tumors-snrna-protoco-14egn7w6zv5d/v1 (ref. 93).
Fluorescence-activated cell sortingDepending on the pellet size, 100–500 μl nuclei suspension in the wash buffer (2% BSA + 1× PBS + RNase inhibitor) was stained with DRAQ5 or 7AAD for RNA or ATAC sequencing, respectively (7AAD was used for multiome processing). Namely, snRNA-seq nuclei were stained with 1 μl DRAQ5 per 300 μl of the sample and snATAC-seq nuclei were stained with 1 μl 7AAD per 500 μl sample. Sorting gates were based on size, granularity and dye staining signal.
Immunofluorescence and microscopyFresh tissues were fixed in 10% neutral buffered formalin (Epredia, 5725) at room temperature overnight but for less than 24 h. Tissues were then dehydrated, infiltrated with wax and embedded into paraffin blocks. After tissues were processed into formalin-fixed paraffin-embedded blocks, 5-μm sections were cut and placed on glass slides. Next, sections were deparaffinized and rehydrated, followed by antigen retrieval using Tris EDTA buffer, pH 9 (Genemed, 10-0046) or 1× sodium citrate, pH 6 (Sigma, C9999) according to the manufacturer’s recommendation for specific antibodies. Then, sections were blocked with 100 mM glycine for 20 min, followed by blocking with 10% normal serum and 1% BSA for 1 h at room temperature. A negative control and a secondary antibody control were used in each experimental setting. Primary antibodies for MELK (Thermo Fisher, MA517120) and E-cadherin (R&D, AF748) were applied on sections at 4 °C overnight, followed by the incubation of appropriate secondary antibodies the next day. Images were collected using a Leica DMi8 microscope.
CODEX preparation and imagingCarrier-free monoclonal or polyclonal anti-human antibodies were purchased from different companies (Supplementary Table 6) and verified using immunofluorescence (IF) staining in multiple channels. After screening, antibodies were conjugated using an Akoya Antibody Conjugation kit (Akoya Biosciences, SKU 7000009) with a barcode (Akoya Biosciences) assigned based on the IF staining results. Several common markers were directly purchased through Akoya Biosciences (Supplementary Table 6). CODEX staining and imaging were performed according to the manufacturer’s instruction (CODEX User Manual, Rev C). In brief, 5-μm formalin-fixed paraffin-embedded sections were placed on APTES (Sigma, 440140)-coated coverslips and baked at 60 °C overnight before deparaffinization. The next day, tissues were incubated in xylene, rehydrated in ethanol and washed in ddH2O before antigen retrieval with TE buffer, pH 9 (Genemed, 10-0046) in boiling water for 10 min in a rice cooker. The tissues were then blocked using blocking buffer (CODEX staining kit, SKU 7000008) and stained with the marker antibody panel (Supplementary Table 6) to a volume of 200 µl for 3 h at room temperature in a humiliated chamber. Imaging of the CODEX multicycle experiment was performed using a Keyence fluorescence microscope (model BZ-X810) equipped with a Nikon CFI Plan Apo λ ×20/0.75 objective, the CODEX instrument (Akoya Biosciences) and CODEX Instrument Manager (Akoya Biosciences). Exposure times, dilutions and the order of markers per cycle are listed in Supplementary Table 6. The raw images were then stitched and processed using the CODEX processor (Akoya Biosciences). After multiplex imaging was completed, H&E staining was performed on the same tissue.
ST preparation and sequencingOCT-embedded tissues were cryosectioned and placed on Visium Spatial Gene Expression Slide following Visium Spatial Protocols-Tissue Preparation Guide (10x Genomics, CG000240 Rev A). Briefly, fresh tissues were coated carefully and thoroughly with room temperature OCT without any bubbles. OCT-coated tissues were then placed on a metal block chilled in dry ice until the OCT turned solidified and white. After an RNA quality check using TapeStation and morphology check using H&E staining for the OCT-embedded tissues, blocks were scored into proper size that fit the Capture Areas and then sectioned into 10-μm sections. After the tissue placement into the Capture Area, sections were fixed in methanol, stained with H&E and imaged at ×20 magnification using the brightfield imaging setting on Leica DMi8 microscope. Tissues were then permeabilized for 18 min and ST libraries were constructed following Visium Spatial Gene Expression Reagent Kits User Guide CG000239 Rev A (10x Genomics). Briefly, cDNA was reverse transcribed from the poly-adenylated mRNA which was captured by the primers on the slides. Next, the second strand was synthesized and denatured from the first strand. Free cDNA was then transferred from slides to tubes for further amplification and library construction. Libraries were sequenced on the S4 flow cell of Illumina NovaSeq-6000 system. A step-by-step protocol is found at https://doi.org/10.17504/protocols.io.x54v9d3opg3e/v1 (ref. 94).
Quantification and statistical analysisGenomic data analysis Tumor-normal somatic variant callingSomatic variants were called from WES tumor and normal paired BAMs using somaticwrapper v.1.6, a pipeline designed for detection of somatic variants from tumor and normal exome data. The pipeline merges and filters variant calls from four callers: Strelka v.2.9.2 (ref. 95), VarScan v.2.3.8 (ref. 96), Pindel v.0.2.5 (ref. 97) and MuTect v1.1.7 (ref. 98). Single-nucleotide variant (SNV) calls were obtained from Strelka, VarScan and MuTect. Indel calls were obtained from Stralka2, VarScan and Pindel. The following filters were applied to get variant calls of high confidence: normal VAF ≤ 0.02 and tumor VAF ≥ 0.05, read depth in tumor ≥14 and normal ≥8, indel length <100 bp, all variants must be called by two or more callers, all variants must be exonic and exclude variants in dbSNP but not in COSMIC.
Tumor-only somatic variant callingTumor-only somatic variants were called using Mutect2 (v.4.1.2.0) best-practice pipeline (gatk.broadinstitute.org) with the GDC Panel of Normal (PON) data (gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files; gatk4_mutect2_4136_pon.vcf.tar). To further reduce false positives, we kept the mutation sites with ≥20× coverage, >3 reads and ≥0.1 tumor VAF, which was supported by bam-readcount (https://github.com/genome/bam-readcount).
Somatic variant rescueIn some tumor cases, we called somatic variants of driver genes for some pieces, but not for all pieces. Therefore, we used bam-readcount to rescue those variants for the piece(s) without them. We kept default parameters for bam-readcount, except setting –min-mapping-quality as 20 and –min-base-quality as 20.
Germline variant calling and annotationGermline variant calling was performed using an in-house pipeline GermlineWrapper v.1.1, which implements multiple tools for the detection of germline INDELs and SNVs. Germline SNVs were identified using VarScan v.2.3.8 (with parameters –min-var-freq 0.10–p-value 0.10, –min-coverage 3–strand-filter 1) operating on a mpileup stream produced by SAMtools v.1.2 (with parameters -q 1 -Q 13) and GATK v.4.0.0.0 (ref. 99) using its haplotype caller in single-sample mode with duplicate and unmapped reads removed and retaining calls with a minimum quality threshold of 10. Germline INDELs were identified using VarScan (version and parameters as above) and GATK (version and parameters as above) in single-sample mode. SNVs were based on the union of raw GATK and VarScan calls. We required that indels were called by Pindel or at least two out of the three callers (GATK, VarScan and Pindel). Cutoffs of minimal 10× coverage and 20% VAF were used in the final step to report high-quality germline variants. All resulting variants were limited to the coding region of the full-length transcripts obtained from Ensembl release 100 plus additional two base pairs flanking each exon to cover splice donor/acceptor sites. We also required variants to have an allelic depth ≥5 for the alternative allele in both tumor and normal samples and filtered out any indels longer than 100 bp.
Germline variant pathogenic classificationGermline variants called with GermlineWrapper were annotated with the Ensembl Variant Effect Predictor100 (v.100 with default parameters, except where –everything) and their pathogenicity was determined with our automatic pipeline CharGer48 (v.0.5.4 with default CharGer scores; https://github.com/ding-lab/CharGer/tree/v0.5.4), which annotates and prioritizes variants based on the American College of Medical Genetics and Genomics–Association for Molecular Pathology (ACMG–AMP) guidelines101. The detailed implementation, score of each evidence level and parameters used are as previously described102.
We selected rare variants with ≤0.05% allele frequency in gnomAD (r2.1.1) or 1000 Genomes103. We also performed readcount analysis using bam-readcount (https://github.com/genome/bam-readcount; v.0.8 with parameters -q 10, -b 15) in both normal and tumor samples. We required variants to have at least five counts of the alternative allele and a VAF of at least 20% in both tumor and normal. Variants affecting known cancer predisposition genes (previously described elsewhere102) were manually reviewed with the Integrative Genomics Viewer software (21221095; v.2.8.2). We considered variants to be ‘pathogenic’ if they were known pathogenic variants in ClinVar; ‘likely pathogenic’ if CharGer score >8; and a ‘prioritized variant of uncertain significance’ if CharGer score >4.
Variants called in cases where a normal sample was not available (tumor only) were further filtered for removal of potential somatic events. Variants in these cases called by our germline pipeline which were also called by our somatic pipeline were filtered out, as well as variants not previously reported in gnomAD or reported in COSMIC (extracted from Variant Effect Predictor annotation).
RNA quantificationWe used an in-house bulk-RNA-seq analysis pipeline for quantification (https://github.com/ding-lab/HTAN_bulkRNA_expression). In brief, for each sample, the raw sequence reads were aligned into BAM files using STAR (v.2.7.4a) two-pass alignment with GRCh38 as the reference. The resulting BAM files were then quantified as a raw count matrix using feature counts (subread, v.2.0.1). For both alignment and quantification, gene annotations were based on Gencode v.34. The raw counts were then converted to FPKM-UQ based on GDC’s formula and then log2 transformed with one pseudocount (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#upper-quartile-fpkm).
Expression-based subtypingBulk expression subtyping was performed according to the methods detailed by Parker et al.104 using the log2 upper quartile-normalized FPKM reads for all bulk-RNA-seq samples. Median values of all 50 PAM50 genes were provided for median adjustment. To minimize the influence of any one sample on subtype calls, median values for the 50 PAM50 genes were bootstrapped using a subset of the bulk-RNA-seq data comprising all 17 ER− samples and an equal number of randomly selected ER+ samples. PAM50 subtype assignments were called for all samples using 1,000 such sets of median values and a final subtype assignment for each sample was the subtype most commonly called across these iterations. The PAM50 subtype-calling algorithm was run using code provided by Parker et al.104 using R v.4.0.2.
scRNA-seq and snRNA-seq quantification and analysis scRNA-seq and snRNA-seq data preprocessingFor each sample, we obtained the unfiltered feature–barcode matrix per sample by passing the demultiplexed FASTQs to the CellRanger v.6.0.2 ‘count’ command using default parameters and the prebuilt GRCh38 genome reference (refdata-gex-GRCh38-2020-A; GRCh38 and Ensembl 93). Seurat v.4.1.0 (refs. 105,106) was used for all subsequent analysis. First, a series of quality filters were applied to the data to remove those barcodes that fell into any one of these categories recommended by Seurat: too few total transcript counts (<300); possible debris with too few genes expressed (<200) and too few UMIs (<1,000); possible more than one cell with too many genes expressed (>10,000) and too many UMIs (>10,000); possible dead cell or a sign of cellular stress and apoptosis with too high proportion of mitochondrial gene expression over the total transcript counts (>10%). Doublets were filtered out using Scrublet (https://github.com/AllonKleinLab/scrublet). Scrublet was run on each sample separately with the following parameter settings: expected_doublet_rate = 0.06, min_counts = 2, min_cells = 3, min_gene_variability_pctl = 85 and n_prin_comps = 30. The doublet score threshold was adjusted manually. We constructed a Seurat object using the unfiltered feature–barcode matrix for each sample. Each sample was scaled and normalized using Seurat’s ‘SCTransform’ function to correct for batch effects (with parameters vars.to.regress = c(‘nCount_RNA’, ‘percent.mito’) and variable.features n = 2,000). Any merged analysis or subsequent subsetting of cells/samples underwent the same scaling and normalization method. Cells were clustered using the original Louvain algorithm and top 30 PCA dimensions via ‘FindNeighbors’ and ‘FindClusters’ (with parameters: resolution = 0.5) functions. The resulting merged and normalized matrix was used for the subsequent analysis. Each sample was scaled and normalized using Seurat’s ‘SCTransform’ function to correct for batch effects (with parameters: vars.to.regress = c(‘nCount_RNA’, ‘percent.mito’), variable.features.n = 3,000). We then merged all samples and repeated the same scaling and normalization method. All cells in the merged Seurat object were then clustered using the original Louvain algorithm (Blondel et al. 2008) and top 30 PCA dimensions via ‘FindNeighbors’ and ‘FindClusters’ (with parameters: resolution = 0.5) functions. The resulting merged and normalized matrix was used for subsequent analysis.
scRNA-seq and snRNA-seq cell type annotationThe main cell types were assigned to each cluster by manually reviewing the expression of a comprehensive set of marker genes. The marker genes used were:
CCL5, FYN, CCL4, IL7R, GNLY (T/NK cells); CD8B, CD8A, CD3E, CD3D (CD8+ T cells); CD4, CD3E, CD3D, SELL, CCR7, IL7R, TCF7, LEF1 (CD4+ T cells); XCL2, XCL1, KLRF1, KIR2DL3, IL2RB, CD7, KLRB1, KLRD1, GZMA, PRF1, CD160, NCAM1 (NK cells); NK markers and CD3E, CD3D (NKT cells); FABP4, VWF, ACKR1, LDB2, PECAM1 (endothelial cells); COL1A2, COL1A1, COL3A1, SFRP2, DCN, SMOC2, ITGBL1, FBLN1, CDH11, PDGFRA, SVEP1, PDPN, LRRC15, CILP, LUM, MFAP5, FBLN2, OLFML3, RNASE4 (mCAF); MGP, SCGB2A2, SLPI, LTF, PTN, KIT, ALDH1A3 (LP cells); SPP1, APOE, LYZ, APOC1, HLA-DRA (monocytes/macrophages); ATP1B2, DES, AOC3, NDUFA4L2, MCAM, HIGD1B, CPE, KCNJ8, ABCC9, IGFBP7, TAGLN, ACTA2, MYL9, CALD1 (vCAF); CFD, DCN, GSN, EBF1, PRKG1 (CAF); CD1C, HLA-DPA1, HLA-DRB1, HLA-DRA, CD74, HLA-DPB1 (cDC2); KRT14, DST, MMP7, MIR205HG, MT1X, OXTR, KRT17, FST (basal/myoepithelial cells); EPCAM, AMBP, MUC1 (epithelial cells); TIMP1, FN1, POSTN, ACTA2, BST2, LY6D, COL6A1, SLC20A1, COL6A2, KRT16, CD9, S100A4, EMP1, LRRC8A, EPCAM, PDPN, ITGB1, PDGFRA, THY1 (fibroblasts); BANK1, CD79A, CD74, MS4A1, MEF2C, CD19, CD79B (B); RNASE1, C1QA, C1QB, C1QC, SELENOP (cDC1); MUCL1, MGP, ERBB4, ANKRD30A, AZGP1,AGR2, STC2 (LM cells); TPSB2, TPSAB1, CPA3, MS4A2, HPGDS,KIT, ENPP3 (mast cells); PTGDS, FCHSD2, GPR183, NR3C1, TCF4 (DCs); FCHSD2, GZMB, PTGDS, TCF4, GPR183, IL3RA, CLEC4C, NRP1, LILRA4, TLR7, TLR9, IRF7, GZMB (pDC); IGKC, IGLC3, IGLC2, IGHG1, IGHA1, IGHG3, IGHG4, IGHA2, FCRL5, TNFRSF17 (plasma); TIMP1, COL1A1, COL1A2, COL3A1, SPARC, ANIN, CDCA3, TPX2, CDCA8, FAM64A, NUF2, BIRC5, CEP55, SKA1, KIF15, TTK, MELK, TOP2A, PBK, CCNA2, SPC25, MKI67 (cCAF); PNPLA2, CAV1, FABP4, PPARG, CEBPA, LEP, CIDEA, SHOX2, SLC7A10, SLC36A2, P2RX5 (adipocytes); ITGAM, LGALS3, CD68, CD163, LYZ, ADGRE1, LAMP2 (macrophages); CD14, FCGR3A, FCGR1A (monocytes); and MYC, ESR1, AR, PGR, CDH1, AKT1, ERBB2, EPCAM, KRT8, KRT18, KRT19 (tumor, markers are also shared by benign breast duct cell types).
We further subdivided certain cell types into subtypes or cell states using the following: IKZF2, TNFRSF18, FOXP3, CTLA4, IL7R, IL2RA (Treg cells); GZMH, GZMB, GZMA, PRF1, IFNG, FASLG, LAMP1, CD8A, CD8B, CD3E, CD3D (cytotoxic T cells); GZMK (pre-exhausted T cells); VSIR, TIGIT, ICOS, EOMES, HAVCR2, PDCD1, BTLA, CD244, LAG3, CD160, CTLA4, CD96 (exhausted T cells); CD69, CD28, CD44, DPP4 (activated T cells); high ribosomal gene expression (RPhigh CD4+ T cells); IL1A, IL1B (IL1A+ macrophages); high TLR2 and CD86 and low CD163 and MRC1 (M1 macrophages); high CD163, high MRC1 (M2 macrophages); high CD163, MRC1, TLR2 and CD86 (M1/M2 macrophages); high CD163 and low MRC1 (M2 partial macrophages); MKI67 (proliferation marker); CD69, CCR7 (medium-low), SELL (medium-low) and IL7R (medium-low) (activated T cells). Note that T cell marker sets were clearer once subsetted and re-clustered.
snRNA-seq mouse cell type annotationTo annotate the mouse single-cell data the following markers were used: Epcam, Krt5, Acta2, Myh11, Krt14, Trp63, Krt17, Myl9 (basal); Areg, Cited1, Ly6d, Prlr, Esr1, Pgr, S100a6 (LMs); Kit, Aldh1a3, Cd14, Gabrp, Tspan8 (LP cells ); Birc5, Hmgb2, Stmn1, Mki67 (cycling cells); Ptprc, Fyb (immune); Col4a1, Sparc, Col4a2, Lamb1, Col5a2 (stroma); and Fabp4, Lpl, C4b, Mylk, Hk2, Slc4a4, Dio2, Vegfa (fibroblasts).
snRNA-seq PAM50 subtype assignmentFor single-nuclei data, to overcome data sparseness, expression-based subtyping was performed at the cluster level. After cell type annotation of the combined Seurat object across all samples, a separate Seurat object was created comprising only tumor cells and was clustered again using the original Louvain algorithm and top 30 PCA dimensions via ‘FindNeighbors’ and ‘FindClusters’ (with parameters, resolution of 0.5) functions. Mean expression of the 50 genes used in the PAM50 algorithm were obtained per cluster. PAM50 subtype assignments were obtained for each of these cluster means in the same way as was conducted for bulk-RNA-seq data (without bootstrapping).
Gene regulatory networksTo infer gene regulatory networks of TFs, we used pySCENIC interface (v.0.11.2) from the SCENIC pipeline107. We applied SCENIC on SCT-normalized assay of sampled merged snRNA combo object, 500 cells sampled randomly per cell type of each sample. The first step of the SCENIC workflow is utilizing a regression per-target approach, GRNBoost2, to infer coexpression modules. We provided the list of unique TFs that are present in the JASPAR2020 db108 as input. Steps 2 and 3 of regulon prediction were run with default parameters and using RcisTarget hg38__refseq_r80 v.9 gene-motif ranking databases (10 kbp around the transcription start site (TSS) and 500 bp around TSS). Next, we recalculated the AUCell score, the regulon activity, to identify significant regulons (TFs) in each cell type. Then only regulons with at least 20 regulated genes were considered in the final heatmap. Finally, we generated a binary regulon activity heatmap to show gene regulatory networks relationships between TFs and their target genes using the ComplexHeatmap R package.
Mapping and quantification of snATAC-seq and snMultiome-seqTo process sequenced snATAC-seq and snMutiome-seq data, we used cellranger-atac count (v.2.0, 10x Genomics) and cellranger-arc count (v.2.0, 10x Genomics) pipelines, respectively. These pipelines filter and map snATAC-seq reads and identify transposase cut sites. The cellranger-arc count pipeline also performs filtering and alignment of snRNA-seq reads. The GRCh38 human reference was used for mapping reads.
Peak calling for snATAC-seq dataWe performed peak calling using MACS2 (ref. 109). We removed peaks from the Y chromosome and peaks overlapping genomic regions containing ‘N’. All peaks were resized to 501 bp centered at the peak summit defined by MACS2. After this, we performed an iterative removal procedure described in Corces et al.110 to get the set of non-overlapping peaks. In brief, we started by retaining the most significant peak by MACS2 peak score (−log10(q value)) and removed all peaks that had a direct overlap with it. We repeated this procedure for the remaining peaks, until we had the set of non-overlapping peaks (sample peak set). The resulting set of sample peaks was used to calculate a peak-count matrix using FeatureMatrix from the Signac package v.1.3.0 (https://github.com/timoast/signac), which was also used for downstream analysis.
QC of snATAC-seq dataQC filtering of the snATAC-datasets was performed using functions from the Signac package. Filters that were applied for the cell calling include: number of fragments in peaks >1,000 and <20,000, percentage of reads in peaks >15, ENCODE blacklist regions percentage <0.05 (https://www.encodeproject.org/annotations/ENCSR636HFF/), nucleosome banding pattern score <10 and enrichment score for Tn5-integration events at transcriptional start sites >2. Peaks were annotated using R package ChiPseeker111 using the transcript database TxDb.Hsapiens.UCSC.hg38.knownGene. The promoter region was specified (−1,000,100) relative to the TSS.
Normalization, feature selection and dimension reduction of snATAC-seq dataThe filtered peak-count matrix was normalized using term frequency-inverse document frequency (TF-IDF) normalization implemented in the Signac package. All peaks were used as features for dimensional reduction. We used the RunSVD Signac function to perform singular value decomposition on the normalized TF-IDF matrix, which is known as latent semantic indexing (LSI) dimension reduction. The resulting 2:30 LSI components were used for nonlinear dimension reduction using function RunUMAP from the Seurat package.
Clustering of snATAC-seq dataThe nuclei were clustered using a graph-based clustering approach implemented in Seurat. First, we utilized the Seurat function FindNeighbors to construct a Shared Nearest Neighbor graph using the 2:30 LSI components. Next, we used the FindClusters function to iteratively group nuclei together while optimizing modularity using the Louvain algorithm.
Merging of snATAC-seq data across samplesMerging of snATAC-seq datasets was performed using functions from the Signac and Seurat packages. To normalize the peaks’ significance scores across samples, we converted MACS2 peak scores (−log10(q value)) to a score per million as described by Corces et al.110. To get the set of peaks for merging, we first combined peaks from all samples of the cohort. For overlapping peaks across samples, we performed an iterative removal procedure using normalized peak scores as described above. This yields the cohort level peak set. The resulting list of peaks was quantified and was used to create a peak-cell matrix so that the set of features was the same across all snATAC-seq samples. After that, the merge function from the Seurat package was used to merge snATAC-seq datasets. Next, we performed TF-IDF normalization. LSI-dimensional reduction was performed using the RunSVD function. Nonlinear dimension reduction was performed using the RunUMAP function with the first 2:50 LSI components.
Cell type label transfer from snRNA-seq to snATAC-seq dataCell type label transfer was performed using functions from Signac and Seurat. First, we quantified chromatin accessibility associated with each gene by summing the reads overlapping the gene body and its upstream region of 2 kb, thus creating the gene by cell matrix. Coordinates for the genes were used from the Ensembl database v.86 (EnsDb.Hsapiens.v86 package). Next, we performed log-normalization of the resulting matrices using the NormaliseData function. The integration of paired snATAC-seq and snRNA-seq datasets was performed using the FindTransferAnchors function with the canonical correlation analysis option for the dimensional reduction. We then utilized the TransferData function to transfer cell type labels from the snRNA-seq dataset to the snATAC-seq dataset using the obtained set of anchors from the previous step. Then cell types were re-evaluated at the cohort-object level, where for each cluster the cell type label was assigned as the most abundant cell type in that cluster.
Identifying differentially accessible chromatin regions using snATAC-seq dataTo identify differentially accessible chromatin regions, we used the FindMarkers function from the Seurat package with logistic regression test and the fraction of fragments peaks as a latent variable to reduce the effect of different sequencing depths across cells. Additionally, we also specified the following parameters: min.pct = 0.1, min.diff.pct = 0.1, logfc.threshold = 0 and only.pos = F. Bonferroni correction was applied for P value adjustment using all peaks from each comparison and peaks were considered significant if they had an adjusted P value <0.05.
Calculation of TF motif scores using snATAC-seq dataTo evaluate TF binding accessibility profiles, we used the ChromVar tool112, which calculates bias-corrected deviations (TF motif scores) corresponding to gain or loss of accessibility for each TF motif relative to the average cell profile. To identify TFs with differential activity between cell groups of snATAC-seq data, we used a two-sided Wilcoxon rank-sum test for the whole set of TFs in the assay and applied FDR correction for the resulting P values. For subtype-specific TFs we additionally required them to have positive fold change (FC) in RNA-seq data when compared between tumor cells from each subtype versus pooled tumor cells from all other subtypes.
Identifying lineage-specific TF motifs in accessible chromatin regions using snATAC-seq dataTo identify cancer lineage-specific motif profiles in open chromatin, we first compared tumor cells and their putative benign cell of origin versus all other epithelial cell groups (list of cell groups was basal-like BC, HER2-enriched BC, luminal A/B BC, LM cells, LP cells and basal myoepithelial cells) using a two-sided Wilcoxon rank-sum test followed by FDR correction, as above. Cancer-associated nonlineage-specific TFs were filtered out by excluding TFs enriched in both luminal BC versus LM cells and basal-like BC versus LP cells. Individual differential motif accessibility comparisons were conducted for each BC subtype and benign cell type versus all other epithelial cells and any TFs significantly enriched in a nonlineage cell type (with mean motif score difference of at least 0.5) were also excluded from the lineage-specific motif list.
Visualizing the coverage of snATAC-seq dataFor snATAC-seq coverage plots, we used the CoveragePlot function from the Signac package. For tumor samples, we plotted coverage for tumor cells only and for normal cell populations we plotted coverage for combined cells across all samples.
Monocle pseudotime analysis using snATAC-seq dataTrajectory-based analyses of snATAC-seq data were performed using Monocle2 (ref. 113). To build case-level trajectories, we used subsetted data from the snATAC-seq merged object. For each case we subsetted 1,000 cells from the pool of its tumor cells and also 1,000 cells from each of the combined sets of normal cell populations, such as LM, basal/myoepithelial and LP cells. To create a Monocle cds object we used the function newCellDataSet on slot count and top 50,000 expressed peaks and with parameters lowerDetectionLimit = 0.5 and expressionFamily = negbinomial.size(). To estimate size factors for each cell and dispersion function for the peaks, we used functions estimateSizeFactors and estimateDispersions. Dimensionality reduction was performed using reduceDimension function with DDRTree method and max_component = 10.
scVarScan mutation mappingWe applied an in-house tool called scVarScan that can identify reads supporting the reference and variant alleles covering the variant site in each individual cell by tracing cell and molecular barcode information in an scRNA bam file. The tool is freely available at https://github.com/ding-lab/10Xmapping. For mapping, we used high-confidence somatic mutations from WES data.
CNV calling on bulk whole-exome dataSomatic copy number variants (CNVs) were called using GATK v.4.1.9.0114. The hg38 human reference genome (National Cancer Institute GDC data portal) was binned into target intervals using the PreprocessIntervals function, with a bin-length of 1,000 bp and the interval-merging-rule set to OVERLAPPING_ONLY. A PON was generated using each normal sample by utilizing the GATK functions CollectReadCounts with the argument–interval-merging-rule OVERLAPPING_ONLY, followed by CreateReadCountPanelOfNormals with the argument–minimum-interval-median-percentile 5.0.
For tumor samples, reads that overlapped the target interval were counted using the GATK function CollectReadCounts. Tumor read counts were standardized and de-noised using the GATK function DenoiseReadCounts, with the PON specified by–count-panel-of-normals. Allelic counts for tumor samples were generated for variants present in the af-only-gnomad.hg38.vcf file, following GATK best practices (variants further filtered to 0.2 > allele frequency > 0.01 and entries marked with ‘PASS’), using the GATK function CollectAllelicCounts. Segments were modeled using the GATK function ModelSegments, utilizing the de-noised copy ratio and tumor allelic counts as inputs. Copy ratios were called on the segment regions using the GATK function CallCopyRatioSegments.
To map the copy number ratios from segments to genes and assign amplifications or deletions, the Bedtools intersection114 was used. For genes overlapping multiple segments, a custom Python script was employed to call that gene as amplified, neutral or deleted based on a weighted copy number ratio calculated from copy ratios of each segment overlapped, the lengths of the overlaps and the z-score threshold used by the CallCopyRatioSegments function. If the resulting z-score cutoff fell within the range of the default z-score thresholds used by CallCopyRatioSegments (0.9, 1.1), the bounds of the default z-score threshold were utilized instead, replicating the logic of the CallCopyRatioSegments function. Similarly, to map copy number ratios from segments to chromosome arms, another script was used following the same approach to call whether the chromosome arm was amplified, neutral or deleted.
scRNA CNV detectionTo detect large-scale chromosomal CNVs using scRNA-seq data, inferCNV (v.0.8.2) was used with default parameters recommended for 10x Genomics data. All cells that were not tumor were pooled together for the reference normal set. inferCNV was run at a sample level and only with post-QC filtered data.
Differential scRNA expression analysesFor cell-level and cluster-level differential expression, we used the ‘FindMarkers’ or ‘FindAllMarkers’ Seurat function as appropriate, with a minimum pct. of 0.25 and looking only in the positive direction, as lack of expression is harder to interpret due to the sparsity of the data. The resulting DEGs were then filtered for adjusted P < 0.05 and sorted by FC. All differential expression analyses were carried out using the ‘SCT’ assay.
Cell surface annotationTo annotate a given biomarker, we annotated each DEG by their subcellular location. Three databases were used to curate the subcellular location information: (1) Gene Ontology term 0005886; (2) Mass Spectrometric-Derived Cell Surface Protein Atlas115; and (3) The Human Protein Atlas subcellular location data based on The Human Protein Atlas v.19.3 and Ensembl v.92.38.
Identification of tumor markers using snRNA-seq dataTo identify tumor cell surface markers, we used functions of the Seurat package. We compared gene expression between tumor cells and non-tumor cells across snRNA-seq samples of basal and luminal subtypes. The pipeline consists of the
留言 (0)