Annotation-free deep learning algorithm trained on hematoxylin & eosin images predicts epithelial-to-mesenchymal transition phenotype and endocrine response in estrogen receptor-positive breast cancer

Cell culture and endocrine-resistant cell model

T47D cell line was obtained from ATCC and cultured according to ATCC’s recommendations in Dulbecco’s modified Eagle’s medium (DMEM; Gibco) with 10% FBS (Gibco) and 1% penicillin/streptomycin (P/S). The fulvestrant-resistant T47D (T47D-FulvR) cell model was induced referring to previous research method [13], using a concentration of 0.5 µM of fulvestrant (Fulv, S1191, Selleck). All cells were maintained in a 5% CO2 incubator at 37 ℃.

Gene expression datasets

Both T47D parental and T47D-FulvR cell samples were sequenced on an BGISEQ platform following the manufacturer’s instructions. Publicly available gene-expression data and their corresponding clinical annotations for breast cancer patients were downloaded from the METABRIC [14] and The Cancer Genome Atlas (TCGA) databases [15]. The FPKM values were converted into transcripts per kilobase million. Besides, additional gene-expression profiles and clinical data from eight breast cancer cohorts, including GSE125738, GSE85536, GSE111563, GSE20181, GSE147271, GSE87411, GSE59515 and E-MTAB-9917 (details provided in Supplementary Table S1), were gathered from the Gene Expression Omnibus (GEO) and ArrayExpress databases [16, 17]. PAM50 subtypes were determined using the PAM50 classifier [18].

Generation of the EMT-score

Milena P. Mak and colleagues previously identified a pan-cancer EMT-related gene signature consisting of 77 genes [8]. From this signature, we selected the 75 genes that are most specific to breast cancer, including 52 ‘mesenchymal’ genes and 23 ‘epithelial’ genes (Supplementary Table S2), to form the EMT-related gene signature in this study. The EMT score for each breast cancer sample was calculated as the average mRNA expression level of ‘mesenchymal’ genes subtracted from that of ‘epithelial’ genes. The ER+ breast cancers, in both TCGA-BRCA and METABRIC cohorts, were then classified according to trisection of their EMT scores as epithelial (Epi-, defined by EMT score ≤ lowest 1/3), intermediate (defined by lowest 1/3 < EMT score < highest 1/3) or mesenchymal (Mes-, defined by EMT score ≥ highest 1/3) subtype [19].

Clinical logistic regression model

To investigate the predictive power of clinicopathologic characteristics in distinguishing between the Epi- and Mes-phenotypes, we developed a logistic regression model incorporating age, histological type, ER level, HER2 status, and PAM50 subtype. These features were extracted from the TCGA database and selected based on their significant differences between the two transcriptome phenotypes.

Gene set enrichment analysis (GSEA)

GSEA was performed to explore enriched pathways and to annotate RNA-seq data by utilizing predefined hallmark gene sets from the Molecular Signatures Database version 2023.2.Hs, employing the “clusterProfiler” package for enrichment analyses of functional annotation estimates to each sample.

Pathological cohorts TCGA-BRCA image cohort

A total of 536 ER+ breast cancer patients in the TCGA-BRCA cohort were classified as either Mes- or Epi-phenotype based on their transcriptional EMT scores. We collected a total of 1,076 eligible H&E-stained whole slide images (WSIs) corresponding to the aforementioned 536 cases from The Cancer Image Archive (TCIA) tissue slide dataset [20], with 534 labeled as Epi-phenotype and 542 as Mes-phenotype. The WSIs were scanned at 40× or 20× magnification.

ZEYY endocrine response cohort

Patients with ER+ breast cancer who underwent core biopsy or surgery at the Second Affiliated Hospital of Zhejiang University between 2015 and 2021 were retrospectively identified from the database of the Department of Pathology. In this cohort, patients with more than 10% of invasive cancer cells positive for ER expression were designated as ER+ and received endocrine therapy with selective estrogen receptor modulators or aromatase inhibitors ± ovarian function suppression, according to the recommendation of the clinical physicians. Resistance to endocrine therapy encompassed both intrinsic and acquired resistance [21, 22]. Therefore, eligibility criteria for tumor tissues defined as endocrine resistance were as follows: (1) early breast cancers with a disease-free interval (DFI) of less than 24 months; (2) recurrent lesions that developed during endocrine therapy; (3) advanced or metastatic diseases that progressed after endocrine therapy. ER+ Cases with pathologically confirmed lymph node metastasis but without distant metastasis, who had achieved a DFI of more than 12 months after completing more than 5 years of adjuvant endocrine therapy, were classified as the endocrine-sensitive subgroup. After thoroughly reviewing the initial pathology reports, slide quality, treatment modalities, and prognostic information of eligible patients, a total of 63 cases with 144 H&E-stained WSIs were retrospectively collected for this study, including 25 resistant cases with 62 slides and 38 sensitive cases with 82 slides. The CONSORT diagram is shown in Supplementary Fig. 1. WSIs were scanned using digital slide scanner KF-PRO-400 (Jiangfeng Bio-Information Technology Co, Ningbo, China) at 40× magnification. Images with low quality, owing to extreme fading, low-resolution, or the absence of invasive tumor regions, were excluded.

Pathological image preprocessing and patch generation

Otsu’s thresholding was utilized to separate the background from the foreground tissue using high-resolution (40×) H&E digital pathology images. Subsequently, all slides were tessellated into non-overlapping patches at 1024 × 1024 pixels and then downsampled by a factor of 2, resulting in patches of 512 × 512 pixels with a depth of 3 channels, while preserving the integrity of the pathological information.

Tumor tissue filtration

To train a filter for non-tumor tissues, we manually curated a subset of patches from 60 WSIs, purposefully selecting those that predominantly contained stroma, necrosis, lymphocytes, and blood vessels, which are typically considered non-tumorous for our analysis. Then we employed an EfficientNetV2B2 architecture as the backbone for the classifier capable of sifting through all patches to exclude these non-target elements. The entire datasets were processed through the filter to retain patches containing more than 50% of tumorous lesions.

Mes- and Epi-phenotype classification

For Mes- and Epi-phenotype classification, we developed another model based on EfficientNetV2B2 using patches detected above. Each patch was assigned either the Mes or Epi label of its corresponding slide based on the patient’s EMT signature score. After classifying all tumor patches from the same slide, the mean prediction value of these patches determined the slide’s final prediction result. Subsequently, all slides were classified using binary cross-entropy loss. We assessed the classifier’s performance using an internal test set from the TCGA-BRCA cohort and validated it on an independent ZEYY endocrine response cohort. Specifically, for the TCGA-BRCA dataset, we split the data into training and internal test sets in a 9:1 ratio, with 90% of the samples used for training and the remaining 10% serving as the internal test set. Threshold optimization was not performed within the ZEYY cohort to avoid potential data leakage, which could overestimate model performance. Instead, pre-established thresholds from the TCGA-BRCA training set were used, ensuring an unbiased and robust evaluation of the model’s generalizability to the independent ZEYY cohort. Performance metrics included area under the curve (AUC), accuracy, positive predictive value (PPV), and negative predictive value (NPV).

Model architecture and training details

The non-tumor tissue filtration and EMT phenotype classification models were both based on EfficientNetV2, trained using a transfer learning approach by fine-tuning from pre-trained weights on the ImageNet dataset to leverage learned features and accelerate convergence. To enhance model robustness and prevent overfitting, we implemented various data augmentation techniques, including random rotations, flips, and color jittering, to simulate the variability in real-world pathology images. A cosine decay learning rate schedule with warm restarts was employed, allowing the learning rate to decrease in a cosine curve manner, resetting periodically to avoid local minima and ensure thorough exploration of the loss landscape. Weight decay and dropout were applied as regularization strategies to prevent overfitting, and a batch size of 8 was selected after empirical evaluation for optimal memory usage and convergence speed. The initial learning rate was set at 1e-5, carefully chosen to balance convergence rate and training stability. The cross-entropy loss function was utilized for training the classifier.

Heatmap generation for WSIs

To provide a comprehensive tissue assessment, we utilized the trained phenotypic status classifier on WSIs. The classifier analyzed the images in strides of 128 pixels, producing detailed heatmaps that visually depicted the model’s activation levels, effectively emphasizing phenotypic statuses throughout the tissue.

Statistical analysis

For statistical analysis, the Mann-Whitney U test was used for numeric data, Pearson’s chi-squared test and Fisher’s exact test were used for categorical data. Box-and-whisker plots indicate the median, 25th and 75th percentiles, with whiskers representing the minima and maxima of the distributions through the ggplot2 R package. A two-sided p value of <0.05 was considered statistically significant. Analyses were performed using R software version 4.1.0 and SPSS version 20 (SPSS Inc., Chicago, IL).

留言 (0)

沒有登入
gif