This study was approved by the Stanford Institutional Review Board under protocol no. 47044. Four persons with FAP (one male and three female) were involved in this study (Supplementary Table 1). FAP tissues were collected at the time of partial or full colectomies from participants. Immediately following colectomy, participant-matched non-neoplastic colorectal mucosa, adenomatous polyps and adenocarcinomas were snap-frozen and preserved in liquid nitrogen. One FAP adenocarcinoma (A001-C-007) was embedded in an optimal cutting temperature (OCT) compound before being stored at −80 °C. Sporadic CRCs from six donors were obtained from the Stanford Tissue Bank. Tissues were examined for histopathology to confirm their disease states. Informed consent was obtained from all participants.
Organoid cultureTissue samples were collected from participants and processed for organoid generation according to the protocol detailed in Pleguezuelos-Manzano et al.72. Briefly, samples were collected on ice in a collection medium (advanced DMEM/F12 supplemented with 10 mM HEPES, 1× Glutamax and 1% penicillin–streptomycin). Tissue samples were washed in collection medium, minced and digested for 30 min at 37 °C in 5 mg ml−1 collagenase type II (Sigma-Aldrich). Samples were then filtered using a 100-μm strainer, washed five times in collection medium and plated in Geltrex (ThermoFisher).
Organoids were cultured in a complete medium (advanced DMEM/F12 supplemented with 10 mM HEPES, 1× Glutamax, 1% penicillin–streptomycin, 1× B27 without vitamin A, 10 mM nicotinamide, 1.25 mM N-acetylcysteine, 500 nM A83-01 (Tocris), 10 μM SB202190 (Sigma-Aldrich), 100 ng μl−1 Noggin (R&D Systems), 1 μg ml−1 human recombinant R-spondin (Stemcell), 0.3 nM Wnt-FC (Immunoprecise), 50 ng ml−1 EGF (Shenandoah Biotechnology), 2.5 μM CHIR 99021 (Tocris) and 100 μg ml−1 Normocin (InvivoGen). Furthermore, 10 μM Y-27632 was added to the medium for the first 3 days after seeding.
For drug experiments, organoids were trypsinized and plated at 30,000 cells per well in 24-well plates. After 5–7 days, organoids were incubated in a complete medium containing 500 nM JQ1 for 24 h. Organoids were then isolated in Cell Recovery Solution (Corning) on ice for 1 h, washed with PBS and centrifuged to retrieve cell pellets. Cell pellets were then processed for RNA extraction.
Cell linesHT29 (American Type Culture Collection (ATCC) HTB-38) and HCT116 (ATCC CCL-247) human CRC cell lines were obtained from the ATCC. Cells were maintained in DMEM/F12 (ThermoFisher, 11320033) with 10% FBS and 1% penicillin–streptomycin. Primary human colonic epithelial cells (Cell Biologics, H-6047) were maintained in epithelial cell growth medium (Cell Biologics, H6621) as suggested by the vendor. The identity of all cell lines was authenticated by respective vendors. All cell lines tested negative for Mycoplasma and experiments were performed before reaching ten population doublings.
For drug experiments, cells were seeded in six-well plates with 50% confluence. After 1 day, cells were grown in a complete medium containing 500 nM JQ1 for 24 h. Cell pellets were then trypsinized for collection and processed for RNA extraction.
CRISPR and CRISPRi assaysFor stable expression of dCas9–KRAB, pHR-SFFV-KRAB-dCas9-P2A-mCherry (Addgene, 60954) was transfected into HEK293T cells using the Lenti-X packaging single shots (Takara Bio, 631275), following the manufacturer’s protocol. Assembled viral particles were isolated after 72 h of incubation and collected by filtering the culture media through a 0.45-µm filter. Viral titers were determined using the Lenti-X GoStix Plus (Takara Bio, 631280). Viral infections were conducted with a multiplicity of infection of 10, where 2.0 × 105–1.0 × 106 cells were incubated in full culture medium containing the viral particles and 8 µg ml−1 polybrene for 48 h. Positive cells were selected on the basis of a high expression of mCherry by fluorescence-activated cell sorting (FACSAria II, BD Bioscience). Cells with stable cassette expression were selected by a second sorting performed 2–3 weeks after the initial.
For delivery of gRNA or Cas9/dCas9–gRNA complex, the synthesized crRNA:tracrRNA duplex (Integrated DNA Technologies (IDT)) was transfected into 1.0 × 105–2.0 × 105 cells using the 4D nucleofector X unit (Lonza) with or without preincubation with equal molar Cas9/dCas9 protein (IDT, 1081058 and 1081066). We followed the protocol provided by IDT for the Lonza nucleofector system. Nucleofection of the cells used the following kits and programs: HPCEC, P3/CM137; HT29, SF/FF137. After nucleofection, cells were seeded in 96-well plates and collected for downstream analysis after 48 h. To obtain statistical robustness, each experiment was repeated with 2–4 trials with two replicates in each trial.
mHi-CmHi-C was performed as a derivative of Tri-HiC, a high-resolution modified Hi-C protocol18,31, with minor modifications. Initially, 5–10 mg of snap-frozen tissue was placed into a tissueTUBE-TT05 (Covaris, 520071) and cryopulverized using the Covaris CP02 cryoPREP automated dry pulverizer, following the manufacturer’s procedure. The pulverized tissue was then subjected to freeze substitution73 by submerging it in 1 ml of −80 °C 0.01% formaldehyde (ThermoFisher, 28906), 97% ethanol and 2% water. Following this, samples were incubated on dry ice for 3 h at a rotor spinning speed of approximately 100 r.p.m. They were then placed in a CoolCell Container (Corning) and transferred to a −20 °C freezer for overnight incubation. On day 2, the container was moved to a 4 °C cold room and spun on a rotor at approximately 100 r.p.m. for 1 h to bring the sample temperature above the freezing point.
Subsequently, the tissue samples were separated from the ethanol solution by centrifuging at 300g for 5 min in a 4 °C microcentrifuge. Crosslinking was carried out by incubating the samples with 1 ml of 1% TBS-formaldehyde for 10 min at room temperature. The solution was then quenched by adding 80 μl of 2.5 M glycine and incubated for an additional 5 min. The samples were centrifuged, washed once with 1 ml of TBS (pH 7.5) and resuspended in 250 μl of Hi-C lysis buffer (10 mM Tris-HCl pH 8.0, 10 mM NaCl and 0.2% Igepal CA630) with an additional 50 μl of proteinase inhibitor cocktail (Sigma, P8340). Nuclear extraction was performed on ice by squeezing the samples 15–20 times with 1.5-ml disposable pellet pestles (Fisher Scientific, 12-141-368).
The crude suspension was then centrifuged at 1,500g for 5 min at 4 °C, resuspended in 800 μl of Hi-C lysis buffer and passed through a 100-μm strainer (Sysmex). After another centrifugation, the purified nuclei were resuspended in 170 μl of 10 mM Tris-HCl containing 0.5% Triton X-100 (Sigma, 93443). This was followed by incubation at room temperature with rotation for 15 min. Then, 10 μl of 1% SDS, 20 μl of Cutsmart buffer (New England Biolabs (NEB)), 3 μl each of HinP1I (NEB, R0124S), DdeI (NEB, R0175L), CviAII (NEB, R0640L) and FspBI (ThermoFisher, ER1762) and 0.6 μl of MseI (NEB, R0525M) were added to the suspension in the indicated order. The mixture was then incubated at 25 °C and 37 °C for 2 h each with rotation. To halt the restriction digestion, the suspension was incubated in a 62 °C heating block for 20 min, followed by cooling down. End repair was carried out by adding 30 μl of a solution containing 0.5 mM biotin-14-deoxyadenosine triphosphate (Active Motif, 14138), 0.5 mM biotin-14-deoxycytidine triphosphate (AAT Bio, 17019), 0.5 mM deoxythymidine triphosphate, 0.5 mM deoxyguanosine triphosphate and 4 μl of Klenow DNA Pol (NEB, M0210L) to the mixture. This was then incubated for 1 h at 37 °C with rotation. For ligation, a 750-μl solution containing 1× NEB T4 DNA ligase buffer (NEB, B0202), 120 μg of BSA (ThermoFisher, AM2616) and 2,000 U of T4 DNA ligase (NEB, M0202M) was added. The mixture was incubated at room temperature for 90 min, followed by 4 °C overnight and then room temperature for an additional 60 min with rotation.
Reverse crosslinking was performed by centrifuging the mixture at 1,500g for 5 min. The supernatant was then replaced with a mixture of 300 μl of 1× T4 ligase buffer, 30 μl of 20 mg ml−1 proteinase K (ThermoFisher, 25530049), 30 μl of 10% SDS and 40 μl of 5 M NaCl. This suspension was then incubated at 66 °C for 4 h. The DNA content was purified by phenol–chloroform extraction and resuspended in 20 μl of 10 mM Tris-HCl.
To generate the mHi-C sequencing library, 300 ng of purified DNA was tagmented with 2.5 μl of Tn5 transposase (APExBIO, K1155; discontinued) loaded with equimolar mosaic ends containing Illumina Nextera i5 and i7 extensions, according to the manufacturer’s protocol. The tagmentation was performed in 100 μl of buffer containing 10% DMF, 10 mM Tris-HCl and 150 mM NaCl at 55 °C for 10 min. The product was then column-purified (Zymo D4014) and PCR-amplified for two cycles using the NEBNext master mix (NEB, M0544L) with Illumina Nextera primers and conditions. Biotin enrichment was then performed by adding 20 μl of Dynabeads MyOne Streptavidin C1 (ThermoFisher, 65001) and incubating at room temperature for 30 min with rotation. The magnetic beads were washed three times with 1× wash buffer (10 mM Tris-HCl pH 7.5, 1 mM NaCl and 0.5 mM EDTA) and once with 10 mM Tris-HCl. Final libraries were obtained by amplifying the beads with an additional eight cycles of PCR, followed by purification with solid-phase reversible immobilization (Beckman, B23318) size selection at a 0.5×–1.1× range. The 33 samples were combined into two pools and sequenced using two NovaSeq (Illumina) S4 200-cycle flow cells.
Real-time PCR and RNA sequencing (RNA-seq)Total RNA was extracted from approximately 5–10 mg of frozen tissues or approximately 1.0 × 105–1.0 × 106 cells from organoid or cell culture using the Zymo Quick-RNA Miniprep (Zymo R1054), according to the manufacturer’s instructions. After purification, DNA digestion was carried out using the DNA-free DNA removal kit (ThermoFisher, AM1906). For complementary DNA (cDNA) synthesis, up to 1 μg of total RNA was processed using the Superscript IV reverse transcription (RT) system (ThermoFisher, 18091050), with Oligo dT provided in the kit serving as the primer. For RT–PCR, 50 ng of synthesized cDNA was mixed with 10 μl of TaqMan Fast Advanced Master Mix (ThermoFisher, 4444557) and 1× primers and then examined in the QuantStudio 6 Flex system (ThermoFisher). Relative gene expressions were normalized against the internal expression of GAPDH using the ΔΔCt method. Sequencing libraries of mRNA were prepared from 200 ng–1 μg of total RNA using the NEBNext Ultra unstranded preparation kit (NEB, E7775S and E7490S), in accordance with the manufacturer’s protocol. Samples were sequenced on a NovaSeq S1 flow cell for 50-bp paired-end sequencing, resulting in an average of 86.3 million raw paired reads per sample.
ATAC-seqATAC-seq was conducted following the latest ENCODE tissue protocol as described previously61. Sequencing was performed on a NovaSeq S1 flow cell using 50-bp paired-end sequencing, resulting in an average of 53.2 million unique fragments mapped for each sample.
Enzymatic methyl sequencing (EM-seq)EM-seq was executed as described previously74. Libraries were constructed using the NEBNext EM-seq kit (NEB), following the manufacturer’s guidance. Sequencing was carried out using the novel ultrahigh-throughput UG-100 (Ultima Genomics) sequencer.
mHi-C data processingInitial processing of mHi-C data was executed using the distiller pipeline (https://github.com/open2c/distiller-nf) with default parameters configured for the SLURM cluster. Deduplicated pair files were then input into Juicer pre37 to generate KR-balanced .hic matrices at resolutions of 200 bp, 500 bp, 1 kbp, 2 kbp, 5 kbp, 10 kbp, 20 kbp, 50 kbp, 100 kbp, 250 kbp, 500 kbp and 1 Mbp using a quality score filter of 30. To generate piled-up master matrices for various stages and all samples, pair files were initially merged and sorted using pairtools (https://github.com/open2c/pairtools).
Stripe calling was carried out as previously described18, with minor modifications to the parameters. We noted an enrichment of mappable reads at open chromatin regions, raising the possibility of false stripe signal detection by measuring raw read count over-representation. To address this, the stripe-calling algorithm normalized long-range contacts against read mappability at each locus, evaluated by distal interactivity-independent self-ligation events. Specifically, long-range (>1.5 kb) and short-range (<1 kb) mapped read pairs (+ or − orientation) were separated into two .bam files using awk and SAMtools. BEDTools was then used to map these reads to two binning bed tracks: a local one with a 2-kb window and a background one with a 50-kb window, both featuring a 100-bp sliding size. Assuming that the over-representation sourced from mappability was proportional between long-range and short-range contracts, the expected count number for each bin was calculated as (longbg/shortbg) × shortlocal. Using MACS2 ‘bdgcmp -m qpois’, the local long-range read count for each bin was examined for statistical significance of enrichment against the expected number. The log fold change signal (stripe strength) was then calculated with the same formula by inputting the actual and expected values into MACS2 ‘bdgcmp -m logFE’. To prevent NaN (not a number) errors, a pseudo-count of 1 was added.
After the determination of stripe q values, each 100-bp bin was counted for the number of samples demonstrating significance (false discovery rate (FDR) < 0.01). Bins with at least three sample hits were deemed significant. These bins were then merged and only windows with a minimum size of 500 bp were included as final stripe anchors. Stripe peaks overlapping with ENCODE blacklist regions were removed. To mitigate gender variations among participants, only autosomal chromosomes were included for downstream analyses.
Loop calling was performed using the HiCCUPS algorithm from Juicer tools37 with the following parameters: ‘-r 500,1000,2000,5000,10000 -f 0.1 -p 4,2,2,2,2 -i 20,10,10,6,6 -t 0.1,1.25,1.75,2 -d 2000,2000,4000,10000,20000’. Given that library complexity significantly affects loop calling power, the analysis was not performed for each individual sample. Instead, it was executed on pooled libraries of (1) all samples; (2) all mucosa; (3) all polyps; and (4) all adenocarcinomas. Postprocessed loop pixels from all profiles at various resolutions were subsequently merged in the order of high resolution to low resolution from combined, mucosa, polyp and adenocarcinoma libraries. A loop with lower priority was filtered if both anchors overlapped with a higher-priority loop. This master loop list was then applied to each sample to execute individual loop quantification. Loop strengths were calculated by dividing read counts in the identified loops by the expected count from the donut background and log-transforming the results. A pseudo-count of 1 was added as necessary to prevent NaN errors. For stage-specific counting, loops with average loop strengths greater than 1.2-fold in samples of the specified stage were regarded as positive.
For annotations for stripe and loop anchors, features were mapped against the Ensembl regulatory build36 and TSS from Gencode using BEDTools. If an anchor overlapped with multiple features, the primary annotation adhered to the following order: promoter or TSS (within −1.5 kb to +0.5 kb of any Gencode transcript), active enhancers (defined by H3K27Ac marks in the Ensembl regulatory build), CTCF-binding sites and open chromatin.
For aggregation analysis of loops, aggregated peak analysis (APA) from Juicer was conducted with the parameters ‘-r 200 -u -n -0 -w 500 -k KR -q 20’. The enrichment score was computed as the average intensity of the 10 × 10 center pixels (2 kb) against the mean of the 100 × 100 pixels from the bottom left. For the aggregation of stripes, the same function was executed with the parameters ‘-r 200 -u -n -0 -w 250 -k KR -q 20’. Given the phenomenal distance-dependent interaction decay at the vicinity of stripes, interaction intensities at specific distances (that is, matrix diagonals) were normalized against the average intensity of the distance. The fold enrichment of the aggregated stripes was then calculated by averaging the normalized values in the center ten pixels. For visualizations of loop and stripe APA, matrices were log-transformed before being plotted onto heat maps.
For SV calling, contact matrices in .mcool format were generated by using the distiller pipeline, using default parameters as described above. These matrices were analyzed using the PredictSV function in the EagleC package41 with the following parameters: ‘--balance-type ICE --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.9999’. Identified SVs were visually confirmed on the Hi-C heat maps.
ATAC-seq processingATAC-seq results were processed using the ENCODE-DCC ATAC-seq pipeline (https://github.com/ENCODE-DCC/atac-seq-pipeline) with default settings. To generate the integrated peak list from all samples, a 100-bp binning track was created and mapped with the pseudo-replicated peak regions from each sample using BEDTools. Bins with at least three hits were deemed valid and merged; intervals with a minimum size of 300 bp were included as final peak sites. Peak fold enrichments of samples were then obtained from pipeline-derived fold change bigWig tracks.
Analysis of P–E connectivityTo elucidate the connectivity between promoters and their neighboring regulatory context, P–E pairs were determined for all ATAC-seq peaks within a 200-kb distance from target promoters. Interaction zones were defined as being within −1.5 kb to +0.5 kb near the TSS for promoters and within 500 bp of ATAC-seq peaks for distal enhancers. Contact frequencies between promoter and each target enhancer were calculated using Juicer Straw37 to extract read counts at 1,000-bp resolution. These raw contacts were normalized against the read mappability of the promoter, which was defined as the KR-normalized contact frequency or, in the case where balanced matrix was not available, the ratio of short-range self-ligation contact (<1.0 kb, + or − orientation) reads per kilobase per million mapped reads at the TSS region versus that for the 50-kb neighborhood. Normalization for coverage between samples was achieved by dividing frequencies of each contact by the long-range contact (with a minimum 1.5-kb distance threshold) densities in the 5–50-kb background donut region, in the unit of contacts per 1 × 1 kb square. The aggregated normalized contacts of the gene with all distal regulatory elements represented the total P–E connectivity of that gene.
The accessibility of interacting regulatory element peaks, referred to as ‘enhancer accessibility’ in the text, was calculated as the log sum of the fold enrichment signals of all ATAC-seq peaks that were involved in calculating the P–E connectivity. This value was extracted from the processed ATAC fold change bigWig tracks using pyBigWig, which represented the overall activity of the regulatory context for each gene, and was used for comparisons with P–E connectivity levels. For promoter accessibility, the top two quantiles of mean fold change of the ATAC-seq signals in the defined TSS region were retrieved using pyBigWig and log-transformed. For downstream analyses, genes showing nonpositive values for P–E connectivity, promoter accessibility or promoter stripe enrichment in all three stages were removed. For the remaining genes, missing or negative values were replaced with zero.
RNA-seq processingRNA-seq results were processed using T. Bencomo’s pipeline (https://github.com/tjbencomo/bulk-rnaseq), which uses Salmon for quantifying transcript levels and DESeq2 for identifying differential genes (FDR < 0.1, fold change > 1.3). Transcription levels (TPM) of genes were obtained by summing the transcript-based TPM from the Salmon output (.rf).
DNA methylation processingA total of 21,175,510 CpG sites with measurable methylation ratios were identified across all samples. The methylation degree of features, including mHi-C hotspots, ATAC peaks and gene promoters, was calculated by averaging the methylation percentage for all valid CpG sites within the feature. Regions with average methylation < 25% were classified as demethylated, while those > 40% were considered methylated. Regions with average methylation between 25% and 40% were classified as intermediately methylated and excluded from the methyl versus demethyl analyses. For methylation changes between two stages, regions demonstrating a >15% difference with <0.1 FDR were classified as significantly hypomethylated or hypermethylated on the basis of the direction of change. For correlation analyses with other features, the degree of demethylation (100% minus methylation percentage, annotated as ‘demethylation’) was often used to maintain a positive correlation between methylation degree and regulatory activity.
Mappability analysisThe mappability of mHi-C, in situ Hi-C and intact micro-C at annotated regulatory regions was visualized by retrieving their read coverage from a .bw file that compiled only short-range self-ligation contacts, defined as those with an interaction distance of less than 1.5 kb and a + or − orientation. To compare the distal interaction signal against mappability, long-range interactions, defined as those with an interaction distance greater than 2.0 kb, were extracted and aggregated similarly. The fold enrichment of interaction was then calculated as the ratio of long-range to short-range interactions at 100-bp resolution. Note that this ratio is equivalent to the ‘raw’ stripe strength before normalization against the average enrichment of the local background.
Principal component analysis (PCA)PCA of mHi-C and RNA-seq was performed using the Python sklearn.decomposition.PCA package. For P–E connectivity, analysis was conducted using either untransformed or scaled-by-sample matrices.
Motif enrichment analysisFor each feature (P–E connectivity, enhancer accessibility, TSS accessibility, TSS stripe strength, TSS methylation and gene expression), sequences of promoter regions for genes with the top 10% feature strength in mucosa were extracted using BEDTools. Motif enrichment was then calculated using the AME tool in the MEME suite75 with the JASPAR 2022 vertebrate motif database serving as the reference. The −log10(P values) for significantly enriched motifs for any of the features were included for hierarchical clustering.
Gene ontologyEnrichment analysis of significantly upregulated or downregulated genes in pathways was performed and visualized using the web-based gene set analysis toolkit76. The method of over-representation was selected to test enrichments in the Kyoto Encyclopedia of Genes and Genomes pathway against the protein-coding genome. Analysis was performed using default parameters.
Machine learningFor the initial and differential models, structural and epigenetic feature values in unaffected mucosa or their fold changes in polyps or adenocarcinoma were compiled for each promoter. For fold-change calculation, a pseudo-count of 1 was added to connectivity and methylation to avoid zero values. These features were further compiled with the expression levels of the genes in mucosa, the binary binding status of all TFs in the ENCODE chromatin immunoprecipitation with sequencing (ChIP-seq) database at their promoters and the sum of the binary binding status for the TFs in their distal enhancer contexts.
The raw value and their rank transform were combined, resulting in a total of 1,374 features for each gene. To train the models, 2,800 randomly selected genes were excluded as the test dataset and the rest were fit to the differential gene expression changes in polyps and adenocarcinoma using a sequential model with three intermediate layers and one dense output. Each layer included 2,048, 512 and 128 neurons in order and was filtered with a 20% dropout rate. Models were trained for up to 50 epochs, with the final model represented by the iteration that showed the lowest mean squared error for the test dataset. The training and evaluation of the models were performed using the Tensorflow Keras module in Python.
For evaluation of feature importance in the models, the Shapley additive explanations (SHAP)77 package was used for analysis with default parameters. Average feature importance was calculated as the absolute mean of the importance across all genes. Overall directionality was represented by the numerical mean of the importance.
TCGA gene expression analysisA list of differentially expressed genes and their fold changes were obtained from the GEPIA database47. The prediction of differential expression was performed using receiver operating characteristic (ROC) analysis in the Python sklearn package, where predicted differential fold changes from the initial model were used as thresholds for the upregulated and downregulated genes. For prediction of the directionality, genes were scored by their consistency of dysregulation, averaging their alterations (−1 for downregulation, 0 for unchanged and +1 for upregulation). The correlation between the directionality score and the prediction accuracy was then evaluated.
Statistics and reproducibilityAll statistical analyses were performed using Python, R or Excel. Unless specified in the figure legend, statistical significance was calculated using the Mann–Whitney U-test or unpaired t-test, assuming a normal distribution of samples, between two experimental conditions. Raw P values < 0.05 or adjusted FDR values < 0.1 were considered statistically significant, as indicated in each figure legend. All statistical tests were two-tailed. The exact P values are reported in each figure, except where P was equal to 1 (not significant (NS)) or below a low-value threshold (2 × 10−16).
Sample sizes were predetermined to have a minimal power of 0.8 for a 1.5-fold effect size with a coefficient of variance less than or equal to 0.3. The exact sample sizes are indicated in the figure legends. No data points or outliers were excluded from the experiment. Randomization and blinding of samples were not possible for this study. However, all experiments involving case–control comparisons were performed in the same batch and treated identically without prior designation. In bar plots, values and error bars indicate the sample mean and s.e.m. unless specified otherwise in the figure legend. In box–whisker plots, the center indicates the median value and the bounds of the box are defined by the first and third quartiles. Whiskers are drawn within the 1.5 interquartile range.
Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
留言 (0)