CREdb: A comprehensive database of Cis-Regulatory Elements and their activity in human cells and tissues

Database design

A data model was designed to store information related to regulatory elements including promoters, enhancers, silencers, transcription factors, and called segments (other elements). In addition to these elements, the data model also stores the genes they interact with, biosamples with data supporting the regulatory elements, biological activity of regulatory elements in these biosamples, and the data sources used to collect these elements. The conceptual form of this data model is presented in Fig. 4C, and detailed information about their attributes is provided in Supplemental Methods.

Fig. 4figure 4

Methodology and Structure of CREdb (A) Curation methodology for the generation of CREdb. Raw data is extracted from databases containing data pertaining to CREs, biosamples, and genes. Relevant metadata is then collected for the biosamples and these samples are mapped to standard ontologies to facilitate efficient comparisons between source datasets. CREs and genes are filtered to only human sources and standardized to the same terminology and, where necessary, reference genome. Consensus data elements are then generated based on overlap between each group. Curated CREs, genes, and biosamples are then mapped to the data model to generate the final CREdb. (B) Consensus generation of elements. For each element type (promoter, enhancer, etc.), elements were clustered where they had at minimum 20% overlap and condensed into a consensus range for the element. In this example, five promoters have sufficient overlap to be considered a part of a single consensus element. This allows for more sensitive detection of the element when querying with data that might not match any one variant of the site identified in the source databases. (C) Data model of final CREdb resource. This conceptual model represents the entities and relationships of the final CREdb. Regulatory elements sit at the center of the model, with interactions between themselves (enhancer/promoter interactions). For each biosample, relationships are identified between genes, regulatory elements, and their respective activity in that sample

Data sources

Regulatory element data with corresponding biospecimen annotations were collected from the following databases for inclusion in CREdb: ENCODE SCREEN [5], ENdb [6], RefSeq [11], SilencerDB [12], Silencer-Candidates [13], ReMap2020 [14], GeneHancer [15], FANTOM5 [16], EpiMap [17], Ensembl Regulatory Build [18] and EnhancerAtlas [19]. Gene data were collected from the following databases for inclusion in CREdb: HGNC [20], Gencode [11], and RefSeq [7]. These databases were downloaded on 22Jul2021.

Data curation

Data from each database were filtered to only human regulatory elements and genes. Duplicate entries were removed, and available data were mapped to the CREdb database tables. Database-specific choices were made during the curation process to convert the available data into the universal format provided by the data model. For data sources where the elements were aligned to GRCh37, the elements were remapped to GRCh38 to harmonize all elements against the same reference genome. The data sources that required remapping were FANTOM5, EPIMAP, SilencerDB, EnhancerAtlas, SilencerCandidates, and ENdb. Specifics of these choices are specified in Supplemental Methods. A diagram of the curation and mapping process can be found in Fig. 4A.

To harmonize the metadata for biosamples across various data sources, raw values were classified into distinct categories, including “cell line,” “primary cell,” “in vitro differentiated cells,” “tissue” or “disease.” Original biological source values were then mapped to standard ontologies widely accepted by the biomedical community. For example, for “cell line” type, the raw values are mapped to Experimental Factor Ontology (EFO), Cellosaurus (CVCL), and Cell Line Ontology (CLO) terms. For “primary cells” and “in vitro differentiated cells”, most of the values are mapped to the Cell Ontology (CL) standard terms. For the “tissue” and “disease” types, most of the values are mapped to Uber-anatomy ontology (UBERON) and the Human Disease Ontology (DOID), respectively. To enrich the annotations for cell lines and primary cells, additional information about corresponding diseases and tissues were extracted as well.

Standard ontology alignment was generated using a custom Excel plugin based on SciGraph [21], which stores and manages ontologies. The plugin automatically searches for the closest term in a selected ontology and returns the preferred term (PT), Ontology, Ontology ID, and a matching score. Manual review by domain experts was then performed for all terms with a matching score below 0.80 (where 1 is the highest score for the exact match).

Consensus generation

The harmonized data were used to generate three sets of combined tables utilizing data from all available databases:

1.

A combined element table generated by concatenation and genomic re-sorting of all element tables, activity tables, and element-gene interaction Table .

2.

A consensus element table derived from the combined element table using the ‘bedmap’ tool from the ‘bedops’ tool suite [22] to perform clustering by a reciprocal overlap of 20% with all elements of the same type (enhancers, promotors, silencers, and called segments). Transcription factors were excluded as they are a product of ReMap’s clustering, and elements less than 10 bp or greater than 50,000 bp were excluded based on size. An overview of this methodology is presented in Fig. 4B.

3.

A master activity table was created by taking the consensus at the biosample level by merging within each consensus element the activity profiles of all biosample replicates, then aggregating these across all datasets to determine consensus activity. FANTOM5 and EnhancerAtlas were handled differently, with gene TPM (Transcripts Per Million) aggregated by taking the median across replicates; and the percentile expression calculated by subsequently rank-normalizing within sample. An element was considered active in the master activity table if any of the following were true:

a)

it had greater than or equal to two datasets where it was called active.

b)

FANTOM5 and EnhancerAtlas expression percentile both greater than zero.

c)

FANTOM5 or EnhancerAtlas expression percentile greater than 20%.

Activity by contact data

Candidate enhancers identified using the Activity by Contact (ABC) model by Nasser et al. [8] were included as a separate table by extracting each candidate and its relevant information from the provided bed files.

Enrichment of GWAS signals in tissue-specific CREs

GWAS lead variants were downloaded from the GWAS Catalog (https://www.ebi.ac.uk/gwas/api/search/downloads/alternative). A BED file was generated by expanding 100 bps both upstream and downstream of each lead variant. The overlap between GWAS signals BED file and tissue specific consensus regulatory elements were performed using bedtools (version 2.27.1) intersect command. Experimental Factor Ontology (EFO) mapping provided by the GWAS Catalog was used to annotate the trait/phenotype of each GWAS study. The enrichment analysis was carried out with hypergeometric testing with Python library Scipy (v1.7.3). Each unique variant-phenotype pair is counted only once. Phenotypes with fewer than ten hits in a tissue were excluded from subsequent analysis.

snATAC-seq analysis

The snATAC-seq dataset was downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6459701) in the form of fragment alignment genomic coordinates (GSM6459701_NOSN01_snATAC.fragments.tsv.gz) and peak by cell count matrix (GSM6459701_NOSN01_snATAC.filtered_peak_bc_matrix.h5). Signac R package (version 1.11.0) was used to load the downloaded dataset and perform downstream analysis. LSI dimensionality reduction was carried out by implementing term frequency-inverse document frequency (TF-IDF) transformation followed by singular value decomposition (SVD). The top 30 LSI projections (except the first one) were used to compute Shared Nearest Neighbors (SNNs), which were then used to cluster nuclei based on the SLM algorithm in the FindClusters function. Gene activity scores were computed for protein-coding genes by summing snATAC-seq reads mapped in the gene body and the promoter (5 kb upstream to TSS) by using GeneActivity function. Cell clusters were annotated by inspecting gene activity scores of known marker genes of neural cell types. The genomic coordinates of smooth muscle cell specific CREs were lifted over to hg19 using the liftOver command in the rtracklayer R package (version 1.60.1) to be compatible with the snATAC-seq fragment coordinates.

留言 (0)

沒有登入
gif