Improving reusability along the data life cycle: a regulatory circuits case study

All the results presented in the paper were obtained by relying on Semantic Web technologies. Our strategy was to create an RDF graph for each different dataset handled in the regulatory circuits project (biological dataset, experimental context dataset, 394 tissue-specific TF-gene datasets, 808 sample-specific TF-gene datasets). Then, the capability of the RDF language to identify groups of related triples as named graphs was used to link all the RDF datasets together. This modular design allows (1) to assign metadata describing each group, (2) to improve SPARQL queries performances by only considering some relevant portions of a dataset, (3) to extend the dataset by adding new groups such as new samples data and (4) to reuse some portions of the dataset in other studies. Note that this addresses the limits to data reusability identified at the beginning of the Background section.

Biological data from Regulatory circuits

The Regulatory Circuits website and supplementary data give access to unstructured, disconnected and diversely formatted tabulated files related either to input biological data (FANTOM5 data, genes and regions genomic coordinates, TFs binding sites occurrences...) or computation intermediate results (59 files). The main output of the in silico analyses resulting from the Regulatory Circuits project consists in maps (called networks) describing interactions between TFs and their target genes in each of the 394 studied tissues.

Each network is described by an oriented graph in which TFs are connected to genes. The nodes are annotated with biological information (gene IDs for both TFs and target genes). The edges are annotated with a unique score aggregating two different weights representing the respective contributions of the enhancer and promoter regions to the predicted strength of the TF-gene regulation. These respective contributions as well as the formula used to compute the final score are neither described nor available. These 394 tissue-specific TF-gene interaction networks are provided as tabulated files, and the pipeline to produce them is neither usable nor reproducible.

Biological data RDF graph

The biological data graph contains the minimal set of biological entities required to build the Regulatory Circuits networks together with their attributes (values), and describes the relations between these entities. As detailed in [14] and depicted in Fig. 1, it is based on five main types of biological entities: three related to genes or proteins (gene, transcript, TF) and two related to chromosomal regulation regions (promoter, enhancer), connected by five reified relations (see below).

Fig. 1figure 1

a Graphical representation of the structure of the biological data graph from the Regulatory Circuits project. Boxes represent classes of entities. The grey boxes represent mappings to external resources. b Data integrated into the biological data graph before running the injection queries. Biological data RDF graph structure of the RDF graph and its population

The identifiers of genes (19,125 instances of the class Gene), transcripts (53,549 instances of the class Transcript) and transcription factors (691 instances of the class TF) are constructed based on the names provided by the Regulatory Circuits datasets (HGNC reference identifiers for TF and Gene, Ensembl transcript names for Transcript). These identifiers are linked to identifiers from external databases such as UniProtKB [16] (release 2021_04) and Ensembl [17] (release 104) as follows. Genes are associated to the UniProt identifier of their reviewed proteins; in case of several proteins being reviewed for a gene, the longest one is selected. Both genes and transcripts are associated to their Ensembl identifiers as already available in Regulatory Circuits datasets.

There are two classes of regulatory regions: Promoter (184,828 entities) and Enhancer (43,011 entities).

The dataset comprises five types of reified relations: two between TFs and regulatory regions weighed by the confidence of transcription factor binding site in the region (1,169,797 entities for TF_promoter and 524,816 for TF_enhancer), two between regulatory regions and transcripts weighed by the distance and the Weight_Distance between those entities (123,441 entities for promoter_transcipt and 950,514 for enhancer_transcipt), and a last one between transcripts and genes (53,449 entities). Each instance of classes Promoter or Enhancer is associated with two sets of 808 float values, one corresponding to its expression value in each sample, and the other corresponding to its normalized relative rank in each sample compared to the 807 others. Similarly, each instance of the Transcript class is associated with 808 float values, describing its normalized relative rank in each sample compared to the others. This rank information is directly provided by Regulatory Circuits. Contrary to the promoters and enhancers, no measured expression value is provided for transcripts. For the LERC resource, the ranks were computed according to the methodology described in [6], using the max of the transcript promoters’ rank. Each rank identifier is built by using the sample’s identifier (libId).

Figure 1 compiles the total number of triples and entities in the biological data graph.

Sample-specific weights of the TF-gene regulation networks

Each TF-gene interaction is characterized by a promoter weight and by an enhancer weight. As shown in Fig. 1, the relation between a TF and a regulatory region is described by a confidence value, and the rank of the regulatory region is described by a value associated with the sample. The promoter weight is defined by \(weightP=\max ((confidence \times \sqrt (Rank\_promoter\_sample*Rank\_transcript\_sample))^) \), where the maximum is computed for all the possible promoters mediating the interaction. The enhancer weight is defined by \(weightE=\max ((confidence \times Weight\_Distance \times \sqrt (Rank\_transcript\_sample \times Rank\_enhancer\_sample))^)\). These formulas were generated according to the method section of [6].

The SPARQL query for computing weightP is given in Fig. 2, where SAMPLE must be replaced by the identifier of an actual sample. A similar query for computing weightE is available on the GitHub repository of the project (cf. Availability section). The relations with a null weight are excluded to avoid overloading the graph.

Fig. 2figure 2

SPARQL query for computing the sample-specific value of the weight associated to promoters for the TF-gene regulation relations (in the WHERE clause) and inserting it in the corresponding sample graph (in the INSERT clause)

Each sample-specific network contains some values such as ranks that depend on values from the other networks, so that the 808 sample-specific networks have to be computed simultaneously. In order to save time and CPU usage, we executed these queries once (11.2 days CPU times) and integrated the final 808 sample-specific networks in our resource triplestore, by using an INSERT operation as shown in Fig. 2.

Tissue-specific weights of the TF-gene regulation networks

At the tissue level, each TF-gene interaction is characterized by (i) a promoter weight (max of the promoter weights among the samples composing the tissue), (ii) an enhancer weight (max of the enhancer weights among the samples composing the tissue), (iii) a Max score combining the two previous one, and (iv) a RC score extracted from the Regulatory circuits output data files. We designed a SPARQL query to compute tissue-specific promotor/enhancer weights and MAX scores and re-inject them into the tissue-specific RDF graphs (Fig. 3). It computes the weights of TF-Gene relations in a tissue-specific network formed by two separate samples. The queries for tissue-specific network with more samples (up to 33) or a single sample are available in the GitHub repository of the project.

Fig. 3figure 3

SPARQL query for computing tissue-specific values of the weights associated to promoters and enhancers and the global score for the TF-gene regulation relations from the values of the samples SAMPLE1 and SAMPLE2 associated to the tissue TISSUEx, and inserting them in the graph describing the corresponding tissue

Experimental context graph

The experimental context graph describes the experimental information about the 808 samples (cell types, organs, patient clinical data [age, gender..], diseases...), about the 394 tissues (linked to the samples they are composed of) as well as the mappings to reference databases. Note that the experimental data (expressions and ranks) belong to the biological data graph. All the information contained in the experimental context graph are extracted from the nmeth_.3799-S2.xlsx file present in the Regulatory Circuits supplementary data, and formatted to respect the identifiers of the generated samples-specific or tissue-specific RDF graphs.

When applicable, we also include links to other knowledge bases from the Linked Open Data such as gene identifiers from Ensembl, protein identifiers from UniProtKB, cell types and anatomical structures from the Uberon and the Foundational Model of Anatomy ontologies.

Metadata graph

The metadata graph contains all the information about the other graphs including their VoID descriptions, as well as the associations of the samples and tissues from the experimental context graph with their respective graph containing their specific regulatory network.

This explicit representation of the metadata about the samples and the tissues can be queried by the users for identifying the subset of samples or tissues they are interested in. The modular approach described next allows the user to retrieve the corresponding portions of the dataset.

Structuration and computation of the modular graphs

We took advantage of the notion of named graph in the RDF model to design a modular structure for Regulatory Circuits that makes it possible to identify the subset of the samples and tissues that meets the user’s requirements, to retrieve the corresponding networks and to combine them with additional data. To do so, we first created an RDF named graph for general biological data such as the binding and neighborhood relations between TFs, regulatory regions and genes. Second, we created a distinct RDF named graph for each sample- and tissue-specific network (see the INSERT clauses of the queries in Figs. 2 and 3 that generate the weights and scores of regulation relations in specific graphs based on information from the biological data graph). Third, an additional metadata graph associates each of these named graph with the corresponding sample or tissue. Fourth, the samples and tissues’ descriptions (i.e. the organs, cell types as well as patient’s characteristics) as well as the composition relations of tissues into samples are represented in the experimental context graph. Thus, a user can query the experimental context graph to identify the samples and tissues that meet some constraints, and retrieve the associated networks. Likewise, new samples or tissues can be combined with Regulatory Circuits by adding the corresponding graphs and generating the associated metadata and experimental context graphs. In both cases, the weights and scores for the regulation relations of new dataset can be re-computed with the queries from Figs. 2 and 3, addressing the reusability and reproducibility requirements.

Availability

The original datasets of the Regulatory Circuits project were downloaded as tabulated files from the website of the original project [5].

All data related to Linked Extended Regulatory Circuits (LERC) resource are available on the website of the project: https://regulatorycircuits-lod.genouest.org. The RDF version of the dataset is under the Attribution 4.0 International (CC BY 4.0) license. The SPARQL queries used to generate the sample and tissue-specific TF-gene graphs are available on GitHub https://github.com/mlouarn/RCsparql/. The generated turtle files are available at https://zenodo.org/record/4889146.

留言 (0)

沒有登入
gif