TFscope: systematic analysis of the sequence features involved in the binding preferences of transcription factors

TFscope overview

TFscope aims to identify the sequence features responsible for the binding differences observed between two ChIP-seq experiments. Typically, TFscope can be used to identify differences between two experiments targeting the same TF in different cell types or conditions, or two experiments targeting two paralogous TFs that share similar motifs. TFscope takes as input two sets of ChIP-seq peaks corresponding to the two ChIP-seq experiments and then runs the three steps illustrated in Fig. 1: sequence selection and alignment, feature extraction, and model learning. In the sequence alignment step, TFscope first selects the two sets of peaks to analyze. By default, the comparison focus on the peaks that are unique to the first vs. the second experiment (see the “Material and methods” section). Alternatively, users can choose to compare the peaks unique either to the first or the second experiment vs. the intersection set. A last possibility is to use a dedicated differential peak caller to identify differential peaks between the two experiments [16]. Then, TFscope identifies the most likely binding site using a strategy similar to Centrimo [8] and UniBind [21] and parses the sequence around the peak summit with the PWM associated with the target TF (if several versions of the motif are available or if the analysis involves two paralogous TFs with similar motifs, the most discriminative PWM is chosen to scan the two sets of sequences; see Material and Methods). The FIMO tool [23] is used for this analysis, and the position with the highest PWM score serves as an anchor point to extract the 1 Kb long sequence centered around this position. At the end of the alignment step, we get two classes of sequences centered on the most likely TFBSs of the input ChIP-seq peaks. Sequences with no occurrence of the motif around the peak summit are discarded. Alternatively, if the motif of the target TF is unknown, or if the user prefers to keep all sequences, the search of the most likely TFBS is skipped, and sequences remain centered on the peak summits.

Fig. 1figure 1

The TFscope approach. In the first step (sequence selection and alignment), the set of ChIP-seq peaks to be compared are selected. In a classical analysis, peaks in the intersection (i.e., associated with both ChIP-seq experiments) are removed, and the peaks in the two complementary sets are selected for comparison. Alternatively, users can choose to compare one of the complementary sets vs. the intersection set. Next, the most likely TFBSs of the selected peaks are identified and used to extract the 1 Kb sequences centered on these sites. If the motif of the target TF is unknown, this step can be skipped and sequences remain centered on the peak summit. All sequences are then used for the second step (feature extraction). Three dedicated modules extract three kinds of sequence features that can be useful for discriminating the two classes. The TFscope-DM module learns a new PWM that discriminates the sequences solely on the basis of the core motif (if the target motif is unknown this module is also skipped). The TFscope-NE module searches for specific nucleotidic environments (i.e., frequency of specific k-mers in specific regions) that are different in the two classes. The TFscope-CF module searches for binding sites of specific co-factors whose presence in specific regions differs between the two classes. All of these features (variables) are then gathered into a long table, and a logistic model (Expression (1)) is learned on the basis of these data (Feature selection and model learning step). A special penalty function (LASSO) is used during training, for selecting only the best variables in the model (in bold in the table)

TFscope then runs the three modules detailed below to extract three kinds of sequence features that are discriminative of the two sequence classes (feature-extraction step). The first module (TFscope-DM) learns a new PWM of the core motif. This PWM differs from the original PWM used to parse the sequence, as it focuses on potential core motif differences that may exist between the two sequence sets. This module returns a single variable \(\text (s)\), which is the score of the new PWM on each sequence s. Note that this module is omitted if the motif of the target TF is unknown. The second module (TFscope-NE) searches for pairs of (k-mer,region) for which the frequency of the k-mer in the defined region is different between the two sets of sequences. For example, the frequency of the 3-mer ACA in region \([-150:+500]\) (with 0 being the anchor point of the sequences) may be globally higher in sequences of the first class than in those of the second class. The idea is to capture nucleotidic environment differences that may exist between the two classes. We use for this module a slight modification of the DExTER method recently proposed to identify long regulatory elements [38]. This module returns a potentially large set of \(\text _i(s)\) variables that corresponds to the frequency of ith k-mer in the associated ith region for each sequence s. The third module (TFscope-CF) uses a library of PWMs (in the experiments below the JASPAR2020 library was used [20]) and searches for pairs of (PWM,region) for which the PWM score in the identified region is different between the two sets of sequences (see below). The idea is to identify all co-factors of the target TF whose binding sites differ between the two classes: either because these binding sites are in majority present in one class and not the other or because the locations of these binding sites differ between the two classes. This third module returns a set of variables \(\text _j(s)\) that corresponds to the score of the jth PWM in the identified jth region for each sequence s.

All variables are then integrated into a global model that aims to predict if a sequence belongs to the first or the second class (learning step). We used a logistic regression model:

$$\begin P(1|s) = S\left( a\cdot \text (s) + \sum \limits _i b_i\cdot \text _i(s) + \sum \limits _j c_j\cdot \text _j(s)\right) , \end$$

(1)

where P(1|s) is the probability that sequence s belongs to the first class, S is the sigmoid function, \(\text (s)\) is the score of the discriminative motif for sequence s, \(\text _i(s)\) is the value of the ith nucleotidic-environment variable for sequence s, \(\text _j(s)\) is the value of the jth co-factor variable for sequence s, and a, \(b_i\) and \(c_j\) are the regression coefficients which constitute the model parameters. Because the set of variables identified by the last two modules is usually large and the variables are often correlated, the model is trained with a LASSO penalty function [55] that selects the most relevant variables—i.e., many regression coefficients (a, \(b_i\) and \(c_j\)) are set at zero. We use for this the glmnet package in R. Note that the right amount of penalization is automatically deduced from the data by glmnet using an internal cross-validation procedure. Finally, once a model has been trained, its accuracy is evaluated by computing the area under the ROC (AUROC) on several hundred sequences. To avoid any bias, this is done on a set of sequences that have not been used in the previous steps (\(70\%\) sequences are used for training, and 30\(\%\) for the test).

TFscope-DM: identification of differences in the core motif

The first TFscope module learns a new discriminative PWM. Recall that at the end of the alignment step, the most likely binding site of each ChIP-seq peak has been identified with the JASPAR PWM associated with the TF, and all sequences are aligned on these sites. If several versions of the PWM are available, the most discriminative PWM is used (see Material and Methods). We then extract the K-length sub-sequence corresponding to the occurrence of the motif in each sequence (with K being the size of the PWM). The first module aims to learn a new PWM that could discriminate these two sets of K-length sequences. First, each sequence s is one-hot encoded in a \(K\times 4\) matrix s. Then a logistic model with \(K\times 4\) parameters is learned to discriminate the two sequence classes:

$$\begin P(1|s) = S\left( \sum \limits _^K \sum \limits _^4 a_ \cdot \textbf_\right) , \end$$

(2)

where P(1|s) is the probability that sub-sequence s belong to the first class, S is the sigmoid function, \(\textbf_\) is the entry of the one-hot matrix s indicating whether the kth nucleotide of sequence s corresponds to the jth nucleotide of \(\\) or not, and \(a_\) is the regression coefficients of the model.

Once this model has been learned, it can be used to predict if a sequence belongs to the first or second class. The sigmoid function being monotonically increasing, this can be done easily by computing the linear function inside the parenthesis of Expression (2) and using the result as a score reflecting the likelihood of class 1. Interestingly, this score function has exactly the same form as that used to compute a score with a PWM. Consequently, the logistic model of Expression (2) is strictly speaking a regular PWM with parameters \(a_\). The interest of learning a PWM in this way is twofold. First, we take advantage of all the algorithmic and theory developed for logistic regression. Most notably, as the likelihood function of a logistic model is convex, we are guaranteed that the learned model is optimal, which means that the inferred discriminative PWM is the best PWM for our problem. This is a major difference from approaches previously proposed to learn a discriminative PWM, such as DAMO [46] or STREME [6]. The reason for this is that these approaches do not exactly address the same problem as ours: they do not search for a PWM that discriminates two sets of sequences that are perfectly aligned and of the same length as the PWM. Instead, they take as input two sets of sequences that are usually much longer than the PWM, and their goal is to identify a motif whose presence can be used to discriminate the two sets, i.e., a problem that is known to be NP-hard [37]. Hence, these approaches rely on heuristics and do not warrant returning the best PWM for our problem (see the “Discussion” section for more details on these differences). The second advantage of learning a PWM via a logistic regression approach is that a LASSO penalty may be included in the optimization procedure in order to obtain a model with fewer variables [55] (see the “Material and methods” section). In practice, this means that many \(a_\) parameters are set at zero and hence that the resulting PWM is simpler and easier to interpret.

Note that, like DAMO [46], the PWMs output by our method are not obtained from position probability matrices (PPMs), i.e., the probabilistic models that are often associated with PWMs. This avoids the constraints attached to PPMs (see section Discussion and the work of Ruan and Stormo [45] for further details), but this also impedes representing PWMs with the classical logo graphics based on information theory [47]. Instead, our PWMs are represented by “mirror-logos” such as that in Fig. 2B (middle). These logos provide the sign of the parameters, which allows us to easily distinguish nucleotides that are more present in sequences of one or the other class.

TFscope-NE: identification of differences in nucleotidic environment

The second TFscope module extracts features related to the nucleotidic environment around the core binding motif. More precisely, this module constructs variables defined by a pair (kmer,region) such that the frequency of the identified k-mer in the identified region is, on average, different in the two classes. We used for this a slight modification of the DExTER method initially proposed to identify pairs of (kmer,region) whose values are correlated with an expression signal. The DExTER optimization function was modified to return variables correlated with classes rather than with expression signals (see the “Material and methods” section). The TFscope-NE module explores short k-mers up to length 4. To prevent this module from capturing information related to the core-motif, this motif is masked before running the TFscope-NE analysis.

TFscope-CF: identification of differences in co-factor combinations

The third TFscope module extracts features related to co-factors. This module constructs variables defined by a pair (PWM,region) such that the score of the PWM in the identified region is, on average, different between the two classes. For example, we may observe that sequences of the first class often have a potential binding site for a specific TF in region [-250,0] upstream of the binding site of the target TF, while the sequences of the second class do not have these potential binding sites. Hence, the goal of this module is to identify, for each PWM of the library, a specific region of sequences in which the scores of this PWM are higher in one class than in the other one.

Sequences are first segmented in bins of the same size. We used 13 bins in the following experiments. The number of bins impacts the precision of the approach but also the computing time for the analysis. For each PWM, TFscope scans all sequences with FIMO [23], and the best score achieved on each bin of each sequence is stored. Then, TFscope searches the region of consecutive bins for which the PWM gets the most different scores depending on the class of the sequences. A lattice structure is used for this exploration (see Fig. 1 and details in the ”Material and methods” section). For each PWM of the library, TFscope-CF selects the region that shows the greatest differences and returns a variable corresponding to this PWM and region. As for TFscope-NE, the core-motif is masked before running the analysis.

Analysis of the cellular specificities of 272 ChIP-seq pairs

We first sought to apply TFscope to identify binding site differences of TFs in different cell types using a selection of 272 pairs of ChIP-seq experiments downloaded from the GTRD database [31]. Data were filtered using the UniBind p-value score [21] to minimize the effects linked to technical issues or indirect binding. In UniBind, the authors studied the distance between the ChIP-seq peaks and the position of the most likely binding site (inferred with the PFM associated with the studied TF). They showed that this binding site is sometimes far from the ChIP-seq peak and that the peak could be a false positive. A dedicated method named ChIP-eat determines genomic boundaries inside which the binding sites are likely true positives and provides a p-value measuring peak enrichment in these boundaries. We used this p-value to remove ChIP-seq experiments that could be affected by technical issues and indirect binding. Moreover, for this analysis, we only selected pairs of experiments that showed strong binding site differences according to the Jaccard’s distance (see the “Material and methods” section). The 272 pairs were chosen to provide a wide view of the ChIP-seq data in GTRD, i.e., pairs that were too close to another already selected pair were discarded (see the pair selection procedure in the “Material and methods” section). These 272 pairs involve a total of 86 different TFs and 168 cell types (see Additional file 1: Fig. S1A for statistics about the number of pairs targeting each TF). The distribution of Jaccard indexes of all pairs of ChIP-seq with Unibind p-value \(<10^\) is shown in Additional file 1: Fig. S1B. For comparison, the figure also reports the Jaccard index of ChIP-seq pairs targeting the same TF in the same cell type, but originating from different studies. As we can see, ChIP-seq pairs from different cell types are usually much more different than in experiments with the same cell type. Hence, in these experiments, many of the differences observed between cell types are likely due to the cellular specificity rather than the technical artifacts. Moreover, a low Jaccard index also means that the number of common peaks is small in proportion, hence comparing unique peaks makes sense for these analyses. The Jaccard index distribution of the 272 selected ChIP-seq pairs is shown in Additional file 1: Fig. S1C. Among the peaks of these experiments, a very low number (\(<2\%\)) do not have the target motif (Additional file 1: Fig. S1D) and were discarded.

TFscope learns both discriminative and informative core motifs

We first assessed the TFscope ability to identify core motif differences in the ChIP-seq experiment pairs. In this analysis, we only used the score function of the learned PWM (Expression (2)) to discriminate the two cell types. For comparison, we also used the score of the original PWM on this problem. Accuracies were measured by AUROC on an independent set of sequences (see Fig. 2A). If several versions of the original PWM were available, we used the version that provides the best AUROC. As we can see, the new PWM outperforms the original PWM most of the time. Moreover, we can also observe that for some of the 272 experiment pairs, the core motif is sufficient to differentiate the two cell types with high accuracy (AUROC \(>75\%\) for 49 experiments). As already discussed, the discriminative PWM differs from the original PWM as it specifically models the differences while removing features common to the two classes. The “mirror-logo” representation summarizes these differences and shows which features are associated with which cell type. To help interpret this logo, TFscope also outputs the PPMs built from the sequences associated with the two cell types. We used for this the K-length sub-sequences that were used by TFscope-DM to learn the discriminative PWM and independently estimated two PPMs from these two sequence sets simply by counting the frequency of each nucleotide at each position. For example, the first three logos on Fig. 2B show the mirror logo of the discriminative PWM learned by TFscope for discriminating CEBPA binding sites between the SKH1 and U937 cell types, in between the two PPMs associated with these cell types. Note here that the canonical CEBPA motif is more often associated with ChIP-seq peaks collected in U937 than in SKH1, although both cell types show very similar motifs. The mirror logo indicates for example that the T nucleotides at positions 3 and 4 are more often missing in the SKH1 sequences than in the U937 sequences.

Fig. 2figure 2

TFscope learns discriminative and informative motifs. A AUROCs achieved by the TFscope PWMs vs. original PWMs on the 272 experiments. B The first three logos represent the PWM learned by TFscope for discriminating CEBPA binding between U937 and SKH1 cell types in between the two PPMs estimated on these two cell types. The bottom logo represents the discriminative PWM learned by DAMO on the same training set. C AUROCs achieved by the TFscope PWMs vs. DAMO PWMs on the 272 experiments. D Gini score of the PWMs learned by TFscope and DAMO. The higher the Gini score, the simpler the model

As an alternative approach, we compared the accuracy of the discriminative PWM to that of the PPMs built directly from the sequences associated with the two cell types. For this, the two PPMs were transformed in PWMs and combined in a logistic model trained to determine if a sequence belongs to the first or the second cell types on the basis of a linear combination of the two PWMs (see the “Material and methods” section). As shown in Additional file 1: Fig. S2A, the PWM learned by TFscope outperforms this two-PWMs approach most of the time and especially at AUC\(>70\%\), illustrating that discriminative PWMs capture sequence features that are not captured when estimating PWMs independently on each cell type. We also used the same approach using PWMs learned with Homer [24] instead of the PPMs independently estimated on the two cell types, and observed that the TFscope PWM also outperforms the two-Homer-PWM approach on these experiments (Additional file 1: Fig. S2B and the “Material and methods” section for details).

We next sought to compare these results to those obtained with another method designed to optimize PWMs using a discriminative approach. We used the DAMO approach for this comparison, as it is one of the rare methods that do not rely on PPM to learn a PWM. Recall that DAMO, like other classical approaches to learn PWMs, was not designed to exactly address the same problem as ours. Indeed, DAMO usually takes as input sequences that are not aligned and that are much longer than the target PWM. Nevertheless, it could also be used on our simpler problem. However, as illustrated in Fig. 2C, it does not achieve the same accuracy as TFscope on this problem, which was somewhat expected as the logistic classifier used by TFscope theoretically returns the most discriminative PWM.

Another striking fact that emerged when we compared the discriminative motifs learned by DAMO to those of TFscope was that the DAMO motifs appear much more complex, with many positions without clear preferences (bottom logo on Fig. 2B). On the contrary, thanks to the LASSO penalty used for learning, the TFscope motif is easier to interpret, with many positions set at zero. This aspect was assessed systematically on the 272 experiments using a score function based on the Gini coefficient for measuring the motif simplicity (see the “Material and methods” section). As illustrated in Fig. 2D, TFscope motifs have higher Gini coefficient, and are thus simpler and easier to interpret than their DAMO counterpart.

Finally, we observed that increasing the size of the PWM on both sides slightly improves the AUROC of the model, especially until 4 nucleotides (see Additional file 1: Fig. S2C). After 4 nucleotides, further increasing the size of the flanking regions still slightly increases the AUROC, but the gain is likely due to PWM capturing part of the large nucleotidic environment, something specifically captured by the TFscope-NE module. Hence, to facilitate the model interpretation, in the following we only used the discriminative PWM with 4 flanking nucleotides on both sides (denoted as DM+8).

TFscope provides meaningful information about co-factors

We next sought to investigate the information gained by the position of the binding sites of potential co-factors for cell type prediction. For this, we used a simplification of the model of Expression (1) which only uses the core motif and the co-factor variables for the prediction—i.e., the \(\text _i\) variables capturing the nucleotidic environment were removed from the model. The accuracy of this model was compared to that of a similar model that also uses the score of potential co-factors, but without integrating the position information. This model, which strongly resembles the TFcoop approach we previously proposed [56], simply uses the best score achieved by the different PWMs in the whole sequence. Hence, the predictive variables of this model are the best scores achieved at any position in the sequence, while in Expression (1) TFscope uses the best score achieved in a specific region identified as the most informative for each co-factor. While the two models have exactly the same number of parameters (i.e., the number of PWMs in the PWM library), the TFscope variables greatly increase the accuracy of the approach (Fig. 3A), illustrating that the position of co-factors relative to the considered TFBS also carry important information. Note that, as we will see hereafter, TFscope provides a graphical representation of all identified co-factors, and position information can be easily retrieved.

Finally, we attempted to assess the relevance of the co-factors identified by TFscope. We thus selected all experiments comparing cell-line pairs involving HepG2, K562, MCF7, GM12878, MCF10A, or IMR90. These cell lines were chosen because they were among the most represented cell lines in the 272 experiments. This involves a total of 19 experiments. RNA-seq data measuring gene expression in the same cell lines were downloaded from ENCODE. Next, for each of the 19 experiments, we extracted the 15 most important variables selected by the model (see the “Material and methods” section) and identified the co-factors present among these 15 variables (this involves a total of 203 co-factors for the 19 experiments). For each co-factor, we then compared its gene expression level (RPKM) in the cell line where it had been identified as associated with the target TF (denoted as cell type \(+\)), and in the other cell line (cell type −). As we can see in Fig. 3B, the gene expression level of the co-factor is usually higher in cell type \(+\) than in cell type − (149 vs. 45, binomial test p-value \(3.e^\)). Similarly, for 79 co-factors, the gene expression level is null in cell type − (RPKM\(<1\)) and non-null in cell type \(+\) (RPKM\(>1\)), while the opposite is true for only 4 co-factors (binomial test p-value \(< 2.e^\)). Hence, very often, the differential presence of binding motifs between the two cell types is corroborated by the difference of expression of the identified co-factors.

Fig. 3figure 3

TFscope provides meaningful information about co-factors. A Position of co-factors helps for predicting the cell-specificity. This plot reports AUROCs achieved by TFscope models using two different scores for co-factors. On the x-axis, the score of a co-factor on a sequence is the best achieved at any position of the sequence, i.e., the co-factor position is not considered in the score computation. On the y-axis, the score of each co-factor is only computed on a specific region identified by the TFscope-CF module to be the most informative for this co-factor. B Gene expression level (RPKM) of co-factors identified by TFscope. Cell type \(+\) (resp. −) is the cell type where sequences have the highest (resp. lowest) scores for the PWM associated with the co-factor

TFscope assesses the relative importance of each sequence feature

We next ran TFscope with the full model of Expression (1) on the 272 pairs of experiments and compared its accuracy (AUROC) to that of different alternatives: the original PWM only, the discriminative PWM only, and three incomplete TFscope models that only use two of the three kinds of genomic information. These incomplete models were obtained by taking the full TFscope model trained with all variables and by setting at zero either the \(\text \) variable (model TFscope w/o core motif information) or the \(\text _i\) variables (TFscope w/o nucleotidic environment information) or the \(\text _j\) variables (TFscope w/o co-factor information). Figure 4A reports the accuracy achieved by all these models. As we can see, the full TFscope model successfully integrates the three kinds of genomic information and outperforms the alternative models. Note also that the accuracy is often good, with a median AUROC above \(80\%\). Moreover, there is a strong link between the accuracy of the approach and the Jaccard distance between the ChIP-seq peaks in the two cell types (Pearson r = 0.51; see Fig. 4B), i.e., experiments with a low proportion of ChIP-seq peaks shared by the two cell types often have good accuracy (remember that these peaks are removed before the analyses). In other words, when the two ChIP-seq experiments are really different, TFscope accurately predicts these differences. For the sake of generality, we also ran TFscope using HOCOMOCO PPMs instead of JASPAR PPMs on a random selection of 10 experiments and observed very similar accuracy (Additional file 1: Fig. S3.A).

Fig. 4figure 4

Accuracy achieved by TFscope for discriminating binding sites of different cell types. A Distribution of AUROCs achieved by TFscope and several alternative models for discriminating binding sites of one TF in two different cell types. B Link between TFscope accuracy and the similarity of ChIP-seq peaks in the two cell types. ChIP-seq experiments that have a high proportion of peaks in common have a low Jaccard distance (Jaccard distance = 1 - Jaccard index). C Distribution of TFscope models according to the most discriminative features: mainly co-factors, discriminative motif \(+\) co-factors, or nucleotidic environment \(+\) co-factors

Fig. 5figure 5

Core motif, nucleotidic environment and co-factors together determine cell specificity. A–C PWM logos, radar plot, and location of the most important variables in the JUND comparison between liver and lung carcinoma. D–F Discriminative PWM, radar plot, and location of the most important variables in the CTCF comparison between B lymphocyte and rhabdomyosarcoma. In the PWM logos (A and D) the discriminative PWM is in between the two PPM logos estimated on the two cell types. Radar plots (B and E) summarize the AUROC achieved by TFscope and several alternative models. Location plots (C and F) provide the identity and location of the most important variables (black: DM; green: co-factors; brown: nucleotidic environment). The numbers on the right indicate the variable ranks, from the most important (rank 1) to the least important. The color of segments indicates the cell type associated with each variable

These good performances legitimate the use of TFscope to investigate the relative importance of each kind of genomic information in the different comparisons. For this, in addition to the logo of the discriminative PWM, TFscope outputs a radar plot that summarizes the accuracy of the different models and alternatives and a location plot that summarizes the position of the most important variables of the model (see the “Material and methods” section). For example, Fig. 5B reports the radar plot obtained when analyzing the binding differences of TF JUND between liver and lung carcinoma. For this experiment, the core motif is clearly the most discriminant information (Fig. 5A), since removing this information lead to the largest drop in AUROC. Besides, peaks detected in lung harbor additional AP-1 motifs around the core motif (Fig. 5C). JunD belongs to the AP-1 family of dimeric TFs, which associate members of the Jun (c-Jun, JunB and JunD) and Fos (c-Fos, FosB, Fra-1/Fosl1 and Fra-2/Fosl2) families. The canonical view is that Jun family members can homodimerize, while Fos family members must, at physiological concentrations, heterodimerize with one of the Jun proteins to bind DNA. The c-Fos proteins homodimerize only when overexpressed [53]. Importantly, Fos:Jun heterodimers have a stronger affinity for DNA than the Jun:Jun homodimers [9]. According to various expression data listed in the EBI Expression Atlas (https://www.ebi.ac.uk/gxa/home), Fos TFs are less expressed in liver than in lung. Thus, the JunD binding preferences observed in liver vs. lung might merely be explained by the expression of Fos TFs: because the probability of forming Fos:Jun heterodimers is greater in lung than in liver, JunD will bind DNA with a higher affinity in lung than in liver. For comparison, the discriminative motif and the radar plot of the CTCF experiment between CD20 and RH4 are shown in Fig. 5D–E. Here, the most discriminative information seems to be the nucleotidic environment. The location plot provides additional information (Fig. 5F). We can see that CD20 favors A/T rich environment in the vicinity of the binding motif (\(\sim +/-100\) bp around the motif), and C/G nucleotides in the larger surrounding region (\(+/-500\) bp). On the contrary, RH4 prefers a nucleotide environment rich in TG and CA dinucleotides. All results obtained on the 272 experiments are available at https://gite.lirmm.fr/rromero/tfscope/-/tree/main/results.

Finally, in an attempt to provide a broad picture of the genomic strategies involved in the control of binding differences between cell types, we ran a K-means clustering on the importance profiles inferred by TFscope. More precisely, all 272 experiments were described by a vector of length 3 obtained by subtracting the AUROC of TFscope w/o DM, TFscope w/o NE, and TFscope w/o CF from that of the full TFscope model. Each experiment was then represented by three values representing the three AUROC losses associated with the three kinds of information. We then ran several K-means clustering with different numbers of classes to study the distribution of the experiments. A plot of cluster variance according to the number of classes k suggests that the most common control-strategies can be broadly classified into 3 or 4 classes (Additional file 1: Fig. S3.B). We thus set \(k=3\) for simplicity, and got the following 3 classes: (1) one class where the co-factors are clearly the most important feature, (2) one class where DM (major) along with the co-factors (minor) are the most important features, and (3) one class where the nucleotidic environment (major) along with the co-factors (minor) are the most important features. Figure 4C reports the distribution of the 272 models in these 3 classes, thus highlighting the fact that the co-factors are by far the most common mechanism involved in the binding differences between cell types. Interestingly, for 32 out of 248 experiments with an AUROC \(>70\%\) (13\(\%\)), the PWM of the target TF is also present among the co-factors, which means that the TF often has additional binding motifs around the core binding site identified as the most likely TFBS of a sequence and that these additional binding motifs are more present in one of the two cell types.

TFscope correctly handles repeat elements

Repeat elements, and notably short tandem repeats, may play an important role in TF binding [1, 12, 25, 33]. These sequences are handled in TFscope via the TFscope-EN module. According to the above experiments, in a majority of cases (\(58\%\)), co-factors are the most influential feature, while the nucleotide environment is the primary predictor in \(26\%\) of experiments. We thus sought to study how TFscope compares to a model that would predict cell sp

留言 (0)

沒有登入
gif