Biomolecules, Vol. 12, Pages 1771: Sequence-Based Prediction of Protein Phase Separation: The Role of Beta-Pairing Propensity

1. IntroductionLiquid-liquid phase separation (LLPS), or demixing, with the coexistence of a diluted state and a dense condensed one, is a classic subject in polymer physics [1]. The essential physics is aptly captured within the Flory-Huggins (FH) approach, a simple lattice theory, where the free energy of mixing per lattice site can be derived using a mean-field assumption [2,3]. The driving force underlying LLPS is the exchange of chain/solvent interactions for chain/chain and solvent/solvent interactions under conditions for which this process is energetically favorable, a balance quantified by the Flory parameter.Evidence has been mounting in the last years that protein LLPS underlies the formation of membrane-less organelles (MLOs) in living cells [4]. In fact, eukaryotic cells are composed of numerous compartments or organelles that carry out specific functions and provide spatio-temporal control over cellular materials, metabolic processes, and signaling pathways. However, cells also harbor several MLOs that lack a delimiting membrane. These are supra-molecular assemblies composed of proteins, nucleic acids, and other molecular components, that are present in the nucleus as well as in the cytoplasm. On the other hand, bacterial biomolecules were also shown to undergo LLPS [5], as well as viral ones [6].One feature that has attracted considerable attention is the presence of intrinsically disordered regions (IDRs) in proteins that have the ability of driving LLPS [7]. These regions display a sequence-intrinsic preference for conformational heterogeneity or disorder under native conditions [8]. The detailed understanding of the biological function of disordered bio-molecular condensates, whose formation is driven by LLPS, is currently the focus of a major effort undertaken by a large community in cell biology [9].In particular, several key proteins in neuro-degenerative disorders are components of MLOs [10]. The observed conversion of dynamic protein droplets to solid aggregates [11] shows them to be meta-stable or inherently unstable, and shows that specific cellular processes keep them from solidifying. These liquid-to-solid transitions are accelerated by disease mutations [12] that seem to target β-zippers in IDRs [13], which makes them more prone to fold into stable amyloid structures [14].In fact, it was recently suggested that in the cell crowded environment the liquid condensed state should be considered as a fundamental state of proteins along with the structured native state and the solid-like amyloid state [15]. The signature of amyloid fibrils consists in pairs of closely mating β-sheets along the fibril axis; their presence is implicated in several degenerative pathologies triggered by aberrant protein mis-folding and subsequent aggregation [16].On the other hand, there is much debate on the type of interactions that underlie protein LLPS at a molecular level. In fact, a central issue in the field is the ability of predicting which proteins can undergo LLPS in physiological conditions in living cells, based on the knowledge of the amino-acid sequence alone, in particular for IDRs [9]. The understanding of the sequence determinants of phase separation in IDRs is still basic, but it is clear that different flavors of IDRs exist that determine the type of stimulus the IDR responds to [17], depending as well on fluctuations in the microenvironment and on the specific context [18]. The sequence also likely determines the emergent properties of its dense phase, i.e., dense-phase concentration [19], and material properties such as visco-elasticity [20].Multi-valent interactions of heterogeneous modular binding domains and their target motifs can drive LLPS of proteins with a well-defined native structure [21], yet the forces promoting LLPS of IDRs are less understood. A role has been suggested for several weak non-covalent interactions such as electrostatic, dipole-dipole, pi-pi stacking, cation-pi, hydrophobic and hydrogen bonding (namely β-zipper) interactions [4].In particular, the importance of charge patterns has been highlighted [22], whereas pi-pi stacking interactions were found to involve non-aromatic as well as aromatic groups in folded globular proteins, so that a phase separation predictive algorithm was built based on pi interaction frequency [23].The use of one or few of those features to predict protein LLPS is typical of the so-called first-generation phase separation predictors [24]. More recently, other approaches were proposed, in which either the multiple features associated with phase separation are comprehensively incorporated within a unique score [25] or LLPS prediction is based on the large conformational entropy associated with nonspecific side-chain interactions in the dense condensed state [26].In this work, we wish to test the possible role played by β-strand pairing as one of the many interaction modes driving protein LLPS. In fact, the formation of β-sheets with a high degree of structural order is fundamental in both the structured native state (intra-chain pairing) and the amyloid state (inter-chain pairing), so that a subtle balance between different factors is at play [27].

On the one hand, we may expect a reduced β-pairing propensity for protein sequences driving LLPS. Accordingly, at the intra-chain level a disordered conformational ensemble would be favored for IDRs in the diluted state, whereas at the inter-chain level the droplet condensed state would be promoted over the amyloid state. On the other hand, an increase in β-pairing propensity may help the demixing of IDRs sequences.

We evaluated β-pairing propensity by means of the well-established PASTA algorithm [28,29], which allows to predict the presence of amyloid interactions between co-aggregating proteins [30]. The PASTA energy function evaluates the stability of putative cross-pairings between different sequence stretches and is based on knowledge-based statistical potentials, estimated separately for parallel and anti-parallel interactions.We used different properties of the sequence stretches involved in the β-pairings with the best PASTA scores to build a scoring function which predicts the ability of a protein chain to drive LLPS. We also build a generalized score, by combining the PASTA properties with PScore, possibly the best performing first-generation predictor, based on pi-pi interactions [23]. Significant improvement over PScore in the performance of the generalized score would signal that β-pairing information may be crucial to better characterize LLPS behavior.The development of effective LLPS predictors is very much depending on the availability of reliable LLPS datasets, for both positive [31,32,33] and negative [26] sets. The latter may be built based on proteomes from different organisms. Interestingly, our results depend on the choice of the positive set and, to a lesser extent on the choice of the negative set. 3. ResultsIn this contribution we consider a number of possible scoring functions where we include different features from the output of the PASTA algorithm [30], originally introduced to predict the propensity to aggregate into amyloid structure, and combine them with PScore, a phase-separation predictor built on the frequency of pi-pi contacts [23]. We then estimate their abilities to classify a protein sequence according to its phase separating behaviour. We have compared our results with respect to the original Pscore as well. 3.1. Data Sets

We considered two possible choices for both the positive set (the sequences that are known to undergo phase separation) and the negative set (the sequences that do not undergo phase separation) used to train the scoring functions.

In all cases, the protein sequences were taken from already published data sets (see Section 2.1). 3.1.1. Positive SetsAs a first possibility for the positive set we use PP (https://phasepro.elte.hu, accessed on 30 June 2020) [33], which provides manually curated protein regions from a variety of organisms, whose association with liquid–liquid phase was experimentally validated in the literature, either “in vitro” or “in vivo”. PP was used as a positive set for the development and training of PScore [23].As a second possibility for the positive set we use LLPS [26], obtained by merging PP with two other data sets for liquid-liquid phase separation, REV and LPS-D. The REV data set is a subset of PhaSepDB (http://db.phasep.pro/, accessed on 30 December 2020); it includes proteins whose involvement in either “in vivo” or “in vitro” liquid–liquid phase separation can be found in the literature [32]. The LPS-D data set is a subset of LLPSDB (http://bio-comp.org.cn/llpsdb, accessed on 30 December 2020); it collects “droplet-driving” proteins, observed to undergo “in vitro” liquid–liquid phase separation spontaneously as one component, with well-defined experimental conditions and phase diagrams. LLPS thus includes PP as a proper subset. 3.1.2. Negative SetsBoth choices for the negative set were assembled in [26]. As a first possibility we use hsnLLPS, the Swiss-Prot human proteome from which all proteins that appear in any of the liquid–liquid phase separation data sets were removed. As a second possibility we use nsLLPS, a collection of proteins sampled from the proteomes of 9 different organisms in order to reproduce the frequencies with which sequences from different organisms appear in the LLPS dataset, after removal of all proteins that appear in any of the liquid–liquid phase separation data sets.For LLPS, hsnLLPS, and nsLLPS we had to consider only the entry sequences yielding a PScore, resulting in 442, 16,360, 3503, entries, respectively (see Section 2.1). 3.2. Scoring FunctionsIn this study we report the performances in predicting the phase separation behaviour of protein sequences for a total of six scoring functions, which are defined as follows:

s00=EN+β˜ln(S+1)+γ˜ln(lp),

(3)

s4=αEN+βln(S+1)+γln(lp)+P,

(7)

where E and P are the best PASTA score and the PScore, respectively; N is the number of amino acids in the protein sequence; S is the average register shift over the best five β-pairings and lp is the length of the best β-pairing. All possible pairings between two different, possibly overlapping, stretches from the same sequence are searched for (see Section 2.2 for details on the PASTA algorithm and its different outputs used here; see Section 2.3 for details on the PScore).The PASTA “energy density” E/N is used having in mind the mean field Flory-Huggins approach [7]. In the PASTA algorithm one specific β-pairing is assumed to form between two sequence stretches in two different chains while all other chain portions remain disordered; the contribution of this specific interaction to the phase separation of the full length chains needs then to be normalised by the chain length.

We use the logarithm of quantities that are in their essence numbers of consecutive residues along the chain, such as lp or S, because in this way they are most naturally connected to entropies. For example, the entropy loss associated to constraint such as loop closure or anchoring of one/both ends, imposed on otherwise conformationally heterogeneous segments of length m, scales like lnm.

In particular, the leading contribution to the entropy loss in going from the diluted to the dense phase due to β-pairing is proportional to lp. This is in principle already taken into account within the PASTA energy parametrization [28]. The subleading contribution would be proportional to lnlp, and can be interpreted as the entropy loss due to anchoring the two sequence portions through either one or both the two end pairs in the pairing, while leaving the remaining pairs still free. By decreasing lp, the entropy of the dense droplet phase is increased with respect to the dilute phase.

We observe that a similar argument can be used to estimate the entropy loss, due to the anchoring to the β-paired sequence stretch of the two sequence portions flanking it from either ends. By assuming lp≪N, with l the position along the sequence of the paired stretch, we estimate the entropy loss as lnl+lnN−l. This expression is maximum for l=N/2, so that the entropy of the dense droplet phase is higher when the β-paired sequence stretch is closer to one end of the chain than to its center.

Also, the entropy of the dense phase is increasing in the case of an off-register pairing because of two reasons, if we assume that only one β-pairing per chain is formed. First, two chains can pair two different sequence segments (say A and B) in two different ways (say A from chain 1 with B from chain 2 or B from chain 1 with A from chain 2), whereas for in-register pairing there is only one possibility, that is A from chain 1 with A from chain 2. In the case of a multi-chain condensate the combinatorics of the different possible arrangements may lead easily to a high entropy gain. Second, if we assume that in the amyloid phase both possible out-of-register pairings described above are formed, a constraint is then placed on the sequence segment between the two portions A and B, being S−lp+1 residues long, implying an entropy loss that scales as lnS+1−lp∼lnS+1, if lp≪S. The larger S, the higher the entropy difference favoring the condensed droplet state over the amyloid one.

3.3. The LLPS Positive SetWe optimised the parameters α, β and γ in the different scoring functions by maximising the AUC on the training set and then used the optimised parameters to find AUC and MCC on the test sets (see Section 2.4 for details on AUC and MCC). Training was performed using a k-fold cross validation with k=4 such that the full dataset (the union of the positive and the negative set) is split randomly in k equally sized subsets with one of them used in turn as the test set and the union of the other k−1 ones as the training set (see Section 2.5 for details on the cross validation training procedure).The PScore function was already trained previously [23], so we evaluated its performance on the test set by computing the corresponding AUC and MCC for each realisation of k-fold cross validation, whereas we did not evaluate its performance on the training set.

In this section we show the results we have obtained by using LLPS as the positive set.

3.3.1. Performance Evaluation: AUC on the Test SetWe begin by presenting in Figure 3 the results obtained with hsnLLPS as the negative set. In Figure 3a we show the normalised distributions of AUC on the training set. While s0 and s00 perform the worst, the performances of s2, s1, s3, s4 get increasingly better.In Figure 3b we show the normalised distributions of AUC on the test set using the parameters optimised on the training set. The performances of s0 and s00 are worse than that for PScore, whereas s2 performs similarly to PScore and s1, s3, s4 perform increasingly better than PScore. All the trends observed for AUC distributions on training and test sets with hsnLLPS as the negative set, remains qualitatively the same if nsLLPS is instead chosen as the negative set (see Figure S1).To check statistical significance we perform one-way ANOVAs on AUC with the scoring functions as the factors, using different combinations of them. The results of these ANOVAs are summarised in Table 1 for hsnLLPS as the negative set and in Table S1 for nsLLPS as the negative set.

In both cases, we found that the AUC values of the scoring function pairs P,s2, s1,s3, do not differ in a statistically significant way from each other, whereas the AUC values for other combinations of scores, such as P,s1,s2 or s1,s3,s4, are instead different in a statistically significant manner. Taken together, this shows that the terms E/N and lnlp can be fruitfully added to PScore, in order to improve the performance of the scoring function, whereas the addition of the term lnS+1 does not provide a statistically significant improvement.

A series of Kolmogorov-Smirnov tests on each of the AUC distributions on the test set for different scores (see File S1) found that the distributions are normal for all scores with both choices for the negative set (p>0.3) except for the score s0 with hsnLLPS as the negative set (p=0.034). 3.3.2. Performance Evaluation: MCC on the Test SetThe values of AUC and MCC on the test set evaluated using the parameters optimised on the training set indicate the performance of each of the scores to classify protein sequences according to their phase separating behaviour. In Table 2, we summarise the mean values of AUC and MCC on the test set for each of the scores, along with the optimised parameters as defined in Equations (2)–(6), with hsnLLPS as the negative set, whereas in Table S2 we show the same values obtained with nsLLPS as the negative set.

Although trends in the comparison between different scoring functions are similar for both negative sets, as we will discuss below, there is a clear quantitative difference between the two choices. Prediction against hsnLLPS is harder than against nsLLPS, resulting in lower AUC values.

For hsnLLPS as the negative set, s4 is identified as the best performing scores with AUC ≈0.79, with a statistically significant improvement over PScore (AUC ≈0.75). Performances of the scores s1 and s3 are similar to each other and better than PScore, as also pointed out by the one-way ANOVA. Similar trends are obtained with nsLLPS as the negative set, with AUC ≈0.84 for PScore, improved to AUC ≈0.88 with score s4.

Using hsnLLPS as the negative set, a weak improvement over the PScore performance in MCC brought about by the addition of the PASTA terms is apparent only for score s4, whereas scores s1 and s3 improve AUC but not MCC with respect to PScore.

This can be rationalized by looking at the ROC curves shown in Figure 4a for each of the considered scoring functions. ROC curves are derived on the entire data set (LLPS as the positive set and hsnLLPS as the negative set), using the optimised parameters, which are summarised in Table 2. Composite scores obtained by combining PScore with PASTA terms improve AUC in the high sensitivity, low specificity portion of the ROC curve, whereas MCC is typically related to its low sensitivity, high specificity portion.Similar trends are observed when nsLLPS is used as the negative set, with MCC =0.56 for score s4 improving with respect to MCC =0.51 for PScore (see Table S2 and Figure 4c). The values of MCC are much higher for nsLLPS because of its much smaller size with respect to hsnLLPS (see Section 2.4). 3.3.3. Optimised WeightsFinally, we have also studied the distribution of the weights used in the definition of the scoring functions. These parameters are optimised in the training set to produce the maximum AUC. Normalised distributions of the optimised parameters are shown in the first row of Figure S2 for hnsLLPS as the negative set. From Kolmogorov-Smirnov tests (see File S1) we found that the obtained data follow a normal distribution in most cases (p>0.1), with α for scores s1 and s3 providing borderline cases (p≈0.02), and the only clear exception of β˜ in score s0 with (p≈3·10−14).We note that the distributions of the α parameter for scores s1 and s3 are essentially the same (see Figure S2a) and that the support of the distributions for the β parameter contains or is close to the 0 value (Figure S2b). Both facts confirm that the contribution of the lnS+1 term is not significant.Similar trends are observed if nsLLPS is used as the negative set (see the third row of Figure S2), although in this case most distributions turn out to be not normal according to the Kolmogorov-Smirnov test (see File S1: the only parameters with a clearly normal distribution, p>0.05, are β in score s3 and α and γ in score s4). Despite this, the values and distributions of γ˜, α and γ obtained for the two considered negative sets are roughly consistent with each other, with the α parameter being scaled down by a factor of roughly 2 when considering nsLLPS in place of hsnLLPS as the negative set. 3.4. The PP Positive SetIn this section we show the results we have obtained by using PP as the positive set. The PScore function was already trained with PP as the positive set, and with a data set from the human proteome as the negative set [23], presumably similar to hsnLLPS. 3.4.1. Performance Evaluation: AUC on the Test SetWe begin by presenting in Figure S3 the results obtained with hsnLLPS as the negative set. In Figure S3a we show the normalised distributions of AUC on the training set. While s0 and s00 perform the worst, and in a similar way, between each other, the performances of s1, s2, s3, s4 get increasingly better, although s3 and s4 show a quite similar distribution. In Figure S3b we show the normalised distributions of AUC on the test set using the parameters optimised on the training set. The performances of s0 and s00 are similar to each other and worse than that for PScore, whereas s1 appears to perform slightly better than PScore and s2, s3, s4 perform similarly to each other and clearly better than PScore.All the trends observed for AUC distributions on training and test sets with hsnLLPS as the negative set, remains qualitatively the same if nsLLPS is instead chosen as the negative set (see Figure S4).To check statistical significance we perform one-way ANOVAs on AUC with the scoring functions as the factors, using different combinations of them. The results of these ANOVAs are summarised in Table 3 and Table S3 for both possible choices of the negative set.

In both cases, we found that the AUC values of the scoring function pair s0,s00 and of the ones within the set s2,s3,s4 do not differ in a statistically significant way from each other. On the other hand, the AUC values for other combinations of scores, such as P,s1 or s1,s3,s4, are instead different in a statistically significant manner. Taken together, this shows that the terms E/N and lnS+1 can, each in its turn, be fruitfully added to PScore, in order to improve the performance of the scoring function. The resulting improvement, however, is much better for the lnS+1 term, so that the further addition of the E/N term (i.e., going from score s2 to score s3) does not provide a statistically significant improvement. On the other hand, the addition of the term lnlp does not provide a statistically significant improvement under any condition.

A series of Kolmogorov-Smirnov tests on each of the AUC distributions on the test set for different scores (see File S1) found that the distributions are normal for all scores with both choices of the negative set (p>0.1). 3.4.2. Performance Evaluation: MCC on the Test SetIn Table 4, we summarise the mean values of AUC and MCC on the test set for each of the scores, along with the optimised parameters as defined in Equations (1)–(5), with hsnLLPS as the negative set, whereas in Table S4 we show the same values obtained with nsLLPS as the negative set.

Although trends in the comparison between different scoring functions are basically similar for both negative sets, as we will discuss below, there is a clear quantitative difference between the two choices. Prediction against hsnLLPS is harder than against nsLLPS, resulting in lower AUC values.

For hsnLLPS as the negative set, any of the scores s2,s3,s4 can be identified as the best performing one, with AUC ≈0.83, with a statistically significant improvement over PScore (AUC ≈0.78). The performance of the score s1 (AUC ≈0.80) is also better than PScore, as also pointed out by the one-way ANOVA. Similar trends are obtained with nsLLPS as the negative set (see Table S4), with AUC ≈0.87 for PScore, improved to AUC ≈0.92 with scores s3,s4.

Using hsnLLPS as the negative set, a weak improvement over the PScore performance in MCC brought about by the addition of the PASTA terms is apparent only for score s4 (0.26 vs. 0.25), whereas scores s2 and s3 have the same AUC as score s4, but do not improve MCC with respect to PScore. Score s1 has a higher AUC, yet the same MCC as Pscore.

This can be rationalized by looking at the ROC curves shown in Figure 4b for each of the considered scoring functions. ROC curves are derived on the entire data set (PP as the positive set and hsnLLPS as the negative set), using the optimised parameters, which are summarised in Table 4. Composite scores obtained by combining PScore with PASTA terms improve AUC in the high sensitivity, low specificity portion of the ROC curve, whereas MCC is typically related to its low sensitivity, high specificity portion.Similar trends are observed when nsLLPS is used as the negative set, with MCC =0.50 for score s4 improving with respect to MCC =0.46 for PScore (see Table 4 and Figure 4d). The values of MCC are much higher for nsLLPS because of its much smaller size with respect to hsnLLPS (see Section 2.4). 3.4.3. Optimised WeightsFinally, we have also studied the distribution of the weights used in the definition of the composite scoring functions. These parameters are optimised in the training set to produce the maximum AUC. Normalised distributions of the optimised parameters are shown in the second row of Figure S2 for hnsLLPS as the negative set. From Kolmogorov-Smirnov tests (see File S1) we found that the obtained data follow a clearly normal distribution only for all parameters (α, β, γ) in score s4 (p>0.1), whereas all other cases are borderline (0.01<p<0.05), with the exception of β˜ for score s0 which is clearly not normal (p≈10−16).We observe that the distributions of the the β parameter for scores s2, s3, s4, are very similar to each other (Figure S2e) and that the support of the distribution for the γ parameter contains the 0 value (Figure S2f). These facts confirm that the contribution of the lnlp term is not significant and the contribution of the lnS+1 term is the dominant one.Similar trends are observed if nsLLPS is used as the negative set (see the fourth row of Figure S2 and Table S4), with the exception of the distributions of the β parameter, which become bimodal ones. With nsLLPS as the negative set, most distributions turn out to be clearly not normal (p<2·10−3) according to the Kolmogorov-Smirnov test (see File S1), with the exception of α in scores s1, s3, s4, and γ in score s4 (p>0.05).

Overall, the values and distributions of all parameters obtained for the two considered negative sets are roughly consistent with each other. Interestingly, the α parameter is scaled down by a factor of roughly 2 when considering nsLLPS in place of hsnLLPS as the negative set, with PP as the positive set as well.

3.5. Sequences Classified Differently by s4 and PScore

In order to better characterize why the addition of the PASTA related terms to the PScore allows to improve the classification of phase separating proteins, we selected in LLPS the 44 protein sequences, set S˜4, which are correctly classified as positives by the score s4 but they are not by PScore, against the negative set hsnLLPS, if a false positive rate FPR=0.3 is considered as the precision threshold. We chose FPR=0.3 because it is the part of the ROC curve where the improvement brought about by s4 over PScore is the clearest, with 319 true positives detected by score s4, set S4, against 296 detected by PScore, set SP. Conversely, we also selected in LLPS the 21 sequences, set S˜P, that are correctly classified by PScore but not by s4, against the negative set hsnLLPS, at FPR=0.3. We restrict this analysis for clarity to the larger positive set and the larger negative set.

The different features in the sequences in the two sets should encapsulate the biophysical properties captured by the PASTA related terms. Analyzing them we observe some common trends. Low-complexity domains are present to some extent in both sets (a more detailed study of the intrinsic disorder content is presented in the next Section 3.6). Yet we observe that FUS and the other low-complexity domains studied in [35] or proteins enriched in proline/glycine residues that self-organize into elastomeric assemblies [36] are not present in both sets. In fact, the sequences mentioned above are classified correctly already by PScore alone (and remain correctly classified by s4). In fact, the energy parameters used in PASTA typically favors the β-pairing of hydrophobic residues, making the algorithm less suitable to investigate β-pairing in low-complexity prion-like domains, for which “ad hoc” predictors are typically developed [37]. The high propensity for phase separation of low-complexity domain is at any rate well captured by PScore, and maintained upon addition of the PASTA-related terms.We instead observe in set S˜4 the presence of generally short (4-8 residues) hydrophobic stretches, the ones picked up by the PASTA algorithm for the best β-pairing (always a parallel in-register pairing), enriched in V,I,L,F residues. In set S˜P, the stretches selected by PASTA become longer than in set S˜4. This is not surprising since the score s4, optimised for LLPS as the positive set against the negative set hsnLLPS, penalizes the increase in lp, the length of the stretch. For 3 sequences in set S˜P, the best β-pairing takes place with an in-register anti-parallel arrangement, wheres all other pairings are in-register parallel as in set S˜4. Interestingly, most of the stretches are flanked by several charged/polar residues in both sets. We provide a list of the sequence stretches selected by PASTA along with the corresponding sequences in the File S2 for both sets S˜4 and S˜P.A non trivial difference between the two sets can be detected if we consider the position within the chain of the sequence stretch involved in the best β-pairing predicted by PASTA. Interestingly, as discussed in Section 3.2, this feature may affect the entropy of the droplet state. If m is the position of the initial residue in the stretch, n the position of its final residue, and N the overall length of the sequence, we can compute the fractional position f of the stretch along the sequence as

f=minm+n2N,1−m+n2N.

(8)

With this definition, 0<f≤1/2: the closer the stretch is located to the center of the sequence, the higher the value of f. If we compute the average fractional position f of the sequence stretches selected by PASTA for the best β-pairing, we obtain f=0.232±0.020 for set S˜4 and f=0.158±0.020 for set S˜P. The fractional positions of the stretches in the two sets differ in a statistically significant way with a Z-score Z=3.7, with the stretches in set S˜P located closer to the chain ends.

As discussed in Section 3.2, the sequences in set S˜P would then increase the entropy of the droplet state with respect to the sequences in set S˜4. Not capturing this effect within the s4 scoring function might explain its failure in classifying correctly the sequences in set S˜P, providing a clue for further improvement of LLPS prediction. 3.6. Intrinsic Disorder Prediction for Different Sequence SetsIn order to gain a further insight into the biophysical interpretation of our results, we computed the intrinsic disorder scores of different sequence sets using the MobiDB-lite consensus predictor, which combines a set of eight complementary intrinsic disorder predictors [38]. The results for the total disorder fraction (the fraction of residues classified as disordered) and the mean length of the predicted disordered segments are summarized in Table 5.The two negative sets differ with respect to their intrinsic disorder content, with hsnLLPS being more disordered than nsLLPS. This is expected, since the human proteome is known to be characterized by a higher fraction of intrinsic disorder with respect to the proteomes of less complex organisms [

留言 (0)

沒有登入
gif