PrePPI: A structure informed proteome-wide database of protein-protein interactions

The identification of proteins that interact with one another is a challenging problem of central importance in fundamental biology and in medicine. Protein-protein interactions (PPIs) is a widely used term which has multiple meanings. Two proteins can interact with one another directly either by forming a binary physical complex or by being in physical contact in the context of a multi-protein complex. Indirect interactions can include two proteins that are part of a complex, but are not in physical contact, or that are part of a pathway or network that mediates their interaction. Multiple experimental and computational tools are available to detect or predict PPIs, and their results are compiled in multiple databases. Here we report a new version of our Predicting Protein-Protein Interactions (PrePPI) database 1, 2, describe its unique features, and compare its performance to that of other databases. We also place PrePPI’s prediction algorithm in the context of recent structure-based, co-evolution, and deep learning-based developments in the prediction of PPIs.

The key element of the PrePPI algorithm, which is summarized in Figure 1, is proteome-wide template-based modeling of PPIs, both direct and indirect. Not accounting for splice variants and posttranslational modifications, there are ∼200 million possible pairwise combinations of human proteins. However, since we consider full proteins as well as their individual domains, we need to examine ∼4.55 billion pairwise interactions and, since we make multiple interaction models for each pair, the number of pairwise combinations evaluated is in the 10s of billions (see Methods). PrePPI’s ability to consider such a large number of potential PPIs is enabled by an efficient scoring function which is based on the similarity of the modeled interface to the interface of a known complex in the Protein Data Bank (PDB) [3]. We highlight these points because it is important to distinguish our goals from standard template-based modeling. Further, we are not necessarily trying to produce an accurate model of the complex as might be judged, for example, in the CAPRI (Critical Assessment of PRediction of Interactions) experiment [4] – although obviously a better model will produce a more reliable prediction. Rather, our hypothesis is that, in the derivation of a structural modeling score, our models are good enough to provide a clue that two proteins form a physical complex. Thus, a model that would score poorly according to CAPRI metrics might be reliable enough to provide a yes or no prediction as to whether two proteins interact and, in addition, produce a low-resolution structural pose for the interaction. As discussed below, PrePPI uses non-structural information as well. For example, if two proteins are co-expressed and have a good structural modeling (SM) score, the likelihood of an interaction, as given in PrePPI by a naïve Bayesian network, will increase. A PPI with low SM score but high non-structural score suggests that the interaction is indirect.

Testing and validating computational predictions is a complicated challenge since experimental databases themselves contain sources of uncertainty and the degree of overlap between them is still quite low in spite of the proliferation of observations from high-throughput screens. Moreover, they are often based on different definitions of PPIs. Mass spectrometry-derived databases (e.g. Bioplex 3.0 [5]) focus explicitly on multi-protein complexes [6] while Y2H-based databases (e.g. HuRI [7]) focus on binary interactions. Among derived databases, the widely used STRING database [8] has a category for physical interactions but does not distinguish binary interactions from those in multi-protein complexes whereas databases such as APID [9] and HINT [10] include both direct and indirect interactions. As depicted in Table 1, overlap between these various databases is limited (see Methods for a description of each database). Of note, Interactome 3D which contains PDB structures and high quality homology models is well represented in most of the databases but, the HINT high-quality literature-curated database (HINT HQ-LC) contains the highest percentage of Interactome3D structures.

In earlier versions of PrePPI 1, 2, training was done on yeast PPIs and testing was done on human interactions, with the true positive dataset comprising PPIs with at least two literature references. No attempt was made at the time to train on datasets of binary physical interactions since PrePPI predicts both direct and indirect interactions. Here we have taken a more refined approach, training the structural modeling component of PrePPI on HINT HQ-LC [10].

In order to evaluate PrePPI’s structure-based algorithm, we have used Escherichia coli K-12 (here E. coli) as a test organism and compared predictions from PrePPI’s structural modeling component to predictions from the threading component of Threpp [11]. Technology closely related to Threpp powers the PEPPI server [12] which, like PrePPI, uses Bayesian statistics to integrate structural and non-structural information. But in contrast to the PrePPI, the PEPPI webserver allows a user to input only two protein sequences at a time while, as described below, the PrePPI database of human PPIs contains about 200 million entries with the highest confidence predictions (∼1.3M) appearing in the online application that can be queried in multiple ways including, for example, inputting a single protein and outputting all predicted binding partners.

Compared to previous versions of PrePPI, in addition to improved training, features of the current version include the replacement of homology models with models from the AlphaFold Protein Structure Database [13] leading to increased structural coverage of the proteome, separate training of the structural modeling and non-structural components, a refined definition of PDB template complexes [3], the implementation of a more accurate algorithm PredUs 2.0 for predicting interfacial residues [14], and a website with expanded functionality. We believe that PrePPI is a unique resource that generates novel hypotheses for the existence of PPIs, both direct and indirect. Moreover, given the ongoing developments in the use of deep learning-based approaches to predict the structure of binary complexes, PrePPI predictions can be used as a starting point for the construction of accurate structural models.

留言 (0)

沒有登入
gif