Missense3D-PPI: a web resource to predict the impact of missense variants at protein interfaces using 3D structural data

Residues on the protein surface that are not directly involved in function generally are more tolerant to amino acid substitutions compared to residues affecting the buried core of a protein [1]. However, interface residues involved in protein-protein interaction (PPI), also known as interface residues, are an exception to this principle. As previously demonstrated by our group and others 1, 2, interface residues are enriched in disease-causing amino acid substitutions. The damaging effect of variants affecting protein interaction sites is difficult to predict and the majority of in silico variant prediction methods perform worst on variants located at the interface compared to the remaining protein surface or the buried interior protein area [3].

Genetic variants causing the disruption of protein interfaces are an important contributor to human disease 4, 5. These variants generally preserve the folding and stability of the monomeric protein but may impair its function by impacting on the many biological processes which rely on protein interaction, such as trafficking and signalling. Identification and prediction of the effect of a variant on PPI requires knowledge of the residues forming a protein interface. In recent years there has been an increase in the availability of three-dimensional structures of protein complexes, both experimentally solved and obtained from protein docking and homology modelling [6]. These 3D coordinates are publicly available from databases, such as PDB [7], Interactome3D [8],GWYRE [9] and PrePPI [10]. Although at present the coverage of the protein interactome remains limited, we can expect an exponential increase in 3D coordinates of PPI complexes in the coming years as a result of the recent breakthrough in protein modelling achieved by AlphaFold and similar approaches which use deep learning 11, 12.

The availability of 3D coordinates allows us not only to predict the damaging effect of a variant, but also to understand the molecular mechanisms by which it affects protein structure/function. In 2019, we launched Missense3D [13], which predicts the effect of a variant on the folding and stability of a monomeric protein. However, Missense3D, similar to other algorithms such as HOPE [14] and SAAP [15], uses the 3D coordinates of a single protein chain, thus, potentially failing to identify the detrimental effect of variants located on the protein surface that may affect PPI. Such a damaging effect can only be predicted when the 3D coordinates of a protein complex are taken into account, an approach used by algorithms such as the energy-based programs BeAtMuSiC [16], MutaBind2 [17] and mCSM-PPI2 [18]. However, most of these in silico prediction tools have been trained on engineered protein variants deposited in databases, such as Skempi [19] and Protherm [20] and do not perform equally well when used on other datasets of variants 21, 22.

To date, the use of 3D structures to predict the effect of a variant remains relatively limited compared to the use of sequence conservation, thus calling for the development of new user-friendly algorithms that can be easily implemented to enhance variant prediction. We present Missense3D-PPI, a purely structure-based algorithm for the prioritization and characterization of missense variants occurring at protein-protein interfaces. Missense3D-PPI is available at the Missense3D web portal (http://missense3d.bc.ic.ac.uk).

Algorithm

Experimental structures and missense variants

Figure 1 presents the Missense3D-PPI pipeline. We extracted ∼4 million human missense variants from our in-house Missense3D-DB database [23] which contains the phenotypic annotation of variants from ClinVar [24] and UniProt [25] and minor allele frequency (MAF) data from GnomAD [26]. In order to identify missense variants occurring at a PPI site, we extracted 16,609 high resolution (≤ 2.5Å) X-ray crystal structures of human dimers and multimers from the Protein Data Bank (PDB) [27]. For each protein complex, we selected the experimental structure with the best resolution and without mutations in the protein interface.

Interface residues were defined as any residue with a relative solvent accessibility (RSA) difference ≥5% between the monomeric and the protein complex structure calculated using an in-house program. Each interface residue was categorised as core, rim or support according to the change in RSA between the monomeric (RSAmonomer) and complex (RSAcomplex) form:

core residue: RSAmonomer ≥ 9% and RSAcomplex < 9%; RSAmonomer - RSAcomplex ≥ 5%

rim residue: RSAmonomer ≥ 9% and RSAcomplex ≥ 9%; RSAmonomer - RSAcomplex ≥ 5%

support residue: RSAmonomer < 9% and RSAcomplex < 9%; RSAmonomer - RSAcomplex ≥ 5%

We based our definition of core and rim according to our definition of change between buried and exposed status [13], but we acknowledge that several other definitions can be found in the literature. Only variants occurring at interface residues were retained and the final dataset comprised of 1,279 missense variants (pathogenic n=733, benign n=546) in 434 proteins and 545 PDB coordinates of PPI complexes. The benign dataset included variants with an annotation of “benign” from ClinVar or UniProt and variants with MAF>1% and no “pathogenic” annotation in ClinVar and/or UniProt. Human Leukocytes Antigen (HLA) proteins were excluded from the final dataset because of their highly polymorphic antigen binding amino acid residues [28]. Over 1,000 naturally occurring haemoglobin variants have been described and extensively studied clinically and in vitro [29]. In our dataset, haemoglobin variants annotated as “unstable” were also included in the “damaging” dataset.

留言 (0)

沒有登入
gif