On the relevance of query definition in the performance of 3D ligand-based virtual screening

Chemical diversity in DUD-E+ and DUD-E+-diverse

Prior to the analysis of the impact of the query definition on the outcome of the VS, the degree of structural resemblance of the chemical space defined by actives and decoys in the DUD-E+ database was examined. To this end, for each target the similarity between actives and decoys relative to the template compound was estimated by means of the Tanimoto index using a 2D Morgan fingerprint (radius = 2; 2048 bits; see above).

Figure 2A shows that the actives included in DUD-E+ generally exhibit a wider range of similarity to the template compound, which is in contrast with the rather uniform deviation obtained for the decoys. Compared to the template, the Tanimoto similarity of decoys is generally comprised between 0.03 and 0.14, and the global similarity determined for the whole set of 92 targets amounts, on average, to 0.11. The similarity index determined for the actives exhibit larger differences, and in 23 cases is ≥ 0.2. In light of these results, the DUD-E+-Diverse subset was generated as a curated dataset aimed to reduce the impact of structural analogs of the template compounds. To this end, only actives with low Tanimoto similarity ( ≤ 0.1) were retained to minimize the potential influence of the 2D bias on the results of the VS. As noted in Fig. 2B, both actives and decoys exhibit close similarity profiles to the template for the 15 datasets retained in in the DUD-E+-Diverse.

Fig. 2figure 2

2D Structural similarity determined by using the Tanimoto index (Morgan fingerprint) for actives (blue) and decoys (red) relative to the template compound for the targets included in (A) DUD-E+ and (B) DUD-E+-Diverse datasets. Dots stand for the average value of actives and decoys in each dataset. The blue/red areas denote the similarity indexes corresponding to the first and forth quartile for each target

Structural resemblance between distinct query definitions

As noted above, distinct protocols were considered to generate the query conformation of the template compound, including the preservation of the X-ray structure (QXR) or its energy-minimized structure (QEMXR), or alternatively the conformers corresponding to the lowest energy structures in the gas phase (QLEG) and in aqueous solution (QLEW), and finally an ensemble of conformations (QENS). The structural similarity between the query conformations relative to the crystallographic geometry was estimated from the RMSD determined for the heavy atoms of the template compound.

Compared to the X-ray conformation, the energy minimized structure leads to a narrow RMSD distribution with a peak centered at 0.7 Å (Fig. 3A). A wider distribution is obtained for the set of lowest-energy conformations (QLEG and QLEW) and for the multiple query (QENS), which exhibit a similar profile characterized by a broader RMSD peak centered at 1.9–2.2 Å. Moreover, the tail of the RMSD distribution is extended up to 6 Å. These trends are preserved in the subset of 15 datasets included in the DUD-E+-Diverse (Fig. 3B). Overall, although sampling of the conformational space increases the dissimilarity between the conformations of the template, there is generally a large resemblance between the distinct definitions of the query conformer, and only 25–35% of the QLEG, QLEW and QENS conformers are characterized by RMSD values > 3 Å.

Fig. 3figure 3

Symmetry-corrected heavy atom root-mean square deviation (RMSD; Å) profiles determined relative to the X-ray structure (QXR) for the distinct query definitions (QEMXR: blue; QLEG: orange; QLEW: green; QENS: red) of the template compounds included in the (left) DUD-E+ dataset and (right) the subset of targets of the DUD-E+-Diverse dataset

Performance of a single query on the virtual screening with pharmscreen

The VS results obtained for the DUD-E+ dataset using the distinct query conformations are examined in this section. The discussion is generally focused on the analysis of the global trends, paying special attention to the ROCe (1%, 2% and 5%) values as long as no relevant differences were observed in the AUC curves (see Supporting Information Table S3 for information about the ROCe and AUC values for the entire set of targets, and results for individual sets in Tables S4 − S7).

Figure 4 displays the distribution of ROCe (1% and 5%) values obtained for the whole set of targets using measurements of the PharmScreen similarity for the distinct query definitions. The distribution profiles are highly similar. Indeed, paired t-tests confirmed that no statistically significant differences between the ROCe distribution obtained for the distinct queries relative to the profile obtained for the QXR conformer were found. The similar performance obtained between QXR and QEMXR is not unexpected due to the close structural resemblance of the corresponding conformations, as 90% of the energy-minimized compounds have a RMSD deviation < 1 Å from the X-ray structure (Fig. 3). A similar performance is obtained for QLEG, QLEW and QENS despite the shift in the position of the peak and the larger tails found in the RMSD profiles relative to the X-ray conformation (Fig. 3).

Fig. 4figure 4

Distribution of ROCe values (top: 1%; bottom: 5%) determined for the set of targets included in DUD-E+ dataset using the distinct query definitions for the template

In agreement with previous studies [20,21,22,23], Fig. 4 suggests that the choice of the.

query conformation has globally a minor effect on the outcome of the VS. However, this does not imply that sensible differences might be found for individual targets. In fact, this can be noticed in Fig. 5A, which shows the representation of the difference in the ROCe values (ΔROCe) obtained for the QLEG and QLEW queries relative to the QXR in front of the RMSD determined between these conformations. Similar enrichments (|ΔROCe| < 1) are obtained for most of the targets. Nevertheless, there is a subset of targets where larger ΔROCe values are found, often revealing a better retrieval of actives when the QLEG/QLEW conformers are used in the LBVS (positive values of ΔROCe), as noted for mk01, hxha and pur2, or alternatively when the QXR query is adopted as query conformations (negative values of ΔROCe), such as ppara, mcr and xiap (see Table 1). Noteworthy, this behavior is not strictly associated to the RMSD between the query conformations, since |ΔROCe| values larger/lower than 10 (ROCe 1%) and 3 (ROCe 5%) are observed for cases characterized with highly similar conformations (RMSD values between queries ≤ 2 Å). Furthermore, inspection of Fig. 5B reveals that the actives recovered exclusively in the VS performed for the QLEG/QLEW queries (i.e., not found in the VS carried out against the QXR query) consistently exhibit a larger structural overlap with QLEG/QLEW compared to the overlap obtained for the actives retrieved uniquely when the QXR query is used in the LBVS.

Overall, this analysis points out that for certain targets the recovery of actives can be largely affected by the choice of the query definition. This trait cannot be attributed to geometrical differences between query definitions, as they are found along the whole range of RMSD values. Rather, it can be ascribed to the chemical resemblance between actives and template, as long as the presence of a common structural motif would favor the generation of similar conformations upon energy minimization either in the gas phase or in solution. In turn, this would justify why the choice of QLEG/QLEW queries favors the identification of actives with a large hydrophobic resemblance, since the chemical skeleton shared by template and actives would tend to maximize the overlap of the 3D hydrophobic distribution. In contrast, choice of the X-ray conformation may be better suited when there is low chemical resemblance between template and actives, as the crystallographic structure may be affected by factors that cannot be inherently associated to the conformational sampling, such as steric constraints imposed upon binding to the target, which would explain the lower degree of hydrophobic similarity shown in Fig. 5B.

Table 1 ROCe 1 and 5% values obtained for selected targets characterized by large |ΔROCe| values when QXR, QLEG and QLEW queries are used in the screening of the DUD-E+ datasetFig. 5figure 5

A) Representation of the difference in ROCe (ΔROCe) 1 and 5% versus the RMSD between query conformation obtained for (left) QLEG and (right) QLEW queries relative to the X-ray crystallographic one (QXR) for the targets included in the DUD-E+ dataset. B) Distribution of the actives uniquely recovered at 1, 2 and 5% according to the structural overlap (0: minimum overlap; 1: maximum overlap) exhibited against the QXR (magenta), QLEG (orange) and QLEW (green) queries. The structural overlap was estimated from the Tanimoto coefficient obtained from the comparison of the projection of the van der Waals radii of atoms onto a grid (spacing of 0.5 Å) of the hydrophobicity-guided alignment of query and active. Dashed lines denote the average value, and upper/lower dotted lines stand for the 25%/75% of the cases

To test the preceding hypothesis, Fig. 6 compares the 2D structures of template and top-ranked actives retrieved using the QLEG query for mk01, pur2 and thb and the 3D overlaid structures of both query and active. Noteworthy, the N-(2-hydroxy-1-phenylethyl)-1H-pyrrole-2-carboxamide moiety present in the template of mk01 is shared with the top-ranked actives (Tanimoto index of 0.82), reflecting the large chemical similarity found for the actives in this target (see Fig. 2A). Therefore, the resemblance of the chemical scaffold should favor the generation of conformations for the actives exhibiting a pronounced overlap with the template QLEG query, as noted in the 3D structural overlap between the volumes of query and active (≥ 0.70). A similar behavior is expected to occur in targets pur2 and thb due to the preservation of a common chemical scaffold in the 2D structures of template and top-ranked actives, which facilitates a large 3D overlap between the volumes of the aligned compounds (close to 0.65; see Fig. 6).

In contrast, the existence of a common scaffold is less evident in the comparison of the chemical skeleton of template and top-ranked actives in cases where the performance of the VS is much larger for the QXR query, as noted in the examples shown in Fig. 7. Despite the chemical dissimilarity between template and actives (Tanimoto index < 0.27), the retrieved actives exhibit a notable 3D overlap (close to 0.50) with QXR. In certain cases, this may be attributed to the adoption of a conformationally strained fold adopted by a flexible template to fit the topological constraints of the binding pocket, as noticed in ppara. Here the ligand (AZ2) is anchored through the hydrogen bonds formed by the terminal carboxylate group with residues Tyr314, Tyr464 and His440 in the interior of the binding pocket. The folded bioactive conformation, which is characterized by the normal arrangement of the two benzene rings, reflects the need to fill the L-shaped pocket, which is facilitated by the flexibility of the ethoxy tether (PDB ID 1I7G; see Fig. 8A). A similar situation can be noticed in mcr, where the hydrogen bonds formed by the sulfonamide group of the ligand (6QE) with residues Asn770 and Thr945 force the terminal cyclopropyl ring to be oriented toward the difluorobenzene ring located at the other end of the molecule (closest C-C distance of 3.2 Å), thus leading to the strained structure observed in the bioactive conformation (PDB ID 5L7G; see Fig. 8B).

Finally, the better performance obtained with QXR can be attributed to the bias of the energy minimization in the gas phase to favor the adoption of intramolecular interactions, which are nevertheless absent in the bioactive conformation. This is illustrated by tysy: although the template and the top-ranked active share a significant fraction of the chemical scaffold (Fig. 9A), the lowest-energy conformation of the template is remarkably different from the bioactive one due to the stabilization afforded by the electrostatic interaction between the carboxylate group and the exocyclic amino group located at the ends of the compound (Fig. 9B). It is worth noting that this latter case is corrected when the lowest-energy conformation of the template in aqueous solution is considered (Fig. 9B). The almost perfect overlay observed in this latter case reflects the fact that the formation of such an intramolecular contact is penalized by hydration effects, which favor the adoption of an extended conformation that enables the solvent exposure of both carboxylate and amino groups. This finding agrees with earlier studies that noticed the relevance of accounting for solvation effects in the conformational search (see for instance [42]).

Fig. 6figure 6

2D Structure of query and top-ranked actives for selected targets where choice of the QLEG query leads to a better performance in the VS relative to the X-ray structure. For each active compound, the structural resemblance (Tanimoto index using Morgan Fingerprint, radius 2) with the template and the structural overlap between the overlaid QLEG query and active are indicated

Fig. 7figure 7

2D Structure of query and top-ranked actives for selected targets where choice of the QXR. query leads to a better performance in the VS relative to the lowest energy-minimized conformer. For each active compound, the structural resemblance (Tanimoto index using Morgan Fingerprint, radius 2) with the template and the structural overlap between the overlaid QLEG query and active are indicated

Fig. 8figure 8

Representation of the bioactive conformation of ligand (A) AZ2 and (B) 6QE (yellow sticks) found in the X-ray structure of the complex with the binding pocket of the peroxisome proliferator-activated receptors alpha (ppara; PDB ID 1I7G) and mineralocorticoid receptor (mcr; PDB ID 5L7G), respectively. The shape of the pocket filled by the ligand is shown as a gray wireframe, and selected protein residues are highlighted as sticks

Fig. 9figure 9

A) 2D Structure of template and top-ranked active (CHEMBL-93,048) reported by QXR in tysy. B) Molecular overlay obtained by using QXR, QLEG, and QLEW

Overall, these results indicate that the degree of scaffold similarity between template and actives can modulate the performance of the LBVS depending on the choice of the query conformation. When the template and the actives share a large structural motif in their chemical scaffolds, it is reasonable to expect that the compounds will populate similar conformational spaces, enhancing the chance of finding conformers able to overlap with the template. In this context, the usage of the low-energy query may be better suited for the VS. However, when the structural diversity between template and actives increases, or alternatively when the query conformation is influenced by the formation of specific interactions or steric effects, the 3D information encoded in the X-ray structure may be more adequate for early recovery of the actives.

This interpretation is supported from the analysis of the maximum common substructure (MCS) as a metric to quantify the degree of identity between the chemical scaffolds of template and top-ranked actives. The results obtained for a subset of six targets (sahh, pur2, mk01, hivrt, wee1 and pa2ga) are displayed in Fig. 10. Targets sahh, pur2 and mk01 were selected since they are characterized by a high 2D similarity between template and top-ranked actives in conjunction with a notable bias in 2D similarity between actives and decoys (Fig. 2A). In contrast, hivrt, wee1 and pa2ga were selected because they exhibit low 2D similarity between template and top-ranked actives and a narrow gap between actives and decoys (Fig. 2A). To compare the relation between the query selection and the 2D hit similarity, we analyzed the characteristics of those hits present in the ROCe 5% that were found by only one of the two queries (QXR or QLEG/QLEW). As discussed above, choice of QLEG leads to a better performance in the VS for sahh, pur2 and mk01, reflecting the higher MCS found between template and actives (Fig. 10). On the other hand, the outcome of the VS is enhanced when QXR is chosen as query for hivrt, wee1 and pa2ga, and the actives exhibit a lower MCS with the template (Fig. 10).

Fig. 10figure 10

Distribution of the actives recovered in the ROCe 5% for targets sahh, pur2, mk01, hivrt, wee1 and pa2ga when either QXR (magenta), QLEG (orange) or QLEW (green) are used as queries. The profiles show the distribution according to the MCS of the actives found uniquely in the VS against QLEG/QLEW but not in QXR, and viceversa

Performance of multiple queries on the virtual screening outcome with pharmScreen

Can the use of multiple queries modulate the influence of the chemical resemblance between template and actives? To answer this question, the previous analysis was extended to the results obtained when an ensemble of conformers of the template are used as query in the VS. Hereafter, the discussion is focused on the results derived using the Parallel fusion algorithm, which led to a slightly better performance (comparison of the results derived for other fusion techniques is provided in Supporting Information Table S7-S9).

Although the use of multiple queries has globally a minor effect on the outcome of the VS (note the similar ROCe distribution relative to the use of QXR in Fig. 3), the representation of the ΔROCe (1% and 5%) values determined relative to the QXR conformation versus the RMSD (Fig. 11A) mimics the trends discussed above (see Fig. 5A). The ΔROCe shows similar values (|ΔROCe| < 1) for most of the targets, but differences larger than 10% (ROCe 1%) and 3% (ROCe 5%) are often found, reflecting a better performance when QXR is used as query (negative ΔROCe values), as noted for xiap, hxh4, mcr and pa2ga, or QENS is otherwise considered (positive ΔROCe values), as observed for mk01, pur2, grik1, and hmdh (Table 2). Furthermore, inspection of the distribution of actives recovered at 1, 2 and 5% for this subset of targets consistently exhibits a larger hydrophobic similarity to QENS compared to the hydrophobic overlap determined for the actives retrieved when QXR is used in the VS (Fig. 11B).

Fig. 11figure 11

A) Representation of the difference in ROCe (ΔROCe) 1% and 5% versus the RMSD averaged for the ensemble of queries (QENS) relative to the X-ray structure (QXR) of the template for the targets in the DUD-E+ dataset. B) Distribution of the actives uniquely recovered at 1, 2 and 5% according to the structural overlap (0: minimum overlap; 1: maximum overlap) exhibited against the QXR (magenta) and QENS (orange) queries. The structural overlap was estimated from the Tanimoto coefficient obtained from the comparison of the projection of the van der Waals radii of atoms onto a grid (spacing of 0.5 Å) of the hydrophobicity-guided alignment of query and actives. Dashed lines denote the average value, and upper/lower dotted lines stand for the 25%/75% of the cases

Table 2 ROCe 1 and 5% values obtained for selected targets characterized by large |ΔROCe| values when QXR and QENS queries are used in the screening of the DUD-E+ dataset

As an example, the template of grik1 includes an alanine group, which is present in the 10 top-ranked actives recovered in the VS performed using the QENS query (Fig. 12). In fact, the VS results suggest that the performance obtained using QXR or QENS is affected by the degree of scaffold similarity between template and actives (Fig. 13). When the 2D similarity is high, the actives adopt conformations that resemble the conformers included in the template’s ensemble. However, if the common structure shared between template and actives is low, the X-ray structure may be better suited for early recovery of the actives.

As a final remark, let us note a distinctive trait observed in the comparison of the QENS query and the lowest-energy conformation (QENS; see Fig. S3 in Supporting Information). A similar performance is generally found for most targets. Nevertheless, in few cases, especially ppara and tysy, differences larger than 10% (ROCe 1%) and 3% (ROCe 5%) are found with the use of QENS. Thus, the choice of the ensemble query may be valuable to correct the limitations found for QLEG (see discussion above).

Fig. 12figure 12

2D structure of query (bold) and top-ranked actives selected upon comparison against QENS for grik1

Fig. 13figure 13

Distribution of the actives recovered in the ROCe 5% for targets sahh, pur2, mk01, hivrt, wee1 and pa2ga when either QENS (red) or QXR (magenta) are used as queries. The profiles show the distribution obtained according to the MCS of the actives found uniquely in the VS against QENS but not in QXR, and viceversa

Impact of query conformation on the virtual screening outcome with phase shape

The performance of the VS may be influenced by the nature of the molecular descriptors used to measure the similarity between template and actives/decoys. Whereas the preceding discussion relies on the usage of the QM-derived Hyphar descriptors from PharmScreen, a complementary analysis was performed resorting to Phase-based shape descriptors. For the sake of comparison with the results shown above, this analysis is limited to QXR and QLEG. The summary of the AUC and ROCe results obtained for the whole set of targets is provided in Supporting Information Table S10, and results for individual sets in Supporting Information Table S11.

Inspection of the distribution profiles shown in Fig. 14A, which represents the ΔROCe (1% and 5%) values determined between QLEG and QXR queries versus the RMSD between these conformations, reproduces the general trends obtained with the Hyphar hydrophobic descriptors. Positive and negative ΔROCe values denote targets where the use of QLEG or QXR leads to a better performance. The former case is observed for targets such as thb, cdk2, ppard and pa2ga, whereas the latter case is found for wee1, fak1, mcr, tysy, hxh4 and hmdh (Table 3). Furthermore, the distribution of actives recovered at 1% and 2% for this subset of targets exhibits a larger shape-based similarity to QLEG compared to the shape overlap determined for the actives retrieved when QXR is used as query (Fig. 14B). Note, however, that this trait is not reflected in the distribution obtained at ROCe 5%. This can be ascribed to the fact that the alignments between template and actives exhibit a low 3D overlap for certain actives with a high degree of chemical diversity relative to the template, as noticed for targets pa2ga (Fig. 15A) and hxk4 (Fig. 15B).

Fig. 14figure 14

A) Representation of the difference in ROCe (ΔROCe) 1% and 5% versus the RMSD between the QLEG query conformation relative to the X-ray crystallographic one (QXR) for the targets included in the DUD-E+ dataset. B) Distribution of the actives uniquely recovered at 1, 2 and 5% according to the structural overlap (0: minimum overlap; 1: maximum overlap) exhibited against the QXR (magenta) and QLEG (orange) queries. The structural overlap was estimated from the Tanimoto coefficient obtained from the comparison of the projection of the van der Waals radii of atoms onto a grid (spacing of 0.5 Å) of the Phase shape-guided alignment of query and active. Dashed lines denote the average value, and upper/lower dotted lines stand for the 25%/75% of the cases

Table 3 ROCe 1 and 5% values obtained for selected targets characterized by large |ΔROCe| values when QXR and QLEG queries are used in the screening of the DUD-E+ datasetFig. 15figure 15

Alignment between QXR and a subset of chemically diverse actives selected using Phase Shape and characterized by low overlap with QXR. A) Target pa2ga (from left to right): CHEMBL356606, CHEMBL158383, CHEMBL514692, and CHEMBL347957. B) Target hxk4 (from left to right): CHEMBL6475058, CHEMBL551445, CHEMBL48514, and CHEMBL572840

Impact of query conformation in finding novel scaffolds (DUD-E+-diverse)

To further check the influence of the structural bias between actives and decoys relative to the template, the DUD-E+-Diverse database was designed to select targets where (i) actives have low structural resemblance to the template, (ii) both actives and decoys have comparable 2D structural similarities to the template (Fig. 2B), and ii) there is an appropriate ratio between the number of decoys and actives in the dataset (Supporting Information Table S2). The AUC and ROCe values obtained for the results of the 15 targets included in DUD-E+-Diverse are shown in Supporting Information Table S12 (results for individual targets are given in Supporting Information Tables S13 and S14).

Compared to Figs. 4 and 16A shows larger differences in the shape of the ROCe 1% and 5% distribution profiles, which are however not statistically significant because of the low number of targets with high RMSD differences. Figure 16B shows that similar ROCe values are obtained for all the targets but hivrt and pa2ga, where the use of QXR gives a better performance.

With regard to pa2ga, the ROCe values obtained using QXR (ROCe 1%: 12.9; 5%: 7.1) outperforms those obtained with QLEG (ROCe 1%: 1.6; 5%: 0.7). This can be related to the adoption of a folded structure of the long, flexible aliphatic chain in the binding pocket, which favors the retrieval of actives such as CHEMBL332993 (Fig. 17). In hivrt (ROCe 1% 6.2; 5%: 3.3 for QXR, and 0.0 and 0.3, respectively, for QLEG) the bioactive structure shows a closed structure, which enables the recovery of CHEMBL104349, whereas QLEG fails in finding active compounds (Fig. 17). Overall, these results support the better performance of the QXR query in cases where the bioactive conformation of the template can be altered upon binding to the protein target, as noticed for pa2ga and hivrt.

Fig. 16figure 16

A) Distribution of ROCe values (top: 1%; bottom: 5%) determined for the set of targets included in DUD-E+-Diverse dataset using QXR and QLEG query definitions for the template. B) Representation of the difference in ROCe (ΔROCe) 1% and 5% versus the RMSD between the QLEG query conformation relative to the X-ray crystallographic one (QXR) for the targets included in the DUD-E+ dataset

Fig. 17figure 17

PharmScreen alignment of QXR /QLEG (green/orange) with a top-ranked active (cyan) for pa2ga and hivrt

Final remarks

The global analysis of the results reveals that the performance, on average, of the 3D LBVS has generally a mild dependence on the choice of the template’s query conformation, which agrees with the results reported in previous works [20,21,22,23]. This tendency can be attributed to a large extent to the fact that the RMSD between the bioactive (X-ray) structure of the ligand and low-energy conformers is in most cases lower than 2.5 Å for around 70% of the targets. Nonetheless, the results obtained for specific targets also reveals noticeable differences in the performance of the screening depending on the template’s conformation used as query, underscoring the significance of 3D data.

As a general trait, in cases where the actives exhibit a notable structural similarity with the chemical scaffold of the template, choice of the energy-minimized conformation can be better suited. This can be attributed to the presence of a common chemical scaffold between template and active, which should favor the adoption of similar conformational spaces, and hence the recovery of actives that exhibit a significant 3D structural overlap with the template. Nevertheless, this scenario may be challenged by different factors, such as the introduction of conformational strain in the template upon binding to the protein target, either arising from the steric constraints imposed by the topology of the binding pocket, or the formation of specific interactions with target residues that compensate the energetic penalty required for adopting a conformation distinct from the lowest-energy conformer, as discussed before for ppara and mcr.

When these factors are important in the screening against templates endowed with large conformational flexibility, the usage of the X-ray structure can be expected to be better suited than the lowest-energy conformer to enhance the early recovery of actives, since the bioactive conformation encodes information relative to the molecular determinants that must be fulfilled by the ligands to achieve a proper binding in the target pocket, as discussed for pa2ga and hivrt.

An alternative scenario that may affect the appropriate choice of the query conformation is the reliability of the computational protocol adopted to generate the query conformation. Computational efficiency may favor the implementation of energy minimization in the gas phase, but this may lead to the formation of artifactual intramolec

留言 (0)

沒有登入
gif