PharmRL: pharmacophore elucidation with deep geometric reinforcement learning

In this method, we trained a convolutional neural network (CNN) model to identify favorable points of interactions (pharmacophore features) on the binding site and developed a deep geometric Q-learning algorithm that attempts to select an optimal subset of these interaction points to form a pharmacophore. The CNN model is trained on pharmacophore features derived from protein-ligand co-crystal structures and is iteratively fine tuned with adversarial examples to ensure predicted points of interaction are physically plausible and close to relevant functional groups on the protein. The reinforcement learning algorithm employs an SE(3)-equivariant neural network [24] as the Q-value function. This network progressively constructs a protein-pharmacophore graph. It does so by either choosing to incorporate an available pharmacophore feature into the graph or determining that the current graph is already optimal. The pipeline for the method is shown in Fig. 2. Importantly, this framework still has the ability to accommodate expert guidance in selecting and adding features while automating a significant portion of the traditional pharmacophore elucidation process.

Pharmacophore definition

A pharmacophore is defined as a set of points $\$ that propose positions of interactions between the give protein binding site and a potential ligand. More specifically each point in a pharmacophore has a 3D coordinate $X_f \in \mathbb ^$ and feature class $Z_f$. The feature class is defined to be any of the following: $\$. Pharmacophore search software such as Pharmit [1] can be used to retrieve molecules that can satisfy the feature and position constraints specified by a given pharmacophore. In this work, pharmacophores are developed in two major steps. First, potential points of interactions on a binding site are identified using a convolution neural network (CNN). A subset of these identified points are then selected with a reinforcement learning model to form a pharmacophore. More details follow in subsequent sections.

Molecular conformation generation and pharmacophore screening

Molecule conformers for pharmacophore screening are generated using RDKit [25] for the DUD-E and Covid Moonshot datasets, with 25 energy minimized conformers produced per molecule. The LIT-PCBA dataset, however, is prohibitively large for conformation generation. Therefore, we submit the list of molecules from this dataset directly to the Pharmit server [26]. This approach saves on compute as Pharmit’s database already contains conformers for most of these molecules due to significant overlap with other datasets hosted on the server. By default, Pharmit stores 20 conformers per molecule. Pharmit is also used to screen pharmacophores on these conformers. The software retrieves and aligns conformers that match the spacial restraints specified by the pharmacophore, with a tolerance radius of 1 Å for all of its features. We also remove conformers that overlap with the protein (receptor exclusion in pharmit) from the screening results. We ensure that only 1 conformer per molecule is returned by the software to calculate our performance metrics.

CNN training

The CNN model is trained to predict whether a given point on the binding site is a plausible point of interaction. Specifically, the CNN predicts if any of the six feature classes are present at the given point. It is trained in a multilabel classification manner so that it can predict the presence of multiple classes at the evaluated point. This approach accounts for overlap between different classes. For instance, certain aromatic groups can be viewed as hydrophobic, and similarly, some hydrogen acceptor groups may also be regarded as negative ion functional groups with the ability to form salt bridges.

The CNN takes as input, a voxelized representation of the protein structure located in a cubic volume of edge 9.5 Å, at a resolution of 0.5 Å, centered at the point. The libmolgrid [27] python library with its default atom types is used for voxelizing the protein structure. The model is trained for 256 epochs, with a batch size of 256, using the adam optimizer at a learning rate of 1e−5. The model checkpoint with the best metrics on the test set is saved. The CNN architecture is provided in Fig. 3.

Fig. 3

CNN architecture for predicting pharmacophore feature points. The CNN takes the local grid around the query point as input and provides confidence scores on the presence of the 6 classes at that point

The model is initially trained on pharmacophore features extracted from the PDBBind V. 2019 dataset [28]. For each structure we use pharmacophore feature interaction points identified by Pharmit as our training samples. The command line used for to extract features for each crystal structure is pharmit pharma -receptor receptor.pdb -in ligand.mol2 -out pharmit.json. The extracted dataset is split into three cross-validation folds with data points (pharmacophore features) from similar ligands being in the same fold. Ligand similarity is determined by Tanimoto similarity over RDKit [25] fingerprints. Two ligands with a Tanimoto similarity greater than 0.9 are considered to be similar and are clustered together into the same fold. In total we have 157,252 data points with approximately 104,835 data points in the training sets and 52,417 data points in the test sets. We train separate models for each fold and use the best performing model for inference.

Table 1 Pharmacophore interaction distance thresholds

To enhance the robustness of pharmacophore feature predictions, the CNN undergoes retraining with adversarial samples. Adversarial samples are generated through a two-step process. Firstly, the protein binding sites are discretized at a resolution of 0.5 Å, and the CNN is evaluated at each grid point. Predictions that are too close to protein atoms are labeled as negative. Additionally, predictions where complementary functional groups of interest on the protein are too distant are collected as adversarial samples. For instance, hydrogen acceptor predictions beyond 4 Å from any hydrogen donor functional group on the protein are considered negatives. Thresholds for pharmacophore features and their complementary functional groups are outlined in Table 1. Complimentary functional groups on the protein are found using the same SMART strings as those defined in Pharmit. The adversarial samples are then added as negative data points to the training set to retrain the model.

From CNN predictions to pharmacophore features

Pharmacophore features are individual interaction points found in the binding site. One key assumption is that these features should be in proximity to complementary interaction feature groups on the protein. These features are inferred through a multi-step process.

Fig. 4

Steps followed to obtain pharmacophore feature points from a CNN predictions on a binding site

The pharmacophore generation process can be viewed in Fig. 4. As before, the binding site is first gridded at a resolution of 0.5 Å and the CNN is evaluated at each grid point. This results in a dense grid of feature confidences. Once this is done we need to determine the number of feature points that have to be extracted from each connected component. We use the complimentary functional groups on the protein that are close to the connected components (Fig. 4b) to determine the number of feature points by taking the top predicted point (Fig. 4c) within a distance threshold (refer to Table 1 for the thresholds) to it.

Finally, feature points are refined by grouping predictions that are near each other through agglomerative clustering. A distance threshold of 1.5 Å is used as criteria for merging clusters and the centroid of each cluster is taken as the predicted pharmacophore feature (Fig. 4d).

Formation of pharmacophores from features

A subset of the candidate pharmacophore features are selected to form a full pharmacophore. This process is modeled as a reinforcement learning problem. Specifically, the method is a deep Q-learning framework that utilizes a SE(3)-equivariant neural network [24] to model the Q value function. The RL algorithm is trained on the Dataset of Useful Decoys - Enhanced (DUD-E) dataset as it provides an extensive set of actives and decoys for each protein-ligand system in its dataset.

Why reinforcement learning

Modeling pharmacophores presents a substantial challenge because it involves selecting a concise set of features suitable for virtual screening. Pharmacophores are built by combining specific features, and this combination greatly influences their performance. Notably, adding or removing a single feature can significantly impact the overall performance, making it challenging to assess the individual importance of each feature in isolation. This complexity poses a hurdle for traditional supervised learning approaches, such as the CNN.

However, reinforcement learning (RL) offers a different perspective. RL has the potential to consider the long-term consequences of adding a single feature to a pharmacophore. Consequently, a RL algorithm can sequentially incorporate features into a pharmacophore model while considering the overall value of the fully formed pharmacophore, rather than just the immediate value of each individual feature added along the way.

Pharmacophore selection as a Markov decision process (MDP)

The generation of the pharmacophore follows an iterative process via the construction of a heterogeneous 3D graph. The graph contains “pharmacophore” nodes () representing pharmacophore features and “protein nodes” () that contain the protein atoms in proximity of the bespoke pharmacophore features. Each iteration involves adding pharmacophore feature nodes and their associated protein atoms to the graph. The structure of the graph in the next step depends entirely on its current state, making the process akin to a Markov decision process (MDP).

In the context of reinforcement learning, a Markov decision process (MDP) is defined with a set of states $s \in S$ that provide information of the environment, actions $a \in A$ that help in moving from the current state to the next state , and a reward function $R(s,a) \rightarrow \mathbb $ that provides a reward value for state-action pair.

Fig. 5

MDP process used for iterative construction of the protein-pharmacophore graph. At each time-point t, the action is to chose the next graph ($G_$). The environment takes this and provides a F1 score for that pharmacophore, along with possible graphs to choose from ($\\}$) for the next iteration

Here, at a given time-point (t) in the iterative process, we define a state ($s_t$) as a heterogeneous protein-pharmacophore graph denoted as $G_t(V_f,V_p,E_,E_)$, consisting of pharmacophore feature nodes ($V_f$), protein nodes ($V_p$), and edges ($E_$ and $E_$) connecting feature nodes to features and protein nodes. The edges are formed based on predefined distance thresholds $\delta _$ and $\delta _$.

This definition leads to a set of possible states $\\}$ that can be reached at time-point $t+1$ by considering the addition of a feature not present in the current graph but within a distance of $\delta _$ from any feature node in the graph. This results in a set of proposed graphs denoted as $\\}$. The current graph $G_t$ is also added to this set, forming a superset $\\} = \\}, G_t \}$. The action ($a_t$) then involves selecting one of the graphs from this proposal set as the next state. If the current graph is selected, the process terminates.

The reward for each step $r_t=R(s_t,a_t)=R(G_)$ is calculated based on the F1 score obtained by running the pharmacophore, obtained as a combination of the features nodes in the graph $G_$, on a dataset containing actives and decoys. Pharmit, the tool used, requires a pharmacophore with at least 3 nodes to screen molecules. Therefore, until the current graph has at least 3 nodes, we assign a reward of 0 and do not include the current graph in the proposal set. A schematic representation of the MDP process at time-point t is given at Fig. 5.

Deep Q-learning

The objective of reinforcement learning is to learn a policy $\pi ^*:S \rightarrow A$ that maximizes the cumulative (discounted) reward you obtain from a MDP. In Q-learning, the function Q(s, a) is trained to predict future rewards given an action on a state. In this context, for a policy $\pi$ the Q-value of a state action pair is given by:

$$\begin Q^\pi (s_t,a_t) = Q^\pi (G_t,a_t) = Q^\pi (G_) = \mathbb \left[ \sum \limits ^T_ \gamma ^*r_i\right] \end$$

(1)

where $\gamma$ is a predetermined reward discount factor. The discount factor implicitly weighs the importance of the immediate reward with respect to the cumulative reward. The optimal policy defined at a state then is $\pi ^*(s) = _a Q^(s,a)$. For this problem this equation translates to:

$$\begin \pi ^*(G_t) = argmax_}Q^(G_) \end$$

(2)

A SE(3)-equivariant neural network is used to parameterize the Q function (Fig. 6). The neural network is trained to minimize the objective $l(\theta ) = \mathbb \left[ y_t - Q(G_t;\theta )\right]$ where $\theta$ is the parameter set of the neural network and $y_t$ is given by:

$$\begin y_t = r_t + \gamma * max_}Q(G_;\theta ) \end$$

(3)

Graph featurization

We construct heterogenous graphs $G(V_f,V_p,E_,E_)$, consisting of pharmacophore feature nodes ($V_f$), protein nodes ($V_p$), and edges ($E_$ and $E_$) connecting feature nodes to features and protein nodes. Since we model a 3D graph, each node has a 3D coordinate in addition to node features. Therefore, we can construct our edges $E_$ and $E_$ using appropriate distance thresholds ($\delta _$) and ($\delta _$). The thresholds themselves were decided through hyperparameter sweeps. The node features for the protein nodes are one-hot encodings of the atom types defined by the libmolgrid library. The atom types are listed in Table 2. The node features for the interaction feature nodes are the output of the final hidden layer of the CNN. We use the output from the CNN as it is essentially an embedding of the local information around that point. The edge features provided to the model are a radial Gaussian basis embedding of the edge distance.

Table 2 Atom types used to featurize protein nodesQ-function neural network

We train an SE(3)-equivariant graph neural network as our Q-function. The neural network consists of separate embedding layers for the different node and edge types, k message passing layers, a global mean pooling aggregation layer and a final fully connected layer that predicts the Q-value. The input graph has two types of edges: feature node–protein node and feature node–feature node. To model this heterogeneity we have separate message passing weights for the two edge types.

Fig. 6

The SE(3)-equivariant neural network takes a protein-pharmacophore graph as input and predicts the Q-value

The message passing layers in our model utilize SE(3)NN convolution layers, implemented through the e3nn Python package [29]. These convolution layers are based on a spherical harmonic basis with varying orders represented by “l” and operate on scalar features ($l=0$), vector features ($l=1$), and higher-order features ($l>1$). In our implementation, we do not exceed $l=2$. Additionally, for the convolution, we model our edge features by a concatenation of the scalar features of the nodes involved and the edge embedding.

The SE(3)NN commences with basic scalar node features and progressively generates higher-order features with both odd and even parity as the network deepens, accomplished through tensor product convolutions. To determine the network’s width, we define two parameters: ns determines the number of scalar features produced by each layer, while nv dictates the number of higher-order features of each type ($l=1$, $l=2$) for both odd and even parity. We use $ns = 32$ and $nv = 8$ for our neural network.

Training details

The DUD-E dataset is split into training and test sets, with the test set being the diverse subset of the DUD-E dataset. This subset contains 8 proteins that are representative of all the proteins in the dataset. The neural network operates on a protein-pharmacophore graph as input. Protein nodes are represented by atom types, while pharmacophore nodes take as features the output of the final hidden layer of the CNN. The final hidden layer of the CNN can be interpreted as an embedding of the local environment around the feature point. It provides a latent vector of size 32. Initially, the model is trained using pharmacophore features extracted from protein-ligand co-crystal structure and is then fine-tuned using pharmacophore features obtained from CNN predictions. Since the ligand features are obtained from crystal structures, they also have directional information about aromatic/hydrogen bonding interations between the ligand and the protein which are used while evaluating generated pharmacophores. A hyperparameter sweep was also conducted while training on pharmacophore features extracted from the cognate ligand. The model that provides the best mean F1 score on ligand extracted features is used to train an ensemble of 5 models on CNN-predicted features.

The training algorithm goes through the protein-ligand systems in the training set, generating training samples through episodes simulated using an $\epsilon$-greedy policy. $\epsilon$-greedy balances exploration and exploitation by setting a probability $\epsilon$ by which a random action is taken as compared to taking the action decided by the neural network. While training the epsilon decays exponentially according to the equation $\epsilon _t=\epsilon _T + (\epsilon _o-\epsilon _T)*e^$, where $\epsilon _o$ and $\epsilon _T$ are initial and final epsilons, and $\alpha$ is a predetermined decay rate parameter. Using this, the initial iterations of RL training is focused on exploring as many graphs as possible. Later iterations are focused on optimizing the learnt policy based on the graphs sampled by the neural network as the neural network has better graph proposals. While training on ligand based features, an $\epsilon _o$ value close to 0.9 is used, but when fine-tuning on CNN features $\epsilon _o=0.5$ is used as lesser amount of exploration is required at this stage.

To simulate an episode, we begin by randomly selecting a protein-ligand system from the dataset. Initially, we set up an empty protein-pharmacophore graph as the starting state. In the first step of the simulation, the policy is allowed to select any pharmacophore feature, along with its corresponding protein atoms, to add to the graph. Subsequent steps only permit the addition of a feature node (and its associated protein atoms) if they are within a distance of $\delta _$ from the feature nodes already present in the graph. This criterion is used to generate a set of proposed graphs for the next step. Additionally, if the current graph contains at least 3 feature nodes, it is included in this set.

At each step, the policy selects a graph from this proposal set, and the associated reward for that action is collected. This process continues iteratively until either the same graph is selected again or the maximum number of steps (T) allowed in an episode is reached. While training on ligand features we set $T = 10$ and on CNN features we set $T=5$

We maintain a replay memory M of capacity N that stores the latest training samples generated from the simulations. In addition we use a separate target neural network with fixed parameters that provides the target for training the Q function neural network. This stabilizes training of the neural network. Every C episodes, the parameters of the target network are updated as a linear combination of the target and Q function network parameters. The importance given to the target network parameters in the update is defined by another parameter $\tau$.

The full training algorithm is provided in Algorithm 1.

Algorithm 1 Deep Q-learning algorithm to train Q function network

Hyperparameter sweep

To identify the optimal combination of parameter values, we perform a Bayesian hyperparameter search during the training process using pharmacophore features from the ligand. We keep track of the mean F1 score on the test set for models trained with various parameter combinations, and we select the parameter set that yields the highest F1 score.

Table 3 Hyperparameter options searched through for RL Q function model

This hyperparameter search is executed through the use of wandb (https://wandb.ai/site). A comprehensive list of hyperparameters and their respective selected values can be found in Table 3.

Performance metrics

We evaluate the methods presented in this work through several metrics. To calculate these metrics we define the following:

True positives (TP): # of molecules returned by the pharmacophore that are known to be actives

False positives (FP): # of molecules returned by the pharmacophore that are known to be decoys

True negatives (TN): # of molecules not returned by the pharmacophore that are known to be decoys

False negatives (FN): # of molecules not returned by the pharmacophore that are known to be actives

The metrics we evaluate the methods on are

Hit rate: The hit rate is given by

$$\begin HR= \frac \end$$

(4)

Precision: The precision is given by

$$\begin P= \frac \end$$

(5)

Recall: The recall is given by

$$\begin R= \frac \end$$

(6)

F1 score: The F1 score is given by:

$$\begin F1= \frac \end$$

(7)

Enrichment factor The enrichment factor is given by:

$$\begin EF = P / \frac \end$$

(8)

Guner-Henry The Guner-Henry metric is given by:

$$\begin GH = \left[ \frac\right] \left[ 1-\frac\right] \end$$

(9)

We place emphasis on the F1 score and the enrichment factor for our experiments. The F1 score is used as it remains relatively unbiased for an unbalanced dataset. The enrichment factor provides a quantitative comparison on the number of actives in our hits vs the dataset, thus providing how much our screening approach has enriched our hits.

View original article

BMC BIOLOGY

分享书签

0 0 0 0 0 0 0

More from this channel

PharmRL: pharmacophore elucidation with deep geometric reinforcement learning

留言 (0)