The BindingDB dataset is publicly accessible and contains experimentally measured binding affinities whose values are expressed as Kd, Ki, IC50, and EC50 terms. For the external test, we extracted drug-target pairs in which the protein is human kinase, and the binding affinity is recorded as a Kd value. These values are then transformed into the log space as described.
The Davis dataset consists of six parts. Five are used for cross-validation and one is used for testing. We use the same training and testing scheme as GraphDTA. The hyperparameter is tuned using five parts with five-fold cross-validation. After tuning the hyperparameter, we train all five parts and evaluate the performance with one test part. To evaluate the generalizability of the model, BindingDB is used as the external test dataset.
2.2. Input Data RepresentationGraphATT-DTA takes SMILES as the compound input and amino acid sequence string as the protein input. First, the SMILES string is converted to a graph structure that takes atoms as nodes and bonds as edges using the open-source Deep Graph Library (DGL) v.0.4.3(2) [27], DGL-LifeSci v.0.2.4 [28], and RDKit v.2019.03.1(1) [29]. We used the atomic feature defined in GraphDTA (i.e., atom symbol, number of adjacent atoms, number of adjacent hydrogens, implicit values of the atoms, and whether the atom is in aromatic structures). We leverage the bond feature used by the directed message-passing neural network (DMPNN; i.e., bond type, conjugation, in the ring, stereo). Table 2 and Table 3 list detailed information for each feature. Each amino acid sequence type is encoded with an integer and cut by a maximum length of 1000. If the sequences are shorter than the maximum length, they are padded with zeros. The maximum length can cover at least 80% of all proteins. 2.3. Drug Representation Learning ModelThe molecule is originally represented by a graph structure consisting of atoms and bonds. The GNN uses its structural information and applies a message-passing phase consisting of message_passing and update functions. In the message_passing function, node v aggregates information from its neighbor’s hidden representation, hw(t). In the update function, it updates the previous hidden representation, hv(t), to a new hidden representation, hv(t+1), using messages mv(t+1) and the previous step of hidden representation, hv(t):mv(t+1)= message_passing()
(2)
hv(t+1)=update(mv(t+1), hv(t))
(3)
where N(v) is the set of the neighbors of v in graph G, and hv(t) follows time step t of initial atom features, xv. This mechanism, in which atoms aggregate and update information from neighbor nodes, captures information about the substructure of the molecule. GNN models have variants, such as the graph convolutional network (GCN) [30], graph attention network (GAT) [31], graph isomorphism network (GIN) [32], message-passing neural network (MPNN) [33], and directed message-passing neural network (DMPNN) [34], which can be leveraged by specifying the message_passing function, mv(t+1), and update function, hv(t+1) (see Table 4). The output is a drug embedding matrix, D∈ℝNa×d, where Na is the number of atoms, and d is the dimension of the embedding vectors. In the drug embedding matrix, each atom has the information of its neighbor atoms (i.e., substructure) along with the number of GNN layers. 2.4. Protein Representation Learning ModelThe Davis and BindingDB datasets have 21 and 20 amino acid types, respectively. Hence, we consider 21 and 20 amino acids for learning and testing, respectively. The integer forms of protein amino acid sequences become the input to the embedding layers. These are then used as input to three consecutive 1D convolutional layers, which learn representations from the raw sequence data of proteins. The CNN models capture local dependencies by sliding the input features with filters, and their output is the protein sub-sequence embedding matrix, S∈ℝNs×d , where Ns is the number of sub-sequences. The number of amino acids in a sub-sequence depends on the filter size. The larger the filter size, the greater the number of amino acids in the sub-sequence.
2.5. Interaction Learning ModelThe relationship between the protein and the compound is a determinant key for DTA prediction. The attention mechanism can make the input pair information influence the computation of each other’s representation. The input pairs can jointly learn a relationship. GraphATT-DTA model constructs the relation matrix R using dot product of protein and compound embedding where R∈ℝNa×Ns. It provides information about the relationship between the substructures of compounds and protein sub-sequences. GraphATT-DTA reflects the local interactions by considering the crucial relationships between protein sub-sequences and compound substructures. The subseq-wise/atom-wise SoftMax is applied to the relation matrix to construct the substructure and sub-sequence significance matrices. The formulas appear in (5) and (6). The element of substructure_significance indicates the substructure’s importance to the sub-sequence. Similarly, the element of subsequence_significance indicates the sub-sequence’s importance to the substructure.substructure_significance=aij=exp(rij)∑i=1Naexp(rij)
(5)
subsequence_significance=sij=exp(rij)∑j=1Nsexp(rij)
(6)
The substructure_significance is directed to the drug embedding matrix via element-wise multiplication (⊙) with aj and D, where aj∈ℝNa×1, and j=1, …, Ns. aj indicates each substructure’s importance of the jth sub-sequence. D(j)′∈ℝNa×d indicates the drug embedding matrix with the importance of the jth sub-sequence. Drug vector d(j)″ is constructed by (8) and carries the information of the compound and the jth sub-sequence, where d(j)″ ∈ℝ1×d.d(j)″=∑aD(j)ab′
(8)
D″=concat[d(1)″,d(2)″, …, d(Ns)″]
(9)
The concatenation of d(j)″ with all sub-sequences causes D″ to inherit all information about the sub-sequences and compounds, where D″∈ℝNs×d. The new drug feature is thus constructed to reflect all protein sequences and compound atoms where drug_feature∈ℝ1×d.drug_feature=∑iDij″
(10)
The new protein feature is calculated the same way. Using the element-wise multiplication of the subsequence_significance and protein embedding matrix, the protein embedding matrix, P(i)′, with the ith substructure significance to the sub-sequence, is constructed so that si∈ℝ1×Ns, S∈ℝNs×d , and P(i)′∈ℝNs×d. The summation of P(i)′ makes protein vector p(i)″ with the sub-sequence information about the compound, where p(i)″∈ℝ1×d. After the concatenation of p(i)″, the summation of P″ makes the new protein feature vector reflect compound sub-structure significance information, where P″∈ℝNa×d, and protein_feature∈ℝ1×d.p(i)″=∑aP(i)ab′
(12)
P″=concat[p(1)″, p(2)″, …, p(Na)″]
(13)
protein_feature=∑iPij″
(14)
Protein and drug features reflecting the local-to-global interaction information are collected via concatenation. The fully connected layers can then predict the binding affinity. We use mean squared error (MSE) as the loss function.
2.6. Implementation and Hyperparameter SettingsGraphATT-DTA was implemented with Pytorch 1.5.0 [35], and the GNN models were built with DGL v.0.4.3(2) [27] and DGL-LifeSci v.0.2.4 [28]. Early stopping was configured with the patience of 30 epochs to avoid potential overfitting and obtain improved generalization performance. The hyperparameter settings are summarized in Table 5. Multiple experiments are used with five-fold cross-validation, applied for hyperparameter selection. The layers of GNN are important because they pertain to how many neighbor nodes are regarded by the model. Because there are many layers, the model can consider many neighbors; however, doing so can cause an over-smoothing problem in which all node embeddings converge to the same value. Additionally, if the number of layers is too small, the graph substructure will not be captured. Therefore, proper layer configuration is important. The optimal number of GNN layers was experimentally chosen for GraphATT-DTA by using each GNN graph embedding model. Specific experimental results can be found in Supplementary Table S2 and Supplementary Figure S1.
留言 (0)