DeepCubist: Molecular Generator for Designing Peptidomimetics based on Complex three-dimensional scaffolds

Methodological Concept

DeepCubist is conceptualized to include two design stages, as illustrated in Fig. 1. At the first stage, a preferred scaffold for reproducing spatial side chain arrangements of a target peptide is determined. Therefore, a database of 3D scaffolds with methyl groups initially placed at three substituent positions is constructed, enabling initial superposition of the scaffold and Cα-Cβ bond of the target peptide. At the second stage, heteroatoms and unsaturated bonds are introduced into selected frameworks to provide further functionalities and support synthetic accessibility.

Fig. 1figure 1

An overview of DeepCubist. The two major stages of the computational design approach are illustrated

Template scaffolds

Construction of DeepCubist’s scaffold database began with defining a qualifying 3D scaffold as a tricyclic or tetracyclic bridged ring system consisting of 5- and/or 6-membered rings. This scaffold definition can be modified for different applications depending on the specific requirements. Our definition ensured that scaffold structures could be chemically diversified compared to, for example, bicyclic systems while restricting theoretically possible chemical complexity and hence increasing the likelihood of achieving synthetic accessibility. For our proof-of-concept investigation, so-defined scaffolds consisting of 10 to 14 carbon atoms were then systematically generated as illustrated in Fig. 2.

Fig. 2figure 2

Generation of a 3D scaffold database

1) Six fused or bridged bicyclic systems consisting of 5- and/or 6-membered rings were computationally constructed as starting points (the number can be varied).

2) Tricyclic ring systems were then exhaustively generated by extensions of bicyclic systems with fragments comprising m carbon atoms added to any pair of ring atoms. From the resulting tricyclic ring systems, tetracyclic structures were obtained by addition of fragments with n carbon atoms to every atom pair of the tricyclic systems. Hence, (m, n) fragment combination were defined to obtain target scaffolds with 10 to 14 carbon atoms, depending on the size of the original bicyclic system.

For example, for bicyclic ring system 1 consisting of eight carbon atoms, (m, n) = were used to exhaustively construct scaffolds with 10 carbons. As a result of these operations, a total of 1347 bridged ring systems were obtained at this stage.

3) The generated tri- and tetracyclic candidate structures were then filtered to collect chemically feasible 3D scaffolds with limited strain energy. Scaffold conformers were generated using the “ligand preparation” option of Discovery Studio 2020 [5] and conformers with a “clean energy” value of no more than 100 kcal/mol were collected, yielding 405 different 3D scaffolds with no chiral information.

4) Finally, combinations of three substituents were added to each 3D scaffold, in each case permitting the presence of at most one quaternary carbon for ease of synthesis (the number of substituents can vary). The introduction of substituent combinations resulted in a total of 28,440 unique carbon atom scaffolds with no chiral information. These carbon atom scaffolds can be classified as 3D cyclic skeletons, following the hierarchical scaffold definition of Bemis & Murcko [6]. These skeletons served as input for the design of final 3D scaffolds containing heteroatoms and unsaturated bonds, as further described below.

Generative model

Once 3D carbon skeletons are generated, they must be converted into chemically meaningful scaffolds. For this purpose, DeepCubist employs a deep generative model based on SMILES strings [7] as a standard text-based molecular representation. Such generative models have been applied, for example, to construct target-focused virtual libraries [8] or natural product-like compounds [9], demonstrating the ability to generate chemical structures of varying complexity. For training such models, SMILES of existing compounds are often augmented with randomized SMILES [10] to support learning of the chemical language encoded by string representations. As a deep learning architecture, a transformer model from natural language processing was selected [11]. Different from other sequence-to-sequence models, transformer models operate on the basis of attention mechanisms that identify and highly weight the most important representation elements for achieving accurate predictions during the training phase [11]. As further discussed below, the transformer model was trained to convert 3D carbon scaffolds into compounds containing heteroatoms and unsaturated bonds, that is, candidate compounds with chemical features amenable to synthesis.

Source and target structures for training

Drug- and natural product-like compounds were retrieved from ChEMBL version 30 [12] and COCONUT [13], a database of natural products, respectively. A total of 1,914,739 ChEMBL and 406,919 COCONUT compounds were obtained, referred to as original compounds. For model derivation, all possible target (output) structures were extracted from the original ChEMBL and COCONUT compounds by removing all exocyclic atoms from primary ring substituents and replacing removed fragments with a hydrogen atom (including, for example, ester, amide, or sulfone moieties), as illustrated in Fig. 3. Thus, target structures represented consistently defined scaffolds with primary substituents for deep learning and candidate structures for further chemical modifications. Source (input) structures were then obtained by converting target structures into cyclic skeletons through replacement of all heteroatoms with carbons and conversion of all bond orders to 1 (single bonds), as also illustrated in Fig. 3. After original compounds were decomposed, target structures with no more than eight atoms in individual rings and elements were collected for modeling. A total of 53,075 pairs of target and corresponding source structures were obtained. The use of these pairs of corresponding source and target structures for model derivation provided the basis for the generation of 3D scaffolds containing heteroatoms and unsaturated bonds from our newly generated database of 3D carbon skeletons described above. The 53,075 target structures were found to contain 268 of the total of 405 enumerated 3D scaffolds; hence, the remaining 137 scaffolds were novel.

Fig. 3figure 3

Source and target structures for training the transformer model

String representations for training

For converting scaffolds into compounds using SMILES-based deep generative models, substitution sites in input structures are often marked as wild-card sites such as “*” to enable chemical diversification [14,15,16]. Furthermore, transformer-based retrosynthetic predictions have been improved by minimizing the edit distance between augmented input and output SMILES strings compared to unique canonical SMILES [17]. The edit distance between two SMILES strings is defined as the number of editing operations consisting of insertion, deletion, and substitution for transforming one string into the other. Corresponding SMILES representations with minimized edit distance closely link these representations for learning, which tends to reduce errors rates. In our study, this strategy was applied for model derivation, as illustrated in Fig. 4 A. After source and target SMILES were augmented by generating additional SMILES rooted at each atom using RDkit [18], newly generated SMILES with smallest edit distance were paired using the sequence alignment module implemented in Biopython [19], as shown in Fig. 4B. In accordance with the DeepCubist design strategy, heteroatoms in target SMILES were replaced with carbon atoms to obtain corresponding source SMILES. Then, the additional SMILES strings were aligned with the original source SMILES using the “pairwise2.align.globalxx” function of Biopython. In the alignment, identical characters obtain a score of 1, otherwise the score is 0. Since source structures were generated from target structures, gaps (“-”) in aligned SMILES strings can only occur in source SMILES.

Fig. 4figure 4

Molecular representations. (A) illustrates the generation and (B) the alignment of source and target SMILES for transformer training

Model derivation

Pairs of source and target structures were randomly divided into 42,990 training (90%) and 4777 validation set (10%) instances. Following data separation, the SMILES augmentation and alignment steps were carried out. Original SMILES were iteratively augmented with randomized SMILES to obtain a total number of 168,137 pairs for training and 18,624 pairs for validation. A multi-head attention transformer model was constructed using Pytorch [20]. SMILES tokens were embedded in 512 dimensions, the number of heads was set as 8, the number of sub-layers in both encoder and decoder units was set to 3, and the dimensionality of the feed-forward network model was set to 512. For all remaining parameters, default settings were used. The model architecture including parameter settings is schematically illustrated in Fig. 5. For structure generation, SMILES tokens were sampled according to the learned probability distribution.

Fig. 5figure 5

Transformer architecture and parameter settings

Scripts for the calculations and the data can be obtained via the following link:

https://www.dropbox.com/s/4gdhew9xjit43e4/DeepCubist_Materials.zip?dl=0.

留言 (0)

沒有登入
gif