CIMB, Vol. 44, Pages 5638-5654: MIFNN: Molecular Information Feature Extraction and Fusion Deep Neural Network for Screening Potential Drugs

The discovery of new drugs or the reuse of drugs is a popular task in biochemistry. Prediction of drug effectiveness or toxicity based on molecular properties plays an important role in this task. In recent years, with the development of machine learning, especially the emergence of deep learning [1,2], many methods have achieved better performance in this task. Drug molecules or compounds are converted into computer-recognizable formats, such as molecular maps [3], molecular fingerprints [4], and molecular descriptors. These readable forms of molecular information are extracted by various means, including deep learning, to form unique molecular features that can be used to achieve subsequent classification or prediction tasks.Schneider’s studies [5] have shown that, if different kinds of molecular information can more comprehensively contain molecular biochemical information after feature extraction, this will directly improve prediction accuracy. Molecular descriptors and fingerprints are usually designed for various specific chemical and biological tasks. For example, one of the most classical molecular descriptors is the Simplified Input Line Entry System (SMILES) for retrieving molecular information. Molecular descriptors are designed for special tasks related to molecules. They are more flexible than molecular fingerprints and have their advantages. SMILES is especially used for molecular retrieval [6], and it has been widely used in molecular datasets. The datasets used in this paper also use SMILES as a general retrieval format. Yang et al. [7] proposed molecular directed information. The team also proposed a Chemprop model based on the molecular descriptor, which has achieved great success in molecular screening. On the basis of the success of molecular directed information in the field of molecular screening, this paper selects it as part of the input sequence.Later, Durant et al. [8] proposed a key-based molecular fingerprint, Molecular Access System (MACCS), to retrieve molecules by molecular substructure. With the development of deep learning and various complex needs, some studies began to build molecular descriptors and molecular fingerprints on the basis of the spatial coordinate information of molecules at the three-dimensional level, such as the 3Dmol network proposed by Chunyan Li et al. [9] and the 4D fingerprint proposed by Senese et al. [10]. In 2010, Rogers et al. [11] proposed molecular Morgan Fingerprints to study the neighborhood of each atom and the bonding connectivity between molecules. In 2020, Prasad and others [12] won the sixth round of the Statistical Assessment of the Modeling of Proteins and Ligands competition using the method of the Morgan fingerprint combined with deep learning, indicating that the Morgan fingerprint had good performance in the direction of prediction and screening.The comparison in classification performance between various descriptors and molecular fingerprints is still a relatively ambiguous situation. The study by Mayr et al. [13] demonstrated that molecular fingerprint models contain a wider range of feature information than molecular descriptors obtained using convolutional models, while the experimental results by Wu et al. [14] showed the opposite. This discrepancy can be partly attributed to the dataset differences in evaluation metrics and molecular species. It is also partly due to the difference in the domains involved in designing the two types of molecular information. The study by Wu et al. [14] illustrated that molecular fingerprints are more specific to the chemical structure of molecules and the existence of some substructures. In contrast, molecular descriptors focus on the type and number of atoms in molecules and the shape of molecules.In the previous study of Tseng et al. [15], an attempt was made to combine the two fingerprints, and, on this basis, a novel fingerprint was designed and showed a better predictive performance. Accordingly, Wang et al. [16] implemented joint fingerprinting and feature engineering. Traditional feature extraction methods include the genetic algorithm proposed by Pérez-Castillo et al. [17] and the partial least square method proposed by Su et al. [18]. The study by Hu et al. [19] showed that traditional feature extraction methods are cumbersome and need a wide range of professional knowledge, which significantly affects efficiency. Subsequent studies have shown that the proper selection and fusion of molecular fingerprints and molecular descriptors can significantly improve classification performance [20,21].Although these models and methods have made some progress, many problems are still worthy of further exploration and research. One of the critical problems is the performance gap between different molecular descriptors and molecular fingerprints. There are some differences between molecular descriptors and molecular fingerprints, and it is necessary to make them more complementary. Another problem worth discussing is optimizing the structure of deep learning networks for feature extraction. With the development and advancement of deep learning, the effectiveness and efficiency of feature extraction are advancing continuously. Taherkhani’s study [22] found that, with deep learning algorithms, computers can automatically identify and filter out more important feature information, thus providing a more significant advantage in processing large-scale drug data. There is a significant difference between drug screening and traditional deep learning in that the number of labels in drug molecular datasets is uneven, sometimes even very different. This characteristic determines that overfitting is more likely to occur when using more complex network models for learning. The studies by Tetko et al. [23,24] also illustrated this problem.

Therefore, we focus on molecular fingerprints and molecular descriptors that contain more abundant molecular information. We combine both in a multimodal way to obtain complete information or features. Through the reasonable design of a deep learning network structure, we can achieve better feature extraction and avoid the occurrence of the overfitting phenomenon. This paper aims to achieve the above goals by designing the Molecular Information Fusion Neural Network (MIFNN). Two different patterns of molecular information are extracted from two different networks as feature information, and two parts of feature information are fused into the classification module to get the final classification results.

In MIFNN, we use two convolution networks with different dimensions to extract the characteristics of molecular information. In the one-dimensional convolution network, we process the molecular orientation information proposed by Yang et al. [7] and add an attention mechanism and bidirectional long short-term memory (bi-LSTM) to the one-dimensional convolution network. The LSTM module has been widely used in the natural language processing field. In the study of Xie [25], it was confirmed that a separate LSTM mechanism is conducive to screening drug molecules. However, the LSTM module often ignores the sequence information, and the molecular directed information expresses the molecules through the directional transmission between atoms. The research of Jiang et al. [26] and Chen et al. [27] showed that, if the sequence diagram information between atoms can be preserved completely, the subsequent feature extraction can be better carried out. Through the research of Lenselink et al. [28] and Öztürk et al. [29], we know that bidirectional LSTM can improve the effect of feature extraction of protein sequences under the framework of deep learning. Hence, we use bi-LSTM to avoid the disadvantages of the single LSTM mentioned above.In the two-dimensional convolution network, the Morgan fingerprint is extracted. This is because there are many common digits in Morgan fingerprint; thus, the data can be spliced into a two-dimensional map to obtain complete feature information. For the classifier module, we use the particle swarm optimized support vector machine (PSO-SVM). This is the PSO algorithm optimization of the traditional SVM. The PSO algorithm is an adaptive algorithm based on alpha stable distribution and dynamic fractional calculus proposed by Deng [30]. After subsequent comparative experiments, in the research of Zhang et al. [31], the classifier optimized by the PSO algorithm was significantly improved.Our MIFNN has two unique advantages: (1) feature extraction methods with different dimensions are used, and splicing and fusion methods are used to obtain more comprehensive molecular features; (2) the bi-LSTM mechanism and attention module are used for feature extraction to obtain complete molecular information. This can provide good support for subsequent prediction and classification, as well as achieve better results; (3) we choose the PSO support vector machine as the classifier, which can obtain more accurate classification results without easy overfitting [32]. We extensively evaluated our model and other recently released neural network structures for feature extraction and conducted several comparative experiments on eight publicly available test sets provided by Wu et al. [14] and Mayer et al. [13]. Our goal was to achieve more significant performance optimization than other network structures on the public dataset with the same evaluation index. According to our control experiment, our classification results had better performance in both the training set and the test set. For objectively evaluating the performance of the current model, a control group was established for different feature information using the classification network. The results show that there is still room for improvement in the structure of the deep learning model for subsequent optimization and improvement of the results.

留言 (0)

沒有登入
gif