MeSH2Matrix: combining MeSH keywords and machine learning for biomedical relation classification based on PubMed

As shown in Fig. 2, we start with the creation of the MeSH2Matrix dataset using our novel principle of the matrix of correspondence. Then, we explore leveraging the MeSH2Matrix dataset for biomedical relation classification. For this, we focus on machine learning-based methods (particularly neural networks) and train selected models on the MeSH2Matrix dataset. Finally, we perform extensive feature analysis of the trained neural networks to better understand the efficacy of the representations encoded in the MeSH2Matrix in classifying biomedical relation types.

Fig. 2figure 2

Illustration of our methodology. We begin by generating the MeSH2Matrix dataset through our innovative concept of the correspondence matrix. Next we leverage machine learning approaches, particularly neural networks, for the task of biomedical relation classsification using MeSH2Matrix. Lastly, we conduct feature analysis to gain deeper insights into MeSH2Matrix-based classification

MeSH2Matrix

In this section, we cover the underlying principle of the MeSH2Matrix dataset creation as well as the practical workflow employed.

Matrix of correspondence

We develop our approach upon the assumption that the qualifiers of two co-occurring MeSH terms can outline the type of semantic relation between them, as shown in the examples in Table 1 [4, 8]. Let \(t_\) and \(t_\) be two semantically related MeSH terms that are not assigned a relation type. Our method proposes to first search for the PubMed scholarly publications having both \(t_\) and \(t_\) as MeSH headings. Then for each retrieved record, we extract its qualifiers \(q_\) and \(q_\) (e.g., therapeutic use for Sofosbuvir/therapeutic use) respectively corresponding to \(t_\) and \(t_\) (e.g., Sofosbuvir for Sofosbuvir/therapeutic use). This will enable the creation of (\(q_\), \(q_\)) pairs as shown in Fig. 3. When a term is assigned two or more qualifiers (e.g., \(t_\) /Z/U for Paper 3 - Fig. 3), this means that a paper deals with a facet of a characteristic of the considered topic. In such a situation, we consider it as though the qualifiers were independently assigned to the MeSH term for the paper (e.g., \(t_\) /Z and \(t_\) /U for Paper 3 - Fig. 3). We restrict the number of considered publications to the 100 most relevant research papers according to PubMed Best Match search algorithm [36]. This will prevent matters related to the timeout limit of the NCBI PubMed API (Error 429). After the couples of MeSH qualifiers are retrieved, we draw a matrix of correspondence (\(M(t_,t_)\)) – this is a square matrix of the qualifiers (\(q_,...,q_\))Footnote 8 where each element \(m_\) is the number of records featuring both \(t_/q_\) and \(t_/q_\) as MeSH keywords divided by the total number of records with the two MeSH terms \(t_\) and \(t_\):

$$\begin m_ = \frac, m_ \in [0,1]. \end$$

(1)

This matrix of correspondence encodes the nature of the semantic relation between \(t_\) and \(t_\). As a practical example, as of March 6, 2022, there are 32 PubMed records where Hepatitis C and Sofosbuvir are featured together as MeSH headings. From these 32 publications, there are 15 papers where drug therapy and therapeutic use are the respective qualifiers to Hepatitis C and Sofosbuvir. In this situation, the value that will be represented for the association between drug therapy and therapeutic use in the Hepatitis C-Sofosbuvir matrix is 15/32 = 0.469.

Fig. 3figure 3

Process for the retrieval of the couples of MeSH qualifiers. \(t_\) is the subject MeSH term, \(t_\) is the object MeSH term, \(q_\) are the subject qualifiers, \(q_\) are the object qualifiers, and c is the set of the couples of the extracted MeSH qualifiers

Building MeSH2Matrix with the matrix of correspondence

Here we describe the concrete workflow through which we create our MeSH2Matrix dataset. The first step involves extracting biomedical relations from Wikidata with the help of our SPARQL query featured in Figure D6. This is done through identifying all items in Wikidata with a MeSH Descriptor ID (wdt:P486), storing these in the %item named result set. For each of these items, the query finds any relationships they have with other items. It ensures that these related items also have a MeSH Descriptor ID (wdt:P486), storing the identifier value in ?object. The query then returns the MeSH ID of the original item (?subject), the Wikidata property corresponding to the type of relationship (?reltype), and the MeSH ID of the related item (?object). The result of this extraction is a set \(\mathbb\mathbb: = \,r_,o_)\}_^\) of N tuples \((s_,r_,o_)\) where \(s_\) is a subject term, \(o_\) is an object term and \(r_\) represents their relation type. The output (\(\mathbb\mathbb\)) of the query is saved as a tab-separated values (TSV) file to allow its automatic processing. Recall that the underlying idea of our MeSH2Matrix dataset is the matrix of correspondence \(M(s_,o_)\) which encodes useful relationships of the subject-object association (using only the qualifiers of \(s_\) and \(o_\)). To obtain the matrix \(M(s_,o_)\) for a given subject term \(s_\) and object term \(o_\) we get their respective qualifiers from PubMed using the NCBI Entrez API. Each term is described in PubMed by a qualifier giving rise to a qualifier couple (\(q_,q_\)) where \(q_\) is the qualifier assigned to the subject \(s_\) and \(q_\) is the qualifier assigned to the object \(o_\). However, a term association can have many couples of MeSH qualifiers. Based on this, we obtain our matrix of correspondence \(M(s_,o_)\) through all the extracted qualifier couples respectively for the subject \(s_\) and object term \(o_\). Each matrix is subsequently assigned the relation type \(r_\) corresponding to the subject-object association as a label.

For a better analysis of our proposed approach, we extract the features of the considered Wikidata relation types and verify their names as well as if they are taxonomic, symmetric, or biomedical through the application of SPARQL queries on Wikidata using Wikibase Integrator coupled with human validation.

Biomedical relation classification with machine learning

This section covers our approach to exploring biomedical relation classification with machine learning. It is important to note that our objective here is to demonstrate the efficacy of machine learning methods, particularly neural networks, and not to obtain the optimal machine learning technique for our task. As a result, this research does not consider many other machine-learning approaches.

Machine learning models explored

Machine learning-based approaches handle biomedical relation classification as a supervised learning classification task, where labeled data is used to train models. In this paper, we provide benchmark results on our dataset, using three machine learning models:

SVM:

Support vector machines (SVMs) [37] are best suited for samples with many features because their ability to learn is independent of the features space [38]. They have been used exensively in biomedical classification tasks [39,40,41,42] due to their ability to generalize well with data consisting of sparse high-dimensional features. For our baseline, we trained a linear support vector machine. For this, we transformed each \(89\times 89\) matrix into a single 7921 feature vector.

D-Model:

Neural networks (NNs) have produced state-of-the-art results in the area of relation classification [27, 42,43,44]. The major advantage of neural network based approaches lies in thier ability to directly learn the latent feature representation from the labeled training data without requiring experts to carefully craft them [3]. For our experiments with neural networks, we designed D-Model, a simple multi-layer perceptron with an input layer of output feature size of 3, 960, a hidden layer of 1, 980 and an output layer with an output feature size corresponding to the number of classes [rationale for the choice of the size of neurons: 1). we tested different sizes and this gave the best result, and 2). we followed [45,46,47] in keeping the hidden layer size between input layer size and output layer size]. ReLU activation function [48] was used between the input and hidden layers to introduce non-linearity. The output layer is connected to a softmax activation function which converts the model’s output into a probability over the classes. Although NNs have shown great promise for relation classification, they are highly susceptible to overfitting [49] and require lots of hyperparameter tuning. Therefore, we experimented with regularization techniques (early stopping and dropout) the hyperparameters (learning rate, batch size, etc) in order to produce the best performing D-Model.

C-Net:

Convolutional neural networks (CNNs) are a type of neural networks that can successfully capture the spatial and temporal dependencies in an image through the application of convolution operation and relevant filters. Their potential was first witnessed in computer vision around 2012 [50], and since then have been used extensively even in biomedical relation classification [44, 51, 52]. CNNs perform well on an image dataset better due to the reduction in the number of parameters involved and reusability of their weights - they are therefore best suited for image-type data. Furthermore, with CNNs we can work directly with the 2-dimensional matrix (compared to transforming it for SVM and D-Model). To explore the impact of CNNs on MeSH2Matrix, we decided to interpret our feature matrix as spatially correlated features and designed C-Net, a simple CNN-based architecture made up of four convolution layers (each layer consisting of a 2-dimensional convolution, batch normalization [53], a ReLU activation function [48] and max-pooling) and two fully connected layers (Fig. 4). After passing through the fully connected layers, the final layer uses the softmax activation function which is used to get probabilities of the input matrix being in a particular class. CNN-based models, while being very promising, require practical knowledge to configure the model architecture with regard to the performance [54], and to set the hyperparameters for the best optimization [55]. Similarly, we conducted hyperparameter tuning and optimization in order to explore the reasonable ranges for the sensitive hyperparameters of the classification model.

Experimental setup

We performed two rounds of classification: one with all relation types (195), and another with 5 categories obtained after grouping the initial 195 relation types (see “MeSH2Matrix dataset” section for more details on grouping). Then, we apply three scenarios of classification to the best-performing model according to the first two rounds to study the effect of various factors on the efficiency of biomedical relation classification based on MeSH2Matrix:

Scenario 1: Restriction to the matrices based on 100 scholarly publications or more (Blue in Fig. 5)

Scenario 2: Restriction to the well-represented relation types (10+ matrices)

Scenario 3: Restricted generalization to a unique superclass at once [56]

Fig. 4figure 4

CNN-based architecture for the classification of the MeSH qualifier-based matrices

We split our dataset into training (33, 457 samples), validation (13, 012 samples) - for early stopping, regularization and hyperparameter tuning - and testing (9, 294 samples) - for the final evaluation of the model. For SVM training, we merged the training and validation set, making a total of 46, 469 samples for training. For the training of D-Model and C-Net, we used the Adam optimizer [57]. The code for all our deep learning experiments was written using the PyTorch deep learning framework [58], while for SVM we implemented the training using the LinearSVC package.

Fig. 5figure 5

Top twenty relation types according to the number of generated matrices: Number of generated matrices (Orange), Number of matrices based on 100+ scholarly publications (Blue)

Evaluation metrics

To assess the efficiency of the three proposed models, we will be based on four basic measures providing insights on the behavior of classification algorithms [59]:

True Positives (TP): the number of items correctly assigned to their respective classes

True Negatives (TN): the number of items correctly not assigned to unrelated classes

False Positives (FP): the number of items mistakenly assigned to unrelated classes

False Negatives (FN): the number of items mistakenly not assigned to their respective classes.

These measures are combined together to provide two main statistical metrics to be used in our study: Accuracy, and F1-Score [59]. Accuracy is defined as the ratio of the number of correct predictions out of the overall number of predictions as clearly revealed in Eq. 2 [59]:

$$\begin Accuracy = \frac \end$$

(2)

The F1-Score combines Precision (\(\frac\)) and Recall (\(\frac\)) in the following way:

$$\begin F1 = \frac \end$$

(3)

As sample size per class as well as class imbalance alter the values of the Accuracy and the F1-Score [60], we additionally consider three metrics that evaluate these two factors for every scenario: Arithmetic Mean, Geometric Mean, and GA-Ratio [61]. Let \(x_i\) be the size of the class i and N be the number of classes, the arithmetic mean and geometric mean are defined as:

$$\begin Arithmetic Mean = \frac\sum \limits _^x_i \end$$

(4)

$$\begin Geometric Mean = \root N \of ^x_i} \end$$

(5)

These measures have been shown to efficiently evaluate the sample size per class [61]. The Arithmetic Mean diverges from the Geometric Mean when the class distribution is even [61]. This allows us to consider the complexity of the classification tasks allowing us to judge whether the comparison between different situations of classification is reasonable or not. In this context, the GA-Ratio is the ratio of the arithmetic mean of the size of classes over the geometric means of the size of classes as shown in Eq. 6. It is an efficient metric to evaluate the class distribution of a dataset. As a result of the inequality of arithmetic and geometric means, GA-Ratio always ranges between 0 and 1 where 1 corresponds to an equal class distribution and 0 corresponds to an absolute class imbalance [61]. Here, the Geometric Mean cannot have a value of 0 because every class should at least have one item to exist:

$$\begin GA = \frac \end$$

(6)

To confirm whether the scenario-based evaluation results for the best-performing model apply to the two other models (SVM and C-Net), we generate confusion matrices for the MeSH2Matrix-driven training of the three machine learning algorithms based on the five superclasses. A confusion matrix is a table that records the associations between the expected classification classes and the predicted ones. Throughout this paper, rows stand for true labels and columns stand for predicted labels. The classes are sorted in the same order in rows and columns for all the generated confusion matrices. As a result, the accurately classified items are featured on the main diagonal line from the top left and bottom right [59].

Feature analysis

Although machine learning models have gained widespread adoption in recent times, several reservations still exist about how these models could be used and the level of trust that should be granted to these models. These reservations do exist because of the black-box nature of these models and this has limited the level of machine-learning model adoption, especially in the medical domain. In their work, [62] demonstrate an opposing non-linear relation that exists between the explainability of a model and the complexity of the model; as a model is trained on a larger amount of data, the complexity of the model increases and the ease of explainability reduces. To make the output of a machine learning model more acceptable in the medical AI domain, more human-based reseasoning needs to be applied [63] in the form of explainability. To enhance the adoption of deep learning models in the medical AI domain, [62] recommended the use of various explainable AI (XAI) techniques hence, the addition of feature analysis and explainability section to this work.

Understanding the predictions of a model using XAI techniques can be broadly categorized into two classes: model-agnostic techniques and model-specific techniques, of which the more popular are the model-agnostic techniques [64]. Due to the nature of our dataset, model-specific feature permutation and ablation have been the simplest strategies for evaluating the feature significance in our models [64]. However, results have shown that the robustness of these techniques is very limited [65]. In addition, although techniques like Shapley Additive Explanations (SHAP) [66] and Local Interpretable Model-agnostic Explanations (LIME) [67] have been seen to be the most widely adopted technique for model explainability, [68] showed that they are less robust and highly prone to adversarial attacks. A more flexible gradient-based technique known as Integrated Gradients (IG) was proposed by [69] and reported to be more robust than earlier mentioned techniques [70], hence our choice for the use of Integrated Gradients as the XAI technique for this work. It is important to note that using the integrated gradients technique for feature importance attribution has its limitations, one of which being that the function learned by the model may be “over-influenced” by just one of the features [71].

Computing the integrated gradients (IG) [69] of neural networks with respect to the input features is a technique that addresses the problem of feature attribution by using a gradient-based approach to satisfy two fundamental axioms: Sensitivity and Implementation Invariance that should be satisfied.

Sensitivity here implies that a non-zero feature attribution value should be assigned to a feature if it is the only feature differing between a baseline input and input data sample with different predictions. On the other hand, Implementation Invariance implies that two neural networks with different architectures are functionally equal if they have equal outputs for each input data sample.

The relevance of integrated gradients, over other techniques, for the explainability of neural networks as highlighted by [69] includes the fact that the use of IG does not require any modification to the original neural network. It can also be used to extract rules from the model and as a tool for debugging deep learning models [69].

Integrated Gradients can be mathematically represented as:

$$\begin IntegratedGradient_i(x) : (x_i - x_i^\prime ) \times \int _^ \frac d \alpha \end$$

Where \(x_i\) and \(x_i^\prime\) are the is the input data sample and the baseline input along the \(i^\) dimension. In order to understand how each feature in our training dataset contributes to the overall decision-making of our model, we employed integrated gradient as an XAI technique to explain our best two performing models, D-Model and C-Net at the superclass level. Using IG for this helps us to be able to equally compare a CNN model to an ANN model as opposed to other techniques that would only see a CNN model input to be an image, which is not the case for our dataset. The limitation in the robustness of the permutation explainer in SHAP to not being able to handle datasets with a feature size greater than 225 also informed our choice of integrated gradients.

To compute the integrated gradient for the D-Model we set \(x_i^\prime\) to be a zero-vector with the same shape as \(x_i\) and \(\delta F\) was computed at little intervals through moving from \(x_i\) (non-zeros) to \(x_i^\prime\) (zeros) and then adding the value of F multiplied by the interval size. The computation was done for 8080 test samples and the results are shown throughout “Insights from feature analyses” section (Figs. 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 and Supplementary Figures S1 to S10). This same computation was done for the C-Net model, however, \(x_i^\prime\) was set to a zero-vector of 89 x 89 shape to simulate the image-like input of a CNN model. 89 x 89 represents the dimensions of the MeSH2Matrix matrices, as we considered 89 predefined MeSH qualifiers in this work.

留言 (0)

沒有登入
gif