Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph

Predicting influential scholarly documents

One of the primary objectives of this workflow was to identify influential scholarly documents within different categories and uncategorized data. The World Health Organization (WHO) assigned the used corpus into four categories of data that may need to be sufficiently granular; we divided the dataset into thirty categories. This categorization aimed to evaluate the impact of more categorized data on identifying influential scholarly documents. The classification workflow was executed in two experimental setups to achieve this goal. As depicted in Fig. 1, the first experiment utilized a corpus of four labelled categories, which the World Health Organization verified. The second experiment employed a corpus of thirty labelled categories, utilizing a machine learning-based clustering method. The workflow for the scholarly document’s thirty labels categorization is presented in Fig. 2. In Fig. 3, the experiment utilized a corpus of thirty labelled categories, which we developed by utilizing a categorization method (Fig. 2). In Fig. 4, the four labels, thirty labels, and uncategorized data were considered input for the influential scholarly document prediction. The impact of categorized data was compared with uncategorized data on the classification task with different machine learning-based methods.

Table 1 Label count (used for this experiment) of the World Health Organization (WHO) (COVID-19 Global literature on coronavirus disease) corpusInput corpus

One of the primary objectives of this workflow was to identify influential scholarly documents within different categories and uncategorized data. The World Health Organization (WHO) assigned articles in the used corpus into four categorized data, which may not be sufficiently granular; we further divided the dataset into thirty categories. This categorization aimed to evaluate the impact of more categorized data on identifying influential scholarly documents. The classification workflow was executed in two experimental setups to achieve this goal. As depicted in Fig. 1, the first experiment utilized a corpus of four labelled categories, which the World Health Organization verified. The second experiment employed a corpus of thirty labelled categories, utilizing a machine learning-based clustering method. The workflow for the scholarly document’s thirty labels categorization is presented in Fig. 2. In Fig. 3, the experiment utilized a corpus of thirty labelled categories, which we developed by utilizing a categorization method (Fig. 2). In Fig. 4, the four labels, thirty labels, and uncategorized data were considered input for the influential scholarly document prediction. The impact of categorized data was compared with uncategorized data on the classification task with different machine learning-based methods.

The primary data source for this work is the World Health Organization (WHO) COVID-19 corpusFootnote 1 containing scholarly documents mostly from biology and medicine applicable to the COVID-19 crisis (Table 1). The WHO COVID-19 corpus was enriched with citation counts from the OpenCitations corpusFootnote 2 by queries based on the respective scholarly document DOIs. The four values of the target variable (“Topics” according to the WHO) were utilized for the multi-class classification task, which the WHO verifies, and the topic modeling extracted the thirty different target variables. For the influential scholarly documents classification, the target variable for the binary classification was derived from citation counts, which are not part of the corpus and were obtained using the median value as a threshold.

For the scholarly document classification task, we enhanced the dataset by adding different levels of categorization. The “WHO-4” dataset was enhanced with four different target classes (Vaccines, Long_Covid, Traditional_medicine, and Variants), which the WHO verified. On the other hand, the “WHO-30” dataset was enhanced with thirty different target classes, which were obtained through a machine learning-based clustering method.

In the next phase of this study, we aim to investigate the effect of utilizing uncategorized versus categorized corpus on the classification of influential scholarly documents. To accomplish this, a sample of 50000 and 1906 scholarly documents about COVID-19, published in 2022 and 2023, were randomly selected from the World Health Organization’s database. However, a subset of 8080 and 69 scholarly documents was removed from the sample due to the absence of abstracts.

Fig. 5

Elbow curve for choosing the number of clusters

Table 2 Example terms for each cluster for 5 example classes (from 30)Machine learning methods (clustering and topic modeling)

To categorize the WHO dataset into thirty different categories (which we referred to as the WHO-30 dataset), we employed the k-means clustering [22] method. The k-means clustering algorithm is widely used for grouping similar data points based on their features. Additionally, we applied the technique of Principle Component Analysis (PCA) [23] to reduce the dimensionality of the data. We utilize the n_componentsFootnote 3 parameter and set the principal components to keep. In this case, we have set it to 0.95, which means the method will keep the number of principal components needed to explain 95% of the variance. The PCA method is used to identify patterns in data and reduce the number of features while maintaining important information. A random state of 42 was used to initialize the PCA algorithm to ensure the reproducibility of the results. We utilized TF-IDF vectorization from the scikit-learn library to vectorize the data. We employed k-means on the vectorized data to cluster the scholarly documents. To determine the optimal number of clusters, k, we utilized the Elbow Method [24]. This method involves computing the sum of squared distances from each point to its assigned center. In Fig. 5, we present the elbow curve for choosing the number of clusters.

Fig. 6

Overview of the workflow for the test and train split of the WHO-4 and WHO-30 data. The number in the dataset name, e.g., number 3 in WHO-4-3, is the id of the label within the dataset here three would correspond to the WHO-4 label \(Traditional\_medicine\)

We applied clustering to the entire WHO corpus of scholarly documents and identified the essential keywords in each cluster. We used k-means clustering to group the documents and topic modeling to identify the themes within each cluster. To efficiently identify the themes, we provided keywords to explore the possibilities within each cluster. To discover topics of the scholarly documents in each cluster, we employed the Latent Dirichlet Allocation (LDA) [25, 26] approach for topic modeling. LDA represents each scholarly document as a distribution of topics and each topic as a distribution of keywords. In Table 2, we present an example of the keyword list for each topic. Specifically, the keywords listed in Table 2, represented in the “Terms” column, were identified as the most probable terms to be associated with each topic. These keywords were used to define the topic clusters. Specifically, the first three keywords were chosen as the cluster name based on their high probability of being associated with that topic.

Train and test split

In this study, we employed a stratified sampling approach to split the data for our classification task of influential level. Specifically, we split the test and train data within each of the categorized datasets (WHO-4 and WHO-30) to ensure that the distribution of classes within the test and train sets is representative of the overall distribution of classes within the categorized data. This approach guarantees that the model was not biased towards any particular class when making predictions. Additionally, we utilized the entire corpus for the uncategorized data and applied a single train and test split across the entire corpus. This approach allows us to evaluate the performance of our models on both categorized and uncategorized data and provides an overall view of the performance of our method. This also allows us to compare the performance of the models when they are trained and tested on both categorized and uncategorized data, which helps us understand the effect of categorization on the performance of our models. The train and test split workflow for the WHO-4 and WHO-30 categorized data is presented in Fig. 6.

Data pre-processing

According to the literature review, stopword removal enhances interpretability but has minimal effect on classifier accuracy [18]. We also utilize the substituting stems and features with their separate lemmas. Both methods reduced classifier accuracy. Therefore, they were not employed in the pre-processing stage.

Document representation methods

The Term Frequency-Inverse Document Frequency (TF-IDF) weighting method is the most popular document representation method for scholarly documents. The TF-IDF document representation method was designed using extracting uni-grams, bi-grams, and tri-grams from the scholarly document.

The main goal of the BOW (binary) document representation method was to improve interpretability. BOW was defined as a binary incidence matrix based on uni-grams, bi-grams, and tri-grams. Like the TF-IDF document representation, the BOW representation was built using extracted uni-grams, bi-grams, and tri-grams.

The Bidirectional Encoder Representations from Transformers (BERT) was utilized state-of-the-art for the embeddings-based document representation method. As with other document representation techniques, we utilized the same set of WHO corpus and applied it with the BERT Tokenizer. We used the pertained model with twelve hidden layers with twelve attention heads. The weights were the same as the original authors, and English Wikipedia [27, 28] and BooksCorpus corpus [27, 28] used for the pre-training.

Machine learning methods (Influential scholarly documents prediction)

For the machine learning experiments, we utilize the random forest [7, 29], linear support vector machine (Linear SVC) [30] and logistic regression [31] classifier as implemented in the scikit-learnFootnote 4, a free software machine learning library for python. For training and testing purposes, we consider each scholarly document’s abstract.

Table 3 Overview of input Parameter grid

For the neural network (BERT) experiment utilized an open-source Transformer library [13] from Hugging FaceFootnote 5. For our experiment, the BERT model processed input tokens in three steps, and each token is the sum of Token, Segment, and Position embedding. We used Bert-base-uncased [28] for this experiment, which contains 12 Transformer Encoders, 12 attention heads, and 110M parameters. The uncased means it does not differentiate between upper-case and lower-case-related issues. For the pre-trained BERT model, we used “BertTokenizer” [13] from this library because this model had a specific, fixed vocabulary, and the BERT tokenizer had a particular way of handling out-of-vocabulary words. For tokenizing the corpus, we used “batch_encode_plus.” It finds all of the unique words in the corpus and assigns each of them a unique integer. To design the classification task, we utilized the “BertForSequenceClassification” where “from_pretrained” was used to load the pre-trained model and num_labels for the number of labels. We used “AdamW” [13] optimizer as an optimizer from the Huggingface library.

We also utilized different types of hyperparameters to find the best result using different machine-learning methods. We used the k-fold cross-validation from the scikit-learn. It provides us with the cross-validation with grid search hyperparameter optimization via the GridSearchCVFootnote 6 classes, respectively. We used the inner loop of nested cross-validation, where the outer loop defined the training dataset is used as the dataset for the inner loop. We also configure the hyperparameter search to refit a final model with the entire training dataset using the best hyperparameters. As we described before, we utilize nested cross-validation for fine-tuning the hyperparameters. A nested cross-validation is an approach for model hyperparameter optimization that attempts to overcome the problem of overfitting the training dataset. The procedure involves treating model hyperparameter optimization as part of the model itself and evaluating it within the broader k-fold cross-validationFootnote 7 procedure for evaluating models for comparison and selection. A set of different hyperparameters for the different machine learning methods was optimized according to the grid in Table 3.

Feature extraction for model explainability

Numerous feature importance approaches were proposed for the random forest, support vector machines, logistic regression, and neural networks (BERT), but many of these methods still need to produce consistent results [32]. In addition, it is possible to compute feature significance for models like the random forest using various model-independent techniques. In this study, Shapley values (SHAP) [33] and LIME [34] were used. These techniques give local feature importance values for a specific test instance instead of the scikit-learn feature-importances method for the random forest, which provides global feature importance scores.

SHapley Additive exPlanations (SHAP). Shapley’s notion from game theory advances the SHAP value. A global interpretation was also possible using the SHAP values. Each observation receives a unique set of SHAP values, allowing for local interpretation of the data.

Local Interpretable Model-agnostic Explanations (LIME). LIME demonstrates which features values and how they affect a specific prediction. This explanation can only be considered approximate because the LIME model was developed by altering the explained instance by varying the feature values and observing the effects on the prediction of each feature change. The explanation was obtained by replacing the described model locally with an interpretable one.

Feature importance for Random Forest. Random forest models were challenging to understand because there were many trees. The trees were sophisticated, and several trees created an impact on choice. However, the random forest learning approach was designed so that producing estimates of feature relevance scores is straightforward [35]. In this research, we utilized random forest feature importance scores using the scikit-learn feature-importances techniqueFootnote 8, which was based on the mean and standard deviation of accumulation of impurity decreases inside each tree.

Leveraging the domain-independent Knowledge Graph DBpedia for improved influential paper classification

An annotation tool called DBpedia Spotlight [36] extracts entities from the abstracts. We utilized those entities to identify the DBpedia resources, and the results were filtered based on confidence, support, and similarity score measures. For example, ’antibiotic,’ ’mechanical_ventilation,’ and ’dysbiosis’ were the three different entities that were extracted from the abstract for the scholarly document called “The importance of airway and lung microbiome in the critically ill” [37] (DOI: https://doi.org/10.1186/s13054-020-03219-4). These entities fulfill all the requirements from Table 4 (the parameters used for this experiment). The selected entities were connected to their respective URIs from DBpedia.

Table 4 DBpedia Spotlight Parameters

In DBpedia Spotlight, “confidence” measures the likelihood that a text fragment corresponds to a specific DBpedia resource. The confidence value is between 0 and 1, generated by the DBpedia Spotlight’s annotation algorithm. The higher the confidence score, the more likely the text fragment corresponds to the identified DBpedia resource.

In DBpedia Spotlight, “support” is a metric that counts how many times a particular DBpedia resource is mentioned in the annotated text. This metric can be used to determine the popularity or frequency of an entity in the text and can provide additional information on the relevance and context of the annotation. The DBpedia Spotlight API returns the support count and other information, such as the identified resource’s confidence score, surface form, offset, and URI. This information can help evaluate the relevance and quality of the annotation and identify errors or mistakes in the annotation process.

In DBpedia Spotlight, “similarity score measures” are methods used to calculate the similarity between a text fragment and a DBpedia resource, which is used in determining the confidence of the annotation. Similarity Score measures are used to compare the text fragment being annotated to the resources available in the DBpedia KG, and this is done by comparing the surface form of the text fragment with the labels and alternative labels of the resources in the graph and comparing the context of the text fragment with the abstract of the resource.

Generators. A KG offers a diverse spectrum of additional features such as specific, unqualified, qualified relation, entity type, etc [38]. In this work, we obtained such features from the DBpedia KG. Then we converted these newly generated features into the form of additional columns. The input was designed to contain at least one column holding URIs to establish connections between the KGs. Such as for ’antibiotic,’ the additional features will be ’Anti-infective_agents’ and ’Bactericides’ using the generators. We only utilized the Direct Type and Unqualified Relation generators for this experiment.

Direct type - Direct types refer to the explicit assignment of a class or category to an entity (using rdf:type) within the graph. This assignment allows for more accurate and efficient querying of the KG and understanding of the entities within it.

Unqualified relation - Unqualified relations refer to edges in the graph that need a more formal qualification, such as a label or a type, which can make it challenging to understand the definition and context of the relationship. This might be a powerful constraint when utilizing DBpedia for particular applications like information extraction and semantic search. Our research focused on identifying and measuring the instances of unqualified relations in the DBpedia knowledge graph and utilizing it for the influential scholarly documents prediction task. As an example, the unqualified connection between the entities “Prague” and “Charles Bridge” in DBpedia would be “There is a connection between Prague and Charles Bridge.” The relationship between these two entities has not provided any additional information.

View original article

JOURNAL OF BIOMEDICAL SEMANTICS

分享书签

0 0 0 0 0 0 0

More from this channel

Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph

留言 (0)