Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature

DDI corpus and annotation guideline

There are two sample pools in this study. The first one is called screened sample pool, which are the abstracts in PubMed through keyword queries: [“drug interaction” AND (Type of Study)] and [“drug combination” AND (Type of Study)]. The “Type of Study” is defined in Table 1: clinical trial, pharmaco-epidemiology study, and case report. Based on the criteria for DDI abstract selection in Table 1, sample abstracts are reviewed and annotated. A corpus is built, which has 933 positive DDI abstracts and 799 negative abstracts. They are the initial labeled samples in the screened sample pool. Table 1 presents inclusion and exclusion criteria for the screened sample pool abstract selection. 5,000 abstracts are randomly selected from screened samples as the screened sample pool in this study.

The other sample pool is called unscreened sample pool. It is made up of 10,000 abstracts that are randomly selected from PubMed and are not overlapped with screened sample pool. This unscreened sample pool, on the other hand, contains data are largely not DDI relevant. Data distribution for screened sample pool and unscreened sample pool is shown in Table 2.

Two annotators with complementary skills in biology and informatics develops this corpus. Mrs. Shijun Zhang, has a master’s degree in biology, and has worked in Dr. Li’s lab for 7 years with the primary research responsibility of corpus development for drug-interaction text mining [27]; Mrs. Weixin Xie, a PhD student in medical informatics, has conducted pharmacology and drug-interaction text-mining research under Dr. Li’s supervision. Training and education in labeling have an initial calibration step, in which two individuals label each abstract according to the inclusion and exclusion criteria outlined in Table 1 [28], the agreement between their labels is then evaluated for the first 30 positive abstracts (30 in each of the three DDI categories), and they receive further training based on that analysis.

Table 1 Inclusion and exclusion criteria for clinical DDI abstract selection Table 2 Statistics of DDI corpus Sampling strategies in active learning

Uncertainty sampling in AL refers to selecting the least confidence new samples, e.g. abstracts with predicted probability around 0.5 in a binary classification (i.e. DDI relevant or not), for the next round labeling and training in machine learning analysis.

Positive sampling refers to selecting the most certain positive new samples, e.g. predicted probability close to 1 in a binary classification, for next round labeling and training in machine learning analysis. Positive sampling is absolutely necessary in the unscreened sample pool because most of samples are negative. In the screened sample pool, as most of samples are positive, positive sampling scheme is not needed.

Random negative sampling Because more than 99% of unscreened pool abstracts are not DDl related, a random subset of unscreened pool is chosen as negative samples. These random negative samples may contain very small fraction of positive samples [33].

Similarity sampling aims to quick screen out samples that more like samples in corpus, the cosine similarity (cosSIM) based on TF-IDF (Term Frequency-Inverse Document Frequency) [34] of each unlabeled sample and all the samples in corpus is used to evaluated. The TF(t) and IDF(t) of term t (word t) are formulated as

$$\beginTF\;(t)\;=\;\frac;&IDF(t)=In\frac\;;\end$$

TF(t) measures how frequently a term t occurs in an abstract, and IDF(t) measures how important the term t is. In fact, certain terms that occur too frequently have little power in determining the relevance, therefore, we need to weigh up the effects of the less frequently occurring terms. And then, we got the TFIDF for term t by computing the following:

$$TFIDF\left(t\right)=TF\left(t\right)\times IDF\left(t\right);$$

Above multiplying TF(t) and IDF(t) results in the TFIDF score of a term t in an abstract. The higher the score, the more relevant that term is in that particular abstract. For each abstract, we derived 30 key terms with high TFIDF, and their frequency vector of each abstract was generated to calculate the cosine similarity (cosSIM). For example, abstracts A and B are two n-dimensional vectors, \(A=(_,\dots ,_)\) and \(B=(_,\dots ,_)\), using the formula below we can find out the cosine similarity between A and B: 

$$cosSIM\left(A,B\right)=\frac.$$

Here, the cosine similarity of abstract A and B ranges from 0 to 1. In this study, sample in pools has its similarity values with samples which are DDI related abstracts, the higher the similarity value, the more DDI related abstract likely. Similarity sampling is applied in conjunction with other sampling strategies in two sample pools, they will benefit to training models.

Existing AL analyses only uses uncertainty sampling. In this paper, we will study whether random negative sampling, positive sampling and similarity sampling will increase the performance of AL analysis.

Active learning with random negative sampling converges to the same optimal classifier as active learning

In our AL analysis, the absence of manual labeling reduces the expense involved with negative random negative sampling, but a small fraction of mislabeled negative samples requires correction to avoid classifier bias. Through the iterative AL process, we expect the asymptotic reduction of this bias to zero as the sample size grows. However, this random negative sampling scheme is beyond the scope of the current AL framework [35, 36], which allows no mislabeled samples. Here is a heuristic proof to clarify the convergence of AL with negative random negative sampling to the same AL optimal classifier.

Let us use a similar notion to that of Balcan and Long [37]. We assume that the data points \((x, y)\) are drawn from an unknown underlying distribution \(_\) over \(X\times Y\). \(X\) is called the feature space (e.g. word frequencies in abstracts), and \(Y\) is the abstract label. Here, \(Y=\left\\) and \(X=}^\), and \(d\) is the dimension. Without loss of generality, we further assume that the feature space X is centralized in 0 after linear transformation. Let \(\mathbb\) be a class of linear classifiers through the origin, that is \(\mathbb=\left\^d,\left\|\mathrm w\right\|=1\right\}\). In an AL, the goal is to identify a classifier \(w\in \mathbb\) of small misclassification error, where \(err\left(w\right)=__}\left[sign\right(w\bullet x)\ne y]\). Balcan and Long showed that with arbitrary small error \(\in\) and probability \(\delta\), an AL needs at most \(O\left(\left(d+\text(1/\delta)+loglog\left(1/\in\right)\right)\text(1/\in)\right)\)labeled samples to identify a classifier with misclassification error less than \(\in\) and probability higher than \(1-\delta\). This AL theory requires no misclassification error among sample labels.

In the unscreened sample pool, let us assume the mislabeling negative sample size,\(_\), is much smaller than negative samples in random negative sampling \(_\), i.e. \(_\ll _\). The true positive samples in the training set,\(_\) is also smaller than \(_.\) Therefore, the error rate before the AL classifier is approximated in Eq. (1). Using the AL classifier, there will be \(_\times _/(_+_)\) mislabeled negative samples predicted to be positive, and their labels will be calibrated through the manual label in the AL. After AL calibration, the error rate of will be reduced to Eq. (2).

$$Error\;rate\;before\;AL:\;\epsilon\frac;$$

(1)

$$Error\;rate\;after\;AL\;\epsilon\;+\;\frac\;\times\;\frac}}.$$

(2)

Practically, considering, \(_=_/1000\) and \(_=_/4\). The error is \(\in+0.001\) before AL, and \(\in+0.001/5\) after one step AL calibration, and \(\in+0.001/5^m\) after m steps. Therefore, the misclassification error due to the mislabeled data will go to zero extremely fast. This heuristic proof has not yet considered the complications such as nonlinear classified, general log-concave distributions, and inseparable positive and negative data in \(}^\). Existing AL theories [37, 38] have shown and supported that error \(\in\) holds with required \(\left(\left(d+\text(1/\delta)+loglog\left(1/\in\right)\right)\text(1/\in)\right)\)labeled samples under these conditions. We, however, will use the similar argument to show that the mislabeled error \(_/(_+_)\) will become small after a number of AL steps.

Positive sampling improves AL optimization when sample population is overwhelming negative

Following the same annotation, \(\in\) is the prespecified misclassification error. Collecting positively labeled samples is not an easy task using uncertainty sampling alone when sample population is overwhelming negative. Here, population positive sample size \(_\) is significantly smaller than population negative sample size, \(_\). The misclassification error rate of negative samples is \(\in\times\frac\), while the misclassification error of positive samples is \(\in\times\frac\). Given a positive sample in the sample pool available for selection, uncertainty sampling focuses on misclassified samples, and it has a \(\in\times\frac\) chance in selecting this positive sample. On the hand, in positive sampling, top \(\alpha\), a percentage, positively predicted samples will be selected. Hence, positive samples have \(\left(1-\in\times\frac\right)\times\alpha\cong\alpha\) chance to be selected, since\(__\). Therefore, positive sampling will have \(\alpha/(\in\times\frac)=\frac\alpha\in\times\frac\) times higher probability in selecting the positive samples than uncertainty sampling. Practically, considering \(_\) = 50,000 positive DDI or PG related abstracts, and \(_\) = 25,000,000 negative abstracts in PubMed, a misclassification rate \(\in=0.20\), and top \(\alpha =20\%\) positively predicted samples are selected, positive sampling has \(\frac\alpha\in\times\frac}_-+}_+}}_+}=501\) times higher chance in selecting this positive sample in AL than uncertainty sampling does.

Active learning implementation in multiple sampling schemes in the unscreened sample pool (Fig. 1)

Random negative sampling and initial training and validation datasets: According to the random negative sampling scheme, the initial training set contains 100 random negative samples from unscreened sample pool and 100 labeled positive samples from screened sample pool. Machine learning model ML1 is trained out. While the initial external validation set is made of 50 labeled negative samples, 50 positive samples and 50 random negative samples.

Uncertainty sampling, positive sampling, and similarity sampling: to predict the unlabeled samples in sample pool, a random subset (100 samples) with the low confidence samples (uncertainty sampling) and high confidence positively predicted samples (positive sampling) are collected from the unscreened sample pool. In the meantime, combined with the similarity values these extracted samples are similar with the samples in corpus, the top 20 samples with high similarity are extracted and manually reviewed (similarity sampling).

Updating training and validation sets: the reviewed and labeled samples from previous multiple sampling processing are divided and distributed equally into the initial training and external validation data sets. The new training set and external validation set for next round are produced.

Re-training: Using the updated training set, ML1 is re-trained, and the multiple sampling scheme is applied again. Totally, four iterations are performed in active learning analysis.

Performance evaluation: The performance of ML1 from all rounds of AL analysis are evaluated using the updated external validation data set.

Active learning implementation with multiple sampling schemes in the screened sample pool (Fig. 1)

Datasets: The initial training set contains 100 positive samples and 100 negative samples. Machine learning model ML2 is trained out. While the initial external validation set is made of 50 labeled negative samples, 50 positive samples and 50 random negative samples.

Uncertainty sampling and similarity sampling : Due to AL in screened sample pool uses labeled samples as training sets, only uncertainty sampling and similarity sampling are applied to unlabeled samples in screened sample pool. After predicting the unlabeled samples in screened sample pool, a random subset (100 samples) with the low confidence samples (uncertainty sampling) are collected. Then, combined with the similarity value that extracted samples are similar with the samples in corpus, the top 20 samples with high similarity are extracted and manually reviewed (similarity sampling).

Updating training and validation sets: the reviewed and labeled samples from previous multiple sampling processing are divided and distributed equally into the initial training and external validation data sets. The new training set and external validation set for next round are produced.

Re-training: Using the updated training set, ML2 is re-trained, and the multiple sampling scheme is applied again. Totally, four iterations are performed in active learning analysis.

Performance evaluation: The performance of ML2 from four rounds are evaluated using the updated external validation data set.

Data preprocessing

All the abstracts are processed after downloading from PubMed. They are parsed with desired content (titles and abstracts), and are converted into GENIA format. Multiple abstract files are saved as text format in a folder. After going through Lowercase converting and StopwordsTokenizer, a Doc object for each file consisting of the text split on single space characters is transformed by basic whitespace tokenizer. This Doc is to produce to tokens that feed into models.

Fig. 1figure 1

Stratified active learning with multiple sampling schemes

Machine learning and deep learning analyses

Supporting vector machine (SVM) is used as the traditional machine learning method in AL. The appearance frequency of terms from the Doc followed Poisson distribution and was represented as a categorical term-document occurrence matrix based on the word count. The terms with low frequency SDs were considered to lack useful information and specificity. Therefore, the terms with frequency SD > 0.03 were selected as features and used to train models.

FastText [39, 40] is used as a relatively simple deep learning (DL) algorithm in AL analysis. We utilize the “torch” module for text mining package in python. FastText is a multi-step approach for text classification (Fig. 2).

Fig. 2figure 2

Input layer: It is a document consisting of words, for example, “loratadine”, " increases”, " the”, " myopathy”, " risk”, " of “, “simvastatin”.

Embedding layer: It maps words and the character N-grams (N = 2) into embedding vectors by looking up the hashed dictionary according to the global vectors (GloVe). The input words and N-grams are represented as an array that would be taken as input and extract the features.

Pooling layer: A fixed-length vector by performing feature selection, the pooling layer performs element-wise averaging over all the word embeddings, followed by the output layer.

Softmax regression: The sigmoid function \(\varphi \left(z\right)=\frac^}\) is used to formulate the prediction probability for an abstract: DDI positive and DDI negative.

Performance evaluation

DDI IR AL analysis is evaluated using the following evaluation matrices: Precision (P) = TP/(TP + FP), Recall (R) = TP/(TP + FN), and the F1-score = (2*P*R)/(P + R). P is reported when R is set as 0.95. This pre-specified high recall rate serves the purpose that we will miss only a small fraction of DDI relevant paper, i.e. 0.05, in our DDI IR analysis.

留言 (0)

沒有登入
gif