An annotated dataset for extracting gene-melanoma relations from scientific literature

In this section, we briefly review the core content of MGDB used to build the MGR base dataset. Then, the procedure to collect the genes linked to melanoma is described. Next, we show how the dataset was exploited with machine learning techniques. Finally, we report on the procedure to annotate the large collection of publications about melanoma that are part of the MGR extended dataset.

MGDB

In MGDB (Melanoma Gene Database), the genes and their relationships with melanoma were manually extracted from PubMed abstracts. Each gene in the database can be accessed through the gene Basic Information web page (Fig. 2).

Fig. 2figure 2

The MGDB Basic Information page reports a description of the annotated genes (in this case APAF-1)

This page contains the GeneID (e.g., 317) in Entrez Gene [23], the gene official symbols (e.g., APAF-1), and the synonyms of the gene in Entrez Gene (e.g., CED4). Besides, the Basic Information page includes the PubMed ID (PMID) of the abstract, and the snippets that explicitly support the causal relation between the gene and melanoma (Fig. 3) in the abstract. However, the full text abstract is absent.

Fig. 3figure 3

The snippets associated to a gene (in this case APAF-1) contain the text-evidence to support the relation between the gene and melanoma

A snippet can be a complete sentence in the abstract, but also a smaller fragment of it, or it can be split across adjacent sentences. An example for each of the three cases is given below:

Complete sentence: BRAF and NRAS mutations are commonly acquired during melanoma progression.

Smaller fragment: The results confirm that c-kit is vastly expressed in uveal melanoma,...

Across adjacent sentences: Previously identified as well as novel driver genes were detected by scanning CNAs of breast cancer, melanoma and liver carcinoma. Three predicted driver genes (CDKN2A, AKT1, RNF139) were found common in these three cancers by comparative analysis.

The genes in the snippets can be mentioned by their official symbol, by their full name, by one of their synonyms, but also by a pronoun. Melanoma can appear in any of its variants depending on whether it affects the skin (e.g., skin melanoma), the eyes (e.g., uveal melanoma), or for example the nasal mucosa (e.g., mucosal melanoma). Melanoma is also mentioned using generic terms such as tumor or cancer or even, in a few cases, there can be no direct mention as in the following example:

Importantly, metastatic outgrowth was found to be consistently associated with activation of the transforming growth factor-beta signaling pathway (confirmed by phospho-SMAD2 staining) and concerted up-regulation of POSTN, FN1, COL-I, and VCAN genes-all inducible by transforming growth factor-beta).

Table 1 summarizes the data on MGDB at two levels of analysis granularity.

Table 1 MGDB contains 1,272 relations at concept-level. ≥1,403 at mention-level

At the level of concepts, where an entity consists of all the mentions which refer to one conceptual entity, the database contains 1,272 relations between the gene and melanoma concepts that co-occur in 910 abstracts. At the level of mentions, where the same snippet could contain more than one mention of the gene and where the exact position of the mentions is unknown, the number of relations can only be estimated based on the number of snippets. Since one snippet contains at least one relation, we can assume that 1,403 relations is a reasonable conservative estimate of the total number of relations for the 1,403 snippets. Figure 4 provides a concrete example for the two types of concept- and mention-level annotation.

Fig. 4figure 4

Concept-level: there is one relation between gene 〈ID: 30014 〉 and melanoma 〈ID: D008545 〉. Mention-level: there are three relations between three mentions (SPANX, SPANX, sperm protein associated with the nucleus) of gene 〈ID: 30014 〉 and two mentions (melanoma, melanoma) of disease 〈ID: D008545 〉

In article 〈PMID: 19318807 〉 gene 〈SPAN-X 〉 has thirteen mentions formed by the set but one concept (ID: 30014). Melanoma has eight mentions formed by the set but one concept (ID: D008545). At concept-level there is one relation between gene 〈ID: 30014 〉 and melanoma 〈ID: D008545 〉. At mention-level there are three relations. The first relation is between one mention (SPANX) of concept 〈ID: 30014 〉 and one mention (melanoma) of concept 〈ID: D008545 〉 expressed in the snippet:

A high percentage of skin melanoma cells expresses SPANX proteins.

The second relation is between one mention (SPANX) of concept 〈ID: 30014 〉 and one mention (melanoma) of concept 〈ID: D008545 〉. Then, the third relation is between one mention (sperm protein associated with the nucleus) of concept 〈ID: 30014 〉 and one mention (melanoma) of concept 〈ID: D008545 〉. These last two relations are expressed in the snippet:

The expression of SPANX (sperm protein associated with the nucleus in the X chromosome) gene family has been reported in many tumors, such as melanoma,

Despite the high quality of the annotated relations in MGDB, this annotation suffers from some limitations that need to be addressed to make it consumable by text mining algorithms. These limitations can be summarized as follows:

Relations between two entities can appear in many snippets of an abstract, but sometimes MGDB just reports one snippet for each single entity pair in the abstract. Browsing the database we observed a certain number of such cases. For example, for document 〈PMID: 23537197 〉 and gene 〈PARP1 〉 the following snippet has been annotated in MGDB:

Genetic variants in PARP1 (rs3219090) and IRF4 (rs12203592) genes associated with melanoma susceptibility in a Spanish population.

Nevertheless, such document contains other snippets where the same relation is expressed, e.g.,

We confirm the protective role in Malignant Melanoma of the rs3219090 located on the PARP1 gene (p-value 0.027).

We confirmed the proposed role of rs3219090, located on the PARP1 gene, and rs12203592, located on the IRF4 gene, as protective to Malignant Melanoma.

The position of the genes and melanoma in the snippets is unknown. Moreover, a gene and melanoma can occur more than once in a snippet. For example, to provide evidence of the relation between gene EIF1AX and melanoma in article 24423917, MGDB shows this snippet:

The tumour showed mutations in GNA11 and EIF1AX that are typical for uveal melanoma and absent from cutaneous melanoma.

Melanoma entities are not always explicitly mentioned in the snippets. For example, for article 15133496 and gene FAP, this snippet is reported:

FAP expression is specifically silenced in proliferating melanocytic cells during malignant transformation.

There is no information about the type of melanoma induced by the genes.

MGR base dataset collection

To overcome the limitations described at the end of the previous section and build the MGR base dataset, the following 5 steps were applied.

1.

Genes collection from MGDB was performed by querying the genes Basic Information web page (Fig. 2). For each gene, we got its GeneID, its snippets and the PMID of the abstracts containing the snippets (Fig. 3). These data were downloaded on March 27, 2019 from the MGDB websiteFootnote 2.

2.

Abstracts collection was conducted using the PMID from step 1 as input to retrieve the abstracts from PubMedFootnote 3. These data were collected on April 4, 2018. Two are the main reasons why we used the whole abstracts instead of using the individual snippets. Firstly, a snippet in MGDB does not always match with a complete sentence. For example, for document 〈PMID: 23103111 〉 and gene 〈OCA2 〉 this snippet has been annotated:

supports existing GWAS data on the relevance of the OCA2 gene in melanoma predisposition,

However, NLP tools often require entire sentences to work properly. Secondly, the annotated entities in the abstracts that do not take part in any relationship were used to generate the negative examples necessary for applying machine learning techniques.

3.

Mention detection was obtained with OGER [24], a fast and accurate Named Entity Recognition and Linking tool for the biomedical domain. Firstly, the gene and disease mentions in the abstracts were recognized. OGER was configured to annotate the entities that are included in Entrez Gene. We annotated their text span (e.g., RUNX3, malignant melanoma), and their category (i.e., Gene, Disease). Secondly, the human genes were selected by matching the gene mentions with the human genes in MIM (Mendelian Inheritance in Man) [25]. Finally, the melanoma diseases were picked out by retaining those disease mentions whose text span contained the term melanoma. Coreference resolution of terms such as tumor or cancer to recognize a small minority of remaining and not yet recognized melanoma diseases is left to a subsequent study.

4.

Concept identification was applied to (i) annotate the genes recognized in the previous step with their concept identifier from Entrez Gene produced with OGER, (ii) associate the most general concept of melanoma (ID: D008545) in the MeSH hierarchy to the annotated diseases. The fact that MGDB does not specify any information about the type of melanoma involved in the relation is the reason for adopting a unique concept ID for all the melanomas.

5.

Relation extraction between the annotated mentions and concepts was performed by leveraging the mention- and concept-level annotation provided with MGDB.

The mention-level annotation of the dataset was created by linking the gene and melanoma mentions co-occurring within a snippet on condition that (i) such relation exists in MGDB, and (ii) the entity mentions in the snippet are not split across sentences. Since a snippet can contain multiple mentions of the same gene or melanoma concept, and MGDB does not specify the mention pair involved in the relationship, this can lead to the production of multiple gene-melanoma relations. For example, in the snippet below three mentions of gene 〈STK11 〉 (LKB1, STK11, LKB1) are linked to melanoma by our procedure, even if only one of such mentions (LKB1) is actually involved in the relationship.

Germline mutations in the LKB1 gene (also known as STK11) cause the Peutz-Jeghers Syndrome, and somatic loss of LKB1 has emerged as causal event in a wide range of human malignancies, including melanoma, lung cancer, and cervical cancer.

We estimate that 14,87% of the snippets in MGDB contain more than one mention of the same entity, and that therefore, they could cause the issue. However, this percentage also includes several cases where indeed multiple mentions of a gene all participate in the relationship. For example, in the snippet below both 〈bone morphogenetic protein 4 〉 (gene official full name) and 〈BMP4 〉 (gene official symbol) are in relationship with melanoma:

An altered expression of bone morphogenetic protein 4 (BMP4) has been found in malignant melanoma cells.

Another example similar to the one above is the following. In the following snippet both 〈melanocortin 1 receptor 〉 (gene official full name) and 〈MC1R 〉 (gene official symbol) are involved in the relationship:

Furthermore variants in melanocortin 1 receptor (MC1R) and microphthalmia-associated transcription factor (MITF) give a moderately increased risk to develop melanoma.

Generating the mention-level annotation aims at (i) preserving the snippets that in MGDB express the relations, (ii) providing researchers with automatically pre-processed entities mentions. The SemEval-2013 Task 9 dataset is an example of a dataset that uses a similar type of annotation.

The concept-level annotation was produced generating one relation for each 〈gene,melanoma 〉 concept pair in an abstract, provided that such relation exists between two mentions of these concepts in a snippet of the abstract. It offers a natural way to (i) go beyond the limitation of the annotation used in MGDB, which sometimes reports just one snippet per relation, (ii) exploit the information at mention-level to decide if a relation holds between two concepts in the abstract. Examples of datasets that use a similar kind of annotation are the BioCreative V Track 3 and BioCreative VI Track 5 datasets.

At the end of these steps, there are 211 out of 1,403 (15.04%) relations at mention-level and 28 out of 1,272 (2.2%) at concept-level which, despite being in MGDB, have not been annotated in the dataset.

Regarding the mention-level annotation, we manually verified that most of the missed relations concern occurrences of melanoma which are not explicitly mentioned in the snippets, and therefore we cannot recognize (140 mentions); the other cases refer to genes not correctly identified by OGER (71 mentions). Below, we report an example of each of the two cases:

Melanoma not explicitly mentioned: BRAF and GNAQ mutations in melanocytic tumors of the oral cavity.

Entities not recognized by OGER (e.g., miR-222): this suggests that targeted therapies suppressing miR-221/-222 may prove beneficial in advanced melanoma.

As far as the concept-level annotations are concerned, the missing relations are due to genes not correctly identified (26 concepts), and melanoma entities that we could not recognize (3 concepts). Table 2 shows that, despite these shortcomings, more than 84% of the relations in MGDB have been maintained in the mention-level dataset and more than 97% in the concept-level dataset. Eventually, 3 out of 910 abstracts were found to be without relations and were removed from consideration.

Table 2 Number of relations in MGDB and of relations that have been maintained in the MGR base dataset

To ensure the reproducibility of our results the dataset is split (randomly) in two parts: a training set for training the models (2/3 of the data) and a test set (1/3) for the evaluation of the models. Figure 5 shows the annotation for article 〈PMID: 15986140 〉 in the test set. Coherently with the license adopted for MGDB, we make the dataset freely availableFootnote 4 for research purposes only.

Fig. 5figure 5

PubTator annotation for article 〈PMID: 15986140 〉. Concept-level: gene 〈ID: 6774 〉 is related to melanoma 〈ID: D008545 〉. Mention-level: gene 〈ID: 6774 〉 at position START:10,END:15 is related to melanoma 〈ID: D008545 〉 at position START:24,END:32

MGR base dataset validation

NLP annotation and biomedical curation applications are based on different needs. In NLP, relation extraction usually requires considering the mentions of given entities in the document, and to decide whether two specific mentions are connected by a relation. Conversely, typical curation applications only need to know that a document supports the existence of a relation between two given entities (without considering the specific mentions expressing the relation).

With this view in mind, we formulate the relation extraction task as a two-class classification problem. First, in the training phase our models are trained on the mention-level annotation of the dataset that specifies the mention pair involved in the relationship. Then, in the evaluation phase the trained models are used to predict whether in a sentence a given candidate mention pair is in relationship or not. Since the predicted relations have to be evaluated between pairs of entities rather than between pairs of mentions, the mention-level annotation produced by the models is transformed into the corresponding concept-level annotation. This is done assuming that a relationship between two entities exists if at least one relation between a mention pair of the entities is found by the models in the abstract.

Training the models at mention-level enables us (i) to apply standard NLP relation extraction approaches that work at sentence-level, rather than at document-level, (ii) to use the snippets reported in MGDB that explicitly support the given relations and that have been annotated with the target mentions. Evaluating the models at concept-level is in the spirit of better matching the actual requirements of practical applications.

Regarding the examples for training and testing the models, they were generated as follows. First, the dataset was processed with spaCyFootnote 5 (i.e., model en_core_web_sm-2.0.0) for tokenization, sentence splitting, and lemmatization. Then, positive and negative examples were generated for all the sentences containing at least one gene and one melanoma entity. Concerning the training data, the following rule was adopted:

In the pseudo-code above, a relation holds if the sentence containing the two entity mentions is equal to or includes the text snippet that in MGDB captures the relation between the two mentions. The rationale of discarding some examples is to reduce false negative examples produced by models that were trained on relations that, despite being positive, were not annotated in the MGR base training set because of the one snippet per relation constraint.

We also create a “test dataset”, representative of the unlabeled data that would be found in a real world application of our system. In such test dataset each sentence represents an example to be classified. To evaluate the predicted relations between pairs of entities rather than between pairs of mentions, we also produced a concept-level annotation of the test dataset. In this regard, the following rule was adopted:

In the pseudo-code above, a relation holds if the abstract containing the two entities is equal to the abstract that in MGDB captures the relation between the two entities.

After data has been generated, each sentence was represented by a set of features that depends on the model used (see “Models” section).

For the models development the training set was split into two parts: a dev-train for training (2/3 of the training data mentioned in the previous section) and a dev-test for tuning the models (1/3). Then, the resulting models were used to annotate the test set.

Eventually, to establish a lower bound of expected performance on the dataset, we adopted the two baselines that were used at BioCreative V chemical-disease relation (CDR) task in the evaluation process: co-occurrence of genes and melanoma entities across sentences in the whole abstract (abstract-level baseline), and co-occurrence of genes and melanoma entity mentions in the same sentence only (sentence-level baseline).

Models

The MGR base dataset was tested with both traditional models like decision trees, and more recent neural network-based models such as Convolutional Neural Network (CNN) and Bidirectional Encoder Representations from Transformers (BERT).

The success of deep learning for NLP applications largely depends on word vector representations, also known as word embeddings. Word embedding is one of the most popular word representations. With word embedding, words are represented by real-valued vectors of tens or hundreds of dimensions. This representation is learnt in such a way that words that are used in similar ways have similar vector representations. This is different from the thousands or millions of dimensions needed by more traditional word models, where each word has a unique representation, independent from how it is used. As an example, consider the two sentences: POT1 gene has been associated with melanoma and POT1 gene has been implicated in melanoma. Using pre-trained word embeddings, the neural networks used in our experiments can exploit the semantic relationships between associated and implicated (their vector representations are relatively close in vector space) to compute a distance between the two sentences. The same cannot be said for the decision-tree models that rely on traditional word representations. These models can only check whether a particular word, represented as a feature, exists or not. Early word embedding models, such as Word2Vec [15] used with our CNN approach, learn one fixed representation per word. However, this cannot capture how the meaning of a word depends on its context. Recently, systems like BERT [16] showed that generating a different embedding for each word in the context where it appears, can outperform previous methods. The methods that we have used for the experiments described in this paper are the following.

The decision-tree method implemented in scikit-learnFootnote 6 is an optimized version of the Classification and Regression Trees (CART) algorithm. As features, we considered the lemma of the tokens in a window from two tokens before the leftmost entity of the pair, to two tokens after the rightmost entity of the pair.

The CNN approach is essentially the same as that described in [4]. The only change in our work is that we used pre-trained embeddingsFootnote 7 calculated on PubMed [26] instead of the ones calculated on more general-purpose text corpus. For our experiments, we used the code implementation of [27], in which each word is represented by concatenating its word-embeddings and the shortest distances relatively to the two entities in relationship.

BioBERT [28] is a version of BERT that has been pre-trained on large-scale biomedical corpora. BERT authors have shown that unsupervised pre-training of language models on a large corpus, followed by fine-tuning, is beneficial for many NLP tasks. BioBERT extends this work and demonstrates that pre-training BERT on additional biomedical corpora helps it to analyze complex biomedical texts. We conducted our experiments using the code and the default configuration provided with BioBERT [29]. As regards the input, we fed BioBERT with a plain text file that consists of one sentence for each positive and negative example.

MGR extended dataset annotation

To demonstrate the usefulness of the models implemented in the previous section, we used BioBERT fine-tuned on the mention-level MGR base training dataset to annotate a large set of PubMed abstracts about melanoma. First, the query in Additional file 1 was performed to get 90,028 publications abstracts. This query was run on April 4, 2018. Then, from these abstracts we removed the ones that are already annotated in MGDB. In this way we obtained 89,137 abstracts. After that, we followed the procedure described in points 3 and 4 of “MGR base dataset collection” section to annotate the gene and melanoma concepts in the abstracts. Finally, the relations among the annotated concepts were extracted with the BioBERT model. Figure 6 shows that in article 〈PMID: 10446968 〉 gene 〈ID: 5728 〉 is associated to melanoma 〈ID: D008545 〉. This dataset will be referred to as MGR extended dataset. We make it available to the research community in the Mendeley Data repositoryFootnote 8.

Fig. 6figure 6

Extracted relation between gene 〈ID: 5728 〉 and melanoma 〈ID: D008545 〉 for article 〈PMID: 10446968 〉

A manual assessment of the quality of the entire MGR extended dataset would be too demanding in terms of human resources. For this reason, the quality of the dataset has been estimated through (i) a direct evaluation of a sample subset of relations (ii) an indirect evaluation using the results obtained on the MGR base dataset. Regarding the direct evaluation, two domain experts assessed the quality of a sample subset of relations. First, the relations in the MGR extended dataset were sorted by their decreasing scores produced by BioBERT. Then, from these relations we considered the relations involving genes not classified in MGDB. These relations should be the most difficult to extract for our classifier (they are not present in our training dataset), but also the most interesting because they are new and not in MGDB. As the output of this phase, we obtained 2,265 genes and 6,866 relations annotated at concept-level. Among these relations, the two domain experts were asked to verify the first 700 relations (those with the highest probability). The first expert manually checked the odd-numbered relations, while the second expert checked the even-numbered relations. At concept-level, 600 relations out of 700 were judged as correct (precision: 85.71), while at mention-level there were 1,070 out of 1,506 correct relations (precision: 71.05). The agreement between the two experts was calculated using the measure studied by Scott [30]. It is defined as the percentage of judgments on which the two experts agree when annotating the same data independently. To compute this measure, 100 additional relations were selected from the dataset and then verified by both the experts. At concept- and mention-level the observed agreement is 73.00% and 71.31% respectively. Concerning the indirect evaluation, we used the results obtained by the models on the MGR base dataset as a good estimator of the quality of the relations extracted from the MGR extended dataset. This approximation is possible because both the vast majority of publications in the MGR base dataset (precisely 891 out of 907) and all the publications in the MGR extended dataset belong to the same set of 90,028 abstracts extracted from PubMed. This allows us to consider with confidence the publications in the MGR dataset as representative of the publications in the MGR extended dataset.

留言 (0)

沒有登入
gif