Classifying literature mentions of biological pathogens as experimentally studied using natural language processing

Pathogen information has been curated into several existing databases. One key resource is the NCBI Taxonomy, which provides a reference set of biological organisms, and their taxonomic classification. We can identify publicly available resources specific to pathogen detection, e.g. NCBI Pathogen Detection [12], the study of pathogenic phenotyping, e.g. PathoPhenDB [2], and toxins related to pathogens, e.g. TADB2.0 [13], the bacteria type II toxin-antitoxin database. There are also other resources that are not publicly available [14], including the Biological Materials Information Program (BMIP).

The scientific literature contains information about research on pathogens and the research institutes performing research on them through, e.g. author affiliations. There is previous work in identifying pathogens in the scientific literature, which tends to focus on specific pathogens and/or specific aspects of pathogens. Among this work we can point to the Bacteria Biotope challenge task at the BioNLP shared tasks [15], which focuses on certain bacteria and their habitats and phenotypes, including 491 individual microorganisms mentioned in 392 articles. There is previous work using the literature to identify the relation of pathogens to the environment [16], pathogen-disease prediction using ontologies and literature mining [17], identification of the geolocation of pathogen samples (e.g. GeoBoost [18, 19]) for phylogeography or other aspects of pathogens related to biodiversity [20, 21], in addition to toxins [13, 22].

Despite this existing work for the identification of pathogen mentions in the scientific literature, there is no comprehensive work on characterising a large set of different pathogen types, or focusing on literature describing the experimental study of pathogens that can be used to evaluate pathogen annotation methods or to annotate a broad set of microorganisms, including pathogenic organisms, PrPSc prions and toxins.

In terms of broad objective as well as methodologically, our work is related to the Chemical Indexing task of the recent BioCreative NLM-Chem track [23]. To construct the NLM-Chem dataset [24], Medical Subject Heading (MeSH) index terms corresponding to chemicals are assigned to an article as a topic term. Similarly, we leverage MeSH index terms as a proxy for identifying key entities in articles in our automatically constructed dataset. The computational task in each case is primarily to identify the topic/key entities mentioned an article. In our work, we focus on biological pathogens rather than chemicals, and aim for a narrower definition of the entities that we consider relevant. Our dataset also takes advantage of resources beyond MeSH index terms over PubMed, to identify literature for a substantially broader set of pathogens.

In the following sections, we present a methodology to develop a data set that can be used to tune and evaluate pathogen identification methods. Then, we evaluate several methods to identify and characterise experimentally studied pathogens that include dictionary methods and state-of-the-art deep learning methods, based on our constructed data set.

Methods

In this section, we describe the methodology used to construct the READBiomed-Pathogens pathogen literature data set that we use to develop and evaluate methods for pathogen identification and characterisation. This data set consists of MEDLINE/PubMed citation records; the texts that we analyse are the title and abstract texts within these records.

Within the set of microscopic organisms, our work considers specific types of pathogens that are classified within the NCBI Taxonomy [25]. The most relevant organism types are bacteria, fungi, protozoa, viroids and viruses [26, 27]. We have recovered information about a set of common pathogenic organisms and selection of less frequent ones that were found in the NCBI Taxonomy at the species level.

We have also considered other pathogens that cannot be categorised within an organism taxonomy but are still relevant to be studied, such as PrPSc prions [28], which are misfolded proteins that cause diseases such as Creutzfeldt-Jakob disease. We have considered prions of common species. As well, we have considered a set of common toxins generated by other pathogens including bacteria or fungi, such as enterotoxins that are produced and secreted by bacteria [29].

Finally, the pathogens represented in the READBiomed-Pathogens data set can be split into three main categories: pathogenic organisms (2848 terms), PrPSc prions (14 terms) and toxins (19 terms).

READBiomed-pathogens data set generation

In this section, we describe how we generated READBiomed-Pathogens, leveraging existing resources from the National Center for Biotechnology Information (NCBI), which is part of the US NIH / National Library of Medicine, using the E-utilities [30].

To develop this data set, we are specifically interested in recovering literature citations that are relevant to the pathogens of interest. We draw on other NCBI resources, including Medical Subject Headings (MeSH) [31] corresponding to pathogen terms. MeSH headings are assigned manually to MEDLINE citations and provide highly reliable labels reflecting key topics addressed in publications. More recent MEDLINE records combine manual annotation with automatic annotation, but our data set was developed before this automatic MeSH indexing was put in place. This might need to be considered by future work following our approach.

Figure 1 shows a diagram of the process that we followed to create our data set, which is further explained in the following sections. We considered three types of pathogens – a pathogenic organism, prion proteins that cause infectious disease, and pathogenic toxins.

Fig. 1figure 1

Diagram of the generation of the READBiomed-pathogens data set

The NCBI offers other relevant resources to identify additional relevant scientific articles. In the case of the pathogenic organisms, GenBank [32] is a gene database with links to PubMed and allows recovering citations in which genes related to the pathogenic organisms have been identified in the scientific literature. Depending on the pathogen type, we queried different data sources to obtain document identifiers from PubMed or PubMed Central. We used these identifiers to build a data set with relevant literature for the pathogens of interest. This methodology can be straightforwardly applied for additional pathogens not considered in the current study.

In constructing our dataset, we have adopted two simplifying assumptions about the relationship between experimentally studied pathogens and the literature:

If a pathogen is included as a MeSH index term for an article in PubMed, then it is a focus entity of the research described in that article, and it is experimentally studied.

If a GenBank record for a pathogen links to an article in PubMed, then the pathogen is a focus entity of the research described in that article, and it is experimentally studied.

While it is clear that these assumptions are overly simplistic, they provide a reasonable proxy for our target task that allows us to conduct larger-scale computational experiments with machine learning methods.

Pathogenic organisms

We grouped pathogens corresponding to biological organisms, including bacteria, fungi, protozoa, viroids, and viruses, together, since most species are directly available from the NCBI Taxonomy [33]. The NCBI Taxonomy contains most of the pathogenic organisms of interest. To find the pathogenic organisms in the NCBI taxonomy database, we searched for the name of a pathogen in the NCBI taxonomy vocabulary, first seeking to match a scientific name. If there was no match, then we expanded the search into all name fields available in the NCBI Taxonomy database. We only consider cases in which a single NCBI Taxonomy record was returned.

For each NCBI Taxonomy pathogen recovered, we obtained pathogen synonyms, a list of all strains in the NCBI Taxonomy database and identifiers of the pathogen from other resources provided by NCBI such as the MeSH controlled vocabulary and GenBank. To obtain article identifiers (PubMed or PubMed Central IDs), we searched PubMed for citations indexed with MeSH terms corresponding to the pathogen identifiers recovered from the NCBI taxonomy record and extracted direct mappings to PubMed from GenBank records linked to from the NCBI Taxonomy.

As already mentioned, for each pathogen identified in the NCBI Taxonomy database, we recovered all the subspecies identifiers using recursive queries. Information recovered for the subspecies was added to the pathogen record to encompass all possible variants of each pathogen as fully as feasible.

PrPSc prions data set

Prions are misfolded proteins that produce diseases such as Creutzfeldt-Jakob disease. We are interested in pathogenic prions such as the scrapie isoform of proteins (PrPSc) associated to specific animal species such as sheep, human or moose. To recover citations relevant to PrPSc and the species of interest, we identified MeSH indexing as a key resource. While there is no entry in MeSH for variants of PrPSc prions, a MeSH heading for PrPSc prions in general, as well as for each of the species, is available. Since specific prion types do not appear as entries in MeSH, in order to recover MEDLINE citations relevant to PrPSc proteins for humans, we utilize a template query .

Of the 14 prions of interest, it was possible to collect documents for 7 of them. The following species were not found in MeSH: elk, greater kudu, moose, mule, nyala, onyx and ostrich. To collect relevant citations for the data set, we reused the query example presented above as a template. We collected the citations in MEDLINE that were identified by this template for each one of the species.

Toxins data set

Toxins, even if some are related to pathogens, are chemicals and hence do not appear in the NCBI Taxonomy database. Therefore, we explored the MeSH controlled vocabulary as a resource for toxins indexing in MEDLINE.

13 out of the 19 toxins in our list of pathogens were not found in MeSH: Abrus abrin toxin, Anatoxin-A, Batrachotoxin, Brevetoxin, decarbamoylsaxitoxin, Fusariotoxins (T-2), gonyautoxins, Maitotoxin, Mycotoxin, neosaxitoxin, Palytoxin, and Ricinus ricin toxin. For the 6 toxins that could be mapped to a MeSH entry, we added the citations that were indexed with that toxin in PubMed to our data set.

Additional work could extend our set of toxins to the ones available in chemical databases such as ChEBI (Chemical Entities of Biological Interest).

READBiomed-pathogens data set statistics

In this section, we provide statistics of the data collected for the different categories of pathogens in Table 1. It was not possible to find PubMed citations for 122 pathogenic organisms in the NCBI Taxonomy database, which in most cases are viruses, e.g. viper retrovirus. We found that just over 10% of all pathogenic organisms were available as MeSH headings, in comparison to the pathogens available in GenBank. On the other hand, the number of citations available per pathogen is larger in MeSH. Despite these differences in the information contained in each database, an advantage of using MEDLINE MeSH indexing is that articles have been manually indexed. Hence, we can determine the articles in which the pathogen has been identified as sufficiently relevant to be included in the index terms. This allows us to identify which MEDLINE citations we should consider when evaluating pathogen characterisation algorithms.

Table 1 Statistics of READBiomed-Pathogens. This includes the number of pathogens identified in the resources MeSH and GenBank compared to the total number of pathogens in our set of interest (Total)), and the average number of PubMed citations (Avg. PMIDs) associated to each pathogen in MeSH and GenBank

Table 2 shows the top pathogenic organisms sorted by the number of unique citations recovered from MeSH indexing or GenBank. The bacteria Escherichia coli is the pathogen with the most citations from both sources. Most of these frequent pathogens appear in MeSH but the ranking is different in the two lists, reflecting differences in scope.

Table 2 Top 10 pathogenic organisms identified from MeSH indexing and GenBank sorted by number of PubMed identifiers recovered from the NBCI resources. (*) Salmonella enterica subsp. enterica serovar TyphimuriumPathogen characterisation

In our work, we define experimental pathogen characterisation as the identification of a pathogen in the text that is experimentally studied in the published work. That is, we aim to ignore pathogens that are mentioned, but for which there is no evidence that the researchers presenting the work had actively experimented on the pathogen in their research, and hence held samples in their facility. Irrelevant pathogen mentions may occur in the context of references to previous or similar work, e.g. in background information, or they may be mentioned in the context of a comparison between organisms. For example, in the citation PMID:13129609, Escherichia Coli is mentioned repeatedly in the article, but the experimentally studied pathogen is Proteus mirabilis. E. coli supports the research on Proteus mirabilis. As another example, PMID:21979562 mentions H1N1 but it is in the context of the patient presentation. The patient had received a vaccine against the H1N1 virus. Additional examples are available in Additional file 1 Appendix C in the supplementary material.

The NLP methods were developed using the UIMA (Unstructured Information Management Application) framework [34]. Pathogen characterisation was split into two steps.

In the first step (pathogen identification), a specific case of named entity recognition of biological concepts [35], mentions of pathogens are identified in the text of the citations (title and abstract texts). Here, we utilise a dictionary method or regular expressions since the pathogen names are specific and derived from a closed vocabulary. The objective of this step is to identify as many mentions of pathogens in the texts as possible.

In the second step (pathogen filtering), the aim is to remove the pathogen mentions that are not relevant to the objective of identifying research that describes active experimentation with pathogens, and, conversely, to retain pathogen mentions that correspond to active experimentation.

Pathogen identification

We developed a methodology for the identification of pathogens in text. We followed distinct strategies depending on the pathogen type, as shown in Fig. 2. We used UIMA [36] as the framework to develop the pathogen identification components. We explain it in more detail in this section.

Fig. 2figure 2

pathogen identification diagram. An input text is processed by a set of dictionary and regular expressions build on the pathogen list. UIMA is the NLP framework that we have used for the development of our method

For pathogenic organism identification, we built a dictionary matching approach using a tool called ConceptMapper [36], which is part of the UIMA sandbox tools.Footnote 1 ConceptMapper is highly configurable and scales well with large dictionaries. It has previously been shown to be highly effective for recognising biological concepts in scientific publications [37]. We evaluated several of the ConceptMapper options, together with a dictionary containing pathogenic organism terms and relevant terms for the identification of PrPSc prions.

For the generation of the dictionary for pathogenic organisms, we used a version of the NCBI Taxonomy available from the OBO Foundry. For each pathogenic organism, we selected the terms for the pathogen of interest and all its subspecies.

We made some modifications to the extracted terms from the OBO NCBI Taxonomy extracted terms. To increase recall, the word “subtype” was removed, e.g., “H1N1 subtype” becomes “H1N1”. Additionally, to reduce the size of the dictionary, we removed terms starting with “influenza a virus (”, “influenza b virus (” or “influenza b virus (” which removed several thousand entries with no impact on recall. For viruses, we removed the ending “virus”, which reduced the recall since in most cases the virus is implied, e.g. “influenza A” vs. “influenza A virus”. Additionally, we removed virus names with one letter, e.g. “B virus” from “Hepatitis B virus” which could annotate “Influenza B virus” incorrectly. After this processing, only terms with more than three letters were kept. Abbreviations are ambiguous and results showed that most of the abbreviations had long forms that matched one of the dictionary entries for that pathogen in the MEDLINE citations.

We have 2637 pathogenic organisms in our dictionary and a total of 83,757 distinct terms, with an average of 31 terms per pathogen. A large number of terms is justified as well by the term variation contributed by the subspecies, e.g. “Borrelia burgdorferi strain N40” for “Borrelia burgdorferi”. Since the longest match is preferred, the term linked to subspecies will be preferred.

For this PrPSc, we have split the set of PrPSc prions into a tuning set that includes the pathogens Sc (cattle), Sc (cat), Sc (deer) and Sc (goat). The testing set includes the pathogens Sc (human), Sc (mink) and Sc (sheep). Table 3 shows the results on the selected tuning set. We identify that missed PrPSc annotations are mostly due to missing species mentions in the citations. This is especially a problem in citations with no abstract. There are also wrong annotations, according to MeSH indexing, where another species might be mentioned in the citation that is not relevant to the research but is mentioned as background information.

Table 3 PrPSc identification algorithm results for tuning data set

For toxin identification, we initially generated a dictionary using the name of the toxin and additional terms using their mapping to the MeSH controlled vocabulary, when available. As in the case of the PrPSc prions, we have split the toxins that could be mapped to MeSH into tuning and testing. The training toxins are: aflatoxins, botulinum toxins, ciguatoxins and conotoxins. The testing toxins are: enterotoxins, saxitoxins and tetrodotoxin. Table 4 shows the results of applying the dictionary to the tuning toxins. We find that the precision tends to be quite high, which might indicate that if a toxin appears in the citation, it is very likely to be relevant. We find as well that some mentions are missed due to term variability. For instance, the pathogen aflatoxins might appear in text as “antiaflatoxinB1”. Another example is botulinum toxin, that might appear in text as “onabotulinumtoxinA” or “botulinum neurotoxin E”. We could expand the dictionary, but a change in the matching of the terms should be made to allow for a more flexible matching, since in some cases, toxin names get combined with other terms or more specific terms are used. This approach has been used in prior work on gene name normalization [38].

Table 4 Toxin identification on tuning data set

We also approached matching toxin names using regular expressions. We implemented a set of regular expressions based on the names of the toxins, the regular expressions can be found in Additional file 1 Appendix B of the supplementary material. Toxin synonyms extracted from the MeSH controlled vocabulary have been added and the expressions match both uppercase and lowercase letters by adding the case insensitive match string (?i) within the regular expression. Results using the regular expressions appear in Table 4. Precision is the same while recall has increased significantly.

Pathogen filtering

Pathogen identification methods described in the previous section find all possible mentions of pathogens in the MEDLINE citations. But not all the identified pathogens are described as experimentally studied in those citations. Some pathogens might be mentioned as part of the background of existing work and others might be mentioned but the researchers were not directly working with the pathogen (e.g., mentioned in the context of a review paper or research describing surveillance of a pathogen-related disease). We evaluate a strategy to identify relevant training data and examine several models. As stated above, we assume that if a pathogen appears as a MeSH heading in the indexing of the citation, it might be experimentally studied.

Data set for pathogen focus detection

We developed a data set that we used to identify mentions of pathogens mentioned in the citations as relevant or not to the research described in the citations, as determined by the MeSH terms assigned to a citation. These terms capture the key focus topics of the paper; inclusion of a pathogen name in the MeSH terms for a citation indicates that the pathogen is relevant to the core research contribution of the paper. This was created by identifying mentions of pathogens using the dictionary method described above, like the processing done in [39], and then deciding on their relevance based on the inclusion of the pathogen name in the set of MeSH index terms for the article. The mentions of the pathogens of interest were substituted within the text of the citation by a common text representation @PATHOGEN$. This resulted in a total of 133,076 examples.

Pathogen filtering methods

We used the data set developed in the previous section to train several classifiers based on supervised machine learning algorithms. We trained a linear Support Vector Machine (SVM) [40] using a stochastic gradient descent implementation using the modified Huber loss [41] suited for imbalanced data [42] and AdaBoostM1 [43] (using the MTIMLExtension package [44]).

For both methods, the text of the citations, which includes the title and the abstract, was tokenized, stop words were removed and both unigrams or unigrams and bigrams were used as features. This processing was done using the BinaryPipeFeatureExtractor from the MTIMLExtension package.

In addition, we fine-tuned a BERT [45] based classifier using HuggingFace pre-trained models [46]. We specifically used a pre-trained model named BioBERT [47], which has been pre-trained using biomedical literature. BERT has a limit of 512 tokens, so documents larger than 512 tokens were truncated. BERT-like models tokenize text using a Wordpiece algorithm that breaks words into several subwords.

Table 5 shows the performance of the machine learning algorithms on this set. We find that BERT has the best performance. These trained algorithms will be evaluated on the pathogen filtering results using the manually annotated set described in the Results section. The fine-tuned BioBERT models perform better than SVM and AdaBoostM1, the difference in performance is similar to other related MEDLINE categorization tasks [48, 49].

Table 5 Classification results

We experimented with enriching the text with additional information provided by the Scientific Discourse Tagger [50], based on evidence that discourse information may be valuable for our task [51]. Using these features with the learning algorithms did not provide a significant change in results and additional work is needed to identify the best way to leverage the discourse annotations.

留言 (0)

沒有登入
gif