ResidueFinder: extracting individual residue mentions from protein literature

The results of a large number of investigations are summarized in Table 1. The results in Table 1 are based on the spreadsheet “Additional file 2.xlsx”, provided in Additional Material. A readme file for navigating Additional file 2.xlsx is provided in Additional File 1.doc The regexes used in the calculations underlying Table 1 are given in Supplementary Material. Specifically, the MutationFinder regex is given in Additional File 3.txt; A cut version of the MutationFinder regex is given in Additional File 4.txt; Regex 1 is given in Additional File 5.txt; A cut version of Regex 1 is given in Additional File 6.txt; Regex 2 is given in Additional File 7.txt; A cut version of Regex 2 is given in Additional File 8.txt; Regex 3 is given in Additional File 9.txt; a cut version of Regex 3 is given in Additional File 10.txt.

For a detailed guide to interpreting analysis for one of the papers in the study, we choose Paper #6 in Tab “highlights” in the Spreadsheet “RF Excel Supplement” available as Additional File 2.xlxs This paper is PMID 10370099 Höllerer-Beitz, Gerhild, Roland Schönherr, Michael Koenen, and Stefan H. Heinemann. “N-terminal deletions of rKv1. 4 channels affect the voltage dependence of channel availability.“ Pflügers Archiv 438, no. 2 (1999): 141–146. We show results from analyzing the full text of the paper. This text has 6 residue mentions based on a close manual inspection.

Columns B-D show results from original MutationFinder. Two mutations are found, four residue mentions are not found (as expected, because MutationFinder does not look for mention of residues not associated with mutations). Columns E-G show results from a “cut” version of the MutationFinder regex where it is shown that the speedup of the “cut” version is achieved at no cost in recall.

Columns H-J show results from Regex 1, revealing that 4 residue mentions are found, 2 residue mentions are not found. Columns K-M for the “cut” version of Regex 1 shows that only 2 of the 6 residue mentions are found. For both the full and cut versions of Regex 1, one false positive is returned.

Columns N-S show that for both the full and cut versions of Regex 2, 4 of the 6 residue mentions are found, and that one false positive is returned.

Columns T-Y show that for both the full and cut version of Regex 3, all 6 of the residue mentions are returned (for a recall of 100 %) but that 9 false positives are also returned.

Column AD provides annotation giving the nature of the false positives (bibliography, equipment description, etc.).

Column AE shows the context in the paper for all the returned expressions, both true and false positives.

Other papers mentioned in the spreadsheet may be interrogated in analogous fashion.

In order for the reader to evaluate the timings in Table 1 we note that the computer used an Ubuntu operating system, the processor was an AMD Athlon 64 × 2 at 1GHz, with access to 2.8 GB RAM. Because of the modest performance of the machine, we expect the timings to be better on newer machines.

The first set of investigations summarized in Table 1 was done on a set of 20 papers all of which were on potassium channels (a particular interest of ours) that mention individual amino acids, randomly selected from a much larger set of over a thousand research articles. The first two rows show the performance of MF on this set. (The regex for MF is provided in Additional File 1.txt) We see from rows 1 and 2 that the full text provides a significantly more severe test of the program than abstracts alone, especially with respect to recall. Comparison of rows 2 and 4 with rows 1 and 3 shows that it was possible to improve the efficiency of MF dramatically without compromising performance at all (for this particular set of papers) by eliminating from the regex all patterns except the most common. This is described in the Implementation section and is provided in detail in the Additional Material 2.xlxs. This would be recommended for using MF to process a large number of papers, with the caveat that the particular regexes to be removed should be checked for other sets of papers than this particular set.

Comparing rows 1 and 3 with rows 5–6 shows the effect of simply using the analogous filters of MutationFinder in ResidueFinder (Regex 1). This comparison reveals that there are many more formats for a residue mention than for a mutation mention, so the performance of RF with this regex is statistically far worse than the performance of MF for the same set of papers.

To compare rows 5–6 with rows 7 and 9, one should understand the differences between Regex 1 and Regex 3. This expansion was in two steps, (1) addition of two patterns that were responsible for the largest number of false negatives with Regex 1, and step (2) for each of the pattern addition of a version that includes a space between the amino acid identifier and the location number. While standard nomenclature suggests not to include a space [39] we found numerous examples of insertion of a space in the text, warranting inclusion of that variation in the regex. There are different classes of errors that cause a FP. These error types are broken down in Table 2 whereas the two different types of FN are shown in Table 3.

Table 2 Gives the classes of error resulting in FPs, for Paper Set I. Since some mentions leading to FPs have more than one contributing cause, the total number of incidences of causes adds up to more than the total number of FP identifications.

Table 3 Shows causes of FN errors in Paper Set 1 as residue not found because they were embedded in a non-readable image or the regex did not have the correct pattern match to identify as a residue. In principle the FN errors in the images could be overcome with OCR technology.

Table 2 Classes of FP Errors in RF using Regex 3 on full text Table 3 Classes of FN errors in RF using Regex 3

In rows 8 and 10 we see the results of introducing a “cut” version of Regex 3. The cut version is created by eliminating all notations except (1) single capitol letter amino acid code or three letter amino acid followed with no space in front of the location number, (2) single capitol amino acid letter or three letter amino acid followed by one space and then the location number. MF and RF regexes include many other possibilities that turn out to be relatively rare, so the “cut” version runs hundreds of time faster with only minor degradation of performance.

Comparing rows 11 with 13 and rows 12 and 14 show the results of creating a “cut” version of MutationFinder as measured by performance on the development and test corpora used by those authors. We see that MF, as is the case with RF, is dramatically improved in speed with only minor degradation of performance as indicated by precision, recall, and F-measure. Note that these corpora are represented by abstracts.

Beginning with row 15 and continuing through row 26, we introduce results on a random set of 100 papers. The papers were screened to mention “amino acid” or “residue” in either a MeSH or text in a search of PubMed Central. Following screening, the 100 were chosen from all that passed the screen by a random number generator operating on PMC identifying numbers. By comparing rows 5–10 with 15–26 we show the effects of moving from a set of papers selected randomly from a particular field (potassium channels, rows 5–10) to a set of papers selected randomly from all fields of protein science (rows 15–26). We find that performance on the randomly selected papers does not seem systematically different from the performance on the potassium channel papers. The fact that the K + channel papers were selected by intensive manual search provided an opportunity to estimate the recall for the key word search for “amino acid” and “residue”. We found that this key word search retrieved 265 out of 329 K + channel papers that we had previously ascertained contained amino acid mentions, for a recall of 0.805. Based on other comparisons between the K + channel set and the more general set, we see no reason to expect the recall for the more general papers to be significantly different from the K + channel papers.

Rows 15–19 show results for variants on Regex 1 (similar to MutationFinder) and Regex 3, which is somewhat streamlined as described in Implementation.

By doing pairwise comparison between rows 15–16, 17–18, 21–22, and 23–24 we see that the “cut” version of each regex suffers only marginal degradation of performance compared to the full versions but is speeded up by a factor of hundreds, in most cases over 200-fold. Thus, we would recommend the “cut” version to process a very large number of papers.

By pairwise comparison between rows 15 and 17, 16 and 18, 21 and 23, and 22 and 24, we see the effect of removing the bibliography from the text. We find processing time reduced by a factor of approximately ¾, a moderate deterioration in precision, and essentially no deterioration in recall. The underlying phenomenon is that the regex found almost no true positives in the bibliography, so perusing the bibliography is essentially wasted computer time. Also, culling the bibliographies from PMC articles is readily automatable, so this is recommended in this context. On the other hand, for articles not available in PMC, the variety of formats makes culling the bibliography more difficult.

Rows 19 and 25 show performance on abstracts only. Statistically the performance on the abstracts looks better than performance on the full texts, but this is misleading because there are many more amino acids mentions in the text than in the abstract. Comparing row 15 with 19 and 21 with 25 shows that inspecting only the abstracts misses over 90 % of the amino acid mentions. By comparing 15 with 16, 17 with 18, 21 with 22, 23 with 24, and 25 with 26, we see that the “cut” version of each regex loses only 12.3–15.1 % of true positive amino acid mentions but achieves a much better speedup than keeping the full regex and scanning only abstracts. Because document preparation for PMC papers is only marginally more effort than the abstracts alone, and because an increasing fraction of papers are available in PMC, it does not seem to us to be a useful strategy to scan abstracts alone unless inspecting collections that include a large fraction of papers not available as PMC.

Rows 27 through 30 represent calculations designed to compare our work with that of Verspoor et al. [34], which motivates us to shift from evaluation based on “at least one mention of an amino acid in a paper” to “all mentions of an amino acid in a paper”. In order to facilitate manual verification of program performance by the “all mentions” criterion, we randomly chose 20 papers out of the set of 100 that was the subject of the calculations in rows 15 through 26. Comparison of row 21 with 29 shows that RF’s performance on the 20 papers was essentially the same as on the 100, suggesting that the 20 is a representative sample. Comparison of row 29 with 30 shows that the statistical performance of our program by the “at least one mention” versus the “any mentions” criteria was essentially the same. We note that the program output includes all mentions for future reference regardless of whether the performance is calculated on an “at least one mention” or an “any mentions” basis. With this equivalence in mind, we apply RF to the Verspoor et al. [34] corpus of papers, with the results shown in row 32 (single count) and 33 (full count). Comparing row 27 to row 31 and row 28 to 33 we see that our program performs significantly better on the Verspoor et al. corpus than on randomly selected papers. We ascribe this to the mode of selection of the Verspoor et al. papers, which were chosen to have as subject proteins for which PDB structures were known. We hypothesize that papers thus selected will also have more standardized nomenclature for amino acids, and therefore be more amenable to an automated search for amino acid mentions by regular expressions than our papers selected by keyword search. Comparing rows 30 and 31, we see that RF, while performing better on the Verspoor corpus than on randomly selected papers, does not perform as well as the Verspoor program based on the F1 measure. Close inspection shows that the difference is due to the larger number of false positives, and hence lower precision, from RF. On the other hand, RF showed better recall than Verspoor, so that the F2 measure, which emphasized recall, showed the two programs to be closely matched.

留言 (0)

沒有登入
gif