Performance assessment of ontology matching systems for FAIR data

We evaluated the performance of three existing ontology matching systems using reference alignments based on the UMLS and BioPortal. Additionally, we analyzed the top-level hierarchies of mappings using manually created mappings between the top-level classes of ontology pairs. These experiments should contribute to the use case of querying distributed data sources, in the context of FAIR data.

Principal findings

What is the performance of automated ontology matching systems to expose mappings between ontologies used in the rare disease research domain?

The systems exposed, on average, 5.726 mappings between ORDO-SNOMED CT, 3.295 mappings between NCIt-ORDO, and 23.134 mappings between NCIt-SNOMED CT. Obtained F1-scores were 0.55/0.66 (AML, UMLS/BioPortal), 0.46/0.53 (FCA-Map), and 0.55/0.58 (LogMap). The results obtained for the modules were comparable to those of the whole ontologies. As there was no gold standard available, the systems’ overall low precision (between 0.39-0.54) and high recall (between 0.64-0.96) suggests that (automatically) evaluating the correctness of mappings is indeed challenging. The systems retrieved most mappings in the reference alignments, but also exposed many additional mappings, hence the lower precision. Both reference alignments were known to be incomplete (silver standard and baseline), further research will be needed to assess whether additional mappings returned by the systems are correct. The results of the OAEI 2019 Large BioMed track (SNOMED CT-NCIt large fragment task [34]) are the closest to use as reference for interpreting performance, as the other tracks and tasks of the OAEI use other ontologies or reference alignments. Using a UMLS-based reference alignment (inconsistent mappings were flagged to be ignored), AML obtained an F1-score of 0.76, FCA-Map 0.65, and LogMap 0.71. Those OAEI results are better, although it is not a one-to-one comparison due to different reference alignment and ontology versions. Moreover, the OAEI reports using a large fragment of SNOMED CT, resulting in fewer mappings than when using the whole ontology (18,887 vs. 14,200 by AML).

Using consensus alignments (i.e., mappings selected by multiple systems) improved performance across the board (Table 11 and Additional File 1). As one would expect, selecting a higher number of votes (vote=3, mappings selected by all systems) resulted in higher precision and lower recall. In practice, one could prioritize precision over recall or vice versa for a specific application, and the ability to select a consensus alignment that fits those needs can be useful.

To what extent are currently available ontology matching systems useful for implementation in FAIR-related projects that focus on querying distributed data?

All systems were able to generate alignments without user intervention, which is important for data querying. Run times varied from minutes up to a few hours, depending on the size of the input ontologies. Matching systems exposed equivalence relations between classes of ontologies pairs. The use case depicted in Fig. 2 requires equivalence mappings and automated matching. This means that the application of AML, FCA-Map, and/or LogMap, in the context of the use case would be a sensible decision. However, in the case of a matching service for querying data, high precision is more important than high recall, hence the need for additional work on validating the correctness of mappings. All systems support OWL ontologies as input and export alignments as a machine-readable RDF-file. This allows joining alignments from several matching systems. Besides, analyzing top-level hierarchies of matched classes was shown to be effective in revealing mappings with classes from the same hierarchy. Table 8 shows that on average 10% of mappings had an incorrect top-level hierarchy; 90% were mappings whose top-level hierarchies matched using the manually created mappings. For example, consider a query: ‘count all patients with a rare disease’, then the hierarchical analysis can reveal that ‘cystic fibrosis’ is a rare disease and its records should be counted. The hierarchical analysis can be used in situations where no reference alignments are available. Finally, the use of modules helps for faster development and testing of matching techniques, in comparison to working with the whole ontologies. Using modules instead of the whole ontologies could be considered if speed or resources are important factors. Additionally, modularization can be used as a type of structure-level matching by removing content from the ontology that is not relevant for the application [10]. As modularization removes content from the ontologies it should be noted that this could improve or worsen the results of matching systems that use structural matching techniques, although we did not test this hypothesis.

Strengths and limitations

Several strengths and limitations can be identified. A strength of this study is its practical approach to ontology matching using a FAIR data use case, using ontologies relevant for the rare disease domain. A limitation is that AML, FCA-Map, and LogMap are not the only matching systems available, although they cover most of the matching techniques as specified by the classification model defined by Euzenat et al. [10]. Particularly, systems that leverage machine-learning techniques were not included. Likewise, other ontologies exist that would be of use in the rare disease domain, especially considering the large number of ontologies present in, among other repositories, BioPortal. The BioPortal Annotator and Recommender were used to select the ontologies and create the seed signatures, but other similar tooling could also be used.

Another strength is the use of modules based on seed signatures derived from rare disease data elements. The smaller modules made it easier to assess the mappings manually while performing the experiments. Also, it shows potential for implementation in matching services where it is desirable to work with smaller chunks of large ontologies, e.g. for faster run times. However, since the list of rare disease data elements was not validated, it was not possible to draw any additional conclusions from the modules versus whole ontology results. For instance, we did not know if the classes contained in the modules were the most relevant ones for use in the rare disease domain.

Evaluation with BioPortal and UMLS reference alignments

Precision, recall, and F1-score were used as performance measures because they are widely known and used by the OAEI. However, both precision and recall introduce a problem when used in the context of ontology matching that should be mentioned. As stated by [35], both precision and recall are set-theoretic measures that do not discriminate between mappings that may be semantically equivalent but not identical. Thus, when a mapping is not present in the reference alignment it is per definition considered to be incorrect (false positive). Semantic precision and recall could solve this problem by considering mappings that are, semantically speaking, close to a mapping in the reference alignment. For example, when a mappings’ class is a super- or subclass of a correspondent class in the reference alignment.

To our knowledge, the UMLS and BioPortal based reference alignments were the only ones available that offered mappings between a wide variety of ontologies, including SNOMED CT, ORDO, and NCIt. We consider using two reference alignments for the evaluation a strength, as both contain different mappings despite their overlap of 28 to 57% (Table 4). The BioPortal mappings were considered to be a baseline alignment, as previously mentioned by the OAEI [22]. The precision and recall for BioPortal were both the highest for AML (0.54 and 0.96 respectively, whole ontology). This corresponds to the fact that both AML and BioPortal (LOOM) base their mappings on lexical techniques only. On the other hand, the mappings derived from the UMLS were considered to be a silver standard since the Metathesaurus is being maintained by domain experts. A limitation of the UMLS reference alignment used for this study is the CUI codes from ORDO, as ORDO is not included in the UMLS those CUIs were extracted from ORDO itself. The BioPortal F1-scores for the modules were 15% lower on average than the whole ontologies, which could be due to the low number of mappings in the reference alignments. Finally, earlier research mentioned that the UMLS reference alignment contains incoherent mappings [33], namely mappings that contain logical errors following from the union of the input ontologies and the mappings set [36]. Moreover, logical incoherences were also found in BioPortal mappings [37]. Such mappings were not removed and/or examined during the evaluation performed in this study.

Hierarchical analysis of mappings

Discarding false positive mappings, whose top-level hierarchy classes were not manually matched, did not result in much higher precision scores (up to 0.06 points higher). Nonetheless, analyzing top-level hierarchies of matched classes can be of value when applying ontology matching for FAIR data. First of all, hierarchies can exploit information about the origin of a class. For instance, if ‘pneumonia’ and ‘asthma’ are both part of the ‘disease’ hierarchy, they can be classified as such, even if it remains unknown if the classes themselves can be used interchangeably due to the lack of a reference alignment. This can be useful when querying data over multiple sources (example Fig. 2). Additionally, some classes were present in multiple mappings (a class mapped to multiple other classes), and in such cases the hierarchy analysis was able to detect incorrect mappings.

Our list of manual mappings between top-level classes may not be complete, which is a limitation of the hierarchical analysis. In addition, our method needs top-level classes to be manually matched and thus cannot be done automatically. However, even large ontologies tend to have few top-level classes which (e.g., SNOMED CT and NCIt both have 19 top-level classes). Table 12 shows four potentially incorrect mappings (as returned by the matching systems) and their top-level hierarchies. The first mapping in Table 12 is Soft tissue and Disorder of soft tissue, the first refers to the anatomic structure of soft tissue, the second refers to a disorder of this soft tissue. The second example is Aneurysmal Bone Cyst and Aneurysmal bone cyst, in which the labels are lexically identical. However, the first refers to the disease and the second to the body structure. The last example is Cell Proliferation matched with Hyperplasia, the top-level hierarchies reveal that the first class is a Biological Process and the second a Body structure. Now, this example amplifies the importance of the evaluation of the mappings by domain experts. Hyperplasia is the result of cell proliferation [38], thus the mapping could be considered correct depending on the application. Moreover, after manual inspection of the alignments, we found true positive mappings whose top-level hierarchy was incorrect according to our manual mappings. Those mappings were not flagged as incorrect because they were included in either one of the reference alignments. Yet, those mappings could suggest incorrect mappings in the reference alignment, which is out of the scope of this work.

Table 12 Four examples of mappings (NCIt-SNOMED CT) that are potentially incorrect based on their top-level hierarchiesConsensus alignments

We have yielded better results using consensus alignments compared to individual alignments. Consensus alignments have also been used by certain tracks of the OAEI [22, 39]. Harrow et al. mention that consensus alignments only compare how matching systems perform against each other. False positives are still likely to occur as more than one system can find the same, incorrect, mapping. Furthermore, correct mappings may only be found by one system and, thus, would not be included in a consensus alignment.

Relation to other work

The evaluation of the ontology matching systems relates to earlier research on matching disease and phenotype ontologies, and large biomedical ontologies (Large BioMed), both of which are tracks of the OAEI [22, 40]. In addition to the BioPortal baseline reference alignments, a consensus alignment based on a voting mechanism (multiple systems returning the same mapping), and manually curated mappings, were used to evaluate the matching systems. The large biomedical ontologies track of the OAEI uses the UMLS as reference alignment, which is based on an earlier work that extracted pairwise mappings from the UMLS Metathesaurus [33]. Moreover, our work relates to a paper published in 2020 which presented a generic workflow for making data FAIR [41]. Ontology matching systems should be included in the FAIRification workflow when dealing with multiple ontologies and some form of automated matching of classes is desired.

Table 13 Rare disease data items. 117 in total. Item are extracted from the common data elements for rare diseases [14], and the Orphanet rare disease classifications [15]Future research

We explored the use of ontology matching systems for a use case in the context of FAIR data and showed that existing ontology matching systems have the potential to be implemented in such environments. A problem that has not yet been solved is how mappings can be evaluated as useful or correct for their specific application. Obtaining complete reference alignments is a challenging task and such alignments are not readily available. The (automatic) evaluation of mappings for usage in an on-the-fly matching service within a FAIR data environment will become important. Therefore, future research should focus on developing methods for evaluating mappings that can be used by such matching services. Moreover, situations where no reference alignments are available should be considered. Those developments should be driven by specific use cases. Adding to this, it would be beneficial to include top-level hierarchy analysis as an additional method to ontology matching systems. Future research could focus on how to integrate this method in existing (modular) systems and workflows. For example, AgreementMaker offers an extensible architecture which may enable the inclusion of our method. We should acknowledge that additional matching methods, utilizing the structure or logic of ontologies, are not limited to top-level hierarchies. Future research could focus on discovering and analyzing other methods that have not yet been implemented by existing matching systems. Lastly, our experiment did not include matching systems based on a machine learning approach. Earlier research has demonstrated that an approach based on representation learning is effective at ontology matching [42]. Hence, investigating machine learning-based systems could be of added value. The recently added machine learning extension to the Matching and EvaLuation Toolkit (MELT) framework could aid such efforts [43]. MELT also offers so-called filters, one of which is a classifier that can be trained to classify a mapping as correct or incorrect. Such a filter could be used to improve the precision of an alignment, given it can be trained with positive and negative mappings. For the latter, a gold standard is required or negative mappings need to be created manually.

View original article

JOURNAL OF BIOMEDICAL SEMANTICS

分享书签

0 0 0 0 0 0 0

More from this channel

Performance assessment of ontology matching systems for FAIR data

留言 (0)