Analysis and implementation of the DynDiff tool when comparing versions of ontology

DynDiff has been experimentally evaluated on large ontologies in the biomedical domain. Our goal was to assess the performance of the algorithm on existing tools in terms of execution time, as well as its ability to identify the right change actions. We also propose additional experiments to evaluate the influence of the ontology matcher on our method.

Materials

During this evaluation, we used seventy-three versions from seven different ontologies. Table 4 provides the average number of classes (C), properties (P), individuals (I), subsumption relationships (R) and attributes (A) that are contained in the various version of these ontologies.

Table 4 Average number of concepts (\(\textbf\)), properties (\(\textbf)\), individuals (\(\textbf\)), subsumption relationships (\(\textbf\)) and attributes (\(\textbf\)) in the different versions of each of the datasets. The Versions column shows the versions collected

We collected these ontologies from BioPortal [5]. For each of the dataset versions, the initial and last ID used are listed in the “versions” column, as well as the number of versions (in parentheses). We used the notation AA to indicate that for these ontologies we selected the latest version published in January of each year. For Gene Ontology (GO), Interlinking Ontology for Biological Concepts (IOBC) and the Ontology of Coronavirus Infectious Disease (CIDO) we indicate in the table the first and last version used.

Experimental method and metrics

The main research objective of this work deals with the scalability, robustness and completeness of the method. The scalability and robustness aspects deal with the ability of the approach to compare large ontologies while the completeness aspect deals with the ability of the approach to identify a larger set of ontological changes compared with state-of-the-art methods. To evaluate these objectives we executed DynDiff and COnto-Diff algorithms in the same computing environment and used the following metrics to analyse the results obtained.

Execution time: With this metrics we will be able to compare the computation time required by DynDiff and COnto-Diff to measure the scalability and robustness of both algorithms. Here we studied two main indicators:

Composition: we studied the composition of the execution discerning between the loading and actual algorithm execution time. We did this to understand the performance of the DAO (loading) and service (algorithm execution) layers. Additionally, it provides a benchmark of the algorithms based on the Apache Jena Fuseki implementation.

Trends and behaviours: to understand the behaviour and trends of the execution time given a set of characteristics (like input size, output size and result composition in terms of changes, among others); in addition, we analysed the outcomes to evaluate the scalability of the algorithm in terms of these characteristics.

Composite to low-level ratio: for this indicator, the low-level are the basic changes while the composite changes are the higher level changes, regrouping the complex and heuristic ones. We look for the ratio between composites and low-level changes to analyse how good the algorithm to compute them is in terms of completeness. Note that composite changes are harder to compute and require more computational time/resources.

% change tool comparison: The following indicator is used to compare the results of DynDiff (DD) with those of COnto-Diff (CO) [12]:

$$\begin \%ChangeTools_= \frac-N_)})}, \quad type \in \ \end$$

(2)

where N is the number of changes of a given tool and of a given type: basic or composite (complex or heuristic). The COnto-Diff algorithm serves as a ground truth since it was also compared (in previous work) with other existing solutions like PromptDiff. We compared the “execution time” indicator and evaluated the proposed “%changeTools” indicator. This analysis also enabled us to explain the differences in the composite-to-low-level ratio. The methodology here is the same, except that there is no API service (and thus no UI); everything is set up through the IDE and the intermediary and final results are kept in a local database.

% change matcher comparison : To compare with our default matcher, we selected the AML (Agreement Maker Light) matcherFootnote 7 because it is an open source (we were able set up a comparable environment and execute the code) and because of the good performance of this matcher in the OAEI from 2019. By changing the default matcher of our tool, we are able to see the effect of the new matcher on the performance of our algorithms. We used the following equation to measure the matcher impact:

$$\begin \%ChangeMatchers = \frac-N_)})} \end$$

(3)

where DM refers to the default matcher and AML to the new one. N refers to the amount of heuristic changes for the given matchers (DM, AML). The methodology to gather the results of the new matcher is the same as that described for the default one, although we limit the analysis to three ontologies: SNOMEDCT in order to evaluate the results in a large ontology; ICD9CM because it is a medium-sized ontology; and CIDO as it also contains properties and instances.

Evaluation of DynDiff

The first set of experiments examines the relationship between the execution time (in seconds) of DynDiff and the total amount of detected changes (referenced as “composition” in the previous section).

The total execution time is made up of two processes: the loading time of the local triplestore and the actual execution time of the algorithm. As shown in Fig. 3, in at least 75% of the cases, the loading time represents less than 18% of the total execution time (see the upper hinge). The maximum loading time was observed for GO ontology and represents 34% of the total execution time, the minimum loading time was observed for cido ontology and represent is 4% of the total execution time. The high loading time is probably due to the parsing step undertaken to transform the original OBO format of GO into OWL. The other ontologies have a native OWL format, avoiding the parsing step. We observed that the parsing process adds extra triples to GO, such as blank nodes, that require more processing time to load and to recompose the graph in memory.

Fig. 3figure 3

Variance of the ratio of loading time to execution time through all iterations of the proposed ontologies

The next step of our evaluation will better detail how the execution time of the algorithm is used. For this, we will start with the analysis of “trends and behaviours”. Figure 4 shows the total amount of changes computed per ontology (diff between consecutive versions) versus the execution time. The plotted execution time represents the average of 10 iterations for each Diff calculated between two consecutive versions. The goal of this analysis is to determine if the execution time increases proportionally to the number of changes in order to validate the scalability of our algorithm.

Fig. 4figure 4

Average execution time (secs) vs total of changes per computed diff between consecutive versions of ontologies

As shown in Fig. 4 the execution time follows two patterns: (i) reference, (ii) high variance. The reference pattern is characterized by the fact that, for each ontology, there is a proportional relation between the execution time and the number of changes. Notice that the ratio can be different for different ontologies (as indicate by the red, for NCIt, and green, for SNOMED CT, doted lines in Fig. 4). According to our findings, two ontologies do not follow the reference pattern: GO and IOBC. For instance, GO (with many concepts and few instances) has an execution time varying between 300s and 850s to compute 30K to 300K changes. IOBC (with few concepts and many instances) expends between 400s and 900s to compute 50K to 700K changes. For both ontologies, we observed a non-linear increase of the execution time vs number of changes. In the next steps of our analysis, we will evaluate the potential impact of other factors on the execution time of our algorithm.

The first analysed factor was the computation environment. Figure 5 shows that there is an inherent variability among the 10 iterations to compute the diff between the different versions. Here we see the total amount of changes produced, and a box-plot quartile analysis over the 10 iterations done. This is mainly due to the start-up time used by the Java Virtual Machine in the first iterations as well as the randomness intrinsic to not using a dedicated server or machine for the computation.

Fig. 5figure 5

Average of execution time (secs) vs Average of total changes for DynDiff

Additionally, Spark is used here for resource management and is known to have high costs at the start and finish of jobs according to the benchmark done by [35].

The next series of experiments will demonstrate that computing composite changes can have an impact on the execution time. This is an important analysis because our approach proposes to compute more composite changes than the other existing approaches.

Composite vs low-level changes

One of the main objectives of this work is to identify and characterize changes that are not only machine- but also human-interpretable. For this reason, it is important to evaluate the capacity of DynDiff to transform the basic changes into composite ones. Figure 6 shows the percentage of basic changes consumed by the composite changes (consumed basic changes/total basic changes) for 2-3 different computed diffs per ontology.

Fig. 6figure 6

Percentage of consumed basic changes by computed diff. We present some examples computed for seven different ontologies

Note that IOBC consumes few basic changes. The most frequent change observed in this ontology was the deletion of labels, which is classified as a basic change. It can partially explain the low execution time required for 700K changes, as shown in Fig. 4. In contrast, the algortithm consumes (on average) more than 50% of the basic changes for NCIt and SNOMEDCT. For example, for NCIt 2005-2006 diff computation, the algorithm was able to identify a large subgraph that was composed of a little more than 48k concepts (addSubgraphC), consuming almost all the basic changes. The consequence of this higher quantity of basic change consumption is observed in Fig. 4 by a higher execution time for these ontologies. Thus, the reduction of the execution time can be obtained by observing the evolution of an ontology and selecting the composite changes that are useful for the end-users. The analysis of the composite changes is closely related to the context and the use that will be made of this information. For instance, we plan to use the types of changes to evaluate the impact of these changes on the mappings [9] and on the semantic annotations of documents [10]. The ultimate goal is to automatically update the mappings and annotations according to the evolution of the ontologies. For this context, the set of rules that we selected covers all our needs. Other contexts would require only a sub-set of these rules, resulting in a reduction of the calculation of composite changes and consequently in a reduction of the execution time for some ontologies. However, the bottlenecks can also come from the matcher and the rules for detecting heuristic changes. According to the chosen matcher and the type of ontology that it will be applied for, the performance of our algorithm can be different. This will be detailed later when the differences in the results given by the matchers are analysed.

Combining the size of the ontology, the quantity of detected changes, and the ratio of composite changes enables a multi-perspective view of the execution time, as shown in Fig. 7. The width of the bars represents the size of the input versions (number of triples representing concepts, relationships, attributes, properties and instances) and the height of the bars represents the number of detected changes. The colours inside the bars distinguish basic changes from composite ones. The ratio between basic and composite changes is also indicated by a red label. The horizontal axis of the figure indicates the average execution time of the algorithm. Note that increasing the input (width) or output (height) cannot justify per se the increase of the computation time. However, if the quantity of composite changes is also taken into account, the correlation becomes clearer. The bar furthest to the right (with the higher execution time) has the following characteristics: large input ontologies and high number of detected changes, with 20% of them being composite changes. Although this does not explain all the cases, it is definitively clear that all execution time over 500s has a large ontology as its input even if the number of the detected changes is not that big. For instance, one diff computed for SNOMEDCT has an execution time greater than 900s and the number of changes is close to 100K. Thus, the type of ontology also impacts the execution time. By comparing Figs. 7 and 4, we can observe that it is time-consuming to compute diffs for SNOMEDCT. One hypothesis is that the sophisticated hierarchical relationship within SNOMEDCT could partially justify this high execution time. Deeper analysis is necessary to demonstrate this hypothesis. But, this demonstration is out of the scope of our work since it impacts all Diff algorithms in the same way.

Fig. 7figure 7

Effect of number of changes (output), input size of version ontologies (width), and the percentage of complex changes in average execution time (secs)

The next step will be to compare our approach with the state-of-the-art in this domain, i.e. COnto-Diff.

Comparison with COnto-Diff

The close collaboration with the University of Leipzig and in particular, with the researchers developing COnto-Diff tool [12] was crucial to the setup of an execution environment in order to fairly compare the tools. The rules for Properties \(\textbf\) and Instances \(\textbf\), cannot be compared with COnto-Diff because COnto-Diff was not designed to analyse these elements. Inspired by the work of [20], our approach added this more granular view to characterize the changes with, for instance, addSubClass, addSubProperty, addOtherRelationship, etc.

Execution time

The first comparison’s factor is the total execution time spent to compute the Diff. In Fig. 8 we can observe the total changes computed by each algorithm (COnto-Diff and DynDiff) for the consecutive versions of the seven ontologies proposed (we used the average execution time of its 10 iterations). Note that on average, the execution time of DynDiff varies less than that of COnto-Diff, especially for larger ontologies or for a large number of changes. The trend of execution time of both algorithms are presented as dotted lines in the figure. For instance, when there are more than 230K changes, COnto-Diff needs twice as much time as DynDiff to compute the changes. The IOBC exception (700K changes) is observed because COnto-Diff does not compute changes in instances (what is mostly the case for IOBC). Moreover, the GO 2019-2020 was excluded from the analysis because COnto-Diff was unable to compute the diff.

Fig. 8figure 8

Comparison of average execution time (secs) vs total changes per computation between consecutive versions of proposed ontologies

The variance in the total execution time for each approach can be seen in Fig. 9. Note the good performance of COnto-Diff for smaller ontologies. This can be explained by the bigger quantity of rules to test and the higher loading time for DynDiff. The novelty is that DynDiff performs well for larger ontologies and even better for larger ontologies with a big quantity of changes which confirms the robustness of DynDiff. The box-plot presents the total execution time per Diff. We regrouped the Diffs by ontology and, within each group, the values are ordered (ascendant) by the number of computed changes.

Fig. 9figure 9

Comparison of the execution time per iteration and its variance for each of the computed diffs. a small to medium and b large sized ontologies

By looking more closely at the variability in the execution time, COnto-Diff shows a higher variance for CIDO on some diffs, such as 0403-0408 and 0202-0312, which has the highest amount of complex changes in this group. These two diffs contain up to 300% more complex changes than the other versions. For larger ontologies, the variability is more evident, such as for IOBC 105-111. This variability can be caused by certain characteristics of the ontology. For instance, IOBC 105-111 uses labels in both Japanese and English. This high variability was not observed with DynDiff. However, for MESH 2015-2016 or NCIt 2015-2016, we observe a higher variability with DynDiff. Since the quantity of changes in these two ontologies are quite small (near 200), we suspect that the start-up and finishing overheads given by Spark had an impact on our experiments.

As an indicator of accuracy, we compare the changes detected by DynDiff with those from COnto-Diff. We defined the \(\%changeTools_\) metric (see Eq. 2) to quantify the differences of changes computed by each tool. We start by comparing the basic changes as shown in Fig. 10. Having a value close to 0% indicates that the same number of changes were found by both tools. The differences observed, such as for IOBC and CIDO, express the capacity of either DynDiff or COnto-Diff to characterize more types of changes than the other one. If the value is positive, then DynDiff detects more changes, if it is negative, then COnto-Diff detects more changes. For IOBC and CIDO, a plausible explanation for the differences is based on the inability of COnto-Diff to process the labels in Japanese as well as on the inability of COnto-Diff to identify the changes in instances and properties (since it does not have rules for that). Similar results are also observed for the analysis of composite changes. The slight differences of 2-4% for the ontologies are mainly due to the rules for heuristic changes.

Fig. 10figure 10

Comparison of the basic changes computed by DynDiff vs ones computed by COnto-Diff

The negative percentages are due to the ability of COnto-Diff to compute addInnerC and delInnerC in order to detect the concepts added or deleted between a leaf and non-leaf node. Our approach uses addSubGraph and delSubGraph, which group several add/delete operations into a more complex operation. This decision generates more changes for COnto-Diff, expressed as negative values in the figure. This analysis shows that both approaches behave similarly for ontologies that do not have changes in properties or individuals, but only DynDiff can identify these kinds of changes making DynDiff more complete in its ability to detect high-level changes.

Influence of the matcher

The reason for evaluating several matchers is to show their potential impact on the results and to highlight the importance of choosing one matcher wisely. The comparison will be between the built-in matcher and the AMLFootnote 8 matcher version 3.1. Before selecting AML, we evaluated 7 matchers (all matchers that performed well for large biomedical ontologies in the OAEI 2019 [36]). In this evaluation, AML had the highest overall F-measure and took fourth position in terms of execution time. From a technical perspective, our experiments were executed on an Ubuntu 18 Laptop with an Intel Core i5-6300HQ CPU @ 2.30GHz x 4 and 15Gb of RAM. The ontologies used were SNOMEDCT (with 306.500 concepts on average) and NCIT (with 66,724 concepts on average).

For the experiments presented in Fig. 11, we decided to deactivate the feature word-matcher from AML because when applied to our dataset, it generates an “outof-memory” error for large ontologies such as SNOMEDCT. The alignment threshold used was 60%, meaning that matches with a similarity value below 60% were discarded. For readability reasons, we present the outcomes of our evaluation for three ontologies: SNOMEDCT, ICD9CM and CIDO. We compared the metric changeMatcher (see Eq. 3) between the default matcher and AML. Note that we evaluated only the difference in heuristic changes, since we observed that the impact is less sensible for the other types of changes. In Fig. 11, we can observe the significant impact of AML on one diff computation of CIDO and one SNOMEDCT computation. These diffs have few heuristic changes, thus a small variation in the quantity of detected changes significantly increases the outcome of the metrics that we used. Otherwise, AML has a positive impact on the detection of changes in SNOMEDCT, allowing us to detect (on average) more cases of heuristic changes than our built-in matcher. The impact was less impressive in the other types of ontologies. The reason for this is the complexity of the relations within the ontology; the mappings between triples are more complex to determine, requiring more efficient matchers. Another factor to consider is the quantity of changes between versions. A matcher can increase the execution time of the algorithm. Thus, it is necessary to balance the efficiency of a matcher, the quantity of heuristic changes and the execution time of the algorithm.

Fig. 11figure 11

Comparison of increase in heuristic changes using AML matcher

To better understand how matchers affect the execution time, we recommend looking at the benchmarks provided by the OAEI. In our case, we used the 2019 benchmark [36] to select AML. When using DynDiff, the end-users can select the most adapted matcher for their applied scenario.

留言 (0)

沒有登入
gif