A multipurpose TNM stage ontology for cancer registries

OWL class hierarchies used in the different CR TNM ontologies

Figures 1 and 2 show the class hierarchy of the original two CR TNM ontologies (the one described in [9], and ENCR TNM-o respectively) expanded in part for stage IIB of the TNM site “breast”. Both ontologies classify stage groups according to the different TNM sites (e.g. breast stage IIB). An advantage of the ontology in Fig. 2 is that it is immediately apparent which values of the TNM parameters are possible within a given stage and TNM site and also which ICD-O-3 topography codes are associated with breast cancer. A disadvantage is the number of equivalent classes that are necessary (indicated by the small brown circles at the end of the arrows), which can lead to subtle types of error of unintended equivalences.

Fig. 1figure 1

Class structure of the TNM ontology developed in [9]. Arrows point to subclasses and “+” signs in the top left corner of certain classes indicate the class contains more subclasses than those shown

Fig. 2figure 2

Class structure of the original ENCR TNM-o ontology used in the ENCR validation checks developed by the authors. Solid lines signify subclasses; broken lines signify object properties (with different colours representing different object properties); and brown circles touching the arrows denote equivalences

In view of the advantages accruing from the availability of having a single ontology for addressing the multiple needs of a CR, ENCR TNM-o was refactored to allow a complete separation of the major underlying validation components. This allows ENCR TNM-o v2 to be used in isolation and independently of the ENCR data-validation checks.

In order to achieve the unification of the two original TNM ontologies, the design of the axioms was aligned as far as possible with that of [9] but extended on the basis of the data-validation ontology to incorporate all TNM sites, all codes of the individual T, N, and M parameters, and all ICD-O-3 morphology codes grouped by morphology categorisation. Figures 3 and 4 show this alignment for ENCR TNM-o v2 (which is discussed further in the "Ontology design" section). Figure 3 illustrates the TNM-related classes, and Fig. 4, the ICD-O related classes.

Fig. 3figure 3

Class structure of ENCR TNM-o v2 showing the TNM-related associations. Solid lines signify subclasses; broken lines signify object properties

Fig. 4figure 4

Class structure of ENCR TNM-o v2 showing the ICD-O-3 related associations

In Fig. 3, the classes TNMStage and TNMStageIIB, correspond to the respective classes of EC and EC_IIB of Fig. 1, but there is a important distinction in the resultant subclass name (c.f. TNMSiteEd7Breast and TumoresDeMama_EC_IIB). We considered it important to decouple the TNM site (e.g. breast) from TNM stage (e.g. stage IIB) since the concept of stage is essentially independent of the specific cancer site. We also divided the TNM classes more comprehensively between a generic ontology and a TNM edition-specific ontology to avoid having to redefine all the TNM classes for each TNM specific edition. Furthermore, we introduced a TNMCodeSpace class to encapsulate the different permissible values for the T, N, and M parameters for the different cancer sites.

Regarding the relation with ICD-O-3, all morphology codes have been defined and grouped under specific morphological categories in ENCR TNM-o v2, which are partly shown in Fig. 5. One example of the relationship between morphology code and morphology category is shown in Fig. 4 for the morphology code M_8140_3 and the adenocarcinoma morphology category. This is in contrast to Fig. 1, where only a descriptive morphological term is used (c.f. the UndifferentiatedOrAnaplasticCarcinoma class). ENCR TNM-o v2 also differentiates between pathological and clinical TNM. The resulting ontology is thus a more comprehensive model and more readily scalable to different TNM editions.

Fig. 5figure 5

Class structure of MorphologicalGroup in ENCR TNM-o v2 showing some of the morphological categories expanded in part for the carcinoma class

ENCR TNM-o v2 ontology structure

ENCR TNM-o v2 draws on concepts that go beyond TNM and which serve other needs within the wider context of the work of CRs. Examples include the ICD-O-3 codes (broken down into their constituent parts, e.g. topography, morphology, behaviour codes, etc.) and the grouping of sets of morphology codes into relevant morphological categories (describing carcinomas, melanomas, sarcomas, etc.).

In order to provide a separation of these concerns and allow optimal reuse, ENCR TNM-o v2 is based on a modular design, in which the individual concerns or domains are encapsulated in separate ontologies. OWL ontologies (essentially files written in OWL) may import other OWL ontologies/files to build larger ontologies consisting of a number of separate ontologies. By modular design, we intend the separation of inherently different concerns into different abstractions, encapsulated in their own separate ontologies, which nevertheless can be integrated in a larger ontology and linked in an appropriate manner within that ontology whilst not interfering with their individual descriptions and/or axiomatic definitions.

The concept is illustrated in Fig. 6, which shows the import structure of ENCR TNM-o v2 whereby an ontology is imported by another in the direction of an arrowed line. An overview of some of the metrics associated with the constituent ontologies is provided in Table 2 where numbers are cumulative for ontologies which import others unless parenthesised when they show the ontology-specific numbers. Concerning the metrics: Class count refers to the number of distinct classes; SubClassOf refers to the number of SubClassOf axioms (through which a class is made a subclass of another named or unnamed class); Object property is the number of object properties; Equivalent classes is the number of equivalent or defined classes; GCI count refers to the number of general concept inclusions or SubClassOf axioms whose subclasses are complex class axioms; and Logical axiom count is the number of logical axioms (includes SubClassOf but not Class count).

Fig. 6figure 6

Structure of the ontology import tree. An ontology that points to another ontology is imported by the ontology pointed at. The structure is adaptable to any classification of codes and TNM edition; only the relevant ontology needs to be swapped out

Table 2 Overview of the individual ontologies shown in Fig. 6

The expressivity is also cumulative, where the meanings are: AL – Attributive Language or the basic description language; ALC – AL with complements (including full existential quantification and concept union); ALCI – ALC with inverse properties; The superscript D denotes datatype properties (used for specifying age in the axioms of TNM sites that require it, e.g. thyroid gland).

Consequently, the TNM axioms can be specified according to any TNM edition and to define an ontology of another TNM edition, only the axioms specific to that TNM edition have to be defined; the rest of the ontology structure remains the same. The structure is therefore adaptable and scalable to any particular edition of TNM. In a similar fashion, it is also possible to change the morphology code groupings in the MorphologyGrouping ontology without having necessarily to change all the associated TNM-related axioms.

There is nevertheless a significant number of classes within the edition-specific ontology to change (c.f. the bracketed numbers in the final column of Table 2). It is, however, a relatively straightforward task to make global replacements of version-dependent stings (e.g. TNMEd7 to TNMEd8) in an OWL file and once that is done, to tweak the individual classes where there are differences between the editions. Furthermore, once the edition-specific ontology has been finalised, there is thereafter no general need for changing it further. Whereas it could in principle be possible to define many of the stage-related TNM parameters in the generic TNM ontology (since many of them are identical between editions), it then becomes a more complicated maintenance task should a future TNM edition require changes to a rule that was common to all the previous editions (the common rule would then need to be removed from the TNM generic ontology and refactored in all the TNM edition-specific ontologies).

Once the ontology of a new TNM edition has been developed in this manner, it does require full testing, especially of the classification structures that have changed between editions. This is generally performed by passing a set of test records through the reasoner using the programme interface (described further in the "Results" section) and verifying the inferred stage is the same as that specified in each test record.

Ontology design

OWL is based on the open-world assumption (OWA) which limits the inferences that can be made by any reasoning mechanism on statements known to be true – the philosophy being that there may be other information not yet known to the reasoner that may invalidate the inferences drawn.

In an OWL ontology, a reasoner can infer further classifications on the basis of information that is known and through which inferences can be made. OWL provides a number of mechanisms for imposing restrictions on the information available that allow such inferences to be made. One of the mechanisms relates to the “defined class” attribute. Defined classes essentially express equivalence. Defined classes are considered to contain a set of necessary and sufficient conditions that will make it automatically equivalent to any other class containing those same conditions. Thus in description logic, the axiom:

TNMSiteKidney ≡ ∃hasMorphology.Carcinoma ⊓ ∃hasTopography.C649

states an equivalence between the class TNMSiteKidney and the intersection of the object property hasMorphology having some carcinoma with the object property hasTopography having some ICD-O-3 topography code C64.9.

Another mechanism is via the general concept inclusion (GCI) construct [26] whereby an anonymous (or complex) class expression class is subclassed from an atomic class (in contrast to the more usual way of constructing classes using an OWL user interface such as Protégé [27]). This mechanism results in the subsumption by the atomic class of any class that contains the conditions specified in the complex class expression.

Thus, if instead of making the class TNMSiteKidney a subclass of a complex class expression such as:

TNMSiteKidney ⊑ ∃hasMorphology.Carcinoma ⊓ ∃hasTopography.C649

the complex class expression is made a subclass of TNMSiteKidney:

∃hasMorphology.Carcinoma ⊓ ∃hasTopography.C649 ⊑ TNMSiteKidney

the effect is that any class will be subsumed by TNMSiteKidney if it contains the intersection of the two classes:

∃hasMorphology.Carcinoma and ∃hasTopography.C649

Depending on the type of information one wishes to extract from an ontology, both subclassing constructs may be useful and it is worth noting that if both expressions are declared simultaneously, one has by definition [28] the equivalent class:

TNMSiteKidney ≡ ∃hasMorphology.Carcinoma ⊓ ∃hasTopography.C649

Using defined classes with complex class expressions however can lead to unintentional equivalence inferences in cases where identical expressions occur in two or more defined classes. Where there are many such complex expressions, it becomes difficult to ensure clashes do not occur. For this reason, the GCI approach was considered the most appropriate even though it tended to increase the number of axioms. GCIs are also known to cause performance issues [29, 30] but it was considered preferable in order to avoid potentially subtle inference errors.

Many of the equivalence axioms used in the data-validation ontology were consequently refactored. For example, the morphology category axiom:

Mesothelioma ≡ (M_9050 ⊔ M_9051 ⊔ M_9052 ⊔ M_9053 ⊔ M_9054 ⊔ M_9055)

was remodelled as six separate general class axioms, following the pattern:

∃hasMorphology.M_905X ⊑ Mesothelioma

where “X” signifies values between 0 and 5.

The number of axioms could be reduced in some instances by using three-digit morphology codes (e.g. M_905) and making the latter the superclasses of the four-digit codes, e.g.:

∃hasMorphology.M_905 ⊑ Mesothelioma

where

M_9050, M_9051, M_9052, M_9053, M_9054, M_9055 ⊑ M_905

Using this pattern, the data-validation TNM ontology could be more closely aligned with that of [9].

Encoding of TNM stage is performed on the basis of the various permissible codes ascribed to the individual T, N, and M categories. The codes ascribed to the T, N, and M categories are dependent on topography or site of the primary tumour as well as on the TNM edition. This introduces the notion of a symbol code space for each category, and was modelled in the ontology by a defined class for each TNM site specified by the TNM edition. Since the TNM sites have unique names, any clashes in the equivalence statements are avoided. For example the code space for T for the TNM site Breast in TNM edition 7 is specified by the intersection of the Breast TNMEd7 site with the union of all the associated T codes:

TNMSiteEd7Breast ⊓ ∃hasT.(CT0 ⊔ CT1 ⊔ CT1a ⊔ CT1b ⊔ CT1c ⊔ CT2 ⊔ CT3 ⊔ CT4 ⊔ CT4a ⊔ CT4b ⊔ CT4c ⊔ CT4d ⊔ CTX ⊔ PT0 ⊔ PT1 ⊔ PT1a ⊔ PT1b ⊔ PT1c ⊔ PT1mi ⊔ PT2 ⊔ PT3 ⊔ PT4 ⊔ PT4a ⊔ PT4b ⊔ PT4c ⊔ PT4d ⊔ PTX ⊔ PTis) (1)

Where the classes prefixed by the letter “C” denote clinical T and those prefixed by the letter “P”, pathological T. Any T code outside this code space is not recognised for this particular TNM site. Also in this axiom, the class TNMSiteEd7Breast is the superclass of the intersection of the TNM generic name of the same site and the object property of hasTNMEdition acting on the class TNMEd7:

TNMSiteBreast ⊓ ∃hasTNMEdition.TNMEd7 ⊑ TNMSiteEd7Breast (2)

Finally, the TNM generic class for the TNM topographic site “Breast” is the superclass of the intersection of the object properties related to the ICD-O-3 topographic code C50 and the morphology category denoted by the Carcinoma class which itself consists of object properties related to a number of ICD-O-3 morphology codes:

∃hasTopography.C50 ⊓ ∃hasMorphology.Carcinoma ⊑ TNMSiteBreast

where the morphology category Carcinoma is the superclass of the morphology subcategory Adenocarcinoma.

Adenocarcinoma ⊑ Carcinoma

which is described in terms of specific ICD-O-3 morphology codes, one example being:

∃hasMorphology.M_850 ⊑ ∃hasMorphology.Adenocarcinoma

These aspects were not modelled in the ontology of [9]; also, instead of modelling stage as the intersection of the TNM category classes and topography class as in the example below for stage 0 breast cancer:

C50 ⊓ Tis ⊓ N0 ⊓ M0 ⊑ BreastCancer_CS_0

we preferred to represent a general class of stage 0 as an intersection of object properties (of T, N, and M) with the class TNMEd7Breast defined in axiom (2) and an object property of hasBehaviour with BehaviourCode2 (corresponding to in situ tumours):

TNMSiteEd7Breast ⊓ ∃hasBehaviour.BehavioutCode2 ⊓ ∃hasT.Tis ⊓ ∃hasN.N0 ⊓ ∃hasM.M0 ⊑ TNMStage0

Defining the axioms in this way reduces the need to create a separate class for each combination of stage group and TNM site and also allows the conceptually different classes of topography, T, N, and M to be declared disjoint.

Another aspect we modelled in ENCR TNM-o v2 was the concept of code-spaces for T, N, and M encapsulating all the respective codes for a given TNM cancer site. The permissible sets of codes are in general different for different cancer sites and this feature was not modelled in the ontology of [9], where for instance the axiom for stage IIIC breast cancer is:

C50 ⊓ N3 ⊓ M0 ⊑ BreastCancer_CS_IIIC

This axiom is entirely independent of T and would miss any associated data-validation errors. In ENCR TNM-o v2, the same class is modelled as:

TNMSiteEd7BreastCodeSpaceT ⊓ ∃hasN.N3 ⊓ ∃hasM.M0 ⊓ ∃hasBehaviour.BehaviourCode3 ⊑ TNMStageIIIC

where the first term is provided by axiom (1), in line with the set of T category values for breast cancers derived from Table 1.

By extending these axioms for the entire set of TNM cancer sites, the ontology is able to provide a comprehensive representation of the TNM tables such as those shown in Table 1. Moreover, defining the axioms in this way using general concept inclusions after the manner proposed in [9] provides the means of automatically deriving stage from knowledge of the parameters on which it depends.

Subsumption of classes

As a consequence of this design, an input record specifying an object property of hasMorphology with morphology code subclassed under M_850 will be subsumed under:

∃hasMorphology.Carcinoma

If the input record were also to specify an object property of hasTopography with a topography code in the class hierarchy of C50, this together with the morphological designation will be subsumed under TNMSiteBreast. TNMSiteBreast together with an object property of hasTNMEdition of TNMEd7, will in turn be subsumed under TNMSiteEd7Breast.

Finally, if the input record specified the object properties with the corresponding correct T, N, and M codes (in this case PTis, CN0 or PN0, and CM0 or PM0), the whole input record will be subsumed under TNMStage0. The subsumption schema for the breast cancer example just described is illustrated in Fig. 7 where the boxed axioms are subsumed by the circled classes to provide the final subsumption under TNM stage0.

Fig. 7figure 7

Subsumption schema of the GCI axioms in which the boxed axioms are subsumed by the circled classes

The ontology takes into full consideration the ICD-O-3 morphology codes which are themselves classified according to the categories of malignant neoplasms specified in Table 25 of ICD-O-3 first revision [23] and adapted from [31]. The axioms also provide via the generic TNM classes a scalable architecture that minimises duplication of classes between different TNM editions. The design therefore provides a comprehensive basis of a general-purpose TNM ontology that can be useful for serving the various TNM-relate tasks within a CR.

Comparison of metrics of TNM ontologies

The expressivity of ENCR TNM-o v2 is nominally ALCIQ(D) (ALCI with qualified cardinality restrictions) but the qualified cardinality restriction arises solely from one axiom in one of the imported ontologies and is not used explicitly within the TNM ontology; thus the expressivity can be considered as ALCI(D) – the same level of expressivity as the ontology developed in [9]. Table 3 shows a comparison between the different ontologies using the same metrics as those described in the "ECR TNM-o v2 ontology structure" section.

Table 3 Comparison of axiom and class counts between the CR-related TNM ontologies. The metrics of the ontology of [9] were taken from the ontologies directly downloaded from: https://github.com/djogopatrao/tnm_ontology/tree/master/ontologies

留言 (0)

沒有登入
gif