Identification of missing hierarchical relations in the vaccine ontology using acquired term pairs

To demonstrate our approach, we used the 1.1.192 version (released on 03/19/2022) of VO in Web Ontology Language (OWL). Utilizing the OWLReady2 python library [37], we obtain the names and ancestors of each VO concept. Then, leveraging the ancestor information obtained, we generate linked and unlinked-pairs of concepts. Each pair of concepts with common lexical feature(s) will further derive an acquired term pair (ATP) denoting the term difference between the two concepts. If the same ATP can be obtained by both a linked concept-pair and an unlinked concept-pair, then the unlinked concept-pair is flagged as indicating a potentially missing is-a relation.

Representation of concepts

Our approach requires a concept to be represented as a set of its features. In this work, we obtain the features from concept names as follows. We first convert the name of a concept to lowercase. Then we tokenize the concept name to words and remove duplicate words. The result would be a set of words which can be considered as the lexical features corresponding to the name of the concept. For example, consider the concept “Hepatitis B Surface Antigen Vaccine 0.01 MG/ML” (VO:0003423). This concept would be represented as .

Generation of linked concept-pairs

We leverage the ancestor information for each concept obtained through OWLReady2 to construct a set of linked concept-pairs as follows. A given concept-pair C and A would form a linked concept-pair L(C,A) if the following constraints are satisfied:

1.

if A is an ancestor of C; and

2.

if C and A have at least a single common lexical feature.

Note that linked concept-pairs are ordered pairs. That is, L(C,A) indicate that C is the descendant and A is the ancestor. This means that L(C,A) and L(A,C) are different pairs. However, usually L(C,A) and L(A,C) would not both exist in an ontology as they would form a cycle.

For example, the concepts “infectious bursal disease virus vaccine” (VO:0001497) and “viral vaccine” (VO:0000609) in Fig. 1 form a linked concept-pair as VO:0000609 is the parent of VO:0001497 and both the concepts have the common lexical feature: . Similarly, considering Fig. 2, the concepts “Bovine rotavirus” (NCBITaxon:10927) and “Rotavirus” (NCBITaxon:10912) form a linked concept-pair.

Fig. 1figure 1

Valid missing is-a suggestion between concepts VO:0000961 and VO:0001220. A missing is-a relation identified between the concepts “live attenuated infectious bursal disease virus vaccine” (VO:0000961) and “live attenuated viral vaccine” (VO:0001220) was confirmed as valid by the domain expert

Fig. 2figure 2

Valid missing is-a relation between concepts VO:0001507 and VO:0000753. A missing is-a relation identified between the concepts “Bovine Respiratory Syncytial Virus vaccine” (VO:0001507) and “Respiratory syncytial virus vaccine” (VO:0000753) was confirmed as valid by the domain expert

We iterate through ancestors of all the concepts and construct a set of all linked concept-pairs.

Generation of unlinked concept-pairs

We leverage the hierarchical information of VO to construct a set of unlinked concept-pairs. A given concept-pair C and D would form an unlinked concept-pair U(C,D) only if the following conditions are satisfied:

1.

if C≠D;

2.

if D is not an ancestor of C and C is not an ancestor of D;

3.

if C and D have at least a single common lexical feature;

4.

if C and D both belong to the same ontology (note that VO contains external ontology concepts); and

5.

if C and D fall within the same subhierarchy out of the 19 different subhierarchies under concept “material entity” (BFO:0000040) of VO.

Here the fifth condition requires the unlinked concept-pair to be in the same subhierarchy of “material entity” for the following reasons: (1) the vast majority of VO concepts (including vaccines) are under “material entity”; and (2) the subhierarchies under “material entity” model different domains of VO.

Note that unlinked concept-pairs are ordered as well. That is, U(C,D) is considered to be different from U(D,C). However, in certain situations, one of them could form a linked concept-pair. For instance, if U(C,D) is an unlinked concept-pair but L(D,C) is a linked concept-pair, then we do not include U(C,D) in the unlinked concept-pair set. Otherwise, both are included.

As an example, the concepts “live attenuated infectious bursal disease virus vaccine” (VO:0000961) and “live attenuated viral vaccine” (VO:0001220) in Fig. 1 form an unlinked concept-pair as VO:0001220 is not an ancestor of VO:0000961, both the concepts are in the subhierarchy rooted under “processed material” (OBI:0000047) which is a subhierarchy under “material entity” (BFO:0000040) and both the concepts have common lexical features . Similarly, in Fig. 2, the concepts “Bovine Respiratory Syncytial Virus vaccine” (VO:0001507) and “Respiratory syncytial virus vaccine” (VO:0000753) form an unlinked-pair.

For each subhierarchy under “material entity” (BFO:0000040), we iterate through all combinations of concept-pairs and construct a set of all unlinked concept-pairs.

Generation of acquired term pairs

A linked or unlinked concept-pair derives an ATP which emphasises the unique lexical features of each concept. Let the lexical features of a concept-pair C1 and C2 be F(C1) and F(C2) respectively. Then the ATP generated by the concepts is defined as:

$$ATP(C_, C_) = ()-F(C_)}, )-F(C_)}),$$

i.e., the ATP is obtained by removing common lexical features and maintaining unique ones. For instance, consider the linked concept-pair “infectious bursal disease virus vaccine” (VO:0001497) and “viral vaccine” (VO:0000609) in Fig. 1. By removing common lexical features, we obtain (, ) as the ATP. Similarly, from the unlinked concept-pair “Bovine Respiratory Syncytial Virus vaccine” (VO:0001507) and “Respiratory syncytial virus vaccine” (VO:0000753) in Fig. 2, we obtain the ATP (, ). The second set of the ATP in this instance is an empty set since the lexical features of concept VO:0000753 form a subset of the lexical features of VO:0001507.

Note that a concept-pair (C1,C2) would generate an ATP(C1,C2) that is different from an ATP(C2,C1) generated by concept-pair (C2,C1).

Discovery of potentially missing is-a relations

Given a linked concept-pair L(C1,C2) and an unlinked concept-pair U(C3,C4), if ATP(C1,C2)=ATP(C3,C4), then we suggest a potentially missing is-a relation: C3is-a C4. In other words, if an ATP derived by a linked concept-pair can also be derived by an unlinked concept-pair, this is considered to indicate a potentially missing is-a relation among the unlinked concept-pair.

For example, in Fig. 1, the linked concept-pair “infectious bursal disease virus vaccine” (VO:0001497) and “viral vaccine” (VO:0000609) derive the ATP (, ), which can also be derived by the unlinked concept-pair “live attenuated infectious bursal disease virus vaccine” (VO:0000961) and “live attenuated viral vaccine” (VO:0001220). Hence, this denotes a potentially missing is-a relation: VO:0000961 is-a VO:0001220.

Similarly, in Fig. 2, the linked concept-pair “Bovine rotavirus” (NCBITaxon:10927) and “Rotavirus” (NCBITaxon:10912) as well as the unlinked concept-pair “Bovine Respiratory Syncytial Virus vaccine” (VO:0001507) and “Respiratory syncytial virus vaccine” (VO:0000753), derive the same ATP (, ), and thus indicate a potentially missing is-a relation: VO:0001507 is-a VO:0000753. Note that as in this example, the linked or unlinked concept-pairs may originate from external ontologies as VO reuses concepts from external ontologies.

Given the unlinked-pairs and linked-pairs, Algorithm 1 shows the procedure that was used to extract such potentially missing is-a relations.

Note that the same potentially missing is-a relation C3is-a C4 may be obtained leveraging multiple linked concept-pairs L(C1,C2) and L(C5,C6) if they derive the same ATP. We remove such duplicate cases from our final set of potentially missing is-a relations.

Post-processing

We further perform a filtration step on the set of potentially missing is-a relations as described below. For an unlinked concept-pair U(C3,C4) and a linked concept-pair L(C1,C2) generating the same ATP, if the name of the concept C3 is the same as the concept C1 or the name of the concept C4 is the same as the concept C2, then we do not suggest a potentially missing is-a relation between C3 and C4. This is because two concepts with the same name but different identifiers may reveal a different type of quality issues (e.g., duplicate concepts) rather than a missing is-a relation. For example, the linked concept-pair “F fusion protein” (VO:0011167) and “Measles virus protein” (VO:0010784) and the unlinked concept-pair “F fusion protein” (VO:0011208) and “Measles virus protein” (VO:0010784) generate the same ATP: (, ). However, VO:0011167 and VO:0011208 have the same name: “F fusion protein”. Hence, we do not suggest a missing is-a relation between VO:0011208 and VO:0010784.

Manual evaluation of identified potentially missing is-a relations

Potentially missing is-a relations identified by this method need to be manually reviewed for validation and confirmation before their adoption to VO. We randomly selected a subset of suggested potentially missing is-a relations for domain expert evaluation. For each missing is-a relation in the subset, the names of the two concepts together with their identifiers were provided to a domain expert (author YH, who has expertise in microbiology, vaccinology, and nephrology, and currently leads the development of VO). The domain expert examined whether the suggested relation is valid; not only theoretically, but also in terms of its suitability to current modeling practices of VO.

留言 (0)

沒有登入
gif