Alignment of vaccine codes using an ontology of vaccine descriptions

Construction of the VaccO ontology

A vaccine code in a medical coding system stands for an individual vaccine product or for a pharmacologic group of vaccines. To prepare the creation of the VaccO ontology, we first identified categories of the properties used to define the vaccine groups in a number of general, drug-specific, and custom, database-specific coding systems: SNOMED-CT, Read-2, MeSH, ATC, BNF, and Additional Health Data (AHD) from the database of the The Health Improvement Network (THIN).

Immunization targets (i.e., vaccine-preventable diseases and their pathogens) were used in all coding systems for the definition of vaccine codes (Table 1). Vaccine-preventable diseases and pathogens may be used interchangeably to describe equivalent vaccine groups (e.g., ‘Vaccine against cervical cancer’ and ‘Human papillomavirus vaccine’). Vaccine codes were further defined based on vaccine strategies, ingredients (including adjuvants, excipients, and active ingredients), routes of administration, and valences (which can denote the number of pathogen strains targeted by a vaccine or the number of components in combination vaccines).

Table 1 Categories of properties used in vaccine descriptions. A check mark (\(\checkmark\)) indicates that a property category (row) is used for defining vaccine codes in a coding system (column)

The VaccO ontology is specified using the Web Ontology Language (OWL2) [32]. Classes are hierarchically structured by the subclass relation (is-a) and their extension is specified by expressions of description logic (DL) describing the properties of the class [33]. For example, the class of influenza vaccines can be defined by the DL expression Vaccine that immunizes-against Influenza, where Vaccine and Influenza refer to other classes and immunizes-against is a propertyFootnote 1. A class can further contain one or more terms to state the meaning of the class in free text.

The categories of vaccine properties, vaccines, and vaccine products are represented by fundamental classes, which lay out the overall structure of the VaccO ontology: Vaccine, Valence, Route, Ingredient, Strategy, Disease, and Pathogen (see Fig. 1). Classes for pharmacological groups and vaccine products are defined as subclasses of Vaccine. The other classes in the VaccO ontology and their English terms were compiled from the following resources (by manual analysis if not stated differently):

Classes for vaccine products and their ingredients were extracted from the Art57 DB using a Python script.

Common pharmacological vaccine groups and their abbreviations (such as ‘DTaP’) were identified in vaccine literature [5, 34,35,36,37] and a monograph from the US Centers for Disease Control and Prevention [38].

Vaccine strategies and terms were extracted from descriptions in literature, classes in the VO ontology, and vaccine codes in MeSH.

Indications of drugs including immunization targets of vaccines are not defined in any publicly available, formalized resource to the best of our knowledge. We extracted classes for pathogens and diseases, and causal relationships between them instead from the descriptions of MeSH headings (‘scope notes’). Terms were automatically compiled from the codes that the Unified Medical Language System [39] links to the MeSH headings of pathogens and diseases in the following coding systems: Consumer Health Vocabulary (CHV) [40], International Statistical Classification of Diseases, 10th Revision, Clinical Revision [41], Medical Dictionary for Regulatory Activities [42], MeSH, the taxonomy of the National Center for Biotechnology Information [43], and SNOMED-CT.

Administration routes were identified in the Art57 DB and the VO ontology, and terms (including common abbreviations) were compiled from literature and a monograph of the FDA [44].

Classes and terms for valences (‘1-valent’ up to ‘30-valent’) were generated automatically, and common terms for valence 1-10 were added manually (e.g., ‘pentavalent’).

Fig. 1figure 1

Structure of the core VaccO ontology. Fundamental classes representing property categories are shown as orange boxes. Properties marked with an asterisk are propagated along subclass relations (is-a) and containment relations (has-ingredient). Their domains are expanded along the same relations. Examples for representing a vaccine product from the Article 57 database (‘Havrix’), and a vaccine group defined by ATC code J07BC02 (‘Hepatitis A, inactivated’) are shown with dashed frames. The visualization follows the Graffoo specification [45]

Relations between classes are expressed in OWL2 using (existential) object properties. An object property is defined by its domain and by its range. For example, the domain of the object property has-ingredient is the class Vaccine and its range is the class Ingredient. Other object properties in VaccO are immunizes-against (relating Vaccine and Active ingredient with Pathogen and Disease), has-strategy (relating Vaccine and Active-ingredient with Strategy), has-valence (relating Vaccine with Valence), and has-route (relating Vaccine with Route), causes (relating Pathogen with Disease), and caused-by (relating Disease with Pathogen). Property chains were defined to allow for propagating properties from ingredients to containing vaccines, and to unify pathogens and diseases as immunization targets when they are in a causal relation (Table 2). For example, the property chain has-ingredient \(\circ\) immunizes-against \(\Rightarrow\) immunizes-against states that if a vaccine has an ingredient that immunizes against a specific target (left-hand side), the vaccine immunizes also against the target (right-hand side).

Table 2 Example inferences about compiled vaccine classes using property chains in VaccO: the propagation of the a) immunization targets and b) vaccine strategies from the active ingredients to vaccines, and c) the definition of immunization targets interchangeably by pathogen and vaccine-preventable diseasesRepresentation of vaccine descriptions in VaccO

The representation of vaccine descriptions in VaccO involves three steps: The identification of vaccine properties in the free-text description, the compilation of the vaccine properties into logical expressions in the ontology, and the normalization of the comprised information as property values.

Identification of vaccine properties in free text

The set of all terms assigned to the classes in an ontology is called the ontology dictionary. The VaccO ontology dictionary constitutes the basis for identifying references to its classes in free text. Each occurrence of a term from the dictionary in an input text is considered a reference to the associated class. We refer to the set of classes identified in an input text t as C(t). For example, the input text \(t=\) ‘Live/attenuated inuenza vaccine’ contains references to the classes in \(C(t)=\, \}\).

We prepared the dictionary of VaccO for multilingual input by automatically translating all English terms using GoogleTranslate to Spanish, Italian, and Catalan (the languages of the vaccine code descriptors in the ADVANCE data sources) [46]. The multilingual dictionary is stored in the Apache Solr text search platform, and a Solr plugin for dictionary-based concept identification, Solr TextTagger, is used to identify occurrences of terms from the ontology dictionary in free text [47, 48].

Compilation of vaccine properties into the VaccO class

The representation of vaccine descriptions in VaccO is based on the compilation of a VaccO class c identified in the descriptor to a DL expression describing a vaccine, \([\![ c ]\!]\). The compilation depends on the category of c and corresponds to c itself if it is a vaccine (a class being a DL expression), or to the class of vaccines with a specific property if c is a vaccine property:

$$\begin[\![ c ]\!] :=\left\ c & \mathrm \ c\ is-a\ Vaccine \\ Vaccine\ \mathrm }\ has-strategy\ c & \mathrm \ c\ is-a\ Strategy \\ Vaccine \ \mathrm }\ immunizes-against\ c & \mathrm \ c\ is-a\ Pathogen\ or\ Disease \\ Vaccine \ \mathrm }\ has-ingredient\ c & \mathrm \ c\ is-a\ Ingredient \\ Vaccine \ \mathrm }\ has-valence\ c & \mathrm \ c\ is-a\ Valence \\ Vaccine\ \mathrm }\ has-route\ c & \mathrm \ c\ is-a\ Route \end\right. \end$$

For example, the disease class Tuberculosis is compiled to the DL expression Vaccine that immunizes-against Tuberculosis. A set of classes is compiled into the conjunction of the compiled individual classes, \([\![\left\ ]\!] := [\![ c_1 ]\!] \ \mathrm }\ \ldots \ [\![ c_n ]\!]\).

A textual description t of a vaccine is represented by the compiled vaccine class \(V(t) := [\![ C(t)]\!]\), defined by the result of compiling the classes identified in the description. For example, the vaccine class for the descriptor ‘Live/attenuated influenza vaccine’ is defined by the DL expression Vaccine that immunizes-against Influenza and has-strategy Attenuated.

Normalization to property values

The property values P(t) of a vaccine description t are an assignment of each object property in VaccO (immunizes-against, has-route, etc.) to all subclasses of the property range that conform to the vaccine description and the information available in VaccO. Formally, the property values P(t) contain for each property p all subclasses c in the range of p, where \(\mathrm \vDash [\![ C(t) ]\!] \sqsubseteq \mathrm }\ \mathrm }\ p\ c\) (using the notation by Baader [33]). For example, the property values for the descriptor ‘DTwP’ are [immunizes-against: ; has-strategy: ].

The compiled vaccine class links information from the vaccine description with information in the VaccO ontology. An ontology reasoner is required to access information implied by the ontology, and the comparison of two compiled vaccine classes can only assess specification, generalization, or equivalence. However, the property values are an explicit representation of all information about a vaccine description implied by the ontology, and they can be compared with each other more flexibly using similarity measures for sets. Furthermore, equivalent vaccine descriptions based on pathogens (‘Influenza virus vaccine’), diseases (‘Flu vaccine’), abbreviations (‘IIV3’), or products (‘Influvac’) are normalized to the same property value [immunizes-against: ].

The representation of vaccine classes and the conversion to property values was implemented in Java using the the OWL2 application programming interface and the JFact ontology reasoner [49, 50].

Figure 2 summarizes the pipeline for representing a textual vaccine description using the VaccO ontology.

Fig. 2figure 2

Pipeline for representing a textual vaccine description t using the VaccO ontology

Automatic code alignment and evaluation

An alignment between a source coding system and a target coding system assigns each source code to its closest corresponding target code. Our algorithm for creating an alignment first scores the similarity between each source code and each target code (where 1 indicates maximal similarity and 0 indicates no similarity). The target code with the highest similarity score is then assigned to the source code, provided that the score was larger than a preset similarity threshold. If the maximum score does not reach the threshold, no target code is assigned. If multiple target codes have the same maximum similarity score larger than the threshold, all target codes are assigned unless the target coding system has a taxonomic hierarchy. In that case, only the most general target codes with maximum similarity are assigned.

Alignment methods

We evaluated our alignment algorithm using two baseline similarity methods and three similarity methods involving the representation of vaccine descriptions in VaccO as described above. Example alignments for the VaccO -based methods are shown in Fig. 3.

Fig. 3figure 3

Example compilation of the textual descriptors t of vaccine codes X, Y1, Y2, and Y3 into classes in VaccO. Above: VaccO classes are identified in the code descriptors (blue boxes in the source and target code descriptors on the left) and compiled into vaccine classes (V(X), V(Y1), ...). Below: Representation of the vaccine descriptors in the VaccO similarity methods. The classes identified in the descriptor of code X do not overlap with those in the descriptors of codes Y1, Y2, or Y3, and the DL-expressions are not equivalent, resulting in a similarity of 0 for similarity methods Classes and Equivalence and a missing alignment for X. However, property values of code X and the target codes overlap, and X is assigned in Properties to code Y1, which has maximal similarity with X (Y1: 0.5, Y2: 0.3, Y3: 0)

Method Tokens implemented a simple lexical technique. Each code descriptor was tokenized, and the similarity between two codes was measured by the Jaccard coefficient of the two sets of tokens. The Jaccard coefficient of two sets \(s\) and \(t\) is defined as \(\left| s \cap t \right| /\left| s \cup t \right|\).

Method Metamap used the MetaMap program to identify UMLS concept unique identiers (CUIs) for each code descriptor, abstracting over word inflections and synonyms [51]. MetaMap used a dictionary of English terms, and thus can only find concepts in English text. Similarity was defined by the Jaccard coefficient of the two sets of CUIs.

Method Classes represented a code with descriptor \(t\) as the set of classes identified in the code descriptor, \(C(t)\). Similarity was defined by the Jaccard coefficient of the classes of the source code and the classes of the target code.

Method Equivalence represented a code with descriptor \(t\) by the compiled vaccine class, \(V(t)\). Similarity between two codes was 1 if their compiled vaccine classes are equivalent and 0 otherwise. Assessing equivalence involved information implied from the VaccO ontology and is checked using the ontology reasoner.

Method Properties represented a code with descriptor \(t\) by its property values, \(P(t)\). The similarity between a source code and target code was defined as 0 if the values of property immunizes-against differed, and by the overlap between the property values otherwise. The overlap was defined as the Jaccard coefficient between the property values.

Reference mappings

To evaluate our code alignment algorithm, we used two reference sets with manually curated alignments (Table 3). The first reference set Vactype used the Vactype coding system as a target. Vactype was developed as a pragmatic solution to harmonize the vaccine descriptors in the databases that participated in an early vaccine studies of the ADVANCE project [20]. It used English descriptors, and currently comprises 43 codes (for 28 single immunization targets with strategies, and 15 combinations). The Vactype reference set used five custom vaccine coding systems with multilingual descriptors from European EHR databases as source coding systems: the Catalonian Information System for Research in Primary Care (SIDIAP) with Catalan descriptors [52], the Spanish Base de datos para la Investigación Farmacoepidemiológica en Atencióon Primaria (BIFAP) with Spanish descriptors [53], the Italian paediatric database Pedianet with both English and Italian descriptors [54], and the regional primary care database of Venetia with Italian descriptors. The alignments in the Vactype reference set were manually created and validated by the database custodians in a proof-of-concept study of the ADVANCE project [20].

Table 3 Vaccine coding systems, languages, and number of source codes in the reference sets

The second reference set Atc comprised alignments from coding systems in the UMLS to the ATC target coding system. As of 2017, the ATC system contained 114 vaccine codes (with prefix J07). The coding systems with the largest number of mappings to ATC vaccine codes in the UMLS were used as source coding systems in the ATC reference set: Veterans A air National Drug File (VANDF), MeSH, CHV, Vaccine Administered (CVX), and NDF-RT. We corrected 17 code assignments where the source codes were not assigned to the most specific, corresponding ATC code in the UMLS.

Reflexive alignments in which either Vactype or ATC was both the source coding system and the target coding system were included in the evaluation to assess the completeness of the intermediate representation used by the different similarity methods.

Performance measures

The comparison of an automatically generated alignment with a reference alignment is based on the number of correctly generated assignments (true positive, TP), the number of incorrectly generated assignments (false positive, FP), and the number of reference assignments that were not generated (false negative, FN). The performance of a generated alignment was assessed by its precision (\(\text /\left( \text + \text \right)\)), recall (\(\text /\left( \text + \text \right)\)), and F-score (\(2*\ \text *\ \text \ /\ (\text + \text )\)). We also report the average performance measures over all source coding systems in each reference set (excluding reflexive alignments).

留言 (0)

沒有登入
gif