Constructing a knowledge graph for open government data: the case of Nova Scotia disease datasets

A knowledge graph construction process can be performed based on the following steps: 1) Knowledge acquisition to collect semi-structured data from an API, 2) Knowledge extraction to extract entities and their relationships, 3) Knowledge fusion to construct an ontology, assigning entities and relationships and interlink entities to external ontologies and datasets, and 4) Knowledge storage to create knowledge graph in a triple store. To generate a knowledge graph for the NSOD disease datasets, we transform the collected datasets to RDF using a multi-dimensional data model, a custom ontology, semantic rules, and an interlinking process. The following subsections will describe the steps in detail.

Data model

The metadata of each NSOD dataset consists of information about that dataset, such as name, publisher, publication date, category, department, etc., which can be transformed to RDF using VoiD [12], DCMIFootnote 4, DCATFootnote 5, and RDFS vocabularies. The observation of an NSOD dataset includes a collection of dimensions, measures, and attributes that can be shown as Data Structure Definition (DSD). Figure 2 shows an observation example in an NSOD dataset.

Fig. 2figure 2

An example of observation in an open statistical dataset [4]

To model the multi-dimensional NSOD datasets, the RDF Data Cube vocabularyFootnote 6 is used based on the W3C recommendation [13]. The RDF Cube allows publishers to integrate and slice across their datasets [14]. This enables the representation of the statistical data in standard RDF format and publishes the data conforming to the principles of linked data [15]. Slices are frequently useful to group subsets of observations within a dataset. For instance, we can group all the observations about a given region or category in a dataset.

Ontology

To the best of our knowledge, no existing ontologies can be re-used based on the nature of the NSOD datasets. However, we re-use a current data model for describing multi-dimensional data (RDF Cube vocabularies), an external disease ontology, and the best practice vocabularies such as Statistical Data and Metadata eXchange (SDMX) to develop a custom ontology for the disease-related datasets of NSOD. The datasets were coded as entities with distinct data structure definitions, slices and observations.

All the datasets in the ontology are instances of class DataSet and the nomenclature used for datasets is “dataset-dataset_name”. Each dataset has one associated data structure definition (qb:DataStructureDefintion), which defines the dataset’s dimensions, measures, and attributes linked with DataSet by structure property. The dimensions, measures and attributes are linked with the data structure definition by properties dimension, measure, and attribute, respectively. Also, class qb:Slice and ObservationGroup are used to group observations by one or more dimensions. Each slice is linked to the data structure definition using sliceKey property. The observations are attached to a dataset by the observation property and the respective slices by the observationGroup property. Figure 3 illustrates a sample observation based on the defined ontology. Table 1 also shows the prefixes used in the ontology.

Fig. 3figure 3

An observation based on the defined ontology

Table 1 Re-used vocabularies Interlinking datasets to external ontology and datasets

We use an external ontology, Disease OntologyFootnote 7 to enrich the knowledge graph with domain knowledge. We link the NSOD diseases to the disease ontology based on the cosine similarity between the disease names. According to this interlinking process, we enrich the disease information by its parent (super-class) diseases and enable users to search the knowledge graph based on the disease direct super-classes (e.g., viral disease). We also use GeonamesFootnote 8 to represent regional dimension information instead of literal. This allows the addition of semantics to statistical data in case the other regional datasets (other provincial datasets) are joined to the knowledge graph.

As the DBpedia knowledge graphFootnote 9 includes a broad scope of entities covering different areas of disease knowledge, we also connect the disease names of an NSOD dataset to this knowledge graph. To perform this, we use Python to search for each disease name in DBpedia using SPARQL via its SPARQL endpoint and connect each observation to the DBpedia source using owl:sameAs vocabulary. For example, the disease Giardiasis is linked to http://dbpedia.org/resource/Giardiasis.

Rules

Complex formal semantics in a knowledge graph allows a reasoner to infer the relationship between data items in different datasets [16]. This step is carried out to add more meaning to the knowledge graph and links the entities together using an additional semantic layer. The Semantic Web Rule Language (SWRLFootnote 10), an example of a Rule Markup Language, is used to standardize the publishing and sharing of inference rules. As a proof of concept, we design an SWRL rule to infer the transitive relationship of diseases in a dataset using ProtégéFootnote 11 rule engine. This implies that if an observation x includes a disease y, which is a form of disease z in the disease ontology, then the graph will infer that observation x includes the disease z implicitly. The rule states that:

$$\begin hasDisease(?x, ?y)\, \wedge \, \textit\_a(?y, ?z) \implies hasDisease(?x, ?z) \end$$

Another semantic rule example is related to the observations with the highest number of cases for a particular disease. Based on the current number of cases of each disease in the Nova Scotia province, we considered 1,000 disease cases per 100,000 population to be high in the province. Those observations can be defined by the following rule:

$$\begin Observation(?obs) \wedge numberOfCases(?obs,&?n) \wedge \textit(?n, 1000) \\&\implies HighDiseaseCases(?obs) \end$$

Transformation process

The structural metadata about the dimensions and measures of the NSOD datasets are generally different. We develop a configuration setting to specify the dimensions and measures of each dataset in case other datasets with various dimensions and measures are added. This allows semi-automatic updating of the graph with input data and makes the datasets semantically connected to the external ontologies and the Linked Open Data cloud. For example, several disease datasets had number of cases property that could be used as one predicate (eg:numberOfCases) across the knowledge graph.

In the transformation process, we use the Dublin Core Metadata [17], the most widely used metadata schema, to describe the metadata elements of datasets such as published date, dataset title, subject or category, source, contributor, etc. The corresponding elements of each observation are mapped to RDF triples based on the vocabularies mentioned in Table 2).

Table 2 Mapping vocabulariesKnowledge graph constructor

The knowledge graph constructor is the main component of the knowledge graph construction process (see Fig. 4). It connects various parts of the system by collecting data from different sources, transforming them into a unified multi-dimensional model based on the W3C standards, interlinking them with external ontologies, and translating the defined rules to enable semantic reasoning over the knowledge graph. Eventually, the datasets are added to the graph as observations, ensuring that they conform to prescribed metadata, structure, and Semantic Web protocols. We wrote a Python program to construct the knowledge graph which is available at https://github.com/erajabi/Nova_Scotia_Open_Data.

Fig. 4figure 4

Knowledge graph construction process

Queries

We use the built-in SPARQLFootnote 12 tab in Protégé to pose a set of designed queries against the knowledge graph, which cannot be explicitly expressed through linkage. We design the questions with the help of Nova Scotia health stakeholders considering the semantic rules developed in Rules section in the knowledge graph. For example, some diseases re the sub-classes of the infectious disease class in the disease ontology, and we use rdfs:subClassOf propertyto retrieve the results. The queries are outlined below.

Figure 5 shows two queries we define along with the sample results. In both queries, we leverage the rules that we defined before.

Query 1: List of viral infectious diseases along with their number of cases in Nova Scotia in different years.

In this query, we use doid:is_a relationship rule to identify all the disease classified as “viral infectious diseases”.

Query 2: List of viral infectious diseases with a high number of cases (more than 1,000 cases) in Nova Scotia in 2017.

In this question, we use the HighDiseaseCases class to infer the results based upon the rule defined in Rules section.

Fig. 5figure 5

The designed queries. An online SPARQL editor was used to improve the readability of the SPARQL Queries

Knowledge graph

The final knowledge graph included 2,883 triples with 24 classes, 23 object properties, and two data properties. All 21 disease datasets were successfully transformed into the knowledge graph, with a total of 252 observations. Each observation includes several dimensions such as gender (sdmx:sex), observation year (dimension:refPeriod), and area of observation (dimension:refArea). It also contains a few measures such as disease rate per 100k population of disease (eg:rateper100kpopulation) and a number of disease cases (numberofcases). Additionally, an observation has disease information (eg:hasDisease) and disease label (rdfs:label) properties, which has been connected to the DBpedia knowledge graph using owl:sameAs property. The knowledge graph is publicly available at Zenodo under Creative Commons Universal Public Domain Dedication (CC0 1.0)Footnote 13 license.

留言 (0)

沒有登入
gif