Pathling: analytics on FHIR

We addressed the identified use cases through the design of a “FHIR Analytics API” (see Fig. 1), which refers to a specialization of the FHIR API that focuses on providing functionality useful for health data analytics applications. FHIR provides standard mechanisms for extending the functionality of its API such as extended operations and search profiles, and we have taken advantage of these to deliver new functionality that can be consumed by existing FHIR client software in a frictionless way.

Fig. 1

Operations of a FHIR Analytics API

The additional operations purposefully target capabilities that are not currently possible or not easily achieved using the core FHIR REST API specification, such as aggregation and transformation of data. Search capabilities are also provided that allow for more expressive queries than the core FHIR Search API. The FHIR Analytics API is designed to complement, rather than replace, the transactional capabilities of the standard FHIR REST API.

Import operation

The first feature provided by the FHIR Analytics API is the Import Operation (see Fig. 2), which provides a way of making bulk data available to the server and preparing it for subsequent queries. A standard FHIR server can import data using the create and update REST operations (and batches of these operations), however this approach does not scale well to the volume of data needed for analytic applications.

Fig. 2

The Import Operation is designed to accept data that has been extracted from systems through the HL7 FHIR Bulk Data Access interface [19]. This specification is achieving rapid implementation within vendor implementations [20], providing a standard way of sharing bulk data with other systems, including analytics tools. This operation is also based upon the Draft Bulk Import Implementation Guide [21], an effort by the FHIR community to design a standard operation for the efficient import of large data sets.

One of the differentiating aspects of a bulk data import request (relative to the FHIR transaction/batch operation) is the method for providing data to the server. Within FHIR transaction and batch requests, data is provided inline within the HTTP request itself. In a bulk import request, the client provides the server with URLs that the data files can be retrieved from. The protocols used within these URLs can include sources other than HTTP. Pathling currently supports the retrieval of data from Hadoop File System (HDFS) [22] and Amazon S3 URLs, which both use protocols that are optimized for the retrieval of large files.

The other difference with bulk import is that the response may be asynchronous, responding immediately with a URL rather than making the client wait for the completion of the operation. This URL can be used to receive updates on the progress of the operation and information about how to retrieve the final result. This is important when operations can take long periods of time to complete, and be otherwise hampered by timeouts and limitations on request size within HTTP implementations.

FHIR data are made available to the server via NDJSON [23], which is a way of representing collections of FHIR resources that is more bandwidth and storage efficient than a FHIR Bundle. Each NDJSON file provided to the operation contains a collection of instances of a single FHIR resource type.

Aggregate operation

The second feature of the FHIR Analytics API is the Aggregate Operation (see Fig. 3), which facilitates the execution of aggregation-based queries across a data set. The concept of the aggregate operation is very similar to that of a “pivot table” [24], commonly used within spreadsheet applications. This provides a familiar and flexible set of tools that can power a range of different applications that satisfy the “Exploratory data analysis” use case, including visualizations.

Fig. 3

Expressions are used to describe the aggregations, groupings and filters that form the definition of the query. We use FHIRPath [25] for this purpose, a language that is used within the FHIR specification. FHIRPath is a graph-based traversal language, and allows for “paths” through the resource graph to be described succinctly and in absence of some of the complexities of dealing with FHIR data types, cardinalities, resource references, and missing data.

The Aggregate Operation is defined within FHIR as a “type-level” operation, which means that it is invoked on a particular resource type, with the collection of all resources of that type becoming the subject of the operation. This subject resource becomes the root context for the FHIRPath expressions that are supplied to the operation.

One or more aggregation expressions can be provided, serving the purpose of defining a set of summary values that are to be calculated over the data set. Some examples of aggregation expressions are “count” and “sum”.

Grouping expressions are evaluated against each resource in order to determine the set of groups that it should be counted within. Aggregation expressions are then executed to determine a result for each group of resources, and these results are provided within the response to the operation.

Filter expressions serve to constrain the scope of the input collection. Resources that do not evaluate as true for all supplied filter expressions are excluded from the results.

A simple worked example of the Aggregate Operation follows, given the data set of patients, with a gender, deceased status (“deceasedBoolean”) and birth date (“birthDate”) in Table 1.

Table 1 Example Patient data set

A query can be composed that contains one aggregation expression, A:

The query contains two grouping expressions: B, grouping on the patient gender, and; C, grouping on the deceased status of the patient:

Finally, the query defines a single filter D, filtering out all patients with a birth date from January 1, 2000:

Table 2 shows the result, which contains a row for each distinct combination of B and C found in the data set, along with the result of A calculated over the resources that are a part of that grouping.

Table 2 Example result from Aggregate Operation

Note that patient 1 was omitted from the results by filter D. The combinations [female, false] and [other, true] were not included, as there were no resources matching these values.

FHIRPath search profile

The third feature of the FHIR Analytics API is FHIRPath-powered search functionality, surfaced through a FHIR search profile. This search profile accepts any number of FHIRPath filter expressions, returning a Bundle resource containing matching resources within the server.

This search parameter can be used in conjunction with the Aggregate Operation, which returns a “drill down” expression with each grouping in the response. Each of these expressions can be used with the search API to retrieve the individual resources that comprise a grouping within the Aggregate Operation result. This provides the basis for delivering applications that satisfy the aforementioned “Patient cohort selection” use case.

FHIRPath filter expressions can also be combined with standard search parameters from the FHIR search specification. This allows the client to customize the representation of results in the response. An example of this is “_elements”, which controls the data elements that are included with each resource in the response.

An example drill-down expression for the [female, true] grouping in the previous example Aggregate response would be:

This would return patients 2 and 4 from the example data set. Note that this incorporates one of the distinct value pairs resulting from groupings B and C as a matching condition, as well as the overall filter condition from expression D.

FHIRPath provides additional expressive power relative to the FHIR Search API, such as:

The ability to refer to any element, not just those that are the subject of a defined search parameter;

Unlimited nesting and bracketing of expressions;

The ability to follow resource references that traverse multiple levels of relationships, and;

Support for complex terminology operations within criteria.

Extract operation

Finally, the Extract Operation (see Fig. 4) is designed to create custom data extracts for input into other tools and workflows. It is designed to simplify the task of reaching into the FHIR data model and producing a flattened rendition of selected parts of the data set. The full power of FHIRPath is available for use within this operation, including terminology functions.

Fig. 4

The Extract Operation takes a subject resource type and a set of FHIRPath expressions as input. The result of the operation is a delimited text file with a column for each input expression. Each row in the file contains the result of the execution each of those expressions against each FHIR resource.

A simple worked example of the Extract Operation follows, using the previously stated example Patient data set in Table 1, a Practitioner data set in Table 3 and a MedicationRequest data set in Table 4.

Table 3 Example Practitioner data set Table 4 Example MedicationRequest data set

A query can be composed that uses MedicationRequest as the subject resource. It contains four column expressions: D, the resource ID of the MedicationRequest; E, the resource ID of the subject Patient; F, the provider identifier of the prescribing Practitioner, and; G, the text representation of the prescribed medication.

Table 5 shows the result, which contains a row for each MedicationRequest resource. Each row contains the result of evaluating the expressions D, E, F and G against the resource.

Table 5 Example result from Extract operation

The Extract operation returns a URL that can be used to retrieve the result of the operation, rather than returning the result inline within the HTTP response. This is to account for the potential large size of an Extract result when executed upon a data set with a large number of resources.

The Extract operation also features a “limit” parameter that allows the user to specify the maximum number of rows to be returned within the result. This can be useful to preview the format of a result without executing the query against the entire data set.

Terminology functions

As an important source of meaning within the information model, terminology is a core element of both the FHIR and FHIRPath specification. The FHIRPath specification contains a number of terminology functions that we have implemented and extended to provide a set of capabilities useful for analytic query.

In our implementation, these functions delegate terminology queries to an implementation of the FHIR Terminology Service API. This approach removes the need to import and maintain a read-optimized view of terminology data, and it creates a clear separation of concerns between the query engine and the source of terminology knowledge.

Value set membership

Value set membership testing is fundamental to the task of categorizing and extracting ontological information from codes. We provide this capability through the implementation of the FHIRPath function, “memberOf”. This function takes a URI that identifies a defined set of codes, and uses the terminology server to determine the membership of a set of input concepts based on its knowledge of the code systems involved.

An example of the “memberOf” function follows, given the data set of SNOMED CT coded procedures in Table 6. This example uses a SNOMED CT implicit ValueSet URI that refers to an Expression Constraint Language (ECL) expression, which is a standard mechanism within FHIR for describing a set of SNOMED CT concepts.

Table 6 Example data set used with memberOf function

Note that the actual structure of the “code” field in the FHIR Procedure resource is CodeableConcept, which is a complex structure that can accommodate multiple codings along with text. This has been simplified for this example to a simple SNOMED CT code, along with a label for readability.

Given that a search query was made to the Procedure resource, with the following FHIRPath condition:

Where the following ECL expression was used within the argument:

The result of this operation would be a FHIR Bundle containing the Procedure resources with SNOMED CT codes that are a type of “Procedure”, where the “Procedure site” is a “Heart structure”. There are 1,638 concepts that would match this query within the international release of SNOMED CT (February 2022), out of a total of 58,737 procedure concepts. The operation would return the Procedure resources with IDs 1 and 3, which are both types of procedures that are performed on the heart.

The information required to derive this result includes the subsumption relationships and attributes of the codes that exist within SNOMED CT itself. The terminology server has knowledge of these relationships along with the ability to understand ECL, and Pathling requests this information at execution time using the FHIR Terminology Service API.

Subsumption testing

Code subsumption testing is particularly useful for terminologies such as SNOMED CT, which feature deep hierarchies and large numbers of concepts. The functions “subsumes” and “subsumedBy” allow us to efficiently move up and down the subsumption hierarchy within our queries, based on the ontological data held by the terminology server.

An example of an expression that uses subsumption testing follows:

This expression could be used to filter Condition resources to only those that are a type of “Diabetes mellitus” (which is the concept referred to by the identifier “73211009”). Within the international release of SNOMED CT (February 2022), there are 119 concepts that would match this query.

Concept translation

Concept translation allows us to traverse mappings between codes that are known to the terminology server. The “translate” function takes a set of input concepts and asks the terminology server to return the targets of mapping relationships found within a particular map. This is particularly useful in the area of data analytics, where the harmonization of heterogeneous data capture is a common requirement. It can also be used with maps that are implicitly defined within particular terminologies. One example of this is the use of SNOMED CT historical associations to translate inactive codes to the updated codes that replace them.

Another example translates SNOMED CT codes to Read CTV3:

The argument refers to a SNOMED CT implicit ConceptMap URI. This is the standard mechanism within FHIR for describing a simple map within SNOMED CT. This URI refers to a map that is released as part of the international edition of SNOMED CT (identified by “900000000000497000”).

The translate function can also be “reversed”, which instructs it to retrieve the source concepts within a map when given a set of target concepts. It is also capable of filtering results to a defined subset of relationship types, such as “related to”, “broader”, “narrower” and “equivalent”.

Property and designation lookup

Property and designation lookup uses a FHIR terminology service to join across from coded data to the known attributes of those codings.

Examples of information that can be retrieved via properties and designations are synonyms, parent and child concepts, display text in other languages and more. Complex terminologies such as SNOMED CT and LOINC also provide ontological information via properties, such as finding sites for disorders, or methods for procedures.

Property and designation lookup has not yet been implemented within Pathling. The details of the design and implementation of this feature will be the subject of a future paper.

Pathling

Pathling is an implementation of the FHIR Analytics API concept within this paper, and has been made freely available under a permissive open source license via GitHub. It is written using Java and Scala, and distributed as a Docker image. The components that comprise the Pathling solution are shown in Fig. 5.

Fig. 5

Implementation components

Pathling is an Apache Spark application at its core. The work done by Cerner on the Bunsen library was adopted and enhanced to enable the import of FHIR data into Spark. Each FHIR resource is represented as a single Spark data set, utilizing data types such as “struct” and “array” to represent nested data that often has multiple cardinalities and optionality. Each data set is written to a Parquet file for persistence and subsequent retrieval.

A FHIRPath parser, written using ANTLR [26], translates FHIRPath expressions into queries that can be executed using Spark. Parsing of each sub-expression results in an object that represents a FHIRPath expression, along with its associated Spark query representation. These FHIRPath objects can then be used as inputs to various FHIRPath functions and operators that have been implemented as Spark query transformations.

The terminology functions manifest as Spark mapping operations, where input codings are deduplicated and sent to the configured FHIR terminology service. Multiple requests are made in parallel to resolve queries about different portions of the coded data, based upon a configurable level of parallelism. Ontoserver [27] was primarily used for the development of this capability, as it is an existing mature implementation of the FHIR Terminology Services API with good feature coverage and high performance suitable for analytic workloads.

The Aggregate, Search and Extract operations compose individual FHIRPaths into a composite query. HAPI is used as a framework for exposing the operations within an API that can be consumed by standard FHIR client software. The results of each operation query is transformed into the appropriate FHIR resource for return to the user, whether that be a Parameters, OperationOutcome or Bundle resource.

Pathling implements the asynchronous request pattern defined within the FHIR specification to accommodate long-running queries over large data sets. The asynchronous pattern is based upon RFC 7240 [28], and utilizes standard HTTP headers such as “Prefer” and “Content-Location” to provide a mechanism for querying the status of a job and eventually receiving the final result.

The use of Docker as a distribution format enables users to deploy Pathling using their own infrastructure. This makes it easier for computation to take place where the data currently resides, avoiding egress across boundaries of data custodianship.

Libraries

It is our intent that the components that make up the Pathling server implementation can be used independently of each other and re-composed into other novel solutions. The first step that we have taken towards this is to make the individual modules that make up the server implementation available via Maven Central. This allows them to be used independently within projects that use Java, Scala and other Java Virtual Machine (JVM) languages. For example, the encoders module can be used to encode FHIR data into Spark data frames, which can then be queried via SQL.

We have also made a library available to Python users through the Python Package Index (PyPI). The Pathling Python API provides access to the Spark encoders and the terminology functions directly from Python, without the need for a running Pathling server. This allows for closer integration with Python-based data science workflows, and also the incorporation of Pathling functionality into bespoke Python applications.

We also have plans to implement methods within the Python API that provide access to the Aggregate and Extract operations, which will see the library reach functional parity with the server implementation. A package for the R language is also planned, which will provide equivalent functionality to the Python library for R users.

User interface

As a demonstration of its utility, the Pathling API was used to develop an experimental exploratory data analysis user interface, showing the use of the Aggregate and Search Operations within a generic tool for exploring FHIR data sets (see Fig. 6). A number of synthetic FHIR data sets were created using Synthea [29] that allowed for the demonstration of the features of the API, and the underlying terminology capability.

Fig. 6

Experimental user interface for exploratory data analysis

View original article

JOURNAL OF BIOMEDICAL SEMANTICS

分享书签

0 0 0 0 0 0 0

More from this channel

Pathling: analytics on FHIR

留言 (0)