Learning pharmacometric covariate model structures with symbolic regression networks

The method described in this paper aims to automatically learn closed-form pharmacometric parameter–covariate relationships from data. We will consider learning expressions for PK model parameters from drug administration and resulting blood plasma concentration data, both of which are time series, but not necessarily synchronous or equidistant samples. The overarching setup for this is shown in Fig. 1 and we will focus on a concrete example based on a large multi-study data set published in [4] for the anesthetic drug propofol, commonly modeled with a mammillary three-compartment model [12].

Fig. 1figure 1

Mammillary three-compartment model example illustrating our novel method. The objective is to automatically learn the covariate model \(}\) that maps a known covariate vector \(}\) (comprising e.g., age, gender, or genetic factors) to the parameter vector \(}\) (e.g. rate constant \(k_\) and volumes \(V_\cdot\)) of a fixed-structure pharmacometric (PK) model. The method is data-driven in that it uses drug administration profiles (time series data) u, and model-based in that it assumes a PK model of known structure. In this example, \(}\) is learned to minimize some error measure between observed (i.e. measured from samples) blood plasma concentrations and corresponding predictions \(C_\text \) by the model. Dots in the graphs show instances of dose changes and blood samples, respectively

Data set

We demonstrate our method on a data set composed of propofol plasma concentration observations from 1031 individualsFootnote 1 from 30 clinical studies aggregated by Eleveld et al. [4], from here on referred to as the data set. Ethical approvals of the underlying studies are declared in the original publications, referenced by [4]. The data set contains 15,433 observations, of which 11,530 were arterial and 3903 venous. Of the 1031 individuals, there were 670 males and 361 females with ages ranging from 27 weeks to 88 years, and weights ranging from 0.68 to 160 kg, see Table 1.

Table 1 Covariate candidates considered in the PK model of [4] as well as in our modeling. Individuals of the data set fall within the reported ranges

The main reason for choosing this data set as a demonstrator is that propofol is a drug with well-studied pharmacokinetics. Evidence of this, which also constitutes a benchmark for our demonstrator, is the model presented in [4]. Furthermore, the data set is that it has been openly disclosed by Eleveld et al., enabling transparent third-party analysis of our work.

In our model development, we consider age, weight, BMI, gender, and blood sampling site (arterial or venous) as potential covariates of our PK model. These are the same candidates as considered in [4], where individual demographics were disclosed as part of the data set.

The data was pre-processed the same way as in [4]: data points corresponding to subsequent infusion changes spaced closer than 1 s apart in the time dimension, or 0.5 \(\mu \textrm^\) in the dose dimension were merged.

Pharmacokinetic model

We consider a three-compartment mammillary model to describe the pharmacokinetics of propofol. The drug concentration \(x_i\) \([\mu \textrm^]\) in compartment \(i \in \\) is

$$\begin }_1&= - (k_ + k_ + k_) x_1 + k_ x_2 + k_ x_3 + \frac u, \end$$

(1a)

$$\begin }_2&= k_ x_1 - k_ x_2, \end$$

(1b)

$$\begin }_3&= k_ x_1 - k_ x_3, \end$$

(1c)

where \(k_\) describes the drug transfer rate [1/s] from compartment i to j. The drug is administered at rate u [\(\mu \textrm^\)] to the central compartment (\(i=1\)), which is also where the propofol plasma concentration is measured. The volume of the central compartment is \(V_1\) [L].

In the literature, the equivalent parameterization of volumes (\(V_1, V_2, V_3\)) and clearances (\(CL, Q_2, Q_3)\) constitutes a common alternative to Eq. 1. The conversions between these parameterizations are

$$\begin CL&= k_ V_1\,^]}, \end$$

(2a)

$$\begin Q_2&= k_ V_1\,^]}, \end$$

(2b)

$$\begin Q_3&= k_ V_1\,^]}, \end$$

(2c)

$$\begin V_2&= \frac}} V_1\,, \end$$

(2d)

$$\begin V_3&= \frac}} V_1\,. \end$$

(2e)

We have chosen to implement our method using the parameterization in Eq. 1 due to numeric benefits. These are further explained in [13], where we developed a fast and natively differentiable simulator for the three-order mammillary model.

Covariate model

The covariate model, shown in Fig. 1, is expressed as a function \(}\) that maps the covariate vector \(}=[\varphi _1,\ldots ,\varphi _}]^\top\) to the vector \(}=[\theta _1,\ldots ,\theta _}]^\top\) of PK model parameters. Thus \(}\) has components \(f_1,\dots ,f_}\), each mapping the covariate vector \(}\) to one of the \(n_\theta\) PK model parameters.

In our example with the three-compartment model for propofol, there are \(n_ = 5\) covariates and \(n_ = 6\) PK model parameters according to Table 1. To not favor covariates based on their scale, all input covariates were normalized during model development. Continuous inputs (age, weight and BMI) were scaled from 0 to 1, and categorical inputs (gender and blood sampling site) were scaled to \(\pm 0.5\).

Predictive performance

To assess the quality of a particular covariate model candidate \(}\), we need a performance measure that captures how well \(}\) reflects the training data. Most optimization methods, including the ones used in this paper, relies on this measure being scalar.

For comparability, we employ the same scalar performance measures as those used in [4]: We train our model to minimize an ensemble average of absolute logarithmic error (ALE). Subsequently, we evaluate predictive performance in terms of ALE, and three additional error measures: logarithmic error (LE), prediction error (PE), and absolute prediction error (APE).

For each individual in the data set, there is a vector of observation–prediction errors, where each entry corresponds to the difference between a blood sample observation and the value predicted by the model. Observation-prediction errors could thus be computed for each sample, over a time series for one individual, or the entire data set. This prompts a consistent notation and we index by ij the error over a sample j for individual i, whereas a total error over a time series for an individual is indexed by i.

The per-sample (absolute) logarithmic error (A)LE [14] is thus

$$\begin \text _&= \ln (C__}/C__}), \end$$

(3a)

$$\begin \text _&= |\text _ |, \end$$

(3b)

where \(C__}\) are observed (measured) plasma concentrations and \(C__}\) are corresponding predictions produced by the model.

Similarly, the per-sample (absolute) prediction error (A)PE [15] is

$$\begin \text _&= \left( \dfrac_}-C__}}_}} \right) \cdot 100 \% \end$$

(4a)

$$\begin \text _&= |\text _ |. \end$$

(4b)

To avoid divisions by zero if \(C__}=0\), a special casing is needed where the error is set to zero for such samples.

Using Eqs. 3 and 4, corresponding per-individual errors were computed by taking the median, resulting in the median absolute logarithmic error (MdALE):

$$\begin \text _i = \text \left( \text _ \right) , \quad j = 1,\dots ,n_i, \end$$

(5)

where \(n_i\) is the number of entries of the time series from individual i.

To train the model of [4] and the ones considered here, a single scalar loss function representing model fit is required. This is obtained by averaging across the individuals. Training the covariate model to minimize ALE thus translates into minimizing

$$\begin J_\text =\dfrac \sum _^ \text _i, \end$$

(6)

where n is the number of individuals in the data set.

Similarly to the analysis performed in [4], ALE and APE were used as indicators of model accuracy and LE and PE were used as indicators of bias. Values closer to zero for ALE and APE reflect better accuracy and values closer to zero for LE and PE indicate less bias.

Clinically acceptable ranges for MdPE\(_i\) and MdAPE\(_i\) are \(10-20 \%\) and \(20-40 \%\), respectively [4, 15, 16]. This translates to acceptable clinical ranges for bias measure MdLE\(_i < 0.18\) and accuracy measure MdALE\(_i < 0.34\).

Of note, the choice of loss function will affect how outliers are penalized, where using an average loss across individuals would for example penalize outliers more than a median loss over individuals.

Symbolic regression networks

At the core of our methodology lies a small artificial neural network (ANN) with a specific structure, that of a symbolic regression network [17], representing a simple closed-form expression. In our case, the purpose is to learn human-readable closed-form expressions describing the covariate model \(}\). A schematic illustration of such network is shown in Fig. 2.

Fig. 2figure 2

Symbolic regression network with three layers, each marked by a gray box. The output of node \(z_\) at layer l is the \(i ^\) component of \(}_ = W_l }_l + }_l\), where \(}_l\), \(W_l\) and \(}_l\) are the input vector, weight matrix, and bias vector of that layer. The base expressions \(g_\) acting on \(z_\) take on the role of activation functions used in ordinary ANNs. For example, the output of the first layer (and therefore input to the second layer) is \(}_2 = }_1(W_1 }_1 + b_1)\). Input and output of the network is the covariate vector \(}= }_1\) and the PK parameter \(\theta _k = }_4\), respectively

In general, ANNs can be viewed as flexible function approximators that can be trained to represent input–output mappings that fit available data. An ANN consists of \(n_l\) layers, where the output vector \(}_\) of layer l is obtained by applying a vector of nonlinear activation functions \(}_l\) to an affine transformation \(}_l\) to the layer input \(}_l\):

$$\begin }_l&=W_l}_l+}_l, \end$$

(7a)

$$\begin }_&=}_l(}_l). \end$$

(7b)

The linear weight matrices \(W_l\) and bias vectors \(}_l\) constitute the free parameters used to train the network. In the conventional case, the components of the activation functions \(}_l\) are monotonously increasing functions, such as sigmoids [18]. In contrast, a symbolic regression network can be understood as an ANN with the additional constraint that the resulting approximator \(}\) should be a human-readable closed-form expression of a mathematical function [17, 19,20,21]. Training of symbolic regression networks is therefore often referred to as equation learning [17]. The methodology has gained broad attention for its ability to produce impressive results in discovering (known) physical laws from data [22].

The activation functions \(}\) of a symbolic regression network represent mathematical functions that we refer to as base expressions. In standard symbolic regression, a sequence of topologies with different base expressions and scalar parameters–of which the weight matrices and bias vectors in Eq. 7 would constitute a special case–are evaluated in search of one that fits the available data well [23]. Conventional equation learning is thus a combinatorial problem, with poor (exponential) time complexity, and therefore often approached using genetic algorithms [23].

Instead of relying on random perturbations as in genetic algorithms, we use gradient-based optimization methods common to conventional ANN training. To ensure human-readability, we enforce sparsity of the ANN by alternating training epochs with pruning epochs, in which the least important parameters are removed from the network. We next rely on a concrete example based on the Eleveld data set, to describe and illustrate this approach.

We set out with a nominal (unpruned) network of Fig. 2. It constitutes our nominal representation of \(f_k\), where the inputs are the covariates according to Table 1. Thus, the output of the network represents one of the PK parameters \(\theta _k\), such as \(k_\) or \(V_1\) of Eq. 1. For a model with \(n_\theta\) PK parameters, we use a parallel interconnection of \(n_\theta\) symbolic regression networks, each modeling one component \(f_k\) of \(}\), mapping the covariate vector \(}\) to each PK parameter \(\theta _k,\ k=1,\dots ,n_\theta\).

In our example, the base expressions of each layer \(l \in \\), with input vector \(\varvec_l = \begin z_,&z_,&\dots \end ^\top\), were chosen to cover previously published PK models for propofol, such as [4] and [12]:

$$\begin \varvec_1(\varvec_1)&= \begin z_\\ z_ \cdot z_\\ |z_ |^}, \end\\ \varvec_2(\varvec_2)&= \begin z_\\ z_ \cdot z_ \\ \frac} + 1}, \end\\ g_3(z_3)&= |z_ |, \end$$

where the \(g_3\) assures positive output of the final layer. The division in \(\varvec_2\) has the term one in the denominator to assure that the output does not blow if \(z_ \ge 0\) approaches zero.

Training

Minimizing the loss function Eq. 6 across trainable parameters of the covariate model \(}\) is referred to as training. In our case, these parameters are stacked into a vector \(}\), made up by the elements of all weight matrices and bias vectors. Since the data set is static, we can view training as minimization of the scalar-valued function \(J_\text (})\). In order to do so, we need to evaluate \(J_(})\). Doing so requires simulation of one PK model for each data set individual, to obtain the predicted plasma concentration at each observation time instance. This can be efficiently done using the method introduced in [13].

In addition to the supporting fast simulation, the method of [13] enables exact and efficient evaluation of derivatives of individual prediction errors with respect to the trainable model parameters. This allows us to train the covariate model \(}\) using conventional ANN back-propagation, using the stochastic gradient-based optimization algorithm ADAM [24]. As with artificial neural networks in general, establishing formal convergence guarantees is challenging. However, in practice both standard deep learning networks and our symbolic regression do converge to models that fit data adequately well. Similarly to the deep learning case, convergence rate will also vary with data, choice of activation functions, learning rate, and initialization of the training. Our implementation, using the neural network package Flux [25], relies on the differential programming capabilities of the Julia language [26]. A full disclosure of our implementation is found in the GitHub repository [27].

Pruning

To obtain simple human-readable expressions from an initially dense symbolic regression network, such as the one in Fig. 2, we alternate between parameter training of the fixed network structure and pruning of the network.

The goal of this pruning is to obtain a sparse network structure, which translates into a readable expression of the corresponding covariate model. In the pruning process, we remove covariates and network parameters that have relatively little influence on the network output. This process is visually exemplified in in Fig. 3.

It is desirable to only include as many covariates as needed to explain the data [5]. Therefore, we start by identifying and removing the least important covariates from the symbolic regression network to obtain expressions with fewer covariates. Next, we prune network parameters \(}\) (linear weights and biases) to obtain simple covariate expressions. Pruning a network parameter is achieved by fixing its value to zero and removing it from the vector \(}\) of trainable parameters.

Before each pruning iteration, we train the network until convergence. This translates into finding a local minimum of the loss function in Eq. 6. At such minima, the partial derivatives of the loss function with respect to the trainable parameters are zero. Therefore, the second derivatives define the first non-zero terms of the Taylor series expansion that describe local parameter sensitivities, as further explained in Appendix A. The second-order derivatives make up the elements of the Hessian matrix. Specifically its diagonal elements represent sensitivities in the corresponding individual parameters.

In the Taylor series expansion, the second-order terms take on the form

$$\begin S(\gamma _k) = \gamma _k^2 H_, \end$$

(8)

where \(H_\) is the \(k^\text \) Hessian diagonal element of the loss function with respect to the network parameter \(\gamma _k\). \(S(\gamma _k)\) denotes salience of a parameter \(\gamma _k\) [28].

For pruning of the network parameters \(}\), Eq. 8 may be used directly. However, computing the salience for a covariate is not as straightforward since the Hessian elements would differ between individuals. A solution to this is to sum the salience contribution of each individual,

$$\begin S(\varphi _k) = \sum _^ \varphi _^2 (H_)_k, \end$$

(9)

where \((H_)_k\) is the \(k^\text \) diagonal element of the Hessian, \(H_i\), determining the sensitivity of the covariate \(\varphi _\) for individual i, and n is the number of individuals.

We use the Zygote package [29] in Julia to compute the Hessian diagonal elements. In Appendix A, we give a more detailed description of the Hessian-based pruning method, providing mathematical insight into this methodology.

Fig. 3figure 3

Pruning sequence of a symbolic regression network with output pharmacokinetic parameter \(\theta _2 = k_\). Input covariates are age, weight, gender, and arterial or venous sampling (AV). The nominal network has three dense layers, each followed by base expressions such as 1 (feedforward), multiplication, power function, division and absolute value. A black line represents a connection between two nodes, and a gray line represents a pruned (removed) connection. The final network represents the covariate expression of Eq. 10b

Recipe: Symbolic regression

A compact summary of our training and pruning scheme is provided below. Initially, we start with a nominal symbolic regression network, like the one shown in Fig. 2. It is sequentially trained and pruned until we obtain a final expression of sufficient complexity and fit. Our training and pruning sequence of a symbolic regression network is as follows:

1.

Choose a nominal symbolic regression network architecture, and corresponding base expressions.

2.

Train the network until convergence.

3.

Compute the salience of each (remaining) covariate, \(S(\varphi _)\) of Eq. 9.

4.

Sort the covariates by saliency and remove the covariate with the smallest salience.

5.

If the desired final number of covariates is reached, continue to step 6, otherwise return to step 2.

6.

Train the reduced network until convergence.

7.

Compute the salience of each (remaining) trainable network parameter, \(S(\gamma _k)\) of Eq. 8.

8.

Sort the network parameters by saliency and remove N parameters with the smallest salience.

9.

If the desired final number of network parameters is reached, continue to step 10, otherwise return to step 6.

10.

Train the reduced network until convergence.

11.

Convert the resulting network to a readable functional expression.

The number of covariates and network parameters to keep in the final symbolic regression network is a trade-off between fit to data and complexity of the final expression. In our example, illustrated in Fig. 3, we remove one covariate at each pruning iteration, until only two covariates are left. Next, we remove the \(N=10\) least sensitive parameters in the first parameter pruning iteration, and then one parameter per subsequent pruning iteration until only twelve network parameters are left. More details of the pruning and training can be found in [27]. In [30], we demonstrate that our methodology can identify functions of known shapes correctly.

Initialization of the parameter values associated with step 1 of the recipe affects the fit of the resulting final model. To mitigate the risk of poor model fit due to an unfortunate initialization, the recipe could be run several times, where the best fitting model is kept. In our example, we have executed the recipe eight times.

Limits of performance

Structural mismatch between the asserted PK model structure Eq. 1 and the data, in combination with measurement errors, induce upper and lower limits on prediction errors of the trained covariate model. Here we explain how these limits can be characterized by training two additional models.

Even with a very complex covariate model, one cannot expect perfect fit to data (zero loss). This is because the fixed (three-compartment) PK model structure is only an approximation of the actual pharmacokinetics, combined with (blood sample) measurement errors. Part of the loss remaining after applying our covariate modeling scheme can thus be attributed to this mismatch. To indicate how much, we optimize one set of (three-compartment) PK parameters for each individual in the data set. This results in a completely covariate-free model. While it will fit data better than any covariate model, it does not generalize. This makes it practically useless for purposes other than providing an upper limit for performance.

Another natural question to ask is how much we gain (in terms of loss) by considering covariate dependencies. To do this, we optimize a constant, i.e. covariate-free, model—one where all individuals share the same (three-compartment) PK parameter values. This model does constitute a lower bound of the achievable predictive performance that any covariate-based model should beat.

For these two additional models, PK parameter optimization was done with the optimization package Optim.jl [31], to minimize the same loss as for our covariate model based on symbolic regression. Implementation details can be found in [27].

留言 (0)

沒有登入
gif