A proof of concept reinforcement learning based tool for non parametric population pharmacokinetics workflow optimization

Pharmacometricians execute a sequential design process. Each step has three conceptual parts: the data is compared to the current model predictions and a hypothesis is formed, a model is constructed, and an optimization performed. This is repeated until comparison of data and model prediction is satisfactory. The iterations of this sequence are an informed exploration of potential models. In this work, we remove hypothesis formation (which requires human intelligence) and determine new models at each iteration step based solely on optimization performance of the current model (and without human intervention).

Since optimization performance is one of many potential metrics of how informative a model is on the data, the need for human intelligence is not alleviated, but the initial (and usually most time-consuming part of the pharmacometrician’s task) is automated: that is, given a small set of parameterized models, find the best fit on the data.

The task of determining, from a set of ”best-fit” models, which is the most informative to the specific question asked of the modeler is one that cannot be answered solely through automated design because unlike the question of best fit, which has a true metric, the question of most informative also relies on qualitative assessment, in the case of maximum likelihood methods this metric is the likelihood or one of its transformations (log-likelihood, Akaike Information Criterion, Bayesian Information Criterion,...).

Pharmacokinetic models

Pharmacokinetics (PK) is the study of the absorption, distribution, metabolism, and elimination processes of a drug in the body and the mathematical equations to model these processes. In pop-PK, the usual way to represent these systems is by compartmental models. A compartmental model is a mathematical construction in which the body is represented as a set of ”interconnected tanks”. Drug flows between the tanks and is assumed to be instantly and evenly distributed within each of these tanks.

Tanks which correspond to measured drug concentrations are assigned a volume \(v_i\) to relate the amount of drug inside it \(X_i\) to the concentration. The connections that dictate drug mass transfer between the environment and tanks and between tanks are typically represented by a rate \(k_\) , where i is the originating tank and j is the destination, and 0 can be used to represent the environment. An alternative parameterization of mass transfer is clearance terms describing drug-containing volumes, e.g., \(Cl_\) or \(Q_\).

Simple PK models without pharmacodynamic (PD) outputs usually contain one to three compartments [14], depending on the need to describe absorption from extravascular drug administration (e.g. oral) and/or linear or non-linear decline of log-transformed drug concentrations with respect to time after dosing. It is important to highlight that pop-PK modeling is usually empiric in that equations are derived to mathematically fit the shape of the concentration-time profiles without regard to underlying mechanistic reasons for the shapes of the profiles. While understanding of anatomy or physiology can inform choice of compartments and types of transfers, detailed mathematical equations with large numbers of parameters to describe such processes are characteristic of a physiologically based PK (PBPK) approach and not a pop-PK approach [13]. Little physiology is associated with non-PBPK models, except for the systemic concentration in the central compartment [6].

Equations (1) and (2) are the analytic solution for one and two compartments models respectively, adapted from [4]:

$$\begin&X(t)=X(0)e^+\frac(1-e^) \end$$

(1)

$$\beginX_c(t)=&\frac ((\lambda _1 -k_)e^ \nonumber \\&- (k_-\lambda _2)e^) + \frac\left( \frac)}(1-e^)\right. \nonumber \\&\left.- \frac-\lambda _2)}(1-e^)\right) \end$$

(2)

with

$$\begin&\lambda _1,\lambda _2 = \nonumber \\&\quad \frac} \end$$

(3)

Non-parametric maximum likelihood methods

Our laboratory has long favored non-parametric (NP) pop-PK methods. The mathematical requirements for applying likelihood as a guiding metric for NP-PK model development were first completely expressed by [10]. Mallet’s paper relates likelihood and optimal design theory in the context of PK. One conclusion is the maximum likelihood (ML) solution is a number of discrete, weighted support points less than or equal to the number of subjects.

Thus, algorithms that rely on likelihood to guide model development maximize equation

$$\begin L( w, \phi ) =\prod _^N \sum _^K w_k p( Y_i \vert \phi _k ) \end$$

(4)

with respect to the support points \( = (_1, \ldots , _K)\) and weights \( = ( w_1, \ldots , w_K)\) such that \(_k \in \Theta , w_k \ge 0\) for \(k=1, \ldots , K\), \(K \le N\) and \(\sum _^K w_k = 1\). Bender’s decomposition is often invoked to allow the above optimization to be carried out in two parts, optimization of support point placement and optimization of their weights [2]. The distribution that maximizes equation (4) is a consistent estimator of the true mixing distribution; which means that it will converge to the true distribution if the number of subjects is large. This was proved by [7].

By construction, NP methods converge to a local optimal solution given the search space boundary, initial estimated support point locations and the error function of the associated regression model. Over 20 years ago, LAPKB developed the Non-parametric Adaptive Grid (NPAG) algorithm [16] and recently the faster Non-parametric Optimal Design (NPOD) algorithm [8].

In this work, NPOD is initialized using a Sobol sequence of 51 elements within the input boundary, and the error is fixed to be constant. An interior point method (IPM) is used to find the weights of each point such that the likelihood is maximized. The support points with high probability are saved as the initial support.

NPAG

NPAG is a iterative grid search algorithm, that uses a grid expansion/contraction add extra support points to the pool \(\\), then, by determining \(p( Y_i \vert \ \})\), applies a primal-dual interior point algorithm to choose the best N-or-less supports from this pool, consistent with equation (4); \(\ \} \rightarrow \ \}\). The algorithm ends when no improvement can be found. NPAG was first introduced by Leary [8] and details of the algorithm are found in [16]. NPAG and NPOD differ in their method for determining the extra support points given the current support.

NPOD

The initial set of support points are saved as an ”anchor” set. Each support point is allowed to move in turn, leaving all other points in the anchor at their position. Usually a Nelder-Mead algorithm is applied to move the points, and likelihood is used as the objective during movement. The resulting ”optimal” points are now added to the anchor set, the IPM is applied, and the less likely points are removed. This procedure is repeated to convergence. Details about the NPOD algorithm are about to be published elsewhere, but the general idea is presented in this poster by Leary et al. [8].

Machine learning toolsMarkov decision process

Markov Decision Process (MDP) is a framework to define decision-reward based problems. The decision maker is called agent, this agent interacts with the environment, changing the environment’s state and receiving a reward. The MDP framework proposes than any goal-directed problem can be reduced to three elements that are being interchanged between an agent and an environment: the choice made by the agent action, the environment’s state and the reward that indicates the agent if it’s reaching its goal [15]. The sequence of actions is taken until receiving a termination signal is called an episode.

The agent interacts with the environment in discrete time steps, on each step the agent has access to pool of actions \( a \in A\) and can select one to perform. After this the environment gives the agent a new observation and a reward associated \(r \in R \subset } \). An observation is a (partial) representation of the environment’s state \(s \in S\). A MDP can be defined by the quintuple \(M=\langle S,A,P,R,\rangle \) [9] where P are the transition probabilities and \(\gamma \) is the discount factor, these concepts are going to be defined later in this document.

In the pharmacokinetic modeling case the agent would be the pharmacometrician, the environment is the software where he is modeling his PK data with, the actions would be the decisions he makes to fit the model and the reward would be a quantity he uses to evaluate the fitness of his model. The termination signal would be the moment when the pharmacometrician realizes that his criteria is met or that the model is good enough an decides to stop the modeling process. Figure 1 represents the PK optimization problem defined by a dataset and a pool of models in terms of the RL framework.

Fig. 1figure 1

PK Optimization problems in terms of RL. RL Reinforcement learning, Pk Pharmacokinetic, St State at time t, Rt Reward at time t, At Action at time t

Reinforcement learning

Reinforcement Learning (RL) is one of the three main branches of machine learning. It is defined by a set of tools that aim to train different kind of agents over an MDP in how to take decisions to maximize the total expected reward on a defined time horizon.

A critical problem that needs to be addressed in RL is the trade-off between exploration and exploitation. In an ideal scenario an agent would prefer to exploit actions that have proven to return a high reward, but in order to obtain that knowledge the agent has to explore them or similar actions before . In their introduction to RL [15] expresses that the equilibrium between exploration an exploitation is an open problem that for example does not need to be addressed in supervised or unsupervised learning.

This way of learning is very reminiscent of the way living beings learn. Taking actions, analyzing the consequences and internalizing the results to make better decisions in the future, while taking into account the balance between making safe decisions to get an ensured reward or when to explore new options. And because these RL-driven agents does not have the time, computational capacity or memory constraints humans do, they will end up evaluating sequences of actions that might not be obvious to a modeler from the very beginning.

Other RL concepts

The future reward \(G_t\) also known as return is the total sum of discounted rewards in an infinite time horizon \(G_t=R_+\gamma *R_+ \cdots =\sum _^ \gamma ^k*R_ \).

The discount factor \(\gamma \in [0,1]\) describes how much the algorithm is going to value future rewards versus present ones, the closest to 1 the most importance the algorithm is going to give to future rewards.

Learning rate \( \alpha \) is a hyper parameter that defines how fast we want the agent to learn, a learning rate of 0 means that the agent is not going to learn any new information and a learning rate of 1 means that the agent is going to prioritize the new information over the old.

The agent’s policy \( \pi (s)\), that describes the best action to take in a given state to maximize this total sum of rewards. This policy can be deterministic \(\pi (s) = a \) or stochastic \( \pi (a|s) = P_\pi [A=a|S=s]\).

The value of an state \(V_\pi (s)\) is defined as the total sum of rewards the agent will get if it is in the state s and then follows the policy \(\pi \) until termination. \(V_\pi (s)=}_\pi [G_t|S_t=s]\).

The value of an state-action pair \(Q_\pi (s,a)\) similarly to \(V_\pi (s)\) is the total sum of rewards the agent might obtain if it is in the state s, but in this case it is going to take action a and then follow the policy \(\pi \). \(Q_\pi (s,a)=}_\pi [G_t|S_t=s,A_t=a]\).

On-policy and off-policy distinguish between to different kinds of RL algorithms, in the former the agents is trained by using trajectories that follow the agent’s policy, in the latter the agent learns a different policy that the one it is actually following in the training process.

Model-based RL is the subset of algorithms the transition probabilities P and rewards R are known beforehand. And finally, model-free RL defines the algorithms in which the learning process is independent of P and R. The latter is the case in the pharmacokinetic model building process. Have in mind that P and R are two of the 5 elements that defines the MDP problem.

The end goal of RL is to train the agent into maximize the total sum of discounted rewards, for finite MDPs this means finding the optimal policy \(\pi ^*\). This policy is defined as \(v_(s) \ge v_\pi (s)\) for all \(s \in S\). In the same way the state-action pair for an optimal policy can be expressed as follows, adapted from [15]:

$$\begin Q_(s,a) = }[R_+\gamma v_(S_)|S_t=s,A_t=a] \end$$

(5)

Because of this relationships, there are two different approach to maximize this total sum of discounted rewards. In value learning methods the goal is to train the agent to learn \(v_\pi (s)\) or \(Q_\pi (s,a)\). Or in policy learning methods when the agent looks to infer \(\pi \) directly.

Dynamic programming

Dynamic Programming (DP) refers to the set of algorithms used to solve optimization problems by dividing it up into multiple sub-problems each of them can be optimized independently by computing an optimal policy. This is what Bellman calls the ’principle of optimality’, in his words: ”an optimal policy has the property that, whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision” [1].

The central idea of DP is to use value functions to organize the search of better policies, this is done in three steps: policy evaluation, policy improvement and policy iteration to find a \(v_\pi (s)\) or \(Q_\pi (s,a)\) that satisfies the Bellman’s optimality Eq. (6), for the value version of this equation and a more in deep explanation of these topics refer to [15]

$$\begin Q_ = \sum _p(s',r|s,a)[r + \gamma \undersetQ_(s',a') ] \end$$

(6)

One property of interest of DP methods is that they update their estimates of the value-action pairs using the values of their successor as is evident in (6), this idea is called bootstrapping. By using this concept RL algorithms are able to backward propagate the information about good state-action pairs to the beginning of the episode.

Monte Carlo methods

Monte Carlo (MC) methods in RL are used to estimate the value functions and by this discovering optimal policies, unlike DP algorithms MC methods does not need to know the transition probabilities or reward accurately beforehand. Monte Carlo methods relies on the exploration of a real or virtual environment, acquiring experience by sampling sequences of actions, states and rewards on multiple episodes. A difference between Monte Carlo methods with DP is that the former only adjust rewards at the end of each episode, this imposes a set of restrictions: each episode must terminate no matter the sequence of actions taken and they average complete returns, it is not possible to learn from partial results and they not bootstrap.

A benefit MC methods offer over DP that is specially relevant in the population pharmacokinetic model selection process is the fact that they can learn how to behave optimally just by analyzing these exploration steps over the environment, in other words, it would be possible to capture the state-action-sequence taken by an expert and introduce that knowledge to the agent directly.

To guarantee convergence in this random exploring process, MC methods implements greedy policy, that is the one that for each \(s \in S\) chooses the action that has the maximal action-value, theorem presented in (7) is borrowed from [15] and shows how by having a greedy policy the algorithm can guarantee that in a given step k \(\pi _\) is going to be as good or better than \(\pi _k\).

$$\begin \begin \pi (s)&\,}\, \underset Q(s,a)\\ Q_(s,\pi _(s))&= Q_(s, \underset Q_(s,a))\\&= \underset Q_(s,a)\\&\ge Q_(s, \pi _k(s))\\&\ge v_(s). \end \end$$

(7)

The main problem is the fact that the agent needs to explore and take random actions in order to acquire experience but it also needs to take greedy actions to be able to converge, obtaining equilibrium between these two divergent behaviors is a key aspect of Reinforcement Learning algorithms. One way to do it is by using \(\epsilon \)-greedy policies, this mean that that the agent is going to balance exploration and exploitation by taking a greedy action with a probability \( 1 - \epsilon + \frac\) and all non greedy actions with a probability \(\frac\) each one.

Temporal difference reinforcement learning

Temporal Difference (TD) learning is the combination of Monte Carlo and Dynamic Programming ideas. TD can learn from raw experience like MC and can also bootstrap the information like DP. A key difference between MC and TD methods is that the latter does not need to wait until the end of the episode to update \(Q_\pi (s,a)\), as they update the state-action pair after each time step.

In summary, the main characteristics of TD algorithms are:

They do not require previous knowledge of the model (model free).

They can be implemented in an incremental way.

They do not need to wait until the end of the episode to learn.

The exploration is not penalized as bad as other algorithms.

SARSA

SARSA (State-Action-Reward-State-Action) is an on-policy TD algorithm, as in other on-policy algorithms its goal is to estimate the value matrix \(Q_\pi (s,a)\) for all states s and actions a, following an \(\epsilon \)-greedy policy \(\pi \). This is done by taking the quintuple \((s_t, a_t, r_, s_, a_)\) and updating \(Q_\pi (s,a)\) after every transition from \(s_t\) to \(s_\) using Eq. (8) a modified version of Bellman’s equation .

$$\beginQ(s_t,a_t) =& Q(s_t, a_t) + \alpha (r_ \nonumber \\&+ \gamma Q(s_, a_) - Q(s_t, a_t)) \end$$

(8)

The general idea of the SARSA algorithm is to bootstrap future information our policy \(\pi \) could provide while evaluating present rewards, how relevant each of those elements is going to be, is controlled by the parameters \(\gamma \) and \(\alpha \) respectively. The final algorithm used to train the agent in this paper is an interpretation made by Sutton and Barto [15] of the original work by G. Rummery and M. Niranjan [12] and presented in Algorithm 1

Because the final goal of SARSA is to estimate \(Q_\pi (s,a)\), this imposes some limitations, in this case the most relevant is that both the state space S and the action space A should be finite and discrete. This is because the conditions of convergence Monte-Carlo methods imposes and the general assumptions of the algorithm. We explain our approach to overcome those limitations.

figure a

留言 (0)

沒有登入
gif