Reinforcement learning for intensive care medicine: actionable clinical insights from novel approaches to reward shaping and off-policy model evaluation

Clinical context

In this paper we focused on COVID-19 patients that required mechanical ventilation in the ICU to maintain adequate oxygenation and decarboxylation. Important ventilator parameters include respiratory rate, tidal volume, peak pressure, plateau pressure, positive end- expiratory pressure (PEEP) and fraction of inspired oxygen (FiO2). For most of these parameters, the ventilator mode determines whether they are controlled by the healthcare, controlled by the patient or only monitored. However, regardless of ventilator mode, PEEP and FiO2 are always controlled by healthcare professionals rather than only monitored. There is no consensus on optimal values for PEEP and FiO2 [4, 5]. If PEEP is too low, the lung may collapse causing decreased compliance and hypoxemia due to shunting. Overzealous application of PEEP may lead to reduced preload and hence reduced cardiac output, decreased compliance, increased dead-space ventilation causing hypercarbia and acidosis [6]. Similarly, if FiO2 is too low, hypoxia will ensue which will ultimately lead to organ failure and death. However, an FIO2 that is too high is associated with oxygen toxicity to the long and other organs [7]. Personalisation of PEEP and FiO2 may therefore be a valuable strategy. Some recent clinical trials in this direction have shown promising results [8, 9]. However, their approaches are labour intensive and require particular expertise in respiratory physiology as well as additional monitoring devices. Therefore, RL should be promising in this setting. To the best of our knowledge, no previous attempts have been made to use RL to optimise mechanical ventilation settings in COVID-19. There have been a limited number of attempts to use RL for optimising ventilator settings in patients with respiratory failure for reasons other than COVID-19. Prasad et al.'s application of Fitted Q-iteration focuses on optimising weaning protocols in ICUs, with the clinical goal of reducing reintubation occurrences and regulating patient physiological stability [10]. Peine et al. developed a RL-based model called VentAI to minimise 90-day mortality by optimising ventilatory settings, specifically tidal volume (TV), PEEP and FiO2, which are among the most commonly used settings on a ventilator under controlled ventilation [11]. Kondrup et al. proposed DeepVent, a Conservative Q-Learning algorithm (CQL), using a similar setup to Peine et al. and evaluated using Fitted Q Evaluation (FQE) [12]. Additionally, they introduced an intermediate reward function component based on the modified Apache II score.

Data extraction and preprocessing

The data for this study were sourced from the DDW, a database compiling information on critically ill COVID-19 patients from 25 ICUs in the Netherlands [13]. The DDW encompasses data on 3464 patients, covering two distinct periods of the pandemic in the Netherlands, often referred to as "wave 1" and "wave 2". These terms denote the first and the second major surge of COVID-19 cases, respectively, each of which saw a dramatic increase in infections. This database boasts more than 200 million individual clinical data points. At the time of this study, a snapshot of the DWWH, containing 3051 patients rather than the full 3464 were used for this experiment. The overall ICU mortality was 24.4%. Respiratory and haemodynamic parameters were among the most commonly recorded, including ventilation mode, prone position and ventilator settings. Medications administration and daily fluid balance were available for most patients. Lab records were widely available. Clinical features were derived using time aggregations applied at 4, 6, h, 24 h time intervals. A list of all available data parameters and features is available in Additional file 1: Appendix SA. Patient admissions were selected based on length of ICU stay, use of invasive mechanical ventilation, and data availability, for reasons described in Fig. 1.

Fig. 1figure 1

Patient selection flowchart

Reinforcement learning problem definition

RL is a computational approach where a computer program, known as an 'agent', learns to make decisions by interacting with an environment, in this case, represented by ICU patient data. The goal of the agent is to maximise a 'reward', which in the ICU context, translates to optimal patient treatment outcomes. RL involves the agent interacting with the environment through 'states', which are snapshots of aggregated patient data over specific time intervals, and 'actions', the medical interventions or treatment decisions. A 'trajectory' pertains to the sequential events and decisions made throughout the entire course of a single patient's admission, encompassing the complete set of states and the corresponding actions executed during that individual's stay in the ICU.

Q-learning, an off-policy algorithm in RL, learns optimal decision strategies even from suboptimal actions. In the ICU, this means it can learn effective treatment strategies by analysing both optimal and non-optimal medical decisions made by healthcare professionals. This ability is crucial as it allows for learning from a wide range of historical real-world ICU scenarios without the need for experimental interventions on patients. In Q-learning, a function known as Q(s, a) estimates the utility of taking a particular action (a) in a given state (s), and then following the optimal strategy thereafter.

This section outlines the specific components of our model: the state space, action space, and reward structure, integral to our RL approach.

The state space defines the range of all possible conditions that an ICU patient can experience, as represented by a comprehensive combination of key clinical indicators and measurements. We used one-hour time steps and combined ventilator sensor data, medication records, clinical scores, and laboratory results into one state space. Missing values were imputed using the carry-forward method, up to clinical cut-offs; see Additional file 1: Appendix SA for the full list of features.

The action space defines the set of all possible interventions and treatment adjustments that the RL model can select from, specifically centred around the settings of positive end-expiratory pressure (PEEP) and fraction of inspired oxygen (FiO2). Figure 2 shows the distribution of actions in the dataset. We used bins of PEEP and FiO2 based on clinical cut-offs, and specified one action for non-ventilation (NV) to allow for the entire ICU admission to be used as a trajectory for training RL models. Our decision to focus the action space on PEEP and FiO2, diverging from the broader dimensions of previous studies [peine `22] [Prassad `17], is strategic. It makes the RL model simpler, reducing the need for extensive training data and computational resources. Concentrating on these key ventilatory parameters allows for direct control, facilitating more straightforward and broadly applicable decision-making. It is worth noting that factors like tidal volume, ventilation mode, and others are incorporated into the state space, ensuring that the model still considers their influence while keeping the action space concise. This approach guarantees the model's adaptability to diverse ICU conditions while maintaining its capacity to provide clear, actionable guidance essential for effective patient care.

Fig. 2figure 2

Action space density distribution of the historical actions of physicians in the dataset and illustration of the RL king-knight policy restriction. The red box shows under which actions the RL policy may recommend cessation of mechanical ventilation and the yellow box shows what actions a policy may next recommend if the current action (the small green box) is PEEP 6–10 cmH2O with FiO2 40–60%. NV stands for Non-invasively ventilated

In the context of our RL approach, a "reward" serves as a quantitative measure of the quality of patient care, with multiple rewards provided throughout a single patient trajectory. In this study, the reward function encompasses various short-term treatment goals, including oxygenation and ventilation as well as long-term treatment goals such as mortality, length of stay, and discharge destination, collectively guiding the agent towards optimising patient treatment outcomes. For the intermediate reward, we included the P/F-ratio, the ratio of arterial partial pressure of oxygen (PaO2) to inspired oxygen concentration (FiO2), as a measure of oxygenation [14, 15]. The PF-ratio is also used to classify the severity of ARDS [16] and is confirmed as a risk factor for mortality in COVID-19 patients [17]. However, optimal P/F targets are not well-defined [18, 19]. We also included dead space, using Enghoff's modification of Bohr's equation [20] to estimate dead-space ventilation [18] from partial pressure of carbon dioxide (PaCO2) and the end-tidal carbon dioxide (ETCO2) levels: Vd/Vt = [PaCO2 – ETCO2] / PaCO2. Removal of CO2 is a primary goal of mechanical ventilation [21] and dead space is correlated with mortality in ARDS patients [22,23,24,25]. As exact targets are ill-defined, we primarily used changes between measurements as a delta target in the intermediate reward formulation. We defined a terminal reward component based on several factors. First, we included mortality, but also included the length of stay and the patient's discharge destination after the hospital admission. The primary objective is to improve the quality of life, which is conventionally measured using Quality-Adjusted Life Years [26]. However, due to insufficient post-discharge data, we employed the length of stay (LOS) in the ICU as a surrogate measure for quality of life. This choice was made to approximate the impact of healthcare interventions on patients' well-being. The exact formula for the reward function is provided in Additional file 2: Appendix SB.

Policy formulation

In RL models, while the primary goal is to learn optimal policies, these models inherently do not prescribe a specific action as optimal. The common practice is to employ a "greedy" policy, selecting actions with the highest expected reward. However, this approach might not always be suitable, particularly for unstable ARDS patients, where a significant deviation from previous actions could be detrimental. To mitigate this issue, we propose a policy restriction that confines the model to similar actions, allowing for one step up or down in either PEEP or FiO2. Illustrated in Fig. 2, our approach restricts the agent like a chess piece, with movements similar to a chessboard. The agent's deviations are akin to a king's movements, and the RL policy's advice on stopping mechanical ventilation is likewise constrained. We term this approach the "king-knight" policy that allows for structured flexibility, particularly when a patient is not on mechanical ventilation, where any action is permissible.

Model architecture, training and off-policy evaluation

The model architecture used in this paper is a Dueling Double-Deep Q Network (DDQN) [27, 28] as used in previous work [29]. We used an extensive hyperparameter grid to find a set of optimal model and training settings, including a variable amount of hidden layers (3 to 5) and nodes (32, 64, 128) per layer. Two learning rate (LR) decay (ReduceLRonPlateau and STEPLR) were explored for training. Training was performed using Prioritised Experience Replay [30] and all models were implemented with PyTorch [31]. We used the MAGIC [32]OPE estimator to assess policy performance. We defined the physician behavioural policy using K-nearest neighbour as in previous work [29].

Experimental setup

Given the unknown optional trade-off between intermediate and terminal rewards, our study involved training models using a diverse range of six weightings encompassing both reward components. To assess the generalisability of policies, we introduced cross off-policy evaluation, where policies trained under a specific set of reward weights were evaluated on the remaining five sets of reward weights. In our experiments, this weighting factor is varied across a set of predefined values: [0.25, 0.5, 1, 2, 4, 8]. This evaluation methodology necessitates that each individual reward component, namely the intermediate and terminal reward, inherently reflects clinically desirable outcomes in isolation. The experimental setup and methodology employed are depicted in Fig. 3. Best performing models were selected for further clinical policy inspection.

Fig. 3figure 3

Experiment design with a framework for off-policy evaluation and model and policy selection through cross-OPE evaluation and clinical policy inspection

Policy evaluation and clinical policy inspection

To assess RL policies, the Off-Policy Policy Evaluation (OPE) method is utilised. OPE allows for the appraisal of a proposed AI policy by estimating its performance using real-world, historical ICU data, thus measuring its potential impact and effectiveness in past patient care scenarios without the need for actual policy execution.

To evaluate policies in a rigorous manner, we propose the introduction of a novel metric termed "delta-Q". Firstly, the Q-value, or q(s,a), represents the output of our model for a given state-action pair, indicating the expected utility of an action in a specific state. Delta-Q, on the other hand, is defined as the difference between the Q-value generated by our model and the Q-value representing the physician's action. Essentially, it measures the discrepancy in action quality between the AI's decision and the physician's choice. A delta-Q of zero implies that the model's decision aligns with the physician's, while a positive delta-Q suggests that the model's action might lead to improved treatment outcomes compared to historical decisions. By examining delta-Q values across all state-action pairs, we can identify specific areas of treatment that warrant further clinical scrutiny. Notably, delta-Q can be employed at both the dataset level and individual patient trajectories, enabling comprehensive analysis.

We also extend upon previous work on clinical policy inspection [29] and propose several model and policy visualisations for clinical validation and operationalisation. Due to the black-box nature of deep learning algorithms, we aim to provide clinical insight into policy behaviour for individual patients. For example, we can use delta-Q as a metric for alignment between physician and policy actions. Sudden changes in delta-Q, due to changes in state or physician action, over the course of the admissions can be used as a clinical alert. This allows physicians to evaluate model behaviour on a case-by-case basis and align with the clinical context in which the trained model and policy will be used.

留言 (0)

沒有登入
gif