Algorithms at the Bedside: Moving Past Development and Validation*

We need help.

For seventy years, it has been known that humans can simultaneously handle only seven bits of information (1). Regardless that in 1956 a “bit” of information and in 2024 a “byte” of data are different, the number 7 has persisted through many cognitive psychology experiments. And if your mind is tied up managing a sudden noise (e.g., an alarm) and a question from a colleague, you can then handle only five more information bits. Much work has been done to understand how we “chunk” data into patterns (2), and the ability to find patterns both improves with experience and also has the same cognitive limit (3). However, more recent work shows our ability to handle chucks is limited to 3–5 (4) (and again, fewer when we are distracted). Not surprisingly, critical care physicians handle high information load by learning to “see patterns” (a.k.a. chunks) and as expertise increases, pattern recognition also increases (5). But even the most highly qualified experts have limited attention and memory availability, let alone limited knowledge given the ever-expanding scientific literature.

Relying only on individual clinician’s cognition, even if they have the highest expertise possible, is quite troublesome given the amount of data each clinician is asked to handle in every ICU. Manor-Shulman et al (6) showed about 1350 data points are documented daily on patients in the PICU. About 1500 data points are documented daily on children on mechanical ventilation. Not included in these numbers are the undocumented data (e.g., waveform data, waveform interactions, collegial and family discussions), which probably take the data loads up two or three orders of magnitude. Further, a typical attending intensivist often cares for ten or more patients. To do the multiplication, that is 13,000 to 15,000 data points for each clinician every day. To pretend any of us can ingest and analyze even most of the data is just that—pretend.

The promise of artificial intelligence (AI), and particularly machine learning, in critical care is its ability to ingest the documented and undocumented data including images and text and discover patterns in the data that clinicians now do not, and frankly cannot, recognize because clinicians are humans. Hence, the promise of AI in critical care is great. AIs contribution to bedside patient critical care remains absent.

In this issue of Pediatric Critical Care Medicine, Chanci et al (7) use machine learning techniques to combine more than 50 parameters into a single number that “…predicts the need for intubation in children between 24 hours and up to 7 days after hospital admission.”

Three questions about this (and every other predictive algorithm) are: 1) does this new data point reduce a clinician’s cognitive burden by reducing 50 numbers into one; 2) where, when, to whom, and how should this new data point be presented; and 3) when the algorithm is wrong or biased, how will the mistakes (either discovered by the clinicians, or worse, blindly accepted by the clinicians) be incorporated into algorithm improvements?

Asked succinctly, does this new prediction algorithm help (question no. 1)?

Answered succinctly, no. The data supporting “no” is in Supplemental Tables 4–6 in (7). For the 17,841 PICU stays, the algorithm correctly fired 921 times and correctly did not fire 8,856 times. Seven thousand two hundred eighty times it incorrectly fired and 244 times it incorrectly did not fire (which includes the 227 “late positives” when the child was intubated before the algorithm fired.

Considering the positive alerts, the precision of the algorithm, defined as true positive alerts divided by all positive alerts is only 921/(921 + 7820) = 11% (also called the positive predictive value [PPV]). If you change the alarm threshold to allow the sensitivity to fall to about 50% (or the alert fires only half the time when a child needs intubation), the PPV then rises to about 50%. Neither scenario helps. Continuing to hammer the point, Supplement Table 7 (7) shows that about 88% of the alerts are false, and on day 6 of admission, 98% of the alerts are false. Thus, it is an understatement when the authors comment, “While minimization of false positive alarms is needed, these patients may warrant increased monitoring and vigilance.” Yet to follow-up on the last part of the above sentence the data presented show that 20% of the patients were intubated even before the alert fired; obviously, they were being monitored.

Back to the questions, where, when, to whom, and how should this new data point be presented (question no. 2)? Asked differently, how should the algorithm output be incorporated into the workflow?

Because all the parameters used in this algorithm are derived from the electronic medical record (EMR), a simple solution would seem to be to create a pop-up alert in the EMR. There are, at least, three problems with this solution that can be explained briefly by the concept of “work as imagined” vs. “work as done” from human factors engineering (8,9). First, there is often a charting delay between the time an event happens and the time it is documented. Thus, an alert might fire at 2 pm based on data that occurred at noon. Second, the alert may only be seen when a clinician signs into the EMR and in this case, the clinician may be signing in to place orders for intubation. Third, the vital sign inputs could be available with an analysis of the continuous monitor data; the monolithic EMRs do not incorporate data at high frequency (and much information is consequently lost).

An answer to question no. 3 is even more complicated; when the algorithm is wrong, how will the mistakes be incorporated into algorithm improvements? It is hard to even know when an algorithm is wrong. Overlooked is often the issue of “ground truth.” If you try to predict something and that something has a nebulous (i.e., noncomputable) definition, then any prediction must be calibrated against an approximation of ground truth. Further, the use of data to predict future clinician behavior (e.g., endotracheal intubation) using data that often also depends on clinician behavior (e.g., laboratory data sent only with a specific order) will also be nebulous given the wide variabilities of clinician behaviors (10). Incorporating mistakes into algorithm improvements will require continuous tracking of ground truth and routine algorithm recalibrations.

In conclusion, we as a field, need to move past “just” development and validation. For any article describing a new predictive analytic, the discussion section should have a paragraph beginning with, “And at the bedside, we believe this alert based on our in-depth human factors engineering investigations of clinical workflow and cognitive work should be…” even if the sentence is completed with “… ignored until the PPV is improved and the team-based workflow is better organized.” The following paragraph should start with, “To keep the algorithm current, we should include adaptive, continuous and active learning (among many others), seek a computable definition of ground truth … and discover the best clinical pathways” (11,12). This paragraph will be even harder to complete. Yet without both paragraphs, we will be doing remarkable math, but will neither be supporting clinicians nor improving patient care.

We need help. AI can help if, and only if, it is designed and implemented considering the realities of clinical work and by using a participatory human-centered design.

1. Miller GA: The magical number seven plus or minus two: Some limits on our capacity for processing information. Psychol Rev. 1956; 63:81–97 2. Mathy F, Feldman J: What’s magic about magic numbers? Chunking and data compression in short-term memory. Cognition. 2012; 122:346–362 3. Chase WG, Simon HA: Perception in chess. Cogn Psychol. 1973; 4:55–81 4. Cowan N: The magical mystery four: How is working memory capacity limited, and why? Curr Dir Psychol Sci. 2010; 19:51–57 5. Fackler JC, Watts C, Grome A, et al.: Critical care physician cognitive task analysis: An exploratory study. Crit Care. 2009; 13:R33 6. Manor-Shulman O, Beyene J, Frndova H, et al.: Quantifying the volume of documented clinical information in critical illness. J Crit Care. 2008; 23:245–250 7. Chanci D, Grunwell JR, Rafiei A, et al.: Development and Validation of a Model for Endotracheal Intubation and Mechanical Ventilation Prediction in PICU Patients. Pediatr Crit Care Med. 2024; 25:212–221 8. Deutsch ES: Bridging the gap between work-as-imagined and work-as-done | advisory. PA Patient Saf Advis. 2017; 14:80–83 9. Tresfon J, Brunsveld-Reinders AH, van Valkenburg D, et al.: Aligning work-as-imagined and work-as-done using FRAM on a hospital ward: A roadmap. BMJ Open Qual. 2022; 11:e001992 10. Beaulieu-Jones BK, Yuan W, Brat GA, et al.: Machine learning for patient risk stratification: Standing on, or looking over, the shoulders of clinicians? NPJ Digit Med. 2021; 4:62 11. Yu C, Liu J, Zhao H: Inverse reinforcement learning for intelligent mechanical ventilation and sedative dosing in intensive care units. BMC Med Inform Decis Mak. 2019; 19:57 12. Liu S, See KC, Ngiam KY, et al.: Reinforcement learning for clinical decision support in critical care: Comprehensive review. J Med Internet Res. 2020; 22:e18477

留言 (0)

沒有登入
gif