A Pilot, Predictive Surveillance Model in Pharmacovigilance Using Machine Learning Approaches

Definitions

A “true signal” is defined as any AE determined to be causally associated with the drug product and included in the drug product label at the time of data analysis (e.g., also known as adverse drug reaction [ADR]). A “potential new signal” is defined as an AE that might be caused by the drug product and requires further assessment. This assessment is a manual process that includes the review of all cases (i.e., ICSRs) that reported that specific AE with the use of that specific drug product. From this assessment, the signal is either confirmed (when a causal association between the AE and the drug is established) or refuted (when a causal association between the AE and the drug cannot be established).

Individual AEs are coded at the preferred term (PT) level in the Medical Dictionary for Regulatory Activities (MedDRA). [7]

Scheme of Machine Learning Pipeline

The workflow of our ML pipeline began with the splitting of the whole dataset for each drug into the training set and the test set. The training set was used for model training and hyperparameter tuning, and the test set was a dataset held out separately for the purpose of evaluating model performance after the final model was obtained based on the training set. On the training set, threefold cross-validation was conducted to select the best model algorithm and hyperparameter setting, while the use of cross-validation aimed at controlling for over-fitting due to the high complexity of ML algorithms and a large number of features provided in the modeling training and selection process. Then, the final model was trained with the selected best hyperparameter setting, based on the whole training set. In the end, our final model performance was evaluated based on the hold-out test set. Figure 2 provides an illustration of the workflow of our ML pipeline. For each drug, this ML pipeline was applied respectively to generate model performance results and any potential new signals for a further manual assessment performed by humans.

Fig. 2figure 2

Workflow of the machine learning pipeline for potential new signal detection. ADR adverse drug reaction (i.e., true signal), PT preferred term

Data Source and Train-Test Splitting

For Drug X (a mature product), post-marketing (PM) data from 2017 to 2018 were extracted from AE reports at the PT level, including demographic features of the patients (e.g., age, race, country, etc.) and characteristics of AEs (e.g., event seriousness, event outcome, time to event, etc.), and used for both training and testing sets. For Drug Y, clinical trial data from Phase 3 trials were extracted for the training set, and PM data from 2021 to 2022 were extracted for the testing set. During the collection of PTs for signal classification, we retained the PTs with the number of occurrences ≥ 5 or the PTs with < 5 occurrences and at least one occurrence of a serious report. As all patient data reviewed did not contain identifiable information, no informed consent, ethics committee, or Institutional Review Board approval was sought or required.

Given that raw safety data are retrieved on a case level, they were transformed to the granularity level of PTs by aggregating the data of all cases associated with each PT. In terms of features used for ML modeling, the following were included after review by internal safety experts: number of occurrences of a PT, number of occurrences of all other PTs, gender, age group, country, event seriousness, event outcome, change in dose due to AE, reporter event causality to drug, AE time-to-onset, and MedDRA System Organ Class. For each feature, aggregation was done by summing up the number of cases or subjects (depending on whether the feature was at the case level or subject level) that belonged to each category of the feature for each PT. In terms of missing data for a case, a new unknown category was created and assigned to the case [8].

The train-test splitting of the whole dataset for each drug was conducted in a sequential way where the data used for model training were collected prior to the data for model testing. This sequential data split approach mimics the process in which the model was trained based on labeled true signals (i.e., labeled ADRs) and then applied to future AE data for safety signal monitoring. Moreover, this approach prevented information leakage from the use of future true new signals. Table 1 shows the breakdown of true signal PTs and non-signal PTs in both the training and test sets. Given that, for the training and test sets, we retained the PTs based on the aforementioned occurrence and seriousness criteria, only a subset of true signal PTs in the test set was available to the ML models in the training set, which facilitated a conservative assessment of model performance.

Table 1 Number of signal vs. non-signal preferred terms in datasets for Drugs X and Y

When the test set was applied, each PT was predicted under one of the two categories: signal PT or non-signal PT. The signal PTs included either a true signal (i.e., labeled ADRs) or a potential new safety signal that would require further assessment by the safety team to confirm or refute the new signal.

Machine Learning Algorithms

The gradient boosting-based ML approaches [9,10,11,12] were chosen as the main modeling methodology because of their superior performances in a variety of data science prediction tasks. Boosting, based on the idea of combining a committee of iteratively trained weak classifiers (e.g., decision tree stumps) to produce a powerful final model, has been one of the most powerful learning ideas in ML over the last two decades. In contrast to the recently popularized deep learning (e.g., deep neural networks) approaches that often require training on very large data sets to achieve good performances, gradient boosting-based approaches are still able to perform very well with data sets of smaller sizes. We chose to use the implementation of gradient boosting algorithms by XGBoost [10], which is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.

The tuning of the XGBoost machine learning algorithm was conducted through cross-validation, utilizing the grid search over the following hyper-parameters for Drug X: n_estimators with the grid [10, 20, 30, 50, 60, 75], max_depth with the grid [4, 6, 8, 10, 15, 20], and learning_rate with the grid [0.05, 0.10, 0.25, 0.50]. For Drug X, the final selected hyper-parameter values are as follows: n_estimators = 50, max_depth = 4, and learning_rate = 0.10. The hyperparameters and their search grids for Drug Y are the same and the final selected parameter values are as follows: n_estimators = 30, max_depth = 4, and learning_rate = 0.10.

Model Performance Evaluation

Model performance was first evaluated based on objective quantitative metrics. Prediction accuracy provides a good metric to calibrate classification performance for a non-rare outcome. However, prediction accuracy can look overly optimistic for a poorly performed model under rare events. For example, the accuracy would be extremely high for a useless model that predicts every case to not be a signal. We used a pair of classification performance metrics, sensitivity (also known as true positive rate or recall) and positive predictive value (also known as precision), to measure the model performance. These two metrics are defined in Table 2 based on the four types of classification results often presented in a so-called confusion matrix: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Sensitivity and positive predictive value (PPV) are defined to be TP/(TP + FN) and TP/(TP + FP), respectively. The specificity and negative predictive value (NPV) are defined in a similar way to be TN/(TN + FP) and TN/(TN + FN), respectively. For instance, for a rare signal detection problem with a 5% occurrence rate, a simple model based on a biased coin flip (with 5% vs. 95% probabilities to predict signal vs. non-signal) only gives 5% for both sensitivity and PPV under the assumption that the simple model only considers the population percentage of true signals. In addition to the assessment based on quantitative classification performance metrics, a manual assessment of potential new signals was conducted by humans to evaluate the generalizability of the ML models.

Table 2 Binary classification confusion matrix with four types of resultsStrategy to Handle Imbalanced Labels

It is a well-known phenomenon that ML binary classifiers tend to be overwhelmed by the data of the event making up the major portion in training and hence overpredicting the majority event [13]. An ML binary classifier can predict an event of interest (e.g., being a true signal) for any new instance given its covariate information. Behind the scenes, many ML binary classifiers are trained to predict a quantitative outcome on a scale between 0 and 1, representing the probability of the event of interest. To convert the predicted probability to the predicted binary outcome, it is common practice to apply a probability decision threshold beyond which the case is predicted to the event of interest. By default, this probability decision threshold is often chosen to be 0.5 for binary classification problems [14]; however, in the scenario with imbalanced events, this leads to a very small sensitivity rate for detecting the minority event, which is often the target of interest in a scientific investigation. In addition, with the default threshold, very few new signals tend to be generated, which undermines the purpose of a predictive surveillance ML system. Adjusting the probability decision threshold is a viable and widely used approach with good performance[15], where model retraining with either sample weighting or resampling of the training set is not required. While the probability decision threshold can be selected by maximizing a certain mathematical criterion like F1 score [16], it is recommended to be chosen, in our application setting, based on the operation limitation in practice, i.e., the limited resource devoted to manual review. In practice, because the goal is often to minimize the chance of missing a true safety signal, we recommend choosing the lowest probability threshold to generate as many potential signals as the resource limit of manual review permits. In our pilot study, the probability decision threshold was chosen to generate about ten potential new signals for manual review, for demonstration purposes, while achieving about 50% sensitivity and 35% PPV rate for model performance.

留言 (0)

沒有登入
gif