The objective of this study was to analyse the variability of MCIC for the COMI score using diverse calculation methods. Building upon previous studies [5, 10], these methods were categorized into two types: distribution-based and anchor-based. Within the scope of this study, an additional classification was introduced to the anchor-based methods, splitting them into non-predictive and predictive. Our results showed that different calculation methods provided very different MCIC values for improvement ranging from 0.3, for the predictive model corrcted by prevalence and reliability to 4.9, for the within-patients score change. The observed variability was at times logical and arose from the distinct methodological approaches used to calculate the MCIC. For example, within distribution-based methods, variability arises from the different statistical approaches used, such as using the standard deviation of the baseline versus the change score. For anchor-based methods, differences primarily stem from the definitions of the groups extracted using the anchor. A clear limitation of distribution-based methods is that they do not involve the use of clinical questionnaires to assess patients’ perceptions of change and are solely based on the distribution of the analysed sample, making them sample-specific [8, 35]. In fact, we agree with previous studies that have concluded that distribution-based methods are more likely to measure minimal detectable change (MDC) rather than minimal clinically important change (MCIC) [9, 36, 37].
Some of the values derived from distribution-based and non-predictive anchor-based methods seem to lack intuitive significance as a “minimal clinically important change”, as a change in COMI below one point would be barely perceived by a patient and would also be close to the error of measurement (Table 2). On the other hand, a change of 4.9 is almost half of the COMI maximum score suggesting that it would represent a significant change and not just a minimal change perceived as important. Concerning non-predictive anchor techniques, while they involve patients’ ratings of change, the choice of ‘item groups’ used for computation is a crucial element that greatly impacts the MCIC values. Moreover, the concept of minimal clinically important change (MCIC) implies a threshold. Relying on group averages may result in unusually high MCIC values, potentially underestimating the number of patients who achieve MCIC.
Predictive anchor-based methods are dependent on the initial dichotomization of the anchor question to define ‘improved’/’not improved’ and ‘deteriorated’/’not deteriorated’ groups. In terms of improvement, there were consistent and stable results for MCIC across the different methods. This suggests their enhanced reliability in a clinical environment as they are using a threshold concept and not a statistic of a specific item. Moreover, their values were very close to those reported in other similar studies [14, 27, 38]. The clinical validity was confirmed by high sensitivity and specificity values for all methods except for the logistic model corrected for both prevalence and reliability with a specificity of 0.56 (Table 3). Indeed, the stability among the different methods was heavily impacted by the corrections implemented in the logistic model [39]. Owing to the high prevalence (79%) of the positive class (the ‘improved’ group) following the dichotomization of the anchor question, the corrections had a notable effect. The challenge with this approach lies in the fact that these corrections, deriving from empirical formulae and recommended for use when the prevalence of the positive class deviates from 50% (but is still between 30 and 70%), may not function optimally when there is a very high prevalence [40]. This needs further investigation, specifically in the field of spine surgery, where the proportion of good outcomes is typically high [41, 42].
Concerning deterioration, the MCIC values were very variable across the methods, sometimes yielding very small or positive values—a phenomenon that has been reported before [13, 14, 43]—which in theory does not make sense. It was, however, a problem that the proportion of deteriorated patients was very low (1%), meaning that not only was the number of data points to characterize deterioration limited but the adjustments to the logistic regression would still result in notable residual bias in the calculated MCIC [40].
The distribution of changes in the COMI score showed that ‘deteriorated’ patients peaked around zero, while ‘not deteriorated’ patients were spread between 0 and 10 (Fig. 2C and 2D). This pushed the ROC methods to choose a positive threshold to minimize errors, namely false positives, and false negatives (Fig. 2C). Conversely, the MCICs for the predictive models were all negative (range −3.1/−0.8) resulting in many false negatives (i.e. deteriorated patients being wrongly classified as not deteriorated). This was due to the logistic regression favouring the ‘not deteriorated’ class (Fig. 2D). In this study, the values obtained from the ROC analysis, ranging from 0.9 for Farrar to 1.1 for Youden, are higher than the MCIC of 0.3 used to determine deterioration in the study by Mannion et al. [14]. This is probably due to the less negative average delta COMI for the deteriorated group (formed by patients who answered ‘made things worse’) in the present study (–0.3 SD 1.2 vs −0.7 SD 2.2 in [14]).
Anchor-based methods, despite their effectiveness in capturing patient perceptions, have been criticized. One criticism concerns recall bias which can affect long-term responsiveness and bias the responses to the anchor-question. A patient may not accurately recall their past condition when answering the anchor question, leading to uncertainty about the actual benefits of a medical procedure [44, 45].
A further criticism of predictive anchor-based methods concerns the dichotomization of the multi-point “global outcome scale”. Indeed, the process of dichotomizing patients into ‘improved’/ ‘not improved’ and ‘deteriorated’/ ‘not deteriorated’ for predictive models is arbitrary and can be adjusted based on the level of improvement of interest. For instance, if the focus is on a notable clinical improvement, the ‘improved’ group could consist solely of patients who reported that the surgery ‘helped a lot’ [5]. Conversely, if even a minor improvement is of interest, patients who report being ‘helped only a little’ could also be included in the ‘improved’ group. It may be even more difficult to dichotomize the anchor question for the deterioration analysis. In the present study, the patients who answered ‘did not help’ had a mean change in COMI which was very close to the mean change of patients who answered ‘made things worse’ suggesting that they could belong to the same group (deteriorated). Furthermore, deterioration may be hard to assess as the datasets are usually heavily imbalanced towards the ‘not deteriorated’ group [43].
The present study has some limitations. First, while we used a large sample size of over 9,000 patients to provide robust and generalizable results, we acknowledge that this cohort comprised a wide variety of patient characteristics and diagnoses for which we did not stratify in our analysis. Future research should investigate how MCIC values might vary across different subgroups, such as specific spinal conditions, age groups, or genders, to provide more tailored guidance for clinical interpretation and decision-making.[27, 46]. Second, we were not able to calculate the smallest detectable change (SDC) of the COMI questionnaire that could have been used as a threshold to define meaningful MCIC values. However, SDC has been heavily explored in many studies with values ranging from 1.34 to 1.98 [20, 47, 48] which are comparable to some of our distribution-based and non-predictive anchor-based values. Finally, for the anchor-based predictive models in this study, the corrections were applied to the logistic regression models despite the prevalence range in this study (0.01 and 0.8) being outside of the suggested range (0.3–0.7) for which these corrections were originally defined. This discrepancy could potentially impact the credibility of the MCIC values derived from these two approaches [33, 40].
Given that no method has been definitively proven to be superior to others, the selection of the MCIC calculation method in clinical research must be approached with care [36, 49]. Nonetheless, it is difficult to consider some of the methods employed in this study as defining minimum clinically important change scores. They are primarily based on statistical variations in the score change and might be more appropriately termed detectable changes. The various MCIC thresholds that can be computed using different methods are rarely compared or analyzed on the same dataset to measure the impact of the chosen MCIC calculation method. Caution is needed when reporting and using MCIC values in clinical settings. The MCIC could be influenced by different factors for example the COMI value at baseline [50]. Future work should evaluate the role of the baseline value, as a reduction of 2 points may be interpreted differently by the patient depending on the starting point. One potential solution could be to consider relative changes (in the COMI) rather than absolute changes, although this has its own attendant problems [28]. All the previous challenges highlight the complexity of MCIC determination. A recent study performed a similar evaluation of another instrument, the Zurich Claudication Questionnaire (ZCQ), obtaining findings in line with those reported here [51]. There is probably no “hard and fast” or definitive MCIC value for a given questionnaire that functions under all circumstances and for all patients. While we preferentially recommend MCICs derived from predictive anchor-based methods, ongoing research should address the remaining uncertainties and refine methodologies, especially with respect to corrections for the prevalence of the given outcome and the reliability of the anchor question. The predictive anchor-based methods are grounded in both patients’ perceptions and a data-driven approach and they appear to yield the most consistent values for assessing improvement, providing a balanced and accurate measure of clinical relevance. In the present study, the MCIC values were more stable across different predictive anchor-based methods, gave high values of sensitivity and specificity, and were similar to the 2.2-point MCIC for the COMI currently in use, suggesting that a ballpark figure of this scale remains valid for different scenarios.
留言 (0)