Understanding effect size: an international online survey among psychiatrists, psychologists, physicians from other medical specialities, dentists and other health professionals

WHAT IS ALREADY KNOWN ON THIS TOPIC

Most effect size indices are poorly understood by clinicians.

The magnitude of effect size is most likely to be interpreted correctly when presented with dichotomous outcome measures.

How effect sizes are interpreted when they are presented using the control event rate (CER) and the experimental event rate (EER) instead of risk difference (ie, EER–CER) only has never been investigated.

WHAT THIS STUDY ADDS

Presenting results using the CER and the EER would lead to the best interpretation of the effect size.

Effect size presented with risk ratio is often misinterpreted while medical professionals indicate to have great confidence and perceived usefulness for risk ratio.

Relative outcome measures must be supplemented with absolute measures to avoid misinterpretation.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

The findings of our study provide authors of scientific papers with a recommendation on how to present results in future papers in the most comprehensible way.

Further initiatives are needed to improve the education of health professionals in health research methodology.

Background

Clinicians, patients and policymakers, when choosing among treatments, reach a decision based on numbers: numbers expressing the magnitude of their efficacy—effect size indices—derived typically from systematic reviews of randomised controlled trials (RCTs) or explicitly from large randomised trials.

The same effect can be expressed in various ways.1 When the outcome of interest is a continuous variable, as is typically the case in mental health, the most common way of presentation is the standardised mean difference (SMD),2 3 the difference in means in the experimental and control arms divided by their SD.4 When the outcome is measured on the same scale, perhaps a more intuitive option is the simple mean difference (MD) of the two groups.3 MD is also usually the primary index in the case of RCTs. There are also some other proposed ways of presenting continuous outcomes, including the ratio of means (RoM)5 and the difference in means divided by each included instrument’s minimal important differences (MID units).6 For dichotomous variables, the commonly used effect sizes include the risk ratio (RR) and OR as relative measures as well as risk difference (RD) or number needed to treat (NNT) as absolute measures.4 7 For dichotomous outcomes, the Cochrane Collaboration’s Summary of Findings tables recommends showing both, the control event rate (CER) and the experimental event rate (EER), to facilitate evidence users’ understanding of the results of systematic reviews.8

Previous research suggests that the same magnitude of effect, when expressed by different indices, may be interpreted differently by users of evidence. Akl et al 9 reviewed and compared the absolute risk reduction (which is the same as RD) and the relative risk reduction (which is converted from RR as (1.0–RR)×100, for example, 60% reduction instead of RR=0.4) and concluded that the two did not differ in terms of correct interpretations of the effect, but that the relative risk reduction greatly increased the willingness to adopt the intervention.9 The RD and the NNT did not differ in terms of interpretability or persuasiveness.

Johnston et al 10 investigated clinicians’ understanding and perception of the usefulness of six statistical formats for presenting outcomes from meta-analyses based on a hypothetical scenario about the treatment of chronic pain. Their results showed that all tested effect size indices were rather poorly understood (rate of correct responses lower than 50% for every tested index), especially those representing continuous outcomes (eg, SMD). However, Johnston et al did not examine some widely used or recommended indices, such as the NNT separately or the CER together with the EER. Moreover, the small versus large effect size values, presented in their questionnaire, were not calculated/determined consistently across indices, thereby making it difficult to interpret the proportion of correct responses for different indices. Finally, they surveyed clinicians in internal medicine and family medicine only and did not include doctors of other specialities or people from other healthcare professions.

Objective

In this study, we examined various psychiatric, medical and allied professionals’ understanding of eight different effect size indices for dichotomous and continuous outcome data, including SMD, MD, MID unit, RoM, RR, RD CER/EER and NNT. We aimed to find out which of those would be best suited to present the efficacy of medical treatments in the most comprehensible way. We also investigated respondents’ confidence while dealing with these measures as well as their perceived usefulness. Finally, we evaluated the influence of various demographic characteristics on the understanding of the effect size indices. We chose chronic pain as an example to make the comparison with Johnston et al’s results easier and because it is a ubiquitous symptom for any professional or lay persons.

Methods

We hereby report our study in accordance with the consensus-based checklist for reporting of survey studies.11

Participants

Since health research methodology and its understanding are integral to any health experts’ activities, we decided to recruit broad range of medical professionals. Medical doctors of all specialities and training levels as well as dentists, medical and dental students and people from other healthcare professions (eg, psychologists, nurses, pharmacists) were eligible to participate. Participants had to be sufficiently proficient in English. We distributed the link to the online questionnaire (see online supplemental file 1) with an invitation and further explanations and descriptions about the project by email. We used mailing lists of hospitals, doctors’ networks, and personal contacts.

Questionnaire

We created a digital questionnaire (see online supplemental file 1) to reach as many potential respondents as possible from as broad backgrounds as possible. To compare the results with the previous studies and because English is the lingua franca for science, the questionnaire was mainly in English, except for the introductory explanations which were presented in the local language when necessary to increase the accessibility of the questionnaire. To design and conduct the questionnaire, we used the online survey tool SoSci-Survey (V.3.3.13). The survey was completely anonymous. Tracing the participants’ IP addresses was impossible and data protection was always guaranteed by secure internet communication and the secure website. The participants were informed about the processing of their data in the invitation. The questionnaire comprised two parts. In the first part, we asked about demographic and background information. The second part assessed participants' understanding and perceived usefulness of eight effect size indices: SMD, MID unit, MD, RoM for continuous outcomes; and RR, RD, CER and EER, and NNT for dichotomous effect measures.

Initially, we had presented a clinical scenario before we introduced the actual questions. The scenario described a hypothetical meta-analysis of randomised trials of interventions for patients with chronic non-cancer pain. Pain often shares a complex interplay with psychiatric diseases since the persistent nature of chronic pain can contribute to the development or exacerbation of psychiatric conditions such as depression and anxiety.12

Pain was measured on a visual analogue scale (VAS) between 0 (no pain) and 10 (worst pain ever). Before treatment, the average score on the VAS was approximately 6 points, as reported in a large-scale study of similar patients.13 All subsequent questions were based on this scenario. For each of the eight effect size indices, we determined a small, medium and large treatment effect (table 1) in accordance with Cohen’s rule of thumb, which defines a small effect as SMD=0.2, a medium effect as SMD=0.5 and a large effect as SMD=0.8.14 Our exact approach for calculating and defining the required effect sizes is explained in online supplemental file 2.

Table 1

Small, medium and large treatment effects for all eight effect size indices

In the digital questionnaire, each participant assessed the magnitude of the effect for all eight effect size indices. To reduce the respondents’ burden and avoid response fatigue and errors, we chose only one of small, medium or large effects for each effect size index. Thus, for each index, the participants had to choose one of three possible answers (small effect, medium effect and large effect). The sequence of the eight indices as well as the presented effect size was randomised automatically to prevent order effects.

Additionally, the participants indicated how certain they felt about their own answers and how useful they found the given effect size index. For every effect size question, their confidence and perceived usefulness were assessed on a 7-point Likert scale, with response options ranging from ‘not at all’ (1 point) to ‘extremely confident’ (7 points) and ‘not useful in understanding the size of the effect’ (1 point) to ‘extremely useful in understanding the size of the effect’ (7 points), respectively. We carried out a pretest of our survey with the help of 20 individuals who were not involved in the project and who matched our survey target group.

Survey period

The online questionnaire was accessible for participation for a period of 2 months, from 15 February 2022 to 15 April 2022.

Outcomes

Our primary outcome was the proportion of respondents who correctly understood each effect size index. Correct understanding was defined as the right estimation of the effect size (small, medium or large) that was presented with the respective index. Secondary outcomes were the respondents’ confidence while dealing with these measures, their perceived usefulness as well as sociodemography and other factors that were associated with the understanding of the eight statistical formats.

Sample size

Assuming a proportion of correct answers at 50%10 to achieve a 95% CI width of 10% (ie, margin of error of 5%), we calculated our required sample size to be at least 384.

Statistical analysis

We included only fully completed and returned questionnaires in the analysis and did descriptive statistics to summarise the respondents’ characteristics. Then the proportion of correct answers for the questions about the magnitude of presented effects was calculated with corresponding 95% CI. We applied a multivariable logistic regression to examine the relative performance of the different indices. The index which produced the largest rate of correct answers was chosen as reference. We also compared the rate of correct answers for the small effect sizes with those affiliated with the medium and large effects. We displayed the influence of expertise in conducting systematic reviews and experience in health research methodology on the results and contrasted the performance of respondents of mental health professions with the rest of the participants. We summarised the respondents’ confidence and perceived usefulness for each index as their mean scores on the 7-point Likert scale, reported with 95% CI. All statistical procedures were performed using Excel (V.2301) and R Software (V.4.2.2).

FindingsParticipants’ characteristics

In total, 1316 people participated in our survey. Overall, 762 participants fully completed and returned the questionnaire, for a response rate of 57.9%. Respondents came from 13 different countries. Among those, Germany was the most frequent one (50.3%). 58.3% of the interviewed persons stated they had no experience in health research methodology. A small part of 124 participants (16.3%) had conducted at least one systematic review with meta-analysis by themselves (table 2).

Table 2

Characteristics of all participants that fully completed the survey

Correct understanding of the effect size indices

The proportions of correctly evaluated magnitudes of effect-by-effect size indices varied between 43% and 56% (figure 1A). The best results were ascertained for CER and EER. Fifty-six per cent of the participants estimated a given effect size correctly if it was presented with CER and EER. The RD turned out to be the second best. SMD and NNT showed similar results. The RR ranked clearly lower, and the MID unit was the least understood.

Figure 1Figure 1Figure 1

Proportion of correct answers, perceived confidence and perceived usefulness description. (A) Proportion of correct answers regarding the estimation of the size of treatment effects presented by the eight effect size measures. (B) Participants’ perceived confidence (on a scale between 1 and 7) while dealing with the effect size measures. A higher value stands for higher confidence. (C) Participants’ perceived usefulness (on a scale between 1 and 7) for the effect size measures. A higher value stands for higher perceived usefulness. Error bars=95% CI. CER & EER, control event rate and experimental event rate; MD, mean difference; MID unit, difference in minimal importance difference units; NNT, number needed to treat; RD, risk difference; RoM, ratio of means; RR, risk ratio; SMD, standardised mean difference.

Logistic regression

In the multivariable logistic regression taking CER and EER as reference, there was strong evidence that all the indices except for RoM and RD performed worse than the CER and EER by 5 percentage points or greater (table 3). Medium and large effect sizes tended to be more incorrectly estimated than small effect sizes. We also examined factors associated with correct understanding. The data suggested that education in health research methodology improved the assessment of given effect sizes to a small degree. There was no evidence that specialities (mental health vs others) or experience in conducting systematic reviews made any meaningful contributions.

Table 3

Results of the logistic regression analysis

Perceived confidence and usefulness

Respondents felt most confident about using CER and EER while they were least confident about using MID unit (figure 1B, online supplemental file 3). Likewise, they found CER and EER to be the most useful presentation approach while they rated MID unit as the least useful (figure 1C, online supplemental file 3). NNT, RD and RR were also highly appreciated. In both categories, RR ranked high even though it was rather poorly understood comparatively.

Understanding of the indices by the magnitude of the effect size

For all the effect size indices (except RR and NNT), the most correct answers were noted for small effect sizes (online supplemental file 4). When presented with large or medium effect sizes, the participants tended to underestimate the magnitude of the effect (ie, interpret them as representing smaller effects). Only in the case of the RR, larger effects were better interpreted, indicating that small or medium effects were misinterpreted as larger effects.

Discussion

The CER and EER as method of presentation was understood best followed by the RD. The lowest rate of correct answers was seen for the MID unit, followed by the MD. These results were generally in line with the respondents’ indications regarding their perceived confidence and usefulness. The NNT, SMD and RoM ranked in-between. Experience in health research methodology showed a positive impact on the rate of correct answers.

The performance of the RR, arguably the preferred summary index in many meta-analyses for the dichotomous outcomes,15 was peculiar. Although participants clearly understood CER and EER or RD more than RR, respondents indicated that they had greater confidence and perceived usefulness for RR. As perhaps expected, the misinterpretation of the RR lay in the direction of over-interpreting small effects, a common mistake by consumers of the evidence especially when presented with the relative risk reduction. The RR was also less correctly understood than the RD in Johnston’s study.10 These findings clearly suggest that the RR should not be the sole summary index to communicate the effect of an intervention. A recent survey found that the RR was the only reported data presentation method in abstracts of many RCTs in leading journals,16 a practice that will likely mislead evidence users and that needs improvement. The NNT is sometimes advocated as the preferred way to make the RR more clinically interpretable,7 however, given the methods for calculating the 95% CI can be confusing (eg, when the 95% CI of the RD is (−0.2 to 0.2), the correct NNT should be (−∞ to −5, 5 to ∞) but is often misunderstood as (−5 to 5)), it has been suggested that EER, CER and RD are better options,17 as our findings bear out.

With regard to continuous outcomes, the MD is often defended as the more easily interpretable than the SMD, particularly if the instrument, in our case pain intensity on a 10 point scale, is familiar to the audience.18 This was not the case in our survey. In Johnston’s study, the MD was also one of the two least correctly interpreted indices (along with the MID unit).10 The MD has been shown to be slightly less generalisable than the SMD.19 It is possible that, if the MD represents natural units such as weight or a laboratory value instead of a 0–10 pain score as in our survey, it may be interpreted more correctly than SMD. Moreover, we must remember that in the current survey experience in health research methodology influenced the interpretability. The MD probably will continue to be perceived as a readily understood index of effect size when the results are presented to the lay public or to the less methodologically trained health professionals. However, we must keep in mind that, behind this apparent ease of understanding, perhaps their interpretation may remain misleading, especially when the unit is not familiar, for example, scores of a certain psychopathology scale. Once the evidence users became more experienced, interpretations based on the SMD were more correct than those based on the MD, and perceived to be equally helpful with equal confidence.

The index based on MID units was the least correctly interpreted and perceived to be the least helpful. In Johnston’s study, it was one of the least correctly interpreted indices and the second least useful index. First of all, this was probably driven by the unfamiliarity of the current medical professionals with the concept of the MID, even though it has been around for three decades.20 Second, it remains possible that using the MID, which represents the smallest important pre–post change, in the context of the between-group comparison may have been conceptually misguided.21 22 The place and value of the MID unit approach to express the effect size need further research and educational outreach. By contrast, another newly proposed method to summarise a continuous outcome, the RoM, was as well understood as the SMD. Unfortunately, the use of the RoM is limited by the fact that it can be calculated only when the scores in the intervention and the control group are both, positive or negative.4 Meeting this condition, the RoM remains a viable option as an effect size index.

It must be noted that even the best-performing indices led to correct interpretations in slightly more than half of the questions only, and the perceived confidence and usefulness hovered around the middle value on a scale of 1 to 7. This performance must be interpreted in the context of our questionnaire design in which, for the estimation of the presented effect sizes, participants only had to choose among three possible answers (small effect, medium effect, large effect). This design would have naturally increased the rate of right guesses, because, by chance alone, every third answer should be correct. Furthermore, a small effect could not be underestimated while a large effect could not be overestimated. However, the characterisation of various effects is bound to be subjective and achievement of perfect or near-perfect correct answers may not be pragmatically possible or to be expected. The fact that participants tended to underestimate large or medium effect sizes could also indicate that effects in pain interventions are mostly modest. Furthermore, it implies that what we define as a large effect in our survey may, in reality, only yield a limited impact and might not be considered as a large effect.

Limitations

The study faces limitations typical of survey studies, with voluntary participation potentially under-representing those uninterested in evidence-based medicine.23 Also, people who do not feel confident with the English language might not have taken part in our study. Another reason for non-participation in our survey could be that individuals found the questionnaire too time-consuming during their daily work routine. For others, the number of questions may have caused response fatigue, prompting them to cease completing the questionnaire. Our participants, while diverse across 13 countries, were predominantly from high-income nations. The absence of respondents from middle-income to low-income country, where training in health research methodology is likely to be less common, limits generalisability. We did not include the OR in our survey because it is already known to be difficult to understand and can be easily misinterpreted as compared with the RR.24 To reduce the respondents’ time burden and response fatigue to avoid non-response, we decided to focus on the RR as a relative index of efficacy. Nevertheless, it would have been interesting to examine OR in our questionnaire. The study’s definitions of small, medium and large effects could be seen as arbitrary. We concede this caveat but argue that no other alternative could have been any more plausible. Following Johnston et al, we examined both perceived confidence and usefulness but these concepts probably overlap. In online supplemental file 3, we present a scatterplot, which indeed demonstrates that these measures very likely are related.

Conclusion and clinical implications

Our findings suggest that studies presenting the results using the CER and EER would lead to the most correct interpretation of the effect size. It was also associated with the highest perceived confidence and usefulness among various healthcare and related workers. However, all the tested effect size indices were only moderately correctly interpreted, with only a 13% difference in correct answers between the best (CER and EER: 56%) and worst performing index (MID unit: 43%). While relative measures including the RR (and the OR) remain the most externally valid summary index,15 they should be supplemented with absolute measures such as the CER and EER. The current study provides strong empirical support for the way the Summary of Findings tables are structured in the Cochrane Library.25 Especially in the field of psychiatry, the presentation of both CER and EER is particularly crucial, given the often high placebo response rates. For instance, in the acute treatment of schizophrenia with antipsychotics, a meta-analysis revealed that approximately 30% of patients exhibited at least minimal improvement after about 6 weeks under placebo, while 50% demonstrated such improvement under drug treatment.1 In the case of antidepressants for major depression, approximately 37% respond to placebo,26 compared with 52% responding to antidepressants.27 When the outcome is continuous such as pain intensity, our results indicate the use of the SMD over the MD, which is less correctly interpreted, and which is less externally valid. The interpretability of the SMD must apparently be cultivated with education in health research methodology among the professionals, while we must be aware of the faux interpretability of the MD for the lay public when conveying the results of continuous outcomes. The SMD can also be converted into the CER and EER using the validated conversion method28 29: supplementing the summary SMD with the converted CER and EER may be as helpful as supplementing the RR with the same. Since, in our survey, even the best-performing indices led to correct interpretations in only slightly more than half of the questions, further initiatives are needed to improve the education of health professionals in health research methodology, including skills in interpreting effect size.

Data availability statement

Data are available upon reasonable request. The datasets used and analysed during the current study are available from the corresponding author on reasonable request.

Ethics statementsPatient consent for publicationEthics approval

This study involves human participants and was approved by Ethics Committee of the Technical University of Munich; 682/21 S-SR. Participants gave informed consent to participate in the study before taking part.

Acknowledgments

This project will be part of Ferdinand Heimke’s doctoral thesis.

留言 (0)

沒有登入
gif