A case study of the informative value of risk of bias and reporting quality assessments for systematic reviews

This literature study shows that reporting of experimental details is low, frequently resulting in unclear risk-of-bias assessments. We observed this both for animal and for human studies, with two main study designs: disease-control comparisons and, in a smaller sample, investigations of experimental treatments. Overall reporting is slightly better for elements that contribute to the “story” of a publication, such as the background of the research question, interpretation of the results and generalisability, and worst for experimental details that relate to differences between what was planned and what was actually done, such as protocol violations, interim analyses, and assessed outcome measures. The latter also results in overall high RoB scores for selective outcome reporting.

Of note, we scored this more stringently than SYRCLE’s RoB tool [13] suggests and always scored a high RoB if no protocol was posted, because only comparing the “Methods” and “Results” sections within a publication would, in our opinion, result in an overly optimistic view. Within this sample, only human treatment studies reported posting protocols upfront [31, 32]. In contrast to selective outcome reporting, we would have scored selection, performance, and detection bias due to sequence generation more liberally for counterbalanced designs (Table 2), because randomisation is not the only appropriate method for preventing these types of bias. Particularly when blinding is not possible, counterbalancing [33, 34] and Latin-square like designs [35] can decrease these biases, while randomisation would risk imbalance between groups due to “randomisation failure” [36, 37]. We would have scored high risk of bias for blinding for these types of designs, because of increased sequence predictability. However, in practice, we did not include any studies reporting Latin-square-like or other counterbalancing designs.

One of the “non-story” elements that is reported relatively well, particularly for human treatment studies, is the blinding of participants, investigators, and caretakers. This might relate to scientists being more aware of potential bias of participants; they may consider themselves to be more objective than the general population, while the risk of influencing patients could be considered more relevant.

The main strength of this work is that it is a full formal analysis of RoB and RQ in different study types: animal and human, baseline comparisons, and treatment studies. The main limitation is that it is a single case study from a specific topic: the nPD test in CF. The results shown in this paper are not necessarily valid for other fields, particularly as we hypothesise that differences in scientific practice between medical fields relate to differences in translational success [38]. Thus, it is worth to investigate field-specific informative values before selecting which elements to score and analyse in detail.

Our comparisons of different study and population types show lower RoB and higher RQ for human treatment studies compared to the other study types for certain elements. Concerning RQ, the effects were most pronounced for the type of experimental design being explicitly mentioned and the reporting of adverse events. Concerning RoB, the effects were most pronounced for baseline differences between the groups, blinding of investigators and caretakers, and selective outcome reporting. Note, however, that the number of included treatment studies is a lot lower than the number of included baseline studies, and that the comparisons were based on only k = 12 human treatment studies. Refer to Table 3 for absolute numbers of studies per category. Besides, our comparisons may be confounded to some extent by the publication date. The nPD was originally developed for human diagnostics [39, 40], and animal studies only started to be reported at a later date [41]. Also, the use of the nPD as an outcome in (pre)clinical trials of investigational treatments originated at a later date [42, 43].

Because we did not collect our data to assess time effects, we did not formally analyse them. However, we had an informal look at the publication dates by RoB score for blinding of the investigators and caretakers, and by RQ score for ethics evaluation (in box plots with dot overlay), showing more reported and fewer unclear scores in the more recent publications (data not shown). While we thus cannot rule out confounding of our results by publication date, the results are suggestive of mildly improved reporting of experimental details over time.

This study is a formal comparison of RoB and RQ scoring for two main study types (baseline comparisons and investigational treatment studies), for both animals and humans. Performing these comparisons within the context of a single SR [16] resulted in a small, but relatively homogeneous sample of primary studies about the nPD in relation to CF. On conferences and from colleagues in the animal SR field, we heard that reporting would be worse for animal than for human studies. Our comparisons allowed us to show that particularly for baseline comparisons of the nPD in CF versus control, this is not the case.

The analysed tools [12, 13, 15] were developed for experimental interventional studies. While some of the elements are less appropriate for other types of studies, such as animal model comparisons, our results show that many of the elements can be used and could still be useful, particularly if the reporting quality of the included studies would be better.

Implications

To correctly interpret the findings of a meta-analysis, awareness of the RoB in the included studies is more relevant than the RQ on its own. However, it is impossible to evaluate the RoB if the experimental details have not been reported, resulting in many unclear scores. With at least one unclear or high RoB score per included study, the overall conclusions of the review become inconclusive. For SRs of overall treatment effects that are performed to inform evidence-based treatment guidelines, RoB analyses remain crucial, even though the scores will often be unclear. Ideally, especially for SRs that will be used to plan future experiments/develop treatment guidelines, analyses should only include those studies consistently showing low risk of bias (i.e. low risk on all elements). However, in practice, consistently low RoB studies in our included literature samples (> 20 SRs to date) are too scarce for meaningful analyses. For other types of reviews, we think it is time to consider if complete RoB assessment is the most efficient use of limited resources. While these assessments regularly show problems in reporting, which may help to improve the quality of future primary studies, the unclear scores do not contribute much to understanding the effects observed in meta-analyses.

With PubMed already indexing nearly 300,000 mentioning the term “systematic review” in the title, abstract, or keywords, we can assume that many scientists are spending substantial amounts of time and resources on RoB and RQ assessments. Particularly for larger reviews, it could be worthwhile to restrict RoB assessment to either a random subset of the included publications or a subset of relatively informative elements. Even a combination of these two strategies may be sufficiently informative if the results of the review are not directly used to guide treatment decisions. The subset could give a reasonable indication of the overall level of evidence of the SR while saving resources. Different suggested procedures are provided in Table 5. The authors of this work would probably have changed to such a strategy during their early data extraction phase, if the funder would not have stipulated full RoB assessment in their funding conditions.

Table 5 Examples of potential SR procedures to evaluate the included studies and when to use them

We previously created a brief and simple taxonomy of systematised review types [44], in which we advocate RoB assessments to be a mandatory part of any SR. We would still urge anyone calling their review “systematic” to stick to this definition and perform some kind of RoB and/or RQ assessment, but two independent scientists following a lengthy and complex tool for all included publications, resulting in 74.6% of the assessed elements not being reported, or 77.9% unclear RoB, can, in our opinion, in most cases be considered inefficient and unnecessary.

留言 (0)

沒有登入
gif