Meta-analysis of (single-cell method) benchmarks reveals the need for extensibility and interoperability

Given the rapid development and uptake of new technologies in biology (e.g. high-throughput DNA sequencing, single-cell assays and imaging technologies), methodologists are presented with nearly unlimited opportunities to apply or develop computational tools to process, model, and interpret large-scale datasets across a wide range of applications. Unsurprisingly, the data explosion [1, 2] is mirrored by a massive increase in the number of computational methods; for example, at time of writing, 1318 are listed in a database of tools for the analysis of single-cell RNA-seq data [3, 4] and more than 370 tools are listed for the analysis of spatial omics data [5]. This imposes challenges for determining what tools to use for discovery [6]. In particular, researchers often need to convince themselves that they develop or use performant tools in their research, and a typical approach is via formal benchmarks. Benchmarks can be decomposed into five steps: (1) formulating a computational task (or subtask) that will be investigated (e.g. calling differentially expressed genes); (2) collecting reference datasets by either generating (realistic) synthetic datasets or using ground truth derived from experimental data; (3) defining performance criteria (e.g. sensitivity, specificity); (4) evaluating a representative set of methods via a set of performance criteria across multiple reference datasets; and (5) formulating conclusions and guidelines.

In terms of developing and disseminating new methods, the minimum requirement of quality is usually that the new approach provides a benefit against existing approaches. However, the current standard allows developers to be their own “judge, jury and executioner” [7], giving them some freedom to choose the settings and the evaluations used in the benchmark. There is a notable tension here, since methodologists that want to develop high-quality novel approaches may also be under pressure to publish [8]. One can argue that the risk of biases and over-optimistic evaluations of new approaches is minimized by the standard scientific review process; however this is also known to have its challenges [9]. Ultimately, it is almost a foregone conclusion that a newly proposed method will report comparatively strong performance [7, 10]. Thus, claims from individual method development papers need to be scrutinized, preferably from a neutral (i.e. independent) standpoint [11, 12]. Neutral benchmarking appears to be a popular approach, since over 60 benchmarks have been conducted for single-cell data analysis alone (see below and Additional file 1). But even if done neutrally, the community may still want a mechanism to challenge, extend, or personalize the assessments (e.g. update/add reference datasets, run methods with alternative parameters, use different metrics, rank methods differently).

Neutral benchmarking shares common ground with community “challenges” for consolidating the state-of-the-art. In some subfields of computational biology, there is a long history of such challenges, such as the biannual CASP (Critical Assessment of Structure Prediction) [13] and DREAM (Dialogue on Reverse Engineering Assessment and Methods) challenges [14], where participants are invited to propose solutions to a predefined problem. The challenge model is growing in success, including leaderboards that give real-time feedback, but these can sometimes reinforce a narrow view of performance (e.g. a single measure of performance on a single dataset [15]. In addition, for a challenge to be conceptualized in the first place, the community needs to formalize and frame existing problems, have access to suitable reference datasets, gather indications that the challenge can be solved, and only then assess if current technologies and methods would be able to solve it. This necessarily requires some common ground and knowledge in the field, which is often gained by neutral benchmarking studies. Although community challenges engage communities and innovation, they are typically time-gated in their scope, whereas one could also imagine benchmarking as a continuous process where challenges are integrated into the subfield’s trajectory. Initiatives in this direction include OpenEBench [16], which provides a computing platform and infrastructure for benchmarking events, and’Open Problems in Single Cell Analysis’, which is focused on formalizing (single-cell data analysis) tasks to foster innovation in method development while providing infrastructure and datasets such that new methods can be tested [17, 18].

There are several obstacles related to running and using the results of benchmarks. In fast-moving subfields, benchmark results become rapidly out-of-date and, in some cases, competing methods never get directly compared because they are developed simultaneously. Current benchmarks are always a snapshot in time, while tool development is continuous. There are common components of most benchmarks, such as reference datasets, a set of methods and metrics to score their performance, but there are typically no pervasive standards for the system or strategy of benchmarking in computational biology, except for those predefined by challenges. This lack of standards can, for example, lead to different rankings of the same methods for the same task [19,20,21]. Similarly, shortcomings of existing performance metrics may be discovered subsequent to earlier benchmarks [22]. Another area in which benchmarks underdeliver is the interpretation of performance results. Benchmark authors (or challenge organizers) generally get to make all the decisions about how the performance evaluation is conducted (e.g. how a ranking is determined, given multiple criteria).

Altogether, it remains an open question whether computational biology has reached a benchmarking optimum, achieving fair comparisons in a timely, independent, and continuous manner, while also keeping the barrier low enough for colleagues, including those in adjacent fields, to participate. In this report, we review current benchmarking practices in a subfield of genome biology (single-cell data analysis) to get an understanding of where the current state-of-the-art is, and based on our findings, we postulate what elements of benchmarking would be considered desirable for the future.

State-of-the-art in benchmarking

To understand the current state of practice in benchmarking studies in computational biology, we crowdsourced a list of single-cell method benchmarks (studies were selected by the project team from the period of 2018–2021 as exhaustively as possible on the basis of having at least a preprint posted, and involving the comparison of computational methods for some form of single-cell data, and not a primary method publication; see Additional file 1 for the list of studies). We then designed a questionnaire to query various attributes of a benchmark study (see Methods, Additional file 2) and crowdsourced the review of these benchmarks. We focused on single-cell methods because it is an active area of methodological research, with an acute method explosion [3, 4], and a large number of benchmarks have been conducted. The list of questions asked for each benchmark includes both factual (e.g. “Whether synthetic data is available”) and opinion (e.g. “Degree to which authors are neutral”) assessments. Overall, we queried the scope, extensibility, neutrality, open science, and technical features of each benchmark. We required that at least two reviewers answer the questionnaire for each benchmark, and in the case of large discrepancies between responses (disagreement in a factual question or large difference in opinion), results were consolidated manually by a third reviewer (see Additional file 1 for consolidated responses, Additional file 3 for original responses). Questions were organized into two topics: (1) overall design of the benchmark and (2) code and data availability and technical aspects.

Overall design of benchmarks

The overall design of the 62 surveyed single-cell benchmarks, including the number of datasets, methods, or criteria used for the comparison or the neutrality of the authors, was assessed. The number of benchmark datasets varied greatly, with 2 benchmarks using only 1 dataset and 1 benchmark using thousands of simulated datasets (median = 8). Likewise, the number of methods evaluated varied from 2 to 88 (median = 9) for the chosen task. Finally, the number of evaluation criteria, defined here as numerical metrics to compare method results against a ground truth, varied from 1 to 18 (median = 4), showing that current benchmarks tend to include more methods and datasets than evaluation criteria (Fig. 1A). The range of analysis tasks covered by benchmarks mirrors quite well the range of tools available [3, 4]. A notable exception is the category of visualization tasks, accounting for 40% of the available tools but formally benchmarked only once (Fig. 1B). Seventy-two percent of the manuscripts were first released as preprints and 66% tested only default parameters (Fig. 1C). We also enquired about the neutrality of the authors, defined as whether the authors of the benchmark were involved in one or several of the methods evaluated. For more than 60% of the benchmarks, the authors were completely independent of the evaluated methods (Fig. 1D); neutrality is a desired attribute, although not absolutely required. More than 75% of benchmarks also assessed secondary measures (Additional file 4 Figure S1), such as runtime, memory usage, and scalability.

Fig. 1figure 1

Overall design of 62 single-cell method benchmarks. Overview of crowdsourced meta-analysis across surveyed benchmarks. A Numbers of entities (datasets, methods, metrics) present in each benchmark (each dot is a benchmark). Jitter is added to the X-axis. B Data analysis tasks. C Percentages of benchmarks that were first posted as preprint or whether benchmarks explored parameter space beyond default settings. D Reviewer’s opinions on the neutrality (whether the benchmark authors were involved in methods evaluated). Jitter is added to the X-axis and Y-axis of the scores

Code/data availability, reproducibility, and technical aspects

Also important for the uptake of benchmarking results is the open science and reproducibility practices of the studies. Thus, the second group of questions related to the availability of data, code, and results as well as technical aspects. Figure 2A gives an overview of the availability of the different levels of data for benchmarking studies, highlighting that input (often ground-truth-including) data is frequently available (97% of studies). However, intermediate results, including outputs of methods run on datasets and performance results were only sparsely available at 19% and 29%, respectively. For studies that generated simulated data, less than half (19/46 articles) made their synthetic data available. Only 10% of benchmarks provided performance results in an explorable format. On the technical side, most benchmarks reported software versions of the methods being evaluated (68%), although provenance tracking (tracking of inputs, outputs, parameters, software versions, etc.) was not explicitly used. Another aspect of reproducible practice is related to workflow tools that are used to orchestrate the datasets through methods and metrics. Although their use in computational biology is increasing [23], we observed that less than 25% of the surveyed benchmarks used any form of them (see Fig. 2B). In a similar vein, containerization of software environments is quite mature in computational biology [24] but is rarely utilized in benchmarking (8% of studies). Concerning the methods that are compared, R and Python remain the dominant programming languages for single-cell methods, mirroring the summaries in the scrna-tools database [4]; see also Additional file 4 Figure S1.

Fig. 2figure 2

Code/data availability, reproducibility and technical aspects of 62 single-cell method benchmarks. A Each column of the heatmap represents a benchmark study and each row represents a factual question; responses are represented by colours (Yes: blue; Partially: orange; Not Applicable: white; No: red). Not Applicable corresponds to benchmarks that did not use simulated data (synthetic data is available row) and to a benchmark that evaluated secondary measures only (performance results available row). "results available" refers to computational methods run on datasets; "performance results" refers to the results that are compared to a ground truth. B Type of workflow system used (benchmarks with no workflow used or no code available are represented in red, otherwise grey). C Reviewer’s opinions on the availability and extensibility of benchmarking code. Jitter is added to the X-axis and Y-axis of the scores. D Licence specification across benchmarking studies (benchmarks without licences or no code available are represented in red, otherwise grey)

We next scored the degree to which code is available on a scale of 1 to 5 (Fig. 2C) where 1 means ‘not at all’ and 5 means ‘completely’. For over 75% of the benchmarks, code was fully or partially available, such as in a GitHub repository, although there are clearly different levels of completeness and description. We also gathered opinions on how extensible the available code is (e.g. how easy it is to incorporate an additional dataset, method or evaluation criterion; see Fig. 2C); among the 47 studies sharing code, only two studies received a high score for extensibility.

Explicit code licencing is somewhat sporadic for benchmark studies (Fig. 2D); of the 47 studies that made code available, 19 (40%) did not specify a licence. This can become an important consideration when re-using (public but licenced) code for building data analysis pipelines or extending benchmarks. Of those benchmarks that specified a licence, the free software MIT and GPLv3 appeared to be the dominant ones.

Taken together, our meta-analysis shows that most benchmarks results are at least in principle reproducible, since code and input data are shared. However, a fully reproducible analysis would also require information about the software environment (e.g. operating system, libraries, packages) or an available container, which are sometimes documented, but often not readily available in the benchmarks that we evaluated. Thus, a significant amount of redundant work would be required to re-establish or extend the vast majority of surveyed benchmarks.

留言 (0)

沒有登入
gif