Dissecting self-supervised learning methods for surgical computer vision

Automatic analysis and interpretation of visual signals from the operating room (OR) is the primary concern of surgical computer vision, a fast-growing discipline that is expected to play a major role in the development of reliable decision support systems for surgeons (Maier-Hein et al., 2017). Recent developments in the field have indeed resulted in increasingly refined vision algorithms; however, a majority of these studies have only been conducted on datasets containing small amounts of recorded procedures, all of which have been manually annotated by clinical experts. In future developments, much larger quantities of data will be required in order to account for variations in anatomy, patient demographics, clinical workflow, surgical skills, instrumentation, and image acquisition (Maier-Hein et al., 2022).

For that purpose, raw video data can be supplied on a very large scale by laparoscopic surgeries, since they are guided by intra-abdominal video streams: in the United States, nearly 1M laparoscopic cholecystectomies are performed each year, resulting in approximately 630k hours of footage for just this one type of procedure. Yet, datasets used for training current surgical vision models remain disproportionately small. For example, Cholec80 (Twinanda et al., 2016b), one of the most popular datasets in the field (Maier-Hein et al., 2017), hardly exceeds 50 h of recordings. Apart from medico-legal constraints, the critical factor leading to this sparsity of data is the reliance on manual annotations. While labels for natural images can be easily supplied by the general public, surgical annotations usually require clinical expertise. As a result, the fully supervised approach – i.e. training models with entirely annotated datasets – may prove to be unsustainable in surgical computer vision.

In computer vision, an alternative has emerged in the form of Self-Supervised Learning (SSL) (Jing and Tian, 2021). Considerable progress has been made in this area, with increasingly refined methods for extracting rich vector representations from images without labels, using only the raw pixel data. This research topic has so far not been thoroughly explored in surgical applications. In the few self-supervised training tasks proposed by the community, learning from the visual content itself is generally de-emphasized in favor of utilizing other available sources of information – for example time (Funke et al., 2018, Yengera et al., 2018), stereoscopy (Yang and Kahrs, 2021) or robot kinematics (Sestini et al., 2021). State-of-the-art natural image SSL methods, with their advanced representational capabilities, have yet to be adequately demonstrated on surgical images.

However expanding SSL methods outside of natural images can be challenging, especially in a complex domain such as surgery. Most notably, heavy parameter tuning based on heuristics (Xiao et al., 2020) might be required. Robustness against large variations in domains and tasks also is not guaranteed; in-depth performance analysis has essentially been conducted on general computer vision datasets (Feichtenhofer et al., 2021a), most commonly Imagenet, which contains 14M images and over 1000 visually distinct classes. In contrast, Cholec80, one of the most prominent surgical computer vision datasets (Maier-Hein et al., 2017), contains 80 videos of procedures resulting in under 200k frames at 1fps. Only 7 classes of surgical phases and 7 classes of tools are featured; moreover, the visual evidence to distinguish them is highly sparse, especially for time-based tasks such as surgical phase recognition, a coarse-grained form of activity recognition. Further, since surgical videos can last up to several hours depicting a relatively stable scene, it is non-trivial to determine how existing SSL frameworks can best accommodate frames coming from the same procedure. Finally, these issues may be exacerbated by surgery-specific confounding factors such as smoke, bleeding, occlusions, or rapid tool movements. Such fundamental differences between natural and surgical image data motivate the need for a thorough study of SSL in the surgical domain.

The work presented here thoroughly addresses this need in three distinct steps (see Fig. 1). We select four SSL methods – MoCo v2 (Chen et al., 2020a), SimCLR (Chen et al., 2020b), SwAV (Caron et al., 2020), DINO (Caron et al., 2021) – suitably covering the state of the art in general computer vision, and extensively examine hyperparameter variations for each of them on Cholec80. We identify key differences with the natural image domain, highlighting hyperparameter tuning as a non-trivial and crucial element of SSL method transfer. In the second step, we set hyperparameters to their optimal values and test out the quality of the representations learned through each of these methods on two classic surgical downstream tasks: phase recognition and tool presence detection. Furthermore, we verify how these approaches respond to varying amounts of labeled and unlabeled data in a practical semi-supervised setting. Here, we show that these methods, while generic in design, achieve state-of-the-art performance for both tasks and significantly mitigate the reliance on annotated data, adding up to 7.4% phase recognition F1 score and 20.4% tool presence detection mAP. In the final step of the study, we extend our experiments to additional tasks and datasets: phase recognition & tool presence detection on HeiChole (Wagner et al., 2021), phase recognition & tool presence detection on CATARACTS (Al Hajj et al., 2019), action triplet recognition with CholecT50 (Nwoye et al., 2022b), semantic segmentation on Endoscapes (Alapatt et al., 2021), and 8 & 25 class semantic segmentation with CaDIS (Grammatikopoulou et al., 2021); thereby extensively covering the domain of surgical vision with SSL.

This paper’s contributions are as follows:

1.

Benchmarking of four state-of-the-art self-supervised learning methods (MoCo v2 (Chen et al., 2020a), SimCLR (Chen et al., 2020b), SwAV (Caron et al., 2020), and DINO (Caron et al., 2021)) in the surgical domain.

2.

Thorough experimentation (∼200 experiments, 7000 GPU hours) and analysis of different design settings – data augmentations, batch size, training duration, frame rate, and initialization – highlighting a need for and intuitions towards designing principled approaches for domain transfer of SSL methods.

3.

In-depth analysis on the adaptation of these methods, originally developed using other datasets and tasks, to the surgical domain with a comprehensive set of evaluation protocols, spanning 10 surgical vision tasks in total performed on 6 datasets.

4.

Extensive evaluation (∼280 experiments, 2000 GPU hours) of the scalability of these methods to various amounts of labeled and unlabeled data through an exploration of both fully and semi-supervised settings.

留言 (0)

沒有登入
gif