The deep learning segmentation method was trained on 115 randomly selected patches from the aforementioned annotated patches from both experts and evaluated against the same set of 19 patches that were used to assess expert inter-rater agreement, which were also kept separate from the training data selection pool. The training dataset was drawn entirely from separate tissue cores (and therefore patients) to those used in the validation dataset. Because validation metrics were not used for early stopping, model selection, or anything else aside from extraction of results, the validation set was not kept stratified from the evaluation dataset. We recognize that this would introduce some bias to the hyperparameter tuning process, but only for the high level loss and regularization hyperparameters that were tuned, so the impact and potential for overfitting would be minor. Due to the large size of the patches (480x480 pixels), they were subdivided during training and evaluation. Further details on the patch subdivision are in the appendices (Online Resource 1). For all tests, AdaDelta [17] (a adaptive stochastic gradient descent technique) was used as the optimizer, as this was found to produce good results during network development. The optimizer hyperparameters used were \(rho=0.975\) and \(learning\_rate=0.005\). A batch size of 4 and input image dimensions of 224x224 pixels were used for all tests.
Since there are readily available ConvNeXt weights which are trained extensively on ImageNet [18], these were used as a starting point for the backbone weights. Given the broad nature of the ImageNet dataset, some of the patterns learned from it were thought likely to be transferable to RNAscope segmentation. The exact weights used were the convnext_base_21k_1k_224_fe weights (pretrained on ImageNet-21k and then on ImageNet-1k, with the final classification layers removed) provided by Sayak Paul on Kaggle [19]. The non-backbone layers were initialized with random weights. Because of the mix of pre-trained weights and newly initialized weights, there was potential for the early epochs of training to destroy the learnt patterns in the pre-trained backbone section. The non-backbone layers would require some epochs to reach useful weights and therefore would not initially allow the backbone layers to be correctly penalized/rewarded based on their ability to recognize useful features. To prevent this loss of feature recognition capability, the backbone layers were locked for the first segment of training, and only the non-backbone layers had their weights trained. Later segments allowed all layers to be trained, which allowed the backbone weights to be fine-tuned. Input images were normalized using the ImageNet normalization statistics to ensure the pre-trained weights were utilized well.
The ground truth masks contained a 5-pixel cross pattern over each RNAscope dot location to mitigate the low positive class representation in the dataset. Stain normalization was not used on the images, since the RNAscope stain in this dataset was so underrepresented that it was removed by stain normalization entirely in many cases. Some data augmentations were also applied during training: rotation of up to 90 degrees and vertical and/or horizontal flips.
Fig. 12The loss during training of the deep learning network with different loss functions
Fig. 13The validation loss during training of the deep learning network with different loss functions
Several sets of tests were run to evaluate which loss function and regularization methods produced the most stable training and best results. The results of these tests are detailed in the following subsections. The results of some less significant sets of tests are instead in the appendices (Online Resource 1).
Loss FunctionsThe first set of tests examined loss functions. As we were developing a segmentation network, loss functions that represented the segmentation problem well were sought after. The heavy class imbalance involved in RNAscope segmentation was also taken into consideration; a loss function weighting each pixel and class equally would be unlikely to perform well for this task. Using such a function would likely lead to the network being trained to ignore the positive class entirely and to predict only negatives, as it could obtain a score of more than 99% by that type of metric.
The loss functions selected for comparison were binary cross-entropy, Jaccard loss, Dice loss, and Tversky loss. Binary cross-entropy does not account for class imbalance and was speculated to perform poorly, but was added for the sake of comparison. Jaccard and Dice loss both incentivize overlap (intersection) between predictions and ground truth, while penalizing over-detection. Tversky loss is similar to Dice loss, but allows the user to set the weighting of recall (\(\alpha \)) and precision (\(\beta \)) in the formula, which must sum to a total of one. This allows for tuning of sensitivity to match the problem. Dice loss is identical to Tversky loss with \(\alpha =0.5\) and \(\beta =0.5\). Tversky loss was tested with \(\alpha =0.6\) and \(\alpha =0.7\) (increased sensitivity/recall).
These tests were run for 1024 epochs, split into two halves. For the first half, the pre-trained ImageNet backbone weights were locked, and only the upscaling layers were trained. For the final half, all layers were trained. DropPath was disabled for the ConvNeXt backbone section, and DropOut (\(r=0.15\)) was enabled for the upscaling and final layers of the network. The validation intersection over union throughout training for each loss function test is shown in Fig. 11, and the best post-processing \(F_1\)-score for each test is shown in Table 1. The loss and validation loss during training are shown in Figs. 12 and 13, respectively.
Binary cross-entropy loss performed surprisingly well given that it does not account for class imbalance, weighting each pixel evenly. It was able to reach a validation binary intersection over union score of 0.107 after 1024 epochs, as shown in Fig. 11, and was noticeably more stable than the other loss functions. Intersection over union loss functions are known to be unstable [20], and this is evident in Fig. 11; the Dice loss and Jaccard loss were so unstable that they failed to converge on a good solution. The modified sensitivity weighting of Tversky loss allowed it to converge despite its inherent instability, and its superior representation of the optimization problem at hand allowed it to produce better \(F_1\)-scores than binary cross-entropy. Although Tversky loss with \(\alpha =0.7\) ended with a lower validation intersection over union than binary cross-entropy, its \(F_1\)-score after post-processing was much higher; it was 0.678, compared to just 0.576 for binary cross-entropy. This implies that Tversky loss learns a more robust representation of the problem, even if its pixel-wise accuracy is not as high. Given the large amount of training instability when using Tversky loss, it was deemed likely that more regularization would improve outcomes. This will be explored in the following sets of tests.
Fig. 14The \(F_1\)-score of the deep learning network trained with Tversky loss (\(\alpha =0.6\)) when using differing grey thresholds and area thresholds for segmentation post-processing. The max \(F_1\)-score is at the area threshold of 0 pixels and grey threshold of 254
The two best performing networks were those that used Tversky loss. Surface plots of the \(F_1\)-score at different grey threshold and area threshold values (as defined in Section “Post-Processing”) for both networks trained using Tversky loss are shown in Figs. 14 and 15. These decision surface plots demonstrate that these classifiers may not generalize well using the peak threshold values; any deviation in the size of segmentation detections (impacting area threshold) or value (impacting grey threshold) would cause the \(F_1\) score to drop significantly. However, both graphs display a larger plateau around area threshold 5, grey threshold 250 which would solve this generalization issue.
Fig. 15The \(F_1\)-score of the deep learning network trained with Tversky loss (\(\alpha =0.7\)) when using differing grey thresholds and area thresholds for segmentation post-processing. The max \(F_1\)-score is at the area threshold of 0 pixels and grey threshold of 254
While the network using Tversky loss with \(\alpha =0.7\) had a slightly higher best \(F_1\)-score (0.678) than the network using Tversky loss with \(\alpha =0.6\) (0.663), its validation intersection over union was less stable and appeared to be over-fitting. Therefore, Tversky loss with \(\alpha =0.6\) was selected for use in the remaining tests.
Backbone RegularizationThe second set of tests examined the impact of enabling the DropPath regularization layers existing in the ConvNeXt backbone. As with the previous loss function test, DropOut (\(r=0.15\)) was enabled for the upscaling and final layers of the network. The backbone layers were locked for the first half of training again, but the total number of epochs was increased to 2048 to ensure any over-fitting would be evident. Backbone DropPath values of 0.0, 0.1, and 0.2 were tested. The validation intersection over union during training is shown in Fig. 16. The best \(F_1\)-score for each test is shown in Table 2. The loss and validation loss during training are shown in Figs. 17 and 18, respectively.
Fig. 16The validation intersection over union during training of the deep learning network with different levels of backbone (downscaling layer) regularization
Enabling DropPath in the backbone layers improved training performance in terms of validation intersection over union, validation loss, and \(F_1\)-score. Heavier regularization resulted in slower convergence but better performance by the end of training. The performance increase was expected, given the unstable nature of Tversky loss as an intersection-over-union based metric, which was also demonstrated in Fig. 11. Over the increased 2048 epoch training period, the network with DropPath \(d=0.0\) was clearly overfitting, and the network with \(d=0.1\) was beginning to overfit and decrease in validation intersection over union by the end of training. For this length of training, DropPath \(d=0.2\) appeared to work very well, as it was the most stable, did not overfit, and performed the best by the end of training. Increasing DropPath further would result in slower training convergence. DropPath \(d=0.2\) for the backbone layers was used for the remaining tests.
Table 2 The best \(F_1\)-scores of the deep learning network when trained using different backbone DropPath probabilitiesThe \(F_1\)-score for the network trained with backbone DropPath of 0.2 at different area threshold and grey threshold values is shown in Fig. 19. As expected, this shows that this more heavily regularized segmentation network produces more robust and generalizable classifications; minor changes to size of detections (impacting area threshold) or major changes to value (impacting grey threshold) would not significantly change the \(F_1\)-score.
Upscaling Node RegularizationThe next set of tests examined the viability of using DropPath regularization for the upscaling nodes (the nodes coloured red on Fig. 5) instead of DropOut. The network only converged with DropOut; further details on this set of tests are included in the appendices (Online Resource 1). Additionally, details on an inconclusive set of tests on regularization of the final layers (as depicted in Fig. 4) are also contained in the appendices (Online Resource 1).
Generated DataThe final set of tests investigated the effect of using artificial training data generated by the method described in Section “Data Generation”. This data generation method was used to generate 1071 patches for use as additional training data. Given the simpler nature of the generated images, they were used to make the early phases of training easier and entirely removed by the end of training. The exact amount of generated samples at each phase of training for this set of tests is shown in Table 3. Each of the 115 real patches was sampled 4 times per epoch to give 460 real samples each epoch, whereas each generated patch was only sampled once, with the patch count varying each phase. The backbone layer weights of the network were only locked for phase 1.
As previously discussed, DropOut (\(r=0.15\)) was enabled for the upscaling layers of the network. Different regularization methods for the final section of the network (as shown in Fig. 4) were also tested: DropOut \(r=0.15\), DropPath \(d=0.1\), DropPath \(d=0.2\), and no regularization. Due to the inconclusive nature of the previous final section regularization test (in the appendices), each test was run 10 times, and the metrics were then averaged for each set of parameters. One test using no generated data and equivalent training length was added for comparison. The validation intersection over union during training is shown in Fig. 20. The best \(F_1\)-score for each test is shown in Table 4. The loss and validation loss during training are shown in Figs. 21 and 22, respectively.
Fig. 17The loss during training of the deep learning network with different levels of backbone (downscaling layer) regularization
Fig. 18The validation loss during training of the deep learning network with different levels of backbone (downscaling layer) regularization
The \(F_1\)-score for one instance (out of 10) of the highest performing configuration (using generated data and no final section regularization) at different area threshold and grey threshold values is shown in Fig. 23. Although there was a small amount of variation between instances, the decision surface for this instance shows good generalizability in both area threshold and grey threshold, which correspond to generalizability in detection size and intensity, respectively. The \(F_1\)-score only begins to significantly drop at area thresholds of 4 or higher and is stable across any grey threshold.
Inclusion of the generated data facilitated much faster convergence initially, with the no artificial data tests consistently performing worse in the first 500 epochs. This changed around 1000 epochs in, with the no artificial data tests having converged better at this time. By the end of training, the artificial data tests again managed to converge better, which is also evident in their higher final \(F_1\)-score. Including the simpler generated data appears to have been helpful for boosting performance, and altering the training phase scheme to shorten or omit phases 3–5 could help to prevent the slow training around epoch 1000. This would come with the risk of destabilizing the training process since the change in training data would happen less gradually, but having less (albeit more major) changes in training set could also increase stability. The average \(F_1\)-score for the tests with no final section regularization was the highest, implying that there is sufficient and more effective regularization conducted in the earlier layers of the network.
A final test was conducted to evaluate the performance of the network when trained entirely on one expert’s annotated data and then tested against the other expert’s annotated data. Any patches with annotations by both experts were used in the training dataset, with the corresponding expert’s annotations used as ground truth. This resulted in one test having a training dataset of 113 patches and a test dataset of 31 patches and the other test having a training dataset of 50 patches and a test dataset of 94 patches. This test was intended to assess how well the deep learning network could generalize the subjective annotations of a single expert; however, the results should be interpreted as indicative only. This is because the network’s capability to generalize would be limited when trained using only a single expert’s data, especially with as few as 50 training images. The results of these tests are shown in Table 5. Generated data was used according to the training phases in Table 3.
Fig. 19The \(F_1\)-score of the deep learning network trained with backbone DropPath (\(d=0.2\)) when using differing grey thresholds and area thresholds for segmentation post-processing. The max \(F_1\)-score is at the area threshold of 2 pixels and grey threshold of 253
Table 3 The amount of each type of training sample used in each phase of network training when using generated dataFig. 20The validation intersection over union during training of the deep learning network using generated data and differing final section regularization methods. Due to the larger number of patches per epoch in early phases of tests using artificial data, the epoch data has been standardized to a length of 460 training samples per epoch for these runs to make them directly comparable with the test with no artificial data
Table 4 The best \(F_1\)-scores of the deep learning network when trained using generated data and differing final section regularization methodsFig. 21The loss during training of the deep learning network using generated data and differing final section regularization methods. Due to the larger number of patches per epoch in early phases of tests using artificial data, the epoch data has been standardized to a length of 460 training samples per epoch for these runs to make them directly comparable with the test with no artificial data
Fig. 22The validation loss during training of the deep learning network using generated data and differing final section regularization methods. Due to the larger number of patches per epoch in early phases of tests using artificial data, the epoch data has been standardized to a length of 460 training samples per epoch for these runs to make them directly comparable with the test with no artificial data
Fig. 23The \(F_1\)-score of one instance of the deep learning network trained with generated data and no final section regularization when using differing grey thresholds and area thresholds for segmentation post-processing. The max \(F_1\)-score is at the area threshold of 2 pixels and grey threshold of 246
The first generalization test, which had a much larger amount of available training data, showed decent performance with an \(F_1\)-score of 0.624. The second test showed much worse performance, with an \(F_1\)-score of only 0.314. Overall, these results show that the deep learning network has the potential to learn features that generalize well, but that it cannot train well on as few as 50 training patches.
Comparison to LiteratureMost of the existing methods for RNAscope dot segmentation are not properly automated or do not fully segment RNAscope dots, instead leaving some groups as clusters [10]. The commercial methods do not have a verified accuracy for correct identification of RNAscope dot positions, and the mechanisms by which they operate are not in the public domain. Compared to these existing methods, our deep learning method offers a fully automated approach with no user configuration required, which segments every RNAscope dot rather than leaving some groups as clusters. The feature-based segmentation method by Davidson et al. [13] offered an automated, full RNAscope dot segmentation and therefore was directly comparable to our method. Table 6 shows a comparison of our method to expert performance and to the grey level texture feature method developed by Davidson et al. Our deep learning method outperformed both other methods by a considerable margin.
留言 (0)