Cervical lymph node metastasis prediction from papillary thyroid carcinoma US videos: a prospective multicenter study

This current study has two phases, a retrospective model pre-training phase and a prospective model fine-tuning phase. There were three steps in the fine-tuning phase: training, test, and validation steps. Figure 1 shows the structure and development process of our multi-scale, multi-frame, and dual-direction deep learning (MMD-DL) model. For both pre-training and fine-tuning phases, patients received a thyroidectomy after US examinations, and the postoperative pathological reports were used as the gold standard to determine whether the thyroid cancer was metastatic.

Fig. 1figure 1

Illustration of the multi-scale, multi-frame, and dual-direction deep learning (MMD-DL) model. a Flowchart of the training stages of MMD-DL. b Architecture of the pre-trained feature extractor. c Architecture of MMD-DL with transverse and longitudinal ultrasound videos as inputs and lymph node metastasis probability as the output

Retrospective model pre-training phase

From September 2017 to December 2018, PTC patients who underwent thyroid examinations and surgeries from the first medical center of the Chinese PLA general hospital were enrolled in this study to pre-train the DL model. Maximum transverse and longitudinal gray-scale US images were collected by radiologists with more than five years of US experience.

The inclusion criteria were (1) patients confirmed to be PTC after thyroidectomy; (2) patients who underwent thyroid US examination within 2 weeks before surgery; (3) patients who received a thyroidectomy and lymph node dissection consistent with the Chinese Guidelines [25], and ground truth of LNM were evaluated by pathology.

The exclusion criteria were (1) patients received a biopsy before US examination; (2) the US image quality was insufficient, or the number of US videos was incomplete; (3) other pathological types of thyroid cancer, such as medullary carcinoma and undifferentiated carcinoma; (4) presence of distant metastases; and (5) patients who underwent surgery in other hospitals.

Both transverse and longitudinal US images were involved for pre-training, so that our feature extractors learned basic perception ability for diagnosing LNM.

Prospective model fine-tuning phasePatient enrollment and sample size

The multicenter prospective study was approved by the institutional ethics committee of all involved hospitals, with the ethics committee approval number of S2019-212–06 and a clinical trial registration number of ChiCTR1900025592.

Patients with suspicious PTC from four different centers, including the first medical center of the Chinese PLA general hospital, the fourth medical center of the Chinese PLA general hospital, Beijing Tongren hospital, and China–Japan Friendship hospital, were consecutively enrolled from January 2019 to July 2021. All of the centers are located in Beijing.

All patients were operated on by surgeons with more than 15 years of experience in thyroid surgery and more than 1000 annual volume. All pathological specimens were sent to the pathology department for paraffin fixation and histological analysis by two or more experienced pathologists. Inclusion and exclusion criteria were as listed above.

We assumed that at least 30% of enrolled patients would have cervical LNM. Therefore, we calculated the sample size necessary to estimate a receiver operating characteristic (ROC) curve with no less than 217 patients (α: 0.05, 1-β: 0.85, width of the confidence interval: 0.125, confidence level: 0.95). Given an expected dropout rate of 20%, we should at least enroll 261 patients.

Clinical pathological data and US features

Clinical characteristics including age, sex, number of tumors, tumor size, location, presence of Hashimoto thyroiditis, type of thyroidectomy, type of lymph node dissection, Clinical T stage, and N stage were obtained from the patients’ medical records. Pathological T stage and lymph node metastatic results were obtained from the patients’ pathological report after surgery. The American Joint Committee on Cancer staging of thyroid cancer was applied to evaluate the TNM stage [26].

The multicenter standardized US videos were acquired with a Supersonic Aixplorer System using an S15–4 linear-array transducer (SuperSonic Imaging, France), with a center frequency of 8 Hz (ranging from 4 to 15 Hz), by radiologists with more than 6 years of experience. Patients were supine with the neck extended and the head turned to check the contralateral direction. The gain was 40%, the depth was 4 cm, the frame rate was 40 Hz and the focus was on target depth. Dynamic collection started from the edge of one side of the thyroid lobe, sweeping evenly and slowly until it reached the other side of the lobe. The direction is fixed from the top to the bottom, from the left to the right, and no scanning back and forth. More details in standardized US video acquisition are shown in Additional file 1: Method S1.

US features of the tumors were obtained from US examinations according to the American College of Radiology Thyroid Imaging, Reporting and Data System [27].

DL model development

DL model development is divided into two stages, as shown in Fig. 1a. In the first stage, we pre-train a feature extractor using the retrospective US images, the structure of which is shown in Fig. 1b. In the second stage, based on the pre-trained feature extractor, we build a multi-scale, multi-frame, and dual-direction deep learning (MMD-DL) model and fine-tune it. The structure of MMD-DL is shown in Fig. 1c.

In the first stage, the feature extractor adopts three networks to extract feature vectors of the US images in three scales, namely large, middle, and small. Here, ResNet18 is adopted as the network because of its popularity and resistance to overfitting. The design of the multi-scale structure helps the model to focus on the lesion characteristics of its exterior, edge, and interior areas and avoid the omission of features in the important regions. The features are fused by several fully connected layers to output the diagnostic results.

In the second stage, MMD-DL with two branches were used to extract the features from the horizontal scan and vertical scan after US video prerecession (Additional file 2: Method S2), respectively. Each branch consists of a multi-scale feature extractor, which has the same structure as the pre-trained feature extractor and has the same weight at the beginning of fine-tuning. In order to fuse temporal features, the feature extractor processes five frames obtained from the preprocessing of one US video one by one. Finally, a fully connected layer is used to fuse all features extracted from multi-scale, multi-frame, and dual-direction video frames, offering the diagnostic probability as the output. Details of our model and strategy of training our model are displayed in Additional file 3: Method S3.

Then, the model was transferred into the prospective US videos for test and validation. During training, test, and validation steps, we did not use the same population. Measuring the performance of our model can be found in Additional file 4: Method S4.

The impact of radiologists with different experiences by using AI for assistance

Two junior radiologists (Yi Mao and Guozheng Zhao) with 1 year of experience in thyroid US, two intermediate radiologists (Yan Wang and Lin Yuan) with 5 years of experience in thyroid US, and two senior radiologists (Mingbo Zhang and Mengjie Song) with over 8 years of experience in thyroid US were invited to interpret the same US videos of the test and validation cohorts. The radiologists were shown ultrasound videos that they had not seen before. After they gave the prediction of LNM based on their evaluation of US videos, the AI-predicted probability and AI-generated heatmap were provided to them as assisting information (Additional file 5: Method S5). Then, they performed the second-round diagnosis. Their predictive performances with and without AI assistance were compared.

Statistical analysis

The categorical and normally distributed continuous variables were presented as frequency (percentage) and mean with a 95% confidence interval (CI), respectively. Categorical variables were compared by the χ2 test. Student’s t-test was used for comparison between normally distributed continuous variables. The area under the ROC curve (AUC) was used to measure the performance of prediction. All the statistical analyses above were performed with SPSS software (version 26, Chicago, IL). The Delong test was employed to compare different AUCs using GraphPad Prism (version 8, CA, USA). A two-sided P < 0.05 was considered to indicate statistical significance.

留言 (0)

沒有登入
gif