Data-driven risk stratification and precision management of pulmonary nodules detected on chest computed tomography

Participant recruitment

The retrospective primary dataset comprised participants from the MCC in the West China Hospital of Sichuan University, which included subjects aged 18 years and older who underwent a voluntary chest CT imaging examination from 2013 to 2022. The independent testing dataset was derived from the MSC. This cohort was initiated between 2019 and 2022 in various communities across Sichuan Province, China (clinical trial registration number: ChiCTR1900024623). It was conducted at multiple sites, including Longquan District in Chengdu, Pidu District in Chengdu, Mianzhu City and Ganzi Tibetan Autonomous Prefecture. As an observational study, the lung cancer screening program of the MSC enrolled local residents who were 40 years of age or older and volunteered to undergo chest CT scans. The study received ethical clearance from the Ethics Committee of West China Hospital, Sichuan University (no. 2023.2287).

Participants were eligible to be included in the study only if they met all the following criteria: (1) pulmonary nodules diagnosed by chest CT or LDCT scans; (2) noncalcified and solitary pulmonary nodules including solid nodules, mGGNs (both ground-glass and solid components) and pGGNs; (3) agreed to finish the clinical questionnaire; and (4) agreed to provide written informed consent. Participants were excluded from study enrollment if they met any of the following criteria: (1) received any surgical resection of pulmonary nodules before enrollment, (2) combined with other serious lung diseases such as pulmonary fibrosis or bronchiectasis, and (3) chest CT image quality failed to meet the required standards, CT volumes with <50 slices or patients with no qualifying volumes.

Gold standard of risk level

The gold standard for determining the malignancy risk of pulmonary nodules was established by consistently considering pathological findings and clinician ratings, utilizing the rules outlined in Extended Data Fig. 2. First, nodules pathological diagnosed as malignancy were categorized as extremely high risk (label 4). Then, the nodules pathologically confirmed benign or with VDT exceeding 600 days during 2-year follow-up were considered benign and classified as not extremely high risk (labels 1–3). Finally, the nodules were initially assessed by three junior doctors (3–5 years of experience), followed by quality control and final review from two senior physicians (over 10 years of experience).

Clinical information collection

Each participant provided written informed consent and completed a questionnaire on risk factors for lung cancer, including age, sex, smoking status, history of cancer, family history of cancer and family history of lung cancer. The pathological diagnostic outcomes for the participants were obtained from the hospital information system or acquired through telephone follow-up.

Chest CT images were taken using the following equipment: United Imaging Healthcare uCT 960+, uCT 780, uCT 528 (UNICOM vehicle-mounted CT), Siemens Somatom Sensation 16 spiral CT, Somatom Definition Flash dual-source CT, Somatom Sensation AS128 CT, GE Optima CT680, etc. CT image scanning and reconstruction parameters: tube voltage ~100–120 kV, tube current 120 mAs, collimator 128 × 0.6 mm, pitch 0.6, matrix 512 × 512, scan time ~4–5 s, scan layer thickness ~1–5 mm, lung window width ~1,500–1,800 HU and window level ~−500 to −600 HU, mediastinal window width ~250–350 HU and window level ~40–50–HU. CT images stored in digital imaging and communications in medicine (DICOM) format were interpreted by radiological physicians. A panel of three radiologists, including two junior radiologists and one senior radiologist with substantial expertise, collaboratively performed the Lung-RADS v2022 grading for the pulmonary nodules of enrolled participants.

Data preprocessingImage resampling strategy

The images were resampled to a target spacing of 1 × 1 × 1 mm3 to obtain local structural information using an interpolation strategy. Linear and nearest-neighbor interpolation methods were applied for isotropic and anisotropic images, respectively, to suppress resampling artifacts.

Intensity normalization

Images were normalized with lung window (window width: 1,500 HU and window level: −600 HU) using the z-score standardization method, and the normalized intensity values were clipped to (−1, 1) to achieve fast convergence of the model.

Handling imbalanced data

Imbalanced data distribution occurred in the lung cancer screening data, where the majority of the data (~98%) were benign, while the minority were malignant. To address this challenge, we calculated weights for each class based on the inverse frequency of the class in the dataset. This means that classes with fewer samples receive higher weights, making them more influential in the learning process72.

Details in AI detection model

A convolutional neural network was used for the intelligent detection of each pulmonary nodule, resulting in a patch with a size of 96 × 96 × 96, as detailed previously41. The resulting patches were then used as inputs to develop the risk stratification model.

Model development

The development of the C-Lung-RADS pipeline comprised two phases: phase 1 for initial risk classification and phase 2/2+ for identifying the suspicious malignant nodules.

Phase 1: initial risk classification of nodules by a classification tree

A classification tree model was developed to identify various risks of nodules considering nodal intrinsic properties (that is, density and size). A grid search was applied to define optimal thresholds for four-category risk stratification.

Construction of the classification tree

In phase 1 training, a classification tree was constructed with 36,052 pulmonary nodules from as many participants, where the inputs were nodule’s density, size and solid component’s size in mGGNs, and the output was the risk level (~1–4). A grid search approach was introduced to determine the optimal splitting nodes of the classification tree for effective four-category discrimination. The process involved three steps:

First, all the continuous variables were discretized to establish the classification rules.

Second, for each density type of nodule, a univariate classification tree was trained using nodule size as the sole feature. Rules such as maximum depth, minimum samples for node splitting and class weight were recorded. Size splitting points between adjacent risk levels and their neighbor size were considered as alternative thresholds. A grid search identified top-performing threshold combinations for risk stratification, resulting in N threshold candidates for each nodule type.

Finally, for all density types of nodules, a multivariate classification tree was trained. Based on the threshold candidates of three types of nodules, grid research was conducted to create comprehensive combinations. The performance was calculated for each combination, with the best-performing one regarded as the final rule for risk stratification in phase 1.

Evaluation metrics

The performance of the risk stratification rule was evaluated by classification tree loss, receiver operating characteristic (ROC) curve, the area under the ROC curve (AUC) and information value (IV). The ROC curve reflected the trade-off between the sensitivity and specificity of the model, with a higher AUC representing better performance. IV measured the predictive ability of a categorical variable x to the target binary outcome. The computation of IV depended on the weight of evidence (WOE), which could reflect the difference in the positive–negative ratio between the current group and the overall sample. Detailed definitions of WOE and IV are as follows:

$$}_=\mathrm\frac}P}_}}P}-\mathrm\frac}N}_}}N},$$

(1)

$$}_=\left(\frac}P}_}}P}-\frac}N}_}}N}\right)\times }_=\left(\frac}P}_}}P}-\frac}N}_}}N}\right)\times \left(\mathrm\frac}P}_}}P}-\mathrm\frac}N}_}}N}\right),$$

(2)

where \(X\) is the group of categorical variables from 1 to 4, i is the current category, \(\#\) denotes the number, \(P\) refers to overall positives, \(_\) refers to the positives in the \(}\) category, \(N\) refers to overall negatives and \(_\) refers to the negatives in the \(}\) category. A reasonable classification scheme entailed \(}_\) increasing with i, indicating a higher malignancy proportion with escalating risk levels. Therefore, the initial four-category risk stratification rule was achieved in phase 1, with 1 representing low risk, 2 representing mid risk, 3 representing high risk and 4 representing extremely high risk, used for screening of non-low-risk nodules.

Phase 2: malignancy evaluation by a deep learning model

Deep learning algorithms have indeed shown promising results in identifying malignancies, differentiating cancer subtypes and predicting tumor invasiveness73,74,75. In phase 2, a DCNN model was developed to generate image-level malignant probabilities of nodules.

Construction of the DCNN model

During the training process of phase 2, a DCNN model was developed using 5,452 pulmonary nodules from as many participants. Nodule images were input to predict malignancy probability, aiming to differentiate malignant nodules from benign ones.

The DCNN architecture included an input block, four continuous down-sampling blocks and an output block, referring to a prior publication76. Briefly, (1) the input block was a three-dimensional convolutional layer for converting images into semantic representations, (2) the four down-sampling blocks included four convolutional layers for generating feature maps, (3) a global average pooling (GAP) layer regularized the network to prevent overfitting and (4) a fully connected layer as the output block was adopted to generate the malignancy probability for nodules, which was further translated into a malignant or benign classification. Notably, the CAM served as the attention map to guide the network to focus on the nodule region with visual interpretability. To improve the classification performance of the DCNN, a loss function \(}\) was calculated, composed of cross-entropy loss (\(}}_}\)) and CAM loss (\(}}_}\)), as defined below:

$$ }_}}=\frac\Vert }\_}^ (x,y)-}}_^ (x,y)\Vert__}},$$

(5)

$$}}_}=-\log \frac^_\right)}}_^_\right)}},$$

(6)

$$_=\frac\sum _}_\left(x,y\right),$$

(7)

where \(}\) denotes the combined ratio; \(}}_}\) measures the Dice similarity coefficient between nodule_mask and \(}_\), driving the network to learn more spatially discriminative feature representations and to focus on nodule regions; \(}\_}^\) and \(}_^\) are the minimum–maximum normalization of \(}\_}\) and \(}\), respectively; \(}_\) is defined as the CAM for class i; and \(}_(x,y)\) indicates the importance of the activation at \((x,y)\), leading to an image belonging to class i. H and W denote the height and width of the nodule mask, respectively; l1 is the L1 norm, a type of norm used in mathematics to measure the size of a vector; e is Euler’s number, a mathematical constant approximately equal to 2.71828; zi represents the activation value of class i in the CAM; and j is an index variable used in the softmax function to represent different classes or categories.

During the training process, other parameters were carefully set. The learning rate used to refine the network was reduced from a large initial value (1 × 10−3) to a small value (1 × 10−5). The Adam optimizer was set to betas of (0.9, 0.999) and epsilon of 1 × 10−8. Data augmentation techniques such as shifting, scaling, flipping, cropping, rotating and adding noise were employed to improve model robustness.

Model calibration

The output of malignancy probability was calibrated by Platt scaling and temperature scaling in the training stage. By adjusting the scaling and intercept parameters of the logistic regression model, the calibrated probabilities could better reflect the true likelihood of each class. Temperature scaling involved adjusting the temperature parameter in softmax function to scale down or up the predicted probabilities of the classes. With these strategies, the output probabilities of the DCNN model could be well calibrated and interpreted as reliable estimates of confidence in its predictions.

Inference configuration

The DCNN was implemented in PyTorch with one Nvidia Tesla V100 graphics processing unit. We randomly selected 20% of the primary dataset as an internal testing dataset, with its loss computed at the end of each training epoch. The training process was considered converged if the loss stopped decreasing within ten epochs.

Phase 2+: multidimensional diagnosis by GBR

To better identify malignant nodules, a multivariable regression model was constructed integrating multidimensional information from imaging, clinical and follow-up features.

Construction of the GBR model

In phase 2+ training, a GBR model was developed to predict the final risk level of nodules by integrating multidimensional information. Specifically, the imaging feature referred to AI-predicted malignant probability from phase 2, and clinical features included lung cancer risk factors (that is, age, sex, smoking status, history of cancer, family history of cancer and family history of lung cancer). Follow-up features consisted of the specific growth rate (SGR) and VDT, calculated from follow-up pair CT images77:

$$}}=\frac2}}=\frac2\times \Delta T}(_/_)},$$

(9)

where \(_\) and \(_\) are the nodule volumes of a follow-up pair of images quantified from AI-based segmentation results78 and \(\Delta T\) represents the time interval between scans. SGR was decomposed into its positive (SGR+) and negative (SGR−) components. This strategy enabled the model to independently evaluate the trends of growth and reduction in nodule volume changes. Multidimensional features were fused as input for the GBR model to output malignant probability and corresponding risk level. The GBR model was trained on 15,290 CT examinations (including follow-up scans) from 5,452 participants and represented as follows:

$$G\left(x\right)=}(\;_}\left(I\;\right)+_}\left(_}\right)+_}\left(_}\right)),$$

(10)

where \(G\left(x\right)\) is the output malignancy probability of the GBR model, \(_}\) is the DCNN and the input i is the CT nodal patch to generate the logits of malignancy probability. \(_}\) and \(_}\) represent the clinical features and coefficients, while \(_}\) and \(_}\) represent the nodule follow-up features and coefficients. Three items in the formula served as weak prediction models, trained sequentially to compensate the weakness of their predecessor and assembled together to become the ultimate trained model. The least absolute shrinkage and selection operator (LASSO) algorithm was used to select the most important features and generate optimal coefficients. Logistic loss was used as the classification loss function. Finally, the GBR model was represented as follows:

$$\begin}\left(1\times }\; }+0.11\times }\left(}\right)+3.00\times 10^\times }\right.\\\qquad\quad+\,0.07\times }+0.19\times }\; }\; }+0.09\\\qquad\quad\left.\times\, }\; }\; }\; }+185\times }_\right).\end$$

(11)

First, continuous variables were normalized to 0–1. Fivefold cross-validation was applied to avoid grouping bias and model overfitting.

Second, a GBR algorithm started creating a single leaf based on the imaging model (\(}(_}\left(I\right))\)), with the error between the output and label \(y\) serving as the learning target for the second model.

Third, when training the second model, LASSO regression was used to select the most important clinical features (\(_}\)) and generate the optimal coefficients (\(_}\)). To eliminate the strong correlation between sex and smoking, clinical features were fitted in two steps. First, sex was balanced through subsampling to fit the regression model for other clinical features. This step was repeated 50 times, using a bagging strategy to create an ensemble of multiple models. Second, a gradient-boosting strategy was employed for univariate regression on the sex dimension. This updated the model \(}(_}\left(I\right)+g_}}\left(_}}\right))\), and the error between the output and label \(y\) served as the learning target for the third model.

Fourth, LASSO regression was used in training the third model to select critical follow-up features (\(_}\)) and optimize coefficients (\(_}\)).

Finally, a more robust multidimensional model was generated (\(}(\;_}\left(I\;\right)+_}\left(_}\right)+_}\left(_}\right))\)) to distinguish malignant nodules effectively compared with single- and dual-dimensional models. The output provided the malignancy probability for each nodule. Nodules with a probability of malignancy below 0.5 were predicted to be benign and retained their original risk levels, while those with a higher probability of malignancy were predicted to be malignant and assigned a risk level of 4. Therefore, the C-Lung-RADS criteria for four-category risk stratification for pulmonary nodules were finalized.

Model validationInternal testing of models

For phase 1, the internal testing dataset included 9,012 nodules from 9,012 participants, which were screened by the classification tree to identify the initial risk level of nodules. For phases 2 and 2+, the internal testing dataset included 1,351 nodules from 1,351 participants, first graded by the DCNN to generate malignant probability, which combined with clinical and follow-up features as the input of the GBR model to determine the final risk level.

Independent testing of models

In phase 1, a total of 16,375 CT examinations from 14,437 participants were decided by the classification tree. In phase 2, the CT images of 1,951 nodules (from 3,327 examinations) were fed into DCNN model to generate malignancy probability. This, along with clinical and follow-up features, served as input for the GBR model to determine the final risk level.

Evaluation metrics

A variety of metrics were assessed including ROC curves and corresponding AUC values, accuracy, sensitivity, specificity, FPR, positive predictive value and negative predictive value.

Statistical analysis

Shapiro–Wilk tests were used to check the normal distribution of continuous variables. Continuous variables that were approximately normally distributed were represented as mean ± s.d. Continuous variables with asymmetrical distributions were represented as median (25th, 75th percentiles). Categorical variables were expressed as counts and percentages and compared using chi-square tests. The performance of the classification tree was compared with Lung-RADS v2022 using ordinary two-way analysis of variance (ANOVA) followed by Sidak’s multiple comparison tests. To quantitatively compare the size of the solid component between the four different risk nodules, statistical analyses were performed using Kruskal–Wallis H tests followed by Dunnett’s multiple comparison tests. In addition, Mann–Whitney U tests were used to compare the malignancy probability and SGR+ distribution between the benign and the malignant nodules. To compare the malignancy probability among three models (single-, dual- and multidimension models), Friedman tests followed by Dunnett’s multiple comparison tests were applied. Two-tailed adjusted P values were obtained and represented by asterisks, with *P < 0.05, **P < 0.01 and ***P < 0.001. All statistical analyses were implemented using IBM SPSS 26.0. All plots were drawn by GraphPad Prism 9 and Origin 2021. All figures were created by Adobe Illustrator 2023.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

留言 (0)

沒有登入
gif