As depicted in Fig. 1, the overall procedure for automatically grading KOA severity based on X-ray images consists of three main components: segmentation, feature extraction, and classification using a hierarchical classification method.
Fig. 1Overview of the automatic knee osteoarthritis severity grading based on X-ray images. The total joint space, the lateral joint space, the medial joint space, and osteophyte were segmented using U-Net model, respectively. The mask of subchondral bone was generated based on the segmentation result of total joint space. A hierarchical classification method was employed to assessment KOA severity using geometric and radiomic features. KOA: knee osteoarthritis. KL: Kellgren-Lawrence
ParticipantsThe present study adhered to the principles outlined in the Declaration of Helsinki and received approval from the West China Hospital, Sichuan University. Informed consent forms were signed by all participants included in the study.
A total of 5317 knee joint X-ray images from 4074 patients (mean ± standard deviation (SD) age = 52.08 ± 14.35 years; 1268 males) between July 2009 and April 2021 were analyzed in the current study. Among the 4074 patients, 2848 had only one X-ray image, while the remaining patients had multiple X-ray images. The training cohort comprised 80% of the patients, randomly selected (3259 patients, 4247 radiographs), while the remaining patients formed the testing cohort (815 patients, 1070 radiographs). All X-ray images from a patient were included in only one cohort. Table 1 provides an overview of the demographic characteristics of the training and testing cohorts.
Table 1 Demographic characteristics of training and testing cohortsAll patients included in the study met the following inclusion criteria: (1) availability of clinical information and (2) availability of images. Patients who had (1) missing images, (2) low image quality, (3) incomplete clinical information, or (4) a history of knee joint trauma or tumor were excluded.
Image acquisitionEach patient in the study underwent radiography using a digital radiography system. The imaging parameters were standardized, with a fixed tube voltage of 50 kV, an exposure condition of 0.8 mAs, a source-film distance of 60 cm, and an exposure index ranging from 1400 to 1800. Given that the Rosenberg view is more sensitive than Antero-posterior view in identifying the joint space narrowing which played an important role in KL grading [23], the field of the knee radiograph encompassed the entire knee joint, leg, and thigh was scanned using Rosenberg view.
Image reading and manual segmentationLateral joint space (LJS), medial joint space (MJS), total joint space (TJS), and osteophyte were manually segmented using ITK-SNAP software [24], respectively. The manual segmentation was performed by three senior orthopedic doctors who have more than ten years’ experience in X-ray image reading and clinical practice. If there was inconsistency in the manual segmentation, a final decision was reached through discussion among three senior doctors. In addition, these three doctors made diagnosis of KL grade of KOA based on the interpretation of the X-ray images. To ensure accuracy, three doctors independently reviewed all the acquired X-ray images. The KL grade, which assesses joint space and osteophytes in KOA, was determined based on the X-ray images [25]. In cases where there was inconsistency in the KL grade assigned by the doctors, a final decision was reached through discussion among them. These three senior orthopedic doctors were blinded to the patients’ clinical information during manual segmentation and KL grade diagnosis.
Segmentation based on U-Net modelTo reduce the time-consuming and labor-intensive process of precise segmentation for doctors, four U-Net models (as depicted in Fig. 2) were used to segment LJS, MJS, TJS, and osteophyte, respectively. The U-Net [26] architecture consists of an encoder and a decoder. The encoder includes four downsample blocks. A downsample block consists of two convolutional layers, two batch normalization layers, a rectified linear unit (ReLU), and a max-pooling layer. The decoder, on the other hand, comprises four upsample blocks. A upsample block consists of a transpose convolutional layer, two convolutional layers, two batch normalization layers, and a ReLU. This structure is constructed to generate the final segmentation output. Two dropout2d layers were added between the encoder and decoder to prevent overfitting.
Fig. 2The flowchart of the segmentation using the U-Net models. Four U-Net models were used to segment the total joint space, the lateral joint space, the medial joint space, and the osteophyte, respectively
To normalize the X-ray images, the following equation was applied:
where \(\:x\) and \(\:y\) represent the original and normalized intensity of each pixel, respectively. Further, \(\:min\) and \(\:max\) denote the minimum and maximum value of the intensity of all pixels in the corresponding original X-ray image, respectively. Due to the large size of the X-ray image (as depicted in Fig. 2), both the normalized X-ray image and the corresponding manually segmented masks were cropped. During cropping, a smallest rectangle that contained LJS, MJS, TJS, and osteophyte was first generated. Then, a large rectangle was generated by expanding 50 pixels outward from the edges of the smallest rectangle. Thus, the normalized X-ray image and the corresponding manually segmented mask were cropped along the boundary of the large rectangle.
Given that the size of cropped normalized X-ray image and manually segmented mask fed into the U-Net model was not consistent between patients, spline interpolation [27,28,29] was used to resize the cropped normalized X-ray image and manually segmented mask. Thus, the size of the cropped normalized X-ray images and corresponding cropped manual segmentation masks was changed to 224 × 512, and then these cropped normalized X-ray images and corresponding cropped manual segmentation masks were used to construct the U-Net model. To validate the constructed U-Net model, approximately one-fourth of the patients randomly selected from the training cohort were designated as the validation cohort (815 patients, 1041 X-ray images), while the remaining patients were used for training (2444 patients, 3206 X-ray images). Each patient’s X-ray images were included in only one cohort.
The U-Net model was trained using the cropped normalized X-ray images and their corresponding cropped manual segmentation masks from the training cohort. Subsequently, the cropped normalized X-ray images and their corresponding cropped manual segmentation masks from the validation cohort were employed to select the best-performing U-Net model. Finally, the cropped normalized X-ray images and their corresponding cropped manual segmentation masks from the testing cohort were used to evaluate the performance of the selected U-Net model. Segmentation result was obtained based on constructed U-Net model. The size of the segmentation result was restored to the size of the corresponding original cropped manual segmentation mask.
For the osteophyte segmentation, X-ray images without osteophyte were excluded from the training, validation, and testing cohorts. Thus, cropped normalized X-ray images with osteophyte in training cohort were used to construct the U-Net model, and cropped normalized X-ray images with osteophyte in validation cohort were used to select the best-performing U-Net model. The cropped normalized X-ray images with osteophyte from the testing cohort were used to evaluate the performance of the selected U-Net model. In addition, Supplementary Method1 provided details of another U-Net model that constructed using total training cohort including X-ray images without osteophyte to segment osteophyte.
The U-Net models were constructed using PyTorch on a Windows computer. The GPU used was an NVIDIA Quadro RTX 5000 with 16 GB of memory, while the CPU was an Intel Core i9-9980XE with 128 GB of memory. The training process involved setting the number of epochs to 100, the batch size to 4, and the initial learning rate to 0.0001. During training, the learning rate was updated every 20 epochs by dividing the previous learning rate by 10. The optimization of the U-Net model was performed using adaptive moment estimation (Adam). Further details regarding the evaluation of the U-Net model can be found in Supplementary Method2.
Postprocessing for segmentationAfter LJS, MJS, TJS, and osteophyte were segmented, all restored segmentation results were generated. As Fig. 3A shown, we found that some evident false positive regions existed in some segmentation results. To avoid the effect on feature extraction, these false positive regions were removed. Specifically, for each patient, after the restored segmentation results obtained, the number of pixels of connected regions in the restored segmentation results was calculated. For the TJS, the regions with the number of pixels less than 2500 were removed. For the LJS and MJS, only the largest connected region was retained. For the osteophyte, the opening was used to remove false positive regions. Thus, the pixel values of these false positive areas were zeroed (Fig. 3B) in the postprocessed segmentation results.
Fig. 3Examples of the false positive regions in the restored segmentation and postprocessed results. A: The restored segmentation results of U-Net models were not performed postprocessing. The area marked by the red arrow was a false positive area. B: The restored segmentation results of U-net models were performed postprocessing
Geometric features extractionIn accordance with the current KL grading system, joint space narrowing and osteophyte formation are crucial factors in diagnosing KL. For this study, 27 geometric features were extracted from the postprocessed segmentation results of LJS, MJS, TJS, and osteophytes. These geometric features encompassed width (n = 9), area (n = 5), range (n = 6), ratio (n = 6), number (n = 1). For a detailed overview of these geometric features, please refer to Supplementary Method3.
Radiomic feature extractionThe original X-ray images were used to extract radiomic features. To extract radiomic features from the knee subchondral bone, a region of interest was generated by expanding 20 pixels outward from the upper and lower edges of postprocessed segmentation results of TJS. The open-source pyradiomics package [30] was utilized to extract radiomic features from the original X-ray images. Prior to feature extraction, the original X-ray images underwent preprocessing using seven filters: Square Root, Wavelet, Square, Gradient, Logarithm, 2D Local Binary Pattern, and Exponential. Shape, textural and first-order features were extracted from both the original and preprocessed X-ray images, resulting in a total of 1032 features extracted from each X-ray image. Detailed information about these radiomic features can be found in the pyradiomics documentation (https://pyradiomics.readthedocs.io/en/latest/features.html). For a detailed overview of the procedure of feature extraction using Pyradiomics, please refer to Supplementary Method4.
Hierarchical classification methodFigure 4 illustrates the hierarchical classification method developed for predicting the severity of KOA. This method involves four sub-task classifications. First, low-level KL grades (KL grade 0–2) and high-level KL grades (KL grade 3–4) are classified. Then, the classification of KL grade 0 versus KL grade 1–2, and the classification of KL grade 3 versus KL grade 4 are performed. Finally, KL grades 1 and 2 are assigned to their respective groups.
Fig. 4The structure of hierarchical classification method. In the hierarchical classification method, there were four sub-task classifications. The low-level KL grades (i.e., KL grade 0–2) and the high-level KL grade (i.e., KL grade 3–4) were first classified. Then, the classification of the KL grade 0 and KL grade 1–2, and the classification of the KL grade 3 and KL grade 4 were performed. Last, the KL grade 1 and the KL grade 2 were classified. OA: Osteoarthritis. KL: Kellgren-Lawrence
In the hierarchical classification method, the geometric, radiomic, and combined models were constructed using geometric features, radiomic features, and the combination of geometric and radiomic features, respectively. Figure 5 provides an illustration of these models, which were used to compare the performance of different feature sets in classifying KL grades. Further details on the evaluation of the classification models can be found in Supplementary Method2.
Fig. 5Schematic overview of training and testing in the geometric model, radiomic model, and combined model. A: The flowchart of the construction of the geometric model. B: The flowchart of the construction of the radiomic model. C: The flowchart of the construction of the combined model. LASSO: least absolute shrinkage and selection operator. GradSearchCV (10): grid search with 10-fold cross-validation. T test: Student’s t-test. Chi-Square test: Pearson’s Chi-Square test
Geometric model constructionIn this study, logistic regression (LR) models with L2 regularization (LR_L2), LR with elastic-net regularization (LR_EN), support vector machines (SVM), random forest (RF) and extreme gradient boosting (XGBoost) were evaluated to determine the best classifier for predicting the severity of KOA based on clinical and geometric features.
Figure 5A demonstrates the construction of the geometric model, which utilized clinical features (age and gender) and geometric features. Feature selection was performed using Pearson’s chi-square test for discrete features (gender) and Student’s t-test for continuous features. Subsequently, the feature values in the training cohort were scaled to a range of 0 to 1 based on the minimum and maximum values of features in the training cohort. The same scaling was applied to the corresponding features in the testing cohort. The least absolute shrinkage and selection operator (LASSO) was employed for further feature selection. A random undersampling method [31] (namely, RandomUnderSampler [31, 32]) was employed in the training cohort to overcome the problem of the imbalance between majority and minority classes.
Optimal hyper-parameters for the classifiers were determined through a grid search with 10-fold cross-validation (CV) in the training cohort. For LR_L2, the regularization parameter λ was selected from 21 values [2− 10, 2− 9, …, 29, 210]. For LR_EN, the λ and \(\:}_\) ratios were selected from 21 values [2− 10, 2− 9, …, 29, 210] and 10 values [0.1, 0.2, …, 0.9, 1], respectively. For SVM, the C parameter was selected from 21 values [2− 10, 2− 9, …, 29, 210]. For RF, the number of estimators and maximum depth was selected from 21 values [1, 5, 10, 15, …, 95, 100] and 19 values [10, …,95, 100], respectively. For XGBoost, the number of estimators, maximum depth and learning rate was selected from 20 value [5, 10, 15, …, 95, 100], 20 values [5, 10, 15, …, 95, 100], and 4 values [0.2, 0.1, 0.01, 0.001], respectively. Finally, classifiers with their corresponding optimal hyper-parameters were constructed using the undersampled training cohort, and these classifiers were then tested using the testing cohort. Based on the highest AUC of training cohort, the best geometric model was selected in each sub-task classification. The Scikit-learn library (version 0.23.2) was utilized for implementing the classification models.
Radiomic model constructionIn the current study, LR_L2, LR_EN, SVM, RF and XGBoost models were evaluated to determine the best classifier for predicting the severity of KOA based on radiomic features.
Figure 5B illustrates the construction of the radiomic model, which utilizes radiomic features. The process of constructing the radiomic model is similar to that of the geometric model. To weaken the multicollinearity, the Pearson’s correlation analysis [33, 34] was used before radiomic features scaling. Based on Pearson’s correlation analysis, these radiomic features with absolute Pearson’s correlation coefficient lager 0.8 were removed. In addition, LASSO was employed for further feature selection prior to applying RandomUnderSampler. After the LR_L2, LR_EN, SVM, RF and XGBoost were constructed, the best radiomic model was selected based on the highest AUC of training cohort in each sub-task classification.
Combined model constructionAs depicted in Fig. 5C, based on the highest AUC value of the training cohort, the geometric features from the best geometric model and the radiomic features from the best radiomic model were stacked together to form combined features. These combined features were obtained before performing RandomUnderSampler in their respective models. LASSO was employed for feature selection prior to applying RandomUnderSampler. Optimal hyper-parameters for the classifiers were determined through a grid search with 10-fold CV in the training cohort. The hyper-parameters setting was consistent with those of the geometric model. LR_L2, LR_EN, SVM, and RF models were evaluated to determine the best classifier for predicting the severity of KOA based on the combined features.
Hierarchical classification method evaluationTo evaluate the hierarchical classification method, a strict decision strategy was employed in this study. According to this strategy, if all four sub-task classifications are accurately predicted, the output of the hierarchical classification method was considered a correct prediction. Otherwise, it was considered an incorrect prediction.
Significant features selectionFor each sub-task classification in the hierarchical classification method, features that play important roles in the corresponding classification task were extracted based on the weights of the combined model. Thus, we first ranked all weights according to their absolute values and then selected the top 10% of weights. Features corresponding to these top 10% of weights were selected as significant features, which play important roles in the classification task.
留言 (0)