Automatic Screening of Pediatric Renal Ultrasound Abnormalities: Deep Learning and Transfer Learning Approach


Introduction

Renal abnormalities are important findings in pediatric medicine. It is well accepted that “silent” renal abnormalities can be effectively detected through ultrasound (US) screening, which makes both early diagnoses and intervention possible [,]. US is a safe, relatively cheap, and convenient medical modality. Portable ultrasonic probes and internet connections have largely developed in recent years, even extending the coverage of pediatric renal US screening throughout the world. However, current methods remain limited due to the lack of automated processes that accurately classify diseased and normal kidneys [].

Common renal abnormalities identified in US images in a series of more than 1 million school children included hydronephrosis (39.6%), unilateral small kidney (19.8%), unilateral agenesis (15.9%), cystic disease (13.9%), abnormal shapes—ectopic, horseshoe, and duplication of kidney (8%)—as well as others, that is, stones, tumors, and parenchymal diseases (1.5%) [].

Thus far, publications regarding computer-aided US image interpretation have been much fewer than those based on computerized tomography or magnetic resonance imaging [,]. The use of US presents unique challenges, such as different angles of image sampling, low image quality caused by noise and artifacts, high dependence on an abundance of operators, and high inter- and intra-observer variability across different institutes and manufacturers’ US systems []. From the review about medical US published in 2021 [], there were only 3 studies involving deep learning for renal US image classification [,,].

This study was performed to select normal pediatric renal US images, as well as different types of renal abnormalities previously mentioned, for purposes of machine learning. Through the pretreatment of original images, adequate grouping of images, and deep neural network training, we hope that renal images can be correctly classified as either normal or abnormal. The aim of this study is to establish an artificial intelligence (AI) model for screening renal abnormalities to enhance the well-being of children even in areas where there is no pediatric nephrologist.


MethodsEthics Approval

This study was approved by the institutional review board of Taichung Veterans General Hospital (No. CE20204A).

Materials

The images used were all created from the original images in the pediatric US examination room at Taichung Veterans General Hospital from January 2000 to December 2020. Here were 4 different US machines manufactured by both Philips and Acuson, which were used in this study. All images were obtained by a US technician having more than 20 years of experience, using a 4 MHz sector transducer. We chose only images taken of a longitudinal view from the right and left kidney.

We established 2 data sets. One data set was for training, and the other was for validation. The images in these 2 data sets were totally different.

Image Preprocessing and Data Cleaning

All images were detached from their original general data, including name, date of birth, date of examination, and chart number. The size of all the images was 600x480 pixels. We processed the images using software to obtain adequate illustrations for machine learning. As shown in , after preprocessing, the images contain a kidney, a square of liver obtained from the examination simultaneously, and a gray scale gradient seen in the left upper part of the image.

Figure 1. Preprocessing images for machine learning. View this figureImage Grouping

Normal images were those having a normal size and shape, as well as a clear renal cortex or medulla without hydronephrosis, hyperechogenicity, cysts, stones, or any space-occupying lesion. We prepared 330 images for this group. There were a total of 1269 abnormal renal images. The abnormalities included hydronephrosis, hyperechogenicity, cysts, stones, and space-occupying lesions. The number of images and examinations are summarized in . The hyperechogenicity of the renal US images included increased renal cortex echogenicity as compared to the liver, a poor differentiation of the renal cortex or medulla, and an inversed echogenicity of the renal cortex or medulla. These findings were judged by 2 pediatric nephrologists.

Table 1. Distribution of images and examinations in the training and testing augmented database.DiagnosisTraining (cases/images)Testing (cases/images)Totals (cases/images)Normal132/26432/66164/330Abnormal
Stone146/34237/85183/427
Cyst100/21525/53125/268
Hyperechogenicity60/13215/3375/165
Space-occupying lesions108/18126/45134/226
Hydronephrosis68/14616/3784/183Total614/1280151/319765/1599Machine Learning

We performed feature extraction with the pretrained model of ResNet-50 [-] in PyTorch from the data set ImageNet []. We used the pretrained weight of ResNet, so there was no backpropagation during feature extraction for training US images. The input data used were renal US images of 800x600 pixels in size. We normalized the dimension to 224x224 pixels prior to feeding the images into the network.

For the classification purpose, we redefined the final fully connected layers, which output image classification as abnormal or normal. After the training images went through Resnet50, there were 2048 outputs. There were 4 components in the final fully connected layer. The first was a linear layer with the 2048 feature extractions and 512 outputs. The second was rectified linear unit, which was a piecewise linear function that only outputted the positive result. Subsequently, we added the dropout layer to prevent overfitting. The 4th component was another linear layer, performing with 512 inputs and 2 outputs, which stand for the 2 categories, that is, abnormal and normal class with their probabilities.

We optimized the model with the Adam optimizer at a learning rate of 0.01 []. There were a total of 30 epochs used for convolutional neural network training. We created a 94 MB size model to classify normal versus abnormal renal US images. is a summary of our deep learning structure.

Figure 2. Brief summary of machine learning. View this figureExperimental Setup

We implemented the training-testing approach. The data set was randomly divided into 1272/1599 (79.55%) images for training and 327/1599 (20.45% )images for testing to establish the model. We performed a 10-time randomization of the data set to repeat the machine learning described in the previous paragraph. For validation of the 94 MB model, there was another validation data set with 327 pediatric renal US images, including 66 (20.2%) normal, 37 (11.3%) hydronephrosis, 53 (16.2%) cyst, 95 (29.1%) stone, 53 (16.2%) hyperechogenicity, and 26 (7.9%) space-occupying US images. All these images were totally different from the data set for establishing the model.

Evaluation of Performance

We evaluated the performance from a single image result. The diagnostic performance was measured by accuracy, specificity, sensitivity, positive predictive value, and negative predictive value. To calculate the above metrics, we defined an abnormal result as positive and a normal result as negative.


Results

After 30 epochs for these 1599 pediatric renal US images, we obtained satisfactory results. The performance metrics in the test part of the data set are shown in . The accuracy in different abnormalities ranged from 95% to 100%.

Table 2. Evaluation metrics for screening different abnormalities from test renal ultrasound images in the data set.Diagnosis (number)Accuracy (%)Sensitivity (%)Specificity (%)AUC-ROCaPPVb (%)NPVc (%)Stone1001001000.974100100Cyst95.288.51000.94510091.7Hyperechogenicity98.396.21000.93810097.1Space-occupying lesions98.795.61000.93510097.1Hydronephrosis1001001000.998100100Overall98.496.391000.96110097.2

aAUC-ROC: area under the receiver operating characteristic curve.

bPPV: positive predictive value.

cNPV: negative predictive value.

The accuracies of each abnormality ranged from 95.2% to 100%, with an overall accuracy as 98.4%. The area under curves (AUCs) were from 0.935 to 0.998. The AUC for overall performance was 0.961. There was no difference between these 10 random tests (P>.05). We repeated the 10 experiments using different randomizations involving 80%/20% training/test images to check the consistency of the machine learning performance. The accuracies ranged from 95.2% to 98.4%. There was no difference between these 10 tests (P>.05). We performed a 5-fold cross test, and the results are shown in .

We validated the 94 MB model through machine learning with another 327 pediatric renal US images. The classifications included 66 (20.2%) normal, 37 (11.3%) hydronephrosis, 53 (16.2%) cyst, 95 (29.1%) stone, 53 (16.2%) hyperechogenicity, and 26 (7.9%) space-occupying US images. The performances based on each single image are summarized in . Accuracy in the different abnormalities ranged from 89.9% to 94.1%, with an average of 92.3%. AUC was from 0.934 to 0.996 (). The overall performance in AUC was 0.959. The macro F1 was 0.924.

Table 3. Results of the 5-fold cross test.
Test 1Test 2Test 3Test 4Test 5OverallNormal accuracy (%)8087.987.987.987.986.32Stone accuracy (%)/AUCa91.2/0.92592.9/0.89789.4/0.92389.4/0.92594.3/0.92791.60/0.927Cyst accuracy (%)/AUC75.4/0.85890.6/0.89684.9/0.92790.6/0.89882.1/0.89185.3/0.903hyperechogenicity accuracy (%)/AUC84.8/0.84881.8/0.85581.8/0.86281.8/0.86281.8/0.89184.2/0.859Space-occupying lesion accuracy (%)/AUC92.5/0.90384.9/0.88194.5/0.91783.0/0.87482.6/0.86386.8/0.896Hydronephrosis accuracy (%)/AUC100/0.96591.9/0.88889.2/0.94094.6/0.93291.4/0.87194/0.928Overall accuracy (%)/AUC87.8/0.90389/0.88787.8/0.92887.5/0.90287.7/0.90188.3/0.900

aAUC: area under curve.

Table 4. Evaluation metrics for screening different abnormalities from other renal ultrasound images for validation.DiagnosisUS images, n (%)Accuracy (%)Sensitivity (%)Specificity (%)AUC-ROCaPPVb (%)NPVc (%)F1-scoreNormal66 (20.2)N/AdN/A90.9%N/AN/AN/AN/AStone93 (28.4)93.294.7N/A0.97393.292.30.927Cyst53 (16.2)91.692.5N/A0.94091.693.80.918Hyperechogenicity53 (16.2)89.988.7N/A0.94089.990.90.897Space-occupying lesions26 (7.9)91.392.3N/A0.93491.396.810.923Hydronephrosis37 (11.3)94.1100N/A0.99694.21000.957Overall328 (100)92.996.1N/A0.95993.677.920.924e

aAUC-ROC: area under the receiver operating characteristic curve.

bPPV: positive predictive value.

cNPV: negative predictive value.

dN/A: not applicable.

eMacro F1.

Figure 3. Area under the receiver operating characteristic curves of different image abnormalities and the overall performance. AUC: area under curve. View this figure
Discussion

The main finding of this study is a useful AI model for screening abnormal pediatric renal US images. The average accuracy can be 92.9%. The results can fulfill the main purpose of this study—to develop a useful computer-aided diagnosis model for screening various pediatric renal US abnormal patterns automatically. In this study, the machine learning methods were based upon convolutional neural network and fine-tuning, along with our unique methods for image preprocessing, as well as strategies for classification, which achieved a feasible model for clinical purposes. We constructed the stable classifier that combined both the transfer learning and training from scratch, balancing the training of a medical data set taken from an adequate sample size.

Clinical applications of AI in nephrology are versatile, but the use of renal US in this field is still in its infancy [,]. The reports derived from renal US images alone have been relatively limited up until now, with the major reports involving acute and chronic injuries [-]. Most renal image studies for AI used magnetic resonance imaging, computerized tomography, and patient histology for tumors, stones, nephropathy, transplantation, and other conditions [-]. The key challenges associated with deep learning involving US include reliability, generalizability, and bias []. The basic studies for enhancing AI performance in renal US have begun and remain undergoing [-].

There have been 4 reports from studies involving clinical AI applications in pediatric renal US abnormalities [, ,,]. Zheng et al [] found that the deep transfer learning method offers satisfactory accuracy in identifying congenital anomalies in the kidney and urinary tract, even when the data set is as small as only having 50 children with congenital anomalies in the kidney and urinary tract and 50 children as the control. Yin et al [] performed a similar study to detect posterior urethral valves. Sudarharson et al [] used 3 variant data sets for identifying renal cysts, stones, and tumors, with an accuracy rate of 96.54% in images of quality and 95.58% in images of noise. Smail et al [] attempted to use AI for grading hydronephrosis involving the 5-point scoring system from the Society of Fetal Urology (SFU). The best recorded performance was a 78% accuracy rate by dividing hydronephrosis into mild and severe. However, the accuracy rate was only 51% when using the 5-point system. In our study, we established a single 94 MB model to classify normal versus abnormal pediatric renal US images. The items seen in the abnormalities included renal cysts, stones, and tumors, as reported by Sudarharson et al []. In addition, the model was able to identify images of hydronephrosis and hyperechogenicity. Comparing the results from the study performed by Smail et al [], our results showed a better classification accuracy for hydronephrosis. The 37 validated images were moderate or severe hydronephrosis, that is, the SFU class II, III, and IV. Our model can achieve 100% sensitivity, comparing the sensitivity of 46%-54%, as previously reported [].

In terms of SFU class I, our model had an accuracy of 71.7% (119/166). Up until now, grading of hydronephrosis has been an ongoing challenge []. Extremely early intervention for treatment of mild hydronephrosis remains inadequate. If a child with mild hydronephrosis is also experiencing other renal abnormalities, such as stones, cysts, or hyperechogenicity, it is highly possible our model would be capable of providing any alarming information surrounding these conditions.

The unique pretreatment of images for machine learning performed in this study was performed to provide a comparison of liver echogenicity in the simultaneous study. This step is necessary for identifying hyperechogenicity. Other abnormalities, such as hydronephrosis, cysts, stones, and tumors, showed no difference in classification, regardless of whether we inputted the images with the addition of the square containing liver echogenicity and the gray scale gradient in the left part of the image shown in . As demonstrated in , the accuracy and sensitivity for hyperechogenicity identification was lower than it was with other abnormalities. Increased echogenicity is an important finding in evaluating muscle, thyroid, vascular, and renal diseases []. The gray scale US presents a general sensitivity rate of 62% to 77%, a specificity of 58% to 73%, and a positive predictive value of 92% for detecting microscopically confirmed renal parenchymal diseases. The above results reveal that the echogenicity change was not sensitive enough for detecting renal disease. Abnormalities in renal echogenicity include increased echogenicity, poor differentiation of the cortex or medulla, and inversed echogenicity of the renal cortex and medulla []. In practice, it is quite often that we cannot obtain a square containing homogenous liver echogenicity for purposes of machine learning. When the classification is compared by a pediatric nephrologist, the results are acceptable. It is also difficult for the naked eye to discriminate between the not-so-significant gray scale differences. Currently, the so called “radiomics” information, which can aid US imaging in AI, is emerging [], with a more precise assessment of US pixels possibly enhancing the utility of hyperechogenicity.

A limitation of this study is the single medical center image source. More images from different hospitals, areas, ethnicities, and US companies need to be used. We conducted a small-scale external validation using US images from different companies, including General Electric, Siemens, and Toshiba. After image pretreatment, the results could be 100% sensitivity, 80% specificity, and 90% accuracy. Another limitation is the moderate image number of images contributing to the data set. We did not divide images from right or left kidney for training, though the results can be acceptable. We will further validate our method based on larger data sets.

In conclusion, this study proposed the use of an automatic model for purposes of screening various abnormalities in pediatric renal US images. We will continue to enhance the model’s performance as we conduct additional evaluation studies surrounding its future clinical applications, including being an auxiliary software for screening children’s renal abnormalities in remote areas.

This study was supported in part by a grant from Taichung Veterans General Hospital (No. TCVGH-1106506B, TCVGH-1116504C). The statistical work is partially supported by the Ministry of Science and Technology, Taiwan, R.O.C. (grant MOST 110-2118-M-A49-002-MY3 and 110-2634-F-A49-005).

None declared.

Edited by C Lovis; submitted 12.07.22; peer-reviewed by M Lee, Y Fan, SJC Soerensen , V Khetan, S Yin; comments to author 03.08.22; revised version received 16.09.22; accepted 02.10.22; published 02.11.22

©Ming-Chin Tsai, Henry Horng-Shing Lu, Yueh-Chuan Chang, Yung-Chieh Huang, Lin-Shien Fu. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 02.11.2022.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

留言 (0)

沒有登入
gif