Objective approach to diagnosing attention deficit hyperactivity disorder by using pixel subtraction and machine learning classification of outpatient consultation videos

Participants

We included 43 children who had received a diagnosis of ADHD (24 boys and 19 girls, age [mean ± standard deviation (SD)]: 7 years 6 months ± 2 years 1 month) and 42 children who had not received a diagnosis of ADHD (21 boys and 21 girls, age [mean ± SD]: 7 years 9 months ± 2 years 2 months). No significant difference was observed between the ages of the children in the ADHD and non-ADHD groups. Diagnoses of ADHD were based on the Diagnostic and Statistical Manual of Mental Disorders (DSM)-V criteria, and ADHD severity was assessed using the SNAP IV. A continuous performance test (CPT) was used to measure the sustained and selective attention in patients with ADHD. Patients were excluded who had a history of severe intellectual disabilities, had abused drugs, had head injuries, or had received a diagnosis of psychotic disorders. The diagnoses in the patients without ADHD were headache, epilepsy, and dizziness, which were common in pediatric neurology. For each patient, a family member or legal guardian provided written informed consent for their child’s participation. The ethic regulations were conducted in accordance with the Declaration of Helsinki. Ethical approval was obtained from the Institutional Review Board of Kaohsiung Medical University Hospital [KMUIRB-SV (I)-20190060].

Movement Recording and Analysis

We used pixel subtraction quantification to analyze video footage obtained during consultations with a pediatric neurologist. We used a two-dimensional camera (I-Family IF-005D) to record movement videos of each patient. The videos were captured at a sampling rate of 30 Hz and a resolution of 1280 × 720 pixels. The video recorder was placed in a fixed, unobtrusive position in the consultation room, as illustrated in Fig. 1. Our pixel subtraction method and movement analysis diagram are presented in Fig. 2. The input video frames, originally in color, were three-dimensional; we converted them to grayscale images. For example, consider the first two frames of the video sequence, referred to as the first frame (\(\:_\)) and the second frame (\(\:_\)). The original color images of these frames are shown in Fig. 2a and b, while their corresponding grayscale images are shown in Fig. 2c and d. This conversion significantly reduced computational time without compromising the results of the movement analysis. After obtaining a series of sequential grayscale images, pixel subtraction was performed pairwise for each consecutive image pair. Assuming the video was captured at a sampling rate of 30 Hz, the frames were numbered \(\:(_,\:_,\dots\:_)\:\) for the first second of the video. Pixel subtraction for the first pair was calculated as \(\:\left|_-_\right|\), followed by \(\:\left|_-_\right|\), and so forth, up to \(\:\left|_-_\right|\). This process was repeated for each consecutive frame pair that defines the temporal dimension of the video, resulting in a series of pixel-subtracted images tracked only within the region of interest (ROI), as shown in Fig. 2c and d. The ROI, depicted as the red rectangular region, was used to limit the analysis to the relevant area. The resulting pixel-subtracted image is shown in Fig. 2e. The final pixel-subtracted image was obtained by filtering with a significant movement threshold, shown in Fig. 2f. The detailed definitions of the ROI and the significance movement threshold will be explained in subsequent sections. The series of subtracted images that were obtained were used for movement analysis. Because each patient’s consultation time varied, the initial 4 min of video recording were employed for movement analysis to minimize comparison bias. When the patients visited the pediatric neurologist, they sat on a medical chair. If the child maintained a stable sitting posture over time, the pixel values for consecutive images did not markedly differ, and consequently, the calculated frame-by-frame subtraction values of the pixel subtraction were approximately zero. By contrast, if the patients exhibited fidgeting behavior, such as swaying or swiveling, the pixel values differed, resulting in large values in the calculated frame-by-frame subtraction. A previous study demonstrated that all measured human body movements are contained within the frequency of 20 Hz [12]. Therefore, to explore whether different sampling rates affect the performance of pixel subtraction and machine learning classification, we evaluated various sampling rates when implementing pixel subtraction. For example, let \(\:Q=(_,_,\dots\:,\:_\)) be the sequence of frames of the first second in a video. If we obtain five subtracted images per second, the original 30 Hz video will be downsampled to 6 Hz. That is, after downsampling, \(\:^}=\left(_^},_^},\dots\:,_^}\right)=(_,_,\dots\:,\:_)\). The downsampling of the corresponding subtracted image sequence \(\:^}\) was defined as follows.

Fig. 1figure 1

Video recorder view in the consultation room

Fig. 2figure 2

Diagram of the pixel subtraction method

$$\begin^}&=(_^},_^},\dots\:,_^})\\&=\left(\begin&\left|_}-_\right|,\left|_}-_}\right|,\cr&\quad\dots\:,\left|_}-_}\right|\end\right)\end$$

(1)

where \(\:d\) represents the value of the downsampling rate; \(\:S\) represents the value of the original sampling rate multiplied by the number of selected video frames, which is 30 Hz per second; and \(\:F\) represents the value of the number of selected video seconds.

Accordingly, in downsampling to 6 Hz, five subtracted images are obtained per second, resulting in a 4-minute video comprising 1,200 consecutive subtracted images. In our approach, when no substantial movement difference was observed, the pixel values of any two consecutive images were approximately equal. Thus, the output pixel value was near zero after pixel subtraction, and the pixels in the output image were nearly opaque. By contrast, if any change or movement occurred between the capture of the two input images, the light portion of the subtracted image (Fig. 2e) indicated a movement difference. Using this pixel subtraction technique, we identified small movements in our participants that were imperceptible to the naked eye.

In the present study, movement was identified and tracked only within the ROI representing the participant’s movement in the subtracted images to avoid the influence of other individuals on the analytical results. Moreover, because each patient’s height varied, slight differences exist in the defined ROI for each patient. Therefore, we selected the corresponding ROI region from the subtracted image sequence \(\:^}\) for each patient, obtaining the \(\:ROI\) subtracted image sequence. The ROI is depicted as the red rectangular region as shown in Fig. 2c and d. We set a threshold \(\:\theta\:\) for the pixel value. For example, if the difference in pixel values in the first subtracted image \(\:\left|_^}\left(i,j\right)\right|\) exceeded the threshold \(\:\theta,\), \(\:_\left(i,j\right)\) was set to 1; otherwise, it was set to 0. In dynamic image processing, all pixels in \(\:_\left(i,j\right)\) with the value of 1 were considered to be the result of movement [13]. This process was then repeated for subsequent image sequences. The significance movement within the \(\:_\), as shown in Fig. 2f, was defined as follows:

$$\:_\left(i,j\right)=\left\1,\:\:if\:_^}\left(i,j\right)>\theta\:\\\:0,\:\:otherwise\end\right.,\:\:\:\left(i,j\right)\in\:SA,\:\:\:k=\text,\dots\:,N$$

(2)

where \(\:\:_\left(i,j\right)\) represents the pixel value in the \(\:k\)th frame within the ROI, with \(\:i\) and \(\:j\) representing the pixel \(\:x\) and \(\:y\) coordinates, respectively. \(\:SA\) represents the set of image coordinates corresponding to the ROI, \(\:N\) represents the number of subtracted images in the 4-minute video. \(\:\theta\:\) represents the threshold value. Based on our experiment, the threshold pixel value \(\:\theta\:\) is a constant value [13] and was set to 100, which was determined to represent significant movement.

The sum of the pixels in each subtracted frame was calculated to quantify the extent of patient movement in each subtracted frame. The resulting vector characterizes patient movement throughout the entire video. The sequence of movement along the measurement vector \(\:M\) was defined as follows:

$$\:M=\left(_,_,_,\dots\:,_\right)$$

(3)

where

$$\:_=__\left(i,j\right),\:\:\:\:k=\text,\dots\:,N$$

where \(\:_\left(i,j\right)\) represents the pixel value in the \(\:k\)th frame within the ROI, with \(\:i\) and \(\:j\) representing the pixel \(\:x\) and \(\:y\) coordinates, respectively.

Patients with ADHD often exhibit fidgeting behavior when seated or exhibit noticeable movement. This movement can be quantified through the mean (\(\:\mu\:\)), variance (\(\:\stackrel\)), and Shannon entropy (\(\:\stackrel\)), which were used in this study to analyze the movement vector.

Greater average movement indicated fidgeting. The mean movement in the sequence \(\:M\) was defined as follows:

where \(\:_\) represents the value of \(\:M\) corresponding to the \(\:k\)th frame.

Greater variance in movement was considered to indicate greater fidgeting. To avoid the influence of outliers when calculating the variance directly from the movement sequence, we use a sliding window to calculate the variance for each time window, and then computes the average of the variances across all time windows. The averaged variance in movement of the sequence \(\:M\) was defined as follows:

$$\:\stackrel=\frac_^Var\left(_\right),\quad\:L=N-W$$

(5)

where \(\:Var\left(_\right)=\frac_^_-\stackrel_}\right)}^\)

where \(\:_=\left(_,_,\dots\:,_\right)\), \(\:j=\frac\) represents the \(\:k\)th movement subsequence of \(\:M\), with a window size of \(\:W\:\)and \(\:L\:\)indicates the number of movement subsequences. Additionally, the window and overlapping sizes were set to 5 and 2.5 s, respectively. \(\:Var\left(_\right)\) represents the variance of \(\:_\), and \(\:\stackrel_}\) represents the mean of \(\:_\).

Greater entropy in movement was considered to indicate irregular and unpredictable patient movement. Accordingly, we used Shannon entropy to extract patient movement rhythm. Shannon entropy is used to calculate entropy on the basis of the probability distribution of movement. The higher the entropy value is, the greater the information content of the movement sequence and the greater the unpredictability and complexity of the movement are. To avoid the influence of outliers when calculating Shannon entropy directly from the movement sequence, we use a sliding window to calculate the Shannon entropy for each time window and then compute the average of the Shannon entropies across all time windows. The averaged Shannon entropy for movement of the sequence \(\:M\) was defined as follows:

$$\:\stackrel=\frac_^SE\left(_\right),\quad\:L=N-W$$

(6)

where

$$\:SE\left(_\right)=-_^__}___})$$

where \(\:SE\left(_\right)\) represents the Shannon entropy of \(\:_,\) and \(\:__}\) represents the probability of occurrence of \(\:_\) in the movement sequence \(\:_\).

Feature discriminability analysis

To evaluate the discriminability between the ADHD and non-ADHD groups in terms of each movement feature, we employed classification analyses based on six machine learning methods: support vector machines (SVM), random forest, decision tree, k-nearest neighbor (KNN), adaptive boosting (AdaBoost), and extreme gradient boosting (XGBoost). The machine learning library scikit-learn was utilized for comparative analysis [14]. We employed nested cross-validation to optimize the model hyperparameter and evaluate the model’s classification performance. The outer loop employs 10-fold cross-validation for model training and testing. During each iteration of the outer loop, 1-fold of ADHD and non-ADHD patients’ movement features are used as the test dataset, while the remaining 9-fold of ADHD and non-ADHD patients’ movement features are used as the training dataset. The training dataset obtained from each outer loop iteration is then used for model hyperparameter optimization and model training. The hyperparameter optimization uses grid search and 5-fold cross-validation to identify the suitable parameters for each machine learning classification model. Therefore, the grid search for tuning the hyperparameters of each classification model was defined as follows:

1.

Support vector machine (SVM): The kernel type of the SVM was specified as the radial basis function. The model parameters “gamma” and “C” were optimized using a grid search within the ranges and , respectively. Default values were used for all other parameters in the library.

2.

Random forest: The model parameter “n-estimators” was optimized using a grid search within the ranges . Default values were used for all other parameters in the library.

3.

Decision tree: The tree algorithm of the decision tree was specified as the CART. The model parameter “max-depth” was optimized using a grid search within the ranges . Default values were used for all other parameters in the library.

4.

K-nearest neighbor (KNN): The model parameter “n-neighbors” was optimized using a grid search within the ranges . Default values were used for all other parameters in the library.

5.

Adaptive boosting (AdaBoost): The weak classifiers of the AdaBoost was specified as the CART tree algorithm. The model parameter “n-estimators” was optimized using a grid search within the ranges . Default values were used for all other parameters in the library.

6.

Extreme gradient boosting (XGBoost): The weak classifiers of the AdaBoost was specified as the CART tree algorithm. The model parameters “max-depth”, “learning rate” and “n-estimators” were optimized using a grid search within the ranges , , and , respectively. Default values were used for all other parameters in the library.

To minimize biased comparisons of classification results among different movement features with different machine learning classification models, the resampling strategy of nested cross-validation was repeated 10 times. Consequently, a total of 100 pairs of training and testing datasets were obtained. The training datasets were used to train each classification model, and the testing datasets were used to evaluate the classification performance of the trained classification model.

Feature error estimation and analysis

To evaluate the reliability of classification performance metrics for different movement features across various machine learning classification models, we calculated the standard error of the mean (SEM) [15] for each classification model and movement feature. Although nested cross-validation was repeated 10 times in this study, to mitigate potential overfitting and identify the movement feature with the best performance and the smallest error range across sampling iterations, we further quantified this variability. Therefore, SEM was employed to quantify the variability in classification performance across multiple iterations of cross-validation, providing error estimates for the different performance metrics of each classification model. The SEM for classification performance metrics across multiple iterations of cross-validation was defined as follows:

where \(\:\sigma\:\) represents the standard deviation of one of the performance metrics across multiple iterations of cross-validation, and \(\:n\) represents the number of iterations, which was set to 100 in this study, corresponding to the total number of training and testing dataset pairs.

Statistical analysis

All statistical analyses were conducted using SAS v. 9.4; (SAS Institute, Cary, NC, USA). The results are presented as means ± SDs. Patients with ADHD and non-ADHD were compared in terms of their movement features by using the two-sample t-test and their sex distribution was compared by using chi-squared test. A P value < 0.05 was considered significant.

留言 (0)

沒有登入
gif