To our knowledge, insufficient literature has assessed the effects of image resolution, color depth, noise level, and low light on the inference of eye opening and closing and body landmarks from digital images.
The aim of the present study is to test the accuracy of commonly used deep-learning models applied to different image resolutions, lighting conditions, color depths, and noise levels to establish baseline threshold values when the quality of the model drops below the accepted level of performance.
These parameters are important to establish for future work in applying computer vision in actual patient monitoring. It can be hypothesized that a degraded image should gradually decrease the model’s accuracy up to a certain threshold beyond which the model will fail completely.
Previous literature on human pose estimation and facial landmark detection is first summarized, followed by a description of the methods for decreasing image quality and testing model accuracy. Results of the effects of image resolution, color depth, noise level, and low light are then reported and discussed.
2. Related WorksHuman pose estimation using deep learning has been the subject of intense research in recent years and has been reviewed in [29]. In general, there are two types of multi-subject pose estimation algorithms. The top-down approach first detects all human subjects in a particular scene and subsequently localizes all keypoints for each given subject. Algorithms that use such a technique include G-RMI [30], Mask-RCNN [31], MSRA [32], CPN [33], and ZoomNet [34]. By combining high- and low-resolution representations through multi-scale fusion while maintaining a high-resolution backbone, HRNet [35] and HigherHRNet [36] achieved excellent keypoint detection results.In contrast, the bottom-up technique identifies all keypoints first and then assigns each keypoint to an individual subject. Algorithms that employ this method include DeepCut [37], DeeperCut [38], and MultiPoseNet [39]. By introducing part affinity fields, OpenPose became the most popular bottom-up algorithm [40]. The concept of part affinity fields was expanded in PifPaf through the addition of a part intensity field [41].Facial landmark detection is closely related to pose estimation and has benefited from advancements made in human pose estimation. Algorithms used for facial landmark detection have been previously reviewed in [42]. The earliest algorithms used deformable facial mesh, which has been replaced by an ensemble of regression tree models [43] such as those included in the Dlib open-source library [44]. Since they have very high computation speeds and are easy to implement, these models have become widely used in research. More recently, algorithms used in pose estimation such as HRNet [35] have been adapted for facial landmark detection [45]. In addition, other newer methods using shape model (e.g., Dense face alignment [46]), heatmap (e.g., style-aggregated network [47], aggregation via separation [48], FAN [49], and MobileFAN [50]), and direct regression (e.g., PFLD [51], deep graph learning [52], and AnchorFace [53]) techniques have been proposed. 3. Materials and Methods 3.1. DataTwo hundred images (100 images of humans with eyes open and 100 images with eyes closed) randomly chosen from the Closed Eyes in the Wild Dataset [54] were used to assess model accuracy as image quality was gradually degraded.To generate out-of-sample images, photographs of the primary author with eyes open and closed were captured using a 13 MP smartphone camera (Moto E XT2052-1, 13 MP, f/2.0, 1/3.1), with the height of the face occupying approximately half of the image height. Images were obtained using three 300-lumen dimmable light sources placed 150 cm in front of the face.
To test the effects of image quality on the accuracy of pose estimation, images from the COCO 2017 [55] validation dataset (https://cocodataset.org/#overview, accessed on 23 June 2022) were used. Specifically, images that depict exactly one person (Figure S6) were extracted along with body keypoint annotations (921 images). 3.2. Model DescriptionFacial landmark recognition was performed using the pretrained model in Dlib v19.24.0 (http://dlib.net/, accessed on 20 June 2022). Sixty-eight key facial landmarks were predicted by the model (see Supplementary Figure S5), where points 36 to 41 and 42 to 47 delineate the right and left palpebral fissures, respectively. 3.3. Modifications MadeTwo hundred images from the Closed Eyes in the Wild Dataset were used to assess model accuracy as image quality was gradually degraded. To generate images of different resolutions (a total of 8000 images), the original images were resized from 100 × 100 pixels to 20 × 20 pixels at an interval of 10 pixels while maintaining the aspect ratio. The image color depth was successively decreased from 16.7 M colors to 8 M, 1 M, 512 K, 216 K, 64 K, 8 K, 1 K, 729, 512, 343, 216, 125, 64, 27, and 8 colors. Gaussian noise was added by replacing randomly chosen pixels with random pixels; noise intensity was changed by varying the probability of replacing a given pixel from 0% to 10% at an interval of 1% and from 10% to 50% at an interval of 10%.
For images captured using the smartphone camera, light intensity at the level of the face was measured using a smartphone light meter application (https://play.google.com/store/apps/details?id=com.tsang.alan.lightmeter&hl=en_CA&gl=US, accessed on 15 July 2022). Images under different lighting conditions were captured by varying light across 21 intensity levels from 42 to 2 lux with an interval of 2 lux. Since the smartphone-captured images had a higher resolution (1000 × 800 pixels) than images from the Closed Eyes in the Wild Dataset (100 × 100 pixels), images of different resolutions (a total of 789 images) were generated first by resizing the original images from an image height of 1000 pixels to 100 pixels at an interval of 100 pixels and then from the 100-pixel image height to 10 pixels at an interval of 10 pixels.For the COCO 2017 keypoint dataset, images of different quality (a total of 44,208 images) were generated. To generate images of different resolutions, the original image width (500 pixels) was decreased to 50 at an interval of 50 pixels and from 50 to 5 pixels at an interval of 5 pixels. Images with different color depths and noise levels were generated with the same quality degradation scheme as outlined above for eye open–closed inference.
3.4. Study Procedures 3.5. Measurements/StatisticsThe opening and closing of the eyes were quantified using the eye aspect ratio as described in [56], which is the ratio of the vertical and horizontal dimensions of the palpebral fissure. Palpebral fissure dimensions were estimated on randomly chosen images (100 images with eyes open and 100 images with eyes closed) from the Closed Eyes in the Wild Dataset. For pose estimation, the mean absolute error (MAE) of the x and y pixel coordinates of the predicted keypoint vs. the ground truth was computed. An average MAE value was calculated for all keypoints. One-way ANOVA with multiple comparisons was carried out using images of the best quality as the comparator. An adjusted p-value of less than 0.05 is considered statistically significant. The rate of model failure was computed by dividing the number of images for which the models were unable to detect faces/humans by the total number of images. Statistical calculations were performed in GraphPad Prism 9 and Python 3.9. 4. Results 4.1. Eye Open-Close InferenceAs shown in Figure 1, when image resolution was reduced under 60 pixels × 60 pixels, model estimates of closed-eye dimensions (EAR of 0.19) deviated from the true dimensions (EAR of 0.18, Figure 1) and the model failed to detect the face and eyes in larger numbers of images (Figure S1D) at 30 × 30 pixels (24%) compared to the baseline (17%). Similar trends can be observed in the open-eye dataset (Figure S2A,D): EAR was 0.30 at the full image resolution (100 × 100 pixels) and deviated to 0.31 when the resolution was decreased to 50 × 50 pixels; missing values increased from 5% at baseline to 10% when the resolution was reduced to 30 × 30 pixels.When color depth was reduced from 16.7 M colors to 343 colors, closed-eye dimensions deviated significantly (EAR of 0.18 vs. 0.17 at baseline, Figure S1B). The deviation was highest when the color depth was reduced to 27 colors in the open (EAR of 0.33 vs. 0.30 at baseline, Figure 2) and closed (EAR of 0.19 vs. 0.17 at baseline) eye datasets. Furthermore, the percentage of missing values also increased as the color depth decreased to 343 colors (from 17% to 27% in the closed-eye dataset, Figure S1E, and from 5% to 6% in the open-eye dataset, Figure S2E).As shown in Figures S1C and S2C, eye dimension estimates deviated from the true dimensions when 7 to 9% of the original image pixels were replaced by noise (closed-eye EAR of 0.20 with 9% noise vs. 0.17 at baseline and open-eye EAR of 0.32 with 7% noise vs. 0.30 at baseline). However, the percentage of missing values (Figures S1F and S2F) began to increase even when 4% of pixels were replaced by random noise (from 17% to 38% in the closed-eye dataset and from 5% to 10% in the open-eye dataset).For images with different light intensities, model prediction of palpebral fissure dimension started to deviate from the true dimension as image size was reduced to image heights of 50–70 pixels (EAR of 0.20 at 1000 pixels vs. 0.22 at 70 pixels, Figure S3A, and EAR of 0.34 at 1000 pixels vs. 0.33 at 50 pixels, Figure S3C). Similarly, the number of missing values, i.e., images where the model failed to identify the face and/or both eyes, increased sharply under this image resolution: The percent of missing values increased from 19% at 40 pixels to 95% at 30 pixels in the closed-eye dataset and from 19% at 40 pixels to 76% at 30 pixels in the open-eye dataset.The model prediction of the palpebral fissure dimension deviated more gradually from the true dimension as the light intensity level decreased under 12 lux (Figure 3 and Figure S3D). At a light intensity of 8 lux, the model was increasingly less capable of correctly identifying the face and both eyes (16% at 42 lux vs. 21% at 8 lux). 4.2. Human Pose EstimationThe prediction accuracy of the model for human poses using the COCO dataset decreased significantly when the image height was reduced to less than 200 of the original 500 pixels (MAE of 1.3 pixels vs. 0.98 pixels, respectively, Figure 4). Since human subjects occupied, on average, 150 × 200 pixels of the original images, this indicates that the model was accurate up to a resolution of 60 × 80 pixels that depict only the human subject. Similarly, the fraction of images where the model was unable to identify the human subject started to increase dramatically beyond this resolution threshold (the percent of missing values increased from 17% at 200 pixels to 84% at 100 pixels, Figure S4D).When color depth was reduced to values lower than 512 colors, pose estimation began to deviate significantly from the ground truth (MAE of 0.98 pixels at 16.7 M colors vs. 1.12 pixels at 512 colors). The percentage of missing values also increased sharply as color depth was inferior to 343 colors (10% at 16.7 M colors vs. 14% at 343 colors, Figure S4E).As shown in Figure S4C, the error of pose estimation from ground truth began to rise significantly compared to the baseline when 5% of the original image pixels were replaced by noise (0.97 pixels at baseline vs. 1.17 pixels with 5% noise). Similarly, the percentage of missing values (Figure S4F) started to increase when 4% of pixels were replaced by random noise (10% at baseline vs. 13% with 4% noise). 5. DiscussionThis study systematically tested the effects of image quality on facial feature extraction and human pose estimation using common deep learning models.
For the determination of eye opening and closing with Dlib, the resolution of facial images can be reduced to 60 × 60 pixels without significantly affecting the model estimation of eye dimension. When the color depth of images was lower than 343 colors, eye dimensions estimated by the model began to deviate from the true eye dimensions, and it became increasingly difficult for the model to identify the face. The accuracy of model estimation of eye dimensions began to decrease when 7% of the original image pixels were replaced by noise. Interestingly, even when images of the face were taken under low lighting conditions (14 lux), eye dimensions could still be accurately determined to differentiate between open vs. closed eyes. Under very low lighting (6 lux), the model could still identify the face in most instances.
For human pose estimation using OpenPose, the resolution of regions representing human subjects can be reduced to 60 × 80 pixels without significantly affecting model accuracy or performance. Color depth reduction from 16.7 M to 512 colors resulted in a significant increase in the mean absolute error of model prediction. The addition of more than 4% Gaussian noise also increased model error.
Typically, contemporary convolutional neural networks are trained using images with resolutions greater than a few hundred pixels in width and height. Large image datasets (e.g., Microsoft COCO [55], ImageNet [57], the MPII Human Pose Dataset [58], and the CMU Panoptic Dataset [59]) used for the recognition and pose estimation of human subjects usually contain images with decent resolutions of 300 to 500 pixels in height and width. Images of similar resolutions are also contained in frequently used datasets for facial landmark annotation (e.g., the AFLW Dataset [60] and 300 W [61]) and emotion detection (e.g., AffectNet [62], CK+ [63], and EMOTIC [64]). Furthermore, in medical imaging with MRI [65,66], PET [65,67], and CT [14,68], deep learning applications are typically trained using images with resolutions ranging from 128 × 128 to 512 × 512 pixels.Previous studies have investigated the application of pose estimation algorithms in low-resolution images [69,70]. However, insufficient literature has assessed the effects of image resolution, color depth, noise level, and low light on the inference of eye opening and closing and body landmarks from digital images. Therefore, in the present study, the accuracy of commonly used deep-learning models while varying image resolutions, lighting conditions, color depths, and noise levels was tested. This allowed us to establish baseline threshold values for future work applying computer vision in continuous patient monitoring.Limitations of this work include the use of relatively small datasets of images; therefore, our study may be underpowered to detect changes in model prediction with small decreases in image quality. Furthermore, subjects in the COCO body keypoint dataset do not all occupy the same number of pixels, which may have introduced heterogeneity in model accuracy. Future work may therefore be performed by testing multiple different networks for a given task using larger numbers of images. In addition, variability (e.g., head tilt) exists in the photos captured by the smartphone camera, which may be an obstacle to the reproducibility of the results. In addition, only the OpenPose and DLib models were tested without model finetuning; other newer deep learning models (e.g., Retinaface [71] and Mediapipe [72,73]) should be studied in future works. Future works should also assess the effects of video instead of photo quality on model accuracy. 6. ConclusionsIn this study, the effects of image quality on facial feature extraction and human pose estimation using the Dlib and OpenPose models were systematically assessed. It is found that, so far, these models only failed to detect eye dimensions and body keypoints at very low image resolutions (the failure rate for eye dimension estimation increased from below 20% to over 70% by decreasing the facial image resolution from 40 × 40 to 30 × 30 pixels), lighting conditions (the failure rate for eye dimension estimation of 16% at 42 lux light intensity vs. 21% at 8 lux), and color depths (failure rate for pose estimation of 10% at 16.7 M colors vs. 14% at 343 colors). Our established baseline threshold values will be essential for future work in the application of computer vision in continuous patient monitoring.
Supplementary MaterialsThe following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jimaging8120330/s1, Supplementary material. Supplementary Figure S1. Palpebral fissure dimension and model performance in the closed-eyes dataset as a function of image quality using the Closed Eyes in the Wild Dataset. Supplementary Figure S2. Palpebral fissure dimension and model performance in the open-eyes dataset as a function of image quality using the Closed Eyes in the Wild Dataset. Supplementary Figure S3. Palpebral fissure dimension and model performance as a function of image resolution and light intensity. Supplementary Figure S4. Model performance in the COCO body keypoint dataset as a function of image quality. Supplementary Figure S5. Facial landmarks prediction using Dlib showing an example of model output with the localization of 64 landmarks.Author ContributionsConceptualization, R.Z.Y. and V.H.; Methodology, R.Z.Y. and V.H.; Software, R.Z.Y.; Validation, R.Z.Y.; Formal analysis, R.Z.Y., A.S., D.D., H.L. and V.H.; Investigation, R.Z.Y., A.S., D.D., H.L., B.P. and V.H.; Data curation, B.P.; Writing—original draft, R.Z.Y. and V.H.; Writing—review & editing, R.Z.Y., A.S., D.D., H.L., B.P. and V.H.; Visualization, R.Z.Y.; Supervision, V.H. All authors have read and agreed to the published version of the manuscript.
FundingR.Z. Ye received funding from the Canadian Institute of Health Research (CIHR Funding Reference Number: 202111FBD-476587-76355).
Institutional Review Board StatementNot applicable.
Informed Consent StatementNot applicable.
Data Availability StatementAcknowledgmentsVitaly Herasevich is the corresponding author of this work.
Conflicts of InterestThe authors declare no conflict of interest.
ReferencesYurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 2020, 8, 58443–58469. [Google Scholar] [CrossRef]Balla, P.B.; Jadhao, K. IoT based facial recognition security system. In Proceedings of the 2018 International Conference on Smart City and Emerging Technology (ICSCET), Mumbai, India, 5 January 2018; IEEE: New York, NY, USA. [Google Scholar]Zhang, Z. Technologies raise the effectiveness of airport security control. In Proceedings of the 2019 IEEE 1st International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Kunming, China, 17–19 October 2019; IEEE: New York, NY, USA. [Google Scholar]Ives, B.; Cossick, K.; Adams, D. Amazon Go: Disrupting retail? J. Inf. Technol. Teach. Cases 2019, 9, 2–12. [Google Scholar] [CrossRef]Berman, D.S.; Buczak, A.L.; Chavis, J.S.; Corbett, C.L. A survey of deep learning methods for cyber security. Information 2019, 10, 122. [Google Scholar] [CrossRef]Shen, L.; Margolies, L.R.; Rothstein, J.H.; Fluder, E.; McBride, R.; Sieh, W. Deep Learning to Improve Breast Cancer Detection on Screening Mammography. Sci. Rep. 2019, 9, 12495. [Google Scholar] [CrossRef]Yala, A.; Lehman, C.; Schuster, T.; Portnoi, T.; Barzilay, R. A Deep Learning Mammography-based Model for Improved Breast Cancer Risk Prediction. Radiology 2019, 292, 60–66. [Google Scholar] [CrossRef]Becker, A.S.; Marcon, M.; Ghafoor, S.; Wurnig, M.C.; Frauenfelder, T.; Boss, A. Deep Learning in Mammography: Diagnostic Accuracy of a Multipurpose Image Analysis Software in the Detection of Breast Cancer. Investig. Radiol. 2017, 52, 434–440. [Google Scholar] [CrossRef]Milletari, F.; Ahmadi, S.-A.; Kroll, C.; Plate, A.; Rozanski, V.; Maiostre, J.; Levin, J.; Dietrich, O.; Ertl-Wagner, B.; Bötzel, K. Hough-CNN: Deep learning for segmentation of deep brain regions in MRI and ultrasound. Comput. Vis. Image Underst. 2017, 164, 92–102. [Google Scholar] [CrossRef]Liu, S.; Wang, Y.; Yang, X.; Lei, B.; Liu, L.; Li, S.X.; Ni, D.; Wang, T. Deep learning in medical ultrasound analysis: A review. Engineering 2019, 5, 261–275. [Google Scholar] [CrossRef]Akkus, Z.; Galimzianova, A.; Hoogi, A.; Rubin, D.L.; Erickson, B.J. Deep learning for brain MRI segmentation: State of the art and future directions. J. Digit. Imaging 2017, 30, 449–459. [Google Scholar] [CrossRef]Avendi, M.R.; Kheradvar, A.; Jafarkhani, H. A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Med. Image Anal. 2016, 30, 108–119. [Google Scholar] [CrossRef] [PubMed]Gibson, E.; Giganti, F.; Hu, Y.; Bonmati, E.; Bandula, S.; Gurusamy, K.; Davidson, B.; Pereira, S.P.; Clarkson, M.J.; Barratt, D.C. Automatic multi-organ segmentation on abdominal CT with dense v-networks. IEEE Trans. Med. Imaging 2018, 37, 1822–1834. [Google Scholar] [CrossRef] [PubMed]Weston, A.D.; Korfiatis, P.; Kline, T.L.; Philbrick, K.A.; Kostandy, P.; Sakinis, T.; Sugimoto, M.; Takahashi, N.; Erickson, B.J. Automated abdominal segmentation of CT scans for body composition analysis using deep learning. Radiology 2019, 290, 669–679. [Google Scholar] [CrossRef] [PubMed]Ye, R.Z.; Noll, C.; Richard, G.; Lepage, M.; Turcotte, E.E.; Carpentier, A.C. DeepImageTranslator: A free, user-friendly graphical interface for image translation using deep-learning and its applications in 3D CT image analysis. SLAS Technol. 2022, 27, 76–84. [Google Scholar] [CrossRef] [PubMed]Ye, R.Z.; Montastier, E.; Noll, C.; Frisch, F.; Fortin, M.; Bouffard, L.; Phoenix, S.; Guerin, B.; Turcotte, E.E.; Carpentier, A.C. Total Postprandial Hepatic Nonesterified and Dietary Fatty Acid Uptake Is Increased and Insufficiently Curbed by Adipose Tissue Fatty Acid Trapping in Prediabetes With Overweight. Diabetes 2022, 71, 1891–1901. [Google Scholar] [CrossRef]Magi, N.; Prasad, B. Activity Monitoring for ICU Patients Using Deep Learning and Image Processing. SN Comput. Sci. 2020, 1, 123. [Google Scholar] [CrossRef]Davoudi, A.; Malhotra, K.R.; Shickel, B.; Siegel, S.; Williams, S.; Ruppert, M.; Bihorac, E.; Ozrazgat-Baslanti, T.; Tighe, P.J.; Bihorac, A. The intelligent ICU pilot study: Using artificial intelligence technology for autonomous patient monitoring. arXiv 2018, arXiv:1804.10201. [Google Scholar]Ahmed, I.; Jeon, G.; Piccialli, F. A deep-learning-based smart healthcare system for patient’s discomfort detection at the edge of Internet of things. IEEE Internet Things J. 2021, 8, 10318–10326. [Google Scholar] [CrossRef]Yeung, S.; Rinaldo, F.; Jopling, J.; Liu, B.; Mehra, R.; Downing, N.L.; Guo, M.; Bianconi, G.M.; Alahi, A.; Lee, J.; et al. A computer vision system for deep learning-based detection of patient mobilization activities in the ICU. NPJ Digit. Med. 2019, 2, 11. [Google Scholar] [CrossRef]Davoudi, A.; Malhotra, K.R.; Shickel, B.; Siegel, S.; Williams, S.; Ruppert, M.; Bihorac, E.; Ozrazgat-Baslanti, T.; Tighe, P.J.; Bihorac, A. Intelligent ICU for autonomous patient monitoring using pervasive sensing and deep learning. Sci. Rep. 2019, 9, 8020. [Google Scholar] [CrossRef]Rahim, A.; Maqbool, A.; Rana, T. Monitoring social distancing under various low light conditions with deep learning and a single motionless time of flight camera. PLoS ONE 2021, 16, e0247440. [Google Scholar] [CrossRef]Ren, W.; Liu, S.; Ma, L.; Xu, Q.; Xu, X.; Cao, X.; Du, J.; Yang, M.-H. Low-light image enhancement via a deep hybrid network. IEEE Trans. Image Process. 2019, 28, 4364–4375. [Google Scholar] [CrossRef] [PubMed]Guo, X.; Li, Y.; Ling, H. LIME: Low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 2016, 26, 982–993. [Google Scholar] [CrossRef] [PubMed]McCunn, L.J.; Safranek, S.; Wilkerson, A.; Davis, R.G. Lighting control in patient rooms: Understanding nurses’ perceptions of hospital lighting using qualitative methods. HERD Health Environ. Res. Des. J. 2021, 14, 204–218. [Google Scholar] [CrossRef] [PubMed]Bernhofer, E.I.; Higgins, P.A.; Daly, B.J.; Burant, C.J.; Hornick, T.R. Hospital lighting and its association with sleep, mood and pain in medical inpatients. J. Adv. Nurs. 2014, 70, 1164–1173. [Google Scholar] [CrossRef] [PubMed]Leccese, F.; Montagnani, C.; Iaia, S.; Rocca, M.; Salvadori, G. Quality of lighting in hospital environments: A wide survey through in situ measurements. J. Light Vis. Environ. 2016, 40, 52–65. [Google Scholar] [CrossRef]Ring, E.; Ammer, K. The technique of infrared imaging in medicine. In Infrared Imaging: A Casebook in Clinical Medicine; IOP Publishing: Bristol, UK, 2015. [Google Scholar]Liu, W.; Mei, T. Recent Advances of Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective. ACM Comput. Surv. (CSUR) 2022, 55, 80. [Google Scholar] [CrossRef]Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]Jin, S.; Xu, L.; Xu, J.; Wang, C.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P. Whole-body human pose estimation in the wild. In European Conference on Computer Vision; Springer: Berlin, Germany, 2020. [Google Scholar]Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]Pishchulin, L.; Insafutdinov, E.; Tang, S.; Andres, B.; Andriluka, M.; Gehler, P.V.; Schiele, B. Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016. [Google Scholar]Kocabas, M.; Karagoz, S.; Akbas, E. Multiposenet: Fast multi-person pose estimation using
留言 (0)