Yoga is a traditional Indian way of keeping the mind and body fit, through physical postures (asanas), voluntarily regulated breathing (pranayama), meditation, and relaxation techniques. The recent pandemic has seen a huge surge in numbers of yoga practitioners, many practicing without proper guidance. This study was proposed to ease the work of such practitioners by implementing deep learning-based methods, which can estimate the correct pose performed by a practitioner. The study implemented this approach using four different deep learning architectures: EpipolarPose, OpenPose, PoseNet, and MediaPipe. These architectures were separately trained using the images obtained from S-VYASA Deemed to be University. This database had images for five commonly practiced yoga postures: tree pose, triangle pose, half-moon pose, mountain pose, and warrior pose. The use of this authentic database for training paved the way for the deployment of this model in real-time applications. The study also compared the estimation accuracy of all architectures and concluded that the MediaPipe architecture provides the best estimation accuracy.
Keywords: Artificial intelligence, deep learning, machine learning techniques, pose estimation techniques, skeleton and yoga
How to cite this article:Yoga is an ancient Indian science and a way of living that includes the adoption of specific bodily postures, breath regulation, meditation, and relaxation techniques practiced for health promotion and mental relaxation. In recent years, yoga has been adopted internationally for its health benefits. Among several techniques, physical postures have become very popular in the Western world. Yoga is not only about the orientation of the body parts but also emphasizes breathing and being mindful.[1] The traditional Sanskrit name for Yoga postures is asanas. During the pandemic, many people have used yoga to keep themselves physically and mentally fit.[2] Many people practice fine forms of asanas, without a teacher to guide them: either because no trained yoga instructors are available or due to unwillingness to engage one. Nevertheless, it is important to perform asanas correctly, so the practitioner does not sustain injury.[3] Furthermore, asanas should be practiced systematically, paying attention to the orientation of the limbs and the breathing. Improper stretching or performing inappropriate asanas and breathing inappropriately when exercising can be injurious to health. Improper postures can lead to severe pain and chronic problems.[4] Hence, a scientific analysis of asana practice is all important. The present work was developed, with this in mind.
Pose estimation techniques can be used to identify the accurate performance of yoga postures.[5] Pose estimation algorithms have been used to mark the key points and draw a skeleton on the human body for real-time images and used to determine the best algorithm for comparing the poses. Posture estimation tasks are challenging as they require creating datasets from which real-time postures can be estimated.[6]
This study estimated the five asanas performed by the participant using four different deep learning architectures: EpipolarPose, OpenPose, PoseNet, and MediaPipe. These architectures are especially suitable for pose estimation. Deep learning architectures were trained for the abovementioned five asanas. The training was carried out on an authentic database at S-VYASA Deemed to be University, hence suitable for real-time and practical applications. The dataset consisted of about 6000 images of the above five postures, of which 75% of the dataset was used in training the model, whereas 25% was used for testing.
Human Body ModelingHuman body modeling is essential to estimate a human pose by locating the joints in the body skeleton from an image. Most methods use kinematic models where the body's kinematic structure and shape information is represented by its joints and limbs.[7] Different types of human body modeling are shown in [Figure 1].
The human body can be modeled using a skeleton-based (kinematic) model, a planar (contour-based) model, or a volumetric model, as shown in [Figure 1]. The skeleton-based model represents a human body having different key points showing the positions of the limb with orientations of the body parts.[8],[9]
However, the skeleton-based model does not represent the texture or shape of the body. The planar model represents the human body by multiple rectangular boxes yielding a body outline showing the shape of a human body.[10] The volumetric model represents a three-dimensional (3D) model of well-articulated human body shapes and poses.[11] The challenges involved in human pose estimation are that the joint positions could change due to diverse forms of clothes, viewing angles, background contexts, and variations in lighting and weather,[12] making it a challenge for image processing models to identify the joint coordinates and especially difficult to track small and scarcely visible joints.
Human Pose EstimationComputer vision is used to estimate the human pose by identifying human joints as key points in images or videos, for example, the left shoulder, right knee, elbows, and wrist.[13] Pose estimation tries to seek an exact pose in the space of all performed poses. It can be done by single pose or multipose estimation: a single object is estimated by the single pose estimation method, and multiple objects are estimated by multipose estimation.[14] Human posture assessment can be done by mathematical estimation called generative strategies, also pictorially named discriminative strategies.[15] Image processing techniques use AI-based models, such as convolutional neural networks (CNNs) which can tailor the architecture suitable for human pose inference.[16] An approach for pose estimation can be done either by bottom-up/top-down methods.
In the bottom-up approach, body joints are first estimated and then grouped to form unique poses, whereas top-down methods first detect a boundary box and only then estimate body joints.[17]
Pose estimation with deep learning
Deep learning solutions have shown better performance than classical computer vision methods in object detection. Therefore, deep learning techniques offer significant improvements in pose estimation.[18],[19]
The pose estimation methods compared in this research include EpipolarPose, OpenPose, PoseNet, and MediaPipe.
EpipolarPose
The EpipolarPose constructs a 3D structure from a 2D image of a human pose. The main advantage of this architecture is that it does not require any ground truth data.[20] A 2D image of the human pose is first captured, and then an epipolar geometry is utilized to train a 3D pose estimator.[21] Its main disadvantage is requiring at least two cameras. The sequence of the steps for training is shown in [Figure 2]. The upper row of the [Figure 2] (orange) depicts the inference pipeline and the bottom row (blue) shows the training pipeline.
The input block consists of the images of the same scene (human pose) captured from two or more cameras. These images are then fed to a CNN pose estimator. The same set of images are then fed to the training pipeline, and after triangulation, the 3D human pose obtained (V) is fed back to the upper branch. Hence, this architecture is self-supervised.
OpenPose
The OpenPose is another 2D approach for pose estimation.[22] The OpenPose architecture is shown in [Figure 3]a, [Figure 3]b, [Figure 3]c. Input images can also be sourced from a webcam or CCTV footage. The advantage of OpenPose is the simultaneous detection of body, facial, and limb key points.[23][Figure 3]a shows VGG-19, a trained CNN architecture from the Visual Geometry Group. It is used to classify images using deep learning. It has 16 convolutional layers along with 3 fully connected layers, altogether making 19 layers and the so-called VGG-19. The image extract of VGG-19 is fed to a “two-branch multistage CNN,” as shown in [Figure 3]b. The top part of [Figure 3]c predicts the position of the body parts, and the bottom part represents the prediction of affinity fields, i.e., the degree of association between different body parts. By these means, the human skeletons are evaluated in the image.
Figure 3: (a) VGG-19 Convolution Neural Network (C-Convolution, P-Pooling). (b) Convolution layer branches. (c) OpenPose architecturePoseNet
The PoseNet can also take video inputs for pose estimation; it is invariant to image size; hence, it gives a correct estimation even if the image is expanded or contracted[24],[25] and can also estimate single or multiple poses.[26] The architecture shown in [Figure 4] has several layers with each layer having multiple units. The first layer includes input images to be analyzed; the architecture consists of encoders that generate visual vectors from the image. These are then mapped onto a localization feature vector. Finally, two separated regression layers give the estimated pose.
MediaPipe
This is an architecture for reliable pose estimation. It takes a color image and pinpoints 33 key points on the image. The architecture is shown in [Figure 5].
A two-step detector–tracker ML pipeline is used for pose estimation.[27] Using a detector, this pipeline first locates the pose region-of-interest (ROI) within the frame. The tracker subsequently predicts all 33 pose key points from this ROI.[28]
Methodology AdoptedInitially, the image of a yoga practitioner performing an asana was captured by a camera and fed separately to the four deep learning architectures, which then estimate the pose performed by the practitioner by comparing it with the pretrained model. If it does not match any of the five asanas, an error was shown.
Twenty practitioners in the age group of 18–60 years performing different postures in real time were captured and fed separately to the proposed architectures, and a comparison of the estimated accuracy was done.
ResultsPose estimation for five yoga postures was done using different proposed techniques. The results of pose estimation were shown for each of the five asanas for all the four architectures used. For simplicity, the images of the same individual were shown (after taking consent) for all estimations and comparisons. The five yoga poses considered for posture estimation are as follows:
Ardha Chandrasana/half-moon pose,Tadasana/mountain pose,Trikonasana/triangular pose,Veerabhadrasana/warrior poseVrukshasana/tree pose.Results of pose estimation using EpipolarPose
The pose estimation results obtained for five yoga postures using an EpipolarPose are shown in [Figure 6].
Figure 6: Key point detection by EpipolarPose. (a) Ardhachandrasana. (b) Tadasana. (c) Trikonasana. (d) Veerabhadrasana. (e) VrukshasanaResults of pose estimation using OpenPose
The pose estimation results obtained for five yoga postures using OpenPose are shown in [Figure 7].
Figure 7: Key point detection by OpenPose. (a) Ardhachandrasana. (b) Tadasana. (c) Trikonasana. (d) Veerabhadrasana. (e) VrukshasanaResults of pose estimation using PoseNet
The pose estimation results obtained for five yoga postures using PoseNet are shown in [Figure 8].
Figure 9: Key point detection by MediaPipe. (a) Ardhachandrasana. (b) Tadasana. (c) Trikonasana. (d) Veerabhadrasana. (e) VrukshasanaResults of pose estimation using MediaPipe
The pose estimation results obtained for five yoga postures using MediaPipe are shown in [Figure 9].
Pose estimation of the five yoga postures was done for different methods, as shown in [Figure 6], [Figure 7], [Figure 8], [Figure 9]. After validation of the model, 20 sample images were captured in real time and were fed individually to the model, and the posture accuracy was estimated. The average value of accuracy is summarized in [Table 1]. Here, the method used for calculating the accuracy is the classification score, which is the ratio of the number of correct predictions (CP) made to the total number of predictions (TP) (i.e., total number of predictions = the sum of CP and the number of wrong predictions (WP))
It is observed that the accuracy of prediction using EpipolarPose was around 50%. This is because the EpipolarPose is generally suited for describing and analyzing multicamera vision systems dealing with two viewpoints of the same points in a pair of images.[29] As this work involves capturing the image from only one camera, the accuracy of the pose is less and also may be observed that the number of key points detected is less [Figure 6]a.
It is observed that the accuracy of prediction using OpenPose was around 70%. OpenPose is preferred for 2D pose detection for multiperson system, which includes body, facial, foot, and hand key points.[30] It is reported to have been used for vehicle detection as well. This method of pose estimation suffers estimating the poses when the ground truth has nontypical postures and also in estimating poses in crowded images, leading to the overlapping of key points. The number of key points detected is more than the EpipolarPose, yet during computation, the accuracy is compromised [Figure 7]a as Graphics processing units (GPU)-powered systems were not used.
It has been reported that after using the fully connecting layer to detect the features using PoseNet, the results have worsened as the network was likely to overfit to the training data. In our work, PoseNet methods gave an accuracy of about 80%. [Figure 8] shows the key point detection by PoseNet.
However, MediaPipe has better accuracy as compared to EpipolarPose, OpenPose, and PoseNet. It may be observed from [Table 1] that the reason for less accuracy for other methods could also be due to pose estimation using a single camera.
The background light and contrast also have an influence on the accuracy values; it is clear that MediaPipe provides better results and can estimate postures more accurately than other methods, and hence, it is the most suitable technique for pose classification. It is also observed that the accuracy of a few postures in the MediaPipe is also less because the MediaPipe does not detect the neck key point.
The accuracy of each of these could be increased further with an increase in the training dataset; but nevertheless, it clearly illustrates the comparative study between different pose estimation methods.
The present study used four different deep learning architectures, i.e., EpipolarPose, OpenPose, PoseNet, and MediaPipe, which are suitable for pose estimation to evaluate yoga postures, and the results support the fact that MediaPipe has better accuracy compared to the other methods despite using a single camera.
Further research would be needed to expand this technique for other advanced postures for pose estimation and correction using the same methodology which involves simple tools with better accuracy to assist individuals practicing yoga postures as a self-evaluation as well as a biofeedback mechanism.
DiscussionThe present study used four different deep learning architectures, i.e., EpipolarPose, OpenPose, PoseNet, and MediaPipe, which are suitable for pose estimation to evaluate yoga postures, and the results support the fact that MediaPipe has better accuracy compared to the other methods despite using a single camera.
Muhammed et al.[21] in their work used a self-supervised EpipolarPose pose estimation model which does not need 3D ground-truth data or camera parameters. During training, a 3D pose is obtained using the geometry of a 2D pose estimated from multiview images and used to train a 3D pose estimator. Furthermore, Yihui et al.[32] proposed a differentiable epipolar transformation model where 2D is detected to leverage 3D-aware features to improve 2D pose estimation.
Haque et al.[33] used CNN to estimate the human pose present in a 2D image with an accuracy of 82.68, and Dushyant et al.[34] reported on techniques using CNN to estimate 2D and 3D pose features using an architecture called SelecSLS Net and then predicted a skeletal model fit. Jose and Shailesh[35] used 3D CNN architecture; a modified version of C3D architecture was used for pose estimation which gave an accuracy of 91.5%. Santosh Kumar et al.[36] in their work reported an accuracy of 99.04% for using CNN for feature extraction and LSTM for temporal prediction.
Yoga is a form of physical exercise demands performing it accurately. Anilkumar et al.[37] reported on a yoga monitoring system which is implemented to estimate and analyze the yoga posture where the user is notified of the error in the posture through a display screen or a wireless speaker. The inaccurate body pose of the user can be pointed out in real time, so that the user can rectify the mistakes. In this work, the nose is assumed to be the origin, so that all calculations are done with respect to the location of the nose in the image. An imaginary horizontal line passes through the nose's coordinates. This is the X-axis of all the angles and are calculated with respect to this horizontal line. However, in our work, we have divided the image into quadrants and compared the key points. Deepak and Anurag[38] uploaded a photo of the user performing the pose and compared it with the pose of the expert, and the difference in angles of various body joints was calculated. Based on this difference of angles, feedback is provided to the user to improve the pose.
Chen et al.[39] proposed a yoga posture recognition system using Microsoft kinetics to detect joints of the human body and to extract the skeleton and then calculated various angles to estimate the poses confirming accuracy of 96%. Chiddarwar et al.[40] reported a technique for android application discussing the methodology used for yoga pose estimation. However, the present study demonstrated that MediaPipe has better accuracy compared to the other methods despite using a single camera.
Further research would be needed to expand this technique for other advanced postures for pose estimation and correction using the same methodology which involves simple tools with better accuracy to assist individuals practicing Yoga postures as a self-evaluation as well as a biofeedback mechanism.
ConclusionsThe human pose estimation can be effectively used in the health and fitness sector. Pose estimation for fitness applications is particularly challenging due to the wide variety of possible poses with large degrees of freedom, occlusions as the body or other objects occlude limbs as seen from the camera, and a variety of appearances or outfits. This work estimates the accuracy of different postures and compares them with four different architectures. Based on the results, the study concludes that the MediaPipe architecture provides the best estimation accuracy.
Acknowledgment
The authors would like to thank B N M Institute of Technology and SVYASA Deemed to be University for jointly collaborating toward the completion of this research work.
Ethical clearance
The study was approved by the Institutional Ethics Committee of Swami Vivekananda Yoga Anusandhana Samsthana (S-VYASA), Bengaluru (Approval Letter No: RES/IEC-SVYASA/193/2021.). The study procedure was explained and signed consent was obtained from the participants.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References
Correspondence Address:
D Mohan Kishore
Swami Vivekananda Yoga Anusandhana Samsthana (S-VYASA), Jigani, Bengaluru – 560105, Karnataka
India
Source of Support: None, Conflict of Interest: None
CheckDOI: 10.4103/ijoy.ijoy_97_22
留言 (0)