Applied Sciences, Vol. 13, Pages 387: Assessing the Applicability of Machine Learning Models for Robotic Emotion Monitoring: A Survey

1. IntroductionMental health plays a vital role in our overall well-being. However, in recent times, mental health issues have significantly escalated. A survey on the societal mental health of 600,000 U.S. people showed that the number of adolescents reporting a depressive episode had doubled between 2009–2017, and many eventually resulted in suicide [1]. This study clearly indicates the constantly growing mental health issues within our society. Numerous mental health issues are linked to social isolation [2,3,4]. The concern over this issue is even more intensified by the upward trend in single-person households, especially in developed countries, where the number is alarmingly as high as 60% [5]. Furthermore, loneliness is not limited to adolescents either. Many elderly people are getting less familial support and ending up living alone. Hence, our society is on the verge of brimming with loneliness, and mental health is sure to deteriorate if nothing is done to remedy the situation. Socially assistive robots can be useful in dealing with loneliness as they can also function as companions [6,7].Assistive robots have already seen widespread success in the healthcare and medicine sectors [8]. Their versatile contributions in these sectors, such as surgeries [9], radiation therapy, cancer treatment, and animal therapies [10], also led us to believe that robots can play a crucial role in coping with the current mental health situation worldwide. One possible use is to monitor patients’ mental health and refer them to a professional neuro-therapist. The traditional approach to mental health monitoring is wholly based on patients recounting their days. In the end, professionals have to rely on the patients to give a true and accurate recount of their health. However, people often face difficulties remembering events accurately. Further, sadness is highly correlated with depression among patients and is a prime component of clinical diagnoses [11]. Therefore, robots can prove highly beneficial in monitoring mental health through emotion monitoring. Currently, numerous ML methods and literature reviews have shown the potentiality of different sensors to monitor human emotions, but there is still a research gap between identifying suitable signal sources and ML models for robotic applications. Moreover, our literature search could not find any uniform methodology or analysis to assess available resources. As different studies used different datasets and sources with varying evaluation metrics, it is indeed a challenging task to make a proper comparative analysis.While conducting the survey, we also came across a few survey and review papers in the same or similar field. Dzedzickis et al. [12] performed a review of the sensor and methods used for mental health monitoring. However, the pivoting factors of their work were the sensors and the engineering view of the emotion recognition process. The survey conducted by Mohammed et al. [13] also did not prioritize the machine learning methods used for emotion recognition. Instead, they focused on the challenges faced by researchers in developing a human–robot interaction system. Saxena et al. [14] performed a separate analysis of the ML methods and feature-based techniques, lacking the robotic applicability of these approaches. Moreover, their survey mostly involved discrete emotion recognition, with a single study of the valence-arousal category. The foremost objective of Yadav et al. [15] was speech emotion recognition and visual systems. Many other signal sources that could potentially contribute to emotion recognition were not considered in their review.

This paper aims to analyze and determine which machine learning methods and signal sources are the most appropriate for emotion monitoring through robots. While there is plenty of research on emotion recognition and mental health, we systematically set up a boundary to capture the latest works in this field. We have considered all the papers relevant to emotion recognition and monitoring through machine learning from June 2015 to August 2022. Moreover, machine learning is one of the core subject matters of this survey. However, all the machine learning algorithms were not contemplated in the survey because the researchers preferred to use the most sought-after methods for their experiments. Further, even though there is a third emotion category (hierarchical model), we only considered the two most widely used emotional categories. Recognizing emotions through robots could provide an accurate account of people’s mental states. To make this a reality, we must determine the means to recognize emotions accurately. High accuracy for classifying emotion was prioritized in the decision-making. Ease of implementation, accessibility of signal sources, and highly accurate ML methods are also key factors. The current study can be utilized for future implementations of robotic mental health monitoring.

3. Methods

We explored six academic databases due to their relevance to the topic. These databases were IEEE Xplore, Google Scholar, ANU SuperSearch, Scopus, Pubmed Central, and ResearchGate. In order to search these databases, a set of keywords was derived in consultation with university librarians. These keywords were robot*, emotion recognition, and sensor* (where * denotes wildcard characters). For consistency, the six databases were searched to find papers that contained all three terms in any meta field. As technology is rapidly evolving and to keep our research accurate-to-date, our search results were narrowed down to papers published in the last seven years; from 1 June 2015 to 1 August 2022.

If the databases searched this way had less than 200 results, all papers were added to be screened. In the case where there were more than 200 results (Google Scholar, ANU SuperSearch, and Scopus), the results were sorted by the engine’s definition of relevance, and the top 200 results were added to the screening pool. This resulted in a collection of 1141 articles in total, of which 885 were unique.

Following this, the records were screened in order to provide papers relevant to our research interests. Articles were excluded if they were either: (1) publications that were not original peer-reviewed papers, (2) not related to emotion recognition, (3) do not mention the applicability of research to improving robots or machines or agents, or (4) do not state research applicability in a mental health context. For the validation of excluded results, two reviewers acted independently on the 885 unique papers. There were 17% mismatched papers from each reviewer. By discussing with the third reviewer, 80 papers were finally included for detailed analyses. The process and milestones are displayed in Figure 3. Then, both quantitative and qualitative data were extracted from the 80 papers. The aim of each paper, the number of participants, the physiological data used, the methods used for emotion classification, the emotional category type, and the outcome of each paper were recorded. 4. Results and DiscussionFollowing data extraction, we assembled the classification accuracy results from the papers and identified different signal sources. We found a total of 18 signal sources generated from different parts of the human body. Applying different machine learning methods, the studies attempted to identify or monitor human emotions. Among all papers, only one [64] used fully synthesized data, where no participants were involved. Due to many experiments involving multiple sensors, the number of signal sources and sensors is greater than the number of experiments. In total, 112 signals were studied across different physical and physiological sources, namely Brain, Lung, Skin, Heart, Muscle, Imaging, Speech, Tactile, etc. (Details will be found in Supplementary Material). The choice of the classifier plays a key role in accurately classifying emotions. Therefore, in various experiments, we have come across multiple supervised, unsupervised, and hybrid classifiers.To allow for accurate comparisons, papers are split into two main categories of emotion: discrete and valence-arousal. However, even among discrete emotions studies, there are intensity experiments, e.g., anger intensity [65] and stress level [66]. This adds another dimension to an emotion classification task. Since this is not the same kind of classification as mapping out a user state to the six basic emotions, and all results in a shared category should be comparable to each other, this kind of experiment is categorized as “other”. Gesture recognition tasks that are not validated to emotions are also in the other category. The distribution of emotion classification type is illustrated in Figure 4.The focus of our study was detecting emotion correctly for better mental health monitoring. Amongst the 80 papers, 70 papers provided single or a range of accuracy percentages as an evaluation matrix. The remaining eight studies used different evaluation metrics for their experiments and, therefore, are excluded from Figure 5 and Figure 6. Carmona et al. [64] calculated their results in terms of sensitivity and specificity, whereas Yu et al. [65] used RMSE to evaluate the accuracy of predicted ratings. Spaulding et al. [67] measured their performance in terms of area under curve where the result varied from 55% to 62%. The experiments performed by Wei et al. [68] and Mencattini et al. [69] were evaluated in terms of the correlation coefficient. On the other hand, Bhatia et al. [45] evaluated their performance on the basis of mean average precision, whereas Hassani et al. [70] and Yun et al. [71] utilized predictive and statistical data analysis rather than classification. Few studies did not include any evaluation matrix at all, as their aims were beyond classification tasks. Al-Qaderi et al. 2018 [72] proposed a perceptual framework for emotion recognition. Miguel et al. 2019 [73] showed that socio-emotional brain areas do not react to effective touch in infants. These provide conclusions to their research questions but do not yield a percentage accuracy figure.Although discrete emotion classification experiments take up 54% of the total, it is nearly matched by experiments mapping emotions to the valence-arousal plane (30%). The remaining 16% are emotion models that do not fit any of these emotion-label categories. Most experiments classified emotions using discrete labels, such as happy, neutral, sad, or the continuous valence-arousal plane. How studies classified emotions using the plane varied; however, some split the V-A plane into quadrants to create four emotion labels (Figure 2). Meanwhile, some others measured distance along the valence and arousal axes. To visualize the findings of our survey, we created two separate scatter plots for the discrete (Figure 5) and valence-arousal (Figure 6) categories. Highest accuracies for the discrete and the V-A categories are shown in Table 1 and Table 2. The graphs were plotted based on the data we assembled from our study across the 80 papers. Four papers ([50,74,75,76]) did not provide a separate accuracy for valence and arousal. Instead, they provided an overall accuracy for their whole experiment. The neural network-based methodologies were commonly plotted under the label NN. Similarly, the Bayes variants were commonly denoted as Bayesian and tree-based methods were placed under DT. Hybrid methodologies or the combination of different methods are commonly denoted as fusion.In Figure 5, the accuracies of the methods are plotted against the source of the signals. Different colors denote the methodologies used in the experiments. If any of the experiments were conducted under different experimental settings, the best results from all of those settings were considered. Further, if any of the experiments used multiple sources altogether, they were considered a fusion source. The highest level of accuracy was achieved from imaging signals. Amongst the 27 imaging signals, 25 of the signal studies resulted in above 80% accuracy. While two of the imaging signals showed poor accuracies (44.90% and 46.70%), the rest of the imaging signals showed accuracies ranging from 80.33% to 99.90%. Therefore, facial imaging can potentially be the most prevalent signal for emotion recognition. The brain, heart, and skin signal sources provided good accuracies of above 60–70%. Another signal of interest is speech audio, for which classification accuracies varied a lot, from 55% to 99.55%. However, with an accuracy of up to 90% in some cases, speech is definitely a signal worth considering for mental health robots. On the other hand, the tactile signal did not perform well at all. With an accuracy of 22.30%, tactile signals were the worst performer among the discrete signals. A similar thing can be noticed in Figure 6 as well—tactile signals had very low accuracies. Accuracies procured from the eye signals were unsatisfactory as well (52.70% and 59.60%). However, the lung signals and the muscle signals are particularly worth mentioning in this regard as they had a few data points, and it would not be constructive to reach any conclusion based on the average performances. It is worthwhile to mention that most of the fused source signals had accuracies over 80% and more than 90% in some cases. Therefore, another interesting approach to emotion recognition can be fusing signals from different sources.In Figure 6, the accuracies of valence are plotted against the accuracies of arousal. Different colors, shapes, and sizes represent different methodologies, sources, and the number of participants, respectively. For accuracies provided in a range, the maximum value was used in both figures. The only well-performing signal is the brain signals. None of the other signals provided a good accuracy value. If we consider the 10 best accuracies, 8 of them were from brain signals. The brain signals, including other associated signals, might be useful for diagnosing and maintaining other brain disorders, for example multiple sclerosis [122,123] and autism spectrum disorder [124].

Table 2. Summary of the included papers in the valence-arousal emotional category.

Table 2. Summary of the included papers in the valence-arousal emotional category.

AuthorsParcitipant No.SourceDatasetsMethodsAccuracy ValenceAccuracy ArousalAltun et al. [125]32TactileTactileDT5648Mohammadi et al. [126]32BrainEEGKNN86.7584.05Wiem et al. [127]24FusionECG, RVSVM69.4769.47Wiem et al. [128]25FusionECG, RVSVM68.7568.50Wiem et al. [129]24FusionECG, RVSVM56.8354.73Yonezawa et al. [130]18TactileTactileFuzzy69.163.1Alazrai et al. [131]32BrainEEGSVM88.989.8Bazgir et al. [132]32BrainEEGSVM91.191.3Henia et al. [133]27FusionECG, GSR, ST, RVSVM57.4459.57Marinoiu et al. [134]7ImagingRGB, 3DNN36.237.8henia et al. [135]24FusionECG, EDA, ST, Resp.SVM59.5760.41Salama et al. [136]32ImagingRGBNN87.4488.49Pandey et al. [137]32ImagingRGBNN63.561.25Su et al. [138]12FusionEEG, RGBFusion72.877.2Ullah et al. [139]32BrainEEGDT77.470.1Yin et al. [140]457SkinEDANN73.4373.65Algarni et al. [141]94ImagingRGBNN99.496.88Panahi et al. [142]58HeartECGSVM78.3276.83Kumar et al. [106]94ImagingRGBNN83.7983.79Martínez-Tejada et al. [116]40BrainEEGSVM5968For Figure 5 and Figure 6, neural network-based methods outperformed the other methods. K-Nearest Neighbours and Support Vector Machine also performed well in both emotional categories. However, Decision Tree-based methods slightly outperformed KNN and SVM in the discrete emotional category. The highest accuracy was scored by the neural network-based methods, but the most common method found in our studies, with 30% of the total, is SVM. There could be two reasons for the comparatively lower number of papers using NN. First, training NNs or any deep networks require a large number of data. This is hard to come by with physiological signals. Even the largest sample size in all of the papers was 457 for an EDA-based experiment. However, image sets can have potentially thousands of faces, and this does not count video datasets. Another reason for fewer experiments with neural network-based methods could be computational effort, as SVM is a faster process than neural network methods [143].It is also noticeable that the accuracy of imaging (RGB sensors) appears much higher than other sensors, representing 40% of the papers with the highest accuracies. However, while facial expression accuracy is very high compared to other physiological categories, there is a key difference in emotional validation. At face value, the facial expression is a sort of derived signal where people can counterfeit their smile. Unless we can differentiate between fake and genuine facial expressions, the emotional expressions we get from patients might not always represent their true mental state. However, one could argue that if robots were used in a person’s home environment, where they were more likely to be relaxed, they would most likely capture the genuine emotions of the person. Furthermore, there is evidence that facial muscles activate differently depending on whether the person is genuine or acting a smile [144]. However, since none of the papers investigated the difference between genuine or acted smiles, it remains to be seen how useful standalone imaging (RGB cameras) can be for mental health monitoring.However, even though emotion recognition using imaging sources scored well, there is much debate on the link between facial recognition and true emotion [145], and for our purpose of mental health monitoring, it is vital that we determine the patients’ mental health accurately. Moreover, high computational power is required for deep NN methods to analyse imaging data. Thus it is unclear whether imaging is the most suited sensor for mental health monitoring. In addition, brain signal sources (EEG and ECG), are too invasive and non-consumer friendly to be used in this space [146,147]. However, EDA-based skin signals also performed well. Out of the 80 papers, 17 used skin signals in emotion recognition, and only two used standalone EDA [81,140], while the rest of the studies used a fusion of sensors. Notably, the three EDA emotion recognition experiments that used CNN achieved accuracies ranging from 68.5% to 95%, averaging 79%. It is likely that CNN is a good strategy for the classification of skin data, but incorporating skin-signals in robots remain challenging. The future direction of the study would investigate the feasibility and possibility of skin-based sensors to incorporate physiological signals in robots. Another signal source of interest, at 7.1% of the total, is speech audio. Audio recordings of people talking are used to classify their emotions, which can be easily applied in robots, but its classification accuracy varies a lot through speech audio. With an accuracy of over 90% in some cases [103,105], speech audio is definitely a signal worth considering for mental health robots.

We conducted an extensive survey and found some promising results. However, there are still a few limitations on which we can work in the future. The main limitation of the current study is the lack of common ground for comparison. Each experiment or study is different from another in terms of sources, ML models, and sometimes even in their evaluation metrics. Therefore, our study could not directly compare different models and sources; rather we set up a priority list of sources and ML models to work for robotic emotion monitoring. Moreover, most of the neural network-based methods outperformed other traditional ML methods. However, it was also noticed that, in a lot of cases, the experiments suffered from a lack of data. As per our survey, NN-based LSTM was the highest-performing method for valence-arousal data. However, only a few experiments used NN. Therefore, we still need to explore the applicability of NN models in emotion recognition. Another limitation would be that, even though facial imaging data had the highest level of accuracy among other data sources, most of the works did not consider fake expressions. Humans are capable of faking an expression, which may alter the results. Further, humans are capable of having more than one feeling in a moment, which was not considered in any of the experiments. Therefore, fake emotions and multiple emotions also need to be considered for future experiments in this field.

5. Summary

Our survey assessed 80 latest articles on robotic emotion recognition of two emotional categories—discrete and valence-arousal and discussed the applicability of different sources and ML models in robots. For both categories, our survey found neural network-based methods, especially CNN, performed the best. To be specific, for the discrete category, the highest accuracy of 99.90% was achieved by CNN. Another neural network-based method, LSTM, was the best performer, with accuracies of 99.40% and 96.88% for valence and arousal, respectively. The majority of the experiments that used neural networks had accuracies above 80%. Besides neural network models, SVM can be an alternative model, as this model has been widely used by numerous researchers, with ease of implementation and accuracies of 80% to 99.82%. From the signal sources, Imaging signals were the most proficient and widely used source. Within the VA category, the top eight best-performing models used brain signals as signal sources, showing the great potential of brain signals for this recognition. Despite this, imaging and brain signals also perform well for VA and discrete categories. Among the signal sources, tactile signals performed worst in both categories, which gives an indication of the cautious usage of tactile signals for human emotion recognition. It is also noticeable that fusion signals performed comparatively better than individual signals. In terms of applicability, brain signals need sophisticated acquisition devices and data processing procedures, while imaging signals can be readily used in ML models. Therefore, to apply within humanized robots for emotion monitoring, we believe imaging sources could be the first choice. Therefore, ML methods, neural networks and SVM, and signal sources of facial imaging should be most promising for further research on emotion monitoring, with some focus on using fusion signals to make the model more robust.

留言 (0)

沒有登入
gif