Obesity is a public health issue that leads to chronic diseases [], including diabetes [], cancer [], and cardiovascular diseases []. In the United Kingdom, the obesity levels increased from 15% in 1993 to 28% in 2019 []. Similarly, in the United States, the obesity levels increased from 14.5% in 1970s to 39.6% []. Furthermore, poor diet was estimated to have contributed to 11 million deaths globally in 2017 [].
Given these alarming statistics, gaining insight into people’s dietary habits is crucial for designing effective interventions aimed at promoting a healthy lifestyle. Dietary behavior tracking includes a spectrum of approaches ranging from manual to highly automated methods. At the most manual end, traditional food diaries require users to write down manually or digitally every item they eat or drink. The most commonly used manual tools to assess dietary intake and eating behaviors are 24-hour recalls, food records (food diaries), and food frequency questionnaires [,]. Major limitations of these methods include participant burden and recall or memory bias [], which can lead to under- and overreporting of dietary intake. Digital tools and apps (eg, MyFitnessPal []) simplify the manual input process and integrate nutritional data, yet they require active user engagement, and in some cases nutrition knowledge to estimate calorie intake from precooked meals. A visual and less structured alternative is photographing meals, which offers an alternative way to recall and review dietary choices, sometimes shared with a dietitian for professional advice.
Related WorksThe existing research indicates a growing interest in developing automated tools for monitoring eating activities. Regarding the tools and studies closest to our study, several studies have explored monitoring eating activities based on sensor-enabled glasses. Most of these studies are focused on detecting eating or drinking episodes [-] and are performed in controlled environments [,]. Only one study has explored a more complicated scenario than the typical eating or noneating detection [] by exploring the detection of chewing events using eyeglasses equipped with electromyography sensors in a study involving 10 participants both in controlled and in real-life conditions. Compared to the existing work, we present the first study to use smart glasses with integrated optical surface tracking sensors and deep learning (DL) to accurately identify both eating and chewing events, assessed both in controlled laboratory settings and through real-life trials, thus addressing research gaps and proving its efficacy in natural environments.
ObjectiveThis study aimed to develop and evaluate a novel, noninvasive system for automatically monitoring eating behavior by detecting eating and chewing activities. The system aims to enhance the accuracy and ease of tracking eating behaviors, addressing the limitations of self-reporting by providing precise, objective data.
The study provides a comprehensive evaluation of the proposed method using a combination of laboratory-controlled and real-life user studies, ensuring robust and noninvasive way to distinguish chewing activity from other activities, such as speaking, teeth clenching, grinding, smiling, frowning, brow raise, and winking.
The real-life data collection and analysis addresses a substantial gap in previous research and allows for the evaluation of the system’s performance in natural settings, providing insights into its practical application and adaptability.
Throughout this study, several terms related to eating behaviors are used. To ensure clarity and consistency in their use, the following definitions are provided:
Bite: the act of placing food into the mouth, chewing it, and then swallowing it as part of the eating process.Chew: a masticatory cycle involving the grinding or crushing of food with the teeth, preparing it for swallowing.Chewing: the overall process of breaking down food with the teeth.Chewing rate: the frequency of masticatory cycles (chews) per unit of time, measured in chews per second.Eating segment: a continuous period during which the participant consumes food without interruption, encompassing consecutive bites and chewing cycles without pauses between bites. Thus, one eating segment can include one or several bites and chewing events.Smart Glasses and Data Collection SetupOverviewIn this section, we describe our data collection setup, providing insights into the configuration and sensors of the used smart glasses. In addition, we describe the methodologies used for data collection in both controlled laboratory settings and real-life scenarios.
In contrast to the methods that require manual input, in this study, we propose an approach to automatic monitoring of eating behavior by monitoring facial muscle activations using optical sensors incorporated in smart glasses frame. The approach offers real-time feedback that can be integrated with mobile health apps, allowing users to monitor their dietary habits seamlessly. The data collected can be used to personalize dietary recommendations, support weight management programs, and contribute to research in nutritional epidemiology. Ultimately, the goal is to empower individuals with actionable insights to improve their eating habits and promote long-term health and well-being.
The proposed system is depicted in : (A) facial muscles associated with chewing that we aim to monitor; (B) the areas of skin that are monitored by the system; and (C) OCO optical sensors embedded in smart glasses. One of the muscles associated with the chewing activity is the temporalis muscle. The temporalis muscle is near the temple and extends downward in a direction toward the mouth. It controls movement of the lower jaw (eg, opening and closing of the mouth). This area is monitored by the OCO temple sensor in the glasses. Other muscles that are activated during chewing are the cheek muscles such as zygomaticus major and minor. This area is monitored by the OCO cheek sensor in the glasses. Our approach is based on the assumption that chewing activates multiple facial muscles, which causes the facial skin to move in a parallel direction relative to the sensors embedded within a glasses frame. These movements of the facial skin in the X-Y plane are monitored by our novel patented optical tracking sensors—OCO. The optical sensor data are then analyzed using DL to distinguish chewing activity from other activities that cause facial skin movements, such as speaking, teeth clenching, smiling, frowning, winking, and similar.
Figure 1. (A) The 2 types of facial muscles related to chewing (temporalis and zygomaticus); (B) monitored skin areas by the smart glasses; and (C) the placement of OCO sensors within the glasses frame. OCOsense Smart Glasses and OCO Sensors DataThe OCOsense smart glasses integrate 6 optical tracking—OCO sensors [], 3 proximity sensors, a 9-axis inertial measurement unit, an altimeter, and dual speech detection microphones. The OCO sensors use optomyography, an optical noncontact methodology, to measure skin movement in 2 dimensions resulting from underlying myogenic activity. They consist of an optical surface tracking sensor that measure relative movements on the skin’s surface in 2 dimensions (X and Y dimensions). These sensors operate accurately within a range of 4 to 30 mm without requiring direct skin contact []. Positioned within the glasses frame, their focus lies on monitoring skin movement over specific facial muscle groups, including the frontalis and corrugator muscles on both sides of the forehead, the zygomaticus major and minor muscles on the left and right sides of the cheeks, the orbicularis muscles around each eye, and the left and right temples.
The eating activity activates two types of facial muscles that we can monitor with the glasses: (1) the temporalis muscle, which is near the temple, and controls movement of the lower jaw (opening and closing of the mouth), and (2) zygomaticus major and minor, which are located in the cheek area, and are activated during the chewing activity. Therefore, in this paper, we primarily focus on data collected from the cheek and temple OCO sensors (marked with green rectangles in Figure S1 in ), as these areas are more relevant to eating activity, compared with the rest of the sensors available in the glasses (marked with red rectangles in Figure S1 in ). A corresponding sensor data are presented in Figures S2 and S3 in .
Data Collection MethodologyFor development and evaluation of our method we collected 2 data sets. The first data set was collected in laboratory environment, while the second data set was collected in-the-wild. The laboratory data enabled us to establish a foundational understanding of eating behaviors under controlled conditions. However, evaluating the method on real-life data allows for assessing its generalization capability and adaptability to diverse and unpredictable environments.
To be more precise, we used the laboratory data sets to:
Perform a statistical analysis comparing measurements obtained during 3 activities (eating or chewing, speaking, and teeth clenching) from both temple and cheek sensors, assessing skin movement along both the X and Y axes. This analysis is based on data from 28 participants for whom we have both eating and noneating labeled data.Develop and evaluate DL models for chewing detection. We compared the performance of four DL architectures. For the best-performing DL method, we conducted a more detailed analysis, including the impact of individual sensors on the performance of DL models (eg, temple, cheek, temple + cheek, left temple + left cheek, and right temple + right cheek), and the impact of segmentation window size on the performance the DL models, varying the window size between 2 and 15 seconds.We used the real-life data set to evaluate chewing detection and eating segments detection methods using data collected in-the-wild. Summary of the collected data sets is presented in .
Participant recruitment involved booking a time for data collection through social media announcements and completing a Google form to confirm eligibility. Eligible participants were required to be in good health; with no history of eating disorders; and without dietary restrictions, allergies, or intolerances. In addition, participants with conditions affecting facial muscle activation, such as stroke or facial palsy, or any other conditions impacting normal and symmetrical chewing and swallowing were excluded. An important inclusion criterion was that participants have proper glasses fit to ensure accurate detection of skin movements by the sensors.
Table 1. Summary of collected data sets.Data setParticipants, nMedian durationTotal durationEating (laboratory)289 min 49 s369 min 15 sNoneating (laboratory)126 (same 28+98 new)11 min 50 s1601 min 47 sReal life8907 min7163 minStudy ProceduresLaboratory-Based Data Collection (Controlled Environment)In the laboratory-based experiments, we collected two data sets:
Eating data set: the participants engaged in a full meal, providing them with the freedom to choose from a diverse range of food options, including:Crispy or hard foods: apples, carrots, nuts, crisps, and crackersCreamy or soft foods: porridge, banana, yogurt, fruit salad, and green saladChewy foods: breakfast bars; pop-tart; toast, bagel, or croissant; and biscuitsIn addition, they were allowed to eat with or without utensils, based on their preference. There were no time constraints for completing the meal. Participants ate their meals in a laboratory setting designed to simulate a natural dining environment. They consumed their meals alongside the researchers, which helped create a more relaxed and realistic atmosphere. Despite the laboratory setting, participants were encouraged to consume their meals in a natural manner, simulating real-life conditions. This allowed for varied behaviors, for example, some participants used their phones during meals and others engaged in conversations. During the data collection, the participants were continuously video recorded, providing synchronized data between the video recording and sensor data. This enabled us to label each chewing segment later manually. For the eating activity, we annotated all segments where the participants had food in their mouth. Two researchers independently coded the bites, ensuring reliability and validity through cross-verification. The availability of video data allowed for accurate annotation, as each segment was reviewed by at least 2 researchers to confirm the presence of food in the participants’ mouths.
2. Noneating data set: the data collection was performed in a controlled laboratory setting, where participants were instructed on the activities they should perform. First, participants performed a subset of activities associated with facial muscle engagement. This category includes brushing teeth, engaging in conversation, reading aloud, and diverse expressions of bruxism, encompassing teeth clenching, grinding, and tapping. In addition, we incorporated various facial expressions and gestures, such as smiling, frowning, winking, and similar, to capture a diverse range of facial movements. Moreover, we included a variety of activities that do not specifically rely on facial muscle engagement. These include hygiene-related activities such as handwashing and dishwashing, routine activities such as walking and sitting in a chair, and physical activities such as jogging and stair climbing.
Real-Life Data Collection (Uncontrolled Environment)In the real-life setting, the participants were instructed to wear the OCOsense smart glasses continuously for a minimum of 8 hours a day over a span of 2 days. The participants were allowed to follow their daily routines without any imposed limitations during this period. This enabled the capture of eating behaviors in various settings such as home, workplace, and other public spaces. In addition, there were no restrictions placed on participants regarding their food choices or other diet-related decision. For the data collection procedure, we developed an application that collects data from the glasses and enables the participants to annotate when engaged in eating activities. More specifically, they were asked to press a button when they start eating and press it again when they finish eating. A researcher monitored the number of labeled eating events per day per participant. In instances where participants forgot to press the start or end buttons, they were asked to note the approximate times of their eating sessions. These cases were then manually analyzed by a researcher using the sensor data from the glasses to provide precise labels for the eating start and end times. These labeled segments served as the ground truth for subsequent experiments. The annotations collected with this approach result in whole data segments labeled as eating, yet these segments may also include a range of activities beyond eating itself, such as engaging in conversation or pausing briefly between bites, which typically occur during regular real-life meals.
Statistical AnalysisData PreprocessingTo perform statistical comparison, the following data preprocessing steps were applied to the sensor data:
Calculation of the vector magnitude for each sensor: as the OCO sensors measure skin movement in 2 dimensions (X and Y), the vector magnitude was calculated for each sensor ().Combination of processed sensor signals values from the left and right sensors: the vector magnitude value from the left cheek sensor was added to the vector magnitude value from the right cheek sensor, and the same was done for the temple sensors. This resulted in the creation of 2 signals, one representing the total cheek movement (left+right), and one representing the total temple movement (left+right).Smoothing of the resulting signals: the resulting cheek and temple signals were smoothed using a rolling median filter with a window size of 15 samples (0.3 s) to reduce the effects of noise on the signals.Hypothesis TestingHypothesis testing was conducted using the Wilcoxon signed-rank test, a nonparametric alternative to the paired 2-tailed t test. This test evaluates the distribution of differences between related paired samples to ascertain whether they originate from the same distribution. The null hypothesis is that the samples derive from the same distribution. To account for multiple comparisons, P values were adjusted using the Bonferroni correction method (α=.05).
Chewing Detection MethodologyOverviewThis section describes the method used in this study for automatic chewing and eating segment detection. Initially, the sensor data undergoes preprocessing, including filtering and segmentation into windows. Then, both the filtered signals and their frequency representations are used as input to DL models, which classify the windows into chewing or nonchewing. To enhance the accuracy in real-life scenarios, we introduce a supplementary model—hidden Markov model (HMM). This integration enables the grouping of chewing predictions and the construction of coherent eating segments. Finally, we calculate the number of chews and the chewing rate for the detected eating segments. The block diagram of the pipeline is shown in .
Figure 2. Overview of the developed method for detection of chewing and eating segments estimation. Signal PreprocessingThe method uses data from the 4 OCO sensors—left and right temple and left and right cheek. Let ORC, OLC, ORT, and OLT denote these sensors in the specified sequence. The set of sensors can be represented as S = , where each sensor Si reads data . in the time interval of T from timestamps t1 to tn. The main objective is defined as:
Partitioning T into partially overlapping windows of equal size W = and assuming a target activity set Y = Assigning each window Wi a target label Yj from the target label set Y = and training a classifier accordinglyFirst, to remove the noise from the data, a fifth order median filter was applied to each sensor channel within the sensor set S. This filter was proven to effectively remove noise while preserving essential signal features in our previous studies on expression recognition using the same type of sensor []. Following the medial filter, the next step in the process involved determining the appropriate window size for data segmentation. We experimented with various window sizes ranging between 1 and 15 seconds. Once the sensor set S was segmented into windows (W), the next step was to enhance the information carried by the input signals further. To achieve this, we used Fourier transformation for each sensor within the segmented windows. This transformation allowed us to convert the time-domain signals into their frequency representations, thereby extracting additional features from the data. The Fourier transformation process provided valuable insights into the frequency components present in the sensor data, which could be crucial for detecting subtle patterns associated with chewing activity.
Chewing Detection With DL ModelsIn this study, we used 4 distinct DL models based on convolutional neural networks (CNNs) for the purpose of chewing detection. We focused on DL architectures commonly used for wearable sensor data, such as CNN 1D [,], CNN 2D [], attention model [], and convolutional long short-term memory (ConvLSTM) []. By using these common architectures, we aim to demonstrate the baseline accuracy achievable with existing methods. This serves as a foundation upon which further improvements can be made. Specifically, developing DL architectures tailored to the unique specifications of the glasses and the specific use-case of detecting eating activity could potentially enhance accuracy beyond the baseline results established in this study.
An overview of the architectures and their associated hyperparameters is as follows:
CNN 2D: Our initial model adopts a standard CNN [], crafted to extract hierarchical spatial features from input data. The feature extraction module consists of 3 consecutive convolutional layers, each followed by group normalization and max-pooling layers. Extracted features are then passed through 2 fully connected layers, each containing 128 neurons, connecting to the output nodes.ConvLSTM: Expanding on CNN’s foundation, the ConvLSTM model [] introduces a temporal dimension to our analysis. It shares the same convolutional layers with the CNN 2D architecture and integrates 2 LSTM layers, each featuring 128 hidden units. This modification allows the model to effectively capture sequential patterns and dependencies within the data.Attention model: Incorporating insights from attention mechanisms, the attention model [] comprises 4 convolutional layers with 64 feature maps followed by 2 LSTM layers, each with 128 hidden units [], and an attention layer. The attention layers allow the model to prioritize relevant information during the learning process.CNN 1D with statistical features: The last model incorporates a 1D CNN architecture [], enhanced with statistical features. It consists of a single convolutional layer with 256 filters followed by a max-pooling layer. The resulting features are then flattened and fused with statistical features extracted from filtered sensor data, including mean, variance, and absolute sum. The joint vector is then processed through a fully connected layer with 1024 neurons capturing both spatial and statistical characteristics.The determination of architecture parameters, such the kernel size in the convolutional layers, output size of CNN layers, LSTM units, and fully connected units, was guided by a pragmatic approach focused on achieving a balance between model’s ability to capture complex data patterns and model’s complexity. These parameters were fine-tuned on the validation set to optimize performance.
Each model was trained for 100 epochs with a batch size set at 256. Prior the beginning of the learning process, we used orthogonal weight initialization for both weights and biases, aiming to enhance the stability and effectiveness of neural network training. Cross-entropy loss was used as the objective function for training. Furthermore, all the models were trained using the Adam optimizer with an initial learning rate of 1e-3. To avoid overfitting as well as to reduce the training time, early stopping, monitoring validation F1-macro score with patience of 15 epochs was applied. In the end, the optimal weights were selected based on the epoch with the highest validation F1-macro score.
Detection of Eating SegmentsIn the initial phase of our eating detection system, we use a DL model to detect chewing moments at a window-level granularity. By incorporating the temporal dependence between the detected chews, we aim to enable our system to identify not only individual chewing instances but also to discern when eating segments occur within real-life data. This allows us to effectively mitigate the occurrence of short false-positive predictions and consolidate densely clustered chewing instances into coherent eating segments. By doing so, we anticipate a more robust and precise analysis of dietary patterns.
To address the temporal dependence between chewing events in real life, we integrate HMM as a supplementary model that analyzes the detected chews from the DL model. The HMM was initialized and trained as described in the study by Stankoski et al []. This process is visually illustrated in Figure S4 in .
Detection of Number of Chews and Chewing Rate EstimationAfter the detection of eating segments, to determine the number of chews in an eating segment, we additionally analyze and process the signals from the sensors. presents a visual representation of the data processing steps used in the detection of chews within a randomly selected eating segment from the data set. The initial step involves identifying the signal with the highest root mean square value. Subsequently, we use a 2-step filtering process to enhance the selected signal. First, a median filter with a kernel size of 5 is applied, followed by a second-order bandpass filter within the frequency range of 0.5 to 3 Hz. For the calculation of the number of chews, we used an existing peak detection algorithm (using SciPy []) on the processed signal. This involves configuring the threshold and distance parameters to identify relevant peaks in the signal. Furthermore, peaks with insufficient prominence are excluded from the final set.
Each retained peak after this step is considered as a separate chew in the signal. The parameter values used in the filtering and peak detection processes were determined empirically.
Following the detection of chews within eating segments, we extend the analysis to estimate the chewing rate. To achieve this, we use the same signal with the highest root mean square value. This signal is subjected to further analysis through Fourier transformation to compute its frequency spectrum. By examining the resulting spectrum, we identify the most substantial frequency component, which corresponds to the dominant chewing frequency within the examined eating segment.
Figure 3. Processing steps for detecting the number of chews in an eating segment: (A) cheek and temple signals; (B) selection of signal with highest root mean square (RMS) value; (C) filtering the chosen signal; (D) detection of peaks and chews in the filtered signal. Evaluation SetupTo evaluate the effectiveness of the models, we used the Leave-One-Group-Out cross-validation technique. This involved dividing the initial data set into N separate groups, where the data from a single participant is present in only one subset. Each model is trained on combined data from N-2 subsets, leaving one subset to be used as validation data set and a second subset for testing the final model. Thus, all the models are person-independent, that is, the experimental results demonstrate the model’s accuracy on unseen test users.
Regarding evaluation metrics, we used recall, precision, and F1-score. Recall indicates the proportion of actual chewing segments correctly identified by the model, while precision denotes the proportion of identified chewing segments that are truly chewing segments. The F1-score is the harmonic mean of the recall and the precision—which is more balanced metric compared with accuracy especially in unbalanced data sets where one of the classes is more frequent. The reported metrics reflect the models’ ability to detect chewing at a window level, and they are calculated as follows:
(1)
(2)
(3)
In the equations (1) to (3), TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives. In the context of chewing detection, these metrics can be interpreted as follows:
TP indicates the number of windows from the chewing class correctly classified as chewing.FP indicates the number of windows from the nonchewing class incorrectly classified as chewing.FN indicates the number of windows from the chewing class incorrectly classified as nonchewing.In addition, for the evaluation of eating detection in the real-life scenario, we used custom metric to provide deeper insights into the models’ performance within eating segments. The metric was defined to analyze the number of eating segments that are correctly identified based on the frequency of positive chewing predictions within each eating segment:
Detected eating segments: the number of eating segments where at least 50% of the instances (windows) are correctly identified as eating.Ethical ConsiderationsTo ensure ethical compliance, ethics approval was obtained from the London—Riverside Research Ethics Committee on July 15, 2022 (ref: 22/LIO/0415). After a detailed explanation of the experimental procedure, all participants provided written informed consent before participating in the study. The consent forms addressed the use of their data. To protect participant privacy, all data were deidentified. The participants who took part in the laboratory sessions were compensated with US $26.5, while those involved in the real-life study received US $26.5 per day for their participation. The experiment was conducted following institutional ethical provisions and the Declaration of Helsinki.
In this section, we present the results from the experiments. The Statistical Analysis of OCO Sensors for Facial Muscle Movements section presents the outcomes of the statistical analysis, focusing on the ability of OCO sensors in detecting facial muscle movements during various activities, including eating. The Laboratory-Based Data Set DL Experiments section assesses the performance of different DL models, sensor combinations, and window sizes for chewing detection in a controlled laboratory data set. Finally, the Real-Life Data Set Experiments—Chewing and Eating Segments Detection section presents the results obtained with the real-life data set and evaluate the performance of the method for detection of eating segments.
Statistical Analysis of OCO Sensors for Facial Muscle MovementsTo evaluate the ability of the OCO sensors to detect facial muscle movements during different activities, we first conducted a statistical analysis. Our focus was on comparing measurements obtained from both temple and cheek sensors, assessing skin movement along both the x and y axes. In this context, we focused on comparing facial muscle movements during the activities of eating or chewing, speaking, and teeth clenching. The selection of these activities was based on the potential similarity in facial muscle activation patterns. For example, presents 6 graphs. The top row measures movements from sensors placed over the zygomaticus major muscle (cheek area) and the bottom row from sensors positioned on the temples. Each column of graphs represents 1 of the 3 activities being measured (eating, speaking, and clenching). The horizontal axis of each graph represents time in seconds, and the vertical axis shows the magnitude of skin movement in millimeters. By comparing these graphs, we can assess the differences and similarities in facial muscle activation patterns during the 3 activities.
We calculated mean movements measured from the cheek and temple OCO sensors for each participant during eating or chewing, speaking, and teeth clenching. The mean values were calculated over all data points corresponding to each activity, resulting in n=28 (number of participants present in both the eating and noneating laboratory data set) tuples, with each tuple comprising 3 values representing the mean cheek or temple movement for eating or chewing, speaking, and teeth clenching.
shows the mean cheek (left plot) and temple (right plot) movements during different activities, presented on the x-axis, and the results from the Wilcoxon signed-rank (paired) test with Bonferroni correction (α=.05).
For the cheek OCO sensors, we can observe an increased movement during eating (median value 0.113 mm) compared with relatively lower values observed during speaking (median value 0.036 mm) and the teeth clenching (median value 0.008 mm). The results from the statistical test further indicate significant differences in cheek movements between speaking and eating (P<.001), eating and teeth clenching (P<.001), as well as speaking and teeth clenching (P<.001).
Similarly, for the temple OCO sensors, a notable increase in movement with a median value of 0.027 mm during eating is observed, compared with 0.008 mm during speaking and 0.002 during the teeth clenching. The statistical tests affirm the significance of these differences, demonstrating that mean temple movements differ significantly between speaking and eating (P<.001), eating and the teeth clenching (P<.001), as well as speaking and the teeth clenching (P<.001).
These findings highlight the potential sensitivity of the cheek and temple OCO sensors in capturing distinct patterns and subtle variations in facial muscle activation across different activities.
Figure 4. Sensor signals from the sensors on the right cheek and temple recorded during eating, speaking, and teeth clenching activity performed by one participant. Figure 5. Wilcoxon signed-rank (paired) test with Bonferroni correction for comparing mean cheek and temple movements during activity pairs (n=28): clenching versus eating, clenching versus speaking, and eating versus speaking. Statistical significance annotations: *If P ∈ {.05, .01); **if P ∈ {.01, .001); ***if P ∈ {.001, .0001); and ****if P≤.001. Laboratory-Based Data Set DL ExperimentsIn this section, we present the sample characteristics of the data set, the results of the experiments for chewing detection, conducted on the laboratory-based data set, offering insights into the results achieved across various DL architectures, sensor combinations, and window sizes.
Sample CharacteristicsThe laboratory-based data set consists of 2 subsets, one for eating activities and another for noneating-related activities. In the controlled eating data set, we gathered data from a cohort of 28 participants, comprising 13 (46%) males and 15 (54%) females, with an average age of 25.6 (SD 9.1) years. The data set comprises a total of 6.1 hours of recorded data. The noneating data set includes data from the same 28 participants, along with an additional 98 participants (n=48, 49% males and n=50, 51% females) with an average age of 23.3 (SD 6.4) years. Each participant contributed data for various activities, totaling 26.7 hours of recorded data. In summary, the data set comprises 126 participants and spans a combined total of 32.8 hours of recorded data.
DL Models for Chewing DetectionIn this section, we present a comparison of various DL architectures used for the task of chewing detection. provides a summary of the performance metrics, including F1‑score, recall, and precision for the chewing class, for each architecture.
The results show that all architectures demonstrated strong results, indicating that the sensor data provided from the glasses is informative for the chewing detection task. ConvLSTM demonstrated the highest F1-score of 0.91, precision of 0.92, and recall of 0.89. CNN 2D also performed well with balanced metrics, achieving a slightly lower precision of 0.90, recall of 0.90, and F1-score of 0.90. In contrast, the attention model displayed moderate performance with precision, recall, and F1‑score of 0.89, 0.90, and 0.89, respectively. The CNN 1D architecture, despite exhibiting a high precision of 0.90, fell short in recall at 0.86, resulting in a lower overall F1-score of 0.88.
The confusion matrices for the evaluated models are presented in . They provide additional insights into the models’ behavior. Notably, the ConvLSTM model also demonstrated a lower number of false positives (FPs), totaling 1749 instances. This number is approximately 20% lower than that of the second-best model, CNN 2D, which recorded 2089 FP instances.
Figure S5 in shows the FP rates for various noneating activities detected by the ConvLSTM model. Socializing has the highest rate (0.72%), followed by reading (0.28%). Both involve speaking, leading to confusion with eating due to similar facial movements. The overall FP rate is 2%, also shown in (0.02 in confusion matrix D).
provides a comparison of complexity and resource metrics for the evaluated architectures focusing on network parameters, computational complexity expressed as the number of floating-point operations per second (FLOPS) during a forward pass and the size of the model. CNN 1D with 2.4 gigaFLOPS and the 270-kB model size is the smallest model making it suitable for embedded applications in the future. CNN 2D, although larger, offers a balanced trade-off between performance and model size. The attention model, despite having fewer parameters than CNN 2D, has the highest computational complexity of 80.25 gigaFLOPS. ConvLSTM demonstrates a balance between accuracy and resource requirements.
Considering the results, ConvLSTM emerged as the preferred choice for the chewing detection tasks based on the model accuracy and computational complexity and resources needed, thus this architecture was used in the subsequent experiments.
Table 2. Performance metrics of different deep learning architectures for chewing detection. Precision, recall, and F1-score are calculated for the eating class.DL architecturePrecisionRecallF1-scoreCNNa 1D0.90.860.88CNN 2D0.90.90.9Attention model0.890.90.89ConvLSTMb0.92c0.890.91aCNN: convolutional neural network.
bConvLSTM: convolutional long short-term memory.
cBest performing algorithm.
Figure 6. Confusion matrices for the evaluated deep learning architectures: (A) convolutional neural network (CNN) 1D; (B) CNN 2D; (C) attention model; (D) convolutional long short-term memory (ConvLSTM). Table 3. Complexity and resource metrics for the evaluated deep learning (DL) architectures.DL architectureTotal parametersComputational complexity (GFLOPSa)Model size (MB)CNNb 1D43,3662.410.27CNN 2D1,208,57820.924.88Attention model407,62080.252.68ConvLSTMc997,89031.424.03aGFLOPS: giga floating-point operations per second.
bCNN: convolutional neural network.
cConvLSTM: convolutional long short-term memory.
Impact of Individual Sensors on the Performance of Chewing Detection ModelsIn this section, we present the results from the analysis of the impact of individual sensors on the performance of the chewing detection models. Having identified the ConvLSTM architecture as the best-performing architecture among the models that we evaluated in the previous experiments, we proceeded with this architecture for a series of experiments encompassing various sensor combinations. The tested sensor combinations included temple, cheek, temple and cheek, as well as the left versus right side. The results from these experiments are presented in .
From it can be observed that the cheek sensor outperforms the temple sensor. Specifically, in detecting chewing segments, the cheek sensor achieves recall of 0.88, which is 4 percentage points higher than the recall achieved by the model trained with temple sensor data (0.84).
The performance of the model trained with data from the cheek sensors can be attributed to the role of the cheek region in eating activities, predominantly chewing. The sensors are adept at capturing the specific circular movements of the cheek area during such activities, which produce distinct signal pattern associated with eating. The ability to capture these specific patterns results in the model’s high precision in distinguishing eating episodes, thus enhancing recall rates.
Although the temple muscle is uniquely activated during chewing activity, the results show that the activation measured by the sensor is not very high across all people. However, if we combine the temple and the cheek sensors, we can see that the recall is improved by 1 percentage point. This shows that the temple sensor data provides additional information to the model.
In addition, we explored the performance of the models by using only one side of the temple and cheek sensors. On the basis of the results, we can see that the combination with the sensors measuring the right temple and cheek achieves recall of 0.89, which is 3 percentage points higher than the recall achieved by the model trained with left temple and cheek sensor data. This might be expected because most people prefer to chew the food on one side of their mouth [,] and the activation of the muscles is higher, which results in higher values in the sensor data.
Table 4. Performance metrics of convolutional long short-term memory for chewing detection with multiple combinations of sensor data. Precision, recall, and F1-score are calculated for the eating class.Sensor combinationPrecisionRecallF1-scoreTemple0.830.840.83Cheek0.920.880.90Temple+cheek0.92a0.890.91Left temple+left cheek0.90.860.88Right temple+right cheek0.890.890.89aThe selected combination shows the best results based on the F1-score.
Window Size Impact on the Performance of Chewing Detection ModelsThis section presents the results of the analysis of how window size influences the performance of the chewing detection models. For this purpose, a series of experiments were conducted, exploring various window sizes that extend beyond the default 4-second window size used in the previous experiments. For this analysis, we used a consistent 1-second window slide, with the aim to prevent delays in prediction changes and to ensure that the model will be able to promptly detect eating-related movements. The results from the experiments are presented in .
The performance of the ConvLSTM architecture demonstrated a noticeable enhancement with the increase in window size in terms of precision, recall, and F1-score. More specifically, as the window size extends from 2 to 10 seconds, we consistently observe improvements in results. However, upon reaching a 15-second window, we observe saturation in performance metrics, where the obtained results remain consistent with those achieved at the 10-second window. This is probably because longer windows might include nonchewing data, leading the model to misclassify entire instances as noneating.
Although, among the window sizes of 6- and 10-second improvement can be observed, we decided to proceed with the 4-second window. This decision was based on its advantage in processing fewer data compared with the 6- and 10-second modes, leading to a reduced computational demand and potentially lower energy use.
Table 5. Performance metrics of convolutional long short-term memory for chewing detection with various window sizes. Precision, recall, and F1-score are calculated for the eating class.Window sizePrecisionRecallF1-score2 seconds0.900.860.884 seconds0.920.890.916 seconds0.930.920.9210 seconds0.93a0.940.9315 seconds0.920.940.93aBest performing result.
Real-Life Data Set Experiments: Chewing and Eating Segments DetectionOverviewTo assess the effectiveness of our chewing detection and eating segments detection methods using data collected in-the-wild, we conducted a series of experiments. In the first subsection, we present the sample size of the data set. Then, in the second subsection, we present the results of the chewing detection method using real-life data. Next, in the third subsection, evaluation of the eating segment detection is presented. In the last subsection, we show the estimation of the chewing characteristics.
Sample CharacteristicsThe real-life setting data collection involved 8 participants (5 males and 3 females; average age 30.8, SD 12.4 years). Each participant wore the glasses for a minimum of 8 hours per day over 2 days, resulting in 16 hours of recorded data per participant and a total of 128 hours of recorded data.
Chewing Detection Evaluation Using Real-Life DataThis evaluation allows us to explore whether a model trained with seminaturalistic behavior data collected in a laboratory setting can perform well on a real-life data from unseen participants. presents the results obtained on the real-life data set at a window level using the model for chewing detection. The classification report is shown in . It shows that the model achieved precision of 0.95, recall of 0.82, and an F1-score of 0.88 for the eating class. The accuracy of this model was 98%.
We derived the probability density function of the model’s probability outputs. The resulting graph, depicted in , reveals a bimodal distribution, exhibiting one smaller peak near a probability of 0.2 and a larger, more substantial peak beginning at approximately 0.8 probability. The prominence of the second peak starting from a higher probability threshold signifies the model’s strong confidence in identifying chewing activity, within the labeled eating segments. In addition, the predictions around the first peak can be interpreted as instances where the model is relatively certain that chewing is not occurring within the eating-labeled segments.
These results are in line with our expectations and understanding of the real-life data set. As previously described, the ground truth of the real-life data set contains only the information when eating segments took place. When evaluating the chewing detection method on data set where eating segments are labeled, the presence of false negatives can be attributed to the nature of the data set. Eating segments may encompass various activities beyond just chewing, such as talking, short breaks between bites, holding food, and similar. Therefore, segments labeled as “eating” may indeed involve nonchewing activities.
Figure 7. Confusion matrix for the chewing detection model evaluated on the real-life data set on window level. Table 6. Classification report for the chewing detection model evaluated on the real-life data set on window level.ClassPrecisionRecallF1-scoreNoneating0.991.000.99Eating0.950.820.88Macroaverage0.970.910.94Figure 8. Probability density function of the model’s output probabilities for the chewing-labeled instances. Evaluation of the Method for Eating Segment DetectionAs previously described, the ground truth for the real-life data set contains information when the eating segments took place. This means that the annotated eating segments may contain short breaks between bites, conversations, food preparation, and similar. Because of this, we evaluated the eating segments detection based on the temporal information of the chewing detection algorithm as described in the Detection of Eating Segments section.
The results obtained on a segment level are shown in . This table contains the total number of eating segments, number of detected eating segments, and falsely detected eating segments for each participant. An eating segment is considered as detected if >50% of the instances in the labeled segment are predicted as chewing. The result of this evaluation shows that from total of 74 eating segments labeled by the participants, we can accurately detect 71 eating segments. The number of the falsely detected eating segments is relatively low for all participants, having total of 7 false detections.
Furthermore, we extended our analysis of the real-life data set to explore the suitability of the sensor data obtained from the smart glasses in-the-wild for capturing more detailed eating-related metrics, beyond only detecting instances of eating. In particular, we aimed to quantify the number of chews and the chewing rate within eating segments, although this method was not subjected to formal evaluation, mainly because of the lack of ground truth in the real‑life data set.
Table 7. Evaluation of eating segment (ES) detection on the real-life data set, including total number of ES, number of true detected ES, number of false detected ES, and mean duration of falsely detected ES per participant.Participant IDTotal ESTrue detected ESFalse detected ESMean duration of falsely detected ES, SD (min)1540—a2121211.77 (0)320200—47620.72 (0.5)55430.54 (0.2)615
留言 (0)