Remote physiological signal recovery with efficient spatio-temporal modeling

1 Introduction

The continuous development of modern society is constantly improving the living standards of people. However, in the meantime, the incidence rate of cardiovascular disease is increasing, which may be caused by the increased work pressure and faster pace of life. The detection of human physiological indicators is of great importance for sensing both affect and health status (Loh et al., 2022; Gupta et al., 2022). However, traditional physiological measurement methods are mostly contact-based, which have the following shortcomings: They are not applicable in some specific application scenarios, such as analyzing the cognitive pressure experienced by interpreters during simultaneous translation, monitoring the psychological states of criminal suspects during interrogation, etc. Moreover, contact measurement requires active cooperation of the tested person. When there is a deviation in the position of the measuring instrument in contact with the skin, it can easily cause a large deviation in the measurement results (Zaunseder et al., 2017; Kumar et al., 2023). In addition, although the electrocardiograph provides accurate measurements, it is relatively expensive and requires to be operated by professionals, which is not suitable for daily physiological measurements.

Plethysmography is the detection of the cardio-vascular pulse wave accomplished by methods such as variations in air pressure or impedance (Verkruysse et al., 2008). Remote photo-plethysmography (PPG) uses light reflectance. This technology can provide contactless monitoring of human cardiac activities by capturing the pulse-induced periodic weak color variations on the skin surface through a conventional camera (McDuff, 2022; Wang et al., 2017; Gupta et al., 2020; Ryu et al., 2021). It is based on the principle that the blood absorbs more light than the surrounding tissues and, therefore, changes in blood volume (caused by systole and diastole stages of the heart) affect transmission and reflectance (Song et al., 2020). Although this pulse-induced variation is subtle, it can be remotely measured on the human face with normal ambient light and a consumer-level camera from a distance of several meters (Verkruysse et al., 2008). The rPPG has many advantages, such as contactless measurement, simple operation, low cost, etc. It provides a new solution for physiological signal measurement and its applications in affective computing (Dasari et al., 2021).

Thanks to the prospective development of computer vision techniques, the subtle change in skin appearance caused by cardiac activities can be detected by low-cost cameras (Verkruysse et al., 2008). Classic signal processing has proved the feasibility of rPPG-based heart rate measurement with the initial success of prototype methods (Yue et al., 2020). However, these methods often show degradation in the presence of artifacts, such as movements, lighting variations, and different skin tones (Shao et al., 2021). With the extensive applications of deep learning in various research fields, such as biometrics (Labati et al., 2022), affective computing (Jung and Sejnowski, 2022), and internet-of-things (Yao et al., 2018), recent studies have also begun to focus on deep learning-based rPPG due to its better representation ability (McDuff, 2022). Several deep learning models, such as convolutional neural network (both 2D and 3D (Zhan et al., 2020)) and recurrent neural network (gated recurrent unit (GRU) (Niu et al., 2019a) and long short-term memory (LSTM) (Hill et al., 2021)), have been successfully applied in the rPPG recovery tasks. However, deep learning-based rPPG can still not effectively model the spatio-temporal information (Ren et al., 2021).

A method of modeling spatio-temporal information by generating a feature map, called STMap (Niu et al., 2019b; Niu et al., 2020; Lu et al., 2021), requires preprocessing including face detection, facial landmarks localization, face alignment, skin segmentation, and color space transformation, which are considerably complicated. Besides, some existing methods directly regress a value, such as heart rate, as the final output instead of recovering the whole waveform, whereas the waveform could be helpful for further analysis of more refined physiological indicators. Furthermore, various loss functions such as L1 loss (mean absolute error, MAE) (Niu et al., 2019a), L2 loss (root mean of squared errors, RMSE) (Tsou et al., 2020), negative Pearson correlation coefficient (Yu et al., 2020a), or more complicated losses with different combinations of these losses exist (Lu et al., 2021). However, the comparison of different loss functions is rarely studied.

Motivated by the above discussion, this paper aims to realize robust contactless physiological signal measurements. Thereby, we propose an rPPG waveform recovery method based on efficient spatiotemporal modeling. It is achieved through a three-dimensional central difference convolution (3D-CDC) operator (Yu et al., 2021) with a dual branch structure composed of motion and appearance branches, as well as a soft attention mask that assigns higher weights to the skin regions with stronger physiological signals. The 3D-CDC can effectively describe intrinsic patterns through the combination of gradient and intensity information. Moreover, to the best of our knowledge, we introduce Huber loss (Wang et al., 2022) for the first time in the rPPG task, which combines the advantage of both L1 and L2 losses, and shows better performance than using these losses separately.

This paper is an extended version of our conference paper (Zhao et al., 2021). The following are the main differences with respect to (Zhao et al., 2021): 1) We propose a multi-task 3D-CDC for pulse wave and respiration wave joint measurement in addition to heart rate measurement; 2) Ablation studies are performed to show the effectiveness of each module, e.g., 3D-CDC, dual branches architecture, and soft attention mask; 3) Robustness of the proposed method with respect to video compression is evaluated and compared with other methods; 4) Extended quantitative and qualitative analyses are provided. The current paper includes additional experiments, data, and interpretation, which have added value to the work proposed in (Yu et al., 2020a).

Our main contributions are as follows:

• An accurate rPPG measurement method based on a 3D-CDC attention network for efficient spatio-temporal modeling is proposed. The utilized 3D-CDC operator can extract temporal context by aggregating temporal difference information.

• Huber loss is adapted as the loss function for rPPG measurements. By evaluating different loss functions and their combinations, we show that better performance is achieved with Huber loss alone by focusing on the intensity level constraint.

• A multi-task variant of the proposed method for joint measurement of cardiac and respiratory activities is developed. It has the advantage of sharing information between related physiological signals, which can further improve accuracy while reducing computational costs.

• Extensive experiments show superior performance on public databases. Both cross-database evaluation and ablation studies are conducted, as well as the effects of video compression are evaluated, which proves the effectiveness and robustness of the proposed method.

The rest of the paper is organized as follows: Section 2 provides the related work and Section 3 gives details about the framework and each module. Section 4 introduces the evaluation settings and implementation details. Section 5 provides the performance of the proposed models on public databases and rigorous ablation studies. Finally, the paper is concluded in Section 6.

2 Related work2.1 Signal separation-based rPPG

The remote physiological signal detection method based on rPPG is favored by researchers because it is non-invasive and can obtain physiological signals without any direct contact with the subject’s skin. The underlying mechanism is the delivery of blood flow to the whole body due to the periodic contraction and relaxation of the heart, resulting in blood volume changes in vessels. Due to the different absorption and reflection capabilities of blood vessels and other tissues for different wavelengths of light, subtle color changes occur in skin areas with a rich vascular distribution, such as the face or palm. When a part of human skin tissue containing pulsatile blood is observed with a remote color camera, the camera measured signal of the skin surface would have a certain color variation over time, both due to the motion-induced intensity/specular variations and pulse-induced subtle color changes (Wang et al., 2017). Instead of the specular variations, the diffuse reflection component is associated with the absorption and scattering of light in skin tissues, which contains the pulse signal.

The task of rPPG algorithms is to derive the pulse signal from the RGB signals captured by the camera. Blazek et al. (Blazek et al., 2000) proved that blood pulse signals could be measured with a remote near-infrared imaging system. A similar technique was presented shortly after using a visual band camera (Wu et al., 2000). This concept was further developed by successful replications of this work in (Verkruysse et al., 2008). Verkruysse et al. first proved the feasibility of using a low-cost camera to detect the human heart rate (Verkruysse et al., 2008), and obtained the heart rate signal by analyzing a facial video taken under visible light. Many subsequent studies began to pay attention to artifact elimination during rPPG measurements, such as movement, facial expression, skin tone, illumination variations, etc.

In terms of eliminating the illumination artifacts, there are mainly two solutions: one is to directly separate the light change signal from the pulse signal through signal separation methods, and the other is to consider the non-skin background area except the face area as the artifacts reference (Nowara et al., 2020). Anti-motion interference methods can be roughly divided into 1) blind source analysis methods that separate the components of motion signals (Poh et al., 2010), 2) methods based on color models, such as CHROM (De Haan and Jeanne, 2013), POS (Wang et al., 2017), etc., which distinguish motion signals from pulse signals by analyzing skin color models, 3) methods based on motion compensation that include global and local motion compensation to eliminate the influence of head translation and rotation (Cheng et al., 2016).

However, it was found in practical applications that the signal separation-based methods can only aim at a specific interference, and cannot effectively deal with the coexistence of multiple interferences in a real scene. To further improve the robustness of non-contact rPPG pulse wave recovery, and also to explore the feasibility of rPPG recovery method based on deep learning, the research trend has changed from signal separation-based methods to data-driven methods.

2.2 Data-driven rPPG

The widespread use of deep learning in computer vision has led to the development of numerous contactless heart rate measuring techniques. Chen et al. proposed a convolutional attention network that employed normalized difference frames as input to predict the derivative of the pulse wave signal (Chen and McDuff, 2018). Niu et al. generated spatio-temporal map representation by aggregating information in multiple small regions of interest, and the spatio-temporal map was cascaded with ResNet to predict the heart rate (Niu et al., 2019a). Yu et al. designed the spatiotemporal networks PhysNet (Yu et al., 2020a) and rPPGNet (Yu et al., 2019) for pulse wave signal recovery, introduced the temporal difference information into the ordinary three-dimensional convolution network, and subsequently constrained the convergence of the model with a self-defined loss function (Yu et al., 2020b). Nowara et al. (2020) used the inverse operation of the attention mask to estimate the artifacts, and used it as the input of sequence learning to improve the estimation.

Transformer is becoming the preferred architecture for many computer vision tasks. By fully utilizing the self-attention mechanism to break through the space limitation of convolution computing, two recent rPPG works preliminarily showed that the transformer structure could match performance with the most advanced convolution network (Yu et al., 2022). However, whether it can exceed the performance of convolution network on large data sets remains to be studied (Kwasniewska et al., 2021). More recent works explore Transformer architecture with multimodal sources (e.g., RGB and NIR) (Liu et al., 2023), different color spaces (Liu et al., 2024), as well as multistage framework (Zhang et al., 2023; Zou et al., 2024a). Furthermore, a very recent sequence model backbone Mamba was also been investigated in the rPPG task (Zou et al., 2024b).

In addition, the training model based on generative adversarial network can generate realistic rPPG waveforms. For example, PulseGAN (Song et al., 2021) used conditional generative adversarial network to optimize the waveforms obtained with signal separation methods. Dual-GAN (Lu et al., 2021) used dual generative adversarial networks to model the background artifacts for better pulse wave recovery. However, this method involved facial landmark detection, ROI extraction, color space transformation, and other preprocessing. The complexity of preprocessing steps limits the real-time application in natural scenes.

In addition, there have also been attempts to use meta-learning methods (Lee et al., 2020). Liu et al. (2020) proposed a meta-learning framework, which used model-agnostic meta-learning algorithms for learning, and utilized signal separation-based methods to generate pseudo labels. However, due to the limitations of supervised learning, the performance of existing methods in cross databases evaluation and practical applications would be degraded. The learned spatiotemporal features are still vulnerable to lighting conditions and movements, and are unable to fully exploit the extensive temporal context to improve spatiotemporal representation. Therefore, introducing an enhanced temporal feature learning module might be a workable solution.

2.3 Spatio-temporal modeling

The rPPG signal contains temporal information that changes with the cardiac cycle, therefore, the modeling of spatio-temporal information is crucial. Early spatiotemporal deep learning networks directly extracted motion characteristics between frames using 3D convolution. Tran et al. (2015) suggested a homogeneous, small-scale C3D neural network to replace 3D convolution. Christoph et al. proposed a network called SlowFast, which consisted of a slow pathway that executed at a low frame rate and a fast pathway that executed at a high frame rate (Feichtenhofer et al., 2019). Liu et al. (2020) introduced a temporal shift module into the convolution network for rPPG-based physiological measurements on a mobile platform. Ouzar et al. (2023) end-to-end pulse rate estimation method based on depthwise separable convolutions.

The 3D-CDC (Yu et al., 2021; Yu et al., 2020c) was proposed as an innovative approach to replace 3D CNN. It was realized by a unified 3D convolution operator that incorporated spatio-temporal gradient information to deliver a more robust and discriminative modeling capability. The CDC has been adapted for tasks such as gesture recognition and face anti-spoofing, and has achieved the state-of-the-art performance. Yu et al. (2021) combined 3D-CDC with neural architecture search to perform gesture recognition. The difference between the aforementioned work and our work described in this paper is that the former uses 3D-CDC as a searchable convolution component. The neural architecture search usually needs large databases to support the search; however, for rPPG tasks, only relatively small data sets are available. Therefore, in this paper, the 3D-CDC module is directly applied to extract spatio-temporal representation, and subsequently combined with a dual branch structure and a soft attention mask to further take advantage of the rPPG-intrinsic temporal characteristics.

3 Methods3.1 General framework

As Figure 1 shows, a multi-task 3D temporal central difference convolutional attention network with Huber loss was proposed to achieve robust pulse and respiration wave recovery. In particular, the normalized video frame difference is used as the input for motion representation based on the optical model of skin reflection, and a separate appearance branch is introduced to assign higher weights to the skin regions with stronger physiological signals. Temporal 3D-CDC is adapted as the backbone to capture rich temporal context. Multi-task measurement variant with Huber loss is then used to output the final prediction. The preprocessing of regions of interest extraction is not required in the proposed framework. The attention mechanism between two branches is deployed to achieve a similar function. As the distribution of physiological signals is not uniform in the whole facial area, the attention mechanism can learn soft-attention masks and assign higher weights to skin areas with stronger signals, which is beneficial for accuracy improvement.

www.frontiersin.org

Figure 1. A multi-task 3D temporal central difference convolutional attention network for remote physiological measurements.

3.2 Architecture3.2.1 Skin reflection modeling

The principle of rPPG is that when the light source irradiates skin tissues, the reflected light intensity would change with the variation of the measured substance (Wang et al., 2017). The measured substance here refers to the variation of blood volume in the blood vessel. The transmitted light intensity detected by the camera contains the corresponding physiological information of the tissue. Specifically, the skin reflections can be modeled as follows:

where Vkt is the RGB value of the kth skin pixel, It is the light intensity related to the light source, camera distance, and skin tissue absorption, Vnt is the random quantization noise of the camera, and Vst and Vdt represent the specular reflection and diffuse reflection of the skin, respectively. These reflections contain stationary and time-varying components. After expanding the two components given in Equation 1, Vkt can be rewritten as

Vkt=I01+ΨNPt,Pt⏟Intensity varaition·C⏟Constant+ΦsNPt,Pt⏟specular componets+Pt⏟diffuse componets+Vnt(2)

where I0 denotes the static component of light intensity, ΨNPt,Pt is the intensity variation detected by the camera, and ΦsNPt,Pt is the time-varying part of the specular reflection. The desired pulse wave signal is indicated by Pt, and NPt indicates variations caused by non-physiological changes, such as changes in background light, head movement, speech, facial expression, etc.

The aim of any rPPG-based method is to extract Pt from Vkt. To further simply Equation 2, spatial averaging of pixels is first applied to reduce the camera quantization error Vnt. This is accomplished by using bicubic interpolation to downscale each frame to L-by-L pixels. The choice of L involves making a trade-off between reducing the camera noise and maintaining spatial resolution (Wang et al., 2015). Subsequently, any product between varying terms, such as ΨNPt,Pt·Pt, is neglected because the fixed components are significantly larger than the time-varying components. Furthermore, the constant term varies based on the subjects’ skin tone and lighting conditions, and is usually dominant, which can be reduced by taking the first order derivative of Equation 2 on both sides. After applying the above simplifications, we obtain

Vk′t=I0·∂ΨNPt,Pt∂t+I0·C·∂ΦsNPt,Pt∂t+I0·∂Pt∂t(3)

It can be gathered from Equation 3 that Vkt still depends on the observed stationary light intensity I0. The spatial distribution of I0 is irrelevant to physiology, but is different in different video recording setups due to different distances to the light source and uneven skin contours (Chen and McDuff, 2018). The intensity I0 can be removed by dividing Vk′t by the temporal mean of Vkt as

Vk′tVkt¯=1C·∂ΨNPt,Pt∂t+∂ΦsNPt,Pt∂t+1C·∂Pt∂t(4)

Following (Chen and McDuff, 2018), the discrete-approximation form of Equation 4 can be written as

Vk′tVkt¯≈Vkt+∆t−VktVkt+∆t+Vkt(5)

which is the normalized frame difference and ∆t is the sampling interval.

Based on these deductions, a machine learning model would be suitable to capture the complex relationship between Vkt and Pt. The normalized difference between consecutive frames can serve as the input of the motion branch of the learning model as illustrated in Equation 5. The motion representation thereby captures the physiological processes in a variety of lighting conditions. Subsequently, the appearance information in facial videos can be used to guide where and how the physiological processes should be approximated.

3.2.2 Efficient spatio-temporal modeling

The rPPG is a periodic time-varying signal and, therefore, the spatio-temporal representation of facial video is the core step in rPPG signal extraction. The 3D convolution can naturally be used as a spatio-temporal information extractor. Compared with conventional 3D convolution, temporal 3D-CDC concentrates on the differences in temporal gradient by including the temporal gradient data into a single 3D convolution operation. This results in calculation of the central difference from the adjacent local spatio-temporal region (Yu et al., 2020d). The 3D-CDC contains two main steps with a tendency to converge towards the center-oriented temporal gradient of the sampled values, which can be expressed as Equation 6:

3DCDCl0=∑ln∈Cωln⋅xl0+ln+θ⋅−xl0⋅∑ln∈Rωln(6)

where x is the input feature map, C denotes the local receptive field cube, ω are the learnable weights, l0 represents the current location on the feature map, and ln enumerates the locations in C and adjacent time steps in R. The hyperparameter θ tradeoffs the importance of intensity and gradient information. The 3D-CDC can provide a more discriminative and reliable modeling capability without any extra parameters.

3.2.3 Dual branch and soft attention

The first order derivative during the reflection modeling is used to remove the constant terms that are generally associated with the subjects’ skin tone and lighting conditions. The proposed model could partially reduce the dependence of the learned model on skin tones and lamp spectra in the training data. In the motion representation, however, each pixel is assumed to be equally weighted in skin reflection modeling. Although the use of normalized frame difference helps to reduce the influence of background pixels to a certain extent, it would still cause an increase in artifacts and affect the rPPG measurement. Previous studies have used custom regions of interest for rPPG measurement. However, this usage requires additional preprocessing such as facial landmark detection or skin detection. Not all skin pixels contribute equally to rPPG measurement because physiological signals are not evenly distributed in skin regions. Therefore, it would be beneficial to add an attention module to assign a higher weight to skin areas with a stronger physiological signal representation.

As the differential operation in the motion representation process removes the appearance information, a separate appearance branch is utilized based on (Chen and McDuff, 2018). Unlike the motion branch, which uses the normalized frame differences as the input, the downsampled frame is considered as the input of the appearance branch. The two branches have the same structure except for the lack of last three layers in the appearance branch. The attention masks could be estimated with a 1 × 1 × 1 convolution filter right before the pooling layers. The soft attention mask is defined in Equation 7:

SoftAttXMj=SωjXAj+bj·Hj·Wj2SωjXAj+bj⊙XMj(7)

where XAj and XMj are the feature maps of the convolution layer j of the appearance and motion branches, respectively, and Hj and Wj are the height and width of the feature maps of the convolution layer j, respectively. The sigmoid function is denoted by S·, ωj and bj are the weights and bias of the convolution kernel, respectively, · is the l1 norm, and ⊙ denotes the element-wise product. The soft attention mask is obtained by the sigmoid function followed by L1 normalization, which generates a soft attention mask that can avoid extreme values. The attention mask is the bridge between the motion and appearance branches to assign higher weights to the skin regions with stronger physiological signals through joint learning.

3.2.4 Multi-task and loss function

There are generally two types of loss functions widely used in deep learning-based rPPG. One is the loss function that aims to minimize the point-by-point error, such as the MAE and RMSE. The other one minimizes the waveform similarity error, such as the negative Pearson correlation coefficient. The former focuses on the intensity level constraint, which is relatively simple and easy to converge, but may cause overfitting. For instance, the MSE would decrease slowly to approach the local minimum during gradient descent. However, as the RMSE is a squared error, it would be sensitive to abnormal values. The MAE can reduce the sensitivity to outliers, but its gradient remains relatively constant that may miss the local minimum. In contrast, the latter constraint is in the frequency domain, forcing the model to learn periodic features in the target frequency band. As the artifacts in rPPG may be large in a realistic environment, these losses would be difficult to converge. Therefore, we compare different loss functions (MAE, RMSE, negative Pearson, ε-insensitive, and their combinations, detailed in Section 5.4), and find that the Huber loss achieves the best rPPG recovery performance. The Huber loss equation is as follows:

Lhuber=1N∑i=1N I∣yi−y^i∣≤δyi−y^i22+I∣yi−y^i∣>δδ∣yi−y^i∣−12δ2(8)

where yi is the ground truth pulse waveform or respiration waveform and y^i is the respective predictions by the proposed method. When the error between the predicted rPPG signal and the ground-truth is less than or equal to the threshold δ (default value 1 is used), the loss function degenerates from Huber to RMSE; otherwise, it degenerates from Huber to MAE. Thus, the Huber loss can combine the advantages of RMSE and MAE while being less sensitive to outliers in the training data.

Due to respiratory sinus arrhythmia, which is a rhythmic fluctuation of the cardiac cycle in the respiratory frequency, the PPG signals also include information about respiration (Berntson et al., 1993). Thus, the physiological signal Pt is an intricate synthesis of the pulse and respiration waves. These two are related in terms of the underlying mechanism. Therefore, a multi-task network is constructed to measure pulse and respiratory signals simultaneously, reducing the computational cost by about half. The intermediate representation can be shared, and only different fully connected layers are used to regress pulse and respiration separately, as shown in Figure 1. The multi-task loss is defined as

LTotal=α·LHuberhr+β·LHuberrsp(9)

where α=β=1 is adapted in our experiment based on empirical studies.

4 Results4.1 Database and evaluation settings

The proposed method is evaluated on three publicly available databases: UBFC (Bobbia et al., 2019), PURE (Stricker et al., 2014), and COHFACE (Heusch et al., 2017), which are commonly used in recent research.

The UBFC-rPPG database (Bobbia et al., 2019) comprises 42 videos from 42 subjects recorded with a web camera at a rate of 30 frames per second, a resolution of 640 × 480, and stored in an uncompressed format. A pulse oximeter was used to obtain the ground-truth PPG data. All scenes were indoors in different lighting conditions. During collection, subjects were asked to do mental arithmetic as a manipulation of heart rate. The PURE database (Stricker et al., 2014) contains 10 subjects, each of whom participated under six different recording conditions, e.g., sitting still, speaking, slow/fast head movements, etc. This database was recorded at 30 frames per second using an industrial camera and stored in an uncompressed format with a resolution of 640 × 480. The PPG data were also collected by a pulse oximeter.

The COHFACE database (Heusch et al., 2017) comprises 160 videos from 40 subjects, where four videos involving each subject were taken with two different lighting conditions. The videos were recorded with a webcam at a rate of 20 frames per second, a resolution of 640 × 480, and compressed in MPEG-4 format. The bit rate was 250 kb/s, which made it considerably challenging due to the compression. The recorded physiological signals included blood volume pulse and respiratory signals. Note that other databases also exist for rPPG research, such as VIPL-HR (Niu et al., 2018), OBF (Li et al., 2018) and AFRL (Estepp et al., 2014). The OBF and AFRL databases are currently not publicly available. We obtained the VIPL-HR database and the ground-truth PPG signals for training our method. However, after further investigation, we found that the PPG signals in VIPL-HR were not evenly sampled. The ratio of sampling points of contact PPG to the frame number of videos varied from 2 to 4, making it not suitable for training our method.

The average heart rate (HR) estimation task is evaluated on all three databases while the respiration rate (RR) estimation task is evaluated on the COHFACE database. Particularly, we follow the evaluation settings in (Niu et al., 2020; Tsou et al., 2020; Heusch et al., 2017). For the UBFC database, data from 28 subjects are used as the training set and those from the remaining 14 subjects are used as the test set. For PURE and COHFACE databases, data from 60% of subjects are used for training, and those from the remaining 40% are used for testing.

4.2 Implementation details

To avoid overfitting, the second and fourth convolutional layers are followed by two average pooling layers and two dropout layers, respectively1. The input of the appearance branch is preprocessed by downsampling each video frame to a size of 36 × 36, since 36 is supposed to be the optimal value for retaining spatial resolution while reducing camera noise (Chen and McDuff, 2018). We pick α=β=1 for the multi-task loss function in order to force the pulse and respiration estimations to be regraded equally. A second-order Butterworth filter is used to further filter the network’s output. The cut-off frequencies for HR are 0.75 and 2.5 Hz, and for RR they are 0.08 and 0.5 Hz. The position of the highest peak in the power spectrum obtained from the filtered signal is used to determine the estimated HR or RR.

We implement our method in TensorFlow 2.0. Adadelta optimizer is used to train the model with an NVIDIA GeForce RTX 2080Ti GPU. The learning rate is set as 1.0, and all other parameters are the same as the default parameters of the Adadelta optimizer. The number of training epochs is chosen differently for different databases with early stopping based on visual inspection of ten-fold cross-validation. Apart from the proposed method with 3D-CDC, we also implement a standalone 3D CNN model to verify the effectiveness of the central difference mechanism. All other modules are the same except for the convolution operation. During HR-only evaluations, the network was trained based on a HR-based loss where α=1 and β=0 in Equation 9.

Another thing to note is that we do not reproduce all previous methods, but refer to the results from the corresponding papers. For comparison, classical non-deep learning methods, such as POS (Wang et al., 2017) and CHROM (De Haan and Jeanne, 2013) are used as a baseline. Not all previous methods are evaluated on the aforementioned three databases; however, the state-of-the-art methods in each database are compared, such as Dual-GAN (Lu et al., 2021) in UBFC, DeepPhys (Chen and McDuff, 2018) in PURE, and DeeprPPG (Liu and Yuen, 2020) in COHFACE.

4.3 Intra-database evaluation4.3.1 HR estimation on UBFC-rPPG

Table 1 shows the intra-database evaluation results on the UBFC-rPPG database. The results show that the proposed method based on 3D temporal central difference convolutional attention network outperforms both the traditional and recent deep learning-based methods with MAE, RMSE, and correlation coefficient of 0.34, 1.12, and 0.997 respectively. It is important to note that the evaluation metrics MAE and RMSE represent the MAE or RMSE of estimated heart rate and respiration rate, rather than the point-by-point error in loss functions. Two examples of the rPPG signal predicted by the proposed rPPG recovery network on this database and the corresponding ground-truth PPG signal collected by the sensor are shown in Figure 2. In most cases, the recovered curve fits with the ground-truth signals, but there are unfavorable cases such as the one shown in Figure 2B. The failure may be due to the noisy ground-truth signal, which can be caused by artefacts during sensor collection. Even under this noisy condition, our method is still able to reconstruct a sinusoidal-like curve.

www.frontiersin.org

Table 1. Intra-database evaluation on UBFC-rPPG.

www.frontiersin.org

Figure 2. Two cases of the recovered signal curve on UBFC-rPPG: (A) Subject 39 and (B) Subject 48. The red line represents the recovered signal, while the blue line represents the ground-truth.

4.3.2 HR estimation on PURE

Table 2 shows the intra-database evaluation results on the PURE database. The results show that the proposed method outperforms existing methods with MAE, RMSE, and correlation coefficient of 0.78, 1.07, and 0.999, respectively. Two examples of the rPPG signal predicted on the PURE database and the corresponding ground-truth signal are also shown in Figure 3. Even for the “bad” cases in Figure 3B, the recovered curve generally fits well with the ground-truth, exhibiting a small phase difference.

www.frontiersin.org

Table 2. Intra-database evaluation on PURE.

www.frontiersin.org

Figure 3. Two cases of recovered PPG signal curve on PURE: (A) Sample 07–03 and (B) Sample 07–01. The red line represents the recovered signal, while the blue line represents the ground-truth.

4.3.3 HR and RR estimation on COHFACE

Table 3 shows the intra-database evaluation results on the COHFACE database. The results show that the proposed method significantly outperforms prior methods with MAE, RMSE, and correlation coefficient of 1.71, 3.57, and 0.965, respectively. The video compression of COHFACE does not perform as well as the other two databases. Two examples of the rPPG signal predicted b

留言 (0)

沒有登入
gif