SmartFPS: Neural network based wireless-inertial fusion positioning system

1. Introduction

Location navigation service is one of the indispensable technologies for the modern society and scientific development. However, the current Global Navigation Satellite System (GNSS) is usually unable to locate effectively indoors due to irregular attenuation of GNSS signals caused by the occlusion of clouds, building walls, and ceilings. During the last decades, many efforts have been made to study how to leverage any indoor wireless signals to provide indoor navigation services, and indoor positioning technology was born. At present, mainstream indoor positioning technologies include Wi-Fi (Zhuang et al., 2016), Bluetooth (Paterna et al., 2017), Ultra-wide Band (UWB) (Mahfouz et al., 2008), Lidar (Li et al., 2020), computer vision (Hwang and Song, 2011), inertial navigation (Kang and Han, 2015; Wu et al., 2019; Hou and Bergmann, 2021), visible light (Zhuang et al., 2018, 2019; Hua et al., 2021) and so on. Each positioning technique has its advantages as well as its limitations. For example, inertial navigation positioning is prone to accumulative errors due to system noise and drift (Abdulrahim et al., 2010). Modeling signal propagation is challenging for Wi-Fi and Bluetooth since they are easily affected by occlusions, and traditional positioning algorithms, such as trilateration positioning (Yang et al., 2020), usually have low positioning accuracy. The current UWB devices are too expensive to be widely promoted (Alarifi et al., 2016); Lidar is also costly and has specific requirements for wall reflection coefficient (Roche et al., 2022), so it would fail under poor lighting conditions. Generally, low-cost, high-precision and high-stable positioning performance cannot be obtained based on a single positioning system. Therefore, most practical positioning solutions leverage multiple sensor data from the user's mobile devices, such as the gyroscope and accelerometer of the inertial unit, Wi-Fi, and Bluetooth modules in the smartphone. Many fusion positioning systems have been proposed in recent years, such as the fusion of Wi-Fi, Bluetooth, Lidar, and inertial sensor (Chen et al., 2014; You et al., 2020; Yang et al., 2022).

Current wireless-inertial fusion positioning systems widely adopt empirical propagation models of wireless signals and filtering algorithms such as the Kalman filter (KF) or the particle filter (PF). However, in many practical positioning scenarios, the complexity of the fusion system or a single system is usually very high. For example, for pedestrian navigation, it is quite challenging to model a step-heading estimation system (SHS) or the distribution characteristics of wireless signals in the environment. Besides, the increase in the system complexity will enlarge the influence on the parameter error to the final state estimation. Although many studies apply machine learning techniques to deal with the complexity of a single positioning system, fusion is still based on filtering algorithms. However, it is rather challenging to determine the noise parameters in the filtering methods, especially the noise of the wireless positioning, which is irregularly distributed in the whole area. If wireless positioning adopts the fingerprinting method, the noise should be analyzed area by area. Furthermore, by assigning specific values to the covariance matrixes of process and observation, the accuracy of the noise parameter is compromised. Considering those factors, deep learning is a more straightforward way to accurately model the noise and build the whole model at the same time. In this study, we propose a fusion localization system based on an end-to-end neural network and a transfer learning strategy for improving the performance of the neural network model for samples with different distributions. Furthermore, we verified our methods in a Bluetooth-inertial fusion positioning scenario considering the low cost, quick deployment of the commercial Bluetooth beacons, and the convenience of leveraging built-in smartphone sensors. Our contributions are summarized as follows:

1) An end-to-end trainable neural network-based inertial-wireless fusion positioning system, namely SmartFPS, is first proposed. Besides, a training procedure based on multi-task learning is also proposed. Large field experiments were performed, and results showed that our method could outperform filter-based fusion systems by 36.5% when optimal parameters were set for the particle filter.

2) A transfer learning strategy for SmartFPS is proposed based on the generative adversarial network (GAN) to deal with device heterogeneity and other factors in practical positioning scenarios. Simultaneously training the GAN-based network with labeled data in the source domain and the unlabeled data in the target domain improved the positioning accuracy by over 50%, compared with training SmartFPS with the source domain dataset.

2. Related works

In wireless-inertial fusion localization systems, the processing and modeling of inertial sensors, wireless signals, and their fusion method must be carefully considered. In recent decades, inertial navigation and indoor wireless positioning have been studied individually, from empirical modeling to machine learning. Fusion methods, especially filtering methods, have also improved to cope with complex processes and non-Gaussian noise. This section will discuss the development of these three components in fusion positioning separately to show the advantages of replacing all empirical models with deep learning.

2.1. Pedestrian inertial navigation

Pedestrian inertial navigation mainly includes the strap-down inertial navigation system (SINS) (Bortz, 1971) and the step-heading estimation system (Jirawimut et al., 2003). However, those methods rely on accurate sensor adjustment or estimation of step length and direction, which are often corrupted by noise and drift in inertial systems.

As for SINS, even if there is only a small error in the angle estimation, the final position result will be affected exponentially through multi-layer transfer. The effect of this error is especially severe for low-cost MEMS sensors in smartphones. Usually, the extended Kalman filter (EKF) can be used to lower this error.

One measure to deal with the drift problem is to close the integrating loop periodically by imposing external constraints on the system. The most widely used constraint method is the zero velocity update (ZUPT) (Foxlin, 2005). ZUPT is based on the sensor being at rest and can be applied during the stance phase, provided the sensor is attached to the foot. ZUPT is easily incorporated into the INS structure by representing ZUPT as a pseudo-measurement of zero velocity. By applying ZUPT, the open loop integration only occurs during the swing phase of the foot. For such short durations, the accumulation of drift is limited, so longer tracking durations are feasible. However, for reliable output, ZUPT must only be applied when the foot is completely stationary. Problems can arise when the sensor is mounted higher than the sole. The peeling motion associated with the transition from standing to swinging means that the heel rises soon after the foot lands down, so the sensor in the midfoot will begin to experience acceleration as the foot lifts. These small accelerations occur before the strict end of the stance phase, so it is necessary to account for these errors by applying a non-zero covariance next to the ZUPT pseudo-measurement.

To simplify the modeling of inertial navigation, a deep learning-based method, IONet (Chen et al., 2018), was proposed. Unlike the traditional SHS system, the step length and direction are estimated by training the neural network with sequential inertial data. Since the neural network has an excellent capability of non-linear system modeling, the accuracy of step length and direction estimation is significantly improved. IONet is formed by a two-layer long short-term memory neural network (LSTM), a specific neural network dealing with time series data. The input of the system is the accelerometer and gyroscope time sequence under a fixed time window, and the output is the step length and rotation angle. On the premise that the initial position and direction of the pedestrian are known, the system could estimate the position and direction through each step's estimated step length and rotation angle. Experiment results showed that the neural network-based method significantly improved positioning accuracy compared with the PDR method. The method based on the neural network not only enhances positioning accuracy but also simplifies various complex processes such as data processing and noise analysis.

Inspired by IONet, a new system, namely RoNIN, was proposed to improve the positioning performance of deep learning-based inertial navigation (Herath et al., 2020). This study investigated three network structures: the ResNet network, the LSTM network, and the temporal convolutional network (TCN). Another difference from IONet is that the output of the system proposed in this study is a velocity vector. Acquiring ground truth labels in large environments is challenging, and the study used the Tango system by attaching an Android phone to the front of the body. Another Android phone was held in one hand to collect inertial data. The pre-processing of inertial data was explained in detail in this study. A spatiotemporal calibration process was introduced to calibrate inertial data and ground truth from two devices. Verified by a massive dataset in a large environment, the results showed that the positioning accuracy of the proposed system was much higher than IONet.

2.2. Indoor positioning based on wireless signals

Wireless signal-based positioning systems generally build positioning models based on features such as received signal strength (RSS), angle of arrival (AOA), and time of flight (TOF). However, due to the advantage of simple implementation, most wireless-based positioning solutions adopt the RSS feature. RSS-based wireless positioning algorithms generally have three types: proximity (Yu et al., 2020), trilateration, and fingerprinting.

The Proximity method is the simplest positioning method that determines whether the receiving device is close to a wireless beacon according to the RSS value and directly uses the position of the beacon as the position of the receiving device. Therefore, the positioning accuracy of the proximity method is usually low. However, the proximity method has the advantage of high robustness. Although the RSS of the wireless signal will fluctuate, the RSS value after a smoothing process is within a fixed range when the device is very close to the wireless beacon.

The RSS-based trilateration estimates a device's location based on the RSS values of three or more wireless beacons. Due to the fluctuation of RSS, the circles of three distances estimated by the RSS values do not intersect at the same point but overlap in an area. Usually, non-linear optimization methods can solve this position, such as least squares. For trilateration positioning, the positioning accuracy depends on at least three line-of-sight (LOS) wireless signals. In practical positioning scenarios, due to the consideration of deployment costs, the distance between wireless beacons is large, and inevitably, there will be environmental interference or occlusion. Therefore, it is challenging to collect three LOS signals continuously. Even though three LOS signals exist, the signal-to-noise ratio (SNR) is low in specific locations due to high noise and weak signal. As a result, trilateration positioning often has low robustness.

Fingerprint positioning technology is widely used to solve the problem of wireless positioning under a low SNR. Fingerprint positioning technology mainly collects RSS values from different wireless beacons at various locations in the scene in the offline stage, and the combination of these measurement values becomes the fingerprint information of the location. After deploying the system, the RSS combination of online measurements (obtained in real time) is compared with offline measurements to estimate user location. Many machine learning methods, such as K-nearest neighbors (KNN) and support vector machine (SVM), and deep learning methods, such as multi-layer perceptron (MLP), can be used to solve fingerprint matching problems. Fingerprint localization algorithms usually need to survey the environment to obtain fingerprints or characteristics of the environment, so it is also called scene analysis localization technology.

2.3. Fusion algorithms based on filtering methods

The filtering algorithm is one of the most widely used fusion algorithms in mainstream indoor fusion positioning systems. At present, the filtering algorithms applied in the positioning fusion algorithm mainly include (1) the discrete Bayesian filter (Fox et al., 2003); (2) the Kalman filter; (3) the particle filter (Chen et al., 2022). The Kalman filter also includes the extended Kalman filter (Ozbek et al., 2010), the unscented Kalman filter (UKF) (You et al., 2020), and so on. All of the filters have their advantages and drawbacks.

The Kalman filter is theoretically the best estimate for unimodal linear systems with Gaussian noise but not for non-Gaussian non-linear systems. To address this limitation, various improved Kalman filter algorithms, such as the extended Kalman filter and the unscented Kalman filter, were proposed in subsequent research.

Since the extended Kalman filter is a first-order estimation for non-linear systems, it cannot achieve good performance for strongly non-linear systems. Therefore, second-order and third-order extended Kalman filters are proposed. In addition, the extended Kalman filter cannot guarantee the convergence of the algorithm. The algorithm will diverge if the initial state quantity error is significant or the process model is incorrect.

UKF can handle non-linear, continuous, multivariate problems. The sigma points specifically proposed in UKF can also estimate a certain degree of non-Gaussian noise but cannot accurately estimate complex non-Gaussian distribution.

The particle filter can handle non-linearity and non-Gaussian noise. However, at different degrees of non-linearity and non-Gaussian noise, the number of particles, the particle generation strategy, and the resampling strategy can significantly affect the accuracy. Besides, more particles will increase the computational cost and lower the calculation speed. In addition, the performance of particle filters in dealing with high-dimensional systems is unstable because high-dimensional systems can easily lead to excessive differences in the weight distribution of particles, resulting in the loss of particle diversity.

In addition to the above filtering algorithms, many studies have also proposed other well-known algorithms, including the ensembled Kalman Filter (EnKF) (Hua et al., 2021), the adaptive Kalman Filter (AKF) (Mehra, 1970), and the switched Kalman Filter (SKF) (Wu et al., 2004). However, all methods still inevitably suffer from accuracy and divergence problems.

3. SmartFPS architecture

In this section, the general architecture of SmartFPS is presented first. Then, the modules of the network are explained separately. Finally, a multi-task learning-based training method is introduced.

3.1. System overview

SmartFPS is an end-to-end trainable neural network for the wireless-inertial fusion navigation system. The inputs of SmartFPS are the inertial signal sequence and the wireless signal sequence. The outputs of SmartFPS are the current pedestrian position and direction. The structure of the fusion positioning system is illustrated in Figure 1. SmartFPS is formed by four modules: (1) inertial encoder, which is a feature extraction module of inertial navigation data based on the LSTM network; (2) wireless encoder, a wireless positioning module based on convolutional neural network (CNN); (3) attention layer; (4) fusion decoder, a fusion positioning module based on LSTM network. The output of the inertial encoder is the latent space tensor of step length and rotation angle, and the output of the wireless encoder module is the extracted wireless signal feature information rather than the final position information. The outputs of the modules pass through the attention layer, then to the fusion localization module. The fusion decoder module uses the hidden state of the last timestep as the initial state and finally estimates the position and direction.

www.frontiersin.org

Figure 1. Architecture of SmartFPS.

Compared with the filtering fusion positioning system, the fusion positioning system based on the whole network structure has the following advantages:

1) The fusion positioning system with a complete network structure can realize end-to-end computing output, which is easier to train and deploy than the positioning system with networks fused by filtering algorithms. Each positioning network of the network and filtering algorithm need to output noise matrixes during training, and the noise matrix is difficult to represent by individual parameters for complex systems. Therefore, such training and fusion methods are complicated, and the implementation is cumbersome.

2) Compared with the Kalman filter, SmartFPS does not require all noises to be white noise, and there is no need to pre-determine the noise coefficient. It can approximate any white noise or colored noise through data training.

3) The fusion positioning system with a complete network structure is faster than the particle filter, and the time cost of SmartFPS estimation is generally in milliseconds. In contrast, the time cost of the particle filter will be much higher if the number of particles is large. The particle filter will also cause problems such as loss of diversity over time, so parameters such as particle resampling need to be adjusted.

3.2. Inertial encoder

The inertial encoder network in the fusion positioning network is an inertial navigation feature extraction network based on the Bi-LSTM network. Its input is the 6-axis accelerometer and gyroscope data in the inertial unit of the smart device, and the signal data set at 1 step can be expressed as (accx, accy, accz, gyrox, gyroy, gyroz). The LSTM network generally processes a time series of data, which in this research scenario are the accelerometer and gyroscope data in a time window. The length of the window can be selected according to different application scenarios. For example, in this study, a time window of 1 s is used by considering the frequency of pedestrian steps.

One primary concern for the inertial encoder is its objective, which represents its physical meaning and may influence the fusion performance. In fact, the inertial encoder derives from former studies of deep learning-based inertial navigation. We carefully select the objective of the inertial encoder among all the related works, such as IONet (Chen et al., 2018) and RoNIN (Herath et al., 2020), and adopt the step length and the rotation angle as IONet did. It is noted that RoNIN assigned the velocity vector (vx, vy) as the output and achieved excellent performance in accuracy. However, this may cause a problem when the pedestrian keeps walking straightly at a constant speed. In this case, the velocity vector should be possible in all directions so that the trained network is not decoupled from the map. We also found that the trained inertial network with this objective is difficult to fuse with the wireless encoder. The author of RoNIN, although not discussing this point in their work, may have tried to fix this issue by randomly rotating the inertial data for data augmentation before training. However, this method could significantly increase the amount of training data and lower the training speed, which is cumbersome for model transfer learning in a new scenario. Unlike RoNIN, we adopt the step length and the rotation angle, which are completely decoupled from the map.

The objective function of the inertial positioning network is as follows:

minθL(θ)=1n∑i=1nL(xi,yi;θ)=1n∑i=1n(∥l~−l∥22+λ∥dφ~−dφ∥22) 

where n represents the number of batch data, θ represents the network parameters, l represents the step length, dϕ represents the rotation angle, and λ is the weight of the rotation angle loss. The training target of the entire network includes two parts: step length and rotation angle. Due to the different physical meanings of the two parts, the range of values is also different. For example, the maximum value of the step length is related to the pedestrian step length and the selected time window size, which often remain in the range of [0, 1], and the value of the rotation angle in this study is in the range of (−180°, 180°], so both ranges should be considered to choose the value for λ. In our tests, we set λ to 2.

3.3. Wireless encoder

The wireless encoder extracts meaningful characteristics from the wireless signal sequence and maps them to the latent space. Different from previous studies, this study uses dynamically collected wireless time-series signals as input. The input feature is a matrix composed of the signal sequence of the wireless beacons deployed in the entire environment within a time window. The input length is the window size, and the height is the number of wireless beacons. The output 2D position represents the position in the middle of the time window. Both generalization performance and training accuracy can be considered to determine the size of the time window. For example, the use of an excessively long-time window should be avoided. The dynamic acquisition of signal data using a long-time window will cause time dependency, thereby increasing the complexity of the scenario and reducing the generalization performance of the wireless location network. Second, due to the noise of the wireless signal received by the smartphone, if the time window is too short, the network learns the noise signal instead of the signal itself, which will lead to the overfitting of the network. Therefore, the size of the time window should be carefully selected, and the optimal solution can be found by the positioning accuracy of the validation set.

The wireless encoder network is built based on the CNN network. Since the wireless signal is collected dynamically, the dynamic sequence generally considers its timing relationship. However, this network aims to learn the signal itself in the high-noise signal sequence rather than the timing relationship, which can be regarded as a filtering process for high-noise signals. The wireless encoder network mainly includes three convolutional layers, a maximum pooling layer, a global average pooling layer, and two fully connected layers.

The size of the convolution kernel of each layer of the wireless encoder gradually decreases, that is, from global attention to local attention, and the step length also selects different values for tensors with different lengths and widths. The loss function of the wireless location network is as follows:

minθL(θ)=1n∑i=1nL(xi,yi;θ)=1n∑i=1n(x~-x)2+(y~-y)2θ

where x, y are the two-dimensional position coordinates. It is worth noting that this network uses the “selu” function as the activation function for all CNNs. The “selu” function is expressed as follows:

{a=λz,z>0a=λα(ez−1),z≤0 

Compared with the “relu” activation function, “selu” does not have a dead zone. Compared with sigmoid, it also has the advantage that the gradient is not easy to vanish. Compared with the “elu” function, it has the advantage of no parameter selection. The most important role of “selu” is that it can automatically normalize the sample distribution to 0 mean and unit variance, thereby speeding up the network convergence. The function is like the batch normalization layer, but there is no need to increase the network depth. We also found that the “selu” function can improve the training speed.

3.4. Attention layer

Asymmetric attention layers are mainly based on the local attention mechanism in deep learning (Vaswani et al., 2017). The attention mechanism is a concept proposed in natural language processing, mainly used to implement a contextual logic between the encoding network and the decoding network. The attention mechanism generally includes three steps: calculating the calibration weight through the hidden state, calculating the softmax weight, and calculating the overall context weight. The concept of the local attention is relative to the global attention. The global attention implements the attention mechanism for all units of the input sequence, while the local attention mechanism implements the attention mechanism for part of the units.

In SmartFPS, the attention mechanism is only used for the output of the wireless encoder network. It is because the hidden state outputs by the wireless encoder network and the inertial encoder network have different physical meanings. Since the output of the inertial encoder network is the hidden state quantity of the last step, it does not have the concept of context itself. Suppose the output of the inertial encoder network adopts the attention mechanism. In that case, it will break the integrity of the part of the hidden state and only input the part of the hidden state sequence into the decoder network. As for the wireless encoder network, due to the different positions of each wireless beacon, when pedestrians are in different positions, the signal strength and stability of each wireless beacon are also different. At the same time, due to factors such as occlusion by walls or multipath effects, it cannot be determined whether to use the signal of the beacon as an input simply by the strength of the signal. Therefore, the attention mechanism can enable the localization network to learn the local characteristics of the environment and realize the regional selection of appropriate wireless signal characteristics, especially when the number of wireless beacons is large. The calculation process of the attention layer is as follows:

ut=tanh(Wht+b)αt=softmax(ut)s=∑t=1Mαtht

where W refers to the weight of the fully connected layer in the attention layer, b is the bias of the fully connected layer, and ht is the hidden state input to the attention layer. Compared with the global attention mechanism, the asymmetric attention mechanism greatly reduces the number of network parameters, which is of great significance for suppressing model overfitting and improving model training and computing speed.

3.5. Fusion decoder

The fusion decoder network is a two-layer LSTM network structure. The hidden state and the unit state of the network are initialized by the state of the previous moment. The input includes the inertial navigation features and the attention-selected wireless positioning signal features, which can be regarded as observations at the current moment. The role of the recurrent neural network here is similar to a filtering process. Unlike filtering algorithms, the noise and the observation are not separated, and the state and the noise feature of the last moment are also used as the hidden state to initialize the network unit. State quantities, observations, and system processes and noise are coupled through the network, thereby eliminating the tedious process of determining system noise values. Especially for the wireless location system based on fingerprint location, the noise at different locations is more difficult to quantify. The hidden state quantity output by the fusion decoder network will be further connected to two fully connected layers. The output of the last fully connected layer is the two-dimensional position coordinate vector (x, y) and direction vector (rx, ry) of the pedestrian. The target is set as the root mean square error of the two-dimensional coordinate vector and the direction vector, with a scale factor of κ.

The final objective function of the fusion decoder network is as follows:

minθL(θ)=∑​L(x,y;θ)=∑​∥d˜−d∥22+κ∥r˜−r∥22

where θ is the network parameter, d represents the two-dimensional position vector (x, y), r represents the direction vector (rx, ry).

3.6. Training method based on multi-task learning

Since SmartFPS is a large network consisting of several modules, the training procedure should be carefully treated. Since the encoders have different objectives, training them together may corrupt their physical meanings, causing the whole network to collapse. Training them individually and freezing their weights when training the fusion decoder is a good option but cumbersome. Therefore, we proposed a multi-task learning strategy to solve this problem. Multi-task learning (Caruana, 1997) is a branch of deep learning that aims to improve learning efficiency by exploiting the similarity between different tasks to solve multiple different tasks simultaneously. Here the loss of the fusion decoder and the loss of each encoder can be trained simultaneously. The objective of our multi-task learning network is:

minθL(θ)=minθ(Lfusion(θ)+λ1Lins(θ)+λ2Lwireless(θ))

Although multi-task learning is generally used to improve learning efficiency, it can protect the integrity of the output features of the intermediate layer, and the fixed output positions can retain the physical meaning of the output of each sub-network.

4. SmartFPS transfer learning

Since the environmental factors and device parameters are often different from the training data in practical scenarios, trained models often perform poorly. In this section, a transfer learning method is proposed to solve this problem based on the generative adversarial network mechanism. In this section, the attack factors that may cause the trained models to fail are discussed. Next, the proposed transfer learning method is explained in detail.

4.1. Attack factors

To analyze the influence of different pedestrians on the characteristics of the inertial data, two testers collected inertial data by carrying the same device and walking back and forth along the same straight-line path. The pose of the smartphone in hand was similar to that of the training set. Figures 2A–D shows the time series of the accelerometer and gyroscope data collected by two testers (person 1 1.76 m and person 2 1.63 m). From Figures 2A, B, it can be found that the peak-to-peak ratio of the z-axis acceleration of tester 1 is greater than that of tester 2, that is, the mobile phone swings up and down with greater force. In addition, it can be found that there are frequent sub-peaks near the peak in tester 2's data, which is also caused by the tester's walking habit. It can be observed from Figures 2C, D that the peak-to-peak value of the gyroscope data of tester 1 is smaller than that of tester 2. Considering these attack factors, most of the pedestrian inertial navigation positioning research needs to add the analysis of pedestrian motion characteristics to the system. However, it is difficult to accurately model the diversity of the characteristics and changes of different pedestrian movements, so it is difficult to accurately track the position of pedestrians by relying solely on the pedestrian inertial navigation system.

www.frontiersin.org

Figure 2. Signal sequence of inertial sensors. (A) Accelerator along z-axis for tester 1 (1.76 m). (B) Accelerator along z-axis for tester 2 (1.63 m). (C) Gyroscope along z-axis for tester 1 (1.76 m). (D) Gyroscope along z-axis for tester 2 (1.63 m).

In addition to different pedestrian motions, different devices also cause the difference in data characteristics. Since the calibration methods can solve the problems such as bias and drift, we only consider the noise characteristics of the inertial device itself of different devices. To verify the difference in noise characteristics of inertial systems of different devices, two devices (Xiaomi Mi6 and ZTE A2017) are used to collect inertial data in a horizontal stationary state. Allan variance was used to analyze the noise components in the inertial data. The Allan variance curves of the accelerometer and gyroscope sequences of the two devices in the static state are shown in Figures 3A–D. From Figures 3A, B, it can be observed that the major noise component of the accelerometer of Mi6 and A2017 within 1 s is the Gaussian white noise, but the coefficients are quite different. For A2017, there are colored noise components with a small magnitude near 1 s. From Figures 3C, D, white noise is detected by both gyroscopes' sequences but with different coefficients. Therefore, the noise characteristics of inertial sensors are quite different between different devices.

www.frontiersin.org

Figure 3. (A) Allan variance of Mi 6's accelerator along x-axis. (B) Allan variance of ZTE A2017's accelerator along x-axis. (C) Allan variance of Mi 6's gyroscope along x-axis. (D) Allan variance of ZTE A2017's gyroscope along x-axis.

There are also differences in the wireless signal receivers of different mobile phone devices. To verify this problem, two mobile phones were placed horizontally on the table and kept the same distance from the Bluetooth beacon to collect RSS data. Figures 4A–D are the RSS sequence when the distance from the Bluetooth beacon is 0.6 and 1.5 m. Figures 4A, B show the Bluetooth RSS data of the two devices at 0.6 m. Figure 4A is the signal timing diagram, in which the middle-dotted line represents the mean value, and the upper and lower dotted lines represent the mean value plus or minus the standard deviation. Figure 4B shows the statistical histograms of different signal strength values. It can be observed that the sequential noise of the two devices is quite different, and the high noise is induced in the A2017 signal. In addition, although the distance between the two devices and the Bluetooth beacon is the same, differences can also be found in the RSS values. The average value of the Mi6 is −72 dBm, and the A2017 is −68 dBm. From the histogram, there are two obvious peaks in the histogram of Mi6, and the amplitude values are obviously discrete. The A2017 noise frequency is higher, but the change of the amplitude is smaller. Many outliers can be found in the timing diagram of the A2017 signal. Figures 4C, D show the RSS value at 1.5 m. It can be seen from Figure 4C that the signal of A2017 still has the characteristics of high-frequency noise. The histogram in Figure 4D shows that the two peaks of the Mi6's Bluetooth signal are farther apart, and the A2017 also shows the characteristics of the two peaks.

www.frontiersin.org

Figure 4. (A) The received signal strength sequence of a Bluetooth beacon with 0.6 m distance. (B) The received signal strength sequence of a Bluetooth beacon with 1.5 m distance. (C) The received signal strength distribution of a Bluetooth beacon with 0.6 m distance. (D) The received signal strength distribution of a Bluetooth beacon with 1.5 m distance.

In general, there are irregular differences in the amplitude and noise characteristics of inertial and wireless signal data. Therefore, it is necessary to consider how to enhance the generalization performance of SmartFPS under those differences.

4.2. Transfer learning based on GAN

The transfer learning algorithm of SmartFPS aims at training the first 3-layer CNN network layer in the wireless positioning network and the first 2-layer LSTM network in the inertial positioning network. Since the transfer learning methods of the two networks are similar, only the transfer learning method of the wireless positioning network is explained here.

The wireless encoder aims to map the features of the wireless signal data sequence, so the transfer learning algorithm aims to train those layers and substitute the trained weights into the fusion positioning network. The transfer learning structure of the wireless positioning network is mainly based on the GAN network, a transfer learning strategy known for its success in image processing. GAN is composed of a generator and a discriminator. The generator's role is to produce a fake wireless sequence similar to the target domain according to the wireless positioning features of the source domain. The discriminator will judge whether the generator produces sequences similar to the target domain. Therefore, the generator will improve its performance with each training epoch, and eventually, the discriminator will not be able to make the right judgment.

We also add a cycle-consistent loss to the network so that the generator network will include a source and a target domain generator. These two parts share one unique encoder. According to cycleGAN, the source domain generator network needs to restore the sequence of fake target domain signals generated by the target domain generator (the fake target domain signals are also derived from the source domain data) to the source domain data. Similarly, the target domain generator network needs to restore the sequence of fake target domain signals generated by the source domain generator (fake source domain signals are generated from target domain data) to target domain data. Such a mechanism makes GAN less influenced by the generator network when the source and target data labels are poorly matched. In addition, a new loss function, identity loss, is often introduced when applying cycle-consistent loss. This mechanism means that data in the same domain should maintain the same data through the same domain generator network.

We also introduce the reconstruction network in the network. The reconstruction network is a CNN structure, and its role is to restore the original signal using the encoder output features. Unlike the cycle-consistency loss mechanism, the reconstruction network is unique. Although the recovered signal will not be the same as the original signal, it can preserve the signal integrity as much as possible from the common characteristics of the source and target domains. This part is also to protect the encoder from being unduly influenced by the generative network. Since the GAN network needs to train the generator and the discriminator separately, we also divide the training of the inertial positioning network transfer learning network into two steps: (1) training the generator targets and (2) training the discriminator targets.

The network forward propagation routes of the two targets are illustrated in Figure 5. The generator and the encoder are trained by the wireless positioning loss, the reconstruction loss, the consistency loss, and the cycle consistency loss. Different weight factors combine those losses into one single target. The combined loss function is (take the source domain as an example):

Lc,S=Lgan+λ1Lcycle+λ2Lidentity+λ3Lpred+λ4Lrecon www.frontiersin.org

Figure 5. Transfer learning structure of wireless positioning encoder in SmartFPS.

The training procedure of wireless positioning encoder transfer learning is similar to training GAN. It is important to train only the generator or the discriminator alone while freezing the weights of the other network. However, the other goals can be trained together. The training procedure is listed in Table 1. The transfer learning structure of the inertial encoder is similar to the wireless positioning encoder, and only the CNN network should be replaced by LSTM. After training the wireless positioning encoder and inertial encoder separately, the weights of each encoder can be used in the fusion network to accomplish the transfer learning target.

www.frontiersin.org

Table 1. Algorithm 1: Training procedure of wireless positioning encoder transfer learning.

5. Results 5.1. Experiment settings

The experiments were conducted in the corridor area plus a whole laboratory area on one floor of the Chuangzhi Building, Nanjing, China, as shown in Figure 6A. The test area is about 20 m long and 30 m wide. A total of 20 Bluetooth beacons were deployed in the venue, and the distance between each Bluetooth beacon was 5–8m, of which No. 17–20 Bluetooth beacons were deployed inside the laboratory. The experimental site includes the stair hall and the elevator hall that may block the Bluetooth signals, so there are generally 3 or 4 line-of-sight Bluetooth signals received by smartphones during positioning. The Bluetooth beacon was the E5 model of Yunli Physics. We changed the broadcast period from 500 ms (default) to 100 ms to increase the training data density. The broadcast power of the product is 0 dBm by default with a coverage radius of 50 m, and we adjusted it to −8 dBm with a coverage radius of 22 m. The beacons were stuck on the wall 1.5 m above the ground.

www.frontiersin.org

Figure 6. (A) Experiment testbed. (B) Lidar slam on iPhone. (C) Device setup on body.

The training and test data were collected with the android phone facing the user by assuming that the user should keep an eye

留言 (0)

沒有登入
gif