Applied Sciences, Vol. 12, Pages 12244: Multi-Vehicle Tracking Based on Monocular Camera in Driver View

3.2. JDE ModuleThere are three prediction heads at the end of our designed one-stage detection architecture. The detection part combines a CSPDarkNet-53 as backbone and a modified bi-directional feature pyramid network (BiFPN) as the neck layer [13].CSPDarkNet-53 is an upgraded version of the DarkNet-53 [9] which is a feature extractor used for object detection. Its core unit is the CSPResNet, which divides a basic feature map into two parts by using CSPNet strategy, and then merges these parts through a cross-stage hierarchical structure. In the way, this convolutional neural network is used as a backbone in YOLOv4 [14] and transferred to our JDE without any modification.The BiFPN is a multi-scale feature fusion block which balances efficiency and accuracy. It fuses three feature maps that are output by the backbone block and enables features to flow in both the top–down and bottom–up directions. Different from traditional methods, BiFPN can learn the importance of the input feature maps which have different resolutions by adding an additional weight. Then, these fused features are fed into the corresponding prediction head, which has some convolution layers. As shown in Figure 1, the proposed architecture combines two BiFPN blocks. What is more, we can extend the framework more deeply and widely by increasing the number of BiFPN blocks as well as its input features.

Each prediction head is composed of a series of convolution layers and outputs a three-dimension map with size (4A+2A+E)×H×W, where A is the number of anchors that are assigned to its corresponding feature scale, and E is the number of appearance embedding. The three-dimension map consists of the box coordinate offsets regression of size 4A×H×W, the box classification of size 2A×H×W, and the appearance embedding of size E×H×W.

3.3. Loss FunctionThere are three subtasks of each prediction head in our multi-vehicle tracker, so the total loss function L consists of classification loss, regression loss, and embedding loss of all prediction heads,

L=∑i=1N(wclsiLclsi+wregiLregi+wembiLembi)

(1)

where N=3 is the number of prediction heads, wclsi and Lclsi are the weight and loss function for the foreground/background classification task, wregi and Lregi are the weight and loss function for the bounding box regression task, wembi and Lembi are the weight and loss function for the appearance embedding task, and i=1,2,3. All w∗i are carefully tuned for optimal performance.In addition, Lcls is formulated as a cross-entropy loss as below:

Lcls=−(ylog(p)+(1−y)log(1−p))

(2)

where y is a binary indicator (0 or 1) for foreground and background classification, and p is a probability which predicted as positive class.When it comes to regression loss, we use smooth L1 loss:

Lreg=0.5x2,|x|<1|x|−0.5,otherwise.

(3)

Moreover, Lemb is formulated as the same in [7] as below:

Lemb=−logexp(fTg+)exp(fTg+)+∑i(exp(fTgi−))

(4)

where fT is a selected anchor instance in a mini-batch, g+ is the weight of the positive sample with respect to fT, and gi− is the weight of the negative sample.Let m and v denote the 1st moment vector and 2nd moment vector, respectively. α is the learning rate of the model. β1,β2 and ε are hyper-parameters. The parameters m, v, and θ are updated during training as follows:

mt←β1mt−1+(1−β1)∂Loss∂θt−1,vt←β2vt−1+(1−β2)∂Loss∂θt−12,m^t←mt1−β1t,v^t←vt1−β2t,θt←θt−1−αm^tv^t+ε.

(5)

3.5. Motion ModelIn multiple vehicle tracking, we use the constant velocity as a motion model with a Kalman filter [15] when we assume the tracking system as a linear Gaussian process. In the following section, we derive the constant velocity motion model for objects represented with aspect ratio and height as well as the detected bounding box-center coordinate. Suppose each target follows a linear Gaussian model:

fk|k−1(x|ϵ)=N(x;Fk−1ϵ,Qk−1)

(6)

where fk|k−1(·|ϵ) is the state transition probability of a single object at time k when given the previous state ϵ. gk(z|x) is the likelihood function of the single object, which defines the probability that z is observed under conditions as state x. N(·;m,P) is a Gaussian density whose mean and covariance are m and P, respectively. Fk−1 is the state transition matrix and Qk−1 denotes the covariance matrix of the process noises. Hk is the measurement matrix. Rk denotes the covariance matrix of the measurement noises, which can be measured from the detection and ground truth of training datasets.Our objective now is to obtain the formulation of Fk,Qk,Hk,andRk. Suppose that the centers of the detection box to be estimated are denoted by (xb,k,yb,k) and that the aspect ratio and the height of the bounding box detected at the coordinates of the image to be estimated are represented by ab,kandhb,k, respectively, at time k. The velocity of the centers of the detected box, the bounding box, the aspect ratio, and the height are indicated by (x˙b,k,y˙b,k),a˙b,k,andh˙b,k, respectively. The state space model can also be expressed as a vector–matrix representation as follows:

xb,kyb,kx˙b,ky˙b,kab,khb,ka˙b,kh˙b,k︸Xk=10δt00000010δt00000010000000010000000010δt00000010δt0000001000000001︸Fk−1xb,k−1yb,k−1x˙b,k−1y˙b,k−1ab,k−1hb,k−1a˙b,k−1h˙b,k−1︸Xk−1+δt22wx,k−1δt22wy,k−1δtwx,k−1δtwy,k−1δt22wa,k−1δt22wh,k−1δtwa,k−1δtwh,k−1︸Wk−1

(8)

where w∗,k−1 denotes a piece-wise constant white acceleration that can be described by a zero-mean Gaussian white noise as w∗,k−1∼N(0,σ∗,k−12). The σ∗,k−12 is a variance which determines the relaxation level of the constant velocity assumption. δt is the time between frames. Equation (8) can be also represented as:

Xk=Fk−1Xk−1+Wk−1

(9)

where the value of the state transition matrix Fk−1 is given in Equation (8) and Wk−1∼N(0,Qk−1). Therefore, Qk−1 can be obtained by computing the covariance of Wk−1 as:

Qk−1=Cov(Wk−1)=E[Wk−1Wk−1T]

(10)

where E[Wk−1Wk−1T] denotes the mean of Wk−1 and Wk−1T. Now Qk−1 can be obtained as:

Qk−1=δt44σwx20δt32σwx2000000δt44σwy20δt32σwy20000δt32σwx20δt2σwx2000000δt32σwy20δt2σwy200000000δt44σwa20δt32σwa2000000δt44σwh20δt32σwh20000δt32σwa20δt2σwa2000000δt32σwh20δt2σwh2

(11)

There are two significant ideas in the derivation of Qk−1:

E[wx,k−1wx,k−1]=σwx2. Meanwhile, E[wy,k−1wy,k−1]=σwy2, E[wa,k−1wa,k−1]=σwa2, and E[wh,k−1wh,k−1]=σwh2 where σwx2 is variance (σwx=σwx2 is the standard deviation).

E[wx,k−1wy,k−1]=0 since between the x-axis and y-axis, there is no correlation. Similarly, E[wx,k−1wa,k−1]=0, E[wa,k−1wh,k−1]=0, etc.

This completes the derivation for Fk−1 and Qk−1. We set the different values for the variance (σwx2, σwy2, σwa2, and σwh2) after tuning them individually during our experiments.

To speed up the operation, the observations at time k can be represented by the following state space model in vector matrix form.

zx,kzy,kza,kzh,k︸Zk=10000000010000000000100000000100︸Hkxb,kyb,kx˙b,ky˙b,kab,khb,ka˙b,kh˙b,k︸Xk+vx,kvy,kva,kvh,k︸Vk

(12)

where (zx,k,zy,k) are the center points of a detection box at time k. za,k and zh,k are the aspect ratio and height of a detection box in image coordinates at time k. vx,k,vy,k,va,k,andvh,k are observation noises corresponding to zx,k,zy,k,za,k,andzh,k, respectively, which are basically zero-mean Gaussian white noises. For instance, vx,k∼N(0,σzx,k2).The measurement matrix Hk projects the state space to the observation space and Equation (12) can also be represented as where the value of Hk is given in Equation (12) and Vk∼N(0,Rk). Thus, the value of Rk can be obtained by computing the covariance of Vk as:Here, the covariance matrix of the measurement noise Rk is expressed as:

Rk−1=σvx20000σvy20000σva20000σvh2

(15)

where E[vx,kva,k]=0, E[vx,kvy,k]=0, E[va,kvh,k]=0 and so on, since there is not any correlation between each other. This completes the derivation for Hk and Rk. We set the different values for the variance (σvx2, σvy2, σva2, and σvh2) after tuning them individually during our experiments.

Then, every track is assigned a filter and follows the prediction and update process to acquire an estimated state. To some extent, the speed of the tracking phase depends on the number of objects being tracked.

The predicted state X^k|k−1 is

X^k|k−1=FkX^k−1|k−1.

(16)

The predicted estimate covariance Pk|k−1 is

Pk|k−1=FkPk−1|k−1FkT+Qk.

(17)

2.

Kalman Filter Update

The measurement residual y˜k is

y˜k=Zk−HkX^k|k−1.

(18)

The residual covariance Sk is

Sk=HkPk|k−1HkT+Rk.

(19)

The Kalman gain Kk is

Kk=Pk|k−1HkTSk−1.

(20)

The updated state estimate X^k|k is

X^k|k=X^k|k−1+Kky˜k.

(21)

The updated estimate covariance Pk|k is

Pk|k=(I−KkHk)Pk|k−1.

(22)

View original article

APPLIED SCIENCES-BASEL

Like

分享书签

0 0 0 0 0 0 0

More from this channel

Applied Sciences, Vol. 12, Pages 12244: Multi-Vehicle Tracking Based on Monocular Camera in Driver View

留言 (0)