Applied Sciences, Vol. 13, Pages 377: Multi-Granularity Dilated Transformer for Lung Nodule Classification via Local Focus Scheme

Inspired by [17,35], we combine the merits of them to construct a Deformable Dilated Transformer as the backbone. The key characteristic of the regular transformer is that it can explore all locations in an image for long-range spatial modeling. To mitigate the problems of spatial resolutions and convergence, as shown in Figure 3, we leverage the deformable attention based on [17] to learn a small set of vital sampling points around a reference point, without considering the resolution of feature maps. Given the original input xC×H×W, the deformable attention is calculated as Equation (6):

DefAtt(z,pq,x)=∑n=1NWn[∑k=1KAnqk·x(pq+ΔPnqk)Wn′]

(6)

where Wn and Wn′ are the learnable weight matrix, q is the index of a query element with content feature z, and a 2D reference point Pq. In addition, n and k index the attention head and sampled keys, respectively. K is the total sampled key that is much smaller than H×W. ΔPnqk denotes the sample offset, Anqk is the attention matrix of kth sampling point in the nth attention head. ΔPnqk and Anqk are obtained by the linear transformation on the content feature z. ΔPnqk are 2d real numbers without constrained range and the attention ∑k=1KAnqk=1. We implement the above principle as [17], with the aim to reduce computational cost and avoid overfitting risk. Moreover, we replace the feedforward layer (i.e., regular 2D convolution) with two dilated convolutional layers to further enlarge spatial receptive fields. Let m be the number of stacked k×k dilated convolution, where k is the filter size. We denote the dilation rate as r and then the kernel size k after dilation is calculated as Equation (7):Let Rm be the effective receptive field of layer m, which is defined as Equation (8):

Rm=Rm−1+(k′−1)×∏i=1msi

(8)

where Rm−1 is the receptive field of layer m−1, and si is the stride of layer i. From the above equations, we can know that the dilated layer can increase the receptive field without introducing additional parameters. By combining the deformable attention and dilated convolutional layers (Figure 3), we can obtain a generalized high-level global-local representation.

留言 (0)

沒有登入
gif