This paper is accepted to CVPR 2017. It’s an interesting paper that borrowes ideas from RNN for combining the information from both lower and higher layers’ feature maps.
SSD and RRC
The RRC model can be viewed as a recurrent version of SSD. And in my opinion, it’s a SSD specially tuned to perform well in detecting small objects.
In SSD, each feature layer is responsible for detecting particular scales of objects. Higher layers with larger receptive fields are responsive to the larger objects. Lower layers are designed to be more responsive to small objects. The problem is the lower layers capture more fine details of the input object but lack the semantic information in higher layers. And the higher layers have larger receptive fields but lose the details of thin and small structures.
Figure 1. SSD
RRC tries to solve the problem by combining information from all the feature layers. Especially, in my opinion, it’s adding the semantic information from higher layers to the lower layers so as to perform better on small objects.
Figure 2. RRC
The way RRC solves this problem is by doing so-called rolling and recurrent.
Rolling
For layer p, we first have a 12 * 40 * 256 feature map. Then by doing convolution and max pooling, we can generate a 12 * 14 *19 feature map from layer p-1’s feature map. By doing deconvolution (transposed convolution), we can generate a 12 * 14 *19 feature map from layer p+1’s feature map. Then we concatenate these three feature maps and we get a 12*40*(19+256+19) feature map.
Recurrent
After having the 12*40*(19+256+19) feature map, we perform a 1*1*256 convolution on this feature map so as to reduce it back to 12*40*256.
Bounding Box Regression Space Discretization
A group of feature maps in a layer (e.g. conv4_3) is responsible for the regression for bounding boxes of a certain size range. SSD has one regressor for one certain size range. But in RRC, they further divide the range to several finer ranges and assign different regressors for each fine range respectively. It’s like the idea of piecewise linear regression.
Experiments
The outputs of each RRC
RRC was used for 5 times in the training, so there are 6 outputs and 6 corresponding loss functions.
We expect the loss of later output to be smaller. But the fact is the lowest loss is achieved in fourth output. The reason why RRC eventually degenerates the prediction is mainly because the lack of an effective memory mechanism (e.g. LSTM).
Results on KITTI
In the paper, they only have results for KITTI datasets.
So the performance on datasets like PASCAL VOC and COCO is still unknown. And the model’s performance on multi-class detection is also a question.