ResNeXt is a new CNN architechture following the ResNet. The paper is here.
ResNeXt
According to the paper, ResNeXt with 100 layers can outperform ResNet with 200 layers, while having only 50% complexity. The special architecture ResNeXt has is the use of group convolution. In addition to depth and width, the paper proposed a new dimension called cardinality. And increasing cardinality can lead to an improvement in model performance.
Below is the comparison of ResNet (left) and ResNeXt (right).
For ResNeXt, we still have one shortcut just like ResNet. But the convolutional layers are divided into different branches. And the number of the branches is called “cardinality”. And ResNeXt has a similar amount of parameters compared with ResNet.
Group Convolution
First let’s recall the concept of convolutional layer. In each convolutional layer, we have a certain numebr of kernels (e.g. 64, 128, 256). Each kernel is a 3D block with size of (C, K, K). C is the input data depth (or the number of the input data channels or the number of feature maps of last layer’s output). K is the kernel size (e.g. 3 in above image). Each kernel will scan through the input and output one feature map. So the number of output feature maps of this convolutional layer is the number of kernels.
Several years ago, when we don’t have enough memory in one GPU, we split kernels into different groups and feed the groups into different GPUs. As we discussed above, each kernel will generate one feature map. So after grouping the kernels, each group will generate part of the output feature maps. This is called group convolution.
Now we already have enough memory in GPUs to accommodate all the kernels, but ResNeXt reintroduces the idea of group convolution to improve the model performance.
Now it will not be hard to understand the key idea of ResNext. Above image shows three equivalent network architectures. By using groups, the model can improve accuracy when maintaining the model complexity and number of parameters.
ResNeXt has also been adopted in Mask R-CNN that achieves state-of-the-art results on COCO instance segmentation and object detection tasks.