Computer Vision News - November 2021

The CNN backbone has dominated the field of computer vision for 30 years. Recent works have applied Transformers to a computer vision backbone for certain tasks, but an important question has not been answered – can Transformers be a general-purpose backbone for computer vision ? Transformers have been used in natural language processing (NLP) for years but are starting to show good results in computer vision. Their novelty is a self-attention layer rather than a convolutional neural layer. This paper proposes a new architecture based on Transformer, Swin Transformer , and demonstrates that it performs much better than CNN backbones. It seeks to prove that this architecture can be applied to many different computer vision tasks. To make Transformer work well in computer vision, Han and his colleagues looked carefully into the key differences between visual and text signals. They found three priors which are fundamental to vision signals and introduced them in this architecture: hierarchy, locality, and translation invariance. There is also a crucial design to make it practical in speed, named non-overlapping S hifted win dows (where the method name is from). It is much faster than the traditional sliding window approach due to its friendly memory access. Han Hu is a Principal Researcher in the Visual Computing Group at Microsoft Research Asia. His work proposing a new general-purpose backbone for computer vision has just won the Marr Prize for Best Paper at this year’s conference. Huge congratulations to Han and his colleagues on taking home this prestigious award! He spoke to us ahead of his live Q&A session and before receiving the award. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows 2 Summary Best P per Award 1 Best of ICCV 2021 Best PAPER Award