Computer Vision News - October 2021

54 Medical Imaging Technology Best of MICCAI 2021 Introduction Recently, transformer-based models have gained a lot of traction in natural language processing and computer vision due to their capability of learning pre- text tasks, scalability, better modeling of long-range dependencies in the sequences of input data. In computer vision, vision transformers and their variants have achieved state-of-the-art performance by large-scale pretraining and fine-tuning on downstream tasks such as classification, detection and segmentation. Specifically, input images are encoded as a sequence of 1D patch embeddings and utilize self-attention modules to learn a weighted sum of values that are calculated from hidden layers. As a result, this flexible formulation allows us to effectively learn long-range information. This warrants the question, what is the potential of Transformer-based networks in Medical Imaging for 3D segmentation ? Novel proposed methodologies that leverage transformer-based or hybrid ( CNN+transformer ) approaches have demonstrated promising results in medical image segmentation for different applications. In this article, we will deep dive into one such network architecture (UNETR) and will also evaluate other transformer based approaches in medical imaging (TransUNET & CoTr). 1. UNETR NVIDIA researchers have proposed to leverage the power of transformers for volumetric (3D) medical image segmentation and introduce a novel architecture dubbed as UNEt TRansformers (UNETR). UNETR employs a pure vision transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful U-shaped network design for the encoder and decoder. Why UNETR: Although Convolutional Neural Networks (CNN)-based approaches have powerful representation learning capabilities, their performance in learning long-range dependencies is limited to their localized receptive fields. As a result, such a deficiency in capturing multi-scale contextual information leads to sub- optimal segmentation of structures with various shapes and scales. Ali Hatamizadeh is a research scientist at NVIDIA. He received his PhD and MSc in Computer Science from the University of California Los Angeles. Prerna Dogra is a Senior Product Manager for Healthcare at NVIDIA, where she leads the Clara Application Framework and the collaborative open-source initiative Project MONAI.