MIDL Vision 2022

“ The very successful autoregressive models can assign likelihoods of sequences and work very well at various tasks. We’d love to use them directly on medical images, but medical images are 3D, and they’re huge. Much bigger than the 2D datasets people work with in standard computer vision. You can’t apply a transformer directly to a sequence of intensities for a medical image; it’s just not going to fit in memory. ” The team used a compression technique to solve this, modeling the datasets using vector- quantized variational auto- encoders, or VQ-VAEs . These allowed them to have highly compressed inputs and choose high-quality reconstructions without losing any information. The images are compressed massively – 16 times along each of the three dimensions, so a factor of 16 3 . “ You take a 256 3 medical image, and our representation of that image 6 VISION MIDL i ge became 10 3 or 12 3 , ” Mark explains. “ Using these high compression rates, we can get sequences rich in information, richly representing the image, but small enough to easily fit in the memories of transformers. We can successfully train transformers on these small sequences and build good likelihood models. We found that a high compression rate was crucial. We don’t get good results if we compress the images less, say by eight times along each dimension. That’s an ablation study in the paper. Something about having a compressed representation of your image seemed to help the transformer effectively out-of-distribution samples and correctly assign likelihoods to them. ” Oral Presentation