37 datascEYEnce! Computer Vision News RETFound is, as the name already reveals, a foundation model. It’s a model that learned a representation of the eye through self-supervised learning (stage 1 in the graphic) and can later on be fine-tuned on specific tasks without the need for a huge amount of data (stage 2 in the graphic). The encoder Yukun applies is a largescale vision transformer, while the decoder is a smaller vision transformer. Using an encoderdecoder architecture already gives a hint on it being an autoencoder. Specifically, they used a masked autoencoder. The addition here is the way the data is arranged. Instead of providing the network with the whole image, the image is divided into tiles and a fraction of the tiles are hidden. In their experiments, hiding 75% of the tiles was a good value for the color fundus experiment, while for OCT images 85% of hidden tiles achieved the best results. You can see some examples in the figure! Other non-generative techniques were actually only implemented after suggestions by reviewers. The contrastive methods (SimCLR, SwAV, DINO and MoCo-v3) ended up performing better than a supervised pre-training strategy, while slightly worse than the masked autoencoder. After all, this doesn’t really matter to the researchers since they wanted to prove the general adaptability of a model trained in a self-supervised manner to a supervised downstream task - the specific self-supervised approach is just a tool. After pretraining the autoencoder on firstly natural images and subsequently fundus images (or alternatively OCT images) the decoder is replaced by a multi-layer perceptron. For the fine-tuning on
RkJQdWJsaXNoZXIy NTc3NzU=