Computer Vision News - August 2021

3 Summary 5 Generating Counterfactuals... This project is then based on creating a pipeline for explainability & discovery which is applied on Diabetic Macular Edema (DME) prediction models. The gold standard for the diagnosis of this disease is a 3D scan of the retina based on Optical Coherence Tomography (OCT ) while a cheapest but less accurate option is the most used Color Fundus Photos (CFP), a 2D image of the retina. The screen is done by looking for the presence of lipid deposits called hard exudates (HE), a proxy feature for retinal thickening. The dataset used is made of 7072 paired CFP and OCT images labeled by 2 medical doctors. Note that only the combination of CFP images and labels by CFP/OCT are employed for training. While some previous studies found that the same CNN architecture which uses labels derived from human experts grading OCT significantly outperforms HE labels based on CFP images, they do not explain why it is so. The authors fill in this knowledge gap by building a method based on generation of counterfactuals. The outcome is a scientific discovery which answers the question of why OCT labels are more accurate. Briefly, the method firstly identifies salient regions; it creates image translation models which determine what in the region influences the predictions; it amplifies those modifications to enhance human interpretability; and finally it extracts a minimal set of hand-engineered features which are used to train an SVM classifier. The performance of this is compared with a CNN trained on raw images. Are the two outcomes going to be similar? Let’s look at all the steps more in depth: 0) Train a CNN model M based on inception-v3 on the DME dataset (obtained AUC of 0.847). A multi-task version of this model is also trained (AUC of 0.89). 1) Input ablation to evaluate the importance of known regions: two known landmarks (optic disc and fovea) are used. Circular crops of different radii (0.25 to 5) are cropped around both landmarks and the rest of the pixels becomes background. From this experiment, the authors can conclude that the model gets most of its information from the region surrounding the fovea (i.e. the macula). The model is looking at the right region .