CNNs and ViTs with various experiments, they only used gradient-based attack methods. In previous research, it was observed that ViTs focus more on low-frequency features while CNNs rely more on high-frequency features. From this point of view, gradient-based attacks, which tend to perturb high-frequency features in images through spatial domain perturbations, might cause CNNs to be fooled more easily than ViTs. In order to mitigate such bias, we formulate an attack framework that can directly attack the pixel values, magnitude spectrum, and phase spectrum of an image to allow flexible perturbations in both spatial and spectral domains. It can be observed that attacking different components induces different distortion patterns in the image (Fig 1 in the previous page). The distortion pattern also varies depending on the target model. Our research reveals that Vision Transformers exhibit similar or more vulnerability to the phase attack, which primarily injects perturbations in the low-frequency regions while Convolutional Neural Networks are more vulnerable to the pixel attack that injects perturbations mainly in the high-frequency regions. To examine the effect of the phase attack on the spectral characteristics of images, we employ the Fourier transform on the difference between the original and attacked images, analyzing the magnitudes in different frequency regions (Fig 2 below). For ResNet50, the high-frequency regions are mainly distorted whereas the distortion is concentrated on the low-frequency region in ViTs. This aligns with the information that CNNs and ViTs rely more on high and low-frequency information, respectively. Example of the perturbed image resulting from the phase attack, distortion in the pixel domain, and distribution of the distortion over different frequency region. Computer Vision News 26 WACV Poster Presentation
RkJQdWJsaXNoZXIy NTc3NzU=