9 Computer Vision News While the quality of generated captions has seen notable improvements, the automatic evaluation of captions has also witnessed a significant effort but until now has relied on metrics that utilize few human references or noisy web-collected data. Furthermore, these standard metrics often failed to align closely with human judgment. To bridge this gap, we propose a novel evaluation metric based on a positive-augmented contrastive learning that allows us to reach the greatest correlation with human judgment. This new and efficient metric is called PAC-S and is just a consequence of a fine-tuning of the CLIP architecture using as augmentation some positive examples generated using two synthetic generators for text and images (BLIPandStable Diffusion, respectively). With our advancements in the evaluation aspect, we shifted our focus towards enhancing the caption generation task and we devised another augmented architecture calledPMA-Net. The idea started noticing that the attention operator, mainly used in captioner, is not able to attend past training examples, reducing its effectiveness. To address this limitation, we propose a prototypical memory network, which can recall and exploit past activations. In this case it is a memory augmentation in which the memory is fully integrated in the architecture Sara Sarto
RkJQdWJsaXNoZXIy NTc3NzU=