Computer Vision News - April 2023

13 Alberto Testoni Alberto proposes to evaluate machine-generated dialogues on a deeper level by capturing the interplay between the Encoder and Decoder components of neural architectures. He considers entity hallucinations (generation of words that are not coherent with the image upon which the conversation occurs) as a case study. Hallucinations are shown to create a detrimental cascade effect on consecutive dialogue turns. By adapting Transformer-based models to these tasks, he finds that more sophisticated visual processing plays a crucial role in reducing hallucinations. The progressive advance towards even deeper evaluation criteria led Alberto to study the effectiveness of the question-asking strategy in humans andmachines. Inspired by cognitive studies on children and adults, Alberto proposes Confirm-it (Figure 1), a model based on a beam search re-ranking technique that implements a confirmation-driven strategy. Confirm-it outperforms different decoding strategies against both surface-level and more fine-grained metrics, as well as generating dialogues that are most informative also for humans playing the same task. Finally, Alberto broadens the horizons on what is still missing from achieving human-like dialogue systems by presenting a large-scale study of human conversations used to train computational models to unveil the pragmatic phenomena that make human communication successful in Visual Dialogue tasks. For more information, see Alberto’s website . In the “GuessWhat?!” game, an Oracle is assigned a target object in an image, and a Questioner has to ask questions to identify it. The charts show the probability distributions over different objects after dialogue exchanges. Among the possible follow-up questions(a-c), Confirm-it selects the one that tests the target intermediate hypothesis (b).

RkJQdWJsaXNoZXIy NTc3NzU=