Aniruddha Kembhavi (top left) is a Senior Director at the Allen Institute for AI (AI2), leading the Perceptual Reasoning and Interaction Research (PRIOR) team, where Christopher Clark (center) and Jiasen Lu (top right) are Research Scientists, Sangho Lee (bottom left) is a Postdoctoral Researcher, and Zichen “Charles” Zhang (bottom right) is a Predoctoral Young Investigator. Before their poster session this afternoon, they speak to us about their highlight paper proposing Unified-IO 2, a versatile autoregressive multimodal model. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action 8 DAILY CVPR Friday Highlight Presentation Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating images, text, audio, and action. It can handle multiple input and output modalities and incorporates a wide range of tasks from vision research. Unlike traditional models with specialized components for different tasks, it uses a single encoder-decoder transformer model to handle all tasks, with a unified loss function and pretraining objective.
RkJQdWJsaXNoZXIy NTc3NzU=