CVPR Daily - Friday

9 DAILY CVPR Friday “It’s a super broad model,” Christopher tells us. “It can take many different modalities as input and output. It can do image, text, audio, and video as input and can generate text, image, and audio output. Within those modalities, we basically threw in every task we could think of that vision researchers have been interested in. It’s a super, super broad model. I think it’s one of the most broadly capable models that exists today.” While language models can perform many tasks and input and output all kinds of structured language, handling diverse inputs and outputs in computer vision is more challenging. “When it comes to computer vision, it’s a mess,” Aniruddha says bluntly. “Sometimes, you have to input an image. Sometimes, you have to output a bounding box. Sometimes, you have to output a continuous vector like a depth map. Inputs and outputs in computer vision are very heterogeneous. That’s why, for the last 10 years, people have been building models that can do one or two things.” Unified-IO 2

RkJQdWJsaXNoZXIy NTc3NzU=