Computer Vision News 2 In this paper, Soon proposes a new diffusion model that can fuse three distinct modalities simultaneously to generate and edit images of people. Previously, separate models have been used to generate images from text or transfer a person’s appearance from a source image to the pose of a target image. In a departure from those models that operate in isolation, UPGPT represents the first-ever integration of text prompting, pose guidance, and visual prompting, unlocking new creative possibilities for image generation and editing. The purpose of integrating text, pose, and visual prompting under a single unified model is one of efficiency. Why rely on multiple disparate models when a comprehensive model can effectively harness these diverse inputs? “It makes sense,” Soon affirms. “Why do we have so many different models if one can do all? Even if you have different models, the base model already has a rich understanding of what a human should look like. We have one model that can learn everything!” Achieving this was not without its difficulties. Data availability was a challenge Soon faced. He used the DeepFashion dataset, but the 3D pose information the model required was not inherently present. Ultimately, he had to employ other software to extract it, adding an extra layer of complexity to the process. Soon Yau Cheong is a second-year PhD student at the University of Surrey, under the supervision of Andrew Gilbert and Armin Mustafa. He tells us about his novel multimodal diffusion model for text, pose, and visual prompting. This work will be presented next month at ICCV 2023 during the Computer Vision for Metaverse workshop. UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer ICCV Workshop Paper and Demo
RkJQdWJsaXNoZXIy NTc3NzU=