This work is about creating vision language models for video games. For example, these days we have chatbots through which you can send an image and start a conversation with that image, asking a specific question about the image itself. What is the content of the image? How many persons are in the image? These models are closed source, but there are some opensource alternatives to these models, the most famous being LLaVA. However, these models are not very good with video game content. If you send an image of a video game screenshot and then ask some question about the game or the game world, usually these models struggle to answer. The goal with VideoGameBunny was to create a model that is more familiar with video game context and can answer and understand video game content better than other models. Let’s observe this image taken from the paper, with a video game screenshot and a simple question. Are there any visible glitches or errors in the game environment? If you ask LLaVA, it says that yes, additional download progress bar seems to be stuck. If you look closely on top, there is an additional download, there is a progress bar here. Mohammad Reza’s model is not confused by the progress bar and answers the question correctly because it understands what is going on. What was the biggest challenge in creating such a powerful model? When it comes to creating a new model, there are two things that are very important. First thing is data, the data we need to collect. And we need to be very, very careful with 4 DAILY WACV Sunday Oral Presentation Mohammad Reza Taesiri is currently a postdoc at the University of Alberta. He is working on large vision language models under the supervision of Cor-Paul Bezemer. Mohammad is also the first author of a lovely paper that was accepted at WACV 2025 as an oral. VideoGameBunny: Towards vision assistants for video games
RkJQdWJsaXNoZXIy NTc3NzU=