6 DAILY WACV Sunday Oral Presentation The thing Mohammad Reza is the most proud of about this model is that it is a tiny model. It is relatively small, but it can perform similarly or better than models that are very, very large. What would the author do, if he had a magic wand to add one more feature to the model? The answer is definitely better quality data. “I never say no to better quality data,” confides Mohammad Reza. “And if I want to redo this project from scratch, I would spend more time to create higher quality data.” What would be an ideal direction for continuing this work? There is one immediate direction, which is extending this model to video. Currently, it's only working with a single image, but preferably the authors want to expand this capability to a video. For example, sending a video and asking a question about that. This is one of the things that they are trying to do at the moment, generating a video data set and creating a video model. Another thing, which is a little bit harder, is to have a component that allows this model to control the video game. Currently, it only prints out the text, but imagine it could print the actions that you could play in the game. “There are some works in robotics,” he shares “which they call vision-language-action models (VLAs). It's a combination of vision and language and action. That part
RkJQdWJsaXNoZXIy NTc3NzU=