WACV 2025 Daily - Monday

9 DAILY WACV Monday A Video is Worth 10,000 Words The most interesting idea in this work is using large language models to take videos that have very long text captions and using the large language model to reliably and controllably change that text caption to be different words with the same meaning. It is interesting to take a five sentence very detailed nitty gritty play by play description and condensing that down to a five-word summary. We want to do that so that we can make sure that existing video language models understand short descriptions, just as well as they understand very long ones. Matt did this work at internship with SRI International, and the idea arose organically from conversations between him, his mentor Michael Cogswell, as well as the manager of the team and last author in the paper, Ajay Divakaran. The most challenging thing in doing this was designing and executing the user study itself, where they ended up using 15 people to check parts of the data to make sure that the LLM output meant the same thing as the original text annotations. Setting that up, organizing it, tabulating it in a way that would be convincing to the readers that the data was trustworthy was probably for Matt the hardest part of the execution of the paper. In terms of scientific challenges, understanding why the video language models did struggle with the new synthetic data was a little challenging. There were some cases where the short caption would actually still seem fairly easy for a human to tell which video it matched. But the video language models would make very strange decisions. It was not easy to figure out a way to, first off, identify what the easy ones are compared to the hard ones, to identify when a short caption loses information versus when it doesn't, and to identify when it's lost the identifying information versus when it's still unique. Those are some key challenges, both still on the organizational side, but also the technical side. And some of those are still open questions. The papers has the merit of providing preliminary answers as to why the models might still struggle, but some of that is still open. To solve this, the team first looked at shortening in terms of losing words, but it didn't really matter; they tried to look at when information was lost; they counted unique nouns, or they looked at the overlap in meaning between the vector embeddings of the different captions. “We looked at how,” Matt explains “when we went from long to short, how those embeddings moved with respect to each other, to become closer, which indicated more ambiguity or not. And then we would have humans go in once we flagged some and just spot check and say, you know, to a human

RkJQdWJsaXNoZXIy NTc3NzU=