Computer Vision News - October‏ 2023

combining information from these modalities, leading to significant advancements in video summarization and instructional video analysis. The first part of Medhini’s thesis introduces a non-parametric approach to synthesize long videos from short clips by using representations learned via contrastive learning. This is achieved by repeatedly stitching together segments of the short video coherently to create dynamic yet consistent outputs. A learned distance metric is used for choosing segments, which allows for comparing clips in a manner that scales to more challenging dynamics, and to condition on other data, such as audio. In the next section, Medhini introduces her work CLIP-It which is a novel technique for generating concise visual summaries of lengthy videos guided by natural language cues. Specifically, a user-defined query or a generated video caption is used to create a visual summary of a video that best matches this natural language prompt. Next, she focuses specifically on summarizing instructional videos, capitalizing on audio-visual alignments between the narration and actions in the videos and similarity in the task structure across multiple videos to produce informative summaries. Fig 1 illustrates this method of creating a video summary using no external supervision. To further enrich the comprehension of instructional videos, she then introduces a cutting-edge approach that facilitates the learning and verification of procedural steps within instructional content, empowering the model to grasp long and complex video sequences and ensure procedural accuracy. Lastly, her work explores the potential of large language models for answering questions about images by generating executable Python code. This involves first defining modules which are useful to answer questions and which use pre-trained image-language modules in the background. As seen in Fig 2, using a few sample prompts, an LLM can be instructed to orchestrate these modules into meaningful code snippets which can be executed to answer questions about the image in an explainable fashion. Currently, her research efforts are being directed towards exploring use of large vision and language models for video understanding. 25 Medhini Narasimhan Computer Vision News

RkJQdWJsaXNoZXIy NTc3NzU=