Computer Vision News - November 2023

Computer Vision News 6 Exclusive Interview knew every single scientist in your field. Now, you are meeting thousands and cannot learn all their names. What is your message to our growing community? A number of different messages. The first one is there are a lot of applications of current technologies where you need to tweak an existing technique and apply it to an important problem. There’s a lot of that. Many people who attend these conferences are looking for ideas for applications they’re interested in medicine, environmental protection, manufacturing, transportation, etc. That’s one category of people – essentially AI engineers. Then, some people are looking for new methods because we need to invent new methods to solve new problems. Here’s a long-term question. The success we’ve seen in natural language manipulation and large language models – not just generation but also understanding – is entirely due to progress in selfsupervised learning. You train some giant transformer to fill in the blanks missing from a text. The special case is if the blank is just the last word. That’s how you get autoregressive LLMs. Selfsupervised learning has been a complete revolution in NLP. We’ve not seen this revolution in vision yet. A lot of people are using selfsupervised learning. A lot of people are experimenting with it. A lot of people are applying it to problems where there’s not that much data, so you need to pre-train on whatever data you have available or synthetic data and then fine-tune on whatever data you have. So, some progress in imaging. I’m really happy about this because I think that’s a good thing, but the successful methods aren’t generative. The kind of methods that work in these cases aren’t the same kind of methods that work in NLP. In my opinion, the idea that you’re going to tokenize your video or learn to predict the tokens is not going anywhere. We have to develop specific techniques for images because images and video are considerably more complicated than language. Language is discrete. It makes it simple, particularly when having to handle uncertainty. Vision is very challenging. We’ve made progress. We have good techniques now that do selfsupervised learning from images. The next step is video. Once we figure out a recipe to train a system to learn good representations of the world from video, we can also train it to learn predictive world models: Here’s the state of the world at time T. Here’s an action I’m taking. What’s going to be the state of the world at time T+1? If we have that, we can have machines that can plan, which means they can reason and figure out a sequence of actions to arrive at a goal. I call this objective-driven AI. This is, I think, the future of AI systems. Computer vision has a very important role to play there. That’s what I’mworking on. My entire research is entirely focused on this!

RkJQdWJsaXNoZXIy NTc3NzU=