In addition to knowing the 3D geometry of the scene, can we somehow enrich these representations with knowing: this is a table, this is a chair? The longterm goal is to be able to develop agents, robots that can move in any space without any additional training and just understand what's happening around them. This can be very easily applicable in many fields. The team wants the model to be as general as possible. One of the main challenges that the authors initially encountered is that these foundation models started out performing reconstruction in a pairwise manner. You would have two images and you would produce output for these two images. And then you would have some sort of global optimization that would align the results. This turned out for 3D panoptic segmentation to be a challenge. Then came the development of models that support multi-view predictions directly: you can already take many images at the input, you just put it through one feed forward and then you immediately get the outputs. This is the MUSt3R model, a scalable multi-view version of DUSt3R, both developed by the team: “It sort of all clicked together!” Another major challenge, maybe an even bigger one, was the data. There’s a lot of 3D data that is used to train this foundation models. But only a small portion of this data has annotations for segmentation - for instance panoptic segmentation - so they used one data set that has about 700 scenes. That’s many images, but the diversity inside those images was very limited. You could imagine that there are only 700 different types of chairs or probably even less, because most scenes were recorded in the same institutions or same types of places. The team had to find a way to capture more visual diversity and Lojze likes the solution they found: “We use a combination of 3D data 13 DAILY ICCV Tuesday Lojze Zust
RkJQdWJsaXNoZXIy NTc3NzU=