“Originally, people were more focused on static scenes,” Weirong says. “There are very wellestablished methods for bundle adjustment that try to estimate the camera pose and recover 3D geometry from multi-view images or RGB video. However, the world we’re living in today is a 4D dynamic world.” The difficulty comes from the coupling of two distinct types of motion: one caused by the camera’s movement, and the other by the motion of objects within the scene. In casual handheld video, both happen simultaneously. Researchers have previously attempted to sidestep the issue by either masking out dynamic regions or modeling moving objects separately, but both approaches have limitations. “The challenge is that there’s no direct way to constrain how the dynamic object moves in 3D easily,” he points out. “Therefore, it’s hard to reconstruct them!” His insight was to separate – or decouple – the two motions from the 2D perspective. The resulting framework, BA-Track, introduces the paper’s key contribution: a motiondecoupled point tracker that solves the correspondence problem in dynamic video. The model relies on a dual-network design. “We use learning-based techniques – a transformer-based network, with one part predicting the total motion and the other predicting the dynamic parts,” Weirong explains. “When we combine them, we use total motion 5 DAILY ICCV Tuesday Weirong Chen
RkJQdWJsaXNoZXIy NTc3NzU=