Computer Vision News Computer Vision News 22 Of course, there are always challenges when entering a project like this. One of the most daunting aspects was processing the vast datasets required to make the system work. “It was months of tedious work,” Robin admits. “While we used off-the-shelf methods to extract character and camera poses from videos with good accuracy, we had a lot of post-processing to do. It wasn’t the most interesting part of the project, but at the end of the day, it’s what makes it work!” A crucial part of the work is the camera generation framework known as ‘DIRECTOR,’ which stands for DiffusIon tRansformEr Camera TrajectORy and is designed to create smooth and realistic camera movements based on textual prompts. It is a diffusion-based model that learns the distribution of a ground truth dataset. This initial data distribution is perturbed with Gaussian noise equivalent to a normal distribution. Then a neural network is used to iteratively and progressively denoise the Gaussian distribution, ultimately generating new camera trajectories that align with the ground truth distribution. While DIRECTOR is based on established diffusion theory, the architecture draws inspiration from the Diffusion Transformer (DiT) model, which proposed different configurations and ways to incorporate conditioning within the diffusion framework. “We took inspiration from DiT and wanted to put this kind of architecture into the motion world,” Robin explains. “Here, we’re dealing with camera movement, but we could put our work in the human motion community. It would be the closest community to ours.” Looking to the future, Robin tells us the team is already working on a ECCV 2024 Paper
RkJQdWJsaXNoZXIy NTc3NzU=