sticking to its very clear goals, including avoiding the need to determine transition lengths manually, which can be ambiguous and subjective, and not applying any postprocessing. “We observed in prior works that if you closely analyzed the motions they were generating, in the boundaries of this postprocessing, they showed some artifacts and abrupt transitions,” German continues. “These were a result of this artificial postprocessing they were applying, and we didn’t want that.” To transfer the semantics of the text into the motion, the model needs to know the absolute position of each frame within the sequence. “Let’s say we want to generate five minutes of motions of different actions sequentially,” German suggests. “We need to know the beginning and ending of each action. Also, we want it to be invariant to the absolute position because we don’t have training data that takes as long as five minutes. We want a method that can generate motion, no matter at which position, which is against having access to the absolute position. We have two things fighting against each other and need to harmonize them somehow.” A breakthrough came in the form of Blended Positional Encodings (BPE), a new concept for diffusion models that uses both absolute and relative positional encodings in the denoising process. In essence, diffusion models transform noise into the target motion. Initially, absolute information is used to recover the global motion coherence, and then relative positions are used to build smooth and realistic transitions between actions. 21 DAILY CVPR Wednesday Seamless Human Motion Composition...
RkJQdWJsaXNoZXIy NTc3NzU=