CVPR Daily - Friday

“Our work is a pre-training of such networks,” Sophia explains. “We have come up with two auxiliary tasks. One is a geometric task, and the second is to learn semantic information. The paper’s novelty comes from combining these two pre-trainings, giving the network, at the same time, 3D geometry information and semantic information through distillation.” Sophia proposes OccFeat, a selfsupervised pre-training method for camera-only BEV segmentation networks. It pre-trains the BEV network via occupancy prediction and feature distillation tasks. “The pre-training task we use is asking the model to predict a 3D volume from images,” she continues. “This volume encodes occupancy information, whether or not a 3D voxel is occupied, and predicts features in the occupied voxels that come from a pre-trained image model.” Occupancy prediction provides a 3D geometric understanding of the scene, but the geometry learned is class-agnostic. In addressing this, Sophia integrates semantic information into the model in the 3D space through distillation from a self-supervised pre-trained image foundation model, DINOv2. Models pre-trained with OccFeat show improved BEV semantic segmentation performance, especially in low-data scenarios. 19 DAILY CVPR Friday OccFeat UKRAINE CORNER Overview of OccFeat’s self-supervised BEV pretraining approach. OccFeat attaches an auxiliary pretraining head on top of the BEV network. This head “unsplats” the BEV features to a 3D feature volume and predicts with it (a) the 3D occupancy of the scene (occupancy reconstruction loss) and (b) high-level self-supervised image features characterizing the occupied voxels (occupancyguided distillation loss).

RkJQdWJsaXNoZXIy NTc3NzU=