Computer Vision News - November 2021
22 Best of ICCV 2021 so I think that’s the most difficult part to make sure that one run is perfect for these kinds of large models. ” The idea for the work was a joint one between Aishwarya and Nicolas Carion , a postdoc in her lab and the last author on the paper. His paper, End-to-End Object Detection with Transformers – which was an oral at ECCV 2020 – presented a new object detection framework using transformers called DEtection TRansformer . It proposed object detection in an end-to-end way that no longer required non-differentiable components, like non-maximum suppression. In previous works, it was not possible to train end to end because the gradients do not flow through these non- differentiable components. “ This work started in a slightly different way, ” Aishwarya discloses. “ We started out working on extending DETR to take in multi-modal input. Then after a few months of trying simpler versions, we came up with this novel approach of doing it through modulated detection. ” There have been a number of multi-modal understanding papers in the last couple of years since transformers began to be used everywhere, but with this paper taking a different approach to the others, it might just be what caught the ICCV reviewers’ attention. “ You can see on many benchmarks that we show a huge improvement, even compared to papers that came out one or two months before us, ” Aishwarya points out. “ On one of the benchmarks, we made the error rate half of what it was before! We had more than five points increase on the referring expressions dataset. I think that will have convinced them because we offer a new approach to the same problem, and it clearly works very well. ” Aishwarya reveals she is already working on a follow-up paper for CVPR next year. Can she reveal any details at this early stage? “ We’re trying to learn how to train our model with less supervision than we have in this paper because even though it worked really well as a method, it still requires quite strong supervision, so one direction would be to try to Oral Presentation
Made with FlippingBook
RkJQdWJsaXNoZXIy NTc3NzU=