Computer Vision News - May 2018
Results: The performance of the above CNN networks is presented in the table below. All CNNs outperformed the Fully Connected baseline, Inception and ResNet achieved the best performance. You can see from the first five rows that ResNet performed best, note that the last row results are of a second run for ResNet with a longer training step -- improving results even further. The authors then investigated how training with different subsets of labels affects performance -- by forcing the network to generalize. The following table gives the results of varying label set sizes. All models are variants of ResNet-50 trained on 70M videos. Below you can see 3 captured video frames from a video classified by ResNet-50 with instantaneous model outputs overlaid. The 16 label outputs (out of 30K) with the highest peak values throughout the video were selected for display. More in this video . The different sound sources present at different points in the video are clearly distinguished. Code and example can be found here . Conclusions: The authors demonstrate that state-of-the-art CNN networks can be used for audio classification and achieve excellent accuracy, compared to a simple fully connected network or earlier image classification architectures. Training on larger label sets can improve performance. We can also get a subjective impression of the model’s performance by viewing video segments. Research 47 Research Computer Vision News
Made with FlippingBook
RkJQdWJsaXNoZXIy NTc3NzU=