Enhancing contrastive learning with temporal cognizance for audio-visual representation generation

2022
Download Copy BibTeX
Copy BibTeX
Audio-visual data allows us to leverage different modalities for downstream tasks. The idea being individual streams can complement each other in the given task, thereby resulting in a model with improved performance. In this work, we present our experimental results on action recognition and video summarization tasks. The proposed modeling approach builds upon the recent advances in contrastive loss based audio-visual representation learning. Temporally cognizant audio-visual discrimination is achieved in a Transformer model by learning with a masked feature reconstruction loss over a fixed time window in addition to learning via contrastive loss. Overall, our results indicate that the addition of temporal information significantly improved the performance of the contrastive loss based framework. We achieve an action classification accuracy of 66.2% versus the next best baseline at 64.7% on the HMDB dataset. For video summarization, we attain an F1 score of 43.5 verses 42.2 on the SumMe dataset.

Latest news

US, MA, Westborough
Amazon is looking for talented Postdoctoral Scientists to join our Fulfillment Technology and Robotics team for a one-year, full-time research position. The Innovation Lab in BOS27 is a physical space in which new ideas can be explored, hands-on. The Lab provides easier access to tools and equipment our inventors need while also incubating critical technologies necessary for future robotic products. The Lab is intended to not only develop new technologies that can be used in future Fulfillment, Technology, and Robotics products but additionally promote deeper technical collaboration with universities from around the world. The Lab’s research efforts are focused on highly autonomous systems inclusive of robotic manipulation of packages and ASINs, multi-Read more