Dynamic inference with grounding based vision and language models

By Burak Uzkent, Amanmeet Garg, Wentao Zhu, Keval Doshi, Jingru Yi, Andy Wang, Mohamed Omar
2023
Download Copy BibTeX
Copy BibTeX
Transformers have been recently utilized for vision and language tasks successfully. For example, recent image and language models with more than 200M parameters have been proposed to learn visual grounding in the pre-training step and show impressive results on downstream vision and language tasks. On the other hand, there exists a large amount of computational redundancy in these large models which skips their run-time efficiency. To address this problem, we propose dynamic inference for grounding based vision and language models conditioned on the input image-text pair. We first design an approach to dynamically skip multi-head self-attention and feed forward network layers across two backbones and multimodal network. Additionally, we propose dynamic token pruning and fusion for two backbones. In particular, we remove redundant tokens at different levels of the backbones and fuse the image tokens with the language tokens in an adaptive manner. To learn policies for dynamic inference, we train agents using reinforcement learning. In this direction, we replace the CNN backbone in a recent grounding-based vision and language model, MDETR, with a vision transformer and call it ViTMDETR. Then, we apply our dynamic inference method to ViTMDETR, called D-ViTDMETR, and perform experiments on image-language tasks. Our results show that we can improve the run-time efficiency of the state-of-the-art models MDETR and GLIP by up to ∼ 50% on Referring Expression Comprehension and Segmentation, and VQA with only maximum ∼ 0.3% accuracy drop.
Research areas

Latest news

GB, MLN, Edinburgh
We’re looking for a Machine Learning Scientist in the Personalization team for our Edinburgh office experienced in generative AI and large models. You will be responsible for developing and disseminating customer-facing personalized recommendation models. This is a hands-on role with global impact working with a team of world-class engineers and scientists across the Edinburgh offices and wider organization. You will lead the design of machine learning models that scale to very large quantities of data, and serve high-scale low-latency recommendations to all customers worldwide. You will embody scientific rigor, designing and executing experiments to demonstrate the technical efficacy and business value of your methods. You will work alongside aRead more