Aligning vision language models with contrastive learning
2024
In recent years, Vision Language Models (VLMs) have achieved significant advancements due to the success of large language models. The common strategy for aligning vision and language models involves a two-step process: an alignment (or pretraining) stage and an instruction tuning stage. During the alignment stage, a projection module is trained to map image embeddings into the language space using a paired image-text dataset. In the instruction tuning stage, the model is trained to answer specific questions about the images. In this work, we focus on the alignment stage and identify a significant gap between the embeddings for image and text pairs when VLMs are trained with next-token prediction loss. To address this issue, we employ a contrastive training strategy similar to that used by Radford et al. [39] along with next token prediction based training. Our findings indicate that this joint pretraining method enhances VLM performance by approximately 2% across various multimodal evaluations without any additional compute or training data. To assess the robustness and generalizability of joint training, we experimented with multiple large language models and observed similar performance improvements. Furthermore, we explore the importance of prompts in contrastive training with various LLM options. We also provide a detailed analysis of the type of vision encoder, projection layer, and LLM to use with the proposed joint training approach.
Research areas