Vision-language models, which map images and text to a common representational space, have demonstrated remarkable performance on a wide range of multimodal AI tasks. But they’re typically trained on text-image pairs: each text input is associated with a single image.
This limits the models’ applicability. You might, for instance, want a vision-language model to take two input images and identify differences between them, or you might want to make inferences from a 3-D fusion of ultrasound or x-ray cross sections. In the Amazon Store, multiple images are frequently associated with a single product, and you might want to execute a query that factors in several of those images.
The standard way around this limitation is to concatenate a set of images and feed them to a model as, essentially, one enormous image. But this misses an opportunity to create a richer representation — or embedding — that systematically draws on complementary information from multiple images.
At this year’s Winter Conference on Applications of Computer Vision (WACV), we presented a new method for producing an aggregated embedding of multiple images, which improves performance on several multimodal AI tasks.
We considered four methods of fusing multiple images: one computes an element-wise average of the embeddings of the individual images; one uses max pooling, which records the highest value for each image feature across all images; and the other two use neural-network attention mechanisms, one with a gate on the attention values and one without.
We tested our approach on three different tasks: product categorization, product information inference, and image captioning. As a baseline, we used a model that took concatenated images, fine-tuned on each task, and we used three metrics to measure the results: accuracy, precision, and recall.
Across the board, the model using an ungated attention mechanism outperformed the others, sometimes by a considerable margin. On the image-captioning task, for instance, it was 6.4% better than baseline, and on the product attribute inference task, its precision and recall were 6.9% and 7.9% better than baseline, respectively.
Model architecture
Vision-language models typically involve an image encoder, which produces an embedding of an input image, and a projection layer, which learns to project the image embedding into the representational space of a trained large language model (LLM).
Sometimes, a query embedding generator intervenes between the image encoder and the projection layer. The query embedding generator is trained on a combination of image embeddings and the associated image captions, so it learns linguistic representations of the image embeddings that can help the projection layer better navigate the LLM’s representational space.
We introduce a multiple-instance visual component (MIVC) that, in either architecture, receives the output of the visual encoder, creating a unified representation of multiple input images.
Permutation-invariant attention
The visual encoder learns to recognize features of the input data — which might be low-level properties like color gradients across image patches or higher-level properties like particular shapes — and assigns each input a value along each feature dimension.
Our first MIVC method simply averages the feature values of the input images, while max pooling selects the highest value for each feature across all the images.
The attention mechanism is fine-tuned on particular tasks and learns which features of which images are most important for those tasks. We want the representation of multiple images to be invariant to the order in which the images pass to the visual encoder, so we devised an attention mechanism whose attention values for each image feature are the result of not only that image’s embedding but the embeddings of the other images as well.
The gated attention mechanism is like the basic attention mechanism, except that it learns an additional sigmoid function that boosts higher attention values and reduces lower ones, in an attempt to isolate the most crucial features of the input signal. In our experiments, however, it didn’t work as well as the basic attention mechanism.
Because we fine-tuned the attention mechanism on the target task, we fine-tuned the baseline model, too, to ensure fair comparison. But on the attribute inference and captioning tasks, fine-tuning actually diminished the baseline model’s performance. If we use the zero-shot concatenated-image model as the baseline, the improvements offered by our method shrink slightly: on the image-captioning task, our advantage contracts to 5.6%, and on the product attribute inference task, the advantages on precision and recall contract to 5.5% and 7%. But that’s still a significant difference.
At present, the attention mechanism applies only to the visual encoding pipeline, and it operates under the assumption that all images are independently and identically distributed. In ongoing work, we’re investigating whether cross-modal attention and incorporating correlations across images offer any further improvements.