What matters when building vision language models for product image analysis?

Ameni Trabelsi; Maria Zontak; Yiming Qian; Brian Jackson; Suleiman Khan; Umit Batur

Publication

What matters when building vision language models for product image analysis?

By Ameni Trabelsi, Maria Zontak, Yiming Qian, Brian Jackson, Suleiman Khan, Umit Batur

2025

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

This paper investigates multi-modal large language models (MLLMs) for predicting product features from images, comparing fine-tuned versus proprietary models. We introduce two domain-specific benchmarks: (1) Inductive Bias vs. Image Evidence (IBIE) Benchmark, which evaluates MLLMs’ ability to distinguish between image-derived features and latent knowledge, and (2) Catalog-bench, which assesses feature prediction using Catalog terminology. Our fine-tuned model outperforms proprietary models like Gemini by 9.4% and 29.13% on these benchmarks respectively. We address the crucial aspect of computational efficiency, exploring cost effective deployment solutions under limited hardware resources. The significance of this work extends beyond e- commerce to physical retail, where efficient MLLMs are essential for real-time processing of visual data from store cameras and shelf sensors. These models enable automated inventory management, produce quality monitoring, and planogram compliance while operating within in-store computing constraints. This capability is particularly valuable for physical retail environments where immediate decisions about restocking and quality control are critical, while also enabling real-time assistance to customers seeking information about product details, ingredients, and nutritional content.

What matters when building vision language models for product image analysis?

Latest news

Work with us