What matters when building vision language models for product image analysis?
2025
This paper investigates multi-modal large language models (MLLMs) for predicting product features from images, comparing fine-tuned versus proprietary models. We introduce two domain-specific benchmarks: (1) Inductive Bias vs. Image Evidence (IBIE) Benchmark, which evaluates MLLMs’ ability to distinguish between image-derived features and latent knowledge, and (2) Catalog-bench, which assesses feature prediction using Catalog terminology. Our fine-tuned model outperforms proprietary models like Gemini by 9.4% and 29.13% on these benchmarks respectively. We address the crucial aspect of computational efficiency, exploring cost effective deployment solutions under limited hardware resources. The significance of this work extends beyond e- commerce to physical retail, where efficient MLLMs are essential for real-time processing of visual data from store cameras and shelf sensors. These models enable automated inventory management, produce quality monitoring, and planogram compliance while operating within in-store computing constraints. This capability is particularly valuable for physical retail environments where immediate decisions about restocking and quality control are critical, while also enabling real-time assistance to customers seeking information about product details, ingredients, and nutritional content.
Research areas