-
WACV 2025 Workshop on Physical Retail in AI2025This paper investigates multi-modal large language models (MLLMs) for predicting product features from images, comparing fine-tuned versus proprietary models. We introduce two domain-specific benchmarks: (1) Inductive Bias vs. Image Evidence (IBIE) Benchmark, which evaluates MLLMs’ ability to distinguish between image-derived features and latent knowledge, and (2) Catalog-bench, which assesses feature prediction
-
2025General vision-language models (VLMs) trained on web data struggle to understand and converse about real-world e-commerce product images. We propose a cost-efficient approach for collecting training data to train a generative VLM for e-commerce product images. The key idea is to leverage large-scale, loosely-coupled image-text pairs from e-commerce stores, use a pre-trained LLM to generate multi-modal instruction-following
-
2025Automated construction of shopping cart from medical prescriptions is a vital prerequisite for scaling up online pharmaceutical services in emerging markets due to the high prevalence of paper prescriptions that are challenging for customers to interpret. We present RxLens, a multi-step end-end Large Language Model (LLM)-based deployed solution for automated pharmacy cart construction comprising multiple
-
Diffusion models have revolutionized the landscape of generative AI, particularly in the application of text-to-image generation. However, their powerful capability of generating high-fidelity images raises significant security concerns on the malicious use of the state-of-the-art (SOTA) text-to-image diffusion models, notably the risks of misusing personal photos and copyright infringement through the
-
ICASSP 20252025We propose a low-shot image classification method called LIMO, which can train an accurate image classification model under conditions of acute data scarcity. LIMO uniquely assembles existing knowledge from a set of diverse models and builds a novel mixture of experts architecture for low-shot image classification. LIMO’s architecture introduces minimal number of new model parameters, such that the added
Related content
-
January 03, 2023Automated methods with a little human guidance use annotators’ time much more efficiently.
-
December 26, 2022Combining contrastive training and selection of hard negative examples establishes new benchmarks.
-
December 16, 2022University of Wisconsin-Madison associate professor and ARA recipient has authored a series of pioneering papers on real-time object instance segmentation.
-
December 09, 2022Why multimodal identification is a crucial step in automating item identification at Amazon scale.
-
November 22, 2022Francesco Locatello on the four NeurIPS papers he coauthored this year, which largely concern generalization to out-of-distribution test data.
-
November 15, 2022Models that map spoken language to objects in an image would make it easier for customers to communicate with multimodal devices.