-
2025Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language
-
2025In this paper, we tackle the novel computer vision problem of depth estimation through a translucent barrier. This is an important problem for robotics when manipulating objects through plastic wrapping, or when predicting the depth of items behind a translucent barrier for manipulation. We propose two approaches for providing depth prediction models the ability to see through translucent barriers: removing
-
IEEE Robotics and Automation Letters2025We extend our previous work, PoCo [1], and present a new algorithm, Cross-Source-Context Place Recognition (CSCPR), for RGB-D indoor place recognition that integrates global retrieval and reranking into an end-to-end model and keeps the consistency of using Context-of-Clusters (CoCs) [2] for feature processing. Unlike prior approaches that primarily focus on the RGB domain for place recognition reranking
-
WACV 2025 Workshop on Physical Retail in AI2025This paper investigates multi-modal large language models (MLLMs) for predicting product features from images, comparing fine-tuned versus proprietary models. We introduce two domain-specific benchmarks: (1) Inductive Bias vs. Image Evidence (IBIE) Benchmark, which evaluates MLLMs’ ability to distinguish between image-derived features and latent knowledge, and (2) Catalog-bench, which assesses feature prediction
-
2025General vision-language models (VLMs) trained on web data struggle to understand and converse about real-world e-commerce product images. We propose a cost-efficient approach for collecting training data to train a generative VLM for e-commerce product images. The key idea is to leverage large-scale, loosely-coupled image-text pairs from e-commerce stores, use a pre-trained LLM to generate multi-modal instruction-following
Related content
-
February 22, 2024Method preserves knowledge encoded in teacher model’s attention heads even when student model has fewer of them.
-
February 14, 2024Derek Chibuzor utilized his SURE experience to gain "exposure to an aerospace research project in a professional research environment."
-
February 07, 2024New approach enables sustainable machine learning for remote-sensing applications.
-
January 19, 2024Attention-based representation of multi-image inputs improves performance on downstream vision-language tasks.
-
December 20, 2023Novel architectures and carefully prepared training data enable state-of-the-art performance.
-
October 27, 2023Motion vectors — which are common in popular video formats — can be used to efficiently track regions of interest across multiple frames of video to generate motion-aware masks that improve video representation learning.