De-noised vision-language fusion guided by visual cues for e-commerce product search
2024
In e-commerce applications, vision-language multimodal transformer models play a pivotal role in product search. The key to successfully training a multimodal model lies in the alignment quality of image-text pairs in the dataset. However, the data in practice is often automatically collected with minimal manual intervention. Hence the alignment of image-text pairs is far from ideal. In e-commerce, this misalignment can stem from noisy and redundant non-visual-descriptive text attributes in the product description. To address this, we introduce the MultiModal alignment-guided Learned Token Pruning (MM-LTP) method. MM-LTP employs token pruning, conventionally used for computational efficiency, to perform online text cleaning during multimodal model training. By enabling the model to discern and discard unimportant tokens, it is able to train with implicitly cleaned image-text pairs. We evaluate MM-LTP using a benchmark multimodal e-commerce dataset comprising over 710,000 unique Amazon products. Our evaluation hinges on visual search, a prevalent e-commerce feature. Through MM-LTP, we demonstrate that refining text tokens enhances the paired image branch’s training, which leads to significantly improved visual search performance.
Research areas