Multimodal interaction

MixGen: A new multi-modal data augmentation

Xiaoshuai Hao, Yi Zhu, Srikar Appalaraju, Aston Zhang, Wanqian Zhang, Bo Li, Mu Li

WACV 2023 Workshop on Pretraining Large Vision and Multimodal Models

2023

Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new imagetext pairs with semantic relationships preserved by interpolating

Computer vision

Multimodal context carryover

Prashan Wanigasekara, Nalin Gupta, Fan Yang, Emre Barut, Zeynab Raeesy, Kechen Qin, Stephen Rawls, Xinyue Liu, Chengwei Su, Spurthi Sandiri

EMNLP 2022

2022

Multi-modality support has become an integral part of creating a seamless user experience with modern voice assistants with smart displays. Users refer to images, video thumbnails, or the accompanying text descriptions on the screen through voice communication with AI powered devices. This raises the need to either augment existing commercial voice only dialogue systems with state-of-the-art multimodal

Conversational AI

Benchmarking robustness under distribution shift of multimodal image-text models

Jielin Qiu, Yi Zhu, Xingjian Zhen, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li

NeurIPS 2022 Workshop on Distribution Shifts (DistShifts)

2022

Multimodal image-text models have shown remarkable performance in the past few years. However, the robustness of such foundation models against distribution shifts is crucial in downstream applications. In this paper, we investigate their robustness under image and text perturbations. We first build several multimodal benchmark datasets by applying 17 image perturbation and 16 text perturbation techniques

Computer vision

GraVL-BERT: Graphical visual-linguistic representations for multimodal coreference resolution

Danfeng Guo, Arpit Gupta, Sanchit Agarwal, Jiun-Yu Kao, Shuyang Gao, Arijit Biswas, Chien-Wei Lin, Tagyoung Chung, Mohit Bansal

COLING 2022

2022

Learning from multimodal data has become a popular research topic in recent years. Multimodal coreference resolution (MCR) is an important task in this area. MCR involves resolving the references across different modalities, e.g., text and images, which is a crucial capability for building next-generation conversational agents. MCR is challenging as it requires encoding information from different modalities

Conversational AI

A multi-level alignment training scheme for video-and-language grounding

Yubo Zhang, Feiyang Niu, Qing Ping, Govind Thattai

ICDM 2022 Workshop on Foundation Models for Vision and Language

2022

To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities. For a pair of video and language description, their semantic relation is reflected by their encodings’ similarity. A good multi-modality encoder should be able to well capture both inputs’ semantics and encode them in the shared feature space where embedding distance gets properly

Computer vision

Relaxing contrastiveness in multimodal representation learning

Zudi Lin, Erhan Bas, Kunwar Yashraj Singh, Gurumurthy Swaminathan, Rahul Bhotika

WACV 2023

2022

Multimodal representation learning for images with paired raw texts can improve the usability and generality of the learned semantic concepts while significantly reducing annotation costs. In this paper, we explore the design space of loss functions in visual-linguistic pretraining frameworks and propose a novel Relaxed Contrastive (ReCo) objective, which act as a drop-in replacement of the widely used

Computer vision

YORO - Lightweight end to end visual grounding

Chih-Hui Ho, Srikar Appalaraju, Bhavan Jasani, R. Manmatha, Nuno Vasconcelos

ECCV 2022 Workshop on International Challenge on Compositional and Multimodal Perception

2022

We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. This task involves localizing, in an image, an object referred via natural language. Unlike the recent trend in the literature of using multi-stage approaches that sacrifice speed for accuracy, YORO seeks a better trade-off between speed an accuracy by embracing a single-stage design, without CNN backbone

Computer vision

MixGen: A new multi-modal data augmentation

Xiaoshuai Hao, Yi Zhu, Srikar Appalaraju, Aston Zhang, Wanqian Zhang, Bo Li, Mu Li

2022

Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating

Computer vision

Multimodal semi-supervised learning for text recognition

Aviad Aberdam, Roy Ganz, Shai Mazor, Ron Litman

2022

Until recently, the number of public real-world text images was insufficient for training scene text recognizers. Therefore, most modern training methods rely on synthetic data and operate in a fully supervised manner. Nevertheless, the amount of public real-world text images has increased significantly lately, including a great deal of unlabeled data. Leveraging these resources requires semi-supervised

Computer vision

TheWebConf: Stable themes, new wrinkles

Larry Hardesty

April 21, 2022

Amazon Scholar Eugene Agichtein on incorporating knowledge into natural-language-processing models, multimodal interactions, and more.

Search and information retrieval

Multimodal interaction

Work with us