Multimodal representation learning for images with paired raw texts can improve the usability and generality of the learned semantic concepts while significantly reducing annotation costs. In this paper, we explore the design space of loss functions in visual-linguistic pretraining frameworks and propose a novel Relaxed Contrastive (ReCo) objective, which act as a drop-in replacement of the widely used