Various Vision-Language Pre-training (VLP) models (e.g., CLIP, BLIP) have sprung up and dramatically improved the benchmarks of public general-domain datasets (e.g., COCO, Flickr30k). Such models typically learn the cross-modal alignment from large-scale well-aligned image-text datasets. Adapting these models to downstream applications in specific domains, such as fashion, requires fine-grained in-domain