iEdit: Localised text-guided image editing with weak supervision
2024
Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability on the generated images. We introduce iEdit, a novel method for text-guided image editing conditioned on a source image and textual prompt. As a fully-annotated dataset with target images does not exist, previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image, leading to issues on preserving source image fidelity. We propose to automatically construct a dataset derived from LAION-5B, containing pseudo-target images and descriptive edit prompts. The dataset allows us to incorporate a weakly-supervised loss function, generating the pseudo-target image from the source image’s latent noise conditioned on the edit prompt. To encourage localised editing we propose a loss function that uses segmentation masks to guide the editing during training and optionally at inference. Trained with limited GPU resources on the constructed dataset, our model outperforms counterparts in image fidelity, CLIP alignment score, and qualitatively for both generated and real images.
Research areas