HopNet: Harmonizing object placement network for realistic image generation via object composition
2025
Realistic image generation is an increasingly desired, but deceptively complicated computer vision task, especially when a specific object is required. Whether generating product advertisements or building novel datasets, object composition for realistic image generation depends on realistic object placements as well as believable object harmonization. To address this task, we introduce HopNet, the first network designed for end-to-end realistic image generation via object composition. HopNet excels in two pivotal tasks: object placement and harmonization, setting state-of-the-art performance in both domains. Unlike conventional methods that employ separate models for each task, HopNet seamlessly integrates object placement and harmonization to acquire knowledge of correlated information. It leverages a transformer-based framework to encode both foreground objects and background scenes and learns attention mechanisms crucial for both object placement and harmonization concurrently. We introduce a modified sparse contrastive loss, allowing our model to learn from multiple both good and bad placements while also learning object harmonization in a self-supervised manner. HopNet generalizes well on challenging scenes while removing the compounding errors associated with using separate models for each subtask.
Research areas