Relation Rectification in Diffusion Model

National University of Singapore
International Conference on Computer Vision and Pattern Recognition (CVPR), 2024

*Corresponding Author.

(a) Our approach enables diffusion model to successfully generate images with the correct directional relation in response to the textual prompt, which they originally failed. (b) Our method can synthesize relation of diverse and unseen objects in zero-shot manner.


Despite their exceptional generative abilities, large text-to-image diffusion models, much like skilled but careless artists, often struggle with accurately depicting visual relationships between objects. This issue, as we uncover through careful analysis, arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this, we introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this, we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically, we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders, supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space. Crucially, our method retains the parameters of the text encoder and diffusion model, preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data, demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations.



Our RRNet architecture. Given OSPs and their exemplar images, the RRNet learns to produce adjustment vectors. The rectified embeddings then will used as the condition to guidance the generation process of a frozen SD. The upper left part is the heterogeneous graph RRNet uses to model the relation direction. Upon optimization with negative loss and denoising loss, the SD will be able to generate images with correct relation direction.


      title={Relation Rectification in Diffusion Model}, 
      author={Yinwei Wu and Xingyi Yang and Xinchao Wang},