Abstract
Diffusion models have demonstrated impressive performance in text-to-image generation and image editing. However, in instruction-based image editing, they often encounter two challenges: (1) inaccurate localization of the editing targets and (2) unintended modifications in nontarget regions. These issues stem from the global processing of diffusion models due to attention mechanisms. To address these limitations, we conduct a systematic analysis of attention maps under editing instructions and design localization instructions to obtain the desired attention. We propose Instruction Attention Maps (IAM)-Edit, a localized image editing framework that explicitly decouples an editing pipeline into two stages: region localization followed by region-aware editing. Specifically, to localize the editing region, a mask is generated by clustering patches of self-attention maps and combining them with the focal points of cross-attention maps under the editing instruction. To preserve nonediting regions, we apply an attention modulation method that adjusts cross-attention weights at each denoising step based on the generated mask, enabling the denoising process to focus on the editing region. Experiments show that IAM-Edit outperforms state-of-the-art methods both qualitatively and quantitatively.
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
