IAM-Edit: Localized Image Editing via Instruction Attention Maps

Abstract

Diffusion models have demonstrated impressive performance in text-to-image generation and image editing. However, in instruction-based image editing, they often encounter two challenges: (1) inaccurate localization of the editing targets and (2) unintended modifications in nontarget regions. These issues stem from the global processing of diffusion models due to attention mechanisms. To address these limitations, we conduct a systematic analysis of attention maps under editing instructions and design localization instructions to obtain the desired attention. We propose Instruction Attention Maps (IAM)-Edit, a localized image editing framework that explicitly decouples an editing pipeline into two stages: region localization followed by region-aware editing. Specifically, to localize the editing region, a mask is generated by clustering patches of self-attention maps and combining them with the focal points of cross-attention maps under the editing instruction. To preserve nonediting regions, we apply an attention modulation method that adjusts cross-attention weights at each denoising step based on the generated mask, enabling the denoising process to focus on the editing region. Experiments show that IAM-Edit outperforms state-of-the-art methods both qualitatively and quantitatively.

Keywords

diffusion models localized image editing attention maps

Get full access to this article

View all access options for this article.

References

Avrahami

Lischinski

Fried

(2022). Blended diffusion for text-driven editing of natural images. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 18187–18197).

Balaji

Nah

Huang

Vahdat

Song

Zhang

Kreis

Aittala

Aila

Laine

Catanzaro

Karras

Liu

M.-Y.

(2022). eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv. https://doi.org/10.48550/arXiv.2211.01324

Brack

Friedrich

Kornmeier

Tsaban

Schramowski

Kersting

Passos

(2024). LEDITS++: Limitless image editing using text-to-image models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 846–856).

Brooks

Holynski

Efros

A. A.

(2022). InstructPix2Pix: Learning to follow image editing instructions. arXiv. https://doi.org/10.48550/arXiv.2211.09800

Brown

T. B.

Mann

Ryder

Subbiah

Kaplan

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

D. M.

Winter

Amodei

(2020). Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165

Cao

Wang

Shan

Qie

Zheng

(2023). MasaCtrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 22560–22570).

Couairon

Verbeek

Schwenk

Cord

(2023). DiffEdit: Diffusion-based semantic image editing with mask guidance. Proceedings of the international conference on learning representations (ICLR).

Feng

Qiu

Bai

Zhang

Dong

Zhou

Ying

Tassiulas

(2024). An item is worth a prompt: Versatile image editing with disentangled control. arXiv. https://doi.org/10.48550/arXiv.2403.04880

T.-J.

Wang

W. Y.

Yang

Gan

(2024). Guiding instruction-based image editing via multimodal large language models. International Conference on Learning Representations (ICLR).

10.

Gal

Alaluf

Atzmon

Patashnik

Bermano

A. H.

Chechik

Cohen-Or

(2022). An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv. https://doi.org/10.48550/arXiv.2208.01618

11.

Gal

Patashnik

Maron

Bermano

A. H.

Chechik

Cohen-Or

(2022). StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics, 41(4), Article 69. 10.1145/3528223.3530164

12.

Guo

Lin

(2024). Focus on uour instruction: Fine-grained and multi-instruction image editing by attention modulation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6986–6996).

13.

Hertz

Mokady

Tenenbaum

Aberman

Pritch

Cohen-Or

(2022). Prompt-to-prompt image editing with cross attention control. arXiv. https://doi.org/10.48550/arXiv.2208.01626

14.

Huang

Xie

Wang

Yuan

Cun

Zhou

Dong

Huang

Zhang

Shan

(2024). SmartEdit: Exploring complex instruction-based image editing with multimodal large language models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8362–8371).

15.

Zeng

Bian

Liu

(2023). Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv. https://doi.org/10.48550/arXiv.2310.01506

16.

Karras

Aittala

Aila

Laine

(2022). Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35, 26565–26577. https://arxiv.org/abs/2206.00364

17.

Kirillov

Mintun

Ravi

Mao

Rolland

Gustafson

Xiao

Whitehead

Berg

A. C.

W.-Y.

Dollár

Girshick

(2023). Segment anything. arXiv. https://doi.org/10.48550/arXiv.2304.02643

18.

Zeng

Feng

Gao

Liu

Tang

Liu

Zhang

(2024). ZONE: Zero-shot instruction-guided local editing. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6254–6263).

19.

Lugmayr

Danelljan

Romero

Timofte

Van Gool

(2022). RePaint: Inpainting using denoising diffusion probabilistic models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11461–11471).

20.

Mirzaei

Aumentado-Armstrong

Brubaker

M. A.

Kelly

Levinshtein

Derpanis

K. G.

Gilitschenski

(2024). Watch your steps: Local image and scene editing by text instructions. Proceedings of the European conference on computer vision (ECCV) (pp. 111–129).

21.

Mokady

Hertz

Aberman

Pritch

Cohen-Or

(2023). Null-text inversion for editing real images using guided diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6038–6047).

22.

Nichol

Dhariwal

Ramesh

Shyam

Mishkin

McGrew

Sutskever

Chen

(2022). GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. Proceedings of the 39th international conference on machine learning (ICML) (pp. 16784–16804).

23.

Oquab

Darcet

Moutakanni

H. V.

Szafraniec

Khalidov

Fernandez

Haziza

Massa

El-Nouby

Assran

Ballas

Galuba

Howes

Huang

P.-Y.

S.-W.

Misra

Rabbat

Sharma

Bojanowski

(2023). DINOv2: Learning robust visual features without supervision. arXiv. https://doi.org/10.48550/arXiv.2304.07193

24.

Parmar

Singh

K. K.

Zhang

Zhu

J.-Y.

(2023). Zero-shot image-to-image translation. arXiv. https://doi.org/10.48550/arXiv.2302.03027

25.

Patashnik

Garibi

Azuri

Averbuch-Elor

Cohen-Or

(2023). Localizing object-level shape variations with text-to-image diffusion models. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 22994–23004).

26.

Podell

English

Lacey

Blattmann

Dockhorn

Müller

Penna

Rombach

(2023). SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv. https://doi.org/10.48550/arXiv.2307.01952

27.

Radford

Kim

J. W.

Hallacy

Ramesh

Goh

Agarwal

Sastry

Askell

Mishkin

Clark

Krueger

Sutskever

(2021). Learning transferable visual models from natural language supervision. Proceedings of the 38th international conference on machine learning (ICML) (pp. 8748–8763).

28.

Ramesh

Dhariwal

Nichol

Chu

Chen

(2022). Hierarchical text-conditional image generation with CLIP latents. arXiv. https://doi.org/10.48550/arXiv.2204.06125

29.

Ren

Kuang

Xia

Wang

Zhu

Xie

Wang

Xiao

Wang

Zheng

(2024). ByteEdit: Boost, comply and accelerate generative image editing. Proceedings of the European conference on computer vision (ECCV) (pp. 184–200).

30.

Rombach

Blattmann

Lorenz

Esser

Ommer

(2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10684–10695).

31.

Ronneberger

Fischer

Brox

(2015). U-Net: Convolutional networks for biomedical image segmentation. Proceedings of the international conference on medical image computing and computer-assisted intervention (MICCAI) (pp. 234–241).

32.

Ruiz

Jampani

Pritch

Rubinstein

Aberman

(2022). DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv. https://doi.org/10.48550/arXiv.2208.12242

33.

Saharia

Chan

Saxena

Whang

Denton

Ghasemipour

S. K. S.

Ayan

B. K.

Mahdavi

S. S.

Lopes

R. G.

Salimans

Fleet

D. J.

Norouzi

(2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494. https://doi.org/10.52202/068431-2643

34.

Sheynin

Polyak

Singer

Kirstain

Zohar

Ashual

Parikh

Taigman

(2024). Emu Edit: Precise image editing via recognition and generation tasks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8871–8879).

35.

Tang

Wang

Yang

van deWeijer

(2024). LocInv: Localization-aware inversion for text-guided image editing. arXiv. https://doi.org/10.48550/arXiv.2405.01496

36.

Tumanyan

Geyer

Bagon

Dekel

(2023). Plug-and-play diffusion features for text-driven image-to-image translation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1921–1930).

37.

Wang

Yang

Butt

M. A.

van deWeijer

(2023). Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing. Advances in Neural Information Processing Systems, 36, 26291–26303. https://doi.org/10.48550/arXiv.2309.15664

38.

Wei

Zhang

Bai

Zhang

Zuo

(2023). ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 15897–15907).

39.

Xue

Song

Guo

Liu

Zong

Liu

Luo

(2023). RAPHAEL: Text-to-image generation via large mixture of diffusion paths. arXiv. https://doi.org/10.48550/arXiv.2305.18295

40.

Yang

Zhang

Chen

Sun

Chen

Wen

(2023). Paint by example: Exemplar-based image editing with diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 18381–18391).

41.

Yang

Ding

Wang

Chen

Zhuang

Shen

(2024). Object-aware inversion and reassembly for image editing. Proceedings of the twelfth international conference on learning representations (ICLR).

42.

Feng

Liu

Jin

Zeng

Chen

(2023). Inpaint anything: Segment anything meets image inpainting. arXiv. https://doi.org/10.48550/arXiv.2304.06790

43.

Zhang

Chen

Sun

(2023). MagicBrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems (NeurIPS), 36, 31428–31449. https://doi.org/10.52202/075280-1365

44.

Zhang

Isola

Efros

A. A.

Shechtman

Wang

(2018). The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 586–595).

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

16.17 MB