Sage Journals: Discover world-class research

Abstract

The Direct Segment Anything Model (DirectSAM), pre-trained on the SA-1B dataset, which serves as the training set for the Segment Anything Model, demonstrates exceptional performance in class-agnostic edge detection. This research explores its application to remote sensing imagery, emphasizing the practical significance of semantic edge detection, including structures like buildings, roadways, and coastlines. Currently, these applications are managed by separately training specialized models on individual datasets in each specific domain. We present DirectSAM-remote sensing (RS), a foundation model built upon DirectSAM. It retains the powerful segmentation abilities acquired from natural images while leveraging a large-scale dataset designed for semantic edge detection remote sensing. The dataset contains over 34k image-text-edge triplets, making it over 30 times larger than any individual dataset. DirectSAM-RS incorporates a prompter module, consisting of a text encoder and cross-attention layers added to the DirectSAM architecture, enabling flexible conditioning on target class labels or reference expressions. We assess DirectSAM-RS in both zero-shot and fine-tuning scenarios, showing that it delivers state-of-the-art results on various downstream benchmarks.

Keywords

foundation model edge detection remote sensing prompt learning open-vocabulary segmentation

Get full access to this article

View all access options for this article.

References

Chen

Cahyawijaya

Liu

, et al. (2024). Subobject-level image tokenization. arXiv preprint arXiv:240214327.

Chen

Wang

, et al. (2021). Reconstruction bias u-net for road extraction from optical remote sensing images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 2284–2294. https://doi.org/10.1109/JSTARS.2021.3053603

Chen

F. L.

Zhang

D. Z.

Han

M. L.

et al (2023). Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20(1), 38–56. https://doi.org/10.1007/s11633-022-1369-5

Cordts

Omran

Ramos

, et al. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Demir

Koperski

Lindenbaum

, et al. (2018). Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 172–181).

Deng

Dong

Socher

, et al, (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.

Everingham

Van Gool

Williams

C. K. I.

, et al. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338. https://doi.org/10.1007/s11263-009-0275-4

Feng

Wang

(2024). A novel sea-land segmentation network for enhanced coastline extraction using satellite remote sensing images. Advances in Space Research, 74(5), 2200–2213. https://doi.org/10.1016/j.asr.2024.06.011

Zhang

Yang

, et al. (2019). Bi-directional cascade network for perceptual edge detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3828–3837).

10.

Rohrbach

Darrell

(2016). Segmentation from natural language expressions. In ECCV 2016 (pp. 108–124).

11.

Huan

Xue

Zheng

, et al. (2021). Unmixing convolutional features for crisp edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6602–6609. https://doi.org/10.1109/TPAMI.2021.3084197

12.

Jain

S. M.

(2022). Hugging face. In Introduction to transformers for NLP: With the hugging face library and models to solve problems (pp. 51–67). Springer.

13.

Kenton

J. D. M. W. C.

Toutanova

L. K.

(2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT (vol. 1, p. 2).

14.

Kirillov

Mintun

Ravi

, et al. (2023). Segment anything. In ICCV 2023 (pp. 4015–4026).

15.

Kuo

, et al. (2018). Referring image segmentation via recurrent refinement networks. In CVPR 2018 (pp. 5745–5753).

16.

Xiong

, et al. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888–12900). PMLR.

17.

Liu

Chen

Guan

, et al. (2024). Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 62, 1-16. https://doi.org/10.1109/TGRS.2024.3390838

18.

Liu

Cheng

M. M.

, et al. (2017). Richer convolutional features for edge detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3000–3009).

19.

Liu

Yao

Zhang

, et al. (2025). Boost uav-based object detection via scale-invariant feature disentanglement and adversarial learning. IEEE Transactions on Geoscience and Remote Sensing, 63, 1–13. https://doi.org/10.1109/TGRS.2025.3564261

20.

Loshchilov

(2017). Decoupled weight decay regularization. arXiv preprint arXiv:171105101.

21.

Paszke

Gross

Massa

, et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 8026–8037.

22.

Radford

Kim

J. W.

Hallacy

, et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.

23.

Russakovsky

Deng

, et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y

24.

Sener

Koltun

(2018). Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 527–538.

25.

Soria

Pomboza-Junez

Sappa

A. D.

(2022). Ldc: Lightweight dense cnn for edge detection. IEEE access, 10, 68281–68290. https://doi.org/10.1109/ACCESS.2022.3186344

26.

Soria

Sappa

Humanante

, et al. (2023). Dense extreme inception network for edge detection. Pattern Recognition, 139, 109461. https://doi.org/10.1016/j.patcog.2023.109461

27.

Suzuki

, et al. (1985). Topological structural analysis of digitized binary images by border following. Computer vision, graphics, and image processing, 30(1), 32–46. https://doi.org/10.1016/0734-189X(85)90016-7

28.

Upadhyay

Chhipa

P. C.

Phlypo

, et al. (2023). Multi-task meta learning: learn how to adapt to unseen tasks. In IJCNN 2023 (pp. 1–10).

29.

Wang

Zheng

, et al. (2021). Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).

30.

Waqas Zamir

Arora

Gupta

, et al. (2019). isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 28–37).

31.

Xia

Zhang

, et al. (2021). Building extraction from very-high-resolution remote sensing images using semi-supervised semantic edge detection. Remote Sensing, 13(11), 2187. https://doi.org/10.3390/rs13112187

32.

Xie

(2015). Holistically-nested edge detection. In ICCV 2015 (pp. 1395–1403).

33.

Xie

Wang

, et al. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34, 12077–12090.

34.

Yang

Wang

Tang

, et al. (2022). Lavt: Language-aware vision transformer for referring image segmentation. In CVPR 2022 (pp. 18155–18165).

35.

Yang

Zhong

Liu

, et al. (2024). Occlusion-aware road extraction network for high-resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 62, 1–16. https://doi.org/10.1109/TGRS.2024.3387945

36.

Yao

Liu

Chen

, et al. (2025). Remotesam: Towards segment anything for earth observation. arXiv preprint arXiv:250518022.

37.

Yao

Liu

Zhang

, et al. (2024). Domain-invariant progressive knowledge distillation for uav-based object detection. IEEE Geoscience and Remote Sensing Letters, 22, 1-5. https://doi.org/10.1109/LGRS.2024.3492187

38.

Yuan

Mou

Hua

, et al. (2024). Rrsis: Referring remote sensing image segmentation. IEEE Transactions on Geoscience and Remote Sensing, 62, 1–12. https://doi.org/10.1109/TGRS.2024.3369720

39.

Zou

Yao

Liu

, et al. (2025). Remotetrimmer: Adaptive structural pruning for remote sensing image classification. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). IEEE.

Direct Segment Anything Model-Remote Sensing: A Vision-language Foundation Model for Semantic Edge Detection in Remote Sensing Imagery

Abstract

Keywords

Get full access to this article

References