Sage Journals: Discover world-class research

Abstract

Monocular visual odometry (VO) is crucial for ego-motion estimation in autonomous systems, but it suffers from scale ambiguity, computational inefficiencies, and poor generalization across different motion scales. In this paper, we introduce a novel end-to-end monocular VO framework that combines convolutional feature extraction with Transformer-based spatial-temporal feature modeling. Our framework directly utilizes image patches and spatial coordinates instead of traditional descriptors, thus improving efficiency for monocular VO where inter-frame motion is typically small. We integrate a multi-scale feature extraction model into the SuperPoint network using a Feature Pyramid Network (FPN) to address the scale ambiguity. In addition, we design a hierarchical Transformer that enhances feature matching by incorporating spatial-temporal-aware attention, guided by geometric priors, to improve robustness in challenging scenes. A joint loss function that combines pose loss, geometric consistency, and feature association, coupled with curriculum learning, ensures effective generalization. Evaluated on the KITTI data set, our method demonstrates superior trajectory estimation accuracy compared to existing state-of-the-art learning-based models such as DeepVO and TSformer-VO, and it achieves competitive or superior performance compared to traditional methods like ORB-SLAM3, especially in challenging sequences.

Keywords

Monocular visual odometry transformer feature tracking scale ambiguity

Get full access to this article

View all access options for this article.

References

Adkins

Chen

Biswas

(2024) ObVi-SLAM: Long-term object-visual SLAM. IEEE Robotics and Automation Letters 9(3): 2909–2916.

Almalioglu

Saputra

MRU

DeGusmao

PPB

, et al. (2019) Ganvo: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In: 2019 international conference on robotics and automation (ICRA), Montreal, QC, Canada, 20–24 May, pp. 5474–5480. Piscataway, NJ: IEEE.

Arnab

Dehghani

Heigold

, et al. (2021) Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, 11–17 October, pp. 6836–6846. Piscataway, NJ: IEEE.

DeTone

Malisiewicz

Rabinovich

(2018) Superpoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Salt Lake City, UT, 18–22 June, pp. 224–236. Piscataway, NJ: IEEE.

Dosovitskiy

Beyer

Kolesnikov

, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .

Engel

Koltun

Cremers

(2017) Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(3): 611–625.

Engel

Schöps

Cremers

(2014) LSD-SLAM: Large-scale direct monocular SLAM. In: European conference on computer vision, Zürich, 6–12 September, pp. 834–849. Berlin: Springer.

Forster

Pizzoli

Scaramuzza

(2014) SVO: Fast semi-direct monocular visual odometry. In: 2014 IEEE international conference on robotics and automation (ICRA), Hong Kong, China, 31 May–5 June, pp. 15–22. Piscataway, NJ: IEEE.

Franccani

Maximo

MROA

(2023) Transformer-based model for monocular visual odometry: A video understanding approach. arXiv preprint arXiv:2305.06121 .

10.

Geiger

Lenz

Stiller

, et al. (2013) Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32(11): 1231–1237.

11.

Wang

, et al. (2023) OVD-SLAM: An online visual slam for dynamic environments. IEEE Sensors Journal 23(12): 13210–13219.

12.

Kumar

Park

Behera

(2023) High-speed stereo visual SLAM for low-powered computing devices. IEEE Robotics and Automation Letters 9(1): 499–506.

13.

Wang

Long

, et al. (2018) Undeepvo: Monocular visual odometry through unsupervised deep learning. In: 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, QLD, Australia, pp. 7286–7291. Piscataway, NJ: IEEE.

14.

Liang

Wang

(2021) Deep unsupervised learning based visual odometry with multi-scale matching and latent feature constraint. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS), Prague, 27 September–1 October, pp. 2239–2246. Piscataway, NJ: IEEE.

15.

Lim

Jeon

Myung

(2022) UV-SLAM: Unconstrained line-based SLAM using vanishing points for structural mapping. IEEE Robotics and Automation Letters 7(2): 1518–1525.

16.

Liu

Lin

Cao

, et al. (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, pp. 10012–10022. Piscataway, NJ: IEEE.

17.

Luo

Liu

Guo

, et al. (2025) SuperVINS: A real-time visual-inertial SLAM framework for challenging imaging conditions. IEEE Sensors Journal. arXiv preprint arXiv:2407.21348 .

18.

Memmel

Bachmann

Zamir

(2023) Modality-invariant visual odometry for embodied vision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, pp. 21549–21559. Piscataway, NJ: IEEE.

19.

Vaswani

Shazeer

Parmar

, et al. (2017) Attention is all you need. Advances in Neural Information Processing Systems 30.

20.

Wang

Clark

Wen

, et al. (2017) Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: 2017 IEEE international conference on robotics and automation (ICRA), Singapore, 29 May–3 June, pp. 2043–2050. Piscataway, NJ: IEEE.

21.

Yang

von Stumberg

Wang

, et al. (2020) D3VO: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, 14–19 June, pp. 1281–1292. Piscataway, NJ: IEEE.

22.

Yin

Han

(2024) IBD-SLAM: Learning image-based depth fusion for generalizable SLAM. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, 17–21 June, pp. 10563–10573. Piscataway, NJ: IEEE.

23.

Zhan

Weerasekera

Bian

, et al. (2021) DF-VO: What should be learnt for visual odometry? arXiv preprint arXiv:2103.00933 .

24.

Zhang

Chan

RKY

Wong

KKY

(2024) GlocalFuse-Depth: Fusing transformers and CNNs for all-day self-supervised monocular depth estimation. Neurocomputing 569: 127122.

25.

Zhou

Ummenhofer

Brox

(2020) DeepTAM: Deep tracking and mapping with convolutional neural networks. International Journal of Computer Vision 128(3): 756–769.

A transformer-enhanced geometric learning framework for robust monocular visual odometry

Abstract

Keywords

Get full access to this article

References