Abstract
Monocular visual odometry (VO) is crucial for ego-motion estimation in autonomous systems, but it suffers from scale ambiguity, computational inefficiencies, and poor generalization across different motion scales. In this paper, we introduce a novel end-to-end monocular VO framework that combines convolutional feature extraction with Transformer-based spatial-temporal feature modeling. Our framework directly utilizes image patches and spatial coordinates instead of traditional descriptors, thus improving efficiency for monocular VO where inter-frame motion is typically small. We integrate a multi-scale feature extraction model into the SuperPoint network using a Feature Pyramid Network (FPN) to address the scale ambiguity. In addition, we design a hierarchical Transformer that enhances feature matching by incorporating spatial-temporal-aware attention, guided by geometric priors, to improve robustness in challenging scenes. A joint loss function that combines pose loss, geometric consistency, and feature association, coupled with curriculum learning, ensures effective generalization. Evaluated on the KITTI data set, our method demonstrates superior trajectory estimation accuracy compared to existing state-of-the-art learning-based models such as DeepVO and TSformer-VO, and it achieves competitive or superior performance compared to traditional methods like ORB-SLAM3, especially in challenging sequences.
Get full access to this article
View all access options for this article.
