Spatially-enhanced recurrent memory for long-range mapless navigation via end-to-end reinforcement learning

Abstract

Recent advancements in robot navigation, particularly with end-to-end learning approaches such as reinforcement learning (RL), have demonstrated remarkable efficiency and effectiveness. However, successful navigation still fundamentally depends on two key capabilities: mapping and planning, whether implemented explicitly or implicitly. Classical approaches rely on explicit mapping pipelines to transform and register egocentric observations into a coherent map for the planning module. In contrast, end-to-end learning often achieves this implicitly—through recurrent neural networks (RNNs) that fuse current and historical observations into a latent space for planning. While existing architectures, such as LSTM and GRU, can capture temporal dependencies, our findings reveal a critical limitation: their inability to effectively perform spatial memorization. This capability is essential for transforming and integrating sequential observations from varying perspectives to build spatial representations that support planning tasks. To address this, we propose spatially-enhanced recurrent units (SRUs)—a simple yet effective modification to existing RNNs—that enhance spatial memorization. To improve navigation performance, we introduce an attention-based network architecture integrated with SRUs, enabling long-range mapless navigation using a single forward-facing stereo camera. Additionally, we employ regularization techniques to facilitate robust end-to-end recurrent training via RL. Experimental results demonstrate that our approach improves long-range navigation performance by 23.5% overall compared to existing RNNs. Furthermore, when equipped with SRU memory, our method outperforms both RL baseline approaches—one relying on explicit mapping and the other on stacked historical observations—achieving overall improvements of 29.6% and 105.0%, respectively, in diverse environments that require long-horizon mapping and memorization capabilities. Finally, we address the sim-to-real gap by leveraging large-scale pretraining on synthetic depth data, enabling zero-shot transfer for deployment across diverse and complex real-world environments.

Keywords

spatial memory end-to-end mapless navigation recurrent neural networks reinforcement learning

Introduction

End-to-end learning for robot navigation has recently gained significant attention with its potential to address two major challenges inherent in classical modular approaches: (a) system delays and (b) the difficulty of modeling complex kinodynamic environmental interactions. These challenges have traditionally hindered the development of high-speed platforms with intricate dynamics, such as legged-wheeled robots. However, end-to-end learning approaches face their own challenges, particularly in achieving efficient spatial mapping. Unlike classical mapping pipelines, which explicitly transform historical ego-centric observations into a coherent map frame for downstream planning, end-to-end learning relies on neural networks to implicitly learn this process. This requires the network to iteratively build and update an environmental representation of the surroundings and understand the spatial-temporal relationships between observations.

In autonomous driving, large-scale mapping modules (Mescheder et al., 2019; Mohajerin and Rohani, 2019; Wang et al., 2025; Wei et al., 2023) are trained on thousands of hours of data, enabling robust spatial mapping with specifically designed architectures, such as occupancy networks (Mescheder et al., 2019) or occupancy grid maps (Mohajerin and Rohani, 2019). However, such approaches are not easily deployable on smaller robotic platforms and often struggle to generalize to environments beyond structured road networks. In contrast, embedded robots often rely on end-to-end learning approaches, either by imitating behaviors from datasets (Cèsar-Tondreau et al., 2021; Karnan et al., 2022; Loquercio et al., 2021; Shah et al., 2023) or by optimizing policies through reinforcement learning (RL) (Surmann et al., 2020; Wijmans et al., 2020; Zhu et al., 2017). These methods typically employ specific network architectures, such as recurrent neural networks (RNNs), to implicitly learn spatial-temporal mappings (Wijmans et al., 2020, 2023). While these approaches have demonstrated success in structured indoor environments with discretized action and observation spaces, their performance often diminishes in more complex, real-world scenarios that involve continuous action spaces and dynamic motions.

Recently, for real-world deployments, researchers have started integrating explicit mapping pipelines (Miki et al., 2022b) to fuse ego-centric observations and provide environmental information to learning modules for tasks such as perceptive locomotion (Miki et al., 2022a) and navigation (Francis et al., 2020; Lee et al., 2024; Weerakoon et al., 2022). This raises an important question: can end-to-end learning networks with implicit memory mechanisms, such as RNNs, match or surpass the performance of approaches that rely on explicit mapping pipelines? Specifically, do RNNs have inherent limitations in learning spatial-temporal mappings?

While RNNs excel at capturing temporal dependencies, showcased by their success in various sequential tasks, such as natural language processing (Sutskever et al., 2014) and time-series prediction (Siami-Namini et al., 2019), their ability to learn spatial transformations and memorization remains a topic of research. RNNs are designed to process sequences of data by maintaining an internal state that captures temporal dependencies. However, it is not yet clear to what extent they can effectively learn spatial transformations and integrate observations from different perspectives. Classical approaches achieve spatial registration through homogeneous transformations in three-dimensional space, aligning observations into a consistent local or global frame. For RNNs to achieve effective spatial registration, they must not only memorize sequences but also learn to transform and integrate observations across time and space.

In this work, we examine the spatial-temporal memory capabilities of several recurrent architectures, including long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997), gated recurrent unit (GRU) (Cho et al., 2014), and recent state-space models (SSMs) such as S4 (Gu et al., 2022) and Mamba-SSM (Gu and Dao, 2023). We evaluate these models on two criteria: (i) their ability to memorize temporal sequences and (ii) their capacity to register and transform sequential observations across varying spatial perspectives. Our findings indicate that, while these models perform well in capturing temporal dependencies, they exhibit limitations in spatial registration, particularly under conditions of dynamic ego-motion and rapidly changing perspectives.

To address this limitation, we introduce spatially-enhanced recurrent units (SRUs), a simple yet effective modification to standard LSTM and GRU units that enhances their spatial registration capabilities when processing sequences of ego-centric observations. Unlike classical mapping pipelines that rely on explicit homogeneous transformations, our approach enables the recurrent units to implicitly learn the transformations from varying observation perspectives effectively. To further enhance the performance of long-range navigation tasks, we propose an attention-based network architecture integrated with SRUs, allowing the model to learn long-range mapless navigation policies using only ego-centric observations via end-to-end RL. Our experiments demonstrate improvements in spatial awareness compared to the baselines. With the SRU memory, the implicit recurrent approach via RL with sparse rewards promotes robust exploration in complex 3D and maze-like environments, outperforming the baseline that rely on explicit mapping and memory modules.

To prevent premature convergence to suboptimal strategies and fully exploit the capabilities of the proposed attention-based recurrent structure, we find that incorporating regularizations during end-to-end RL training is crucial. Furthermore, to address the sim-to-real gap caused by noisy depth images, we pretrain the image encoder on a large-scale synthetic dataset and augment the data using a fully parallelized depth-noise model, adapted from Handa et al. (2014), Barron and Malik (2013a), and Bohg et al. (2014a). In summary, our main contributions are as follows:

• Addressing spatial mapping limitations with SRUs: We identify that standard RNNs, while effective in capturing temporal dependencies, can struggle with spatial registration of observations from different perspectives. To overcome this, we introduce spatially-enhanced recurrent units (SRUs) that enhance the ability to learn implicit spatial transformations from sequences of ego-centric observations.

• End-to-end reinforcement learning with SRUs and attention-based policy: We integrate the SRU unit into a proposed attention-based network architecture, enabling improved end-to-end reinforcement learning for long-range mapless navigation tasks using only ego-centric observations.

• Large-scale pretraining for zero-shot sim-to-real transfer in long-range mapless navigation: By leveraging large-scale synthetic pretraining and a parallelizable depth-noise model, our system bridges the sim-to-real gap, enabling zero-shot deployment on a legged-wheel platform in diverse real-world environments, using a single forward-facing stereo camera for long-range mapless navigation.

Related works

The navigation and planning problem has been studied extensively for decades. Early approaches relied on classic search-based methods, including Dijkstra’s algorithm and A* (Dijkstra, 1959; Hart et al., 1968) operating on pre-discretized grids, as well as on sample-based techniques such as the Rapidly-exploring Random Tree (RRT) family—including variants like RRT*, RRT-Connect (Karaman and Frazzoli, 2011; Kuffner and LaValle, 2000; LaValle et al., 2001), etc.—and probabilistic roadmap (PRM) methods, such as Lazy PRM and SPARS (Bohlin and Kavraki, 2000; Dobson and Bekris, 2014; Kavraki et al., 1996). While these techniques have achieved significant success in robotics and real-world applications (Wellhausen and Hutter, 2023), they depend on building or the existence of a predefined navigation or occupancy map. Consequently, they often struggle in unknown or dynamic environments (Yang et al., 2022a), particularly when planning under static world assumptions or when complex kinodynamic constraints are present (Ortiz-Haro et al., 2024; Webb and Berg Jvd, 2012). Moreover, these classical methods typically require an additional perception and mapping module, and the predetermined traversability or occupancy maps are usually based on heuristic designs rather than being optimized for a specific robotic platform.

To address these limitations, recent research has increasingly turned to learning-based approaches, especially for more complex robotic agents (e.g., quadrupeds or legged-wheel systems). For instance, recent works in imitation learning leverage large-scale video data or demonstrations (Bojarski et al., 2016; Loquercio et al., 2021; Pfeiffer et al., 2017; Shah et al., 2022, 2023) to directly map raw egocentric sensory inputs to navigation actions. Given the challenges of capturing dynamic, closed-loop interactions from purely offline data, researchers have also explored model-free reinforcement learning (RL) methods (Bhattacharya et al., 2025; Choi et al., 2019; Fu et al., 2022; Hoeller et al., 2021; Huang et al., 2023; Lee et al., 2024; Ruiz-Serra et al., 2022; Shi et al., 2019; Truong et al., 2021; Wijmans et al., 2020; Wu et al., 2021) that train navigation policies end-to-end by simulating the entire robot dynamics. By replacing the traditional perception, mapping, and planning pipeline with a tailored network—such as architectures based on recurrent networks (Choi et al., 2019; Hoeller et al., 2021; Wijmans et al., 2020; Wu et al., 2021) or Transformers with attention mechanisms (Bhattacharya et al., 2025; Huang et al., 2023; Ruiz-Serra et al., 2022; Zeng et al., 2024)—these approaches have achieved improvements in navigation tasks as well as in robotic locomotion (Kareer et al., 2023; Miki et al., 2022a; Yang et al., 2022b).

A key challenge with end-to-end approaches is learning a robust state representation from the partial observations provided by egocentric sensors. Recent studies have attempted to mitigate this challenge by incorporating explicit mapping and memory mechanisms (Cimurs et al., 2021; Fu et al., 2022; Lee et al., 2024; Savinov et al., 2018) or by employing specialized network architectures like RNNs (Choi et al., 2019; Hoeller et al., 2021; Wijmans et al., 2020). However, RNNs—originally designed to capture temporal sequences in language tasks (Cho et al., 2014)—are not inherently well-suited for spatial mapping, particularly when processing sequential egocentric observations from continuously changing perspectives. For instance, while prior studies in indoor navigation have shown that spatial cues can be decoded from RNN memories, this effect has been demonstrated only with binary contact sensing and does not extend to high-dimensional visual inputs (Wijmans et al., 2023). Moreover, recent findings suggest that variations in recurrent network architectures have minimal impact on the final task-level rewards achieved through reinforcement learning (Duarte et al., 2023). This indicates that, despite architectural differences, some fundamental limitations may persist across these RNN units.

In this paper, we explore a key limitation of existing RNN-based architectures in addressing partial observability—their spatial memorization capabilities—highlighting their shortcomings in learning spatial transformations and integrating observations from different perspectives. We then introduce spatially-enhanced recurrent units (SRUs) and demonstrate their effectiveness in improving long-range mapless navigation tasks with a specifically designed attention-based network structure via end-to-end reinforcement learning.

Problem statement

Consider a robot operating in a three-dimensional (3D) environment $E \subset R^{3}$ . At each time step t, the robot is located at a configuration defined by its position and orientation in SE(3), and receives an observation o_t through its egocentric sensors. The navigation objective is defined as starting from an initial relative goal position $p_{1} \in R^{3}$ in the robot’s egocentric frame and reaching a designated goal region $G \subset R^{3}$ , such that the relative goal position satisfies ‖p_t‖ < ϵ, where ϵ > 0 represents a specified tolerance, within a finite time horizon t ≤ T_max. The robot follows a policy π that maps its current state s_t to an action a_t. However, due to the egocentric setup, the agent’s current state s_t is not fully observable from a single sensor snapshot. Formally, we model the navigation task as a partially observable Markov decision process (POMDP), characterized by the tuple:

(S, A, T, R, O, Z, γ),

with the following components:

• $S$ : the set of all possible states of the environment.

• $A$ : the set of actions available to the robot.

• $T : S \times A \times S \to [0,1]$ : the state transition function, where $T (s, a, s^{'})$ denotes the probability of transitioning from state s to state s′ when action a is taken.

• $R : S \times A \to R$ : the reward function, assigning a scalar reward to each state-action pair.

• $O$ : the set of all possible observations.

• $Z : S \times A \times O \to [0,1]$ : the observation function, specifying the probability of receiving an observation given the current state and action.

• γ ∈ [0, 1]: the discount factor that balances immediate reward and future payoffs.

The action $a_{t} \in A$ is executed in the robot’s local frame at time t. At each time step t, the robot receives an observation $o_{t} \in O$ and combines it with its historical observations $H_{t - 1}$ to determine the current state, where $H_{t - 1}$ is defined as:

H_{t - 1} ≔ {o_{1}, o_{2}, \dots, o_{t - 1}} .

We define a function f that fuses current and historical observations into an estimate ${\hat{s}}_{t}$ of the unobservable state $s_{t} \in S$ :

{\hat{s}}_{t} = f (o_{t}, H_{t - 1}) .

The policy π then maps the estimated state ${\hat{s}}_{t}$ to an action a_t:

a_{t} = π ({\hat{s}}_{t}) .

Due to the robot’s ego-motion, the current observation o_t can be captured from a different perspective or observation frame compared to historical observations in $H_{t - 1}$ . Therefore, to fuse observations from different viewpoints, the function f typically involves a spatial transformation that aligns the observations into a coherent reference frame to estimate the current state s_t for the policy π. In classical mapping pipelines, spatial transformations are typically achieved through homogeneous transformations that combine rotations and translations. However, in end-to-end learning approaches, the function f, which can be parameterized by neural network weights, is learned implicitly.

Methodology

Overview

To tackle the long-range navigation task, we first examine and demonstrate the limitations of existing recurrent architectures (e.g., LSTM, GRU, S4, and Mamba-SSM) in a spatial-temporal memory task. We then introduce the spatially-enhanced recurrent units (SRUs). Next, we integrate SRUs into an attention-based network architecture to learn long-range mapless navigation via end-to-end reinforcement learning. Furthermore, we discuss the importance of incorporating regularization techniques to prevent early overfitting, which we find to be crucial to enhance SRUs’ spatial memorization. Finally, we address the sim-to-real gap by pretraining the depth image encoder on large-scale synthetic depth data and incorporating a parallelizable depth-noise model, enabling zero-shot transfer to real-world environments.

Background: Recurrent neural networks

Recurrent neural networks (RNNs) are a class of neural networks designed to process sequential data by maintaining a hidden state that captures temporal dependencies. Given a sequence of inputs x≔(x₁, x₂, …, x_T), an RNN computes a sequence of hidden states h≔(h₁, h₂, …, h_T) using the following recursive formula:

h_{t} ≔ f (x_{t}, h_{t - 1}),

where f is a function that combines the previous hidden state h_t−1 with the current input x_t to compute the current hidden state h_t. This is analogous to the function mentioned earlier, which fuses the current observation o_t with the historical observations

H_{t - 1}

into the estimated current state

{\hat{s}}_{t}

. The hidden state h_t captures the network’s internal representation at time t, encoding information from the entire input sequence up to that point. However, the vanilla version of RNN can suffer from gradient vanishing and explosion issues (Cho et al., 2014; Hochreiter and Schmidhuber, 1997). To address these problems, several variants have been proposed, including LSTM and GRU, as commonly used in sequential tasks nowadays. These models introduce gating mechanisms that control the flow of information through the network, enabling better long-term memory retention and gradient flow. Those types of RNNs adopted gates and residual connections across temporal sequences, resulting in strong performance in various sequential tasks. The standard LSTM unit takes the following form:

\begin{aligned} i_{t} & = σ (W_{x i} x_{t} + W_{h i} h_{t - 1} + b_{i}), \\ f_{t} & = σ (W_{x f} x_{t} + W_{h f} h_{t - 1} + b_{f}), \\ o_{t} & = σ (W_{x o} x_{t} + W_{h o} h_{t - 1} + b_{o}), \\ g_{t} & = \tanh (W_{x g} x_{t} + W_{h g} h_{t - 1} + b_{g}), \\ c_{t} & = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ g_{t}, \\ h_{t} & = o_{t} ⊙ \tanh (c_{t}), \end{aligned}

and the GRU unit:

\begin{aligned} z_{t} & = σ (W_{x z} x_{t} + W_{h z} h_{t - 1} + b_{z}), \\ r_{t} & = σ (W_{x r} x_{t} + W_{h r} h_{t - 1} + b_{r}), \\ {\tilde{h}}_{t} & = \tanh (W_{x h} x_{t} + W_{h h} (r_{t} ⊙ h_{t - 1}) + b_{h}), \\ h_{t} & = (1 - z_{t}) ⊙ {\tilde{h}}_{t} + z_{t} ⊙ h_{t - 1}, \end{aligned}

where

i_{t}

f_{t}

o_{t}

, and

g_{t}

are the input, forget, output, and cell gate activations in LSTM, respectively. Similarly,

z_{t}

r_{t}

, and

{\tilde{h}}_{t}

are the update, reset, and candidate hidden states in GRU, respectively. The weights

W

and biases

b

are learnable parameters, and σ denotes the sigmoid activation function. Note that for all the letters used in the RNN formulations, we use a “monospaced” font style to prevent confusion with the symbols used in the remainder of the paper.

More recently, the State-Space Model (Gu et al., 2022) was introduced, which is inspired by the general form of state space models widely used in control theory. Such models take the following form in discrete time:

\begin{aligned} x_{t} & = \bar{A} x_{t - 1} + \bar{B} u_{t}, \\ y_{t} & = \bar{C} x_{t} + \bar{D} u_{t}, \end{aligned}

where

x_{t}

represents the state at time

t

u_{t}

the input, and

y_{t}

the output. The matrices

\bar{A}

\bar{B}

\bar{C}

, and

\bar{D}

define the system dynamics.

In Gu et al. (2020a), a so-called HiPPO matrix is proposed for $\bar{A}$ and $\bar{B}$ to optimally project historical information into the current state via a polynomial basis. Although this approach has led to a new family of RNN models (e.g., S4, S5, Mamba-SSM) that excel at capturing long-term temporal dependencies, their emphasis on long temporal processing is not the focus of this paper; therefore, we omit further details on these models.

Spatial mapping limitations in RNNs

Achieving long-range mapless navigation from egocentric observations requires the robot to perform effective spatial mapping. In three-dimensional (3D) space, spatial mapping is commonly achieved using homogeneous transformations, which combine rotations and translations. A general representation of such a transformation is expressed as:

\begin{aligned} [\begin{matrix} p^{'} \\ 1 \end{matrix}] = [\begin{matrix} R & t \\ 0^{⊤} & 1 \end{matrix}] [\begin{matrix} p \\ 1 \end{matrix}], \end{aligned}

where p and p′ are the coordinates of a point in the original and transformed frames, R is the rotation matrix, and t is the translation vector. In the context of RNNs, this spatial mapping and transformation is implicitly learned through the function f, which integrates the current observation o_t with the historical observations

H_{t - 1}

to estimate the current state s_t.

In this section, we assess the spatial and temporal mapping performance of existing recurrent structures—namely LSTM, GRU, and the recent S4 and Mamba-SSM—on two fronts: (i) temporal memorization and (ii) spatial transformation and memorization. Consider an abstract scenario relevant to the navigation task, in which a robot is initialized at a pose in SE(3) and moves randomly within the three-dimensional environment $E \subset R^{3}$ . At each time step t, the robot receives an observation o_t containing the coordinates of observed landmarks $l_{t}^{i} \in R^{3}$ , defined relative to the robot’s current frame (indicated by the subscript t). Each landmark is also associated with a binary categorical label cⁱ, which is temporally relevant and independent of the observation frame. Additionally, the robot is provided with its ego-motion transformation matrix $M_{t}^{t - 1}$ , representing the transformation from the previous pose at time step t − 1 to the pose at time step t. This transformation enables the network to align and integrate observations into a unified reference frame, akin to classical homogeneous transformations. Over a sequence of T time steps, the RNN processes these observations. At the final step T, the network is evaluated simultaneously on its ability to:

• Memorize and accurately predict the sequence of binary categorical labels associated with the observed landmarks, ensuring temporal association and order preservation (Temporal Task).

• Transform and register the spatial coordinates of all observed landmarks into the final robot frame at t = T, achieving spatial alignment and memorization of their positions (Spatial Task).

The training details are provided in Appendix A. The results indicate that while LSTM, GRU, S4, and Mamba-SSM effectively encode temporal sequences and retain landmark categories, as shown in Figure 1(a), they face significant challenges in accurately memorizing and transforming landmark coordinates from sequential ego-centric observations. This limitation is reflected in the higher mean squared error (MSE) when recalling observed landmark positions during training, as depicted in Figure 1(b).

Figure 1.

Training for the spatial-temporal memorization: (a) Temporal memorization loss shows that standard RNN units (LSTM, GRU, S4, and Mamba-SSM) effectively recall sequential information. (b) Spatial memorization loss indicates that these units struggle with accurate spatial transformations and memorization under changing observation perspectives, resulting in misaligned landmark coordinates.

Spatially-enhanced recurrent unit

To address the limitations of spatial mapping of existing RNN units, we propose a modification to the standard LSTM and GRU architectures by introducing an additional spatial transformation operation. This enhancement results in a new class of units, termed spatially-enhanced recurrent units (SRUs). The added operation enables the network to implicitly learn spatial transformations, aligning and memorizing observations from varying perspectives while preserving robust temporal memorization capabilities.

The effectiveness of this approach is demonstrated by the training results of the spatial mapping task mentioned above and illustrated in Figure 1. With the SRU modification, the network effectively transforms and memorizes observed landmark coordinates from different perspectives, as indicated by the spatial loss curve in Figure 1(b), while preserving similar temporal memorization performance compared to standard LSTM and GRU units, as shown in Figure 1(a). The design of SRUs emerged through iterative experimentation and analysis of spatial mapping performance. The final formulation draws inspiration from the multiplicative form of homogeneous transformations and recent research on the use of the “star operation” (element-wise multiplication) to enhance the representational capacity of neural networks (Ma et al., 2024).

The following equations detail the modifications to incorporate spatial transformations into both LSTM and GRU units, ensuring a balance between spatial memorization and temporal dependency learning. In each case, we compute an additional spatial transformation term, denoted as $s_{t}$ , which acts as a mechanism to implicitly transform and align the candidate state with the current observation’s perspective. For the modified LSTM, referred to as SRU-LSTM, we define:

\begin{aligned} s_{t} & = W_{x s} x_{t} + b_{s}, \\ g_{t} & = \tanh (s_{t} ⊙ (W_{x g} x_{t} + W_{h g} h_{t - 1} + b_{g})) . \end{aligned}

Similarly, for the modified GRU, referred to as SRU-GRU, the formulation is enhanced as follows:

\begin{aligned} s_{t} & = W_{x s} x_{t} + b_{s}, \\ {\tilde{h}}_{t} & = \tanh (s_{t} ⊙ (W_{x h} x_{t} + W_{h h} (r_{t} ⊙ h_{t - 1}) + b_{h})) . \end{aligned}

To further enhance spatial-temporal memorization, we extend the SRU-LSTM with a refined gating mechanism (Gu et al., 2020b), simply referred to as SRU-Ours in the following sections. This mechanism introduces an additional refining function to address gating saturation issues during recurrent training. The final modification, compared to the vanilla LSTM unit, is as follows:

\begin{aligned} s_{t} & = W_{x s} x_{t} + b_{s}, \\ g_{t} & = \tanh (s_{t} ⊙ (W_{x g} x_{t} + W_{h g} h_{t - 1} + b_{g})), \\ r_{t} & = i_{t} ⊙ (1 - {(1 - f_{t})}^{2}) + (1 - i_{t}) ⊙ f_{t}^{2}, \\ c_{t} & = r_{t} ⊙ c_{t - 1} + (1 - r_{t}) ⊙ g_{t} . \end{aligned}

The effectiveness of SRU is further validated and compared to the standard LSTM unit in the designed spatial mapping task, as illustrated in Figure 2. In this task, the robot follows a spiral path, observing landmarks from varying perspectives along its trajectory. At the end of the path, the robot is tasked with memorizing and transforming the observed landmark coordinates into the final frame, as well as recalling the associated categories of the observed landmarks. Our experiments demonstrate that the SRU modification enables the network to effectively learn spatial transformations. In contrast, the baseline models struggle to align and memorize landmark coordinates observed in earlier steps, resulting in higher spatial errors, particularly for earlier observations, as depicted in Figure 2(c). However, both SRU and baseline models achieve 100% accuracy in recalling the categories of the observed landmarks. Since the temporal task results are identical across all models, they are not visualized specifically in Figure 2. Furthermore, the latest recurrent units, such as S4 and Mamba-SSM, which excel at long-term temporal memorization, exhibit even worse spatial memorization capabilities, as shown in both Figure 1(b) and Figure 2(c). The SRUs exhibit consistently low spatial errors across all observation steps, underscoring their superior spatial memorization capabilities.

Figure 2.

Spatial mapping comparison: (a) and (b) depict the spatial mapping performance of LSTM and SRU-LSTM units on synthetic data, respectively, as the robot follows a spiral path, observing landmarks from different perspectives. At the end of the path, the robot is tasked with memorizing and transforming the observed landmark coordinates into the final robot frame. Numbers indicate observation time steps. (c) illustrates the mean spatial memory errors (log scale) across observation step indices, ordered from the final (15) to the initial step (1), averaged over various randomly generated trajectories and observations.

Attention-based recurrent network architecture for navigation

To leverage the SRUs in the navigation context, we propose an attention-based recurrent network architecture for long-range mapless navigation tasks using raw front-facing stereo depth input, as illustrated in Figure 3. The network consists of a pretrained depth image encoder, two spatial attention layers incorporating both self-attention and cross-attention mechanisms to enhance and compress the encoded visual features, and a recurrent unit (SRU) that learns a spatial-temporal representation of the state by fusing the current observation with historical observations. Finally, a multilayer perceptron (MLP) head computes the actions from the recurrent hidden state, outputting velocity commands for the robot’s locomotion controller.

Figure 3.

Attention-based recurrent network architecture for navigation: The network integrates a pretrained image encoder and an attention mechanism to compress and emphasize relevant features from encoded observations. These features, combined with proprioceptive inputs, are processed by the SRU unit, which learns spatial transformations and temporal dependencies and fuses them with historical observations to estimate the robot’s state. The state is then mapped to actions using an MLP-based head integrated with the temporally consistent (TC) dropout layer for improved robustness and generalization.

Depth encoder pretraining and simulated perception noise

For the depth encoder, we adopt a convolutional neural network (CNN) backbone based on RegNet (Radosavovic et al., 2020), chosen for its simplicity and efficiency. This is further enhanced with a feature pyramid network (FPN) (Lin et al., 2017) to capture spatial features across multiple scales. The encoder is pretrained for self-reconstruction on large-scale synthetic depth image data from TartanAir (Wang et al., 2020) using a variational autoencoder (VAE) framework. This pretraining enables the encoder to learn and extract robust and generalizable features from depth images, facilitating effective downstream navigation learning and deployment. However, depth images captured in simulation often differ from those obtained in real-world environments due to various sensor artifacts and noise. To address this sim-to-real gap, we integrate a parallelized depth-noise model, adapted from (Barron and Malik, 2013b; Bohg et al., 2014b; Handa et al., 2014), which introduces configurable noise to the depth images, such as:

• Edge Noise: Distortions at sharp depth discontinuities due to abrupt changes in the scene.

• Filling Noise: Blurring artifacts introduced during the interpolation of missing or unregistered pixels.

• Rounding Noise: Quantization effects resulting from sensor resolution limitations, causing rounding errors.

Figure 4 illustrates an example of the simulated stereo depth noise. The depth-noise model is designed for efficient batch processing, enabling parallelized pretraining on large-scale datasets and during RL with simulated depth images. For implementation details, refer to Appendix B.

Figure 4.

Simulated stereo depth noise: (a) synthetic depth image from the TartanAir dataset (Wang et al., 2020) and (b) image with augmented artificial noise. The depth-noise model introduces edge, filling, and rounding noise to the depth images, simulating realistic sensor artifacts.

Attention layers for feature compression

During navigation, humans and animals tend to focus on the most relevant spatial cues rather than attempting to memorize all available information (Matthis et al., 2018). This selective attention enables more efficient and effective memory usage. To emulate this, we combine self-attention and cross-attention mechanisms in our architecture. These spatial attention layers process high-dimensional visual inputs, extracting the information most relevant to the robot’s current state. Specifically, given the feature map encoded by the pretrained depth encoder:

F_{t} \in R^{C \times H \times W},

each spatial feature

F_{t}^{i j}

(where i ∈ {1, …, H} and j ∈ {1, …, W}) is enriched with global context via a self-attention mechanism, resulting in a refined feature map

F_{t}^{'}

. Here, H and W denote the height and width of the feature map, respectively, while C represents the number of channels. The self-attention mechanism computes the attention weights across the spatial dimensions of the feature map, enabling the network to fuse and emphasize relevant features while suppressing less important ones. Next,

F_{t}^{'}

is processed by a cross-attention layer, where the query is derived from the robot’s egocentric proprioceptive state

o_{t}^{prop}

—which includes linear velocity v_t, angular velocity ω_t, projected gravity n_t, and the previous action a_t−1—as well as the current relative goal position p_t. This procedure is illustrated in Figure 3. This operation compresses the 2-dimensional feature map into a 1-dimensional latent representation

{\hat{F}}_{t} \in R^{C \times 1},

which preserves the most relevant spatial features while reducing the dimensionality. This compressed feature

{\hat{F}}_{t}

is then concatenated with the proprioceptive state

o_{t}^{prop}

and the relative goal position p_t before being passed to the recurrent memory unit. There, it is fused with historical observations to form a spatial-temporal representation of the robot’s current state, integrating both exteroceptive and proprioceptive information.

Figure 5 visualizes the output attention weights of the cross-attention layer across four distinct attention heads (depicted in different colors), overlaid on the depth input. It highlights how the attention mechanism focuses on depth features relevant to the robot’s current state. When the robot’s movement direction is manually altered, the output attention weights shift accordingly, emphasizing spatial regions and obstacles in the new direction. These behaviors emerge naturally during end-to-end learning, demonstrating the policy’s ability to effectively acquire critical spatial cues for navigation.

Figure 5.

Visualization of cross-attention weights corresponding to different robot states over raw real-world depth input: (a) when the robot turns left, the attention weights highlight the left region, focusing on the left pillar in the depth image; (b) when the robot moves straight, the attention weights emphasize the central region, capturing both pillars; (c) when the robot turns right, the attention weights focus on the right region, concentrating on the right pillar. Distinct colors in the attention weights visualizations represent different attention heads.

Spatially-enhanced recurrent unit

The spatially-enhanced recurrent unit (SRU), as described in Section 4.4, processes the compressed feature map ${\hat{F}}_{t}$ , along with the robot’s current proprioceptive state $o_{t}^{prop}$ and the previous hidden state h_t−1, to generate a spatial-temporal representation of the surrounding environment from the robot’s egocentric observations. The observation of the current proprioceptive state $o_{t}^{prop}$ provides essential ego-motion information, equivalent to the transformation matrix $M_{t}^{t - 1}$ in Section 4.3. The SRU learns to implicitly perform spatial transformations, aligning the current observation feature map ${\hat{F}}_{t}$ with the previous hidden state h_t−1. The resulting hidden state h_t, which encapsulates the integrated environmental information to estimate the robot’s current state s_t, is subsequently passed through a multi-layer perceptron (MLP) head to compute the robot’s action a_t.

Learning navigation with sparse rewards and regularizations

The final attention-based network with the SRU is trained end-to-end using RL to achieve long-range mapless navigation, with the objective of maximizing the cumulative reward over the episode. The reward function for the navigation task is designed as a combination of task-level rewards r^task, regularization r^reg, and penalty r^pen terms, as follows:

r_{t} = α_{1} r_{t}^{task} - α_{2} r_{t}^{reg} - α_{3} r_{t}^{pen} .

Here, α₁, α₂, and α₃ are coefficients used to balance the contributions of the task-level reward, regularization, and penalty terms, respectively. The task-level reward $r_{t}^{task}$ is the reward signal that encourages the agent to reach the goal. To promote exploration in complex environments, we adopt time-based rewards, similar to Rudin et al. (2022); Zhang et al. (2024); He et al. (2024) that provide feedback at the end of the episode. This approach provides a sparse reward signal, encouraging the agent to reach the goal without being distracted by intermediate rewards. However, with a long episode length T_max, for example, 60 s, and a rewarding period T_r of only 2 s, the network may learn to delay progress until the final step. To mitigate this, we introduce a random check during the episode with a small probability δ_check. This check incentivizes the agent to attempt reaching the goal earlier, without compromising the overall sparsity of the reward signal. The final reward formulation is adapted from He et al. (2024) and is given as follows:

r_{t}^{task} = \frac{1 (t > T_{max} - T_{r} \lor random < δ_{check})}{1 + ‖ \frac{p_{t}}{σ} ‖_{2}}

where 1(⋅) is the binary indicator function, T_r is the rewarding period, ∨ represents the logical OR operation, p_t is the relative goal position at time t with respect to the current robot pose, and σ is a scaling factor controlling the reward’s spatial sensitivity. Similar to He et al. (2024), we adopt two reward configurations: a tight reward with a small σ to encourage precise goal-reaching behavior, and a loose reward with a larger σ to promote exploration and stabilize training through intermediate guidance.

The regularization term $r_{t}^{reg}$ encourages smooth behaviors by penalizing rapid action changes and excessive joint accelerations. This is implemented using L1 regularization on the difference between the current action a_t and a momentum-filtered version of previous actions $a_{t}^{m}$ , as defined below:

a_{t}^{m} = λ \cdot a_{t - 1}^{m} + (1 - λ) \cdot a_{t},

where λ is the momentum factor. The regularization reward is then given by:

r_{t}^{reg} = β_{1} \cdot ‖ a_{t} - a_{t}^{m} ‖_{1} + β_{2} \cdot ‖ j_{t}^{acc} ‖_{1},

where β₁ and β₂ are regularization coefficients, and

j_{t}^{acc}

represents joint-level accelerations from the simulation environment. The penalty term

r_{t}^{pen}

discourages unsafe behaviors such as collisions or excessive tilt:

r_{t}^{pen} = η_{1} \cdot 1 (collision) + η_{2} \cdot \max (0, | θ_{t} | - θ_{safe})

where η₁ and η₂ are penalty coefficients, θ_t is the robot’s current tilt angle, and θ_safe defines the safe tilt threshold. The reward formulation described above is consistently applied throughout the entire RL training process, which is conducted end-to-end using an Asymmetric Actor-Critic (Pinto et al., 2018) setup and trained with PPO, without employing any additional environment or reward curricula. Further training details are provided in Appendix C.

Training regularization

To mitigate overfitting and enhance robustness, we incorporate two additional regularization strategies during training. These strategies are crucial for training a robust spatial-temporal representation with SRUs, as explained below and demonstrated in the experimental results. The regularization techniques are as follows:

• Deep mutual learning (DML): As described in Xie et al. (2025), DML involves training two policies simultaneously, enabling them to mutually distill knowledge from each other. This approach enhances generalization and mitigates the risk of convergence to suboptimal solutions. The mutual distillation is achieved by incorporating a Kullback–Leibler (KL) divergence loss between the two policies, both of which are trained using standard proximal policy optimization (PPO) (Schulman et al., 2017).

• Temporally consistent dropout (TC-Dropout): Adapting from Hausknecht and Wagener (2022), we apply a consistent dropout mask across time steps during both rollout and training, ensuring stable memory learning within the recurrent structure.

As shown in Figure 1, SRUs exhibit a slower convergence rate for spatial memorization compared to learning temporal dependencies, highlighting the inherent complexity and slower pace of learning spatial transformations and forming spatial memory. This discrepancy can lead the network to favor easier-to-learn solutions early in training, relying on temporal features while neglecting the formation of good spatial memorization, resulting in suboptimal performance. To address this, it is crucial to incorporate regularization techniques that mitigate early overfitting and promote the exploration and development of more challenging spatial-temporal features during policy optimization. To tackle this challenge, we employ deep mutual learning (DML) strategies tailored for reinforcement learning (RL) (Xie et al., 2025). DML involves training multiple policies in parallel, allowing them to distill knowledge from one another. This mutual distillation process enhances the network’s generalization capabilities and fosters the learning of robust and essential features. By regularizing each other, the models are less likely to converge prematurely to suboptimal solutions that rely solely on easy-to-learn features, such as temporal dependencies. Instead, DML encourages the formation of spatial-temporal representations, leading to improved overall performance. This approach is critical for leveraging the full potential of the SRU network, as demonstrated in the experimental results.

Second, compared to standard dropout layers, consistent dropout addresses a critical issue in on-policy reinforcement learning, where standard dropout introduces inconsistent masks between the rollout and training stages (Hausknecht and Wagener, 2022). Building on this, we extend consistent dropout with temporal consistency for training the recurrent structure. Specifically, during the data rollout stage, we maintain the same dropout mask across all time steps, ensuring temporal consistency. During the training stage, the same dropout mask is applied to the policy network. This approach promotes stable memory learning through recurrent connections and enhances the robustness of the learned policy.

Experiments

To evaluate the effectiveness of the proposed spatially-enhanced recurrent unit (SRU) and the attention-based network architecture in enhancing long-horizon robot navigation, we conduct experiments in both simulated and real-world environments. We compare the SRU against standard LSTM and GRU units in long-range mapless navigation tasks, focusing on their performance in end-to-end reinforcement learning (RL) training and navigation success rates (SR). Additionally, we compare the SRU policy, integrated with our proposed network structure and trained using recurrent RL, against two current state-of-the-art (SOTA) baselines (Huang et al., 2023; Lee et al., 2024) for robot navigation with RL. Our evaluation highlights the advantages of the implicit recurrent memory provided by SRU in solving long-range mapless navigation tasks across diverse environments.

Furthermore, we ablate the role of our proposed spatial attention layers in compressing features from encoded observations to improve memorization and overall navigation performance. We compare our approach against the convolution and average pooling method used in Wijmans et al. (2020), as well as the attention mechanism introduced in Huang et al. (2023). We also investigate the impact of regularization techniques on training the SRU unit end-to-end in RL, evaluating their effectiveness in preventing early convergence to suboptimal solutions and enhancing navigation performance. Finally, we explain and validate the pretrained image encoder’s ability to bridge the sim-to-real gap by demonstrating zero-shot transfer across diverse and complex real-world environments.

Experimental setup

We conduct our experiments in simulated 3D environments using NVIDIA IsaacLab (Mittal et al., 2023), which provides a realistic physics engine and fast, parallelizable simulation capabilities. The environments are designed to challenge the robot’s navigation capabilities and include maze-like structures, randomly generated pillars, stairs, and environments with negative obstacles, such as holes and pits, as shown in Figure 6. The robot is equipped with front-facing depth sensors as the only exteroceptive input, capturing the surrounding environment from an egocentric perspective. Additionally, a state estimation and localization module provides the robot’s proprioceptive state $o_{t}^{prop}$ , including linear and angular velocities (v_t and ω_t), projected gravity (n_t), and the relative goal position (p_t) with respect to the robot’s frame at time step t. Given the limited field of view (FoV) of the depth camera (Horizontal FoV: 105°, Vertical FoV: 78°) and a maximum range of 10 m, the robot must rely on its spatial-temporal mapping capabilities to navigate through the terrain and reach the designated goal region effectively. The robot’s motion is controlled by a set of linear and angular velocities, referred to as the action a_t, which is the output of the policy network. The navigation policy operates at a frequency of 5 Hz. The robot is equipped with a learning-based locomotion controller (Lee et al., 2024), operating at 50 Hz. This controller takes the action output a_t from the high-level navigation policy and executes it to control the robot. The policies are trained end-to-end using reinforcement learning, without employing any distillation or teacher-student setups.

Figure 6.

Simulated environments used for training and testing RL-based navigation tasks: (A) maze, (B) random pillars, (C) stairs, and (D) pits. These environments are parameterizable and can be randomly generated during both training and testing using the NVIDIA IsaacLab (Mittal et al., 2023) simulation framework.

Comparsion with recurrent units

We evaluate the performance of the proposed spatially-enhanced recurrent units (SRUs) compared to standard LSTM and GRU units. Given the superior spatial memorization capability of SRUs, as demonstrated in Figure 1, we hypothesize that integrating SRUs will improve the performance of navigation policies in addressing long-range mapless navigation tasks. To test this hypothesis, we train, under same conditions, policy networks integrated with different recurrent units end-to-end using RL in the simulated 3D environments shown in Figure 6. All policies are equipped with the same components (attention, training regularization, and a pretrained encoder); only the recurrent network structure differs. We then evaluate their navigation performance.

As shown in Figure 7, the policy with the SRU memory unit is able to outperform those with standard LSTM and GRU units in terms of average return episode rewards during training, with results averaged across multiple random seeds. (Note: GRU training can exhibit instability, so only its successful runs are included in the analysis.) Table 1 provides a summary of the navigation performance for policies using different architectures. The best-performing model from each unit (determined by the highest average return rewards) is selected for comparison. The data is averaged over 4800 episodes across 120 randomly generated environments, which are different from the training set. The results are presented in terms of success rate (SR) for each environment. The SRU units consistently outperform the standard LSTM and GRU units, achieving an average 21.8% improvement in SR across all environments with the SRU modification alone. Furthermore, incorporating the refined gating mechanism in the SRU-Ours model further boosts the results, achieving an overall 23.5% increase in SR, demonstrating the effectiveness of these enhancements in improving navigation performance. Notably, in stair-like environments, where the 3D structure and significant occlusions pose challenges for navigation without precise spatial memorization and registration capabilities, the navigation policy with SRU units demonstrate over double the performance in success rate compared to standard LSTM and GRU units.

Figure 7.

Training curve comparison between policies integrated with different recurrent units: The average return from three random seeds during training. The architecture with SRU units achieves a higher return compared to the baseline LSTM and GRU units.

Table 1.

Navigation success rate (SR) for policies integrated with different recurrent units across four environment types: Maze, random pillars, stairs, and pits.

Navigation success rate-SR %
Model	Maze	Pillar	Stair	Pit	Overall
GRU	68.1	73.6	35.7	66.7	61.0
LSTM	70.3	78.2	33.1	72.7	63.5
SRU-GRU	73.1	78.8	74.1	74.8	75.2
SRU-LSTM	75.9	76.7	79.3	74.1	76.5
SRU-Ours	76.0	81.0	82.8	75.6	78.9

The table compares standard LSTM and GRU units with our proposed spatially enhanced counterparts: SRU-GRU, SRU-LSTM, and SRU-Ours. Bold values denote the highest performance within each column.

Figure 8 presents example traversed trajectories comparison between the SRU and standard LSTM policies. In maze environments, the LSTM policy gets trapped in a dead-end corridor, looping between points B and C, while the SRU policy successfully passes through the corridor, demonstrating better spatial memorization capability. In stair-like environments, although the LSTM eventually reaches the destination, it exhibits frequent back-and-forth movements (areas A and B), indicating less reliable spatial-temporal memorization and estimation of the current state compared to the policy with SRU. In pit environments, the LSTM policy fails to avoid the previously encountered pits that are no longer visible in the current depth observation, when turning at area A. In contrast, the SRU policy effectively recalls the locations of the pit and other previously observed obstacles, enabling it to avoid them during turns and backward motion.

Figure 8.

Comparison of navigation trajectories using (a) Navigation Policy with LSTM Unit and (b) Navigation Policy with SRU-Ours. The traversed trajectories are shown in yellow. In maze environments, the LSTM policy becomes trapped in a dead-end corridor, repeatedly looping between points B and C, while the SRU policy successfully navigates through the corridor, traverses region D, and reaches the goal. In stair-like environments, the LSTM policy exhibits frequent back-and-forth movements in areas A and B, indicating unreliable spatial-temporal mapping. In pit environments, the LSTM policy fails to avoid previously encountered pits during turns at area A, whereas the SRU policy effectively recalls their locations and avoids them, even during backward motion.

Comparsion against RL-based navigation baselines

Next, we compare our proposed network structure, trained using recurrent reinforcement learning with SRU, against two state-of-the-art RL baseline methods: the goal-guided transformer-based RL approach (GTRL) (Huang et al., 2023) and the RL approach with explicit mapping and historical path (EMHP) (Lee et al., 2024). GTRL employs a goal-guided transformer (GoT) architecture to extract task-relevant visual features from stacked historical observations, enabling mapless navigation using only egocentric input. EMHP employs an external mapping pipeline to integrate historical observations for local mapping (Miki et al., 2022b) and uses an explicit historical traversed path to address POMDP challenges in long-range mapless navigation. While explicit mapping (EMHP) can theoretically achieve high accuracy in spatial-temporal registration, it has two major drawbacks: (i) it introduces significant delays that hinder real-time performance, especially on high-speed, agile platforms (Lee et al., 2024), and (ii) it relies on heuristic rules (e.g., fixed context window lengths) to select information, limiting its ability to capture complex spatial-temporal dependencies and abstract information beyond the selected context window.

Figure 9(a) presents a comparison between the EMHP baseline and our SRU-based approach. The EMHP policy collects historical paths for approximately 20 m, which is insufficient to navigate the long corridor spanning around 30 m. In contrast, recurrent neural networks offer an unlimited context window and can learn intricate spatial-temporal dependencies optimized for the given task. This allows the SRU-based policy to adapt to long-horizon navigation challenges more effectively. Moreover, our end-to-end architecture processes raw depth sensor inputs, reducing latency and better supporting the agile and fast motion of legged-wheel platforms during deployment. For the GTRL baseline, temporal history is provided by stacking several past observation frames, which are then fused using the transformer-based architecture as described in Huang et al. (2023). Following the original approach in Huang et al. (2023), we use the 4-frame history in our experiments. However, the choice of the number of stacked frames remains heuristic: a short history may miss important context, while a longer history increases computational cost quadratically. For a fair comparison, we retrain the GTRL baseline, as described in Huang et al. (2023), within our environment. We replace the RGB input with depth images and utilize the same on-policy optimization method (PPO) for end-to-end RL training. To ensure consistency with the platform utilized in Lee et al. (2024), this comparison is conducted using the simulated wheeled ANYmal (Hutter et al., 2016) robot model, which differs from the robot model used in the other comparisons in this paper. All policies are trained under identical conditions and evaluated on an independent test environment set to ensure a fair comparison. Note that our policy and the GTRL baseline rely solely on a front-facing camera with a limited field of view, whereas the EMHP approach incorporates a local height scan with a similar range for environmental detection but benefits from a complete 360-degree field of view for mapping.

Figure 9.

Comparison of proposed mapless method with SRU recurrent memory against the EMHP baseline approach in a maze environment. The robot’s traversed trajectory is shown in yellow, with traversal order marked as A, B, C, and D. (a) The EMHP approach starts looping in the long corridor between points B and C, failing to navigate through the dead-end corridors. (b) Our approach, with SRU recurrent memory, successfully navigates from start to goal, rerouting through the dead-end corridors and reaching the goal through area D.

The experimental results (see Table 2) indicate that our architecture, leveraging an implicit memory representation within the recurrent module, outperforms both baselines. Compared to EMHP, it achieves a 29.6% relative improvement in SR. Against GTRL, it demonstrates a remarkable 105.0% relative improvement. These results underscore the limitations of explicit mapping or stacked-frame histories for long-horizon navigation under limited context. Figure 9(a) illustrates a representative comparison: in a long-horizon maze environment, the EMHP approach, despite achieving a relatively higher SR among the two baselines, is constrained by the limited horizon of its explicit memory, fails to navigate through the maze and eventually loops in a long corridor. In contrast, our policy successfully traverses a dead-end corridor and reaches the goal, demonstrating the effectiveness of the SRU unit in learning implicit spatial-temporal mapping for long-horizon navigation tasks. To further validate the effectiveness of SRU for implicit spatial-temporal memorization in mapless navigation, we replaced GTRL’s stacked historical observations with our SRU-based recurrent memory while keeping all other components unchanged. This modified variant, denoted GTRL*, achieves a 73.6% relative improvement over the original GTRL baseline, increasing the SR from 38.2% to 66.3% (Table 2). This substantial gain highlights the advantage of SRU’s implicit recurrent memory over heuristic stacking of historical observations in capturing spatial-temporal dependencies for improved navigation performance. Notably, compared to GTRL*, our complete SRU-based approach (Ours) achieves an additional 18.1% relative improvement, attributable to our proposed spatial attention layers. The impact of these attention layers is further discussed later.

Table 2.

Overall navigation success rate (SR) comparison against baselines.

Method	SR (%)
GTRL (w. historical obs.)	38.2
EMHP (w. explicit mapping/path)	60.4
GTRL* (w. SRU memory)	66.3
Ours (w. SRU memory)	78.3

Policies with SRU implicit recurrent memory outperform: (i) GTRL (stacked historical observations) and (ii) EMHP (explicit mapping and historical path). GTRL* denotes our modified GTRL variant where stacked observations are replaced by SRU implicit memory, yielding a substantial gain over the original GTRL baseline. Bold values denote the highest performance.

Quantitatively, while the EMHP approach achieves comparable navigation performance, we also analyze the success rate (SR) as a function of travel distance to evaluate its long-range memorization and generalization capabilities. As shown in Figure 10, with a maximum episodic time of 60 s (consistent with training) and the robot’s maximum speed set to 1.5 m/s, the EMHP approach’s SR drops significantly when the travel distance exceeds 40 m. In contrast, our SRU-based approach maintains an SR of over 80% up to 50 m. When the maximum episode time is extended to 120 s, the EMHP’s SR still declines to below 60% at the same 40-m distance, constrained by its fixed context window. Conversely, our SRU-based approach sustains an SR of over 70% for distances up to 120 m, demonstrating the SRU’s superior ability to implicitly learn spatial-temporal mappings and generalize to distances beyond the training range. The baseline’s reliance on a fixed explicit memory window limits its capacity to capture long-range dependencies, hindering its generalization in extended long-distance navigation tasks.

Figure 10.

Success rate sorted by travel distance: comparison between the EMHP baseline approach, which uses explicit mapping and a fixed-length historical path, and our approach, which employs the implicit recurrent memory of SRU. Our method maintains a high success rate over longer distances and extends effectively with longer episodic times. In contrast, the baseline’s success rate drops significantly for longer travel distances, even when the maximum episodic time is doubled, due to its fixed context window limitation.

Furthermore, we observed that the EMHP policy struggles to effectively learn to climb staircases unless dense reward guidance is provided. We believe this limitation arises from the inherent difficulty explicit memory mechanisms face in capturing intricate spatial-temporal features, which are essential for the robot to develop the maneuvers required to overcome 3D obstacles effectively. Lastly, the end-to-end recurrent setup offers a simpler and more maintainable solution compared to baseline methods. In contrast, baselines rely on an external mapping pipeline and the storage of additional historical paths or observations for each robot, which can introduce complexity and overhead during both training and real-world deployment.

Importance of spatial attention layers

We now examine the role of the proposed spatial attention layers in the network architecture and evaluate their impact on navigation performance. These layers are designed to compress and emphasize relevant features from encoded observations, addressing a key challenge faced by recurrent structures: the difficulty of retaining long-term information due to the exponential decay of memory over time. By selectively focusing on the most salient features, the attention mechanism emphasizes the most relevant spatial cues for navigation based on the robot’s state and reduces the information density passed into the recurrent memory at each step. We hypothesize that this mechanism can improve the network’s memorization and navigation capabilities, enabling it to handle complex, long-range tasks more effectively.

To test this, we conduct an ablation study by: (i) removing the attention layers from our network architecture and replacing them with convolution followed by average pooling for feature compression, as implemented in Wijmans et al. (2020), and (ii) comparing the performance of our proposed spatial attention layers against the goal-guided transformer (GoT) architecture proposed in Huang et al. (2023). The GoT architecture utilizes a modified vision transformer (ViT) that integrates the goal state as an additional token. It performs self-attention across both visual feature tokens and the goal token to extract goal-relevant features. In contrast, our approach first applies self-attention exclusively to visual tokens to enhance spatial features. Subsequently, the goal and proprioceptive state are used as queries in the cross-attention layer to compress and extract the most relevant features. For a fair comparison in the ablation experiments, we use identical training settings for all approaches, integrating the SRU memories and the pretrained encoder while varying only the attention layers used to process visual features during RL. The GoT integrated with SRU is the GTRL* approach, as described in Section 5.3. Figure 11 shows the average return rewards during training. The network without the attention layers exhibits significantly lower performance compared to the two policies utilizing attention mechanisms. Additionally, our proposed spatial attention layers outperform the GoT attention mechanism. Table 3 shows the SR performance of the three configurations: (i) without attention, (ii) with GoT attention, and (iii) with our proposed spatial attention (Ours). Our method achieves a 56.2% relative SR improvement over the no-attention baseline, highlighting the importance of selectively compressing and extracting spatial features for long-range mapless navigation when utilizing implicit recurrent memory. Furthermore, it achieves a 15.4% relative improvement (18.1% when trained with the ANYmal robot model, as shown in Table 2) over the policy utilizing GoT attention. This demonstrates that the proposed two-stage spatial attention mechanism more effectively extracts task-relevant cues, enhancing recurrent memorization and policy optimization.

Figure 11.

Average training return rewards for attention ablations (all using SRU recurrent memory): (1) without attention (w/o.) (Wijmans et al., 2020); (2) goal-guided transformer (GoT) attention (Huang et al., 2023); and (3) the proposed two-stage spatial attention (Ours). The proposed spatial attention achieves the highest returns, indicating more effective extraction of task-relevant spatial cues for improved recurrent memorization.

Table 3.

Navigation success rate (SR) for policies integrated with different attention configurations.

Attention configuration	SR (%)
w/o. attention	50.5
GoT (GTRL*)	68.4
Spatial attention (Ours)	78.9

The policy with the proposed spatial attention (Ours) achieves the highest SR, outperforming (i) the baseline without attention (Wijmans et al., 2020) and (ii) the goal-guided transformer (GoT) (Huang et al., 2023) integrated with SRU memory—referred to as the GTRL* approach—there by highlighting the importance of an attention mechanism for training mapless navigation end-to-end and the effectiveness of the proposed spatial attention structure.

Notably, the attention effect emerges naturally during the end-to-end RL training without requiring additional supervision or auxiliary losses. Figure 12 illustrates the attention weights generated by the cross-attention layer over raw visual inputs in three distinct real-world deployment scenarios: an indoor office, an outdoor terrace, and a forest environment. The attention weights, with four attention heads (depicted in different colors), dynamically emphasize the most relevant spatial cues, such as obstacles and navigable free space, based on the robot’s state at the time the depth input was recorded. This highlights the effectiveness of training the spatial attention mechanism end-to-end and its ability to generalize across diverse and challenging environments.

Figure 12.

Visualization of attention weights for the cross-attention layer in three distinct real-world deployment scenarios over raw depth inputs: (a) office environment, (b) outdoor terrace environment, and (c) forest environment. The attention weights dynamically highlight relevant spatial cues for navigation based on the robot’s state. The RGB images in the top corners are included for visualization purposes only.

Training with regularizations

We evaluate the role of regularization techniques in the end-to-end training of the recurrent network using reinforcement learning. First, as shown in Figure 1 and discussed in Section 4.6, while the SRU unit effectively enhances the network’s ability to learn implicit spatial memorization from sequential observations, the learning curve indicates that spatial memory learning can converge significantly slower than temporal memorization. This discrepancy, combined with the inherent properties of standard policy optimization algorithms like PPO—which restrict deviations from previous optimization steps—and the complex structure of attention networks with RNNs prone to overfitting, suggests that without proper regularization, the network may converge to suboptimal strategies. Such strategies might overly rely on easier-to-learn temporal features to solve navigation tasks, thereby failing to establish robust spatial-temporal memorization. To test this hypothesis, we conduct an ablation study by removing the regularization techniques, specifically deep mutual learning (DML), from the standard PPO training setup and comparing the performance against the setup with DML regularization.

Figure 13 illustrates that the network without DML exhibits lower average return rewards during training. Notably, the performance difference between standard LSTM and SRU modifications becomes more pronounced when regularization techniques are applied. As shown in Table 4, the SR performance improves from 61.8% to 65.7% (a 6.3% increase) without DML and from 63.5% to 78.9% (a significant 24.3% increase) with DML. This finding underscores that, in certain RL tasks, the network’s architecture alone may not be the sole limiting factor. Instead, the optimization process plays a critical role in fully leveraging the network’s potential, highlighting the importance of effective training strategies.

Figure 13.

Training curve comparison between policies trained using PPO with deep mutual learning (DML) regularization and PPO: The network with DML regularization techniques achieves higher returns compared to the network trained with vanilla PPO.

Table 4.

Comparison of the overall navigation success rate (SR) with and without DML regularization for policies using LSTM and SRU units.

RL training	SR %
LSTM w/o. DML	61.8
LSTM w. DML	63.5
SRU-Ours w/o. DML	65.7
SRU-Ours w. DML	78.9

DML significantly enhances SR for SRU (over 20%) and provides a marginal improvement for LSTM (2.8%), showcasing DML’s effectiveness in unlocking SRU’s potential for long-range mapless navigation.

Additionally, we observe that incorporating the consistent dropout layer with temporal consistency into the recurrent training can also positively impact navigation performance, as shown in Table 5. This enhancement improves the SR when tested in new, randomly generated environments. These findings align with the discussion in Hausknecht and Wagener (2022), which highlights the benefits of using dropout in RL training to enhance the network’s generalization and robustness.

Table 5.

Evaluation of the overall navigation success rate (SR) with and without temporally consistent dropout (TC-D).

RL training	SR (%)
SRU-Ours w/o. TC-D	77.2
SRU-Ours w. TC-D	78.9

The network with TC-D is able to maintain a similar (or even higher) SR compared to the network without TC-D, while improving robustness and generalization.

Large-scale pretraining for sim-to-real transfer

In this section, we evaluate the pretrained image encoder, trained on a large-scale synthetic dataset, for its ability to bridge the sim-to-real gap in real-world perception. Additionally, we assess the effectiveness of the proposed depth noise model in reducing discrepancies between synthetic and real-world data. To this end, we conduct zero-shot transfer experiments on legged-wheel platforms across diverse real-world environments to demonstrate the generalization of our approach.

Pretrain and depth noise

Here, we analyze the latent space distribution of encoders trained under two distinct conditions: (i) an encoder trained exclusively on simulated depth images generated during RL navigation training (RL images), (ii) an encoder pretrained on large-scale synthetic data from Wang et al. (2020), augmented with the proposed parallelizable depth noise model (Figure 4). To evaluate these encoders, we compare the latent features extracted from their outputs using two data sources: (i) RL images and (ii) real-world stereo depth images captured by the ZEDX camera during deployment (real-world images). This analysis highlights their differences in latent space distributions depending on the pretraining data source used for the encoder.

Figure 14 illustrates a 2D principal components analysis (PCA) (Dunteman, 1989) projection of the latent features. The latent space distribution of RL-images shows a larger distribution range that encompasses the features extracted from real-world data when derived from the encoder pretrained on large-scale synthetic data (Figure 14(a)). This indicates that the encoder pretrained on large-scale synthetic data effectively captures a wide range of features, enabling it to generalize well to real-world scenarios. In contrast, the encoder trained solely on RL images (Figure 14(b)) exhibits a narrower latent space distribution, failing to cover many real-world features. This suggests that an encoder trained exclusively on simulated depth images collected during RL navigation training may struggle to generalize effectively to real-world data when deployed.

Figure 14.

Comparison of latent space distributions: (a) The feature distribution from the encoder pretrained on large-scale synthetic data effectively covers the distribution of real-world data, indicating better generalization. (b) The feature distribution from the encoder trained solely on simulated data collected during RL fails to cover the distribution of real-world depth images, posing challenges in generalizing to real-world data.

Additionally, Figure 15 provides a qualitative comparison of depth reconstruction using features extracted from the same two pretrained encoders. The comparison is based on a real-world stereo depth input captured during deployment. The encoder pretrained on large-scale data demonstrates effective reconstruction of the depth image, with only minor blurring effects (Figure 15(b)). In contrast, the encoder trained solely on RL images struggles to reconstruct the input depth image effectively, resulting in outputs with significant artifacts and noise (Figure 15(c)).

Figure 15.

Comparison of depth image reconstruction using features from encoders pretrained on different data sources. (a) Original input stereo depth image from real-world deployment, captured using the ZEDX camera. (b) Reconstructed depth image using features extracted from the encoder pretrained on large-scale synthetic data with noise augmentation. (c) Reconstructed depth image using features extracted from the encoder trained exclusively on simulated images collected during RL navigation training.

To quantitatively assess the distributional disparity of features from encoders trained on different sources, we adapt the method from Lee et al. (2018) to measure the Mahalanobis distance (MD) for the latent distributions derived from each encoder. In addition to the two pretraining sources mentioned earlier, we also analyze the latent distribution of the encoder pretrained on large-scale synthetic data without noise augmentation. This allows us to evaluate the effectiveness of the proposed depth noise model in further reducing the sim-to-real gap between synthetic depth images and real-world stereo depth. The MDs are computed between the latent features of real-world images and the latent feature distributions of RL images extracted from each encoder. As shown in Figure 16, pretraining on large-scale synthetic data effectively reduces the MD, lowering the median from 1.15 (RL images) to 0.82 (large-scale synthetic data without noise). This demonstrates the pretrained encoder’s effectiveness in covering the distribution of real-world perception inputs. Furthermore, incorporating the proposed depth noise model further reduces the MD to 0.69, underscoring its role in narrowing the differences between synthetic and real-world depth data. These results highlight that the encoder, pretrained on large-scale data and augmented with the proposed depth noise model, can effectively minimize the sim-to-real gap, enabling improved generalization to real-world environments.

Figure 16.

Comparison of Mahalanobis distances between the latent features of real-world images and the latent feature distributions of RL images, using encoders pretrained on different sources: (i) RL images, (ii) large-scale synthetic data without noise augmentation, and (iii) large-scale synthetic data with noise augmentation.

Real-world tests on legged-wheel robot

To evaluate the pretrained image encoder’s with the proposed attention-based recurrent network’s ability to generalize across in real-world environments, we conduct several zero-shot transfer experiments, on a Unitree B2W robot with a learning-based locomotion policy from RIVR. The robot is mounted with a ZEDX, front-facing stereo depth sensor, and NVIDIA Jetson AGX Orin for onboard compute for the policy. The pretrained encoder and network are directly deployed on the robot without any fine-tuning with real-world data. For all the test, the robot receives no prior information about the environment, and receives only the stereo depth images from the front-facing camera as the exteroceptive input. Additionally, a LiDAR-based state estimation and localization module (Chen et al., 2023) provide the robot’s proprioceptive state, including linear and angular velocities v_t and ω_t, projected gravity n_t, and relative goal position p_t with respect to the robot’s frame. The robot is controlled by a set of linear and angular velocities, referred to as action a_t, which is the input to the locomotion policy.

First, we conduct an experiment in an office environment, as shown in Figure 17, to compare the navigation performance of our policy with the SRU memory module against a baseline model using a standard LSTM unit. In this experiment, the robot is tasked with navigating from one side of the office to the other while avoiding obstacles. To evaluate the long-term spatial-temporal memorization capabilities of the SRU module, several passageways are temporarily blocked, requiring the robot to backtrack and search for alternative routes to reach the goal. Additionally, dynamic changes are introduced by unblocking certain areas during navigation to further assess the robustness of the SRU-enhanced policy. The policy with SRU demonstrates the ability to explore dead ends, navigate around obstacles, and re-evaluate its path to adapt to dynamic changes in the environment (Figure 17(a)). The robot successfully reaches the goal, showcasing the effectiveness of utilizing the SRU memory module to learn robust spatial-temporal memorization from sequential observations. In contrast, the baseline model with a standard LSTM fails to reach the goal and repeatedly loops between dead-end areas, as shown in Figure 17(b).

Figure 17.

Comparison of navigation trajectories (orange) in an office environment. A, B, and C indicate areas that the robot traverses in sequence. (a) shows that the robot using the SRU memory module successfully navigates through two dead ends and reaches the goal while adapting to changes in the environment (the blocker located in area A was initially set and later removed). (b) illustrates that the baseline model with a standard LSTM fails to reach the goal and repeatedly loops between the dead-end areas C and B.

To further evaluate the generalization and performance of the proposed network architecture in long-horizon navigation tasks, we conduct experiments in a variety of real-world environments—including an indoor campus main hall, outdoor terrace areas, and forest environments—using the same pretrained encoder and navigation policy (see Figure 18). In these experiments, the robot is tasked with navigating to a designated goal and returning to its starting point. Note that the policy is designed to maintain episodic memory only between the start and the goal and is reset when a new goal is given. The results demonstrate that the policy generalizes effectively to unseen environments, handling diverse obstacles such as walls, stairs, vegetation, bushes, and trees, as well as navigating uneven terrains. Additionally, the policy adapts to larger-scale scenarios, including extended goal distances of more than 70 m and traversing over 100 m, as shown in Figure 18(c). Note that the maximum start-goal distance during RL training is 30 m. The figures show the trajectories of the robot successfully navigating through these environments, with point clouds generated from the state estimation module (Chen et al., 2023) provided solely for visualization. Note that, due to the absence of a dedicated mapping module or loop closure mechanism, the trajectories shown may exhibit some drift and errors.

Figure 18.

Evaluation of long-range mapless navigation in diverse real-world environments: (a) main hall of a university, (b) outdoor terrace, and (c) forest environment with natural obstacles. In each scenario, the robot is tasked with two separate navigation goals (memory reset between goals), resulting in two trajectory segments (orange and blue). Labels A, B, and C mark key areas traversed by the robot.

Limitations and future work

While the proposed SRUs in this paper demonstrate significant improvements in spatial-temporal learning capabilities, their recurrent nature remains subject to exponential memory decay, which can limit their ability to retain global context over extended sequences. As a result, the long-range navigation capabilities presented in this paper are centered on local, mapless navigation using egocentric sensing. In this context, “long-range” refers to planning horizons that extend well beyond the local perception radius (e.g., 10 m), enabling rerouting from local dead ends without reliance on an explicit global map. Extending this approach to global-scale navigation—spanning kilometers or hours—would likely require additional mechanisms or architectural enhancements, such as the integration or maintenance of a global map.

Furthermore, while SRUs enhance the network’s capacity for implicit spatial memorization and improve long-range navigation performance, the precise characteristics of the information retained and utilized during end-to-end navigation training remain unclear. This highlights a broader challenge in explainable artificial intelligence, where understanding the internal representations and decision-making processes of neural networks continues to be an active area of research (Mi et al., 2024). Future work could explore integrating SRUs with recent advancements in foundation pretraining, such as DINO (Caron et al., 2021), to combine their strengths in scene understanding with the efficiency of recurrent structures, further enhancing the policy’s performance in complex real-world environments. Investigating auxiliary losses or additional regularization techniques to further leverage the potential of spatial-temporal memorization in SRUs could also be beneficial. Additionally, extending the application of SRUs to other domains, such as robotic manipulation and 3D reconstruction, could unlock new possibilities and advancements in spatial-temporal learning. In summary, while SRUs are effective, they represent a simple yet practical solution—not necessarily unique or optimal—that proves successful in our end-to-end mapless navigation context. More importantly, this work aims to highlight the potential of implicit spatial memory mechanisms in addressing complex navigation challenges while identifying opportunities for further exploration and optimization in both methodology and application domains.

Conclusion

In this study, we identify and address a limitation of existing recurrent neural network architectures in the context of navigation: while RNNs excel at modeling temporal sequences, they are not inherently designed for spatial memorization or transforming observations from varying perspectives. This limitation makes them less effective in building the spatial representations required for mapless navigation using egocentric perception. To address this, we propose spatially-enhanced recurrent units (SRUs), which integrate an implicit spatial transformation operation into standard GRU and LSTM structures. These SRUs are incorporated into a novel attention-based architecture, trained end-to-end via reinforcement learning, achieving long-horizon navigation tasks with a single forward-facing depth camera. Our research further highlights the importance of regularization strategies in end-to-end reinforcement learning frameworks. Techniques such as temporally consistent dropout and deep mutual learning are crucial for fully leveraging SRUs’ potential and preventing early overfitting. Experiments demonstrate SRUs’ superior navigation performance compared to standard LSTM and GRU models. Moreover, we compare our implicit recurrent memory-based approach with a state-of-the-art baseline that utilizes explicit mapping and historical paths. Our findings illustrate the superior effectiveness of recurrent memory structures for long-range mapless navigation tasks. Additionally, through ablation studies, we demonstrate the role of specific design choices, particularly the spatial attention mechanism, in enhancing overall navigation performance. Lastly, we analyze and address the challenge of sim-to-real transfer for stereo depth perception by integrating large-scale pretraining. This approach enables successful zero-shot transfer and robust generalization across diverse real-world environments, including indoor, outdoor, and forest scenarios—as demonstrated in the supplemental video, underscoring the practical applicability and effectiveness of our proposed methodology.

Supplemental Material

Footnotes

Acknowledgments

The authors acknowledge Nikita Rudin, Takahiro Miki, Jonas Frey, Pascal Roth, and Chong Zhang for their valuable feedback and discussions. The authors also extend their gratitude to Marco Trentini for his assistance in conducting real-world experiments and testing the LiDAR-inertial state estimation module. Additionally, the authors recognize the RIVR team for their technical support with the legged-wheel robot platform utilized in this research.

ORCID iDs

Fan Yang

Per Frivik

David Hoeller

Chen Wang

Cesar Cadena

Marco Hutter

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The project is funded by the Swiss National Science Foundation (SNSF) under project No. 227617 and through the National Centre of Competence in Research (NCCR) Automation, as well as by the European Union’s Horizon Europe Framework Programme under grant agreements No. 101070596 and No. 101070405. Additional support was provided by Armasuisse Science and Technology and Mercedes-Benz AG.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

Appendix

References

Barron

Malik

(2013a) Intrinsic Scene Properties From a Single rgb-d Image. CVPR.

Barron

Malik

(2013b) Intrinsic scene properties from a single rgb-d image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 17–24. IEEE.

Bhattacharya

Rao

Parikh

Kunapuli

Tao

Matni

Kumar

(2025) Vision transformers for end-to-end vision-based quadrotor obstacle avoidance. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 1–8.

Bohg

Romero

Herzog

, et al. (2014a) Robot Arm Pose Estimation Through Pixel-Wise Part Classification. ICRA.

Bohg

Romero

Herzog

, et al. (2014b) Robot arm pose estimation through pixel-wise part classification. In:2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3143–3150. IEEE.

Bohlin

Kavraki

(2000) Path planning using lazy prm. In: Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), Volume 1, pp. 521–528. IEEE.

Bojarski

Del Testa

Dworakowski

, et al. (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316.

Caron

Touvron

Misra

, et al. (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660. IEEE.

Cèsar-Tondreau

Warnell

Stump

, et al. (2021) Improving autonomous robotic navigation using imitation learning. Frontiers in Robotics and AI 8: 627730.

10.

Chen

Nemiroff

Lopez

(2023) Direct lidar-inertial odometry: lightweight lio with continuous-time motion correction. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 3983–3989. IEEE.

11.

Cho

van Merri¨enboer

Gulcehre

Bahdanau

Bougares

Schwenk

Bengio

(2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1724–1734.

12.

Choi

Park

Kim

, et al. (2019) Deep reinforcement learning of navigation in a complex and crowded environment with a limited field of view. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 5993–6000. IEEE.

13.

Cimurs

Suh

Lee

(2021) Goal-driven autonomous exploration through deep reinforcement learning. IEEE Robotics and Automation Letters 7(2): 730–737.

14.

Dijkstra

(1959) A note on two problems in connexion with graphs. Numerische Mathematik 1(1): 269–271.

15.

Dobson

Bekris

(2014) Sparse roadmap spanners for asymptotically near-optimal motion planning. The International Journal of Robotics Research 33(1): 18–47.

16.

Dozat

(2016) Incorporating Nesterov Momentum into Adam. Proceedings of the 4th International Conference on Learning Representations, Workshop Track, San Juan, Puerto Rico, 2-4 May 2016, 1-4.

17.

Duarte

Lau

Pereira

, et al. (2023) Lstm, convlstm, mdn-rnn and gridlstm memory-based deep reinforcement learning. ICAART 2: 169–179.

18.

Dunteman

(1989) Principal Components Analysis. Sage, Vol. 69.

19.

Francis

Faust

Chiang

HTL

, et al. (2020) Long-range indoor navigation with prm-rl. IEEE Transactions on Robotics 36(4): 1115–1134.

20.

Kumar

Agarwal

, et al. (2022) Coupling vision and proprioception for navigation of legged robots. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17273–17283. IEEE.

21.

Dao

(2023) Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752

22.

Dao

Ermon

, et al. (2020a) Hippo: recurrent memory with optimal polynomial projections. Advances in Neural Information Processing Systems 33: 1474–1487.

23.

Goel

R´e

(2022) Efficiently modeling long sequences with structured state spaces. In: The International Conference on Learning Representations (ICLR).

24.

Gulcehre

Paine

, et al. (2020b) Improving the gating mechanism of recurrent neural networks. In: International Conference on Machine Learning, pp. 3800–3809. PMLR.

25.

Handa

Whelan

McDonald

, et al. (2014) A benchmark for rgb-d visual odometry, 3d reconstruction and slam. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1524–1531. IEEE.

26.

Hart

Nilsson

Raphael

(1968) A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics 4(2): 100–107.

27.

Hausknecht

Wagener

(2022) Consistent dropout for policy gradient reinforcement learning. arXiv preprint arXiv:2202.11818.

28.

Zhang

Xiao

, et al. (2024) Agile but safe: learning collision-free high-speed legged locomotion. In: Proceedings of Robotics: Science and Systems. Delft, Netherlands. IEEE.

29.

Hochreiter

Schmidhuber

(1997) Long short-term memory. Neural Computation 9(8): 1735–1780.

30.

Hoeller

Wellhausen

Farshidian

, et al. (2021) Learning a state representation and navigation in cluttered and dynamic environments. IEEE Robotics and Automation Letters 6(3): 5081–5088.

31.

Huang

Zhou

, et al. (2023) Goal-guided transformer-enabled reinforcement learning for efficient autonomous navigation. IEEE Transactions on Intelligent Transportation Systems 25(2): 1832–1845.

32.

Hutter

Gehring

Jud

, et al. (2016) Anymal-a highly mobile and dynamic quadrupedal robot. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 38–44. IEEE.

33.

Karaman

Frazzoli

(2011) Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research 30(7): 846–894.

34.

Kareer

Yokoyama

Batra

, et al. (2023) Vinl: visual navigation and locomotion over obstacles. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2018–2024. IEEE.

35.

Karnan

Warnell

Xiao

, et al. (2022) Voila: visual-observation-only imitation learning for autonomous navigation. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 2497–2503. IEEE.

36.

Kavraki

Svestka

Latombe

, et al. (1996) Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation 12(4): 566–580.

37.

Kuffner

LaValle

(2000) Rrt-connect: an efficient approach to single-query path planning. In: Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), Volume 2, pp. 995–1001. IEEE.

38.

LaValle

Kuffner

Donald

, et al. (2001) Rapidly-exploring random trees: progress and prospects. Algorithmic and computational robotics: New Directions 5: 293–308.

39.

Lee

, et al. (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: Advances in Neural Information Processing Systems. MIT Press, vol. 31.

40.

Lee

Bjelonic

Reske

, et al. (2024) Learning robust autonomous navigation and locomotion for wheeled-legged robots. Science Robotics 9(89): eadi9641.

41.

Lin

Dollár

Girshick

, et al. (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. IEEE.

42.

Loquercio

Kaufmann

Ranftl

, et al. (2021) Learning high-speed flight in the wild. Science Robotics 6(59): eabg5810.

43.

Dai

Bai

, et al. (2024) Rewrite the stars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5694–5703. IEEE.

44.

Matthis

Yates

Hayhoe

(2018) Gaze and the control of foot placement when walking in natural terrain. Current Biology: CB 28(8): 1224–1233.

45.

Mescheder

Oechsle

Niemeyer

, et al. (2019) Occupancy networks: learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470. IEEE.

46.

Jiang

Luo

, et al. (2024) Toward explainable artificial intelligence: a survey and overview on their intrinsic properties. Neurocomputing 563: 126919.

47.

Miki

Lee

Hwangbo

, et al. (2022a) Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics 7(62): eabk2822.

48.

Miki

Wellhausen

Grandia

, et al. (2022b) Elevation mapping for locomotion and navigation using gpu. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2273–2280. IEEE.

49.

Mittal

, et al. (2023) Orbit: a unified simulation framework for interactive robot learning environments. IEEE Robotics and Automation Letters 8(6): 3740–3747.

50.

Mohajerin

Rohani

(2019) Multi-step prediction of occupancy grid maps with recurrent neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10600–10608. IEEE.

51.

Ortiz-Haro

Hönig

Hartmann

, et al. (2024) idb-a*: iterative search and optimization for optimal kinodynamic motion planning. IEEE Transactions on Robotics 41: 2031–2049.

52.

Pfeiffer

Schaeuble

Nieto

, et al. (2017) From perception to decision: a data-driven approach to end-to-end motion planning for autonomous ground robots. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1527–1533. IEEE.

53.

Pinto

Andrychowicz

Welinder

Zaremba

Abbeel

(2018) Asymmetric actor critic for image-based robot learning. In: Proceedings of Robotics: Science and Systems. Pittsburgh, Pennsylvania. DOI: 10.15607/RSS.2018.XIV.008

54.

Radosavovic

Kosaraju

Girshick

, et al. (2020) Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10428–10436. IEEE.

55.

Rudin

Hoeller

Bjelonic

, et al. (2022) Advanced skills by learning locomotion and local navigation end-to-end. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2497–2503. IEEE.

56.

Ruiz-Serra

White

Petrie

, et al. (2022) Towards self-attention based visual navigation in the real world. arXiv preprint arXiv:2209.07043.

57.

Savinov

Dosovitskiy

Koltun

(2018) Semi-parametric topological memory for navigation. In: International Conference on Learning Representations (ICLR).

58.

Schulman

Wolski

Dhariwal

, et al. (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

59.

Shah

Sridhar

Bhorkar

, et al. (2022) Gnm: A general navigation model to drive any robot. arXiv preprint arXiv:2210.03370.

60.

Shah

Sridhar

Dashora

, et al. (2023) ViNT: a foundation model for visual navigation. In: 7th Annual Conference on Robot Learning. MIT Press.

61.

Shi

, et al. (2019) End-to-end navigation strategy with deep reinforcement learning for mobile robots. IEEE Transactions on Industrial Informatics 16(4): 2393–2402.

62.

Siami-Namini

Tavakoli

Namin

(2019) The performance of lstm and bilstm in forecasting time series. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 3285–3292. IEEE.

63.

Surmann

Jestel

Marchel

, et al. (2020) Deep reinforcement learning for real autonomous mobile robot navigation in indoor environments. arXiv preprint arXiv:2005.13857.

64.

Sutskever

Vinyals

(2014) Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems. MIT Press. Vol. 27.

65.

Truong

Yarats

, et al. (2021) Learning navigation skills for legged robots with learned robot embeddings. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 484–491. IEEE.

66.

Wang

Zhu

Wang

, et al. (2020) Tartanair: a dataset to push the limits of visual slam. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4909–4916. IEEE.

67.

Wang

Huang

Sun

Yan

Xing

(2025) Uniocc: A unified benchmark for occupancy forecasting and prediction in autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE.

68.

Webb

Berg Jvd (2012) Kinodynamic rrt*: Optimal motion planning for systems with linear differential constraints. arXiv preprint arXiv:1205.5088.

69.

Weerakoon

Sathyamoorthy

Patel

, et al. (2022) Terp: reliable planning in uneven outdoor environments using deep reinforcement learning. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 9447–9453. IEEE.

70.

Wei

Zhao

Zheng

, et al. (2023) Surroundocc: multi-camera 3d occupancy prediction for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21729–21740. IEEE.

71.

Wellhausen

Hutter

(2023) Artplanner: robust legged robot navigation in the field. Field Robotics 3: 413–434.

72.

Wijmans

Kadian

Morcos

Lee

Essa

Parikh

Savva

Batra

(2020) DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. In: International Conference on Learning Representations (ICLR).

73.

Wijmans

Savva

Essa

, et al. (2023) Emergence of maps in the memories of blind navigation agents. AI Matters 9(2): 8–14.

74.

Wang

Esfahani

, et al. (2021) Learn to navigate autonomously through deep reinforcement learning. IEEE Transactions on Industrial Electronics 69(5): 5342–5352.

75.

Xie

Cao

Wang

, et al. (2025) Representation convergence: Mutual distillation is secretly a form of regularization. arXiv preprint arXiv:2501.02481.

76.

Yang

Cao

Zhu

, et al. (2022a) Far planner: fast, attemptable route planner using dynamic visibility update. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9–16. IEEE.

77.

Yang

Zhang

Hansen

, et al. (2022b) Learning vision-guided quadrupedal locomotion end-to-end with cross-modal transformers. In: International Conference on Learning Representations. IEEE.

78.

Zeng

Zhang

Ehsani

, et al. (2024) Poliformer: Scaling on-policy rl with transformers results in masterful navigators. arXiv preprint arXiv:2406.20083.

79.

Zhang

Jin

Frey

, et al. (2024) Resilient legged local navigation: learning to traverse with compromised perception end-to-end. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 34–41. IEEE.

80.

Zhu

Mottaghi

Kolve

, et al. (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. IEEE.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB