Sage Journals: Discover world-class research

Abstract

This work introduces Deep Policy Similarities (DeePS), a learning-based bisimulation approach designed to enhance generalization in reinforcement learning for robotic control. Traditional reward-based bisimulation metrics often fail to enable effective policy transfer, particularly in environments with inconsistent reward structures. DeePS overcomes these limitations by leveraging policy similarities and approximating the policy similarity metric using a forward dynamics model. This approach facilitates more efficient and effective policy transfer across semantically equivalent environments. Through a series of control experiments, DeePS is shown to significantly outperform standard reinforcement learning methods and reward-based bisimulation approaches. In the noisy cartpole environment with randomized rewards, DeePS achieved 53.5% and 79.0% higher test rewards compared to Soft Actor-Critic and Deep Bisimulation for Control (DBC), respectively. Additionally, in a zero-shot evaluation on the Minigrid simple crossing environment, DeePS outperformed existing approaches, with test rewards 79.9% and 99.5% higher than DBC and RAPID, respectively. These results demonstrate that DeePS significantly enhances the ability of reinforcement learning models to generalize to diverse, unseen environments. This makes DeePS a valuable advancement for reinforcement learning, particularly in robotic applications where adaptability and robustness are critical.

Keywords

Deep learning for robotics reinforcement learning policy similarities generalization policy transfer

Introduction

Reinforcement learning is a powerful framework for developing autonomous systems capable of making sequential decisions to maximize cumulative rewards. In the realm of robotic control, reinforcement learning enables robots to learn complex behaviors through interactions with their environment, eliminating the need for pre-programmed instructions. This capability is critical for tasks such as navigation, manipulation, and coordination in complex environments with undefined solutions.

A significant breakthrough in reinforcement learning for robotics was the development of Deep Q-Network,¹ which integrated deep learning with traditional Q-learning.² This innovative approach enabled reinforcement learning to manage high-dimensional state spaces, making it feasible to apply reinforcement learning techniques to more sophisticated robotic tasks. For instance, reinforcement learning was successfully applied to train robots for object manipulation, locomotion, and autonomous navigation.^3–6

Despite these advances, applying reinforcement learning to real-world robotic problems remains challenging, particularly in achieving generalization—the ability to transfer learned behaviors from training environments to unseen, semantically equivalent environments. Real-world environments are inherently unpredictable, posing a significant challenge for reinforcement learning models that are often optimized for static, well-defined environments. For instance, a robot trained with reinforcement learning to navigate a specific indoor environment may perform well in training, but struggle in new settings with different layouts or goals. This difficulty in adapting to changes highlights the fragility of current reinforcement learning approaches as a result of overfitting to training conditions leading to poor performances in novel situations.^7,8

Traditional reinforcement learning benchmarks such as MuJoCo⁹ or the Arcade Learning Environment¹⁰ evaluate agents in controlled and repeatable settings. While useful for assessing specific agent capabilities, these benchmarks fail to capture the complexity and variability of dynamic real-world environments. To this end, more recent benchmarks, such as those presented in Nichol et al.,¹¹ which introduce a train–test split reveal that standard deep reinforcement learning methods perform poorly under varied conditions. This discrepancy in performance between tightly controlled experiments and practical deployment further underscores the need for reinforcement learning models that are robust and can generalize effectively across diverse conditions.

To address this challenge, studies have explored various strategies to improve the generalizability of reinforcement learning in robotics. Techniques such as data augmentation^12–14 and domain randomization^15,16 introduce variability during training, helping to develop more robust models. Regularization and dropout^17,18 were also applied to neural models to reduce overfitting to training environments. In parallel, procedurally generated environments^7,19 offer a means to create diverse training conditions, alleviating concerns with overfitting. These environments continuously generate new and varied scenarios, guiding the agent to learn more adaptable policies. However, these existing methods largely fail to leverage the temporal dependency between sequential decisions when considering generalization for reinforcement learning—a characteristic that is unique to reinforcement learning.

More recent approaches have adapted the temporal properties of reinforcement learning to learn general policies with bisimulation techniques.^20–25 Bisimulation methods, in general, seek to encode state representations such that states leading to similar behaviors are represented similarly. This helps in learning policies that are more general and adaptable by recognizing and responding to behavioral similarities in different states. Notably, bisimulation approaches are flexible and can be integrated with other existing enhancement techniques such as the data augmentation or algorithmic modifications described previously.

Enhancing the generalization capabilities of reinforcement learning is crucial for deploying it in real-world robotics, making it a key focus of ongoing research. By addressing these generalization challenges, reinforcement learning can maximize its potential, enabling trained robots to perform a wide range of tasks autonomously and effectively in diverse settings.

The contribution of this work is as follows: (i) we introduce a novel learning-based bisimulation approach that leverages policy-driven actions to enhance the generalization capabilities of reinforcement learning models efficiently and (ii) propose the use of an additional nonlinear projection layer on the encoder representations for improved performance. (iii) We also demonstrate that our proposed method improves generalization across diverse environments that are semantically equivalent, through a series of control-related experiments.

These advancements represent a critical step toward developing generalized reinforcement learning policies that are not only effective in controlled scenarios, but also capable of handling the complexities and unpredictability of real-world applications. Our findings suggest that integrating bisimulation with policy-driven strategies using a learning-based approach can bridge the gap between theoretical research and practical deployment, paving the way for more reliable and versatile robotic systems.

Markov Decision Processes

Markov decision processes (MDPs)²⁶ provide a mathematical framework for modeling decision-making in situations where outcomes are influenced by both the action of the agent and the probabilistic elements of the environment. MDPs are widely used in various fields of work, including robotics, transportation systems,²⁷ and artificial intelligence, particularly in the context of reinforcement learning.

An MDP is defined by a tuple $(S, A, P, R, γ)$ where:

$S$ is a finite set of states representing all possible situations the agent might encounter.

$A$ is a finite set of actions available to the agent at each state.

$P : S \times A$ is the state transition probability function, where $P (s^{'} | s, a)$ represents the probability of transitioning to state $s^{'}$ from state $s$ after taking action $a$ .

$R : S \times A$ is the reward function, where $R (s, a)$ gives the immediate reward received after taking action $a$ in state $s$ .

$γ \in [0, 1]$ is the discount factor, which quantifies the importance of future rewards compared to immediate rewards.

The objective in reinforcement learning is to find an optimal policy

π *

that maximizes the expected cumulative reward for the agent over time and the policy

π : S \to A

determines the action that the agent should take in each state.

MDPs assume the Markov property, where the future states depend only on the current state and action, simplifying decision-making. While this assumption facilitates modeling and optimization within individual environments, generalization across environments requires understanding and identification of similarities between the states for effective policy transfer. Solving an MDP focuses on maximizing returns in a single environment^28–30 and generalization requires the policy to recognize and adapt to similarities across multiple MDPs.

Although deep reinforcement learning techniques have shown remarkable success in developing effective policies for individual MDPs, $M$ , they face difficulties when agents are required to operate across a collection of environments, represented as $ρ = {M_{1}, \dots, M_{n}}$ , after being trained on only a limited subset of the collection akin to zero-shot learning. Agents trained on this limited subset often struggle to adapt to new but semantically related scenarios. This highlights the critical challenge of generalization in reinforcement learning: ensuring policies perform robustly in unseen environments with shared behavioral structures.

MDPs form the backbone of reinforcement learning algorithms, where an agent learns to make decisions by interacting with the environment. Through trial and error, the agent seeks to discover the optimal policy that yields the highest rewards and MDPs provide the theoretical foundation for understanding how agents can learn effective strategies in these uncertain environments. It offers a robust framework for modeling complex decision-making problems essential for developing intelligent robotic systems capable of operating autonomously.

Bisimulation in Reinforcement Learning

In MDPs, bisimulation provides a systematic approach to measure the similarity between states, where bisimilar states are expected to exhibit similar behavior. This approach is especially useful for tasks such as model reduction or state aggregation,³¹ which aim to simplify a complex MDP while maintaining key behavioral characteristics. A significant advancement for applying bisimulation in MDPs is the introduction of the bisimulation metric,²⁰ which provides a quantitative measurement of the similarity between states, focusing particularly on their reward structures.

Definition 1 (Bisimulation metrics²⁰)

Let $F : M \to M$ be defined by

\begin{aligned} F (d) (s_{i}, s_{j}) = & max_{a \in A} (| R (s_{i}, a) - R (s_{j}, a) | \\ + c W_{1} (d) (P (\cdot | s_{i}, a), P (\cdot | s_{j}, a))) \end{aligned}

(1)

where the operator

F

has a unique fixed-point

d_{\sim}

in the metric space

M

representing the bisimulation metric.

This bisimulation metric provides a nuanced method for evaluating similarities between states in MDPs, comprising two main components: the absolute difference in immediate rewards from specific actions and the variation in future state transitions, assessed using the 1-Wasserstein distance,³² denoted as $W_{1}$ , which is adjusted by a discount factor $c$ , similar to the $γ$ in MDPs. Derived from optimal transport theory, the 1-Wasserstein distance quantifies the similarity between two distributions by determining the minimum cost to transform one distribution into the other. Essentially, a lower cost indicates a higher similarity between the distributions, implying that the corresponding actions lead to similar subsequent states and thus comparable future values.

The bisimulation metric evaluates the differences between states $s_{i}$ and $s_{j}$ by examining the maximum disparity across all actions. While effective, the use of the $max$ operator can lead to overly pessimistic estimates, limiting policy transfer,³³ and increasing computation costs, particularly in environments with continuous action spaces. These challenges make the bisimulation metric impractical for complex applications.

The on-policy bisimulation metric²² was introduced to address these limitations by focusing on a specific policy $π$ and evaluating only actions it selects. By disregarding actions not taken under the chosen policy, the metric reduces unnecessary pessimism and computational complexity, enhancing the efficiency and practicality of bisimulation in reinforcement learning, especially in environments with large action spaces.

Definition 2 (

π

-bisimulation metric²²)

Let $F : M \to M$ be defined by

\begin{aligned} F (d) (s_{i}, s_{j}) = & | R^{π} (s_{i}) - R^{π} (s_{j}) | \\ + c W_{1} (d) (P^{π} (\cdot | s_{i}), P^{π} (\cdot | s_{j})) \end{aligned}

(2)

where the operator

F

has a unique fixed-point

d_{π}

in the metric space

M

representing the

π

-bisimulation metric.

However, relying solely on reward similarities may not be sufficient for effective policy transfer between environments which is essential for the broader adoption of these metrics in reinforcement learning. The policy similarity metric (PSM) represents a novel approach that differs from traditional reward-based bisimulation metrics by focusing on the alignment of actions derived from policy $π$ when evaluating state similarity.²³ This method offers a new perspective and is particularly effective for policy transfer with its emphasis on the consistency of policy actions.

Definition 3 (Policy similarity metric²³)

Let $F : M \to M$ be defined by

\begin{aligned} F (d) (s_{i}, s_{j}) = & DIST (π (s_{i}), π (s_{j})) \\ + c W_{1} (d) (P^{π} (\cdot | s_{i}), P^{π} (\cdot | s_{j})) \end{aligned}

(3)

where the operator

F

has a unique fixed-point

d *

in metric space

M

representing the PSM.

In the PSM, emphasis is placed on the difference in agent actions measured using a pseudo-metric DIST. For discrete action spaces, DIST uses the total variation distance as its measure. In continuous domains, it corresponds to the $L_{1}$ distance between the means of the policy actions. We illustrate the differences between PSM and $π$ -bisimulation metric within the context of MDPs in Figure 1.

Figure 1.

Consider MDPs with states $x$ and $y$ . The optimal action at each state is represented by the colored path, with the value of the reward indicated by the color. (a) The distance between states $x_{1}$ and $y_{1}$ measured using the $π$ -bisimulation is non-zero, while the distance, according to PSM, is zero. (b) The $π$ -bisimulation indicates a distance of zero between states $x_{4}$ and $y_{4}$ , while the PSM measures a non-zero distance. Under $π$ -bisimulation, the policy is expected to behave differently when the states are deemed bisimilar, which is both anomalous and inconsistent. MDP: Markov decision process; PSM: policy similarity metric.

As shown in Figure 1, the action-based PSM provides a more effective comparison metric for generalizing learned policies. Unlike $π$ -bisimulation, which is heavily dependent on reward structures, the PSM focuses on the similarity of actions, emphasizing behavioral similarities based on agent reactions rather than environmental state values. This makes PSM more suitable for generalizing between states without imposing excessive restrictions on the reward similarities and the distinction is crucial because reward-based $π$ -bisimulation can introduce significant challenges in policy transfer, particularly when dealing with states that, despite having similar values, necessitate divergent actions to achieve optimal outcomes.

By prioritizing action-based comparisons, PSM facilitates more robust policy generalization and enhances the agent’s ability to adapt to new and varied environments. This capability is essential for real-world robotic applications where consistency and adaptability of the learned policies are critical for success.

Learning-based bisimulation metrics

Learning-based bisimulation approaches have been extensively explored^{22,24,25,34,35} in recent works, marking a significant progress in addressing the generalization challenges posed by environments with extensive or continuous state spaces. These advanced methods utilize deep learning to approximate the bisimulation metric, facilitating its scalable application in complex settings and contributing to the development of more robust and generalized reinforcement learning approaches that can be effectively applied across diverse environments.

The Deep Bisimulation for Control (DBC)³⁴ is a notable example that approximates the $π$ -bisimulation metric by modeling environmental dynamics. It estimates the transition probabilities as a Gaussian distribution, parameterized by mean $μ$ and covariance $Σ$ and represents the next latent state probabilistically as $h^{'} \sim N (μ, Σ)$ . This allows the convenient substitution of the computationally intensive 1-Wasserstein distance with the 2-Wasserstein distance,^31,36 which has a closed-form solution given as:

\begin{aligned} W_{2} (N (μ_{i}, Σ_{i}), N (μ_{j}, Σ_{j}))^{2} = & ‖ μ_{i} - μ_{j} ‖_{2}^{2} \\ + Tr (Σ_{i} + Σ_{j} - 2 (Σ_{i}^{1 / 2} Σ_{j} Σ_{i}^{1 / 2})^{1 / 2}) \end{aligned}

(4)

The

π

-bisimulation metric can then be estimated as:

\begin{aligned} {\hat{d}}_{π} (s_{i}, s_{j}) = & | R^{π} (s_{i}) - R^{π} (s_{j}) | \\ + c W_{2} ({\hat{P}}^{π} (\cdot | ϕ (s_{i})), {\hat{P}}^{π} (\cdot | ϕ (s_{j}))) \end{aligned}

(5)

where

\hat{P}

is the dynamics model, and

ϕ

is the encoder.

DBC directly enforces bisimilarity properties on the encoder representations by applying an auxiliary loss designed to aggregate bisimilar states (i.e. in this case, states with very similar values). This is achieved by minimizing the mean squared error:

L_{DBC} = E_{(s_{i}, s_{j}) \sim D} {[| ϕ (s_{i}) - ϕ (s_{j}) | - {\hat{d}}_{π} (s_{i}, s_{j})]}^{2}

(6)

Gradient stopping is used for the

W_{2}

term to ensure that the

L_{DBC}

loss affects only the encoder, promoting the alignment of the latent space with the estimated

π

-bisimulation metric.

However, the use of an approximate dynamics model to estimate the $π$ -bisimulation metric can result in significant issues, such as the runaway expansion of the latent space.³¹ To prevent this unconstrained expansion, existing work proposed to normalize the latent space within a theoretically predefined closed ball $B_{c}$ to constrain the search for the unique metric within a bounded space.

To the best of our knowledge, most learning-based bisimulation research continues to focus on the reward-based $π$ -bisimulation metric, despite its inefficiencies for policy transfer (see Figure 1). This limitation motivates our proposal of a learning-based approach that leverages the action-based PSM to address these shortcomings.

Deep policy similarities

This work proposes adapting the DBC methodology to learn generalized policies using action-based similarities, specifically the PSM. We call our proposed methodology the Deep Policy Similarities (DeePS), where a neural network $\hat{P}$ is employed to model the environmental dynamics and estimate long-term differences in order to approximate the PSM.

Definition 4 (Policy similarity metric using approximate forward dynamics)

Let $F : M \to M$ be defined by

\begin{aligned} F (d) (s_{i}, s_{j}) = & DIST (π (s_{i}), π (s_{j})) \\ + c W_{1} (d) ({\hat{P}}^{π} (\cdot | s_{i}), {\hat{P}}^{π} (\cdot | s_{j})) \end{aligned}

(7)

where the operator

F

has a unique fixed-point

\hat{d} *

in metric space

M

representing the approximate PSM.

By representing the transition probabilities as a Gaussian distribution, $N (μ, Σ)$ , it is possible to further write the approximate PSM as:

\begin{aligned} \hat{d} * = & DIST (π (s_{i}), π (s_{j})) + c W_{2} ({\hat{P}}^{π} (\cdot | ϕ (s_{i})), {\hat{P}}^{π} (\cdot | ϕ (s_{j}))) \end{aligned}

(8)

We present a theoretical bound for convergence guarantees on the approximate PSM, similar to the convergence guarantees established for the approximate

π

-bisimulation in Kemertas and Aumentado-Armstrong.³¹

Lemma 1 (Bounds on Wasserstein distances³⁷)

For any two distributions $μ$ , $λ$ over a space $X$ , for all $p \geq 1$ ,

W_{1} (μ, λ) \leq W_{p} (μ, λ) \leq diam {(X)}^{(p - 1) / p} W_{1} (μ, λ)^{1 / p}

(9)

where

diam

is the diameter of the space.

Theorem 1 (Boundedness condition for convergence)

Assume $S$ is compact. If the support of an approximate forward dynamics model $\hat{P}$ (i.e., $S^{'} = supp (\hat{P})$ ) is a closed subset of $S$ , then there exists a unique approximate PSM $\hat{d} *$ as in equation 7. The bounds of this metric, dependent on the nature of the action space, are given as:

diam (S; \hat{d} *) \leq {\begin{matrix} \frac{1}{1 - c} & discrete \\ \frac{1}{1 - c} (A_{max} - A_{min}) & continuous \end{matrix}

(10)

Proof.

The existence proof is virtually identical to the proof presented in Kemertas and Aumentado-Armstrong.³¹ For continuous MDPs, the proof of the existence of the unique metric depends on the state space $S$ being compact.²¹ Therefore, if the support of the approximate forward dynamics model is a closed subset of the compact $S$ , then the same proof can be used to show the existence of a unique metric.

\begin{aligned} F (d) - F (d^{'}) & = c (W_{1} (d) {\hat{P}}_{i, j}^{π} - W_{1} (d^{'}) {\hat{P}}_{i, j}^{π}) \\ = c (W_{1} (d - d^{'} + d^{'}) {\hat{P}}_{i, j}^{π} - W_{1} (d^{'}) {\hat{P}}_{i, j}^{π}) \\ \leq c (W_{1} (‖ d - d^{'} ‖_{\infty} + d^{'}) {\hat{P}}_{i, j}^{π} - W_{1} (d^{'}) {\hat{P}}_{i, j}^{π}) \\ \leq c (‖ d - d^{'} ‖_{\infty} + W_{1} (d^{'}) {\hat{P}}_{i, j}^{π} - W_{1} (d^{'}) {\hat{P}}_{i, j}^{π}) \\ = c ‖ d - d^{'} ‖_{\infty}, \forall (s_{i}, s_{j}) \in S \times S \end{aligned}

where

{\hat{P}}_{i, j}^{π} = ({\hat{P}}^{π} (\cdot | s_{i}), {\hat{P}}^{π} (\cdot | s_{j}))

. This implies that

F

is a

c

-contraction and a unique metric

\hat{d} *

exists according to the Banach fixed-point theorem. To prove that the distance is bounded, according to Lemma 1,

\begin{aligned} supp (\hat{P}) \subseteq & S ⟹ sup_{s_{i}, s_{j} \in S \times S} W_{p} (\hat{d} *) ({\hat{P}}^{π} (\cdot | s_{i}), {\hat{P}}^{π} (\cdot | s_{j})) \\ \leq diam (S; \hat{d} *), \forall p \geq 1 \end{aligned}

Then,

\begin{aligned} \hat{d} * (s_{i}, s_{j}) = & DIST (π (s_{i}), π (s_{j})) + c W_{p} (\hat{d} *) ({\hat{P}}^{π} (\cdot | s_{i}), {\hat{P}}^{π} (\cdot | s_{j})) \end{aligned}

implies that

\hat{d} * (s_{i}, s_{j}) \leq DIST (π (s_{i}), π (s_{j})) + c \cdot diam (S; \hat{d} *)

and

\begin{aligned} diam (S; \hat{d} *) & \leq DIST (π (s_{i}), π (s_{j})) + c \cdot diam (S; \hat{d} *) \\ \leq \frac{1}{1 - c} DIST (π (s_{i}), π (s_{j})), \forall (s_{i}, s_{j}) \in S \times S \end{aligned}

For discrete action spaces,

DIST

is the total variation distance bounded between 0 and 1. For continuous spaces,

DIST

is the

L_{1}

between the means of the policy. Therefore,

diam (S; \hat{d} *) \leq {\begin{matrix} \frac{1}{1 - c} & discrete \\ \frac{1}{1 - c} (A_{max} - A_{min}) & continuous \end{matrix}

where

A_{max}

and

A_{min}

are the upper and lower bounds on the action, respectively.

Instead of normalizing the latent states within a closed ball $B_{c}$ , we notice that employing a nonlinear projection effectively prevents runaway expansion in the latent space. The projection layer disrupts the positive feedback loop within the DBC architecture, where encoder representations estimate transition probabilities that, in turn, constrain these representations, thus stabilizing the latent space. Moreover, the projection eases the constraints on the encoder. Direct enforcement of bisimilarity properties onto the encoder representations is too restrictive since the aggregated representations of two bisimilar observations will always produce the same value estimates. This conflicts with our expectation that states with different values but similar action sequences should be aggregated. We instead enforce the bisimilarity properties on a nonlinear projection such that bisimilar states need not have identical encoder representations, as long as their projections are similar. Therefore, the auxiliary loss function is given as:

L_{DeePS} = E_{(s_{i}, s_{j}) \sim D} {[| z_{θ} (h_{i}) - z_{θ} (h_{j}) | - \hat{d} * (s_{i}, s_{j})]}^{2}

(11)

where

h = ϕ (s)

is the encoder representation, and

z_{θ}

is the nonlinear projection. We present our DeePS architecture in Figure 2, where the policy network and dynamics model are trained with standard reinforcement learning algorithms (e.g. PPO³⁸, SAC³⁹) and maximum likelihood estimation, respectively. The overall training process is demonstrated in Algorithm 1.

Figure 2.

The proposed DeePS architecture bears a strong resemblance to the DBC architecture, with the key distinction being the omission of the reward network and the additional projection layer. The encoder processes the input observation, extracting a latent representation $h = ϕ (s)$ . This latent representation is the basis for the policy network $π$ and also the projection layer $z_{θ}$ . The dynamics model $\hat{P}$ takes a concatenation of the latent representation and policy action, $[h, π (h)]$ , to predict the next latent state in the form of a Gaussian distribution, $h^{'} \sim N (μ, Σ)$ . The bisimilarity properties are then enforced on the projection $z_{θ} (h)$ instead of the latent representations. DeePS: Deep Policy Similarities; DBC: Deep Bisimulation for Control.

Experiments

Noisy sparse cartpole environment

We study the effectiveness of our proposed approach using a modified version of the traditional cartpole environment. This variant³¹ offers a more challenging adaptation of the well-known OpenAI Gym environment⁴⁰ and is specifically designed to test the robustness and adaptability of reinforcement learning models for control under increasingly complex conditions. The noisy sparse cartpole environment introduces three key modifications:

Sparse reward structure: Unlike the original environment, which provides rewards for keeping the pole within a wide angular range at each timestep, this adapted version restricts the rewards to a narrow angular window of $[- θ_{rew}, θ_{rew}]$ with $θ_{rew}$ drastically reduced to 1% of the original threshold.

Noisy observations: The environment complicates state observations by concatenating the state vector with $N_{m} \cdot dim (S)$ dimensions of noise, sampled from an isotropic Gaussian distribution, mimicking uncertainty in observations. This added noise challenges the agent’s ability to generalize and correctly aggregate noisy states into consistent representations.

Continuous action space: The action space, $A \in [- 1, 1]$ , is redefined from discrete to continuous, demanding more nuanced control strategies. This shift requires the agent to make precise adjustments to the force applied to the cart, significantly increasing the complexity of the control problem and evaluating the agent’s ability to handle fine-grained action selection.

These adaptations not only increase the complexity of the task but also necessitate advanced techniques to accurately capture latent state representations for developing effective policies. By evaluating our approach in this challenging environment, which mimics the complexities of a robotic control problem, we aim to demonstrate its robustness and ability to learn generalized policies, highlighting its potential for real-world robotic applications.

We assess the performance of our proposed approach DeePS alongside its reward-based counterpart DBC, both of which employ the Soft Actor-Critic (SAC)³⁹ as the standard reinforcement learning model. This analysis aims to underscore the strengths and potential weakness of our method in comparison to the state-of-the-art for generalization in reinforcement learning which leverages on the $π$ -bisimulation metric. For DBC, $L_{2}$ normalization is applied to the latent representations to ensure convergence. In contrast, DeePS uses a nonlinear projection to prevent runaway expansion in the latent state space.

In models that utilize intrinsic rewards, the forward model error in the latent space is employed.^31,41–43 This intrinsic reward is mathematically defined as:

{\tilde{r}}_{I, t} = η_{r} ‖ \hat{ϕ} (s_{t}, a_{t}) - ϕ (s_{t + 1}) ‖_{2}^{2} / (2 n)

(12)

where

\hat{ϕ} (s_{t}, a_{t}) = E_{\hat{ϕ} (s_{t + 1}) \sim \hat{P} (\cdot | s_{t}, a_{t})} [\hat{ϕ} (s_{t + 1})]

denotes the expected latent state representation predicted by the approximate forward dynamics model, and

η_{r} > 0

is the weight coefficient. To ensure the intrinsic reward remains within practical bounds, it is capped at a maximum value of

R_{max}

, leading to a final intrinsic reward of

r_{I, t} = min (R_{max}, {\tilde{r}}_{I, t})

. This upper bound on the intrinsic reward ensures it contributes positively to the learning process without overwhelming the true reward signal.

Network architecture and hyperparameters

The neural network architecture used in this experiment closely follows the design outlined in Kemertas and Aumentado-Armstrong.³¹

The encoder network comprises a four-layer multilayer perceptron (MLP) with LeakyReLU activation functions applied after each layer except the final one. The output dimension of the encoder is set to 50. Separate encoders are used for the actor and critic networks, with the weights of their fully connected layers shared to ensure consistency in the learned representations. However, the shared weights are only updated by the critic, and gradients from the actor are stopped before propagating through the encoder.

The critic network employs double Q-learning, with each Q-function represented by a three-layer MLP featuring ReLU activation functions after each layer, except the final one. Similarly, the actor network, which is responsible for generating the mean and covariance for a diagonal Gaussian distribution representing the policy, follows the same structure. Both the actor and critic networks have a hidden layer dimension of 256. Additionally, the weight matrices of the fully connected layers in both networks are orthogonally initialized, and all biases are set to zero. The target critic network undergoes a soft update with a factor of $τ = 0.005$ every two iterations, while the actor network is also updated every two iterations.

When utilizing a predictive dynamics model, an additional network approximates a Gaussian distribution representing the transition probabilities. For simplicity, deterministic transitions were assumed in this environment. The forward dynamics model is a two-layer MLP with a hidden layer dimension of 512, with layer normalization applied after the first fully connected layer, followed by a ReLU activation function to introduce nonlinearity.

In our proposed approach, which incorporates a nonlinear projection, the projection layer consists of a single fully connected layer with an output dimension of 32, followed by a ReLU activation function to introduce nonlinearity.

All other hyperparameters used in the experiment are summarized in Table 1.

Table 1.

Hyperparameters for noisy sparse cartpole environment.

Hyperparameter	Value
Batch size	512
Discount factor ( $γ$ )	0.99
Discount factor ( $c$ )	0.99
Initialization steps	1000
Learning rate	0.001
Maximum intrinsic reward ( $R_{max}$ )	0.1
Replay buffer size	50,000
Scaling factor ( $η_{r}$ )	2.0
Training steps	50,000

Results and discussion

Our initial analysis concentrates on the boundedness of the state space. We demonstrate the feasibility of DeePS in the noise-free setting ( $N_{m} = 0$ ), comparing the performance with and without the projection layer. We also evaluate the performance of a variant that normalizes the latent representation rather than using a nonlinear projection. Figure 3 shows that the norm of the representations for the model without projection grows rapidly and uncontrollably, ultimately leading to the model’s failure in learning an effective policy. Furthermore, we observe that the average norm of the representations, when using the nonlinear projection, remains well within the theoretical limits, unlike the normalization approach where the norm grows to the maximum allowed by the bounds.³¹ This observation further validates the effectiveness of our solution in preventing unconstrained expansion of the latent space.

Figure 3.

The solid lines represent the average episode rewards, while the dashed lines indicate the $L_{2}$ norm values of the encoder representations. The shaded regions depict the 95% confidence intervals.

The performance of the different models (SAC, DBC, DBC w/ ir, DeePS, and DeePS w/ ir) are illustrated in Figure 4, and the average episode rewards are summarized in Table 2, with noise intensity incrementally raised from $N_{m} = 0$ to $N_{m} = 2$ . Overall, approaches utilizing intrinsic rewards perform the best as expected given the sparsity of the rewards. Among the different approaches, the action-based DeePS w/ ir showed the best performance.

Figure 4.

Performance comparison of SAC, DBC, and DeePS (ours) in the noisy sparse cartpole environment. ‘ir’ indicates the use of intrinsic rewards. Each plot represents the average episode reward over 10 episodes, averaged across 20 runs, with the 95% confidence interval indicated. SAC: Soft Actor-Critic; DBC: Deep Bisimulation for Control; DeePS: Deep Policy Similarities.

Table 2.

The average episode reward over 10 episodes for the noisy sparse cartpole environment, averaged across 20 runs, is presented with the standard deviation in brackets. ‘ir’ indicates the use of intrinsic rewards. The maximum attainable episode reward is 200.

	$N_{m} = 0$	$N_{m} = 1$	$N_{m} = 2$
SAC	200.00 (0.00)	117.87 (64.87)	10.25 (2.80)
DBC	165.78 (58.96)	178.74 (52.73)	93.73 (98.62)
DBC w/ ir	197.75 (10.08)	197.15 (12.75)	194.22 (17.42)
DeePS	200.00 (0.00)	157.03 (46.74)	95.22 (93.59)
DeePS w/ ir	200.00 (0.00)	200.00 (0.00)	199.07 (3.29)
DeePS (norm)	200.00 (0.00)	54.10 (78.12)	16.38 (43.27)
DeePS (norm) w/ ir	200.00 (0.00)	200.00 (0.00)	199.85 (0.65)
DeePS (w/o projection)	34.20 (16.45)	38.54 (12.72)	36.02 (8.86)
DeePS (w/o projection) w/ ir	35.98 (17.73)	41.90 (40.64)	38.63 (4.81)

Bold values are the best performing algorithm for each scenario.

SAC: Soft Actor-Critic; DBC: Deep Bisimulation for Control; DeePS: Deep Policy Similarities.

The SAC, a representative of standard reinforcement learning methods, managed to achieve optimal performance in the noise-free setting despite the sparse rewards. However, its performance declined significantly as the noise intensity increased, eventually failing at $N_{m} = 2$ . This underscores the limitation of standard reinforcement learning approaches in generalizing noisy observations as consistent latent representations.

Interestingly, the DBC approaches, both with and without intrinsic rewards, failed to achieve optimal performance even in the noise-free setting, indicating instability. This instability may stem from inaccurate state clustering where states with similar values but different optimal actions are aggregated incorrectly, as discussed previously. The declining performance of DBC, along with the high standard deviations, further supports the notion that state aggregation using $π$ -bisimulation is insufficient for effective generalization.

Although DeePS also experienced reduced performance as noise intensity increased (as seen in Table 2), the evaluation curves suggest that, unlike DBC which appeared to plateau, training was incomplete for DeePS, hinting that extended training could improve performance. Our approach showed delayed improvements during training, likely due to the difficulty of policy learning in sparse reward settings early in the training process, and state aggregation based on these random or poor policies hurt the training process. This points to a potential weakness in our approach in the reliance on the quality of the baseline policy. This is evident from the improved performance of DeePS w/ ir as seen in Figure 4 when intrinsic rewards were used to counteract reward sparsity. Despite this weakness, DeePS still outperforms the SAC, indicating that the model likely benefits from state aggregation based on policy-driven behaviors.

We conducted additional ablation studies to assess the impact of nonlinear projection on performance. As shown in Table 2, the model suffers catastrophic failure when no projection or normalization constraint is applied to the latent space. When intrinsic rewards are included, models utilizing nonlinear projection and normalization exhibit comparable performance. However, in the absence of intrinsic rewards, the normalized variant performs significantly worse, highlighting the robustness of the nonlinear projection approach.

In this experiment, our approach based on an approximation of the policy-driven PSM demonstrates significant promise in addressing the generalization challenge more effectively than standard reinforcement learning algorithms or the reward-based DBC. The superior performance of our approach validates the feasibility of using approximate forward dynamics to estimate the PSM, which is then used to improve generalization for reinforcement learning.

Noisy cartpole with randomized rewards

To study the impact of reward structures on the generalization ability of our action-based approach, we introduce slight variations to the noisy cartpole environment. We simulate a generalization problem where semantically equivalent environments have different reward structures by integrating randomized rewards. Specifically, the agent receives a reward that uniformly varies between 50% and 150% of the original reward for each successful timestep. This setup disregards reward sparsity to allow for greater variations in state values.

The randomization of reward signals introduces stochasticity, creating a more challenging and realistic environment for evaluating generalization. By testing the models in these varied environments, we can assess their ability to learn robust policies that generalize across different reward structures—essential for real-world applications where reward functions may be inconsistent. This allows us to compare the flexibility and adaptability of our proposed policy-driven DeePS approach to standard reinforcement learning algorithms and DBC.

Network architecture and hyperparameters

In general, we made use of the same network architecture and hyperparameters as with the previous experiment. However, to better manage the increased complexity introduced by randomized rewards, we made several adjustments to both the network architecture and training process. The hidden layer dimensions for both the actor and critic networks were reduced from 256 to 128 in an attempt to reduce the model’s capacity, preventing it from memorizing and overfitting to specific reward structures. In addition, we decay the learning rate by 0.99 every 1000 environment steps and increase the training batch size from 512 to 1024 to stabilize the training process. These adjustments help ensure that the models can handle the increased complexity and stochasticity of the noisy cartpole environment with randomized rewards, facilitating a thorough evaluation of the robustness and adaptability of our proposed method in a setting that closely simulates real-world scenarios.

Results and discussion

The performance of different models (SAC, DBC, and DeePS) in the noisy cartpole environment with randomized rewards is illustrated in the provided plots (Figure 5) and the table of results (Table 3). The results demonstrate the superiority of our proposed DeePS approach against the SAC and DBC across different noise levels ranging from $N_{m} = 0$ to $N_{m} = 20$ .

Figure 5.

Performance comparison of SAC, DBC, and DeePS (ours) in the noisy cartpole environment with randomized rewards. Each plot represents the average episode reward over 10 episodes, averaged across 20 runs, with the 95% confidence interval indicated. SAC: Soft Actor-Critic; DBC: Deep Bisimulation for Control; DeePS: Deep Policy Similarities.

Table 3.

The average episode reward over 10 episodes for the noisy cartpole environment with randomized rewards, averaged across 20 runs, is presented with the standard deviation in brackets. The maximum attainable episode reward is 200.

	$N_{m} = 0$	$N_{m} = 10$	$N_{m} = 20$
SAC	190.29 (23.10)	186.73 (17.41)	115.94 (43.45)
DBC	199.73 (1.19)	170.68 (17.69)	99.41 (49.82)
DeePS	199.00 (3.53)	195.31 (8.56)	177.99 (17.21)

Bold values are the best performing algorithm for each scenario.

SAC: Soft Actor-Critic; DBC: Deep Bisimulation for Control; DeePS: Deep Policy Similarities.

In the noise-free setting ( $N_{m} = 0$ ), both DeePS and DBC achieved near-optimal performance, with DBC slightly outperforming DeePS. However, as noise level increases from $N_{m} = 0$ to $N_{m} = 20$ , DeePS maintained strong performance, while both DBC and SAC showed clear signs of deterioration. At the highest noise level ( $N_{m} = 20$ ), DeePS outperformed SAC and DBC by approximately 53.5% and 79.0%, respectively, in terms of episode rewards.

The evaluation curves in Figure 5 further support these findings, showing that DeePS consistently delivers higher and more stable performance as noise increases. The decline in SAC’s performance underscores the limitation of standard reinforcement learning in effectively generalizing across MDPs with variable reward structures. While DBC marginally outperformed DeePS in the absence of noise, its performance deteriorated more rapidly as noise levels increased, eventually performing worse than SAC at $N_{m} = 20$ . DBC also exhibited greater instability, reflected in its larger standard deviation, particularly in high-noise environments. These results suggest that reward-based state aggregation methods not only negatively affect performance in environments with inconsistent reward structures, but also introduce instability.

Overall, these experiments highlight the effectiveness of DeePS in generalizing across continuous control tasks affected by noise and inconsistent rewards. Our method consistently outperforms both standard reinforcement learning algorithms and reward-based bisimulation approaches, demonstrating its robustness in handling complex and variable environments. By focusing on action-based similarities, DeePS enables more reliable policy transfer and adaptability, making it a valuable tool for real-world robotic applications. This success underscores the potential of our approach to enhance the development of more resilient and versatile reinforcement learning agents for actual robotic systems.

Crossing environment

We further validate our proposed approach in a simple environment that mimics robotic navigation problems in indoor settings. For this purpose, we utilize the simple crossing environment (SimpleCrossingS9N1-v0) implemented in the procedurally generated Minigrid environment.¹⁹ Each environment consists of a $7 \times 7$ grid world, where the agent starts in the top left corner of the room and must reach the green goal square in the opposite corner. To do so, the agent must navigate through a single crossing point in the wall that spans either horizontally or vertically across the room. At each step, the agent receives a partial observation of shape $(7, 7, 3)$ , describing the current environment in its forward direction. The agent’s action space consists of seven actions: turn left, turn right, move forward, and four other redundant actions not used in this environment. A discounted reward of $1 - 0.9 (s t e p / m a x_s t e p)$ is given when the agent successfully reaches the goal, where $s t e p$ is the number of steps taken to reach the goal and $m a x_s t e p$ is the maximum allowable steps, set at 324. If the agent fails to reach the goal within $m a x_s t e p$ , it receives a reward of 0. This environment challenges the agent to locate the crossing in the wall and move toward the goal as quickly as possible to maximize rewards.

We study generalization by training the agent on 10 different layouts (training set) and testing it on three other semantically equivalent layouts (testing set) as shown in Figures 6 and 7, respectively. The layouts in the training and testing sets do not overlap, requiring the agent to adapt to the test environments after training for 1,500,000 timesteps in the training set. This presents a very challenging task known as zero-shot generalization.

Figure 6.

Environment layouts used to train the agent in the crossing environment. The agent should move through the crossing in the wall towards the goal located at the bottom right corner as quickly as possible.

Figure 7.

Environment layouts used to test the agents in the crossing environment. The agent should move through the crossing in the wall towards the goal located at the bottom right corner as quickly as possible.

We evaluate the performance of our proposed approach alongside standard reinforcement learning and also DBC. Additionally, we consider the use of the RAPID algorithm,⁴⁴ designed to enhance reinforcement learning in procedurally generated environments. While the RAPID algorithm is not specifically designed for generalization, it is state-of-the-art for improving agent performance in procedurally generated environments. The RAPID algorithm uses a ranking buffer to gather high-quality state-action transitions for imitation learning. These transitions are ranked based on the weighted sum of their extrinsic rewards and their local and global exploration scores, defined as

S = w_{0} S_{ext} + w_{1} S_{local} + w_{2} S_{global}

(13)

where

S_{ext}

is the extrinsic reward,

S_{local}

is the local exploration score measuring episode diversity,

S_{global}

is the global count-based exploration score with hyperparameters

w_{0}

w_{1}

and

w_{2}

, respectively. All models are trained with Proximal Policy Optimization (PPO).³⁸

As noted in previous experiments, DeePS is dependent on the quality of the baseline policy during training. Therefore, to ensure effective generalization, we first train the DeePS model with an additional RAPID loss for 1,000,000 timesteps to establish a strong baseline policy (for the training domain) before applying state aggregation for the remaining 500,000 timesteps.

Network architecture and hyperparameters

The encoder network consists of a two-layer MLP with Tanh activation function applied after the initial layer. The output dimension of the encoder is set to 64. Both the actor and the critic are single fully connected layers. For simplicity, transitions are assumed to be deterministic, and the forward dynamics model is a two-layer MLP with hidden dimensions of 512. Layer normalization is applied after the initial layer, followed by a ReLU activation function to introduce nonlinearity. The projection layer is a single fully connected layer with a dimension of 32, followed by a ReLU activation for nonlinearity. Gradient clipping with a norm value of 1.0 is applied for the PPO loss. The hyperparameters used are summarized in Table 4.

Table 4.

Hyperparameters for crossing environment.

Hyperparameter	Value
Learning rate	0.0001
Training steps	1,500,000
Clip	0.2
Discount ( $γ$ )	0.99
Epochs	4
Entropy coefficient	0.01
Generalized advantage estimation ( $λ$ )	0.95
Horizon	128
Minibatch	4
Batch size	256
Buffer size	10,000
Extrinsic weight ( $w_{0}$ )	1.0
Local exploration weight ( $w_{1}$ )	0.1
Global exploration weight ( $w_{2}$ )	0.001

Results and discussion

The performance of the different models (PPO, RAPID, DBC, and DeePS) is shown in Figure 8, and the summary of the results provided in Table 5.

Figure 8.

Comparison of training (top) and testing (bottom) performance for PPO, RAPID, DBC, and DeePS (ours) in the crossing environment. The training plots represent the average episode rewards over 100 episodes, and the testing plots represent the average episode rewards over 30 episodes. Both are averaged across five runs, with 95% confidence intervals included. PPO: Proximal Policy Optimization; DBC: Deep Bisimulation for Control; DeePS: Deep Policy Similarities.

Table 5.

The average episode reward over 100 episodes (train) and 30 episodes (test) for the crossing environment, averaged across five runs, is presented with the standard deviation in brackets.

	train	test
PPO	0.488 (0.454)	0.224 (0.365)
RAPID	0.864 (0.082)	0.202 (0.255)
DBC	0.929 (0.014)	0.224 (0.069)
DeePS	0.895 (0.091)	0.403 (0.104)

Bold values are the best performing algorithm for each scenario.

PPO: Proximal Policy Optimization; DBC: Deep Bisimulation for Control; DeePS: Deep Policy Similarities.

We observed that PPO, as a representative of standard reinforcement learning models, performed poorly on both the training and testing sets. Although extended training could potentially enhance PPO’s performance on the training set, its testing performance is unlikely to improve due to its tendency to overfit to the training conditions.

The other models demonstrated strong training performance, as shown in Table 5, with DBC achieving the highest average training reward. Surprisingly, DBC performed well in the test environments during the early stages of training. However, its near-optimal training performance did not translate into effective generalization as the training progressed. This is evident in its test performance, which remained comparable to PPO despite the superior training results. The inefficiency of DBC is further illustrated in the evaluation curve in Figure 8, where its test performance deteriorates sharply after plateauing in training, underscoring its susceptibility to overfitting.

Similarly, while RAPID achieved better training performance than the baseline PPO, this improvement did not generalize. In fact, RAPID’s test performance was observed to be worse than PPO and clear signs of overfitting were apparent in its evaluation curve as seen in Figure 8. This further highlights the impracticality of using RAPID for generalization purposes.

In contrast to the other models, DeePS demonstrated substantial improvements in generalization, achieving approximately 99.5% higher test rewards than RAPID and 79.9% higher test rewards than DBC (and PPO). Despite achieving slightly lower training performance compared to DBC, DeePS’s performance on the test environments was the best among all models, with a test reward of 0.403, nearly doubling that of PPO and DBC. This result highlights the ability of DeePS to strike a balance between strong training performance and robust generalization.

As seen in the evaluation curves, DeePS effectively leverages its state aggregation mechanism to enhance both training and testing performance. Unlike RAPID and DBC, DeePS avoids overfitting, enabling it to perform consistently across unseen environments. By focusing on action-based similarities, DeePS offers a reliable method for policy transfer and adaptation, ensuring that agents remain effective across diverse environments.

Although DeePS’s test performance remains below optimal levels, it represents a significant step forward in reinforcement learning methodologies. By improving generalization without compromising training performance, DeePS demonstrates its potential to enhance the robustness and adaptability of reinforcement learning policies, particularly in challenging and varied environments.

Limitations

While the proposed DeePS method is effective in enhancing the generalization of reinforcement learning policies, its reliance on a quality baseline and certain assumptions about environmental transitions limit its overall effectiveness. Specifically, the assumption of a Gaussian transition probability may not hold across all environments, and the requirement of accurate modeling of the transition dynamics may limit the reliability of the metric estimates. Nonetheless, a significant amount of existing works have shown the possibility of accurately modeling environmental transitions.^45–48 which supports the use of our proposed methodology.

Conclusion

In conclusion, this work introduced a learning-based bisimulation approach that leverages policy similarities. Our experiments demonstrated the effectiveness and robustness of the proposed methodology, showing that DeePS outperforms the state-of-the-art reward-based bisimulation methods. Furthermore, the approach manages to preserve the integrity of the training process while substantially improving the model’s ability to generalize to new, unseen environments with semantically equivalent dynamics. Even in environments with different reward structures, DeePS was able to generalize effectively. This makes it particularly valuable for real-world robotic applications, where adaptability and robustness are essential.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Unmanned Vehicles Core Technology Research and Development Program through the National Research Foundation of Korea (NRF), Unmanned Vehicle Advanced Research Center (UVARC) funded by the Ministry of Science and ICT (MSIT), the Republic of Korea (no. 2020M3C1C1A0108237512).

ORCID iD

Han-Lim Choi

References

Mnih

Kavukcuoglu

Silver

, et al. Playing Atari with deep reinforcement learning. arXiv, 2013, https://arxiv.org/abs/1312.5602.

Watkins

CJCH

Dayan

. Q-learning. Mach Learn 1992; 8: 279–292.

Kiran

Sobh

Talpaert

, et al. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans Intell Transp Syst 2022; 23: 4909–4926.

Kiumarsi

Vamvoudakis

Modares

, et al. Optimal and autonomous control using reinforcement learning: A survey. IEEE Trans Neural Networks Learn Syst 2018; 29: 2042–2062.

Kober

Bagnell

Peters

. Reinforcement learning in robotics: A survey. Int J Rob Res 2013; 32: 1238–1274.

Zhao W, Queralta JP and Westerlund T. Sim-to-real transfer in deep reinforcement learning for robotics: A survey. In: 2020 IEEE symposium series on computational intelligence (SSCI), Canberra, ACT, Australia, 1–4 December 2020, pp. 737–744. IEEE Press. https://doi.org/10.1109/SSCI47803.2020.9308468

Cobbe

Klimov

Hesse

, et al. Quantifying generalization in reinforcement learning. In: Chaudhuri K and Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning. Vol. 97 of proceedings of machine learning research. PMLR, Long Beach CA, USA, 10–15 June 2019, pp. 1282–1289, https://proceedings.mlr.press/v97/cobbe19a.html.

Zhang

Vinyals

Munos

, et al. A study on overfitting in deep reinforcement learning. arXiv, 2018, https://arxiv.org/abs/1804.06893.

Todorov

Erez

Tassa

. MuJoCo: A physics engine for model-based control. In: 2012 IEEE/RSJ international conference on intelligent robots and systems, Vilamoura-Algarve, Portugal, 7–12 October 2012, pp. 5026–5033. IEEE Press.

10.

Bellemare

Naddaf

Veness

, et al. The arcade learning environment: An evaluation platform for general agents. J Artif Intell Res 2013; 47: 253–279.

11.

Nichol

Pfau

Hesse

, et al. Gotta learn fast: A new benchmark for generalization in RL. arXiv, 2018, https://arxiv.org/abs/1804.03720.

12.

Lee

Shin

, et al. Network randomization: a simple technique for generalization in deep reinforcement learning. arXiv, 2019, https://arxiv.org/abs/1910.05396.

13.

Raileanu

Goldstein

Yarats

, et al. Automatic data augmentation for generalization in reinforcement learning. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS and Vaughan JW (eds) Advances in neural information processing systems. Curran Associates, Inc., 2021, vol. 34, pp. 5402–5415.

14.

Laskin

Lee

Stooke

, et al. Reinforcement learning with augmented data. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF and Lin H (eds) Advances in neural information processing systems. Curran Associates, Inc., 2020, vol. 33, pp. 19884–19895.

15.

Mehta

Diaz

Golemo

, et al. Active domain randomization. In: Kaelbling LP, Kragic D and Sugiura K (eds) Proceedings of the conference on robot learning. Vol. 100 of proceedings of machine learning research. PMLR, 2020, pp. 1162–1176, https://proceedings.mlr.press/v100/mehta20a.html.

16.

Tobin

Fong

Ray

, et al. Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), Vancouver, BC, Canada, 24–28 September 2017, pp. 23–30. IEEE Press.

17.

Wang

Kang

Shao

, et al. Improving generalization in reinforcement learning with mixture regularization. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF and Lin H (eds) Advances in neural information processing systems. Curran Associates, Inc., 2020, vol. 33, pp. 7968–7978.

18.

Farebrother

Machado

Bowling

. Generalization and regularization in DQN. arXiv, 2018, https://arxiv.org/abs/1810.00123.

19.

Chevalier-Boisvert

Dai

Towers

, et al. Minigrid & miniworld: modular & customizable reinforcement learning environments for goal-oriented tasks. CoRR, 2023, abs/2306.13831.

20.

Ferns

Panangaden

Precup

. Metrics for finite Markov decision processes. In: Proceedings of the 20th conference on uncertainty in artificial intelligence (UAI '04), Arlington, Virginia, USA, 7–11 July 2004, pp. 162–169. AUAI Press.

21.

Ferns

Panangaden

Precup

. Bisimulation metrics for continuous Markov decision processes. SIAM J Comput 2011; 40: 1662–1714.

22.

Castro

. Scalable methods for computing state similarity in deterministic markov decision processes. Proc AAAI Conf Artif Intell 2020; 34: 10069–10076.

23.

Agarwal

Machado

Castro

, et al. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. In: International conference on learning representations, 3–7 May 2021.

24.

Wang

Yang

Dong

, et al. Efficient potential-based exploration in reinforcement learning using inverse dynamic bisimulation metric. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M and Levine S (eds) Advances in neural information processing systems. Curran Associates, Inc., 2023, vol. 36, pp. 38786–38797.

25.

Zang

Zhang

, et al. Understanding and addressing the pitfalls of bisimulation-based representations in offline reinforcement learning. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M and Levine S (eds) Advances in neural information processing systems. Curran Associates, Inc., 2023, vol. 36, pp. 28311–28340.

26.

Bellman

. A markovian decision process. J Math Mech 1957; 6: 679–684.

27.

Salkham

Cunningham

Garg

, et al. A collaborative reinforcement learning approach to urban traffic control optimization. In: 2008 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, Sydney, NSW, Australia, 9–12 December 2008, vol. 2, pp. 560–566. IEEE Press.

28.

Bellman

. The theory of dynamic programming. Bull Am Math Soc 1954; 60: 503–515.

29.

Tamar

Thomas

, et al. Value iteration networks. In: Lee D, Sugiyama M, Luxburg U, Guyon I and Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., 2016, vol. 29.

30.

Sun

Gordon

Boots

, et al. Dual policy iteration. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N and Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., 2018, vol. 31.

31.

Kemertas

Aumentado-Armstrong

. Towards robust bisimulation metric learning. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS and Vaughan JW (eds) Advances in neural information processing systems. Curran Associates, Inc., 2021, vol. 34, pp. 4764–4777.

32.

Figalli

Glaudo

. An invitation to optimal transport, Wasserstein distances, and gradient flows. EMS Press, 2021.

33.

Castro

Precup

. Using bisimulation for policy transfer in MDPs. Proc AAAI Conf Artif Intell 2010; 24: 1065–1070.

34.

Zhang

McAllister

Calandra

, et al. Learning invariant representations for reinforcement learning without reconstruction. arXiv, 2020, https://arxiv.org/abs/2006.10742.

35.

Zhang

Sodhani

Khetarpal

, et al. Learning robust state abstractions for hidden-parameter block MDPs. In: 9th International conference on learning representations, ICLR 2021, May 3–7, 2021. OpenReview.net, 2021, https://openreview.net/forum?id=fmOOI2a3tQP.

36.

Kemertas

Jepson

. Approximate policy iteration with bisimulation metrics. Transactions on Machine Learning Research. 2022. https://openreview.net/forum?id=Ii7UeHc0mO.

37.

Santambrogio

. Optimal transport for applied mathematicians: calculus of variations, PDEs, and modeling. Springer International Publishing, 2015.

38.

Schulman

Wolski

Dhariwal

, et al. Proximal policy optimization algorithms. arXiv, 2017, https://arxiv.org/abs/1707.06347.

39.

Haarnoja

Zhou

Abbeel

, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Dy J and Krause A (eds) Proceedings of the 35th international conference on machine learning. Vol. 80 of proceedings of machine learning research. PMLR, 2018, pp. 1861–1870, https://proceedings.mlr.press/v80/haarnoja18b.html.

40.

Brockman

Cheung

Pettersson

, et al. OpenAI Gym. arXiv, 2016, https://arxiv.org/abs/1606.01540.

41.

Burda

Edwards

Pathak

, et al. Large-scale study of curiosity-driven learning. arXiv, 2018, https://arxiv.org/abs/1808.04355.

42.

Pathak

Agrawal

Efros

, et al. Curiosity-driven exploration by self-supervised prediction. In: Precup D and Teh YW (eds) Proceedings of the 34th international conference on machine learning. Vol. 70 of proceedings of machine learning research. PMLR, 2017, pp, 2778–2787, https://proceedings.mlr.press/v70/pathak17a.html.

43.

Stadie

Levine

Abbeel

. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv, 2015, https://arxiv.org/abs/1507.00814.

44.

Zha

Yuan

, et al. Rank the episodes: a simple approach for exploration in procedurally-generated environments. In: International conference on learning representations, 2021, https://openreview.net/forum?id=MtEE0CktZht.

45.

Racanière

Weber

Reichert

, et al. Imagination-augmented agents for deep reinforcement learning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al. (eds) Advances in neural information processing systems. Curran Associates, Inc., 2017, vol. 30.

46.

Deng

Jang

Ahn

. DreamerPro: reconstruction-free model-based reinforcement learning with prototypical representations. In: Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G and Sabato S (eds) Proceedings of the 39th international conference on machine learning. Vol. 162 of proceedings of machine learning research. PMLR, 2022, pp. 4956–4975, https://proceedings.mlr.press/v162/deng22a.html.

47.

Hafner

Lillicrap

, et al. Dream to control: Learning behaviors by latent imagination. arXiv, 2019, https://arxiv.org/abs/1912.01603.

48.

Okada

Taniguchi

. Dreaming: model-based reinforcement learning by latent imagination without reconstruction. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi'an, China, May 30–Jun 5 2021, pp. 4209–4215. IEEE Press. https://doi.org/10.1109/ICRA48506.2021.9560734

Enhancing generalization in reinforcement learning for robotic control through deep policy similarities

Abstract

Keywords

Introduction

Markov Decision Processes

Bisimulation in Reinforcement Learning

Definition 1 (Bisimulation metrics 20 )

Learning-based bisimulation metrics

Deep policy similarities

Experiments

Noisy sparse cartpole environment

Network architecture and hyperparameters

Results and discussion

Noisy cartpole with randomized rewards

Network architecture and hyperparameters

Results and discussion

Crossing environment

Network architecture and hyperparameters

Results and discussion

Limitations

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References

Definition 1 (Bisimulation metrics²⁰)