Sage Journals: Discover world-class research

Abstract

Recent work has shown diffusion models are an effective approach to learning the multimodal distributions arising from demonstration data in behavior cloning. However, a drawback of this approach is the need to learn a denoising function, which is significantly more complex than learning an explicit policy. In this work, we propose Equivariant Diffusion Policy, a novel diffusion policy-learning method that leverages domain symmetries to obtain better sample efficiency and generalization in the denoising function. We theoretically characterize when a diffusion policy is equivariant and analyze the SO (2) symmetry of full 6-DoF control. We furthermore evaluate the method empirically on a set of 12 simulation tasks in MimicGen, and show that it obtains a success rate that is, on average, 34.5% higher than the baseline Diffusion Policy. We also evaluate the method on a real-world system to show that effective policies can be learned with relatively few training samples, whereas the baseline Diffusion Policy cannot.

Keywords

equivariance diffusion model robotic manipulation

Introduction

The recently proposed Diffusion Policy (Chi et al., 2023) formulates robotic manipulation action prediction as a diffusion model that denoises the action conditioned on the observation, thereby better capturing the multimodal action distribution of the demonstration data in Behavior Cloning (BC). Although Diffusion Policy often outperforms baselines on various benchmarks (Gupta et al., 2020; Mandlekar et al., 2022), a significant limitation is that the denoising function is inherently more complex than a standard policy function. Specifically, for a single state-action pair (s, a), the denoising process utilizes a mapping (s, a + ɛ^k, k)↦ɛ^k for all possible k and ɛ^k, where ɛ^k is Gaussian noise conditioned on step k. This formulation creates a considerably more challenging learning problem compared with an explicit BC s↦a.

In this paper, we leverage equivariant neural network models to incorporate task symmetry as an inductive bias in the diffusion process, substantially simplifying the denoising function learning problem. Although equivariant diffusion models have been studied by a number of prior works (Brehmer et al., 2023; Chen et al., 2023; Guan et al., 2023; Hoogeboom et al., 2022; Ryu et al., 2023b), our work is the first to comprehensively study and implement this concept for visuomotor policy learning. As illustrated in Figure 1, when a state and noisy trajectory action are rotated about the gravity axis, the corresponding denoised trajectory undergoes an equivalent transformation. This symmetry-aware approach enables our model to achieve significantly greater data efficiency and generalization capabilities than non-symmetric baselines, effectively addressing the high data requirements typically associated with diffusion-based methods.

Figure 1.

Equivariance in diffusion policy. Top left: a randomly sampled trajectory. Top right: a valid trajectory after denoising. If the state and the random trajectory are both rotated (bottom left), and we rotate the noise accordingly in the denoising process, we will end up with a successful trajectory in the rotated state (bottom right).

Our contributions are as follows:

• We propose Equivariant Diffusion Policy, a novel BC approach based on equivariant diffusion.

• We theoretically show that the diffusion policy is equivariant when the denoising function is equivariant, justifying modeling the denoising function using an equivariant network.

• We theoretically demonstrate the use of SO (2)-equivariance in the context of 6-DoF control for robotic manipulation, which prior methods (Jia et al., 2023; Wang et al., 2022b) leveraged in a less expressive SE (2) action space.

• We provide a thorough demonstration of our method in both simulated and physical systems. In simulation, we evaluate on 12 manipulation tasks in the MimicGen benchmark (Mandlekar et al., 2023) and outperform the baseline Diffusion Policy by an average success rate of 34.5% when trained with 100 demos. On hardware, we show that successful policies can be learned with a small number of demonstrations for 12 different manipulation tasks, including long-horizon tasks like bagel baking, coffee making, etc.

This work is an extended version of our conference paper (Wang et al., 2024b), substantially expanding both the theoretical foundations and practical implementations of Equivariant Diffusion Policy. Particularly, while our conference version focused on SO (2) rotational equivariance, here we include T (3) (the group of 3D translations) translational symmetry alongside rotational symmetry. This extension is enabled by our novel Equivariant Point Transformer architecture, which processes point cloud inputs in a manner that preserves both types of symmetries. Concretely, we extend the content of the prior work in the following ways:

• We update the theoretical analysis of Equivariant Diffusion Policy, refining the proposition and proof from a probability density perspective that is more comprehensive. See Section Theory of Equivariant Diffusion Policy.

• We propose a new version of Equivariant Diffusion Policy using point cloud input, referred to as EquiDiff (PC), powered by our novel Equivariant Point Transformer architecture. See Section Equivariant Point Transformer and Translation Symmetry.

• EquiDiff (PC) achieves additional translational symmetry (and is thus SO (2) × T (3)-equivariant) and reaches a performance that is 12.6% higher than the conference version. See Section Standard Baseline Comparison.

• We include an ablation study for EquiDiff (PC) in Section Ablation Study studying the effect of each design choice of our work.

• We include a new real-world experiment with six new challenging manipulation tasks, demonstrating the advantage of EquiDiff (PC) compared with the version from the conference paper. See Section EquiDiff with Point Cloud Input.

Related work

Diffusion models

Diffusion models (Sohl-Dickstein et al., 2015) learn distributions by modeling the reverse of a diffusion process, which is a Markov chain that gradually adds Gaussian noise to the data until it transitions to a Gaussian distribution. Denoising diffusion models (Ho et al., 2020; Song and Ermon, 2019) can be interpreted as learning the gradient field of an implicit score during training, where inference applies a sequence of score optimization steps. This new family of generative methods has proven to be effective for capturing multimodal distributions in planning (Janner et al., 2022; Liang et al., 2023) and policy learning (Chi et al., 2023; Pearce et al., 2022; Wang et al., 2022c; Xian et al., 2023; Ze et al., 2024). However, these methods did not leverage the geometric symmetries underlying the task and the diffusion process. Xu et al. (2022); Hoogeboom et al. (2022) show that leveraging SO (3) symmetries from the domain in the diffusion process dramatically improves sample efficiency and generalization ability in molecular generation. EDGI (Brehmer et al., 2023) extends diffuser (Janner et al., 2022) to equivariant diffusion planning with improved performance, but relies on the ground-truth state as the input. Ryu et al. (2023b) propose bi-equivariant diffusion models for visual robotic manipulation, while limited to open-loop settings. Yang et al. (2024a) integrate SIM (3)-equivariant networks with diffusion models to enable scalable, generalizable policy, but is limited to tasks involving a single object. By contrast, we exploit domain symmetries during the diffusion process to attain an effective closed-loop visuomotor policy for complex manipulation tasks.

Equivariance in manipulation policies

Robots operate within a three-dimensional Euclidean space, where manipulation tasks inherently encompass geometric symmetries such as rotations. Recent works (Eisner et al., 2024; Gao et al., 2024; Hu et al., 2024; Huang et al., 2023a, 2023b; Jia et al., 2023; Kim et al., 2023; Kohler et al., 2023; Lim et al., 2024; Liu et al., 2023; Nguyen et al., 2023, 2024; Pan et al., 2023; Simeonov et al., 2023; Wang et al., 2021a, 2022b; Yang et al., 2024b) compellingly show that improvement in sample efficiency and performance can be obtained by leveraging symmetries in policy learning. (Wang et al., 2022a; Zhu et al., 2022, 2023) show the efficiency of equivariant models for on-robot learning. (Huang et al., 2022, 2023; 2024a; 2024b; Ryu et al., 2023a; Simeonov et al., 2022) learn an open-loop pick and place policy with few demonstrations. While this prior work either considers symmetries in SE (3) open-loop or SE (2) closed-loop action spaces, our paper studies symmetries in an SE (3) closed-loop action space, and is the first one to study the symmetries in diffusion policy.

Closed-loop visuomotor control

Closed-loop visuomotor policies are more robust and responsive but struggle with learning from diverse trajectories and predicting long-horizon actions. Previous methods (Florence et al., 2019; Rahmatizadeh et al., 2018; Toyer et al., 2020; Zhang et al., 2018) directly map from observations to actions. However, this type of explicit policy-learning struggles to learn multimodal behavior distributions and may not be expressive enough to capture the full range and fidelity of trajectory data (Orsini et al., 2021; Pearce et al., 2022). Several works propose implicit policies (Florence et al., 2021; Jarrett et al., 2020) with energy-based models (Du and Mordatch, 2019; Grathwohl et al., 2020). However, training is challenging due to the necessity of a substantial volume of negative samples to effectively learn an optimal energy score function for state-action pairs. Recently, (Chi et al., 2023; Pearce et al., 2022) model action generation as a conditional denoising diffusion process and demonstrate strong performance by adapting diffusion models to sequential environments. Our work builds on Chi et al. (2023) but focuses on equivariance in the diffusion process.

Background

Problem statement

We study policy learning using behavior cloning. The agent is required to learn a mapping from the observation o to the action a that mimics an expert policy. Both o and a can contain a number of time steps, that is, o = {o_t−(m−1), …, o_t−1, o_t}, a = {a_t, a_t+1, …, a_t+(n−1)} where m is the number of history steps observed and n is the number of future action steps. The observation contains both visual information (images or voxels) and the pose vector of the gripper.

Let $T_{t} \in R^{4 \times 4}$ be the current SE (3) pose of the gripper in the world frame, the actions a_t specify a desired pose $A_{t} \in R^{4 \times 4}$ of the gripper and an open-width command $w_{t} \in R$ . The pose can be either absolute (T_t+1 = A_t, also called position control) or relative (T_t+1 = A_tT_t, also called velocity control). In order to noise and denoise via addition and subtraction as in the standard diffusion process, we vectorize the SE (3) pose A_t into a vector a_t during diffusion and denoising, and orthogonalize the noise-free action vector after denoising.

Diffusion policy

Chi et al. (2023) proposed Diffusion Policy to model the multimodal distribution in behavior cloning using Denoising Diffusion Probabilistic Models (DDPMs) (Ho et al., 2020). Diffusion Policy learns a noise prediction function ɛ_θ (o, a + ɛ^k, k) = ɛ^k using a network ɛ_θ parameterized by θ. The network is expected to predict the noise component of the input a + ɛ^k. During training, transitions (o, a) are sampled from the expert dataset. Then, random noise ɛ^k (conditioned on a randomly sampled denoising step k) is added to a. The loss is $L = ‖ ε_{θ} (o, a + ε^{k}, k) - ε^{k} ‖^{2}$ . During inference, given an observation o, DDPM performs a sequence of K denoising steps starting from a random action $a^{k} \sim N (0,1)$ to generate an action a⁰ defined inductively by

a^{k - 1} = α (a^{k} - γ ε_{θ} (o, a^{k}, k) + ϵ),

(1)

where

ϵ \sim N (0, σ^{2} I)

. α, γ, σ are functions of the denoising step k (also known as the noise schedule). The action a⁰ is expected to be a sample from the expert policy π.

Equivariance

A function f is equivariant if it commutes with the transformations of a symmetry group G. Specifically, ∀g ∈ G,

f (ρ_{x} (g) x) = ρ_{y} (g) f (x),

(2)

where ρ: G → GL (n) is called the group representation that maps each group element to an n × n invertible matrix that acts on the input and output through matrix multiplication. We sometimes leave the actions implicit and write f (gx) = gf (x).

We mainly focus on the group SO (2) × T (3), where SO (2) is a group of planar rotations (i.e., rotation around the z-axis of the world) and T (3) is a group of 3D translations. This group captures the symmetry in many robotic tasks without enforcing unnecessary out-of-plane rotation equivariance (which is often invalid due to gravity and the canonical pose of objects). Notice that SO (2) × T (3) can be decomposed and a function that is both SO (2)-equivariant and T (3)-equivariant would be SO (2) × T (3)-equivariant.

We sometimes approximate SO (2) with the discrete subgroup C_u ⊂ SO (2) containing u discrete rotations, and there are three particular representations of SO (2) or C_u that are of interest in this paper:

1) the trivial representation ρ₀ defines SO (2) or C_u acting on an invariant scalar $x \in R$ by ρ₀(g)x = x.

2) the irreducible representation ρ_ω defines SO (2) or C_u acting on a vector $v \in R^{2}$ by a 2 × 2 rotation matrix with frequency ω, $ρ_{ω} (g) v = (\begin{matrix} \cos ω g & - \sin ω g \\ \sin ω g & \cos ω g \end{matrix}) v$ .

3) the regular representation ρ_reg that defines C_u acting on a vector $x \in R^{u}$ by u × u permutation matrices. Let g = r^m ∈ C_u = {1, r¹, …, r^u−1} and $(x_{1}, \dots, x_{u}) \in R^{u}$ . Then ρ_reg (g)x = (x_u−m+1, …, x_u, x₁, x₂, …, x_u−m) cyclically permutes the coordinates of $R^{u}$ .

A representation ρ can also be a combination of different representations, that is, $ρ = ρ_{0}^{n_{0}} \oplus ρ_{1}^{n_{1}} \oplus ρ_{2}^{n_{2}} \in G L (n_{0} + 2 n_{1} + 2 n_{2})$ . In such a case, ρ(g) is an (n₀ + 2n₁ + 2n₂) × (n₀ + 2n₁ + 2n₂) block diagonal matrix that acts on $x \in R^{n_{0} + 2 n_{1} + 2 n_{2}}$ .

Method

Theory of equivariant diffusion policy

The main contribution of this paper is a method that incorporates equivariance in the diffusion process for policy learning. As a theoretical justification, we analyze the noise prediction function ɛ and show that if ɛ is equivariant, then the policy being modeled is also equivariant. This implies equivariant neural networks have the correct inductive bias to model this function.

Proposition 1

If the noise prediction function ɛ: o, a^k↦ɛ^k is SO (2)-equivariant, that is, for all g ∈ SO (2),

ε (g o, g a^{k}, k) = g ε (o, a^{k}, k),

(3)

then the policy function is SO (2)-equivariant, that is, for all g ∈ SO (2) and any measurable set

A

π (g A ∣ g o) = π (A ∣ o),

(4)

where

g A = {g a ∣ a \in A}

. Proof

Notation and Setup.

Let q (a^k−1∣a^k, ɛ^k) denote the transition density induced by the DDPM update,

a^{k - 1} = α (a^{k} - γ ε^{k} + ϵ),

where the noise

ϵ \sim N (0, σ^{2} I)

is sampled from an isotropic Gaussian and α, γ, and σ depend only on k (and are invariant under the group action).

Let p_k (a^k∣o) denote the probability density of the intermediate state a^k given the observation o. The policy π is defined by probability density p_π (a∣o) = p₀ (a∣o). For any measurable set $A$ ,

π (A ∣ o) = \int_{A} p_{π} (a ∣ o) d a .

Invariance of q.

Since the group action is linear and acts identically on a^k, ɛ^k, and ϵ, applying a rotation g to the DDPM update gives

g a^{k - 1} = α (g a^{k} - γ g ε^{k} + g ϵ) .

Because the Gaussian density is invariant under rotations (i.e., the probability density at gϵ equals that at ϵ) and rotations preserve volume, we have

q (a^{k - 1} ∣ a^{k}, ε^{k}) = q (g a^{k - 1} ∣ g a^{k}, g ε^{k}) .

(5)

Substituting the equivariance condition (Equation (3)) into Equation (5) yields

q (g a^{k - 1} ∣ g a^{k}, ε (g o, g a^{k}, k)) = q (a^{k - 1} ∣ a^{k}, ε (o, a^{k}, k)) .

Since o influences the density only via ɛ, we can write

q (g a^{k - 1} ∣ g a^{k}, g o) = q (a^{k - 1} ∣ a^{k}, o) .

(6)

Proof by Induction.

For k = 0, 1, …, K, we prove by backward induction that

p_{k} (g a^{k} ∣ g o) = p_{k} (a^{k} ∣ o) for all g \in S O (2) .

Base Case (k = K):

Since the initial state a^K is drawn from an isotropic Gaussian (and is independent of o), its density is given by

p_{K} (a^{K} ∣ o) = p_{N (0, I)} (a^{K}),

where

p_{N (0, I)}

is the density function of the Gaussian

N (0, I)

. Because the Gaussian density is invariant under rotations, for any g ∈ SO (2) we have

\begin{aligned} p_{K} (a^{K} ∣ o) & = p_{N (0, I)} (a^{K}) \\ = p_{N (0, I)} (g a^{K}) \\ = p_{K} (g a^{K} ∣ g o) . \end{aligned}

Inductive Step:

Assume that for some k ∈ {1, …, K},

p_{k} (g a^{k} ∣ g o) = p_{k} (a^{k} ∣ o)

holds for all g ∈ SO (2). Then the density at step k − 1 is given by

p_{k - 1} (a^{k - 1} ∣ o) = \int q (a^{k - 1} ∣ a^{k}, o) p_{k} (a^{k} ∣ o) d a^{k} .

Similarly, for transformed observation and action,

p_{k - 1} (g a^{k - 1} ∣ g o) = \int q (g a^{k - 1} ∣ a^{k}, g o) p_{k} (a^{k} ∣ g o) d a^{k} .

We change variables of integration from a^k to ga^k. Since rotations preserve the set {a^k−1} and preserve volume (d (ga^k) = da^k),

\begin{align} p_{k - 1} (g a^{k - 1} ∣ g o) \\ = \int q (g a^{k - 1} ∣ g a^{k}, g o) p_{k} (g a^{k} ∣ g o) d a^{k} . \end{align}

(7)

Substituting Equation (6) into Equation (7) and applying the inductive hypothesis,

\begin{aligned} p_{k - 1} (g a^{k - 1} ∣ g o) \\ = & \int q (a^{k - 1} ∣ a^{k}, o) p_{k} (a^{k} ∣ o) d a^{k} \\ = & p_{k - 1} (a^{k - 1} ∣ o) . \end{aligned}

as desired. In particular, for k = 0 we have

p_{π} (g a ∣ g o) = p_{π} (a ∣ o) .

Equivariance of π.

By the invariance of the density p_π, for any measurable set $A$ we obtain

\begin{array}{l} π (A ∣ o) & = \int_{A} p_{π} (a ∣ o) d a \end{array}

\begin{array}{l} = \int_{g A} p_{π} (a ∣ g o) d a \end{array}

\begin{array}{l} = π (g A ∣ g o) . \end{array}

Thus, the policy function is SO (2)-equivariant.

Figure 2 illustrates the equivariance property of ɛ. If we infer ɛ for all actions in the action space, we effectively acquire a gradient field towards the expert trajectory. The figure shows that if the function ɛ is equivariant, such a gradient field would also be equivariant. Thus, the expert policy is equivariant. Notice that the figure shows the average of all action time steps.

Figure 2.

Equivariance of the denoising function ɛ. Left: In observation o, the goal for the gripper is to reach the green block while avoiding the blue obstacle. Right: The expert trajectory and the gradient field associated with the denoising function. If the policy is equivariant, both the denoising function and the entire gradient field are equivariant. The orange boxes show the equivariance of ɛ with a particular input ɛ^k.

SO (2) representation on 6DoF action

A key step in defining an Equivariant Diffusion Policy is to define how actions a_t transform linearly under SO (2). We describe this SO (2) transformation in terms of irreducible SO (2) representations, which allows us to build the equivariance constraint into the denoising network.

Proposition 2

There exist irreducible representations that describe how SO (2) acts on an SE (3) gripper action a_t. In absolute pose control, let a_t = Vec_c(A_t) where Vec_c flattens an SE (3) pose $A_{t} \in R^{4 \times 4}$ into a vector by column, $g a_{t} = {(ρ_{1} \oplus ρ_{0}^{2})}^{4} (g) a_{t}$ . In relative-pose control, let a_t = Vec_r (A_t) where Vec_r flattens A_t into a vector by row, $g a_{t} = P^{- 1} [(ρ_{0}^{6} \oplus ρ_{1}^{4} \oplus ρ_{2}) (g)] P a_{t}$ , where P is a fixed change-of-basis matrix.

Absolute control

We first consider absolute pose control, where the model infers the absolute pose to which the gripper is to move, that is, T_t+1 = A_t. Let T_g be the transformation matrix corresponding to the SO (2) rotation along the z-axis of the world frame,

T_{g} = [\begin{matrix} \cos g & - \sin g & 0 & 0 \\ \sin g & \cos g & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}] = [\begin{matrix} ρ_{1} (g) \\ ρ_{0} (g) \\ ρ_{0} (g) \end{matrix}],

(8)

where

ρ_{1} (g) = (\begin{matrix} \cos g & - \sin g \\ \sin g & \cos g \end{matrix})

. The SO (2) action on A_t is

g A_{t} = T_{g} A_{t} = (ρ_{1} \oplus ρ_{0}^{2}) (g) A_{t}

. Vectorizing A_t by column gives

a_{t} = {V e c}_{c} (A_{t}) = {[A_{t}^{1 T}, A_{t}^{2 T}, A_{t}^{3 T}, A_{t}^{4 T}]}^{T}

where

A_{t}^{i}

is the ith column of A_t. By the rule of matrix multiplication, we have

g A_{t}^{i} = (ρ_{1} \oplus ρ_{0}^{2}) (g) A_{t}^{i}

and

g a_{t} = {(ρ_{1} \oplus ρ_{0}^{2})}^{4} (g) a_{t}

Since the gripper open width is invariant, gw_t = ρ₀ (g)w_t, we can append w_t to a_t and add an extra ρ₀ to the representation. We can also simplify the representation by removing the constants in the transformation matrix and removing the last row in the rotation part of the transformation matrix (i.e., the 6D rotation representation (Zhou et al., 2019)). The resulting action vector would be $a_{t} \in R^{6} \times R^{3} \times R$ , where the first six elements are the 6D rotation, the following three elements are the translation, and the last element is the gripper open width. In such a case, we have $g a_{t} = (ρ_{1}^{3} \oplus (ρ_{1} \oplus ρ_{0}) \oplus ρ_{0}) (g) a_{t}$ .

Relative control

For relative gripper pose, that is, T_t+1 = A_tT_t, the group action on A_t satisfies (gA_t)T_gT_t = T_g (A_tT_t) (because the rotation g ∈ SO (2) applies to both the current pose and the change of pose). Solving for gA_t we get $g A_{t} = T_{g} A_{t} T_{g}^{- 1}$ . Let a_t = Vec_r (A_t) where ${V e c}_{r} : R^{n \times m} \to R^{(n \cdot m)}$ flattens a matrix into a vector by row. Here we want to find a linear action ρ_A that satisfies $g a_{t} = ρ_{A} (g) {V e c}_{r} (A_{t}) = {V e c}_{r} (T_{g} A_{t} T_{g}^{- 1})$ . Solving for $ρ_{A} \in R^{16 \times 16}$ we have the group action of SO (2) on Vec_r (A_t) as

ρ_{A} = [\begin{matrix} c^{2} & - \frac{s_{2}}{2} & 0 & 0 & - \frac{s_{2}}{2} & s^{2} & 0 & 0 \\ \frac{s_{2}}{2} & c^{2} & 0 & 0 & - s^{2} & - \frac{s_{2}}{2} & 0 & 0 \\ 0 & 0 & c & 0 & 0 & 0 & - s & 0 \\ 0 & 0 & 0 & c & 0 & 0 & 0 & - s \\ \frac{s_{2}}{2} & - s^{2} & 0 & 0 & c^{2} & - \frac{s_{2}}{2} & 0 & 0 \\ s^{2} & \frac{s_{2}}{2} & 0 & 0 & \frac{s_{2}}{2} & c^{2} & 0 & 0 \\ 0 & 0 & s & 0 & 0 & 0 & c & 0 \\ 0 & 0 & 0 & s & 0 & 0 & 0 & c \\ ρ_{1} (g) \\ I_{2} \\ ρ_{1} (g) \\ I_{2} \end{matrix}],

(9)

where c = cos g, s = sin g, c₂ = cos 2g, s₂ = sin 2g, c² = (cos g)², s² = (sin g)². We need to decompose ρ_A into irreducible representations of SO (2) in order to implement it in an equivariant neural network. To do so, we calculate a change-of-basis matrix P (see Appendix Decomposing Group Representation in Relative Pose Control into Irreducible Representations) such that Pρ_AP⁻¹ is a block diagonal matrix consisting of irreducible representations,

P ρ_{A} P^{- 1} = [\begin{matrix} I_{6} \\ ρ_{1} (g) \\ ρ_{1} (g) \\ ρ_{1} (g) \\ ρ_{1} (g) \\ ρ_{2} (g) \end{matrix}] .

(10)

In other words, $g a_{t} = P^{- 1} [(ρ_{0}^{6} \oplus ρ_{1}^{4} \oplus ρ_{2}) (g)] P a_{t}$ . We can then use $P ρ_{A} P^{- 1} = ρ_{0}^{6} \oplus ρ_{1}^{4} \oplus ρ_{2}$ as the group representation of the output of the equivariant network, then construct the 4 × 4 transformation matrix using P. For easier implementation, we add a ρ₀ for the gripper action w_t, remove the constants in the transformation matrix, and decompose $S E (3) = S O (3) \times R^{3}$ . See Appendix Simplifying Group Action in Relative Pose Control for details.

Implementation of equivariant diffusion policy

Now that we have the theoretical grounding of the equivariance in the noise prediction function ɛ, this section will introduce the network architecture of our Equivariant Diffusion Policy. We first present an SO (2)-equivariant version in this section, then introduce the SO (2) × T (3)-equivariant version in Section Equivariant Point Transformer and Translation Symmetry.

As is shown in Figure 3, our network consists of three main parts: encoding (white box), denoising (yellow box), and decoding (gray box). We implement our network using the escnn library (Cesa et al., 2021). First, an equivariant observation encoder and an equivariant action encoder take inputs o and a^k, respectively, to create equivariant embeddings e_o and e_a. The embeddings will be in the form of a regular representation of the subgroup C_u ⊂ SO (2) (where u is the number of discrete rotations in the group). The embeddings have shape $e_{o} \in R^{u \times d_{o}}$ and $e_{a} \in R^{u \times d_{a} \times n}$ , where d_o and d_a are the dimensions of the embeddings and n is the number of action steps. Here, each of the d_o or d_a × n dimensional vectors encodes the features for a specific group element (i.e., a rotation angle). Second, in the denoising step, let $e_{o}^{g} \in R^{d_{o}}$ and $e_{a}^{g} \in R^{d_{a} \times n}$ be a pair of partial embeddings corresponding to the same group element g. We process each pair with a 1D Temporal U-Net (where the 1D conv is applied to the n temporal dimension of $e_{a}^{g}$ , and $e_{o}^{g}$ is used as conditioning, adopted from the prior works (Chi et al., 2023; Janner et al., 2022)) to calculate an equivariant noise embedding. Specifically, letting k be the denoising step, U the U-Net, and z its output, we have $z^{g} = U (e_{o}^{g}, e_{a^{k}}^{g}, k)$ . Since the same network is applied for all g ∈ C_u, the output is an equivariant embedding of the noise in the regular representation. We refer to this strategy of processing each element of the regular representation as the “pointwise equivariant processing.” Finally, an equivariant decoder will decode the noise ɛ^k.

Figure 3.

Overview of our Equivariant Diffusion Policy architecture. The first two equivariant encoders take the input observation o and the noisy action sequence a^k and generate the equivariant observation and action embeddings. Second, a 1D Temporal U-Net will generate the equivariant noise embedding, which is further decoded into the noise prediction ɛ^k. We can then perform a denoising step using Equation (1) to get a^k−1. The above process is iterated k times until we acquire the noise-free action sequence a⁰.

Equivariant point transformer and translation symmetry

We consider different input modalities in our work (i.e., images, voxel grids, and point clouds). Although images and voxel grids can be effective visual observations, point clouds have the advantage of directly capturing the 3D geometry of objects without the limited resolution problem that typically exists with voxel grids. For image and voxel inputs, we can use a simple 2D or 3D equivariant CNN as the equivariant observation encoder (as in Figure 3). However, with point cloud inputs, we found that a simple equivariant version of PointNet (Qi et al., 2017) does not perform well enough in complex manipulation tasks. To address this limitation, we propose an Equivariant Point Transformer that effectively captures geometric relationships while maintaining equivariance properties.

Our architecture is based on point transformer (Zhao et al., 2021). Let x be the input feature vector; p be the input point coordinates; y be the output feature vector; i, j be the indices of points; and $N (i)$ be the set of neighbor points of x_i. We use the same point transformer layer as in the prior work:

y_{i} = \sum_{x_{j} \in N (i)} s o f t m a x (w (q_{i} - k_{j} + δ)) ⊙ (v_{j} + δ),

(11)

where q, k, v are the outputs of three MLPs with input x (i.e., q = q(x), k = k(x), v = v(x)); w is another MLP; ⊙ is pointwise multiplication; and δ = θ(p_i − p_j) is the relative position embedding where θ is an MLP. In order to make Equation (11) equivariant, we need to make all the MLPs q, k, v, θ, w equivariant. Moreover, we need to ensure that the other operations in Equation (11) do not break the equivariance. To achieve this, we use the regular representation ρ_reg as the output types of all MLPs, since the regular representation is compatible with all pointwise operations such as addition, subtraction, and pointwise multiplication. The full architecture is shown in Figure 4.

Figure 4.

Bottom: Our equivariant point transformer architecture. Top (gray): equivariant point transformer layer; middle left (blue): down sample block; middle right (yellow): equivariant point transformer block.

One advantage of our equivariant point transformer is that the relative position embedding δ = θ (p_i − p_j) makes the encoder translationally invariant, that is, translating the entire point cloud will not change the observation embedding. We further translate the action from the world frame to the gripper frame, making the entire policy T (3)-invariant (recall that T (3) is the group of 3D translations). Specifically, when the point cloud and the action are translated by the same amount, both the observation embedding and the action in the gripper’s translation frame are invariant, as shown in Figure 5. As a result, the point cloud version of our method has SO (2) × T (3) symmetry rather than just SO (2). (A similar process called “Relative Trajectory” is proposed in Chi et al. (2024).) This SO (2) × T (3) symmetry allows our model to generalize across both position and orientation changes in the environment, a capability that neither image-based nor voxel-based methods can fully achieve. In our experiments in Sections Standard Baseline Comparison and Ablation Study, this comprehensive equivariance contributes significantly to the superior performance of our point cloud-based approach, particularly in environments with high variability in object positions and orientations.

Figure 5.

Translation invariance via predicting the action in the gripper translation frame.

We also replace the pointwise equivariant processing in the U-Net (Section Implementation of Equivariant Diffusion Policy) with a Frame Averaging interface (Puny et al., 2022). Specifically, let U be the U-Net, k be the denoising step, e_a and e_o be the action and observation embeddings (in the form of regular representations), and G be a symmetry group (e.g., C₈). The noise embedding z is calculated as

z = \frac{1}{| G |} \sum_{g \in G} ρ_{r e g} (g) U (ρ_{r e g}^{- 1} (g) e_{o}, ρ_{r e g}^{- 1} (g) e_{a}, k) .

Compared with the pointwise equivariant processing, Frame Averaging enables the U-Net to access the entire regular representation (rather than one element at a time) while also ensuring the equivariance. See Puny et al. (2022) for details.

Simulation experiments

Experimental settings

We first evaluate our Equivariant Diffusion Policy (EquiDiff) with image (Im), voxel (Vo), or point cloud (PC) input on 12 manipulation tasks from MimicGen (Mandlekar et al., 2023) (Figure 6). The RGB observation is an agent-view image and an eye-in-hand image with a size of 3 × 84 × 84. The voxel grid observation has a size of 4 × 64 × 64 × 64 where the first channel is binary occupancy and the remaining three channels are RGB. The point cloud observation has a size of 1024 × 6 (i.e., xyzrgb). The point cloud only contains points above the table, as suggested in Ze et al. (2024). All tasks have a full 6 DoF SE (3) action space. We define the rotation of the observation as a point cloud rotation, a voxel grid rotation, or an image rotation. Notice that in the image version of our method, there is a mismatch between the rotation of the agent-view image and the rotation of the ground-truth state since the agent view is not orthogonally top-down. Although top-down observations could be captured, we use the observation settings in the published dataset from MimicGen (Mandlekar et al., 2023) to demonstrate the generalizability of our method (Notice that the prior work (Wang et al., 2023) has demonstrated that the equivariant CNN is still able to capture symmetry in such a scenario.). On the other hand, the point cloud and the voxel versions eliminate this symmetry mismatch as the rotation of the point cloud or the voxel grid aligns with the rotation of the ground-truth state. To better leverage the equivariance, we also add a rotation augmentation in the point cloud and voxel versions of our method following our analysis in Section Method.

Figure 6.

The experimental environments from MimicGen (Mandlekar et al., 2023). The left image in each subfigure shows the initial state of the environment; the right image shows the goal state. (a) Stack D1, (b) Stack three D1, (c) Square D2, (d) Threading D2, (e) Coffee D2, (f) Three Pc, Assembly D2, (g) Hammer cleanup D1, (h) Mug cleanup D1, (i) Kitchen D1, (j) Nut assembly D0, (k) Pick place D0, and (l) Coffee preparation D1.

Standard baseline comparison

We evaluate our Equivariant Diffusion Policy for both absolute pose control and relative-pose control. We compare our method with the following baselines:

1. DiffPo-C: the original diffusion policy (Chi et al., 2023) trained with the 1D Temporal U-Net (Janner et al., 2022). Notice that the baseline shares the same U-Net architecture as our method, but it does not have any equivariant structure.

2. DiffPo-T: same as above, but trained with a transformer.

3. DP3: the 3D diffusion policy (Ze et al., 2024) trained with a point net encoder.

4. ACT: the Action Chunking Transformer (Zhao et al., 2023) trained as a conditional VAE.

5. BC RNN: a recurrent architecture from Mandlekar et al. (2022).

Notice that the voxel version of our method and DP3 utilizes the 3D inputs constructed from four cameras, while the image version of our method and the other baselines directly use the RGB images from two cameras. As our main baseline, we evaluate DiffPo-C in both absolute and relative-pose control. We evaluate the other baselines in the same control mode as in the original work (absolute for DiffPo-T, DP3 and ACT, and relative for BC RNN). See Appendix Simulation Environments and Training Detail for the details.

Results. Table 1 shows the experimental results of different methods using absolute control in terms of the maximum success rate among 50 evaluations throughout the training. Our Equivariant Diffusion Policy with point cloud input (EquiDiff (PC)) consistently achieves the best overall performance, significantly outperforming all baselines in most environments. The performance advantage is particularly pronounced in the low-data regime (i.e., with 100 or 200 demonstrations). Specifically, as shown in Table 2, EquiDiff (PC) trained with just 100 demos achieves an average success rate of 76.5% across all environments, outperforming the best baseline with the same amount of data by 34.5%. Remarkably, this performance even exceeds that of all baselines trained with 1000 demos, clearly demonstrating the exceptional sample efficiency of our approach. Compared to our conference version (EquiDiff (Vo)), the point cloud model delivers an additional 12.6% performance improvement, validating the benefits of incorporating both rotational and translational symmetries.

Table 1.

The performance of our method in absolute control compared with the baselines in simulation.

Method	Obs	100	200	1000	100	200	1000	100	200	1000	100	200	1000
		Stack D1			Stack Three D1			Square D2			Threading D2
EquiDiff (PC)	PCD	98 (+22)	100 (+3)	100 (=)	90 (+52)	96 (+24)	97 (+3)	67 (+59)	81 (+62)	75 (+26)	55 (+38)	60 (+25)	59 (=)
EquiDiff (Vo)	Voxel	99 (+23)	100 (+3)	100 (=)	75 (+37)	91 (+19)	91 (−3)	39 (+31)	48 (+29)	63 (+14)	39 (+22)	53 (+18)	55 (−4)
EquiDiff (Im)	RGB	93 (+17)	100 (+3)	100 (=)	55 (+17)	77 (+5)	96 (+2)	25 (+17)	41 (+22)	60 (+11)	22 (+5)	40 (+5)	59 (=)
DiffPo-C	RGB	76	97	100	38	72	94	8	19	46	17	35	59
DiffPo-T	RGB	51	83	99	17	41	84	5	11	45	11	18	41
DP3	PCD	69	87	99	7	23	65	7	6	19	12	23	40
ACT	RGB	35	73	96	6	37	78	6	18	49	10	21	35
		Coffee D2			Three Pc. assembly D2			Hammer cleanup D1			Mug cleanup D1
Method	Obs	100	200	1000	100	200	1000	100	200	1000	100	200	1000
EquiDiff (PC)	PCD	78 (+31)	74 (+8)	75 (−4)	66 (+62)	72 (+66)	69 (+26)	81 (+27)	81 (+10)	82 (−5)	65 (+22)	71 (+12)	71 (+6)
EquiDiff (Vo)	Voxel	65 (+18)	73 (+7)	76 (−3)	37 (+33)	58 (+52)	71 (+28)	70 (+16)	66 (−5)	73 ( −14 )	53 (+10)	65 (+6)	68 (+3)
EquiDiff (Im)	RGB	60 (+13)	79 (+13)	76 (−3)	15 (+11)	39 (+33)	69 (+26)	65 (+11)	63 (−8)	77 ( −10 )	49 (+6)	64 (+5)	67 (+2)
DiffPo-C	RGB	44	66	79	4	6	30	52	59	73	43	59	65
DiffPo-T	RGB	47	61	75	1	4	43	48	60	76	30	43	63
DP3	PCD	34	45	69	0	1	3	54	71	87	21	33	53
ACT	RGB	19	33	64	0	3	24	38	54	71	23	31	56
		Kitchen D1			Nut assembly D0			Pick place D0			Coffee preparation D1
Method	Obs	100	200	1000	100	200	1000	100	200	1000	100	200	1000
EquiDiff (PC)	PCD	84 (+17)	86 (+1)	86 (−5)	91 (+36)	94 (+26)	95 (+11)	59 (+24)	79 (+14)	90 (+7)	84 (+19)	85 (+23)	88 (+12)
EquiDiff (Vo)	Voxel	85 (+18)	89 (+4)	88 (−3)	67 (+12)	77 (+9)	83 (−1)	58 (+23)	68 (+3)	82 (−1)	80 (+15)	83 (+21)	85 (+9)
EquiDiff (Im)	RGB	67 (=)	77 (−8)	81 ( −10 )	74 (+19)	85 (+17)	94 (+10)	42 (+7)	74 (+9)	92 (+9)	77 (+12)	83 (+21)	85 (+9)
DiffPo-C	RGB	67	85	87	55	68	83	35	65	83	65	62	58
DiffPo-T	RGB	54	75	81	31	32	46	15	37	50	38	51	76
DP3	PCD	45	71	91	16	24	58	12	15	34	10	22	63
ACT	RGB	37	61	87	42	64	84	7	17	50	32	46	65

We experiment with 100, 200, and 1000 demos in each environment and report the maximum task success rate among 50 evaluations throughout training. Results averaged over three seeds. Number in parentheses shows the difference between our method and the best baseline (with decrement in italic format). Bold performance indicates the best, bold difference is greater than 10%.

Table 2.

The average performance over 12 tasks of Equivariant Diffusion Policy compared with baselines.

Method	Ctrl	Average over 12 environments
Method	Ctrl	100	200	1000
EquiDiff (PC)	Abs	76.5 (+34.5)	81.6 (+23.8)	82.3 (+10.9)
EquiDiff (Vo)		63.9 (+21.9)	72.6 (+14.8)	77.9 (+6.5)
EquiDiff (Im)		53.7 (+11.7)	68.5 (+10.7)	79.7 (+8.3)
DiffPo-C		42.0	57.8	71.4
DiffPo-T		29.0	43.0	64.9
DP3		23.9	35.1	56.8
ACT		21.3	38.2	63.3
EquiDiff (Vo)	Rel	48.8 (+15.5)	58.0 (+10.7)	70.2 (−0.1)
EquiDiff (Im)		35.4 (+2.1)	50.4 (+3.1)	74.0 (+3.7)
DiffPo-C		33.3	47.3	63.2
BC RNN		22.9	41.2	70.3

Table 3 shows the results of different methods using relative control. Our method with voxel input achieves the best performance, while our method with RGB input is only marginally better than the baselines.

Table 3.

Same experiment as Table 1 in relative control.

Method	Obs	100	200	1000	100	200	1000	100	200	1000	100	200	1000
		Stack D1			Stack Three D1			Square D2			Threading D2
EquiDiff (Vo)	Voxel	95 (+14)	100 (+5)	100 (=)	59 (+33)	76 (+24)	83 (−9)	25 (+17)	35 (+14)	52 (−7)	33 (+20)	39 (+13)	46 (−1)
EquiDiff (Im)	RGB	75 (−6)	96 (+1)	100 (=)	25 (−1)	63 (+11)	92 (=)	11 (+3)	21 (=)	48 ( −11 )	11 (−2)	22 (−4)	49 (+2)
DiffPo-C	RGB	81	93	99	26	52	86	6	13	37	13	26	40
BC RNN	RGB	59	95	100	12	48	92	8	21	59	7	13	47
		Coffee D2			Three Pc. Assembly D2			Hammer Cleanup D1			Mug Cleanup D1
Method	Obs	100	200	1000	100	200	1000	100	200	1000	100	200	1000
EquiDiff (Vo)	Voxel	55 (+12)	59 (+7)	64 ( −12 )	5 (+3)	5 (=)	55 (+28)	64 (+21)	62 (+8)	67 (−5)	39 (+14)	43 (+4)	62 (−5)
EquiDiff (Im)	RGB	41 (−2)	59 (+7)	66 ( −10 )	1 (−1)	5 (=)	59 (+32)	49 (+6)	52 (−2)	69 (−3)	29 (+4)	36 (−3)	65 (−2)
DiffPo-C	RGB	43	51	67	2	2	20	43	54	65	25	39	55
BC RNN	RGB	37	52	76	0	5	27	32	43	72	19	39	67
		Kitchen D1			Nut assembly D0			Pick place D0			Coffee preparation D1
Method	Obs	100	200	1000	100	200	1000	100	200	1000	100	200	1000
EquiDiff (Vo)	Voxel	69 (+27)	83 (+19)	89 (+8)	53 (+11)	65 (+3)	72 ( −13 )	40 (+5)	58 (−1)	79 (−3)	48 (+6)	71 (+18)	73 (+12)
EquiDiff (Im)	RGB	61 (+19)	72 (+8)	83 (+2)	44 (+2)	65 (+3)	87 (+2)	29 (−6)	55 (−4)	91 (+9)	49 (+7)	59 (+6)	79 (+18)
DiffPo-C	RGB	42	64	81	42	62	75	35	59	82	42	53	51
BC RNN	RGB	31	47	81	35	58	85	21	41	77	14	32	61

Improvement with different levels of equivariance

We further analyze the performance improvement of our method when the tasks have different levels of equivariance. Since equivariant models generalize automatically across different object poses, equivariance should hypothetically be more useful when there is greater variance in the distribution of initial object poses. We qualitatively group the tasks into three levels: (1) high-equivariance tasks where the poses of the objects are initialized randomly within the workspace; (2) intermediate-equivariance tasks where each object is initialized in a certain range, but with some randomness inside the range; (3) low-equivariance tasks where there is no randomness for the position and/or orientation of certain objects. Figure 7(a) shows the three task groups. We show the performance improvement of our Equivariant Diffusion Policy with point cloud in absolute pose control compared with the standard diffusion policy in Figure 7(b). Generally, the high-equivariance tasks benefit more from injecting symmetry in the network architecture. Moreover, our method’s strong performance in the intermediate and low-equivariance tasks indicates its robustness and generalizability, as the model’s symmetry is helpful even when the task is partially symmetric.

Figure 7.

(a) The three task groups are based on the level of equivariance and their initial object distribution. Images were generated by taking the average of five random initialization states. (b) The performance improvement of our Equivariant Diffusion Policy (PC) compared with the original diffusion policy in absolute pose control. Blue environments are high-equivariance tasks; green environments are intermediate-equivariance tasks; red environments are low-equivariance tasks.

Sample-efficient baseline comparison

In this section, we evaluate our method against two sample-efficient baselines:

1. RISE: a diffusion transformer architecture with a sparse 3D encoder, taking point clouds as inputs (Wang et al., 2024a).

2. ISP: an equivariant policy using spherical projection to project the input eye-in-hand RGB image onto a sphere for equivariant reasoning (Hu et al., 2025).

We experiment with 100 demos (i.e., the low-data regime) in this evaluation. As shown in Table 4, although ISP achieves a comparable performance as EquiDiff (Vo), and RISE slightly outperforms EquiDiff (Im), EquiDiff (PC) remains the best performing method. Across all 12 tasks with 100 demos, EquiDiff (PC) outperforms ISP and RISE by 13.6–20.0 absolute points on average (76.5 vs 62.9 and 56.5, respectively).

Table 4.

The performance of our method compared with two sample-efficient baselines in simulation.

Method	Obs	Mean	Stack D1	Stack three D1	Square D2	Threading D2	Coffee D2	Three Pc. D2
EquiDiff (PC)	PCD	76.5	98	90	67	55	78	66
EquiDiff (Vo)	Voxel	63.9	99	75	39	39	65	37
EquiDiff (Im)	RGB	53.7	93	55	25	22	60	15
RISE	PCD	56.5	100	87	34	38	48	45
ISP	RGB	62.9	98	81	56	19	67	39
Method	Obs		Hammer Cl. D1	Mug Cl. D1	Kitchen D1	Nut Asse. D0	Pick Place D0	Coffee prep. D1
EquiDiff (PC)	PCD		81	65	84	91	59	84
EquiDiff (Vo)	Voxel		70	53	85	67	58	80
EquiDiff (Im)	RGB		65	49	67	74	42	77
RISE	PCD		73	54	71	44	31	53
ISP	RGB		73	54	64	85	56	63

We experiment with 100 demos in each environment and report the maximum task success rate among 50 evaluations throughout training. Results averaged over three seeds. Bold indicates best performance.

Figure 8.

The real-world environments. The left image of each subfigure shows the initial state of the environment; the right image shows the goal state. See Appendix Real-Robot Environment Details for a detailed task description. (a) Oven opening, (b) Banana in bowl, (c) Letter alignment, (d) Trash sweeping, (e) Hammer to drawer, and (f) Bagel baking.

Ablation study

We perform an ablation study to understand the importance of different components of EquiDiff (PC). Specifically, we consider the following variations:

1. EquiDiff (PC): the complete model.

2. No FA: replaces the Frame Averaging with the pointwise equivariant processing.

3. No TSFM: replaces the equivariant point transformer with a simple equivariant point net.

4. No Trans Equi: does not translate the action to the gripper frame (thus the policy does not have the translation symmetry described at the end of Section Equivariant Point Transformer and Translation Symmetry).

5. No Rot Equi: removes all the SO (2)-equivariant structure in the model. This is essentially DP3 but with translation symmetry.

As shown in Table 5, removing the rotational symmetry makes the most significant negative impact on the model, decreasing the average performance by nearly 40%. Removing the translation symmetry and the point transformer architecture decreases the overall performance by 17.5% and 15.7%, respectively. This result demonstrates the importance of all the key pieces of our model: rotational and translational symmetry, and the equivariant point transformer architecture.

Table 5.

The performance of our EquiDiff (PC) compared with different ablated variations.

Ablation	Average	Stack	Stack three	Square	Threading	Coffee	Three Pc.	Hammer C.	Mug C.	Kitchen	Nut Asse.	Pick Place	Coffee Prep.
EquiDiff (PC)	76.5	98	90	67	55	78	66	81	65	84	91	59	84
No FA	74.8 (−1.7)	96	94	63	50	71	69	82	59	85	92	51	85
No TSFM	60.8 (−15.7)	100	93	30	35	73	5	85	58	85	57	49	59
No trans equi	59.0 (−17.5)	95	65	44	29	48	49	61	57	71	84	34	71
No rot equi	37.2 (−39.3)	97	41	13	21	51	2	67	48	61	13	17	15

We experiment with 100 demos in each environment. Results averaged over three seeds.

Real-robot experiment

Experimental settings

In this section, we evaluate our method on a real robot system containing a Franka Emika robot arm (Haddadin et al., 2022) equipped with a pair of fin-ray (Crooks et al., 2016) fingers and three Intel Realsense (Keselman et al., 2017) D455 cameras. Demonstrations were gathered by an operator using a 6DoF 3DConnexion mouse. Observations and demonstration actions were recorded at 5 Hz. Similarly to prior work (Chi et al., 2023), we use DDIM (Song et al., 2020) in this experiment to reduce the number of denoising steps to 16.

EquiDiff with voxel input

We first compare our Equivariant Diffusion Policy with voxel input against a baseline Diffusion Policy, which uses the same voxel grid as the vision input and employs a non-equivariant 3D convolutional encoder with approximately the same number of trainable parameters as ours. As we show in the ablation study (Appendix Ablation Study of EquiDiff (Vo)), this baseline works better than the original diffusion policy with image input. Figure 8 shows the six tasks in this experiment. Figure 9 shows the robot system.

Figure 9.

Our real-robot platform contains a Franka Emika robot arm equipped with a pair of fin-ray fingers, and three Intel Realsense D455 cameras.

Results

We evaluate the trained models over 20 test trials for each task. The results are shown in Table 6. Our Equivariant Diffusion Policy can solve those tasks with only 20 to 60 demonstrations. Notably, our method achieves an 80% success rate in bagel baking, where the failures were all due to the joint limits of the robot. In comparison, the baseline performs poorly in all six tasks.

Table 6.

Performance of Equivariant Diffusion Policy in real-world robot experiments.

# Demos	Oven opening	Banana in bowl	Letter alignment	Trash sweeping	Hammer to drawer	Bagel baking
# Demos	20	40	40	40	60	58
EquiDiff (Vo)	95% (19/20)	95% (19/20)	95% (19/20)	90% (18/20)	85% (17/20)	80% (16/20)
DiffPo-C (Vo)	60% (12/20)	30% (6/20)	0% (0/20)	5% (1/20)	5% (1/20)	10% (2/20)

EquiDiff with point cloud input

This experiment evaluates our Equivariant Diffusion Policy with point cloud or voxel input in more advanced tasks. We consider the Bagel Baking and Trash Sweeping tasks in Figure 8, as well as six new tasks in Figure 10. As is shown in Table 7, EquiDiff (PC) can solve those more advanced tasks with significantly higher success rates compared to the voxel version, which aligns with our simulation experiment.

Figure 10.

The more advanced real-world environments. (a) Coffee making, (b) Twist pipe, (c) Toast making, (d) Seat wiping, (e) Tool box, and (f) Screwdriver to drawer.

Table 7.

Performance of Equivariant Diffusion Policy in more advanced real-world environments.

# Demos	Twist pipe	Seat wiping	Screwdriver to drawer	Trash sweeping	Toast making	Bagel baking	Tool box	Coffee making
# Demos	20	20	40	50	50	60	99	159
EquiDiff (PC)	100% (20/20)	95% (19/20)	90% (18/20)	100% (20/20)	70% (14/20)	85% (17/20)	85% (17/20)	70% (14/20)
EquiDiff (Vo)	100% (20/20)	70% (14/20)	85% (17/20)	80% (18/20)	65% (13/20)	85% (17/20)	55% (11/20)	55% (11/20)

Generalization experiment

In this experiment, we evaluate the generalizability of our Equivariant Diffusion Policy to unseen object poses. We conduct this evaluation in the Bagel Baking experiment in the real world, where the oven is initialized in three different poses during training (Figure 11(a)). At test time, we rotate the oven to eight different, unseen poses (Figure 11(b)). We found that the learned policy can zero-shot generalize to these unseen rotations, with the exception of the scenario where the oven is rotated to the bottom-right corner. In this case, the policy is constrained by the robot’s joint limits. Specifically, the policy was able to open the oven and pull out the tray, but when picking up the bagel, although the policy could generate good gripper poses, the actions were infeasible for the robot due to joint limits. This generalization demonstrates the power of the equivariant structure in our policy.

Figure 11.

(a) The initial oven poses in the training set. (b) The oven poses in the generalization experiment. Those poses are unseen during training. In both training and testing, the pose of the bagel is random.

Conclusion

This paper studies leveraging symmetries in visuomotor policy learning. We propose the novel Equivariant Diffusion Policy method and provide a theoretical analysis identifying the conditions under which the diffusion policies are equivariant. We extend our previous work by incorporating both SO (2) rotational and T (3) translational symmetries through our Equivariant Point Transformer architecture, demonstrating a general framework for using these symmetries in 6DoF control for robotic manipulation. Our comprehensive evaluation in both simulation and real-world environments shows that our extended method substantially outperforms both our conference version and the baseline Diffusion Policy, achieving significantly higher success rates with notably fewer demonstrations.

Training stability

We trained our equivariant models with the same optimizer, noise schedule, and U-Net/transformer depth as their non-equivariant counterparts, with no per-task tuning. Across seeds and data regimes, we did not observe gradient vanishing/exploding or mode-collapse behavior. In Appendix Performance under Hyper-Parameter Changes, varying batch size, warm-up, weight decay, and learning rate left Stack D1 at 100% success, indicating our equivariant layers are plug-and-play replacements that preserve Diffusion Policy training dynamics while improving data efficiency and generalization.

Symmetry breakings

One limitation of this work is the partial utilization of equivariance due to symmetry mismatch in the vision system. Even with voxel or point cloud inputs, factors such as occasional arm visibility in the observation and camera noise can introduce imperfect symmetric transformations, leading to an “extrinsic equivariance” (Wang et al., 2023) setting where the symmetry in the architecture transforms the data out of distribution, and the benefit of equivariance degrades. Future work could address this by designing a vision system that avoids such symmetry corruption. Additionally, “incorrect equivariance,” as shown in prior work (Wang et al., 2024c), may harm performance when the model’s symmetry conflicts with the demonstrations. For example, reachability and kinematic constraints of the robot arm are not always symmetric, potentially yielding infeasible symmetric transformations of demonstrated actions. Another example is tasks requiring actions tied to the world frame without visual cues (e.g., “push object to the left”); applying a symmetric transformation could produce the opposite behavior.

It is worth noticing that although both “extrinsic equivariance” and “incorrect equivariance” can be viewed as forms of symmetry breaking, their effects differ fundamentally. Intuitively, extrinsic equivariance can shift inputs slightly out of distribution, but the similarity between the symmetrically transformed data and the in-distribution data can help the network learn the true decision boundary. In such cases, equivariant models often remain beneficial, and the good performance of our method in the intermediate- and low-equivariance tasks in Figure 7(a) confirms this. In contrast, incorrect equivariance places the model in direct conflict with the ground-truth mapping and is therefore harmful and should be avoided. See Wang et al. (2023, 2024c) for further discussion.

Other limitations

While the theory in Section SO(2) Representation on 6DoF Action is not restricted to diffusion policies and in principle applies to other policy-learning pipelines, we have not demonstrated this empirically. Given the strong performance of BC-RNN with relative-pose control in Table 1, an equivariant BC-RNN is a promising direction. Finally, extending our approach to other robotic settings, such as navigation, locomotion, and mobile manipulation, remains important future works.

Footnotes

Acknowledgments

The authors would like to thank Dr Osman Dogan Yirmibesoglu for the design of the fin-ray gripper fingers, Dr Andy Park for building the teleop system for data collection, Emmanuel Panov for collecting demonstration data in the robot experiment, Dr Thomas Weng for the proofreading of the paper, and Dr Cheng Chi for the helpful discussion.

ORCID iDs

Dian Wang

Haojie Huang

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Part of the work was done when Dian Wang was an intern at the Robotics and AI Institute. This work is supported in part by NSF 1750649, NSF 2107256, NSF 2314182, NSF 2134178, NSF 2409351, 2442658, and NASA 80NSSC19K1474. Dian Wang is supported in part by the JPMorgan Chase PhD fellowship.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix

References

Brehmer

Bose

De Haan

, et al. (2023) EDGI: equivariant diffusion for planning with embodied agents. ArXiv Preprint arXiv:2303.12410.

Cesa

Lang

Weiler

(2021) A program to build E(N)-equivariant steerable CNNs. In: International conference on learning representations, Virtual Conference, 3–7 May 2021.

Chen

, et al. (2023) Equidiff: a conditional equivariant diffusion model for trajectory prediction. In: 2023 IEEE 26th international conference on intelligent transportation systems (ITSC), Bilbao, Spain, 24–28 September 2023, pp. 746–751. IEEE.

Chi

Feng

, et al. (2023) Diffusion policy: visuomotor policy learning via action diffusion. In: Proceedings of robotics: science and systems (RSS), Daegu, Republic of Korea, 10–14 July 2023.

Chi

Pan

, et al. (2024) Universal manipulation interface: in-The-wild robot teaching without in-the-wild robots. In: Proceedings of robotics: science and systems (RSS), Delft, Netherlands, 15–19 July 2024.

Crooks

Vukasin

O’Sullivan

, et al. (2016) Fin ray® effect inspired soft robotic gripper: from the robosoft grand challenge toward optimization. Frontiers in Robotics and AI 3: 70.

Mordatch

(2019) Implicit generation and generalization in energy-based models. In: Advances in neural information processing systems, Vancouver, BC, 8–14 December 2019.

Eisner

Yang

Davchev

, et al. (2024) Deep SE(3)-equivariant geometric reasoning for precise placement tasks. In: The twelfth international conference on learning representations, Vienna, Austria, 7–11 May 2024. URL: https://openreview.net/forum?id=2inBuwTyL2

Florence

Manuelli

Tedrake

(2019) Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters 5(2): 492–499.

10.

Florence

Lynch

Zeng

, et al. (2021) Implicit behavioral cloning. In: Conference on robot learning (CoRL), London, UK, 8–11 November 2021.

11.

Gao

Xue

Deng

, et al. (2024) RiEMann: near real-time SE(3)-equivariant robot manipulation without point cloud segmentation. In: 8th annual conference on robot learning, Munich, Germany, 6–9 November 2024. URL: https://openreview.net/forum?id=eJHy0AF5TO

12.

Grathwohl

Wang

Jacobsen

, et al. (2020) Learning the stein discrepancy for training and evaluating energy-based models without sampling. In: International conference on machine learning, Virtual Meeting, 13–18 July 2020, pp. 3732–3747. PMLR.

13.

Guan

Qian

Peng

, et al. (2023) 3D equivariant diffusion for target-aware molecule generation and affinity prediction. In: The eleventh international conference on learning representations, Kigali, Rwanda, 1–5 May 2023.

14.

Gupta

Kumar

Lynch

, et al. (2020) Relay policy learning: solving long-horizon tasks via imitation and reinforcement learning. In: Conference on robot learning, Virtual Meeting, 16–18 November 2020, pp. 1025–1037. PMLR.

15.

Haddadin

Parusel

Johannsmeier

, et al. (2022) The Franka Emika robot: a reference platform for robotics research and education. IEEE Robotics and Automation Magazine 29(2): 46–64.

16.

Zhang

Ren

, et al. (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp. 770–778.

17.

Jain

Abbeel

(2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33: 6840–6851.

18.

Hoogeboom

Satorras

Vignac

, et al. (2022) Equivariant diffusion for molecule generation in 3D. In: International conference on machine learning, Baltimore, MD, 17–23 July 2022, pp. 8867–8887. PMLR.

19.

Zhu

Wang

, et al. (2024) Orbitgrasp: Se (3)-equivariant grasp learning. In: 8th annual conference on robot learning, Munich, Germany, 6–9 November 2024.

20.

Wang

Klee

, et al. (2025) 3d equivariant visuomotor policy learning via spherical projection. In: The thirty-ninth annual conference on neural information processing systems, San Diego, CA, 2–7 December 2025. URL: https://openreview.net/forum?id=kXJd4JxF34

21.

Huang

Wang

Walters

, et al. (2022) Equivariant transporter network. In: Robotics: science and systems, New York City, NY, 27 June 2022–1 July 2022.

22.

Huang

Wang

Tangri

, et al. (2023a) Leveraging symmetries in pick and place. The International Journal of Robotics Research 43(4): 550–571.

23.

Huang

Wang

Zhu

, et al. (2023b) Edge grasp network: a graph-based SE(3)-invariant approach to grasp detection. In: International conference on robotics and automation (ICRA), London, UK, 29 May 2023–2 June 2023.

24.

Huang

Howell

Wang

, et al. (2024a) Fourier transporter: bi-equivariant robotic manipulation in 3d. In: The twelfth international conference on learning representations, Vienna, Austria, 7–11 May 2024. URL: https://openreview.net/forum?id=UulwvAU1W0

25.

Huang

Schmeckpeper

Wang

, et al. (2024b) Imagination policy: using generative point cloud models for learning manipulation policies. In: 8th annual conference on robot learning, Munich, Germany, 6–9 November 2024. URL: https://openreview.net/forum?id=56IzghzjfZ

26.

Janner

Tenenbaum

, et al. (2022) Planning with diffusion for flexible behavior synthesis. In: International conference on machine learning, , Baltimore, MD, 17–23 July 2022, pp. 9902–9915. PMLR.

27.

Jarrett

Bica

van der Schaar

(2020) Strictly batch imitation learning by energy-based distribution matching. Advances in Neural Information Processing Systems 33: 7354–7365.

28.

Jia

Wang

, et al. (2023) SEIL: simulation-augmented equivariant imitation learning. In: International conference on robotics and automation (ICRA), London, UK, 29 May 2023–2 June 2023.

29.

Keselman

Iselin Woodfill

Grunnet-Jepsen

, et al. (2017) Intel realsense stereoscopic depth cameras. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Honolulu, HI, 21–26 July 2017, pp. 1–10.

30.

Kim

Lim

Lee

, et al. (2023) Se (2)-equivariant pushing dynamics models for tabletop object manipulations. In: Conference on robot learning, Atlanta, GA, 6–9 November 2023, pp. 427–436. PMLR.

31.

Kohler

Srikanth

Arora

, et al. (2023) Symmetric models for visual force policy learning. ArXiv Preprint arXiv:2308.14670.

32.

Liang

Ding

, et al. (2023) AdaptDiffuser: diffusion models as adaptive self-evolving planners. In: International conference on machine learning, Honolulu, HI, 23–29 July 2023.

33.

Lim

Kim

, et al. (2024) Equigraspflow: SE(3)-equivariant 6-dof grasp pose generative flows. In: 8th annual conference on robot learning, Munich, Germany, 6–9 November 2024. URL: https://openreview.net/forum?id=5lSkn5v4LK

34.

Liu

Huang

, et al. (2023) Continual vision-Based reinforcement learning with group symmetries. In: Conference on robot learning, Atlanta, GA, 6–9 November 2023, pp. 222–240. PMLR.

35.

Loshchilov

Hutter

(2018) Decoupled weight decay regularization. In: International conference on learning representations, Vancouver, BC, 30 April 2018–3 May 2018.

36.

Mandlekar

Wong

, et al. (2022) What matters in learning from offline human demonstrations for robot manipulation. In: Faust

Hsu

Neumann

(eds) Proceedings of the 5th conference on robot learning, proceedings of machine learning research, Auckland, New Zealand, 14–18 December 2022, pp. 1678–1690. PMLR.

37.

Mandlekar

Nasiriany

Wen

, et al. (2023) MimicGen: a data generation system for scalable robot learning using human demonstrations. In: 7th annual conference on robot learning, Atlanta, GA, 6–9 November 2023.

38.

Nguyen

Baisero

Klee

, et al. (2023) Equivariant reinforcement learning under partial observability. In: Conference on robot learning, Atlanta, GA, 6–9 November 2023, pp. 3309–3320. PMLR.

39.

Nguyen

Kozuno

Beltran-Hernandez

, et al. (2024) Symmetry-aware reinforcement learning for robotic assembly under partial observability with a soft wrist. ArXiv Preprint arXiv:2402.18002.

40.

Orsini

Raichuk

Hussenot

, et al. (2021) What matters for adversarial imitation learning? Advances in Neural Information Processing Systems 34: 14656–14668.

41.

Pan

Okorn

Zhang

, et al. (2023) TAX-Pose: task-specific cross-pose estimation for robot manipulation. In: Conference on robot learning, Atlanta, GA, 6–9 November 2023, pp. 1783–1792. PMLR.

42.

Pearce

Rashid

Kanervisto

, et al. (2022) Imitating human behaviour with diffusion models. In: The eleventh international conference on learning representations, Virtual Meeting, 25–29 April 2022.

43.

Puny

Atzmon

Smith

, et al. (2022) Frame averaging for invariant and equivariant network design. In: International conference on learning representations, Virtual Meeting, 25–29 April 2022. URL: https://openreview.net/forum?id=zIUyj55nXR

44.

, et al. (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 21–26 July 2017, pp. 652–660.

45.

Rahmatizadeh

Abolghasemi

Bölöni

, et al. (2018) Vision-based multi-task manipulation for enexpensive robots using end-to-end learning from demonstration. In: 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, QLD, 21–25 May 2018, pp. 3758–3765. IEEE.

46.

Ryu

Kim

Chang

, et al. (2023b) Diffusion-EDFs: bi-equivariant denoising generative modeling on SE(3) for visual robotic manipulation. ArXiv Preprint arXiv:2309.02685.

47.

Simeonov

Tagliasacchi

, et al. (2022) Neural descriptor fields: SE(3)-equivariant object representations for manipulation. In: 2022 international conference on robotics and automation (ICRA), Philadelphia, PA, 23–27 May 2022, pp. 6394–6400. IEEE.

48.

Ryu

Lee

, et al (2023a) Equivariant descriptor fields: SE(3)-equivariant energy-based models for end-to-end visual robotic manipulation learning. In: The eleventh international conference on learning representations, Kigali, Rwanda, 1–5 May 2023.

49.

Simeonov

Lin

, et al. (2023) SE(3)-Equivariant relational rearrangement with neural descriptor fields. In: Conference on robot learning, Atlanta, GA, 6–9 November 2023, pp. 835–846. PMLR.

50.

Sohl-Dickstein

Weiss

Maheswaranathan

, et al. (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: Bach

Blei

(eds) Proceedings of the 32nd international conference on machine learning, proceedings of machine learning research, Lille, France, 6–11 July 2015, pp. 2256–2265. PMLR.

51.

Song

Ermon

(2019) Generative modeling by estimating gradients of the data distribution. In: Advances in neural information processing systems, Vancouver, BC, 8–14 December 2019.

52.

Song

Meng

Ermon

(2020) Denoising diffusion implicit models. In: International conference on learning representations, Virtual Meeting, 26–30 April 2020.

53.

Toyer

Shah

Critch

, et al. (2020) The magical benchmark for robust imitation. Advances in Neural Information Processing Systems 33: 18284–18295.

54.

Wang

Walters

Zhu

, et al. (2021a) Equivariant Q learning in spatial action spaces. In: 5th annual conference on robot learning, London, UK, 8–11 November 2021.

55.

Wang

Walters

(2021b) Incorporating symmetry into deep dynamics models for improved generalization. In: International conference on learning representations (ICLR), Virtual Meeting, 3–7 May 2021.

56.

Wang

Jia

Zhu

, et al. (2022a) On-robot learning with equivariant models. In: 6th annual conference on robot learning, Auckland, New Zealand, 14–18 December 2022.

57.

Wang

Walters

Platt

(2022b) SO(2)-Equivariant reinforcement learning. In: International conference on learning representations, Virtual Meeting, 25–29 April 2022.

58.

Wang

Hunt

Zhou

(2022c) Diffusion policies as an expressive policy class for offline reinforcement learning. In: The eleventh international conference on learning representations, Virtual Meeting, 25–29 April 2022.

59.

Wang

Park

Sortur

, et al. (2023) The surprising effectiveness of equivariant models in domains with latent symmetry. In: International conference on learning representations, Kigali, Rwanda, 1–5 May 2023.

60.

Wang

Fang

, et al. (2024a) Rise: 3d perception makes real-world robot imitation simple and effective. In: 2024 IEEE/RSJ international conference on intelligent robots and systems (IROS), Abu Dhabi, UAE, 14–18 October 2024, pp. 2870–2877. IEEE.

61.

Wang

Hart

Surovik

, et al. (2024b) Equivariant diffusion policy. In: 8th annual conference on robot learning, Munich, Germany, 6–9 November 2024. URL: https://openreview.net/forum?id=wD2kUVLT1g

62.

Wang

Zhu

Park

, et al. (2024c) A general theory of correct, incorrect, and extrinsic equivariance. In: Advances in neural information processing systems, Vancouver, BC, 10–15 December 2024.

63.

Xian

Gkanatsios

Gervet

, et al. (2023) ChainedDiffuser: unifying trajectory diffusion and keypose prediction for robotic manipulation. In: 7th annual conference on robot learning, Atlanta, GA, 6–9 November 2022.

64.

Song

, et al. (2022) GeoDiff: a geometric diffusion model for molecular conformation generation. In: International conference on learning representations, Virtual Meeting, 25–29 April 2022.

65.

Yang

Cao

Deng

, et al. (2024a) Equibot: Sim (3)-equivariant diffusion policy for generalizable and data efficient learning. In: 8th annual conference on robot learning, Munich, Germany, 6–9 November 2024.

66.

Yang

Deng

, et al. (2024b) Equivact: Sim (3)-equivariant visuomotor policies beyond rigid object manipulation. In: 2024 IEEE international conference on robotics and automation (ICRA), Yokohama, Japan, 13–17 May 2024, pp. 9249–9255. IEEE.

67.

Zhang

, et al. (2024) 3d diffusion policy: generalizable visuomotor policy learning via simple 3d representations. In: Robotics: science and systems, Delft, Netherlands, 15–19 July 2024.

68.

Zhang

McCarthy

Jow

, et al. (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In: 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, QLD, 21–25 May 2018, pp. 5628–5635. IEEE.

69.

Zhao

Jiang

Jia

, et al. (2021) Point transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, 10–17 October 2021, pp. 16259–16268.

70.

Zhao

Kumar

Levine

, et al. (2023) Learning fine-grained bimanual manipulation with low-cost hardware. In: Proceedings of robotics: science and systems (RSS), Daegu, Republic of Korea, 10–14 July 2023.

71.

Zhou

Barnes

, et al. (2019) On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, 15–20 June 2019, pp. 5745–5753.

72.

Zhu

Wang

Biza

, et al. (2022) Sample efficient grasp learning using equivariant models. In: Robotics: science and systems, New York City, NY, 27 June 2022–1 July 2022.

73.

Zhu

Wang

, et al. (2023) On robot grasp learning using equivariant models. Autonomous Robots 47(8): 1175–1193.

Equivariant diffusion policy for sample-efficient robotic manipulation

Abstract

Keywords

Introduction

Related work

Diffusion models

Equivariance in manipulation policies

Closed-loop visuomotor control

Background

Problem statement

Diffusion policy

Equivariance

Method

Theory of equivariant diffusion policy

SO (2) representation on 6DoF action

Absolute control

Relative control

Implementation of equivariant diffusion policy

Equivariant point transformer and translation symmetry

Simulation experiments

Experimental settings

Standard baseline comparison

Improvement with different levels of equivariance

Sample-efficient baseline comparison

Ablation study

Real-robot experiment

Experimental settings

EquiDiff with voxel input

Results

EquiDiff with point cloud input

Generalization experiment

Conclusion

Training stability

Symmetry breakings

Other limitations

Footnotes

Acknowledgments

ORCID iDs

Funding

Declaration of conflicting interests

Appendix

References