Sage Journals: Discover world-class research

Abstract

Human motion prediction is a classic problem in computer vision and graphics, and the prediction of human motion diversity has a wide range of practical applications. To tackle this problem, this study proposes predicting the future motion diversity of the human body based on conditional denoising diffusion probabilistic models combined with the kinematics of human joints. First, the observed and predicted sequences were integrated into the same sample space using the mask mechanism, and Gaussian noise was gradually injected into the predicted sequence leveraging the cosine noise scheduler to destroy the sequence structure. Subsequently, the spatial-temporal feature extractor and channel enhancement module were used to form a denoiser to learn the temporal dynamic evolution of the sample and the potential correlation between the nodes in the diffusion process to complete the noise prediction and restore the sample information. The proposed method was verified on the Human3.6M and HumanEva-I datasets, and the experimental results show that the proposed method is competitive with previous methods in diversity prediction.

Keywords

DDPM self-attention mechanism transformer mask matrix GCN

1. Introduction

Human motion sequence prediction has gained significant attention due to its wide-ranging applications in human–robot interactions (Bütepage et al., 2018), autonomous driving (Mangalam et al., 2020; Paden et al., 2016), and assistive robotics (Cui et al., 2020; Kundu et al., 2019). The core challenge lies in predicting future sequences from observed data, a task complicated by the intrinsic uncertainty and randomness of human behavior (Dang et al., 2022). While prior studies (Ma et al., 2022; Mao et al., 2019) achieved promising accuracy in deterministic predictions, they overlooked the critical need for modeling diverse future motions—an essential requirement for safe decision-making in scenarios where robots must adapt to unpredictable human actions. For example, a home robot caring for an elderly person relies on diverse motion predictions to prepare multiple response strategies for situations like sudden stumbles or changes in posture (Dang et al., 2022).

Early approaches to motion prediction, such as recurrent neural networks (RNNs) and graph convolutional networks (GCNs) (Butepage et al., 2017; Chiu et al., 2019; Xu et al., 2022), focused on capturing spatio-temporal features to predict the most likely future motion. These methods evaluated performance by measuring the distance between predicted skeletons and ground truth, achieving moderate success in short-term prediction (Butepage et al., 2017). However, they inherently lacked the ability to model the stochasticity of human movement, limiting their utility in applications where uncertainty must be explicitly addressed (Dang et al., 2022).

With the development of deep generative modeling, methods like generative adversarial networks (GANs) and variational autoencoders (VAEs) (Hernandez et al., 2019; Kundu et al., 2019; Lin & Amer, 2018; Yan et al., 2018; Yuan & Kitani, 2020) emerged to generate diverse motion sequences. These models assessed performance using probability metrics (e.g., minimum distance to ground truth) and diversity metrics (e.g., average distance between prediction samples) (Lin & Amer, 2018). While they introduced stochasticity, challenges such as mode collapse and unstable training limited their effectiveness in producing realistic and varied motions (Hernandez et al., 2019).

Denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020) have since emerged as a promising framework for motion prediction, leveraging their capacity to transform noise into plausible sequences through iterative denoising. The DDPM has not only become a new state-of-the-art image generation model (Dhariwal & Nichol, 2021; Lugmayr et al., 2022; Saharia et al., 2022), but has also been successfully applied to speech generation (Kim et al., 2022). Lugmayr et al. (2022) applied it to image painting and verified that the model could generate high-quality and diverse images for any form of lacquer painting. Dhariwal et al (Dhariwal & Nichol, 2021) also demonstrated the superiority of the DDPM over generative adversarial networks in terms of images, DDPMs have been adapted to human motion prediction with notable advancements (Ahn et al., 2023; Chen et al., 2023; Tashiro et al., 2021; Wei et al., 2023; Wen et al., 2023).

Wei et al. (2023) modeled joint diffusion as thermally agitated particles to derive a parameter-free ‘‘whitened” latent space, reducing posterior collapse and improving diversity by addressing latent variable neglect in strong decoders. This approach addresses the limited diversity that latent variables learned after the joint training of sampling and decoding tend to be ignored by strong decoders. Score-based diffusion models have recently outperformed extant diffusion models in many tasks, such as image generation and audio synthesis; therefore, Tashiro et al. (2021) advanced score-based diffusion with conditional models, leveraging observed data to enhance deterministic accuracy and reduce calculation errors. Wen et al. (2023) integrated diffusion uncertainty with spatio-temporal graph networks to better model inherent variability in skeletal data, validated via experiments. Ahn et al. (2023) proposed a method of combining spatial and temporal transformers in series or parallel to fully learn the spatio-temporal structural information of noisy samples, which solves the problem that diffusion models cannot fully capture the spatio-temporal structural information of noisy samples during the training process. By evaluating the ability of the DDPM to model the diversity and determinism of human motion in the future, it was demonstrated that the denoiser trained by this method can accurately sample the original samples from the noisy space. Previous predictions of human motion have been made by encoding historical motion into latent representations, and then the latent representations are used to obtain the future motion of the human body by some decoding means. However, in practice, this approach remains inadequate owing to issues such as complex constraints and the diversity of predicted future movements. To address these weaknesses, Chen et al. (2023) discarded the encoding-decoding approach and proposed a masking mechanism combined with data-centered techniques to deal with human behavior prediction from a perspective. Specifically, historical and predicted motions were first merged, then a motion diffusion model was learned, and motions were generated from random noise.

Despite these innovations, DDPM-based motion modeling remains hindered by two key limitations. First, existing methods often treat each joint’s time series in isolation, neglecting the spatial correlations between body parts that are essential for biomechanically plausible predictions (Chen et al., 2023; Wei et al., 2023). Second, the sequential nature of motion data leads to inefficient training and inference, particularly for long-term prediction, as models struggle to encode global cross-frame dependencies without excessive computational cost (Ahn et al., 2023; Tashiro et al., 2021).

In this study, inspired by the three-dimensional (3D) skeleton diffusion model (Ahn et al., 2023), we propose a denoising diffusion network consisting of a spatio-temporal feature extractor (ST-FE) and a channel enhanced module (CEM). Specifically, ST-FE captures potential connections between joint points via a GCN and utilizes a transformer to model cross-frame global correlations. In addition, CEM estimates the useful information in the channel using a similarity function, which can reduce information aggregation between neighboring frames owing to local differences.

Figure 1 illustrates the network structure. The observed and predicted sequences after adding masks were input into the network, and plausible 3D human motion sequence were generated by diffusion modeling. Specifically, the DDPM forward process combines the observed sequences $y_{hsy}$ and predicted sequences $y_{pre}$ in the same sample $G_{t}$ using a masking mechanism $M$ . The observed sequence $y_{hsy}$ is injected into each reverse process as a bootstrap during training, and Gaussian noise is then slowly injected into the predicted sequence $y_{pre}$ via a cosine noise scheduler. After $S$ times of adding noise process the predicted samples $y_{t}$ were close to Gaussian distribution $N (0, I)$ . The reverse process first embeds the diffusion step coding and position coding in the sample space. Then, the sample noise is gradually denoising by ST-FE and CEM, where ST-FE consists of a GCN and a transformer in series or parallel, both learning the spatial structure between joint points and the temporal correlation between all frames in the sample sequence. Utilizing series or parallel connections to improve the model’s ability to extract spatio-temporal information, CEM assesses the weight of each frame ${\tilde{Y}}_{AGC}$ in the motion segment $Y_{AGC}$ to which it belongs using a similarity function. Next, an adaptive map convolution was performed to update the information content of all the frames and enhance the temporal motion correlation. When a parallel connection is used, CEM acts on the transformer layer and progressively trains the denoiser to transform the noise into a reasonable prediction. Finally, we evaluated the model on two publicly available datasets Human3.6M and HumanEva-I. The results of the experiments show that the proposed method achieves competitive performance on both datasets. Overall, the main contributions of this work can be summarized as follows:

We provide a powerful denoiser that fully utilizes transformer and adaptive GCN to exploit spatio-temporal information of noisy samples.

We present a new, practical, conditional diffusion-based model for 3D human motion prediction. Compared to previous work, this article achieves high accuracy on the Human 3.6M and HumanEva-I datasets.

We demonstrate the excellent performance of the proposed method through comprehensive experiments and affinity matrices.

Figure 1.

Network Structure.

The rest of the article is organized as follows. Section 2 introduces related works. Section 3 briefly introduces some related concepts of DDPM, masking mechanism, and position encoding. Section 4 models our motion prediction model. Section 5 details the process of constructing our human motion prediction model and how to train our denoiser. Section 6 shows the experimental study of the proposed method. Finally, Section 7 summarizes this work.

2. Related Work

2.1. Deterministic Prediction

Deterministic prediction can be viewed as a regression task. Given past motion, predictions of future human motion produce the most likely single outcome. Most previous methods relied on RNNs for temporal modeling (Liu et al., 2022b), but their error accumulation and first-frame discontinuities limit long-term prediction. To address this, Bouazizi et al. (2022) proposed a multi-layer perceptron (MLP)-based architecture, employing spatial MLPs to capture joint dependencies and temporal MLPs to model temporal interactions, thereby avoiding RNN-induced error propagation. Building on transformer advancements in natural language processing (Vaswani et al., 2017), Aksan et al. (2021) developed a spatio-temporal transformer using self-attention mechanisms to model both spatial and temporal dimensions simultaneously, improving long-term modeling efficiency. GCNs further enhanced spatial reasoning by learning dynamic joint relationships (Dang et al., 2021; Mao et al., 2019; Zhong et al., 2022), with Zhong et al. (2022) integrating gated-neighborhood GCNs for adaptive spatio-temporal dependency modeling. Additionally, Cui et al. (2020) introduced dynamic graph learning to capture both physical and non-physical joint interactions, complementing traditional GCN approaches.

2.2. Stochastic Prediction

Stochastic prediction produces multiple predictions of future human motion given past motion, and its predictions are usually generated by generative models that produce a number of columns of possible motions (Yan et al., 2018). Diversity is a key aspect in evaluating the quality of a campaign generation task. GANs have excelled in image synthesis (Creswell et al., 2018) and motion modeling, enabling diverse, realistic motion sampling while addressing short-term prediction limitations (Hernandez et al., 2019) and absolute positional inaccuracies through adversarial training (Sigal et al., 2010). VAEs complement this via representation learning for probabilistic motion forecasting (Hernandez et al., 2019; Yan et al., 2018), though multi-network architectures are often required to capture subtle motion patterns. Based on previous methodology, DLow (Yuan & Kitani, 2020) pioneered reparameterized latent flows and diversity optimization objectives to generate diverse motion predictions, decoupling stochastic sampling from pose quality constraints via a shared deterministic network. Mao et al. (2021) introduced kinematic constraint propagation with inverse kinematics layers and temporal smoothness losses, addressing the trade-off between diversity and physical plausibility in long-term predictions. Xu et al. (2022) proposed multi-level spatio-temporal anchors and multi-scale temporal modeling, enabling structured reasoning for complex human–object interactions.

2.3. Denoising Diffusion Probabilistic Model

The diffusion model has become the state-of-the-art deep learning generative model and has shown great potential in many fields, such as computer vision and time series modeling. In particular, its performance on text-conditional image synthesis has struck awe in researchers and the public (Saharia et al., 2022). Its forward process destroys the original data by gradually adding noise through a Markov chain, whereas reverse denoising gradually restores the original data. Recently, diffusion models—inspired by their image generation breakthroughs (Ho et al., 2020; Wei et al., 2023)—have gained traction for human motion prediction (Ahn et al., 2023; Tashiro et al., 2021), offering superior probabilistic modeling compared to traditional likelihood-based methods. This study explores DDPM-based diffusion models (Ho et al., 2020) for human behavior prediction, leveraging their hierarchical denoising process to balance diversity and temporal coherence in long-term motion synthesis.

Researchers have attempted to use the powerful synthetic features of diffusion models for behavioral prediction (Ahn et al., 2023; Wei et al., 2023). To take advantage of the correlations in temporal data, the conditional score-based interpolated diffusion model directly models the data distribution using observations as conditions (Tashiro et al., 2021). It uses self-supervised training to optimize the diffusion model and proposes a new time series interpolation method using score-based diffusion models. Ahn et al. (2023) explored the competitiveness of using diffusion probability models for 3D motion prediction tasks. The experimental results show that the diffusion model cannot completely replace existing techniques for both deterministic and stochastic motion prediction tasks. However, they found a glimmer of hope in the diffusion model, as it was valid for both types of predictions after a single training process and was able to appropriately tradeoff between context and the need for diverse motion sampling.

Subsequently, HumanMAC (Chen et al., 2023) masked motion completion with transformer architectures to mitigate error accumulation in long-term prediction, establishing a foundation for spatiotemporal consistency. Building on this, BeLFusion (Barquero et al., 2023) and MCLDN (Gao et al., 2024) expanded conditional generation by integrating behavioral semantics and scene-aware priors, respectively, enabling contextually grounded motion synthesis. Tian et al. (2024) further unified transformer-based temporal attention with diffusion steps, improving both fidelity and inference efficiency. To address the diversity-quality trade-off, Yu et al. (2024) introduced contrastive language-image pre-training-guided latent space regularization to avoid mode collapse, while CoMusion (Sun & Chowdhary, 2024) harmonized stochastic diversity with temporal smoothness through trajectory clustering and consistency constraints. Finally, Curreli et al. (2025) redefined the noise modeling paradigm by introducing direction-aware covariance matrices, resolving biomechanical implausibility in 3D joint rotations—a critical limitation in prior isotropic diffusion frameworks.

3. Background Knowledge

3.1. Denoising Diffusion Probabilistic Model

DDPM is an unconditional generative model, and the entire modeling process consists of a forward process and a reverse process based on Markov chains, learning a model distribution $p_{θ} (x_{0})$ that approximates $q (x_{0})$ . Given sample data $x_{0} \sim q (x_{0})$ , the forward process destroys the original structure of the data by gradually injecting Gaussian noise into the sample data and transforms it into a simple prior distribution. The reverse process progressively removes noise and learns how to reshape the corrupted data structures. Given a potential sequence $x_{0}$ situated in the same sample space $χ$ as $x_{n}$ , $n = 1, 2, \dots, N$ . Specifically, the forward $S$ times diffusion process of DDPM is given by the following equation:

\begin{aligned} q (x_{1}, \dots, x_{S} | x_{0}) & = \prod_{t = 1}^{S} q (x_{s} | x_{s - 1}) \end{aligned}

(1)

\begin{aligned} q (x_{s} | x_{s - 1}) & = N (\sqrt{1 - β_{s}} x_{s - 1}, β_{s} I) \end{aligned}

(2)

where

q (x_{s} | x_{s - 1})

follows a Gaussian distribution and

β_{s}

is a constant value for the noise level. The diffusion process follows a Markov chain and the variables are independent of each other; thus, it also follows a Gaussian distribution.

x_{s}

diffuses from

x_{0}

as shown in the following equation:

x_{s} = \sqrt{α_{s}} x_{0} + \sqrt{1 - α_{s}} ε, ε \sim N (0, I)

(3)

where

{\hat{α}}_{s} = 1 - β_{s}

α_{s} = \prod_{i = 1}^{s} {\hat{α}}_{i}

. Reverse process: given the diffused sample

x_{s}

, the reverse process gradually removes the noise from it through a learnable Markov chain, which can be viewed as the reverse propagation

(x_{s}, x_{s - 1}, \dots, x_{0})

of the forward process, as shown in the following equations:

\begin{aligned} p_{θ} (x_{0 : s - 1} | x_{s}) & = p (x_{s}) \prod_{s = 1}^{S} p_{θ} (x_{s - 1} | x_{s}) \end{aligned}

(4)

\begin{aligned} p_{θ} (x_{s - 1} | x_{s}) & = N (x_{s - 1}; μ_{θ} (x_{s}, s), σ^{2} (s) I) \end{aligned}

(5)

where

p (x_{s}) \sim N (0, I)

, to obtain

μ_{θ}

and

θ

, Ho et al. (2020) suggested to obtain from DDPM the

σ^{2} (s) = \frac{1 - α_{s - 1}}{1 - α_{s}} β_{s}

μ_{θ}

is parameterized by

θ

x_{s - 1} \sim p_{θ} (x_{s - 1} | x_{s})

as given below:

\begin{aligned} μ_{θ} (x_{s}, s) & = \frac{1}{\sqrt{{\hat{α}}_{s}}} (x_{s} - \frac{β_{s}}{\sqrt{1 - α_{s}}} ε_{θ} (x_{s}, s)) \end{aligned}

(6)

\begin{aligned} x_{s - 1} & = μ_{θ} (x_{s}, s) + σ (s) z, z \sim N (0, I) \end{aligned}

(7)

during the training process,

ε_{θ} (x_{s}, s)

is modeled by the neural network, learns the amount of noise removed from it each time, and further predicts the noise that will be added progressively during the forward process with a specific loss function as shown in the following equation:

\begin{aligned} L (θ) = {‖ ε - ε_{θ} (x_{s}, s) ‖}^{2} & = {‖ ε - ε_{θ} (\sqrt{α_{s}} x_{0} + \sqrt{1 - α_{s}} ε, s) ‖}^{2} \end{aligned}

(8)

3.2. Diffusion Model Based on Observation Series

Conditional score-based computational diffusion modeling is devoted to the time series computation of diffusion models. Conditional diffusion models allow us to utilize useful information from the observed series to perform accurate calculations equations (4) and (5) is added to the observed sequence $x_{h}$ ,

\begin{aligned} p_{θ} (x_{0 : s - 1} | x_{s}) & = p (x_{s}) \prod_{s = 1}^{S} p_{θ} (x_{s - 1} | x_{s}, x_{h}) \end{aligned}

(9)

\begin{aligned} p_{θ} (x_{s - 1} | x_{s}, x_{h}) & = N (x_{s - 1}; μ_{θ} (x_{s}, s | x_{h}), σ^{2} (s) I) \end{aligned}

(10)

where

μ_{θ} (x_{s}, s | x_{h})

compared to

μ_{θ} (x_{s}, s)

, adds observation conditions to guide model learning. Replacing

ε_{θ} (x_{s}, s)

with

ε_{θ} (x_{s}, s | x_{h})

is still modeled by a neural network, and the loss function equation (8) still applies.

3.3. Mask Complete

During the training process, to enhance the computational efficiency of the model, the observed sequence $x_{h}$ and predicted sequence $x_{s}$ are integrated into the same sample sequence as in equation (11) through a masking mechanism so that they have the same spatial dimensions.

\begin{aligned} G t & = [x_{h}, x_{s}] \end{aligned}

(11)

\begin{aligned} x_{_{h}}^{mask} & = M ⊙ G t \end{aligned}

(12)

\begin{aligned} x_{_{s}}^{mask} & = (1 - M) ⊙ G t \end{aligned}

(13)

where

M

is the mask matrix,

⊙

indicates Hadamard, and the model loss function after adding the mask is rewritten as equation (14) product.

\begin{aligned} L (θ) & = {‖ ε - ε_{θ} (x_{_{s}}^{mask}, s | x_{_{h}}^{mask}) ‖}^{2} \\ = {‖ ε - ε_{θ} (\sqrt{α_{s}} x_{_{0}}^{mask} + \sqrt{1 - α_{s}} ε, s | x_{_{h}}^{mask}) ‖}^{2} \end{aligned}

(14)

Compared to equation (8), this equation makes full use of the favorable information in the observed data to accurately calculate and reconstruct the predicted data.

3.4. Positional Coding

Vaswani et al. (2017) proposed positional coding that applies to periodic data in a transformer model, where the key idea is to encode words as vector space representations based on their position and context in the text,

PE (pos, i) = {\begin{cases} \sin (pos / 10, 000^{2 i / d}) \\ \cos (pos / 10, 000^{2 i / d}) \end{cases}

(15)

where

0 < i < d

d

denotes the coded feature dimension,

pos

is the position, and sine and cosine functions are used to encode odd and even numbered positions in a sequence, respectively. As the sine and cosine functions are periodic, in a periodic sequence, the position at

pos + k

can be represented by a linear change in the position of

pos

k

denotes sequence offset.

4. Problem Formulation

Given $T$ -frames observed sequence pose $Y_{hsy} = {y_{1}, \dots, y_{T}}$ , achieving future $L$ -frames 3D human motion pose prediction task $Y_{pre} = {y_{T}, \dots, y_{T + L}} \in R^{L \times N \times 3}$ , where $Y_{t}$ denotes the 3D pose at moment $t$ , $N$ denotes the number of 3D joint points. We construct the sample space according to Chen et al. (2023) and the masking operation $G t = [Y_{hsy}, Y_{pre}]$ , observed sequence $Y_{hsy}^{mask} = M ⊙ G t \in R^{(T + L) \times N \times 3}$ and predicted sequence $Y_{pre}^{mask} = (1 - M) ⊙ G t \in R^{(T + L) \times N \times 3}$ are obtained by masking, mask is $M = [\underset{T}{\underset{⏟}{1, \dots, 1}}, \underset{L}{\underset{⏟}{0, \dots, 0}}]$ . The forward process of the model starts from $Y_{pre}^{0} \Leftarrow Y_{pre}^{mask}$ and gradually injects Gaussian noise to obtain $Y_{pre}^{s}$ . This process is given by the following equation:

Y_{pre}^{s} = \sqrt{α_{s}} Y_{pre}^{0} + \sqrt{1 - α_{s}} ε, ε \sim N (0, I)

(16)

However, Nichol and Dhariwal (2021) found that the linear noise scheduler is not suitable for low resolution images. Therefore, we use the cosine noise scheduler, as in equation (17), which proves to be useful for low resolution data and is gentler than the linear noise scheduler when adding noise,

α_{s} = \frac{f (s)}{f (0)}, f (s) = \cos {(\frac{s / S + n}{1 + n} \cdot \frac{π}{2})}^{2}

(17)

where

n

prevents

β_{t}

from becoming too small near

s

= 0, yielding n = 0.008. After the forward process adds

S

times of Gaussian noise, the reverse process gradually removes the noise, as in equation (18) to obtain an initial prediction sample.

p_{θ} (Y_{pre}^{0 : s - 1} | Y_{pre}^{s}) = p (Y_{pre}^{s}) \prod_{s = 1}^{S} p_{θ} (Y_{pre}^{s - 1} | Y_{pre}^{s}, Y_{hsy}^{mask})

(18)

The denoising network consists of ST-FE and CEM as shown in Figures 2 and 3, which gradually learns the spatial structure and temporal information of the data from the corrupted samples and predicts the noise of the forward process $ε_{θ} (x_{s}, s | x_{h}) = ε_{θ} (Y_{pre}^{s}, s | Y_{hsy}^{mask})$ . The network loss function is given by the following equation:

L (θ) = {‖ ε - ε_{θ} (Y_{pre}^{s}, s | Y_{hsy}^{mask}) ‖}^{2}

(19)

Figure 2.

Spatio-Temporal Feature Extractor (ST-FE) Structure.

Figure 3.

Channel Enhanced Module (CEM) Structure.

At the end of training, we sampled using equation (20) and gradually denoising from $Y_{pre}^{s}$ to obtain sample $Y_{pre}^{0}$ ,

Y_{pre}^{s - 1} = μ_{θ} (Y_{pre}^{s}, s | Y_{hsy}^{mask}) + σ (s), z \sim N (0, I)

(20)

where

Y_{pre}^{s}

is obtained by adding

s

times Gaussian noise to

Y_{pre}^{mask}

Y_{pre}^{s} \sim N (0, I)

μ_{θ} (Y_{pre}^{s}, s | Y_{hsy}^{mask})

is defined by equation (6) and

ε_{θ} (Y_{pre}^{s}, s | Y_{hsy}^{mask})

5. ST-FE and CEM Denoiser

A sequence of human motion can be viewed as the movements of 3D joint points. The spatial characteristics of the motion were obtained by exploiting the correlation between the joint points, and we used a neural network with spatio-temporal information extraction capability to model $ε_{θ} (Y_{pre}^{s}, s | Y_{hsy}^{mask})$ . ST-FE processes a sample sequence injected with noise by combining the transformer and GCN layers in a serial or parallel way. The process of learning the temporal correlation and spatial structure is shown in Figure 2; the CEM evaluates the volume of useful information contained in each frame to enhance the temporal motion correlation, which is structured as shown in Figure 3.

Denoising in series: Input sample sequence $Y_{input} = concate [Y_{pre}^{s}, Y_{hsy}^{mask}] \in R^{(T + L) \times N \times 3}$ into the model, $Y_{pre}^{s}$ is gradually injected into the Gaussian noise through a forward process as in equation (16), $Y_{hsy}^{mask}$ is a sequence of observations to guide model training, $Y_{pre}^{s}$ adds T times noise close to a Gaussian distribution. In the reverse process of denoising, learnable encoding is performed for the diffusion step $s$ . Encoding is embedded in the spatial features of $Y_{input}$ via the ST-FE layer. In addition, the relative positional structure between the joints is enhanced by embedding positional coding for each joint. The GCN captures the spatial structure between the points and improves the joint features, allowing the model to learn more useful information, as shown in the following equation:

Y_{G} = GCN (Y_{input} + PE (pos, i))

(21)

where

Y_{G} = concate [Y_{pre}^{s}, Y_{hsy}^{mask}] \in R^{(T + L) \times N \times d}

0 < i < d

3 < d

is the dimension of the joint point after GCN. To learn the temporal features of the samples, the output of the GCN is passed through a linear layer and transposed

Y_{TF} = Linear (Y_{G})^{T} \in R^{(3 \times N) \times (T + L)}

, similarly, embedding the learnable positional coding for diffusion step

t

into the sequence features and positional coding of the time series improves the ability of the model to capture the positional information of each frame. The final output is obtained by cross-frame learning in the transformer layer,

Y_{TF} = F (Y_{TF} + PE (pos, m))

(22)

where

0 < m < (T + L)

Y_{TF} \in R^{H \times N \times 3}

H > T + L

, and

F ()

denotes the transformer layer. Evaluating the information contained in each frame by CEM reduces the impact of local difference information between two neighboring frames on global motion changes,

Y_{TF}

is transposed through a linear layer to

Y_{AGC} = Linear (Y_{TF})^{T} \in R^{3 \times (T + L) \times N}

and grouped into motion regions, adjacent

G

-frames are considered as a group. Adopting Liu et al. (2022a) utilizes a similarity function to evaluate the amount of information contained in each frame in the group to which it belongs, as shown in the following equation:

T (Y_{AGC}) = Y_{AGC} \cdot (I - ϕ (Y_{AGC}, \tilde{Y_{AGC}}))

(23)

where

T (Y_{AGC}) \in R^{3 \times (T + L) \times N}

{\tilde{Y}}_{AGC} = \frac{1}{G} \sum_{i = t}^{t + G} {Y_{AGC}}_{i}

is the average feature of

Y_{AGC}

in the sequence range

[t, t + G]

. The similarity function is calculated by the following equation:

ϕ_{t \in [1, T]} = {\begin{cases} ϕ_{1}, \dots, ϕ_{T} | ϕ_{t} = \frac{Y_{AGC}^{t} \cdot {\tilde{Y}}_{AGC}^{t}}{\sqrt{\sum_{i = 1}^{N} Y {_{AGC}^{t}}^{2} (i)} \times \sqrt{\sum_{j = 1}^{N} \tilde{Y} {_{AGC}^{t}}^{2} (j)}} \end{cases}}

(24)

Subsequently, an adaptive graph convolution is performed for each frame as shown in the following equation:

Y = σ (\sum_{k = 0}^{K_{v}} ({(\tilde{A} + P \tilde{A} + E \tilde{A})}^{k} T (Y_{AGC})) ⊙ ω_{k} (θ))

(25)

where

ω_{k} (θ)

is parameterized by

θ

\hat{A} \in R^{N \times N}

is a normalized adjacency matrix,

P \hat{A} \in R^{N \times N}

parametric representation of indirect connections between points so that non-adjacent vertices in the skeleton graph create dependencies as the network deepens, learning correlations between joint points via the network,

E \hat{A} = soft ((φ_{1} (Y_{AGC} \cdot ω_{φ_{1}}))_{N \times d (T + L)}^{T} \cdot (φ_{1} (Y_{AGC} \cdot ω_{φ_{2}})_{d (T + L) \times N}) \in R^{N \times N}

is used to calculate the strength of the correlation between two joint points,

K_{v}

3

ω_{φ 1}

ω_{φ 2}

are the learnable parameters. The same structure

ε_{θ} \in R^{L \times N \times 3}

as the predicted sequence is then extracted from

Y

as a result of neural network modeling, and the prediction noise

ε_{θ} \in R^{L \times N \times 3}

is used for denoising

Y_{pre}^{t}

Denoising in parallel: The difference with series denoising is that the spatio-temporal features of the noise samples are processed in parallel, the noise sample $Y_{input} = concate [Y_{pre}^{t}, Y_{hsy}^{mask}] \in R^{(T + L) \times N \times 3}$ is inputted into the GCN layer and transformer layer in ST-FE as in equations (26) and (27), respectively,

\begin{aligned} Y_{G} & = GCN (Y_{input}) + PE (pos, i) \end{aligned}

(26)

\begin{aligned} Y_{TF} & = Linear (F ({Y_{input}}^{T} + PE (pos, m))) \end{aligned}

(27)

where

0 < i < d

3 < d

is the dimension of the joint point after GCN,

Y_{G} \in R^{(T + L) \times N \times d}

0 < m < (T + L)

Y_{TF} \in R^{H \times N \times d}

, and

F ()

is the transformer layer. Similarly,

Y_{TF} \in R^{N \times 3 \times H}

passes through the linear and CEM layers as shown in the following equation:

Y_{AGC} = σ (\sum_{k = 0}^{K_{v}} ((\tilde{A} + P \tilde{A} + E \tilde{A})^{k} T ({(Y_{TF})}^{T})) ⊙ ω_{k} (θ))

(28)

where

Y_{AGC} \in R^{3 \times (T + L) \times N}

, the temporal features and spatial information of the sample sequence are merged, and finally, the sample spatio-temporal features are incorporated by 2D convolution, as in equation (30).

\begin{aligned} Y_{out} & = concate [Linear (Y_{G})^{T}, {Y_{AGC}}^{T}] \in R^{2 \times (T + L) \times N \times 3} \end{aligned}

(29)

\begin{aligned} Y & = cov2D (Y_{out}) \in R^{(T + L) \times N \times 3} \end{aligned}

(30)

Similarly, the same structure

ε_{θ} \in R^{L \times N \times 3}

as the predicted sequence is then extracted from

Y

as a result of neural network modeling, and the prediction noise

ε_{θ} \in R^{L \times N \times 3}

is used for denoising

Y_{pre}^{t}

6. Experiment

We performed better validation of the model for the prediction task. From the original samples for joining noise to the prediction of the joining noise, we provide specific details of the experiments and tests performed by the model in each dataset and the results of the model’s prediction of human behavioral diversity.

6.1. Dataset and Metrics

Dataset. Human3.6M is the largest baseline dataset for human behavior recognition and prediction, containing 3.6 million poses with seven actors performing 15 categories of motion each (Ionescu et al., 2013). The motions executed by five actors were used as the training set, and the motions of the other two actors were used as the test and validation sets. The model was trained through the train set. The original 3D pose skeleton in the dataset consisted of 32 joints, and the 3D pose skeleton was represented in different ways in previous studies. The motion sequence of the dataset was sampled at 50 fps and 3D joint points of the human body were used as experimental objects. The HumanEva-I dataset is smaller; it consists of three subjects captured at 60 Hz. Each subject performs five motions, and the motions are represented by 15 joints (Sigal et al., 2010). Input observed sequence of 25 frames, output predicted sequence of 100 frames (Yuan & Kitani, 2020) suggested measuring various metrics, such as the probability and diversity of the predicted sequence.

Evaluation Metrics. For a fair comparison, we measure the diversity and accuracy of the predictions according to the same evaluation metrics by Barquero et al. (2023), Chen et al. (2023), Mao et al. (2021), Tian et al. (2024), and Yuan and Kitani (2020):

(1) Average pairwise distance (APD): Average $ℓ_{2}$ distance of the all prediction pairs:

APD = \frac{1}{K (K - 1)} \sum_{i = 1}^{K} \sum_{j i}^{K} ‖ {\hat{Y}}_{i} - {\hat{Y}}_{j} ‖_{2}

(31)

this metric measures the diversity of results.

(2) Average displacement error (ADE) and final displacement error (FDE): ADE calculates the average $ℓ_{2}$ distance over time between the ground truth and the closest prediction:

ADE = \frac{1}{T_{p}} min_{k} ‖ {\hat{Y}}_{k} - Y ‖_{2}

(32)

FDE is the final displacement error calculates the

ℓ_{2}

distance of the last frame between the ground truth and the closest prediction:

FDE = min_{k} ‖ {\hat{Y}}_{k} [T_{p}] - Y [T_{p}] ‖_{2}

(33)

(3) Multi-modal average displacement error (MMADE) and multi-modal final displacement error (MMFDE) : The MMADE metric measures the average displacement error between the predictions and the multi-modal ground truth:

MMADE = \frac{1}{N T_{p}} \sum_{n = 1}^{N} min_{k} ‖ {\hat{Y}}_{k} - Y_{n} ‖_{2}

(34)

and the MMFDE metric focuses on the final displacement error between the predictions and the multi-modal ground truth:

MMFDE = \frac{1}{N} \sum_{n = 1}^{N} min_{k} ‖ {\hat{Y}}_{k} [T_{p}] - Y_{n} [T_{p}] ‖_{2}

(35)

6.2. Implementation Details

All the experiments were conducted based on the PyTorch deep learning framework implementation. The self-attention module of the transformer in our module ST-FE with eight multi-heads, each attention head has 64 dimensions, the GEM module considers three consecutive frames as a group, the value of $k_{v}$ in the adaptive map convolution is 2, the diffusion step is set as $[0, 20]$ , and noise addition using a cosine noise scheduler. To train the model to predict positive over-added noise, we set the batch size to 512 and used the Adam optimizer with a learning rate of 0.0001 for the parameter iteration. During the validation phase of the model, a total of 5,167 motion instances in the validation set were diversely predicted by the model. For each motion instance, 50 predicted motion sequences were generated. It took a total of 233.61 s in the validation phase to obtain 258,350 predicted sequences. On average, the model spent 45.212 ms for each motion instance during the validation phase to obtain the 50 possible predicted sequence results of that motion instance and calculate the corresponding multimodal metrics.

6.3. Qualitative Results

Stochastic Prediction. The model was tested to predict the future behavioral diversity of the human body, and the results on the dataset Human3.6M is shown in Table 1, where bolded is the optimal result and underlined indicates the second best result. Evaluated using the metrics used by Barquero et al. (2023), Chen et al. (2023), Mao et al. (2021), Sun and Chowdhary (2024), Tian et al. (2024), and Yuan and Kitani (2020)—average pairwise distance (APD), average displacement error (ADE), final displacement error (FDE), multi-modal average displacement error (MMADE), and multi-modal final displacement error (MMFDE)—our model demonstrates competitive diversity, as evidenced by an improved APD score. However, it exhibits notable limitations in deterministic accuracy. Specifically, while our method outperforms baselines in ADE and MMADE, it underperforms in FDE and MMFDE metrics compared to transfusion (Tian et al., 2024).

Table 1.
Experimental Results for the Dataset Human3.6M.

Metrics APD $↑$ ADE $↓$ FDE $↓$ MMADE $↓$ MMFDE $↓$

DLow (Yuan & Kitani, 2020) 11.741 0.425 0.518 0.495 0.531

GSPS (Mao et al., 2021) 14.757 0.389 0.496 0.476 0.525

HumanMAC (Chen et al., 2023) 6.301 0.369 0.480 0.509 0.545

BeLFusion (Barquero et al., 2023) 7.602 0.372 0.474 0.473 0.507

Transfusion (Tian et al., 2024) 5.975 0.358 0.468 0.506 0.539

CoMusion (Sun & Chowdhary, 2024) 7.632 0.350 0.458 0.494 0.506

Our (series) 8.608 0.410 0.766 0.496 0.522

Our (parallel) 6.570 0.382 0.761 0.491 0.529

Metrics	APD $↑$	ADE $↓$	FDE $↓$	MMADE $↓$	MMFDE $↓$
DLow (Yuan & Kitani, 2020)	11.741	0.425	0.518	0.495	0.531
GSPS (Mao et al., 2021)	14.757	0.389	0.496	0.476	0.525
HumanMAC (Chen et al., 2023)	6.301	0.369	0.480	0.509	0.545
BeLFusion (Barquero et al., 2023)	7.602	0.372	0.474	0.473	0.507
Transfusion (Tian et al., 2024)	5.975	0.358	0.468	0.506	0.539
CoMusion (Sun & Chowdhary, 2024)	7.632	0.350	0.458	0.494	0.506
Our (series)	8.608	0.410	0.766	0.496	0.522
Our (parallel)	6.570	0.382	0.761	0.491	0.529

APD = average pairwise distance; ADE = average displacement error; FDE = final displacement error; MMADE: multi-modal average displacement error; MMFDE: multi-modal final displacement error.

To verify that the model also has a strong learning ability when the spatio-temporal feature extraction order is changed, in this study, we change the serial order of ST-FE, first learn the temporal features of the sequence using a transformer, and then use the GCN to aggregate the neighboring features of the joints; the final results of the experiments are shown in Table 2.

Table 2.

Experimental Results on Human3.6M After Changing the Spatio-Temporal Order.

Metrics	APD $↑$	ADE $↓$	FDE $↓$	MMADE $↓$	MMFDE $↓$
S-T	8.608	0.41	0.766	0.496	0.522
T-S	8.373	0.405	0.768	0.493	0.510

APD = average pairwise distance; ADE = average displacement error; FDE = final displacement error; MMADE: multi-modal average displacement error; MMFDE: multi-modal final displacement error.

Furthermore, to verify the impact of spatio-temporal features on the diversity and accuracy of the model, we conducted further ablation experiments on the ST-FE. We used the transformer to learn the temporal features of the sequence and the GCN to aggregate the neighboring features of the joints, respectively. The final results of the experiment are shown in Table 3.

Table 3.

Experimental Results on Human3.6M After Selecting Different Features.

PE	S	T	APD $↑$	ADE $↓$	FDE $↓$	MMADE $↓$	MMFDE $↓$
$\times$	✓	✓	6.327	0.407	0.725	0.502	0.511
✓	✓	$\times$	5.327	0.358	0.543	0.468	0.506
✓	$\times$	✓	9.285	0.421	0.823	0.512	0.524

APD = average pairwise distance; ADE = average displacement error; FDE = final displacement error; MMADE: multi-modal average displacement error; MMFDE: multi-modal final displacement error.

The above experimental results show that, after changing the order of spatio-temporal feature extraction, the model’s experimental results on the Human3.6M dataset are relatively close to each other, and the model achieves a better prediction using the sequence features, whether it is learning the joint features or the correlation between the sequence at first. However, when using temporal or spatial features alone, the performance of the model is significantly affected: when using temporal features alone, the model shows a significant decline in diversity metrics (APD), and it is difficult to capture the rich changes in human behavior; while when using spatial features alone, the model significantly increases in accuracy metrics (ADE and FDE), resulting in a larger deviation between the predicted results and the true values. It can be seen that the synergistic effect of spatio-temporal features is crucial for the balance between the diversity and accuracy of the model, and neither of them can be dispensed with. Channel enhanced module is to improve the importance of key features in motion sequence, so it is necessary for learning sequence features, this paper proves that the CEM module is extremely important for human behavioral diversity prediction through ablation experiments and verifies whether adding the CEM module results is more competitive in Human3.6M dataset by connecting them in tandem, and the results of the experiments are as Table 4.

Table 4.

Ablation Experiments on the CEM on the Dataset Human3.6M.

Metrics	APD $↑$	ADE $↓$	FDE $↓$	MMADE $↓$	MMFDE $↓$
CEM	8.608	0.41	0.766	0.496	0.522
No-CEM	7.314	0.61	0.751	0.502	0.517

CEM = channel enhanced module; APD = average pairwise distance; ADE = average displacement error; FDE = final displacement error; MMADE: multi-modal average displacement error; MMFDE: multi-modal final displacement error.

For the experimental results on the Human3.6M dataset, the results of all metrics except the mFDE metric were better than those without CEM, again indicating that the CEM is helpful in predicting human behavioral diversity.

Figure 4 shows two comparative results of a transformer-based motion denoising device. For the motion observation labeled ‘‘walking,” the picture presents a visual comparison between our method and the baseline method in terms of the diversity of motion generation and the final posture.

Figure 4.

Visual Result. This is the Result From the Human3.6M Dataset. The Two Visualized Results, Respectively, Show the Visual Comparison Between Our Method and the Baseline Method in Terms of the Logicality and Diversity of the Generated Motions for the Action Category “walking.” The Observed Background is Represented by a Red and Black Skeleton, While the Future Motion is Represented by a Blue and Green Skeleton.

Stochastic Prediction. The results for the dataset HumanEva-I for predicting the future behavioral diversity of the human body are shown in Table 5. The model in the HumanEva-I dataset is more competitive in terms of the ADE metric and inferior to previous methods in terms of the APD and FDE metrics. We also found that the model is superior to parallel denoising.

Table 5.

Experimental Results for the Dataset HumanEva-I.

Metrics	APD $↑$	ADE $↓$	FDE $↓$	MMADE $↓$	MMFDE $↓$
DLow (Yuan & Kitani, 2020)	4.855	0.251	0.268	0.362	0.339
GSPS (Mao et al., 2021)	5.825	0.233	0.244	0.343	0.331
HumanMAC (Chen et al., 2023)	6.554	0.209	0.223	0.342	0.335
Transfusion (Tian et al., 2024)	1.031	0.204	0.234	0.408	0.427
Our (series)	5.547	0.416	0.459	0.362	0.340
Our (parallel)	4.952	0.189	0.517	0.358	0.345

APD = average pairwise distance; ADE = average displacement error; FDE = final displacement error; MMADE: multi-modal average displacement error; MMFDE: multi-modal final displacement error.

We changed the serial order of ST-FE to verify that the model has a strong learning ability. We first learned the temporal features of the sequence using transformer and then used the GCN to aggregate the neighboring features of the joints. The final results of the experiments are shown in Table 6.

Table 6.

Experimental Results on HumanEva-I After Changing the Spatio-Temporal Order.

Metrics	APD $↑$	ADE $↓$	FDE $↓$	MMADE $↓$	MMFDE $↓$
S-T	5.547	0.416	0.459	0.362	0.340
T-S	5.573	0.421	0.466	0.362	0.337

APD = average pairwise distance; ADE = average displacement error; FDE = final displacement error; MMADE: multi-modal average displacement error; MMFDE: multi-modal final displacement error.

The above experimental results show that after changing the order of spatio-temporal feature extraction, the model’s experimental results on the HumanEva-I dataset are relatively close to each other, and the model achieves a better prediction using the sequence features, whether it is learning the joint features or the correlation between the sequence at first. We again demonstrated the importance of the CEM in behavioral recognition using the HumanEva-I dataset, and the results of the experiments are shown in Table 7.

The experimental results on the HumanEva-I dataset were better with the addition of CEM to the model than without, except for the aDE and sFDE metrics.

To verify that the model coan better learn the correlation between two physically disconnected but semantically connected joints in space,on the dataset Human3.6M, the model-learned modulation affinity matrices exhibited a wider range of connections than the more spatially learned affinity matrices, as shown in Figure 5.

Table 7.

Ablation Experiments on the CEM on the Dataset HumanEva-I.

Metrics	APD $↑$	ADE $↓$	FDE $↓$	MMADE $↓$	MMFDE $↓$
CEM	5.547	0.416	0.459	0.362	0.340
No-CEM	5.314	0.470	0.537	0.368	0.328

Figure 5.

Affinity Matrices and Modulation Affinity Matrices.

7. Conclusions and Future Perspectives

In this study, we propose a novel human future motion prediction method that decombines conditional DDPM with human joint kinematics for human prediction. Firstly, the observed and predicted sequences were fused using masks. A cosine noise scheduler was used to slowly destroy the sequence structure. Then, sequence noise was removed step by step using the ST-FE module and CEM to learn the spatial features of the joint points and the temporal features of the motions. The denoiser is gradually trained to transform noise into reasonable predictions to generate different future motions. In addition, we combined the transformer and adaptive graph convolution in ST-FE in series and parallel to further improve the learning ability of the denoiser. The experimental results show that our method is competitive with existing methods.

Note that the algorithm in this study has the following shortcomings: the position encoding for the joint points still uses the encoding of the periodic sine-cosine function, which ignores the spatial structure information between the joint points. Secondly, when graph convolution is used to capture the spatial structure during the denoising process, the fixation of the adjacency matrix limits the extraction of the overall structural information, and the correlation between the non-physical connections of the joints cannot be captured.

In the future, we will further investigate the reasonable positional encoding of joint points and utilize the self-attention mechanism to enhance the learning ability of graph convolution. The CEM will be used with the transformer for the positional change to improve the useful information contained in each frame.

Footnotes

Acknowledgements

This publication was made by the National Natural Science Foundation of China No. 62171342, and the Natural Science Foundation of the Anhui Higher Education Institutions of China under Grant No.2024AH051682. The statements made herein are solely the responsibility of the authors.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Gansen Deng

Hanqing Tong

Wenwen Ding

References

Ahn

Mascaro

E. V.

Lee

(2023). Can we use diffusion probabilistic models for 3D motion prediction? arXiv preprint arXiv:2302.14503. https://doi.org/10.48550/arXiv.2302.14503

Aksan

Kaufmann

Cao

Hilliges

(2021). A spatio-temporal transformer for 3D human motion prediction. In 2021 International conference on 3D vision (3DV) (pp. 565–574). IEEE. https://doi.org/10.1109/3DV53792.2021.00066

Barquero

Escalera

Palmero

(2023). Belfusion: Latent diffusion for behavior-driven human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2317–2327). https://doi.org/10.1109/ICCV51070.2023.00220

Bouazizi

Holzbock

Kressel

Dietmayer

Belagiannis

(2022). Motionmixer: Mlp-based 3D human body pose forecasting. arXiv preprint arXiv:2207.00499. https://doi.org/10.48550/arXiv.2207.00499

Butepage

Black

M. J.

Kragic

Kjellstrom

(2017). Deep representation learning for human motion prediction and classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6158–6166). https://doi.org/10.1109/CVPR.2017.173

Bütepage

Kjellström

Kragic

(2018). Anticipating many futures: Online human motion prediction and generation for human-robot interaction. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 4563–4570). IEEE. https://doi.org/10.1109/ICRA.2018.8460651

Chen

L. H.

Zhang

Pang

Xia

Liu

(2023). Humanmac: Masked motion completion for human motion prediction. arXiv preprint arXiv:2302.03665. https://doi.org/10.1109/ICCV51070.2023.00875

Chiu

H. K.

Adeli

Wang

Huang

D. A.

Niebles

J. C.

(2019). Action-agnostic human pose forecasting. In 2019 IEEE winter conference on applications of computer vision (WACV) (pp. 1423–1432). IEEE. https://doi.org/10.48550/arxiv.1810.09676

Creswell

White

Dumoulin

Arulkumaran

Sengupta

Bharath

A. A.

(2018). Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1), 53–65. https://doi.org/10.1109/MSP.2017.2765202

10.

Cui

Sun

Yang

(2020). Learning dynamic relationships for 3D human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6519–6527). https://doi.org/10.1109/CVPR42600.2020.00655

11.

Curreli

Muhle

Saroha

Marin

Cremers

(2025). Nonisotropic Gaussian diffusion for realistic 3D human motion prediction. arXiv preprint arXiv:2501.06035. https://doi.org/10.1109/CVPR52734.2025.00181

12.

Dang

Nie

Long

Zhang

(2021). Msr-gcn: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11467–11476). https://doi.org/10.48550/arXiv.2108.07152

13.

Dang

Nie

Long

Zhang

(2022). Diverse human motion prediction via gumbel-softmax sampling from an auxiliary space. In: Proceedings of the 30th ACM international conference on multimedia (pp. 5162–5171). https://doi.org/10.1145/3503161.3547956

14.

Dhariwal

Nichol

(2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794. https://doi.org/10.48550/arXiv.2105.05233

15.

Gao

Yang

G. J.

(2024). Multi-condition latent diffusion network for scene-aware neural human motion prediction. IEEE Transactions on Image Processing, 33(2024), 3907–3920. https://doi.org/10.1109/TIP.2024.3414935

16.

Hernandez

Gall

Moreno-Noguer

(2019). Human motion prediction via spatio-temporal inpainting. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7134–7143). https://doi.org/10.1109/ICCV.2019.00723

17.

Jain

Abbeel

(2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851. https://doi.org/10.48550/arXiv.2006.11239

18.

Ionescu

Papava

Olaru

Sminchisescu

(2013). Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339. https://doi.org/10.1109/TPAMI.2013.248

19.

Kim

Yoon

(2022). Guided-tts 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv preprint arXiv:2205.15370. https://doi.org/10.48550/arXiv.2205.15370

20.

Kundu

J. N.

Gor

Babu

R. V.

(2019). Bihmp-gan: Bidirectional 3D human motion prediction gan. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 8553–8560). https://doi.org/10.1609/aaai.v33i01.33018553

21.

Lin

Amer

M. R.

(2018). Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652. https://doi.org/10.48550/arXiv.1804.10652

22.

Liu

Zhang

(2022a). Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowledge-Based Systems, 240, 108146. https://doi.org/10.1016/j.knosys.2022.108146

23.

Liu

Jin

Liu

Cheng

(2022b). Investigating pose representations and motion contexts modeling for 3D motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 681–697. https://doi.org/10.1109/TPAMI.2021.3139918

24.

Lugmayr

Danelljan

Romero

Timofte

Van Gool

(2022). Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11461–11471). https://doi.org/10.1109/CVPR52688.2022.01117

25.

Nie

Long

Zhang

(2022). Progressively generating better initial guesses towards next stages for high-quality human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6437–6446). https://doi.org/10.1109/CVPR52688.2022.00633

26.

Mangalam

Adeli

Lee

K. H.

Gaidon

Niebles

J. C.

(2020). Disentangling human dynamics for pedestrian locomotion forecasting with noisy supervision. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2784–2793). https://doi.org/10.1109/WACV45572.2020.9093350

27.

Mao

Liu

Salzmann

(2021). Generating smooth pose sequences for diverse human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13309–13318). https://doi.org/10.1109/ICCV48922.2021.01306

28.

Mao

Liu

Salzmann

(2019). Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9489–9497). https://doi.org/10.1109/ICCV.2019.00958

29.

Nichol

A. Q.

Dhariwal

(2021). Improved denoising diffusion probabilistic models. In International conference on machine learning (pp. 8162–8171). PMLR. https://doi.org/10.48550/arXiv.2102.09672

30.

Paden

čáp

Yong

S. Z.

Yershov

Frazzoli

(2016). A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Transactions on Intelligent Vehicles, 1(1), 33–55. https://doi.org/10.1109/TIV.2016.2578706

31.

Saharia

Chan

Saxena

Whang

Denton

E. L.

Ghasemipour

Gontijo Lopes

Karagol Ayan

Salimans

(2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494. https://doi.org/10.48550/arXiv.2205.11487

32.

Sigal

Balan

A. O.

Black

M. J.

(2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1-2), 4–27. https://doi.org/10.1007/s11263-009-0273-6

33.

Sun

Chowdhary

(2024). Comusion: Towards consistent stochastic human motion prediction via motion diffusion. In European conference on computer vision (pp. 18–36). Springer. https://doi.org/10.1007/978-3-031-73036-8_2

34.

Tashiro

Song

Ermon

(2021). Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34, 24804–24816. https://doi.org/10.48550/arXiv.2107.03502

35.

Tian

Zheng

Liang

(2024). Transfusion: A practical and effective transformer-based diffusion model for 3D human motion prediction. IEEE Robotics and Automation Letters, 9(7), 6232–6239. https://doi.org/10.1109/LRA.2024.3401116

36.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

(2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 6000–6010. https://doi.org/10.5555/3295222.3295349

37.

Wei

Sun

(2023). Human joint kinematics diffusion-refinement for stochastic motion prediction. In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, pp. 6110–6118). https://doi.org/10.1609/aaai.v37i5.25754

38.

Wen

Lin

Xia

Wan

Zimmermann

Liang

(2023). Diffstg: Probabilistic spatio-temporal graph forecasting with denoising diffusion models. arXiv preprint arXiv:2301.13629. https://doi.org/10.48550/arXiv.2301.13629

39.

Wang

Y. X.

Gui

L. Y.

(2022). Diverse human motion prediction guided by multi-level spatial-temporal anchors. In European conference on computer vision (pp. 251–269). Springer. https://doi.org/10.1007/978-3-031-20047-2_15

40.

Yan

Rastogi

Villegas

Sunkavalli

Shechtman

Hadap

Yumer

Lee

(2018). Mt-vae: Learning motion transformations to generate multimodal human dynamics. In Proceedings of the European conference on computer vision (ECCV) (pp. 276–293). https://doi.org/10.1007/978-3-030-01228-1_17

41.

Hou

Pei

Ong

Y. S.

Zhang

(2024). Divdiff: A conditional diffusion model for diverse human motion prediction. IEEE Transactions on Multimedia, vol. 27, 1848–1859. https://doi.org/10.1109/TMM.2024.3521821

42.

Yuan

Kitani

(2020). Dlow: Diversifying latent flows for diverse human motion prediction. In Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16 (pp. 346–364). Springer. https://doi.org/10.1007/978-3-030-58545-7_20

43.

Zhong

Zhang

Xia

(2022). Spatio-temporal gating-adjacency gcn for human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6447–6456). https://doi.org/ 10.1109/CVPR52688.2022.00634

An Effective Conditional Transformer-Based Diffusion Model for Three-Dimensional Human Motion Prediction

Abstract

Keywords

1. Introduction

2.1. Deterministic Prediction

2.2. Stochastic Prediction

2.3. Denoising Diffusion Probabilistic Model

3. Background Knowledge

3.1. Denoising Diffusion Probabilistic Model

6.1. Dataset and Metrics

6.3. Qualitative Results

Footnotes

Acknowledgements

Funding

Declaration of Conflicting Interests

ORCID iDs

References