Sage Journals: Discover world-class research

Abstract

Scholars have proposed a cable-driven parallel robot (CDPR) with aerial and ground actuators, referred to as a cable-towed aerial platform (CTAP), to address the limited ability to withstand external forces of existing unmanned arrival vehicle (UAV)-based aerial platforms. To tackle practical problems (e.g., firefighting), a CTAP may need to perform motion planning in an environment with obstacles and communication failures. To this end, this article proposes an online decentralized planning approach based on multi-agent reinforcement learning (MARL) for a CTAP to achieve real-time motion planning in a communication-denied environment with obstacles. This article then defines the state and action spaces of a MARL-based planner. A reward function is designed for the MARL-based planning approach according to the widely used optimization-based planning approach. This study has successfully trained a MARL-based planner using this approach, and the important training techniques used in this study are reported. Statistical comparison of the MARL-based decentralized planner and an optimization-based centralized planner is conducted in simulation. The MARL-based and optimization-based planners are deployed to a CTAP prototype to address motion planning problems in the real world. Experimental results show that the MARL-based planner can achieve successful, decentralized, and online motion planning for a CTAP.

Keywords

cable-driven parallel robot aerial actuators decentralized planning multi-agent reinforcement learning

Introduction

Efforts have long been underway to achieve efficient aerial manipulations, such as high-rise firefighting and aerial rescue. Aerial manipulations usually require a large workspace, even in a complex environment with obstacles, leading to challenges for aerial manipulation systems. To address aerial manipulations, researchers have proposed aerial platforms based on unmanned aerial vehicles (UAVs).¹

For instance, UAVs have been used for high-rise building firefighting, as shown in Figure 1.

Figure 1.

UAV-based aerial platforms used for high-rise building firefighting.

To address the limited load capacity and the limited ability to withstand external forces of UAV-based aerial platforms, scholars have proposed the concept of a cable-driven parallel robot (CDPR) with aerial and ground actuators,² referred to as a cable-towed aerial platform (CTAP), as shown in Figure 2. A CTAP is a CDPR^3–5 driven by several aerial and ground actuators via cables collaboratively. Researchers have addressed some problems of CTAPs, including static workspace,² energy efficiency,⁶ and real-time reconfiguration planning.⁷ To operate a CTAP in an environment with obstacles (e.g., buildings), it is important to plan the proper motion of aerial and ground actuators. Thus, the motion planning of a CTAP in an environment with obstacles is a critical problem.^8,9

Figure 2.

Process of the deployment of a cable-towed aerial platform for firefighting.

The motion planning of a CTAP in an environment with obstacles is challenging for several reasons. Firstly, the heterogeneous nature of a CTAP and the combination of aerial actuators, ground actuators, and multiple flexible and rigid bodies lead to a complex and high-dimension motion-planning problem. Secondly, a target position may be located at the back of a building where a CTAP cannot reach directly, forcing the motion planning of a CTAP to address obstacles for aerial actuators, the moving platform, and cables. Thirdly, a complex environment (e.g., firefighting) may have communication failures.¹⁰ To address the motion planning of a CTAP, an effective motion planning approach is required.

Existing methods for multi-agent system planning can be broadly classified into centralized and decentralized planning approaches.¹¹ Researchers have studied centralized planning algorithms for systems similar to a CTAP. Gagliardini et al.¹² and Rasheed et al.¹³ formulated the problems of reconfiguration planning and obstacle avoidance for reconfigurable CDPRs and developed optimization-based and search-based planning methods to solve these problems. A novel constrained path planning method has been proposed for a reconfigurable CDPR to achieve obstacle avoidance by simultaneously adjusting the cable anchor positions and cable lengths.¹⁴ An adaptive sampling-based path planning method has been proposed for CDPRs to find collision-free trajectories in a dynamic environment.¹⁵ Liu et al.¹⁶ proposed a reconfiguration planning method based on reinforcement learning (RL), enabling reconfigurable CDPRs to avoid dynamic obstacles. Joyo et al. integrate twin delayed deep deterministic policy gradient (TD3) with a proportional-integral-derivative (PID) loop for reference tracking under tension constraints, demonstrating improved accuracy and stability and the practicality of deep policy gradients in multi-constraint settings.¹⁷ It was shown that RL offers a promising direction for achieving real-time obstacle avoidance. Centralized planning involves a central planner generating the target trajectories, which are then sent to individual agents. However, the communication between the central planner and an agent may suffer from signal delays or communication failures.¹⁰ A communication failure could result in the paralysis of the entire system.^18,19

To address issues including communication failure, scholars have proposed decentralized planning methods,²⁰ consisting of optimization-based decentralized planning methods and heuristic-based decentralized planning methods. Alessandro et al.²¹ proposed a heuristic algorithm to address collaboration problems in logistics scenarios. Harikumar et al.¹¹ employed both heuristic algorithms and an optimization-based multi-UAV search and dynamic formation control framework to search for targets in uncertain environments, such as forest firefighting. However, the time consumption of the optimization-based decentralized planning methods and heuristic-based decentralized planning methods²² prevents the online planning of a CTAP.

The integration of learning-based approaches in motion planning has become a significant research trend.^23,24 For instance, Ding et al. recently proposed a federated reinforcement learning framework for intelligent route planning in aerial–terrestrial networks,²⁵ which addresses privacy preservation and cross-environment knowledge sharing. While this line of work is not directly applicable to a CTAP—since the system involves tightly coupled aerial and ground actuators within a single physical platform rather than independent environments—it highlights the growing interest in distributed RL approaches for aerial–ground systems. To address online decentralized planning, researchers have turned their attention to multi-agent reinforcement learning (MARL).²⁶ The classic algorithms, such as multi-agent trust region policy optimization (MATRPO), multi-agent deep deterministic policy gradient (MADDPG), and their improved versions multi-agent proximal policy optimization (MAPPO) and multi-agent twin delayed deep deterministic policy gradient (MATD3), have all achieved excellent results in scenarios like multi-robot collaboration and intelligent transportation.²⁷ Recently, Gabler and Wollherr proposed a decentralized MARL framework grounded in best-response policies with hierarchical opponent modeling;²⁸ it surpasses MADDPG in benchmarks and mitigates combinatorial explosion in interactions. While highlighting decentralized decisions and opponent modeling, evaluations remain in generic domains rather than online planning on a CTAP. Several works have already applied MARL to multi-robot cooperative systems,²⁹ but these studies are mostly restricted to settings with few degrees of freedom and simplified dynamics. To this end, this article proposes an online decentralized planning approach based on MARL for a CTAP. The main contributions of this work can be summarized as follows.

This article proposes an online decentralized planning approach based on MARL for a CTAP to address path planning and obstacle avoidance in an environment with obstacles. The approach enables the multiple actuators of a CTAP to operate independently, in real time, and move the moving platform to a target position collectively in scenarios where communication is denied.

This study applies the proposed online decentralized planning approach to a CTAP in simulation and in the real world and compares the proposed approach to a commonly used centralized optimization-based planning approach, aiming to verify the effectiveness of the proposed approach. The performance metrics including time consumption and path length achieved by the proposed approach compared to those of the centralized optimization-based planning approach are reported.

Compared with prior studies, this study applies multi-agent reinforcement learning to an aerial–ground cooperative system with higher degrees of freedom, achieving near-optimal planning results while maintaining real-time performance. Unlike single-agent reinforcement learning approaches, the proposed method remains effective under communication interruptions and exhibits stronger robustness to partial agent failures. In contrast to existing applications of multi-agent reinforcement learning, the target system of this study involves higher degrees of freedom and more complex dynamics, which further demonstrates the applicability and potential of the proposed approach in challenging cooperative control scenarios.

The remaining sections of this article are structured as follows. Section “Preliminaries” presents the preliminaries of this study, including the problem statement and the notation for a CTAP. Section “Problem formulation” formulates the motion planning of a CTAP as an optimization problem. Section “MARL-Based decentralized planning approach” proposes a decentralized motion planning approach based on MARL for a CTAP. Section “Experiments” conducts experiments in simulation and in the real world to demonstrate and verify the proposed MARL-based decentralized planning approach for a CTAP. Finally, Section “Conclusions” summarizes the article and suggests potential directions for future research.

Preliminaries

In this section, the problem statement and the notation for a CTAP in an environment with an obstacle are presented.

Problem statement

This article addresses the problem of online decentralized motion planning of a CTAP with a point-mass moving platform and both aerial and ground actuators in an environment with obstacles, motivated by the deployment of a CTAP for firefighting, as shown in Figure 2. Every aerial actuator connects to the moving platform through a fixed-length cable and every ground actuator connects to the moving platform through a variable-length cable. The positions of ground actuators are fixed. The CTAP is required to move the moving platform from an initial position to a target position. The desired velocity of each aerial actuator is required to be determined by an agent attached to the aerial actuator and the desired cable stretching velocity of ground actuators is required to be determined by an agent on the ground. The moving platform may move around an obstacle (e.g., a building). The position and velocity of the moving platform and aerial actuators, as well as the state of cables and the obstacle, can be observed, for instance, using visual sensors (e.g., cameras), inertial measurement units (IMUs), and encoders.

Notation for a CTAP

The notations for a CTAP with $N$ aerial actuators and $M$ ground actuators encountering an obstacle are illustrated in Figure 3. The length of the $j$ th variable-length cable, which is connected to the $j$ th ground actuator, is denoted as $l_{g, j}$ . This length $l_{g, j}$ is adjustable. The length of the fixed cable connected to the aerial actuator is denoted as $l_{a, i}$ , which remains constant. The position of the $j$ th ground actuator is represented by the coordinates $p_{g, j} = [p_{g, j, x}, p_{g, j, y}, 0]$ . $p_{a, i} = [p_{a, i, x}, p_{a, i, y}, p_{a, i, z}]$ represents the position of the anchor point of the cable connected to the $i$ th aerial actuator. The cables are attached to the moving platform at the same point whose position is represented by $p_{m} = [p_{m, x}, p_{m, y}, p_{m, z}]$ . It should be noted that for a CTAP, $p_{g, j}$ is fixed to the ground and $l_{g, j}$ can be adjusted by varying the length of the variable-length cable whose stretching velocities are ${\dot{l}}_{g, j}$ .

Figure 3.

Notations of a cable-towed aerial platform.

It is assumed that an obstacle is contained within a cuboid, since buildings are frequent obstacles in high-rise building firefighting. The center of the bottom surface of the cuboid is denoted as $p_{o} = [p_{o, x}, p_{o, y}, 0]$ . The length, width, and height of the cuboid are represented by $l_{o}$ , $w_{o}$ , and $h_{o}$ , respectively. It is assumed that the moving platform is contained within a sphere with a diameter of $d_{m}$ . Every aerial actuator is contained within a cylinder with a diameter of $d_{a}$ and a height of $h_{a}$ .

Problem formulation

This section formulates the motion planning of a CTAP demonstrated in Section “Problem statement” as an optimization problem, inspired by Gagliardini et al.,¹² Rasheed et al.¹³ The formulation provides a basis for the development of an online decentralized planning approach based on MARL and the baseline—a conventional centralized planning approach based on optimization. $p_{m} (t)$ represents the position of the moving platform at the $t$ th time step, where $t = 0, \dots, T$ . $p_{a, i} (t)$ represents the position of the $i$ th aerial actuator at the $t$ th time step. The state of the CTAP at the $t$ th time step, denoted as $x (t)$ , can be expressed as

\begin{matrix} x (t) = [p_{m} {(t)}^{T} p_{a, 1} {(t)}^{T} \dots p_{a, N} {(t)}^{T}]^{T} \end{matrix}

(1)

where

x (0)

corresponds to the initial state of the CTAP.

x (t + 1)

is determined based on the previous state

x (t)

and the derivative of the previous state

\dot{x} (t)

. The Euler integration method is used to represent all states of a CTAP in a task. One has

x (t + 1) = x (t) + \dot{x} (t) Δ t

(2)

The trajectories of the states and derivative of states are defined as

\begin{matrix} X = [x {(1)}^{T} x {(2)}^{T} \dots x {(t)}^{T}]^{T} \\ \dot{X} = [\dot{x} {(1)}^{T} \dot{x} {(2)}^{T} \dots \dot{x} {(t)}^{T}]^{T} \end{matrix}

(3)

The motion planning of a CTAP can be formulated as an optimization problem defined as

min_{X, \dot{X}} \sum_{t = 1}^{T} {\sum_{i = 1}^{N} ‖ p_{a, i} (t) - p_{a, i} (t - 1) ‖_{2}

(4a)

+ \sum_{i = 1}^{M} ‖ l_{g, j} (t) - l_{g, j} (t - 1) ‖_{2}

(4b)

+ ‖ p_{m} (t) - p_{m} (t - 1) ‖_{2}}

(4c)

s.t. p_{m} (T) = p_{m}^{*}

(4d)

\dot{x} (T) = 0

(4e)

x (t + 1) = x (t) + \dot{x} (t) Δ t

(4f)

\begin{aligned} ‖ p_{a, i} (t) - p_{m} (t) ‖_{2} = l_{a, i}, \\ f o r i = 1, \dots, N \end{aligned}

(4g)

\begin{aligned} ‖ p_{g, j} - p_{m} (t) ‖_{2} = l_{g, j} (t), \\ f o r j = 1, \dots, M \end{aligned}

(4h)

\begin{aligned} ‖ p_{a, i} (t) - p_{a, j} (t) ‖_{2} \geq d_{a a}, \\ f o r i, j = 1, \dots, N and j \neq i \end{aligned}

(4i)

\begin{aligned} ‖ p_{a, i, z} (t) - p_{m, z} (t) ‖_{2} \geq d_{a h}, f o r i = 1, \dots, N \end{aligned}

(4j)

\begin{aligned} p_{a, i, β} (t) \notin W_{o}, \\ f o r i = 1, \dots, N, β \in (0, 1) [1] \end{aligned}

(4k)

\begin{aligned} p_{g, j, β} (t) \notin W_{o}, \\ f o r j = 1, \dots, M, β \in (0, 1) \end{aligned}

(4l)

p_{m} (t) \notin W_{m o}

(4m)

\begin{aligned} p_{a, i} (t) \notin W_{a o}, \\ f o r i = 1, \dots, N \end{aligned}

(4n)

{\dot{x}}_{L} \leq \dot{x} (t) \leq {\dot{x}}_{U}

(4o)

where (4a) and (4b) reduce the movement of aerial actuators and the change of length cables connecting ground actuators per time step. (4c) reduces the movement of the moving platform per time step. (4d) requires that the moving platform should reach the target position. (4e) requires that the velocities of the moving platform, cables, and aerial actuators are zero at the final time step. (4f) represents the state transition of the system. (4g) and (4h) guide the aerial and ground actuators to keep the cables in tension. (4i) requires that the distance between every pair of aerial actuators must be greater than a safety distance

d_{a a}

. (4j) ensures that the difference between the altitude of an aerial actuator and that of the moving platform is greater than a safety distance

d_{a h}

. (4k) and (4l) enforce that no point on the cables lies within the obstacle.

β

is a normalized position parameter. If

β = 0

p_{g, j, β}

p_{g, j, β}

represents the position of the anchor point of a cable on the moving platform. If

β = 1

p_{g, j, β}

p_{g, j, β}

represents the position of the anchor point of a cable on the aerial actuator or the ground actuator.

W_{o}

is the position set of the obstacle defined as

\begin{matrix} W_{o} = {(x, y, z) | & ‖ x - p_{o, x} ‖_{2} < l_{o} / 2, \\ ‖ y - p_{o, y} ‖_{2} < w_{o} / 2, \\ ‖ z - p_{o, z} ‖_{2} < h_{o}, \\ z > 0} \end{matrix}

(5)

(4m) and (4n) indicate that neither the aerial actuators nor the moving platform should collide with the obstacle. The expression for

W_{ao}

\begin{matrix} W_{ao} = {(x, y, z) | & ‖ x - p_{o, x} ‖_{2} < l_{o} / 2 + d_{a} / 2, \\ ‖ y - p_{o, y} ‖_{2} < w_{o} / 2 + d_{a} / 2, \\ ‖ z - p_{o, z} ‖_{2} < h_{o} + h_{a} / 2, \\ z > h_{a} / 2} \end{matrix}

(6)

W_{mo}

is defined as

\begin{matrix} W_{mo} = {(x, y, z) | & ‖ x - p_{o, x} ‖_{2} < l_{o} / 2 + d_{m} / 2, \\ ‖ y - p_{o, y} ‖_{2} < w_{o} / 2 + d_{m} / 2, \\ ‖ z - p_{o, z} ‖_{2} < h_{o} + d_{m} / 2, \\ z > d_{m} / 2} \end{matrix}

(7)

(4o) ensures that the velocities of the moving platform, cables, and aerial actuators are feasible.

{\dot{x}}_{L}

and

{\dot{x}}_{U}

represent the lower bounds and upper bounds of corresponding velocities.

MARL-Based decentralized planning approach

To address the online decentralized motion planning of a CTAP, this section proposes a decentralized planning approach based on MARL. The MARL-based decentralized planning approach develops a decentralized planner including an agent for every aerial actuator and an agent for all ground actuators. For a CTAP with $N$ aerial actuators and $M$ ground actuators, the decentralized planning approach develops a total of $N + 1$ agents. It is assumed that the positions of ground actuators are given and the nearest obstacle is contained within a cuboid. The agent of an aerial actuator and the agent of ground actuators can observe, by visual sensors for instance, the position and shape of the nearest obstacle and the position of aerial actuators and the moving platform. An agent can estimate the velocity of aerial actuators and the moving platform based on the change of the corresponding position. The agent of an aerial actuator determines the target velocity of the aerial actuator. The agent of ground actuators determines the target cable stretching velocities of the ground actuators. The flow diagram of the MARL-based decentralized planner including $N + 1$ agents applied to the motion planning of a CTAP is presented in Figure 4. The details of the state space and action space of the agents are demonstrated in Section “State space and action space.” Different state of the art (SOTA) MARL algorithms, such as MATD3 and MAPPO, can be applied to implement the agents. It should be emphasized that the reward functions are critical for obtaining effective agents. Thus, the reward functions are explained in detail in Sections “Reward function for ground actuators” and “Reward function for an aerial actuator”.

Figure 4.

Flow diagram of a MARL-based decentralized planner applied to the motion planning of a cable-towed aerial platform.

Regarding MARL algorithm selection, this study compared the performance of MAPPO and MATD3. As an on-policy method, MAPPO requires freshly collected data for each update and thus suffers from low sample efficiency.³⁰ This led to convergence difficulties during the early training phase, making it hard to obtain stable policies within a limited number of interaction steps. In contrast, MATD3, as an off-policy algorithm, can repeatedly reuse past experiences via a replay buffer, which significantly improves sample efficiency and accelerates convergence. Therefore, this study adopted MATD3 to ensure efficient training in the complex CTAP environment.

State space and action space

The state spaces for an aerial-actuator agent and for the ground-actuator agent are the same. The state space comprises the position of the moving platform $p_{m}$ , the target position of the moving platform $p_{m}^{*}$ , the position $p_{o}$ and the dimensions $l_{o}$ , $w_{o}$ , and $h_{o}$ of the cuboid containing the obstacle, the positions of aerial actuators $p_{a, i}$ (for $i = 1, 2, \dots, N$ ), the lengths of the variable-length cables $l_{g, j}$ (for $j = 1, 2, \dots, M$ ), the velocities of aerial actuators ${\dot{p}}_{a, i}$ (for $i = 1, 2, \dots, N$ ), and cable stretching velocities ${\dot{l}}_{g, j}$ (for $i = 1, 2, \dots, M$ ). The state space is defined as

\begin{matrix} s_{t} = [ & p_{m}, p_{m}^{*}, p_{o}, l_{o}, w_{o}, h_{o}, p_{a, 1}, \dots p_{a, N}, \\ l_{g, 1}, \dots l_{g, M}, {\dot{p}}_{a, 1}, \dots {\dot{p}}_{a, N}, {\dot{l}}_{g, 1}, \dots {\dot{l}}_{g, M}] \end{matrix}

(8)

The output of the agent of the $i$ th aerial actuator and the output of agent of ground actuators are target velocities defined as

\begin{matrix} V_{a, i}^{RL} & = [{\dot{p}}_{a, i, x}^{RL}, {\dot{p}}_{a, i, y}^{RL}, {\dot{p}}_{a, i, z}^{RL}] f o r i = 1, 2, \dots, N \end{matrix}

(9)

\begin{matrix} V_{g}^{RL} & = [{\dot{l}}_{g, 1}^{RL}, \dots, {\dot{l}}_{g, M}^{RL}] \end{matrix}

(10)

The values of

V_{a, i}^{RL}

and

V_{g}^{RL}

are within certain ranges according to (4o).

Reward function for ground actuators

Inspired by Gagliardini et al.,¹² Xiong et al.,⁷ this study proposes a reward function that accumulates multiple terms for the agent of ground actuators based on a multi-objective optimization problem as

max \sum_{t = 0}^{T} - γ^{T - t} r_{g} (s_{t});

(11)

where

γ

is a discount factor.

T

denotes the number of time steps.

r_{g}

represents the reward obtained by the agent of ground actuators.

r_{g}

is defined as

\begin{aligned} r_{g} (s_{t}) = & w_{tar} r_{tar} (s_{t}) + w_{reach} r_{reach} (s_{t}) + w_{mo} r_{mo} (s_{t}) \\ + w_{gco} r_{gco} (s_{t}) + w_{gl} r_{gl} (s_{t}) + w_{sp} r_{sp} (s_{t}) + w_{cl} r_{cl} (s_{t}) \end{aligned}

(12)

where

r_{tar}

is a reward designed for guiding the moving platform to the target position. The weight of the reward is

w_{tar}

r_{tar}

is defined as

r_{tar} (s_{t}) = ‖ p_{m}^{*} - p_{m} ‖_{2}^{2}

(13)

$r_{reach}$ is a reward designed for the success in moving the moving platform to the target position, corresponding to (4d). The weight of the reward is $w_{reach}$ . $r_{reach}$ is

r_{reach} (s_{t}) = {\begin{matrix} 1 & if reaches the target position \\ 0 & otherwise \end{matrix}

(14)

$r_{tar}$ and $r_{reach}$ provide positive reward when the agent gets closer to the target and an additional bonus when reaching it. Intuitively, they act like a “compass” which ensures that the platform keeps progressing toward the mission goal rather than stalling.

$r_{mo}$ is a reward designed for guiding the moving platform to keep away from the obstacle, according to (4m). The weight of the reward is $w_{mo}$ . $r_{mo}$ is defined as

r_{mo} (s_{t}) = {\begin{matrix} r_{mo, x} (s_{t}) + r_{mo, y} (s_{t}) + r_{mo, z} (s_{t}) \\ if ‖ p_{o, x} - p_{m, x} ‖_{2} < d_{mo} + l_{o} / 2 + d_{m} / 2, \\ ‖ p_{o, y} - p_{m, y} ‖_{2} < d_{mo} + w_{o} / 2 + d_{m} / 2, \\ and ‖ p_{m, z} ‖_{2} < d_{mo} + h_{o} + d_{m} / 2 \\ 0 otherwise \end{matrix}

(15)

where $d_{mo}$ is the safety distance between the obstacle and the moving platform. $r_{mo, x}$ is defined as

r_{mo, x} (s_{t}) = {\begin{matrix} d_{mo} + l_{o} / 2 + d_{m} / 2 - ‖ p_{o, x} - p_{m, x} ‖_{2} \\ if ‖ p_{o, x} - p_{m, x} ‖_{2} < d_{m o} + l_{o} / 2 + d_{m} / 2 \\ and ‖ p_{o, x} - p_{m, x} ‖_{2} > l_{o} / 2 + d_{m} / 2 \\ 0 otherwise \end{matrix}

(16)

r_{mo, y}

is defined in the same way.

r_{mo, z}

is defined as

r_{mo, z} (s_{t}) = {\begin{matrix} d_{mo} + l_{o} / 2 + d_{m} / 2 - ‖ p_{o, z} - p_{m, z} ‖_{2} \\ if ‖ p_{o, z} - p_{m, z} ‖_{2} < d_{mo} + h_{o} + d_{m} / 2 \\ and ‖ p_{o, z} - p_{m, z} ‖_{2} > h_{o} + d_{m} / 2 \\ 0 otherwise \end{matrix}

(17)

$r_{gco}$ is a reward designed for avoiding the collision between a variable-length cable and the obstacle, according to (4l). The weight of the reward is $w_{gco}$ . $r_{gco}$ is defined as

r_{gco} (s_{t}) = \sum_{j = 1}^{M} r_{gco}^{j} (s_{t})

(18)

where

r_{gco}^{j}

r_{gco}^{j} (s_{t}) = {\begin{matrix} 1 & if the j th cable collides with the obstacle \\ 0 & otherwise \end{matrix}

(19)

$r_{mo}$ and $r_{gco}$ penalize proximity to obstacles and give a large negative reward upon collision. In plain terms, they drive the platform and cable to steer around obstacles rather than flying through them, thus enhancing safety.

$r_{gl}$ is defined to guide every ground actuator to adjust cable length based on the distance between the position of the ground actuator and the target position. $r_{gl}$ is defined as

r_{gl} (s_{t}) = \sum_{j = 1}^{M} r_{gl}^{j} (s_{t})

(20)

and

r_{gl}^{j}

is expressed as

r_{gl}^{j} (s_{t}) = {\begin{matrix} | ‖ p_{g, j} - p_{m} ‖_{2} - ‖ p_{g, j} - p_{m}^{*} ‖_{2} | - d_{g l} \\ if | ‖ p_{g, j} - p_{m} ‖_{2} - ‖ p_{g, j} - p_{m}^{*} ‖_{2} | > d_{g l} \\ 0 otherwise \end{matrix}

(21)

$r_{sp}$ is used to limit the maximum velocity of the moving platform and thus reduce the vibration of the cables, according to (4c). The weight of the reward is $w_{sp}$ . $r_{sp}$ is defined as

r_{sp} (s_{t}) = {\begin{matrix} ‖ {\dot{p}}_{m} ‖_{2} - v_{sp} & if ‖ {\dot{p}}_{m} ‖_{2} > v_{s p} \\ 0 & otherwise \end{matrix}

(22)

where

v_{sp}

denotes a set velocity threshold of the moving platform.

$r_{cl}$ is a reward designed for ensuring that the variable-length cables are in tension, according to (4h). The weight of the reward is $w_{cl}$ . $r_{cl}$ satisfies

r_{cl} (s_{t}) = \sum_{j = 1}^{M} r_{cl}^{j} (s_{t})

(23)

r_{cl}^{j}

is defined as

r_{cl}^{j} (s_{t}) = {\begin{matrix} ‖ p_{g, j} - p_{m} ‖_{2} - l_{g, j} & if l_{g, j} > ‖ p_{g, j} - p_{m} ‖_{2} \\ 0 & otherwise \end{matrix}

(24)

Although defined separately, $r_{gl}$ , $r_{sp}$ , and $r_{cl}$ share the same purpose: to keep the cables taut throughout the process, preventing slack or oscillation caused by excessive speed or uneven forces, and thereby ensuring the stability of the platform.

Reward function for an aerial actuator

This study designs a reward function that accumulates multiple terms for the agent of an aerial actuator based on a multi-objective optimization problem as

max \sum_{t = 0}^{T} - γ^{T - t} r_{a}^{i} (s_{t})

(25)

where

r_{a}^{i}

represents the reward achieved by the agent of the

i

th aerial actuator.

r_{a}^{i}

is defined as

\begin{aligned} r_{a}^{i} (s_{t}) = & w_{tar} r_{tar} (s_{t}) + w_{reach} r_{reach} (s_{t}) + w_{mo} r_{mo} (s_{t}) \\ + w_{sp} r_{sp} (s_{t}) + w_{at} r_{at}^{i} (s_{t}) + w_{aco} r_{aco}^{i} (s_{t}) \\ + w_{ah} r_{ah}^{i} (s_{t}) + w_{aa} r_{aa}^{i} (s_{t}) + w_{ao} r_{ao}^{i} (s_{t}) \end{aligned}

(26)

where

w_{tar}

r_{tar}

w_{reach}

r_{reach}

w_{mo}

r_{mo}

w_{sp}

, and

r_{sp}

have been defined in Section “Reward function for ground actuators.”

$r_{at}^{i}$ is designed to guide the $i$ th aerial actuator to move into a target area defined by (4d) and (4a). The weight of the reward is $w_{at}$ . $r_{at}^{i}$ is defined as

r_{a t}^{i} (s_{t}) = {\begin{matrix} | ‖ p_{a, i} - p_{m}^{*} ‖_{2} - l_{a, i} | - d_{at} \\ if | ‖ p_{a, i} - p_{m}^{*} ‖_{2} - l_{a, i} | > d_{at} \\ 0 otherwise \end{matrix}

(27)

where

d_{at}

represents the critical distance of the target area.

$r_{ah}^{i}$ is designed to maintain the $i$ th aerial actuator at a high altitude, preventing the collision between an aerial actuator and the cable towed by the aerial actuator, corresponding to (4j). The weight of the reward is $w_{ah}$ . $r_{ah}^{i}$ is defined as

r_{ah}^{i} (s_{t}) = {\begin{matrix} d_{ah} - (p_{a, i, z} - p_{m, z}) & if p_{a, i, z} - p_{m, z} < d_{a h} \\ 0 & otherwise \end{matrix}

(28)

where

d_{ah}

denotes the safety difference between the altitude of an aerial actuator and that of the moving platform.

The purposes of $r_{at}^{i}$ and $r_{ah}^{i}$ are the same as those of the ground-actuator rewards to ensure that the aerial actuators accurately reach their targets and maintain proper position and altitude, and thus guarantee stable task execution.

$r_{aco}^{i}$ is a reward designed for avoiding the collision between the obstacle and the cable attached to the $i$ th aerial actuator, according to (4k). The weight of the reward is $w_{aco}$ . $r_{aco}^{i}$ is

r_{aco}^{i} (s_{t}) = {\begin{matrix} 1 & if the i th cable collides with the obstacle \\ 0 & otherwise \end{matrix}

(29)

$r_{aa}^{i}$ is a reward designed for guiding aerial actuators to keep away from each other, according to (4i). The weight of the reward is $w_{aa}$ . $r_{aa}^{i}$ is defined as

r_{aa}^{i} (s_{t}) = \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} r_{aa}^{i j} (s_{t})

(30)

r_{aa}^{i j}

is a reward designed for guiding the

i

th and the

j

th aerial actuators to keep away from each other.

r_{aa}^{i j}

is defined as

r_{aa}^{i j} (s_{t}) = {\begin{matrix} d_{aa} - ‖ p_{a, i} - p_{a, j} ‖_{2} & if ‖ p_{a, i} - p_{a, j} ‖_{2} < d_{aa} \\ 0 & otherwise \end{matrix}

(31)

where

d_{aa}

denotes the safety distance between two aerial actuators.

$r_{ao}^{i}$ is a reward designed for guiding the aerial actuator to keep away from the obstacle, according to (4n). The weight of the reward is $w_{ao}$ . $r_{ao}^{i}$ is defined as

r_{ao}^{i} (s_{t}) = {\begin{matrix} r_{ao, x}^{i} (s_{t}) + r_{ao, y}^{i} (s_{t}) + r_{ao, z}^{i} (s_{t}) \\ if ‖ p_{o, x} - p_{a, i, x} ‖_{2} < d_{ao} + l_{o} / 2 + d_{a} / 2 \\ and ‖ p_{o, y} - p_{a, i, y} ‖_{2} < d_{ao} + w_{o} / 2 + d_{a} / 2 \\ and ‖ p_{a, i, z} ‖_{2} < d_{ao} + h_{o} + h_{a} / 2 \\ 0 otherwise \end{matrix}

(32)

where

d_{ao}

is the safety distance between an obstacle and an aerial actuator.

r_{ao, x}^{i}

is defined as

r_{ao, x}^{i} (s_{t}) = {\begin{matrix} d_{ao} + l_{o} / 2 + r_{a} - ‖ p_{o, x} - p_{a, i, x} ‖_{2} \\ if ‖ p_{o, x} - p_{a, i, x} ‖_{2} < d_{ao} + l_{o} / 2 + d_{a} / 2 \\ and ‖ p_{o, x} - p_{a, i, x} ‖_{2} > l_{o} / 2 + d_{a} / 2 \\ 0 otherwise \end{matrix}

(33)

r_{ao, y}^{i}

is defined in the same way.

r_{ao, z}^{i}

is defined as

r_{ao, z}^{i} (s_{t}) = {\begin{matrix} d_{ao} + h_{o} + r_{a} - ‖ p_{o, z} - p_{a, i, z} ‖_{2} \\ if ‖ p_{o, z} - p_{a, i, z} ‖_{2} < d_{ao} + h_{o} + d_{a} / 2 \\ and ‖ p_{o, z} - p_{a, i, z} ‖_{2} > h_{o} + d_{a} / 2 \\ 0 otherwise \end{matrix}

(34)

$r_{aco}^{i}$ , $r_{aa}^{i}$ , and $r_{ao}^{i}$ are all collision-avoidance rewards, sharing the same purpose of preventing the aerial actuators and their attached cables from colliding with obstacles or with each other. By penalizing cable–obstacle contact, excessive proximity between aerial actuators, and aerial actuators approaching obstacles, these rewards guide the system to maintain safe distances in complex environments, thereby ensuring safe task execution.

In summary, the reward terms can be grouped into three categories. First, the goal-reaching rewards (i.e., $r_{tar}$ , $r_{reach}$ , $r_{at}^{i}$ , and $r_{ah}^{i}$ ) encourage both the moving platform and the aerial actuators to approach and remain at their designated target positions and altitudes, ensuring task completion. Second, the collision-avoidance rewards (i.e., $r_{mo}$ , $r_{gco}$ , $r_{aco}^{i}$ , $r_{aa}^{i}$ , and $r_{ao}^{i}$ ) penalize unsafe proximity or collisions between actuators, cables, and obstacles, guiding the system to maintain safe distances in complex environments. Third, the coordination rewards (i.e., $r_{gl}$ , $r_{sp}$ , and $r_{cl}$ ) ensure that the cables remain taut and balanced throughout the motion, preventing slack or oscillations and thereby stabilizing the overall system. Together, these three categories of rewards shape the agents’ behaviors toward safe, stable, and goal-directed coordination.

Experiments

To evaluate and demonstrate the proposed decentralized planning approach, a CTAP prototype is established, as shown in Figures 5 and 6, and a model of the CTAP prototype is built based on the MuJoCo simulation environment,³¹ as shown in Figure 6. The CTAP prototype involves two aerial actuators and two ground actuators (i.e., $N = 2$ and $M = 2$ ).

Figure 5.

A cable-towed aerial platform prototype with two aerial actuators and two ground actuators in the real world.

Figure 6.

A cable-towed aerial platform with two aerial actuators and two ground actuators in the MuJoCo simulator.

The MATD3 algorithm suppresses the overestimation problem more effectively than MADDPG.³² Additionally, its deterministic policy enables more precise action control than the MAPPO algorithm. In this study, the MAPPO algorithm was first applied to training. However, with MAPPO, the achieved reward is hard to converge in training. For this reason, the MATD3 algorithm is used to train a decentralized planner consisting of two agents designed for the two aerial actuators and one agent for the two ground actuators, according to the decentralized planning approach presented in Section “MARL-Based decentralized planning approach”. A centralized optimization-based planner is developed working as a baseline. The effectiveness of the decentralized planning approach is verified through the training of a decentralized planner in simulation, statistical comparison in simulation, and a case study in simulation and in the real world.

Setup

In this study, a planner is required to guide a CTAP to move in an environment with an obstacle with a height of up to 1.0 m, a length of up to 0.5 m, and a width of up to 0.5 m. The length of the cable between an aerial actuator and the moving platform is 1 m. The maximum velocity of aerial actuators is 0.2 m/s. The maximum cable stretching velocity of the cables driven by the ground actuators is 0.2 m/s. A planner defines a world frame and the origin of the world frame is on the ground and at the center of the two ground actuators. The $y$ -axis of the world frame is defined by the ground actuators. The positions of the ground actuators are [1.38 m, 0 m, 0 m] and [ $- 1.38$ m, 0 m, 0 m], respectively. The frequency of the planner is set to 10 Hz.

A personal computer with an Intel i9-10900K CPU, Nvidia 3080Ti GPU, and 64 gigabytes of memory is used for the training of a MARL-based decentralized planner and the analysis of the MARL-based planner and the optimization-based planner in simulation. In simulation, the positional and geometric information of aerial actuators and the moving platform as well as the length of variable-length cables can be obtained by the Mujoco simulator.

For experiments in the real world, a laptop with an AMD R5-5600U CPU and 16 gigabytes of memory is used for the optimization-based planner and the agent of ground actuators of the MARL-based decentralized planner. Two Jetson Orin Nano onboard computers with 8 gigabytes of memory are used for the two agents of aerial actuators of the MARL-based decentralized planner, respectively. The positional and geometric information of the moving platform, obstacle, and aerial actuators are measured by a NOKOV motion capture system. The length of variable-length cables is recorded by encoders.

Implementation and training of a MARL-based decentralized planner

The MATD3 algorithm³² is used for this study due to its robust framework, which is based on decentralized execution and centralized training, making it well-suited for developing decentralized controllers for multiple agents. According to the MATD3 algorithm, the policy and value networks are both designed based on a multilayer perceptron (MLP) architecture, consisting of three fully connected (FC) hidden layers. Each layer includes 256 units and employs the leaky-ReLU activation function. The output layer of the policy network is a tanh layer, which ensures that the output of the policy network ranges from $- 1$ to 1. The state space of every agent is defined as

\begin{matrix} s_{t} = [ & p_{m}, p_{m}^{*}, p_{o}, p_{a, 1}, p_{a, 2}, l_{o}, w_{o}, h_{o}, \\ l_{g, 1}, l_{g, 2}, {\dot{p}}_{a, 1}, {\dot{p}}_{a, 2}, {\dot{l}}_{g, 1}, {\dot{l}}_{g, 2}] \end{matrix}

(35)

The dimensionality of the state space is 28. The action space of an aerial actuator is defined as

\begin{matrix} V_{a, i}^{RL} & = [{\dot{p}}_{a, i, x}^{RL}, {\dot{p}}_{a, i, y}^{RL}, {\dot{p}}_{a, i, z}^{RL}] f o r i = 1, 2 \end{matrix}

(36)

The dimensionality of the action space is 3. The action space of ground actuators is defined as

\begin{matrix} V_{g}^{RL} & = [{\dot{l}}_{g, 1}^{RL}, {\dot{l}}_{g, 2}^{RL}] \end{matrix}

(37)

The dimensionality of the action space is 2.

To train a MARL-based decentralized planner, obstacles with a height ranging from 0.2 m to 1.2 m and a length and a width ranging from 0.1 m to 0.6 m are randomly generated. The position of the obstacle is randomly selected up to 0.8 m from the origin of the world frame. A target point around the obstacle is randomly selected. The learning rates of the actor and critic networks are set to 0.001 and 0.0001, respectively. The actor and critic networks are updated every 0.005 s. The batch size is set to 256.

According to the practice of this study, it is difficult to directly train an effective MARL-based decentralized planner. Thus, this study adopts the following strategies to achieve a successful training, inspired by Liu et al.,¹⁶ Xu et al.³³

Phased training

The training of a MARL-based decentralized planner is divided into three phases.

The first phase is to guide the agents to learn to move the moving platform to a target point in an environment without an obstacle. Reward functions revised from (26) and (12) are used. The rewards at time step $t$ can be expressed as

\begin{aligned} r_{a, 1}^{i} & = w_{tar} r_{tar} (s_{t}) + w_{reach} r_{reach} (s_{t}) \\ + w_{sp} r_{sp} (s_{t}) + w_{at} r_{at}^{i} (s_{t}) \\ + w_{ah} r_{ah}^{i} (s_{t}) + w_{aa} r_{aa}^{i} (s_{t}) \end{aligned}

(38)

\begin{aligned} r_{g, 1} & = w_{tar} r_{tar} (s_{t}) + w_{reach} r_{reach} (s_{t}) \\ + w_{gl} r_{gl} (s_{t}) + w_{sp} r_{sp} (s_{t}) \\ + w_{cl} r_{cl} (s_{t}) \end{aligned}

(39)

The first stage will be terminated if the decentralized planner can move the moving platform to the target position.

The second phase aims to guide the agents to learn to address small obstacles. The rewards at time step $t$ are

\begin{aligned} r_{a, 2}^{i} & = r_{a, 1}^{i} + w_{ao} r_{ao} (s_{t}) + w_{aco} r_{aco}^{i} (s_{t}) \\ + w_{mo} r_{mo} (s_{t}) \end{aligned}

(40)

r_{g, 2} = r_{g, 1} + w_{mo} r_{mo} (s_{t}) + w_{gco} r_{gco} (s_{t})

(41)

The collision between a cable and the obstacle is addressed using a soft-constraint technique.³⁴ The length, width, and height of the obstacle are randomly selected from the range of 0.1 m to 0.2 m. The second phase is terminated if the moving platform can bypass an obstacle and reach the target point.

The third phase aims to train the agents to address all possible obstacles. The size of an obstacle is increased with training. Eventually, the maximum length and width of an obstacle are both set to 0.6 m, while the maximum height is set to 1.2 m. The rewards at time step $t$ can be expressed as

\begin{aligned} r_{a, 3}^{i} & = r_{a, 1}^{i} + w_{ao} r_{ao} (s_{t}) + w_{aco} r_{aco}^{i} (s_{t}) \\ + w_{mo} r_{mo} (s_{t}) \end{aligned}

(42)

r_{g, 3} = r_{g, 1} + w_{mo} r_{mo} (s_{t}) + w_{gco} r_{gco} (s_{t})

(43)

The third phase is terminated if the achieved reward converges.

The weights of the reward functions are tuned based on multiple attempts and analysis of the reward achieved in order to balance the different reward terms. The parameters of the reward functions determined through multiple attempts are listed in Table 1.

Table 1.

Parameters of rewards.

Parameter	Value	Parameter	Value	Parameter	Value
$v_{\max}$	0.20	$d_{m}$	0.10	$h_{a}$	0.025
$d_{a}$	0.20	$w_{tar}$	10.00	$d_{mo}$	0.15
$w_{mo}$	100.00	$w_{aco}$	60.00	$w_{sp}$	15.00
$v_{sp}$	0.10	$w_{cl}$	6.00	$w_{reach}$	5000.00
$w_{at}$	12.60	$d_{at}$	0.03	$d_{ah}$	0.35
$w_{ah}$	90.00	$d_{aa}$	0.40	$w_{aa}$	150.00
$w_{gco}$	60.00	$d_{\min}$	0.40	$d_{\max}$	1.25
$w_{ao}$	360.00	$d_{ao}$	0.08	$w_{gl}$	6.00
$d_{gl}$	0.00	$γ$	0.98

Curriculum learning

The curriculum learning technique is used to facilitate the third phase.^35,36 Curriculum learning allows one to start learning from simpler tasks (addressing smaller obstacles) and gradually increase the difficulty of tasks. In the third phase, initially, the MARL-based decentralized planner is trained based on an environment with a randomly generated small obstacle, the maximum height, length, and width of which are 0.1 m, 0.2 m, and 0.2 m, respectively. With the increase in the number of training episodes, the size of the randomly generated obstacle increases, according to curriculum learning. Eventually, the maximum height, length, and width of the obstacle are 1.2 m, 0.6 m, and 0.6 m, respectively.

The accumulated rewards achieved by the two agents of aerial actuators and the agent of ground actuators are presented in Figures 7 to 9. The horizontal axis of the figures denotes the number of training steps, while the vertical axis of the figures indicates the value of accumulated reward. The accumulated rewards have similar trends due to the collaboration of the agents. In the first phase, the rewards increase rapidly in an obstacle-free environment. In the second phase, the agents can achieve the same level of rewards in addressing small obstacles. In the third phase, with the increase of the size of obstacles, the accumulated rewards decrease and then converge, suggesting a successful training.

Figure 7.

Accumulated reward achieved by the agent of aerial actuator 1.

Figure 8.

Accumulated reward achieved by the agent of aerial actuator 2.

Figure 9.

Accumulated reward achieved by the agent of ground actuators.

Implementation of the optimization-based planner

The optimization-based planner used in this study is achieved by applying the CasADi solver to solve the optimization problem of the motion planning of a CTAP defined in Section “Problem formulation.” If a target point is on the front side of the obstacle, the number of time steps is set to $T = 15$ . If a target point is on the side or back of the obstacle, the number of time steps is set to $T = 18$ and $T = 35$ , respectively. The parameters of $d_{aa} = 0.4$ , $d_{ah} = 0.35$ , and $d t = 0.5$ are used. The optimization-based planner determines target trajectories of waypoints with a step size of 0.5 s for the aerial actuators and the moving platform. Target trajectories of waypoints with a step size of 0.1 s are achieved by cubic interpolation and the target velocities of the aerial actuators are updated 10 Hz. Based on the target trajectory of waypoints with a step size of 0.1 for the moving platform, the target velocities of the ground actuators are updated at 10 Hz as well.

Statistical comparison of the MARL-based planner and optimization-based planner

To investigate the effectiveness of the MARL-based decentralized planning approach, this study applies both the MARL-based decentralized planner and the optimization-based centralized planner to perform 100 tests of addressing random obstacles and target positions, respectively. For fairness, both the MARL-based planner and the optimization-based planner are run on the same personal computer presented in section “Setup”. The test process can be summarized as follows. (a) The moving platform is initialized at [0 m, 0 m, 0.5 m]. (b) An obstacle is randomly generated with a length and a width ranging from 0.1 m to 0.5 m and a height ranging from 0.2 m to 1.0 m. (c) The target position is randomly generated around the obstacle. 4) The MARL-based planner and the optimization-based planner address the motion planning of the CTAP and move the moving platform to the target position, respectively.

The success rate, time consumption, and path length of the planners are investigated. The results of the 100 tests are shown in Table 2. It is shown that both the MARL-based planner and the optimization-based planner can achieve a 100% success rate, indicating that both planners can address the motion planning of a CTAP. It should be emphasized that the MARL-based planner can address the motion planning of a CTAP based on decentralized agents. The time consumption of the MARL-based planner in determining target velocities is two orders of magnitude lower than the time consumption of the optimization-based planner. The cost of the high time efficiency and decentralization achieved by the MARL-based planner is slightly worse in terms of path length than that of the optimization-based planner.

Table 2.

Results of the statistical comparison of the MARL-based planner and optimization-based planner.

Planner	Success ratio	Agent	Average time consumption	Movable rigid body	Average path length
MARL-based	100%	Aerial actuator 1	0.00073 s (per step), 0.060 s (per test)	Aerial actuator 1	1.64 m
		Aerial actuator 2	0.00073 s (per step), 0.059 s (per test)	Aerial actuator 2	1.74 m
		Ground actuators	0.0012 s (per step), 0.094 s (per test)	Moving platform	2.15 m
Optimization-based	100%	∖	9.17 s (per test)	Aerial actuator 1	1.36 m
				Aerial actuator 2	1.38 m
				Moving platform	1.46 m

To further evaluate robustness, this study has tested the MARL planner in the presence of dynamic obstacles. In this scenario, obstacles moved randomly at a velocity of $0.01$ m/s. Over 100 trials, the proposed MARL method achieved a task success rate of 85%, whereas the optimization-based baseline failed to complete the task due to its reliance on pre-computed static plans. These results highlight the advantage of the MARL approach in adapting online to slowly changing environments.

Regarding communication failures, the proposed MARL design does not rely on explicit inter-agent communication during execution, as each agent makes decisions based on local observations and rewards. Therefore, communication interruptions have no effect on task execution. The optimization baseline is an offline centralized solver, which also does not depend on runtime communication. This demonstrates that, while communication failure is a critical issue in many distributed systems, it is not a limiting factor for the studied approaches.

Sensitivity analysis of reward components

To assess the effectiveness of different reward components, this study conducted an ablation-style sensitivity analysis on a previously trained MARL model. The reward terms were grouped into three categories: (1) goal-reaching rewards, (2) collision-avoidance rewards, and (3) coordination rewards (ensuring cables remain taut). Starting from the same initialization, this study retrained three models, each with one category removed, for $1,000,000$ steps and then evaluated each model over 100 simulation trials.

The results highlight the necessity of each component. Without goal-reaching rewards, the agents failed in all trials because the moving platform did not learn where to move, leading to timeouts. Without collision-avoidance rewards, only 7% of the trials succeeded; the majority failed due to collisions between the moving platform or cables and an obstacle. Without coordination rewards, the success rate dropped to 38%; the failures were caused by violent oscillations near the target, where slack cables led to collisions. These observations confirm that all three categories are essential. Goal-reaching rewards provide the primary task objective, collision-avoidance rewards ensure safety, and coordination rewards stabilize the system by maintaining cable tension.

Case study in the real world

A case study of applying the MARL-based planner and the optimization-based planner to the CTAP prototype is performed in the real world to investigate the performance of the MARL-based planner in detail. The MARL-based planner runs on a laptop and two onboard computers introduced in section “Setup” while the optimization-based planner runs on the laptop. An obstacle with a height of 0.8 m and a width and a length of 0.4 m is selected according to the limitations of the experimental site. The obstacle is located at [0.450 m, -0.498 m, 0.00 m]. The CTAP prototype is required to bypass the obstacle and reach a target position behind the obstacle located at [0.900 m, -1.096 m, 0.853 m].

The trajectories of the aerial actuator 1, aerial actuator 2, and moving platform of the CTAP prototype following the MARL-based planner and the optimization-based planner are presented in Figures 10 and 11, respectively. The time consumption and path length of the MARL-based planner and the optimization-based planner are listed in Table 3. The results show that the optimization-based planner takes 20 s to address the motion planning of the CTAP prototype based on the laptop, suggesting that the optimization-based planner cannot achieve online planning for a CTAP. The agents of the MARL-based planner can be deployed on different devices and can determine the target velocities for aerial and ground actuators within 0.01 s, indicating that the MARL-based planner can achieve online decentralized planning for a CTAP (Figures 12 and 13). The path length of the trajectories achieved by the MARL-based planner is longer than that achieved by the optimization-based planner, consistent with the result of path length in the statistical analysis in section “Statistical comparison of the MARL-based planner and optimization-based planner.”

Figure 10.

Trajectories of the CTAP prototype following the MARL-based planner in the real world.

Figure 11.

Trajectories of the CTAP prototype following the optimization-based planner in the real world.

Figure 12.

Behaviors of the CTAP prototype following the MARL-based planner.

Figure 13.

Behaviors of the CTAP prototype following the optimization-based planner.

Table 3.

Time consumption and path length results in approaching a target position on the back of an obstacle in the real world.

Planner	Agent	Time consumption	Movable rigid bodies	Path length
MARL-based	Aerial actuator 1	0.00069 s (average per step), 0.0047 s (maximum per step)	Aerial actuator 1	3.00 m
	Aerial actuator 2	0.00081 s (average per step), 0.0081 s (maximum per step)	Aerial actuator 2	3.36 m
	Ground actuators	0.00086 s (average per step), 0.0029 s (maximum per step)	Moving platform	3.30 m
Optimization-based	∖	20.61 s	Aerial actuator 1	2.07 m
			Aerial actuator 2	2.43 m
			Moving platform	2.79 m

In this experiment, this study quantified path tracking accuracy for the moving platform, aerial actuator 1, and aerial actuator 2 following the proposed planner. The commanded trajectory $p_{q}^{cmd} (t) \in {p_{m}^{cmd} (t), p_{a, 1}^{cmd} (t), p_{a, 2}^{cmd} (t)}$ is obtained by integrating the policy’s commanded velocity at step $t$ for each body

p_{q}^{cmd} (t + 1) = p_{q}^{cmd} (t) + v_{q}^{cmd} (t) Δ t,

(44)

p_{q}^{cmd} (0) = p_{q}^{act} (0) .

(45)

This study then has compared it with the measured trajectory

p_{q}^{act} (t) \in {p_{m}^{act} (t), p_{a, 1}^{act} (t), p_{a, 2}^{act} (t)}

and report two metrics: the mean squared error (MSE) and the maximum deviation as

MSE = \frac{1}{K} \sum_{k = 1}^{K} ‖ p_{q}^{act} (t) - p_{q}^{cmd} (t) ‖^{2},

(46)

maximum deviation = max_{k} ‖ p_{q}^{act} (t) - p_{q}^{cmd} (t) ‖ .

(47)

Tables 4 and 5 summarize the results for a trajectory of approximately 10 m scale. The MSEs range from

0.0160

^{2}

0.0267

^{2}

, suggesting RMSEs range from

0.13

m to

0.16

m, and the max deviations are below

0.21

m for aerial actuators and

0.33

m for the platform. These errors are 1% to 3% of the trajectory scale. The path tracking accuracy is not high due to intricate cable dynamics and the uncertainties inherent in their operating environment. However, the path tracking accuracy is enough for rapid deployment tasks of a CTAP. It also suggested that a specially designed controller is urgently needed to improve the accuracy of a CTAP for tasks such as aerial manipulation.

Table 4.

Success rates under ablation of different reward categories (100 trials).

Reward category removed	Success rate	Typical failure mode
Goal-reaching rewards	0%	Timeout (fail to move to the target)
Collision-avoidance rewards	7%	Collisions with the obstacle
Coordination rewards	38%	Oscillations to collisions

Table 5.

Path tracking accuracy.

Body	MSE (m $^{2}$ )	Maximum deviation (m)
Aerial actuator 1	0.0267	0.199
Aerial actuator 2	0.0160	0.206
Moving platform	0.0203	0.333

It should be noted that compared to optimization-based methods, the MARL-based planner provides control actions within milliseconds, thereby offering superior real-time performance at the cost of somewhat longer paths. In high-risk applications such as firefighting, real-time responsiveness is often more critical than achieving the shortest path. Faster decision-making enables the platform to promptly handle sudden hazards, whereas the extra energy cost from longer paths is relatively minor. Therefore, this trade-off is acceptable in practice.

Discussion

It should be noted that the experiments in this study were based on several simplifying assumptions. First, the number of obstacles was fixed to one, and its shape was constrained to a cuboid. This reduced problem complexity and facilitated training, but it does not cover more complex scenarios such as multiple or irregularly shaped obstacles. Second, the positions of the ground actuators were fixed symmetrically, which stabilized training and reduced the dimensionality of the action space. However, in more general settings, non-symmetric actuator placements may significantly increase planning difficulty. Third, the system considered in this work included only two aerial and two ground actuators. While this simplified CTAP provides a useful starting point, scaling the proposed approach to larger systems with more actuators remains a challenge, as the joint action space grows rapidly with the number of agents. To extend the proposed approach to handle more complex scenarios (e.g., multi-robot coordination in large-scale environments), possible ways include improving the scalability of the proposed approach by adopting modular or hierarchical MARL architectures to coordinate larger groups of actuators, or by integrating safety filtering mechanisms such as quadratic programming (QP) projections to ensure feasibility constraints when more agents are involved.

It should also be emphasized that deploying the proposed CTAP system in real-world settings introduces additional challenges beyond the simulation assumptions. Environmental factors such as wind disturbances, extreme weather, complex terrain, and sensor noise may cause deviations from planned trajectories and reduce robustness. Hardware limitations further constrain performance: aerial agent’s endurance restricts mission duration, actuator dynamics impose limits on responsiveness, and onboard computation is limited compared to simulation resources.

Conclusions

This article has developed a MARL-based online decentralized planning approach for a CTAP to achieve motion planning in a communication-denied environment with a cuboid obstacle. The motion planning problem of a CTAP has been formulated as an optimization problem and the reward function of the MARL-based planning approach has been defined according to the optimization problem. A MARL-based decentralized planner has been successfully trained according to the MARL-based planning approach and the important training techniques used in this study have been presented. Statistical analysis of the MARL-based decentralized planner in simulation and deployment of the MARL-based decentralized planner on a CTAP prototype has been performed to validate that the MARL-based planner can achieve successful, decentralized, and online motion planning for a CTAP.

Although the proposed MARL-based decentralized planning approach demonstrates advantages in real-time performance and decentralization, it also has several limitations. First, as the number of agents increases, the action space dimensionality grows rapidly, making the training process more challenging. Second, the simulator used does not support realistic cable–obstacle collisions, and the geometric approximation requires additional computation and may introduce errors. Third, the paths generated by the MARL-based planner are generally longer than those produced by optimization-based methods, deviating from the theoretical optimum. Fourthly, the proposed MARL-based decentralized planner has been validated in static obstacle-rich environments, but its performance in highly dynamic environments remains to be explored.

Future research could explore methods for improving the system’s adaptability to an increased number of agents and highly dynamic environments. Moreover, developing hybrid approaches combining MARL with model-based planning methods might offer benefits in terms of both efficiency and robustness.

Footnotes

ORCID iD

Hao Xiong

Ethical approval and informed consent statements

This article does not contain any studies with human or animal participants.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partially supported by the National Natural Science Foundation of China (Grant No. 52405010), the Guangdong Basic and Applied Basic Research Foundation (Grant No. 2023A1515011010), and the Shenzhen Science and Technology Innovation Program (Grant No. GXWD20231130150349002).

Conflicting interest

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: I hereby disclose that I am a current member of the editorial board of the International Journal of Advanced Robotic Systems. However, I affirm that this affiliation has not influenced the objectivity, analysis, or conclusions presented in this work. All editorial processes for this submission, including peer review and decision-making, were managed in strict accordance with the journal’s ethical guidelines to ensure impartiality.

Data availability statement

Any data gathered and reported in this study are available from the corresponding authors upon reasonable request.

References

Lee

Kim

, et al. An integrated framework for cooperative aerial manipulators in unknown environments. IEEE Robot Autom Lett 2018; 3: 2307–2314.

Jamshidifar

Khajepour

. Static workspace optimization of aerial cable towed robots with land-fixed winches. IEEE Trans Robot 2020; 36: 1603–1610.

Wang

. A suspended cable-driven parallel robot with articulated reconfigurable moving platform for schönflies motions. IEEE ASME Trans Mechatron 2022; 27: 5173–5184.

Yuan

Tang

, et al. A novel cable-driven parallel robot with movable anchor points capable for obstacle environments. IEEE ASME Trans Mechatron 2022; 27: 5472–5483.

Asl

Janabi-Sharifi

. Adaptive neural network control of cable-driven parallel robots with input saturation. Eng Appl Artif Intell 2017; 65: 252–260.

Xiong

Cao

Diao

, et al. On the optimal energy efficiency of the multi-rotor UAVs of an aerial work platform based on an aerial cable towed robot. Mech Mach Theory 2022; 176: 105002.

Xiong

Cao

Zeng

, et al. Real-time reconfiguration planning for the dynamic control of reconfigurable cable-driven parallel robots. J Mech Robot 2022; 14: 060913.

Idà

Bruckmann

Carricato

. Rest-to-rest trajectory planning for underactuated cable-driven parallel robots. IEEE Trans Robot 2019; 35: 1338–1351.

Jiang

Barnett

Gosselin

. Dynamic point-to-point trajectory planning beyond the static workspace for six-DOF cable-suspended parallel robots. IEEE Trans Robot 2018; 34: 781–793.

10.

Varone

. Firefighter safety and radio communication. Fire Eng 2003; 156: 141–164.

11.

Harikumar

Senthilnath

Sundaram

. Multi-UAV oxyrrhis marina-inspired search and dynamic formation control for forest firefighting. IEEE Trans Autom Sci Eng 2019; 16: 863–873.

12.

Gagliardini

Caro

Gouttefarde

, et al. Discrete reconfiguration planning for cable-driven parallel robots. Mech Mach Theory 2016; 100: 313–337.

13.

Rasheed

Long

Roos

, et al. Optimization based trajectory planning of mobile cable-driven parallel robots. In: IEEE/RSJ International conference on intelligent robots and systems (IROS), 2019, pp.6788–6793. ISBN 2153-0866 VO -z. DOI: 10.1109/IROS40897.2019.8968133.

14.

Wang

Zhang

Shang

, et al. Constrained path planning for reconfiguration of cable-driven parallel robots. IEEE ASME Trans Mechatron 2023; 28: 2352–2363. DOI: 10.1109/TMECH.2023.3234569

15.

Qian

Park

, et al. Adaptive sampling-based moving obstacle avoidance for cable-driven parallel robots. IEEE ASME Trans Mechatron 2022; 27: 4983–4993.

16.

Liu

Cao

Xiong

, et al. Dynamic obstacle avoidance for cable-driven parallel robots with mobile bases via sim-to-real reinforcement learning. IEEE Robot Autom Lett 2023; 8: 1683–1690.

17.

Joyo

Alenezi

, et al. Controlling cable driven parallel robots operations—deep reinforcement learning approach. IEEE Access 2025; 13: 36212–36223.

18.

Liu

, et al. Multi-UAV coverage path planning: a distributed online cooperation method. IEEE Trans Vehic Technol 2023; 72: 11727–11740.

19.

Farivarnejad

Berman

. Multirobot control strategies for collective transport. Ann Rev Control, Robot Auton Syst 2022; 5: 205–219.

20.

Yan

Zheng

Jiang

, et al. Distributed control of unmanned marine vehicles for target circumnavigation in communication-denied environments. IEEE ASME Trans Mechatron 2025; 30: 345–356. DOI: 10.1109/TMECH.2024.3392956

21.

Farinelli

Boscolo

Zanotto

, et al. Advanced approaches for multi-robot coordination in logistic scenarios. Rob Auton Syst 2017; 90: 34–44.

22.

Cao

Wang

Zheng

, et al. Reinforcement learning with prior policy guidance for motion planning of dual-arm free-floating space robot. Aerospace Sci Technol 2023; 136: 108098.

23.

Zhao

Cheng

Ding

, et al. A survey of optimization-based task and motion planning: from classical to learning approaches. IEEE ASME Trans Mechatron 2025; 30: 2799–2825. DOI: 10.1109/TMECH.2024.3452509

24.

Yao

, et al. Deep reinforcement learning control of fully-constrained cable-driven parallel robots. IEEE Trans Indust Electron 2023; 70: 7194–7204.

25.

Ding

Liu

Yan

, et al. Federated reinforcement learning for intelligent route planning in aerial-terrestrial network. IEEE Int Things J 2024: 1. DOI: 10.1109/JIOT.2024.3492215. Early Access.

26.

Vinod

Safaoui

Summers

, et al. Decentralized, safe, multiagent motion planning for drones under uncertainty via filtered reinforcement learning. IEEE Trans Control Syst Technol 2024; 32: 2492–2499.

27.

Yang

Shi

Zeng

, et al. Optimization methods in fully cooperative scenarios: a review of multiagent reinforcement learning. Front Inform Technol Electron Eng 2025; 26: 479–509.

28.

Gabler

Wollherr

. Decentralized multi-agent reinforcement learning based on best-response policies. In: Frontiers in Robotics and AI 2024, Volume 11. https://www.frontiersin.org/journals/robotics-and-ai/articles/10.3389/frobt.2024.1229026.

29.

Nomanfar

Notash

. Motion Control of a Cable-Driven Parallel Robot Using Reinforcement Learning Deep Deterministic Policy Gradient Multi-Agent. In: 2024 6th International conference on reconfigurable mechanisms and robots (ReMAR), 2024, ISBN VO -, pp.44–49. DOI: 10.1109/ReMAR61031.2024.10617543.

30.

Baltes

Akbar

Saeedvand

. Cooperative dual-actor proximal policy optimization algorithm for multi-robot complex control task. Adv Eng Inform 2025; 63: 102960.

31.

Todorov

Erez

Tassa

. Mujoco: a physics engine for model-based control. In: IEEE/RSJ International conference on intelligent robots and systems (IROS), 2012, pp.5026–5033. ISBN 1467317365. DOI: 10.1109/IROS.2012.6386109.

32.

Ackermann

Gabler

Osa

, et al. Reducing overestimation bias in multi-agent domains using double centralized critics. arXiv preprint arXiv:191001465 2019.

33.

Chen

She

, et al. Time-varying weights in multi-reward architecture for deep reinforcement learning. IEEE Trans Emerg Top Comput Intell 2024; 8: 1865–1881.

34.

Zhuang

Wang

, et al. Robot Parkour Learning. In: The 7th conference on robot learning, 2023, vol. 229, pp.73–92.

35.

Xiao

. Flying through a narrow gap using end-to-end deep reinforcement learning augmented with curriculum learning and sim2real. IEEE Trans Neural Netw Learn Syst 2023; 34: 2701–2708. DOI: 10.1109/TNNLS.2021.3107742

36.

Hou

Liang

, et al. Subtask-masked curriculum learning for reinforcement learning with application to UAV maneuver decision-making. Eng Appl Artif Intell 2023; 125: 106703.

Online decentralized planning for a cable-driven parallel robot with aerial and ground actuators through multi-agent reinforcement learning

Abstract

Keywords

Introduction

Preliminaries

Problem statement

Notation for a CTAP

Problem formulation

MARL-Based decentralized planning approach

State space and action space

Reward function for ground actuators

Reward function for an aerial actuator

Experiments

Setup

Implementation and training of a MARL-based decentralized planner

Phased training

Curriculum learning

Implementation of the optimization-based planner

Statistical comparison of the MARL-based planner and optimization-based planner

Sensitivity analysis of reward components

Case study in the real world

Discussion

Conclusions

Footnotes

ORCID iD

Ethical approval and informed consent statements

Funding

Conflicting interest

Data availability statement

References