Sage Journals: Discover world-class research

Abstract

The use of mobile robots for assisting astronauts in extravehicular activities could be an effective option for improving mission productivity and crew safety. It is thus critical that these robots follow the astronaut and maintain a stable distance to provide personalized and timely assistance. However, most extraterrestrial bodies exhibit rugged terrain that can impede a robot’s movements. As such, a novel predictive-guide following strategy is proposed to improve the stability of astronaut–robot distance in obstructive environments. This strategy combines a deep reinforcement learning navigator and a Kalman filter-based predictor to generate optimized motion sequences for safely following the astronaut and acquire predictive guidance concerning future astronaut movements. The proposed model achieved a success rate of 95.0% in simulated navigation tasks and adapted well to untrained complex environments and varied robot movement settings. Comparative tests indicated our strategy managed to stabilize the following distance to within ±1.0 m of the reference value in obstructed environments, significantly outperforming other following strategies. The feasibility and advantage of the proposed approach was validated with a physical robotic follower in a Mars-like environment.

Keywords

Extravehicular activity assistant robot astronaut following movement prediction obstacle avoidance deep reinforcement learning mobile robot navigation

Introduction

Future manned extraterrestrial exploration may include sustained human habitat development and in situ resource utilization.^1,2 These surface exploration missions will inevitably require extravehicular activities (EVAs), which could be time-consuming and dangerous. Thus, the cooperation of astronauts and mobile assistant robots offers the potential to increase mission productivity and ensure crew security during EVA tasks, due to improved mobility, load capacity, and environmental tolerance. Specifically, assistant robots can operate alongside astronauts and perform repetitive tasks or assist with photographing, lighting, tool carrying, material mining, collecting and transporting, repairs, and alerting the astronaut of potential hazards. In the past 20 years, several exploratory analog field tests involving astronaut–robot cooperation during EVAs have been conducted, such as the astronaut-rover interaction field test,³ the desert research and technology study,⁴ and MOONWALK.⁵ These projects have investigated issues of EVA astronaut–robot cooperation, including astronaut following, on-site astronaut–robot interaction, and collaborative operation.^6,7

Stably following an astronaut requires maintaining a consistent astronaut–robot distance and is a prerequisite for efficient cooperation in several ways: timely and personalized responses to the astronaut’s requests, stable distance for robust visual tracking of the astronaut, as well as sustainable following. However, it can be challenging to maintain a stable astronaut–robot distance in complex extraterrestrial terrain, where rocks or sand can impede the robot. In addition, strictly following the astronaut’s trajectory is often not ideal for the robot, due to differences in their size, mobility, and gait. Even if the robot possesses obstacle avoidance capabilities, efficient coordination is required for stable following distances. Existing astronaut following strategies typically do not consider following stability in the presence of obstacles, in which either the robot is directly controlled to maintain a fixed following distance based on stereo vision powered astronaut detection and location, without considering the hazards,³ or the obstacle avoidance technique is implemented independently of astronaut following.⁶

Techniques for stably following a human have seen increasing interest in social applications of family services, elderly care, and medical assistance. One intuitive method is to control the robot to directly approach the target person with the help of some vison-based perception techniques.^8
–10 For example, Pang et al.⁸ employed a proportional-derivative (PD) controller to maintain a consistent distance and orientation relative to an identified human target. In a comparable study, an end-to-end controller was trained using deep reinforcement learning (DRL), making the robot a robust follower using only monocular images.⁹ Besides, deep learning-based methods have also been applied, due to their high accuracy and speed of human detection, to improve the robot’s following robustness and efficiency.^11,12 These techniques featured simple control systems, nevertheless, have been limited to ideal conditions without the inclusion of obstacle avoidance. Recent efforts considering obstacle avoidance during human following have involved more complete following frameworks. Wu et al. implemented human following and obstacle avoidance modules separately, applying a priority decider for robot behavior coordination.¹³ Yun et al. combined three artificial force fields in coordinating human following and obstacle avoidance.¹⁴ By these methods, obstacle avoidance and target following of the robot were implemented independently, which might lead to the loss of optimality. For following path optimization, the issue can also be considered a path planning problem and addressed using combined global and local path planning algorithms.^15,16 For instance, Pang et al. utilized the A*¹⁷ and timed elastic band¹⁸ to identify optimal collision-free following paths for the robot.¹⁵ However, path searching in the GPS-denied environment calls for real-time obstacle map construction based on 3D-LiDAR sensors, involving high power consumption and computational demand.

Some innovative attempts have been made to improve the intelligence and efficiency of the following robot, given knowledge of the target human’s predicted movements.^19,20 For example, Bayoumi and Bennewitz predicted human motion using a modified Markov decision process (MDP)²¹ and developed a DRL-based navigator, which resulted in foresighted following behaviors and shorter following trajectories.¹⁹ Recovery mechanisms, based on human trajectory prediction, have also been implemented when the robot misses a target in dynamic environments.²⁰ These studies suggested that the prediction of user movement could significantly improve following efficiency and the foresightedness of the robot, but the mechanisms how it contributes to following stability remain to be addressed.

This study devotes to coordinative planning of obstacle avoidance and astronaut following for the EVA assistant robots, so as to improve the following stability in obstructed environments. We resolve astronaut following into two process, namely robot navigation and predictive guide. A novel predictive-guide strategy is proposed, which combines a DRL navigator and a Kalman filter (KF)-based predictor. The predictor is applied for acquiring predictive-guide information concerning future astronaut movements, while the navigator is then provided with fused data including predicted information and depth information acquired from the robot’s field of view (FoV), to generate optimized motion sequences for avoiding obstacles and following the astronaut. Simulations were conducted to assess robot adaptability to unknown complex environments as well as varied speed settings. Comparative tests suggested this strategy outperformed the other two strategies in following stability. Finally, the effectiveness of the proposed technique was verified experimentally using a physical robotic test bed developed by our group. The primary contributions of this study are as follows:

A novel predictive-guide strategy is introduced to enable target following in obstructed environments, which improved stability.

The DRL-based navigator is well generalized to unseen conditions, complex obstacles, variable robot motion settings, and a real-world robotic test bed, after being trained in a relatively simple simulation environment.

A physical robotic follower is designed and constructed to demonstrate the advantages of our predictive-guide strategy in real-world settings.

Scenario formulation for robotic following of astronauts

This study focuses on developing a predictive following strategy to enable assistant robots to follow astronauts while maintaining a stable distance on obstructed extraterrestrial surfaces. Figure 1 shows an example of predictive following, in which path 1 shows a possible strategy in which the robot tracks the astronaut’s trajectory and approaches the current astronaut position, while path 2 reveals a more ideal solution in which the robot finds a shorter path to avoid rocks and approach to predicted astronaut positions.

Figure 1.

An EVA scene with an astronaut followed by an assistant robot. Path 1: an elementary path tracking the astronaut’s trajectory. Path 2: a shorter path to avoid rocks and approach to the predicted astronaut position. EVA: extravehicular activity.

For convenience of explanation, the astronaut’s position, robot’s position, and astronaut’s velocity are, respectively, denoted by $ϕ = {(x_{}^{a}, y_{}^{a})}^{T}$ , $r = (x_{}^{r}, y_{}^{r})$ , and $v = {(v_{}^{x}, v_{}^{y})}^{T}$ . The astronaut’s trajectory at the kth time step is denoted by $ψ_{k}$ , where $ψ_{k} = \{ϕ_{0}, ϕ_{1}, \dots, ϕ_{k}\} = \{{(x_{1}^{a}, y_{1}^{a})}^{T}, {(x_{2}^{a}, y_{2}^{a})}^{T}, \dots, {(x_{k}^{a}, y_{k}^{a})}^{T}\}$ and $k = 1, 2, 3, \dots$ . The astronaut–robot distance, denoted by d, is given by $d = |ϕ - r|$ . The objective is to maintain a stable value of d, near a designated $d_{}^{ref}$ , during the entire following process.

A robotic follower was designed and constructed as shown in Figure 2, to validate the proposed model. The robot has a mass of approximately 42.0 kg and an outer envelope size of 850 × 710 × 580 mm³. The outer structure is composed of a dexterous four-wheel differential platform, a pair of self-stabilizing suspension assemblies, an electrical cabin, and an orientation-adjustable stereo camera (detailed facts of the robotic follower can be found at www.researchgate.net/publication/358734507_Data_Brief_A_Dexterous_Smart_Robotic_Follower).The robotic follower provided an eligible test bed to validate numerical settings and evaluate physical experimental results.

Figure 2.

The robotic follower acting as a test bed.

The predictive-guide framework

Astronaut following in obstructive environments can be decomposed into two hierarchies: robot navigation and guidance. A novel predictive-guide framework for astronaut following control of the robots is proposed and described in this section.

The predictive-guide framework

The framework of the predictive-guide strategy consists of a DRL powered navigator and a KF-based predictor, as shown in Figure 3. The predictor was used to acquire predictive guidance information based on the astronaut’s trajectory, which was then fused with local obstacle measurements and provided to the navigator for robot behavior planning. We suggest astronaut following and obstacle avoidance could be coordinated simultaneously in a predictive manner, using this hierarchal framework.

Figure 3.

The predictive-guide framework.

End-to-end navigator

An end-to-end navigator is built based on DRL, which can carry out reactive motion planning according to the target information and the perceived depth measurements from the robot’s FoV. A concise principle of DRL as well as the design of the navigator and the reward function will be introduced in this subsection.

Reinforcement learning model

Reinforcement learning constructs optimal mapping relationships between states and actions by estimating how humans would respond to external signals.²² The deep Q-network (DQN),²³ a representative reinforcement learning framework that uses a Q-network to represent state spaces and evaluate actions, provided an attractive end-to-end architecture mapping raw high-dimensional sensing data for robotic action decisions. We modeled the robot navigator based on DQN, due to its high generality and simpler architecture for single networks, compared with other DRL frameworks. The mathematical model of the navigator, based on MDP,²¹ is presented below.

In the kth time step, the navigator observes a state $s_{k}$ and selects an action a_k from the action set $A$ . This decision is guided by policy $π (θ_{k})$ , in which $θ_{k}$ represents weights in the Q-network. Afterward, the agent reaches the next state $s_{k + 1}$ and obtains an instant reward R_k . A state–action evaluation value $q_{π} (s, a)$ can then be introduced as $q_{π} (s_{k}, a_{k}; θ_{k}) = Ε_{π} [G_{k} | s_{k}, a_{k}]$ and used to evaluate policy $π (θ_{k})$ , where $G_{k} = \sum_{t = 0}^{\infty} γ^{t} R_{k + t + 1}$ represents the expected cumulative sum of rewards and $γ \in (0, 1)$ is a discount factor. As the agent learns from interacting experiences within the environment, $q_{π} (s_{k}, a_{k}; θ_{k})$ can be optimized using the Bellman equation

\begin{matrix} q_{π} (s_{k}, a_{k}; θ_{k + 1}) = q_{π} (s_{k}, a_{k}; θ_{k}) \\ + α [R_{k} + γ max_{a'} q_{π} (s_{k + 1}, a') - q_{π} (s_{k}, a_{k}; θ_{k})] \end{matrix}

where α represents the learning rate. The navigator can then be trained by minimizing the following loss function

Loss (θ_{k}) = E [(R_{k} + γ max_{a'} q_{π} (s_{k + 1}, a'; θ_{k}^{-}) - q_{π} (s_{k}, a_{k}; θ_{k}))^{2}]

where $θ_{k}^{-}$ represents weights of a target Q-network that are only updated for each specific number of steps and we denote the number of the interval steps as L. More details about the DQN algorithm can be found in the studies of Mnih et al.^23,24

Navigator design

An end-to-end navigator based on DQN was developed for assistant robot behavior planning. A Q-network composed of four fully connected layers activated by rectified linear units is provided with fused local depth sensing data from the robot’s FoV and guide information concerning the astronaut’s position. Figure 4 illustrates information flow in the proposed DRL-based navigator.

Figure 4.

Information flow in the predictive-guide strategy.

Depth feature vectors

A depth feature vector $λ^{N}$ was abstracted from the depth image and used for efficient representation of local obstacle sensing data, acquired in two steps as shown in Figure 5. The N represents the length of the vector. First, ground depth information was deduced from the original depth image, through pixel similarity comparisons with obstacle depth images obtained in an open area. These images were evenly separated into N striped patches and the maximum pixel value in each patch was sampled as an element of the depth feature vector, representing the distance from the nearest obstacle to the robot camera plane in the region. The depth feature vector can be expressed as $λ^{N} = {(p_{1}, p_{2}, \dots, p_{N - 1}, p_{N})}^{T}$ . We suggest low-dimensional depth feature vectors are a more efficient option than raw pixels, for local obstacle situation representation. What’s more, it is believed the abstracted low-dimensional vectorial representation is helpful for eliminating the gap between the virtual and real-world environments in method transplantation and application.²⁵

Figure 5.

Depth feature vector acquisition.

Guide information

A guide point $G_{k}$ , identifying the target astronaut position, is provided to the navigator as a critical state component. It is worth noting the robot actually tracks $G_{k}$ as the target movement position throughout the following process. Possible strategies for determining $G_{k}$ include using the measured position of the astronaut at the current time step ( $ϕ_{k}$ ), the position of the astronaut at past time steps ( $ϕ_{k - k^{Lag}}$ ), or a predicted position of the astronaut at future time steps ( $ϕ_{k + k^{Pre}}$ ). Here, $k^{Pre}$ and $k^{Lag}$ represent the number of time steps ahead and after the current moment, respectively. The mechanisms by which these guide points affect astronaut following are illustrated in Figure 6, with three potential scenarios outlined below.

Figure 6.

The effects of using different guide points.

$G_{k} = ϕ_{k - k^{Lag}}$ : The guide point remains behind the astronaut. Before the astronaut passes the obstacle, the navigator generates a direct approach route represented by trajectory 1 in Figure 6. As such, the robot tries to maintain a following spacing of $|ϕ_{k} - ϕ_{k - k^{Lag}}|$ . Once the astronaut passes the obstacle, the guide point is moved closer to the obstacle, in which case a navigation conflict arises, and the robot waits in place.

$G_{k} = ϕ_{k}$ : The guide point keeps pace with the astronaut. In this case, navigation conflicts will occur as the astronaut approaches an obstacle and the robot will stay in place until the astronaut passes the obstacle. Once the astronaut and guide point begin moving away from the obstacle, the navigator plans a new route, as indicated by trajectory 2.

$G_{k} = ϕ_{k + k^{Pre}}$ : The guide point moves ahead of the astronaut. The navigator will likely generate a route similar to trajectory 3, which bypasses the obstacle and trends toward a predicted astronaut position. In this case, the robot avoids the obstacle simultaneously with the astronaut and thus eliminates wait time.

A combined strategy for guide point determination was developed using the mechanisms discussed above and can be expressed as

G_{k} = \{\begin{cases} \begin{matrix} ϕ_{k + t^{Pre}}, & min (λ) < d^{O A} \end{matrix} \\ \begin{matrix} ϕ_{k - t^{Lag}}, & other \end{matrix} \end{cases}

where $min (λ) < d^{O A}$ is a segmentation criterion and $d^{O A}$ defines a judgment threshold for the existence of obstacles. The condition $min (λ) < d^{O A}$ implies an obstacle has entered the robot’s FoV. In this case, $ϕ_{k + k^{Pre}}$ is used as a guide point for quick and predictive obstacle avoidance. Otherwise, $ϕ_{k - k^{Lag}}$ is used to maintain a robot–astronaut spacing.

State fusion with multiple moments

Generally, a limited perception of the robot’s forward-facing FoV and the high dynamic due to the movement of the robot itself can negatively affect obstacle avoidance planning in complex environments. We suggest that fused states from continuous time steps could compensate for these issues. Specifically, depth feature vectors and guide information during the last k ^Memory steps, used to overcome incomplete perception, can be fused as

\begin{array}{l} s_{k} = {G_{k - k^{Memory}}; λ_{(k - k^{Memory})}^{N}; \\ \begin{matrix}  \end{matrix} G_{k - k^{Memory} + 1}; λ_{(k - k^{Memory} + 1)}^{N}; \dots; G_{k}; λ_{(k)}^{N}} \end{array}

The dimensions of $s_{k}$ are given by $dim = (N + 2) \times k^{Memory}$ .

Action set

The action set for the robotic follower includes discrete movements commands ( $a_{i} \in A, i = 1, 2, \dots$ ), corresponding to several robot motions including driving ahead as well as steering with different speeds. The robot’s behaviors of astronaut approaching and obstacle avoidance are achieved through action sequences planned by the navigator.

Integrated reward

A dense integrated reward was included to optimize the DRL-based strategy for improved guide point following, obstacle avoidance, and energy saving. These attributes are described in detail below.

Reward for guide point following

A large positive reward of $R^{arrival}$ was assigned during strategy optimization if the robot arrived at the guide point. Otherwise, a reward is given in proportion to the distance between the robot and the guide point in two consecutive steps. Therefore, a final term for guide point following $R_{k}^{G F}$ can be written as

R_{k}^{G F} = \{\begin{array}{l} R^{arrival}, & d_{k}^{G} \leq D^{arrival} \\ K^{G F} (d_{k - 1}^{G} - d_{k}^{G}), & other \end{array}

where $D^{arrival}$ represents a distance threshold for determining arrival at the guide point and $d_{k}^{G}$ is the distance from the robot to the guide point $G_{k}$ . The term $R^{arrival}$ is a positive constant and $K^{G F}$ is a scaling factor for the variation of $d_{}^{G}$ , which both serve as hyperparameters.

Reward for obstacle avoidance

The robot is considered at risk of a collision if an obstacle enters its FoV. A function $F^{safety} = η^{safety} λ^{N}$ is constructed to evaluate the robot’s instance safety, based on the weighted sum of the depth feature vector, where $η^{safety} = [η^{1}, η^{2}, \dots, η^{N}]$ represents the sum weights of elements in the depth feature vector. Intuitively, obstacles in the center of the robot’s FoV pose a greater threat to the robot’s safety than those in the peripheries. Therefore, $η^{i}$ is determined with an axisymmetric linear curve, as shown in Figure 7. Expressions for $η^{safety}$ can be written as

η^{i} = \{\begin{cases} \begin{matrix} ρ (i - \frac{N}{2} - 1) + 1, & \frac{N}{2} \leq i < \frac{3 N}{4} \end{matrix} \\ \begin{matrix} ρ (- i + \frac{N}{2}) + 1, & \frac{N}{4} \leq i < \frac{N}{2} \end{matrix} \\ \begin{matrix} 0, & i < \frac{N}{4} \cup i > \frac{N}{4} \end{matrix} \end{cases}

where ρ represents the slope of the curve and N represents the length of $λ^{N}$ . Rewards for obstacle avoidance, denoted as $R_{k}^{O A}$ , are assigned in proportion to the variance between two consecutive moments

R_{k}^{O A} = K^{O A} (F_{k}^{safety} - F_{k - 1}^{safety})

where $K^{A O}$ is an amplification factor. A large negative value of $R^{fail}$ is assigned for any collision. As such, $R_{k}^{O A}$ can be written in the following form

R_{k}^{O A} = \{\begin{cases} \begin{matrix} K^{O A} (η^{safety} λ_{k}^{N} - η^{safety} λ_{k - 1}^{N}), & d_{k}^{obst} > \end{matrix} D^{collision} \\ \begin{matrix} R^{fail}, & d_{k}^{obst} \leq \end{matrix} D^{collision} \end{cases}

where $D^{collision}$ represents a distance threshold for determination of a collision and $K^{O A}$ acts as a scaling factor for variations of $F^{safety}$ . The value of $R_{k}^{O A}$ plays an import role in learning of steering behaviors, which produces only slight position variations and rarely activate $R_{k}^{G F}$ .

Figure 7.

The sum weight curve of the safety function. The weights at two ends are set to 0 to prevent the disturbance from obstacles on the robot FoV edges. FoV: field of view.

Reward for energy optimization

A reward for optimizing the length of the robot’s path, denoted as $R^{fail}$ , was also assigned if the current control step ever exceeded the maximum number within a training episode ( $k^{episode}$ ). Meanwhile, a constant cost of $R^{step}$ was allocated to each robot control step. As such, reward for energy optimization $R^{E O}$ can be expressed as

R_{k}^{E O} = \{\begin{cases} \begin{matrix} R^{step}, & k \leq k^{episode} \end{matrix} \\ \begin{matrix} R_{}^{f ail} & k = k^{episode} \end{matrix} \end{cases}

The final comprehensive reward was calculated by summing the three rewards discussed above

R_{k} = R_{k}^{G F} + R_{k}^{O A} + R_{k}^{E O}

Astronaut movement prediction

The predicted astronaut position $ϕ_{k + k^{Pre}}$ was obtained on the basis of the currently measured trajectory, using a modified KF. Assuming a constant velocity motion, the astronaut’s movements can be modeled as follows

\{\begin{cases} x_{k} = A x_{k - 1} + w_{k} \\ z_{k} = H x_{k} + u_{k} \end{cases}

In the above expression, $x_{k}$ represents motion states given by $x_{k} = {[ϕ_{k} v_{k}]}^{T}$ and $A$ is a state transition matrix by

A = [\begin{matrix} 1 & 0 & Δ t & 0 \\ 0 & 1 & 0 & Δ t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

where $Δ t$ represents the interval between discrete time steps, $z_{k} = {[ϕ'_{k} v'_{k}]}^{T}$ is a measurement of astronaut states, $H$ is a measurement matrix, $w_{k}$ denotes model noise, and $u_{k}$ indicates measurement noise. These two terms were assumed to be white Gaussian noise with zero-mean and covariance matrices given by $Q = E [w_{k} w_{k}^{T}]$ and $R = E [u_{k} u_{k}^{T}]$ , respectively. The optimal estimation of the astronaut’s position is represented by ${\hat{x}}_{k}^{} = {[{\hat{ϕ}}_{k} {\hat{v}}_{k}]}^{T}$ and can be determined in the following stages

\{\begin{cases} {\hat{x}}_{k}^{-} = A {\hat{x}}_{k - 1} \\ P_{k}^{-} = A P_{k - 1}^{} A_{}^{T} + Q \\ K_{k} = P_{k}^{-} H^{T} {(H P_{k}^{-} H^{T} + R)}^{- 1} \\ {\hat{x}}_{k} = {\hat{x}}_{k}^{-} + K_{k} (z_{k} - H {\hat{x}}_{k}^{-}) \\ P_{k} = (I - K_{k} H) P_{k}^{-} H^{T} \end{cases}

In these expressions, ${\hat{x}}_{k}^{-}$ represents an a priori estimate state, $P_{k}^{-}$ denotes an a priori estimate error covariance matrix, $K_{k}$ is a Kalman gain matrix, $H$ is an identity matrix, and $P_{k}$ is an a posteriori estimate error covariance matrix at t_k . Astronaut movements at future time steps can be predicted by ${\hat{x}}_{k + t^{Pre}} = A^{k^{Pre}} {\hat{x}}_{k}$ . Note astronaut’s position $ϕ_{k} "$ can be measured using stereo vision and deep learning-based person detection and tracking techniques, as reported in findings by Zhang et al.²⁶ In addition, the astronaut’s velocity can be obtained from $v'_{k} = \frac{ϕ'_{k} - {\hat{ϕ}}_{k - 1}}{Δ t}$ .

Numerical analysis

The numerical work mainly copes with navigator training, discussion about the method generalizability, as well as comparative analysis on the performance of the proposed predictive-guide following strategy.

Navigator training settings and results

The navigator was trained through a cross-platform simulation environment mainly based on V-REP²⁷ (see Appendix 1 for details of the training settings). An environment containing three clusters of obstacle was constructed to train the navigator, as depicted in Figure 8(a). At the beginning of each training episode, the location of the robot was randomly initialized, while the guide point was maintained at a constant position throughout all episodes. Each navigation episode is considered a success only if the robot has arrived at the guide point within the maximum number of steps. We managed to train the navigator with a success rate of 95.0% using piecewise hyperparameters. Success rate and the average Q value over the training episodes are plotted in Figure 9, demonstrating the navigator optimization process.

Figure 8.

Obstacle environments used for navigator training and testing: (a) environment 1 for training, (b) environment 2 with regularly shaped obstacles, and (c) environment 3 with irregular rock-like obstacles.

Figure 9.

Training curves. The red curve is for success rate over the training episodes and the green for the average Q value.

Generalizability analysis

The generalization ability of learning-based algorithms represents its adaptability and feasibility in a strange environment. We have tested the adaptability of the trained navigator to several untrained environments as well as varied motion parameters in simulation in the following subsection.

Navigator generalizability in unseen environments

The trained navigator was applied to unseen environments to evaluate algorithm generalizability. Three obstacle layouts used are shown in Figure 8. Environment 1 is the training environment, while environments 2 and 3 are unknown to the robot and exhibit clusters of regular shapes and irregular rock-like obstacles, respectively.

We conducted 1000 repetitive navigation tests in each environment to evaluate the adaptability of the navigator. It should be noted the navigator was no longer trained when applied to the two novel environments, relying solely on the model acquired during the previous training runs. The position of the robot and the guide point were randomly initialized at the beginning of each test episode. As shown in Table 1, the trained navigator adapted well to environment 2, with a high navigation success rate comparable to that of the training environment. What’s more, the trained navigator exhibited robust generalizability in environment 3, among rock-like obstacles with complex shapes and variable sizes.

Table 1.

Navigation success rates in three different environments.

Environment	Characteristics	Navigation success rate
1	The training environment	0.9514
2	An unknown environment with regularly shaped obstacles	0.9486
3	An unknown environment with irregular rock-like obstacles	0.8997

Navigator generalizability for various motion settings

It is of great significance to investigate the robustness of the DRL-based navigator to various robot speeds, since accurately controlling the robot on extraterrestrial surfaces without precise models of friction and the softness of the ground is challenging. The robot was trained to drive ahead and steer with a maximum wheel tangential speed of 0.5 ms^-1 and 0.7 ms^-1, respectively. In evaluation episodes, a multiplier factor was introduced to vary the robot’s speed, which was applied to each motion in the robot’s action set. One thousand episodes of navigation tests were conducted in environment 2, and the resulting success rates over the speed multiplier factor are plotted in Figure 10. Evidently, multiplier factors as low as 0.5 and as high as 2.0 produced success rates of no less than 85% in the untrained environment, indicating that the trained navigator is robust to varying robot speeds.

Figure 10.

Success rates for various robot speeds. The multiplier factor equal to 1.00 stands for a driving speed of 0.5 m s⁻¹ and a steering speed of 0.7 m s⁻¹ of the robot.

Astronaut following performance

In this subsection, numerical analyses about the performance of the proposed predictive-guide following strategy are conducted, including an effectiveness validation of the method, a comparative analysis about the following stability, as well as a supplementary analysis about the quantitative influence of the robot’s and the astronaut’s speeds.

Simulation of astronaut following

The effectiveness of the proposed predictive-guide strategy for astronaut following was preliminarily validated in simulation. The astronaut was set to move through a preset trajectory, with a constant speed of 0.25 m s⁻¹, as the assistant robot attempted to avoid the obstacles and keep up with the astronaut. Two examples, in which the astronaut walks along straight and circular trajectories, are shown in Figure 11. The predictive-guide strategy was initialized by $k^{Pre} = 8$ and $d^{O A} = 2.0 m$ . It is worth noting the robot was able to quickly resume following distance after the astronaut passed near an obstacle or traversed a narrow channel.

Figure 11.

Following trajectories. The astronaut walks along a (a) straight trajectory and (b) circular trajectory.

Comparative analysis

Predictive knowledge of the astronaut’s future movements, one of the primary contributions of this study, was included in the predictive-guide strategy to assist the navigator in planning an ideal following sequence, which was expected to improve following stability. Quantitative analyses were conducted to determine the advantages of the proposed technique. The OA-first and target-guide techniques, based on the trained DRL navigator, were included for comparison purposes, as discussed below.

OA-first: This strategy is based on a trained DRL navigator in which, however, obstacle avoidance and astronaut approaching are implemented independently. During astronaut approaching, we set $G_{k} = ϕ_{k - k^{Lag}}$ and $λ^{N} = \vec{0}$ to maintain fixed following distances. While during obstacle avoidance, $G_{k}$ is determined as a point along the robot’s FoV edge at the farthest side from the nearest obstacle detected, which is considered to be the safest navigating direction according to current measurements.

Target-guide: This strategy involves guiding the DRL navigator using the target astronaut’s current position by $G_{k} = ϕ_{k - k^{Lag}}$ , without utilizing predictive knowledge of future astronaut movements, which is the only difference with the predictive-guide strategy.

Each of these strategies was used to control the robot as it followed the astronaut and traversed complex terrains. The astronaut was set to walk along a straight trajectory of length 50.0 m at a speed of 0.25 m s⁻¹. Each test episode began with the same initial robot position and the same preset astronaut trajectory. Parameters were set as $k^{Pre} = k^{Lag} = 8$ and $d^{O A} = 2.0 m$ , suggesting a reference following distance of $d^{ref} = 2 .0 m$ . Figure 12 illustrates generated following trajectories for the robot and astronaut. Each of the three strategies was able to control the robot to safely follow the astronaut through obstacles. Figure 13 demonstrates the predictive-guide method outperformed the other two as measured by following distance stability, which was able to maintain a following distance between 2.0 m and 3.0 m throughout the simulation. In the other two tests, the robot required longer periods of time to avoid obstacles, thus increasing astronaut–robot distances to more than 6.0 m in some cases (e.g. at 200s and 260 s).

Figure 12.

Motion trajectories by the three following strategies. The simulated astronaut walks with a speed of 0.25 m s⁻¹ and the reference following distance is set as 2.0 m.

Figure 13.

Astronaut–robot distances for the three following strategies. The astronaut walks with a constant speed of 0.25 m s⁻¹ and the reference following distance is set as 2.0 m.

Influence of speed ratios

Intuitively, robots traveling at higher speeds can more easily approach a walking astronaut in an obstructed environment. In this section, the ratio of the robot’s speed to the astronaut’s walking speed is quantitatively investigated as a crucial factor affecting following distance stability. Figure 14 depicts astronaut following test results for various speed ratios. As the speed ratio decreased, the average astronaut–robot distance increased significantly. It was also difficult for the robot to maintain a stable relative distance when the speed ratio decreased to 1.4. In this case, both the target-guide and predictive-guide strategies produced maximum distance values exceeding 7.0 m. In addition, the OA-first simulation terminated after 200s, as the astronaut–robot distance had diverged beyond the navigator’s trained coverage. In summary, higher robot speeds provide improved stability for astronaut following. Figure 14(b) demonstrates that faster moving speeds can also reduce cumulative mileage during the following process, as the robot avoids obstacles and follows the astronaut more efficiently. Furthermore, mean deviation $|E (d) - d_{}^{ref}|$ of the astronaut–robot distance over speed ratio is shown in Figure 15, where it is evident that the stability increased significantly with speed ratio. A threshold of $C = 2.0$ suggests the speed ratio should be at least 1.6 for stable astronaut following.

Figure 14.

The influence of speed ratio of the robot’s speed to the astronaut’s walking speed. The astronaut walks with a constant speed of 0.25 m s⁻¹ and the reference following distance is set as 2.0 m: (a) astronaut–robot distance and (b) cumulative robot mileage.

Figure 15.

Statistical analysis of astronaut–robot distance over speed ratios. C = 2.0 indicates a classical threshold for a stable following distance.

Experimental results and discussion

The performance of the proposed method is evaluated through real-world comparative experiments. As shown in Figure 16, we conducted field tests using the developed robotic follower and an analog astronaut at a Mars-like site in Huzhou, China, in February 2022. A ZED 2 camera²⁸ was installed in front of the robot to detect the astronaut and collect depth information from obstacles. An efficient DCNN-based human detection algorithm²⁶ was adopted to locate the astronaut and estimate the walking trajectory. The following strategies and the astronaut detection module were run by an onboard NVIDIA Jetson TX2 controller.²⁹ In addition, an ultra-wideband (UWB) location system was employed to accurately record the trajectories of the astronaut and the robot, as well as the position of obstacles, but only for the purpose of obtaining quantitative experimental results.

Figure 16.

The experimental scenario at a Mars-like site. The robot running the predictive-guide strategy tries to follow the analog astronaut walking along a winding path with rocks and slopes. The UWB tags are installed onside the astronaut’s helmet and robot’s electrical box to record the trajectories. UWB: ultra-wideband.

Three following strategies were tested comparatively, namely the OA-first strategy, the target-guide strategy, as well as the predictive-guide strategy. In each test, the analog astronaut was instructed to walk along a winding path with rocks and slopes, at an average speed of 0.25 m s⁻¹. And the robotic follower was set to move at a maximum speed of approximately 0.5 m s⁻¹, with a control period of 1.0s. The trajectories of the astronaut and the robot, as well as the following distances by the three algorithms, were plotted in Figures 17 and 18, respectively. Interesting and obvious conclusions can be drawn as follows:

Rocks and slopes may interfere the following motion of the robot, resulting in a temporary increase in the astronaut–robot distance, as illustrated in Figure 18, at about 20s and 70s.

The OA-first strategy has a poor coordination ability in target following and obstacle avoiding, resulting in irreversible divergence of the following distance, hence missing the astronauts after finishing the climbing the slope; while by the target-guide strategy or the predictive-guide strategy, the robot can timely respond to and continuously follow the astronaut.

In case of the narrow passage that only the astronaut can pass, the robot can autonomously choose the side with smaller obstacles to go and resume following afterward.

The predictive-guide strategy performs best in preventing the interference of obstacles, maintaining the variation of following distance within 1.3 m. That is to say, the method enables more stably astronaut following in the presence of obstacles, which is optimized by predictive calculations.

Figure 17.

Trajectories of the analog astronaut and the robot by the three following strategies.

Figure 18.

Following distances by the three following strategies.

Conclusion

In this study, a novel astronaut following strategy was proposed for extraterrestrial assistant robots, enabling the prediction of movements to avoid obstacles without requiring dynamic modeling or global path planning. The framework is provided with local depth sensor data and predictive-guide information regarding the astronaut, which improves foresightedness and adaptability. Although trained in a relatively simple simulated environment, the method is well generalized to unknown scenarios in both simulations and the real world. A comparative numerical analysis showed the proposed predictive-guide strategy managed to stabilize variations in following distance to within ±1.0 m in obstructed environments significantly outperforming other techniques. A physical robotic follower was also designed and constructed, which verified not only the effectiveness of the predictive-guide algorithm but also the practicality of the integrated system. This work represents a significant step in improving astronaut–robot cooperation during EVAs and could also be applied to social robotics.

In future work, recent DRL algorithms for continuous action space will be studied to achieve more delicate control of the robotic follower, which could improve the smoothness of the following process. In addition, the presented experiments were conducted on paved ground, and the performance of our following strategy for robotic followers needs to be further investigated on rugged terrain.

Footnotes

Appendix 1 Acknowledgments

The authors would like to thank Dr Lin Zhang of University of Cincinnati for giving some general advice on English writing of this article. The authors would also like to thank Huzhou Institute of Zhejiang University for funding this research.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Huzhou Institute of Zhejiang University under the Huzhou Distinguished Scholar Program (ZJIHI—KY0016).

ORCID iD

Ruijun Hu

References

Smith

Craig

Herrmann

, et al. The Artemis program: an overview of NASA’s activities to return humans to the moon. In: IEEE aerospace conference proceedings, 2020, Big Sky, MT, USA, 7–14 March 2020, 1–10.

Gibney

. How to build a Moon base. Nature 2018; 562: 474–478.

Trevino

Kosmo

Ross

, et al. First astronaut-rover interaction field test. SAE Technical Papers, 2000. Epub ahead of print 2000. DOI: 10.4271/2000-01-2482.

Ross

Kosmo

Janoiko

. Historical synopses of desert RATS 1997-2010 and a preview of desert RATS 2011. Acta Astronaut 2013; 90: 182–202.

Imhof

Hogle

Davenport

, et al. Project Moonwalk: lessons learnt from testing human robot collaboration scenarios in a lunar and Martian simulation. In: Proceedings of the international astronautical congress (IAC), Adelaide, Australia, 25–29 September 2017, pp. 5465–5476.

Burridge

Graham

Shillcutt

, et al. Experiments with an EVA assistant robot. In: Seventh international symposium on artificial intelligence, robotics and automation in space, NARA, Japan, 19–23 May 2003.

Schwendner

Höckelmann

Schröer

, et al. Surface exploration analogue simulations with a crew support robot. In: IAA space exploration conference, Washington, DC, US, 9–10 January 2014.

Pang

Zhang

, et al. A human-following approach using binocular camera. In: 2017 IEEE international conference on mechatronics and automation, Takamatsu, Kagawa, Japan, 6–9 August 2017, pp. 1487–1492.

Pang

Zhang

Coleman

, et al. Efficient hybrid-supervised deep reinforcement learning for person following robot. J Intell Robot Syst Theory Appl 2020; 97: 299–312.

10.

Cosenza

Nicolella

Esposito

, et al. Mechanical system control by RGB-D device. Machines 2020; 9: 3.

11.

Pang

Cao

, et al. A robust visual person-following approach for mobile robots in disturbing environments. IEEE Syst J 2020; 14: 2965–2968.

12.

Rui

Zhaokui

Yulin

. A person-following nanosatellite for in-cabin astronaut assistance: system design and deep-learning-based astronaut visual tracking implementation. Acta Astronaut 2019; 162: 121–134.

13.

Jen

Tsou

, et al. Accompanist detection and following for wheelchair robots with fuzzy controller. In: 2012 international conference on advanced mechatronic systems, Tokyo, Japan, 18–21 September 2012, pp. 638–643.

14.

Yun

Kim

Lee

. Person following with obstacle avoidance based on multi-layered mean shift and force field method. In: Conference proceedings—IEEE international conference on systems, man and cybernetics. Istanbul, Turkey, 10–13 October 2010, pp. 3813–3816.

15.

Pang

Cao

, et al. A collision-free person-following approach based on path planning. In: 2020 IEEE international conference on mechatronics and automation, Beijing, China, 2–5 August 2020, pp. 327–331.

16.

Algabri

Choi

M-T

. Deep-learning-based indoor human following of mobile robot using color feature. Sensors 2020; 20: 2699.

17.

Hart

Nilsson

Raphael

. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans Syst Sci Cybern 1968; 4: 100–107.

18.

Rosmann

Feiten

Wosch

, et al. Efficient trajectory optimization using a sparse model. In: 2013 European Conference on Mobile Robots ECMR 2013, Barcelona, Spain, 25–27 September 2013, pp. 138–143.

19.

Bayoumi

Bennewitz

. Learning optimal navigation actions for foresighted robot behavior during assistance tasks. In: Proceedings—IEEE international conference on robotics and automation, Stockholm, Sweden, 16–21 May 2016, pp. 207–212.

20.

Lee

Choi

Baek

, et al. Robust human following by deep Bayesian trajectory prediction for home service robots. In: Proceedings—IEEE international conference on robotics and automation, Brisbane, Australia, 21–25 May 2018, pp. 7189–7195.

21.

Chang

Marcus

. Markov Decision Processes. In: Simulation-Based Algorithms for Markov Decision Processes. London: Springer, 2013, pp. 1–17.

22.

François-Lavet

Henderson

Islam

, et al. An Introduction to Deep Reinforcement Learning. Foundations and Trends® in Machine Learning 2018; 11(3-4): 219–354.

23.

Mnih

Kavukcuoglu

Silver

, et al. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013, accessed 20 June 2022).

24.

Mnih

Kavukcuoglu

Silver

, et al. Human-level control through deep reinforcement learning. Nature 2015; 518: 529–533.

25.

Tai

Paolo

Liu

. Virtual-to-real deep reinforcement learning: continuous control of mobile robots for mapless navigation. In: IEEE international conference on intelligent robots and systems, Vancouver, British Columbia, Canada, 24–28 September 2017, pp. 31–36.

26.

Zhang

Wang

Zhang

. Astronaut visual tracking of flying assistant robot in space station based on deep learning and probabilistic model. Int J Aerosp Eng 2018; 2018: 1–17.

27.

Rohmer

Singh

SPN

Freese

V-REP: A versatile and scalable robot simulation framework. In: IEEE international conference on intelligent robots and systems, Tokyo, Japan, 3–7 November 2013, pp. 1321–1326.

28.

STEREOLABS. ZED 2 - AI stereo camera, www.stereolabs.com/zed-2 (2021, accessed 18 February 2022).

29.

NVIDIA. NVIDIA Jetson TX2: high performance AI at the edge, 2022, www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-tx2 (2022, accessed 18 February 2022).

Toward stable astronaut following of extravehicular activity assistant robots using deep reinforcement learning

Abstract

Keywords

Introduction

Scenario formulation for robotic following of astronauts

The predictive-guide framework

The predictive-guide framework

End-to-end navigator

Reinforcement learning model

Navigator design

Depth feature vectors

Guide information

State fusion with multiple moments

Action set

Integrated reward

Reward for guide point following

Reward for obstacle avoidance

Reward for energy optimization

Astronaut movement prediction

Numerical analysis

Navigator training settings and results

Generalizability analysis

Navigator generalizability in unseen environments

Navigator generalizability for various motion settings

Astronaut following performance

Simulation of astronaut following

Comparative analysis

Influence of speed ratios

Experimental results and discussion

Conclusion

Footnotes

Appendix 1

Acknowledgments

Declaration of conflicting interests

Funding

ORCID iD

References