Sage Journals: Discover world-class research

Abstract

Reinforcement Learning (RL) has been considered a promising method to enable the automation of contact-rich manipulation tasks, which can increase capabilities for industrial automation. RL facilitates autonomous agents’ learning to solve environments with complex dynamics with little human intervention, making it easier to implement control strategies for contact-rich tasks compared to traditional control approaches. Further, RL-based robotic control has the potential to transfer policies between task variations, significantly improving scalability compared to existing methods. However, RL is currently inviable for wider adoption due to its relatively high implementation costs and safety issues, so current research has been focused on addressing these issues. This paper comprehensively reviewed recently developed techniques to improve cost and safety for RL in contact-rich robotic manipulation. Techniques were organized by their approach, and their impact was analysed. It was found that current research efforts have significantly improved the cost and safety of RL-based control for contact-rich tasks, but further improvements can be made by progressing research towards improving knowledge transfer between tasks, improving inter-robot policy transfer and facilitating real-world and continual RL. The identified directions for further research set the stage for future developments in more versatile and cost-effective RL-based control for contact-rich robotic manipulation in future industrial automation applications.

Keywords

Reinforcement learning robotic control industrial automation industrial robots artificial intelligence

Introduction

As Kroemer et al.¹ state, ‘robot manipulation is central to achieving the promise of robotics’. Endowing robots with the ability to alter their environment, with flexibility and reliability that match humans, would unlock novel and exciting capabilities for industrial automation, leading to significant societal impact. For instance, robots could be deployed to improve the flexibility of manufacturing^2,3 and improve the efficiency of the circular economy via remanufacturing.⁴ Contemporary robots, however, are rigidly programmed and only able to complete specific manipulation tasks within highly controlled environments, which limits their applicability. Improving their flexibility has been an important research topic.^1,5,6

A critical sub-problem within robotic manipulation is performing contact-rich tasks. Contact-rich tasks involve frequent physical contact between the robot and the environment, encompassing most day-to-day activities that robots are expected to execute in some scenarios, such as assembly. These tasks involve multiple contact interactions with the environment and require suitable control schemes to react to these interactions safely. Due to the complex, non-linear, and discontinuous dynamics involved in contact interactions, traditional control approaches^7,8 that model the environment to analyse contact states and design manoeuvres to react to these states are time-consuming to produce. They are also unscalable, as control strategies developed with traditional control approaches are sensitive to variations in task configurations, which can lead to failure and thus require re-implementation.

Reinforcement Learning (RL) has recently become a trending Machine Learning (ML) approach due to its ability to learn to solve complex problems from experience without human supervision. RL removes the laborious process of handcrafting controllers for challenging environments, making it promising for contact-rich robotic manipulation. RL initially found success in only simple problems^9,10 due to limitations in computing power and the inability of early algorithms to handle continuous state and action spaces. With the increased accessibility of powerful hardware and the incorporation of deep learning, the capability of RL has improved and gained success in many research domains. Recent works show the success of RL-trained deep Artificial Neural Networks (ANNs) applied to many other areas, such as playing games^11–14 autonomous ground^15,16 and air navigation,^17,18 robotic manipulation^19–22 and recommender systems.²³ Despite RL’s potential, it is still far from being an off-the-shelf solution for robotics,¹⁹ so there is an active research effort to develop techniques to improve its practicality.

This paper comprehensively reviews the recent research to improve RL for contact-rich robotic manipulation. It highlights the current techniques developed, their motivation, and the existing gaps to offer directions for future research. To aid in understanding the current research directions, techniques are grouped based on their underlying approach, where an approach is a general outlook of a problem that motivates the development of a technique and a technique is a specific solution implementing the approach. The techniques developed within each approach are discussed in isolation and followed by a holistic analysis of the current state of research.

Works similar to this review exist that analyse the research on RL for robotic manipulation. Kober et al.¹⁹ provided an early review of RL in robotic manipulation. However, their review in 2013 covered the state-of-the-art before a decade of rapid progress and did not focus specifically on contact-rich manipulation. Kroemer et al.¹ provided a more recent review on robotic manipulation but did not focus specifically on contact-rich tasks or RL. Suomalainen et al.²⁴ surveyed contact-rich robotic manipulation techniques but did not focus on RL-related techniques. The most recent and relevant review was conducted by Elguea-Aguinaco et al.,²⁵ who analysed RL for contact-rich robotic manipulation. Their review focused on applying RL-based robotic control to contact-rich task types that have been studied extensively and present techniques as made to improve the performance for those specific tasks. In contrast, this review frames techniques as task-agnostic, meaning they can improve RL-based robotic control across different contact-rich task types, even those not previously studied. This perspective sets the stage for the application of RL-based control for novel contact-rich task types, which is important to consider improving the wider adoption of RL-based control in future industrial automation. The main contributions of the present review can be summarized as follows:

It identifies the key challenges in RL for contact-rich robotic manipulation that the current research aimed to address.

It highlights the techniques developed from 2015 to 2025 that improve RL for contact-rich robotic manipulation regarding the key challenges identified.

It organizes the techniques into groups, known as approaches, to understand the main themes of the existing body of work.

It highlights trends and gaps in the literature and offers a perspective on future directions for research.

The review is organized as follows. Section 2 introduces core concepts for RL and contact-rich robotic manipulation. Section 3 summarizes the literature and themes. Section 4 organizes the relevant works into approaches and presents the key techniques for each approach. Section 5 contains a holistic analysis of trends and gaps within the literature. Section 6 provides concluding remarks about the contents covered in the previous sections.

Core concepts

Contact-rich robotic manipulation

Contact-rich robotic manipulation refers to manipulation tasks requiring the robot’s End Effector (EEF) or grasped tools and objects to interact with the environment. In contact-rich tasks, contact is required for task completion and cannot be avoided. Therefore, the robot must react to contact accordingly to progress towards task completion while acting safely to prevent damage to the robot or environment. Due to the natural world’s non-linear contact dynamics, precisely anticipating the required control inputs for task completion is challenging and time-consuming, which may cause conventional control methods to be impractical for addressing contact-rich tasks.

The contact-rich tasks studied in the reviewed literature can be grouped into five task types, where tasks belonging to the same task type require semantically similar skills to accomplish. Contact-rich task types include, but are not limited to, insertion, non-prehensile manipulation, surface tracking and object extraction. These task types are depicted in Figures 1 and 4(a) show the proportion of reviewed works focused on addressing each task. By grouping tasks into task types, one can leverage techniques applied on a task to new tasks belonging to the same task type, reducing implementation costs.

Figure 1.

Images of contact-rich tasks types which researchers have studied with RL-controlled robots: (a) insertion,²⁶ (b) non-prehensile manipulation, (c) surface tracking,²⁷ (d) door opening²⁸ and (e) object extraction.

Below are descriptions of each task type, examples of tasks belonging to that task type, the expected contact interactions and appropriate robot reactions to them and the challenges presented in such tasks which motivate the use RL for them.

Insertion

Insertion involves inserting an object into hole with corresponding geometry. Examples of insertion task include the classic peg-in-hole task, assembly, key insertion and the initial phase of unscrewing.

In an ideal scenario, the position and orientation of the hole are perfectly known, allowing a robot to execute a simple yet precise motion to perform insertion. In reality, only inaccurate approximations of the hole’s position and orientation can be given. A naïve control strategy that does not account for this inaccuracies may miss the hole if hole’s position estimate is inaccurate, or cause jamming or wedging during insertion, which can result in material damage to the object and hole, if the hole’s orientation estimate is inaccurate.

An appropriate control strategy can account for discrepancies by performing a multi-stage approach that exploits contact interactions, involving a searching process to find the hole and a dynamic object re-alignment process during insertion. In the initial stage of search, the object attached to the robot makes an impact contact with the hole’s surface and must limit this contact force. During the search, the robot glides the object across the surface of the hole until the object is positioned within the hole. Searching requires the robot to apply enough tangential force to overcome friction and also apply enough normal force keep the object in contact with the surface it without causing the material damage. Using force measurements, the robot can determine if the hole has been found and then proceed to the insertion operation. During insertion, misalignment between the hole and object may cause further sliding, jamming or wedging contacts that resist the insertion motion. In these cases, the robot must dynamically re-align the object in response to such contacts to complete insertion whilst minimizing contact forces.

Analytical methods that characterize the contact states during insertion, using sensory observations and designing manoeuvres to react to them can be time-consuming. This is especially true as the complexity of the hole and object geometry increases. Furthermore re-implementation may be required depending as the geometry and material properties of the object and hole changes.²⁹ Analytical methods are thus considered unscalable when considering a wide variety of insertion tasks.

Non-prehensile manipulation

Non-prehensile manipulation tasks are a class of tasks that involve moving an object without grasping it. Such tasks involve pushing, pulling or tilting an object a desired position or orientation.

Contact interactions in non-prehensile motion include normal pushing contact to adjust position and tangential pushing/pulling contact, that utilizes friction, to adjust an object’s pose. Using conventional control methods, one can analytically design a control strategy that pre-determines the sequence of contact interactions assuming that effect of the contact interactions on the object’s pose are perfectly known. However, due to uncertainty in environment dynamics (e.g. friction, mass), the true trajectory of the object may deviate from the assumed trajectory, causing later actions in an analytical method to fail. For instance, during a normal pushing interaction, unexpected sliding between the robot and an object can occur, causing unintended rotation. Following actions that expect the object to be in a specific orientation may then fail as result.

A successful non-prehensile manipulation strategy should be robust to uncertainty in the environment by being reactive to deviations in the object’s trajectory. Designing robust control strategies analytically can be time-consuming and requires significant re-implementation for changes in the environment, thus is unscalable.

Surface tracking

Surface tracking tasks require the robot to maintain sustained contact with a surface and move the robot’s EEF tangentially to perform some given task. Tasks belonging to this task type include wiping, polishing and grinding.

In surface tracking, the primary contact interaction is sliding, which involves applying enough normal force to maintain contact with the surface without material damage and enough tangential force to overcome friction.

For flat surfaces, maintaining contact whilst moving tangentially can be already be achieved by using conventional force controllers. However, surfaces with curves may require extra effort to programme a robot to track safely and will need to be re-implemented for different surface geometries, making conventional methods unscalable.

Furthermore, depending on the objective task, the robot may need to re-visit certain regions of the surface, rather traversing through a pre-determined route in a single pass. For instance, in polishing or wiping, the surface condition (surface finish or cleanliness) may not meet the requirements after the first pass. In these cases, a reactive robot controller is required to be able to monitor the surface conditions and decide if certain surface regions need re-visiting.

Door opening

Door opening involves interacting with door-like objects that possess constrained degrees of freedom, typically limited to rotational or translational motion about fixed hinges or sliding tracks. They involve grasping a handle and moving it along the permissible degrees of freedom until the door is opened. Tasks in this task type included opening hinged doors and drawers.

In door opening, the primary contact interaction is a frictional contact between the robot gripper and the handle. The robot must then follow the permissible degree of freedom of the door. Should the robot not follow the correct trajectory, the door may be damaged or the robot loses grip with the handle, ultimately failing the task.

With conventional control methods, one can deduce the permissible trajectory of robot that is grasping the handle, given a perfect model of the door object and the robot’s relative pose from it a priori. The robot can then, in principle, execute this trajectory to perform the task. However, the true permissible trajectory may differ from the one assumed due to uncertainties in the hinge or track’s pose relative to the handle, and the pose of the robot relative to the door. Therefore, executing the assumed trajectory may result in failure of the task.

It is unfeasible to predict the parameters of a door opening task a priori for all possible instances of this task. Therefore, analytical methods that rely on perfect a priori knowledge are impractical in addressing a variety of door opening tasks. A robust and scalable control method is thus needed to react to unexpected variability in door opening.

Object extraction

Object extraction can be considered the inverse of insertion and usually includes tasks associated with disassembly. It involves grasping an object, that is part of an assembly, and performing the reverse assembly motion to extract the object. For example, peg-hole extraction and gear disassembly.

Object extraction can be attempted using conventional control methods by assuming perfect knowledge of the extraction trajectory and programming a robot to follow that pre-defined path. However, assumed trajectories can be inaccurate due to variability in the condition or geometry of the assembly. Executing assumed trajectories can result in unintentional sliding, jamming or wedging of the object against its surroundings, which may result in material damage.

To deal with this uncertainty, the robot must identify contact interactions between the object and the assembly and manoeuvre in way that minimizes contact forces. As mentioned in insertion tasks, characterizing contact states with sensory observations and analytically designing reaction manoeuvres is time-consuming and unscalable.

Reinforcement learning

RL is a data-driven framework for sequential decision-making. It involves the learner and decision maker (the agent) acting within an environment with a specified goal. At every timestep, the agent selects actions based on the state of the environment, and the environment changes to a new state in the next time step. The environment administers rewards for certain states and actions, regulated by a reward function, that specifies the desired behaviour. Through continual interaction with the environment, the agent gathers experience about it and learns how rewards are obtained. Then, the agent adjusts its actions to maximize the rewards.³⁰

RL assumes that the environment can be modelled as a Markov Decision Process (MDP),³¹ a general mathematical formalism for sequential decision-making problems. In MDPs, actions influence immediate rewards, subsequent states, and hence future rewards. Therefore, a solution to an MDP must trade-off between immediate and delayed rewards. MDPs also model stochastic state transitions, making it possible to model uncertain environments.

This review uses the notational standard MDPNv1³² whereby MDPs are defined by the tuple $(S, A, P, R, d_{0}, γ)$ . $S$ and $A$ denote the sets of environment states and available control actions, respectively. $d_{0} : S \to [0, 1]$ denotes the starting state probability distribution. The function $P : S \times A \times S \to [0, 1]$ non-deterministically maps the transition from state $s_{t} \in S$ at timestep $t$ , given an action $a_{t} \in A$ , to state $s_{t + 1} \in S$ at the subsequent timestep $t + 1$ . The reward function is denoted by $R : S \to R$ , and issues rewards given $s_{t}$ and $a_{t}$ . $γ \in [0, 1]$ denotes the discount factor for future rewards.

Let $π : S \times A \to [0, 1]$ be a policy that probabilistically maps an action to the state, which represents the strategy employed by an agent. RL aims to learn a policy that maximizes the expected discounted reward, denoted as $J (π)$ , which is used as the policy’s performance metric in the environment.

J (π) = E_{s_{0} ~ d_{0} (\cdot), a_{t} ~ π (\cdot), s_{t + 1} ~ P (\cdot)} [\sum_{t} γ^{t} R (s_{t}, a_{t})]

(1)

The optimal policy, $π^{*}$ , is then defined in (2).

π^{*} = \underset{π}{argmax} J (π)

(2)

Different RL algorithms that solve MDPs in varied manners have been proposed, and each algorithm presents trade-offs concerning one another. This review does not focus on RL algorithms, but instead the techniques applied alongside the algorithms to improve cost and safety. As such, the reader is referred to^30,33–35 for a comprehensive understanding of RL algorithm types. For a more focused discussion on RL algorithms specific to the context of robotic control, the reader is referred to.^19,22,36

It should be noted that this sub-section presents the most fundamental problem formulation for RL. This formulation is extended for techniques described in the latter sections to facilitate additional functionality unavailable with the basic formulation. However, these extended formulations remain consistent with the basic formulation so that the basic formulation provides a suitable foundation for further extensions. Where appropriate, the notation for the extended formulation will be introduced in later sections when certain techniques are discussed.

RL-based robotic control

A generic RL-based robotic control architecture schematic is presented in Figure 2, which helps one understand the effects of proposed techniques and opportunities for combining them to compound their effects.

Figure 2.

Schematic representation of a generic RL-based robotic control architecture. This diagram has been inspired by Sutton and Barto³⁰ but has been adapted to emphasize the components tailored for robotic control.

RL-based robotic control consists of an RL agent at the core, a control mechanism serving as the interface between policy and the environment, a perception module comprised of sensors to collect high-dimensional observations and an inference mechanism to reduce observations to low-dimensional state information and a user-specified reward function that specifies the task’s objective. Figure 2 depicts the relationship between the components and the feedback loop, enabling the RL agent to learn. The proposed architecture extends the notation to facilitate the introduction of robot-specific components

Firstly, the state $s_{t} \in S$ is substituted by two sub-components: the observation $o_{t} \in O$ and the state representation ${\hat{s}}_{t} \in \hat{S}$ . The notion of state in RL refers to a complete description of a system at a particular point in time, which can often be challenging to obtain in real environments and, hence, is approximated via state representations using observations. Observations correspond to raw sensor measurements and are distinct from the state as they are often noisy, incomplete, and subject to error.³⁷ State representations approximate the environment state and result from processing observations using an inference mechanism. State representations provide an incomplete approximation of the environment state, tracking a subset of state variables pertaining to a given task rather than monitoring all available variables and may also be subject to inaccuracy.

Secondly, a control mechanism interfacing the agent’s actions and the environment is added rather than the agent directly acting on the environment. The control mechanism $f_{c}$ takes the agent action $a_{t} \in A$ and maps to actuator torques $u_{t} \in U$ , such that $f_{c} : A \to U$ . This interface is often included to enable expert knowledge injection into the agent’s actions to improve sample efficiency, generalizability and safety, which is discussed further in Section 4.5.

Literature overview

Trends were observed after conducting a thorough literature review related to RL for contact-rich robotic manipulation. The trends consisted of the challenges the research addressed regarding RL in robotic manipulation, the approaches to addressing such challenges and the types of contact-rich tasks studied.

Key challenges in RL for contact-rich robotic manipulation

Despite RL’s potential to reduce the manual effort in developing robotic control strategies to address diverse and complex robotic manipulation problems more easily, RL-based robotic control has yet to gain mainstream traction due to challenges in cost and safety.^22,25,38,39 As such, traditional control approaches remain the dominant approach for industrial robotic control.⁴⁰ Addressing these challenges has become the focus of the current literature to improve the practicality of RL-based control and its future adoption by practitioners.

Cost

Cost is the most significant barrier to the mainstream adoption of RL-based robotic control.^19,36 RL is a data-driven control approach that requires numerous data samples to learn effective control policies, which can be expensive and time-consuming. In comparison, traditional control approaches do not involve such expensive training as they rely direct programming based on expert knowledge, making it a more feasible control approach over RL-based robotic control.

Researchers have focused on improving sample efficiency and generalizability to alleviate the high costs of RL-based robotic control. These aspects, in the context of RL, and how they affect implementation costs are explained below:

Sample efficiency: Refers to how quickly an agent can learn to perform a task (i.e. maximize the reward obtained) with respect to the amount of data (acquired through environment interactions) used for training.³⁵ Greater sample efficiency is desirable, as it minimizes the time and cost required to train an agent for a new task type. RL agents for robotic control tasks are inherently sample inefficient as robots act in continuous state and action spaces, which require the exploration of many state-action pairs to find an optimal policy. Robots also act in real environments where interactions are slower and more costly than in digital environments, exacerbating the effects of low sample efficiency and making RL-based robotic control impractical.

Generalizability: Refers to an agent’s ability to discern the correct notion of behaviour to perform well across semantically similar environments,⁴¹ enabling the re-use of policies in different environments without further training. High generalizability is desirable as it can offset the cost associated with implementation. Due to RL’s algorithm’s underlying assumption of it acting only in a single MDP, agents can overfit to the training environment. As a result, it can fail to perform well in slightly different but semantically similar environments due to the slight differences between environments’ underlying MDPs. This encapsulates situations where the agent is deployed to perform the same task as presented in the training environment, as the deployment environment differs slightly due to uncertainty in the environment conditions or calibration errors in robot perception and control systems. This also encapsulates situations where the agent is deployed to perform semantically similar skills in significantly different environments, such as a peg insertion policy applied to insertion of different shaped pegs.⁴² A lack of generalizability results in more re-training of an agent, which can increase the amount of training, raising the cost of implementation for RL-based robotic control.

Safety

Ensuring RL-controlled robots behave safely is essential to consider in protecting nearby humans from injuries and mitigating material damage caused by uncontrolled collisions between the robot and its environment.^38,43 RL-based robotic control introduces unique safety challenges as practitioners do not directly control the robot’s behaviour, unlike traditional control approaches, as its behaviour emerges from autonomous learning. The main risks in RL-based robotic control arise during training when the policy executes exploratory actions, which can cause unexpected collisions. RL-based control also risks acting hazardously in environment configurations unseen during training. For example, a policy trained in an environment without obstacles may be unsafe when deployed to a similar environment as the policy was not trained to avoid obstacles. Therefore, safety must be ensured during all stages (training and deployment) of the policy’s lifecycle.³⁹

Summary of literature

From analysing the techniques proposed in the reviewed papers, nine approaches were identified, which are presented in Figure 3. Each approach corresponds to a particular outlook on the challenges of RL-based robotic control, motivating the development of techniques. Identifying these approaches, we can understand the critical areas of research in this field and gain a broad overview of the general progress to address the challenges in cost and safety for RL-based robotic control.

Figure 3.

A mind map providing an overview of the literature reviewed. Intermediate nodes indicate the main approaches identified to address the key challenges of RL in robotic control. Leaf nodes denote the notable techniques or sub-approaches under each approach. Colour tags accompany intermediate nodes to indicate the challenges addressed by each approach.

Table 1 summarizes the reviewed literature. It shows all the papers considered in the review, the approach they belong to, the task type studied in the paper and the challenge that each paper’s proposed technique aimed to address. Figures 4 and 5 summarize Table 1 to highlight notable trends observed in the reviewed literature.

Table 1.

An overview of the literature reviewed in tabular format.

Main approach studied	References	Year	Challenges targeted			Contact-rich task type(s) studied
			SE	GEN	SAF
Perception system design (section 4.1)	Levine et al.⁴⁴	2015		✓		Insertion
	Lee et al.²⁶	2018	✓	✓		Insertion
	Dong et al.⁴⁵	2021	✓	✓		Insertion
	Jin et al.⁴⁶	2022	✓	✓		Insertion
	Hansen et al.⁴⁷	2022	✓			Door opening
	Li et al.⁴⁸	2022		✓	✓	Extraction
	Yang et al.⁴⁹	2023	✓			Non-prehensile manipulation
	Chen et al.⁵⁰	2023	✓			Insertion
	Dang et al.⁵¹	2023	✓			Insertion
	Markert et al.⁵²	2023	✓			Insertion
	Kamijo and Ramirez-Alpizar⁵³	2023	✓			Insertion
	Chen et al.⁵⁴	2023	✓			Insertion
	Kim et al.⁵⁵	2023	✓			Non-Prehensile Manipulation
	Ferrandis et al.⁵⁶	2024	✓			Non-Prehensile Manipulation
	Wang et al.⁵⁷	2024	✓			Non-Prehensile Manipulation
	Fuchioka et al.⁵⁸	2024	✓			Insertion
	Lin et al.⁵⁹	2025	✓			Insertion
	Chen et al.⁶⁰	2025	✓			Insertion
Dense rewardfunctions (Section 4.2)	Xu et al.⁶¹	2019	✓			Insertion
	Schoettler et al.⁴²	2019	✓			Insertion
	Wu et al.⁶²	2020	✓			Insertion
	Zhang et al.⁶³	2021	✓			Insertion
	Wang et al.⁶⁴	2022	✓			Insertion
	Zhou et al.⁶⁵	2025	✓	✓		Door Opening
Residual RL(Section 4.3)	Johannink et al.⁶⁶	2018	✓	✓		Insertion
	Davchev et al.⁶⁷	2020	✓	✓		Insertion
	Spector and Zacksenhouse⁶⁸	2021	✓			Insertion
	Shi et al.⁶⁹	2021	✓			Insertion
	Ma et al.⁷⁰	2021	✓	✓		Insertion
	Wang et al.⁷¹	2021	✓	✓		Insertion
	Oikawa et al.⁷²	2021	✓	✓		Insertion
	Ranjbar et al.⁷³	2021	✓	✓		Insertion
	Wang et al.⁷⁴	2022	✓	✓		Insertion
	Kulkarni et al.⁷⁵	2022	✓	✓		Insertion
	Zang et al.⁷⁶	2023	✓			Insertion
	Sleiman et al.⁷⁷	2024	✓	✓		Door Opening
	Zhang et al.⁷⁸	2024	✓	✓		Insertion
	Li et al.⁷⁹	2025	✓	✓		Surface Tracking
Structured exploration(Section 4.4)	Florensa et al.⁸⁰	2017	✓			Insertion
	Hoppe et al.⁸¹	2019	✓			Insertion
	Florensa et al.⁸²	2020	✓			Insertion
Control mechanismdesign (Section 4.5)	Li et al.⁸³	2019			✓	Insertion
	Martín-Martín et al.²⁷	2019	✓	✓		Insertion, Surface Tracking
	Beltran-Hernandez et al.⁸⁴	2020	✓	✓		Insertion
	Ulmer et al.⁸⁵	2021	✓			Door Opening, Surface Tracking
	Shaw et al.⁸⁶	2021	✓		✓	Door Opening, Non-Prehensile Manipulation
	Zhu et al.⁸⁷	2022			✓	Surface Tracking
	Tang et al.⁸⁸	2023			✓	Insertion
	Ji et al.⁸⁹	2024			✓	Insertion
Task decomposition(Section 4.6)	Hou et al.⁹⁰	2021	✓	✓		Insertion
	Nasiriany et al.⁹¹	2021	✓	✓		Door Opening, Insertion, Surface Tracking
	Yang et al.⁹²	2022	✓			Insertion
	Mayr et al.⁹³	2022	✓			Insertion
	Yang et al.⁹⁴	2022	✓			Insertion
	Cheng and Xu⁹⁵	2022	✓	✓		Door Opening
	Wang et al.⁹⁶	2023	✓	✓		Door Opening
	Lew et al.⁹⁷	2023	✓			Surface Tracking
	Nghia and Pham⁹⁸	2023	✓	✓		Insertion
	Abbatematteo et al.⁹⁹	2024	✓	✓		Door Opening
	Sun et al.¹⁰⁰	2024	✓	✓		Insertion
	Padalkar et al.¹⁰¹	2024	✓			Insertion
	Lin et al.¹⁰²	2024	✓			Insertion
	Wang et al.¹⁰³	2025	✓			Non-Prehensile Manipulation
Meta-learning (Section 4.7)	Schoettler et al.¹⁰⁴	2020		✓		Insertion
	Allshire et al.¹⁰⁵	2021		✓		Door Opening, Surface Tracking
	Hafez et al.¹⁰⁶	2021		✓		Insertion
	Zhao et al.¹⁰⁷	2021		✓		Insertion
	Yang et al.¹⁰⁸	2022		✓		Insertion
Model-based RL(Section 4.8)	Fu et al.¹⁰⁹	2015	✓			Insertion
	Levine et al.¹¹⁰	2015	✓			Insertion
	Chebotar et al.¹¹¹	2017	✓			Door Opening, Non-Prehensile Manipulation
	Zhao et al.¹¹²	2020	✓			Insertion
	Thananjeyan et al.¹¹³	2020			✓	Extraction
	Mitsioni et al.¹¹⁴	2021			✓	Non-Prehensile Manipulation
	Liu et al.¹¹⁵	2021	✓	✓		Insertion
	Anand et al.116	2022	✓			Surface Tracking
Bridging the realitygap in sim-to-real transfer (Section 4.9)	Chebotar et al.¹¹⁷	2019	✓	✓		Door Opening
	Kaspar et al.¹¹⁸	2020	✓			Insertion
	Beltran-Hernandez et al.¹¹⁹	2020			✓	Insertion
	Petrovic et al.¹²⁰	2022			✓	Insertion
	Beltran-Hernandez et al.¹²¹	2022	✓			Insertion
	Li et al.⁴⁸	2022	✓		✓	Insertion
	Church et al.¹²²	2022	✓		✓	Surface Tracking, Non-Prehensile Manipulation
	Shi et al.¹²³	2023	✓			Insertion
	Zhao et al.¹²⁴	2023	✓	✓		Insertion
	Wang et al.¹²⁵	2023	✓	✓		Non-Prehensile Manipulation
	Vuong and Pham¹²⁶	2023	✓			Insertion
	Aflakian et al.¹²⁷	2024	✓	✓		Surface Tracking
	Wu et al.¹²⁸	2024	✓	✓		Surface Tracking
	Chen et al.¹²⁹	2024	✓			Insertion

This table highlights each paper’s publication year, the key challenges that each paper’s main contribution aimed to address, and the contact-rich task type studied in each paper. The abbreviated terms under ‘Challenges Targeted’ correspond to sample efficiency (SE), generalizability (GEN) and safety (SAF).

Figure 4.

Pie charts showing (a) the proportion of research effort aimed at studying specific contact-rich task types and (b) the approximate distribution of research efforts in addressing the key challenges identified across the reviewed literature.

Figure 5.

A bar chart showing how many studies were reviewed for each of the approaches identified.

Approaches to improve RL-based contact-rich robotic manipulation

Perception system design

Perception plays a crucial role in RL-based control, directly impacting the agent’s ability to make informed decisions and learn tasks efficiently and safely. Thus, researchers have investigated improving perception system design to improve RL for contact-rich robotic manipulation. The perception system encapsulates sensor selection and sensory data processing, and these two components’ design considerations are discussed.

Sensor selection

Most reviewed works use haptic and proprioceptive feedback as the sensor composition for contact-rich tasks.^25,52 Proprioception informs the robot of its pose and motion relative to the environment. Haptic feedback, via Force/Torque (F/T) sensors, enables the robot to measure external forces from the environment, enabling the robot to detect contact and enable the RL agent to learn safe safely react to contact interactions. With this combination, an agent can determine the location of expected contact points in an environment and react accordingly to complete the task. Even without visual feedback, this minimal sensor composition has proven sufficient to learn a range of contact-rich tasks.²⁵ Further, this composition facilitates the most sample-efficient learning as low dimensional input minimizes the policy network’s size and the training required. However, this minimal sensor composition gives the agent a limited understanding of the environment’s state, which can cause the policy to fail in completing variants of the original task.²⁶ Additional sensing modalities can be added to the policy to enable a richer understanding of the environment’s state and improve policy generalizability.

Visual feedback (e.g. RGB, RGB-D) can be added as direct feedback to the agent (i.e. visuomotor learning).^{26,42,44,46,48,57} Visual feedback can give the agent real-time spatial awareness of its environment, enabling it to sense possible contacts without physical interaction, thus improving safety. More concretely, an agent with visual input can learn to associate the intersection of visual features with unsafe contact interactions and can use this insight to prevent certain unsafe interactions from occurring, which may not be possible without it. For example, Elguea-Aguinaco et al.¹³⁰ train an object extraction policy with vision and include the objective of avoiding human contact in training to result in a policy that dynamically avoids human contact during deployment. Further, visual feedback helps improve generalizability, as environment variations can be immediately observed and adapted to. For example, in insertion tasks with no accurate estimation of the hole position or varying geometries, the agent can learn a single insertion policy that can address all these uncertainties and variations by using vision to guide decision making.^26,46,48,50 In wiping, a policy with visual input can learn wipe a surface with a variety of surface contaminant positions.⁹⁷ However, adding these high-dimensional sensing modalities increases the size of the policy network, which increases the training required before an optimal policy is learned.

Tactile sensors observe the deformation of a skin-like membrane at the robot gripper’s fingertips to monitor normal and shear stress on the membrane. By monitoring forces in this way, tactile sensor can be used in the same capacity as F/T sensors to indirectly measure external contact forces with the environment. For instance, Dong et al.⁴⁵ substitute F/T sensors with tactile sensors for insertion tasks. Their work also report an improvement in the generalizability of the insertion policy using tactile sensor compared to one using F/T sensors. The physical imprint of an object’s geometric features when in contact with a tactile sensor may also be observed. This property can be utilized to determine the pose of objects in contact with the sensor, which has been exploited in learning insertion, door opening and non-prehensile manipulation tasks.^47,49,56

Compressing observations into a state representation

Converting raw sensory data, $o_{t}$ , into a meaningful state representation, ${\hat{s}}_{t}$ , is essential for an agent to enable effective task learning. The default method to achieve this is end-to-end RL,¹³¹ which learns with $o_{t}$ as the input and an abstract representation of ${\hat{s}}_{t}$ is learned implicitly by the network. This technique is the most popular due to its relative ease of implementation owing to the autonomous feature learning that deep learning implements. However, as high-dimensional sensing modalities are added, training becomes prohibitively long¹³² due to the many parameters that need to be updated for one network to act as the feature extraction mechanism and policy.¹³³ Researchers have introduced alternative learning paradigms that separate policy and feature learning to address this issue and improve sample efficiency.

State Representation Learning (SRL)^37,134 (Figure 6(a)) involves separately training a dedicated state representation network in a self-supervised fashion¹³⁵ (training occurs without human-labelled data) to compress sensory data into a meaningful lower-dimensional abstract state representation. This is typically implemented as a Variational Auto Encoder (VAE)¹³⁶ trained on environment transitions gathered by the agent. Then, a policy network is stacked on top of the state representation network’s encoder for more sample-efficient RL. Lee et al.²⁶ demonstrated improved sample efficiency due to this technique compared to basic RL in a peg insertion study with multimodal sensory feedback. This technique was also employed by Chen et al.⁵⁰ in a peg insertion study. Although promising, lengthy policy training is replaced with extensive and costly SRL, which can limit its practicality. A potential improvement can be to train general state representation networks applicable across different tasks and environments to reduce the average SRL cost.

Figure 6.

Schematic diagram illustrating the difference between (a) state representation learning, (b) feature transplanting and (c) supervised learning techniques for acquiring a state representation.

Jin et al.⁴⁶ proposed a simpler and more cost-effective technique to learn an abstract state representation (Figure 6(b)). It involved training a policy with visual inputs on a simple source task and then transplanting the lower layers, corresponding to the visual encoder, to another policy network trained on a complex target task related to the source task. The simple task was designed to be easy to learn to minimize training time. Sample efficiency was enhanced when learning the complex task as relevant state features were transferred from the simple task, leading to shorter retraining. However, manual effort is required to define the source and target tasks, which can limit scalability.

Another set of techniques aim to explicitly infer a concrete state representation (i.e. state parameters directly that map to measurable variables in the environment) from the high-dimensional observations (Figure 6(c)). This inference model is typically obtained by collecting privileged state information from simulation and simulated observation data and training the model to correctly infer the privileged state information. Yang et al.⁴⁹ train a Convolutional Neural Network (CNN) with supervised learning to infer the pose of an object using tactile sensor observations for an object non-prehensile manipulation policy. Kamijo and Ramirez-Alpizar et al.⁵³ train an ANN to infer the relative orientation of a grasped peg from on-finger tactile sensory data for an insertion to enable sample efficient learning of that task. Chen et al.⁵⁴ train an ANN to infer a categorical contact state between a peg and hole to learn the insertion task in a sample efficient manner. Chen et al.⁵⁴ train a key point detector to detect key points of an object to directly infer its pose to quickly learn a non-prehensile manipulation skill. To improve robustness to sensor noise and partially observability, the use of probabilistic state inference models were also investigated. Ferrandis et al.⁵⁶ trained a probabilistic ANN also to infer the position of an object for an object non-prehensile manipulation policy from occluded visual feedback and feedback from a tactile sensor. Chen et al.⁶⁰ uses a Gaussian Mixture Model (GMM) to probabilistically infer the relative position of grasped nut from a nut to quickly learn nut insertion using only F/T observations. Dang et al.⁵¹ used a mechanism based on a template matching⁷⁷ to extract the relative position of an elastic peg from the hole and its deformation. Using this technique, elastic insertion was learned in a sample efficient manner. Whilst these techniques increase the interpretability of how the RL agent perceives the environment and makes decisions, they require a specific implementation for the specific task and thus is not as scalable as techniques that build abstract representations from scratch using data.

Attention mechanisms for perception systems

Attention is a mechanism in deep learning that selectively focuses on the most relevant parts of the input, filtering out less important information. By assigning different levels of importance to various input features, attention enhances the model’s ability to learn complex patterns efficiently. Typically, attention mechanisms are integrated into the learning process by computing weighted combinations of input features, where the weights indicate each feature’s relevance to the task. These weights are dynamically learned during training, enabling the model to adjust its focus based on the task objectives.

Attention is widely used in tasks such as natural language processing Chaudhari et al.¹³⁷ and computer vision Guo et al.,¹³⁸ and enables state-of-the-art performance in these domains. Some works also investigated attention with RL^139,140,141 on RL benchmark tasks. However, results do not suggest a definitive improvement in performance and sample efficiency resulting from this addition.

In RL for contact-rich robotic manipulation, researchers have used simpler attention mechanisms that do not require the agent to autonomously learn to assign attention. Ferrandis et al.⁵⁶ introduces a tactile gating mechanism, which dynamically blocks flow of tactile sensor information through the policy network when the no contact is made on the tactile sensor. Tactile gating was shown to improve the agent’s exploitation of tactile signals in decision making, which reportedly increased sample efficiency in door opening tasks. Lin et al.⁵⁹ used a prompt-guided image segmentation technique to filter out task-irrelevant image features to increase the sample efficiency of learning insertion tasks.

Dense reward functions

During training, rewards regulate the predicted (state or state-action) values and guide exploration towards more promising states and actions to accelerate training. Sparse reward functions are the most straightforward, with a reward issued upon task completion. However, predicted values remain uninformative until the reward is encountered, which delays the exploration of promising actions and prolongs training. This is worsened in large state-action spaces where reaching a goal state under a semi-exploratory policy can take longer. Therefore, dense reward functions are typically implemented in robotic manipulation.

Regardless of task completion, dense reward functions issue continuous rewards to the agent, the magnitude of which is inversely proportional to the agent’s proximity to its goal. It should be noted that in contact-rich tasks, the measure of proximity is not only limited to physical proximity but can include proximity between measured and desired values of other sensory inputs such as force. With dense rewards, predicted value updates are more informative much sooner, resulting in more efficient exploration and improved sample efficiency.¹⁴²

Reward shaping

The most common implementation technique for dense reward functions is reward shaping, which is a process where practitioners manually define equations,⁹⁰ a set of rules,²⁶ or use fuzzy logic^61,64,70 to express the dense reward function. Reward shaping enables the relatively simple implementation of dense rewards. It also allows practitioners to explicitly define auxiliary objectives to produce policies that conform to specific user preferences. For instance, terms in the reward function can be added to train policies that apply a target contact force to the environment^71,74 or to minimise energy consumption²⁷. Further, dense rewards can improve safety by penalising the entry of unsafe states, such as states where contact forces exceed a threshold⁸⁴. However, safe behaviour is guaranteed only after training, as the unsafe states must first be encountered during training before the agent learns not to revisit such states.

Reward shaping has effectively improved sample efficiency across various contact-rich tasks. However, the cost of reward shaping and the policy’s quality mainly depends on the practitioner’s experience. Further, the reward shaping process must be repeated for new tasks, making this technique unscalable and impractical for settings where robots are required to complete varied tasks.

Inverse reinforcement learning

Inverse Reinforcement Learning (IRL)¹⁴³ is a technique in which a dense reward function is inferred through expert demonstrations, addressing scalability and policy quality limitations in reward shaping.

Zhang et al.⁶³ employed Adversarial IRL (AIRL)¹⁴⁴ to infer a reward function and quickly train a policy for peg insertion and cup placement. Zhou et al.⁶⁵ also used AIRL to quickly learn the insertion task. AIRL is an adversarial learning-based IRL technique which enables reward functions to be inferred in environments with large and continuous state-action spaces, making it suitable for robotic manipulation. However, adversarial techniques require numerous demonstrations to infer a reward function, making it costly to train and can experience training instabilities.¹⁴⁴

An alternative technique was proposed by Wu et al.,⁶² which required only a single demonstration, involved less training and was not susceptible to training instabilities. Their technique involved learning a latent measure of task progress from a single demonstration to act as the reward function. Their study demonstrated accelerated learning of a peg insertion policy.

Residual RL

Residual RL involves an expert policy, $π_{e}$ , defining a task-related trajectory and a residual policy, $π_{r}$ , that learns to adjust this trajectory.⁶⁶ $π_{e}$ is the trajectory designed for the completion of a specific task and is provided a priori by an expert. $π_{r}$ considers environmental uncertainties not considered by $π_{e}$ to improve its performance further and is acquired via RL. $π_{e}$ and $π_{r}$ are combined to form the final policy, that is, $a_{t} = f_{a} (a_{t, e}, a_{t, r})$ where $a_{t, e} = π_{e} (s_{t})$ , $a_{t, r} ~ π_{r} (a_{t, r} | s_{t})$ and $f_{a}$ is a function to combine the expert actions $a_{t, e}$ and residual actions $a_{t, r}$ .

Residual RL enhances sample efficiency as the initial policy directly steers the agent towards goal states, reducing the exploration required to encounter rewards, which enables the exploration of more promising actions sooner. Constrained exploration around $π_{e}$ increases the likelihood of exploring appropriate actions for a task or environment variation, enhancing generalizability compared to basic RL. Further, safety during training is improved as the robot executes more predictable behaviours.

Residual RL has gained traction within the contact-rich robotic manipulation domain, particularly in the area of peg insertion,^66,71,73,74 assembly^{69,70,72,75,76,78} and grinding.⁷⁹ Since its introduction by Johannink et al.,⁶⁶ many variations of residual RL have spawned primarily differing by the formulation of the residual policy and the incorporation of imitation learning.

Residual policy action formulation

The original implementation by Johannink et al.,⁶⁶ Residual Policy Learning (RPL), $π_{e}$ involved an expert trajectory executed using force feedback controller, which was attenuated by a residual policy outputting a force that superposed the force controller’s output. The direct superposition of the residual policy’s output to the force controller resulted in abrupt adjustments to the initial policy, which may be problematic in certain phases of a contact-rich task. For instance, abrupt adjustments may cause jamming in peg insertion. Ranjbar et al.⁷³ proposed an improvement which combined RPL with Residual Feedback Learning (RFL). RFL’s residual policy instead learned to attenuate the force controller’s feedback error term, which resulted in smoother adjustments. Combining RFL and RPL facilitated smooth and abrupt trajectory correction, which endowed the agent with finer control of the robot and can improve policy performance.⁷³

Oikawa et al.⁷² presented a unique residual policy formulation that used a discrete action space. Their residual policy learned to change the force controller’s stiffness matrix discretely by selecting from a pre-defined set of matrices. The stiffness matrices were non-diagonal, facilitating smooth local exploration around the expert trajectory when the robot was in contact with the environment. Due to the discrete action space, sample efficiency was significantly improved. However, the set of matrices must be manually selected, increasing implementation difficulty as choosing a set of non-diagonal matrices may involve trial and error. Spector and Zacksenhouse⁶⁸ proposed a similar technique, but allowed the policy to set any real value for select elements of the stiffness matrix, which reduced manual effort at the expense of sample efficiency.

Obtaining the expert trajectory

The original residual RL implementation relies on a trajectory generated from interpolating between the start point to a goal position and thus tend to be simple straight-line trajectories. For tasks such as door opening¹⁰⁵ or L-shaped insertion,⁷¹ complex curved trajectories may be needed. Using the conventional robot programming methods, these trajectories may be implemented by composing spline and circular motion commands. However, such an approach requires significant time to implement and thus is unscalable.

Rather than manually defining trajectories, an robot can learn to imitate expert demonstrations, known as imitation learning,¹⁴⁵ to form the desired initial trajectory and implement residual RL for complex tasks in a more time- and cost-efficient manner. The variations of this technique differ by the imitation learning frameworks used to learn the demonstrations. Ma et al.⁷⁰ used GMMs, which enable robust imitation of an insertion trajectory within four demonstrations. However, their technique only allowed the imitation of the EEF translation and not rotation. Davchev et al.⁶⁷ used Dynamical Movement Primitives (DMPs)¹⁴⁶ to imitate insertion demonstrations. This technique enabled EEF translation and rotation to be imitated and required only one demonstration for successful imitation. Wang et al.⁷¹ proposed a novel imitation learning to imitate an expert insertion trajectory robustly. Their imitation learning framework required more computation and demonstrations than Davchev et al.⁶⁷ and Ma et al.⁷⁰ but guaranteed more robust imitation.

Another technique for generating the expert trajectory is to use trajectory optimisation¹⁴⁷, which can generate arbitrarily complex trajectories. This is achieved by treating trajectory planning as optimization problem that optimizes a given performance metric and is bound by a given set of constraints. Sleiman et al.⁷⁷ uses trajectory optimization to obtain an initial trajectory for door opening and uses RL to refine the motion to account for environmental uncertainties. The limitation of this approach is the necessity of an environment model that the trajectory optimization model can use for planning, which may be difficult to implement for varied environments.

Structured exploration

Structured exploration is an approach that aims to achieve enhanced sample efficiency by strategically restricting the agent’s exploration. Regions of the state-action space deemed unpromising can be avoided, focussing on exploring more promising actions and states. Different techniques have been developed to realize this approach.

Curriculum learning

Curriculum learning is a training strategy used in RL to help agents learn more effectively by gradually increasing the complexity of the tasks they must solve.¹⁴⁸ This learning scheme is inspired by how humans and animals learn, where they gradually progress from simple to more challenging tasks.

The technique developed by Florensa et al.⁸⁰ involved training the agent near the goal state in early training episodes and gradually moving the starting position further away in later episodes. Starting close to the goal increases the probability of reaching the goal and receiving a reward, accelerating the acquisition of informative value function updates and improving exploration efficiency, even in sparse reward settings. Moving the start point gradually further from the goal, the agent quickly learned to perform the tasks from any starting state.

Hybrid RL and model-based control

Another strategy is to use model-based controllers to steer agents towards promising regions of the state space and resume RL from those regions. Hoppe et al.⁸¹ combined trajectory optimization controller with model-free RL. The RL component explores and collects data as usual. Using the data collected, the trajectory optimization controller places the agent in promising state regions and resumes RL to accelerate learning. Florensa et al.⁸² proposed Guided Uncertainty Aware Policy Optimization (GUAPO), which involved a perception system that divided the EEF space into free and uncertain spaces. Uncertain space encompassed regions where contact interactions with objects are expected, which is the space where the RL policy acts. The free space was the remaining region where point-to-point control was applied to redirect the EEF back to the uncertain space to return to using the RL policy.

Control mechanism design

The control mechanism is the interface that connects the agent to the robot’s actions within its environment. As mentioned in Section 2.3, design choices concerning the control mechanism have been shown to significantly impact the sample efficiency, generalizability and safety of RL for robotic control.

Fundamental design principles

Current control mechanism designs for contact-rich tasks follow common principles prescribed by prior seminal works.

Peng and Panne¹⁴⁹ prescribed RL with policies that learn how to operate a basic feedback controller, that converts agent actions to torques (i.e. $u_{t} = f_{c} (a_{t})$ ), over learning how to control joint torques directly (i.e. $u_{t} = a_{t}$ , where $f_{c} (a_{t}) = a_{t}$ ). This configuration improves sample efficiency and produces more robust and responsive policies.

Bellegarda and Byl¹⁵⁰ prescribed using EEF space control over joint space control for the downstream controller. EEF and joint space control differ in how the RL agent expresses actions before converting them to actuator torques. Joint space control requires the agent to express actions in joint space and is mapped to actuator torques, that is, $u_{t} = f_{c} (a_{t})$ . In contrast, EEF space control leverages inverse kinematics solvers or operational space control $f_{IK}$ ^151,152 to enable the agent to express actions in EEF space, and is later converted actuator torques in joint space, that is, $u_{t} = f_{c} (f_{IK} (a_{t}))$ . As user commands or task goals are naturally expressed in EEF space, requiring the agent to reason in EEF space, the required amount of learning to perform a task is reduced with EEF space control, improving sample efficiency compared to agents equipped with joint space control. Another benefit is that EEF space control facilitates inter-robot policy transfer, as the agent’s actions become independent of robot embodiment.²⁷

Another fundamental design principle is using compliant control schemes for the downstream controller. Compliant controllers allow the robot to comply to external contact with the environment instead of rigidly resisting it, enabling safer interactions and improved adaptability in uncertain or dynamic environments. In principle, an RL agent using a rigid position controller and access to force information can learn to regulate contact forces by learning to position itself in a way that limits contact forces. However, its effectiveness in regulating contact is limited by the RL agent’s sampling frequency and adds to the learning burden, decreasing sample efficiency. Compliant control schemes also allow the robot to adapt well to environmental uncertainties when in contact, enhancing the policy’s generalizability. The control schemes most relevant to the literature are impedance,¹⁵³ admittance^154,155 and hybrid motion-force control.¹⁵⁶ The review by Elguea-Aguinaco et al.²⁵ found that compliant control schemes can be used interchangeably for the same task when using RL-based control. Most works used impedance control due to the relative ease of implementation compared to admittance and hybrid motion-force control, which rely on more complicated control systems with reliable F/T sensor(s).¹⁵⁷ However, impedance control is only suitable for torque-controlled robots. The most widely used robots are position-controlled, so admittance and hybrid motion-force control are used in these circumstances.⁸⁴

Variable compliance control

Variable compliance control involves online adaptation of controller compliance, achieved by adding dimensions of variable compliance to the agent’s action space. This concept was introduced by Buchli et al.¹⁵⁸ and was later enhanced by Martín-Martín et al.²⁷ by combining variable compliance control with the EEF-space impedance control. Beltran-Hernandez et al.⁸⁴ proposed a similar control mechanism, but their mechanism was based on admittance control, which enabled variable compliance control for rigid position-controlled robots.

Variable compliance control improves the policy’s safety, execution time and energy consumption, given that the reward function specifies these preferences. Safety was enhanced as the policy can learn to increase compliance in states where contact is expected. Execution time is minimized as the policy can learn to reduce compliance when no contact is expected, which increases movement speed. Finally, energy consumption is minimized as the policy can learn to increase compliance, reducing maximum actuator torque when required. Another benefit is that it circumvents manually tuning the controller gains, which can be time-consuming. As such, it has become an essential aspect of current controller mechanisms designed for contact-rich robotic manipulation. Such tasks include peg insertion, assembly, disassembly, door opening and surface tracking.^66,86,92 Works employing variable compliance control typically allow the RL agent to modulate only the diagonal terms of the gain matrices. However, some works allow control over all terms in gain matrices,^68,89 facilitating local exploration on surfaces that the robot is in contact with.

Ulmer et al.⁸⁵ proposed an alternative implementation of variable compliance control. Their control mechanism modulates compliance within the inner control loop instead of being modulated by the upstream policy (i.e. adaptive control¹⁵⁹), reducing the agent’s action space. As a result, using Ulmer et al.’s control mechanism results in improved sample efficiency relative to Martín-Martín et al.²⁷ and Beltran-Hernandez et al.⁸⁴ as the exploration required by the agent was reduced.

Riemannian motion policies

Ratliff et al.¹⁶⁰ introduced the Riemannian Motion Policy (RMP) framework, which is based on differential geometry. This framework facilitates the modular combination of motion policies for robotic control. A common use for the RMP framework is to enhance a primary motion policy by combining it modularly with auxiliary policies that satisfy other objectives. Typically, these auxiliary policies include collision avoidance (at the EEF and intermediate joints) and joint-limit avoidance, which steer the robot away from these unsafe states, improving safety. This modularity simplifies the controller mechanism design process as the complexity of tasks and robot embodiments increases.

Shaw et al.⁸⁶ investigated using RMPs to enhance the control mechanism for RL in contact-rich robotic manipulation and demonstrated improved sample efficiency and safety for surface tracking and door-opening. Sample efficiency was enhanced by the joint limit and collision avoidance auxiliary policies, which prevented episode-terminating collision and joint-limit violations, allowing the agent to learn the task in far fewer episodes. The collision avoidance policies improved safety and were most impactful in the earlier episodes where the agent typically explores more dangerous actions. Whilst promising, the obstacle avoidance policies require real-time knowledge of surrounding obstacles, which require advanced perception mechanisms that may be difficult to implement for varied environments.

Improving robustness of imperfect controllers

Torque-based control methods, such as impedance control, are highly sensitive to inaccuracies in the controller’s assumed robot dynamics model, which include inertial and friction parameters. These inaccuracies cause deviations in motions intended by an RL agent, as real-world conditions differ from the idealised simulation. Position control relies on high gains and integration to mitigate uncertainties, ensuring a precise trajectory. These techniques are unviable for impedance control as it removes its compliant properties.

Tang et al.⁸⁸ proposed a policy-level action integrator mechanism that acts similarly integration terms in classical PID control. This technique rejects unmodelled disturbances, which results in much closer behaviour of the robot between simulation and reality, eliminating the need for retraining in real environments, thus increasing sample efficiency. This mechanism also limits the accumulated efforts from the integrator to retain the compliant characteristics of the impedance controller.

Embedding safety functionality

Modern control mechanisms for contact-rich robotic manipulation have embedded safety functionality to protect the robot and its environment and maintain safe conditions during operation. Safety can be ensured with relatively simple techniques consisting of handcrafted mechanisms. For instance, the control mechanism can limit maximum joint torque¹²³ or velocity.⁸³ If unintended contact occurs, these measures limit the damage inflicted on the robot or environment. Another simple technique is triggering a pre-defined recovery procedure if a particular sensor measurement reaches undesirable values. For example, EEF position or F/T measurements outside a desired range may trigger the termination of training.^61,104

Zhu et al.⁸⁷ proposed a more sophisticated technique. It involved a contact detector for intermediate robot joints and a novel controller that performs null space operation, reconfiguring the manipulator without displacing the EEF. This technique enables the robot to comply with disturbances at intermediate joints without disturbing the task execution, which enhances safety.

Task decomposition

A promising approach is to consider RL problems modularly, decomposing tasks into smaller, more manageable subtasks. With this view, the policy $π$ from basic RL can be replaced by a set of sub-policies, $π^{L}$ , that address specific sub-tasks, and a super-policy, $π^{H}$ , that sequences sub-policies to achieve a goal.

This modular approach allows agents to think and act at higher temporal resolutions, considering longer action sequences and higher-level plans rather than making decisions at each timestep. This improves sample efficiency as temporally extended actions result in quicker exploration of states far from the starting point, accelerating the exploration of promising actions and enhancing the agent’s ability to learn long-horizon tasks.¹⁶¹ Long-horizon tasks require agents to make decisions over an extended period, involving numerous actions before task completion and reward acquisition, which are more challenging to learn with basic RL.¹⁶¹ Additionally, task decomposition facilitates sub-policy reuse across different tasks, that is, knowledge transfer, further improving sample efficiency when learning new tasks. Generalizability is also enhanced by transferring super-policies between task or environment variation, which reduces the required fine-tuning in policy transfer as policies are retrained at the sub-policy level, each addressing smaller portions of a task.

Four sub-approaches have been identified in the literature that adopt the task decomposition approach for contact-rich robotic manipulation: namely, Task Planning with RL, Skill-Based RL, Hierarchical RL, techniques using an expert-guided super-policy. These sub-approaches, which are illustrated in Figure 7 and the techniques proposed under each sub-approach are discussed in this section.

Figure 7.

Schematic representations of different task decomposition sub-approaches. (a) Task planning with RL,^93,95 (b) skill-based RL,^91,92,94 (c) hierarchical RL⁹⁰ and (d) expert-guided super-policies.^100,102,103

Task planning with RL

Task planning^162–164 is a robotics technique using a semantic environment model and planner to execute long-horizon tasks. It involves a symbolic planner assessing the state of the environment and producing a sequence of actions that fulfil a user’s desired goal(s). Actions are typically selected from an action library shared between task plans to simplify the planning process. Finally, the action sequence is passed to an execution module to implement the plan. The main strength of task planning is its generalisability potential, as it can, in principle, immediately facilitate the completion of any task in any given environment, given a suitable semantic environment model and action library. However, the actions are typically unsuitable for contact-rich tasks as actions are not well suited to complex and uncertain contact interactions, which can fail. As such, researchers have combined task planning and RL to produce actions that are robust to environmental uncertainties. This is accomplished by using the planner $π^{H}$ , which determines the sequence of sub-policies $π^{L}$ , and using RL to define the implementation of $π^{L}$ .

Mayr et al.⁹³ proposed using a planner that generated an action sequence from a library of tuneable action primitives, and later, RL was used to fine-tune the action primitives. Cheng and Xu⁹⁵ used a planner that identified the relevant sub-goals for a task and learned the actions from scratch with RL. The main bottleneck in these techniques is the reliance on semantic world models. Semantic world models are challenging to produce due to the limitations of current perception and semantic environment model generation algorithms which limit scalability. Therefore, applying task planning with RL in uncertain and novel environments is limited.

Skill-based RL

Skill-based RL involves training a model-free RL agent with an augmented action space, where actions represent a sequence of low-level actions $a_{t} = {u_{t}, u_{t + 1}, u_{t + 2}, \dots, u_{t + H - 1}}$ that represent semantically meaningful skills (e.g. grasping and reaching) over a fixed horizon of $H$ timesteps. The augmented action space can also be referred to as a skill space in skill-based RL.¹⁶⁵ This is in contrast to basic RL, where agent actions are decided at every timestep, that is, $H = 1$ . In skill-based RL, the RL agent’s policy serves as the super-policy, $π^{H}$ , with the skills serving as fixed, pre-defined sub-policies, $π^{L}$ .

The skill-based RL techniques applied to contact-rich robotic manipulation differ primarily by the method for defining the skill space. Lew et al.⁹⁷ proposed a technique where the RL agent learns to command a downstream trajectory optimization module, which takes a goal pose and executes an to optimal trajectory to the goal. This technique was applied to learn a surface tracking task. Nasiriany et al.⁹¹ defined the skill space as a finite set of hand-crafted skills. These skills were also parametrized with skill parameters $x$ , that is, $a_{t} = a_{t} (x)$ , that the agent chose. Similar technique were proposed by Wang et al.⁹⁶ and Nghia and Pham⁹⁸ for door opening and insertion, respectively. This allowed the agent to generate variations of each skill, ensuring that a broad range of sub-tasks can be performed with the skill library. The technique proposed by Pertsch et al.,¹⁶⁶ which was applied to learning insertion in Yang et al.⁹² and Yang et al.,⁹⁴ involved training a VAE¹³⁶ to reconstruct input skills. The VAE learned a low-dimensional latent space that could be sampled to generate an infinite number of skills continuously interpolating between input skills. This technique provided a way to define an infinite skill space that can generate skills that apply to a broad range of sub-tasks. The input skills can consist of previously learned RL policies, demonstrations or manually designed controllers. Unlike in the technique in Nasiriany et al.,⁹¹ this technique did not require careful design of parametrized hand-crafted skills to define a flexible skill space, so Pertsch et al.’s technique requires less implementation effort.

As the agent is model-free, skill-based RL circumvents the need for prior knowledge of the environment, which increases its applicability compared to task planning with RL. However, a skill-based RL agent’s performance is limited by the skill space’s quality, as a sub-optimal skill space can harm learning efficiency.¹⁶⁷ Skill spaces require human input, so the implementation may be time-consuming depending on the implementer’s skill level, which can limit skill-based RL’s scalability and effectiveness.

Hierarchical RL

Unlike other task decomposition sub-approaches, Hierarchical RL (HRL)^161,168 involves learning both the super- and sub-policies from scratch without assistance from injected expert knowledge. HRL techniques involve autonomous segmentation of tasks into sub-tasks to learn the super-policy and autonomous learning of each sub-policy for their respective sub-task. As a result, HRL involves less manual effort than other task decomposition sub-approaches and eliminates sub-optimality introduced by human input, which can be useful if the given task’s structure is challenging to decompose, or the required sub-policies are unknown or difficult to produce. However, these benefits trade-off with reduced sample efficiency as more environment interactions are needed to learn the task decomposition and set of suitable sub-policies from scratch. Hou et al.⁹⁰ is the only work using HRL techniques for contact-rich robotic manipulation where it was used to learn multi-peg insertion.

Expert-guided super-policy

Rather than learning the $π^{H}$ or relying on task planning infrastructure to infer $π^{H}$ , it can be constructed using prior knowledge. A simple approach is to define the $π^{L}$ sequence manually and use finite state machines to determine transitions between skills. This technique was employed by Abbatematteo et al.⁹⁹ and Padalkar et al.¹⁰¹ to learn the door opening and insertion tasks, respectively, in sample-efficient manner.

More sophisticated techniques construct $π^{H}$ using demonstration provided by a human operator. In such techniques, $π^{H}$ is assumed to have access to a library of $π^{L}$ and a demonstration to deconstruct and represent as a sequence of $π^{L}$ . Sun et al.¹⁰⁰ uses a diffusion model¹⁶⁹ to train a super-policy inference mechanism from input demonstrations for an assembly task.¹⁰³ proposes a similar technique involving diffusion models for learning a non-prehensile manipulation task.¹⁰² instead investigate using a transformer model¹⁷⁰ to construct $π^{H}$ to quickly learn to perform an assembly task.

Meta-learning

Meta-learning¹⁷¹ is an ML approach that focuses on algorithms and techniques enabling models to learn how to learn. It involves a meta-training process where a meta-model is trained on a distribution of domains (tasks or datasets), $p (T)$ . to learn common and valuable representations and patterns across them. Then, the meta-model is trained further for a target domain from the training domain distribution, $T^{*} ~ p (T)$ , known as meta-adaptation, to maximise performance in the target domain (Figure 8). By leveraging the representations and patterns acquired in meta-training, successful meta-adaptation can be achieved using small amounts of data from the target domain.

Figure 8.

Illustration of the central concept behind meta-RL.

Meta-learning is similar to transfer learning,¹⁷² as both techniques facilitate knowledge transfer between domains to minimize additional training. However, transfer learning only facilitates transfer between a single source and target domain due to its naïve approach, which involves fine-tuning a pre-trained model on a new domain without knowledge of the variance between domains. Instead, meta-learning models first learn the variance between domains in a training domain distribution, which enables it to adapt to any domain within the training domain distribution, not just one. Meta-adaptation is also much quicker than naïve transfer learning, with some works completing successful meta-adaptation with little or no further training, referred to as few- or zero-shot transfer, respectively.¹⁷³

Meta-learning techniques produce generalized models that can quickly adapt to changes in different domains. This is especially beneficial for contact-rich manipulation tasks, where robust and adaptable policies are needed to operate efficiently in uncertain environments with minimal retraining.

Meta-reinforcement learning

Meta-Reinforcement Learning (Meta-RL) applies meta-learning concepts to sequential decision-making.¹⁵³ It involves meta-training a meta-policy on a task distribution, $p (T)$ , encompassing a set of similar tasks. The goal is to reduce meta-adaptation time when learning an unseen task instance $T^{*} ~ p (T)$ .

Many Meta-RL techniques have been proposed, such as Recurrent Neural Network (RNN) based^174,175 and variational inference-based approaches,¹⁷⁶ which formalize the problem as a Partially Observable MDP.¹⁷⁷ Other techniques adopt a gradient-based approach.¹⁷⁸ The variational inference-based technique, Probabilistic Embeddings for Actor-Critic RL (PEARL),¹⁷⁶ is the only Meta-RL technique adopted in contact-rich robotic manipulation as it decouples task inference from task completion, which can enhance sample efficiency by facilitating off-policy learning.¹⁷⁹

Schoettler et al.¹⁰⁴ and Hafez et al.¹⁰⁶ used PEARL to train a meta-policy in simulated insertion tasks with domain randomization. The randomized simulation emulated meta-training on a distribution of varied insertion tasks. More specifically, their task distribution randomized hole position estimates, peg-hole clearances, and position controller step size to emulate uncertainties in the real world not captured in the simulation. The experiments found that the meta-policy adapted to real insertion tasks much quicker than a basic policy trained on a single simulated environment.

The application of Meta-RL in these works only meta-train on relatively narrow task distributions with minor variations between task instances, limiting the generalizability potential of the final meta-policy. In principle, meta-training can be performed on broader task distributions to attain higher generalizability potential, such as a meta-policy that can insert cables or assemble objects, not just variations of the basic peg and hole problem. However, attaining high generalizability potential significantly increases meta-training time and cost,¹⁰⁷ which offsets its advantage over more straightforward transfer learning techniques. Furthermore, successful meta-adaptation requires careful task distribution design to ensure the target task falls in this distribution, increasing implementation difficulty as the task distribution’s scope grows as designing each task requires considerable manual effort.

Meta imitation learning

To address the challenges of Meta-RL’s high cost, researchers combined imitation learning¹⁴⁵ with Meta-RL, which we refer to as Meta-Imitation Learning (Meta-IL). Meta-IL leverages demonstrations to learn each task from $p (T)$ quickly, significantly improving the overall sample efficiency of meta-training compared to Meta-RL. As sample efficiency is enhanced, Meta-IL enables cheaper meta-training on broader task distributions, making more affordable meta-policies with more significant generalizability potential.

Meta-IL has typically been achieved by training a VAE¹³⁶ to reconstruct policies learned from demonstration for tasks from $p (T)$ , which results in learning latent space that can be used to generate variations of the input policies that continuously interpolate between them. Then, an agent is trained to sample from the latent space, treating it as an action space, to generate a relevant policy for $T^{*} ~ p (T)$ in meta-adaptation. Meta-IL is similar to the skill-based RL technique proposed by Pertsch et al.¹⁶⁶ (Section 4.6.2), but input demonstrations target a group of similar tasks instead of being used to form a versatile skill library that targets a wide task variety. Zhao et al.,¹⁰⁷ Allshire et al.¹⁰⁵ and Yang et al.,¹⁰⁸ proposed variants of this technique and demonstrated success in quickly learning insertion, surface tracking and door-opening tasks.

Model-based reinforcement learning

Model-Based RL (MBRL)¹⁸⁰ is an approach to RL centred around the use of a virtual environment transition model for training as a supplement or substitute to training in a real environment. By using such a model for training, data required for training can be generated using this virtual model rather than sampling data from the environment. As such, is can be much quicker and cheaper than the more Model-Free RL (MFRL), effectively increasing sample efficiency.

Transition model design

Despite MBRL’s strengths, MFRL has gained more traction in contact-rich manipulation. MFRL typically achieves greater maximum reward than MBRL because contact dynamics are challenging to model accurately, leading to inaccurate transition models. Policies then overfit to this inaccurate model, resulting in degraded asymptotic policy performance, and is known as model bias.¹⁸¹ To realise MBRL’s potential in contact-rich manipulation, researchers have focused on improving transition model design to minimise model bias.

MBRL research for contact-rich robotic manipulation exclusively adopts learned over analytic transition models as learned models are more straightforward and less time-consuming to implement than analytic models, offering a more practical solution for MBRL. Learned models only require feeding an appropriate amount of labelled data collected from the real environment to an ML algorithm to model the dynamics. In contrast, analytic models require more effort, involving precise modelling of the environment and physical phenomena to obtain an accurate dynamics model. Further, analytical models present an unfavourable trade-off between accuracy and latency^182,183 as analytic models require expensive calculations to model non-linear contact dynamics that become more expensive as the model’s fidelity increases, resulting in time-intensive model queries. Learned models achieve this by fitting a non-linear function to the acquired data, which makes it quicker to query¹⁸⁴ and the query latency does not scale directly with the model’s fidelity. Some researchers^182,183 have investigated hybrid learned-analytic models and shown promising results. However, proposed models have not yet been studied in more complex robotic manipulation environments or in conjunction with RL.

Transition models within the reviewed works are also probabilistic, meaning that a state transition distribution is predicted rather than a single transition.¹⁸⁰ This is because state transition prediction is subject to epistemic and aleatoric uncertainty.¹⁸⁵ Epistemic uncertainty arises from insufficient amounts of data to make accurate predictions. It can be reduced by providing training models with more data but is seldom eliminated due to the associated costs. Aleatoric uncertainty arises from measurement noise, environment variability, non-linear dynamics and partial observability of the environment state. It is irreducible as it is caused by inherent environment stochasticity. Attempting to predict state transition deterministically will result in inaccurate predictions and degraded policy performance. Using probabilistic transition models, an agent can consider multiple likely outcomes for its actions during training, reducing model bias, significantly improving the maximum reward attainable by an MBRL agent.^181,186,187

Given that learned probabilistic transition models are most suitable, three prominent ML frameworks have been adopted to represent the transition model in the MBRL works reviewed. Namely, probabilistic ANNs,¹⁸⁸ Gaussian Processes (GPs)¹⁸⁹ and Gaussian Mixture Models (GMMs)¹⁹⁰ due to their ability to model non-linear functions. Using these ML frameworks as transition models with MBRL algorithms (e.g.,^177–183) resulted in contact-rich tasks being learned successfully. For instance, insertion,^110,112 door opening¹¹¹ and surface tracking.¹¹⁶ In these cases, sample efficiency was improved compared to MFRL, and the maximum reward obtained was comparable to that obtained by an MFRL policy.

Transition model transferability

The drawback of learned transition models is that they are specific to the training environment and require further training for a new environment, reducing the scalability of MBRL. As such, researchers have investigated transition model transferability to accelerate learning in new environments to improve MBRL’s scalability. Only ANN transition models have been used to investigate transition model transferability, as ANNs are widely studied in the context of transfer learning.^191,192

Fu et al.¹⁰⁹ fine-tuned an ANN transition model learned from a prior insertion task to quickly learn a variant of the insertion task. The technique proposed by Tanaka et al.¹⁹³ involved training an ANN to aggregate a transition model ensemble from previous assembly environments to learn a new assembly environment’s dynamics quickly. Nagabandi et al.¹⁹⁴ proposed a meta-learning technique to transition model learning and was applied by Liu et al.¹¹⁵ to learn variations of insertion tasks with MBRL in a sample-efficient manner.

These techniques require pre-trained transition models from prior environments similar to the target environment for quick transfer, which is suitable for addressing a set of similar environments. However, when addressing novel environments that differ too much from previously seen environments, transition models must be learned from scratch, as prior transition models cannot utilize current transfer techniques to accelerate training. Therefore, current techniques only partially address the scalability issue of MBRL. To address the scalability issues of MBRL completely, future research should focus on knowledge transfer techniques that enable quick learning of novel environments by reducing the reliance on access to prior transition models from environments similar to a target environment.

Safe exploration with MBRL

Some researchers focused on exploiting the foresight afforded by the transition model to facilitate significantly safer exploration during training.

Mitsioni et al.¹¹⁴ developed a technique involving pre-training a classifier that segregated the state space into safe and unsafe sets. During training, actions likely to reach unsafe states were filtered out of the viable actions set during policy deployment in the real world, resulting in safer operation. Thananjeyan et al.¹¹³ proposed using a pre-trained safety critic that predicted the probability of visiting unsafe states in the future given the current action and state. The safety critic was trained offline with a dataset of example trajectories where the agent reached the state. This cost critic was then used as a function for the agent to minimize when learning with MBRL, resulting in safe and sample-efficient learning in an object extraction task.

Bridging the reality gap in sim-to-real transfer

An essential approach for improving the practicality of RL for robotic control is sim-to-real transfer, referring to pre-training a policy in simulation and then fine-tuning it in the real environment. Simulation training enables training episodes to be executed faster than wall clock time, reducing training time and cost. Safety is also improved as the most hazardous early training phase, where actions are random, is conducted in a simulation with no consequences for unsafe exploration. The physics simulators typically used for RL in contact-rich robotic manipulation are MuJoCo, Gazebo, Bullet and Isaac.

There is a mismatch between simulation and reality, referred to as the reality gap.¹⁹⁵ The reality gap is caused by three factors described below. The effects of these factors are also discussed.

Low-fidelity physics simulations: The most common type of physics simulators for training RL agents are rigid-body simulators. Such simulators are attractive for RL due to their computational efficiency, enabling training episodes to simulated much faster than wall-clock time to train RL agents quickly. However, to achieve this efficiency, bodies in these simulation are treated as rigid-bodies, ignoring local deformations during contact that may impact the robot’s performance in the real world. These simulators also feature simplified frictional models, which do not capture realistic frictional dynamics that may also affect the real task. Furthermore, these simulations also allow for unrealistic intersection between bodies, which is not allowed in the real world. These discrepancies with reality manifest as a misalignment in the expected environment dynamics when a simulation-trained policy is transferred to the real world. This may cause the agent to select sub-optimal or unsafe actions in the real world that are expected to be optimal in the dynamics of the simulated environment.

Inaccurate synthetic sensor data generation: Tied to the low-fidelity of the simulators, synthetic sensory data generated by the simulator will not accurately reflect the sensory data that will be acquired in the real world. For instance, the visual fidelity of objects in the environment will differ between simulation and reality, and F/T sensor data generated in simulation often deviate from reality during impulse contacts. This discrepancy can cause a simulation-trained policy to misinterpret the state of the environment when using sensory information from the real world, causing sub-optimal actions to be executed as a result.

Inaccurate reconstruction of the real environment: The user-generated content to reconstruct the target real-world environment may differ slightly from reality. For example, inaccuracies may come from discrepancies in the appearance, geometry, inertial and frictional and properties of objects in the environment. Completely eliminating these inaccuracies is often impractical. These inaccuracies can cause a misinterpretation of the environment state in the real environment and misalignment between the expected and real dynamics, which induce the same effects discussed in the previous two points.

The more significant the reality gap, the more fine-tuning is required to complete sim-to-real transfer, reducing the time, cost and safety benefits of sim-to-real transfer. Therefore, techniques have proposed to close the reality gap to enable efficient transfer of simulation-trained policies to real environments. The techniques are discussed in this subs-section are grouped into sub-approaches use data-driven methods to overcome the reality gap, namely Domain Randomization (DR), System identification (SI) and Domain Adaptation (DA), which are illustrated in Figure 9. Another sub-approach is discussed that aims to close the reality gap by enhancing simulation fidelity.

Figure 9.

Visualisations illustrating the intuition behind the three data-driven sub-approaches in closing the reality gap for sim-to-real transfer. (a) domain randomisation, (b) system identification, (c) domain adaptation.

Domain randomization

DR^195,197 involves randomising select simulation parameters during in-simulation training and training the agent on the distribution of environments. By training a policy to be successful in a distribution of environments, the policy can be transferred to the real environment with little fine-tuning, as it can be considered as an instance within the environment distribution already seen in training. DR has been the most commonly applied sim-to-real transfer technique within the contact-rich manipulation literature^{27,48,68,119,120,125} due to its simplicity and effectiveness. Typically, parameters such as joint damping, geometry, friction, surface appearance and stiffness are randomised to facilitate successful sim-to-real transfer for contact-rich tasks. This technique can be seen as a way to compensate for reality gaps caused by all factors discussed.

The optimal use of DR relies on carefully designing the parameter sampling distribution. The parameter sampling distribution’s mean should be placed near the real environment’s expected parameter values to minimize the required fine-tuning. Also, the distribution’s variance should be large enough to account for environmental uncertainty and compensate for the physics engine’s low fidelity and inaccurate environment reconstruction. However, the variance should be limited as too much produces an overly conservative policy with sub-optimal performance¹⁹⁸ and unnecessarily worsens sample efficiency. These design considerations can make it challenging to implement DR successfully, so researchers have proposed extensions to DR that streamline implementation to achieve optimal results quickly.

Chebotar et al.¹¹⁷ proposed a technique that iteratively updates the sampling distribution of environment variants towards around the real environment parameters. The optimization process was facilitated with a closed-loop system that updated the simulation using data from the real environment. This technique was used enable successful sim-to-real transfer of a door opening policy. Beltran-Hernandez et al.¹²¹ proposed combining curriculum learning with DR for insertion, which involved adapting the parameter sampling distribution’s variance with the average reward obtained during training – the larger the average reward, the larger the variance grew. The adaptive variance enabled the agent to learn broad parameter sampling distributions quicker than with a static variance, as the adaptive variance scheme presented the agent with tasks of progressive difficulty, which were easier to learn. Aflakian et al.¹²⁷ also employ this technique for a surface tracking task. Another extension of DR is to combine it with Meta-RL (Section 4.7).^104,106 In basic DR, the RL algorithm assumes the environment variations are a single environment, so the agent will learn an average optimal policy across all variations rather than addressing each variant optimally. This results in overly conservative policies for broad task distributions and will require more fine-tuning to transfer well to the real environment. In Meta-RL, the meta-policy instead considers the environment variations as distinct tasks. The meta-policy learns the common features or representations across environmental variations to facilitate fast adaptation to other potential environmental variations. This alternative approach reduces the fine-tuning required when transferring the real environment.

System identification

SI aims to identify the real environment’s dynamic parameter values directly and use these values to improve simulation accuracy. It aims to partially address the reality gap associated with inaccurate reconstructions of the real environment. With a more accurate simulation, sim-to-real transfer requires less fine-tuning when transferring to the real environment, which improves sample efficiency. As accurate environment parameter values are difficult to obtain manually, SI techniques infer parameter values in a data-driven manner. The technique developed by Kaspar et al.¹¹⁸ involved executing a handcrafted controller in the real and simulated environment and iteratively optimizing the simulation parameters until the robot trajectory in each environment matched.

Domain adaptation

DA¹⁹⁹ is an ML technique enabling model transfer to a target domain even when the available data from the target domain is too scarce to facilitate transfer via regular fine-tuning. DA is achieved by unifying the source (simulation) and target (real) domain data in the same feature space, allowing knowledge obtained in the source domain to improve the model’s performance in the target domain. It is often used to address the reality gap associated with inaccurate synthetic sensor data generation in simulation. DA can be promising for sim-to-real transfer as this can reduce the amount of real environment fine-tuning for successful sim-to-real transfer, which improves sample efficiency.

DA techniques typically centre around the use of Generative Adversarial Networks (GANs).²⁰⁰ GANs are trained in an unsupervised manner to generate ability to realistic samples that imitate a relatively scarce dataset. In the context of RL for contact-rich robotic manipulation, GANs will be trained on a scarce dataset of sensor data from reality. After training a GAN, the underlying distribution of real data is captured in a low-dimensional latent space. The latent spaces of data captured in the real environment and the simulated environment can then be aligned, creating a mapping from the source environment. Then, a policy trained in simulation can directly be transferred to reality without further re-training using the mapping of sensory observations between the source and target domains.

Shi et al.¹²³ used CycleGAN²⁰¹ to learn a function to transform image observations from the real environment to equivalent observations in simulation. This enabled the simulation-trained visuomotor policy for insertion to transfer to the real environment with little fine-tuning. Zhao et al.¹²⁴ also used CycleGAN to transfer an insertion policy using tactile data from simulation to reality. Church et al.¹²² use the pix2pix framework,²⁰¹ also based on GANs, to map real tactile sensor data to sensor data used for learning in simulation to enable efficient transfer of a surface tracking and non-prehensile manipulation policy from simulation to reality. Wu et al.¹²⁸ use a GAN-VAE framework to enable the sim-to-real transfer of a surface tracking policy using tactile sensory observations.

Enhancing simulation fidelity

Some works aim to close the reality gap by enhancing simulations to better simulate the physical phenomena in the real world. These techniques intend to address reality gaps caused by low-fidelity physics simulations and inaccurate synthetic sensor data generation.

Vuong and Pham¹²⁶ propose a technique that reduces the number of contact points used to compute contact forces when objects are in contact bounding the stiffness of each contact point. With these modifications, the speed, accuracy and stability of the contact simulation was shown to have improved. The authors demonstrate in simple studies the improved accuracy of contact simulation with their technique and demonstrate an insertion policy successfully being transferred from simulation to reality with their modified simulation. However, extensive validation studies are required to further validate the effectiveness of this technique.

Chen et al.¹²⁹ use a more expensive Finite Element Method (FEM) simulator to obtain more accurate simulation of the deformation of an on-finger tactile sensor when in contact with an object. They then showed successful transfer of an insertion policy with tactile sensory input that was trained in the FEM simulation to a real environment. They also compared the performance of a policies trained in a FEM and rigid body simulator and showed the improved transferability with their policy.

Discussion

Assessing the current state of the key challenges

We revisit the key challenges presented in Section 3.1 and analyse their present status given the recently developed techniques. We also identify gaps where further improvement may be necessary.

Cost

Recent techniques have reduced the cost of implementing RL-based robotic control by improving sample efficiency and generalizability. Improving sample efficiency reduces the time required to learn an effective policy from scratch when learning a new environment. Enhancing generalizability also reduces training time, but when a policy for an environment similar to the target environment is available for re-use, in other words, when re-using policies between tasks belonging to the same task type. We discuss the works improving these aspects of RL below and their impact on cost.

Sample efficiency

Various approaches have been utilised to enhance sample efficiency, as highlighted in Section 4. These approaches rely on expert knowledge injection, auxiliary knowledge being learned, or knowledge transfer between task types to accelerate the exploration of promising states and actions, resulting in convergence to an optimal policy within fewer environment interactions. The findings in the literature demonstrate that employing these approaches results in improved sample efficiency compared to basic RL, potentially reducing the associated training times and costs. Table 2 summarises the approaches targeting improved sample efficiency.

Table 2.

A table showing an analysis of the approaches that aim to improve sample efficiency.

Approach	Re-implementation required across different task types?	Manner of improving sample efficiency	Description
Perception system design	Yes	Auxiliary learning	It involves training a sensory encoder to reduce raw sensory data to informative state representations, facilitating more efficient training than training directly with raw sensory data.
Model-based RL	Yes	Auxiliary learning	It involves learning a transition model that the agent can train with, which is cheaper to gather data from than interacting with the real environment.
Bridging the reality gap in sim-to-real transfer	Yes	Auxiliary learning	It facilitates a relatively small amount of data from the real environment to be used by the agent to transfer a policy from simulation to reality in a sample-efficient manner.
		Expert knowledge injection	Presents techniques which speeds up training in simulation with domain randomization.
Residual RL	Yes	Expert knowledge injection	Constrains the agent’s exploration around an expert trajectory to reduce the exploration of unpromising states, reducing the exploration required to learn an optimal policy.
Structured exploration	Yes	Expert knowledge injection	Expert knowledge explicitly defines unpromising states and actions to avoid, reducing the exploration required to learn an optimal policy.
Dense reward functions	Yes	Expert knowledge injection	The dense reward function rewards the agent for specific behaviours that encourage the agent to learn to complete the task quicker with fewer environment interactions.
Control mechanism design	No	Expert knowledge injection	Injects expert knowledge (e.g., inverse kinematics) into the robot’s control mechanism to reduce the learning required to learn the optimal policy.
Task decomposition	No	Expert knowledge injection	Assumes a task can be decomposed into sub-tasks that sub-policies can address, which passively improves exploration efficiency.
		Knowledge transfer	Sub-policies learned from other tasks can be directly re-used in novel task types to accelerate learning.

The analysis includes whether the approach requires re-implementation across different task types, a categorization of roughly describing how enhanced sample efficiency was achieved and a more detailed description of how each approach achieved enhanced sample efficiency.

Most approaches enhance sample efficiency using injected expert knowledge or learned auxiliary knowledge specific to the task type. In these cases, the task-specific knowledge is used to directly guide the agent towards the goal, significantly accelerating reward acquisition and quickly converging to an optimal policy. Despite significantly improving sample efficiency, these approaches must be re-implemented when learning new task types, as the underlying knowledge used may not be applicable for accelerating the learning of a novel task type. For example, dense reward functions are specific to a task type and only apply to the task type it was designed for, so a new dense reward function must be designed for a new environment. Re-implementing these techniques across task types offsets the cost-savings from enhancing sample efficiency when practitioners must consider a diverse set of task types, which is not ideal.

A minority of approaches do not require re-implementation across different task types due to the universal nature of the knowledge used to improve sample efficiency. For example, inverse kinematics is used in control mechanism design to improve sample efficiency, and re-implementation is not required for a new environment to achieve the same effect. However, these approaches result in a less significant sample efficiency enhancement as the knowledge used does not directly guide the agent towards the desired goal and only passively improves learning.

As a result, approaches that require re-implementation have been more prominently studied. To minimize their re-implementation costs, techniques have been developed to streamline the implementation of these approaches by reducing manual effort and reliance on specialized knowledge. For instance, self-supervised SRL²⁶ in perception system design, IRL^62,63 for dense reward functions, the integration of imitation learning with residual RL^67,70,71 and transition model transferability^109,115,193 in MBRL.

A promising solution is directly transferring knowledge between policies from previously learned task types to new task types. In the current literature, the only approach to facilitate knowledge transfer is task decomposition by enabling the reuse of modular sub-policies between task types. Studies^90–95 have shown accelerated and successful learning of a set of varied task types using a fixed set of sub-policies. Task decomposition approaches could potentially be scaled up to learn a broader range of novel task types quicker with RL-based robotic control, but this scalability requires further development.¹⁶¹ The main benefit of knowledge transfer is that learning different task types can be achieved sample efficiently without re-implementation between task types, significantly reducing the cost of RL-based robotic control. SRL²⁶ in perception system design and learned transition models in MBRL^{110,111,112,116} may have the potential to transfer knowledge between task types to enhance sample efficiency by capturing underlying structures and patterns common across different task types.^202,203 However, extending these approaches to facilitate knowledge transfer requires further study.

Generalizability

The current research addresses generalizability at various scopes. In increasing order of scope, the scopes of generalizability considered were robustness to environment uncertainty, generalizability between tasks belonging to the same task type, and policy reuse between different robots.

Residual RL and control mechanism design approaches have addressed policy robustness to environment uncertainty. Residual RL uses a residual policy to consider environmental uncertainty not considered by the expert trajectory to enhance the expert trajectory’s robustness to different environmental uncertainties.^{66,67,70–74,105,145} Control mechanism design principles prescribe compliant control schemes, which enable robots to adapt to environmental uncertainty by allowing the robot to adjust its behaviour in response to unexpected external forces.^27,84,85

Approaches such as perception system design, meta-learning and task decomposition improve the generalizability of policies regarding tasks belonging to the same task type as well handling environmental uncertainty. Perception system design focuses on capturing state features independent of task variations through richer sensor modalities like visual feedback, creating a policy adaptable to different task variants. Meta-learning explicitly trains a meta-policy on a distribution of similar environments, resulting in environments captured by the task distribution to be quickly learned using the meta-policy.^104–108 Task decomposition modularizes the policy, facilitating faster transfer to tasks sharing similar structures as adaptation occurs at the sub-policy level, requiring less fine-tuning than a monolithic policy.^90,95

Finally, policy re-use between robots has been addressed by control mechanisms design by suggesting using EEF space control to make policies independent of robot kinematics.²⁷ However, successful inter-robot policy transfer still requires extensive fine-tuning due to the unaccounted variability of dynamics between robots, which may be costly. Future work should be directed at developing techniques that account for robot dynamics variability to reduce the cost of inter-robot policy transfer.²⁰⁴

Safety

Different approaches have been utilised to enhance safety across RL agent lifecycle phases. Combining techniques from these approaches can improve safety across the RL agent’s lifecycle. The different phases of the RL agent’s lifecycle are listed and summarised below:

Early training phase: The exploratory phase of training, during which the agent’s actions are qualitatively the most random due to the policy being initialized randomly. In this phase, the agent will likely explore dangerous actions that may damage the robot or its environment.

Late training phase: The later portion of training during which the agent’s actions are qualitatively less random and more closely resemble the desired behaviour. We refer to a policy exhibiting such behaviour as a mature policy. A mature policy has conducted enough exploration and learned about the environment to consistently reach rewards but requires further training to maximize rewards and obtain optimality. A mature policy also includes policies transferred from another environment to another related environment that requires fine-tuning to become optimal in that new environment.

Deployment: The policy is deployed in the operating environment, and policy learning is suspended.

During early training, sim-to-real transfer and residual RL ensure safety by bypassing the early training phase and beginning training with a mature policy that makes it less prone to behaving dangerously. Sim-to-real transfer conducts the early training phase in simulation and transfers the policy to reality, which starts training in reality with a mature policy. A policy in residual RL only explores locally around an expert trajectory, which mimics a mature policy being fine-tuned. These approaches are the most commonly adopted as they are simple to implement but may be limited by scalability. Sim-to-real transfer requires an accurate reconstruction of the task in simulation, which can be time-consuming and expensive to produce. Residual RL may need different expert trajectories to be implemented across various tasks, which can be time-consuming and costly.

In the late training phase, exploration is less random, but there is still a risk of acting hazardously during fine-tuning. As a fallback, control mechanism features such as compliant control schemes and limiting actuator torque or velocity can minimize damage in the event of contact during the final stages of training.^{61,83,104,123} Dense reward terms can encourage safer policy behaviour and further enhance the safety of the final policy.^71,74,84

MBRL has been used to improve safety during the early and late phases of training. This was achieved by predicting future state transitions, pre-defining unsafe states for the agent to avoid during exploration and preventing the execution of actions that will lead to unsafe states.¹¹⁴ As MBRL involves learned transition models, managing implementation costs across varied environments is a concern due to sample efficiency and implementation complexity. However, using learned transition models opens the possibility of transferability of the learned knowledge to enhance safety across different environments whilst minimizing re-implementation costs. Therefore, building on MBRL may be a promising direction for future research in improving safety.³⁹

During deployment, the robot may be vulnerable to unsafe novel scenarios as it has not learned how to behave safely in such scenarios during training. Since learning is suspended during deployment, non-ML-based techniques, such as RMPs^86,160 and facilitating null space transformations,⁸⁷ have been developed to enhance safety in novel scenarios. However, the scope of scenarios for which these techniques can maintain safe operation is undetermined, and more non-ML-based techniques may be required to cover more unsafe scenarios. Developing more of these non-learning techniques may be a direction in future research.

Promising directions for future research

The following directions for future research are proposed based on the gaps identified in Section 5.1.

Knowledge transfer across diverse tasks

The long-term ambition for robotic control is to develop systems capable of performing a wide range of tasks without task-specific adjustments or fine-tuning. Achieving this would require robots to leverage experiences from diverse tasks to determine appropriate actions for new tasks in a zero-shot manner, similar to how large language models have demonstrated impressive generalisation capabilities in the natural language processing domain.

Current research in robotic control is predominantly focused on single-task settings, with limited emphasis on leveraging knowledge gained from different tasks. Existing approaches are often require task-specific implementation, lacking mechanisms for transferring or generalizing knowledge across varying objectives and environments. While meta-learning techniques have been explored, their application has been largely confined to related tasks within narrow domains.

Expanding the scope of research to encompass broader task distributions and more effective knowledge transfer mechanisms could significantly enhance the generalization capabilities of robotic systems. Developing architectures that can consolidate knowledge from diverse experiences and apply it to novel tasks remains a critical challenge. Future efforts should focus on designing frameworks that facilitate scalable knowledge transfer, allowing robots to continually expand their skill-sets through interaction with varied environments without extensive retraining.

Enabling cross-platform generalization through inter-robot policy transfer

One of the most significant barriers to widespread deployment of RL-based control systems is their lack of generalisability across different robotic platforms. The current reliance on training in EEF space addresses kinematic differences but neglects the complexities introduced by diverse dynamics, particularly during contact interactions.²⁰⁴

Achieving true inter-robot policy transfer requires developing methods that are invariant to both kinematic and dynamic discrepancies between platforms. Such advances would pave the way for universally applicable policies that can be seamlessly deployed across various robots, drastically reducing the cost and time associated with re-training.

Future research could explore embedding dynamic awareness into policies, leveraging techniques like meta-learning. Breakthroughs in this area could enable plug-and-play robotic controllers applicable to a multitude of platforms, thus accelerating the mainstream adoption of RL-based robotic control.

Beyond controlled environments: real-world RL and continual learning

In the current implementation of RL, agents are trained to maximise rewards in controlled training environments. Controlled environments are simulated or real-world environments where the environment’s state can be precisely monitored to administer rewards accurately, and the environment can be easily reset between training episodes. The policy is continuously updated during this training based on the experiences gathered during environmental interactions.

Once a desired agent behaviour has been obtained, training is suspended, and the best version of the policy is deployed in the operating environment, which is not controlled. Training is suspended as precisely monitoring the operating environment’s state is not guaranteed, so computed rewards may not be useful for updating the policy.²⁰⁵ Further, resetting the environment in the real world across many attempts may be impractically time-consuming if the agent requires many training episodes to adapt to the operating environment. During deployment, the policy is no longer updated based on new experiences gathered, and decisions are made based solely on what was considered optimal for the controlled environments seen during training. This may result in the policy failing in an unseen variation of the intended environment and may even act unsafely.

Current research using meta-learning^104–108 aimed to address this problem by training the agent in varied environments and so that the agent performed well across those variations, thus being robust to variations in the operating environment. However, this can be very costly due to the significantly longer training times, and the agent can still fail if the variations in the operating environment are not represented by the environment presented during training.¹⁷³

A potential path forward is to enable continuous policy updates through the agent’s interactions within the operating environment by addressing the issues with reward assignments and environment resetting. This is currently being investigated for other applications of RL and is known as real-world RL.^205–207 Another component required for this is continual or lifelong learning,^208,209 which is an advanced ML paradigm, enabling an agent to incrementally learn continuously from a stream of incoming data rather than in one instance on a static set of past experiences without forgetting past learnings. Incorporating these components into RL for contact-rich robotic manipulation could enable agents to learn to deal with unseen task variations and avoid novel unsafe scenarios without needing to return the agent to a controlled environment for a formal training process.

Conclusion

RL-based robotic control is a promising approach for addressing contact-rich robotic manipulation tasks due to its ability to solve complex environments with minimal human supervision. It was identified that the current research was motivated by the need to improve the cost (through improving sample efficiency and generalizability) and safety of RL-based control, which were considered to be bottlenecks in its wider adoption.

To improve cost and safety, researchers developed techniques to address these challenges. To organize the literature, the review organized the techniques into approaches corresponding to a specific perspective on the problems of RL-based robotic control that motivates the development of a given technique. The approaches and the associated techniques were presented in an organized manner to provide a structured overview of the current state of research.

The approaches were then holistically analysed based on their cost and safety impact. It was found that cost and safety had improved significantly due to the recent developments, making RL-based robotic control more viable for wider adoption for future industrial automation. To further enhance RL-based robotic control, it was suggested that improving knowledge transfer between different task types, improving inter-robot policy transfer and implementing real-world RL and continual learning, were promising research directions.

Footnotes

Correction (November 2025):

In the published version of the article, the textual errors “MDPNv132” and “two-dimensional” on page 5, have been corrected to “MDPNv1³²” and “low-dimensional”. The citation “Aflakian et al.¹²⁷” on page 23 has been corrected to “Chen et al.¹²⁹” and lastly, the sentence “RL Manchin et al.,¹³⁹ Fernandes et al.,¹⁴⁰ Sorokin et al.¹⁴¹ on RL benchmark tasks” on page 12 has been revised to “RL^139,140,141 on RL benchmark tasks.” The online version of the article has been updated to reflect the changes.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/W00206X/1.

ORCID iDs

Anselmo Parnada

Marco Castellani

Yongjing Wang

References

Kroemer

Niekum

Konidaris

A review of robot learning for manipulation: challenges, representations, and algorithms. J Mach Learn Res 2021; 22(30): 1–82.

Arents

Greitans

Smart industrial robot control trends, challenges and opportunities within manufacturing. Appl Sci 2022; 12(2): 937.

Stan

Nicolescu

Pupăză

Reinforcement learning for assembly robots: a review. Proc Manuf Syst 2020; 15(3): 135–146.

Ramírez

Castellani

Autonomous remanufacturing for sustainability. Int J Adv Manuf Technol 2023; 124(9): 2969–2970.

Mason

MT.

Toward robotic manipulation. Annu Rev Control Robot Auton Syst 2018; 1(1): 1–28.

Cui

Trinkle

Toward next-generation learned robot manipulation. Sci Robot 2021; 6(54): eabd9461.

Wan

Chen

, et al. Optimal path planning and control of assembly robots for hard-measuring easy-deformation assemblies. IEEE/ASME Trans Mechatron 2017; 22(4): 1600–1609.

Zhang

Chen

, et al. Jamming analysis and force control for flexible dual peg-in-hole assembly. IEEE Trans Ind Electron 2019; 66(3): 1930–1939.

Watkins

CJCH

Dayan

. Q-learning. Mach Learn 1992; 8: 279–292.

10.

Bellman

Dynamic programming. Science 1966; 153(3731): 34–37.

11.

Mnih

Kavukcuoglu

Silver

, et al. Playing Atari with deep reinforcement learning. 2013, ArXiv preprint arXiv:1312.5602, 2013.

12.

Vinyals

Babuschkin

Chung

, et al. Alphastar: mastering the real-time strategy game Starcraft II. DeepMind blog 2019; 2: 20.

13.

Yannakakis

Togelius

Artificial intelligence and games, Vol. 2. Springer, Cham, Switzerland, 2018.

14.

Silver

Hubert

Schrittwieser

, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 2018; 362(6419): 1140–1144.

15.

Zhao

Zhang

, et al. Reinforcement learning and deep learning based lateral control for autonomous driving [application notes]. IEEE Comput Intell Mag 2019; 14(2): 83–98.

16.

Lange

Riedmiller

Voigtländer

Autonomous reinforcement learning on raw visual input data in a real world application. In: The 2012 international joint conference on neural networks (IJCNN), Brisbane, Australia, 2012, pp.1–8. IEEE.

17.

Coates

Diel

, et al. Autonomous inverted helicopter flight via reinforcement learning. In: Experimental robotics IX: The 9th international symposium on experimental robotics, Springer, Singapore, 2006, pp.363–372.

18.

José Antonio Martín

de Lope

. Learning autonomous helicopter flight with evolutionary reinforcement learning. In: Moreno-Díaz

Pichler

Quesada-Arencibia

(eds) Computer aided systems theory - EUROCAST 2009. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp.75–82.

19.

Kober

Bagnell

Peters

Reinforcement learning in robotics: a survey. Int J Rob Res 2013; 32: 1238–1274.

20.

Lobbezoo

Qian

Kwon

Reinforcement learning for pick and place operations in robotics: a survey. Robotics 2021; 10: 105.

21.

Kleeberger

Bormann

Kraus

, et al. A survey on learning-based robotic grasping. Curr Robot Rep 2020; 1: 239–249.

22.

Liu

Nageotte

Zanne

, et al. Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review. Robotics 2021; 10(1): 22.

23.

Lin

YCA

Liu

Lin

, et al. A survey on reinforcement learning for recommender systems. IEEE Trans Neural Netw Learn Syst 2024; 35(10): 13164–13184.

24.

Suomalainen

Karayiannidis

Kyrki

A survey of robot manipulation in contact. Robot Auton Syst 2022; 156: 104224.

25.

Elguea-Aguinaco

Serrano-Muñoz

Chrysostomou

, et al. A review on reinforcement learning for contact-rich robotic manipulation tasks. Robot Comput Integr Manuf 2023; 81: 102517.

26.

Lee

Zhu

Krishna Parasuram

, et al. Making sense of vision and touch: self-supervised learning of multimodal representations for contact-rich tasks. In: 2019 international conference on robotics and automation (ICRA), Montreal, Canada, 2018, pp.8943–8950.

27.

Martín-Martín

Lee

Gardner

, et al. Variable impedance control in end-effector space: an action space for reinforcement learning in contact-rich tasks. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), Jeju Island, South Korea, 2019, pp. 1010–1017.

28.

Zhu

Wong

Mandlekar

Roberto Mart’in-Mart’in. Robosuite: a modular simulation framework and benchmark for robot learning. ArXiv, abs/2009.12293, 2020.

29.

Goli

Aflakian

, et al. Characterizing the mechanics of rectangular peg-hole disassembly and the effect of the active compliance centre on the extraction force. R Soc Open Sci 2024; 11(11): 240956.

30.

Sutton

Barto

AG.

Reinforcement learning: an introduction. IEEE Trans Neural Netw 1998; 9: 1054.

31.

Bellman

A Markovian decision process. Indiana Univ Math J 1957; 6: 679–684.

32.

Thomas

PS.

A notation for Markov decision processes. IEEE Trans Neural Netw 1998; 9(5): 1054. ArXiv, abs/1512.09075, 2015.

33.

Wang

Liu

Zhang

, et al. Deep reinforcement learning: a survey. Front Inf Technol Electron Eng 2020; 21: 1726–1744.

34.

Arulkumaran

Deisenroth

Brundage

, et al. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 2017; 34: 26–38.

35.

Yuxi

Deep reinforcement learning: an overview. IEEE Trans Neural Networks Learn Syst 2024; 35(4): 5064–5078. ArXiv, abs/1701.07274, 2017.

36.

Nguyen

. Review of deep reinforcement learning for robot manipulation. In: 2019 Third IEEE international conference on robotic computing (IRC), Naples, Italy, 2019, pp. 590–595.

37.

Lesort

Díaz-Rodríguez

Goudou

, et al. State representation learning for control: an overview. Neural Netw 2018; 108: 379–392.

38.

Pecka

Svoboda

. Safe exploration techniques for reinforcement learning–an overview. In: Modelling and simulation for autonomous systems: first international workshop, MESAS 2014, Rome, Italy, May 5-6, 2014, Revised Selected Papers 1, pp. 357–375. Springer, 2014.

39.

Yang

, et al. A review of safe reinforcement learning: methods, theory and applications. ArXiv, abs/2205.10330, 2022.

40.

Satya Durga Manohar Sahu

Samal

Kumar Panigrahi

. Modelling, and control techniques of robotic manipulators: a review. Mater Today Proc 2022; 56: 2758–2766.

41.

Malik

Ravikumar

When is generalizable reinforcement learning tractable?

ArXiv, abs/2101.00300, 2021.

42.

Schoettler

Nair

Luo

, et al. Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), Montreal, Canada, 2019, pp. 5548–5555.

43.

Billard

Kragic

Trends and challenges in robot manipulation. Science 2019; 364: eaat8414–1149.

44.

Levine

Finn

Darrell

, et al. End-to-end training of deep visuomotor policies. J Mach Learn Res 2015; 17(39): 1–39.

45.

Dong

Jha

Romeres

, et al. Tactile-RL for insertion: generalization to objects of unknown geometry. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, pp. 6437–6443, 2021.

46.

Jin

Lin

Tan

, et al. Multi-modal fusion in contact-rich precise tasks via hierarchical policy learning. ArXiv, abs/2202.08401, 2022.

47.

Hansen

Hogan

Rivkin

, et al. Visuotactile-rl: learning multimodal manipulation policies with deep reinforcement learning. In: 2022 International conference on robotics and automation (ICRA), Philadelphia, PA, 2022, pp. 8298–8304. IEEE.

48.

Pang

Zheng

, et al. A flexible manufacturing assembly system with deep reinforcement learning. Control Eng Pract 2022; 118: 104957.

49.

Yang

Lin

Church

, et al. Sim-to-real model-based and model-free deep reinforcement learning for tactile pushing. IEEE Robot Autom Lett 2023; 8(9): 5480–5487.

50.

Chen

Zeng

Liang

, et al. Multimodality driven impedance-based sim2real transfer learning for robotic multiple peg-in-hole assembly. IEEE Trans Cybern 2024; 54(5): 2784–2797.

51.

Dang

Hou

Yang

, et al. Fusing vision and force: a framework of reinforcement learning for elastic peg-in-hole assembly. In: 2023 WRC symposium on advanced robotics and automation (WRC SARA), Beijing, China, 2023, pp.1–6. IEEE.

52.

Markert

Hoerner

Matich

, et al. Robotic peg-in-hole insertion with tight clearances: a force-based deep q-learning approach. In: 2023 International conference on machine learning and applications (ICMLA), Jacksonville, FL, 2023, pp.1045–1051. IEEE.

53.

Kamijo

Ramirez-Alpizar

IG.

Enrique Coronado, and Gentiane Venture. Tactile-based active inference for force-controlled peg-in-hole insertions. ArXiv preprint arXiv:2309.15681, 2023.

54.

Chen

Zhang

Pan

Active compliance control of robot peg-in-hole assembly based on combined reinforcement learning. Appl Intell 2023; 53(24): 30677–30690.

55.

Kim

Han

Kim

, et al. Pre-and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer. In: 2023 IEEE/RSJ international conference on intelligent robots and systems (IROS), Detroit, MI, 2023, pp.10644–10651. IEEE.

56.

Ferrandis

JDA

Moura

Vijayakumar

Learning visuotactile estimation and control for non-prehensile manipulation under occlusions. ArXiv preprint arXiv:2412.13157, 2024.

57.

Wang

Liu

Chang

, et al. Multi-stage reinforcement learning for non-prehensile manipulation. IEEE Robot Autom Lett 2024; 9: 6712–6719.

58.

Fuchioka

Beltran-Hernandez

Nguyen

, et al. Robotic object insertion with a soft wrist through sim-to-real privileged training. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, 2024, pp.9159–9166. IEEE.

59.

Lin

Wang

, et al. Multimodal task attention residual reinforcement learning: advancing robotic assembly in unstructured environment. IEEE Robot Autom Lett 2025; 10: 3900–3907.

60.

Chen

Lin

, et al. Learning latent causal factors from the intricate sensor feedback of contact-rich robotic assembly tasks. Robot Auton Syst 2025; 183: 104832.

61.

Hou

Wang

, et al. Feedback deep deterministic policy gradient with fuzzy reward for robotic multiple peg-in-hole assembly tasks. IEEE Trans Ind Inform 2019; 15: 1658–1667.

62.

Lian

Unhelkar

, et al. Learning dense rewards for contact-rich manipulation tasks. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, 2020, pp.6214–6221.

63.

Zhang

Sun

Kuang

, et al. Learning variable impedance control via inverse reinforcement learning for force-related tasks. IEEE Robot Autom Lett 2021; 6: 2225–2232.

64.

Wang

Men

, et al. Deep deterministic policy gradient with reward function based on fuzzy logic for robotic peg-in-hole assembly tasks. Appl Sci 2022; 12: 3181.

65.

Zhou

Yang

Zhang

Variable impedance control on contact-rich manipulation of a collaborative industrial mobile manipulator: an imitation learning approach. Robot Comput Manuf 2025; 92: 102896.

66.

Johannink

Bahl

Nair

, et al. Residual reinforcement learning for robot control. In: 2019 international conference on robotics and automation (ICRA), Montreal, QC, 2018, pp.6023–6029.

67.

Davchev

Luck

Burke

, et al. Residual learning from demonstration: adapting dmps for contact-rich manipulation. IEEE Robot Autom Lett 2022; 7: 4488–4495.

68.

Spector

Zacksenhouse

Learning contact-rich assembly skills using residual admittance policy*. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS), Prague, Czech Republic, 2021, pp.6023–6030.

69.

Shi

Chen

Liu

, et al. Proactive action visual residual reinforcement learning for contact-rich tasks using a torque-controlled robot. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, 2021, pp.765–771. IEEE.

70.

Qin

Efficient insertion control for precision assembly based on demonstration learning and reinforcement learning. IEEE Trans Ind Inform 2021; 17: 4492–4502.

71.

Wang

Beltran-Hernandez

Wan

, et al. Robotic imitation of human assembly skills using hybrid trajectory and force learning. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 2021, pp.11278–11284.

72.

Oikawa

Kusakabe

Kutsuzawa

, et al. Reinforcement learning for robotic assembly using non-diagonal stiffness matrix. IEEE Robot Autom Lett 2021; 6: 2737–2744.

73.

Ranjbar

Vien

Ziesche

, et al. Residual feedback learning for contact-rich manipulation tasks with uncertainty. In: 2021 IEEE/RSJ International conference on intelligent robots and systems (IROS), Prague, Czech Republic, 2021, pp.2383–2390.

74.

Wang

Beltran-Hernandez

Wan

, et al. An adaptive imitation learning framework for robotic complex contact-rich insertion tasks. Front Rob AI 2021; 8: 777363.

75.

Kulkarni

Kober

Babuška

, et al. Learning assembly tasks in a few minutes by combining impedance control and residual recurrent reinforcement learning. Adv Intell Syst 2022; 4(1): 2100095.

76.

Zang

Wang

Zha

, et al. Geometric-feature representation based pre-training method for reinforcement learning of peg-in-hole tasks. IEEE Robot Autom Lett 2023; 8(6): 3478–3485.

77.

Sleiman

Mittal

Hutter

Guided reinforcement learning for robust multi-contact loco-manipulation. In: 8th annual conference on robot learning (CoRL 2024), Munich, Germany, 2024.

78.

Zhang

Wang

Zhang

, et al. A residual reinforcement learning method for robotic assembly using visual and force information. J Manuf Syst 2024; 72: 245–262.

79.

Wang

, et al. Deep reinforcement learning-based variable impedance control for grinding workpieces with complex geometry. Robot Intell Autom 2025; 45: 159–172.

80.

Florensa

Held

Wulfmeier

, et al. Reverse curriculum generation for reinforcement learning. ArXiv, abs/1707.05300, 2017.

81.

Hoppe

Lou

Hennes

, et al. Planning approximate exploration trajectories for model-free reinforcement learning in contact-rich manipulation. IEEE Robot Autom Lett 2019; 4: 4042–4047.

82.

Florensa

Tremblay

Ratliff

, et al. Guided uncertainty-aware policy optimization: combining learning and model-based strategies for sample-efficient policy learning. In: 2020 IEEE international conference on robotics and automation (ICRA), Paris, France, pp.7505–7512, 2020.

83.

Jiang

Quan

, et al. Manipulation skill acquisition for robotic assembly using deep reinforcement learning. In: 2019 IEEE/ASME international conference on advanced intelligent mechatronics (AIM), Hong Kong, China, 2019, pp.13–18.

84.

Beltran-Hernandez

Petit

Ramirez-Alpizar

, et al. Learning force control for contact-rich manipulation tasks with rigid position-controlled robots. IEEE Robot Autom Lett 2020; 5: 5709–5716.

85.

Ulmer

Aljalbout

Schwarz

, et al. Learning robotic manipulation skills using an adaptive force-impedance action space. ArXiv, abs/2110.09904, 2021.

86.

Shaw

Abbatematteo

Konidaris

. Rmps for safe impedance control in contact-rich manipulation. In: 2022 international conference on robotics and automation (ICRA), Philadelphia, PA, 2022, pp.2707–2713.

87.

Zhu

Kang

Chen

A contact-safe reinforcement learning framework for contact-rich robot manipulation. In: 2022 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp.2476–2482, 2022.

88.

Tang

Lin

Akinola

, et al. Industreal: transferring contact-rich assembly tasks from simulation to reality. ArXiv preprint arXiv:2305.17110, 2023.

89.

Liu

, et al. Deep reinforcement learning on variable stiffness compliant control for programming-free robotic assembly in smart manufacturing. Int J Prod Res 2024; 62(19): 7073–7095.

90.

Hou

Fei

Deng

, et al. Data-efficient hierarchical reinforcement learning for robotic assembly control applications. IEEE Trans Ind Electron 2021; 68: 11565–11575.

91.

Nasiriany

Liu

Zhu

Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks. In: 2022 international conference on robotics and automation (ICRA), Philadelphia, PA, 2022, pp.7477–7484.

92.

Yang

Johannes Andreas

Stoyanov

Transferring knowledge for reinforcement learning in contact-rich manipulation. ArXiv, abs/2210.02891, 2022.

93.

Mayr

Ahmad

Chatzilygeroudis

, et al. Skill-based multi-objective reinforcement learning of industrial robot tasks with planning and knowledge integration. In: 2022 IEEE international conference on robotics and biomimetics (ROBIO), Jinghong, China, 2022, pp.1995–2002.

94.

Yang

Durr

Topp

, et al. Variable impedance skill learning for contact-rich manipulation. IEEE Robot Autom Lett 2022; 7: 8391–8398.

95.

Cheng

League: guided skill learning and abstraction for long-horizon manipulation. IEEE Robot Autom Lett 2023; 8: 6451–6458.

96.

Wang

Zhang

, et al. Task-driven reinforcement learning with action primitives for long-horizon manipulation skills. IEEE Trans Cybern 2024; 54(8): 4513–4526.

97.

Lew

Singh

Prats

, et al. Robotic table wiping via reinforcement learning and whole-body trajectory optimization. In: 2023 IEEE international conference on robotics and automation (ICRA), London, UK, 2023, pp.7184–7190. IEEE.

98.

Nghia

Pham

QC.

Reinforcement learning with parameterized manipulation primitives for robotic assembly. ArXiv preprint arXiv:2306.06679, 2023.

99.

Abbatematteo

Rosen

Thompson

, et al. Composable interaction primitives: a structured policy class for efficiently learning sustained-contact manipulation skills. In: 2024 IEEE international conference on robotics and automation (ICRA), Yokohama, Japan, 2024, pp.7522–7529. IEEE.

100.

Sun

Curtis

You

, et al. Hierarchical hybrid learning for long-horizon contact-rich robotic assembly. ArXiv preprint arXiv:2409.16451, 2024.

101.

Padalkar

Quere

Raffin

, et al. Guiding real-world reinforcement learning for in-contact manipulation tasks with shared control templates. Auton Robots 2024; 48(4–5): 12.

102.

Lin

Corcodel

Zhao

Generalize by touching: tactile ensemble skill transfer for robotic furniture assembly. In: 2024 IEEE international conference on robotics and automation (ICRA), Seattle, Washington, pp.9227–9233. IEEE, 2024.

103.

Wang

Liu

Chang

, et al. Hierarchical diffusion policy: manipulation trajectory generation via contact guidance. IEEE Trans Robot 2025; 41: 2086–2104.

104.

Schoettler

Nair

Ojea

, et al. Meta-reinforcement learning for robotic industrial insertion tasks. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp.9728–9735, 2020.

105.

Allshire

Mart’in-Mart’in

Lin

, et al. Laser: learning a latent action space for efficient reinforcement learning. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, 2021, pp.6650–6656.

106.

Hafez

Mohammed Ibrahim Awad

, et al. Meta reinforcement learning for robust and adaptable robotic assembly tasks. 2021 16th International Conference on Computer Engineering and Systems (ICCES), pp.1–7, 2021.

107.

Zhao

Luo

Sushkov

, et al. Offline meta-reinforcement learning for industrial insertion. In: 2022 international conference on robotics and automation (ICRA), Philadelphia, PA, 2022, pp.6386–6393.

108.

Yang

Stork

Stoyanov

Mpr-rl: multi-prior regularized reinforcement learning for knowledge transfer. IEEE Robot Autom Lett 2022; 7: 7652–7659.

109.

Levine

Abbeel

One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In: 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp.4019–4026, 2015.

110.

Levine

Wagener

Abbeel

Learning contact-rich manipulation skills with guided policy search. In: 2015 IEEE international conference on robotics and automation (ICRA), Seattle, Washington, pp.156–163, 2015.

111.

Chebotar

Hausman

Marvin

, et al. Combining model-based and model-free updates for trajectory-centric reinforcement learning. In: International conference on machine learning, Sydney, 2017.

112.

Zhao

Chen

, et al. Model accelerated reinforcement learning for high precision robotic assembly. Int J Intell Robot Appl 2020; 4: 202–216.

113.

Thananjeyan

Balakrishna

Nair

, et al. Recovery RL: safe reinforcement learning with learned recovery zones. IEEE Robot Autom Lett 2021; 6: 4915–4922.

114.

Mitsioni

Tajvar

Kragic

, et al. Safe data-driven contact-rich manipulation. In: 2020 IEEE-RAS 20th international conference on humanoid robots (Humanoids), Munich, Germany, 2021, pp.120–127.

115.

Liu

Zhang

Yuqing

, et al. Industrial insert robotic assembly based on model-based meta-reinforcement learning. In: 2021 IEEE international conference on robotics and biomimetics (ROBIO), Munich, Germany, 2021, pp.1508–1512.

116.

Anand

Hagen Myrestrand

Gravdahl

. Evaluation of variable impedance- and hybrid force/motioncontrollers for learning force tracking skills. In: 2022 IEEE/SICE International Symposium on System Integration (SII), Sanya, China, 2022, pp.83–89.

117.

Chebotar

Handa

Makoviychuk

, et al. Closing the sim-to-real loop: adapting simulation randomization with real world experience. In: 2019 international conference on robotics and automation (ICRA), Montreal, QC, 2019, pp.8973–8979. IEEE.

118.

Kaspar

Osorio

JDM

Bock

. Sim2real transfer for reinforcement learning without dynamics randomization. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), Las Vegas, NV, 2020, pp.4383–4388.

119.

Beltran-Hernandez

Petit

Ramirez-Alpizar

, et al. Variable compliance control for robotic peg-in-hole assembly: a deep reinforcement learning approach. ArXiv, abs/2008.10224, 2020.

120.

Petrovic

Schäper

Roggendorf

, et al. Sim2real deep reinforcement learning of compliance-based robotic assembly operations. In: 2022 26th international conference on methods and models in automation and robotics (MMAR), Międzyzdroje, Poland, 2022, pp.300–305.

121.

Beltran-Hernandez

Petit

Ramirez-Alpizar

, et al. Accelerating robot learning of contact-rich manipulations: a curriculum learning study. ArXiv, abs/2204.12844, 2022.

122.

Church

Lloyd

Lepora

, et al. Tactile sim-to-real policy transfer via real-to-sim image translation. In: Conference on robot learning, Auckland, New Zealand, pp.1645–1654. PMLR, 2022.

123.

Shi

Yuan

Tsitos

, et al. A sim-to-real learning-based framework for contact-rich assembly by utilizing cyclegan and force control. IEEE Trans Cogn Dev Syst 2023; 15: 2144–2155.

124.

Zhao

Jing

Qian

, et al. Skill generalization of tubular object manipulation with tactile sensing and sim2real learning. Robot Auton Syst 2023; 160: 104321.

125.

Wang

Sun

Zha

, et al. Learning adaptive reaching and pushing skills using contact information. Front Neurorobot 2023; 17: 1271607.

126.

Vuong

Pham

. Contact reduction with bounded stiffness for robust sim-to-real transfer of robot assembly. In: 2023 IEEE/RSJ international conference on intelligent robots and systems (IROS), Detroit, MI, 2023, pp.361–367. IEEE.

127.

Aflakian

Hathaway

Stolkin

, et al. Robust contact-rich task learning with reinforcement learning and curriculum-based domain randomization. IEEE Access 2024; 12: 103461–103472.

128.

Peng

Zhou

, et al. Rttf: rapid tactile transfer framework for contact-rich manipulation tasks. In: 2024 IEEE/RSJ international conference on intelligent robots and systems (IROS), Abu Dhabi, United Arab Emirates, 2024, pp.2913–2920. IEEE.

129.

Chen

Xiang

, et al. General-purpose sim2real protocol for learning contact-rich manipulation with marker-based visuotactile sensors. IEEE Trans Robot 2024; 40: 1509–1526.

130.

Elguea-Aguinaco

Serrano-Muñoz

Chrysostomou

, et al. Goal-conditioned reinforcement learning within a human-robot disassembly environment. Appl Sci 2022; 12: 11610.

131.

Mnih

Kavukcuoglu

Silver

, et al. Human-level control through deep reinforcement learning. Nature 2015; 518: 529–533.

132.

Curran

Brys

Taylor

, et al. Using PCA to efficiently represent state spaces. ArXiv, abs/1505.00322, 2015.

133.

Sinha

Bharadhwaj

Srinivas

, et al. D2rl: deep dense architectures in reinforcement learning. ArXiv, abs/2010.09163, 2020.

134.

de Bruin

Kober

Tuyls

, et al. Integrating state representation learning into deep reinforcement learning. IEEE Robot Autom Lett 2018; 3: 1394–1401.

135.

Rani

Nabi

Kumar

, et al. Self-supervised learning: a succinct review. Arch Comput Methods Eng 2023; 30: 2761–2775.

136.

Kingma

Welling

. Auto-Encoding Variational Bayes. 2022. https://arxiv.org/abs/1312.6114

137.

Chaudhari

Mithal

Polatkan

, et al. An attentive survey of attention models. ACM Trans Intell Syst Technol 2021; 12(5): 1–32.

138.

Guo

Liu

, et al. Attention mechanisms in computer vision: a survey. Comput Vis Media 2022; 8(3): 331368.

139.

Manchin

Abbasnejad

Van Den Hengel

Reinforcement learning with attention that works: a self-supervised approach. In: Neural information processing: 26th international conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part V 26, pp.223–230. Springer, 2019.

140.

Fernandes

Joseph

Vogel

, et al. Self-attention for visual reinforcement learning. In: 2023 IEEE Conference on Games (CoG), Sydney, NSW, Australia, 2023, pp.1–8. IEEE.

141.

Sorokin

Seleznev

Pavlov

, et al. Deep attention recurrent q-network. ArXiv preprint arXiv:1512.01693, 2015.

142.

Wiewiora

Reward shaping. Boston, MA: Springer US, 2010. pp.863–865. DOI: 10.1007/978-0-387-30164-8_731

143.

Russell

. Algorithms for inverse reinforcement learning. In: International Conference on Machine Learning, Boston, MA, USA, 2000.

144.

Levine

F, L, S

, Learning robust rewards with adversarial inverse reinforcement learning. ArXiv, abs/1710.11248, 2017.

145.

Hussein

Gaber

Elyan

, et al. Imitation learning: a survey of learning methods. ACM Comput Surv 2018; 50(2). 1–35. https://arxiv.org/abs/1710.11248

146.

Ijspeert

Nakanishi

Hoffmann

, et al. Dynamical movement primitives: learning attractor models for motor behaviors. Neural Comput 2013; 25: 328–373.

147.

Betts

JT.

Survey of numerical methods for trajectory optimization. J Guid Control Dyn 1998; 21(2): 193–207.

148.

Soviany

Ionescu

Rota

, et al. Curriculum learning: a survey. Int J Comput Vis 2022; 130: 1526–1565.

149.

Peng

Panne

MVD

. Learning locomotion skills using deeprl: does the choice of action space matter? In:Proceedings of the ACM SIGGRAPH / Eurographics symposium on computer animation, Los Angeles, CA, 2017, pp.1–13.

150.

Bellegarda

Byl

Training in task space to speed up and guide reinforcement learning. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), Zurich, Switzerland, 2019, pp.2693–2699.

151.

Khatib

A unified approach for motion and force control of robot manipulators: the operational space formulation. IEEE J Robot Autom 1987; 3: 43–53.

152.

Byl

Satzinger

. Algorithmic optimization of inverse kinematics tables for high degree-of-freedom limbs. In: Dynamic Systems and Control Conference, American Society of Mechanical Engineers, Antonio, TX, 2014, Vol. 46186, p. V001T04A005.

153.

Hogan

Impedance control: an approach to manipulation. In: 1984 American Control Conference, San Diego, CA, 1984, pp.304–313.

154.

Newman

WS.

Stability and performance limits of interaction controllers. J Dyn Syst Meas Control Trans ASME 1992; 114: 563–570.

155.

Whitney

DE.

Force feedback control of manipulator fine motions. J Dyn Syst Meas Control Trans ASME 1977; 99: 91–97.

156.

Chiaverini

Sciavicco

The parallel approach to force/position control of robotic manipulators. IEEE Trans Robot Autom 1993; 9: 361–373.

157.

Keemink

van der Kooij

Stienen

AH.

Admittance control for physical human–robot interaction. Int J Rob Res 2018; 37: 1421–1444.

158.

Buchli

Stulp

Theodorou

, et al. Learning variable impedance control. Int J Rob Res 2011; 30: 820–833.

159.

Zhang

Wei

A review on model reference adaptive control of robotic manipulators. Annu Rev Control 2017; 43: 188–198.

160.

Ratliff

Issac

Kappler

Riemannian motion policies. ArXiv, abs/1801.02854, 2018.

161.

Pateria

Subagdja

Tan

, et al. Hierarchical reinforcement learning. ACM Comput Surv 2022; 54: 1–35.

162.

Chitnis

Silver

Tenenbaum

, et al. Learning neuro-symbolic relational transition models for bilevel planning. In: 2022 IEEE/RSJ international conference on intelligent robots and systems (IROS), Kyoto, Japan, 2021, pp.4166–4173.

163.

Caelan Reed

Chitnis

Holladay

, et al. Integrated task and motion planning. ArXiv, abs/2010.01083, 2020.

164.

Srivastava

Fang

Riano

, et al. Combined task and motion planning through an extensible planner-independent interface layer. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 2014, pp.639–646.

165.

Shi

Lim

Youngwoon

Skill-based model-based reinforcement learning. ArXiv, abs/2207.07560, 2022.

166.

Pertsch

Lee

Lim

JJ.

Accelerating reinforcement learning with learned skill priors. In: Conference on robot learning, Cambridge MA, 2020.

167.

Jong

Hester

Peter

The utility of temporal abstraction in reinforcement learning. In: 7th international joint conference on Autonomous agents and multiagent systems-Volume 1, Estoril, Portugal, 2008, pp. 299–306.

168.

Hutsebaut-Buysse

Mets

Latré

Hierarchical reinforcement learning: a survey and open research challenges. Mach Learn Knowl Extr 2022; 4: 172–221.

169.

Jain

Abbeel

Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 2020; 33: 6840–6851.

170.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inform Process Syst 2017; 30.

171.

Hospedales

Antoniou

Micaelli

, et al. Meta-learning in neural networks: a survey. IEEE Trans Pattern Anal Mach Intell 2022; 44: 5149–5169.

172.

Zhuang

Duan

, et al. A survey on ambient intelligence in Health Care. Proc IEEE Inst Electr Electron Eng 2013; 101: 2470–2494.

173.

Beck

Vuorio

Liu

E. Z

, et al. A survey of meta-reinforcement learning. ArXiv, abs/2301.08028, 2023.

174.

Wang

Kurth-Nelson

Soyer

, et al. Learning to reinforcement learn. ArXiv, abs/1611.05763, 2016.

175.

Duan

Schulman

Chen

, et al. Rl²: fast reinforcement learning via slow reinforcement learning. ArXiv, abs/1611.02779, 2016.

176.

Rakelly

Zhou

Quillen

, et al. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In: International conference on machine learning, Long Beach, CA, 2019.

177.

Monahan

. State of the art – a survey of partially observable Markov decision processes: theory, models, and algorithms. Manag sci 1982; 28(1): 1–16.

178.

Chelsea Finn

Abbeel

Levine . Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, Sydney, Australia, 2017.

179.

Prat

Johns

. Peril: probabilistic embeddings for hybrid meta-reinforcement and imitation learning. 2020.

180.

Moerland

Broekens

Plaat

, et al. Model-based reinforcement learning: A survey. Found Trends Mach Learn 2023; 16(1): 1–118.

181.

Kurutach

Clavera

Duan

, et al. Model-ensemble trust-region policy optimization. ArXiv, abs/1802.10592, 2018.

182.

Fazeli

Zapolsky

Drumwright

, et al. Learning data-efficient rigid-body contact models: case study of planar impact. In: Conference on robot learning, Mountain View, CA, 2017.

183.

Fazeli

Zapolsky

Drumwright

, et al. Fundamental limitations in performance and interpretability of common planar rigid-body contact models. In: International symposium of robotics research, Springer, Cham, 2020, pp.555–571.

184.

Nguyen-Tuong

Peters

Model learning for robot control: a survey. Cogn Process 2011; 12: 319–340.

185.

Charpentier

Senanayake

Kochenderfer

, et al. Disentangling epistemic and aleatoric uncertainty in reinforcement learning. ArXiv, abs/2206.01558, 2022.

186.

Pal

Leon

Brief survey of model-based reinforcement learning techniques. 2020 24th international conference on system theory, control and computing (ICSTCC), Sinaia, Romania, 2020, pp.92–97.

187.

Clavera

Rothfuss

Schulman

, et al. Model-based reinforcement learning via meta-policy optimization. In: Conference on Robot Learning, Sinaia, Romania, 2018.

188.

Ismailov

VE.

A three layer neural network can represent any multivariate function. J Math Anal Appl 2023; 523: 127096.

189.

Williams

Rasmussen

. Gaussian processes for regression. In: Touretzky

Mozer

Hasselmo

(eds) Advances in Neural Information Processing Systems, MIT Press, Vol. 8, 1995.

190.

Reynolds

. Gaussian mixture models. In: Encyclopedia of Biometrics, Springer, 2015, pp. 827–832.

191.

Behbood

Hao

, et al. Transfer learning using computational intelligence: a survey. Knowl Based Syst 2015; 80: 14–23.

192.

Niu

Liu

Wang

, et al. A decade survey of transfer learning (2010–2020). IEEE Trans Artif Intell 2020; 1: 151–166.

193.

Tanaka

Yonetani

Hamaya

, et al. Trans-am: transfer learning by aggregating dynamics models for soft robotic assembly. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, 2021, pp.4627–4633.

194.

Nagabandi

Clavera

Simin

, et al. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. ArXiv: Learning, 2018.

195.

Peng

Andrychowicz

Zaremba

, et al. Sim-to-real transfer of robotic control with dynamics randomization. In: 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, QLD, 2018, pp.3803–3810.

196.

Zhao

Jorge Peña

Westerlund

Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In: 2020 IEEE symposium series on computational intelligence (SSCI), Brisbane, Australia, 2020, pp.737–744.

197.

Tobin

Fong

Ray

, et al. Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp.23–30, 2017.

198.

Chebotar

Handa

Makoviychuk

, et al. Closing the sim-to-real loop: adapting simulation randomization with real world experience. In: 2019 international conference on robotics and automation (ICRA), Vancouver, British Columbia, Canada, 2018, pp.8973–8979.

199.

Farahani

Voghoei

Rasheed

, et al. A brief review of domain adaptation. ArXiv, abs/2010.03978, 2020.

200.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial networks. Commun ACM 2020; 63(11): 139–144.

201.

Rao

Harris

Irpan

, et al. Rl-cyclegan: reinforcement learning aware simulation-to-real. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp.11154–11163, 2020.

202.

Yarats

Fergus

Alessandro

, et al. Reinforcement learning with prototypical representations. In: International Conference on Machine Learning, Vienna, Austria, 2021, pp. 11920–11931.

203.

Tomar

Zhang

Calandra

, et al. Model-invariant state abstractions for model-based reinforcement learning. ArXiv, abs/2102.09850, 2021.

204.

Wang

Pham

DT.

Robotic disassembly task training and skill transfer using reinforcement learning. IEEE Trans Ind Inform 2023; 19: 10934–10943.

205.

Paduraru

Mankowitz

Dulac-Arnold

, et al. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Mach Learn 2021; 110: 2419–2468.

206.

Gupta

Tony

, et al. Reset-free reinforcement learning via multi-task learning: learning dexterous manipulation behaviors without human intervention. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, 2021, pp.6664–6671.

207.

Smith

Cao

Levine

Grow your limits: continuous improvement with real-world RL for robotic locomotion. ArXiv, abs/2310.17634, 2023.

208.

Wang

Zhang

, et al. A comprehensive survey of continual learning: theory, method and application. IEEE Trans Pattern Analys Mach Intell 2024; 46(8): 5362–5383.

209.

Chen

Liu

Lifelong machine learning. In: Synthesis lectures on artificial intelligence and machine learning. Springer International Publishing, Cham. 2018.

Towards cost-effective and safe contact-rich robotic manipulation with reinforcement learning: A review of techniques for future industrial automation

Abstract

Keywords

Introduction

Core concepts

Contact-rich robotic manipulation

Insertion

Non-prehensile manipulation

Surface tracking

Door opening

Object extraction

Reinforcement learning

RL-based robotic control

Literature overview

Key challenges in RL for contact-rich robotic manipulation

Cost

Safety

Summary of literature

Approaches to improve RL-based contact-rich robotic manipulation

Perception system design

Sensor selection

Compressing observations into a state representation

Attention mechanisms for perception systems

Dense reward functions

Reward shaping

Inverse reinforcement learning

Residual RL

Residual policy action formulation

Obtaining the expert trajectory

Structured exploration

Curriculum learning

Hybrid RL and model-based control

Control mechanism design

Fundamental design principles

Variable compliance control

Riemannian motion policies

Improving robustness of imperfect controllers

Embedding safety functionality

Task decomposition

Task planning with RL

Skill-based RL

Hierarchical RL

Expert-guided super-policy

Meta-learning

Meta-reinforcement learning

Meta imitation learning

Model-based reinforcement learning

Transition model design

Transition model transferability

Safe exploration with MBRL

Bridging the reality gap in sim-to-real transfer

Domain randomization

System identification

Domain adaptation

Enhancing simulation fidelity

Discussion

Assessing the current state of the key challenges

Cost

Sample efficiency

Generalizability

Safety

Promising directions for future research

Knowledge transfer across diverse tasks

Enabling cross-platform generalization through inter-robot policy transfer

Beyond controlled environments: real-world RL and continual learning

Conclusion

Footnotes

Correction (November 2025):

Declaration of conflicting interests

Funding

ORCID iDs

References