Abstract
Untethered small-scale soft robots have promising applications in minimally invasive surgery, targeted drug delivery, and bioengineering applications as they can directly and non-invasively access confined and hard-to-reach spaces in the human body. For such potential biomedical applications, the adaptivity of the robot control is essential to ensure the continuity of the operations, as task environment conditions show dynamic variations that can alter the robot’s motion and task performance. The applicability of the conventional modeling and control methods is further limited for soft robots at the small-scale owing to their kinematics with virtually infinite degrees of freedom, inherent stochastic variability during fabrication, and changing dynamics during real-world interactions. To address the controller adaptation challenge to dynamically changing task environments, we propose using a probabilistic learning approach for a millimeter-scale magnetic walking soft robot using Bayesian optimization (BO) and Gaussian processes (GPs). Our approach provides a data-efficient learning scheme by finding the gait controller parameters while optimizing the stride length of the walking soft millirobot using a small number of physical experiments. To demonstrate the controller adaptation, we test the walking gait of the robot in task environments with different surface adhesion and roughness, and medium viscosity, which aims to represent the possible conditions for future robotic tasks inside the human body. We further utilize the transfer of the learned GP parameters among different task spaces and robots and compare their efficacy on the improvement of data-efficient controller learning.
Keywords
1. Introduction
Soft robots are composed of highly deformable soft materials exhibiting programmable shape change, mechanical compliance, and high degrees of freedom, which are hard to achieve using rigid materials (Majidi, 2014). The easier access to novel fabrication methods further allows the engineering of stimuli-responsive soft materials that enable new functionalities for soft robots in multiple length scales (Shen et al., 2020). Biology remains to be a source of inspiration for the design, control, and behavior of soft robots (Laschi et al., 2016) and provides templates for new application areas in multi-terrain locomotion (Calisti et al., 2017), adaptive manipulation (Hughes et al., 2016), sensing (Iida and Nurzaman, 2016), human-assistive wearable systems (Walsh, 2018), and biomedicine (Cianchetti et al., 2018). Soft robots also enable safe human–robot physical interaction due to their physical compliance and the mechanical dampening of excess forces (Polygerinos et al., 2017), which otherwise require additional computational effort in conventional robotic systems (Haddadin et al., 2017). Small-scale (i.e.,≤1 cm) untethered soft robots have further potential application areas in medicine owing to their ability to access enclosed small spacesnon-invasively (Sitti, 2018) and the embodiment of functionalized materials enabling targeted drug delivery, diagnostics, and surgery (Cianchetti et al., 2018).
Despite their exciting potential and new capabilities, soft robots face challenges that arise from the nature of their soft materials, such as having virtually infinite degrees of freedom, being prone to fabrication-dependent performance variabilities that cause more significant effects at the smaller scale, and nonlinear material behavior (e.g., hysteresis, creep). Moreover, physical interactions of soft-bodied robots with their operation environment, such as solid or fluid operation medium, are very hard to model due to complex fluid–structure interactions, soft body dynamics, and contact mechanics. The combination of these aspects renders the application of conventional modeling and control methods challenging for soft robots, especially for untethered systems at the small scales (Rus and Tolley, 2015). One of the most widely used methods is the employment of the constant curvature (CC) models that utilize the well-established beam theories to model the kinematics and dynamics of axisymmetrically bending soft robotic systems (Della Santina et al., 2020; Webster and Jones, 2010). Alternatively, analytical approaches using Cosserat rod models (Renda et al., 2018) and geometrically exact models have been suggested for continuum robots (Grazioso et al., 2019). Simulation techniques build upon these modeling methods as in the finite-element methods (FEMs), which construct continuum robot structures using a chain of rigid elements connected with tunable spring–damper mechanisms (Chenevier et al., 2018; Goury and Duriez, 2018). Numerical approaches using voxel-based representations (Hiller and Lipson, 2014) and discrete differential geometries (DDGs) (Huang et al., 2020) improve the computation time of soft robotic simulations at the expense of nonlinear dynamics precision. These models and simulation tools typically allow the implementation of static and dynamic controllers for continuum robots on a larger scale (Thuruthel et al., 2018). However, the physical application of these closed-loop controllers depends on the continuous sensing of body deformations from embedded sensors and highly responsive actuators, and computationally heavy model solutions, which are conditions that may not be met for untethered soft robots at the small scales (Rich et al., 2018). Therefore, the soft robotic platforms that successfully employ the analytical models at the small scales still depend on either open-loop (Gu et al., 2020; Lu et al., 2018; Ren et al., 2019; Wu et al., 2019) or manually applied (Kim et al., 2019) controllers. In particular, for those robots targeting medical applications, their dynamically changing and deformable task environments, fabrication-based variations, and material degradation over prolonged use significantly alter their robotic function performances and pose challenges for the conventional control strategies (Sitti, 2018). The combination of these challenges makes the machine learning-based, adaptive, and data-efficient (i.e., using as few experiments as possible) control methods more desirable for untethered small-scale soft robots.
Data-driven machine learning methods may provide alternative solutions for the design and control of soft robots in the lack of existing analytical or numerical models that describe their underlying kinematics, dynamics, and functions (Chin et al., 2020). One common approach is to learn these models by gathering data from robot experiments and training a neural network (NN) architecture (Bern et al., 2020; Hyatt et al., 2019; Thuruthel et al., 2019). However, the need for data efficiency, i.e., the ability to learn from only a few experimental trials, presents a core challenge for such methods (Chatzilygeroudis et al., 2019). Conversely, Bayesian optimization (BO) (Ghahramani, 2015; Shahriari et al., 2015) allows for the maximization of a performance function using a small number of physical experiments. BO typically employs Gaussian processes (GPs) (Rasmussen and Williams, 2006) as a probabilistic model of the latent objective function. Although no explicit dynamics model is needed, GPs allow for incorporating information as probabilistic priors, thus reducing the experimental data requirements. There are emerging examples that demonstrate the application of this approach to optimize the locomotion performance of robots on different length scales (Calandra et al., 2016; Liao et al., 2019; Marco et al., 2020; Yang et al., 2018). Despite its potential, there are only a few examples that apply this method to address the controller challenge for untethered soft robots, such as in the gait exploration of a tensegrity system (Rieffel and Mouret, 2018), and the optimization of an undulating motion of a microrobot (von Rohr et al., 2018).
For cases where the training and testing domains show differences in terms of features or data distribution, transfer learning (TL) methods may provide further improvements in data-efficient learning and adaptation to new test cases (Pan and Yang, 2009). Within the BO applications that employ GPs, the prior knowledge can be transferred as GP priors (Raina et al., 2006) and hyperparameters (Perrone et al., 2019) from the trained domains to provide predictive information about the unknown features and distributions in the new test domains. In robotics, TL is typically employed as the transfer of the models of kinematics and dynamics between simulated and physical platforms of conventional rigid robotic systems, such as manipulators (Devin et al., 2017; Makondo et al., 2018), humanoids (Delhaisse et al., 2017), and quadrotor platforms (Helwa and Schoellig, 2017). However, the application of TL on soft robotics systems is still in its early infancy (Schramm et al., 2020).
In our recent work in Culha et al. (2020), we demonstrated the controller learning of walking soft millirobots using BO and GPs and showed the improvement of learning efficiency in means of transferring prior mean information between robots as a TL application example. We followed the initial example by von Rohr et al. (2018), which designed a learning scheme by comparing different GP priors and BO settings on generating a semi-synthetic dataset that represents the estimated gait controller space and used this estimation to optimize the one-dimensional crawling gait of a light-driven soft microrobot. In our work in Culha et al. (2020), we adopted the magnetic soft millirobots from Hu et al. (2018) that lacked sufficient predictive kinematic models and was therefore controlled with an open-loop system whose multi-dimensional parameters were tuned manually. We showed that these robots suffered performance inconsistencies due to the fabrication reproducibility issues, material degradation over prolonged experiments, and environmental disturbances, which limited the derivation of a deterministic kinematic model and the application of relevant model-based controllers. Therefore, we applied BO and GPs to directly learn the controller parameters while optimizing the stride length performance of these robots and employed TL methods to improve learning efficiency using a small number of physical experiments.
In this study, we extend our previous work in Culha et al. (2020) and provide an in-depth analysis on using BO and GPs (in particular for TL) to directly and efficiently learn the controller parameters of the magnetic soft millirobots’ walking gait on task spaces emulating bio-medical application environments (Figure 1). First, we introduce our new automated and closed-loop experimental platform that can run the robot learning experiments repeatedly and reliably to eliminate the influence of any human intervention, which caused further material degradation and consequent performance inconsistencies in Hu et al. (2018) and Culha et al. (2020). We start with using an exhaustive search on the two-dimensional (2D) gait controller parameter space of the millirobot and generating benchmark datasets that show the stride length performances of three different robots on three different walking surfaces. We use this benchmark data to learn the optimum gait controllers using BO and GPs, and then to compare the influence of four different TL methods on the improvement of learning efficiency. We choose the best performing TL method from these experiments and use it with the BO and GPs to learn the walking gait controller parameters on a wide range of task spaces. We test our robots on task spaces with different surface roughness and friction, and liquid medium viscosity to emulate the conditions inside the human body for future target operations. Our results reveal that the direct controller learning with BO and GPs allows for adaptation to different task spaces for small-scale untethered soft robots that are prone to fabrication-, material-, and interaction-dependent performance variabilities. We also show that the effective use of TL improves this adaptation by exploring a larger set of successful walking gait controllers within a limited number of physical experiments despite the significantly changing task space conditions. The methodology we present in this study can be used for controlling future small-scale soft robot applications for medical operations that require a data-efficient controller learning system and quick adaptation to the changing task environments. In summary, the main contributions of our work are:
demonstration of a data-driven optimization tool (i.e., BO) that can efficiently learn the gait controllers of a small-scale untethered robot whose performance is prone to fabrication-, material-, and physical interaction-based variabilities;
successful testing of the walking gait on three different task spaces that emulate, e.g., dynamic environments inside the human body, and the adaptation of the robot controller parameters to these environments in a small number of experiments;
implementation of an automated experimental platform that runs and evaluates the physical learning experiments repeatably and reliably without human intervention and simulated environments;
comparison and evaluation of four different TL methods within the context of GP hyperparameters on the learning efficiency of BO on the small-scale soft robots;
generation of five benchmark datasets consisting of the exhaustively parsed controller parameter space involving 3,750 different physical experiments for three different robots and three different walking surfaces that would allow further comparison between different optimization methods, which is available at https://github.com/sozgundemir/softrobotwalkingdataset

(a) Fabrication process of the magneto-elastomeric soft millirobot: the robot, which is composed of non-magnetized ferromagnetic microparticles homogeneously distributed inside a silicone elastomer sheet, is rolled around a cylindrical rod and magnetized with
The organization of this paper is as follows. We describe the design of the robotic system, the walking gait, and the properties of the task environments in our experiments in Section 2. Section 3 describes the learning approach with the details on the BO, GP, and TL methods. In Section 4, we present the experiments on generating the benchmark datasets, comparing the TL methods, and learning the walking gait in different task environments. We discuss the experimental results and conclude our work in Section 5.
2. Experimental robot system
2.1. Robot design and fabrication
We follow the methods and materials reported in Hu et al. (2018) and our previous work (Culha et al., 2020) and fabricate three magnetic soft millirobots with a 1:1 body mass ratio of Ecoflex 00-10 (Smooth-On Inc.) and neodymium–iron–boron (NdFeB) ferromagnetic microparticles with around
2.2. Walking gait definition
The walking gait of our robot is composed of four consecutive quasi-static states that are inspired by the planar quadrupedal bounding (Alexander, 1984) and a caterpillar inching motion (Trimmer and Lin, 2014). These states are depicted as (1) relaxed, (2) front-stance, (3) double-stance, and (4) back-stance as shown in Figures 1(c)–(f). We control four parameters to generate the walking gait: the maximum magnetic field magnitude (
2.3. Actuation and feedback setup
We place our magnetized soft robot along the y-axis of the magnetic coil setup consisting of three orthogonal pairs of custom-made electromagnets (see Figure 1(b)) that can generate a 3D uniform magnetic field within a
We track the robot’s gait using two high-speed cameras (Basler aCa2040-90uc, shown in Figure 1(b)). The first camera running at 120 frames per second (fps) is placed orthogonal to the axis of robot motion (i.e., y–z axis of the controller). A tracking algorithm, whose pseudo-code is given in Appendix B, uses this camera to detect and evaluate the robot’s motion to identify if the robot is moving according to the walking gait definition given in Section 2.2. The second camera running at 60 fps has an isometric view of the test scene and is used to measure the distance traveled by the robot following the perspective correction of the captured image. In every experiment, we calculate the stride length of the robot by tracking the average distance covered by its center of mass in five consecutive steps. At the end of every experiment, the robot is moved back to its original starting position automatically with the tracking and the actuation commands. See Extension 1 for the gait detection, position tracking, and repositioning for robot 3 walking on paper.
The learning process and image processing run on a master PC, and all the communication tasks between different elements of the robotic system (e.g., image capture and electric current control) are executed on Robot Operating System (ROS) architecture, which allows our system to be scalable for further extensions. The automated experimental platform implemented in our work allows the physical experiments to be executed with minimum human intervention; therefore reducing the human-based disturbances on the robot and the test surfaces. Without these interactions that can cause significant alterations on the soft millirobots, the physical learning experiments can be maintained repeatably and reliably.
2.4. Task environments
In this study, we use a wide range of different task environments to test the efficacy of our adaptive learning strategy in comparison with the limited surface experiments in Culha et al. (2020). Our goal is to emulate the in-air and liquid-immersed surface walking environments that a magnetic soft millirobot might experience during future medical operations inside the human body. To capture some of the characteristic properties of the target tissues and body fluids, we fabricate different task spaces and vary their : (1) surface adhesion, (2) surface roughness, and (3) the liquid medium viscosity properties. For each of these properties, we experimentally identify the range of values that allow successful walking gaits and systematically test the robots in these specific ranges.
We fabricate a set of flat substrates with different surface adhesion strengths by using different materials. This set of substrates consists of paper, polystyrene (PS), and modified polydimethylsiloxane (PDMS). PDMS substrates are prepared by mixing Sylgard184 (Dow Corning) with its curing agent to a 10:1 ratio, degassing, and curing at 90 C for 1 hour. PDMS is modified by adding ethoxylated polyethylenimine (80% solution, Sigma Aldrich) prior to mixing and curing to increase its adhesive properties (Jeong et al., 2016). A volume of 0 (PDMS-0), 1 (PDMS-1), and

The type and range of task space properties investigated for the robot’s walking gait. (a) Adhesion strength of different surfaces ranges between 1 and 10 kPa. (b) Roughness values of different surfaces that are named after the grit scale of their template sandpapers. The inset figures from the profilometer scans represent the two extremes of the roughness range (i.e.,
Walking surfaces with changing roughness were prepared by replicating the surface texture of different grits of sandpaper. First, we fabricate the negative molds of the original surfaces by pressing a glass plate with a layer of uncured vinylsiloxane polymer (Flexitime medium flow, Heraeus Kulzer GmbH) onto the original surfaces. After curing the molds for 5 minutes at room temperature, we removed them and produced positive replicas of the original surfaces using clear casting epoxy (EpoxAcastTM 690, Smooth-On Inc., 10:3 ratio by weight) onto the mold, with another glass plate pressed on top. At the end of 24 hours of curing time, we removed the positive replicas from the molds. According to surface profile examination of the replicated surfaces (Keyence VKX260K), surface roughness values
Last, we submerged the robots in different Newtonian fluids while walking on a flat surface to investigate the effect of bulk liquid medium viscosity on the walking performance. The Newtonian fluids were prepared by mixing different ratios of water and glycerol, and their viscosity is measured in a TA Instruments Discovery HR-2 rheometer. We analyze task environments with medium viscosity ranging from 1 to 90 cP as shown in Figure 2(c).
3. Learning approach
We adapt the learning approach from our previous work (Culha et al., 2020) that aims to optimize the walking gait controller parameters to maximize the stride length S of the robot. Here we define the reward function as
which maps the parameter set
where Θ denotes the complete search space, θ is the parameter set, and
We define the range of the controller parameters based on the findings in Hu et al. (2018) and the physical limitations of our magnetic actuation setup. Accordingly,
3.1. Gaussian processes
The magnetic soft millirobots in our paper did not have accurate models for kinematics or dynamics (i.e., we demonstrated the model inaccuracy of the original work of Hu et al. (2018) in our previous study in Culha et al. (2020)), therefore, it is necessary to approximate the reward function based on the data collected from physical experiments. However, the physical data has inherent uncertainty owing to the noise in the measurements and the variations during the experiments. To include these uncertainties in the model, overcome the sparsity in the data, and make probabilistic predictions at unobserved locations, we represent the reward function
where
where
During one run of BO, the GP model is sequentially updated with
From the experimental data
where
We select the squared exponential as the kernel function in the GPs, which is defined in (Duvenaud et al., 2011) for multi-dimensional cases as
where
3.2. Bayesian optimization
We use BO to select the parameter set
In this study, we choose the expected improvement (EI) as the acquisition function
where
where Φ and ϕ are the Gaussian cumulative density and probability density functions, respectively. The term Z is described as
3.3. Transfer learning
In this study, we compare four different methods of TL on our walking gait experiments: (1) transfer of all GP hyperparameters,
3.3.1. Transfer of GP hyperparameters
The choice of the types and values of GP hyperparameters influence the regression of the GP (Chen and Wang, 2018) and their transfer from prior models can change the dynamics of the learning process (Patacchiola et al., 2020; Wang et al., 2020). The hyperparameters we choose to investigate as a part of the GPs in this study can be listed as the noise in the collected data
to simultaneously optimize the GP hyperparameters based on the collected data during the learning runs. We use these estimates of the selected hyperparameters as one of the TL methods in the following experiments in Section 4.3.
3.3.2.Transfer of prior mean information
In addition to the kernel, the prior mean
3.3.3.Hybrid transfer
Previous methods can be combined and both the optimized estimation of the GP hyperparameters and the prior mean information can be transferred between the BO experiments. In this study, we also investigate the combination of the estimated length scales
4. Experimental results
Our study aims to use BO and GPs to demonstrate adaptation to different task spaces while experimentally optimizing the stride length of the soft millirobots whose walking performances are prone to fabrication-, material-, and interaction-based reproducibility issues that cannot be successfully predicted with kinematic models. In that sense, we focus more on exploring a variety of walking patterns under changing task space conditions rather than continuously optimizing a specific walking gait performance. Accordingly, we design the experiments to highlight the influence of BO and GPs and TL methods on increasing the average performance of finding successful walking gaits, i.e., gaits strictly following the consecutive states described in Section 2.2 that also yield sub-optimal stride length performances, during the limited number of learning runs, instead of only finding the optimum controller parameters.
We begin with using an exhaustive search approach to generate benchmark datasets for the walking gaits on five different test scenarios using our millirobots in Section 4.1. Here, we limit the controller parameter space to two dimensions and only explore the
4.1. Generation of the walking gait benchmark datasets
In our previous work in (Culha et al., 2020), we observed that the soft millirobots we adopted from (Hu et al., 2018) experienced additional material degradation over long repeated experiments that altered their gait performances. While we investigate the influence of BO and TL methods on the improvement of learning efficiency in Section 4.2 and Section 4.3, we want to minimize this material degradation effect on the walking gait. That is why here we use an exhaustive search approach and generate five different benchmark datasets that cover the walking gait function space necessary for the BO and TL methods investigations. To this end, we test three different robots (i.e., robot 1, 2, and 3) on a flat paper surface and a single robot (robot 3) on two additional surfaces: PDMS-0 and P800-grit sandpaper replica.
To explore the walking gait function space on these five test cases, we constrain the controller space into two dimensions by using a constant

Experimental displacement measurements
In these experiments the robots do not necessarily follow the gait definition in Section 2.2, therefore we describe these resulting values as “displacement measurements,”
Best performing α controller parameter sets found by the exhaustive search and the corresponding stride length results
4.2. Learning the walking gait with the “standard” BO
We initially test the walking gait learning with a BO approach on the benchmark datasets, where the prior mean information is set to zero (i.e.,
BO selects a new parameter set θ that maximizes the acquisition function based on the GP model;
for the selected controller parameter pair, the corresponding stride length performance is sampled from the normal distribution defined by the mean and standard deviation values found in the relevant benchmark dataset;
the learning system updates the GP model using this sampled data and prepares for the next iteration step of the learning run.
We report the median of the learning results with the upper and lower quartiles in Figure 4. These values represent the normalized gait performance
Comparison of the transfer learning methods.
Relative performance with respect to the optimum exhaustive search results.
Relative performance with respect to the standard BO learning approach.

Performance of the standard BO for (a) robot 1 on paper, (b) robot 2 on paper, (c) robot 3 on paper, (d) robot 3 on PDMS-0, and (e) robot 3 on P800. The stride length performances are normalized with respect to
4.3. Comparison of transfer learning methods on the benchmark datasets
In this study, we extend our previous investigation on the role of TL in learning efficiency (Culha et al., 2020) and compare four different methods while optimizing the gait controllers of our soft millirobots. Similar to Section 4.2, we apply our BO learning to the benchmark datasets generated in Section 4.1, where the controller parameter space is limited to two dimensions with
4.3.1. HP-4 transfer
We initially transfer all of the four GP hyperparameters (i.e., noise in the collected data

Comparison of each TL method are shown in separate columns in terms of normalized stride length performance
4.3.2. HP-
transfer
Second, we investigate the transfer of only the two length scale hyperparameters
4.3.3. Mean transfer
As the third method, we transfer the posterior mean information,
4.3.4. Hybrid transfer
Finally, we adopt a hybrid approach and transfer the posterior mean information,
The comparative performance results of the standard BO and the four TL methods for each of the five test cases are reported in Figure 6 and Table 2. Owing to the statistical and explorative nature of the BO, the stride length performances do not monotonically increase at every consecutive iteration step (as also visible in Figure 5). That is why, to provide a clear comparison between our methods in Figure 6, we identify the iteration steps that show the “best so far” performance during the learning run of each approach. The first row, Figure 6(a)–(e), compares these methods in terms of the normalized error of the achieved stride length performances during the BO learning runs (

Performance comparison of standard BO and four TL methods on five test cases. (a)–(e) Normalized performance error
The second row, Figure 6(f)–(j), represents the learning efficiency performance comparison in terms of the iteration steps needed to explore the best performance by the standard BO and to achieve standard BO level performance by the four TL methods. We describe this exploration performance with “convergence steps,” which is calculated by finding the performance value that stays within the 5% band of the averaged remaining steps. Normally, a monotonic convergence is not expected from the statistical and explorative BO learning. However, viewing the learning runs with the “best-so-far” evaluation method allows us to represent the required iteration steps to achieve comparable performance, and to capture the relative data efficiency of the TL methods. Accordingly, we can see that the “HP-4” method fails to achieve standard BO level performance for the three cases (Figure 6(g)–(i)) as its convergence steps are equal to the number of iteration steps in the experiments. The “HP-
As the “Mean” TL method outperforms the standard BO and other three TL methods by finding better performing parameter sets in fewer iterations, we select it to use for task space adaptation experiments in Section 4.4.
4.4. Adaptation to task spaces
The task environment is more susceptible to dynamic changes than the robot morphology, especially for medical operations inside the human body. Therefore, a quick adaptation of the robot controller is important to maintain successful robot task handling. In the following experiments, we investigate the learning efficiency of our BO approach while focusing on the three physical properties that may dynamically change during the walking task of our soft millirobots in future in vivo operations, which are (1) surface adhesion, (2) surface roughness, and (3) medium viscosity.
Here, we expand the controller parameter space exploration back to four dimensions by including the magnetic field magnitude
Again, we compare the learning efficiency of the standard BO with the prior mean transfer method on all the task spaces defined in Section 2.4. Here, the objective of these learning runs is to adapt to dynamic task spaces and learn the optimized controller parameters in as few experiments as possible especially for future medical operations. To find the number of learning steps sufficient enough for BO to find the desired walking gaits, we used the results in Figures 4, 5, and 6, which involve approximately 250,000 data points. These results show that the BO finds the optimized controller parameters that generate desired walking gaits consistently in less than 20 steps for different robots and walking surfaces. Therefore, we limit the number of steps of a learning run to 20 experiments (i.e., iteration steps), and perform three independent learning runs with the same initial conditions, yielding 60 experiments in total. One step of the learning run involves five steps:
BO selects a new parameter set θ that maximizes the acquisition function based on the GP model;
the microcontroller initiates the physical experiment and regulates the magnetic field based on the selected θ;
the cameras record the robot’s motion and measure the average stride length performance
the learning system updates the GP model using the newly collected data from the experiment;
the robot returns to its initial position for the next experiment.
4.4.1. Surface adhesion
We initially test the robot on five surfaces with different adhesion strengths reported in Figure 2(a). Figure 7(a)–(d) show the walking gait performances during the three independent learning runs on the two ends of the adhesion range: paper (1.34 kPa) and PDMS-2 (11.02 kPa). The left column (Figure 7(a) and (c)) show the learning runs with the standard BO approach, and the right column (Figure 7(b) and (d)) show the learning runs with the prior mean transfer method. The difference between these figures shows that the TL method improves the learning runs by finding more of the controller parameters that yield positive walking gait performances. In addition, the BO with the TL manages to explore these parameters in the earlier steps of the learning runs compared to a standard BO approach. We represent the general influence of the mean transfer method on all the adhesion surfaces with the standard interquartile range (IQR) method in Figure 7(e). In this figure, the horizontal lines represent the median of the generated
Comparison of the learning performances for changing surface adhesion.

The learning of the controller parameters for the changing adhesion properties of the test surfaces (upper row paper, lower row PDMS-2) within 20 physical experiments in 3 independent learning runs (depicted as LR 1–3). Learning runs with the standard BO in (a) and (c) are compared with the learning runs with the mean transfer in (b) and (d). Overall performances of the learning runs with the standard BO (left bars) and the mean transfer method (right bars) reported with box plots for all test cases (e).
4.4.2. Surface roughness
Next, we test the same robot on five surfaces with increasing roughness properties that are reported in Figure 2(b). The walking gait performances achieved during the learning runs for the two extreme surfaces, P800-grit (
Comparison of the learning performances for changing roughness values.

The learning of the controller parameters for the changing roughness values of the test surfaces (upper row P800, lower row P60) within 20 physical experiments in 3 independent learning runs (a) and (c) without utilizing the prior information and (b) and (d) with utilizing the prior information. Overall performances of the learning runs with the standard BO (left bars) and the mean transfer method (right bars) reported with box plots for all test cases (e).
4.4.3. Medium viscosity
Finally, we test our robot walking in eight different media with changing viscosity as reported in Figure 2(c), and report the results for two extreme cases, cP1 and cP90 in Figure 9(a)–(d). It can be seen that the ability to explore a wide range of controller parameter sets that generate walking gaits decreases for both BO approaches for the high-viscosity fluids (cP > 35). However, the mean transfer method still manages to increase the number of successful sets compared with the standard BO for all the media. The overall performances of the standard and mean transfer approaches are reported as box plots in Figure 9(e) and Table 5. These results are consistent with the other two test surfaces that the TL method allows the BO to explore more of the controller sets that generate walking gaits against the changing task space properties. See Extension 4 for a comparison between the walking inside cP1 and cP90, and Extension 7 for the details of the independent learning runs for all the medium viscosity experiments.
Comparison of the learning performances for changing medium viscosities.

The learning of the controller parameters for the changing test medium viscosity (upper row cP1, lower row cP90) within 20 physical experiments in 3 independent learning runs (a) and (c) without the prior information and (b) and (d) with the prior information. Overall performances of the learning runs with the standard BO (left bars) and the mean transfer approach (right bars) reported with box plots for all test cases (e).
5. Discussion
The displacement measurements from the exhaustive search experiments in Figures 3(a)–(c) show that even though three identical robots are tested with the same controller parameters on the same surface, they generate different walking gait performances. These initial results also confirm the observations related to performance repeatability in our previous work (Culha et al., 2020). Moreover, the influence of the task space on the robot performance can be seen clearly from Figures 3(h)–(j), where the adhesion and roughness differences between different surfaces are reflected. These observations support the necessity of a data-efficient controller learning system that is robust to the robot performance variabilities caused by the material, fabrication, and the task environment of the small-scale, medical-oriented, and untethered soft robots.
Our choice on the application of BO and GPs to directly learn the controller parameters of our soft millirobot is based on three aspects. First, BO offers the efficient data-driven optimization of continuum and complex black-box functions that do not have a closed-form definition. This feature addresses our challenges with not having a deterministic model for the kinematics for our robot and requiring to achieve optimized walking gaits in a small number of experiments. Second, the investigated function can be represented with GPs within BO, which allows capturing the noise and unknown disturbances. As our robot inherits fabrication-, material-, and interaction-based performance disturbances, GPs provide us with a walking gait function representation that includes these variances. Finally, BO is a global optimization tool that avoids getting stuck at local minima, which is important for exploring the parameter space of the investigated function. Combined with the appropriate TL method, this feature allowed us to overcome local minima while adapting the controller parameters to different robots and task spaces. While we do not claim that BO is the best or only suitable optimization technique here, these three aspects successfully address the controller and modeling challenges existing for our robotic system and make the BO the choice of our application.
The experimental results in this study show that our approach of using BO with GPs and TL methods allowed a data-efficient (i.e., using as few experiments as possible) controller learning that achieves adaptation to different task spaces within a wide range (i.e., on the scale of an order of magnitude) of surface and medium properties. Our main goal is to allow the learning system to explore the controller parameter space to find more of the parameter sets that generate successful walking gaits in response to changing task environments. For this purpose, we configured our BO to favor exploration more than exploitation. That is why we do not focus on finding the optimum walking gait controller parameters for each robot or task space in our experiments. Consequently, our current approach does not establish a straightforward correlation between the change of controller parameters with respect to changing robot and task conditions. The comparative results between the standard BO and the TL methods show that both approaches can find sub-optimum parameter sets owing to the statistical nature of the learning method, whose results are given in Appendix C for the task space adaptation experiments. However, we propose that TL methods may allow the system to explore a larger portion of the function space in a fewer number of physical experiments, hence achieving data-efficiency in learning.
In terms of experimental learning efficiency, the transfer of the prior mean information outperformed the other TL methods in our experiments. The transfer of this information allowed the BO to start the parameter exploration in the function space within the regions of high-performance result expectations. Therefore, it took the BO much faster to explore the parameter spaces that generate optimum walking gaits (see Extension 8 for a sample comparison of parameter selection with the standard BO and mean transfer method). We see the same effect for the test case of robot 3 walking on P800 in Figure 5 (s). Here, this TL method allowed the exploration of the regions with higher expected results and surpassed the exploration boundary of the standard BO. The larger variance in the stride length performances explored by this TL method is caused by this exploration tendency. In comparison, we see that the HP-4 method failed to explore the controller parameters that yield optimum gaits because of the transfer of the signal variance parameter
In the learning experiments that compare the standard BO and four TL approaches, we chose to represent learning performance with median and IQR instead of mean and standard deviation (as seen in Figure 4), since IQR is a robust measure of scale, as it is less sensitive to the outliers in the data. Moreover, dissimilar to standard deviation, IQR can represent the skewness in the distribution of the walking performance results, which becomes more apparent as the performance values get closer to the ends of possible performance ranges. In addition to its advantages in the statistical distribution representation, IQR does not report any unachievable result according to the gait definition in Section 2.2.
6. Conclusion and future work
In this study, we have investigated the use of BO with GPs to experimentally learn the controller parameters for the walking gait of a magnetic soft millirobot. We have created benchmark datasets consisting of 750 experimental results using an exhaustive search to find the walking gait function space for five different test cases. We then used these datasets to compare the effectiveness of four different TL methods to complement the standard BO learning. In these experiments that involve 104 learning steps for each test case, we have shown that the transfer of the prior mean information increased the BO learning performance the most in terms of increasing the number of explored sub-optimum controller parameters and decreasing the number of required experiments. Based on these findings, we also applied BO learning together with the prior mean transfer method on different task spaces with changing surface adhesion, surface roughness, and medium viscosity. We have shown that controller learning with a BO that utilizes prior mean transfer demonstrates successful adaptation to task spaces in a data-efficient way by exploring the function space of the robot in fewer experiments to find a larger group of controller parameters that yield successful walking gaits.
Our approach is not only limited to walking gait learning and it can further be applied to different locomotion and manipulation controllers for soft robots (Chin et al., 2020). In the future, studies focusing on small-scale fabrication with higher magnetization resolution may address the fabrication reproducibility issues (Alapan et al., 2020; Kim et al., 2018; Xu et al., 2019). However, especially for robots designed for biomedical operations, the interaction with the dynamic task environment may still have degrading robot material and performance effects. For such scenarios, a data-efficient controller learning system may adapt optimum controller parameters to these changes in the robot. For example, such an approach may be applied to endoscopic soft robots within or outside the gastrointestinal (GI) tract (Son et al., 2020; Yim et al., 2014) using a small number of trials. Our study can be further extended to involve the design parameters, such as the magnetic particle density in our robots, and guide the task-oriented design strategies for future soft mobile robots. Our approach can be used to reveal design guidelines to improve the kinematic models of the small-scale robots while utilizing the CC approximations (Webster and Jones, 2010), analytical models (Renda et al., 2014), and FEMs (Largilliere et al., 2015). However, as the BO we are using is an episodic algorithm, meaning that each suggested parameter set must be evaluated first in an experiment, the adaptation to design optimization will require the experiments to be run either in a simulation environment or an automated rapid fabrication system that can be integrated within the actuation architecture. The systematic comparison of our experimental approach to alternative optimization and control methods supported with simulations such as intelligent trial and error (Cully et al., 2015), evolution algorithms (Kriegman et al., 2020), or policy gradients (Sehnke et al., 2010) is beyond the scope of our current study but is an interesting task for future work. We believe that the benchmark datasets available in this study can be used to compare these different methods. Our long-term vision is to build fully autonomous systems that can control, track, evaluate, and optimize soft robots operating in changing complex real-world environments, with minimum human involvement.
Footnotes
Appendix A. Index to multimedia extensions
Archives of IJRR multimedia extensions published prior to 2014 can be found at http://www.ijrr.org, after 2014 all videos are available on the IJRR YouTube channel at http://www.youtube.com/user/ijrrmultimedia
Robot tracking and gait detection output.
Comparison of walking on paper and PDMS-2.
Comparison of walking gaits on P800 and P60.
Comparison of walking gaits inside cP1 and cP90.
Sample learning runs comparing the parameter selection process by a standard BO with prior mean transfer.
Appendix B. Pseudo-code for robot tracking and motion evaluation algorithm
The pseudo-code for robot tracking and motion evaluation is given in Algorithm 1.
Appendix C. Hyperparameter and controller parameter sets
Acknowledgements
S.O.D. thanks the Ministry of National Education of the Republic of Turkey for the Doctoral Scholarship. U.C. and A.P.F. thank the Alexander von Humboldt Foundation for the Humboldt Postdoctoral Research Fellowship and the Federal Ministry for Education and Research.
Data accessibility statement
Funding
This work was funded in part by the Alexander von Humboldt Foundation, the Ministry of National Education of the Republic of Turkey, the Cyber Valley Initiative, the Grassroots Initiative of the Max Planck Institute for Intelligent Systems, the Max Planck Society, and the European Research Council (ERC) Advanced Grant “SoMMoR” Project (grant number 834531).
