Abstract
Manipulators based on soft robotic technologies exhibit compliance and dexterity which ensures safe human–robot interaction. This article is a novel attempt at exploiting these desirable properties to develop a manipulator for an assistive application, in particular, a shower arm to assist the elderly in the bathing task. The overall vision for the soft manipulator is to concatenate three modules in a serial manner such that (i) the proximal segment is made up of cable-based actuation to compensate for gravitational effects and (ii) the central and distal segments are made up of hybrid actuation to autonomously reach delicate body parts to perform the main tasks related to bathing. The role of the latter modules is crucial to the application of the system in the bathing task; however, it is a nontrivial challenge to develop a robust and controllable hybrid actuated system with advanced manipulation capabilities and hence, the focus of this article. We first introduce our design and experimentally characterize its functionalities, which include elongation, shortening, omnidirectional bending. Next, we propose a control concept capable of solving the inverse kinetics problem using multiagent reinforcement learning to exploit these functionalities despite high dimensionality and redundancy. We demonstrate the effectiveness of the design and control of this module by demonstrating an open-loop task space control where it successfully moves through an asymmetric 3-D trajectory sampled at 12 points with an average reaching accuracy of 0.79 cm ± 0.18 cm. Our quantitative experimental results present a promising step toward the development of the soft manipulator eventually contributing to the advancement of soft robotics.
Introduction
Activities of daily living (ADLs) refer to the daily tasks people should be able to independently accomplish
1
and are grouped into two general categories: (i) basic ADLs (BADLs) and (ii) instrumental ADLs. However, achieving these is a significant challenge for the elderly generation with chronic ailments and loss of abilities.
2,3
Initially, children took care of their elders; however, nowadays the majority of the population is dependent upon professional healthcare services. This has brought about an interest to the development of advanced assistive technical devices (also referred to as assistive robotics) in order to promote independent living and consequently provide the possibility to the elderly generation to improve their quality of life.
2,4,5
This work focuses on the development of assistive devices for BADLs, in particular, the task of bathing which is considered challenging because the entire body is exposed in this scenario. Literature reveals, that this area has been scarcely investigated, with only two commercially available technical solutions. Oasis Seated Shower system:
6
provides an independent, safe, and hygienic shower experience. The system is equipped with a push-button soap application and rinse system. On the downside, it completely lacks active interaction between the user and the tool which also includes lack of functionalities such as rubbing/wiping the senior. Furthermore, it is bulky and expensive. Seat Lift Device:
7
is much simpler, partially automated, and helps positioning the user in the bathing task. However, it does not provide any assistance in showering task.
There still exists a multitude of unaddressed issues in this domain where the system should: (i) be client-oriented i.e. translate customer requests into technical requirements; (ii) be safe, reliable, and adaptable for human-robot interaction; 8,9 (iii) be highly dexterous with a large reachable workspace; and finally (iv) autonomously reach difficult and sensitive regions of the body (refer to Figure 1).

General overview of the bathing task. The soft manipulator approaches the back region of the user, one of the most critical and difficult regions to reach. The lower limbs have also been identified as difficult task. An additional soft manipulator could be mounted on the wall for the lower limbs.
Soft robotic manipulators
In order to address the above limitations and increase the user autonomy at home, a suitable approach is to exploit principles and technologies from the rapidly growing research field of Soft Robotics 10 for developing a manipulator as a shower arm in the bathing task. Soft robotics refers to a new generation of flexible robots that take inspiration from muscular hydrostats such as octopus tentacles, elephant trunks, etc., in the use of soft materials to achieve advanced manipulation capabilities while being safe to interact with. 11 – 13 These natural manipulators have inspired the development of novel soft robotic manipulators for obtaining a desired gamut of motion. These manipulators comprise of actuators elements that can be arranged into the single body or repeated into single identical modules. The advantages of the modular approach to soft robots, introduced by Onal and Rus 14 for the first time, demonstrates how actuation units can be composed in various, possibly redundant arrangements to achieve exceedingly complex motions. Taking advantage of this achievement, a mechanical modular manipulator can be designed by concatenating modular blocks, which, in turn, are an assembly of soft actuators that can be intrinsic (actuators are located within the body thus forming part of the animated mechanism), extrinsic (the actuator is remote and motion is transferred into the mechanism via a mechanical linkage), or hybrid (combination of intrinsic and extrinsic actuation). 15
We have considered that the technical requirements for the design of a soft manipulator are defined by the workspace that it should cover, which includes a volumetric space around and close to the users’ body as shown in Figure 1. This will be translated into a mechanical modular manipulator by concatenating three modular blocks such that (i) the proximal module is a structural module based on radially arranged cables that are used to compensate for gravitational effects and (ii) the centre and the distal modules on hybrid actuation are identical functional modules. The success of the manipulator is highly dependent upon the functional capabilities and working of the hybrid actuated module which is a nontrivial challenge. Consequently, the focus of this work lies in the development of a reliable and controllable distal module.
Challenges in design
The challenge of developing manipulators that rely on the combination of fluidic actuators and cables i.e. hybrid actuation has been largely faced in the surgical robotics field opening the door to other applications. 16 Good examples of soft manipulators based on tendons pulled in relation to pressurized chambers are: the KSI tentacle manipulator powered by a hybrid system of pneumatic bellows and electric motors, 17 the continuum robot proposed by Pritts 19 uses 6-8 contracting/extending McKibben actuators for providing two-axis bending, the Air-Octor manipulator that uses three cables and a single chamber, 20 and the OCTARM continuum robot 21 that is able to perform adaptive manipulation in challenging environments by using air muscle actuators. These examples have been integrated into a more comprehensive literature analysis, as reported in Table 1, that provides a chronological order of some examples for bioinspired manipulators that cover all three categories focusing on the actuation strategy that enables specific functionalities (i.e. elongation/contraction, bending and variable stiffness). This table also indicates the major achievements in Soft Robotics concerning the design of manipulators based on variable stiffness mechanisms. Moreover the Application field of Table 1 highlights how none of these is intended for assistive robotics. It is clear how our proposal of using Soft Robotics methodologies and mechanisms for building the first assistive soft robot is challenging and completely new. In order to develop a manipulator as a shower arm in the bathing task, we investigate an actuation strategy that enables all the mentioned specific functionalities (i.e. elongation/contraction, bending and variable stiffness).
A summary of the design and control of some examples of soft robotic manipulators.
*NSS: not specifically stated.
**Simulation.
Challenges in control
Soft robots often lack an analytical model due to their nonlinear material properties and high-dimensional articulated structure. This makes predictable and controlled movements with positional accuracy a significantly challenging task. Consequently, one of the open research questions is to achieve task space control that allows these robotic systems to accurately follow a trajectory, that is, point-to-point motion control. This is essential for a bathing task where close contact is required between the manipulator and the human to perform scrubbing and cleaning. The objective of point-to-point motion control is to find the correct combination of input actuation at each discrete sample, that is, inverse kinematics/kinetics. Immega and Antonelli 17 proposed a cascaded spline arc approach that controls the cables of the KSI tentacle that roughly positions it in 3-D space, which is further improved through a closed-loop vision feedback. Giorelli et al. 24 used a Jacobian based approach to reach an average tip accuracy of 6% of the total manipulator length. Marchese et al. 31 applied a closed-loop controller on a 3D soft-arm to position the end effector to reach a ball with a diameter of 0.04m. These traditional model-based methods are limited in accuracy and computational costs by modelling assumptions that need to be improved for practical application of soft manipulators. Learning based control 23,24,26 is a promising approach to automate low-level sensorimotor skills. The underlying principle is to allow the manipulator to autonomously explore its environment and correlate sensory and motor spaces. This work investigates the application of these methods for the previously unaddressed challenge of the control of hybrid actuated systems.
The paper is organized as follows: Section ‘
Design
The overall design of the distal segment relies on the total length which should be able to reach the target workspace which includes back and/or lower limbs (refer to Figure 1). On the one hand, a McKibben-based actuation maximizes the reachable length due to the elongation performances of the fluidic actuators, thus, reducing the number of modules required to achieve the desired length. However, it is unstable due to gravitational effects. On the other hand, the use of cables enables multiple functionalities of shortening, omnidirectional bending, and compensation for gravity effects. However, it requires a higher number of modules for covering the same length. We propose to embed both technologies within the distal module in order to complement the limitations of each other while increasing the overall functionality of the complete module. Thus, the distal module is hybrid actuated. The innovative element in its mechanical design is represented by the flexible fluidic actuators that are McKibben based but with bellow-shaped surface, discussed subsequently.
Design and manufacturing of the distal module
Three McKibben actuators and three cables are alternately displaced at an angle of 60° along a circle with a radius of 3 cm. A central hollow chamber of 1.4 cm diameter has been inserted to facilitate the flow of water and soap. The total length of the module is 20.5 cm, while the total weight is 180 g (refer to Figure 2). The cables and chambers are decoupled, meaning that each one has a dedicated activation line for tension and pressure regulation, respectively.

Flowchart showing the fabrication process for the single McKibben actuator.
Design of the McKibben actuator
The manufacturing process for the McKibben actuator 32–34 comprises of the following steps: (i) forming the external chamber: A braided sheath (Pro Power PETBK3B10, Farnell Components) with a length of 60 cm and internal nominal diameter of 0.3 cm is expanded and inserted on a metallic cylinder with a diameter of 0.8 cm. A mechanical deformation is applied in the inferior/superior direction to introduce a fiber deformation until a cylindrical bellow structure is created. The final shape is “memorized” by applying a cyclic uniform heat along the structure after which the actuator has an outer diameter of 1 cm and a total length of 19.5 cm (refer to Figure 2). A recommended temperature is around 350°C for 3 min. (ii) Combining the internal and external chambers: A balloon (latex) is inserted into the braided sheath. It is delimited at both ends by endcaps which provide an air-source from one end and an anchoring point to the lateral surface from the other. The bellow-shaped external chamber facilitates a radial containment effect of the internal elastic chamber, thus, improving the elongation performances of the combined structure. (iii) Structural support: The entire module is completed by adding a custom-made helicoidally shaped structure along its length (Figure 3). It houses the three cables, the three chambers, and the internal channel, thus, providing a constraint for the structural elements and avoiding lateral expansions of the actuators. The last part of the module consists of the lateral interfaces (end plates) for confining the module.

CAD design of the single module; on the top right a section view of the module with the three McKibben-based actuators (identified by A1–A2–A3), the three cables (identified by B1–B2–B3), and the internal channel for the provision of water.
Kinetic characterization of the distal module
The kinetic characterization refers to the evaluation of the complete behavior of the system that includes contraction/elongation, bending/stiffness, and the complete reachable workspace, as discussed in the subsequent subsections.
Experimental setup
The experimental set-up for the characterization of the distal module consists of: (1) a rigid rectangular frame made of 0.8 cm acrylic sheet that orients the module vertically downward; it is custom designed to encompass the actuation systems; (2) Pneumatic Set-up: three proportional pressure-controlled electronic valves (K8P Series by EVP Systems, Italy, Input: 0 - 10 V, Output: 0 -3 bar), one filter (EVP Systems, Italy: MC-104FB0); one manometer (EVP Systems, Italy: M043-p12 0-12bar); one stand-alone air compressor; (3) Cable Set-up: three Hs-645 Hitech Servomotors; (4) A six-DOF electromagnetic sensing probe (Aurora® Tracking by Northern Digital Inc., Canada) has been fixed at the tip of the module base which is capable of sub-millimetre position tracking. This is a visualization tool that provides a log of tracking movements at the end of an experiment (Figure 4).

Experimental setup for the single module characterization.
Elongation/contraction
The linear motion of the module was captured by moving it from a completely contracted state to a completely extended state. The contracting state corresponds to the cables pulled by the servomotors without the chambers being inflated. All the servomotors are then simultaneously rotated through 180° in steps of 20°. The position obtained after the completion of this rotation corresponds to the neutral point of the manipulator. Next, the pneumatic chambers are inflated from 0.5 to 1 bar. As explained in, 35 we start from 0.5 bar as it is the inferior pressure limit of a single actuator i.e., below this limit the actuator does not detect any global elongation movement. Figure 5 illustrates the displacement of the module along the z-axis from the contracted state to the extended state which is found to be 5 cm. With respect to the neutral position, this implies that there is a contraction of 3 cm and elongation of 2 cm.

Elongation–contraction module characterization. The blue marker represents the starting configuration without active elements.
Bending and stiffening
In order to perform the bending tests, we evaluated two different activation patterns: (i) bending due to a single cable and single chamber: a single randomly chosen cable of the system is contracted from an untensioned state into a tensioned state by rotating the servomotor from the neutral position through 180° in incremental steps of 20°. When the maximum tensioned state is achieved, the diametrically opposite pneumatic chamber is inflated by applying a pressure from 0.5 to 1 bar, with an incremental step of 0.1 bar. Referring to Figure 3, this activation pattern is achieved when A1 collaborates with B3, A2 with B1, or A3 with B2. Figure 6 illustrates the resulting behavior from this activation pattern. The global bending of the system contributed by the cable undergoes a displacement of 60 mm along the z-axis (viewed as the xz-view and yz-view in Figure 6). A stiffness variation can be observed when pneumatic actuation begins which is due to an antagonistic effect. (ii) Bending due to two cables and two chambers: two adjacent cables of the system are contracted from an untensioned state into a tensioned state by rotating the corresponding servomotors simultaneously through 180° in incremental steps of 20°. Referring to Figure 3, this activation pattern can be achieved when first B1–B2 or B2–B3 or B3–B1 are activated. When the maximum tensioned state is achieved, the diametrically opposite pneumatic chambers corresponding to each servomotor, that is, A2–A3, A3–A1, or A1–A2, respectively, are inflated similar to the procedure explained previously. Figure 7 depicts resulting behavior from this activation pattern. The global bending of the system contributed by the cables undergoes a displacement of 5 cm along the z-axis. The reduced value in comparison to the first activation pattern is due to the double-cable activation that pulls the cables with an increased force. It can also be concluded from these figures that the chambers contribute to adapt the tip orientation of the module. The hybrid actuation is promising because of the stiffness variation for applying restrained forces in the interaction with the user.

Activation pattern with a single cable (B3 in Figure 3) up to the maximum position and activation of the opposite chamber (A1 in Figure 3). (a) Global bending movement of the element; (b) xz view; (c) yz view. Arrows with different colors show the orientation of the module: x, y, z directions in green, blue, and red, respectively. Arrows with different colors show the orientation of the module: x, y, z directions in green, blue, and red, respectively.

Activation pattern with two cables (B1–B2 in Figure 3) up to the maximum position and activation of the two opposite chambers (A2–A3 in Figure 3). (a) Global bending movement of the element. (b) xz view. (c) yz view. Arrows with different colors show the orientation of the module: x, y, z directions in green, blue, and red, respectively. Arrows with different colors show the orientation of the module: x, y, z directions in green, blue, and red, respectively.
Reachable workspace
The workspace measurement (see Figure 8) aims to measure all the reachable positions of the module tip. A total of 8000 points were recorded which resulted from a permutation of all six inputs applied using the following values: (i) servomotors: 0°, 45°, 90°, 135°, 180°; (ii) pneumatic pressure: 0 bar, 0.5 bar, 0.8 bar, and 1 bar.

(a) Workspace evaluation for the single module. The blue point represents the initial position of the module in an unactivated state. (b) Workspace view from the X–Y plane. (c) Workspace view from the X–Z plane. (d) Workspace view from the Y–Z plane.
Figure 8(a) depicts the isometric view of the resulting point cloud, highlighting the workspace in the range of 21.5 cm × 17.2 cm × 10 cm in the x–y–z axes, respectively. The blue marker corresponds to the starting configuration of the end effector. The overall shape of the workspace can be categorized as a volumetric convex with the lower limits defined by the maximum bending capabilities (approximately 90°) of the module, as expected. The asymmetry of the shape can be credited to the manufacturing procedure i.e. the irregularities that are introduced during the assembly phase which are not easily identifiable when the module is at rest. For a detailed analysis, the workspace is also presented from a 2D viewpoint from all three planes. Figure 8b shows the top view of the workspace from the X-Y plane. The shape is circular, as theoretically expected. Figure 8c indicates an asymmetric response of the system from the X-Z planar viewpoint, whereas a symmetric response of the system from the Y-Z planar viewpoint is depicted in Figure 8d. Data points are more pronounced toward the left portion in the former graph as a result of a tightened cable attachment to the respective servo arm.
Control framework
Learning based controllers 36,37 offer a promising solution to address the control challenge. In particular, we take inspiration from a previous work 38 where each actuator is considered an autonomous agent that resides within and shares an environment forming a distributed multi-agent system (MAS). 39 The underlying objective is to enable the agents to coordinate their actions to learn a joint optimum behaviour. Reinforcement Learning (RL) 40 is a very attractive online optimization technique in milieu of automating a solution for MAS as it assists in (i) model-free learning and (ii) online learning. The former is beneficial to learn optimal behaviour in unpredictable environments where the dynamics are not known a-priori, whereas, the latter is particularly useful to learn behaviour without specifying the underlying actuation technology of the system. This, however, requires that the agents learn in an independent manner 41 similar to the Iterated Prisoner’s Dilemma. 42
Reinforcement learning preliminaries
RL is a sequential decision-making tool where an agent iteratively interacts with an environment modeled as a Markov decision process (MDP). An MDP is a 4-tuple <S, A, T, R>, where S is the state space, A is the action space, T is the environment dynamics given by the conditional probability Pr
{s′ | s, a} that taking action a in state s will lead to new state s′, and R is the reward received when an action a in state s leads to a new state s′. An agent acts on the environment by selecting actions according to a policy π, that is, a mapping from states to actions. Starting from an initial state, the agent iteratively interacts with the MDP according to the policy to generate a trajectory which refers to a sequence of state-action-reward tuples
where τ is the terminating condition, s is the state of the system at time t, and a is the action selected in state s at time t in given policy π. This work considers trajectories that finish within a finite horizon; hence, τ represents when either the goal sg is reached or a maximum number of action trials are attempted in an episode. In optimal control problems, the goal is to find an optimal control policy π* that maximizes the expected cumulative discounted return regardless of the initial state. The focus of this work is to derive π* via policy iteration (PI), 44 one of the two main classes of dynamic programing to solve MDPs.
Formally, PI is a subroutine that runs trajectories iteratively for a given duration, wherein it alternates the processes of policy evaluation and policy improvement as follows: Starting from any initial position at time-step t, an action in the current state is chosen according to a policy
Consequently, an optimal policy π ∗ can be automatically derived from Q* as
For a finite-dimensional MDP, the Bellman equation is employed to analytically optimize the Q-function. However, for most real-world applications, including the problem-at-hand, prior knowledge of environment dynamics is rendered unknown due to stochasticity. The most powerful approach in this scenario is temporal difference (TD)-based PI which is model free and learns incrementally from the data. 40 There are two classical TD learning algorithms applied in the context of control, that is, Q-learning (off-policy TD learning) and SARSA (on-policy TD learning), where the focus of this work lies on the latter approach.
In SARSA, an arbitrarily initialized Q-function is incrementally updated with a tuple
where
This learning process can be further sped-up by incorporating eligibility traces e ∈ ℝ which update Q not only by the current tuple but rather with all visited state-action pairs that lead to this tuple as it is the causal result of an entire trajectory. This variant of SARSA, more commonly known as SARSA(⋋) with accumulating eligibility traces, 45 has memory parameters associated with each state-action pair that are initialized to 0 at the beginning of each trajectory and updated as
where
Convergence to Q* is dependent upon two factors: (i) Robbins-Monro criteria:
46
the learning rate α should be optimized in the range
However, fulfilling criteria (ii) is nontrivial for a continuous domain environment, necessitating techniques to approximate the Q-function, as discussed subsequently.
Parametric linear function approximation
For continuous domains, the Q-function can be estimated either with linear/non-linear function approximation. Theoretical stability and convergence properties favor linear approximators over non-linear approximators. 48
The choice opted for the parametric linear function approximator is tile coding, which partitions the state space into multiple layers called tilings in the spherical coordinate system such that R = [0 max(R;)]; (ii) ⱷ =[0° 360°]; and (iii) θ = [max(θ) 90°]; where max(R) refers to the length of the manipulator in a fully extended state; ⱷ refers to the azimuth which for omnidirectional bending will always be 360°; and θ refers the contraction capability of the manipulator measured from the z-axis. The complete feature space is then represented by rectangular shaped tilings (refer to Figure 9) and is automatically restricted to the reachable workspace of the manipulator. Each tiling is divided into elemental blocks called a tile that allows for local generalization of distances close in Euclidean space. The resolution is different in each dimension of the state-space, given mathematically as

3-D Cartesian plane is abstracted into 2-layered rectangular tilings in R, ⱷ, and θ dimensions. There are 7 tiles, 8 tiles, and 1 tile with a width of wR, wⱷ, and wθ in each dimension, respectively. The origin of the tilings are offset with respect to each other. The position of the soft manipulator end effector activates one tile per tiling.
where i = 1,…, n and n is the total number of dimensions of the state-space; ϕ represents the resolution of a tile in a given dimension; l represents the width of the tile in given dimension; N represents the total number of tilings. The role of resolution is crucial on the learning performance of the algorithm: A larger resolution implies slower learning but more optimal solutions, whereas smaller tiles imply faster learning but less optimal solutions. The selection of the resolution is mostly done through a manual trial and error process. Tiles act as receptive fields, that is, only one tile per tiling is activated if and only if the given state falls in the region delineated by that tile. The origin of each tiling is offset to avoid singularities at the boundaries of the tiles. Q-function is then simply approximated by a sum of the indexes of these activated tiles as
where j = 1…k, where k is the total number of tilings; θ
: S x A → ℝk is a vector of binary basis functions
Reward structure and policy learning
In order to facilitate learning, we guide the policies through the technique of reward shaping such that the MAS is motivated to select actions that progressively lead towards the goal. (Note: In this work, the optimal policy is learnt from a single starting state towards the goal). This is implemented in an information theoretic manner by defining the reward using a distance measure from the goal as follows: A bounded monotonically increasing reward function is implemented representing unequally spaced concentric spheres centered on the target, where the number of spheres in this work are selected randomly but once created, is applied in a similar manner for any target. The motivation behind this reward structure is to allot the distribution of information throughout the abstracted state-space such that motor commands that result in positions with distances further from the goal will be updated with a higher negative value in comparison to targets closer to the goal. Even though this provides efficient distribution of information, it would still suffer from the curse of dimensionality 49 as the policy space increases exponentially with the number of agents. Thus, we modularize the learning process that is the policy learns to reach the goal by learning by moving through the state-space in the direction of greater reward. This is achieved by taking advantage of our reward structure such that once it encounters an action that is rewarded with a scalar value higher than previously encountered ones, it will be the first action taken by the system in the next episode onwards.
Hierarchical controller
A schematic overview of the motion controller is depicted in Figure 10. It has a hierarchical architecture such that at the higher level, a trajectory is generated within the reachable workspace of the manipulator using the spherical coordinates. It is sampled at n points

Schematic view of the hierarchical control architecture for a point-to-point motion controller for soft robotic modules.
Experiments
The implementation of the tile coding software was achieved through typically available software. Before using it, all the necessary parameters required by the algorithm are set as follows (refer to Table 2): (i) reachable workspace: It is defined in spherical coordinates in consistence with the characterization of the module shown in Figure 8, that is, R = [10 cm 30 cm], ⱷ = [0° 360°], θ = [0° 90°]. The upper range of the radius was overextended to compensate for the unseen scenarios, for example, the tear of a cable resulting in a larger range for the workspace. (ii) Resolution: 10, 12, and 10 tiles were initialized in the R, ⱷ, and θ dimensions with a total of 4 tilings offset with respect to one another such that the state space has a total of 4800 tiles. (iii) Learning parameters: The step-size (α) and exploration (Є) are set to 0.16 and 0.05, standard values found in text. The eligibility traces (ƛ) are set to 0.9. (iv) Accuracy: The radius of the target circle was kept at a randomly chosen value of .99 cm. This indicates the maximum accuracy that is acceptable as a solution, the minimum being 0. (v) Action-set: The module is allowed to choose actions from a heuristically initialized discrete action-set that can simultaneously decrease/increase/keep unchanged the length of the cables and pneumatics. The length of the cable is controlled by a servomotor rotation range of [0°, 180°], whereas the length of the chamber is controlled by pressure-regulating valves in the range of [0,1] bar. The cardinality of the action-set implies that the module has to able to learn the required motor command from a total of 96 = 531,441 input action combinations. (vi) Algorithmic settings: The algorithm runs online for a total of 100 episodes with 50 trials per episode and is evaluated on two criteria: (a) mean reaching error and (b) trials needed for convergence.
A summary of applied parameters for the proximal module.
Experimental setup
The experimental setup is similar to the one mentioned in section “Kinetic characterization of the distal module”; however, an OptiTrack Motion Capture V120: Trio vision feedback system was employed to provide real-time 3-D positioning data. There are seven retroreflective markers attached to the end effector in a random manner such that four markers need to be detected for the 3-D reconstruction of the tip position. The complete setup with the initial position of the robotic module is shown in Figure 11. The system represents a mapping from an R6 motor space to an R3 Cartesian space.

Experimental setup.
Results
We generated 12 points in an uneven shaped curve starting from the top right end with respect to the module in the resting position moving downward toward the resting position. The higher-level controller then fed each sample sequentially to the low-level controller.
Reaching error
Figure 12 depicts the kinematic solution learnt by the manipulator for each sample. The algorithm was found to converge for all 12 points with an average reaching accuracy of 0.79 cm (± 0.18) as summarized in Table 2. It is worthy to mention that this result for accuracy opens the possibility to apply this module to reach sensitive areas for providing future services of scrubbing and rinsing. Furthermore, the possibility to control both pneumatic and cable actuators simultaneously implies that the control can inherently produce stiffness, which otherwise require tedious kinematic and force mappings. Thus, it facilitates a model-free form of handling internal/external loads in an efficient manner. The authors would also like to mention that this accuracy can be improved by reducing the target region size, provided that the system can also physically reach that desired accuracy. The green line depicts the trajectory the module moves through to go from one point to another. The trajectory from the resting position to the first target point is not shown as it obstructs the view from the other targets.

The distal module moving point-to-point through an asymmetric 3-D trajectory generated offline and sampled at 12 points. Each frame corresponds to the learnt kinematic solution of the manipulator for a given sample. The green line depicts the trajectory followed by the manipulator from sample to sample.
Convergence time
For a given sample, the algorithm allows the module to test a total of 50,000 actions (i.e. 100 episodes × 50 trials per episode); however, Table 3 demonstrates that the algorithm is able to generate solutions for all the given samples within a couple of hundred of actions apart from targets 1 and 2, which will be discussed in the next paragraph. This is a direct consequence of the reward structure mentioned in the previous section. In a state where the manipulator has experienced actions with both lower and higher scalar reward, it will prefer actions with higher reward. This implies high data efficiency but also the ability to quickly learn optimal solutions high in accuracy. However, as a future research goal, the authors intend to use the hierarchical control in conjunction with a learnt forward model. A forward mapping is a causal unique relation between the motor and task space, and thus, are accurate to learn. The inverse kinematic solutions can then be learnt without executing real-time trajectories, saving both time and hardware resources.
A summary of the obtained results for point-to-point motion control of the module through a sampled asymmetric trajectory.a
aNote: The data from this table are color coordinated with the data provided in Figure 13.
Targets 1 and 2 from Table 3 show that a significant amount of actions were required to achieve convergence. This is because the actions taken by the module to reach these two targets positioned the retroreflective markers in such a way that it was challenging to reconstruct the 3-D position for the Opti-track system. As mentioned previously, an occlusion for more than three markers at various places in the state space returns no information to the algorithm which implies that (0,0,0) is returned to the algorithm. This state has been given a reward of −1000. The red plots in Figure 13(b) illustrate the reward accumulation for these two targets, which is significantly much higher as compared to the other samples. Despite this challenge, the algorithm was able to learn a solution as long as the target point was not occluded.

The reward accumulation for all samples of the asymmetric trajectory.
The quality of the learnt policies
As mentioned in section “Parametric linear function approximation”, the worst performance of this algorithm will be generating a suboptimal policy. In this practical setup, this was found to be true. The algorithm is tested from various starting positions; however, in majority of the cases, the solution was generated from the positions that are closer to the goal rather than further from the goal. One reason for this could be that the resolution of the state-space is not optimized, that is, it should be increased. However, this will result in an exponential increase in computational time, whereas the policy currently learnt very well meets our requirements. Consequently, the authors believe that such an approach should not be precluded from practical applications.
Figure 13 depicts the reward accumulation for the all samples. The same trend is exhibited for all experiments which is that the reward accumulation will increase until convergence where it will receive a reward of 0, hence becoming constant.
Discussion and conclusions
Soft robotics has already demonstrated promising results to revolutionize key sectors such as industry and surgery primarily because this technology guarantees safe human–robot interaction. This article aims to apply the powerful properties of soft robotic systems to design the first system to assist the elderly community in the task of bathing. We proposed an overall modular system; however, in order to reach this ambitious goal, we first needed to design, fabricate, and test the key component module that will perform the main bathing activities, that is the distal module which will come in direct contact with the human.
For the design, a hybrid actuated system was proposed. It comprises of radially arranged bellow-shaped pneumatic chambers and cables. The kinetic characterization of the module pointed out the following functional capabilities of the robotic platform: The novel bellow shape of the fluidic actuators contributes toward elongation and contraction, whereas the fluidic chambers contribute to adapt the tip orientation. The overall advantages provided by the combination of these actuators are (i) reachability to the user workspace: It guarantees to cover larger distances with a reduced number of modules and orient the module with respect to the target region of the user’s body. This was observed through the measured reachable workspace which was 21.5 cm × 17.2 cm × 10 cm in the x–y–z axes, respectively. Consequently, a single module has the capability to match the dimensions of the back region of the human. However, further experimentations are required to test how well this module can be oriented toward this region in a modular form. We also acknowledge that these results are specific to the experimental setup described in section “Experimental setup”. However, this is not considered a source of error as the overall behavior (apart from the asymmetries) is in congruence with theoretical expectations. For future work, a more optimized setup will be prepared by introducing motors instead of the servo-motors to facilitate variable adjustment of the module motion, due to the extrinsic cable-driven actuation. (ii) Stiffness variation: Hybrid actuation provides an antagonistic motion between cables and pneumatic actuators that results in a stiffening which is particularly important to compensate for internal/external loading in the bathing environment.
For the control, a point-to-point motion hierarchical controller based on a multiagent RL was used to control the inverse kinetics of a high-dimensional, nonlinear, hybrid actuated manipulator. We tested the controller to move the manipulator through an asymmetric trajectory generated offline sample at 12 discrete points. We found the algorithm to provide solutions to all points with positional accuracy of 0.79 cm ± 0.18 cm. It is worthy to mention that this accuracy opens the possibility to apply this module to reach sensitive areas for providing future services of scrubbing and rinsing. The accuracy can also be improved by setting the target region to a lower value, of course provided that the system can physically reach that limit. Furthermore, as the positions are being reached with all six input activations, this implies that the control is able to exploit the stiffness properties in a model-free manner while reaching which otherwise is achieved through complex kinematic and force models. The authors also demonstrated the effectiveness of applying this approach despite its ability to generate only suboptimal policies. A limitation of the approach lies in the possible high number of steps needed to reach a single point. This will be addressed through the computation of the forward model on which the proposed method can be applied and then the inverse solution is computed, this can be executed by the robot. Future works will focus on this improvement as well as on the control of the tip orientation which is not considered in this implementation. Also, it is also demonstrated that vision-based systems may introduce additional constraints and hence, it is recommended to find alternative sources of feedback.
Footnotes
Acknowledgements
The authors would like to acknowledge the support by the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme FP7/2007-2013/ under REA grant agreement number 608022; the European Commission through the I-SUPPORT project (#643666); and by the Italian Ministry of Foreign Affairs, General Directorate for the Promotion of the Country System”, Bilateral and Multilateral Scientific and Technological Cooperation Unit, for the support through the Joint Laboratory on Biorobotics Engineering project.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
