Bottom Dressing by a Dual-Arm Robot Using a Clothing State Estimation Based on Dynamic Shape Changes

Abstract

This paper describes an autonomous robot's method of dressing a subject in clothing. Our target task is to dress a person in the sitting pose. We especially focus on the action whereby a robot automatically pulls a pair of trousers up the subject's legs, an action frequently needed in dressing assistance. To avoid injuring the subject's legs, the robot should be able to recognize the state of the manipulated clothing. Therefore, while handling the clothing, the robot is supplied with both visual and tactile sensory information. A dressing failure is detected by the visual sensing of the behaviour of optical flows extracted from the clothing's movements. The effectiveness of the proposed approach is implemented and validated in a life-sized humanoid robot.

Keywords

Dynamic State Estimation Clothing Manipulation Autonomous Robot

1. Introduction

Everyday clothing exists in various forms, and dressing is an essential part of humans' daily routines. Since clothing is changed to suit appropriate times and circumstances, we can also regard dressing as an important social activity. However, elderly individuals or those with physical disabilities frequently require assistance while dressing. Automated dressing would improve the quality of life for these groups of people [3].

In this study, we focus on a dressing action that is particularly problematic for disabled people, i.e., the pulling of clothing items such as trousers along the legs. The main contribution of this study is to show a full system for achieving this dressing motion by using an autonomous robot. The system requires recognition functions and motion-planning functions, as well as close cooperation with each other. We designed functions that were suitable for clothing manipulation, and coordinated them to form another important function: failure detection and recovery.

With respect to recognition, the estimation of clothing states is challenging because clothes are soft objects whose shapes change greatly when they are handled. The robot must understand the condition of fabrics in order to avoid injuring the subject's legs. To achieve this recognition, we extracted optical flows from two consecutive images that were captured during dressing, and used them to estimate the clothing's present state. Incorrect situations were detected by supplementary force sensors mounted on the wrists of the robot.

The robot must also create an end-effector trajectory for pulling up a pair of trousers. Because leg length differs among individuals, the trajectory planning should be adjustable on site. To accommodate leg differences, we created a set of trajectory segments corresponding to the standard leg size in advance. These trajectory segments were then fitted to the size and position of the subject's legs using depth information captured by a range camera just before dressing. Modifications were based on a statistical human model.

Collectively, estimation of clothing state and on-site trajectory modification enables failure detection and a recovery function. In other words, if the estimation function detects an unforeseen situation, such as the snagging of the toe on cloth, the original trajectory changes to another trajectory in order to restore the previous trouble-free condition. Once the failure is corrected, the dressing procedure continues. The effectiveness of the function was confirmed in experiments on a real humanoid robot.

This paper is organized as follows: In the next section, related work is introduced. Section 3 describes our approaches. Sections 4 and 5 describe the clothing's state and estimations based on optical flows. Section 6 describes the end-effector trajectory planning. Sections 7 and 8 introduce experimental results, and Section 9 concludes this paper.

2. Related Work

2.1. Dressing Assistance by an Automated Machine

Dressing assistance has received little attention in robotics research. Among the few relevant studies, Matsubara et al. [12] proposed the use of reinforcement learning in putting on a t-shirt. They found a feasible end-effector trajectory after dozens of trials. Several machines that provide daily assistance relevant to clothing have also been developed. In the case of a support machine for the toilet, the subject needs only to stand at the centre of the machine, and their body is manipulated by mechanical arms equipped with custom-designed actuators.

Dressing actions using these systems are based on grasping points and motion trajectories. Manipulation failures during dressing are not considered, and recognition functions that provide information about the dressing state are lacking. However, these functions are crucial in practical applications because of the difficulty inherent in controlling the fluid motions of clothing.

2.2. Clothing-state Estimation

Unlike dressing assistance, recognition methods for clothing manipulation have been widely reported in the literature [6,19]. Because objects made of cloth are so flexible, observing their motions during manipulation is essential. Previous research has represented clothing items via contours and deformable models.

Ono et al. [15] estimated the state of a rectangular piece of cloth from its contour information. Data were provided as groups of planar states, some of which had bent corners. Other researchers used silhouette information to estimate the state of a piece of cloth hung up by a robot [11,16]. Kita et al. [7] converted images captured by a trinocular stereo camera to a cloud of input points, which were fitted to a three-dimensional deformable model.

Three-dimensional databases can also be created by a physical engine. Kita et al. [8] fitted the three-dimensional cloth model to a three-dimensional point cloud. Maitin-Shepard et al. [11] proposed action selection for manipulating deformable planar objects. By implementing a physical model, they achieved the straightening of a square-shaped cloth by a robot.

These previous studies assumed static conditions for the cloth during the recognition process. Almost all of the studies above use prohibitively time-consuming methods (requiring several tens of seconds for a single estimation [13]). For practical application, the efficiency of state estimation must be markedly improved.

3. Issues and Approaches

3.1. Assumed Dressing Procedure

The purpose of this study is to dress a person seated at a bedside. The person is presumed to possess partial control over his limbs. Figure 1 shows the sequence of dressing a mannequin. As shown in the mannequin's poses, we assume that our target person can lift one leg and pull himself into a standing position.

Figure 1.

A dressing procedure

We assume that the dressing task is divided into several phases: (i) insert both feet into the trousers' legs (Figure 1, (1)), (ii) pull the item up to cover the knees (Figure 1, (2)), (iii) pull up the dangling hem on one side (Figure 1, (3) and (4)), (iv) repeat (iii) for the other side (Figure 1, (5) and (6)) and (v) pull the item up over the hips (Figure 1, (7) and (8)). In this study, the robot performs motion sequences (i)-(iv). We focus, especially, on sequences (i) and (ii).

Figure 2.

Failure cases. Top left: neither leg is successfully inserted. Top right and middle left: only one leg is successfully inserted. Middle right, lower left and lower right: both legs are inserted into one trouser leg.

Figure 3.

A framework for the estimation of dressing

Figure 4.

An example of an optical flow when only the lower left part was manipulated. Left: original image, centre: detected flows depicted by colour segments, right: the distribution of the flow magnitude.

Figure 5.

Relationship between overall flows and local flows

Figure 6.

Online classification of a clothing sequence

Figure 7.

State transition model

Figure 8.

An example of leg detection. Upper left and right images: colour image and depth image captured using an Xtion sensor. Lower left: a region-growing result. Pixels of the same colour represent the fact that they belong to the same cluster. Lower right: a legs-detection result. Two clusters regarded as legs are extracted using the region-growing result. Yellow points show the gravity centre of each cluster.

Another procedure is possible. For instance, in order to reduce the difficulty of leg insertion into the deformable trousers, one method is to roll the trousers' legs before inserting the subject's feet. However, one issue with this method is the difficulty of the rolling manipulation for the robot, so the manipulation must be done by the person getting dressed. On the other hand, in our procedure described above, what the person should do is to let the robot grasp both ends of the trouser legs, which is more practical.

3.2. Dressing Problems

During the dressing action outlined above, some undesirable situations may occur while inserting the subject's legs into the clothing item and pulling it up the body. For example, as shown in the lower left and right-hand panels of Figure 2, the feet may snag on the cloth, or a leg may not enter the desired opening. To avoid these undesirable situations, the progress of the dressing action should be monitored and managed by external sensors.

Previous researchers have used image sensors to figure out the state of clothing. Because images provide a wide variety of information, this approach show promise for our proposed application. From past research, we can find various options for how to create knowledge about cloth in advance, and how to extract useful information from image data. Past researchers have modelled clothing via three-dimensional deformable models [8], three-dimensional mesh models [7] and two-dimensional contours [11].

These models are then matched with processed sensory data, such as three-dimensional point clouds and image contours. Previously proposed matching methods are based on position alignment between the model and sensor data. However, the alignment approach has always been time-consuming when applied to flexible objects such as clothes. Thus, an effective matching process is imperative in order to realize a practical robotic dressing system.

From the discussion above, we identify the following problems relating to dressing:

(A) Estimating the clothing state via an online process: Although online performance has not been addressed in previous studies, we consider it essential for realizing a practical robotic application. Because the most time-consuming step is calculating the convergence between the sensory data and prior knowledge (e.g., a 3D deformable model), one method to avoid the processing time problem is to have a novel mode of representing clothing.

(B) Estimating the shape and pose of the subject's legs: Estimating the clothing state alone is insufficient for our purpose. The subject's legs are an equally important recognition target because their size and pose varies among individuals and situations. If the legs are not recognized, the dressing action is more likely to fail. Thus, the state estimation of the subject's legs is needed.

(C) Creating the trajectory of the end-effector: This item concerns manipulation planning, and it is closely linked to item (B). Once the subject's legs have been recognized, the end-effector's trajectory of dressing should be planned. As explained in Subsection 3.1, we apply the following motion sequence: (i) insert both feet into the trousers' legs, (ii) pull up the trousers until the knees are covered, (iii) pull up the dragging hem on one side and (iv) repeat (iii) for the other side. These manipulations must be managed by a set of end-effector trajectories that change the grasping points of the fabric. The trajectories should also be adaptable to different clothing items and leg states.

3.3. Approach

To resolve the issues above, we adopt the following approaches:

(a) A clothing shape estimation based on optical flow: The purpose of this estimation is to determine a present clothing state and whether it is appropriate for the progress of dressing. This means that the entire clothing shape is not needed in each frame; the important thing is to find the irregular part of the clothing region. To improve the performance, we use two-dimensional optical-flow information rather than three-dimensional shape information. Optical flow is extracted from image streams that capture the dynamics of the manipulated clothing.

To construct prior knowledge, a series of optical flows measured from various dressing patterns are preregistered and labelled according to their dressing phase. In the state-estimation process, the current optical flow is compared with the registered optical flow, and the label of the most similar situation is used to represent a present clothing state. The method presented is the main contribution of this paper; it is described in detail in Sections 4 and 5.

(b) A leg's shape estimation from point cloud data: A time series of three-dimensional point cloud data is generated by Xtion Pro LIVE, which returns video graphics array (VGA) depth images at 30 Hz. Before dressing, the robot measures the size and pose of the legs. Two line segments are extracted from the depth information, i.e., one extending from the knee to the ankle, and the other from the ankle to the toe. The leg's size and pose are estimated from the lengths, inclinations and positions of these lines. To obtain the shape of parts not facing the sensor, e.g., heel position, we used public statistical person-frame data. The details are described in Subsection 6.1.

(c) End-effector trajectory planning based on joint position: One method to avoid unnecessary dressing failures, such as getting legs through a hole in a trouser, is to plan a trajectory that enables tangle-free clothing handling, which is important. In our approach, a basic trajectory is empirically defined in advance. The trajectory has a good track record of dressing the leg shapes of a particular subject.

If the robot dresses another person's body with trousers, an end-effector trajectory is created by modifying the basic trajectory. Because the result of item (b) provides an approximated position of a convex body shape, its difference from the original subject is used for the modification. The details of this are described in Subsection 6.2.

Section 4 is dedicated to determining whether the robot adopts an appropriate dressing behaviour (item (a), above). The input to the proposed method is an image stream of the dressing sequence. Section 5 details the state estimation of the legs and the planning of the end-effector trajectory (items (b) and (c), above). In these steps, the input is a depth image captured immediately before the dressing action occurs.

We will also discuss the correction of dressing failures. Failures are detected by vision functions described in the next section, supplemented by force sensors embedded in the robot's wrists. Because our vision function returns the type of failure, the result is used to decide the next end-effector trajectory that will address the failure.

4. Description of an Optical Flow-based Clothing State

Optical flow is calculated from two consecutive images. Using the flow distribution, the present clothing state is then matched to the optical flow extracted from the training dataset. Therefore, we describe the clothing's condition in terms of three feature types. The method outputs the status of the present act of dressing, i.e., successful or otherwise. If the dressing is unsuccessful, the method outputs the specific type of failure. To improve the discrimination process, a transition graph is utilized.

4.1. Framework Overview

Figure 3 is a flowchart of the proposed method. The major procedures are outlined below:

Preprocessing (Figure 3(1),(2)): First, each image in an input image stream is divided into the cloth and background regions. The optical flow is calculated in the cloth's region. Thus, the cloth states are extracted as a set of tiny local movements. The details are described in Subsection 4.2.

Feature extraction (Figure 3(3),(4),(5)): Because cloth is very flexible, the flow transition is non-uniform. To represent the variety of the cloth's movement, three flow features are calculated, i.e., (i) flow magnitude, (ii) the mutual relationships among the flows and (iii) local flow movements. The details are described in Subsection 4.3.

Database (Figure 3 (6),(7)): Image streams that capture several patterns of dressing are used to generate a database. A consecutive set of optical flows is calculated from each image stream and added to the database. The classification is performed in two stages, the details of which are described in Section 5.

State matching (Figure 3 (8)): The current state is estimated by searching for a matching optical flow in the database. If a failure is found, the type of failure is also identified.

4.2. Pre-processing the State Description of Dressing

Pre-processing 1: Domain segmentation

Before description of the state, the image region of the target clothing item is extracted via a dynamic graph cut method [9]. This procedure minimizes the following cost function [1]:

E (X) = \sum_{v \in V} g_{v} (X_{v}) + \sum_{(u, v) \in N} h_{u v} (X_{u}, X_{v}),

(1)

where V is a group of image pixels and $N \subset V$ is a group of pixels neighbouring the target pixel. Each pixel is assigned a label X_v. The first term $g_{v} (\cdot)$ , called the “data term,” represents the likelihood that a label is assigned to a pixel. The second term $h_{u v} (\cdot)$ , called the “smoothing term,” expresses the cost of disconnecting the present pixel from its neighbours.

Optimization by graph cut usually requires pre-specified seed points. In our case, the points are automatically detected using three-dimensional information because we use a 3D-range image sensor. After specifying the pose of the subject's legs via the method introduced in Section 6.1, pixels satisfying the following items are used as foreground seed points:

Arranged in the lower half of the legs in 3D space,

a colour different from that of the skin and

a certain level of a large optical flow is detected from the pixels.

Once the seed points are given, the first graph cut is performed. After that, the dynamic graph cut uses the resulting region in the next frame.

Pre-processing 2: Optical flow calculation

The optical flow is calculated from the consecutive clothes region in an image stream. We require methods that do not need a distinctive texture to work since clothing does not necessarily have such a texture. Therefore, we apply a method proposed by Farneback [4], which compares two local image windows $W$ by the following error vector ε [17]:

ε = \int \int_{W} {[J (x + d) - I (x)]}^{2} w (x) d x,

(2)

where $x$ contains the pixel coordinates and $d$ is the displacement to be estimated. $J (\cdot)$ and I(⋅) denote two consecutive images.

Because this method outputs high-density flows, even from less textured regions, it highly expresses the detailed shape changes of clothes.

4.3. Three descriptive Features of Optical-flow Information

Flow magnitude F_m

When a piece of cloth is manipulated, the extent of its motion can vary depending on the situation, because it is a soft object. The global motion characteristics of a piece of fabric can be expressed by the magnitude of the flows.

Our first descriptive feature is calculated via the following procedure. Unnecessary flows of excessively large or small magnitudes are removed by threshold processing. The remaining flows are normalized by their average magnitude. The normalization step allows flows to be matched regardless of the dressing speed.

In the central panel of Figure 4, the flows are colour-coded according to their directions. The right panel distinguishes the flow magnitudes (motion speeds) by their brightness. F_m is generated by calculating the density histogram from the greyscale image.

Mutual relationship between flow pairs F_r

The second feature describes the positional and directional relationships between a pair of flows. This feature quantifies the complexity of flows, i.e., whether a group of flows is well aligned or disturbed.

Let $(x_{i}, y_{i})$ be the starting position of each flow. The mutual relationship between local regions is determined from the optical flow calculated for a pair of images. This description is inspired by the Surflet Pair Relation Histogram proposed by Wahl et al. [18]. During feature generation, two flows are selected, and a three-dimensional vector is constructed with the following components.

The distance between two flows

V_{1} = \sqrt{{(x_{1} - x_{2})}^{2} + {(y_{1} - y_{2})}^{2}}

(3)

The angle between one flow direction and the position vector between two flows.

The angle between one flow and another

These calculations are performed for many randomly sampled flow pairs, and the results are voted into the three-dimensional parameter space whose axes correspond to items 1)-3) above. F_r is generated by calculating a frequency histogram from the voted space. In this calculation, the space is divided into small voxels, and the number of points in each voxel is counted. Each number is assigned to one bin of the histogram. In our experiments, the frequency histogram was composed of $5^{3} = 125$ elements, indicating that each of the three axes was divided into five sections.

Local flow movement F_l

A flexible material such as cloth develops many wrinkles and stretched areas. Thus, when a part of the target cloth is manipulated, the cloth is frequently reformed around the manipulated part. Such a partial movement is represented by extracting possible flows in local areas.

In preparation for feature description, the central position of the clothing region is calculated from the image-segmentation result. Meanwhile, only regions of dense flows are selected as descriptive features. RANSAC [5] is used for this process. F_l is generated by calculating the relationships between the central position and the local flows, as shown in Figure 5. This feature is specified by two criteria: (i) the direction from the local flows to the central position, and (ii) the area ratio of the local flow and clothing regions.

5. Clothing State Estimation Using the State Transition Model

The clothing state is determined from three estimation targets.

The dressing is successful or unsuccessful (Figure 6, (A)),

if successful, return the present state,

if unsuccessful, return the type of failure (Figure 6, (B)).

To construct a state classification, we prepare a database of feature vectors calculated from a set of training data. Several image streams of successful and unsuccessful clothes dressing are captured. The optical flow and the three features F_m, F_r and F_l are calculated from each image stream. The database contains a list of the features, appended by information of dressing states.

The database serves the following main purposes:

Phase estimation: One dressing sequence is divided into several phases. The database is accessed in order to identify switching events.

One-on-one matching: A pair of consecutive images produces a set of optical flows. The result is used for frame-to-frame matching by which the present dressing condition is recognized.

Based on this information, we implement two types of feature set, illustrated in Figure 2 (6), (7).

5.1. Evaluation Formula

The activity of dressing is estimated by searching for the feature set calculated from current image pairs. Let $F = {F_{m}, F_{r}, F_{l}}$ be a feature set, and let each $F_{*}$ be a time-series feature set, for instance, $F_{m} = {f_{m}^{1}, f_{m}^{2}, \dots, f_{m}^{t}}$ , where t denotes a serial number of clock time, and each $f_{m}$ comprises a feature calculated from a pair of images.

Because we specify three feature descriptions, the similarity calculation should integrate these three features. To this end, we express the similarity as follows:

P_{s_{k}} = \prod_{i = {m, r, l}} d i s t (F_{i}, F_{i}^{s^{t}}),

(4)

where $P_{s_{k}}$ represents the similarity with a state set $s_{k}$ in the database. The matching result is the one that minimizes Equation (4).

Let k be a serial number of phases, and $s_{k}$ be a group of state s^t, which we try to represent by using the three features $f_{m}^{t}$ , $f_{r}^{t}$ and $f_{l}^{t}$ . $s_{k} = {s^{a}, \dots, s^{b}}$ , where ${a, b} \in t$ and $a < b$ . If any one of s^t included in $s_{k}$ is well matched with the current features, the current phase is estimated as k. The $d i s t$ function varies by feature type. The Hellinger distance [2] is applied for $F_{m}$ .

d i s t (F_{m}, F_{m}^{s^{t}}) = \sqrt{1 - \sum_{n = 1}^{N_{m}} \sqrt{H_{m} (n) H_{m}^{s^{t}} (n)}},

(5)

where $H_{m} (n)$ denotes an element of the density histogram $H_{m}$ derived from the flow magnitude and n is the serial number of the histogram bins. $H_{m}^{s^{t}}$ denotes a histogram derived from state s^t in the database.

Meanwhile, the Euclidean distance is used for $F_{r}$ and $F_{l}$ . In the case of $F_{r}$ ,

d i s t (F_{r}, F_{r}^{s^{t}}) = \sum_{n = 1}^{N_{r}} \sqrt{{(H_{r} (n) - H_{r}^{s^{t}} (n))}^{2}},

(6)

where $H_{r}$ and $H_{r}^{s^{t}}$ are related histograms of the current and a database optical flow, respectively.

5.2. Improving the Effectiveness and Robustness of State Matching

A state that is similar to the current one can be found by matching the current state to all states in the database. However, this approach is time-consuming and prone to producing mismatches. Therefore, we improve the efficiency and robustness of the search by employing a state transition model.

The transition model is based on dressing phases. For instance, if the present phase describes a situation in which both legs are inserted into a pair of trousers pulled up by both hands, the next phase can be limited to a manipulation that inserts one foot into one of the trouser legs. If such a transition is not detected, the present manipulation is regarded as a failure.

A transition from one phase $(s_{k})$ to the other $(s_{k^{'}})$ is expressed as follows:

s_{k^{'}} = \min_{S} {\frac{P_{s_{k^{'}}}}{w_{s_{k} \to s_{k^{'}}}}},

(7)

where $S = {s_{1}, s_{2},…, s_{K}}$ and $w_{s_{k} \to s_{k^{'}}}$ denotes the weight coefficient of the transition from phase $s_{k}$ to $s_{k^{'}} \in S$ . If there is almost no way to $s_{k^{'}}$ from $s_{k}$ , $w_{s_{k} \to s_{k^{'}}}$ is set to a tiny value. $P_{s_{k^{'}}}$ is the score of matching with $s_{k^{'}}$ , computed by Equation (4).

Figure 7 is a conceptual diagram of the state transition model. Normally, a transition occurs from one successful phase to the next, but unsuccessful phases transition to failure states such as $s_{f a i l u r e 1}$ and $s_{f a i l u r e 2}$ .

6. End-effector Trajectory Generation

As mentioned in Section 2, our approach assumes a basic end-effector trajectory, and this trajectory is adaptively modified to fit individual subjects. This section explains how the state of the subject's legs is measured, and introduces a strategy for modifying the end-effector trajectory.

6.1. The Estimation of the Joint Positions of Legs

The state of the legs is evaluated from images acquired by a three-dimensional range camera, Xtion PRO LIVE. Immediately before the dressing action, the robot stands before the subject and acquires a depth image of the two legs. The subject's legs, captured by the sensor and the pre-processing result, are shown in Figure 8.

The depth image is shown in the upper-right panel. Applying the region-growing algorithm to this image, we obtained dozens of three-dimensional clusters (lower-left panel).

6.1.1. The Legs' Extraction by the Region Growing Algorithm

The region-growing algorithm first selects an initial point $p_{0}$ . Neighbouring points whose similarity to $p_{0}$ exceeds the predefined threshold are selected as homologous points. Our similarity measure is calculated by the angular difference between the normal and the $p_{0}$ , and that of a neighbouring point.

In this process, the normal vectors of all points should be calculated in advance. The input to the following procedure is a depth image of pixels with a depth value of d. Let the pixel of interest be p. The vector normal to p is calculated from the three-dimensional positions of p and its neighbours. After calculating the positional average and covariance matrix from these points, the directions and lengths of three orthogonal axes are obtained by eigenvalue decomposition. The normal vector is the axis of the shortest length among the three axes. Thus, each pixel is assigned four variables, i.e., depth d, and the components of a normal vector $n = (n_{x}, n_{y}, n_{z})$ .

In the region-growing algorithm, pixels are connected if they satisfy the following rule:

\begin{matrix} | d (i, j) - d (i + n, j + m) | & < d_{t h r e s h o l d}, \\ \cos^{- 1} (n (i, j) \cdot n (i + n, j + m)) & < θ_{t h r e s h o l d}, \end{matrix}

(8)

where $(i, j)$ denotes a coordinate on the present depth image and $(i + n, j + m)$ are the coordinates of its neighbours; $(\cdot)$ denotes the inner product.

An example of leg detection via this procedure is shown in the lower-left panel of Figure 8. Each coloured region indicates one cluster, and the legs are revealed as two long regions of contiguous clusters.

6.1.2. Estimating the Joints' Positions

The two leg regions extracted by the region-growing algorithm are obtained as a three-dimensional point cloud. The next step is to estimate two characteristic parts, i.e., the knee joint and the ankle. Both parts provide essential information for modifying the end-effector trajectory. First, a point cloud corresponding to a leg is divided into two sections by the kneecap, which is characterized by a large bend when the subject sits on a chair or at a bedside. The region-growing algorithm is applied to the point cloud with a smaller $θ_{t h r e s h o l d}$ in Equation (8).

Note that further division between the knee and toes is unstable because these parts are joined by a gradual curve. In addition, the angle made by the ankle joint depends on the situation and an individual's posture; thus, these regions will not be well modelled by a region-growing algorithm with a static threshold.

To overcome this problem, we use statistical human-body data [20]. The point cloud of a leg is divided into two parts by a second region growing as described above, with the kneecap regarded as the division point. The lower point cloud comprises the area between the shin and the toes. Here, the length proportion between the two parts, from the kneecap to the ankle and from the ankle to the toes, is assumed to be a human anatomical attribute and is retained as a constant. Similarly, the distance between the ankle joint and heel follows anatomical proportions. Consequently, we represent the leg model by three variables, as shown in Figure 9. Based on the leg dimensions of average men and women, we set $L_{1} = 430.5 m m$ and $L_{2} = 141.3 m m$ . The position of the ankle joint, which divides the point cloud beneath the kneecap into two sections, is calculated from the ratio of L₁ to L₂. In addition, the ankle joint's position determines the heel position. In Figure 9, the $O f f s e t$ was set to $100.0 m m$ based on statistical data.

Figure 9.

Three parameters to estimate the shape of a leg

6.2. End-effector Trajectory Generation

To dress the bottom-half of a subject, we must avoid failures such as snagging the legs on the cloth of trousers. Our approach defines the basic trajectory in advance, and modifies it to fit the point cloud representing the kneecap-to-toe region of the lower legs. In the modification step, 10 regularly spaced anchoring points are first extracted from the point cloud. The interval between the points is $50 m m$ to $100 m m$ , sufficient to prevent injury to the subject during correction of a snagging failure.

Let $P^{r}$ and $P^{l}$ be the sets of anchoring points extracted from the left and right leg, respectively. These points constitute a part of the input point cloud, and are thus unavailable for computing the end-effector trajectory. Therefore, we create new point sets $P^{r : d r e s s}$ and $P^{l : d r e s s}$ that are moved outward from $P^{r}$ and $P^{l}$ , respectively. The relationship between the two point sets is illustrated in the left panel of Figure 10. The yellow points indicate the waypoints of the end-effectors.

Figure 10.

Waypoint generation. First, the two point lists $P^{r}$ and $P^{l}$ are defined from leg point clouds resulting from region growing. $p_{k n e e}^{r}$ and $p_{k n e e}^{l}$ are also detected from the point clouds, and $p_{h e e l}^{r}$ and $p_{h e e l}^{l}$ are decided using statistical human-body data. From them, an end-effector trajectory represented by two sets of waypoints $P^{r : d r e s s}$ and $P^{l : d r e s s}$ is created.

To prevent the leg parts from becoming entangled with the clothes, we must consider the positions of the kneecap and heel. Let $p_{k n e e}^{{r, l}}$ and $p_{h e e l}^{{r, l}}$ be the kneecap and heel positions, respectively, where ${r, l}$ indicates “right, left.” If a waypoint $p^{{r, l} : d r e s s}$ is close to the $p_{k n e e}^{{r, l}}$ or the $p_{h e e l}^{{r, l}}$ , it is shifted away from the subject body, as shown in the right panel of Figure 10.

6.3. Dressing-failure Detection and Recovery

By combining clothing state estimation with waypoint-based trajectory generation, we can detect failures and recover proper actions. In practical dressing actions, the end-effectors track waypoints $P^{r : d r e s s}$ and $P^{l : d r e s s}$ in sequence. During dressing, there are two ways to detect failure. The main way adopts the vision function explained in Section 4. If an inadequate, wrinkly state of clothing that is similar to a “failure” mode in the learning database is detected, then the trajectory is modified for failure recovery. The modification is achieved by inserting one waypoint before the remaining waypoints. The inserted waypoint is a waypoint that had already been used just before failure detection, but which was slightly translated left or right. The translation direction depends on the failure type. For example, if the left toe becomes caught in the crotch of the clothing item, the given end-effector trajectory is biased toward the right-hand side, and the trajectory should slide to the right. The magnitude and the direction of translation are manually defined for each failure type.

Because a failed action also means that clothing must be disentangled, the failure can be resolved by repeating a past dressing action. This recovery process was validated in experiments using a real robot.

Aside from vision function, we also use the force sensors mounted on the robot's wrists. If a force sensor detects a force exceeding the predefined threshold while dressing, the action is regarded as a failure. However, the sensing plays a secondary role because it indicates that the subject may experience pain as a result of the pulled clothing. Ideally, dressing failures should be detected visually before the force reaches this level. Therefore, vision-based failure detection plays the primary role.

7. Experimental Assessment of Vision Function

7.1. Settings

A mannequin was sat on a horizontal board with a height of 700 mm. Image streams were captured by a camera placed 750 mm high and 1500 mm distant from the mannequin. Each image stream was VGA sized (640 pixels × 480 pixels) and captured at 30 fps. Under this set-up, a person dressed the mannequin in a pair of trousers.

Before the state-matching experiments, we captured six different image streams of successful dressing. To represent state transitions between successful dressing actions, each image stream was divided into four phases named No. 1 to No. 4 (the starting frames of the four phases are shown in Figure 11). Seven image streams of failed dressing states were also captured, and named No. 5 to No. 11. The database was constructed from feature descriptions based on the optical flows calculated from these 11 image streams.

Figure 11.

Four phases of dressing the bottom-half of a subject

The purpose of this experiment was to evaluate whether or not the robot completed the dressing procedure. If the dressing succeeds, the state matching will proceed sequentially through phases 1–4. Otherwise, the type of failure is identified from phases 5–11 in the database.

7.2. Success-failure Classification and the Estimation of Failure Type

Figure 12 shows an example of state matching. The current state is consistently matched to the flows in the database (Figure 3 (6)). The main image corresponds to the current optical flow; the inset on the lower right corresponds to the matched features in the database. These images show a failure state in which both legs enter the same hole while the trousers were being pulled up.

Figure 12.

Current state estimation

The effectiveness of the three feature descriptions was investigated in a second experiment. The results are summarized in Table 1. The test data in this experiment were 19 series of image streams. The second and third columns of Table 1 show the success-failure patterns of the input image streams and the matching results, respectively. The fourth column (headed Avg. score) shows the average similarity value of the matching results. The matching failed in two out of six successful trials, whereas all of the failed cases were correctly estimated as failures. However, failure-type estimation failed in four out of the 13 trials. Therefore, the overall success rate was 73 %.

Table 1.

Estimate state and calculate scores

	Input Pattern	Result	Avg. score
Success cases	No.1 to No.4	No.9	0.5242
	No.1 to No.4	No.1 to No.4	0.3620
	No.1 to No.4	No.10	0.3926
	No.1 to No.4	No.1 to No.4	0.5690
	No.1 to No.4	No.1 to No.4	0.3204
	No.1 to No.4	No.1 to No.4	0.3712
Failure cases	No.5	No.5	0.3110
	No.6	No.6	0.4762
	No.7	No.11	0.4307
	No.7	No.7	0.3901
	No.8	No.11	0.3455
	No.8	No.8	0.2826
	No.8	No.8	0.2937
	No.9	No.9	0.4027
	No.9	No.8	0.2993
	No.10	No.7	0.3210
	No.10	No.11	0.3641
	No.11	No.11	0.3508
	No.11	No.11	0.4013

A third experiment investigated the effectiveness of the three feature descriptions. Table 2 summarizes the results of several feature combinations. Each combination was tried four times. The ranges in the cells are the ranges of success rates. The success rate was much lower when combining only two feature types than when combining three features. This indicates that the three feature descriptions proposed in this paper have different expressive powers, and that all are needed for estimating the dressing behaviour.

Table 2.

Compare to trial without one feature

	All	F_r, F_l	F_m, F_l	F_m, F_r
Success rate [%]	58–73	21–36	5–20	21–42
False positive [%]	0–8	15–30	15–38	0
False negative [%]	33–66	50–83	33–83	84–100
Failure state estimation [%]	69–77	15–30	15–30	30–45

8. Experimental Results of Dressing Assistance

8.1. Settings

When evaluating the vision function, a subject (human or mannequin) was sat on a horizontal board with a height of 700 mm. Dressing was performed by a life-sized humanoid robot named HRP2-JSK [14], equipped with seven degrees of freedom (DoFs) in both arms, 2 DoFs in the torso and 7 DoFs in both legs. The clothing state was measured by a three-dimensional range camera, Xtion Pro Live, mounted on the head of the robot. Colour images and VGA-sized (640 pixels × 480 pixels) depth images were captured at 30 fps. Under this set-up, the robot dressed the subject in a pair of trousers. In this experiment, a desktop computer that was connected separately to the robot was used to perform all of the recognition and motion-generation processes. The performance of the computer was 2.4 GHz, with an 8-core CPU. In this condition, one dressing experiment took about 240 seconds. For one state-estimation process consisting of a dynamic graph cut, optical-flow detection, feature description and similarity calculation, two or three seconds were needed. The most time-consuming part was the optical-flow detection.

Figure 13 shows the three pairs of trousers used in these experiments. Item (A) is constructed from stretchable fabric, which exerts a high degree of inward friction during dressing. Item (B) is composed of inelastic fabric with a high degree of inward friction, and item (C) is constructed from stretchable fabric with a low degree of inward friction.

Figure 13.

Three types of trousers used in dressing experiments

8.2. Results of Dressing the Bottom-half of a Subject Using a Life-sized Robot

Figures 14 and 15 show the images of one dressing experiment using the life-sized humanoid robot. As explained in Section 3, the dressing procedure was divided into four phases. The images in Figure 14, which illustrate these phases in sequence, are part of the images captured by the Xtion sensor and the state-estimation results. To the right of each image are the dressing phases ordered in a temporal sequence. The green-filled blue rectangles indicate successful present phases. In other instances, the dressing has failed. Throughout the experiment, the dressing procedure was temporarily classified as a failure because the trouser leg became entangled with the left toes (Figure 14, (5)). However, the dressing motion was retried based on the estimation result, and continued until the trousers were pulled up over the knees.

Figure 14.

Online experiment

Figure 15.

A dressing experiment using a life-sized humanoid robot. The procedure was: (i) insert both feet into the legs of trousers (Figures (1) to (4)), (ii) pull up the trousers until they are over the knees (Figures (4) to (6)), (iii) pull up a hanging-down hem on one side of the trousers (Figures (7) and (8)), (iv) pull up the other side of the hem (Figures (9) and (10)), (v) pull the trousers up over the hips (Figures (10) to (12)).

Dozens of similar experiments were performed with various pairs of trousers. The results are summarized in Table 3. Although the legs frequently became snagged in parts of the cloth, our sensory functions detected failure in almost all cases. After detection, the failure states were resolved by the recovery action described in Subsection 6.3. In these experiments, “success” was achieved when the robot pulled the trousers over the subject's knees. The success rate of 30 trials was 83 %.

Table 3.

The success rate of dressing the bottom-half of a subject

Clothing type	Subject type	No. of hanging up	No. of success
A	Mannequin	5	3
	Human	3	4
B	Mannequin	2	4
	Human	2	5
C	Mannequin	5	4
	Human	1	5

As shown in Table 3, the number of successes during the dressing procedure was rather low for item (A), which was constructed from highly stretchable fabric exerting a high degree of friction.

Because the optical flow was stably extracted from image streams, recognition performance was not so different for different trousers. On the other hand, the recovery action was impeded by the combination of elasticity and a high degree of inward friction. Despite the high number of recovery motions, the robot could not proceed with the dressing procedure.

9. Conclusions

In this paper, we proposed methods for dressing a person using an autonomous robot. We focused on the actions by which the robot can pull a pair of trousers along the subject's legs. These actions are frequently demanded by humans requiring dressing assistance and which are potentially automatable. To avoid injuring the subject's legs during dressing, the robot should be programmed with recognition functions, enabling it to determine the state of the manipulated clothing. We decided that dressing failures were best detected by vision sensing, and could be predicted from the behaviour of optical flows extracted from image streams. The recognition function was designed to estimate dressing success or failure. If a failure occurred, the type of failure was also specified. In experiments of practical dressing by a person, we verified that the success rate of the method was 73 %.

To demonstrate the applicability of the method, we implemented the dressing procedure using a life-sized humanoid robot. Estimating the shape of the legs from images captured by a three-dimensional range camera, we proposed a method of modifying the trajectory from the basic trajectory estimated from statistical human-body data. When programmed with the proposed methods, the robot performed the dressing of human-like subjects with trousers with a success rate of 83 %.

In future, we will refine our method to improve the success rate. For instance, feature descriptions using optical flow can be combined with 3D information. The state-estimation model might be improved from our deterministic model into probabilistic method; e.g., the Hidden Markov Model.

More constructive use of force sensors has the effect of sensitizing failure detection. For instance, feature description combining optical flow with force data should be studied. Because significant relationships between force data and optical-flow data are likely, the combination will improve the success rate and accuracy of failure detection and recovery. From another viewpoint, force data should be used actively. Because the robot manipulates a highly-deformable object, it is difficult to predict force data during clothing manipulation. However, the level of the force might be estimated from the results of image-based recognition. Since we defined some dressing phases using optical-flow information, this might help the estimation of force-sensor data. The reliability of such estimations enables us to generate an end-effector trajectory by using a feedback control scheme.

Based on these improvements, the effectiveness of our approach must also be verified further in other dressing tasks.

Footnotes

10. Acknowledgements

This work was partly supported by the JST PRESTO programme and JSPS KAKENHI Grant Number 26700024.

References

Boykov

Veksler

Zabih

: “Fast Approximate Energy Minimization via Graph Cuts,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 23, No. 11, pp. 1222–1239, 2001.

Comaniciu

Ramesh

Meer

: “Real-Time Tracking of Non-Rigid Objects using Mean Shift,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, Vol. 2, 142–149, 2000.

Dudgeon

Hoffman

Ciol

Shumway-Cook

Yorkston

Chan

: “Managing Activity Difficulties at Home: A Survey of Medicare Beneficiaries,” Archives of Physical Medicine and Rehabilitation, Vol. 89, No. 7, pp. 1256–1261, 2008.

Farneback

: “Two-Frame Motion Estimation Based on Polynomial Expansion,” in Proc. of the 13th Scandinavian Conference on Image Analysis, LNCS 2749, pp. 363–370, 2003.

Fischler

M. A.

Bolles

Robert C.

: “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” Communications of the ACM, Vol. 24, No. 6, pp. 381–395, 1981.

Hamajima

Kakikura

: “Planning Strategy for Unfolding Task of Clothes – Isolation of Clothes from Washed Mass,” in Proc. of Int'l. Conf. on Robots and Systems, pp. 1237–1242, 2000.

Kita

Neo

Ueshiba

Kita

: “Clothes Handling Using Visual Recognition in Cooperation with Actions,” in Proc. of Int. Conf. of Intelligent Robots and Systems, pp. 2710–2715, 2010.

Kita

Saito

Kita

: “A Deformable Model Driven Method for Handling Clothes,” in Proc. of Int'l. Conf. on Pattern Recognition, Vol. 4, pp. 3889–3895, 2004.

Kohli

Torr

P.H.S.

: “Efficiently Solving Dynamic Markov Random Fields Using Graph Cuts,” in Proc of IEEE Int'l Conf. on Computer Vision, Vol. 2, pp. 922–929, 2005.

10.

Lucas

Kanade

: “An Iterative Image Registration Technique with an Application to Stereo Vision,” in Proc. of Int'l. Joint Conf. on Artificial Intelligence, pp. 674–679, 1981.

11.

Maitin-Shepard

Cusumano-Towner

Abbeel

Lei

: “Cloth Grasp Point Detection based on Multiple-View Geometric Cues with Application to Robotic Towel Folding,” in Proc of Int'l. Conf. on Robotics and Automation, pp. 2308–2315, 2010.

12.

Matsubara

Shinohara

Kidode

Masatsugu

: “Reinforcement Learning of a Motor Skill for Wearing a T-shirt using Topology Coordinates,” Advanced Robotics, Vol. 27, Issue 7, pp. 513–524, 2013.

13.

Miller

Fritz

Darrell

Abbeel

: “Parameterized Shape Models for Clothing,” in the Proc. of the Int'l Conf. on Robotics and Automation, pp. 4861–4868, 2011.

14.

Okada

Ogura

Haneda

Kousaka

Nakai

Inaba

Inoue

: “Integrated System Software for HRP2 Humanoid,” in Proc. of Int'l Conf. on Robotics and Automation, pp. 3207–3212, 2004.

15.

Ono

Okabe

Ichijo

Aisaka

: “Robot Hand with Sensor for Cloth Handling,” in Proc. 1990, Japan, U.S.A. Symp. on Flexible, pp. 1363–1366, 1990.

16.

Osawa

Seki

Kamiya

: “Unfolding of Massive Laundry and Classification Types by Dual Manipulator,” Journal of Advanced Computational Intelligence and Intelligent Informatics, Vol. 11, No. 5, pp. 457–463, 2007.

17.

Shi

Tomasi

: “Good Features to Track,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 593–600, 1994.

18.

Wahl

Hillenbrand

Hirzinger

: “Surflet-Pair-Relation Histograms: A Statistical 3D-Shape Representation for Rapid Classification,” in Proc. of 3-D Digital Imaging and Modeling, pp. 474–482, 2003.

19.

Willimon

Birchfleld

Walker

: “Classification of Clothing using Interactive Perception,” in Proc. of IEEE Int'l Conf. on Robotics and Automation, pp. 1862–1868, 2011.

20.

AIST Human Body Dimension Database: http://riodb.ibase.aist.go.jp/dhbodydb/91-92/. Accessed on 2 Dec 2014.

21.

Xtion PRO LIVE http://www.asus.com/Multimedia/Xtion_PRO_LIVE/. Accessed on 2 Dec 2014.