Sage Journals: Discover world-class research

Abstract

In this article, a novel, efficient grasp synthesis method is introduced that can be used for closed-loop robotic grasping. Using only a single monocular camera, the proposed approach can detect contour information from an image in real time and then determine the precise position of an object to be grasped by matching its contour with a given template. This approach is much lighter than the currently prevailing methods, especially vision-based deep-learning techniques, in that it requires no prior training. With the use of the state-of-the-art techniques of edge detection, superpixel segmentation, and shape matching, our visual servoing method does not rely on accurate camera calibration or position control and is able to adapt to dynamic environments. Experiments show that the approach provides high levels of compliance, performance, and robustness under diverse experiment environments.

Keywords

Object recognition robotic grasping

Introduction

An aging population and rising labor costs are acute challenges facing society, resulting in a high demand for indoor service robots. Service robots working in indoor environments, such as homes or offices, often need to handle a variety of grasping tasks that require the ability to recognize the target object in a complex or dynamic background environment.^1,2 Due to uncertain factors, such as illumination, occlusion, and object posture, as well as the challenges of executing a real-time response by indoor service robots and choosing the proper gripping positions, it is difficult to design a lightweight recognition algorithm that can define different target objects whenever necessary.

Research on robotic grasping has resulted in many different grasping methods.^3

–6 Recently, deep-learning techniques have emerged as the most preferred methods among the approaches in the field of grasp synthesis.^7,8 These methods use various versions of convolutional neural networks (CNNs) to identify the objects to be grasped,^9,10 which means they demand a large amount of data as well as time for training and testing; the approaches also require an expensive hardware environment. However, the results of these methods often include problems with overfitting and lack reasonable generalization ability and the ability to be well-interpreted. Therefore, the methods based on deep-learning technology are difficult to apply to indoor robotic grasp tasks with variable target objects, viewing angles, and a dynamic environment.

In this article, a novel, fast, and lightweight method is proposed for robotic object recognition and grasping tasks. The method can extract the contour information of objects contained in an image using edge detection and superpixel segmentation techniques and calculate the similarity between the two contours with a shape descriptor technique to complete the object recognition. Then, using the relative distance between the object centroid and the gripper, the algorithm guides the robot to move the gripper to the object and form a proper grabbing posture to complete the grasping task.

When compared with the prevailing deep-learning methods, our approach has the following advantages. First, it can flexibly adapt to the variable positions and postures of the target objects and the changes in the environment because the recognition method is based on the shape information of the objects, which is a stable, long-lasting, and essential feature. Second, since the object is identified by shape features, this method does not require a large number of training samples, which means that it saves cumbersome manual labeling work and greatly lowers the requirements for a computing hardware. Third, this method combines the object recognition module with the robot control module to form a hand-eye coordination mechanism with feedback, that is, a closed-loop control process. This process means it is not necessary to calculate the exact absolute coordinate values, only the relative positional offset between the object and the gripper, which greatly simplifies the calculation of the conversion between multiple coordinate systems. This process also improves the robot’s adaptability to the environment and the response speed to the tasks. Fourth, this method is highly interpretable. The human visual system recognizes objects mainly based on the contour information,^11,12 which means our method utilizes the results of cognitive research.

Related work

Robotic grasping is a widely studied topic. Generally, these techniques can be grouped into two categories: empirical methods and analytic methods. Analytic methods^3,13 use mathematical and physical models of geometry, kinematics, and dynamics to calculate stable grasping strategies. However, such methods are not easily applied to real-world scenarios, since it is difficult to model the physical interaction between the gripper and the object. Empirical methods^14
–16 tend to avoid the computation of physical or mathematical models that mimic human grasping strategies. These techniques associate the appropriate grasp points with a database storing object model or shape information based on object type definitions.

Recently, techniques based on deep learning have become popular.^{8,10,17
–19} The strategies are similar: A certain number of grasp candidates are extracted from the image or point cloud, and the algorithm ranks them with a CNN and considers the object of the highest score the one to be grasped. Once the object is identified, the robot performs an open-loop grasp, which requires the precise calibration between the camera and the gripper, as well as a completely static grasping environment. These methods require a large number of labeled samples for training and testing, which not only imposes high requirements on the hardware environment but also requires a lot of manpower and time. The models usually lack reasonable generalization ability. However, CNNs often contain millions of parameters and rank grasp candidates with a sliding window at discrete intervals of offset and rotation, which results in a long processing time that can be up to tens of seconds. Deep-learning methods often only achieve coarse positioning with bounding boxes, which is not enough for precise grasping tasks.

The approach proposed in this article identifies an object based on the shape information, can quickly complete the pixel-level recognition tasks, and does not require a pretraining process or expensive hardware environment. In addition, this method is highly adaptable to changes in the environment as well as the position and posture of the object because of the shape-based recognition algorithm. Instead of bounding boxes, the recognition result is the pixel-level outline of the object, which is more conducive to the following grasping tasks.

Shape-based object detection with background interference

Shape-based object representation method

In the early work of one of the authors, multiscale triangular centroid distance (MTCD) descriptors were proposed to represent shapes.^{20,21,22

–25} MTCD descriptors can be adapted to translation, scale, rotation, and deformation. In addition, it is convenient and quick to calculate the difference between shapes represented by MTCD descriptors, so here we used this method to calculate the similarity between two contours.

Given a shape S, let sequence $P (x_{i}, y_{i}), (i = 1, 2, \dots, N)$ denote the sample points of its outer contour, where (x_i, y_i ) are the coordinates of the sample points and N is the number of the sample points (Figure 1). We calculate the centroid point $G (x_{G}, y_{G})$ of S

\{_{y_{G} = \frac{1}{N} \sum_{i = 1}^{N} y_{i}}^{x_{G} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}}

Figure 1.

(a) The original image of an elephant model; (b) the contour shape of the model; and (c) the set of points obtained by equidistant sampling of the contour.

Given a certain point $P (x_{i}, y_{i})$ , we can find other points $P_{j} (x_{i}, y_{i})$ , (i ≠ j) of S. A triangle $Δ P_{i} P_{j} G$ can be formed with P_i , P_j and the centroid point G. As shown in Figure 2, the coordinates of the centroid point $G_{i j} (x_{G_{i j}}, y_{G_{i j}})$ of $Δ P_{i} P_{j} G$ can be calculated as follows

\{\begin{matrix} x_{G_{i j}} = \frac{x_{i} + x_{G} + x_{j}}{3} \\ y_{G_{i j}} = \frac{y_{i} + y_{G} + y_{j}}{3} \end{matrix}

Figure 2.

Acquisition of centroid point G_ij of Δ P_iP_jG .

T triangles can be obtained for each sample point P_i , where T represents the scale number, which is set to $(N - 1) / 2$ in our experiments. We then calculate the distances $d_{i j}$ between P_i and the centroid points of all triangles

d_{i j} = \sqrt{{(x_{i} - x_{G_{i j}})}^{2} + {(y_{i} - y_{G_{i j}})}^{2}}

Then, we can obtain a column vector v_i

v_{i} = [\begin{matrix} d_{i 1} \\ d_{i 2} \\ \dots \\ d_{i T} \end{matrix}]

Thus, given a certain shape S, we can obtain an T×N matrix M

\begin{array}{l} M = [v_{1}, v_{2}, \dots, v_{N}] \\ = [\begin{matrix} d_{11}, d_{21}, \dots, d_{N 1} \\ d_{12}, d_{22}, \dots, d_{N 2} \\ \dots \\ d_{1 T}, d_{2 T}, \dots, d_{N T} \end{matrix}] \end{array}

It is easy to prove that M has intrinsic invariance to translation of the contour of S from its definition. In addition, for each row of M, we normalize M by dividing the elements by the maximal absolute value of each of row as follows

M = [\begin{matrix} \frac{d_{11}, d_{21}, \dots, d_{N 1}}{max_{i \in [1, N]} d_{i 1}} \\ \frac{d_{12}, d_{22}, \dots, d_{N 2}}{max_{i \in [1, N]} d_{i 2}} \\ \dots \\ \frac{d_{1 T}, d_{2 T}, \dots, d_{N T}}{max_{i \in [1, N]} d_{i T}} \end{matrix}]

Next, to obtain invariance to the starting point of our shape descriptor, Fourier transform is applied to each row of M and the phase information is discarded. For easy explanation, let r_t denote each row of M. Then, the discrete Fourier transform for r_t can be calculated as

F_{t} (i) = \frac{1}{N} \sum_{u = 1}^{N} r_{t} (u) e^{\frac{- j 2 π (u - 1) i}{N}}, i = 1, 2, \dots, N, j^{2} = - 1

It is not difficult to prove that $abs (F_{t} (i))$ is invariant to the starting point of the contour of S. Furthermore, to improve the efficiency and effectiveness of the following shape matching, the dimensionality of M is reduced from N to Q, where $Q < < N$ . Thus, the final definition of our shape descriptor is

M = \{abs (F_{t} (v)) | t = 1, 2, \dots, T; v = 1, 2, \dots, Q\}

Here, we set $Q = 16$ in all our experiments, see Figure 3.

Figure 3.

(a to c) The characteristics of line 1, line 80 and line 150 of shape descriptor M , respectively.

Figure 4.

Structured random forest is used for edge detection. The left side shows the original image and the right side shows the edge detection result.

Given two shapes S ₁ and S ₂, whose shape descriptors are $M_{S_{1}} = \{abs (F_{t}^{S_{1}} (v)) | t = 1, 2, \dots, T; v = 1, 2, \dots, Q\}$ and $M_{S_{2}} = \{abs (F_{t}^{S_{2}} (v)) | t = 1, 2, \dots, T; v = 1, 2, \dots, Q\}$ , respectively. Then, the measure of dissimilarity between S ₁ and S ₂ can be obtained by the L ₁ distance

D (S_{1}, S_{2}) = \frac{1}{T \times Q} \sum_{t = 1}^{T} \sum_{v = 1}^{Q} |abs (F_{t}^{S_{1}} (v)) - abs (F_{t}^{S_{2}} (v))|

The smaller the dissimilarity $D (S_{1}, S_{2})$ is, the more similar the two shapes are.

Elimination of redundant information in images

To obtain the line information in the image, we referred to Dollár and Zitnick’s work²⁶ on edge detection and the work of Radhakrishna et al.²⁷ on superpixel segmentation. In one study,²⁶ the use of the structured forest technique for edge detection achieved good results.

However, since real images often contain a lot of noise, the results of edge detection involve much redundant information for object recognition tasks. On the other hand, superpixel segmentation, which groups pixels into perceptually meaningful atomic regions, can effectively eliminate the effects of noise in the image.

The superpixel segmentation algorithm we apply is easy to understand, and it requires only one parameter provided by users, that is, k, which is the desired number of the superpixels.

Given a color image in Commission Internationale de l'Eclairage Lab (CIELAB) color space, the algorithm firstly divides the N pixels into a regular grid, whose interval is $S = \sqrt{\frac{N}{k}}$ . k initial cluster centers $C_{i} = {[l_{i}, a_{i}, b_{i}, x_{i}, y_{i}]}^{T}$ are then sampled from the grid. To avoid centering a superpixel on an edge and seeding a superpixel with a noisy pixel, the centers are moved to positions corresponding to the lowest gradient in a $3 \times 3$ neighborhood.

Next, we traverse all the centers and associate each pixel with the nearest cluster center whose search region overlaps its position. The algorithm searches a limited region, the size of which is set to $2 S \times 2 S$ in our experiments. Given a pixel $P_{i} = {[l_{i}, a_{i}, b_{i}, x_{i}, y_{i}]}^{T}$ , we can obtain its color distance d_c and space distance d_s from a certain cluster center $C_{j} = {[l_{j}, a_{j}, b_{j}, x_{j}, y_{j}]}^{T}$ by:

d_{c} = \sqrt{{(l_{j} - l_{i})}^{2} + {(a_{j} - a_{i})}^{2} + {(b_{j} - b_{i})}^{2}}

d_{s} = \sqrt{{(x_{j} - x_{i})}^{2} + {(y_{j} - y_{i})}^{2}}

Then, we normalize the spatial proximity and the color proximity by their respective maximum distances within a cluster, that is, N_S and N_C , to combine d_c and d_s into a single measure of distances between a pixel and a cluster center. Let D denote this measure, whose definition is as follows

\begin{array}{l} D & = \sqrt{{(\frac{d_{c}}{N_{c}})}^{2} + {(\frac{d_{s}}{N_{s}})}^{2}} \\ = \sqrt{{(\frac{d_{c}}{m})}^{2} + {(\frac{d_{s}}{S})}^{2}} \\ = \sqrt{{(d_{c})}^{2} + {(\frac{d_{s}}{S})}^{2} m^{2}} \end{array}

where $N_{s} = S = \sqrt{\frac{N}{k}}$ , $N_{c} = m$ . m, which we set to 10 in our experiments, weighs the relative importance between spatial proximity and color similarity.

The cluster centers are then adjusted to the mean vector $\bar{v} = {[\bar{l}, \bar{a}, \bar{b}, \bar{x}, \bar{y}]}^{T}$ of all the pixels within the cluster. A residual error E is calculated between the new cluster center and the previous one by using L ₂ normalization. The clustering and adjusting steps will be repeated iteratively until E converges, but we found that 10 iterations perform good enough in all our experiments.

Finally, we traverse all the pixels and assign the disjoint pixels to their closest superpixels to enforce connectivity. The segmentation result is shown in Figure 5.

Figure 5.

The left side shows the original image and the right side shows the result of superpixel segmentation.

However, superpixel segmentation also results in a considerable amount of edge information for the superpixel blocks, which is also redundant information for object recognition. The approach proposed in this article combines the results of edge detection, shown in Figure 4, and superpixel segmentation to extract the true contour information of the objects contained in the image.

Figure 6 shows this algorithm in action. Assume that the input image is I, of which the size is $c \times r$ . $I_{ed}$ is obtained by performing edge detection on I, and $I_{sp}$ is obtained by performing superpixel segmentation on I, where both $I_{ed}$ and $I_{sp}$ are grayscale images whose gray values are [0, 1]. The larger the gray value of a pixel, the greater the probability that it belongs to an edge. For a pixel p in $I_{ed}$ , assume its coordinate in the image is x,y, (1 ≤ x ≤ c, 1 ≤ y ≤ r), and its gray value is g, (0 ≤ g ≤ 1), corresponding to the pixel $p^{'}$ in $I_{sp}$ , whose coordinate is x,y, and the gray value is $g^{'}$ . If $\frac{(g + g^{'})}{2} > t$ , it is considered that the pixel whose coordinate is $(x, y)$ is more likely to be the actual contour of some object, where t is the set threshold. Thus, the value for I_C is obtained that contains the possible contour information of the objects involved in the original image I, where I_C is a binary map, wherein a pixel value of 1 indicates that the pixel is a contour pixel, and a pixel value of 0 indicates that the pixel is a background pixel.

Figure 6.

(a) The original image; (b) the result of superpixel segmentation; (c) the result of edge detection; and (d) the result of combination of (b) and (c).

Next, we refine the contour lines in I_C , that is, reduce the width of lines to one pixel and then take the branch points, whose number of adjacent points is more than two pixels, as the line end points to extract the contour line information C, which is a set of lines, where a line is a set of coordinate values of pixels.

Extraction of contour template

To obtain the shape contour of a given object, we photograph the object with a relatively monotonous background and remove redundant information of the image captured using the method above. As is shown in Figure 8, the image of animal models with a white background captured by a camera is processed by edge detection and superpixel segmentation, respectively. Then, the remaining pixels are filtered, and the lines are refined so that the shape contour of the models is obtained. The shape contour of a given object obtained above can be used as contour template in the following process, as shown in Figure 7.

Figure 7.

Line set C extracted from I _c using branch points as end points.

Figure 8.

Animal models in a relatively monotonous background are photographed, and the images captured are processed with edge detection and superpixel segmentation, and then the remaining pixels are filtered and lines are refined. The left side of each row shows the original images and the right side of each row shows the obtained contour templates.

Line segments combination based on heuristic search

Heuristic search, also known as informed search, reduces the search scope and complexity of the problem to be solved by referring to the heuristic information. The objective of heuristic search is to produce a solution in a reasonable time frame that is good enough for solving the problem. Heuristic search can avoid combinatorial explosions by guiding search to the most promising direction using heuristic information. The stronger the heuristic information, the less the search branches. The function used to evaluate the importance of search nodes is called valuation function, which is generally in the form of:

f (x) = g (x) + h (x)

where $g (x)$ denotes the actual cost from the initial node to node x, and h(x) denotes the estimated cost of the optimal path from node x to the target node. Heuristic information is mainly included in h(x) and is determined according to the characteristics of the problem.

Given an image, after the preprocessing described above of eliminating the redundant information, we can turn the image to a set of line segments which includes the contour information. If we traverse all the combinations of the line segments in the set, then the combination of line segments that is most similar to the shape contour of a given object can always be found, which is obviously very time-consuming. By using the shape descriptor and shape dissimilarity we introduced above, we can guide the search path to avoid unnecessary search nodes and thus greatly improving search efficiency.

Given the contour template M of the object to be grasped, M is a binary map where pixels have a value of 1, indicating they belong to contours. As shown in Figure 9, the algorithm looks for a seed line C_s in C as the starting state for the following search, using the heuristic search strategy. C_s should have a certain length because the short lines correspond to very few sampling points, which result in finding too many similar parts on the template. C_s should also have a certain degree of curvature, since a real scene in images always involves a lot of line segments that tend to be mismatched with the line segment of the template. With these restrictions, the search domain for C_s is greatly reduced, and C_s , which indicates the line most similar to some part of the object to be grasped, is searched exhaustively in the remaining line set $C^{'}$ . Specifically, for a line c length of l in $C^{'}$ , the algorithm slides a window with size of l onto the contour template M. In each iteration, the line c_t in the window is taken out to calculate the shape similarity s with c. The maximum value of s is considered to be the similarity between c and M, and the corresponding part of M is recorded as c_m . After all the lines in $C^{'}$ have been calculated in regards to their similarity with M, the algorithm takes the c with the largest s as the seed line C_s , corresponding to the contour line C_m on M.

Figure 9.

The process of matching an object to be grasped is shown in order from left to right and from top to bottom. (a) Search for the seed line. (b) to (e) Process of each iteration of the search. The candidates are marked in white, and the most similar line is marked in yellow. (f) Matching result.

After C_s is found, the following searches only consider the line set $C_{nei}$ connected to the ends of C_s , where $C_{nei} = \{c_{1}^{′}, c_{2}^{′}, \dots, c_{n}^{′} | c_{i}^{′} \in C \cap (startpoint (c_{i}^{′}) = endpoint (C_{s}) \cup endpoint (c_{i}^{′}) = startpoint (C_{s}))\}$ . For each iteration, a line $c_{nei} \in C_{nei}$ is selected to combine with C_s to form a new line $C_{s}^{′}$ , and $C_{m}^{′}$ is obtained by extending the corresponding length in the same direction of C_m on M, and the similarity between $C_{s}^{′}$ and $C_{m}^{′}$ is calculated. At the end of each iteration, the $C_{s}^{′}$ most similar to M is taken as the result of this round and set as the seed line C_s . The search proceeds according to this rule until C_s becomes a closed line and the template match is completed, as shown in Figure 10.

Figure 10.

The first row of each column shows the contour template; the second row of each column shows the original image captured by the camera; the third row of each column shows the line segments extracted from the image; the fourth row of each column shows the seed line found by the algorithm; and the fifth row of each column shows the final recognition result.

Finally, the centroid P of C_s is calculated to guide the robot to execute the grasping task. Suppose C_s = $\{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\}$ . The coordinate of P is $(x, y)$ , then $x = \frac{\sum_{1}^{n} x_{i}}{n}$ and $y = \frac{\sum_{1}^{n} y_{i}}{n}$ .

Determination of the grasp position with contour detection

After the recognition task is completed, the shape contour $C_{s} = \{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\}$ of the object has been obtained, let $P (x, y)$ denotes the centroid point of the shape contour, where

x = \frac{\sum_{1}^{n} x_{i}}{n} and y = \frac{\sum_{1}^{n} y_{i}}{n}

Since the gripper is open to a certain extent, an appropriate gripping position is needed to guide the robot to rotate the sixth joint to form a proper grip posture to execute the grasp task. We took into consideration the general size of the gripper mounted on the robotic arms and the irregularity of the shape of the object to be grasped. The proposed approach uses the relatively narrow concave portion of the object outline as the grasping position so that it can handle different situations, such as when the gripper is too small to grasp a relatively large object and make the grasping state as strong as possible.

The result of the recognition method discussed in this article, a precise outline of the contour of the object, indicates that the gripping position that meets the requirements above can be obtained in a simple way. First, make a straight line l passing through the centroid C of the object outline and measure the width of the contour using the Euclidean distance d between the intersections p ₁ and p ₂ of l and contour C_s . Rotate l at a certain angle interval θ, and obtain sets of intersection points and corresponding distance values. After a rotation of 180°, the set of intersection points with the smallest distance value is taken as the final grasping position.

As shown in Figure 11, the normal line between the two clips of the gripper is initially collinear with the horizontal axis. After the appropriate grasping position is calculated, it is only necessary to control the gripper to rotate θ degrees counterclockwise to form the optimal grasping posture.

Figure 11.

The straight line l passes through the centroid of the contour to obtain two intersection points p ₁ and p ₂ , and the corresponding distance d . The pair of intersection points with the smallest d is taken as the final grasping position. The blue part shows the orthographic view of the EFG20 electric gripper with two clamping plates at the ends used in the experiments.

Guiding the robot to approach the object to be grasped using the recognition results

Interaction between the recognition module and the robot control module

As is shown in Figure 12, the computer is connected with the control cabinet of the robotic arm through a twisted pair. The world coordinates (the origin is in the center of the base of the robotic arm) the gripper to be moved to are transferred to operating system of the robotic arm using transmission control protocol protocol. The coordinate data are then transformed to the rotation angles of each of the six joints.

Figure 12.

(a) Control cabinet in the blue wireframe and robotic arm in the red wireframe; the computer running the program transmits data through a twisted pair to control cabinet using TCP protocol. (b) The establishment of coordinate system using the base center of the robotic arm as the origin. TCP: transmission control protocol.

As shown in Figure 13, the images captured by the camera are transmitted to the computer; the latter takes the images and the appropriate template image as parameters and invokes the recognition module to execute the recognition tasks. Next, the relative position of the object centroid to the center of the camera is calculated. If the centroid is in the central area of the camera field of view, the control module moves the gripper down to grasp the object. Otherwise, it moves the gripper to the centroid position of the object according to the relative position and makes the camera capture an image.

Figure 13.

Interaction between the recognition module and the robot control module.

Object recognition flowchart

The flowchart in Figure 14 corresponds to Algorithms 1 and 2.

Algorithm 1.

Contour segments generation.

Algorithm 2.

Search and combine contour segments

Figure 14.

Sketch map of the object recognition module.

Robot control module flow chart

As shown in Figure 15, the robot control module accesses the memory location storing the object recognition result at a certain frequency and reads out if it is updated, where the recognition result refers to the relative position of the object centroid to the center of the camera field of view. If the centroid is in the center, control the gripper to move down for grasping, otherwise move it closer to the centroid of the object to be grasped.

Figure 15.

Sketch map of the robot control module.

Experiments

Our environment

Figure 16 shows our experiment environment, which consists of a computer connected to the robot control cabinet, an SD700E industrial robot arm (yellow) with six degrees of freedom, ±0.03 mm repeatability, and 700 mm radius. An EFG20 electric gripper (silver white) was mounted on the end flange of the arm, above which a simple camera (Logitech C310) was mounted. The total cost of the hardware set was less than US$15,000.

Figure 16.

Our experiment hardware set.

The surface of the workbench was set up to be complicated and messy on purpose. The state of the table was changed during the experiments, such as when we disturbed the relative order of the objects and changed their postures and positions, to highlight the robustness of our approach.

Experiment result

Figures 17 to 19 show the application of our method in the real environment. Each row in Figure 18 shows the relative positions between the gripper and the object to be grasped, that is, rhino model, before and after the recognition process was done. To present the robustness of our method, each time when the gripper was moved above the object referring to the recognition result, the object was moved to a different position with a different posture as well as the surrounding objects, which resulted in another round of recognition process. Once the state of the object was not changed after the gripper was moved above it, the gripper would be moved down to execute the grasp, as is shown in the last row.

Figure 17.

The left side of each row shows the initial position of the gripper and the object to be grasped, that is, rhino model; the right side shows the position the gripper was moved to once the object was recognized. As the first four rows show, the object was moved to different positions and placed in different postures once the gripper was moved above it, and the state of the surrounding objects was also changed (as is shown in third row). The last row shows that when we did not change the state of the table after the gripper was moved above the object, the gripper was then moved down to execute the grasp task.

Figure 18.

The left side of each row shows the captured image; the middle shows the contour information extracted from the image; and the right side shows the recognition result. The two ends of the red line point to the centers of the object to be grasped and the camera mounted on the gripper, respectively.

Figure 19.

Once the precise shape contour of the object to be grasped is obtained, the appropriate grasping points are calculated to guide the robotic arm to execute the grasp task. (a) The initial state of the robotic arm; (b) the gripper was moved above the object according to the recognition result; (c) the gripper was moved down to the object; and (d) the gripper was rotated to form an appropriate grasp posture according to the grasp points. It is difficult to form such a posture if there is no such a pixel-level shape contour.

Figure 18 shows the details of recognition process corresponding to different positions of the gripper displayed in Figure 17. From left to right on each row, the image captured by camera mounted on the gripper, the line segments representing the contour information extracted from the image and the recognition result are displayed respectively. The red line in the right side of each row connecting the centers of the camera view and the object to be grasped shows the relative position between the gripper and the object, which can be used to guide the robotic arm to move the gripper above the object.

After the precise contour of the object to be grasped is obtained, the appropriate grasp point can be calculated in a simple way to guide the robot to perform grasp tasks, as is shown in Figure 19.

We are unable to include enough pictures to show the entire recognition and grasp process owing to space constraints. However, a video clip is provided to show the whole experiment process.

Discussion

Although the object recognition methods based on deep learning are outstanding for classification tasks, they can only generate bounding boxes containing the object to be grasped when guiding a robot to perform grasp tasks. In addition, deep-learning methods require a lot of training and test data and computing time as well as an extremely extensive hardware environment. Therefore, these techniques are not suitable for robot grasp tasks, especially those that should be defined temporarily.

The object recognition module required for robot grasp tasks should be sufficiently lightweight and fast, while still being able to handle the noisy environment because the environment affects the motion planning of the robot. The recognition proposed in this article starts with the shape of the object and extracts the contour information from the original image which is stored in the form of lines. In addition, the heuristic search strategy greatly reduces the search domain so that the object can be effectively identified from the chaotic environment, which makes our method fast and robust enough to be more suitable for various robot applications.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by NSFC under Project 61771146 and Project 61375122.

ORCID iD

A Hui Wei

Supplemental material

Supplemental material for this article is available online.

References

Göngör

Tutsoy

. Design and Implementation of a Facial Character Analysis Algorithm for Humanoid Robots. Robotica 2019; 37(11): 1850–1866.

Gongor

Tutsoy

Two steps for chair recognition: feature extraction and shape descriptors. In: International Mediterranean Science and Engineering Congress, Adana, Turkey, October 2016, pp. 26–28.

Bicchi

Kumar

. Robotic grasping and contact: a review. In: Proceedings 2000 ICRA millennium conference. IEEE international conference on robotics and automation. Symposia proceedings (Cat. No. 00CH37065), 24–28 April 2000, pp. 348–353. San Francisco, California: IEEE.

Bohg

Morales

Asfour

, et al. Data-driven grasp synthesis—a survey. IEEE Trans Rob 2014; 30(2): 289–309.

Sahbani

El-Khoury

Bidaud

. An overview of 3D object grasp synthesis algorithms. Rob Auton Syst 2012; 60(3): 326–336.

Shimoga

. Robot grasp synthesis algorithms: a survey. Int J Rob Res 1996; 15(3): 230–266.

Johns

Leutenegger

Davison

. Deep learning a grasp function for grasping under gripper pose uncertainty. In: Intelligent robots and systems (IROS), 2016 IEEE/rsj international conference on Intelligent Robots and Systems (IROS), 9 October 2016, pp. 4461–4468. IEEE.

Lenz

Lee

Saxena

. Deep learning for detecting robotic grasps. Int J Rob Res 2015; 34(4–5): 705–724.

Kumra

Kanan

Robotic grasp detection using deep convolutional neural networks. In: 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), Vancouver, BC, Canada, 24–28 Septeber 2017, pp. 769–776. IEEE.

10.

Redmon

Angelova

Real-time grasp detection using convolutional neural networks. In: 2015 IEEE international conference on robotics and automation (ICRA) , 26 May 2015, pp. 1316–1322. IEEE.

11.

Hoffman

Richards

. Parts of recognition. Cognition 1984; 18(1–3): 65–96.

12.

Hoffman

Singh

. Salience of visual parts. Cognition 1997; 63(1): 29–78.

13.

Prattichizzo

Trinkle

. Grasping. In: Springer handbook of robotics 2016 (pp. 955–988). Springer, Cham.

14.

Detry

Kraft

Kroemer

, et al. Learning grasp affordance densities. Paladyn, Journal of Behavioral Robotics 2011; 2(1): 1–7.

15.

Goldfeder

Allen

Lackner

, et al. Grasp planning via decomposition trees. In: 2007 IEEE international conference on robotics and automation, Roma, Italy, 10 April 2007, pp. 4679–4684. IEEE.

16.

Miller

Knoop

Christensen

, et al. Automatic grasp planning using shape primitives. In: Proceedings. ICRA’03. IEEE international conference on robotics and automation, 2003, Vol. 2, Taipei, Taiwan, 14 September 2003, pp. 1824–1829. IEEE.

17.

Mahler

Liang

Niyaz

, et al. Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.

18.

Pinto

Gupta

Supersizing self-supervision: learning to grasp from 50 k tries and 700 robot hours. In: 2016 IEEE international conference on robotics and automation (ICRA), Stockholm, Sweden, 16 May 2016, pp. 3406–3413. IEEE.

19.

Wang

, et al. Robot grasp detection using multimodal deep convolutional neural networks. Adv Mech Eng 2016; 8(9): 1687814016668077.

20.

Yang

Wei

. A novel method for 2D nonrigid partial shape matching. Neurocomputing 2018; 275: 1160–1176.

21.

Yang

Wei

Multiscale triangular centroid distance for shape-based plant leaf recognition. In: Proceedings of the twenty-second European conference on artificial intelligence, 29 August 2016, pp. 269–276. IOS Press.

22.

Xiao

Wei

. Scale-invariant contour segment context in object detection. Image Vision Comput 2014; 32(12): 1055–1066.

23.

Wei

Yang

. Local part chamfer matching for shape-based object detection. Pattern Recognit 2017; 65: 82–96.

24.

Wei

Yang

. Efficient graph-based search for object detection. Inform Sci 2017; 385: 395–414.

25.

Wei

Yang

. Contour segment grouping for object detection. J Vis Commun Image R 2017; 48: 292–309.

26.

Dollár

Zitnick

CL.

Structured forests for fast edge detection. In: Proceedings of the IEEE international conference on computer vision. December 2013, pp. 1841–1848. IEEE.

27.

Achanta

Shaji

Smith

, et al. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans Pattern Anal Mach Intell 2012; 34(11): 2274–2282.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

Robotic object recognition and grasping with a natural background

Abstract

Keywords

Introduction

Related work

Shape-based object detection with background interference

Shape-based object representation method

Elimination of redundant information in images

Extraction of contour template

Line segments combination based on heuristic search

Determination of the grasp position with contour detection

Guiding the robot to approach the object to be grasped using the recognition results

Interaction between the recognition module and the robot control module

Object recognition flowchart

Robot control module flow chart

Experiments

Our environment

Experiment result

Discussion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Supplemental material

References

Supplementary Material