Abstract
Grasping has always been a great challenge for robots due to its lack of the ability to well understand the perceived sensing data. In this work, we propose an end-to-end deep vision network model to predict possible good grasps from real-world images in real time. In order to accelerate the speed of the grasp detection, reference rectangles are designed to suggest potential grasp locations and then refined to indicate robotic grasps in the image. With the proposed model, the graspable scores for each location in the image and the corresponding predicted grasp rectangles can be obtained in real time at a rate of 80 frames per second on a graphic processing unit. The model is evaluated on a real robot-collected data set and different reference rectangle settings are compared to yield the best detection performance. The experimental results demonstrate that the proposed approach can assist the robot to learn the graspable part of the object from the image in a fast manner.
Introduction
A growing attention has been put on the autonomous robotic grasping because it is the most fundamental action for many manipulation tasks. Humans demonstrate their grasping behavior intuitively. By just taking a glance at the object, they can automatically localize the object and find proper locations on the object to grasp it. However, it remains quite a challenge for robots. One of the core problems is that the robot lacks the ability to well understand the perceived sensing information. Given the same visual information as humans, the robot has no idea of where its gripper should be placed in order to grasp the object. Therefore, it is highly expected that the robot can learn to find a proper grasp configuration to grasp an object with the visual information. A common pipeline that is used for detecting good grasps from an image composes the following two components: (1) extracting features from the image: the features should be representative and informative enough to represent the grasping behavior and (2) training a classifier and identifying the good grasp locations in the image: a crucial problem in this stage is how to find potential graspable locations in a fastest way so that the robot is able to grasp an object in real time. Many approaches suffer from low efficiency of the detection process.
For a long time, the handcraft features are used to reveal the potential grasp relevant information from the image. However, there isn’t a unified framework for the design of these features. Different features are delicately engineered according to different situations. Some common local appearance features such as color, texture, and edges are often used to distinguish a graspable area from an image. 1,2 In the research by Bohg and Kragic, 3 a shape context feature is proposed, which describes the global shape of the object. Considering the situation where a pair of objects may appear in the same image, the distance and depth variation features are useful to identify the right grasp point in the image. 4 Although these handcraft features can reveal some clues from the image to facilitate the robotic grasping, they are highly dependent on expertise knowledge and different grasping hypotheses. Also, because only partial information of the image is used to design the features, some significant information may be neglected, which is very likely to result in the failure grasp detection. An alternative approach to extracting features from the image is by learning a dictionary. Normally, an overcomplete dictionary is learned from the raw input data and the learned atoms in the dictionary are supposed to capture some latent information of the raw image. Thereafter, any observation can be coded as a linear combination of only a few atoms in the dictionary. In the work by Trottier, 5 several dictionary learning and sparse coding approaches have been analyzed in the robotic grasp recognition and detection tasks. Recently, in the computer vision community, the deep learning technique has achieved very promising performances in the object recognition and detection tasks. The input of the deep network can be the entire image, which avoids missing any useful information. Therefore, it is believed that more profound features can be learned from the deep network than handcraft features. Several works 6 –8 have already resorted to the deep learning method to assist the robotic grasping. Lenz et al. 7 employ the deep network and autoencoder to implement grasp recognition and detection on the Cornell grasp data set. 1 On the same data set, a convolutional neural network is used for grasp detection, which provides an end-to-end approach to detect the graspable bounding box from images. 6
In order to find graspable locations in an image, we should check different areas in the image and use a classifier to recognize if it is graspable or not. A lot of works mentioned earlier 1,5,7 employ the sliding window approach to detect possible good grasp locations. The standard sliding window approach will densely sample numerous small patches in the image and is supposed to detect every possible location. However, it is time-consuming to test all the patches with the classifier and hard to satisfy the requirement in real-time grasping. In order to enhance the efficiency of the detecting process, some region proposal techniques 9,10 are widely used in object detection tasks. In the region proposal networks, a limited number of potential bounding boxes are first generated for searching instead of searching every location in the image exhaustively. Therefore, the quality of the proposals will largely affect the object detection performance.
In this work, we look into the problem of detecting good grasps from images in the perspective of robots. Inspired by the novel technology in the object detection and considering the specificity of the robotic grasping task, an end-to-end deep vision network is proposed to detect good grasps from image. The network unifies the feature extraction and grasp localization in a single framework. Given the image as the input, the graspable part of the object and grasping rectangles can be learned from the image in real time. We have evaluated our method on a robot-collected data set in order to find out the specific mechanism for robotic grasping.
The layout of this article is as follows. In section “Related work,” we review the existing robotic grasping data sets and the related work. The proposed deep vision networks model is introduced in section “Methodology.” To evaluate the proposed model, several experiments are conducted in section “Experiment.” At last, we come to the conclusion in section “Conclusion.”
Related work
The object representation and sensing perception are crucial for data-driven grasp synthesis, 11 since collecting real-world data is usually time-consuming and cumbersome. For a long time, some synthetic data sets are employed to develop and evaluate algorithms in simulated environments. The main advantage of the synthetic data sets is that they can be generated very fast and the model of the object is completely known. Once a model is created, numerous samples can be obtained automatically by changing different rendering conditions.
For 3-D models, the Princeton Shape Benchmark 12 is often used for robotic grasping research. It contains a large amount of 3-D models of household objects. In the Columbia Grasp Database, 13 a bunch of stable grasps is generated corresponding to these 3-D models. With the same data set, Boularias et al. 14 manually labeled graspable and ungraspable parts on these 3-D models trying to train a Markov network so as to predict the graspable part on a given 3-D model. Other 3-D data sets adopt some parameterized model such as superquadrics to represent the shape of the object, which utilize geometry parameters to distinguish the graspable part from the object. 15,16 Learning good grasps from the 2-D image is also very appealing, one remarkable work 2 uses 2-D synthetic models to learn good grasp point from the image and project it to the 3-D space. Thanks to the synthetic object data sets and simulation environments, great progress has been made on the data-driven robotic grasp synthesis. However, the situation in the real-world environment is far more complicated than the ideal simulated environment. It is still a challenging problem for the robot to obtain the complete and accurate sensing information of the object. So, higher requirement has been put on how to enable the robot to have the ability to well understand its perceived environment and then implement a good grasp. A popular real-world grasping data set is the Cornell grasp data set that contains 280 household graspable objects and 1035 images. 7 Each object is placed arbitrarily on the tabletop and several graspable and nongraspable rectangles are labeled by humans on each image. The graspable rectangle can be projected to the robotic configuration space so that the real-world image information can be associated with the robotic action.
All these mentioned grasp data sets are labeled by humans. Although human knowledge may enable the robot to have a prior knowledge and help the robot to learn how to grasp a given object, there are some problems with human-labeled data set. For one thing, it is impossible for humans to label all the possible grasp configurations when grasping an object. Furthermore, considering the real situations for both the robot and humans, a grasp that humans believe is good may not suit for a robotic hand to grasp. It is more reasonable and desirable that a robot can learn how to grasp an object considering its specific situations. For example, in the work by Herzog et al., 17 considering the mechanism of different robotic grippers, humans help to demonstrate different grasp configurations toward different objects. Also, in a recent impressive work, 18 a supersizing grasping data set is built. During its data collecting process, no human interaction is involved. It is a data set that is collected completely from the perspective of a robot. In this article, we evaluate the proposed deep vision network on this data set (Figure 1). It is collected during 700 h of robot grasping attempts. A two-finger gripper is used to pick up numerous objects on a tabletop with a top pinch grasp. A grasp rectangle 1 is used to denote each grasp from the image. Details about the data collecting process can be found in the research by Pinto and Gupta. 18

The left image demonstrates some objects used for grasping. They are placed randomly on the tabletop and a Baxter robot with parallel grippers is used to implement the grasping. A rectangle is used to represent the grasp in the image. Images on the right side display a variety of successful cases. The green lines denote the positions of two parallel fingers and the red line denotes the opening width of the gripper.
Methodology
The grasp rectangle 1 is used to represent the grasp in an image. In this way, the robotic grasping planning problem can be regarded as a detection problem that is profoundly investigated in the computer vision community. A commonly used approach to detect the grasp rectangle is to use the sliding window method and identify the graspability of every local patch. Some work has achieved very good detection performance. 5,7 However, the sliding window strategy makes the detecting time-consuming and thus hard to meet the requirement for the real-time robotic grasp detection. In the area of the generic object detection, it is possible to use a finite number of proposals to replace the dense sliding windows, which largely saves the detection time. 10,19 Redmon and Angelova 6 pioneer to employ the object detection technique to the grasp detection task on the Cornell grasp data set. A real-time convolutional neural network is proposed to implement the robotic grasp detection.
In our work, an end-to-end deep vision network model is designed to detect robotic grasp rectangles from the input image. Local features are used for detection, which is believed to reveal more specific grasping relevant information. To promote the efficiency of the detection, the reference rectangles are proposed to indicate potential regions for good grasps. The output of the network is the graspable scores for each reference rectangles in the image and the corresponding predicted grasp rectangles. The graspable part of an object can also be inferred from the output.
Reference rectangle
Inspired by the Faster R-CNN 10 framework that integrates the region proposal network to the object detection network , we propose the reference rectangle to indicate potential graspable candidates in the image. The reference rectangle samples all the possible locations in the convolutional feature map in a sliding window fashion and yields a grasp probability score for each rectangle. In Figure 2, reference rectangles with one scale and three aspect ratios are demonstrated. Each reference rectangle can then be refined to its corresponding grasp rectangle. By sliding the reference rectangle in the feature map, we can achieve a better detection performance efficiently in the experiments. The scale and aspect ratio of the reference rectangles may differ in different situations. This design of the reference rectangles enables the network to detect every possible location in the image and their corresponding regressed grasp rectangles. Since the predicted grasp rectangle is determined by the local features within the reference rectangle, it is believed to better describe a graspable area from the whole image.

The feature map and the reference rectangles. Reference rectangles of three ratios are used, which are denoted by dashed lines. They sample each location across the feature map in a sliding window fashion.
Model architecture
The architecture of our network is shown in Figure 3. It is composed of three parts. The first part is responsible for feature extraction. We employ the Zeiler and Fergus (ZF) model
20
that has five shared convolutional layers to extract features from the input image. The ZF model shows great ability on object recognition task, which is believed to provide robust features from the image and proved to have the ability to generalize to different visual tasks. The last part is the grasp prediction part that contains two sibling convolutional layers. It outputs the graspable score for each reference rectangle and its corresponding regressed grasp rectangle. The higher the graspable score is, the more probable the regressed grasp rectangle can be used to successfully grasp the object. Assuming n is the number of the reference rectangle used in each location, a 2n dimension score output volume and a 4n dimension coordinate output volume can be consequently obtained, respectively. Noting that there are two class labels for each reference rectangle, namely graspable and nongraspable, and

The architecture of the proposed deep vision network. The first five shared convolutional layers from ZF model are used to extract robust features from the image. The last two sibling layers are responsible for predicting the possible grasp rectangle. These two parts are combined by an intermediate layer. ZF: Zeiler and Fergus.
Training details
There are two sibling outputs in the proposed deep vision network. One is the scores for the reference rectangles and the other is the corresponding predicted grasp rectangles. The higher the score is, the more likely that the predicted grasp rectangle represents a good grasp. To fulfill this problem, we define the loss function for the ith reference rectangle as
where gi is the graspability score for the ith reference rectangle and ti is the offset parameters for the predicted grasp rectangle.
where

The process of how the grasp rectangle is regressed from the reference rectangle. The reference rectangle is denoted by red lines and the predicted grasp rectangle is denoted by blue lines.
We resort to the pretrained ZF model to initialize the first five shared convolutional layers during the training of our deep vision network. The other layers are initialized with a zero-mean Gaussian distribution. The stochastic gradient descent is used for training with a starting learning rate of 0.0001 and dropping the learning rate by 10 after every 40K iterations. The training process is implemented with Caffe. 22
Experiment
We have tested the proposed deep vision networks model on a robot-collected CMU grasp data set 18 (Figure 1) in order to investigate the characteristics of robotic grasping. In this data set, each robotic grasp attempt is represented as a rectangle in the image. With the proposed deep vision network model, the graspable part of the object and possible grasp rectangles can be detected from the image in real time.
Data preprocessing
We utilize the successful grasp cases in the CMU grasp data set as positive training samples. Each image is cropped into a size of 400 × 400. To enlarge the data set, we mirror and rotate each image resulting in a data set of 12K samples. Then, the data set is randomly split into the training set and testing set with a ratio of 4:1. To avoid the disturbance of the gripper angles when detecting, we adopt the rotated image in the data set in which the vertical parallel sides denote the two fingers of the gripper and the horizontal parallel sides denote the opening width of the gripper. In each image, only one rectangle is labeled on an object although there are cases that multiple objects are in the image (Figure 1).
Grasp detection results
Given the image as the input, the proposed model can output the graspable score for each reference rectangle and the corresponding predicted grasp rectangle at a rate of 80 frames per second on a graphic processing unit (GPU), which is enough to satisfy the requirement for the real-time robotic grasping. The grasp detection results are demonstrated in Figure 5. The first row demonstrates the input images, in which different numbers of the objects are in the image. The second row and the third row visualize the output of the proposed networks. The graspable score map reveals the grasp saliency for each image. It can be concluded that the robot can learn the location that is suitable for a good grasp. Interestingly, the robot even finds the handle parts for some objects that are depicted in a brighter color in the graspable score map. In the third row, the top 3 scored detected grasp rectangles are demonstrated. The appropriate rectangles are regressed for each object. For the image containing more than one object, multiple grasp rectangles are learned for each object. It indicates the proposed model can also be used for detecting a cluttered environment.

The first row is the input images. The second row is the visualization of the graspable score layer. The brighter the color is, the more probable that this area is a potential good grasp location. The last row is the predicted grasp rectangle for each input image. We demonstrate the top 3 scored rectangles. Note that the scores are arranged in a descending order with the color of red, green, and blue.
Grasp detection accuracy analysis
We use the grasp intersection over union (IoU) metric to evaluate the performance of the grasp detection. Assuming G is the detected grasp rectangle and G* is the ground truth rectangle, the grasp IoU can be defined as
where G ∩ G* is the overlap area between the detected grasp rectangle and the ground truth rectangle and G ∪ G* is the union area of these two rectangles. Normally, the predicted grasp rectangle is considered to be a good one if its grasp IoU is above a certain threshold value (0.25 in our experiments). We also evaluate the grasp detection performance at different threshold values. For the baseline, we extract the histogram of oriented gradients (HOG) feature from the image and use the support vector machine (SVM) as a classifier to detect grasp rectangles from the image.
We first calculate the grasp detection accuracy with the top N predicted grasp rectangles with a grasp IoU threshold of 0.25 (Figure 6(a)). The accuracy increases while the N increases. It performs the worst when the reference rectangles have three scales and three ratios. It is because the grasp rectangle in the image is in a relatively fixed size, so multiple scales reference rectangles will introduce noise to the model and add the difficulty for refining the reference rectangles. We can also observe from Figure 6(a) that the detection accuracy increases tremendously when the number of picked grasp rectangles goes from one to two and then grows flatly. It is because although there may be multiple objects in the image, only one ground truth grasp rectangle is labeled. So, the top 1 grasp rectangle detected may be not for the object that the ground truth label indicates. Another reason that affects the detection performance is that there are different ways to grasp the same object. But the predicted grasp rectangle only compared with the only one ground truth for that image. Since the data set itself is not categorized by objects, we cannot evaluate the model on an object-wise fashion. We also change the IoU threshold value ranging from 0 to 1, maintaining the top 3 predicted grasp rectangles. The experiment results are shown in Figure 6(b). The variation trend for each situation is similar and the situations with one scale perform slightly better than the one with multiple scales when the IoU threshold value is changing. Because of the calibration error in the annotation process and the robustness of the grasp, 0.25 is often chosen as the threshold value. 6 For the detecting time, the proposed method can detect a grasp at the rate of 80 frames per second on a GPU, while it takes the baseline method 4 s to detect one frame. The proposed method is far superior to the baseline. The detailed experimental data is illustrated in Table 1.

The grasp detection accuracy analysis. (a) Top N accuracy curve. (b) IoU threshold accuracy curve. Curves with different indexes correspond to the four situations stated in Table 1. IoU: intersection over union.
Different experimental settings and corresponding results obtained in the experiment.
Italics values are used to emphasize the value of the best performance in that column.
IoU: intersection over union.
The experiments demonstrate that the design of the reference rectangles can affect the performance of the grasp detection task. The one scale reference rectangles perform best in the evaluated data set.
Conclusion
Given the visual information, it is believed that detecting possible grasps from the image can facilitate the robotic grasping task. In this work, an end-to-end deep vision network model is proposed to solve the problem of detecting grasps from the image. With an image acquired by the robot, the model will exam the whole image and output the graspable score for each probable location in the image and its corresponding predicted grasp rectangle in real time. To improve the detection efficiency, reference rectangles are used to indicate possible grasp locations in the image. The setting of the reference rectangle can affect the performance of the robotic grasp detection. The experimental results demonstrate that the reference rectangles with fixed scale work best for the grasp detection in the evaluated CMU grasp data set. 18 Possible grasp rectangles for each object in the image can be predicted and also the part of the object that is suitable for grasping can also be learned. For the future work, we will detect the grasp for more complicated gripper configuration and evaluate the model on a real experimental scenario.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was jointly supported by the National Natural Science Foundation of China (Grants No: 61210013, 61327809, 91420302).
