Sage Journals: Discover world-class research

Abstract

This work presents a novel method for addressing the bin picking problem with two significant contributions. First, we propose a specialized generator for synthetic labelled data tailored for bin picking applications. This synthetic data facilitates the training of deep learning models for detecting objects. Second, a parallel workflow to optimize cycle time is presented; this workflow leverages the robot’s motion to perform all cognitive computations. Experimental results demonstrate that models trained exclusively on virtual data achieve comparable performance when applied to real-world scenarios. Additionally, the proposed workflow effectively optimizes cycle time by ensuring uninterrupted robot movement throughout the process.

Keywords

Bin picking synthetic data sim-to-real transfer industrial robots process optimization

Introduction

In the current industrial context, robotics and artificial intelligence have risen as key enablers for process automation. The automation of industrial tasks brings several advantages to manufacturers, since physically demanding or dangerous tasks can be performed by robots instead of humans. This allows the reallocation of human resources to more suitable tasks while reducing costs and increasing productivity and competitiveness. In this scenario, industrial bin picking plays a central role in factory automation.¹

Bin picking is the process of unloading a container full of unordered pieces and placing them in a more controlled way. Despite its simple definition, bin picking poses a series of challenges that are non-trivial to tackle, such as piece localization, grasping pose computation, collision-free path planning, and motion control.² Bin picking has been a long research field in the robotics community, where we can distinguish two different research scenarios. On the one hand, academics focus on developing general algorithms for picking unknown daily-life light objects with general end effectors in uncontrolled environments. These developments are normally validated in laboratory prototypes, which implies a technology readiness level (TRL) of 3 or 4. On the other hand, industrial practitioners usually work with known pieces and custom-designed end effectors in more controlled contexts. Systems are validated in factory scenarios where the requirements are far more restrictive in terms of picking rate and cycle time. They target a TRL between 7 and 9.

The application of deep learning (DL) models is common in various steps involved in bin picking. However, the effectiveness of these models relies on large amounts of data, which are not always readily available. Additionally, the overall execution cycle is typically extended due to the high computational demands of the associated processes. The main contributions of this work are twofold. First, a generator of photorealistic synthetic labelled images, with a focus on industrial bin picking applications, is presented. This enables training vision models while avoiding tedious labelling tasks and transferring them to the real world. Second, a parallel picking workflow focused on cycle time minimization, which is agnostic to hardware and use case, is introduced. We believe this method could be generalized to other bin picking scenarios. To the best of our knowledge, this is a unique work that has been tested in a complete industrial environment. This work aims to bridge the gap between academia and industry by presenting a method that adapts academic algorithms to an industrial scenario.

Related work

The most singular and challenging phase of bin picking solutions compared with other industrial applications is the grasping pose calculation from visual data. A successful grasping pose is defined as a reachable and collision-free joint configuration of the robot that allows picking and retrieving an object. Kleeberger et al.³ and Cordeiro et al.⁴ review the existing literature on grasping point calculations and differentiate between model-based and model-free methods. On the one hand, model-based methods require the model of the object to be picked, and normally, they are composed of two steps: object localization and grasp point determination. On the other hand, model-free methods are data-driven and do not require a predefined model. These methods either deduce the grasping point directly from visual data in a single step or may involve additional steps, such as object localization, before determining the grasping point.

In recent years, DL models have been used for object localization, as Lee et al.⁵ and Zhuang et al.⁶ and for direct grasp point estimation, such as Zeng et al.,⁷ Zhou et al.,⁸ and Mahler et al.⁹ Training these models requires a significant amount of labelled data, which can be challenging to obtain. A simple approach is to label the images manually, as in Zeng et al.,¹⁰ but it is a tedious process and requires a lot of time and effort. Other approaches generate grasping labels in a self-supervised manner, as in Pinto and Gupta¹¹ and Levine et al.,¹² where the system infers the labels by trial and error; they still require a considerable amount of time.

A growing trend in data generation is the utilization of simulation environments to create labelled datasets. The primary advantage of this approach is that it provides a faster, safer, and less tedious method of obtaining labelled data. However, it has its drawbacks, such as the requirement for computer-aided design (CAD) models of the objects and the fact that simulated data is not a perfect replica of real-world data due to the simulation–reality gap.

Li et al.¹³ generate a synthetic dataset for industrial bin picking using Blender and simulating the physical drop of the pieces with the Bullet (https://www.blender.org/). engine (https://pybullet.org/wordpress/). They increase realism with two adaptations. Firstly, they accommodate the background of the simulation to the real background. Secondly, they perform a grid search to choose the material properties, comparing real and virtual images and choosing the set of material properties with the highest similarity score. Selvam et al.¹⁴ present a photorealistic synthetic dataset focused on bin picking based on Stillleben,¹⁵ which is a rendering shader pipeline based on the PhysX simulator (https://github.com/NVIDIA-Omniverse/PhysX). It uses image-based lighting to add realistic illumination effects. They use objects of the YCB-Video dataset¹⁶ that are highly textured daily-life objects. Kim et al.¹⁷ proposed a method using a physics simulator to generate synthetic datasets with three-dimensional (3D) CAD models and automatic annotations for segmentation and pose estimation.

Many of the works devoted to bin picking address textured daily-life objects as they rely on popular benchmark datasets such as LINEMODE¹⁸ and YCB-Video.¹⁶ However, in industrial contexts, pieces are usually metallic, rigid, reflective, textureless, and they are often found in a more cluttered manner, leading to occlusions. There are certain datasets which focus on industrial pieces such as Hodan et al.¹⁹ and Drost et al.,²⁰ they capture real images employing high-cost sensors similar to real industrial applications, and the ground truth is obtained by an initial manual labelling followed by an automatic refinement. Yang et al.²¹ create a dataset of very reflective industrial objects with high-cost sensors, where captures are taken before and after applying an anti-reflective spray. Depth maps of sprayed pieces present much less noise and fewer missing values. Labels are annotated manually, and visibility scores are obtained by rendering the segmentation mask of the object pose via perspective projection and comparing it with the visible pixels. Kleeberger et al.²² expand the Sileane dataset²³ with some industrial parts composing a bin picking dataset of virtual and real images, two-dimensional (2D) and 3D data, labels, and visibility scores that are obtained automatically by comparing subsequent images. Virtual images are obtained in Coppelia robotics simulator (https://www.coppeliarobotics.com/), throwing pieces into the container one by one and taking captures and segmentation masks after each drop. Visibility scores of the objects are calculated by computing visible pixels in subsequent captures and dividing by the number of visible pixels in the first capture. For real images, the process is reversed, with a bin full of pieces, pieces are withdrawn one by one, and the mask is extracted by Iterative Closest Point refinement on the subtracted point cloud. The visibility threshold for real images is computed by rebuilding each real-world scene in simulation.

Compared with existing methods, we propose a generator of synthetic data designed for training vision models to operate on specific industrial components. In this work, we demonstrate that models trained purely on our virtually generated images can be transferred to real scenarios. Our approach is primarily focused on bin picking scenarios, but it has the potential to be extended to other applications.

As well as the grasping point estimation, there are other steps in the picking process that can benefit from an increase in performance. Within the industrial context, cycle time is an important factor. Several studies, such as Iversen and Ellekilde,²⁴ Ichnowski et al.,²⁵ and Ellekilde and Petersen²⁶ aim to minimize cycle time from a path planning perspective, exploring the trade-off between planning time and motion time. However, little work has been done in improving the picking cycle by optimizing the picking logic. Several industrial works, such as Le and Lin,²⁷ Martinez et al.,²⁸ and Kozák,²⁹ present a sequential workflow of capturing, processing, planning, and motion phases. In contrast with these works, we propose a picking workflow where some of these phases are executed in parallel, improving cycle time without affecting picking performance.

Problem formulation

The problem addressed in this work is a real industrial application that takes place in a manufacturing company in the automotive sector. The objective is to unload the pieces from the container and place them in smaller boxes. The pieces can be placed randomly in the small box as long as they are distributed evenly throughout the box. Once these smaller boxes are filled, they are dispatched through a conveyor belt. The container is deposited in the picking station by an automatic guided vehicle; therefore, there are small displacements between containers. The container enters the station with 1500 pieces. Pieces are metallic, with asymmetrical and non-convex shapes with certain protrusions and flat surfaces, see Figure 1(a) and (b).

Figure 1.

Designed tool and piece involved in the bin picking application. (a) Front view of the simulated piece. (b) Lateral view of the simulated piece. (c) Digital model of the tool. (d) Real model of the tool while picking a piece, four strings have been added to increase the tool compliance and allow applying pressure on the piece.

Application configuration

The selected robot for the task is a KUKA kr10r1100-2, an industrial robot arm with a maximum reachability of 1100 mm and a maximum payload of 10 kg. The tool is a pneumatic magnet, which supposedly has been designed according to the application requirements. It has a set of springs to add compliance in the picking, see Figure 1. The container dimensions are $1200 \times 1000 \times 600 mm$ and are made of metal, which is a standard in manufacturing contexts.

The vision sensor is an Ensenso X-30 with a 1.6 MP resolution, located at a height of 2.3 m from the floor, facing the container perpendicularly. It is a structured-light system comprising a stereo camera and a texture pattern projector that facilitates point matching and enables 3D reconstruction. The output from these sensors is a 3D map, where each pixel corresponds to a 3D coordinate $(X, Y, Z)$ . Additionally, the 2D pixel-correspondent image is also available, which provides a greyscale representation of the 3D data. The bin picking scenario is shown in Figure 2.

Figure 2.

Bin picking scenario. (a) Virtual scene used for development. (b) Real scene within an industrial environment.

The robot, along with other hardware components such as conveyors, sensors, and safety devices, is connected to a Beckhoff Programmable Logic Controller (PLC). This PLC is responsible for managing the system’s operations and ensuring compliance with current industry safety standards.

The computations of the proposed work are conducted on an external computer equipped with an i9-13059 processor of 2.2 GHz and 64 GB of RAM, which is connected to the Ensenso and the Beckhoff PLC via Ethernet (Transmission Control Protocol/Internet Protocol). The computed trajectories are sent to the PLC via Beckhoff Automation Device Specification. The PLC is in charge of the robot’s motion as well as sending motion completion notifications.

From the software point of view, the application is based on PickingDK,² which is a plugin-based bin picking framework. For developing purposes, the application was validated first on simulation using an extended version of the simulator described in Ojer et al.³⁰

Challenges

This case presents several significant challenges. The first challenge involves accurately locating the pieces that can be picked up. The random placement of the pieces within the container results in numerous occlusions, complicating their identification. Additionally, the cycle time of the process should be as short as possible and kept below 12 s. Finally, a particularly complex challenge arises from the non-convex shape of the pieces, coupled with their protrusions (as illustrated in Figure 1(b)). These features can cause the pieces to become tightly entangled, significantly hindering reliable and efficient extraction.

Cognition phase

The cognition phase involves interpreting the scene to gain a comprehensive understanding of the environment and making informed decisions based on that understanding. Its objective is to determine a set of collision-free trajectories that enable piece picking. This phase can be divided into two main processes, perception and path planning. Perception focuses on the understanding aspect, as its purpose is to detect target poses for object grasping, using information from the vision sensor. In contrast, path planning addresses the decision-making aspect by generating safe robot trajectories towards the identified grasping positions.

Perception

The proposed perception process involves two steps to determine the grasp poses: (1) instance segmentation for object localization; (2) grasping pose estimation from the segmented instances. This phase is fed with 3D and 2D information, which is provided by the selected sensor.

Many industrial components have textureless, metallic, reflective materials or matte black surfaces, which can lead to noisy outputs in 3D map reconstruction. This is particularly true for the target pieces in this case. Due to these challenges and the current limitations of DL techniques for 3D data, the proposed object localization performs instance segmentation on the 2D image.

Synthetic data generation

In order to train the instance segmentation models, there is a need for annotated data, which is costly and tedious to obtain. Therefore, the proposed approach generates synthetic labelled data in simulation in an effortless way.

The proposed data generator consists of a simulation of the pieces in the container and the sensor element of the scene. The simulated sensor acquires the 2D image along with the instance labels for each pixel in the image, as well as instance information such as instance ID and visibility scores. Our generator is based on Unity (https://unity.com/) and its perception package for automatic labelling.³¹ Visibility scores are computed as the ratio of the visible pixel count and the estimated number of pixels that would represent the object if it were fully visible. This estimation is obtained from the object’s actual surface area and distance from the camera.

Domain randomization techniques are used to enhance robustness and generalization. The randomization extent is from the lighting properties such as intensity, temperature, direction, and from the material properties such as smoothness, reflectivity, and in some cases, colour. To generate images that closely resemble the real setup, a grid search is performed on lighting parameters to identify the configuration that maximizes the histogram correlation similarity measure with a reference image. The resulting parameter values are then used as the centre of a uniform distribution for randomization. The container model is also randomized from a predefined set of container models of the same size.

The data generation process is done as follows. Initially, $n_{thrown}$ pieces are randomly dropped in the container. The sensor captures, labels, and saves the image and the corresponding annotations. Then, $n_{remove}$ pieces are removed from the container, where $n_{remove} < n_{thrown}$ , and visual randomization is applied on the scene. This is repeated until no pieces are left inside the container. The whole process starts again until a sufficient number of images are generated. To efficiently emulate the different fill levels of the container, a transparent collision plane is generated at different heights before the throwing phase. This method of data generation allows for multiple instances of objects with varying levels of clutter and different visual characteristics, see Figure 3.

Figure 3.

Synthetic generated images emulating a bin picking process where randomization is applied.

Additionally, once a dataset is generated, the proposed generator provides data augmentation techniques such as scaling, rotating, flipping, blurring, colour manipulation, etc., as well as some filtering options for the instance masks based on parameters such as mask area, aspect ratio, and visibility. Among these, the visibility score is particularly important since the less occluded an object is, the easier it should be to grasp.

Setting a higher visibility score threshold increases the likelihood of selecting less occluded objects. However, this can result in a smaller number of pieces being included in the dataset. Therefore, the choice of visibility threshold requires careful consideration, as it directly impacts the composition of the generated dataset. Therefore, the visibility threshold is a parameter that has to be carefully chosen, as it directly impacts the generated dataset, as shown in Figure 4.

Figure 4.

Synthetic generated image and instance mask with different visibility threshold: (a) red–green–blue (RGB) generated images; (b) instance masks without visibility threshold; (c) with visibility threshold 0.7, and (d) with visibility threshold 0.97.

Once a labelled dataset is created, it can be used to train DL instance segmentation models. These models generate pixel-wise masks for each detected instance of the pieces, which allows a precise localization of the pieces.

Grasping pose estimation

The previous segmented pixel-wise mask in the 2D image can be projected onto the 3D map, achieving spatial segmentation of the piece.

Since the piece presents a flat metallic surface and the tool is a magnet, the target grasping point corresponds to the most centred point of the visible planar surface of the piece. This position is robust against system positioning errors, provides greater balance, and minimizes the bending of the piece.

With the previously trained instance segmentation model, the objects are localized in the 2D image. These pixel-wise masks are projected to the 3D map to obtain spatial masks. The grasping point estimation algorithm segments the planar surface from the spatial masks through plane fitting. Some candidates may be discarded based on the ratio of points that fit the plane. From the planar surface, a distance transform is applied to determine the distance from the edges, and the point with the greatest distance is selected for the grasping, see Figure 5. The grasping direction is computed as the normal of the fitted plane. Certain candidates are filtered according to the angle between the normal and the world $Z$ -axis. Thus, the outcome of the grasp pose estimation algorithm is a set of 3D points and normal vector pairs.

Figure 5.

Grasping point estimation. The left column shows the two-dimensional (2D) image. The middle column represents the segmentation of the planar section. In this segmentation, green indicates the planar surface, red denotes points above the computed plane, blue represents points below the plane, and pink indicates points not belonging to the piece mask. The right column displays the distance transform applied to the edges of the planar region.

Path planning

To facilitate and accelerate path planning, it is recommended to establish predefined intermediate waypoints and collision-free trajectories. Therefore, the path planning task is constrained only to the variable poses of the process, such as the grasping pose.

To reduce computational cost, a collision model of the scene was created using simplified shapes, specifically convex shapes for static elements. Since the container’s position may vary slightly, it is detected using a vision sensor. Walls are added to the collision model as cuboids. All models incorporate a safety margin to account for potential robot positioning errors.

From the computed point–normal pairs, a collision-free joint configuration of the robot must be determined. Thus, different rotations around the $Z$ -axis are sampled, starting where the $X$ -axis of the tool aligns with the projection of the negative $Z$ -axis in the world coordinate system. The closest collision-free feasible configuration to a predefined pose $P_{ready}$ is chosen as the grasp configuration $P_{grasp}$ . The chosen grasping approximation strategy is a common approach, where the robot approaches the target grasping position $P_{grasp}$ in the tool $Z$ -axis direction from a pregrasp pose $P_{\,pregrasp}$ . This pregrasp pose is previously reached from a predefined ready pose $P_{ready}$ , which has easy accessibility to the container and is normally located above it. The withdrawal trajectory can be easily computed as the inverse of the approximation trajectory. Thus, it reduces the planning time, since only two trajectories are planned, from $P_{ready}$ to $P_{\,pregrasp}$ and from $P_{\,pregrasp}$ to $P_{grasp}$ , see Figure 6 for a visual representation. For path planning from $P_{ready}$ to $P_{\,pregrasp}$ path, the RRT-Connect algorithm³² is used, which provides a fast and efficient method for finding feasible paths in high-dimensional configuration spaces, particularly suitable for complex robotic environments. For the plan from $P_{\,pregrasp}$ to $P_{grasp}$ a linear interpolation in the Cartesian space is performed, with all intermediate states checked for collisions.

Figure 6.

Waypoints to plan for.

Workflow optimization

This section presents the parallel picking workflow, which is related to the management of the different phases of the process, such as the cognition phase and motion execution. Its goal is to organize the process effectively and minimize cycle time.

Parallel workflow

This work proposes a parallel workflow to optimize the cycle time. The idea is to leverage the robot’s motion execution to simultaneously perform the perception and planning tasks. This approach works under the assumption that removing a piece from the container just affects the nearby region, but more distant regions remain unaltered. This assumption holds true for the majority of bin picking cases.

The proposed workflow consists of two parallel threads named the action thread and cognition thread. The action thread is responsible for managing the hardware. It captures visual information from the sensor and executes robot motions. The cognition thread handles cognitive tasks. It performs the perception phase to extract several grasping pose candidates, sorts them according to specific criteria, and computes the grasping trajectories. Path planning is performed starting from higher priority candidates, and if the trajectories are feasible, they are added to a database. Then, the neighbour candidates are removed. Candidates are removed if the distance to the already planned candidates is less than a threshold $R_{neighbour}$ .

This process is repeated iteratively until certain conditions are met, either there are no candidates left or trajectories for $N_{\,plan}$ candidates have already been added to the database from a single image. The choice of parameters $R_{neighbour}$ and $N_{\,plan}$ depends on the use case and the desired behaviour.

The action thread captures data only when the capture flag is active, not during every iteration. After each robot motion execution, the action thread evaluates the size of the trajectory database. When the database size decreases to one, the action thread schedules the capture for the next cycle, activating the capture flag. Consequently, during the next cycle, the action thread captures data and immediately executes the remaining trajectory in the database. Meanwhile, the cognition thread processes the recently acquired visual information and computes the subsequent trajectories. This workflow is illustrated in Figure 7.

Figure 7.

Proposed parallel workflow diagram.

Experiment 1: Evaluation of synthetic trained models

This experiment aims to validate the effectiveness of the proposed data generation scheme by evaluating whether instance segmentation models trained exclusively on synthetic data can achieve satisfactory performance on real-world data.

For this experiment, we use two different types of pieces. Piece A, the one mentioned in the problem formulation section, is located chaotically inside the container, see Figure 8(a). Piece B is a black, flat metal component with a Pacman-like shape. It is larger than piece A and is arranged in the container in nine stacked columns, see Figure 8(b)

Figure 8.

Pieces used in Experiment 1, piece A is (a) and piece B is (b).

We fine-tuned two widely popular instance segmentation models, such as YOLOv5s-seg³³ and Mask R-CNN³⁴ with a backbone ResNet50, both of which are pretrained with Common Objects in Context. For YOLOv5-Seg, the recommended hyperparameters were used for the stochastic gradient descent (SGD) optimizer, including an initial learning rate of 0.01 with cosine decay. A batch size of 12 and a maximum of 100 epochs were chosen to adapt to our computational resources. Similarly, for Mask R-CNN, the default/recommended parameters were applied: SGD was selected as the optimizer, with an initial learning rate of 0.02, a warm-up, and a multistep decay learning rate scheduler. A batch size of eight and a maximum of 100 epochs were chosen.

Ground-truth masks of the real images are manually labelled by an expert. It is important to note that the manual labelling is only used to obtain the ground truth for the evaluation phase, but the models are fine-tuned using pure synthetic data from the previously explained generator. The intersection over union (IoU) metric is used for model evaluation.

Results on chaotic scenario

In this case, where pieces are randomly placed, manual labelling could be quite subjective since the operator cannot ensure which piece can be retrieved visually. Therefore, two labelling strategies are followed. The first focused on pieces with complete visibility that were theoretically easier to pick, while the second adopted more relaxed constraints, including many partially visible pieces. To consider a correct match between ground truth and predicted instances, an IoU threshold of 0.8 is fixed since detection accuracy is important due to the posterior picking process.

In this case, the manual labelling process consists of two steps. First, an expert manually selects the bounding boxes. These bounding boxes are then fed into the segment-anything-model³⁵ to generate the segmentation masks. Then, the expert reviews the generated masks, discarding defective ones.

A dataset is generated with $n_{thrown}$ and $n_{remove}$ equal to 300 and 15, respectively; the process is repeated 30 times, resulting in a total of 600 images. Models are trained with the same image dataset, but the corresponding labels are filtered according to different visibility thresholds. The idea behind it is to obtain insights about how the visibility threshold affects the picking process, see results in Figure 9. In this case, we use precision and recall over the aforementioned IoU, which are typical metrics for instance segmentation. Result metrics are shown in Tables 1 and 2.

Figure 9.

Instance segmentation on the real images with YOLOv5s-Seg model; (a): manually labelled strict masks, (b): model trained with 0.7 visibility threshold, (c): model trained with 0.9 visibility threshold, (d): model trained with 0.97 visibility threshold.

Table 1.

Results on the relaxed dataset.

Models	Visibility	Precision	Recall
YOLOv5s-Seg	0.7	0.744	0.847
	0.9	0.835	0.697
	0.97	0.908	0.551
Mask R-CNN	0.7	0.569	0.833
	0.9	0.845	0.593
	0.97	0.881	0.542

Table 2.

Results on the restricted dataset.

Models	Visibility	Precision	Recall
YOLOv5s-Seg	0.7	0.376	0.926
	0.9	0.494	0.863
	0.97	0.573	0.756
Mask R-CNN	0.7	0.276	0.845
	0.9	0.508	0.777
	0.97	0.554	0.695

The tables indicate that models with higher visibility thresholds demonstrate higher precision but lower recall. This was somewhat expected since the number of detections is expected to be lower. At the same time, precision over the relaxed dataset is higher, while recall is greater for the strict dataset. A higher recall in the restricted dataset indicates that most retrievable pieces are detected, though some may not be truly graspable. In contrast, the higher precision in the relaxed dataset means that many of the detected pieces are indeed pickable, but not all pickable pieces are detected. These results suggest that the generated dataset lies in between strict and relaxed assessments of the expert. The choice of the appropriate visibility threshold will depend on the impact of false positives and false negatives in the application. Overall, YOLOv5s-seg outperforms Mask R-CNN across all evaluated metrics.

Results stacked scenario

In the case of piece B, a training dataset is generated with $n_{thrown}$ and $n_{remove}$ equal to 32 and 8, respectively. The throwing process is repeated 150 times, resulting in a total of 150 images. Since the pieces are stacked, pickable pieces present much higher visibility compared with the others, and the range of positions of the pieces is restricted. Thus, the models are trained with a visibility threshold of 0.70, see Figure 10.

Figure 10.

Left column: generated synthetic images. Middle column: all generated labels. Right column: filtered labels with visibility threshold 0.7.

These pieces are intended for a controlled drop, which requires the generated masks to be highly precise. While the objects are easily identifiable, the main challenge lies in obtaining an accurate mask. As a result, the evaluation metric will focus on detection accuracy with a matching IoU threshold of 0.9. The labelling process for the test set is fully manual, with the operator drawing the ground-truth mask.

The obtained metrics are shown in Table 3. Again, the YOLOv5s-seg model offers better performance, but both models achieve acceptable results. The detection is quite accurate as it can discriminate the top pieces from the stack, see Figure 11 for detection results on real images.

Figure 11.

Computed masks of a real image (piece 2) with YOLOv5s-seg model.

Table 3.

Results on piece B.

Models	Accuracy
YOLOv5s-seg	0.972
Mask R-CNN	0.834

As a result, it can be observed that both trained models deliver acceptable performance when trained with the synthetic data generated by the proposed generator. In this case, the YOLOv5s-seg model outperforms in every metric.

Experiment 2: Evaluation on a real bin picking scenario

In this experiment, the main purpose is to evaluate the effectiveness of the proposed picking workflow in a real industrial bin picking application. While the previous experiment focused on independently validating the localization models, the current use case integrates these models into a complete operational pipeline. Thus, a secondary objective is to analyse how the interconnection between the different modules of the bin picking application affects the overall performance.

According to the previous results, the selected instance segmentation model is YOLOv5-seg. Model inference takes place in the cloud as a separate service called during the perception phase. From the detected grasping poses, our strategy is to prioritize candidates with the greatest relative height in the robot’s coordinate system, based on the intuition that these candidates are likely to be on top of the pile, thus offering greater clearance and facilitating grasping, see Figure 12. In this experiment, the maximum number of picked pieces per photo, $N_{\,plan}$ , was set to 5, while the neighbour exclusion radius, $R_{neighbour}$ , was configured to 15 cm.

Figure 12.

Grasping point candidates marked as targets, along with their segmentation mask. Target colour indicates grasping priority, red represents higher priority, and white represents lower priority. The green lines denote the container limits.

Figure 12 displays detected grasp candidates from an image.

The drop pose for the pieces is defined by a joint configuration where the tool is centred directly above the centre of the output box. During each iteration, random displacements are applied in the $X$ and $Y$ directions within the box margins to spread the pieces evenly within the box.

Based on the results obtained in the previous section, a decision was made to initiate the picking process using a model trained with a high visibility threshold (0.9). This model focuses on pieces that appear easier to detect and grasp. As overall piece visibility decreases, the model’s detection performance may decrease until very few or no pieces are detected. When the number of detections from a single photo drops below a predefined value, the model is switched to one trained with a lower visibility threshold (0.7). This strategy aims to prioritize easier grasps at the beginning of the process while maintaining a consistent number of detections when pieces are less visible.

Results on cycle time

During the picking process, the start and end times of each step of the picking workflow were recorded, as well as the total cycle time. In this experiment, two other picking workflows were compared with the proposed one. Workflow 1 is the traditional picking workflow used in Le and Lin,²⁷ where for each capture, a single piece is retrieved from the bin. In Workflow 2, for each capture, multiple pieces are retrieved, as in Martinez et al.²⁸ and Kozák.²⁹ Workflow 3 corresponds to our proposal that enhances Workflow 2 through parallel computation between steps. For each workflow, a total of 100 cycles were performed. Phase and cycle average times are summarized in Table 4, and phase time distributions are shown in Figure 13.

Figure 13.

Visual phase cycle times between different picking workflows. Workflow 1 (a), Workflow 2 (b), and Workflow 3 (c). Vertical purple lines denote the cycle ending time.

Table 4.

Summary of cycle times of the different approaches.

	Time (s)
	Capture	Processing	Planning	Motion	Total cycle
Workflow 1	1.81	1.95	4.51	9.54	17.22
Workflow 2	2.05	1.82	4.71	9.32	14.84
Workflow 3	3.30	2.31	4.66	9.43	10.11

For each workflow, a total of 100 cycles are performed. Phase and cycle average times are summarized in Table 4. Phase time distributions are shown in Figure 13.

The fastest cycle time is achieved by the proposed approach, with a reduction of up to 42% with respect to Workflow 1 and of about 30% relative to Workflow 2. However, our method relies on the assumption that picking objects only has a local effect on the scene, which holds true for Workflow 2 and Workflow 3. Workflow 1 does not have this limitation, as it captures images for every picking. Nevertheless, this assumption applies to the vast majority of real-world cases.

The bottleneck of the proposed method lies in the action thread, see Figure 13(c). Thus, future improvements in cycle time should address the motion phase. In our work, the motion phase was managed by the PLC, and it was beyond our scope. It is observed that the capture time from our proposed method takes a longer time. A reason could be that the use of threads decreases the performance of the libraries used to acquire the data. However, the action thread will always be constrained by the mechanical limits of the system.

Results on picking accuracy

In this experiment, we let the system run for a total of 663 cycles using Workflow 3, while manually annotating whether each piece was actually retrieved or not. The obtained results are expressed as a picking success rate within a moving average of size 50, see Figure 14. During the initial steps of the container unloading process, the success rate is around 90%, but it decreases as the container gets emptier. The overall average success rate is around 75%.

Figure 14.

Moving average of the picking success rate.

The decrease in the success rate is attributed to the pieces becoming intertwined. When a piece is stuck, even if the system positions the magnet correctly and attaches to the piece, the retraction trajectory may cause another piece to obstruct the movement, leading to the piece falling. At the beginning of the picking process, the top parts are looser and more accessible. However, the weight of these pieces is supported by the pieces underneath, and this pressure leads them to interlock, making successful picking more difficult.

In this particular case, determining in advance which pieces are blocked by others is challenging because the blocked part may not be visible to the sensor. We observed that some pieces with high visibility, which presumably seem pickable, might actually be trapped, while others with lower visibility that appear more difficult to pick may not be obstructed, see Figure 15 for some examples.

Figure 15.

Candidates with slight occlusion, where it is difficult to determine if the picking will be successful or not. The top row corresponds to successful retrievals, whereas the bottom row corresponds to failed retrievals.

The proposed grasping selection strategy does not take into account previously attempted candidates, and therefore, when a candidate with a high priority is blocked, it will cause a failure on the next attempts, as the position of the candidate remains. Thus, multiple picking failures were due to the repetition of blocked candidates. However, there were some cases where, after several trials, the system was able to successfully unblock and pick the target piece. This might suggest that higher magnetic power could improve performance. Additional failures occurred due to collisions of the tool with nearby pieces during the approach movement. These collisions sometimes caused neighbouring pieces to shift, altering the orientation of the target piece and preventing full contact with the magnet. Redesigning the tool to ensure that only the magnet part protrudes further could potentially mitigate these collisions.

A video showing the bin picking application can be found in this link (Video: https://vicomtech.box.com/shared/static/ahogug6et30cqnzh2sxxso8jp80gh9ut).

Discussion

The performed experiments validated the two primary contributions of this work. First, a successful transfer learning of instance segmentation models using purely synthetic data was demonstrated and validated in a real-world scenario. The proposed models showed satisfactory performance on two different types of pieces. Second, the proposed parallel workflow significantly reduces cycle times compared to traditional approaches, which provides a crucial advantage in industrial bin picking applications.

However, the overall system’s success rate was limited by the specific complexities of this bin picking application. In our experiments, piece entanglement hindered successful retrieval, even with accurate grasping. This highlights the complex interplay between piece morphology, grasping strategy, and physical system limitations.

Diagnosing errors in a holistic bin picking system remains challenging. Nevertheless, the experiments provided valuable insights and suggested several potential improvements. For example, the grasp selection strategy could be revised to account for previously attempted pieces, either through heuristic methods or learned policies. Additionally, redesigning the tool in order to make the magnet more prominent or increasing its magnetic power could help resolve blockages caused by certain pieces.

Notwithstanding, it is very difficult to diagnose a complete bin picking application since it performs as a holistic system. For example, in the first experiment, the ground-truth masks of the pickable pieces were labelled manually by an expert operator. Expert knowledge is limited as it can label a piece that cannot possibly be picked or vice versa. However, when isolating the vision task from the rest of the application, it may be appropriate to identify certain pieces based on their visibility alone. Then, posterior phases such as grasping pose estimation or the grasping selection strategy should be in charge of deciding whether to attempt to pick each identified piece.

A valid ground truth for the complete system can be extracted when a piece is correctly retrieved from the container, but when a failure occurs, identifying the specific component responsible for the error is challenging. What remains clear is that refining the bin picking process requires intensive testing on the final setup. In Experiment 2, the localization, grasp pose estimation, and path planning functionalities operated as intended, resulting in the robot successfully picking the piece. However, errors during the retrieval motion suggest a potential limitation in the proposed grasp selection strategy. To address this, intelligent selection strategies that incorporate information from previous grasp attempts could offer an improvement over heuristic approaches. However, validating these strategies could be a tedious process.

As mentioned, obtaining feedback on the operation’s outcome can be highly beneficial. From a software perspective, image comparison between consecutive captures could potentially serve as a feedback mechanism. However, through experimentation, we discovered that this approach is more challenging than initially anticipated. Specifically, when the system successfully grasps the piece but fails during the withdrawal motion, the surrounding area is altered, and the target piece shifts relative to the previous image. In an industrial setting, this feedback system must be highly robust. Therefore, alternative solutions may involve hardware enhancements, such as incorporating presence or force sensors into the tool or adding weight sensors to the output conveyor.

The selection of parameters $N_{\,plan}$ and $R_{neighbour}$ can significantly affect system performance. Higher values of $N_{\,plan}$ encourage long-term planning, which reduces the number of photos taken and thereby decreases processing time, but may increase uncertainty. Conversely, lower values of $N_{\,plan}$ promote a more reactive strategy, resulting in longer processing times. $R_{neighbour}$ should be adjusted according to the size of the pieces. Higher values reduce uncertainty but may increase cycle time, while lower values can diminish picking effectiveness by causing the system to plan for two adjacent pieces and potentially displace one when attempting to remove the neighbouring piece. Additionally, system performance is affected by various error sources, including sensor noise, calibration errors, and modelling inaccuracies.

Despite these challenges, the proposed methods demonstrate the potential for generalization to other bin picking applications. The proposed data generator can be used to generate synthetic data, facilitating training for object detection and instance segmentation models. Moreover, the parallel workflow is hardware-agnostic and suitable for a wide range of applications.

Conclusion and future work

In this work, we have presented a method to solve a bin picking problem. The development of the industrial application demonstrated that the system design helps bridge the gap between academic proposals and industry needs, particularly in (but not limited to) bin picking scenarios. This method is characterized by two main contributions.

First, a synthetic data generator is proposed. This generator generates photorealistic labelled data, which can later be used to train DL models for the instance segmentation task. In the first experiment, we demonstrate that models trained purely on the generated synthetic data can effectively transfer to real-world data in the bin picking context.

Secondly, a parallel workflow is proposed. This workflow leverages the robot motion execution to perform perception and planning tasks for the upcoming cycles. The experiment shows that the total cycle time is minimized to the sensor capture and robot movement times. Therefore, our method optimized the cognitive tasks, and the resultant cycle time is bounded by the hardware behaviour. Besides, this parallel workflow is agnostic to the hardware and the use case. Therefore, it can be generalized to other bin picking applications.

Our future work will focus on enhancing picking effectiveness through self-improvement. By incorporating real-time feedback on system performance, the system could autonomously learn from its errors and update its policies during the production process. This approach holds significant promise for industrial applications, potentially reducing reliance on manual heuristic design and validation periods while maintaining productivity.

Furthermore, implementing a comprehensive data logging system for the bin picking application would be beneficial. This system would record captured images, performed motions, grasping candidates, and grasp outcomes to generate historical data. This data could then be used, for example, to automatically label real-world images for improved model training or to refine the grasping selection strategy based on previous errors.

Another approach to addressing the entanglement problem is to have the robot perform specific motions, such as pushing, pulling, and shaking, to dislodge or separate entangled pieces. However, the correct configuration of these motions presents a significant challenge, as it requires extensive training, potentially with numerous real-world trials or sophisticated simulations, to learn robust and reliable disentanglement strategies across various scenarios.

Supplemental Material

Footnotes

ORCID iD

Marco Ojer

Funding

This work has been carried out under the PTAS program funded by the Government of Spain.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Supplemental material

Supplemental material for this article is available online.

References

Sarna

Meixner

Lüder

, et al. Quantitative performance parameters for industrial bin picking applications. Volkswagen, Technical Report CDL-SQI-2021-11, 2021.

Ojer

Lin

Tammaro

, et al. PickingDK: A framework for industrial bin-picking applications. Appl Sci 2022; 12: 9200 .

Kleeberger

Bormann

Kraus

, et al. A survey on learning-based robotic grasping. Curr Robot Rep 2020; 1: 239–249.

Cordeiro

Rocha

Costa

, et al. Bin picking approaches based on deep learning techniques: a state-of-the-art survey. In: 2022 IEEE international conference on autonomous robot systems and competitions (ICARSC), Santa Maria da Feira, Portugal, 2022, pp.10–117, New York: IEEE.

Lee

Kang

Park

. 3D Pose estimation of bin picking object using deep learning and 3D matching. In: Proceedings of the 15th international conference on informatics in control, automation and robotics, 2018, vol. 2, Porto, Portugal, 29-31 July, 2018, pp.318–324. INSTICC, Setubal: SciTePress. DOI: 10.5220/0006858203180324.

Zhuang

Ding

. Instance segmentation based 6D pose estimation of industrial objects using point clouds for robotic bin-picking. Robot Comput Integr Manuf 2023; 82: 102541 .

Zeng

Song

Welker

, et al. Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), Madrid, Spain, 1–5 October 2018, pp.4238–4245. New York: IEEE.

Zhou

Lan

Zhang

, et al. Fully convolutional grasp detection network with oriented anchor box. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), Madrid, Spain, 1–5 October 2018, pp.7223–7230. New York: IEEE.

Mahler

Matl

Satish

, et al. Learning ambidextrous robot grasping policies. Sci Robot 2019; 4: eaau4984 .

10.

Zeng

Song

, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. Int J Robot Res 2022; 41 : 690–705.

11.

Pinto

Gupta

. Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours. In: 2016 IEEE international conference on robotics and automation (ICRA), Stockholm, Sweden, 16–21 May 2015, pp.3406–3413. New York: IEEE.

12.

Levine

Pastor

Krizhevsky

, et al. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int J Rob Res 2018; 37: 421–436.

13.

Cao

Feng

, et al. A sim-to-real object recognition and localization framework for industrial robotic bin picking. IEEE Robot Autom Lett 2022; 7: 3961–3968.

14.

Periyasamy

Schwarz

Behnke

. SynPick: A dataset for dynamic bin picking scene understanding. In: 2021 IEEE 17th international conference on automation science and engineering (CASE), Lyon, France, 23–27 August 2021, pp.488–493. New York: IEEE.

15.

Schwarz

Behnke

. Stillleben: Realistic scene synthesis for deep learning in robotics. In: 2020 IEEE international conference on robotics and automation (ICRA), Paris, France, 31 May 2020–31 August 2020, pp.10502–10508. New York: IEEE.

16.

Xiang

Schmidt

Narayanan

, et al. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In: Robotics: Science and systems (RSS), RSS 2018, Pittsburgh, Pennsylvania, USA, 26–30 June 2018. Cambrdige: MIT Press. DOI: 10.15607/RSS.2018.XIV.019.

17.

Kim

Min

. Sim-to-real object pose estimation for random bin picking. In: 2024 IEEE International conference on robotics and automation (ICRA), Yokohama, Japan, 13–17 May 2024, pp.10749–10756. New York: IEEE.

18.

Hinterstoisser

Lepetit

Ilic

, et al. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee

Matsushita

Rehg

et al. (eds) Computer Vision—ACCV 2012. Berlin: Springer, 2012, pp.548–562.

19.

Hodan

Haluza

Obdrzálek

, et al. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In: 2017 IEEE winter conference on applications of computer vision (WACV), Santa Rosa, USA, 24-31 March 2017, pp.880–888, New York: IEEE.

20.

Drost

Ulrich

Bergmann

, et al. Introducing MVTec ITODD—A dataset for 3D object recognition in industry. In: 2017 IEEE international conference on computer vision workshops (ICCVW), Venice, Italy, 22-29 October 2017, pp. 2200-2208, New York: IEEE. DOI: 10.1109/ICCVW.2017.257.

21.

Yang

Gao

, et al. ROBI: A multi-view dataset for reflective objects in robotic bin-picking. In: IEEE international conference on intelligent robots and systems, Prague, Czech Republic, 27 September 2021–1 October 2021, pp.9788–9795. New York: IEEE.

22.

Kleeberger

Landgraf

Huber

. Large-scale 6D object pose estimation dataset for industrial bin-picking. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), Macau, China, 3–8 November 2019, pp.2573–2578. New York: IEEE.

23.

Bregier

Devernay

Leyrit

, et al. Symmetry aware evaluation of 3D object detection and pose estimation in scenes of many parts in bulk. In: Proceedings—2017 IEEE international conference on computer vision workshops, ICCVW 2017, Venice, Italy, 22–29 October 2017, pp.2209–2218. New York: IEEE.

24.

Iversen

Ellekilde

. Benchmarking motion planning algorithms for bin-picking applications. Industr Robot 2017; 44: 189–197.

25.

Ichnowski

Danielczuk

, et al. GOMP: Grasp-optimized motion planning for bin picking. In: 2020 IEEE international conference on robotics and automation (ICRA), Paris, France, 31 May 2020–31 August 2020, pp.5270–5277. New York: IEEE.

26.

Ellekilde

Petersen

. Motion planning efficient trajectories for industrial bin-picking. Int J Robot Res 2013; 32: 991–1004.

27.

Lin

. Bin-picking for planar objects based on a deep learning network: A case study of USB packs. Sensors 2019; 19: 3602.

28.

Martinez

Boca

Zhang

, et al. Automated bin picking system for randomly located industrial parts. In: 2015 IEEE international conference on technologies for practical robot applications (TePRA), Woburn, MA, USA, 11–12 May 2015, pp.1–6. New York: IEEE.

29.

Kozák

Sushkov

Kulich

, et al. Data-driven object pose estimation in a practical bin-picking application. Sensors 2021; 21: 6093.

30.

Ojer

Lin

Mendizabal-Arrieta

, et al. Acroba gym: A unity-based ROS compliant simulator for robotics. In: European Robotics Forum 2024, Cham: Springer, 2024, pp.230–235.

31.

Unity Technologies . Unity Perception package. https://github.com/Unity-Technologies/com.unity.perception (2020). Accessed: 05/02/2025.

32.

Kuffner

LaValle

. RRT-connect: An efficient approach to single-query path planning. In: Proceedings 2000 ICRA. Millennium conference. IEEE international conference on robotics and automation. Symposia proceedings (Cat. No.00CH37065), San Francisco, CA, USA, vol. 2, pp.995–1001. New York: IEEE.

33.

Jocher

. Yolov5 by ultralytics. https://github.com/ultralytics/yolov5 (2023). Accessed: 10/04/2023.

34.

Kirillov

Massa

, et al. Detectron2. https://github.com/facebookresearch/detectron2 (2019). Accessed: 10/04/2023

35.

Kirillov

Mintun

Ravi

, et al. Segment anything. In: Proceedings of the 2023 IEEE/CVF international conference on computer vision (ICCV), Paris, France, 1–6 October 2023, pp.4015–4026. New York: IEEE.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

Advances in industrial bin picking through sim-to-real transfer and cycle optimization

Abstract

Keywords

Introduction

Related work

Problem formulation

Application configuration

Challenges

Cognition phase

Perception

Synthetic data generation

Grasping pose estimation

Path planning

Workflow optimization

Parallel workflow

Experiment 1: Evaluation of synthetic trained models

Results on chaotic scenario

Results stacked scenario

Experiment 2: Evaluation on a real bin picking scenario

Results on cycle time

Results on picking accuracy

Discussion

Conclusion and future work

Supplemental Material

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

Data availability

Supplemental material

References

Supplementary Material