Advances in sensing and processing methods for three-dimensional robot vision

Abstract

In this article, we survey the recent developments of sensing methods in three-dimensional robot vision, centering on the current three-dimensional sensors and core techniques embedded in robotic systems. Over 8000 publications have reported rather wide application areas of three-dimensional robot vision in the last 40 years, such as human–robot interaction, object recognition, three-dimensional modeling, object tracking, searching and surveillance, as well as robot manipulation, localization, navigation, mapping, and path planning. Representative works and future research trends are also addressed in this article.

Keywords

3D robot vision SLAM object representation 3D sensors sensing technique

Introduction

In the 1970s, with the development of modern control theory and sensor technology, robots started the real production process. A robot equipped with a camera based on a CCD ship formed the earliest robot vision system. In the 1980s, vision sensors entered a period of rapid development. Robots equipped with electronic camera, range finder, and such sensors established an advanced visual system to help the robots to perform independent reasoning and motion planning in unstructured environments. The visual system¹ is recognized as the core entrance for robots to be intelligent. In the 21st century, robots equipped with advanced image acquisition system and stereo vision technology have appeared, leading to the importance of three-dimensional (3D) vision system on the mobile robots being widely recognized.

The development of 3D robot vision is actually based on the improvement of vision algorithms and 3D sensors.² Through decades of innovation, vision algorithms in the robot vision system have formed three classes: human–robot interaction,^{3

–13} identification and perception,^14

–20 and movement decision.^{21

–34} Corresponding to these classes, the main applications contain gesture recognition,^{3,5,9,11,12,35} target tracking,^10,36

–39 human eye tracking,^13,40 map building,^{20,25,41

–44} scene understanding,^33,34 object recognition,^{16

–19,45,46} location and pose recognition,^21
–23 autonomous navigation,^{20,24

–28} and path planning.^29

–32

It is inseparable between the development of 3D robot vision and the evolution of optical sensors. We divide the visual sensors equipped in the robot vision system into three categories: one-dimensional linear array transducer represented by single laser radar,⁴⁷ two-dimensional array of sensors represented by the embedded camera,⁴⁸ and 3D depth sensors represented by special light source camera.⁴⁹ The 3D depth sensor is the main and most critical sensor for robots to realize the 3D vision system. The quality of 3D data acquisition directly influences the outcome of the backend algorithm for the mobile robot and the control of decisions.

The current mainstream techniques of a 3D depth sensor in robot vision are generally developed after 2010. The technology routes are as following: monocular structured light technology based route,⁵⁰ binocular structured light technology based route,⁵¹ and time of flight (TOF) technology based route.⁵² The principle of structured light is using the optical diffraction principle of laser, and the sensor projects particular patterns to accelerate or assist acquiring depth information. Four particular patterns included are regular pattern, pseudorandom, random dot speckle, and special graphics pattern. The advantage of structured light is the high accuracy and fast refresh rate, but the drawback is its unsuitability for the outdoor environment under bright light. TOF technique is using the phase delay of the modulated light source received at different distance and calculating the depth according to the speed of light. The advantage of this principle is the maintenance of measurement accuracy with increasing distance, but the disadvantage is the low resolution and instability to the environment disturbance.

This article briefly reviews the state of the art concerning vision algorithms and 3D sensing devices applied in 3D robot vision, and presents their applications in different fields. The “Overview of contributions” section provides a brief overview of related contributions in 3D robot vision. Representative vision techniques and applications are surveyed in the “Vision processing in robotic systems” section. The “3D sensing methods” section lists the relevant 3D sensing techniques and the most important recent developments in robotics sensors. The “Future trends” section is a discussion on current and future trends. The conclusion is drawn in the “Conclusion” section.

Overview of contributions

Summary

There are about 8000 research papers closely related to 3D robot vision published from 1976 to 2017, as shown in Figure 1. Yearly distribution shows this topic emerged around the year 1976 and rapidly developed within the past 10 years.

Figure 1.

Yearly published records on 3D robot vision in the last 40 years. 3D: three-dimensional.

Representatives

A 3D vision robotic system finds its place in many fields from industry and robotic services. Their most significant applications are summarized in the following list:

Object detection.^53,54

Security and surveillance.⁵⁵

Automating process.⁵⁶

Autonomous navigation.^57,58

Human–computer interaction.⁵⁹

Simultaneous localization and mapping (SLAM).^60
–62

Path planning.^29,32

Scene understanding.⁶³

Target tracking.^64,65

Gesture recognition.^5,59

Medical treatment.^38,66,67

Robot vision has made considerable development, and complexity and cost are sustained reducing. With the introduction of 3D technology, 3D robot vision has rapidly penetrated into many new application areas, not only in all industries but also in medical treatment, entertainment, and services.

Industrial detection

Currently, robotic vision has successfully applied in industrial testing detection, such as product packaging and printing quality detection,⁶⁸ beverage container inspection,⁶⁹ beverage filling quality detecting and beverage products sealing detection,⁶⁹ timber wood inspection,⁷⁰ the semiconductor integrated packaging quality inspection,⁷¹ and steel coil quality inspection and grading of fruits detection,⁷², which not only greatly improves the quality of the products and reliability but also ensures the production speed. In the pharmaceutical production line, robotic vision technology can be used for drug packaging detection and medicine loading inspection.⁷³

Medical science

In the field of medical science, while vision processing is inevitably used in some surgical robots, it is also used to assist doctors in medical image analysis, mainly using digital image processing technology and information fusion technology to analyze the nuclear magnetic resonance (NMR) images, computed tomography images, and other medical imaging data.⁷⁴ Different medical imaging devices can obtain biological tissue images with different characteristics. For example, X-ray images reflect the bone tissue, while NMR images reflect the organic organization. Doctors need to consider the mutual impact of bone and organic organization. Thus, the two kinds of images need to be fused using digital image processing techniques for medical analysis.^75,76

Robot navigation and visual servo

Robot vision is a most important part in robotics, whose purpose is giving feedback to the robot motion control system about the location of the target or the pose and location of robot itself through vision techniques such as 3D location, scene understanding, and so on. Visual servo system is used for robot manipulator,⁷⁷ localization of humanoid robot for locomotion tasks,⁷⁸ object recognition with pose estimation,^79,80 robot ego-motion estimation,^81,82 and object grasping.^83,84

Aeronautics and astronautics

Satellite remote sensing image information is usually very huge. In addition, there are many interference and error data, so it is difficult to process and analyze this kind of data. The 3D robot vision technology is used to analyze all kinds of remote sensing images for environmental monitoring, geography measurement, and 3D reconstruction.⁸⁵ According to the 3D characteristics of topography landform images, automatic recognition,²⁴ understanding and classification can be done. Moreover, 3D robotic vision can be applied to the movement and planning of unmanned aerial vehicle (UAV), such as position and orientation measurement for autonomous aerial refueling⁸⁶ and UAV visual path planning.⁸⁷

Vision processing in robotic systems

Target representation

The goal of robotic vision is to make the robots identify objects, like people, not only to capture the object but also to understand the representation of the objects. Representation is a main step in object recognition. For robots, different object representations (mainly refer to different mathematical models) can determine the ability and reliability of object recognition directly. The observations of an object in different views correspond to different images, and sometimes the difference between the images is very large. It is hoped that the representations of an object have as weak relationship as possible to the observation angle of the object. Currently, it is difficult for robots to handle the recognition problem of an object under different observation angles. However, it is not very sensitive to human under this kind of changing observation angle in object recognition. Therefore, the study of human visual representation is an effective way to solve the object visual representation problem for robots.

There are many literatures giving out a lot of mathematical descriptions for object in different ways, such as feature description, invariant descriptions, and elastic model. However, all these models are only some representations of some special objects in the special environment, and they are less than the general object representation level. According to the current literatures, there are two main models for object representation. The first is a view-based two-dimensional (2D) image model and the second is a 3D shape model.

Biologically, target recognition might be based on both the 2D image model and 3D shape model. When people perform object recognition, although the time difference between recognition from the front image of the object and recognition from the side image of the object is very small, the difference still exists. That means the object representation in the human vision system is not exactly a 3D representation. If it is a 3D representation, the time of object recognition from the front image and the side image should be exactly the same. Some representation forms are proposed based on images.^88
–90 In this kind of representation forms, the representation of object in human vision is not the 3D geometry shape, but a set of images of the object in different views. Under this model, the process of object recognition becomes the matching process between the input images and the stored images of the object. This mathematical model is based on the so-called subspace method. That is to say, although in theory it can be projected into innumerable different images for one object, but in the case of allowing a certain small error, any image can be linear combined with a limited number of base images. The brain only needs to store the images. However, this is unreasonable, since there are a lot of depth information sensitive neurons in our brain. If the depth information obtained by neurons is useless for object recognition, these cells should be already degraded in the process of human evolution. Thus, the image-based representation model is not very mature.

The 3D shape model is proposed by David Marr in the 1980s.⁹¹ The basic idea is: the representation of an object in our brain is the 3D geometry shape. Because the 3D shape of the object has nothing to do with the angle of view, the representation of the object in the human vision system has nothing to do with the angle of view. Marr’s 3D representation model is the origin of computer vision and it still has an important influence today. In Marr’s paper, the 3D shape model is also known as 3D reconstruction theory. Marr’s theory holds that people first extract some simple primitives from the 2D image such as dot, line, and area, then recover the depth of these simple primitives using visual module such as binocular stereovision and motion vision, and at last, give a simple representation of the whole shape of the object. However, Marr’s 3D model meets a lot of difficulties in practical applications.⁹² The main difficulty with the Marr’s 3D representation model is that the computer can’t reliably recover the lost 3D depth information from 2D images during image processing. Then, theories like hierarchical reconstruction were proposed to overcome the robustness problem in the depth recover process.^93,94

3D representation

According to the object description coordinate system, the 3D object representation methods can be divided into two categories: object centered and viewer centered. The object-centered representation methods focus on the coordinate system of object and use the features of the object itself such as the corners, holes, and edges which have nothing to do with the point of view. However, the viewer-centered representation methods make certain that the appearance of the object depends on one or more points of views to observe the object and use viewpoint-related features such as the edge of the occlusion, outline, and shape of the T-connection to do so.

According to different sizes of geometric features used for object representation, representation methods can be divided into the following six classes¹: representation methods based on 3D points, mainly using the information of the depth data, the normal directions of the points on the surface, and the curvature²; representation methods based on the convex point, mainly using the vertexes, the corner points and the points with maximum and minimum curvature of the object³; representation methods based on the shape, mainly using the edge information of the object⁴; representation methods based on the surface, mainly using the surface information of the object, such as plane, spherical surface, quadric surface, and the connected relation between the surfaces⁵; representation methods based on the body, mainly using the body descriptor of voxel, ellipsoid, superellipsoid, and so on⁶; and representation methods based on the parts, mainly using the basic components of the object. A few kinds of 3D object representation methods based on the feature selection are as follows.

Basic surface feature

Early contributions use some basic features on the surface of the object, such as Grimson and Lozano-Perez⁹⁵ used edge, straight-line segment, and the normal vector to represent the polyhedron. Since using polyhedron representation to approximate the curved surface of the object needs a lot of space, quadric equation has been proposed to represent the curved surface of the object.⁹⁶ There are also some representation methods based on the Gaussian curvature and mean curvature of the surface points.⁹⁷ Meanwhile, a structure representation method has been proposed in the study by Stein and Medioni⁹⁸ and this method determines the edge and the local surface based on the distribution of the normal vector, so it can handle the arbitrary shape of the surface of the object. In general, representation methods based on the local boundary and the basic surface features are sensitive to the noise in the data, and the accuracy depends on the reliability of the feature extraction from the input images.

Discontinuity

Some representation methods use the discontinuity of the surface of the object; for example, the method of Godin and Levine⁹⁹ applied the ridge and the crease edges of the object to construct the edge-junction graph, and then represented the object by using the edge-junction graph. However, Chen and Stockman¹⁰⁰ comprehensively used the surface and breakpoint information to mark the edges, the authors also used 2D contour features to describe arbitrary-shaped 3D objects.¹⁰¹ Since the representation methods based on the discontinuity of the surface use the edge information of the object, the storage requirement is reduced. But, in most cases, the representation is incomplete because of the lost surface information.

Surface fitting

Global object representation methods fit the object surface using a parametric equation. Researchers analyzed the invariability of some parameters in higher-order algebraic equations and used these parameters to represent surfaces of 3D objects.¹⁰² Other surface fitting methods such as B-spline are also used.¹⁰³ In addition, the superquadric surface feature was first introduced by Barr, where the authors fitted the data by an implicit equation.¹⁰⁴ In this type of methods, the object occlusion becomes a difficult problem, because the algebraic polynomial generated by the part surface of the object may be different from the algebraic polynomial generated by the whole surface of the object, and the algebraic equation parameters are obtained from region segmentation, so the accuracy of the parameters depends on the segmentation results.

Orientation

Each point on the object surface can find a corresponding point on the Gaussian ball which has the same surface direction. If the surface of the object is convex, the above correspondence is one-to-one correspondence. However, the disadvantage of the Gaussian ball is ambiguity, because it cannot save the transformation and size information of the object. Some improved methods projected the surface normal vector to unit ball according to the support function, such as the Generalized Gaussian Image,¹⁰⁵ which stored the connection information of adjacent points on the ball to ensure the unique presentation of the object. Representation methods based on orientation mainly describe the surface information of the object. Although these are global representation methods, they are too cumbersome and fail to handle the objects with occlusion. In addition, it is also very difficult to obtain segmentations of single objects from the whole scene Gaussian image.

Grid

Grid representation methods use polygons to represent the shape of the object, commonly using a quadrilateral and triangular grid. Due to the large amount of data, the storage, transmission, and calculation are very difficult. It is hard to be directly used in 3D object recognition. Thus, these representations need to be converted to other representations. In the study by Johnson and Hebert,¹⁰⁶ the authors transformed the geometric relation between the vertexes in the gird into a 2D image which is called spin-image, and they used the spin-image to describe the feature of 3D points in recognition. Despite the introduction of the simplified algorithm, the calculation used spin-image is still too complicated.

Voxel

Voxel representation methods are mainly used to describe the body feature of an object.¹⁰⁷ Similar to the pixel in the image, the voxel represents a small volume in the space, and it describes the object as a collection of nonoverlapping cubes. These voxels distribute in the 3D space to represent the object. The voxel representation methods are not suitable for object recognition, but are more suitable for surface reconstruction and modeling.

Octree

Octree representation methods describe the object hierarchically. Each node has eight branches in the tree structure and the root node is a cube which can completely cover the object. In hierarchical description, the space occupied by the object is divided into eight components step by step in the iteration, and the stop condition of the iteration is that each final cube becomes homogeneous in some features. The work of Chien et al.¹⁰⁸ proposed a method to generate the octree model of the object using three pairs of vertical depth images. Another octree generation method using the depth images in arbitrary angle has been proposed in the study by Li and Crebbin.¹⁰⁹ This method can describe the concave surface of the object. Octree representation methods can describe the global feature of the object, but they are not accurate representations, and the approximation of the representations depends on the accuracy of the segmentation.

Simultaneous localization and mapping

SLAM is the forefront technique of spatial location in robotic vision.¹¹⁰ This technique is mainly used to solve the camera location problem in the space and create an environment map. The architecture of SLAM includes two main components: the front end and the back end. The front-end abstracts sensor data into models that are amenable for estimation, while the back end performs inference on the abstracted data produced by the front end.¹¹¹ This architecture is illustrated in Figure 2.

Figure 2.

The standard architecture of SLAM.¹¹¹ SLAM: simultaneous localization and mapping.

A standard formulation of SLAM can be addressed as following. Assume that a robot is moving in the environment as shown in Figure 3, the trajectory of the robot X = {x_k : k = 1, …, m}, the landmarks Y = {y_N : N = 1, …, n } in the environment. When the robot moves from the moment k − 1 to k

x_{k} = g (x_{k - 1}, u_{k}, σ_{k})

where g(⋅) is the motion model, U = {u_k : k = 1, …, m} is the input data of sensor, and σ_k is the random measurement noise. At the moment k, landmark y_i is observed.

Z_{k, i} = h (y_{i}, x_{k}, τ_{k, i})

Figure 3.

Visual SLAM process.¹¹² SLAM: simultaneous localization and mapping.

where h(⋅) is the observation model, z_k,i is the observation at the observation point x_k , and τ_k,i is another measurement noise. Equations (1) and (2) constitute the basic mathematical model of SLAM. This is a state estimation problem which can be solved by filter or nonlinear optimization.

SLAM is more like a concept than a single algorithm, as illustrated in Figure 2, the front end is a visual odometer to estimate the inter-frame motion of the camera and the position of the landmark. The back end is an optimization, according to the pose of the camera measured by the visual odometer at different moments, calculating the maximum posterior probability. The loop-back between the front end and the back end is a detection about whether the robot has reached the previous position. At last, a map of task’s requirement should be built according to the trajectory and the observations of the camera.

SLAM technology covers a wide range of applications and there are many kinds of classifications for SLAM according to different sensors, application fields, and core algorithms. According to the different sensors, SLAM can be divided into 2D and 3D SLAM based on laser radar, RGB-D SLAM based on the depth camera, visual SLAM based on vision sensors, and visual inertial odometry SLAM based on vision sensors and inertial unit.

The 2D SLAM based on laser radar is relatively mature. Thrun et al.¹¹² have determined the basic framework of the laser radar SLAM and the work of 2D SLAM has been concluded very thoroughly. The current grid mapping method has more than a 10-year history. Google open source of the laser radar SLAM cartographer can integrate the information of an inertial measurement unit (IMU) and uniformly deal with 2D and 3D SLAM. At present, 2D SLAM has been successfully applied in floor mopping robot.

RGB-D SLAM based on the depth camera is developing rapidly in the past few years. Since Kinect was introduced, several important algorithms successively appeared in just a few short years, such as Kinect fusion,¹¹³ kintinuous,¹¹⁴ voxel hashing,¹¹⁵ and dynamic fusion.¹¹⁶

Vision sensors include monocular camera, binocular camera, fish-eye camera, and so on. Due to the low price of visual sensors and can be used indoor and outdoor, visual SLAM becomes a research hot spot. The early visual SLAM such as mono-SLAM is more like a filter method extending from the field of robot. At present, the optimization methods in the field of computer vision are more frequently used, such as the bundle adjustment in structure from motion. In visual SLAM, the visual feature extraction method can be divided into direct method and feature method, the representative of the popular algorithms such as ORB-SLAM (a feature-based monocular SLAM system),¹¹⁷ SVO (semi-direct monocular visual odometry),¹¹⁸ and DSO (direct sparse odometry).¹¹⁹

Vision sensors cannot work for non-texture areas. However, an IMU can measure the angular velocity and acceleration by the built-in gyroscope and accelerometer, and then calculate the pose of the camera. An IMU is quite complementary to vision sensors. Therefore, visual inertial odometry SLAM based on the fusion of measurement information from IMU and vision sensors becomes another research hot spot. According to the different information fusion methods, visual inertial odometry SLAM can be divided into two categories: method based on filtering and method based on optimization. The representative algorithms of visual inertial odometry SLAM are extended Kalman filter,^120,121 multi-state constraint Kalman filter,¹²² preintegration,^123,124 and open keyframe-based visual inertial SLAM.¹²⁵

Overall, compared to SLAM based on laser radar and SLAM based on depth camera, visual SLAM based on vision sensors and visual inertial odometry SLAM is not mature enough. The operation is difficult, and needs to fuse with some other sensors and be used in a controlled environment.

Representative applications include virtual reality and augmented reality fields, which are to render the virtual objects in the environment according to the map information and the current perspective information from SLAM, and the sense of reality of the virtual objects can be greatly enhanced. In the field of UAV,¹²⁶ SLAM can be used for map building,¹²⁷ autonomous obstacle avoidance,¹²⁸ and path planning.^129,130 In the unmanned vehicle field, SLAM technology provides visual function of odometer for mixing with other location techniques.¹³¹ In the field of robot location and navigation,^132,133 SLAM can be used for environment map building.¹³⁴ Based on this map, the robots can perform path planning,¹³⁵ autonomous searching, and navigation tasks.¹³⁶

3D sensing methods

3D sensing

Applications such as object recognition, 3D SLAM, and target tracking demand the vision systems of the robot to have the capability of generating 3D image data. Both passive and active techniques can be used to perform such tasks, each of which is listed in the following. 3D sensing techniques applied in robotic vision are shown in Table 1.

Table 1.

Typical 3D sensing techniques applied in robotic vision.¹³ ⁷

	Principle	Modality	Data form	Measuring form
Stereo vision	Triangulation	Passive	Range	Direct
SFS	Monocular images	Passive	Surface orientation	Indirect
SFT	Monocular images	Passive	Surface orientation	Indirect
Structured lighting	Triangulation	Active	Range	Direct
TOF	Time delay	Active	Range	Direct

3D: three-dimensional; TOF: time of flight; SFT: shape from texture; SFS: shape from shading.

Stereo imaging

The stereo system uses two cameras to capture the images of the object from two different views. Like human eyes, the computer processes the different locations of the object point projected onto the two images to get the solid object. This method is similar to the human visual function. However, the calculation is relatively complicated. A typical stereo vision system is shown in Figure 4(a). For stereo cameras with parallel optical axes (Figure 4(b)), the 3D position of the point P can be derived from the following equations

z = f \cdot \frac{b}{(x_{l} - x_{r})} = f \cdot \frac{b}{a}

x = x_{l} \cdot \frac{z}{f} or b + x_{r} \cdot \frac{z}{f}

y = y_{l} \cdot \frac{z}{f} or y_{r} \cdot \frac{z}{f}

Figure 4.

(a) Typical stereo vision system.¹³⁸ (b) Simple model: optic axes of two parallel cameras.¹³⁸

where f is the focal length, b is the baseline, corresponding image points (x_l, y_l ) and (x_r , y_r ), and d is the disparity of the point.

This technique has been widely used in robotics, such as mobile robot navigation and mapping,^139
–141 robot hand-eye coordination,¹⁴² human–robot interaction,¹⁴³ obstacle avoidance and path planning,^144,145 and mobile robot localization.¹⁴⁶

Shape from shading

Shape from shading (SFS) is also a common method.¹⁴⁷ Considering the shadow boundary of the image contains contour feature information, so it is able to take advantage of the illumination degree of the image under different conditions of light and shade to calculate the depth of the object surface, and then 3D reconstruction could be done based on the model of reflect light. It is important to note that the brightness of pixels is restricted by the light source, camera parameters, target surface materials, and so on.

The shadow recovery shape method has a wide range of applications, which can restore the 3D model of various objects except for the mirror. This technique is used for 3D reconstruction in human–robot interaction,^148
–150 and an example is shown in Figure 5. The disadvantages are that the process is too mathematical and the reconstruction results are not precise, and it cannot be ignored that the SFS method requires accurate light source parameters, including location and direction information. This causes it unable to be applied to complex light conditions such as outdoor scenes.

Figure 5.

Shape from shading.¹⁴⁷ (a) A real face image. (b) Surface recovered from (a) by the generic SFS algorithm with the perspective model, with the light source located at the optical center. SFS: shape from shading.

Shape from texture

The surfaces of different objects have various texture information. This information is composed of texture elements, which can determine the surface direction and then restore the corresponding 3D surface. This method is called shape from texture (SFT).¹⁵¹ In theory, the texture as the repeated visual element in the visual field covers the object surface in various positions and directions. When an object with a texture element is projected onto the plane, its corresponding texture element will bend and change. For example, perspective contraction deformation makes the texture longer because the smaller includes angle between the plane and the texture. The projection deformation will make the texture element larger which is closer to the plane. According to the measurement of the image, the deformation can be obtained and then the depth can be calculated based on the distorted texture element. SFT requires strict requirements of texture information on the object surface. It is necessary to understand the distortion information of the texture element in the imaging projection.

Structured lighting

The structured lighting method, by transmitting special light to the object surface, obtains the depth information based on the stereo information in the light source.¹⁵² The specific process consists of two steps, first using the laser projector to project the encoded beam onto the target object to generate the feature points. Then, according to the projection model and the geometric pattern of the projected light, calculating the distance between the feature points and the camera optical center using the triangulation principle, which can generate the depth of the feature point and implement model reconstruction.

The encoded beam is the structure light, including various patterns such as points, lines, faces, and so on. The structure lighting method solves the problem where the object surfaces are flat, single texture, and slow gray change. Because of the easy implementation and high precision, the structured lighting method has a very wide range of applications. There are several hardware equipment produced based on the structured lighting technology, such as the Prime Sensor from PrimeSense, Kinect from Microsoft device and Xtion PRO LIVE from Asus.

The basic application of the structured light technique in 3D robot vision is also for depth perception.^153,154 For the mobile robots, structured lighting has been used in navigation and obstacle detection,^155,156 scene understanding,¹⁵⁷ and 3D reconstruction.^158,159 Other applications include shape acquisition,¹⁶⁰ object modeling,¹⁶¹ 3D hand-eye robot vision system,¹⁶² high-speed 3D structured light imaging techniques, and potential applications to intelligent robotics.¹⁶³

Figure 6 shows the basic principle of structure light 3D measurement. The camera is usually described as a perspective projection model¹⁶⁴ and the corresponding relations between the object space and the image plane can be expressed as follows

s {[u_{c} v_{c} 1]}^{T} = M_{c} {[X_{w} Y_{w} Z_{w}]}^{T}

Figure 6.

The basic principle of 3D measurement by structure light.¹³⁷ 3D: three-dimensional.

where (u_c , v_c , 1) and (X_w , Y_w , Z_w , 1) are the homogeneous coordinates of a point P in the camera image coordinate system and the object world coordinate system, respectively; s is an arbitrary scale factor; and M_c is the linear transformation matrix of 3 × 4. The projector can be regarded as an inverse camera. Therefore, it has a similar model as equation (6)

s^{*} {[u_{p} v_{p} 1]}^{T} = M_{p} {[X_{w} Y_{w} Z_{w}]}^{T}

where (u_p , v_p , 1) is the coordinate of a point P in the projector image coordinate system, s* is an arbitrary scale factor, and M_p is the linear transformation matrix of 3 × 4. Thus, the 3D coordinates (X_w , Y_w , Z_w , 1) of measured points can be obtained by equations (6) and (7) with the image coordinates (u_ci , v_ci ) and (u_pi , v_pi ) known values of M_c and M_p .

Time of flight

TOF is a relatively novel method for 3D imaging, which shares the similar principle with 3D laser sensor. The major merit is to obtain the depth information of the whole scene simultaneously, instead of to scan the scene in a serial one by one point.¹⁶⁵ This technique illuminates objects within a scene with a modulated light source and then measures the phase shift between the illumination and the reflection as shown in Figure 7, then the distance D can be calculated as

D = \frac{c}{2 f} \cdot \frac{φ}{2 π}

Figure 7.

TOF principle.¹⁶⁶ TOF: time of flight.

where c is the speed of light, f is the frequency of the signal, and φ is the phase difference between the radiated and reflected IR signals

Some practical applications of this new sensing modality in 3D robot vision include robot navigation,^167,168 collision and obstacle detection,¹⁶⁹ mapping,¹⁷⁰ and 3D reconstruction.¹⁷¹ The attractions of TOF technique include its low cost, the single-sensor measurement of depth directly without recourse to any form of the target, video-rate depth data collection, and its compact size.

3D sensors for robotics

Robots perform different tasks using intelligent vision sensors in different ways. In this section, we give an overview of popular 3D sensors embedded in robots, as shown in Table 2.

Table 2.

The latest 3D sensors for robotics.

	Technique	Resolution	Frame rate	Range	IMU
DUO3D stereo camera	Stereo vision	640 × 480	30 fps	—	Yes
RealSense R200	Structure light	640 × 480	60 fps	3–4 m	No
RealSense R300	Structure light	640 × 480	60 fps	0.2–1.2 m	No
RealSense ZR300	Structure light	480 × 360	60 fps	0.5–2.8 m	Yes
MESA SR4000	TOF	176 × 144	54 fps	5/10 m	No
MESA SR4500	TOF	176 × 144	30 fps	0.8–9 m	No
Microsoft Kinect1	Structure light	320 × 240	30 fps	1.2–3.5 m	No
Microsoft Kinect2	TOF	512 × 484	30 fps	0.5–4.5 m	No
PrimeSense Carmine 1.09	Structure light	640 × 480	30 fps	0.35–3 m	No
Argos3D P100	TOF	160 × 120	160 fps	3 m	No
Argos3D P330	TOF	352 × 287	40 fps	0.1–10 m	No
Sentis3D M520	TOF	160 × 120	160 fps	0.1–5 m	No
OrSens3D camera	Stereo vision	640 × 640	15 fps	—	No
Nerian stereo vision sensor SP1	Stereo vision	1440 × 1440	40 fps	—	No
Bumblebee2	Stereo vision	648 × 488	48 fps	0.1–20 m	No
Bumblebee XB3	Stereo vision	1280 × 960	16 fps	0.1–20 m	No
SICK Visionary-T	TOF	144 × 176	30 fps	1–7 m	No
Orbbec Astra	Structure light	640 × 480	30 fps	0.4–8 m	No
Basler ToF ES camera	TOF	640 × 480	20 fps	0–13 m	No

3D: three-dimensional; IMU: inertial measurement unit; TOF: time of flight.

Future trends

Although 3D robot vision has been developed in many applications, the techniques are still not mature and perfect. With the progress of software and hardware technology, the core techniques such as object recognition, location, mapping and planning need to be constantly updated. More and more complex application demands appeared constantly due to the development and progress of human society. Vision companies and researchers are exerting efforts not only in the new vision techniques, but also in improving the applications mainly in the following aspects.

Smart sensor for intelligent recognition

Visual recognition plays a crucial role in technology and platform support. From the perspective of the whole robotic vision chain, it can be divided into three steps: imaging, perception, and understanding. Imaging is taking photographs. Perception involves acquiring content in images via sensors and acquiring perceptual input via algorithms. Understanding is actually based on visual recognition, such as in human–robot interaction, and the robots need to use intelligent robotic vision techniques to perform facial detection, facial recognition and the analysis of facial attributes to determine things like age and gender.^172,173 The widely used neural network has been applied to 3D object recognition inevitably (Figure 8).

Figure 8.

Convolutional neural network for 3D object recognition.¹⁷⁴ (a) The VoxNet architecture. (b) Point cloud from three data sets. 3D: three-dimensional.

Event-sensitive vision sensor

The development of robotic vision is based on the existing framework, optimizing on the basis of predecessors and absorbing the latest achievements in other directions. The emergence of new sensors keeps robotic vision vigorous. If we can obtain high-quality original information directly, the calculation pressure will be reduced a lot in the next process. For example, the event camera, also called dynamic vision system (as shown in Figure 9), is gradually used in the 3D robotic vision system for scene reconstruction and tracking¹⁷⁶ because of its low power consumption and high frame rate. If the cost of this kind of sensor can come down, many changes will be brought into robotic vision.

Figure 9.

Reconstruction based on event camera.¹⁷⁵ (a) Scene and DVS camera, (b) event stream, (c) estimated gradient map, and (d) reconstructed intensity map. DVS: Dynamic Vision Sensor.

Semantic SLAM by machine learning

Since the deep learning, invincible in many fields, many researchers tried to use the idea of end to end from deep learning to reconstruct the SLAM process. Some works have been done in replacing some processes of SLAM by deep learning,^177
–179 but these methods did not show landslide performance compared to the traditional geometric methods. In the near future, SLAM should absorb the achievements of deep learning and develop the accuracy and robustness constantly. Some parts of SLAM will be overall replaced by deep learning and a new framework will be formed.

Original SLAM focuses on the geometry information of the environment and the semantic information should be combined in the future. By means of deep learning technology, object detection and semantic segmentation techniques develop very fast. The abundant semantic information can be obtained from the images and aided for geometry inference,¹⁸⁰ as shown in Figure 10.

Figure 10.

Semantic SLAM.¹⁸¹ (a) Overview of the algorithm. (b) Initial map without recognized objects. (c) The tetra pack has been recognized, inserted, and is being tracked. SLAM: simultaneous localization and mapping.

Robot vision for medical applications

Robotic vision has been applied in important medical fields as a critical technology, such as the beating heart tracking based on stereo endoscope,³⁸ 3D shape reconstruction for laparoscopy,⁶⁷ clinician detection using RGB-D data¹⁸² as shown in Figure 11, vision-based navigation for capsule endoscopy,¹⁸³ and deep venous thrombosis detection using RGB-D sensor.¹⁸⁴ In clinical medicine, the robotic vision system can provide data regarding benign birthmarks and malignant melanomas without having to remove the conspicuous birthmark by means of an operation.

Figure 11.

Clinician detection using RGB-D data.¹⁸² (a) The connectivity map when real 3D positions of the nodes are considered. (b) The corresponding depth map used to re-project points into 3D. (c) Pose estimation results, where white and magenta represent left and right arms, respectively. 3D: three-dimensional.

Cloud robotics

For a simple robot hand, even grasping an object requires a huge amount of processing and preprogrammed information. That means sophisticated systems like humanoid robots need very powerful computing capabilities.¹⁸⁵ However, cloud robotics has the potential to make massive gains. With the research in this area happening across the globe, access to a cloud computing infrastructure would give all kinds of robots the massive processing power and data they need to perform complex, compute-intensive tasks as shown in Figure 12. Let robots off-load things like image processing and voice recognition. More excitingly, it would make the possibility of downloading new skills feasible to deal with a new task.¹⁸⁶

Figure 12.

Robots can benefit from the powerful computation, storage, and communication resources of the modern data center in the cloud.¹⁸⁵

Conclusion

This article summarizes the recent developments in 3D robotic vision, including tens of kinds of current 3D sensors. Representative works are listed for readers to have a general overview of state of the art. Typical vision techniques are addressed for industry-concerned issues; for example, 3D object representation and SLAM. Representative applications are also reported and a number of vision systems for solving visual acquisition problems are investigated.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the National Natural Science Foundation of China (grant nos. U1509207, 61325019).

ORCID iD

Yu He

References

Horn

. Robot vision. Cambridge, MA, USA: MIT press, 1986.

Chen

. Kalman filter for robot vision: a survey. IEEE Trans Ind Electron 2012; 59(11): 4409–4420.

Verma

Kofman

. Real-time estimation of 3D human arm motion from markerless images for human-machine interaction. In: Proceedings of the international society for optics and photonics, RI, United States, Vol. 5264, 30 September 2003, pp. 9–19.

Atienza

Zelinsky

. Intuitive human-robot interaction through active 3D gaze tracking. In: Robotics research. The eleventh international symposium, Berlin, Heidelberg, 2005, pp. 172–181. Berlin, Heidelberg: Springer.

Hong

Setiawan

Kim

. Gesture recognition based on context awareness for human-robot interaction. In: Advances in Artificial Reality and Tele-Existence, Berlin, Heidelberg, 2006, pp. 1–10. Berlin, Heidelberg: Springer.

Burger

Lerasle

Ferrane

. Mutual assistance between speech and vision for human-robot interaction. In: IEEE/RSJ International conference on intelligent robots and systems, Nice, France, 22–26 September 2008, pp. 4011–4016.

Aranda

Vinagre

Martín

. Friendly human-machine interaction in an adapted robotized kitchen. In: IEEE International conference on computers for handicapped persons, Vienna, Austria, 14–16 July 2010, pp. 312–319.

Takimoto

Yoshimori

Mitsukura

. Classification of hand postures based on 3D vision model for human-robot interaction. In: RO-MAN, Viareggio, 13–15 September 2010, pp. 292–297.

Raducanu

Dornaika

. Pose-invariant face recognition in videos for human-machine interaction. In: European conference on computer vision, Florence, Italy, 7–13 October 2012, pp. 566–575.

10.

van Delden

Chenevert

Burris

. Finger tip tracking for manipulator jogging using the Kinect. In: IEEE International conference on technologies for practical robot applications, Woburn, Massachusetts, USA, 11–12 May 2015, pp. 1–5.

11.

Lun

Zhao

. A survey of applications and human motion recognition with Microsoft Kinect. Int J Pattern Recognit Artif Intell 2015; 29(05): 1555008.

12.

Al-Berry

Ebied

Hussein

. Human action recognition via multi-scale 3D stationary wavelet analysis. In: IEEE International conference on hybrid intelligent systems, Kuwait, 14–16 December 2014, pp. 254–259.

13.

Duchowski

. Eye tracking methodology. Theory and practice, Vol. 328. London: Springer, 2007.

14.

Liu

. Using semantic maps for room recognition to aid visually impaired people. In: IEEE International conference on automation and computing, Colchester, United Kingdom, 7–8 September 2016, pp. 89–94.

15.

Wilkowski

Kornuta

Kasprzak

. Point-based object recognition in RGB-D images. In: IEEE International conference on intelligent systems is’14, Warsaw, Poland, 24–26 September 2014, pp. 593–604.

16.

Cheng

Jiang

. A robust and efficient algorithm for tool recognition and localization for space station robot. Int J Adv Robot Syst 2014; 11(12): 193.

17.

Weng

Liang

. A vision-based robotic grasping system using deep learning for 3D object recognition and pose estimation. In: IEEE International conference on robotics and biomimetics, Shenzhen, China, 12–14 December 2013, pp. 1175–1180.

18.

Sukop

. Implementation of two cameras to robotic cell for recognition of components. Appl Mech Mater 2013; 282: 167–174.

19.

Tanabe

Cao

Murao

. Vision based object recognition of mobile robot with Kinect 3D sensor in indoor environment. In: SICE Annual conference, Akita, Japan, 20–23 August 2012, pp. 2203–2206.

20.

Droeschel

Nieuwenhuisen

Beul

. Multilayered mapping and navigation for autonomous micro aerial vehicles. J Field Robot 2016; 33(4): 451–475.

21.

Yousef

KMA

Park

Kak

. An approach-path independent framework for place recognition and mobile robot localization in interior hallways. In: IEEE International conference on robotics and automation, Karlsruhe, Germany, 6–10 May 2013, pp. 2669–2676.

22.

Gori

Fanello

Odone

. A compositional approach for 3D arm-hand action recognition. In: IEEE Workshop on robot vision, Florida, USA, 15–17 January 2013, pp. 126–131.

23.

Mohareri

Rad

. A vision-based location positioning system via augmented reality: an application in humanoid robot navigation. Int J Humanoid Robot 2013; 10(03): 1350019.

24.

Park

Kim

. High-precision underwater navigation using model-referenced pose estimation with monocular vision. In: IEEE/OES Autonomous underwater vehicles, Tokyo, Japan, 6–9 November 2016, pp. 138–143.

25.

Mungúıa

Nuno

Aldana

. A visual-aided inertial navigation and mapping system. Int J Adv Robot Syst 2016; 13(3): 94.

26.

Al-Muteb

Faisal

Emaduddin

. An autonomous stereovision-based navigation system for mobile robots. Intell Ser Robot 2016; 9(3): 187–205.

27.

Honegger

Pollefeys

. Real-time 3D navigation for autonomous vision-guided MAVS. In: IEEE/RSJ International conference on intelligent robots and systems, Hamburg, Germany, 28 September –2 October 2015, pp. 53–59.

28.

Omari

Bloesch

Gohl

. Dense visual-inertial navigation system for mobile robots. In: IEEE International conference on robotics and automation, Seattle, WA, USA, 26–30 May 2015, pp. 2634–2640.

29.

Tsai

Lee

Ann

. Machine vision based path planning for a robotic golf club head welding system. Robot Comput Int Manuf 2011; 27(4): 843–849.

30.

Aljarboua

Geometric path planning for general robot manipulators. In: Proceedings of the world congress on engineering and computer science, Vol. 2, San Francisco, USA, 20–22 October 2009.

31.

Moghadam

Wijesoma

Feng

. Improving path planning and mapping based on stereo vision and lidar. In: IEEE International conference on control, automation, robotics and vision, Hanoi, Vietnam, 17–20 December 2008, pp. 384–389.

32.

Hrabar

3D path planning and stereo-based obstacle avoidance for rotorcraft UAVS. In: IEEE/RSJ International conference on intelligent robots and systems, Nice, France, 22–26 September 2008, pp. 807–814.

33.

Haltakov

Unger

Ilic

. Geodesic pixel neighborhoods for 2d and 3D scene understanding. Comput Vis Image Understand 2016; 148: 164–180.

34.

Lee

Ahn

Chung

. Visibility-based test scene understanding by real plane search. In: International symposium on visual computing, Las Vegas, Nevada, USA, 1–3 December 2008, pp. 813–822.

35.

Jiang

Cheheb

. Emotion recognition from scrambled facial images via many graph embedding. Pattern Recognit 2017; 67: 245–251.

36.

Tsai

Song

Dutoit

. Robust visual tracking control system of a mobile robot based on a dual-jacobian visual interaction model. Robot Auton Syst 2009; 57(6): 652–664.

37.

Shibata

Honma

. 3D object tracking on active stereo vision robot. In: IEEE International workshop on advanced motion control, Maribor, Slovenia, Slovenia, 3–5 July 2002, pp. 567–572.

38.

Yang

Liu

Zheng

. Motion prediction via online instantaneous frequency estimation for vision-based beating heart tracking. Inform Fusion 2017; 35: 58–67.

39.

Petrović

Leu

Ristić

-Durrant D

. Stereo vision-based' human tracking for robotic follower. Int J Adv Robot Syst 2013; 10(5): 230.

40.

Leroux

Raison

Adadja

. Combination of eye-tracking and computer vision for robotics control. In: IEEE International conference on technologies for practical robot applications, Woburn, MA, USA, 11–12 May 2015, pp. 1–6.

41.

Faessler

Fontana

Forster

. Autonomous, vision-based flight and live dense 3D mapping with a quadrotor micro aerial vehicle. J Field Robot 2016; 33(4): 431–450.

42.

Jun

Kang

Yeon

. Towards a realistic indoor world reconstruction: preliminary results for an object-oriented 3D RGB-D mapping. Intell Autom Soft Comput 2017; 23(2): 207–218.

43.

Sock

Kim

Min

. Probabilistic traversability map generation using 3D-lidar and camera. In: IEEE International conference on robotics and automation, Stockholm, Sweden, 16–21 May 2016, pp. 5631–5637.

44.

Navarro

Canas

. Incremental compact 3D maps of planar patches from RGBD points. In: Robot 2015: second iberian robotics conference, Lisbon, Portugal, 19–21 November 2015, pp. 659–671.

45.

Espinace

Kollar

Roy

. Indoor scene recognition by a mobile robot through adaptive object detection. Robot Auton Syst 2013; 61(9): 932–947.

46.

Effenberger

Kuhnle

Verl

. Fast and flexible 3D object recognition solutions for machine vision applications. In: Proceedings of society of photo-optical instrumentation engineers, image processing: machine vision applications VI, Vol. 8661, Burlingame, California, United States, 5–6 February 2013.

47.

Hebert

Krotkov

. 3D measurements from imaging laser radars: how good are they? Image Vision Comput 1992; 10(3): 170–178.

48.

Honegger

Meier

Tanskanen

. An open source and open hardware embedded metric optical flow CMOS camera for indoor and outdoor applications. In: IEEE International conference on robotics and automation, Karlsruhe, Germany, 6–10 May 2013, pp. 1736–1741.

49.

Besl

PJ.

Active optical range imaging sensors. In: Advances in machine vision. New York, USA: Springer, 1989, pp. 1–63.

50.

Kim

Ahn

Cho

. Bayesian sensor fusion of monocular vision and laser structured light sensor for robust localization of a mobile robot. J Instit Control Robot Syst 2010; 16(4): 381–390.

51.

Yang

Chen

. Design of a 3-D infrared imaging system using structured light. IEEE Trans Instrumentat Meas 2011; 60(2): 608–617.

52.

Cui

Schuon

Chan

. 3D shape scanning with a time-of-flight camera. In: IEEE Conference on computer vision and pattern recognition, San Francisco, CA, 13–18 June 2010, pp. 1173–1180.

53.

Fehr

. Covariance based point cloud descriptors for object detection and classification. PhD Thesis, University of Minnesota, Minnesota, United States, 2013.

54.

Biegelbauer

Vincze

Wohlkinger

. Model-based 3D object detection. Mach Vis Appl 2010; 21(4): 497–516.

55.

Scaramuzza

Achtelik

Doitsidis

. Visioncontrolled micro flying robots: from system design to autonomous navigation and mapping in GPS-denied environments. IEEE Robot Autom Mag 2014; 21(3): 26–40.

56.

Zimmermann

Tiemerding

Fatikow

. Automated robotic manipulation of individual colloidal particles using vision-based control. IEEE/ASME Trans Mechatron 2015; 20(5): 2031–2038.

57.

Kumar

Kumari

Srivastava

. Probabilistic approach for autonomous navigation using stereo vision. In: Software technology and engineering, Chennai, India, 24–26 July 2009, pp. 184–188.

58.

Sales

Correa

Osório

. 3D vision-based autonomous navigation system using ann and Kinect sensor. In: International conference on engineering applications of neural networks, London, UK, 20–23 September 2012, pp. 305–314.

59.

Ehlers

Brama

. A human-robot interaction interface for mobile and stationary robots based on real-time 3D human body and hand-finger pose estimation. In: IEEE International conference on emerging technologies and factory automation, Berlin, Germany, 6–9 September 2016, pp. 1–6.

60.

Wang

Yang

. Slam based on information fusion of stereo vision and electronic compass. Int J Robot Autom 2016; 31(3): 206–4748.

61.

Hochdorfer

Schlegel

. 6 DOF slam using a TOF camera: the challenge of a continuously growing number of landmarks. In: IEEE/RSJ International conference on intelligent robots and systems, Taipei, Taiwan, 18–22 October 2010, pp. 3981–3986.

62.

Schleicher

Bergasa

Ocaña

. Real-time hierarchical stereo visual slam in large-scale environments. Robot Auton Syst 2010; 58(8): 991–1002.

63.

Geiger

Lauer

Wojek

. 3D traffic scene understanding from movable platforms. IEEE Trans Pattern Anal Mach Intell 2014; 36(5): 1012–1025.

64.

Ahmad

Ruff

Bulthoff

. Dynamic baseline stereo vision-based cooperative target tracking. In: IEEE International conference on information fusion, Heidelberg Germany, 5–8 July 2016, pp. 1728–1734.

65.

Jia

Balasuriya

Challa

. Vision based data fusion for autonomous vehicles target tracking using interacting multiple dynamic models. Comput Vis Image Understand 2008; 109(1): 1–21.

66.

Rodríguez-Quiñonez

Sergiyenko

Preciado

LCB

. Optical monitoring of scoliosis by 3D medical laser scanner. Opt Lasers Eng 2014; 54: 175–186.

67.

Sugawara

Kiyomitsu

Namae

. An optical projection system with mirrors for laparoscopy. Artif Life Robot 2017; 22(1): 51–57.

68.

Patel

Kar

Jha

. Machine vision system: a tool for quality inspection of food and agricultural products. J Food Sci Technol 2012; 49(2): 123–141.

69.

Malamas

Petrakis

Zervakis

. A survey on industrial vision systems, applications and tools. Image Vis Comput 2003; 21(2): 171–188.

70.

Silvén

Niskanen

Kauppinen

. Wood inspection with non-supervised clustering. Mach Vis Appl 2003; 13(5): 275–285.

71.

Cubero

Aleixos

Moltó

. Advances in machine vision applications for automatic inspection and quality evaluation of fruits and vegetables. Food Bioproc Technol 2011; 4(4): 487–504.

72.

Brosnan

Sun

. Inspection and grading of agricultural and food products by computer vision systems - a review. Comput Electron Agric 2002; 36(2): 193–213.

73.

Ke-ming

Hua-bing

. Automated visual inspection and its application on automated inspection. In: Proceedings of the 1st international conference on mechanical engineering and material science, Shanghai, China, 28–30 December 2012.

74.

Studholme

Hill

Hawkes

. Automated 3-D registration of MR and CT images of the head. Med Image Anal 1996; 1(2): 163–175.

75.

Rositi

Frindel

Wiart

. Computer vision tools to optimize reconstruction parameters in x-ray in-line phase tomography. Phys Med Biol 2014; 59(24): 7767.

76.

Bastan

Byeon

Breuel

. Object recognition in multi-view dual energy x-ray images. In: British machine vision conference, Bristol UK, 9–13 September 2013.

77.

Lin

Hsieh

Chang

. DSP based uncalibrated visual servoing for a 3-Dof robot manipulator. In: IEEE International conference on industrial technology, Taipei, Taiwan, 14–17 March 2016, pp. 1618–1621.

78.

Martínez

Castelán

Arechavaleta

. Vision based persistent localization of a humanoid robot for locomotion tasks. Int J Appl Math Comput Sci 2016; 26(3): 669.

79.

Garcia-Garcia

Orts-Escolano

Oprea

. Multisensor 3D object dataset for object recognition with full pose estimation. Neural Comput Appl 2017; 28(5): 941–952.

80.

Zhang

Guo

Chen

. Corner-based 3D object pose estimation in robot vision. In: IEEE International conference on intelligent human-machine systems and cybernetics, Hangzhou, China, 27–28 August 2016, pp. 363–368.

81.

Wen

Kong

. A comparative study of the multi-state constraint and the multi-view geometry constraint Kalman filter for robot ego-motion estimation. In: IEEE International conference on intelligent human-machine systems and cybernetics, Vol. 2, Hangzhou, China, 27–28 August 2016, pp. 466–471.

82.

Vicente

Jamone

Bernardino

. Robotic hand pose estimation based on stereo vision and gpu-enabled internal graphical simulation. J Intell Robot Syst 2016; 83(3–4): 339–358.

83.

Masuta

Motoyoshi

Koyanagi

. Direct perception of easily visible information for unknown object grasping. In: International conference on intelligent robotics and applications, Tokyo, Japan, 22–24 August 2016, pp. 78–89.

84.

Kent

Behrooz

Chernova

. Construction of a 3D object recognition and manipulation database from grasp demonstrations. Auton Robots 2016; 40(1): 175–192.

85.

Yilmaz

Karakus

. Stereo and kinect fusion for continuous 3D reconstruction and visual odometry. In: IEEE International conference on electronics, computer and computation, Ankara, Turkey, 7–9 November 2013, pp. 115–118.

86.

Wang

. Position and orientation measurement for autonomous aerial refueling based on monocular vision. Int J Robot Autom 2017; 32(1): 13–21.

87.

Jing

Polden

Lin

. Sampling-based view planning for 3D visual coverage task with unmanned aerial vehicle. In: IEEE/RSJ International conference on intelligent robots and systems, Daejeon, Korea (South), 9–14 October 2016, pp. 1808–1815.

88.

Riesenhuber

Poggio

. Models of object recognition. Nature Neurosci 2000; 3: 1199–1204.

89.

Tarr

Williams

Hayward

. Three-dimensional object recognition is viewpoint dependent. Nature Neurosci 1998; 1(4): 275–277.

90.

Poggio

Bizzi

. Generalization in vision and motor control. Nature 2004; 431(7010): 768–774.

91.

Marr

Vision

. A computational investigation into the human representation and processing of visual information, Vol. 1(2). San Francisco, CA: Freeman and Company, 1982.

92.

Tarr

Black

. A computational and evolutionary perspective on the role of representation in vision. CVGIP Image Understand 1994; 60(1): 65–73.

93.

Faugeras

. Three-dimensional computer vision: a geometric viewpoint. London, UK: MIT press, 1993.

94.

Hartley

Zisserman

. Multiple view geometry in computer vision. Cambridge, UK: Cambridge university press, 2003.

95.

Grimson

WEL

Lozano-Perez

. Localizing overlapping parts by searching the interpretation tree. IEEE Trans Pattern Anal Mach Intell 1987; 9(4): 469–482.

96.

Fan

Medioni

Nevatia

. Recognizing 3-D objects using surface descriptions. IEEE Trans Pattern Anal Mach Intell 1989; 11(11): 1140–1157.

97.

Besl

. Surfaces in range image understanding. New York, USA: Springer Science & Business Media, 2012.

98.

Stein

Medioni

. Structural indexing: efficient 3-D object recognition. IEEE Trans Pattern Anal Mach Intell 1992; 14(2): 125–145.

99.

Godin

Levine

. Structured edge map of curved objects in a range image. In: IEEE Computer society conference on computer vision and pattern recognition, San Diego, California, USA, 4–8 June 1989, pp. 276–281.

100.

Chen

Stockman

. Object wings-2 1/2 d primitives for 3D recognition. In: IEEE Computer society conference on computer vision and pattern recognition, San Diego, California, USA, 4–8 June 1989, pp. 535–540.

101.

Chen

Stockman

. 3D free-form object recognition using indexing by contour features. Comput Vis Image Understand 1998; 71(3): 334–355.

102.

Umasuthan

Wallace

. Model indexing and object recognition using 3D viewpoint invariance. Pattern Recognit 1997; 30(9): 1415–1434.

103.

Liao

Medioni

. Representation of range data with b-spline surface patches. In: IAPR International conference on pattern recognition, conference c: image, speech and signal analysis, proceedings, Vol. 3, The Hague, Netherlands, 30 August–1 September 1992, pp. 745–748.

104.

Solina

Bajcsy

. Recovery of parametric models from range images: the case for superquadrics with global deformations. IEEE Trans Pattern Anal Mach Intell 1990; 12(2): 131–147.

105.

Liang

Taubes

. Orientation-based differential geometric representations for computer vision applications. IEEE Trans Pattern Anal Mach Intell 1994; 16(3): 249–258.

106.

Johnson

Hebert

. Surface matching for object recognition in complex three-dimensional scenes. Image Vis Comput 1998; 16(9): 635–651.

107.

Best

Mageo

. Autonomous construction of three dimensional models from range data. Pattern Recognit 1998; 31(2): 121–136.

108.

Chien

Sim

Aggarwal

. Generation of volume/surface octree from range data. In: IEEE Computer society conference on computer vision and pattern recognition, Ann Arbor, MI, USA, 5–9 June 1988, pp. 254–260.

109.

Crebbin

. Octree encoding of objects from range images. Pattern Recognit 1994; 27(5): 727–739.

110.

Chhaya

Reddy

Upadhyay

. Monocular reconstruction of vehicles: Combining slam with shape priors. In: IEEE International conference on robotics and automation, Stockholm, Sweden, 16–21 May 2016, pp. 5758–5765.

111.

Cadena

Carlone

Carrillo

. Past, present, and future of simultaneous localization and mapping: toward the robust perception age. IEEE Trans Robot 2016; 32(6): 1309–1332.

112.

Thrun

Burgard

Fox

. Probabilistic robotics. Cambridge, MA, USA : MIT press, 2005.

113.

Newcombe

Izadi

Hilliges

. Kinect fusion: realtime dense surface mapping and tracking. In: IEEE International symposium on Mixed and augmented reality, Basel, Switzerland, 26–29 October 2011, pp. 127–136.

114.

Whelan

Kaess

Fallon

. Kintinuous: spatially extended kinect fusion. Robot Auton Syst 2012; 69(C): 3–14.

115.

Nießner

Zollhofer

Izadi

. Real-time 3D reconstruction at scale using voxel hashing. ACM Trans Graph 2013; 32(6): 169.

116.

Newcombe

Fox

Seitz

. Dynamic fusion: reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June 2015, pp. 343–352.

117.

Mur-Artal

Montiel

JMM

Tardos

. Orb-slam: a versatile and accurate monocular slam system. IEEE Trans Robot 2015; 31(5): 1147–1163.

118.

Forster

Pizzoli

Scaramuzza

. Svo: fast semi-direct monocular visual odometry. In: IEEE International conference on robotics and automation, Hong Kong, China, 31 May–7 June 2014, pp. 15–22.

119.

Engel

Koltun

Cremers

. Direct sparse odometry. IEEE Trans Pattern Anal Mach Intell 2016, pp(9): 1–1.

120.

Choi

Lee

Song

. Real-time EKF slam system using confidence map of depth information. Int J Appl Eng Res 2016; 11(2): 1077–1081.

121.

Zhang

Song

. Convergence and consistency analysis for a 3-D invariant-EKF slam. IEEE Robot Autom Lett 2017; 2(2): 733–740.

122.

Mourikis

Roumeliotis

. A multi-state constraint Kalman filter for vision-aided inertial navigation. In: IEEE International conference on Robotics and automation, Roma, Italy, 10–14 April 2007, pp. 3565–3572.

123.

Forster

Carlone

Dellaert

. IMU preintegration on manifold for efficient visual-inertial maximum-a-posteriori estimation. In: Robotics: science and systems, Ann Arbor, Michigan, USA, 18–19 June 2015.

124.

Forster

Carlone

Dellaert

. On-manifold preintegration for real-time visual–inertial odometry. IEEE Trans Robot 2017; 33(1): 1–21.

125.

Leutenegger

Lynen

Bosse

. Key frame-based visual–inertial odometry using nonlinear optimization. Int J Robot Res 2015; 34(3): 314–334.

126.

Aguilar

Rodríguez

Álvarez

. Visual slam with an RGB-D camera on a quadrotor uav using on-board processing. In: International work-conference on artificial neural networks, Cádiz, Spain, 14–16 June 2017, pp. 596–606.

127.

Newnham

Fisher

Fang

. Improving the field of regard of a time-of-flight camera for UAV mapping and collision avoidance in cluttered environments. In: 17th Australian international aerospace congress: AIAC 2017, Melbourne, Australia, 26 February–2 March 2017, p. 297.

128.

Olivares-Mendez

Suarez-Fernandez

. Monocular visual-inertial slam-based collision avoidance strategy for fail-safe UAV using fuzzy logic controllers. J Intell Robot Syst 2014; 73(1–4): 513–533.

129.

Liao

Lai

. 3D motion planning for UAVS in GPS denied unknown forest environment. In: IEEE International conference on intelligent vehicles symposium, Gothenburg Sweden, 19–22 June 2016, pp. 246–251.

130.

Dong

Kayacan

. RRT-based 3D path planning for formation landing of quadrotor uavs. In: IEEE International conference on control, automation, robotics and vision, Phuket, Thailand, 13–15 November 2016, pp. 1–6.

131.

Demim

Nemra

Louadj

. Robust SVSF-SLAM for unmanned vehicle in unknown environment. IFAC-PapersOnLine 2016; 49(21): 386–394.

132.

Huang

Zhang

. The mobile robot slam based on depth and visual sensing in structured environment. In: Robot intelligence technology and applications, Bucheon, Korea, 14–16 December 2015, pp. 343–357.

133.

Guerra

Bolea

Grau

. Human-robot slam in industrial environments. In: IEEE International conference on industrial informatics, Cambridge, United Kingdom, 22–24 July 2015, pp. 390–395.

134.

Song

Chen

Cui

. Autonomous navigation and mapping for mobile robot in unknown environment using line segments. In: ASME 2016 Conference on information storage and processing systems, Santa Clara, California, USA, 20–21 June 2016, pp.V001T07A009–V001T07A009.

135.

Valencia

Morta

Andrade-Cetto

. Planning reliable paths with pose slam. IEEE Trans Robot 2013; 29(4): 1050–1059.

136.

Gulde

Karcher

Curio

. Vision-based slam navigation for vibro-tactile human-centered indoor guidance. In: European conference on computer vision, Amsterdam, Netherlands, 8–16 October 2016, pp. 343–359.

137.

Sansoni

Trebeschi

Docchio

. State-of-the-art and applications of 3D imaging sensors in industry, cultural heritage, medicine, and criminal investigation. Sensors 2009; 9(1): 568–601.

138.

Bhatti

. Current advancements in stereo vision. Vienna, Austria: I-Tech Education and Publishing Kirchengasse, 2012.

139.

Murray

Little

. Using real-time stereo vision for mobile robot navigation. Auton Robots 2000; 8(2): 161–171.

140.

Murray

Jennings

. Stereo vision based mapping and navigation for mobile robots. In: IEEE International conference on robotics and automation, Vol. 2, Albuquerque, New Mexico, 20–25 April 1997, pp. 1694–1699.

141.

Kriegman

Triendl

Binford

. Stereo vision and navigation in buildings for mobile robots. IEEE Trans Robot Autom 1989; 5(6): 792–803.

142.

Hager

Chang

Morse

. Robot hand-eye coordination based on stereo vision. IEEE Control Syst Mag 1995; 15(1): 30–39.

143.

Seemann

Nickel

Stiefelhagen

. Head pose estimation using stereo vision for human-robot interaction. In: IEEE International conference on automatic face and gesture recognition, Seoul, South Korea, 17–19 May 2004, pp. 626–631.

144.

Sabe

Fukuchi

Gutmann

. Obstacle avoidance and path planning for humanoid robots using stereo vision. In: IEEE International conference on robotics and automation, Vol. 1, New Orleans, LA, USA, 26 April–1 May 2004, pp. 592–597.

145.

Bertozzi

Broggi

. Gold: a parallel real-time stereo vision system for generic obstacle and lane detection. IEEE Trans Image Process 1998; 7(1): 62–81.

146.

Lowe

Little

. Vision-based mobile robot localization and mapping using scale-invariant features. In: IEEE International conference on robotics and automation, Vol. 2, Seoul, South Korea, 21–26 May 2001, pp. 2051–2058.

147.

Prados

Faugeras

. Shape from shading. In: Handbook of mathematical models in computer vision, Springer, Boston, MA, 2006, pp. 375–388.

148.

Atick

Griffin

Redlich

. Statistical approach to shape from shading: reconstruction of three-dimensional face surfaces from single two-dimensional images. Neural Comput 1996; 8(6): 1321–1340.

149.

Lei

Liu

. Shape from shading and optical flow used for 3-Dimensional reconstruction of endoscope image. Acta Otolaryngol 2016; 136(11): 1190–1192.

150.

Peng

Feng

. B-spline shape from motion & shading: an automatic free-form surface modeling for face reconstruction. arXiv, 2016. volume abs/1601.05644.

151.

Konolige

Projected texture stereo. In: IEEE International conference on robotics and automation, Anchorage, AK, USA, 3–8 May 2010, pp. 148–155.

152.

Chen

Hung

Chiang

. Range data acquisition using color structured lighting and stereo vision. Image Vis Comput 1997; 15(6): 445–456.

153.

Boyer

Kak

. Color-encoded structured light for rapid active ranging. IEEE Trans Pattern Anal Mach Intell 1987; 9(1): 14–28.

154.

Chen

Zhang

. Vision processing for realtime 3-D data acquisition based on coded structured light. IEEE Trans Image Process 2008; 17(2): 167–176.

155.

Wei

Gao

. Indoor mobile robot obstacle detection based on linear structured light vision system. In: IEEE International conference on robotics and biomimetics, Bangkok, Thailand, 22–25 February 2008, pp. 834–839.

156.

King

Weiman

. Helpmate autonomous mobile robot navigation system. In: International society for optics and photonics, Vol. 1388, San Jose, CA, 1 March 1991, pp. 190–198.

157.

Silberman

Fergus

. Indoor scene segmentation using a structured light sensor. In: IEEE International conference on computer vision workshops, Barcelona, Spain, 6–13 November 2011, pp. 601–608.

158.

Sarafraz

Haus

. A structured light method for underwater surface reconstruction. ISPRS J Photogramm Remote Sens 2016; 114: 40–52.

159.

Johnson-Roberson

Bryson

Friedman

. High resolution underwater robotic vision-based mapping and three-dimensional reconstruction for archaeology. J Field Robot 2017; 34(4): 625–643.

160.

Zhang

Curless

Seitz

. Rapid shape acquisition using color structured light and multi-pass dynamic programming. In: First international symposium on 3d data processing visualization and transmission, Padova, Italy, 19–21 June 2002, pp. 24–36.

161.

Park

DeSouza

Kak

. Dual-beam structured-light scanning for 3-D object modeling. In: IEEE International conference on 3-D digital imaging and modeling, Quebec City, Quebec, Canada, 28 May–1 June 2001, pp. 65–72.

162.

Park

Kim

. 3D hand-eye robot vision system using a cone-shaped structured light for the sice-icase international joint conference 2006 (sice-iccas 2006). In: IEEE International joint conference, Vancouver, BC, Canada, 16–21 July 2006, pp. 2975–2980.

163.

Cappelleri

. High-accuracy, high-speed 3D structured light imaging techniques and potential applications to intelligent robotics. Int J Intell Robot Appl 2017; 1(1): 86–103.

164.

Huang

Wang

Gao

. Projector calibration with error surface compensation method in the structured light three-dimensional measurement system. Opt Eng 2013; 52(4): 043602–043602.

165.

Jarvis

. A laser time-of-flight range scanner for robotic vision. IEEE Trans Pattern Anal Mach Intell 1983; PAMI-5: 505–512.

166.

Foix

Alenya

Torras

. Lock-in time-of-flight (ToF) cameras: a survey. IEEE Sens J 2011; 11(9): 1917–1926.

167.

May

Werner

Surmann

. 3D time-of-flight cameras for mobile robotics. In: IEEE/RSJ International conference on intelligent robots and systems, Beijing, China, 9–13 October 2006, pp. 790–795.

168.

Prusak

Melnychuk

Roth

. Pose estimation and map building with a time-of-flight-camera for robot navigation. Int J Intell Syst Technol Appl 2008; 5(3–4): 355–364.

169.

Droeschel

Holz

Stuckler

. Using time-of-flight cameras with active gaze control for 3D collision avoidance. In: IEEE International conference on robotics and automation, Anchorage, AK, USA, 3–8 May 2010, pp. 4035–4040.

170.

May

Droeschel

Holz

. Three-dimensional mapping with time-of-flight cameras. J Field Robot 2009; 26(11–12): 934–965.

171.

Liang

Liu

Yan

. Accurate 3D reconstruction using a multi-phase TOF camera. In: International society for optics and photonics, Vol. 9273, San Diego, California, United States, 19–23 August 2014, pp. 92733J–92733J-7.

172.

Mahmood

Marhaban

Rokhani

. Fasta-elm: a fast adaptive shrinkage/thresholding algorithm for extreme learning machine and its application to gender recognition. Neurocomputing 2017; 219: 312–322.

173.

Lai

Pan

. Gender recognition using local block difference pattern. In: Advances in intelligent information hiding and multimedia signal processing: proceeding of the twelfth international conference on intelligent information hiding and multimedia signal processing, Vol. 2, Kaohsiung, Taiwan, 21–23 November 2016, pp. 45–52.

174.

Maturana

Scherer

. Voxnet: a 3D convolutional neural network for real-time object recognition. In: IEEE/RSJ International conference on intelligent robots and systems, Hamburg, Germany, 28 September–2 October 2015, pp. 922–928.

175.

Kim

Leutenegger

Davison

. Real-time 3D reconstruction and 6-DOF tracking with an event camera. In: European conference on computer vision, Amsterdam, Netherlands, 8–16 October 2016, pp. 349–364.

176.

Kim

Handa

Benosman

. Simultaneous mosaicing and tracking with an event camera. Solid State Circ 2008; 43: 566–576.

177.

Stuckler

Kerl

. Multi-view deep learning for consistent semantic mapping with RGB-D cameras. In: IEEE/RSJ International conference on intelligent robots and systems, Vancouver, Canada, 24–28 September 2017.

178.

Kendall

Cipolla

. Modelling uncertainty in deep learning for camera relocalization. In: IEEE International conference on robotics and automation, Stockholm, Sweden, 16–21 May 2016, pp. 4762–4769.

179.

Muller

Savakis

. Flowdometry: an optical flow and deep learning based approach to visual odometry. In: IEEE Winter conference on applications of computer vision, Santa Rosa, CA, USA, 27–29 March 2017, pp. 624–631.

180.

Yang

Song

Kaess

. Pop-up slam: semantic monocular plane slam for low-texture environments. In: IEEE/RSJ International conference on intelligent robots and systems, Daejeon, Korea (South), 9–14 October 2016, pp. 1222–1229.

181.

Civera

Gálvez-López

Riazuelo

. Towards semantic slam using a monocular camera. In: IEEE/RSJ International conference on intelligent robots and systems, San Francisco, CA, USA, 25–30 September 2011, pp. 1277–1284.

182.

Kadkhodamohammadi

Gangi

de Mathelin

. Articulated clinician detection using 3D pictorial structures on RGB-D data. Med Image Anal 2017; 35: 215–224.

183.

Mura

Abu-Kheil

Ciuti

. Vision-based haptic feedback for capsule endoscopy navigation: a proof of concept. J Micro Bio Robot 2016; 11(1–4): 35–45.

184.

Meng

Zhao

Chen

. Robot-assisted mirror ultrasound scanning for deep venous thrombosis detection using RGB-D sensor. Multimed Tools Appl 2016; 75(22): 14247–14261.

185.

Kehoe

Patil

Abbeel

. A survey of research on cloud robotics and automation. IEEE Trans Autom Sci Eng 2015; 12(2): 398–409.

186.

Rahimi

Shao

Veeraraghavan

. An industrial robotics application with cloud computing and high-speed networking. In: IEEE International conference on robotic computing, Taichung, Taiwan, 10–12 April 2017, pp. 44–51.