Mobile robot 3D trajectory estimation on a multilevel surface with multimodal fusion of 2D camera features and a 3D light detection and ranging point cloud

Abstract

Nowadays, multi-sensor fusion is a popular tool for feature recognition and object detection. Integrating various sensors allows us to obtain reliable information about the environment. This article proposes a 3D robot trajectory estimation based on a multimodal fusion of 2D features extracted from color images and 3D features from 3D point clouds. First, a set of images was collected using a monocular camera, and we trained a Faster Region Convolutional Neural Network. Using the Faster Region Convolutional Neural Network, the robot detects 2D features from camera input and 3D features using the point’s normal distribution on the 3D point cloud. Then, by matching 2D image features to a 3D point cloud, the robot estimates its position. To validate our results, we compared the trained neural network with similar convolutional neural networks. Then, we evaluated their response for the mobile robot trajectory estimation.

Keywords

Mobile robot feature recognition odometry and mapping 3D localization

Introduction

Indoor environments may include slopes to transit from different multilevel areas. Most structured environments provide an even surface useful for robot mapping and exploration, where feature or image extraction is easier than in unstructured environments. Since modern indoor infrastructure includes slopes, mobile robots can navigate multilevel areas.

RGB cameras and light detection and ranging (LIDAR) sensors allow robots to explore structured even-surface scenarios with a robust response.^1
–3

RGB cameras capture scenes in 2D images, and we can classify one image into pixels and superpixels. LIDAR is an active sensor that does not depend on the lighting conditions and provides an accurate distance measurement. However, RGB camera performance depends on illumination, and LIDAR point clouds do not have texture or color information. To overcome those limitations, we can use multimodal sensor fusion.^4

–7

Multimodal fusion uses region levels,⁸ and conditional random fields (CRFs)⁹ help to model contextual information, but some LIDAR information is lost, resulting in labeling problems.^6,10 Since 3D LIDAR point clouds have plenty of noise, a solution is treating the point cloud as a mesh.^11,12 Point cloud labeling or mesh treatment is viable for large outdoor scenarios. However, for indoor scenarios, the feature extraction has limitations. Mobile robot localization on unstructured scenarios with uneven or multilevel surfaces is still a challenge.

The proposed method aims to provide an efficient solution for image feature detection and mobile robot localization in indoor environments. Nowadays indoor environments are provided with ramps to allow the connection of different levels in indoor environments, such as ramps or access points for wheelchairs. Similar feature extraction methods certainly identify and extract features with a good response. However, our method allows a robust feature extraction and robot localization in indoor environments including multilevel surfaces.

Since 3D point cloud treatment is critical for mobile robot exploration, we propose a multimodal sensor fusion for robot localization on multilevel surfaces employing an RGB camera and a 3D-LIDAR. Using a convolutional neural network (CNN), we extracted 2D features from RGB images and matched them into a 3D point cloud. To perform the 2D feature detection, the input from the RGB camera trains a Faster Region Convolutional Neural Network (Faster R-CNN).

In parallel, the robot extracts features from the 3D point cloud generated from the 3D LIDAR. Finally, we match the features from the 3D point cloud and the camera. Figure 1 shows the pipeline of the proposed concept. The contributions of this document are as follows: a trained neural network for 2D feature detection on multilevel scenarios and a 3D LIDAR–2D camera fusion that enables mobile robot trajectory estimation based on rapid feature detection. Figure 2 shows the proposed strategy to do the neural network training.

Figure 1.

Proposed features-fusion model pipeline.

Figure 2.

Feature extraction and training strategy using Faster R-CNN. Faster R-CNN: Faster Region Convolutional Neural Network.

Related work

To review the background studies, we divided them into two major areas. The first is CNNs for object and feature detection techniques. The second is the 3D LIDAR point cloud and RGB camera fusion.

ConvNet-based approaches for object detection

ConvNet is an image feature extractor.^13,14 The most popular object detectors are sliding window and region based. Sliding window ConvNet: This is a classic method for object detection. It employs a sliding window mechanism suggested by Sermanet et al.¹⁵ Region-based ConvNets: R-CNN¹⁶ and selective search¹⁰ are methods for object proposal generation. The Faster R-CNN¹⁷ uses spatial pyramid pooling networks.¹⁸ In Faster R-CNN, the image passes through a convolutional layer and finishes with end-to-end training. The fully convolutional network¹⁹ improves object detection and time efficiency. Xiang et al. modify the Faster R-CNN using 3D voxel patterns.²⁰ Single-shot object detectors: You Only Look Once (YOLO)^21,22 and single-shot detector²³ use a single ConvNet. YOLO divides every input image into a grid, and each grid detects an object within a bounding box. One disadvantage is that the detection accuracy increases during training when YOLO tries to use the entire image, and the detection of small objects could be challenging. Cai et al.²⁴ implement detection at multiple intermediate networks, dealing with objects of different sizes. Oliveira et al.²⁵ proposed outdoor localization based on speed invariant inertial transformation and deep learning for the sensor. This localization has applications on terrain classification. With 3D object detection, Yang et al.²⁶ use convolutional features, and cascade classifiers to reject negative object proposals. Li et al.²⁷ use deep learning for fusion 2D LIDAR and inertia measurement unit (IMU). They used a recurrent neural network.

3D LIDAR and RGB camera fusion

3D LIDAR is critical for 3D scene perception, as it can capture data both at day and at night. Combining 3D LIDAR with 2D and 2.5D images allows for better 3D scene perception. Shinzato et al.²⁸ used a graphical method to recognize obstacles. Approaches such as Xiao et al.^6,29 use CRFs. For CNNs, Eitel et al.³⁰ propose object detection combining color images and depth maps. Schlosser et al.³¹ transformed LIDAR point clouds into Horizontal disparity Height above the ground and Angle (HHA) fused with RGB. Asvadi et al.³² integrated LIDAR and a color camera using deep learning for object detection. Bellone et al.³³ employ a support vector machine (SVM) which identifies roads using 3D LIDAR data. Zhou et al.³⁴ built an online learning road detector. Quan et al.³⁵ use the projection of 2D lines into 3D lines. However, the approach depends on geometric computation as initialization to bundle adjustments. Ouyang et al.³⁶ fuse odometry and wheel encoder to provide localization. Still, the approach is highly dependent on the gyroscope for positioning.

Mobile robots have sensors such as mono and stereo cameras, sonar, 2D, and 3D LIDAR.^37,38 3D LIDAR is an important solution for high-level safety and environment recognition. The wide field of view, distance measurement, and night-vision capability are among the advantages of the 3D LIDAR. The cost of the integrated mechanical parts and the high-power requirements are major limitations for 3D LIDAR. Wisth et al.³⁹ use multi-sensory odometry for mobile robot localization. The information from visual reference is combined with IMU. He et al.⁴⁰ integrate global navigation satellite based on simultaneous localization and mapping pose estimation performing large-scale 3D map building. The authors used global positioning for pose estimation.

Feature detection uses monocular cameras fused with various sensors. For example, to perform road-background detection and classification, the multi-sensor divides an image into pixels and superpixels. Machine learning has important applications such as a mixture of Gaussian,⁴¹ SVM,^4,42,43 and boosting⁴⁴ structure random forest.⁶ These methods classify each unit independently, but the prediction could be noisy. Xu et al.⁴⁵ proposed a multi-sensory fusion using factor graph topology for optimal navigation. 3D LIDAR is crucial for autonomous vehicles too. Markov random fields can model LIDAR information generating a grid map.⁴⁶ Yuan et al.⁴⁷ proposed a location-based landmark recognition and used a novel quadrupole potential field for obstacle avoidance. Shinzato et al.⁵ propose a simple camera–LIDAR fusion for road detection, but LIDAR and cameras have some drawbacks. Sensor fusion is an alternative for a single sensing modality. The extent of work on data fusion²⁶ techniques in multimodal object detection is classified into three categories: low level, combines the multiple sensor data; middle level, integrated the detected features; high Level, combines classified outputs.⁴⁸

For pedestrian detection, Premebida et al.⁴⁹ use Velodyne LIDAR with color data. Combining color images and depth maps improves the performance of object detection. González et al.⁵⁰ use depth maps and color images as inputs. Schlosser et al.³¹ use ConvNet-based fusion for pedestrian detection. Deep learning enhances the HHA data channel from 3D-LIDAR.^51,52 The authors employ color images and 3D-LIDAR point clouds as inputs. These inputs get region-wise features.

Multimodal feature detection

We propose a multimodal feature detection method based on 3D LIDAR and an RGB camera. The RGB images are collected from a Kinect camera and the 3D point clouds from a Quanergy 3D LIDAR. Since the projection of the 3D LIDAR points into the image is sparse, only the 3D point cloud main corners are extracted. The corner extraction process is described in sections “Spatial planar coordinates transformation” and “Division of voting space.” Then, the extracted main corners are fused with the 2D features from the RGB images.

3D point cloud plane extraction

An efficient way to represent a 3D LIDAR point cloud is to segment it into small-scale 3D scenes. Kd-tree accelerates the point cloud segmentation using the normal vector for each point in the 3D point cloud. To proceed with the normal estimation, we use the K-nearest neighbor process to search around the pending points. Then, using the pending and neighbor points, the normal vectors are estimated using principal component analysis. Here, $\vec{n} = (n_{x}, n_{y}, n_{z})$ is normal for each vector and $\vec{r} = (x, y, z)$ is the incident direction of the 3D LIDAR. The constraint ( $\vec{n} \cdot \vec{r} < 0$ ) adjusts the normal orientation. After the normal values are obtained, fussy clustering (FC) combines the normal angles with the Euclidean distance, segmenting the point cloud into physical planes. FC divides the points, separating them into two different facades. According to the number of points, FC allows us to check the point cloud space. Additionally, to reduce the computational time we employed two steps: spatial planar coordinates transformation and division of voting space.

Spatial planar coordinates transformation

We follow some procedures from Zhang et al. (1) for planar extraction. First, we transformed the spatial planar coordinates to polar coordinates using equation (1)

ρ = x cos θ cos ϕ + y sin θ cos ϕ + z sin ϕ

where $θ$ is the angle between the normal $\vec{n}$ and the x-axis, $ϕ$ is the angle between the normal $\vec{n}$ on the plane $x o y$ , and $ρ$ is the distance from origin O to the plane $x o y$ . Figure 3(b) shows the normal $\vec{n}$ with the corresponding angles $θ$ , $ϕ$ and the distance $ρ$ . We use equations (2)–(4) for the vector conversion $\vec{n} = (n_{x}, n_{y}, n_{z})$ to the angles $θ$ , $ϕ$ and the distance $ρ$ .

Figure 3.

Extracted planes from the 3D point cloud: (a) scanned scenario image, (b) original 3D point cloud, (c) normal vector representation with corresponding angles, and (d) extracted 3D planes.

θ = arctan (n_{y} / n_{x}), θ \in [0, 2 π]

ϕ = arcsin (n_{z}), ϕ \in [- π / 2, π / 2]

p = d

Division of voting space

For plane fitting with the width ( $ρ$ ), theta ( $θ$ ), and phi ( $ϕ$ ) obtained from equations (2)–(4) to decide the ranges of the values in the space as $D = (d_{max} - d_{min}) / ρ$ , $T = 360 / θ$ , and $P = 180 / ϕ$ , where $d_{max}$ and $d_{min}$ represent the maximum and minimum distance values from the point to the plane. Therefore, we made a 3D array $Vote (ρ, θ, ϕ)$ . The array size is $D \times T \times P$ , and all the elements started from 0. The values ${plane}_{1}$ and ${plane}_{2}$ were extracted using equations (5) and (6). Figure 3(a) shows the scenario scanned, (b) shows the original 3D point cloud, (c) and (d) shows the planes extracted from the 3D point cloud.

\begin{array}{l} \vec{{plane}_{1}} \times \vec{{plane}_{2}} > A n g l e T h e t a & & \\ | D i s t_{p l a n e 1} \end{array}

- D i s t_{p l a n e 2} | < W i d t h

Feature extraction

Once we completed the 3D planes segmentation, the section “3D Point cloud feature extraction.” describes the procedures for 3D feature extraction from the 3D segmented planes, and the section “2D RGB images feature extraction” describes the 2D feature extraction from RGB images.

3D point cloud feature extraction

Once the 3D point clouds were segmented into planes, we found the intersecting points. The values ${plane}_{1}$ and ${plane}_{2}$ are perpendicular to each other, and their normal vectors are perpendicular to each plane. The value ${normal}_{1}$ is assigned to ${plane}_{1}$ and ${normal}_{2}$ to ${plane}_{2}$ . The intersection vector noted as $\vec{P_{0}}$ , and the direction is given by $\vec{u} = \vec{n_{1}} \times \vec{n_{2}}$ and $\vec{u} = (u_{x}, u_{y}, u_{z})$ . The direction of the intersection vector is perpendicular to the normal $\vec{n_{1}}$ and $\vec{n_{2}}$ . To determine the coordinates of the intersection points, we selected a nonzero coordinate $\vec{u}$ ( $u_{z} \neq 0$ ) and set the corresponding coordinate of $\vec{P_{0}}$ to 0. $\vec{P_{0}} = (x_{0}, y_{0}, 0)$ lies on the intersection line L. The ${plane}_{1}$ equation is given by $a_{1} x_{0} + b_{1} y_{0} + d_{1} = 0$ , and the ${plane}_{2}$ equation is $a_{2} x_{0} + b_{2} y_{0} + d_{2} = 0$ . Equations (7) and (8) are derived from the ${plane}_{1}$ and ${plane}_{2}$ equations, respectively. Equation (9) represents the obtained line feature. Figure 4(a) shows the original 3D point cloud, (b) shows the normal points in the intersection of the plane, and (c) shows the extracted features from the 3D point cloud.

Figure 4.

3D point cloud plane intersection: (a) original point cloud, (b) intersected points located, and (c) intersected point located in the point cloud.

x_{0} = \frac{b_{1} d_{2} - b_{2} d_{1}}{a_{1} b_{2} - a_{2} b_{1}}

y_{0} = \frac{a_{2} d_{1} - a_{1} d_{2}}{a_{1} b_{2} - a_{2} b_{1}}

L (s) = \frac{(| \begin{matrix} b_{1} & b_{2} \\ d_{1} & d_{2} \end{matrix} |, | \begin{matrix} d_{1} & d_{2} \\ a_{1} & a_{2} \end{matrix} |, 0)}{| \begin{matrix} a_{1} & a_{2} \\ b_{1} & b_{2} \end{matrix} |} + s (n_{1} \times n_{2})

2D RGB images feature extraction

A trained Faster R-CNN (2) detects the 2D features of RGB images using a region proposal network. Due to our robot employed a Kinect camera for collecting RGB images, the experiment environment has standard lighting conditions of 200–300 Lux. Since the experiment is oriented to indoor environments, we considered a location provided with average indoor lighting conditions. For the particular case of this experiment, the considered lighting conditions (minimum of 200–300 lux) allow the mobile robot to extract features.

To proceed with the neural network training and testing, we started collecting a set of 300 images of the experiment scenario. We used transfer learning for training our neural network. All the images were resized to a 224 × 224 resolution, transformed into gray scale and the Canny edge detector ran on every image. As a result, we obtained a CNN composed of 15 convolutional layers and two fully connected layers. We label 200 images identifying the main corners. The Faster R-CNN ran on the entire image during the training time and testing time. For testing, we used a set of 100 images. Figure 5(a) shows the 2D image capture by the Kinect camera, (b) shows the 2D gray scale of the captured image, and (c) shows the extracted edge using the Faster R-CNN.

Figure 5.

Feature extraction from RGB camera images using a Faster R-CNN: (a) original 2D image obtain by the Kinect camera, (b) main edges in the 2D image, and (c) edge extracted using the Faster R-CNN. Faster R-CNN: Faster Region Convolutional Neural Network.

The 2D features were extracted using equations (10) and (11). Here, i is the sequence number of the prior images, j is the index of features in each image, $I_{prior}^{i}$ is the previous image, and $I_{crt}$ is the current image, both of which were obtained from the Kinect camera. The 3D point cloud noted as S and the corresponding 2D image $I_{prior}^{i}$ have six Degrees of Freedom (DoF)

f_{prior}^{j, i} = Faster_RCNN (I_{prior}^{i}) \in ℝ^{2}

f_{crt}^{j} = Fast_RCNN (I_{crt}^{}) \in ℝ^{2}

2D and 3D feature fusion for robot localization

Adapting the procedure in Rublee et al.,³ we projected the 2D RGB features $f_{prior}^{i}$ and $f_{crt}^{i}$ using ray tracing. Before proceeding with the 2D feature extraction, the Kinect camera was calibrated using the procedure described in Raposo et al.,⁵³ and the mobile robot odometry was calibrated using the procedure described in Borenstein and Feng.⁵⁴

Then, a clustering algorithm run on the 2D feature’s coordinates. Equation (12) calculates the candidate’s images $I_{candidate}^{k}$ among $I_{prior}^{i}$ , and $X_{approximate}$ is the robot position given by the IMU.

(I_{candidate}^{k}, T_{candidate}^{k}) = neighboring_pos (I_{prior}^{i}, T_{prior}^{i}, X_{approximate})

Here, k is the index of the candidates, and the transformation $T_{prior}^{i}$ represents the 3D position and orientation of the robot for $I_{prior}^{i}$ . The function $neighboring_pos$ finds the candidates by comparing the distance between $T_{prior}^{i}$ and $X_{approximate}$ to the 3D coordinates. $I_{candidate}^{k}$ with low correlation concerning $I_{crt}$ are removed using the random sample consensus. Equations (13) and (14) generate the 3D coordinate features $f_{S}^{j, k}$ of $f_{candidate}^{j, k}$ in the 3D point cloud S. The features are obtained from the ray-tracing algorithm for $T_{candidate}^{i}$ and are expressed using the pinhole camera model

Q_{S}^{j . k} (λ) = P^{- 1} (f_{candidate}^{j, k}, T_{candidate}^{k}, K)

f_{S}^{j . k} = RayTracing (T_{candidate}^{k}, Q_{S}^{j . k} (λ)) \in ℝ^{3}

Here, k is an index of the candidate and $P^{- 1}$ is the projection operator. The notation $Q_{s} (λ)$ represents the 3D coordinates in S. The $RayTracing$ function finds the point that lies on the line between $T_{candidate}^{k}$ and $Q_{S}^{j, k} (λ)$ . In equations (13) and (14), we can replace $f_{candidate}^{j, k}$ and $T_{candidate}^{i}$ with $f_{prior}^{j, k}$ and $T_{prior}^{i}$ . Finally, the 3D to 2D projection $(f_{S}^{j, k} \in ℝ^{3})$ calculates the robot position. The largest cluster selected gives the multimodal fusion. Figure 6 shows the matching process between the 2D and 3D features.

Figure 6.

2D and 3D feature projection and rotation: (a) ray tracing projection on the 3D point cloud, (b) RGB camera and 3D point cloud features on the X–Z plane, and (c) RGB camera and 3D point cloud features on the X–Y plane.

Minimization of the robot 3D localization

Once we obtained the 2D and 3D feature fusion, the robot localization is minimized for every position and time $t + Δ t$ . We consider each pose at i and the coordinates of the matched features as $f_{crt}^{j} + f_{S}^{j} = (x_{f}, y_{f}, z_{f})$ . Each point is matched at each position $i + 1$ , and then each feature coordinate is matched within the rotation matrices R_x and R_z . Using the values of $R^{f}$ and t^f , we transformed to the robot’s current position $(x_{r}, y_{r}, z_{r})$ using equations (15)–(17)

R_{x (α_{IMU})} = [\begin{matrix} 1 & 0 & 0 \\ 0 & cos (α_{IMU}) & sin (α_{IMU}) \\ 0 & - sin (α_{IMU}) & cos (α_{IMU}) \end{matrix}]

R_{z (β_{IMU})} = [\begin{matrix} cos (β_{IMU}) & sin (β_{IMU}) & 0 \\ - sin (β_{IMU}) & cos (β_{IMU}) & 0 \\ 0 & 0 & 1 \end{matrix}]

[\begin{matrix} x_{r} \\ y_{r} \\ z_{r} \end{matrix}] = R_{x (α_{IMU})} R_{z (β_{IMU})} [\begin{matrix} x_{f} \\ y_{f} \\ z_{f} \end{matrix}]

Using the values of $(x_{r}, y_{r}, z_{r})$ we minimized the error in the coordinate values obtained from odometry $(x_{o}, y_{o}, z_{o})$ , as shown in equation (18)

E (R, t) = min \sum_{i = 1}^{N} ‖ f_{i} - R_{i}^{f} \cdot P_{i} - t_{i}^{f} ‖

where P_i is the pose obtained from the previous feature alignment and f_i is the reference robot pose obtained from the IMU. Both f_i and P_i are $6 \times 1$ vectors $(x, y, z, α, γ, β)$ , where $x, y, z$ are the 3D coordinates and $α, γ, β$ are the roll, pitch, and yaw angles, respectively. Algorithm 1 shows how multi-sensor fusion works to obtain the robot localization. We use the input from the collected set of local point clouds S_l and the set of images I_l to proceed with the multi-sensor localization. The number of iterations is given by the size of S_l . Using the procedure described in sections “Spatial planar coordinates transformation” and “Division of voting space,” we generated a function “3D_point_cloud_feature.” Then we obtained the features f_S . From the section “2D RGB images feature extraction,” we obtained the image features $f_{crt}$ . Using the fusion described in the section “2D and 3D feature fusion for robot localization,” we combined the features f_S and $f_{crt}$ , obtaining the local coordinates $x_{l}, y_{l}, z_{l}$ . Then using the process described in the section “Minimization of the robot 3D localization,” the local coordinates were transformed into global coordinates $x_{G}, y_{G}, z_{G}$ . Finally, the robot coordinates are minimized using equation (18).

Algorithm 1.

Multi-sensor localization.

1: Input: Set Local Clouds (S_l ), Set images

(I_{l}

)

2: Output: Global Robot Localization:

P_{G : m}

3: for

i = 1

n = Size (S_{l})

f_{S} = 3D_point_cloud_feature (S_{l})

(S_{l})

f_{crt} = Faster_RCNN (I_{l})

{[x_{l}, y_{l}, z_{l}]}_{local} = fusion (f_{S}, f_{crt})

{[x_{G}, y_{G}, z_{G}]}_{global} = rotate (x_{l}, y_{l}, z_{l})

min [x_{G}, y_{G}, z_{G}] P_{G:M}^{\to} = [x_{G : m}, y_{G : m}, z_{G : m}]

9: end for

Results

To test our experiment, we chose a university location provided with a multilevel surface. The location is divided into three sections, and each section is connected with a 10° ramp, as shown in Figure 7(c). We used a Kobuki robot,⁵⁵ a Kinect camera,⁵⁶ a Quanergy M8 3D LIDAR manufactured by Quanergy Systems (Sunnyvale, California, USA),⁵⁷ and a laptop computer (8 GB RAM and processor intel i7) running MATLAB. Figure 7(a) shows the employed mobile robot and (b) shows the experiment scenario. For the experiment, the linear velocity of the robot was considered without any slip. The robot moved with a linear speed of 0.05 m/s, and the sampling period was 1s. We used a scanning frequency of 0.5 m for LIDAR point cloud registration. Using the mentioned sampling parameters, the robot had enough time for the multi-sensor acquisition.

Figure 7.

Scenario and equipment used for the experiment: (a) mobile robot, (b) experiment scenario, and (c) scenario diagram showing three different levels.

For the neural network training, we started collecting all the RGB camera images and trained a Faster R-CNN. Our method employs grayscale segmentation before proceeding with the 2D image feature extraction. The proposed segmentation allows a faster extraction of the major obstacles in front of the mobile robot. Then the robot localizes itself using the features from the trained neural network and the 3D point cloud.

The proposed method was compared with similar transfer learning CNNs as Vgg16 and AlexNet. The response of our method using the Faster R-CNN has a lower training loss compared to the mentioned techniques. The three methods have a high training accuracy and a robust response for object recognition. AlexNet and Vgg16 have trajectory errors of 0.28 m and 0.24 m, respectively. However, our method with the Faster R-CNN reduces the trajectory error to 0.16 m. Table 1 shows a quantitative comparison of the proposed method compared with AlexNet and Vgg16. To optimize the neural network weight updating, we evaluated the training loss. The training loss allows us the interpretation of how well the model is doing for every set. The lower the training loss, the better is the model. Unlike accuracy, training loss is not a percentage, and it is a summation of errors made for each example in training or validation sets. Loss values imply how well the model behaves after each iteration of optimization. Ideally, we expect the reduction of loss after each or several iterations. To obtain the best possible accuracy, we use a Mini Bach size of 128 and a max epoch of 100 for optimal training response. Figure 8(a) shows a quantitate comparison in the training process of the proposed neural network versus similar networks such as VGG16 and AlexNet.

Figure 8.

Mobile robot training performance and trajectory: (a) training root mean square error comparison and (b) mobile robot trajectory compared with the reference values.

Table 1.

Proposed faster neural network training comparison.

	Trajectory error (m)	Training accuracy (%)	Training loss
Faster R-CNN	0.16	100	0.0013
Vgg16	0.24	99.99	0.0029
AlexNet	0.28	99.99	0.0042

Faster R-CNN: Faster Region Convolutional Neural Network.

The ground truth for the robot localization was obtained using the odometer and the IMU integrated into the robot. The information collected from these sensors was fused using the extended Kalman filter within a ROS node. Figure 8(b) shows a comparison of the obtained trajectory versus the ground truth trajectory. The axes X and Y represent the coordinates of each robots’ pose measured in meters. As additional validation, we calculated the root mean square error (RMSE) comparing our trajectory versus the ground truth in each level. Table 2 shows the quantitative registration results of the obtained trajectory.

Table 2.

RSME for multi-sensor registration.

RMSE (m)	Multi-sensor input ID
	#1–#40	#40–#90	#90–#130
	Level 1	Level 2	Level 3
X-axis	0.10	0.04	0.02
Y-axis	0.01	0.04	0.01
Z-axis	0.01	0.01	0.01

RSME: root mean square error.

Discussion

3D mapping and registration face different challenges, such as overlapping areas or sparse features. Our robotic registration framework detects and computes 2D and 3D features using as input two sensors for multimodal localization.

The proposed plane extraction is based on the geometric properties of the 3D point cloud. For the image feature extraction, we trained an artificial neural network using transfer learning. Feature matching into the 3D point cloud is based on the reference described in the section “2D and 3D feature fusion for robot localization.” Lastly, we proposed Algorithm 1 for multi-sensory fusion and mobile robot localization. In the proposed localization, a mobile robot only uses visual reference (color images) and the surrounding environment (3D point clouds). Although we obtained a robust 3D localization during a dynamic scan, we identified two weaknesses. First, if the robot’s velocity is higher than the established linear velocity of 0.05 m/s, the robot may not have enough time to process the data from all sensor inputs. Second, the robot cannot crossover a ramp bigger than 10° due to the wheel diameter. The proposed localization approach can impact on the service industry, improving the monitoring and control of mobile robots in multilevel areas. As future work, the experiment will be performed in outdoor scenarios, extending the neural network training for detecting and extracting more scenario features.

Conclusions

We presented a 2D and 3D feature fusion for mobile robot localization in a multilevel area. The 3D point cloud feature extraction based on plane segmentation reduces the point cloud processing time. The Faster R-CNN identifies the main corners in 2D images. The mobile robot extracts 3D features using 3D point cloud processing. As presented in the Introduction, plenty of methods still rely on positioning sensors as IMU or global positioning system (GPS), which offer a good solution in outdoor scenarios. On the other hand for indoor scenarios, the response of these sensors is limited. The proposed method presents an RMSE around the axis X in 0.053 m and around the axis Y in 0.02 m. Those values are acceptable for mobile robot indoor exploration, meaning that our method has a reliable response in localization, providing an alternative to sensors as IMU or GPS. In terms of neural network efficiency, the proposed method reduced the training loss significantly compared to Vgg16 in 55.18%, and Alexnet in 69.05% keeping the lowest robot trajectory error.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by Ministry of Trade, Industry and Energy under Robot Industrial Core Technology Development Project program (20015052 and K_G012000921401) supervised by the KEIT.

ORCID iD

Vinicio Rosas-Cervantes

References

Lowe

. Object recognition from local scale-invariant features. In Proceedings of the IEEE international conference on computer vision, Cambridge, USA, 6 August 1999, pp. 1150–1157. New York, USA: IEEE.

Bay

Ess

Tuytelaars

, et al. Speeded-up robust features (SURF). Comput Vis Image Underst 2008; 110: 346–359.

Rublee

Rabaud

Konolige

, et al. ORB: an efficient alternative to SIFT or SURF. In: 2011 International conference on computer vision, Barcelona, 2–6 November 2011, pp. 2564–2571. New York, USA: IEEE.

Liu

Sun

, et al. Visual–Tactile fusion for object recognition. IEEE Trans Autom Sci Eng 2017; 14: 996–1008.

Shinzato

Wolf

Stiller

. Road terrain detection: avoiding common obstacle detection assumptions using sensor fusion. In: 2014 IEEE intelligent vehicles symposium proceedings, Dearborn, MI, USA, 8–11 June 2014, pp. 687–692. New York, USA: IEEE.

Xiao

Dai

Liu

, et al. CRF based road detection with multi-sensor fusion. In: 2015 IEEE intelligent vehicles symposium (IV), 2015, pp. 192–198.

Xiao

Liu

Sun

, et al. Likelihood confidence rating based multi-modal information fusion for robot fine operation. In: 2014 13th international conference on control automation robotics & vision (ICARCV), Singapore, 10–12 December 2014, pp. 259–264.

Zhang

Candra

Vetter

, et al. Sensor fusion for semantic segmentation of urban scenes. In: 2015 IEEE international conference on robotics and automation (ICRA), Washington, 25–30 May 2015, pp. 1850–1857. New York, USA: IEEE.

Shotton

Winn

Rother

, et al. TextonBoost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Computer vision—ECCV 2006, Berlin, Heidelberg, 2006, pp. 1–15.

10.

Kolmogorov

Zabin

. What energy functions can be minimized via graph cuts? IEEE Trans Pattern Anal Mach Intell 2004; 26: 147–159.

11.

Zha

Luo

, et al. Global and local isometry-invariant descriptor for 3D shape comparison and partial matching. In: 2010 IEEE computer society conference on computer vision and pattern recognition, San Francisco, 13–18 June 2010, pp. 438–445. New York, USA: IEEE.

12.

Guo

Sohel

Bennamoun

, et al. Rotational projection statistics for 3D local surface description and object recognition. Int J Comput Vis 2013; 105: 63–86..

13.

LeCun

Boser

Denker

, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput 1989; 1: 541–551..

14.

Krizhevsky

Sutskever

Hinton

. ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012, pp. 1097–1105.

15.

Sermanet

Eigen

Zhang

, et al. Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.

16.

Girshick

Donahue

Darrell

, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition, Columbus, 23–24 June 2014, pp. 580–587. New York, USA: IEEE.

17.

Girshick

. Fast R-CNN. In: 2015 IEEE international conference on computer vision (ICCV) , Santiago de Chile, 7–13 December 2015, pp. 1440–1448. New York, USA: IEEE.

18.

Zhang

Ren

, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 2015; 37: 1904–1916.

19.

Shelhamer

Long

Darrell

. Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1605.06211, 2016.

20.

Xiang

Choi

Lin

, et al. Subcategory-Aware convolutional neural networks for object proposals and detection. In: 2017 IEEE winter conference on applications of computer vision (WACV), Santa Rosa, 24–31 March 2017, pp. 924–933. New York, USA: IEEE.

21.

Redmon

Farhadi

. YOLO9000: better, faster, stronger. https://arxiv.org/abs/1612.08242 (2016 accessed 18 March 2022).

22.

Redmon

Divvala

Girshick

, et al. You Only Look Once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, 27–30 June 2016, pp. 779–788. New York, USA: IEEE.

23.

Liu

Anguelov

Erhan

, et al. SSD: single shot MultiBox detector. In: Computer Vision—ECCV 2016, Cham, 2016, pp. 21–37.

24.

Cai

Fan

Feris

, et al. A unified multi-scale deep convolutional neural network for fast object detection. In: Computer Vision—ECCV 2016, Cham, 2016, pp. 354–370.

25.

Oliveira

Neto

Howard

, et al. Three-dimensional mapping with augmented navigation cost through deep learning. J Intell Robot Syst 2021; 101: 50..

26.

Yang

Choi

Lin

. Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR) , Las Vegas, 27–30 June 2016, pp. 2129–2137. New York, USA: IEEE.

27.

Wang

Zhuang

, et al. Deep sensor fusion between 2D laser scanner and IMU for mobile robot localization. IEEE Sens J 2021; 21: 8501–8509.

28.

Shinzato

Wolf

Stiller

Road terrain detection: avoiding common obstacle detection assumptions using sensor fusion. In: IEEE intelligent vehicles symposium, proceedings, 2014, pp. 687–692.

29.

Xiao

Wang

Dai

, et al. Hybrid conditional random field based camera-LIDAR fusion for road detection. Inf Sci 2018; 432: 543–558.

30.

Eitel

Springenberg

Spinello

, et al. Multimodal deep learning for robust RGB-D object recognition. In: IEEE international conference on intelligent robots and systems, Hamburg, 28 September–2 October 2015, pp. 681–687. New York, USA: IEEE.

31.

Schlosser

Chow

Kira

. Fusing LIDAR and images for pedestrian detection using convolutional neural networks. In: 2016 IEEE international conference on robotics and automation (ICRA), Stockholm, 16–20 May 2016, pp. 2198–2205. New York, USA: IEEE.

32.

Asvadi

Garrote

Premebida

, et al. Multimodal vehicle detection: fusing 3D-LIDAR and color camera data. Pattern Recogn Lett 2018; 115: 20–29.

33.

Bellone

Reina

Caltagirone

, et al. Learning traversability from point clouds in challenging scenarios. IEEE Trans Intell Transp Syst 2018; 19: 296–305.

34.

Zhou

Gong

Xiong

, et al. Road detection using support vector machine based on online learning and evaluation. In: 2010 IEEE intelligent vehicles symposium, La Jolla, California, USA, 2010, pp. 256–261.

35.

Quan

Piao

, et al. Monocular visual SLAM with points and lines for ground robots in particular scenes: parameterization for lines on ground. J Intell Robot Syst 2021; 101: 72.

36.

Ouyang

Cao

Guan

, et al. Visual-gyroscope-wheel odometry with ground plane constraint for indoor robots in dynamic environment. IEEE Sens Lett 2021; 5: 1–4.

37.

Urmson

Anhalt

Bagnell

, et al. Autonomous driving in urban environments: Boss and the Urban Challenge. J Field Robot 2008; 25: 425–466.

38.

Ziegler

Bender

Schreiber

, et al. Making bertha drive—an autonomous journey on a historic route. IEEE Intell Transp Syst Mag 2014; 6: 8–20.

39.

Wisth

Camurri

Das

, et al. Unified multi-modal landmark tracking for tightly coupled LIDAR-visual-inertial odometry. IEEE Robot Autom Lett 2021; 6: 1004–1011.

40.

Yuan

Zhuang

, et al. An integrated GNSS/LiDAR-SLAM pose estimation framework for large-scale map building in partially GNSS-denied environments. IEEE Trans Instrum Meas 2021; 70: 1–9.

41.

Dahlkamp

Kaehler

Stavens

, et al. Self-supervised monocular road detection in desert terrain. In: Science and systems II, 16–19 August 2006. University of Pennsylvania: Philadelphia, Pennsylvania, USA.

42.

Alon

Ferencz

Shashua

. Off-road path following using region classification and geometric projection constraints. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, New York, NY, USA 17–22 June 2006, pp. 689–696. New York, USA: IEEE.

43.

Zhu

Miao

, et al. Vehicle detection in driving simulation using extreme learning machine. Neurocomputing 2014; 128: 160–165.

44.

Fritsch

Kühnl

Kummert

. Monocular road terrain detection by combining visual and spatial information. IEEE Trans Intell Trans Syst 2014; 15: 1586–1596.

45.

Yang

Sun

, et al. A multi-sensor information fusion method based on factor graph for integrated navigation system. IEEE Access 2021; 9: 12044–12054.

46.

Liu

Qin

Sun

, et al. Extreme kernel sparse learning for tactile object recognition. IEEE Trans Cybern 2017; 47: 4509–4520.

47.

Yuan

. Multisensor-based navigation and control of a mobile service robot. IEEE Trans Syst Man Cybern: Syst 2021; 51: 2624–2634.

48.

Garcia

Martin

Escalera

ADL

, et al. Sensor fusion methodology for vehicle detection. IEEE Intell Trans Syst Mag 2017; 9: 123–133.

49.

Premebida

Carreira

Batista

, et al. Pedestrian detection combining RGB and dense LIDAR data. In: 2014 IEEE/RSJ international conference on intelligent robots and systems, Chicago, 14–18 September 2014, pp. 4112–4117. New York, USA: IEEE.

50.

González

Vázquez

López

, et al. On-board object detection: multicue, multimodal, and multiview random forest of local experts. IEEE Trans Cybern 2016; 47: 3980–3990.

51.

Gupta

Girshick

Arbeláez

, et al. Learning rich features from RGB-D images for object detection and segmentation. In: Computer vision—ECCV 2014, Cham, 2014, pp. 345–360.

52.

Chen

Wan

, et al. Multi-view 3d object detection network for autonomous driving, https://arxiv.org/abs/1611.07759 (2016 accessed 18 March 2022).

53.

Raposo

Barreto

Nunes

. Fast and accurate calibration of a Kinect sensor. In: 2013 international conference on 3D vision—3DV, Seattle, WA, USA, 29 June–1 July 2013, pp. 110–111. New York, USA: IEEE.

54.

Borenstein

Feng

. Measurement and correction of systematic odometry errors in mobile robots. IEEE Trans Robot. 1996; 12(6): 869–880.

55.

Kobuki mobile robot, Yujin Robot, Seoul, South Korea. http://kobuki.yujinrobot.com/about2/ (2015, accessed 18 March 2022).

56.

Blair

Davis

. Innovate engineering outreach: a special application of the Xbox 360 Kinect sensor. In: 2013 IEEE frontiers in education conference (FIE), Oklahoma City, OK, 23–26 October 2013, pp. 1279–1283. New York: IEEE.

57.

Quanergy M8 3D LIDAR sensor, Sunnyvale, California, USA. https://quanergy.com/wp-content/uploads/2019/12/M8-Datasheet_QPN-98-00037-Rev-M.pdf (2020, accessed 18 March 2022).