Sage Journals: Discover world-class research

Abstract

This study introduces the YORB-SLAM algorithm, a novel approach that integrates an enhanced ORB-SLAM2 framework with a lightweight YOLOv5 model to improve the robustness and accuracy of visual SLAM systems in indoor dynamic environments. By incorporating a variable threshold FAST corner detection algorithm, we optimize feature point extraction performance under unstable lighting conditions. An improved quadtree algorithm not only accelerates feature extraction but also retains richer image information. Further, we tailor a lightweight YOLOv5 model to our application scenario through self-training and devise a set of dynamic feature point elimination rules, significantly boosting performance in dynamic indoor scenes. Evaluations on six dynamic indoor sequences from the TUM dataset show that YORB-SLAM significantly outperforms the original ORB-SLAM2 in accuracy and exhibits better real-time capabilities than DS-SLAM and DynaSLAM.

Keywords

Indoor dynamic environment visual SLAM object detection visual odometry

Introduction

Simultaneous localization and mapping (SLAM) technology acquires external environmental information through sensors and determines its own position and establishes a map of the surrounding environment based on the matching relationship between information.¹ It is a key technology in various applications such as augmented reality,² mobile robotics,³ autonomous driving, and drones.⁴ SLAM technology is divided into visual SLAM and laser SLAM according to the type of sensor used. Visual SLAM has developed rapidly in recent years due to its low cost and rich information. Unlike laser SLAM, visual SLAM relies on image data to estimate position and build environmental maps,⁵ requiring precise feature extraction and matching strategies in the face of lighting changes and dynamic obstacles.

The implementation techniques of the visual SLAM front-end visual odometry mainly include direct methods and feature point methods. Direct methods estimate camera motion using pixel value information of images, which are easily disturbed by external environmental factors. Feature point methods estimate camera motion through reprojection by tracking the position changes of representative information in images, offering good resistance to interference. Common feature point extraction algorithms include the ORB,⁶ SIFT,⁷ and SURF⁸ algorithms. Among the existing feature point algorithms, the ORB feature due to its fast computation speed, meeting real-time requirements, is widely used in visual SLAM systems. However, the ORB algorithm tends to cluster, with unevenly distributed feature points extracted, which is not conducive to subsequent camera tracking.⁹

Many researchers have improved the ORB detection algorithm. Mur-Artal et al.¹⁰ used a quadtree division algorithm to segment images and then employed the Harris response value for nonmaximum suppression of feature points, effectively improving the uniformity of feature point distribution and enhancing the stability and reliability of feature points. Yao et al.¹¹ proposed an adaptive threshold ORB feature extraction algorithm based on an improved quadtree algorithm, showing better stability and speed in complex environments. Sun et al.¹² proposed an improved ORB algorithm based on regional segmentation, optimizing the uniformity of feature point distribution. Although issues with feature point mismatches persist, notable progress has been achieved.

In the field of visual SLAM, many excellent SLAM system frameworks have been proposed in recent years, such as DSO,¹³ VINS-Fusion,¹⁴ and the ORB-SLAM series. These algorithm frameworks can meet most application needs, especially in improving positioning accuracy and map construction efficiency. However, as the complexity of application scenarios increases, SLAM systems still have shortcomings in specific real-world scenarios, especially in indoor dynamic scenes with pedestrians and other dynamic objects, where existing SLAM systems often struggle to accurately handle these dynamic changes, affecting the system's positioning and mapping performance.¹⁵

To address this challenge, many researchers have proposed numerous excellent algorithms based on the ORB-SLAM framework. Bescos et al.¹⁶ introduced the DynaSLAM method, which precisely eliminates feature points of dynamic objects by combining semantic segmentation with multi-view geometry methods, thereby improving the performance of SLAM systems in dynamic environments. The DS-SLAM method integrates a semantic segmentation network and greatly improves positioning accuracy in dynamic scenes through motion consistency detection.¹⁷ Dynamic-SLAM¹⁸ uses the SSD network to detect dynamic objects and compensates for detection omissions using the velocity invariance of adjacent frames. This method improves the accuracy of dynamic object detection, especially in fast-moving scenes. Gong et al.'s method attempts to retain static feature points during dynamic and static detection crossover, improving SLAM performance in dynamic scenes to some extent but failing to completely correctly retain dynamic feature points.¹⁹ Sun et al.'s VSLAM algorithm uses RGB-D information and dense optical flow tracking technology to remove dynamic foregrounds.²⁰ Although this method can handle dynamic objects to some extent, its performance is limited by the accuracy of the optical flow algorithm and high-speed moving scenes.

Addressing the issues with the ORB feature point algorithm, this article proposes optimization measures for the feature point extraction strategy of the front-end visual odometry in the ORB-SLAM2 system. Initially, we introduce an adaptive threshold calculation method based on image grayscale mean to mitigate the impact of environmental lighting changes on feature point extraction. By dynamically adjusting the threshold for feature point detection, this method maintains the stability and accuracy of feature point extraction under varying lighting conditions, thereby enhancing the robustness of the SLAM system. Furthermore, we optimized the quadtree splitting rules in ORB-SLAM2, limiting the number of splits and prioritizing feature points with high response values. This improvement strategy not only enhances the matching accuracy of feature points but also boosts the system's adaptability to complex scenes, particularly in feature-rich or texture-complex indoor environments.

To address the issue of positioning accuracy in indoor dynamic scenes for visual SLAM systems, we introduced the YOLOv5 object detection model. To balance the model size, accuracy, and efficiency, especially considering the deployment needs on embedded devices, we opted to replace YOLOv5's backbone network with MobileNetV3. This adjustment makes the entire SLAM system more suitable for operation on resource-constrained devices while maintaining good object detection and feature point elimination performance. Tailored for the specific needs of indoor dynamic scenes, we designed dynamic feature point elimination rules and pose estimation methods, significantly improving the system's positioning accuracy in dynamic environments.

The algorithm proposed in this article, named YORB-SLAM, builds on the ORB-SLAM2 foundation with a series of optimizations for feature extraction and dynamic feature point handling, significantly enhancing the system's robustness and positioning accuracy. By integrating advanced object detection algorithms, YORB-SLAM effectively addresses the challenges in indoor dynamic environments.

The rest of this article is organized as follows: the second section provides a detailed introduction to the design of the YORB-SLAM framework, optimization strategies for feature extraction, and the object detection network. The third section first analyzes the effectiveness of each optimization through experiments. It then compares the YORB-SLAM algorithm with other exemplary algorithms, showcasing YORB-SLAM's performance in various scenarios and discussing the significance of the research findings. The fourth section concludes the study, summarizing the work and looking forward to future research directions in the field of visual SLAM.

Materials and methods

Overview of the YORB-SLAM system

The YORB-SLAM framework innovatively extends the ORB-SLAM2 foundation, retaining its original architecture while incorporating a series of improvements to enhance adaptability to dynamic environments and overall performance. As illustrated in Figure 1, the optimizations designed by us are encompassed by the red dashed lines. The framework comprises four parallel threads: an optimized tracking thread, a newly added dynamic feature point removal thread, the original local mapping thread, and the loop closure detection thread. The collaborative operation of these threads enables YORB-SLAM to efficiently process image frames in real time within complex environments, effectively tracking the camera's position and orientation, and simultaneously constructing and maintaining a stable map of the environment.

Figure 1.

Overall framework of the YORB-SLAM algorithm.

The tracking thread is responsible for real-time tracking of the camera's position and orientation, utilizing ORB feature points for matching to estimate the camera pose. YORB-SLAM optimizes the feature point extraction strategy within the tracking thread to enhance the quality and robustness of feature point extraction.

In the introduced dynamic feature point removal thread, image recognition technology based on the self-trained YOLO-Mo network is employed to effectively identify and eliminate dynamic feature points in image frames, reducing the impact of dynamic environmental factors on system performance and enhancing stability and accuracy.

The local mapping thread uses observed new data to create and update the local map, performs local bundle adjustment to optimize the scene's structure and camera motion, and manages and maintains map points.

The loop closure detection thread is tasked with identifying loop closures in the environment, divided into loop closure detection and correction phases, using a bag-of-words model for detection and global bundle adjustment (BA) for loop closure correction, thereby improving the global consistency of the map.

Optimization of ORB feature extraction algorithm

The ORB algorithm, a widely applied method for feature point extraction in the field of image processing, combines the strengths of FAST feature point detection and BRIEF feature description to offer a fast and stable solution for image feature recognition and description.²¹ The quality of ORB feature point extraction is directly linked to the YORB-SLAM system's estimation of camera pose.

Optimization of the FAST corner detection algorithm

The FAST corner detection algorithm bases its judgment on the idea that for a target pixel, a circle is drawn with a radius of 3 pixels around it, using 16 pixels on the circumference for comparison, as shown in Figure 2. Here, p represents the target pixel, and 1∼16 are the comparison points. If among these 16 comparison points, there are consecutively n pixels whose grayscale value is either higher than the target pixel's by a threshold T, as in formula (1), or lower than the threshold T, as in formula (2), then the target pixel p is identified as a FAST corner. The value of n is typically set to 12. This comparison considers not only the grayscale differences between pixels but also the continuity of these differences, which is crucial for corner identification.

I_{x} > I_{p} + T

(1)

I_{x} < I_{p} - T

(2)

Figure 2.

FAST feature point.

Here, I_p represents the grayscale of the target pixel, I_x the grayscale of the 16 comparison pixels, with x ranging from 1 to 16, and T the comparison threshold. To enhance detection speed, a subset of points may be examined initially. For instance, if n is set to 12, the 1st, 5th, 9th, and 13th pixels can be checked first. If at least three of these points satisfy the formulas above, the pixel is considered a candidate corner; otherwise, it is not. This method effectively eliminates most noncorners, reducing unnecessary comparisons and significantly accelerating corner detection speed.

The uncertainty of lighting conditions in environments, where intensity changes can directly affect image grayscale, is particularly crucial in image processing, especially when using algorithms like FAST corner detection that rely on grayscale value comparisons. Consequently, the number of feature points extracted using traditional FAST corner detection algorithms may vary significantly across different lighting and contrast conditions of the same scene. To achieve the desired number of feature points, thresholds often need to be readjusted, which may lead to instability in feature point quality, thereby affecting subsequent processing steps. Changes in image grayscale caused by lighting can be divided into local and global grayscale variations. Local grayscale variations refer to changes in grayscale values in specific areas, affecting the accuracy of fixed threshold detection; global grayscale variations mean adjustments in the grayscale values of the entire image, where setting thresholds simply based on a percentage of the target pixel's grayscale value may no longer be suitable.

To address this, we propose an adaptive threshold determination method based on the average grayscale value of the image. This method first calculates the total grayscale of the 16 pixels on the circumference, excluding extreme values to mitigate the impact of abnormal data under extreme lighting conditions. It then introduces a calculation based on the average grayscale value of the remaining pixels to accommodate situations with multiple extreme values, ultimately determining an adaptive grayscale threshold according to the variation in this average grayscale value, as shown in formula (3). This approach not only considers the overall grayscale level of the image but also improves the flexibility and accuracy of threshold setting by excluding extreme values and calculating averages.

{\begin{matrix} I_{max} = MAX {I_{x}} \\ I_{min} = MIN {I_{x}} \\ I_{avg} = (\sum I_{x}) / 16 \\ T = λ (\sum I_{x} - I_{max} - I_{min} - I_{avg}) / 13 \end{matrix}

(3)

In this formula, T is the final threshold, λ a weighting parameter, and ΣI_x the sum of grayscale values on the circumference, reflecting the overall grayscale condition of the search area with local representativeness. I_max and I_min are the maximum and minimum grayscale values, respectively, representing the two points most affected by lighting, and I_avg the average grayscale value of the 16 comparison points. This algorithm does not rely on a fixed global threshold for feature extraction but calculates thresholds individually within the pixel's graphical context.

Quadtree algorithm optimization

The quadtree algorithm is a common method for feature point uniformization, suitable for the balanced processing of ORB feature points with weak discreteness. The implementation of the quadtree algorithm comprises the following steps:

Initialize the entire image as the root node;

Split the node into four equal-sized child nodes;

Remove child nodes that do not contain any feature points;

Repeat steps 2 and 3 until the number of feature points in each child node is less than or equal to 1 or the extracted feature points reach the desired number; and

Retain the feature point with the highest response value in each node and remove the rest.

The quadtree algorithm effectively balances the distribution of feature points and enhances their discreteness. However, while making the feature point extraction more uniform, the quadtree algorithm may also lead to two issues: firstly, in cases of feature point accumulation, the quadtree may undergo multiple splits, reducing the speed of the feature point extraction algorithm and leading to tracking loss¹²; secondly, this strategy only considers retaining high-response feature points in the last step, resulting in the loss of some high-response feature points, especially in areas dense with high-response points. As indicated by the red boxes in Figure 3, the deeper the color of a point, the higher its response value. During the final step of feature point deletion, some high-response points in the same node are deleted, implying the loss of some significant feature information in the image.

Figure 3.

Demonstration of limitations of the quadtree algorithm.

To address these deficiencies in the ORB-SLAM2 quadtree algorithm, this article proposes an improved quadtree algorithm.

The algorithm initially performs conventional quadtree splits, then checks after each round of splitting whether the current number of nodes exceeds 70% of the target number of feature points. If this criterion is met, the splitting stops; otherwise, it continues. After splitting ends, nodes with only one feature point are retained, and all feature points from the remaining nodes are collected into an array, sorted by response value from highest to lowest, and the top n feature points are selected. The value of n is the difference between the target number of feature points and the number of single feature point nodes after splitting, as shown in equation (4).

n = s - m

(4)

In the equation, s represents the target number of feature points and m the number of nodes with only one feature point after the quadtree splitting ends. This improved algorithm ensures the quality of feature points while reducing the number of quadtree iterations, thereby accelerating the speed of the feature point extraction algorithm.

Object detection

YOLOv5 algorithm light-weighting

In the domain of object detection, the YOLO (You Only Look Once) algorithm has been revolutionary. Since its inception, YOLO has evolved into many versions, with YOLOv5 and YOLOv8 being the most prominent. Each version boasts unique features and advantages, making them stand out in their respective areas. Given the scenario of deploying an object detection network on an indoor mobile robot, YOLOv5 was chosen after considering factors like computational resources, network model size, and accuracy, aligning with our deployment needs for indoor settings.

The design of YOLOv5 centers around four main components: the input stage, backbone network, neck network, and prediction stage, each optimized for enhanced performance and flexibility.²² In the input stage, YOLOv5 employs various data augmentation techniques such as mosaic augmentation, dynamic anchor calculations, and adaptive image scaling, improving anchor generation and enhancing the diversity and quality of input images. The backbone network extracts features from the input images, combining convolutional layers, C3 layers, and SPPF (Spatial Pyramid Pooling Fast) to efficiently enhance feature extraction.²³ The neck network, utilizing PAN (Path Aggregation Network) and FPN (Feature Pyramid Network) structures, performs feature fusion, enriching the information across different feature levels. In the prediction stage, the algorithm processes loss calculations for bounding boxes and nonmaximum suppression to finalize and refine detection results.²⁴

To accommodate varying performance and resource requirements, YOLOv5 offers five network models of different sizes: YOLOv5n, YOLOv5 s, YOLOv5 m, YOLOv5 l, and YOLOv5x, each with varying width and depth coefficients. This study employs YOLOv5 l v6.1, referred to as YOLOv5 l. Although subsequent community research has introduced several new YOLO models, the YOLOv5 model remains favored for its lightweight, ease of use, and deployment suitability for mobile devices.

Further, to reduce the network size and complexity of the YOLOv5 l model, this study substitutes its backbone network with MobileNetV3. Announced by Google in 2019, MobileNetV3, the latest in the MobileNet series, inherits advantageous features from MobileNetV1 and MobileNetV2 and optimizes them further. Comprising bneck structures, MobileNetV3 incorporates depthwise separable convolutions from MobileNetV1 and inverted residual structures from MobileNetV2, introduces a lightweight SE (Squeeze-and-Excitation) attention mechanism, and replaces the original activation function with h-swish,²⁵ as shown in Figure 4. The modified YOLOv5 model significantly reduces network size and complexity while meeting computational power requirements, enhancing algorithm speed, and providing a viable solution for efficient object detection on embedded devices.

Figure 4.

MobileNetV3 network structure diagram.

The lightweight network proposed in this article is referred to as the YOLO-Mo network. Table 1 presents the backbone network structure of YOLO-Mo.

Table 1.

YOLO-Mo Backbone network structure.

From	Parameters	Module	Arguments
−1	464	conv_bn_hswish	[3, 16, 2]
−1	612	MobileNet_Block	[16, 16, 16, 3, 2, 1, 0]
−1	3864	MobileNet_Block	[16, 24, 72, 3, 2, 0, 0]
−1	5416	MobileNet_Block	[24, 24, 88, 3, 1, 0, 0]
−1	13736	MobileNet_Block	[24, 40, 96, 5, 2, 1, 1]
−1	55340	MobileNet_Block	[40, 40, 240, 5, 1, 1, 1]
−1	55340	MobileNet_Block	[40, 40, 240, 5, 1, 1, 1]
−1	21486	MobileNet_Block	[40, 48, 120, 5, 1, 1, 1]
−1	28644	MobileNet_Block	[48, 48, 144, 5, 1, 1, 1]
−1	91848	MobileNet_Block	[48, 96, 288, 5, 2, 1, 1]
−1	294096	MobileNet_Block	[96, 96, 576, 5, 1, 1, 1]
−1	294096	MobileNet_Block	[96, 96, 576, 5, 1, 1, 1]

Table 1 offers an overview of the YOLO-Mo network structure, with “From” indicating input sources, “-1” denoting input from the previous layer's output, “Param” for the number of parameters, “Module” for module names, and “Arguments” for model parameter settings. The “conv_bn_hswish” module includes three adjustable parameters: input channels, output channels, and stride information. The “MobileNet_Block” module features seven adjustable parameters: input channels, output channels, expanded convolution channels, kernel size, stride, SE attention mechanism inclusion, and h-swish activation function usage.

Table 2 compares the network layers and parameter counts between YOLO-Mo, YOLOv5n, and YOLOv5 s. Despite YOLO-Mo having more layers, its parameter count is only about half that of YOLOv5 s.

Table 2.

Comparison of network models.

Model	Depth	Width	Network layers	Parameters (10⁶)
YOLOv5s	0.33	0.5	270	7.2
YOLOv5n	0.33	0.25	270	1.8
YOLO-Mo	1	1	340	3.8

Discussion on feature point removal rules

The Dynamic Feature Point Removal Thread is designed to optimize feature point identification and processing through four main tasks, thereby enhancing the overall performance and efficiency of the system. These tasks include semantic segmentation of input image frames, receiving feature point identification results from the tracking thread, removing dynamic feature points, and sending the remaining feature points back to the tracking thread.

To accurately determine the state of objects in the environment, the system utilizes YOLO-Mo for semantic segmentation to identify object types and locations. However, semantic segmentation alone is insufficient for recognizing the dynamic state of objects, necessitating the introduction of prior judgments to assist in determining whether objects are in motion. Objects are classified into three categories based on their motion characteristics: high dynamic objects, low dynamic objects, and static objects. High dynamic objects are defined as objects capable of autonomous movement, such as people, animals, and certain mechanical products like robot vacuums; low dynamic objects are defined as objects that do not move autonomously but often move with high dynamic objects, such as mobile phones, cups, and chairs; static objects are defined as objects that rarely move, such as tables, air conditioners, and computers.

The criterion for determining dynamic feature points is based on the classification and state of the object they reside in: if a feature point is located within a high dynamic object and not within a static object, it is deemed a dynamic feature point; if a feature point is located within a low dynamic object that overlaps with a high dynamic object, it is also considered a dynamic feature point. All others are considered static feature points.

To address potential shortages of static feature points following the removal of dynamic feature points, the system adopts an over-extraction strategy for feature points. This means that initially, the system extracts more feature points than anticipated and later decides whether to delete some of the feature points with lower response values based on the actual number of static feature points. This strategy ensures that there are enough static feature points for stable camera pose estimation even after removing some dynamic feature points, while optimizing the efficiency of the system's operation.

Experimental results and discussion

In this section, we validate the efficacy of our work through experiments conducted across multiple public datasets, divided into three parts.

Part 1 compares the original ORB-SLAM2 algorithm with its enhanced version, which solely optimizes the feature point extraction strategy of ORB-SLAM2. Initially, the stability of feature point extraction under various lighting conditions is assessed, followed by a discussion on the dispersion and extraction speed of feature points, and concluding with the verification of feature matching accuracy.

Part 2 involves training a lightweight target detection network introduced on a custom dataset and comparing its performance with other target detection networks to further validate its effectiveness in real indoor scenes.

Part 3 tests the performance of the proposed YORB-SLAM algorithm on six dynamic sequences from the TUM dataset, comparing it with ORB-SLAM2 and other dynamic scene SLAM systems. Subsequently, the error metrics and real-time performance of each algorithm are discussed.

Experimental platform

The experiments in this article were conducted on a computing platform equipped with an AMD Ryzen 5 4600H CPU, which includes an integrated Radeon Graphics processing unit with a base clock of 3.00 GHz, and a dedicated NVIDIA GeForce GTX 1650 GPU. The system is complemented with 16GB of RAM.

In terms of the software environment, the experiments were carried out using the Ubuntu 20.04 operating system. The experiment utilized OpenCV 3.4.15 and OpenCV Contrib 3.4.15 as foundational libraries for image processing and computer vision tasks. PyTorch 1.12.0 served as the deep learning framework, and Python 3.6 was employed as the programming language. Additionally, CUDA 11.3 was used to accelerate deep learning computations.

Stability comparison of feature point extraction quantity under different lighting conditions

To verify the effectiveness of optimizations to the FAST corner detection algorithm, we use the Oxford 5k dataset to compare the original and improved algorithms in terms of light sensitivity and feature point dispersion. This dataset features high-quality images of various Oxford landmarks, divided into predefined categories, making it a standard benchmark for fine-grained recognition and location identification tasks in computer vision research.

The experiments are conducted in three sets under three different conditions: 60%, 100%, and 140% of the original image brightness, aiming to simulate the variations in lighting intensity that might be encountered in real-world environments. Regarding threshold settings, the original FAST corner extraction algorithm's thresholds were fixed at 40 and 30; in contrast, the improved algorithm's FAST corner extraction threshold's weight parameter was set to 0.7. To evaluate the algorithm's performance under different brightness levels, the number of feature points extracted from the same image under different brightness levels was used as the criterion for judging the algorithm's light sensitivity. Figure 5 compares the improved algorithm with the original ORB-SLAM2 algorithm under the “bikes” sequence in the Oxford 5k dataset.

Figure 5.

Comparison of feature point extraction under different lighting conditions.

Table 3 compares the feature point extraction results of the original and the improved algorithm in four subsets of the Oxford 5k dataset. The results indicate that the original algorithm experienced significant fluctuations in the number of feature points with changes in brightness, specifically, a 63% average decrease in feature points at 60% of the original image brightness, and a 48% average increase at 140%. This outcome demonstrates the original algorithm's high sensitivity to light changes, leading to unstable feature point extraction. In contrast, the improved algorithm exhibited smaller fluctuations in the number of feature points with changes in brightness, with a 4% average decrease at 60% brightness and an 11% average increase at 140%. These results suggest that the improved FAST corner detection algorithm significantly enhances robustness to changes in lighting, resulting in more stable feature point extraction. This improvement enhances the algorithm's adaptability under different lighting conditions, reducing the risk of tracking failure due to lighting changes, especially in environments with varying light intensity and dim scenes, demonstrating better adaptability.

Table 3.

Feature point extraction results under different brightness conditions.

Algorithm	Brightness (%)	Number of feature points
Algorithm	Brightness (%)	Bikes	Graffiti	Trees	ubc
ORB-SLAM2	60	356	432	4671	1643
	100	1248	996	12360	4444
	140	2043	1446	14753	7295
Improved algorithm	60	1193	838	12063	3094
	100	1248	883	12360	3233
	140	1515	930	13671	3459

Feature point extraction validation

Under the experimental conditions described in the Experimental Platform section, a comparative validation was conducted between the original ORB-SLAM2 feature extraction algorithm and the improved algorithm using the “freiburg2/xyz” subset of indoor images from the TUM dataset. The TUM dataset, developed by the Computer Vision Group at the Technical University of Munich, is widely utilized for evaluating and benchmarking the performance of visual SLAM systems. It is specifically designed to provide indoor environment data of varying complexity, including scenarios with different lighting conditions, rapid movement, and diverse structures. The experiment detailed a comparative analysis of 10 consecutive pairs of indoor image frames. Figure 6 illustrates a comparative visualization of feature extraction results between the original and the improved ORB-SLAM2 algorithms.

Figure 6.

Feature point extraction comparison.

To quantify the dispersion of feature points, the experiment divided each image into nine equal regions and counted the number of feature points within them. The standard deviation of these counts was computed to effectively evaluate the uniformity of feature point distribution across the image frames. The formula for calculating the dispersion of feature points is shown in equation (5):

Di = \sqrt{\frac{\sum_{i = 1}^{9} {(N_{i} - \bar{N})}^{2}}{9}}

(5)

where N_i represents the number of feature points in the ith region, and

\bar{N}

the average number of feature points across all nine regions. Table 4 presents the average number of feature points extracted, their dispersion, and the computation time per frame for both algorithms.

Table 4.

Feature point extraction data comparison.

Algorithm	Number of feature points	Feature point dispersion (%)	Computation time per single frame image (s)
ORB-SLAM2	504	48	0.013
Improved algorithm	514	47	0.011

The results indicate that the improved algorithm surpasses the original ORB-SLAM2 in terms of computation time per single frame image, demonstrating a significant efficiency advantage and enhancing the real-time capabilities of the visual SLAM system. Although the feature point dispersion of the improved algorithm is slightly inferior to that of the original ORB-SLAM2, this is attributed to the improved algorithm's preference for retaining feature points with higher response values and representativeness, at the cost of some dispersion. This approach preserves more image features, thereby increasing the accuracy of subsequent feature matching.

Feature point matching experiment

In the feature point matching experiment, the same experimental setup as in the feature point extraction experiment was utilized. Figure 7 compares the matching effectiveness between the original ORB-SLAM2 algorithm and the improved algorithm.

Figure 7.

Comparison of feature point matching effectiveness.

The comparison focuses on the correct match count and matching precision PM of two feature point extraction algorithms, using indoor images from the TUM dataset freiburg2/xyz. The matching algorithm employs a brute-force method calculating Hamming distance. Matching precision PM is defined by equation (6):

PM = \frac{M_{c}}{M}

(6)

where M is the total number of matches and M_c the number of correct matches. A match is considered correct if the Hamming distance between the pair of feature points is less than twice the minimum Hamming distance; otherwise, it is deemed incorrect.²⁶ The experimental results are shown in Table 5.

Table 5.

Comparison of feature point matching data.

Algorithm	Average number of matches (pairs)	Average matching precision (%)
ORB-SLAM2	101	0.42
Improved Algorithm	113	0.53

In the feature point matching experiment, the improved algorithm demonstrated higher numbers of matches and better matching precision. These results further validate the superiority of the improved algorithm in selecting feature points, enabling more effective identification and matching of identical feature points between images, which is crucial for enhancing the tracking and reconstruction accuracy of visual SLAM systems.

In summary, by making moderate sacrifices in the dispersion of feature points, the improved ORB feature point extraction algorithm has achieved significant advantages in terms of quality retention of feature points, computational efficiency, and matching performance. These optimizations not only enhance the algorithm's adaptability to environmental changes but also its application efficiency and accuracy in real-time visual SLAM systems. Therefore, the improved algorithm shows excellent adaptability and efficiency in processing image feature point extraction and matching, offering a more reliable and efficient solution for real-time visual SLAM systems.

Object detection algorithm

YOLO-Mo network training

Given the indoor work scenario envisaged in this study, a substantial indoor dataset was curated and annotated, supplemented with select indoor data from the COCO dataset, totaling 2000 images for training. To ensure the accuracy and consistency of data annotation, we used Labelimg as the annotation tool.

The custom dataset was randomly divided into training and testing sets at a 4:1 ratio, with the training platform as outlined in the Experimental Platform section, and key training parameters presented in Table 6.

Table 6.

Training parameter settings.

Parameter name	Initial value setup
lr0	0.01
lrf	0.114
momentum	0.873
weight_decay	0.00047
Epochs	300
Batch-size	8

Throughout the training, particular attention was paid to two critical metrics: Loss (accuracy loss) and mAP (mean Average Precision). The training process' loss and precision curves are depicted in Figure 8, where Loss indicates accuracy loss, and mAP represents mean Average Precision.

Figure 8.

Training process.

The results demonstrate that our model exhibited robust performance throughout the training cycle, with a gradual and stabilizing decrease in accuracy loss and a steady increase in mAP, showing no signs of overfitting or underfitting. This indicates effective model training, achieving anticipated outcomes with the YOLO-Mo model on the custom dataset.

Comparative training with YOLOv5 s and YOLOv5n under identical conditions yielded the performance data in Table 7.

Table 7.

Comparison of network detection results.

Model type	Model size (MB)	Recall rate (%)	Frames per second (FPS)	Average precision (%)
YOLOv5n	2.0	0.65	100	70.42
YOLOv5s	13.7	0.78	52	86.73
YOLO-Mo	7.9	0.72	78	82.32

Table 7 reveals that the YOLO-Mo model's size and framerate are between those of YOLOv5n and YOLOv5 s. Its recall and precision significantly improve compared to YOLOv5n and are very close to YOLOv5 s, indicating its higher accuracy in recognizing objects in indoor scenes.

Due to its lightweight network, YOLO-Mo experiences a slight drop in detection precision compared to YOLOv5 s. However, it reduces the model size by 5.8MB and increases the framerate by 26 fps compared to YOLOv5 s. Therefore, the YOLO-Mo algorithm achieves a balanced trade-off between detection precision and speed, facilitating easier deployment on mobile devices, aligning with the application scenarios of YORB-SLAM.

Experimental validation

The YOLO-Mo model, trained on the custom dataset, was compared with the YOLOv5 s model using default weights for indoor scene detection. The results, illustrated in Figure 9, show that YOLOv5 s, trained on the COCO dataset encompassing 80 categories of labels, fails to detect certain common indoor objects, such as cabinets and trash bins, and also produce some detection errors. This is attributed to the COCO dataset not being specifically designed for indoor scenes. In contrast, YOLO-Mo, trained on the custom dataset, accurately identifies common objects within indoor scenes, meeting the accuracy requirements for indoor object detection set forth in this study.

Figure 9.

Real scene detection results.

YOLO-SLAM system

To assess the performance and robustness of YORB-SLAM in indoor dynamic scenarios compared to the original ORB-SLAM2 system, the study selected six dynamic scene sequences from the TUM dataset in the experimental environment described in the Experimental Platform section. These sequences encompass a range of scene dynamics from high to low, including movements of people in offices, camera motion along various paths and directions, and minor movements of seated individuals. A comprehensive evaluation of high-dynamic (walking sequences) and low-dynamic (sitting sequences) scenarios aimed to test the YORB-SLAM system's performance under different levels of dynamics and motion patterns. The sequences, including freiburg3_walking_xyz, freiburg3_walking_halfsphere, freiburg3_walking_rpy, freiburg3_walking_static, freiburg3_sitting_halfsphere, and freiburg3_sitting_static, encompass a range of scene dynamics from high to low and a variety of motion patterns from simple to complex, such as XYZ translation, hemispherical motion, RPY rotations (around the x, y, and z axes), and static scenes, providing a comprehensive testing environment for the experiments. For convenience, subsequent discussions use abbreviations fr3, w, half, and s to denote freiburg3, walking, halfsphere, and sitting, respectively, as sequence names.

In terms of performance evaluation, the study used the root mean square error (RMSE), mean error, and standard deviation of absolute trajectory error (ATE) as key performance indicators. RMSE measures the deviation of estimated poses from actual poses based on actual positions, as shown in equation (7):

RMSE = \sqrt{\frac{1}{N} \sum_{1}^{N} {(x_{e, i} - x_{a, i})}^{2}}

(7)

The mean error calculates the average difference between estimated and actual poses, as defined in equation (8):

Mean = \frac{1}{N} \sum_{1}^{N} (x_{e, i} - x_{a, i})

(8)

Standard deviation reflects the fluctuation in the deviation of estimated poses from actual poses, calculated as in equation (9):

SD = \sqrt{\frac{1}{N} {\sum_{1}^{N} (x_{e, i} - μ)}^{2}}

(9)

where N is the number of frames, x_e,i the estimated camera pose for frame ith, and x_a,i the actual camera pose for frame ith, and μ the mean of the pose data obtained. Figure 10 illustrates the comparison of ATE between the ORB-SLAM2 and YORB-SLAM algorithms, with the black line representing the actual trajectory, the blue line representing the estimated trajectory, and the red line indicating the error.

Figure 10.

Track diagrams of different dynamic scenes.

Results in Table 8 show that YORB-SLAM's performance improvement in low-dynamic scenes is limited, with an average reduction of 28.19% in RMSE, 37.40% in mean error, and 18.37% in variance. This is because, in low-dynamic scenes, where human movement is minimal, ORB-SLAM2's visual odometry can use the RANSAC algorithm to eliminate some mismatches. However, in high-dynamic scenes with larger human movement, relying solely on ORB-SLAM2's mismatch elimination algorithm is insufficient, and YORB-SLAM's performance significantly improves, with an average reduction of 92.08% in RMSE, 97.14% in mean error, and 85.38% in variance. These statistics convincingly demonstrate YORB-SLAM's robustness in dynamic indoor scenes, particularly under high-dynamic conditions, marking a significant improvement.

Table 8.

Comparison between ORB-SLAM2 and YORB-SLAM in TUM sequences.

Sequences	ORB-SLAM2(m)			YORB-SLAM(m)			Improvements (%)
Sequences	RMSE	Mean	SD	RMSE	Mean	SD	RMSE	Mean	SD
fr3/w/xyz	0.8133	0.6792	0.4474	0.0121	0.0132	0.0079	98.51	98.06	98.23
fr3/w/half	0.4548	0.3830	0.2451	0.0297	0.0254	0.0157	93.47	93.37	93.59
fr3/w/rpy	0.5420	0.5027	0.2024	0.1187	0.0062	0.0977	78.10	98.77	51.73
fr3/w/static	0.4032	0.3626	0.1764	0.0071	0.0060	0.0036	98.24	98.35	97.96
fr3/s/half	0.0261	0.0224	0.0133	0.0176	0.0114	0.0113	32.57	49.12	14.78
fr3/s/static	0.0084	0.0074	0.0041	0.0064	0.0055	0.0032	23.81	25.68	21.95

RMSE: root mean square error.

Beyond comparing with ORB-SLAM2, this study also benchmarks the YORB-SLAM approach against DynaSLAM and DS-SLAM, two popular dynamic scene SLAM systems in recent years, as shown in Table 9. The data for DynaSLAM is taken from the paper by Bescos et al.,¹⁶ and for DS-SLAM from the paper by Yu et al.,¹⁷ where the “-” symbol indicates data not provided in their respective publications.

Table 9.

Comparison of absolute trajectory error between YORB-SLAM and other dynamic SLAM methods.

Sequences	DynaSLAM	DS-SLAM	YORB-SLAM
Sequences	RMSE (m)	RMSE (m)	RMSE (m)
fr3/w/xyz	0.015	0.025	0.012
fr3/w/half	0.025	0.030	0.030
fr3/w/rpy	0.035	0.444	0.119
fr3/w/static	0.006	0.008	0.007
fr3/s/half	0.017	-	0.018
fr3/s/static	-	0.007	0.006

RMSE: root mean square error.

According to Table 9, in low-dynamic scenes, YORB-SLAM's positioning accuracy is comparable to DynaSLAM and DS-SLAM. In high-dynamic scenes, YORB-SLAM outperforms DS-SLAM and is slightly inferior to DynaSLAM.

Real-time performance is another critical metric for evaluating SLAM systems. A SLAM system that sacrifices real-time capabilities for accuracy is challenging to apply in real-world scenarios. To quantify the real-time performance of the YORB-SLAM system, the study compared the processing time per frame in different sequences using the open-source codes of DynaSLAM and DS-SLAM on the same hardware, as presented in Table 10.

Table 10.

Comparison of real-time performance between YORB-SLAM and other dynamic SLAM methods.

Sequences	ORB-SLAM2	DynaSLAM	DS-SLAM	YORB-SLAM
fr3/w/xyz	0.0392	1.122	0.714	0.104
fr3/w/half	0.0321	1.160	0.655	0.093
fr3/w/rpy	0.0352	1.132	0.631	0.098
fr3/w/static	0.0388	1.141	0.671	0.089
fr3/s/half	0.0301	1.026	0.601	0.081
fr3/s/static	0.0284	0.982	0.588	0.088

Data from Table 10 indicates that the YORB-SLAM system exhibits superior real-time performance compared to other dynamic SLAM systems. Although it takes more time than ORB-SLAM2 due to the added object detection process, it still operates at a lower level. In contrast, DynaSLAM, despite its excellent accuracy in handling dynamic environments, shows slightly insufficient real-time performance. Due to its multi-view nature, the DynaSLAM algorithm spends a considerable amount of time on multi-view geometry, suggesting that DynaSLAM may not be the best choice for applications requiring rapid processing and real-time feedback.¹⁶

In summary, YORB-SLAM significantly enhances indoor dynamic positioning capabilities while maintaining efficient operation of the SLAM system. In dynamic indoor environments, YORB-SLAM's comprehensive performance surpasses ORB-SLAM2, DynaSLAM, and DS-SLAM.

Conclusion

This study addresses the issue of reduced positioning accuracy of the ORB-SLAM2 algorithm in indoor dynamic scenes by designing a vision SLAM system based on object detection, aimed at minimizing the impact of dynamic objects in the environment on the SLAM system. Initially, by proposing a variable threshold FAST corner detection algorithm and an enhanced quadtree strategy, the ORB feature extraction algorithm was optimized, enhancing the speed of feature extraction while retaining more image information. Subsequently, we lightened YOLOv5 with MobileNetV3, training the network on a custom indoor dataset tailored to our application scenario. Integrating the lightened YOLOv5 model, we introduced the YORB-SLAM algorithm and devised dynamic feature point elimination rules. Comparative experiments with the ORB-SLAM2 algorithm on the TUM dataset demonstrated that our system could reduce the absolute trajectory error by 97%. Compared to dynamic SLAM systems like DS-SLAM and DynaSLAM, our system also showed improved accuracy without sacrificing speed. We conclude that YORB-SLAM surpasses ORB-SLAM2, DS-SLAM, and DynaSLAM in terms of overall performance.

For future work, we aim to further refine the dynamic information judgment rules of the YORB-SLAM system to maintain its performance in more complex environments. First, we consider combining the object detection algorithm with optical flow methods to overcome the limitation of only detecting pretrained objects in the network. Secondly, acknowledging the vision SLAM system's sensitivity to environmental brightness, we propose enhancing the system's localization and mapping capabilities through multisensor fusion under various lighting conditions.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Yingchao Wang

References

Zou

Sun

Chen

, et al. A comparative analysis of LiDAR SLAM-based indoor navigation for autonomous vehicles. IEEE Trans Intell Transport Syst 2022; 23: 6907–6921.

Chen

Liang

Pan

, et al. A quick development toolkit for augmented reality visualization (QDARV) of a factory. Appl Sci 2022; 12: 8338.

Lee

Kim

Cho

. A monocular vision sensor-based efficient SLAM method for indoor service robots. IEEE Trans Ind Electr 2019; 66: 318–328.

Meng

Zhao

, et al. Visual localization with a monocular camera for unmanned aerial vehicle based on landmark detection and tracking using YOLOv5 and DeepSORT. Int J Adv Robotic Syst 2023; 20: 17298806231164831.

Zhao

Liu

Tian

, et al. A survey of visual SLAM based on deep learning. Robotics 2017; 39: 889–896.

Campos

Elvira

Rodríguez

JJG

, et al. ORB-SLAM3: an accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Trans Robot 2021; 37: 1874–1890.

David

. Distinctive image features from scale-invariant keypoints. Int J Computer Vis 2004; 60: 91–110.

Huang

Sun

, et al. Incorporating learnt local and global embeddings into monocular visual SLAM. Auton Robot 2021; 45: 789–803.

Rublee

Rabaud

Konolige

, et al. ORB: an efficient alternative to SIFT or SURF. 2011 Int Conf Computer Vis: 2564–2571.

10.

Mur-Artal

Tardós

. ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans Robot 2017; 33: 1255–1262.

11.

Yao

Zhang

Wang

, et al. An adaptive uniform distribution ORB based on improved quadtree. IEEE Access 2019; 7: 143471–143478.

12.

Hao

SUN

Peng

. An improved ORB algorithm based on region division. J Beijing Univ Aeronaut Astronaut 2020; 46: 1763–1769.

13.

Engel

Koltun

Cremers

. Direct sparse odometry. IEEE Trans Pattern Anal Machine Intell 2018; 40: 611–625.

14.

Qin

Pan

Cao

, et al. A general optimization-based framework for local odometry estimation with multiple sensors. arXiv preprint arXiv:1901.03642. Epub ahead of print 11 January 2019. DOI: 10.48550/arXiv.1901.03638

15.

Wan Aasim

WFA

Okasha

Faris

. Real-Time artificial intelligence based visual simultaneous localization and mapping in dynamic environments – a review. J Intell Robot Syst 2022; 105: 15.

16.

Bescos

Fácil

Civera

, et al. DynaSLAM: tracking, mapping, and inpainting in dynamic scenes. IEEE Robot Autom Lett 2018; 3: 4076–4083.

17.

Liu

X-J

, et al. DS-SLAM: a semantic visual SLAM towards dynamic environments. 2018 IEEE/RSJ Int Conf Intell Robots Syst (IROS): 1168–1174.

18.

Xiao

Wang

Qiu

, et al. Dynamic-SLAM: semantic monocular visual localization and mapping based on deep learning in dynamic environment. Robot Autonom Syst 2019; 117: 1–16.

19.

Gong

, et al. AHY-SLAM: toward faster and more accurate visual SLAM in dynamic scenes using homogenized feature extraction and object detection method. Sensors 2023; 23: 4241.

20.

Sun

Liu

Meng

MQ-H

. Motion removal for reliable RGB-D SLAM in dynamic environments. Robot Autonom Syst 2018; 108: 115–128.

21.

Sun

. Feature extraction and matching combined with depth information in visual simultaneous localization and mapping. Int J Adv Robot Syst 2023; 20: 17298806231158298.

22.

, et al. YOLO-FIRI: improved YOLOv5 for infrared image object detection. IEEE Access 2021; 9: 141861–141875.

23.

Nepal

Eslamiat

. Comparing YOLOv3, YOLOv4 and YOLOv5 for autonomous landing spot detection in faulty UAVs. Sensors 2022; 22: 464.

24.

Zhu

Lyu

Wang

, et al. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios. pp. 2778–2788.

25.

Howard

Sandler

Chu

, et al. Searching for MobileNetV3. pp. 1314–1324.

26.

Wang

, et al. High-precision and robust localization system for mobile robots in complex and large-scale indoor scenes. Int J Adv Robot Syst 2021; 18: 17298814211047690.

Optimized feature extraction and object detection for indoor dynamic environment visual SLAM study

Abstract

Keywords

Introduction

Materials and methods

Overview of the YORB-SLAM system

Optimization of ORB feature extraction algorithm

Optimization of the FAST corner detection algorithm

Quadtree algorithm optimization

Object detection

YOLOv5 algorithm light-weighting

Discussion on feature point removal rules

Experimental results and discussion

Experimental platform

Stability comparison of feature point extraction quantity under different lighting conditions

Feature point extraction validation

Feature point matching experiment

Object detection algorithm

YOLO-Mo network training

Experimental validation

YOLO-SLAM system

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References