SAM-YOLO: An Improved Small Object Detection Model for Vehicle Detection

Abstract

Vehicle detection using computer vision plays a crucial role in accurately recognizing and responding to various road conditions, targets, and signals, particularly within autonomous driving technology. However, traditional vehicle detection algorithms suffer from slow detection speed, low accuracy, and poor robustness. To address these challenges, this paper proposes the simple attention mechanism-you only look once (SAM-YOLO) algorithm. SAM-YOLO incorporates the simple attention mechanism into the YOLOv7 network, allowing for the capture of more detailed information without introducing additional parameters. In this study, we experimentally redesigned the backbone network of SAM-YOLO by replacing the redundant part of the network layer with the C3 module, resulting in improved model performance while maintaining accuracy. The experimental results show that the SAM-YOLO algorithm performs excellently in several evaluation metrics under conventional conditions, especially outperforming other algorithms in accuracy and mean average precision values. In tests on the ExLight dataset facing extreme lighting conditions, SAM-YOLO similarly demonstrated optimal detection capabilities, especially in terms of robustness when dealing with complex lighting variations. These findings emphasize the potential of the SAM-YOLO algorithm for real-time and accurate target detection tasks, especially in environments with highly variable lighting conditions.

Keywords

vehicle detection simple attention mechanism (SimAM)small object detection

1. Introduction

In recent years, substantial advancements in autonomous driving technology have occurred, motivated by the pursuit of scientific and technological innovation, as well as the increasing demand for convenient travel (Grigorescu et al., 2020; Kiran et al., 2022). Autonomous driving technology empowers vehicles to perceive and comprehend their surroundings, formulate navigation plans, and regulate their movements without human intervention (Yurtsever et al., 2020). To accomplish this, a car must possess the capability to detect objects in its vicinity, discern road conditions, and make informed decisions regarding its trajectory (Petit & Shladover, 2014). Hence, achieving precise detection and recognition of vehicles and road environments is crucial for fully exploiting the capabilities of autonomous driving technology (Gupta et al., 2021). In this context, the development of machine learning models for vehicle visual detection has emerged as a crucial research area with substantial practical implications (Liu et al., 2021a).

Traditional machine learning algorithms commonly used for object detection rely on manual feature engineering, including predefined feature extraction (Outay et al., 2020; Shi et al., 2019; Wang et al., 2019), sliding windows (Chen & Huang, 2019; Chen et al., 2014; Song et al., 2019), and statistical learning (Alotibi & Abdelhakim, 2021; Cucchiara et al., 2000; Sun et al., 2006; Wang & Lien, 2008). These algorithms extract features from input images and utilize machine learning techniques to ascertain the presence of objects at each location (Liu et al., 2019). The final detection outcome is obtained by aggregating multiple detection results using suppression rules. However, these algorithms face limitations when dealing with complex scenes (Wang et al., 2023), primarily due to the diverse shapes and viewpoints of detected objects, resulting in high computational complexity, low accuracy, and poor robustness (Srivastava et al., 2021). Various factors such as different driving poses, changes in lighting conditions, occlusion by surrounding objects, and interference from cluttered backgrounds pose challenges to traditional machine learning object detection algorithms. The advent of deep learning has attracted significant attention in the field of artificial intelligence, particularly in the development and application of deep learning-based object detection algorithms (Srivastava et al., 2021).

You only look once (YOLO) is an object detection algorithm based on convolution neural networks that was proposed by Redmon et al. (2016). In contrast to two-stage object detection methods (Law & Deng, 2019; Liu et al., 2016; Long et al., 2015; Wang et al., 2021), YOLO can precisely predict the bounding box and object probabilities of the entire image in a single evaluation using a single neural network. This property makes YOLO an efficient approach for object detection since the entire detection process is contained within a single neural network, featuring a single end-to-end architecture that encompasses all processing steps from image input to output. The high effectiveness and efficiency of YOLO have contributed to its popularity as an algorithm in the field of computer vision, where it has found applications in various areas including autonomous driving, surveillance, and robotics (Li et al., 2022).

YOLOv7 is a part of the YOLO family of object detection models (Wang et al., 2022a). It represents an enhancement over YOLOv5 (Jocher et al., 2022). Like the YOLO algorithm, YOLOv7 employs a single neural network to conduct an overall prediction for the entire image within one evaluation. As a conventional neural network model, YOLOv7 comprises four primary components: the input network, backbone network, neck network, and head network. These components collaborate harmoniously to efficiently and precisely identify objects in images, making YOLOv7 a versatile tool applicable to a broad range of computer vision tasks.

Despite exhibiting exceptional performance in object detection tasks, the YOLO algorithm has high rates of missed detections and false alarms for detecting small objects (Hu et al., 2021; Jiang et al., 2022a; Li & Shen, 2023). Researchers have proposed various methods to address this issue (Liu et al., 2021b), such as multi-scale feature representation (Hong et al., 2016; Najibi et al., 2017, 2019; Newell et al., 2016; Wu et al., 2018), additional detection heads (Deng et al., 2022; Zhu et al., 2021), image enhancement (Rabbi et al., 2020), super-resolution techniques, and attention mechanisms.

For instance, Hsu and lin (2021) proposed a multi-scale feature representation that combines length and width information, alleviating image distortion after resizing and integrating complementary data from multiple sub-images. Carrasco et al. (2023) integrated features extracted from local images at different scales into the YOLOv5 Backbone network, effectively reducing the number of trainable parameters and floating-point operations per second (Carrasco et al., 2023). As a result, both inference speed and accuracy were improved.

Zhu et al. (2020) presented a multi-sensor multi-level improved convolution network model that incorporates an improved reasoning head and feature fusion method, integrating radar data. Additionally, Zhao et al. (2023) introduced a prediction head to YOLOv7 and utilized the simple attention mechanism (SimAM) module to enhance the detection of small objects or individuals.

Enhancing image information is also a prevalent approach in recent studies. Liu et al. (2022) employed the Flip-Mosaic algorithm to enhance the network’s capability in detecting small targets and mitigating the false detection rate of occluded vehicle targets. Likewise, Jiang et al. (2022b) incorporated the attention mechanism and merged the infrared image with the image enhancement algorithm and the global attention mechanism, resulting in enhanced accuracy for small target detection. The method proposed by Shen et al. (2023b) is based on multiple information perception and attention modules, including five processes: information preprocessing, information collection, information interaction, feature fusion, and attention generation.

Thus, this paper proposes an improved YOLOv7 object detection algorithm called simple attention mechanism-YOLO (SAM-YOLO) that improves the accuracy of object positioning and recognition, while preserving the original excellent features of the YOLOv7 network. The contributions of this paper on-road road-vehicle visual detection can be summarized as follows:

The SAM-YOLO algorithm introduces an SimAM that integrates both channel-level and spatial-level information to model multidimensional dependencies, structural information, and global insights. This attention mechanism focuses selectively on critical regions within the image, thereby enhancing the precision of small object detection. By effectively capturing essential features within the two-dimensional space of images, it addresses information loss and improves the accuracy of behavior recognition.

The SAM-YOLO algorithm’s network layer has been reduced based on the concept of model lightweight design. This reduction significantly alleviates the computational burden caused by the multi-layer propagation of information during the inference process, thereby enhancing recognition speed and achieving high computational efficiency. Consequently, it becomes well-suited for fast image processing and rendering, making it suitable for real-time applications. Moreover, it facilitates the deployment of the algorithm on low-power vehicle terminals.

We propose a new application of the SAM-YOLO algorithm specifically for detecting moving vehicles on the road. Our findings demonstrate that the SAM-YOLO algorithm offers advantages in performance when compared to existing YOLO and other algorithms.

The remaining sections of this paper are organized as follows. In Section 2, a restatement of the problem is provided, followed by the details of the SAM-YOLO algorithm in Section 3. The experimental results and effectiveness evaluation of the approach are presented in Section 4. Finally, the findings are summarized, and potential avenues for future research are discussed in Section 5.

2. Restatement of the Problem

Object detection, or object recognition, is a fundamental problem in computer vision that involves identifying and localizing objects within images or video sequences. The task requires the model to predict both the presence and category of objects, as well as draw bounding boxes around detected objects to indicate their locations. This problem combines elements of classification and localization and poses challenges due to variations in object appearance, scale, occlusion, and environmental conditions.

The YOLOv7 algorithm is widely applied in diverse object detection scenarios, and its network model is predominantly composed of input, backbone, neck, and head components, as shown in Figure 1. More specifically, the input layer consists of preprocessed and normalized image inputs, and the backbone network is responsible for extracting features from input images. The head layer in YOLOv7 is a CSPSPP layer, so it is merged into the neck layer in the image.

Figure 1.

The architecture of YOLOv7.

The multi-layer efficient layer aggregation networks (ELAN) structure is designed to enhance computational efficiency and strengthen feature fusion capabilities. This structure employs complex layer aggregation strategies to significantly improve feature extraction performance, making it more suitable for object detection tasks. Specifically, the ELAN architecture consists of basic units made up of multiple convolutional layers, activation functions (such as rectified linear unit [ReLU] or Leaky ReLU), and normalization layers (such as batch normalization), which are further combined into higher-level groups. ELAN achieves parallel feature processing by aggregating outputs from multiple convolution paths at specific nodes, thereby enriching feature expression diversity. At the same time, ELAN emphasizes multi-scale feature aggregation, integrating information from different layers through feature fusion, and it alleviates the vanishing gradient problem with skip connections similar to those in ResNet, thus enhancing training stability and efficiency.

However, the original multi-layer ELAN structure in YOLOv7 results in substantial inter-layer information exchange, thereby decelerating the algorithm’s training speed. Additionally, the utilization of fixed anchor sizes in YOLOv7 confines its effectiveness in discerning and detecting objects with various scales, particularly in demanding scenarios. These demanding scenarios can include conditions such as poor lighting, occlusion, or objects in high-speed motion, which complicate the detection process.

In object detection, another major challenge is the limitation of the YOLO algorithm in detecting small targets, especially in complex traffic environments. Due to the small size of these targets and their low pixel resolution, they often contain limited information in the image, making them susceptible to interference or occlusion from the background.

It is widely recognized that increasing the number of parameters and utilizing more intricate networks can partially enhance the accuracy of algorithmic detection. However, concerning detection accuracy, the effectiveness of improving training time and model size is restricted. Moreover, in engineering applications, the use of complex networks and a high volume of parameters is non-ideal owing to the computational constraints at the application level. Therefore, the presented algorithm strives to enhance the efficiency of the network layers instead of further augmenting the complexity of the YOLOv7 base model.

To tackle the challenges of detecting small targets in complex scenes, this paper proposes an improved YOLOv7 object detection algorithm based on the YOLOv7 network. The algorithm enhances the accuracy of target localization and recognition while retaining the fundamental features of the YOLOv7 network. This model differs from existing methods in that it does not require multi-scale feature fusion. Instead, it introduces the SimAM, which enables the network to learn and emphasize important aspects of the targets without introducing additional parameters. Additionally, the algorithm improves efficiency while maintaining detection accuracy by redesigning the original backbone network. By replacing the loss function used in the YOLOv7 algorithm, its recognition capability is enhanced, improving both parallelism and stability. Experimental results demonstrate that the improved YOLOv7 algorithm performs exceptionally well in handling complex scenes and small targets, effectively overcoming the aforementioned challenges.

3. SAM-YOLO Algorithm

SAM-YOLO is an improved target detector based on the YOLOv7 architecture. The algorithm focuses on the challenges of small target detection in complex scenarios and improves the accuracy of target localization and identification while retaining the basic features of the YOLOv7 network while minimizing the potential degradation of target detection accuracy and recall. In addition, SAM-YOLO effectively reduces the number of parameters required for the model and speeds up the inference of the model. The network structure is shown in Figure 3, and the main improvements are as follows:

Figure 2.

The schematic diagram of SIoU loss function. Notes. SIoU = SCYLLA intersection over union.

Figure 3.

The architecture of the improved algorithm.

(1)

Incorporating the SimAM into the backbone network by designing experiments.

(2)

Redesigning the backbone network of the model.

(3)

Replacing part of the original structure with the more lightweight C3 module.

(4)

Redesigning the loss function of the model.

3.1. Simple Attention Mechanism (SimAM)

The attention mechanism is a widely used technique in the fields of machine learning and deep learning. It aims to simulate the human attention mechanism, selectively focusing on important parts of the input data. The attention mechanism has been extensively studied and applied in various tasks, including natural language processing, computer vision, and speech recognition. By introducing the attention mechanism, models can pay more attention to the parts that are more important for the current task when processing large amounts of information. The core idea of this mechanism is to determine the importance of each element in the input data through learning weight allocation. In the attention mechanism, each element can be assigned a weight or attention score, which reflects the model’s degree of attention to each element. The model can adaptively adjust these weights based on the characteristics of different tasks and input data, thereby making more accurate predictions and processing.

The application of attention mechanisms in the fields of machine learning and deep learning is very extensive, including several important attention mechanisms such as convolutional block attention module (CBAM), channel attention (CA), squeeze-and-excitation (SE), and SimAM (Cheng et al., 2024; Jia et al., 2023; Mahaadevan et al., 2023; Shen et al., 2023a; Wu & Dong, 2023).

CBAM is an attention mechanism based on convolutional neural networks. It enhances model performance by capturing both CA and spatial attention. CA is used to determine the importance of each channel in the input feature map, thereby weighting the channels. Spatial attention, on the other hand, determines the importance of each spatial position in the feature map, thus weighting the elements at different spatial positions. By combining CA and spatial attention, the CBAM attention mechanism enables the model to more accurately focus on the important parts of the input data.

The CA mechanism focuses to determine the importance of each channel in the feature map. By utilizing global average pooling and fully connected layers, the CA model can compute and allocate weights for each channel to better capture the feature representations of different channels. The CA mechanism performs well in computationally intensive tasks and helps the model differentiate the importance of each channel more effectively.

The SE attention mechanism is a lightweight attention model that enhances the representation capability of the model efficiently. The core idea of the SE attention model is to dynamically adjust the weights of each channel by utilizing global contextual information. By introducing the “squeeze” and “excitation” stages, the SE attention mechanism can adaptively learn the importance of each channel and re-weight the features accordingly. The SE attention mechanism has achieved good results in many image classification and object detection tasks.

Yang et al. (2021) propose a module that efficiently generates true three-dimensional weights in SimAM. Specifically, it estimates the importance of individual neurons by taking into account the phenomenon in neurology where over-excited neurons usually inhibit surrounding neurons. This phenomenon suggests that neurons with spatial inhibition effects should be assigned higher weights in visual processing. The importance of each neuron is determined based on its linear separation from other neurons using the formula defined as equation (1):

e_{t}^{*} = \frac{4 ({\hat{σ}}^{2} + λ)}{(t - \hat{μ})^{2} + 2 {\hat{σ}}^{2} + 2 λ}

(1)

Theoretically,

\hat{μ}

represents the average of individual neurons, and

\hat{σ}

denotes the variance of individual neurons. Equation (1) demonstrates that a lower neuron energy

e^{t}

indicates a greater dissimilarity between the neuron and its neighboring neurons, resulting in a correspondingly higher weight.

λ

is a hyperparameter introduced to stabilize the attention map during its computation. It represents a small positive value added to prevent division by zero and to improve numerical stability in the calculation of the attention weights. If

λ

is too small, the division by nearly zero might lead to extremely high attention values, which might dominate the learning process and lead to poor generalization. Conversely, if it is too large, it may overly smooth out the differences in attention, leading to underfitting. In our cases,

1 \times 10^{- 4}

was used in our experiment. Consequently, the significance of a neuron can be determined using

1 / e_{t}^{*}

. Moreover, the algorithm reintroduces the concept of neurology and proposes that attention regulation in the mammalian brain commonly involves the amplification or scaling of neuronal responses, instead of mere addition or subtraction. Thus, the algorithm applies a scaling operation to the neuronal energy

e^{t}

to amplify its characteristics. Here,

E

represents the sequence of energy

e^{t}

of all neurons in both channel and space.

Computationally, $\hat{μ}$ represents the average value of all neurons within a channel over the spatial dimensions (i.e., height $H$ and width $W$ ).

\hat{μ} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{:, :, i, j}

(2)

where

X_{:, :, i, j}

represents all neurons at position

(i, j)

within the channel.

Variance ${\hat{σ}}^{2}$ describes the degree of dispersion of neurons around the calculated spatial mean. It is computed using the formula:

{\hat{σ}}^{2} = \frac{1}{H \times W - 1} \sum_{i = 1}^{H} \sum_{j = 1}^{W} (X_{:, :, i, j} - \hat{μ})^{2}

(3)

where “

-

1” is used to provide an unbiased estimate.

Neuron energy $e^{t}$ reflects the similarity of a neuron to its neighboring neurons. Lower energy indicates a greater difference between the neuron and its neighbors, thus possibly being more significant.

\tilde{X} = \frac{1}{e_{t}^{*}}

(4)

The importance of each neuron is determined by the linear separability defined in equation (4), and the sigmoid function is employed to prevent the reciprocal of

E

from becoming excessively large. Moreover, as the sigmoid function is monotonically increasing, it preserves the relative weights between neurons.

To determine the impact of the SimAM module on different parts of YOLOv7, we conducted a series of experiments to evaluate the architecture of the SimAM module that has the greatest positive impact on evaluation indicators. Specifically, we integrated the SimAM module into the input network, backbone network, neck network, and head network of YOLOv7 by replacing certain layers within the original architecture. We then compared the model performance before and after the SimAM module’s integration. The experimental results show that the SimAM module has the greatest impact on the neck network of YOLOv7. After introducing the SimAM module, the performance of the backbone network of YOLOv7 has significantly improved. Tables 1 and 2 shows detailed experimental results on our collected datasets.

Table 1.

Test Result on Different Attention Mechanism.

Mechanism	Params (%)	Precision (%)	Recall (%)	${mAP}_{0.5}$	${mAP}_{0.5 : 0.95}$
SimAM	+0%	0.958	0.889	0.895	0.684
CBAM	+0.5%	0.957	0.859	0.905	0.700
CA	+0.2%	0.961	0.852	0.905	0.695
SEAM	+0.9%	0.905	0.830	0.883	0.608

Notes. Params = parameters; mAP = mean average precision; SimAM = simple attention mechanism; CBAM = convolutional block attention module; CA = channel attention; SEAM = self-supervised equivariant attention mechanism.

Table 2.

The Impact of SimAM Module on YOLOv7.

Location	Precision (%)	Recall (%)	${mAP}_{0.5}$	${mAP}_{0.5 : 0.95}$
Backbone (Layer 12)	0.890	0.773	0.844	0.556
Neck (Layer 64)	0.928	0.820	0.886	0.607
Neck (Layer 76)	0.958	0.889	0.895	0.684
Neck (Layer 102)	0.922	0.863	0.895	0.642

Notes. SimAM = simple attention mechanism; mAP = mean average precision.

In the SAM-YOLO model, the SimAM substitutes a segment of the ELAN structure within the neck layer. More precisely, the SimAM module replaces the initial six convolutional layers, featuring 1, 1 as the step size parameter and 3, 1 as the convolution kernel size parameter. During forward propagation, SimAM evaluates the neurons and activates them based on Equation (4).

3.2. Loss Function

In machine learning, the loss function serves as a metric for evaluating the discrepancy between the predicted and actual values of a model. By continuously adjusting its parameters to minimize this discrepancy, the model’s performance can be improved, leading to better detection and prediction accuracy.

Meanwhile, during object detection, the algorithm generates multiple bounding boxes with high confidence around the target object. However, only one bounding box can accurately represent the target. To address this redundancy, a non-maximum suppression algorithm is implemented, which ensures that only the most appropriate bounding box is selected. The algorithm starts by sorting all bounding boxes and then calculates the intersection over union (IoU) of the highest-confidence bounding box with the remaining boxes. If the IoU of a bounding box exceeds a predefined threshold, it is discarded.

To evaluate the performance of object detection models, the IoU metric is commonly employed. This metric quantifies the degree of overlap between predicted and ground truth boxes, thereby assessing the accuracy of predictions made by the model.

IoU = \frac{| B \cap B^{g t} |}{| B \cup B^{g t} |}

(5)

The formula for calculating IoU, as presented in equation (5), typically utilizes the norm of IoU. In this equation,

B

represents the ground truth box with four parameters:

x

y

w

, and

h

, which respectively, indicate the coordinates of the box center, as well as the width and height of the box.

B^{g t}

denotes the predicted box. The IoU-based loss function

L

is defined in equation (6). A smaller value of

L

indicates a more effective detection outcome from the model.

\begin{aligned} L_{IoU} & = 1 - \frac{| B \cap B^{g t} |}{| B \cup B^{g t} |} \end{aligned}

(6)

\begin{aligned} L_{SIoU} & = 1 - IoU + \frac{Δ + Ω}{2} \end{aligned}

(7)

Building upon this foundation, the complete IoU (CIoU) loss function used in YOLOv7 is replaced with the SCYLLA IoU (SIoU) loss function proposed by Gevorgyan and Zhora (Gevorgyan, 2022). This loss function comprehensively considers four aspects: angle, distance, shape, and IoU. The resulting loss function, represented in equation (7), takes into account the contribution of

Δ

(the distance between the center points of the two boxes) and

Ω

(the difference in area between the two boxes), in addition to IoU. The schematic diagram of the SIoU loss function is illustrated in Figure 2. where

Δ

represents the distance between the center points of two boxes,

Ω

represents the difference in area between the two boxes, and IoU represents their intersection. A smaller value of

L

indicates a better detection effect of the model.

The SIoU loss function assigns different weights to object detection at various scales, giving more attention to objects with smaller scales during training. By introducing additional variables, such as shape loss, the SIoU function not only provides a better measure of symmetry between the predicted box and the true box but also addresses the imbalance problem found in other IoU variants. Additionally, it facilitates faster convergence to the optimal solution and reduces training time. Moreover, it possesses greater sensitivity in detecting small target objects, thereby reflecting the effectiveness of the target detection model more accurately.

3.3. Construction Network of the SAM-YOLO

In the SAM-YOLO algorithm, YOLOv7 is adopted as the basic network architecture, and the C3 module is incorporated. The number of layers and parameters in the network is reduced through this module, accelerating the model’s inference and training speed.

Furthermore, the SimAM is introduced into the neck network. This parameterless attention algorithm proposes an energy function based on mathematical methods to determine the importance of each neuron. Inspired by concepts in neurology, this approach avoids expending excessive energy on adjusting and enhancing the structure. Figure 3 illustrates the improved architecture of the algorithm.

4. Experiment and Analysis

4.1. The Dataset

To ensure the applicability of the vehicle recognition model on highways and urban roads, we collected a substantial number of authentic road videos captured by vehicle dashcams or built-in cameras near Xiangyang City, China, on highways and urban roads. The videos were captured from the driver’s front-facing perspective encompassing diverse road and driving scenarios, including clear weather conditions on two-way four-lane roads, dimly lit rural road scenes, capturing traffic signs, vehicles, pedestrians, traffic lights, and road markings while vehicles are in motion. Figure 4(a) to (c) displays images from the training set, capturing different types of vehicles from various angles, road segments, and lighting conditions to enhance the dataset diversity. Figure 4(a) displays images captured in clear weather conditions, Figure 4(b) displays images captured in cloudy weather conditions, and Figure 4(c) displays images captured at night or in tunnels.

Figure 4.

Images from the training set. (a) Clear weather condition; (b) cloudy weather condition; (c) night or tunnel condition.

Segments with a high number of vehicles and clear video quality were meticulously selected, and one image sample was extracted for every 25 frames. Ultimately, a dataset of 16,008 images with vehicle information was obtained, encompassing various vehicle categories such as cars, trucks, taxis, and tankers. To ensure proper evaluation, the dataset was divided into training, validation, and test sets in a ratio of $7 : 2 : 1$ . The training set comprises 11,206 frames, the validation set comprises 3,198 frames, and the test set comprises 1,604 frames.

4.2. Data Augmentation

Videos captured by dashcams or in-vehicle cameras, while partially representative of real road scenarios are influenced by various factors. Challenges such as image blurring from camera focus issues, underexposed or overexposed objects due to high contrast, and video noise from lighting conditions introduce recognition noise to the captured images. These issues hinder the training effectiveness of machine learning models.

To address this, we implement data augmentation during model training to enhance model robustness. Our augmentation techniques include:

(1)
Rotation: Images are randomly rotated between $- 10^{\circ}$ and $+ 10^{\circ}$ to help the model recognize objects regardless of their orientation.
(2)
Shear: Horizontal and vertical shearing by $\pm 10^{\circ}$ simulates changes in perspective.
(3)
Brightness adjustment: Variation in image illumination by $- 30 %$ to $+ 30 %$ enhances adaptation to fluctuating lighting conditions.
(4)
Blur: A blur effect of up to 2 pixels approximates the out-of-focus images.
(5)
Noise addition: Introducing noise to up to $1 %$ of pixels mimics sensor noise or transmission interference.
Additional challenges include motion-induced blur from the movement of objects and the camera’s distance, often resulting in out-of-focus images. Gaussian blur, a linear smoothing filter, is utilized to reduce this blur while preserving edge information. The Gaussian kernel is defined as follows:
$G (x, y) = \frac{1}{2 π σ^{2}} \exp (- \frac{(x - ((N - 1) / 2))^{2} + (y - ((N - 1) / 2))^{2}}{2 σ^{2}})$
(8)
where $G (x, y)$ denotes the weight at coordinates $(x, y)$ within the 2D Gaussian kernel, which is normalized to ensure the total weight sum is 1.

Additionally, low-light conditions can introduce random image noise, which mere camera adjustments cannot correct. In this context, Poisson noise—a statistical distribution that models random events such as photon counting—effectively simulates noise under low light. This approach helps to replicate realistic imaging conditions, further improving the model’s robustness.
4.3. Experimental Environment

The experimental environment utilized in this study is summarized in Table 3.

Table 3.
Experimental Environment.

Parameters Value

Operating system Ubuntu 23.10

CPU Intel Xeon Platinum 8352 V

GPU RTX 4090 (24 GB) $\times$ 32

RAM 1024 GB

CUDA Version 12.1

Python Version 3.10.9

Framework Pytorch

Pytorch version 2.1

Parameters	Value
Operating system	Ubuntu 23.10
CPU	Intel Xeon Platinum 8352 V
GPU	RTX 4090 (24 GB) $\times$ 32
RAM	1024 GB
CUDA Version	12.1
Python Version	3.10.9
Framework	Pytorch
Pytorch version	2.1

The training parameters utilized in this experiment are presented in Table 4.

Table 4.

Training Parameters.

Parameters	Value
Batch size	64
Epochs	300
img-size	640
hyp	hyp.scratch.p5.yaml
Weights	yolov7.pt
Workers	8

4.4. Evaluation Metrics

Performance evaluation of the SAM-YOLO algorithm involves the utilization of multiple metrics to assess the model’s quality. For this study, the evaluation metrics employed include precision rate (P), recall rate (R), mAP with an IoU recognition threshold of 0.5 ( ${mAP}_{0.5}$ ), and mAP within the IoU range of 0.5 to 0.95 ( ${mAP}_{0.5 : 0.95}$ ).

Precision represents the probability of correctly predicting a positive sample, whereas recall denotes the probability of accurately identifying a positive sample from the original sample. ${mAP}_{0.5}$ refers to the mAP of the model with an IoU recognition threshold of 0.5, whereas ${mAP}_{0.5 : 0.95}$ represents the mAP with an IoU range of 0.5 to 0.95. The accuracies ${AP}_{i}$ and $mAP$ for each recognition threshold are computed according to equation (9).

In equations (9) and (10), which establish the function $p (r)$ to represent the precision and recall rate, where the calculation equations of precision and recall are shown in equations (11) and (12). $n$ in equation (10) represents the total number of all samples. The area under the $p (r)$ curve in the interval $[0, 1]$ represents the AP value.

In equations (11) and (12), TP represents the number of true positives, FP represents the number of false positives, and FN represents the number of false negatives (Tables 5 and 6).

\begin{aligned} AP & = \int_{0}^{1} p (r) d r \end{aligned}

(9)

\begin{aligned} mAP & = \frac{1}{n} \sum_{i = 1}^{n} {AP}_{i} \end{aligned}

(10)

\begin{aligned} Precision & = \frac{TP}{TP + FP} \end{aligned}

(11)

\begin{aligned} Recall & = \frac{TP}{TP + FN} \end{aligned}

(12)

Table 5.

Comparison of Evaluation Indicators Results.

Model category	Precision (%)	Recall (%)	${mAP}_{0.5}$ (%)	${mAP}_{0.5 : 0.95}$ (%)	GFLOPS	Parameters (millions)	Time (ms)	FPS(/s)
YOLOv7	96.45	92.15	95.06	75.44	105.5	35.5	12.6	79.36
SAM-YOLO	96.34	93.46	95.96	75.72	103.3	36.5	9.5	105.

Notes. mAP = mean average precision; GFLOPS = gigaflops; FPS = frames per second; SAM-YOLO = simple attention mechanisim-you only look once.

Table 6.

Comparison of Detection Category Results.

	${mAP}_{0.5}$ (%)								${mAP}_{0.5 : 0.95}$ (%)
Model category	Total	People	Bicycle	Tricycle	Car	Motor cycle	Bus	Truck	Total	People	Bicycle	Tricycle	Car	Motor cycle	Bus	Truck
YOLOv7	95.0	86.7	91.8	98.6	98.1	93.0	99.2	98.4	75.5	52.0	61.4	81.1	82.9	65.7	84.5	84.8
SAM-YOLO	95.9	88.2	94.3	99.2	98.3	93.6	99.3	98.3	75.5	54.8	65.3	82.0	84.2	66.2	86.3	85.7

Notes. mAP = mean average precision; SAM-YOLO = simple attention mechanisim-you only look once.

4.5. Experimental Results

The effectiveness of the proposed method was evaluated through extensive experiments conducted on a benchmark dataset in this study. The experimental results indicate that the proposed SAM-YOLO algorithm achieves higher accuracy and recall rates compared to the original YOLOv7 algorithm. Additionally, there is a 3% improvement in the accuracy of ${mAP}_{0.5}$ and ${mAP}_{0.5 : 0.95}$ .

It can be observed from Figure 5 that the SAM-YOLO algorithm exhibits characteristics of missed detection rate and false detection rate across all detection categories, and demonstrates high accuracy in detecting small targets.

Figure 5.

Confusion matrix of the SAM-YOLO Algorithm. Notes. SAM-YOLO = segment anything model-you only look once.

4.6. Ablation Experiment

In the ablation experiments conducted, by selectively adding or removing the SimAM, SIoU, and C3 modules, we were able to gain a deeper understanding of the specific impact of these components on the overall performance of the model. The addition or removal of each module provided us with unique insights, which in turn allowed us to meticulously evaluate their respective values and roles. The results of the ablation experiments show that when the SimAM, SIoU, and C3 modules are enabled simultaneously, the model can achieve the highest precision (0.961%), recall (0.930%), ${mAP}_{0.5}$ (0.962), and ${mAP}_{0.5 : 0.95}$ (0.725) with a 9.2% reduction in parameters while maintaining a relatively high frame rate (476 FPS). This suggests that the combined use of these modules can significantly improve the performance of the model (Table 7).

Table 7.
Ablation Experiment Results.

SimAM SIoU C3 module Params (%) FPS Precision % Recall (%) ${mAP}_{0.5}$ ${mAP}_{0.5 : 0.95}$

✓ 0 476 0.958 0.889 0.895 0.684

✓ 0 312 0.946 0.853 0.908 0.649

✓ −9.2% 312 0.944 0.860 0.913 0.658

✓ ✓ 0 434 0.933 0.847 0.925 0.623

✓ ✓ −9.2% 434 0.958 0.862 0.905 0.667

✓ ✓ −9.2% 434 0.934 0.853 0.905 0.644

✓ ✓ ✓ −9.2% 476 0.961 0.930 0.962 0.725

SimAM	SIoU	C3 module	Params (%)	FPS	Precision %	Recall (%)	${mAP}_{0.5}$	${mAP}_{0.5 : 0.95}$
✓			0	476	0.958	0.889	0.895	0.684
	✓		0	312	0.946	0.853	0.908	0.649
		✓	−9.2%	312	0.944	0.860	0.913	0.658
✓	✓		0	434	0.933	0.847	0.925	0.623
✓		✓	−9.2%	434	0.958	0.862	0.905	0.667
	✓	✓	−9.2%	434	0.934	0.853	0.905	0.644
✓	✓	✓	−9.2%	476	0.961	0.930	0.962	0.725

Notes. SimAM = simple attention mechanism; SIoU = SCYLLA intersection over union; Params = paramters; FPS = frames per second; mAP = mean average precision.

4.7. Comparative Experiment

In this study, we compare and analyze the performance of multiple target detection algorithms in different contexts, including YOLOv7, SAM-YOLO, CBAM-YOLO, CA-YOLO, SEAM-YOLO, SSD, RetinaNet, and Faster-RCNN. We evaluate the performance of multiple target detection algorithms based on the number of parameters, frames per second (FPS), precision, recall, and different mAP metrics were comprehensively evaluated. All algorithms use ELAN or ResNet-50 as the underlying skeleton to ensure consistency and fairness in the evaluation.

In terms of overall performance, SAM-YOLO performs best in several metrics, especially in ${mAP}_{0.5}$ and ${mAP}_{0.5 : 0.95}$ , which reach 0.962 and 0.725, respectively, and are significantly higher than other algorithms. This result indicates that SAM-YOLO is not only able to detect targets efficiently but also has high accuracy and robustness. YOLOv7 also shows strong performance, especially on the FPS metric, which reaches 476, which means that it can realize high-speed real-time target detection.

In the ${mAP}_{0.5}$ and ${mAP}_{0.5 : 0.95}$ evaluations for different object categories, SAM-YOLO has the best performance in almost all categories, especially in “tricycle,” “bus,” and “tricycle.” Especially in the categories of “tricycle,” “bus,” and “truck,” its ${mAP}_{0.5}$ is as high as 0.993, 0.987, and 0.991, respectively, which is much better than other algorithms. This further confirms the superiority of SAM-YOLO in dealing with complex scenes and targets of different scales.

In contrast, the performance of SSD, RetinaNet, and Faster-RCNN is relatively weak, especially in the ${mAP}_{0.5 : 0.95}$ metric, where these algorithms significantly lag behind the improved YOLO-based algorithms. In addition, the large parameter reduction of these algorithms, although conducive to reducing the computational complexity, may also affect the accuracy and comprehensive performance of the detection (Tables 8 and 9).

Table 8.
Comparison of Detection Results.

Algorithm Backbone Params (%) FPS Precision (%) Recall (%) ${mAP}_{0.5}$ ${mAP}_{0.5 : 0.95}$

YOLOv7 ELAN 0 476 0.953 0.854 0.910 0.657

SAM-YOLO ELAN −9.2% 476 0.961 0.930 0.962 0.725

CBAM-YOLO ELAN +0.5% 300 0.957 0.859 0.905 0.700

CA-YOLO ELAN +0.2% 336 0.961 0.852 0.905 0.695

SEAM-YOLO ELAN +0.9% 340 0.905 0.830 0.883 0.608

SSD ResNet-50 −32.1% 102 0.872 0.820 0.773 0.503

RetinaNet ResNet-50 −31.3% 180 0.833 0.647 0.725 0.423

Faster-RCNN ResNet-50 +17.0% 247 0.808 0.652 0.715 0.427

Algorithm	Backbone	Params (%)	FPS	Precision (%)	Recall (%)	${mAP}_{0.5}$	${mAP}_{0.5 : 0.95}$
YOLOv7	ELAN	0	476	0.953	0.854	0.910	0.657
SAM-YOLO	ELAN	−9.2%	476	0.961	0.930	0.962	0.725
CBAM-YOLO	ELAN	+0.5%	300	0.957	0.859	0.905	0.700
CA-YOLO	ELAN	+0.2%	336	0.961	0.852	0.905	0.695
SEAM-YOLO	ELAN	+0.9%	340	0.905	0.830	0.883	0.608
SSD	ResNet-50	−32.1%	102	0.872	0.820	0.773	0.503
RetinaNet	ResNet-50	−31.3%	180	0.833	0.647	0.725	0.423
Faster-RCNN	ResNet-50	+17.0%	247	0.808	0.652	0.715	0.427

Notes. Params = parameters; FPS = frames per second; mAP = mean average precision; SAM = simple attention mechanisim; YOLO = you only look once; ELAN = efficient layer aggregation networks; CBAM = convolutional block attention module; CA = channel attention; SEAM = self-supervised equivariant attention mechanism; SSD = single shot multibox detector; RCNN = region-based convolutional neural network.

Table 9.

Comparison of Detection Results on ${mAP}_{0.5}$ .

Algorithm	All	Person	Bicycle	Tricycle	Car	Motorcycle	Bus	Truck
YOLOv7	0.910	0.870	0.899	0.950	0.931	0.860	0.949	0.935
SAM-YOLO	0.962	0.876	0.956	0.993	0.982	0.952	0.987	0.991
CBAM-YOLO	0.905	0.729	0.887	0.979	0.934	0.886	0.950	0.970
CA-YOLO	0.905	0.726	0.888	0.979	0.934	0.889	0.948	0.969
SEAM-YOLO	0.883	0.721	0.841	0.927	0.934	0.845	0.962	0.953
SSD	0.773	0.535	0.741	0.867	0.862	0.702	0.797	0.908
RetinaNet	0.725	0.495	0.678	0.792	0.857	0.647	0.757	0.848
Faster-RCNN	0.715	0.465	0.678	0.802	0.862	0.605	0.734	0.859

Notes. mAP = mean average precision; SAM = simple attention mechanisim; YOLO = you only look once; CBAM = convolutional block attention module; CA = channel attention; SEAM = self-supervised equivariant attention mechanism; SSD = single shot multibox detector; RCNN = region-based convolutional neural network.

The experimental results show that SAM-YOLO and YOLOv7 are the optimal choices for target detection tasks with high speed and high accuracy. They are not only able to accurately detect various types of targets in complex scenes but also maintain high processing speed to meet the demands of real-time applications. Future research can further explore the optimization and adaptation of these algorithms in specific application scenarios to achieve higher performance and better adaptability (see Figure 6 and Table 10).

Figure 6.

(a) Original image; (b) simple attention mechanisim-you only look once (SAM-YOLO).

Table 10.

Comparison of Detection Results on ${mAP}_{0.5 : 0.95}$ .

Algorithm	All	Person	Bicycle	Tricycle	Car	Motorcycle	Bus	Truck
YOLOv7	0.657	0.416	0.550	0.736	0.774	0.550	0.793	0.780
SAM-YOLO	0.725	0.494	0.792	0.678	0.857	0.647	0.757	0.848
CBAM-YOLO	0.700	0.469	0.637	0.772	0.86	0.564	0.768	0.829
CA-YOLO	0.695	0.443	0.614	0.793	0.791	0.609	0.797	0.82
SEAM-YOLO	0.608	0.362	0.51	0.683	0.739	0.500	0.726	0.738
SSD	0.503	0.327	0.413	0.365	0.788	0.345	0.596	0.684
RetinaNet	0.423	0.195	0.297	0.453	0.609	0.302	0.534	0.575
Faster-RCNN	0.427	0.192	0.295	0.481	0.613	0.282	0.537	0.585

4.8. Test on ExDark Dataset

This section analyzes the target detection results on the ExDark dataset (Loh & Chan, 2019), which is designed to evaluate the performance of target detection algorithms under extreme lighting conditions. The analysis covers several algorithms including YOLOv7, SAM-YOLO, CBAM-YOLO, CA-YOLO, SEAM-YOLO, SSD, RetinaNet, and Faster-RCNN. By comparing the performance of each algorithm under two main metrics, ${mAP}_{0.5}$ and ${mAP}_{0.5 : 0.95}$ , we aim to reveal the ability of different algorithms to adapt to extreme light change conditions (Table 11).

Table 11.
Comparison of Detection Results on ExDark Dataset ${mAP}_{0.5}$ .

Algorithm All Person Bicycle Tricycle Car Motorcycle Bus Truck

YOLOv7 0.521 0.318 0.415 0.455 0.797 0.342 0.622 0.701

SAM-YOLO 0.593 0.348 0.551 0.586 0.781 0.478 0.634 0.778

CBAM-YOLO 0.443 0.260 0.325 0.346 0.759 0.209 0.563 0.643

CA-YOLO 0.478 0.241 0.356 0.537 0.656 0.353 0.605 0.597

SEAM-YOLO 0.309 0.152 0.138 0.129 0.694 0.127 0.454 0.468

SSD 0.315 0.178 0.165 0.075 0.712 0.173 0.451 0.453

RetinaNet 0.314 0.118 0.159 0.310 0.545 0.136 0.456 0.471

Faster-RCNN 0.310 0.139 0.238 0.196 0.543 0.141 0.401 0.508

Algorithm	All	Person	Bicycle	Tricycle	Car	Motorcycle	Bus	Truck
YOLOv7	0.521	0.318	0.415	0.455	0.797	0.342	0.622	0.701
SAM-YOLO	0.593	0.348	0.551	0.586	0.781	0.478	0.634	0.778
CBAM-YOLO	0.443	0.260	0.325	0.346	0.759	0.209	0.563	0.643
CA-YOLO	0.478	0.241	0.356	0.537	0.656	0.353	0.605	0.597
SEAM-YOLO	0.309	0.152	0.138	0.129	0.694	0.127	0.454	0.468
SSD	0.315	0.178	0.165	0.075	0.712	0.173	0.451	0.453
RetinaNet	0.314	0.118	0.159	0.310	0.545	0.136	0.456	0.471
Faster-RCNN	0.310	0.139	0.238	0.196	0.543	0.141	0.401	0.508

In the ${mAP}_{0.5}$ metric, SAM-YOLO leads the other algorithms with a score of 0.593, which shows that it is more capable of detecting multiple types of objects in the ExDark dataset. Especially in the detection of “bicycle,” “tricycle,” and “truck,” SAM-YOLO shows high accuracy. YOLOv7 also performs well, especially in the “car” category where it achieves a high score of 0.797. In contrast, CBAM-YOLO, CA-YOLO, and SEAM-YOLO perform slightly less well overall, with SEAM-YOLO in particular showing the weakest performance of all categories, probably due to its lack of adaptability to extreme light changes.

On the more stringent ${mAP}_{0.5 : 0.95}$ metric, SAM-YOLO leads again with a result of 0.382, further proving its better robustness and accuracy in dealing with complex lighting conditions. In addition, SAM-YOLO performs particularly well in the categories of “tricycle,” “car,” and “truck,” which demonstrates its strong detection ability for large objects. This demonstrates its ability to detect large objects. In contrast, other algorithms such as CBAM-YOLO, CA-YOLO, and SEAM-YOLO perform poorly, especially CA-YOLO and SEAM-YOLO perform lower than expected on most of the categories, which may reflect their shortcomings in dealing with the capture of details under extreme lighting conditions.

By analyzing the test results on the ExDark dataset, we can conclude that the SAM-YOLO algorithm not only performs well under regular lighting conditions but also maintains a high detection performance under extreme lighting conditions. This ability makes it the preferred algorithm for high-precision target detection in complex lighting environments. However, the performance degradation of other algorithms under extreme lighting conditions suggests that improving the robustness of the algorithm to changes in lighting remains an important direction for future research (Table 12).

Table 12.

Comparison of Detection Results on ExDark Dataset $m A P_{0.5 : 0.95}$ .

Algorithm	All	Person	Bicycle	Tricycle	Car	Motorcycle	Bus	Truck
YOLOv7	0.282	0.111	0.153	0.231	0.522	0.119	0.428	0.408
SAM-YOLO	0.382	0.160	0.256	0.413	0.586	0.231	0.493	0.536
CBAM-YOLO	0.227	0.083	0.104	0.150	0.468	0.062	0.354	0.370
CA-YOLO	0.159	0.054	0.046	0.025	0.425	0.044	0.270	0.237
SEAM-YOLO	0.142	0.039	0.037	0.040	0.367	0.028	0.268	0.211
SSD	0.169	0.101	0.104	0.265	0.141	0.135	0.145	0.152
RetinaNet	0.136	0.101	0.090	0.170	0.010	0.100	0.124	0.104
Faster-RCNN	0.157	0.008	0.084	0.142	0.087	0.078	0.124	0.104

5. Conclusion

In this paper, an improved YOLOv7 algorithm that incorporates a SimAM into the neck network was proposed, replaces the CIoU function of YOLOv7 with the SIoU function and simplifies the model architecture to accelerate model training and reduce the number of parameters. This approach enhances the model’s generalization ability, improves the learning of spatial features, and enhances computational efficiency. The results reveal that the SAM-YOLO algorithm outperforms other algorithms in terms of comprehensive performance, especially in terms of accuracy and mAP metrics, both in standard test environments and under extreme lighting conditions. This demonstrates the potential of SAM-YOLO in realizing high-speed and high-accuracy target detection, especially for real-world application scenarios with variable lighting conditions.

Regarding model training, the training set comprises real road videos with a fixed perspective effect, and the proportion of vehicles on the road varies. Consequently, the model training may result in bias. To mitigate this limitation, future studies should enhance the original dataset to alleviate the impact of issues related to data collection on the performance of the model. Furthermore, during the data collection process, non-conventional shooting perspectives such as reverse and left-right angles should be introduced to enhance the capability of detecting vehicles in various angles and environments.

Additionally, our findings point to a general decrease in the performance of target detection algorithms under extreme lighting conditions, highlighting the importance of improving the robustness of algorithms to lighting changes. While SAM-YOLO performed the best in these tests, the performance degradation of the other algorithms hints at the need for future work, especially in optimizing the algorithms to better adapt to extreme lighting conditions.

Footnotes

ORCID iDs

JiaWang Liao

SuYu Jiang

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the Natural Science Foundation of Hubei Province (No. 2024AFB147), Xiangyang Science & Technology Plan (High-tech field, No. 2022ABH006596), Innovation and Entrepreneurship Education Special Project (No. CX2023003).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Alotibi

Abdelhakim

(2021). Anomaly detection for cooperative adaptive cruise control in autonomous vehicles using statistical learning and kinematic model. IEEE Transactions on Intelligent Transportation Systems, 22(6), 3468–3478. https://doi.org/10.1109/TITS.2020.2983392

Carrasco

D. P.

Rashwan

H. A.

Garc’ıa

Puig

(2023). T-YOLO: Tiny vehicle detection based on YOLO and multi-scale convolutional neural networks. IEEE Access, 11, 22430–22440. https://doi.org/10.1109/ACCESS.2021.3137638

Chen

Huang

(2019). Pedestrian detection for autonomous vehicle using multi-spectral cameras. IEEE Transactions on Intelligent Vehicles, 4(2), 211–219. https://doi.org/10.1109/TIV.2019.2904389

Chen

Xiang

Liu

C.-L.

Pan

C.-H.

(2014). Vehicle detection in satellite images by hybrid deep convolutional neural networks. IEEE Geoscience and Remote Sensing Letters, 11(10), 1797–1801. https://doi.org/10.1109/LGRS.2014.2309695

Cheng

Xiao

Sun

Ren

Liu

(2024). Enhancing remote sensing object detection with K-CBST YOLO: Integrating CBAM and swin-transformer. Remote Sensing, 16(16), 2885. https://doi.org/10.3390/rs16162885

Cucchiara

Grana

Piccardi

Prati

(2000). Statistic and knowledge-based moving object detection in traffic scenes. In 2000 IEEE Intelligent Transportation Systems. Proceedings (pp. 27–32). IEEE. https://doi.org/10.1109/ITSC.2000.881013

Deng

Wang

Liu

Jiang

(2022). Extended feature pyramid network for small object detection. IEEE Transactions on Multimedia, 24, 1968–1979. https://doi.org/10.1109/TMM.2021.3074273

Gevorgyan

(2022). SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv preprint. https://doi.org/10.48550/arXiv.2205.12740

Grigorescu

Trasnea

Cocias

Macesanu

(2020). A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3), 362–386. https://doi.org/10.1002/rob.21918

10.

Gupta

Anpalagan

Guan

Khwaja

A. S.

(2021). Deep learning for object detection and scene perception in self-driving cars: Survey, challenges, and open issues. Array, 10, 10–57. https://doi.org/10.1177/0037549717709932

11.

Hong

Roh

Kim

K.-H.

Cheon

Park

(2016). PVANet: Lightweight deep neural networks for real-time object detection. arXiv preprint. https://doi.org/10.48550/arXiv.1611.08588

12.

Hsu

W.-Y.

Lin

W.-Y.

(2021). Adaptive fusion of multi-scale YOLO for pedestrian detection. IEEE Access, 9, 110063. https://doi.org/10.1109/access.2021.3102600

13.

Zhi

Shi

Zhang

Cui

Zhao

(2021). PAG-YOLO: A portable attention-guided YOLO network for small ship detection. Remote Sensing, 13(16), 3059. https://doi.org/10.3390/rs13163059

14.

Jia

Wang

Chen

Zang

Shi

Gao

(2023). MobileNet-CA-YOLO: An improved YOLOv7 based on the MobileNetV3 and attention mechanism for rice pests and diseases detection. Agriculture, 13(7), 1285. https://doi.org/10.3390/agriculture13071285

15.

Jiang

Ren

Zhu

Zeng

Nan

Sun

Ren

Huo

(2022a). Object detection from UAV thermal infrared images and videos using YOLO models. International Journal of Applied Earth Observation and Geoinformation, 112, 102912. https://doi.org/10.1016/j.jag.2022.102912

16.

Jiang

Xie

Yan

Wen

Jiang

Feng

Duan

Wang

(2022b). An attention mechanism-improved YOLOv7 object detection algorithm for hemp duck count estimation. Agriculture, 12(10), 1659. https://doi.org/10.3390/agriculture12101659

17.

Jocher

Chaurasia

Stoken

Borovec

, NanoCode012, Kwon

Michael

, TaoXie , Fang

, imyhxy , Lorna , Yifu

Wong

Abhiram

Montes

Wang

Fati

Nadar

, Laughing , UnglvKitDe , Sonck

, tkianai , yxNONG , Skalski

Hogan

Nair

Strobel

Jain

. (2022). Ultralytics/YOLOv5: V7.0 – YOLOv5 SOTA Realtime Instance Segmentation. Zenodo. https://doi.org/10.5281/zenodo.3908559

18.

Kiran

B. R.

Sobh

Talpaert

Mannion

Sallab

A. A. A.

Yogamani

Perez

(2022). Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6), 4909–4926. https://doi.org/10.1109/TITS.2021.3054625

19.

Law

Deng

(2019). CornerNet: Detecting objects as paired keypoints. International Journal of Computer Vision, 128(3), 642–656. https://doi.org/10.48550/arXiv.1808.01244

20.

Zhou

Cao

(2022). Cross-domain object detection for autonomous driving: A stepwise domain adaptative YOLO approach. IEEE Transactions on Intelligent Vehicles, 7(3), 603–615. https://doi.org/10.1109/tiv.2022.3165353

21.

Shen

(2023). YOLOSR-IST: A deep learning method for small target detection in infrared remote sensing images based on super-resolution and YOLO. Signal Processing, 208, 108962. https://doi.org/10.1016/j.sigpro.2023.108962

22.

Liu

Anguelov

Erhan

Szegedy

Reed

C.-Y.

Berg

A. C.

(2016). SSD: Single Shot MultiBox Detector. Computer Vision – ECCV 2016.2015.

23.

Liu

Zhong

Yao

Zhang

Shi

(2021a). Computing systems for autonomous driving: State of the art and challenges. IEEE Internet of Things Journal, 8(8), 6469–6486. https://doi.org/10.1109/JIOT.2020.3043716

24.

Liu

Ouyang

Wang

Fieguth

Chen

Liu

Pietikainen

(2019). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128(2), 261–318. https://doi.org/10.1007/s11263-019-01247-4

25.

Liu

Sun

Wergeles

Shang

(2021b). A survey and performance evaluation of deep learning methods for small object detection. Expert Systems with Applications, 172, 114602. https://doi.org/10.1016/j.eswa.2021.114602

26.

Liu

Wang

Liu

Peng

(2022). CEAM-YOLOv7: Improved YOLOv7 based on channel expansion and attention mechanism for driver distraction behavior detection. IEEE Access, 10, 129116–129124. https://doi.org/10.1109/access.2022.3228331

27.

Loh

Y. P.

Chan

C. S.

(2019). Getting to know low-light images with the exclusively dark dataset. Computer Vision and Image Understanding, 178, 30–42. https://doi.org/10.1016/j.cviu.2018.10.010

28.

Long

Shelhamer

Darrell

(2015). Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3431–3440). IEEE. https://doi.org/10.1109/cvpr.2015.7298965

29.

Mahaadevan

Narayanamoorthi

Gono

Moldrik

(2023). Automatic identifier of socket for electrical vehicles using swin-transformer and SimAM attention mechanism-based EVS YOLO. IEEE Access, 11, 111238–111254. https://doi.org/10.1109/ACCESS.2023.3321290

30.

Najibi

Samangouei

Chellappa

Davis

L. S.

(2017). SSH: Single stage headless face detector. In 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 4875–4884). IEEE. https://doi.org/10.1109/iccv.2017.522

31.

Najibi

Singh

Davis

(2019). AutoFocus: Efficient multi-scale inference. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 9745–9755). IEEE. https://doi.org/10.1109/iccv.2019.00984

32.

Newell

Yang

Deng

(2016). Stacked hourglass networks for human pose estimation. In Computer Vision – ECCV 2016 (pp. 483–499). Springer. https://doi.org/10.1007/978-3-319-46484-8_29

33.

Outay

Mengash

H. A.

Adnan

(2020). Applications of unmanned aerial vehicle (UAV) in road safety, traffic and highway infrastructure management: Recent advances and challenges. Transportation Research Part A: Policy and Practice, 141, 116–129. https://doi.org/10.1016/j.tra.2020.09.018

34.

Petit

Shladover

S. E.

(2014). Potential cyberattacks on automated vehicles. IEEE Transactions on Intelligent Transportation Systems, 16(2), 1–11. https://doi.org/10.1109/tits.2014.2342271

35.

Rabbi

Ray

Schubert

Chowdhury

Chao

(2020). Small-object detection in remote sensing images with end-to-end edge-enhanced GAN and object detector network. Remote Sensing, 12(9), 1432. https://doi.org/10.3390/rs12091432

36.

Redmon

Divvala

Girshick

Farhadi

(2016). You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 779–788). IEEE. https://doi.org/10.1109/CVPR.2016.91

37.

Shen

Lang

Song

(2023). CA-YOLO: Model optimization for remote sensing image object detection. IEEE Access, 11, 64769–64781. https://doi.org/10.1109/access.2023.3290480

38.

Shen

Wang

Cui

Guo

. (2023b). Multiple information perception-based attention in YOLO for underwater object detection. The Visual Computer, 40(3), 1415–1438. https://doi.org/10.1007/s00371-023-02858-2

39.

Shi

Wong

Y. D.

M. Z.-F.

Palanisamy

Chai

(2019). A feature learning approach based on XGBoost for driving assessment and risk prediction. Accident Analysis & Prevention, 129, 170–179. https://doi.org/10.1016/j.aap.2019.05.005

40.

Song

Liang

Dai

Yun

(2019). Vision-based vehicle detection and counting system using deep learning in highway scenes. European Transport Research Review, 11(1), 1–16. https://doi.org/10.1186/s12544-019-0390-4

41.

Srivastava

Narayan

Mittal

(2021). A survey of deep learning techniques for vehicle detection from UAV images. Journal of Systems Architecture, 117, 102–152. https://doi.org/10.1016/j.sysarc.2021.102152

42.

Sun

Bebis

Miller

(2006). Monocular precrash vehicle detection: Features and classifiers. IEEE Transactions on Image Processing, 15(7), 2019–2034. https://doi.org/10.1109/tip.2006.877062

43.

Wang

C. Y.

Bochkovskiy

Liao

H.-Y. M.

(2022a). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7464–7475). IEEE. https://doi.org/10.1109/CVPR52729.2023.00721

44.

Wang

C.-C. R.

Lien

J.-J. J.

(2008). Automatic vehicle detection using local features: A statistical approach. IEEE Transactions on Intelligent Transportation Systems, 9(1), 83–96. https://doi.org/10.1109/tits.2007.908572

45.

Wang

Liu

Wang

Liu

Wang

(2019). Feature extraction and dynamic identification of drivers’ emotions. Transportation Research Part F: Traffic Psychology and Behaviour, 62, 175–191. https://doi.org/10.1016/j.trf.2019.01.002

46.

Wang

Shivanna

Cheng

Jain

Lin

Hong

Chi

(2021). DCN V2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the Web Conference 2021 (pp. 1785–1797). Association for Computing Machinery. https://doi.org/10.1145/3442381.3450078

47.

Wang

Zhou

Ren

(2023). Performance and challenges of 3D object detection methods in complex scenes for autonomous driving. IEEE Transactions on Intelligent Vehicles, 8(2), 1699–1716. https://doi.org/10.1109/TIV.2022.3213796

48.

Dong

(2023). YOLO-SE: Improved YOLOv8 for remote sensing object detection and recognition. Applied Sciences, 13(24), 12977. https://doi.org/10.3390/app132412977

49.

Hong

Ghamisi

Tao

(2018). MsRi-CCF: Multi-scale and rotation-insensitive convolutional channel features for geospatial object detection. Remote Sensing, 10(12), 1990. https://doi.org/10.3390/rs10121990

50.

Yang

Zhang

R.-Y.

Xie

(2021). SimAM: A simple, parameter-free attention module for convolutional neural networks. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning (Vol. 139, pp. 11863–11874). PMLR.

51.

Yurtsever

Lambert

Carballo

Takeda

(2020). A survey of autonomous driving: Common practices and emerging technologies. IEEE Access, 8, 58443–58469. https://doi.org/10.1109/access.2020.2983149

52.

Zhao

Zhang

Zhao

(2023). YOLOv7-sea: Object detection of maritime UAV images based on improved YOLOv7. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) (pp. 233–238). IEEE. https://doi.org/10.1109/WACVW58289.2023.00029

53.

Zhu

Jin

Sun

Song

(2020). MME-YOLO: Multi-sensor multi-level enhanced YOLO for robust vehicle detection in traffic surveillance. Sensors, 21(1), 27. https://doi.org/10.3390/s21010027

54.

Zhu

Lyu

Wang

Zhao

(2021). TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2778–2788). IEEE.

SAM-YOLO: An Improved Small Object Detection Model for Vehicle Detection

Abstract

Keywords

1. Introduction

2. Restatement of the Problem

4. Experiment and Analysis

4.1. The Dataset

Table 3. Experimental Environment. Parameters Value Operating system Ubuntu 23.10 CPU Intel Xeon Platinum 8352 V GPU RTX 4090 (24 GB) × 32 RAM 1024 GB CUDA Version 12.1 Python Version 3.10.9 Framework Pytorch Pytorch version 2.1

Footnotes

ORCID iDs

Funding

Declaration of Conflicting Interests

Data Availability

References

Table 3.
Experimental Environment.

Parameters Value

Operating system Ubuntu 23.10

CPU Intel Xeon Platinum 8352 V

GPU RTX 4090 (24 GB) $\times$ 32

RAM 1024 GB

CUDA Version 12.1

Python Version 3.10.9

Framework Pytorch

Pytorch version 2.1