Sage Journals: Discover world-class research

Abstract

To address the challenges of urban traffic complexity, such as occlusion and lighting variations that impact road target detection, this study introduces the KGKPD algorithm. This algorithm integrates knowledge graphs with keypoint detection based on the CenterNet concept. It enhances robustness by introducing salt-and-pepper noise and uses the RepVit network as the backbone. The weighted fusion adaptive feature pyramid network module fuses multi-scale features to optimize the extraction of small target features. The efficient linear deformable convolutional head improves detection of occluded targets. The Poly-1 loss function addresses class imbalance, thereby improving accuracy. The integration of prior knowledge enhances the model’s ability to understand relationships between targets. Compared to CenterNet, the KGKPD algorithm reduces parameters and computational load by 92.32% and 91.85%, respectively, increases the mean average precision by 4.1%, and achieves a frame rate of 40.5 frames per second, meeting the requirements for real-time detection. The code is available at https://github.com/yjx-cup/kgkpd.

Keywords

knowledge graph urban traffic linear deformable convolution data augmentation keypoint detection

1. Introduction

With the rapid economic development and urbanization, the surge in urban population and the popularization of new energy vehicles have led to a sharp increase in the number of vehicles, intensifying urban traffic pressure and highlighting issues such as traffic congestion, safety, and environmental pollution. Traditional manual inspection methods are costly, inefficient, and unreliable. In contrast, while infrared or radar detection technologies can accurately measure the position, speed, and types of vehicles, and possess the ability to work around the clock and strong interference resistance, these technologies are very expensive.

In recent years, the rapid advancement of deep learning technology has significantly propelled the development of object detection algorithms (Dai et al., 2021; Duan et al., 2019; Girshick, 2015; Han et al., 2020; Wang et al., 2023a). Researchers have optimized existing models to better meet the specific needs of road vehicle detection. These optimized models have achieved a dual improvement in cost-effectiveness and high detection accuracy while maintaining a small model size and excellent performance. For instance, to address the issue of occluded targets during detection, He et al. (2024) proposed a method called YOLO-OVD (YOLO for occluded vehicle detection) and a corresponding dataset, effectively handling the problem of vehicle occlusion. Yu et al. (2024) introduced an enhanced YOLOv7 for traffic systems (ETS-YOLOv7), which replaces the traditional efficient layer aggregation network (ELAN) module with a compact layer aggregation network module, reducing redundant computation and improving computational efficiency without sacrificing model accuracy. Zhou et al. (2024) proposed an fully convolutional one-stage object detection (FCOS) that combines dynamic convolution and feature enhancement. Through a dynamic convolution module (Dy-Conv), dual attention module, and multi-scale feature fusion module, it enhances feature extraction capabilities, thereby improving the efficiency and accuracy of real-time transport protocol detection. Vankdoth and Arock (2024) proposed an end-to-end model that can be easily deployed on edge devices and has a high mean average precision (mAP). Liu et al. (2024) introduced a new structure-aware fusion network, which enhances the robustness of traffic object detection by incorporating bio-inspired event cameras and designing a reliable structure generation network as well as an adaptive feature complementation module. In complex road traffic environments, the visibility and features of traffic targets are easily attenuated and lost due to factors such as lighting conditions, weather conditions, time, background elements, and traffic density. Tang et al. (2024) proposed a new YOLO network called HRYNet, which significantly improves the network’s detection performance of traffic targets in complex backgrounds by enhancing the feature extraction and fusion process.

In summary, although existing methods have made some progress in object detection in complex traffic environments, they still face several challenges: First, existing technologies have not fully considered interference factors during image transmission, which can reduce image clarity and thus affect the accuracy and robustness of object detection. Second, existing frameworks often overlook the inherent relationships between traffic targets, which are important for a deeper understanding and prediction of traffic behavior patterns. To address these issues, this study proposes a road object detection algorithm called KGKPD, which combines knowledge graphs and keypoint detection. Our contributions include:

To address the issue of pixel loss during image transmission, we additionally employed the salt-and-pepper noise algorithm in the data augmentation stage. Although the salt-and-pepper noise algorithm itself is not a novel technique, we innovatively matched its parameters to the specific noise patterns in the road target detection task. Unlike previous approaches that simply added salt-and-pepper noise with fixed intensity, we dynamically adjusted the degree of salt-and-pepper noise addition based on the characteristics of road scene images. This allows the model to better adapt to the complex and variable image quality conditions encountered in practical applications.

To tackle the problem of occluded traffic targets in scenes, we introduce an efficient linear deformable convolutional head (ELDHead). This head is a novel detection head structure that integrates the advantages of efficient local attention (ELA) and linear deformable convolution (LDConv), and it is specifically optimized for addressing occlusion issues in road target detection. By organically combining these two components, ELDHead achieves adaptive focusing and refined modeling of image features when dealing with occluded targets.

To address the issue of detecting small targets, we propose a weighted fusion adaptive feature pyramid network (WF-AFPN). The WF-AFPN module innovatively improves upon the adaptive feature pyramid network (AFPN) to meet the requirements of small target feature extraction in road target detection. Compared with the original AFPN, we not only introduce the squeeze-and-excitation (SE) attention mechanism and learnable weight parameters, but also replace ordinary convolutions with depthwise convolutions. This enables the model to more effectively perform weighted fusion of features at different scales, while significantly reducing the computational cost and model parameters.

Incorporating knowledge graphs into object detection algorithms to address practical problems, we constructed a specific knowledge graph containing relationships between road target categories using ConceptNet. Subsequently, the restart random walk (RW) algorithm was employed to quantify the graph and obtain a semantic consistency matrix. Through this approach, we were able to more accurately capture the complex relationships between target categories and effectively integrate them into the detection model, thereby enhancing the model’s ability to understand semantic associations between targets and improving detection accuracy.

2. Related Work

The design philosophy of CenterNet (Duan et al., 2019) is leveraged in KGKPD, where targets are treated as keypoints, specifically the center points of target bounding boxes. Compared to other anchor box-based object detection algorithms, this approach eliminates the inefficient and complex anchor operations, thereby enhancing the detection performance of the algorithm. It also improves the flexibility and accuracy of detection. Additionally, during the inference stage, prior knowledge is used to intervene in the detection results, increasing the confidence in the model’s detected classes while also enhancing the mathematical interpretability of the model. Similar to the TTFNet (Liu et al., 2020) model, an elliptical Gaussian kernel is used to generate an elliptical heatmap at the keypoint, avoiding the overflow of the true bounding box that occurs with CenterNet’s circular heatmap after a $4 \times$ downsampling. The two-dimensional elliptical Gaussian kernel generates the heatmap $H_{m} \in R (1 \times (H / r) \times (W / r))$ as shown in equation (1):

K_{m} (x, y) = \exp (- \frac{(x - x_{0})^{2}}{2 σ_{x}^{2}} - \frac{(y - y_{0})^{2}}{2 σ_{y}^{2}}),

(1)

where

σ_{x} = α w / 6, σ_{y} = α h / 6

, where

w

and

h

are the width and height of the target’s true bounding box. In this paper,

α = 0.54

To better address the issue of class imbalance in the dataset, KGKPD employs a variant of focal loss known as Poly Loss-1 (Poly-1; Leng et al., 2022) as the keypoint loss function. This approach involves a significant enhancement by simply adding the first term of the Taylor expansion to the focal loss, which results in a qualitative improvement. The focal loss is defined as a loss function that focuses on hard examples by down-weighting well-classified examples, thereby addressing the imbalance between easy and hard examples in the dataset. The Poly-1 loss function extends this concept by incorporating the leading polynomial term, allowing for more flexibility and adaptability in handling class imbalance. This modification is achieved with minimal changes, requiring only the addition of one extra hyperparameter and a single line of code, making it an efficient and effective solution. The specific formulations of focal loss and Poly-1 are shown in equations (2) and (3), respectively.

\begin{aligned} L_{FL} & = - (1 - P_{t})^{γ} \log (P_{t}), \end{aligned}

(2)

\begin{aligned} L_{poly-1} & = L_{FL} + β (1 - P_{t})^{γ + 1}, \end{aligned}

(3)

where

β

is the coefficient that adjusts the second polynomial term in the loss function. Let

β = - 1

, the loss function of the KGKPD model, such as that of CenterNet, is composed of the keypoint loss function

L_{poly-1}

, the size loss

L_{size}

, and the offset loss

L_{offset}

, as shown in equation (4):

\begin{aligned} L & = L_{poly-1} + 0.1 L_{size} + L_{offests} . \end{aligned}

(4)

The KGKPD detection network consists of three modules: the backbone network, the neck network, and the detection head. The overall network structure is shown in Figure 1.

Figure 1.

Structure of KGKPD.

3. Method

To reduce the number of model parameters and computational load, we have chosen the lightweight RepVit as the backbone network for feature extraction in the KGKPD task. RepVit, proposed by Wang et al. (2024), combines the advantages of lightweight convolutional neural network (CNNs) and ViTs to enhance the performance of visual tasks on mobile devices. It is achieved through a step-by-step enhancement of the MobileNetV3 architecture, ultimately forming a new efficient pure CNN architecture. RepVit outperforms existing lightweight ViTs in multiple visual tasks and has an advantage in terms of latency. Despite having fewer parameters and faster training, RepVit still has shortcomings when directly used for feature extraction in KGKPD. This is because the feature map output by the backbone network is downsampled from a $512 \times 512$ input image to a $16 \times 16$ size, with a downsampling factor of 32. This leads to a significant loss of spatial information during the downsampling process, thereby reducing the model’s ability to extract target features and ultimately resulting in a decrease in detection accuracy. Specifically, a large downsampling factor makes it difficult for the model to effectively retain detailed information in high-resolution images when processing them. This is detrimental for the KGKPD task, which requires precise detection of target positions and shapes. For example, when detecting small targets or target edges, the loss of spatial information may prevent the model from accurately identifying the contours of the targets, thus affecting the accuracy and reliability of the detection.

In order to compensate for the spatial information loss in the RepVit network, a series of measures have been implemented. Initially, the outputs from the last three stages of the RepVit network were extracted as effective feature layers. While these feature layers contain rich semantic information, some detailed information may be lost during the downsampling process. Therefore, the WF-AFPN module was designed to enhance the features of each layer. This module introduces learnable weights and the adaptive spatial fusion (ASF) operation to perform weighted fusion of features from different levels, thereby improving the expressiveness of the features. Subsequently, the feature maps processed by the WF-AFPN module were upsampled to a uniform size of $128 \times 128$ . This process was achieved through deconvolution operations, with the aim of adjusting the feature maps of different resolutions to the same size for subsequent fusion operations. The upsampled feature maps contain multi-scale information ranging from coarse to fine, providing rich details for subsequent fusion. Finally, the weighted adaptive spatial fusion (WASF) was employed again to integrate the upsampled feature maps. This method assigns different weights to each feature map, performing weighted summation according to their importance and relevance, thereby generating the final fused feature map. This weighted fusion strategy not only retains the key information of each feature layer but also optimizes the feature combination by adjusting the weights, further enhancing the performance of the network. Through these steps, the spatial information loss of the RepVit network was successfully compensated, and the performance and accuracy of the network in handling image-related tasks were improved.

Addressing the issue of feature deformation caused by target occlusion, an advanced detection head, ELDHead, has been designed to enhance the model’s adaptability to geometric transformations of targets. ELDHead employs LDConv to sparsify features, thereby increasing the model’s sensitivity to geometric transformations. LDConv achieves this by adding learnable offsets to the sampling locations of the convolution kernel, enabling the kernel to adjust its sampling positions adaptively and thus better capture geometric transformations. This adaptive adjustment capability allows the model to more flexibly deal with geometric changes such as the scale, pose, and viewpoint of targets, significantly improving the modeling ability for complex geometric transformations. ELDHead also introduces the ELA attention mechanism, which can identify and optimize the degraded feature regions caused by occlusion, thereby significantly enhancing the model’s ability to recognize occluded targets. The ELA attention mechanism acquires feature vectors in the horizontal and vertical directions through strip pooling in the spatial dimension, maintaining a narrow kernel shape to capture long-range dependencies and preventing irrelevant areas from affecting label predictions, thereby generating rich target location features in each direction. Each directional feature vector is processed independently to obtain attention predictions, which are then combined using a product operation to ensure accurate positional information of the region of interest. This lightweight attention mechanism not only accurately locates the objects of interest but also significantly improves the overall performance of CNNs with minimal additional parameters.

To construct the relationship prior knowledge graph (RPKG) module, the initial step involves processing the assertion list data provided by ConceptNet. Specifically, these assertion data are loaded and then transformed into a pruned version that retains only the English subset while filtering out all negative relationships, such as NotDesires, NotHasProperty, NotCapableOf, NotUsedFor, Antonym, DistinctFrom, and ObstructedBy. Additionally, cycles within the graph are removed. After this processing, the resulting assertions are stored in a list, with each element comprising two concepts, their relationship, and the corresponding weight. Following this step, a lookup file and the pruned knowledge graph file are generated and output. The lookup file is used to quickly locate the integer indices corresponding to the concepts, while the pruned knowledge graph file provides the necessary input data for the subsequent RW algorithm.

Subsequently, employing the RW algorithm with the category concepts from the dataset as seed nodes, an RW with restart (RWR) score vector is computed for each seed node. By extracting the scores related to other category concepts, a matrix $R$ is constructed, which reflects the semantic consistency between concepts. Furthermore, based on matrix $R$ , a symmetric matrix $S$ is calculated, where each element is the square root of the product of the corresponding element and its transpose element in matrix $R$ . Finally, the resulting matrix $S$ and other relevant information are stored in a dictionary, which is serialized into JSON format and saved to a specified file for convenient subsequent reading and further analysis.

Through the semantic consistency matrix provided by this JSON file, the model’s output results are strategically intervened to optimize its ability to recognize relational features between objects. This enables the consideration of semantic consistency between targets during the inference process, thereby enhancing the accuracy of object detection. Moreover, by incorporating strong prior knowledge of the spatial relationships and distributions between targets, the model can reduce false detection rates and improve localization accuracy.

3.1. Weighted Fusion Adaptive Feature Pyramid Network (WF-AFPN)

The RepVit backbone network is primarily composed of stacked RepVitBlock modules and RepVitSeBlock modules. The RepVitBlock module consists of two regular convolutions, a $1 \times 1$ depthwise convolution, and a $3 \times 3$ depthwise convolution. It employs structural re-parameterization technology, which introduces multi-branch topologies for depth filters during training to enhance performance. During inference, these multi-branch topologies are merged into a single depthwise convolution to reduce computational and memory overhead. The RepVitSeBlock module is an extension of the RepVitBlock, incorporating an SE module. The SE module is a typical channel attention module that generates an attention vector containing channel weights through global average pooling and fully connected layers. This enhances the model’s response to important features and suppresses less important ones.

When feature maps enter the network, they first pass through a Stem module for preprocessing. The Stem module consists of two $3 \times 3$ convolutional layers with a stride of 2. The first convolutional layer has 24 filters, and the second has 48 filters. This design helps the model capture local features, and the stacked layers gradually increase the receptive field, allowing the model to learn richer feature representations. Subsequently, the feature maps enter stages composed of multiple RepVitBlock and RepVitSeBlock modules. The depthwise convolutions are responsible for spatial information fusion, while the $1 \times 1$ convolutions handle interactions between channels. The structural re-parameterization technology introduces multi-branch topologies for depth filters during training to improve performance.

RepVit employs a cross-block placement strategy for SE layers, using them in the first, third, fifth, etc., blocks of each stage. This alternating placement aims to maximize accuracy improvement while controlling the increase in latency. Since the input image size is $512 \times 512$ , after five downsampling operations in the RepVit network, the final feature map size becomes $16 \times 16$ . If the $16 \times 16$ feature map is directly upsampled to $128 \times 128$ using deconvolution as in CenterNet, it may lead to loss or blurring of feature information. This can cause the upsampled feature map to fail to retain important information from the original feature map, thereby affecting detection results and potentially resulting in insufficient detection accuracy for small targets.

Yang et al. (2023b) proposed the asymptotic feature pyramid network (AFPN), which addresses the issue of information loss during feature transmission by allowing direct interaction between non-adjacent layers and introduces an ASF operation to handle conflicts between features at different levels. To further enhance feature fusion, we proposed a strategy and combine it with the SE module to produce multi-scale feature maps with channel weights. Finally, deconvolution is used to upsample feature maps of different scales to the same size, and the ASF incorporates learnable weight parameters, becoming the WASF operation, further improving the adaptive adjustment of fusion effects. Depthwise convolution replaces the original regular convolution to reduce computational load and model parameters while maintaining network performance. The structure of the WF-AFPN network is shown in Figure 2.

Figure 2.

Structure of Weighted Fusion Adaptive Feature Pyramid Network (WF-AFPN).

Feature maps of different levels are extracted from the backbone network, denoted as $C_{3}$ , $C_{4}$ , and $C_{5}$ . These feature maps represent semantic information at different scales, where $C_{3}$ and $C_{4}$ are lower-level features, while $C_{5}$ is a higher-level feature. To achieve feature fusion, $C_{3}$ and $C_{4}$ are first input into the feature pyramid module for ASF. Subsequently, $C_{4}$ is fed into the pyramid module again to further fuse features from different levels. Before the feature fusion step, the feature pyramid module generates a set of multi-scale feature maps, denoted as $P_{3}$ , $P_{4}$ , and $P_{5}$ . To ensure that the output feature maps have a uniform spatial resolution, upsampling operations are applied to these multi-scale feature maps $P_{3}$ , $P_{4}$ , and $P_{5}$ , upsampling them from their respective feature strides of 16, 32, and 64 pixels to 128 pixels. Finally, through ASF and weighted fusion strategies, the upsampled feature maps $P_{3}$ , $P_{4}$ , and $P_{5}$ are merged into a single feature layer. This fusion process involves assigning a weight to each feature map, which can be learned to ensure that the contributions of different scales of feature maps are fully considered during fusion. This results in a feature representation that integrates multi-scale information, with high resolution and rich semantic information.

Let $x_{i j}^{n \to m}$ represent the feature vector from layer $n$ to layer $m$ at position $(i, j)$ . The fused result is denoted as $y_{i j}^{m}$ , which is obtained through ASF of $x_{i j}^{1 \to m}$ , $x_{i j}^{2 \to m}$ , and $x_{i j}^{3 \to m}$ , and is defined as a linear combination of the feature vectors, as shown in equation (5):

\begin{aligned} y_{i j}^{m} = W_{1} \cdot α_{i j}^{m} \cdot x_{i j}^{1 \to m} + W_{2} \cdot β_{i j}^{m} \cdot x_{i j}^{2 \to m} + W_{3} \cdot γ_{i j}^{m} \cdot x_{i j}^{3 \to m}, \end{aligned}

(5)

where

α_{i j}^{m}

β_{i j}^{m}

, and

γ_{i j}^{m}

represent the spatial weights of the features from three different layers in the feature space at layer

m

, and they satisfy

α_{i j}^{m} + β_{i j}^{m} + γ_{i j}^{m} = 1

W_{i}

, where

i \in 1, 2, 3

, denotes the learnable weight parameters.

When the WF-AFPN module fuses layers $C_{3}$ and $C_{4}$ , it first applies a $1 \times 1$ and a $3 \times 3$ depthwise convolution to layer $C_{3}$ , followed by the SE module to enhance feature representation. Then, one output is upsampled to become layer $P_{3}$ , while the other is fused with layer $C_{4}$ through the WASF module to form layer $L_{3}$ for the next round of input. This process is repeated until all feature layers are fused. Finally, the resulting $P_{3}, P_{4}, P_{5}$ are fused into a single feature layer using the WASF module. The WF-AFPN structure is shown in Figure 3.

3.2. Constructing ELDHead

In street surveillance scenarios, targets are often partially occluded by various obstacles such as pedestrians, vehicles, and buildings, which poses a significant challenge to object detection and is prone to cause missed detections. Moreover, to accurately distinguish between different categories of targets in complex surveillance environments, such as pedestrians, vehicles, and animals, it is necessary to clearly highlight the differences among various categories in the feature maps, enabling the detection model to accurately perform classification and recognition.

To address these issues, we have carefully designed a detection head named ELDHead, which is an improved and innovative version based on the classic CenterNet detection head. In the ELDHead, we have ingeniously integrated the ELA mechanism, which plays a crucial role in enhancing detection performance.

One of the core advantages of the ELDHead is its ability to flexibly adjust convolutional kernel parameters according to changes in target shapes. In street surveillance scenarios, the shapes and postures of targets are highly variable. For instance, pedestrians may present different contours due to various actions, and vehicles can exhibit different shape features from different angles. Traditional fixed convolutional kernels often struggle to handle such shape variations. However, the ELDHead introduces the concept of deformable convolution, allowing the convolutional kernels to adaptively adjust according to the actual shapes of targets. This flexible adjustment capability greatly enhances the model’s ability to recognize occluded targets. Even when targets are partially occluded, the model can capture the key features of the targets through the adjusted convolutional kernels, thereby effectively reducing the occurrence of missed detections.

Figure 3.

Weight Adaptive Spatial Fusion (WASF) Operation. It Demonstrates That WASF Performs Feature Fusion at Three Different Levels, Allowing for the Assignment of Different Spatial Weights to Features at Different Levels. This Enhances the Importance of Key Levels and Mitigates the Impact of Conflicting Information From Different Objects.

Figure 4.

The Initial Sampled Coordinates for Arbitrary Convolutional Kernel Sizes. Adapted from Zhang et al. (2024), © Elsevier. Reproduced with permission.

LDConv is an advanced convolutional technique proposed by Zhang et al. (2024). It allows for any number of parameters and any sampling shape for convolutional kernels of arbitrary sizes. Unlike traditional convolution and deformable convolutional networks (Li et al., 2023, 2022; Su et al., 2023; Wang et al., 2023b; Xiong et al., 2024; Yang et al., 2023a) operations, LDConv does not result in a quadratic increase in the number of parameters as the kernel size increases. Instead, it achieves a linear growth in the number of parameters, thereby reducing the computational burden of the model while maintaining performance. This design makes LDConv more flexible and efficient in handling feature maps of different sizes and shapes. Different kernel shapes are shown in Figure 4.

As shown in Figure 5, it can be observed that LDConv can dynamically adjust sampling points according to the size of the target. This allows it to capture features of objects of different sizes more accurately. In contrast, traditional convolution lacks this flexibility. This characteristic enables LDConv to identify target objects more precisely when extracting high-level features. Compared to traditional convolution, it is more efficient in handling background noise, thereby extracting features that are more conducive to target recognition.

Figure 5.

Comparison of Standard Convolution and Linear Deformable Convolution Extraction Features.

ELA was first proposed by Xu and Wan (2024). It is an ELA method that combines one-dimensional convolution and group normalization feature enhancement techniques. This approach effectively encodes two one-dimensional positional feature maps, allowing for precise localization of regions of interest without the need for dimensionality reduction, while maintaining a lightweight implementation. It uses strip pooling instead of spatial global pooling to capture long-range spatial dependencies. For a convolutional output $R^{H \times W \times C}$ , average pooling is performed on each channel within two spatial dimensions, as shown in equation (6):

\begin{aligned} {\begin{aligned} Z_{c h} (h) = \frac{1}{H} \sum_{0 \leq i < H} x_{c} (h, i), \\ Z_{c w} (w) = \frac{1}{W} \sum_{0 \leq j < W} x_{c} (j, w) . \end{aligned} \end{aligned}

(6)

The two resulting feature vectors are then subjected to local interaction using one-dimensional convolution. The results are further processed through group normalization and activation functions to generate position attention predictions for two spatial directions. Finally, the prediction results are multiplied with the feature values of the $c$ -th channel in the feature map output by the convolutional block, as shown in equation (9).

\begin{aligned} y_{h} & = σ (G_{n} (F_{h} (z_{h}))), \end{aligned}

(7)

\begin{aligned} y_{w} & = σ (G_{n} (F_{w} (z_{w}))), \end{aligned}

(8)

\begin{aligned} Y & = x_{c} \times y_{h} \times y_{w}, \end{aligned}

(9)

where

F

denotes the one-dimensional convolutional transformation, which generates a tensor with the same number of input channels and is applied along the spatial dimension.

G_{n}

represents group normalization, which is used to reduce differences between different batches, thereby enhancing the model’s stability and generalization capability.

In order to effectively filter out irrelevant background noise and enhance the model’s ability to handle occlusions, the processing of the input feature maps begins with the LDConv module. This module employs LDConv operations, which can accurately capture local features of the target, thereby effectively distinguishing the target from the background. The adaptive nature of LDConv allows it to adjust according to changes in the target’s shape, minimizing the interference of background noise and capturing key features of the target even when it is partially occluded. This is crucial for street surveillance scenarios where targets are often occluded by various objects.

Subsequently, the feature maps enter the ELA module, an efficient hierarchical attention mechanism that can adaptively adjust the importance of different feature channels. In the presence of occlusions, the ELA module focuses more on features containing target information while suppressing the influence of background noise, thereby highlighting the target in complex scenes. The ELA module achieves this functionality by combining one-dimensional convolution and group normalization. The one-dimensional convolution efficiently processes feature maps along a specific dimension, capturing linear features that indicate the target’s position, while group normalization stabilizes the learning process and enhances the network’s generalization ability.

After processing by the ELA module, the feature maps undergo normalization to ensure they are on a consistent scale, which is crucial for the stability and performance of subsequent operations. The feature maps are then processed through the SiLU activation function. The SiLU function is chosen because it introduces non-linearity to the model in a controllable manner, which is beneficial for learning complex patterns in the data.

Finally, a $1 \times 1$ convolution is applied to the feature maps to generate the corresponding output results. The structure of the ELDHead is shown in Figure 6. Through this design, the ELDHead can effectively handle occlusion issues in street surveillance scenarios while clearly highlighting the differences between different categories in the feature maps, providing more accurate classification and recognition capabilities for the detection model.

Figure 6.

Structure of Efficient Linear Deformable Convolutional Head (ELDHead).

3.3. RPKG Module

A knowledge graph is a method used to represent and store knowledge, organizing and displaying various relationships between entities in the form of a graph. Each entity represents a specific or abstract object, and these entities are connected through specific relationships, thereby forming a complex network. Knowledge graphs not only focus on the entities themselves but also emphasize the semantic relationships between the entities and the attributes of these relationships. This allows the knowledge graph to express deeper meanings and context. In a knowledge graph, entities are the nodes of the graph, while relationships are the edges that connect these nodes. Each edge describes a certain type of connection or interaction between entities.

In addition to entities and relationships, a knowledge graph also includes semantic descriptions, which provide additional information such as the type of relationship, the attributes of entities, or specific conditions. These elements allow the knowledge graph to represent knowledge in the world more accurately and comprehensively. Figure 7 illustrates the structure of a knowledge graph in the form of a directed graph, where each node represents an entity, each edge represents a relationship, and the semantic descriptions are supplemented through the attributes of the edges.

Figure 7.

Knowledge Graph. Adapted from Jung et al. (2007), © Springer Nature. Reproduced with permission.

Fang et al. (2017) made a pioneering contribution to the field of object detection by integrating knowledge graph technology. They proposed two innovative methods to improve the detection accuracy of the model. The first method utilizes frequency-based knowledge, which infers the relationships between target classes by analyzing the frequency of their co-occurrence. For example, keyboards and mice often appear together, so detecting one can increase the confidence in detecting the other. The second method uses knowledge graph-based knowledge, capturing relationships between targets that have not co-appeared in actual scenes, thus complementing the shortcomings of the frequency-based approach. Ulger et al. (2023) further proposed a feature enhancement model based on relational priors (RP-FEM). This model uses relational priors to enhance target proposal features, running a graph transformer on the scene graph obtained from initial proposals to learn the relational context modeling for object detection and instance segmentation simultaneously. Experimental results show that RP-FEM can effectively suppress impossible class predictions in images and prevent the model from generating duplicate predictions, thereby improving its baseline model.

The RPKG model enhances the classification ability of the network by intervening in the class confidence in the heatmap output from the ELDHead. By integrating prior knowledge, this model is capable of simulating human reasoning, especially the relationships and contextual information between objects, thus effectively identifying specific object categories. Unlike traditional methods, the RPKG model not only relies on visual features from the image but also leverages external semantic information to enhance classification performance.

To achieve this goal, a knowledge graph containing all the categories in the dataset must first be constructed. This step is accomplished by utilizing the ConceptNet common-sense knowledge base, which is a large-scale open-source knowledge base offering rich semantic relationships and common-sense knowledge. ConceptNet helps the model understand potential connections between object categories, such as objects that frequently appear together or objects with similar characteristics.

Next, the RPKG model uses the RWR algorithm to process the constructed knowledge graph. The RWR algorithm performs repeated RWs within the knowledge graph, combined with a restart mechanism, allowing the model to extract more reliable and stable semantic information from the graph. In this way, the algorithm builds a semantic consistency matrix to quantify the relationship strength and semantic similarity between different categories, thereby improving the model’s reasoning ability in complex scenarios.

ConceptNet (Liu et al., 2018) is a large-scale open-source knowledge base dedicated to capturing the common-sense knowledge implied in natural language vocabulary. This knowledge base constructs a relational graph that emphasizes the connectivity of knowledge, using unstructured text and expressions close to natural language. The CSV format data of ConceptNet is shown in Table 1.

Table 1.

Content in ConceptNet.


uri	/a/[/r/Synonym/,/c/zh//,/c/zh//]
relation	/r/Synonym
start	/c/zh/
end	/c/zh/
json	{”dataset”: ”/d/cc_cedict”, ”license”: ”cc:by-sa/4.0”,
	”sources”: [{”contributor”: ”/s/resource/cc_cedict/2017-10”}],
	”weight”: 1.0}

To adapt the ConceptNet knowledge base for the construction of a semantic consistency matrix, it is necessary to preprocess the data by converting it into English triplets and excluding all entries that represent negative relationships as well as self-loops. The pseudocode for the ConceptNet Transformation Algorithm is presented in Algorithm 1.

Lookup_file: Extract all unique concepts from the filtered assertion list and assign a unique integer index to each concept. Save the concepts and their corresponding integer indices into a lookup file for subsequent fast retrieval.

Cropped_file: Merge the filtered assertion list with the lookup table, replacing the concept names with their corresponding integer indices. Save the pruned knowledge graph to a new CSV file with the following format: Concept 1, Concept 2, Relation, Weight, Integer Index of Concept 1, and Integer Index of Concept 2.

In this conversion process, the correspondence between the symbols involved and their specific meanings is detailed and displayed in Table 2.

Table 2.

Symbol Explanation in ConceptNet.

Symbols	Meaning
/r	Relation
/c	Concept
/en	English
/zh	Chinese
uri	Concatenate the (relation, start, end) triple into a string, which can be used as an index.
json	Additional Information

The RW algorithm (Gionis et al., 2007) is a graph theory algorithm based on Markov chains. It simulates the process of random walking in a graph. Starting from a node, at each step, a neighboring node is randomly selected to move to, until a certain stopping condition is met. RWR is a variant of RW that introduces a restart probability to allow the walk to return to the starting node or a specific node, rather than solely relying on the adjacency relationships in the graph. Specifically, RWR requires solving the equation (10):

\begin{aligned} r_{i}^{k} & = α \times W_{i, j} \times r_{i}^{k - 1} + (1 - α) e, \end{aligned}

(10)

where

α \in (0, 1)

represents the restart probability,

k

denotes the number of walks,

W_{i, j}

represents the transition probability matrix from node

i

j

r_{i}

denotes the probability at node

i

, and

e

is the starting vector. After a finite number of iterations, a matrix

R

is ultimately obtained, as shown in equation (11):

\begin{aligned} R_{i, j} & = lim_{k \to \infty} p (r_{j}^{k} | r_{i}^{0}; α), \end{aligned}

(11)

where

p

represents the probability. This matrix characterizes the probability of transitioning from one state category to another when the operator is in a certain state category. Since the semantic consistency matrix is a symmetric matrix, we obtain:

\begin{aligned} S_{i, j} & = S_{j, i} = \sqrt{R_{i, j} R_{j, i}} . \end{aligned}

(12)

The structure of the RPKG module is shown in Figure 8. The overall idea is to enhance the original features and improve the model’s detection performance by fusing the semantic consistency matrix $S$ with the classification matrix $Q$ obtained from the basic detection head. The new classification matrix $\hat{Q}$ is shown in equation (13). Based on this, classification and localization are performed, and new labels are assigned as $b^{'} = argmax ({\hat{Q}}_{h, b})$

\begin{aligned} {\hat{Q}}_{h, b} & = (1 - e) \cdot \frac{\sum_{\begin{matrix} h^{'} = 1 \\ h^{'} \neq h \end{matrix}}^{H} \sum_{b = 1}^{B} S_{b, b^{'}} {\hat{Q}}_{h^{'}, b^{'}}}{\sum_{\begin{matrix} h^{'} = 1 \\ h^{'} \neq h \end{matrix}}^{H} \sum_{b = 1}^{B} S_{b, b^{'}}} + e \cdot Q_{h, b}, \end{aligned}

(13)

where

b

represents the original label before enhancement, and

b^{'}

represents the new label.

B

denotes all the class labels.

h

and

h^{'}

represent a pair of hotspots on the heatmap, and

H

represents all the hotspots.

e \in (0, 1)

is used to balance the weight of the two terms. In this paper,

e = 0.75

Figure 8.

Structure of Relationship Prior Knowledge Graph (RPKG).

3.4. Data Augmentation

Data augmentation techniques play a crucial role in the fields of object detection and deep learning, enhancing model performance through various means. These methods not only expand the training dataset and improve the model’s generalization capabilities but also enhance its adaptability to abnormal situations by introducing a certain degree of perturbation. Common data augmentation techniques include image flipping, cropping, and color jittering.

To enhance the robustness of the model under adverse environmental conditions, we employ the injection of salt-and-pepper noise as a data augmentation technique, aimed at simulating noise interference during image transmission. The core idea of this technique is to introduce random noise points into the training data to mimic distortions and interferences occurring during the actual transmission process. Specifically, we present an adaptive salt-and-pepper noise addition algorithm based on local contrast. Local contrast is obtained by calculating the local mean and subtracting it from the original image. Then, the noise probability and intensity are dynamically adjusted based on the local contrast. Higher parameters are used in regions of high contrast, while lower parameters are applied elsewhere. The noise addition formula is shown in equation (14):

\begin{aligned} G = I - M, \\ p = s = {\begin{cases} 0.05 & if G > 0.5, \\ 0.02 & otherwise, \end{cases} \\ N = {\begin{cases} 1 & if rand < p \cdot s, \\ 0 & if rand > 1 - p \cdot s, \\ I & otherwise, \end{cases} \end{aligned}

(14)

where

I

represents the original image,

M

denotes the local mean image,

p

is the noise probability, and

s

is the noise intensity. rand refers to a random number, and

N

represents the image after noise addition. The noise addition process is illustrated in Figure 9.

Figure 9.

Schematic Diagram of the Noise Addition Process.

By doing so, the model is exposed to more noisy samples during training, allowing it to learn more generalized feature representations and avoid overfitting to clean data.

The introduction of salt-and-pepper noise not only strengthens the model’s ability to adapt to noise but also improves its stability in complex, dynamic environments. For example, in practical applications, images captured by sensors are often subject to various interferences, such as signal attenuation during transmission, changes in environmental lighting, or equipment malfunctions. By incorporating salt-and-pepper noise into the training process, the model learns to differentiate between noisy and clean signals, enabling it to make more accurate predictions in noisy environments.

Furthermore, this data augmentation technique fosters the model’s adaptability to a variety of scenarios, allowing it to maintain high performance even when facing different types of interferences. By repeatedly exposing the model to noisy data, the recognition process becomes more stable, reducing the fluctuations in predictions caused by unstable input data quality. Ultimately, the model’s robustness is significantly improved, enhancing its stability and accuracy in real-world applications.

4. Experiments

4.1. Data Split

The UA-DETRAC dataset (Lyu, S et al., 2017; Wen et al., 2020) is a large-scale benchmark dataset specifically designed for object detection tasks from a road surveillance perspective. In this study, we carefully selected and extracted 8,127 images from the UA-DETRAC dataset, which not only have a representative quantity but also encompass a variety of traffic scenes under different lighting conditions, including both daytime and nighttime. This enables us to comprehensively simulate real-world traffic situations, providing a rich and diverse sample base for model training and evaluation.

These images were scientifically and rigorously divided into three subsets to meet the specific needs of different research stages. The training set consists of 6,582 images, offering a large sample size that provides sufficient learning material for the model, enabling it to fully learn and capture the characteristics and patterns of various traffic targets, thus ensuring accurate identification and localization of targets in complex traffic scenarios. The validation set includes 732 images and is primarily used for real-time monitoring and evaluation of model performance during training, allowing for timely detection and adjustment of issues such as overfitting or underfitting, ensuring the model’s stability and generalization ability during the training process. The test set consists of 813 images, used for comprehensive and objective evaluation of the model’s final performance after training, verifying the model’s actual application effectiveness and reliability on unseen data.

Regarding category classification, we conducted an in-depth analysis and detailed reclassification of the “Other” category in the selected 8,127 images from the UA-DETRAC dataset. Upon further investigation, we found that the “Other” category contained targets with certain common characteristics but also unique features. To enhance the granularity and usability of the dataset, making it more applicable to traffic target detection and recognition tasks, we further subdivided the “Other” category into three subcategories: “Pedestrian,” “Bicycle,” and “Motorcycle.” This subdivision not only enriched the category composition of the dataset, but also provided more precise annotation information for model training, facilitating more accurate recognition and differentiation of various target types, thereby improving the model’s application value and accuracy in real traffic scenarios. The specific categories and their respective quantities are detailed in Table 3, which provides a breakdown of the distribution of each category.

Table 3.
Classes and Quantities in a Dataset. It can be Observed That the Dataset has a Significant Imbalance in the Number of Classes.

Classes Quantities

Car 85,803

Bus 4,434

Bicycle 1,142

Person 18,357

Truck 3,969

Motorcycle 825

Classes	Quantities
Car	85,803
Bus	4,434
Bicycle	1,142
Person	18,357
Truck	3,969
Motorcycle	825

4.2. Evaluation Metrics

We use the mAP under the condition of intersection over union (IoU) $= 0.5$ , the number of images (frames) that the object detection algorithm can detect per second (FPS), and the model’s parameter count and computational load as the main evaluation metrics. mAP, as one of the most commonly used evaluation metrics for object detection algorithms, is defined as shown in equation (15):

\begin{aligned} mAP & = \frac{1}{Q} \sum_{q}^{Q} AveP (q), \end{aligned}

(15)

where

Q

represents the number of categories to be detected,

q

represents the precision of a certain category at different recall rates, and

AveP (q)

represents the average precision of that category.

FPS is a core metric for evaluating real-time performance in networks, reflecting the number of image frames processed per second. A higher FPS value indicates stronger real-time processing capability, which is closely related to the performance of computer hardware, the efficiency of image processing algorithms, and network bandwidth. This metric is commonly used to assess the responsiveness and smoothness of image or video processing tasks.

The number of parameters in a model is a key factor in determining its complexity and capacity. Increasing the number of parameters typically means the model has greater expressive power, enabling it to learn and capture more complex data patterns. However, an excessive number of parameters may lead to overfitting, reducing the model’s generalization ability, and it also increases the computational burden and time cost during training. Therefore, when designing a model, it is essential to balance the model’s expressive capacity with the computational resource requirements.

4.3. Experimental Results

All experiments presented in this paper were performed on a system equipped with a Windows 11 operating system and an NVIDIA RTX 4060 Ti GPU, utilizing the PyTorch 2.4.1 deep learning framework and the Python programming language. The experimental hyperparameter settings are shown in Table 4.

Table 4.
Experiment Parameters.

Hyperparameter Value

Image resolution 512 $\times$ 512

Iterations 300

Optimizer Adam

Initial learning rate 5 $\times 10^{- 4}$

Minimum learning rate 5 $\times 10^{- 6}$

Learning rate decay strategy cos

Momentum 0.937

Batchsize 16

Restart probability in RWR 0.15

Maximum number of iterations in RWR 1,000

Hyperparameter	Value
Image resolution	512 $\times$ 512
Iterations	300
Optimizer	Adam
Initial learning rate	5 $\times 10^{- 4}$
Minimum learning rate	5 $\times 10^{- 6}$
Learning rate decay strategy	cos
Momentum	0.937
Batchsize	16
Restart probability in RWR	0.15
Maximum number of iterations in RWR	1,000

Note. RWR = random walk with restart.

4.3.1. Selecting a Backbone

RepVit has variants of different sizes, including RepViT_m0.9, RepViT_m1.0, RepViT_m1.1, and RepViT_m1.5, where the suffix “_mx” indicates that the corresponding model has a latency of $x$ milliseconds on mobile devices (such as the iPhone 12). The main differences among these variants lie in the number of channels and blocks in each stage, providing various tradeoffs between performance and latency. In this paper, while keeping the other structures of KGKPD unchanged, experiments are conducted on RepVit variants of different sizes to select an appropriately sized RepVit for subsequent experiments. The mean average precision calculated at an intersection over union threshold of 0.50 (mAP50) of the six classes in the dataset and the model’s parameter count are used as evaluation metrics, and the experimental results are shown in Figure 10. To pursue a balance between mAP and parameter count, this paper chooses RepVi_m1.1 as the backbone for KGKPD.

Figure 10.

Mean Average Precision (mAP) Values of Different RepVit.

In further analyzing the data presented in Table 5, we identified several key performance metrics that are crucial for evaluating the suitability of different backbone networks in the CenterNet model.

Table 5.

Comparison Results of Different Backbones.

	RepVit_m1.1	GhostNetV3	EffectNetV2-s	ResNet18
Params (M)	10.640	10.318	20.564	14.042
GFLOPS	30.943	25.112	34.230	35.623
FPS	105.3	40.8	72.5	230.8
mAP50 (%)	77.43	74.09	78.68	69.18

Note. Params = parameter count; GFLOPS = giga floating-point operations per second; FPS = frames per second; mAP50 = mean average precision calculated at an intersection over union threshold of 0.50.

Firstly, from the perspective of computational complexity, giga floating-point operations per second (GFLOPS) indicate the computational demands of the model when processing images. GhostNetV3 demonstrates the most efficient performance with a computational load of 25.112 GFLOPS, suggesting that it can achieve effective feature extraction while maintaining a relatively low computational cost. In contrast, EfficientNetV2-s (Tan & Le, 2021) has a higher computational load of 34.230 GFLOPS, which may limit its application in resource-constrained environments.

Secondly, FPS is directly related to the model’s real-time processing capability. ResNet18 performs exceptionally well in this regard, achieving 230.8 FPS, making it ideal for high-frame-rate applications such as video stream processing or real-time surveillance. However, despite ResNet18’s lead in FPS, its mAP50 is the lowest, indicating that a higher frame rate may come at the expense of detection accuracy.

In terms of accuracy, EfficientNetV2-s leads with an mAP50 of 78.68%, but it comes with a significantly higher GFLOPS and parameters. RepVit_m1.1, with nearly half the number of parameters as EfficientNetV2-s, only lags behind by 1.25% in mAP50.

Overall, RepVit_m1.1 strikes a good balance between parameter count, computational complexity, and real-time processing capability. It not only has the smallest parameter count but also maintains a relatively high-frame rate while offering competitive accuracy.

4.3.2. Ablation Study

In deep learning research, ablation studies are a common and effective method for analyzing and validating the impact of various components on the final performance of a model. By systematically removing or replacing certain parts of the model, researchers can identify the contribution of each module, loss function, detection head, and so on, to the model’s performance. This paper presents a series of ablation experiments to test the performance improvement of different modules on the CenterNet algorithm.

To comprehensively evaluate the effects of these modules, nine sets of experiments were designed. The first experiment uses the original CenterNet algorithm as the baseline, without any modifications, and employs Hourglass104 as the backbone network. The aim of this experiment is to assess the performance of the original algorithm on a specific dataset, providing a basic performance baseline.

In the second experiment, the focal loss function in CenterNet is replaced with the Poly-1 loss function. Focal loss is the original loss function in CenterNet, designed to address the class imbalance problem, whereas Poly-1 loss is an improved loss function that, in theory, may perform better in handling different types of samples. The goal of this experiment is to evaluate the impact of the Poly-1 loss function on model training and detection accuracy.

The third experiment replaces CenterNet’s standard detection head with the ELDHead. ELDHead is a novel detection head designed to enhance the model’s ability to detect complex objects. This experiment aims to evaluate the effect of different detection heads on the model’s performance in object detection tasks.

In the fourth experiment, the RPKG module is introduced. The RPKG module is applied in object detection to enhance the feature representation capability effectively. By incorporating the RPKG module into the network, we aim to improve the model’s robustness in diverse scenarios by enhancing its feature expression capacity.

In the fifth experiment, the backbone network of CenterNet is replaced with RepVit_m1.1, a transformer-based backbone network with stronger feature learning ability. By comparing the performance of RepVit_m1.1 with the original Hourglass104 backbone network, we can evaluate the effect of a more advanced backbone network on the object detection task.

The sixth experiment builds upon the fifth experiment by further introducing the Poly-1 loss function. We aim to assess whether combining the RepVit_m1.1 backbone network with the Poly-1 loss function can further improve model performance.

The seventh experiment, based on the fifth experiment, introduces the ELDHead detection head. Compared to the sixth experiment, this experiment focuses on exploring whether the combination of a more powerful detection head and a superior backbone network leads to a significant improvement in model detection accuracy.

In the eighth experiment, the WF-AFPN module is introduced on top of the fifth experiment. WF-AFPN is an advanced feature pyramid network module that effectively enhances the network’s ability to fuse multi-scale features. By incorporating this module, we aim to further improve the model’s ability to detect targets of different scales.

The ninth experiment, which presents the proposed KGKPD algorithm, integrates all the aforementioned improvements. It comprehensively incorporates the Poly-1 loss function, ELDHead detection head, WF-AFPN module, and RPKG module, and optimizes the model based on the RepVit_m1.1 backbone network. This experiment allows us to evaluate the combined effect of all improvements and verify their impact on the overall model performance.

All the experimental results are summarized in Table 6, where key performance indicators for each group of experiments are presented. The data in the table provides a clear visualization of the specific impact of each improvement measure on performance. To clearly indicate the application of each component, the table uses “✓” to denote the use of the structure, “✗” to indicate that the structure is not used, and “-” to show that the group shares the same result as the previous line. These data allow for a comprehensive analysis of the impact of various modules and improvements on model performance, providing valuable insights for future research and practical applications.

Table 6.
Results of Ablation Study. It Shows That Each Individual Improvement Module can Enhance the Model’s Detection Performance to Varying Degrees.

Group Baseline RepVit_m1.1 Poly-1 ELDHead WF-AFPN RPKG mAP50 (%) Params (M) GFLOPS FPS

1 ✓ ✗ ✗ ✗ ✗ ✗ 80.1 190.1 546.4 21.4

2 ✓ ✗ ✓ ✗ ✗ ✗ 81.3 – – –

3 ✓ ✗ ✗ ✓ ✗ ✗ 82.2 188.6 496.9 9.4

4 ✓ ✗ ✗ ✗ ✗ ✓ 80.8 190.1 546.4 18.5

5 ✓ ✓ ✗ ✗ ✗ ✗ 77.4 10.6 30.9 105.3

6 ✓ ✓ ✓ ✗ ✗ ✗ 78.8 – – –

7 ✓ ✓ ✓ ✓ ✗ ✗ 80.3 10.5 28.9 61.2

8 ✓ ✓ ✓ ✓ ✓ ✗ 83.6 14.6 44.5 50.1

9 ✓ ✓ ✓ ✓ ✓ ✓ 84.2 – – 40.5

Group	Baseline	RepVit_m1.1	Poly-1	ELDHead	WF-AFPN	RPKG	mAP50 (%)	Params (M)	GFLOPS	FPS
1	✓	✗	✗	✗	✗	✗	80.1	190.1	546.4	21.4
2	✓	✗	✓	✗	✗	✗	81.3	–	–	–
3	✓	✗	✗	✓	✗	✗	82.2	188.6	496.9	9.4
4	✓	✗	✗	✗	✗	✓	80.8	190.1	546.4	18.5
5	✓	✓	✗	✗	✗	✗	77.4	10.6	30.9	105.3
6	✓	✓	✓	✗	✗	✗	78.8	–	–	–
7	✓	✓	✓	✓	✗	✗	80.3	10.5	28.9	61.2
8	✓	✓	✓	✓	✓	✗	83.6	14.6	44.5	50.1
9	✓	✓	✓	✓	✓	✓	84.2	–	–	40.5

Note. Poly-1 = Poly Loss-1; ELDHead = efficient linear deformable convolutional head; WF-AFPN = weighted fusion adaptive feature pyramid network; RPKG = relationship prior knowledge graph; mAP50 = mean average precision calculated at an intersection over union threshold of 0.50; Params = parameter count; GFLOPS = giga floating-point operations per second; FPS = frames per second.

The baseline model, without any improvement modules applied, has a mAP of 80.1%, a parameter count (Params) of $190.1 \times 10^{6}$ , floating-point operations (GFLOPS) of 546.4, and frames per second (FPS) of 21.4. These baseline data provide a benchmark for performance comparison. After introducing the Poly-1 loss function, only one line of code is needed to increase the model’s mAP by 1.2 percentage points, indicating that this loss function has a significant effect on improving model accuracy. The addition of the ELDHead module, by providing more accurate object localization capabilities, further increases the mAP by 1.1 percentage points, while reducing the parameter count by $1.5 \times 10^{6}$ and GFLOPS by 49.5, demonstrating the dual advantages of ELDHead in enhancing detection performance and reducing computational burden. The introduction of the RPKG module, by combining the relationships between objects with the model’s detection results, increases the mAP by 0.7 percentage points. This result confirms the importance of prior knowledge in object detection, especially in improving the model’s ability to understand complex scenes. Although the introduction of each new module leads to a slight decrease in FPS, using the lighter RepVit_m1.1 as the backbone results in a significant increase in FPS, despite a slight decrease in mAP. This indicates the need for a tradeoff between lightweighting and performance. After gradually introducing each improvement module, it is found that the WF-AFPN module, although increasing the parameter count by $4.1 \times 10^{6}$ and GFLOPS by 15.6, achieves an mAP of 83.6%, showing its significant effect in improving detection accuracy. Finally, the combined effect of all modules increases the mAP to 84.2%, an improvement of 4.1% over the baseline model, reduces the parameter count by 92.32%, decreases GFLOPS by 91.85%, and increases FPS to 40.5. These results indicate that the model proposed in this paper has achieved significant improvements in both detection performance and model lightweighting.

4.3.3. Statistical Significance Test

The results of the ablation study indicate that the introduction of the RPKG module significantly enhances the model’s key performance metric, mAP50, preliminarily revealing the potential value of the RPKG module in improving the precision of target detection. However, the improvement was relatively limited. To ensure the robustness and reliability of these experimental results and to avoid misleading influences caused by random factors or accidental errors during the experimental process, we further adopted a more rigorous experimental validation strategy.

Specifically, for Experimental Group 1 and Experimental Group 4, we conducted 10 independent replicate experiments, respectively. Each replicate experiment was performed under identical conditions, strictly adhering to the same experimental procedures and parameter settings to maximize the reproducibility and consistency of the results. By systematically collecting and organizing the data from these 10 independent replicate experiments, we obtained more comprehensive and reliable experimental data, which provide a solid foundation for subsequent in-depth analysis. The detailed information of these experimental data is presented in Table 7, available for further statistical analysis and interpretation of the results.

Table 7.
Results of Independent Replicate Experiments.

1 2 3 4 5 6 7 8 9 10 Average

Baseline 80.08% 80.11% 80.19% 79.95% 80.20% 80.12% 79.99% 80.25% 80.14% 80.07% 80.11%

Baseline + RPKG 80.70% 80.83% 80.88% 80.66% 80.88% 80.68% 80.50% 80.79% 80.84% 80.67% 80.74%

Difference 0.62% 0.72% 0.69% 0.71% 0.68% 0.56% 0.51% 0.54% 0.70% 0.60% 0.63%

	1	2	3	4	5	6	7	8	9	10	Average
Baseline	80.08%	80.11%	80.19%	79.95%	80.20%	80.12%	79.99%	80.25%	80.14%	80.07%	80.11%
Baseline + RPKG	80.70%	80.83%	80.88%	80.66%	80.88%	80.68%	80.50%	80.79%	80.84%	80.67%	80.74%
Difference	0.62%	0.72%	0.69%	0.71%	0.68%	0.56%	0.51%	0.54%	0.70%	0.60%	0.63%

Note. RPKG = relationship prior knowledge graph.

Prior to conducting the paired-samples $t$ -test, it is necessary to perform a normality test on the data. Given that the sample size is $< 5, 000$ , the Shapiro–Wilk test is employed to assess the significance of the data. If the test shows significance ( $P < 0.05$ ), it indicates that the data do not follow a normal distribution, a condition that is often difficult to meet in real-world research scenarios. The results of the normality test for the paired differences are shown in Table 8. The significance $P$ -value is 0.332, which is not significant at the 0.05 level. Therefore, the null hypothesis cannot be rejected, and it can be concluded that the data meet the assumption of normal distribution.

Performing a paired-samples $t$ -test on the data in Table 7 using the stats tool from the scipy package, we obtained a $t$ -value of 25.88 and a $p$ -value of 0.000, which is $< 0.05$ . This indicates that the results are statistically significant at the 0.05 level, and we reject the null hypothesis. Therefore, we conclude that the introduction of the RPKG module has a statistically significant effect on the improvement of mAP.

Table 8.

Results of the Normality Test for Paired Differences.

Variable name	Sample size	Average	Standard deviation	Skewness	Kurtosis	S–W test
Baseline $+$ RPKG	10	0.807	0.001	$-$ 0.697	0.377	–
Baseline	10	0.801	0.001	$- 0.34$	−0.033	–
Baseline $+$ RPKG paired with baseline	10	0.006	0.001	$- 0.43$	−1.246	0.922(0.332)

Note. S–W = Shapiro–Wilk; RPKG = relationship prior knowledge graph.

4.3.4. Comparative Study

In this study, we conducted comprehensive comparative experiments between the proposed KGKPD algorithm and several mainstream object detection algorithms on the UA-DETRAC dataset. The experimental results, as shown in Table 9, demonstrate that the KGKPD algorithm outperforms others across multiple key performance indicators. Specifically, the KGKPD algorithm achieved an mAP50 of 84.2%, the highest among all compared algorithms, highlighting its superior accuracy in object detection tasks. Additionally, the FPS of KGKPD is 40.5, which, while not the fastest, remains satisfactory considering its high precision, especially when compared to two-stage detection algorithms such as Faster R-CNN, where KGKPD shows a clear advantage in speed.

Table 9.
Performance Comparison Results of Few Algorithms.

${AP}_{S}$ ${AP}_{M}$ ${AP}_{L}$ ${AR}_{S}$ ${AR}_{M}$ ${AR}_{L}$

(%) (%) (%) (%) (%) (%)

Backbone Params (M) GFLOPS $P$ (%) $R$ (%) mAP50 (%) FPS IoU $=$ 0.5

EfficientDet EfficientNetB0 3.8 4.8 78.6 21.5 33.2 36.0 16.0 40.2 51.1 20.7 85.4 70.9

EfficientNetB1 6.6 11.6 66.2 27.2 40.4 29.7 16.6 48.3 79.4 46.7 92.5 96.7

EfficientNetB2 8.0 20.7 82.5 32.9 48.4 24.9 27.0 55.4 63.5 63.8 94.2 97.9

EfficientNetB3 11.9 47.1 75.8 44.1 58.8 16.3 36.9 64.5 83.1 79.8 95.0 97.0

SSD VGG 26.3 62.7 46.3 27.3 29.2 42.5 9.7 27.5 79.8 16.9 54.1 96.9

yolov5-S – 7.1 16.5 56.3 45.5 53.1 120.1 37.6 53.9 61.1 49.7 59.4 63.6

yolov5-M – 21.1 50.7 56.9 49.2 54.6 91.5 38.7 55.8 62.0 50.2 60.4 64.3

yolov5-L – 46.7 114.6 50.9 57.4 55.9 60.4 40.1 57.1 64.2 53.4 60.7 66.8

yolov6-N – 4.2 11.8 66.0 64.1 68.2 134.4 46.7 82.5 88.7 86.1 97.9 99.2

yolov6-S – 16.3 44.0 75.0 71.5 76.7 79.6 58.8 87.6 89.8 89.3 98.6 98.8

yolov6-M – 52.0 161.2 76.8 70.8 78.3 65.8 61.7 88.5 90.2 91.0 98.4 98.0

yolov6-L – 110.9 391.2 77.8 73.1 79.7 56.3 63.7 89.2 90.7 91.0 98.8 97.7

yolov8-N – 3.0 8.1 77.2 69.8 78.1 154.9 57.4 88.0 90.7 91.0 99.1 99.3

yolov8-S – 11.1 28.7 79.3 76.2 83.5 142.1 66.7 87.9 86.9 93.6 98.6 98.0

yolov8-M – 25.8 79.1 80.5 79.5 85.6 114.8 70.2 88.8 88.8 93.8 99.1 98.2

yolo11-N – 2.6 6.4 78.3 69.2 77.4 134.4 56.6 84.9 87.8 91.6 98.9 98.3

yolo11-S – 9.4 21.6 78.3 77.1 83.1 110.7 67.7 87.7 88.8 95.7 99.2 99.5

yolo11-M – 20.1 68.2 84.1 77.2 86.5 92.4 69.7 89.4 88.6 94.9 99.6 99.0

yolo12-N Tian et al., 2025 – 2.6 6.5 75.9 72.5 79.0 87.1 55.2 85.2 86.7 92.4 99.1 99.5

yolo12-S Tian et al., 2025 – 9.3 21.5 81.6 75.8 83.9 71.6 67.5 87.6 87.0 94.9 99.4 98.3

yolo12-M Tian et al., 2025 – 20.1 67.8 84.7 76.3 86.5 69.3 71.7 89.5 88.1 96.0 99.5 98.8

CenterNet HG104 190.1 546.4 88.9 67.9 80.1 21.4 56.2 78.1 91.2 67.5 81.8 92.7

FastRCNN VGG 136.8 369.8 69.6 68.6 61.5 7.5 22.0 76.3 95.6 59.6 94.3 99.8

RT-DETR HGNetv2 32.0 103.5 79.4 80.3 84.6 62.5 69.9 89.6 90.3 91.5 98.2 98.5

RT-DETRv2 ResNet50 42.7 130.5 82.2 77.8 85.9 64.6 72.2 89.1 93.7 99.2 99.7 99.8

FCOS Tian et al., 2021 ResNet50 32.2 103.6 80.7 51.8 64.9 73.0 37.6 74.7 86.7 63.4 92.6 99.2

KGKPD (our) RepVit_m1.1 14.6 44.5 93.4 62.3 84.2 40.5 67.5 87.9 91.8 84.7 93.0 93.2

								${AP}_{S}$	${AP}_{M}$	${AP}_{L}$	${AR}_{S}$	${AR}_{M}$	${AR}_{L}$
EfficientDet	EfficientNetB0	3.8	4.8	78.6	21.5	33.2	36.0	16.0	40.2	51.1	20.7	85.4	70.9
	EfficientNetB1	6.6	11.6	66.2	27.2	40.4	29.7	16.6	48.3	79.4	46.7	92.5	96.7
	EfficientNetB2	8.0	20.7	82.5	32.9	48.4	24.9	27.0	55.4	63.5	63.8	94.2	97.9
	EfficientNetB3	11.9	47.1	75.8	44.1	58.8	16.3	36.9	64.5	83.1	79.8	95.0	97.0
SSD	VGG	26.3	62.7	46.3	27.3	29.2	42.5	9.7	27.5	79.8	16.9	54.1	96.9
yolov5-S	–	7.1	16.5	56.3	45.5	53.1	120.1	37.6	53.9	61.1	49.7	59.4	63.6
yolov5-M	–	21.1	50.7	56.9	49.2	54.6	91.5	38.7	55.8	62.0	50.2	60.4	64.3
yolov5-L	–	46.7	114.6	50.9	57.4	55.9	60.4	40.1	57.1	64.2	53.4	60.7	66.8
yolov6-N	–	4.2	11.8	66.0	64.1	68.2	134.4	46.7	82.5	88.7	86.1	97.9	99.2
yolov6-S	–	16.3	44.0	75.0	71.5	76.7	79.6	58.8	87.6	89.8	89.3	98.6	98.8
yolov6-M	–	52.0	161.2	76.8	70.8	78.3	65.8	61.7	88.5	90.2	91.0	98.4	98.0
yolov6-L	–	110.9	391.2	77.8	73.1	79.7	56.3	63.7	89.2	90.7	91.0	98.8	97.7
yolov8-N	–	3.0	8.1	77.2	69.8	78.1	154.9	57.4	88.0	90.7	91.0	99.1	99.3
yolov8-S	–	11.1	28.7	79.3	76.2	83.5	142.1	66.7	87.9	86.9	93.6	98.6	98.0
yolov8-M	–	25.8	79.1	80.5	79.5	85.6	114.8	70.2	88.8	88.8	93.8	99.1	98.2
yolo11-N	–	2.6	6.4	78.3	69.2	77.4	134.4	56.6	84.9	87.8	91.6	98.9	98.3
yolo11-S	–	9.4	21.6	78.3	77.1	83.1	110.7	67.7	87.7	88.8	95.7	99.2	99.5
yolo11-M	–	20.1	68.2	84.1	77.2	86.5	92.4	69.7	89.4	88.6	94.9	99.6	99.0
yolo12-N Tian et al., 2025	–	2.6	6.5	75.9	72.5	79.0	87.1	55.2	85.2	86.7	92.4	99.1	99.5
yolo12-S Tian et al., 2025	–	9.3	21.5	81.6	75.8	83.9	71.6	67.5	87.6	87.0	94.9	99.4	98.3
yolo12-M Tian et al., 2025	–	20.1	67.8	84.7	76.3	86.5	69.3	71.7	89.5	88.1	96.0	99.5	98.8
CenterNet	HG104	190.1	546.4	88.9	67.9	80.1	21.4	56.2	78.1	91.2	67.5	81.8	92.7
FastRCNN	VGG	136.8	369.8	69.6	68.6	61.5	7.5	22.0	76.3	95.6	59.6	94.3	99.8
RT-DETR	HGNetv2	32.0	103.5	79.4	80.3	84.6	62.5	69.9	89.6	90.3	91.5	98.2	98.5
RT-DETRv2	ResNet50	42.7	130.5	82.2	77.8	85.9	64.6	72.2	89.1	93.7	99.2	99.7	99.8
FCOS Tian et al., 2021	ResNet50	32.2	103.6	80.7	51.8	64.9	73.0	37.6	74.7	86.7	63.4	92.6	99.2
KGKPD (our)	RepVit_m1.1	14.6	44.5	93.4	62.3	84.2	40.5	67.5	87.9	91.8	84.7	93.0	93.2

Note. Params = parameter count; GFLOPS = giga floating-point operations per second; mAP50 = mean average precision calculated at an intersection over union threshold of 0.50; FPS = frames per second; AP = average precision; AR = average recall; SSD = single shot multibox detector; IoU = intersection over union: subscripts S, M, L = small, medium, and large.

In terms of model size and computational complexity, the KGKPD algorithm has 14.6M parameters and 44.5 GFLOPS, both lower than most of the compared algorithms. This indicates that KGKPD achieves a balance between high accuracy and efficient parameter usage and computational cost. Furthermore, the performance of KGKPD across different IoU thresholds is also outstanding, with its AP value for small ( ${AP}_{S}$ ), AP value for medium ( ${AP}_{M}$ ), AP value for large ( ${AP}_{L}$ ), and other metrics surpassing most comparison algorithms. This further corroborates the algorithm’s ability to maintain stable performance across various detection difficulties.

Compared to other algorithms, KGKPD clearly excels in detection accuracy over Faster R-CNN while requiring fewer parameters and computational resources. In comparison with one-stage detection algorithms such as the YOLO series, EfficientDet (Tan et al., 2020) series, and single shot multibox detector (SSD; Liu et al., 2016), KGKPD not only leads in detection precision, but also maintains a relatively low model size and computational cost. When compared with real-time detection transformer (RT-DETR; Lv et al., 2024; Zhao et al., 2024), KGKPD reduces both parameter count and computational complexity by nearly half, with only a minor decline of 0.4 percentage points in accuracy. Additionally, KGKPD outperforms the original CenterNet by 4.1 percentage points in detection precision, along with an improvement in detection speed. In terms of parameters and computational complexity, KGKPD reduces CenterNet’s parameter count by $175.5 \times 10^{6}$ and computational complexity by $501.9 \times 10^{9}$ .

In summary, the KGKPD algorithm excels in object detection tasks, not only leading in accuracy over most comparison algorithms but also offering advantages in speed, parameter efficiency, and computational efficiency. These results indicate that KGKPD is an efficient and accurate object detection algorithm with broad application potential.

4.3.5. Energy Consumption Estimation

In the current era of rapid technological advancement, energy consumption has become one of the key indicators for evaluating the performance and practicality of algorithms. For road target detection algorithms, accurately estimating their energy consumption holds significant importance in multiple aspects. Firstly, energy consumption directly affects the feasibility and sustainability of an algorithm in real-world applications. In many practical deployment scenarios, such as mobile devices, edge computing devices, or resource-constrained traffic monitoring systems, energy supply is often limited. An energy-efficient algorithm can operate for longer periods, reducing the dependence on energy replenishment and thereby enhancing the overall availability and reliability of the system.

To comprehensively evaluate the efficiency and practical applicability of the KGKPD algorithm, we conducted an estimation of its energy consumption. The experiments utilized the Zeus package to calculate the energy consumption during both the training and evaluation processes of the KGKPD algorithm.

\begin{aligned} P_{total} & = P_{train} + P_{inference}, \\ P_{train} & = \frac{E_{train}}{T_{train}}, \\ P_{inference} & = \frac{E_{val}}{T_{val}}, \end{aligned}

(16)

where

P

represents the power (W),

T

represents the time in seconds (s), and

E

represents the energy in joules (J). After calculation, we obtain

W_{total} = [9.06 \times 10^{6} (J)] / [9.38 \times 10^{4} (s) = 96.59 (W)]

Figure 11.

Diagram of Comparative Experimental Results.

Figure 12.

Noise Addition Experiment Comparison Figure.

Figure 13.

Visualization of Different Noise Levels.

4.4. Visualization Results

Figure 11 presents a detailed comparison of the performance of various object detection algorithms. Through an in-depth analysis, it is evident that the KGKPD algorithm demonstrates outstanding performance in object detection tasks. Compared to other algorithms, the KGKPD algorithm not only accurately identifies predefined target objects but also exhibits superior suppression capability when handling non-target objects. This significantly reduces the false positive rate, thereby enhancing the reliability of the detection results. In contrast, some other algorithms often suffer from missed detections and false detections during the detection process. Missed detection refers to the failure of the algorithm to recognize an object present in the image, while false detection occurs when the algorithm mistakenly identifies background or non-target objects as targets. These issues directly affect the accuracy and reliability of the detection results, thereby impacting the effectiveness of the algorithm in practical applications. In the image on the far right, it is evident that only the KGKPD model successfully identified the two individuals in the top left corner of the image. Other models failed to accurately detect these targets, resulting in their omission or misclassification. Overall, the results presented in Figure 11 conclusively demonstrate the superiority of the KGKPD algorithm in object detection tasks.

4.5. Robustness Study

We carefully designed two sets of comparative experiments. In the experimental group, images in the training dataset were subjected to noise addition with a probability of 0.5, and the addition of salt-and-pepper noise was dynamically adjusted based on the local features of the images. For instance, in high-contrast regions of the images, the intensity and probability of noise could be appropriately increased, as the features in these regions are more pronounced and relatively more robust to noise. Conversely, in low-contrast regions, the intensity and probability of noise were reduced to avoid excessive interference with the target features.

The control group, on the other hand, maintained the original state of the images without any noise processing, providing a baseline reference for the experiment without noise interference. By comparing the network performance between the experimental and control groups, we were able to quantitatively assess the specific impact of salt-and-pepper noise on network performance, thereby providing a solid basis for network optimization and improvement.

Figure 12 illustrates the changes in detection accuracy of the network under different noise conditions, clearly presenting the differences in detection accuracy between the experimental and control groups. The experimental results showed that the experimental group with salt-and-pepper noise interference had a significantly higher robustness compared to the control group without noise, indicating that the network had developed a certain degree of adaptability to salt-and-pepper noise and could maintain a relatively stable performance in noisy environments.

Furthermore, Figure 13 displays the visualization results of KGKPD under different levels of interference. From the figure, it can be observed that as the noise intensity gradually increased, the detection results of KGKPD showed different changing trends. Under low noise levels, KGKPD was able to accurately identify targets, with the detection boxes accurately positioned and well-matched to the target contours, indicating that the network still performed excellently in detection under slight noise interference. However, when the noise intensity further increased, although it was powerless to distinguish small targets that were difficult for the naked eye to discern, it could still roughly lock the position of larger targets. This further confirmed the strong robustness of the experimental group network in the face of salt-and-pepper noise.

5. Conclusion

We have developed a model for road target recognition based on the CenterNet concept, employing the RepVit network as the feature extraction backbone. To enhance detection accuracy, we designed the WF-AFPN strategy, which dynamically fuses three key feature maps to improve the detection of small targets, and integrated ELDHead to enhance the recognition of occluded targets. The RPKG module was introduced to further optimize the detection results. In terms of data augmentation, salt-and-pepper noise was used to increase the model’s robustness to data loss, and the Poly-1 loss function was utilized to address class imbalance issues. Although the model demonstrated high detection accuracy on the actual highway image dataset and featured low model complexity and computational demand, it has not yet fully met the requirements for lightweight deployment. Moreover, the scale of the dataset used in the experiments was relatively small, and the model’s generalization ability still needs further validation. Therefore, future work will focus on optimizing the model’s lightweight design and expanding the dataset scale to further enhance the model’s performance and applicability.

Footnotes

Acknowledgments

I am deeply grateful to my advisor, Professor Liao, for the invaluable guidance and support throughout this research. Professor Liao played a crucial role in shaping the study’s direction, offering insightful advice and constructive feedback at each stage. His expertise and dedication have been a constant source of inspiration. I appreciate the time and effort invested in refining my research ideas and overcoming challenges. Without his guidance, this work would not have been possible.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Yu Jianxin

Liao Mingchao

References

Dai

Chen

Yang

Zhang

Yuan

Zhang

(2021). Dynamic DETR: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2988–2997). IEEE Xplore.

Duan

Bai

Xie

Huang

Tian

(2019). CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 6569–6578). IEEE.

Fang

Kuan

Lin

J. e. a.

(2017). Object detection meets knowledge graphs. In Proceedings of the 26th international joint conference on artificial intelligence (pp. 1661–1667). ACM.

Gionis

Tsioutsiouliklis

Faloutsos

(2007). Random walks for network analysis. ACM Computing Surveys, 38(4), 17.

Girshick

(2015). Fast R-CNN. In 2015 IEEE international conference on computer vision (pp. 1440–1448). IEEE.

Han

Wang

Tian

Tang

Zhang

(2020). GhostNet: More features from cheap operations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1580–1589). IEEE.

Chen

Liu

Luo

Liu

(2024). Enhancing yolo for occluded vehicle detection with grouped orthogonal attention and dense object repulsion. Scientific Reports, 14(1), 19650.

Jung

Lee

J.-Y.

Kim

Park

Myaeng

S.-H.

Rim

H.-C.

(2007). Building a large-scale commonsense knowledge base by converting an existing one in a different language. In A. Gelbukh (Ed.), Computational Linguistics and intelligent text processing (pp. 23–34). Springer Berlin Heidelberg.

Leng

Tan

Liu

Cubuk

E. D.

Shi

Cheng

Anguelov

(2022). PolyLoss: A polynomial expansion perspective of classification loss functions. In International conference on learning representations (pp. 1–16). OpenReview.net.

10.

Zhu

Jiang

Zhu

Yuan

Wang

Qiao

Wang

Dai

(2023). Uni-Perceiver v2: A generalist model for large-scale vision and vision-language tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2691–2700). IEEE.

11.

Wang

Xie

Sima

Qiao

Dai

(2022). BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV (pp. 1–18). Springer.

12.

Liu

Anguelov

Erhan

D. E. A.

(2016). SSD: Single shot multibox detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), 14th European conference on computer vision (ECCV) (Lecture notes in computer science, vol. 9905, pp. 21–37). Springer International Publishing AG.

13.

Liu

Ravichandran

Subramanya

Joshi

Wang

Liu

Natarajan

Samwald

Fan

Sun

(2018). ConceptNet 55: An open multilingual graph of general knowledge. In Thirty-second AAAI conference on artificial intelligence (pp. 4444–4451). AAAI Press.

14.

Liu

Yang

Wang

Zhao

Wang

F.-Y.

(2024). Enhancing traffic object detection in variable illumination with RGB-event fusion. IEEE Transactions on Intelligent Transportation Systems, 25(12), 20335–20350.

15.

Liu

Zheng

Yang

Liu

Cai

(2020). Training-time-friendly network for real-time object detection. In Proceedings of the AAAI conference on artificial intelligence (vol. 34, pp. 11685–11692). AAAI.

16.

Zhao

Chang

Huang

Wang

Liu

(2024). RT-DETRv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv preprint arXiv:2407.17140, 2024.

17.

Lyu

Chang

M. C.

Wen

Wei

Del Coco

Carcagnì

Anisimov

Bochinski

Galasso

Bunyak

Han

Wang

Palaniappan

, … Zhang

(2017). UA-DETRAC 2017: Report of AVSS2017 & IWT4S challenge on advanced traffic monitoring. In 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS) (pp. 1–7). IEEE.

18.

Zhu

Tao

Huang

Qiao

Wang

Zhou

Dai

(2023). Towards all-in-one pre-training via maximizing multi-modal mutual information. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15888–15899). IEEE..

19.

Tan

(2021). EfficientNetV2: Smaller models and faster training. In International conference on machine learning (pp. 10096–10106). PMLR.

20.

Tan

Pang

Q. V.

(2020). EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10781–10790). IEEE.

21.

Tang

Yun

Chen

Cheng

(2024). HRYNet: A highly robust YOLO network for complex road traffic object detection. Sensors, 24(2), 642.

22.

Tian

Doermann

(2025). YOLOv12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524.

23.

Tian

Shen

Chen

(2021). FCOS: A simple and strong anchor-free object detector. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4), 1922–1933.

24.

Ulger

Wang

Galama

Y. e. a.

(2023). Relational prior knowledge graphs for detection and instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) workshops (pp. 53–61). IEEE.

25.

Vankdoth

S. R.

Arock

(2024). End-to-end deep learning pipeline for scalable, deployable object detection engine in the traffic system. Signal, Image and Video Processing, 18(2), 1589–1600.

26.

Wang

Chen

Lin

Han

Ding

(2024). RepViT: Revisiting mobile CNN from ViT perspective. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15909–15920). IEEE.

27.

Wang

C.-Y.

Bochkovskiy

Liao

H.-Y. M.

(2023a). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7464–7475). IEEE.

28.

Wang

Dai

Chen

Huang

Zhu

Wang

Qiao

(2023b). InternImage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14408–14419). IEEE.

29.

Wen

Cai

Lei

Chang

M. C.

Lim

Yang

M. H.

Lyu

(2020). UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. Computer Vision and Image Understanding, 193, 102907.

30.

Xiong

Chen

Wang

Zhu

Luo

Wang

Qiao

(2024). Efficient deformable ConvNets: Rethinking dynamic and sparse operator for vision applications. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5652–5661). IEEE.

31.

Wan

(2024). Efficient local attention for deep convolutional neural networks. arXiv preprint arXiv: 2403.01123.

32.

Yang

Chen

Tian

Tao

Zhu

Zhang

Huang

Qiao

Zhou

Dai

(2023a). BEVFormer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17830–17839). IEEE.

33.

Yang

Lei

Zhu

Cheng

Feng

Liang

(2023b). AFPN: Asymptotic feature pyramid network for object detection. In 2023 IEEE international conference on systems, man, and cybernetics (SMC) (pp. 2184–2189). IEEE.

34.

Yuan

Wang

Liu

(2024). Real-time monitoring method for traffic surveillance scenarios based on enhanced yolov7. Applied Sciences, 14(16), 7383.

35.

Zhang

Song

T. E. A.

(2024). LDconv: Linear deformable convolution for improving convolutional neural networks. Image and Vision Computing, 149, 105190.

36.

Zhao

Wei

Wang

Dang

Liu

Chen

(2024). DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16965–16974). IEEE.

37.

Zhou

Liu

(2024). A feature enhancement FCOS algorithm for dynamic traffic object detection. Connection Science, 36(1), 2321345.

								${AP}_{S}$	${AP}_{M}$	${AP}_{L}$	${AR}_{S}$	${AR}_{M}$	${AR}_{L}$
								(%)	(%)	(%)	(%)	(%)	(%)
	Backbone	Params (M)	GFLOPS	$P$ (%)	$R$ (%)	mAP50 (%)	FPS	IoU $=$ 0.5
EfficientDet	EfficientNetB0	3.8	4.8	78.6	21.5	33.2	36.0	16.0	40.2	51.1	20.7	85.4	70.9
	EfficientNetB1	6.6	11.6	66.2	27.2	40.4	29.7	16.6	48.3	79.4	46.7	92.5	96.7
	EfficientNetB2	8.0	20.7	82.5	32.9	48.4	24.9	27.0	55.4	63.5	63.8	94.2	97.9
	EfficientNetB3	11.9	47.1	75.8	44.1	58.8	16.3	36.9	64.5	83.1	79.8	95.0	97.0
SSD	VGG	26.3	62.7	46.3	27.3	29.2	42.5	9.7	27.5	79.8	16.9	54.1	96.9
yolov5-S	–	7.1	16.5	56.3	45.5	53.1	120.1	37.6	53.9	61.1	49.7	59.4	63.6
yolov5-M	–	21.1	50.7	56.9	49.2	54.6	91.5	38.7	55.8	62.0	50.2	60.4	64.3
yolov5-L	–	46.7	114.6	50.9	57.4	55.9	60.4	40.1	57.1	64.2	53.4	60.7	66.8
yolov6-N	–	4.2	11.8	66.0	64.1	68.2	134.4	46.7	82.5	88.7	86.1	97.9	99.2
yolov6-S	–	16.3	44.0	75.0	71.5	76.7	79.6	58.8	87.6	89.8	89.3	98.6	98.8
yolov6-M	–	52.0	161.2	76.8	70.8	78.3	65.8	61.7	88.5	90.2	91.0	98.4	98.0
yolov6-L	–	110.9	391.2	77.8	73.1	79.7	56.3	63.7	89.2	90.7	91.0	98.8	97.7
yolov8-N	–	3.0	8.1	77.2	69.8	78.1	154.9	57.4	88.0	90.7	91.0	99.1	99.3
yolov8-S	–	11.1	28.7	79.3	76.2	83.5	142.1	66.7	87.9	86.9	93.6	98.6	98.0
yolov8-M	–	25.8	79.1	80.5	79.5	85.6	114.8	70.2	88.8	88.8	93.8	99.1	98.2
yolo11-N	–	2.6	6.4	78.3	69.2	77.4	134.4	56.6	84.9	87.8	91.6	98.9	98.3
yolo11-S	–	9.4	21.6	78.3	77.1	83.1	110.7	67.7	87.7	88.8	95.7	99.2	99.5
yolo11-M	–	20.1	68.2	84.1	77.2	86.5	92.4	69.7	89.4	88.6	94.9	99.6	99.0
yolo12-N Tian et al., 2025	–	2.6	6.5	75.9	72.5	79.0	87.1	55.2	85.2	86.7	92.4	99.1	99.5
yolo12-S Tian et al., 2025	–	9.3	21.5	81.6	75.8	83.9	71.6	67.5	87.6	87.0	94.9	99.4	98.3
yolo12-M Tian et al., 2025	–	20.1	67.8	84.7	76.3	86.5	69.3	71.7	89.5	88.1	96.0	99.5	98.8
CenterNet	HG104	190.1	546.4	88.9	67.9	80.1	21.4	56.2	78.1	91.2	67.5	81.8	92.7
FastRCNN	VGG	136.8	369.8	69.6	68.6	61.5	7.5	22.0	76.3	95.6	59.6	94.3	99.8
RT-DETR	HGNetv2	32.0	103.5	79.4	80.3	84.6	62.5	69.9	89.6	90.3	91.5	98.2	98.5
RT-DETRv2	ResNet50	42.7	130.5	82.2	77.8	85.9	64.6	72.2	89.1	93.7	99.2	99.7	99.8
FCOS Tian et al., 2021	ResNet50	32.2	103.6	80.7	51.8	64.9	73.0	37.6	74.7	86.7	63.4	92.6	99.2
KGKPD (our)	RepVit_m1.1	14.6	44.5	93.4	62.3	84.2	40.5	67.5	87.9	91.8	84.7	93.0	93.2

KGKPD: A Road Object Detection Algorithm that Fuses Knowledge Graphs and Keypoint Detection

Abstract

Keywords

1. Introduction

2. Related Work

3.1. Weighted Fusion Adaptive Feature Pyramid Network (WF-AFPN)

4.1. Data Split

Table 3. Classes and Quantities in a Dataset. It can be Observed That the Dataset has a Significant Imbalance in the Number of Classes. Classes Quantities Car 85,803 Bus 4,434 Bicycle 1,142 Person 18,357 Truck 3,969 Motorcycle 825

4.5. Robustness Study

5. Conclusion

Footnotes

Acknowledgments

Funding

Declaration of Conflicting Interests

ORCID iDs

References

Table 3.
Classes and Quantities in a Dataset. It can be Observed That the Dataset has a Significant Imbalance in the Number of Classes.

Classes Quantities

Car 85,803

Bus 4,434

Bicycle 1,142

Person 18,357

Truck 3,969

Motorcycle 825