Sage Journals: Discover world-class research

Abstract

As deep learning continues to advance, object detection technology holds potential and promising prospects in the recognition of cylindrical objects’ quantity, such as in industries like timber processing, construction, and pipeline engineering. The traditional manual counting methods have lower efficiency, a higher error rate, and demand a greater amount of manpower. The introduction of object detection technology can effectively address these issues, enhance work efficiency, and reduce labor costs. Therefore, this research paper introduces a novel variant of the YOLOv5s algorithm, called YOLOv5-COC, specifically designed to tackle the task of counting cylindrical objects. This paper makes the following significant contributions: Firstly, introducing the utilization of data augmentation techniques to augment the dataset, thereby enhancing the generalization ability of the model. Secondly, the K-means $++$ algorithm is employed as an alternative to the conventional K-means algorithm in order to enhance the initialization of anchor boxes. Thirdly, introduce distinct methodologies, including the incorporation of a coordinated attention mechanism, the amalgamation of the Bidirectional Feature Pyramid Network (BiFPN), and the substitution of the loss function, in order to further refine the model and enhance its recognition precision. Finally, employ ablation experiments to assess the optimization outcomes of the aforementioned methodologies. The experimental results reveal that the YOLOv5-COC model proposed in this study attains an mAP of 98.7%, operates at a frame rate of 60 FPS, attains a Precision of 98.3%, and boasts a Recall of 99.1%. The mAP@0.5:0.95 stands at 72.4%. In comparison to the original YOLOv5s model, the mAP value exhibits an improvement of 1.3%, the FPS experiences a remarkable surge of 27.7%, detection accuracy elevates by 1%, the recall rate advances by 1%, and the mAP@0.5:0.95 escalates by 3.5%. In summary, the YOLOv5-COC model demonstrates a sufficiently high level of accuracy in object detection tasks, mitigating instances of both false negatives and false positives. It efficiently accomplishes the task of object detection.

Keywords

YOLOv5s YOLOv5-COC counting object detection

1. Introduction

Fueled by the continuous progress of artificial intelligence technology, the gradual replacement of humans by object detection technology in performing various complex tasks has become increasingly evident. For instance, facial recognition technology has been applied in compulsory education [1], aiming to tackle concerns like enhancing campus security, automating attendance tracking, and monitoring student emotions. It has also found use in border control processes [2], reducing waiting times during crossings. Additionally, it has been employed in mask face recognition [3], effectively mitigating the spread of viruses during pandemic periods. However, object detection has seen relatively limited application in the realm of object quantity recognition. Following extensive research conducted by the team, it has been discerned that the application of object detection technology holds substantial potential and vast prospects in the identification of quantities of cylindrical objects. Conventional manual counting methods encounter numerous challenges in the recognition of cylindrical objects, including inefficiency, susceptibility to errors, and substantial human resource consumption. By incorporating object detection technology, these issues can be effectively addressed, leading to improved work efficiency and reduced labor costs. In everyday life and various industrial sectors, numerous application scenarios necessitate the identification of the quantity of cylindrical objects. For instance, in the timber processing industry, the utilization of object detection technology for the automatic recognition and counting of logs ensures the smooth progression of the production process. Similarly, in the construction and pipeline sectors, precise detection of the quantity of steel bars and water pipes ensures the structural integrity and seamless execution of operations. Hence, the recognition of the quantity of cylindrical objects in images holds significant research value and promising prospects for the future. To investigate target detection algorithms for recognizing the quantity of cylindrical objects in images, this paper improved upon the YOLOv5s-based algorithm and introduced the YOLOv5-COC model. This paper presents some valuable contributions as shown below:

1.
Regarding the selection of the dataset, this paper chose suitable data augmentation methods to enhance the dataset, with the aim of simulating extreme scenarios and enhancing the model’s generalization abilities.
2.
Concerning the initialization of anchor boxes within the model, this paper employed the k-means $++$ algorithm for cluster analysis in anchor box selection. This choice mitigates the impact of improper initialization of cluster centers, often observed with the k-means algorithm, thereby enhancing the quality and stability of clustering results.
3.
The introduction of the coordinated attention mechanism has been implemented to establish effective correlations between different positions within the feature map. This enhancement facilitates the better capture of contextual information and semantic relationships pertaining to the targets.
4.
The fusion of the Bidirectional Weighted Feature Pyramid Network (BiFPN) and the network results from YOLOv5s has effectively integrated feature information from different scales, resulting in improved object detection performance.
5.
Replacing the original algorithm’s loss function with Smooth Intersection over Union (SIoU). In contrast to the original loss function in YOLOv5s, SIoU provides a smoother estimation of the overlap between bounding boxes, resulting in greater training stability, faster convergence, and improved performance.
6.
Replacing Non-Maximum Suppression (NMS) with DIoU-NMS.DIoU-NMS allows for a more precise selection of anchor boxes with the highest target confidence and optimal positioning, while also exhibiting improved adaptability to various target shapes. The specific contributions outlined in this paper are illustrated in Fig. 1.

2. Related works

As deep learning continues to make significant progress, the emergence of Convolutional Neural Networks (CNNs) [4] has gradually supplanted traditional object detection algorithms. Object detection algorithms primarily fall into two categories: one-stage object detection algorithms and two-stage object detection algorithms. One-stage object detection algorithms are exemplified by YOLO [5, 6, 7, 8], SSD [9], DSSD [10], RetinaNet [11], and EfficientDet [12]. The prominent two-stage object detection algorithms include R-CNN [13], SPPnet [14], Fast R-CNN [15], Faster R-CNN [16], Mask R-CNN [17], and FPN [18]. However, it must be emphasized that two-stage object detection algorithms can enhance detection precision, but at the expense of extended computational time when contrasted with one-stage detection algorithms.

Due to the prolonged computational time of current two-stage object detection algorithms, they are unable to meet the exigencies of real-time detection. In 2016, Redmon [5] first introduced the YOLO algorithm for object category recognition in images. In 2017, Redmon [6] introduced the YOLOv2 model, which was capable of detecting over 9000 different categories, leading to an improvement in detection accuracy. In 2018, Redmon [7] once again introduced the YOLOv3 model, which employed the DarkNet-53 network architecture. The model incorporated the ideas of ResNet by stacking more layers for feature extraction and also utilized Spatial Pyramid Pooling Networks to facilitate multi-scale inputs and same-size outputs. In 2020, Alexey [8] introduced the YOLOv4 model, which adopted the CSP DarkNet-53 network architecture, enabling the model to adapt to complex scenarios. In the same year, Ultralytics introduced the YOLOv5 model. While it didn’t exhibit significant improvements in detection speed and accuracy, it was notable for its lighter model weight. Subsequently, Chuyi Li [19], Alexey Bochkovskiy [20], and Ultralytics, building upon the foundation of YOLO, introduced various other versions. However, within the YOLO series of algorithms, YOLOv5 stands out for its smaller model size and faster speed compared to previous versions. This makes it particularly suitable for deployment on energy-efficient devices, including embedded and mobile systems.

One of the applications of object detection is counting, which aims to alleviate issues associated with manual counting, including low efficiency, low accuracy, and limited real-time capabilities. Arinaldi [21] employed the Faster R-CNN algorithm to establish a system for the analysis of traffic road videos. This system, in particular, can accomplish tasks such as vehicle flow statistics, classification of vehicle types, estimation of vehicle speeds, and monitoring of lane utilization. Tu [22] introduced an approach for passion fruit detection and counting using Faster R-CNN, which provided crucial information for yield prediction and machine harvesting. Ahmad [23] introduced a method for top-down view pedestrian detection and counting based on SSD, enabling the statistical analysis of pedestrian traffic. Xu [24] introduced a method for cattle counting using Mask R-CNN in images collected by drones, and applied it in various scenarios such as extensive ranches and feeding facilities.

YOLO, being a popular object detection framework, is extensively utilized for counting purposes as well. Zhao [25] introduced an modified wheat ear detection method based on YOLOv5 in drone images. This improvement expanded the YOLO algorithm’s usability in challenging field conditions, facilitating precise detection of wheat ears, particularly small-sized ones. In the realm of transportation, Li [26] presented an object detection approach utilizing YOLOv5s to count axles and recognize tire types. This method captures vehicle information and is used to determine whether trucks are overweight. Ling [27] introduced an enhanced YOLOv5s detection approach, integrating it with DeepSORT object tracking for the purpose of crowd monitoring and counting in tourist attractions and urban environments. In the field of logistics, Xie [28] introduced a lightweight recognition model called Tailored-YOLO, which enables parcel counting in warehouse processes. Hence, this paper introduces an improved YOLOv5s model for the localization and counting of densely arranged cylindrical objects, further enhancing work efficiency.

3. Improved YOLOv5s network model

This section provides a comprehensive explanation of the enhancement techniques proposed for the YOLOv5s modelencompassing enhancements in attention mechanisms, neck architecture, loss functions, post-processing methodologies, and anchor box initialization.

3.1 Coordinate attention

The concept of attention mechanism, initially introduced by Bahdanau [29] in 2014, draws its inspiration from human visual research. It autonomously acquires critical information from the input data, leading to an enhancement in the model’s performance. Distinct weights are allocated to various feature layers by the attention mechanism, enabling the prioritization of segments with higher weights, which are more critical. Previously, attention mechanisms in lightweight networks commonly utilized the Squeeze-and-Excitation Network (SE), primarily addressing inter-channel information and disregarding positional information. While subsequent approaches, such as the Convolution Block Attention Module (CBAM), have endeavored to derive positional attention information via convolution by reducing the channel count, convolution can exclusively capture local relationships and is deficient in extracting long-range relationships.

Coordinate attention (CA) [30], as an innovative and effective attention module, involves two principal phases: embedding of coordinate information and the generation of coordinate attention. In the section on embedding of coordinate information, coordinate attention transforms two-dimensional global pooling into feature encodings specific to individual dimensions, facilitating the capture of distant spatial interactions with exact positional details. For the input X, encoding is carried out independently for each channel. Along the horizontal axis, it uses a pooling kernel with a size of (h, 1), while along the vertical axis, it utilizes a kernel with a size of (1, w). As illustrated in Eq. (1) and Eq. (2), these operations merge features both horizontally and vertically, generating a set of feature maps that are sensitive to direction.

The Fig. 1 illustrates the coordinate attention module.

Figure 1.

Coordinate attention model.

$\displaystyle\mbox{Z}_{c}^{h}(h)=\frac{1}{w}\sum\limits_{0\leqslant i\leqslant w% }{X_{c}(h,i)}$ (1)

$\displaystyle\mbox{Z}_{c}^{w}(w)=\frac{1}{h}\sum\limits_{0\leqslant j\leqslant h% }{X_{c}(j,w)}$ (2)

Where, $w$ and $h$ represent the width and height of the feature map, respectively. $\mbox{Z}_{c}^{w}(w)$ and $\mbox{Z}_{c}^{h}(h)$ represent the output results obtained after performing average pooling in the lateral and vertical orientations, respectively.

In coordinate attention generation section, as shown in Eq. (3), initially, the direction-aware feature maps in this set are stacked and their channel count is reduced through convolution, resulting in a feature channel count of $C/r$ . Subsequently, BatchNorm and ReLU are applied to encode positional information along the vertical and horizontal directions.

Subsequently, $f$ is divided into two distinct tensors, $f^{h}\in R^{C/r\times h}$ and $f^{w}\in R^{C/r\times w}$ . Then, two consecutive convolutions are applied to both $f^{h}$ and $f^{w}$ , transforming them into feature channel counts equal to that of input $X$ , followed by a normalization-weighting process to obtain $g^{h}$ and $g^{w}$ , as shown in Eqs (4) and (5).

$\displaystyle f=\delta(F_{1}([Z_{c}^{h},Z_{c}^{w}]))$ (3)

$\displaystyle g^{h}=\sigma(F_{h}(f^{h}))$ (4)

$\displaystyle g^{w}=\sigma(F_{w}(f^{w}))$ (5)

Where, $F_{1}(\cdot)$ represents the stacking of the two input tensors, followed by a 1 $\times$ 1 convolution and BatchNorm for data normalization; $\delta(\cdot)$ denotes the ReLU activation function; $g^{h}$ signifies the generated vertical direction weights; $g^{w}$ represents the generated horizontal direction weights; $F_{h}(\cdot)$ and $F_{w}(\cdot$ ) are each a single convolution operation used to adjust channel counts; $\sigma(\cdot$ ) stands for the Sigmoid normalization-weighting process.

Finally, $g^{h}$ and $g^{w}$ are expanded and serve as coordinate attention weights. The ultimate equation is delineated in the subsequent fashion:

$\displaystyle y_{c}(i,j)=x_{c}(i,j)\times g_{c}^{h}(i)\times g_{c}^{w}(j)$ (6)

Where, $x_{c}(i,j)$ represents the original input; $g_{c}^{h}(i)$ denotes the vertical direction weights; $g_{c}^{w}(j)$ signifies the horizontal direction weights; $y_{c}(i,j)$ represents the feature map obtained after weighting.

In general, the coordinate attention is ideally inserted at the intermediate layers of the network. This is because, within the network’s shallow layers, where the spatial feature maps extracted tend to be overly expansive, and the channel count is insufficient for the effective capture of specific features. On the other hand, in the later layers of the network, an excessive number of channels can lead to overfitting. Furthermore, it is imperative to note that as the coordination attention mechanism approaches the classification layer, it is more susceptible to perturbing the focus on classification outcomes, consequently impacting the stability of the classification results. Therefore, selecting the intermediate stages of the network as the insertion location for the coordinated attention mechanism can enhance the feature representation capabilities effectively.

To further assess the impact of the insertion point of the coordination attention mechanism on the algorithm, a ablation experiment regarding the coordination attention mechanism was conducted. The positions of adding the coordinated attention mechanism are illustrated in Fig. 2, denoted as (a), (b), and (c), representing three different insertion locations. Specifically, (a) and (b) correspond to insertion in the Backbone, while (c) corresponds to insertion in the Neck. The final results, as demonstrated in Experiment 1 of Fig. 7, conclusively establish that (c) is the optimal insertion location based on the experimental findings.

Figure 2.

Insertion positions of cooperative attention.

3.2 BiFPN

The Neck architecture in YOLOv5s is crucial for object detection, refining backbone network features, and enhancing feature fusion. It consists of two key components: the FPN and the PAN, which collaborate effectively, as demonstrated in the structural diagram presented in Fig. 3. The first two columns represent the FPN structure, while the rightmost column represents the bottom-up structure of the PAN architecture. The FPN structure establishes a pathway from top to bottom for feature fusion, but the unidirectional flow of information limits the capacity for feature integration. The PANet network builds upon FPN by introducing a bottom-up pathway for information enhancement, effectively preserving more shallow-level features. However, due to the excessive retention of shallow-level semantic information, it results in a significant loss of deep-level semantic information in the network. Hence, to strike a balance between retaining more shallow-level semantic information and avoiding excessive loss of relatively deep-level semantic information, the BiFPN structure is adopted as a replacement for the original FPN+PAN structure.

Figure 3.

FPN + PAN structure.

The BiFPN structure, introduced by Mingxing Tan [31] in 2020, is an improvement over PAN and represents a bidirectional feature pyramid network. The primary concepts behind BiFPN are twofold: These improvements encompass two aspects: the first is the implementation of efficient bidirectional cross-scale connections, and the second is the incorporation of weighted feature map fusion. By employing bidirectional fusion, it reconstructs bidirectional pathways, facilitating the fusion of feature information originating from distinct scales of the backbone network. By incorporating both upsampling and downsampling techniques, this unit not only unifies feature resolution scales but also establishes bidirectional connections between feature maps at the same scale. This clever integration of features not only avoids incurring additional costs but also partially resolves the problem of losing important feature information. Additionally, BiFPN is treated as a fundamental unit, where a pair of pathways within BiFPN is considered as a single feature layer. To achieve further high-level feature fusion, this unit is iterated several times, as depicted in the network structure shown in Fig. 4.

Figure 4.

BiFPN structure.

In contrast to conventional feature fusion, BiFPN incorporates a weighted fusion mechanism, enabling the acquisition of the significance of distinct input features and customizing the fusion process for each input feature. Hence, BiFPN introduces trainable weights, which are applied to each input with additional weights. BiFPN utilizes rapid normalized fusion for weight determination. This approach computes the fusion weights by directly dividing the sum of all values by the corresponding weights. Moreover, it further ensures the normalized fusion weights fall within the range of [0, 1]. Fast normalized fusion is 30% faster than Softmax-based fusion under similar optimization results. The calculation formula is defined as follows:

$\displaystyle\textit{Out}=\sum\limits_{i}{\frac{w_{i}}{\varepsilon+\sum\limits% _{i}{w_{i}}}\cdot\textit{In}_{i}}$ (7)

In Eq. (7), A represents the weights, and the activation function ReLU ensures that $w_{i}\geqslant 0$ . $\varepsilon$ is employed to mitigate numerical instability and represents a very minute value. $\textit{In}_{i}$ represents the input features, while Out represents the result of weighted feature fusion.

3.3 Loss function

CIoU [32] is a localization loss function used in YOLOv5s, and its formula is as follows:

$\displaystyle L_{\textit{CIoU}}=1-\textit{IoU}+\frac{\rho^{2}(b,b^{\textit{gt}% })}{c^{2}}+\alpha v$ (8)

$\displaystyle v=\frac{4}{\pi^{2}}\left(\arctan\frac{w^{\textit{gt}}}{h^{% \textit{gt}}}-\arctan\frac{w}{n}\right)^{2}$ (9)

$\displaystyle\alpha=\frac{v}{(1-\textit{IoU})+v}$ (10)

where, $b$ and $b^{\textit{gt}}$ represent the centers of predicted box B and true box A, respectively. $\rho^{2}(b,b^{\textit{gt}})$ represents the distance between the two central points. $c$ is the distance that simultaneously encloses the predicted and true boxes. 0 is a parameter used as a trade-off. $v$ is a parameter used to measure aspect ratio consistency.

The CIoU takes into account the intersection area, center point distance, and aspect ratio when comparing predicted and target bounding boxes. However, the issue of misaligned angles between the true box and the predicted box is not considered. Hence, this study incorporates the SIoU loss function for bounding box regression.

Compared to the initial loss function, the SIoU loss function [33] incorporates the angle between the ground truth box and the predicted box as part of the loss function. Specifically, it comprises four constituent elements, with the precise formula detailed as follows:

$\displaystyle\textit{Loss}_{\textit{SIoU}}=1-\textit{IoU}+\frac{\Delta+\Omega}% {2}$ (11)

$\displaystyle\Lambda=1-2\times\sin^{2}\left(\arcsin(x)-\frac{\pi}{4}\right)$ (12)

$\displaystyle\Delta=\sum\limits_{t=x,y}{(1-e^{-(2-\Lambda)\rho_{t}})}$ (13)

$\displaystyle\Omega=\sum\limits_{t=w,h}{(1-e^{-w_{t}})^{\theta}}$ (14)

Where, $\Lambda$ represents the Angle cost, $\Delta$ represents the Distance cost considering the Angle cost condition, and $\Omega$ represents the Shape cost.

3.4 Improvement in Non-Maximum Suppression

Non-Maximum Suppression (NMS) plays a vital role as an essential algorithm in the post-processing phase of object detection tasks. The traditional NMS algorithm relies solely on the IoU metric to suppress redundant bounding boxes. However, this approach often fails in cases of overlapping or occluded objects, making it challenging to accurately identify very small objects, as they are typically overshadowed by larger ones. At the same time, when using the traditional NMS algorithm, all detection boxes need to be sorted, and pairwise IoU calculations must be performed. If there are a large number of detection boxes, this can lead to significant computational overhead. Therefore, Distance-IoU-NMS(DIoU-NMS) [34] is chosen as a replacement for traditional NMS.

DIoU-NMS takes into account not only the IoU value but also the distance between the predicted box and the ground truth box’s center point during the calculation process, in contrast to NMS. The formula is as follows:

$\displaystyle S_{i}=0,f_{\textit{DIoU}}(M,B_{i})\geqslant\textit{thresh}$ (15)

$\displaystyle S_{i}=S_{i},f_{\textit{DIoU}}(M,B_{i})<\textit{thresh}$ (16)

Were, thresh represents the set IoU threshold; $f_{\textit{DIoU}}(M,B_{i})$ represents the determination result of DIoU-NMS; $S_{i}=0$ indicates that the detection box is a redundant box, and $S_{i}=S_{i}$ indicates that the box corresponds to a different target box.

3.5 Utilizing the k-means

++

algorithm for anchor box selection

The YOLOv5s model utilizes the k-means clustering algorithm to ascertain the anchor box sizes. Nevertheless, it’s worth mentioning that the k-means algorithm’s sensitivity to the initialization of cluster centers should not be overlooked. As k-means includes the random selection of initial cluster centers, running the algorithm multiple times may yield slightly varying results. If the initial selection of cluster centers during the running process is not appropriate, it can have a significant impact on the final clustering results. Hence, in this paper, the k-means $++$ algorithm is utilized as an alternative to the k-means algorithm to address this concern.

The k-means $++$ algorithm [35] , as an improved version of k-means, addresses the issue of the initial selection of cluster centers having a significant impact on the k-means algorithm. The main idea of the k-means $++$ algorithm is to improve the way initial cluster centers are selected. By increasing the probability of selecting points that are farther away as new cluster centers, it enhances the quality of the initial cluster centers. Since the dataset in this paper differs significantly from the COCO dataset used in YOLOv5s, using the default anchor boxes may not be suitable for this dataset. Therefore, prior to training the program, the k-means $++$ algorithm is utilized to generate anchor boxes that are suitable for the current dataset. In the end, this paper selected 9 sets as the number of clustering centers, resulting in the following final clustering results: (19, 30), (24, 10), (26, 47), (31, 13), (30, 13), (38, 18), (42, 76), (57, 32), (84, 49).

4. Experiments and results

4.1 Dataset

Table 1
Data augmentation techniques

Data augmentation techniques	Purpose
Brightness variation	Adjusts the brightness of images to simulate different lighting conditions during photography.
Motion blur	Blurs images in a way that simulates camera or object motion, typically done to create motion blur effects.
Gaussian noise	Adds gaussian noise to the image, simulating noise that can occur during image capture, transmission, or processing.
Sharpening	Enhances high-frequency components in the image to reduce blurriness.

The dataset in this study was acquired through the capture of images of bamboo sticks using a smartphone. The bamboo sticks were handheld during the capture, and the camera had a resolution of 48 million pixels. The capture environment was well-lit. This paper employed the labelImg software for image annotation, designating a single category as “bar”. The training, validation, and test sets are divided in a nearly 6:2:2 ratio, making up the dataset of 344 captured images.

To further increase the dataset’s size and enhance the model’s generalization, this paper employed data augmentation techniques to simulate extreme conditions in the captured images, thereby augmenting the diversity of training images. The selected data augmentation methods in this study include brightness variation, motion blur, Gaussian noise, and image sharpening. Each data augmentation method results in the expansion of 1 image in the dataset. Therefore, the final dataset has been augmented from 344 images to 1376 images, with 793 images in the training set, 306 images in the test set, and 277 images in the validation set. Table 1 describes the effects of the data augmentation methods, and Fig. 5 illustrates the different effects of each data augmentation technique. In Fig 5, (a) represents the original image, (b) represents the effect under strong lighting conditions, (c) shows the effect under weak lighting conditions, (d) demonstrates the effect of motion blur, (e) shows the effect of Gaussian noise, and (f) displays the effect of image sharpening.

4.2 Experimental platform

All experiments were conducted using the PyTorch framework. The software and hardware configuration parameters are as shown in Table 2.

Table 2
Software and hardware platform configuration parameters

Configuration	Parameter
CPU	AMDR5-5600H
GPU	NVIDIA RTX 3050
CUDA	12.0
Operation system	Windows10
Development platform	Python3.9
Development of language	Pytorch1.10.0

Table 3

Confusion matrix

Predicted/actual	Relevant	Irrelevant
Detected	TP	FP
Undetected	FN	TN

Figure 5.

Illustrates the impact of data augmentation techniques.

4.3 Evaluation metrics

In this experiment, the network’s performance will be evaluated using the following metrics: precision, recall, mean average precision (mAP), and frames per second (FPS). Before introducing these metrics, let’s first discuss some concepts: TP represents true positive, indicating that the positive class from the ground truth bounding box is correctly detected as positive in the object detection results. TN represents true negative, indicating that the negative class from the ground truth bounding box is correctly detected as negative in the object detection results. FP represents false positive, indicating that the negative class from the ground truth bounding box is incorrectly detected as positive in the object detection results. FN represents false negative, indicating that the positive class from the ground truth bounding box is incorrectly detected as negative in the object detection results. The meanings of TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative) are provided in Table 3.

Precision measures the model’s accuracy in predicting positive instances. The calculation formula for precision is as follows:

$\displaystyle\textit{Precison}=\frac{\textit{TP}}{\textit{TP}+\textit{FP}}$ (17)

Recall, assesses the model’s capacity to detect real positive samples. It is calculated as follows:

$\displaystyle\textit{Recall}=\frac{\textit{TP}}{\textit{TP}+\textit{FN}}$ (18)

The AP metric assesses the area under the Precision-Recall (P-R) curve . The calculation formula for Average Precision is as follows:

$\displaystyle\textit{AP}=\sum\limits_{i=1}^{n-1}{P(r_{i+1})(r_{i+1}-r_{i})}$ (19)

Were, AP stands for Average Precision, where $P(r_{i+1})$ represents the precision at the top $r_{i+1}$ predictions, and $r_{i+1}-r_{i}$ represents the increment in recall at the top $r_{i+1}$ predictions.

The mAP is utilized to compute the average precision across all classes within the dataset. The calculation formula for mAP is as follows:

$\displaystyle\textit{mAP}=\frac{1}{N}\sum\limits_{i=1}^{N}{\textit{AP}_{i}}$ (20)

Were, mAP signifies the Mean Average Precision, $N$ denotes the number of categories, and $\textit{AP}_{i}$ represents the AP for each category “i”.

The FPS metric represents the number of frames or images processed or displayed per second and is typically used to assess the speed and performance of devices such as computers or cameras. The formula for calculating FPS is as follows:

$\displaystyle\textit{FPS}=\frac{1}{T}$ (21)

Were, $T$ represents the processing time for each frame, measured in seconds. Typically, a higher FPS value indicates that the device can process more image frames within a unit of time, which reflects faster processing speed and better performance.

4.4 Experiment results

In this section, we shall introduce and analyze the ablation experiments of cooperative attention mechanism, performance comparison experiments with different algorithms, and the experimental results of various module ablations. Finally, the experimental results of YOLOv5s-COC model are analyzed. In the ablation experiments of the cooperative attention mechanism,we examined the influence of incorporating the attention mechanism at various network positions, comparing four distinct integration methods. Based on the experimental results, it was validated that adding the cooperative attention to the Neck network is more suitable. The specific addition method is illustrated in Fig. 2(c), and the detailed experimental results are presented in Table 4, as shown in the ablation experiments of the cooperative attention mechanism.

In the comparative experiments assessing the performance of various algorithms, we examined the effectiveness of YOLOv5-COC in comparison to Fast R-CNN, SSD, YOLOv3, YOLOv4, and the default settings of YOLOv5s. The YOLO series of algorithms demonstrate a significant improvement in precision, recall, average precision, and FPS values compared to Fast R-CNN and SSD. In comparison to YOLOv5s, YOLOv5-COC demonstrates an enhancement of 1% in precision, a 1% increase in recall, a 1.3% boost in mAP, and an impressive 27.7 FPS improvement. The improved YOLOv5-COC model exhibits faster inference speed. The specific experimental results, as shown in Table 4, depict the performance comparison of different algorithms. The impact of various improvement modules on the YOLOv5s algorithm was assessed in module ablation experiments. The specific experimental results are depicted in Table 4, showing the results of the module ablation experiments. After optimizing by adding the cooperative attention mechanism at the Neck layer, incorporating BiFPN in the network, replacing the original loss function with SIoU, and changing the bounding box selection algorithm to DIoU-NMS, the YOLOv5-COC model showed improvements in Precision, Recall, mAP@0.5, mAP@0.5:0.95, and FPS. The values increased from the initial YOLOv5s values of 97.3%, 98.1%, 97.4%, 68.9%, and 32.3, respectively, to 98.3%, 99.1%, 98.7%, 72.4%, and 60.

The final model exhibits a modest improvement in performance compared to the initial model: a 1% increase in accuracy, a 1% increase in recall, a 1.3% increase in mAP, and a notable 3.5% increase in mAP@0.5:0.95. While the performance gain of the final model may not appear significant in traditional metrics, it is crucial to consider the nature of the task at hand.The task of detecting cylindrical objects is relatively straightforward, and typical object detection models can accurately identify the majority of cylindrical objects in the dataset. Even the initial, unimproved object detection model performs reasonably well. However, when dealing with densely distributed cylindrical objects, some challenges arise. These challenges primarily include the problems of multiple detections and missed detections.The multiple detections issue occurs when the model incorrectly combines multiple recognized cylindrical objects into a single object, resulting in over-detection. This affects precision since a single actual object is mistakenly labeled as multiple objects. On the other hand, the missed detection issue arises when closely spaced cylindrical objects or adverse environmental conditions cause the model to miss some objects, leading to a decrease in recall.Given that these issues are relatively rare within the dataset, they may not significantly impact traditional performance evaluation metrics. Therefore, the relatively small improvement in the final model’s performance is a result of the specific nature of the task and the relatively low occurrence rat.

The final model of this study demonstrates a notable improvement in FPS compared to the initial model, with a remarkable increase of 27.7. The reasons behind this improvement can be attributed to several key factors:

1.
Incorporation of Cooperative Attention Mechanism: The study introduced a cooperative attention mechanism to the existing YOLOv5s structure. This enhancement introduces an additional computational burden, requiring extra computational resources for tasks such as coordinate information processing and attention generation. As a result, it initially led to a reduction in the algorithm’s FPS metric.
2.
Integration of BiFPN Network: YOLOv5s was fused with the BiFPN network, introducing bidirectional channels from high to low resolution for improved feature fusion. Compared to the original structure, BiFPN offers a lightweight advantage, contributing to the increase in the FPS metric.
3.
Replacement of CIoU Loss Function with SIoU: The study replaced the CIoU loss function in YOLOv5s with SIoU loss. SIoU employs smoothing operations to approximate IoU, avoiding complex mathematical computations while maintaining smoother gradients. For this dataset, SIoU demonstrated faster convergence and computation speed, resulting in an increased FPS metric.
4.
Implementation of DIoU-NMS: In place of traditional NMS algorithms, DIoU-NMS was used. Traditional NMS typically employs IoU for measuring bounding box overlap, which can become computationally intensive when dealing with a large number of objects to be detected. In this dataset, the prevalence of objects for detection accentuated the drawbacks of traditional NMS. DIoU-NMS incorporates distance information alongside traditional IoU, which is typically faster to compute than IoU. Moreover, DIoU-NMS typically requires only one iteration to determine the retained bounding boxes, reducing computation time and contributing to the improved FPS metric.

In summary, the FPS metric of YOLO-COC was significantly enhanced due to improvements in various aspects, with the exception of a minor reduction caused by the added cooperative attention mechanism. These enhancements have laid a strong foundation for real-time detection in the future.

The experimental results indicate that the proposed YOLOv5-COC model in this study is effective and superior in recognizing cylindrical objects. To validate the robustness of the research algorithm, experiments were conducted under conditions of weak lighting, strong lighting, and images with blur. The detection performance remained good, demonstrating the algorithm’s strong robustness. The detection results are shown in Fig. 6. In Fig. 6(a), we have the original image containing 60 objects to be detected. In Fig. 6(b), the detection results under low light conditions are shown, with 60 objects detected. Figure 6(c) displays the detection results under low light conditions, still detecting all 60 objects. Finally, Fig. 6(d) demonstrates the detection results under motion blur, again successfully detecting all 60 objects.

Table 4
Experimental results

Experiment Location Model Precision Recall mAP@0.5 mAP@0.5:0.95 FPS

Ablation experiment Not added YOLOv5s 97.5% 98.4% 98.0% 67.6%

of coordinated Backbone 97.4% 97.7% 98.5% 65.8%

attention Backbone 97.2% 98.3% 98.6% 69.2%

mechanism Neck 98.1% 98.5% 98.7% 70.3%

Experimental Fast R-CNN 84.2% 81.6% 82.6% 4.1

comparison of SSD 80.1% 78.3% 78.2% 8.3

algorithmic YOLOv3 93.3% 86.5% 90.3% 15.3

performance YOLOv4 95.6% 90.3% 92.1% 18.6

YOLOv5s 97.3% 98.1% 97.4% 32.3

YOLOv5-COC 98.3% 99.1% 98.7% 60

Module YOLOv5s 97.3% 98.1% 97.4% 68.9% 32.3

ablation YOLOv5 $+$ CA 98.1% 98.5% 98.7% 70.3% 27.5

experiments YOLOv5 $+$ BiFPN 98.3% 98.6% 98.6% 71.1% 56

YOLOv5 $+$ SioU 98.2% 98.4% 98.7% 70.0% 59.9

YOLOv5 $+$ DIoU-NMS 98.3% 98.4% 98.7% 70.0% 67

YOLOv5-COC 98.3% 99.1% 98.7% 72.4% 60

Figure 6.
Inference results under low light, strong light, and motion blur conditions.

4.5 Training process

Experiment	Location	Model	Precision	Recall	mAP@0.5	mAP@0.5:0.95	FPS
Ablation experiment	Not added	YOLOv5s	97.5%	98.4%	98.0%	67.6%
of coordinated	Backbone		97.4%	97.7%	98.5%	65.8%
attention	Backbone		97.2%	98.3%	98.6%	69.2%
mechanism	Neck		98.1%	98.5%	98.7%	70.3%
Experimental		Fast R-CNN	84.2%	81.6%	82.6%		4.1
comparison of		SSD	80.1%	78.3%	78.2%		8.3
algorithmic		YOLOv3	93.3%	86.5%	90.3%		15.3
performance		YOLOv4	95.6%	90.3%	92.1%		18.6
		YOLOv5s	97.3%	98.1%	97.4%		32.3
		YOLOv5-COC	98.3%	99.1%	98.7%		60
Module		YOLOv5s	97.3%	98.1%	97.4%	68.9%	32.3
ablation		YOLOv5 $+$ CA	98.1%	98.5%	98.7%	70.3%	27.5
experiments		YOLOv5 $+$ BiFPN	98.3%	98.6%	98.6%	71.1%	56
		YOLOv5 $+$ SioU	98.2%	98.4%	98.7%	70.0%	59.9
		YOLOv5 $+$ DIoU-NMS	98.3%	98.4%	98.7%	70.0%	67
		YOLOv5-COC	98.3%	99.1%	98.7%	72.4%	60

4.5.1 Experimental parameter configuration

The parameters for each model are consistent. The experiment lasts for 150 iterations, using a batch size of 8 and a learning rate of 0.01. After each iteration, we save the model weights, resulting in the final desired model weights. Warmup is employed during model training to gradually increase the learning rate, thereby reducing overfitting in the early stages of training with small batch data, dampening oscillations, and ensuring model stability.

4.5.2 Training results

After completing 200 epochs of model training, we can evaluate the training results. The performance metrics for both the training and validation sets during training are shown in Fig. 7.

Figure 7.

Experimental results on the training and validation sets.

Around 40 iterations, it can be observed from the validation set’s target loss graph that the validation set’s target loss reaches its lowest value. Afterward, as the number of iterations grows, the validation set’s target loss gradually starts to rise, indicating the occurrence of overfitting. Moreover, by examining the plots of other evaluation metrics , it can be observed that their values reach a plateau, and any further increase is not significant. However, in the training set’s bounding box loss and target loss graphs, there is still a decreasing trend in the loss values, and at the 148th iteration, the model stops training due to minimal training variation. To ensure obtaining the optimal model, this paper selects the best weights after 150 iterations as the weights for cylindrical object detection.

5. Conclusion

This paper introduces an object detection algorithm that is designed to identify cylindrical objects, leveraging the YOLOv5s network. Before starting the model training, data augmentation is implemented on the dataset, incorporating methods like Brightness Adjustment, Noise Addition, and Motion Blur. The k-means $++$ algorithm is employed to select the initial anchor boxes, preventing improper cluster center selection from affecting the model. In the Neck network, a coordinated attention mechanism is introduced to enhance the model’s precision in pinpointing and identifying objects. To enhance multi-scale feature fusion, the BiFPN structure is introduced. Furthermore, the SIoU loss function is utilized as the model’s loss function, and GIoU-NMS replaces the non-maximum suppression. Experimental findings demonstrate that the final model proposed in this research exhibits a 1% improvement in detection accuracy, a 1% increase in recall rate, a 1.3% enhancement in mAP, a 3.5% improvement in mAP@0.5:0.95, and an increase of 27.7 in FPS.

6. Discussion

While the object detection algorithm proposed in this study has shown promising performance in recognizing cylindrical objects, there are still several directions for future research and improvements. Here are some prospects:

Exploring the Detection of Various Categorizations of Cylindrical Objects: By expanding the dataset to include images of bundled logs, steel pipes, and other cylindrical objects of various shapes and sizes, the algorithm can be trained to detect a wider range of cylindrical objects, enhancing its generalization capabilities.

Addressing Detection in Complex Scenes: The current algorithm has been primarily tested and evaluated in simple scenes. There is a need to extend its applicability to more complex and cluttered scenes for robust object detection. For instance, in situations where cylindrical objects are densely packed, occlusions can become more prevalent, potentially leading to missed detections or the merging of multiple cylindrical objects into a single entity. Moreover, variations in the orientation of cylindrical objects can also impact the detection performance. Addressing these challenges will require the development of more robust algorithms capable of handling various complexities in real-world scenes.

Application in real-world scenarios: Our algorithm holds the potential for various practical applications, such as industrial automation and smart forestry. We aim to further optimize the algorithm, explore transfer learning techniques, and conduct testing and deployment in real-world settings to validate its effectiveness and reliability in practical applications.

Footnotes

Funding

This work was supported in part by the Research Projects for Provincial University (grant no.KY2022 059).

References

Andrejevic

Selwyn

. Facial recognition technology in schools: critical questions and concerns. Learn Media Technol. 2020 Apr 2; 45(2): 115-28. doi: 10.1080/17439884.2020.1686014.

Carlos-Roca

Torres

Tena

. Facial recognition application for border control. 2018 International Joint Conference on Neural Networks (IJCNN). 2018; 1-7. doi: 10.1109/IJCNN.2018.8489113.

Wang

Huang

Wang

Jiang

. Masked face recognition dataset and application. IEEE Trans Biom Behav Identity Sci. 2023 Apr; 5(2): 298-304. doi: 10.1109/TBIOM.2023.3242085.

Xie

Ahmad

Jin

Liu

Zhang

. A new CNN-based method for multi-directional car license plate detection. IEEE Trans Intell Transp Syst. 2018 Feb; 19(2): 507-17. doi: 10.1109/TITS.2017.2784093.

Redmon

Divvala

Girshick

Farhadi

. You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016; 779-88. doi: 10.1109/CVPR.2016.91.

Redmon

Farhadi

. YOLO9000: Better, faster, stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017; 6517-25. doi: 10.1109/CVPR.2017.690.

Redmon

Farhadi

. YOLOv3: An incremental improvement. arXiv. 2018. doi: 10.48550/arXiv.1804.02767.

Bochkovskiy

Wang

Liao

HYM

. YOLOv4: Optimal speed and accuracy of object detection. arXiv. 2020. doi: 10.48550/arXiv.2004.10934.

Liu

Anguelov

Erhan

Szegedy

Reed

, et al. SSD: Single shot multibox detector. In: Leibe

Matas

Sebe

Welling

, editors. Computer vision – ECCV 2016. Cham: Springer international publishing. 2016; 21-37. doi: 10.1007/978-3-319-46448-0_2.

10.

Liu

Ranga

Tyagi

Berg

. DSSD: Deconvolutional single shot detector. arXiv. 2017. doi: 10.48550/arXiv.1701.06659.

11.

Lin

Goyal

Girshick

Dollar

. Focal loss for dense object detection. In 2017; 2980-8. doi: 10.48550/arXiv.1708.02002.

12.

Tan

Pang

. EfficientDet: Scalable and efficient object detection. arXiv. 2020. doi: 10.48550/arXiv.1911.09070.

13.

Girshick

Donahue

Darrell

Malik

. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv. 2014. doi: 10.48550/arXiv.1311.2524.

14.

Zhang

Ren

Sun

. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015 Sep; 37(9): 1904-16. doi: 10.1109/TPAMI.2015.2389824.

15.

Girshick

. Fast R-CNN. arXiv. 2015. doi: 10.48550/arXiv.1504.08083.

16.

Ren

Girshick

Sun

. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017 Jun; 39(6): 1137-49. doi: 10.1109/TPAMI.2016.2577031.

17.

Gkioxari

Dollár

Girshick

. Mask R-CNN. arXiv. 2018. doi: 10.48550/arXiv.1703.06870.

18.

Lin

Dollár

Girshick

Hariharan

Belongie

. Feature pyramid networks for object detection. arXiv. 2017. doi: 10.48550/arXiv.1612.03144.

19.

Jiang

Weng

Geng

, et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv. 2022. doi: 10.48550/arXiv.2209.02976.

20.

Wang

Bochkovskiy

Liao

HYM

. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC, Canada: IEEE. 2023; 7464-75. doi: 10.1109/CVPR52729.2023.00721.

21.

Arinaldi

Pradana

Gurusinga

. Detection and classification of vehicles for traffic video analytics. Procedia Comput Sci. 2018 Jan 1; 144: 259-68. doi: 10.1016/j.procs.2018.10.527.

22.

Pang

Liu

Zhuang

Chen

Zheng

, et al. Passion fruit detection and counting based on multiple scale faster R-CNN using RGB-D images. Precis Agric. 2020 Oct 1; 21(5): 1072-91. doi: 10.1007/s11119-020-09709-3.

23.

Ahmad

Ahmed

Ullah

Ahmad

. A deep neural network approach for top view people detection and counting. 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON). 2019; 1082-8. doi: 10.1109/UEMCON47517.2019.8993109.

24.

Wang

Falzon

Kwan

Guo

Chen

, et al. Automated cattle counting using Mask R-CNN in quadcopter vision system. Comput Electron Agric. 2020 Apr 1; 171: 105300. doi: 10.1016/j.compag.2020.105300.

25.

Zhao

Zhang

Yan

Qiu

Yao

Tian

, et al. A wheat spike detection method in UAV images based on improved YOLOv5. Remote Sens. 2021 Jan; 13(16): 3095. doi: 10.3390/rs13163095.

26.

Luo

. A video axle counting and type recognition method based on improved YOLOv5S. In: Tan

Shi

Zomaya

Yan

Cai

, editors. Data Mining and Big Data. Singapore: Springer. 2021; 158-68. doi: 10.1007/978-981-16-7476-1_15.

27.

Ling

Tao

. Pedestrian detection and feedback application based on YOLOv5s and DeepSORT. 2022 34th Chinese Control and Decision Conference (CCDC). 2022; 5716-21. doi: 10.1109/CCDC55256.2022.10033779.

28.

Xie

Zhou

Zhong

Yan

Zhang

. A package auto-counting model based on tailored YOLO and DeepSort techniques. MATEC Web Conf. 2022; 355: 02054. doi: 10.1051/matecconf/202235502054.

29.

Bahdanau

Cho

Bengio

. Neural machine translation by jointly learning to align and translate. arXiv. 2016. doi: 10.48550/arXiv.1409.0473.

30.

Hou

Zhou

Feng

. Coordinate attention for efficient mobile network design. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021; 13708-17. doi: 10.1109/CVPR46437.2021.01350.

31.

Tan

Pang

. EfficientDet: Scalable and efficient object detection. arXiv. 2020. doi: 10.48550/arXiv.1911.09070.

32.

Zheng

Wang

Ren

Liu

, et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans Cybern. 2022 Aug; 52(8): 8574-86. doi: 10.1109/TCYB.2021.3095305.

33.

Gevorgyan

. SIoU Loss: More powerful learning for bounding box regression. arXiv. 2022. doi: 10.48550/arXiv.2205.12740.

34.

Zheng

Wang

Liu

Ren

. Distance-IoU loss: Faster and better learning for bounding box regression. Proc AAAI Conf Artif Intell. 2020 Apr 3; 34(07): 12993-3000. doi: 10.48550/arXiv.1911.08287.

35.

Arthur

Vassilvitskii

. K-means+⁣+ the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. 2007; 1027-35.

Number detection of cylindrical objects based on improved Yolov5s algorithm

Abstract

Keywords

1. Introduction

3. Improved YOLOv5s network model

3.1 Coordinate attention

4. Experiments and results

4.1 Dataset

Table 1 Data augmentation techniques

Table 2 Software and hardware platform configuration parameters

4.5.1 Experimental parameter configuration

4.5.2 Training results

6. Discussion

Footnotes

Funding

References

Table 1
Data augmentation techniques

Table 2
Software and hardware platform configuration parameters