Abstract
This study introduces an improved lightweight section-steel surface detection (ILSSD) YOLOX-s algorithm model to enhance feature fusion performance in single-stage target detection networks, addressing the low accuracy in detecting defects on section-steel surfaces and limited computing resources at steel plants. The ILSSD YOLOX-s model is improved by introducing the deep-wise separable convolution (DSC) module to reduce parameter count, a dual parallel attention module for improved feature extraction efficiency, and a weighted feature fusion path using bi-directional feature pyramid network (BiFPN). Additionally, the CIoU loss function is employed for boundary frame regression to enhance prediction accuracy. Based on the NEU-DET dataset, experimental results demonstrate that the ILSSD YOLOX-s algorithm model achieves a 75.9% mean average precision with an IoU threshold of 0.5 (mAP@0.5), an improvement of 7.1 percentage points over the original YOLOX-s model, with a detection speed of 78.4 frames per second (FPS). Its practicality is validated through training and validating it with a lightweight section-steel surface defect dataset from an industrial steel plant, further confirming its viability for industrial defect detection applications.
Keywords
Introduction
Section-steel plays a vital role in the development of industries and economies, finding extensive applications in shipbuilding, machinery manufacturing, construction, and bridges. Recently, there has been a growing demand for section-steel with superior surface quality that not only meets performance standards but also exhibits exceptional durability. However, various factors such as environmental conditions, variations in raw material quality, and manufacturing processes can lead to the occurrence of surface defects during section-steel production. These defects have negative implications on the steel’s wear and corrosion resistances, and fatigue strength while posing risks during practical usage scenarios.
With the progress of computer technology, machine vision-based detection of object defects has become widely used in industrial applications.1,2 In recent times, deep learning algorithms for detecting objects have shown remarkable advancements due to rapid progress in computer network algorithms. The existing literature classifies deep learning object detection algorithms into two categories, that is, two-stage and single-stage methods. Two-stage methods include faster region-based convolutional neural network (F-RCNN)3,4 and improved F-RCNN. 5 Single-stage methods primarily consist of the you-only-look-once (YOLO) series6–16 and the single-shot multi-box detector (SSD).17,18 Recently, general object detection algorithms have been hindered by the lack of clear features in real environment captured images, leading to a demand for high-performance super-resolution models. In Wang et al., 19 a novel fusion structure combining distilled feature pyramid with serial CNN and transformer models was proposed. Wang et al. 20 also used a parallel fusion structure involving CNN and transformer models.
The YOLO series object detection algorithm has gained widespread adoption in various fields due to its fast and accurate detection results. In accordance with, 7 an improved YOLOXs model is proposed for identifying sub-health regions on rape plants during the bolting stage in agriculture. Moreover, multiple enhanced versions of YOLOv5 models are developed for detecting helmets, 8 lithium battery poles, 9 road damages, 10 weld defect detection, 11 and video-based sentiment analysis. 12 Additionally, improved YOLOX models are created specifically for face mask of human, 13 traffic sign detections, 14 and human multi-modal neuromorphic monitoring in smart home application. 15 Furthermore, in Wang et al., 16 the researchers proposed an enhanced feature pyramid model, AF-FPN, which addresses information loss during feature map generation and improves representation capability. Substituting YOLOv5 with AF-FPN has been shown to improve detection performance for multi-scale traffic signs while maintaining real-time capability.
The problem of steel surface defect detection has been extensively studied using various series of YOLO algorithms. In Cheng et al., 21 an improved YOLOv3 model was proposed, while Cao et al., 22 Shi et al., 23 Jiang, 24 Xu et al., 25 addressed the issue by introducing several YOLOv5-based models with different modified sub-modules. In Wang et al., 26 a strip-surface defect detection model that integrates two strategies into the fast response Yolov5 model to enhance feature recognition capability and optimize network architecture. More recent advancements have been made with improved YOLOX Algorithms.27,28 However, it is important to note that these aforementioned YOLO series algorithms primarily rely on the NEU-DET dataset 29 for training and efficiency verification, rather than utilizing actual collected datasets specifically focused on steel defects.
The main research gaps in detecting surface defects in section-steel are as follows. (1) Scale variation challenge, with some imperfections being minute and others significantly larger. (2) Shape variation issue, as steel surface defects come in diverse shapes, complicating accurate detection and classification. (3) Detection efficiency is crucial for precise and real-time defect identification in industrial applications. Motivated by previous analysis, we propose an algorithm called improved lightweight section-steel surface detected (ILSSD) YOLOX-s model that is built in terms on the original YOLOX-s model 30 for detecting surface defects on lightweight section-steel. Our study’s key contributions are summarized.
(1) Utilizing the DSC module 31 for processing to create a more parameter-efficient version of the YOLOX-s model, reducing computational resource requirements and making it practical for steel plants.
(2) Introducing an efficient fused attention (EFA) module that parallelly combines channel and spatial attention modules to enhance network sensitivity towards defect features and minimize loss of defect information.
(3) Integrating a multi-scale weighted feature fusion path from BiFPN, 32 called the BiFPN with efficient fused attention (BiFPN-EFA) network, into the Neck to improve overall feature fusion performance.
(4) Enhancing the boundary frame regression loss function with the CIoU loss function 33 to improve positioning accuracy of various defects within the model.
Furthermore, we validate our proposed ILSSD YOLOX-s model not only using NEU-DET dataset 29 for training and efficiency verification but also employing an actual dataset comprising lightweight section-steel surface defects for training and validation. These experiments demonstrate that our ILSSD YOLOX-s algorithm is correct, effective, and applicable in identifying surface defects in lightweight section-steel manufacturing plants.
The study presents the following contents. Section “Introduction” provides a brief introduction to the original YOLOX-s model architecture and describes the functions of its sub-modules. In section “Model construction and analysis,” we propose the ILSSD YOLOX-s model architecture and provide four strategies for enhancing the original YOLOX-s model, including an explanation of the operation principle of DSC and EFA modules, details on BiFPN network structure, as well as characteristics of CIoU loss function. Section “Experimentation and analysis” covers the training and efficiency verification of ILLSD YOLOX-s model using the NEU-DET dataset 29 and actual dataset correcting from the steel plant. Finally, a summary is made.
Architecture of YOLOX-s model
The YOLOX model, 30 which was introduced in 2021, is an innovative algorithm for single-stage object detection. It incorporates the advanced features of the YOLO series algorithm by integrating a decoupled head and anchor-free design into its network architecture. Due to the excellent network performance, YOLOX model has made significant contributions to the field of target detection, such as face mask and remote sensing image etc. This model comprises a Backbone, a Neck for feature extraction and fusion, as well as a detection Head. The original YOLOX-s algorithm model, depicted in Figure 1, is built upon the architecture of the YOLOX model and adopts the simplified optimal transport assignment (SimOTA) sample matching strategy.34,35 In comparison to the original YOLOX model, enhancements of YOLOX-s model have been made to both detection speed and accuracy. The definition and operation of each sub-module can be referred to Ge et al. 30

Structure of YOLOX-s algorithm model.
The main function of the Backbone module depicted in Figure 1 is to extract features from the input image and transmit the resulting feature map to the Neck module for further feature extraction and fusion. The Neck module combines elements from feature pyramid network (FPN) 36 and path aggregation network (PAN), 37 incorporating a hybrid structure. Initially, the FPN structure is up-sampled and merged with the corresponding scale feature map generated by the Backbone module, enhancing positional accuracy of local features. Subsequently, through a bottom-up down-sampling process, PAN structure fuses with corresponding scale feature maps output by FPN to capture strong semantic information from high-level features. This improves defect classification accuracy and enhances target detection precision within network predictions. Finally, considering an input network image size of 640 × 640 × 3, Head module predicts three scaled feature maps: 80 × 80 × 128, 40 × 40 × 256, and 20 × 20 × 512.
Improved lightweight section-steel surface detected YOLOX-s algorithm model
Incorporating an attention mechanism during the feature extraction and fusion stages of the object detection model can enhance the significance of relevant features, reduce the impact of irrelevant features, and enhance the model’s sensitivity toward effective features. The lightweight section-steel surface often displays numerous minor imperfections, yet the original YOLOX-s model alone proves inadequate in precisely identifying these minute flaws and efficiently capturing and incorporating their characteristics, leading to a diminished level of accuracy when detecting defects on the steel surface.
To prevent information loss during feature extraction, it is essential to merge the efficient fused attention (EFA) module, which enhances the network’s ability to detect defective features. However, the inclusion of a complex attentional mechanism module significantly affects the speed of detection. Otherwise, the detection performance of small defects and fine linear cracks on the steel surface is negatively affected in YOLOX-s model due to the absence of interactive fusion of multi-scale features within the three independent feature fusion paths of the Neck.
In light of the preceding analysis, this study introduces a more advanced and efficient network named improved lightweight section-steel surface detected (ILSSD) YOLOX-s algorithm model for detecting and identifying surface defects in lightweight section-steel. It is depicted in Figure 2. This network model utilizes the original YOLOX-s architecture and integrates EFA modules before each of the three connecting paths that originally feed into the Neck. This enhancement aims to enhance both feature extraction efficiency and feature fusion capability of the network. Furthermore, we have enhanced the connection mode of the original feature fusion path in the Neck module to facilitate multi-scale feature fusion, leading to improved detection performance.

Improved lightweight steel surface detected (ILSSD) YOLOX-s model.
Design of slim ILSSD YOLOX_s model with depth-wise separable convolution
The network architecture of ILSSD YOLOX_s algorithm exhibits a significantly higher level of complexity compared to the original YOLOX_s, thereby imposing relatively greater demands on computer hardware resources when employed for steel surface defect detection and identification. The judicious utilization of slim convolutional modules in the ILSSD YOLOX_s model can effectively reduce the model parameter count and computational requirements without compromising the accuracy of object detection, thereby facilitating the deployment of identification models on devices with limited computing resources.
The effectiveness of depth-wise separable convolution (DSC), which is one kind of slim convolutional module, has been proven to exceed that of standard convolution module, while retaining a notable level of precision. 38 The DSC comprises two elements, that is, deep convolution and point-by-point convolution. When these two elements are merged into a single entity, it converts into a standard convolution. The DSC, in comparison to standard convolution, achieves comparable or even superior feature extraction capabilities with fewer parameter count, thereby enabling the development of slim models without compromising accuracy. The proposed approach in Figure 2 suggests replacing every conventional Conv module embedded in the ILSSD YOLOX_s model with a DSC module, aiming to reduce the complexity and achieve a slim and enhanced model with fewer parameter count.
The comparative evaluation of the slim model’s efficiency can be conducted by contrasting the count of parameters in the DSC convolutional module with that in a standard one.31,39,40 The convolution operation is applied to the input feature graph, which has dimensions
The dimensions of the feature graph in the convolution operation are determined by the height (
The DSC first performs deep convolution and then applies
The second step involves point-by-point convolution, where a
From equations (3) and (4), the total count of parameters of DSC,
The ratio of the count of DSC parameters to the count of parameters for standard convolution is
According to equation (5), the count of DSC parameters is smaller compared to standard convolution, enabling the achievement of a slim ILSSD YOLOX_s model.
Parallel efficient fused attention module
To prevent information loss during feature extraction, it is crucial to integrate efficient fusion attention (EFA) modules in order to enhance the network’s capability to detect defective features. However, the use of complex attention mechanism modules may have a significant impact on detection speed. According to Hu et al., 41 different channel characteristics can have varying impacts on network prediction results. The squeeze-and-excitation network (SE-Net) uses channel weights to represent the importance of individual characteristics, but this significantly increases the computational load due to sequential dimension reduction and subsequent increase during squeeze and activation operations. Additionally, there is a risk of losing important channel characteristic information when reducing channels with significant weight.
The efficient channel attention network (ECA-Net), proposed in a previous study, 42 uses an adaptive convolution kernel to establish inter-channel connections while maintaining the original dimensionality. In contrast to SE-Net, it shows lower computational complexity and improved efficiency. However, the SE-Net and ECA-Net models primarily focus on channel importance in feature extraction, overlooking spatial feature information within the network. In Woo et al., 43 the convolutional block attention module (CBAM) is used to establish a sequential connection between channel and spatial attention, promoting their interaction for improved network feature extraction performance. The dual attention network (DA-Net) proposed in a previous study 44 simultaneously integrates channel and spatial attention mechanisms, but it is characterized by a substantial parameter count and significant computational complexity.
Recently, Wang et al., 11 a new fusion strategy called multiscale alignment fusion with parallel feature filtering (MSAPF) has been developed to integrate and filter multiscale features effectively. Besides, in Yeung C-C Lam 45 proposed a fused-attention network (FA-Net) to address the mentioned issues. It includes an adaptively balanced feature fusion method for integrating feature maps at different levels based on their significance and a fused-attention module to enhance feature representations in both channel and spatial dimensions.
The lightweight section-steel exhibits a multitude of minor imperfections on its surface. To prevent their loss during the feature extraction process, which could lead to decreased recognition accuracy, we propose incorporating an attention mechanism module in this study to enhance the network’s sensitivity towards defect characteristics while maintaining detection speed. Hence, in Figure 2 of the ILSSD YOLOX_s model, we have developed a parallel type of EFA within the Neck. To validate the efficacy of our proposed attention module, in section “Experimentation and analysis,” we conducted comparative experiments among various attention modules presented in prior literature. The EFA module facilitates efficient interaction between channel information and spatial information. The structure of this module is depicted in Figure 3.

Structure of EFA module.
The channel attention module (CAM) and spatial attention module (SAM) are utilized to derive channel feature weights
The primary function of CAM in Figure 3 is to enhance the interaction between channel information and accurately determine the importance of different channel information through establishing appropriate channel feature weights. The structural representation of CAM can be observed in Figure 4. The operational procedure of CAM involves initially applying global average pooling (GAF) independently to each channel within the output feature map, resulting in obtaining initial weight

Channel attention module (CAM).
Furthermore, the initial weights undergo dimension compression adjustment (
where
After completing the one-dimensional convolution, the feature weights are restored to their original structural state through transpose and dimensional amplification (
where
The establishment of spatial information interaction and internal spatial relationships among input feature maps is facilitated by SAM, as shown in Figure 5. The SAM operation is conducted as follows. Initially, the input feature map

Spatial attention module (SAM).
In equation (9),
where ⊕ represents the operation by adding the elements one by one.
Bi-directional FPN with weighted feature fusion path
The YOLOX-s model in Figure 2 incorporates a serial-parallel fusion network, Neck, that combines the FPN 36 with the PAN. 37 This fusion allows for the integration of feature layers at different scales, resulting in the generation of three scale feature maps. These maps are then passed to the Head for prediction output. As a result, the accuracy of object detection is directly influenced by how well the Neck performs in fusing these features.
The original Neck lacks interactive fusion of multi-scale features in its three feature fusion paths. In the case of surface defects on lightweight section-steel, not only do defects of different types vary in shape and size, but there is also significant variation in shape among defects of the same type. The detection performance of the original YOLOX-s model on small defects and thin cracks is found to be unsatisfactory through direct testing using actual input image data for surface defect detection in lightweight section-steel. Consequently, the fusion impact of defect characteristics in the original YOLOX-s model does not fulfill the criteria for detecting surface defects on lightweight section-steel.
To improve the network’s ability to combine features and enhance the accuracy of detecting different defects, this study proposes integrating a BiFPN 32 with weighted feature fusion paths into the Neck. This allows for merging features across different scales, incorporating shallow features necessary for identifying small targets in each prediction feature map, and boosting the network’s performance in detecting small targets. Simultaneously, feature maps can be adaptively fused by introducing weight coefficients on each path of feature fusion. This approach effectively controls the representation of features at various scales within each prediction feature layer.
In Figure 3, the EFA module is developed in this study with the structural characteristics of the network and the number of predictive feature layers. To enhance feature interaction and fusion, weight coefficients are introduced into the three connecting paths of the original Neck module, along with additional EFA modules that further improve feature extraction efficiency and fusion ability. The structure of the BiFPN-EFA network is illustrated in Figure 6, while the operational procedures of the BiFPN-EFA network are explained in more detail below.

Structure of BiFPN-EFA network.
The input feature maps of
The equation representing the output characteristics of BiFPN-EFA, namely
The terms
Improvement of loss function
The loss function of YOLOX-s comprises three components, that is, the positional regression function’s loss function, Lobj, the cross-entropy loss function for classification, Lcls, and the bounding frame regression’s loss function, Lreg. Both Lobj and Lcls employ the binary cross entropy (BCE) loss. 5 However, Lreg utilizes the widely adopted intersection over union (IoU) loss function LIoU for bounding frame prediction 46 in object detection networks, aiming to accurately represent the positioning precision of the predicted frame. The calculation equation is expressed as follows.
where
To mitigate the potential impact of the aforementioned scenario on the robustness of our model and improve the accuracy of predicted bounding frame outputs, this study introduces a more comprehensive loss function for bounding frame regression. The loss function, referred to as LCIoU,46–48 replaces the previous one. The subsequent explanation will provide a clear definition of LCIoU. Figure 7 illustrates how the CIoU loss function is calculated, which precisely defines LCIoU.

Calculation diagram for LCioU.
In this context,

Six defect categories of NEU-DET datasets.
The LCIoU loss function, in contrast to LIoU, considers not only the overlap between the predicted and target bounding frames but also incorporates disparities in their center point distances and height-width ratios. This continuous refinement of the center point distance during model training enhances the similarity in shape between predicted and target frames. The LCIoU applied to ILSSD YOLOX-s algorithm expedites model convergence and improves the accuracy of localizing predicted frames. It is allowing for better adaptability of models to various morphological defects observed on lightweight section-steel surfaces.
Experimental results and analysis
Experimental configuration and dataset
The experimental configuration of this study employed the Ubuntu 20.04 LTS operating system, featuring a memory capacity of 16GB. The hardware setup consisted of an AMD Ryzen 5 5600X CPU and an NVIDIA GeForce RTX3060 GPU with VRAM totaling 12GB. To implement the software, PyTorch version 1.10.0, a deep learning framework, was utilized along with CUDA version 11.1 to accelerate computations. Python version 3.8.5 served as the primary programming language in this research endeavor.
The NEU-DET 29 dataset, utilized in this study, is an open-source collection specifically curated for analyzing surface defects on section steel. It encompasses six distinct defect categories, that is, rolling scraps (Rs), patches (Pa), cracks (Cr), inclusions (In), pittings (Ps), and scratches (Sc). Visual examples of these six defect types can be seen in Figure 8. Each category consists of a total of 300 high-resolution images measuring 200 × 200 pixels. To ensure equitable representation, the dataset is divided into training, validation, and test sets using an 8:1:1 ratio.
Training strategy and evaluation index
The training strategy employed in this study consistent across all experiments. A Batchsize of 16 is used, and the input image size remains fixed at 640 × 640 pixels. The training process consists of a total of 200 epochs, starting with an initial learning rate of 0.01 and a momentum value of 0.937. To optimize the model, the stochastic gradient descent (SGD) is utilized as the optimizer with a regression coefficient set to 0.0005. In order to gradually decrease the learning rate, a cosine annealing strategy is implemented. Furthermore, an IoU threshold of 0.5 is applied for evaluation purposes, and data augmentation is disabled during the last 30 epochs to ensure consistency between experimental results and real dataset distribution while achieving complete model convergence.
The identification of various types of defects on lightweight section-steel surfaces is a critical task that requires the model to detect them quickly and accurately. Therefore, in the experiment, detection speed and accuracy are essential metrics. Mean average precision (mAP) is commonly used in object detection to reflect the overall accuracy of the model. In this study, a mAP with an IoU threshold of 0.5 (mAP@0.5) is used to evaluate the performance of the relevant models. The average precision (AP) represents the integration of precision (P) at different recall levels (R). The AP for each class is an integration of the model’s detection accuracy for that specific class, while mAP indicates the mean AP across all target classes. Equations (20) through (23) present formulas for calculating these metrics.
The TP (True Positive) represents the number of positive samples accurately predicted by the model, while FP (False Positive) denotes the number of positive samples erroneously predicted by the model. FN (False Negative) signifies the number of negative samples incorrectly predicted by the model. N indicates the total number of categories encompassed in the dataset.
The speed of the model’s inference is measured in frames per second (FPS) for image processing,
where

Deep learning process of ILSSD model.
Experimental results and analysis
In this section, a series of comparative experiments are conducted on the NEU-DET dataset 29 to validate the performance of the ILSSD YOLOX-s model.
(1) The original YOLOX-s model is gradually enhanced by incorporating each of the four proposed improvement strategies in this study, resulting in multiple refined YOLOX-s models for comparison.
(2) The effectiveness of the proposed parallel EFA module is assessed by comparing the impact of various attention modules on the ILSSD YOLOX-s model.
(3) The validation process confirms the enhancement effect achieved by replacing IoU loss function with CIoU loss function.
(4) Several single-stage object detection algorithms are compared with the proposed ILSSD YOLOX-s model to evaluate their respective detection efficiency for surface defects on lightweight section-steel.
Ablation experiments for different improved YOLOX-s models
To assess the performance improvement of YOLOX-s model in detecting surface defects on lightweight section-steel, we developed four modified models and compared them to the original YOLOX-s (model 1). Model 2 utilizes LCIoU as a regression loss function for boundary frame prediction in the YOLOX-s model. Building upon Model 2, Model 3 incorporates the parallel EFA modules into six lateral paths within the network. Further enhancing Model 3, Model 4 introduces the BiFPN-EFA multi-scale weighted feature fusion paths to improve the lateral connection paths in Neck. Lastly, Model 5, which is the proposed ILSSD YOLOX-s model, replaces all standard convolution modules with DSC based on advancements made in Model 4.
The results of the ablation experiments for different improved YOLOX-s models are displayed in Table 1. By analyzing the data from Table 1, we can draw the following key points.
(1) Model 2 shows an obvious improvement in detecting Cr and Sc defects compared to the original YOLOX-s model 1. This improvement is attributed to LCIoU, which enhances the accuracy of predicted bounding frames by considering differences in morphological size for Cr defects and length-width ratio for Sc defects.
(2) Model 3 has a slightly higher parameter count but achieves a noteworthy increase of 2.6 percentage points in mAP@0.5 compared to Model 2. This suggests that the provided parallel EFA module acts as a attention mechanism, allowing the network to focus more on relevant feature information and improve feature extraction efficiency.
(3) Model 4 outperforms Model 3 with a mAP@0.5 improvement of 3.3 percentage points and improved detection accuracy for six defects. The BiFPN-EFA feature fusion path effectively enhances the network’s ability to merge features by integrating shallow feature information into each prediction feature layer, thereby improving recognition rates for small defects. However, this comes at the expense of increased model parameter-count and reduced detection speed.
(4) Compared to Model 4, overall detection accuracy is slightly enhanced in the proposed ILSSD YOLOX-s model while significantly reducing model parameter-count and increasing detection speed by an impressive margin of 4.7 FPS through effective utilization of DSC techniques without compromising network performance.
Ablation experiments for different improved YOLOX-s models.
Based on the findings obtained from the comparative experiments presented in Table 1, it is clear that the gradual incorporation of various network enhancement techniques into the original YOLOX-s framework results in a progressive enhancement of the model’s detection accuracy. This confirms the effectiveness of each individual improvement strategy. Finally, the ILSSD YOLOX-s model incorporating all the improved strategies is introduced, demonstrating superior detection performance compared to the original YOLOX-s model. The overall accuracy of detection has witnessed a rise of 7.1 percentage points, with a mAP@0.5 score reaching 75.9%. Notably, there has been a significant improvement in detecting low-accuracy Rs and Cr defects in the original network by 9.9 and 6.5 percentage points respectively, while also reducing the count of parameters used by the model. The adoption of DSC instead of a standard convolution module, which includes separate depth and point-by-point convolutions, contributes to an enhanced model depth and the integration of multiple feature fusion pathways within the Neck. As a result, this leads to an 8.8 FPS decrease in the model’s detection speed.
Comparative experiments for different attention modules
The performance of ILSSD YOLOX-s model was compared with different attention modules replacing the provided parallel EFA* module in the network, as shown in Table 2. The experimental results lead to the following key points.
(1) The replacement of the ECA module resulted in the highest detection speed but caused a decrease of 2.3 percentage points in mAP@0.5.
(2) Replacing CBAM led to both the lower mAP@0.5 and detection speed.
(3) However, replacing the EFA module with the DA module slightly improved detection accuracy compared to the ILSSD YOLOX-s model while decreasing detection speed by 4.6 FPS.
(4) Among these attention modules, it was found that the parallel EFA module achieved a balanced trade-off between detection accuracy and speed.
Comparative experiments between different attention modules.
Based on comparative experiments presented in Table 2, it can be concluded that constructing feature weight matrices, which are simultaneously considering both spatial and channel attentions, enhance network’s detection accuracy more effectively than using single-channel or sequential spatial and channel attentions. Furthermore, the parallel-path configuration of feature map weight matrices has a lesser impact on model’s detection speed compared to sequential path structure.
Effectiveness of LCIoU loss function
In Figure 10, the loss function curve of Model 2 (in Table 1) and the original YOLOX-s model during training are shown. According to the convergence of curves, the key points are summarized as follows.
(1) The loss function curve of Model 2 exhibits a more stable training process.
(2) Model 2 demonstrates faster convergence speed compared to the original YOLOX-s model when data augmentation is enabled and disabled. The final loss function curves converge to values of 1.88 and 2.0 for Model 2 and the original YOLOX-s model, respectively.
(3) By replacing the original boundary frame regression loss function LIoU with LCIoU, which is based on the original YOLOX-s model, Model 2 achieves improved stability in model training, accelerates convergence speed, and further reduces training loss as observed from comparing their respective loss function curves.

Curves of loss function for Model 2 in Table 1 and original YOLOX-s model.
Comparative experiments for different single-stage algorithms
The results of the experimental data for different types of single-stage detection algorithms on the NEU-DET 29 dataset are presented in Table 3. The main findings can be summarized as follows.
(1) The ILSSD YOLOX-s model outperforms SSD, RetinaNet, 49 YOLOv3, 50 and YOLOv4 network in terms of model parameter count, detection speed, and accuracy.
(2) Compared to the YOLOv5-s network, the ILSSD YOLOX-s model shows advantages in terms of parameter count and detection speed while improving overall detection accuracy by 7.8 percentage points.
(3) Despite having slightly fewer parameter count and slower detection speed compared to the lightest tiny model in YOLOv7, 52 the ILSSD YOLOX-s model still maintains a 6.6 percentage point advantage in terms of model detection accuracy due to its enhancement based on the original YOLOX-s model.
(4) The ILSSD YOLOX-s model surpasses both algorithms proposed in the work 21 regarding both detection accuracy and speed, while exhibiting a slight advantage over the algorithm proposed in the work 22 concerning both aspects.
Comparison of different single-stage detection algorithms.
Based on the aforementioned factors, it can be concluded that the ILSSD YOLOX-s model outperforms other single-stage detection algorithms in terms of performance. Moreover, when compared to other enhanced models, it offers certain advantages. This slim model effectively detects surface defects on lightweight section-steel with higher accuracy while utilizing fewer parameter count. Additionally, its real-time capabilities meet the requirements for detecting surface defects on lightweight section-steel efficiently.
The comparison results presented in Figure 11 demonstrate the detection performance of YOLOF, 32 the original YOLOX-s, 30 and the proposed ILSSD YOLOX-s models on different types of defects. Based on our analysis of the labeled pictures, we can conclude that both YOLOF and the original YOLOX-s model have limitations when it comes to accurately detecting all instances of defects, which is evident from their lower predicted frame confidence scores. In contrast, the ILSSD YOLOX-s model showcases superior capabilities in identifying defects and effectively adjusting predicted frames with higher levels of confidence. When compared to the original YOLOX-s model, the ILSSD YOLOX-s model variant significantly reduces instances where Cr and Rs defects are missed or falsely detected, while enhancing positioning accuracy for In and Ps defects. Furthermore, it achieves exceptional detection performance across various defect categories.

Detection results of YOLOF, original YOLOX-s and ILSSD YOLOX-s models.
The ILSSD YOLOX-s algorithm model proposed in this study demonstrates a substantial reduction in the count of model parameters compared to other algorithm models, resulting in diminished capacity to effectively fit complex data. Consequently, this may lead to compromised detection quality of the model when confronted with intricate and dynamic on-site environments, as well as limited enhancement in accuracy during pre-training with extensive datasets.
Experiments of section-steel manufacturing plant dataset
To further demonstrate the feasibility of applying the ILSSD YOLOX-s algorithm to detect surface defects in lightweight section-steel manufacturing plant, an industrial camera was installed in a hot rolled H-shape section-steel workshop to capture images of defects on the surface of H-shape section-steel during cooling. The resulting photo data were labeled with rectangular boxes using Labelme software and converted into COCO dataset format for verification of the ILSSD YOLOX-s algorithm’s effectiveness. The final dataset consisted of 784 images, including four types of defects: 313 Scratches (Sc), 624 Roll scraps (Rs), 108 Pittings (Ps), and 129 Folding scratch (Fs). The dataset was divided into a training set containing 560 images and a test set containing 224 images.
The experimental setup, including the hardware and software conditions, as well as the training strategy employed, remains consistent with what was described in section “Simulation analysis of thermal-force coupled finite element model.” The results of the detection experiment on the collected dataset are presented in Table 4. Key observations can be summarized as follows:
(1) YOLOv4 shows a significant shortcoming compared to other algorithms. While YOLOF performs better overall than YOLOv4, there is still a noticeable gap compared to the remaining four algorithms.
(2) The ILSSD YOLOX-s algorithm achieves an overall detection accuracy of 73.6%, surpassing both the YOLOv5-s model by 2.6 percentage points and the YOLOX-s model by 1.1 percentage points, while also outperforming the latest YOLOv7-tiny model by 0.7 percentage points.
(3) The data presented in Table 4 indicates varying sensitivities among different models towards distinct defect types. Except for Ps and Fs defects’ detection accuracy which is slightly lower than that of YOLOv7-tiny, consistently better overall detection accuracy is exhibited by the ILSSD YOLOX-s algorithm compared to other members within the YOLO series.
Experimental results of YOLO series algorithms.
Moreover, as illustrated in Figure 12, the ILSSD YOLOX-s algorithm exhibits a remarkable ability to accurately detect the three defects found in the dataset. This further validates the precision and effectiveness of utilizing the ILSSD YOLOX-s algorithm for identifying surface defects on lightweight section-steel in manufacturing plant.

Appearance of ILSSD YOLOX-s model on steel plant datasets.
Conclusions
This study aims to improve the feature fusion performance in the YOLOX-s model for single-stage object detection. The proposed ILSSD YOLOX-s algorithm achieves a detection accuracy rate of 75.9% in mAP@0.5, which is 7.1 percentage points higher than that of the original YOLOX-s, with a fast detection speed of 78.4 FPS based on NEU-DET dataset. It demonstrates superior performance in accurately and swiftly identifying various defects on lightweight section-steel surfaces compared to most single-stage object detection models while maintaining its slim network nature and ease of deployment at industrial sites.
The enhancements include replacing the standard convolutional module with a depth-wise separable convolution (DSC) module, introducing a parallel attention module known as EFA, developing a BiFPN-EFA multi-scale weighted feature fusion path, and utilizing CIoU Loss as the loss function for boundary frame regression to enhance prediction frame positioning accuracy. When applied to H-shaped section-steel defect photo database collected from actual manufacturing plants, our ILSSD YOLOX-s algorithm exhibits high recognition rates in distinguishing different defect categories further validating its effectiveness and suitability for real-world steel production environments.
Footnotes
Acknowledgements
The authors also acknowledge the supports from the University of Science and Technology Beijing and the School of Mechanical and Electric Engineering, Sanming University.
Handling Editor: Claudia Barile
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was carried out as part of the Major Science and Technology Projects of Fujian Province (Grants no. 2022HZ026025 and 2023T5001), the Program for Innovative Research Team in Science and Technology in Fujian Province University, the Production and Research Collaboration with Innovative in Key Scientific and Technological Project of Sanming City (Grant no. 2022–G–17), and the Operational Funding of the Advanced Talents for Scientific Research (Grant no. 19YG04) of Sanming University.
