Sage Journals: Discover world-class research

Abstract

Defect detection in mobile phone cameras constitutes a critical aspect of the manufacturing process. Nonetheless, this task remains challenging due to the complexities introduced by intricate backgrounds and low-contrast defects, such as minor scratches and subtle dust particles. To address these issues, a Bilateral Feature Fusion Network (BFFN) has been proposed. This network incorporates a bilateral feature fusion module, engineered to enrich feature representation by fusing feature maps from multiple scales. Such fusion allows the capture of both fine and coarse-grained details inherent in the images. Additionally, a Self-Attention Mechanism is deployed to garner more comprehensive contextual information, thereby enhancing feature discriminability. The proposed Bilateral Feature Fusion Network has been rigorously evaluated on a dataset of 12,018 mobile camera images. Our network surpasses existing state-of-the-art methods, such as U-Net and Deeplab V3+, particularly in mitigating false positive detection caused by complex backgrounds and false negative detection caused by slight defects. It achieves an F1-score of 97.59%, which is 1.16% better than Deeplab V3+ and 0.99% better than U-Net. This high level of accuracy is evidenced by an outstanding precision of 96.93% and recall of 98.26%. Furthermore, our approach realizes a detection speed of 63.8 frames per second (FPS), notably faster than Deeplab V3+ at 57.1 FPS and U-Net at 50.3 FPS. This enhanced computational efficiency makes our network particularly well-suited for real-time defect detection applications within the realm of mobile camera manufacturing.

Keywords

Defect detection image segmentation feature fusion deep learning mobile camera

1 Introduction

The mobile camera has become a key feature affecting the overall performance of modern smartphones. During the production process of mobile cameras, defects such as scratches, dirt, and foreign objects, are often unavoidable. These defects will negatively affect the final photographic quality of the cameras. Among them, certain minor ones have low contrast with the background, making them especially difficult to identify.

Due to the complexity of mobile camera defects, there are two main challenges in defect detection: (1) Low contrast: Defects such as slight dust and shallow scratches can be particularly difficult to detect due to their low contrast with the background. This can lead to false negative detections, as shown in Fig. 1(a); (2) Interference of complex features in the background area: The mobile camera is divided into a white transparent glass area with a center circle and a black ink area with ink on the surrounding glass screen. Some normal features in the ink area are very similar to the defect features in the glass area, which can easily result in false positive detection. As highlighted in the red box in Fig. 1(b), the normal features of the background are identical to defects in the glass area.

Fig. 1

Challenges in mobile camera defect detection. (a) In the red box are low-contrast defects, and in the yellow box are common defects; (b) Confusing normal background features.

Traditional defect detection of mobile cameras relies on manual identification, which is inefficient and may result in inaccuracies due to human error and fatigue, particularly when identifying subtle defects. Alternatively, image processing techniques can be utilized to augment the contrast between defects and the background in mobile phone camera images, thereby enhancing the visibility of defects [1]. However, this method also fails to meet the industrial demand

As machine vision technology has progressed, researchers have successfully applied it to defect detection across various fields [2, 3]. However, traditional machine vision methods rely on manually designed features, which may not be effective when dealing with complex backgrounds, image noise, and low-contrast defects, such as those found in mobile cameras. For example, machine vision technology is combined with near-infrared spectroscopy technology to detect pinholes, cracks, and other defects on the surface of wood materials, achieving high detection accuracy [4]. Similarly, in the context of wine bottle detection, a defect detection method based on residual analysis and threshold segmentation was adopted to address low detection accuracy of bottle mouth defects [5].

However, these methods face challenges when dealing with mobile cameras due to complex backgrounds, image noise, and low-contrast defects, making effective feature design and extraction difficult. Moreover, such methods may not generalize well to other defect detection scenarios.

Deep learning has shown great success in various visual tasks, including image classification, object detection, and semantic segmentation. Unlike traditional machine vision methods, deep learning can automatically extract useful features without complex manual feature design and can be applied to different scenarios. In defect detection, many studies have applied deep learning techniques. For example, Convolutional Neural Networks (CNNs) have been used to detect defects and classify images [6], while object detection networks have been used to detect and locate defects with greater accuracy [7, 8]. Semantic segmentation methods can determine whether each pixel in an image has defects, providing detailed information such as defect shape and area. The Fully Convolutional Network (FCN) [9] and U-Net [10] have been successfully used for defect segmentation in various domains, including concrete and highway tunnel defects [11, 12].

However, deep learning-based detection methods still face challenges when dealing with low-contrast and complex background images. To address this, attention modules have been designed to emphasize areas with defects, resulting in more effective feature extraction [13 –15]. While the attention mechanism served to highlight the defects, it falls short in utilizing the interdependence of features across a global perspective. This contextual information is crucial for reducing false positive detection in complex backgrounds. In general, a larger receptive field brings more comprehensive contextual information. To address this, other mechanisms have been introduced, such as dilated convolutions that expand the receptive field without losing image resolution [16], and a new upsampling operator with a larger field of perception to better perceive contextual information [17]. In contrast to other mechanisms, the self-attention mechanism is not limited by the receptive field and can access global contextual information [18, 19], making it well-suited for defect detection tasks.

Different layers of convolutional networks have different sensitivities to features [20]. Features extracted by the shallow layers have higher resolution and contain more shape and boundary details but less semantic information. In contrast, features extracted by the deep layers have abstract semantic information commonly used for classification but weaker shape and location information. Although high-level features are adopted from the deep layers of the network, the prediction performance was unsatisfactory due to insufficient acquisition of information from the shallow layers [21].

Therefore, features from all levels of the network are essential for effective defect detection. Simple merging of features from different levels may lead to the loss of information [22]. To solve this problem, the ES-Net is proposed to fuse multi-level feature maps to obtain richer information [23]. A feature fusion network to integrate shallow details with deep semantics is designed to detect cell phones mobile phone lenses [24]. Some other works used a backbone network to extract the multi-level feature maps of images, and the feature fusion module is applied to fuse feature maps of different scales [25, 26].

However, most of these methods fuse multi-level feature maps from many layers of the backbone network at the same time, which provides richer information but adds significant computational consumption and affects the speed of defect detection. Moreover, there is no selection based on feature importance during the fusion. Therefore, a new module is proposed to fuse two levels of feature maps to obtain complementary information, utilizing the sigmoid function and attention mechanism to evaluate the significance of information from different layers fully.

According to the above analysis and discussion, this paper proposes a novel segmentation network based on the self-attention module (SAM) and the bilateral feature fusion module (BFFM) to construct an end-to-end detection method. The proposed network performs more effectively than other networks on the mobile camera defect dataset. The main contributions of this paper are as follows:

(1) Focusing on the application of a deep learning-based semantic segmentation network for defect detection on mobile cameras. Additionally, enhancements have been made to the network architecture of the semantic segmentation model to improve both the accuracy and speed of the detection process.

(2) The proposed bilateral feature fusion module fuses two levels of feature maps extracted by the backbone network using the sigmoid function and attention mechanism to evaluate the significance of information from different layers and improve detection accuracy while minimizing computational consumption.

(3) The self-attention module is incorporated to capture more comprehensive contextual information and discriminative features, especially for detecting slight defects.

(4) Experimental results show that the proposed method achieves state-of-the-art performance on the mobile camera defect dataset. Additionally, the proposed network maintains a high detection speed.

The rest of this paper is organized as follows: In Section 2, we provide a detailed description of the proposed segmentation network based on the bilateral feature fusion module and self-attention module. In Section 3, we introduce the dataset used for evaluation and the experiment setup. Section 4 presents the experimental results of detection performance and ablation study. Finally, in Section 5, we summarize the full context of the paper and provide a conclusion.

2 Proposed method

In this section, we present the proposed segmentation network in detail. Firstly, we introduce the overall network architecture, followed by detailed explanations of some essential modules, including the self-attention module, the bilateral feature fusion module, and the loss function, in subsequent subsections.

2.1 Network architecture

The proposed segmentation network, called the Bilateral Feature Fusion Network, has an overall architecture shown in Fig. 2. It consists of three key components: the backbone network, the self-attention module, and the bilateral feature fusion module.

Fig. 2

The architecture of the proposed network. The input image is fed into a lightweight CNN to extract feature maps at each layer. These feature maps are then processed by the self-attention modules to enhance the perception of defects by capturing richer contextual information and discriminative features. Finally, the bilateral feature fusion module is used to fuse the two levels of feature maps to obtain complementary information and improve detection accuracy.

In our proposed method, a lightweight backbone network is adopted, such as Resnet18 [27]. Such a lightweight backbone is computationally cheap and can rapidly down-sample features to obtain semantic information with a larger receptive field. Firstly, feature maps are extracted at different scales through the backbone network. Then, the feature maps extracted from the last two layers of the backbone network are processed by self-attention modules to obtain more contextual information and discriminative features. Meanwhile, a global average pooling has been incorporated at the tail of the backbone network to capture global contextual information. The feature maps from the pooling layer and the self-attention module of the final layer are simply summed together. Then the summed feature maps are fused with the feature maps from the upper layer self-attention module utilizing the bilateral feature fusion module. Finally, the fused feature maps are further fused with the feature maps from the shallow layer of the backbone network, and subsequently processed by a convolutional block before being upsampled to obtain the full-sized defect segmentation map.

2.2 Self-attention module

The self-attention module is mainly to obtain more contextual information and discriminative features, which is essential for defect detection in mobile cameras. In general, larger receptive field brings more comprehensive contextual information, resulting in better performance. Contextual information is especially important in defect detection of mobile cameras. Limited receptive field of a network can lead to misclassification of certain normal background features in the ink area as defects in the glass area, as shown in Fig. 1(b).

Therefore, a large receptive field is required to achieve overall perception of defects in the image [28]. While CNNs require multiple convolutional layers to obtain global information [29, 30], self-attention is not restricted by the receptive field and can obtain global contextual information and long-range dependencies more efficiently. Then, the self-attention mechanism achieves refined feature extraction by adding corresponding weight matrices to the defect feature maps [31], enhancing the capability to highlight slight defects. The proposed self-attention modules are inspired by Bottleneck Transformers [18], and are used to process the feature maps extracted from the backbone network.

The input feature maps are passed through a convolution layer, a multi-head self-attention layer, and a convolution layer sequentially. After each convolution layer, layer normalization and rectified linear unit (ReLU) activation are applied. The final output is obtained by residual calculation.

2.3 Bilateral feature fusion module

The bilateral feature fusion module aims to fuse the feature maps extracted from different layers to obtain richer and more effective information for semantic segmentation tasks.

Convolutional layers exhibit diverse sensitivities to features based on their depth within the network, shallow layers extract low-level feature maps with elevated resolution, providing intricate details regarding shapes and boundaries, but with a comparatively limited amount of semantic information. Conversely, high-level feature maps extracted by deep layers having more abstract semantic information while potentially sacrificing the shape and boundary details [20]. With the complementary information obtained from different layers, the bilateral feature fusion module is proposed to fuse the feature maps effectively.

There are various ways to fuse two levels of features, e.g., element-wise summation [32] and concatenation [33]. However, these simple methods tend to overlook the diversity of information, leading to suboptimal performance. The proposed bilateral feature fusion module is inspired by the bilateral guided aggregation layer in [34]. The details of this structure are shown in Fig. 3. The high-level feature maps containing more semantic information guide the representation of low-level features through a sigmoid function, following two layers of convolution operation.

Fig. 3

The architecture of bilateral feature fusion module, where Up represents the upsampling, and +, × represent element wise sum and matrix multiplication respectively.

Similarly, the low-level feature maps guide the high-level feature representation. The bilateral feature fusion module filters the effective information for each level of feature maps, enabling different levels of feature maps to guide each other effectively and capture essential information. The guided feature maps are upsampled to the same size for concatenation, and the weight vector of the feature is calculated like SENet [35], which is then used to re-weight the feature maps. This method allows for the interaction of information between different levels of feature maps, thus improving the final detection accuracy.

2.4 Loss function

Cross-entropy loss function is commonly used as a loss function for image segmentation tasks. In this method, a principal loss function and two auxiliary loss functions are adopted to supervise the training [36]. The principal loss function is used to supervise the output of the whole network, while the auxiliary loss functions are used to supervise the output of deep layers. Both the principal and auxiliary loss functions are cross-entropy loss function.

The overall loss function is defined as: $L = L_{p} + \frac{α}{2} \sum_{i = 1}^{2} L_{aux}^{i}$ (1) where L_p is the principal loss of the whole network, $L_{aux}^{i}$ are the auxiliary losses, and L is the overall loss. The parameter α is used to balance the weight of the principal loss and auxiliary loss. In our experiments, α is set to 0.7. The cross-entropy loss function is defined in Eq. (2).

The cross-entropy loss of principal and auxiliary loss function is as: $\begin{matrix} L_{ce} = & - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} (y_{i, c} log {\hat{y}}_{i, c} \\ + (1 - y_{i, c}) log (1 - {\hat{y}}_{i, c})) \end{matrix}$ (2) where y_i,c is the ground-truth label of pixel i and class c, ${\hat{y}}_{i, c}$ is the predicted probability of pixel i belonging to class c, N is the total number of pixels, and C is the number of classes.

3 Experiments setup

3.1 Dataset

To evaluate the defect detection performance of the proposed network, a dataset of actual mobile camera defect images are collected and labeled by professionals. The dataset images contain a white transparent glass area and a surrounding black ink area that is smeared with ink on glass. Six types of defects need to be detected on the glass area, including scratches, foreign objects, white spots, dust, dirt, and collapsing edge, as shown in Fig. 4. Because the surrounding ink area is black and does not affect the performance of cameras, no defects on it need to be detected. To save the training time of the model, the images are cropped to a size of 128 * 128. The dataset contains a total of 12,018 images in the cropped format. The evaluation is performed by 4-fold cross-validation. For each fold, 25% of the samples are taken out as test set, and the remaining 75% are using as training set. When tuning hyperparameters, ten percent of the training samples are taken aside as evaluation set.

Fig. 4

Types of defects. First row are images, while second row are corresponding labels. (a) Scratches; (b) Foreign objects; (c) White spots; (d) Dust; (e) Dirt; (f) Collapsing edge.

3.2 Evaluation metrics

In the task of defect detection with segmentation networks, many works have used segmentation metrics such as Intersection over Union (IoU) and Dice coefficient to evaluate the quality of the detection results. These metrics calculate the overlap between predicted and actual defects and can represent the pixels correctly classified by the network. However, in practical industrial production, while pixel classification accuracy remains a significant factor, the primary focus is on accurately detecting the defective areas [37].

To better evaluate the network’s defect detection performance, the problem is translated into a classification problem. The classification errors of the defects are evaluated, and the segmentation results are only used for visualizing the defect detection. The following metrics are used to assess the network’s performance: F1-score, Recall, Precision, FN, and FP. The formulas for these metrics are as follows: $Recall = \frac{TP}{TP + FN}$ (3)

$Precision = \frac{TP}{TP + FP}$ (4)

$F 1 - score = 2 \times \frac{Precision * Recall}{Precision + Recall}$ (5) Here, TP, FP, and FN represent the number of correct classifications of defective areas, false detections of non-defective areas, and missed detections of defective areas, respectively.

3.3 Implementation details

All experiments were implemented with PyTorch and trained for 30 epochs on Nvidia P100 GPU with 16 GB of memory. The optimizer used was RMSprop with an initial learning rate of 5e-4, and the training batch size was set to 16. The self-attention module has Embedding dim of 64, Number of heads of 4, Dropout rate of 0.1.

4 Experimental results and analysis

In this section, the performance of the bilateral feature fusion network is evaluated by comparison experiments with state-of-the-art methods, and a comparison of detection efficiency was also provided. Furthermore, some ablation studies of the proposed network are presented.

4.1 Performance analysis

To further validate the detection performance of the proposed network, we compared it with state-of-the-art segmentation networks such as FCN and U-Net, all using resnet18 as their backbone network.

4.1.1 Performance comparison

The results of the experiment are summarized in Table 1 with metrics of F1-score, Recall, precision, and FP+FN. The experimental values are the mean of the 4-fold cross-validation results, and the standard deviation between each fold is calculated.

Table 1
Comparison of different network detection performance

Network F1 Precision Recall FN+FP

FCN 96.29 ± 0.24 95.53 ± 0.44 97.07 ± 0.34 241(94+147)

DeepLab V3+ 96.43 ± 0.18 95.41 ± 0.42 97.49 ± 0.28 233(81+152)

U-Net 96.60 ± 0.16 95.07 ± 0.40 98.17 ± 0.19 223(59+164)

Bisenet 97.14 ± 0.20 96.28 ± 0.43 98.02 ± 0.31 186(64+122)

Proposed 97.59 ± 0.28 96.93 ± 0.40 98.26 ± 0.22 157(56+101)

Network	F1	Precision	Recall	FN+FP
FCN	96.29 ± 0.24	95.53 ± 0.44	97.07 ± 0.34	241(94+147)
DeepLab V3+	96.43 ± 0.18	95.41 ± 0.42	97.49 ± 0.28	233(81+152)
U-Net	96.60 ± 0.16	95.07 ± 0.40	98.17 ± 0.19	223(59+164)
Bisenet	97.14 ± 0.20	96.28 ± 0.43	98.02 ± 0.31	186(64+122)
Proposed	97.59 ± 0.28	96.93 ± 0.40	98.26 ± 0.22	157(56+101)

Due to the presence of slight defects and complex backgrounds in the dataset, FCN fails to demonstrate superior detection performance. Deeplab V3+ tends to miss slight defects, resulting in lower Recall. As a representative network of pixel-level segmentation, the U-Net performs well in detecting slight defects. However, the skip connections of U-Net simply connect the low-level features extracted during Encode to Decode. These low-level features only contain shape and texture information, which cannot obtain contextual information well because of the small receptive field. It leads to false positive detection on the ink area, resulting in low Precision. The performance of Bisenet in detecting slight defects was also unsatisfactory. In contrast, the proposed network is more accurate than other segmentation networks in detecting mobile camera defects.

4.1.2 Performance visualization

To demonstrate the detection performance of the proposed network effectively, the segmentation results are shown in Fig. 5. The first and second columns in the figure represent the original image and the ground truth, and the third to seventh columns represent the segmentation result of the different networks.

Fig. 5

Comparison of the detection performance of each network (The yellow box in the figure is the false negative detection area, and the red box is the false positive detection area).

As shown in the yellow box of Fig. 5, there is a high likelihood of false negative detection when defects are slight or the contrast between objects and background is low. In addition, some images have complex backgrounds with geometric features similar to the defects, as shown in the red box in Fig. 5. Detecting these types of defects correctly is challenging. FCN does not perform well on these kinds of defects, and false negative detection of slight defects and false positive detection of normal areas occur from time to time. Deeplab V3+ outperforms FCN but still has many false negative detections of slight defects. U-Net demonstrates satisfactory performance in detecting slight defects, but it tends to misclassify certain normal background features as defects. Bisenet does not perform as accurately as U-Net in false negative detection. Our proposed network with the addition of SAM and BFFM yields significantly better detection performance than other state-of-the-art methods.

4.1.3 Efficiency analysis

The efficiency of defect detection is also an essential metric for industrial applications. To compare the detection speed of the methods, the number of images detected per second are measured and presented in Table 2. Our proposed method, with the addition of SAM and BFFM modules, achieves a detection speed of around 63.8 FPS, which is significantly faster than most other methods. These improvements in efficiency make our proposed method easier to apply in production environments with lower requirements for high-end hardware.

Table 2
Ablation of different backbones

Backbone F1 Precision Recall FN+FP

Resnet18 97.59 ± 0.28 96.93 ± 0.40 98.26 ± 0.22 157(56+101)

Resnet34 97.61 ± 0.27 97.04 ± 0.44 98.18 ± 0.22 155(59 + 96)

Resnet50 97.65 ± 0.14 96.96 ± 0.30 98.36 ± 0.29 152(53 + 99)

Backbone	F1	Precision	Recall	FN+FP
Resnet18	97.59 ± 0.28	96.93 ± 0.40	98.26 ± 0.22	157(56+101)
Resnet34	97.61 ± 0.27	97.04 ± 0.44	98.18 ± 0.22	155(59 + 96)
Resnet50	97.65 ± 0.14	96.96 ± 0.30	98.36 ± 0.29	152(53 + 99)

4.2 Ablation study

To better understand the decision choices of the bilateral feature fusion network, some ablation experiments are conducted.

4.2.1 Effects of different backbones

To make clear the impact of different sizes of backbone networks on our proposed method, experiments were conducted to compare the detection accuracy of Resnet18, Resnet34, and Resnet50 as backbone networks, as summarized in Table 3.

Table 3
Ablation of different configurations

Network F1 Precision Recall FN+FP

Baseline 97.12 ± 0.20 96.30 ± 0.39 97.96 ± 0.20 188(66+122)

Baseline+SAM 97.32 ± 0.15 96.52 ± 0.32 98.14 ± 0.31 175(60+115)

Baseline+BFFM 97.49 ± 0.16 96.88 ± 0.34 98.11 ± 0.25 163(61+102)

Proposed 97.59 ± 0.28 96.93 ± 0.40 98.26 ± 0.22 157(56+101)

Network	F1	Precision	Recall	FN+FP
Baseline	97.12 ± 0.20	96.30 ± 0.39	97.96 ± 0.20	188(66+122)
Baseline+SAM	97.32 ± 0.15	96.52 ± 0.32	98.14 ± 0.31	175(60+115)
Baseline+BFFM	97.49 ± 0.16	96.88 ± 0.34	98.11 ± 0.25	163(61+102)
Proposed	97.59 ± 0.28	96.93 ± 0.40	98.26 ± 0.22	157(56+101)

Although Resnet 50 performs better than other backbone networks, the improvement is not significant. This could potentially be attributed to the limited size of the training dataset. Additionally, as resnet50 takes more computing time, resnet18 is preferred as the backbone network.

4.2.2 Effects of different configurations

The proposed model with various configurations was evaluated, as shown in Table 4. It can be seen that the addition of the self-attention module improved both recall and precision, indicating that the module helped to focus on critical defect areas and achieve better overall defect perception. In addition, the bilateral feature fusion module was found to suppress background noise and reduce false positive detections in non-defective areas, leading to higher precision. When both modules were used together, the proposed method achieved the best overall performance.

Table 4
Processing times of the networks

Networks Speed(FPS) Parameters(M)

FCN 56.0 42.7

Deeplab V3+ 57.1 63.4

U-Net 50.3 58.1

Bisenet 68.6 51.5

Proposed 63.8 58.0

Networks	Speed(FPS)	Parameters(M)
FCN	56.0	42.7
Deeplab V3+	57.1	63.4
U-Net	50.3	58.1
Bisenet	68.6	51.5
Proposed	63.8	58.0

5 Conclusions

This paper proposes an efficient and accurate network for detecting defects in mobile cameras in industrial settings. The network has the capability to autonomously extract complex defect features from images, facilitating precise and efficient defect detection without the need for human intervention. Advancements and further investigation within this domain have the potential to profoundly transform defect detection in mobile cameras, enhancing performance, customer satisfaction, and overall industry quality.

The network comprises two key components: the self-attention module and the bilateral feature fusion module. The self-attention module provides rich contextual information and helps the network focus on defect areas, reducing false negatives and suppressing background interference. The bilateral feature fusion module allows for interaction between different levels of feature maps, improving defect segmentation accuracy. The proposed network is evaluated on a dataset of mobile camera defects from industrial production, and outperforms other state-of-the-art methods, particularly in detecting defects with complex backgrounds and low contrast.

However, it is worth noting that the network’s ability to generalize has only been tested on datasets specific to mobile cameras. To more thoroughly validate its utility, it would be beneficial to extend the testing to include different types of materials and defect contexts. In addition, exploring the integration of self-supervised learning techniques into defect detection could be a promising avenue for future studies.

Acknowledgement

This work is supported by the Graduate Research & Innovation Projects of Jiangsu Province (Project SJCX21-1526).

References

Hong

, Lee

Stain defect detection for mobile phone camera modules, in: Image Processing: Machine Vision Applications VII, Vol. 9024, SPIE, pp. 10–18.

Jing

, Yang

, Li

and Kang

, Supervised defect detection on textile fabrics via optimal Gabor filter, Journal of Industrial Textiles 44(1) (2013), 40–57. doi: 10.1177/1528083713490002. <Go toISI>://WOS:000340235800004

Huang

, Wu

, Zhang

, Chen

and Chen

, EMD-based pulsed TIG welding process porosity defect detectionand defect diagnosis using GA-SVM, Journal of Materials Processing Technology 239 (2017), 92–102. doi: 10.1016/j.jmatprotec.2016.07.015.

, Liang

and Zhang

, Recognition of wood surface defects with near infrared spectroscopy andmachine vision, Journal of Forestry Research 30(6) (2019), 2379–2386. doi: 10.1007/s11676-018-00874-w.

Zhou

, Wang

and Zhu

, Research on defect detection method for bottle mouth, Journal of ElectronicMeasurement and Instrumentation 30 (2016), 702–713.

Xie

, Li

, Xu

, Yu

and Wang

, Automatic Detection and Classification of Sewer Defects via HierarchicalDeep Learning, IEEE transactions on automation science and engineering: a publication of the IEEE Roboticsand Automation Society 16(4) (2019), 1836–1847. doi: 10.1109/tase.2019.2900170.

Yang

, Lai

and Zhou

, Visual defects detection model of mobile phone screen, Journal of Intelligent & Fuzzy Systems 43(4) (2022), 4335–4349. doi: 10.3233/JIFS-212896.

Zheng

, Jia

, Gong

, Zhang

and Dang

, Component identification and defect detection in transmissionlines based on deep learning, Journal of Intelligent & Fuzzy Systems 40(2) (2021), 3147–3158. doi: 10.3233/JIFS-189353.

Long

, Shelhamer

and Darrell

, Fully Convolutional Networks for Semantic Segmentation, IeeeTransactions on Pattern Analysis and Machine Intelligence 39(4) (2015), 640–651. doi: 10.1109/Tpami.2016.2572683. <Go to ISI>://WOS:000397717600003

10.

Ronneberger

, Fischer

, Brox

U-net: Convolutional Networks for Biomedical Image Segmentation, in: International Conference on Medical image computing and computerassisted intervention, (2015), 234–241.

11.

, Zhao

and Zhou

, Automatic pixel-level multiple damage detection of concrete structure using fullyconvolutional network, Computer-Aided Civil and Infrastructure Engineering 34(7) (2019), 616–634. doi: 10.1111/mice.12433. <Go to ISI>://WOS:000470835400005

12.

Miao

, Wang

, Sui

, Gao

and Jiang

, Automatic Recognition of Highway Tunnel Defects Based onan Improved U-Net Model, IEEE Sensors Journal 19(12) (2019), 11413–11423. doi: 10.1109/jsen.2019.2934897. <Go to ISI>://WOS:000503385100052

13.

Tao

, Zhang

, Hou

, Ma

and Xu

, Industrial Weak Scratches Inspection Based on Multifeature FusionNetwork, IEEE Transactions on Instrumentation and Measurement 70 (2021) doi: Artn 500051410.1109/Tim.2020.3025642. <Go to ISI>://WOS:000594910700002

14.

Wang

, Chen

, Li

, Zhang

and Xiong

, Improved YOLOv3 detection method for PCB plug-in solder jointdefects based on ordered probability density weighting and attention mechanism, AI Communications 35(3) (2022), 171–186. doi: 10.3233/aic-210245.

15.

Yang

, Fan

, Huo

, Li

and Liu

, A nondestructive automatic defect detection method with pixelwisesegmentation, Knowledge-Based Systems 242 (2022), 108338. doi: 10.1016/j.knosys.2022.108338.

16.

Ren

, Huang

, Hong

, Lu

, Yin

, Zou

and Shen

, Image-based concrete crack detection in tunnelsusing deep fully convolutional networks, Construction and Building Materials 234 (2020). doi: 10.1016/j.conbuildmat.2019.117367.

17.

and Zhu

, Sim-YOLOv5s: A method for detecting defects on the end face of lithium battery steel shells, Advanced Engineering Informatics 55 (2023). doi: 10.1016/j.aei.2022.101824.

18.

Srinivas

, Lin

T.-Y.

, Parmar

, Shlens

, Abbeel

, Vaswani

Bottleneck Transformers for Visual Recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, 16519–16529.

19.

Vaswani

, Ramachandran

, Srinivas

, Parmar

, Hechtman

, Shlens

Scaling local self-attention for parameter efficient visual backbones, in: Proceedings of 10 C. Liu et al./A bilateral feature fusion network for defect detection on mobile cameras the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021) 12894–12904.

20.

Mahendran

, Vedaldi

Understanding deep image representations by inverting them, Proceedings of the IEEE conference on computer vision and pattern recognition (2015), 5188–5196.

21.

and Yu

, Visual Saliency Detection Based on Multiscale DeepCNN Features, Ieee Transactions on Image Processing 25(11) (2016), 5012–5024. doi: 10.1109/Tip.2016.2602079. <Go toISI>://WOS:000386148400002

22.

Gao

, Wang

, Wu

LIP: Local Importance-based Pooling, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019) pp. 3355–3364.

23.

Gong

, Liu

, Wang

, Liu

and Hu

, Research on surface defects detection method and system inmanufacturing processes based on the fusion of multi-scale features and semantic segmentation for intelligentmanufacturing, Journal of Intelligent & Fuzzy Systems 44(4) (2023), 6463–6481. doi: 10.3233/JIFS-223041.

24.

Fan

, Ma

, Jian

, Jiang

A real-time detection network for surface defects of mobile phone lens, in: Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021), Vol. 12083, SPIE, pp. 224–232.

25.

Dong

, Song

, He

, Xu

, Yan

and Meng

, PGA-Net: Pyramid Feature Fusion and Global Context AttentionNetwork for Automated Surface Defect Detection, IEEE Transactions on Industrial Informatics 16(12) (2020), 7448–7458. doi: 10.1109/tii.2019.2958826.

26.

Wang

, Li

, Wang

, Yu

, Zhou

, Wang

and Song

, An improved YOLOv3 model for detecting locationinformation of ovarian cancer from CT images, Intelligent Data Analysis 25(6) (2021), 1565–1578. doi: 10.3233/ida-205542.

27.

, Zhang

, Ren

, Sun

Deep Residual Learning for Image Recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), 770–778.

28.

Zhang

, Dana

, Shi

, Zhang

, Wang

, Tyagi

, Agrawal

Context Encoding for Semantic Segmentation, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, (2018), 7151–7160.

29.

Simonyan

, Zisserman

Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv preprint arXiv:1409.1556 (2014).

30.

Zhou

, Khosla

, Lapedriza

, Oliva

, Torralba

Object detectors emerge in deep scene cnns, arXiv preprint arXiv:1412.6856 (2014).

31.

Chen

, Cai

, Hu

, Zhan

and Wang

, Defect detection method of aluminum profile surface using deepself-attention mechanism under hybrid noise conditions, IEEE Transactions on Instrumentation andMeasurement 70 (2021), 1–9.

32.

Liang

, Lin

, Wei

, Shen

, Yang

and Yan

, Proposal-freenetwork for instance-level object segmentation, IeeeTransactions on Pattern Analysis and Machine Intelligence 40(12) (2018), 2978–2991. doi: 10.1109/TPAMI.2017.2775623. https://www.ncbi.nlm.nih.gov/pubmed/29990248

33.

Zhang

, Chen

, Li

, Hong

, Liu

, Ma

, Han

, Ding

Acfnet: Attentional class feature network for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6798–6807.

34.

, Gao

, Wang

, Yu

, Shen

and Sang

, BiSeNet V2: Bilateral Network with Guided Aggregation forReal-Time Semantic Segmentation, International Journal of Computer Vision 129(11) (2021), 3051–3068. doi: 10.1007/s11263-021-01515-2. <Go to ISI>://WOS:000692902100002

35.

, Shen

and Sun

, Squeeze-and-Excitation Networks, Ieee Transactions on Pattern Analysis andMachine Intelligence 42(8) (2017), 2011–2023. doi: 10.1109/Tpami.2019.2913372.

36.

Wang

, Lee

, Tu

, Lazebnik

Training deeper convolutional networks with deep supervision, arXiv preprint arXiv:1505.02496 (2015).

37.

Tabernik

, Sela

Skvarč

Skočaj

, Segmentationbased deep-learning approach forsurface-defect detection, Journal of Intelligent Manufacturing 31(3) (2020), 759–776.

A bilateral feature fusion network for defect detection on mobile cameras

Abstract

Keywords

1 Introduction

2.1 Network architecture

2.3 Bilateral feature fusion module

3.1 Dataset

4 Experimental results and analysis

4.1 Performance analysis

4.1.1 Performance comparison

Table 2 Ablation of different backbones Backbone F1 Precision Recall FN+FP Resnet18 97.59 ± 0.28 96.93 ± 0.40 98.26 ± 0.22 157(56+101) Resnet34 97.61 ± 0.27 97.04 ± 0.44 98.18 ± 0.22 155(59 + 96) Resnet50 97.65 ± 0.14 96.96 ± 0.30 98.36 ± 0.29 152(53 + 99)

4.2.1 Effects of different backbones

Table 4 Processing times of the networks Networks Speed(FPS) Parameters(M) FCN 56.0 42.7 Deeplab V3+ 57.1 63.4 U-Net 50.3 58.1 Bisenet 68.6 51.5 Proposed 63.8 58.0

Acknowledgement

References

Table 2
Ablation of different backbones

Backbone F1 Precision Recall FN+FP

Resnet18 97.59 ± 0.28 96.93 ± 0.40 98.26 ± 0.22 157(56+101)

Resnet34 97.61 ± 0.27 97.04 ± 0.44 98.18 ± 0.22 155(59 + 96)

Resnet50 97.65 ± 0.14 96.96 ± 0.30 98.36 ± 0.29 152(53 + 99)

Table 4
Processing times of the networks

Networks Speed(FPS) Parameters(M)

FCN 56.0 42.7

Deeplab V3+ 57.1 63.4

U-Net 50.3 58.1

Bisenet 68.6 51.5

Proposed 63.8 58.0