Sage Journals: Discover world-class research

Abstract

Gas masks are essential respiratory protective equipment commonly used by laborers who work in harsh environments. However, respiratory diseases and accidents can occur due to the absence of gas masks. To prevent these accidents, this paper developed an object detector that uses convolutional neural networks (CNNs) to detect whether workers are wearing gas masks. To achieve this goal, a gas mask detection dataset was constructed derived from real industrial scenarios and Faster R-CNN was improved for gas mask wearing detection. Firstly, to address the multi-scale problem in real scenes, the Feature Pyramid Network was introduced into Faster R-CNN to effectively fuse features between different levels and improve the detection ability of small objects. Secondly, the Online Hard Sample Mining algorithm was used to alleviate the class imbalance problems in the dataset. Finally, Mixup and Mosaic were used in the training process to augment the data and make the model better adapt to different scenes and complex backgrounds. After multiple experiments, the combination of the three optimization strategies improved the ${mAP}_{0.5 : 0.95}$ by 23.2%. This work is an initial attempt at gas mask wearing detection and there is still much room for improvement in terms of model and dataset.

Keywords

Gas mask wearing detection convolutional neural networks Faster R-CNN Feature Pyramid Network online hard example mining

1. Introduction

The health and safety of workers in industrial production has been a longstanding concern for both companies and governments. Researchers and technicians have made significant efforts to increase awareness of occupational safety, prevent accidents, establish effective occupational safety systems, and cultivate a safety culture. It is well-known that industrial production unavoidably generates harmful gases, dust, and particles that can cause damage to the respiratory system of workers who are exposed to these environments for extended periods. Therefore, protective measures must be implemented to safeguard the health of workers. In practice, gas masks are the most commonly used respiratory protection device, which can effectively filter out harmful substances from the air and protect the respiratory system of workers. However, accidents such as respiratory injuries and poisoning occasionally occur because gas masks are not worn as required. To address this issue, it is necessary to establish an effective gas mask wearing detection system that reminds workers to wear gas masks and promotes awareness of occupational safety.

In recent years, convolutional neural networks (CNNs)-based object detection algorithms have been widely used in security monitoring and surveillance. Current object detection algorithms can be divided into two types: one-stage and two-stage models. One-stage models, such as YOLO series [1,9,23–25], SSD [19], and RetinaNet [15], use a single network to directly predict bounding boxes and classifications, treating detection as a regression problem. While these models are known for their high real-time speed and simple structures, they often compromise on detection accuracy. In contrast, two-stage models divide the detection process into two stages. In the first stage, the algorithm generates several proposal regions for the objects. In the second stage, it detects the proposal regions to obtain precise coordinates and category information. Two-stage models generally offer higher detection accuracy, but require more computation than one-stage models. The classical two-stage models include Fast R-CNN [6], Faster R-CNN [26], Cascade R-CNN [2], Dynamic R-CNN [34]. Researchers have proposed various CNNs-based object detection models for occupational safety and public health, including methods for safety helmet detection [13,27], safety harness detection [3], and face mask detection [20,28,32]. These models aim to improve the accuracy and real-time performance of object detection systems by using transfer learning approaches, novel feature fusion methods and some special strategies. In other industrial scenes, high-precision detection methods have been proposed for coated fuel particles [8] and tiny defects on PCB surfaces [33]. Furthermore, a novel architecture has been developed for detecting leaks in gas pipelines [21].

Object detection models based on convolutional neural networks have shown great potential in enhancing occupational safety and public health. However, the success of these models relies heavily on the availability of large datasets, which is a key characteristic of deep learning approaches. To address this problem, a gas mask detection dataset was built and analyzed to identify the challenges in developing an effective gas mask wearing detection system. On the one hand, the size of the targets in the dataset varies greatly and there are a large number of small targets. In the detection process, small targets have less feature information and are easily confused with the background, resulting in a large number of false detections. On the other hand, the dataset suffers from class imbalance problems, where the number of samples in different foreground classes (e.g., different object categories) is significantly unbalanced. Additionally, the foreground of interest may occupy a relatively small portion of the image. These issues can lead to difficulties in training object detection models and inaccurate identification of negative examples.

Detecting objects of varying sizes is a major challenge in computer vision due to the use of multi-layer convolution that extracts information in a “shallow to deep” manner. Shallow features have high resolution and rich geometric information but lack semantic information, while deep features have rich semantic information but low resolution. With the increase of model layers, semantic information is gradually diluted for small targets until it disappears. For large targets, sufficient semantic information may be extracted in deeper layers, but at this time, the semantic information of small targets has been lost. As such, retaining semantic information for both small and large targets is difficult in object detection. To overcome this shortcoming, SSD [19] uses multiscale feature maps to detect targets. It uses large feature maps to detect relatively small targets, while small feature maps are responsible for detecting large targets. However, because the semantic information of low-level features is not sufficient, it is difficult to detect small targets accurately. Feature Pyramid Network (FPN) [14] has been developed to fuse low-level feature maps with high-level feature maps to obtain a new feature map for more accurate predictions. This integration of semantic and geometric information makes FPN highly effective in improving the detection accuracy of small targets. Furthermore, recent studies have demonstrated the advantages of bidirectional feature fusion, as seen in the simple yet effective Path Aggregation Network (PANet) [18]. Although PANet’s bidirectional fusion is relatively straightforward, other researchers have explored more complex bidirectional fusion techniques, such as ASFF [17], NAS-FPN [5], and BiFPN [30]. Compared with PAFPN, ASFF, and BiFPN, FPN is simple and versatile. Its structure is relatively simple, easy to implement, and can be applied to various target detection algorithms. At the same time, the performance of FPN is excellent, and it can achieve good results in various object detection tasks.

In the field of object detection, the imbalance problems include class imbalance, scale imbalance, spatial imbalance and objective imbalance [22]. This paper will focus on class imbalance, which often manifest as foreground-foreground and foreground-background class imbalance. The former can be mitigated by balancing the number of categories in the data through data augmentation methods, repeated sampling, etc. The latter can be mitigated by controlling the proportion of positive and negative samples during training, which can be accomplished through sampling methods. The simplest method is to manually set the ratio of positive to negative samples, which will be time-consuming. In the two-stage models, Girshick et al. [29] proposed Online Hard Example Mining (OHEM) to solve the class imbalance problem by selecting difficult samples to train the network. OHEM avoids hyperparameters tuning and focuses on difficult foreground and background objects. However, the multi-task loss function (including classification loss and localization loss) defined in OHEM ignores the influence of different loss types in the training process, which can lead to a lack of attention to localization accuracy in later stages of training. To solve this problem, Li et al. [12] proposed S-OHEM to sample training samples according to the distribution of loss. Compared to OHEM, S-OHEM selects difficult samples based on the distribution of different loss functions, avoiding only using high-loss samples to update model parameters. However, S-OHEM introduces additional hyperparameters and does not provide a universal way to select hyperparameters. In one-stage algorithms, improving the loss function is a common method to solve the class imbalance problem, such as Balanced Cross Entropy and Focal Loss [15]. Focal Loss focuses on difficult samples and solves the problem of low classification accuracy for classes with few samples, but it pays too much attention to outliers. GHM [10] uses the gradient modulus length to distinguish between outlier and normal samples. GHM reduces the attention of the model to the samples that are difficult to classify and improves the performance of the model. However, GHM requires additional calculation of the gradient, which increases the computational burden and training time. In addition, GHM still needs to manually adjust some hyperparameters, such as the number of groups and the boundary value of the gradient modulus length, which requires a certain amount of time and effort.

This work aims to enhance gas mask wearing detection by incorporating FPN into the Faster R-CNN algorithm, which facilitates multi-scale prediction, and thus improves the model’s ability to detect small targets and its overall stability. To mitigate the imbalance problem, we utilize OHEM during training, while also employing Mixup and Mosaic data augmentation techniques to improve the model’s discriminative ability and prevent overfitting. By integrating these three improvements, the final model achieves more stable performance in gas mask wearing detection, demonstrating the effectiveness of our approach. In brief, the contribution of this article is summarized as follows.

We propose a gas mask wearing detection method by integrating the Feature Pyramid Network into the Faster R-CNN, which effectively improves the model’s ability to detect small targets and increases its stability.

To effectively alleviate the imbalance problem and accelerate model convergence, we introduce the Online Hard Sample Mining algorithm during the training process.

We propose a new gas mask detection dataset to meet the needs of practical applications, it contains 6143 pictures and 9724 labeled information.

The rest of this article is organized as follows. Section 2 details the dataset and methods used in this work. Section 3 gives the analysis of experimental information and experimental results. Section 4 is the conclusion of this paper.

2. Method

2.1. Gas mask detection dataset

The dataset used in this work consists of data from real industrial scenarios and some scenarios similar to industrial scenarios. The following is a detailed description of the gas mask detection dataset.

2.1.1. A brief introduction to gas masks

Gas masks are generally divided into two types: filtered and isolated gas masks. Filtered gas masks primarily filter harmful gases and dust particles via a filter box or filter cotton. Isolated gas masks create a barrier between the wearer’s respiratory system, eyes, and face from the contaminated air by supplying oxygen from a gas storage system. These are usually used in narrow, low-oxygen scenarios where long work hours are required. Compared to isolated gas masks, filtered gas masks have a wider range of industrial applications. The gas mask detection dataset focuses specifically on the 3M filtered gas masks shown in Fig. 1, which are commonly used in industrial production.

Fig. 1.

3M gas mask.

Table 1

The label names and descriptions of the dataset

Index	Label name	Description
1	front_head_wear	the front of the head faces the camera with wearing a gas mask
2	side_head_wear	the side of the head faces the camera with wearing a gas mask
3	front_head_no_wear	the front of the head faces the camera without wearing a gas mask
4	side_head_no_wear	the side of the head faces the camera without wearing a gas mask
5	front_head_wear_wrong	the front of the head faces the camera with wearing a wrong mask
6	side_head_wear_wrong	the side of the head faces the camera with wearing a wrong mask
7	back_head	the back of the head faces the camera

2.1.2. Data labeling

In this work, head orientations are divided into three categories based on the direction in which the subject is facing the camera: front-facing, side-facing and back-facing. These three categories are further subdivided into seven labels by taking into account whether or not the subject is wearing a gas mask. Table 1 gives a detailed description of the name and meaning of each label. Figure 2 shows examples of annotations. In real industrial scenarios, workers may enter production workshops wearing masks that do not provide adequate respiratory protection. This situation is just as worrying as not wearing a gas mask, so labels 5 and 6 have been included in the dataset.

Fig. 2.

Visualization of labels.

2.1.3. Dataset statistics

To increase the diversity of the dataset, this work has collected samples from environments similar to the primary data collection site. We used web crawler technology to obtain negative samples as supplementary data from image websites such as Bing Image Library and Baidu Image Library. Figure 3 shows the supplementary data. There are 6134 images from 15 different scenes in the dataset. Among them, 5180 samples from 11 scenes are used as the training data set, and 954 samples from the rest of the scenes are used as the test dataset.

Fig. 3.

The supplementary data from other scenarios and websites.

The data provided in Fig. 4a shows that the categories “front_head_wear_wrong” and “side_head_wear_wrong” have significantly fewer samples compared to the other categories. While this distribution may reflect the real-world scenario, it may lead to an imbalance between foreground and foreground and consequently, a reduction in the classification accuracy of the model. At this stage, due to unforeseen circumstances, we are unable to collect an even number of samples for each category. We plan to address this issue by collecting additional data in the subsequent phases of this project to refine the dataset. Figure 4b shows a scatter plot describing the length-width distribution of the labelled bounding boxes in the dataset. The majority of the bounding boxes are smaller than $300 \times 300$ in size and exhibit significant variation in scale, indicating a large number of small objects in the dataset. This can cause an imbalance between the foreground and background since small objects occupy less area in the image. The resulting imbalance may cause the model to classify more regions as background and ignore some small targets. Moreover, due to the limited number of distinctive features of small objects, the model may mistakenly classify some background regions as small objects, leading to an increased false detection rate.

Fig. 4.

The statistics of the gas mask detection dataset.

Fig. 5.

The visualization effect of mixup and mosaic.

2.2. Data augmentation

Data augmentation is an effective method for addressing imbalance issues and improving the generalization ability of models by utilizing a larger amount of data. Therefore, it is necessary to apply data augmentation techniques to expand the gas mask detection dataset. Given the labour-intensive nature of data collection and annotation, we employed two strategies to augment our existing data. Prior to training, we used data augmentation techniques such as random rotation, shearing and flipping to expand our dataset. During training, we further enhanced our data by using the MixUp [35] and Moscia [1] techniques to generate additional samples. The effect of these augmentations is shown in Fig. 5.

Mixup is a technique that mixes input and output data from different samples to generate new training samples. It can limit model overfitting to a single sample, further improving the model’s ability to discriminate target features from background features. The samples are interpolated using Eq. (1) and Eq. (2). $x_{i}$ and $x_{j}$ refer to images in the training set, and $y_{i}$ and $y_{j}$ refer to their corresponding ground truth bounding boxes. $\begin{array}{c} (1) & \tilde{x} = λ x_{i} + (1 - λ) x_{j} \\ (2) & \tilde{y} = y_{i} \cup y_{j} \end{array}$ The $\tilde{x}$ is the set of the image $x_{i}$ and $x_{j}$ and the $\tilde{y}$ is a list of the ground truth bounding boxes of the objects merged from them. The λ is a parameter obeying the beta(α, β). The tricky values of α and β are set to 1.5 [36]. In the Eq. (3), $L_{i}$ and $L_{j}$ refer to the losses of $x_{i}$ and $x_{j}$ . To obtain the total loss of the mixed sample, the method uses the weighted sum of the detection loss of the items in two images. $\begin{array}{c} (3) & L = λ L_{i} + (1 - λ) L_{j} \end{array}$ Mosaic is a common computer vision technique that can effectively increase the diversity and number of training images. The method stitches four different images together to form a new training image. It can increase the variability and richness of the background while maintaining the position and size of the target object, and can allow the model to better distinguish the features of small targets from the background. At the same time, it increases the number of targets and improves the total number of samples, which is useful in mitigating the imbalance problem.

2.3. Faster R-CNN network

Faster R-CNN consists of three main components: Backbone, Region Proposal Network (RPN) and Region of Interest (ROI) head [26]. Both the RPN and the ROI head share the feature map extracted from the backbone. The RPN generates proposal regions based on the input feature map, and then these regions are processed by the ROI Head, which performs coordinate regression and classification. Prior to RPN, Fast R-CNN used a selective search algorithm [6] to generate candidate boxes, which is computationally expensive. To address this problem, RPN was developed by using convolutional neural networks for feature extraction to generate candidate box locations. This approach reduces the computational overhead associated with selective search algorithms.

Fig. 6.

Region proposal network.

The main idea of RPN shown in Fig. 6 is to generate numerous candidate boxes of possible targets. The input of RPN is derived from the feature maps generated by the backbone. It then uses a sliding window to process these feature maps and simultaneously predict k candidate boxes. Each sliding window generates a 256-dimensional intermediate vector and feeds this vector into two fully connected layers for category and coordinate prediction. The class score is the probability that a candidate box belongs to foreground and background. In the algorithm, the k candidate boxes are parameterised as anchors. To make the network more applicable to targets of different shapes and sizes, RPN presets k anchor boxes with different aspect ratios at each position of the feature map to predict candidate regions for images of different scales. Therefore, all sliding windows simultaneously predict k anchor boxes, and the output of each sliding window is $2 k$ class scores and $4 k$ coordinates. ROI Head detects and identifies whether workers are wearing gas masks or not. During the training process, the images firstly enter the backbone network for feature extraction, and the extracted feature maps are sent to the RPN to generate candidate regions. Then the feature maps in the candidate regions are sent to ROI Pooling [26] to obtain the fixed dimension feature vectors. Finally, the feature vectors are sent to the full connection layer for target category prediction and coordinate regression.

2.4. Feature Pyramid Network

The Feature Pyramid Network (FPN) mainly solves the multi-scale problem in object detection, and significantly improves the performance of small object detection by simply changing the network connectivity, with no increase in the computational complexity of the original model. During the forward computation of a convolutional neural network, the lower layers contain less semantic information but provide a more accurate target location. Conversely, the higher layers contain more semantic information but offer a coarser target location. To address this issue, FPN combines semantic features and location information from both higher and lower layers of the network using bottom-up paths, top-down paths, and horizontal connections. This approach greatly improves the model’s ability to detect multi-scale objects, especially small objects. The structure of FPN is shown in Fig. 7.

Fig. 7.

Feature Pyramid Network.

In this paper, ResNet [7] is used as the backbone. It has five convolutional processes, and the first convolutional process is not included in the feature pyramid because it is computationally intensive. The outputs of the last four convolutional processes are denoted as C2, C3, C4, C5 and the outputs of FPN are denoted as p2, p3, p4, p5. C5 enters the top-down path by a $1 \times 1$ convolution. In the top-down path, small feature maps are scaled up to the same size as the feature map of the previous layer by upsampling, using both the location information of the lower layer and the semantic information of the upper layer. The horizontal connection is a fusion of the upsampling results and the same-size feature maps generated from the bottom-up. Since the resolution of P5 after upsampling is the same as that of C4, these two feature maps can be added directly to obtain P4. Finally, the feature maps of each layer are output by $3 \times 3$ convolutions.

2.5. Online hard example mining

In the training process, RPN will produce a large number of random candidate boxes. Due to the small proportion of the gas mask target in the image, the number of background candidate boxes will be too large, and the ratio between the number of background candidate boxes and foreground candidate boxes will be seriously unbalanced, which may result in a model that is highly biased towards background prediction. To address the problem of unbalanced positive and negative samples, this article uses the Online Hard Example Mining in the training process of Faster R-CNN. As shown in Fig. 8, ROI head is expanded into two networks that share parameters. One of the ROI head’s parameters is fixed, which is used to calculate and sort the loss of all candidate regions and select some regions with significant losses as complex samples. The other ROI Head is trainable, its input is the hard samples selected by the previous ROI Head, and its output is the predicted bounding box coordinates and classification results. In a word, OHEM adds another ROI head to select hard examples and then uses them to train the standard ROI head. OHEM can improve model accuracy, reduce overfitting, and improve training computational efficiency. In addition, the algorithm does not need to set the ratio of positive and negative samples, greatly reducing the difficulty of training.

Fig. 8.

The structure of Faster R-CNN with FPN and OHEM.

3. Experiment

3.1. Experimental platform and parameters

Our models were trained and tested on the Nvidia GTX 3090, using randomly sized images of $1920 \times 1080$ , $1330 \times 880$ , and $1024 \times 512$ as input. During the training process, we used the stochastic gradient descent optimizer with a momentum of 0.9 and a weight regularization parameter of 0.0005. The initial learning rate was set to 0.001, and we adopted a polynomial decay strategy for the learning rate schedule with a weight decay of 0.01. Because of GPU performance limitations, we set the batch size to 4. To analyze the effectiveness of our model and the dataset’s existing problems, we conducted a series of tests.

3.2. Evaluation metrics

In this article, the trained Faster R-CNN is used to perform experiments on a test dataset to verify its recognition accuracy and generalization ability. Unlike the classification task, the output in the object detection task is the confidence level and the coordinates of the detected object. When evaluating the performance of the model, the confidence threshold and the intersection union ratio ( $IoU$ ) threshold are set. Prediction boxes with confidence values below the threshold will be discarded. The $IoU$ between the prediction boxes and the ground truth is used to calculate $Precision$ , $Recall$ and $AP$ . $IoU$ is calculated according to Eq. (4). The Precision and Recall are defined in Eq. (5) and Eq. (6), respectively. $\begin{array}{c} (4) & IoU = \frac{Ground Truth \cap Prediction}{Ground Truth \cup Prediction} \\ (5) & Precision = \frac{TP}{TP + FP} \\ (6) & Recall = \frac{TP}{TP + FN} \end{array}$ The $TP$ (true positive) represents the number of predicted boxes with an Intersection over Union ( $IoU$ ) greater than the specified threshold ( ${IoU}_{threshold}$ ). Conversely, the $FP$ (false positive) represents the number of predicted boxes with an $IoU$ less than or equal to the threshold. Finally, the $FN$ (false negative) is the number of actual boxes that were not detected. The $AP$ is the area under the precision-recall curve for a specific category, while the $mAP$ represents the mean of the $AP$ for each category. For multi-category tasks, the $mAP$ can be calculated using Eq. (8), where ${AP}_{i}$ represents the average precision for a given category. $\begin{array}{c} (7) & {AP}_{i} = \int_{0}^{1} Precision (Recall) d (Recall) \\ (8) & mAP = \frac{\sum_{i = 1}^{k} A P_{i}}{k} \end{array}$

The $F_{1}$ score, which is a comprehensive evaluation index used to assess the model’s classification ability, is defined in Eq. (9). We use it to evaluate how well the model classifies each class in the dataset. $\begin{array}{c} (9) & F_{1} = \frac{2 \times Precision \times Recall}{(Precision + Recall)} \end{array}$ The dataset is in COCO [16] format. Unlike conventional evaluation metrics, COCO uses 10 $IoU$ thresholds of 0.50:0.05:0.95 to calculate $AP$ . Traditional $mAP$ is calculated on a single $IoU$ of 0.50. Averaging over $IoU$ allows for better evaluation of the classification and localization ability of the detector. In this article, we use ${mAP}_{0.5}$ to denote the $mAP$ under the ${IoU}_{threshold} = 0.5$ , ${mAP}_{0.75}$ to denote the $mAP$ under the ${IoU}_{threshold} = 0.75$ , and ${mAP}_{0.5 : 0.95}$ to denote the average $mAP$ under the 10 $IoU$ thresholds between 0.5 and 0.95.

3.3. Experimental results and analysis

This section compares different methods in Faster R-CNN using a test input size of $1920 \times 1080$ and a confidence threshold of 0.1. The experimental results are presented in Table 2. Firstly, training Faster R-CNN without any additional methods produced poor results with a ${mAP}_{0.5 : 0.95}$ of only 36.7%. We then evaluated the performance of Faster R-CNN augmented with FPN and OHEM. The addition of FPN to Faster R-CNN significantly improved the ${mAP}_{0.5 : 0.95}$ by 19.1%, while the use of OHEM during training increased the ${mAP}_{0.5 : 0.95}$ to 57.9%. Finally, applying Mixup and Mosaic data augmentation approaches resulted in a 2% improvement in the model’s ${mAP}_{0.5 : 0.95}$ score.

Table 2
Effects comparison of different methods

FPN OHEM Mixup Mosaic ${mAP}_{0.5}$ (%) ${mAP}_{0.75}$ (%) ${mAP}_{0.5 : 0.95}$ (%)

55.7 40.7 36.7

✓ 75.2 67.3 55.8

✓ ✓ 81.0 70.1 57.9

✓ ✓ ✓ ✓ 82.6 72.6 59.9

FPN	OHEM	Mixup	Mosaic	${mAP}_{0.5}$ (%)	${mAP}_{0.75}$ (%)	${mAP}_{0.5 : 0.95}$ (%)
				55.7	40.7	36.7
✓				75.2	67.3	55.8
✓	✓			81.0	70.1	57.9
✓	✓	✓	✓	82.6	72.6	59.9

Table 3

The $F_{1}$ score of each category

Index	Category	Object number	TP	FP	FN	P(%)	R(%)	$F_{1}$ (%)
1	front_head_wear	253	232	51	71	82.0	76.6	79.2
2	side_head_wear	221	194	104	67	65.1	74.3	69.4
3	front_head_no_wear	136	134	63	41	68.0	76.6	72.0
4	side_head_no_wear	128	122	116	30	51.3	80.3	62.6
5	back_head	257	238	24	84	90.8	73.9	81.5
6	front_head_wear_wrong	90	73	18	54	80.2	57.5	67.0
7	side_head_wear_wrong	76	46	8	88	85.2	34.3	48.9

Table 4

Comparison with other models

Input size	Method	Backbone	Neck	${mAP}_{0.5}$ (%)	${mAP}_{0.75}$ (%)	${mAP}_{0.5 : 0.95}$ (%)
$512 \times 512$	SSD	VGG16	–	61.2	51.4	42.8
	RetinaNet	ResNet50	FPN	41.5	34.9	27.5
	YOLOv5	CSPDarknet53	PAFPN	43.3	38.5	30.5
	YOLOX	Darknet53	PAFPN	55.7	52.0	40.7
	YOLOv6	EfficientRep	PAFPN	43.2	35.0	24.8
	YOLOv7	YOLOv7-Backbone	PAFPN	48.2	39.6	31.9
	Casecade R-CNN	ResNet50	FPN	38.5	27.1	23.8
	Dynamic R-CNN	ResNet50	FPN	31.0	24.1	20.1
	Improved Faster R-CNN	ResNet50	FPN	51.1	46.3	36.6
$1024 \times 1024$	RetinaNet	ResNet50	FPN	67.4	61.9	48.2
	YOLOv3	Darknet53	–	56.0	48.4	38.7
	YOLOv5	CSPDarknet53	PAFPN	56.2	52.4	42.0
	YOLOX	Darknet53	PAFPN	71.3	66.3	52.0
	YOLOv6	EfficientRep	PAFPN	74.8	62.4	51.7
	YOLOv7	YOLOv7-Backbone	PAFPN	76.3	63.4	51.0
	Casecade R-CNN	ResNet50	FPN	72.6	65.4	52.7
	Dynamic R-CNN	ResNet50	FPN	71.0	62.8	50.7
	Improved Faster R-CNN	ResNet50	FPN	75.4	64.9	52.8
$1280 \times 1280$	RetinaNet	ResNet50	FPN	75.4	65.7	53.4
	YOLOv3	Darknet53	–	71.4	63.9	50.3
	YOLOv5	CSPDarknet53	PAFPN	74.0	66.7	54.4
	YOLOX	Darknet53	PAFPN	70.7	67.0	51.5
	YOLOv6	EfficientRep	PAFPN	77.1	65.7	54.2
	YOLOv7	YOLOv7-Backbone	PAFPN	74.6	63.9	52.6
	Casecade R-CNN	ResNet50	FPN	77.7	66.3	54.4
	Dynamic R-CNN	ResNet50	FPN	75.7	66.4	54.3
	Improved Faster R-CNN	ResNet50	FPN	82.1	69.0	56.7

The $F_{1}$ score of each category is shown in Table 3 to evaluate the classification ability of the model. In general, the target classification accuracy of the model is not very well. The $F_{1}$ scores for categories 2, 3 and 4 are 69.4%, 72.0% and 62.6% respectively. As the detector had more than 100 false positives for these three categories, the $F_{1}$ scores for these categories are not very high. Similarly, category 7 has 88 false negatives, resulting in a low recall and $F_{1}$ score. Referring to the data in Fig. 4a, categories 6 and 7 are relatively small in the dataset, which leads to the model not being able to completely learn the characteristics of these two categories. Combining the information in Table 2 and Table 3, we find that the model has accurate localization but poor classification ability. Due to the unbalanced sample categories in the dataset, the OHEM and data enhancement methods did not improve the generalisation ability of the model very well. In addition, the small batch size would limit the effect of data enhancement. Because these two methods randomly select a batch of images for enhancement, if the batch size is not large enough, the result of random enhancement will be highly repetitive. It cannot effectively and fully exploit all the features of the dataset. We also analyzed the differences between the training and test datasets and found a gap between the scales and angles of the samples retrieved from the web and those in real-world scenes. This difference may contribute to inaccurate classification results.

Table 5

Detection effect under different confidence thresholds

Confidence threshold	Model	${mAP}_{0.5}$ (%)	${mAP}_{0.75}$ (%)	${mAP}_{0.5 : 0.95}$ (%)
0.1	YOLOv6	78.3	67.9	55.4
	YOLOv7	75.9	63.2	51.6
	Cascade R-CNN	79.7	65.9	56.1
	Dynamic R-CNN	73.8	64.1	53.5
	Improved Faster R-CNN	82.6	76.5	59.9
0.2	YOLOv6	74.3	65.8	53.3
	YOLOv7	72.8	61.7	49.9
	Cascade R-CNN	76.5	63.7	54.0
	Dynamic R-CNN	70.8	62.5	51.7
	Improved Faster R-CNN	77.0	68.5	56.6
0.3	YOLOv6	70.7	63.0	51.0
	YOLOv7	70.5	59.9	48.5
	Cascade R-CNN	73.9	61.7	52.4
	Dynamic R-CNN	66.8	59.6	49.3
	Improved Faster R-CNN	75.1	67.1	55.4
0.4	YOLOv6	66.4	59.8	48.3
	YOLOv7	68.4	58.5	47.1
	Cascade R-CNN	70.9	59.5	50.4
	Dynamic R-CNN	62.6	56.9	46.8
	Improved Faster R-CNN	73.8	66.2	54.5

This work compared Faster R-CNN with classic models such as SSD, YOLOv3, RetinaNet, YOLOv5, and Cascade R-CNN [2], as well as recent models such as YOLOX [4], YOLOv6 [11], YOLOv7 [31], and Dynamic R-CNN [34]. To demonstrate the model’s detection ability at different input scales, we conducted tests using input sizes of $512 \times 512$ , $1024 \times 1024$ , and $1280 \times 1280$ pixels. The results are presented in Table 4. SSD achieved the highest ${mAP}_{0.5}$ score of 61.2% at the smallest input size of $512 \times 512$ pixels, but could not be tested in the following experiments due to its limited support for larger input sizes. At an input size of $1024 \times 1024$ pixels, YOLOv7 achieved the highest ${mAP}_{0.5}$ score of 76.3%, while YOLOX achieved the highest ${mAP}_{0.75}$ score of 66.3%. Our model achieved the best performance on ${mAP}_{0.5 : 0.95}$ , with a score of 52.8%, while the ${mAP}_{0.5}$ and ${mAP}_{0.75}$ scores were similar to those of YOLOX and YOLOv7, respectively. However, when the input size was increased to $1280 \times 1280$ pixels, the improved Faster R-CNN emerged as the clear winner, outperforming other models on all metrics. Our model achieved a ${mAP}_{0.5}$ score of 82.1%, ${mAP}_{0.75}$ of 69.0%, and ${mAP}_{0.5 : 0.95}$ of 56.7%, demonstrating exceptional accuracy in detecting objects at higher input sizes compared to other models. This suggests that our model is particularly suitable for detecting objects at larger input sizes, as its accuracy improves and becomes more stable with increasing input sizes, outperforming models such as YOLOX and YOLOv7.

In practical applications, a high confidence threshold is typically employed to filter out a large number of false detections. However, as the confidence threshold increases, the false negative rate of the model also increases, leading to decreased stability. To demonstrate the robustness of our model, we evaluate its detection accuracy at different confidence thresholds (0.1, 0.2, 0.3 and 0.4) on the original $1920 \times 1080$ input image and compare it with several recent models. Evaluation results are presented in Table 5, which compares several object detection models using the $mAP$ metric for different IoU ratios: 0.5, 0.75 and 0.5:0.95. The models considered in this evaluation include YOLOv6, YOLOv7, Cascade R-CNN and Dynamic R-CNN. Our analysis shows that, generally, there is a decrease in $mAP$ scores for all models as the confidence threshold increases. However, our model demonstrates greater stability in performance as the confidence threshold increases. Specifically, our model achieves the highest ${mAP}_{0.5}$ , ${mAP}_{0.75}$ and ${mAP}_{0.5 : 0.95}$ scores among all models when the confidence threshold is set to 0.1, and significantly outperforms others in terms of $mAP$ score, especially when the IoU ratio is set to 0.5:0.95. For example, at the lowest confidence threshold of 0.1, our model achieves an ${mAP}_{0.5 : 0.95}$ score of 59.9%, while the second-ranking model, Cascade R-CNN, only achieves an ${mAP}_{0.5 : 0.95}$ score of 56.1%. By contrast, YOLOv7 and Dynamic R-CNN exhibit a significant drop in performance at higher confidence thresholds, suggesting a higher incidence of false negatives. In conclusion, our improved Faster R-CNN demonstrates robustness with high accuracy, maintaining stable performance even at high confidence thresholds.

Fig. 9.

Visualization results from the test dataset.

In order to effectively demonstrate the outcomes of our research, we selected some images from our test dataset to illustrate the results of our experiments. As can be seen in Fig. 9, our model demonstrates a high degree of accuracy in detecting the objects of interest, with a decent recognition impact overall. The target location of the model is largely accurate, with fewer false positives. However, there are some instances where the model mis-detected the backgrounds, resulting in false positives in the output. It is worth noting that our approach is not without its limitations, and there is still room for improvement. Future research could explore other techniques to improve the detection accuracy of gas masks in some challenging scenarios, such as false detection of complex backgrounds.

4. Conclusion

In this article, we present our work on producing a gas mask dataset and training an effective gas mask detector using the classical Faster R-CNN to meet the practical needs in industrial production. To address the multi-scale problem, we incorporated the Feature Pyramid Network into the Faster R-CNN, resulting in a substantial improvement in the detection performance of the model. To address the issue of class imbalance, we utilized OHEM during the training process. Furthermore, to enhance the dataset during the model’s training, we applied Mixup and Mosaic techniques. Based on the experimental results, we analyzed the problems encountered during the dataset and model training. However, there is still ample room for improvement in our research. In future work, we plan to extend the number of scenarios, enrich the dataset, and conduct in-depth analyses of the features of each scenario. Moreover, we will explore better models and investigate effective deployment systems to meet real engineering needs.

Footnotes

Acknowledgements

This project was supported by the Provincial Natural Science Foundation of Anhui (No. 2108085QF264, 2108085QF268).

Conflict of interest

None to report.

References

Bochkovskiy,

Wang and

H.M.

Liao, YOLOv4: Optimal speed and accuracy of object detection, CoRR, 2020. arXiv:2004.10934.

Cai and

Vasconcelos, Cascade R-CNN: High quality object detection and instance segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 43(5) (2021), 1483–1498. doi:10.1109/TPAMI.2019.2956516.

Fang,

Ding,

Luo and

P.E.D.

Love, Falls from heights: A computer vision-based approach for safety harness detection, Automation in Construction 91 (2018), 53–61. https://www.sciencedirect.com/science/article/pii/S0926580517308403 . doi:10.1016/j.autcon.2018.02.018.

Ge,

Liu,

Wang,

Li and

Sun, YOLOX: Exceeding YOLO series in 2021, CoRR, 2021. arXiv:2107.08430.

Ghiasi,

Lin and

Q.V.

Le, NAS-FPN: Learning scalable feature pyramid architecture for object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, Computer Vision Foundation / IEEE, 2019, pp. 7036–7045. doi:10.1109/CVPR.2019.00720.

R.B.

Girshick, Fast R-CNN, in: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7–13, 2015, IEEE Computer Society, 2015, pp. 1440–1448. doi:10.1109/ICCV.2015.169.

He,

Zhang,

Ren and

Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, IEEE Computer Society, 2016, pp. 770–778. doi:10.1109/CVPR.2016.90.

Hu,

Liu,

Jiang et al., A high-precision detection method for coated fuel particles based on improved faster region-based convolutional neural network, Computers in Industry 143 (2022), 103752. https://www.sciencedirect.com/science/article/pii/S016636152200149X . doi:10.1016/j.compind.2022.103752.

Jocher,

Chaurasia et al., Ultralytics/Yolov5: V6.2 – YOLOv5 classification models, Apple M1, reproducibility, ClearML and Deci.Ai integrations, Zenodo. doi:10.5281/zenodo.7002879.

10.

Li,

Liu and

Wang, Gradient harmonized single-stage detector, in: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, the Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, the Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27–February 1, 2019, AAAI Press, 2019, pp. 8577–8584. doi:10.1609/aaai.v33i01.33018577.

11.

Li,

Jiang et al., YOLOv6: A single-stage object detection framework for industrial applications, CoRR, 2022. arXiv:2209.02976. doi:10.48550/arXiv.2209.02976.

12.

Li,

Zhang,

Yu,

Chen and

Li, S-OHEM: Stratified online hard example mining for object detection, in: Computer Vision – Second CCF Chinese Conference, CCCV 2017, Tianjin, China, Proceedings, Part III, October 11–14, 2017,

Yang,

Hu,

Cheng,

Wang,

Liu,

Bai and

Meng, eds, Communications in Computer and Information Science, Vol. 773, Springer, 2017, pp. 166–177. doi:10.1007/978-981-10-7305-2_15.

13.

Li,

Xie,

Zhang,

Lu,

Xie,

Su,

Du and

Hou, Toward efficient safety helmet detection based on YoloV5 with hierarchical positive sample selection and box density filtering, IEEE Trans. Instrum. Meas. 71 (2022), 1–14. doi:10.1109/TIM.2022.3169564.

14.

Lin,

Dollár,

R.B.

Girshick,

He,

Hariharan and

S.J.

Belongie, Feature pyramid networks for object detection, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, IEEE Computer Society, 2017, pp. 936–944. doi:10.1109/CVPR.2017.106.

15.

Lin,

Goyal,

R.B.

Girshick,

He and

Dollár, Focal loss for dense object detection, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, IEEE Computer Society, 2017, pp. 2999–3007. doi:10.1109/ICCV.2017.324.

16.

Lin,

Maire,

S.J.

Belongie,

Hays,

Perona,

Ramanan,

Dollár and

C.L.

Zitnick, Microsoft COCO: Common objects in context, in: Computer Vision – ECCV 2014 – 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V, Lecture Notes in Computer Science, Vol. 8693, Springer, 2014, pp. 740–755. doi:10.1007/978-3-319-10602-1_48.

17.

Liu,

Huang and

Wang, Learning spatial fusion for single-shot object detection, CoRR, 2019. arXiv:1911.09516.

18.

Liu,

Qi,

Qin,

Shi and

Jia, Path aggregation network for instance segmentation, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, Computer Vision Foundation / IEEE Computer Society, 2018, pp. 8759–8768. doi:10.1109/CVPR.2018.00913.

19.

Liu,

Anguelov,

Erhan,

Szegedy,

S.E.

Reed,

Fu and

A.C.

Berg, SSD: Single shot MultiBox detector, in: Computer Vision – ECCV 2016 – 14th European Conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings, Part I, Lecture Notes in Computer Science, Vol. 9905, Springer, 2016, pp. 21–37. doi:10.1007/978-3-319-46448-0_2.

20.

Mercaldo and

Santone, Transfer learning for mobile real-time face mask detection and localization, J. Am. Medical Informatics Assoc. 28(7) (2021), 1548–1554. doi:10.1093/jamia/ocab052.

21.

Ning,

Cheng,

Meng,

Duan and

Wei, Enhanced spectrum convolutional neural architecture: An intelligent leak detection method for gas pipeline, Process Safety and Environmental Protection 146 (2021), 726–735. https://www.sciencedirect.com/science/article/pii/S0957582020319388 . doi:10.1016/j.psep.2020.12.011.

22.

Oksuz,

B.C.

Cam,

Kalkan and

Akbas, Imbalance problems in object detection: A review, IEEE Trans. Pattern Anal. Mach. Intell. 43(10) (2021), 3388–3415. doi:10.1109/TPAMI.2020.2981890.

23.

Redmon,

S.K.

Divvala,

R.B.

Girshick and

Farhadi, You only look once: Unified, real-time object detection, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, IEEE Computer Society, 2016, pp. 779–788. doi:10.1109/CVPR.2016.91.

24.

Redmon and

Farhadi, YOLO9000: Better, faster, stronger, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017 IEEE Computer Society, 2017, pp. 6517–6525. doi:10.1109/CVPR.2017.690.

25.

Redmon and

Farhadi, YOLOv3: An incremental improvement, CoRR, 2018. arXiv:1804.02767.

26.

Ren,

He,

R.B.

Girshick and

Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39(6) (2017), 1137–1149. doi:10.1109/TPAMI.2016.2577031.

27.

Sadiq,

Masood and

Pal, FD-YOLOv5: A fuzzy image enhancement based robust object detection model for safety helmet detection, Int. J. Fuzzy Syst. 24(5) (2022), 2600–2616. doi:10.1007/s40815-022-01267-2.

28.

Sethi,

Kathuria and

Kaushik, Face mask detection using deep learning: An approach to reduce risk of coronavirus spread, Journal of Biomedical Informatics 120 (2021), 103848. https://www.sciencedirect.com/science/article/pii/S1532046421001775 . doi:10.1016/j.jbi.2021.103848.

29.

Shrivastava,

Gupta and

R.B.

Girshick, Training region-based object detectors with online hard example mining, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, IEEE Computer Society, 2016, pp. 761–769. doi:10.1109/CVPR.2016.89.

30.

Tan,

Pang and

Q.V.

Le, EfficientDet: Scalable and efficient object detection, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, Computer Vision Foundation / IEEE, 2020, pp. 10778–10787. doi:10.1109/CVPR42600.2020.01079.

31.

Wang,

Bochkovskiy and

H.M.

Liao, YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, CoRR, 2022. arXiv:2207.02696. doi:10.48550/arXiv.2207.02696.

32.

Yu and

Zhang, Face mask wearing detection algorithm based on improved YOLO-v4, Sensors 21(9) (2021), 3263. doi:10.3390/s21093263.

33.

Zeng,

Wu,

Wang et al., A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection, IEEE Transactions on Instrumentation and Measurement 71 (2022), 1–14. doi:10.1109/TIM.2022.3153997.

34.

Zhang,

Chang,

Ma,

Wang and

Chen, Dynamic R-CNN: Towards high quality object detection via dynamic training, in: Computer Vision – ECCV 2020,

Vedaldi,

Bischof,

Brox and

J.-M.

Frahm, eds, Springer International Publishing, Cham, 2020, pp. 260–275. ISBN 978-3-030-58555-6. doi:10.1007/978-3-030-58555-6_16.

35.

Zhang,

Cissé,

Y.N.

Dauphin and

Lopez-Paz, Mixup: Beyond empirical risk minimization, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, Conference Track Proceedings, OpenReview.net, 2018.

36.

Zhang,

He,

Zhang,

Xie and

Li, Bag of freebies for training object detection neural networks, CoRR, 2019. arXiv:1902.04103.

Gas mask wearing detection based on Faster R-CNN

Abstract

Keywords

1. Introduction

2. Method

2.1. Gas mask detection dataset

2.1.1. A brief introduction to gas masks

2.3. Faster R-CNN network

3.1. Experimental platform and parameters

3.2. Evaluation metrics

3.3. Experimental results and analysis

Table 2 Effects comparison of different methods FPN OHEM Mixup Mosaic mAP 0.5 (%) mAP 0.75 (%) mAP 0.5 : 0.95 (%) 55.7 40.7 36.7 ✓ 75.2 67.3 55.8 ✓ ✓ 81.0 70.1 57.9 ✓ ✓ ✓ ✓ 82.6 72.6 59.9

Footnotes

Acknowledgements

Conflict of interest

References

Table 2
Effects comparison of different methods

FPN OHEM Mixup Mosaic ${mAP}_{0.5}$ (%) ${mAP}_{0.75}$ (%) ${mAP}_{0.5 : 0.95}$ (%)

55.7 40.7 36.7

✓ 75.2 67.3 55.8

✓ ✓ 81.0 70.1 57.9

✓ ✓ ✓ ✓ 82.6 72.6 59.9