Sage Journals: Discover world-class research

Abstract

The objective of this work is to address the problem of detecting track intruders in railway systems using deep learning-based algorithms. Unauthorized entry onto railway tracks poses a significant risk of collisions between trains and humans. However, intrusion discrimination algorithms often suffer from a lack of learning data and data imbalance issues. To overcome these challenges, this research proposes an algorithm that combines generative models and classification networks. Generative models are utilized to generate synthetic intrusion data by learning the underlying distribution of available data and creating new samples resembling the original data. The augmented intrusion data is then used to train deep neural networks to accurately identify intrusions. The proposed algorithm is evaluated using real data sets, demonstrating its effectiveness in overcoming limited learning data and data imbalance issues. By augmenting intrusion data using generative models, the algorithm achieves improved accuracy compared to traditional approaches. In conclusion, the algorithm presented in this work provides a solution for detecting track intruders in railway systems. By leveraging generative models to augment limited intrusion data and utilizing classification networks for intrusion discrimination, the algorithm demonstrates improved performance in accurately identifying intrusions. This research highlights the potential of deep learning-based approaches in enhancing railway safety and recommends further exploration and application of these methods in real-world settings.

Keywords

Track intrusion detection data augmentation generative model computer vision Pix2Pix diffusion

Introduction

Tracks are important parts of the railway infrastructure and the prime locations where collisions occur between trains and humans. Entry of unauthorized people onto the tracks can lead to a major accident involving their collision with a train. Damage to railway infrastructure by intruders is also possible. To prepare for such dangerous situations, urban railway institutions use real-time intruder-detection system that includes cameras installed on platforms. Recently, studies have been conducted to improve the accuracy and processing speed of these systems by applying deep learning-based algorithms.^1–5

Artificial intelligence (AI) is a system in which computers infer, learn, and judge on their own by learning the rules and patterns of data. Therefore, it is essential for deep learning-based models, a type of AI technology, to secure a sufficient amount of data to achieve satisfactory performance. Therefore, in recent deep learning studies, methods for increasing the amount of learning data using data augmentation have been investigated.^6,7 Methods for increasing the amount of data while maintaining its quality using a generative model are being studied. The generation model is a deep learning-based algorithm that generates data similar to the original data by learning, such that the input and output data are in the same dimension. These models have the advantage of generating data similar to real data. In addition, research is being conducted to design models to control data generation.^8–12 As the use of these models can increase the quality and diversity of the generated data, research is being conducted to augment the data and improve the performance of algorithms in various fields.

The data required in this study consisted of images of people not on the track (normal state) and images of people invading the track (intruder state). Normal-state images are easy to acquire; however, the collection of intrusion-state images has several difficulties such as train suspension and other safety problems. Therefore, in this study, we propose a deep learning-based generation model that augments a small number of track-intrusion images that are difficult to secure. The proposed generation model can control the intruder data in various positions, thereby increasing the diversity of the learning data. When implementing an AI-based intrusion discrimination algorithm, the problem of performance degradation caused by a lack of learning data and an imbalance between the normal- and intrusion-state data can be solved using the proposed method.

The remainder of this article is organized as follows. The “Intrusion detection system” section describes the networks used to determine and track intrusion. The “Generative model-based data augmentation” section provides an overview of the data and generation models used to augment the training data. In the “Experimental results” section, the proposed scheme is applied to actual image data, and its effectiveness is demonstrated by comparing it with various neural networks. Finally, the conclusions are presented in the “Conclusions” section.

Intrusion detection system

Track intrusion detection

With the development of AI-based image recognition technology, the demand for risk detection systems using images is increasing, and research is actively conducted to detect intrusion using deep learning. The development of the railway industry and the significant increase in the number of tracks and users increased the demand for railway safety. To ensure the safety of railway passengers, various camera installations are being implemented on railway vehicles and platforms. These installations aim to enhance surveillance and monitoring capabilities, providing a means to detect and respond to potential safety hazards more effectively. Therefore, there has been significant research conducted on deep learning algorithms for detecting arbitrary intrusions in restricted areas of railway facilities.^1–5 These algorithms utilize advanced computer vision techniques to analyze real-time video footage and identify potential security breaches. By employing deep neural networks, these algorithms can effectively recognize and classify various objects and activities, enabling the detection of unauthorized individuals entering restricted zones. In Cao et al.,¹ a lightweight neural network is utilized, while Guo et al.² employed an SSD network to detect objects trespassing in prohibited areas of railway facilities. Furthermore, Pan et al.³ explored the application of convolutional neural networks and multi-task learning techniques for this purpose. In Kapoor et al.,⁴ faster R-CNN is used as a detection method. Additionally, Chen et al.⁵ proposed an algorithm that utilizes a two-stage network, initially recognizing railway tracks and subsequently detecting objects. However, most AI-based intrusion detection studies use object detection. Object detection aims to locate an object in an input image and output its coordinates. Therefore, to find a track intruder using object detection, it is necessary to determine whether the location of the identified person belongs to a track or a platform. In this study, we used a classifier instead of object detection to detect intruders without these additional modules. Classification is a task that aims to infer where the input image belongs among the specified classes; in this study, a classifier is learned to output which state is intrusion or normal. The classifier can predict the intrusion state by learning to comprehensively determine the area division of the track and platform, in addition to the location of the person.

Intrusion discrimination network

In this study, ResNet,¹³ EfficientNet,¹⁴ MobileNet,^15–17 Vision Transformer,¹⁸ and multi-layer perceptron (MLP) Mixer¹⁹ were used as classifiers to determine intrusion. Research on the initial deep learning network structure is focused on improving the performance by deepening the layers. However, the deeper the layer, the smaller the effect of backpropagation, resulting in gradient vanishing, eventually reducing the overall performance. To overcome this problem, a residual block using a skip connection with ResNet, as shown in Figure 1, is proposed.

Figure 1.

ResNet.

The residual block combines several layers into a single step and adds an input value to the output value. Thus, when calculating backpropagation, at least one value is output, even after differentiation, thus solving the gradient vanishing problem.¹³

Methods for increasing the network size to increase the accuracy of the classifier have also been studied. The accuracy is improved with a larger width, larger height, and better resolution of the input image. However, several other parameters should be considered. A study using EfficientNet¹⁴ confirmed that increasing the width, height, and resolution of the input image contributes to performance improvement and proposed an efficient model that could determine the optimal combination between the accuracy and computation time of the model using compound scaling (Figure 2), which simultaneously adjusts these three parameters.

Figure 2.

EfficientNet.

MobileNet^15–17 is a network created by Google for use in mobile devices; it has a low computational time and high performance. As shown in Figure 3, a study on MobileNetV1 was the first to propose a depthwise separable convolution, which involved a convolution operation for each channel of images; its calculation was 8 to 9 times faster. A study with MobileNetV2 proposed a linear bottleneck using projection layers to express high-dimensional information in low dimensions. Another study used MobileNetV3 with neural architecture search (NAS) to optimize the structure of the model and improve its accuracy.

Figure 3.

MobileNetV1.

A vision transformer (ViT)¹⁸ applies a transformer, which has become a standard for realizing good performance in the natural language processing field, to problems in the image processing and image understanding fields. ViT does not use convolution; therefore, it has a small number of parameters but shows results similar to those of conventional neural networks (CNNs). As shown in Figure 4, ViT divides the input image into patch units and renders each patch one-dimensional. Subsequently, class token and position embedding, which indicate the location of the patch in the input image, are added to the transformer encoder. The transformer encoder includes layer normalization, attention, and an MLP to extract the features.

Figure 4.

Vision transformer.

The MLP-Mixer¹⁹ is a model that uses only simple perceptron layers and not convolution and attention layers. In an MLP-Mixer, an image is divided into patches to form a new matrix. This matrix is transposed, as shown in Figure 5, and the operation is performed in both the lateral and column directions through the layer, repeating the process of passing through the MLP. Through this process, an operation similar to convolution without convolution is performed, and results similar to those of ViT are obtained.

Figure 5.

Mixer layer of multi-layer perceptron (MLP)-Mixer.

Generative model-based data augmentation

The rapid development of deep learning-based algorithms enabled the solving of problems that were difficult to solve in many industries. As AI systems optimize algorithms based on data, they require a sufficient quantity of high-quality data. However, it is often difficult to obtain sufficient data. Recently, several studies have focused on data augmentation technology using generative models to solve the data-shortage problem. Further, studies have been reported to improve performance by increasing the amount of learning data of AI systems by generating difficult-to-collect data, such as data for liver lesion diagnosis,⁹ three-dimensional face landmark data,¹⁰ CT image data¹¹ of COVID-19 patients, and car driving image data.¹² Therefore, in this study, data were augmented using a generative model to solve the performance degradation caused by the limited collection of railway track intrusion image data.

The data used in this study include images of the track and platform as shown in Figure 6. The generation model outputs data with the same dimensions as the probability distribution of the input data but with different values. The generative adversarial network (GAN) proposed by Goodfellow et al.²⁰ in 2014 is a representative deep learning-based generative model that simultaneously competes and learns the generator generating the image; the discriminator determines whether the generated image is real or fake. Many reports^21–25 have been published on image generation in the field of computer vision, and significant progress has been made based on GAN. In this study, we augmented the insufficient data using Pix2Pix,²⁵ which is used in the field of image-to-image translation. It generates images with different domains while maintaining features and information by changing the domain of the input image. Pix2Pix was the first to define image-to-image translation; it transforms an image into an image in a different domain, unlike a vanilla GAN, which generates an image with noise as the input. Pix2Pix uses U-Net²⁶ as a generator. U-Net is an autoencoder-based model that maintains the features and dimensions of the input data by reducing its dimensions, extracting key information, and expanding the dimensions again; however, it outputs different values, adding skip connections to better preserve information from the original. In this study, Pix2Pix is used to create people at the desired location. As shown in Figure 7, a black square is created at the location of a person and used as the input image, and the original image is used as the output image. The input and output images are used as training data for model generation. When a rectangle is placed at a desired position in the input image and input into a trained model, the generator creates a person to fit the background.

Figure 6.

Intrusion and normal state examples.

Figure 7.

Data augmentation using Pix2Pix.

The diffusion model²⁷ can calculate the probability distribution differently from the GAN-based generation model. This is a generative model that considers high tractability. Studies by Ho et al.²⁸ and Saharia et al.²⁹ showing high accuracy have been published and have received considerable attention. The network consists of a diffusion process that adds very little noise to the input image step-by-step to create the noise of the standard distribution, in addition to a reverse process that defines and learns the inverse process of each step as a parameter. Equation (1) is a basic loss function of the diffusion model.

E [\begin{matrix} D_{K L} (q (X_{T} | X_{0}) ∥ P_{θ} (X_{T})) \dots L_{T} \\ + \sum_{t > 1} D_{K L} (q (X_{t - 1} | X_{t}, X_{0}) ∥ p_{θ} (X_{t - 1} | X_{t})) \dots L_{T - 1} \\ - \log_{p_{θ}} (X_{0} | X_{1}) \dots L_{0} \end{matrix}]

(1)In (1),

L_{T}

represents the difference between the Gaussian noise and predicted noise in the diffusion process.

L_{T - 1}

represents the difference between the noise predicted by the learned network in the reverse process, and the Gaussian noise actually added in the diffusion process.

L_{0}

represents the difference between the original image and the image obtained by adding noise in the first step. In the diffusion model, a loss function is used to simplify (1). The simplified loss function uses only the noise caused by

L_{T - 1}

. Therefore, diffusion learns a network that predicts noise by receiving an image

X_{t}

from the previous step and uses it in the diffusion process. Existing diffusion-model-based generation models generate an entire image from noise. However, in this study, a model that generates a person in the track portion of the image is required. Therefore, a diffusion process is performed such that only the human portion of the input image is changed to noise. Subsequently, when the noise square is placed at the desired location in the input image in the reverse process, as shown in Figure 8, the generator creates a person to match the background.

Figure 8.

Data augmentation using diffusion model.

Experimental results

To select a discriminant model that learns the characteristics of track-intrusion images well, the five discriminant models described in the “Intrusion detection system” section are learned, and their performance is evaluated using a previously secured dataset. Table 1 presents the performance results of the learned models.

Table 1.

Performance comparison by network model.

Network model	Accuracy (%)	No. of parameters
ResNet_50	64.67	$23.5 \times 10^{6}$
EfficientNet_b2	62.67	$7.7 \times 10^{6}$
MobileNetV3_large	70.00	$4 .2 \times 1 0^{6}$
ViT_B/16	56.67	$5.5 \times 10^{6}$
MLP-Mixer_S/16	48.00	$17.8 \times 10^{6}$

The experiment used data from three people to detect intrusions on the tracks. The dataset consisted of 241 images, with 191 training data and 50 test data. In a binary classification problem, the actual value and the value predicted by the model are represented by a 2 × 2 matrix, as shown in Table 2.

Table 2.

TP, FP, FN, and TN.

	Intruder state	Normal state
Estimated intruder state	True positive (TP)	False positive (FP)
Estimated normal state	False negative (FN)	True negative (TN)

True and false indicate whether the predicted result is correct. Positive and negative indicate cases in which the model classifies the state as intruder or normal. The classification accuracy is the most widely used index for the data learning of neural networks and is defined by

a c c u r a c y = \frac{T P + T N}{T P + F P + F N + T B}

(2)The experimental results confirmed that MobileNetV3 had the highest accuracy. It performed better than ResNet, the same CNN-based model because it is a hyper-parameter-optimized model that uses NAS. As ViT and MLP-Mixer, which are not CNN-based, performed well on a large dataset, MobileNetV3 was believed to have a higher performance in this study using small data. This was also the smallest number of parameters used by the models compared. Therefore, in this study, MobileNetV3, with the smallest number of parameters and good recognition performance, was used as the classifier. The specifications of the hardware and the structures of the five deep networks used in the experiment are shown in Table 3 and Figure 9, respectively.

Figure 9.

Structures of five networks.

Table 3.

System specification.

Type	Product name	Specification
CPU	i9-11900F	2.5 GHz
GPU	RTX 3090	24 GB
RAM	–	64 GB

The datasets are composed of images acquired from three tracks (A, B, and C), as shown in Figure 10.

Figure 10.

Track images.

The dataset used in this study was collected from the Yongin EverLine, an operational subway system in South Korea. The training dataset consisted of 81, 82, and 28 samples for tracks A, B, and C, respectively, while the test dataset comprised 21, 21, and 8 samples for the same tracks. In the training data, there were 60, 60, and 20 samples of normal state images for tracks A, B, and C, respectively, and 21, 22, and 8 samples depicting intrusion events. For the test data, there are 15, 15, and 5 samples of normal state images for tracks A, B, and C, respectively, along with 6, 6, and 3 samples of intrusion images. Table 4 presents the results of the evaluation of the intrusion states of each of the three tracks using MobileNetV3.

Table 4.

Intrusion detection performance.

	Accuracy (%)
Track A	64.00
Track B	67.00
Track C	65.33
Total	70.00

Results after data augmentation using Pix2Pix

Two experiments were conducted to augment the data using Pix2Pix. They include two cases: placing a black rectangle in the desired location for human creation on the input image (pix2pixA) and placing a rectangle with random noise (pix2pixB). Figure 11 shows an image generated using Pix2Pix after completing the training.

Figure 11.

Images generated using Pix2Pix.

Figure 11(a) shows the result for Pix2pixA. Although a person has been created, it can be observed that the boundaries are blurred, and there is a lack of detailed description. Figure 11(b) shows the results of Pix2pixB; in this case, it can be confirmed that learning does not proceed well and has no form at all. The number of training data is increased by 500 with two pix2pix model variations. Table 5 lists the intrusion detection results obtained by learning MobileNetV3 using the augmented dataset.

Table 5.

Intrusion detection performance using two pix2pix model variations.

	MobileNetV3 + Pix2PixA	MobileNetV3 + Pix2PixB	MobileNetV3
Track A	66.00	62.67	64.00
Track B	70.00	65.33	67.00
Track C	68.00	63.33	65.33
Total	72.00	67.00	70.00

It may be confirmed that the performance of Pix2PixA is improved, while that of Pix2PixB is rather lower, compared to the method before enhancement. It may be seen that the quality of the generated image affects the performance of the track intrusion classifier. Therefore, in this study, Diffusion is used to improve the quality of the generated image. The structure of pix2pix in the experiment is shown in Figure 12.

Figure 12.

Structure of Pix2Pix.

Results after data augmentation using diffusion

Four experiments were conducted to increase the data using diffusion in various ways. The four different cases are listed as follows: to create one person with one pose (DiffusionA), one person with four poses (DiffusionB), three people with one pose (DiffusionC), and three people with three poses (DiffusionD). The four poses are standing, walking, falling, and sitting. Each diffusion is learned in 2000 steps. A person is created on an empty track image and is used for data augmentation. Figure 13 shows an image generated using diffusion that completed the training.

Figure 13.

Images generated using diffusion.

Figure 13(a) to (d) shows the results of DiffusionA, DiffusionB, DiffusionC, and DiffusionD, respectively. As shown in Figure 13, some images are well generated, but some images with unclear human shapes are also generated. The number of training data points is increased by 500 with four diffusions. Table 6 shows the results of intrusion detection using the augmented dataset.

Table 6.

Intrusion detection performance using four diffusion methods.

	MobileNetV3 + DiffusionA	MobileNetV3 + DiffusionB	MobileNetV3 + DiffusionC	MobileNetV3 + DiffusionD
Track A	76.17	67.00	72.00	70.33
Track B	82.17	71.00	76.00	73.00
Track C	82.83	74.33	79.33	66.33
Total	84.00	76.67	81.33	74.00

This is because an increase in diversity hinders the learning of diffusion, thereby lowering the quality of the generated image. Furthermore, among the four experiments, DiffusionA exhibited the best performance. The structure of diffusion in the experiment is shown in Figure 14.

Figure 14.

Structure of diffusion.

Discussion

The distinctive aspects of this study are as follows: in this study, we utilized a classifier for detecting intruders instead of employing additional modules for object detection. The classifier's objective is to determine the class to which the input image belongs. In this case, the classifier is trained to distinguish between intrusion and normal states. By considering not only the person's location but also the division of the track and platform areas, the classifier is able to comprehensively predict the intrusion state. Furthermore, in this article, the diffusion-based model does not generate the entire image from noise. Instead, it employs a diffusion process in which only the human portion of the input image is altered to noise.

Based on the findings of this study, the following are significant recommendations: Due to the computational demands of the diffusion process, sufficient learning time and computing resources are required. It involves performing numerous computations at each stage. Therefore, it is essential to consider allocating adequate resources for learning and generation and to carefully design experiments accordingly. Furthermore, it is necessary to have an evaluation metric that can assess whether the generated images are effective for the classifier's training. Since the visual quality of the generated images and the performance of the classifier with the generated images added as training data are not directly related, an evaluation metric that takes both of these factors into account needs to be developed.

Conclusions

This study proposed a method to augment training data using deep learning-based generative models to improve the performance of discriminant models that recognize people intruding on railway tracks; this intrusion is a representative hazard in the railway field. Deep learning-based algorithms perform well only when a sufficient amount of data is secured; however, the task of acquiring images with human intrusion on the track can pose many risks. To address this problem, experiments were conducted, demonstrating that augmenting data with generative models learned to generate people on empty-track images improves performance. This study experimentally confirmed MobileNetV3 as the most suitable deep learning-based discrimination model. In addition, the accuracy of the track intrusion discriminator was significantly improved by augmenting the training data through learning to create people at the desired location using diffusion as a generation model. Therefore, the proposed method not only addresses the lack of data for deep learning-based track-intrusion discrimination systems but also improves the accuracy of the system.

Footnotes

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP; Ministry of Science, ICT & Future Planning) (NRF-2021R1F1A1052074). This was supported by Korea National University of Transportation Industry-Academy Cooperation Foundation in 2023.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Research Foundation of Korea, Korea National University of Transportation Industry-Academy Cooperation Foundation (grant number NRF-2021R1F1A1052074).

ORCID iDs

Beomseong Kim

Heesung Lee

Author biographies

SooHyung Lee received the BS and MS degrees in railroad electrical and electronics engineering from Korea National University of Transportation (KNUT), Uiwang-si, Gyeinggi-do, Korea, in 2021 and 2023, respectively. Since 2023, he has been with ALCHERA Corporation, Seongnam-si, Gyeonggi-do, Korea. His current research interests include deep learning, generative model, and intelligent railroad system.

Beomseong Kim received the BS and Combined MS and PhD degrees in electrical and electronic engineering from Yonsei University, Seoul, Korea, in 2009 and 2015, respectively. From 2015 to 2017, he was a senior researcher with the LG electronics Corporation, Seoul, Korea. From 2017 to 2020, he was a manager with the SK telecom Corporation, Seoul, Korea. Since 2020, he has been with the Gyeonggi University of Science and Technology, Siheung-si, Gyeonggi-do, Korea, where he is currently an assistant professor. His current research interests include machine learning algorithm, sensor fusion system, and autonomous control for robot.

Heesung Lee received the BS, MS, and PhD degrees in electrical and electronic engineering from Yonsei University, Seoul, Korea, in 2003, 2005, and 2010, respectively. From 2011 to 2014, he was a managing researcher with the Samsung S1 Corporation, Seoul, Korea. Since 2015, he has been with the railroad electrical and electronics engineering at Korea National University of Transportation (KNUT), Uiwang-si, Gyeonggi-do, Korea, where he is currently an associate professor. His current research interests include computational intelligence, biometrics, and intelligent railroad system.

References

Cao

Qin

Xie

, et al. An effective railway intrusion detection method using dynamic intrusion region and lightweight neural network. Measurement ( Mahwah NJ) 2022; 191: 110564.

Guo

Shi

Zhu

, et al. High-speed railway clearance intrusion detection with improved SSD network. Appl Sci 2019; 9: 2981.

Pan

Wang

, et al. Railway obstacle intrusion detection based on convolution neural network multitask learning. Electronics (Basel) 2022; 11: 2697.

Kapoor

Goel

Sharma

. An intelligent railway surveillance framework based on recognition of object and railway track using deep learning. Multimed Tools Appl 2022; 81: 21083–21109.

Chen

Meng

Jiang

. Foreign object detection in railway images based on an efficient two-stage convolutional neural network. Comput Intell Neurosci 2022; 2022. Article ID 3749635.

Xie

Dai

Hovy

, et al. Unsupervised data augmentation for consistency training. Proc Adv Neural Inf Proces Syst 2020; 33: 6256–6268.

Ghiasi

Cui

Srinivas

, et al. Simple copy-paste is a strong data augmentation method for instance segmentation. In: Proc. IEEE/CVF conference on computer vision and pattern recognition, 2021, pp.2918–1928. Piscataway, NJ: Virtual, IEEE.

Lee

. Railway track intrusion classification algorithm based on generative adversarial network and deep neural networks. J Kor Soc Railw 2022; 25: 339–345.

Frid-Adar

Diamant

Klang

, et al. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 2018; 321: 321–331.

10.

Wood

Baltrušaitis

Hewitt

, et al. 3D face reconstruction with dense landmarks. In: Proc European conference on computer vision, Tel Aviv, Israel, 2022, pp.160–177. Dordrecht: Springer Science+Business Media.

11.

Das

Tran

Singh

, et al. Conditional synthetic data generation for robust machine learning applications with limited pandemic data. In: Proc. the AAAI conference on artificial intelligence, 2022, pp.11792–11800, Vol. 36. Palo Alto, CA: Virtual, AAAI Press.

12.

Kishore

Choe

Kwon

, et al. Synthetic data generation using imitation training. In: Proc. IEEE/CVF international conference on computer vision, 2021, pp.3078–3086. Piscataway, NJ: Virtual, IEEE.

13.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proc. IEEE/CVF conference on computer vision and pattern recognition, Las Vegas, NV, USA, 2016, pp.770–778. Piscataway, NJ: IEEE.

14.

Tan

. Efficientnet: rethinking model scaling for convolutional neural networks. In: Proc. international conference on machine learning, Long Beach, CA, USA, 2019, pp.6105–6114. New York, NY: ACM digital library.

15.

Howard

Zhu

Chen

, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, 2017.

16.

Sandler

Howard

Zhu

, et al. Mobilenetv2: inverted residuals and linear bottlenecks. In: Proc. IEEE/CVF conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 2018, pp.4510–4520. Piscataway, NJ: IEEE.

17.

Howard

Sandler

Chu

, et al. Searching for MobileNetV3. In: Proc. IEEE/CVF international conference on computer vision, Long Beach, CA, USA, 2019, pp.1314–1324. Piscataway, NJ: IEEE.

18.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv:2010.11929, 2020.

19.

Tolstikhin

Houlsby

Kolesnikov

, et al. MLP-Mixer: an all-MLP architecture for vision. In: Proc. advances in neural information processing systems, 2021, vol. 34, pp.24261–24272. Cambridge, MA: Virtual, MIT Press

20.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial networks. In: Proc. conference on neural information processing systems, Montreal, Canada, 2014, pp.2672–2680. New York, NY: ACM digital library.

21.

Radford

Metz and Chintala

. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2016.

22.

Arjovsky

Chintala

Bottou

. Wasserstein generative adversarial networks. In: Proc. international conference on machine learning, Sydney, Australia, 2017, pp.214–223. New York, NY: ACM digital library.

23.

Kerras

Laine

Aila

. A style-based generator architecture for generative adversarial networks. In: Proc. IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 2019, pp.4401–4410. Piscataway, NJ: IEEE.

24.

Ledig

Theis

Huszar

, et al. Photo-realistic single image super-resolution using a generative adversarial network. In: Proc. IEEE/CVF conference on computer vision and pattern recognition, Honolulu, HI, USA, 2017, pp.4681–4690. Piscataway, NJ: IEEE.

25.

Isola

Zhu

Zhou

, et al. Image-to-image translation with conditional adversarial networks. In: Proc. IEEE/CVF conference on computer vision and pattern recognition, Honolulu, HI, USA, 2017, pp.1125–1134. Piscataway, NJ: IEEE.

26.

Ronneberger

Fischer

Brox

. U-Net: convolutional networks for biomedical image segmentation. In: Proc. medical image computing and computer-assisted intervention, Munich, Germany, 2015, pp.234–241. New York: Springer Cham.

27.

Sohl-Dickstein

Weiss

Maheswaranathan

, et al. Deep unsupervised learning using nonequilibrium thermodynamics. In: Proc. international conference on machine learning, Lille, France, 2015, pp.2256–2265. New York, NJ: ACM digital library.

28.

Jain

Abbeel

. Denoising diffusion probabilistic models. In: Proc. advances in neural information processing systems, 2020, pp.6840–6851.

29.

Saharia

Chan

Chang

, et al. Palette: image-to-image diffusion models. In: Proc. ACM special interest group on computer graphics and interactive techniques conference, 2022, pp.1–10.

Data augmentation using generative models for track intrusion detection

Abstract

Keywords

Introduction

Intrusion detection system

Track intrusion detection

Intrusion discrimination network

Generative model-based data augmentation

Experimental results

Results after data augmentation using Pix2Pix

Results after data augmentation using diffusion

Discussion

Conclusions

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iDs

Author biographies

References