Abstract
For autonomous railway vehicle with complex crisscrossed tracks, it is a huge challenge to intelligently detect the trespassers lying in the possible track regions where the train will move along. In order to solve the issue that the existing object detection algorithms detect all obstacles in traffic scene images, a novel strategy YOLOSEG is proposed for intelligent road segmentation and obstacle detection of railway trespasser. Unet is firstly trained to intelligently segment the railway track region where the train is likely to move on, and then the generated region mask is introduced into object detection network for recognizing obstacle within the mask area. The real video of the obstacle emerging in front of the train is difficult to record, therefore the traffic scenes taken from drivers’ perspectives are randomly combined with various obstacles to create the synthetic training dataset which covers various railway traffic scenarios and lighting conditions, and at the same time the label file is automatically generated. Furthermore, a random brightness strategy is proposed for dataset enhancement. By the performance evaluation comparison of FLOPs, Top-1 Accuracy, and mAP@0.5/%, abundant trespasser detection experiments based on synthetic dataset and real images verify the accuracy and effectiveness of the proposed method.
Introduction
Autonomous driving of railway vehicles is a daunting technical challenge. Compared with machine learning algorithms, the core advantage of AI-based smart precision detection techniques lies in its ability to process more complex information and larger amounts of data, so as to achieve more accurate and flexible prediction. Especially for machine vision, AI deep learning can achieve high-precision image classification, detection, segmentation, and other tasks, and has a wide range of application prospects.1–5
Trains often collide with trespassers, such as vehicles, pedestrians, and animals, resulting in serious casualties and economic losses. For intelligent transportation systems, 6 obstacle detection is an important computer vision task. 7
Most object detection algorithms based on deep learning technique detect all obstacles in the images. However, as shown in Figure 1, many of the objects in railway scene do not pose a real danger to trains. Detecting all objects wastes computing resources and increases the risk of misidentification.

Obstacles at different locations put different threats to traffic safety.
Some researches 8 adopted the algorithms that set fixed recognition region, which is useful to detect railway straight lines but proves to be inefficient in dealing with curves. For instance, the trapezoidal region with fixed shape and position shown in Figure 2(a) is suitable for detecting of railway trespassers in the straight-line region as shown in Figure 2(b). However, railway tracks are not always simple straight lines. For the crisscrossed railway tracks region as shown in Figure 2(c) and (d), the method of setting fixed identification region does not work for the complex road conditions.

Fixed identification region: (a) is suitable for the straight-line region (b), but does not work for the complex crisscrossed railway tracks conditions (c) and (d).
Based on machine vision and deep learning techniques, this paper researches two key challenges of autonomous railway with complex crisscrossed tracks. One task is to intelligently segment regions of the railway tracks, and the other is to detect trespassers only lying in the possible regions where the train is likely to move along the crisscrossed railway tracks.
In this paper, an abundant railway trespasser dataset is made for the model training of obstacle detection neural networks. Considering that the real video of obstacles appearing in front of the train is difficult to obtain, the traffic scene images taken from drivers’ perspectives collected on the Internet are randomly combined with various obstacles to create the training dataset, and the label file needed for intelligence network training are automatically generated while the image file is created.
In order to solve the issue that the object detection algorithm detects all obstacles in traffic scene images, a novel strategy YOLOSEG is proposed. The contribution is threefold:
For an input image, the Unet is firstly trained for track region segmentation, and then the generated mask is introduced into YOLO network as the detection region of traffic obstacles, so as to only the obstacles lying in the possible regions where the train is likely to move along will be detected.
A synthetic dataset is generated consisting of abundant railway traffic scenarios such as stations, fields, bridges, tunnels, and covering a variety of lighting conditions. Furthermore, a random brightness training strategy is proposed for data enhancement which ensures the better efficiency-vs-accuracy trade-off.
By the performance evaluation comparison of FLOPs, Top-1 Accuracy, and mAP@0.5/%, the extensive trespasser detection experiments based on synthetic dataset and real images verify the accuracy and effectiveness of the proposed method.
Related works
Object detection algorithms based on deep learning have achieved fruitful results in recent years, which are divided into two main categories: two-stage detection and one-stage detection. State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.
Girshick et al. 9 proposed R-CNN which applied high-capacity convolutional neural networks to bottom-up region proposals in order to localize and segment objects. He et al. 10 equipped the networks with spatial pyramid pooling faster than R-CNN while achieving comparable accuracy. Girshick 11 proposed Fast R-CNN which added the training regressor and the bounding box detector significantly improved the detection accuracy. Ren et al. 12 introduced a Region Proposal Network that shared full-image convolutional features with the detection network thus enabling nearly cost-free region proposals. Liu et al. 13 proposed SSD which eliminated proposal generation and subsequent pixel or feature resampling stages and encapsulated all computation in a single network. Liu et al. 14 proposed Swin Transformer as a general-purpose backbone to adapt Transformer from language to vision. Wang et al. 15 proposed improved Pyramid Vision Transformer and provided improvements on object detection. Yao et al. 16 constructed Wavelet Vision Transformer that formulated the invertible down-sampling with wavelet transforms and self-attention learning in a unified way.
YOLO v1 17 algorithm was proposed in 2016, which divides the image into multiple grids. Each grid prediction contains boundary frame coordinate information, confidence and conditional probability of each category, thus converting the object detection into a regression problem. In 2017, YOLO v2 18 algorithm was proposed, which added the prior anchor frame mechanism and global mean pooling into the YOLO v1 network. Instead of predicting the coordinates of the target, it predicted the deviation between the target and the anchor frame, which reduced the training difficulty and improved the network recall rate. YOLO v3 19 was proposed in 2018, which added residual blocks and feature pyramid modules, thus improving the recognition effect of small targets. In 2020, YOLO v4 20 used CSPDarknet53 as the backbone network and adopted self-antagonistic training to improve the algorithm performance. YOLO v5 was also an improved version based on YOLO v3. The network architecture is composed of three parts: Backbone-CSPDarknet, Neck-PANet, and Head-Yolo Layer. Since classification and regression are trained in different convolution layers, it improves the accuracy and detection speed simultaneously. YOLO v6, 21 an enhanced version of YOLO v5, redesigned the architecture using the Rep-Pan and the EfficientRep. YOLO v7 22 was an improved version of YOLO v4 with higher detection accuracy and detection rate than competitors when it was released. In 2023, YOLO v8 was released achieving the SOTA performance, which only predicts the center of the object, not the offset of the anchor frame.
Proposed method
Dataset generation
Training dataset of YOLO consists of a large number of images and annotation boxes, which are used to label the location and size of the objects. Generating an appropriate dataset can improve the performance of target detection. In this paper, a synthetic dataset for railway trespasser detection is generated exploiting 100 publicly available railway videos from the internet to improve network generalization and scene applicability.
Railway background: When vehicles pass through different railway lines, the environment and the weather are both different. In this paper, traffic videos taken from driver’ perspective are converted into image sequences as the railway background. These videos are shot in day and night, and include possible road and weather conditions to enhance the wide applicability of the railway trespasser detection network.
Some examples of railway background are shown in Figure 3. Our synthetic dataset consists of abundant railway traffic scenarios such as stations, fields, bridges, tunnels, and covers a variety of lighting conditions.

The synthetic dataset covers abundant railway traffic scenarios and various lighting conditions.
Obstacle foreground: In this paper, Unet 23 is applied as the segmentation algorithm to obtain obstacles. The architecture of Unet is shown in Figure 4. It is composed of encoder and decoder, and the information transfer between them is implemented by skip connection. An example of obstacle segmentation is shown in Figure 5.

Architecture of Unet. 23

Left, input image. Right, segmentation result.
Then, cyclic processing of all background images and obstacle images to create the training dataset. The code is written in Python and OpenCV. The trespasser image and the background image are pixel-wise replaced to generate the synthetic railway trespasser dataset which contains 8000 images by randomly altering the location, number, and size of trespassers for each background, as shown in Figure 6. Typical railway trespassers include animals, plants, vehicles, and trains, as shown in the red circles in Figure 7.

Synthetic railway-trespasser images generated by pixel-wise replacement of trespasser images and background images.

Examples of railway trespassers.
Label file creation: As mentioned above, YOLO model training needs label file in which marks the locations of the boundary boxes of the targets in the traffic image, including category information and coordinate information. Generally, it is necessary to use labeling software to annotate images containing the obstacles and then to create label file, which usually takes a lot of time.
Since the parameters of the background image and the obstacle image of our dataset can be adjusted and calculated, python code is written to generate the label files in .txt format required by YOLO model training while the synthetic image is generated. The size and position parameters of trespassers are calculated as follows
where
Track region segmentation
Obstacles may be randomly set in the sky or on the sides of the railway when creating the training dataset. In order to solve the issue that the existing object detection algorithms detect all obstacles in traffic scene images, a novel strategy YOLOSEG is proposed for intelligent road segmentation and obstacle detection of railway trespasser.
Firstly, road segmentation is carried out to annotate the detection region of traffic obstacles. In this paper, the labeling software LabelMe 24 is applied to annotate the track region where the train will move along, as shown in Figure 8.

Annotation of tack regions in LabelMe.
Figure 9 shows the annotations of two types of railways: track without turnout, and track with three-way turnouts. The railway contour information needs to be accurately identified, including the tracks currently in operation as well as the tracks that the train may pass through connected by the turnout.

Railway image annotation in LabelMe. Left, track without turnout. Right, track with three-way turnouts.
After the annotation of an image is completed, LabelMe outputs a JSON file, based on which the black and white binary image is generated as the mask image. Figure 10 shows some examples of the original images and corresponding segmented road mask.

Left, the input images. Right, segmented road masks.
Trespasser detection
In order to detect the trespassers only lying in the possible track regions where the train will move along, the proposed strategy is to cover the region beyond the road with the generated mask image. The generated track region is taken as the identification region of the improved YOLOSEG neural network for obstacle detection. The architecture of YOLOSEG is shown in Figure 11.

The architecture of the improved YOLOSEG.
For an input image, the trained Unet is first applied for road segmentation to obtain the mask image, where the road is white, and other regions are black in which the RGB space values are zeros. Introduce mask image into YOLO; the white road region is selected as the region to be detected, and the RGB space values of other regions are all zero. Therefore, our neural network only detects the road region and does not detect targets outside the road mask region.
Random brightness strategy
A random brightness strategy is introduced during the training phase instead of image brightness adjustment during the test phase, which ensures a better trade-off between recognition efficiency and detection accuracy. Some examples of random brightness background images are shown in Figure 12.

Top: daytime background images. Bottom: random brightness images simulating nighttime background.
In order to extend the applicability performance of the proposed YOLOSEG model, the brightness of trespasser is also adjusted by gamma transform, as shown in Figure 13.

From left to right: original trespasser image, the brightness transformation images with gamma transformed coefficients are 0.5, 1.5, and 2.5 respectively.
Figure 14 shows an example of a railway background is pixel-wise replaced by a trespasser to generate the random brightness output image of synthetic railway trespasser.

From left to right: railway background, generated image by pixel-wise replacement of a trespasser and the background image, and the random brightness output image.
Model training and evaluation
The improved object detection model YOLOSEG is trained on our railway-trespasser dataset, where 6400 images from the rail-trespasser dataset are applied for training as well as the remaining 1600 images for verification and testing. The initial learning rate is 0.001, the attenuation coefficient is 0.0005, each batch contains 32 images, and the training epoch is 500.
TP is defined for the correct type to be recognized as correct, FP is for error type to be recognized as correct, FN is for error type to be recognized as wrong, and TN is for error type to be recognized as correct.
Recall is the proportion of all targets labeled true that can be predicted correctly, defined in (7).
Precision is the proportion of predicting correctly among all predictions, defined in (8).
Precision and recall are two contradictory concepts. Generally, when the precision is high, the recall will be low; when the recall rate is high, the precision will be low. A target detection network with excellent performance can keep the precision at a high level while the recall increases.
The precision-recall (P-R) curve can be drawn with the recall as the abscissa and the precision as the ordinate. For each category to be predicted, there is a P-R curve correspondingly. Comparing the P-R curve, the performance of the trained model can be evaluated. If the P-R curve of one model exceeds that of another model, it indicates that the performance of this model is better than another.
The region of P-R curve is positively correlated with the performance of the trained model. The region of P-R curve is the Average-Precision (AP). The larger the value of AP, the better the network performance.
The mAP means calculating the average value of the region of the P-R curve for each category respectively, calculated in (9).
The training parameters of our obstacle detection network are shown in Figure 15.

Training process parameters.
Evaluation samples using our trained network, and some test results are shown in Figure 16.

Examples of test results by improved YOLOSEG.
Figure 17 illustrates in contrast that the unimproved YOLO detects all objects in the image, while the improved YOLOSEG detects the trespassers only lying in the possible track regions where the train will move along.

Left, test result by unimproved YOLO. Right, test result by improved YOLOSEG.
The trained YOLOSEG model is applied to test the real-world railway trespasser images, and the results were shown in Figure 18. The mAP@0.5 of the proposed method reaches 89.3%.

Test result of the real-world railway trespasser images by YOLOSEG.
Based on the generated railway trespasser dataset, the widely used object detection algorithms are compared with YOLOSEG. As shown in Table 1, the proposed method achieves a better average precision.
Comparison of average precision on our railway trespasser dataset.
Table 2 summarizes the performance comparisons between the state-of-the-art object detect approaches with the proposed method on COCO Val 2017. All models are evaluated on images of resolution 224 × 224. Our FLOPs is 9.2 and Top-1 Accuracy is 84.3, which indicates that the proposed method achieves a better efficiency-vs-accuracy trade-off.
Comparison of FLOPs and Top-1 Accuracy on COCO Val 2017.
Conclusions
For autonomous railway vehicle with complex crisscrossed tracks, it is a huge challenge to intelligently detect the trespassers. Most object detection algorithms based on deep learning technique detect all obstacles in the images, which wastes computing resources and increases the risk of misidentification.
This paper proposed a novel strategy YOLOSEG to solve the issue that the existing object detection algorithms detect all obstacles in traffic scene images. The Unet is firstly trained to segment the railway track region, and then the output mask is introduced into the YOLO network to detect the obstacles within the track mask area. Based on the mask mechanism, only the obstacles lying in the possible regions where the train is likely to move along will be detected. The proposed strategy significantly improves the identification efficiency and reduces the misjudgment rate.
A synthetic dataset is generated covering abundant railway traffic scenarios and various lighting conditions. Besides, a random brightness strategy is introduced during the training phase instead of image brightness adjustment during the test phase, which ensured a better efficiency-vs-accuracy trade-off.
By the performance evaluation comparison of FLOPs, Top-1 Accuracy, and mAP@0.5/%, the abundant trespasser detection experiments based on synthetic dataset and real images verify qualitatively and quantitatively the accuracy and effectiveness of the proposed method in AI-based precision obstacle detection tasks.
In the case of poor image quality such as extremely snow, fog, low illumination, and contrast imbalance, traffic obstacle detection is a very challenging topic. Adaptive improvement of object recognition performance under harsh conditions will be our future research topic.
Footnotes
Handling Editor: Chenhui Liang
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China under Grants 52275546 and 51775449.
