Intelligent road segmentation and obstacle detection for autonomous railway vehicle

Abstract

For autonomous railway vehicle with complex crisscrossed tracks, it is a huge challenge to intelligently detect the trespassers lying in the possible track regions where the train will move along. In order to solve the issue that the existing object detection algorithms detect all obstacles in traffic scene images, a novel strategy YOLOSEG is proposed for intelligent road segmentation and obstacle detection of railway trespasser. Unet is firstly trained to intelligently segment the railway track region where the train is likely to move on, and then the generated region mask is introduced into object detection network for recognizing obstacle within the mask area. The real video of the obstacle emerging in front of the train is difficult to record, therefore the traffic scenes taken from drivers’ perspectives are randomly combined with various obstacles to create the synthetic training dataset which covers various railway traffic scenarios and lighting conditions, and at the same time the label file is automatically generated. Furthermore, a random brightness strategy is proposed for dataset enhancement. By the performance evaluation comparison of FLOPs, Top-1 Accuracy, and mAP@0.5/%, abundant trespasser detection experiments based on synthetic dataset and real images verify the accuracy and effectiveness of the proposed method.

Keywords

Autonomous railway vehicle deep learning road segmentation object detection YOLOSEG

Introduction

Autonomous driving of railway vehicles is a daunting technical challenge. Compared with machine learning algorithms, the core advantage of AI-based smart precision detection techniques lies in its ability to process more complex information and larger amounts of data, so as to achieve more accurate and flexible prediction. Especially for machine vision, AI deep learning can achieve high-precision image classification, detection, segmentation, and other tasks, and has a wide range of application prospects.^1–5

Trains often collide with trespassers, such as vehicles, pedestrians, and animals, resulting in serious casualties and economic losses. For intelligent transportation systems,⁶ obstacle detection is an important computer vision task.⁷

Most object detection algorithms based on deep learning technique detect all obstacles in the images. However, as shown in Figure 1, many of the objects in railway scene do not pose a real danger to trains. Detecting all objects wastes computing resources and increases the risk of misidentification.

Figure 1.

Obstacles at different locations put different threats to traffic safety.

Some researches⁸ adopted the algorithms that set fixed recognition region, which is useful to detect railway straight lines but proves to be inefficient in dealing with curves. For instance, the trapezoidal region with fixed shape and position shown in Figure 2(a) is suitable for detecting of railway trespassers in the straight-line region as shown in Figure 2(b). However, railway tracks are not always simple straight lines. For the crisscrossed railway tracks region as shown in Figure 2(c) and (d), the method of setting fixed identification region does not work for the complex road conditions.

Figure 2.

Fixed identification region: (a) is suitable for the straight-line region (b), but does not work for the complex crisscrossed railway tracks conditions (c) and (d).

Based on machine vision and deep learning techniques, this paper researches two key challenges of autonomous railway with complex crisscrossed tracks. One task is to intelligently segment regions of the railway tracks, and the other is to detect trespassers only lying in the possible regions where the train is likely to move along the crisscrossed railway tracks.

In this paper, an abundant railway trespasser dataset is made for the model training of obstacle detection neural networks. Considering that the real video of obstacles appearing in front of the train is difficult to obtain, the traffic scene images taken from drivers’ perspectives collected on the Internet are randomly combined with various obstacles to create the training dataset, and the label file needed for intelligence network training are automatically generated while the image file is created.

In order to solve the issue that the object detection algorithm detects all obstacles in traffic scene images, a novel strategy YOLOSEG is proposed. The contribution is threefold:

For an input image, the Unet is firstly trained for track region segmentation, and then the generated mask is introduced into YOLO network as the detection region of traffic obstacles, so as to only the obstacles lying in the possible regions where the train is likely to move along will be detected.

A synthetic dataset is generated consisting of abundant railway traffic scenarios such as stations, fields, bridges, tunnels, and covering a variety of lighting conditions. Furthermore, a random brightness training strategy is proposed for data enhancement which ensures the better efficiency-vs-accuracy trade-off.

By the performance evaluation comparison of FLOPs, Top-1 Accuracy, and mAP@0.5/%, the extensive trespasser detection experiments based on synthetic dataset and real images verify the accuracy and effectiveness of the proposed method.

Related works

Object detection algorithms based on deep learning have achieved fruitful results in recent years, which are divided into two main categories: two-stage detection and one-stage detection. State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.

Girshick et al.⁹ proposed R-CNN which applied high-capacity convolutional neural networks to bottom-up region proposals in order to localize and segment objects. He et al.¹⁰ equipped the networks with spatial pyramid pooling faster than R-CNN while achieving comparable accuracy. Girshick¹¹ proposed Fast R-CNN which added the training regressor and the bounding box detector significantly improved the detection accuracy. Ren et al.¹² introduced a Region Proposal Network that shared full-image convolutional features with the detection network thus enabling nearly cost-free region proposals. Liu et al.¹³ proposed SSD which eliminated proposal generation and subsequent pixel or feature resampling stages and encapsulated all computation in a single network. Liu et al.¹⁴ proposed Swin Transformer as a general-purpose backbone to adapt Transformer from language to vision. Wang et al.¹⁵ proposed improved Pyramid Vision Transformer and provided improvements on object detection. Yao et al.¹⁶ constructed Wavelet Vision Transformer that formulated the invertible down-sampling with wavelet transforms and self-attention learning in a unified way.

YOLO v1¹⁷ algorithm was proposed in 2016, which divides the image into multiple grids. Each grid prediction contains boundary frame coordinate information, confidence and conditional probability of each category, thus converting the object detection into a regression problem. In 2017, YOLO v2¹⁸ algorithm was proposed, which added the prior anchor frame mechanism and global mean pooling into the YOLO v1 network. Instead of predicting the coordinates of the target, it predicted the deviation between the target and the anchor frame, which reduced the training difficulty and improved the network recall rate. YOLO v3¹⁹ was proposed in 2018, which added residual blocks and feature pyramid modules, thus improving the recognition effect of small targets. In 2020, YOLO v4²⁰ used CSPDarknet53 as the backbone network and adopted self-antagonistic training to improve the algorithm performance. YOLO v5 was also an improved version based on YOLO v3. The network architecture is composed of three parts: Backbone-CSPDarknet, Neck-PANet, and Head-Yolo Layer. Since classification and regression are trained in different convolution layers, it improves the accuracy and detection speed simultaneously. YOLO v6,²¹ an enhanced version of YOLO v5, redesigned the architecture using the Rep-Pan and the EfficientRep. YOLO v7²² was an improved version of YOLO v4 with higher detection accuracy and detection rate than competitors when it was released. In 2023, YOLO v8 was released achieving the SOTA performance, which only predicts the center of the object, not the offset of the anchor frame.

Proposed method

Dataset generation

Training dataset of YOLO consists of a large number of images and annotation boxes, which are used to label the location and size of the objects. Generating an appropriate dataset can improve the performance of target detection. In this paper, a synthetic dataset for railway trespasser detection is generated exploiting 100 publicly available railway videos from the internet to improve network generalization and scene applicability.

Railway background: When vehicles pass through different railway lines, the environment and the weather are both different. In this paper, traffic videos taken from driver’ perspective are converted into image sequences as the railway background. These videos are shot in day and night, and include possible road and weather conditions to enhance the wide applicability of the railway trespasser detection network.

Some examples of railway background are shown in Figure 3. Our synthetic dataset consists of abundant railway traffic scenarios such as stations, fields, bridges, tunnels, and covers a variety of lighting conditions.

Figure 3.

The synthetic dataset covers abundant railway traffic scenarios and various lighting conditions.

Obstacle foreground: In this paper, Unet²³ is applied as the segmentation algorithm to obtain obstacles. The architecture of Unet is shown in Figure 4. It is composed of encoder and decoder, and the information transfer between them is implemented by skip connection. An example of obstacle segmentation is shown in Figure 5.

Figure 4.

Architecture of Unet.²³

Figure 5.

Left, input image. Right, segmentation result.

Then, cyclic processing of all background images and obstacle images to create the training dataset. The code is written in Python and OpenCV. The trespasser image and the background image are pixel-wise replaced to generate the synthetic railway trespasser dataset which contains 8000 images by randomly altering the location, number, and size of trespassers for each background, as shown in Figure 6. Typical railway trespassers include animals, plants, vehicles, and trains, as shown in the red circles in Figure 7.

Figure 6.

Synthetic railway-trespasser images generated by pixel-wise replacement of trespasser images and background images.

Figure 7.

Examples of railway trespassers.

Label file creation: As mentioned above, YOLO model training needs label file in which marks the locations of the boundary boxes of the targets in the traffic image, including category information and coordinate information. Generally, it is necessary to use labeling software to annotate images containing the obstacles and then to create label file, which usually takes a lot of time.

Since the parameters of the background image and the obstacle image of our dataset can be adjusted and calculated, python code is written to generate the label files in .txt format required by YOLO model training while the synthetic image is generated. The size and position parameters of trespassers are calculated as follows

x_{center} = \frac{xmax + xmin}{2.0}

(1)

y_{center} = \frac{ymax + ymin}{2.0}

(2)

x = \frac{x_{center}}{width}

(3)

y = \frac{y_{center}}{height}

(4)

w = \frac{xmax - xmin}{width}

(5)

h = \frac{ymax - ymin}{height}

(6)

where $xmax$ is the horizontal coordinate of the upper right point of the obstacle located in the background image; $xmin$ is the horizontal coordinate of the upper left point; $ymax$ is the vertical coordinate of the upper right point; $ymin$ is the vertical coordinate of the lower right point; $x_{center}$ is the center horizontal coordinate of the obstacle; $y_{center}$ is the center vertical of the obstacle; $width$ is the width of the background image; $height$ is the height of the background image; $x$ , $y$ , $w$ , and $h$ are the output label values of the corresponding parameters.

Track region segmentation

Obstacles may be randomly set in the sky or on the sides of the railway when creating the training dataset. In order to solve the issue that the existing object detection algorithms detect all obstacles in traffic scene images, a novel strategy YOLOSEG is proposed for intelligent road segmentation and obstacle detection of railway trespasser.

Firstly, road segmentation is carried out to annotate the detection region of traffic obstacles. In this paper, the labeling software LabelMe²⁴ is applied to annotate the track region where the train will move along, as shown in Figure 8.

Figure 8.

Annotation of tack regions in LabelMe.

Figure 9 shows the annotations of two types of railways: track without turnout, and track with three-way turnouts. The railway contour information needs to be accurately identified, including the tracks currently in operation as well as the tracks that the train may pass through connected by the turnout.

Figure 9.

Railway image annotation in LabelMe. Left, track without turnout. Right, track with three-way turnouts.

After the annotation of an image is completed, LabelMe outputs a JSON file, based on which the black and white binary image is generated as the mask image. Figure 10 shows some examples of the original images and corresponding segmented road mask.

Figure 10.

Left, the input images. Right, segmented road masks.

Trespasser detection

In order to detect the trespassers only lying in the possible track regions where the train will move along, the proposed strategy is to cover the region beyond the road with the generated mask image. The generated track region is taken as the identification region of the improved YOLOSEG neural network for obstacle detection. The architecture of YOLOSEG is shown in Figure 11.

Figure 11.

The architecture of the improved YOLOSEG.

For an input image, the trained Unet is first applied for road segmentation to obtain the mask image, where the road is white, and other regions are black in which the RGB space values are zeros. Introduce mask image into YOLO; the white road region is selected as the region to be detected, and the RGB space values of other regions are all zero. Therefore, our neural network only detects the road region and does not detect targets outside the road mask region.

Random brightness strategy

A random brightness strategy is introduced during the training phase instead of image brightness adjustment during the test phase, which ensures a better trade-off between recognition efficiency and detection accuracy. Some examples of random brightness background images are shown in Figure 12.

Figure 12.

Top: daytime background images. Bottom: random brightness images simulating nighttime background.

In order to extend the applicability performance of the proposed YOLOSEG model, the brightness of trespasser is also adjusted by gamma transform, as shown in Figure 13.

Figure 13.

From left to right: original trespasser image, the brightness transformation images with gamma transformed coefficients are 0.5, 1.5, and 2.5 respectively.

Figure 14 shows an example of a railway background is pixel-wise replaced by a trespasser to generate the random brightness output image of synthetic railway trespasser.

Figure 14.

From left to right: railway background, generated image by pixel-wise replacement of a trespasser and the background image, and the random brightness output image.

Model training and evaluation

The improved object detection model YOLOSEG is trained on our railway-trespasser dataset, where 6400 images from the rail-trespasser dataset are applied for training as well as the remaining 1600 images for verification and testing. The initial learning rate is 0.001, the attenuation coefficient is 0.0005, each batch contains 32 images, and the training epoch is 500.

TP is defined for the correct type to be recognized as correct, FP is for error type to be recognized as correct, FN is for error type to be recognized as wrong, and TN is for error type to be recognized as correct.

Recall is the proportion of all targets labeled true that can be predicted correctly, defined in (7).

recall = \frac{TP}{TP + FN}

(7)

Precision is the proportion of predicting correctly among all predictions, defined in (8).

precision = \frac{TP}{TP + FP}

(8)

Precision and recall are two contradictory concepts. Generally, when the precision is high, the recall will be low; when the recall rate is high, the precision will be low. A target detection network with excellent performance can keep the precision at a high level while the recall increases.

The precision-recall (P-R) curve can be drawn with the recall as the abscissa and the precision as the ordinate. For each category to be predicted, there is a P-R curve correspondingly. Comparing the P-R curve, the performance of the trained model can be evaluated. If the P-R curve of one model exceeds that of another model, it indicates that the performance of this model is better than another.

The region of P-R curve is positively correlated with the performance of the trained model. The region of P-R curve is the Average-Precision (AP). The larger the value of AP, the better the network performance.

The mAP means calculating the average value of the region of the P-R curve for each category respectively, calculated in (9).

mAP = \sum \frac{AP}{num_class}

(9)

The training parameters of our obstacle detection network are shown in Figure 15.

Figure 15.

Training process parameters.

Evaluation samples using our trained network, and some test results are shown in Figure 16.

Figure 16.

Examples of test results by improved YOLOSEG.

Figure 17 illustrates in contrast that the unimproved YOLO detects all objects in the image, while the improved YOLOSEG detects the trespassers only lying in the possible track regions where the train will move along.

Figure 17.

Left, test result by unimproved YOLO. Right, test result by improved YOLOSEG.

The trained YOLOSEG model is applied to test the real-world railway trespasser images, and the results were shown in Figure 18. The mAP@0.5 of the proposed method reaches 89.3%.

Figure 18.

Test result of the real-world railway trespasser images by YOLOSEG.

Based on the generated railway trespasser dataset, the widely used object detection algorithms are compared with YOLOSEG. As shown in Table 1, the proposed method achieves a better average precision.

Table 1.

Comparison of average precision on our railway trespasser dataset.

Model	mAP@0.5/%
Faster RCNN	83.9
SSD	83.5
YOLOv3	86.1
mobilenetv2-YOLOV4	87.6
YOLOv5s	88.5
YOLOSEG	89.3

Table 2 summarizes the performance comparisons between the state-of-the-art object detect approaches with the proposed method on COCO Val 2017. All models are evaluated on images of resolution 224 × 224. Our FLOPs is 9.2 and Top-1 Accuracy is 84.3, which indicates that the proposed method achieves a better efficiency-vs-accuracy trade-off.

Table 2.

Comparison of FLOPs and Top-1 Accuracy on COCO Val 2017.

Model	FLOPs (G)	Top-1 Accuracy (%)
ResNet-18²⁵	1.9	69.7
RegNetY-1.6G²⁶	1.7	78.0
PVTv2-b1¹⁵	2.2	78.6
Swin-T¹⁴	4.6	81.2
RegionViT-S²⁷	5.4	82.5
Swin-B¹⁴	15.5	83.4
MaxViT-T²⁸	5.7	83.5
Wave-ViT-S¹⁶	4.8	83.8
CrossFormer-L²⁹	16.2	84.0
CSWin-B³⁰	15.2	84.1
YOLOSEG	9.2	84.3

Conclusions

For autonomous railway vehicle with complex crisscrossed tracks, it is a huge challenge to intelligently detect the trespassers. Most object detection algorithms based on deep learning technique detect all obstacles in the images, which wastes computing resources and increases the risk of misidentification.

This paper proposed a novel strategy YOLOSEG to solve the issue that the existing object detection algorithms detect all obstacles in traffic scene images. The Unet is firstly trained to segment the railway track region, and then the output mask is introduced into the YOLO network to detect the obstacles within the track mask area. Based on the mask mechanism, only the obstacles lying in the possible regions where the train is likely to move along will be detected. The proposed strategy significantly improves the identification efficiency and reduces the misjudgment rate.

A synthetic dataset is generated covering abundant railway traffic scenarios and various lighting conditions. Besides, a random brightness strategy is introduced during the training phase instead of image brightness adjustment during the test phase, which ensured a better efficiency-vs-accuracy trade-off.

By the performance evaluation comparison of FLOPs, Top-1 Accuracy, and mAP@0.5/%, the abundant trespasser detection experiments based on synthetic dataset and real images verify qualitatively and quantitatively the accuracy and effectiveness of the proposed method in AI-based precision obstacle detection tasks.

In the case of poor image quality such as extremely snow, fog, low illumination, and contrast imbalance, traffic obstacle detection is a very challenging topic. Adaptive improvement of object recognition performance under harsh conditions will be our future research topic.

Footnotes

Handling Editor: Chenhui Liang

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China under Grants 52275546 and 51775449.

ORCID iDs

Dongtai Li

Jie Zhang

References

Bhatti

Masud

Bazai

, et al. Investigating AI-based smart precision agriculture techniques. Front Plant Sci 2023; 14: 1237783.

Majeed

Khan

Waseem

, et al. Renewable energy as an alternative source for energy management in agriculture. Energy Rep 2023; 10: 344–359.

Guan

Aamir

Rahman

, et al. A dual-tree complex wavelet transform-based model for low-illumination image enhancement. Wuhan Univ J Nat Sci 2021; 26: 405–414.

Fang

Liu

, et al. An image-based visual serving control method for UAVs based on fuzzy logic. Adv Mech Eng 2023; 15: 16878132231167238.

Zhang

Liang

Tang

, et al. Alternative deep learning method for fast spatial-frequency shift imaging microscopy. Opt Express 2023; 31: 3719–3730.

Zhai

Vehicle-track coupled dynamics: theory and applications. Singapore: Springer, 2020.

Zhang

Zhai

Blind attention geometric restraint neural network for single image dynamic/defocus deblurring. IEEE Trans Neural Netw Learn Syst 2023; 34: 8404–8417.

Kano

Andrade

Moutinho

Automatic detection of obstacles in railway tracks using monocular camera. In: Tzovaras

Giakoumis

Vincze

, et al. (eds) 12th international conference on computer vision systems. Vol. 11754. Cham: Springer, 2019, pp.284–294.

Girshick

Donahue

Darrell

, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: 27th IEEE conference on computer vision and pattern recognition, Columbus, OH, USA, 23–28 June 2014.

10.

Zhang

Ren

, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 2015; 37: 1904–1916.

11.

Girshick

Fast R-CNN. In: IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015, pp.1440–1448. New York: IEEE.

12.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 2015; 28: 1–9.

13.

Liu

Anguelov

Erhan

, et al. SSD: single shot multibox detector. In: Leibe

Matas

Sebe

, et al. (eds) European conference on computer vision. Cham: Springer, 2016, pp.21–37.

14.

Liu

Lin

Cao

, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, 10–17 October 2021.

15.

Wang

Xie

, et al. PVT v2: improved baselines with pyramid vision transformer. Comput Vis Media 2022; 8: 415–424.

16.

Yao

Pan

, et al. Wave-vit: unifying wavelet and transformers for visual representation learning. In: Avidan

Brostow

Cissé

, et al. (eds) European conference on computer vision. Cham: Springer, 2022, pp.328–345.

17.

Redmon

Divvala

Girshick

, et al. You only look once: unified, real-time object detection. arXiv Preprint arXiv:1506.02640, 2016.

18.

Redmon

Farhadi

YOLO9000: better, faster, stronger. arXiv Preprint arXiv:1612.08242, 2017.

19.

Redmon

Farhadi

YOLOv3: an incremental improvement. arXiv Preprint arXiv:1804.02767, 2018.

20.

Bochkovskiy

Wang

Liao

YOLOv4: optimal speed and accuracy of object detection. arXiv Preprint arXiv:2004.10934, 2020.

21.

Jiang

, et al. YOLOv6: a single-stage object detection framework for industrial applications. arXiv Preprint arXiv:2209.02976, 2022.

22.

Wang

Bochkovskiy

Liao

YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv Preprint arXiv:2207.02696, 2022.

23.

Ronneberger

Fischer

Brox

U-Net: convolutional networks for biomedical image segmentation. In: Navab

Hornegger

Wells

, et al. (eds) Medical image computing and computer-assisted intervention. Cham: Springer International Publishing, 2015, pp.234–241.

24.

Russell

Torralba

Murphy

, et al. LabelMe: a database and web-based tool for image annotation. Int J Comput Vis 2008; 77: 1–3.

25.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016, pp.770–778. New York: IEEE.

26.

Radosavovic

Kosaraju

Girshick

, et al. Designing network design spaces. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, 14–19 June 2020, pp.10428–10436. New York: IEEE.

27.

Chen

Panda

Fan

Regionvit: regional-to-local attention for vision transformers. In: The tenth international conference on learning representations, 2022.

28.

Talebi

Zhang

, et al. Maxvit: multi-axis vision transformer. In: Avidan

Brostow

Cissé

, et al. (eds) European conference on computer vision. Cham: Springer, 2022, pp.459–479.

29.

Wang

Yao

Chen

, et al. Crossformer: a versatile vision transformer hinging on cross-scale attention. In: International conference on learning representations, 2022.

30.

Dong

Bao

Chen

, et al. CSWin transformer: a general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18–24 June 2022.