Reliability prediction model of further bus service based on random forest

Abstract

With the background of rapid development of intelligent city, intelligent traffic is also getting more and more attention and bus arrival time prediction has become one hotspot to the researchers in recent years. Accurate and real-time prediction of bus state cannot only help travelers to choose a better trip mode, but also provide some scientific advices for the traffic department to manage scientifically and make a reasonable scheduling. Considering the most study focus on the current reliability evaluation and few references about reliability prediction were written, this paper aims to firstly use a reliability evaluation method to get the reliability of bus line. Based on this, this paper proposes a reliability prediction method of further bus service using the random forest. Finally, the model of reliability prediction proposed in this paper is tested with the data of the bus line 23 in Dalian city of China. The result shows that the random forest with the reasonable parameters can predict the reliability of bus service accurately. Furthermore, the random forest method performs better than artificial neural network and support vector machine. It is feasible to predict the reliability of bus service.

Keywords

Reliability prediction bus service random forest

Introduction

Along with the rapid development of social economy and the speeding up of urbanization, China's vehicle ownership and urban road traffic sharply increase. At the same time, travel routes and other infrastructure construction level are by leaps and bounds. However, there is no doubt that the latter growth could not catch up with the former one. Besides, some problems exist in road traffic order management, which causes the contradiction between supply and demand of road traffic. It leads to widespread traffic congestion in the city, frequent accidents, and environmental pollution. In order to promote the sustainable development of urban transport, the purposes are to develop the large capacity and high loading rate of public transport.

While in the large- and medium-sized cities of China, the development of public transport is not satisfactory. There are some appearances existing, such as the poor punctuality of buses; the uneven headway; the common lines of “train car” and “large space” and so on. With the accelerated pace of life, travelers want to reach their destination quickly and on time, that is, propose new requirements for the arrival at destination within the expected time. Therefore, to study the reliability of public transportation is helpful to schedule buses reasonably and improve the transit service.

In the transportation field, the reliability theory applied to the urban road traffic network to evaluate whether the reliability of the urban road traffic network is on a specified service level. In the past three decades, the reliability theory has been developing rapidly in the research of the transportation network. The main contents include connectivity reliability,^1,2 travel time reliability,^3–6 running on-time reliability,⁷ the reliability of waiting time,⁸ transfer reliability,⁹ and so on. Bates¹⁰ did survey for 146 bus companies and compared the punctuality based on the bus company definition. Strathman et al.¹¹ analyzed the punctuality of Tri-Met bus company quantitatively in Portland. Strathman et al.¹² constructed the evaluation indexes of bus reliability, such as departure frequency, travel time, coefficient of variation and average waiting time, and the relational analysis of indicators. Yin et al.^13,14 evaluated the reliability index based on the Monte Carlo stochastic simulation method from the perspective of the station timetable and the public transportation network. Camus et al.¹⁵ proposed the degree of advance and public transport delay to be considered after the amendment of the six grades.

To guarantee the reliability of bus arrival time, we should pay more attention to the bus arrive time and traveler demand estimation. Sun and Xu¹⁶ have optimized for single bus line timetable based on hybrid vehicle size model. Sun and Xu¹⁷ have evaluated the model based on geographic information system (GIS) and super-efficient data envelopment analysis. And a multistate-based travel time schedule model has been presented for fixed transit route.¹⁸ On the traveler demand estimation, Xue et al.¹⁹ put forward a method of short-term bus passenger demand prediction based on time series model and interactive multiple model approach. These papers are important foundation to further predict the reliability of bus arrive time.

As we can see from above, these research mainly focus on the assessment of current transport reliability, but few research focus on the transit reliability prediction. And so, we try to make an attempt to predict bus future reliability in this paper. The key of the reliability prediction of urban transport involves two aspects, one is traffic volume prediction and the other is bus arrival time prediction. The prediction methods can be roughly divided into two classes: one based on traditional mathematics prediction methods, including historical average model, time series model, Kalman filter model, and so on; the other is data driven approach, including neural networks, the non-parametric regression, KNN and so on. Kim and Hobeika²⁰ applied the autoregressive integrated moving average (ARIMA) model to the freeway volume forecasting. D'Angelo et al.²¹ used a nonlinear time series model to predict the bus travel time on the freeway. In the prediction process, Angelo contrasts two scenarios to predict the travel time: the first one just uses velocity as the model variable; the second one uses the variables of speed, lane occupancy, and traffic flow. The results show that the univariate model is superior to the multivariate model. Chien et al.²² proposed two artificial neural network (ANN)-based models: the stop-based ANN model and the link-based ANN model, to predict bus arrival time. Wall and Dailey²³ proposed to use the Kalman filter model to track the position of the vehicle and combining the automatic vehicle location (AVL) and to predict the arrival time of transit vehicles historical data in the Washington area of Seattle. However, they did not use the waiting time as independent variable in model. Yu et al.²⁴ adopted several methods including support vector machine (SVM), ANN, k-Nearest Neighbor algorithm (k-NN), and linear regression (LR) for the bus arrival time prediction, and found that the SVM model performs the best among the four proposed models.

Random forest (RF) is a machine learning algorithm proposed by Leo Breiman in 2001, which combines Bagging integrated learning theory with random subspace method. Like SVM, RF is also a prediction algorithm based on learning. Coussement et al.²⁵ compared the prediction capabilities of SVM, Logistic model, and RF; the results showed that the SVM is better than Logistic model only when SVM was added to the optimal parameter, whereas the RF was always superior to SVM. Besides, the RF methods have been successfully used in the field of genome-wide association analysis (GWAS).

And from the above, in this paper we make an inspiring exploration by using RF to predict the further reliability of bus arrival time, which is less common in references. The remainder of the paper is organized as follows. “Model foundation” section shows the description of the problem and develops a model to predict the transit reliability using RF. In “Random forest” section, we present the principle of RF, as well as the process of algorithm. A case study of 23 bus line in Dalian city of China is presented in “Case study” section. And at last, the conclusions are provided in “Conclusions” section.

Model foundation

The bus arrival reliability can provide travelers with more accurate travel information, so the travel time can be reasonably arranged, and unexpected delays can be avoided. Bus companies need to improve the management level of bus operation so as to lead the available resources to be used at the maximum degree. In addition, the reliability is also a decision factor for transit network planning and optimization, which can ensure that public transport has a higher service and increases the share rate in the future.

Analysis on factors of bus service reliability

There are many factors impacting the bus service reliability considering that bus could be subject to a lot of external factors interference, such as road traffic condition, intersection waiting, and uncertain condition (adverse weather or emergency) when it is running back and forth in the urban road. And this makes the bus service reliability complex.

As seen in Figure 1, there are three effect factors we can obtain in online real time information: road traffic condition (Φ), intersection number (Π), and uncertain condition (Ψ), respectively. These effect factors are input variables, and the evaluation index of service quality is output variable. In the next section, we are aiming to evaluate bus service reliability.

Figure 1.

The analysis of bus service reliability.

Bus service reliability evaluation

Bus service reliability evaluation should consider the three aspects: the operating-lever, the station-lever, and bus-lever. For the operating-lever, it should be regard as the actual interval of buses which are evaluated at period j. Theoretically, the interval of buses should be equal to the departure interval. However, affected by weather, traffic accident, and other random factors, the interval of buses will fluctuate actually. And the more volatility of bus interval means the more bus unreliability. In a similar, the station-lever should be regard as the actual arriving time of each bus. Compared with the time table, the smaller arriving time lag of each bus at station means the higher reliability of the bus. And the bus-lever intends to evaluate the congestion degree in bus. If the actual congestion degree of bus is smaller than theoretical value, the bus could have high reliability. Thus, bus service reliability can be reflected by evaluating these levers that are relevant to bus.

Bus interval reliability: the gap of the running time and the stop time at station causes the bus service inaccurate and unreliable. In practice, this situation may occur when the delay would lead to an extension of bus turnaround time, causing lower frequency of service for some stations with an increase in service interval. Therefore, we define a variable $D_{j}$ to reflect the fluctuation degree for bus interval in the actual operation process during a certain time period j

D_{j} = 1 - \sum_{i = 1}^{N - 1} \frac{| p_{i + 1} - p_{i} |}{{\bar{h}}_{i, i + 1}} / (N - 1)

(1)

where N is the total bus number during certain time period, which is called rolling horizon in this paper. That is, only the information of the N buses is used to assess the bus interval volatility of the stop, while the information beyond the confine of the rolling horizon is not considered.

p_{i}

is the position of bus i.

{\bar{h}}_{i, i + 1}

is the mean bus interval between bus i and bus i + 1.

Punctuality rate reliability: punctuality rate reliability is defined as the ability of a transit vehicle to arrive / depart at a station in dynamic transportation network during a period. And in this paper, punctuality rate reliability is measured by arrival time deviation. We define a variable $T_{j}$ to reflect the fluctuation degree for punctuality rate in the actual operation process during a certain time period j

T_{j} = 1 - \sum_{k = 2}^{M} \sum_{i = 1}^{N} \frac{| {A'}_{i, k} - A_{i, k} |}{t_{k - 1, k}} / (MN - N)

(2)

where M is the number of bus station.

A'_{i, k}

and

A_{i, k}

denote the schedule time and actual time of bus i arrives at station k, respectively.

t_{k - 1, k}

denotes the theory of running time between station k and station k + 1.

Load factor reliability: to measure the comfort of bus service, the load factor reflects the degree of congestion within the bus. In this paper, we propose the size of personal space of each passenger to determine the load factor reliability. We define a variable $S_{j}$ to reflect the fluctuation degree for load factor in the actual operation process during a certain time period j

S_{j} = (\sum_{i = 1}^{N} \sum_{k = 2}^{M} \frac{Q_{i, k}}{C_{i}} / {\bar{S}}_{k}) / N

(3)

{\begin{matrix} Q_{i, k} = Q_{i, k - 1} + Q_{i, k - 1}^{L} - Q_{i, k - 1}^{U} & k \in Z, k \geq 2 \\ Q_{i, k} = Q_{i, k - 1}^{L} & k = 1 \end{matrix}

(4)

where

Q_{i, k}

denotes the quantity of passengers of bus i at station k.

C_{i}

denotes the maximum capacity of bus i.

{\bar{S}}_{k}

denotes the theory of load factor at station k.

Q_{i, k - 1}^{L}

and

Q_{i, k - 1}^{U}

denote the quantity of passengers who get on or get off the bus i at station k respectively.

According to the above, the service quality is affected by three indexes, which are bus interval reliability, punctuality rate reliability, and load factor reliability. From the three evaluation indexes of service quality, an integrated bus service reliability index $R_{j}$ , defined by this paper is as follows

R_{j} = β_{1} D_{j} + β_{2} T_{j} + β_{3} S_{j}

(5)

β_{1} + β_{2} + β_{3} = 1

(6)

where

β_{1}

β_{2}

, and

β_{3}

are the weight for each reliability index, respectively.

In this paper, only the information of the n buses is used to evaluate bus service reliability, while the information beyond the confine of the rolling horizon is not considered. The illustration of rolling horizon of buses can be seen in Figure 2.

Figure 2.

The illustration of rolling horizon of buses.

Bus service reliability prediction

With the help of GPS technology, some information like the bus position, the bus arrival time at each stop can be easily obtained. And in virtue of infrared sensor in bus, we can get the number of boarding/alighting passengers. Thus, the service reliability of the current bus line is assessed according to the equations in “Bus service reliability evaluation” section. However, it is necessary to predict future reliability of bus line that is of great significance for both bus operation schedules and passenger travel choice based on these essential data. To predict the reliability of the further bus service, the potential relation between the current and further bus services should be deduced. In this paper, RF method is adopted to model the reliability of the further bus service based on the reliabilities of the current and recent bus services

R_{j + 1} = f (R_{j}, R_{j - 1}, R_{j - 2}, \dots, R_{j - w + 1})

(7)

where

R_{j + 1}

denotes the prediction bus reliability value in period j,

R_{j}, R_{j - 1}, R_{j - 2}, \dots, R_{j - w + 1}

refer to the actual reliability value of former w periods.

f (\cdot)

denotes the reliability prediction function. Figure 3 shows an example of the further reliability of bus service

R_{j + 1}

is predicted when the bus completes a trip of time table in period j.

Figure 3.

The frame of bus service reliability prediction.

Random forest

RF is a machine learning algorithm proposed by Leo Breiman in 2001, which combines Bagging integrated learning theory with random subspace method. Its essence is an improvement to the decision tree algorithm, which combines multiple decision trees. Each tree is built on an independently extracted sample and it has the same distribution in the forest. The classification error depends on classification ability of each tree as well as the correlation between them. Feature selection uses a random method to split each node, and then it compares with the error generated under different circumstances. The ability to classify individual trees may be small, but after randomly generating a large number of decision trees, a test data can be categorized by classifying each tree to select the most likely classification.

The principle of RF

The random forest classification (RFC) is a combinatorial classification model composed by many decision tree classification models {h(X, Θk), k = 1, 2…}, and the parameter set {Θk} is an independent and identically distributed random vector. Under the given independent variable X, each decision tree classification model selects the optimal classification result by a decision. Specifically, firstly k samples are extracted from the original training set by bootstrap sampling, and the sample size of each sample is the same as the original training set. Secondly, k decision tree models are established for k samples. Finally, according to k classification results of each record to vote on its final classification. The RF method theory framework is shown in Figure 4.

Figure 4.

Framework of random forest method.

RF constructs different training sets to increase the difference between the classification models, thus improving the extrapolation and prediction ability of the combined classification model. We obtain a classification model sequence {h₁(X), h₂(X),…, h_k(X)} after K times decisions, and then use them to form a multi-classification model system, the final classification results using simple majority voting. The final classification decision is shown in equation (8)

H (x) = \arg {max}_{Y} \sum_{i = 1}^{k} I (h_{i} (x) = Y)

(8)

where

H (x)

denotes the combined classification model.

h_{i}

denotes the single decision tree classification model and Y is the output variable (or objective variable). Equation (1) illustrates how to determine the final classification by the use of a majority voting decision.

The process of RF algorithm

The RF is an integrated classifier composed of a number of classification decision trees. Each tree is formed by training the samples from the initial sample bootstrap in the data set. The result of RF classification sample is determined by the classification result of each classification decision tree. The RF basic algorithm is described as follows:

Step 1: the input data set is X, and consists of N samples. Each sample has a class attribute and a plurality of prediction variable attributes. There are M attributes in total, and M is more than N as usual.

Step 2: a sample of a new sample set X*, is also composed of N samples, each time from the data set X to extract a sample, and then put back to X after repeated extraction for N times. On average, about one-third of the samples were not extracted, and the rest samples are used as test data.

Step 3: a classification decision tree t is generated by a new data set X*. The growth process is the process of recursively dividing the data into different subclasses, that is, one parent node is split into two child nodes. In order to achieve the purpose of data classification, to select the appropriate node split criteria, so that the child node sample purity higher than the parent node sample purity. At the time of splitting each node, we randomly select $mtry$ candidate attributes from M attributes, and then select the best splitting attribute from them, default $mtry = \sqrt{M}$ .

Step 4: repeat steps 2 and 3 to generate a forest. If used to classify, the best class of votes for all the trees in the forest is the result of random forest classification.

Step 5: calculating importance values and error rates of each the test sample by classifying.

Furthermore, as we can see from above that the predictive ability of each trained decision tree is weak, but all of them are some “experts”. When all the decision tree to predict together, it can show a good prediction performance.

In this paper, we tend to predict the bus future reliability. Given a forecast input of current bus service reliability (x₁,x₂,……,x_D) , and then put it into each trained decision tree. According to the division condition, (x₁,x₂,……,x_D) is determined to the nodes it belongs to. And it can output the average value of training sample. The final prediction results are mean value of all the decision tree.

Case study

Data on bus line 23 in Dalian city of China are used to verify the accuracy of the above model. The origin of bus line 23 is East gate of Dalian university of technology-station and the terminal is Dalian university of foreign languages-station with a total length of 14.6 km and 21 stations. The configuration of bus line 23 is shown in Figure 5. There are 1457 groups of valid data from 6:00 to 20:00 during 1 April 2016 to 10 April 2016. Each group of data contains the bus position, the arrival time, and the number of boarding/alighting passengers at each station for each bus in a certain period. Before prediction, we should put the information of above data convert into the bus service reliability, on account of the input variable of RF is bus service reliability. The sample data are divided into three subsets: training sample set, inspect sample sets, and test sample set, where the test samples are about 10% (140 groups), inspect samples are about 20% (280 groups), and the rest are the training samples.

Figure 5.

Configurations of bus line 23 in Dalian city of China.

In the process of RF method, the selection of parameters is an important step. The parameter setting determines the predict accuracy. There are two parameters in RF. One is decision tree k and the other is candidate attribute $mtry$ . In this paper, we set the k = 500, and $mtry$ = 2 after predicting the accuracy of the date we have.

Because the RF method selects the input variable of candidate attribute randomly, the predicted result may differ each time. In order to verify the stability of the RF method, we train and predict for 10 times. And we use the data of Dalian bus line 23 in a period of 9 a.m. to 9.30 a.m., 6 April to predict the bus service reliability at 9.30 a.m. to 10 a.m. 6 April. The prediction result of bus service reliability can be seen in Figure 6.

Figure 6.

The prediction result of bus service reliability.

As seen in Figure 6, the prediction results of bus service reliability are steady. The values of 10 prediction results are between 91% and 94%. So we think the variance is acceptable.

In order to verify the accuracy of RF, we use RF, ANN, and SVM, respectively, to predict the reliability of bus line 23 during consecutive three days. And we use the indexes of mean absolute percent error (MAPE) to evaluate the prediction accuracy. The performance comparison between ANN, SVM, and RF results are shown in Table 1

MAPE (%) = \frac{1}{n} \sum_{i = 1}^{n} | \frac{Y_{i} - Y_{i}^{*}}{Y_{i}} | \times 100 %

(9)

Table 1.

The performance comparison between ANN, SVM, and RF.

Predict time	Algorithm	MAPE
2016-4-8	ANN	8.52%
	SVM	8.01%
	RF	5.70%
2016-4-9	ANN	10.25%
	SVM	9.12%
	RF	8.34%
2016-4-10	ANN	9.24%
	SVM	8.78%
	RF	6.39%

ANN: artificial neural network; MAPE: mean absolute percent error; RF: random forest; SVM: support vector machine.

As we can see from Table 1, the performance of ANN is worst. While the MAPE of RF is smaller than the SVM, since that in the RF method, the data are trained more sufficient compared with SVM. From the performance of RF, we will find that the MAPE of RF is obviously smaller than those of ANN and SVM. Because of adding new sample data to provide better candidate attribute and update model online, the prediction accuracy of RF will be greatly improved.

Using the RF method to predict the reliability of bus service during a week, the results are shown in Figure 7.

Figure 7.

The prediction reliability of bus line 23 during a week using RF. RF: random forest.

As we can see in Figure 7, the reliability of bus service is around the level of 0.8 in workday. The reliability during Saturday and Sunday is lower than that of working day, because during weekends, the passage flow is relatively more than working day, which brings more uncertainty to the public transport. The reliability of bus service during peak time in a week shows the same trend as that during off-peak, but far worse than that of off-peak, around the level of 0.5. So it is of great potential to improve the reliability of the bus service, even the public transport.

Conclusions

The need for bus service reliability accompanied with the acceleration of life pace. The increased reliability of bus service could shorten the travel time of passengers. Furthermore, good bus service can attract more passengers to choose bus as their trip way. Therefore, get hold of the bus reliability could provide more support to improve the bus service. Considering the most study focused on the current reliability evaluation and few references about reliability prediction were written. This paper firstly uses a reliability evaluation method to get the reliability of bus line. Then on the basis of this, this paper proposes a reliability prediction method of further transit service by using the RF. Then the principle of RF is introduced in details. At last, the reliability prediction proposed in this paper is tested with the data from bus line 23 in Dalian city of China. The RF is tested for 10 times, the result proves the stability. In addition, RF algorithm is compared with ANN and SVM, which turns out not better than RF. Finally, we use RF to predict the reliability of the bus service. In a word, the RF is feasible to predict the reliability of bus service. For the future study direction, we are aiming to consider the uncertain of current events with consecutive bus service reliability. In addition, we are looking forward to see some connections between bus service reliability and different period timing.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: National Natural Science Foundation of China 51578112, 71571026, Liaoning Excellent Talents in University LR2015008, and the Fundamental Research Funds for the Central Universities DUT16YQ104.

References

Mine

Kawai

. Mathematics for reliability analysis, Tokyo: Asakura-shoten, 1982.

Yue

Liu

. Adaptive control of an underactuated spherical robot with a dynamic stable equilibrium point using hierarchical sliding mode approach. Int J Adapt Control Signal Process 2014; 28: 523–535.

Asakura Y and Kashiwadani M. Road network reliability caused by daily fluctuation of traffic flow. In: PTRC summer annual meeting, University of Sussex, UK, 19 January 1991.

Bell MGH, Cassir C, Iida Y, et al. A sensitivity based approach to network reliability assessment. In: 14th international symposium on transportation and traffic theory, Jerusalem, Israel, 20–23 July 1999.

Yao

et al.

Transit network design based on travel time reliability. Transp Res C 2014; 43: 233–248.

Kong

Sun

et al.

A bi-level programming for bus lane network design. Transp Res C 2015; 55: 310–327.

Bates

Polak

Jones

et al.

The valuation of reliability for personal travel. Transp Res E 2001; 37: 191–229.

Bowman

Turnquist

. Service frequency, schedule reliability and passenger wait times at transit stops. Transp Res A 1981; 15: 465–471.

Currie G and Csikos DR. The impacts of transit reliability on wait time: insights from automated fare collection system data. In: Transportation research board 86th annual meeting. Washington DC, United, 21–25 January 2007.

10.

Bates JW. Definition of practices for bus transit on-time performance: preliminary study. Transp Res Circ 1986; (300); 5.

11.

Strathman

Hopper

. Empirical analysis of bus transit on-time performance. Transp Res A 1993; 27: 93–100.

12.

Strathman

Dueker

Kimpel

et al.

Automated bus dispatching, operations control, and service reliability: baseline analysis. Transp Res Rec 1999(1666: 28–36.

13.

Yin Y, Lam WH and Ieda H. Reliability assessment on transit network services. In: The network reliability of transport: proceedings of the 1st international symposium on transportation network reliability (INSTR), Kyoto, Japan, 31 July–1 August 2001, p.119. Elsevier.

14.

Yin

Lam

Miller

. A simulation-based reliability assessment approach for congested transit network. J Adv Transp 2004; 38: 27–44.

15.

Camus

Longo

Macorini

. Estimation of transit reliability level-of-service based on automatic vehicle location data. Transp Res Rec 2005(1927: 277–286.

16.

Sun

Peng

. Timetable optimization for single bus line based on hybrid vehicle size model. J Traffic Transp Eng 2015; 2: 179–186.

17.

Sun

Chen

Zhang

et al.

A bus route evaluation model based on GIS and super-efficient data envelopment analysis. Transp Plann Technol 2016; 39: 407–423.

18.

Chen S and Sun DJ. A multistate-based travel time schedule model for fixed transit route. Transp Lett 2016: 1–10.

19.

Xue R, Sun DJ and Chen S. Short-term bus passenger demand prediction based on time series model and interactive multiple model approach. Discrete Dyn Nat Soc. Epub ahead of print 2015. doi: 10.1155/2015/682390.

20.

Kim C and Hobeika AG. A short-term demand forecasting model from real-time traffic data. In: Infrastructure planning and management, Denver, Colorado, 21–23 June 1993, pp.540–550. ASCE.

21.

D'Angelo

Al-Deek

Wang

. Travel-time prediction for freeway corridors. Transp Res Rec 1999(1676: 184–191.

22.

Chien

SIJ

Ding

Wei

. Dynamic bus arrival time prediction with artificial neural networks. J Transp Eng 2002; 128: 429–438.

23.

Wall Z and Dailey DJ. An algorithm for predicting the arrival time of mass transit vehicles using automatic vehicle location data. In: 78th annual meeting of the transportation research board, National Research Council, Washington, DC, 10–14 January 1999.

24.

Lam

Tam

. Bus arrival time prediction at bus stop with multiple routes. Transp Res C 2011; 19: 1157–1170.

25.

Coussement

Van den Poel

. Improving customer attrition prediction by integrating emotions from client/company interaction emails and evaluating multiple classifiers. Expert Syst Appl 2009; 36: 6127–6134.