Abstract
The market for on-demand mobility services is growing worldwide. These services include, for example, ride-hailing, ride-sharing, and car-sharing. Large-scale fleets of such services collect GPS trajectory (probe vehicle) data constantly everywhere in the network. At a certain penetration rate, this data becomes representative of the entire road network. It can give valuable insights into traffic dynamics and the evolution of congestion. In this paper, we use such GPS trajectory data from Chengdu, China, to investigate the stability and recurrence of macroscopic traffic patterns. Using the two-fluid theory, we find that the two-fluid coefficients are robust on between-day variation, not only supporting the theory itself but also emphasizing that the general evolution of traffic is a robust pattern. We investigate the deviations from the model using time series analysis of the residuals of the two-fluid model. Here, we find evidence for daily and weekly seasonality in the residuals, indicating that congestion patterns are convincingly recurring. These patterns can be used for network-wide traffic state prediction. We conclude that GPS trajectory data from large on-demand mobility fleets is a promising data source for observing traffic patterns in urban road networks once the data becomes representative.
Measuring traffic in a metropolis can be costly for the traffic management center as it needs to install many stationary sensors and accompanying communication infrastructure. Since the advent of GPS probe vehicle data, traffic state information is reported by the moving vehicles, aggregated, and returned to all drivers ( 1 ). While stationary sensors commonly measure traffic flows well, the GPS probe vehicle data performs better on recording speeds ( 2 ). Thus, a fusion from both sources can improve the network-wide traffic state estimation ( 3 ). Currently, many probe vehicle data providers do not offer the original vehicle trajectories for privacy reasons, but rather aggregate the data to trip, origin–destination, traffic volumes, and speed data on road segments with a typical length of around 100 m. Consequently, no other trip-related information is available that could be informative for traffic state estimation. However, in recent years, on-demand mobility vehicles and taxis have turned into a large fleet of moving sensors that report at a large scale and in almost real time their trajectories, not only for traffic management but also for third-party applications and research. As this data is constantly collected, it offers the opportunity for the first time to study the stability and recurrence of congestion patterns at a large scale both temporally and spatially. Predicting patterns in addition to speeds allows for the dimensionality to be reduced to its most essential dimensions, simplifying the prediction and allowing the explanation of them more comprehensibly.
The interest in network-level traffic dynamics models can be traced back to the late 1960s and can be categorized into three eras: (i) flow–speed relationships until 1979, (ii) two-fluid theory from 1979 to 2007, and (iii) the network macroscopic fundamental diagram (NMFD) from 2007 to the present ( 4 ). When discussing the suitability of a fleet of moving sensors for network traffic state estimation ( 5 )—or arterial traffic state estimation ( 6 )—the use of the two-fluid theory seems intriguing as, compared with the other two approaches, it primarily relies on vehicle trajectories and speed measurements that are provided by such a fleet, and no flow measurements. Although its era has come to an end, it is still being used, sometimes together with NMFD models, for example, based on taxi data ( 7 ) or drone data ( 8 ), or as a means for fusing data sources ( 9 ). As with the NMFD ( 10 , 11 ), the two-fluid parameters also depend on network topology and network features ( 12 – 14 ). The evidence further shows that the two-fluid parameters depend on driving behavior (aggressive/conservative) and crash rates, resulting from drivers’ objective of maximizing the quality of their journeys by traveling fast and maintaining safety ( 15 , 16 ). Once the vehicle fleet is large enough to be representative of the entire network, this fleet data can be used to detect and model the traffic patterns of the monitored road network. Working on patterns instead of using the full traffic data reduces the dimensionality of the problem (e.g., reducing thousands of streets to a few congestion patterns), which makes the complexity of dynamic urban traffic more comprehensible. Thus, these patterns act as a support for selecting adequate measures for traffic management, for example, if a specific pattern of traffic flows is linked to a bottleneck activation pattern. This network-level perspective has already been shown, for example, for loop detector data ( 17 ) and automated number-plate recognition system data ( 18 ), where the complexity of urban traffic dynamics has been reduced to a few clusters. It must not be limited to traffic state estimation, but can also be used to inform about other events such as weather ( 19 ), from which further measures for traffic management can be drawn. The advancement of deep learning techniques for congestion prediction in the big data age ( 20 ) may also support the development of high-resolution spatio-temporal pattern detection algorithms.
However, none of these analyses combine the questions of stability and recurrence of congestion patterns based on trajectory data over a long time period. Stability focuses on how congestion varies over time (range and severity) and how fast the network can recover from congestion, while recurrence studies the repeating patterns of congestion. In some cities, the transportation network company (TNC) already operates large fleets, but the intriguing question is whether such a data source can be used as a sensor for traffic management, in particular traffic state estimation and prediction. Here, we consequently investigate the fundamental suitability of the data source for such problems.
In this paper, we use an open-access trajectory data set from Chengdu, China that covers 30 days, which reports the waypoints of on average 1,250 vehicles circulating simultaneously in the city. From this data, we estimate the two-fluid relationship (
5
) and show that the postulated relationship is indeed robust over several days. Also, we find a strong linear relationship as
This paper is organized as follows. In the next section, we introduce the data used in this analysis. Thereafter, we present our methodology to investigate the robustness and recurrence of congestion patterns in Chengdu. We then proceed by presenting the results of our analysis, before closing the paper by discussing our findings.
Data Set and Study Area
In this study, we utilize the GPS trajectory data set provided by the Didi Chuxing GAIA Open Dataset. Didi Chuxing is one of the biggest leading TNCs worldwide, providing transportation services such as ride-hailing and ride-sharing. Over 10 billion passenger trips are provided by the Didi platform per year ( 21 ). In this study, the data is only from ride-hailing services ( 22 ). Here, we use the GPS trajectory data set collected from Chengdu, China in 2016, which has been extensively used by other researchers in previous years. These studies involve different topics including data processing and outlier detection ( 23 ), demand prediction ( 24 – 29 ), order dispatching ( 30 , 31 ), ride-splitting ( 32 ), traffic flow prediction ( 33 , 34 ), and also travel time prediction ( 35 , 36 ).
The GPS trajectory data was recorded in November 2016 with an average frequency of 3.11 s. The data include five variables: driver ID, order ID, timestamp, longitude, and latitude. In the analyzed data set, driver ID labels the identities of drivers, while order ID stands for individual orders. One driver can accept several orders in a single day, that is, the same driver ID is usually linked to several different order IDs within one day. Driver ID and order ID have already been anonymized for privacy. Examples of the GPS trajectory data are given in Table 1.
Format of the Original Data Set
The data used in this study cover the area shown in Figure 1 with OpenStreetMap (OSM) as the background ( 37 ). It corresponds to the northeast corner of the area within the third ring road in Chengdu, which is covered with a high resolution. For example, for November 1, 2016, there are 32,155,517 GPS records in total, belonging to 181,172 orders and 35,449 drivers. Thus, each driver has roughly 5.11 orders per day, and each order contains 177.5 GPS records. Considering the average frequency as 3.11 s, the average trip duration is 9.2 min.

Study area of the data set.
Figure 2 describes how GPS records are distributed per hour within one day (November 1, 2016). Ride-hailing services are concentrated mainly from 8:00 a.m. to 11:00 p.m. in Chengdu. The original data possess a GPS shift because of the unique Chinese geographical coordinate system ( 38 ), which we have fixed during pre-processing. To investigate the coverage of services, we visualize the GPS records as trajectories in Figure 3 with data aggregated from 9:00 a.m. to 9:05 a.m. and in Figure 4 from 9:00 a.m. to 9:01 a.m., on November 1, 2016. We find that a 1 min trajectory data aggregation might not be representative of the study area, while 5 min data aggregation is able to cover most roads. Therefore, 5 min is selected as the aggregation level. In conclusion, the penetration rate of the ride-hailing fleet by Didi within 5 min is high enough for further aggregation and investigation of the traffic states in the city.

Distribution of global positioning system (GPS) records over hours of one day.

Global positioning system (GPS) trajectories in a 5 min period (9:00–9:05 a.m., November 1, 2016).

Global positioning system (GPS) trajectories in a 1 min period (9:00–9:01 a.m., November 1, 2016).
Methodology
We introduce this study’s methodology step-wise. Following the introduction of the two-fluid theory, we perform a temporal aggregation of the data to extract macroscopic traffic indicators. Then, we use this aggregated data to estimate the two-fluid model parameters. After calculating the residuals between the real data and the estimation from the two-fluid model, we finally analyze the temporal correlations of the residuals by using time series analysis.
Two-Fluid Model
The two-fluid model is a concise model for urban road traffic developed by Herman and Prigogine ( 5 ). According to the model’s assumptions, traffic consists of two fluids: moving and stopped vehicles. A speed threshold is selected to define whether a vehicle stops. This threshold might differ among different data sets: for high-frequency recordings of 1 s, a low threshold can capture the stopped state more accurately; for a lower frequency, a looser threshold can avoid misclassification. All involved variables of the two-fluid model are summarized in Table 2.
Summary of Variables
Note: na = not applicable.
Based on the definition of the variables in Table 2, the following relationships can be set:
Here, trip time per unit distance equals the reciprocal of travel speed. Average trip time per unit distance
The two-fluid model possesses three key assumptions. The first assumes a linear relationship between the fraction of moving vehicles
The average speed
The second assumption is based on the ergodic theory: the average speed of moving vehicles
The third assumption states that the ratio of stopped time
Based on these three assumptions, the two-fluid theory’s main relationship is finalized in Equation 6.
where
Data Aggregation
To estimate two-fluid parameters, we need first to aggregate raw individual GPS records. Following the findings from Figure 3, we set the aggregation interval to 5 min. Then, we use function distm from R package geosphere to calculate the distance between two records, which is then divided by the time interval to calculate the vehicle’s instant speed.
To calculate the fraction of stopped vehicles

Density plot (logarithm) of speeds lower than 20 km/h in one day.
For each aggregation period, the variables listed in Table 3 are then generated. The speed
Variables of One Aggregated Period
Note: GPS = global positioning system.
Estimating the Two-Fluid Parameters and Residuals
Using the linear model from Equation 7, two-fluid parameters
The residuals
The residuals capture all the variations and trends in the data that are not described by the two-fluid model. The advantage of using the residuals instead of the observed values is that the expectable part is removed from the data and only the information of the deviation part is used for the time series. In describing time series patterns, there is usually a distinction of three components ( 39 ). Trend refers to a long-term change in data. Cycle occurs when these are repeated patterns such as rises and falls of a non-fixed frequency, while seasonality, corresponding to its name, is always of a fixed frequency. In time series analysis, the term trend is also used to combine both trend and cycle as just defined.
Here, we focus on the seasonality in the residuals of the two-fluid model. We assume that seasonality exists in the residuals because of day-of-week (DoW) and hour-of-day factors. Therefore, the residuals are not stationary. To make the time series stationary, we keep differentiating residuals until certain conditions have been met. The number of differentiation steps is called degree
where
Results and Analysis
In this section, we present the results following the same sequence as in the Methodology section.
Data Filtering
Before estimating the two-fluid model, the empirical data has to be checked for outliers as current GPS measurements come with errors. For example, the maximum speed is 288 km/h, which is unreasonable and unfeasible. Therefore, we investigated the speed distribution of all vehicles in the downtown area. Figure 6 shows the distribution of speed values. In total, 0.03% of observations are greater than 80 km/h. Speeds that exceed this limit are consequently rare. Considering the trajectories with speeds over 120 km/h we find that most of them result from sudden GPS drifts. Contrarily, speeds in the range from 80 to 120 km/h still look reasonable but may indicate speeding during non-peak hours as supported by the findings from a single day shown in Figure 7. Thus, we decided to remove all trajectory parts from the data that exceed 120 km/h to avoid an impact of clear erroneous measurements on our results. For the remaining observations, the aggregation of the data to a macroscopic network state may equal the small errors or GPS drift, assuming that the error process itself is unbiased (Gaussian process, etc.).

Unfiltered speed distribution of the entire data set.

Frequency distribution of 80 to 120 km/h observations on November 1, 2016.
The next step is to remove all observations with a very small fleet size and most likely an unrepresentative fleet. We remove all observations from 11:59 p.m. to 6:00 a.m. for all days as these time periods are either without congestion or with free flow on every road but also with smaller fleet size compared with daytime hours. Further, we define a threshold for the number of vehicles in the fleet. We set this threshold based on the relationship shown in Figure 8 between the number of vehicles in the aggregation period and its (inverted) cumulative share at the location of the steepest slope. This location is identified using differentiation. The threshold is selected as 112 vehicles and we filter out the observations that have fewer vehicles in one aggregation period. An aggregation interval of 5 min results in 288 available intervals per day. Therefore, the 30-day data set contains 8,640 intervals. From these 8,640 observations, 93.0% (i.e., 8,031 observations) are kept for further analysis.

Percentage of observations versus available vehicles.
The Two-Fluid Relationship
Each aggregated period contains trip time

Trip time

Logarithm of running time
Model Estimates Analysis
For linear regression, we do not create one single model for all days but 30 single-day models to find day-specific slopes and intercepts, resulting in series of two-fluid model coefficients
Example of Output From Linear Regression Estimation
Note: Bold font highlights substantial change in both coefficients on Sunday.
All models have
The two coefficients for the two-fluid model vary slightly, which arguably depends on the interactions of local road network structure and traffic demand.
Compared with weekdays, weekends, especially Sundays (bold in Table 4), have a substantial change in both coefficients. The minimal trip time
Figure 11 compares how the model coefficients fluctuate over time: the red line shows
Two-Fluid Theory Parameters

Fluctuation of model coefficients over 30 days.
Residual Analysis
Residuals are calculated from the model estimations and visualized in Figure 12 (absolute values) and Figure 13 (scaled residuals). The red horizontal dashed line indicates zero and “DoW” represents day-of-week. In both figures, significant recurring daily patterns can be observed that we can extrapolate to a similar weekly pattern.

Residual values over one week.

Residual scales (percentage) over one week.
Figures 14 and 15 present the daily and weekly seasonality extracted from a time series decomposition, where the frequency is set to one day and one week. Daily seasonality shows a “W”-shaped pattern with two drop-downs that might be caused by two traffic peaks and corresponding demand increases. Similarly, weekly seasonality can also be viewed as a combination of seven consecutive daily patterns.

Seasonality from time series decomposition: daily pattern.

Seasonality from time series decomposition: weekly pattern.
For modeling time series data, an ARIMA model is used. Since seasonality exists in the data, the observed residual time series data is not yet stationary. To test statistically for this, we utilize the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test ( 40 ), a unit root test, with the null hypothesis being that the data is stationary. After one degree differencing, the p-value of the test is as low as 0.0022, proving that the null hypothesis must be rejected. Thus, ARIMA can be applied on the differentiated residual time series as shown in Figure 16.

Differenced residual time series over one week.
The auto.arima function from R package forecast has been used to search iteratively for the best ARIMA model. The best ARIMA has been proposed with a dimension of (0,0,4). Considering the one degree of differencing already made, this leads to the final ARIMA model with

Checks of the autoregressive integrated moving average (ARIMA) model.
From the stable two-fluid model coefficients and the revealed seasonality, we can conclude that our approach to using TNC vehicles for macroscopic traffic state estimation with the two-fluid theory seems robust. Consequently, we can utilize such data and models to forecast traffic states with recurrent and non-recurrent components.
Conclusion
This research retrieved one-month GPS trajectory data from Didi Chuxing, a TNC providing large amounts of on-demand mobility services in China. The original data set comprised more than 30 million GPS records per day in Chengdu, China. The original data offer a good coverage of the road network and recurrent patterns, and the data set thus is capable of representing the general traffic state. We aggregated the data to 5 min intervals and estimated traffic states that are required for the two-fluid model of urban traffic (
5
). Our results showed, first, that traffic in Chengdu indeed exhibits convincingly the relationship postulated by the two-fluid theory with an average
The presented analysis will be developed further in several directions to address its limitations. First, the data is still biased, likely caused by the single data source from ride-hailing vehicles. It is reasonable that ride-hailing vehicles tend to travel more within the entertainment and restaurants area, causing a biased estimation of the total average speed for the whole network. To solve this, we plan in the next steps to introduce more data sources, like loop detector data, and use data fusion techniques to understand and then minimize the potential biases. In future research we will also extend the sample not only to a longer time period but also to include more cities. This will enable further study on how representative TNC vehicles are as a sensor for traffic management and how stable and recurrent congestion patterns are across cities.
In closing, our analysis has three important implications. First, the results presented in this paper contribute to recent work on estimating the parameters for the two-fluid theory model. On the one hand, these results support earlier findings that taxi GPS data can indeed be used for two-fluid theory model parameter estimation and subsequent network monitoring ( 7 ), but at a much larger scale. Building on the multi-modal extension of the two-fluid model presented by Paipuri et al. ( 8 ), our findings (robust parameters and predictable seasonality) underline our motivation that such taxi vehicles can be used as moving sensors to inform about the multi-modal traffic state once a multi-modal speed model like the two-fluid model is calibrated, for example, based on drone data. This link will be investigated in future research. Second, having a large fleet of moving sensors is a promising tool for monitoring and predicting the performance of urban road networks. Our results have shown that the data from on-demand mobility services can be used to inform about the network-wide traffic state in its dynamics and stability of patterns over time. Although we lack a ground truth reference to assess the representativeness of the data, we can infer that revealed patterns can be used for predicting comparative traffic patterns. Consequently, traffic management centers should have an interest in obtaining such trajectory data for improving their traffic state estimation, in particular when complemented with a pattern prediction. As many cities already have a fleet of vehicles for many services, they could in principle rely on them as moving sensors if they do not wish to rely on (commercial) on-demand mobility trajectory data. Third, as trajectory data is now available at a large scale to estimate the two-fluid models in almost every city, this model should be further exploited to understand which factors drive the network performance, similar to an analysis based on stationary detector data ( 11 ).
Footnotes
Acknowledgements
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: Y. Zhang, A. Loder, F. Rempe, K. Bogenberger; data collection: Y. Zhang; analysis and interpretation of results: Y. Zhang, A. Loder; draft manuscript preparation: Y. Zhang, A. Loder. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Yunfei Zhang acknowledges the support from the German Federal Ministry for Digital and Transport (BMVI) for the funding of the project TEMPUS (Testbed Munich - Pilot test of urban automated road traffic), grant no. 01MM20008K. Allister Loder acknowledges the support from the German Federal Ministry for Digital and Transport (BMVI) for the funding of the project KIVI (Artificial Intelligence in Ingolstadt’s Transportation System), grant no. 45KI05A011.
