Abstract
For traffic management entities, the ability to forecast traffic patterns is crucial to their suite of advanced decision-making solutions. The inherent unpredictability of network traffic makes it challenging to develop a robust predictive model. For this reason, by leveraging a spatiotemporal graph transformer equipped with an array of specialized experts, ensuring more reliable and agile outcomes. In this method, utilizing Louvain algorithm alongside a temporal segmentation approach partition the overarching spatial graph structure of traffic networks into a series of localized spatio-temporal graph subgraphs. Then, multiple expert models are obtained by pre-training each subgraph data using a spatio-temporal synchronous graph transformer. Finally, each expert model is fused in a fine-tuning way to obtain the final predicted value, which ensures the reliability of its forecasts while reducing computational time, demonstrating superior predictive capabilities compared to other state-of-the-art models. Results from simulation experiments on real datasets from PeMS validate its enhanced performance metrics.
Keywords
Introduction
Given its fundamental part in people’s daily activities, transportation also exerts a substantial influence on environmental conditions [1]. As the count of cars and drivers has swelled, so too have the problems of traffic congestion and safety on our streets become increasingly severe. To solve this problem, many countries are committed to developing intelligent transportation systems (ITS) to achieve efficient traffic management [1]. Traffic control and guidance are the keys to the ITS, and traffic prediction is the prerequisite of scientific management and control [2]. However, traffic network data has strong temporal and spatial correlation and nonlinearity, which brings challenges to the establishment of accurate traffic prediction models.
With the deepening of research on traffic prediction algorithms, researchers have proposed plenty of high-performance prediction models, the algorithms of deep neural networks, which can mine complex nonlinear relationships between data from a large amount of historical data, thereby achieving higher prediction accuracy and stronger generalization ability [3, 4]. For instance, Yu et al. [5] characterized the traffic and speed data of the traffic network into a static image, and then captures the spatio-temporal correlation through the spatio-temporal loop convolutional network, and verifies its superior performance on a traffic network in Beijing. Wu et al. [6] introduced an advanced predictive model for traffic flow that integrates various deep learning techniques. The model harnesses the power of Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs) to explore the intricate spatial and temporal dimensions of traffic data. It synthesizes historical traffic metrics, including velocities and traffic volumes, with the aid of attention mechanisms, effectively highlighting the DNN-BTF model’s capacity to tackle predictive challenge. Yang et al. [7] put forth an advanced LSTM framework, crafted to elevate the performance of traffic flow forecast methodologie, which mines extremely long distance temporal correlations through attention, effectively improving the memory ability of the LSTM model. Yang et al. [8] introduced a ranking system based on ideal solution similarity to differentiate road segments into distinct categories. Following this, they employed convolutional LSTM networks for spatiotemporal data mining of pivotal road segments, which allows for the accurate prediction of their diverse states. Zhang et al. [9] have crafted a specialized CNN for anticipating short-term traffic patterns by conducting an analysis of the data’s spatio-temporal progression. The system selects pertinent features through CNN-based mining, thereby boosting the predictive power of the forecasting model. Zhao et al. [10] employed hierarchical clustering to segment traffic flow data into distinct groups, followed by an analysis of spatial correlations among road networks and segments within these groups using the conventional Euclidean space framework. By pinpointing the top-k most relevant road segment data strongly associated with the segment of interest, the LSTM is fed features that boost its forecasting precision. X Zhang and Q Zhang. [11] fused the predictive capabilities of LSTM networks with the robustness of XBoost’s ensemble learning to focus on estimating forthcoming traffic volumes, thereby circumventing the overfitting tendency inherent in LSTMs and bolstering the models’ predictive performance across various scenarios. Cai et al. [12] have utilized the correlation entropy as a robust loss function for LSTM, aimed at mitigating the impact of non-Gaussian noise on short-term traffic flow predictions and improving the model’s noise immunity. Xia et al. [13] combined distributed modeling frameworks with LSTM networks to solve the problem of difficulty in training and using models caused by large traffic data, improving the efficiency and usability of projecting near-future traffic patterns. Zhang and Jiao [14] implemented a gated convolutional module with an array of kernel sizes to unearth the temporal and spatial interdependencies in historical traffic datasets. They also crafted an attention mechanism that incrementally augments the model’s width to assign importance to key hidden features, which maintains high accuracy with a relatively low computational cost. Fang et al. [15] enhanced their LSTM model for predicting short-term traffic flows by embedding an attention mechanism. This addition enables the model to discern and emphasize key informational inputs, leading to more accurate predictive outcomes
Standard algorithms for convolutional and recurrent neural networks are designed for data within Euclidean domains and are not suitable for the graph-based data from complex traffic networks that exist in non-Euclidean spaces. Graph Neural Networks [16, 17], however, can adeptly process this type of data by leveraging various aggregation methods to discern the relationships between nodes and extract underlying features. Their ability to represent the spatial connections within traffic networks makes them well-suited for data mining tasks in non-Euclidean contexts. For example, Yu et al. [18] crafted an STGCN for the purpose of traffic forecasting, leveraging the model’s ability to capture spatial and temporal dependencies within traffic data. It mined the spatiotemporal correlation of road network information through stacking gated convolutional network and graph convolutional network structure, and it outperformed the ensemble CNN-RNN model in terms of forecasting accuracy, reflecting its enhanced predictive capabilities. Guo et al. [19] introduced an attention mechanism into the ASTGCN for the initial time to perform traffic flow predictions. They dissected spatio-temporal correlations through three unique temporal branches and employed attention to weigh the significance of hidden features across each branch’s layers, which resulted in higher prediction accuracy. Zhao et al. [20] presented a novel neural network for traffic prediction that synergizes GCN with GRU within the T-GCN framework, adeptly seizing the evolving dynamics within traffic datasets and outperforming other advanced models. Bai et al. [21] designed a module that adaptively learns each spatial node and applied it to a graph convolutional recursive network to generate an adaptively learning graph convolutional framework(AGCRN) designed for anticipating traffic patterns, allowing the model to automatically capture different fine-grained traffic spatio-temporal correlations. Zheng et al. [22] crafted the GMAN framework, which incorporating an encoder-decoder approach, the model projects the evolution of traffic patterns over differing time spans. The model fuses spatial and temporal attention with a gating technique to enhance the significance of spatiotemporal embeddings, demonstrating effectiveness in long-term predictive tasks through real-data trials. Song et al. [23] developed a groundbreaking framework known as the STSGCN, designed to address the complexities of spatial-temporal dynamics in traffic flow prediction through a synchronized graph convolutional approach, thereby markedly enhancing predictive precision over methods that analyze these correlations asynchronously. Wang et al. [24] Unveiled an innovative strategy employing a multi-graph adversarial neural network for the autonomous detection of spatial-temporal features in traffic data. This technique allows for the real-time extraction of these states and the subsequent generation of traffic forecasts constrained by the GAN framework. Yin et al. [25] introduced an innovative traffic forecasting framework known as the MASTGN. The model adopted encoder-decoder structure and mixed spatial attention. The three forms of attention, internal attention and temporal attention, integrate hidden features from different angles and achieve a very high accuracy. Zhang et al. [26] crafted a unique Spatiotemporal Graph Attention Network for forecasting traffic flow, capable of unearthing both global and local spatial interactions and incorporating various levels of temporal dynamics. Moreover, By tapping into the traffic data’s semantic nuances, it secures remarkable outcomes in predictive analytics. Li et al. [27] have engineered a pioneering model for understanding the spatial-temporal patterns present in traffic data, adeptly visualizing the temporal and spatial features, fully harnessing the natural connections of time and space, and markedly improving the accuracy of traffic flow forecasts. Na et al. [28] developed an adaptive approach for computing adjacency matrices that, in conjunction with graph convolutional networks, adeptly uncovers the temporal variations in spatial relationships of road networks. It outperforms the conventional fixed-matrix methods for local hidden feature aggregation in terms of both accuracy and adaptability. Ni and Zhang [29] employed a multi-graph framework to depict the transportation network, then uses an interpretable spatiotemporal graph convolutional network (STGMN) for hidden feature information mining, and Elevated the network’s depth by stacking additional layers within a residual framework, which prediction results have advantages compared to the advanced models previously proposed. Yin et al. [30] combined spatiotemporal graph neural network and transfer learning to mine spatiotemporal traffic patterns of specific nodes, and introduces clustering mechanism to elevate the predictive capabilities for the intended outcome. Jin et al. [31] designed a transformative traffic prediction model known as Trafformer, which combines spatial and temporal insights into a singular transformer model, adept at uncovering complex dependencies across space and time. Yu et al. [32] took into account the diverse spatiotemporal dynamics in traffic forecasting by employing a causally-driven spatiotemporal synchronous graph convolutional network to uncover spatial-temporal relationships, which led to superior predictive outcomes. Chen et al. [33] derived adjacency matrices from traffic flow data, leveraging the power of attention mechanisms, they constructed a transformer encoder in tandem with graph convolutional networks to act as a proficient feature extractor for traffic’s spatial-temporal correlations, augmenting the model’s forecasting efficacy. Liu [34] combines SAE, GCN, and BiLSTM to predict the passenger flow of urban rail transit, and evaluates it through real data at different granularities, proving its high accuracy and good robustness.
Despite the applicability of existing forecasting models to data from complex traffic networks, there remains a need to address issues related to increasing the accuracy of calculations and decreasing the duration of the computation process. These mainly contain three parts. 1) Creating a localized spatiotemporal graph allows for a more nuanced representation of the intricate spatial and temporal dynamics within traffic data, but the number of nodes in each local spatiotemporal graph has multiplied than the original graph, resulting in a significant increase in the calculation time. 2) Traffic monitoring sensors can detect and record various indicators of traffic conditions, encompassing flow, occupancy, and speed.. How to effectively use this information’s spatiotemporal dependence to enhance the precision of the predictive model is of utmost importance. 3) When leveraging a graph neural network for the concurrent extraction of temporal and spatial correlations, it is essential to account for the ancillary data among nodes across time and space to accurately aggregate their latent representations. To solve these problems, the current research designs a spatiotemporal synchronization graph transformer with mixture of experts (MOE-STSGFomer) for anticipating traffic flow. The innovative points of this research include:
Firstly, by combining Louvain algorithm with local time sliding window, traffic network data set is divided into several local time-gap subgraph data sets. Then, each subset is pre-trained to obtain several expert models, and then these expert models are migrated and the expert gated network is fine-tuned to obtain the prediction model of the entire road network map, which can effectively reduce the prediction time while ensuring a high prediction accuracy.
Secondly, the graph Transformer network is used in each expert model, only encoder structure is used in the network, and the self-attention multi-head structure in the graph Transformer is replaced by trainable edge information, so that both node information and edge information are considered when extracting spatiotemporal correlation synchronously. The model can more fully and accurately express and Leverage the traffic network’s dynamic interplay of space and time
Finally, the current research uses two real datasets on PeMS for simulation experiments, and the experimental outcomes unequivocally show that our model’s forecasting capabilities surpass those of current state-of-the-art predictive models
Preliminary
Envisioning traffic flow forecasting as the anticipation of future sequences, each influenced by multiple variables. These data come from multiple traffic nodes on the road network. Under the assumption,
In addition, we have defined some of the concepts used in the method, as shown below. Traffic network data can be represented by an undirectedgraph

The structure of MOE-STSGFormer.
To ensure high prediction accuracy and solve the problem that training the model presents considerable difficulties by using local space-time graph for feature extraction, this paper designed the MOE-STSGFormer method for short-term traffic forecasting tasks. Figure 1 illustrates that the technique is fundamentally made up of several stages: Construct local spatio-temporal subgraphs, Pre-training and Fine-tuning. Firstly, Louvain algorithm and local time sliding window are combined to reconstruct the historical input features into multiple local time-gap subgraphs. Then, the transformer network is used for pre-training and each model is saved and defined as an expert model. Finally, the final predicted value is obtained by combining all the fixed parameter expert models and fine-tuned gated network to train the historical input features. The framework of this model is described in detail below.
To segment the optimal set of subgraph structures, this paper first quotes a general standard for evaluating the rationality of community segmentation: modularity. The principle is the difference between the module cohesion of certain segmentation results and the cohesion of random segmentation results. The calculation process is as follows:
Louvain algorithm [35] is an algorithm based on modularity to search for optimal community segmentation. The algorithm first sets the resolution, selects the interval Each node in the network is assigned a different number so that there are subgraphs with the same number of vertices in the initial subgraph segmentation. Add node The modularity of community Then, the modularity gain obtained is:
Add each node to the subgraph whose modularity gain is greater than 0 and has the maximum modularity gain. If the modularity gain calculated by the surrounding subgraphs is less than 0, the current node is not added to any subgraph. The results obtained in the previous step are reconstructed. Each subgraph is merged again, and the original graph is converted into a new hypergraph. It can be considered that the new subgraph is a large node, and the edge weight between these two significant nodes is the cumulative weight of the edges that interconnect all nodes across both subgraphs. After constructing the new hypergraph, the modularity transformation is iteratively calculated again. After repeating steps 2–4 repeatedly, stop the algorithm until the overall modularity no longer changes or the predefined iteration count is met.
Louvain algorithm decomposes the spatial graph structure of historical traffic data into multiple subgraph structures. Utilizing a local time sliding window, the subgraph configuration for every historical traffic dataset is reconstructed. Assume that the

The new adjacency matrix.
Considering

The structure of GSA.
The graph transformer network uses a stacked graph self-attention network (GSA) for data mining. Figure 3 displays the structure of a one-layer graph self-attention network, which calculates the spatio-temporal dependence between any two locations through the linear transformation of the three branches and allows the model to more effectively seize the comprehensive details of historical data.
With
If it is not in the first layer, the input feature is only node input features. The specific calculation process of Query, Key, and Value of self-attention is as follows:
The correlation
Then, correlation
Finally, the vector features of all nodes in the next layer are obtained by producting of
After the transfomer prediction model corresponding to the subgraph is created through the above process, the transfomer prediction model is trained using MSE as a loss function and Adam as a parametric updated optimization algorithm. The trained parameters are then saved. Each trained model will undergo subsequent transfer learning as an expert model.
Transfer learning puts entire historical traffic data as input features into each trained expert model, and then weights the output features of each expert model through a gated network. Training the gated network represents a fine-tuning process. Finally, all the weighted output features are summed to arrive at the ultimate forecasted outcome.
Within the gated network, there are two layers of full connectivity. The top layer reduces the number of temporal channels in the input features to unity by linear mapping. The bottom layer, in turn, decreases the node count of the input features to equate with the domain expert model count through another linear mapping. The exact calculation process is detailed hereafter:
Where
The complete simulation experiment was conducted utilizing a computer equipped with an RTX 2080Ti GPU and the model was crafted using the open-source PyTorch framework.
Data description
For the simulation aspects of this paper, we have employed two datasets that are publicly accessible through PeMS:
The PeMSD4 dataset is derived from 307 traffic sensors along 29 Bay Area roads in San Francisco, recorded over a 59-day period from January 1, 2018, to February 28, 2018. The training data includes 52 days, extending to February 21, 2018, and the test data comprises the last seven days of this period, ending on February 28, 2018. The PeMSD8 dataset is derived from 170 traffic sensors along 8 San Bernardino Area roads, recorded over a 61-day period from July 1, 2016, to August 31 2016. The training data includes 54 days, extending to August 25, 2016, and the test data comprises the last seven days of this period, ending on August 31, 2016. This paper mainly uses k-Nearest Neighbor [35] to interpolate missing data.
Multiple training and verification tests were executed to pinpoint the most efficient parameters for the MOE-STSGFormer model, which are as follows: (1) The duration of the historical time window for input features is one hour, while the prediction horizon varies from 5 to 45 minutes. The time window for feature reconstruction is set at 15 minutes, with each temporal data point spaced 5 minutes apart,
Subgraphs segmentation result

The optimal modularity at different resolutions.
Utilizing the dataset’s original adjacency matrix as a foundation, Louvain algorithm is used to segment the whole graph structure, and samples are collected within the range of 0
It can be seen that when the resolution is 0.39, the optimal modularity of PeMSD4 data set is obtained. In other words, at the 39th sampling, the optimal modularity value of the subgraph segmentation by Louvain algorithm is the largest, which is 0.8717. When the resolution is 0.61, the optimal modularity of PeMSD8 data is obtained, that is, at the 61th sampling, the optimal modularity value of the subgraph segmentation by Louvain algorithm is the largest, which is 0.7473. Through this process, 23 subgraphs can be generated from PeMSD4 data and 12 subgraphs can be generated from PeMSD8 data.
To establish the superiority of our model, we will benchmark it against seven advanced baseline models: LSTM, GCN, STGCN, ASTGCN, STSGCN, STGMN, and Trafformer. The LSTM model is designed with a 5-layer setup, and the GCN model shares an equivalent structure with the STGCN model. Other baseline models are configured according to the descriptions provided in the references.
Performance superiority analysis
To begin with, an assessment of the precision of each predictive model is undertaken. Error metrics including Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and the Coefficient of Determination (
Three evaluation metrics of different prediction models on two data sets.
Table 1 illustrates the performance of various models as measured by MAE, RMSE, and
Calculation times of different prediction models.
The time required for model training and testing is also a significant metric in evaluating the model’s effectiveness. Table 2 shows the calculation time of our designed model and all baseline models, where
Referencing Table 4, it is evident that the time taken for our model to perform calculations is more than what is needed for LSTM, GCN, and STGCN models, because these three models are simple in structure and sacrifice the prediction accuracy. When pitted against the STSGCN and Trafformer models, our model boasts a lower time frame for processing predictions, which indicates that the model designed by us solves the problem of increasing the prediction time caused by constructing local spatiotemporal graph for synchronous spatiotemporal correlation mining.

Visualization of true and predicted traffic flow values in different traffic patterns.
A plethora of spatial nodes exists for traffic data, with the potential for heterogeneity among them. To verify that the prediction model designed by us can have higher prediction accuracy on different types of spatial nodes, the predicted value of high traffic flow, medium traffic flow and low traffic flow are selected to compare with the real value. The diagram in Fig. 5 visually represents how the MOE-STSGFormer model can adapt to traffic flow datasets with diverse traffic modes, ranging from high to low.
The prediction performance assessments mentioned previously were conducted under the condition that the prediction horizon equals 1. This paper verify that MOE-STSGFormer also has good prediction accuracy in other prediction horizons, the model was compared with other baseline MAE models in the two datasets when the prediction horizon is 1–9, which is 5–45 minutes. Figure 6 illustrates the outcomes of our MOE-STSGFormer model, which were observed with a prediction horizon extending from 1 to 9 across two different datasets. When juxtaposed with baseline models, our MOE-STSGFormer model shows the lowest performance metrics, highlighting its ability to sustain optimal prediction accuracy under diverse prediction horizons.
Evaluation metrics of prediction models with different number of edge information channels.
Evaluation metrics of prediction models with different number of edge information channels.

The MAE of all prediction models in different prediction horizons.
The variable
As observed in Table 3, when
Five evaluation metrics of different prediction models on two data sets.
Five evaluation metrics of different prediction models on two data sets.

Visualization of training time and test time.
To verify that obtaining the final predictive model through pretraining multiple expert models and fine-tuning the gating system can solve the problem of difficult training of predictive models, this paper compares the predictive performance of the spatiotemporal synchronous graph transformer model (STSGFormer) trained on the entire spatial graph data with the original model (MOE-STSGFormer), the outcomes from both datasets are detailed in Table 4.
The performance of MOE-STSGFormer and STSGFormer in terms of prediction accuracy is comparable for both datasets; however, MOE-STSGFormer is notably faster in computation. To encapsulate, the approach of initially pre-training multiple expert models followed by fine-tuning the gating mechanism ensures high predictive accuracy, while simultaneously simplifying the model to expedite its computation time.
In this paper, a traffic flow prediction model based on MOE-STSGFormer is proposed to solve the problem of high computing time and high hardware requirement when there are too many nodes in the traffic network. MOE-STSGFormer uses Louvain algorithm based on optimal modularity to divide the spatial graph structure of the whole traffic network into multiple sub-graphs, and then reconstructs the data of each subgraph by using time sliding window. Then, multiple expert models are obtained through pre-training, and finally, multiple expert models are fused through fine-tuning to obtain the final predicted value. The simulation results show that the proposed method has a high prediction accuracy, reducing the error by 15%–20% compared with the best baseline model, and the calculation time is much lower than other models for synchronous mining of spatio-temporal correlation, and it is easier to train and test. Moreover, it is proved by experiments that selecting the optimal number of edge information channels is conducive to improving the prediction performance of the model. In addition, it is also verified by experiments that adding Mixture Expert Models to the model can ensure the constant prediction accuracy while reducing a large amount of calculation time and calculation cost.
Footnotes
Funding
This research was supported by 2022 Fujian province young and middle-aged Teacher Education Research Project (Science and Technology category) (No. JAT220470), 2022 Xiamen Institute of Technology School-level Research Fund for young and middle-aged projects (No. KYT2022004), College of Computer Science and Information Engineering 2021 Academic level Research Fund Project (No. EEKY2021003).
Conflict of interest
The authors declare no conflicts of interest.
Data availability
The data used to support the findings of this study are included within the article.
