Abstract
Urban rail transit systems play an essential role in improving mobility and efficiency. A complex rail transit network serves the Boston metropolitan area, U.S., which costs $38 million for the 422 GWh of system electricity consumed annually. With the aim of developing a tool for energy and cost reduction decision support, we propose a comprehensive machine learning framework to investigate line-specific contributions to energy. This effort builds on prior work in estimating a system-wide energy model for the Boston network. By introducing line-specific train movement and operation variables, we obtain a higher-performing model
With the rapid growth in urbanization worldwide, mobility demand is on the rise in many countries (
Related Work
Several research efforts have investigated models to reduce system-wide energy consumption in URTs. Generally, URT system energy consumption is affected by several variables, such as ridership, temperature, and train schedules, among others (
At the URT system level, however, there have been relatively few quantitative energy modeling efforts. Notably, linear regression and random forest (RF) models have been shown to be effective in accurately predicting system-wide URT consumption, with Boston used as a case study (
We organize the rest of the paper as follows. In the next section, we describe the data sources used in this study. In the Methods section, we describe the framework for generating the line-specific variables and explain the flow of aggregating data. Then, we discuss the models we used to extract features and estimate system-wide energy. In the Results section, we interpret the model parameters and analyze the energy contributions from different categories. Finally, we conclude with key findings and future directions for research.
Data
Study Area
The urban rail transit (URT) system of the Massachusetts Bay Transportation Authority (MBTA) served as the case study. Considered the fourth largest transit agency in the U.S. by ridership, MBTA serviced 1.7 billion passenger miles in the Boston area in 2019 (
Variables and Sources from January 2019 to December 2020
Train Location
We obtained timestamped train location data from the MBTA-MIT research database for the years 2019 and 2020 (
Number of Operating Trains on All Lines in the System

Timestamped locations of trains for the entire day on April 18, 2019: (
Energy
Hourly energy consumption data from 2019 to 2020 were captured by MBTA’s energy meters across the network (Figure 2). The system’s hourly energy consumption reaches its highest peaks during the winter months (December to February), with an average hourly usage of 53 MWh, indicating the substantial amount of energy used during these months. The hourly average during the non-winter months, however, is 44 MWh. This seasonality pattern (greater energy consumption in the winter compared with the summer) indicates that a significant amount of energy is consumed by heating, ventilation, and air conditioning (HVAC) systems.

Heatmap of system hourly energy consumption in the Massachusetts Bay Transportation Authority urban rail transit system from 2019 to 2020.
The hourly energy consumption also peaks twice a day (Figure 2), mirroring passenger travel patterns, as well as train schedules. The average hourly energy consumption during the morning peak (7:00–9:00 a.m.) is 54 MWh, while the average afternoon peak (4:00–6:00 p.m.) energy is 55 MWh. During the overnight hours with few train operations, the energy consumption is minimal and the average hourly consumption is only 34 MWh during this period.
Following the COVID-19 lockdown policies implemented in the spring of 2020, MBTA reduced service on the Red, Orange, and Green Lines by 20%, and on the Blue Line by 5% beginning on March 14. These operational changes contributed to a 7.6% reduction in energy consumption in 2020 compared with 2019. This consequently resulted in a decrease of 13.6% in energy costs (
Temperature and Precipitation
We obtained hourly temperature and precipitation in the Boston area from 2019 to 2020 from the Metometics database (

Hourly temperature in Boston area from 2019 to 2020.

Hourly precipitation in Boston area from 2019 to 2020.
Ridership
We obtained system hourly ridership for 2019 and 2020 from farecard tap-ins recorded in the MBTA-MIT research database (Figure 5). Before the COVID-19 pandemic, the average hourly peak ridership was 15,553. The purple dashed vertical line in Figure 5 indicates the start of the government-instituted COVID-19 lockdown policy, which led to reduced demand and, consequently, service. Although a rebound in ridership was observed after operations resumed in July 2020, the annual hourly peak ridership observed after the lockdown had decreased by 81% compared with the pre-pandemic period.

Hourly system ridership in the Massachusetts Bay Transportation Authority (MBTA) urban rail system from 2019 to 2020.
Methods
We computed line-specific trajectories of distance, speed, and acceleration from the high-resolution timestamped locations. Using these, we computed equal probability bin-time variables for speed and acceleration and their interaction at the hourly level to obtain a set of tractive variables. We then aggregated the non-tractive variables at the hourly level and integrated both the tractive and non-tractive variables into an input data set. Using RF models to identify and remove the insignificant variables, we estimated a ridge regression model with the final set of inputs to obtain an interpretable system energy model with line-specific components. The methods are detailed in the following subsections.
Trajectory Computation
First, we extracted a subset of the train coordinate data set by unique train identification with identical train numbers, vehicle numbers, line labels, and related indicators. Then we sorted the train location by traveling time and computed the operating intervals between two consecutive time location records. We used the Haversine distance function to determine the distance
where
Speed and Acceleration Binning
The tractive energy consumption of a train is governed by the magnitude of its velocity and acceleration (
More strictly, the
where
inf = the infimum (greatest lower bound) of the set of values
Using line-specific quantiles
where
We tested the performance of the RF model (see description in following section) with different numbers of bins as shown in Table 3. For efficiency, we constrained our search to identical bin numbers
Random Forest Model Performance under Different Numbers of Bins.

Cumulative distribution function of the computed speeds based on successive observations for each train in 2019.

Cumulative distribution function of the computed accelerations based on successive observations for each train in 2019.
Train Operation Variables
We computed the hourly number of operating trains in each line. Figure 8 shows the variation of train operation variables in each line. The Green Line has the busiest schedule which operates an average of 66 trains per hour. Figure 8a reflects the difference of the number of hourly operating trains across different lines. There are twice as many average hourly operating Green Line trains as on the other three lines. Figure 8, b and

Time series plots of: (
Variable Integration
In the previous steps, we obtained the time-stamped vehicle-distances and operating times between two consecutive time records. These computed measurements were summed up at the hourly level to obtain hourly vehicle-distances and operating time. The binning process generated bin-time variables—indicating operating time at various speed, acceleration, and interaction intervals. We also computed hourly operation variables. We then integrated all hourly variables into one single input matrix for model estimation (summarized in Table 4). Overall, 218 variables were processed from the raw data, of which speed bin time, acceleration bin time, speed-acceleration interaction time, and fundamental train movement variables are line-specific. The variables are summarized in Table 4
Summary of Variables Aggregated at the Hourly Level
Random Forest (RF) Model
We use the RF model to determine the features most relevant to hourly energy consumption (
Ridge Regression Model
Ridge regression is a linear modeling approach that employs regularization to mitigate overfitting, especially where there is a high number,
where
Model predictions
where the intercept is estimated by the mean response
We selected the optimal regularization parameter
We used 70% of the data to train the RF model for selecting the optimal parameters and most-important variables for energy consumption. Then, the same data set was also used to train the ridge regression energy model. We finally used remaining the 30% of variables to test the model performance.
Results
Train Movement Interaction Terms
The distribution of the 36 train movement speed-acceleration interaction terms by different lines is shown in Figure 9. We observe that the trains spent more time in the low-speed bins,

Boxplots of log(speed-acceleration interaction time in hours) by speed bin (S1–S6) and acceleration bin (A1–A6).
Feature Extraction
Figure 10 shows the OOB errors for different values of the number of splitting candidates,

Out-of-bag errors of random forests in the training set for different numbers of estimators
We ranked the importance scores of all variables and then computed the cumulative variable importance (as shown in Figure 11). We observe that cumulative importance increases rapidly by around 35 variables and then the curve begins to flatten, which shows the most significant variables have greater effects on energy consumption. The RF model filtered out the four insignificant variables (

Variable importance of random forests in the training set.
Line-Specific System Energy Model
We estimated a line-specific ridge regression model based on the extracted 214 variables. We also assessed its performance on the test set. Figure 12 displays the relationship between predictions and observations. Our model explains 91% of the variance in the training set. In measuring the model performance on the test set, we used the root mean squared error (RMSE) and the mean absolute percentage error (MAPE). We obtained an RMSE of 2.46 MWh and MAPE of 3.95% on the test set, which indicates our model reliably tracks the system energy in the MBTA URT network.

Ridge regression model performance.
Our analysis shows that the hourly energy consumption baseline is 39 MWh, which is possibly because of factors such as signals and station operations. Furthermore, we observed that temperature has a negative coefficient, indicating that, on average, energy consumption increases as temperatures drop (Figure 13). This highlights the importance of HVAC systems. We note that winter months (December through March) contribute significantly to energy consumption, compared with the non-winter months, which on average, save energy (Figure 13). The pattern of the monthly indicators mirrors those of the temperature (Figure 3) and precipitation (Figure 4). This further highlights the significant energy demands of HVAC on the overall system consumption, especially during colder periods.

Coefficients of non-line-specific terms (variables named by month stand for the monthly indicators).
Furthermore, we observed a small, negative coefficient for ridership. This can be attributed to the significant decline in ridership following the COVID-19 outbreak, as depicted in Figure 5. Despite the full restoration of train operations (Figure 8) in July 2020, ridership remained at low levels until December 2020. Overall, these findings suggest that both non-tractive factors and the COVID-19 pandemic had a notable impact on energy consumption in the Boston area.
The coefficients of train movement (tractive energy) variables provide a comprehensive overview of how energy consumption varies across different lines in the system. The hourly operating train numbers and operating distances have minimal effects on energy consumption in our model, as tractive energy is explained by the speed and acceleration bin-time variables. Our analysis shows that the Blue Line variables have negative effects on system energy. This could indicate that more Blue Line trains are equipped with regenerative braking compared with other lines, as illustrated in Figure 14. In addition, the average interval speed reflects the average speed of each line over an hour, with small coefficients below 0.006. Similarly, the coefficients for the average hourly speed, which represents instantaneous speed, are also small, with the Red Line having negative effects on energy consumption. This finding could be because the Red Line has the highest speed among the four lines, as depicted in Figure 6.

Coefficients of line-specific variables.
The interaction terms can better explain the relationship between system energy consumption and line-specific train movement variables. We found few trains operated within
We also observe that interaction bin-time coefficients of the Blue Line at low speeds and large decelerations, such as
Observed Line-Specific Operation Variables and Predicted Energy Intensity
Analysis of Energy Contributors
We computed the hourly energy contributions based on the ridge regression predictions and model coefficients, as shown in Figure 15 and Table 6. The non-line-specific variables (temperature, ridership, monthly indicators, and precipitation) contribute an average hourly energy of −5.4 MWh (−2.7%). We observed that the energy contribution pattern of non-train movement variables reflects strong seasonality. The intercept of 39 MWh represents the baseline energy consumption (lights, heating, and cooling, among others).

Time series plots of: (
Hourly Energy Contributions across Different Variable Categories
We also captured the energy contributions from line-specific variables as shown in Table 6. Of the four lines, the Red Line accounted for the highest contribution (4.53 MWh) to the average hourly energy consumption. Although the Green Line had the busiest schedule, as shown in Figure 8, it had the second highest contribution from the lines. On average, the Blue Line-specific variables contributed 0.65 MWh to the hourly energy—the lowest of the four. This may be because the Blue Line has the largest proportion of trains with regenerative braking.
Conclusion
The main goal of this study is to explore how the factors of different lines influence system-wide energy consumption. We obtained data on hourly temperature, hourly precipitation, tap-in ridership, and train coordinates from various data sources. To explore the endogenous relationship between energy and train movement, we created different bins for each line to indicate train movement variables under various combinations of speed and acceleration bins. We trained an RF model on 70% data to extract the most relevant variables and then estimated a ridge regression model with the selected variables based on the same training set. The model had an
According to Figure 14, all lines exhibit the highest unit energy consumption under high accelerations at low speeds, except for the Red Line (and only marginally so, as its train spent comparatively much less time in this regime). On the other hand, all lines had significant negative effects on system consumption when accelerating at higher speeds. These observations indicate that energy consumption during acceleration can be divided into two periods. The start-up period will initially require more energy, but, as the trains continue to accelerate, energy consumption will gradually decrease. In comparison to system statistics such as the overall number of operating trains and hours of operation, the interaction terms are more sensitive to the system’s energy consumption.
The strong seasonality of energy consumption was captured in Figures 2 and 15, which motivated us to explore how energy varies across time series. The ongoing research is focusing on estimating a time series model to forecast energy by historical energy observations at different time scales, which depends on the lag we will define in the model. In future work, we plan to train a deep-learning framework to map the low-level movement variables to strategic planning metrics, which could be potentially further used to generate synthetic data.
There are a few potential policy implications of our findings. Our model results indicate that non-movement factors are most significant to system energy consumption. Thus, policymakers might more successfully reduce energy use by focusing on strategies that limit energy use for heating, lighting, signaling, and other non-train operations, instead of reducing transit service. The results of this study specifically reveal that the system consumes more energy in the winter for heating, which could be reduced by installing more insulation. Moreover, the model results indicate that the Blue Line—which has more trains with regenerative braking—made the lowest contributions to the system energy consumption. Increasing the number of trains with regenerative braking may reduce energy consumption.
The data-driven modeling framework developed in this study is highly effective for analyzing any URT system. To predict system energy consumption of a URT, the model relies solely on train location and system-wide energy data, as well as non-line-specific variables listed in Table 4. These variables are readily available from the industrial data set, making the prediction process straightforward. The model must be robust enough to provide accurate information about the energy contributions from all components in the system. Additionally, the model is also able to capture variations in energy consumption caused by factors such as changes in train operations or weather conditions. By carefully designing and testing such a model, we can ensure that it provides accurate and reliable information about energy usage in URT systems, thus enabling more efficient operation and reducing energy consumption. To enhance the interpretability of the model, we will investigate the relationship between the physical model and the estimated machine learning model. Using this approach, the model will not only accurately predict system energy, but also better identify the contributing sources involved. In future research, we plan to validate the impacts of regenerative braking trains on system energy consumption. To achieve this, we will collaborate with URT operators to obtain further line-specific energy data in to validate our model.
Footnotes
Acknowledgements
We would like to thank Sean Donaghy of MBTA for providing the data used in this study.
Author Contributions
The authors confirm their contribution to the paper as follows: study conception and design: J. Oke, E. Christopha, E. Gonzales, Z. Han; data collection: Z. Han; analysis and interpretation of results: Z. Han, J. Oke, E. Christopha, E. Gonzales; draft manuscript preparation: Z. Han, J. Oke. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Data Accessibility Statement
