Sage Journals: Discover world-class research

Abstract

Demand response (DR) is a strategy that encourages customers to adjust their energy usage during periods of peak demand, aiming to enhance the reliability of the power grid and reduce operational costs. The optimal DR scheme utilizes both distribution system operators and consumers within the energy network to achieve optimal results. The integration of renewable energy sources into smart grids poses significant challenges due to their intermittent and unpredictable nature. DR strategies, coupled with reinforcement learning techniques, have emerged as promising approaches to address these challenges and optimize grid operations where traditional methods fail to meet such kind of complex requirements. This article presents a reinforcement learning-based strategy to optimize DR and energy management in smart grids, focusing on battery-photovoltaic integrated systems. The proposed method employs the soft actor-critic with automated adjustment of temperature algorithm to enhance load-shifting flexibility and grid stability. Experimental results, using the CityLearn environment, demonstrate significant reductions in energy costs 3% and 15% compared to rule-based control and soft actor-critic-based strategies, respectively.

Keywords

Deep reinforcement learning demand response renewable energy energy management soft actor-critic algorithm

Introduction

Transforming energy grids to become carbon neutral requires radical changes in how energy is consumed, especially to address the fluctuating nature of wind and solar power generation. An approach to mitigating this challenge is demand response (DR), a strategy that encourages users to shift their energy consumption from low-generation periods to times when energy production is abundant (Siano, 2014). With the growing adoption of renewable energy sources (RES), such as solar and wind, both intrinsically intermittent, DR plays a vital role in balancing energy supply and demand.

DR is most often implemented through buildings. To deal with the uncertainties associated with RES and to prevent grid instability, their integration into existing infrastructure must be handled with care. The complexity of building energy management (BEM) increases due to the need for adaptive load shifting in response to grid signals (Wang and Hong, 2020). Participation in DR programs enables building networks to better control energy usage, reduce operational costs, and buffer the variability introduced by renewable energies (Yang et al., 2022).

Various methods have been proposed to optimize energy consumption in grid-responsive buildings. Rule-based control techniques are frequently used in BEM systems because of their simplicity and ease of implementation (Bay et al., 2022; Ferahtia et al., 2022). Rule-based controllers, however, frequently produced less-than-ideal performance in dynamic environments due to their reliance on static thresholds and inability to adjust to real-time variations in energy demand.

Model predictive control (MPC) has also made notable contributions to both DR and BEM (Mariano-Hernández et al., 2021). Another approach formulates DR as a scheduling problem using mixed-integer linear programming, which requires detailed knowledge of the system dynamics and appliance characteristics (Henggeler Antunes et al., 2022). However, as the number of buildings increases, creating individualized energy models becomes infeasible. The complexity introduced by time-varying variables and building diversity limits the scalability of traditional model-based methods for large-scale DR applications.

With the rapid advancement of machine learning, reinforcement learning (RL) has emerged as a powerful alternative to traditional model-based approaches for solving DR problems, framing them as sequential decision-making tasks (Vázquez-Canteli and Nagy, 2019). Unlike traditional optimization methods, RL doesn’t necessitate pre-existing knowledge of system behavior and can be employed in a model-free way, simplifying its application in real-world scenarios.

In recent years, various RL methods have been proposed and explored for energy management tasks. These include deep Q-networks (Amer et al., 2023), proximal policy optimization algorithm (Schulman et al., 2017), deep deterministic policy gradient (DDPG) off—policy algorithm (Lillicrap et al., 2019), twin delayed DDPG off-policy algorithm (Fujimoto et al., 2018), among others. RL has been successfully implemented in developing several DR programs (Ajagekar and You, 2023; Jin et al., 2022; Kong et al., 2020; Lu and Hong, 2019; Mocanu et al., 2019; Yang et al., 2020). According to Brandi et al. (2022), an RL agent trained offline could achieve comparable cost savings to a model predictive controller (MPC) for an office building, while also requiring significantly less computational time for real-time decision-making.

Despite these advances, single-agent deep RL (DRL) techniques often struggle to effectively capture the interactions between multiple buildings in DR scenarios. As the complexity and dimensionality of the environment increase, scalability becomes a major challenge. In response, multi-agent DRL has recently gained significant interest among power system researchers for their applications in distributed control and energy management within hybrid energy systems and microgrids (Karavas et al., 2015; Zeng et al., 2011). However, in DR scenarios, where energy demand and renewable generation fluctuate widely across daily and seasonal cycles, traditional RL algorithms with static parameters often lack the adaptability required to respond to these fluctuations effectively.

To address these challenges, we propose the use of soft actor-critic with automated adjustment of temperature (SAC-AAT) algorithm (Haarnoja et al., 2019) for optimizing demand response management. SAC-AAT extends the soft actor-critic (SAC) algorithm, a state-of-the-art RL method that performs well in continuous action spaces and environments requiring robust decision-making under uncertainty. The automated adjustment of temperature in SAC-AAT improves adaptability, allowing for more flexible and responsive DR control in environments with fluctuating energy demand and variable renewable energy supply.

Methods

Energy modeling

CityLearn (Vázquez-Canteli et al., 2020) is an OpenAI Gym-based simulation framework for implementing RL algorithms in urban energy management, enabling DR by controlling energy storage across multiple buildings. It uses pre-simulated building data to model hourly energy loads, including cooling, heating, and non-shiftable loads, and provides a standardized platform for benchmarking RL algorithms. The environment can model various types of energy demands, electrical devices, energy storage systems, and electricity sources (Figure 1). The number of buildings can range from a single unit to an extensive district.

Figure 1.

CityLearn proposed energy models.

In this study, all energy demands, including heating, cooling, and electrical appliance usage, are aggregated into a single non-shiftable load. This approach ensures that these demands are treated as essential and must be satisfied immediately. All buildings are equipped with 4 kWp solar panels as a RES, allowing the buildings to generate their own electricity. The system relies exclusively on batteries with a capacity of 6.4 kWh for energy storage, focusing on optimizing charge and discharge cycles to manage demand peaks, enhance renewable energy self-consumption, and maintain grid stability. The electric energy storage system operates dynamically, with charging and discharging power denoted as $P^{es}$ . The maximum amount of energy that can be charged into or discharged from the battery during a time step is restricted by its power specifications.

\begin{aligned} 0 \leq P_{t}^{es} \leq & u_{t}^{es} P^{{es}_{max}} if u_{t}^{es} \geq 0, \forall t \in T \\ u_{t}^{es} P^{{es}_{max}} \leq & P_{t}^{es} < 0 if u_{t}^{es} < 0, \forall t \in T \end{aligned}

(1)

where

P^{{es}_{max}} = 5.0 kW

indicates the battery’s maximum charging/discharging power, and

u_{t}^{es}

is a control signal generated by the optimization algorithm. The state of charge (SoC) of the battery is updated at each time step based on the amount of energy flow, while accounting for any associated energy losses. The change in SoC is given by the following equation:

{SoC}_{t + 1}^{es} = {SoC}_{t}^{es} \cdot (1 - e_{loss}) + \frac{P_{t}^{es}}{P^{{es}_{max}}}, \forall t \in T

(2)

where

e_{loss} = 0.0

represents the storage losses. It’s user-defined and can be set according to the specific characteristics of the battery. In general, it’s a small value, reflecting that most modern batteries have relatively low losses during charging/discharging cycles and while idle. The photovoltaic (PV) system uses pre-simulated data to generate electricity. The amount of electricity

P_{t}^{p v}

produced is based on the installed capacity of the PV panels. The electric grid supplies power to N buildings based on a dynamic pricing structure. The total energy consumption of the district at each time step

t

represented as

P_{t}

, is given by the following equation:

P_{t} = \sum_{i = 1}^{n} (P_{t}^{i} - P_{t}^{{es}^{i}} - P_{t}^{pv}), \forall t \in T

(3)

The primary objective for each building is to ensure a real-time balance between electricity demands and supplies. This contributes to a reliable DR management system that enhances grid stability and efficiency.

RL formulation: Markov decision process (MDP)

RL is a subfield of machine learning that centers around training an agent to make sequential decisions through direct interaction with a dynamic environment. Unlike supervised learning, where the agent learns from labeled examples, RL operates in a trial-and-error fashion, allowing the agent to explore different strategies and adapt its behavior based on the outcomes of its actions (Figure 2). The DR problem in grid-responsive buildings is formulated as a multi-agent extension of the MDP and is defined by a five-tuple ${S, A, P, R, γ}$ . Here, $S$ is defined as the Cartesian product of the observation spaces of all agents, $S = O^{1} \times O^{2} \times \dots \times O^{n}$ , representing the combined observation of $n$ agents. The action space $A$ includes the joint action space of all agents: $A = A^{1} \times A^{2} \times \dots \times A^{n}$ . From the state $s_{t} \in S$ , the agent takes an action $a_{t} \in A$ and transitions to the next state $s_{t + 1}$ , determined by the transition probability $P (s_{t + 1} = s^{'} ∣ s_{t}, a_{t})$ . This probability represents the dynamics of the environment, which are typically unknown to the agent. Additionally, the agent receives a scalar reward $r_{t} \in R$ , determined by its current state and action. The primary objective of the agent is to learn the optimal control policy $π$ (Azuatalam et al., 2020) that maximizes the cumulative sum of future rewards.

Figure 2.

Reinforcement learning control framework for energy system.

In the proposed DR management system, the observable state $o_{t}^{i}$ of building $i$ at time step $t$ includes individual and shared variables. Each building’s individual state includes its electricity loads, the SoC of the battery, and solar power generation from the building’s PV panels.

In terms of shared information, the state comprises weather-related variables and their future predictions for 6, 12, and 24-hour intervals. Other shared variables include dynamic electricity prices, carbon intensity emitted during electricity production, and time-related information such as the current hour, day, and month (Table 1). The action space $a_{t}^{i}$ is defined as a single-dimensional vector representing the control of the battery for each building. Each agent can store or utilize energy from the battery to address the energy demand. The action value ranges continuously from $- 1$ to $1$ , where a positive value corresponds to charging and a negative value corresponds to discharging it.

Table 1.

Observation space for CityLearn environment.

State variable	Description
Month	Monthly timestep [1:12]
Day	Daily timestep [1:8]
Hour	Hourly timestep [1:24]
Dry-bulb temperature	Outside temperature (°C)
( $+$ 6 h, $+$ 12 h, $+$ 24 h)
Relative humidity	Outside humidity (%)
( $+$ 6 h, $+$ 12 h, $+$ 24 h)
Diffuse solar irradiance	Scattered sunlight
( $+$ 6 h, $+$ 12 h, $+$ 24 h)
Direct solar irradiance	Direct sunlight
( $+$ 6 h, $+$ 12 h, $+$ 24 h)
Non-shiftable load	Non-shiftable load of building
Solar generation	Solar generation of building
Electricity pricing	Price for electricity ($/kWh)
( $+$ 6 h, $+$ 12 h, $+$ 24 h)
Carbon intensity	Equivalent of greenhouse
	gases emitted per kWh
Battery SoC	State of charge of battery

The reward function is designed to minimize electricity costs $C$ by optimizing battery usage across multiple buildings. For each building $i$ , the function calculates a reward $r$ that reflects the overall energy management performance, summed across all $n$ buildings. The key principle is to promote net-zero energy usage by encouraging the agent to reduce reliance on grid electricity and maximize the use of stored battery energy:

r = \sum_{i = 0}^{n} (p_{i} \cdot | C_{i} |)

(4)

p_{i} = - (1 + sign (C_{i}) \cdot {SoC}_{i}^{es})

(5)

This reward structure encourages the agent to charge the batteries when electricity prices and emissions are low and to prioritize self-generated solar energy for charging during the late morning to late afternoon when solar radiation is available. Additionally, the agent is incentivized to ensure that solar energy is not wasted by using it to charge the batteries whenever possible, and to discharge stored energy when there is a net demand on the grid.

Soft actor-critic (SAC) algorithm with automated adjustment temperature

SAC-AAT is an extension of the SAC, which combines value-based methods, such as Q-learning, as critics with policy-based methods as actors and employs separate networks for the policy and value functions. SAC maximizes the reward while encouraging exploration using an entropy term. The objective function is:

J (π) = \sum_{t = 0}^{T} E_{(s_{t}, a_{t}) \sim π} [r (s_{t}, a_{t}) + α H (π (\cdot ∣ s_{t}))]

(6)

where

α

is the temperature parameter that scales the entropy term

H (π (\cdot ∣ s_{t}))

, encouraging exploration by adjusting the weight assigned to it. Policy entropy encourages randomness in action selection. The entropy is defined as follows:

H (π (\cdot ∣ s_{t})) = - α \log π (\cdot ∣ s_{t})

(7)

The term

α H

penalizes deterministic actions and forces the policy to remain stochastic, enabling better exploration. By adding the entropy term, the policy is encouraged to explore more diverse actions, instead of converging too quickly to a deterministic policy. Higher entropy implies more exploration, which allows the agent to gather more information about the environment before converging on an optimal policy.

SAC utilizes the soft Q-function, which interacts with the value function as follows:

Q (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim P} [V (s_{t + 1})]

(8)

where

γ \in [0, 1]

is a discount factor,

E (\cdot)

denotes the expectation function; the state value function is expressed as follows:

V (s_{t}) = E_{a_{t} \sim π} [Q (s_{t}, a_{t}) - α \log π (a_{t} ∣ s_{t})]

(9)

SAC uses an off-policy actor-critic network, where the actor refines its actions using experience replay. Past experiences (state, action, reward, and next state) are stored in the replay buffer for the RL agent to sample during training. It helps break the temporal correlation between consecutive experiences, ensuring more stable and efficient learning. By reusing past data, it improves sample efficiency and prevents overfitting to recent experiences. Sampled experiences then are used to train value network, soft Q-network, and policy network in SAC. The Q-network is trained to minimize the Bellman residual:

L_{Q} = E_{s_{t} \sim D} [{(Q (s_{t}, a_{t}) - (r_{t} + γ E_{a_{t + 1} \sim π} [V (s_{t + 1})]))}^{2}]

(10)

This updates the Q-function to approximate the expected return of an action. The value network is updated by minimizing the soft value loss:

L_{V} = E_{s_{t} \sim D} [{(V (s_{t}) - E_{a_{t} \sim π} [Q (s_{t}, a_{t}) - α \log π (a_{t} ∣ s_{t})])}^{2}]

(11)

The policy network

π

is trained to maximize the entropy-augmented objective:

L_{π} = E_{s_{t} \sim D, a_{t} \sim π} [α \log π (a_{t} ∣ s_{t}) - Q (s_{t}, a_{t})]

(12)

In the SAC-AAT algorithm, instead of manually setting a fixed temperature, the temperature parameter

α

, which controls the weight of the entropy term in the policy loss, is dynamically adjusted during training. This adjustment aims to maintain the policy’s entropy at a desired target level

H_{target}

, ensuring an optimal balance between exploration and exploitation. The adjustment of

α

minimizes the following loss function:

L (α) = E_{a_{t} \sim π} [- α (\log π (a_{t} ∣ s_{t}) + H_{target})]

(13)

where

H_{target}

is the target entropy, which represents the desired exploration level and is typically defined as the negative dimensionality of the action space:

H_{target} = - \dim (A)

(14)

Algorithm 1.

Training SAC-AAT Algorithm.

1:	Input: Initial policy parameters $θ$ , Q-function parameters $ϕ_{1}$ , $ϕ_{2}$ , empty replay buffer $D$
2:	Temperature parameter $α$
3:	Set target Q-function parameters: $ϕ_{target, 1} \leftarrow ϕ_{1}$ , $ϕ_{target, 2} \leftarrow ϕ_{2}$
4:	repeat
5:	Observe state $s$ and policy $π_{θ} (\cdot ∣ s)$
6:	Execute action $a$ in the environment
7:	Observe reward $r$ , next state $s^{'}$ , and done signal $d$
8:	Store transition $(s, a, r, s^{'}, d)$ in replay buffer $D$
9:	if $s^{'}$ is terminal then
10:	Reset environment
11:	end if
12:	if it's time to update the policy then
13:	for $j$ in range of update steps do
14:	Sample a batch $B = {(s, a, r, s^{'}, d)}$ from $D$
15:	Compute Q-targets for Q-function
16:	Update Q-functions by minimizing the loss
17:	Update policy by maximizing the objective function
18:	Adjust temperature $α$ using the entropy objective
19:	Update target Q-networks
20:	end for
21:	end if
22:	until convergence

To ensure the policy maintains the desired level of stochasticity, $α$ is updated using gradient descent. The update rule for $α$ is

α \leftarrow α - λ_{α} \nabla_{α} L_{α}

(15)

where

λ_{α} \nabla_{α}

is the learning rate for the temperature parameter. The training process of SAC is presented in Algorithm 1.

Overall, SAC-AAT enhances SAC by enabling the agent to automatically adjust its level of exploration throughout training, which can lead to better performance without the need for manual tuning of the entropy parameter. The dynamic adjustment of exploration is particularly useful in environments where the optimal level of exploration changes as the agent learns.

SAC-AAT model for DR management

Figure 3 illustrates the system workflow of the SAC-AAT controller, aimed at optimizing energy management in smart buildings by adaptively controlling energy storage. The workflow begins with the environment, where agents interact with buildings. At each time step, for agent $i$ , the environment provides a set of state observations ( $o_{t}^{i}$ ), including information like electricity demand, battery SoC, solar generation, and energy prices. Based on these observations, the agent uses a policy network (actor) to decide the appropriate action $a$ , such as how much to charge or discharge the battery.

Figure 3.

Soft actor-critic with automated adjustment of temperature (SAC-AAT) demand response management model.

The batteries are dynamically charged by storing excess solar energy when generation exceeds demand and discharge by supplying stored energy during peak demand or high-price periods. This adaptive battery control plays a crucial role in optimizing energy management. During peak periods, the system discharges batteries to reduce grid stress and flatten the demand curve. By using stored energy during high-price periods, it minimizes reliance on costly grid electricity. Additionally, batteries store solar energy for later use, reducing wastage and maximizing renewable energy utilization, all of which contribute to maintaining grid stability. Agents receive feedback in the form of rewards based on their performance in reducing energy costs, lowering carbon emissions, flattening demand peaks, and preserving battery health. These rewards guide the agents in refining their policies over time.

The system evaluates predicted energy flows for each building and executes decisions for battery operations in real-time. This dynamic approach not only reduces energy bills for users by leveraging low-cost energy during off-peak hours but also enhances comfort and reliability by ensuring that essential energy demands are always met. For the grid, the system lowers peak demand, reduces the need for costly infrastructure upgrades, enables smoother integration of RES, and improves overall resilience and stability.

Experiments and results

Dataset overview and simulation hyperparameters

This case study explores the development of a controller to manage battery systems in a random two-building setup, integrating DR signals into the control strategy to improve operational efficiency and flexibility. The data utilized in this research is derived from 17 zero net energy (ZNE) single-family homes located in the Sierra Crest ZNE community in Fontana, California. These buildings were part of a study on grid integration within zero net energy communities, conducted under the California Solar Initiative program. The dataset includes comprehensive information on energy demand, solar generation, weather, carbon emissions, and pricing data over the span of one year.

Figure 4 illustrates the energy demand and solar generation for one of the buildings over the course of a year. The data shows how peak demands and variations in renewable energy generation present challenges in balancing supply and demand, highlighting the potential benefits of incorporating battery storage to mitigate these fluctuations. Each timestep in the CityLearn environment corresponds to 1 h. This resolution is suitable for evaluating high-level BEM and demand response strategies. While shorter timesteps would introduce more fluctuations and may require different control architectures, the SAC-AAT controller is designed to operate effectively at the hourly level, as it aligns with the temporal granularity of energy pricing, solar generation patterns, and demand peaks.

Figure 4.

Energy patterns for a representative week in each season.

It is important to note that the CityLearn environment operates on an hourly time resolution. While this limits the ability to capture sub-hourly fluctuations or real-time battery dynamics, it is sufficient for evaluating high-level energy management strategies. The focus of this work is on optimizing long-term energy cost, emissions, and peak demand at the building and district level, rather than real-time control or power electronics behavior.

Multiple simulation runs are conducted to fine-tune the critical parameters for the case study, as the performance of DRL tends to decline when these parameters are altered. Table 2 details the chosen hyperparameters for the SAC-AAT model, aimed at optimizing performance and stability in a continuous control environment.

Table 2.

SAC-AAT-based controller hyperparameters.

Hyperparameter	Value
Discount factor	0.99
Learning rate	0.0003
Replay buffer capacity	1,000,000
Batch size	256
Number of NN hidden layers	2
Size of NN hidden layers	256
$τ$ (soft update rate)	0.005
$α$ (initial temperature)	0.5

SAC-AAT: soft actor-critic with automated adjustment of temperature; NN: neural network.

The discount factor influences the agent’s performance by determining how the agent prioritizes future rewards, thereby improving its learning efficiency. Choosing an appropriate learning rate is also crucial in DRL training. A learning rate that is too small may result in extensive training and the risk of the process getting stuck, while a learning rate that is too large can lead to suboptimal solutions or unstable training. Alpha $α$ directly affects the balance between exploration and exploitation during training. A higher initial $α$ encourages broader exploration, helping the agent avoid local optima, while a gradual decay ensures the agent transitions to exploiting the learned policy for optimal performance.

Benchmarking and evaluation

Rule-based controllers (RBC) and SAC-based control strategies were implemented to regulate the charging and discharging of electrical energy storage systems across buildings, aiming to minimize electricity costs and mitigate electrical load fluctuations during the considered DR events. The performance of these models is then compared against the SAC-AAT to demonstrate the proposed method’s effectiveness. The first RBC (baseline) optimizes the electrical load of the cluster without considering the participation of buildings in DR events throughout the simulation period. It operates without utilizing battery storage, with electricity demand being fully met by drawing power from the main grid. This baseline method serves as a reference to evaluate the benefits of incorporating energy storage systems and intelligent control strategies. In contrast, the second RBC was developed to meet the requirements of DR events. Unlike the baseline controller, it serves as a reliable benchmark for evaluating the performance of more advanced controllers based on the RL methods. It is a widely used control strategy in various systems, such as HVAC and batteries, due to its simplicity. It operates based on a set of rules expressed as if-else statements and conditions that direct decision-making. The battery charges from 8 a.m. to 4 p.m., with the highest charging rate occurring between 1 p.m. and 3 p.m. when there is surplus electricity. Discharging happens from 6 p.m. to midnight, reaching a peak discharge rate of 35% during the peak hours of 7 p.m. to 8 p.m.

On the other hand, SAC is a model-free, off-policy reinforcement learning algorithm. Being an off-policy method, it allows for efficient experience reuse and learning with fewer samples, without the need for system modeling. It serves as the backbone for the development and functionality of SAC-AAT. While SAC focuses on maximizing a balance between reward and entropy for efficient exploration and stable learning in environments with continuous action spaces, SAC-AAT extends this framework by incorporating automatic adjustment mechanisms tailored for specific applications. The SAC-based controller is implemented as a decentralized controller. For the experimental settings, we ensure a consistent configuration is applied across all baseline methods as well as the proposed multi-agent actor-critic technique for DR in grid-responsive buildings. The hyperparameters used for the SAC agent are outlined in Table 3.

Table 3.

SAC-based controller hyperparameters.

Variable	Value
Discount factor	0.99
Learning rate	0.0003
Replay buffer capacity	1,000,000
Batch size	256
Number of NN hidden layers	2
Size of NN hidden layers	256
$τ$ (soft update rate)	0.005

SAC-AAT: soft actor-critic; NN: neural network.

We evaluate the agents’ performance based on their ability to minimize five cost functions that quantify the energy flexibility of the entire district or individual buildings, including factors such as grid stability and the mitigation of fluctuations caused by RES.

Cost is calculated as the total expense for electricity imported by building $i$ , represented by $I_{t}^{i}$ , where $E_{t}$ denotes the electricity rate at each timestep $t$ :

Cost = \sum_{t = 0}^{n - 1} max (0, I_{t}^{i} \cdot E_{t})

(16)

Carbon emissions are calculated as the total building-level emissions (kg CO₂e), represented by

I_{t}^{i} \cdot O_{t}

, where

O_{t}

indicates the carbon intensity for each hour:

Carbon emissions = \sum_{t = 0}^{n - 1} max (0, I_{t}^{i} \cdot O_{t})

(17)

Average daily peak averages the highest electricity demand peaks over each day as follows:

Avg. DP = \frac{1}{n} \sum_{d = 0}^{n - 1} max (I_{24 d}^{district}, \dots, I_{24 d + 23}^{district})

(18)

A lower average daily peak suggests that the energy management system effectively reduces peak demand, which can lead to lower energy costs and less strain on the grid. The load factor is defined as the ratio of the monthly average load to the peak load, represented by the monthly average and peak values of

I_{t}^{district}

1 - LF = \frac{1}{n} \sum_{m = 0}^{n - 1} [1 - \frac{\sum_{t = 0}^{d - 1} I_{d m + t}^{district} / d}{max (I_{d m}^{district}, \dots, I_{d m + d - 1}^{district})}]

(19)

where

d

is the number of days in a month and

n

is the number of months. A load factor close to 1 would be ideal, indicating that the energy use is almost flat and that there are no significant peaks in demand. Ramping measures the rate at which the electricity demand increases or decreases over time:

Ramping = \sum_{t = 0}^{n - 1} | I_{t}^{district} - I_{t - 1}^{district} |

(20)

A lower ramping value indicates less volatility in energy usage, which is desirable as it suggests a more stable and predictable demand pattern. The value close to 1 suggests that the system’s performance is almost ideal in minimizing large fluctuations in energy demand.

Results and analysis

The main task of the DRL controllers is to optimize energy management in a smart grid environment by dynamically controlling battery energy storage systems, ensuring the smoother integration of RES while maintaining grid stability. Table 4 illustrates the district-level performance of DR controllers across various cost functions during the simulation period. The cost metric represents the total electricity expenses incurred by operating the energy management system. It is a critical measure of economic efficiency, as it directly impacts the financial burden on households, businesses, and utilities. Lowering electricity costs benefits consumers by reducing utility bills and helps businesses achieve cost efficiency, thereby improving profitability. Simultaneously, minimizing emissions contributes to reducing the environmental footprint of energy systems. The SAC-AAT-based controllers achieve the lowest electricity costs. Furthermore, they exhibit the lowest emissions, average daily peak, and ramping over the simulation period, highlighting their effectiveness in load curve shaping and maintaining grid stability. The RBC model exhibits a high $1$ -load factor ( $0.99$ ), indicating effective energy utilization over time; however, it incurs higher ramping and cost compared to SAC-AAT.

Table 4.

Performance comparison of control strategies.

Metric	Baseline	RBC	SAC	SAC-AAT
Emissions	1.00	0.86	0.87	0.79
Cost	1.00	0.82	0.72	0.70
Avg. daily peak	1.00	0.87	0.85	0.82
1–Load factor	1.00	0.99	0.98	0.97
Ramping	1.00	1.05	1.16	0.93

RBC: rule-based controllers; SAC: soft actor-critic; SAC-AAT: soft actor-critic with automated adjustment of temperature.

Figure 5 compares the performance of control methods (baseline, RBC, SAC, and SAC-AAT) in terms of cost and emissions for individual buildings. The proposed DR controllers achieved the lowest costs and emissions in both buildings. Thanks to its automated temperature adjustment mechanism, the SAC-AAT-based controller dynamically adapts to varying environmental conditions. On the other hand, the RBC achieves lower costs compared to SAC, highlighting its effectiveness in simple predefined scenarios. However, SAC-based controllers balance exploration and exploitation but lack tailored optimizations for specific conditions.

Figure 5.

Cost and emissions for Buildings 3 and 6 under each control strategy.

Reducing fluctuations in energy demand is essential for ensuring grid stability, integrating RES, and lowering energy costs. Stable demand profiles minimize stress on the grid, reduce the risk of outages, and enable more efficient use of renewable energy, which is often intermittent. Figure 6 illustrates a comparison of different control strategies at the district level. Specifically, it highlights the impact of two control strategies—RBC and SAC-AAT—on load profiles, using the baseline as a reference. The aggregated load profile produced by the SAC-AAT controller is significantly more uniform than that of the RBC, indicating its superior ability to smooth demand and enhance grid reliability.

Figure 6.

District energy demand under different control methods.

On the other hand, the analysis at a single building level (Figure 7) showed that SAC performs moderately well, reducing variability compared to RBC in Building 6. This highlights that RBC relies on static, predefined rules and lacks the ability to adapt to dynamic changes in the environment.

Figure 7.

Energy demand profiles for Buildings 3 and 6.

Average daily building loads further validate the above statements (Figure 8). With the SAC-AAT controller, the load curves exhibit smaller fluctuations and tend to follow a more stable consumption pattern.

Figure 8.

Average daily energy demand in Buildings 3 and 6.

Lastly, Figure 9 illustrates the episodic rewards across 25 episodes, comparing the performance of various control models. Here, each episode corresponds to a full simulation year consisting of 8760 hourly timesteps. The reward serves as a metric for evaluating how well the RL agent is achieving its objective and provides feedback to the agent about the effectiveness of its actions in each state.

Figure 9.

The evolution of episodic cumulative rewards for control methods.

The baseline and RBC models display constant reward values across episodes, as they use fixed, non-learning strategies. RBC initially outperforms baseline, reflecting its rule-based optimization. In contrast, the SAC and SAC-AAT models learn over time. Each episode corresponds to one full simulation year. In the early episodes, RBC achieves better rewards than SAC due to SAC’s exploration phase. However, SAC-AAT exhibits a faster and more stable learning curve, surpassing the Baseline by episode 4 and RBC by episode 7. The faster convergence and higher final reward of SAC-AAT indicate more effective and adaptive energy management, resulting in lower operational costs and improved performance over time. This trend confirms that reinforcement learning models, particularly SAC-AAT, improve with experience and outperform static strategies in long-term deployments.

Discussion and future work

While the proposed SAC-AAT controller demonstrates strong performance in simulated scenarios, several limitations must be acknowledged. Initially, the model has only been validated within the CityLearn simulation environment, utilizing set building data and hourly time intervals. While CityLearn offers a standardized and useful platform for evaluation, it is unable to completely mirror the intricate realities of building-level controls, occupant behavior, or limitations imposed by grid infrastructure. While SAC-AAT includes an automatic adjustment of the temperature parameter to enhance exploration, the technique still requires significant training and hyperparameter optimization. This leads to computational demands that could pose difficulties for systems with limited resources or applications that require real-time performance. Furthermore, the model is based on the assumption that input data is accurate and continuously available, which might not always be the case in real-world scenarios.

Although there are some constraints, the suggested SAC-AAT framework presents encouraging opportunities for practical use in smart grid and BEM systems. By adjusting in real-time to fluctuations in energy prices, demand trends, and renewable generation, SAC-AAT can assist utilities and building managers in minimizing operational expenses and enhancing the self-consumption of renewable energy. In the context of smart buildings, the integration of SAC-AAT into energy management systems could improve comfort, ensure demand fulfillment, and reduce emissions. The algorithm’s compatibility with continuous control tasks makes it suitable for integration with Internet of Things platforms and BMSs. From an economic perspective, reduced electricity bills and improved renewable energy utilization offer strong incentives for adoption. Future research will investigate how to adapt the SAC-AAT controller for different types of buildings and more detailed time intervals. Furthermore, integrating SAC-AAT with other advanced reinforcement learning methods, like attention-driven coordination in multi-agent frameworks or hybrid model-based approaches, could improve its scalability and resilience.

Conclusion

In this study, we explored the SAC-AAT control strategy for DR management in a district-level energy system, comparing the performance against baseline, rule-based and SAC controllers. The results demonstrated that RL models, particularly those enhanced with attention mechanisms, can significantly improve energy demand management, enabling smoother load profiles, reduced peak demand, and minimized emissions. The proposed model, incorporating an attention-inspired mechanism, consistently outperformed all baseline models in terms of cumulative reward, energy cost, and emissions reduction. By adaptively focusing on high-demand periods, the SAC-AAT controller demonstrated improved responsiveness to dynamic energy conditions, enabling more precise and stable demand management. The baseline and RBC models exhibited limited adaptability, with static or rule-based strategies that failed to respond effectively to fluctuations in energy demand. While the RBC approach outperformed baseline by employing a set of predefined rules, its inability to adapt dynamically led to frequent demand peaks and suboptimal utilization in complex environments. Future work could explore additional improvements, such as incorporating more complex DR strategies or extending the model to larger, more diverse building clusters. Overall, this study underscores the value of advanced RL techniques in achieving sustainable and resilient energy management solutions.

Footnotes

ORCID iD

Alibek Rustamovich Esanov

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Technology Development Program (RS-2025-02312851) funded by the Ministry of SMEs and Startups (MSS, Korea).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Ajagekar

You

(2023) Deep reinforcement learning based unit commitment scheduling under load and wind power uncertainty. IEEE Transactions on Sustainable Energy 14(2): 803–812. DOI: https://doi.org/10.1109/TSTE.2022.3226106

Amer

Shaban

Massoud

(2023) DRL-HEMS: Deep reinforcement learning agent for demand response in home energy management systems considering customers and operators perspectives. IEEE Transactions on Smart Grid 14(1): 239–250. DOI: https://doi.org/10.1109/TSG.2022.3198401

Azuatalam

Lee

de Nijs

, et al. (2020) Reinforcement learning for whole-building HVAC control and demand response. Energy and AI 2: 100020. DOI: https://doi.org/10.1016/j.egyai.2020.100020

Bay

Chintala

Chinde

, et al. (2022) Distributed model predictive control for coordinated, grid-interactive buildings. Applied Energy 312: 118612. DOI: https://doi.org/10.1016/j.apenergy.2022.118612

Brandi

Fiorentini

Capozzoli

(2022) Comparison of online and offline deep reinforcement learning with model predictive control for thermal energy management. Automation in Construction 135: 104128. DOI: https://doi.org/10.1016/j.autcon.2022.104128

Ferahtia

Djeroui

Rezk

, et al. (2022) Optimal control and implementation of energy management strategy for a DC microgrid. Energy 238(PB): 121. DOI: https://doi.org/10.1016/j.energy.2021.121

Fujimoto

van Hoof

Meger

(2018) Addressing function approximation error in actor-critic methods. In: Proceedings of the international conference on machine learning (ICML), 2018. https://api.semanticscholar.org/CorpusID:3544558.

Haarnoja

Zhou

Hartikainen

, et al. (2019) Soft actor-critic algorithms and applications. arXiv preprint, https://arxiv.org/abs/1812.05905.

Henggeler Antunes

Alves

Soares

(2022) A comprehensive and modular set of appliance operation MILP models for demand response optimization. Applied Energy 320: 119142. DOI: https://doi.org/10.1016/j.apenergy.2022.119142

10.

Jin

Zhou

, et al. (2022) Deep reinforcement learning-based strategy for charging station participating in demand response. Applied Energy 328: 120140. DOI: https://doi.org/10.1016/j.apenergy.2022.120140

11.

Karavas

Kyriakarakos

Arvanitis

, et al. (2015) A multi-agent decentralized energy management system based on distributed intelligence for the design and control of autonomous polygeneration microgrids. Energy Conversion and Management 103: 166–179. DOI: https://doi.org/10.1016/j.enconman.2015.06.021

12.

Kong

Yao

, et al. (2020) Online pricing of demand response based on long short-term memory and reinforcement learning. Applied Energy 271: 114945. DOI: https://doi.org/10.1016/j.apenergy.2020.114945

13.

Lillicrap

Hunt

Pritzel

, et al. (2019) Continuous control with deep reinforcement learning. arXiv preprint, https://arxiv.org/abs/1509.02971.

14.

Hong

(2019) Incentive-based demand response for smart grid with reinforcement learning and deep neural network. Applied Energy 236: 937–949. DOI: https://doi.org/10.1016/j.apenergy.2018.12.061

15.

Mariano-Hernández

Hernández-Callejo

Zorita-Lamadrid

, et al. (2021) A review of strategies for building energy management system: Model predictive control, demand side management, optimization, and fault detect & diagnosis. Journal of Building Engineering 33: 101692. DOI: https://doi.org/10.1016/j.jobe.2020.101692

16.

Mocanu

Nguyen

, et al. (2019) On-line building energy optimization using deep reinforcement learning. IEEE Transactions on Smart Grid 10(4): 3698–3708. DOI: https://doi.org/10.1109/TSG.2018.2834219

17.

Schulman

Wolski

Dhariwal

, et al. (2017) Proximal policy optimization algorithms. arXiv preprint, https://arxiv.org/abs/1707.06347.

18.

Siano

(2014) Demand response and smart grids – A survey. Renewable and Sustainable Energy Reviews 30: 461–478. DOI: https://doi.org/10.1016/j.rser.2013.10.022

19.

Vázquez-Canteli

Dey

Henze

, et al. (2020) CityLearn: Standardizing research in multi-agent reinforcement learning for demand response and urban energy management. arXiv preprint, https://arxiv.org/abs/2012.10504.

20.

Vázquez-Canteli

Nagy

(2019) Reinforcement learning for demand response: A review of algorithms and modeling techniques. Applied Energy 235: 1072–1089. DOI: https://doi.org/10.1016/j.apenergy.2018.11.002

21.

Wang

Hong

(2020) Reinforcement learning for building controls: The opportunities and challenges. Applied Energy 269: 115036. DOI: https://doi.org/10.1016/j.apenergy.2020.115036

22.

Yang

Gao

You

(2022) Model predictive control for demand- and market-responsive building energy management by leveraging active latent heat storage. Applied Energy 327: 120054. DOI: https://doi.org/10.1016/j.apenergy.2022.120054

23.

Yang

Zhao

, et al. (2020) Reinforcement learning in sustainable energy and electric systems: A survey. Annual Reviews in Control 49: 145–163. DOI: https://doi.org/10.1016/j.arcontrol.2020.03.001

24.

Zeng

Liu

, et al. (2011) A multi-agent solution to energy management in hybrid renewable energy generation system. Renewable Energy 36(5): 1352–1363. DOI: https://doi.org/10.1016/j.renene.2010.11.

Enhancing grid stability and renewable energy integration with reinforcement learning for optimized demand response

Abstract

Keywords

Introduction

Methods

Energy modeling

RL formulation: Markov decision process (MDP)

Soft actor-critic (SAC) algorithm with automated adjustment temperature

SAC-AAT model for DR management

Experiments and results

Dataset overview and simulation hyperparameters

Benchmarking and evaluation

Results and analysis

Discussion and future work

Conclusion

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

References