Sage Journals: Discover world-class research

Abstract

The variable nature of wind, including wind speed, direction, barometric pressure, and air temperature, presents significant challenges for accurately predicting wind power output. This paper addresses the issue of contextual prediction accuracy, highlighting limitations of existing methods in analyzing temporal information in depth. It introduces a Transformer-based Dynamic context-aware power forecasting model that combines the Long short-term memory (LSTM) to enhance contextual wind power prediction. The model identifies significant factors influencing wind power generation and integrates various conditions affecting wind power output into a unified embedding. To improve the forecast accuracy, the model adopts a two-layer architecture. The first layer uses LSTM units to extract essential temporal features from the data stream. The subsequent layer utilizes the Dynamic context-aware model's hierarchical multihead self-attention mechanism to discern global information and contextual interrelations. The results reveal that the LSTM-based dynamic context-aware model significantly outperforms other models in forecasting wind power plants output.

Keywords

Wind power power forecasting transformer LSTM wind speed

Introduction

The wind energy resources assessment is crucial to predict energy performance accurately. The capacity and efficiency of wind turbines have increased significantly in the scientific community as a vital asset for global sustainable development, supporting wind spread adaptation and decreasing costs in electricity generated by wind turbines.^1,2 However, it is complex because wind turbine technology involves nonlinear and time-changing uncertainties in wind-related forecasting.³ Researchers made continuous efforts in related fields to explore techniques to predict short-term wind speed using mathematical models and machine learning algorithms to advance the efficiency and cost-effectiveness of wind energy systems.

Researchers have used deep learning procedures such as recurrent neural networks (RNN) to analyze loads data from high-dimensional load. However, the original RNNs are difficult to handle due to gradient vanishing and long-term dependency problems.^4,5 Thus, Oslobo et al. introduced the long short-term memory (LSTM) algorithm by inserting cell states into RNN.⁶ The LSTM algorithm may be influenced by parameters with a larger number, which ultimately leads to overfitting.⁷ Kim et al. also introduced an algorithm, such as GRU, to decrease the number of intrinsic parameters, thus reducing the risk of overfitting based on the simplest model.⁷ Furthermore, Almuzaini et al. propose the Bi-GRU algorithm to fully exploit current and past data. This model is also a combination of the attention mechanism, highlighting significant features of the data. The authors established the capability of the proposed algorithm to improve the accuracy and efficiency of classification.⁸

Deep learning has appeared as an emerging method for processing complex temporal data in many domains recently. It has made notable achievements in natural language processing, deep learning, and various combination methods,⁹ including long short-term memory (LSTM),¹⁰ convolutional neural networks (CNNs), and other commonly used models for power prediction. Deep learning designed for wind farm clustering has increased the accuracy of prediction and capturing complex dependencies in the case of time series forecasting.¹¹ Deep learning models possess a wide range of possible practices in time series forecasting, with a particular focus on solar and Wind power potential. Scientists have studied many deep learning frameworks that outclass in catching temporal dependencies, handling data sequences, and increasing prediction accuracy.

Moreover, considerations and limitations, such as interpretability, training efficiency, and handling missing or noisy data specific to each architecture, must also be taken into account while handling sequential data.¹² Time series predictions have a huge scope in developing deep learning hybrid models, where they can improve the performance and address the related limitations. A Hybrid model is a combination of two or more architecture networks to gain the benefits of each network and better interpretability, improving prediction accuracy and more efficient dealing with data limitations.

Transformer is perhaps one of the most successful sequence modeling architectures, representing unprecedented performance in numerous applications such as natural language processing (NLP) and speech recognition.^13,14 Recently, there has been a great increase in transform-based solutions for time series analysis.¹⁵ The notable popular models focusing on the challenging and less discovered problems of time series long-term forecasting (LTSF) comprise LongTrans, Informer, Autoformer, Pyraformer, and Triformer.^16–19 The main achievement of Transformers lies in its self-multihead attention mechanism, which has the remarkable ability to extract semantic associations between elements of a large sequence (2D patches in images or words into texts). However, self-attention is an immutable permutation and to some extent “anti-order.” Although some ordering information can be preserved using various types of position coding techniques, it is still inevitable that the temporal information will be lost after peak self-attention. In semantically demanding applications such as natural language processing, this is generally not a serious problem. For example, if we rearrange some words, the interpretive meaning of a sentence remains largely preserved. Nevertheless, there is typically a limitation of semantics in the numerical data when dealing with time series data, where we are primarily concerned with modeling temporal changes between a set of continuous points.

This paper addresses the challenges of low forecast accuracy arising from the inability of existing methods to comprehensively analyze temporal data by proposing a dual layer time series Transformer-based Dynamic context-aware power forecasting model combined with LSTM (Long-short-term memory) to filter irrelevant noise out for wind power forecasting. The model starts with correlation analysis to identify key factors affecting wind power and gathers various conditions into an integrated contextual embedding. It employs LSTM to distill the essential temporal features extracting, and structuring dependencies from the raw data, followed by a Dynamic context-aware hierarchical multihead self-attention mechanism to recognize the global information and interrelationship. This method detects important features of the current interval and tackles challenges such as computational demands and temporal forecasting limitations to enable accurate time-series wind power forecasting. Moreover, a comparative analysis is established against three available forecasting models: Gated Recurrent Unit (GRU), LSTM, and Transformer. The results reveal that the Dynamic context-aware model significantly exceeds these models in forecasting power output for wind power plants.

Related work

Machine learning methods that include transforming time series data into a suitable form for supervised learning are extensively employed in time series forecasting.²⁰ The predominant methods include SVR, which fits linear regression model problems; Light Gradient Boosting Machine (LightGBM), a framework for gradient boosting that employs decision tree-based algorithms learning; and extreme Gradient Boosting (XGBoost), an application of algorithms boosting with the technology improvement.^21,22 Moreover, deep learning models have a vast application predicting time series, including RNN, which has memory capabilities with recursive process; GRU, an alternative of LSTM that improves computational efficiency and simplifies the structure of the model; LSTM, to handle long-term dependencies, and efficiently achieve series features; 1D convolutional neural networks (1D-CNN), which can easily capture short-term spatial information and local dependencies; and Transformer model, which is ideal for long-term dependencies modeling and time series interaction forecasting, however unable to extract temporal features.²³ Since the publication of the Transformer model in 2017, it has made remarkable progress in natural language processing.¹³

Time series forecasting such as Traditional statistical models, is generally based on autoregression (AR) in which the past forecast is demonstrated as a linear function of earlier observations.²⁴ It is appropriate for unweighted data with no seasonality or trends. The autoregressive moving average (ARMA), which is of AR and MA combination, models the next forecast sequence residuals from past observations as a linear function.²⁵ The autoregressive integrated moving average (ARIMA), which represents wind power forecasts, is appropriate for time series univariate data with trends but not with seasonality.²⁶ Seasonal Autoregressive-integrated moving average (SARIMA) represents the sequence of the next prediction as a linear function of previous observations, such as seasonal errors and observations.²⁷ The methods mentioned above possess productive efforts to model the time series dynamics data but generally acquire extensive engineering feature engineering and achieve poor performance in prediction.

The Transformer model has been extensively used in different fields, including time series prediction, document retrieval, and image processing.²⁸ The transformer encoder-decoder models are composed of a point-wise CNN layer, a self-attention layer, and a one-dimensional CNN layer. With the combination of referred constituent layers, the encoder excerpts the spatiotemporal compressed features from the time series multivariate data from the decoder similar structure layer. Furthermore, a temporal attention pattern method to select relevant multivariate forecasting time series is also advised by Shih et al.²⁹ Samaher et al. proposed artificial intelligence (AI) algorithm and mixed-integer programming to adjust flight altitude dealing with real-time data through comprehensive performance metrics. Techniques like team work optimizer were also used to get higher accuracy in green energy sources such as fuel cells.^30,31 Du et al. also proposed a combined attention mechanism in a structure like an encoder–decoder to excerpt multivariate correlations.³² Zhou et al. presented the attention mechanism twice to represent both dependencies among variables and temporal patterns.¹⁶ Several variants of Transformer are proposed for time series forecasting, such as Longformer, Reformer, Powerformer, and Informer.^33–36 Currently, more and latest techniques and methods are adopted in the field of wind farms, such as the adaptive decomposition method combined with the Quaternion convolutional long-short-term memory neural model to predict wind speed in the northern Aegean islands.³⁷ Wind speed prediction using a machine learning approach and a Sentinel family satellite imagery, as well as Wave power prediction involving a bidirectional convolutional model based on efficient decomposition with a Nelder-Mead equilibrium optimizer, are practiced.³⁸

Data processing

Data acquisition

The data of a 50-MW wind power plant are employed in this research which contains local measurement data from 32 different wind turbines in Jhimpir city site, Pakistan. This site dataset documents electricity production at 10-min intervals, resulting in a dataset of 52,560 records from February 16, 2022, to February 17, 2023. The dataset includes five attributes: temperature (°C) and atmospheric pressure (hPa), wind speed (m/s), wind direction (°), and relative humidity (RH). It also includes corresponding Numerical Weather Prediction (NWP) data, covering 32 wind turbines, temperature, pressure, relative humidity, wind direction, wind speed, and wind power output. One anther data set is also employed from a real-time wind power plant of 30 MW from a wind farm of Texas including wind speed, direction, atmospheric pressure, and temperature. The integrity of the dataset is critical for the impact of deep learning models. Fixing missing data and other anomalies in datasets is crucial to improve model accuracy, reduce computational effort, and enable effective training. Missing values can significantly affect the accuracy of forecasting models in time series analyses based on continuous data streams. In this study, linear interpolation was used to impute missing values. In addition, due to the different scales between parameters, data normalization is essential to avoid bias toward higher-rank variables and ensure equal consideration of features. Normalization also speeds up algorithmic calculations and convergence during the training process. Noise reduction techniques such as Kalman Filter and min–max scaler method are used to normalize the data to a range [0, 1].³⁹ The normalization formula is expressed as follows:

x_{s c a l e d} = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}

where x denotes the original measurement,

x_{s c a l e d}

is the normalized value, and

x_{m a x}

and

x_{m i n}

present the dataset's minimum and maximum values, respectively.

Dataset creation

The dataset should be structured as a supervised regression task for operative time series prediction using deep learning with defined inputs and outputs. Here, X indicates the input data matrix for forecasts, where X∈Rn×5, with n = 144 corresponds to the number of time intervals predicted, with 5 signifying the selected weather features for each interval. The output matrix Y, where Y∈Rn×1, shows the predicted power values for the n intervals. The dataset undergoes a random split, with 80% dedicated to training the deep learning model and 20% dedicated to testing its accuracy and performance.

Methodology

Wind power generation forecasting is challenging for multivariate time series, which are affected by various climatic features such as wind speed, direction, humidity, and temperature. A LSTM model captures the features of multivariate time series, but they can easily overfit large dataset, require significant memory, and limit to interpret and handle long-range sequences. Besides, the self-attention mechanism in the Transformer-based frameworks instantaneously reflects all positions in the input sequence, regardless of the distance between features. To overcome these challenges, this study presents a novel time series Transformer-based Dynamic context-aware power forecasting model with a combination of LSTM model for processing time-series data for long and short-term while extracting and structuring features from raw sequential data. The model starts with correlation analysis to identify key factors affecting wind power and gathers various conditions into an integrated contextual embedding. It employs LSTM to distill the essential temporal features extracting and structuring dependencies from the raw data, followed by a Dynamic context-aware hierarchical multihead self-attention mechanism to recognize the global information and interrelationship. This method detects important features of the current interval, and tackles challenges such as computation high demands and temporal forecasting limitations accuracy to enable accurate time-series wind power forecasting. The contextual dynamic model can identify complex relationships in multivariate time series. The framework of the proposed model is presented in Figure 1.

Figure 1.

Flow chart of wind power forecasting for dynamic context-aware power forecasting model.

LSTM-based embedding

The mechanism of self-attention mechanism in the encoder-decoder network may struggle to effectively capture sequential data. Even with the integration of scalar, local, and global time stamps (including units such as durations like minutes, hours, weeks, months) as location embeddings, the model's capacity to derive features from time series data can be limited. To address this, an LSTM layer is introduced postembedding to harness its strong capability in learning temporal attributes and sequential information.¹⁶ The standout feature of LSTM is its management of long-term data dependencies. It employs a “memory cell” that maintains and transmits information over prolonged intervals, allowing it to adeptly discern temporal dependencies and patterns in sequences with extended time gaps. The output from the LSTM not only reflects insights from various time steps but also captures hidden node data from each of those steps, thus enhancing the model's proficiency in recognizing hidden details at every stage. This process is detailed in Figure 2. After passing through the embedding layer, the time series data is processed by the LSTM layer to pull out temporal features. This extracted information from each node's hidden layer is then utilized as input for both the encoder and decoder.

Figure 2.

The structure of LSTM-Dynamic context-aware power forecasting model.

The formula is as follows:

e = Embedding (x_{i n})

(1)

h = LSTM (e)

(2)

Where the t-th node output formula through LSTM:

i_{t} = Sigmoid (W_{i} \cdot [h_{t - 1}, e_{t}] + b_{i})

(3)

f_{t} = Sigmoid (W_{f} \cdot [h_{t - 1}, e_{t}] + b_{f})

(4)

C_{t} = \tanh (W_{C} \cdot [h_{t - 1} \cdot e_{t}] + b_{C})

(5)

o_{t} = Sigmoid (W_{o} \cdot [h_{t - 1}, e_{t}] + b_{o})

(6)

C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot C_{t}

(7)

h_{t} = o_{t} \cdot \tanh (C_{t})

(8)

where hyperbolic tangent (tanh) and Sigmoid are two functions for activation commonly express in LSTM networks. ht shows the hidden layer output at the t-th node.

Transformer layer

This research employs a Transformer architecture composed of an Encoder and a Decoder, each built from multiple identical layers. Every layer integrates two key components: a position-wise feed-forward neural network and a multihead attention mechanism, as illustrated in Figure 3. To enhance model performance and data flow, residual connections and layer normalization are incorporated.²⁵ The attention mechanism is formulated as a mapping from a query to a combination of key-value pairs,¹³ utilizing scaled dot-product attention, mathematically represented as:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(9)

where d denotes the dimension of the hidden layers and Q (query), K (key), and V (value) are given as the hidden depictions of the previous layer. Usually, the self-attention sublayer of the multihead variant is used, that permits the model to jointly pay attention to the statistics from various sources of subspaces and define

Multi - head (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{H}) W^{O}

(10)

head_{k} = Attention (Q W_{k}^{Q}, K W_{k}^{K}, V W_{k}^{V})

(11)

where

W_{k}^{Q} \in R^{d \times d_{K}}, W_{k}^{K} \in R^{d \times d_{K}}, W_{k}^{V} \in R^{d \times d_{V}}

, and

W^{O} \in R^{H d_{V} \times d}

are project parameter matrices, H is the number of attention heads.

d_{K}

and

d_{V}

are the key and value vectors, respectively. When processing a sequence of inputs

(x_{1}, \dots, x_{n})

, we use

MultiHeadAtt (x_{i}, [x_{1}, x_{2}, \dots, x_{n}])

(12)

Figure 3.

Structure of transformer model.

This allows each input position to attend to all other positions in the sequence, capturing contextual relationships effectively

MultiHeadAtt (x_{i}, [x_{1}, x_{2}, \dots, x_{n}]) = Multi - head (x_{i}, [x_{1}, \dots, x_{n}], [x_{1}, \dots, x_{n}])

(13)

Context-aware embedding

Context-aware Embedding Context-aware embedding is a good collection in Dynamic context-aware power forecasting models as it is good to capture and utilize contextual information that processes an entire sequence within the data rather than focusing on individual elements in isolation.⁴⁰

Let V = {v_{1}, \dots, v_{n}}

(14)

be the target input, where vi is the i-th token in V. To form the input, a special token c, which stands for the entire target utterance, is placed at the beginning of V, resulting in the input sequence

I = {c} \oplus V

(15)

Let Li be the i-th context utterance, then

V = L_{1} \oplus \dots \oplus L_{k} = {v_{1}, \dots, v_{m}}

(16)

which is the integrated tokens’ list in whole utterances of context, represented as k, while v1 represents the first token in L1 and vm as the last token in Lk. The sequence input I= {c}⊕V is fed into embedding that includes information about the interaction between these inputs, which produces an order of embeddings such as {ec}⊕Ev where Ev=

{e_{1}^{v}, \dots, e_{m}^{v}}

is the list of embedding for V and

e^{c}

for c respectively.

Hierarchical multihead self-attention

In this section, an approach to reduce the computational and space requirements associated with using the H-MHSA mechanism is presented. Instead of focusing attention on the entire input, a hierarchical strategy is adopted that allows each stage to process only a limited number of tokens. $X \in R^{H \times W \times C}$ The first step focuses on computing local attention.⁴¹ Assuming that the input feature map is denoted by, we partition the feature map into small grids of size $G_{1} \times G_{1}$ and reshape them as follows:

\begin{matrix} X \in R^{H \times W \times C} \to X_{1} \in R^{(\frac{H}{G_{1}} \times G_{1}) \times (\frac{W}{G_{1}} \times G_{1}) \times C} \end{matrix}

(17)

\to X_{1} \in R^{(\frac{H}{G_{1}} \times \frac{W}{G_{1}}) \times (G_{1} \times G_{1}) \times C}

(18)

The values and query, key is then calculated by

Q_{1} = X_{1} W_{1}^{q}, K_{1} = X_{1} W_{1}^{k}, V_{1} = X_{1} W_{1}^{v}

(19)

Where $W_{1}^{q}, W_{1}^{k}, W_{1}^{v} \in R^{C \times C}$ are trainable weight matrices that are applied to produce the local attention feature $A_{1}$ . To facilitate network optimization, we again give $A_{1}$ back to the shape of X through

A_{1} \in R^{(\frac{H}{G_{1}} \times \frac{W}{G_{1}}) \times (G_{1} \times G_{1}) \times C}

(20)

\to A_{1} \in R^{(\frac{H}{G_{1}} \times G_{1}) \times (\frac{W}{G_{1}} \times G_{1}) \times C}

(21)

\to A_{1} \in R^{H \times W \times C}

(22)

and integrate a residual connection:

A_{1} = A_{1} + X

(23)

Since the local attention feature $A_{1}$ is calculated on every small $G_{1} \times G_{1}$ grid, a significant reduction in computational and spatial complexity is achieved. The second step emphasizes computing the overall attention. $A_{1}$ is reduced by a factor $G_{2}$ of by calculating the key and value matrices. This subsampling allows for efficient computation of the global attention, treating each $G_{1} \times G_{1}$ grid as a token. This process can be stated as:

{\hat{A}}_{1} = {AvePool}_{G_{2}} (A_{1})

(24)

where AvePool

_{G_{2}} (\cdot)

represents down sampling a feature map by

G_{2}

times using average pooling with both the kernel size and stride set to

G_{2}

. Subsequently, we have

{\hat{A}}_{1} \in

$R^{\frac{H}{G_{2}} \times \frac{W}{G_{2}} \times C}$ . We then reform $A_{1}$ and ${\hat{A}}_{1}$ as follows:

A_{1} \in R^{H \times W \times C} \to A_{1} \in R^{(H \times W) \times C}

(25)

{\hat{A}}_{1} \in R^{\frac{H}{G_{2}} \times \frac{W}{G_{2}} \times C} \to {\hat{A}}_{1} \in R^{(\frac{H}{G_{2}} \times \frac{W}{G_{2}}) \times C}

(26)

Then we calculate the query, key, and value as follows

Q_{2} = A_{1} W_{2}^{q}, K_{2} = {\hat{A}}_{1} W_{2}^{k}, V_{2} = {\hat{A}}_{1} W_{2}^{v}

(27)

where

W_{2}^{q}, W_{2}^{k}, W_{2}^{v} \in R^{C \times C}

are trainable weight matrices. It is easy to stem that we have

Q_{2} \in R^{(H \times W) \times C}

K_{2} \in R^{(\frac{H}{G_{2}} \times \frac{W}{G_{2}}) \times C}

, and

V_{2} \in R^{(\frac{H}{G_{2}} \times \frac{W}{G_{2}}) \times C}

. Moreover, the global attention function is called

A_{2} \in R^{(H \times W) \times C}

, following a reshaping operation:

A_{2} \in R^{(H \times W) \times C} \to A_{2} \in R^{H \times W \times C}

(28)

The final output of H-MHSA is written as

H - MHSA (X) = (A_{1} + A_{2}) W^{p} + X

(29)

where

W^{p}

has the same meaning as in

A^{'} = A W^{p} + X

in which

W^{p} \in R^{C \times C}

while H-MHSA efficiently models both local and global relations.

Evaluation metrics

To evaluate the efficiency of a predictive model or algorithm, the evaluation metrics can be utilized. The numerical quantities help us assess, how best a model or algorithm resolves a problem. It is probable to develop measures using both quantitative and qualitative methods. In addition, it allows the comparison, of numerous models by providing objective criteria against which the effectiveness of a model can be evaluated.⁴² In addition, evaluation metrics help make knowledgeable judgments about applying a model to an assigned task and determine its complete effectiveness. Therefore, they also help evaluate the satisfaction of a particular problem's accuracy, generalizability, and consistency.

Mean absolute error (MAE)

The evaluation method employs the absolute average deviation between predicted values and observed values to decide. In this method, any changes in the data are considered the same.

MAE = \frac{1}{N} \sum_{i = 1}^{N} | {\hat{y}}_{i} - y_{i} |

(30)

where

{\hat{y}}_{i}

represents the predicted value of epoch i, while

y_{i}

represents the actual value.

Mean square error (MSE)

This metric is calculated by squaring the difference between the actual values and predicted, placing greater emphasis on larger deviations by penalizing them more heavily.

M S E = = \frac{1}{N} \sum_{i = 1}^{N} ({\hat{y}}_{i} - y_{i})^{2}

(31)

Root mean square error (RMSE)

Root mean square error (RMSE) is a well-known used metric for evaluating the precision of predictive models. It is computed by considering the square root of the average of the squared differences between the predicted and actual values. Due to its sensitivity to large errors, RMSE is considered a crucial indicator in evaluating model performance.

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}

(32)

Correlation

Coefficient of correlation is used to show the strength relationship between any two variables which has an absolute worth less than one or equal to one. The absolute value reaches to 1 to represent the linear correlation between two features. It is usually expressed as Pearson correlation coefficient (r) which has a range from −1 to 1 with n data points each showing “x” and “y” which is expressed as:

r = \frac{\sum (x i - x) (y i - y)}{\sqrt{\sum {(x i - x)}^{2} \sum {(y i - y)}^{2}}}

(33)

where

x i

and

y i

show data points specifically while x, y expresses the mean values.^43,44 We can use (Information Gain and Correlation) to find out the impact of most effective features on forecasting results.

Results and discussion

Data description

The dataset used in this study is one year data from a 50 MW wind power plant situated in Jhimpir city of Sindh province of Pakistan and 30 MW wind power plant of Texas. It consists of seasonal trends and patterns highlighting wind power generation and variations over a period of 2 years. Annual fluctuation output of different variables, such as wind speed, temperature, wind direction, and humidity, are highlighted. Historical data play a crucial role for developing accuracy in predictive models and strategies examining these input variables and utilizing time series data. Variables like temperature, wind shear and humidity are key features in considering forecasting model. The integration of these models through mathematical relationship correlates with wind power and other external environmental key factors. Even though in this study does not tackle these environmental factors directly, but the modelling approach incudes them mathematically in order to enhance the accuracy and develop a better understanding to forecast within the boundaries of available time series wind power data.

Results and discussions

To forecast wind power, a series of GRU, Bi-LSTM, Transformer, and an innovative Dynamic context-aware power forecasting time series model with combination an LSTM is employed to predict wind power plant output. This model is executed and evaluated on an Intel i7 processor coupled with an NVIDIA Quadro RTX 6000 with 16GB of RAM. The model's various algorithms were developed and refined using a both datasets, divided into 70% for training, 15% for validation, and 15% for testing.⁴⁵ Figure 4 depicts the frequency histogram of the actual data available.

Figure 4.

Frequency histogram for the actual wind power plant of Jhimpir dataset.

The learning rate for the model is configured at 0.001, utilizing an Adaptive Moment Estimation Optimizer (ADAM) for training. Training and validation of the models occur on their respective datasets, with the LSTM-Dynamic context-aware model's validation and training losses displayed in Figure 5. The performance metrics for the GRU, Bi-LSTM, Transformer, and LSTM-Dynamic context-aware models are shown in Figure 6 and 7, while their average normalized scores are presented in Tables 1 and 2. Further evaluation of the LSTM-Dynamic context-aware model with different optimizers like Adam, Adamax, Adagrad, and RMSprop is detailed in Figure 8, with normalized average values in Table 3. These optimizers were used to enhance and test the model's predictions for both Jhimpir and Texas wind power time series dataset. Among them, the Adam optimizer yielded the most favorable outcomes, with the lowest average error values of 1.262, 1.721, and 2.523, while adagrad with highest values 1.747, 2.351, and 4.240 for normalized mean NMAE, NRMSE, and NMSE, respectively, indicating that different optimizers can significantly impact model performance.

Figure 5.

Training and validation loss plot for dynamic context-aware power forecasting model.

Figure 6.

Avarage monthly normalized prediction error of different models for Jhimpir site dataset.

Figure 7.

Average monthly normalized prediction error of different models for Texas site dataset.

Figure 8.

Average monthly normalized prediction error of dynamic context-aware model using different optimizers.

Table 1.

Average normalized evaluation results of all models in the Jhimpir site dataset.

Evaluation matrix	LSTM-dynamic context-aware	Transformer	Bi-LSTM	GRU
NMAE	1.263	1.511	1.850	1.735
NRMSE	1.721	2.192	2.600	2.610
NMSE	2.842	4.754	5.402	5.277

Table 2.

Average normalized evaluation results of all models in the Texas site dataset.

Evaluation matrix	LSTM-dynamic context-aware	Transformer	Bi-LSTM	GRU
NMAE	1.243	1.631	1.622	2.002
NRMSE	1.518	1.800	2.097	2.679
NMSE	2.427	4.351	5.297	6.687

Table 3.

Average normalized prediction performance for LSTM-dynamic context-aware model with respective optimizers.

Optimizer model	Adam	Adamax	Adagrad	RMSprop
NMAE	1.263	1.490	1.747	1.416
NRMSE	1.721	1.980	2.351	1.946
NMSE	2.523	3.256	4.240	2.756

Selected samples from the wind power dataset are randomly chosen to demonstrate the prediction capabilities. The models forecast power output for the forthcoming 12 h, as illustrated in Figure 9. These trained and validated models are then tested on the test dataset to predict wind power performance based on wind speed, direction, and ambient temperature. The features used in the datasets are correlated to in Figure 10 which shows a strong positive correlation between the wind speed and power. The results indicate that the LSTM-Dynamic context-aware model outperforms others with an average NRME of 1.721, 1.263, and 2.842 for NMAE and NMSE while Transformer model closely following with a NRMSE of 2.192, 1.511, 4.754 for NMAE, and NMSE, respectively, for Jhimpir dataset. Furthermore, LSTM-Dynamic context-aware model leads the other models with an average NRME of 1.518, 1.243, and 2.427 for NME and NMSE while as discussed above the Transformer model continues to follow closely with a NRME of 1.800, 1.631, and 4.351 for NMAE and NMSE analyzing the Texas dataset.

Figure 9.

The forecasting result for the next 12h on the Jhimpir dataset of different models.

Figure 10.

Correlation among the target power and the features.

However, the performance of GRU and Bi-LSTM models lag behind due to overfitting, leading to inaccurate power value predictions during training. Forecasting wind power is crucial for wind turbine management and preparation of hybrid wind power systems, playing a pivotal role in the generation of wind turbines. Wind interval forecasting, which assesses changes in predicted performance due to uncertain factors, provides vital information for power system planning. The comprehensive capabilities of wind interval forecasting are crucial for grid design and operation, including wind farms. Furthermore, deep learning models offer insights that can guide decision-makers in wind turbine manufacturing to optimize wind power generation. Balancing supply and demand are essential for a sustainable energy economy and cost efficiency, while congestion could reduce energy efficiency, compromising grid stability and security, and leading to higher long-term costs and conservation issues.

Conclusion

Wind power is a low-carbon, zero-emission energy source that brings major technological advancements. Forecasting the wind energy generated from wind power is crucial for renewable technologies and environmental progress. Deep learning techniques have recently emerged as effective methods in advanced forecasting. In this paper, a novel LSTM-Dynamic contextual time-series performance forecasting model is provided, which is an improved transformer framework implemented to address the limitations of existing time-series forecasting models. To predict difficult operational problems by minimizing risks and improving efficiency, the model results are compared with three different models, GRU, LSTM, and transformer, using two datasets of the Jhimpir and Texas power plant electrical system. The results are compared using the performance predictions for the next 12 h after training and testing the dataset. Moreover, the accuracy of different models is tested using the evaluation matrix, where the contextual dynamic model shows high accuracy with the lowest error values. The combination of LSTM with Contextual Dynamic model is useful for extracting extremely nonlinear and complex data from a real-time input dataset to drive wind power prediction improve energy storage optimization and grid stability. This model can be extended and improved to provide comprehensive knowledge to decision-makers interested in wind turbines and energy optimization.

Footnotes

Acknowledgments

We are very thankful to Prof. Yong Wang, School of New Energy, North China Electric Power University who gave precious suggestions on revision. We would also extend our heartfelt appreciation to Mr Shoaib Ahmed, site data analyst at Jhimpir Power Plant, Pakistan for providing us with the required 50-MW wind power plant dataset. Moreover, the authors are grateful for the generous support from the School of New Energy at North China Electric Power University, Beijing, China facilitating and awarding through the Chinese Scholarship Council 2020.

ORCID iDs

Yasir Jan

Mutale Sydney

Author contributions

Jan Yasir: writing—original draft and writing—review and editing. Ahmed Saeed: data curation and review and editing. Sydney Mutale: review and editing. Yan Jie: supervision and review and editing. Ishwor KC: review and editing.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Declaration of generative AI tools

The authors declare that they have not used any type of generative artificial intelligence, for the writing of this manuscript nor the creation of images, graphics, tables, or their corresponding captions.

Data availability statement

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

References

Yasir

Yong

Mutale

, et al. Quantifying the impact of wind shear coefficient on annual energy production of coastal wind farm in Balochistan, Pakistan. Renew Energies 2025; 3. DOI: 10.1177/27533735251319764

Mutale

Wang

Yasir

, et al. Exploring the economic prospects of wind energy in Zambia. Int J Sustain Eng 2024; 17: 1–16.

Yan

Möhrlen

Göçmen

, et al. Uncovering wind power forecasting uncertainty sources and their propagation through the whole modelling chain. Renewable Sustainable Energy Rev 2022; 165: 112519.

Yasir

Yong

Mutale

, et al. Enhancing Wind Power Forecasting with GRU Dynamic Context-aware Power Forecasting Hybrid Model. In: IEEE 10th International conference on power and electrical engineering, 2025.

Lee

Bang

Mantooth

, et al. Condition monitoring of 154 kV HTS cable systems via temporal sliding LSTM networks. IEEE Access 2020; 8: 144352–144361.

Oslobo

Corzine

Weatherford

, et al. Dc pulsed load transient classification using long short-term memory recurrent neural networks. In: 2019 13th International Conference on Signal Processing and Communication Systems (ICSPCS), 2019, pp.1–6. 10.1109/ICSPCS47537.2019.9008730

Pan

Zhou

Cao

, et al. Water level prediction model based on GRU and CNN. IEEE Access 2020; 8: 60090–60100.

Almuzaini

Azmi

. Impact of stemming and word embedding on deep learning-based arabic text categorization. IEEE Access 2020; 8: 127913–127928.

Wang

Yan

Zhang

, et al. Short-term integrated forecasting method for wind power, solar power, and system load based on variable attention mechanism and multi-task learning. Energy 2024; 304: 132188.

10.

Feng

, et al. An attention-based CNN-LSTM-BiLSTM model for short-term electric load forecasting in integrated energy system. Int Trans Electr Energy Syst 2021; 31: 12637.

11.

Song

Yan

Han

, et al. A multi-task spatio-temporal fusion network for offshore wind power ramp events forecasting. Renewable Energy 2024; 237: 121774.

12.

Benti

Chaka

Semie

. Forecasting renewable energy generation with machine learning and deep learning: current advances and future prospects. Sustainability 2023; 15: 7087. https://www.mdpi.com/2071-1050/15/9/7087#

13.

Ashish

. Attention is all you need. Adv Neural Inf Process Syst 2017; 30: 1. https://dl.acm.org/doi/10.1145/3347146.3359342

14.

Dong

. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp.5884–5888. 10.1109/ICASSP.2018.8462506

15.

Wen

Zhou

Zhang

, et al. Transformers in time series: A survey. arXiv preprint arXiv 2022; 07125. 10.48550/arXiv.2202.07125

16.

Zhou

Zhang

Peng

, et al. Informer: beyond efficient transformer for long sequence time-series forecasting. Proc AAAI Conf Artif Intell 2021; 35: 11106–11115.

17.

Wang

, et al. Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst 2021; 34: 22419–22430.

18.

Liu

Liao

, et al. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In: The Tenth International Conference on Learning Representations, 2022. https://openreview.net/forum?id=0EXmFzUn5I

19.

Cirstea

Guo

Yang

, et al. Triformer: Triangular, Variable-Specific Attentions for Long Sequence Multivariate Time Series Forecasting. arXiv preprint arXiv 2022; 2204.13767. 10.48550/arXiv.2204.13767

20.

Awad

Khanna

Awad

, et al.

Support vector regression

Efficient learning machines: theories, concepts, and applications for engineers and system designers. Berkeley, CA: Apress, 2015, pp.67–80. 10.1007/978-1-4302-5990-9_4

21.

Fan

, et al. Light gradient boosting machine: an efficient soft computing model for estimating daily reference evapotranspiration with local and external meteorological data. Agric Water Manag 2019; 225: 105758.

22.

Chen

. Others Xgboost: extreme gradient boosting. R Package Version 2015; 42: 11. https://cir.nii.ac.jp/crid/1370017282240571269

23.

Jain

. Recurrent neural networks: design and applications. USA: CRC Press, 2000, p.13951

24.

Kim

. A convolutional transformer model for multivariate time series prediction. IEEE Access 2022; 10: 101319–101329.

25.

Hansun

. A new approach of moving average method in time series analysis. In: 2013 conference on new media studies (CoNMedia), 2013, pp.1–4. 10.1109/CoNMedia.2013.6708545

26.

Zhang

. Seasonal autoregressive integrated moving average and support vector machine models: prediction of short-term traffic flow on freeways. Transp Res Rec 2011; 2215: 85–92. https://doi/abs/10.3141/2215-09

27.

Shen

Liu

, et al. Contrastive learning of subject-invariant EEG representations for cross-subject emotion recognition. IEEE Trans Affect Comput 2022; 14: 2496–2511.

28.

Graves

. Long short-term memory. Supervised sequence labelling with recurrent neural networks 2012: 37-45. https://link.springer.com/chapter/10.1007/978-3-642-24797-2_4

29.

Bommidi

Teeparthi

Kosana

. Hybrid wind speed forecasting using ICEEMDAN and transformer model with novel loss function. Energy 2023; 265: 126383.

30.

Al-Janabi

Seyhood

. Optimizing UAV performance with IoT and fuzzy linear fractional transportation models. Results Eng 2024; 24: 103306. 2024.

31.

Syah

Guerrero

JWG

Poltarykhin

, et al. Developed teamwork optimizer for model parameter estimation of the proton exchange membrane fuel cell. Energy Rep 2022; 8: 10776–10785.

32.

Ahmed

Nielsen

Tripathi

, et al. Transformers in time-series analysis: a tutorial. Circuits Syst Signal Process 2023; 42: 7433–7466. https://link.springer.com/article/10.1007/s00034-023-02454-8

33.

Huang

Zhou

Shen

, et al. Multistage spatio-temporal attention network based on NODE for short-term PV power forecasting. Energy 2024; 290: 130308.

34.

Beltagy

Peters

Cohan

. Longformer: The long-document transformer. arXiv preprint arXiv 2020; 2004.05150. 10.48550/arXiv.2004.05150

35.

Kitaev

Kaiser

Levskaya

. Reformer: The efficient transformer. arXiv preprint arXiv 2020; 2001.04451. 10.48550/arXiv.2001.04451

36.

Wang

, et al. Powerformer: a temporal-based transformer model for wind power forecasting. Energy Rep 2024; 11: 736–744.

37.

Neshat

Nezhad

Mirjalili

, et al. Quaternion convolutional long short-term memory neural model with an adaptive decomposition method for wind speed forecasting: north Aegean islands case studies. Energy Convers Manage 2022; 259: 115590.

38.

Nezhad

Heydari

Pirshayan

, et al. A novel forecasting model for wind speed assessment using sentinel family satellites images and machine learning method. Renewable Energy 2021; 179: 2198–2211.

39.

Mahdi

Salman

Al-Janabi

. NDDLM-SCTSI: a novel method for assessing node trustworthiness for trust management and analysis in online social network. Int J Inf Technol 2024: 1–17.

40.

Dong

Choi

. Transformer based context-aware sarcasm detection in conversation threads from social media. arXiv preprint arXiv 2020; 2005.11424. 10.48550/arXiv.2005.11424

41.

Liu

Sun

, et al. Vision transformers with hierarchical attention. Mach Intell Res 2024; 21: 1–14. https://link.springer.com/article/10.1007/s11633-024-1393-8

42.

TMC

Huynh

, et al. Optimal power flow solutions to power systems with wind energy using a highly effective meta-heuristic algorithm. Int J Renewable Energy Dev 2023; 12: 467–477.

43.

Mohammed

Al-Janabi

. An innovative synthesis of optimization techniques (FDIRE-GSK) for generation electrical renewable energy from natural resources. Results in Engineering 2022; 16: 100637.

44.

Al-Ibraheemi

Al-Janabi

. Sustainable energy: advancing wind power forecasting with grey wolf optimization and GRU models. Results Eng 2024; 24: 102930.

45.

Kadhuim

Al-Janabi

. Codon-mRNA prediction using deep optimal neurocomputing technique (DLSTM-DSN-WOA) and multivariate analysis. Results Eng 2023; 17: 100847.

Enhancing wind power forecasting using novel LSTM-dynamic context-aware power forecasting hybrid model

Abstract

Keywords

Introduction

Related work

Data processing

Data acquisition

Dataset creation

Methodology

LSTM-based embedding

Transformer layer

Context-aware embedding

Let Li be the i-th context utterance, then

Hierarchical multihead self-attention

Evaluation metrics

Mean absolute error (MAE)

Mean square error (MSE)

Root mean square error (RMSE)

Correlation

Results and discussion

Data description

Results and discussions

Conclusion

Footnotes

Acknowledgments

ORCID iDs

Author contributions

Funding

Declaration of conflicting interests

Declaration of generative AI tools

Data availability statement

References