Abstract
During the transmission of power measurement data through communication networks from remote terminal unit (RTU) to the state estimator in Supervisory Control and Data Acquisition (SCADA), power cyber-physical systems (PCPSs) are more susceptible to cyber-attacks. To mitigate that threat, this paper is concerned with a new data recovery strategy on machine learning against false data injection attacks (FDIAs) in PCPSs. Firstly, in view of the limited resources (such as limited energy) of adversaries and system protections, a sparse target false data injection attack (FDIA) is constructed. Then, the FDIA detection problem is transformed into a tripartite separation problem, and the alternating direction method of multipliers on proximal exchange (ADMM-PE) is adopted to complete the intrusion detection of FDIAs. In addition, with the help of reliable mask information and real incomplete measurement data provided by the FDIA detection, a similar supervised generative adversarial imputation networks (GAIN) is proposed to complete the measurement data recovery after FDIAs. Specifically, the pseudo labels generated by data analysis methods such as k-means clustering and support vector machine (SVM) to improve the accuracy of measurement data recovery. Finally, the experimental results of PCPSs show the effectiveness and superiority of the proposed data recovery strategy against FDIAs.
Keywords
Introduction
With more and more renewable power generations and smart devices, power system is increasingly dependent on cyber infrastructure (e.g. interface terminals due to promote real-time analysis, open communication networks).1,2 Therefore, Power Cyber Physical Systems (PCPSs) is a product of high informatization, intelligence, and deep networking of power systems. However, the high integration of advanced information technology not only brings convenience but also many information security issues to PCPSs. In the process of transmitting data through communication networks, it is more susceptible to cyber-attacks, for example, false data injection (FDI) attacks and denial of service (DoS) attacks, where FDI attacks (FDIAs) can inject subtle and pseudo biases into the data through several ways such as network layer or communication channel. In addition, FDIAs can be more intelligent and conceal with the objective of minimizing contaminated measurements and maximizing attack impact for collaboratively altering the meter measurements in PCPSs.3,4 Especially, different from DoS attacks, FDIAs do not destroy the observability of systems through intrusion of the communication link, magnetic field injection or global positioning system (GPS) spoofing, 1 which is relatively difficult to be detected and eliminated.5,6
There have been numerous studies on intrusion detection methods against FDIAs for measurement data in PCPSs. According to Musleh et al., 5 intrusion detection methods against FDIAs can be divided into model-based and data-driven methods, and advantages and disadvantages of these methods are analyzed and compared. At the end of Musleh et al., 5 future development trends about FDI attack (FDIA) detection methods are pointed out, that is, the methods that are adaptive to new types of FDIAs, system topology and parameter changes, or independent on system models and parameters. Therefore, more and more FDIA detection methods on machine learning (ML) that are independent on system model and parameters or data-driven methods are adopted to detect FDIAs in PCPSs. For example, a fast go-decomposition (GoDec) approach on matrix decomposition, 7 non-convex robust principal component analysis (NcRPCA) on matrix separation, 8 a semi-supervised deep learning approach on autoencoders and an advanced generative adversarial network (GAN), 9 the detection mechanism on a temporal correlation and spatial correlation method and a deep convolutional neural network, 10 secure federated deep learning with Transformer, federated learning and Paillier cryptosystem, 11 the detection method on Kalman filter and recurrent neural network, 12 FDIAs detection based on the spectral energy of Hilbert-Huang transform, 2 a novel interval state forecasting-based detection scheme on ensemble learning of long short term memory neural network and parametric Gaussian distribution. 4 It is worth noting that FDIA detection methods on ML are suitable for detecting all malicious attacks, including FDIAs, and has good universality. However, this type of FDIA detection methods requires a large amount of training time and has a high dependence on training samples. Compared to FDIA detection methods on ML, detection methods on matrix decomposition or separation do not have the above-mentioned problems and are only suitable for detecting FDIAs.
However, intrusion detection alone is not enough, and effective data recovery methods of measurement data are also needed to provide complete and reliable measurement data for state estimation and control decisions in PCPSs. Indeed, designing state estimators that are resilient and robust to FDIAs can reduce the impact of attacks on the system. However, the design of anti-FDIAs state estimators heavily relies on system modeling, and the modeling of complex PCPSs is very difficult, therefore, adopting a data-driven measurement data recovery method is a better way. In addition, the detected false data is often discarded in PCPSs, which lends to the incomplete data. When the discarded data reaches a certain scale, it will seriously affect downstream applications, for example, incomplete power measurement data can affect state estimation and control decision in supervisory control and data acquisition (SCADA). Therefore, how to recover the incomplete data has become a focus of attention, which can be converted to a data imputation problem for multivariate time series (DIP-MTS). 13
In general, DIP-MTS can be mainly divided into two categories: non-deep learning (DP)-based and DP-based methods. Non-DP-based methods include the imputation with mean values, median values, clustering etc., 14 while DP-based methods include matrix completion, recurrent neural networks (RNN)-based methods (e.g. bidirectional recurrent imputation,15,16 gated recurrent unit (GRU)-D 17 ), GAN-based methods. However, non-DP-based methods for DIP-MTS are difficult in capturing complex nonlinear correlation in PCPSs, and their imputation errors are large when the missing rate is relatively high. For DP-based methods, most of DP-based methods (e.g. RNN, GRU-D, GAN) need complete data for training, which limits their application in PCPSs.13,15 In the paper, how to use intrusion detection to assist us in completing the DIP-MTS based on incomplete measurement data and DP-based method is our focus.
Obviously, the intrusion detection problem against FDIAs in PCPSs and data imputation problem, as two relatively independent problems, have received many attentions. In this paper, the two problems are combined to address the intrusion detection, identification, and data recovery problems against FDIAs in PCPSs. Our main contributions are summarized as the following:
(1) The sparse targeted FDIA against PCPSs is constructed to deceive traditional detector in state estimator. A new data recovery strategy against FDIAs is presented to recover measurement data of PCPSs, which combines FDIA detection and data imputation techniques based on alternating direction method of multipliers (ADMM) and similar supervised generative adversarial imputation nets.
(2) The results of the proposed FDIA detection provide the mask information and real incomplete measurement data for data imputation. Moreover, k-means clustering and support vector machine (SVM) are introduced to improve the accuracy of measurement data recovery.
(3) Experiments on power cyber physical systems (CPSs) under FDIAs are demonstrated quantitatively and qualitatively to evaluate the effectiveness and superiority of our data recovery strategy.
The rest of this paper is organized as follows. In Section 2, the preliminary including state estimation and modeling of false data injection attacks is introduced. In Section 3, intrusion detection (ID) on machine learning with tripartite separation data model is presented. Then, the data imputation on the improved generative adversarial imputation networks is proposed in Section 4. In Section 5, the proposed integrated data recovery strategy against FDIAs is described. Finally, the experimental results and analyses of data recovery strategy against FDIAs in PCPSs are presented in Section 6, and conclusion is given in Section 7.
Preliminary
State estimation
A safe and reliable PCPSs requires accurate state estimation to make follow-up control decisions. The state estimation problem is the process of deriving state variables from measurement values
where
Modeling of false data injection attacks
According to the work on the FDIAs first proposed in Liu et al.,
18
the successful FDIAs passing the residual-based detector can be constructed:
Considering the specific scenarios and limited conditions in which the FDIAs are launched, as early as 2014, the sparse FDIA was proposed in Liu et al.,
19
and the sparsity of FDIAs was also defined as
where
where
(4) is clearly a least absolute shrinkage and selection operator (Lasso) problem that can be solved using the ADMM. However, due to the limited resources of adversaries, at most
where

Two FDIA models against the state estimation in power cyber-physical systems.
Intrusion detection (ID) on machine learning
Transformation problem of ID
Data imputation can repair abnormal measurement data in PCPSs, requiring the location information for tampered data through intrusion detection. Therefore, adopting appropriate intrusion detection methods against FDIAs is a powerful guarantee for the location information of the tampered data.
Firstly, due to system protection and the limited ability of adversaries, adversaries can only tamper with partial measurement data if the FDIAs attempt to evade traditional detectors, hence the FDIAs are generally sparse. Secondly, when PCPSs operate stably, the corresponding measurement and system state change slowly in a short time, then the matrix composed of measurement data is generally low rank. Therefore, as the mathematical method of machine learning, matrix separation algorithm can be introduced to complete the detection of FDIAs, which is always specifically used to detect FDIAs,7,19 that is, the FDIA detection problem can be transformed into the following convex optimization matrix separation problem:
where
Intrusion detection on tripartite separation data model
Considering the impact of measurement noise on intrusion detection, an ADMM on proximal exchange (ADMM-PE) method for tripartite separation data model is proposed as a solution to the FDIA detection problem:
where
Moreover, a proximal operator is introduced to obtain the optimal solution of (7) in the tripartite separation, which can convert the solution method of (7) into the distributed and parallel method on ADMM.
24
Specifically, the proximal operator
where
where
where
In addition,
where
Finally, through continuous iteration, sparse attacks
Data imputation on the improved generative adversarial imputation networks
On the one hand, results of FDIA detection provide a binary mask matrix
where
Basic GAIN
GAIN includes two generators and one discriminator, adding a hint generator on the top of the traditional GAN. In GAIN, incomplete measurement data
where
where the entries of
The biggest feature of GAIN is the introduction of the hint generator,15,27 the output
where
The goals of the discriminator
where
with
Similar supervised gain
Obviously, GAIN belongs to an unsupervised machine learning algorithm. In order to improve the accuracy of measurement data recovery, a clustering algorithm is introduced to provide potential category information for GAIN, transforming it from an unsupervised to a similar supervised algorithm. 26
At first, the sampling measurement data is arranged in ascending order according to the level of
In the second step, k-means clustering algorithm is introduced to data analysis, forming the pseudo-labels. 27 Next, as a classifier, support vector machine (SVM) is trained with the pseudo-labels and the imputed data.
Then, with the information from SVM, the generator and discriminator play multiple games again. The goal of the discriminator is the same as that of the discriminator in basic GAIN (as shown in (19)), and the goal of the generator is transformed as follows:
where
The proposed data recovery strategy against FDIAs
For FDIAs in PCPS, a data recovery strategy against FDIAs that consists of two parts is proposed. The first part is the intrusion detection on the proposed ADMM-PE, which completes the separation and detection of the sparse FDIAs through tripartite separation technique and proximal exchange, and provides the low rank measurement data, noise and sparse attack data. Then, the incomplete measurement data and mask information for subsequent measurement data imputation can be obtained through the tampered location information from sparse attack data. The intrusion detection on the proposed ADMM-PE is summarized in Algorithm 1:
The second part is data imputation, in which a similar supervised GAIN algorithm is proposed to restore power measurement data, and provide reliable and complete data for state estimation. The proposed data recovery strategy against FDIAs is shown in Figure 2.

The proposed data recovery strategy against FDIAs.
According to Figure 2, once the FDIAs are launched, the intrusion detection method on ADMM-PE can detect and identify the tampered location, generating mask information. Then, the tampered measurement data will be discarded, resulting in incomplete measurement data
The results of the intrusion detection including mask information and incomplete measurement data will provide the input information for subsequent data imputation process. Considering that the FDIAs are sparse attacks, the proposed similar supervised GAIN algorithm is introduced to complete the task of data recovery, which is not only suitable for the recovery of low MR data, but also improves the accuracy of data recovery with the help of pseudo labels. The specific steps of the data imputation are as follows:
(1) Pre-training: select a subset of measurement data samples with low missing rates to pre-train the generator and discriminator, and obtain the pre-imputed (measurement) data through GAIN;
(2) K-means clustering is introduced to generate the pseudo-labels on the pre-imputed data; and then a auxiliary classifier on SVM is trained with the pseudo-labels and the pre-imputed data;
(3) Formal training: all measurement data samples are used to train the generator and the discriminator in GAIN, while the classification information obtained by SVM are used to constrain the generator and force it to learn features from different classes. In the end, the final imputed measurement data is obtained through the games between generator and discriminator.
Apparently, the quality of the generated pseudo-labels affects the performance of the generator and discriminator by affecting the performance of the auxiliary classifier.
Experiments
We validate the performance of the proposed data recovery strategy (the proposed DRS) on IEEE 14-bus power system extracted from the MATPOWER toolbox, where measurement data is generated by direct current power flow operations. Furthermore, we assume that adversaries can replace true measurement data with tampered measurement data through the communication link to deceive the state estimator in SCADA.
Experiments for intrusion detection on traditional chi square
In this section, we first verify the stealth of FDIAs against traditional residual-based detector. If the attack density of the sparse attacks (i.e. the sparsity of FDIAs) is 6% (all subsequent experiments have the same attack density), traditional residual-based detector cannot detect the H-FDIA and Blind-FDIA well, as shown in Figure 3, where the true positive rate is defined as follows:

The results of chi square detection against H-FDIA and Blind-FDIA.
Experiments for intrusion detection on tripartite separation model
Firstly, in order to evaluate the detection performance of FDIA detection methods, except
where
Four machine learning methods on matrix separation are used to compare their performance on FDIA detection against pFDIA in IEEE 14-bus power system, under different signal to noise ratio (SNR), including inexact augmented lagrange multipliers (IALM), low rank matrix factorization (LMaFit),
19
a fast Go Decomposition (GoDec),
7
and the proposed intrusion detection on ADMM-PE. The sampling time is set:
Table 1 shows the performance of FDIA detection methods in IEEE 14-bus power system, under different signal to noise ratio (SNR), where the bold font is the proposed algorithm and the data with the best performance. According to Table 1, IALM has the best performance on
The performance of FDIA detection methods in IEEE 14-bus power system.
Then, the results of intrusion detection should be processed, including discarding the tampered measurement data, setting the corresponding entries to 0 or null, that is, the incomplete measurement data can be obtained according to (24), and generating the corresponding mask information for use in the data recovery process in next section.
Experiments for measurement data recovery
Data imputation technology is adopted to complete the measurement data recovery in the paper. According to Figure 2, the dataset including
For the comparison experiments, during each parameter adjustment, other parameters are fixed, and only one parameter is manually set. Then, the experiment corresponding to each parameter group is repeated for 20 times, and the average imputation effect is obtained. Finally, the best group is selected as the final parameters. By the above way, the parameters of PC-GAIN are set to
First, we compare the performance of these algorithms under different missing rates
where

The RMSE of four data imputation algorithms under different missing rates
According to Figure 4, the performances of the four algorithms are stable under different missing rates
Next, we record the running time of the four algorithms under different
Comparisons of running time (s) for four data imputation algorithms under different
From Table 2, it can be seen that the running time of both traditional methods is very short. In contrast, the running time of PC-GAIN and GAIN is relatively large. The reason is that PC-GAIN and GAIN belong to deep learning methods and require more time to train. Moreover, PC-GAIN has more parameters and more complex calculation process (e.g. pre-training phase), since PC-GAIN further improves the inference ability of the generator by adding implicit category information on the basis of GAIN. Obviously, PC-GAIN requires more time for pre-training and so on than the other three algorithms. Therefore, the running time of PC-GAIN is about 1 s longer than that of GAIN and more longer than that of MEAN and KNN.
In addition, we study the impact of the cluster number in k-means clustering on the performance of PC-GAIN algorithm, as shown in Figure 5.

The RMSE of the data imputation on PC-GAIN with different cluster numbers and
According to Figure 5, we can see that when the
In general, the proposed intrusion detection on ADMM-PE in the previous section provides reliable information for subsequent data imputation, ensuring the effectiveness of measurement data recovery in PCPSs. In addition, in terms of data recovery, compared with unsupervised GAIN, the proposed similar supervised GAIN has good performance on imputation accuracy (minimum RMSE) but spends slightly more time (lower computing efficiency), however, with the rapid improvement of hardware computing and processing capabilities, the computing efficiency will not be a major issue.
Conclusion
In this paper, we have developed a new data recovery strategy on machine learning against FDIAs in PCPSs. A sparse target FDIA has been introduced, and the machine learning algorithms such as ADMM-PE and similar supervised GAIN are integrated to provide a data recovery solution after FDIAs. Specifically, the FDIA detection problem can be transformed into a tripartite separation problem and provides reliable inputs such as mask information and real incomplete measurement data to the proposed similar supervised GAIN. With the help of k-means clustering algorithm and SVM, the pseudo labels generated by data analysis provide information similar to supervised learning for data imputation to improve the accuracy of measurement data recovery. Finally, the example in power cyber physical systems is illustrated the effectiveness and superiority of the proposed data recovery strategy against FDIAs. Obviously, the proposed data recovery strategy against FDIAs provides reliable and complete measurement data for the state estimator in SCADA, ensuring the subsequent stable and reliable operation of the power system. Overall, as long as cost is considered, the protection configurations of PCPSs cannot be perfect, the proposed strategy is feasible in theory for the data recovery problem of bad data injection caused by such sparse FDIAs.
However, the proposed data recovery strategy has two limitations. (1) In the intrusion detection phase, the proposed ADMM-PE on matrix separation may fail when FDIAs are not sparse according to equations (6) and (7). (2) In the data imputation phase, according to Table 2, PC-GAIN has a longer running time, and may not meet the real-time requirements of PCPSs If the hardware computing and processing capabilities are not enough. In general, with the improvement of hardware computing and processing capabilities, the proposed strategy still remains a good data recovery strategy against FDIAs. In the future research, it is expected that: the feasible and efficient data recovery solutions against mixed cyber-attacks not only the sparse FDIAs in PCPSs should be studied, which are more universal, for example, hybrid intrusion detection methods against mixed cyber-attacks, or intrusion detection based on ensemble learning, and data recovery solution on self-attention-based imputation.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China [grant numbers 62006052, 61973128]; Basic and Applied Basic Research Foundation of Guangdong Province [grant numbers 2023A1515012468, 2022A1515110148, 2021A1515011520].
Data availability statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
