Machine Learning Based Detection of Anomalous User Behavior in University Data Centers

Abstract

Anomalies in the work of data center users can be caused by both Structured Query Language (SQL) injection attacks and user attempts to make unauthorized access to data. The paper explores various machine learning models to detect such anomalies. The peculiarity of the problem being solved is its focus on the university data centers, whose databases have a non-normalized structure. In this case, the problem of reducing the feature space arises. The paper proposes an algorithm for generating a dataset based on typing the data table names. The experimental results obtained on supervised, unsupervised and semi-supervised machine learning models confirmed the high efficiency of the proposed approach. They showed that the support vector machine, random forest, Gaussian Naive Bayes, and neural network models are the most effective in detecting known SQL injections, and the local outlier factor semi-supervised learning model is the most effective in detecting unknown SQL injections and unauthorized access attempts.

Keywords

Structured Query Language cybersecurity machine learning classifiers attacks

1. Introduction

Nowadays the popularity of using data centers (DCs) in control systems has significantly increased (Alqahtani et al., 2020). DCs provide their users with the possibility of joint sustainable and timely use of information resources in the interests of solving various problems (Mujib & Sari, 2020; Welsh & Benkhelifa, 2021). For this reason, DCs are the primary targets for internal and external security violators to obtain information or disrupt centers (Klymash et al., 2019; Paiusescu et al., 2018).

Different methods of anomaly detection can be used in the creation of DC information security systems. As a rule, anomalies are detected in the network traffic with the help of various network security tools (e.g., intrusion detection systems, firewalls, antiviruses). However, traffic anomalies do not fully reflect abnormal (anomalous) DC user behavior. It manifests itself in the form of user requests to databases with incorrect, anomalous requests that allow to make malicious changes in the content of databases or to obtain unauthorized information from databases. Such queries are a special kind of computer attacks—Structured Query Language (SQL) injection (Marashdeh et al., 2021). In addition, anomalous queries may look normal and not contain SQL injection, but access to forbidden areas of the database. Protection against such accesses is usually assigned to the database access control system. However, for complex databases with a large number of data tables, creating an access control system that completely bans abnormal queries is a very difficult task.

The proposed approach, based on machine learning, is designed to help to fulfill this task. At the same time, it is proposed to use database logs as the initial data, on the basis of which the datasets used in machine learning methods are formed. These logs record the texts of queries the users accessed the databases. If the databases are based on the relational model, then queries are written in SQL. However, the proposed approach is not strictly tied to SQL and can be applied to any types of databases, such as NoSQL. The paper discusses the possibility of using machine learning methods to detect abnormal user behavior in DCs stored information and solved problems related to the university educational process. The choice of this type of DCs is explained by two reasons. First, the educational process databases contain a very large number of data tables. As a result, if you use table names to form a feature space, then the number of features will be very large. This makes the application of machine learning impossible or extremely difficult. Secondly, there are a large number of insider security threats in the university DC. Students should be considered as potential insiders. Therefore, the development of an additional security frontier for the university DC that allows detecting anomalous requests to DC databases is an urgent task. At the same time, it should be noted that a few known works are currently devoted to the topic of detecting or analyzing the possible malicious behavior of DC users.

The main contribution of the paper is as follows: (1) an original formal statement of the problem of detecting anomalous actions of DC users is proposed, oriented to the use of machine learning methods; (2) a new proposal to form a feature space, which determines the structure of training and testing datasets from the three categories of features—the keywords of the SQL language, signatures specific to SQL injection, and names of data tables; (3) an heuristic approach is proposed to reduce the dimension of the original feature space and an algorithm that implements it; (4) the implementation of the proposed approach was carried out using a variety of the most well-known machine learning models; (5) an experimental evaluation of the proposed approach was carried out, which confirmed its effectiveness and high efficiency.

The results of this study were reported at the 15th International Symposium on Intelligent Distributed Computing (IDC 2022) (Kotenko & Saenko, 2023). This article presents an extended description of the results obtained. In addition to Kotenko and Saenko (2023), we updated the results of the analysis of related works, and also examined in more detail approaches to feature space optimization and ensuring timely detection of anomalous user behavior. In the experiments, we used not only supervised machine learning methods, but also unsupervised and semi-supervised learning. In this case, the experiments were carried out using 10-fold cross-validation in order to increase the accuracy of the results and with optimization of parameters, which was especially necessary for the use of the principal component analysis (PCA) method.

The paper is structured as follows. Section 2 provides an overview of related works. Section 3 discusses the theoretical foundations of the proposed approach. Section 4 describes the details of the software implementation of the proposed approach and the generation of a dataset for its experimental evaluation. Experimental results and their discussion are considered in Section 5. Section 6 contains conclusions and directions for further research.

2. Related Work

Works related to the topic of detecting anomalies in the operation of DCs can be divided into two groups: (1) detecting anomalies in the operation of DCs and (2) detecting computer attacks such as SQL injection. Among the works of the first group, the works (Chen et al., 2020; Decker et al., 2020; Deka et al., 2019; Nanekaran et al., 2020; Salman et al., 2017; Shahid & Ali Shah, 2021) should be marked. Decker et al. (2020) notes that log entries in DC are stochastic and non-stationary in nature. Therefore, this work proposes an approach in which features are extracted from time windows and used to develop and update an evolving Gaussian Fuzzy Classifier on the fly. Shahid and Ali Shah (2021) proposes to use the word2vec algorithm to extract features from logs, and to use an LSTM neural network (NN) to detect anomalies. In Nanekaran et al. (2020), it is proposed to use unsupervised machine learning methods to determine the normal and abnormal behavior of DC cooling systems. The issues of preventing hostile influence on the detection of anomalies in DC are considered in Deka et al. (2019). This work proposes a linear regression-based optimization framework with the ability to poison data in the training phase. Chen et al. (2020) suggests to detect anomalies by judging the deviation of predicted data and true data in DC operating using various machine learning methods. Salman et al. (2017) uses linear regression and random forest (RF) to not only detect but also classify attacks in DC network traffic.

Despite the good results obtained in these anomaly detection works, it should be noted that these works did not consider anomalies in SQL queries and SQL injection detection. The works of the second group are devoted to this, for example (Gowtham & Pramod, 2022; Hlaing & Khaing, 2020; Prarthana & Gangadhar, 2017; Xiao et al., 2017; Xie et al., 2019). Hlaing and Khaing (2020) emphasizes that SQL injections became possible due to lack of validation of input queries. This paper presents an approach which detects a query token with reserved words-based lexicon to detect SQL injections. Gowtham and Pramod (2022) proposes a robust semantic query ensemble learning model for SQL injection prediction. The proposed learning model used a set of nine basic classifiers designed to provide maximum prediction based on the voting ensemble. In Prarthana and Gangadhar (2017), it is proposed to use multivariate statistical tests to detect anomalies in the behavior of DC users. In Xie et al. (2019), a method for detecting SQL injections in web applications based on the Elastic-Pooling convolutional NN is presented. Xiao et al. (2017) proposes an approach to detecting SQL injections based on the analysis of the response and state of a web application during various attacks.

It should be noted that most of the recent SQL injection detection work focuses on the use of machine learning. So, Hasan et al. (2019) considers a machine learning-based approach to prevent SQL attacks, in which over 20 different classifiers are tested and the top five are selected. The potential of using machine learning techniques to detect SQL injections at the application level is explored in Tripathy et al. (2020). In this work, classifiers trained with malicious and secure SQL queries are investigated. Adebiyi et al. (2021) believes that machine learning is the most effective approach to detect and identify SQL injections. In this work, using the Decision Tree (DT) classifier, the accuracy of detecting SQL attacks was more than 0.98. In Balaji B et al. (2021), the possibilities of using deep learning on the multilayer perceptron model were studied. The same accuracy of attack detection was obtained. In Hosam et al. (2021), a study was made of the possibility of using six binary classifiers to detect SQL injections. A feature space was proposed, consisting of 13 most important features. Logistic regression (LR) was indicated as the best classifier. In the study (Roy et al., 2022), the best classifier was the naive Bayes method. Misquitta and Asha (2023) comes to the conclusion that supervised machine learning methods are quite effective, and it is proposed to use convolutional NNs to identify SQL attacks in real time.

We used the ideas proposed in related works on machine learning to detect SQL attacks in our work when selecting and evaluating classifiers.

3. Theoretical Background

3.1. Task Statement

Let us first consider the task statement for detecting anomalous actions of DC users, focused on the use of machine learning methods.

By a university DC we mean a DC with several databases, which are used in the interests of organizing and conducting the educational process. We will assume that the behavior of DC users consists in accessing the databases available in the DC using queries written in some language (e.g., SQL). Databases can be stored on nodes of a distributed file system (e.g., HDFS). Then parallel anomaly detection will be possible. In the paper, however, we do not focus on this. Database queries are recorded in the database management system (DBMS) logs (e.g., PostgreSQL, MySQL and others). The log consists of separate records. Each record reflects the fact of some user accessing the database and contains the following fields: date, time, user identifier and SQL query text. Therefore, the task of detecting abnormal behavior of university DC users comes down to detecting anomalous SQL queries to DC databases, which leads to searching for anomalous records in the logs.

If we imagine the DBMS log as a dataset consisting of records, then a possible technique for analyzing such a dataset for anomaly detection involves the following possible steps: (1) the formation of a set of features that characterize SQL queries; (2) transformation of the log dataset into a dataset, the records of which contain the values of the generated features; (3) formation of a training sample on which the machine learning process will be carried out; (4) the use of trained tools to directly identify anomalous requests.

The initial data of the task are:

(1)
a set of DC users $U = {U_{1}, U_{2}, \dots, U_{N}}$ and logs $L = {L_{1}, L_{2}, \dots, L_{M}}$ ;
(2)
each log is represented as a set of records $L_{m} = {l_{m i}}$ ;
(3)
each log record is represented as a tuple $l_{m i} = ⟨ D a t e_{m i}, T i m e S t a m p_{m i}, U s e r_{m i}, O p_{m i} ⟩$ , where $D a t e_{m i}$ is the date of the i-th query in the m-th log; $T i m e S t a m p_{m i}$ is the request timestamp; $U s e r_{m i} \in U$ is the query user; $O p_{m i}$ is the SQL statement;
(4)
each SQL statement can be represented as $O p_{m i} = ⟨ O p e r a t o r, {T a b l e s}, {F i e l d s}, {V a l u e s} ⟩$ , where Operator is the SQL operator; ${T a b l e s}$ is a set of table names that are present in the SQL statement; ${F i e l d s}$ is a set of field names; ${V a l u e s}$ is a set of field values;
(5)
well-known machine learning models (binary classifiers), which are most often used to detect anomalies in the analysis of various types of data (Branitskiy & Kotenko, 2016; Kotenko & Saenko, 2019);
(6)
requirements for detecting attacks such as SQL injection: Probability of correct attack detection: $P_{d e t} \geq 0.95$ ; probability of missing an attack: $P_{m i s} \leq 0.1$ .
To calculate these probabilities, it is proposed to use the following formulas: $P_{d e t} \approx T P / (T P + F P + F N)$ , $P_{m i s} \approx F N / (F N + T P)$ , where TP is the number of correctly detected anomalies in the dataset (True Positive); FP is the number of False Positive; FN is the number of False Negative.

As a result of solving the task, it is required to develop a feature space model containing features used in machine learning, conduct an experimental evaluation of machine learning methods used to detect anomalous user behavior, and develop a methodology for ensuring the timeliness of detecting anomalous user behavior.

The experimental evaluation of machine learning methods is described in detail in Section 5. Below are the results of the development of a feature space model and a methodology for ensuring the timely detection of anomalous user behavior
3.2. Feature Space Model

The feature space model is the result of one of the initial operations in machine learning technology, which is called feature selection. The essence of this model is a set of features that characterize some object or process and are used to analyze data about this object or process (Brownlee, 2020).

In well-known works on detecting anomalies in log files, discussed in Section 2 (for example, in Gowtham & Pramod, 2022; Shahid & Ali Shah, 2021), it is proposed to use the word2vec method to extract features. This method applies to any text data. However, it does not take data structure into account. Therefore, its application to SQL queries that have a strict structure was recognized by us as insufficiently effective, since this method produced a very large number of features at the output, and most of them were not informative. In addition, the accuracy of anomaly detection in SQL queries when using the word2vec method was lower than required.

Therefore, the proposed feature space model was formed heuristically as follows. It included features of three categories. The features of the first category determine the number of occurrences of a keyword in the SQL query. A total of 30 keywords were selected, such as SELECT, INSERT, CREATE, etc. The features of this category are designed to determine the complexity level of the SQL query and its type.

The second category of features was the number of occurrences of certain signatures specific to SQL injection. The following signatures were selected for this purpose: “Execute,” “or,” “txtUserId,” “getRequestString,” “1=1,” “- -,” “CHAR,” “#” and “;”. The presence of such signatures in SQL statements indicates the presence of known SQL injections in queries. Thus, the features of the second category reflect queries with SQL injections.

The third category was formed by the number of occurrences of data table names. Using this group of features, as expected, it is possible to detect anomalous SQL queries in which users attempt unauthorized access. Thus, the features of the third category reflect the preferences of users in accessing the structure of the database.

The DC University database used to experimentally evaluate the proposed approach contained over 4,000 data tables. This is due to the non-normalized nature of its structure. Approximately 2,000 tables contained faculty data, one table per faculty member. Each course, discipline, and study group also had its own data table.

Due to such a large number of tables, we made the following two decisions. First, we decided not to use field names and values when forming the feature space. Second, we decided to reduce the number of table names in the feature space by replacing them with a generic type name. So, all table names for teachers were replaced with the type name “Teacher_Table,” table names for study groups were replaced with “Group_Table,” etc. Thus, it was possible to reduce the number of table names taken into account to 141.

The feature space formed in this way was called the initial space. Its composition is presented in Table 1. It shows that the total number of features has become 181. Of these, 30 features belong to the first category, 10—to the second category and 141—to the third category.

Table 1.
The Composition of the Initial Feature Space.

# Category Name Description

1 1 SELECT_COUNT Number of occurrences of SELECT

2 INSERT_COUNT Number of occurrences of INSERT

… … …

30 HAVING_COUNT Number of occurrences of HAVING

31 2 Execute_COUNT Number of occurrences “Execute”

32 “ $1 = 1$ ”_COUNT Number of occurrences “ $1 = 1$ ”

… … …

40 txtUserId_COUNT Number of occurrences “txtUserId”

41 2 Table_1_COUNT Number of occurrences of data table name 1

42 Table_2_COUNT Number of occurrences of data table name 2

… … …

181 Table_141_COUNT Number of occurrences of data table name 141

#	Category	Name	Description
1	1	SELECT_COUNT	Number of occurrences of SELECT
2		INSERT_COUNT	Number of occurrences of INSERT
…		…	…
30		HAVING_COUNT	Number of occurrences of HAVING
31	2	Execute_COUNT	Number of occurrences “Execute”
32		“ $1 = 1$ ”_COUNT	Number of occurrences “ $1 = 1$ ”
…		…	…
40		txtUserId_COUNT	Number of occurrences “txtUserId”
41	2	Table_1_COUNT	Number of occurrences of data table name 1
42		Table_2_COUNT	Number of occurrences of data table name 2
…		…	…
181		Table_141_COUNT	Number of occurrences of data table name 141

3.3. Feature Space Optimization

As noted above, an excessively large number of features can have a negative impact on the accuracy of anomaly detection. Therefore, in this work, we solved the problem of maximizing the reduction of the initial set of features, in which the resulting accuracy of anomaly detection does not decrease and/or, if possible, can even increase.

Optimization of the initial set of features was carried out on the basis of an assessment of their information content. For this purpose, the following well-known metrics were used (Brownlee, 2020):

Information Gain (Info.Gain),

Information Gain Ratio (Gain Ratio),

ANalysis Of VAriance (ANOVA).

The Info.Gain metric defines “information gain”. It shows the decrease in entropy caused by dividing the original set of features and looking for the optimal feature that gives the highest value using the following formula:

M_{I G} (A, α) = H (A) - H (A | α),

(1)

where A is a feature, H(A) is an entropy A;

H (A | α)

is an entropy A at a given

A = α

The Gain Ratio metric is the ratio between the value of the Info.Gain metric calculated according to (1) and the value of the Split Information metric defined as follows:

M_{S I} (A) = - \sum_{i = 1}^{n} (p_{i} \cdot \log_{2} p_{i}),

(2)

where A is an evaluated feature that takes values

{α_{1}, α_{2}, \dots, α_{n}}

;

p_{i}

is the relative number of times feature A takes on a value

α_{i}

. In other words,

p_{i} = N (α_{i}) / N

, where

N (α_{i})

is the number of values

α_{i}

of feature A, N is the total number of values of feature A in the sample.

ANOVA is a set of statistical models and associated estimation procedures used to analyze differences between means. In its simplest form, ANOVA provides a statistical test for the equality of two or more population means. The ANOVA metric was calculated using built-in software.

To minimize the number of features that make up the optimal feature space, we use the Guttman-Kaiser criterion (Arnaut, 2014). This criterion states that the features included in the optimal feature space must have information content that exceeds the average information content for the entire set of initial features.

Then for the feature $A^{*}$ , which remains in the feature space, the following condition is satisfied:

M (A^{*}) > (M (A_{1}) + M (A_{2}) + \dots + M (A_{K})) / K,

(3)

where K is the total number of features in the original feature space;

M (A)

is the metric chosen to assess the information content of feature A.

3.4. Ensuring Timely Detection of Anomalous User Behavior

To ensure timely detection of anomalous user behavior, it is necessary to determine what limit the size of the log file should have in order to meet the requirements for the timeliness of processing queries to databases. To do this, it is necessary to solve the optimization task in the following formulation.

Let’s designate the length of the period of time during which the system works, through $T_{0}$ . Let the database receive a stream of requests with intensity $λ$ .

Let’s introduce the variable $x$ – the number of intervals into which the period $T_{0}$ is divided. Logbook entries falling within an interval of duration $t_{x} = T_{0} / x$ , form a file of a dataset coming for testing. During testing, a list of anomalous records is issued if they are found. Each entry contains a user ID and feature values. Based on this information, the administrator can easily restore the identity of the user who generated the anomalous request and the content of this request. During the time $t_{x}$ , the system receives $N_{x} = λ \cdot t_{x}$ requests.

Let us assume that the dataset processing time $T_{p r o c}$ depends on the number of records in this set, and this dependence is linear (in the first approximation). In other words, we can write that:

T_{p r o c} = T_{p r o c} (N_{x}) = C_{0} + C_{1} \cdot N_{x},

(4)

where

C_{0}

and

C_{1}

are linear dependence coefficients.

The total processing time of all data sets for the period $T_{0}$ will be equal to:

T_{\sum} = T_{p r o c} \cdot x .

(5)

Let us introduce the concept of “relative load” of the system

ρ

, which is calculated as the ratio of time

T_{\sum}

to duration

T_{0}

, i.e.

ρ = T_{\sum} / T_{0}

. It is natural to assume that the relative load should be minimal.

On the other hand, let’s assume that the duration of processing one request is imposed by the requirement that it does not exceed some given value of $T_{q u e r y}^{r e q}$ . The processing time of one request is equal to the dataset processing time $T_{p r o c}$ . This duration thus depends on the variable x.

Then the formal statement of the optimization task with decision variable x has the following form:

{\begin{cases} ρ (x) \to m i n, \\ T_{p r o c} (x) \leq T_{p r o c}^{r e q} . \end{cases}

(6)

Let us present a solution to task (6). Based on (4) and (5), to calculate

ρ

, we can write:

ρ = \frac{C_{0}}{T_{0}} \cdot x + C_{1} \cdot λ .

(7)

It can be seen from (7) that the minimum for

ρ

is reached at the minimum possible value x. This value can be found from the second constraint in (6). To find

T_{p r o c}

, based on (4), we can write:

T_{p r o c} = C_{0} + C_{1} \cdot λ \cdot T_{0} / x .

(8)

After transformations, taking into account (8), the second constraint in (6) takes the following form:

x \geq \frac{C_{1} \cdot λ \cdot T_{0}}{T_{q u e r y}^{r e q} - C_{0}} .

(9)

Then the solution of problem (6) is determined by condition (9), and the minimum relative system load

ρ_{m i n}

is determined by the following expression:

ρ_{m i n} = C_{1} \cdot λ \cdot (\frac{C_{0}}{T_{q u e r y}^{r e q} - C_{0}} + 1) .

(10)

Expressions (9) and (10) show what should be the partition of the log file and what will be the relative load of the system, based on the intensity of requests

λ

and the required maximum duration of their processing

T_{q u e r y}^{r e q}

. Coefficients

C_{0}

and

C_{1}

are calculated experimentally.

4. Implementation

To implement the approach, the Python v.3.8.8 was used with the following libraries: sklearn, numpy, pandas, matplotlib, Scipy, Re, Pylab, Math. The computing environment was organized on a Jupyter notebook.

Both supervised machine learning methods using binary classifiers, as well as unsupervised and semi-supervised methods were studied. Supervised learning methods were used to detect SQL attacks with known signatures. The unsupervised and semi-supervised methods were used to detect unknown SQL attacks and unauthorized access attempts to data tables.

The university DC used DBMS PostgreSQL v.13.4 running under Ubuntu v.13.4 to create the database.

To form a dataset used to train binary classifiers, a fragment of the log of the university DC database was selected, showing the work of users with the database for 15 minutes. In total, this fragment initially contained 82,192 statements. Figure 1 presents the instructions included in this log.

Figure 1.

Log’s fragment of the university data center (DC).

Analyzing Figure 1 we observe that the recording of this fragment was made on March 18, 2022. It started at 12:44:09 and ended at 13:00:33. Several users worked with the data-base and their IDs were 1174, 1187, 1211, 1230, 1234, 1276, 1565, and 1587. The queries accessed various data tables. So, the query with time 12:44:37 turned to the system table “p_uplan_pmk” (it contained planning information on the educational process). The name of this table appears in the statement after the word FROM. Other queries addressed the following tables: “p_group”, “1123–2021”, “pg_tables”, “IvanovDA”, “D-0406”, “D-2006”. The “p_group” and “pg_tables” tables were system ones. They were created by the system when the database itself was created. Other tables are non-system created by users using the CREATE command while working with the database. The table “IvanovDA” contains data about the teacher D.A. Ivanov. Table “1123–2021” contains data on study group $#$ 1123. Tables “D-0406” and “D-2006” contain data on academic disciplines, which have identifiers D-0406 and D-2006.

Since the database contained a very large number of tables with data on teachers, study groups and academic disciplines, it was decided to type such tables, that is, replace them with names of types. For example, the names of all data tables containing teacher data are replaced with the name “Teacher_Table”. This leads to the fact that in the feature space of the dataset, instead of 2000 names of data tables characterizing teachers, there will be the name of only one table. Similarly, the number of data table names that characterize study groups and academic disciplines is reduced.

This procedure has become one of the initial steps of the developed dataset formation algorithm. In total, this algorithm contains the following 5 steps. Step 1.

Extracting all table names from the log fragment and forming a set of table names used in the fragment. In total, 310 table names were extracted.

Step 2.

Generation of a set of new, typical table names. For the names of tables with data about teachers, the name “Teacher_Table” was used, for the names of tables with data about groups “Group_Table”, etc. In total, this set included 141 names, including the names of system tables.

Step 3.

Replacing the original table names in the log fragment with generic names. At the same time, the number of statements in the fragment was still 82192.

Step 4.

Formation of the initial dataset in CSV format. For each instruction, a CSV record was created from the fragment. The fields of this record were the features that were included in the feature space model and shown in Table 1. In addition, the Result field was included in the initial version of the dataset, which value played the role of a label for a normal or abnormal record. It was assumed that if Result = 0, then the record is normal, and if Result = 1 abnormal.

Step 5.

Removing duplicate records from the dataset and introducing anormal records into it. Since the date and time of the query were excluded from the feature space at this stage of the study (this was done deliberately to test the effectiveness of machine learning on SQL query structures), a large number of duplicate records appeared in the dataset. At this stage such records were deleted. After their removal, only 1,026 records remained in the dataset. In addition, a few randomly selected records were changed to match various possible anomalies (SQL injection and access violation attempts). A total of 50 such records were modified and marked as anomalous in the Result field.

Examples of such anomalous requests are:

SELECT * FROM users WHERE username = ‘administrator’–’ AND password = ”;

SELECT name, description FROM products WHERE category = ‘Gifts’ UNION SELECT username, password FROM users;

SELECT UserId, Name, Password FROM Users WHERE UserId = 105 or 1 = 1;

uName = getRequestString(“username”);

SELECT * FROM Users WHERE Name = “” or ““=”” AND Pass = “” or ““=””;

SELECT * FROM Users WHERE UserId = 105 UNION DROP TABLE Suppliers;

UPDATE users SET password = ‘newpwd’ WHERE userName = ‘admin’ – ’ AND password = ‘oldpwd’;

SELECT accounts FROM users WHERE login = “legalUser”; exec(char(0x73687574646f776e)) - - AND pass = “” AND pin =.

The dataset formed using the algorithm described in this section was further subjected to more detailed evaluation and analysis above using the selected machine learning models.

5. Experimental Results

5.1. Application of Supervised Machine Learning Methods

Supervised machine learning methods were used to detect known SQL injections. The following most popular models of supervised machine learning were studied (Branitskiy & Kotenko, 2016; Kotenko & Saenko, 2019):

Support Vector Machine (SVM) with linear kernel,

DT,

RF,

Gaussian Naive Bayes (GNB),

k-Nearest Neighbors (KNN),

Multilayer Neural Network (NN).

Before training the models, the feature space was optimized as discussed in subsection 3.3. For this purpose, we used the capabilities that the open source machine learning framework Orange 3.34 (https://orangedatamining.com).

Figure 2 shows a fragment of the resulting table that Orange generates to analyze the information content of features for the selected metrics Info.Gain, Gain Ratio and ANOVA. The rows in the table are not sorted. It can be seen from the figure that different features have different qualities for including them in the final data set with respect to different information content metrics.

Figure 2.

Tabular presentation of feature information content metrics.

The results of anomaly detection efficiency evaluation (F-score) for the complete and optimized feature sets according to control testing data for various machine learning models are presented in Table 2. 10-fold cross-validation was used to obtain these results.

Table 2.

The Composition of the Initial Feature Space.

	F-score
		Optimized Feature Sets
Model	Full set of features	Info. Gain	Gain Ratio	ANOVA
SVM	0.974	0.984	0.998	0.998
DT	0.935	0.935	0.935	0.935
LR	0.935	0.971	0.992	0.992
RF	0.998	0.998	0.995	0.998
GNB	0.517	0.998	0.998	0.998
KNN	0.935	0.978	0.992	0.992
NN	0.993	0.984	0.998	0.998

SVM = support vector machine; RF = random forest; GNB = Gaussian Naive Bayes; NN = neural network; ANOVA = analysis of variance; DT = Decision Tree; LR = logistic regression; kNN = k-nearest neighbor.

An analysis of the obtained experimental results shows that the studied machine learning models demonstrate different accuracy in detecting anomalous queries. Thus, the DT and LR models turned out to be insufficiently effective. Perhaps this is due to the insufficiently large size of the training sample. In turn, the SVM, RF, GBN, KNN and NN models showed a fairly high efficiency. The GNB model showed the highest efficiency, which did not make a single error in anomaly detection on the optimal set of features.

5.2. Application of Unsupervised and Semi-supervised Machine Learning Methods

Unsupervised and semi-supervised machine learning methods were used to detect unknown SQL attacks, as well as unauthorized access attempts to data tables.

The following models were chosen for the study:

the method of k-means together with the method of PCA,

Isolated Forest (IF),

Local Outlier Factor (LOF);

One-Class SVM (OCSVM).

The study of the k-Means method on a dataset with an optimal feature space showed that its highest accuracy is achieved when the number of clusters is 5. However, the accuracy achieved was 0.82, which does not correspond to requirements.

To study the remaining models of unsupervised machine learning, the size of the original data set was increased by 50 times and brought up to 507,232 records. The optimal feature space was expanded by all features of the third category. The number of anomalous records was 20. However, as the experiments showed, in this case all three models gave a large number of FP and FN cases.

For this reason, the IF, LOF, and OCCVM models were studied in the semi-supervised learning mode, when the model is trained on a dataset containing only normal records. The results obtained on such a dataset with an extended composition of features are presented in Figure 3.

Figure 3.

Results of semi-supervised machine learning.

The accuracy of detecting normal records for the LOF model was 0.9989, and for abnormal records, 1.0. The ISF model, despite the high accuracy of detecting normal records, could not detect any anomalous records. The OCSVM model showed an accuracy of 1.0 for detecting anomalous records, but for normal records, the accuracy was only 0.8567.

5.3. Choosing the Most Preferred Machine Learning Models

Based on the experiments, the final Table 3 was formed, which, based on a comparative assessment, presents performances of using machine learning models to detect various anomalies. The best performing machine learning models are marked with “Ok”.

Table 3.
Comparative Evaluation of Machine Learning Models.

SQL injection Unauthorized access

Type of machine learning Machine learning model Evaluation Recommendation Evaluation Recommendation

Supervised SVM $+ +$ Ok $+ -$ $-$

DT $- -$ $-$ $- -$ $-$

LR $+ -$ $-$ $- -$ $-$

RF $+ +$ Ok $+ -$ $-$

GNB $+ +$ Ok $- -$ $-$

KNN $+ -$ $-$ $- -$ $-$

NN $+ +$ Ok $+ +$ Ok

Unsupervised K-Means $- -$ $-$ Not used

ISF $- -$ $-$ $- -$ $-$

LOF $- -$ $-$ $+ -$ $-$

OCSVM $+ -$ $-$ $- -$ $-$

Semi-supervised ISF $- -$ $-$ $- -$ $-$

LOF $+ +$ Ok $+ +$ Ok

OCSVM $+ -$ $-$ $+ -$ $-$

		SQL injection	Unauthorized access
Supervised	SVM	$+ +$	Ok	$+ -$	$-$
	DT	$- -$	$-$	$- -$	$-$
	LR	$+ -$	$-$	$- -$	$-$
	RF	$+ +$	Ok	$+ -$	$-$
	GNB	$+ +$	Ok	$- -$	$-$
	KNN	$+ -$	$-$	$- -$	$-$
	NN	$+ +$	Ok	$+ +$	Ok
Unsupervised	K-Means	$- -$	$-$	Not used
	ISF	$- -$	$-$	$- -$	$-$
	LOF	$- -$	$-$	$+ -$	$-$
	OCSVM	$+ -$	$-$	$- -$	$-$
Semi-supervised	ISF	$- -$	$-$	$- -$	$-$
	LOF	$+ +$	Ok	$+ +$	Ok
	OCSVM	$+ -$	$-$	$+ -$	$-$

SVM = support vector machine; RF = random forest; GNB = Gaussian Naive Bayes; NN = neural network; DT = Decision Tree; LR = logistic regression; kNN = k-nearest neighbor; LOF = local outlier factor; OCSVM = One-Class Support Vector Machine.

Direct verification of current queries using the selected learning models is performed by submitting the current SQL query to the corresponding already trained model. This corresponds to the model being used in test mode. The current request is tested in real time, taking into account expressions (9) and (10). If the model detects an anomaly, the DC security administrator is notified of a possible SQL injection or the unauthorized access attempt.

In order to ensure timely detection of abnormal user behavior, we assumed that the parameters included in (9) and (10) take the following values: $T_{0} = 3600$ s, $λ = 100 s^{- 1}$ , $C_{1} = 10^{- 3}$ s/record, $T_{q u a r y}^{r e q} = 60$ s, $C_{0} = 10^{- 3}$ s. Then from (9) we obtain $x = 6$ . In other words, in order to meet the requirement that the average time for detecting requests should not exceed 1 minute, it is necessary to split the log file corresponding to a period of 1 hour into at least 6 fragments, each of which corresponds to 10 minutes. The minimum relative load of the system will be equal to 0.1. This, in turn, means that detecting anomalous user behavior does not result in significant additional computational costs for the DC hardware.

6. Conclusion

The paper presents the task statement, algorithms, issues of implementation and the results of an experimental evaluation of a new approach to detecting anomalous queries to SQL databases, based on a heuristic algorithm for reducing the dimension of a feature space and using binary classification methods. The initial data of the task are logs, database users, selected binary classifiers, and requirements for the accuracy of detecting anomalous SQL queries. The result of the approach implementation is a feature space model, presented as a set of normal and abnormal data records containing the values of the generated features, and a technique for searching for abnormal queries.

An experimental evaluation of the proposed approach was carried out on real data sets generated during the work of university DC users with the database of the educational process. Supervised machine learning methods and an optimal set of features were used to detect anomalies caused by known SQL injections. Unsupervised and semi-supervised machine learning methods together with an extended set of features were used to detect unknown SQL injections and unauthorized access attempts to data tables. The evaluation results confirmed the effectiveness of the proposed approach and made it possible to choose the most preferable machine learning models. For detecting known SQL injections, these models include SVM, RF, GNB, and NN. For detecting unknown SQL injections and unauthorized access attempts, the LOF semi-supervised machine learning model is best. To ensure the timely detection of anomalous user behavior, an optimization problem was set and solved, which allows finding the maximum size of fragments into which the logbook should be divided.

Further research is aimed at improving the accuracy of detecting anomalous SQL queries by improving the parameters of classifiers and combining them.

Footnotes

ORCID iDs

Igor Kotenko

Igor Saenko

Igor Zelichenok

Funding

The authors received the following financial support for the research, authorship and/or publication of this article: The reported study was partially funded by the budget project FFZF-2025-0016.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Adebiyi

M. O.

Arowolo

M. O.

Archibong

G. I.

Mshelia

M. D.

Adebiyi

A. A.

(2021). An SQL injection detection model using chi-square with classification techniques. In 2021 International conference on electrical, computer and energy technologies (ICECET) (pp. 1–8).

Alqahtani

Alanazi

Hamdaoui

(2020). Traffic behavior in cloud data centers: A survey. In 2020 International wireless communications and mobile computing (IWCMC) (pp. 2106–2111).

Arnaut

L. R.

(2014). Optimizing low-frequency mode stirring performance using principal com-ponent analysis. IEEE Transactions on Electromagnetic Compatibility, 56(1), 3–14.

Balaji B

J. K. R. S.

Pandey

Beriwal

Amarajan

(2021). An efficient SQL injection detection system using deep learning. In 2021 International conference on computational intelligence and knowledge economy (ICCIKE) (pp. 442–445).

Branitskiy

A. A.

Kotenko

I. V.

(2016). Analysis and classification of methods for network attack detection. SPIIRAS Proceedings, 2(45), 207–244.

Brownlee

(2020). Data preparation for machine learning: Data cleaning, feature selection, and data transforms in Python (pp. 18–23). Machine Learning Mastery.

Chen

Wang

(2020). Machine learning-based anomaly detection of ganglia monitoring data in HEP data center. EPJ Web of Conferences, 245, 07061.

Decker

Leite

Giommi

Bonacorsi

(2020). Real-time anomaly detection in data centers for log-based predictive maintenance using an evolving fuzzy-rule-based approach. In 2020 IEEE international conference on fuzzy systems (FUZZ-IEEE) (pp. 1–8).

Deka

P. K.

Bhuyan

M. H.

Kadobayashi

Elmroth

(2019). Adversarial impact on anomaly detection in cloud datacenters. In 2019 IEEE 24th pacific rim international symposium on dependable computing (PRDC) (pp. 188–18809).

10.

Gowtham

Pramod

H. B.

(2022). Semantic query-featured ensemble learning model for SQL-injection attack detection in IoT-ecosystems. IEEE Transactions on Reliability, 71(2), 1057–1074.

11.

Hasan

Balbahaith

Tarique

(2019). Detection of SQL injection attacks: A machine learning approach. In 2019 International conference on electrical and computing technologies and applications (ICECTA) (pp. 1–6).

12.

Hlaing

Z. C. S. S.

Khaing

(2020). A detection and prevention technique on SQL injection attacks. In 2020 IEEE conference on computer applications (pp. 1–6).

13.

Hosam

Hosny

Ashraf

Kaseb

A. S.

(2021). SQL injection detection using machine learning techniques. In 2021 8th International conference on soft computing & machine intelligence (ISCMI) (pp. 15–20).

14.

Klymash

Shpur

Lavriv

Peleh

(2019). Information security in virtualized data center network. In 2019 3rd International conference on advanced information and communications technologies (AICT) (pp. 419–422).

15.

Kotenko

Saenko

(2023). Applying machine learning methods to detect abnormal user behavior in a university data center. In Intelligent distributed computing XV. IDC 2022. Studies in computational intelligence (Vol. 1089, pp. 13–22).

16.

Kotenko

Saenko

Branitskiy

(2019). Detection of distributed cyber attacks based on weighed ensemble of classifiers, big data processing architecture. In IEEE INFOCOM 2019 – IEEE conference on computer communications workshops (pp. 1–6).

17.

Marashdeh

Suwais

Alia

(2021). A survey on SQL injection attack: Detection and challenges. In 021 International conference ICIT (pp. 957–962).

18.

Misquitta

Asha

(2023). SQL injection detection using machine learning and convolutional neural networks. In 2023 5th International conference on smart systems and inventive technology (ICSSIT) (pp. 1262–1266).

19.

Mujib

Sari

R. F.

(2020). Performance evaluation of data center network with network micro-segmentation. In 2020 12th International conference ICITEE (pp. 27–32).

20.

Nanekaran

N. P.

Esmalifalak

Narimani

(2020). Fast anomaly detection in micro data centers using machine learning techniques. In 2020 IEEE 18th international conference on industrial informatics (INDIN) (pp. 86–93).

21.

Paiusescu

Barbulescu

Vraciu

Carabas

Cuza

A. I.

(2018). Efficient datacenters management for network and security operations. In 2018 17th RoEduNet conference: Networking in education and research (RoEduNet) (pp. 1–5).

22.

Prarthana

T. S.

Gangadhar

N. D.

(2017). User behaviour anomaly detection in multidimensional data. In 2017 IEEE international conference on cloud computing in emerging markets (CCEM) (pp. 3–10).

23.

Roy

Kumar

Rani

(2022). SQL injection attack detection by machine learning classifier. In 2022 International conference on applied artificial intelligence and computing (ICAAIC) (pp. 394–400).

24.

Salman

Bhamare

Erbad

Jain

Samaka

(2017). Machine learning for anomaly detection and categorization in multi-cloud environments. In 2017 IEEE 4th international conference on cyber security and cloud computing (CSCloud) (pp. 97–103).

25.

Shahid

Ali Shah

(2021). Anomaly detection in system logs in the sphere of digital economy. In Competitive advantage in the digital economy (pp. 185–190).

26.

Tripathy

Gohil

Halabi

(2020). Detecting SQL injection attacks in cloud SaaS using machine learning. In 2020 IEEE 6th international conference on big data security on cloud (BigDataSecurity), IEEE Intl conference on high performance and smart computing, (HPSC) and IEEE international conference on intelligent data and security (IDS) (pp. 145–150).

27.

Welsh

Benkhelifa

(2021). On resilience in cloud computing: A survey of techniques across the cloud domain. ACM Computing Surveys, 53(3), 59.

28.

Xiao

Zhou

Yang

Deng

(2017). An approach for SQL injection detection based on behavior and response analysis. In 2017 IEEE 9th international conference on communication software and networks (ICCSN) (pp. 1437–1442).

29.

Xie

Ren

Guo

(2019). SQL injection detection for web applications based on elastic-pooling CNN. IEEE Access, 7, 151475–151481.

		SQL injection		Unauthorized access
Type of machine learning	Machine learning model	Evaluation	Recommendation	Evaluation	Recommendation
Supervised	SVM	$+ +$	Ok	$+ -$	$-$
	DT	$- -$	$-$	$- -$	$-$
	LR	$+ -$	$-$	$- -$	$-$
	RF	$+ +$	Ok	$+ -$	$-$
	GNB	$+ +$	Ok	$- -$	$-$
	KNN	$+ -$	$-$	$- -$	$-$
	NN	$+ +$	Ok	$+ +$	Ok
Unsupervised	K-Means	$- -$	$-$	Not used
	ISF	$- -$	$-$	$- -$	$-$
	LOF	$- -$	$-$	$+ -$	$-$
	OCSVM	$+ -$	$-$	$- -$	$-$
Semi-supervised	ISF	$- -$	$-$	$- -$	$-$
	LOF	$+ +$	Ok	$+ +$	Ok
	OCSVM	$+ -$	$-$	$+ -$	$-$