Sage Journals: Discover world-class research

Abstract

Intrusion detection systems play a vital role in traffic flow monitoring on Internet of Things networks by providing a secure network traffic environment and blocking unwanted traffic packets. Various intrusion detection systems approaches have been proposed previously based on data mining, fuzzy techniques, genetic, neurogenetic, particle swarm intelligence, rough sets, and conventional machine learning. However, these methods are not energy efficient and do not perform accurately due to the inappropriate feature selection or the use of full features of datasets. In general, datasets contain more than 10 features. Any machine learning–based lightweight intrusion detection systems trained with full features turn into an inefficient and heavyweight intrusion detection systems. This case challenges Internet of Things networks that suffer from power efficiency problems. Therefore, lightweight (energy-efficient), accurate, and high-performance intrusion detection systems are paramount instead of inefficient and heavyweight intrusion detection systems. To address these challenges, a new approach that can help to determine the most effective and optimal feature pairs of datasets which enable the development of lightweight intrusion detection systems was proposed. For this purpose, 10 machine learning algorithms and the recent BoT-IoT (2018) dataset were selected. Twelve best features recommended by the developers of this dataset were used in this study. Sixty-six unique feature pairs were generated from the 12 best features. Next, 10 full-feature-based intrusion detection systems were developed by training the 10 machine learning algorithms with the 12 full features. Similarly, 660 feature-pair-based lightweight intrusion detection systems were developed by training the 10 machine learning algorithms via each feature pair out of the 66 feature pairs. Moreover, the 10 intrusion detection systems trained with 12 best features and the 660 intrusion detection systems trained via 66 feature pairs were compared to each other based on the machine learning algorithmic groups. Then, the feature-pair-based lightweight intrusion detection systems that achieved the accuracy level of the 10 full-feature-based intrusion detection systems were selected. This way, the optimal and efficient feature pairs and the lightweight intrusion detection systems were determined. The most lightweight intrusion detection systems achieved more than 90% detection accuracy.

Keywords

Determining optimal feature pairs feature pair selection lightweight intrusion detection systems machine learning algorithms Internet of Things security

Introduction

In the last decade, the field of Internet of Things (IoT) has expanded greatly, with more than 20 billion interconnected devices, including computers, different kinds of smart tools, conventional or smart sensors, and fast 4G and 5G Internet access devices being used in different types of traditional or smart applications, such as health care, education, energy, and transportation. It is expected that, due to the rapid increase and proliferation of IoT devices, this number will reach 50 billion by 2030.¹ Such large numbers of interconnected devices constantly exchange and transmit a huge quantity of data (Big Data), making IoT systems a target to various types of cyberattacks.

Furthermore, IoT systems do not have a single-standard architecture that is internationally recognized by researchers and engineering developers. The basic architecture of IoT consists of three layers,^2–4 while other researchers suggest four- and five-layer architectures.^5,6 Having no standard architecture, naturally, causes security and privacy issues because smart environments consist of different types of IoT systems including several distinct sensors, different hardware tools, or software applications from various technology companies that do not share a universal standard language.⁷ Besides, IoT fully depends on Internet connection in all its architectures. It uses communication technologies such as Radio Frequency Identification (RFID),⁸ Near Field Communication (NFC),⁹ Bluetooth,¹⁰ Wi-Fi,¹¹ and Long-Term Evolution (LTE),¹² which have become the biggest target of cyber threats such as service attacks, authentication problems, Denial of Service (DOS), and Distributed DOS (DDOS)¹³ on the Internet. Being the number one target for cyber-attackers with such kinds of cyber threats is a daunting challenge for researchers, IoT manufacturers, and even for IoT users. In addition, IoT networks suffer from power efficiency problems.

To overcome these challenges, serious and realistic security and investigation measures such as lightweight (energy-efficient) network intrusion detection, malware detection, and network forensic systems need to be effectively developed and implemented in the place of traditional heavyweight intrusion detection systems (IDS) which are built on full or inappropriate features of datasets. Thus, conventional heavyweight IDS which are based on data mining, fuzzy techniques, classical neural networks, genetic algorithms, neurogenetic algorithms, particle swarm intelligence, rough sets, and statistical learning and classical machine learning (ML) algorithms are not suitable for the IDS on IoT networks.⁷ For these reasons, there is a need for robust, high-performance, and lightweight artificial intelligence (ML and deep learning) algorithms that can detect network attacks efficiently and take necessary countermeasures. To develop such algorithms, well-structured and representative datasets are paramount for training and validating the credibility of systems.¹⁴ Although a wide range of studies has been conducted in this area, IoT has become such a topic that needs to be researched and developed further and deeper, since cyberspace is very broad, and attacks in cyberspace are random and unpredictable. In addition to this, there are enough research gaps in terms of methodologies, implementation, and datasets. To address these deficiencies and fill the gaps, it is very important to conduct extensive and comprehensive research on anomaly detection using ML algorithms, real and realistic datasets, and, especially, the impacts of dataset features on the accuracy of ML models.¹⁵

Based on the nature of algorithms, each machine or deep learning model depends on distinct features in a dataset. For example, features with a linear trend nature have high impacts in linear methods such as linear regression, ridge regression, or linear support vector machines (SVM), whereas nonlinear algorithms will leverage the more complex and complicated links in the data. Thus, different features or feature pairs should be investigated and implemented with various techniques, discovering which features have a more significant, greater impact on the accuracy and performance of these models. Furthermore, IoT devices constantly generate data. This huge amount of data is called big data, and it increases by gigabytes every day. Therefore, the size and dimensionality of big datasets are astronomical. Hence, determining and estimating which features (which columns) of a dataset are crucial allow us to analyze and focus on the parameters that are the most effective, saving valuable time and resources. Moreover, machine or deep learning–based anomaly detection systems or network forensic systems are usually trained with datasets via random features or full features, which leads to low accuracy rates or has negative effects on the performance of a system. In addition, due to the explosive growth in energy consumption and high costs of using these cloud-based ML systems, energy-efficient workflow scheduling under budget constraints becomes the most challenging issue. Very few pieces of research that consider the stated issue have been reported. Most of them mainly focus on minimization of schedule length under user-specified budget constraints or energy consumption constraints.¹⁶ Due to these reasons, robust, lightweight, and energy-efficient IoT network IDS are vital to protect such systems optimally efficiently.

To address these challenges, we proposed a new approach that can help to determine the most effective and optimal feature pairs of datasets which facilitate improvement of the accuracy and performance of intrusion or anomaly detection systems on IoT network systems. For this purpose, we chose the most popular 10 ML algorithms and the most recent and realistic Bot-IoT dataset by the School of Engineering and Information Technology, UNSW Canberra Cyber, University of New South Wales Canberra, Australia. We generated 66 unique feature pairs out of the 12 best features of the dataset. Actually, 10 best features were utilized in the original paper where the Bot-IoT dataset was presented. However, in the article, it was recommended to use 12 best features; thus, we have utilized 12 best features. We have trained ML algorithms with 10 best features, and 12 best features at the same time, but we achieved higher accuracy via 12 best features. Therefore, we have used the 12 best features as was recommended in the original paper. The selected ML algorithms were trained via the generated unique feature pairs. Next, the feature pairs that performed best and with the highest accuracy were determined as the most effective and optimal trainable input data. In addition, we also investigated which ML algorithms were the most appropriate and optimal in terms of accuracy and performance for online and offline IoT network IDS to be developed using the Bot-IoT (2018) dataset. The entire system can be seen in Figure 1. The contribution of the study is as follows:

Real-time energy-efficient and lightweight IDS that can perform an incoming network traffic packet in less than 0.01 ms was developed in this study.

Ten ML and the recent Bot-IoT 2018 dataset were utilized to develop the lightweight IDS. Twelve best features recommended by the dataset developers were chosen. Next, 66 unique feature pairs were generated from these features.

Ten IDS were developed by training the 10 ML algorithms with the full 12 features; 660 feature pair-based lightweight IDS were also developed by training the 10 ML algorithms via each of the generated 66 feature pairs. Consequently, we obtained 10 heavyweight and 660 lightweight IDS which belonged to the 10 algorithmic groups, including K-Nearest Neighbors (KNN), Linear SVM, Radial Basis Function (RBF) SVM, Gaussian Process (GP), Decision Trees, Random Forest, Neural Networks, AdaBoost, Naive Bayes, and Quadratic Discriminant Analysis (QDA).

Next, the 10 heavyweight IDS and 660 lightweight IDS were compared based on the algorithmic groups. For instance, the KNN-based heavyweight IDS trained via the full 12 features was compared to each of the 66 KNN-based lightweight IDS trained via each of the 66 feature pairs. Then, the optimal and efficient feature pairs were determined by the accuracy and packet execution time variables. This way, we have shown that high-accuracy and high-performance IDS could be developed by training ML algorithms with just a feature pair.

Figure 1.

The developed system. Left side represents 660 lightweight IDS trained with every single of 66 feature pairs. Right side is 10 IDS trained via full 12 features of the Bot-IoT dataset (2018).

A brief outline of the remaining parts of the article is as follows. In the “Literature review” section, a summary of traditional IDS and machine and deep learning–based IDS is described. “The proposed approach” section presents the methodology. The “Experimental setup” section explains the experimental setup. The “Results and discussion” section includes the results and discussion, and in the “Conclusion” section, conclusions are provided.

Literature review

The battle between cyber-attackers and security developers or researchers has been going on since the first computer virus which was known as the “Creeper Worm” in 1970s and the boot-sector virus “Elk Cloner” in 1980s were created. For instance, according to Symantec, smartphone malware threats increased by 54% in 2017, whereas attacks on IoT devices increased by 600%, with the Mirai botnet and its distinct versions serving as the vehicle for some of the most potent DoS and DDoS attacks in history.¹⁷ Generally, regular/traditional computer networks share a unique and universal language, whereas IoT network systems do not have such a standard mechanism since IoT systems consist of different types of hardware tools, conventional or smart sensors, and software applications from different companies. Regular computer networks contain seven-layer, well-defined architectures, while IoT network systems do not have a unique, defined, and standard number of layers in their architecture. In addition, regular computer network systems share unique, well-defined communication protocols, but IoT networks do not have standard data communication and transfer protocols. Therefore, traditional computer networks generate well-structured network traffic flow packets in each layer while the IoT network packets can be different depending on the hardware tools, sensors, and software application used in these systems.¹⁸

Naturally, cyber threat developers have been innovating different types of viruses, threats, and attacks every day, while countermeasures against these threats, viruses, and attacks have been developed by cybersecurity researchers and developers. Countermeasures (intrusion detection, malware detection, and network forensics systems) have been developed using distinct methods and techniques such as conventional heavyweight IDS based on data mining, fuzzy techniques, classical neural networks, genetic algorithms, neurogenetic algorithms, particle swarm intelligence, rough sets, statistical learning and classical ML algorithms, artificial intelligence (large, complex machine and deep learning) techniques, and other conventional anomaly detection methods. However, these conventional heavyweight approaches are not energy efficient and do not perform as accurately as expected due to inappropriate feature selection or the use of full features of datasets. Classical IDS methods are usually trained via full features of datasets or they use inappropriate features (columns) of datasets. These conditions increase the time complexity of packet processing and they cause more energy consumption. On the contrary, IoT systems need IDS systems with low power consumption, because IoT systems mostly suffer from power efficiency problems. Thus, lightweight (energy-efficient) IDS are needed to replace heavyweight IDS which are built on full or inappropriate features of datasets. Due to these reasons, the accuracy and performance of these methods drop significantly.¹⁹

As it was reported in most intrusion, anomaly, or malware detection survey studies, conventional IDS approaches face challenges in terms of their high false-positive rates and computational complexity. High false-positive and false-negative rates reduce the Quality of Services (QoS) of such a network system. If any user packet is dropped by mistake, the user will suffer a billing error, and the user packet will be delayed. Anomaly-based IDS also face challenges with regard to illegal analysis methods, such as packet-based methods, that infringe on user privacy.²⁰ However, lightweight ML algorithms have a clear advantage over other traditional techniques, and they are replacing these conventional methods due to their high ability and high performance in detecting cyber threats and attacks by dynamic learning from huge volumes of data.²¹

ML is a subfield of artificial intelligence that can extract necessary, useful knowledge and information from very complex and large volumes of data using its supervised and unsupervised learning techniques. Hence, this area has made incredible strides in the field of computer vision, speech recognition, natural language processing, and cybersecurity over the past decade. In general, ML methods in the fields of intrusion, malware, and anomaly detection have been categorized into three groups such as static, dynamic, and hybrid (static + dynamic) methods.²²

Static methods use manually extracted rules and static features with statistical approaches, while dynamic algorithms utilize dynamic rules and features extracted via flexible and dynamic approaches that do not rely on experts’ prior knowledge of the domain to define discriminative features. Furthermore, static methods are based on static analysis which is impractical and computationally prohibitive, and they detect anomalies without running the content of an incoming packet file in contrast to dynamic methods which recognize malicious programs by running them on virtually dedicated environments.²³ Static approaches are far faster than dynamic models. However, they cannot detect anomalies that have not been defined in their blacklist. Unlike static methods, dynamic models can detect most malicious packets using their characteristics and activities extracted from dynamic features without any prior knowledge about the incoming packets.²² Because of the aforementioned reasons, dynamic ML algorithms have become an essential part of intrusion and anomaly detection systems. As usual, conventional techniques utilize static features, whereas dynamic ML approaches use the dynamic features of an incoming IoT packet. A categorization of static and dynamic features is illustrated in Figure 2.

Figure 2.

Feature taxonomy: static and dynamic features.

Dynamic features are extracted from the execution of a program or an incoming packet at runtime. Analysis of a program or an incoming packet using dynamic methods has been carried out by observing the packet for whether it follows normal programming or network instructions or it misuses the memory and central processing unit of a device. The monitoring process reveals process creation, file, and registry manipulation, and modifications of memory values, registers, and variables.²² For instance, researchers²⁴ proposed a method to distinguish benign programs from malicious programs using the features of memory and registers usage, while another previous study²⁵ developed a method that performed dynamic analysis on virtual machines (VMs) to extract program run-time traces from both benign and malicious executables. Besides, in previous studies,^26–28 network traffic features such as Hyper Text Transfer Protocol (HTTP) and Domain Name System (DNS) requests, host-based events, and metadata such as IP addresses, ports, and packet counts have been actively utilized to detect and classify packets as normal or threats, while other studies^29–33 have presented a dynamic approach to detect malicious programs using Application Programming Interface (API) call traces. Here, all these approaches have been developed using ML algorithms, and these methods have conditionally been categorized as dynamic learning approaches for malware or IDS by previous studies.²² Moreover, it is presented that all of these methods are fully dependent on datasets and their example feature (column(s)) because these techniques generalize features using their linear and nonlinear functions to extract crucial information and knowledge about programs and packets, classifying them as anomaly or normal labels.

IDS are divided into three classes based on signature, anomaly, and specification according to the detection method. A signature-based IDS matches network traffic patterns to the existing attack patterns in a database. If a match is found, an alarm is issued.³⁴ A signature-based IDS have a high accuracy and low false alarm rate, but it cannot detect new attacks. A specification-based network IDS maps parameters to a predefined set of rules and specifications to detect malicious activity.³⁵ These rules are specified manually by the user. Unlike the signature and specification-based IDS, an anomaly-based IDS constantly checks network traffic for any deviation from the normal network profile.³⁶ If a deviation exceeds the threshold, an alarm is issued to indicate intrusion detection. The normal network profile is learned using ML algorithms.³⁷ An anomaly-based IDS is preferred over signature and specification-based IDS because of its ability to detect new attacks. The efficiency of anomaly-based IDS increases considerably with the quality of the network traffic models that are used.³⁸ Once the system is trained, it can effectively detect new attacks. Intrusion detection in IoT networks aims to classify network traffic according to the normal or attack classes with a trained classifier, maximum accuracy, and minimum false alarms (FAR).³⁹ The classifier’s high performance in terms of accuracy and FAR depends solely on the chosen lightweight algorithms and the choice of training data. Researchers or security professionals prefer high-performance lightweight ML algorithms for the task of intrusion detection.⁴⁰ Due to these reasons, the essence of robust datasets and their features cannot be underestimated in creating a successful intrusion detection model for IoT systems using ML algorithms.⁴¹

In the literature, there is a large number of datasets for intrusion detection, but most of them do not fit the general FAIR⁴² concept for dataset requirements. The FAIR concept defines four principles that scholarly data should fulfill, namely, Findability, Accessibility, Interoperability, and Reusability.⁴³ A dataset may be utilized to train an ML algorithm when it fulfills this concept. There are less than 10 anomaly detection datasets in the literature, and among them, the recent Bot-IoT dataset is the only option for IoT botnet detection, because other datasets do not contain any information about the botnet scenarios on IoT. In addition, previous datasets are not well-structured, while the Bot-IoT (2018) dataset is well-structured, and it includes more than 40 features (column(s)).¹⁴ For these reasons, we chose the Bot-IoT (2018) dataset in this study. Furthermore, correctly selected feature (column)s are as essential as correctly labeled datasets, because selecting the right features or feature pairs helps improve the accuracy and performance of IDS. However, in the Bot-IoT (2018) dataset, as with other datasets, there is no clear information on what features or feature pairs of this dataset are more important, and which of these features or feature pairs is more compatible with which ML algorithms. Due to these reasons, we proposed a new system that helps determine the most import features or feature pairs and the most compatible ML algorithm for intrusion and anomaly detection systems in the field of IoT.

The proposed approach

In this study, we propose a new approach that helps to determine the most effective and most optimal features or feature pairs of datasets which can help improve the accuracy and performance of the intrusion, malware, or anomaly detection systems of IoT devices. For this propose, we chose the 10 well-known ML methods and the latest Bot-IoT (2018) dataset. Furthermore, we also aim to discover which ML algorithm(s) is the most appropriate and optimal in terms of accuracy and performance for IDS to be developed using the Bot-IoT dataset.

The ML methods and the dataset

ML algorithms are a repetitive process that mainly consists of four stages such as gathering available data, cleaning and preparing the data, building models, and validation and deployment into production (Figure 3). For data collection, there are two methods including data acquisition from real-life surveys or experiments and generating synthetic data. In the first method, the data are collected directly from the field if it is possible to obtain the data by observing or experimenting from real life. If data are not available, the dataset is generated programmatically. For this study, we opted for the realistic Bot-IoT (2018) dataset which had been developed using simulation tools that imitate a real IoT environment. Figure 3 depicts the test-bed of the Bot-IoT dataset, in which several VMs are connected to local area network (LAN) and wide area network (WAN) interfaces in the cluster and linked to the Internet through the pfSense machine.

Figure 3.

The simulation of IoT services system.¹⁴

Figure 4.

Main workflow diagram of machine learning algorithms.

On Ubuntu VM platforms, the Nod-RED was used for simulating various IoT sensors (weather station, smart fridge, motion activated lights, remotely activated garage door, smart thermostat), which were connected to the public IoT hub, Amazon Web Services (AWS). Java scripts on the Nod-RED were developed for subscribing and publishing IoT services to the IoT gateway of the AWS via the Message Queuing Telemetry Transport (MQTT) protocol. The pcap files were collected using this virtualized setup. Then, the normal and attack traffic flow data were extracted, and the created dataset was saved as a “.csv” file. The dataset contained more than 73 million records and 42 features (28 original and 14 derived feature columns). It was claimed by researchers that they selected 3 million most important packets (rows) and the most important 10 features using the Correlation Coefficient and Entropy⁴⁴ techniques, and they tested these with three ML algorithms such as SVM, recurrent neural networks (RNN), and long short term memory (LSTM).¹⁴ However, we selected 12 features (10 best + sport and dport) and 10 popular, lightly structured ML approaches instead of three large-structured deep learning architectures because of performance concerns. The selected classification algorithms for this study were K-Nearest Neighbors, Linear SVM, RBF SVM, GP, Decision Tree, Random Forest, Neural Net, AdaBoost, Naive Bayes, and QDA. The brief descriptions for these classifiers are given below.

K-Nearest Neighbor is a simple classification algorithm that categorizes input data using k > 0 neighbors and similarity measures like the Euclidean distance (see equation (1)) or other distance measures such as the Minkowski or Manhattan. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point and predict the label from these. The Euclidean distance measure is used in this article. The number of neighbors was taken as k = 3 after some experiments

D = \sum_{i = 0}^{k} \sqrt{{(X_{i}^{input} - X_{i}^{test})}^{2}}

(1)

where D is the distance, k is the number of neighbors, and $X^{input}$ and $X^{test}$ are the feature vectors of the input and test data.

The idea behind the linear and RBF SVM is separation of input data into several classes using linear or nonlinear hyperplanes. The algorithm creates a linear function (equation (2)) for a linear SVM and a nonlinear function for an RBF SVM (equation (3)) which separates the input data into binary classes. The separating line of an SVM function is called a hyperplane

\begin{matrix} \min \frac{1}{2} | | w_{1} {| |}^{2}, \min \frac{1}{2} | | w_{2} {| |}^{2}, y_{i}^{(- 1)} \leq w_{1} x_{i}^{(1)} + w_{2} x_{i}^{(2)} + b, \\ y_{i}^{(+ 1)} \geq w_{1} x_{i}^{(1)} + w_{2} x_{i}^{(2)} + b \end{matrix}

(2)

where $w_{1}$ and $w_{2}$ are the normal or weight vectors which determine the boundary of the decision-making function, $y^{(- 1)}$ and $y^{(+ 1)}$ are the hyperplane functions for the classes (1: normal, –1: attack), $x^{(1)}$ and $x^{(2)}$ are the feature vectors of a given input packet, and b is the bias.

The RBF SVM algorithm is also similar to the linear SVM method, but it uses a radial basis nonlinear function kernel instead of a linear function. These linear and RBF binary SVM algorithms provide an efficient classification environment in dealing with extra-large datasets (for instance, several millions of training data pairs). In our case, there were 3 million records in the dataset, so the methods that were used are some of the ideal classification techniques for the study. The regularization parameter C = 0.001 was used for linear SVM, and C = 1 was used for the RBF-based SVM classifier in this article. Degree = 3 and gamma = 2 were set for both classifiers since there are two normal and attack class labels

K (X^{input}, X^{test}) = e^{- \frac{X^{input} - X^{test}}{2 σ}}

(3)

The GP classifier is a supervised ML algorithm which is utilized in binary logistic regression and binary classification tasks. It predicts the label of input data with probabilistic confidence level interpolating the observations in a training dataset using Laplace approximation. The GP function (equation (4)) is specified by its mean $m (X^{(input)})$ and covariance (kernel) $k (X^{(input)}, X^{(test)})$ functions. The kernel selection process is crucial for the binary classification; thus, kernel = 1.0 × RBF(1.0) was chosen for this study because the input data were linearly distributed. The other parameters were kept as default

\begin{matrix} k (X^{(input)}, X^{(test)}) = (f (X^{(input)}) - m (X^{(input)})) \times \\ \times (f (X^{(test)}) - m (X^{(test)})) \end{matrix}

(4)

The Decision Trees algorithm is established on tree-structured rules. The rules are extracted using an entropy information gain function (equation (5)) or the Gini approach, while the Random Forest algorithm constructs its classification model with several individual decision trees. It pools all prediction values from the individual decision trees to make a final decision. The decision tree does not generalize from the training data but memorizes all samples since a nonparametric supervised learning method is used for classification. It usually results in overfitting; however, it performs better when the dataset is well-structured and accurately annotated. In addition, the Random Forest classifier overcomes the overfitting cases using several weak decision trees

E (S) = \sum_{c \in C} - p (c) lo g_{2} p (c)

(5)

where S is the feature of the current dataset, C is the set of the classes, c is the label, and p(c) is the probability of the classes.

A Neural Network is a system that imitates the human brain. It contains the input–output layers, the hidden layer(s), neurons, activation, and decision functions as in equation (6). In our case, we chose a simple, three-layered neural network. Larger neural networks could be constructed, because more layers and more neurons provide higher accuracy. However, this decreases the performance of a system. The Rectified Linear Unit (ReLu) was used as the activation function. The Adam optimizer approach was used for weight optimization in this study. The learning rate was 0.001, and decay was 0.9

y = f (\sum_{i = 0}^{n} w_{i} X_{i}^{input} + b)

(6)

where y is the predicted label, $w_{i}$ is the weight vector, and $X^{(input)}$ is the input data (feature vector).

The AdaBoost classification algorithm is an ensemble algorithm like the Random Forest algorithm. The logic behind it is combining several weak classifier algorithms under one system. It classifies input data using this function as in equation (7). The main difference between the AdaBoost classifier and the Random Forest approach is that the former makes the final decision combining through a weighted majority vote (or sum) of the weak decision tree classifiers. For this study, 50 weak decision tree classifiers were selected, and the learning rate was chosen as 1.0

H (X^{input}) = sign (\sum_{t = 1}^{T} α_{t} h_{t} (X^{input}))

(7)

where $H (X^{(inputs)})$ is the decision function, T is a set of classifiers, $α_{t}$ is the weight of classifiers, and $h_{t} (X^{input})$ is the output of weak classifiers.

The Naive Bayes classifier performs the task of classification using the Bayes’ probability theorem as in equation (8) which relies on the assumptions between feature vectors. In our case, it categorized an input packet into class labels as normal and attack using the “naive” assumption of conditional independence between every pair of features such as sport–dport and sport–state_number. Thus, the classifier is best fit for determining the optimum feature pairs based on the likelihood of the features (equation (9))

p (C_{k} | X^{input}) = \frac{p (C_{k}) p (X^{input} | C_{k})}{p (X^{input})}

(8)

where $C_{k}$ is the set of the classes and $p (C_{k})$ is the probability function

P (x_{input} | y) = \frac{1}{\sqrt{2 π σ_{y}^{2}}} \exp (- \frac{{(x_{input} - μ_{y})}^{2}}{2 σ_{y}^{2}})

(9)

QDA is a nonlinear example of linear discriminant analysis that uses the measurement assumptions of each class which is normally distributed. It classifies input data according to the distribution of a feature vector. More specifically, for linear and QDA, it is modeled as a multivariate Gaussian distribution (equation (10)) with density

P (x | y = k) = \frac{1}{{(2 π)}^{d / 2} | Σ_{k} |^{1 / 2}} e^{(- \frac{1}{2} {(x - μ_{k})}^{t} Σ_{k}^{- 1} (x - μ_{k}))}

(10)

where d is the number of features, and in our case, it was 2 for feature pair training and 12 for full feature training. According to the model above, the log of the posterior (equation (11)) is

\begin{matrix} \log P (y = k | x) = \log P (x | y = k) + \log P (y = k) + Cst \\ = - \frac{1}{2} \log | Σ_{k} | - \frac{1}{2} {(x - μ_{k})}^{t} Σ_{k}^{- 1} (x - μ_{k}) + \\ + \log P (y = k) + Cst \end{matrix}

(11)

The model of the proposed system

The model included four steps. The first step was cleaning the dataset and normalization. The Min–Max Scaler transformation method was used to scale all selected features into the [0; 10] range (equation (12)). This was because it helps ML methods to converge and generalize the training data focusing on the extremum points

X_{std} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}, X_{new} = X_{std} (\max - \min) + \min

(12)

where X is an input feature, $X_{\max}$ and $X_{\min}$ are the global maximum and minimum of the dataset, and min and max are the user-defined feature range. In our case, min = 0 and max = 10. Different ranges could be selected. After the transformation, the scaled dataset was split into the training and test parts, and the dataset was fed to the IDS with the full features and the feature pairs separately. The main purpose of using feature pairs is to develop real-time, lightweight (energy-efficient) IDS without affecting the accuracy of the systems negatively. That is, IDS based on an ML algorithm runs slower when it is trained with all features of a dataset (especially, there are more than 10 features in the dataset). Hence, it cannot process a network traffic packet in real time. Therefore, an ML-based IDS that is developed by training with all attributes of the dataset is considered energy-inefficient. However, when the same ML algorithm is trained with only one feature pair, a higher performance, lightweight (energy-efficient) IDS can be developed without sacrificing accuracy. For these reasons, ML algorithms have been trained with a pair of features, and high performance, high-accuracy IDS have been developed in this study.

The second step was generating the distinct feature pairs. We used the permutation technique without repetition and replacement to find out the distinct pairs among the 12 features. The selected features are illustrated in Table 1. There were 66 generated unique feature pairs. In the third step, we built the 10 ML models and fed the data with all features and each feature pair one by one to the models so that we could calculate the difference between the accuracy values of when an ML algorithm is trained with full features and a certain feature pair. This way, we estimated which features or feature pairs were the most effective and optimal for a certain type of ML-based intrusion detection algorithm. Then, in the last step, we measured the accuracy and performance of the models, validating them on the test data. All accuracy and performance values by the feature pairs and ML algorithms were stored in the two-dimensional (2D) array. We determined the effective and optimal feature pairs and algorithms, minimizing the accuracy (equation (13)) and time complexity (equation (14)) values with full features and feature pairs training. The block diagram of the entire system is illustrated in Figure 5. In Figure 5, $F_{ff}$ is the set of full features, $F_{fp}$ is the set of the generated feature pairs, M is the set of ML algorithms, $A_{ff}$ is the array which holds accuracy values of the botnet IDS that has been trained via full features, $A_{fp}$ is the accuracy of the detection systems trained with the 66 generated feature pairs, $T_{ff}$ is the time complexity array of the detection systems trained with full features for an input packet, and $T_{fp}$ is the time complexity of the detection system trained via each of the 66 feature pairs separately

B_{A} = \min (A_{ff}^{c} - A_{fp}^{c})

(13)

B_{T} = \min (T_{ff}^{c} - T_{fp}^{c})

(14)

where $B_{A}$ and $B_{T}$ are the best accuracy and time complexity, respectively, $A_{ff}^{c}$ and $T_{ff}^{c}$ are the accuracy and time complexity when a certain ML algorithm is trained with full features, and $A_{fp}^{c}$ and $T_{fp}^{c}$ are those with a certain feature pair.

Table 1.

Features and descriptions.

Feature	Description
sport	Source port number
dport	Destination port number
state_number	Numerical representation of feature state
seq	Argus sequence number (this feature obtained as .argus format files from .pcap files using Argus tool)
mean	Average duration of aggregated records
stddev	Standard deviation of aggregated records
min	Minimum duration of aggregated records
max	Maximum duration of aggregated records
srate	Source-to-destination packets per second
drate	Destination-to-source packets per second
N_IN_Conn_P_SrcIP	Number of inbound connections per source IP
N_IN_Conn_P_DstIP	Number of inbound connections per destination IP

Figure 5.

Flowchart of the proposed approach.

Figure 6.

Training and test accuracy results of the machine learning methods trained with the feature pairs. The first raw represents the ML algorithms, the first column represents feature pairs (as $sport - dport, \dots, sport - N_IN_Conn_P_SrcIP$ ), and the distribution of input feature pair data is illustrated in the first column.

Experimental setup

In this study, we used a computer with 4 GB RAM and a conventional CPU, and the Google Colaboratory TPUs and GPUs to train and test our system. We chose 10 well-known, lightweight ML algorithms. Different datasets were examined for this study. However, we opted for the recently released realistic Bot-IoT (2018) dataset because it includes usual IoT and other types of network traffic, as well as distinct kinds of common botnet attacks and threats. Training and testing systems were developed using the Python programming language due to its wide range of flexible ML and other scientific frameworks such as Scikit-learn (for data preprocessing, and ML training and testing),⁴⁵ NumPy (for matrix processing),⁴⁶ Pandas (for reading data from a file, data handling, and writing the processed data),⁴⁷ and Matplotlib (for displaying data and results).⁴⁸

The dataset was divided into two parts, 80% training and 20% test sets, as it was recommended in the original paper of the dataset. Then, 66 unique feature pairs were generated, and IDS were developed by training ML algorithms once with all the features and once with each of the generated 66 unique feature pairs. This way, the most efficient and optimal feature pairs were determined by comparing the time complexity and accuracy based on the algorithmic group. IDS that were developed by training ML algorithms with the most efficient and optimal feature pairs of the dataset were considered as energy-efficient (lightweight) IDS.

For the accuracy calculation of the classification systems, we used a conventional accuracy calculation metric which is known as the confusion matrix (equation (15))

\begin{matrix} Accuracy = \frac{TP + TN}{TP + TN + FP + FN}, \\ Recall = \frac{TP}{TP + FN} \\ Precision = \frac{TP}{TP + FP}, \\ F_{score} = \frac{2 Precision Recall}{Precision + Recall} \end{matrix}

(15)

According to these measures, the difference between the accuracy values of two systems trained with full features (Table 2) and a certain feature pair (Table 3) is minimized as in equation (13). The execution time (see Table 5) values of these botnet detection systems that were trained with full features and feature pairs were minimized via equation (14) to determine which algorithm(s) and what features performed the most effectively and optimally.

Table 2.

Overall accuracy and F1-score values of the IDS, and execution time for an incoming packet when the systems trained with full features.

Methods	Nearest Neighbors	Linear SVM	RBF SVM	Gaussian Process	Decision Tree	Random Forest	Neural Net	AdaBoost	Naive Bayes	QDA
Accuracy (%)	99.8	99.29	99.19	100	99.8	99.9	99.6	99.5	92.43	86.88
F1 score (%)	99.8	99.3	99.19	100	99.8	99.9	99.6	99.5	91.77	84.33
Execution time (ms)	0.36	0.06	0.11	0.39	0.1	0.97	0.19	1.91	0.08	0.09

IDS: intrusion detection systems; SVM: support vector machines; RBF: radial basis function; QDA: quadratic discriminant analysis.

Table 3.

Accuracy of the detection systems trained with the 66 unique feature pairs.

Feature pairs (FPs)/methods	Nearest Neighbors	Linear SVM	RBF SVM	Gaussian Process	Decision Tree	Random Forest	Neural Net	AdaBoost	Naive Bayes	QDA	Average Accuracy(s)	Max	Count90
sport–dport	90.62	79.92	86.07	91.42	98.99	99.09	83.65	99.19	79.92	79.92	88.88	99.19	5
sport–seq	92.23	85.07	92.23	91.93	92.53	93.24	91.32	94.45	90.92	91.32	91.52	94.45	9
sport–stddev	84.46	79.92	81.43	85.87	88.29	88.4	79.92	86.98	67.1	67.1	80.95	88.4	0
sport–N_IN_Conn_P_SrcIP	92.73	91.52	92.73	92.33	95.66	93.84	92.33	95.86	90.21	90.21	92.74	95.86	10
sport–min_	85.77	79.92	82.04	83.55	88.4	90.01	79.92	88.09	79.92	79.92	83.75	90.01	1
sport–state_number	93.44	79.92	92.63	92.63	93.14	93.14	92.43	94.25	79.92	79.92	89.14	94.25	7
sport–mean	85.87	79.92	82.34	84.66	85.67	88.4	79.92	87.99	79.92	79.92	83.46	88.4	0
sport–N_IN_Conn_P_DstIP	98.49	97.68	98.49	98.39	98.79	98.89	98.49	98.89	97.48	97.48	98.31	98.89	10
sport–drate	82.64	80.02	83.25	83.65	84.16	84.36	79.82	84.56	80.52	80.52	82.35	84.56	0
sport–srate	81.74	80.12	82.44	83.25	94.25	94.35	80.12	91.62	81.63	81.03	85.06	94.35	3
sport–max_	86.07	79.92	82.44	83.25	84.76	86.18	79.92	87.79	80.52	80.73	83.16	87.79	0
dport–seq	97.17	81.23	91.83	98.69	98.79	98.89	91.52	99.6	91.32	91.12	94.02	99.6	9
dport–stddev	97.28	79.92	83.75	97.68	98.18	98.18	83.05	98.08	67.2	67.31	87.06	98.18	5
dport–N_IN_Conn_P_SrcIP	98.49	91.52	92.73	97.88	99.7	99.6	92.53	99.7	90.01	89.71	95.19	99.7	9
dport–min_	97.88	79.92	85.27	98.39	98.69	98.49	83.45	98.69	79.92	79.92	90.06	98.69	5
dport–state_number	98.49	79.92	96.37	98.49	99.19	99.09	95.06	99.6	79.92	79.92	92.61	99.6	7
dport–mean	98.69	79.92	85.07	99.29	99.6	99.19	83.25	99.5	79.92	79.92	90.44	99.6	5
dport–N_IN_Conn_P_DstIP	98.99	97.48	98.69	99.6	99.5	99.6	98.49	99.5	97.48	97.48	98.68	99.6	10
dport–drate	98.59	80.02	84.26	98.59	98.69	98.59	83.45	98.79	80.63	80.52	90.21	98.79	5
dport–srate	98.79	80.12	83.85	98.49	98.89	99.09	83.75	98.99	81.63	81.03	90.46	99.09	5
dport–max_	98.59	79.92	84.16	98.59	98.99	98.39	83.05	98.89	83.25	79.92	90.38	98.99	5
seq–stddev	95.16	92.94	94.45	94.65	96.06	96.27	94.65	95.86	94.05	94.05	94.81	96.27	10
seq–N_IN_Conn_P_SrcIP	99.09	94.65	98.39	98.99	99.09	99.39	94.85	99.39	95.26	95.26	97.44	99.39	10
seq–min_	92.73	91.73	92.23	93.04	91.62	94.45	91.73	94.05	91.42	91.32	92.43	94.45	10
seq–state_number	97.28	85.07	96.27	96.17	95.66	96.37	93.44	96.47	91.12	91.12	93.9	97.28	9
seq–mean	93.95	91.83	94.75	94.85	95.16	95.66	93.54	94.85	92.33	92.53	93.95	95.66	10
seq–N_IN_Conn_P_DstIP	99.39	98.79	99.5	99.6	100	100	99.39	100	99.5	99.5	99.57	100	10
seq–drate	92.43	80.12	90.72	90.82	91.73	90.51	91.02	91.52	80.93	80.93	88.07	92.43	7
seq–srate	91.62	80.12	90.72	91.02	97.98	97.38	91.02	97.78	81.84	81.84	90.13	97.98	7
seq–max_	94.55	92.23	94.75	94.85	95.46	96.37	93.54	96.27	93.34	93.14	94.45	96.37	10
stddev–N_IN_Conn_P_SrcIP	97.28	92.84	93.24	97.07	97.88	97.98	93.64	98.08	90.31	90.41	94.87	98.08	10
stddev–min_	89.71	79.92	81.33	87.18	88.19	88.5	79.92	88.29	75.58	75.58	83.42	89.71	0
stddev–state_number	98.49	79.92	95.46	98.69	97.98	98.89	91.73	98.69	66.8	66.7	89.34	98.89	7
stddev–mean	89.81	79.92	82.04	83.15	89.3	89.61	79.92	88.19	75.48	75.48	83.29	89.81	0
stddev–N_IN_Conn_P_DstIP	99.19	98.08	98.89	98.89	98.89	99.19	98.28	99.29	97.98	97.98	98.67	99.29	10
stddev–drate	84.46	80.02	80.32	84.86	84.76	84.96	79.82	84.96	80.93	80.93	82.6	84.96	0
stddev–srate	94.65	80.12	79.92	84.36	96.27	96.27	80.12	95.26	81.33	80.63	86.89	96.27	4
stddev–max_	90.01	79.92	81.63	83.15	88.9	89	79.92	88.5	75.48	75.48	83.2	90.01	1
N_IN_Conn_P_SrcIP–min_	97.88	92.53	96.27	96.37	98.18	98.18	92.73	97.68	90.01	90.11	94.99	98.18	10
N_IN_Conn_P_SrcIP–state_number	98.49	91.83	98.28	97.78	98.28	98.49	94.55	98.49	89.71	89.71	95.56	98.49	8
N_IN_Conn_P_SrcIP–mean	97.68	92.33	96.06	96.27	97.28	97.48	93.84	97.78	91.42	91.32	95.15	97.78	10
N_IN_Conn_P_SrcIP–N_IN_Conn_P_DstIP	98.89	97.28	98.99	98.99	98.99	98.79	98.49	99.09	97.28	97.68	98.45	99.09	10
N_IN_Conn_P_SrcIP–drate	96.17	91.62	92.33	96.27	95.96	95.86	92.53	96.37	80.63	80.63	91.84	96.37	8
N_IN_Conn_P_SrcIP–srate	98.18	91.62	92.53	96.57	97.78	97.88	92.53	97.88	81.63	81.43	92.8	98.18	8
N_IN_Conn_P_SrcIP–max_	97.58	92.13	94.95	96.57	97.38	97.48	93.64	97.38	92.03	92.23	95.14	97.58	10
min_–state_number	92.53	79.92	92.73	92.84	92.63	92.63	92.33	92.94	79.92	80.02	88.85	92.94	7
min_–mean	90.21	79.92	81.94	87.08	89.3	89.4	79.92	88.7	74.07	75.38	83.59	90.21	1
min_–N_IN_Conn_P_DstIP	99.09	97.48	98.79	98.89	98.89	99.09	98.49	99.19	97.68	97.68	98.53	99.19	10
min_–drate	88.8	80.02	80.52	86.38	88.09	87.79	80.02	88.7	80.52	80.52	84.14	88.8	0
min_–srate	91.02	80.12	81.23	87.08	93.64	96.47	80.12	92.84	81.43	80.83	86.48	96.47	4
min_–max_	89.91	79.92	81.63	87.18	88.9	89.51	79.92	89.1	75.18	75.28	83.65	89.91	0
state_number–mean	99.29	79.92	99.29	99.6	99.19	99.29	98.79	99.09	79.01	86.98	94.05	99.6	7
state_number–N_IN_Conn_P_DstIP	99.09	97.48	98.69	94.85	98.59	98.79	98.49	99.29	97.48	97.48	98.02	99.29	10
state_number–drate	56	80.02	92.53	92.53	92.33	92.33	91.93	92.43	80.52	80.52	85.11	92.53	6
state_number–srate	99.29	80.12	92.73	93.34	96.17	96.67	91.93	96.47	81.63	81.03	90.94	99.29	7
state_number–max_	98.89	79.92	99.29	99.29	98.79	99.19	98.89	99.29	75.58	76.49	92.56	99.29	7
mean–N_IN_Conn_P_DstIP	98.69	97.68	99.19	99.5	99.29	99.5	98.69	99.29	97.78	97.88	98.75	99.5	10
mean–drate	84.86	80.02	81.13	82.74	84.36	86.07	79.82	87.29	80.63	80.63	82.76	87.29	0
mean–srate	90.01	80.12	81.33	82.24	96.17	96.27	80.12	95.46	81.43	80.83	86.4	96.27	4
mean–max_	89.91	79.92	81.43	87.49	88.4	88.5	79.92	88.09	75.28	75.38	83.43	89.91	0
N_IN_Conn_P_DstIP–drate	98.99	97.48	98.28	98.49	98.69	98.69	98.49	98.99	93.84	93.84	97.58	98.99	10
N_IN_Conn_P_DstIP–srate	99.09	97.68	98.49	98.49	99.29	99.5	98.49	99.29	91.02	90.82	97.22	99.5	10
N_IN_Conn_P_DstIP–max_	99.09	98.08	98.89	98.99	99.29	99.39	98.69	99.5	98.18	98.18	98.83	99.5	10
drate–srate	95.86	80.12	80.32	93.24	93.64	93.64	80.12	93.54	81.33	81.43	87.32	95.86	5
drate–max_	83.15	80.02	80.52	81.13	85.97	84.16	79.82	86.68	80.63	80.63	82.27	86.68	0
srate–max_	89.3	80.12	80.42	80.63	96.17	96.47	80.12	95.66	81.33	80.83	86.11	96.47	3
Average Accuracy by methods	93.56	85.72	89.93	92.86	94.87	95.11	88.87	95.08	84.7	84.75
Max	99.39	98.79	99.5	99.6	100	100	99.39	100	99.5	99.5
Count90	50	25	38	46	51	52	38	51	25	24

FP: feature pairs; SVM: support vector machines; RBF: radial basis function; QDA: quadratic discriminant analysis.

Results and discussion

In this study, we developed lightweight IDS using ML algorithms and by determining the optimal feature pairs of the Bot-IoT (2018) dataset. The results of the study suggested that in determining the most effective and optimal feature pairs, and ML algorithm(s), the distributions of the input feature pairs (input data) are very important and critical. The distribution of an input feature pair parameter is a crucial factor due to the reason that the normally distributed $x_{1}$ and $x_{2}$ input feature data allow ML algorithms to generalize the input data and make decisions fast and easily. For instance, non-normally distributed pairs, including sport–stddev, sport–min, sport–mean, sport–drate, sport–srate, sport–max, stddev–mean, stddev–drate, stddev–srate, stddev–max, min–mean, min–drate, mean–max, drate–max, sport–dport, sport–state_number, dport–stddev, dport–min, dport–state_number, dport–mean, dport–drate, dport–srate, dport–max, and some other feature pairs did not perform well since these non-normally distributed input feature pairs had not been generalized well. However, the normally distributed 37 feature pairs like sport–N_IN_Conn_P_SrcIP, seq–stddev, seq–min, sport–N_IN_Conn_P_DstIP, and others performed very well in terms of accuracy. The ML algorithms learned well from these normally distributed input feature pairs since generalizing normally distributed data is an easy task. They fell short when the input data were not well-distributed. We can conclude that it is important to select well-distributed input feature pairs in developing lightweight IDS based on these 10 ML algorithms. Further details may be seen in Figures 26 and 7, and Tables 3 and 4.

Figure 7.

Training and test accuracy results of the machine learning methods trained with the feature pairs. The first raw represents the ML algorithms, the first column represents feature pairs (as $sport - \min_, \dots, sport - N_IN_Conn_P_DstIP$ ), and the distribution of input feature pair data is illustrated in the first column.

Table 4.

F1 score of the detection systems trained with the 66 unique feature pairs.

FPs/methods	Nearest Neighbors	Linear SVM	RBF SVM	Gaussian Process	Decision Tree	Random Forest	Neural Net	AdaBoost	Naive Bayes	QDA
sport–dport	89.77	71	83.81	91.82	99	99.09	79.12	99.19	71	71
sport–seq	92.33	82.23	92.44	91.96	92.68	93.06	91.48	94.5	91.3	91.69
sport–stddev	84.46	71	77.35	85.71	88.64	87.16	76.87	87.28	70.31	70.31
sport–N_IN_Conn_P_SrcIP	92.69	91.73	92.6	92.21	95.73	94.03	92.5	95.95	90.54	90.54
sport–min_	84.87	71	77.38	81.12	86.43	88.09	71	87.48	71	71
sport–state_number	93.18	71	91.97	91.97	92.61	92.71	91.65	93.88	71	71
sport–mean	85.43	71	78.22	83.35	83.15	85.81	76.45	87.26	71	71
sport–N_IN_Conn_P_DstIP	98.5	97.73	98.51	98.4	98.8	98.6	98.51	98.9	97.53	97.53
sport–drate	81.16	71.24	78.81	79.61	80.28	80.36	71.14	81.28	73.55	73.55
sport–srate	80.09	71.48	78.1	78.71	94.43	94.17	71.48	91.27	75.27	74.88
sport–max_	85.57	71	78.94	82.39	82.86	84.6	76.16	87.08	77.4	77.29
dport–seq	97.14	74.73	91.88	98.68	98.59	98.89	91.23	99.6	91.68	91.48
dport–stddev	97.26	71	79.19	97.68	98.18	98.29	78.24	98.08	70.42	70.52
dport–N_IN_Conn_P_SrcIP	98.5	91.73	92.69	97.92	99.7	99.5	92.5	99.7	90.36	90.06
dport–min_	97.88	71	82.18	98.38	98.69	98.69	78.75	98.68	71	71
dport–state_number	98.5	71	96.23	98.5	99.2	99.2	94.85	99.6	71	71
dport–mean	98.69	71	81.45	99.29	99.6	98.99	78.6	99.5	71	71
dport–N_IN_Conn_P_DstIP	99	97.53	98.7	99.6	99.5	99.6	98.6	99.5	97.53	97.53
dport–drate	98.58	71.24	80.08	98.59	98.68	98.69	78.71	98.79	73.62	73.55
dport–srate	98.79	71.48	79.37	98.5	98.9	98.49	79.12	98.99	75.27	74.88
dport–max_	98.58	71	80	98.59	98.99	98.58	78.24	98.89	78.39	71
seq–stddev	95.17	93.27	94.53	94.71	96.16	96.12	94.92	95.92	94.25	94.25
seq–N_IN_Conn_P_SrcIP	99.1	94.64	98.41	99	99.1	99.3	94.83	99.4	95.22	95.22
seq–min_	92.6	91.79	92.36	93.18	91.25	94.54	91.86	93.91	91.7	91.61
seq–state_number	97.27	82.3	96.27	96.19	95.75	95.99	93.9	96.42	91.49	91.51
seq–mean	93.93	92.15	94.88	94.92	95.33	95.78	93.68	94.92	92.55	92.74
seq–N_IN_Conn_P_DstIP	99.39	98.8	99.5	99.6	100	99.6	99.4	100	99.5	99.5
seq–drate	92.56	71.84	90.82	90.91	91.98	90.79	91.06	91.44	73.8	73.8
seq–srate	91.66	71.48	90.82	91.09	97.99	96.1	91.04	97.79	75.41	75.41
seq–max_	94.53	92.51	94.92	94.95	95.69	96.61	93.67	96.29	93.49	93.3
stddev–N_IN_Conn_P_SrcIP	97.32	92.89	93.08	97.13	97.91	98.21	93.64	98.09	90.57	90.66
stddev–min_	88.89	71	76.66	86.15	87.09	87.66	71	87.73	77.93	77.93
stddev–state_number	98.5	71	95.59	98.69	98	99.09	90.97	98.68	70.03	69.94
stddev–mean	89.04	71	76.08	80.44	88.7	88.15	71	87.51	77.83	77.83
stddev–N_IN_Conn_P_DstIP	99.2	98.11	98.9	98.9	98.89	99.4	98.5	99.3	98	98
stddev–drate	85.63	71.24	71.95	85.97	85.91	86.07	71.14	86.08	73.8	73.8
stddev–srate	94.72	71.48	71	85.48	96.34	96.34	71.48	95.35	75.07	74.61
stddev–max_	89.26	71	74.84	80.37	88.24	88.24	71	87.94	77.83	77.83
N_IN_Conn_P_SrcIP–min_	97.91	92.65	96.32	96.41	98.21	97.91	92.69	97.69	90.27	90.36
N_IN_Conn_P_SrcIP–state_number	98.5	92	98.3	97.79	98.3	98.5	94.49	98.5	90.09	90.09
N_IN_Conn_P_SrcIP–mean	97.71	92.27	96.08	96.32	97.33	98	93.96	97.8	91.63	91.54
N_IN_Conn_P_SrcIP–N_IN_Conn_P_DstIP	98.9	97.34	99	99	99	98.8	98.51	99.1	97.34	97.73
N_IN_Conn_P_SrcIP–drate	96.28	91.82	92.32	96.37	96.03	96.47	92.5	96.47	73.62	73.62
N_IN_Conn_P_SrcIP–srate	98.2	91.82	92.5	96.66	97.8	98.1	92.5	97.91	75.27	75.14
N_IN_Conn_P_SrcIP–max_	97.62	92.08	94.93	96.63	97.43	97.81	94.05	97.41	92.13	92.36
min_–state_number	91.98	71	92.09	92.21	92.08	92.08	91.65	92.33	71	71.61
min_–mean	89.45	71	77.72	85.54	88.86	88.54	71	88.03	76.07	77.73
min_–N_IN_Conn_P_DstIP	99.1	97.53	98.8	98.9	98.9	99.2	98.51	99.2	97.73	97.73
min_–drate	88.06	71.24	72.76	84.47	87.72	87.6	71.24	88.18	73.55	73.55
min_–srate	90.79	71.48	76.14	85.5	93.59	96.23	71.48	92.82	74.86	74.74
min_–max_	89.17	71	77.09	85.73	88.39	87.46	71	88.54	77.55	77.64
state_number–mean	99.29	71	99.29	99.6	99.2	99.4	98.79	99.09	70.55	84.72
state_number–N_IN_Conn_P_DstIP	99.1	97.53	98.7	94.56	98.6	98.7	98.51	99.3	97.53	97.53
state_number–drate	59.15	71.24	91.89	91.89	91.7	91.7	91.16	91.8	73.55	73.55
state_number–srate	99.3	71.48	92.09	92.83	96.07	96.61	91.16	96.39	75.27	74.88
state_number–max_	98.89	71	99.29	99.29	98.8	98.99	98.89	99.29	74.98	76.19
mean–N_IN_Conn_P_DstIP	98.68	97.73	99.2	99.5	99.3	99.5	98.8	99.3	97.82	97.92
mean–drate	85.97	71.24	74.08	78.29	82.58	84.25	71.14	86.39	73.62	73.62
mean–srate	89.78	71.48	74.2	77.74	96.22	96.32	71.48	95.56	74.86	74.74
mean–max_	89.2	71	74.42	85.95	87.38	85.86	71	87.32	77.64	77.74
N_IN_Conn_P_DstIP–drate	99	97.53	98.31	98.51	98.7	98.7	98.51	99	93.58	93.58
N_IN_Conn_P_DstIP–srate	99.1	97.73	98.51	98.51	99.3	99.4	98.51	99.3	90.03	89.84
N_IN_Conn_P_DstIP–max_	99.1	98.11	98.9	99	99.3	99.4	98.7	99.5	98.21	98.21
drate–srate	95.95	71.48	72.3	93.27	93.62	93.62	71.48	93.53	75.21	75.54
drate–max_	84.42	71.24	72.92	74.23	83.47	83.54	71.14	85.19	73.62	73.62
srate–max_	89.04	71.48	72.36	72.98	96.21	96.33	71.48	95.76	74.79	74.74

FPs: feature pairs; SVM: support vector machines; RBF: radial basis function; QDA: quadratic discriminant analysis.

Every single one of the 66 feature pairs can be used for a certain type of ML algorithm. For example, the feature pairs state_number–N_IN_Conn_P_DstIP, or N_IN_Conn_P_DstIP–max, or seq–min, or seq–stddev performed significantly well with the linear SVM due to the reason that these pairs were linearly distributed, while the feature pairs such as dport–seq and dport–state_number achieved better accuracy with the RBF SVM which was because these input data were distributed nonlinearly. However, 37 out of 66 pairs could be used for all 10 methods because when all 10 algorithms were trained with these feature pairs, an intrusion detection accuracy rate of over 90% was achieved, and this is illustrated in Table 3, with the column “count90.” If the number in this column was greater than 7, a feature pair was considered best, because the number 7 means a certain feature pair performed well when it was fed to 7 ML algorithms. In addition to this, four ML algorithms, namely, Nearest Neighbors, Decision Trees, Random Forest, and AdaBoost, were determined as the most effective and optimal ones due to the reason that they produced more than 90% accuracy when they were trained with more than 50 different feature pairs separately (Table 3, row “count90”). However, none of these algorithms was good enough in terms of performance when they were trained with full features and thousands of input data, since they make decisions by comparing an input packet to thousands of trained examples in their memory. Therefore, selecting normally distributed feature pairs is vital to building a real-time botnet anomaly detection system when the Bot-IoT dataset is used. Although we can achieve a desired real-time performance when these methods are trained with a feature pair and less than 50,000 example packets, they will fail in real-time IDS if they are trained for more than 50,000 input data with a feature pair since these methods compare thousands of memorized examples to each incoming packet.

Furthermore, the time complexity appeared very well (under 0.4 ms to process an incoming packet) for all feature pairs and ML algorithms. For this very reason, the models in this study were trained with just 2500 packets (the network packets were randomly selected from 3 million records). However, this would not provide such a good performance when the systems were trained with thousands or millions of network packets (Table 5). Although the overall accuracy rates of some models were above 80%, none of them could be used for a real-time IDS, because these models were very slow. For example, the algorithms such as Nearest Neighbors and GP categorize input packets into the anomaly and normal classes by comparing the input data to all packets stored in their memory. As a result of this long comparison, the detection process takes more time. Thus, they will be unable to handle online traffic flow data; however, they are ideal for offline network traffic flow analysis systems. Other methods including Linear SVM, RBF SVM, and Neural Networks are best at generalizing input data via their activation and decision functions. This is why these three methods were better than the other methods in terms of both accuracy and performance. The methods including Decision Tree and Random Forest are very good in terms of accuracy, but they cannot be used for real-time detection systems due to their slow performance. One of the most slowing factors of these systems is that they use all the attributes of the dataset at the same time. In addition to this, detection systems such as SVM, RNN, and LSTM which were used in a previous study,¹⁴ where the Bot-IoT dataset was created and validated, have sufficient and good accuracy. So, they are ideal for offline detection systems, but they cannot process input packets in real time, since they have been trained with full features. Another reason was that they utilized large-structured deep learning algorithms such as RNN and LSTM. However, if the features of this dataset are reduced, and then, smaller and more suitable ML architectures are used, both the accuracy and performance of the system may increase. In that case, a real-time anomaly detection system can be built with deep learning algorithms with large architectures, and it might produce the desired real-time performance, when they are trained using a certain appropriate feature pair.

Table 5.

Execution time in milliseconds for an input packet when the detection systems are trained with feature pairs.

Methods	Nearest Neighbors	Linear SVM	RBF SVM	Gaussian Process	Decision Tree	Random Forest	Neural Net	AdaBoost	Naive Bayes	QDA	Average
sport–dport	0.3	0.06	0.1	0.36	0.1	1.12	0.18	1.94	0.08	0.08	0.432
sport–seq	0.28	0.08	0.05	0.35	0.1	0.91	0.19	1.9	0.1	0.08	0.404
sport–stddev	0.29	0.05	0.06	0.36	0.1	0.93	0.19	2.02	0.08	0.08	0.416
sport–N_IN_Conn_P_SrcIP	0.3	0.09	0.05	0.36	0.1	0.97	0.19	1.99	0.08	0.08	0.421
sport–min_	0.33	0.05	0.1	0.34	0.1	0.94	0.19	1.92	0.09	0.08	0.414
sport–state_number	0.29	0.05	0.05	0.35	0.1	0.93	0.19	1.96	0.08	0.08	0.408
sport–mean	0.29	0.05	0.06	0.36	0.1	0.94	0.18	1.94	0.08	0.08	0.408
sport–N_IN_Conn_P_DstIP	0.29	0.06	0.05	0.36	0.1	0.93	0.19	1.95	0.08	0.08	0.409
sport–drate	0.3	0.06	0.06	0.35	0.1	0.98	0.19	1.9	0.08	0.08	0.41
sport–srate	0.3	0.05	0.06	0.34	0.1	0.91	0.18	1.91	0.08	0.08	0.401
sport–max_	0.29	0.05	0.06	0.34	0.1	0.92	0.19	1.91	0.08	0.08	0.402
dport–seq	0.29	0.05	0.09	0.37	0.1	0.91	0.19	1.91	0.08	0.08	0.407
dport–stddev	0.29	0.05	0.06	0.35	0.1	0.91	0.19	1.91	0.07	0.07	0.4
dport–N_IN_Conn_P_SrcIP	0.28	0.05	0.05	0.35	0.1	0.92	0.19	1.91	0.08	0.08	0.401
dport–min_	0.29	0.05	0.05	0.35	0.1	0.93	0.19	1.94	0.08	0.08	0.406
dport–state_number	0.31	0.05	0.05	0.35	0.1	0.97	0.19	1.94	0.08	0.08	0.412
dport–mean	0.29	0.06	0.06	0.37	0.1	0.93	0.19	1.93	0.08	0.08	0.409
dport–N_IN_Conn_P_DstIP	0.29	0.05	0.05	0.35	0.1	1	0.19	1.95	0.08	0.08	0.414
dport–drate	0.3	0.06	0.06	0.36	0.1	0.95	0.19	1.94	0.08	0.08	0.412
dport–srate	0.29	0.05	0.06	0.37	0.1	0.98	0.19	1.99	0.08	0.08	0.419
dport–max_	0.31	0.06	0.08	0.38	0.11	0.97	0.19	1.94	0.08	0.08	0.42
seq–stddev	0.29	0.05	0.05	0.35	0.1	0.95	0.19	1.95	0.08	0.08	0.409
seq–N_IN_Conn_P_SrcIP	0.29	0.05	0.05	0.36	0.1	0.96	0.21	1.96	0.08	0.08	0.414
seq–min_	0.3	0.05	0.07	0.35	0.1	0.91	0.19	1.91	0.08	0.08	0.404
seq–state_number	0.3	0.05	0.05	0.36	0.1	0.95	0.19	1.98	0.08	0.08	0.414
seq–mean	0.3	0.06	0.05	0.35	0.1	0.95	0.19	1.98	0.08	0.08	0.414
seq–N_IN_Conn_P_DstIP	0.3	0.05	0.05	0.36	0.1	0.95	0.19	1.96	0.08	0.08	0.412
seq–drate	0.3	0.06	0.06	0.35	0.1	0.94	0.19	1.94	0.08	0.08	0.41
seq–srate	0.29	0.05	0.05	0.35	0.1	0.92	0.18	1.89	0.08	0.08	0.399
seq–max_	0.31	0.07	0.08	0.35	0.1	0.95	0.19	1.95	0.08	0.08	0.416
stddev–N_IN_Conn_P_SrcIP	0.3	0.05	0.05	0.35	0.1	0.93	0.2	1.92	0.08	0.08	0.406
stddev–min_	0.29	0.05	0.06	0.35	0.1	0.91	0.2	1.88	0.08	0.08	0.4
stddev–state_number	0.29	0.05	0.05	0.35	0.1	0.91	0.2	1.95	0.08	0.08	0.406
stddev–mean	0.3	0.06	0.06	0.34	0.1	0.92	0.18	1.91	0.08	0.08	0.403
stddev–N_IN_Conn_P_DstIP	0.29	0.05	0.05	0.36	0.1	0.9	0.19	1.9	0.08	0.08	0.4
stddev–drate	0.29	0.05	0.06	0.35	0.1	0.94	0.2	1.91	0.08	0.08	0.406
stddev–srate	0.29	0.05	0.06	0.34	0.1	0.91	0.18	1.92	0.08	0.08	0.401
stddev–max_	0.29	0.05	0.06	0.36	0.1	0.94	0.19	2.05	0.08	0.08	0.42
N_IN_Conn_P_SrcIP–min_	0.33	0.06	0.05	0.35	0.1	0.92	0.18	1.93	0.08	0.08	0.408
N_IN_Conn_P_SrcIP–state_number	0.29	0.05	0.05	0.36	0.1	0.92	0.19	1.89	0.08	0.08	0.401
N_IN_Conn_P_SrcIP–mean	0.29	0.05	0.05	0.35	0.1	0.91	0.19	1.9	0.08	0.08	0.4
N_IN_Conn_P_SrcIP–N_IN_Conn_P_DstIP	0.29	0.06	0.05	0.35	0.1	0.97	0.19	1.91	0.08	0.08	0.408
N_IN_Conn_P_SrcIP–drate	0.3	0.05	0.05	0.36	0.1	0.9	0.18	1.91	0.08	0.08	0.401
N_IN_Conn_P_SrcIP–srate	0.29	0.06	0.05	0.37	0.1	0.92	0.19	1.91	0.08	0.08	0.405
N_IN_Conn_P_SrcIP–max_	0.29	0.07	0.05	0.35	0.1	0.93	0.2	1.97	0.08	0.08	0.412
min_–state_number	0.3	0.06	0.05	0.37	0.1	0.96	0.2	1.95	0.08	0.08	0.415
min_–mean	0.3	0.06	0.06	0.37	0.1	0.95	0.19	1.92	0.08	0.08	0.411
min_–N_IN_Conn_P_DstIP	0.29	0.05	0.05	0.35	0.1	0.91	0.19	1.98	0.08	0.07	0.407
min_–drate	0.28	0.05	0.06	0.35	0.1	0.9	0.18	1.89	0.08	0.08	0.397
min_–srate	0.29	0.05	0.06	0.37	0.1	0.97	0.19	1.91	0.08	0.08	0.41
min_–max_	0.3	0.05	0.06	0.35	0.1	0.93	0.18	1.91	0.08	0.08	0.404
state_number–mean	0.29	0.05	0.05	0.34	0.1	0.94	0.19	1.9	0.08	0.08	0.402
state_number–N_IN_Conn_P_DstIP	0.3	0.05	0.05	0.35	0.1	0.96	0.19	1.93	0.08	0.08	0.409
state_number–drate	0.3	0.05	0.05	0.34	0.1	0.96	0.19	1.91	0.08	0.08	0.406
state_number–srate	0.29	0.05	0.05	0.36	0.1	0.94	0.19	1.96	0.08	0.08	0.41
state_number–max_	0.3	0.06	0.05	0.36	0.1	0.94	0.19	1.96	0.08	0.08	0.412
mean–N_IN_Conn_P_DstIP	0.3	0.05	0.05	0.36	0.1	0.76	0.19	1.91	0.08	0.08	0.388
mean–drate	0.29	0.05	0.1	0.35	0.1	0.92	0.19	1.9	0.08	0.08	0.406
mean–srate	0.29	0.05	0.06	0.36	0.1	0.94	0.19	1.93	0.08	0.08	0.408
mean–max_	0.29	0.05	0.06	0.37	0.11	0.98	0.2	1.98	0.08	0.08	0.42
N_IN_Conn_P_DstIP–drate	0.3	0.05	0.05	0.36	0.1	1.5	0.19	1.97	0.08	0.08	0.468
N_IN_Conn_P_DstIP–srate	0.29	0.05	0.05	0.34	0.1	0.91	0.19	1.92	0.08	0.08	0.401
N_IN_Conn_P_DstIP–max_	0.29	0.05	0.05	0.36	0.1	0.91	0.19	2.02	0.08	0.08	0.413
drate–srate	0.29	0.05	0.06	0.37	0.1	1	0.19	1.98	0.08	0.08	0.42
drate–max_	0.3	0.05	0.07	0.35	0.1	0.99	0.09	1.96	0.16	0.11	0.418
srate–max_	0.3	0.05	0.13	0.35	0.1	0.99	0.19	2	0.08	0.08	0.427

SVM: support vector machines; RBF: radial basis function; QDA: quadratic discriminant analysis.

Conclusion

In this study, we developed a new approach that assists in determining the most effective and optimal feature pairs and ML algorithms. We evaluated each ML algorithm out of 10 with each feature pair of the 66 feature pairs. For this purpose, we developed a new system that included 10 ML algorithms and a feature pair generation module. The system automatically generated feature pairs from a given number of features, and it trained itself with the generated feature pairs; then it evaluated 10 algorithms in terms of accuracy and performance when the methods were trained with the generated feature pairs. In our case, 12 features were given. The system generated 66 unique feature pairs, and it fed each feature pair to a certain ML algorithm. Then, it evaluated the accuracy and performance of each anomaly detection algorithm. As a result, we determined that 37 feature pairs including sport–N_IN_Conn_P_SrcIP, seq–stddev, seq–min, and sport–N_IN_Conn_P_DstIP performed very well in all algorithms, because they were distributed very well with each other. Four ML algorithms including Nearest Neighbors, Decision Trees, Random Forest, and AdaBoost achieved a high accuracy of over 95%; however, their performance (execution time for an incoming network packet) was not good enough for real-time intrusion detection applications. In particular, if these methods are trained with millions of records, they cannot be used for an online detection system, but they are ideal for offline applications. In addition to this, methods like Linear SVM, RBF SVM, and Neural Networks also performed well in terms of both accuracy and time complexity. They were the most effective and most optimal algorithms for both online and offline anomaly detection systems when they were trained via 37 feature pairs of the Bot-IoT (2018) dataset.

Footnotes

Handling Editor: Peio Lopez Iturri

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Erman Özer

Jahongir Azimjonov

References

Burhan

Rehman

Khan

, et al. IoT elements, layered architectures and security issues: a comprehensive survey. Sensors 2018; 18(9): 1–37.

Mashal

Alsaryrah

Chung

, et al. Choices for interaction with things on internet and underlying issues. Ad Hoc Netw 2015; 28: 68–90.

Yun

Yuxin

. Research on the architecture and key technology of Internet of Things (IoT) applied on smart grid. In: 2010 international conference on advances in energy engineering, Beijing, China, 19–20 June 2010, pp.69–72. New York: IEEE.

Said

Masud

Towards Internet of Things: survey and future vision. Int J Comput Netw 2013; 5: 1–17.

Somayya Madakam

Tripathi

. Internet of Things (IoT): a literature review. J Comput Commun 2015; 3: 164–173.

Khan

Zaheer

, et al. Future internet: the Internet of Things architecture, possible applications and key challenges. In: 2012 10th international conference on Frontiers of Information Technology, Islamabad, Pakistan, 17–19 December 2012, pp.257–260. New York: IEEE.

Elrawy

Awad

Hamed

HFA

. Intrusion detection systems for IoT-based smart environments: a survey. J Cloud Comput 2018; 7: 21.

Want

An introduction to RFID technology. IEEE Pervas Comput 2006; 5(1): 25–33.

Want

Near field communication. IEEE Pervas Comput 2011; 10(3): 4–7.

10.

McDermott-Wells

What is Bluetooth?

IEEE Potentials 2005; 23(5): 33–35.

11.

Ferro

Potorti

Bluetooth and Wi-Fi wireless protocols: a survey and a comparison. IEEE Wirel Commun 2005; 12(1): 12–26.

12.

Crosby

Vafa

Wireless sensor networks and LTE-A network convergence. In: 38th annual IEEE conference on local computer networks, Sydney, NSW, Australia, 21–24 October 2013, pp.731–734. New York: IEEE.

13.

Drira

Renault

Zeghlache

Towards a secure social sensor network. In: 2013 IEEE international conference on bioinformatics and biomedicine, Shanghai, China, 18–21 December 2013, pp.24–29. New York: IEEE.

14.

Koroniotis

Moustafa

Sitnikova

, et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener Comp Sy 2019; 100: 779–796.

15.

Yaqoob

Ahmed

Hashem

IAT

, et al. Internet of Things architecture: recent advances, taxonomy, requirements, and open challenges. IEEE Wirel Commun 2017; 24(3): 10–16.

16.

Ahmad

Alam

Atman

An energy-efficient big data workflow scheduling algorithm under budget constraints for heterogeneous cloud environment. J Supercomput 2021; 77: 11946–11985.

17.

Kolias

Kambourakis

Stavrou

, et al. DDoS in the IoT: Mirai and other botnets. Computer 2017; 50(7): 80–84.

18.

Hajiheidari

Wakil

Badri

, et al. Intrusion detection systems in the Internet of Things: a comprehensive investigation. Comput Netw 2019; 160: 165–191.

19.

Zarpelo

Miani

Kawakani

, et al. A survey of intrusion detection in Internet of Things. J Netw Comput Appl 2017; 84: 25–37.

20.

Mitchell

Chen

I-R.

A survey of intrusion detection in wireless network applications. Comput Commun 2014; 42: 1–23.

21.

Fraley

Cannady

. The promise of machine learning in cybersecurity. In: SoutheastCon 2017, Concord, NC, 30 March–2 April 2017, pp.1–6. New York: IEEE.

22.

Gibert

Mateu

Planes

The rise of machine learning for detection and classification of malware: research developments, trends and challenges. J Netw Comput Appl 2020; 153: 102–526.

23.

Raff

Barker

Sylvester

, et al. Malware detection by eating a whole EXE. In: AAAI workshops, New Orleans, LA, 2–7 February 2018. Menlo Park, CA: Association for the Advancement of Artificial Intelligence.

24.

Ghiasi

Sami

Salehi

Dynamic VSA: a framework for malware detection based on register contents. Eng Appl Artif Intel 2015; 44: 111–122.

25.

Carlin

Cowan

OKane

, et al. The effects of traditional anti-virus labels on malware detection using dynamic runtime opcodes. IEEE Access 2017; 5: 17742–17752.

26.

Bekerman

Shapira

Rokach

, et al. Unknown malware detection using network traffic classification. In: 2015 IEEE conference on communications and network security (CNS), Florence, 28–30 September 2015, pp.134–142. New York: IEEE.

27.

Zhao

, et al. Detecting APT malware infections based on malicious DNS and traffic analysis. IEEE Access 2015; 3: 1132–1142.

28.

Kheir

Behavioral classification and detection of malware through http user agent anomalies. J Inf Secur Appl 2013; 18(1): 2–13 (SETOP’2012 and FPS’2012 Special Issue).

29.

Galal

Mahdy

Atiea

MA.

Behavior-based features model for malware detection. J Comput Virol Hacking Tech 2016; 12: 59–67.

30.

Ding

Yuan

Tang

, et al. A fast malware detection algorithm based on objective-oriented association mining. Comput Secur 2013; 39: 315–324.

31.

Salehi

Sami

Ghiasi

MAAR: robust features to detect malicious activity based on API calls, their arguments and return values. Eng Appl Artif Intel 2017; 59: 93–102.

32.

Rieck

Trinius

Willems

, et al. Automatic analysis of malware behavior using machine learning. J Comput Secur 2011; 19(4): 639–668.

33.

Uppal

Sinha

Mehra

, et al. Malware detection and classification based on extraction of API sequences. In: 2014 international conference on advances in computing, communications and informatics (ICACCI), Delhi, India, 24–27 September 2014, pp.2337–2342. New York: IEEE.

34.

Lawal

Shaikh

Hassan

SR.

Security analysis of network anomalies mitigation schemes in IoT networks. IEEE Access 2020; 8: 43355–43374.

35.

Eskandari

Janjua

Vecchio

, et al. Passban IDS: an intelligent anomaly-based intrusion detection system for IoT edge devices. IEEE Internet Things 2020; 7(8): 6882–6897.

36.

Al-Hamadi

Chen

Wang

, et al. Attack and defense strategies for intrusion detection in autonomous distributed IoT systems. IEEE Access 2020; 8: 168994–169009.

37.

Maseer

Yusof

Bahaman

, et al. Benchmarking of machine learning for anomaly based intrusion detection systems in the CICIDS2017 dataset. IEEE Access 2021; 9: 22351–22370.

38.

Rose

Kifayat

Abbas

, et al. A hybrid anomaly-based intrusion detection system to improve time complexity in the internet of energy environment. J Parallel Distr Com 2020; 145: 124–139.

39.

Bagaa

Taleb

Bernabe

, et al. A machine learning security framework for IoT systems. IEEE Access 2020; 8: 114066–114077.

40.

Dwivedi

Vardhan

Tripathi

, et al. Implementation of adaptive scheme in evolutionary technique for anomaly-based intrusion detection. Evolut Intell 2020; 13: 103–117.

41.

Krishnaveni

Vigneshwar

Kishore

, et al. Anomaly-based intrusion detection system using support vector machine. In: Dash

Lakshmi

Das

, et al (eds) Artificial intelligence and evolutionary computations in engineering systems, vol. 1056. Singapore: Springer, 2020, pp.723–731.

42.

Wilkinson

Dumontier

Aalbersberg

, et al. The fair guiding principles for scientific data management and stewardship. Sci Data 2016; 3: 160018.

43.

Ring

Wunderlich

Scheuring

, et al. A survey of network-based intrusion detection data sets. Comput Secur 2019; 86: 147–167.

44.

Lesne

Shannon entropy: a rigorous notion at the crossroads between probability, information theory, dynamical systems and statistical physics. Math Struct Comp Sci 2014; 24(3): e240311.

45.

Pedregosa

Varoquaux

Gramfort

, et al. Scikit-learn: machine learning in python. J Mach Learn Res 2011; 12(85): 2825–2830.

46.

van der Walt

Colbert

Varoquaux

. The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 2011; 13(2): 22–30.

47.

McKinney

pandas: a foundational Python library for data analysis and statistics. Python High Perform Sci Comput 2011; 14: 1–9.

48.

Barrett

Hunter

Miller

, et al. Matplotlib—a portable Python plotting package. In: Shopbell

Britton

Ebert

(eds) Astronomical data analysis software and systems XIV ASP conference series, vol. 347. San Francisco, CA: Astronomical Society of the Pacific, 2005, pp.91–95.