This paper is devoted to the problem of class imbalance in machine learning, focusing on the intrusion detection of rare classes in computer networks. The problem of class imbalance occurs when one class heavily outnumbers examples from the other classes. In this paper, we are particularly interested in classifiers, as pattern recognition and anomaly detection could be solved as a classification problem. As still a major part of data network traffic of any organization network is benign, and malignant traffic is rare, researchers therefore have to deal with a class imbalance problem. Substantial research has been undertaken in order to identify these methods or data features that allow to accurately identify these attacks. But the usual tactic to deal with the imbalance class problem is to label all malignant traffic as one class and then solve the binary classification problem. In this paper, however, we choose not to group or to drop rare classes but instead investigate what could be done in order to achieve good multi-class classification efficiency. Rare class records were up-sampled using SMOTE method (Chawla et al., 2002) to a preset ratio targets. Experiments with the 3 network traffic datasets, namely CIC-IDS2017, CSE-CIC-IDS2018 (Sharafaldin et al., 2018) and LITNET-2020 (Damasevicius et al., 2020) were performed aiming to achieve reliable recognition of rare malignant classes available in these datasets.
Popular machine learning algorithms were chosen for comparison of their readiness to support rare class detection. Related algorithm hyper parameters were tuned within a wide range of values, different data feature selection methods were used and tests were executed with and without over-sampling to test the multiple class problem classification performance of rare classes.
Machine learning algorithms ranking based on Precision, Balanced Accuracy Score, , and prediction error Bias and Variance decomposition, show that decision tree ensembles (Adaboost, Random Forest Trees and Gradient Boosting Classifier) performed best on the network intrusion datasets used in this research.
Detection of intrusions into networks, information systems or workstations, as well as detection of malware and unauthorized activities of individuals, have emerged into a global challenge. A part of cybernetic defence challenges is addressed by optimizing the intrusion detection systems (IDS). There are three methods of intrusion detection (Koch, 2011): known pattern recognition (signature-based), anomaly based detection, and a hybrid of the previous two. Anomaly based detection is currently mainly implemented as a support for zero-day network perimeter defence of big infrastructures and network operators, while signature based intrusion prevention remains the main mode of defence for most businesses and households. Pattern recognition or anomaly detection can be seen as classification problems. Classification problems refer to the problems in which the variable to be predicted is categorical. In network traffic the benign data is most often represented by a large number of examples, while malignant traffic appears extremely rarely or is an absolute rarity. This is known as the class imbalance problem and is a known obstacle to the induction of good classifiers by Machine Learning (ML) algorithms (Batista et al., 2004).
He and Ma (2013) define imbalanced learning as the learning process for data representation and information extraction with severe data distribution skews to develop effective decision boundaries to support the decision-making process. He and Ma (2013) introduced informal conventions for imbalanced dataset classification. A dataset where the most common class is less than twice as common as the rarest class would be marginally unbalanced. A dataset with the imbalance ratio of about 10 : 1 would be modestly imbalanced, and a dataset with imbalance ratios above 1000 : 1 would be extremely unbalanced. This sort of imbalance is found in medical record databases regarding rare diseases, or production of electronic equipment, where non-faulty examples heavily outnumber faulty examples. Cases when negative to positive ratios are close to or higher than 1 000 000 : 1 are called absolute rarity imbalance. This sort of imbalance is found in cyber security, where all but a few network traffic flows are benign. However, standard ML algorithms are still capable of inducing good classifiers for extremely imbalanced training sets. This shows that class imbalance is not the only problem responsible for the decrease in performance of learning algorithms. Batista et al. (2004) have demonstrated that a part of the problem to have class separation is often an overlap of classes due to a lack of feature separation. Another reason could be a lack of attributes, specific to a certain decision boundary. It is known that in cases where negative class has an internal structure (multimodal class), an overlap between negative and positive classes can be observed on a few of the clusters within negative class.
This study reports results of the empirical research executed with selected supervised machine learning classification algorithms in an attempt to compare their efficiency for intrusion detection and get improved results compared to other published studies. The study consists of the following sections: Section 2, introduction of the data sources, Section 3, a review of machine learning methods and model benchmark metrics used in this study, Section 4, an overview of the experiment and pre-processing steps, Section 5, results and conclusions.
Contribution
The research question raised in this study is which supervised machine learning method consistently provides the best multi-class classification results with large and highly imbalanced network datasets. To answer this question we chose the CIC-IDS2017, CSE-CIC-IDS2018 (Sharafaldin et al., 2018) and LITNET-2020 (Damasevicius et al., 2020) datasets as they are recent realistic software-generated traffic network datasets and meet the required criteria (Gharib et al., 2016) for a good network intrusion dataset. An answer to this question is that based on rankings of performance metrics and bias-variance decomposition the tree ensembles Adaboost, RandomForest Trees and Gradient Boosting Classifier performed best on the network intrusion datasets used in this research.
The novelty of this research is in a proposed methodology (see Section 4) and application of it for the recent and not yet in depth studied dataset LITNET-2020. A review of the LITNET-2020 dataset compliance to the criteria raised by Gharib et al. (2016) is first introduced in Section 2.2. A variant of random under-sampling (skewed ratio under-sampling, proposed by authors and discussed in Section 3.1) is used to reduce imbalance of classes in a nonlinear fashion. SMOTE up-sampling for numeric data and SMOTE-NC for categorical data (see Section 3.2) is executed to increase representation of rare classes. Further in this research, comparison of multi-class classification performance of the CIC-IDS2017 and CIC-IDS2018 datasets with the LITNET-2020 dataset is discussed in Section 5. Multi-class performance macro-averaged metrics are implemented in this research. Balanced accuracy (Formula (2)) and geometric mean of recall (Formula (4)) for the LITNET-2020 dataset are implemented for the first time (see results in Tables 16 and 17). Multi-criteria scoring is cross-validated with an approach of testing through data previously unseen for the models (see Section 4). For decision tree ensemble methods, instead of the weak CART base classifiers, parameters Tree depth and alpha were GirdSearched and validated using the method of maximum cost path analysis (Breiman et al., 1984), see Section 3.8. Additional ML model, Gradient Boosting Classifier, utilizing ensemble of Classification and regression trees (CART), was introduced for benchmark in this research via the use of XGBoost library (Chen and Guestrin, 2016) with GPU support (see Section 3.5.6). In our methodology, due to the highly imbalanced nature of the used data, cost sensitive method implementations were chosen. These choices lead to better results (see Table 20) compared to other reviewed studies. Furthermore, selection of models with better generalization capabilities in this research is achieved through decomposition of classification error into bias and variance (see results in Table 18).
Datasets Used
The following section presents a review of datasets considered for this research together with arguments for the choice made.
Datasets Considered for Analysis
There are many datasets that have been used by the researchers to evaluate the performance of their proposed intrusion detection and intrusion prevention approaches. Far from being complete, the list includes: DARPA 1998 (Lippmann et al., 1999) and 1999 traces by Lincoln Laboratory, USA, KDD’99 (Hettich and Bay, 1999), CAIDA (The Cooperative Association for Internet Data Analysis, 2010) datasets by University of California, USA, the Internet Traffic Archive and LBNL traces by Lawrence Berkeley National Laboratory, USA (Lawrence Berkeley National Laboratory, 2010), DEFCON by The Shmoo Group (2011), ISCX IDS 2012 (Shiravi et al., 2012), CIDDS-001 (Coburg Intrusion Detection Data Set) (Ring et al., 2017) and others. However, it has been widely acknowledged that machine learning research in an intrusion detection area needs to include new attack types and therefore researchers should consider more recent data sources.
In this research, three recent network data sets, compliant to the criteria described further (see Section 2.2) suggested by their authors for intrusion detection research, are explored. The datasets chosen are CIC-IDS2017, CSE-CIC-IDS2018 (Sharafaldin et al., 2018) by the University of Brunswick, Canada, and LITNET-2020 (Damasevicius et al., 2020). These datasets are of significant volume, contain anonymized real academic network traffic and are suited for multiple purposes of machine learning. LITNET-2020 is a new dataset that is given particular attention in this research, with discussion of compliance to the dataset suitability as devised by Gharib et al. (2016).
Requirements for Cybersecurity Datasets
Criteria for building such datasets are discussed by Małowidzki et al. (2015), Buczak and Guven (2016), Maciá-Fernández et al. (2018), Ring et al. (2019), Damasevicius et al. (2020), and others.
Małowidzki et al. (2015) define the following features of a good dataset: it must contain recent data, be realistic, contain all typical attacks met in the wild, be labelled, be correct regarding operating cycles in enterprises (working hours), should be flow-based. Ring et al. (2019) contend that a good dataset should be comparable with real traffic and therefore have more normal than malicious traffic, since most of the traffic within a company is normal and only a small part is malicious. Detailed framework and analysis of criteria for such datasets is proposed by Canadian Institute for Cybersecurity (CIC) at the University of New Brunswik. Gharib et al. (2016) have proposed the eleven dataset selection criteria. These criteria are presented in Table 1. Following this publication of the criteria, CIC created a list of new datasets,1 addressing issues of compliance to these criteria. Creation of the CSE-CIC-IDS2018 followed with improvements, such as decreasing number of duplicates and uncertainties. Thakkar and Lohiya (2020) in Sections 4.1 and 4.2, Tables 4 and 5, and Karatas et al. (2020) in Sections III.C (CIC-IDS21017) and III. D (CSE-CIC-IDS2018) provide discussion and support to these claims.
Dataset compliance criteria by Gharib et al. (2016).
No.
Criteria
1.
Complete network configuration
2.
Complete traffic
3.
Labelled dataset
4.
Complete interaction
5.
Complete record
6.
Available protocols
7.
Attack diversity
8.
Anonymity
9.
Heterogeneity
10.
Feature set
11.
Metadata
LITNET-2020 Compliance
The LITNET-2020 dataset was selected for the current study as complying to most of the above mentioned requirements with some reservations regarding interaction completeness, heterogeneity and feature set completeness criteria.
These eleven criteria as applied to LITNET-2020 are discussed below.
Complete network configuration: In order to investigate the real course of attacks, it is necessary to test the real network configuration. All of the network flows in this dataset are received or generated at the Network of Lithuanian academic institutions LITNET.
Complete traffic: The dataset accumulates full packet flows from the source to the destination, which can be a workstation computer, router or another specialized service device.
Labelled dataset: The dataset is labelled into a single benign and 12 malignant classes. The benign class is not separately labelled into sub-classes, however, it could be done because the number of benign records is exceeding 36 million records and is close to of the whole dataset.
Complete interaction: The correct interpretation of the data requires data from the entire network interoperability process. LITNET-2020 dataset, however, is a pure network traffic dataset with no correlated host memory or host log information.
Record completeness: The LITNET-2020 dataset is compliant with this requirement.
Various protocols: Records of 13 types of protocols for normal and 3 types of protocols for malignant traffic are available in the LITNET-2020 dataset.
Diversity and novelty of attacks: The dataset includes attack flows that were detected from 2019-03-06 first flow and 2020-01-31 last flow.
Anonymity: It is important that the simulated set contain data for which privacy is not important. The LITNET-2020 data set contains no personally identifiable data.
Heterogeneity: Data from different sources, such as network streams, operating system logs, or network equipment logs, memory images, must be available. LITNET-2020 is not compliant with this requirement.
Feature Set/Attribute Linkage: It is important for the research that data from different types of sources for the same event be linked, for example, device memory view, network traffic, and device logs. LITNET-2020 is not compliant with this requirement as it contains no linked host sources.
Metadata and documentation: Information about attributes, how the traffic was generated or collected, network configuration, attackers and victims, machine operating system versions and attack scenarios are required to do the research. LITNET-2020 is documented in Damasevicius et al. (2020).
Cybersecurity Dataset Imbalance Problem
In datasets selected for the research, the benign class takes from up to of total records (see Table 2), and some small classes only have less than (see Table 4). The following Table 2 is a summary of the data set imbalance of benign versus malignant records:
Dataset content split.
Record Type
CIC-IDS2017
CSE-CIC-IDS2018
LITNET-2020
Benign
80.3%
83.1%
92.0%
Malignant
19.7%
16.9%
8.0%
The following Table 3 presents the split of malignant classes and is a summary of dataset imbalance shares in accordance with the taxonomy described by He and Ma (2013):
Dataset imbalance.
Imbalance category1
CIC-IDS2017
CSE-CIC-IDS2018
LITNET-2020
Modest <(10 : 1)
8.16%
0.00%
0.00%
High <(1000 : 1)
11.39%
16.85%
7.83%
Extreme >(1000 : 1)
0.15%
0.08%
0.20%
Total Malignant
19.7%
16.9%
8.0%
1Share of records in imbalance category.
The following Table 4 represents a summary of extremely imbalanced (>1000 : 1) classes in the three selected datasets.
Extremely rare classes in the datassets.
CIC-IDS2017
CIC-IDS2018
LITNET-2020
Class
Share
Class
Share
Class
Share
Bot
0.0695%
DoS-Slowloris
0.0677%
W32.Blaster
0.0660%
Brute Force-Web
0.0532%
LOIC-UDP1
0.0107%
ICMP Flood
0.0638%
Brute Force-XSS
0.0230%
Brute Force-Web
0.0038%
HTTP Flood
0.0630%
Infiltration
0.0013%
Brute Force-XSS
0.0014%
Scan
0.0170%
SQL Injection
0.0007%
SQL Injection
0.0005%
Reaper Worm
0.0032%
Heartbleed
0.0004%
Spam
0.0021%
Fragmentation
0.0013%
Total Extreme >(1 000 : 1)
0.15%
0.08%
0.20%
1DDOS attack.
Various imbalance measures are discussed by Ortigosa-Hernández et al. (2017) in a study, dedicated to such measures. In Karatas et al. (2020), section III.E, authors review most practical to use imbalance ratios of several IDS datasets, including the CIC-IDS2017 and CSE-CIC-IDS2018.
Referring to Ortigosa-Hernández et al. (2017) and Karatas et al. (2020), the following Formula (1) can be used for the calculation of the imbalance ratio: where: shows the data size in the class i.
For example, historical NSL-KDD has an imbalance ratio of 648, CIC-IDS2017 has an imbalance ratio of 112 287 and CSE-CIC-IDS2018 has a slightly better imbalance ratio of 53 887. LITNET-2020 has an imbalance ratio of 70 769.
While imbalance ratios are an important part of the discussion, the absolute rarity is another concept introduced by He and Ma (2013) for the case when there is not enough records to learn the class. If there is not enough information within the feature-scape, determination of decision boundary cannot be made. There are no such classes in the LITNET-2020 datasets, and the data was sufficient for learning to all the machine learning algorithms used in our experiment. However, Infiltration, Heartbleed and Web Attack-Aql Injection classes in the CIC-IDS2017 dataset exhibit behaviour of such an absolute rarity and learning the decision boundaries for these classes is complicated and unspecific. In CSE-CIC-IDS2018 dataset, even though Infiltration class records are abundant, high overlap with benign class is observed.
CIC-IDS-2017
The CIC-IDS-2017 dataset (Sharafaldin et al., 2018) is made available by Canadian Institute for Cyber Security Research at the University of New Brunswick2 and introduces labelled data of 14 types of attacks including DDoS, Brute Force, XSS, SQL Injection, Infiltration, and Botnet. The traffic was emulated in a test environment during a period from July 3 to July 7, 2017. Network traffic features and related aggregates were extracted and generated using the CICFlowMeter tool and made available in a form of 8 CSV files. The CICFlowMeter is an open source tool3 provided by CIC at UNB that generates bidirectional flows from pcap files, and extracts features from these flows, made available to the research community by Draper-Gil et al. (2016) and further described by Lashkari et al. (2017). The dataset contains a total of 2 830 743 records with flow data, synthetic features and is labelled.
The following Table 5 is a summary of class representation of this dataset.
Class representation in CIC-IDS2017 dataset.
Traffic class
Record count
Share (%)
BENIGN
2 273 097
80.3004%
DoS Hulk
231 073
8.1630%
PortScan
158 930
5.6144%
DDoS
128 027
4.5227%
DoS GoldenEye
10 293
0.3636%
FTP-Patator
7 938
0.2804%
SSH-Patator
5 897
0.2083%
DoS slowloris
5 796
0.2048%
DoS Slowhttptest
5 499
0.1943%
Bot
1 966
0.0695%
Web Attack-Brute Force
1 507
0.0532%
Web Attack-XSS
652
0.0230%
Infiltration
36
0.0013%
Web Attack-SQL Injection
21
0.0007%
Heartbleed
11
0.0004%
Dataset features, all measures of duration or related aggregates, further used for this research belong to these categories:
Fiat (Forward Inter Arrival Time mean, min, max, std): aggregates on the time between two flows are sent in forward direction;
Biat (Backward Inter Arrival Time mean, min, max, std): aggregates on the time between two flows are sent backwards;
Flowiat (Flow Inter Arrival Time, mean, min, max, std): aggregates on the time between two flows sent in either direction;
Active (mean, min, max, std): aggregates on the amount of time a flow was active before going idle;
Idle (mean, min, max, std): aggregates on the amount of time a flow was idle before becoming active;
Flow Bytes/s: Flow bytes sent per second;
Flow Packets/s: Flow packets sent per second;
Duration: The duration of a flow.
CSE-CIC-IDS2018
The CSE-CIC-IDS2018 dataset (Sharafaldin et al., 2018) is made available by Canadian Institute for Cyber Security Research at the University of New Brunswick.4 Data was emulated in the CIC test environment within an environment of 50 attacking machines, 420 victim PC’s and 30 victim servers during the period from February 14 to March 2, 2018. The dataset contains records from 14 distinct attacks, is labelled and presented together with anonymised PCAP5 files. 80 network traffic features were extracted and calculated using the CICFlowMeter tool. Ten CSV files are made available for machine learning, containing 16 232 943 records. The representation of classes in IDS-2018 ranges from approximately 1 : 20 to 1 : 100 000.
The following Table 6 presents a summary of class representation of this dataset.
Class representation of CSE-CIC-IDS2018 dataset.
Traffic class
Record count
Share (%)
Benign
13 484 708
83.070%
HOIC1
686 012
4.226%
LOIC-HTTP1
576 191
3.550%
Hulk1
461 912
2.846%
Bot
286 191
1.76%
FTP-BruteForce
193 360
1.191%
SSH-Bruteforce
187 589
1.156%
Infilteration
161 934
0.998%
SlowHTTPTest1
139 890
0.862%
GoldenEye1
41 508
0.256%
Slowloris1
10 990
0.068%
LOIC-UDP1
1 730
0.011%
Brute Force-Web
611
0.004%
Brute Force-XSS
230
0.001%
SQL Injection
87
0.0005%
1Variants of DoS attacks.
Same dataset features as described in Section 2.5 are used further in this research for selection of features.
LITNET-2020
LITNET-2020 is a new annotated network dataset for network intrusion detection, obtained from the real life Lithuanian academic network LITNET traffic by researchers from Kaunas Technology University (KTU). The environment of data collection, comparison of the dataset with other recently published network-intrusion datasets and description of attacks represented in the LITNET-2020 dataset is introduced by Damasevicius et al. (2020). The dataset contains benign traffic of the academic network and 12 attack types generated at KTU managed LITNET network from March 6, 2019 to January 31, 2020. Network traffic was captured using the open source nfcapd binary format, anonymised and processed into the CSV format, containing 39 603 674 time-stamped records. Nfsen, MeSequel, and Python script tools were used for extra feature generation and pre-processing, with data fields in CSV format named after fields, generated by Nfdump.6 The 49 attributes that are specific to the NetFlow v9 protocol as defined in RFC 3954 (Claise, 2004) are used to form a dataset basis, further expanded with additional fields of time and tcp flags (in symbolic format), which can be used to identify attacks. An additional 19 attack specific attributes are added. The representation of classes in LITNET-2020 is imbalanced in a range from approximately 1 : 30 to 1 : 100 000.
The following Table 7 presents a summary of class representation of this dataset.
Class representation of LITNET-2020 dataset.
Traffic class
Record label
Record count1
Share, %
Benign
none
36 423 860
91.9709%
SYN Flood
tcp_syn_f
1 580 016
3.9896%
Code Red
tcp_red_w
1 255 702
3.1707%
Smurf
icmp_smf
118 958
0.3004%
UDP Flood
udp_f
93 583
0.2363%
LAND DoS
tcp_land
52 417
0.1324%
W32.Blaster
tcp_w32_w
24 291
0.0613%
ICMP Flood
icmp_f
23 256
0.0587%
HTTP Flood
http_f
22 959
0.0580%
Port Scan
tcp_udp_win_p
6 232
0.0157%
Reaper Worm
udp_reaper_w
1 176
0.0030%
Spam botnet
smtp_b
747
0.0019%
Fragmentation
udp_0
477
0.0012%
1Record counts before removing timestamp and related record duplicates.
Methods
Multiple different types of methods were used in this research to improve performance of ML methods. The methods employed could be grouped into pre-processing (see Sections 3.1–3.3) and machine learning methods (see Section 3.5). Data record sampling methods are discussed in detail in Section 3.1. Record over-sampling – in Section 3.2, feature selection, scaling and frequency transformation undertaken and pre-processing activities are discussed in Section 3.3. Machine learning methods (see Section 3.5), capable of cost sensitive learning, were chosen for performance comparison in this paper.
For all models, their hyper-parameters were searched using the GridSearch method, and later multiple performance measures (see Section 3.6) were used to evaluate and compare ML algorithms.
Under-Sampling Methods
The benign class in our datasets constitutes up to of total records. Fixed ratio random under-sampling, utilizing uniform distribution for record selection, of benign and over-represented malignant class records was implemented on data load for all datasets. Under-sampling refers to the process of reducing the number of samples in a dataset. Fixed ratio random under-sampling method aims to balance class distribution through the random-uniform elimination of majority class examples. It is worth noting that random under-sampling can discard potentially useful data that could be important for the machine learning process. Under-sampling methods can be categorized into two groups: (i) fixed ratio under-sampling and (ii) cleaning under-sampling (Lemaitre et al., 2016). Fixed ratio under-sampling is based on a statistically random selection, which targets the provided absolute record numbers of a given class or a ratio, constituting a proportion of the total number of labels. Cleaning under-sampling is based on either (i) clustering, (ii) the nearest neighbour analysis, or (iii) classification accuracy (based on instance hardness threshold, Smith et al., 2014).
Cleaning under-sampling approaches do not target a specific ratio, but rather clean the feature space based on some empirical criteria (Lemaitre et al., 2016). According to Lemaitre et al. (2016), these criteria are derived from the nearest neighbour rule, namely: (i) condensed nearest neighbours (Hart, 1968), (ii) edited nearest neighbours (Wilson, 1972), (iii) one-sided selection (Kubat and Matwin, 1997), (iv) neighbourhood cleaning rule (Laurikkala, 2001), and (v) Tomek links (Tomek, 1976).
Cleaning under-sampling methods such as Edited Nearest Neighbours, TomekLinks, Condensed Nearest Neighbours were tested, however, due to the size of sub-sampled data and the large computational overhead they require, these methods were not further explored. The fixed random under-sampling was implemented in two steps as follows:
Major class records were first randomly under-sampled to a target number of records, such as to provide sufficient learning for all models. Target numbers were obtained after analysis of learning curves. Sufficient learning is defined here as the objective to have learning and testing curves to converge within a margin less than , which for all models in this experiment occurs after approximately 0.6 million records.
Numbers of benign and other highly imbalanced classes were further transformed with a random under-sampling function from Imbalanced-learn library (Lemaitre et al., 2016) using the number of records per class targets, calculated with the following empirically chosen skewed ratio function introduced in this research, where N is a number of initial records within a named class, where s is a share of records in that class. This proposed under-sampling method further on in this paper is referred to as Skewed fixed ratio under-sampling. The effect of this function is such that numbers of over-represented classes are decreased in a non linear manner, penalizing the best represented classes, while leaving the rare classes almost intact, thus simplifying, speeding up and decreasing the imbalance of the related learning of rare classes.
Over-Sampling Methods
In this paper, to balance minority classes, we investigate random and SMOTE (Synthetic Minority Over-sampling Technique) (Chawla et al., 2002) over-sampling methods. Random over-sampling is a base method that aims to balance class distribution through the random replication of minority class examples. Unfortunately, this can increase the likelihood of classifier overfitting (Batista et al., 2004). Therefore, we removed all duplicates in training data.
A more advanced method, capable of increasing minority class size without duplication, is SMOTE. SMOTE forms new minority class examples by linearly interpolating between minority class examples that are close. Thus, the overfitting problem risk is mitigated as the decision boundaries of the classifier for the minority class are moved further away from the minority class space. SMOTE works in feature space, not in data space, therefore, before the procedure to over-sample is executed, the first step is to select numeric features to over-sample, as it is not necessary to over-sample in all dimensions. SMOTE over-sampling is achieved by following these steps: a) take k nearest neighbours from minority class for some minority class vector in the feature space, b) randomly choose the vector from those k neighbours, c) take a difference between the vector and its neighbour, and multiply the difference vector by a random number which lies between 0, and 1, d) repeat previous step until the target number of synthetic points is reached. After this, new records can be added to the current data (see Chawla et al., 2002, for a complete algorithm). SMOTE method can be combined with some under-sampling methods to remove examples of all classes that tend to be misclassified. For example, in SMOTE with the Edited Nearest Neighbours (ENN) algorithm (Batista et al., 2004), after SMOTE is used to over-sample a number of records in defined minority classes, ENN is used to remove samples from both classes such that any sample that is misclassified by its given number of nearest neighbours is removed from the training set. Batista et al. (2004) have demonstrated the best results on imbalanced datasets with minority classes containing under 100 records. However, due to the complexity of the edited neighbours procedures (Witten et al., 2005) being , where n is a number of samples, d – a number of dimensions (features) and k – a number of nearest neighbours, this solution is resource intensive.
As our datasets have not only continuous but also nominal features, we used a modification of SMOTE – Synthetic Minority Over-sampling Technique-Nominal Continuous (SMOTE-NC), from imbalanced-learn library (Lemaître et al., 2017) in the research. We used a recommended number of neighbours equal to , and separated categorical and numeric features before over-sampling.
Feature Selection Methods
Based on the ideas of research and practical implementation recommendations made by Sharafaldin et al. (2018) and Shetye (2019), a selection of features was tested with 3 classes of methods: (a) filtering – correlation and related heat map analysis (b) univariate – recursive feature elimination and (c) iterative – regularization methods. In this research, features were selected with SelectKBest from Scikit-learn library (Pedregosa et al., 2011). The SelectKBest method takes as a parameter a score function, such as or Anova F-value, or information gain function and retains the first k features with the highest scores.
If the Anova F-value function is used, a test result is considered statistically significant if it is unlikely to have occurred by chance, assuming the truth of the null hypothesis. If is used as a score function, SelectKBest will compute the statistic between each feature of X and y (assumed to be class labels). A small value will mean the feature is independent of y. A large value will mean the feature is non-randomly related to y, and so likely to provide important information. Only k features will be retained. Mutual information (information gain) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, whereas higher values mean higher dependency. Mutual information methods can capture any kind of statistical dependency, but being non-parametric (Ross, 2014), it requires more samples for accurate estimation and is computationally more expensive, therefore, as a result of a better time performance in this research, Anova F-value was selected.
Embedded methods penalize a feature based on a coefficient threshold. On each iteration of the model training process those features which contribute the most to the training for a particular iteration are selected.
Further on in this paper, two methods, the filtering and SelectKBest from Scikit-Learn were used to select features.
When performing feature selection, SelectKBest is focusing on the largest classes, therefore a possible improvement would be to do feature selection in a pipeline, by firstly selecting the most important features for the rarest class and then adding features needed for every class.
Generating additional synthetic features was not attempted in this research, as all chosen datasets contain a significant number of such.
Cost-Sensitive Learning Methods
Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model (Brownlee, 2020).
If not configured, machine learning algorithms assume that all misclassification errors made by a model are equal. In case of an intrusion detection problem, missing a positive or minority class case is worse than incorrectly classifying an example from the negative or majority class.
The simplest and most popular approach to implementing cost-sensitive learning is to penalize the model less for training errors made on examples from the minority class by adjusting weights. The decision tree algorithm can be modified to weight model error by class weight when selecting splits. The Heuristic rule, also confirmed with intuition from decision trees (Brownlee, 2020), is to invert the ratio of the class distribution in the training dataset.
In this research, weights adjustment for decision trees was implemented using Scikit-learn library model parameters class_weight, setting it to ‘balanced’, which does the above mentioned inversion of class weights. Prior statistics were used for Quadratic discriminant analysis model.
Choice of Machine Learning Methods
For a performance comparison of machine learning methods on network intrusion detection data with imbalanced classes, we selected the most popular machine learning algorithms from surveys and review papers, related to intrusion detection (Buczak and Guven, 2016; Sharafaldin et al., 2018; Damasevicius et al., 2020).
Adaptive Boosting (Adaboost)
AdaBoost ensemble method was proposed by Yoav Freund and Robert Shapire for generating a strong classifier from a set of weak classifiers (Freund and Schapire, 1997). AdaBoost algorithm works by weighting instances in the dataset by how easy or difficult they are to classify, and correspondingly prioritizes them in the construction of subsequent models. A Default base classifier was used with Adaboost by authors of the CIC-IDS-2017 dataset (Sharafaldin et al., 2018) obtaining the result on Precision and of 0.77 whereas Recall at 0.84. Yulianto et al. (2019) used SMOTE, Principal Component Analysis (PCA), and Ensemble Feature Selection (EFS) to improve the performance of AdaBoost on the CIC-IDS-2017 dataset achieving Accuracy, Precision, Recall, and scores of 0.818, 0.818, 1.000, and 0.900, respectively.
Classification and Regression Tree (CART)
The Classification and Regression Tree method was proposed by Breiman et al. (1984), and used to construct tree structured rules from training data. Tree split points are chosen on a basis of cost function minimization.
The authors of the CIC-IDS-2017 dataset (Sharafaldin et al., 2018) obtained weighted averages of Precision, Recall and of 0.98 using ID3 (Iterative Dichotomiser 3), introduced by Quinlan (1986).
In this research, CART, as implemented in Scikit-learn library, was also used to obtain a base classifier and tree parameters for Adaboost, Gradient Boosting Classifier and Random Forest Classifier. Tree depth and alpha were obtained using the method of maximum cost path analysis (Breiman et al., 1984), implemented in the Scikit-learn library cost-complexity-pruning-path function, discussed in Section 3.8.
k-Nearest Neighbours (KNN)
The k-Nearest Neighbours method was proposed by Dudani (1976), as a method which makes use of a neighbour weighting function for the purpose of assigning a class to an unclassified sample. KNN was used by authors of the CIC-IDS-2017 dataset (Sharafaldin et al., 2018) with obtained results for weighted averages of Precision, Recall and of 0.96. The KNN algorithm in Scikit-learn by default uses the Euclidean distance as a distance metric for the k-NN algorithm. However, this is not appropriate when the domain presents qualitative attributes or categorical features of a different domain. For those domains, the distance for qualitative attributes is usually calculated using the overlap function, in which the value 0 (if two examples have the same value for a given attribute) or the value 1 (if these values differ) are assigned. In this research we have used the Manhattan dimension with positive effect obtained in the experiments.
Quadratic Discriminant Analysis (QDA)
Quadratic discriminant analysis descends from discriminant analysis introduced by Fisher (1954). Bayesian estimation for QDA was first proposed by Geisser (1964). Quadratic discriminant analysis (QDA) models the likelihood of each class as a Gaussian distribution, then uses the posterior distributions to estimate the class for a given test point (Friedman, 2001). The method is sensitive to the knowledge of priors. QDA was used by authors of the CIC-IDS-2017 dataset (Sharafaldin et al., 2018) with obtained result for Precision, Recall and of 0.97, 0.88 and 0.92.
Random Forest Trees (RFT)
The Random Forest Trees (RFT) classifier was proposed by Breiman (2001) as a combination of tree predictors minimizing overall generalization error of participating trees as the number of trees in the forest becomes larger. Random forests are an alternative to Adaboost by Freund and Schapire (1997) and are more robust with respect to noise. Random Forests is an extension of bagged decision trees where only a random subset of features are considered for each split.
The algorithm was used by the authors of the CIC-IDS-2017 dataset (Sharafaldin et al., 2018), and also by Kurniabudi et al. (2020). Sharafaldin et al. (2018) obtained results for the weighted averages of Precision, Recall and of 0.98, 0.97, and 0.97. In a study by Kurniabudi et al. (2020) the Random Forest algorithm has Accuracy, Precision and Recall of respectively 0.998 using the 15–22 selected features. These metrics were estimated for the benign and attack class.
Gradient Boosting Classifier (GBC)
In order to extend the scope of the research, Gradient Boosting Classifier (GBC), as proposed by Friedman (2001) and Friedman (2002), was added as a natural member of classifier ensemble methods. GBC is a stochastic gradient boosting algorithm, where decision trees are fitted on the negative gradient of the chosen loss function. The idea of gradient boosting is to fit the base-learner not to re-weighted observations, as in AdaBoost, but to the negative gradient vector of the loss function evaluated at the previous iteration. XGBoost library (Chen and Guestrin, 2016), incarnation with GPU support of GBC, was implemented in this research. The results of GBC of other authors are not known publicly.
Multiple Layer Perceptron
Multiple Layer Perceptron (MLP) has been proposed by Rosenblatt (1962) as an extension to a linear perceptron model (Rosenblatt, 1957). It is a supervised learning artificial neural network implementation, utilizing back-propagation for training, that can have multiple layers and a chosen, non necessarily linear, activation function.
MLP was used in the study of Sharafaldin et al. (2018) with obtained results for weighted averages of Precision, Recall and of 0.77, 0.83, and 0.76.
Performance Measures
Standard performance metrics for classifiers are presented in Section 3.6.1, and Bias and Variance decomposition metric (see Section 3.7) was used to evaluate ML algorithm tendencies to overfit or underfit.
Confusion Matrix Based Metrics
Accuracy, Precision in equation (5), Recall in equation (3) and in equation (6), are very sensitive to the representation of classes in the source datasets (Sokolova and Lapalme, 2009). Results change if proportions of class samples change (Tharwat, 2018). In their study Garcia et al. (2010) review most of the performance measures used for imbalanced classes, introducing a new measure called Index of Balanced Accuracy (IBA) currently implemented and used in the classification report of Imbalanced-learn library (Lemaitre et al., 2016) for calculating Geometric mean of recall , equation (4) introduced by Kubat and Matwin (1997). An experimental comparison of performance measures for classification is presented by Ferri et al. (2009). Mosley (2013) reviews multi-class data performance metrics such as Recall, , Relative Classifier Information (RCI) (Wei et al., 2010), Matthew’s Correlation Coefficient (MCC) (Matthews, 1975), Confusion Entropy (CEN) (Jurman et al., 2012). It is important to note that Chicco and Jurman (2020) demonstrated that MCC and CEN cannot be reliably used in case of an imbalance of data classes and these will not be discussed in this paper. Mosley (2013) introduces a per-class Balanced Accuracy (also known as Balanced accuracy score (BAS)), see equation (2) which is based on recall and neglects the precision. However, Precision is very sensitive to attributions of records from other classes, which was clearly observed during this research. In the case of imbalance, it mainly indicates a false classification of major classes, therefore, it has been chosen to be studied in this research.
Further on in this research, the Balanced accuracy score and along with Precision were chosen as classification quality quantification metrics for comparison because: (i) these metrics were previously used by other researchers to measure performance of learning in imbalanced multi-class problems, while datasets used in this studyx have extremely imbalanced class distributions, (ii) these measures are available in popular and open source software libraries like Scikit-learn and Imbalanced-learn, (iii) metrics have simple and clear intuition for use in practical cyber-security applications, (iv) precision also allows for comparison with other research. Macro score averages were calculated in further experiment to give equal weight to each class, avoiding of scaling with respect to number of instances per class.
Balanced accuracy score BAS in formula (2) is further defined as average of recall values for K classes: where: where stands for True Positive, and stands for False Negative, i is a number of class in question and k is the number of classes in the dataset. is True Positive (correct classified) for class i, and are all false negative instances for the class i. is an element of the confusion matrix in row i and column j.
Geometric mean of sensitivity is defined as follows: where k is a number of classes in a dataset.
Precision for class i is defined as follows:
Whereas for class i is defined as follows:
In this research, we have used macro-weighted (i.e. unweighted mean) , Precision and , if it is not specified otherwise.
Bias and Variance Decomposition
The decomposition of the loss into bias and variance helps to improve understanding of generalization capacities of compared learning algorithms, such as overfitting and underfitting. Various methods of decomposition are reviewed in Domingos (2000). It has been demonstrated that high variance correlates to overfitting, and high bias correlates to underfitting. In practical terms, when comparing the performance of learning algorithms, models with lower bias and variance over the same test data would be preferred. It is worth noting that models with a higher degree of parameter freedom tend to demonstrate lower bias and higher variance, and models with a low degree of freedom demonstrate high bias and lower variance.
The loss function of a learning algorithm can be decomposed into three terms: a variance, a bias, and a noise term, which will be ignored further for simplicity (Raschka, 2018). Loss function depends on the machine learning algorithm. For decision trees (CART), training proceeds through a greedy search, each step based on information gain. For the random forest classifier, loss function is the Gini impurity. Cross-entropy is the default loss function to use for multi-class classification problems with MLP.
The prediction bias is calculated as the difference between the expected prediction accuracy of a model and the true prediction accuracy (equation (7)). In formal notation the of an estimator is the difference between its expected value and the true value of a parameter β being estimated (Raschka, 2018):
The variance (equation (8)) is a measure of the variability of model’s predictions if the learning process is repeated multiple times with random fluctuations in the training set. Variance is obtained by repeating prediction on a model trained on stratified shuffle-split training data. The more sensitive the model-building process is towards fluctuations of the training data, the higher the variance (Raschka, 2018).
Tree Pruning
Finding the values where training and testing learning curves converge allows for creation of better generalizing decision trees, decrease of overfitting and underfitting. The Tree depth (implemented in Scikit-learn library through parameter ) and α (implemented in Scikit-learn library through parameter ) were obtained using the method of maximum cost path analysis (Breiman et al., 1984), implemented in Scikit-learn library cost-complexity-pruning-path function and searching for a minimum of Bias and Variance. In this algorithm the cost-complexity measure of a given tree T is defined in formula (9) as follows: where is the number of terminal nodes in T, is defined as the total mis-classification cost of the terminal nodes for the complexity parameter α (⩾0). As α increases, more descendent nodes are pruned.
Variance Inflation Factor
Many variables in the datasets CIC-IDS2017 and CSE-CIC-IDS2018 appear to be correlated with each other, which increases bias while using Quadratic Discriminate Analysis. A statistical measure known as VIF (Variance Inflation Factor) was proposed by Lin et al. (2011) to support elimination of cross-correlation of features and is implemented in this research from statsmodels library (Seabold and Perktold, 2010).
Other Methods
The number of estimators was obtained using the Scikit-learn’s GridSearch (LaValle et al., 2004) method. See Sections 4.4–4.5 and Table 15 for implementation details in this research.
Experiment Design
Our experiment contained pre-processing, described further in detail in Section 4.1 for the CIC-IDS2017 dataset, Section 4.2 for the CSE-CIC-IDS2018 dataset and Section 4.3 for the LITNET-2020 dataset. The datasets were cleaned and normalized. Quantile transformation from Scikit-learn library (Pedregosa et al., 2011) with QuantileTransformer using a default of 1 000 quantiles has been implemented for the pre-processing of numeric (continuous time related) features of all datasets in order to transform original values to a more uniform distribution.
Datasets were further under-sampled with random fixed ratio under-sampling and proposed skewed fixed ratio under-sampling so that after splitting into testing and training, sets would contain more than approximately 600 000 records each, which is sufficient for learning of all algorithms. This number has been estimated by performing learning curve analysis.
Later on, the training subsets were over-sampled using SMOTE for CIC-IDS2017 and CIC-IDS2018 datasets and SMOTE-NC for LITNET-2020. Features were selected using KBest (see Section 3.3) and VIF procedures (see Section 3.9). Training and hyper-parameter search was performed using cross validation with on stratified shuffle split samples of training datasets.
The final results of predictions were obtained using testing data, e.g. not seen to trained models. In order to obtain a reliable result, predictions were run 30 times with a change of random seed on each run.
Further on in the experiment, the best features were selected using the SelectKBest procedure from Scikit-learn library (Pedregosa et al., 2011) and followed by Variance inflation factor analysis (Lin et al., 2011) with a target threshold value, to eliminate variables with high collinearity.
Parameters for classification models were searched using GridSearch from the Scikit-learn library.
CIC-IDS2017 Pre-Processing Steps
The following procedures were implemented to condition the dataset for better learning of under-represented attack classes: a) removal of unused features and related record duplicates, b) random under-sampling of benign class records, such as to represent no more than a number of records, providing sufficient learning for the worst performing model, obtained after analysis of learning curves and c) over-sampling using SMOTE for the training sub-sample of extremely rare records (see Table 4) up until the minimum number of examples of classes with high imbalance.
Duplicate rows were removed (leaving the first one), see Table 8.
Removal of duplicates in IDS2017 dataset.
Class
Share of removed records (%),
Resulting counts1
Resulting share (%)
Benign
7.770%
2 096 484
83.1159%
DoS Hulk
25.197%
172 849
6.8527%
PortScan
42.856%
90 819
3.6006%
DDoS
0.009%
128 016
5.0752%
DoS GoldenEye
0.068%
10 286
0.4078%
FTP-Patator
25.258%
5 933
0.2352%
SSH-Patator
45.413%
3 219
0.1276%
DoS slowloris
7.091%
5 385
0.2135%
DoS Slowhttptest
4.928%
5 228
0.2073%
Bot
0.661%
1 953
0.0774%
Web Attack – Brute Force
2.455%
1 470
0.0583%
Web Attack-XSS
0.000%
652
0.0258%
Infiltration
0.000%
36
0.0014%
Web Attack-Sql Injection
0.000%
21
0.0008%
Heartbleed
0.000%
11
0.0004%
Total
2 522 362
1Record counts after removing duplicate records.
The following 8 features ‘Bwd PSH Flags’, ‘Bwd URG Flags’, ‘Fwd Avg Bytes/Bulk’, ‘Fwd Avg Packets/Bulk’, ‘Fwd Avg Bulk Rate’, ‘Bwd Avg Bytes/Bulk’, ‘Bwd Avg Packets/Bulk’, ‘Bwd Avg Bulk Rate’, containing no information in all loaded files and duplicate feature ‘Fwd Header Length.1’ with ‘Fwd Header Length’ were removed.
After dropping the duplicates, the 2 522 362 remaining records were investigated for missing values and infinities.
As a result, 1 358 missing values containing records were removed with drop duplicates. The remaining 353 rows with missing values were found to be split between ‘Benign’ (350) and ‘DoS Hulk’ (3) classes and missing values were replaced with −1.
Further, 1 211 records with infinities in two features Flow ‘Bytes/s’ and ‘Flow Packets/s’ were found and replaced by maximums of values per class, see Table 9.
Replacing infinities in IDS2017 dataset.
Class
Record count
Flow Bytes/s
Flow Packets/s
Benign
1 077
2.071e+09
4.0e+06
PortScan
125
8.00e+06
2.0e+06
Bot
5
1.20e+07
2.0e+06
FTP-Patator
2
1.40e+07
3.0e+06
DDoS
2
3.47e+08
2.0e+06
Total:
1211
This processing step is made under an assumption that such a replacement for lost values would be possible to implement after learning the values during the initial training of a real life intrusion detection system.
Further numbers of records for Benign and second largest class Dos Hulk were transformed with a skewed fixed ratio under-sampling. Remaining data is split into test and train sub-samples. The training sub-set is then over-sampled with SMOTE (thus training record count values of 4 999 and 2 999 in Table 10). This procedure keeps all extremely imbalanced class records (Table 4) intact and adds new records for the training, resulting in record counts for the training and testing samples presented in Table 10.
After this, the values of numeric columns were scaled to a range of with Scikit-learn (Pedregosa et al., 2011) QuantileTransform. This transformation assigns each feature into a quantile individually and scales such that it is in the given range on the training set, by default between zero and one.
Further in this research, the 40 best features were selected using the SelectKBest procedure from the Scikit-learn library (Pedregosa et al., 2011) and followed by Variance inflation factor analysis with a target threshold value equal to 40, to eliminate variables with high collinearity.
Resulting IDS2017 dataset training and/or validation sample representation.
Record label
Training records
Resulting share (%)
Testing records
Resulting share (%)
Benign
442 421
64.739%
442 421
67.508%
DoS Hulk
86 425
12.646%
86 424
13.187%
DDoS
64 008
9.366%
64 008
9.767%
PortScan
45 410
6.645%
4 5409
6.929%
DoS GoldenEye
5 143
0.753%
5 143
0.785%
FTP-Patator
4 999
0.731%
2 967
0.453%
DoS slowloris
4 999
0.731%
2 692
0.411%
DoS Slowhttptest
4 999
0.731%
2 614
0.399%
SSH-Patator
4 999
0.731%
1 610
0.246%
Bot
4 999
0.731%
976
0.149%
Web Attack-Brute Force
2 999
0.439%
735
0.112%
Web Attack-XSS
2 999
0.439%
326
0.050%
Infiltration
2 999
0.439%
18
0.003%
Web Attack-Sql Injection
2 999
0.439%
11
0.002%
Heartbleed
2 999
0.439%
6
0.001%
Total:
683 397
655 360
CIC-IDS2018 Pre-Processing Steps
The same pre-processing procedure from Section 4.1 was applied to dataset CIC-IDS2018.
The timestamp column and related record duplicates were removed, as no time series dependent machine learning methods were chosen in this research.
Afterwards, 8 features ‘Bwd URG Flags’, ‘Bwd Pkts/b Avg’, ‘Bwd PSH Flags’, ‘Bwd Blk Rate Avg’, ‘Fwd Byts/b Avg’, ‘Fwd Pkts/b Avg’, ‘Fwd Blk Rate Avg’, ‘Bwd Byts/b Avg’ containing no information (eq. ) were removed.
The following sampling procedures were executed in order to achieve a better balance between major classes and extremely rare classes:
the top two classes (‘Benign’ and ‘DDoS attacks-LOIC-HTTP’) were under-sampled so as to represent no more than a number of records, providing sufficient learning for the worst performing model, obtained after analysis of learning curves.
The remaining data was split into test and train sub-samples.
Training sub-set was then over-sampled with SMOTE (thus, value of 2 999). This procedure keeps all extremely imbalanced class records (Table 4) intact and adds new records for the training, resulting in record counts for the training and testing samples presented in Table 11.
Resulting IDS2018 dataset training and validation sample representation.
Record label
Training records
Resulting share (%)
Testing records
Resulting share (%)
Benign
134 850
20.067%
134 849
20.576%
DDoS attacks-LOIC-HTTP
129 558
19.280%
129 558
19.769%
DDOS attack-HOIC
99 430
14.796%
99 431
15.172%
Infilteration
72 612
10.805%
72 613
11.080%
DoS attacks-Hulk
72 599
10.804%
72 600
11.078%
Bot
72 268
10.754%
72 267
11.027%
SSH-Bruteforce
47 024
6.998%
47 024
7.175%
DoS attacks-GoldenEye
20 703
3.081%
20 703
3.159%
DoS attacks-Slowloris
4 954
0.737%
4 954
0.756%
DDOS attack-LOIC-UDP
2 999
0.446%
865
0.132%
Brute Force-Web
2 999
0.446%
285
0.043%
Brute Force-XSS
2 999
0.446%
114
0.017%
SQL Injection
2 999
0.446%
43
0.007%
FTP-BruteForce
2 999
0.446%
27
0.004%
DoS attacks-SlowHTTPTest
2 999
0.446%
27
0.004%
Total:
671 992
655 360
It should be noted that 7 373 records with infinities in two features ‘Flow Bytes/s’ and ‘Flow Packets/s’ were found and replaced by maximums of values per class, see Table 12.
Replacing infinities in IDS2018 dataset.
Class
Record count
Flow Bytes/s
Flow Packets/s
Benign
6 243
1.47e+09
4.0e+6
Infilteration
1 129
2.74e+08
3.0e+06
FTP-BruteForce
1
0.0e+00
2.0e+06
Total:
7 373
Presence of such values could indicate that related flows were not terminated on recording.
After the data cleaning, the dataset was normalized with QuantileTransform. The 40 best features from SelectKBest were passed through the Variance Inflation Factor procedure with a threshold of 40 which was selected to eliminate collinearity of features.
LITNET-2020 Dataset Pre-Processing
Due to the choice of supervised machine learning models and problem definition in this study, the LITNET-2020 dataset timestamp feature was not used. Features related to the source and destination address, such as source and destination issuing authorities, are highly supportive in discovering not only the attacker but also the attack class, therefore, in order to support generalization of training, they were eliminated.
After removing timestamp and address related features, related duplicate records were also removed, see Table 13.
Removal of timestamp related duplicates in LITNET-2020 dataset.
Traffic type
Share of removed records (%)
Resulting counts of records1
Resulting share (%)
Benign
33.1%
24 349 750
95.052%
SYN Flood
98.2%
28 873
0.113%
Code Red
13.5%
1 085 656
4.238%
Smurf
87.7%
14 642
0.057%
UDP Flood
1.3%
92 412
0.361%
LAND DoS
75.3%
12 926
0.050%
W32.Blaster
99.2%
200
0.001%
ICMP Flood
92.6%
1 723
0.007%
HTTP Flood
1.7%
22 578
0.088%
Scan
0.0%
6 232
0.024%
Reaper Worm
0.3%
1 173
0.005%
Spam
0.1%
746
0.003%
Fragmentation
15.9%
401
0.002%
1Record counts after removing timestamp and related record duplicates.
The resulting dataset is even more imbalanced. The target number of records of the Benign and the Code Red type was set after learning curves that indicate the number of records required by the worst performing model for sufficient learning. Sufficient learning is defined here as the objective of getting the learning and testing curves to converge within a margin of less than , which for all models under experiment occurs after approximately 0.5 million records.The dataset was further split by half into testing and validation.
As a final step, a Synthetic Minority Over-sampling Technique for Nominal and Continuous features for datasets with categorical features, SMOTE-NC, introduced by Chawla et al. (2002) was implemented, see Table 14.
LITNET-2020 dataset sample representation.
Record label
Training records
Resulting share (%)
Testing records
Resulting share (%)
Benign
349 470
51.277%
349 470
53.325%
Code Red
215 484
31.618%
215 485
32.880%
UDP Flood
45 858
6.729%
45 859
6.997%
SYN Flood
14 436
2.118%
14 437
2.203%
HTTP Flood
11 289
1.656%
11 289
1.723%
Smurf
9 999
1.467%
7 321
1.117%
Scan
9 999
1.467%
6 463
0.986%
LAND DoS
9 999
1.467%
3 116
0.475%
Spam
2 999
0.440%
710
0.108%
Reaper Worm
2 999
0.440%
587
0.090%
ICMP Flood
2 999
0.440%
373
0.057%
Fragmentation
2 999
0.440%
153
0.023%
W32.Blaster
2 999
0.440%
100
0.015%
Total:
681 529
655 363
After the data cleaning, the dataset was normalized with QuantileTransform. The 40 best features from SelectKBest were obtained and further checked for feature collinearity. Collinear features were reduced using the Variance Inflation Factor procedure (see Section 3.9) with a threshold value of 40.
Experiment Software Environment
All code for models was realized in the Python 3.7 environment on Anaconda 3 using Scikit-learn7 and Imbalanced-learn8 libraries, except for the Gradient Boosting Classifier, which was implemented using the XGBoost library (Chen and Guestrin, 2016), utilizing GPU.
Model parameters were searched with the GridSearch method. Tree depth and alpha were further validated using the method of maximum cost path analysis (Breiman et al., 1984), implemented in Scikit-learn by the cost-complexity-pruning-path function (see Section 3.8).
Parameter Values Selection
The following parameter ranges were selected for the grid search:
ADA: n_estimators: (range(10, 256, 5)), learning_rate: [0.001, 0.005, 0.01, 0.5, 1], and base estimator – CART.
QDA: reg_param: np.geomspace(1e−19, 1e−1, 50, endpoint = True). Value of tol parameter only impacts threshold when warnings of variable collinearity should be suppressed.
RFC: n_estimators: range(100, 350, 5), other parameters in the same ranges as CART.
The parameters used in this study are presented in the Table 15.
Model parameters used.
Dataset
Model
CIC-IDS2017
CIC-IDS2018
LITNET-2020
Parameters
ADA
base_estimator = DecisionTreeClassifier, learning_rate = 11, n_estimators = 120, tree parameters as indicated for CART, next row
1Default Scikit-Learn values; 2Priors calculated equal to class shares.
Results and Discussion
Results of the Conducted Experiments
Tables 16, 17 and 18 represent the results of ML methods rankings using a Standard Ranking approach (Adomavicius and Kwon, 2011), where equal items get the same ranking number, and a gap is left in between the smaller and bigger result, where the bigger result means a worse result.
In Table 16, the results of scoring by Balanced Accuracy are in favour of trees or their ensembles, Adaboost being the strongest, closely followed by Random Forest Classifier and K-Nearest Neighbours.
Comparison of Model performance on 3 datasets using Balanced Accuracy Score (BAS) and Error Rate (ErR).
CIC-IDS2017
CIC-IDS2018
LITNET-2020
Rank by BAS
Model
ErR
BAS
Rank
ErR
BAS
Rank
ErR
BAS
Rank
Total
Best
ADA1
0.001
0.995
1
0.060
0.887
4
0.003
0.996
1
5
1
CART
0.004
0.984
5
0.064
0.897
3
0.005
0.985
4
12
4
GBC
0.003
0.986
4
0.063
0.811
4
0.011
0.756
6
14
5
KNN
0.006
0.989
3
0.060
0.917
1
0.044
0.864
5
9
3
MLP
0.020
0.937
7
0.072
0.860
6
0.070
0.698
7
20
7
QDA
0.068
0.951
6
0.090
0.843
7
0.022
0.992
2
15
6
RFC
0.002
0.991
2
0.059
0.898
2
0.005
0.987
3
7
2
1Adaboost ensemble is made of CART estimators with the grid-searched hyper-parameters described in Table 15.
Results of this research support notion that Balanced Accuracy metric (see Table 16) should be used for measuring accuracy in case of highly and extremely imbalanced data sets. Error Rate for all models is below 0.1, while Balanced Accuracy manifests some insufficient learning. Accuracy of Extremely rare (malicious) classes in this research is dominated by majority (benign) class, representing over 80% of the whole data (see Tables 2 and 3) and therefore Error Rate is overly optimistic, under-representing the prediction error of Extremely rare classes (see Table 4), important to this research.
The ranking results in Table 17 were obtained based on the minimum of the sum of rankings for Presicion and . The results of scoring by Precision and are in favour of the same tree ensembles.
Model rankings by Precision (Pr) and G-mean .
CIC-IDS2017
CIC-IDS2018
LITNET-2020
Rank
Model
Pr
Rank
Pr
Rank
Pr
Rank
Total
Best
ADA
0.928
0.919
1
0.991
0.990
1
0.970
0.994
1
3
1
CART
0.868
0.886
5
0.971
0.977
6
0.828
0.989
4
15
5
GBC
0.892
0.884
4
0.988
0.987
2
0.963
0.987
3
9
3
KNN
0.906
0.912
2
0.988
0.987
2
0.674
0.519
7
11
4
MLP
0.879
0.834
6
0.979
0.977
5
0.685
0.876
6
17
6
QDA
0.713
0.839
7
0.936
0.881
7
0.915
0.978
5
19
7
RFC
0.913
0.907
2
0.985
0.984
4
0.937
0.998
2
8
2
The rankings of bias and variance decomposition in Table 18 are obtained on a basis of the minimum of the sum of bias and variance (equal to the model mean squared error, when not accounted for the noise component). The bias and variance are calculated according to formulas (7) and (8). To calculate bias, we have to estimate β and . β is equal to true class labels vector of test dataset. To estimate , the bootstrap with replacement of training dataset is taken 5 times, each time the model is trained and its prediction for each training dataset is stored as a separate vector value. Then is estimated as squared length of the difference of average prediction vector () and test dataset true label vector (β) and divided by the number of test records. The variance (Var) is then calculated by formula (8), e.g. it estimates the variance in calculated for each bootstrap sample with replacement from the training dataset.
Model rankings using model bias and variance (Var) decomposition.
CIC-IDS2017
CIC-IDS2018
LITNET-2020
Rank1
Model
Bias2
Var
Rank
Bias2
Var
Rank
Bias2
Var
Rank
Total
Best
ADA
0.09
0.024
1
1.36
0.324
1
0.22
0.006
1
3
1
CART
0.15
0.109
6
1.80
0.966
5
0.26
0.049
4
15
4
GBC
0.08
0.025
1
1.96
0.201
2
0.22
0.041
3
6
2
KNN
0.14
0.050
4
2.26
0.984
6
1.08
0.335
7
17
5
MLP
0.16
0.051
5
2.77
0.477
7
0.54
0.231
5
17
5
QDA
0.56
0.018
7
19.23
0.985
8
1.12
0.003
6
21
8
RFC
0.11
0.034
3
1.90
0.279
3
0.25
0.006
2
8
3
1Ranking is performed on the sum of model loss variance and bias squared; 2Bias squared value.
The QDA values that are much higher than average compared to other algorithm errors from the same data in Table 18 are a characteristic property of models with low number of hyper-parameters as noted in Brownlee (2020). Values obtained in this experiment could be local optima, but authors were not able to find other parameter values that would result in lower difference of values for this model between datasets. However, bias and variance of this model was noticed to be sensitive to changes in a list of features selected before the parameter search process. The list of features chosen for model training is individual for each dataset.
Discussion and Comparison of the Results
Comparison of results of research in different implementations for CIC-IDS2017 and CSE-CIC-IDS2018 datasets is presented in Table 19. Performance metrics are not directly comparable to our research (further in Table 19 – this research), as validation results in our experiment were obtained using multiple class optimization and 50% of dataset as a hold-out data, versus standard k-fold cross-validation, known to be prone to knowledge leak. In our methodology, cost sensitive model implementations provided classification for multiple class measures. However, for comparison, traditional measures suitable only for balanced datasets are presented with other reviewed studies (see Table 19). It is important to note that optimization in this experiment was done on Balanced Accuracy Score, therefore, other measures are sub-optimal.
1See explanatory notes related to cited work in Section 5.2.
In Sharafaldin et al. (2018) authors had an objective to introduce the CIC-IDS-2017 dataset, and default parameter model results of machine learning are presented for purely benchmark purposes of future research. Feature selection was performed using the random forest regression feature selection algorithm. The results of Precision, Recall and were obtained in their studies in a form of weighted average of each evaluation metrics and are represented in Table 19. Iterative Dichotomiser 3, decision tree learner with an early stopping, as implemented in Weka (Witten and Frank, 2002), is used in their research. In our research the results were obtained using macro average for the above mentioned and other performed metrics. Macro averages of metrics are more sensitive to the imbalance of classes.
In Sharafaldin et al. (2019) authors improve results on RFT through proposing super-feature creation versus random feature regression algorithm for feature selection used in previous research (Sharafaldin et al., 2018). In our research the feature selection was obtained through fast Kbest procedure with Anova F-value optimization function, however, algorithm has been chosen after testing three classes of feature selection methods.
In Yulianto et al. (2019) strategy, SMOTE is utilized with CIC-IDS-2017. However, only benign and DDos class data of CIC-IDS-2017 dataset is taken, calculating binary classification problems, therefore, produces results that are incomparable to our research results. Features in their research are also selected differently, first utilizing Primary Components Analysis (PCA), then the Ensemble Feature Selection (EFS), using EFS Package in R Studio and ensemble methods gbm, glm, lasso, ridge and treebag from the fscaret library. The AdaBoost classification with default weak decision tree classifiers was used during the training. Meanwhile, in our research a choice was made to strengthen the base classifier via pruning. The results of Precision, Recall and obtained are represented in Table 19.
Kanimozhi and Jacob (2019a, 2019b) classified the CSE-CIC-IDS2018 data set using ADA, RF, kNN, SVM, NB and ANN (Artificial neural network) machine learning methods. For an ANN authors used MLP with two layers, lbfgs solver, grid searched alpha parameter (for L2 regularization) and Hidden layer sizes. In their research, authors used 0–1 classification. Either “Benign” or “Malicious” labels were used for training, making the results directly incomparable with our multi-class approach. Results of the accuracy, precision, recall, and AUC were obtained. The results of Precision, Recall and are represented in Table 19.
In the study Karatas et al. (2020) classified the CSE-CIC-IDS2018 dataset using KNN, RFT, GBC, ADA, DT (Decision tree), and LDA (Linear discriminant analysis with singular value decomposition solver) algorithms. Parameters that were selected for all the implemented algorithms are described in Karatas et al. (2020) Table 8. Number of classes was determined to be six (one for non-attack type, and 5 for attack types), making the results directly incomparable with our multi-class approach. Cross-validation with 80%/20% split of training and test data was used. Results of the accuracy, precision, recall and were obtained. The results of Precision, Recall and are represented in Table 19.
In their study Kilincer et al. (2021) classified the CSE-CIC-IDS2018 dataset using KNN, DT, and SVM algorithms. Options of Matlab for KNN with KNN Fine algorithm, DT with Fine tree and SVM Quadratic algorithm gave the best results in this research. Results on a limited amount of records (up to 1584 records per class, see Kilincer et al. (2021) Table 3) were used in this research for CSE-CIC-IDS2018 dataset classes. Authors focus on UNSW-NB15 dataset with no discussion on pre-processing for CSE-CIC-IDS2018, parameter search or tree pruning or overfitting. Results of the accuracy, precision, recall, and g-mean were obtained. The results of Precision, Recall and are represented in Table 19.
In Dutta et al. (2020) authors used SMOTE and ENN to balance the LITNET-2020 dataset. Classes are reduced to two, normal and malignant, therefore, results are directly incomparable with ours. The approach also differs in that authors reduce dimensionality with Deep sparse autoencoder (Zhang et al., 2018), selecting 15 features. Then authors stack LSTM with adam optimizer and DNN with four layers, back-propagation and stochastic gradient descent as the optimizer and early stopping on Keras with TF back-end and Scikit-learn. 5-fold validation was used in that research. Results of the precision, recall, false positive rate, and MCC were obtained. The results of Precision, Recall and are represented in Table 19.
Known Limitations
Regarding the limitations of the approach taken in this research, it is important to note that new categories of malicious traffic in reality are introduced daily. Therefore, models tuned using this method will not detect zero day threats.
Another know limitation is that in absolute rarity case, or when data has not been obtained and labelled sufficiently, models will predict with high Error rate. A possible known solution to this problem is an anomaly detection for the unseen data.
Moreover, datasets CIC-IDS2017 and IDS-2018 lack some categorical flag data, which is possible to obtain, like it has been demonstrated in LITNET-2020 case.
Even though LITNET-2020 lacks temporal features, introduced in CIC-IDS datasets, this, however, can be resolved by running the CICFlowMeter on the original PCAP files.
Temporal average approach of flags does not help some classes like Infiltration, however, flag features could be added to CIC-IDS datasets in the future.
While SMOTE was helpful for some rare classes, the method did not help much where sub-classes overlap due to lack of host data or feature latency.
Some features can be extracted and supplemented, which might be used in future research, however, extraction requires high degree of previous network traffic logging, whereas authors are aware that organizations lack resources to collect data on such a level of detail.
Observations on Multi-Class Predictions
Details of comparison of each class and dataset before and after SMOTE up-sampling is not represented here due to substantial amount of tables. However, it is important to note that some rare classes in these datasets learn very well even with a small numbers of records, which is confirmed by testing using dedicated unseen data. Some classes learn significantly better after adding synthetic data, which is further supported with tests on model performance and classification reports executed before (prefixed with n as and for no-SMOTE) and after enriching data using SMOTE procedure in Table 20 prefixed with s as sPr and .
MLP model results for Precision , and G-mean on LITNET-2020 dataset before and after SMOTE.
Class1
Reaper Worm
0
0.778
0
0.972
Spam Botnet
0.631
0.912
0.766
0.988
W32.Blaster
0
0.285
0
0.969
1Selected example rare classes.
As demonstrated in Table 20, random data under-sampling and SMOTE over-sampling techniques are supportive in ensuring that extremely under-represented classes (see Table 4) can learn with non-zero precision and , or provide better results.
Conclusions
In this paper, we have studied three highly imbalanced network intrusion datasets and proposed methodology steps (see Section 4), helping to achieve high classification results of rare classes which were validated through model error decomposition and 50% data hold-out strategy. This methodology was checked using a novel, differently structured dataset LITNET-2020, and comparison of the results to those obtained on the established benchmark datasets CIC-IDS2017 and CSE-CIC-IDS2018.
A review of the LITNET-2020 dataset compliance to the criteria raised by Gharib et al. (2016) is first introduced in Section 2.2. A variant of random under-sampling (skewed ratio under-sampling, proposed by authors and discussed in Section 3.1), is used to reduce imbalance of classes in a nonlinear fashion, and SMOTE-NC up-sampling (see Section 3.2) is executed to increase representation of under-represented classes. Further on in this research, comparison of multi-class classification performance of the CIC-IDS2017 and CIC-IDS2018 datasets with the recent LITNET-2020 dataset is discussed in Section 5. As LITNET-2020 is constructed differently from the CIC-IDS datasets, a conclusion can be made that the proposed method is resistant to dataset change. Performance metrics – balanced accuracy (Formula (2)) and geometric mean of recall (Formula (4)), better suited for multi-class classification used for the LITNET-2020 dataset, is another introduced novelty (see results in Tables 16 and 17), not discussed by other authors using these datasets. Multi-criteria scoring is cross-validated with an approach of testing through data previously unseen for the models (see Section 4). Additional ML model, Gradient Boosting Classifier, utilizing ensemble of classification and regression trees, was introduced for benchmark in this research via the use of XGBoost library (Chen and Guestrin, 2016) incarnation with GPU support (see Section 3.5.6). In our methodology, cost sensitive model implementations have been used and have provided some better results (see Table 19) compared to other reviewed studies. Furthermore, selection of models with better generalization capabilities in this research has been achieved through decomposition of classification error into bias and variance (see results in Table 18). Instead of the weak CART base classifiers (see Section 3.8) parameters were GirdSearch’ed and parameters Tree depth and alpha were validated using the method of maximum cost path analysis (Breiman et al., 1984). Other models were tuned using Gridsearch and Balanced Accuracy Score was scored as an optimization goal.
Machine learning algorithm rankings based on Precision, Balanced Accuracy Score, , and Bias – Variance decomposition of Error, show that tree ensembles (Adaboost, Random Forest Trees and Gradient Boosting Classifier) perform best on the compared here network intrusion datasets, including the recent LITNET-2020.
Footnotes
See https://www.unb.ca/cic/datasets/index.html.
More information at https://www.unb.ca/cic/datasets/ids-2017.html.
More information at https://www.unb.ca/cic/research/applications.html.
More information at https://www.unb.ca/cic/datasets/ids-2018.html.
File format as abbreviated from Packet CAPture, traffic capture file format in use by networking tools.
For a definition of features used in Nfdump 1.6 see https://github.com/phaag/nfdump/blob/master/bin/parse_csv.pl.
https://scikit-learn.org/stable/.
https://imbalanced-learn.org/stable.
References
1.
AdomaviciusG.KwonY. (2011). Improving aggregate recommendation diversity using ranking-based techniques. IEEE Transactions on Knowledge and Data Engineering, 24(5), 896–911.
2.
BatistaG.E.A.P.A.PratiR.C.MonardM.C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter. https://doi.org/10.1145/1007730.1007735.
BreimanL.FriedmanJ.StoneC.OlshenR. (1984). Classification and Regression Trees (Wadsworth Statistics/Probability), 0412048418. CRC Press, New York,
5.
BrownleeJ. (2020). Imbalanced Classification with Python – Choose Better Metrics, Balance Skewed Classes, and Apply Cost-Sensitive Learning. Machine Learning Mastery, San Juan, pp. 463.
6.
BuczakA.GuvenE. (2016). A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys I& Tutorials, 18, 1153–1176. https://doi.org/10.1109/COMST.2015.2494502.
ChenT.GuestrinC. (2016). XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘16. ACM, New York, NY, USA, pp. 785–794. 978-1-4503-4232-2. https://doi.org/10.1145/2939672.2939785.
9.
ChiccoD.JurmanG. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1). https://doi.org/10.1186/s12864-019-6413-7.
10.
ClaiseB. (2004). RFC 3954, Cisco Systems NetFlow Services Export Version 9. Technical report, IETF. https://doi.org/10.17487/rfc3954.
11.
DamaseviciusR.VenckauskasA.GrigaliunasS.ToldinasJ.MorkeviciusN.AleliunasT.SmuikysP. (2020). Litnet-2020: An annotated real-world network flow dataset for network intrusion detection. Electronics (Switzerland), 9(5). https://doi.org/10.3390/electronics9050800.
12.
DomingosP. (2000). A unified bias-variance decomposition and its applications. In: Icml, pp. 231–238. 2065432969.
13.
Draper-GilG.LashkariA.H.MamunM.S.I.GhorbaniA.A. (2016). Characterization of encrypted and VPN traffic using time-related features. In: Proceedings of the 2nd International Conference on Information Systems Security and Privacy, PP. 407–414. https://doi.org/10.5220/0005740704070414.
14.
DudaniS.A. (1976). The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man and Cybernetics, pp. 325–327. https://doi.org/10.1109/TSMC.1976.5408784.
15.
DuttaV.ChoraśM.PawlickiM.KozikR. (2020). A deep learning ensemble for network anomaly and cyber-attack detection. Sensors (Switzerland), 20(16), 1–20. https://doi.org/10.3390/s20164583.
16.
FerriC.Hernández-OralloJ.ModroiuR. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30(1), 27–38. https://doi.org/10.1016/j.patrec.2008.08.010.
17.
FisherR. (1954). The analysis of variance with various binomial transformations. Biometrics, 10(1), 130–139.
18.
FreundY.SchapireR.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139. https://doi.org/10.1006/jcss.1997.1504.
19.
FriedmanJ.H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451.
GarciaV.MollinedaR.A.SanchezJ.S. (2010). Theoretical analysis of a performance measure for imbalanced data. In: 2010 20th International Conference on Pattern Recognition. IEEE, Istanbul, pp. 617–620. 978-1-4244-7542-1. https://doi.org/10.1109/ICPR.2010.156.
22.
GeisserS. (1964). Posterior odds for multivariate normal classifications. Journal of the Royal Statistical Society: Series B (Methodological), 26(1), 69–76.
23.
GharibA.SharafaldinI.LashkariA.H.GhorbaniA.A. (2016). An evaluation framework for intrusion detection dataset. In: 2016 International Conference on Information Science and Security (ICISS). IEEE, Pattaya, Thailand, pp. 1–6. 978-1-5090-5493-0. https://doi.org/10.1109/ICISSEC.2016.7885840.
24.
HartP.E. (1968). The condensed nearest neighbor rule (Corresp.). IEEE Transactions on Information Theory, 14(3), 515–516. https://doi.org/10.1109/TIT.1968.1054155.
25.
HeH.MaY. (2013). Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley, Piscataway, NJ, pp. 216. 9781118074626. https://doi.org/10.1002/9781118646106.
26.
HettichS.BayS.D. (1999). The UCI KDD Archive http://kdd.ics.uci.edu. University of California, Department of Information and Computer Science.
27.
JurmanG.RiccadonnaS.FurlanelloC. (2012). A comparison of MCC and CEN error measures in multi-class prediction. PLoS ONE, 7(8), 41882. https://doi.org/10.1371/journal.pone.0041882.
28.
KanimozhiV.JacobD.T.P. (2019a). Calibration of various optimized machine learning classifiers in network intrusion detection system on the realistic cyber dataset CSE-CIC-IDS2018 using cloud computing. International Journal of Engineering Applied Sciences and Technology, 04(06), 209–213. https://doi.org/10.33564/IJEAST.2019.v04i06.036.
29.
KanimozhiV.JacobT.P. (2019b). Artificial intelligence based network intrusion detection with hyper-parameter optimization tuning on the realistic cyber dataset CSE-CIC-IDS2018 using cloud computing. ICT Express, 5(3), 211–214. 9781538675953. https://doi.org/10.1016/j.icte.2019.03.003.
30.
KaratasG.DemirO.SahingozO.K. (2020). Increasing the performance of machine learning-based IDSs on an imbalanced and up-to-date dataset. IEEE Access, 8, 32150–32162. https://doi.org/10.1109/ACCESS.2020.2973219.
31.
KilincerI.F.ErtamF.SengurA. (2021). Machine learning methods for cyber security intrusion detection: datasets and comparative study. Computer Networks, 188(January), 107840. https://doi.org/10.1016/j.comnet.2021.107840.
32.
KochR. (2011). Towards next-generation intrusion detection. In: 2011 3rd International Conference on Cyber Conflict, pp. 151–168.
33.
KubatM.MatwinS. (1997). Addressing the curse of imbalanced data sets: one-sided sampling. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. https://doi.org/10.1007/3-540-62858-4_79.
34.
KurniabudiStiawanD.DarmawijoyoBin IdrisM.Y.B.BamhdiA.M.BudiartoR. (2020). CICIDS-2017 dataset feature analysis with information gain for anomaly detection. In: IEEE Access, pp. 132911–132921 https://doi.org/10.1109/ACCESS.2020.3009843.
35.
LashkariA.H.GilG.D.MamunM.S.I.GhorbaniA.A. (2017). Characterization of tor traffic using time based features. In: Proceedings of the 3rd International Conference on Information Systems Security and Privacy, pp. 253–262. 978-989-758-209-7. https://doi.org/10.5220/0006105602530262.
36.
LaurikkalaJ. (2001). Improving Identification of Difficult Small Classes by Balancing Class Distribution. Springer. 3540422943. https://doi.org/10.1007/3-540-48229-6_9.
37.
LaValleS.M.BranickyM.S.LindemannS.R. (2004). On the relationship between classical grid search and probabilistic roadmaps. The International Journal of Robotics Research, 23(7–8), 673–692.
38.
Lawrence Berkeley National Laboratory (2010). The Internet Traffic Archive. http://ita.ee.lbl.gov/index.html.
39.
LemaitreG.NogueiraF.AridasC.K. (2016). Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18, 1–5.
40.
LemaîtreG.NogueiraF.AridasC.K. (2017). Imbalanced-learn: a python toolbox to tackle the urse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17), 1–5.
41.
LinD.FosterD.P.UngarL.H. (2011). VIF regression: a fast regression algorithm for large data. Journal of the American Statistical Association, 106(493), 232–247. https://doi.org/10.1198/jasa.2011.tm10113.
42.
LippmannR.P.FriedD.J.GrafI.HainesJ.W.KendallK.R.McClungD.WeberD.WebsterS.E.WyschogrodD.CunninghamR.K.ZissmanM.A. (1999). Evaluating intrusion detection systems without attacking your friends: the 1998 DARPA intrusion detection evaluation. In: Proceedings DARPA Information Survivability Conference and Exposition, 2000. DISCEX‘00, PP. 12–26. https://doi.org/10.1109/DISCEX.2000.821506.
43.
Maciá-FernándezG.CamachoJ.Magán-CarriónR.García-TeodoroP.TherónR. (2018). UGR‘16: a new dataset for the evaluation of cyclostationarity-based network IDSs. Computers and Security, 73, 411–424. https://doi.org/10.1016/j.cose.2017.11.004.
44.
MałowidzkiM.BerezinskiP.MazurM. (2015). Network intrusion detection: Half a kingdom for a good dataset. In: Proceedings of NATO STO SAS-139 Workshop, Portugal.
45.
MatthewsB.W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) – Protein Structure, 405(2), 442–451. https://doi.org/10.1016/0005-2795(75)90109-9.
SeaboldS.PerktoldJ. (2010). Statsmodels: econometric and statistical modeling with python. In: 9th Python in Science Conference.
57.
SharafaldinI.Habibi LashkariA.GhorbaniA.A. (2019). A detailed analysis of the CICIDS2017 data set. In: MoriP.FurnellS.CampO. (Eds.), Information Systems Security and Privacy. Springer International Publishing, Cham, pp. 172–188. 978-3-030-25109-3.
58.
SharafaldinI.LashkariA.H.GhorbaniA.A. (2018). Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: Proceedings of the 4th International Conference on Information Systems Security and Privacy, Vol. 1. ICISSP, Funchal, Madeira, Portugal, pp. 108–116. 978-989-758-282-0. https://doi.org/10.5220/0006639801080116.
ShiraviA.ShiraviH.TavallaeeM.GhorbaniA.A. (2012). Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Computers & Security, 31(3), 357–374. https://doi.org/10.1016/J.COSE.2011.12.012.
61.
SmithM.R.MartinezT.Giraud-CarrierC. (2014). An instance level analysis of data complexity. Machine Learning, 95(2), 225–256. https://doi.org/10.1007/s10994-013-5422-z.
62.
SokolovaM.LapalmeG. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45, 427–437. https://doi.org/10.1016/j.ipm.2009.03.002.
63.
ThakkarA.LohiyaR. (2020). A review of the advancement in intrusion detection datasets. Procedia Computer Science, 167(2019), 636–645. https://doi.org/10.1016/j.procs.2020.03.330.
The Cooperative Association for Internet Data Analysis (2010). CAIDA – The Cooperative Association for Internet Data Analysis. http://www.caida.org/home/.
WeiJ.M.YuanX.J.HuQ.H.WangS.Q. (2010). A novel measure for evaluating classifiers. Expert Systems with Applications, 37(5), 3799–3809. https://doi.org/10.1016/j.eswa.2009.11.040.
69.
WilsonD.L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408–421. https://doi.org/10.1109/TSMC.1972.4309137.
70.
WittenI.H.FrankE. (2002). Data mining: practical machine learning tools and techniques with Java implementations. ACM SIGMOD Record, 31(1), 76–77. https://doi.org/10.1145/507338.507355.
71.
WittenI.H.FrankE.HallM.A.PalC.J. (2005). Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. Morgan Kaufmann Publishers, San Francisco, pp. 558. 0-12-088407-0.
72.
YuliantoA.SukarnoP.SuwastikaN.A. (2019). Improving AdaBoost-based intrusion detection system (IDS) performance on CIC IDS 2017 dataset. Journal of Physics: Conference Series, 1192(1). https://doi.org/10.1088/1742-6596/1192/1/012018.
73.
ZhangC.ChengX.LiuJ.HeJ.LiuG. (2018). Deep sparse autoencoder for feature extraction and diagnosis of locomotive adhesion status. Journal of Control Science and Engineering, 1–9. https://doi.org/10.1155/2018/8676387.