Sage Journals: Discover world-class research

Abstract

We present a lightweight and scalable method for classifying network and program traces to detect system intrusion attempts. By employing interelement dependency models to overcome the independence violation problem inherent in the Naive Bayes learners, our method yields intrusion detectors with better accuracy. For efficient and lightweight counting of n-gram features without losing accuracy, we use a k-truncated generalized suffix tree (k-TGST) for storing n-gram features. The k-TGST storage mechanism enables us to scale up the classifiers, which cannot be easily achieved by Support-Vector-Machine- (SVM-) based methods that require implausible computing power and resources for accuracy. Experimental results on a set of practical benchmark datasets show that our method is scalable up to 20-gram with consistent accuracy comparable to SVMs.

1. Introduction

Data mining algorithms have been widely used for classifying program traces (i.e., sequences of system calls or sequences of network packets) in intrusion detection tasks. For example, in a host-based intrusion detection task, a program trace is defined as a sequence of system calls that the program invokes during its execution, whereas in a network-based intrusion detection task, a program trace can be defined as a sequence of network packets that the program transmits during its execution. As a preprocessing step for data mining algorithms on intrusion detection tasks, the n-gram (n consecutive system calls in a trace) approach [1] has been widely used for featurization of system call sequences [2–5].

However, those n-gram approaches suffer from three critical problems when applied to intrusion detection tasks. (1)

Dimensionality Issues. Since the number of distinct system calls is usually as many as about 200, the number of distinct n-gram features increases drastically as n increases. For example, the number of distinct SunOS system calls is 183, and if a 20-gram approach is used, then the number of 20-gram features will be $18 3^{20}$ , which is impractical for real world applications. The data mining algorithms that have nonlinear space complexity (such as Support Vector Machines (SVMs) [6, 7]) severely suffer from this problem.

(2)

Overlap of Features. When n-gram features are generated from an original trace using a fixed-length sliding window, one system call in the trace can be considered as many as n times in the worse case, in the resulting n-gram features [8, 9].

(3)

Violation of Independence Assumption. If the resulting intrusion detectors rely on the statistical independence assumption among features (e.g., Naive Bayes), the above-mentioned way of generating overlapped features systematically breaks the assumption.

Against these backgrounds, we applied interelement dependency models [8, 10] of n-gram features to intrusion detection tasks and compared their performance with those of widely used data mining algorithms such as Naive Bayes with n-gram features and those of SVM with n-gram features. To overcome the curse of dimensionality problem, we adapted the k-truncated generalized suffix tree storage mechanism [11, 12] to index system call traces and to generate counts for each n-gram in an efficient way. Since the features with more order information (i.e., longer n-gram features) from an appropriate amount of input data sets usually contribute more to classification, it is important for an intrusion detection algorithm to be scalable with the length of n-grams.

Experimental results on host-based and network-based intrusion detection benchmark data sets show that the proposed method outperforms the Naive Bayes learner with n-gram features as input, which breaks the independence assumption, on intrusion detection tasks and shows comparable accuracies and false positive rates to those of SVMs with n-gram features. With the suffix tree storage mechanism, we tested the performance of the classifiers up to 20-gram in the experiments, which indicates the advantage of scalability of the proposed combination of the interdependency model of n-gram features and suffix tree storage mechanism over the other methods. We were able to perform the experiments with n-grams much longer than 20-grams, but because of lack of data sets and overfitting, the results were not so significant.

The rest of the paper is organized as follows: Section 2 describes our method, Section 3 presents the experimental results, and Section 4 summarizes and concludes this paper.

2. Method

First of all, we introduce Naive Bayes with n-gram features (NB n-gram), interdependency models of n-Grams (IM(n)), and SVM with n-gram features (SVM n-gram). After that, we explain suffix tree mechanism which is used to store n-gram features.

Before we describe each method, we formally define the intrusion detection problem as follows: Let $Σ = {s_{1}, s_{2}, s_{3}, \dots, s_{m}}$ be a set of system calls, where $m = | Σ |$ is the number of system calls. Data set D can be defined as a set of labeled sequences (i.e., program traces) ${〈 Z_{i}, c_{i} 〉 ∣ Z_{i} \in Σ^{*}, c_{i} \in {0,1}}$ , where $Z_{i} = z_{1}, z_{2}, z_{3}, \dots, z_{l}$ is an input sequence and $c_{i}$ is a corresponding class label denoting 0 for “normal” label and 1 for “intrusion” label. Given the data set D, the goal of a learning algorithm is to find an intrusion detector $h : Σ^{*} \to {0,1}$ that maximizes given criteria such as accuracy, F-1 measure, detection rate, and false positive rate.

If a probabilistic model is applied for the intrusion detector h (e.g., Naive Bayes), then the probabilistic model $P_{h}$ specifies for a sequence Z the probability $P_{h} (Z = z_{1}, z_{2}, z_{3}, \dots, z_{l})$ as follows. (1)

For each class $c_{i}$ , estimate the probabilities $P_{h} (c_{i})$ using all the sequences Z coupled with $c_{i}$ .

(2)

For a new sequence Z, assign the class c such that

\begin{matrix} c_{h} = \underset{c \in {0,1}}{argmax} P_{h} (Z = z_{1}, z_{2}, z_{3}, \dots, z_{l} ∣ c) \cdot P_{h} (c) . \end{matrix}

(1)

2.1. Naive Bayes Classifier

One of the important assumptions in the Naive Bayes classifier as a host-based intrusion detector is that each system call of the sequence is independent of the other system calls given the class label. Therefore, as for Naive Bayes, the classification (shown in (1)) of a new sequence will be formulated as follows:

\begin{matrix} c_{NB} = \underset{c \in {0,1}}{argmax} P_{h} (c) \cdot \prod_{i} ‍ P_{h} (z_{i} ∣ c) . \end{matrix}

(2)

When the Naive Bayes classifier is applied to text or protein sequence classification, it treats each document or protein sequence as a bag or set of words or letters that denotes amino acid [9, 13]. There are a few researches [15–17] that explore intrusion detection tasks with a bag or set of system calls, but most intrusion detection researches focus on the n-gram approach [4, 14, 18, 19].

2.2. Naive Bayes with n-Gram Features (NB n-Gram)

Since it is difficult to deal with variable length sequences directly, each sequence $Z \in Σ^{*}$ is mapped into a finite n-dimensional feature vector (i.e., n-gram features). In host-based intrusion detection tasks, if we want to monitor a program's behavior, we consider the program's trace as a sequence. Thus, to generate n-gram features from an input trace, a sliding window of size n is applied to the trace, moving from the beginning of the trace to the end of the trace by one system call at each step, to generate a bag of n-gram features from the trace. Then, the probabilistic model for this n-gram representation is straightforward from (2):

\begin{matrix} c_{NB n -gram} = \underset{c \in {0,1}}{argmax} P_{h} (c) \cdot \prod_{i = 1}^{l - n + 1} ‍ P_{h} (z_{i}, \dots, z_{i + n - 1} ∣ c), \end{matrix}

(3)

where l is the length of the sequence.

There is one serious problem in the NB n-gram approach. When n-gram features are generated from an original trace using a sliding window, one system call in the trace can be considered and included as many as n times in the resulting n-gram features. This systematically violates the independence assumption of the Naive Bayes learning algorithm.

2.3. Interdependency Models of n-Grams (IM(n))

To overcome the previously mentioned problem, we applied interdependency models of n-grams [8, 10] for scalable n-gram-based intrusion detection. The applied method tried to explicitly model the dependencies among the elements inside an n-gram feature generated from a sequence.

Figure 1 shows the model that describes dependencies among six consecutive elements in a sequence.

Figure 1

Graphical models that incorporate the dependencies among the six consecutive elements in a sequence.

Following the Junction Tree Theorem [20], the probabilistic model for IM(n) is as follows:

\begin{matrix} c_{IM (n)} = \underset{c \in {0,1}}{argmax} \frac{\prod_{1}^{l - n + 1} P_{h} (z_{i}, \dots, z_{i + n - 1} ∣ c)}{\prod_{2}^{l - n + 1} P_{h} (z_{i}, \dots, z_{i + n - 2} ∣ c)} P_{h} (c) . \end{matrix}

(4)

From Figure 1 and (4), it can be seen that the probabilistic graphical model of IM(n) is a Markov Network where the probabilistic distribution is obtained by dividing the product of the marginals of the maximal cliques (maximally connected subgraphs) in the graph by the product of the marginals of the separators (overlaps among cliques).

Algorithm 1 shows the pseudocode of the intrusion detection algorithm using the interdependency model.

Algorithm 1: Intrusion detection algorithm with interdependency model.

IntrusionDetector( S ):

begin

(1) Input: sequence data set $S = s_{1}, \dots, s_{n}$ and interdependency model α as a

probabilistic model

(2) Learning: For each class $c_{j}$ , estimate probabilities $P_{α} (S = s_{1}, \dots, s_{n})$ of $α (c_{j})$

based on D that comprises the intrusion detector h

(3) Testing: For a novel sequence $\hat{S} = s_{1}, \dots, s_{n}$ , predict the classification $c (\hat{S})$ as

follows:

$c (\hat{S}) = \underset{c_{j} \in C}{argmax} {P_{α} (\hat{S} = s_{1}, \dots, s_{n} ∣ c_{j}) P (c_{j})}$

end.

2.4. Support Vector Machines with n-Gram Features (SVM n-Gram)

For the comparison of IM(n)'s performance with the other data mining algorithm, we consider Support Vector Machines with n-grams as input.

It is of interest to compare IM(n) and NB n-gram with SVM n-gram, because, in contrast to Naive Bayes with n-gram features (NB n-gram), SVMs do not rely on the independence assumption between features.

However, the SVM algorithm suffers from the explosion of the number of features as n increases, because it takes at least $O (n^{2})$ time and space to prepare a kernel matrix for SVM. In other words, since the SVM learning algorithm has nonlinear asymptotic complexity and due to the nature of n-gram attributes, as n increases, the number of attributes and the size of the working memory increase drastically (i.e., curse of dimensionality). In the actual experiment, because of these computational and memory requirements, we were only able to conduct the experiment of SVM n-gram for $n = 1, 2,$ and $3$ .

2.5. k-Truncated Suffix Tree

A suffix tree is a data structure to index a string [11]. Figure 2 shows an example suffix tree for a string “banana$”, where “$” denotes the end of the string. The number in each node represents the number of pattern occurrences. For example, in the string “banana$”, “a” occurs three times and “na” occurs twice.

Figure 2

A suffix tree of a string “banana$”. The number in each node represents the number of pattern occurrences.

When the length of a string is l, then it takes $O (l)$ time [21] to build a suffix tree for the string. Once a suffix tree is generated, then it takes $O (m)$ time to find a pattern string with length m. Also, with edge-label compression, it only needs $O (l)$ space for a suffix tree.

In practice, to store multiple strings, a generalized suffix tree is used. A generalized suffix tree is a storage that contains all suffixes of a set of strings [11]. An example of a generalized suffix tree for the strings “bagle $$_{1}$ ” and “beagle $$_{2}$ ” is shown in Figure 4.

Even with the suffix tree storage, it still takes a lot of memory to save the entire traces of system calls. However, in our application, it is not necessary to store a whole trace into a suffix tree. Instead, we store n-gram features into the generalized suffix tree, as shown in Figure 3, which are of interest for generating intrusion detectors.

Figure 3

A 4-truncated suffix tree of “ktruncatedsuffixtree$”.

Figure 4

A generalized suffix tree of “bagle $$_{1}$ ” and “beagle $$_{2}$ ”.

3. Experimental Setup and Results

To evaluate the performance of interdependency models of n-grams (IM(n)), we compared its performance with Naive Bayes (NB), Naive Bayes with n-gram features (NB n-gram), and Support Vector Machines with n-gram features (SVM n-gram). For the experiment, we chose publicly available data sets from the University of New Mexico (UNM) [14] and MIT Lincoln Lab 1998 DARPA Intrusion Detection Evaluation Data Sets (MIT LL 1998) [22].

3.1. Data Sets

3.1.1. UNM Data Sets

The University of New Mexico (UNM) provides a number of system call data sets. Each data set corresponds to a specific attack or exploit. The data sets we tested are “live lpr”, “live lpr MIT”, and “denial of service” (DoS).

In UNM system call traces, each trace is an output of one program. Sometimes, one trace has multiple processes. In such cases, we have made as many sequences as the number of processes in the original trace. Thus, multiple sequences of system calls are made from one trace if the input trace has multiple processes in it. However, most traces have only one process and usually one sequence is created for each trace. Table 1 shows the number of original traces and the number of sequences for each data set.

Table 1

The number of instances for each type of attack.

Attack	Positive	Negative	Total
Live lpr (l_lpr)	1001	1231	2232
Live lpr MIT (l_lpr.MIT)	1001	2703	3704
Denial of service (stide)	105	13726	13831

There are two different mapping files in the UNM call traces we used for the experiment. One is Sun (live lpr and live lpr.MIT), and the other is Linux (denial of service). There are old and new Sun mapping files but only one system call is added to the new mapping file, so both can be easily converted. The Sun mapping file has a few duplicate system calls (e.g., “fstat”, “stat”, etc.), but we changed them such that each system call is unique.

3.1.2. MIT LL 1998 Data Sets

In the MIT Lincoln Lab 1998 data sets [14], we used both seven weeks of training data and two weeks of testing data. The data comprises a detailed set of data files representing the state of a particular system over eight-hour daytime periods over the course of the nine weeks (seven for training and two for testing). Of interest to the experiments are the omnibus data files containing all system calls made during the collection period and the network traffic analysis files (distilled from raw network data) that identify inbound network connection attempts.

We explained the issues with cross-indexing the data files. The MIT Lincoln Lab data sets include omnibus files containing all system call traces. For each omnibus file, there is a separate, network traffic analysis data file that indicates inbound network connections to the system. Attack attempts are logged with the network data, which implies that labeling the training data requires cross-indexing this file with the system call trace file. The system call trace file identifies the source of each call using the process ID. Therefore, cross-indexing requires tracking the argument to the “exec” system call identifying the binary to be executed. Additionally, the timestamps from the network traffic analyzer do not exactly correspond to the execution timestamps from the operating system kernel. A tolerance of one second is arbitrarily chosen and permits the matching of a large majority of connection attempts with their corresponding server processes on the target system.

All processes detected that do not correspond to some network connection attempt identified in the trace are removed from consideration (since they cannot be classified), as are all calls attributed to a process ID for which an “exec” system call is not found.

As for the experimental setting, we both followed training and testing experiments as explained in the data set description and performed 10-fold cross-validation over the whole nine-week data sets.

3.2. Results

We applied the following performance measures:

\begin{matrix} Accuracy = \frac{TP + TN}{TP + TN + FP + FN}, \\ False Positive Rate = \frac{FP}{FP + TN}, \end{matrix}

(5)

where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives, and FN is the number of false negatives.

Table 2 shows accuracies and false positive rates of the three algorithms (IM(n), NB n-gram, and SVM n-gram) on the “UNM live lpr” data. IM(n) shows the best performance when n is from 6 to 8. Its performance is comparable to that of Support Vector Machines (SVMs). Overall, IM(n) shows better performance than NB n-gram.

Table 2

Comparison of accuracy (A) and false positive rate (FP) of intrusion detectors generated by IM(n), NB n-gram, and SVM n-gram on the UNM live lpr data. The accuracies and false positive rates were estimated using 10-fold cross-validation. We calculated a 99% confidence interval on the accuracies and false positive rates. Note that when $n \geq 3$ , SVM n-gram is infeasible because of computational and memory requirements.

N	IM(n)		NB n-gram		SVM n-gram
N	A	FP	A	FP	A	FP
1	$84.09 \pm 0.02$	$28.84 \pm 0.02$	$84.09 \pm 0.02$	$28.84 \pm 0.02$	100.00 ± 0.00	0.00 ± 0.00
2	$99.78 \pm 0.00$	$0.41 \pm 0.00$	$98.30 \pm 0.01$	$3.09 \pm 0.01$	$99.96 \pm 0.00$	$0.00 \pm 0.00$
3	$99.96 \pm 0.00$	$0.00 \pm 0.00$	$99.01 \pm 0.01$	$1.79 \pm 0.01$	N/A	N/A
4	$99.96 \pm 0.00$	$0.00 \pm 0.00$	$99.60 \pm 0.00$	$0.73 \pm 0.00$	N/A	N/A
5	$99.96 \pm 0.00$	$0.00 \pm 0.00$	$99.82 \pm 0.00$	$0.32 \pm 0.00$	N/A	N/A
6–8	100.00 ± 0.00	0.00 ± 0.00	99.87 ± 0.00	0.24 ± 0.00	N/A	N/A
9-10	$99.96 \pm 0.00$	$0.08 \pm 0.00$	$99.82 \pm 0.00$	$0.32 \pm 0.00$	N/A	N/A
11–20	$99.96 \pm 0.00$	$0.08 \pm 0.00$	$99.78 \pm 0.00$	$0.41 \pm 0.00$	N/A	N/A

As for the “UNM live lpr” data, IM(n) and NB n-gram show the best performance when n is from 6 to 8. The best accuracy and false positive rate of IM(n) are 100.00 and 0.00, and the best accuracy and false positive rate of NB n-gram are 99.87 and 0.24. These results correspond to the widely known claim that “six is the magic number” [23]. However, for the other data sets (“UNM live lpr MIT” and “UNM denial of service”) we have tested, the claim does not always hold.

Table 3 shows the results of the three algorithms for UNM live lpr MIT data. When n is from 17 to 20, IM(n) shows the best performance (accuracy is 99.97 and false positive rate is 0.00) over SVM (accuracy is 99.95 and false positive rate is 0.00) and NB n-gram (accuracy is 93.49 and false positive rate is 8.92).

Table 3

Comparison of accuracy (A) and false positive rate (FP) of intrusion detectors generated by IM(n), NB n-gram, and SVM n-gram on UNM live lpr MIT data. The accuracies and false positive rates were estimated using 10-fold cross-validation. We calculated a 99% confidence interval on the accuracies and false positive rates. Note that when $n \geq 3$ , SVM n-gram is infeasible because of computational and memory requirements.

N	IM(n)		NB n-gram		SVM n-gram
N	A	FP	A	FP	A	FP
1	$54.56 \pm 0.02$	$62.26 \pm 0.02$	$54.56 \pm 0.02$	$62.26 \pm 0.02$	$99.83 \pm 0.00$	$0.14 \pm 0.00$
2	$96.76 \pm 0.01$	$4.44 \pm 0.01$	$78.75 \pm 0.02$	$29.12 \pm 0.02$	99.95 ± 0.00	0.00 ± 0.00
3	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$86.23 \pm 0.01$	$18.87 \pm 0.02$	N/A	N/A
4	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$89.28 \pm 0.01$	$14.69 \pm 0.02$	N/A	N/A
5	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$91.17 \pm 0.01$	$12.10 \pm 0.01$	N/A	N/A
6	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$91.77 \pm 0.01$	$11.28 \pm 0.01$	N/A	N/A
7	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$91.85 \pm 0.01$	$11.17 \pm 0.01$	N/A	N/A
8	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$91.98 \pm 0.01$	$10.99 \pm 0.01$	N/A	N/A
9	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$92.09 \pm 0.01$	$10.84 \pm 0.01$	N/A	N/A
10	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$92.04 \pm 0.01$	$10.91 \pm 0.01$	N/A	N/A
11	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$92.33 \pm 0.01$	$10.51 \pm 0.01$	N/A	N/A
12	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$92.68 \pm 0.01$	$10.03 \pm 0.01$	N/A	N/A
13	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$92.85 \pm 0.01$	$9.80 \pm 0.01$	N/A	N/A
14	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$92.95 \pm 0.01$	$9.66 \pm 0.01$	N/A	N/A
15	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$93.17 \pm 0.01$	$9.36 \pm 0.01$	N/A	N/A
16	$99.95 \pm 0.00$	$0.00 \pm 0.00$	$93.28 \pm 0.01$	$9.21 \pm 0.01$	N/A	N/A
17	99.97 ± 0.00	0.00 ± 0.00	$93.30 \pm 0.01$	$9.17 \pm 0.01$	N/A	N/A
18	99.97 ± 0.00	0.00 ± 0.00	$93.36 \pm 0.01$	$9.10 \pm 0.01$	N/A	N/A
19-20	99.97 ± 0.00	0.00 ± 0.00	93.49 ± 0.01	8.92 ± 0.01	N/A	N/A

In Table 4, which is for the UNM denial of service data, we can see that SVM with 2-grams shows the best performance (accuracy is 99.99 and false positive rate is 0.00), but the difference from the results of IM(n) is very marginal (accuracy is 99.93 and false positive rate is 0.00). Moreover, during the training stage, SVM learns its discriminative model by the Sequential Minimal Optimization algorithm [24], which takes more than several hours. In contrast, IM(n) takes only about one or two minutes and its performance is comparable to that of SVM.

Table 4

Comparison of accuracy (A) and false positive rate (FP) of intrusion detectors generated by IM(n), NB n-gram, and SVM n-gram on the UNM denial of service data. The accuracies and false positive rates were estimated using 10-fold cross-validation. We calculated a 99% confidence interval on the accuracies and false positive rates. Note that when $n \geq 3$ , SVM n-gram is infeasible because of computational and memory requirements.

N	IM(n)		NB n-gram		SVM n-gram
N	A	FP	A	FP	A	FP
1	$98.69 \pm 0.00$	$0.92 \pm 0.00$	$98.69 \pm 0.00$	$0.92 \pm 0.00$	$99.98 \pm 0.00$	$0.01 \pm 0.00$
2	$99.27 \pm 0.00$	$0.01 \pm 0.00$	$99.24 \pm 0.00$	$0.04 \pm 0.00$	99.99 ± 0.00	0.00 ± 0.00
3	$99.24 \pm 0.00$	$0.01 \pm 0.00$	$99.06 \pm 0.00$	$0.69 \pm 0.00$	N/A	N/A
4	$99.23 \pm 0.00$	$0.01 \pm 0.00$	$99.18 \pm 0.00$	$0.68 \pm 0.00$	N/A	N/A
5	$99.23 \pm 0.00$	$0.01 \pm 0.00$	$99.24 \pm 0.00$	$0.65 \pm 0.00$	N/A	N/A
6	$99.23 \pm 0.00$	$0.01 \pm 0.00$	$99.32 \pm 0.00$	$0.63 \pm 0.00$	N/A	N/A
7	$99.24 \pm 0.00$	$0.01 \pm 0.00$	$99.40 \pm 0.00$	$0.59 \pm 0.00$	N/A	N/A
8	$99.25 \pm 0.00$	$0.00 \pm 0.00$	$99.52 \pm 0.00$	$0.47 \pm 0.00$	N/A	N/A
9	$99.36 \pm 0.00$	$0.00 \pm 0.00$	$99.66 \pm 0.00$	$0.33 \pm 0.00$	N/A	N/A
10	$99.65 \pm 0.00$	$0.00 \pm 0.00$	$99.66 \pm 0.00$	$0.33 \pm 0.00$	N/A	N/A
11	$99.80 \pm 0.00$	$0.00 \pm 0.00$	$99.66 \pm 0.00$	$0.33 \pm 0.00$	N/A	N/A
12	$99.84 \pm 0.00$	$0.00 \pm 0.00$	$99.66 \pm 0.00$	$0.33 \pm 0.00$	N/A	N/A
13	$99.87 \pm 0.00$	$0.00 \pm 0.00$	$99.66 \pm 0.00$	$0.33 \pm 0.00$	N/A	N/A
14	$99.88 \pm 0.00$	$0.00 \pm 0.00$	$99.67 \pm 0.00$	$0.31 \pm 0.00$	N/A	N/A
15	$99.89 \pm 0.00$	$0.00 \pm 0.00$	$99.67 \pm 0.00$	$0.31 \pm 0.00$	N/A	N/A
16	$99.89 \pm 0.00$	$0.00 \pm 0.00$	$99.68 \pm 0.00$	$0.31 \pm 0.00$	N/A	N/A
17	$99.91 \pm 0.00$	$0.00 \pm 0.00$	$99.69 \pm 0.00$	$0.30 \pm 0.00$	N/A	N/A
18	$99.92 \pm 0.00$	$0.00 \pm 0.00$	$99.69 \pm 0.00$	$0.30 \pm 0.00$	N/A	N/A
19-20	99.93 ± 0.00	0.00 ± 0.00	99.69 ± 0.00	0.30 ± 0.00	N/A	N/A

In Table 5, we show the results of training and testing experiments on the MIT LL 1998 data sets. In these training and testing experiments, we use predefined training data sets of the MIT LL 1998 data sets for learning intrusion detectors and use predefined testing data sets for evaluating the intrusion detectors. Notice that we do not calculate confidence intervals because it is infeasible to obtain distributions from the train/test setting.

Table 5

Comparison of accuracy (A) and false positive rate (FP) of intrusion detectors generated by IM(n), NB n-gram, and SVM n-gram on MIT LL 1998 data. We followed the train/test setting specified in the data set description. Note that when $n \geq 4$ , SVM n-gram is infeasible because of computational and memory requirements.

N	IM(n)		NB n-gram		SVM n-gram
N	A	FP	A	FP	A	FP
1	36.57	62.49	36.57	62.49	81.55	7.45
2	80.33	0.06	80.36	0.03	80.53	8.28
3	80.35	0.03	80.35	0.05	80.28	9.35
4	80.37	0.00	80.34	0.03	N/A	N/A
5	80.33	0.05	80.35	0.03	N/A	N/A
6	80.33	0.05	80.30	0.08	N/A	N/A
7	80.37	0.04	78.50	2.89	N/A	N/A
8	79.51	1.11	78.45	2.95	N/A	N/A
9	79.44	1.20	75.22	6.93	N/A	N/A
10	75.99	5.50	75.12	7.07	N/A	N/A
11	75.93	5.57	74.99	7.27	N/A	N/A
12	75.93	5.57	75.05	7.29	N/A	N/A
13	75.93	5.58	74.85	7.56	N/A	N/A
14	75.91	5.59	71.11	12.84	N/A	N/A
15	75.79	5.76	65.10	25.30	N/A	N/A
16	76.97	6.26	52.47	42.21	N/A	N/A
17	72.23	12.46	50.39	44.82	N/A	N/A
18	71.98	12.67	50.42	44.77	N/A	N/A
19	71.99	12.66	50.36	44.88	N/A	N/A
20	74.95	9.06	49.88	44.93	N/A	N/A

It can be seen that SVM with 1-gram shows the best performance on the accuracy with 81.55, but the false positive rate is 7.45. The difference from the results of IM(4) is marginal (accuracy is 80.37 and false positive rate is 0.00). Moreover, the false positive rate of IM(4) is far better than that of SVM with 1-gram. As discussed, the total learning and testing time of SVM is more than several hours, while, in contrast, it takes about one or two minutes for all the IM(n) models and their performances in general are comparable to those of SVMs.

In Figure 5, we show the Receiver Operating Characteristic (ROC) curves of train/test experiments on the MIT LL 1998 data sets. Area under the curve (AUC) of IM(4) is 0.7290, AUC of NB 2-gram is 0.6654, and AUC of SVM 1-gram is 0.7043. Thus, in terms of ROC, IM(4) outperforms SVM 1-gram.

Figure 5

Receiver Operating Characteristic (ROC) curves of interelement dependency model of 4-gram (IM(4), $AUC = 0.7290$ ), Naive Bayes of 2-gram (NB 2-gram, $AUC = 0.6654$ ), and Support Vector Machines of 1-gram (SVM 1-gram, $AUC = 0.7043$ ) on the MIT LL 1998 data estimated with the train/test setting.

In Table 6, we show the results of 10-fold cross-validation on the MIT LL 1998 data sets.

Table 6

Comparison of accuracy (A) and false positive rate (FP) of intrusion detectors generated by IM(n), NB n-gram, and SVM n-gram on MIT LL 1998 data. The accuracies and false positive rates were estimated using 10-fold cross-validation. We calculated a 99% confidence interval on the accuracies and false positive rates. Note that when $n \geq 3$ , SVM n-gram is infeasible because of computational and memory requirements.

$N$	IM(n)		NB n-gram		SVM n-gram
$N$	A	FP	A	FP	A	FP
1	$32.55 \pm 0.37$	$88.55 \pm 0.25$	$32.55 \pm 0.37$	$88.55 \pm 0.25$	$91.83 \pm 0.21$	$0.03 \pm 0.01$
2	$74.54 \pm 0.34$	$26.17 \pm 0.34$	$60.61 \pm 0.38$	$41.91 \pm 0.38$	$92.45 \pm 0.21$	$4.46 \pm 0.16$
3	$48.33 \pm 0.39$	$56.18 \pm 0.39$	$46.97 \pm 0.39$	$57.44 \pm 0.39$	93.34 ± 0.19	4.40 ± 0.16
4	$85.39 \pm 0.28$	$13.84 \pm 0.27$	$50.73 \pm 0.39$	$53.29 \pm 0.39$	N/A	N/A
5	$92.17 \pm 0.21$	$6.02 \pm 0.19$	$56.28 \pm 0.39$	$47.08 \pm 0.39$	N/A	N/A
6	$93.12 \pm 0.20$	$4.97 \pm 0.17$	$60.77 \pm 0.38$	$42.02 \pm 0.38$	N/A	N/A
7	$93.17 \pm 0.20$	$3.97 \pm 0.15$	$76.68 \pm 0.33$	$24.54 \pm 0.33$	N/A	N/A
8	$93.23 \pm 0.20$	$3.96 \pm 0.15$	$77.58 \pm 0.33$	$22.43 \pm 0.33$	N/A	N/A
9	$93.22 \pm 0.20$	$3.94 \pm 0.15$	$78.74 \pm 0.32$	$21.18 \pm 0.32$	N/A	N/A
10	93.24 ± 0.20	3.93 ± 0.15	$79.33 \pm 0.32$	$20.532 \pm 0.32$	N/A	N/A
11	$93.24 \pm 0.20$	$3.94 \pm 0.15$	$78.71 \pm 0.32$	$21.92 \pm 0.32$	N/A	N/A
12	$93.37 \pm 0.19$	$4.00 \pm 0.15$	$78.64 \pm 0.32$	$21.54 \pm 0.32$	N/A	N/A
13	$91.65 \pm 0.22$	$1.44 \pm 0.09$	$82.43 \pm 0.30$	$12.55 \pm 0.26$	N/A	N/A
14	$91.39 \pm 0.22$	$1.73 \pm 0.10$	86.21 ± 0.27	8.12 ± 0.21	N/A	N/A
15	$91.35 \pm 0.22$	$1.77 \pm 0.10$	$85.09 \pm 0.28$	$14.54 \pm 0.27$	N/A	N/A
16	$91.27 \pm 0.22$	$2.36 \pm 0.12$	$85.10 \pm 0.28$	$13.92 \pm 0.27$	N/A	N/A
17	$91.50 \pm 0.22$	$5.39 \pm 0.18$	$85.24 \pm 0.28$	$13.20 \pm 0.26$	N/A	N/A
18	$91.42 \pm 0.22$	$4.98 \pm 0.17$	$85.37 \pm 0.28$	$13.03 \pm 0.26$	N/A	N/A
19	$91.47 \pm 0.22$	$5.41 \pm 0.18$	$85.54 \pm 0.27$	$12.83 \pm 0.26$	N/A	N/A
20	$91.46 \pm 0.22$	$5.43 \pm 0.18$	$85.68 \pm 0.27$	$13.87 \pm 0.27$	N/A	N/A

It can be seen that SVM with 3-grams shows the best performance with the accuracy 93.34 and the false positive rate 4.40. Again, the difference from the results of IM(n) is marginal (accuracy is 93.24 and false positive rate is 3.93). Moreover, the running time of SVM 3-grams is more than a week for learning and testing, whereas the running time of IM(n) is less than two minutes and the performance of IM(n) is comparable to that of SVM.

In Figure 6, we show the Receiver Operating Characteristic (ROC) curves of train/test experiments on the MIT LL 1998 data sets. Area Under the Curve (AUC) of IM(10) is 0.7742, AUC of NB 2-gram is 0.5271, and AUC of SVM 1-gram is 0.8729. Thus, in terms of ROC, SVM 1-gram shows the best performance among the three.

Figure 6

Receiver Operating Characteristic (ROC) curves of interelement dependency model of 10-gram (IM(10), $AUC = 0.7742$ ), Naive Bayes of 14-gram (NB 14-gram, $AUC = 0.5271$ ), and Support Vector Machines of 3-gram (SVM 3-gram, $AUC = 0.8729$ ) on MIT LL 1998 data estimated using 10-fold cross-validation.

To show the big picture of comparing IM(n) and NB n-gram, in Figure 7, we show ROC curves of the train/test experiments on the MIT LL 1998 data sets, when $2 \leq n \leq 17$ .

Figure 7

ROC curves of the train/test experiments on the MIT LL 1998 data sets, when $2 \leq n \leq 17$ .

4. Conclusion

4.1. Related Work

Peng and Schuurmans [8] introduced n-gram augmented Naive Bayes and applied their algorithms to text classification. Silvescu et al. [10] proposed interelement dependency models which are similar to n-gram-augmented Naive Bayes. However, to the best of our knowledge, there has been no research on application of interelement dependency models for intrusion detection tasks.

Rieck and Laskov [4] used language models to detect unknown network attacks. They used a trie data structure [25, 26] to compute the similarity between two traces. The generalized suffix tree we used is more advantageous for storing n-gram features in that it does not take constant time to find a substring in a trie data structure. We performed our experiments with n ranging from 1 to 20, whereas Rieck and Laskov did the experiments when n is only 1, 3, and 5.

Most intrusion detection research focuses on the n-gram approach [4, 14, 18, 19]. However, there have been a few research efforts [15–17] that explore intrusion detection tasks with a bag or set of system calls. Liao and Vemuri [15] applied the K-Nearest Neighbor text classification method to intrusion detection tasks. Kang et al. [16] proposed a bag of system calls representation for intrusion detection. They performed various experiments for misuse detection and anomaly detection and showed that a bag of system calls representation is effective for misuse detection. Liu et al. [17] compared different system call feature representations and concluded that system call alone is not sufficient for detecting insider threats. Our work extends the representation to n-gram and shows that the resulting intrusion detectors are more accurate than n-gram features in terms of accuracy and false positive rate.

Forrest et al. [18] devised a Sequence Time-Delay Embedding (STIDE) intrusion detector which takes the n-gram approach with a few thresholds. Tan and Maxion [23] tried to define the operational limit of STIDE and concluded that 6-gram is enough for anomaly detection. Our work shows that the interelement dependency model (IM(n)) is more accurate than the n-gram approach and, for some attack types (UNM live lpr MIT and UNM denial of service), six is not always the magic number.

4.2. Summary and Future Work

We discussed an application of interelement dependency models to n-grams stored in a k-truncated generalized suffix tree (k-TGST) directly to classify intrusive sequences. We evaluated the performance of our method with those of Naive Bayes and Support Vector Machines (SVMs) with n-gram features by the experiments on intrusion detection benchmark data sets.

Experimental results on the University of New Mexico (UNM) benchmark data sets and MIT Lincoln Lab 1998 DARPA intrusion detection evaluation data sets show that our method, which solves the problem of independence violation that happens when n-gram-features are directly applied to Naive Bayes (i.e., Naive Bayes with n-gram features), yields intrusion detectors with higher accuracy than those from Naive Bayes with n-gram features and shows comparable accuracy to those from SVM with n-gram features.

For scalable and lightweight counting of n-gram features, we use the k-truncated generalized suffix tree mechanism for storing n-gram features. With this mechanism, we tested the performance of the classifiers up to 20-gram in our experiment, which illustrates the scalability and accuracy of n-gram augmented Naive Bayes with the k-truncated generalized suffix tree storage mechanism.

As for future work, we plan to apply n-gram representation to system call arguments [27], because system call arguments are important for accurate intrusion detection. Since there has been a lot of research on n-gram approaches, it is of interest to devise an n-gram representation where each element is the audit record data structure (i.e., system call and its arguments). However, there have been no researches we are aware of for the application of n-gram representation for the audit record data structure.

References

Charniak

Statistical Language Learning 1994

Cambridge, MA, USA

MIT Press

Lee

Stolfo

S. J.

Mok

K. W.

A data mining framework for building intrusion detection models

Proceedings of the IEEE Symposium on Security and Privacy

May 1999

120 132

2-s2.0-0032676506

Murali

Rao

A survey on intrusion detection approaches

Proceedings of the 1st International Conference on Information and Communication Technology (ICICT '05)

August 2005

233 240

2-s2.0-33847256909

10.1109/ICICT.2005.1598592

Rieck

Laskov

Detecting unknown network attacks using language models

Proceedings of the 3rd International Conference on Detection of Intrusions andMalware & Vulnerability Assessment (DIMVA '06)

2006

Berlin, Germany

74 90

Shafiq

M. Z.

Khayam

S. A.

Farooq

Embedded malware detection using Markov n-grams

Proceedings of the 5th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA '08)

2008

Paris, France

88 107

Boser

B. E.

Guyon

I. M.

Vapnik

V. N.

Training algorithm for optimal margin classifiers

Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory

July 1992

New York, NY, USA

ACM Press

144 152

2-s2.0-0026966646

Vapnik

V. N.

The Nature of Statistical Learning Theory 1995

New York, NY, USA

Springer

Peng

Schuurmans

Sebastiani

Combining naive Bayes and n-gram language models for text classification

Advances in Information Retrieval 2003 2633

New York, NY, USA

335 350 LectureNotes in Computer Science

Andorf

Silvescu

Dobbs

Honavar

Learning classifiers for assigning protein sequences to gene ontology functional families

Proceedings of the 5th International Conference on Knowledge Based Computer Systems (KBCS '04)

2004

256 265

10.

Silvescu

Andorf

Dobbs

Honavar

Inter-element dependency models for sequence classification

2004

Iowa State University

11.

Gusfield

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology 1997 1st

New York, NY, USA

Cambridge University Press

12.

J. C.

Park

Data compression with truncated suffix trees

Proceedings of the Data Compression Conference (DDC '00)

March 2000

Snowbird, UT, USA

2-s2.0-0033876942

13.

Mitchell

T. M.

Machine Learning 1997

New York, NY, USA

McGraw-Hill

14.

Warrender

Forrest

Pearlmutter

Detecting intrusions using system calls: alternative data models

Proceedings of the IEEE Symposium on Security and Privacy

May 1999

133 145

2-s2.0-0032639421

15.

Liao

Vemuri

V. R.

Using text categorization techniques for intrusion detection

Proceedings of the 11th USENIX Security Symposium

2002

Berkeley, CA, USA

USENIX Association

51 59

16.

Kang

D.-K.

Fuller

Honavar

Learning classifiers for misuse and anomaly detection using a bag of system calls representation

Proceedings of the 6th Annual IEEE System, Man and Cybernetics Information Assurance Workshop (SMC '05)

June 2005

West Point, NY, USA

118 125

2-s2.0-33745463455

10.1109/IAW.2005.1495942

17.

Liu

Martin

Hetherington

Matzner

A comparison of system call feature representations for insider threat detection

Proceedings of the 6th Annual IEEE System, Man and Cybernetics Information Assurance Workshop (SMC '05)

June 2005

West Point, NY, USA

340 347

2-s2.0-33745477389

10.1109/IAW.2005.1495972

18.

Forrest

Allen

Perelson

A. S.

Cherukuri

Self-nonself discrimination in a computer

Proceedings of the 1994 IEEE Symposium on Research in Security and Privacy

May 1994

202 212

2-s2.0-0027961889

19.

Lee

Stolfo

Data mining approaches for intrusion detection

Proceedings of the 7th USENIX Security Symposium

1998

San Antonio, TX, USA

79 94

20.

Cowell

R. G.

Lauritzen

S. L.

David

A. P.

Spiegelhalter

D. J.

Spiegelhater

D. J.

Probabilistic Networks and Expert Systems 1999

Secaucus, NJ, USA

Springer

21.

Ukkonen

On-line construction of suffix trees

Algorithmica 1995 14 3 249 260

2-s2.0-0001704377

10.1007/BF01206331

22.

Lippmann

Fried

Graf

Haines

Kendall

McClung

Lippmann

Fried

Graf

Haines

Kendall

McClung

Weber

Webster

Wyschogrod

Cunningham

Zissman

Evaluating intrusion detection systems: the 1998 DARPA off-line intrusion detection evaluation

Proceedings of the DARPA Information Survivability Conference and Exposition

2000

Los Alamitos, CA, USA

IEEE Computer Society Press

23.

Tan

K. M. C.

Maxion

R. A.

‘Why 6?’ Defining the operational limits of stide, an anomaly-based intrusion detector

Proceedings of the Symposium on Security and Privacy

May 2002

188 201

2-s2.0-0036085540

24.

Platt :

J. C.

Fast training of support vector machines using sequential minimal optimization

Advances in Kernel Methods 1999 185 208

25.

de la Briandais

File searching using variable length keys

Proceedings of the AFIPS West Joint Computer Conference

1959

295 298

26.

Fredkin

Trie memory

Communications of the ACM 1960 3 9 490 499

27.

Mutz

Valeur

Vigna

Kruegel

Anomalous system call detection

ACM Transactions on Information and System Security 2006 9 1 61 93

2-s2.0-33745201000

Lightweight and Scalable Intrusion Trace Classification Using Interelement Dependency Models Suitable for Wireless Sensor Network Environment

Abstract

1. Introduction

2. Method

2.1. Naive Bayes Classifier

2.2. Naive Bayes with n-Gram Features (NB n-Gram)

2.3. Interdependency Models of n-Grams (IM(n))

Algorithm 1: Intrusion detection algorithm with interdependency model.

2.4. Support Vector Machines with n-Gram Features (SVM n-Gram)

2.5. k-Truncated Suffix Tree

3. Experimental Setup and Results

3.1. Data Sets

3.1.1. UNM Data Sets

3.1.2. MIT LL 1998 Data Sets

3.2. Results

4. Conclusion

4.1. Related Work

4.2. Summary and Future Work

References