Sage Journals: Discover world-class research

Abstract

Information amount has been shown to be one of the most efficient methods for measuring uncertainty. However, there has been little research on outlier detection using information amount. To fill this void, this paper provides a new unsupervised outlier detection method based on the amount of information. First, the information amount in a given information system is determined, which offers a thorough estimate of the uncertainty of this information system. Then, the relative information amount and the relative cardinality are proposed. Following that, the degree of outlierness and weight function are shown. Furthermore, the information amount-based outlier factor is constructed, which determines whether an object is an outlier by its rank. Finally, a new unsupervised outlier detection method called the information amount-based outlier factor (IAOF) is developed. To validate the effectiveness and advantages of IAOF, it is compared to five existing outlier identification methods. The experimental results on real-world data sets show that this method is capable of addressing the problem of outlier detection in categorical information systems.

Keywords

Outlier detection CIS Information amount IAOF

1 Introduction

1.1 Background and related work

Data mining aims to select the best information from massive amounts of data. Most information conforms to certain rules, but there are also small quantities of information that runs counter to the rules for various reasons, and these are called outliers.Until now, there has not been a uniform and rigorous definition of outliers, and the description commonly cited by scholars was proposed by Hawkins in the 1980s, namely, “outliers are different data points in a data set”. We can understand that outliers are significantly different from the rest of the data set. Outliers are also commonly referred to as outliers, blemishes, outliers, inconsistencies, deviations, and novelties [13]. Edgeworth was the first to perform outlier detection by analyzing and mining outlier points using mathematical statistics. Since then, many scholars have applied new theories and technologies to outlier detection and optimized and improved them constantly, making outlier detection an important branch of data mining on a par with predictive modeling, cluster analysis, and association analysis and playing a pivotal role in many fields such as public health and medical treatment [25], loan approval [31], weather forecasting [4], education management [26], and power operation [19, 38]. The name, definition, and criterion of outliers are different due to different application fields or purposes.

There are various ways to categorize outlier detection methods. Outlier detection methods can be classified into three categories according to whether labels are involved in the process of recognition: supervised, supervised nested, and unsupervised supervised. It can be divided into four categories based on different method principles as well: probability distribution-based methods, classification-based methods, clustering-based methods, and proximity-based methods.

Probability distribution-based methods utilize a standard distribution to fit the data set and identify outliers based on the probability distribution. For example, Yamanishi et al. [33] used a Gaussian mixture model to represent normal behavior and scored each case according to the variation of the model, and a high score indicates a high probability of being an outlier. Later, this research result was used in conjunction with a supervised learning method to obtain another probability distribution outlier detection model.Shin [30] employed Markov chains to probabilistically model abnormal events in networked systems.While classification-based methods do not need to know the sample distribution and do not need to label the sample, They have strong learning inference ability, good generalization, and are theoretically more suitable for outlier detection of high-dimensional data. Such as neural network-based methods, Bayesian network-based methods, support vector machine-based methods, and rule-based methods.For clustering-based methods such as DBSCAN [11] and DPC [27], no strict rules about data types, no priori knowledge, and class labels are needed. They mainly focus on finding clusters and regard objects far from the cluster centers as outliers.Proximity-based methods have similar advantages to clustering-based methods and identify outliers mainly based on the distance or density of objects. The method commonly used for local proximity is the LOF method [3], and then a series of methods such as LDOF and LoOP have been proposed based on LOF [6 , 18].

With the rapid development and gradual improvement of methods related to rough set theory (RST), these methods are widely used in the fields of knowledge acquisition, machine learning, pattern recognition, attribute approximation, and decision analysis. RST has also been successfully used for outlier detection.Shaari [29] used the new concept of nonapproximation to detect outlier patterns based on RST.Jiang et al. [16] proposed the IEOF method for a categorical information system based on information entropy, one of the metric tools of uncertainty.Chen et al. [5] introduced a neighborhood model to combine the coarse-grained technique with an outlier detection technique. Sangeetha et al. [28] considered the weighted density values of attributes and objects in outlier detection for an intuitionistic fuzzy information system. Domingueset et al. [8] gave a comparative evaluation of outlier detection methods.Yuan et al. [34] proposed the NIEOD method by building a neighborhood information system using heterogeneous distances and adaptive radius and then realizing the information metrics by using the neighborhood information entropy. Degirmenci et al. [10] put forward efficient density and cluster-based incremental outlier detection in data streams. Yuan et al. [35 –37] proposed methods based on fuzzy information entropy, fuzzy roughness granularity, and weighted fuzzy roughness density, respectively. Jin et al. [15] introduced intrusion detection on the internet of vehicles via combining log-ratio oversampling, outlier detection, and metric learning. Kandanaarachchi et al. [17] brought up unsupervised anomaly detection ensembles using item response theory. Meira et al. [22] came up with fast anomaly detection with locality-sensitive hashing and hyperparameter autotuning. Wang et al. [32] provided outlier detection based on a weighted neighborhood information network for mixed-valued data sets.

1.2 Motivation and inspiration

The existing detection methods apply to numerical data or mixed data, and only a few methods can deal with data using categorical attributes, but categorical attributes account for the majority of data in real life and applications. In addition, data processing methods for casting numerical data into categorical data tend to be more complex and sometimes lead to suboptimal results, while categorical data can be converted into numerical data more easily. Therefore, outlier detection for categorical data is still a topic that needs to be continually explored and developed.

The information amount is a basic metric for the uncertainty of an information system. However, the study for outlier detection based on information amount has not yet been reported.

Based on the research motivations stated above, we propose a new unsupervised method for spotting outliers that takes advantage of the information amount in rough set theory. Our method’s main idea is as follows. If we have an information system and a set of indiscernibility relations on domain U, we conceive of each equivalence class in the partition as a collection of objects on U. This is because every indiscernibility relation makes U partitionable. In other words, the indiscernibility relationship separates U into multiple groups. Then, based on the information amount, we compute the relative information amount and relative cardinality of object x under the indiscernibility relation, which provides us with a measure of x’s uncertainty. Because uncertainty can be considered an abnormal quality, the calculation of the relative information amount helps to identify group objects that are not common on U. The relative information amount, relative cardinality, and weights are then utilized to construct the information amount-based outlier factor, which explains the outlier. Finally, if the outlier factor of x under these relations is always very high, we can treat x as an information amount-based outlier on U with respect to the information system.

The main contributions are summarized as follows:

(1) The information amount and relative information amount for categorical data are proposed, which provide a comprehensive measure for the uncertainty of categorical data. The degree of outlierness and weight function are presented to find outlier factors.

(2) An outlier detection method based on information amount is presented. Instead of labeling data and determining the neighborhood radius to identify outlier samples, this method measures the difference in the information amount after abandoning a certain object. Compared with neighborhood-based and clustering-based methods, it does not need to consider finding out the optimal parameter values, nor does it need to calculate the distance between two objects.

(3) The information amount-based outlier factor method is given. The experimental results show that the proposed method has better validity and adaptability for categorical data.

1.3 Organization

The remainder of this essay is structured as follows: Section 2 briefly reviews information systems and information amount. Section 3 introduces the necessary definition and concept of the outlier detection method based on information amount, and Section 4 describes the particular method of the proposed detection method and the spatial and temporal complexity of the method. The experimental findings are presented in Section 5, along with a discussion of the findings following statistical analysis. Section 6 of this work contains its conclusion. Fig. 1 depicts the research framework of this paper.

Fig. 1

The research framework of this paper.

2 Preliminaries

The section recalls information systems and the concept of information amount. In this paper, O and A are two non-empty finite sets; 2^O denotes the power set of O, and |X| represents the cardinality of X ∈ 2^O. Concretely, let $O = {o_{1}, o_{2}, \dots, o_{n}}, A = {a_{1}, a_{2}, \dots, a_{m}} .$

Definition 2.1. [23] Let O be a finite set of objects and A a finite set of attributes. Then (O, A) is called an information system (IS), if each attribute a ∈ A determines an information function a : O → V_a, where V_a = {a (o) : o ∈ O}.

If the attribute in (O, A) is categorical, then (O, A) is known as a categorical information system (CIS).

In a given CIS (O, A), for any subset of attributes P ⊆ A, the indistinguishable relation under attribute subset P is defined as follows: $ind (P) = {(o, o^{'}) \in O \times O : \forall a \in P, a (o) = a (o^{'})},$ (1)

Obviously, ind (P) is an equivalence relation on O. Denote $[o]_{P} = {o^{'} \in O : (o, o^{'}) \in ind (P)} .$ (2)

Then [o] _P is known as the equivalence class of o under relation ind (P).

Proposition 2.2. Let (O, A) be a CIS, then for any attribute subsets P₁ ⊆ P₂ ⊆ A and any object o in O, $[o]_{P_{2}} \subseteq [o]_{P_{1}} .$

Proof. Obviously. □

Definition 2.3. [39] For a given CIS (O, A), where O is a finite set of objects, that is, O = {o₁, o₂, ⋯ , o_n}, and an attribute subset P ⊆ A. Let the partition of O caused by relation ind (P) be O/ind (P) = X₁, X₂, cdots, X_m. The information amount E (P) of (O, P) is defined as $E (P) = \sum_{i = 1}^{m} \frac{| X_{i} |}{n} \frac{| O - X_{i} |}{n} .$ (3)

Proposition 2.4. For a given CIS (O, A), where O is an un-empty finite set of objects, that is, O = {o₁, o₂, ⋯ , o_n}, and an attribute subset P ⊆ A, then the information amount E (P) of (O, P) can be calculated as follows: $E (P) = \frac{1}{n} \sum_{i = 1}^{n} (1 - \frac{| [o_{i}]_{P} |}{n}) .$ (4)

Proof. Denote $O / ind (P) = {X_{1}, X_{2}, \dots, X_{m}} .$

Suppose that X_i = {o_i1, o_i2, …, o_{is
_i}}, then |X_i| = s_i. So $\sum_{i = 1}^{m} s_{i} = n$ and ∀ i,

X_i = [o_i1] _P = [o_i2] _P = ⋯ = [o_{is
_i}] _P .

This implies that ∀ i, $| X_{i} | = | [o_{i 1}]_{P} | = | [o_{i 2}]_{P} | = \dots = | [o_{{is}_{i}}]_{P} | = s_{i} .$

Thus ∀ i,

$\begin{matrix} \frac{| X_{i} |}{n} \frac{| O - X_{i} |}{n} & = \frac{| X_{i} |}{n} (1 - \frac{| X_{i} |}{n}) \\ = s_{i} \frac{1}{n} (1 - \frac{| [o_{ik}]_{P} |}{n}) \\ = \sum_{i = 1}^{m} \sum_{k = 1}^{s_{i}} \frac{1}{n} (1 - \frac{| [o_{ik}]_{P} |}{n}) \\ = \sum_{k = 1}^{s_{i}} \frac{1}{n} (1 - \frac{| [o_{ik}]_{P} |}{n}) . \end{matrix}$

Hence $\begin{matrix} E (P) & = \sum_{i = 1}^{m} \frac{| X_{i} |}{n} \frac{| O - X_{i} |}{n}) \\ = \sum_{i = 1}^{m} \sum_{k = 1}^{s_{i}} \frac{1}{n} (1 - \frac{| [o_{ik}]_{P} |}{n}) \\ = \frac{1}{n} \sum_{i = 1}^{m} \sum_{k = 1}^{s_{i}} (1 - \frac{| [o_{ik}]_{P} |}{n}) \\ = \frac{1}{n} \sum_{i = 1}^{n} (1 - \frac{| [o_{i}]_{P} |}{n}) . \end{matrix}$

3 Outlier detection in a CIS based on the information amount

In this section, we present a formally and strictly defined method for detecting outliers in a CIS based on information amount.

Let (O, A) be a CIS. ∀ P ⊆ A, ∀ x ∈ O, removing all objects belonging to the equivalence class [x] _P from O and letting $O_{P}^{x} = O - [o]_{P} = {x_{1}, x_{2}, \dots, x_{s}};$ (5)

$\begin{matrix} {ind}^{x} (P) & = {(o, o^{'}) \in O_{P}^{x} \times O_{P}^{x} : \forall a \in P, a (o) \\ = a (o^{'})}; \end{matrix}$ (6) $[o]_{P}^{x} = {o^{'} \in O_{P}^{x} : (o, o^{'}) \in {ind}^{x} (P)} .$ (7)

Naturally, from Proposition 2.4, the information amount of ind^x (P) that does not contain [x] _P from O is defined as $E_{x} (P) = \frac{1}{s} \sum_{i = 1}^{s} (1 - \frac{| [x_{i}]_{P}^{x} |}{s}) .$ (8) Information amount is used to measure information uncertainty. E (P) denotes the information amount including all objects in the CIS, while E_x (P) represents the information amount after removing the class [x] _P. To detect outliers, a new concept, relative information amount, can be put forward to measure the impact of x.

Definition 3.1. Suppose that (O, A) is a CIS. Then ∀ P ⊆ A, ∀ x ∈ O, the relative information amount RE_P (x) of object x under (O, P) is defined as ${RE}_{P} (x) = {\begin{matrix} 1 - \frac{E_{x} (P)}{E (P)}, E_{x} (P) < E (P); \\ 0, E_{x} (P) \geq E (P) . \end{matrix}$ (9)

It’s easy to derive 0 ≤ RE_P (x) ≤1.

Especially, E_x (P) =0 and RE_P (x) =1 when O/ind (P) = {[x] _P, O - [x] _P}.

Definition 3.2. Suppose that (O, A) is a CIS, ∀ P ⊆ A, and ∀ x ∈ O. Let $O_{P}^{x} / ind (P) = {X_{1}^{'}, X_{2}^{'}, \dots, X_{m - 1}^{'}}$ be the partition of $O_{P}^{x}$ induced by relation ind (P). Define ${RC}_{P} (x) = | [x]_{P} | - \frac{\sum_{i = 1}^{m - 1} | X_{i}^{'} |}{m - 1} .$ (10) Then RC_P (x) is known as the relative cardinality of [x] _P.

In particular, RC_P (x) = |O| when m = 1.

From the form of the formula, relative cardinality denotes the difference between the cardinality of [x] _P and the mean cardinality of other equivalence classes. In general, the cardinality of the minority class is smaller than the cardinality of the majority class. It means the cardinality of [x] _P is larger than that of the equivalence classes in $O_{P}^{x} / ind (P)$ when RC_P (x) >0. In other words, x belongs to a majority class of O. Conversely, x belongs to a minority class of O when RC_P (x) ≤0. From the above definition, it is not difficult to prove that ∀ P, ∀ x ∈ O, 2 - |O| ≤ RC_P (x) ≤ |O|.

Suppose that (O, A) is a CIS with A = {a₁, a₂, ⋯ , a_m}. We rearrange the attributes in A to get a sequence $A^{'} = 〈 a_{1}^{'}, a_{2}^{'}, \dots, a_{m}^{'} 〉$ according to the following condition: $\forall 1 \leq i < m, E ({a_{i}^{'}}) \leq E ({a_{i + 1}^{'}}) .$ Then a sequence of attribute subsets can be constructed as follows: $A_{1} = {a_{1}^{'}, a_{2}^{'}, \dots, a_{m}^{'}},$ $A_{2} = {a_{2}^{'}, \dots, a_{m}^{'}}, \dots,$ $A_{m} = {a_{m}^{'}} .$

Definition 3.3. Suppose that (O, A) is a CIS. ∀ P ⊆ A, ∀ x ∈ O, define ${DO}_{P} (x) = {\begin{matrix} {RE}_{P} (x) \frac{n - {RC}_{P} (x)}{2 n}, {RC}_{P} (x) > 0; \\ {RE}_{P} (x) \sqrt{\frac{n - {RC}_{P} (x)}{2 n}}, {RC}_{P} (x) \leq 0 . \end{matrix}$ (11) Then DO_P (x) is known as the degree of outlierness of x in (O, P).

Denote ${DO}_{a} (x) = {DO}_{{a}} (x) .$

Definition 3.4. Let (O, A) be a CIS. Then ∀ P ⊆ A, ∀ x ∈ O, the weight function of [x] _P is determined by $ω_{P} (x) = \sqrt{\frac{| [x]_{P} |}{n}} .$ (12)

Denote $ω_{a} (x) = ω_{{a}} (x) .$

Definition 3.5. Let (O, A) be a CIS with A = {a₁, a₂, ⋯ , a_m}. ∀ x ∈ O, define

$\begin{matrix} IAOF (x) & = 1 - \frac{\sum_{j = 1}^{m} ω_{a_{j}} (x) (1 - {DO}_{a_{j}} (x))}{2 m - 1} \\ - \frac{\sum_{j = 1}^{m - 1} ω_{A_{j}} (x) (1 - {DO}_{A_{j}} (x))}{2 m - 1} . \end{matrix}$ (13)

Then IAOF (x) is known as the information amount-based outlier factor (IAOF) of x.

Definition 3.6. Let (O, A) be a CIS. Given a threshold μ ∈ [0, 1] and x ∈ O, if IAOF (x) > λ, then x is known as an information amount-based μ - outlier.

In this article, the set of all μ-outliers in a CIS is denoted as Ω_μ. Next, an illustrative example is shown in detail.

Example 3.7. Table 1 represents a CIS where O = {o₁, o₂, o₃, o₄, o₅, o₆} and A = {a₁, a₂, a₃}. Meanwhile, Table 2 lists the equivalence classes for each object.

Table 1

Initial CIS for Example 1

O	a ₁	a ₂	a₃
o ₁	normal	small	yes
o ₂	normal	small	no
o ₃	high	big	no
o ₄	high	small	no
o ₅	high	small	no
o ₆	very high	big	yes

Table 2

The equivalence class of each object

O	[o] _{a ₁}	[o] _{a ₂}	[o] _{a ₃}
o ₁	{o₁, o₂}	{o₁, o₂, o₄, o₅}	{o₁, o₆}
o ₂	{o₁, o₂}	{o₁, o₂, o₄, o₅}	{o₂, o₃, o₄, o₅}
o ₃	{o₃, o₄, o₅}	{o₃, o₆}	{o₂, o₃, o₄, o₅}
o ₄	{o₃, o₄, o₅}	{o₁, o₂, o₄, o₅}	{o₂, o₃, o₄, o₅}
o ₅	{o₃, o₄, o₅}	{o₁, o₃, o₄, o₅}	{o₂, o₃, o₄, o₅}
o ₆	{o₆}	{o₃, o₆}	{o₂, o₃, o₄, o₅}

For {a₁}, O/ind ({a₁}) = {{o₁, o₂} , {o₃, o₄, o₅} , {o₆}} .

For {a₂}, O/ind ({a₂}) = {{o₁, o₂, o₄, o₅} , {o₃, o₆}} .

For {a₃}, O/ind ({a₃}) = {{o₁, o₆} , {o₂, o₃, o₄, o₅}} .

(1) According to Proposition 2.4 mentioned above, the information amount for each singleton attribute subset of A is

$E ({a_{1}}) = \frac{1}{6} \sum_{i = 1}^{6} (1 - \frac{| [o_{i}]_{{a_{1}}} |}{6}) = \frac{1}{6} \times [(1 - \frac{2}{6}) \times 2 + (1 - \frac{3}{6}) \times 3 + (1 - \frac{1}{6})] = 0.611;$

$E ({a_{2}}) = \frac{1}{6} \sum_{i = 1}^{6} (1 - \frac{| [o_{i}]_{{a_{2}}} |}{6}) = \frac{1}{6} \times [(1 - \frac{4}{6}) \times 4 + (1 - \frac{2}{6}) \times 2] = 0.444;$

$E ({a_{3}}) = \frac{1}{6} \sum_{i = 1}^{6} (1 - \frac{| [o_{i}]_{{a_{3}}} |}{6}) = \frac{1}{6} \times [(1 - \frac{2}{6}) \times 2 + (1 - \frac{4}{6}) \times 4] = 0.444 .$

$E_{o_{1}} ({a_{1}}) = \frac{1}{4} \sum_{i = 1}^{4} (1 - \frac{| [x_{i}]_{{a_{1}}}^{o_{1}} |}{4}) = \frac{1}{4} \times [(1 - \frac{3}{4}) \times 3 + (1 - \frac{1}{4}) \times 1] = 0.375 = E_{o_{2}} ({a_{1}});$

$E_{o_{3}} ({a_{1}}) = \frac{1}{3} \sum_{i = 1}^{3} (1 - \frac{| [x_{i}]_{{a_{1}}}^{o_{3}} |}{3}) = \frac{1}{3} \times [(1 - \frac{2}{3}) \times 2 + (1 - \frac{1}{3}) \times 1] = 0.444 = E_{o_{4}} ({a_{1}}) = E_{o_{5}} ({a_{1}});$

$E_{o_{6}} ({a_{1}}) = \frac{1}{5} \sum_{i = 1}^{5} (1 - \frac{| [x_{i}]_{{a_{1}}}^{o_{6}} |}{5}) = \frac{1}{5} \times [(1 - \frac{2}{5}) \times 2 + (1 - \frac{3}{5}) \times 3] = 0.480;$

$E_{o_{1}} ({a_{2}}) = \frac{1}{2} \sum_{i = 1}^{2} (1 - \frac{| [x_{i}]_{{a_{2}}}^{o_{1}} |}{2}) = \frac{1}{2} \times [(1 - \frac{2}{2}) \times 2] = 0 = E_{o_{2}} ({a_{2}}) = E_{o_{4}} ({a_{2}}) = E_{o_{5}} ({a_{2}}) = E_{o_{6}} ({a_{2}});$

$E_{o_{1}} ({a_{3}}) = \frac{1}{4} \sum_{i = 1}^{4} (1 - \frac{| [x_{i}]_{{a_{3}}}^{o_{1}} |}{4}) = \frac{1}{4} \times [(1 - \frac{4}{4}) \times 4] = 0 = E_{o_{2}} ({a_{3}}) = E_{o_{3}} ({a_{3}}) = E_{o_{4}} ({a_{3}}) = E_{o_{5}} ({a_{3}}) = E_{o_{6}} ({a_{3}}) .$

(2) A descending sequence of attribute subsets can be constructed as A₁ = {a₂, a₃, a₁} , A₂ = {a₃, a₁} , A₃ = {a₁} since E ({a₁}) > E ({a₂}) = E ({a₃}).

For A₁, O/ind (A₁) = {{o₁} , {o₂} , {o₃} , {o₄, o₅} , {o₆}} .

For A₂, O/ind (A₂) = {{o₁} , {o₂} , {o₃, o₄, o₅} , {o₆}} .

For A₃, O/ind (A₃) = {{o₁, o₂} , {o₃, o₄, o₅} , {o₆}} .

Analogously, the information amount for each attribute subset in the sequence becomes

$E (A_{1}) = \frac{1}{6} \sum_{i = 1}^{6} (1 - \frac{| [o_{i}]_{A_{1}} |}{6}) = \frac{1}{6} \times [(1 - \frac{1}{6}) \times 4 + (1 - \frac{2}{6}) \times 2] = 0.777;$

$E (A_{2}) = \frac{1}{6} \sum_{i = 1}^{6} (1 - \frac{| [o_{i}]_{A_{2}} |}{6}) = \frac{1}{6} \times [(1 - \frac{1}{6}) \times 3 + (1 - \frac{3}{6}) \times 3] = 0.667;$

$E (A_{3}) = \frac{1}{6} \sum_{i = 1}^{6} (1 - \frac{| [o_{i}]_{A_{3}} |}{6}) = \frac{1}{6} \times [(1 - \frac{2}{6}) \times 2 + (1 - \frac{3}{6}) \times 3 + (1 - \frac{1}{6}) \times 1] = 0.611 .$

And the information amount for each attribute subset in the sequence after removing an equivalence class is

$E_{o_{1}} (A_{1}) = \frac{1}{5} \sum_{i = 1}^{5} (1 - \frac{| [x_{i}]_{A_{1}}^{o_{1}} |}{5}) = \frac{1}{5} \times [\frac{4}{5} \times 3 + \frac{3}{5} \times 2] = 0.720 = E_{o_{2}} (A_{1}) = E_{o_{3}} (A_{1}) = E_{o_{6}} (A_{1});$

$E_{o_{4}} (A_{1}) = \frac{1}{4} \sum_{i = 1}^{4} (1 - \frac{| [x_{i}]_{A_{1}}^{o_{4}} |}{4}) = \frac{1}{4} \times [(1 - \frac{1}{4}) \times 4] = 0.75 = E_{o_{5}} (A_{1});$

$E_{o_{1}} (A_{2}) = \frac{1}{5} \sum_{i = 1}^{5} (1 - \frac{| [x_{i}]_{A_{2}}^{o_{1}} |}{5}) = \frac{1}{5} \times [(1 - \frac{1}{5}) \times 2 + (1 - \frac{3}{5}) \times 3] = 0.56 = E_{o_{2}} (A_{2}) = E_{o_{2}} (A_{2}) = E_{o_{6}} (A_{2});$

$E_{o_{3}} (A_{2}) = \frac{1}{3} \sum_{i = 1}^{3} (1 - \frac{| [x_{i}]_{A_{2}}^{o_{3}} |}{3}) = \frac{1}{3} \times [(1 - \frac{1}{3}) \times 3] = 0.667 = E_{o_{4}} (A_{2}) = E_{o_{5}} (A_{2});$

$E_{o_{1}} (A_{3}) = \frac{1}{4} \sum_{i = 1}^{4} (1 - \frac{| [x_{i}]_{A_{3}}^{o_{1}} |}{4}) = \frac{1}{4} \times [(1 - \frac{3}{4}) \times 3 + (1 - \frac{1}{4}) \times 1] = 0.375 = E_{o_{2}} (A_{3}) .$

$E_{o_{3}} (A_{3}) = \frac{1}{3} \sum_{i = 1}^{3} (1 - \frac{| [x_{i}]_{A_{3}}^{o_{3}} |}{3}) = \frac{1}{3} \times [(1 - \frac{2}{3}) \times 2 + (1 - \frac{1}{3}) \times 1] = 0.444 = E_{o_{4}} (A_{3}) = E_{o_{5}} (A_{3}) .$

$E_{o_{6}} (A_{3}) = \frac{1}{5} \sum_{i = 1}^{5} (1 - \frac{| [x_{i}]_{A_{3}}^{o_{6}} |}{5}) = \frac{1}{5} \times [(1 - \frac{2}{5}) \times 2 + (1 - \frac{3}{5}) \times 3] = 0.48 .$

(3) Based on Definition 3.1 and the above results, we next determine the relative information amount of different subsystems.

${RE}_{{a_{1}}} (o_{1}) = 1 - \frac{E_{o_{1}} ({a_{1}})}{E ({a_{1}})} = 1 - \frac{0.375}{0.611} = 0.386 = {RE}_{{a_{1}}} (o_{2});$

${RE}_{{a_{1}}} (o_{3}) = 1 - \frac{E_{o_{3}} ({a_{1}})}{E ({a_{1}})} = 1 - \frac{0.444}{0.611} = 0.273 = {RE}_{{a_{1}}} (o_{4}) = {RE}_{{a_{1}}} (o_{5});$

${RE}_{{a_{1}}} (o_{6}) = 1 - \frac{E_{o_{6}} ({a_{1}})}{E ({a_{1}})} = 1 - \frac{0.48}{0.611} = 0.215;$

${RE}_{{a_{2}}} (o_{1}) = 1 - \frac{E_{o_{1}} ({a_{2}})}{E ({a_{2}})} = 1 - \frac{0}{0.611} = 1 = {RE}_{{a_{2}}} (o_{2}) = {RE}_{{a_{2}}} (o_{3}) = {RE}_{{a_{2}}} (o_{4}) = {RE}_{{a_{2}}} (o_{5}) = {RE}_{{a_{2}}} (o_{6});$

${RE}_{{a_{3}}} (o_{1}) = 1 - \frac{E_{o_{1}} ({a_{3}})}{E ({a_{3}})} = 1 - \frac{0}{0.611} = 1 = {RE}_{{a_{3}}} (o_{2}) = {RE}_{{a_{3}}} (o_{3}) = {RE}_{{a_{3}}} (o_{4}) = {RE}_{{a_{3}}} (o_{5}) = {RE}_{{a_{3}}} (o_{6}) .$

Similarly, we can obtain

${RE}_{A_{1}} (o_{1}) = 1 - \frac{E_{o_{1}} (A_{1})}{E (A_{1})} = 1 - \frac{0.72}{0.777} = 0.073 = {RE}_{A_{1}} (o_{2}) = {RE}_{A_{1}} (o_{3}) = {RE}_{A_{1}} (o_{6});$

${RE}_{A_{1}} (o_{4}) = 1 - \frac{E_{o_{4}} (A_{1})}{E (A_{1})} = 1 - \frac{0.75}{0.777} = 0.035 = {RE}_{A_{1}} (o_{5});$

${RE}_{A_{2}} (o_{1}) = 1 - \frac{E_{o_{1}} (A_{2})}{E (A_{2})} = 1 - \frac{0.56}{0.667} = 0.16 = {RE}_{A_{2}} (o_{2}) = {RE}_{A_{2}} (o_{6});$

${RE}_{A_{2}} (o_{3}) = 1 - \frac{E_{o_{3}} (A_{2})}{E (A_{2})} = 1 - \frac{0.667}{0.667} = 0 = {RE}_{A_{2}} (o_{4}) = {RE}_{A_{2}} (o_{5});$

${RE}_{A_{3}} (o_{1}) = 1 - \frac{E_{o_{1}} (A_{3})}{E (A_{3})} = 1 - \frac{0.375}{0.611} = 0.386 = {RE}_{A_{3}} (o_{2});$

${RE}_{A_{3}} (o_{3}) = 1 - \frac{E_{o_{3}} (A_{3})}{E (A_{3})} = 1 - \frac{0.444}{0.611} = 0.273 = {RE}_{A_{3}} (o_{4}) = {RE}_{A_{3}} (o_{5});$

${RE}_{A_{3}} (o_{6}) = 1 - \frac{E_{o_{6}} (A_{3})}{E (A_{3})} = 1 - \frac{0.48}{0.611} = 0.214 .$

(4) From Definition 3.2, the relative cardinality

${RC}_{{a_{1}}} (o_{1}) = | [o]_{{a_{1}}} | - \frac{\sum_{i = 1}^{2} | X_{i}^{'} |}{2} = 2 - \frac{3 + 1}{2} = 0 = {RC}_{{a_{1}}} (o_{2});$

${RC}_{{a_{1}}} (o_{3}) = 3 - \frac{2 + 1}{2} = 1.5 = {RC}_{{a_{1}}} (o_{4}) = {RC}_{{a_{1}}} (o_{5});$

${RC}_{{a_{1}}} (o_{6}) = 1 - \frac{2 + 3}{2} = - 1.5;$

${RC}_{A_{1}} (o_{1}) = | [o]_{A_{1}} | - \frac{\sum_{i = 1}^{4} | X_{i}^{'} |}{4} = 1 - \frac{1 + 1 + 2 + 1}{4} = - 0.25 = {RC}_{A_{1}} (o_{2}) = {RC}_{A_{1}} (o_{3}) = {RC}_{A_{1}} (o_{6});$

${RC}_{A_{1}} (o_{4}) = 2 - \frac{1 + 1 + 1 + 1}{4} = 1 = {RC}_{A_{1}} (o_{5});$

${RC}_{A_{2}} (o_{1}) = 1 - \frac{1 + 3 + 1}{3} = - 0.667 = {RC}_{A_{2}} (o_{2}) = {RC}_{A_{2}} (o_{6});$

${RC}_{A_{2}} (o_{3}) = 3 - \frac{1 + 1 + 1}{3} = 2 = {RC}_{A_{2}} (o_{4}) = {RC}_{A_{2}} (o_{5});$

${RC}_{A_{3}} (o_{1}) = 2 - \frac{3 + 1}{2} = 0 = {RC}_{A_{3}} (o_{2});$

${RC}_{A_{3}} (o_{3}) = 3 - \frac{2 + 1}{2} = 1.5 = {RC}_{A_{3}} (o_{4}) = {RC}_{A_{3}} (o_{5});$

${RC}_{A_{3}} (o_{6}) = 1 - \frac{2 + 3}{2} = - 1.5 .$

(5) Correspondingly, from Definition 3.3, the degree of outlierness of o_i can be calculated as follows:

${DO}_{a_{1}} (o_{1}) = {RE}_{{a_{1}}} (o_{1}) \sqrt{\frac{6 - {RC}_{{a_{1}}} (o_{1})}{12}} = 0.386 \times \sqrt{\frac{6 - 0}{12}} = 0.273 = {DO}_{a_{1}} (o_{2});$

${DO}_{a_{1}} (o_{3}) = {RE}_{{a_{1}}} (o_{3}) \frac{6 - {RC}_{{a_{1}}} (o_{3})}{12} = 0.273 \times \frac{6 - 1.5}{12} = 0.102 = {DO}_{a_{1}} (o_{4}) = {DO}_{a_{1}} (o_{5});$

${DO}_{a_{1}} (o_{6}) = {RE}_{{a_{1}}} (o_{6}) \sqrt{\frac{6 - {RC}_{{a_{1}}} (o_{6})}{12}} = 0.273 \times \sqrt{\frac{6 + 1.5}{12}} = 0.169;$

${DO}_{a_{2}} (o_{1}) = {RE}_{{a_{2}}} (o_{1}) \frac{6 - {RC}_{{a_{2}}} (o_{1})}{12} = 1 \times \frac{6 - 2}{12} = 0.333 = {DO}_{a_{2}} (o_{2}) = {DO}_{a_{2}} (o_{4}) = {DO}_{a_{2}} (o_{5});$

${DO}_{a_{2}} (o_{3}) = {RE}_{{a_{2}}} (o_{3}) \sqrt{\frac{6 - {RC}_{{a_{2}}} (o_{3})}{12}} = 1 \times \sqrt{\frac{6 + 2}{12}} = 0.816 = {DO}_{a_{2}} (o_{6}) = {DO}_{a_{3}} (o_{1}) = {DO}_{a_{3}} (o_{6});$

${DO}_{a_{3}} (o_{2}) = {RE}_{{a_{3}}} (o_{2}) \frac{6 - {RC}_{{a_{3}}} (o_{2})}{12} = 1 \times \frac{6 - 2}{12} = 0.333 = {DO}_{a_{3}} (o_{3}) = {DO}_{a_{3}} (o_{4}) = {DO}_{a_{3}} (o_{5}) .$

Meanwhile, the degree of outlierness of different objects with respect to different subsets of attributes is

${DO}_{A_{1}} (o_{1}) = {RE}_{A_{1}} (o_{1}) \sqrt{\frac{6 - {RC}_{A_{1}} (o_{1})}{12}} = 0.073 \times \sqrt{\frac{6 + 0.25}{12}} = 0.053 = {DO}_{A_{1}} (o_{2}) = {DO}_{A_{1}} (o_{3}) = {DO}_{A_{1}} (o_{6});$

${DO}_{A_{1}} (o_{4}) = {RE}_{A_{1}} (o_{4}) \sqrt{\frac{6 - {RC}_{A_{1}} (o_{4})}{12}} = 0.035 \times \frac{6 - 2}{12} = 0.014 = {DO}_{A_{1}} (o_{5});$

${DO}_{A_{2}} (o_{1}) = {RE}_{A_{2}} (o_{1}) \sqrt{\frac{6 - {RC}_{A_{2}} (o_{1})}{12}} = 0.16 \times \sqrt{\frac{6 + 0.667}{12}} = 0.119 = {DO}_{A_{2}} (o_{2}) = {DO}_{A_{2}} (o_{6});$

DO_{A
₂} (o₃) =0 = DO_{A
₂} (o₄) = DO_{A
₂} (o₅) ;

${DO}_{A_{3}} (o_{1}) = 0.386 \times \sqrt{\frac{6 + 2}{12}} = 0.273 = {DO}_{A_{3}} (o_{2});$

${DO}_{A_{3}} (o_{3}) = 0.273 \times \frac{6 - 1.5}{12} = 0.273 = {DO}_{A_{3}} (o_{4}) = {DO}_{A_{3}} (o_{5});$

${DO}_{A_{3}} (o_{6}) = 0.214 \times \frac{6 + 1.5}{12} = 0.169 .$

(6) In addition, according to Definition 3.4, the weights of different objects with respect to different subsets of attributes can be calculated:

$ω_{a_{1}} (o_{1}) = \sqrt{\frac{| [o_{1}]_{{a_{1}} |}}{6}} = \sqrt{\frac{2}{6}} = 0.577 = ω_{a_{1}} (o_{2}) = ω_{a_{2}} (o_{3}) = ω_{a_{2}} (o_{6}) = ω_{a_{3}} (o_{1}) = ω_{a_{3}} (o_{6});$

$ω_{a_{1}} (o_{3}) = \sqrt{\frac{| [o_{3}]_{{a_{1}}} |}{6}} = \sqrt{\frac{3}{6}} = 0.707 = ω_{a_{1}} (o_{4}) = ω_{a_{1}} (o_{5});$

$ω_{a_{1}} (o_{6}) = \sqrt{\frac{| [o_{6}]_{{a_{1}}} |}{6}} = \sqrt{\frac{1}{6}} = 0.408;$

$ω_{a_{2}} (o_{1}) = \sqrt{\frac{| [o_{1}]_{{a_{2}}} |}{6}} = \sqrt{\frac{4}{6}} = 0.816 = ω_{a_{2}} (o_{2}) = ω_{a_{2}} (o_{4}) = ω_{a_{2}} (o_{5}) = ω_{a_{3}} (o_{2}) = ω_{a_{3}} (o_{3}) = ω_{a_{3}} (o_{4}) = ω_{a_{3}} (o_{5}) .$

Besides,

$ω_{A_{1}} (o_{1}) = \sqrt{\frac{| [o_{1}]_{A_{1}} |}{6}} = \sqrt{\frac{1}{6}} = 0.408 = ω_{A_{1}} (o_{2}) = ω_{A_{1}} (o_{3}) = ω_{A_{1}} (o_{6}) = ω_{A_{2}} (o_{1}) = ω_{A_{2}} (o_{2}) = ω_{A_{2}} (o_{6}) = ω_{A_{3}} (o_{6});$

$ω_{A_{1}} (o_{4}) = \sqrt{\frac{| [o_{4}]_{A_{1}} |}{6}} = \sqrt{\frac{2}{6}} = 0.577 = ω_{A_{1}} (o_{5}) = ω_{A_{3}} (o_{1}) = ω_{A_{3}} (o_{2});$

$ω_{A_{2}} (o_{3}) = \sqrt{\frac{| [o_{3}]_{A_{2}} |}{6}} = \sqrt{\frac{3}{6}} = 0.707 = ω_{A_{2}} (o_{4}) = ω_{A_{2}} (o_{5}) = ω_{A_{3}} (o_{3}) = ω_{A_{3}} (o_{4}) = ω_{A_{3}} (o_{5}) .$

(7) As a result, the information amount-based outlier factor of o₁ in (O, A) is given as follows: $\begin{matrix} IAOF (o_{1}) & = 1 - \frac{\sum_{j = 1}^{3} ω_{a_{j}} (o_{1}) (1 - {DO}_{a_{j}} (o_{1}))}{5} \\ - \frac{\sum_{j = 1}^{2} ω_{A_{j}} (o_{1}) (1 - {DO}_{A_{j}} (o_{1}))}{5} \\ \approx 0.6372 . \end{matrix}$

Similarly, we compute each object’s information amount-based outlier factor:

IAOF (o₂) ≈0.5492, IAOF (o₃) ≈0.5243,

IAOF (o₄) = IAOF (o₅) ≈0.4001, IAOF (o₆) ≈0.7406 .

Finally, the IAOF for each instance is compared against the judgment threshold λ = 0.72:

IAOF (o₄) = IAOF (o₅) < IAOF (o₃) < IAOF (o₂) < IAOF (o₁) < μ < IAOF (o₆) .

As a result, o₆ has an IAOF greater than the judgment threshold mu, implying that o₆ is an information amount-based outlier, as defined in Definition 3.6.

4 Outlier detection algorithm of the IAOF method

Section 3 describes an outlier detection method based on the information amount, including theoretical concepts, specific calculation procedures, and an illustrative example. This section also presents the corresponding algorithm and examines its complexity.

Algorithm 1 depicts the process of calculating the IAOF for a CIS (O, A), which includes four loops. Because O/ind (a_j) and O/ind (A_j) have a time complexity of O (nlogn) and O (m × nlogn), respectively, the calculation costs for the first three loops are O (m × nlogn) , O (m² × nlogn), and O (mn). As a result, the IAOF method has a worst-case time complexity of O (m² × nlogn). The technique of the method also employs a two-dimensional array data structure and has an O (mn) space complexity. Therefore, the algorithm is feasible in terms of computer operating time and space.

Algorithm 1 IAOF algorithm

Input: (O, A) , |O| = n, |A| = m, μ ∈ [0, 1].

1: Output:Outlier set Ω_μ.

2: Ω_μ← ∅

3: for j ← 1 to m do

Calculate O/ind ({a_j})

4: Calculate E ({a_j})

6: end for

7: Determine $A^{'} = 〈 a_{1}^{'}, a_{2}^{'}, \dots, a_{m}^{'} 〉$

8: Construct A₁, A₂, ⋯ , A_m

9: for j ← 1 to mdo

10: Calculate O/ind (A_j)

11: Calculate E (A_j)

12: end for

13: for i ← 1 to n do

14: for j ← 1 to m do

15: Calculate E_{o
_i} ({a_j}) and E_{o
_i} (A_j)

16: Calculate RE_{{a_j}} (o_i) and RE_{A
_j} (o_i)

17: Calculate RC_{{a_j}} (o_i) and RC_{A
_j} (o_i)

18: Calculate DO_{a
_j} (o_i) and DO_{A
_j} (o_i)

19: Calculate ω_{a
_j} (o_i) and ω_{A
_j} (o_i)

20: end for

21: Calculate IAOF (o_i)

22: If IAOF (o_i) textgreaterμ

23: Ω_μ ← Ω_μ ∪ {o_i}

24: end if

25: end for

26: return Ω_μ.

5 Experiments and comparative analyses

In this section, data sets collected from the real world are chosen for experiments to assess the behavior of IAOF. Due to the limited number of categorical data sets provided by the UCI (University of California Irvine) machine learning repository and the problems of a large number of missing values or a high percentage of outliers in some data sets, we selected several data sets from the KEEL (Knowledge Extraction based on Evolutionary Learning) data set repository [2]. The KEEL is available under the GNU General Public License version 3, which can be used for a variety of data discovery applications. It provides a simple user interface that uses data flow to design experiments with different data sets and computational intelligence methods. Almost all of the imbalanced data sets in the KEEL are from UCI. After preprocessing, the original UCI data sets are divided into five folds employing stratified cross-validation. In this way, a sufficient quantity of minority objects in the test partitions is disposed of, and the test partition objects are more representative of the underlying knowledge. Besides, the imbalance rate of the data set ranges from 1.5% to less than 10%, which is extremely suitable for imbalanced data classification as well as outlier detection.

5.1 Experimental settings

5.1.1 Data descriptions

In the experiment, we selected three datasets from the UCI repository, namely the Hayes-Roth dataset (referred to as Hayes), the Congressional Voting Records dataset (referred to as Voting), and the Mammographic Mass dataset (referred to as Mamm). Additionally, four datasets were chosen from the KEEL repository, including the MONK-2 dataset (referred to as Monk), the Australian Credit Approval dataset (referred to as Aust), the Pima Indians Diabetes dataset (referred to as Pima), the Breast Cancer Wisconsin (Diagnostic) dataset (referred to as Cancer), and the Solar Flare-2 dataset (referred to as Flare). The datasets in question comprise categorical attributes, including both nominal and ordinal attributes, as well as numerical attributes, encompassing interval and ratio attributes. Moreover, the data sets exhibit a few instances of missing data. In the context of semantically relevant datasets, classes that are rare and deviate from the typical case are defined as outlier classes. Alternatively, if there is no clear deviation, the outlier class is determined based on the class with the fewest number of objects [7, 20].

Consequently, three main steps are involved in data preprocessing: The first step is imputing the missing values. Because of the relatively low percentage of missing values (the data sets Mamm and Voting suffer from 3.41% and 5.63% missing values, respectively), missing values for numerical attributes are filled with the mean, while missing values for categorical attributes are filled with the mode, which is also known as the maximum probability value method. The second step is to cut the numerical attribute data into three bins using the equal-range discretization method to obtain categorical data. The third step is to generate outliers using the downsampling method proposed in [7]. Outlier detection usually involves the random downsampling of a specific class to generate outliers while retaining all instances of the remaining classes to generate inliers. Random downsampling usually results in a dramatic change in the data set. Therefore, to mitigate the effect of randomization after downsampling, we repeat downsampling 10 times for the data sets with more than 10% outliers, use IAOF to detect outliers for each of these 10 variants, and finally select the variant with average detection performance for the comparison experiment. Table 3 shows the detailed information and data preprocessing of each data set.

Table 3
Detailed information and data preprocessing for the data sets

NO. Data set Preprocessing Numerical attribute Categorical attribute Instances (Outlier) without downsampling Outlier ratio after downsampling

1 Hayes Class “3” is considered to be outliers

and downsampled to 12 objects. 0 5 160(59) 10.52%

2 Monk “1” is considered to be outliers and

downsampled to16 objects, regarding the

integer values as categorical values. 0 6 432(228) 6.56%

3 Voting “republican” is considered to be outliers

and downsampled to 19 objects. 0 16 435(168) 6.64%

4 Cancer “M” is considered to be outliers

and downsampled to 20 objects. 30 0 569(212) 5.30%

5 Aust “1” and “4” are considered to be

outliers and downsampled to 39 objects. 8 6 690(307) 9.24%

6 Pima “tested positive” is considered to be

outliers and downsampled to 44 objects. 8 0 768(268) 8.09%

7 Mamm “1” is considered to be outliers and

downsampled to 37 objects. 5 0 961(445) 5.67%

8 Flare “F” is considered to be outliers. 0 11 1066(43) 4.03%

NO.	Data set	Preprocessing	Numerical attribute	Categorical attribute	Instances (Outlier) without downsampling	Outlier ratio after downsampling
1	Hayes	Class “3” is considered to be outliers
		and downsampled to 12 objects.	0	5	160(59)	10.52%
2	Monk	“1” is considered to be outliers and
		downsampled to16 objects, regarding the
		integer values as categorical values.	0	6	432(228)	6.56%
3	Voting	“republican” is considered to be outliers
		and downsampled to 19 objects.	0	16	435(168)	6.64%
4	Cancer	“M” is considered to be outliers
		and downsampled to 20 objects.	30	0	569(212)	5.30%
5	Aust	“1” and “4” are considered to be
		outliers and downsampled to 39 objects.	8	6	690(307)	9.24%
6	Pima	“tested positive” is considered to be
		outliers and downsampled to 44 objects.	8	0	768(268)	8.09%
7	Mamm	“1” is considered to be outliers and
		downsampled to 37 objects.	5	0	961(445)	5.67%
8	Flare	“F” is considered to be outliers.	0	11	1066(43)	4.03%

5.1.2 Related methods of the experiment

To evaluate our method’s performance quantitatively and comprehensively, we compare it to five other popular outlier detection methods: the k-nearest neighbor method (KNN) [24], the isolation forest method (IForest) [21], the cluster-based local outlier factor method (CBLOF) [14], the information entropy-based method (IEOF) [16], and the neighborhood-based outlier detection method (NED) [5].

KNN is one of the simplest methods in data mining. This algorithm is used to classifying rare events. To apply KNN for outlier detection, all features of the training data must be numeric, as it is based on the distance between two objects. Therefore, in the case of categorical features present in the training data, it is necessary to encode the categorical features with dummy variables to convert them into numerical features. Throughout this experiment, the Euclidean distance is used as the distance matrix after data transformation, and cross-validation and grid search are used to select the optimal value for hyperparameter k. IForest does not need to calculate the distance or density of objects and is widely used in industry for outlier detection of structured data due to its linear time complexity and excellent accuracy. CBLOF uses the data set and the cluster model generated by the clustering algorithm as inputs. It classifies the clusters into small clusters and large clusters using the parameters alpha and beta. According to the size and distance of the cluster to which the point belongs, an anomaly score will be calculated. When dealing with different types of data, CBLOF uses a variety of clustering methods. In the present experiment, clustering by CBLOF is performed using the k-means method. The parameters and required by the CBLOF algorithm are set to 0.9 and 5, and the optimal value of the number of clusters to form as well as the number of centroids to generate n _ cluster is calculated in the range [1, 50] with a step size of 1. NED measures the local information in a CIS and performs well in outlier detection for mixed data sets as well as categorical attribute data sets.

KNN, CBLOF, IEOF, and NED are proximity-base methods, while IForest is a classification-based method. All in all, the five outlier detection methods mentioned above apply to categorical data indirectly. For the outlier detection experiments, Table 4 gives the optimum parameter settings for both KNN and CBLOF.

Table 4
Optimal parameter for KNN and CBLOF on different data sets

Parameter Hayes Monk Voting Cancer Aust Pima Mamm Flare

k of KNN 15 31 51 19 41 51 25 85

n _ cluster of CBLOF 11 13 12 7 21 11 11 19

Parameter	Hayes	Monk	Voting	Cancer	Aust	Pima	Mamm	Flare
k of KNN	15	31	51	19	41	51	25	85
n _ cluster of CBLOF	11	13	12	7	21	11	11	19

5.1.3 Evaluation metrics

In the binary classification of machine learning, there is a difference between the predicted outcomes and the actual classification, so some classification evaluation metrics such as precision, recall, receiver operating characteristic (ROC) curve, area under curve (AUC), and F1-score are often employed to evaluate the classification ability of different methods. Outlier detection is essentially a binary problem, so we can use the ROC curve, AUC, and F1-score to measure the efficiency of the outlier detection algorithms. In outlier detection, most detection methods output an outlier factor for each object in U. When an object has an outlier factor greater than or equal to a given threshold, it is regarded as an outlier. Therefore, if the given threshold is too small, then some outliers will be regarded as inliers; likewise, if the threshold is too large, some inliers will be regarded as outliers. That is, the output objects are divided into 4 parts: outliers predicted as outliers (true outliers, TO), outliers predicted as inliers (false inliers, FI), inliers predicted as outliers (false outliers, FO), and inliers predicted as inliers (true inliers, TI).

Precision is the probability that samples predicted to be outliers are outliers. Recall denotes the possibility of outliers being indicated to be outliers. It requires a threshold t for the prediction probability when taking precision or recall as evaluation metrics. That is, an object is regarded as an outlier when the prediction probability is greater than the threshold t, and vice versa as an inlier. These two indicators are calculated by $P (t) = \frac{TO}{TO + FO}, R (t) = \frac{TO}{TO + FI} .$ (14) Generally, the higher these two indicators are, the more effective a method is at detecting outliers. However, it is clear from the formula that the indicators restrict each other, a higher precision leads to a lower recall, and a higher recall usually leads to a lower precision. Therefore, to balance the impact of precision and recall, researchers introduced the F1-score as a comprehensive indicator to evaluate a classification method comprehensively. The F1-score is the harmonic average of precision and recall, which sets the weight of both indicators to 1. $F 1 = 2 \times \frac{P \times R}{P + R} .$ (15) On the other hand, the threshold leads to an additional hyperparameter, and this hyperparameter affects the generalization ability of a detection method. The ROC curve, whose vertical coordinate is the true outlier rate (TOR = TO/(TO + FI)) and horizontal coordinate is the false outlier rate (FOR = FO/(FO + TI)), is obtained by choosing different thresholds and thus is not affected by the threshold and is considered to be an excellent evaluation indicator at present [1]. The area enclosed by a ROC curve and the abscissa axis is defined as the AUC. If a curve dominates in ROC space, then it dominates in outlier detection. The ROC curve and AUC evaluate the efficiency of an outlier detection method from an intuitive and quantitative perspective, respectively.

The comparative experiments in this section are conducted on a computer with the Intel (R) core (TM) i5-8250U processor platform, 1.80 GHz frequency, and 8 GB of memory. The operating system is Windows 11. The experiments are performed in Python 3.8.

5.2 Experimental results and analyses

In this subsection, outlier detection is carried out for IAOF and the other five existing methods on the eight data sets mentioned above. Scatter diagrams, broken line graphs, and tables are presented as results to reflect the performance of different methods.

First, we calculate the outlier factors of each object in every data set and the judgment threshold μ by the process of IOAF. In turn, scatter diagrams are generated by taking an object’s serial number as the horizontal coordinate and the outlier factor as the vertical coordinate. In the diagrams, blue circles represent the true inliers, red-filled circles represent the true outliers, and the red dotted line stands for the judgment threshold μ. According to the IAOF, objects above the red dotted line are considered outliers, and objects below the red dotted line are considered inliers. It can be seen from Figs. 2(a)-(h) that IAOF detects well, where the method recognizes almost all true outliers in the Hayes, Monk, Voting, and Cancer data sets and about half of the true outliers in the Aust, Pima, and Mamm data sets. Although IAOF only recognizes a small part of the outliers in the Flare data set, it can be seen from Fig. 2(h) that almost all of the outliers in the data set were distributed in the middle and top of the scatter diagram, indicating that the outliers calculated by IAOF have relatively large outlier factors, which indirectly demonstrates the effectiveness of the method in the data set.

Fig. 2

Outlier factor scatter diagrams of the data sets.

Furthermore, we perform outlier detection on every data set using the six methods. In real-world scenarios, identifying outliers in a dataset is generally difficult due to a lack of prior knowledge about the precise occurrences that differ greatly from the norm. These methods put all objects in descending or ascending order based on their determined “outlier factors,” and then consider the objects at the top of the sequence to be outliers. In detail, as shown in Figs. 3–10, if a detection method deems the objects at the top of the sequence as outliers and the number of objects in this part is “number of objects", then the proportion of objects in this part of the data set is “Top ratio”. In fact, only some of these objects are true outliers, and the number of real outliers is written as “Number of true outliers”, and the proportion of the currently successfully identified outliers to all true outliers in the data set is expressed as “outlier coverage”. In other words, a larger value of “Number of true outliers (outlier coverage)” indicates that the detection method is capable of detecting outliers more effectively. The maximum number of outliers identified successfully (outlier coverage) on each line is typed in bold characters in the table. According to Figs. 3–10, IAOF performs significantly better than KNN, CBLOF, IForest, and NED in terms of overall performance, and IAOF has a similar detection efficiency to IEOF. However, when the number of objects equals 15, 30, 45, 60, and 75 on the Aust data set, the outlier coverage of IEOF is slightly smaller than that of IAOF.

Fig. 3

The broken line chart of outliers detected on Hayes.

Fig. 4

The broken line chart of outliers detected on Monk.

Fig. 5

The broken line chart of outliers detected on Voting.

Fig. 6

The broken line chart of outliers detected on Cancer.

Fig. 7

The broken line chart of outliers detected on Aust.

Fig. 8

The broken line chart of outliers detected on Pima.

Fig. 9

The broken line chart of outliers detected on Mamm.

Fig. 10

The broken line chart of outliers detected on Flare.

In order to provide a concise overview of the detection effects of all methods, we generated broken line graphs using the experimental data. These graphs are depicted in Figs. 3–10. In most cases, except for Figs. 6 and 7, the broken line representing IAOF consistently surpasses the line representing KNN in terms of detection efficiency. This suggests that IAOF generally offers a superior degree of detection performance compared to KNN. Similarly, it can be observed from Figs. 3–10 that the performance of IAOF in detecting outliers surpasses that of CBLOF, IForest, and NED. The broken lines representing IAOF in Figs. 3, 4, and 6 almost overlap with those representing IEOF, while in Figs. 7, 8, and 9, the broken lines of IAOF are positioned above those representing IEOF. Consequently, it can be inferred that IAOF demonstrates slightly superior performance compared to IEOF.

Table 5

Confusion matrix for outlier detection

	Predicted outlier	Predicted inlier
Actual outlier	TO	FI
Actual inlier	FO	TI

Table 6

Experimental results for different detection methods

Data set	Top ratio (number	Number of true outliers (outlier coverage)
	of objects)	KNN	CBLOF	IForest	IEOF	NED	IAOF
Hayes	3% (3)	3(25%)	3(25%)	3(25%)	3(25%)	3(25%)	3(25%)
	5% (6)	6(50%)	5(42%)	6(50%)	6(50%)	6(50%)	6(50%)
	8% (9)	9(75%)	7(58%)	9(75%)	9(75%)	9(75%)	9(75%)
	11% (12)	10(83%)	8(67%)	11(92%)	9(75%)	12(100%)	11(92%)
	13% (15)	11(92%)	8(67%)	12(100%)	12(100%)	12(100%)	12(100%)
	16% (18)	12(100%)	9(75%)	12(100%)	12(100%)	12(100%)	12(100%)
	18% (21)	12(100%)	9(75%)	12(100%)	12(100%)	12(100%)	12(100%)
	21% (24)	12(100%)	9(75%)	12(100%)	12(100%)	12(100%)	12(100%)
Monk	2% (5)	3(19%)	5(31%)	5(31%)	5(31%)	2(13%)	5(31%)
	4% (10)	8(50%)	8(50%)	10(63%)	10(63%)	2(13%)	10(63%)
	6% (15)	12(75%)	8(50%)	13(81%)	15(94%)	3(19%)	13(81%)
	8% (20)	16(100%)	10(63%)	14(88%)	16(100%)	8(50%)	16(100%)
	10% (25)	16(100%)	10(63%)	16(100%)	16(100%)	13(81%)	16(100%)
	12% (30)	16(100%)	12(75%)	16(100%)	16(100%)	16(100%)	16(100%)
	14% (35)	16(100%)	12(75%)	16(100%)	16(100%)	16(100%)	16(100%)
	16% (40)	16(100%)	12(75%)	16(100%)	16(100%)	16(100%)	16(100%)
Voting	3% (10)	9(47%)	3(16%)	3(16%)	6(32%)	6(32%)	6(32%)
	7% (20)	11(58%)	4(21%)	6(32%)	13(68%)	11(58%)	12(63%)
	10% (30)	12(63%)	4(21%)	10(53%)	15(79%)	14(74%)	15(79%)
	14% (40)	14(74%)	4(21%)	12(63%)	15(79%)	15(79%)	15(79%)
	17% (50)	16(84%)	6(32%)	13(68%)	16(84%)	16(84%)	16(84%)
	21% (60)	17(89%)	7(37%)	13(68%)	17(89%)	18(95%)	16(84%)
	24% (70)	17(89%)	10(53%)	13(68%)	18(95%)	18(95%)	18(95%)
	28% (80)	17(89%)	12(63%)	14(74%)	19(100%)	19(100%)	19(100%)
Cancer	3% (10)	10(50%)	10(50%)	7(35%)	10(50%)	10(50%)	10(50%)
	5% (20)	17(85%)	18(90%)	14(70%)	18(90%)	18(90%)	18(90%)
	8% (30)	19(95%)	19(95%)	16(80%)	19(95%)	19(95%)	19(95%)
	11% (40)	19(95%)	19(95%)	16(80%)	20(100%)	19(95%)	20(100%)
	13% (50)	19(95%)	19(95%)	18(90%)	20(100%)	19(95%)	20(100%)
	16% (60)	20(100%)	19(95%)	19(95%)	20(100%)	20(100%)	20(100%)
	19% (70)	20(100%)	20(100%)	20(100%)	20(100%)	20(100%)	20(100%)
	21% (80)	20(100%)	20(100%)	20(100%)	20(100%)	20(100%)	20(100%)
Aust	4% (15)	5(13%)	5(13%)	8(21%)	7(18%)	1(3%)	8(21%)
	7% (30)	13(33%)	8(21%)	11(28%)	15(38%)	1(3%)	16(41%)
	11% (45)	19(49%)	13(33%)	14(36%)	18(46%)	1(3%)	21(54%)
	14% (60)	21(54%)	13(33%)	18(46%)	23(59%)	7(18%)	24(62%)
	18% (75)	25(64%)	15(38%)	21(54%)	24(62%)	17(44%)	26(67%)
	21% (90)	30(77%)	18(46%)	25(64%)	26(67%)	26(67%)	28(72%)
	25% (105)	32(82%)	21(54%)	30(77%)	29(74%)	28(72%)	31(79%)
	28% (120)	34(87%)	23(59%)	31(79%)	30(77%)	30(77%)	32(82%)
Pima	5% (25)	8(18%)	4(9%)	6(14%)	8(18%)	6(14%)	8(18%)
	9% (50)	12(27%)	6(14%)	9(20%)	15(34%)	11(25%)	18(41%)
	14% (75)	17(39%)	10(23%)	11(25%)	20(45%)	16(36%)	21(48%)
	18% (100)	21(48%)	13(30%)	16(36%)	24(55%)	22(50%)	24(55%)
	23% (125)	24(55%)	16(36%)	21(48%)	27(62%)	25(57%)	26(59%)
	28% (150)	30(68%)	20(45%)	25(57%)	32(73%)	28(64%)	30(68%)
	32% (175)	30(68%)	21(48%)	31(70%)	34(77%)	29(66%)	35(80%)
	37% (200)	31(70%)	22(50%)	31(70%)	36(82%)	30(68%)	37(84%)
Mamm	5% (25)	7(23%)	8(26%)	4(13%)	14(45%)	13(42%)	14(45%)
	9% (50)	12(39%)	11(35%)	8(26%)	17(55%)	18(58%)	19(61%)
	14% (75)	14(45%)	11(35%)	10(32%)	22(71%)	21(68%)	21(68%)
	18% (100)	19(61%)	11(35%)	16(52%)	22(71%)	25(81%)	22(71%)
	23% (125)	24(77%)	12(39%)	19(61%)	22(71%)	26(84%)	26(84%)
	27% (150)	24(77%)	13(42%)	23(74%)	27(87%)	26(84%)	27(87%)
	32% (175)	24(77%)	20(65%)	23(74%)	27(87%)	27(87%)	27(87%)
	37% (200)	25(81%)	21(68%)	27(87%)	27(87%)	27(87%)	27(87%)
Flare	3% (35)	9(21%)	10(23%)	6(14%)	10(23%)	8(19%)	10(23%)
	7% (70)	12(28%)	16(37%)	11(26%)	15(35%)	14(33%)	15(35%)
	10% (105)	18(42%)	19(44%)	14(33%)	21(49%)	18(42%)	20(47%)
	13% (140)	21(49%)	21(49%)	17(40%)	24(56%)	21(49%)	24(56%)
	16% (175)	22(51%)	23(53%)	19(44%)	29(67%)	24(56%)	28(65%)
	20% (210)	24(56%)	24(56%)	21(49%)	30(70%)	24(56%)	30(70%)
	23% (245)	24(56%)	26(60%)	23(53%)	32(74%)	24(56%)	32(74%)
	26% (280)	24(56%)	31(72%)	25(58%)	32(74%)	24(56%)	36(84%)

5.3 Comparison results and analyses

In the preceding subsection, we examined the performance of IAOF and the other five methods in detecting outliers based on experimental results. The goal of this subsection is to compare the efficiency of IAOF and the other five methods using the classification evaluation indices stated above. The following is a summary of the key analysis:

(1) From Figs. 11–18, we can also see the effectiveness of the IAOF algorithm more graphically. Figs. 15, 16, and 18 show the Aust, Pima, and Flare datasets. The IAOF algorithm is closest to the upper left corner of the first quadrant with the largest area under the curve for these datasets. On the other datasets, from Figs. 11–14 and 17, it can be observed that the IAOF is also very close to the upper-left corner of the first quadrant as compared to most of the methods. By calculating the area under each ROC curve, we get Table 7. From Table 7, we can see that the IAOF algorithm is ranked second in terms of area under its ROC curve on the Hayes, Monk, Voting, Cancer, and Mamm datasets.

Fig. 11

The ROC curve on dataset Hayes.

Fig. 12

The ROC curve on dataset Monk.

Fig. 13

The ROC curve on dataset Voting.

Fig. 14

The ROC curve on dataset Cancer.

Fig. 15

The ROC curve on dataset Aust.

Fig. 16

The ROC curve on dataset Pima.

Fig. 17

The ROC curve on dataset Mamm.

Fig. 18

The ROC curve on dataset Flare.

Table 7

AUC results

Data sets	AUC value (rank)
	KNN	CBLOF	IForest	IEOF	NED	IAOF
Aust	0.8421(4)	0.7207(6)	0.8538(3)	0.8686(2)	0.7978(5)	0.8813(1)
Pima	0.7133(4)	0.5962(6)	0.7280(3)	0.8161(2)	0.6878(5)	0.8184(1)
Flare	0.8251(3)	0.8062(4)	0.7761(6)	0.8516(2)	0.7944(5)	0.8580(1)
Mamm	0.7919(4)	0.7305(6)	0.7791(5)	0.8589(3)	0.8662(1)	0.8647(2)
Monk	0.9868(4)	0.9374(6)	0.9951(3)	1.0000(1)	0.9539(5)	0.9973(2)
Hayes	0.9412(6)	0.9432(5)	0.9967(3)	0.9926(4)	1.0000(1)	0.9975(2)
Voting	0.9324(4)	0.7264(6)	0.8514(5)	0.9533(1)	0.9487(3)	0.9495(2)
Cancer	0.9931(4)	0.9924(5)	0.9727(6)	0.9975(1)	0.9940(3)	0.9973(2)
Average rank	4.125	5.5	4.5	2.0	3.25	1.625

(2) Table 9 sketches the variation of P (t), R (t), and F1-score with the value of t on different data sets. In all data sets except Hayes and Monk, IAOF produces the maximum average P (t) for a given t. Among the Cancer, Aust, Pima, Mamm, and Flare data sets, IAOF gets the maximum average R (t) and average F1-score, whereas they all rank second on the other data sets. For instance, on the Cancer data set, IAOF shares the same average P (t) with IEOF and NED, while the average P (t) of KNN, CBLOF, and IForest is 63.23%, 65.06%, and 52.53%, respectively. The average R (t) of the IAOF balances that of the IEOF; both get 87.14%, while the average R (t) of other methods are 83.57%, 84.29%, 72.86%, and 85%. The average F1-score of IAOF is equivalent to that of IEOF; both are 69.08%, while the average F1-scores of other methods are 65.81%, 67.16%, 56.27% and 67.81%, respectively.

Table 8

Comparison results with precision, recall, and F1-score

Data sets	KNN			CBLOF			IForest			IEOF			NED			IAOF
	t	P(t)	R(t)	F1	P(t)	R(t)	F1	P(t)	R(t)	F1	P(t)	R(t)	F1	P(t)	R(t)	F1	P(t)	R(t)	F1
Aust	15	0.3333	0.1282	0.1852	0.3333	0.1282	0.1852	0.5333	0.2051	0.2963	0.4667	0.1795	0.2593	0.0667	0.0256	0.0370	0.5333	0.2051	0.2963
	30	0.4333	0.3333	0.3768	0.2667	0.2051	0.2319	0.3667	0.2821	0.3188	0.5000	0.3846	0.4348	0.0333	0.0256	0.0290	0.5333	0.4103	0.4638
	45	0.4222	0.4872	0.4524	0.2889	0.3333	0.3095	0.3111	0.3590	0.3333	0.4000	0.4615	0.4286	0.0222	0.0256	0.0238	0.4667	0.5385	0.5000
	60	0.3500	0.5385	0.4242	0.2167	0.3333	0.2626	0.3000	0.4615	0.3636	0.3833	0.5897	0.4646	0.1167	0.1795	0.1414	0.4000	0.6154	0.4848
	75	0.3333	0.6410	0.4386	0.2000	0.3846	0.2632	0.2800	0.5385	0.3684	0.3200	0.6154	0.4211	0.2267	0.4359	0.2982	0.3467	0.6667	0.4561
	90	0.3333	0.7692	0.4651	0.2000	0.4615	0.2791	0.2778	0.6410	0.3876	0.2889	0.6667	0.4031	0.2889	0.6667	0.4031	0.3111	0.7179	0.4341
	105	0.3048	0.8205	0.4444	0.2000	0.5385	0.2917	0.2857	0.7692	0.4167	0.2762	0.7436	0.4028	0.2667	0.7179	0.3889	0.2952	0.7949	0.4306
	average	0.3586	0.5311	0.3981	0.2437	0.3406	0.2605	0.3364	0.4652	0.355	0.3764	0.5201	0.402	0.1459	0.2967	0.1888	0.4123	0.5641	0.438
Pima	10	0.2000	0.0455	0.0741	0.1000	0.0227	0.0370	0.2000	0.0455	0.0741	0.3000	0.0682	0.1111	0.4000	0.0909	0.1481	0.3000	0.0682	0.1111
	20	0.3000	0.1364	0.1875	0.2000	0.0909	0.1250	0.2000	0.0909	0.1250	0.3000	0.1364	0.1875	0.2500	0.1136	0.1562	0.3000	0.1364	0.1875
	30	0.2667	0.1818	0.2162	0.1667	0.1136	0.1351	0.2000	0.1364	0.1622	0.3667	0.2500	0.2973	0.2333	0.1591	0.1892	0.3000	0.2045	0.2432
	40	0.2750	0.2500	0.2619	0.1500	0.1364	0.1429	0.2000	0.1818	0.1905	0.3500	0.3182	0.3333	0.2500	0.2273	0.2381	0.3250	0.2955	0.3095
	50	0.2400	0.2727	0.2553	0.1200	0.1364	0.1277	0.1800	0.2045	0.1915	0.3000	0.3409	0.3191	0.2200	0.2500	0.2340	0.3600	0.4091	0.3830
	60	0.2333	0.3182	0.2692	0.1167	0.1591	0.1346	0.1667	0.2273	0.1923	0.3000	0.4091	0.3462	0.1833	0.2500	0.2115	0.3000	0.4091	0.3462
	70	0.2000	0.3182	0.2456	0.1286	0.2045	0.1579	0.1571	0.2500	0.1930	0.2714	0.4318	0.3333	0.2143	0.3409	0.2632	0.3000	0.4773	0.3684
	average	0.2450	0.2175	0.2157	0.1403	0.1234	0.1229	0.1863	0.1623	0.1612	0.3126	0.2792	0.2754	0.2501	0.2045	0.2058	0.3121	0.2857	0.2784
Flare	65	0.1846	0.2791	0.2222	0.2308	0.3488	0.2778	0.1692	0.2558	0.2037	0.2308	0.3488	0.2778	0.1846	0.2791	0.2222	0.2308	0.3488	0.2778
	130	0.1615	0.4884	0.2428	0.1615	0.4884	0.2428	0.1308	0.3953	0.1965	0.1846	0.5581	0.2775	0.1538	0.4651	0.2312	0.1846	0.5581	0.2775
	195	0.1179	0.5349	0.1933	0.1231	0.5581	0.2017	0.1077	0.4884	0.1765	0.1487	0.6744	0.2437	0.1231	0.5581	0.2017	0.1538	0.6977	0.2521
	260	0.0923	0.5581	0.1584	0.1115	0.6744	0.1914	0.0885	0.5349	0.1518	0.1231	0.7442	0.2112	0.0923	0.5581	0.1584	0.1231	0.7442	0.2112
	325	0.0923	0.6977	0.1630	0.0985	0.7442	0.1739	0.0800	0.6047	0.1413	0.1108	0.8372	0.1957	0.0862	0.6512	0.1522	0.1138	0.8605	0.2011
	390	0.0949	0.8605	0.1709	0.0821	0.7442	0.1478	0.0769	0.6977	0.1386	0.0949	0.8605	0.1709	0.0897	0.8140	0.1617	0.0974	0.8837	0.1755
	455	0.0857	0.907	0.1566	0.0769	0.814	0.1406	0.0769	0.814	0.1406	0.0857	0.907	0.1566	0.0813	0.8605	0.1486	0.0857	0.907	0.1566
	average	0.1185	0.6180	0.1867	0.1263	0.6246	0.1966	0.1043	0.5415	0.1641	0.1398	0.7043	0.2191	0.1159	0.598	0.1823	0.1413	0.7143	0.2217
Mamm	20	0.2500	0.1613	0.1961	0.3000	0.1935	0.2353	0.2000	0.1290	0.1569	0.6500	0.4194	0.5098	0.5500	0.3548	0.4314	0.7000	0.4516	0.5490
	40	0.2750	0.3548	0.3099	0.2750	0.3548	0.3099	0.1500	0.1935	0.1690	0.4000	0.5161	0.4507	0.4000	0.5161	0.4507	0.4000	0.5161	0.4507
	60	0.2333	0.4516	0.3077	0.18330	0.3548	0.2418	0.1500	0.2903	0.1978	0.3167	0.6129	0.4176	0.3500	0.6774	0.4615	0.3333	0.6452	0.4396
	80	0.1750	0.4516	0.2523	0.13750	0.3548	0.1982	0.1250	0.3226	0.1802	0.275	0.7097	0.3964	0.2625	0.6774	0.3784	0.2750	0.7097	0.3964
	100	0.1900	0.6129	0.2901	0.1100	0.3548	0.1679	0.1600	0.5161	0.2443	0.2200	0.7097	0.3359	0.2500	0.8065	0.3817	0.2200	0.7097	0.3359
	120	0.1917	0.7419	0.3046	0.1000	0.3871	0.1589	0.1583	0.6129	0.2517	0.1833	0.7097	0.2914	0.2167	0.8387	0.3444	0.1917	0.7419	0.3046
	140	0.1714	0.7742	0.2807	0.0857	0.3871	0.1404	0.1429	0.6452	0.2339	0.1929	0.8710	0.3158	0.1857	0.8387	0.3041	0.1929	0.8710	0.3158
	average	0.2123	0.5069	0.2773	0.1702	0.3410	0.2075	0.1552	0.3871	0.2048	0.3197	0.6498	0.3882	0.3164	0.6728	0.3932	0.3304	0.6636	0.3989
Monk	5	0.6000	0.1875	0.2857	1.0000	0.3125	0.4762	1.0000	0.3125	0.4762	1.0000	0.3125	0.4762	0.4000	0.1250	0.1905	1.0000	0.3125	0.4762
	10	0.8000	0.500	0.6154	0.8000	0.5000	0.6154	1.0000	0.6250	0.7692	1.0000	0.6250	0.7692	0.2000	0.1250	0.1538	1.0000	0.6250	0.7692
	15	0.8000	0.7500	0.7742	0.5333	0.5000	0.5161	0.8667	0.8125	0.8387	1.0000	0.9375	0.9677	0.2000	0.1875	0.1935	0.8667	0.8125	0.8387
	20	0.8000	1.0000	0.8889	0.5000	0.6250	0.5556	0.7000	0.8750	0.7778	0.8000	1.0000	0.8889	0.4000	0.5000	0.4444	0.8000	1.0000	0.8889
	25	0.6400	1.0000	0.7805	0.4000	0.6250	0.4878	0.6400	1.0000	0.7805	0.6400	1.0000	0.7805	0.5200	0.8125	0.6341	0.6400	1.0000	0.7805
	30	0.5333	1.0000	0.6957	0.4000	0.7500	0.5217	0.5333	1.0000	0.6957	0.5333	1.0000	0.6957	0.5333	1.0000	0.6957	0.5333	1.0000	0.6957
	35	0.4571	1.0000	0.6275	0.3429	0.7500	0.4706	0.4571	1.0000	0.6275	0.4571	1.0000	0.6275	0.4571	1.0000	0.6275	0.4571	1.0000	0.6275
	average	0.6615	0.7768	0.6668	0.5680	0.5804	0.5205	0.7424	0.8036	0.7094	0.7758	0.8393	0.7437	0.3872	0.5357	0.4199	0.7567	0.8214	0.7252
Hayes	3	1.0000	0.2500	0.4000	1.0000	0.2500	0.4000	1.0000	0.2500	0.4000	1.0000	0.2500	0.4000	1.000	0.2500	0.4000	1.0000	0.2500	0.4000
	6	1.0000	0.5000	0.6667	0.8333	0.4167	0.5556	1.0000	0.5000	0.6667	1.0000	0.5000	0.6667	1.0000	0.5000	0.6667	1.0000	0.5000	0.6667
	9	1.0000	0.7500	0.8571	0.7778	0.5833	0.6667	1.0000	0.7500	0.8571	1.0000	0.7500	0.8571	1.0000	0.7500	0.8571	1.0000	0.7500	0.8571
	12	0.8333	0.8333	0.8333	0.6667	0.6667	0.6667	0.9167	0.9167	0.9167	0.7500	0.7500	0.7500	1.0000	1.0000	1.0000	0.9167	0.9167	0.9167
	15	0.7333	0.9167	0.8148	0.5333	0.6667	0.5926	0.8000	1.0000	0.8889	0.8000	1.0000	0.8889	0.8000	1.0000	0.8889	0.8000	1.0000	0.8889
	18	0.6667	1.0000	0.8000	0.5000	0.7500	0.6000	0.6667	1.0000	0.8000	0.6667	1.0000	0.8000	0.6667	1.0000	0.8000	0.6667	1.0000	0.8000
	21	0.5714	1.0000	0.7273	0.4286	0.7500	0.5455	0.5714	1.0000	0.7273	0.5714	1.0000	0.7273	0.5714	1.0000	0.7273	0.5714	1.0000	0.7273
	average	0.8292	0.7500	0.7285	0.6771	0.5833	0.5753	0.8507	0.7738	0.7510	0.8269	0.7500	0.7271	0.8626	0.7857	0.7629	0.8507	0.7738	0.7510
Voting	10	0.9000	0.4737	0.6207	0.3000	0.1579	0.2069	0.3000	0.1579	0.2069	0.6000	0.3158	0.4138	0.6000	0.3158	0.4138	0.6000	0.3158	0.4138
	20	0.5500	0.5789	0.5641	0.2000	0.2105	0.2051	0.3000	0.3158	0.3077	0.6500	0.6842	0.6667	0.5500	0.5789	0.5641	0.6000	0.6316	0.6154
	30	0.4000	0.6316	0.4898	0.1333	0.2105	0.1633	0.3333	0.5263	0.4082	0.5000	0.7895	0.6122	0.4667	0.7368	0.5714	0.5000	0.7895	0.6122
	40	0.3500	0.7368	0.4746	0.1000	0.2105	0.1356	0.3000	0.6316	0.4068	0.3750	0.7895	0.5085	0.3750	0.7895	0.5085	0.3750	0.7895	0.5085
	50	0.3200	0.8421	0.4638	0.1200	0.3158	0.1739	0.2600	0.6842	0.3768	0.3200	0.8421	0.4638	0.3200	0.8421	0.4638	0.3200	0.8421	0.4638
	60	0.2833	0.8947	0.4304	0.1167	0.3684	0.1772	0.2167	0.6842	0.3291	0.2833	0.8947	0.4304	0.3000	0.9474	0.4557	0.2667	0.8421	0.4051
	70	0.2429	0.8947	0.3820	0.1429	0.5263	0.2247	0.1857	0.6842	0.2921	0.2571	0.9474	0.4045	0.2571	0.9474	0.4045	0.2571	0.9474	0.4045
	average	0.4352	0.7218	0.4893	0.1590	0.2857	0.1838	0.2708	0.5263	0.3325	0.4265	0.7519	0.5000	0.4098	0.7368	0.4831	0.4170	0.7369	0.4890
Cancer	8	1.0000	0.4000	0.5714	1.0000	0.4000	0.5714	0.6250	0.2500	0.3571	1.0000	0.4000	0.5714	1.0000	0.4000	0.5714	1.0000	0.4000	0.5714
	16	0.8125	0.6500	0.7222	1.0000	0.8000	0.8889	0.8125	0.6500	0.7222	1.0000	0.8000	0.8889	1.0000	0.8000	0.8889	1.0000	0.8000	0.8889
	24	0.7917	0.9500	0.8636	0.7500	0.9000	0.8182	0.6250	0.7500	0.6818	0.7917	0.9500	0.8636	0.7917	0.9500	0.8636	0.7917	0.9500	0.8636
	32	0.5938	0.9500	0.7308	0.5938	0.9500	0.7308	0.5000	0.8000	0.6154	0.5938	0.9500	0.7308	0.5938	0.9500	0.7308	0.5938	0.9500	0.7308
	40	0.4750	0.9500	0.6333	0.4750	0.9500	0.6333	0.4000	0.8000	0.5333	0.5000	1.0000	0.6667	0.4750	0.9500	0.6333	0.5000	1.0000	0.6667
	48	0.3958	0.9500	0.5588	0.3958	0.9500	0.5588	0.3750	0.9000	0.5294	0.4167	1.0000	0.5882	0.3958	0.9500	0.5588	0.4167	1.0000	0.5882
	56	0.3571	1.0000	0.5263	0.3393	0.9500	0.5000	0.3393	0.9500	0.5000	0.3571	1.0000	0.5263	0.3393	0.9500	0.5000	0.3571	1.0000	0.5263
	average	0.6323	0.8357	0.6581	0.6506	0.8429	0.6716	0.5253	0.7286	0.5627	0.6656	0.8714	0.6908	0.6565	0.8500	0.6781	0.6656	0.8714	0.6908

In general, IAOF has superior detection performance compared to other methods when applied to the eight data sets. Consequently, IAOF can be efficiently employed for the purpose of detecting outliers in categorical information systems.

6 Statistical analyses of experimental results

Since the three sets of indicator values (AUC, R (t), and F1-score) found in the previous subsection don’t follow a normal distribution, it is necessary to use a nonparametric test for multiple paired samples. This test aims to determine if there exists a statistically significant disparity between the values of the same indicator obtained from different methods. In essence, it seeks to ascertain whether there is a noteworthy variation in the detection efficacy across various methods.

The null hypothesis of the Friedman test, which is dependent on the Fisher test, is that there is barely any difference in rank across various algorithms. Let k be the number of outlier detection algorithms, n be the number of data sets, and c_i be the average rank of the i-th algorithm on the n data sets, then the Friedman statistic commonly used is defined as $τ_{F} = \frac{(n - 1) τ_{χ^{2}}}{n (k - 1) - τ_{χ^{2}}}$ where $τ_{χ^{2}} = \frac{12 n}{k (k + 1)} (\sum_{i = 1}^{k} c_{i}^{2} - \frac{k (k + 1)^{2}}{4})$ . Here, τ_χ² follows a chi-square distribution with k - 1 degrees of freedom when k and n are large enough, and τ_F follows a Fisher distribution with k - 1 and (k - 1) (n - 1) degrees of freedom.

If the statistic F_F is bigger than the critical value of F_α (k - 1, N - 1), it means the null hypothesis is rejected under the Friedman test. Then the Nemenyi test [9] for pairwise comparisons is used to further investigate which algorithm is superior in terms of statistics. The performance of the two algorithms will be very different if the average level of distance surpasses the critical distance CD_α, which is denoted as ${CD}_{α} = q_{α} \sqrt{\frac{k (k + 1)}{6 N}},$ (16) where α is a level of significance chosen appropriately to guarantee the proper credibility level, and q_α is the upper α quantile of the Tukey distribution. The test result of Nemenyi can be shown by a fast visualization criticality diagram. The scale axis in the test diagram located above indicates the average rank of each method, and some methods are connected by a horizontal red line, indicating that there is no significant difference between their ranks, while methods that are not connected by the same horizontal red line have significant differences.

By conducting tests on the index values of this experiment, one can derive the following conclusions based on Figs. 19–21.

Fig. 19

The Nemenyi test on AUC (α = 0.1).

Fig. 20

The Nemenyi test on R (t) (α = 0.05).

Fig. 21

The Nemenyi test on F1-score (α = 0.05).

(1) IAOF has the greatest average rank in AUC values, indicating that it has the best outlier identification performance. Furthermore, when the significance level is taken at 10%, there are significant variations in the detection performance of IAOF and CBLOF, IAOF and KNN, and IAOF and IForest, whereas the other groups of methods have no significant differences.

(2) IAOF has the highest average rank in the values of R (t), indicating that the method has the highest recall rate for finding outliers. Furthermore, under the same experimental settings, the detection accuracy of IAOF is significantly greater than that of CBLOF and IForest, whereas the other groups of methods show no significant difference when the significance level is set at 0.05.

(3) IAOF has the greatest average rank in terms of the F1-score, indicating that this method is the best at predicting outliers and inliers. When the significance level is 0.05, the prediction effectiveness of IAOF is significantly higher than that of CBLOF and IForest, but not of the other methods.

7 Conclusions

This paper develops a novel and efficient IAOF method for identifying outliers, which is based on a thorough and rigorous theoretical framework and includes a real-world illustration of the various steps involved in outlier detection. The method finds outliers by using the information amount to calculate the degree of outlierness of an object, which is one of the uncertainty measures in rough set theory. There is no related research on the issue of outlier detection. IAOF can be used to identify outliers in both categorical and discretized numerical information systems. Using a collection of eight UCI standard datasets, we ran tests to evaluate and compare various existing outlier detection methods. The experiment results show that the suggested method is adaptable, works well, and could be useful in real life because it finds patterns more accurately in categorical datasets than other methods tested in the same conditions. IAOF expands the utilization of the information amount in measures of uncertainty and enriches the techniques for identifying outliers. Nevertheless, the method’s time cost and space complexity do not possess substantial advantages.

In the future, we will consider adopting other efficient attribute approximation methods to lower the time and space complexity of the process of IAOF in order to deal with outlier detection problems for high-dimensional data sets. What is more, applying the IAOF technique to outlier identification in other types of information systems, such as categorical information systems with missing values, numerical information systems with missing values, set-value information systems, and so on, is a future research area.

Footnotes

Acknowledgments

The authors would like to thank the editors and the anonymous reviewers for their valuable comments and suggestions, which have helped immensely in improving the quality of the paper. This work was supported by the project of Improving the Basic Scientific Research Ability of Young and Middle-aged Teachers in Guangxi Universities (2020KY14013) and the project of Natural Science Foundation of Guangxi (2020GXNSFAA159155, 2020GXNSFAA159061).

References

Ashfaq

, SantaŕAnna

, Lingman

and Stawomir

, Read-mission prediction using deep learning on electronic health records, Journal of Biomedical Informatics 97 (2019), 103256.

Alcalĺć-Fdez

, Fernandez

, Luengo

, Derrac

, Garcĺła

, Sĺćnchez

and Herrera

, KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing 17 (2011), 255–287.

Breunig

M.M.

, Kriegel

H.P.

, Ng

R.T.

and Sander

, LOF: Identifying density-based local outliers, pp, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (2000), 93–104.

Borne

K.D.

and Vedachalam

, Surprise detection in multivariate astronomical data, Statistical Challenges in Modern Astronomy 5 (2012), 275–289.

Chen

Y.M.

, Miao

D.Q.

and Zhang

H.Y.

, Neighborhood outlier detection, Expert Systems with Applications 37(2) (2010), 8745–8749.

Constantinou

, PyNomaly: Anomaly detection using Local Outlier Probabilities (LoOP), Journal of Open Source Software 3(30) (2018), 845.

Campos , Zimek

G.O.

, Sander

and Campello

, On the evaluation ofunsupervised outlier detection: Measures, datasets, and an empiricalstudy, Data Mining and Knowledge Discovery 30(4) (2016), 891–927.

Domingues

, Filippone

, Michiardi

and Zouaoui

, A comparative evaluation of outlier detection algorithms: Experiments and analyses, Pattern Recognition 74 (2018), 406–421.

Demar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006), 1–30.

10.

Degirmenci

and Karal

, Efficient density and cluster based incremental outlier detection in data streams, Information Sciences 607 (2022), 901–920.

11.

Ester

, Kriegel

H.F.

, Sander

and Xu

, A density-based algorithm for discovering clusters in large spatial databases with noise, AAAI Press 96(34) (1996), 226–231.

12.

Fakharzadeh

J.A.

and Zarei

, A LoOP based outlier detection method for high dimensional fuzzy data set, Journal of Intelligent & Fuzzy Systems 32(1) (2017), 241–248.

13.

Gruhl

and Sick

, Novelty detection with CANDIES: A holistic technique based on probabilistic models, Int J Mach Learn Cybern 9(6) (2016), 927–945.

14.

Z.Y.

, Xu

X.F.

and Deng

S.C.

, Discovering cluster-based local outliers, Pattern Recognition Letters 24(9-10) (2003), 1641–1650.

15.

Jin

F.S.

, Chen

M.N.

, Zhang

W.W.

, Yuan

and Wang

S.L.

, Intrusion detection on internet of vehicles via combining log-ratio oversampling, outlier detection and metric learning, Information Sciences 579 (2021), 814–831.

16.

Jiang

, Sui

Y.F.

and Cao

C.G.

, An information entropy-based approach to outlier detection in rough sets, Expert Systems with Applications 37(9) (2010), 6338–6344.

17.

Kandanaarachchi

, Unsupervised anomaly detection ensembles using item response theory, Information Sciences 587 (2022), 142–163.

18.

Kriegel

H.P.

, Krger

, Schubert

, Zimek

LoOP: Local outlier probabilities, Proceedings of the 18th ACM conference on Information and Knowledge Management, 2009, pp. 1649–1652.

19.

, Power big data anomaly detection method based on an improved PSO-PFCM clustering algorithm, Power System Protection and Control 49(18) (2021), 161–166.

20.

Lin

, Li

Z.W.

Outlier detection for set-valued data based on rough set theory and granular computing, International Journal of General Systems (2022), 1–29.

21.

Liu

F.T.

, Ting

K.M.

and Zhou

Z.H.

, Isolation-based anomaly detection, Acm Transactions on Knowledge Discovery from Data 6(1) (2012), 1–39.

22.

Meira

, Eiras-Franco

, Bolón-Canedo

, Marreiros

and Alonso-Betanzos

, Fast anomaly detection with locality-sensitive hashing and hyperparameter autotuning, Information Sciences 607 (2022), 1245–1264.

23.

Pawlak

, Rough sets, International Journal of Computer and Information Science 11 (1982), 341–356.

24.

Ramaswamy

, Rastogi

, Shim

Efficient algorithms for mining outliers from large data sets, In: Proceedings of the ACM SIGMOD Conference on Management of Data, Dallas, USA, 2000, 427–438.

25.

Reddy

R.V.

, Subhani

and Rao

B.S.

, Machine learning based outlier detection for medical data, Indonesian Journal of Electrical Engineering and Computer Science 24(1) (2021), 564–569.

26.

Romero

and Ventura

, Educational data mining: A review of the state of the art, IEEE Transactions on Systems, Man, and Cybernetics (Part C) 40(6) (2010), 601–618.

27.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344 (2014), 1492–1496.

28.

Sangeetha

, A fuzzy proximity relation approach for outlier detection in the mixed dataset by using rough entropy-based weighted density method, Soft Computing Letters 3 (2021), 100027.

29.

Shaari

, Bakar

A.A.

and Hamdan

A.R.

, Outlier detection based on rough sets theory, Intelligent Data Analysis 13(2) (2009), 191–206.

30.

Shin

S.J.

, Lee

S.M.

, Kim

H.Y.

and Kim

S.H.

, Advanced probabilistic approach for network intrusion forecasting and detection, Expert Systems with Applications 40(1) (2013), 315–322.

31.

Sun

S.X.

, Zhao

J.L.

, Nunamaker

J.F.

and Liu

O.R.

, Formulating the data-flow perspective for business process management, Information Systems Research 17(4) (2006), 374–391.

32.

Wang

and Li

Y.P.

, Outlier detection based on weighted neighbourhood information network for mixed-valued data sets, Information Sciences 564 (2021), 396–415.

33.

Yamanishi

, Takeuchi

Discovering outlier filtering rules from unlabeled data: Combining a supervised learner with an unsupervised learner, In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, USA, 2001, pp. 389–394.

34.

Yuan

, Zhang

X.Y.

and Feng

, Hybrid data-driven outlierdetection based on neighborhood information entropy and itsdevelopmental measures, Expert Systems with Applications 112 (2018), 243–257.

35.

Yuan

, Chen

H.M.

, Li

T.R.

, Liu

and Wang

, Fuzzy information entropy-based adaptive approach for hybrid feature outlier detection, Fuzzy Sets and Systems 421 (2021), 1–28.

36.

Yuan

, Chen

H.M.

, Li

T.R.

, Sang

B.B.

and Wang

, Outlier detection based on fuzzy rough granules in mixed attribute data, IEEE Transactions on Cybernetics 52(8) (2021), 8399–8412.

37.

Yuan

, Chen

B.Y.

, Liu

, Chen

H.M.

, Peng

D.Z.

and Li

P.L.

, Anomaly detection based on weighted fuzzy-rough density, Applied Soft Computing 134 (2023), 109995.

38.

Zhao

, Li

Y.N.

and Li

, Anomaly detection of power consumption data based on fuzzy clustering and isolated forest, Journal of Shanxi University of Technology (Natural Science Edition) 36 (2020), 38–43.

39.

Zhang

W.X.

, Qiu

G.F.

Uncertain decision making based on rough sets, Tsinghua University Press, Beijing, 2005.

A new unsupervised outlier detection method

Abstract

Keywords

1 Introduction

1.1 Background and related work

1.2 Motivation and inspiration

1.3 Organization

5 Experiments and comparative analyses

5.1 Experimental settings

5.1.1 Data descriptions

Table 4 Optimal parameter for KNN and CBLOF on different data sets Parameter Hayes Monk Voting Cancer Aust Pima Mamm Flare k of KNN 15 31 51 19 41 51 25 85 n _ cluster of CBLOF 11 13 12 7 21 11 11 19

Footnotes

Acknowledgments

References

Table 4
Optimal parameter for KNN and CBLOF on different data sets

Parameter Hayes Monk Voting Cancer Aust Pima Mamm Flare

k of KNN 15 31 51 19 41 51 25 85

n _ cluster of CBLOF 11 13 12 7 21 11 11 19