Sage Journals: Discover world-class research

Abstract

It is obvious that the problem of Frequent Itemset Mining (FIM) is very popular in data mining, which generates frequent itemsets from a transaction database. An extension of the frequent itemset mining is High Utility Itemset Mining (HUIM) which identifies itemsets with high utility from the transaction database. This gains popularity in data mining, because it identifies itemsets which have more value but the same was not identified as frequent by Frequent Itemset Mining. HUIM is generally referred to as Utility Mining. The utility of the items is measured based on parameters like cost, profit, quantity or any other measures preferred by the users. Compared to high utility itemsets (HUIs) mining, high average utility itemsets (HAUIs) mining is more precise by considering the number of items in the itemsets. In state-of-the-art algorithms that mines HUIS and HAUIs use a single fixed minimum utility threshold based on which HAUIs are identified. In this paper, the proposed algorithm mines HAUIs from transaction databases using Artificial Fish Swarm Algorithm (AFSA) with computed multiple minimum average utility thresholds. Computing the minimum average utility threshold for each item with the AFSA algorithm outperforms other state-of-the-art HAUI mining algorithms with multiple minimum utility thresholds and user-defined single minimum threshold in terms of number of HAUIs. It is observed that the proposed algorithm outperforms well in terms of execution time, number of candidates generated and memory consumption when compared to the state-of-the-art algorithms.

Keywords

Artificial fish swarm algorithm data mining frequent itemset mining high average utility itemsets itemset mining utility mining

1 Introduction

Data mining applications mainly focus on generating rules [7, 23], patterns [24], or user behavior prediction from the database which are used to provide the best strategies to improve the profit, growth of a business. To be specific, finding useful itemsets from the transactional database identifies the transactions with high profit which is used for planning the best business strategies [14].

The famous procedure in data mining to find the useful itemset is Frequent Itemset Mining [7, 23], which will filter all the less frequent itemset from the transaction database. In frequent itemset mining, the algorithms discover all the frequent itemsets whose frequency is greater than or equal to a given threshold value, which is referred to as minimum support. This technique is the straight approach in identifying strategies from the transactional database.

Less frequent doesn’t mean less profit. Sometimes, the less frequent itemsets may yield higher profit than the frequent ones but that may not be identified by frequent itemset mining algorithms as a profitable itemset. Those less frequent but yet profitable itemsets are discovered by using the utilities of the items considered in the transaction [21]. Here, the utilities of the items are defined in terms of quantity and profit per unit.

The high utility itemset mining (HUIM) algorithm mines all the possible itemsets that have utility values greater than the given fixed utility value [21] which is the minimum support for the utility value. Many algorithms were proposed to mine high utility itemsets (HUIs) [5 , 31] with fixed minimum utility. Research outcomes have even improved the efficiency of the HUI mining algorithms in larger transactional databases. Genetic algorithmic approaches were introduced in improving the efficiency of mining HUIs from large transactional databases [13].

The HUI algorithm requires better pruning strategies to reduce the complexity of the candidate search space and a better approach to replace the unfair measure used in the mining process. Using a single fixed utility to evaluate the candidates is unfair since the items involved in each itemset will have different profits, quantities and importance [10]. If the minimum utility is fixed as a higher value, then the algorithm will miss many important HUIs during the mining process. Similarly, if the value is fixed as a lower value then many non-deserving HUIs will be discovered which will pollute the results. To overcome this drawback, high average utility itemset mining (HAUIM) was proposed [10]. In HAUIM, the algorithm mines both longer patterns and shorter patterns with high utility. In HUIM, the shorter pattern with high utility will not be considered as HUI because it will have only the least amount of items in the pattern which may not exceed the given fixed threshold value.

In this paper, the proposed algorithm HAUIM-AFSA-MMU mines high average utility itemsets (HAUIs) using Artificial Fish Swarm Algorithm (AFSA) by using multiple minimum utility thresholds. As discussed earlier, shorter patterns with high profits will be discovered, since the algorithm considers only the average of the items’ utilities rather than the sum of their utilities. So, this algorithm will overcome the problem of identifying the shorter itemsets with high profit and also not limited with a single fixed utility threshold value because it uses multiple minimum average utilities based on the number of items in the transaction database. The results show that it outperforms in discovering more HAUIs when compared to other HAUIM algorithms.

2 Frequent itemset mining vs utility mining

Utility mining is the extension of frequent itemset mining [1], and also it overcomes the problem in frequent itemset mining.

Before getting into the itemset mining, let us get familiar with the definitions of the terms in use.

Transaction - set of items purchased by a customer of an organization

Transaction database - a database that consists of transactions made by a customer

Item - an atomic unit in the transaction

Itemset - random collection of items that need not belongs to a group or transaction

2.1 Frequent itemset mining

For instance, consider the following transactions made by a set of customers of an organization on a particular day.

Let I be the set of items, where I={a,b,c,d} and T be the sample transaction, where T={T1,T2,T3,T4} such that for each transaction Ti, it holds the subset of set of items I and also in Ti ‘i’ is used to represent each transaction uniquely in a transaction database. Table 1 shows the sample transaction T to be used for mining frequent itemset.

Table 1
A sample transaction database T

Transaction Items

T1 [c,d]

T2 [a,b,c]

T3 [a,b,c,d]

T4 [a,c,d]

Transaction	Items
T1	[c,d]
T2	[a,b,c]
T3	[a,b,c,d]
T4	[a,c,d]

The objective of the frequent itemset mining is to find the itemsets that have appeared in more transactions. Many popular algorithms are proposed to find frequent itemsets. These algorithms take the min_support as a measure to work on the transaction database. The measure min_support refers to a threshold value that determines the minimum number of times the item or itemset should appear in the database to be identified as frequent itemset.

For the above sample transaction database, let the threshold min_support value to find frequent itemset be 3.

The support value for all possible itemset for the sample transaction database T under consideration is listed in Table 2.

Table 2

Support value for all possible itemsets

Item/Itemset	Support value	Item/Itemset	Support value
[a]	3	[b,d]	1
[b]	2	[c,d]	3
[c]	4	[a,b,c]	2
[d]	3	[a,b,d]	1
[a,b]	2	[a,c,d]	2
[a,c]	3	[b,c,d]	1
[a,d]	2	[a,b,c,d]	1
[b,c]	2	–	–

The support value for the itemset say X is calculated by counting the number of occurrences of X in the transaction T under consideration. The min_support value has been fixed as 3 for this example. Itemsets with support value greater than or equal to min_support value are listed as Frequent itemsets (Theorem 1).

$\begin{matrix} Frequentitemset = {X \in T | Support (X) \\ \geq \min_support} \end{matrix}$ (1)

Considering the above formula, the following itemsets are identified as frequent itemset from the sample set of transactions T. $FrequentItemset = {[a], [c], [d], [a, c], [c, d]}$

The above 5 itemsets have support value greater than or equal to min_support and hence they are grouped as frequent itemsets of the sample transaction T.

2.2 Limitation in frequent itemset mining

The itemset listed as frequent actually appeared more number of times or equal number of times than min_support in transaction database T. The actual objective is to mine the transaction database to find useful itemsets which improves the growth of the organization [17]. Frequent Itemset Mining does not consider itemsets’ quantity or value in terms of profit, which is the main limitation in frequent itemset mining.

3 High utility itemset mining

High utility itemset mining identifies all the itemsets with utility greater than or equal to the utility measure given by the user. High utility itemsets identified are aimed to provide high profit in a customer transaction database. Many methodologies have been proposed in mining high utility itemsets [25 , 34] which would be helpful in deciding on marketing strategies to improve the business.

Now, let us consider the items in the same sample transaction T with an additional information ‘profit’ as shown in Table 3.

Table 3
Items in sample transaction T with profit [utility] value per unit

Items Profit per

unit in INR

a 225

b 650

c 350

d 150

Items	Profit per
a	225
b	650
c	350
d	150

Items with profit value determine that item b has more profit compared to other items in the sample transaction T.

To get more understanding, let us include the units of purchase of each item in all the transactions in T as given in Table 4 and determine the profit obtained by each transaction.

Table 4

Transactions with items’ unit of purchase

Transactions	Item with unit
	of purchase
T1	[c-2,d-3]
T2	[a-2,b-2,c-2]
T3	[a-1,b-1,c-2,d-2]
T4	[a-2,c-1,d-1]

The profit of transaction (Transaction utility TU(T_i)) T_i with n number of items is calculated by using the following formula (Theorem 2), $TU (T_{i}) = P (Ti) = \sum_{k = 1}^{n} U_{k} * P_{k}$ (2) where, P(T_i) is the profit of the transaction T_i, n is the number of items in the transaction, U_k is the number of units of Item k purchased, P_k is the profit per unit for Item k

The transaction utility P (T_i) of each transaction is calculated and given in Table 5. By observing the items in transaction (Table 5) and profit value per unit of each item (Table 3), it is obvious that,

Table 5

Actual Profit obtained from each transaction in T

Transactions	Item with unit	Profit in INR
	of purchase
T1	[c-2,d-3]	1150
T2	[a-2,b-2,c-2]	2450
T3	[a-1,b-1,c-2,d-2]	1875
T4	[a-2,c-1,d-1]	950

T2 and T3, the itemset [a,b,c] collectively yields more profit than other itemsets

The itemset with item b yields more profit

The transaction with item d yield lower profit

Also, transaction weighted utility of an itemset X TWU (X) is calculated by adding transaction utilities of all transactions containing the itemset X as given below (Theorem 3), $TWU (X) = \sum P (X)$ (3)

The transaction weighted utility for all possible itemset of sample transaction database T is calculated and listed in Table 6. By fixing the minimum utility min_utility, the list of itemsets having utility greater than or equal to user-defined min_utility forms the high-utility itemset from transaction database T.

Table 6

Utility value for all possible itemsets

Item/Itemset	Utility value	Item/Itemset	Utility value
[a]	5275	[b,d]	1875
[b]	4325	[c,d]	3975
[c]	6425	[a,b,c]	4325
[d]	3975	[a,b,d]	1875
[a,b]	4325	[a,c,d]	2825
[a,c]	5275	[b,c,d]	1875
[a,d]	2825	[a,b,c,d]	1875
[b,c]	4325	–	–

If min_utility for the transaction database T under consideration be 4200, then the list of itemsets declared as high utility itemsets are, $\begin{matrix} High Utility Itemset = {[a], [b], [c], \\ [a, b], [a, c], [b, c], [a, b, c]} \end{matrix}$

By comparing the itemsets that are declared as frequent by frequent itemset mining and the itemsets declared as high utility itemset by high utility itemset mining, it is obvious that frequent itemset does not include certain itemsets like item b or itemset [a,b,c] which yields high profit and listed as high utility itemset. And also, the itemset d is declared as frequent, but that does not yield comparatively high profit and not listed as high utility itemset, because the profit per unit for item d is less than the average profit per unit of all items appeared in the transaction.

Hence by all the observations made on transactions T, the following are the major limitations of frequent itemset mining addressed by high utility itemset mining.

Itemsets declared as frequent by frequent itemset mining may not yield profit. [Example: Item d in example Transactions T]

Itemsets yielding high profit may not be frequent, such significant itemsets are missed by frequent itemset mining. [Example: Itemset [a,b,c] is not frequent but yields high profit]

This limitation in frequent itemset mining, introduces high utility itemset mining, an extension from frequent itemset mining. Based on the utility, itemsets are filtered from the transactions and hence not missing out the itemsets that yield high profit [17].

3.1 Frequent and high utility itemsets

Frequent Itemset Mining (FIM) mines the frequent itemset, whereas High-Utility Itemset Mining (HUIM) generates high utility itemsets from the transaction database [1]. Combining these methods to generate itemsets which are both frequent and have high utility, and applying business strategies based on such itemsets will yield periodical high profit.

Philippe Fournier-Viger et al. [21] designed a methodology called Periodic High-utility itemset Miner (PHM) to mine periodic (frequent) high utility itemset from transaction databases. This methodology mines itemsets that are both frequent and with high utility.

3.2 High Average Utility Itemsets (HAUI)

High Average Utility Itemset (HAUI) mining discovers the itemsets with high average utility when compared to High Utility Itemset mining (HUIM) [26] where it mines the itemsets based on their utility alone. HUIM decides HUIs by checking whether the utility of the itemset is greater than the given threshold but mostly the utility of the itemsets with short length (length denotes the number of items in an itemset) may not be discovered as high utility itemsets by HUIM. In HAIUM, it uses a fair method of average utility which is compared against the given threshold. HAUIM mines HAUIs based on the average utility [11 , 38] which implies that the shorter patterns will not be affected as in HUIM.

4 Related work

The various state-of-the-art algorithms for mining HUIs and HAUIs are discussed in Table 7. Based on the discussion in Table 7, it is obvious that the proposed algorithm is the first algorithm to work on mining HAUIs using an artificial fish swarm algorithm. So far many genetic and optimization algorithms are applied to mine HUIs and many such algorithms are proven [13 , 36]. For mining HAUIs, the optimization techniques are not applied in any existing works. Also, multiple minimum average utility [35] for each item is fixed instead of a single minimum utility threshold to mine HAUI is proposed in this work. There are proven works were HUIs, HAUIs and association rules [40] are mined with multiple thresholds.

Table 7
HUI and HAUI mining algorithms

Algorithm Author(s) Measures Outcome Advantages

Periodic High-utility itemset Miner (PHM) Philippe Fournier-Viger et al. [21] 1. Average periodicity 2. Minimum periodicity 3. Maximum periodicity 1. Mines periodic high utility itemsets (PHIs) 2. Comparative analysis of PHM with FHM and HU IM algorithms Avg, min and max periodicity can be calculated with just one scan on ps(X)

Projection-based Utility Mining (ProUM) WenshengGan et al. [39] Sequence extension utility Generates high utility itemset from sequential data Improves the mining efficiency, and effectively reduces the memory consumption

HUIM-GA Kannimuthu et al. [13] Genetic operations with/without minimum utility threshold Generates high utility itemset with negative item values Genetic algorithm approach

HUIM-SPSO Wei Song et al. [29] Minimum utility threshold, velocity of items Generates high utility itemset positioned in high velocities Mines HUIs in high diversity

HUIM-ACO Wei Song et al. [36] Minimum utility threshold Generates high utility itemset Constructive candidate generation

HUIM-AFSA Wei Song et al. [37] Minimum utility threshold Generates high utility itemset Creative Pruning and candidate generation

HAUI-Miner Jerry Chun-Wei Lin [12] Efficient average–utility threshold Generates high average utility itemset DFS used instead of candidate generation

HAUI-MMAU Jerry Chun-Wei Lin [11] IEUCP and PBCS strategies, multiple thresholds Generates high average utility itemset without fixed single minimum threshold Multiple minimum thresholds

HAUI-MEMU Jerry Chun-Wei Lin [10] Average utility HAUIs based on the average-utility list structure Level-wise mining of HAUIs

HAUIM-GMU Wei Song et al. [38] Generalized maximal utility and generalized average-utility upper bound Generates high average utility itemset Fewer candidates and new pruning strategy

HUIM-HC-SA Saqib Nawaz et al. [18] Used hill climbing and simulated annealing High utility itemsets Addresses long runtime - converts database to bitmap

HUIF-PSO Wei Song et al. [25] PSO with minimum threshold High utility itemsets Selecting from generated HUIs for next population

HUIF-GA Wei Song et al. [30] Genetic algorithm with minimum threshold High utility itemsets Selecting from generated HUIs for next population

HUIF-ABC Wei Song et al. [30] A bat algorithm with minimum threshold High utility itemsets Selecting from generated HUIs for next population

HUIM-BPSO Jerry Chun-Wei Lin [8] Transaction- Weighted Utility TWU High utility itemsets 1-HTWUIs as the size of the particle

HUIM-BPSO-tree Jerry Chun-Wei Lin [9] Transaction- Weighted Utility TWU High utility itemsets finds the high-transaction-weighted utilization 1-itemsets (1-HTWUIs) as the size of the particle

Algorithm	Author(s)	Measures	Outcome	Advantages
Periodic High-utility itemset Miner (PHM)	Philippe Fournier-Viger et al. [21]	1. Average periodicity 2. Minimum periodicity 3. Maximum periodicity	1. Mines periodic high utility itemsets (PHIs) 2. Comparative analysis of PHM with FHM and HU IM algorithms	Avg, min and max periodicity can be calculated with just one scan on ps(X)
Projection-based Utility Mining (ProUM)	WenshengGan et al. [39]	Sequence extension utility	Generates high utility itemset from sequential data	Improves the mining efficiency, and effectively reduces the memory consumption
HUIM-GA	Kannimuthu et al. [13]	Genetic operations with/without minimum utility threshold	Generates high utility itemset with negative item values	Genetic algorithm approach
HUIM-SPSO	Wei Song et al. [29]	Minimum utility threshold, velocity of items	Generates high utility itemset positioned in high velocities	Mines HUIs in high diversity
HUIM-ACO	Wei Song et al. [36]	Minimum utility threshold	Generates high utility itemset	Constructive candidate generation
HUIM-AFSA	Wei Song et al. [37]	Minimum utility threshold	Generates high utility itemset	Creative Pruning and candidate generation
HAUI-Miner	Jerry Chun-Wei Lin [12]	Efficient average–utility threshold	Generates high average utility itemset	DFS used instead of candidate generation
HAUI-MMAU	Jerry Chun-Wei Lin [11]	IEUCP and PBCS strategies, multiple thresholds	Generates high average utility itemset without fixed single minimum threshold	Multiple minimum thresholds
HAUI-MEMU	Jerry Chun-Wei Lin [10]	Average utility	HAUIs based on the average-utility list structure	Level-wise mining of HAUIs
HAUIM-GMU	Wei Song et al. [38]	Generalized maximal utility and generalized average-utility upper bound	Generates high average utility itemset	Fewer candidates and new pruning strategy
HUIM-HC-SA	Saqib Nawaz et al. [18]	Used hill climbing and simulated annealing	High utility itemsets	Addresses long runtime - converts database to bitmap
HUIF-PSO	Wei Song et al. [25]	PSO with minimum threshold	High utility itemsets	Selecting from generated HUIs for next population
HUIF-GA	Wei Song et al. [30]	Genetic algorithm with minimum threshold	High utility itemsets	Selecting from generated HUIs for next population
HUIF-ABC	Wei Song et al. [30]	A bat algorithm with minimum threshold	High utility itemsets	Selecting from generated HUIs for next population
HUIM-BPSO	Jerry Chun-Wei Lin [8]	Transaction- Weighted Utility TWU	High utility itemsets	1-HTWUIs as the size of the particle
HUIM-BPSO-tree	Jerry Chun-Wei Lin [9]	Transaction- Weighted Utility TWU	High utility itemsets	finds the high-transaction-weighted utilization 1-itemsets (1-HTWUIs) as the size of the particle

5 Preliminaries and definitions

Definition 1: Utility of an item

The utility of an item say i_j in a transaction T_q is denoted as u(i_j, T_q) and it is calculated as: $u (i_{j}, T_{q}) = q (i_{j}, T_{q}) * p (i_{j})$ (4) where q(i_j, T_q) represents the quantity of the item i_j in the given transaction T_q and p(i_j) represents the unit profit of the item i_j for the given transaction database.

Definition 2: Utility of an itemset in a transaction

The utility of an itemset say X in a transaction T_q is denoted as u(X, T_q) and it is calculated as: $u (X, T_{q}) = \sum_{i_{j} \in X \land X \subseteq T_{q}} u (i_{j}, T_{q})$ (5)

Definition 3: Average utility of an itemset in a transaction

The average utility of an itemset X in a transaction T_q is denoted as au(X, T_q) and it is represented as: $au (X, T_{q}) = \frac{\sum_{i_{j} \in X \land X \subseteq T_{q}} u (i_{j}, T_{q})}{| X | = k}$ (6) where |X| represents the number of items in itemset X.

Definition 4: Average utility of an itemset in a transaction database

The average utility of an itemset X in a transaction database Dis denoted as au(X) and it is represented as: $au (X) = \sum_{X \subseteq T_{q} \land T_{q} \in D} \land au (X, T_{q})$ (7)

Definition 5: Multiple minimum average utility (MAU)

If there are n items in a transaction database, then there should be n computed minimum average utility which means one minimum average utility for each item i_j represented as $MAUs = {mau (i_{1}), mau (i_{2}), \dots, mau (i_{n})}$ (8) where mau(i_j) is calculated as, $mau (i_{j}) = max {β * p (i_{j}), GLMAU}$ (9) where β is a constant acting as a function of unit profit values of items and GLMAU is the global least minimum average utility value specified by the user.

6 Proposed HAUIM-AFSA-MMU algorithm

HAUIM-AFSA-MMU –High average utility Itemset Mining using Artificial Fish Swarm Algorithm with Multiple minimum average utility threshold.

Algorithm 1 –HAUIM-AFSA-MMU
Input: Transaction database D, population size N, maximum number of iterations max_iter, maximum number of attempts try_number, β, GLMAU
Output: HAUIs
Initialize
N, β, GLMAU, max_iter, try_number
HAUI=Ø;
MAU table using β, GLMAU using Eq. 9;
i = 1;
Generate N PVs (itemsets-Fish) using RW() // pseudocode in algorithm 2
whilei< =max_iter do
forj = 1 to Ndo
is_follow = false;
is_swarm = false;
F_j = follow(F_j); // follow(PV)
pseudocode in algorithm 3
if(!is_follow)then
swarm(F_j); // pseudocode in algorithm 4
end if
if (!is_follow AND !is_swarm) then
prey(F_j); // pseudocode in algorithm 5
end if
end for
i++;
end while

Algorithm 2- RW()
au_sum = 0
forj = 1 to Ndo
au_sum = au_sum+au(item_j)
end for
forj = 1 to Ndo
p[j] = au(item) / au_sum
end for
//Initialize probability array P[]
P[0] = p[0]
fori = 1 to Ndo
P[i] = P[i-1]+p[i]
end for

Algorithm 3- follow(PV F)
Best_F = F₁;
fori = 1 to Ndo
Calculate d = Bitdiff(F, F_i);
if(d< =VD AND au(F_i)> au(Best_F) then
Best_F = F_i;
end if
end for
d’=Bitdiff(Best_F,F);
if(d’>0) then
is_follow = true;
k = random_int(); // less than d’;
//flip k bits in F
if (au(F)> =mau(F) AND IS(F) ∉ HAUI) then
IS(F) -> HAUI;
end if
end if

Algorithm 4 - swarm(PV F)
forj = 1 to NHdo
count_zero = 0;
count_one[j] = 0;
end for
fori = 1 to Ndo
Calculate d = Bitdiff(F, F_i);
if(d< =VD) then
forj = 1 to NHdo
count_zero[j]++if F_i(j) is zero;
count_one[j]++if F_i(j) is one;
end for
end if
end for
forj = 1 to NHdo
if(count_one[j]> =count_zero[j])
Main_F[j] = 1;
else
Main_F[j] = 0;
end if
end for
if (au(Main_F)> =mau(F) AND IS(Main_F) ∉ HAUI) then
IS(Main_F) -> HAUI;
end if
if (au(F)< au(Main _F)) then
is_swarm = true;
d’=Bitdiff(Main_F, F);
k = random_int(); // less than d’;
//flip k bits in F
if (au(F)> =mau(F) AND IS(F) ∉ HAUI) then
IS(F) -> HAUI;
end if
end if

Algorithm 5 - prey(PV P)
flag = false;
times = 1;
whiletimes< =try_numberdo
k = random_int(); // less than VD;
//flip k bits of F to generate F’;
if (au(F’)> =mau(F’) AND IS(F’) ∉ HAUI) then
IS(F’) -> HAUI;
end if
if (au(F’)> au(F)) then
F = F’; flag = true;
break;
end if
times++;
end while
if(! flag) then
//flip random_int() bits in F;
if (au(F)> =mau(F) AND IS(F) ∉ HAUI) then
IS(F) -> HAUI;
end if
end if

6.1 Algorithm Description - HAUIM-AFSA-MMU

The Artificial Fish Swarm Algorithm (AFSA) is a metaheuristic optimization algorithm that is inspired by the behavior of fish in a school. The algorithm was first proposed by Bastos Filho and Lima Neto in 2003.

In the AFSA, each fish represents a solution to the optimization problem. The fish swim in the search space and adjust their positions according to their own and their neighbors’ behaviors. The algorithm includes four main steps:

Initialization: A set of fish (i.e., solutions) is randomly generated in the search space.

Foraging: Each fish evaluates its fitness (i.e., objective function value) and moves toward better solutions in its local search space.

Prey and predator: Fish randomly switch between being a prey and a predator, where predators move to a random location and prey try to avoid them.

Update: The positions of the fish are updated based on their individual behavior and their interactions with their neighbors.

The AFSA has been successfully applied to a wide range of optimization problems, including function optimization, parameter optimization, and feature selection. Its advantages include easy implementation, robustness, and fast convergence to a global optimum.

In the proposed algorithm, the fish are referred to as itemsets. The input and output parameters are given in the algorithm. The MAU table is set by fixing minimum utility values for all the items considered in the transactional database using eq. (8) and (9). The initial population of itemsets are generated based on the roulette wheel selection and an empty HAUI set is created.

Roulette wheel selection is a commonly used selection method in evolutionary algorithms. It is a stochastic method that selects individuals from a population based on their fitness values, with the probability of selection proportional to the fitness. This reduces the complexity of the candidate generation.

Once the initial population of size N is generated, the algorithm is repeated for max_iter number of times to identify the HAUIs. During each iteration, each itemset in the population will forage, prey and update its position as explained above based on its utility value and based on the best itemset identified. Each time, the utility of the itemset is evaluated and if it is greater than the average of minimum utilities of all items in the itemset then it is moved to the set HAUIs.

6.2 Significance of AFSA

The effectiveness of any optimization algorithm depends on the specific problem being solved, the characteristics of the objective function, and the available computational resources. Different optimization algorithms may perform better or worse depending on the problem at hand [2-4]. The Table 8 shows the comparison of the efficiency of the Artificial Fish Swarm Algorithm (AFSA) with other popular optimization algorithms.

Table 8
Efficiency comparison of AFSA with other algorithms

Optimization Algorithm Efficiency Advantages of AFSA Efficiency Advantages of Other Algorithms

Artificial Fish Swarm (AFSA) Simplicity and ease of implementation, nature-inspired algorithm, robustness to noise, scalability Strong convergence properties, faster convergence in certain scenarios, specialized adaptations to problem domains, rigorous theoretical foundations

Particle Swarm Optimization (PSO) Efficient in locating global optima, fast convergence in certain problems, simple implementation Lack of robustness to noise, can suffer from premature convergence, sensitivity to parameter settings, limited exploration capabilities in complex landscapes

Genetic Algorithms (GA) Effective in handling large search spaces, diverse solutions, ability to handle multimodal problems Slower convergence, lack of real-time adaptability, may require higher computational resources, complex parameter tuning

Simulated Annealing (SA) Ability to escape local optima, global exploration, robustness to noise Slower convergence, sensitive to temperature scheduling, limited parallelization, potential for getting stuck in suboptimal solutions

Differential Evolution (DE) Efficient handling of continuous optimization problems, global exploration, good convergence properties May struggle with discrete or constrained problems, parameter sensitivity, slower convergence in some cases

Optimization Algorithm	Efficiency Advantages of AFSA	Efficiency Advantages of Other Algorithms
Artificial Fish Swarm (AFSA)	Simplicity and ease of implementation, nature-inspired algorithm, robustness to noise, scalability	Strong convergence properties, faster convergence in certain scenarios, specialized adaptations to problem domains, rigorous theoretical foundations
Particle Swarm Optimization (PSO)	Efficient in locating global optima, fast convergence in certain problems, simple implementation	Lack of robustness to noise, can suffer from premature convergence, sensitivity to parameter settings, limited exploration capabilities in complex landscapes
Genetic Algorithms (GA)	Effective in handling large search spaces, diverse solutions, ability to handle multimodal problems	Slower convergence, lack of real-time adaptability, may require higher computational resources, complex parameter tuning
Simulated Annealing (SA)	Ability to escape local optima, global exploration, robustness to noise	Slower convergence, sensitive to temperature scheduling, limited parallelization, potential for getting stuck in suboptimal solutions
Differential Evolution (DE)	Efficient handling of continuous optimization problems, global exploration, good convergence properties	May struggle with discrete or constrained problems, parameter sensitivity, slower convergence in some cases

The following are some insights into the potential advantages of the Artificial Fish Swarm Algorithm (AFSA) over other optimization algorithms.

Exploration and Exploitation Balance: AFSA is designed to balance exploration (searching for new, potentially better solutions) and exploitation (exploiting known solutions to improve them). This balance helps AFSA escape local optima and find near-optimal solutions in complex and multimodal optimization landscapes.

Population-Based Approach: AFSA is a population-based algorithm, meaning it maintains a group of potential solutions (fish) instead of a single solution. This approach allows AFSA to explore multiple regions of the search space simultaneously, increasing the likelihood of finding better solutions.

Robustness to Noise: AFSA has demonstrated robustness in handling noisy fitness landscapes, where the objective function may have random variations or inaccuracies. This robustness can be beneficial in real-world optimization problems, which often involve uncertainties and noisy data.

Scalability: AFSA has shown promising performance in solving both small-scale and large-scale optimization problems. Its ability to handle higher-dimensional search spaces makes it suitable for a wide range of applications.

Fewer Parameters: AFSA typically has fewer adjustable parameters compared to some other optimization algorithms. This characteristic makes it easier to set up and tune for specific problems.

Therefore, it is essential to consider the nature of the problem and conduct appropriate comparisons with other algorithms to determine the most suitable one for a particular task. In that connection, AFSA is the best suited algorithm to find HAUI in the given set of transactions.

7 Performance evaluation

In this section, the performance of the proposed algorithm HAUIM-AFSA-MMU is compared against the other multiple minimum utility HAUI mining algorithms - HAUI-MEMU, HAUI-MEMU+. The algorithms were executed on a machine with i7 3.40 GHz CPU, 16 GB RAM and 64 bit operating system.

7.1 Data sets

The following table shows the datasets that were considered for evaluating the performance of the proposed algorithms with other algorithms that mines HAUI. The datasets are taken from SPMF –a java open source library with in-built data sets [32]. The dataset has transactions, each represented in a single line. Each line has set of items represented using their unique_IDs followed by their utilities (product of quantity and profit). All the four datasets mentioned in Table 9 has the same format as explained above.

Table 9
Datasets

Name of the dataset Average item count per transaction (A) Total no. of items (I) No. of transactions Density (%) (A/I)

Chess 37 76 3,196 49

Mushroom 23 120 8,124 19

T25I200D10K 40 1,000 100,000 4

Retail 10 16,470 88,162 0.06

Name of the dataset	Average item count per transaction (A)	Total no. of items (I)	No. of transactions	Density (%) (A/I)
Chess	37	76	3,196	49
Mushroom	23	120	8,124	19
T25I200D10K	40	1,000	100,000	4
Retail	10	16,470	88,162	0.06

7.2 Execution time

In this section, the efficiency of the proposed algorithms is compared with other multiple minimum utility mining algorithms that mines HAUIs. The performance based on runtime is compared in two categories, one with fixed β and various glmau values as shown in Fig. 1 and the other with fixed glmau and various β values as shown in Fig. 2. The proposed algorithm shows better efficiency than the other algorithms in comparison. In particular, the proposed algorithm performs better in dense datasets like chess and mushroom than sparse datasets. The proposed algorithm effectively reduces the runtime by reducing the multiple dataset scans. The decrease in runtime when beta value increases is because the minimum average utility of the item will be comparatively high with increased beta values.

Fig. 1

Runtime comparison for fixed β and various glmau values.

Fig. 2

Runtime comparison for fixed glmau and various β values.

7.3 Memory usage

In this section, the memory usage of all three algorithms on the considered four data sets is compared. The memory comparison for fixed β and various glmau values is shown in Fig. 3 and the memory usage for fixed glmau and various β values is shown in Fig. 4. By observing Fig. 3, it is obvious that the proposed algorithm performs better in the heavily dense chess data set and it consumes less memory than the other two algorithms. The less memory usage in the proposed algorithm is due to its property of neglecting the individual with less fitness in each iteration and thus reduces the total search space.

Fig. 3

Memory comparison for fixed β and various glmau values.

Fig. 4

Memory comparison for fixed glmau and various β values.

7.4 Candidate generation

In this section, the candidate generated by each algorithm for the four different data sets for fixed β, various glmau values and for fixed glmau with various β values are shown in Figs. 5 and 6. The candidate generation in the proposed optimization algorithm is lesser because it reduces the search space. Optimization algorithms are designed to find the optimal solution within a given search space. The algorithm does this by iteratively exploring the search space and narrowing down the range of possible solutions until it arrives at an optimal or near-optimal solution. Here in this case it is a vector of probabilities representing the items in the transaction database. The proposed HAUIM-AFSA-MMU algorithm reduces the search space in the following ways.

Fig. 5

Candidate comparison for fixed β and various glmau values.

Fig. 6

Candidate comparison for fixed glmau and various β values.

Restricting the domain: Concentrating only on vectors with maximum fitness and evolving new vectors from existing vectors with highest fitness at that instant.

Eliminating unpromising regions: Here in the proposed algorithm, the unpromising regions are regions with less density of food as per the definition of Artificial Fish Swarm algorithm. The vectors won’t be allowed to move towards the region with less density of food and again it is calculated in this case by calculating the fitness values. The fitness is calculated using the mau of each item that is present in the vector as discussed in the proposed algorithm.

Overall, the proposed algorithms reduce the search space by focusing on promising regions and discarding unpromising ones. This enables it to efficiently reduce the candidates to generate HAUIs.

7.5 Statistical significance

Additionally, the Friedman test is used to determine the statistical significance of the variations in the performance of the three algorithms. The rank of each algorithm on each data set is used in the Friedman test [6, 22], a nonparametric statistical test. The Equations (10) and (11) show the Friedman statistics calculation. $X_{F}^{2} = \frac{12 N}{k (k + 1)} (\sum_{j = 1}^{k} R_{j}^{2} - \frac{k {(k + 1)}^{2}}{4})$ (10) $F_{F} = \frac{(N - 1) X_{F}^{2}}{N (k - 1) X_{F}^{2}}$ (11) where, N stands for the number of datasets, k for the number of algorithms, and Rj for the average algorithm rank across all 4 datasets. A Fisher distribution with k-1 and (k-1)(N-1) degrees of freedom describes the statistic F_F.

If the Friedman test finds that the null hypothesis, according to which all techniques perform uniformly is false. Individual Friedman tests are run to see if the performances of the three algorithms using the four datasets differ significantly in terms of getting optimal values. The tests’ null hypothesis is that all configurations with particular glmau are similar in terms of the best values for each of the three algorithms. A confidence level of 0.05 with a 95% degree of confidence is used to determine the Fisher distribution’s critical value.

The value of F_F is distributed for the given setup with three algorithms and four datasets with 2 (k-1) and 6 ((N-1)(k-1))degrees of freedom. As discussed earlier, k represents the number of algorithms which is 3 and N represents the number of datasets considered which is 4. The optimal values obtained for three algorithms on four datasets for a fixed glmau is listed in Table 10. Based on the observations on the results in the Table 10, it is obvious that the proposed algorithm HAUIM-AFSA-MMU shows better performance than the other two algorithms under consideration. Friedman tests prove that the proposed algorithm is statistically significant under the given setup.

Table 10

Optimal value comparison of four data sets on three algorithms

Dataset	Glmau (k)	MEMU (Rank)	MEMU+(Rank)	HAUIM-AFSA-MMU (Rank)
Chess	2630	263,567 (3)	268,355 (2)	289,564 (1)
Mushroom	2300	1,735,936 (3)	1,861,022 (2)	1,956,472 (1)
T25I200D10K	60	256,789 (3)	267,892 (2)	286,453 (1)
Retail	1100	322,928 (3)	326,833 (2)	336,365 (1)

8 Conclusion

In this paper, an algorithm is proposed to mine HAUI using AFSA optimization technique with multiple minimum utilities fixed for each item in the dataset. The experimental results discussed here shows that the proposed HAUIM-AFSA-MMU algorithm has outperformed in most of the cases than the other three algorithm considered in both time and space complexity. Parameter sensitivity is the major limitation of the AFSA algorithm discussed in this paper. As discussed in section 6.2, AFSA uses minimal parameters which is a limitation in using AFSA, proper parameter tuning is still necessary for optimal performance. Our future work focuses on modelling new approaches in fixing minimum average utility for each item in the dataset. The metrics like, the external utility of each item, the total weighted utility of the transactions in which the item appears may also be considered to fix the minimum utility to each item in the database. The minimum utility of each item can also be fixed by comparing the utility strength of the similar item in the transaction database. Many such strategies can be evolved in fixing the multiple utilities. These metrics will be considered in building the new algorithm in our future work. Future works may also include reducing the AFSA limitations to mine better HAUIs.

References

Erwin , Gopalan

R.P.

, Achuthan

N.R.

Efficient mining of high utility itemsets from large datasets, Pacific-Asia Conference on Knowledge Discovery and Data Mining (2008), 554–561.

Abed-alguni

B.H.

and Paul

, Island-based Cuckoo Search with elite opposition-based learning and multiple mutation methods for solving optimization problems, Soft Comput 26 (2022), 3293–3312.

Bilal Abed-alguni

, Noor Aldeen Alawad , Distributed Grey Wolf Optimizer for scheduling of workflow applications in cloud environments, Applied Soft Computing 102 (2021), 107113, ISSN 1568-4946.

Faisal Alkhateeb and Bilal Abed-alguni

, A hybrid cuckoo search and simulated annealing algorithm, Journal of Intelligent Systems 28(4) (2019), 683–698.

Fournier-Viger

, Nawaz

M.S.

, He

, Wu

, Nouioua

, Yun

MaxFEM: Mining Maximal Frequent Episodes in Complex Event Sequences. In: Surinta, O., Kam Fung Yuen, K. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2022. Lecture Notes in Computer Science, vol 13651 (2022). Springer, Cham.

Friedman

, A comparison of alternative tests of significance for the problem of m ranking, , Ann Math Stat 11 (1940), 86–92.

Han

, Pei

, Yin

and Mao

, Mining frequent patterns without candidate generation: a frequent-pattern tree approach, Data Mining and Knowledge Discovery 8(1) (2004), 53–87.

Jerry Chun-Wei Lin , Lu Yang , Philippe Fournier-Viger , Jimmy Ming-Thai Wu , Tzung-Pei Hong , Leon Shyue-Liang Wang , Justin Zhan , Mining high-utility itemsets based on particle swarm optimization, Engineering Applications of Artificial Intelligence 55 (2016), 320–330.

Jerry Chun-Wei Lin , Lu Yang , Philippe Fournier-Viger , Tzung-Pei Hong , Miroslav Voznak , A binary PSO approach to mine high-utility itemsets, Soft Computing –A Fusion of Foundations, Methodologies and Applications 21(17) (2017), 5103–5121.

10.

Jerry Chun-Wei Lin , Shifeng Ren , Philippe Fournier-Viger , MEMU: More Efficient Algorithm to Mine High Average-Utility Patterns with Multiple Minimum Average-Utility Thresholds, IEEE Access 14(8) (2017).

11.

Jerry Chun-Wei Lin , Ting Li , Philippe Fournier-Viger , Tzung-Pei Hong , Ja-Hwung Su Efficient Mining of High Average-Utility Itemsets with Multiple Minimum Thresholds, Springer International, ICDM 2016, LNAI 9728, pp. 14–28, 2016.

12.

Jerry Chun-Wei Lin , Ting Li , Philippe Fournier-Viger , Tzung-Pei Hong , Justin Zhan , Miroslav Voznak , An efficient algorithm to mine high average-utility itemsets, Advanced Engineering Informatics 30(2) (2016), 233–243.

13.

Kannimuthu

and Premalatha

, Discovery of high utility itemsets using genetic algorithm with ranked mutation, Appl. Artif. Intell. 28(4) (2014), 337–359.

14.

Ken McGarry A survey of interestingness measure for knowledge discovery, The Knowledge Engineering Review.

15.

, Shao

and Qian

, An optimizing method based on autonomous animals: fish-swarm algorithm, Syst. Eng. Theor. Pract. 22(11) (2002), 32–38. (in Chinese).

16.

Lin

J.C.-W.

, et al. Mining high-utility itemsets based on particle swarm optimization, , Eng. Appl. Artif. Intel. 55 (2016), 320–330.

17.

Liqiang Geng , Howard J. Hamilton , Interestingness measures for Data Mining: A survey , ACM Computing Surveys 38(3) (2006), Article 9.

18.

Saqib Nawaz

, Philippe Fournier-Viger , Unil Yun , Youxi Wu , Wei Song , Mining High Utility Itemsets with Hill Climbing and Simulated Annealing, ACM Trans. Manage. Inf. Syst. 13(1), Article 4 (2022), 22.

19.

Tung

N.T.

Loan Nguyen

T.T.

, Trinh Nguyen

D.D.

, Philippe Fourier-Viger , Ngoc-Thanh Nguyen , Bay Vo , Efficient mining of cross-level high-utility itemsets in taxonomy quantitative databases, Information Sciences 587 (2022), 41–62.

20.

Nawaz

M.S.

, Fournier-Viger

, Alhusaini

, He

, Wu

, Bhattacharya

LCIM: Mining Low Cost High Utility Itemsets. In: Surinta, O., Kam Fung Yuen, K. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI (2022).

21.

Philippe Fournier-Viger , Jerry Chun-Wei Lin , Quang-Huy Duong , Thu-Lan Dam PHM: Mining Periodic High-Utility Itemsets, Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2016.

22.

projects/hdb/resources.shtml. Accessed 6 April 2023.

23.

Agrawal

, Srikant

Fast algorithms for mining association rules, International Conference on Very Large Data Bases (1994), 487–499.

24.

Agrawal

, Srikant

Mining sequential patterns, International Conference on Data Engineering (1995), 3–14.

25.

Kannimuthu

, Premalatha

, Shankar

Investigation of high utility itemset mining in service oriented computing: Deployment of knowledge as a service in E-commerce, 2012 Fourth International Conference on Advanced Computing (ICoAC), Chennai, India, 2012, pp. 1–8.

26.

Krishnamoorthy

, Pruning strategies for mining high utility itemsets, Expert Systems with Applications 42(5) (2015), 2371–2381.

27.

Song

, Huang

Discovering high utility itemsets based on the artificial bee colony algorithm. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 3–14. Springer, Cham (2018).

28.

Song

and Huang

, Mining high average-utility itemsets based on particle swarm optimization, Data Sci. Pattern Recogn 4(2) (2020), 19–32.

29.

Song

, Li

Discovering high utility itemsets using set-based particle swarm optimization. In: Yang, X.,Wang, C.-D., Islam, M.S., Zhang, Z. (eds.) ADMA 2020. LNCS (LNAI), vol. 12447, pp. 38–53. Springer, Cham (2020).

30.

Song Wei , Huang Chaomin , Mining High Utility Itemsets Using Bio-Inspired Algorithms: A Diverse Optimal Value Framework, IEEE Access 6 (2018), 19568–19582. 10.1109/ACCESS.2018.2819162.

31.

Song Wei , Ye Wei , Fournier Viger , Philippe, Mining sequential patterns with flexible constraints from MOOC data, Applied Intelligence 52 (2022), 10.1007/s10489-021-03122-7.

32.

SPMF: An open-source data mining library,http://www.philippe-fournier-viger.com/spmf/. Accessed 6 April 2023.

33.

Subramanian

, Kandhasamy

and Subramanian

, A Novel Approach to Extract High Utility Itemsets from Distributed Databases, COMPUTING AND INFORMATICS 31(6+) (2013), 1597–1615.

34.

Subramanian Kannimuthu , Kandhasamy Premalatha , UP-GNIV: An expeditious high utility pattern mining algorithm for itemsets with negative utility values, International Journal of Information Technology and Management 14 (2015), 26–42.

35.

Gan

, Lin

J.C.W.

and Fournier-Viger

, More efficient algorithms for mining high-utility itemsets with multiple minimum utility thresholds, Database and Expert Systems Applications 2016, 202–213.

36.

Wei Song , Jiakai Nan Mining High Utility Itemsets Using Ant Colony Optimization, Springer Nature Switzerland AG 2021, AISC 1348, pp. 98–107, 2021.

37.

Wei Song , Junya Li , Chaomin Huang Artificial Fish Swarm Algorithm for Mining High Utility Itemsets, Springer Nature Switzerland AG 2021, Y. Tan and Y. Shi (Eds.): ICSI 2021, LNCS 12690, pp. 407–419, 2021.

38.

Wei Song , Lu Liu , Chaomin Huang , Generalized maximal utility for mining high average-utility itemsets, , Knowledge and Information Systems 63 (2021), 2947–2967.

39.

WenshengGan , Jerry Chun-Wei Lin , Jiexiong Zhang , Han-Chieh Chao , Hamido Fujita , Philip Yu

, ProUM: High Utility Sequential Pattern Mining, IEEE International Con-ference on Systems, Man and Cybernetics (SMC) Bari, Italy, October 6-9, 2019.

40.

, Chen

Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism, Decision Support Systems (2006), 1–24.

Mining high average utility itemsets using artificial fish swarm algorithm with computed multiple minimum average utility thresholds

Abstract

Keywords

1 Introduction

2 Frequent itemset mining vs utility mining

2.1 Frequent itemset mining

Table 1 A sample transaction database T Transaction Items T1 [c,d] T2 [a,b,c] T3 [a,b,c,d] T4 [a,c,d]

3 High utility itemset mining

Table 3 Items in sample transaction T with profit [utility] value per unit Items Profit per unit in INR a 225 b 650 c 350 d 150

3.2 High Average Utility Itemsets (HAUI)

4 Related work

6.1 Algorithm Description - HAUIM-AFSA-MMU

6.2 Significance of AFSA

7.1 Data sets

Table 9 Datasets Name of the dataset Average item count per transaction (A) Total no. of items (I) No. of transactions Density (%) (A/I) Chess 37 76 3,196 49 Mushroom 23 120 8,124 19 T25I200D10K 40 1,000 100,000 4 Retail 10 16,470 88,162 0.06

References

Table 1
A sample transaction database T

Transaction Items

T1 [c,d]

T2 [a,b,c]

T3 [a,b,c,d]

T4 [a,c,d]

Table 3
Items in sample transaction T with profit [utility] value per unit

Items Profit per

unit in INR

a 225

b 650

c 350

d 150

Table 9
Datasets

Name of the dataset Average item count per transaction (A) Total no. of items (I) No. of transactions Density (%) (A/I)

Chess 37 76 3,196 49

Mushroom 23 120 8,124 19

T25I200D10K 40 1,000 100,000 4

Retail 10 16,470 88,162 0.06