Abstract
Negative sequential patterns (NSP) refer to sequences with non-occurring and occurring items, and can play an irreplaceable role in understanding and addressing many business applications. However, some problems occur after mining NSP, the most urgent one of which is how to select the actionable positive or negative sequential patterns. This is due to the following factors: 1) positive sequential patterns (PSP) mined before considering NSP may mislead decisions; and 2) it is much more difficult to select actionable patterns after mining NSP, as the number of NSPs is much greater than PSPs. In this paper, an improved method of pruning uninteresting itemsets to fit for a selecting actionable sequential pattern (ASP) is proposed. Then, a novel and efficient method, called SAP, is proposed to select the actionable positive and negative sequential patterns. Experimental results indicate that SAP is very efficient in the selection of ASP. To the best of our knowledge, SAP is the best method for the selection of actionable positive and negative sequential patterns.
Introduction
Negative sequential patterns (NSP) refer to sequences with non-occurring and occurring items, while sequential patterns that contain only occurring items are called positive sequential patterns (PSP) [19, 24, 26, 19, 24, 26]. For instance, assume that
One reason for this is that the PSPs being mined before considering NSP may mislead decisions. For example, a PSP <
Another reason is that it is much more difficult to select actionable patterns after mining NSP, because the number of NSPs is much greater than PSPs. To a
Although there have been some reports of actionable knowledge discovery [8, 16] and selecting actionable patterns/rules or interestingness measures in association rule mining [1, 17], none of the previous research considers how to select actionable positive and negative sequential patterns. Very few papers study NSP mining [12, 27], and most primarily focus on how to design a mining algorithm and how to improve the algorithm’s efficiency. Therefore, this paper investigates association rule mining. In this area, some methods have been proposed to prune uninteresting itemsets, to select actionable patterns, to mine positive and negative association rules, and so on [2, 17]. Among these methods, the one proposed by X.D. Wu, et al. is one of the most suitable methods, referred to as Wu’s method for convenience [23]. Wu’s method can prune the itemsets that are not of interest before mining positive and negative association rules. However, due to the differences between the association rule and sequential pattern, this method cannot be directly used to solve the selection problem and must be improved before use.
No studies shave yet applied Wu’s method to positive or negative sequential patterns, and Wu’s method can only deal with itemsets, not actionable sequential patterns. Therefore, in this paper, we first improve Wu’s method to fit a selecting actionable sequential pattern (ASP), then propose a novel and efficient method, called SAP, to select actionable positive and negative sequential patterns based on the improved method. Experimental results indicate that SAP is very efficient in the selection of ASP. To the best of our knowledge, this is the best method to select actionable positive and negative sequential patterns.
The remainder of the paper is organized as follows. Section 2 discusses the related work. In Section 3 reviews an e-NSP algorithm and Wu’s method, and then proposes the improved method and SAP method. Section 4 presents the experiment results, and conclusions and future work are presented in Section 5.
Related work
Some literature has proposed a few measures to mine interesting association rules, such as correlation coefficients, chi-squared tests, interestingness, Laplace, the Gini-index, Piatetsky-Shapiro method, conviction and so on [2, 17]. M.L. Antonie, O.R. Zaiane, X.J. Dong, Z.D. Niu, X.L. Shi, X.D. Zhang and D.H. Zhu used correlation coefficient to mine positive and negative association rules [11, 25]. B. Liu, W. Hsu and Y.M. Ma used chi-squared test to prune non-actionable rules or rules at lower significance levels [3].
Some literature has discussed the selection of actionable knowledge or actionable patterns. L.B. Cao, C.Q. Zhang, et al. developed a new data mining framework, referred to as Domain-Driven In-Depth Pattern Discovery (DDID-PD) [8]. L.B. Cao, Y.C. Zhao, et al. formally defined the actionable knowledge discovery (AKD) concepts, processes, the action ability of patterns, and operable deliverables. With such components, four types of AKD frameworks are proposed to handle various business problems and applications [9]. L.B. Cao discussed a paradigm shift in knowledge discovery from data to actionable knowledge discovery and delivery. In post-analysis [10], a key component is to extract actions from learned rules [23]. A typical method applied to learning action rules is to split attributes into “hard/soft” [23] to extract actions that may improve the loyalty or profitability of customers. G. Adomavicius and A. Tuzhilin proposed a method of discovery of actionable patterns, which first builds an action tree for the specific application, and then assigns actionable patterns to the corresponding nodes of the tree by data mining queries [5]. K. Kavitha and E. Ramaraj proposed an efficient transaction reduction algorithm, TR-BC, to mine the frequent patterns based on bitmap and class labels [6]. K. Wang, Y. Jiang and A. Tuzhilin introduced the notion of “action” as a domain-independent method to model the domain knowledge. This presents several pruning strategies to reduce the search space and algorithms for mining all actionable patterns, as well as mining the top
Next, the literature on mining NSP is discussed.
S.C. Hsueh, M.Y. Lin and C.L. Chen proposed a PNSP approach for mining positive and negative sequential patterns in the form of < (
This requires
X.J. Dong, Z.G. Zheng, L.B. Cao, Y.C. Zhao, et al. proposed an efficient e-NSP algorithm to mine NSP. In this paper, e_NSP is used to mine PSPs and NSPs. The details of e_NSP will be introduced in Section 3 [24].
SAP method
Basic concept
Let
Sequence
A sequence database D is a set of tuples <
The concepts of length and size for positive sequences are also suitable for negative sequences. The neg-size of sequence ns, denoted as neg-size(ns), is the total number of negative elements in ns; ns is a k-neg-size negative sequence if neg-size(ns) = k. For example,
Review of e_NSP
Three constraints in e_NSP
An e_NSP adds three constraints to negative sequences in order to decrease the number of NSCs and discover only meaningful NSPs.
Constraint 1 is the element negative constraint. That is, the minimum negative unit in am NSC is an element; not only one part of the items in the element (if the element includes more than one item) is negative. For example, <
Constraint 2 is formation constraint. There are no contiguous negative elements in an NSC because right order of two contiguous negative elements cannot be determined if there are no positive elements between them. For example, <
Constraint 3 is frequency constraint. The e_NSP only mined the NSP whose positive partner is a PSP. The positive partner of a negative sequence changes all negative elements in ns to their corresponding elements, denoted as
The main steps of e_NSP
There are four steps in e_NSP. The first step is to mine all PSPs according to the classical PSP mining algorithms. To each PSP p, save its support and all sid of the data sequence that contain the p to a set {p}. Based on these PSPs, generate all NSCs according to the following method. For a k-size PSP, its NSCs are generated by changing any m non-contiguous element to its negative element, Convert the negative containment to a positive containment. For example, if a data sequence ds contains negative sequence <¬ Calculate the support of each NSC and determine whether it is a NSP by comparing its support with the minimum threshold
For example,
The original Wu’s method
Wu’s method defines an interestingness function interest (
where f() is a constraint function concerning the support, confidence, and interestingness of
Using this method, Wu’s method can establish an effective pruning strategy for efficiently identifying all frequent itemsets. However, due to the differences between itemsets and sequential patterns, this method can’t be directly used to solve the selection problem and must be improved before use.
The improvements cover three aspects; the third improvement is the key improvement. Remove the confidence measure from Equations 1–3.
The confidence measure c() and mc are used to discover association rules. If only itemsets/ patterns are discovered, this measure can be removed, as in Wu’s method. Replace the interestingness function with correlation coefficient function.
X.J. Dong explains that interest (
The correlation coefficient between
If If If
The range of
Correspondingly, we replace interest ( Adapt these equations to fit sequential patterns.
So far, the above discussions are based on mining interesting itemsets, not sequences. The differences between them are that, in a sequence, the itemsets are in an order, and an item can occur at multiple times in different elements of a sequence. Therefore, the conditions ∃
A k-size (k > 1) PSP or NSP
According to Definition 1, we can obtain a corollary as follows.
Definition 1 and correlary 1 are used to implement the proposed SAP method
The primary steps of SAP method are as follows.
Algorithm SAP.
Input: D: a sequential database; ms: minimum support;
Output: ASP: set of actionable sequential patterns;
- let ASP =
- use e-NSP to mine all PSPs and NSPs and store them in the set {PNSP};
- for k = 2 to maximum length of PSP in {PNSP} do {
- for each k-size pattern
- test P with definition 1;
- if P is an actionable sequential pattern then
- ASP = ASP∪{P};
- else
- remove P and all the patterns contain P from PNSP;
- end if
- }
- k++;
- }
- return ASP;
An example is illustrated below.
The sample database is shown in Table 1, given ms = 0.4, and
PSP = {<a>, <b>, <c>, <d>, <e>, <f>, <aa>, < (ab)>, <ab>, <ba>, <ac>, <ca>, <ad>, <ea>, <af>, < (bc)>, <bc>, <cb>, <bd>, <db>, <eb>, <bf>, <fb>, <cc>, <dc>, <ec>, <fc>, <ef>, < (ab)c>, < (ab)d>, < (ab)f>, <aba>, <eab>, <a(bc)>, <abc>, <aca>, <eac>, <acb>, <acc>, < (bc)a>, <adc>, <ebc>, <fbc>, <dcb>, <ecb>, <fcb>, <bdc>, <efb>, <efc>, < (ab)dc>, <a(bc)a>, <eacb>, <efcb> }.
NSP = {< ¬
For <ab>, s(<ab>) = 0.8, s(<a>) = 0.8, s(<b>) = 0.8
Sequence <ab>meets asp(a, b) = s(<ab>) ≥ 0.4 ∧ (f (a, b, ms,
For <cb>, s(<cb>) = 0.6, s(<c>) = 0.8, s(<b>) = 0.8
Therefore <cb>is not an asp.
For < ¬
We require test sequence <¬
Sequence <¬
All ASPs are obtained according to the above procedure.
ASP = {<
Experimental results
Experiments were performed on a Pentium 4 Celeron 2.1 G PC with 2 G main memory, running on Microsoft Windows7. All programs were written in MyEclipse 10.
Datasets
To describe and observe the impact of data characteristics on algorithm performance, the following data factors are used:
Four source datasets are also used in these experiments [24]. They include both real data and synthetic datasets generated by an IBM data generator [18].
Dataset 1 (DS1), C8_T4_S6_I6_DB100k_N0.1k.
Dataset 2 (DS2), C10_T8_S20_I10_DB10k_N0.2k.
Dataset 3 (DS3) is obtained from UCI Datasets. There are 989,818 records; average number of elements in a sequence is 4; and each element only has one item.
Dataset 4 (DS4) is real-application data from the financial service industry. It contains 5,269 customers/sequences; the average number of elements in a sequence is 21; the minimum number of elements in a sequence is 1; and the maximum number is 144.
Experimental results
In the experiments, different
Experimental analysis
Figures 1–4 demonstrate that with the increment of
Because our method was conducted after the improvement of Wu’s method and Wu’s method cannot be used for processing sequence, our proposed method cannot be directly compared to Wu’s method. However, our data are all sequence data. Further, There is no method can be used to compare.
These experimental results indicate that our proposed method demonstrates a proven stable running result and running time. The problem of selecting the actionable positive and negative sequential patterns is solved through our method. By setting the parameter close to the real-world parameter, SAP determination can efficiently deal with both synthetic and real-world databases.
Conclusions and future work
It is increasingly recognized that NSP can play an irreplaceable role in understanding and addressing many business applications. However, some urgent problems occur after mining NSP. These urgent problems include: 1) a selection problem; 2) efficiency problem; 3) candidate problem; and other concerns. Among these problems, the selection problem is the most urgent because the PSP being mined before considering NSP may mislead decisions, and it is much more difficult to select actionable patterns after mining NSP due to the huge number of NSPs. This paper proposes an improvement to Wu’s pruning method to fit for selecting ASP. A novel method is proposed to efficiently use SAP to select ASP. Experimental results indicate that SAP is very efficient.
Future work will find solutions to other urgent problems.
Footnotes
Acknowledgments
This work was supported partly by National Natural Science Foundation of China (71271125, 61173061 and 71201120), Natural Science Foundation of Shandong Province, China (ZR2011FM028) and Scientific Research Development Plan Project of Shandong Provincial Education Department (J12LN10).
