A New Parallel Algorithm For Frequent Pattern Mining
A New Parallel Algorithm For Frequent Pattern Mining
Frequent patterns mining is one of the important topics that have been discussed recently in the field
of data mining. Frequent patterns are fundamental in generating association rules, time series, etc.
Most frequent pattern mining algorithms can be classified into two categories: Generate-and-test
approach (Apriori-like) and pattern growth approach (FP-tree). Frequent pattern mining is based on
the FP-tree approach and it only needs two database scans. FIUT algorithm can reduce the com-
putational time of FP-growth algorithm. For pattern growth methods, the execution time increases
rapidly when the database size increases or when the given support is small. Therefore, parallel-
distributed computing is a good strategy for solving this problem. In this field two parallel mining
algorithms proposed, BTP-tree and TPFP-tree. In this paper, we propose TP-FIUT algorithm that
parallelizes FIUT. The method of our algorithm is similar to parallelized FP-growth method. We show
that if the database size is large, the run time of our proposed algorithm on homogeneous and
heterogeneous computing clusters is less than other algorithms and our proposed algorithm has a
better load balance capability than TPFP-tree and PFP-tree and BTP-tree on a multi-cluster grid.
Keywords: Frequent Pattern Mining, Data Mining, Association Rules, Grid Computing, Cluster
Computing, Frequent Items Ultrametric Trees.
RESEARCH ARTICLE
1. INTRODUCTION k pattern is not frequent in the database, then the super-
pattern (length k + 1) cannot be frequent. However, this
Frequent patterns are itemsets, subsequences, or substruc- approach generates a large number of candidate datasets
tures that appear in a data set with frequency no less than a and repetitively scans the database to verify whether it
user-specified threshold. Frequent pattern mining was first is frequent or not. For example, 250 (about 1015) can-
proposed by Agrawal et al. (1993) for market basket anal- didate datasets may be needed to verify whether a set is
ysis in the form of association rule mining. It analyses frequent or not in a database with 50 items. Han et al.
customer buying habits by finding associations between (2004) propose a novel data structure and method for min-
the different items that customers place in their “shopping ing frequent patterns: the Frequent Pattern (FP) tree data
baskets.” Such information can lead to increased sales by structure which only stores compressed, necessary infor-
helping retailers do selective marketing and arrange their mation for mining. Moreover, a mining algorithm – FP
shelf space. growth – based on FP-tree was also developed. Unlike the
Extracting frequent patterns in a transaction-oriented Apriori algorithm, the FP-tree only scans a database twice
database is vital in the mining of association rules and the mining information is obtained from the proposed
(Agrawal & Srikant, 1994; Park et al., 1995), time series, data structure.3
classification (Gorodetsky et al., 2003), etc. The basic Yuh-Jiuan Tsay, Tain-Jung Hsu, and Jing-Rung Yu
problem in frequent pattern mining is finding the number (2009) proposed a mining algorithm (FIUT) based on FP-
of times for a given pattern appears in a database.1 Most tree. FIUT was compared with FP-growth, a well-known
of the research in this area has either used the generate- and widely used algorithm, and the simulation results
and-test (Apriori-like) or the pattern growth approach (FP- showed that the FIUT has better performance than the FP-
growth) (Coenen et al., 2004; Han et al., 2004). growth, also FIUT only needs two database scans.4
For the Apriori-like approach (Lazcorreta et al., 2008; However, even though FP-tree performed better, the exe-
Park et al., 1995), the core idea is that if any length of the cution time still increased significantly when the database
was large. A parallel and distribution technique is a good
∗
Author to whom correspondence should be addressed. strategy for overcoming this problem.
Javed and Khokhar5 proposed a parallel FP-tree min- subset of I is called an itemset. SUPDB x means the num-
ing algorithm (PFP-tree) to solve the problem. The results ber of transactions in a database that contains pattern,
show that parallel computing is a good approach for solv- SUPDB x = t t ∈ DB and x ⊆ t. The problem of frequent
ing this problem. However, for PFP-tree, when the given pattern mining is to find itemset x where SUPDB x =
threshold is small, or the average length of the transaction for a given threshold (1 ≤ ≤ DB).
is long, too much information should be exchanged among
processors. The performance deteriorates notably when the 2.1. Frequent Pattern Growth (FP-Growth)
database increases or the given support decreases.
Yu and Zhou proposed TPFP-tree and BTP-tree Han et al. proposed the FP-growth method to avoid gen-
algorithms.1 TPFP-tree solves frequent pattern mining erating candidate itemsets by building a FP-tree with only
problems on a PC Cluster. A PC Cluster is a typical parallel two scans over the database. FPgrowth algorithm can be
computing environment with homogeneous hardware and decomposed into two phases: the FP-tree construction and
software resources. It consists of many computers inter- the mining of frequent patterns from it.
connected with a fast network connection. The goal of the An FP-tree is an extended prefix-tree structure for stor-
TPFP-tree algorithm was to reduce both communication ing compressed and crucial information about frequent
and tree insertion cost, thus decreasing execution time. patterns, while the FP-growth algorithm uses the FP-tree
BTP-tree solves frequent pattern mining problems on a structure to find the complete set of frequent patterns.
Grid computing. Grid computing and pervasive computing An FP-tree consists of one root labeled as bnullQ, a set
can solve complex problems with large-scale computation of item prefix sub trees as the children of the root and a
power and data storage resources (Foster & Kesselman, frequent-item header table. Each node in the prefix sub tree
1998). Grid is a loosely coupled computing architecture consists of three fields: Item-name, count and node-link.
The count of a node records the number of transactions in
based on internet connection, it shares heterogeneous com-
the database that share the prefix represented by the node,
puting and storage resources related to traditional clus-
and node-link links to the next node in the FP-tree carry-
ter systems, and it can easily add additional computing
ing the same item-name. Each entry in the frequent-item
resources at lower cost. In order to mine the frequent pat-
header table consists of two fields: item-name and head
terns for a large database, large computation resources are
of node-link, which points to the first node in the FPtree
needed. However, the grid computing consisted of hetero-
carrying the item-name. Besides, the FP-tree assumes that
geneous resources. Therefore, the mining algorithm should
the items are sorted in decreasing order of their support
RESEARCH ARTICLE
the mining items, a processor has the necessary transaction 2.4. FIUT Algorithm
corresponding to its assigned mining items. Therefore, the
processor can, independently, build the FP-tree and mine FIUT uses a special frequent items ultrametric tree (FIU-
the frequent patterns by FP-growth. tree) structure to enhance its efficiency in obtaining fre-
quent itemsets. YUH, HSU and YU in Ref. [4] for FIUT
consider four features: First, it minimizes I/O overheadby
scanning the database only twice. Second, the FIU-tree
2.3. BTP-Tree Algorithm
is an improved way to partition a database, which results
BTP-tree algorithm designed for Grid computing. Since from clustering transactions, and significantly reducesthe
the Grid system is a heterogeneous computing system, search space. Third, only frequent items in each trans-
the processors’ capability and memory size are different. action are inserted as nodes intothe FIU-tree for com-
There are six stages in a BTP-tree algorithm: pressed storage. Finally, all frequent itemsets are generated
(1) create header table and Tidset, bychecking the leaves of each FIU-tree, without travers-
(2) evaluate the performance index of computing nodes, ing the tree recursively, which significantlyreduces com-
(3) distribute mining item set, puting time. FIUT consists of two main phases within two
(4) exchange transactions, scans of database DB. Phase 1 starts by computing the
(5) create FP-tree and support for all items occurring in the transaction records.
(6) FP-growth.1 Then, a pruning technique is developed to remove all infre-
quent items, leaving only frequent items to generate the
RESEARCH ARTICLE
(b) Tidset1 Tidset2 Tidset3 Tidset4
M C A F K V G E B F N A C O M K G F P C H G M K A C F L H M B A G D
1 1 1 1 1 1 2 4 1 2 1 1 2 1 4 2 1 1 1 1 3 2 3 1 3 2 2 1 3 2 2 1 1 1
3 2 3 2 2 3 2 4 3 2 2 2 3 4 4 4 3 4 4 4 3 3
4 3 4 4 3 4
(d) GFI
C F M A G H K
10 9 8 7 5 4 4
k-itemsets, where the number of frequent items of a trans- threshold is rejected. Afterward GFI table is sent to all
action is k in a database. Meanwhile, all the frequent processors, so each processor rescans its own database and
1-itemsets are generated. Phase 2 is the repetitive construc- removes items that are not in GFI table (Fig. 1(d)) and
tion of small ultrametric trees, the actual mining of these create its k-itemset tables (Fig. 2).
trees, and their release. The approach for mining frequent In exchange phase, With considering Pk as number of
k-itemsets, first builds an independent, relative k-FIU-tree processors and M as maximum length of database transac-
for all k-itemsets, where k from M down to 2, and M tions, if Pk ≥ M then information exchange between pro-
denotes the maximal value of k among the transactions in cessors is done. In order to create K-FIUT, each processor
the database. Then, each of the trees is mined separately, has to have h-itemsets that satisfy K ≤ h ≤ M. Next, each
without generating candidate k-itemsets. The k-FIU-tree is processor sends h-itemsets from its k-itemset to Pk pro-
discarded immediately after frequent k-itemsets are mined. cessor and makes K-FIUT and with checking the leaves
of tree creates Lk (like FIUT4 ).
Input: a transaction database DB = T1 T2 Tn ,
3. PROPOSED ALGORITHM and each transaction Ti ⊆ I, I = i1 i2 im . A given
minimum threshold p is the number of processors.
In this paper, we propose a parallel FIUT algorithm (p1 is master node (MN), and p2 p3 pp are salve
(TP-FIUT). Our proposed algorithm runs in two phases: nodes (SNs)).
(1) partitioning, (2) exchange. Output: a complete set of frequent patterns, where
In the partitioning phase, database is scanned twice. supxi = ∀ xisupxi, is the number of transactions
First, the database is divided among processors, so that in DB containing xi.
each processor scans its own database (Fig. 1(a)) and cre- Method:
ates related Tidset table of the database (Fig. 1(b)). At the (1) MN equally divides the database DB into p disjointed
same time, they create local header tables (FI) (Fig. 1(c)). partitions (DB1 ; DB2 ; ; DBp DB1 ∪ DB2 ∪ ∪ DBp =
The header table of global database (GFI) is created with DB) and assigns DBi to pi .
sending this local header tables to the central processor. In (2) Each processor Pi receives the database DBi and scans
GFI table, non-frequent items whose number is less than in order to create the local frequent items (FIi ).
M = max {Mi | i = 1, 2, …, p} = 4
Transactions to P3
4-itemsets AMCH
3-itemsets FCK
Transactions to P3, P2, P1 Transactions to P2, P1 Transactions to P2, P1
MHG
4-itemsets FCAM 3-itemsets AGK 3-itemsets FCM
FCGK FCK Transactions to P2, P1 CMH
3-itemsets CAM FAM 4-itemsets AMCH FCH
AMK 2-itemsets FC 3-itemsets FCK 2-itemsets AG
MHG
2-itemsets FG
RESEARCH ARTICLE