Efficient Data Mining For Calling Path Patterns in Gsmnetworks
Efficient Data Mining For Calling Path Patterns in Gsmnetworks
Abstract In this paper, we explore a new data mining capability that involves mining calling path patterns in global system for mobile communication (GSM) networks. Our proposed method consists of two phases. First, we devise a data structure to convert the original calling paths in the log le into a frequent calling path graph. Second, we design an algorithm to mine the calling path patterns from the frequent calling path graph obtained. By using the frequent calling path graph to mine the calling path patterns, our proposed algorithm does not generate unnecessary candidate patterns and requires less database scans. If the corresponding calling path graph of the GSM network can be tted in the main memory, our proposed algorithm scans the database only once. Otherwise, the cellular structure of the GSM network is divided into several partitions so that the corresponding calling path sub-graph of each partition can be tted in the main memory. The number of database scans for this case is equal to the number of partitioned sub-graphs. Therefore, our proposed algorithm is more efcient than the PrexSpan and a priori-like approaches. The experimental results show that our proposed algorithm outperforms the a priori-like and PrexSpan approaches by several orders of magnitude. r 2002 Elsevier Ltd. All rights reserved.
Keywords: Data mining, Sequential pattern, Calling path pattern, GSM network
1. Introduction With the increasing use of computing for various applications, the importance of mining knowledge from large databases is growing at a rapid pace recently. There is a large amount of valuable information embedded in databases or
Recommended by Dr. Nick Koudas, Area Editor. *Corresponding author. Tel.: +886-2-2363-0231x2978; fax: +886-2-2362-1327. E-mail address: [email protected] (A.J.T. Lee).
data warehouses which is useful for analyzing customers buying behavior and thus improving the business decisions. Data mining is an application-specic issue and various mining techniques have been developed to solve different application problems, such as mining association rules [112], classication [1316], clustering [1722], sequential patterns [2326], partial periodic patterns [27], and path traversal patterns in World Wide Web [28]. To the best of our knowledge, there are no data mining techniques specially designed to analyze the sequential patterns of users calling paths in a
0306-4379/03/$ - see front matter r 2002 Elsevier Ltd. All rights reserved. doi:10.1016/S0306-4379(02)00112-6
ARTICLE IN PRESS
930 A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948
global system for mobile communication (GSM) network, and we believe that it is an interesting issue especially on providing mobile broadband services. A GSM network is based on the cellular radio technology [29]. A particularly characteristic feature of cellular radio is that each hexagonal (six-sided polygon) cell (the radius of a base station) is surrounded by at most six neighboring cells. Fig. 1 illustrates the cellular structure of a GSM network. Velez and Correia [30] have shown that the scenarios of mobility in GSM networks can be characterized by a triangular distribution with average velocities 0, 1, 10, 15 and 22.5 m/s, for the static, pedestrian, urban, main road, and highway, respectively. By 2010 AD, the cell radii will be at the limit (75 m for CBD, 600 m for urban cells, and 2 km for sub-urban cells). In fact, the cell radii have already been less than the limit in dense populated areas such as a big city. Consider that a mobile phone user may frequently make a phone call on the way home or to work. It is very likely that such phone calls take place on a moving vehicle. If the user makes a 3-min phone call on a main road, he/she drives 15 m/s 60 s/min 3 min=2700 m during the phone call. Since the cell radius in the urban area is 600 m, he/she may pass through 56 cells. As the utilization of mobile broadband services increases, a phone call may last longer and thus the calling path may be longer. A mobile phone user may make a phone call at one cell and then move to the other cells during the phone call. The sequence of visited cells during the phone call is termed a calling path. In a calling path database, each transaction is a calling path. We say that a transaction supports a calling path P
f g h i
j k l
T600 PMFCP < b, c, d, e > < d, e, f > < e, f, g, h > < g, h, i, j >
Fig. 3. The calling path patterns represented in line segments.
if P is contained in the transaction. The support of P is the ratio of transactions in the database that support P: A calling path with support no less than the user-specied minimum support threshold is termed a frequent calling path. Let us consider an example. A mobile phone user may make phone calls as the patterns shown in Fig. 2. Suppose the minimum support is 50%. As shown in Fig. 3, the frequent calling paths are /b; c; d ; eS; /d ; e; f S; /e; f ; g; hS; /g; h; i; j S; /b; c; d S; /c; d ; eS; /e; f ; gS; /f ; g; hS; /g; h; iS; / h; i ; j S ; / b; c S ; / c ; d S ; / d ; e S ; / e ; f S ; / f ; gS ; /g; hS; /h; iS; and /i; j S: However, it is more meaningful to extract the calling path pattern /b; c; d ; e; f ; g; h; i; j S since the user frequently makes phone calls along this path. The calling
ARTICLE IN PRESS
A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948 931
path pattern /b; c; d ; e; f ; g; h; i; j S is termed a potential maximal frequent calling path (PMFCP). In this paper, we shall explore a new data mining capability that involves mining calling path patterns in GSM networks where each cell has the limited number of neighboring cells. Clearly, understanding users calling path patterns in such environments is becoming an interesting issue while providing the location-based services on GPRS and future 3G services such as points of interest, routing, road trafc assistance, and location tracking. It will not only help improve the system design (e.g., provide efcient data access, the deployment of the base stations, etc.) but also be able to lead to marketing decisions (e.g., offer a highly competitive strategies and pricing policies, better customer/user classication and personalization services, and behavior analysis, etc.). The technique to discover users calling path patterns in such environments is referred to as calling path pattern mining in this paper. Our proposed method consists of two phases. First, we devise a data structure to convert the original calling paths in the log le into a frequent calling path graph (or sub-graph). Second, we design an algorithm to mine the PMFCPs from the frequent calling path graph (or sub-graphs), which is obtained in the rst phase. By using the frequent calling path graph (or sub-graphs) to mine the calling path patterns, our proposed algorithm does not generate unnecessary candidate patterns and requires less database scans. If the corresponding calling path graph of the GSM network can be tted in the main memory, our proposed algorithm scans the database only once. Otherwise, the cellular structure of the GSM network is divided into several partitions such that the corresponding calling path sub-graph of each partition can be tted in the main memory. The number of database scans for this case is equal to the number of partitioned sub-graphs. Therefore, our proposed algorithm is more efcient than the a priorilike and PrexSpan approaches. The experimental results show that our proposed algorithm outperforms the a priori-like and PrexSpan approaches by several orders of magnitude. This paper is organized as follows: Section 2 describes the motivation of this study. Section 3
introduces the data structure of the frequent calling path graph and its construction algorithm. A graph-based mining algorithm based on the frequent calling path graph is described in Section 4. In Section 5, the results on some performance evaluations are presented. Finally, conclusions are made in Section 6.
2. Motivation Many studies on data mining problems adopt an a priori-like algorithm, which employs an iterative approach known as a level-wise search. An a priori-like algorithm may suffer from two serious problems. First, it is very costly to generate and process the candidate sets, especially in the earlier iterations. For example, if there are 10,000 frequent 1-itemsets, the algorithm may generate about 50 million candidate 2-itemsets. The algorithm needs to examine all candidate 2-itemsets to check whether they are frequent or not. Moreover, an enormous number of candidate sets also induce a heavy load on managing the memory and storage spaces. Second, it is tedious to scan the database repeatedly, especially for mining long frequent patterns. For the a priori-like algorithms, the longer the frequent patterns are, the more times the database is scanned. In GSM networks, there are six neighboring cells at most for each cell. In such an environment, the a priori-like algorithms seem inefcient to generate a vast number of candidate sets but most of them can be eliminated in the pruning process. To mine calling path patterns in such an environment, the generation of candidate paths is strongly constrained since each path can be possibly only extended to the neighboring cells of the last cell in the path. For example, assume that a GSM network contains two thousand cells. Since a cell in the network can be possibly connected to its six neighboring cells, the number of 2-sequences (the length of a calling path pattern is equal to two) is restricted to 12,000 at most, which is a great cost reduction in the phase of generating candidate sets in comparison with about 4 million candidate
ARTICLE IN PRESS
932 A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948
2-sequences generated by the a priori-like approaches. To avoid generating unnecessary candidate sets, Han et al. [5] proposed a highly condensed data structure called FP-Tree. Based on the FP-tree, they developed a mining method called FP-growth to mine the complete frequent itemsets without generating unnecessary candidates. However, they [25] pointed out that FP-growth was not suitable for mining sequential patterns. To mine sequential patterns, they sequentially proposed two methods called FreeSpan [24] and PrexSpan [25]. It was shown that PrexSpan was more efcient and scalable than FreeSpan [25]. In the method of PrexSpan, the database is repeatedly partitioned into small pieces by prex projections. The longer the prex is, the smaller the size of the projected databases becomes. Finally, the projected databases are small enough to be tted in the main memory to reduce the cost of disk access. The PrexSpan method is more efcient and scalable than the a priori-like methods since it does not generate unnecessary candidates. However, it induces repeated database scans. As the database size is getting larger and larger, it will take tremendous time to repeatedly scan the database in the process of data mining. As the size of main memory is concerned, the partition factor by which the database (or projected databases) is partitioned in the PrexSpan method is determined by the size of the database (or projected databases). However, in our proposed method, the factor by which the calling path graph is partitioned is determined by the size of the cellular structure of the GSM network. Generally speaking, the size of the database (or projected databases) is extraordinarily larger than the size of the cellular structure of the GSM network, so the number of partitions in the PrexSpan method is usually larger than that in our proposed method. The more the number of partitions is, the more the number of database scans is. Therefore, our proposed algorithm is more efcient than the PrexSpan algorithm. The experimental result shown in Fig. 23 illustrates this phenomenon. Our proposed method consists of two algorithms. One is the construction algorithm that
creates the frequent calling path graph (or subgraphs), and the other is the graph-based mining algorithm that extracts the PMFCPs from the frequent calling path graph (or sub-graphs). These two algorithms will be described in detail in Sections 3 and 4.
3. Frequent calling path graph In this section, we present the data structure of frequent calling path graph (or sub-graphs) and its construction algorithm. The data structure will be used to mine calling path patterns. Before presenting the data structure, we rst dene some terminology used later. Denition 1. A calling path P; denoted by /v1 ; v2 ; y; vn S; nX2; is the sequence of visited cells during a mobile phone call where v1 ; v2 ; y, vn are cell IDs. A calling path P0 formed by /vi ; vi1 ; y; vj S is termed the sub-path of P where 1pioj pn: It is denoted as P0 DP: We can also say that P contains P0 : Denition 2. The support of a calling path P is the ratio of calling paths containing P to all of the calling paths in the database D: The support of P is dened as jPj supP ; jDj where |P| denotes the number of calling paths containing P in D, and |D| denotes the number of calling paths in D. Denition 3. A calling path P is said to be frequent if sup(P) is not less than a user-specied threshold, which is termed the minimum support. Denition 4. A calling path is maximal in the set M if it is not contained by any other calling paths in M : A frequent calling path is maximal if it is not contained by any other frequent calling paths. Denition 5. Let P1 and P2 be calling paths where P1 /v1;1 ; v1;2 ; y:; v1;n S and P2 /v2;1 ; v2;2 ; y; v2;m S: If /v1;i1 ; y; v1;ih S
ARTICLE IN PRESS
A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948 933
/v2;j1 ; y; v2;jh S where hX2; 0pipn h; and 0pj pm h; we dene the merge operation D as
P1 DP2 8 /v1;1 ; v1;2 ; y; v1;n ; v2;h1 ; y; v2;m S > > > < /v ; v ; y; v ; v 2 ;1 2 ;2 2;m 1;h1 ; y; v1;n S > f/v1;1 ; v1;i1 ; y; v1;ih ; v2;jh1 ; y; v2;m S > > : /v2;1 ; v2;j1 ; y; v2jh ; v1;ih1 ; y; v1;n Sg if i n h and j 0; if i 0 and j m h; otherwise:
3.1. Frequent calling path graph An acyclic calling path can be divided into an out-edge as dened in denition 7, or an out-edge plus several inout paths as dened in denition 8. For example, P /a; b; c; d ; eS is a calling path where P can be divided into an out-edge /a; bS plus three inout paths /a; b; cS; /b; c; d S; and /c; d ; eS: Also, these out-edge and inout paths can be merged to generate P. Denition 7. An in-edge of vertex v in a calling path graph is an edge ending at v: An out-edge of vertex v in a calling path graph is an edge starting at v: Denition 8. An inout path of vertex v in a calling path graph is a calling path formed by one in-edge of v and one out-edge of v: A PMFCP in the database D is maximal in the closure of frequent calling paths in D under the merge operation D. A frequent calling path can be divided into an out-edge plus several inout paths. These out-edge and inout paths can also be merged to generate the original frequent calling path. Hence, the problem of nding the PMFCPs in D can be transformed into the problem of nding the maximal calling paths in the closure of frequent out-edges and inout paths under the merge operation D. That is, to nd the PMFCPs in D; we rst nd the frequent out-edges and inout paths in D: Second, we can nd the closure of those frequent out-edges and inout paths under the merge operation D. The maximal calling paths in the found closure are the PMFCPs in D: A calling path graph is a directed graph containing the necessary information of mining PMFCPs in a database. A vertex in the calling path graph represents a cell in the GSM network. For an acyclic calling path in the database, we can divide it into an out-edge, or an out-edge plus several in out paths. For a cyclic calling path, we can divide it into several inout paths. Then, we can store the information about those out-edges and inout paths in the calling path graph. To nd the frequent out-edges and inout paths, we can
If both calling paths have hX2 consecutive cell IDs in common, both calling paths are merged to generate two new calling paths. However, if the sufx of one calling path and the prex of the other have hX2 consecutive cell IDs in common, only one newly merged calling path is generated. Denition 6. The PMFCPs in the database D are dened as {P|PAFP+, and P is maximal in FP+}, where FP+ is the closure of FP under the merge operation D and FP={P|P is a frequent calling path in D}. FP+ can be computed by the following algorithm: FP+=FP; repeat OldFP+=FP+; for each pair of calling paths P1 and P2 in OldFP+ do if both calling paths have at least two consecutive cell IDs in common then Add P1DP2 to FP+; endfor; until FP+==OldFP+; The PMFCPs in the database D are the maximal calling paths in the closure of frequent calling paths in D under the merge operation D. Let us consider the example shown in Fig. 3. Since every PMFCP is maximal, we only need to merge the maximal frequent calling paths to generate PMFCPs. There are four maximal frequent calling paths: /b; c; d ; eS; /d ; e; f S; /e; f ; g; hS; and /g; h; i; j S: First, we apply the merge operation to /b; c; d ; eS and /d ; e; f S; and we can obtain /b; c; d ; e; f S: Then, /b; c; d ; e; f S and /e; f ; g; hS are merged into /b; c; d ; e; f ; g; hS: Finally, /b; c; d ; e; f ; g; hS and /g; h; i; j S are merged into the PMFCP /b; c; d ; e; f ; g; h; i; j S:
ARTICLE IN PRESS
934 A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948
u1 u2 u 3 u4 u5 u6
w1 w2 w3 w4 w5 w6
Fig. 4. The in-edges and out-edges of vertex v.
compute the supports of those out-edges and inout paths in the calling path graph by scanning the database once. A frequent calling path graph is a calling path graph where the supports of every single out-edge or inout path is not less than the minimum support. Let us consider the example as shown in Fig. 4. Vertex v has six in-edges /u1 ; vS; /u2 ; vS; y; /u6 ; vS and six out-edges /v; w1 S; /v; w2 S; y; /v; w6 S: There are 36 inout paths /ui ; v; wj S where i=1,2,y,6 and j=1,2,y,6. Since a cell in a GSM network may have at most six neighboring cells, there may be at most six inedges and six out-edges for a vertex in a calling path graph, and there may be at most 36 inout paths and at most six out-edges for a vertex. Example 1 illustrates how the cellular structure of a GSM network and the calling paths in a database are mapped to a calling path graph. Example 1. Consider the cellular structure of a GSM network as depicted in Fig. 1, which contains 14 cells labeled from a to n alphabetically, and the calling path database D containing 20 calling paths. Fig. 5 illustrates the calling paths in the database D. The cellular structure of the GSM network and the calling paths are shown in Fig. 6, where the calling paths /b; d ; gS and /d ; g; j ; mS depicted by bold lines appear twice. The out-edges and inout paths of each vertex are shown in Fig. 7, where /a; bS:3 means that the count of out-edge /a; bS is equal to three. Let the minimum support be 10%. Fig. 8 shows the corresponding frequent calling path graph, where the notation represents the starting point of an
TID T001 T002 T003 T004 T005 T006 T007 T008 T009 T010 T011 T012 T013 T014 T015 T016 T017 T018 T019 T020
Calling paths dgjm ab gknl abd bdg lnm dgf gk dgkn abdhl bd gkl bdg dgjm gkn bdgkl ac dcf mnl bdgfj
a c f j m d g k
b e h l n i
In-out paths Out-edges Vertex a <a, b>:3, < a, c >:1 b <b, d >:5 < a, b, d >:2 c < d, c, f >:1 d < d, c >:1,< d, g >:4 < b, d, g >:4,< b, d, h >:1 f < g, f, j >:1 g < g, k >:4 < d, g, f >:2,< d, g, j >:2 ,< d, g, k >:2 h < d, h, l >:1 j < g, j, m >:2 k <g, k, l >:2, < g, k, n >:3 l < l, n >:1 m < m, n >:1 n < k, n, l >:1,< l, n, m >:1,< m, n, l >:1
Fig. 7. Out-edges and inout paths.
ARTICLE IN PRESS
A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948 935
a d f
3.2. Construction algorithm Since the corresponding calling path graph of a GSM network may not be tted in the main memory, the cellular structure of the GSM network may be required to be divided into several partitions so that the corresponding calling path graph of each partition can be tted in the main memory. The algorithm used to construct the frequent calling path graph (or sub-graphs) is shown in Fig. 9. If the cellular structure of the GSM network is divided into several partitions, the PMFCPs mined in a sub-graph are not the nal PMFCPs that we expected. Hence, the PMFCPs mined in a subgraph, termed local PMFCPs, are further merged into global PMFCPs. Since a vertex in the corresponding calling path graph of the GSM network has at most 36 inout paths and six out-edges, we require the xed amount of storage to store the information of those inout paths and out-edges for each vertex. The construction algorithm checks if the corresponding calling path graph of the GSM network can be tted in the main memory. If not, the cellular structure of the GSM network must be
g
k l
j
m n
out-edge or inout path. Note that the out-edges and inout paths with the supports less than the minimum support are deleted in Fig. 8. That is, the out-edges and inout paths whose counts=1 in Fig. 7 are deleted. The vertices without any frequent out-edge and inout path are also deleted. The frequent calling path graph contains only the out-edges and inout paths with their supports no less than the minimum support threshold.
Algorithm: Construction. Input: A data file F describing the cellular structure of a GSM network, a calling path database D, and a minimum support threshold . Output: Frequent calling path graph (or sub-graphs). 1. If the corresponding calling path graph of the GSM network can not be fitted in the main memory, divide the cellular structure of the GSM network into several partitions, Q1, Q2, . . ., Qk, so that the corresponding calling path sub-graph of each partition can be fitted in the main memory. 2. For each partition Qi,1 i k , perform steps 3~5. 3. Create the corresponding calling path sub-graph Gi for partition Qi based on F. Set the counts of out-edges and in-out paths in Gi to zero. 4. Scan the calling paths in D one by one. Increase the counts of out-edges and in-out paths in Gi by one if the calling path contains them. 5. Examine the counts of all out-edges and in-out paths in Gi. Delete those out-edges and in-out paths whose counts are less than *|D|. The vertices without any out-edge and in-out path are deleted from Gi, too.
Fig. 9. Construction algorithm.
ARTICLE IN PRESS
936 A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948
Q1
Q2
Q3
Q1 Q4 Q7
Q2 Q5 Q8
(a)
(b)
divided into several partitions so that the corresponding calling path sub-graph of each partition can be tted in the main memory. Let us consider a GSM network with 81 cells as shown in Fig. 10(a). We can use two horizontal and two vertical partition lines to divide the cellular structure of the GSM network into nine partitions. The overlapped cells are darkened as shown in Fig. 10(b). The overlapped cells in each partition are required for merging local PMFCPs to global PMFCPs. In Fig. 10(b), the number of overlapped cells from partition Q1 to partition Q9 are 7, 10, 6, 9, 14, 11, 7, 10, and 6, respectively. Partition Q5 consists of the most number of overlapped cells. If a partition like Q5 consists of q2 non-overlapped cells, the number of overlapped cells is about 4q. The ratio of the number of overlapped cells to the number of cells in the partition is about 4q=q2 4q 4=q 4: If q 3; the ratio is about 4/(3+4)=4/7. If q 100; the ratio is about 4/(100+4)=4/104. The larger q is, the smaller the ratio is. Actually, the partition method depends on the cellular structure of a GSM network, which depends on the distribution of populations, highways, rivers, mountains, etc. For simplicity, in this paper we use the partition method as shown in Fig. 10 to partition the cellular structure of a GSM network. In step 1 of the construction algorithm, we determine how many partitions should be created. For each partition, we create a calling path subgraph Gi based on the cellular structure of the GSM network described by the le F in step 3. The
counts of out-edges and inout paths in Gi are initialized to zero. In step 4, we scan the database D once to accumulate the counts of out-edges and inout paths. In step 5, we eliminate those outedges and inout paths whose supports are less than minimum support threshold t: Thus the frequent calling path sub-graph Gi is obtained. Repeat steps 35 for each partition. Finally, we can obtain all frequent calling path sub-graphs. Complexity analysis: Consider a GSM network and a calling path database containing N calling paths with average length L: In the process of constructing a sub-graph of the cellular structure of the GSM network, the database needs to be scanned once. The time required to construct the sub-graph and to compute the counts of out-edges and inout paths is proportional to the number and the length of calling paths in the database, i.e., O(NL). Assume that the cellular structure of the GSM network is partitioned into S partitions. For each corresponding sub-graph, the calling path database needs to be scanned once to collect the information of out-edges and inout paths in the sub-graph. Thus the time complexity of the construction algorithm is bounded by O(SNL+S the time to scan the database once).
4. Mining algorithm In this section, we shall present the algorithm to mine the PMFCPs from frequent calling path graph (or sub-graphs). The algorithm is based on a
ARTICLE IN PRESS
A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948 937
depth-rst search approach, which is one of the natural ways to visit vertices in a graph systematically. Denition 9. The function v:next(calling path P) returns a vertex u where u is the next vertex of v in P. Similarly, the function v.previous(calling path P) returns a vertex u where u is the previous vertex of v in P. Denition 10. The vertices in a calling path are indexed by the order of their appearances in the path. The function v.index(calling path P) returns the index of v in P. Denition 11. The length of a calling path P is equal to the number of vertices in P. For example, given a calling path P= /a; b; c; d ; eS; a.next(P) returns b since b is the next vertex of a in P. a.index(P) returns one since a is the rst vertex in P. Similarly, c.index(P) returns three. The length of P is equal to ve since P has ve vertices in total. If a PMFCP does not contain a cycle, its starting vertex should link to the next vertex by an out-edge. That is, a PMFCP always starts from an out-edge if the PMFCP is acyclic. If a PMFCP is cyclic, it does not have a starting vertex. To nd such a path, we may start from any vertex in the path. The algorithm to mine the PMFCPs is shown in Figs. 11 and 12. Since the corresponding calling path graph of the GSM network may not be tted in the main memory, the cellular structure of the GSM network needs to be divided into several partitions. The graph-based mining algorithm is performed on each frequent calling path sub-graph to obtain the local PMFCPs. Thereafter, those local PMFCPs obtained from frequent calling path sub-graphs are merged into the global PMFCPs, which are the PMFCPs in the whole frequent calling path graph of the GSM network. The graph-based mining algorithm can be divided into three phases. The rst phase, from step 3 to step 10, nds all acyclic local PMFCPs; the second phase, from step 12 to step 19, nds all cyclic local PMFCPs; and the third phase, from step 21 to
step 29, merges all local PMFCPs extracted from sub-graphs into global PMFCPs. In the rst phase, for each out-edge e of vertex v in graph Gi where no inout path of v contains e, the algorithm nds all local PMFCPs starting with e. If an inout path of v contains e, it means that the calling path starting with e is not maximal. We do not generate the PMFCPs starting with such an out-edge e since a local PMFCP is maximal. In steps 58, if the algorithm nds the starting edge of a new local PMFCP, the procedure visit() is called with the arguments /v; uS; Gi ; and LocalPMFCPi. The procedure visit() will nd all local PMFCPs starting with out-edge /v; uS in Gi and collect those found local PMFCPs in LocalPMFCPi. All out-edges and inout paths of the vertices in the local PMFCPs found in the rst phase are marked traced in step 11. In the second phase, if there exists any inout paths unvisited, it means that there exist cyclic local PMFCPs. An acyclic local PMFCP in the rst phase starts with an out-edge, while a cyclic local PMFCP starts with an unvisited inout path. In steps 1417, if the algorithm nds an unvisited inout path which can be a starting inout path of a cyclic local PMFCP, the procedure visit() is called with the arguments /u; vS; Gi, and LocalPMFCPi. The procedure visit() will nd the cyclic local PMFCPs starting with edge /u; vS in Gi and collect those found local PMFCPs in LocalPMFCPi. In the third phase, all of the local PMFCPs (LocalPMFCPi, i 1; 2; y; k) extracted from each sub-graph are merged into the global PMFCPs in steps 2129. First, let GlobalPMFCPs={p|p is a local PMFCPs in G1 }, and VisitedSubgraphs be {G1 }. Second, for each pair of P1 and P2 ; where P1 is a local PMFCP in GlobalPMFCPs and P2 is a local PMFCP in a neighboring sub-graph of VisitedSubgraphs, say H ; if P1 and P2 have two or more consecutive vertices in common, we merge both paths together and add the merged paths to GlobalPMFCPs. To check if P1 and P2 have two or more consecutive vertices in common, we rst check if both paths contain the overlapped cell IDs (such as the darkened cells in Fig. 10(b)). That is, only the local PMFCPs containing the overlapped cell IDs
ARTICLE IN PRESS
938 A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948
Algorithm: Graph-based mining. Input: Frequent calling path sub-graphs G1, G2, . . ., Gk . Output: PMFCPs. (1) for each sub-graph Gi do (2) load Gi into the main memory; (3) for each vertex v Gi do (4) for each out-edge e of v do (5) if no in-out path of v contains e then (6) u = v.next(e); (7) visit(<v, u>, Gi, LocalPMFCPi); (8) endif; (9) endfor; (10) endfor; (11) mark traced paths; (12) for each vertex v Gi do (13) if there exist any in-out paths of v not visited yet then // cyclic paths do not have starting vertices (14) for each unvisited in-out path p of v do (15) u = v.previous(p); (16) visit(<u, v>, Gi, LocalPMFCPi); (17) endfor; (18) endif; (19) endfor; (20) endfor; // merge local PMFCPs to obtain global PMFCPs (21) Let GlobalPMFCPs be LocalPMFCP1 and VisitedSubgraphs be {G1}; (22) for each neighboring sub-graph Hi of VisitedSubgraphs; (23) for each pair of paths p1GlobalPMFCPs, p2 {p|p is a local PMFCPs in Hi}, p1 and p2 have two or more consecutive vertices in common do (24) add p1 p2 to GlobalPMFCPs; (25) endfor; (26) GlobalPMFCPs = GlobalPMFCPs {p | p is a local PMFCPs in Hi}; (27) delete from GlobalPMFCPs the paths which are not maximal; (28) add Hi to VisitedSubgraphs; (29) endfor; (30) Output every path in GlobalPMFCPs ;
Fig. 11. Graph-based mining algorithm.
are the candidates for the merge operation. By eliminating the local PMFCPs, which do not contain the overlapped cell IDs, the time complexity of the merge operation can be reduced. If both paths contain the overlapped cell IDs, we check if both paths have two or more consecutive vertices in common. S Third, let GlobalPMFCPs=GlobalPMFCPs {p|p is a local PMFCPs in H }. Fourth, delete from GlobalPMFCPs the paths that are not maximal. Fifth, add H to VisitedSubgraphs. At this moment, GlobalPMFCPs contains all local PMFCPs in VisitedSubgraphs. After that,
we repeat from the second step to the fth step for another neighboring sub-graph of VisitedSubgraphs until all frequent sub-graphs are in VisitedSubgraphs. The procedure visit(/u; vS, G, OutputPath) provides a depth-rst search approach to nd the local PMFCPs starting with /u; vS in the frequent calling path sub-graph G ; and collects those found local PMFCPs in OutputPath. To nd the local PMFCPs starting with /u; vS; we rst start with CurrentPath /u; vS and extend CurrentPath by merging sequentially the inout paths if the inout
ARTICLE IN PRESS
A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948 939
procedure visit(<u, v>: edge, G: frequent calling path sub-graph, OutputPath: PMFCPs in G) (1) begin (2) append u to the current path CurrentPath; (3) UnvisitedVertices.push(v); // UnvisitedVertices is a stack (4) repeat (5) w = UnvisitedVertices.pop(); (6) append w to the current path CurrentPath; (7) if w does not have any in-out paths that contains w.previous(CurrentPath) then (8) add CurrentPath to OutputPath; (9) t = BacktrackIndex.pop(); // BacktrackIndex is a stack (10) clear visited-marks of in-out paths for all vertices starting at the (t+1)-th vertex in the current path CurrentPath; (11) delete all vertices starting at the (t+1)-th vertex from the current path CurrentPath; (12) path_count++; (13) else (14) for each in-out path p of w that contains w.previous(CurrentPath) do (15) if p has not been visited by the current path then (16) w = w.next(p); (17) UnvisitedVertices.push(w); (18) mark p visited; (19) endif; (20) endfor; (21) Let r be the number of unvisited in-out paths of w; (22) for j = 1 to r-1 do // push the index of w in the current path on the BacktrackIndex stack (23) BacktrackIndex.push(w.index(CurrentPath)); (24) endfor; (25) endif; (26) until UnvisitedVertices.ISEmpty(); // until UnvisitedVertices stack is empty (27) end;
Fig. 12. Visit procedure.
paths have two vertices in common with the sufx of CurrentPath. Finally, we can obtain the rst local PMFCP. Then, we go back to the last diverging vertex of the rst local PMFCP and extend another local PMFCP from the last diverging vertex. After we nd all local PMFCPs extended from the last diverging vertex, we go back to the previous (last second) diverging vertex of the rst local PMFCP and extend another local PMFCP from the previous (last second) diverging vertex. Repeat the above steps until we nd all local PMFCPs starting with /u; vS: All unvisited vertices in G are kept on the UnvisitedVertices stack. In the procedure visit(), the argument v is rst pushed on the UnvisitedVertices stack in step 3 and the repeat loop from step 4 to step 26 is performed iteratively until the
UnvisitedVertices stack becomes empty. If the stack is not empty, a vertex w in the UnvisitedVertices stack is popped for further processing. First, we append w to the current path in step 6. Then, we check if w has inout paths which contain w.previous(CurrentPath) in step 7. If it does not have any inout path which contains w.previous(CurrentPath), it means that the current path is ended at w. We add the current path CurrentPath to OutputPath in step 8. Second, since the current path is terminated, we need to decide how far we should backtrack, t=BacktrackIndex.pop(), to reuse the prex of CurrentPath in step 9. Third, in step 10, clear the visited-marks of inout paths for all vertices starting at the (t+1)th vertex in CurrentPath. Delete all vertices starting at the (t+1)th vertex from CurrentPath in step 11. Steps
ARTICLE IN PRESS
940 A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948
10 and 11 are performed to adjust the visitedmarks and the current path for the corresponding backtracking in step 9. Fourth, a new local PMFCP with the same prex of t vertices is formed, so the number of paths is incremented by one in step 12. If w does have inout paths which contain w.previous(CurrentPath) in the checking condition of step 7, it means that the current path will be extended further. First, for each p which is an in out path containing w.previous(CurrentPath) of w and has not been visited by the current path, push w.next(p) on the UnvisitedVertices stack in steps 1420. Second, we should decide how many backtracking indices are recorded in steps 2124. Let r be the number of unvisited inout paths of w. If rX2, (r1) backtracking indices are pushed on the BacktrackIndex stack. When the UnvisitedVertices stack becomes empty, all of the local PMFCPs starting with /u; vS are found. Hence, the procedure visit() is terminated. Let us consider the calling path database in example 1 and assume that the corresponding calling path graph of the GSM network can be tted in the main memory. Since the calling path graph need not be partitioned, only one frequent calling path graph is input to the graph-based
mining algorithm. All PMFCPs found are global PMFCPs. The graph-based mining algorithm is executed as follows. First, vertex a is visited and appended to the current path. Then, the only neighbor b of vertex a is selected and procedure visit() is called. In procedure visit(), vertex b is pushed on the UnvisitedVertices stack and then popped in the repeat loop. The repeat loop is executed iteratively and the following vertices are pushed and popped on the stack as shown in Fig. 13. As vertex g is visited by the current path, i.e., path(1)=/a; b; d ; gS; the neighboring vertices f, j, and k of g are pushed on the UnvisitedVertices stack and the index of g in path(1), four, is pushed on the BacktrackIndex stack. Since vertex g has three inout paths, two index values are pushed on the BacktrackIndex stack. Next, while processing vertex k, the neighboring vertices l and n are pushed on the UnvisitedVertices stack and one index value, ve, is pushed on the BacktrackIndex stack since vertex k has two inout paths. The contents of the BacktrackIndex stack are shown in Fig. 14. Second, path(1) is terminated when vertex n is popped since there is no inout path connected to vertex n. The current path /a; b; d ; g; k; nS is added to OutputPath in step 8 of procedure visit().
n k j b d g f l j f l j f j f m f f
Fig. 13. Contents of the Unvisited Vertices stack during the mining process.
Index for generating path(2) 5 4 4 4 4 4 4 4 4 Index for generating path(3) Index for generating path(4)
Fig. 14. Contents of the BacktrackIndex stack during the mining process.
ARTICLE IN PRESS
A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948 941
An index t is popped from the BacktrackIndex stack in step 9 of procedure visit(). Since the value of index t is equal to ve, delete the sixth vertex from the current path. That is, a new PMFCP, path(2)=/a; b; d ; g; kS; is formed in steps 11 and 12 of procedure visit(). Third, vertex l is popped from the UnvisitedVertices stack and appended to path(2). Since vertex l does not have any inout paths, path(2) is terminated. Path(2)=/a; b; d ; g; k; l S is added to OutputPath in step 8 of procedure visit(). An index t is popped from the BacktrackIndex stack in step 9 of procedure visit(). Since the value of index t is equal to four, delete all vertices starting at the fth vertex from the current path. That is, a new PMFCP, path(3)=/a; b; d ; gS; is formed in steps 11 and 12 of procedure visit(). Fourth, vertex j and vertex m are popped from the UnvisitedVertices stack and appended to path(3). Path(3) is terminated. Path(3)=/a; b; d ; g; j ; mS is added to OutputPath in step 8 of procedure visit(). An index t is popped from the BacktrackIndex stack in step 9 of procedure visit(). Since the value of index t is equal to four, delete all vertices starting from the fth vertex in the current path. A new PMFCP, path(4)=/a; b; d ; gS; is formed in steps 11 and 12 of procedure visit(). Finally, vertex f is popped from the UnvisitedVertices stack and appended to path(4), and path(4) is terminated. Path(4)=/a; b; d ; g; f S is added to OutputPath in step 8 of procedure visit(). All of the four PMFCPs are found and shown in Fig. 15. In addition, let us consider the case that the corresponding calling path graph of the GSM network cannot be tted in the main memory. Assume that the cellular structure of the GSM
Appended Appended
network be divided into two partitions by a partition line as shown in Fig. 16(a). Two partitions, named PartitionU and PartitionL, are shown in Figs. 16(b) and (c), respectively. The boundary of each partition is extended by one cell along the partition line as those darkened cells shown in Figs. 16(b) and (c). The overlapped cells in the partitions are required for merging local PMFCPs to the global PMFCPs in the merging phase of mining process. Fig. 17 illustrates the local PMFCPs extracted from PartitionU and PartitionL, and the global PMFCPs. The local PMFCPs in PartitionU are merged with those in PartitionL to obtain the global PMFCPs. For example, the local PMFCP /a; b; d ; g; f S in PartitionU can be merged with the local PMFCP /d ; g; k; nS in PartitionL to generate the global PMFCP /a; b; d ; g; k; nS: The local PMFCP /a; b; d ; g; f S in PartitionU can also be merged with the local PMFCP /d ; g; k; l S in PartitionL to generate the global PMFCP /a; b; d ; g; k; l S: Similarly, we can obtain the global PMFCP /a; b; d ; g; j ; mS: The four global PMFCPs in Fig. 17 are the same with those in Fig. 15.
a c f j m
(a)
b d e h k n l i Partition line f c
a d g
b e h i f j m
(b)
c g
d h k n
(c)
e i l
Fig. 16. Graph partition: (a) Original cellular structure; (b) PartitionU; and (c) PartitionL.
ARTICLE IN PRESS
942 A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948
Local PMFCPs in Local PMFCPs in Global PMFCPs PartitionU PartitionL < a, b, d, g, k, n > < d, g, k, n > < a, b, d, g, k, l > < a, b, d, g, f > < d, g, k, l > < a, b, d, g, j, m > < d, g, j, m > < a, b, d, g, f >
Fig. 17. PMFCP patterns.
Complexity analysis: First of all, let us consider the time complexity of the procedure visit(). The procedure visit() is used to nd all local PMFCPs starting with /u; vS: Let the found local PMFCPs be P1 ; P2 ; y; Pm and the length of Pi be PLi ; 1pipm: The time complexity of steps 59 is bounded by O(1). The time complexity of steps 10 and 11 is bounded by O (the length of the current path). The time complexity of step 12 is bounded by O(1). That is, the time complexity of steps 512 is bounded by O (the length of the current path). Then, let us consider the time complexity of the else-part from step 13 to step 25. The time complexity of steps 1420 is bounded by O(the number of inout paths of w). The time complexity of steps 2224 is bounded by O(the number of in out paths of w), too. The time complexity of step 21 is bounded by O(1). So, the time complexity of steps 1325 is bounded by O(the number of inout paths of w). Therefore, the time complexity Pm of the procedure visit() is bounded by O PL +O(the summai i1 tion of the number of inout paths of w for each w in the UnvisitedVertices stack). Since the extracted local PMFCP contains all inout paths of w; O(the summation of the number of inout paths of w for each w in the UnvisitedVertices stack) is bounded P m by O PL : Thus, the time complexity of the i i1 Pm : PL procedure visit() is bounded by O i i1 Now, let us consider the time complexity of the graph-based mining algorithm. Assume that for the GSM network, there are k frequent calling path sub-graphs Gi ; 1pipk, and sub-graph Gi contains ni local PMFCPs, Ri;1 ; Ri;2 ; y; Ri;ni ; and the length of Ri;j be RLi;j ; 1pjpni. First, let us consider the time complexity of steps 310. The time complexity of steps 310 is bounded by O(|Vi |+the total lengths of mined local PMFCPs
found in steps 310) where jVi j is the number of vertices in graph Gi : Similarly, the time complexity of steps 1219 is bounded by O(|Vi |+the total lengths of mined local PMFCPs found in steps 12 19). The time complexity of steps 11 is bounded by O(the total lengths of mined local PMFCPs found in steps 310). So, the time complexity of the graph-based mining algorithm for sub-graph Gi is bounded by O(|Vi |+the total lengths of mined local PMFCPs in Gi), and the time complexity of steps 120 is bounded by O(|V|+the total lengths of mined local PMFCPs in all sub-graphs) where |V| is the total number of vertices in all sub-graphs Gi ; 1pipk: That is, it is bounded by Pk Pn i O jV j i1 j 1 RLi; j : To merge the local PMFCPs in neighboring sub-graphs Gi and Gj ; we rst nd the local PMFCPs in Gi and Gj that contain the overlapped cell IDs of both sub-graphs. Assume that the number of local PMFCPs containing the overlapped cell IDs in Gi and Gj ; are si; j and sj ;i ; respectively. The time complexity of merging two sets of local PMFCPs neighboring subP of P s j ;i si ; j graphs is bounded by O y1 x1 RLi;x RLj ;y : Then, the time complexity of merging local PMFCPs of all neighboring sub-graphs Gi and Gj into global PMFCPs in steps 2129 P Psj;i Psi; j is bounded by O Gi AdjoinsGj y1 x1 RLi;x RLj ;y : Therefore, the time complexity of the graphbased mining algorithm is bounded by Pk Pni P Psj; i Psi; j O jV j j 1 RLi; j Gi AdjoinsGj i 1 y1 x1 RLi;x RLj;y :
5. Experimental results To compare the performance of our proposed method with the a priori-like and PrexSpan approaches, four algorithms including two a priori-like algorithms (original a priori algorithm and revised a priori algorithm), PrexSpan algorithm, and our proposed algorithm were implemented. In the original a priori algorithm, we generated every possible combination of candidate paths for each level. In the revised a priori
ARTICLE IN PRESS
A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948 943
algorithm, we generated the candidate paths by only extending the neighboring vertices of the last vertex in a frequent calling path. In the PrexSpan algorithm, we involved the pseudo-projection technique. All these algorithms and our proposed method were implemented in Microsoft/Visual C++ 6.0. Our experiments were performed on a PC running Microsoft Windows 2000 Professional with an Intel/1.5 GHz Pentium IV processor, 512 MB of main memory and 120 GB of hard disk. 5.1. Synthetic datasets In our experiments, two synthetic datasets were simulated. The rst one is a cell map le representing the cellular structure of a GSM network in which each cell is contiguous to six neighbors. For a GSM network with N cells, we arrange the cellular structure of the GSMp network to be semi-square shape that contains N and p N 1 cells at each consecutive level. Fig. 18 illustrates the cellular structure of a GSM network consisting of 22 cells. Each level contains either four or ve cells. The second synthetic dataset is a text le in which the IDs of visited cells of each mobile phone call are logged to form a calling path. The cells are labeled in sequential order from 0 to N 1; if the GSM network contains N cells. In general, the starting cells of each mobile phone call is determined from a uniform distribution U a; b; where a denotes the smallest cell ID, and b denotes the largest cell ID. The next cell of a calling path is also determined from a uniform distribution that selects one of the six neighboring cells uniformly. The length of a calling path is determined from an exponential distribution with the parameter of mean m: 5.2. Performance evaluations We conducted four experiments with different parameters to evaluate the performance of our proposed graph-based mining algorithm, the a priori-like algorithms, and the PrexSpan algorithm. In addition, we evaluated the effect of memory size on both graph-based mining and PrexSpan algorithms.
There are three components to a dataset: the number of calling paths D; the number of distinct cells C ; and the average number of cells per calling path T : The rst experiment is conducted by using different thresholds of minimum supports on condition of T10.0C1KD100K where T10.0 denotes that the mean length of calling paths is 10.0, C1K denotes that the number of cells is 1K, and D100K denotes that the number of calling paths is 100K. The second experiment is conducted by using various number of calling paths on condition of T5.0C2KS0.05% where S0.05% denotes that the minimum support is 0.05%. The third experiment is conducted by using different mean lengths of calling paths on condition of C2KD500KS0.05%. The fourth experiment is conducted by using different number of cells on condition of T5.0D10KS0.05%. The results of these four experiments are shown in Figs. 1922, respectively. These four gures use the same legend as shown in Fig. 19. The nal experiment is conducted by using various sizes of main memory to evaluate the performance of our proposed algorithm and the PrexSpan algorithm. The experimental result is shown in Fig. 23. Fig. 19 illustrates the execution time vs. the minimum support threshold for four algorithms. As the minimum support decreases, the execution time of both a priori-like algorithms increases extremely sharply. This is because a smaller support threshold implies a greater number of candidate patterns and database scans which reduces the performance of the a priori-like algorithms. In the PrexSpan algorithm, the execution time grows less sharply while the minimum support decreases since the projection databases are smaller than the original database.
ARTICLE IN PRESS
944 A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948
Apriori 400 Run time (sec.) 300 200 100 0 0.06 0.12
RevisedApriori
Prefixspan
Graph-based
T 10.0 C 1K D 100K
0.36
150
T 5.0 C 2K S 0.05%
100
50
However, the proposed graph-based algorithm scans the dataset only once and does not generate unnecessary candidates, the execution time of our proposed algorithm increases slightly. When the minimum support is equal to 0.36%, the revised a priori algorithm runs 1.28 times faster than the a priori algorithm, the PrexSpan algorithm runs also 1.28 times faster than the revised a priori algorithm, and our proposed algorithm runs 14.87 times faster than the PrexSpan algorithm. When the minimum support is equal to 0.06%, the revised a priori algorithm runs 1.02 times faster than the a priori algorithm, the
PrexSpan algorithm runs 4.17 times faster than the revised a priori algorithm, and our proposed algorithm runs 72.15 times faster than the PrexSpan algorithm. For the different minimum support thresholds, our proposed algorithm outperforms the other three algorithms by several orders of magnitude. Fig. 20 shows how the various algorithms scale with respect to the database size. The number of database sizes is varied from 100 to 600 K. As shown, all algorithms show linear scalability in regard to the number of calling paths. However, the execution time of a priori-like and the
ARTICLE IN PRESS
A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948 945
C 2K D 500K S 0.05%
1,000 800 Run time (sec.) 600 400 200 0 3 4 5 6 Mean length of calling paths 7
T 5.0 D 10K S 0.05% 250 200 Run time (sec.) 150 100 50 0 3000 4000 5000 6000 Number of cells 7000 8000
PrexSpan algorithms grows sharply as the number of calling paths increases, whereas the execution time of our proposed algorithm grows comparatively slowly. As the size of database becomes larger, the cost of scanning database in a priori-like and the PrexSpan algorithms increases much more than that of the proposed graph-based mining algorithm since the proposed algorithm scans the database only once. Hence, our proposed algorithm is much more scalable than the other three algorithms. Fig. 21 illustrates the effect of mean length of calling paths for four algorithms. The execution
time of both a priori-like algorithms rises sharply as the length of calling paths increases, whereas the execution time of the PrexSpan algorithm and the proposed graph-based mining algorithm rises comparatively slowly. The experimental results respond to the problem discussed in Section 2. That is, both a priori-like algorithms suffer from repeatedly database scans especially on the condition of mining long patterns. The longer the frequent patterns are, the more times the database is scanned. For the PrexSpan algorithm, as the size of the projected databases is reduced fast in later projections so that the projected databases
ARTICLE IN PRESS
946 A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948
PrefixSpan 250,000 200,000 Run time (sec.) 150,000 100,000 50,000 0 0.25 0.50
Graph-based
4.00
can be tted in main memory, the execution time of the PrexSpan algorithm grows more smoothly than that of both a priori-like algorithms. Thus, for the various mean lengths of calling paths, our proposed algorithm is better than the a priori, the revised a priori, and the PrexSpan algorithms. Fig. 22 illustrates the execution time vs. the number of cells for four algorithms. The execution time of the a priori algorithm grows really sharply when the number of cells increases. As the number of cells increases, the execution time of our proposed algorithm and the PrexSpan algorithm increases smaller than that of the revised a priori algorithm. This is expected because of the generation of redundant candidate patterns as mentioned in Section 2. The number of redundant candidate patterns generated by the a priori algorithm is much more than that by the revised a priori algorithm. In both PrexSpan and our proposed algorithm, no redundant candidate patterns are generated. However, the number of projected databases created by the PrexSpan algorithm becomes larger as the number of cells increases. For the different number of cells, our proposed algorithm is better than the PrexSpan algorithm and the PrexSpan algorithm is better than both a priori algorithms. Fig. 23 illustrates the experimental results on condition of T5.0C4.5KD200MS0.003% as the main memory size varies from 0.25 to 4 MB. In
this experiment, a large dataset with 200M calling paths was used to evaluate the effect of the main memory size on our proposed graph-based algorithm and the PrexSpan algorithm. First, let us consider the PrexSpan algorithm. While the memory size is equal to 4 MB, all rst-level projected databases can be tted in the main memory and the following mining process can be executed in the main memory. As the size of the main memory decreases, the rst-level projected databases cannot be tted in the main memory and more levels of database projections are required. Hence, more database scans (disk access) are required. As shown in Fig. 23, the execution time of the PrexSpan algorithm increases sharply as the size of main memory decreases. Second, let us consider our proposed graphbased algorithm. When the memory size is equal to 4 MB, the whole calling path graph can be tted in the main memory and the database is scanned only once. As the size of main memory decreases, the cellular structure of the GSM network is partitioned and the number of database scans increases. The number of database scans in our proposed graph-based algorithm is proportional to the number of partitions of the cellular structure of the GSM network. Therefore, our proposed graph-based algorithm outperforms the PrexSpan algorithm especially when the size of the database is tremendous.
ARTICLE IN PRESS
A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948 947
As the size of main memory is concerned, the partition factors of our proposed algorithm and the PrexSpan algorithm are quite different. In our proposed algorithm, the factor by which the frequent calling path graph is partitioned is determined by the size of the cellular structure of the GSM network, whereas, in the PrexSpan algorithm, the partition factor is determined by the size of the database (or projected databases). Generally speaking, the size of the database (or projected databases) is extraordinarily larger than the size of the cellular structure of the GSM network, so the number of partitions in the PrexSpan algorithm is usually larger than that in our proposed algorithm. The more the number of partitions is, the more the number of database scans is. That is, our proposed algorithm is more efcient than the PrexSpan algorithm especially on mining huge databases. In summary, the experimental results show that our proposed mining algorithm outperforms the a priori-like and the PrexSpan algorithms by several orders of magnitude.
algorithms. The experimental results show that our proposed algorithm performs signicantly better than the a priori-like and the PrexSpan algorithms by several orders of magnitude.
Acknowledgements The authors thank Steven Shaw in Taiwan Cellular Corporation for fruitful discussion. The authors are grateful to the referees for valuable comments and suggestions, which signicantly improve the presentation of this article.
References
[1] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM Press, Washington, DC, 1993, pp. 207216. [2] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, 1994, pp. 478499. [3] S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM Press, Tucson, AZ, 1997, pp. 255264. [4] J. Han, Y. Fu, Discovery of multiple-level association rules from large databases, Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, 1995, pp. 420431. [5] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM Press, Dallas, TX, 2000, pp. 112. [6] H. Mannila, Methods and problems in data mining, Proceedings of the Sixth International Conference on Database Theory, Springer, Delphi, Greece, 1997 pp. 478499. [7] H. Mannila, H. Toivonen, A.I. Verkamo, Efcient algorithms for discovering association rules, Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, AAAI Press, Seattle, Washington, 1994, pp. 181192. [8] A.M. Mueller, Fast sequential and parallel algorithms for association rules mining: a comparison, Technical report, Faculty of the Graduate School of The University of Maryland, 1995.
6. Conclusions In this paper, we explore a new data mining capability that involves mining calling path patterns in GSM networks. Since a vertex in the corresponding calling path graph of the GSM network has at most 36 inout paths and six outedges, we require the xed amount of storage to store the information of those inout paths and out-edges for each vertex. By using the constraint of limited number of neighboring cells in GSM networks, we devise a data structure, called frequent calling path graph (or sub-graphs), to store the necessary information of mining PMFCPs. Then, we design the graph-based mining algorithm to mine the PMFCPs from the frequent calling path graph (or sub-graphs) obtained. By using the frequent calling path graph (or subgraphs) to mine the calling path patterns, our proposed algorithm requires less database scans and does not generate unnecessary candidates. Therefore, our proposed algorithm is much more efcient than the a priori-like and the PrexSpan
ARTICLE IN PRESS
948 A.J.T. Lee, Y.-T. Wang / Information Systems 28 (2003) 929948 SIGMOD International Conference on Management of Data, ACM Press, Tucson, AZ, 1998, pp. 7384. R. T. Ng, J. Han, Efcient and effective clustering methods for spatial data mining, Proceedings of the VLDB Conference, Santiago, Chile, 1994, pp. 144155. G. Sudipto, R. Rajeev, S. Kyuseok, ROCK: a robust clustering algorithm for categorical attributes, Inform. Systems 25 (5) (2000) 345366. T. Zhang, R. Ramakrishnan, M. Livny, Birch: an efcient data clustering method for very large databases, Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, Canada, 1996, pp. 103114. R. Agrawal, R. Srikant, Mining sequential patterns, Proceedings of the 11th International Conference on Data Engineering, Taipei, Taiwan, 1995, pp. 314. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.C. Hsu, FreeSpan: frequent pattern-projected sequential pattern mining, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000, pp. 355359. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, PrexSpan: mining sequential patterns efciently by prexprojected pattern growth, Proceedings of the 17th International Conference on Data Engineering, Heidelberg, Germany, 2001, pp. 106115. J. Pei, J. Han, B. Mortazavi-Asl, H. Zhu, Mining access pattern efciently from web logs, Proceedings of PacicAsia Conference on Knowledge Discovery and Data Mining, Kyoto, Japan, Springer, Berlin, 2000, pp. 396407. J. Han, G. Dong, Y. Yin, Efcient mining of partial periodic patterns in time series database, Proceedings of the 15th International Conference on Data Engineering, Sydney, Australia, 1999, pp. 106115. M.-S. Chen, J.S. Park, P.S. Yu, Efcient data mining for path traversal patterns, IEEE Trans. Knowledge Data Eng. 10 (2) (1998) 209221. . cher, H.-J. Vogel, . J. Eberspa GSM: Switching Services and Protocols, Wiley, Chichester, England, 1999. F.J. Velez, L.M. Correia, Mobile broadband services: classication, characterization and deployment scenarios, IEEE Commun. Mag. 40 (4) (2002) 142150.
[9] P. Nicolas, B. Yves, T. Rak, L. Lot, Efcient mining of association rules using closed itemset lattices, Inform. Systems 24 (1) (1999) 2546. [10] J.S. Park, M.-S. Chen, P.S. Yu, Using a hash-based method with transaction trimming for mining association rules, IEEE Trans. Knowledge Data Eng. 9 (5) (1997) 813825. [11] A. Savasere, E. Omiecinski, S. Navathe, An efcient algorithm for mining association rules in large databases, Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, 1995, pp. 432444. [12] H. Toivonen, Sampling large databases for association rules, Proceedings of the 22nd International Conference on Very Large Data Bases, Bombay, India, 1996, pp. 134145. [13] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, A. Swami, An interval classier for database mining applications, Proceedings of the 18th International Conference on Very Large Data Bases, Vancouver, Canada, 1992, pp. 560573. [14] T.M. Anwar, H.W. Beck, S.B. Navathe, Knowledge mining by imprecise querying: a classication-based approach, Proceedings of the Eighth International Conference on Data Engineering, Tempe, AZ, 1992, pp. 622630. [15] J. Han, Y. Cai, N. Cercone, Knowledge discovery in databases: an attribute-oriented approach, Proceedings of the 18th International Conference on Very Large Data Bases, Vancouver, Canada, 1992, pp. 547559. [16] T.-S. Lim, W.-Y. Loh, Y.-S. Shih, A comparison of prediction accuracy complexity and training time of thirtythree old and new classication algorithms, Mach. Learning 40 (3) (1999) 203228. [17] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, Adensity-based algorithm for discovering clusters in large spatial database with noise, International Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), AAAI Press, Portland, Oregon, 1996, pp. 226231. [18] M. Ester, H.-P. Kriegel, X. Xu, Adatabase interface for clustering in large spatial databases, International Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), AAAI Press, Montreal, Canada, 1995, pp. 9499. [19] S. Guha, R. Rastogi, K. Shim, CURE: a clustering algorithm for large databases, Proceedings of the ACM
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29] [30]