dwdm FINAL4
dwdm FINAL4
Association Analysis
Association Analysis: Basic Concepts and Algorithms: Problem Defecation, Frequent Item
Set generation, Rule generation, compact representation of frequent item sets, FP-Growth
Algorithm.
Preliminaries:
Binary Representation : Market basket data can be represented in a binary format as shown
in Table, where each row corresponds to a transaction and each column corresponds to an
item. An item can be treated as a binary variable whose value is one if the item is present in
a transaction and zero otherwise. Because the presence of an item in a transaction is often
considered more important than its absence, an item is an asymmetric binary variable.
Let I = {i1,i2,...,id}be the set of all items in a market basket data and T = {t1,t2,...,tN} be the
set of all transactions. Each transaction ti contains a subset of items chosen from I.
Itemset : In association analysis, a collection of zero or more items is termed an itemset.
k-itemset : If an itemset contains k items, it is called a k-itemset.
Example: For instance, {Beer, Diapers, Milk} is an example of a 3-itemset.
NullSet: The null (or empty) set is an itemset that does not contain any items.
A transaction tj is said to contain an itemset X if X is a subset of tj.
For example, the second transaction shown in Table contains the itemset {Bread, Diapers}
but not {Bread, Milk}.
support count: Refers to the number of transactions that contain a particular itemset.
Mathematically, the support count, σ(X), for an itemset X can be stated as follows:
Formulation of the Association Rule Mining Problem: The association rule mining problem
can be formally stated as follows:
Definition (Association Rule Discovery): Given a set of transactions T, find all the rules having
support ≥ minsup and confidence ≥ minconf, where minsup and minconf are the
corresponding support and confidence thresholds.
Brute-force approach: A brute-force approach for mining association rules is to compute
the support and confidence for every possible rule. This approach is prohibitively expensive
because there are exponentially many rules that can be extracted from a data set. More
specifically, assuming that neither the left nor the righthand side of the rule is an empty set,
the total number of possible rules, R, extracted from a data set that contains d items is :
This approach requires us to compute the support and confidence for = 602rules.
More than 80% of the rules are discarded after applying minsup = 20% and minconf = 50%,
thus wasting most of the computations. To avoid performing needless computations, it
would be useful to prune the rules early without having to compute their support and
confidence values.
An initial step toward improving the performance of association rule mining algorithms is to
the support of a rule X −→ Y is the same as the support of its corresponding itemset, X ∪Y .
decouple the support and confidence requirements. From the above Equation, notice that
have identical support because they involve items from the same itemset, {Beer, Diapers,
Milk}:
{Beer, Diapers}−→{Milk},
{Beer, Milk}−→{Diapers},
{Diapers, Milk}−→{Beer},
{Beer}−→{Diapers, Milk},
{Milk}−→{Beer, Diapers},
{Diapers}−→{Beer, Milk}.
If the itemset is infrequent, then all six candidate rules can be pruned immediately without
our having to compute their confidence values.
Therefore, a common strategy adopted by many association rule mining algorithms is to
decompose the problem into two major subtasks:
1. Frequent Itemset Generation: whose objective is to find all the item sets that satisfy
the minsup threshold.
2. Rule Generation: whose objective is to extract all the high confidence rules from the
frequent item sets found in the previous step. These rules are called strong rules.
If the candidate is contained in a transaction, its support count will be incremented. For
example, the support for {Bread, Milk} is incremented three times because the itemset is
contained in transactions 1, 4, and 5. Such an approach can be very expensive because it
requires O(NMw) comparisons, where N is the number of transactions, is the
number of candidate item sets, and w is the maximum transaction width. Transaction width
is the number of items present in a transaction.
are frequent.
Suppose{c,d,e} is a frequent itemset. Clearly, any transaction that contains {c,d,e} must also
contain its subsets, {c,d}, {c,e}, {d,e}, {c}, {d}, and{e}. As a result, if {c,d,e} is frequent, then all
subsets of {c,d,e} (i.e., the shaded itemsets in this figure) must also be frequent. Conversely,
if an itemset such as{a,b}is infrequent, then all of its supersets must be infrequent too.
Support Based Pruning: The entire subgraph containing the supersets of {a,b} can be
pruned immediately once {a,b} is found to be infrequent. This strategy of trimming the
exponential search space based on the support measure is known as support-based pruning.
Such a pruning strategy is made possible by a key property of the support measure, namely,
that the support for an itemset never exceeds the support for its subsets. This property is
also known as the anti-monotone property of the support measure.
are infrequent.
This is the most simple and easy-to-understand algorithm among association rule
learning algorithms.
The resulting rules are intuitive and easy to communicate to an end user.
It doesn't require labelled data as it is fully unsupervised; as a result, you can use it
in many different situations because unlabelled data is often more accessible.
Many extensions were proposed for different use cases based on this
implementation—for example, there are association learning algorithms that take
into account the ordering of items, their number, and associated timestamps.
The algorithm is exhaustive, so it finds all the rules with the specified support and
confidence.
The Apriori Algorithm calculates more sets of frequent items.
The cons of Apriori are as follows:
If the data set is small, the algorithm can find many false associations that
happened simply by chance.
You can address this issue by evaluating obtained rules on hold-out test data for
support, confidence, lift and conviction rules.
The candidate generation could be extremely slow (pairs, triplets, etc.).
The candidate generation could generate duplicates depending on the
implementation.
The counting method iterates through all of the transactions each time.
Constant items make the algorithm a lot heavier.
Huge memory consumption.
Example:
candidates, which represents a 68% reduction in the number of candidate item sets even in
this simple example
The pseudocode for the frequent itemset generation part of the Apriori algorithm is
Let Ck denote the set of candidate k-item sets and Fk denote the set of frequent k-item
sets:
• The algorithm initially makes a single pass over the data set to determine the support of
each item. Upon completion of this step, the set of all frequent 1-itemsets, F1, will be
known (steps 1 and 2).
• Next, the algorithm will iteratively generate new candidate k-item sets and prune
unnecessary candidates that are guaranteed to be infrequent given the frequent (k−1)-item
sets found in the previous iteration (steps 5 and 6).
• To count the support of the candidates, the algorithm needs to make an additional pass
over the data set (steps 7–12). The subset function is used to determine all the candidate
item sets in Ck that are contained in each transaction t.
• After counting their supports, the algorithm eliminates all candidate item sets whose
support counts are less than N ×min sup (step 13).
∅ (step 14).
• The algorithm terminates when there are no new frequent item sets generated, i.e., Fk =
2. Candidate Pruning: This operation eliminates some of the candidate k-item sets
using support-based pruning, i.e. by removing k-item sets whose subsets are known
to be infrequent in previous iterations. Note that this pruning is done without
computing the actual support of these k-item sets (which could have required
comparing them against each transaction).
Fk−1×Fk−1 Method: This candidate generation procedure, which is used in the candidate-
gen function of the Apriori algorithm, merges a pair of frequent (k−1)-item sets only if their
first k−2 items, arranged in lexicographic order, are identical. Let A = {a1,a2,...,ak−1} and B =
{b1,b2,...,bk−1} be a pair of frequent (k−1)-item sets, arranged lexicographically. A and B are
merged if they satisfy the following conditions: ai = bi (for i =1 ,2,...,k−2). Note that in this
case, ak−1 not equal to bk−1 because A and B are two distinct item sets. The candidate k-
itemset generated by merging A and B consists of the first k −2 common items followed by
ak−1 and bk−1 in lexicographic order. This candidate generation procedure is complete,
because for every lexicographically ordered frequent k-itemset, there exists two
lexicographically ordered frequent (k −1)-item sets that have identical items in the first k −2
positions.
In Figure, the frequent item sets {Bread, Diapers} and {Bread, Milk} are merged to form a
candidate 3-itemset {Bread, Diapers, Milk}. The algorithm does not have to merge{Beer,
Diapers} with{Diapers, Milk}because the first item in both item sets is different. Indeed, if
{Beer, Diapers, Milk} is a viable candidate, it would have been obtained by merging {Beer,
Diapers} with {Beer, Milk} instead.
This example illustrates both the completeness of the candidate generation procedure and
the advantages of using lexicographic ordering to prevent duplicate candidates. Also, if we
order the frequent (k−1) item sets according to their lexicographic rank, item sets with
identical first k − 2 items would take consecutive ranks. As a result, the Fk−1 × Fk−1
candidate generation method would consider merging a frequent itemset only with ones
that occupy the next few ranks in the sorted list, thus saving some computations.
Figure shows that the Fk−1 × Fk−1 candidate generation procedure results in only one
candidate 3-itemset. This is a considerable reduction from the four candidate 3-itemsets
generated by the Fk−1 × F1 method. This is because the Fk−1 × Fk−1 method ensures that
every candidate k-itemset contains at least two frequent (k − 1)-item sets, thus greatly
reducing the number of candidates that are generated in this step.
Note that there can be multiple ways of merging two frequent (k − 1)item sets in the Fk−1
×Fk−1 procedure, one of which is merging if their first k−2 items are identical.
An alternate approach could be to merge two frequent (k−1)-item sets A and B if the last
k−2 items of A are identical to the first k−2 item sets of B.
Example: {Bread, Diapers}and{Diapers, Milk}could be merged using this approach to
generate the candidate 3-itemset {Bread, Diapers, Milk}.
Candidate Pruning:
To illustrate the candidate pruning operation for a candidate k-itemset, X = {i1,i2,...,ik},
consider its k proper subsets, X−{Ij} (∀j =1 ,2,...,k). If any of them are infrequent, then X is
immediately pruned by using the Apriori principle.
Note that we don’t need to explicitly ensure that all subsets of X of size less than k−1 are
frequent. This approach greatly reduces the number of candidate item sets considered
during support counting.
Brute-force: For the brute-force candidate generation method, candidate pruning requires
checking only k subsets of size k −1 for each candidate k-itemset.
Fk−1×F1 : since the Fk−1×F1 candidate generation strategy ensures that at least one of the
(k−1)-size subsets of every candidate k-itemset is frequent, we only need to check for the
remaining k−1 subsets.
Fk−1 ×Fk−1 strategy :The Fk−1 ×Fk−1 strategy requires examining only k−2 subsets of every
candidate k-itemset, since two of its (k −1)-size subsets are already known to be frequent in
the candidate generation step.
Support Counting:
Support counting is the process of determining the frequency of occurrence for every
candidate itemset that survives the candidate pruning step.
Support counting is implemented in steps 6 through 11 of Algorithm.
A brute-force approach for doing this is to compare each transaction against every
candidate itemset. and to update the support counts of candidates contained in a
transaction. This approach is computationally expensive, especially when the numbers of
transactions and candidate item sets are large.
An alternative approach is to enumerate the item sets contained in each transaction and use
them to update the support counts of their respective candidate item sets. To illustrate,
consider a transaction t that contains five items, {1,2,3,5,6}. There are = 10 item sets of
size 3 contained in this transaction. Some of the item sets may correspond to the candidate
3-itemsets under investigation, in which case, their support counts are incremented. Other
subsets of t that do not correspond to any candidates can be ignored.
Figure shows a systematic way for enumerating the 3-itemsets contained in t. Assuming that
each itemset keeps its items in increasing lexicographic order, an itemset can be
enumerated by specifying the smallest item first, followed by the larger items.
For instance, given t = {1,2,3,5,6}, all the 3-item sets contained in t must begin with item 1,
2, or 3. It is not possible to construct a 3-itemset that begins with items 5 or 6 because there
are only two items in t whose labels are greater than or equal to 5. The number of ways to
specify the first item of a 3-itemset contained in t is illustrated by the Level 1 prefix tree
structure depicted in Figure.
For instance, 1 2356 represents a 3-itemset that begins with item 1, followed by two more
items chosen from the set {2,3,5,6}. After fixing the first item, the prefix tree structure at
Level 2 represents the number of ways to select the second item. For example, 1 2 356
corresponds to item sets that begin with the prefix (1 2) and are followed by the items 3, 5,
or 6.
Finally, the prefix tree structure at Level 3 represents the complete set of 3-itemsets
contained in t. For example, the 3-itemsets that begin with prefix {12 } are {1,2,3}, {1,2,5},
and{1,2,6}, while those that begin with prefix {23 } are {2,3,5} and {2,3,6}.
The prefix tree structure shown in Figure demonstrates how item sets contained in a
transaction can be systematically enumerated, i.e., by specifying their items one by one,
from the leftmost item to the rightmost item. We still have to determine whether each
enumerated 3-itemset corresponds to an existing candidate itemset. If it matches one of the
candidates, then the support count of the corresponding candidate is incremented.
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is
its anti-monotonicity of support measure. Apriori assumes that
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate
set)
(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives
us itemset L1.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-1 and Lk-1 is
that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.
(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives
us itemset L2.
Step-3:
o Generate candidate set C3 using L2 (join step). Condition of joining L k-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}
{I2, I3, I5}
o Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2,
I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
o find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives
us itemset L3.
Step-4:
o Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2 elements
(items) should match.
o Check all subsets of these itemsets are frequent or not (Here itemset formed by joining
L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no
itemset in C4
o We stop here because no frequent itemsets are found further
1. Support
2. Confidence
3. Lift
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600 transactions
include a 200 that includes Biscuits and chocolates. Using this data, we will find out the support, confidence, and
lift.
Support
Support refers to the default popularity of any product. You find the support as a quotient of the division of the
number of transactions comprising that product by the total number of transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and chocolates together. So, you
need to divide the number of transactions that comprise both biscuits and chocolates by the total number of
transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates when you sell
biscuits. The mathematical equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is five times more than that of
purchasing the biscuits alone. If the lift value is below one, it requires that the people are unlikely to buy both the
items together. Larger the value, the better is the combination.
Example of a hash tree structure: Each internal node of the tree uses the following hash
function, h(p)=( p − 1) mod 3, where mode refers to the modulo (remainder) operator, to
determine which branch of the current node should be followed next. For example, items 1,
4, and 7 are hashed to the same branch (i.e., the leftmost branch) because they have the
same remainder after dividing the number by 3. All candidate item sets are stored at the
leaf nodes of the hash tree. The hash tree shown in Figure 5.11 contains 15 candidate 3-
itemsets, distributed across 9 leaf nodes.
Consider the transaction, t = {1,2,3,5,6}. To update the support counts of the candidate item
sets, the hash tree must be traversed in such a way that all the leaf nodes containing
candidate 3-itemsets belonging to t must be visited at least once. Recall that the 3-itemsets
contained in t must begin with items 1, 2, or 3, as indicated by the Level 1 prefix tree
structure shown in Figure. Therefore, at the root node of the hash tree, the items 1, 2, and 3
of the transaction are hashed separately. Item 1 is hashed to the left child of the root node,
item 2 is hashed to the middle child, and item 3 is hashed to the right child. At the next level
of the tree, the transaction is hashed on the second item listed in the Level 2 tree structure
shown in Figure.
For example, after hashing on item 1 at the root node, items 2, 3, and 5 of the transaction
are hashed. Based on the hash function, items 2 and 5 are hashed to the middle child, while
item 3 is hashed to the right child, as shown in Figure. This process continues until the leaf
nodes of the hash tree are reached. The candidate item sets stored at the visited leaf nodes
are compared against the transaction. If a candidate is a subset of the transaction, its
support count is incremented. Note that not all the leaf nodes are visited while traversing
the hash tree, which helps in reducing the computational cost. In this example, 5 out of the
9 leaf nodes are visited and 9 out of the 15 item sets are compared against the transaction.
Candidate generation : To generate candidate k-item sets, pairs of frequent (k − 1)-item sets
are merged to determine whether they have at least k − 2 items in common. Each merging
operation requires at most k − 2 equality comparisons. Every merging step can produce at
most one viable candidate k-itemset, while in the worst-case, the algorithm must try to
merge every pair of frequent (k − 1)-item sets found in the previous iteration. Therefore, the
overall cost of merging frequent item sets is where w is the maximum transaction width.
A hash tree is also constructed during candidate generation to store the candidate item
sets. Because the maximum depth of the tree is k, the cost for populating the hash tree with
candidate item sets is During candidate pruning, we need to verify that the
k−2 subsets of every candidate k-itemset are frequent. Since the cost for looking up a
candidate in a hash tree is O(k), the candidate pruning step requires
Rule Generation:
Each frequent k-itemset, Y , can produce up to association rules, ignoring rules that
have empty antecedents or consequents (∅−→Y or Y −→∅). An association rule can be
extracted by partitioning the itemset Y into two non-empty subsets, X and Y −X, such that X
−→ Y −X satisfies the confidence threshold. Note that all such rules must have already met
the support threshold because they are generated from a frequent itemset.
Example: Let X = {a,b,c} be a frequent itemset. There are six candidate association rules that
can be generated from X:{a,b}−→{ c},{a,c}−→ {b}, {b,c}−→{ a}, {a}−→{b,c}, {b}−→{a,c}, and{c}
−→{a,b}. As each of their support is identical to the support for X, all the rules satisfy the
support threshold.
Computing the confidence of an association rule does not require additional scans of the
transaction data set. Consider the rule {1,2} −→ { 3}, which is generated from the frequent
itemset X = {1,2,3}. The confidence for this rule is σ({1,2,3})/σ({1,2}). Because {1,2,3} is
frequent, the anti-monotone property of support ensures that {1,2} must be frequent, too.
Since the support counts for both item sets were already found during frequent itemset
generation, there is no need to read the entire data set again.
Confidence-Based Pruning:
Confidence does not show the anti-monotone property in the same way as the support
439). Nevertheless, if we compare rules generated from the same frequent itemset Y , the
following theorem holds for the confidence measure.
Theorem:
Rule Generation in Apriori Algorithm:
Rule generation of the Apriori algorithm:
The Apriori algorithm uses a level-wise approach for generating association rules, where
each level corresponds to the number of items that belong to the rule consequent. Initially,
all the high confidence rules that have only one item in the rule consequent are extracted.
These rules are then used to generate new candidate rules. For example, if {acd} −→ { b}
and {abd} −→ { c} are high confidence rules, then the candidate rule {ad}−→{bc} is generated
by merging the consequents of both rules. Figure shows a lattice structure for the
association rules generated from the frequent itemset {a,b,c,d}. If any node in the lattice has
low confidence, then according to Theorem 5.2, the entire subgraph spanned by the node
can be pruned immediately. Suppose the confidence for {bcd} −→ { a} is low. All the rules
containing item a in its consequent, including {cd} −→ {ab}, {bd} −→ {ac}, {bc} −→ {ad}, and
{d}−→{abc} can be discarded.
Procedure ap-gen rules(fk, Hm):
Closed Itemsets:
Closed itemsets provide a minimal representation of all itemsets without losing their
support information. A formal definition of a closed itemset is presented below.
Definition(Closed Itemset): An itemset X is closed if none of its immediate supersets has
exactly the same support count as X.
Put another way, X is not closed if at least one of its immediate supersets has the same
support count as X. Examples of closed item sets are shown in Figure. To better illustrate the
support count of each itemset, we have associated each node (itemset) in the lattice with a
list of its corresponding transaction IDs.
For example, since the node {b,c} is associated with transaction IDs 1, 2, and 3, its support
count is equal to three. From the transactions given in this diagram, notice that the support
for{b}is identical to{b,c}. This is because every transaction that contains b also contains c.
Hence, {b} is not a closed itemset. Similarly, since c occurs in every transaction that contains
both a and d, the itemset{a,d} is not closed as it has the same support as its superset {a,c,d}.
On the other hand, {b,c} is a closed itemset because it does not have the same support
count as any of its supersets. An interesting property of closed item sets is that if we know
their support counts, we can derive the support count of every other itemset in the itemset
lattice without making additional passes over the data set. For example, consider the 2-
itemset {b,e} in Figure. Since {b,e} is not closed, its support must be equal to the support of
one of its immediate supersets, {a,b,e}, {b,c,e}, and {b,d,e}. Further, none of the supersets
of{b,e} can have a support greater than the support of {b,e}, due to the anti-monotone
nature of the support measure. Hence, the support of {b,e} can be computed by examining
the support counts of all of its immediate supersets of size three and taking their maximum
value. If an immediate superset is closed (e.g., {b,c,e}), we would know its support count.
Otherwise, we can recursively compute its support by examining the supports of its
immediate supersets of size four. In general, the support count of any non-closed (k−1)-
itemset can be determined as long as we know the support counts of all k-item sets. Hence,
one can devise an iterative algorithm that computes the support counts of item sets at level
k−1 using the support counts of item sets at level k, starting from the level kmax, where
kmax is the size of the largest closed itemset. Even though closed item sets provide a
compact representation of the support counts of all item sets, they can still be exponentially
large in number. Moreover, for most practical applications, we only need to determine the
support count of all frequent itemsets. In this regard, closed frequent itemsets provide a
compact representation of the support counts of all frequent itemsets, which can be
defined as follows
Definition(Closed Frequent Itemset): An itemset is a closed frequent itemset if it is closed
and its support is greater than or equal to minsup
We can use closed frequent item sets to determine the support counts for all non-closed
frequent item sets.
For example, consider the frequent itemset {a,d} shown in Figure. Because this itemset is
not closed, its support count must be equal to the maximum support count of its immediate
supersets, {a,b,d}, {a,c,d}, and{a,d,e}. Also, since{a,d} is frequent, we only need to consider
the support of its frequent supersets. In general, the support count of every non-closed
frequent k-itemset can be obtained by considering the support of all its frequent supersets
of size k + 1.
For example, since the only frequent superset of {a,d} is {a,c,d}, its support is equal to the
support of {a,c,d}, which is 2. Using this methodology, an algorithm can be developed to
compute the support for every frequent itemset. The pseudocode for this algorithm is
shown in Algorithm. The algorithm proceeds in a specific-to-general fashion, i.e., from the
largest to the smallest frequent item sets. This is because, in order to find the support for a
non-closed frequent itemset, the support for all of its supersets must be known. Note that
the set of all frequent item sets can be easily computed by taking the union of all subsets of
frequent closed item sets.
Advantage of using closed frequent itemsets :
To illustrate the advantage of using closed frequent itemsets, consider the data set shown in
Table which contains ten transactions and fifteen items. The items can be divided into three
groups: (1) Group A, which contains items a1 through a5; (2) Group B, which contains items
b1 through b5; and (3) Group C, which contains items c1 through c5. Assuming that the
support threshold is 20%, itemsets involving items from the same group are frequent, but
itemsets involving items from different groups are infrequent. The total number of frequent
itemsets is thus 3 × (25 − 1) = 93. However, there are only four closed frequent itemsets in
the data: ({a3,a4}, {a1,a2,a3,a4,a5}, {b1,b2,b3,b4,b5}, and{c1,c2,c3,c4,c5}). It is often
sufficient to present only the closed frequent itemsets to the analysts instead of the entire
set of frequent itemsets.
Relationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsets:
All maximal frequent item sets are closed because none of the maximal frequent item sets
can have the same support count as their immediate supersets. The relationships among
frequent, closed, closed frequent, and maximal frequent item sets are shown in Figure.
FP-Growth Algorithm:
This algorithm is an improvement to the Apriori method. A frequent pattern is generated without the need for
candidate generation. FP growth algorithm represents the database in the form of a tree called a frequent pattern
tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is fragmented using one
frequent item. This fragmented part is called “pattern fragment”. The itemsets of these fragmented patterns are
analyzed. Thus with this method, the search for frequent itemsets is reduced comparatively.
It encodes the data set using a compact data structure called an FP-tree and extracts
frequent item sets directly from this structure.
FP-Tree Representation
An FP-tree is a compressed representation of the input data. It is constructed by reading the
data set one transaction at a time and mapping each transaction onto a path in the FP-tree.
As different transactions can have several items in common, their paths might overlap. The
more the paths overlap with one another, the more compression we can achieve using the
FP-tree structure. If the size of the FP-tree is small enough to fit into main memory, this will
allow us to extract frequent item sets directly from the structure in memory instead of
making repeated passes over the data stored on disk.
Figure shows a data set that contains ten transactions and five items.
The structures of the FP-tree after reading the first three transactions are also depicted in
the diagram. Each node in the tree contains the label of an item along with a counter that
shows the number of transactions mapped onto the given path. Initially, the FP-tree
contains only the root node represented by the null symbol.
The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of each item. Infrequent
items are discarded, while the frequent items are sorted in decreasing support counts inside
every transaction of the data set. For the data set shown in Figure 5.24, a is the most
frequent item, followed by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the FP tree. After reading
the first transaction, {a, b}, the nodes labelled as a and b are created. A path is then formed
from null → a → b to encode the transaction. Every node along the path has a frequency
count of 1.
3. After reading the second transaction, {b, c, d}, a new set of nodes is created for items b, c,
and d. A path is then formed to represent the transaction by connecting the nodes null → b
→ c → d. Every node along this path also has a frequency count equal to one. Although the
first two transactions have an item in common, which is b, their paths are disjoint because
the transactions do not share a common prefix.
4. The third transaction, {a, c, d, e}, shares a common prefix item (which is a) with the first
transaction. As a result, the path for the third transaction, null → a → c → d → e, overlaps
with the path for the first transaction, null → a → b. Because of their overlapping path, the
frequency count for node a is incremented to two, while the frequency counts for the newly
created nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been mapped onto one of the paths
given in the FP-tree. The resulting FP-tree after reading all the transactions is shown at the
bottom of Figure.
The size of an FP-tree is typically smaller than the size of the uncompressed data because
many transactions in market basket data often share a few items in common.
In the best-case scenario, where all the transactions have the same set of items, the FP-tree
contains only a single branch of nodes.
The worst case scenario happens when every transaction has a unique set of items. As none
of the transactions have any items in common, the size of the FP-tree is effectively the same
as the size of the original data. However, the physical storage requirement for the FP-tree is
higher because it requires additional space to store pointers between nodes and counters
for each item. The size of an FP-tree also depends on how the items are ordered. The notion
of ordering items in decreasing order of support counts relies on the possibility that the high
support items occur more frequently across all paths and hence must be used as most
commonly occurring prefixes.
Benefits of the FP-tree Structure:
Completeness
Preserve complete information for frequent pattern mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone Items in frequency descending
order: the more frequently occurring, the more likely to be shared
Never be larger than the original database (not count node-links and the count field)
There exists examples of databases, where compression ratio could be over 100.
FP-growth finds all the frequent itemsets ending with a particular suffix by employing a
divide-and-conquer strategy to split the problem into smaller subproblems. For example,
suppose we are interested in finding all frequent itemsets ending in e. To do this, we must
first check whether the itemset {e} itself is frequent. If it is frequent, we consider the
subproblem of finding frequent itemsets ending in de, followed byce, be, andae. In turn,
each of these subproblems are further decomposed into smaller subproblems. By merging
the solutions obtained from the subproblems, all the frequent itemsets ending in e can be
found. Finally, the set of all frequent itemsets can be generated by merging the solutions to
the subproblems of finding frequent itemsets ending in e, d, c, b, anda. This divide-and-
conquer approach is the key strategy employed by the FP-growth algorithm.
For a more concrete example on how to solve the subproblems, consider the task of finding
frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initial paths are called
prefix paths and are shown in Figure(a).
2. From the prefix paths shown in Figure(a), the support count for e is obtained by adding
the support counts associated with node e. Assuming that the minimum support count is 2,
{e} is declared a frequent itemset because its support count is 3.
3. Because {e} is frequent, the algorithm has to solve the subproblems of finding frequent
item sets ending in de, ce, be, and ae. Before solving these subproblems, it must first
convert the prefix paths into a conditional FP-tree, which is structurally similar to an FP-tree,
except it is used to find frequent item sets ending with a particular suffix.
A conditional FP-tree is obtained in the following way:
(a) First, the support counts along the prefix paths must be updated because some of
the counts include transactions that do not contain item e. For example, the
rightmost path shown in Figure (a), null −→ b:2 −→ c:2 −→ e:1, includes a
transaction {b,c} that does not contain item e. The counts along the prefix path must
therefore be adjusted to 1 to reflect the actual number of transactions containing {b,
c,e}
(b) The prefix paths are truncated by removing the nodes for e. These nodes can be
removed because the support counts along the prefix paths have been updated to
reflect only transactions that contain e and the subproblems of finding frequent item
sets ending in de, ce, be, and ae no longer need information about node e.
(c) After updating the support counts along the prefix paths, some of the items may no
longer be frequent. For example, the node b appears only once and has a support
count equal to 1, which means that there is only one transaction that contains both
b and e. Item b can be safely ignored from subsequent analysis because all item sets
ending in be must be infrequent.
3. FP-growth uses the conditional FP-tree for e to solve the subproblems of finding
frequent item sets ending in de, ce, and ae. To find the frequent item sets ending in
de, the prefix paths for d are gathered from the conditional FP-tree for e. By adding
the frequency counts associated with node d, we obtain the support count for {d, e}.
Since the support count is equal to 2, {d, e} is declared a frequent itemset. Next, the
algorithm constructs the conditional FP-tree for de using the approach described in
step 3. After updating the support counts and removing the infrequent item c, the
conditional FP-tree for de is shown in Figure 5.27(d). Since the conditional FP-tree
contains only one item, a, whose support is equal to minsup, the algorithm extracts
the frequent itemset {a, d ,e} and moves on to the next subproblem, which is to
generate frequent item sets ending in ce. After processing the prefix paths for c,{c,
e}is found to be frequent. However, the conditional FP-tree for ce will have no
frequent items and thus will be eliminated. The algorithm proceeds to solve the next
subproblem and finds {a, e} to be the only frequent itemset remaining.
FP Growth Algorithm:
Advantages of FP-Growth:
Divide-and-conquer:
Decompose both the mining task and DB according to the frequent patterns
obtained so far.
leads to focused search of smaller databases.
Other factors:
no candidate generation, no candidate test
compressed database: FP-tree structure
no repeated scan of entire database
basic ops—counting local freq items and building sub FP-tree, no pattern
search and matching
The algorithm only needs to read the file twice, as opossed to apriori who
reads it once for every iteration.
Removes the need to calculate the pairs to be counted, which is very
processing heavy, because it uses the FP-Tree. This makes it O(n) which is
much faster than apriori.
The FP-Growth algorithm stores in memory a compact version of the
database.
Dis Advantages of FP-Growth:
The biggest problem is the interdependency of
data. The interdependency problem is that for the
parallelization of the algorithm some that still
needs to be shared, which creates a bottleneck in
the shared memory.
Technique: Apriori algorithm uses Apriori property and join, pure property for
mining frequent patterns. FP Growth algorithm constructs conditional pattern tree
and conditional pattern base from the database which satisfies the minimum
support.
Search Type: Apriori uses breadth first search method and FP Growth uses divide
and conquer method.
Memory Utilization: Apriori algorithm requires large memory space as they deal
with large number of candidate itemset generation. FP Growth algorithm requires
less memory due to its compact structure they discover the frequent item sets
without candidate itemset generation.
No.Of.Scans: Apriori algorithm performs multiple scans for generating candidate set.
FP Growth algorithm scans the database only twice.