0% found this document useful (0 votes)
10 views37 pages

dwdm FINAL4

Uploaded by

vidya7.gcsj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views37 pages

dwdm FINAL4

Uploaded by

vidya7.gcsj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT-4

Association Analysis
Association Analysis: Basic Concepts and Algorithms: Problem Defecation, Frequent Item
Set generation, Rule generation, compact representation of frequent item sets, FP-Growth
Algorithm.
Preliminaries:
Binary Representation : Market basket data can be represented in a binary format as shown
in Table, where each row corresponds to a transaction and each column corresponds to an
item. An item can be treated as a binary variable whose value is one if the item is present in
a transaction and zero otherwise. Because the presence of an item in a transaction is often
considered more important than its absence, an item is an asymmetric binary variable.

Let I = {i1,i2,...,id}be the set of all items in a market basket data and T = {t1,t2,...,tN} be the
set of all transactions. Each transaction ti contains a subset of items chosen from I.
Itemset : In association analysis, a collection of zero or more items is termed an itemset.
k-itemset : If an itemset contains k items, it is called a k-itemset.
Example: For instance, {Beer, Diapers, Milk} is an example of a 3-itemset.
NullSet: The null (or empty) set is an itemset that does not contain any items.
A transaction tj is said to contain an itemset X if X is a subset of tj.
For example, the second transaction shown in Table contains the itemset {Bread, Diapers}
but not {Bread, Milk}.
support count: Refers to the number of transactions that contain a particular itemset.
Mathematically, the support count, σ(X), for an itemset X can be stated as follows:

where the symbol |·|denotes the number of elements in a set.


Example: The support count for {Beer, Diapers, Milk} is equal to two because there are only
two transactions that contain all three items.
Support: Fraction of transactions in which an itemset occurs:
s(X)=σ(X)/N.
Frequent ItemSet: An itemset X is called frequent if s(X) is greater than some user-defined
threshold, minsup.
Transaction width: Transaction width is the number of items present in a transaction

where X and Y are disjoint itemsets, i.e., X ∩Y = ∅.


Association Rule: An association rule is an implication n expression of the form X −→ Y ,
The strength of an association rule can be measured in terms of its support and confidence.
Support: Support determines how often a rule is applicable to a given data set.
Confidence: confidence determines how frequently items in Y appear in transactions that
contain X.
 Confidence measures the strength of association between two itemsets, often referred to as the
antecedent (X) and consequent (Y) of an association rule.
 Confidence is calculated as the conditional probability of finding the consequent Y in a
transaction given that the antecedent X is present in that transaction.

The formal definitions of these metrics are

Example: Consider the rule {Milk, Diapers} −→ {Beer}.


Support: The support count for{Milk, Diapers, Beer}is 2 and the total number of transactions
is 5, the rule’s support is 2/5=0 .4.
Confidence: The rule’s confidence is obtained by dividing the support count for {Milk,
Diapers, Beer} by the support count for {Milk, Diapers}. Since there are 3 transactions that
contain milk and diapers, the confidence for this rule is 2/3=0 .67.
Why Use Support and Confidence?
Support: Support is an important measure because a rule that has very low support might
occur simply by chance. Also, from a business perspective a low support rule is unlikely to be
interesting because it might not be profitable to promote items that customers seldom buy
together. For these reasons, we are interested in finding rules whose support is greater than
some user -defined threshold. support also has a desirable property that can be exploited
for the efficient discovery of association rules.
Confidence: Confidence measures the reliability of the inference made by a rule. For a given
rule X −→ Y , the higher the confidence, the more likely it is for Y to be present in
transactions that contain X. Confidence also provides an estimate of the conditional
probability of Y given X.

Formulation of the Association Rule Mining Problem: The association rule mining problem
can be formally stated as follows:
Definition (Association Rule Discovery): Given a set of transactions T, find all the rules having
support ≥ minsup and confidence ≥ minconf, where minsup and minconf are the
corresponding support and confidence thresholds.
Brute-force approach: A brute-force approach for mining association rules is to compute
the support and confidence for every possible rule. This approach is prohibitively expensive
because there are exponentially many rules that can be extracted from a data set. More
specifically, assuming that neither the left nor the righthand side of the rule is an empty set,
the total number of possible rules, R, extracted from a data set that contains d items is :

This approach requires us to compute the support and confidence for = 602rules.
More than 80% of the rules are discarded after applying minsup = 20% and minconf = 50%,
thus wasting most of the computations. To avoid performing needless computations, it
would be useful to prune the rules early without having to compute their support and
confidence values.
An initial step toward improving the performance of association rule mining algorithms is to

the support of a rule X −→ Y is the same as the support of its corresponding itemset, X ∪Y .
decouple the support and confidence requirements. From the above Equation, notice that

For example, the following rules

have identical support because they involve items from the same itemset, {Beer, Diapers,
Milk}:
{Beer, Diapers}−→{Milk},
{Beer, Milk}−→{Diapers},
{Diapers, Milk}−→{Beer},
{Beer}−→{Diapers, Milk},
{Milk}−→{Beer, Diapers},
{Diapers}−→{Beer, Milk}.
If the itemset is infrequent, then all six candidate rules can be pruned immediately without
our having to compute their confidence values.
Therefore, a common strategy adopted by many association rule mining algorithms is to
decompose the problem into two major subtasks:
1. Frequent Itemset Generation: whose objective is to find all the item sets that satisfy
the minsup threshold.

2. Rule Generation: whose objective is to extract all the high confidence rules from the
frequent item sets found in the previous step. These rules are called strong rules.

Frequent Itemset Generation


A lattice structure can be used to enumerate the list of all possible itemsets. Figure shows
an itemset lattice for I = {a,b,c,d,e}. In general, a data set that contains k items can
potentially generate up to
frequent itemsets, excluding the null set. Because k can be very large in many
practical applications, the search space of itemsets that need to be explored is exponentially
large.
A brute-force approach for finding frequent item sets is to determine the support count for
every candidate itemset in the lattice structure. To do this, we need to compare each
candidate against every transaction, an operation that is shown in Figure.

If the candidate is contained in a transaction, its support count will be incremented. For
example, the support for {Bread, Milk} is incremented three times because the itemset is
contained in transactions 1, 4, and 5. Such an approach can be very expensive because it
requires O(NMw) comparisons, where N is the number of transactions, is the
number of candidate item sets, and w is the maximum transaction width. Transaction width
is the number of items present in a transaction.

Reducing the computational complexity of frequent itemset generation:


There are three main approaches for reducing the computational complexity of frequent
itemset generation.
1. Reduce the number of candidate item sets (M): The Apriori principle is an effective way
to eliminate some of the candidate item sets without counting their support values.
2. Reduce the number of comparisons: Instead of matching each candidate itemset against
every transaction, we can reduce the number of comparisons by using more advanced data
structures, either to store the candidate item sets or to compress the data set.
3. Reduce the number of transactions (N): As the size of candidate item sets increases,
fewer transactions will be supported by the item sets.
The Apriori Principle:
If an itemset is frequent, then all of its subsets must also be frequent.

are frequent.

Suppose{c,d,e} is a frequent itemset. Clearly, any transaction that contains {c,d,e} must also
contain its subsets, {c,d}, {c,e}, {d,e}, {c}, {d}, and{e}. As a result, if {c,d,e} is frequent, then all
subsets of {c,d,e} (i.e., the shaded itemsets in this figure) must also be frequent. Conversely,
if an itemset such as{a,b}is infrequent, then all of its supersets must be infrequent too.
Support Based Pruning: The entire subgraph containing the supersets of {a,b} can be
pruned immediately once {a,b} is found to be infrequent. This strategy of trimming the
exponential search space based on the support measure is known as support-based pruning.
Such a pruning strategy is made possible by a key property of the support measure, namely,
that the support for an itemset never exceeds the support for its subsets. This property is
also known as the anti-monotone property of the support measure.
are infrequent.

itemset X that is a proper subset of itemset Y , i.e. X ⊂ Y , we have f(Y ) ≤ f(X).


Anti-monotone Property: A measure f possesses the antimonotone property if for every

Frequent ItemSet Generation Of The Apriori Algorithm

The Pros and Cons of Apriori Algorithm:


The pros of Apriori are as follows:

 This is the most simple and easy-to-understand algorithm among association rule
learning algorithms.
 The resulting rules are intuitive and easy to communicate to an end user.
 It doesn't require labelled data as it is fully unsupervised; as a result, you can use it
in many different situations because unlabelled data is often more accessible.
 Many extensions were proposed for different use cases based on this
implementation—for example, there are association learning algorithms that take
into account the ordering of items, their number, and associated timestamps.
 The algorithm is exhaustive, so it finds all the rules with the specified support and
confidence.
 The Apriori Algorithm calculates more sets of frequent items.
The cons of Apriori are as follows:
 If the data set is small, the algorithm can find many false associations that
happened simply by chance.
 You can address this issue by evaluating obtained rules on hold-out test data for
support, confidence, lift and conviction rules.
 The candidate generation could be extremely slow (pairs, triplets, etc.).
 The candidate generation could generate duplicates depending on the
implementation.
 The counting method iterates through all of the transactions each time.
 Constant items make the algorithm a lot heavier.
 Huge memory consumption.

Example:

Assumption: We assume that the support threshold is 60%, which is equivalent to a


minimum support count equal to 3.
Initially, every item is considered as a candidate 1-itemset. After counting their supports, the
candidate itemsets {Cola} and {Eggs} are discarded because they appear in fewer than three
transactions.
In the next iteration, candidate 2-itemsets are generated using only the frequent 1-itemsets
because the Apriori principle ensures that all supersets of the infrequent 1-itemsets must be
infrequent. Because there are only four frequent 1-itemsets, the number of candidate 2-
itemsets generated by the algorithm is (4c2) 6. Two of these six candidates, {Beer, Bread}
and {Beer, Milk}, are subsequently found to be infrequent after computing their support
values. The remaining four candidates are frequent, and thus will be used to generate
candidate 3-itemsets.
Without support-based pruning, there are(6c3)= 20 candidate 3-itemsets that can be
formed using the six items given in this example.
With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets are
frequent. The only candidate that has this property is {Bread, Diapers, Milk}.
However, even though the subsets of{Bread, Diapers, Milk} are frequent, the itemset itself is
not. The effectiveness of the Apriori pruning strategy can be shown by counting the number
of candidate item sets generated. A brute-force strategy of enumerating all item sets (up to
size 3) as candidates will produce

candidates, which represents a 68% reduction in the number of candidate item sets even in
this simple example

The pseudocode for the frequent itemset generation part of the Apriori algorithm is
Let Ck denote the set of candidate k-item sets and Fk denote the set of frequent k-item
sets:
• The algorithm initially makes a single pass over the data set to determine the support of
each item. Upon completion of this step, the set of all frequent 1-itemsets, F1, will be
known (steps 1 and 2).
• Next, the algorithm will iteratively generate new candidate k-item sets and prune
unnecessary candidates that are guaranteed to be infrequent given the frequent (k−1)-item
sets found in the previous iteration (steps 5 and 6).
• To count the support of the candidates, the algorithm needs to make an additional pass
over the data set (steps 7–12). The subset function is used to determine all the candidate
item sets in Ck that are contained in each transaction t.
• After counting their supports, the algorithm eliminates all candidate item sets whose
support counts are less than N ×min sup (step 13).

∅ (step 14).
• The algorithm terminates when there are no new frequent item sets generated, i.e., Fk =

Characteristics Of frequent Item Set Generation Of Apriori Algorithm:


The frequent itemset generation part of the Apriori algorithm has two important
characteristics.
(1) It is a level-wise algorithm; i.e., it traverses the itemset lattice one level at a time,
from frequent 1-itemsets to the maximum size of frequent item sets.
(2) It employs a generate-and-test strategy for finding frequent item sets. At each
iteration (level), new candidate item sets are generated from the frequent item sets
found in the previous iteration. The support for each candidate is then counted and
tested against the min sup threshold. The total number of iterations needed by the
algorithm is kmax +1, where kmax is the maximum size of the frequent item sets.

Candidate Generation and Pruning


1. Candidate Generation: This operation generates new candidate k-item sets based on
the frequent (k−1)-item sets found in the previous iteration.

2. Candidate Pruning: This operation eliminates some of the candidate k-item sets
using support-based pruning, i.e. by removing k-item sets whose subsets are known
to be infrequent in previous iterations. Note that this pruning is done without
computing the actual support of these k-item sets (which could have required
comparing them against each transaction).

An effective candidate generation procedure must be complete and non-redundant.


 A candidate generation procedure is said to be complete if it does not omit any

subsume the set of all frequent item sets, i.e., ∀k : Fk ⊆ Ck.


frequent item sets. To ensure completeness, the set of candidate item sets must

 A candidate generation procedure is non-redundant if it does not generate the


same candidate itemset more than once. For example, the candidate itemset
{a,b,c,d} can be generated in many ways—by merging {a,b,c} with {d}, {b,d} with
{a,c}, {c} with {a,b,d}, etc.
Generation of duplicate candidates leads to wasted computations and thus should be
avoided for efficiency reasons. Also, an effective candidate generation procedure should
avoid generating too many unnecessary candidates. A candidate itemset is unnecessary if at
least one of its subsets is infrequent, and thus, eliminated in the candidate pruning step.

candidate generation procedures:


(1) Brute-Force Method: The brute-force method considers every k-itemset as a
potential candidate and then applies the candidate pruning step to remove any
unnecessary candidates whose subsets are infrequent . The number of candidate
item sets generated at level k is equal to where d is the total number of
items. Although candidate generation is rather trivial, candidate pruning becomes
extremely expensive because a large number of item sets must be examined.
(2) Fk−1 ×F1 Method :An alternative method for candidate generation is to extend each
frequent (k − 1)-itemset with frequent items that are not part of the (k−1)-itemset.
Figure illustrates how a frequent 2-itemset such as {Beer, Diapers} can be
augmented with a frequent item such as Bread to produce a candidate 3-itemset
{Beer, Diapers, Bread}. The procedure is complete because every frequent k-itemset
is composed of a frequent (k−1)-itemset and a frequent 1-itemset. Therefore, all
frequent k-item sets are part of the candidate k-item sets generated by this
procedure.
Figure shows that the Fk−1 ×F1 candidate generation method only produces four
candidate 3-itemsets, instead of the 20 item sets produced by the brute-
force method. The Fk−1 ×F1 method generates lower number of candidates because
every candidate is guaranteed to contain at least one frequent (k −1)-itemset. While
this procedure is a substantial improvement over the brute-force method, it can still
produce a large number of unnecessary candidates, as the remaining subsets of a
candidate itemset can still be infrequent.
Dis Advantage: The approach discussed above does not prevent the same candidate itemset
from being generated more than once.
Example: For instance, {Bread, Diapers, Milk} can be generated by merging {Bread, Diapers}
with {Milk}, {Bread, Milk} with {Diapers}, or{Diapers, Milk} with {Bread}.
One way to avoid generating duplicate candidates is by ensuring that the items in each
frequent itemset are kept sorted in their lexicographic order.
Example: Itemsets such as {Bread, Diapers}, {Bread, Diapers, Milk}, and{Diapers, Milk}follow
lexicographic order as the items within every itemset are arranged alphabetically.
Each frequent (k−1)-itemset X is then extended with frequent items that are
lexicographically larger than the items in X.
Example: The itemset {Bread, Diapers} can be augmented with {Milk} because Milk is
lexicographically larger than Bread and Diapers. However, we should not augment {Diapers,
Milk} with {Bread} nor {Bread, Milk} with {Diapers} because they violate the lexicographic
ordering condition.
Every candidate k-itemset is thus generated exactly once, by merging the lexicographically
largest item with the remaining k−1 items in the itemset.
If the Fk−1 ×F1 method is used in conjunction with lexicographic ordering, then only two
candidate 3-itemsets will be produced in the example illustrated in Figure {Beer, Bread,
Diapers} and {Beer, Bread, Milk} will not be generated because {Beer, Bread} is not a
frequent 2-itemset.

Fk−1×Fk−1 Method: This candidate generation procedure, which is used in the candidate-
gen function of the Apriori algorithm, merges a pair of frequent (k−1)-item sets only if their
first k−2 items, arranged in lexicographic order, are identical. Let A = {a1,a2,...,ak−1} and B =
{b1,b2,...,bk−1} be a pair of frequent (k−1)-item sets, arranged lexicographically. A and B are
merged if they satisfy the following conditions: ai = bi (for i =1 ,2,...,k−2). Note that in this
case, ak−1 not equal to bk−1 because A and B are two distinct item sets. The candidate k-
itemset generated by merging A and B consists of the first k −2 common items followed by
ak−1 and bk−1 in lexicographic order. This candidate generation procedure is complete,
because for every lexicographically ordered frequent k-itemset, there exists two
lexicographically ordered frequent (k −1)-item sets that have identical items in the first k −2
positions.

In Figure, the frequent item sets {Bread, Diapers} and {Bread, Milk} are merged to form a
candidate 3-itemset {Bread, Diapers, Milk}. The algorithm does not have to merge{Beer,
Diapers} with{Diapers, Milk}because the first item in both item sets is different. Indeed, if
{Beer, Diapers, Milk} is a viable candidate, it would have been obtained by merging {Beer,
Diapers} with {Beer, Milk} instead.
This example illustrates both the completeness of the candidate generation procedure and
the advantages of using lexicographic ordering to prevent duplicate candidates. Also, if we
order the frequent (k−1) item sets according to their lexicographic rank, item sets with
identical first k − 2 items would take consecutive ranks. As a result, the Fk−1 × Fk−1
candidate generation method would consider merging a frequent itemset only with ones
that occupy the next few ranks in the sorted list, thus saving some computations.
Figure shows that the Fk−1 × Fk−1 candidate generation procedure results in only one
candidate 3-itemset. This is a considerable reduction from the four candidate 3-itemsets
generated by the Fk−1 × F1 method. This is because the Fk−1 × Fk−1 method ensures that
every candidate k-itemset contains at least two frequent (k − 1)-item sets, thus greatly
reducing the number of candidates that are generated in this step.
Note that there can be multiple ways of merging two frequent (k − 1)item sets in the Fk−1
×Fk−1 procedure, one of which is merging if their first k−2 items are identical.
An alternate approach could be to merge two frequent (k−1)-item sets A and B if the last
k−2 items of A are identical to the first k−2 item sets of B.
Example: {Bread, Diapers}and{Diapers, Milk}could be merged using this approach to
generate the candidate 3-itemset {Bread, Diapers, Milk}.
Candidate Pruning:
To illustrate the candidate pruning operation for a candidate k-itemset, X = {i1,i2,...,ik},
consider its k proper subsets, X−{Ij} (∀j =1 ,2,...,k). If any of them are infrequent, then X is
immediately pruned by using the Apriori principle.
Note that we don’t need to explicitly ensure that all subsets of X of size less than k−1 are
frequent. This approach greatly reduces the number of candidate item sets considered
during support counting.
Brute-force: For the brute-force candidate generation method, candidate pruning requires
checking only k subsets of size k −1 for each candidate k-itemset.
Fk−1×F1 : since the Fk−1×F1 candidate generation strategy ensures that at least one of the
(k−1)-size subsets of every candidate k-itemset is frequent, we only need to check for the
remaining k−1 subsets.
Fk−1 ×Fk−1 strategy :The Fk−1 ×Fk−1 strategy requires examining only k−2 subsets of every
candidate k-itemset, since two of its (k −1)-size subsets are already known to be frequent in
the candidate generation step.

Support Counting:
Support counting is the process of determining the frequency of occurrence for every
candidate itemset that survives the candidate pruning step.
Support counting is implemented in steps 6 through 11 of Algorithm.
A brute-force approach for doing this is to compare each transaction against every
candidate itemset. and to update the support counts of candidates contained in a
transaction. This approach is computationally expensive, especially when the numbers of
transactions and candidate item sets are large.
An alternative approach is to enumerate the item sets contained in each transaction and use
them to update the support counts of their respective candidate item sets. To illustrate,

consider a transaction t that contains five items, {1,2,3,5,6}. There are = 10 item sets of
size 3 contained in this transaction. Some of the item sets may correspond to the candidate
3-itemsets under investigation, in which case, their support counts are incremented. Other
subsets of t that do not correspond to any candidates can be ignored.
Figure shows a systematic way for enumerating the 3-itemsets contained in t. Assuming that
each itemset keeps its items in increasing lexicographic order, an itemset can be
enumerated by specifying the smallest item first, followed by the larger items.
For instance, given t = {1,2,3,5,6}, all the 3-item sets contained in t must begin with item 1,
2, or 3. It is not possible to construct a 3-itemset that begins with items 5 or 6 because there
are only two items in t whose labels are greater than or equal to 5. The number of ways to
specify the first item of a 3-itemset contained in t is illustrated by the Level 1 prefix tree
structure depicted in Figure.
For instance, 1 2356 represents a 3-itemset that begins with item 1, followed by two more
items chosen from the set {2,3,5,6}. After fixing the first item, the prefix tree structure at
Level 2 represents the number of ways to select the second item. For example, 1 2 356
corresponds to item sets that begin with the prefix (1 2) and are followed by the items 3, 5,
or 6.
Finally, the prefix tree structure at Level 3 represents the complete set of 3-itemsets
contained in t. For example, the 3-itemsets that begin with prefix {12 } are {1,2,3}, {1,2,5},
and{1,2,6}, while those that begin with prefix {23 } are {2,3,5} and {2,3,6}.
The prefix tree structure shown in Figure demonstrates how item sets contained in a
transaction can be systematically enumerated, i.e., by specifying their items one by one,
from the leftmost item to the rightmost item. We still have to determine whether each
enumerated 3-itemset corresponds to an existing candidate itemset. If it matches one of the
candidates, then the support count of the corresponding candidate is incremented.
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is
its anti-monotonicity of support measure. Apriori assumes that
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.

Minimum support count is 2


minimum confidence is 60%

Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate
set)

(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives
us itemset L1.

Step-2: K=2
 Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-1 and Lk-1 is
that it should have (K-2) elements in common.
 Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.
(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
 Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives
us itemset L2.

Step-3:
o Generate candidate set C3 using L2 (join step). Condition of joining L k-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}
{I2, I3, I5}
o Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2,
I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
o find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives
us itemset L3.

Step-4:
o Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2 elements
(items) should match.
o Check all subsets of these itemsets are frequent or not (Here itemset formed by joining
L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no
itemset in C4
o We stop here because no frequent itemsets are found further

Components of Apriori algorithm


The given three components comprise the apriori algorithm.

1. Support
2. Confidence
3. Lift
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600 transactions
include a 200 that includes Biscuits and chocolates. Using this data, we will find out the support, confidence, and
lift.

Support
Support refers to the default popularity of any product. You find the support as a quotient of the division of the
number of transactions comprising that product by the total number of transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.

Confidence
Confidence refers to the possibility that the customers bought both biscuits and chocolates together. So, you
need to divide the number of transactions that comprise both biscuits and chocolates by the total number of
transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.

Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates when you sell
biscuits. The mathematical equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is five times more than that of
purchasing the biscuits alone. If the lift value is below one, it requires that the people are unlikely to buy both the
items together. Larger the value, the better is the combination.

Support Counting Using a Hash Tree:


In the Apriori algorithm, candidate item sets are partitioned into different buckets and
stored in a hash tree. During support counting, item sets contained in each transaction are
also hashed into their appropriate buckets. That way, instead of comparing each itemset in
the transaction with every candidate itemset, it is matched only against candidate item sets
that belong to the same bucket, as shown in Figure .

Example of a hash tree structure: Each internal node of the tree uses the following hash
function, h(p)=( p − 1) mod 3, where mode refers to the modulo (remainder) operator, to
determine which branch of the current node should be followed next. For example, items 1,
4, and 7 are hashed to the same branch (i.e., the leftmost branch) because they have the
same remainder after dividing the number by 3. All candidate item sets are stored at the
leaf nodes of the hash tree. The hash tree shown in Figure 5.11 contains 15 candidate 3-
itemsets, distributed across 9 leaf nodes.
Consider the transaction, t = {1,2,3,5,6}. To update the support counts of the candidate item
sets, the hash tree must be traversed in such a way that all the leaf nodes containing
candidate 3-itemsets belonging to t must be visited at least once. Recall that the 3-itemsets
contained in t must begin with items 1, 2, or 3, as indicated by the Level 1 prefix tree
structure shown in Figure. Therefore, at the root node of the hash tree, the items 1, 2, and 3
of the transaction are hashed separately. Item 1 is hashed to the left child of the root node,
item 2 is hashed to the middle child, and item 3 is hashed to the right child. At the next level
of the tree, the transaction is hashed on the second item listed in the Level 2 tree structure
shown in Figure.
For example, after hashing on item 1 at the root node, items 2, 3, and 5 of the transaction
are hashed. Based on the hash function, items 2 and 5 are hashed to the middle child, while
item 3 is hashed to the right child, as shown in Figure. This process continues until the leaf
nodes of the hash tree are reached. The candidate item sets stored at the visited leaf nodes
are compared against the transaction. If a candidate is a subset of the transaction, its
support count is incremented. Note that not all the leaf nodes are visited while traversing
the hash tree, which helps in reducing the computational cost. In this example, 5 out of the
9 leaf nodes are visited and 9 out of the 15 item sets are compared against the transaction.

5.2.5 Computational Complexity


The computational complexity of the Apriori algorithm, which includes both its runtime and
storage, can be affected by the following factors.
Support Threshold: Lowering the support threshold often results in more item sets being
declared as frequent. This has an adverse effect on the computational complexity of the
algorithm because more candidate item sets must be generated and counted at every level,
as shown in Figure. The maximum size of frequent item sets also tends to increase with
lower support thresholds. This increases the total number of iterations to be performed by
the Apriori algorithm, further increasing the computational cost.
Number of Items (Dimensionality): As the number of items increases, more space will be
needed to store the support counts of items. If the number of frequent items also grows
with the dimensionality of the data, the runtime.
Number of Transactions: Because the Apriori algorithm makes repeated passes over the
transaction data set, its run time increases with a larger number of transactions.
Average Transaction Width: For dense data sets, the average transaction width can be very
large. This affects the complexity of the Apriori algorithm in two ways.
(1) The maximum size of frequent item sets tends to increase as the average transaction
width increases. As a result, more candidate item sets must be examined during candidate
generation and support counting, as illustrated in Figure.
(2) The transaction width increases, more item sets are contained in the transaction. This
will increase the number of hash tree traversals performed during support counting.

Candidate generation : To generate candidate k-item sets, pairs of frequent (k − 1)-item sets
are merged to determine whether they have at least k − 2 items in common. Each merging
operation requires at most k − 2 equality comparisons. Every merging step can produce at
most one viable candidate k-itemset, while in the worst-case, the algorithm must try to
merge every pair of frequent (k − 1)-item sets found in the previous iteration. Therefore, the
overall cost of merging frequent item sets is where w is the maximum transaction width.
A hash tree is also constructed during candidate generation to store the candidate item
sets. Because the maximum depth of the tree is k, the cost for populating the hash tree with
candidate item sets is During candidate pruning, we need to verify that the
k−2 subsets of every candidate k-itemset are frequent. Since the cost for looking up a
candidate in a hash tree is O(k), the candidate pruning step requires

Rule Generation:
Each frequent k-itemset, Y , can produce up to association rules, ignoring rules that
have empty antecedents or consequents (∅−→Y or Y −→∅). An association rule can be
extracted by partitioning the itemset Y into two non-empty subsets, X and Y −X, such that X
−→ Y −X satisfies the confidence threshold. Note that all such rules must have already met
the support threshold because they are generated from a frequent itemset.
Example: Let X = {a,b,c} be a frequent itemset. There are six candidate association rules that
can be generated from X:{a,b}−→{ c},{a,c}−→ {b}, {b,c}−→{ a}, {a}−→{b,c}, {b}−→{a,c}, and{c}
−→{a,b}. As each of their support is identical to the support for X, all the rules satisfy the
support threshold.
Computing the confidence of an association rule does not require additional scans of the
transaction data set. Consider the rule {1,2} −→ { 3}, which is generated from the frequent
itemset X = {1,2,3}. The confidence for this rule is σ({1,2,3})/σ({1,2}). Because {1,2,3} is
frequent, the anti-monotone property of support ensures that {1,2} must be frequent, too.
Since the support counts for both item sets were already found during frequent itemset
generation, there is no need to read the entire data set again.

Confidence-Based Pruning:
Confidence does not show the anti-monotone property in the same way as the support

confidence for another rule ˜ X −→ ˜ Y , where ˜ X ⊆ X and ˜ Y ⊆ Y (see Exercise 3 on page


measure. For example, the confidence for X −→ Y can be larger, smaller, or equal to the

439). Nevertheless, if we compare rules generated from the same frequent itemset Y , the
following theorem holds for the confidence measure.
Theorem:
Rule Generation in Apriori Algorithm:
Rule generation of the Apriori algorithm:

The Apriori algorithm uses a level-wise approach for generating association rules, where
each level corresponds to the number of items that belong to the rule consequent. Initially,
all the high confidence rules that have only one item in the rule consequent are extracted.
These rules are then used to generate new candidate rules. For example, if {acd} −→ { b}
and {abd} −→ { c} are high confidence rules, then the candidate rule {ad}−→{bc} is generated
by merging the consequents of both rules. Figure shows a lattice structure for the
association rules generated from the frequent itemset {a,b,c,d}. If any node in the lattice has
low confidence, then according to Theorem 5.2, the entire subgraph spanned by the node
can be pruned immediately. Suppose the confidence for {bcd} −→ { a} is low. All the rules
containing item a in its consequent, including {cd} −→ {ab}, {bd} −→ {ac}, {bc} −→ {ad}, and
{d}−→{abc} can be discarded.
Procedure ap-gen rules(fk, Hm):

Compact Representation of Frequent Itemsets:


The number of frequent item sets produced from a transaction data set can be very large. It
is useful to identify a small representative set of frequent item sets from which all other
frequent item sets can be derived. Two such representations are presented in this section in
the form of maximal and closed frequent item sets.
Maximal Frequent Itemsets
Maximal Frequent Itemsets
The number of frequent itemsets generated by the Apriori algorithm can often be very large, so it is
beneficial to identify a small representative set from which every frequent itemset can be derived.
One such approach is using maximal frequent itemsets.
A maximal frequent itemset is a frequent itemset for which none of its immediate supersets are
frequent. To illustrate this concept, consider the example given below:
The support counts are shown on the top left of each node. Assume support count threshold =
50%, that is, each item must occur in 2 or more transactions. Based on that threshold, the frequent
itemsets are: a, b, c, d, ab, ac and ad (shaded nodes).
Out of these 7 frequent itemsets, 3 are identified as maximal frequent (having red outline):
 ab: Immediate supersets abc and abd are infrequent.
 ac: Immediate supersets abc and acd are infrequent.
 ad: Immediate supersets abd and acd are infrequent.
The remaining 4 frequent nodes (a, b, c and d) cannot be maximal frequent because they all have
at least 1 immediate superset that is frequent.
Advantage:
Maximal frequent itemsets provide a compact representation of all the frequent itemsets for a
particular dataset. In the above example, all frequent itemsets are subsets of the maximal frequent
itemsets, since we can obtain sets a, b, c, d by enumerating subsets of ab, ac and ad (including the
maximal frequent itemsets themselves).
Disadvantage:
The support count of maximal frequent itemsets does not provide any information about the support
count of their subsets. This means that an additional traversal of data is needed to determine
support count for non-maximal frequent itemsets, which may be undesirable in certain cases.

Definition (Maximal Frequent Itemset): A frequent itemset is maximal if none of its


immediate supersets are frequent.
The item sets in the lattice are divided into two groups: those that are frequent and those
that are infrequent. A frequent itemset border, which is represented by a dashed line, is
also illustrated in the diagram. Every itemset located above the border is frequent, while
those located below the border (the shaded nodes) are infrequent. Among the item sets
residing near the border, {a,d}, {a,c,e}, and {b,c,d,e} are maximal frequent item sets because
all of their immediate supersets are infrequent. For example, the itemset {a,d} is maximal
frequent because all of its immediate supersets, {a,b,d}, {a,c,d}, and {a,d,e}, are infrequent.
In contrast, {a,c} is non-maximal because one of its immediate supersets, {a,c,e}, is frequent.
Maximal frequent itemsets effectively provide a compact representation of frequent
itemsets. In other words, they form the smallest set of itemsets from which all frequent
itemsets can be derived. For example, every frequent itemset in Figure is a subset of one of
the three maximal frequent itemsets, {a,d}, {a,c,e}, and{b,c,d,e}. If an itemset is not a proper
subset of any of the maximal frequent itemsets, then it is either infrequent (e.g., {a,d,e}) or
maximal frequent itself (e.g., {b,c,d,e}). Hence, the maximal frequent itemsets {a,c,e}, {a,d},
and{b,c,d,e} provide a compact representation of the frequent itemsets shown in Figure.
Enumerating all the subsets of maximal frequent itemsets generates the complete list of all
frequent itemsets.

Closed Itemsets:
Closed itemsets provide a minimal representation of all itemsets without losing their
support information. A formal definition of a closed itemset is presented below.
Definition(Closed Itemset): An itemset X is closed if none of its immediate supersets has
exactly the same support count as X.
Put another way, X is not closed if at least one of its immediate supersets has the same
support count as X. Examples of closed item sets are shown in Figure. To better illustrate the
support count of each itemset, we have associated each node (itemset) in the lattice with a
list of its corresponding transaction IDs.
For example, since the node {b,c} is associated with transaction IDs 1, 2, and 3, its support
count is equal to three. From the transactions given in this diagram, notice that the support
for{b}is identical to{b,c}. This is because every transaction that contains b also contains c.
Hence, {b} is not a closed itemset. Similarly, since c occurs in every transaction that contains
both a and d, the itemset{a,d} is not closed as it has the same support as its superset {a,c,d}.
On the other hand, {b,c} is a closed itemset because it does not have the same support
count as any of its supersets. An interesting property of closed item sets is that if we know
their support counts, we can derive the support count of every other itemset in the itemset
lattice without making additional passes over the data set. For example, consider the 2-
itemset {b,e} in Figure. Since {b,e} is not closed, its support must be equal to the support of
one of its immediate supersets, {a,b,e}, {b,c,e}, and {b,d,e}. Further, none of the supersets
of{b,e} can have a support greater than the support of {b,e}, due to the anti-monotone
nature of the support measure. Hence, the support of {b,e} can be computed by examining
the support counts of all of its immediate supersets of size three and taking their maximum
value. If an immediate superset is closed (e.g., {b,c,e}), we would know its support count.
Otherwise, we can recursively compute its support by examining the supports of its
immediate supersets of size four. In general, the support count of any non-closed (k−1)-
itemset can be determined as long as we know the support counts of all k-item sets. Hence,
one can devise an iterative algorithm that computes the support counts of item sets at level
k−1 using the support counts of item sets at level k, starting from the level kmax, where
kmax is the size of the largest closed itemset. Even though closed item sets provide a
compact representation of the support counts of all item sets, they can still be exponentially
large in number. Moreover, for most practical applications, we only need to determine the
support count of all frequent itemsets. In this regard, closed frequent itemsets provide a
compact representation of the support counts of all frequent itemsets, which can be
defined as follows
Definition(Closed Frequent Itemset): An itemset is a closed frequent itemset if it is closed
and its support is greater than or equal to minsup
We can use closed frequent item sets to determine the support counts for all non-closed
frequent item sets.
For example, consider the frequent itemset {a,d} shown in Figure. Because this itemset is
not closed, its support count must be equal to the maximum support count of its immediate
supersets, {a,b,d}, {a,c,d}, and{a,d,e}. Also, since{a,d} is frequent, we only need to consider
the support of its frequent supersets. In general, the support count of every non-closed
frequent k-itemset can be obtained by considering the support of all its frequent supersets
of size k + 1.
For example, since the only frequent superset of {a,d} is {a,c,d}, its support is equal to the
support of {a,c,d}, which is 2. Using this methodology, an algorithm can be developed to
compute the support for every frequent itemset. The pseudocode for this algorithm is
shown in Algorithm. The algorithm proceeds in a specific-to-general fashion, i.e., from the
largest to the smallest frequent item sets. This is because, in order to find the support for a
non-closed frequent itemset, the support for all of its supersets must be known. Note that
the set of all frequent item sets can be easily computed by taking the union of all subsets of
frequent closed item sets.
Advantage of using closed frequent itemsets :
To illustrate the advantage of using closed frequent itemsets, consider the data set shown in
Table which contains ten transactions and fifteen items. The items can be divided into three
groups: (1) Group A, which contains items a1 through a5; (2) Group B, which contains items
b1 through b5; and (3) Group C, which contains items c1 through c5. Assuming that the
support threshold is 20%, itemsets involving items from the same group are frequent, but
itemsets involving items from different groups are infrequent. The total number of frequent
itemsets is thus 3 × (25 − 1) = 93. However, there are only four closed frequent itemsets in
the data: ({a3,a4}, {a1,a2,a3,a4,a5}, {b1,b2,b3,b4,b5}, and{c1,c2,c3,c4,c5}). It is often
sufficient to present only the closed frequent itemsets to the analysts instead of the entire
set of frequent itemsets.

Relationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsets:
All maximal frequent item sets are closed because none of the maximal frequent item sets
can have the same support count as their immediate supersets. The relationships among
frequent, closed, closed frequent, and maximal frequent item sets are shown in Figure.
FP-Growth Algorithm:

This algorithm is an improvement to the Apriori method. A frequent pattern is generated without the need for
candidate generation. FP growth algorithm represents the database in the form of a tree called a frequent pattern
tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is fragmented using one
frequent item. This fragmented part is called “pattern fragment”. The itemsets of these fragmented patterns are
analyzed. Thus with this method, the search for frequent itemsets is reduced comparatively.

It encodes the data set using a compact data structure called an FP-tree and extracts
frequent item sets directly from this structure.

FP-Tree Representation
An FP-tree is a compressed representation of the input data. It is constructed by reading the
data set one transaction at a time and mapping each transaction onto a path in the FP-tree.
As different transactions can have several items in common, their paths might overlap. The
more the paths overlap with one another, the more compression we can achieve using the
FP-tree structure. If the size of the FP-tree is small enough to fit into main memory, this will
allow us to extract frequent item sets directly from the structure in memory instead of
making repeated passes over the data stored on disk.

Figure shows a data set that contains ten transactions and five items.

The structures of the FP-tree after reading the first three transactions are also depicted in
the diagram. Each node in the tree contains the label of an item along with a counter that
shows the number of transactions mapped onto the given path. Initially, the FP-tree
contains only the root node represented by the null symbol.
The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of each item. Infrequent
items are discarded, while the frequent items are sorted in decreasing support counts inside
every transaction of the data set. For the data set shown in Figure 5.24, a is the most
frequent item, followed by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the FP tree. After reading
the first transaction, {a, b}, the nodes labelled as a and b are created. A path is then formed
from null → a → b to encode the transaction. Every node along the path has a frequency
count of 1.
3. After reading the second transaction, {b, c, d}, a new set of nodes is created for items b, c,
and d. A path is then formed to represent the transaction by connecting the nodes null → b
→ c → d. Every node along this path also has a frequency count equal to one. Although the
first two transactions have an item in common, which is b, their paths are disjoint because
the transactions do not share a common prefix.
4. The third transaction, {a, c, d, e}, shares a common prefix item (which is a) with the first
transaction. As a result, the path for the third transaction, null → a → c → d → e, overlaps
with the path for the first transaction, null → a → b. Because of their overlapping path, the
frequency count for node a is incremented to two, while the frequency counts for the newly
created nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been mapped onto one of the paths
given in the FP-tree. The resulting FP-tree after reading all the transactions is shown at the
bottom of Figure.
The size of an FP-tree is typically smaller than the size of the uncompressed data because
many transactions in market basket data often share a few items in common.
In the best-case scenario, where all the transactions have the same set of items, the FP-tree
contains only a single branch of nodes.
The worst case scenario happens when every transaction has a unique set of items. As none
of the transactions have any items in common, the size of the FP-tree is effectively the same
as the size of the original data. However, the physical storage requirement for the FP-tree is
higher because it requires additional space to store pointers between nodes and counters
for each item. The size of an FP-tree also depends on how the items are ordered. The notion
of ordering items in decreasing order of support counts relies on the possibility that the high
support items occur more frequently across all paths and hence must be used as most
commonly occurring prefixes.
Benefits of the FP-tree Structure:
 Completeness
 Preserve complete information for frequent pattern mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone  Items in frequency descending
order: the more frequently occurring, the more likely to be shared
 Never be larger than the original database (not count node-links and the count field)
 There exists examples of databases, where compression ratio could be over 100.

Frequent Itemset Generation in FP-Growth Algorithm


FP-growth is an algorithm that generates frequent item sets from an FP tree by exploring
the tree in a bottom-up fashion. Given the example tree shown in Figure, the algorithm
looks for frequent item sets ending in e first, followed by d, c, b, and finally, a. This bottom -
up strategy for finding frequent item sets ending with a particular item is equivalent to the
suffix based approach. Since every transaction is mapped onto a path in the FP-tree, we can
derive the frequent item sets ending with a particular item, say, e, by examining only the
paths containing node e. These paths can be accessed rapidly using the pointers associated
with node e. The extracted paths are shown in Figure (a). Similar paths for item sets ending
in d, c, b, and a are shown in Figures (b), (c), (d), and (e), respectively.

FP-growth finds all the frequent itemsets ending with a particular suffix by employing a
divide-and-conquer strategy to split the problem into smaller subproblems. For example,
suppose we are interested in finding all frequent itemsets ending in e. To do this, we must
first check whether the itemset {e} itself is frequent. If it is frequent, we consider the
subproblem of finding frequent itemsets ending in de, followed byce, be, andae. In turn,
each of these subproblems are further decomposed into smaller subproblems. By merging
the solutions obtained from the subproblems, all the frequent itemsets ending in e can be
found. Finally, the set of all frequent itemsets can be generated by merging the solutions to
the subproblems of finding frequent itemsets ending in e, d, c, b, anda. This divide-and-
conquer approach is the key strategy employed by the FP-growth algorithm.
For a more concrete example on how to solve the subproblems, consider the task of finding
frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initial paths are called
prefix paths and are shown in Figure(a).
2. From the prefix paths shown in Figure(a), the support count for e is obtained by adding
the support counts associated with node e. Assuming that the minimum support count is 2,
{e} is declared a frequent itemset because its support count is 3.
3. Because {e} is frequent, the algorithm has to solve the subproblems of finding frequent
item sets ending in de, ce, be, and ae. Before solving these subproblems, it must first
convert the prefix paths into a conditional FP-tree, which is structurally similar to an FP-tree,
except it is used to find frequent item sets ending with a particular suffix.
A conditional FP-tree is obtained in the following way:
(a) First, the support counts along the prefix paths must be updated because some of
the counts include transactions that do not contain item e. For example, the
rightmost path shown in Figure (a), null −→ b:2 −→ c:2 −→ e:1, includes a
transaction {b,c} that does not contain item e. The counts along the prefix path must
therefore be adjusted to 1 to reflect the actual number of transactions containing {b,
c,e}
(b) The prefix paths are truncated by removing the nodes for e. These nodes can be
removed because the support counts along the prefix paths have been updated to
reflect only transactions that contain e and the subproblems of finding frequent item
sets ending in de, ce, be, and ae no longer need information about node e.
(c) After updating the support counts along the prefix paths, some of the items may no
longer be frequent. For example, the node b appears only once and has a support
count equal to 1, which means that there is only one transaction that contains both
b and e. Item b can be safely ignored from subsequent analysis because all item sets
ending in be must be infrequent.

3. FP-growth uses the conditional FP-tree for e to solve the subproblems of finding
frequent item sets ending in de, ce, and ae. To find the frequent item sets ending in
de, the prefix paths for d are gathered from the conditional FP-tree for e. By adding
the frequency counts associated with node d, we obtain the support count for {d, e}.
Since the support count is equal to 2, {d, e} is declared a frequent itemset. Next, the
algorithm constructs the conditional FP-tree for de using the approach described in
step 3. After updating the support counts and removing the infrequent item c, the
conditional FP-tree for de is shown in Figure 5.27(d). Since the conditional FP-tree
contains only one item, a, whose support is equal to minsup, the algorithm extracts
the frequent itemset {a, d ,e} and moves on to the next subproblem, which is to
generate frequent item sets ending in ce. After processing the prefix paths for c,{c,
e}is found to be frequent. However, the conditional FP-tree for ce will have no
frequent items and thus will be eliminated. The algorithm proceeds to solve the next
subproblem and finds {a, e} to be the only frequent itemset remaining.

FP Growth Algorithm:
Advantages of FP-Growth:
 Divide-and-conquer:
Decompose both the mining task and DB according to the frequent patterns
obtained so far.
leads to focused search of smaller databases.
 Other factors:
 no candidate generation, no candidate test
 compressed database: FP-tree structure
 no repeated scan of entire database
 basic ops—counting local freq items and building sub FP-tree, no pattern
search and matching
 The algorithm only needs to read the file twice, as opossed to apriori who
reads it once for every iteration.
 Removes the need to calculate the pairs to be counted, which is very
processing heavy, because it uses the FP-Tree. This makes it O(n) which is
much faster than apriori.
 The FP-Growth algorithm stores in memory a compact version of the
database.
Dis Advantages of FP-Growth:
 The biggest problem is the interdependency of
data. The interdependency problem is that for the
parallelization of the algorithm some that still
needs to be shared, which creates a bottleneck in
the shared memory.

COMPARISON BETWEEN APRIORI ALGORITHM AND FP GROWTH ALGORITHM


 Both Apriori and FP Growth algorithm are used to mine the frequent patterns
from database.
 Both the algorithm uses some technique to discover the frequent patterns.
 Apriori algorithm works well with large database but FP Growth algorithm
works badly with large database .
The comparisons between both the algorithms based on technique, memory
utilization, number of scans and time consumed are given below:

Technique: Apriori algorithm uses Apriori property and join, pure property for
mining frequent patterns. FP Growth algorithm constructs conditional pattern tree
and conditional pattern base from the database which satisfies the minimum
support.

Search Type: Apriori uses breadth first search method and FP Growth uses divide
and conquer method.

Memory Utilization: Apriori algorithm requires large memory space as they deal
with large number of candidate itemset generation. FP Growth algorithm requires
less memory due to its compact structure they discover the frequent item sets
without candidate itemset generation.

No.Of.Scans: Apriori algorithm performs multiple scans for generating candidate set.
FP Growth algorithm scans the database only twice.

Time: In Apriori algorithm execution time is more wasted in producing candidates


every time. FP Growth’s execution time is less when compared to Apriori.
Algorithm Technique RunTime Memory Usage Parallalizability
Apriori Generate Candidate Saves Candidate
singletons, generation is singletons,pairs,triplets,etc generation is very
pairs, extremely parallelizable.
triplets, slow. Runtime
etc increases
exponentially
depending on
the number of
different
items.
FP Insert Runtime Stores a compact version Data are very
Growth sorted increases of the database. interdependent,
items by linearly, each node needs
frequency depending on the root.
into a the no of
pattern transactions
tree. and items.

You might also like