WED Data Warehousing and Data Mining Reference Note
Unit-5
Mining Frequent Patterns
Frequent Patterns
Frequent pattems are patterns (itemsets, subsequences, or substructures) that appear ina data
set frequently.
— For example, a set of items, such as milk and bread that appear frequently together in a
transaction data set is a frequent itemset.
= A subsequence, such as buying first a PC, then a digital camera, and then a memory card,
if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
— A substructure can refer to different structural forms, such as subgraphs or subtrees, which
may be combined with itemsets or subsequences. If a substructure occurs frequently in a
graph database, itis called a (frequent) structural pattern,
Finding frequent pattems plays an essential role in mining associations, correlations, and many
other interesting relationships among data. Moreover, it helps in data indexing, classification,
clustering, and other data mining tasks as well.
Market Basket Analysis
Market basket analysis analyzes customer buying habits by finding associations between the
different items that customer place in their shopping baskets. The discovery of such
associations can help retailers develop marketing strategies by gaining insight into which items
are frequently purchased together by customers. Such information can lead to increased sales
by helping retailers do selective marketing and design different store layouts.
SSopping Baskets
a
Vessel
‘Customer 3
sugar
Market Analyst a
Customer
Items that are frequently purchased together can be placed in proximity in order to further
encourage the sale of such items together. Market basket analysis can also help retailers plan
which items to put on sale at reduced prices. If customers tend to purchase milk and bread
together, then having a sale on milk may encourage the sale of milk-as well as bread.
Collegenote Prepared By: Jayanta PoudelWE Data Warehousing and Data Mining Reference Note
Frequent Itemsets
= Itemsets: An itemset is a set of items.
E.g. X= {milk, bread, coke} is an itemset. 1 Bread, Mik
2 Bread, Diaper, Beer, Eggs
3. Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5S Bread, Milk, Diaper, Coke
= Anitemset that contains é items is a k-itemset.
Ex. The set fmilk, bread} is a 2-itemset.
+ Support Count (9): Frequency of occurrence of an itemset.
Eg. a(bread, milk) = 3
+ Frequent itemset: Items that frequently appear together. E.g. {bread, milk}. An itemset
‘whose support is greater than or equal to a minimum support threshold (min_sup) is called
frequent itemset,
Closed Itemsets
An itemset X is closed in a data set S if there exists no proper super-itemset Y such that Y has
the same support count as X in S. For example,
temset_ [Support] __|_itemset_|Support|
AY 4 ec} | 2
8} 5 | | {ABD} 3
| © 3 ACD} | 2
O} 4 BOO} | 3
AB) | 4 | ABC. | 2
ac | 2
(AD) | 3
BC} 3
4
3
Here, {B} is closed itemset becanse it all supersets have less support count than the itemset
{B}. But other itemsets such as {A}, {C}, and {D} is not closed itemsets. Here, item set (A,
B} is also closed, because its superset {A, B, D} and {A, B, C, Dj has less support count than
{A, B}
Association Rules
Association rules are ifithen statements that help uncover relationships between seemingly
unrelated data in a relational database or other information repository. It has two parts, an
antecedent (if) and a consequent (then). An antecedent is an item found in the data, A
consequent is an item that is found in combination with the antecedent.
Let 1 = {Upp Ip, on mer In} be a set of items and D = {T;,To, Tyn} be a Set of transactions
called the darabase. Each transaction in D has a unique transaction ID and contains a subset of
the items in I. A rufe is defined as an implication of the form:
XY
Where X,¥ fand XY =6
An example rule for the supermarket could be {butter, bread} = {milk} meaning that if a
customer purchases butter and bread then he/she is also likely to buy milk.
Collegenote Prepared By: Jayanta PoudelWE Data Warehousing and Data Mining Reference Note
Support and Confidence
To measure the interestingness of association rules two measures are used:
+ Support: Support of association rule X = ¥ is the percentage of transactions in dataset
that contain both items (X & Y). In formula,
a(X UY)
Support(X = Y) =P(XUY) = No.of trans
For example,
1 Bread, Mik
2 Bread, Diaper, deer, Eegs
3 Mik, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk Diaper, Coke
Tn above dataset, Support(bread = milk) = 2 since both items ocenrs in 60% of all
transactions.
= Confidence: Confidence of association rule X > ¥ is the percentage of transactions
containing X that also contain Y. In formila,
P(XUY) _o(XuY
Confidence(X = Y) = P(X) = ren oe )
For example, In above dataset, fidence(bread = milk) =; , since 75% of all
transactions containing bread also contains milk.
Association Rule Mining
Association rule mining is a method for discovering frequent pattems in large databases. It is
intended to identify strong rules discovered in databases using different measures of
interestingness
Association rule mining consists of two sub-processes: finding frequent item sets and
generating association rules from those item sets.
Frequent itemset: Frequent itemset is a set of items whose support is greater than the user
specified minimum support.
+ Association rule: An association rule must satisfy user-set minimum support (min_sup)
and minimum confidence (min_conf). The rule X = ¥ is called a strong association rule if
support > min_sup and confidence > min_conf.
Finding Association Rules
|__ Find all Frequent Itemsets sing
Minimum Support
Find Asociation rules from Frequent
Nemsets Using Minimam Confidence
Flgwe 1: Generating Assocation Rules
Collegenote Prepared By: Jayanta PoudelWED Data Warehousing and Data Mining Reference Note
Types of Association Rule
1. Single Dimensional Association Rule
It contains a single predicate (e.g. purchase) with its multiple occurrences (i.e. the
predicate occurs more than once within the rule). For example:
purchase(X, "milk") = purchase(X, "bread")
2. Multisdimensionat Association Rule
It contains two or more predicates. Each predicate occurs only once. For example:
age(X,"19 — 25") A occupation(X, "student") = buys(X,"laptop")
3. Hybrid dimensional Association Rule
It is a multidimensional association rule with repeated predicates, which contain multiple
occurrences of some predicates. For example:
age(X,"19 — 25") A buys(X, "laptop") = buys(X,"printer")
4. Multilevel Association Rule
Association nules generated from mining data at multiple levels of abstraction are called
unultiple-level or multilevel association rules. Multilevel association rules can be mined
efficiently using concept hierarchies under a support-confidence framework.
‘Two Approaches of Multilevel Association Rules
- Using uniform minimum support for all levels (referred to as uniform
support): The same minimum support threshold is used when mining at each level of
abstraction.
- Using reduced mininum support at lower levels (referred to as reduced
support): Each level of abstraction has its own minimum support threshold. The
deeper the level of abstraction, the smaller the corresponding threshold is
‘Uniform Support Reduced Support
agi Y Computer ett
[support = 10%] min_sup = 5%
Level? Laptop Computer | Desktop Computer Level?
min_sup =
% | [support = 6%] | | [support= 4%] | min sw ~3%
5. Quantitative Association Rule
Database attributes can be categorical or quantitative. Categorical attributes have a finite
number of possible values, with no ordering among the values (e.g., occupation,
brand, color). Quantitative attributes are numeric and have an implicit ordering among
values (¢.g., age, income, price). Association rules mined from quantitative attributes are
referred as quantitative association rules.
Different from general association rules where both the left-hand and right-hand sides of
the rule should be categorical attributes, at least one attribute of the quantitative association
rule (left or right) must involve a numerical attribute. For example:
Age(x,"20 — 30)*salary(x,"50 — 60k") + buys(x,“laptop”)
Collegenote Prepared By: Jayanta PoudelWE Data Warehousing and Data Mining Reference Note
Apriori Al igorithm
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the
given database
— Apriori Property: All subsets of a frequent itemset must be frequent.
— Apriori pruning principle: If there is any itemset which is infrequent, its superset should
not be generateditested.
Apriori employs an iterative approach known as a level-wise search, where frequent k-itemsets
are used fo explore fiequent (k+1)-itemsets. First, the set of frequent 1-itemsets that satisfy
‘minimum support is found by scanning the database. The resulting set is denoted Li. Next, Li
is used to find L2, the set of frequent 2-itemsets, which is used to find Ls, and so on, until no
‘more frequent k-itemsets can be found. The finding of each Lx requires one full scan of the
database. To improve the efficiency of the level-wise generation of frequent itemsets, an
important property called the Apriori property is used.
Pseudo-code:
Cy: Candidate itemset of size k
Ly: frequent itemset of size k
L, = {frequent items};
for (k = 1; Ly! = 0; k ++)do begin
Cea ~ candidates generated from Lx:
for each transaction t in database do
increment the count of all candidates in Cj, that are contained in ¢
Leva candidates in Cy; with min_support
End
return Ug Ly;
Example
Min_support = 2
Database D
Note: (1,2,3)(1,2,5)
and (1,3,5) not in Cy
Collegenote Prepared By: Jayanta PoudelQE Dara Warehousing and Data Mining Reference Note
Generating As Rules From Frequent Itemsets
For each frequent itemset, f, generate all non-empty subsets of F
For every non-empty subset s of f do
gy ig BERET
output rules s = (f — s) if SMD > min confidence
end
From above example
Frequent itemset = (2,3, 5}
Now,
Rules Confidence
= I Al,
I, bl,
Ig > I AL.
b Aly > I
bh Als >
1 Als = Ia
As the given threshold or minimum confidence is 75%, so the mules
1 Aly > Is and Jy Als > Ip can be considered as the strong association rules for the given
problem.
Examples
‘Meal Items
Order 1 M,, MMs Order 6 Mz, M3
Order 2 Mz My Order 7 My My
Order 3 Mz, My Order 8 ‘My, My, My, M
Order 4 My, M,, My Order 9 ‘My, Mz, Mz
Order 5 My zt
Solution:
Step 1: Cateutating Cy and Ly
Candidate 1-itemsets (C,) Frequent L-itemsets (1)
Ttemset ‘Sup_count Ttems ‘Sup_count
{My} 6 {My} 6
{Mo} 7 {Mo} 7
M3} 6 M3} 6
{Ms} 2 {Ma} 2
{Ms} 2 {Ms} 2
Collegenote Prepared By: Jayanta PoudelData Warehousing and Data Mining Reference Note
Step 2: Caleutating Cz and Lz
Candidate 2-itemsets (C2) Frequent 2-itemsets (Ly)
Ttemset ‘Sup_count Ttems Sup_count
{Mi, Mz } 4 {My, Ma } 4
M,,M } 4 {Mi Ms } 4
{Mi Ma 1 My, Ms } 2
{MM 2 {Mz, Ma} 4
Ma, Ma} 4 (Mz, My 2
{Mp Mg} 2 (Mz, Ms} 2
{Mz Ms} 2
‘Mz, Mg} 0
{Mg, Ms} 1
{My Ms) 0
Step 3: Cateulating C3 and L
Candidate 3-itemsets (C3) Frequent 3-itemsets (L3)
Ttemset ‘Sup_count Ttems ‘Sup_count
{M,,M,,Ma} 2 {M,,M,,M. 2
My, Mp, M, 2 My, Mp,Ms } 2
Step 4: Cateulating Cy and Ly
A candidate set of 4-itemsets, C, is {My, Mg, Mg, Ms }. This item set is praned since its subset
(Mz, Ms, Mg} is not frequent. Thus C, = @, and algorithm terminates,
+ Frequent Itemsets = {M,,Mp,Mg } and {M,, Mp, Ms }
Now,
Generating Association Rules from Frequent Itemsets:
Taking {M,,M, M3 Taking {M,, Mz, Ms }
Rules Confidence Rules Confidence
M, > Mz, AM, |_2/6=0333 M, > M, AM,
M, > M, AM, |_2/7=0.285 M, > M, AM,
Mz > M, AM, |__2/6=0.333 ‘Mz > M, AM.
| Mz a Mg => My LM, AMs = M, |
M, AM; = M; M, AM; = M,;
M, AM, = M, M, AM, = M,
As the given threshold or minimum confidence is 0.7, so the rules
M; => M,AM,, M,AMs= M, and M,AMs=M, can be considered as the strong
association rules for the given problem.
Collegenote Prepared By: Jayanta PoudelWED Data Warehousing and Data Mining
Reference Note
TID
Items_bought
T100
{M,0.N,K,E,¥}
7200
{D,0,N,K,E,Y}
7300
{M, A, K,E}
T400
{M,U, C,K,Y}
Solution:
Given,
Min sup = 60% = 5 x =
Min conf = 80%
7500 " 0,0,K,1,E)
Step 1: Cateutating C, and Ly
Candidate 1-itemsets (C,)
Ttemset
‘Sup_count
{M}
3
{0}
{N}
{Kj
E}
v7}
DS
{A}
w.
iC}
Aly
fro} a] ala foofn| unfea| un
Step 2: Cateulating C2 and Lz
Frequent 1-itemsets (Ly)
Itemset_| Sup_count
My 3
{0} 3
AK} 5
E 4
na 3
Candidate 2-itemsets (C2) Frequent 2-itemsets (Ly)
Ttemset | Sup_count Itemset_| Sup_count
{M,O} 1 {MK} 3
{M, Ky 3 (0, K} 3
{M, E} 2 {0,E} 3
{M,Y} 2 {K,E} 4
{0,K} 3 KY} 3
0.8 3
{0,¥} 2,
{KE} 4
{K,Y} 3
{EY} 2
Collegenote Prepared By: Jayanta PoudelWED) Data Warehousing and Data Mining Reference Note
Step 3: Calculating C3 and Ls
Candidate 3-itemsets (C3) Frequent 3-itemsets (Ls)
Ttemset ‘Sup_count Ttemset Sup_count |
{0,K,E} 3 {0.K.E} 3
Now, stop since no more combinations can be made in Ls.
«Frequent Itemset = (0, K,E}
Now,
Generating Association Rules from Frequent Itemsets:
Rules Confidence
O=KAE
K>OAE
E SOAK
OAKSE
OnESK
KAE=0
As the given threshold or minimum confidence is 80%, so the rules
0 > KANE, OAK = E and OAE > K can be considered as the strong association rules for
the given problem.
Advantages of Apriori Aigoritim Limitations of Apriori Algorithm
- Thisis easy to understand algorithm. — |- Using Apriori needs a generation of
- This algorithm has least memory | candidate itemsets. These itemsets may
consumption, be large in number if the itemset in the
database is huge
- Easy implementation.
- Apriori needs multiple scans of the
database to check the support of each
itemset generated and this leads to high
costs
- It uses Apriori property for pruning
therefore, itemsets left for further support
checking remain less.
These Limitations can be overcome using the
EP growth algorithm.
Methods to Improve Apriori’s Efficiency
= Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is
below the threshold cannot be frequent.
+ Transaction reduction: A transaction that does not contain any frequent K-itemset is
useless in subsequent scans
+ Partitioning: An itemset that is potentially frequent in DB nmust be frequent in at least one
of the partitions of DB.
+ Sampling: Mining on a subset of given data, lower support threshold and a method to
determine completeness.
+ Dynamic itemset counting: Add new candidate itemsets only when all of their subsets are
estimated to be frequent.
Collegenote Prepared By: Jayanta PoudelWED) Data Warehousing and Data Mining Reference Note
Frequent Pattern (FP) Growth Algorithm
This algorithm is an improvement to the Apriori method. A frequent pattem is generated
without the need for candidate generation. FP growth algorithm represents the database in the
form of a tree called a frequent pattern tree or FP tree. This tree structure will maintain the
association between the itemsets.
In FP Growth there are mainly two steps:
‘Step 1: Construction of FP-tree
It is built using two passes over the data-set.
Pass 1:
Scan data and find support for each item,
+ Discard infrequent items.
+ Sort frequent items in decreasing order based on their support
Pass 2:
+ Create the root of the tree, labeled with null
+ Scan database again. Process items in each transaction in descending order of support count
and create branch for each transaction
+ Ifsome transaction shares common prefix with already processed transactions, increment
counter for each item in common prefix and create branches for items that are not common,
+ To facilitate tree traversal, an item header table is also built so that each item points to its,
ocourrences in the tree via a chain of node links.
Step 2: Extract Frequent Item-sets from FP-tree
= Construct its conditional pattern base, which is subset of database that contains the set of
prefix paths in the FP-tree.
= Construct its (conditional) FP-tree. Conditional FP-tree of item I is FP-tree constructed only
by considering transaction that contains item I and then removing the Item I from all
transactions. Then perform mining recursively on such a tree. The pattem growth is
achieved by the concatenation of the suffix pattern with the frequent pattems generated
from a conditional FP-tree.
Example
TID ems
TL BAT
12 AC
T3 A,S
T4 BAC
TS BS
16 A,S
17 BS
18 | B,A,S,T
19 »A,S
Collegenote Prepared By: Jayanta PoudelData Warehousing and Data Mining Reference Note
Solution:
First of all, we need to create
atable of item counts in the whole transactional database and
sort the item list in descending order for their support count as below:
Ttem | Support Count
B 6
A 7
T 2
c 2
Ss 6
Tt Si rt Count
Sort frequent oe
items in frequency, | $
descending oder > |g 6
c 2
T 2
‘Now, for each transaction, the respective Ordered-Item set is built
TD Ttems | Ordered Itemsets
TL B.A.T A.B.T
T2 A.C A.C
BB A.S AS
Ta | B.A.C AB.C
TS B.S BS
6 A,S AS
17 B,S B.S
Ts [B.A.\S.T| ABST
To B,A,S A,B,S
Now building FP-Tree by using ordered itemsets one by one:
1. After reading TID = TI
2. After reading TID =T2 3. Aflerreading TID = T3
©
©
Collegenote
Prepared By: Jayanta PoudelData Warehousing and Data Mining Reference Note
4. Afier reading TID = T4 5. Afler reading TID = TS
DY
Kz e
SOO ©
60
6. Affer reading TID = T6 7. Afler reading TID = T7
Collegenote Prepared By: Jayanta PoudelData Warehousing and Data Mining Reference Note
8. Afier reading TID = TS 9. Affer reading TID = T9
©
GQ Of] &
QOOO!] GOOO
OOO S00
© ©
Now, to facilitate tree traversal, an item header table is built so that each item points to its
occurrences in the tree via a chain of node-links.
Fe Count_| Node Link
Ta
Collegenote Prepared By: Jayanta PoudelData Warehousing and Data Mining
Reference Note
Now we need to construct a table for conditional pattern base and hence, the frequent
pattern generation.
Trem | Conditional Pattern Base | Conditional FP-tree | Frequent Pattern Generation
T | (A.B: 1}, {A.B,S:1}} _[
(A, T:2}, {B,T: 2}, {4.B, T:2)
C__[4A.B:1), {A:T} {A,C:2)
S {{A, B: 2}, {A: 2}, {B: 2}} | , | {A, S: 4}, {B, S: 4}, {A, B, S: 2}
B_ | {{A:4}) {A.B}
Ais directly from the Null node and since there aren’t any in-between nodes to reach A,
there is no need to go for another row of A
We have generated a 3-item frequent set as {A, B, T
2} & (A,B, S: 2}. Similarly 2-Item
frequent sets are {A, T: 2},{B, T: 2}, (A, C: 2}, {A, S: 4},{B, S: 4} & {A, B:4}
Generate Association Rules:
Same as in Apriori algorithm.
Advantages of FP growth algorithm
+ Itallows frequent item set discovery without candidate generation.
+ Itbuilds a compact data structure called FP tree
= Only two passes over dataset.
+ Faster than Apriori algorithm.
Di CEP growth algori
+ FP treeis difficult to build than Aprioti
+ PP tree may not fit in memory.
+ PP tree is expensive to build.
Apriori vs FP Growth
FP Growth
Apriori
Ttis slower than FP growth algorithm.
Itis faster than Aprioni algorithm.
tis an array-based algorithm.
Itis a tree-based algorithm,
Ttuses Apriori join and prune property.
It constructs conditional frequent pattem tree
and conditional pattem base from database
which satisfy minimum support.
‘Aption utilizes a Tevel-wise approach where
it generates patterns containing | item, then
2 items, then 3 items, and so on.
FP Growth utilizes a pattem-growth
approach means that, it only considers
patterns actually existing in the database
Tt requires large memory space due to large
mumber of candidate generation.
Tt scans the database multiple times for
generating candidate sets
It requires less memory space due to
compact structure and no candidate
generation.
It scans the database only twice for
constructing frequent pattem tree.
Candidate generation is very parallelizable
Data are very interdependent, each node
needs the root
Collegenote
Prepared By: Jayanta PoudelWES Data Warehousing and Data Mining Reference Note
From Association Mini ing to Correlation Analysis (Lift)
Most association rule mining algorithms employ a support-confidence framework. Often, many
interesting rules can be found using low support thresholds. Although minimum support and
confidence thresholds help weed out or exclude the exploration of a good number of
uninteresting rules, many mules so generated are still not interesting to the users.
The support and confidence measures are insufficient at filtering out uninteresting association
rules. To tackle this weakness, a correlation measure can be used to augment the support-
confidence framework for association rules. This leads to correlation rules of the form.
A> B (support, confidence, correlation]
That is, a correlation rule is measured uot only by its support aud confidence but also by the
correlation between itemsets 4 and B. There are many different correlation measures from
which to choose. Lift s simple correlation measure that is given as follows:
Lift: The lift between the occurrence of A and B can be measured as below:
P(AUB)
lifi(A, B) = Fae
= If the lif t(A, B) is less than 1, then the occurrence of A is negatively correlated with the
ocenmtence of B.
= Ifthe resulting value is greater than 1, then A and B are positively correlated, meaning that
the occurrence of one implies the occurrence of the other
- Ifthe resulting value is equal to 1, then A and B are independent and there is no correlation
between them,
Please let me know if I missed anything or
anything is incorrect.
[email protected]
Collegenote Prepared By: Jayanta Poudel