0% found this document useful (0 votes)
0 views

DM-u3

Unit III covers mining frequent patterns, focusing on concepts such as the Apriori algorithm for finding frequent itemsets, generating association rules, and mining multilevel associations. It emphasizes the importance of frequent pattern analysis in various applications, including market basket analysis and data classification. The document also discusses the limitations and benefits of market basket analysis, as well as improvements to the Apriori method for efficient candidate generation and support counting.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

DM-u3

Unit III covers mining frequent patterns, focusing on concepts such as the Apriori algorithm for finding frequent itemsets, generating association rules, and mining multilevel associations. It emphasizes the importance of frequent pattern analysis in various applications, including market basket analysis and data classification. The document also discusses the limitations and benefits of market basket analysis, as well as improvements to the Apriori method for efficient candidate generation and support counting.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Unit –III

Mining Frequent Patterns


Unit - III
• Basic Concepts

• Apriori Algorithm: Finding Frequent Itemsets by Confined


Candidate Generation

• Generating Association Rules from Frequent Itemsets

• Mining Multilevel Associations

• Constraint-Based Frequent Pattern Mining


What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.

3
Why Is Freq. Pattern Mining Important?

• Freq. pattern: An intrinsic and important property of datasets


• Foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Sequential, structural (e.g., sub-graph) patterns
– Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
– Classification: discriminative, frequent pattern analysis
– Cluster analysis: frequent pattern-based clustering
– Data warehousing: iceberg cube and cube-gradient
– Semantic data compression: fascicles
– Broad applications
4
Basic Concepts: Frequent Patterns
Tid Items bought • itemset: A set of one or more items
10 Beer, Nuts, Diaper • k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
• (absolute) support, or, support
30 Beer, Diaper, Eggs count of X: Frequency or
40 Nuts, Eggs, Milk occurrence of an itemset X
50 Nuts, Coffee, Diaper, Eggs, Milk
• (relative) support, s, is the fraction
Customer Customer
of transactions that contains X (i.e.,
buys both buys diaper the probability that a transaction
contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys beer
5
Basic Concepts: Association Rules
Tid Items bought • Find all the rules X  Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs – support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk
– confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
diaper
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Customer Diaper}:3
buys beer  Association rules: (many more!)
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)
6
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) =
2100 – 1 = 1.27*1030 sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no super-
pattern Y ‫ כ‬X, with the same support as X (proposed by
Pasquier, et al. @ ICDT’99)
• An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y ‫ כ‬X (proposed by Bayardo @
SIGMOD’98)
• Closed pattern is a lossless compression of freq. patterns
– Reducing the # of patterns and rules
7
Closed Patterns and Max-Patterns
• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
– Min_sup = 1.
• What is the set of closed itemset?
– <a1, …, a100>: 1
– < a1, …, a50>: 2
• What is the set of max-pattern?
– <a1, …, a100>: 1
• What is the set of all patterns?
– !!
8
Frequent Pattern Mining

•Frequent Patterns:
Frequent Patterns are patterns that occur frequently in
data.

•Three types of frequent patterns


Frequent itemset
Frequent sequential pattern
Frequent structured pattern
Frequent Pattern Mining (cntd…)
•Frequent itemset. :A set of items, such as milk and bread, that
appear frequently together in a transaction data set is a frequent
itemset.

•Frequent sequential pattern : If a subsequence occurs frequently


in a shopping history database, it is a frequent sequential pattern.

•Frequent structured pattern :If a substructure occurs frequently, it


is called a frequent structured pattern
Frequent Pattern Mining (cntd…)

•Searches for recurring relationships in a given data set.


•Plays an essential role in associations mining.
•Helps in data classification, clustering and other data mining
tasks
Market Basket Analysis

•The earliest form of frequent pattern mining is Market Basket


Analysis.
•Consider shopping cart filled with several items.
•From marketing perspective, determining which items are
frequently purchased together within the same transaction
Market Basket Analysis (cntd…)
Market Basket Analysis (cntd…)
•To categorize customer purchase behavior
•To identify actionable information
–purchase profiles
–profitability of each purchase profile
–use for marketing
•Store layouts
•Design catalogs
•select products for promotion
•space allocation, product placement
•To plan marketing or advertising strategies.
•To plan which items to put on sale at reduced prices.
Transactions database Example 1
TID Products Attributes converted to binary flags
1 A, B, E TID A B C D E
2 B, D 1 1 1 0 0 1
3 B, C 2 0 1 0 1 0
4 A, B, D 3 0 1 1 0 0
5 A, C 4 1 1 0 1 0
6 B, C 5 1 0 1 0 0
7 A, C 6 0 1 1 0 0
8 A, B, C, E 7 1 0 1 0 0
9 A, B, C 8 1 1 1 0 1
9 1 1 1 0 0

15
Support and Confidence
Transactions database Example 1
TID Products
1 A, B, E Examples:
2 B, D A C
3 B, C
4 A, B, D
Support: 4/9 = 44%
5 A, C •Confidence: 4/6 = 66%
6 B, C
7 A, C
8 A, B, C, E
9 A, B, C Customer
Customer
buys A ,C
buys C

Customer
buys A
17
Market Basket Analysis (cntd…)
•LIMITATIONS
–takes over 18 months to implement
–market basket analysis only identifies hypotheses,
which need to be tested
•neural network, regression, decision
tree
analyses

–measurement of impact needed


–difficult to identify product groupings
–complexity grows exponentially
Market Basket Analysis (cntd…)
•BENEFITS:
•simple computations
–can be undirected (don’t have to have hypotheses
before analysis)
–different data forms can be analyzed
Apriori: A Candidate Generation & Test Approach
• Apriori Property: Any subset of a frequent itemset must be
frequent
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
generated
Apriori Algorithm: Finding Frequent Itemsets
by Confined Candidate Generation—An Example
Supmin = 2
Itemset sup
Database TDB Itemset sup
{A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 C2
Itemset sup Itemset
L2 Itemset sup {A, B} 1 2nd scan {A, B}
{A, C} 2 {A, C} 2 {A, C}
{B, C} 2 {A, E} 1 {A, E}
{B, E} 3 {B, C} 2
{B, C}
{C, E} 2 {B, E} 3
{B, E}
{C, E} 2
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}
23
How to Count Supports of Candidates?

• Why counting supports of candidates a problem?


– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and
counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a
transaction
25
Example
Generating association rules from frequent itemset

If the minimum strong. confidence threshold is, say, 70%, then only the second,
third, and last rules are output, because these are the only ones generated that are
Further Improvement of the Apriori Method

• Major computational challenges


– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• Improving Apriori: general ideas
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates
•Association rules from frequent itemset
-multilevel Association rules
-multidimensional Association rules

You might also like