Limited pass algorithm
Limited pass algorithm
1
Limited Pass Algorithms
◆ Algorithms so far: compute exact collection
of frequent itemsets of size k in k passes
◗ A-Priori, PCY, Multistage, Multihash
◆ Many applications where it is not essential to
discover every frequent itemset
◗ Sufficient to discover most of them
◆ Next: algorithms that find all or most
frequent itemsets using at most 2 passes
over data
◗ Sampling
◗ SON
2
◗ Toivonen’s Algorithm
Random Sampling of Input Data
3
Random Sampling
◆ Take a random sample of the market
baskets that fits in main memory
◗ Leave enough space in memory for counts
Main memory
◗ For sets of all sizes, not just pairs
sample
◗ Don’t pay for disk I/O each
baskets
time we increase the size of itemsets
◗ Reduce support threshold Space
proportionally to match for
the sample size counts4
How to Pick the Sample
◆ Best way: read entire data set
◆ For each basket, select that basket for the
sample with probability p
◗ For input data with m baskets
◗ At end, will have a sample with size close to pm
baskets
5
Support Threshold for Random
Sampling
◆ Adjust support threshold to a suitable,
scaled-back number
◗ To reflect the smaller number of baskets
◆ Example
◗ If sample size is 1% or 1/100 of the baskets
◗ Use s /100 as your support threshold
◗ Itemset is frequent in the sample if it appears
in at least s/100 of the baskets in the sample
6
Random Sampling:
Not an exact algorithm
◆ With a single pass, cannot guarantee:
◗ That algorithm will produce all itemsets that are
frequent in the whole dataset
• False negative: itemset that is frequent in the whole but
not in the sample
◗ That it will produce only itemsets that are
frequent in the whole dataset
• False positive: frequent in the sample but not in the
whole
9
SON Algorithm
◆ Avoids false negatives and false positives
◆ Requires two full passes over data
10
SON Algorithm – (1)
◆ Repeatedly read small subsets of the
baskets into main memory
◆ Run an in-memory algorithm (e.g., a priori,
random sampling) to find all frequent
itemsets
◗ Note: we are not sampling, but processing
the entire file in memory-sized chunks
12
SON – Distributed Version
◆ SON lends itself to distributed data
mining
◗ MapReduce
13
SON: Map/Reduce
◆ Phase 1: Find candidate itemsets
◗ Map?
◗ Reduce?
14
SON: Map/Reduce
Phase 1: Find candidate itemsets
◆ Map
◗ Input is a chunk/subset of all baskets; fraction p of total input file
◗ Find itemsets frequent in that subset (e.g., using random
sampling algorithm)
◗ Use support threshold ps
◗ Output is set of key-value pairs (F, 1) where F is a
frequent itemset from sample
◆ Reduce
◗ Each reduce task is assigned set of keys, which are itemsets
◗ Produces keys that appear one or more time
◗ Frequent in some subset
◗ These are candidate itemsets
15
SON: Map/Reduce
Phase 2: Find true frequent itemsets
◆ Map
◗ Each Map task takes output from first Reduce task AND a
chunk of the total input data file
◗ All candidate itemsets go to every Map task
◗ Count occurrences of each candidate itemset among the baskets
in the input chunk
◗ Output is set of key-value pairs (C, v), where C is a
candidate frequent itemset and v is the support for that
itemset among the baskets in the input chunk
◆ Reduce
◗ Each reduce tasks is assigned a set of keys (itemsets)
◗ Sums associated values for each key: total support for itemset
◗ If support of itemset >= s, emit itemset and its count
16
Toivonen’s Algorithm
17
Toivonen’s Algorithm
◆ Given sufficient main memory, uses one pass
over a small sample and one full pass over
data
◆ Gives no false positives or false negatives
◆ BUT, there is a small but finite probability
it will fail to produce an answer
◗ Will not identify frequent itemsets
◆ Then must be repeated with a different
sample until it gives an answer
◆ Need only a small number of iterations
18
Toivonen’s Algorithm (1)
First find candidate frequent itemsets from sample
◆ Start as in the random sampling algorithm, but
lower the threshold slightly for the sample
◗ Example: if the sample is 1% of the baskets, use s /125 as the
support threshold rather than s /100
◗ For fraction p of baskets in sample, use 0.8ps or 0.9ps as
support threshold
20
Example: Negative Border
21
Picture of Negative Border
Negative Border
…
tripletons
doubletons
Frequent Itemsets
singletons from Sample
22
Toivonen’s Algorithm (1)
First pass:
(1) First find candidate frequent itemsets from
sample
◗ Sample on first pass!
◗ Use lower threshold: For fraction p of baskets in sample,
use 0.8ps or 0.9ps as support threshold
26
Hashing
In PCY algorithm, when generating L1, the set of
frequent itemsets of size 1, the algorithm also:
• generates all possible pairs for each basket
• hashes them to buckets
• keeps a count for each hash bucket
• Identifies frequent buckets (count >= s)
Item counts Frequent items
Bitmap
Main memory
Recall: Hash
Hash table
Main-Memory table Counts of
for pairs candidate
Picture of PCY
pairs
Pass 1 Pass 2
December 2008 ©GKGupta 27
Example
Consider a basket database in the first table below
All itemsets of size 1 determined to be frequent on previous pass
The second table below shows all possible 2-itemsets for each basket
Basket ID Items
100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Bread, Juice, Milk
500 Cheese, Juice, Milk
Bucket 5 is frequent. Are any of the pairs that hash to the bucket frequent?
Does Pass 1 of PCY know which pairs contributed to the bucket?
Hashing Example
Support Threshold = 3
The possible pairs:
100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)
200 (B, C) (B, J) (C, J)
300 (B, M) (B, Y) (M, Y)
400 (B, J) (B, M) (J, M)
500 (C, J) (C, M) (J, M)
(B,C) -> 12, 12%8 = 4; (B,E) -> 13, 13%8 = 5; (C, J) -> 24, 24%8 = 0
Mapping table
Bit map for Bucket number Count Pairs that hash
B 1 frequent buckets to bucket
C 2 1 0 5 (C, J) (B, Y) (M, Y)
0 1 1 (C, M)
E 3 0 2 1 (E, J)
J 4 0 3 0
0 4 2 (B, C)
M 5
1 5 3 (B, E) (J, M)
Y 6 1 6 3 (B, J)
1 7 3 (C, E) (B, M)