Unit 3 Data Mining
Unit 3 Data Mining
● The primary task in association rule mining is to find all frequent itemsets and
generate association rules from them. This involves:
● Identifying frequent itemsets using a minimum support threshold.
● Generating association rules from these frequent itemsets with a minimum confidence
threshold.
● Evaluating and interpreting the discovered rules for their significance and usefulness.
2. NAIVE ALGORITHM:
● The naive algorithm for association rule mining involves two main steps:
● Find Frequent Itemsets: Scan the database to count the support of each itemset. Prune
itemsets with support below the minimum threshold.
● Generate Rules: For each frequent itemset, generate association rules and calculate
their confidence. Prune rules with confidence below the minimum threshold.
Improved Naive Algorithm:
● An improved version of the naive algorithm can incorporate optimizations such as:
● Efficient Counting: Utilize techniques like direct hashing or vertical data format for
efficient counting of itemsets.
● Pruning Strategies: Apply pruning techniques like the Apriori property to reduce the
search space during candidate generation.
● Memory Management: Optimize memory usage with sparse data structures and
compression techniques.
● Parallelization: Implement parallel processing or distributed computing to accelerate
computation on large datasets.
3. APRIORI ALGORITHM
● The Apriori algorithm is a classical algorithm for frequent itemset mining and
association rule learning over transactional databases.
● It follows the principle of generating candidate itemsets and pruning them based on
the Apriori property.
First Part - Itemset Generation:
● After identifying all frequent itemsets, generate association rules from them.
● For each frequent itemset, generate all possible non-empty subsets (itemset
combinations).
● Calculate the confidence of each association rule by dividing the support of the
frequent itemset by the support of its antecedent (left-hand side of the rule).
● Prune rules that do not meet the minimum confidence threshold.
IMPLEMENTATION ISSUES:
Implementing the Apriori algorithm involves addressing various challenges related to
transaction storage, efficient counting, and rule generation.
Transaction Storage:
● Efficiently storing transactional data is crucial for scalability.
● Techniques like hash tables, bitmap indices, or vertical data format can be used for
efficient counting.
● These methods optimize memory usage and reduce computation time.
Finding the Rules:
● Generating association rules involves finding all possible itemset combinations and
calculating their confidence.
● Optimizations include avoiding redundant rule generation and using efficient data
structures for rule storage.
● Pruning techniques help reduce the number of candidate rules to evaluate, improving
performance.
{bread}
{milk}
{butter}
{bread, milk}
These frequent itemsets represent combinations of items that are frequently bought together
by customers. The support count for each frequent itemset indicates how many transactions
contain those items. This information can be used to derive association rules, such as "if
bread and milk are purchased, then butter is also likely to be purchased," which can provide
valuable insights for marketing strategies or product placement in the store.
4. IMPROVING THE EFFICIENCY OF APRIORI ALGORITHM
● TID hashing is a technique used to efficiently store and access transaction information
in association rule mining algorithms.
● Each transaction is assigned a unique ID (TID), and a hash table is constructed where
each entry corresponds to an item and contains a list of TIDs where that item appears.
● This technique allows for fast access to transactions containing specific items,
facilitating efficient candidate generation and pruning.
Let's explore how TID hashing can be applied in a retail dataset to efficiently mine frequent
itemsets:
Consider a retail dataset containing transaction information from a grocery store:
Prune:
Remove itemsets with less support than the threshold.
Remaining 2-itemsets: {bread, butter}
Generate 3-Itemsets:
● Create candidate 3-itemsets from the frequent 2-itemsets.
Itemset | Hashed Count
------------------- -- -|-------------
{bread, butter, milk} | 1
● Initially, we start with an empty count table to keep track of the counts of frequent
itemsets:
Count Table:
Itemset | Count
---------------------
{bread} |
{milk} |
{butter} |
{bread, milk} |
Now, let's process each transaction one by one and dynamically update the count of frequent
itemsets.
Transaction 1: {bread, milk, eggs}
Increment counts for individual items: bread, milk
Increment counts for itemsets: {bread}, {milk}, {bread, milk}
Updated Count Table:
Count Table:
Itemset | Count
---------------------
{bread} |1
{milk} |1
{butter} |
{bread, milk} | 1
Transaction 2: {bread, butter}
Increment counts for individual items: bread, butter
Increment counts for itemsets: {bread}, {butter}
Updated Count Table:
Count Table:
Itemset | Count
---------------------
{bread} |2
{milk} |1
{butter} |1
{bread, milk} | 1
Transaction 3: {milk, butter}
Increment counts for individual items: milk, butter
Increment counts for itemsets: {milk}, {butter}
Count Table:
Itemset | Count
---------------------
{bread} |2
{milk} |2
{butter} |2
{bread, milk} | 1
● After processing all transactions, the count table reflects the counts of frequent
itemsets.
● This dynamic approach avoids storing counts for all possible itemsets and instead
focuses on updating counts only for currently frequent itemsets, reducing memory
usage and computational overhead.
● If an itemset becomes infrequent, its count can be removed from memory to conserve
resources.
Mining Frequent Patterns without Candidate Generation:
null
|
bread(2)
| \
milk(1) butter(1)
|
eggs(1)
Transaction 3: {milk, butter}
null
|
bread(2)
| \
milk(2) butter(2)
|
eggs(1)
Transaction 4: {bread, milk, butter}
null
|
bread(3)
| \
milk(3) butter(3)
|
eggs(1)
● Each node in the tree represents an item, and the number in parentheses denotes the
frequency count of that item.
● The edges represent the conditional relationship between items within transactions.
● This FP-tree can be used to efficiently mine frequent itemsets without candidate
generation.
Step 2: Mine frequent itemsets from the FP-tree:
● Start with the least frequent item (eggs) and traverse the FP-tree to find frequent
itemsets.
● Begin with the node representing 'eggs' and traverse upwards to find paths.
● Since 'eggs' occurs only once, there are no frequent itemsets containing 'eggs'.
{butter}: 3
{bread, butter}: 2
{milk}: 4
{bread, milk}: 3
{bread}: 4
{bread}: 4
{milk}: 4
{butter}: 3
{bread, milk}: 3
{bread, butter}: 2
● Time complexity refers to the amount of time taken by the algorithm to complete its
execution.
● FP-growth algorithm typically has a time complexity of O(n^2) or better, where n is
the number of transactions in the dataset.
● The algorithm's efficiency is often influenced by factors such as the size of the
dataset, the number of unique items, and the structure of the FP-tree.
● Evaluating the time taken by the algorithm to process varying sizes of datasets can
provide insights into its scalability.
Space Complexity:
● Space complexity refers to the amount of memory or space required by the algorithm
to execute.
● FP-growth algorithm usually has a space complexity of O(n), where n is the number
of nodes in the FP-tree.
● The space required primarily depends on the size of the FP-tree constructed from the
dataset.
● Assessing the memory usage of the algorithm for different dataset sizes helps
understand its memory efficiency and scalability.
Accuracy:
● Scalability refers to how well the algorithm performs as the size of the dataset
increases.
● Evaluating the performance of FP-growth on large datasets helps determine its
scalability.
● It's essential to assess how the algorithm's time and space complexity scale with
increasing dataset sizes to ensure its practical applicability in real-world scenarios.