0% found this document useful (0 votes)
2 views

Unit 3 Data Mining

Association rule mining is a data mining technique that identifies relationships between variables in large datasets, focusing on transactional databases. Key concepts include support, confidence, and frequent itemsets, with algorithms like Apriori and improved naive methods used to discover these patterns efficiently. Techniques such as TID hashing, direct hashing, and dynamic itemset counting enhance the efficiency of mining frequent itemsets without generating candidates explicitly.

Uploaded by

22bit12
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 3 Data Mining

Association rule mining is a data mining technique that identifies relationships between variables in large datasets, focusing on transactional databases. Key concepts include support, confidence, and frequent itemsets, with algorithms like Apriori and improved naive methods used to discover these patterns efficiently. Techniques such as TID hashing, direct hashing, and dynamic itemset counting enhance the efficiency of mining frequent itemsets without generating candidates explicitly.

Uploaded by

22bit12
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

1.

ASSOCIATION RULE MINING

● Association rule mining is a data mining technique used to discover interesting


relationships between variables in large datasets.
● It primarily focuses on finding associations or correlations among items in
transactional databases.
● The task involves identifying patterns such as "if X, then Y," where X and Y are sets
of items.
● These patterns are often expressed in the form of association rules, which describe the
relationships between different items based on their co-occurrence in transactions.
Basics of Association Rule Mining:

● Support: The support of an itemset is the proportion of transactions in the database


that contain that itemset. It indicates the frequency of occurrence of the itemset.
● Confidence: The confidence of a rule X ➞ Y is the proportion of transactions
containing X that also contain Y. It measures the reliability of the rule.
● Frequent Itemsets: Itemsets whose support is above a specified threshold are
considered frequent. They form the basis for generating association rules.
The Task:

● The primary task in association rule mining is to find all frequent itemsets and
generate association rules from them. This involves:
● Identifying frequent itemsets using a minimum support threshold.

● Generating association rules from these frequent itemsets with a minimum confidence
threshold.
● Evaluating and interpreting the discovered rules for their significance and usefulness.

2. NAIVE ALGORITHM:

● The naive algorithm for association rule mining involves two main steps:

● Find Frequent Itemsets: Scan the database to count the support of each itemset. Prune
itemsets with support below the minimum threshold.
● Generate Rules: For each frequent itemset, generate association rules and calculate
their confidence. Prune rules with confidence below the minimum threshold.
Improved Naive Algorithm:

● An improved version of the naive algorithm can incorporate optimizations such as:
● Efficient Counting: Utilize techniques like direct hashing or vertical data format for
efficient counting of itemsets.
● Pruning Strategies: Apply pruning techniques like the Apriori property to reduce the
search space during candidate generation.
● Memory Management: Optimize memory usage with sparse data structures and
compression techniques.
● Parallelization: Implement parallel processing or distributed computing to accelerate
computation on large datasets.

3. APRIORI ALGORITHM

● The Apriori algorithm is a classical algorithm for frequent itemset mining and
association rule learning over transactional databases.
● It follows the principle of generating candidate itemsets and pruning them based on
the Apriori property.
First Part - Itemset Generation:

● Begin by identifying frequent individual items in the dataset (1-itemsets) by scanning


the database once.
● Then, iteratively generate candidate itemsets of size k based on frequent (k-1)-
itemsets. This involves joining pairs of (k-1)-itemsets to form new candidate sets.
● After generating candidate itemsets, scan the database again to count the support of
each candidate itemset.
● Prune candidate itemsets that do not meet the minimum support threshold, as per the
Apriori property.
Second Part - Finding the Rules:

● After identifying all frequent itemsets, generate association rules from them.

● For each frequent itemset, generate all possible non-empty subsets (itemset
combinations).
● Calculate the confidence of each association rule by dividing the support of the
frequent itemset by the support of its antecedent (left-hand side of the rule).
● Prune rules that do not meet the minimum confidence threshold.
IMPLEMENTATION ISSUES:
Implementing the Apriori algorithm involves addressing various challenges related to
transaction storage, efficient counting, and rule generation.
Transaction Storage:
● Efficiently storing transactional data is crucial for scalability.

● Options include using databases, in-memory data structures, or file systems.

● Considerations include memory usage, disk I/O, and query efficiency.


Efficient Counting:
● Counting the support of itemsets efficiently is essential.

● Techniques like hash tables, bitmap indices, or vertical data format can be used for
efficient counting.
● These methods optimize memory usage and reduce computation time.
Finding the Rules:
● Generating association rules involves finding all possible itemset combinations and
calculating their confidence.
● Optimizations include avoiding redundant rule generation and using efficient data
structures for rule storage.
● Pruning techniques help reduce the number of candidate rules to evaluate, improving
performance.

The following are the formulas used in the Apriori algorithm:


Example: Suppose we have a transaction database where each transaction contains items
purchased by a customer. Apriori would start by finding frequent individual items, then
pairs of items, triples, and so on, until no more frequent itemsets can be found. give
detailed example
Steps:
Consider a transaction database containing information about purchases made by customers
at a grocery store. Here are some sample transactions:
Transaction 1: {bread, milk, eggs}
Transaction 2: {bread, butter}
Transaction 3: {milk, butter}
Transaction 4: {bread, milk, butter}
Transaction 5: {bread, milk}
Let's apply the Apriori algorithm to find frequent itemsets.
Step 1: Find frequent individual items (itemsets of size 1):
Count the occurrences of each item:
bread: 4
milk: 4
eggs: 1
butter: 3
Set a minimum support threshold, let's say 3.
Since eggs do not meet the minimum support threshold, it is not considered frequent.
Frequent itemsets of size 1: {bread}, {milk}, {butter}

Step 2: Generate candidate itemsets of size 2:


Based on the frequent itemsets of size 1, generate all possible combinations of size 2.
Candidate itemsets of size 2: {bread, milk}, {bread, butter}, {milk, butter}

Step 3: Count occurrences of candidate itemsets of size 2:


Count occurrences of each candidate itemset in the transactions:
{bread, milk}: 3
{bread, butter}: 2
{milk, butter}: 2
Only {bread, milk} meets the minimum support threshold.

Step 4: Generate candidate itemsets of size 3:


Based on the frequent itemset {bread, milk}, generate the only possible combination of size
3: {bread, milk, butter}

Step 5: Count occurrences of candidate itemsets of size 3:


Count occurrences of {bread, milk, butter} in the transactions: 1
Since we didn't find any frequent itemsets of size 4, we stop here.
The frequent itemsets discovered using the Apriori algorithm are:

{bread}
{milk}
{butter}
{bread, milk}
These frequent itemsets represent combinations of items that are frequently bought together
by customers. The support count for each frequent itemset indicates how many transactions
contain those items. This information can be used to derive association rules, such as "if
bread and milk are purchased, then butter is also likely to be purchased," which can provide
valuable insights for marketing strategies or product placement in the store.
4. IMPROVING THE EFFICIENCY OF APRIORI ALGORITHM

● Improving the efficiency of the Apriori algorithm involves several straightforward


strategies.
● Firstly, pruning is essential, which means removing infrequent itemsets early in the
process based on a minimum support threshold and the Apriori property.
● This reduces the search space significantly. Secondly, optimizing candidate
generation is crucial.
● This includes using dynamic counting, transforming the database into a vertical
format, and employing efficient join and prune techniques to generate candidate
itemsets quickly.
● Thirdly, efficient memory management techniques, such as sparse data structures and
transaction compression, help reduce memory usage.
● Additionally, sampling techniques like random or stratified sampling can reduce
dataset size without sacrificing representativeness.
● Preprocessing steps like data cleaning and dimensionality reduction further enhance
efficiency by streamlining the data.
● Finally, exploring alternative algorithms and implementing incremental mining
techniques can offer additional efficiency gains.
● By implementing these simple strategies, the Apriori algorithm can handle larger
datasets more efficiently in association rule mining tasks.
Apriori - TID

● TID hashing is a technique used to efficiently store and access transaction information
in association rule mining algorithms.
● Each transaction is assigned a unique ID (TID), and a hash table is constructed where
each entry corresponds to an item and contains a list of TIDs where that item appears.
● This technique allows for fast access to transactions containing specific items,
facilitating efficient candidate generation and pruning.
Let's explore how TID hashing can be applied in a retail dataset to efficiently mine frequent
itemsets:
Consider a retail dataset containing transaction information from a grocery store:

Transaction 1 (TID 1): {bread, milk, eggs}


Transaction 2 (TID 2): {bread, butter}
Transaction 3 (TID 3): {milk, butter}
Transaction 4 (TID 4): {bread, milk, butter}
Transaction 5 (TID 5): {bread, milk}

To implement TID hashing, we assign a unique ID (TID) to each transaction:

TID 1: {bread, milk, eggs}


TID 2: {bread, butter}
TID 3: {milk, butter}
TID 4: {bread, milk, butter}
TID 5: {bread, milk}
Now, let's construct a hash table where each entry corresponds to an item and contains a list
of TIDs where that item appears.
Hash Table:
Item | TIDs
-------------------
bread | 1, 2, 4, 5
milk | 1, 3, 4, 5
eggs | 1
butter | 2, 3, 4
Using this hash table, we can efficiently access transactions containing specific items. For
example, if we want to find transactions containing bread and milk together, we can directly
look up their corresponding TID lists in the hash table:
TIDs for bread: 1, 2, 4, 5
TIDs for milk: 1, 3, 4, 5
The intersection of these TID lists gives us the transactions containing both bread and milk:
TID 1, TID 4, and TID 5.

● This technique facilitates efficient candidate generation and pruning in association


rule mining algorithms.
● For instance, when generating candidate itemsets of size 2, we can quickly identify
transactions containing the first item of the candidate itemset using TID hashing and
then check if the second item is present in those transactions. This helps reduce the
search space and improve the efficiency of mining frequent itemsets, especially in
large datasets.

Direct Hashing and Pruning (DHP)


● Direct hashing and pruning are crucial techniques employed in the Apriori algorithm
to efficiently mine frequent itemsets from large transactional databases.
● Direct hashing involves the use of hash tables to directly count the occurrences of
candidate itemsets, bypassing the need for repeated scans of the entire database.
● By hashing itemsets to specific indices in the hash table, their counts can be
efficiently accessed and updated.
● This approach significantly improves the computational efficiency of counting
frequent itemsets, particularly for databases with a large number of transactions and
items.
● This not only improves computational efficiency but also enables the algorithm to
scale effectively to handle large datasets.
● In combination, direct hashing and pruning are fundamental techniques that make the
Apriori algorithm suitable for discovering associations in transactional data
efficiently.
Generate 1-Itemsets:
● Count occurrences of each item and hash the counts.
Item | Hashed Count
-------|------------
bread | 4
milk | 3
eggs | 1
butter | 3
Prune:
Remove items with less support than the threshold (3).
Remaining 1-itemsets: {bread, milk, butter}
Generate 2-Itemsets:

● Create candidate 2-itemsets from the frequent 1-itemsets.


Itemset | Hashed Count
----------------|-------------
{bread, milk} | 2
{bread, butter} | 3
{milk, butter} | 1

Prune:
Remove itemsets with less support than the threshold.
Remaining 2-itemsets: {bread, butter}
Generate 3-Itemsets:
● Create candidate 3-itemsets from the frequent 2-itemsets.
Itemset | Hashed Count
------------------- -- -|-------------
{bread, butter, milk} | 1

Since there are no 3-itemsets, the process stops.


● These examples illustrate how the Apriori algorithm works with direct hashing for
counting support and pruning itemsets based on a minimum support threshold.
Dynamic Itemset Counting(DIC)

● Dynamic itemset counting is a strategy used in association rule mining to dynamically


update the count of itemsets as transactions are processed.
● Instead of storing the counts of all possible itemsets, only the counts of currently
frequent item sets are maintained and updated as new transactions are encountered.
● This approach helps reduce memory usage and computational overhead, especially in
scenarios with large transaction datasets.
● Suppose we have a transaction database containing information about purchases made
by customers at a grocery store:

Transaction 1: {bread, milk, eggs}


Transaction 2: {bread, butter}
Transaction 3: {milk, butter}
Transaction 4: {bread, milk, butter}
Transaction 5: {bread, milk}

● Initially, we start with an empty count table to keep track of the counts of frequent
itemsets:
Count Table:
Itemset | Count
---------------------
{bread} |
{milk} |
{butter} |
{bread, milk} |
Now, let's process each transaction one by one and dynamically update the count of frequent
itemsets.
Transaction 1: {bread, milk, eggs}
Increment counts for individual items: bread, milk
Increment counts for itemsets: {bread}, {milk}, {bread, milk}
Updated Count Table:
Count Table:
Itemset | Count
---------------------
{bread} |1
{milk} |1
{butter} |
{bread, milk} | 1
Transaction 2: {bread, butter}
Increment counts for individual items: bread, butter
Increment counts for itemsets: {bread}, {butter}
Updated Count Table:
Count Table:
Itemset | Count
---------------------
{bread} |2
{milk} |1
{butter} |1
{bread, milk} | 1
Transaction 3: {milk, butter}
Increment counts for individual items: milk, butter
Increment counts for itemsets: {milk}, {butter}
Count Table:
Itemset | Count
---------------------
{bread} |2
{milk} |2
{butter} |2
{bread, milk} | 1

Transaction 4: {bread, milk, butter}


Increment counts for individual items: bread, milk, butter
Increment counts for itemsets: {bread, milk}, {bread, butter}, {milk, butter}, {bread, milk,
butter}
Count Table:
Itemset | Count
---------------------
{bread} |3
{milk} |3
{butter} |3
{bread, milk} | 2

Transaction 5: {bread, milk}


Increment counts for individual items: bread, milk
Increment counts for itemsets: {bread, milk}
Count Table:
Itemset | Count
---------------------
{bread} |4
{milk} |4
{butter} |3
{bread, milk} | 3

● After processing all transactions, the count table reflects the counts of frequent
itemsets.
● This dynamic approach avoids storing counts for all possible itemsets and instead
focuses on updating counts only for currently frequent itemsets, reducing memory
usage and computational overhead.
● If an itemset becomes infrequent, its count can be removed from memory to conserve
resources.
Mining Frequent Patterns without Candidate Generation:

● This approach aims to directly discover frequent itemsets without generating


candidate itemsets explicitly, thereby reducing computational complexity and
memory requirements.
● Techniques like FP-growth (Frequent Pattern growth) algorithm are commonly used
for mining frequent patterns without candidate generation.
● FP-growth builds a compact data structure called a frequent pattern tree (FP-tree) to
represent the transaction database efficiently and mines frequent itemsets from this
structure.
Consider the following transactions:
Transaction 1: {bread, milk, eggs}
Transaction 2: {bread, butter}
Transaction 3: {milk, butter}
Transaction 4: {bread, milk, butter}
Transaction 5: {bread, milk}
Step 1: Count the frequency of each item and sort them in descending order:
Item | Frequency
------------------
bread | 4
milk | 4
butter | 3
eggs | 1
Step 2: Construct the FP-tree:
Start with an empty root node:
Null
Process each transaction one by one:
Transaction 1: {bread, milk, eggs}
Null
|
bread(1)
|
milk(1)
|
eggs(1)
Transaction 2: {bread, butter}

null
|
bread(2)
| \
milk(1) butter(1)
|
eggs(1)
Transaction 3: {milk, butter}
null
|
bread(2)
| \
milk(2) butter(2)
|
eggs(1)
Transaction 4: {bread, milk, butter}

null
|
bread(3)
| \
milk(3) butter(3)
|
eggs(1)

Transaction 5: {bread, milk}


null
|
bread(4)
| \
milk(4) butter(3)
|
eggs(1)

● The FP-tree has been constructed.

● Each node in the tree represents an item, and the number in parentheses denotes the
frequency count of that item.
● The edges represent the conditional relationship between items within transactions.

● This FP-tree can be used to efficiently mine frequent itemsets without candidate
generation.
Step 2: Mine frequent itemsets from the FP-tree:

● Start with the least frequent item (eggs) and traverse the FP-tree to find frequent
itemsets.
● Begin with the node representing 'eggs' and traverse upwards to find paths.

● Since 'eggs' occurs only once, there are no frequent itemsets containing 'eggs'.

● Next, mine frequent itemsets containing 'butter':

● Start with the node representing 'butter'.

● Traverse upwards to find paths and form itemsets.

● Frequent itemsets containing 'butter':

{butter}: 3
{bread, butter}: 2

● Then, mine frequent itemsets containing 'milk':

● Start with the node representing 'milk'.


● Traverse upwards to find paths and form itemsets.

● Frequent itemsets containing 'milk':

{milk}: 4
{bread, milk}: 3

● Lastly, mine frequent itemsets containing 'bread':

● Start with the node representing 'bread'.

● Traverse upwards to find paths and form itemsets.

● Frequent itemsets containing 'bread':

{bread}: 4

● Therefore, the frequent itemsets mined from the FP-tree are:

{bread}: 4
{milk}: 4
{butter}: 3
{bread, milk}: 3
{bread, butter}: 2

● This example demonstrates how FP-growth efficiently mines frequent itemsets


without generating candidate itemsets explicitly.
● By constructing the FP-tree and traversing it to find frequent patterns, the algorithm
significantly reduces computational complexity and memory requirements compared
to traditional candidate generation approaches like Apriori.
Performance evaluation of an algorithm

● Performance evaluation of an algorithm involves assessing its efficiency and


effectiveness in terms of time complexity, space complexity, and accuracy.
● Here's how we can evaluate the performance of the FP-growth algorithm for mining
frequent itemsets:
Time Complexity:

● Time complexity refers to the amount of time taken by the algorithm to complete its
execution.
● FP-growth algorithm typically has a time complexity of O(n^2) or better, where n is
the number of transactions in the dataset.
● The algorithm's efficiency is often influenced by factors such as the size of the
dataset, the number of unique items, and the structure of the FP-tree.
● Evaluating the time taken by the algorithm to process varying sizes of datasets can
provide insights into its scalability.
Space Complexity:

● Space complexity refers to the amount of memory or space required by the algorithm
to execute.
● FP-growth algorithm usually has a space complexity of O(n), where n is the number
of nodes in the FP-tree.
● The space required primarily depends on the size of the FP-tree constructed from the
dataset.
● Assessing the memory usage of the algorithm for different dataset sizes helps
understand its memory efficiency and scalability.
Accuracy:

● Accuracy refers to the correctness of the results produced by the algorithm.

● In the context of FP-growth, accuracy is measured by the correctness of the frequent


itemsets mined.
● We can evaluate accuracy by comparing the frequent itemsets obtained from FP-
growth with those obtained from other algorithms or with domain knowledge.
● Additionally, comparing the association rules generated from the frequent itemsets
with real-world observations can help assess the algorithm's practical utility.
Scalability:

● Scalability refers to how well the algorithm performs as the size of the dataset
increases.
● Evaluating the performance of FP-growth on large datasets helps determine its
scalability.
● It's essential to assess how the algorithm's time and space complexity scale with
increasing dataset sizes to ensure its practical applicability in real-world scenarios.

You might also like