0% found this document useful (0 votes)
9 views

Data Wirehose and Mining 3

The FP-Growth algorithm is an efficient method for mining frequent patterns without candidate generation. It uses an FP-tree to store compressed transaction data and mines the tree to extract frequent patterns. The FP-tree is constructed by mapping transactions to paths and combining common prefixes. Frequent patterns are generated by recursively building conditional FP-trees from the nodes of the original tree. This divide-and-conquer approach allows the complete set of frequent patterns to be extracted with better performance than alternative methods like Apriori.

Uploaded by

Sri Lakshmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Wirehose and Mining 3

The FP-Growth algorithm is an efficient method for mining frequent patterns without candidate generation. It uses an FP-tree to store compressed transaction data and mines the tree to extract frequent patterns. The FP-tree is constructed by mapping transactions to paths and combining common prefixes. Frequent patterns are generated by recursively building conditional FP-trees from the nodes of the original tree. This divide-and-conquer approach allows the complete set of frequent patterns to be extracted with better performance than alternative methods like Apriori.

Uploaded by

Sri Lakshmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1.

Can we design a method that mines the complete set of frequent item sets without candidate
generation? Explain with example.

FP Growth Algorithm in Data Mining

In Data Mining, finding frequent patterns in large databases is very important and has been studied on a
large scale in the past few years. Unfortunately, this task is computationally expensive, especially when
many patterns exist.

The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method for mining the
complete set of frequent patterns by pattern fragment growth, using an extended prefix-tree structure
for storing compressed and crucial information about frequent patterns named frequent-pattern tree
(FP-tree). In his study, Han proved that his method outperforms other popular methods for mining
frequent patterns, e.g. the Apriori Algorithm and the TreeProjection. In some later works, it was proved
that FP-Growth performs better than other methods, including Eclat and Relim. The popularity and
efficiency of the FP-Growth Algorithm contribute to many studies that propose variations to improve its
performance.

What is FP Growth Algorithm?

The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate
generations, thus improving performance. For so much, it uses a divide-and-conquer strategy. The core
of this method is the usage of a special data structure named frequent-pattern tree (FP-tree), which
retains the item set association information.

This algorithm works as follows:

First, it compresses the input database creating an FP-tree instance to represent frequent items.

After this first step, it divides the compressed database into a set of conditional databases, each
associated with one frequent pattern.

Finally, each such database is mined separately.

Using this strategy, the FP-Growth reduces the search costs by recursively looking for short patterns and
then concatenating them into the long frequent patterns.

In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with this
problem is to partition the database into a set of smaller databases (called projected databases) and
then construct an FP-tree from each of these smaller databases.

FP-Tree

The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative information
about frequent patterns in a database. Each transaction is read and then mapped onto a path in the FP-
tree. This is done until all transactions have been read. Different transactions with common subsets
allow the tree to remain compact because their paths overlap.

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP tree ie
most frequent pattern. Each node of the FP tree represents an item of the item set.
The root node represents null, while the lower nodes represent the item sets. The associations of the
nodes with the lower nodes, that is, the item sets with the other item sets, are maintained while
forming the tree.

Han defines the FP-tree as the tree structure given below:

One root is labelled as “null” with a set of item-prefix subtrees as children and a frequent-item-header
table.

Each node in the item-prefix subtree consists of three fields:

Item-name: registers which item is represented by the node;

Count: the number of transactions represented by the portion of the path reaching the node;

Node-link: links to the next node in the FP-tree carrying the same item name or null if there is none.

Each entry in the frequent-item-header table consists of two fields:

Item-name: as the same to the node;

Head of node-link: a pointer to the first node in the FP-tree carrying the item name.

Additionally, the frequent-item-header table can have the count support for an item. The below diagram
is an example of a best-case scenario that occurs when all transactions have the same itemset; the size
of the FP-tree will be only a single branch of nodes.

The worst-case scenario occurs when every transaction has a unique item set. So the space needed to
store the tree is greater than the space used to store the original data set because the FP-tree requires
additional space to store pointers between nodes and the counters for each item. The diagram below
shows how a worst-case scenario FP-tree might appear. As you can see, the tree’s complexity grows
with each transaction’s uniqueness.
Algorithm by Han

The original algorithm to construct the FP-Tree defined by Han is given below:

Algorithm 1: FP-tree construction

Input: A transaction database DB and a minimum support threshold?

Output: FP-tree, the frequent-pattern tree of DB.

Method: The FP-tree is constructed as follows.

The first step is to scan the database to find the occurrences of the itemsets in the database. This step is
the same as the first step of Apriori. The count of 1-itemsets in the database is called support count or
frequency of 1-itemset.

The second step is to construct the FP tree. For this, create the root of the tree. The root is represented
by null.

The next step is to scan the database again and examine the transactions. Examine the first transaction
and find out the itemset in it. The itemset with the max count is taken at the top, and then the next
itemset with the lower count. It means that the branch of the tree is constructed with transaction
itemsets in descending order of count.

The next transaction in the database is examined. The itemsets are ordered in descending order of
count. If any itemset of this transaction is already present in another branch, then this transaction
branch would share a common prefix to the root.

This means that the common itemset is linked to the new node of another itemset in this transaction.

Also, the count of the itemset is incremented as it occurs in the transactions. The common node and
new node count are increased by 1 as they are created and linked according to transactions.

The next step is to mine the created FP Tree. For this, the lowest node is examined first, along with the
links of the lowest nodes. The lowest node represents the frequency pattern length 1. From this,
traverse the path in the FP Tree. This path or paths is called a conditional pattern base.
A conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the
lowest node (suffix).

Construct a Conditional FP Tree, formed by a count of itemsets in the path. The itemsets meeting the
threshold support are considered in the Conditional FP Tree.

Frequent Patterns are generated from the Conditional FP Tree.

Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects and sorts
the set of frequent items, and the second constructs the FP-Tree.

Example

Support threshold=50%, Confidence= 60%

Table 1:

Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

Solution: Support threshold=50% => 0.5*6= 3 => min_sup=3

Table 2: Count of each item

Item Count

I1 4

I2 5

I3 4

I4 4

I5 2

Table 3: Sort the itemset in descending order.

Item Count

I2 5

I1 4

I3 4
I4 4

Build FP Tree

Let’s build the FP tree in the following steps, such as:

Considering the root node null.

The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1}, where I2 is linked as a
child, I1 is linked to I2 and I3 is linked to I1.

T2: I2, I3, and I4 contain I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and I4 is linked to I3.
But this branch would share the I2 node as common as it is already used in T1.

Increment the count of I2 by 1, and I3 is linked as a child to I2, and I4 is linked as a child to I3. The count
is {I2:2}, {I3:1}, {I4:1}.

T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.

T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node. Hence it will be
incremented by 1. Similarly I1 will be incremented by 1 as it is already linked with I2 in T1, thus {I2:3},
{I1:2}, {I4:1}.

T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2}, {I5:1}.

T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.

Mining of FP-tree is summarized below:

The lowest node item, I5, is not considered as it does not have a min support count. Hence it is deleted.

The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore considering I4 as
suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1} this forms the conditional pattern base.

The conditional pattern base is considered a transaction database, and an FP tree is constructed. This
will contain {I2:2, I3:2}, I1 is not considered as it does not meet the min support count.

This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}


For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree : {I2:4, I1:3} and
frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.

For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and frequent
patterns are generated: {I2, I1:4}.

Item Conditional Pattern Base Conditional FP-tree Frequent Patterns Generated

I4 {I2,I1,I3:1},{I2,I3:1} {I2:2, I3:2} {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}

I3 {I2,I1:3},{I2:1} {I2:4, I1:3} {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}

I1 {I2:4} {I2:4} {I2,I1:4}

The diagram given below depicts the conditional FP tree associated with the conditional node .When the
FP-tree contains a single prefix path, the complete set of frequent patterns can be generated in three
parts:The single prefix-path P,The multipath Q,And their combinations (lines 01 to 03 and 14). The
resulting patterns for a single prefix path are the enumerations of its subpaths with minimum support.

After that, the multipath Q is defined, and the resulting patterns are processed. Finally, the combined
results are returned as the frequent patterns found.

2)).Explain in detail about multilevel association rules.

The approaches to mining multilevel association rules are based on the supportconfidence framework.
The top-down strategy is employed where counts are accumulated for the calculation of frequent
itemsets at each concept level, starting at concept level 1 and working towards the lower specific
concept levels until more frequent itemsets can be found using the Apriori algorithm.

Data can be generalized by replacing low-level concepts within the data with their higher-level concepts
or ancestors from a concept hierarchy. In a concept hierarchy, which is represented as a tree with the
root as D i.e., Task-relevant data.

The popular area of application for the multi-level association is market basket analysis, which studies
the buying habits of customers by searching for sets of items that are frequently, purchased together
which was displayed in the concept of concept hierarchy.

Each node indicates an item or item set that has been examined. There are various approaches for
finding frequent itemsets at any level of abstraction. Some of the methods which are in use are ‘using
uniform minimum support for all levels’, using reduced minimum support at low levels, level-by-level
independent.

Multi-level databases need a hierarchy-data encoded transaction table rather than the initial transaction
table. This is useful when we are interested in only a portion of the transaction database such as food,
instead of all the items. This way we can first collect the relevant set of data and then work repeatedly
on the task-relevant set. Thus in the transaction table, each item is encoded as a sequence of digits.

Using uniform minimum support for all levels – When a uniform minimum support threshold is used, the
search procedure is simplified. An optimization technique can be adopted, based on the knowledge that
an ancestor is a superset of its descendants, the search avoids examining itemsets containing any item
whose ancestor does not have minimum support.

The main drawback of the uniform support approach is that the items at lower levels of abstraction will
occur as frequently as those at higher levels of abstraction.

Using reduced minimum support at lower levels – Each level of abstraction has its minimum support
threshold. The lower the abstraction level, the smaller the equivalent threshold. The following search
categories for mining multiple-level association with reduced support are –

Level by level independent – It is a full breadth search, background knowledge of frequent itemsets is
used for pruning. Here each node is examined regardless of the parent node is found to be frequent.

Level cross-filtering by a single item – An item as the ith level is determined if and only if its parent node
at the (i-1)th level is frequent.

Level cross-filtering by k-itemset – An itemset at the ith level is determined if and only if its equivalent
parent A-itemset at the (i-1)th level is frequent.

3)).Explain about the Apriori algorithm for finding frequent item sets with an example.

Apriori Algorithm

Prerequisite – Frequent Item set in Data set (Association Rule Mining)

Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a dataset
for boolean association rule. Name of the algorithm is Apriori because it uses prior knowledge of
frequent itemset properties. We apply an iterative approach or level-wise search where k-frequent
itemsets are used to find k+1 itemsets.

To improve the efficiency of level-wise generation of frequent itemsets, an important property is used
called Apriori property which helps by reducing the search space.

Apriori Property –

All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is its
anti-monotonicity of support measure. Apriori assumes that

Remove that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)

Now find support count of these itemsets by searching in dataset.


(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.

Step-3:

Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it should have
(K-2) elements in common. So here, for L2, first element should match.

So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}

Check if all subsets of these itemsets are frequent or not and if not, then remove that itemset.(Here
subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is not
frequent so remove it. Similarly check for every itemset)

Find support count of these remaining itemset by searching in dataset.

Before we start understanding the algorithm, go through some definitions which are explained in my
previous post.

Consider the following dataset and we will find frequent itemsets and generate association rules for
them.

Minimum support count is 2

Minimum confidence is 60%

Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)
(II) Compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.
Step-2: K=2

Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common.

Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.(Example
subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)

Now find support count of these itemsets by searching in dataset.

(11)Compare candidate (C2) support count with minimum support count(here


min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L2.

Step-3:

Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.

So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3,
I5}
Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3,
I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)

Find support count of these remaining itemset by searching in dataset.

Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L3.

(II)Compare candidate (C3) support count with minimum support count(here


min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L3.

Step-4:

Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2 elements (items)
should match.

Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is
{I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4

We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong association
rule comes into picture. For that we need to calculate confidence of each rule.

4a).what are the advantages of FP-Growth algorithm?


4.b)Discuss the applications of association analysis.

Association rule learning is a type of unsupervised learning methods that tests for the
dependence of one data element on another data element and create appropriately so that
it can be more effective. It tries to discover some interesting relations or relations among
the variables of the dataset. It depends on several rules to find interesting relations
between variables in the database.

The association rule learning is the important technique of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. In
market basket analysis, it is an adequate used by several big retailers to find the relations
among items.

Association rules were originally transformed from point-of-sale data that represent what
products are purchased together. Although its roots are in linking point-of-sale transactions,
association rules can be used external the retail market to find relationships among types of
“baskets.”

There are various applications of Association Rule which are as follows –

Items purchased on a credit card, such as rental cars and hotel rooms, support insight into
the following product that customer are likely to buy

Optional services purchased by tele-connection users (call waiting, call forwarding, DSL,
speed call, etc.) support decide how to bundle these functions to maximize revenue.

Banking services used by retail users (money industry accounts, CDs, investment services,
car loans, etc.) recognize users likely to needed other services.

Unusual group of insurance claims can be an expression of fraud and can spark higher
investigation.

Medical patient histories can supports expressions of likely complications based on definite
set of treatments.

Association rules falls to live up to expectations. For instance, they are not the best method
for producing cross-selling models in market such as retail banking, because the rules end
up describing previous marketing promotions. Also, in retail banking, users frequently start
with a checking account and then a saving account. Differentiation among products does
not occur until users have higher products.

In Apriori Algorithm, this algorithm needed frequent datasets to create association rules. It
is created to work on databases that includes transactions. This algorithm needed a
breadth-first search and hash tree to compute the itemset effectively.

It is generally used for market basket analysis and support to understand the products that
can be purchased. It is used in the healthcare space to discover drug reactions for patients.
In Eclat algorithm, it represents Equivalence Class Transformation. This algorithm needed a
depth-first search method to discover frequent itemsets in a transaction database. It
implements quicker implementation than Apriori Algorithm.

5.What are the various Constraints in Constraint based Association rule mining? Explain.

A data mining procedure can uncover thousands of rules from a given set of information,
most of which end up being independent or tedious to the users. Users have a best sense of
which “direction” of mining can lead to interesting patterns and the “form” of the patterns
or rules they can like to discover.

Therefore, a good heuristic is to have the users defines such intuition or expectations as
constraints to constraint the search space. This strategy is called constraint-based mining.

Constraint-based algorithms need constraints to decrease the search area in the frequent
itemset generation step (the association rule generating step is exact to that of exhaustive
algorithms).

The general constraint is the support minimum threshold. If a constraint is uncontrolled, its
inclusion in the mining phase can support significant reduction of the exploration space
because of the definition of a boundary inside the search space lattice, following which
exploration is not needed.

The important of constraints is well-defined – they create only association rules that are
appealing to users. The method is quite trivial and the rules space is decreased whereby
remaining methods satisfy the constraints.

Constraint-based clustering discover clusters that satisfy user-defined preferences or


constraints. It depends on the characteristics of the constraints, constraint-based clustering
can adopt rather than different approaches.

The constraints can include the following which are as follows –

Knowledge type constraints – These define the type of knowledge to be mined, including
association or correlation.

Data constraints – These define the set of task-relevant information such as Dimension/level
constraints – These defines the desired dimensions (or attributes) of the information, or
methods of the concept hierarchies, to be utilized in mining.

Interestingness constraints – These defines thresholds on numerical measures of rule


interestingness, including support, confidence, and correlation.

Rule constraints – These defines the form of rules to be mined. Such constraints can be
defined as metarules (rule templates), as the maximum or minimum number of predicates
that can appear in the rule antecedent or consequent, or as relationships between
attributes, attribute values, and/or aggregates.

The following constraints can be described using a high-level declarative data mining query
language and user interface. This form of constraint-based mining enables users to define
the rules that they can like to uncover, thus by creating the data mining process more
efficient.

Furthermore, a sophisticated mining query optimizer can be used to deed the constraints
defined by the user, thereby creating the mining process more effective. Constraint-based
mining boost interactive exploratory mining and analysis.

6.What is attribute –oriented induction? Describe how this is implemented

AOI stands for Attribute-Oriented Induction. The attribute-oriented induction approach to


concept description was first proposed in 1989, a few years before the introduction of the
data cube approach. The data cube approach is essentially based on materialized views of
the data, which typically have been pre-computed in a data warehouse.

In general, it implements off-line aggregation earlier an OLAP or data mining query is


submitted for processing. In other words, the attribute-oriented induction approach is
generally a query-oriented, generalization-based, on-line data analysis methods.

The general idea of attribute-oriented induction is to first collect the task-relevant data
using a database query and then perform generalization based on the examination of the
number of distinct values of each attribute in the relevant collection of data.

The generalization is implemented by attribute removal or attribute generalization.


Aggregation is implemented by combining identical generalized tuples and accumulating
their specific counts. This decreases the size of the generalized data set. The resulting
generalized association can be mapped into several forms for presentation to the user,
including charts or rules.

The process of attribute-oriented induction which are as follows –

First, data focusing must be implemented before attribute-oriented induction. This step
corresponds to the description of the task-relevant records (i.e., data for analysis). The data
are collected based on the data supported in the data mining query.

Because a data mining query is usually relevant to only a portion of the database, selecting
the relevant set of data not only makes mining more efficient, but also changes more
significant results than mining the whole database.

It can be specifying the set of relevant attributes (i.e., attributes for mining, as indicated in
DMQL with the in relevance to clause) may be difficult for the user. A user can choose only a
few attributes that it is important, while missing others that can also play a role in the
representation.

For example, suppose that the dimension birth place is defined by the attributes city,
province or state, and country. It can allow generalization on the birth place dimension, the
other attributes defining this dimension should also be included.
In other terms, having the system automatically involve province or state and country as
relevant attributes enables city to be generalized to these larger conceptual levels during
the induction phase.

At the other extreme, suppose that the user may have introduced too many attributes by
specifying all of the possible attributes with the clause “in relevance to *”. In this case, all of
the attributes in the relation specified by the from clause would be included in the analysis.

Some attributes are unlikely to contribute to an interesting representation. A correlation-


based or entropy-based analysis method can be used to perform attribute relevance
analysis and filter out statistically irrelevant or weakly relevant attributes from the
descriptive mining process.

7.Define rule based classification. Give Example

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from –

IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes

THEN buy_computer = yes

Points to remember –

The IF part of the rule is called rule antecedent or precondition.

The THEN part of the rule is called rule consequent.

The antecedent part the condition consist of one or more attribute tests and these tests are
logically ANDed.

The consequent part consists of class prediction.

Note – We can also write rule R1 as follows –

R1: (age = youth) ^ (student = yes))(buys computer = yes)

If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction

Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a
decision tree.
Points to remember –

To extract a rule from a decision tree –

One rule is created for each path from the root to the leaf node.

To form a rule antecedent, each splitting criterion is logically ANDed.

The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data.
We do not require to generate a decision tree first. In this algorithm, each rule for a given
class covers many of the tuples of that class.

Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered
by the rule is removed and the process continues for the rest of the tuples. This is because
the path to each leaf in a decision tree corresponds to a rule.

You might also like